mmagic.models.editors.animatediff
¶
Package Contents¶
Functions¶
|
- class mmagic.models.editors.animatediff.AnimateDiff(vae: ModelType, text_encoder: ModelType, tokenizer: str, unet: ModelType, scheduler: ModelType, test_scheduler: Optional[ModelType] = None, dtype: str = 'fp32', enable_xformers: bool = True, noise_offset_weight: float = 0, tomesd_cfg: Optional[dict] = None, data_preprocessor=dict(type='DataPreprocessor'), motion_module_cfg: Optional[dict] = None, dream_booth_lora_cfg: Optional[dict] = None)[source]¶
Bases:
mmengine.model.BaseModel
Implementation of `AnimateDiff.
<https://arxiv.org/abs/2307.04725>`_ (AnimateDiff).
- Parameters
vae (Union[dict, nn.Module]) – The config or module for VAE model.
text_encoder (Union[dict, nn.Module]) – The config or module for text encoder.
tokenizer (str) – The name for CLIP tokenizer.
unet (Union[dict, nn.Module]) – The config or module for Unet model.
schedule (Union[dict, nn.Module]) – The config or module for diffusion scheduler.
test_scheduler (Union[dict, nn.Module], optional) – The config or module for diffusion scheduler in test stage (self.infer). If not passed, will use the same scheduler as schedule. Defaults to None.
lora_config (dict, optional) – The config for LoRA finetuning. Defaults to None.
val_prompts (Union[str, List[str]], optional) – The prompts for validation. Defaults to None.
class_prior_prompt (str, optional) – The prompt for class prior loss.
num_class_images (int, optional) – The number of images for class prior. Defaults to 3.
prior_loss_weight (float, optional) – The weight for class prior loss. Defaults to 0.
fine_tune_text_encoder (bool, optional) – Whether to fine-tune text encoder. Defaults to False.
dtype (str, optional) – The dtype for the model. Defaults to ‘fp16’.
enable_xformers (bool, optional) – Whether to use xformers. Defaults to True.
noise_offset_weight (bool, optional) – The weight of noise offset introduced in https://www.crosslabs.org/blog/diffusion-with-offset-noise # noqa Defaults to 0.
tomesd_cfg (dict, optional) – The config for TOMESD. Please refers to https://github.com/dbolya/tomesd and https://github.com/open-mmlab/mmagic/blob/main/mmagic/models/utils/tome_utils.py for detail. # noqa Defaults to None.
data_preprocessor (dict, optional) –
The pre-process config of
BaseDataPreprocessor
. Defaults todict(type=’DataPreprocessor’).
init_cfg (dict, optional) – The weight initialized config for
BaseModule
. Defaults to None/
- property device¶
Set device for the model.
- set_xformers(module: Optional[torch.nn.Module] = None) torch.nn.Module [source]¶
Set xformers for the model.
- Returns
The model with xformers.
- Return type
nn.Module
- set_tomesd() torch.nn.Module [source]¶
Set ToMe for the stable diffusion model.
- Returns
The model with ToMe.
- Return type
nn.Module
- _encode_prompt(prompt, device, num_videos_per_prompt, do_classifier_free_guidance, negative_prompt)[source]¶
Encodes the prompt into text encoder hidden states.
- prepare_extra_step_kwargs(generator, eta)[source]¶
Prepare extra kwargs for the scheduler step, since not all schedulers have the same signature eta (η) is only used with the DDIMScheduler, it will be ignored for other schedulers.
- convert_lora(state_dict, LORA_PREFIX_UNET='lora_unet', LORA_PREFIX_TEXT_ENCODER='lora_te', alpha=0.6)[source]¶
- Convert lora for unet and text_encoder
TODO: use this function to convert lora
- Parameters
state_dict (_type_) – _description_
LORA_PREFIX_UNET (str, optional) –
'lora_unet'. (_description_. Defaults to) –
LORA_PREFIX_TEXT_ENCODER (str, optional) –
'lora_te'. (_description_. Defaults to) –
alpha (float, optional) – _description_. Defaults to 0.6.
- Returns
check each output type _type_: unet && text_encoder
- Return type
TODO
- prepare_latents(batch_size, num_channels_latents, video_length, height, width, dtype, device, generator, latents=None)[source]¶
Prepare latent variables.
- prepare_model()[source]¶
Prepare model for training.
Move model to target dtype and disable gradient for some models.
- val_step(data: dict) mmagic.utils.typing.SampleList [source]¶
Gets the generated image of given data. Calls
self.data_preprocessor
andself.infer
in order. Return the generated results which will be passed to evaluator or visualizer.- Parameters
data (dict or tuple or list) – Data sampled from dataset.
- Returns
Generated image or image dict.
- Return type
SampleList
- test_step(data: dict) mmagic.utils.typing.SampleList [source]¶
Gets the generated image of given data. Calls
self.data_preprocessor
andself.infer
in order. Return the generated results which will be passed to evaluator or visualizer.- Parameters
data (dict or tuple or list) – Data sampled from dataset.
- Returns
Generated image or image dict.
- Return type
SampleList
- infer(prompt: Union[str, List[str]], video_length: Optional[int] = 16, height: Optional[int] = None, width: Optional[int] = None, num_inference_steps: int = 50, guidance_scale: float = 7.5, negative_prompt: Optional[Union[str, List[str]]] = None, num_videos_per_prompt: Optional[int] = 1, eta: float = 0.0, generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None, latents: Optional[torch.FloatTensor] = None, return_type: Optional[str] = 'tensor', show_progress: bool = True, seed: Optional[int] = 1007)[source]¶
Function invoked when calling the pipeline for generation.
- Parameters
prompt (str or List[str]) – The prompt or prompts to guide the video generation.
video_length (int, Option) – The number of frames of the generated video. Defaults to 16.
height (int, Optional) – The height in pixels of the generated image. If not passed, the height will be self.unet_sample_size * self.vae_scale_factor Defaults to None.
width (int, Optional) – The width in pixels of the generated image. If not passed, the width will be self.unet_sample_size * self.vae_scale_factor Defaults to None.
num_inference_steps (int) – The number of denoising steps. More denoising steps usually lead to a higher quality video at the expense of slower inference. Defaults to 50.
guidance_scale (float) – Guidance scale as defined in Classifier- Free Diffusion Guidance (https://arxiv.org/abs/2207.12598). Defaults to 7.5
negative_prompt (str or List[str], optional) – The prompt or prompts not to guide the video generation. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1). Defaults to None.
num_videos_per_prompt (int) – The number of videos to generate per prompt. Defaults to 1.
eta (float) – Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to DDIMScheduler, will be ignored for others. Defaults to 0.0.
generator (torch.Generator, optional) – A torch generator to make generation deterministic. Defaults to None.
latents (torch.FloatTensor, optional) – Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for video generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will be generated by sampling using the supplied random generator. Defaults to None.
return_type (str) – The return type of the inference results. Supported types are ‘video’, ‘numpy’, ‘tensor’. If ‘video’ is passed, a list of PIL images will be returned. If ‘numpy’ is passed, a numpy array with shape [N, C, H, W] will be returned, and the value range will be same as decoder’s output range. If ‘tensor’ is passed, the decoder’s output will be returned. Defaults to ‘image’.
#TODO :returns: A dict containing the generated video :rtype: dict
- mmagic.models.editors.animatediff.save_videos_grid(videos: torch.Tensor, path: str, rescale=False, n_rows=6, fps=8)[source]¶
- class mmagic.models.editors.animatediff.UNet3DConditionMotionModel(sample_size: Optional[int] = None, in_channels: int = 4, out_channels: int = 4, center_input_sample: bool = False, flip_sin_to_cos: bool = True, freq_shift: int = 0, down_block_types: Tuple[str] = ('CrossAttnDownBlock3D', 'CrossAttnDownBlock3D', 'CrossAttnDownBlock3D', 'DownBlock3D'), mid_block_type: str = 'UNetMidBlock3DCrossAttn', up_block_types: Tuple[str] = ('UpBlock3D', 'CrossAttnUpBlock3D', 'CrossAttnUpBlock3D', 'CrossAttnUpBlock3D'), only_cross_attention: Union[bool, Tuple[bool]] = False, block_out_channels: Tuple[int] = (320, 640, 1280, 1280), layers_per_block: int = 2, downsample_padding: int = 1, mid_block_scale_factor: float = 1, act_fn: str = 'silu', norm_num_groups: int = 32, norm_eps: float = 1e-05, cross_attention_dim: int = 768, attention_head_dim: Union[int, Tuple[int]] = 8, dual_cross_attention: bool = False, use_linear_projection: bool = False, class_embed_type: Optional[str] = None, num_class_embeds: Optional[int] = None, upcast_attention: bool = False, resnet_time_scale_shift: str = 'default', use_inflated_groupnorm=False, use_motion_module=False, motion_module_resolutions=(1, 2, 4, 8), motion_module_mid_block=False, motion_module_decoder_only=False, motion_module_type=None, motion_module_kwargs={}, unet_use_cross_frame_attention=None, unet_use_temporal_attention=None, subfolder=None, from_pretrained=None, unet_addtion_kwargs=None)[source]¶
Bases:
diffusers.models.modeling_utils.ModelMixin
,diffusers.configuration_utils.ConfigMixin
- _supports_gradient_checkpointing = True¶
Implementation of UNet3DConditionMotionModel
- init_weights(subfolder=None, from_pretrained=None)[source]¶
Init weights for models.
We just use the initialization method proposed in the original paper.
- Parameters
pretrained (str, optional) – Path for pretrained weights. If given None, pretrained weights will not be loaded. Defaults to None.
- set_attention_slice(slice_size)[source]¶
Enable sliced attention computation.
When this option is enabled, the attention module will split the input tensor in slices, to compute attention in several steps. This is useful to save some memory in exchange for a small speed decrease.
- Parameters
slice_size (str or int or `list(int) – defaults to “auto”): When “auto”, halves the input to the attention heads, so attention will be computed in two steps. If “max”, maximum amount of memory will be saved by running only one slice at a time. If a number is provided, uses as many slices as attention_head_dim // slice_size. In this case, attention_head_dim’ must be a multiple of `slice_size.
- forward(sample: torch.FloatTensor, timestep: Union[torch.Tensor, float, int], encoder_hidden_states: torch.Tensor, class_labels: Optional[torch.Tensor] = None, attention_mask: Optional[torch.Tensor] = None, return_dict: bool = True) Union[UNet3DConditionOutput, Tuple] [source]¶
- Parameters
sample (torch.FloatTensor) – (batch, channel, height, width)
tensor (noisy inputs) –
timestep (torch.FloatTensor or float or int) –
timesteps ((batch)) –
encoder_hidden_states (torch.FloatTensor) –
(batch –
sequence_length –
states (feature_dim) encoder hidden) –
return_dict (bool, optional, defaults to True) – Whether or not to return a [UNet3DConditionOutput] instead of a plain tuple.
- Returns
[UNet3DConditionOutput] if return_dict is True, otherwise a tuple. When returning a tuple, the first element is the sample tensor.
- Return type
[UNet3DConditionOutput] or tuple