Package Contents



Implementation of `AnimateDiff.



save_videos_grid(videos, path[, rescale, n_rows, fps])

class mmagic.models.editors.animatediff.AnimateDiff(vae: ModelType, text_encoder: ModelType, tokenizer: str, unet: ModelType, scheduler: ModelType, test_scheduler: Optional[ModelType] = None, dtype: str = 'fp32', enable_xformers: bool = True, noise_offset_weight: float = 0, tomesd_cfg: Optional[dict] = None, data_preprocessor=dict(type='DataPreprocessor'), motion_module_cfg: Optional[dict] = None, dream_booth_lora_cfg: Optional[dict] = None)[源代码]

Bases: mmengine.model.BaseModel

Implementation of `AnimateDiff.

<>`_ (AnimateDiff).

  • vae (Union[dict, nn.Module]) – The config or module for VAE model.

  • text_encoder (Union[dict, nn.Module]) – The config or module for text encoder.

  • tokenizer (str) – The name for CLIP tokenizer.

  • unet (Union[dict, nn.Module]) – The config or module for Unet model.

  • schedule (Union[dict, nn.Module]) – The config or module for diffusion scheduler.

  • test_scheduler (Union[dict, nn.Module], optional) – The config or module for diffusion scheduler in test stage (self.infer). If not passed, will use the same scheduler as schedule. Defaults to None.

  • lora_config (dict, optional) – The config for LoRA finetuning. Defaults to None.

  • val_prompts (Union[str, List[str]], optional) – The prompts for validation. Defaults to None.

  • class_prior_prompt (str, optional) – The prompt for class prior loss.

  • num_class_images (int, optional) – The number of images for class prior. Defaults to 3.

  • prior_loss_weight (float, optional) – The weight for class prior loss. Defaults to 0.

  • fine_tune_text_encoder (bool, optional) – Whether to fine-tune text encoder. Defaults to False.

  • dtype (str, optional) – The dtype for the model. Defaults to ‘fp16’.

  • enable_xformers (bool, optional) – Whether to use xformers. Defaults to True.

  • noise_offset_weight (bool, optional) – The weight of noise offset introduced in # noqa Defaults to 0.

  • tomesd_cfg (dict, optional) – The config for TOMESD. Please refers to and for detail. # noqa Defaults to None.

  • data_preprocessor (dict, optional) –

    The pre-process config of BaseDataPreprocessor. Defaults to


  • init_cfg (dict, optional) – The weight initialized config for BaseModule. Defaults to None/

property device

Set device for the model.

set_xformers(module: Optional[torch.nn.Module] = None) torch.nn.Module

Set xformers for the model.


The model with xformers.



set_tomesd() torch.nn.Module

Set ToMe for the stable diffusion model.


The model with ToMe.



_encode_prompt(prompt, device, num_videos_per_prompt, do_classifier_free_guidance, negative_prompt)

Encodes the prompt into text encoder hidden states.


latents decoder.

prepare_extra_step_kwargs(generator, eta)

Prepare extra kwargs for the scheduler step, since not all schedulers have the same signature eta (η) is only used with the DDIMScheduler, it will be ignored for other schedulers.

check_inputs(prompt, height, width)

Check inputs.

Raise error if not correct

convert_lora(state_dict, LORA_PREFIX_UNET='lora_unet', LORA_PREFIX_TEXT_ENCODER='lora_te', alpha=0.6)
Convert lora for unet and text_encoder

TODO: use this function to convert lora

  • state_dict (_type_) – _description_

  • LORA_PREFIX_UNET (str, optional) –

  • 'lora_unet'. (_description_. Defaults to) –

  • LORA_PREFIX_TEXT_ENCODER (str, optional) –

  • 'lora_te'. (_description_. Defaults to) –

  • alpha (float, optional) – _description_. Defaults to 0.6.


check each output type _type_: unet && text_encoder



prepare_latents(batch_size, num_channels_latents, video_length, height, width, dtype, device, generator, latents=None)

Prepare latent variables.


Prepare model for training.

Move model to target dtype and disable gradient for some models.


Set LORA for model.

val_step(data: dict) mmagic.utils.typing.SampleList

Gets the generated image of given data. Calls self.data_preprocessor and self.infer in order. Return the generated results which will be passed to evaluator or visualizer.


data (dict or tuple or list) – Data sampled from dataset.


Generated image or image dict.



test_step(data: dict) mmagic.utils.typing.SampleList

Gets the generated image of given data. Calls self.data_preprocessor and self.infer in order. Return the generated results which will be passed to evaluator or visualizer.


data (dict or tuple or list) – Data sampled from dataset.


Generated image or image dict.



infer(prompt: Union[str, List[str]], video_length: Optional[int] = 16, height: Optional[int] = None, width: Optional[int] = None, num_inference_steps: int = 50, guidance_scale: float = 7.5, negative_prompt: Optional[Union[str, List[str]]] = None, num_videos_per_prompt: Optional[int] = 1, eta: float = 0.0, generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None, latents: Optional[torch.FloatTensor] = None, return_type: Optional[str] = 'tensor', show_progress: bool = True, seed: Optional[int] = 1007)

Function invoked when calling the pipeline for generation.

  • prompt (str or List[str]) – The prompt or prompts to guide the video generation.

  • video_length (int, Option) – The number of frames of the generated video. Defaults to 16.

  • height (int, Optional) – The height in pixels of the generated image. If not passed, the height will be self.unet_sample_size * self.vae_scale_factor Defaults to None.

  • width (int, Optional) – The width in pixels of the generated image. If not passed, the width will be self.unet_sample_size * self.vae_scale_factor Defaults to None.

  • num_inference_steps (int) – The number of denoising steps. More denoising steps usually lead to a higher quality video at the expense of slower inference. Defaults to 50.

  • guidance_scale (float) – Guidance scale as defined in Classifier- Free Diffusion Guidance ( Defaults to 7.5

  • negative_prompt (str or List[str], optional) – The prompt or prompts not to guide the video generation. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1). Defaults to None.

  • num_videos_per_prompt (int) – The number of videos to generate per prompt. Defaults to 1.

  • eta (float) – Corresponds to parameter eta (η) in the DDIM paper: Only applies to DDIMScheduler, will be ignored for others. Defaults to 0.0.

  • generator (torch.Generator, optional) – A torch generator to make generation deterministic. Defaults to None.

  • latents (torch.FloatTensor, optional) – Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for video generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will be generated by sampling using the supplied random generator. Defaults to None.

  • return_type (str) – The return type of the inference results. Supported types are ‘video’, ‘numpy’, ‘tensor’. If ‘video’ is passed, a list of PIL images will be returned. If ‘numpy’ is passed, a numpy array with shape [N, C, H, W] will be returned, and the value range will be same as decoder’s output range. If ‘tensor’ is passed, the decoder’s output will be returned. Defaults to ‘image’.

#TODO :returns: A dict containing the generated video :rtype: dict

abstract forward(inputs: torch.Tensor, data_samples: Optional[list] = None, mode: str = 'tensor') Union[Dict[str, torch.Tensor], list]

forward is not implemented now.

mmagic.models.editors.animatediff.save_videos_grid(videos: torch.Tensor, path: str, rescale=False, n_rows=6, fps=8)[源代码]
class mmagic.models.editors.animatediff.UNet3DConditionMotionModel(sample_size: Optional[int] = None, in_channels: int = 4, out_channels: int = 4, center_input_sample: bool = False, flip_sin_to_cos: bool = True, freq_shift: int = 0, down_block_types: Tuple[str] = ('CrossAttnDownBlock3D', 'CrossAttnDownBlock3D', 'CrossAttnDownBlock3D', 'DownBlock3D'), mid_block_type: str = 'UNetMidBlock3DCrossAttn', up_block_types: Tuple[str] = ('UpBlock3D', 'CrossAttnUpBlock3D', 'CrossAttnUpBlock3D', 'CrossAttnUpBlock3D'), only_cross_attention: Union[bool, Tuple[bool]] = False, block_out_channels: Tuple[int] = (320, 640, 1280, 1280), layers_per_block: int = 2, downsample_padding: int = 1, mid_block_scale_factor: float = 1, act_fn: str = 'silu', norm_num_groups: int = 32, norm_eps: float = 1e-05, cross_attention_dim: int = 768, attention_head_dim: Union[int, Tuple[int]] = 8, dual_cross_attention: bool = False, use_linear_projection: bool = False, class_embed_type: Optional[str] = None, num_class_embeds: Optional[int] = None, upcast_attention: bool = False, resnet_time_scale_shift: str = 'default', use_inflated_groupnorm=False, use_motion_module=False, motion_module_resolutions=(1, 2, 4, 8), motion_module_mid_block=False, motion_module_decoder_only=False, motion_module_type=None, motion_module_kwargs={}, unet_use_cross_frame_attention=None, unet_use_temporal_attention=None, subfolder=None, from_pretrained=None, unet_addtion_kwargs=None)[源代码]

Bases: diffusers.models.modeling_utils.ModelMixin, diffusers.configuration_utils.ConfigMixin

_supports_gradient_checkpointing = True

Implementation of UNet3DConditionMotionModel

init_weights(subfolder=None, from_pretrained=None)

Init weights for models.

We just use the initialization method proposed in the original paper.


pretrained (str, optional) – Path for pretrained weights. If given None, pretrained weights will not be loaded. Defaults to None.


Enable sliced attention computation.

When this option is enabled, the attention module will split the input tensor in slices, to compute attention in several steps. This is useful to save some memory in exchange for a small speed decrease.


slice_size (str or int or `list(int) – defaults to “auto”): When “auto”, halves the input to the attention heads, so attention will be computed in two steps. If “max”, maximum amount of memory will be saved by running only one slice at a time. If a number is provided, uses as many slices as attention_head_dim // slice_size. In this case, attention_head_dim’ must be a multiple of `slice_size.

_set_gradient_checkpointing(module, value=False)

set gradient checkpoint.

forward(sample: torch.FloatTensor, timestep: Union[torch.Tensor, float, int], encoder_hidden_states: torch.Tensor, class_labels: Optional[torch.Tensor] = None, attention_mask: Optional[torch.Tensor] = None, return_dict: bool = True) Union[UNet3DConditionOutput, Tuple]
  • sample (torch.FloatTensor) – (batch, channel, height, width)

  • tensor (noisy inputs) –

  • timestep (torch.FloatTensor or float or int) –

  • timesteps ((batch)) –

  • encoder_hidden_states (torch.FloatTensor) –

  • (batch

  • sequence_length

  • states (feature_dim) encoder hidden) –

  • return_dict (bool, optional, defaults to True) – Whether or not to return a [UNet3DConditionOutput] instead of a plain tuple.


[UNet3DConditionOutput] if return_dict is True, otherwise a tuple. When returning a tuple, the first element is the sample tensor.


[UNet3DConditionOutput] or tuple

classmethod from_pretrained_2d(pretrained_model_path, subfolder=None, unet_additional_kwargs=None)

a class method for initialization.

Read the Docs v: latest
On Read the Docs
Project Home

Free document hosting provided by Read the Docs.