`mmagic.models.editors.animatediff`¶

Package Contents¶

Classes¶

`AnimateDiff`	Implementation of `AnimateDiff.
`UNet3DConditionMotionModel`

Functions¶

save_videos_grid(videos, path[, rescale, n_rows, fps])

class mmagic.models.editors.animatediff.AnimateDiff(vae: ModelType, text_encoder: ModelType, tokenizer: str, unet: ModelType, scheduler: ModelType, test_scheduler: Optional[ModelType] = None, dtype: str = 'fp32', enable_xformers: bool = True, noise_offset_weight: float = 0, tomesd_cfg: Optional[dict] = None, data_preprocessor=dict(type='DataPreprocessor'), motion_module_cfg: Optional[dict] = None, dream_booth_lora_cfg: Optional[dict] = None)[source]¶

Bases: mmengine.model.BaseModel

Implementation of `AnimateDiff.

<https://arxiv.org/abs/2307.04725>`_ (AnimateDiff).

Parameters

vae (Union[dict, nn.Module]) – The config or module for VAE model.
text_encoder (Union[dict, nn.Module]) – The config or module for text encoder.
tokenizer (str) – The name for CLIP tokenizer.
unet (Union[dict, nn.Module]) – The config or module for Unet model.
schedule (Union[dict, nn.Module]) – The config or module for diffusion scheduler.
test_scheduler (Union[dict, nn.Module], optional) – The config or module for diffusion scheduler in test stage (self.infer). If not passed, will use the same scheduler as schedule. Defaults to None.
lora_config (dict, optional) – The config for LoRA finetuning. Defaults to None.
val_prompts (Union[str, List[str]], optional) – The prompts for validation. Defaults to None.
class_prior_prompt (str, optional) – The prompt for class prior loss.
num_class_images (int, optional) – The number of images for class prior. Defaults to 3.
prior_loss_weight (float, optional) – The weight for class prior loss. Defaults to 0.
fine_tune_text_encoder (bool, optional) – Whether to fine-tune text encoder. Defaults to False.
dtype (str, optional) – The dtype for the model. Defaults to ‘fp16’.
enable_xformers (bool, optional) – Whether to use xformers. Defaults to True.
noise_offset_weight (bool, optional) – The weight of noise offset introduced in https://www.crosslabs.org/blog/diffusion-with-offset-noise # noqa Defaults to 0.
tomesd_cfg (dict, optional) – The config for TOMESD. Please refers to https://github.com/dbolya/tomesd and https://github.com/open-mmlab/mmagic/blob/main/mmagic/models/utils/tome_utils.py for detail. # noqa Defaults to None.
data_preprocessor (dict, optional) –
The pre-process config of BaseDataPreprocessor. Defaults to

dict(type=’DataPreprocessor’).
init_cfg (dict, optional) – The weight initialized config for BaseModule. Defaults to None/

property device¶: Set device for the model.

set_xformers(module: Optional[torch.nn.Module] = None) → torch.nn.Module[source]¶

Set xformers for the model.

Returns: The model with xformers.
Return type: nn.Module

set_tomesd() → torch.nn.Module[source]¶

Set ToMe for the stable diffusion model.

Returns: The model with ToMe.
Return type: nn.Module

init_motion_module(motion_module_cfg)[source]¶

init_dreambooth_lora(dream_booth_lora_cfg)[source]¶

_encode_prompt(prompt, device, num_videos_per_prompt, do_classifier_free_guidance, negative_prompt)[source]¶: Encodes the prompt into text encoder hidden states.

decode_latents(latents)[source]¶: latents decoder.

prepare_extra_step_kwargs(generator, eta)[source]¶: Prepare extra kwargs for the scheduler step, since not all schedulers have the same signature eta (η) is only used with the DDIMScheduler, it will be ignored for other schedulers.

check_inputs(prompt, height, width)[source]¶

Check inputs.

Raise error if not correct

convert_lora(state_dict, LORA_PREFIX_UNET='lora_unet', LORA_PREFIX_TEXT_ENCODER='lora_te', alpha=0.6)[source]¶

Convert lora for unet and text_encoder: TODO: use this function to convert lora

Parameters

state_dict (_type_) – _description_
LORA_PREFIX_UNET (str, optional) –
'lora_unet'. (_description_. Defaults to) –
LORA_PREFIX_TEXT_ENCODER (str, optional) –
'lora_te'. (_description_. Defaults to) –
alpha (float, optional) – _description_. Defaults to 0.6.

Returns

check each output type _type_: unet && text_encoder

Return type

TODO

prepare_latents(batch_size, num_channels_latents, video_length, height, width, dtype, device, generator, latents=None)[source]¶: Prepare latent variables.

prepare_model()[source]¶

Prepare model for training.

Move model to target dtype and disable gradient for some models.

set_lora()[source]¶: Set LORA for model.

val_step(data: dict) → mmagic.utils.typing.SampleList[source]¶

Gets the generated image of given data. Calls self.data_preprocessor and self.infer in order. Return the generated results which will be passed to evaluator or visualizer.

Parameters: data (dict or tuple or list) – Data sampled from dataset.
Returns: Generated image or image dict.
Return type: SampleList

test_step(data: dict) → mmagic.utils.typing.SampleList[source]¶

Gets the generated image of given data. Calls self.data_preprocessor and self.infer in order. Return the generated results which will be passed to evaluator or visualizer.

Parameters: data (dict or tuple or list) – Data sampled from dataset.
Returns: Generated image or image dict.
Return type: SampleList

infer(prompt: Union[str, List[str]], video_length: Optional[int] = 16, height: Optional[int] = None, width: Optional[int] = None, num_inference_steps: int = 50, guidance_scale: float = 7.5, negative_prompt: Optional[Union[str, List[str]]] = None, num_videos_per_prompt: Optional[int] = 1, eta: float = 0.0, generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None, latents: Optional[torch.FloatTensor] = None, return_type: Optional[str] = 'tensor', show_progress: bool = True, seed: Optional[int] = 1007)[source]¶

Function invoked when calling the pipeline for generation.

Parameters

prompt (str or List[str]) – The prompt or prompts to guide the video generation.
video_length (int, Option) – The number of frames of the generated video. Defaults to 16.
height (int, Optional) – The height in pixels of the generated image. If not passed, the height will be self.unet_sample_size * self.vae_scale_factor Defaults to None.
width (int, Optional) – The width in pixels of the generated image. If not passed, the width will be self.unet_sample_size * self.vae_scale_factor Defaults to None.
num_inference_steps (int) – The number of denoising steps. More denoising steps usually lead to a higher quality video at the expense of slower inference. Defaults to 50.
guidance_scale (float) – Guidance scale as defined in Classifier- Free Diffusion Guidance (https://arxiv.org/abs/2207.12598). Defaults to 7.5
negative_prompt (str or List[str], optional) – The prompt or prompts not to guide the video generation. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1). Defaults to None.
num_videos_per_prompt (int) – The number of videos to generate per prompt. Defaults to 1.
eta (float) – Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to DDIMScheduler, will be ignored for others. Defaults to 0.0.
generator (torch.Generator, optional) – A torch generator to make generation deterministic. Defaults to None.
latents (torch.FloatTensor, optional) – Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for video generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will be generated by sampling using the supplied random generator. Defaults to None.
return_type (str) – The return type of the inference results. Supported types are ‘video’, ‘numpy’, ‘tensor’. If ‘video’ is passed, a list of PIL images will be returned. If ‘numpy’ is passed, a numpy array with shape [N, C, H, W] will be returned, and the value range will be same as decoder’s output range. If ‘tensor’ is passed, the decoder’s output will be returned. Defaults to ‘image’.

#TODO :returns: A dict containing the generated video :rtype: dict

abstract forward(inputs: torch.Tensor, data_samples: Optional[list] = None, mode: str = 'tensor') → Union[Dict[str, torch.Tensor], list][source]¶: forward is not implemented now.

mmagic.models.editors.animatediff.save_videos_grid(videos: torch.Tensor, path: str, rescale=False, n_rows=6, fps=8)[source]¶

class mmagic.models.editors.animatediff.UNet3DConditionMotionModel(sample_size: Optional[int] = None, in_channels: int = 4, out_channels: int = 4, center_input_sample: bool = False, flip_sin_to_cos: bool = True, freq_shift: int = 0, down_block_types: Tuple[str] = ('CrossAttnDownBlock3D', 'CrossAttnDownBlock3D', 'CrossAttnDownBlock3D', 'DownBlock3D'), mid_block_type: str = 'UNetMidBlock3DCrossAttn', up_block_types: Tuple[str] = ('UpBlock3D', 'CrossAttnUpBlock3D', 'CrossAttnUpBlock3D', 'CrossAttnUpBlock3D'), only_cross_attention: Union[bool, Tuple[bool]] = False, block_out_channels: Tuple[int] = (320, 640, 1280, 1280), layers_per_block: int = 2, downsample_padding: int = 1, mid_block_scale_factor: float = 1, act_fn: str = 'silu', norm_num_groups: int = 32, norm_eps: float = 1e-05, cross_attention_dim: int = 768, attention_head_dim: Union[int, Tuple[int]] = 8, dual_cross_attention: bool = False, use_linear_projection: bool = False, class_embed_type: Optional[str] = None, num_class_embeds: Optional[int] = None, upcast_attention: bool = False, resnet_time_scale_shift: str = 'default', use_inflated_groupnorm=False, use_motion_module=False, motion_module_resolutions=(1, 2, 4, 8), motion_module_mid_block=False, motion_module_decoder_only=False, motion_module_type=None, motion_module_kwargs={}, unet_use_cross_frame_attention=None, unet_use_temporal_attention=None, subfolder=None, from_pretrained=None, unet_addtion_kwargs=None)[source]¶

Bases: diffusers.models.modeling_utils.ModelMixin, diffusers.configuration_utils.ConfigMixin

_supports_gradient_checkpointing = True¶: Implementation of UNet3DConditionMotionModel

init_weights(subfolder=None, from_pretrained=None)[source]¶

Init weights for models.

We just use the initialization method proposed in the original paper.

Parameters: pretrained (str, optional) – Path for pretrained weights. If given None, pretrained weights will not be loaded. Defaults to None.

set_attention_slice(slice_size)[source]¶

Enable sliced attention computation.

When this option is enabled, the attention module will split the input tensor in slices, to compute attention in several steps. This is useful to save some memory in exchange for a small speed decrease.

Parameters: slice_size (str or int or `list(int) – defaults to “auto”): When “auto”, halves the input to the attention heads, so attention will be computed in two steps. If “max”, maximum amount of memory will be saved by running only one slice at a time. If a number is provided, uses as many slices as attention_head_dim // slice_size. In this case, attention_head_dim’ must be a multiple of `slice_size.

_set_gradient_checkpointing(module, value=False)[source]¶: set gradient checkpoint.

forward(sample: torch.FloatTensor, timestep: Union[torch.Tensor, float, int], encoder_hidden_states: torch.Tensor, class_labels: Optional[torch.Tensor] = None, attention_mask: Optional[torch.Tensor] = None, return_dict: bool = True) → Union[UNet3DConditionOutput, Tuple][source]¶

Parameters

sample (torch.FloatTensor) – (batch, channel, height, width)
tensor (noisy inputs) –
timestep (torch.FloatTensor or float or int) –
timesteps ((batch)) –
encoder_hidden_states (torch.FloatTensor) –
(batch –
sequence_length –
states (feature_dim) encoder hidden) –
return_dict (bool, optional, defaults to True) – Whether or not to return a [UNet3DConditionOutput] instead of a plain tuple.

Returns

[UNet3DConditionOutput] if return_dict is True, otherwise a tuple. When returning a tuple, the first element is the sample tensor.

Return type

[UNet3DConditionOutput] or tuple

classmethod from_pretrained_2d(pretrained_model_path, subfolder=None, unet_additional_kwargs=None)[source]¶: a class method for initialization.

mmagic.models.editors.animatediff¶

Package Contents¶

Classes¶

Functions¶

`mmagic.models.editors.animatediff`¶