Package Contents



Denoising Unet. This network receives a diffused image x_t and

class mmagic.models.editors.ddpm.DenoisingUnet(image_size, in_channels=3, out_channels=None, base_channels=128, resblocks_per_downsample=3, num_timesteps=1000, use_rescale_timesteps=False, dropout=0, embedding_channels=- 1, num_classes=0, use_fp16=False, channels_cfg=None, output_cfg=dict(mean='eps', var='learned_range'), norm_cfg=dict(type='GN', num_groups=32), act_cfg=dict(type='SiLU', inplace=False), shortcut_kernel_size=1, use_scale_shift_norm=False, resblock_updown=False, num_heads=4, time_embedding_mode='sin', time_embedding_cfg=None, resblock_cfg=dict(type='DenoisingResBlock'), attention_cfg=dict(type='MultiHeadAttention'), encoder_channels=None, downsample_conv=True, upsample_conv=True, downsample_cfg=dict(type='DenoisingDownsample'), upsample_cfg=dict(type='DenoisingUpsample'), attention_res=[16, 8], pretrained=None, unet_type='', down_block_types: Tuple[str] = (), up_block_types: Tuple[str] = (), cross_attention_dim=768, layers_per_block: int = 2)[source]

Bases: mmengine.model.BaseModule

Denoising Unet. This network receives a diffused image x_t and current timestep t, and returns a output_dict corresponding to the passed output_cfg.

output_cfg defines the number of channels and the meaning of the output. output_cfg mainly contains keys of mean and var, denoting how the network outputs mean and variance required for the denoising process. For mean: 1. dict(mean='EPS'): Model will predict noise added in the

diffusion process, and the output_dict will contain a key named eps_t_pred.

  1. dict(mean='START_X'): Model will direct predict the mean of the

    original image x_0, and the output_dict will contain a key named x_0_pred.

  2. dict(mean='X_TM1_PRED'): Model will predict the mean of diffused

    image at t-1 timestep, and the output_dict will contain a key named x_tm1_pred.

For var: 1. dict(var='FIXED_SMALL') or dict(var='FIXED_LARGE'): Variance in

the denoising process is regarded as a fixed value. Therefore only ‘mean’ will be predicted, and the output channels will equal to the input image (e.g., three channels for RGB image.)

  1. dict(var='LEARNED'): Model will predict log_variance in the

    denoising process, and the output_dict will contain a key named log_var.

  2. dict(var='LEARNED_RANGE'): Model will predict an interpolation

    factor and the log_variance will be calculated as factor * upper_bound + (1-factor) * lower_bound. The output_dict will contain a key named factor.

If var is not FIXED_SMALL or FIXED_LARGE, the number of output channels will be the double of input channels, where the first half part contains predicted mean values and the other part is the predicted variance values. Otherwise, the number of output channels equals to the input channels, only containing the predicted mean values.

  • image_size (int | list[int]) – The size of image to denoise.

  • in_channels (int, optional) – The input channels of the input image. Defaults as 3.

  • out_channels (int, optional) – The output channels of the output prediction. Defaults as None for automaticaaly assigned by var_mode.

  • base_channels (int, optional) – The basic channel number of the generator. The other layers contain channels based on this number. Defaults to 128.

  • resblocks_per_downsample (int, optional) – Number of ResBlock used between two downsample operations. The number of ResBlock between upsample operations will be the same value to keep symmetry. Defaults to 3.

  • num_timesteps (int, optional) – The total timestep of the denoising process and the diffusion process. Defaults to 1000.

  • use_rescale_timesteps (bool, optional) – Whether rescale the input timesteps in range of [0, 1000]. Defaults to True.

  • dropout (float, optional) – The probability of dropout operation of each ResBlock. Pass 0 to do not use dropout. Defaults as 0.

  • embedding_channels (int, optional) – The output channels of time embedding layer and label embedding layer. If not passed (or passed -1), output channels of the embedding layers will set as four times of base_channels. Defaults to -1.

  • num_classes (int, optional) – The number of conditional classes. If set to 0, this model will be degraded to an unconditional model. Defaults to 0.

  • channels_cfg (list | dict[list], optional) – Config for input channels of the intermediate blocks. If list is passed, each element of the list indicates the scale factor for the input channels of the current block with regard to the base_channels. For block i, the input and output channels should be channels_cfg[i] * base_channels and channels_cfg[i+1] * base_channels If dict is provided, the key of the dict should be the output scale and corresponding value should be a list to define channels. Default: Please refer to _default_channels_cfg.

  • output_cfg (dict, optional) – Config for output variables. Defaults to dict(mean='eps', var='learned_range').

  • norm_cfg (dict, optional) – The config for normalization layers. Defaults to dict(type='GN', num_groups=32).

  • act_cfg (dict, optional) – The config for activation layers. Defaults to dict(type='SiLU', inplace=False).

  • shortcut_kernel_size (int, optional) – The kernel size for shortcut conv in ResBlocks. The value of this argument will overwrite the default value of resblock_cfg. Defaults to 3.

  • use_scale_shift_norm (bool, optional) – Whether perform scale and shift after normalization operation. Defaults to True.

  • num_heads (int, optional) – The number of attention heads. Defaults to 4.

  • time_embedding_mode (str, optional) – Embedding method of time_embedding. Defaults to ‘sin’.

  • time_embedding_cfg (dict, optional) – Config for time_embedding. Defaults to None.

  • resblock_cfg (dict, optional) – Config for ResBlock. Defaults to dict(type='DenoisingResBlock').

  • attention_cfg (dict, optional) – Config for attention operation. Defaults to dict(type='MultiHeadAttention').

  • upsample_conv (bool, optional) – Whether use conv in upsample block. Defaults to True.

  • downsample_conv (bool, optional) – Whether use conv operation in downsample block. Defaults to True.

  • upsample_cfg (dict, optional) – Config for upsample blocks. Defaults to dict(type='DenoisingDownsample').

  • downsample_cfg (dict, optional) – Config for downsample blocks. Defaults to dict(type='DenoisingUpsample').

  • attention_res (int | list[int], optional) – Resolution of feature maps to apply attention operation. Defaults to [16, 8].

  • pretrained (str | dict, optional) – Path for the pretrained model or dict containing information for pretrained models whose necessary key is ‘ckpt_path’. Besides, you can also provide ‘prefix’ to load the generator part from the whole state dict. Defaults to None.

forward(x_t, t, encoder_hidden_states=None, label=None, return_noise=False)[source]

Forward function. :param x_t: Diffused image at timestep t to denoise. :type x_t: torch.Tensor :param t: Current timestep. :type t: torch.Tensor :param label: You can directly give a

batch of label through a torch.Tensor or offer a callable function to sample a batch of label data. Otherwise, the None indicates to use the default label sampler.


return_noise (bool, optional) – If True, inputted x_t and t will be returned in a dict with output desired by output_cfg. Defaults to False.


If not return_noise

Return type

torch.Tensor | dict


Init weights for models.

We just use the initialization method proposed in the original paper.


pretrained (str, optional) – Path for pretrained weights. If given None, pretrained weights will not be loaded. Defaults to None.


Convert the precision of the model to float16.


Convert the precision of the model to float32.

Read the Docs v: latest
On Read the Docs
Project Home

Free document hosting provided by Read the Docs.