`mmagic.models.editors.ddpm.denoising_unet`¶

Module Contents¶

Classes¶

`EmbedSequential`	A sequential module that passes timestep embeddings to the children that
`GroupNorm32`	Applies Group Normalization over a mini-batch of inputs as described in
`SiLU`	Applies the Sigmoid Linear Unit (SiLU) function, element-wise.
`MultiHeadAttention`	An attention block allows spatial position to attend to each other.
`MultiHeadAttentionBlock`	An attention block that allows spatial positions to attend to each
`QKVAttentionLegacy`	A module which performs QKV attention.
`QKVAttention`	A module which performs QKV attention and splits in a different
`TimeEmbedding`	Time embedding layer, reference to Two level embedding. First embedding
`DenoisingResBlock`	Resblock for the denoising network. If in_channels not equals to
`NormWithEmbedding`	Nornalization with embedding layer. If use_scale_shift == True,
`DenoisingDownsample`	Downsampling operation used in the denoising network. Support average
`DenoisingUpsample`	Upsampling operation used in the denoising network. Allows users to
`DenoisingUnet`	Denoising Unet. This network receives a diffused image `x_t` and

Functions¶

`convert_module_to_f16`(layer)	Convert primitive modules to float16.
`convert_module_to_f32`(layer)	Convert primitive modules to float32, undoing
`build_down_block_resattn`(resblocks_per_downsample, ...)	build unet down path blocks with resnet and attention.
`build_mid_blocks_resattn`(resblock_cfg, attention_cfg, ...)	build unet mid blocks with resnet and attention.
`build_up_blocks_resattn`(resblocks_per_downsample, ...)	build up path blocks with resnet and attention.

Attributes¶

logger

mmagic.models.editors.ddpm.denoising_unet.logger[source]¶

class mmagic.models.editors.ddpm.denoising_unet.EmbedSequential(*args: torch.nn.modules.module.Module) EmbedSequential(arg: OrderedDict[str, Module])[source]¶

Bases: torch.nn.Sequential

A sequential module that passes timestep embeddings to the children that support it as an extra input.

Modified from https://github.com/openai/improved-diffusion/blob/main/improved_diffusion/unet.py#L35

forward(x, y, encoder_out=None)[source]¶

class mmagic.models.editors.ddpm.denoising_unet.GroupNorm32(num_channels, num_groups=32, **kwargs)[source]¶

Bases: torch.nn.GroupNorm

Applies Group Normalization over a mini-batch of inputs as described in the paper Group Normalization

\[y = \frac{x - \mathrm{E}[x]}{ \sqrt{\mathrm{Var}[x] + \epsilon}} * \gamma + \beta\]

The input channels are separated into num_groups groups, each containing num_channels / num_groups channels. num_channels must be divisible by num_groups. The mean and standard-deviation are calculated separately over the each group. \(\gamma\) and \(\beta\) are learnable per-channel affine transform parameter vectors of size num_channels if affine is True. The standard-deviation is calculated via the biased estimator, equivalent to torch.var(input, unbiased=False).

This layer uses statistics computed from input data in both training and evaluation modes.

Parameters

num_groups (int) – number of groups to separate the channels into
num_channels (int) – number of channels expected in input
eps – a value added to the denominator for numerical stability. Default: 1e-5
affine – a boolean value that when set to True, this module has learnable per-channel affine parameters initialized to ones (for weights) and zeros (for biases). Default: True.

Shape:

Input: \((N, C, *)\) where \(C=\text{num\_channels}\)
Output: \((N, C, *)\) (same shape as input)

Examples:

>>> input = torch.randn(20, 6, 10, 10)
>>> # Separate 6 channels into 3 groups
>>> m = nn.GroupNorm(3, 6)
>>> # Separate 6 channels into 6 groups (equivalent with InstanceNorm)
>>> m = nn.GroupNorm(6, 6)
>>> # Put all 6 channels into a single group (equivalent with LayerNorm)
>>> m = nn.GroupNorm(1, 6)
>>> # Activating the module
>>> output = m(input)

forward(x)[source]¶

mmagic.models.editors.ddpm.denoising_unet.convert_module_to_f16(layer)[source]¶: Convert primitive modules to float16.

mmagic.models.editors.ddpm.denoising_unet.convert_module_to_f32(layer)[source]¶: Convert primitive modules to float32, undoing convert_module_to_f16().

class mmagic.models.editors.ddpm.denoising_unet.SiLU(inplace=False)[source]¶

Bases: mmengine.model.BaseModule

Applies the Sigmoid Linear Unit (SiLU) function, element-wise. The SiLU function is also known as the swish function. :param input: Use inplace operation or not.

Defaults to False.

forward(x)[source]¶

Forward function for SiLU. :param x: Input tensor. :type x: torch.Tensor

Returns: Tensor after activation.
Return type: torch.Tensor

class mmagic.models.editors.ddpm.denoising_unet.MultiHeadAttention(in_channels, num_heads=1, norm_cfg=dict(type='GN', num_groups=32))[source]¶

Bases: mmengine.model.BaseModule

An attention block allows spatial position to attend to each other.

Originally ported from here, but adapted to the N-d case. https://github.com/hojonathanho/diffusion/blob/1e0dceb3b3495bbe19116a5e1b3596cd0706c543/diffusion_tf/models/unet.py#L66. # noqa

Parameters

in_channels (int) – Channels of the input feature map.
num_heads (int, optional) – Number of heads in the attention.
norm_cfg (dict, optional) – Config for normalization layer. Default to dict(type='GN', num_groups=32)

static QKVAttention(qkv)[source]¶

forward(x)[source]¶

Forward function for multi head attention. :param x: Input feature map. :type x: torch.Tensor

Returns: Feature map after attention.
Return type: torch.Tensor

init_weights()[source]¶: Initialize the weights.

class mmagic.models.editors.ddpm.denoising_unet.MultiHeadAttentionBlock(in_channels, num_heads=1, num_head_channels=- 1, use_new_attention_order=False, norm_cfg=dict(type='GN32', num_groups=32), encoder_channels=None)[source]¶

Bases: mmengine.model.BaseModule

An attention block that allows spatial positions to attend to each other.

Originally ported from here, but adapted to the N-d case. https://github.com/hojonathanho/diffusion/blob/1e0dceb3b3495bbe19116a5e1b3596cd0706c543/diffusion_tf/models/unet.py#L66.

forward(x, encoder_out=None)[source]¶

class mmagic.models.editors.ddpm.denoising_unet.QKVAttentionLegacy(n_heads)[source]¶

Bases: mmengine.model.BaseModule

A module which performs QKV attention.

Matches legacy QKVAttention + input/output heads shaping

forward(qkv, encoder_kv=None)[source]¶

Apply QKV attention.

Parameters: qkv – an [N x (H * 3 * C) x T] tensor of Qs, Ks, and Vs.
Returns: an [N x (H * C) x T] tensor after attention.

class mmagic.models.editors.ddpm.denoising_unet.QKVAttention(n_heads)[source]¶

Bases: mmengine.model.BaseModule

A module which performs QKV attention and splits in a different order.

forward(qkv)[source]¶

Apply QKV attention.

Parameters: qkv – an [N x (3 * H * C) x T] tensor of Qs, Ks, and Vs.
Returns: an [N x (H * C) x T] tensor after attention.

class mmagic.models.editors.ddpm.denoising_unet.TimeEmbedding(in_channels, embedding_channels, embedding_mode='sin', embedding_cfg=None, act_cfg=dict(type='SiLU', inplace=False))[source]¶

Bases: mmengine.model.BaseModule

Time embedding layer, reference to Two level embedding. First embedding time by an embedding function, then feed to neural networks.

Parameters

in_channels (int) – The channel number of the input feature map.
embedding_channels (int) – The channel number of the output embedding.
embedding_mode (str, optional) – Embedding mode for the time embedding. Defaults to ‘sin’.
embedding_cfg (dict, optional) – Config for time embedding. Defaults to None.
act_cfg (dict, optional) – Config for activation layer. Defaults to dict(type='SiLU', inplace=False).

static sinusodial_embedding(timesteps, dim, max_period=10000)[source]¶

Create sinusoidal timestep embeddings.

Parameters

timesteps (torch.Tensor) – Timestep to embedding. 1-D tensor shape as [bz, ], one per batch element.
dim (int) – The dimension of the embedding.
max_period (int, optional) – Controls the minimum frequency of the embeddings. Defaults to 10000.

Returns

Embedding results shape as [bz, dim].

Return type

torch.Tensor

forward(t)[source]¶

Forward function for time embedding layer. :param t: Input timesteps. :type t: torch.Tensor

Returns: Timesteps embedding.
Return type: torch.Tensor

class mmagic.models.editors.ddpm.denoising_unet.DenoisingResBlock(in_channels, embedding_channels, use_scale_shift_norm, dropout, out_channels=None, norm_cfg=dict(type='GN', num_groups=32), act_cfg=dict(type='SiLU', inplace=False), shortcut_kernel_size=1, up=False, down=False)[source]¶

Bases: mmengine.model.BaseModule

Resblock for the denoising network. If in_channels not equals to out_channels, a learnable shortcut with conv layers will be added.

Parameters

in_channels (int) – Number of channels of the input feature map.
embedding_channels (int) – Number of channels of the input embedding.
use_scale_shift_norm (bool) – Whether use scale-shift-norm in NormWithEmbedding layer.
dropout (float) – Probability of the dropout layers.
out_channels (int, optional) – Number of output channels of the ResBlock. If not defined, the output channels will equal to the in_channels. Defaults to None.
norm_cfg (dict, optional) – The config for the normalization layers. Defaults too dict(type='GN', num_groups=32).
act_cfg (dict, optional) – The config for the activation layers. Defaults to dict(type='SiLU', inplace=False).
shortcut_kernel_size (int, optional) – The kernel size for the shortcut conv. Defaults to 1.

forward_shortcut(x)[source]¶

forward(x, y)[source]¶

Forward function.

Parameters

x (torch.Tensor) – Input feature map tensor.
y (torch.Tensor) – Shared time embedding or shared label embedding.

Returns

Output feature map tensor.

Return type

torch.Tensor

init_weights()[source]¶: Initialize the weights.

class mmagic.models.editors.ddpm.denoising_unet.NormWithEmbedding(in_channels, embedding_channels, norm_cfg=dict(type='GN', num_groups=32), act_cfg=dict(type='SiLU', inplace=False), use_scale_shift=True)[source]¶

Bases: mmengine.model.BaseModule

Nornalization with embedding layer. If use_scale_shift == True, embedding results will be chunked and used to re-shift and re-scale normalization results. Otherwise, embedding results will directly add to input of normalization layer.

Parameters

in_channels (int) – Number of channels of the input feature map.
embedding_channels (int) –
norm_cfg (dict, optional) – Config for the normalization operation. Defaults to dict(type=’GN’, num_groups=32).
act_cfg (dict, optional) – Config for the activation layer. Defaults to dict(type=’SiLU’, inplace=False).
use_scale_shift (bool) – If True, the output of Embedding layer will be split to ‘scale’ and ‘shift’ and map the output of normalization layer to out * (1 + scale) + shift. Otherwise, the output of Embedding layer will be added with the input before normalization operation. Defaults to True.

forward(x, y)[source]¶

Forward function.

Parameters

x (torch.Tensor) – Input feature map tensor.
y (torch.Tensor) – Shared time embedding or shared label embedding.

Returns

Output feature map tensor.

Return type

torch.Tensor

class mmagic.models.editors.ddpm.denoising_unet.DenoisingDownsample(in_channels, with_conv=True)[source]¶

Bases: mmengine.model.BaseModule

Downsampling operation used in the denoising network. Support average pooling and convolution for downsample operation.

Parameters

in_channels (int) – Number of channels of the input feature map to be downsampled.
with_conv (bool, optional) – Whether use convolution operation for downsampling. Defaults to True.

forward(x)[source]¶

Forward function for downsampling operation. :param x: Feature map to downsample. :type x: torch.Tensor

Returns: Feature map after downsampling.
Return type: torch.Tensor

class mmagic.models.editors.ddpm.denoising_unet.DenoisingUpsample(in_channels, with_conv=True)[source]¶

Bases: mmengine.model.BaseModule

Upsampling operation used in the denoising network. Allows users to apply an additional convolution layer after the nearest interpolation operation.

Parameters

in_channels (int) – Number of channels of the input feature map to be downsampled.
with_conv (bool, optional) – Whether apply an additional convolution layer after upsampling. Defaults to True.

forward(x)[source]¶

Forward function for upsampling operation. :param x: Feature map to upsample. :type x: torch.Tensor

Returns: Feature map after upsampling.
Return type: torch.Tensor

mmagic.models.editors.ddpm.denoising_unet.build_down_block_resattn(resblocks_per_downsample, resblock_cfg, in_channels_, out_channels_, attention_scale, attention_cfg, in_channels_list, level, channel_factor_list, embedding_channels, use_scale_shift_norm, dropout, norm_cfg, resblock_updown, downsample_cfg, scale)[source]¶: build unet down path blocks with resnet and attention.

mmagic.models.editors.ddpm.denoising_unet.build_mid_blocks_resattn(resblock_cfg, attention_cfg, in_channels_)[source]¶: build unet mid blocks with resnet and attention.

mmagic.models.editors.ddpm.denoising_unet.build_up_blocks_resattn(resblocks_per_downsample, resblock_cfg, in_channels_, in_channels_list, base_channels, factor, scale, attention_scale, attention_cfg, channel_factor_list, level, embedding_channels, use_scale_shift_norm, dropout, norm_cfg, resblock_updown, upsample_cfg)[source]¶: build up path blocks with resnet and attention.

class mmagic.models.editors.ddpm.denoising_unet.DenoisingUnet(image_size, in_channels=3, out_channels=None, base_channels=128, resblocks_per_downsample=3, num_timesteps=1000, use_rescale_timesteps=False, dropout=0, embedding_channels=- 1, num_classes=0, use_fp16=False, channels_cfg=None, output_cfg=dict(mean='eps', var='learned_range'), norm_cfg=dict(type='GN', num_groups=32), act_cfg=dict(type='SiLU', inplace=False), shortcut_kernel_size=1, use_scale_shift_norm=False, resblock_updown=False, num_heads=4, time_embedding_mode='sin', time_embedding_cfg=None, resblock_cfg=dict(type='DenoisingResBlock'), attention_cfg=dict(type='MultiHeadAttention'), encoder_channels=None, downsample_conv=True, upsample_conv=True, downsample_cfg=dict(type='DenoisingDownsample'), upsample_cfg=dict(type='DenoisingUpsample'), attention_res=[16, 8], pretrained=None, unet_type='', down_block_types: Tuple[str] = (), up_block_types: Tuple[str] = (), cross_attention_dim=768, layers_per_block: int = 2)[source]¶

Bases: mmengine.model.BaseModule

Denoising Unet. This network receives a diffused image x_t and current timestep t, and returns a output_dict corresponding to the passed output_cfg.

output_cfg defines the number of channels and the meaning of the output. output_cfg mainly contains keys of mean and var, denoting how the network outputs mean and variance required for the denoising process. For mean: 1. dict(mean='EPS'): Model will predict noise added in the

diffusion process, and the output_dict will contain a key named eps_t_pred.

dict(mean='START_X'): Model will direct predict the mean of the
original image x_0, and the output_dict will contain a key named x_0_pred.
dict(mean='X_TM1_PRED'): Model will predict the mean of diffused
image at t-1 timestep, and the output_dict will contain a key named x_tm1_pred.

For var: 1. dict(var='FIXED_SMALL') or dict(var='FIXED_LARGE'): Variance in

the denoising process is regarded as a fixed value. Therefore only ‘mean’ will be predicted, and the output channels will equal to the input image (e.g., three channels for RGB image.)

dict(var='LEARNED'): Model will predict log_variance in the
denoising process, and the output_dict will contain a key named log_var.
dict(var='LEARNED_RANGE'): Model will predict an interpolation
factor and the log_variance will be calculated as factor * upper_bound + (1-factor) * lower_bound. The output_dict will contain a key named factor.

If var is not FIXED_SMALL or FIXED_LARGE, the number of output channels will be the double of input channels, where the first half part contains predicted mean values and the other part is the predicted variance values. Otherwise, the number of output channels equals to the input channels, only containing the predicted mean values.

Parameters

image_size (int | list[int]) – The size of image to denoise.
in_channels (int, optional) – The input channels of the input image. Defaults as 3.
out_channels (int, optional) – The output channels of the output prediction. Defaults as None for automaticaaly assigned by var_mode.
base_channels (int, optional) – The basic channel number of the generator. The other layers contain channels based on this number. Defaults to 128.
resblocks_per_downsample (int, optional) – Number of ResBlock used between two downsample operations. The number of ResBlock between upsample operations will be the same value to keep symmetry. Defaults to 3.
num_timesteps (int, optional) – The total timestep of the denoising process and the diffusion process. Defaults to 1000.
use_rescale_timesteps (bool, optional) – Whether rescale the input timesteps in range of [0, 1000]. Defaults to True.
dropout (float, optional) – The probability of dropout operation of each ResBlock. Pass 0 to do not use dropout. Defaults as 0.
embedding_channels (int, optional) – The output channels of time embedding layer and label embedding layer. If not passed (or passed -1), output channels of the embedding layers will set as four times of base_channels. Defaults to -1.
num_classes (int, optional) – The number of conditional classes. If set to 0, this model will be degraded to an unconditional model. Defaults to 0.
channels_cfg (list | dict[list], optional) – Config for input channels of the intermediate blocks. If list is passed, each element of the list indicates the scale factor for the input channels of the current block with regard to the base_channels. For block i, the input and output channels should be channels_cfg[i] * base_channels and channels_cfg[i+1] * base_channels If dict is provided, the key of the dict should be the output scale and corresponding value should be a list to define channels. Default: Please refer to _default_channels_cfg.
output_cfg (dict, optional) – Config for output variables. Defaults to dict(mean='eps', var='learned_range').
norm_cfg (dict, optional) – The config for normalization layers. Defaults to dict(type='GN', num_groups=32).
act_cfg (dict, optional) – The config for activation layers. Defaults to dict(type='SiLU', inplace=False).
shortcut_kernel_size (int, optional) – The kernel size for shortcut conv in ResBlocks. The value of this argument will overwrite the default value of resblock_cfg. Defaults to 3.
use_scale_shift_norm (bool, optional) – Whether perform scale and shift after normalization operation. Defaults to True.
num_heads (int, optional) – The number of attention heads. Defaults to 4.
time_embedding_mode (str, optional) – Embedding method of time_embedding. Defaults to ‘sin’.
time_embedding_cfg (dict, optional) – Config for time_embedding. Defaults to None.
resblock_cfg (dict, optional) – Config for ResBlock. Defaults to dict(type='DenoisingResBlock').
attention_cfg (dict, optional) – Config for attention operation. Defaults to dict(type='MultiHeadAttention').
upsample_conv (bool, optional) – Whether use conv in upsample block. Defaults to True.
downsample_conv (bool, optional) – Whether use conv operation in downsample block. Defaults to True.
upsample_cfg (dict, optional) – Config for upsample blocks. Defaults to dict(type='DenoisingDownsample').
downsample_cfg (dict, optional) – Config for downsample blocks. Defaults to dict(type='DenoisingUpsample').
attention_res (int | list[int], optional) – Resolution of feature maps to apply attention operation. Defaults to [16, 8].
pretrained (str | dict, optional) – Path for the pretrained model or dict containing information for pretrained models whose necessary key is ‘ckpt_path’. Besides, you can also provide ‘prefix’ to load the generator part from the whole state dict. Defaults to None.

_default_channels_cfg[source]¶

forward(x_t, t, encoder_hidden_states=None, label=None, return_noise=False)[source]¶

Forward function. :param x_t: Diffused image at timestep t to denoise. :type x_t: torch.Tensor :param t: Current timestep. :type t: torch.Tensor :param label: You can directly give a

batch of label through a torch.Tensor or offer a callable function to sample a batch of label data. Otherwise, the None indicates to use the default label sampler.

Parameters: return_noise (bool, optional) – If True, inputted x_t and t will be returned in a dict with output desired by output_cfg. Defaults to False.
Returns: If not return_noise
Return type: torch.Tensor | dict

init_weights(pretrained=None)[source]¶

Init weights for models.

We just use the initialization method proposed in the original paper.

Parameters: pretrained (str, optional) – Path for pretrained weights. If given None, pretrained weights will not be loaded. Defaults to None.

convert_to_fp16()[source]¶: Convert the precision of the model to float16.

convert_to_fp32()[source]¶: Convert the precision of the model to float32.

mmagic.models.editors.ddpm.denoising_unet¶

Module Contents¶

Classes¶

Functions¶

Attributes¶

`mmagic.models.editors.ddpm.denoising_unet`¶