mmagic.models.editors.ddpm.denoising_unet
¶
Module Contents¶
Classes¶
A sequential module that passes timestep embeddings to the children that |
|
Applies Group Normalization over a mini-batch of inputs as described in |
|
Applies the Sigmoid Linear Unit (SiLU) function, element-wise. |
|
An attention block allows spatial position to attend to each other. |
|
An attention block that allows spatial positions to attend to each |
|
A module which performs QKV attention. |
|
A module which performs QKV attention and splits in a different |
|
Time embedding layer, reference to Two level embedding. First embedding |
|
Resblock for the denoising network. If in_channels not equals to |
|
Nornalization with embedding layer. If use_scale_shift == True, |
|
Downsampling operation used in the denoising network. Support average |
|
Upsampling operation used in the denoising network. Allows users to |
|
Denoising Unet. This network receives a diffused image |
Functions¶
|
Convert primitive modules to float16. |
|
Convert primitive modules to float32, undoing |
|
build unet down path blocks with resnet and attention. |
|
build unet mid blocks with resnet and attention. |
|
build up path blocks with resnet and attention. |
Attributes¶
- class mmagic.models.editors.ddpm.denoising_unet.EmbedSequential(*args: torch.nn.modules.module.Module) EmbedSequential(arg: OrderedDict[str, Module])[source]¶
Bases:
torch.nn.Sequential
A sequential module that passes timestep embeddings to the children that support it as an extra input.
Modified from https://github.com/openai/improved-diffusion/blob/main/improved_diffusion/unet.py#L35
- class mmagic.models.editors.ddpm.denoising_unet.GroupNorm32(num_channels, num_groups=32, **kwargs)[source]¶
Bases:
torch.nn.GroupNorm
Applies Group Normalization over a mini-batch of inputs as described in the paper Group Normalization
\[y = \frac{x - \mathrm{E}[x]}{ \sqrt{\mathrm{Var}[x] + \epsilon}} * \gamma + \beta\]The input channels are separated into
num_groups
groups, each containingnum_channels / num_groups
channels.num_channels
must be divisible bynum_groups
. The mean and standard-deviation are calculated separately over the each group. \(\gamma\) and \(\beta\) are learnable per-channel affine transform parameter vectors of sizenum_channels
ifaffine
isTrue
. The standard-deviation is calculated via the biased estimator, equivalent to torch.var(input, unbiased=False).This layer uses statistics computed from input data in both training and evaluation modes.
- Parameters
num_groups (int) – number of groups to separate the channels into
num_channels (int) – number of channels expected in input
eps – a value added to the denominator for numerical stability. Default: 1e-5
affine – a boolean value that when set to
True
, this module has learnable per-channel affine parameters initialized to ones (for weights) and zeros (for biases). Default:True
.
- Shape:
Input: \((N, C, *)\) where \(C=\text{num\_channels}\)
Output: \((N, C, *)\) (same shape as input)
Examples:
>>> input = torch.randn(20, 6, 10, 10) >>> # Separate 6 channels into 3 groups >>> m = nn.GroupNorm(3, 6) >>> # Separate 6 channels into 6 groups (equivalent with InstanceNorm) >>> m = nn.GroupNorm(6, 6) >>> # Put all 6 channels into a single group (equivalent with LayerNorm) >>> m = nn.GroupNorm(1, 6) >>> # Activating the module >>> output = m(input)
- mmagic.models.editors.ddpm.denoising_unet.convert_module_to_f16(layer)[source]¶
Convert primitive modules to float16.
- mmagic.models.editors.ddpm.denoising_unet.convert_module_to_f32(layer)[source]¶
Convert primitive modules to float32, undoing convert_module_to_f16().
- class mmagic.models.editors.ddpm.denoising_unet.SiLU(inplace=False)[source]¶
Bases:
mmengine.model.BaseModule
Applies the Sigmoid Linear Unit (SiLU) function, element-wise. The SiLU function is also known as the swish function. :param input: Use inplace operation or not.
Defaults to False.
- class mmagic.models.editors.ddpm.denoising_unet.MultiHeadAttention(in_channels, num_heads=1, norm_cfg=dict(type='GN', num_groups=32))[source]¶
Bases:
mmengine.model.BaseModule
An attention block allows spatial position to attend to each other.
Originally ported from here, but adapted to the N-d case. https://github.com/hojonathanho/diffusion/blob/1e0dceb3b3495bbe19116a5e1b3596cd0706c543/diffusion_tf/models/unet.py#L66. # noqa
- Parameters
in_channels (int) – Channels of the input feature map.
num_heads (int, optional) – Number of heads in the attention.
norm_cfg (dict, optional) – Config for normalization layer. Default to
dict(type='GN', num_groups=32)
- class mmagic.models.editors.ddpm.denoising_unet.MultiHeadAttentionBlock(in_channels, num_heads=1, num_head_channels=- 1, use_new_attention_order=False, norm_cfg=dict(type='GN32', num_groups=32), encoder_channels=None)[source]¶
Bases:
mmengine.model.BaseModule
An attention block that allows spatial positions to attend to each other.
Originally ported from here, but adapted to the N-d case. https://github.com/hojonathanho/diffusion/blob/1e0dceb3b3495bbe19116a5e1b3596cd0706c543/diffusion_tf/models/unet.py#L66.
- class mmagic.models.editors.ddpm.denoising_unet.QKVAttentionLegacy(n_heads)[source]¶
Bases:
mmengine.model.BaseModule
A module which performs QKV attention.
Matches legacy QKVAttention + input/output heads shaping
- class mmagic.models.editors.ddpm.denoising_unet.QKVAttention(n_heads)[source]¶
Bases:
mmengine.model.BaseModule
A module which performs QKV attention and splits in a different order.
- class mmagic.models.editors.ddpm.denoising_unet.TimeEmbedding(in_channels, embedding_channels, embedding_mode='sin', embedding_cfg=None, act_cfg=dict(type='SiLU', inplace=False))[source]¶
Bases:
mmengine.model.BaseModule
Time embedding layer, reference to Two level embedding. First embedding time by an embedding function, then feed to neural networks.
- Parameters
in_channels (int) – The channel number of the input feature map.
embedding_channels (int) – The channel number of the output embedding.
embedding_mode (str, optional) – Embedding mode for the time embedding. Defaults to ‘sin’.
embedding_cfg (dict, optional) – Config for time embedding. Defaults to None.
act_cfg (dict, optional) – Config for activation layer. Defaults to
dict(type='SiLU', inplace=False)
.
- static sinusodial_embedding(timesteps, dim, max_period=10000)[source]¶
Create sinusoidal timestep embeddings.
- Parameters
timesteps (torch.Tensor) – Timestep to embedding. 1-D tensor shape as
[bz, ]
, one per batch element.dim (int) – The dimension of the embedding.
max_period (int, optional) – Controls the minimum frequency of the embeddings. Defaults to
10000
.
- Returns
Embedding results shape as [bz, dim].
- Return type
torch.Tensor
- class mmagic.models.editors.ddpm.denoising_unet.DenoisingResBlock(in_channels, embedding_channels, use_scale_shift_norm, dropout, out_channels=None, norm_cfg=dict(type='GN', num_groups=32), act_cfg=dict(type='SiLU', inplace=False), shortcut_kernel_size=1, up=False, down=False)[source]¶
Bases:
mmengine.model.BaseModule
Resblock for the denoising network. If in_channels not equals to out_channels, a learnable shortcut with conv layers will be added.
- Parameters
in_channels (int) – Number of channels of the input feature map.
embedding_channels (int) – Number of channels of the input embedding.
use_scale_shift_norm (bool) – Whether use scale-shift-norm in NormWithEmbedding layer.
dropout (float) – Probability of the dropout layers.
out_channels (int, optional) – Number of output channels of the ResBlock. If not defined, the output channels will equal to the in_channels. Defaults to None.
norm_cfg (dict, optional) – The config for the normalization layers. Defaults too
dict(type='GN', num_groups=32)
.act_cfg (dict, optional) – The config for the activation layers. Defaults to
dict(type='SiLU', inplace=False)
.shortcut_kernel_size (int, optional) – The kernel size for the shortcut conv. Defaults to
1
.
- class mmagic.models.editors.ddpm.denoising_unet.NormWithEmbedding(in_channels, embedding_channels, norm_cfg=dict(type='GN', num_groups=32), act_cfg=dict(type='SiLU', inplace=False), use_scale_shift=True)[source]¶
Bases:
mmengine.model.BaseModule
Nornalization with embedding layer. If use_scale_shift == True, embedding results will be chunked and used to re-shift and re-scale normalization results. Otherwise, embedding results will directly add to input of normalization layer.
- Parameters
in_channels (int) – Number of channels of the input feature map.
embedding_channels (int) –
norm_cfg (dict, optional) – Config for the normalization operation. Defaults to dict(type=’GN’, num_groups=32).
act_cfg (dict, optional) – Config for the activation layer. Defaults to dict(type=’SiLU’, inplace=False).
use_scale_shift (bool) – If True, the output of Embedding layer will be split to ‘scale’ and ‘shift’ and map the output of normalization layer to
out * (1 + scale) + shift
. Otherwise, the output of Embedding layer will be added with the input before normalization operation. Defaults to True.
- class mmagic.models.editors.ddpm.denoising_unet.DenoisingDownsample(in_channels, with_conv=True)[source]¶
Bases:
mmengine.model.BaseModule
Downsampling operation used in the denoising network. Support average pooling and convolution for downsample operation.
- Parameters
in_channels (int) – Number of channels of the input feature map to be downsampled.
with_conv (bool, optional) – Whether use convolution operation for downsampling. Defaults to True.
- class mmagic.models.editors.ddpm.denoising_unet.DenoisingUpsample(in_channels, with_conv=True)[source]¶
Bases:
mmengine.model.BaseModule
Upsampling operation used in the denoising network. Allows users to apply an additional convolution layer after the nearest interpolation operation.
- Parameters
in_channels (int) – Number of channels of the input feature map to be downsampled.
with_conv (bool, optional) – Whether apply an additional convolution layer after upsampling. Defaults to True.
- mmagic.models.editors.ddpm.denoising_unet.build_down_block_resattn(resblocks_per_downsample, resblock_cfg, in_channels_, out_channels_, attention_scale, attention_cfg, in_channels_list, level, channel_factor_list, embedding_channels, use_scale_shift_norm, dropout, norm_cfg, resblock_updown, downsample_cfg, scale)[source]¶
build unet down path blocks with resnet and attention.
- mmagic.models.editors.ddpm.denoising_unet.build_mid_blocks_resattn(resblock_cfg, attention_cfg, in_channels_)[source]¶
build unet mid blocks with resnet and attention.
- mmagic.models.editors.ddpm.denoising_unet.build_up_blocks_resattn(resblocks_per_downsample, resblock_cfg, in_channels_, in_channels_list, base_channels, factor, scale, attention_scale, attention_cfg, channel_factor_list, level, embedding_channels, use_scale_shift_norm, dropout, norm_cfg, resblock_updown, upsample_cfg)[source]¶
build up path blocks with resnet and attention.
- class mmagic.models.editors.ddpm.denoising_unet.DenoisingUnet(image_size, in_channels=3, out_channels=None, base_channels=128, resblocks_per_downsample=3, num_timesteps=1000, use_rescale_timesteps=False, dropout=0, embedding_channels=- 1, num_classes=0, use_fp16=False, channels_cfg=None, output_cfg=dict(mean='eps', var='learned_range'), norm_cfg=dict(type='GN', num_groups=32), act_cfg=dict(type='SiLU', inplace=False), shortcut_kernel_size=1, use_scale_shift_norm=False, resblock_updown=False, num_heads=4, time_embedding_mode='sin', time_embedding_cfg=None, resblock_cfg=dict(type='DenoisingResBlock'), attention_cfg=dict(type='MultiHeadAttention'), encoder_channels=None, downsample_conv=True, upsample_conv=True, downsample_cfg=dict(type='DenoisingDownsample'), upsample_cfg=dict(type='DenoisingUpsample'), attention_res=[16, 8], pretrained=None, unet_type='', down_block_types: Tuple[str] = (), up_block_types: Tuple[str] = (), cross_attention_dim=768, layers_per_block: int = 2)[source]¶
Bases:
mmengine.model.BaseModule
Denoising Unet. This network receives a diffused image
x_t
and current timestept
, and returns aoutput_dict
corresponding to the passedoutput_cfg
.output_cfg
defines the number of channels and the meaning of the output.output_cfg
mainly contains keys ofmean
andvar
, denoting how the network outputs mean and variance required for the denoising process. Formean
: 1.dict(mean='EPS')
: Model will predict noise added in thediffusion process, and the
output_dict
will contain a key namedeps_t_pred
.dict(mean='START_X')
: Model will direct predict the mean of theoriginal image x_0, and the
output_dict
will contain a key namedx_0_pred
.
dict(mean='X_TM1_PRED')
: Model will predict the mean of diffusedimage at t-1 timestep, and the
output_dict
will contain a key namedx_tm1_pred
.
For
var
: 1.dict(var='FIXED_SMALL')
ordict(var='FIXED_LARGE')
: Variance inthe denoising process is regarded as a fixed value. Therefore only ‘mean’ will be predicted, and the output channels will equal to the input image (e.g., three channels for RGB image.)
dict(var='LEARNED')
: Model will predict log_variance in thedenoising process, and the
output_dict
will contain a key namedlog_var
.
dict(var='LEARNED_RANGE')
: Model will predict an interpolationfactor and the log_variance will be calculated as factor * upper_bound + (1-factor) * lower_bound. The
output_dict
will contain a key namedfactor
.
If
var
is notFIXED_SMALL
orFIXED_LARGE
, the number of output channels will be the double of input channels, where the first half part contains predicted mean values and the other part is the predicted variance values. Otherwise, the number of output channels equals to the input channels, only containing the predicted mean values.- Parameters
image_size (int | list[int]) – The size of image to denoise.
in_channels (int, optional) – The input channels of the input image. Defaults as
3
.out_channels (int, optional) – The output channels of the output prediction. Defaults as
None
for automaticaaly assigned byvar_mode
.base_channels (int, optional) – The basic channel number of the generator. The other layers contain channels based on this number. Defaults to
128
.resblocks_per_downsample (int, optional) – Number of ResBlock used between two downsample operations. The number of ResBlock between upsample operations will be the same value to keep symmetry. Defaults to 3.
num_timesteps (int, optional) – The total timestep of the denoising process and the diffusion process. Defaults to
1000
.use_rescale_timesteps (bool, optional) – Whether rescale the input timesteps in range of [0, 1000]. Defaults to
True
.dropout (float, optional) – The probability of dropout operation of each ResBlock. Pass
0
to do not use dropout. Defaults as 0.embedding_channels (int, optional) – The output channels of time embedding layer and label embedding layer. If not passed (or passed
-1
), output channels of the embedding layers will set as four times ofbase_channels
. Defaults to-1
.num_classes (int, optional) – The number of conditional classes. If set to 0, this model will be degraded to an unconditional model. Defaults to 0.
channels_cfg (list | dict[list], optional) – Config for input channels of the intermediate blocks. If list is passed, each element of the list indicates the scale factor for the input channels of the current block with regard to the
base_channels
. For blocki
, the input and output channels should bechannels_cfg[i] * base_channels
andchannels_cfg[i+1] * base_channels
If dict is provided, the key of the dict should be the output scale and corresponding value should be a list to define channels. Default: Please refer to_default_channels_cfg
.output_cfg (dict, optional) – Config for output variables. Defaults to
dict(mean='eps', var='learned_range')
.norm_cfg (dict, optional) – The config for normalization layers. Defaults to
dict(type='GN', num_groups=32)
.act_cfg (dict, optional) – The config for activation layers. Defaults to
dict(type='SiLU', inplace=False)
.shortcut_kernel_size (int, optional) – The kernel size for shortcut conv in ResBlocks. The value of this argument will overwrite the default value of resblock_cfg. Defaults to 3.
use_scale_shift_norm (bool, optional) – Whether perform scale and shift after normalization operation. Defaults to True.
num_heads (int, optional) – The number of attention heads. Defaults to 4.
time_embedding_mode (str, optional) – Embedding method of
time_embedding
. Defaults to ‘sin’.time_embedding_cfg (dict, optional) – Config for
time_embedding
. Defaults to None.resblock_cfg (dict, optional) – Config for ResBlock. Defaults to
dict(type='DenoisingResBlock')
.attention_cfg (dict, optional) – Config for attention operation. Defaults to
dict(type='MultiHeadAttention')
.upsample_conv (bool, optional) – Whether use conv in upsample block. Defaults to
True
.downsample_conv (bool, optional) – Whether use conv operation in downsample block. Defaults to
True
.upsample_cfg (dict, optional) – Config for upsample blocks. Defaults to
dict(type='DenoisingDownsample')
.downsample_cfg (dict, optional) – Config for downsample blocks. Defaults to
dict(type='DenoisingUpsample')
.attention_res (int | list[int], optional) – Resolution of feature maps to apply attention operation. Defaults to
[16, 8]
.pretrained (str | dict, optional) – Path for the pretrained model or dict containing information for pretrained models whose necessary key is ‘ckpt_path’. Besides, you can also provide ‘prefix’ to load the generator part from the whole state dict. Defaults to None.
- forward(x_t, t, encoder_hidden_states=None, label=None, return_noise=False)[source]¶
Forward function. :param x_t: Diffused image at timestep t to denoise. :type x_t: torch.Tensor :param t: Current timestep. :type t: torch.Tensor :param label: You can directly give a
batch of label through a
torch.Tensor
or offer a callable function to sample a batch of label data. Otherwise, theNone
indicates to use the default label sampler.- Parameters
return_noise (bool, optional) – If True, inputted
x_t
andt
will be returned in a dict with output desired byoutput_cfg
. Defaults to False.- Returns
If not
return_noise
- Return type
torch.Tensor | dict