ml.utils.spectrogram

Defines spectrogram functions.

This file contains utilities for converting waveforms to MFCCs and back. This can be a more useful representation to use for training models than raw waveforms, since it’s easier for models to learn patterns in the MFCCs than in the waveforms.

class ml.utils.spectrogram.AudioMfccConverter(sample_rate: int = 16000, n_mfcc: int = 40, n_mels: int = 128, n_fft: int = 1024, hop_length: int | None = None, win_length: int | None = None)[source]

Bases: _Normalizer

Defines a module for converting waveforms to MFCCs and back.

This module returns the normalized MFCCs from the waveforms. It uses the pseudoinverse of the mel filterbanks and the DCT matrix to convert MFCCs back to spectrograms, and then uses the Griffin-Lim algorithm to convert spectrograms back to waveforms. The pseudoinverse is used because it’s faster than doing gradient decent every time we want to generate a spectrogram.

Parameters:

sample_rate – Sample rate of the audio.
n_mfcc – Number of MFCC bands.
n_mels – Number of Mel bands.
n_fft – Number of FFT bands.
hop_length – Hop length for the STFT.
win_length – Window length for the STFT.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

mel_fb: Tensor

inv_mel_fb: Tensor

dct_mat: Tensor

inv_dct_mat: Tensor

audio_to_spec(waveform: Tensor) → Tensor[source]

Converts a waveform to MFCCs.

Parameters:: waveform – Tensor of shape (..., num_samples).
Returns:: Tensor of shape (..., num_frames, n_mfcc).

spec_to_audio(mfcc: Tensor) → Tensor[source]

Converts MFCCs to a waveform.

Parameters:: mfcc – Tensor of shape (..., n_mfcc, num_frames).
Returns:: Tensor of shape (..., num_samples).

class ml.utils.spectrogram.AudioStftConverter(n_fft: int = 1024, hop_length: int | None = None, win_length: int | None = None)[source]

Bases: _Normalizer

Defines a class for converting waveforms to spectrograms and back.

This is an exact forward and backward transformation, meaning that the input can be reconstructed perfectly from the output. However, oftentimes the phase information is not easy to deal with for downstream networks.

Parameters:

n_fft – Number of FFT bands.
hop_length – Hop length for the STFT.
win_length – Window length for the STFT.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

normalize(mag: Tensor) → Tensor[source]

Normalizes a signal along the final dimension.

This updates the running mean and standard deviation of the signal if training.

Parameters:: x – The input tensor, with shape (*, N)
Returns:: The normalized tensor, with shape (*, N)

denormalize(log_mag: Tensor) → Tensor[source]

Denormalizes a signal along the final dimension.

Parameters:: x – The latent tensor, with shape (*, N)
Returns:: The denormalized tensor, with shape (*, N)

audio_to_spec(waveform: Tensor) → Tensor[source]

Converts a waveform to a spectrogram.

This version keeps the phase information, in a parallel channel with the magnitude information.

Parameters:: waveform – Tensor of shape (..., num_samples).
Returns:: Tensor of shape (..., 2, num_frames, n_fft // 2 + 1). The first channel is the magnitude, the second is the phase.

spec_to_audio(spec: Tensor) → Tensor[source]

Converts a spectrogram to a waveform.

This version expects the spectrogram to have two channels, one for magnitude and one for phase.

Parameters:: spec – Tensor of shape (..., 2, num_frames, n_fft // 2 + 1).
Returns:: Tensor of shape (..., num_samples), the reconstructed waveform.

class ml.utils.spectrogram.AudioMagStftConverter(n_fft: int = 1024, n_iter: int = 32, hop_length: int | None = None, win_length: int | None = None)[source]

Bases: _Normalizer

Initializes internal Module state, shared by both nn.Module and ScriptModule.

audio_to_mag_spec(waveform: Tensor) → Tensor[source]

Converts a waveform to a magnitude spectrogram.

Parameters:: waveform – Tensor of shape (..., num_samples).
Returns:: Tensor of shape (..., num_frames, n_fft // 2 + 1).

mag_spec_to_audio(mag: Tensor) → Tensor[source]

Converts a magnitude spectrogram to a waveform.

Parameters:: mag – Tensor of shape (..., num_frames, n_fft // 2 + 1).
Returns:: Tensor of shape (..., num_samples), the reconstructed waveform.

class ml.utils.spectrogram.WorldFeatures(sp, f0, ap)[source]

Bases: NamedTuple

Create new instance of WorldFeatures(sp, f0, ap)

sp: Tensor: Alias for field number 0

f0: Tensor: Alias for field number 1

ap: Tensor: Alias for field number 2

class ml.utils.spectrogram.AudioPyworldConverter(sample_rate: int = 16000, dim: int = 24, frame_period: float = 5.0, f0_floor: float = 71.0, f0_ceil: float = 800.0)[source]

Bases: _Normalizer

Defines a class for converting waveforms to PyWorld features and back.

This function also normalizes the features to have zero mean and unit variance using statistics over time.

Parameters:

sample_rate – Sample rate of the audio.
dim – Dimension of the PyWorld features.
frame_period – Frame period in milliseconds.
f0_floor – Minimum F0 value.
f0_ceil – Maximum F0 value.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

normalize(x: ndarray) → ndarray[source]

Normalizes a signal along the final dimension.

This updates the running mean and standard deviation of the signal if training.

Parameters:: x – The input tensor, with shape (*, N)
Returns:: The normalized tensor, with shape (*, N)

denormalize(x: ndarray) → ndarray[source]

Denormalizes a signal along the final dimension.

Parameters:: x – The latent tensor, with shape (*, N)
Returns:: The denormalized tensor, with shape (*, N)

audio_to_features(waveform: ndarray) → WorldFeatures[source]

features_to_audio(features: WorldFeatures | tuple[torch.Tensor | numpy.ndarray, torch.Tensor | numpy.ndarray, torch.Tensor | numpy.ndarray]) → ndarray[source]

class ml.utils.spectrogram.SpectrogramToMFCCs(sample_rate: int = 16000, n_mels: int = 128, n_mfcc: int = 40, f_min: float = 0.0, f_max: float | None = None, n_stft: int = 201, norm: str | None = None, mel_scale: str = 'htk', dct_norm: str = 'ortho')[source]

Bases: _Normalizer

Initializes internal Module state, shared by both nn.Module and ScriptModule.

dct_mat: Tensor

audio_to_spec(waveform: Tensor) → Tensor[source]

forward(spec: Tensor) → Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class ml.utils.spectrogram.AudioToHifiGanMels(sampling_rate: int, num_mels: int, n_fft: int, win_size: int, hop_size: int, fmin: int = 0, fmax: int = 8000)[source]

Bases: Module

Defines a module to convert from a waveform to the mels used by HiFi-GAN.

This module can be used to get the target Mel spectrograms during training that will be compatible with pre-trained HiFi-GAN models. Since the full HiFi-GAN model can be expensive to load during inference, Griffin-Lim is used here to provide a light-weight reconstruction of the audio from the Mel spectrogram during training (although the quality will be poor). Then, during inference, the full HiFi-GAN model can be used instead.

Parameters:

sampling_rate – The sampling rate of the audio.
num_mels – The number of mel bins.
n_fft – The number of FFT bins.
win_size – The window size.
fmin – The minimum frequency.
fmax – The maximum frequency.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

mel_fb: Tensor

inv_mel_fb: Tensor

hann_window: Tensor

classmethod for_hifigan(hifigan_type: Literal['16000hz', '22050hz']) → AudioToHifiGanMels[source]

property dimensions: int

audio_to_mels(waveform: Tensor) → Tensor[source]

mels_to_audio(spec: Tensor) → Tensor[source]

ml.utils.spectrogram.test_audio_adhoc() → None[source]