Audio Classification with Deep Learning in Python | by Leonie Monigatti | Apr, 2023

Fine-tuning image models to tackle domain shift and class imbalance with PyTorch and torchaudio in audio data

Classifying bird calls in soundscapes with Machine Learning
Classifying bird calls in soundscapes with Machine Learning (Image drawn by the author)

Welcome to another edition of “The Kaggle Blueprints”, where we will analyze Kaggle competitions’ winning solutions for lessons we can apply to our own data science projects.

This edition will review the techniques and approaches from the “BirdCLEF 2022” competition, which ended in May 2022.

The objective of the “BirdCLEF 2022” competition was to identify Hawaiian bird species by sound. The competitors were given short audio files of single bird calls and were asked to predict whether a specific bird was present in a longer recording.

In contrast to a vanilla audio classification problem, this competition added flavor with the following challenges:

  • Domain shift — The training data consisted of clean audio recordings of a single bird call separated from any additional sounds (a few seconds, different lengths). However, the test data consisted of “unclean” longer (1 minute) recordings taken “in the wild” and contained different sounds other than bird calls (e.g., wind, rain, other animals, etc.).
Domain shift in audio data
Domain shift in audio data
  • Class imbalance/Few-shot learning —As some birds are less common than others, we are dealing with a long-tailed class distribution where some birds only have one sample.
Long-tailed class distribution
Long-tailed class distribution

Insert your data here! — To follow along in this article, your dataset should look something like this:

Insert your data here: How your audio dataset dataframe should be formatted
Insert your data here: How your audio dataset dataframe should be formatted

A popular approach among competitors to this audio classification problem was to:

  1. Converting the audio classification problem to an image classification problem by converting the audio from waveform to a Mel spectrogram and applying a Deep Learning model
  2. Applying data augmentations to the audio data in waveform and in spectrograms to tackle the domain shift and class imbalance
  3. Finetune a pre-trained image classification model to tackle class imbalance

This article will use PyTorch (version 1.13.0) for the Deep Learning framework and torchaudio (version 0.13.0) and librosa (version 0.10.0) for audio processing. Additionally, we will be using timm (version 0.6.12) for fine-tuning with pre-trained image models.

# Deep Learning framework
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.optim import lr_scheduler
from torch.utils.data import Dataset, DataLoader

# Audio processing
import torchaudio
import torchaudio.transforms as T
import librosa

# Pre-trained image models
import timm

Before getting started with solving an audio classification problem, let’s first get familiar with working with audio data. You can load the audio and its sampling rate from different file formats (e.g., .wav, .ogg, etc.) with the .load() method from the torchaudio library or the librosa library.

PATH = "audio_example.wav"

# Load a sample audio file with torchaudio
original_audio, sample_rate = torchaudio.load(PATH)

# Load a sample audio file with librosa
original_audio, sample_rate = librosa.load(PATH,
sr = None) # Gotcha: Set sr to None to get original sampling rate. Otherwise the default is 22050

If you want to listen to the loaded audio directly in a Jupyter notebook for explorations, the following code will provide you with an audio player.

# Play the audio in Jupyter notebook
from IPython.display import Audio

Audio(data = original_audio, rate = sample_rate)

Displaying audio player for loaded data in Jupyter notebook
Displaying audio player for loaded data in Jupyter notebook

The librosa library also provides various methods to display the audio data for exploration purposes quickly. If you used torchaudio to load the audio file, make sure to convert the tensors to NumPy arrays.

import librosa.display as dsp
dsp.waveshow(original_audio, sr = sample_rate);
Original audio data of the word “stop” in waveform from the “Speech Commands” dataset [1]
Original audio data of the word “stop” in waveform from the “Speech Commands” dataset [0]

A popular method to model audio data with a Deep Learning model is to convert the “computer hearing” problem to a computer vision problem [2]. Specifically, the waveform audio is converted to a Mel spectrogram (which is a type of image) as shown below.

Converting an audio file from waveform (time domain) to Mel spectrogram (frequency domain)
Converting an audio file from waveform (time domain) to Mel spectrogram (frequency domain)

Usually, you would use a Fast Fourier Transform (FFT) to computationally convert an audio signal from the time domain (waveform) to the frequency domain (spectrogram).

However, the FFT will give you the overall frequency components for the entire time series of the audio signal as a whole. Thus, you are losing the time information when converting audio data from the time domain to the frequency domain.

Instead of the FFT, you can use the Short-Time Fourier Transform (STFT) to preserve the time information. The STFT is a variant of the FFT that breaks up the audio signal into smaller sections by using a sliding time window. It takes the FFT on each section and then combines them.

  • n_fft —length of the sliding window (default: 2048)
  • hop_length — number of samples by which to slide the window (default: 512). The hop_lengthwill directly impact the resulting image size. If your audio data has a fixed length and you want to convert the waveform to a fixed image size, you can set hop_length = audio_length // (image_size[1] — 1)
Short-Time Fourier Transform (STFT)
Short-Time Fourier Transform (STFT)

Next, you will convert the amplitude to decibels and bin the frequencies according to the Mel scale. For this purpose, n_mels is the number of frequency bands (Mel bins). This will be the height of the resulting spectrogram.

Convert amplitude to decibel and apply Mel binning to spectrum
Convert amplitude to decibels and apply Mel binning to the spectrum

For an in-depth explanation of the Mel spectrogram, I recommend this article:

Below you can see an example PyTorch Dataset which loads an audio file and converts the waveform to a Mel spectrogram after some preprocessing steps.

class AudioDataset(Dataset):
def __init__(self,
df,
target_sample_rate= 32000,
audio_length
wave_transforms=None,
spec_transforms=None):
self.df = df
self.file_paths = df['file_path'].values
self.labels = df[['class_0', ..., 'class_N']].values
self.target_sample_rate = target_sample_rate
self.num_samples = target_sample_rate * audio_length
self.wave_transforms = wave_transforms
self.spec_transforms = spec_transforms

def __len__(self):
return len(self.df)

def __getitem__(self, index):

# Load audio from file to waveform
audio, sample_rate = torchaudio.load(self.file_paths[index])

# Convert to mono
audio = torch.mean(audio, axis=0)

# Resample
if sample_rate != self.target_sample_rate:
resample = T.Resample(sample_rate, self.target_sample_rate)
audio = resample(audio)

# Adjust number of samples
if audio.shape[0] > self.num_samples:
# Crop
audio = audio[:self.num_samples]
elif audio.shape[0] < self.num_samples:
# Pad
audio = F.pad(audio, (0, self.num_samples - audio.shape[0]))

# Add any preprocessing you like here
# (e.g., noise removal, etc.)
...

# Add any data augmentations for waveform you like here
# (e.g., noise injection, shifting time, changing speed and pitch)
...

# Convert to Mel spectrogram
melspectrogram = T.MelSpectrogram(sample_rate = self.target_sample_rate,
n_mels = 128,
n_fft = 2048,
hop_length = 512)
melspec = melspectrogram(audio)

# Add any data augmentations for spectrogram you like here
# (e.g., Mixup, cutmix, time masking, frequency masking)
...

return {"image": torch.stack([melspec]),
"label": torch.tensor(self.labels[index]).float()}

Your resulting dataset should produce samples that look something like this before we feed it to the neural network:

Sample structure from the Audio Dataset

One technique to tackle this competition’s challenges of domain shift and class imbalance was to apply data augmentations to the training data [5, 8, 10, 11]. You can apply data augmentations for audio data in the waveform and the spectrogram. The torchaudio library already provides a lot of different data augmentations for audio data.

Popular data augmentation techniques for audio data in waveform (time domain) are:

  • Noise injection like white noise, colored noise, or background noise (AddNoise)
  • Shifting time
  • Changing speed (Speed; alternatively use TimeStretch in frequency domain)
  • Changing pitch (PitchShift)
Overview of different data augmentation techniques for audio in waveform: Noise injection (white noise, colored noise, background noise), shifting time, changing speed and pitch
Overview of different data augmentation techniques for audio in waveform: Noise injection (white noise, colored noise, background noise), shifting time, changing speed and pitch

Popular data augmentation techniques for audio data in the spectrogram (frequency domain) are:

  • Popular image augmentation techniques like Mixup [13] or Cutmix [3]
Data Augmentation for Spectrogram: Mixup [4]
Data Augmentation for Spectrogram: Mixup [13]
Data Augmentation for Spectrogram: SpecAugment [2]
Data Augmentation for Spectrogram: SpecAugment [7]

As you can see while providing a lot of audio augmentations, torchaudio doesn’t provide all of the proposed data augmentations.

Thus, if you want to inject a specific type of noise, shift the time, or apply Mixup [13] or Cutmix [12] augmentations, you must write a custom data augmentation in PyTorch. You can reference this collection of audio data augmentation techniques for their implementations:

In the example PyTorch Dataset class from before, you can apply the data augmentations as follows:

class AudioDataset(Dataset):
def __init__(self,
df,
target_sample_rate= 32000,
audio_length):
self.df = df
self.file_paths = df['file_path'].values
self.labels = df[['class_0', ..., 'class_N']].values
self.target_sample_rate = target_sample_rate
self.num_samples = target_sample_rate * audio_length

def __len__(self):
return len(self.df)

def __getitem__(self, index):

# Load audio from file to waveform
audio, sample_rate = torchaudio.load(self.file_paths[index])

# Add any preprocessing you like here
# (e.g., converting to mono, resampling, adjusting size, noise removal, etc.)
...

# Add any data augmentations for waveform you like here
# (e.g., noise injection, shifting time, changing speed and pitch)
wave_transforms = T.PitchShift(sample_rate, 4)
audio = wave_transforms(audio)

# Convert to Mel spectrogram
melspec = ...

# Add any data augmentations for spectrogram you like here
# (e.g., Mixup, cutmix, time masking, frequency masking)
spec_transforms = T.FrequencyMasking(freq_mask_param=80)
melspec = spec_transforms(melspec)

return {"image": torch.stack([melspec]),
"label": torch.tensor(self.labels[index]).float()}

In this competition, we are dealing with a class imbalance. As some classes only have one sample, we are dealing with a few-shot learning problem. Nakamura and Harada [6] showed in 2019 that fine-tuning could be an effective approach to few-shot learning.

A lot of competitors [2, 5, 8, 10, 11] fine-tuned common pre-trained image classification models such as

  • EfficientNet (e.g., tf_efficientnet_b3_ns) [9],
  • SE-ResNext (e.g., se_resnext50_32x4d) [3],
  • NFNet (e.g., eca_nfnet_l0) [1]

You can load any pre-trained image classification model with the timm library for fine-tuning. Make sure to set in_chans = 1 as we are not working with 3-channel images but 1-channel Mel spectrograms.

class AudioModel(nn.Module):
def __init__(self,
model_name = 'tf_efficientnet_b3_ns',
pretrained = True,
num_classes):
super(AudioModel, self).__init__()

self.model = timm.create_model(model_name,
pretrained = pretrained,
in_chans = 1)
self.in_features = self.model.classifier.in_features
self.model.classifier = nn.Sequential(
nn.Linear(self.in_features, num_classes)
)

def forward(self, images):
logits = self.model(images)
return logits

Other competitors reported successes from fine-tuning models pre-trained on similar audio classification problems [4, 10].

Fine-tuning is done with a cosine annealing learning rate scheduler (CosineAnnealingLR) for a few epochs [2, 8].

scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer,
T_max = ..., # Maximum number of iterations.
eta_min = ...) # Minimum learning rate.
PyTorch Cosine Annealing / Decay Learning Rate Scheduler (Image by the author, originally published in “A Visual Guide to Learning Rate Schedulers in PyTorch”)
PyTorch Cosine Annealing / Decay Learning Rate Scheduler (Image by the author, originally published in “A Visual Guide to Learning Rate Schedulers in PyTorch”)

You can find more tips and best practices in this guide for fine-tuning Deep Learning models:

Subscribe for free to get notified when I publish a new story.

Become a Medium member to read more stories from other writers and me. You can support me by using my referral link when you sign up. I’ll receive a commission at no extra cost to you.

Find me on LinkedIn, Twitter, and Kaggle!

Dataset

As the original competition data does not allow commercial use, examples are done with the following dataset.

[0] Warden P. Speech Commands: A public dataset for single-word speech recognition, 2017. Available from download.tensorflow.org/data/speech_commands_v0.01.tar.gz

License: CC-BY-4.0

Image References

If not otherwise stated, all images are created by the author.

Web & Literature

[1] Brock, A., De, S., Smith, S. L., & Simonyan, K. (2021, July). High-performance large-scale image recognition without normalization. In International Conference on Machine Learning (pp. 1059–1071). PMLR.

[2] Chai Time Data Science (2022). BirdCLEF 2022: 11th Pos Gold Solution | Gilles Vandewiele (accessed March 13th, 2023)

[3] Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7132–7141).

[4] Kramarenko Vladislav (2022). 4th place in Kaggle Discussions (accessed March 13th, 2023)

[5] LeonShangguan (2022). [Public #1 Private #2] + [Private #7/8 (potential)] solutions. The host wins. in Kaggle Discussions (accessed March 13th, 2023)

[6] Nakamura, A., & Harada, T. (2019). Revisiting fine-tuning for few-shot learning. arXiv preprint arXiv:1910.00216.

[7] Park, D. S., Chan, W., Zhang, Y., Chiu, C. C., Zoph, B., Cubuk, E. D., & Le, Q. V. (2019). Specaugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779.

[8] slime (2022). 3rd place solution in Kaggle Discussions (accessed March 13th, 2023)

[9] Tan, M., & Le, Q. (2019, May). Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning (pp. 6105–6114). PMLR.

[10] Volodymyr (2022). 1st place solution models (it’s not all BirdNet) in Kaggle Discussions (accessed March 13th, 2023)

[11] yokuyama (2022). 5th place solution in Kaggle Discussions (accessed March 13th, 2023)

[12] Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., & Yoo, Y. (2019). Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6023–6032).

[13] Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2017) mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412.


Read more here: Source link