The Secret to Improved NLP: An In-Depth Look at the nn.Embedding Layer in PyTorch | by Will Badr | Jan, 2023

OpenAI DALL-E Generated Image

You might have seen the famous PyTorch nn.Embedding() layer in multiple neural network architectures that involves natural language processing (NLP). This is one of the simplest and most important layers when it comes to designing advanced NLP architectures. Let me explain what it is, in simple terms.

After spending some time looking into its C++ source code, here is what I found. The nn.Embedding layer is a simple lookup table that maps an index value to a weight matrix of a certain dimension. This simple operation is the foundation of many advanced NLP architectures, allowing for the processing of discrete input symbols in a continuous space. During the training the parameters of the nn.Embedding layer in a neural network are adjusted in order to optimize the performance of the model. Specifically, the embedding matrix is updated via backpropagation to minimize the loss function. This can be thought of as learning a mapping from discrete input tokens (such as words) to continuous embedding vectors in a high-dimensional space, where the vectors are optimised to represent the meaning or context of the input tokens in relation to the task the model is trained for (e.g. text generation, language translation).

Now let’s look at some concrete examples with code:

The nn.Embedding layer takes in two arguments as a minimum. the vocabulary size and the size of the encoded representation for each word. For example, if you have a vocabulary of 10,000 words, then the value of the first argument would be 10,000. Each word in the vocabulary will be represented by a vector of fixed size. The second argument is the size of the learned embedding for each word

import torch
import torch.nn as nn

# Define the embedding layer with 10 vocab size and 50 vector embeddings.
embedding = nn.Embedding(10, 50)

What happened here is that PyTorch created a lookup table called embedding. This table has 10 rows and 50 columns. Each row represents a single word embedding that is initialized randomly drawn from a uniform distribution. They are initialized using the nn.init.uniform_() function from the torch.nn.init module and the weights are initialized with random values between -1 and 1. To examine the embeddings for a given word (eg. first word in the table), you can run:


The output is a vector of size 50:

These are the numbers that gets tuned and optimised during the training process to convey the meaning of a certain word. The initialization method can have a significant impact on the performance of model. Different initialization methods can lead to different starting points for the optimization process, and this can affect how quickly or easily the network converges to a good solution. For example, if the weights are initialized to very small or very large values, the gradients during backpropagation will also be small or large, which can slow down or even prevent convergence. On the other hand, if the weights are initialized to values that are closer to zero, the gradients will be more reasonable, and the network is more likely to converge quickly.

Also, different initialization methods are designed to work well with different types of activation functions. For example, the Xavier initialization is designed to work well with sigmoid and tanh activation functions, whereas other methods may work better with ReLU and its variants. Now, let’s see how to initialize the nn.Embedding layer using different methods:

  1. nn.init.normal_(): which initializes the weights with random values drawn from a normal distribution with a mean of 0 and a standard deviation of 1. It is also known as Gaussian initialization.

nn.init.constant_(): This function initializes the weights with a specific constant value. For example, you can use nn.init.constant_(my_layer.weight, 0) to initialize the weights of a layer to 0.

nn.init.xavier_uniform_() and nn.init.xavier_normal_(): These functions are based on the work of Xavier Glorot and Yoshua Bengio, and they are designed to work well with sigmoid and tanh activation functions. They initialize the weights to values that are close to zero, but not too small.


nn.init.kaiming_uniform_() and nn.init.kaiming_normal_(): These functions are based on the work of He et al., and they are designed to work well with ReLU and its variants (LeakyReLU, PReLU, RReLU, etc.). They also initialize the weights to values that are close to zero, but not too small.

nn.init.kaiming_normal_(embedding.weight, nonlinearity='leaky_relu')

These weights can also be initialized using pre-trained word vectors such as GloVe or word2vec, which have been trained on large corpora and have been shown to be useful for many natural language processing tasks. The process of using a pre-trained word vectors is called — Fine-tuning. Using pre-trained word embeddings with the nn.Embedding layer can be very useful for a variety of natural language processing (NLP) tasks. There are a few reasons why this is the case:

  1. Improves model performance: Pre-trained word embeddings have been trained on massive amounts of text data and have been shown to be useful for a variety of NLP tasks. When used as input to a neural network, they can help to improve the performance of the model by providing it with a good set of initial weights that capture the meaning of words.
  2. Saves computation time and resources: Training a neural network to learn word embeddings from scratch can be a time-consuming and computationally expensive task, especially if you are working with large corpora. By using pre-trained word embeddings, you can save a significant amount of computation time and resources, as the embeddings have already been learned on a large corpus.
  3. Allows transfer learning: Pre-trained word embeddings can be used for transfer learning, which means that you can use the embeddings learned on one task as a starting point for a different but related task. This can be particularly useful for NLP tasks where labeled data is scarce or expensive to obtain.

Let’s see how it can be implemented:

import torch
import torch.nn as nn

# Load a pre-trained embedding model
pretrained_embeddings = torch.randn(10, 50) # Example only, not actual pre-trained embeddings

# Initialize the embedding layer with the pre-trained embeddings

You can also use the from_pretrained() method to load the pre-trained embeddings directly:

embedding_layer = nn.Embedding.from_pretrained(pretrained_embeddings)

You can also use pre-trained embeddings from popular libraries like GloVe or fastText, for example:

import torchtext

# Load pre-trained GloVe embeddings
glove = torchtext.vocab.GloVe(name='6B', dim=300)
embedding_layer = nn.Embedding.from_pretrained(glove.vectors)

In some cases when performing transfer learning, you may need to freeze the pre-trained embeddings during training process, so that they are not updated during the backpropagation step and only the last dense layer is updated. To do this, set embedding_layer.weight.requiresGrad = False to prevent this layer from being updated.

The rise of Transformers

One other common place where you will find this layer is in the transformer architecture. The nn.Embedding layer is a key component of the transformer architecture, which is a type of neural network architecture that has been widely used for natural language processing tasks such as language translation, text summarization, question answering and creating large language models like GPT3

In a transformer architecture, the nn.Embedding layer is used to convert the input sequence of tokens (such as words or subwords) into a continuous representation. This is done by looking up the embedding vector for each token in the input sequence in a learned embedding matrix.

The output of the embedding layer is then passed through several layers of multi-head self-attention and feed-forward neural networks, which are used to process and understand the input sequence in a contextually-aware manner. The self-attention mechanism is the key component of transformers, which allows the model to weigh the importance of each token in the input sequence when making predictions. This is all built in nn.Transformer layer in PyTorch.

After passing through the transformer layers, the output of the model is typically passed through a final linear layer, which is used to make predictions for the task at hand. For example, in a language translation model, the final linear layer would be used to predict the probability of each word in the target language given the input sequence in the source language. Let’s look at what the Transformerclass looks like in Python:

import torch
import torch.nn as nn

class Transformer(nn.Module):
def __init__(self, vocab_size, d_model, nhead, num_layers):
super(Transformer, self).__init__()
# This is our holy embedding layer - the topic of this post
self.embedding = nn.Embedding(vocab_size, d_model)

# This is a transformer layer. It contains encoder and decoder
self.transformer = nn.Transformer(d_model, nhead, num_layers)

#This is the final fully connected layer that predicts the probability of each word
self.fc = nn.Linear(d_model, vocab_size)

def forward(self, x):
# Pass input through the embedding layer
x = self.embedding(x)

# Pass input through the transformer layers (NOTE: This input is usually concatenated with positional encoding. I left it out for simplicity)
x = self.transformer(x)
# Pass input through the final linear layer
x = self.fc(x)
return x

# Initialize the model
vocab_size = 10
d_model = 50
nhead = 2
num_layers = 3
model = Transformer(vocab_size, d_model, nhead, num_layers)

It is important to note that this is a simple example for demonstration purposes and to show how embedding layer is being used in Transformers, and a real transformer model would typically have additional components such as positional encoding which is a technique that provides the model with information about the relative position of each token in the input sequence. There is also layer normalization to normalize the activations of a layer, in order to improve the stability and performance of the model. Also, it is common practice to use pre-trained embeddings to initialize the embedding layer, in order to leverage the knowledge learned from large corpora of text data.

Interesting Facts:

  • The transformer layer above with only 10 vocabularies, 50 dimensional vector embedding, 2 multi-head attention and 2 layers in the encoder and decoder has 2,018,692 trainable parameters. These are the number of parameters we optimize during the training process. To get this number, I run the code below:
sum(p.numel() for p in model.parameters() if p.requires_grad)
  • The model dimension or d_model, must be divisible by the number of heads in the multi-head self-attention mechanism, because the multi-head attention mechanism divides the model dimension into several smaller subspaces. Each subspace is then used to calculate attention weights for a different set of tokens.
  • Transformer models have also been used in computer vision tasks such as object detection and image segmentation by using self-attention mechanism to process image pixels as tokens.

In conclusion, the nn.Embedding layer is a fundamental asset in many NLP models, and it plays a critical role in the transformer architecture. The nn.Embedding layer is used to convert the input sequence of tokens into a continuous representation that can be effectively processed by the model. The use of pre-trained embeddings allows transformer models to leverage the knowledge learned from large corpora of text data, which can improve their performance on a wide range of natural language processing tasks. The nn.Embedding layer also has several parameters that we did not cover in this post, such as sparse option, padding_idx, max_norm and norm_type that can be used to customize the embedding layer to the specific requirements of the task at hand. Understanding the nn.Embedding layer and how it works is an important step in building effective natural language processing models with PyTorch.

Read more here: Source link