Wanting to brush up my PyTorch skills, I’ve started to follow this tutorial. It explains how to create a deep learning model able to predict the origin of a name. At the end of the tutorial, there’s an invitation to try to improve the model. Which I did. Note that the point of the tutorial is not to create the most performant model but rather to demonstrate and explain PyTorch’s capabilities. Here’s a comparison between the model described in the tutorial and the one I’ve built.

Use an advanced gradient descent algorithm to optimize the network’s weights

The code in the tutorial updates the parameters manually. Although this gets the job done, there are more efficient options, so I implemented one of them by using an optimizer.

Add this line before the train() function to define the Adam optimizer. We add L2 regularization (Try to minimize the sum of squared weights as well as the loss) with a $\lambda$ of 2e-5.

optimizer = torch.optim.Adam(rnn.parameters(), lr=learning_rate, weight_decay=2e-5)

Then modify train() so it looks like this:

def train(category_tensor, line_tensor):

    for i in range(line_tensor.size()[0]):
        output, hidden = rnn(line_tensor[i], hidden)

    loss = criterion(output, category_tensor.to(device))

    return output, loss.item()

The optimizer.step() line updates all the model’s parameters according to their gradients (computed by loss.backward()). The optimizer.zero_grad() line zero out the gradients. This is mandatory because otherwise PyTorch accumulates the gradients during the training loop.

This approach is better than the one used in the tutorial because it makes the network learn much faster and circumvent the exploding gradients problem by taking into account the general error trend when updating the parameters. Also, it is less error-prone than managing the parameters by hand as everything is handled automatically.

From a RNN to a GRU-based network

RNNs have the perfect architecture to deal with ordered sequences. That’s why they are quite good at language modeling. The structure is simple, though. The current input and the previous state are multiplied or concatenated together. This means there’s no control of whether we should use the previous state. That’s a lack of control, Also, RNNs suffers from the vanishing gradient problem. Gated Recurrent units (GRUs) are much less sensitive to it and offer a fine-grained control over the information to keep or discard at each pass.

As an input to the GRU, I use an embedding layer, which turns each letter into a vector of 256 dimensions. This allows the model to have a rich representation of the letters.

Model definition

The model is very different from the one in the tutorial. The full code is below:

class GRU(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(GRU, self).__init__()

        self.char_embed = nn.Embedding(input_size, hidden_size, sparse=False)
        self.gru = nn.GRU(hidden_size, hidden_size)
        self.hidden2country = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input):
        embedding = self.char_embed(input)
        lstm_out, _ = self.gru(embedding.view(len(input), 1, -1))
        country = self.hidden2country(lstm_out[-1])

        return self.softmax(country)

The first layer (defined line 5 and used line 13) is the embedding layer. Given a letter ID, it outputs a vector. The vectors are initialized with Xavier’s formula, meaning that the initial variance of the vector will be $\frac{2}{f_{an\_in} + f_{an\_out}}$. $f_{an\_in}$ and $f_{an\_out}$ are the number of inputs and the number of outputs, respectively.

Right after the embedding layer, we use the GRU (defined line 7 and used line 14). As the GRU’s input needs to be in the format (seq length, batch size, input size), we use embedding.view(len(input), 1, -1) to add the batch size dimension. For example, for a name with 6 characters, the embedding layer output will be in the format (6, 256). We turn it into (6, 1, 256) so that the GRU knows there’s only one batch, because we send only one name at a time. As we don’t provide a hidden state to the GRU, it is initialized to zeros.

Last but not least, we grab the final GRU output and map it to the number of classes (the possible name’s origin) line 15. Softmax (line 17) is used to get a probability distribution.

Here is how the model’s graph looks like:

alt text

Input and training loop changes

This models requires several changes to randomTrainingExample() and train(). First, I modify the randomTrainingExample() function so that the input is integer-encoded instead of being one-hot encoded like in the tutorial. The embedding layer can thus convert the input to the vectorized representation:

def random_training_example():
    category = random_choice(all_categories)
    line = random_choice(category_lines[category])
    category_tensor = torch.tensor([all_categories.index(category)], dtype=torch.long)
    line_tensor = prepare_sequence(line)

    return category, line, category_tensor, line_tensor

LETTERS = {letter: index for index, letter in enumerate(all_letters, start=0)}

def alphabet_position(text):
    '''Map the letters to their positions in the alphabet'''

    text = text.lower()
    numbers = [LETTERS[character] for character in text if character in LETTERS]

    return numbers

def prepare_sequence(seq):
    '''Create a integer-encoded name tensor'''
    idxs = alphabet_position(seq)
    return torch.tensor(idxs, dtype=torch.long)

Next, the train() function is updated so that the entire name is passed to the network all at once, instead of sending one letter at a time like in the tutorial:

def train(category_tensor, line_tensor):

    output = rnn(line_tensor.to(device))

    loss = criterion(output, category_tensor.to(device))

    return output, loss.item()



The loos is much better than the one from the tutorial:

alt text

Confusion matrix

Here is the confusion matrix showing the classification accuracy. We can see that it is much better that the one from the tutorial.

alt text


By making the model more sophisticated, we increased its accuracy a lot. The model architecture presented here is suitable for a lot of text classification tasks, although entire sentences or paragraphs classification would benefit from a word embedding rather than the character embedding shown here.