As deep learning models grow, trying to get a finer and finer grasp of reality, the number of parameters composing them increased, making training more and more expensive. Here we delve into Low-Rank Adaptation LoRA, a method aiming to reduce the dimensionality of the training space within deep learning models during fine-tuning.
The dimensionality dilemma
Training a Deep learning model from scratch takes time. Hence the fine-tuning process, where a large (language?) model has general knowledge that can be adapted to tailor it for specific downstream tasks. Instead of having to train costly deep learning model from scratch for each task, one only has to train it one from scratch, and fine-tune the model for each task. In fine-tuning, small fully-connected layers are added on top an existing large model, and only the weights of the added fully-connected layers are optimized during training to minimize the downstream task’s loss.
The downside is that one still needs the full large model for the downstream task, with weights in a high dimensional space which makes the model slow and costly to store, as all of the large model’s weights have to be accounted for. Therefore, there is a compromise between the complexity of the large model (its number of parameters) and the performance of downstream task models. LoRA, for Low-Rank Adaptation, is an attempt at reducing the dimensionality of the weights during fine-tuning while minimizing the impact on the performance. This helps reducing the memory footprint of the model at speeds it up by reducing the number of parameters.
The LoRA idea
LoRA posits that the update needed to tailor the model to a particular prediction task is much smaller that what was previously assumed. Therefore, instead of updating all of the fully connected layer’s weights, only a small-ish matrix represents how the large model needs to change for fine-tuning. Once that small matrix is optimized to account for a specific prediction task, the final model consists of the large model and the small matrix multiplied together.
More formally, let $W_0 \in \mathbb{R}^{d \times k}$ be the original weight matrix size $(d,\ k)$. A regular fine tuning would use gradient descent to update $W$ to account for the prediction task:
$$W = W_0 + \Delta W$$
Therefore, the weight update would be over $(d \times k)$ which takes a lot of resources to compute and store. LoRA’s idea is to adapt this network to low rank matrices $B^{d \times r}$ and $A^{r \times k}$ with $r \ll min(d,k)$, so that the $A$ and $B$ matrices are much smaller that $W$, thus reducing the number of parameters to optimize. The fine tuning process becomes:
$$W = W_0 + \frac{\alpha}{r} BA$$
with $\frac{\alpha}{r}$ a scaling factor for the LoRA output.
Because $BA$ is of dimension $(d, k)$, just like $W$, storing the fine-tuned model weights does not add addition storage or memory resources, and recovering $W_0$ only requires $W - BA$.
Usage in PyTorch
PyTorch implements a LoRA layer as a regular
fully-connected layer replacement. As LoRA is used for fine-tuning, it is
available in Torch
tune in the
LoRALinear
class. To use LoRALinear
, one needs to add at at end of a more general
model, and set the general model’s weight as frozen (so they are not updated during
gradient descent). A sample code is provided below:
1import torch
2from torchtune.modules.peft import LoRALinear
3from torchvision.models import resnet18, ResNet18_Weights
4
5class Net(torch.nn.Module):
6 def __init__(self):
7 super(Net, self).__init__()
8 self.large_model = resnet18(weights=ResNet18_Weights.IMAGENET1K_V1)
9 for param in self.large_model.parameters():
10 param.requires_grad = False
11 self.fine_tuning = LoRALinear(
12 in_dim=list(self.large_model.children())[-1].out_features,
13 out_dim=10,
14 rank=3,
15 alpha=1e-2,
16 )
17 def forward(self, x) -> torch.Tensor:
18 out = self.large_model(xx)
19 return F.log_softmax(self.fine_tuning(out), dim=1)
In this example, we load a (fairly) large
ResNet model, and set its default weights
line 8. Then on lines 9-10 its parameters are frozen and lines 11-16 we define
the LoRA layer. The input dimension corresponds to the size of the large model’s
output, the rank
parameters corresponds to $r$ and alpha
to $\alpha$. In the
forward pass, the large model’s output out
is passed to the LoRA layer, and
the later A
and B
matrices will be updated by gradient descent and can be
used to compute the fine-tuned model’s weight.
LoRA (and the research that followed) therefore provides a way to adapt a large model to a specific prediction task in an efficient way.