How do you teach a human to get better at a task? By accurately describing the problem to be solved, teaching them what success looks like, and then letting them practice. Teaching AI is not so dissimilar. Prompts, examples, and training can help fine-tune an off-the-shelf large language model into a specialized tool you can build a business on.
My aspiration for this review of AI fine-tuning methods is to help those with some experience with LLMs sharpen their perspective on which strategy fits their needs. Much of the material is based on a talk that Contextual AI’s Yaroslav Bulatov and stealth startup founder Cosmin Negruseri presented at the deep learning SF meetup.
Big models can make a lot of headway using zero-shot and few-shot prompt engineering, but eventually you’ll need fine-tuning to reach optimal performance, especially for concrete tasks with plenty of available data.
For small models, fine-tuning is a better initial approach while prompt engineering is less effective,
Fine-tuning for LLMs is becoming more accessible with tools like LoRA for efficient training that achieves similar results as fine-tuning end-to-end.
Simply extracting data out of the model naively like Alpaca seems to just copy style, but if you apply a lot more effort you can actually get quality improvement not just style—Orca.
Direct preference optimization from Stanford is emerging as an alternative to reinforcement learning from human feedback (RLHF), which is brittle, requires hyperparameter tuning, and is difficult to get working.
A giant leap for AI
A little over a decade ago, NASA marked the 135th and final mission of the American Space Shuttle program. I was in the rainy north of Germany in undergrad, researching echo state neural networks for handwritten recognition. You can read my thesis if you want to see how far the field has come. Back then, I was dependent on MATLAB, which was a painful piece of software to us. No one would consider this work on neural networks to be cool.
Fast forward, and in places like San Francisco you can’t escape people talking about LLMs. An entire culture is forming around the technology, yet many excited about its potential are just starting to learn to develop with it. Harnessing this passion is a big opportunity for the tech community to fold in more builders and give a wider range of people a hand in shaping tomorrow.
LLMs are generic, and in order to make them useful to your particular domain, they must be specialized. Besides the knowledge required to specialize these LLMs, there is also compute cost. While there is no magic button to push and 10x the supply or power of this physical hardware, there are ways to lower the cost. I am personally optimistic that we will get to affordability. It costs about $10 to get a human-level vision model now, it used to cost more than $100K. In this article I’m summarizing, in plain English, different methods for fine-tuning, and I’m highlighting one method that is cheaper and elegant: LoRA.
Specializing LLMs: Fine-tuning vs. prompt engineering
Fine-tuning an LLM is like hiring it as an employee. You have a particular job to get done and the requisite experience to know how, but need help to accomplish the task at scale. So first, you need to onboard and train your new employee. Tesla senior director of AI Andrej Karpathy laid out a useful analogy for how knowledge transfer to AI is similar to upskilling humans. Essentially:
Zero-shot prompting = Describing a task in words
Few-shot prompting = Giving examples of how to solve the task
Fine-tuning = Allowing a person to practice the task over and over
Another way of thinking about LLM specialization is according to the level of access to a model, from low visibility but low effort black box to high visibility but high effort white box:
Black box: You are restricted to interacting only with the model's API, as with ChatGPT. Your knowledge is confined solely to the outputs produced by the model, with no visibility into its internal mechanisms.
Gray box:You are partially restricted. For example, when dealing with the GPT-3 API, you might be privy to details such as token generation probabilities. This information serves as a valuable compass, allowing you to formulate and refine appropriate prompts that can extract domain-specific knowledge from the model.
White box: You have comprehensive access. Here, you enjoy complete authorization to the language model, its parameter configuration, training data, and architectural design.
Once you understand the spectrum from limited external interaction to intricate insight, the next question is which degree of access is the best fit for your use case.
Prompt engineering can help you make substantial gains in a gray box scenario, particularly with big base models. In fact, many startups today use prompting as a way to quickly get value to raise a first round of funding. For me personally, as an early-stage investor that can be a detractor. I agree with Karpathy: “I also expect that reaching top-tier performance will include fine-tuning, especially in applications with concrete, well-defined tasks where it is possible to collect a lot of data and ‘practice’ on it.”
Fine-tuning is the most complex way to specialize an LLM. You can get some wins with supervised fine-tuning. Then the hardest part might be RLHF, which brought us ChatGPT. Karpathy strongly advised builders not to do RLHF at the State of GPT, mostly OpenAI got it in production. It took Google quite a while to RLHF its models as they already had big models that weren’t necessarily safeguarded. Llama 2 from Meta also started to use RLHF.
There are also new approaches emerging, including direct preference optimization from Stanford. Instead of reinforcement learning, you use supervised learning because reinforcement learning can be brittle, require hyperparameter tuning, and significant labor to make work properly.
Can fine-tuning help with new task behavior?
If you don’t already have some capabilities in a big model, fine-tuning won’t save you, according to one of the Kaggle Grandmasters from H20. This was somewhat depressing because it means just the big companies can play in LLMs.
However, others disagree and say if you force the models you can add some capabilities. One interesting piece of evidence in this direction is the Goat paper. GPT-4 can’t do arithmetic. If you have addition, subtraction, or multiplication of multiple digits, GPT makes quite a few mistakes—but it seems you can fine-tune some models to get them to multiply and add larger digits. With fine-tuning you can instill some new capabilities that are not there in the models.
Classical fine-tuning approaches
Within fine-tuning are three classic strategies ranging from a faster but less performant feature-based method to slower, more performant fine-tuning approach.
Feature-based approach: This strategy involves keeping a common trunk and just using the last layer activations as an embedding (a representation of the input). It's called transfer learning and was used heavily with ImageNet-trained trunks. You could train a model on ImageNet, use that frozen model for new pictures, and then get the activations in the last layer.
For example, let’s say you wanted to classify your own photos as “hot dog” or “not hot dog” like in HBO’s Silicon Valley. You could use the final layer representation and build on top of that an SVM or linear layer classifier that, based on your small restricted data set, could guess if your picture has a hot dog or doesn’t have a hot dog. This was a good way to train things fast, with few parameters on small data sets and using all the representation power that comes from training a big convolutional network on ImageNet.
Fine-tuning I: Instead of just using the last layer, you could actually freeze most of the trunk of the network and keep the last few layers, or add a few layers and fine-tune them, then change the parameter values for those few layers on your task. This is slightly slower but more performant.
Fine-tune II: Here you update all the layers so you can get the best performance. However, it’s the most expensive approach because you have to back-propagate through the entire model, while the other models back-propagate just through a few parameters.
You can approach LLM fine-tuning in a similar fashion. You can use a BERT encoder, and for the CLS token in the last layer of the BERT encoder you could put an NLP multilayer perceptron on top of that token and freeze the layers. Then you just need to learn how to convert the CLS token to solve your classification task, such as sentiment detection. Alternatively, you could back-propagate through a few layers of BERT, or you could propagate through the entire BERT or transformer model.
One method of making fine-tuning more efficient is to use adapters. Instead of expensively training through the entire network, you have more control over what you want to fine-tune. They insert between the standard layers of transformers a set of smaller layers that adapt to the current task. For example, to fine-tune a translation model or sentiment detection model in a separate language, you’d add some parameters that are specific to a language.
Another approach is prefix fine-tuning, where you fine-tune end-to-end the same as in the top image, but keep the transformer frozen and embed some inputs, as seen in the bottom image. You learn a few parameters that are the first few tokens in your input and then you concatenate them to the original input that you would use for your fine-tuning task and the model somehow learns how to condition the output. You can keep moving this hard-coded prefix so that it modifies the output and it solves your final task. With 0.1% param and fine-tuning the embeddings that you prepend to your input, you can make the models adapt.
Prefix fine-tuning is powerful because you don’t change the network at all. You merely add some embeddings. While LoRA and qLoRA have become popular, it’s a third method that’s efficient and much more beautiful than other methods. All of these methods are relatively simple because they only do a bit of model surgery instead of changing everything.
Fine-tuning with Low-Rank Adaptation (LoRA)
One final fine-tuning concept to understand is Low-Rank Adaptation, or LoRA. The basic idea is that you have a network that’s already trained and you want to train it on a new task. You could update all the parameters, but that’s quite expensive. If you have 65 billion parameters that you need to change every time, that’s 150GB, and accessing that memory takes energy, causing your GPUs to heat up.
Instead, you could insert some extra layers between the layers with adapters and train their parameters. Or you could just add a side matrix. But LoRA allows you to significantly reduce the number of trainable parameters. You can often perform this method of parameter efficient fine-tuning with a single GPU and avoid the need for a distributed cluster of GPUs. Since the rank-decomposition matrices are small, you can fine-tune a different set for each task and then switch them out at inference time by updating the weights.
With this approach, LoRA is able to get the same accuracy as end-to-end fine-tuning but is much cheaper in training since you only update a small set of parameters.
In practice, Hugging Face knows what was the original model and knows how to merge. At inference time to make it fast, you can multiply the A and B matrix and get a bigger matrix that you know is low rank and just add it to the W matrix. That makes it so that you compute just the modified W matrix at inference. Model management is much easier because you fine-tune on multiple tasks and you can swap in and out tasks. It’s a very lightweight way to adapt a big model to lots of tasks.
To give a hypothetical example, you can adapt it to do sentiment detection or summarization on legal data or to create speech in a given dialect. You fine-tune all of these and just have a few matrices on the side. This way you have megabytes not gigabytes, and at serving time you can decide to switch to a new task, add a corresponding matrix, and you can do it all quickly using the feedforward neural network. This lets you serve multiple models with the same machine, with the same space. The downside of LoRA is if you want to run several tasks in the same batch, it won’t work.
Fine-tuning open source models by using high-quality proprietary models
A fascinating discovery was recently made at Stanford: you can also use LLMs to train LLMs. After Facebook launched Llama in March 2023, a group at Stanford developed Alpaca. They got ChatGPT to generate 53K examples and then quickly fine-tuned Llama with them to create a new model with the potential to catch up to OpenAI. Soon after the Alpaca release, there was the famous “We have no moat” blogpost from a Google engineer. It appeared that open source models can get as good as OpenAI’s GPT or Google’s internal models.
A few weeks later, a team from Berkeley followed up with the paper “The False Promise of Imitating Proprietary LLMs,” which finds that Alpaca and similar approaches mostly imitate the style of ChatGPT and if you looked into their factuality, even if the answers were correct, they weren’t as good as ChatGPT’s.
Microsoft Research’s Orca gets data from GPT4 but the examples are more detailed with step-by-step instructions. If you make an effort to generate much better data, you can get closer to replicating the main model. Just extracting data out of the model naively seems to copy style alone, but if you put in enough work, you can actually get some quality improvements.
This was a quick review of fine-tuning methods. Thanks to Yaroslav and Cosmin for giving the talk that inspired this article and for reading drafts of my blog. If you enjoyed it please leave a comment and share. For questions or suggestions about the next blog post, feel free to email me at email@example.com.