Large language models (LLM) are becoming an increasingly requisite component for modern applications. The text generation capability has crossed a threshold to perform many well-defined NLP tasks as well as human workers could. Many of the recent applications in the past year, in 2022-2023, leveraged close-source, privately hosted LLMs. Many of the best, largest state-of-the-art models are close-source and controlled by private companies such as OpenAI, Cohere, Adept, Anthropic, and Google.
There is also rapid progress in the open-source community. At it stands today, open-source models are not likely to be able to match the capabilities of private LLMs due to training data, compute costs, and engineering resources. However, the open-source models could be sufficiently powerful for most applications. Open-source models have their advantages as well. For example, developers could choose a specific model size to match the application requirements to reduce compute waste. The open models could be further modified and trained for specific domains or NLP tasks.
In this post, I am sharing my thoughts on how to get started on choosing an open-source LLM.
Ingredients of LLM
I am going to describe the key factors that differentiate LLMs. The best way to gain a decent understanding of LLM is to read a few of the classic papers on the topic. I would recommend the original transformer paper( 10 ), the BERT (Bert 2 ), and UL2 8 , and the instruct-gpt paper( 4 ).
The first thing to notice about an LLM is its model architecture. All the recently published, relevant LLMs are transformers-based. That is, the key building block is transformer. Here are two excellent tutorials that describe this building block: 1 and 2. A transformer could be loosely understood as building a cross-attention, n
-by-n
matrix with the size n
equal to the input length. The matrix tells the model how to represent each token based on the entire context. The full model has many layers, with each layer having many different multiple heads of these transformer units. One could perceive a transformer unit as a unique logical way to evaluate the input. For example, one unit is to evaluate the input’s language structure, the other unit is to evaluate the input’s historical context, etc.
There are two major variants of transformer-based LLM that are popular: decoder-only and encoder-decoder. Encoder-only could be subsumed by the encoder-decoder model because the decoder part could be discarded for specific downstream tasks. However, from the perspective of LLM users, we don’t have to worry too much about the pros and cons of the different variants. The only key distinction is that the decoder-only model concatenates inputs and targets to process them together. This would make decoder-only models less appropriate for applications that require text embeddings.
An LLM’s size is measured by its number of learned parameters. Many models are trained with multiple sizes for ablative experiments. The largest model is the most powerful. However, they could be expensive to deploy. For example, a medium size INSTRUCTOR model might be sufficient for a semantic search application. The biggest is not always the best. We could choose the model sizes based on trade-off of cost, computation, and model performance,
Another LLM feature is the size of its context window. One of key limitations of transformers is quadratic memory and computation requirement with respect to input length. This limits the input length and the context window. There are variants of transformers, such as long-formers, ETC, and long-t5 that scale linearly to input length. However, there aren’t any powerful, truly large LMs that are fully trained in those architectures. If long context window is a requirement for your applications, you might have to adapt your model and train from scratch. However, pretraining from randomly initialed weights is very expensive and require a fairly substantial amount of engineering know-how.
Another key ingredient is the pretraining objective. LLM gets its reasoning capabilities and knowledge through processing massive amounts of text. The key idea is to hide some part of the known text for each sample, and then ask the model to predict missing span. The objective is to score the prediction, and then use the score to calculate parameter gradients. There are different strategies to generate masked texts. The most common objectives are left-to-right span corruption (e.g. next token prediction), prefix + span corruption, random span corruption, or some combination of these techniques. For practitioners, we care more about the capabilities of the trained models, but less so about the exact training objective. However, if we need to adapt a pretrained model to another domain, we would have to program the objectives to allow the model to process additional corpus.
Another ingredient is the training data. Pretraining data is unlabelled text. Fine-tuning data is labelled dataset, and is much smaller than the pretraining dataset. It is important to understand what data have been used for the pretrained checkpoints. This allows us to understand what domain the model could perform well, what knowledge the parameters could contain, and how to further improve the model.
Another ingredient is input representation. The input representation is learned from the text corpus. We need to consider if the application domain has similar vocabularies. It is usually not a problem for LLMs that are trained on large, sufficiently diverse corpus.
Lastly, we have to consider if the models are trained by reinforcement learning through some reward models. The technique of furthering training fine-tuned models to have more human-preferred outputs is known as reinforcement learning from human feedback (RLHF). It is different from fine-tuning with human-labeled data. RLHF takes labeled data to train a completely independent reward model. The reward model is used to evaluate outputs generated by the fine-tuned model. See 6 for more details. This step has been shown to limit hallucination and optimize outputs for human preferences. However, this step also restricts the model’s variance and makes it more likely to generate similar, mundane outputs. RLHF-trained models are hard to further fine-tuned or modify for specific tasks.
Start with a Pretrained LLM
The practical way of using an open-source LLM is to choose a pretrained checkpoint from well-known models. There are many overviews and surveys about LLMs. For example, 11 provides a good overview of the history and the latest LLM models. As of the writing of this blog, I would consider one of these as a starting point: t5 ( 5 ), long-t5 ( 3 ), flan-t5, flan-ul2, pythia ( 1 ), dolly, instructor ( 7 ), LLaMA ( 9 ), falcon.
I would also consult various well-known benchmark leaderboards on LLM and specific NLP tasks. For example, open LLM leaderboard, MTEB, HELM, and Chatbot Arena
There is no one-size-fits-all step-by-step guide on how to choose a foundation model. I would start by understanding my application’s requirements. I would consider the application’s expected input and output. For example, I would ask some of these questions: what are the typical lengths of the application’s queries, do queries require supplementary contexts, are the outputs long form answers, etc. These questions would guide me to find a foundation model that best matches the characteristics of the application. I would also consider the desired NLP tasks and the model’s strengths and weaknesses. For example, depending on if it is a text search application or an AI assistant, I would choose an encoder or a decoder-only LLM.
Customizing Models
Once I choose pretrained checkpoint, I would consider how to further modify the model to fit my application. I could modify the last layers to target a classification task. I could modify the token search algorithm to allow for an output distribution as opposed to always selecting the output with the highest probability. I could discard the decoder component and only use the encoder to generate text embedding.
If the application domain is very different from the text corpus used for pretraining, it could be appropriate to train the model from scratch. This might be viable for small models, but for models that cross the billion parameters threshold, both the GPU compute and engineering resources would be prohibitively expensive for most small teams.
An LLM could be further customized by fine-tuning with high-quality datasets, e.g. dolly-15k or flan, that were not used to train the model. I could fine-tune the model with private data. I could set up a reinforcement learning loop to instruct the model to have more human-preferred outputs.1 I could further train the model on its pretraining objective on a private corpus for domain adaptation. The model could also acquire additional knowledge from processing large amounts of domain-specific text, allowing the model to answer queries with information that might not be explicitly included in the context window.
Model Deployment
Training LLMs is very expensive, but even model inference is not cheap. Every prediction requires a forward pass. That passthrough needs to touch every parameter. For a 100 billion parameter model, that is 100 billion floating point operations at a minimum. A sentence response is as many passes as the length of the sentence. For latency reasons, the parameters need to be pre-loaded into memory, which is likely to be in the 10s GB range. While operations could be performed by the CPU, the matrix computation should ideally be performed in the GPU. Even for a moderate-size LLM, the inference model needs to be supported by a GPU with 10s GB of memory.2
Footnotes
- This could be brittle and should only be taken when there are sufficient resources to collect human feedback, set up the experiment, and evaluate the model. ↩
- I am not going to discuss model sharding at this post. It might be a topic for future post. ↩
Citations
- Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit, Jakob, Jones, Llion, Gomez, Aidan N., Kaiser, Lukasz, and Polosukhin, Illia. Attention is all you need. 2017. arXiv:1706.03762. 1
- Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton, and Toutanova, Kristina. Bert: pre-training of deep bidirectional transformers for language understanding. 2019. arXiv:1810.04805. 1
- Tay, Yi, Dehghani, Mostafa, Tran, Vinh Q., Garcia, Xavier, Wei, Jason, Wang, Xuezhi, Chung, Hyung Won, Shakeri, Siamak, Bahri, Dara, Schuster, Tal, Zheng, Huaixiu Steven, Zhou, Denny, Houlsby, Neil, and Metzler, Donald. Ul2: unifying language learning paradigms. 2023. arXiv:2205.05131. 1
- Ouyang, Long, Wu, Jeff, Jiang, Xu, Almeida, Diogo, Wainwright, Carroll L., Mishkin, Pamela, Zhang, Chong, Agarwal, Sandhini, Slama, Katarina, Ray, Alex, Schulman, John, Hilton, Jacob, Kelton, Fraser, Miller, Luke, Simens, Maddie, Askell, Amanda, Welinder, Peter, Christiano, Paul, Leike, Jan, and Lowe, Ryan. Training language models to follow instructions with human feedback. 2022. arXiv:2203.02155. 1
- Stiennon, Nisan, Ouyang, Long, Wu, Jeff, Ziegler, Daniel M., Lowe, Ryan, Voss, Chelsea, Radford, Alec, Amodei, Dario, and Christiano, Paul. Learning to summarize from human feedback. 2022. arXiv:2009.01325. 1
- Yang, Jingfeng, Jin, Hongye, Tang, Ruixiang, Han, Xiaotian, Feng, Qizhang, Jiang, Haoming, Yin, Bing, and Hu, Xia. Harnessing the power of llms in practice: a survey on chatgpt and beyond. 2023. arXiv:2304.13712. 1
- Raffel, Colin, Shazeer, Noam, Roberts, Adam, Lee, Katherine, Narang, Sharan, Matena, Michael, Zhou, Yanqi, Li, Wei, and Liu, Peter J. Exploring the limits of transfer learning with a unified text-to-text transformer. 2020. arXiv:1910.10683. 1
- Guo, Mandy, Ainslie, Joshua, Uthus, David, Ontanon, Santiago, Ni, Jianmo, Sung, Yun-Hsuan, and Yang, Yinfei. Longt5: efficient text-to-text transformer for long sequences. 2022. arXiv:2112.07916. 1
- Biderman, Stella, Schoelkopf, Hailey, Anthony, Quentin, Bradley, Herbie, O'Brien, Kyle, Hallahan, Eric, Khan, Mohammad Aflah, Purohit, Shivanshu, Prashanth, USVSN Sai, Raff, Edward, Skowron, Aviya, Sutawika, Lintang, and van der Wal, Oskar. Pythia: a suite for analyzing large language models across training and scaling. 2023. arXiv:2304.01373. 1
- Su, Hongjin, Shi, Weijia, Kasai, Jungo, Wang, Yizhong, Hu, Yushi, Ostendorf, Mari, Yih, Wen-tau, Smith, Noah A., Zettlemoyer, Luke, and Yu, Tao. One embedder, any task: instruction-finetuned text embeddings. 2023. arXiv:2212.09741. 1
- Touvron, Hugo, Lavril, Thibaut, Izacard, Gautier, Martinet, Xavier, Lachaux, Marie-Anne, Lacroix, Timothée, Rozière, Baptiste, Goyal, Naman, Hambro, Eric, Azhar, Faisal, Rodriguez, Aurelien, Joulin, Armand, Grave, Edouard, and Lample, Guillaume. Llama: open and efficient foundation language models. 2023. arXiv:2302.13971. 1