Pretrained LLMs and Text Search - A practitioner's perspective

Pretrained transformer-based large language models (LLM) are upending our perception about what AI is capable in language comprehension. While large, open-domain search engines have been able to process natural language queries, the search box feature in all other applications, such as site search or search in enterprise applications, have been limited to keyword based. The emergence of LLM is changing that. Both consumers and enterprise application users will soon expect all search boxes to understand natural languages.

In this post, I will outline my thoughts on what is next for text search.

Text Search Paradigms¶

I am not writing a comprehensive literature review on information retrieval and LLM. One could refer to LNY21 for an overview. I only intend to give a short review of the techniques that are relevant to implementing a search system as technology stands today.

Lexical Search¶

I use lexical search loosely to refer to a variety of text processing techniques and term-based relevance scoring algorithms. The best known models are BM25 and tf-idf. These models encode documents as sparse vectors, storing them as inverted indices. The search algorithm converts the query into a vector to calculate a relevance score.

Learned Representation¶

Instead of relying on lexical structures, learning models are used to discover a latent feature set to represent the documents and queries. The learning models could be linear models, tree-based models, LSTM (long short-term memory), or transformer-based LLM.

The learned representations could be sparse vectors. The search algorithm treats the sparse vectors similar to how lexical search makes use of inverted indices. The model architecture has a hidden layer that corresponds to the latent features representting the input text. The rest of the model could be any learning models. The model is trained to have a sparsity as well as relevance objectives. The trained model could be used to encode documents and queries into a vector space. See ZDC+18 for an example.

The learned representations could be dense vectors. While sparse vectors could have dimensions in the millions, dense vectors usually have dimensions in the 100s or 1000s. The approach of dense vector representation existed well before LLM. Vector encoding generated by word2vec ( MCCD13 ) and GloVe ( PSM14 ) were examples of dense embedding. These models encode an input text into a dense vector with length in the 100s. Text search algorithm applies a distance similarity and nearest neighbor search on a vector space.

Note that dense vector similarity algorithms could be much more efficient than sparse vector search. However, sparse vector algorithms such as BM25 are usually sufficiently efficient that it is not a major concern. See this blog post for an example of approximate nearing neighbor search for dense vectors.

LLM-based representations work exactly the same. The only difference is that LLM is used to encode the documents and queries. LLM dramatically improves search results if the LLM is trained on a corpus similar to the later queries. See KOM+20 and XXL+20 for detailed descriptions on how to build these models from BERT( DCLT19 ). LLM-based embeddings, e.g. OpenAI embedding, have lengths in the 1000s.

Cross Encoding Ranking Model¶

Both sparse and dense vector models are representation-based. They separate the computations between indexing and query. Cross-encoding requires both the query and the document to be provided as input to calculate a relevance score. They are also called interaction-based models. It is possible to use LSTM or ConvNet to encode the query-document pair. The biggest gain has been using pre-trained LLM as the starting point for model training. The query and document are concatenated as the input. An LLM encodes the input and is trained as a binary classification model. Once trained on a labelled dataest, the model could be used to give a relevance score to any query-document pair. See NC20 and NYCL19 for examples.

Transformer-based LLM is O(n^2) with respect to the input sequence length. Cross-encoding has to process every query-document pair as input. That is hugely inefficient. One way to get around this is late interaction (see KZ20 ). This technique uses two separate LLMs for query and document, thus mitigating the n^2 issue with the concatenated input. It also allows for an opportunity to cache the learned embeddings. The interaction could be calculated at query time.

Current State of Search¶

Performance is a tricky topic. The LLM hype makes it seem as if LLM techniques vastly outperform traditional methods such as BM25. The problem is that all evaluation methods are imperfect. Each model could do well in some evaluation methods but perform purely in others. A detailed discussion of performance requires the introduction of benchmarking datasets and methodologies. I will not go into those details in this post. See TRR+21 for an example. This video offers a good introduction. I would offer the following key takeaways:

BM25 is a robust baseline. It generalizes well to out-of-domain data.
LLM-powered dense vector model has superior in-domain performance. However, it does not generalize well.
Cross-encoding ranking models are computationally expensive, but they perform the best.

The tried and true lexical BM25 is the workhorse of the majority of search applications. Aside from massively well-funded solutions like Google and Bing search, the majority of applications implement lexical search. Lexical search is easy to understand, has a robust baseline, and generalizes well to out of domain datasets. It is already supported by mature technologies such as Solr and Elasticsearch. There is a large base of developers who have the expertise to build sophisticated search systems. There are also well-established hosted solutions such as Algolia and Elastic.

LLM-powered embedding is getting all the hype. There are a few reasons for that. First, it enables semantic search, which is something that users intuitively understand. Everyone wants a search capability that could retrieve relevant documents that do not necessarily contain the exact keywords. LLM dense vector provides that feature easily. Second, there is a perceived performance boost. While this is not always the case, the hype is already here and everyone wants to at least says that they support semantic search. Third, vector databases are becoming a hot trend. There are new vector databases such as Weaviate and Pinecone. Traditional databases are starting to support vector similarity, e.g. Solr, Elasticsearch, Redis, and Postgres. Fourth, AI companies provide hosted LLMs via https APIs to convert documents to dense vectors. Examples are OpenAI and Cohere. They allow developers to apply dense embedding techniques without needing to even understand LLM.

Cross encoding models are not popular because they are not only computationally expensive, but require custom implementations. Unlike lexical search and dense vector methods, there are no established services or open-sourced libraries to support cross encoding algorithms. Only the best funded enterprise applications have the resources to integrate those techniques in their search features. Applications would definitely want access to state-of-the-art search capabilities. The development cost is too high.

Beyond Lexical¶

It is inevitable that search applications will move beyond lexical search due to the emergence of LLM-based techniques. Pretrained LLMs have demonstrated enormous capabilities in understanding languages. Search techniques will evolve to harness its power soon.

What People Want¶

Most application developers blindly apply the dense vector similarity technique as if it universally delivers state-of-the-art results. Embeddings only work well if the application domain is similar to the pretraining and training training data passed to the LLM.

I would want a search system that is more than lexical but also more than just embedding similarities. The search system must allow for domain-specific fine-tuning and pre-training of the LLM. The system has a two-stage ranking pipeline. In stage one, it uses a combination of lexical and dense vector representations to efficiently retrieve candidate documents. In stage two, it uses a cross encoding model and other heuristics to rerank candidate documents.

Implementation Challenges¶

A system such as described above is not easy to build. A lexical search system could be as simple as entirely relying on an Elasticsearch deployment. There is no out-of-the-box solution for implementing the hybrid system. Below I list out some of the key challenges. The challenges are not insurmountable, nor do they require theoretical breakthroughs. They merely require engineering resources.

Using hosted LLM API works well for small datasets. GPT4 costs $0.12 per 1000 tokens¹. Assuming an average of 4 bytes per token, 1MB of data would take $30 dollars to process. That is impossibly expensive! Most search applications will not be built on top of costly LLM APIs. Costs aside, indexing tasks are usually the key bottleneck of a search system. Relying on third-party LLMs as a key step for the indexing stage would require piping the entirety of the raw data outside of the system. It raises issues with regard to performance and data privacy.

The two-stage pipeline has to be custom-built. Reranking algorithms must be custom software. Lexical search and dense vector nearest neighbor algorithms and toolings are widespread, but cross encoding models do not have plug-and-play implementations. The search system also needs to link together different ranking stages.

There needs to be a robust data pipeline for indexing, storage, and retrieval. For lexical search, tools such as Elasticsearch handles all of these tasks as a single piece of software. Developers only need to interact with its APIs to index and search. A hybrid search system needs to integrate multiple pieces of database technologies to store indices and documents, and implement the search algorithms with respect to the databases of choice.

Final Thoughts¶

LLM marks a watershed moment in computer programs’ ability to understand natural language. For the majority of applications, search has been stuck in keyword matching. That is going to change in the near future. Users will expect search boxes to understand natural language. Product thinkers will want to design search and Q&A boxes that match those expectations. However, semantic search is harder than just relying on OpenAI APIs and a vector database. Application developers are still waiting for toolings and hosted solutions to build high-quality, next-generation search experiences.

Footnotes¶

As of April 2023. ↩

Citations

Lin, Jimmy, Nogueira, Rodrigo, and Yates, Andrew. Pretrained transformers for text ranking: bert and beyond. 2021. arXiv:2010.06467. ¹
Zamani, Hamed, Dehghani, Mostafa, Croft, W Bruce, Learned-Miller, Erik, and Kamps, Jaap. From neural re-ranking to neural ranking: learning a sparse representation for inverted indexing. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, 49–506. ACM, 2018. ¹
Mikolov, Tomas, Chen, Kai, Corrado, Greg, and Dean, Jeffrey. Efficient estimation of word representations in vector space. 2013. arXiv:1301.3781. ¹
Pennington, Jeffrey, Socher, Richard, and Manning, Christopher D. Glove: global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), 1532–1543. 2014. URL: http://www.aclweb.org/anthology/D14-1162. ¹
Karpukhin, Vladimir, Oguz, Barlas, Min, Sewon, Lewis, Patrick, Wu, Ledell, Edunov, Sergey, Chen, Danqi, and Yih, Wen-tau. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 6769–6781. Online, November 2020. Association for Computational Linguistics. URL: https://aclanthology.org/2020.emnlp-main.550, doi:10.18653/v1/2020.emnlp-main.550. ¹
Xiong, Lee, Xiong, Chenyan, Li, Ye, Tang, Kwok-Fung, Liu, Jialin, Bennett, Paul, Ahmed, Junaid, and Overwijk, Arnold. Approximate nearest neighbor negative contrastive learning for dense text retrieval. 2020. arXiv:2007.00808. ¹
Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton, and Toutanova, Kristina. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. URL: https://aclanthology.org/N19-1423, doi:10.18653/v1/N19-1423. ¹
Nogueira, Rodrigo and Cho, Kyunghyun. Passage re-ranking with bert. 2020. arXiv:1901.04085. ¹
Nogueira, Rodrigo, Yang, Wei, Cho, Kyunghyun, and Lin, Jimmy. Multi-stage document ranking with bert. 2019. arXiv:1910.14424. ¹
Khattab, Omar and Zaharia, Matei. Colbert: efficient and effective passage search via contextualized late interaction over bert. 2020. arXiv:2004.12832. ¹
Thakur, Nandan, Reimers, Nils, Rücklé, Andreas, Srivastava, Abhishek, and Gurevych, Iryna. Beir: a heterogenous benchmark for zero-shot evaluation of information retrieval models. 2021. arXiv:2104.08663. ¹

Pretrained LLMs and Text Search
A practitioner's perspective