You’re no doubt familiar with fulltext search systems like Elasticsearch, Solr and MySQL fulltext search. These databases are great for finding documents by keyword, but keyword search is very brittle. If you’re a search engineer or architech you’ve no doubt struggled with tokenizers, analysers and synonym maps. Content authors must think carefully about their choice of keywords to ensure the content can be found.
It doesn’t have to be this way! AI powered Dense Passage Ranking (DPR) does the heavy lifting for you …
- Sparse retrieval
- Dense retrieval
- Benefits of DPR
- Try it
- Drawbacks to DPR
- Behavioural considerations
Sparse retrieval (aka keyword search)
Lunr.js and Elasticsearch serve totally different audiences but they both employ the same underlying algorithm, Term Frequency/Inverse Document Frequency aka TF-IDF. This algorithm, along with similar algorithms e.g. BM25F think in terms of keywords. For a long time Google also employed a variant of TF-IDF, hence all the chatter about “keyword optimization”.
Keyword search, as the name implies, matches query and document keywords. The better the match, the higher the document ranks. TF-IDF is described as being “sparse” or “scalar” in nature. In very simplistic terms, this means context and meaning is ignored. Keywords are considered in isolation. “iPhone” is just “iPhone”, irrespective of whether it’s “The new iPhone 14” or “an iPhone app”.
TD-IDF has been around for years. It’s widely understood and simple to use. However, running a TF-IDF based system in production is far from simple - as any Elasticsearch developer will attest to. The simplicity (and shortcomings) of the algorithm itself necessitates a high degree of tuning. Stemmers, tokenizers and analysers must all be configured to get the best performance. Synonyms are another challenge - if only everyone would use the same terms!
Tuning for test data is simple. We just select the best tokenizers, analysers etc, create some synonyms and off we go. Production is a different matter! We quickly discover that one analyser performs well for some queries, but fails miserably for others. Sometimes we need to aggresively tokenize phrases, treating hyphenated words as distinct keywords, at other times we need to keep them as is. If we’re not careful we find ourselves going round in circles - hoping in vain to find that “one size fits all” compromise.
Dense retrieval (aka semantic search)
As humans, we understand the difference between an iPhone and an iPhone app. We immediately recognise the difference because we understand natural language, even if we’re not linguists. We have a degree of learned knowledge - at some point in the past we learned about smartphones and apps.
Dense retrieval is described as being “dense” or “vectored” in nature. Dense retrieval methods work at a higher level than simple keywords. They understand the context in which words are used. They understand, to some extent, language rules and grammar. They understand the significance of word ordering, stemming, tokenization and synonyms without being told explicitly.
Dense retrieval is a form of artificial intelligence. Text is passed through an AI model, to extract features in the form of a numerical vector. This feature vector encodes not just the words, but the context in which they’re used. We then use a mathematical distance metric to compute the distance between vectors. The closer the distance, the closer the relationship between the query and document (or document fragment).
As with all artificial intelligence, dense retrieval models are able to generalise based on examples. In simple terms, we show a neural network a query and some search results, then tell it which result best matches the query. If we show the neural network enough examples, it learns how to best perform searches.
Dense Passage Ranking (DPR) vs Dense Retrieval
I use the terms DPR and Dense Retrieval interchangably in this document but it’s not strictly correct.
Dense Passage Ranking is a form of dense retrieval. DPR uses one neural network to encode document fragments (passages), and another network to encode queries. This is desirable because the passages of text are likely to be longer than the queries themselves. By selecting different neural networks we can optimise the document and query feature extraction. A simpler approach, known as word embeddings, uses the same neural network to encode the documents and queries.
Dense passage ranking and word embeddings are both forms of vector based dense text retrieval.
Benefits of dense passage ranking
The ability to learn from examples illustrates the real power of dense ranking methods. Instead of manually tuning our search cluster, we simply feed the neural network with examples. The resulting model will work out how to process the text: which keywords are significant, how word ordering affects the results and much more. Crucially, dense retriaval works on a case by case basis. It understands that in some cases hyphenated words should be split, but at other times they should be treated as a single word.
Knowledge of natural language allows dense retrieval methods to go beyond conventional search. Let’s take a trivial example: imagine a document contains this passage of text:
“Call York Primary School on 01904 123456 or email us at …”
If I asked you to tell me the phone number for York primary school, you’d answer 01904 123456 without a moments thought. But how would you find this information using Solr or Elasticsearch? “phone number” doesn’t appear in the text.
However, a properly trained DPR model will have no trouble finding this passage given a query “york primary phone number”. The DPR model will understand that in this context, 01904 123456 is a phone number - even if it’s not explicitly stated. DPR allows for something known as semantic search.
Dense passage ranking allows us to go far beyond search. Knowledge of natural language allows us to actually answer questions, instead of just returning matching documents or snippets of text. I explained how a DPR model understands that 01904 123456 is a phone number. It actually understands more that this - it knows the number belongs to an entity (York Primary School).
Given a query “what is York Primary School’s phone number?”, a dense retrieval model can provide a direct answer: “01904 123456”. DPR allows for reading comprehension and question answering.
You can see question answering in action with our LocalGov demo. Notice how the document contents doesn’t explictly mention “phone number”, “call us” or anything similar, yet the system can still identify the School’s phone number.
Try for yourselfFree signup
Drawbacks to dense passage ranking
AI powered dense retrieval models learn by example. The implication is that you need a significant volume of annotated training data. Fortunately training data is publicly available, notably the SQUAD dataset.
Once you have the training data you need to actually build a DPR model. Model training is a computationally expensive process and best avoided if at all possible. Again we can turn to the open source community. Huggingface provides a DPRReader transformer, which allows us to use prebuilt DPR models, e.g. those published by Facebook.
Remember I said tools like Elasticsearch are simple to test, but require a lot of tuning for production? Sadly the same is true for dense passage ranking tools like the DPRReader. It’s not so much a question of tuning the “reader” itself, it’s the model training stage that needs work.
This is where things start to become tricky - model training requires significant data and computing resources so we try to use off the shelf models. Sadly these models often perform poorly unless your real world data is similar to the data they were trained on.
If you have the resources you can of course go ahead and build your own models. I’m guessing you probably don’t have these resources at your disposal though! This leaves you with two options:
Transfer learning can be used to refine an existing model, tweaking it to better suit your specific domain. It’s conceptually similar to a Docker image - let someone else do the heavy lifting, then build on top of it. Tools like Huggingface allow you to refine an existing model with your own annotated datasets. Care needs to be taken though. If you’re not careful, your fine tuning may actually distort the underlying, prebuilt model.
Cloud based COTS
Alternatively you can use a cloud based product. Our own question answering feature is offered over a REST API. We’ve already built a range of DPR models for different domains and scenarios. If required, we can fine tune them for your specific needs. Tools like AWS Kendra, Azure Cognitive search and IBM Watson offer something similar.
Solr and Elasticsearch have taken the TD-IDF algorithm and turned it into a production grade database system. There’s nothing directly comparable to Solr or Elasticsearch in the DPR space. The Huggingface library works at a low level, even lower than Lucene. Huggingface requires developers to pass in a passage of text (a couple of sentences) and a query. Not much use when you have thousands of long documents to query!
This is where cloud based databases like Viko, Kendra and Azure Congnitive Search come into their own. They take care of the indexing, retrieval and semantic search/question answering at scale.
The complexity involved, along with the computational (and financial) costs associated with dense retrieval means that production grade DPR offerings are almost exclusively cloud based. Very few organisations could justify the costs associated with building and maintaining a DPR search system for exclusive use.
Dense passage retrieval makes use of deep neural networks. Neural networks perform poorly on CPU hardware. Instead, we need to use dedicated GPU or TPU (Tensor Processing Unit) instances. This kit is expensive.
Sparse retrieval algorithms like TF-IDF don’t care about language - everything is treated as a term. Tools like Solr add a degree of language specific config in the form of stop words and well known synonyms, but the system itself is inherently multi-lingual.
That’s not the case for dense retrieval methods like DPR. A DPR model is inherently language aware. If you need to support 10 languages, you’ll need to find or train 10 different models.
Something else needs to be considered when evaluating DPR - the way your users search. For years, Google employed keyword based search. An entire generation has grown up thinking in terms of keywords. Without realising it, pretty much every experienced Googler understands the TF-IDF algorithm. Looking for a pair of Nike trainers? “Air Max 2022” will yeield better results than “Latest Nike Air trainers”.
When deciding between dense vs sparse retrieval I like to ask this question:
Is the user looking for a particular document?
In many cases, fulltext search is just a shortcut to a document. The user knows what they’re looking for, they just don’t know where to find it and/or can’t be bothered to hunt around. E-Commerce product search is a perfect fit for something like TF-IDF. The user wants to find a product page and they know which keywords to use.
However if the user is looking for answers, not documents, dense retrieval often makes more sense. Semantic search allows the user to describe their question or problem. A dense retrieval model will find the best answer or solution - even if the question and answer keywords don’t align.
Google is shifting away from keyword based search towards semantic search. They’re actively encouraging and educating their users to ask natural language questions:
I expect the “keyword generation” will start to adapt their behaviours. Once they get a taste for Googling for answers, they’ll expect all systems to work in a similar way. Dense retrieval, semantic search and question answering is the future.
Dense retrieval methods take context and learned knowledge into account when performing searches. This allows us to find content even if the query and document keywords don’t neccessarily align. It allows us to find specific content within documents and answer questions. Dense retrieval does however require significant resources (data and compute). Unlike language agnostic tools like Elasticsearch, dense retrieval is both language and domain specific.
Why not try dense passage ranking for yourself. Signup for a free developer account today!