✨ Introducing Genie, an AI software engineering model
All Articles
✦

Three LLM tricks that boosted embeddings search accuracy by 37%

LLMs
December 25, 2022   —
Author
Alistair Pullen Co-founder & CEO @AlistairPullen

Recently I've been spending a lot of time with embeddings, particularly those provided by OpenAI. Historically I have no experience with them, or vector databases for that matter, but thanks to a recent product pivot since getting into YCombinator we had to get up to speed with them to build our new product providing codebase search, comprehension and augmentation.

Whilst working with embeddings I've learned a huge amount, particularly with regard to searching across different textual media, particularly prosaic text searching code. My main takeaway, especially after having spoken to other founders utilising embeddings is that they are sometimes misunderstood and as a result misused. I see many people simply embedding search queries and indexing against an embedded answer space, which in many simple implementations is fine, but doesn’t work for many applications.

I've outlined some of my core learnings below.

Make the question look like the answer

Following my point about the misunderstanding of embeddings, I think that many people feel like embeddings are innately a search tool; I’d qualify that by saying they’re more of a similarity tool. The cosine similarity of two embedding vectors tells you a great deal, but often when searching your query might look nothing like your desired answer. This is particularly true if you’re using text to search a more structured format, such as code or a CSV file where the answer you’re looking for may bare no resemblance to the string you inputted.

There’s been a lot of attention around the HyDe paper recently, and it provides a competent solution to this problem, and has been hugely beneficial to the code search in our product. The essence of the paper states that for better search in cases where the question and answer space look quite different is to use an LLM to try to predict what the answer looks like from the question, and then embed that prediction and search the answer space with that. This works well because even if the prediction isn’t great it’s likely closer to the answer than your original textual query was, improving the similarity score. We do this with our product where we have a fine-tuned model translating a user’s search query in this manner to better find code snippets in their codebase - it’s been working very well so far, contributing to a 37% improvement in our F1 search metric since implementation. Funnily enough I actually attempted a HyDe like implementation before the paper was published, however in stead of predicting the answer from the question, I tried predicting the question from the answer, and this worked far less well.

Meta-characteristic search

One of the core things we wanted to have is for people to be able to search for characteristics of their code which weren’t directly described in their code, for example a user might want to search for all generic recursive functions , which would be very difficult if not impossible to search for through conventional string matching/regex and likely wouldn’t perform at all well through a simple embedding implementation. This could also be applied to non-code spaces too; a user may want to ask a question of an embedded corpus of Charles Dickens asking find all cases where Oliver Twist is self-reflective which would not really be possible with a basic embedding implementation.

Our solution with Buildt was, for each element/snippet we embed a textual description of the code to get all of the meta characteristics, and we get those characteristics from a fine-tuned LLM. By embedding the textual description of the code along side the code itself it allows you to search both against raw code as well as the characteristics of the code, which is why we say you can ‘search for what your code does, rather than what your code is’. This approach works extremely well and without it questions regarding functionality rarely return accurate results. This approach could easily be ported to prose or any other textual form which could be very exciting for large knowledge bases.

There are pitfalls to this approach: it obviously causes a huge amount of extra cost relative to merely embedding the initial corpus, and increases latency when performing the embedding process - so it may not be useful in all cases, but for us it is a worthwhile trade-off as it produces a magical searching experience.

Resilient Embeddings

Another notable thing I noticed was that embeddings are surprisingly resilient. Because of the nature of the problem we’re trying to solve we inevitably have to index very large corpuses of information, and within that there is the call for some optimisation. When we first shipped I put in an experiment: when we embed code we actually embed only 40% of it, we do this by truncating line length (in this instance very crudely) but this was deliberate as I wanted to see how it’d perform and if any of our users would notice. What shocked me is that the drop-off of our F1 search metric was far lower than the 60% drop in information provided to the embedding - the F1 score only dropped 9% with this change. After thinking about it, it does make some intuitive sense; when we consider embeddings as a measure of similarity there will still be a good amount of similarity in the shape of a truncated code snippet to the predicted code snippet from the LLM query. This may well be useful to other applications outside of code, particularly if they’re supported by the LLM based meta-characteristic search described above which provides an additional layer of resilience.

In summary there are even more things I want to try with embeddings:

  • Fine-tune embedding models by providing examples of extreme positive and negative examples for a query.

  • Provide summarised context of the answer space when creating the predicted answer in the HyDe query to tailor the predicted results to the answer space.

  • Improve the clustering of results to delineate between relevant and irrelevant results.

I'm always happy to chat about this subject, you can find me on Twitter @AlistairPullen

Genie is the highest scoring software engineering model