Live Semantic Search with DuckDB, BERT, and LLama.cpp

Published

March 5, 2024

BERT Sentence Transformers

BERT Sentence Transformers is a Python package, back-ended by PyTorch, that can turn words and sentences into embedding vectors. “SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings.” It comes with a variety of pre-trained models. The documentation has details on their training data. By building a model around many real-world documents, the model will encode “cat” and “lion” in such a way that these vectors will be close to each other.

Embedding with llama.cpp: A low memory breakthrough

The key for me supporting live search with multiple word queries came when llama.cpp added support for BERT. Llama.cpp, as you can probably tell my the name, is designed to run large language models (“Inference of Meta’s LLaMA model (and others) in pure C/C++”), but without Python or any other dependencies.

Now, creating a compatible BERT model, and running it with llama.cpp’s Python bindings, I can generate new embeddings for a search term on the fly, without importing sentence-transfomers, since the llama-cpp-python library has only a few dependencies, and doesn’t require PyTorch. Loading in a model this way only uses ~100MB or so of memory.

import llama_cpp

model = llama_cpp.Llama(model_path=model_path,
                                     embedding=True,
                                     verbose=False)
arr = model.create_embedding(text)['data'][0]['embedding']

Easy array comparisons thanks to DuckDB

At about the same time llama.cpp added BERT model support, DuckDB 0.10.0 came out with fixed length array support, and corresponding functions like array_cosine_similarity. With a database of precomputed embedding vectors for all the emojis, I can now retrieve semantically similar results with something like:

import duckdb
import llama_cpp

model = llama_cpp.Llama(model_path=model_path,
                                     embedding=True,
                                     verbose=False)
query_arr = model.create_embedding(text)['data'][0]['embedding']
con = duckdb.connect('vectors.db')
con.sql("select id,array_cosine_similarity(arr,?::DOUBLE[384]) as similarity from array_table order by similarity desc limit 25;",
params=(query_arr, )).to_df()

Ready for 🚀

Emoji search results for Neil Armstrong

Today, in latest iteration of my semantic Emoji finder, I still use the precomputed database. If nothing is found (either an uncommon word or a longer phrase), it creates a new embedding with llama.cpp, and then use a DuckDB database to find the most relevant emojis and return the results. The new code to retrieve the results with DuckDB is a mere 21 lines, and the code to create the DuckDB vector table is 11 lines.

The Dash app uses slightly more memory now, and the live searches take 100-200 milliseconds rather than ~50ms for the static SQLite lookup. However, it adds functionality and whimsy to be able to search for any term (even occasionally a famous name), so I think it’s definitely worth it!