After experimenting around with lexical/keyword vs semantic search, I became curious if there was a way to combine both in some way. Since I’ve been using FastEmbed for generating and querying embeddings, I ended up stumbling upon one of their blog post, Hybrid Search with FastEmbed & Qdrant. The method they highlight is called Reciprocal Rank Fusion, RRF. The original paper provides the definitive description but in brief RRF entails combining the rankings (rather than scores) from different search algorithms so as to assign its own hybrid score. RRF does not strictly need to be used with semantic search and lexical search, you can use it to combine scores from 2,3 or more semantic search approaches, or lexical search approaches, all it cares about is the derived ranks.
DuckDB stores my embeddings, vector index and full-text search index, so I’d prefer if I could carry out as much of the RRF within SQL, both for simplicity and efficiency - the method used in the Qdrant approach involves a lot of client-side imperative computation so it doesn’t quite cut it.
A bit of searching here and there led me to some sample code from the pgvector team that fits the bill. It is SQL-based, albeit Postgres-flavoured. With some very very minor adjustments here and there, I was able to adopt it for my particular case though all credits belong to the pgvector team.
For lexical search, I derive the scores as follows, nothing too fancy. Of note,
I use window queries to assign the rank in descending order. The null last
part is superfluous since that’s the default behaviour though I prefer making it
explicit:
select
id as entry_id,
rank() over (
order by fts_main_entries.match_bm25(id, $1) desc nulls last
) as rank
from entries
Semantic search based on vector similarity is similar. Unlike FTS, it does not
produce null scores for documents that aren’t entirely irrelevant so I didn’t
add the nulls last
clause in the window clause
select
entry_id,
rank() over(
order by array_cosine_similarity(vec, $2::FLOAT[{dimension}]) desc
) as rank
from embeddings
Now for the fun part, combining both scores using RRF. A search result might
appear in the lexical part or the semantic part or both, that’s why the
full outer join
is there, to ensure the row is kept regardless of whether it’s
from both sides or just one of them. It’s also why we’ve got the coalesce
,
null propagation can mess up the calculation so in case there’s a null, coalesce
assigns a default score of 0. The 60 constant is somewhat of a magic number that
the authors of the RRF paper derive experimentally - it can be configured to
some other value though.
select
coalesce(l.entry_id, s.entry_id) as entry_id,
coalesce(1.0 / (60 + s.rank), 0.0) +
coalesce(1.0 / (60 + l.rank), 0.0) as score
from lexical_search l
full outer join semantic_search s using(entry_id)
order by score desc
limit 20
Bringing in all the snippets, we end up with:
with lexical_search as (
select
id as entry_id,
rank() over (
order by fts_main_entries.match_bm25(id, $1) desc nulls last
) as rank
from entries
),
semantic_search as (
select
entry_id,
rank() over(
order by array_cosine_similarity(vec, $2::FLOAT[{dimension}]) desc
) as rank
from embeddings
)
select
coalesce(l.entry_id, s.entry_id) as entry_id,
coalesce(1.0 / (60 + s.rank), 0.0) +
coalesce(1.0 / (60 + l.rank), 0.0) as score
from lexical_search l
full outer join semantic_search s using(entry_id)
order by score desc
limit 20
I didn’t add the limit clauses in the sub queries particularly the one for semantic search - I’m being optimistic here that DuckDB’s query optimizer will figure it out. Though as always, it’s best to check the query plan and ensure we’re ending up with an index scan rather than a sequential scan
RRF using Polars #
Here’s how to calculate RRF using Polars, for comparison’s sake. I’ll post the snippets bit by bit with explanations then post the whole code at a go.
Suppose search_results
consists of PyArrow tables and each table has an id
column for the document ID and a score
column for the score assigned by the
search/retrieval algorithm. The higher the score the ‘closer’/more relevant the
document is to the query.
To convert a pyarrow table to a polars dataframe:
df = pl.from_arrow(tbl)
To assign a rank to a document based on its score. This snippet also drops the
score
column since once we have the rank we don’t need the score (for RRF)
df
.with_columns(
pl.col("score")
.rank(method="dense", descending=True)
.alias("rank")
)
.drop("score")
To ‘gather’ all the rankings into a single dataframe, use pl.concat
which
takes in an iterator/list of dataframes
df = pl.concat(
pl.from_arrow(tbl)
.with_columns(
pl.col("score")
.rank(method="dense", descending=True)
.alias("rank")
)
.drop("score")
for r in search_results
)
Next, group all the rows and gather the different rankings into a list:
df = df.agg(pl.col("rank").alias("ranks"))
If we print the dataframe at this point, we might get something similar to:
shape: (5, 2)
┌──────┬───────────┐
│ id ┆ ranks │
│ --- ┆ --- │
│ u32 ┆ list[u32] │
╞══════╪═══════════╡
│ 2104 ┆ [2, 3] │
│ 2256 ┆ [1, 1] │
│ 2056 ┆ [3, 2] │
└──────┴───────────┘
Finally to calculate the RRF score:
df = df.with_columns(
(
1
/ pl.col("ranks")
.list.eval(pl.element() + K)
.list.sum()
).alias("score")
)
.drop("ranks")
If we print the dataframe at this point, we might get something similar to:
shape: (5, 2)
┌──────┬───────────┐
│ id ┆ score │
│ --- ┆ --- │
│ u32 ┆ f64 │
╞══════╪═══════════╡
│ 2104 ┆ 0.007692 │
│ 2256 ┆ 0.008197 │
│ 2056 ┆ 0.008065 │
│ 2168 ┆ 0.007937 │
│ 2174 ┆ 0.0078125 │
└──────┴───────────┘
We can then sort by score then pick the top K matches.
I find chaining everything together more elegant rather than breaking it up into different segments. Therefore, the above code can be rewritten as follows:
df = (
pl.concat(
pl.from_arrow(tbl)
.with_columns(
pl.col(r.col())
.rank(method="dense", descending=True)
.alias("rank")
)
.drop("score")
for tbl in search_results
)
.group_by("id")
.agg(pl.col("rank").alias("ranks"))
.with_columns(
(
1
/ pl.col("ranks")
.list.eval(pl.element() + K) # by default K is 60
.list.sum()
).alias("score")
)
.drop("ranks")
)
And that’s how to calculate RRF using polars. I find the SQL more beautiful and understandable, though the Polars dataframe style computation is more flexible. The SQL one can only fuse two results - to add more means rewriting the entire query. The Polars one on the other hand won’t need any rewriting