Small Embedding Models Comparison for Search

By Eldar · Published June 29, 2026

Benchmark 5 lightweight embedding models on SciFact retrieval, measuring recall, MRR, NDCG, indexing speed, and query latency to identify the best model for low-latency semantic search.

  • embedding-models
  • semantic-search
  • retrieval
  • benchmark
  • information-retrieval
15 cells1 experiment30 views0 forks

Inside this notebook

# Small Embedding Models Comparison — Semantic Search on SciFact A head-to-head comparison of 5 small embedding models on the **SciFact** retrieval benchmark (5,183 documents, 1,109 queries). Each model is used to vectorise the corpus, build a flat FAISS index, and retrieve top-10 candidates per query. We measure: - **Recall@k** (k=1, 5, 10) — did the relevant doc appear in the top-k? - **MRR@10** — mean reciprocal rank, how early is the first relevant hit? - **NDCG@10** — discounted cumulative gain, ranking quality weighted by position. - **Indexing time** — time to embed the full corpus. - **Query latency** (p50, p95, p99) — milliseconds per query at index time. The goal: recommend the best embedding model for a **lightweight search system** (low latency, small model size, good retrieval quality).

# ── Setup: Install dependencies, imports, device ──────────────────────
import sys, os, json, time, random, warnings
from pathlib import Path
from tqdm.auto import tqdm
from collections import defaultdict

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import torch
import torch.nn.functional as F

# Install on-demand (skip faiss — use torch-based nearest-neighbour)
%pip install -q datasets sentence-transformers scipy 2>&1 | tail -1

import datasets
…
Note: you may need to restart the kernel to use updated packages.
Device: cuda
GPU: Tesla T4  |  VRAM: 15.6 GB

## Load SciFact Corpus SciFact is a BEIR dataset: 5,183 scientific abstracts (corpus) + 1,109 claims (queries) with relevance judgements.

# ── Load SciFact via BEIR library (handles qrels properly) ────────────
%pip install -q beir 2>&1 | tail -1

from beir import util
from beir.datasets.data_loader import GenericDataLoader

dataset = "scifact"
url = f"https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{dataset}.zip"
data_path = util.download_and_unzip(url, "datasets")

print(f"Data downloaded to: {data_path}")

loader = GenericDataLoader(data_path)
corpus, queries, qrels = loader.load(split="test")

print(f"\nCorpus:  {len(corpus):,} documents")
print(f"Queries: {len(queries):,}")
print(f"Qrels:   {len(qrels):,} query-doc relevance pairs")
Note: you may need to restart the kernel to use updated packages.
Data downloaded to: datasets/scifact

Corpus:  5,183 documents
Queries: 300
Qrels:   300 query-doc relevance pairs

## Prepare Data for Evaluation Convert the BEIR data format into lists and mappings suitable for batch embedding and retrieval.

# ── Prepare data structures for evaluation ────────────────────────────

# Build document list: combine title + text (when title exists)
doc_ids = list(corpus.keys())
doc_texts = []
for did in doc_ids:
    entry = corpus[did]
    if isinstance(entry, dict):
        title = entry.get("title", "")
        text = entry.get("text", entry)  # fallback
    else:
        title, text = "", str(entry)
    doc_texts.append(f"{title} {text}".strip())

print(f"Documents: {len(doc_texts):,}  |  Avg chars: {np.mean([len(t) for t in doc_texts]):.0f}")

# Build query list (queries values are plain strings here)
q_ids = list(queries.keys())
…
Documents: 5,183  |  Avg chars: 1499
Queries:   300  |  Avg chars: 90
Queries with ≥1 relevant doc: 300 / 300

Relevance score distribution: min=1, max=1, mean=1.00

## Evaluation Metrics We'll measure: - **Recall@k** (k=1,5,10): fraction of queries where at least one relevant doc appears in top-k - **MRR@10**: mean reciprocal rank — 1/rank of first relevant doc (averaged), 0 if none found - **NDCG@10**: normalized discounted cumulative gain, accounting for ranking quality - **Indexing throughput**: docs/second during embedding - **Query latency**: milliseconds per query (p50/p95/p99)

# ── Evaluation functions ──────────────────────────────────────────────

def recall_at_k(ranked_doc_ids, relevant_set, k):
    """Recall@k: 1 if any relevant doc in top-k, else 0."""
    top_k = ranked_doc_ids[:k]
    return 1.0 if any(did in relevant_set for did in top_k) else 0.0

def reciprocal_rank(ranked_doc_ids, relevant_set, k=None):
    """Reciprocal rank with optional cap at k."""
    for i, did in enumerate(ranked_doc_ids):
        if k is not None and i >= k:
            break
        if did in relevant_set:
            return 1.0 / (i + 1)
    return 0.0

def dcg_at_k(relevance_scores, k):
    """DCG@k given a list of binary relevance scores up to k."""
…

## Embedding Model Comparison Loop We'll test 5 small embedding models (all ≤33M params, 384-dim output): | Model | Params | Dims | Size | |---|---|---|---| | `all-MiniLM-L6-v2` | 22.7M | 384 | ~90 MB | | `BAAI/bge-small-en-v1.5` | 33.4M | 384 | ~130 MB | | `thenlper/gte-small` | 33.4M | 384 | ~130 MB | | `intfloat/e5-small-v2` | 33.4M | 384 | ~130 MB | | `sentence-transformers/paraphrase-MiniLM-L3-v2` | 11.8M | 384 | ~45 MB | Each model indexes the 5,183 SciFact documents, then retrieves top-10 candidates for 300 queries. We measure retrieval quality + latency.

This is a preview. Open the live notebook to see all 15 cells with their charts and full outputs, or fork it into your own Clusy workspace.

Small Embedding Models Comparison for Search | Clusy