Learning-to-Rank on MS MARCO Passages: candidate generation from prebuilt indexes and re-ranking for QA search
msmarco-ltr/
│
├── cb_features/ # data preprocessed for CatBoost
├── data/ # raw data
├── images/ # helper images, plots, etc.
├── models/ # trained model checkpoints
├── processed_data/ # retrieval results and labeled query-passage pairs
├── scripts/ # python scripts to do all the hard work
└── README.md # this file
└── bert_knrm_ranker.ipynb # training knrm reranker with bert embeddings
└── build_datasets.ipynb # apply retrieval models and build ml datasets
└── cb_feature_engineering.ipynb # feature engineering for CatBoost
└── cb_ranker.ipynb # training CatBoost reranker
└── fasttext_knrm_ranker.ipynb # training knrm reranker with fasttext embeddings
Search and recommendation systems today need to work with massive amounts of text and items. It is not possible to scan millions of documents, passages, or products for every user query with a heavy model. Instead, modern systems use a two-stage pipeline:
- Retrieval: a fast retriever (like BM25, SPLADE, a dense encoder or a hybrid model) fetches a manageable number of candidates (often top 100–1000).
- Reranking: a slower but smarter model (tree-based or neural) reorders these candidates for maximum relevance.
Below is a general scheme of this architecture:
This design balances speed and quality. The retriever ensures coverage (don’t miss relevant items), while the reranker adjusts the order.
This idea is not limited to web search. It is also used in:
- Question answering: retrieving relevant passages before generating an answer.
- Chatbots & assistants: supplying context passages to LLMs.
- Enterprise search: ranking tickets, contracts, or emails.
- Recommender systems: retrieving candidate products and then reranking them based on user preferences.
In this project, I applied this two-stage approach to the MS MARCO passage ranking dataset, exploring different rerankers and comparing their quality.
The MS MARCO passage ranking dataset contains:
- ~1 million queries sampled from Bing search logs.
- ~8.8 million passages from real web pages.
- ~550k human annotations marking relevant passages.
You can download the raw data from that link: https://microsoft.github.io/msmarco/Datasets.html
Due to cpu and RAM constraints, I decided on the following splits:
- Training: 80k queries randomly sampled from the original ~500k qrels.train labeled samples.
- Validation: 20k queries randomly sampled from the original ~500k qrels.train labeled samples and not included in my training data.
- Evaluation (dev): 25k queries randomly sampled from the original ~50k qrels.dev labeled samples.
The dataset is realistic but difficult: queries are often very short, passages noisy, and relevance labels sparse (usually just one relevant passage per query). This makes reranking necessary: retrieval alone is rarely enough.
Before training rerankers, I built candidate sets for each query in the train, val and dev sets.
I used prebuilt indexes from Pyserini:
- BM25 — fast lexical matching, a strong baseline.
- SPLADE (Impact model) — a neural sparse retriever with learned terms.
- Reciprocal Rank Fusion (RRF) — combines rankings from BM25 and SPLADE.
Results on 10k sampled queries (hit rate and recall at different K values, corresponding to the number of candidates):
| Retriever | Hit@100 | Recall@100 | Hit@200 | Recall@200 |
|---|---|---|---|---|
| BM25 | 0.65 | 0.63 | 0.72 | 0.71 |
| SPLADE Impact | 0.92 | 0.91 | 0.95 | 0.95 |
| RRF (fused) | 0.89 | 0.88 | 0.94 | 0.93 |
BM25 misses a noticeable share of true positive samples, which would affect the final ranking metrics negatively. However, BM25 rankings and scores may serve as additional features for reranker models.
SPLADE is clearly much stronger and shows an impressive hit rate of 0.92 for K=100. RRF is a decent approach and helps improve diversity of results. In the end, I opted for SPLADE as the main retriever due to superior quality, and chose K=100 as a good balance between quality and speed.
I ran both the BM25 and SPLADE models to retrieve 100 candidates for each query along with their scores in the train, validation and dev sets. I also double-checked the quality to make sure the hitrate@100 and recall@100 metrics were approximately the same as for the 10k test samples above. I saved the results as json files.
For rerankers, I built query–passage pairs:
- Positives — relevant passages from human labeled qrels files. Most of the original queries contain exactly one relevant passage each. However, this is not always the case. The distribution looks like this:
- Negatives — 4 non-relevant passages retrieved based on the following rules:
- "hard negative": 1 random doc from top-10 retrieved candidates by Impact Searcher;
- "hard negative": 1 random doc from top-10 retrieved candidates by BM25;
- "medium negative": 1 random doc from top-11-100 retrieved candidates by Impact Searcher;
- "easy negative": 1 random doc from the rest of the corpus.
This produced datasets containing labeled pairs for training reranker models, saved as parquet tables.
Rerankers need features that describe how well a passage matches a query. While my neural reranker is based on texts only, a tree-based reranker like CatBoost would require additional attributes. I built the following feature set:
- Retrieval scores: BM25 rank/score, SPLADE rank/score, reciprocal rank.
- Dense embeddings: cosine similarity using pretrained BGE embeddings between query and passage.
- tf-idf cosine: simple but effective lexical similarity between query and passage.
- Text overlap and length features: number of overlapping tokens, query length, passage length, ratio features.
- KNRM kernels: token-level similarity histograms with Gaussian kernels, built with FastText and GloVe vectors.
- MaxSim with IDF weighting: per-token maximum similarity weighted by IDF.
- Centroid cosine: cosine similarity between averaged embeddings.
Below is a feature importance plot estimated using SHAP:
I trained and compared three reranker approaches:
-
Used all handcrafted features.
-
Hyperparameters tuned with optuna.
-
Training was efficient on GPU.
-
Achieved the best overall ranking performance.
-
Validation NDCG@10: 0.8623
The below approaches are inspired by the architecture discussed in this paper. K-NRM is a neural ranking model that compares every query word with every document word in an embedding space. Instead of relying only on exact matches or simple averages of similarity, it groups the word pairs into smooth "bins" called kernels, which represent different levels of closeness such as exact, near match, weak relation or mismatch. The model then learns how much each level should matter for ranking. During training, we can choose to adjust kernel weights or both the word embeddings and kernel weights so that useful synonyms end up in stronger bins while misleading pairs are "pushed away". In this way K-NRM can capture not just literal matches but also close meanings that people naturally consider relevant.
- Embeddings: fasttext (300d), frozen.
- KNRM with 21 Gaussian kernels → MLP with hidden layers [16, 8].
- Training: pairwise logit loss for triplets, each consisting of query, positive passage and negative passage
- Optimizer: AdamW with cosine scheduler.
- Validation NDCG@10: 0.6725
- Embeddings: bert-base-uncased (768d), frozen.
- KNRM with 21 Gaussian kernels → MLP with hidden layers [16, 8].
- Training was heavier, early stopping after 4 epochs.
- Validation NDCG@10: 0.6676
Final performance on the dev (test) set. I first applied the SPLADE retriever to fetch 100 candidates for each query in the test set. Then I reranked the results using the trained rerankers and estimated the quality with MRR@10 and NDCG@10 against the ground truth. I also provided baseline results (raw BM25 scores, i.e. no reranker).
Stage 1
| Retriever | Hit@100 | Recall@100 |
|---|---|---|
| SPLADE model | 0.92 | 0.90 |
Stage 2
| Reranker | MRR@10 | NDCG@10 | NDCG@20 | NDCG@100 |
|---|---|---|---|---|
| BM25 baseline | 0.2103 | 0.264 | 0.2927 | 0.3561 |
| CatBoost rerank | 0.3252 | 0.3778 | 0.4103 | 0.4551 |
| fasttext-KNRM | 0.1315 | 0.1735 | 0.2103 | 0.2942 |
| BERT-KNRM | 0.1402 | 0.1841 | 0.2203 | 0.3012 |
The CatBoost model with handcrafted features was the strongest, giving the best MRR@10 and NDCG@10 scores, and a huge improvement over the BM25 baseline.
The neural KNRM models (fasttext and BERT) worked but did not reach CatBoost's level and performed worse than the BM25 baseline. This is probably due to several reasons, like limited compute, not using all available data and lack of meaningful features besides texts.
Overall, the results show that careful feature engineering and a strong tree-based reranker can outperform heavier neural models in this setup and provide a decent quality.
✅ Strengths
- SPLADE retriever gave strong coverage.
- CatBoost with handcrafted features outperformed neural rerankers in this setup and showed very decent results even compared to SOTA models.
- Features like BGE cosine similarity, SPLADE score, rank and reciprocal rank, BM25 reciprocal rank, tf-idf cosine similarity, maxsim feature (best per-token matches between queries and passages) proved the strongest, while other handcrafted features (like token overlap, kernel features, etc.) added some value as well.
- Hardware (RAM, CPU, GPU) limited experiments.
- I was unable to run dense or hybrid retrievers due to trouble with the pyserini library or high resource cost.
- Some planned CatBoost features (Jaccard char n-grams) were skipped due to memory cost.
- KNRM models showed modest results with frozen embeddings and overfitted easily when trying to adjust word embeddings in addition to kernel parameters.
- BERT-KNRM gave little boost compared to fasttext-KNRM.
This project shows the full retrieval → reranking → evaluation pipeline.
- Two-stage ranking remains the standard for balancing efficiency and quality.
- Strong retrievers like SPLADE ensure relevant candidates are captured.
- Handcrafted features are still very effective when combined with tree-based models.
- Neural rerankers require more resources but have potential with better fine-tuning.





