Smart Matching Algorithm
Technical deep-dive into PaperFox's reviewer-paper assignment algorithm
PaperFox uses Smart Matching to automatically assign reviewers to papers. The algorithm combines text similarity scoring with Min-Cost Max-Flow network optimization to find the globally optimal assignment across all papers simultaneously.
Architecture Overview
Data Collection → Scoring → COI Filtering → MCMF Assignment → Output| Stage | What it does |
|---|---|
| Data Collection | Fetch papers, reviewers, conflicts, system keywords |
| Scoring | BM25 or OpenAI embeddings (0–70) + keyword phrase matching (0–30) |
| COI Filtering | Block reviewer-paper pairs with conflicts |
| MCMF Assignment | Find globally optimal assignment via network flow |
| Output | Suggestions with scores, reasoning, workload summary |
Step 1: Data Collection
Papers and reviewers are fetched in parallel from the database.
Reviewer expertise tags come from three sources (deduplicated case-insensitively):
| Source | Description |
|---|---|
| AI-discovered | Tags from AI profile discovery (aiExpertiseTags) |
| User-declared | Manually added by the reviewer (userExpertiseTags) |
| System keywords | Extracted from reviewer's own paper submissions in PaperFox (top 20 by frequency, from 20 most recent papers) |
A reviewer with any tags is "profiled" and matched by expertise. Reviewers with no tags are assigned as fallback to fill remaining slots.
Conflicts are fetched via getUserConflicts() which returns:
USER_DECLARED— self-disclosed by the reviewerSYSTEM_DETECTED— co-authors computed live from the Author table (last 3 years)
Step 2: Scoring
Each profiled reviewer is scored against each paper. The score has two components:
| Component | Range | Weight |
|---|---|---|
| Text similarity | 0–70 | Primary signal |
| Keyword overlap | 0–30 | Direct keyword matching bonus |
matchScore = min(100, textScore + keywordScore)Text Similarity (two options)
BM25 (default, free): Uses Okapi BM25 — an improvement over TF-IDF used by systems like OpenReview's TPMS. BM25 adds term frequency saturation (a word appearing 10x doesn't score 10x higher) and document length normalization (short abstracts aren't penalized vs long ones).
- Paper
title + abstractis tokenized (title tokens duplicated 2x for emphasis) using thenaturalNLP library: WordTokenizer for word splitting, Porter Stemmer for normalization, andnatural's 170 English stop words plus academic-specific terms (propose, method, results, evaluate, etc.) that appear in every abstract but carry no matching value - Reviewer expertise tags are tokenized as the query (same pipeline)
- BM25 IDF formula:
log((N - df + 0.5) / (df + 0.5) + 1)— never produces negative values - Scores normalized by max score across all pairs
AI Semantic (credits):
Uses OpenAI text embeddings (text-embedding-3-small) for semantic understanding. Understands that "NLP" ≈ "natural language processing" or "deep learning" ≈ "neural networks".
- Paper and reviewer texts embedded as dense vectors
- Cosine similarity between vectors
- Scores normalized by max score across all pairs
Keyword Phrase Matching
Both scoring methods also compute keyword overlap:
- Phase 1: Exact phrase match (case-insensitive) — "machine learning" matches "machine learning" as a whole phrase
- Phase 2: Porter-stemmed word-level fallback for unmatched keywords — catches partial overlaps (e.g., "computing" and "computational" both stem to "comput")
Step 3: COI Filtering
The reviewer's conflicts (user-declared + system-detected co-authors) are checked against paper authors by name, email, or affiliation. Any match blocks the assignment — the pair has no edge in the flow network, so the solver cannot assign them.
Step 4: Optimal Assignment (Min-Cost Max-Flow)
Smart Matching models the entire assignment as a minimum-cost flow problem and solves it optimally using the Successive Shortest Path algorithm with SPFA. This finds the assignment that maximizes total match quality across all papers simultaneously — the same approach used by major conference systems like OpenReview.
Flow Network Structure
Source ──→ Papers ──→ Reviewers ──→ Sink| Edge | Capacity | Cost |
|---|---|---|
| Source → Paper | reviewersPerPaper | 0 |
| Paper → Profiled Reviewer | 1 | -matchScore × 100 |
| Paper → Non-Profiled Reviewer | 1 | small random (-1 to -9) |
| Reviewer → Sink | maxPapersPerReviewer - existingWorkload | 0 (or escalating for balance) |
| Overflow → Sink | 2 | 100,000 (last resort) |
Reviewer Priority
The cost structure creates a natural priority:
- Profiled reviewers — strongly preferred (cost = -7000 for score 70)
- Non-profiled reviewers — fallback (cost = -1 to -9, randomized for fairness)
- Overflow — over-capacity at extreme cost (100,000), only when no other option exists
Balance Mode
When Balance Workload is enabled, reviewer-to-sink edges use escalating costs:
- 1st paper assigned: cost = 0
- 2nd paper: cost = 500
- 3rd paper: cost = 1000
- ...
This penalizes loading any single reviewer, encouraging even distribution even at the cost of slightly lower expertise matches.
Implementation
The solver is implemented in pure TypeScript with no external dependencies (lib/assignments/mcmf-solver.ts). It uses an index-based SPFA queue for O(1) dequeue performance. For typical conference sizes (50–500 papers, 20–200 reviewers), it completes in milliseconds.
Output
Each paper receives a list of recommended reviewers:
| Field | Description |
|---|---|
matchScore | 0–100 for profiled, null for non-profiled |
assignmentType | EXPERTISE_MATCH or RANDOM_NO_PROFILE |
reasoning | Human-readable explanation (e.g., "text similarity: 45/70; keyword overlap: 20/30; matching keywords: NLP, deep learning") |
expertiseOverlap | Actual matching keywords for display |
The workload summary shows existing + new breakdown per reviewer, so chairs can see the impact before applying.
References
- Okapi BM25 — text similarity scoring
- Min-Cost Max-Flow — optimal assignment formulation
- Successive Shortest Path Algorithm — MCMF solver
- SPFA (Shortest Path Faster Algorithm) — shortest path finding
- OpenReview Matcher — similar approach used by NeurIPS/ICML
- NeurIPS 2024 Paper-Reviewer Assignment — related research
- OpenAI Embeddings — AI semantic scoring
- Cosine Similarity — embedding vector comparison