Smart Matching Algorithm

PaperFox uses Smart Matching to automatically assign reviewers to papers. The algorithm combines text similarity scoring with Min-Cost Max-Flow network optimization to find the globally optimal assignment across all papers simultaneously.

Architecture Overview

Data Collection → Scoring → COI Filtering → MCMF Assignment → Output

Stage	What it does
Data Collection	Fetch papers, reviewers, conflicts, system keywords
Scoring	BM25 or OpenAI embeddings (0–70) + keyword phrase matching (0–30)
COI Filtering	Block reviewer-paper pairs with conflicts
MCMF Assignment	Find globally optimal assignment via network flow
Output	Suggestions with scores, reasoning, workload summary

Step 1: Data Collection

Papers and reviewers are fetched in parallel from the database.

Reviewer expertise tags come from three sources (deduplicated case-insensitively):

Source	Description
AI-discovered	Tags from AI profile discovery (`aiExpertiseTags`)
User-declared	Manually added by the reviewer (`userExpertiseTags`)
System keywords	Extracted from reviewer's own paper submissions in PaperFox (top 20 by frequency, from 20 most recent papers)

A reviewer with any tags is "profiled" and matched by expertise. Reviewers with no tags are assigned as fallback to fill remaining slots.

Conflicts are fetched via getUserConflicts() which returns:

USER_DECLARED — self-disclosed by the reviewer
SYSTEM_DETECTED — co-authors computed live from the Author table (last 3 years)

Step 2: Scoring

Each profiled reviewer is scored against each paper. The score has two components:

Component	Range	Weight
Text similarity	0–70	Primary signal
Keyword overlap	0–30	Direct keyword matching bonus

matchScore = min(100, textScore + keywordScore)

Text Similarity (two options)

BM25 (default, free): Uses Okapi BM25 — an improvement over TF-IDF used by systems like OpenReview's TPMS. BM25 adds term frequency saturation (a word appearing 10x doesn't score 10x higher) and document length normalization (short abstracts aren't penalized vs long ones).

Paper title + abstract is tokenized (title tokens duplicated 2x for emphasis) using the natural NLP library: WordTokenizer for word splitting, Porter Stemmer for normalization, and natural's 170 English stop words plus academic-specific terms (propose, method, results, evaluate, etc.) that appear in every abstract but carry no matching value
Reviewer expertise tags are tokenized as the query (same pipeline)
BM25 IDF formula: log((N - df + 0.5) / (df + 0.5) + 1) — never produces negative values
Scores normalized by max score across all pairs

AI Semantic (credits): Uses OpenAI text embeddings (text-embedding-3-small) for semantic understanding. Understands that "NLP" ≈ "natural language processing" or "deep learning" ≈ "neural networks".

Paper and reviewer texts embedded as dense vectors
Cosine similarity between vectors
Scores normalized by max score across all pairs

Keyword Phrase Matching

Both scoring methods also compute keyword overlap:

Phase 1: Exact phrase match (case-insensitive) — "machine learning" matches "machine learning" as a whole phrase
Phase 2: Porter-stemmed word-level fallback for unmatched keywords — catches partial overlaps (e.g., "computing" and "computational" both stem to "comput")

Step 3: COI Filtering

The reviewer's conflicts (user-declared + system-detected co-authors) are checked against paper authors by name, email, or affiliation. Any match blocks the assignment — the pair has no edge in the flow network, so the solver cannot assign them.

Step 4: Optimal Assignment (Min-Cost Max-Flow)

Smart Matching models the entire assignment as a minimum-cost flow problem and solves it optimally using the Successive Shortest Path algorithm with SPFA. This finds the assignment that maximizes total match quality across all papers simultaneously — the same approach used by major conference systems like OpenReview.

Flow Network Structure

Source ──→ Papers ──→ Reviewers ──→ Sink

Edge	Capacity	Cost
Source → Paper	`reviewersPerPaper`	0
Paper → Profiled Reviewer	1	`-matchScore × 100`
Paper → Non-Profiled Reviewer	1	small random (-1 to -9)
Reviewer → Sink	`maxPapersPerReviewer - existingWorkload`	0 (or escalating for balance)
Overflow → Sink (soft mode only)	2	100,000 (last resort)

Constraint Mode

The Max papers per reviewer setting supports two modes:

Soft limit (default) — Adds overflow edges (capacity 2, cost 100,000) from each reviewer to the sink. This allows reviewers to exceed their max by up to 2 papers at extreme cost, ensuring all papers can be assigned even when capacity is tight. The solver only uses these edges as a last resort.
Hard limit — No overflow edges are added. The reviewer's capacity is strictly enforced. If total capacity (reviewers × max) is less than total demand (papers × target), some papers will receive fewer reviewers than the target.

Reviewer Priority

The cost structure creates a natural priority:

Profiled reviewers — strongly preferred (cost = -7000 for score 70)
Non-profiled reviewers — fallback (cost = -1 to -9, randomized for fairness)
Overflow — over-capacity at extreme cost (100,000), only when no other option exists (soft mode only)

Balance Mode

When Balance Workload is enabled, reviewer-to-sink edges use escalating costs:

1st paper assigned: cost = 0
2nd paper: cost = 500
3rd paper: cost = 1000
...

This penalizes loading any single reviewer, encouraging even distribution even at the cost of slightly lower expertise matches.

Implementation

The solver is implemented in pure TypeScript with no external dependencies. It uses an index-based SPFA queue for O(1) dequeue performance. For typical conference sizes (50–500 papers, 20–200 reviewers), it completes in milliseconds.

Output

Each paper receives a list of recommended reviewers:

Field	Description
`matchScore`	0–100 for profiled, `null` for non-profiled
`assignmentType`	`EXPERTISE_MATCH` or `RANDOM_NO_PROFILE`
`reasoning`	Human-readable explanation (e.g., "text similarity: 45/70; keyword overlap: 20/30; matching keywords: NLP, deep learning")
`expertiseOverlap`	Actual matching keywords for display

The workload summary shows existing + new breakdown per reviewer, so chairs can see the impact before applying.

References

Okapi BM25 — text similarity scoring
Min-Cost Max-Flow — optimal assignment formulation
Successive Shortest Path Algorithm — MCMF solver
SPFA (Shortest Path Faster Algorithm) — shortest path finding
OpenReview Matcher — similar approach used by NeurIPS/ICML
NeurIPS 2024 Paper-Reviewer Assignment — related research
OpenAI Embeddings — AI semantic scoring
Cosine Similarity — embedding vector comparison

Smart Matching Algorithm

On this page