PaperFoxPaperFox
Engineering

Smart Matching Algorithm

Technical deep-dive into PaperFox's reviewer-paper assignment algorithm

PaperFox uses Smart Matching to automatically assign reviewers to papers. The algorithm combines text similarity scoring with Min-Cost Max-Flow network optimization to find the globally optimal assignment across all papers simultaneously.


Architecture Overview

Data Collection → Scoring → COI Filtering → MCMF Assignment → Output
StageWhat it does
Data CollectionFetch papers, reviewers, conflicts, system keywords
ScoringBM25 or OpenAI embeddings (0–70) + keyword phrase matching (0–30)
COI FilteringBlock reviewer-paper pairs with conflicts
MCMF AssignmentFind globally optimal assignment via network flow
OutputSuggestions with scores, reasoning, workload summary

Step 1: Data Collection

Papers and reviewers are fetched in parallel from the database.

Reviewer expertise tags come from three sources (deduplicated case-insensitively):

SourceDescription
AI-discoveredTags from AI profile discovery (aiExpertiseTags)
User-declaredManually added by the reviewer (userExpertiseTags)
System keywordsExtracted from reviewer's own paper submissions in PaperFox (top 20 by frequency, from 20 most recent papers)

A reviewer with any tags is "profiled" and matched by expertise. Reviewers with no tags are assigned as fallback to fill remaining slots.

Conflicts are fetched via getUserConflicts() which returns:

  • USER_DECLARED — self-disclosed by the reviewer
  • SYSTEM_DETECTED — co-authors computed live from the Author table (last 3 years)

Step 2: Scoring

Each profiled reviewer is scored against each paper. The score has two components:

ComponentRangeWeight
Text similarity0–70Primary signal
Keyword overlap0–30Direct keyword matching bonus
matchScore = min(100, textScore + keywordScore)

Text Similarity (two options)

BM25 (default, free): Uses Okapi BM25 — an improvement over TF-IDF used by systems like OpenReview's TPMS. BM25 adds term frequency saturation (a word appearing 10x doesn't score 10x higher) and document length normalization (short abstracts aren't penalized vs long ones).

  • Paper title + abstract is tokenized (title tokens duplicated 2x for emphasis) using the natural NLP library: WordTokenizer for word splitting, Porter Stemmer for normalization, and natural's 170 English stop words plus academic-specific terms (propose, method, results, evaluate, etc.) that appear in every abstract but carry no matching value
  • Reviewer expertise tags are tokenized as the query (same pipeline)
  • BM25 IDF formula: log((N - df + 0.5) / (df + 0.5) + 1) — never produces negative values
  • Scores normalized by max score across all pairs

AI Semantic (credits): Uses OpenAI text embeddings (text-embedding-3-small) for semantic understanding. Understands that "NLP" ≈ "natural language processing" or "deep learning" ≈ "neural networks".

  • Paper and reviewer texts embedded as dense vectors
  • Cosine similarity between vectors
  • Scores normalized by max score across all pairs

Keyword Phrase Matching

Both scoring methods also compute keyword overlap:

  1. Phase 1: Exact phrase match (case-insensitive) — "machine learning" matches "machine learning" as a whole phrase
  2. Phase 2: Porter-stemmed word-level fallback for unmatched keywords — catches partial overlaps (e.g., "computing" and "computational" both stem to "comput")

Step 3: COI Filtering

The reviewer's conflicts (user-declared + system-detected co-authors) are checked against paper authors by name, email, or affiliation. Any match blocks the assignment — the pair has no edge in the flow network, so the solver cannot assign them.


Step 4: Optimal Assignment (Min-Cost Max-Flow)

Smart Matching models the entire assignment as a minimum-cost flow problem and solves it optimally using the Successive Shortest Path algorithm with SPFA. This finds the assignment that maximizes total match quality across all papers simultaneously — the same approach used by major conference systems like OpenReview.

Flow Network Structure

Source ──→ Papers ──→ Reviewers ──→ Sink
EdgeCapacityCost
Source → PaperreviewersPerPaper0
Paper → Profiled Reviewer1-matchScore × 100
Paper → Non-Profiled Reviewer1small random (-1 to -9)
Reviewer → SinkmaxPapersPerReviewer - existingWorkload0 (or escalating for balance)
Overflow → Sink2100,000 (last resort)

Reviewer Priority

The cost structure creates a natural priority:

  1. Profiled reviewers — strongly preferred (cost = -7000 for score 70)
  2. Non-profiled reviewers — fallback (cost = -1 to -9, randomized for fairness)
  3. Overflow — over-capacity at extreme cost (100,000), only when no other option exists

Balance Mode

When Balance Workload is enabled, reviewer-to-sink edges use escalating costs:

  • 1st paper assigned: cost = 0
  • 2nd paper: cost = 500
  • 3rd paper: cost = 1000
  • ...

This penalizes loading any single reviewer, encouraging even distribution even at the cost of slightly lower expertise matches.

Implementation

The solver is implemented in pure TypeScript with no external dependencies (lib/assignments/mcmf-solver.ts). It uses an index-based SPFA queue for O(1) dequeue performance. For typical conference sizes (50–500 papers, 20–200 reviewers), it completes in milliseconds.


Output

Each paper receives a list of recommended reviewers:

FieldDescription
matchScore0–100 for profiled, null for non-profiled
assignmentTypeEXPERTISE_MATCH or RANDOM_NO_PROFILE
reasoningHuman-readable explanation (e.g., "text similarity: 45/70; keyword overlap: 20/30; matching keywords: NLP, deep learning")
expertiseOverlapActual matching keywords for display

The workload summary shows existing + new breakdown per reviewer, so chairs can see the impact before applying.


References

On this page