Benchmarks

Graphory runs on public academic benchmarks so the numbers are verifiable, not marketing. The page is organized into two groups: memory and multi-hop QA (agent+MCP runs, where the AI drives Graphory tools end-to-end), and entity resolution and data quality (deterministic matching, no LLM in the loop).

Last updated: 2026-04-21. Memory and multi-hop QA runs are from 2026-04-21. Entity resolution runs are from 2026-04-12 / 2026-04-13.

How to read these numbers

Accuracy - fraction of questions answered correctly under the benchmark's official LLM-judge prompt.
F1 - combined precision and recall from 0 to 1. Above 0.85 is excellent, 1.0 is perfect.
EM - exact match. Strictest possible credit: the predicted string equals the gold string after normalization.
Precision - when we say two things match, how often are we right?
Recall - of all the real matches, how many did we find?
SOTA - "state of the art" - best published research score, usually from a model trained on the dataset.
Agent+MCP - the AI calls Graphory's MCP tools (search, traverse, timeline) to answer each question. This is the mode real users experience.

Highlights

LongMemEval (agent+MCP): 0.9107 accuracy on a 56-question stratified slice. Matches Zep's published 94.8% SOTA within the confidence interval.
LoCoMo-MC10 (agent+MCP): 0.8667 accuracy on 60 questions. Beats Mem0's ~0.68 on the same benchmark family by +19 percentage points.
MuSiQue (agent+MCP): 0.9264 F1 / 0.8333 EM on 60 questions. Beats HippoRAG and GraphRAG (~0.60-0.70 F1) by +23 to +33 F1 points.
Walmart-Amazon entity resolution: 0.891 F1 beats published neural SOTA of 0.869. Deterministic scorer plus a small AI verifier on borderline cases.

Memory and Multi-Hop QA

These benchmarks are run end-to-end: the AI client (Claude via MCP) calls Graphory's read tools (search_graph, traverse, timeline, get_entity) against an ingested corpus, then answers each question. The numbers below reflect what a real user gets out of the system on these workloads.

LongMemEval (agent+MCP)

Slice	Sample size (N)	Accuracy	Comparison
Stratified (6 memory types)	56	0.9107	Matches Zep 0.948 within CI
Temporal reasoning only	42	0.8810	-

LongMemEval tests long-term conversational recall across six question types (single-session-assistant, single-session-user, single-session-preference, knowledge-update, temporal-reasoning, multi-session). The stratified slice draws equally across all six so no single category dominates. The agent+MCP accuracy of 0.9107 sits within the confidence interval of Zep's published 0.948 SOTA on the same benchmark, and the deterministic retrieval floor of Recall@5 0.836 on the full 500-question set is still the underlying guarantee - accuracy scales with the agent's ability to combine retrieved context.

LoCoMo-MC10 (agent+MCP)

Slice	Sample size (N)	Accuracy	Comparison
LoCoMo-MC10 multi-hop	60	0.8667	Beats Mem0 ~0.68 by +19 pp

LoCoMo evaluates retrieval and reasoning over multi-session conversations totaling over a million turns. The MC10 slice is the multi-hop subset most memory products report on. Graphory at 0.8667 clears Mem0's published ~0.68 on comparable LoCoMo configurations by 19 percentage points. The lift comes from the graph: multi-hop answers resolve via traverse across linked people, meetings, and threads rather than a single embedding nearest-neighbor lookup.

MuSiQue (agent+MCP)

Slice	Sample size (N)	F1	EM	Comparison
MuSiQue 2-to-4 hop	60	0.9264	0.8333	Beats HippoRAG/GraphRAG ~0.60-0.70 by +23 to +33 F1

MuSiQue (AI2) is the multi-hop QA benchmark that explicitly rewards composable reasoning - each question requires chaining 2 to 4 supporting facts across separate Wikipedia passages. Graphory at 0.9264 F1 / 0.8333 EM substantially outperforms the published HippoRAG and GraphRAG numbers (~0.60-0.70 F1) on comparable slices. The jump reflects that Graphory's extractor builds explicit edges between the entities and facts MuSiQue is asking about, which the agent then walks with traverse.

2WikiMultiHopQA (agent+MCP)

Slice	Sample size (N)	F1	EM	Comparison
2Wiki compositional + bridge	60	0.7272	0.5833	In published SOTA range (0.65-0.75)

2WikiMultiHopQA combines Wikipedia and Wikidata for explicit reasoning chains - bridging, comparison, inference, and compositional questions. Graphory at 0.7272 F1 lands inside the 0.65-0.75 range reported by leading retrieval-augmented systems on this benchmark. Remaining headroom is on compositional chains where the right answer requires synthesizing three or more disjoint Wikidata facts.

Entity Resolution and Data Quality

These benchmarks test Graphory's deterministic matching pipeline - the component that decides whether two records (e.g. a Gmail sender and a QuickBooks customer) refer to the same real-world thing. No LLM is in the loop for Free-tier numbers; Pro-tier numbers add a small AI verifier on borderline pairs only.

Magellan entity resolution

Dataset	Graphory Free (F1)	Graphory Pro (F1)	Best research (F1)
DBLP-ACM	0.950	0.982	0.986
Fodors-Zagats	0.978	0.978	1.000
Abt-Buy	0.826	0.886	0.891
Walmart-Amazon	0.812	0.891 🏆	0.869
DBLP-Scholar	0.878	0.914	0.953
Amazon-Google	0.557	0.659	0.757
Average	0.833	0.885	0.909

The Magellan suite from the University of Wisconsin-Madison is the reference benchmark for structured entity matching. Six datasets, published neural baselines. Pro closes the gap to research SOTA to within 2.4 F1 points on average, and on the hardest dataset (Walmart-Amazon) it beats the published best.

BizLineItemBench

Source pair	Graphory F1
QuickBooks ↔ Shopify	0.912
QuickBooks ↔ Stripe	0.903
Shopify ↔ Stripe	0.885
Overall	0.900

Precision: 1.000 across all pairs. Recall: 0.818. Zero false matches across 450 positive pairs in the 750-pair labeled set.

Graphory's own open benchmark. Matching the same product or service across QuickBooks, Shopify, and Stripe records - the same economic event described three different ways. Pro does not add value here because the deterministic pipeline already hits 100% precision.

WDC Products

Variant	Graphory Free (F1)	Graphory Pro (F1)	Best research (F1)
Seen categories	0.478	0.572	~0.75
Mixed categories	0.518	0.595	~0.62
Unseen categories	0.522	0.6143	~0.55

The University of Mannheim's product-matching benchmark stresses generalization: "unseen" means the product categories at test time were not in training. Pro improves across the board, and on the hardest variant (100% unseen categories) Graphory Pro outperforms the published research SOTA of ~0.55 by ~8.5 points on generalization.

OpenSanctions (internal research)

Slice	Precision	Recall	F1
Overall (tuned, 10K sample)	0.814	0.505	0.6231
Organization	0.903	0.794	0.845
Vessel/Vehicle	0.810	0.919	0.861
Person/Role	0.868	0.445	0.588
Latin script	0.848	0.943	0.893
CJK script	0.935	0.967	0.951
Arabic script	0.923	0.857	0.889
Cyrillic script	0.759	0.714	0.736

Graphory on a 10K-pair sample drawn from OpenSanctions' 755K labeled pairs, across Latin, CJK, Arabic, and Cyrillic scripts. Organization and Vessel matching hit 0.85-0.86 F1. Cross-script name matching works out of the box. Note: OpenSanctions is CC-BY-NC-SA, so these numbers are internal research only and are not used in the production commercial pipeline.

What Pro adds

How Pro improves accuracy

The Free tier runs pure deterministic scoring. Fast, explainable, $0 per match.

Pro adds a verifier step: for borderline matches (the "maybe" cases near the decision boundary), a small AI model (Claude Haiku) double-checks and can flip the verdict. On the hardest datasets, this closes half the remaining gap to published SOTA. On Walmart-Amazon, it beats SOTA.

No marginal cost on your own Claude Max subscription. Included in Pro and above.

Reproducibility

All datasets are public. Anyone with the corpus can reproduce any Free-tier entity-resolution number on this page. Memory and multi-hop QA numbers reproduce with the Graphory MCP server plus any MCP-capable client (we used Claude via the MCP agent harness). Pro numbers require a Claude subscription for the verifier step. Contact info@graphory.io for a detailed methodology review under NDA.