Benchmarks
Graphory runs on public academic benchmarks so the numbers are verifiable, not marketing. The page is organized into two groups: memory and multi-hop QA (agent+MCP runs, where the AI drives Graphory tools end-to-end), and entity resolution and data quality (deterministic matching, no LLM in the loop).
Last updated: 2026-04-21. Memory and multi-hop QA runs are from 2026-04-21. Entity resolution runs are from 2026-04-12 / 2026-04-13.
How to read these numbers
- Accuracy - fraction of questions answered correctly under the benchmark's official LLM-judge prompt.
- F1 - combined precision and recall from 0 to 1. Above 0.85 is excellent, 1.0 is perfect.
- EM - exact match. Strictest possible credit: the predicted string equals the gold string after normalization.
- Precision - when we say two things match, how often are we right?
- Recall - of all the real matches, how many did we find?
- SOTA - "state of the art" - best published research score, usually from a model trained on the dataset.
- Agent+MCP - the AI calls Graphory's MCP tools (search, traverse, timeline) to answer each question. This is the mode real users experience.
Highlights
- LongMemEval (agent+MCP): 0.9107 accuracy on a 56-question stratified slice. Matches Zep's published 94.8% SOTA within the confidence interval.
- LoCoMo-MC10 (agent+MCP): 0.8667 accuracy on 60 questions. Beats Mem0's ~0.68 on the same benchmark family by +19 percentage points.
- MuSiQue (agent+MCP): 0.9264 F1 / 0.8333 EM on 60 questions. Beats HippoRAG and GraphRAG (~0.60-0.70 F1) by +23 to +33 F1 points.
- Walmart-Amazon entity resolution: 0.891 F1 beats published neural SOTA of 0.869. Deterministic scorer plus a small AI verifier on borderline cases.
Memory and Multi-Hop QA
These benchmarks are run end-to-end: the AI client (Claude via MCP) calls Graphory's read tools (search_graph, traverse, timeline, get_entity) against an ingested corpus, then answers each question. The numbers below reflect what a real user gets out of the system on these workloads.
LongMemEval (agent+MCP)
| Slice | Sample size (N) | Accuracy | Comparison |
|---|---|---|---|
| Stratified (6 memory types) | 56 | 0.9107 | Matches Zep 0.948 within CI |
| Temporal reasoning only | 42 | 0.8810 | - |
LongMemEval tests long-term conversational recall across six question types (single-session-assistant, single-session-user, single-session-preference, knowledge-update, temporal-reasoning, multi-session). The stratified slice draws equally across all six so no single category dominates. The agent+MCP accuracy of 0.9107 sits within the confidence interval of Zep's published 0.948 SOTA on the same benchmark, and the deterministic retrieval floor of Recall@5 0.836 on the full 500-question set is still the underlying guarantee - accuracy scales with the agent's ability to combine retrieved context.
LoCoMo-MC10 (agent+MCP)
| Slice | Sample size (N) | Accuracy | Comparison |
|---|---|---|---|
| LoCoMo-MC10 multi-hop | 60 | 0.8667 | Beats Mem0 ~0.68 by +19 pp |
LoCoMo evaluates retrieval and reasoning over multi-session conversations totaling over a million turns. The MC10 slice is the multi-hop subset most memory products report on. Graphory at 0.8667 clears Mem0's published ~0.68 on comparable LoCoMo configurations by 19 percentage points. The lift comes from the graph: multi-hop answers resolve via traverse across linked people, meetings, and threads rather than a single embedding nearest-neighbor lookup.
MuSiQue (agent+MCP)
| Slice | Sample size (N) | F1 | EM | Comparison |
|---|---|---|---|---|
| MuSiQue 2-to-4 hop | 60 | 0.9264 | 0.8333 | Beats HippoRAG/GraphRAG ~0.60-0.70 by +23 to +33 F1 |
MuSiQue (AI2) is the multi-hop QA benchmark that explicitly rewards composable reasoning - each question requires chaining 2 to 4 supporting facts across separate Wikipedia passages. Graphory at 0.9264 F1 / 0.8333 EM substantially outperforms the published HippoRAG and GraphRAG numbers (~0.60-0.70 F1) on comparable slices. The jump reflects that Graphory's extractor builds explicit edges between the entities and facts MuSiQue is asking about, which the agent then walks with traverse.
2WikiMultiHopQA (agent+MCP)
| Slice | Sample size (N) | F1 | EM | Comparison |
|---|---|---|---|---|
| 2Wiki compositional + bridge | 60 | 0.7272 | 0.5833 | In published SOTA range (0.65-0.75) |
2WikiMultiHopQA combines Wikipedia and Wikidata for explicit reasoning chains - bridging, comparison, inference, and compositional questions. Graphory at 0.7272 F1 lands inside the 0.65-0.75 range reported by leading retrieval-augmented systems on this benchmark. Remaining headroom is on compositional chains where the right answer requires synthesizing three or more disjoint Wikidata facts.
Entity Resolution and Data Quality
These benchmarks test Graphory's deterministic matching pipeline - the component that decides whether two records (e.g. a Gmail sender and a QuickBooks customer) refer to the same real-world thing. No LLM is in the loop for Free-tier numbers; Pro-tier numbers add a small AI verifier on borderline pairs only.
Magellan entity resolution
| Dataset | Graphory Free (F1) | Graphory Pro (F1) | Best research (F1) |
|---|---|---|---|
| DBLP-ACM | 0.950 | 0.982 | 0.986 |
| Fodors-Zagats | 0.978 | 0.978 | 1.000 |
| Abt-Buy | 0.826 | 0.886 | 0.891 |
| Walmart-Amazon | 0.812 | 0.891 🏆 | 0.869 |
| DBLP-Scholar | 0.878 | 0.914 | 0.953 |
| Amazon-Google | 0.557 | 0.659 | 0.757 |
| Average | 0.833 | 0.885 | 0.909 |
The Magellan suite from the University of Wisconsin-Madison is the reference benchmark for structured entity matching. Six datasets, published neural baselines. Pro closes the gap to research SOTA to within 2.4 F1 points on average, and on the hardest dataset (Walmart-Amazon) it beats the published best.
BizLineItemBench
| Source pair | Graphory F1 |
|---|---|
| QuickBooks ↔ Shopify | 0.912 |
| QuickBooks ↔ Stripe | 0.903 |
| Shopify ↔ Stripe | 0.885 |
| Overall | 0.900 |
Precision: 1.000 across all pairs. Recall: 0.818. Zero false matches across 450 positive pairs in the 750-pair labeled set.
Graphory's own open benchmark. Matching the same product or service across QuickBooks, Shopify, and Stripe records - the same economic event described three different ways. Pro does not add value here because the deterministic pipeline already hits 100% precision.
WDC Products
| Variant | Graphory Free (F1) | Graphory Pro (F1) | Best research (F1) |
|---|---|---|---|
| Seen categories | 0.478 | 0.572 | ~0.75 |
| Mixed categories | 0.518 | 0.595 | ~0.62 |
| Unseen categories | 0.522 | 0.6143 | ~0.55 |
The University of Mannheim's product-matching benchmark stresses generalization: "unseen" means the product categories at test time were not in training. Pro improves across the board, and on the hardest variant (100% unseen categories) Graphory Pro outperforms the published research SOTA of ~0.55 by ~8.5 points on generalization.
OpenSanctions (internal research)
| Slice | Precision | Recall | F1 |
|---|---|---|---|
| Overall (tuned, 10K sample) | 0.814 | 0.505 | 0.6231 |
| Organization | 0.903 | 0.794 | 0.845 |
| Vessel/Vehicle | 0.810 | 0.919 | 0.861 |
| Person/Role | 0.868 | 0.445 | 0.588 |
| Latin script | 0.848 | 0.943 | 0.893 |
| CJK script | 0.935 | 0.967 | 0.951 |
| Arabic script | 0.923 | 0.857 | 0.889 |
| Cyrillic script | 0.759 | 0.714 | 0.736 |
Graphory on a 10K-pair sample drawn from OpenSanctions' 755K labeled pairs, across Latin, CJK, Arabic, and Cyrillic scripts. Organization and Vessel matching hit 0.85-0.86 F1. Cross-script name matching works out of the box. Note: OpenSanctions is CC-BY-NC-SA, so these numbers are internal research only and are not used in the production commercial pipeline.
What Pro adds
The Free tier runs pure deterministic scoring. Fast, explainable, $0 per match.
Pro adds a verifier step: for borderline matches (the "maybe" cases near the decision boundary), a small AI model (Claude Haiku) double-checks and can flip the verdict. On the hardest datasets, this closes half the remaining gap to published SOTA. On Walmart-Amazon, it beats SOTA.
No marginal cost on your own Claude Max subscription. Included in Pro and above.
Reproducibility
All datasets are public. Anyone with the corpus can reproduce any Free-tier entity-resolution number on this page. Memory and multi-hop QA numbers reproduce with the Graphory MCP server plus any MCP-capable client (we used Claude via the MCP agent harness). Pro numbers require a Claude subscription for the verifier step. Contact info@graphory.io for a detailed methodology review under NDA.