Benchmarks
Graphory runs on public academic benchmarks so the numbers are verifiable, not marketing. These are the results from the same scoring pipeline that ships in production. Two modes: Free (pure deterministic code, zero cost per run) and Pro (deterministic plus a small AI verifier on borderline cases, included in Pro plans and above).
How to read these numbers
- F1 - combined accuracy from 0 to 1. Above 0.85 is excellent. 1.0 would be perfect.
- Precision - when we say two things match, how often are we right?
- Recall - of all the real matches, how many did we find?
- Recall@5 - is the right answer in the top 5 results?
- SOTA - "state of the art" - best published research score, usually from a model trained specifically on that dataset.
Highlight
Magellan entity resolution
| Dataset | Graphory Free (F1) | Graphory Pro (F1) | Best research (F1) |
|---|---|---|---|
| DBLP-ACM | 0.950 | 0.982 | 0.986 |
| Fodors-Zagats | 0.978 | 0.978 | 1.000 |
| Abt-Buy | 0.826 | 0.886 | 0.891 |
| Walmart-Amazon | 0.812 | 0.891 🏆 | 0.869 |
| DBLP-Scholar | 0.878 | 0.914 | 0.953 |
| Amazon-Google | 0.557 | 0.659 | 0.757 |
| Average | 0.833 | 0.885 | 0.909 |
The Magellan suite from the University of Wisconsin-Madison is the reference benchmark for structured entity matching. Six datasets, published neural baselines. Pro closes the gap to research SOTA to within 3 F1 points on average - and on the hardest dataset (Walmart-Amazon) it beats the published best.
BizLineItemBench
| Source pair | Graphory F1 |
|---|---|
| QuickBooks ↔ Shopify | 0.905 |
| QuickBooks ↔ Stripe | 0.897 |
| Shopify ↔ Stripe | 0.885 |
| Overall | 0.896 |
Precision: 1.000 across all pairs. No false matches.
Graphory's own open benchmark. Matching the same product or service across QuickBooks, Shopify, and Stripe records - the same economic event described three different ways. Zero false matches across 450 pairs. Pro does not add value here because the deterministic pipeline already hits 100% precision.
WDC Products
| Variant | Graphory Free (F1) | Graphory Pro (F1) | Best research (F1) |
|---|---|---|---|
| Seen categories | 0.478 | 0.572 | ~0.75 |
| Mixed categories | 0.518 | 0.595 | ~0.62 |
| Unseen categories | 0.522 | 0.614 | ~0.55 |
The University of Mannheim's product-matching benchmark stresses generalization: "unseen" means the product categories at test time were not in training. Pro improves across the board, and on the hardest variant (100% unseen categories) Graphory Pro outperforms the published research SOTA of ~0.55.
LongMemEval
| Question type | Sample size (N) | Right answer in top 5 |
|---|---|---|
| Single-session-assistant | 56 | 0.929 |
| Knowledge-update | 78 | 0.872 |
| Single-session-preference | 30 | 0.867 |
| Single-session-user | 70 | 0.857 |
| Temporal-reasoning | 133 | 0.827 |
| Multi-session | 133 | 0.767 |
| Overall | 500 | 0.836 |
LongMemEval measures long-term memory recall - can the memory layer find the right prior-session context when the AI asks? "Right answer in top 5" (Recall@5) is the standard metric in this literature. 0.836 overall is a strong floor for full-text search without a model in the retrieval loop.
OpenSanctions
| Slice | Precision | Recall | F1 |
|---|---|---|---|
| Overall (tuned) | 0.814 | 0.505 | 0.623 |
| Organization | 0.903 | 0.794 | 0.845 |
| Vessel/Vehicle | 0.810 | 0.919 | 0.861 |
| Person/Role | 0.868 | 0.445 | 0.588 |
| Latin script | 0.848 | 0.943 | 0.893 |
| CJK script | 0.935 | 0.967 | 0.951 |
| Arabic script | 0.923 | 0.857 | 0.889 |
| Cyrillic script | 0.759 | 0.714 | 0.736 |
Graphory on the OpenSanctions name-matching challenge across Latin, CJK, Arabic, and Cyrillic scripts. Organization and Vessel matching hit 0.85-0.86 F1. Cross-script name matching works out of the box.
LoCoMo
| Question type | Sample size | Accuracy |
|---|---|---|
| Open-domain | 841 | 0.710 |
| Single-hop | 282 | 0.571 |
| Multi-hop | 321 | 0.259 |
| Temporal-reasoning | 96 | 0.219 |
| Adversarial | 446 | 0.002 |
| Overall | 1986 | 0.435 |
| Recall@5 (retrieval) | 1986 | 0.310 |
LoCoMo is the long conversational memory benchmark - over a million turns of dialogue across multi-session conversations. The deterministic floor on single-hop and open-domain questions already clears half the suite. Multi-hop and temporal-reasoning are where Pro's graph traversal has the most headroom.
What Pro adds
The Free tier runs pure deterministic scoring. Fast, explainable, $0 per match.
Pro adds a verifier step: for borderline matches (the "maybe" cases near the decision boundary), a small AI model (Claude Haiku) double-checks and can flip the verdict. On the hardest datasets, this closes half the remaining gap to published SOTA. On Walmart-Amazon, it beats SOTA.
No marginal cost on your own Claude Max subscription. Included in Pro and above.
Reproducibility
All datasets are public. Anyone with the corpus can reproduce any Free-tier number on this page. Pro numbers require a Claude subscription for the verifier step. Contact support@graphory.io for a detailed methodology review under NDA.