Graphory logo

Graphory Labs

Benchmarks

Graphory runs on public academic benchmarks so the numbers are verifiable, not marketing. These are the results from the same scoring pipeline that ships in production. Two modes: Free (pure deterministic code, zero cost per run) and Pro (deterministic plus a small AI verifier on borderline cases, included in Pro plans and above).

How to read these numbers

  • F1 - combined accuracy from 0 to 1. Above 0.85 is excellent. 1.0 would be perfect.
  • Precision - when we say two things match, how often are we right?
  • Recall - of all the real matches, how many did we find?
  • Recall@5 - is the right answer in the top 5 results?
  • SOTA - "state of the art" - best published research score, usually from a model trained specifically on that dataset.

Highlight

Graphory Pro beats neural SOTA on Walmart-Amazon. 0.891 F1 vs 0.869 for the best published trained neural model. A deterministic scoring pipeline plus a small AI verifier, outperforming a model trained specifically for this task.

Magellan entity resolution

DatasetGraphory Free (F1)Graphory Pro (F1)Best research (F1)
DBLP-ACM0.9500.9820.986
Fodors-Zagats0.9780.9781.000
Abt-Buy0.8260.8860.891
Walmart-Amazon0.8120.891 🏆0.869
DBLP-Scholar0.8780.9140.953
Amazon-Google0.5570.6590.757
Average0.8330.8850.909

The Magellan suite from the University of Wisconsin-Madison is the reference benchmark for structured entity matching. Six datasets, published neural baselines. Pro closes the gap to research SOTA to within 3 F1 points on average - and on the hardest dataset (Walmart-Amazon) it beats the published best.

BizLineItemBench

Source pairGraphory F1
QuickBooks ↔ Shopify0.905
QuickBooks ↔ Stripe0.897
Shopify ↔ Stripe0.885
Overall0.896

Precision: 1.000 across all pairs. No false matches.

Graphory's own open benchmark. Matching the same product or service across QuickBooks, Shopify, and Stripe records - the same economic event described three different ways. Zero false matches across 450 pairs. Pro does not add value here because the deterministic pipeline already hits 100% precision.

WDC Products

VariantGraphory Free (F1)Graphory Pro (F1)Best research (F1)
Seen categories0.4780.572~0.75
Mixed categories0.5180.595~0.62
Unseen categories0.5220.614~0.55

The University of Mannheim's product-matching benchmark stresses generalization: "unseen" means the product categories at test time were not in training. Pro improves across the board, and on the hardest variant (100% unseen categories) Graphory Pro outperforms the published research SOTA of ~0.55.

LongMemEval

Question typeSample size (N)Right answer in top 5
Single-session-assistant560.929
Knowledge-update780.872
Single-session-preference300.867
Single-session-user700.857
Temporal-reasoning1330.827
Multi-session1330.767
Overall5000.836

LongMemEval measures long-term memory recall - can the memory layer find the right prior-session context when the AI asks? "Right answer in top 5" (Recall@5) is the standard metric in this literature. 0.836 overall is a strong floor for full-text search without a model in the retrieval loop.

OpenSanctions

SlicePrecisionRecallF1
Overall (tuned)0.8140.5050.623
Organization0.9030.7940.845
Vessel/Vehicle0.8100.9190.861
Person/Role0.8680.4450.588
Latin script0.8480.9430.893
CJK script0.9350.9670.951
Arabic script0.9230.8570.889
Cyrillic script0.7590.7140.736

Graphory on the OpenSanctions name-matching challenge across Latin, CJK, Arabic, and Cyrillic scripts. Organization and Vessel matching hit 0.85-0.86 F1. Cross-script name matching works out of the box.

LoCoMo

Question typeSample sizeAccuracy
Open-domain8410.710
Single-hop2820.571
Multi-hop3210.259
Temporal-reasoning960.219
Adversarial4460.002
Overall19860.435
Recall@5 (retrieval)19860.310

LoCoMo is the long conversational memory benchmark - over a million turns of dialogue across multi-session conversations. The deterministic floor on single-hop and open-domain questions already clears half the suite. Multi-hop and temporal-reasoning are where Pro's graph traversal has the most headroom.

What Pro adds

How Pro improves accuracy

The Free tier runs pure deterministic scoring. Fast, explainable, $0 per match.

Pro adds a verifier step: for borderline matches (the "maybe" cases near the decision boundary), a small AI model (Claude Haiku) double-checks and can flip the verdict. On the hardest datasets, this closes half the remaining gap to published SOTA. On Walmart-Amazon, it beats SOTA.

No marginal cost on your own Claude Max subscription. Included in Pro and above.

Reproducibility

All datasets are public. Anyone with the corpus can reproduce any Free-tier number on this page. Pro numbers require a Claude subscription for the verifier step. Contact support@graphory.io for a detailed methodology review under NDA.