Chain-of-thought reasoning benchmark

Transparent benchmarking for machine reasoning

Zeta Reason evaluates how large language models think — not just whether they provide the right answer. Capture chain-of-thought traces and score models across accuracy, calibration, robustness, and reasoning integrity.

View on GitHub

30+ metrics

Accuracy, calibration, path quality

Multi-model

OpenAI, Anthropic, Google

JSON-first

FastAPI backend, React/Tailwind UI

Model comparison snapshotCoT enabled

Model	ACC	Brier	PFS	USR
Model A	0.82	0.09	0.91	0.03
Model B	0.79	0.11	0.76	0.07
Model C	0.85	0.14	0.63	0.12

Zeta Reason surfaces calibration, path faithfulness, and unsupported step rate — showing how models actually think.

Why Zeta Reason?

Zeta Reason focuses on chain-of-thought reasoning and provides a multi-dimensional understanding of model behavior beyond accuracy.

Go beyond accuracy

Capture chain-of-thought traces and evaluate coherence, calibration, robustness, and faithfulness — not just final answers.

Multi-dimensional metrics

ACC, Brier, ECE, path faithfulness, unsupported step rate, and robustness metrics in one place.

High-stakes ready

Designed for teams deploying LLMs into finance, healthcare, legal, and policy environments where auditability is critical.

Core metrics for reasoning quality

Zeta Reason organizes evaluation into core, reasoning, and robustness metrics — giving you visibility into model behavior instead of a single number.

Core

ACC — Answer accuracy
Brier, ECE — Calibration
$/ok, Tok/ok — Cost & efficiency

Reasoning

PFS — Path Faithfulness Score
USR — Unsupported Step Rate
PVS — Process Validity Score

Robustness & Context

AR@ε — Adversarial robustness
DSI@k — Distraction Sensitivity
CR@k / CP@k — Context recall & precision

Open-source core + enterprise extension

Zeta Reason is free and open-source for the research community, with an optional enterprise layer for teams needing collaboration, governance, and compliance.

Open-Source Core

Python + FastAPI backend
JSON-first pipelines
React/Tailwind dashboards
MIT-licensed

Enterprise Extension

Team workspaces
Recurring evaluation schedules
Dataset & results versioning
Compliance-ready audit logs

Built for researchers, enterprises, and regulators

Zeta Reason supports evaluation for research, applied AI, and emerging governance work.

AI Research Labs

Reasoning benchmarks for papers, ablations, and new model families.

Enterprise AI Teams

High-trust evaluation for production AI, safety reviews, and risk committees.

Regulators

Vendor-neutral metrics supporting AI safety and certification.