Clusterfudge Evals
Eval platform built for agents
Track, compare, and share evaluation results with your team. Publish to the public directory to showcase your agent's performance.
New!See evaluations in actionComprehensive
Evaluate across multiple dimensions and tasks
Comparable
Compare results across models and versions
Shareable
Share results with your team or the world
Standardized
Use industry-standard eval frameworks
Track, share, compare, and publish your evals
Share your evals with the rest of your team. Compare eval results across models and versions. Send your evals to manual review. Publish to the public directory to showcase your agent's performance.
Easy to integrate
- Python API for running evals
- JSON output for easy analysis
- Share results with a single link
- Publish to public directory
Public Evaluation Directory
Showcase your agent's performance on eval.clusterfudge.com and compare against other published results.
Webgames (3.5)
by clusterfudge.com
Claude 3.591.5%
JSON
Webgames (3.7)
by clusterfudge.com
Claude 3.791.5%
JSON
Proxy Lite
by convergence.ai
Proxy Lite93.2%
JSON
Start evaluating your AI today
Join leading AI companies using Clusterfudge Evals to benchmark and improve their models.