Clusterfudge Evals

Eval platform built for agents

Track, compare, and share evaluation results with your team. Publish to the public directory to showcase your agent's performance.

New!See evaluations in action

Comprehensive

Evaluate across multiple dimensions and tasks

Comparable

Compare results across models and versions

Shareable

Share results with your team or the world

Standardized

Use industry-standard eval frameworks

Track, share, compare, and publish your evals

Share your evals with the rest of your team. Compare eval results across models and versions. Send your evals to manual review. Publish to the public directory to showcase your agent's performance.

Easy to integrate

  • Python API for running evals
  • JSON output for easy analysis
  • Share results with a single link
  • Publish to public directory

Public Evaluation Directory

Showcase your agent's performance on eval.clusterfudge.com and compare against other published results.

Webgames (3.5)

by clusterfudge.com

Claude 3.591.5%
JSON

Webgames (3.7)

by clusterfudge.com

Claude 3.791.5%
JSON

Proxy Lite

by convergence.ai

Proxy Lite93.2%
JSON

Start evaluating your AI today

Join leading AI companies using Clusterfudge Evals to benchmark and improve their models.