Going to NeurIPS 2024? Come say hi.

Get your clusters ready for research

Ensure all your hardware is ready for AI research. Monitor metrics, run healthchecks, allocate GPUs to teams, and track utilization.

NODE 1

NODE 2

NODE 3

NODE 4

NODE 5

NODE 6

Maximise research reliability

Hardware Monitoring

Gather GPU metrics

Our agent continuously monitors core system metrics: CPU and GPU utilization, memory usage, and disk space, network usage, and more.

Graph these alongside workloads and see how your clusters are performing. Or export them to your favourite aggregator.

Hardware Monitoring

Track hardware health

We've seen it all before. Now you can too. Get email and Slack notifications for common problems:

  • Disk usage (and emergency disk space recovery)
  • GPUs falling off the PCI bus
  • Thermal throttling
Environment diffing

Maintain environment consistency

Ensure consistent environments across all your clusters and nodes. Check your cluster for differences in CUDA drivers, Python and PyTorch versions, and other Python dependencies.

Node 1Node 2Node 3Node 4
CUDA Driver11.711.711.711.8
Python Version3.9.73.9.73.9.73.9.7
PyTorch Version1.10.01.10.01.10.11.10.0
Registered GPUs8887
Built-in fault tolerance

Migrate GPU workloads

Automatically detect unhealthy nodes. Migrate GPU workloads to spare nodes. We use CRIU and CUDA checkpoint to snapshot and restore GPU workloads.

Get your clusters really working

Resource allocation

Allocate resources for critical workloads

Guarantee resources for critical projects. Flexibly burst to meet paper deadlines and public launches. Allow opportunistic access to keep utilisation high and the team unblocked.

  • Per-team or per-project quotas
  • Protect quotas with ACLs
  • Visualise and edit quotas from our web interface
Multi-cluster

Schedule across clusters

A single entrypoint for all your clusters. Works across any cloud, or on-prem. We have no dependencies on Kubernetes, Slurm, or any other software.

Burst capacity

Burst into the cloud

Meet paper deadlines or public launches. Clusterfudge is integrated with major cloud providers, allowing burst capacity with a minimum commitment of just one week.

Cluster utilization

Generate utilization reports

Measure the ROI of your GPU clusters. Get the insights you need to make informed decisions and optimize resource allocation.

GPU utilization reports

  • GPU usage percentages over time
  • Power consumption

Reliability reports

  • Mean time between failures (MTBF)
  • SLA reports for hardware health

Performance reports

  • Job completion times
  • Throughput metrics

Cost analysis reports

  • Cost per job or workload
  • ROI calculations

Capacity planning reports

  • Projected resource needs based on usage trends
  • Recommendations for scaling or upgrading

Resource allocation reports

  • Distribution of workloads across GPUs
  • Queue times for jobs

Protect your hardware and your models

Security

Alert on suspicious processes

Our agent monitors processes running on GPUs. We've caught unauthorized cryptocurrency miners running in production. Stay informed and secure with real-time alerts.

Easy setup

Run Clusterfudge in one-line, with zero config and no dependencies.

1

Sign up

Sign up to get an API key and personalised command to run our agent — Fudgelet.

2

Run Fudgelet

Run Fudgelet on your compute node. This auto-detects GPUs and allows it to run workloads.

3

Launch workloads

Launch notebooks and workstations via the web, or write your own launches using our Python API.

$ curl https://get.clusterfudge.com/run.sh |
API_KEY=<your-api-key> bash

Join the beta

See how Clusterfudge can accelerate your research.