Get your clusters ready for research
Maximise research reliability
Gather GPU metrics
Our agent continuously monitors core system metrics: CPU and GPU utilization, memory usage, and disk space, network usage, and more.
Graph these alongside workloads and see how your clusters are performing. Or export them to your favourite aggregator.
Track hardware health
We've seen it all before. Now you can too. Get email and Slack notifications for common problems:
- ✓Disk usage (and emergency disk space recovery)
- ✓GPUs falling off the PCI bus
- ✓Thermal throttling
Maintain environment consistency
Ensure consistent environments across all your clusters and nodes. Check your cluster for differences in CUDA drivers, Python and PyTorch versions, and other Python dependencies.
Node 1 | Node 2 | Node 3 | Node 4 | |
---|---|---|---|---|
CUDA Driver | 11.7 | 11.7 | 11.7 | 11.8 |
Python Version | 3.9.7 | 3.9.7 | 3.9.7 | 3.9.7 |
PyTorch Version | 1.10.0 | 1.10.0 | 1.10.1 | 1.10.0 |
Registered GPUs | 8 | 8 | 8 | 7 |
Migrate GPU workloads
Automatically detect unhealthy nodes. Migrate GPU workloads to spare nodes. We use CRIU and CUDA checkpoint to snapshot and restore GPU workloads.
Get your clusters really working
Allocate resources for critical workloads
Guarantee resources for critical projects. Flexibly burst to meet paper deadlines and public launches. Allow opportunistic access to keep utilisation high and the team unblocked.
- ✓Per-team or per-project quotas
- ✓Protect quotas with ACLs
- ✓Visualise and edit quotas from our web interface
Schedule across clusters
A single entrypoint for all your clusters. Works across any cloud, or on-prem. We have no dependencies on Kubernetes, Slurm, or any other software.
Burst into the cloud
Meet paper deadlines or public launches. Clusterfudge is integrated with major cloud providers, allowing burst capacity with a minimum commitment of just one week.
Generate utilization reports
Measure the ROI of your GPU clusters. Get the insights you need to make informed decisions and optimize resource allocation.
GPU utilization reports
- GPU usage percentages over time
- Power consumption
Reliability reports
- Mean time between failures (MTBF)
- SLA reports for hardware health
Performance reports
- Job completion times
- Throughput metrics
Cost analysis reports
- Cost per job or workload
- ROI calculations
Capacity planning reports
- Projected resource needs based on usage trends
- Recommendations for scaling or upgrading
Resource allocation reports
- Distribution of workloads across GPUs
- Queue times for jobs
Protect your hardware and your models
Alert on suspicious processes
Our agent monitors processes running on GPUs. We've caught unauthorized cryptocurrency miners running in production. Stay informed and secure with real-time alerts.
Easy setup
Run Clusterfudge in one-line, with zero config and no dependencies.
Sign up
Sign up to get an API key and personalised command to run our agent — Fudgelet.Sign up to get an API key and personalised command to run our agent — Fudgelet.
Run Fudgelet
Run Fudgelet on your compute node. This auto-detects GPUs and allows it to run workloads.Run Fudgelet on your compute node. This auto-detects GPUs and allows it to run workloads.
Launch workloads
Launch notebooks and workstations via the web, or write your own launches using our Python API.Launch notebooks and workstations via the web, or write your own launches using our Python API.
$ curl https://get.clusterfudge.com/run.sh |
API_KEY=<your-api-key> bash