The compute platform for rapid AI research
Built by talent from








Designed for Organisational Efficency
Features
- Allocate GPU resources
- Easily partition and ringfence GPU clusters for different teams and projects.
- Maximize ROI with insights
- Generate reports to optimize GPU utilization, and drive strategic decision-making.
- Checkpoint automatically
- Checkpoint and resume executables or containers, without any code changes.
- Remove infrastructure pain
- Simplify your infrastructure, remove operational complexity, and focus on research.
- Fill spare capacity
- Fill spare GPUs with data generation and batch inference tasks, with built-in workqueues.
- Monitor multiple clusters
- Multiple clusters in a single dashboard, on-prem or in-cloud, launch workloads via a single interface.
Maximise hardware utilization
Reports
GPU resources utilized by each researcher, team, or experiment, allows AI Labs to do more with less.
GPU utilization reports
- GPU usage percentages over time
- Power consumption
Reliability reports
- Mean time between failures (MTBF)
- SLA reports for hardware health
Performance reports
- Job completion times
- Throughput metrics
Cost analysis reports
- Cost per job or workload
- ROI calculations
Capacity planning reports
- Projected resource needs based on usage trends
- Recommendations for scaling or upgrading
Resource allocation reports
- Distribution of workloads across GPUs
- Queue times for jobs
Allocate resources for critical workloads
Guarantee resources for critical projects. Flexibly burst to meet paper deadlines and public launches. Allow opportunistic access to keep utilisation high and the team unblocked.
- ✓Per-team or per-project quotas
- ✓Protect quotas with ACLs
- ✓Visualise and edit quotas from our web interface
Accelerate AI research
Built-in fault tolerance
Automatically migrate
GPU workloads
Detect
Automatically detect unhealthy nodes through hardware, network and application healthchecks.
Checkpoint
Using CRIU and CUDA to snapshot/checkpoint the GPU workloads.
Recover
Migrate the workloads to a spare nodes. Cordon the faulty node. Restore the workload.
Specify workloads in pure Python
AI runs on Python. Now your GPU cluster does too. Clusterfudge provides a pure-Python API for specifying and launching experiments and multi-node training jobs.