The compute platform for rapid AI research
Features
- Allocate GPU resources
- Easily partition and ringfence GPU clusters for different teams and projects.
- Reduce operational complexity
- Simplify your infrastructure and reduce maintenance overhead with a modern alternative to Slurm.
- Maximize ROI with insights
- Generate reports to optimize GPU utilization, and drive strategic decision-making.
- Fill spare capacity
- Fill spare GPUs with data generation and batch inference tasks, with built-in workqueues.
- Checkpoint automatically
- Checkpoint and resume executables or containers, without any code changes.
- Monitor multiple clusters
- Multiple clusters in a single dashboard, on-prem or in-cloud, launch workloads via a single interface.
Maximise hardware utilization
Generate utilization reports
Measure the ROI of your GPU clusters. Get the insights you need to make informed decisions and optimize resource allocation.
GPU utilization reports
- GPU usage percentages over time
- Power consumption
Reliability reports
- Mean time between failures (MTBF)
- SLA reports for hardware health
Performance reports
- Job completion times
- Throughput metrics
Cost analysis reports
- Cost per job or workload
- ROI calculations
Capacity planning reports
- Projected resource needs based on usage trends
- Recommendations for scaling or upgrading
Resource allocation reports
- Distribution of workloads across GPUs
- Queue times for jobs
Allocate resources for critical workloads
Guarantee resources for critical projects. Flexibly burst to meet paper deadlines and public launches. Allow opportunistic access to keep utilisation high and the team unblocked.
- ✓Per-team or per-project quotas
- ✓Protect quotas with ACLs
- ✓Visualise and edit quotas from our web interface
Accelerate AI research
Built-in fault tolerance
Automatically migrate
GPU workloads
Detect
Automatically detect unhealthy nodes through hardware, network and application healthchecks.
Checkpoint
Using CRIU and CUDA to snapshot/checkpoint the GPU workloads.
Recover
Migrate the workloads to a spare nodes. Cordon the faulty node. Restore the workload.
Specify workloads in pure Python
AI runs on Python. Now your GPU cluster does too. Clusterfudge provides a pure-Python API for specifying and launching experiments and multi-node training jobs.