The compute platform for rapid AI research

Iterate faster on AI. Maximise utilisation. Orchestrate training at scale, batch jobs, notebooks, and GPU workstations — all on your own hardware.

A preview of the Launches page of Clusterfudge, showing a list of launches in various states, like running and completed.
Designed for Organisational Efficency

Features 

Allocate GPU resources
Easily partition and ringfence GPU clusters for different teams and projects.
Reduce operational complexity
Simplify your infrastructure and reduce maintenance overhead with a modern alternative to Slurm.
Maximize ROI with insights
Generate reports to optimize GPU utilization, and drive strategic decision-making.
Fill spare capacity
Fill spare GPUs with data generation and batch inference tasks, with built-in workqueues.
Checkpoint automatically
Checkpoint and resume executables or containers, without any code changes.
Monitor multiple clusters
Multiple clusters in a single dashboard, on-prem or in-cloud, launch workloads via a single interface.

Maximise hardware utilization

Cluster utilization

Generate utilization reports

Measure the ROI of your GPU clusters. Get the insights you need to make informed decisions and optimize resource allocation.

GPU utilization reports

  • GPU usage percentages over time
  • Power consumption

Reliability reports

  • Mean time between failures (MTBF)
  • SLA reports for hardware health

Performance reports

  • Job completion times
  • Throughput metrics

Cost analysis reports

  • Cost per job or workload
  • ROI calculations

Capacity planning reports

  • Projected resource needs based on usage trends
  • Recommendations for scaling or upgrading

Resource allocation reports

  • Distribution of workloads across GPUs
  • Queue times for jobs
Resource allocation

Allocate resources for critical workloads

Guarantee resources for critical projects. Flexibly burst to meet paper deadlines and public launches. Allow opportunistic access to keep utilisation high and the team unblocked.

Accelerate AI research

Built-in fault tolerance

Automatically migrate
GPU workloads

Detect

Automatically detect unhealthy nodes through hardware, network and application healthchecks.

Checkpoint

Using CRIU and CUDA to snapshot/checkpoint the GPU workloads.

Recover

Migrate the workloads to a spare nodes. Cordon the faulty node. Restore the workload.

Designed for Python

Specify workloads in pure Python

AI runs on Python. Now your GPU cluster does too. Clusterfudge provides a pure-Python API for specifying and launching experiments and multi-node training jobs.

Book a demo

See how Clusterfudge can accelerate your research.