Going to NeurIPS 2024? Come say hi.

Iterate faster on AI using any hardware

Train at scale, run interactive workstations or notebooks, and collect logs and errors. Runs on any hardware.

A preview of the Launches page of Clusterfudge, showing a list of launches in various states, like running and completed.

Accelerate AI research

Fast iteration

Access GPU workstations

Start GPU workstations in your cluster and access them from anywhere with a simple SSH command.

ssh h100.ssh.clusterfudge.com
Seamless Research Environment

Launch Jupyter Notebooks

Run Jupyter Notebooks in your cluster with a single click. Our built-in tunneling eliminates the need for SSH port-forwarding, providing instant web access without any complex setup.

Designed for Python

Specify workloads in pure Python

AI runs on Python. Now your GPU cluster does too. Clusterfudge provides a pure-Python API for specifying and launching experiments and multi-node training jobs.

Web dashboard

Track research progress

View our web dashboard anywhere. You and your team can see launches, logs, tracebacks, and hardware metrics — even on mobile.

A preview of the Launches page of Clusterfudge, showing a list of launches in various states, like running and completed.

Reliable foundations for your matrix multiplications

Hardware Monitoring

Track hardware health

We've seen it all before. Now you can too. Get email and Slack notifications for common problems:

  • Disk usage (and emergency disk space recovery)
  • GPUs falling off the PCI bus
  • Thermal throttling
Environment diffing

Maintain environment consistency

Ensure consistent environments across all your clusters and nodes. Check your cluster for differences in CUDA drivers, Python and PyTorch versions, and other Python dependencies.

Node 1Node 2Node 3Node 4
CUDA Driver11.711.711.711.8
Python Version3.9.73.9.73.9.73.9.7
PyTorch Version1.10.01.10.01.10.11.10.0
Registered GPUs8887
Built-in fault tolerance

Migrate GPU workloads

Automatically detect unhealthy nodes. Migrate GPU workloads to spare nodes. We use CRIU and CUDA checkpoint to snapshot and restore GPU workloads.

Easy setup

Run Clusterfudge in one-line, with zero config and no dependencies.

1

Sign up

Sign up to get an API key and personalised command to run our agent — Fudgelet.

2

Run Fudgelet

Run Fudgelet on your compute node. This auto-detects GPUs and allows it to run workloads.

3

Launch workloads

Launch notebooks and workstations via the web, or write your own launches using our Python API.

$ curl https://get.clusterfudge.com/run.sh |
API_KEY=<your-api-key> bash

Join the beta

See how Clusterfudge can accelerate your research.