TAO: A New Protocol for Trustworthy Machine Learning on...

Machine learning is increasingly conducted on hardware outside users' control, such as cloud GPUs and inference marketplaces. This lack of control raises a significant challenge: verifying that the outputs returned by these services accurately reflect the intended inputs. Users are often left in the dark, unable to counter service downgrades like model swaps or discrepancies in ad embeddings.

The Problem with Verification

Why is verifying outputs so tricky? The crux of the issue lies in the inherent nondeterminism of floating-point execution on heterogeneous accelerators. Previous attempts at solving this either fell short for real-world neural networks or required users to place their trust in vendors. Enter TAO, a new protocol that promises to change this narrative.

Introducing TAO

TAO stands for Tolerance Aware Optimistic verification. It's not just a catchy acronym but a novel approach that accepts outputs falling within operator-level acceptance regions, rather than demanding bitwise equality. This protocol uses two error models: sound per-operator IEEE-754 worst-case bounds and tight empirical percentile profiles, calibrated across various hardware.

When discrepancies arise, TAO employs a Merkle-anchored, threshold-guided dispute game. This process recursively partitions the computation graph until a single operator is left. The adjudication then boils down to a lightweight check against theoretical bounds or a simple vote against empirical thresholds. The key contribution: TAO doesn't rely on trusted hardware or deterministic kernels, making it a scalable solution for real-world ML compute.

TAO in Practice

TAO has been implemented as a PyTorch-compatible runtime and a contract layer, deployed on the Ethereum Holesky testnet. The runtime can instrument graphs, compute per-operator bounds, and execute vendor kernels in FP32 with minimal overhead. Specifically, there's only a 0.3% overhead on models like Qwen3-8B.

Across different models such as CNNs, Transformers, and diffusion models on hardware including A100, H100, RTX6000, and RTX4090, empirical thresholds were found to be 100 to 1,000 times tighter than theoretical bounds. The ablation study reveals that bound-aware adversarial attacks achieved a 0% success rate under TAO, demonstrating its solid defense capabilities.

Why This Matters

TAO's approach to reconciling scalability with verifiability is a significant step forward. In a world where trust is becoming a precious commodity, protocols like TAO could become the standard for ensuring trustworthiness in ML-as-a-Service. But here's the million-dollar question: can TAO truly replace the need for vendor trust, or will it serve as a complement to existing systems?

For now, TAO offers a promising path forward. By providing a way to verify outputs without relying on trusted environments, it's setting the stage for more transparent and accountable machine learning services in decentralized settings. Code and data are available at TAO's project page, inviting others to explore its potential and build on its foundation.

TAO: A New Protocol for Trustworthy Machine Learning on the Cloud

The Problem with Verification

Introducing TAO

TAO in Practice

Why This Matters

Key Terms Explained