Someone Built the Smallest Transformer That Can Add Two 10-Digit Numbers. And It
A new research project called AdderBoard found the minimal transformer architecture that can reliably add two 10-digit numbers. The result challenges assumptions about how much compute models actually need for basic math.
Transformers can write poetry, generate code, and pass the bar exam. They still struggle with basic arithmetic. A new research project just figured out exactly how small a transformer can be and still add two 10-digit numbers correctly.
The project is called AdderBoard, and it's now open-source on GitHub. The findings are simple but striking: you don't need billions of parameters to do reliable math. You need a carefully designed architecture with surprisingly few.
The Setup
The research question was specific and clean. What is the smallest transformer that can consistently add two 10-digit numbers? Not approximately. Not most of the time. Consistently.
This isn't a question about making GPT-4 better at math. It's a question about understanding what transformers actually learn and how efficiently they can learn it. Addition is one of the simplest algorithmic tasks that exists. Humans learn it in first grade. Understanding why transformers find it hard, and what minimal architecture overcomes that difficulty, tells us something fundamental about how these models process information.
The researchers tried progressively smaller architectures. Fewer layers. Fewer attention heads. Smaller embedding dimensions. At each step, they checked whether the model could still reliably solve 10-digit addition. They were looking for the cliff, the point where making the model any smaller caused it to fail.
What They Found
The minimal architecture is surprisingly small. The exact specifications are in the repository, but we're talking about a model that's orders of magnitude smaller than anything used in production. Think thousands of parameters, not billions.
The key insights are about architecture design rather than scale. The number of attention layers matters more than the width of the model. The position encoding scheme matters enormously because addition is fundamentally a positional operation. You need to align digits correctly before you can add them, which means the model needs to understand position in a way that many standard architectures don't optimize for.
The researchers also found that training curriculum matters as much as architecture. Start with small numbers and gradually increase digit length. Models trained this way generalize to 10-digit addition far more reliably than models trained on the full distribution from the start. It's basically the way humans learn math: start simple, build up.
Why This Matters Beyond Math
The obvious reaction is: who cares? We have calculators. Transformers don't need to do arithmetic when they can call a calculator tool.
Fair point. But the research value isn't about building a better calculator. It's about understanding the limits and capabilities of transformer architectures at a fundamental level.
When a 100-billion-parameter model fails at 10-digit addition, that failure tells us something important about what the model has and hasn't learned. It hasn't learned the algorithm for addition. It's learned a statistical approximation of addition that works most of the time but breaks at the edges. Finding the minimal architecture that actually learns the algorithm, as opposed to approximating it, helps us understand the difference between pattern matching and genuine computation in neural networks.
This distinction matters for every task where accuracy is important. If you're using an AI model for financial calculations, medical dosing, engineering specifications, or any other domain where "close enough" isn't good enough, you need to know whether your model has learned the underlying algorithm or just a convincing approximation.
The Broader Research Trend
AdderBoard fits into a growing body of research on minimal architectures. There's been an explosion of work asking: how small can you make a model while preserving a specific capability? Can you distill a 70-billion-parameter model down to 7 billion and keep 90% of performance? Can you find the minimal architecture for specific tasks?
This research trend matters because it pushes back against the "bigger is always better" assumption that dominates AI development. Yes, scaling up produces impressive general capabilities. But for specific tasks, you often don't need the scale. A tiny, purpose-built model can outperform a general-purpose giant on the task it was designed for.
For deployment, this is extremely practical. Running a 100-billion-parameter model for arithmetic is like driving a dump truck to the grocery store. If you can identify the minimal model for a specific task, you can run it on cheaper hardware, with lower latency, and at a fraction of the cost.
Implications for AI Education
There's a teaching angle here that deserves attention. AdderBoard is a clean, well-defined benchmark that anyone can reproduce. The code is open-source. The task is well-understood. The success criteria are binary: either the model gets the addition right or it doesn't.
For students learning about transformer architectures, this is a perfect sandbox. You can experiment with different configurations, see exactly where they fail, and build intuition about why. That's harder to do with language models where "good output" is subjective and evaluation requires human judgment.
The project also demonstrates something valuable about research methodology. By choosing an extremely specific question and pursuing it systematically, the researchers generated insights that apply far beyond 10-digit addition. Good research isn't always about tackling the biggest problem. Sometimes it's about finding the simplest version of a hard problem and solving it completely.
What's Next
The AdderBoard repository is open for contributions. Other researchers can try to find even smaller architectures, test different encoding schemes, or extend the approach to other arithmetic operations like multiplication or division, which are significantly harder for transformers.
The ultimate goal isn't building math-capable transformers. It's understanding what transformers actually do when they compute, and using that understanding to build better models for tasks where reliability matters more than creativity. In a world obsessed with making AI models bigger, it's refreshing to see research asking how small they can go.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
A mechanism that lets neural networks focus on the most relevant parts of their input when producing output.
A standardized test used to measure and compare AI model performance.
A dense numerical representation of data (words, images, etc.
The process of measuring how well an AI model performs on its intended task.