Unpacking LilyBERT: When Small Beats Big in Music AI

By Rio VasquezApril 14, 2026

A new dataset and model, LilyBERT, shake up symbolic music research. Is less really more? Dive into how curated data is outperforming bulk in this niche.

Symbolic music research has hit a new note with the introduction of BMdataset and its companion model, LilyBERT. Crafted from 393 LilyPond scores and their 2,646 movements, this dataset promises a fresh perspective on music understanding. Forget MIDI, it's time to tune into LilyPond.

What's in the Box?

BMdataset isn't just another collection of scores. Each piece is expertly transcribed from original Baroque manuscripts, complete with metadata about composers, musical forms, and instrumentation. In a music world dominated by MIDI, this approach is a breath of fresh air. Why settle for one format when diversity could lead to breakthroughs?

LilyBERT: A New Player in Town

Enter LilyBERT, a model that's shaking things up by using a CodeBERT-based encoder modified specifically for symbolic music. With a vocabulary boosted by 115 LilyPond-specific tokens, it promises to understand music in ways traditional models can't. Testing on the Mutopia corpus shows that even its modest size of approximately 90 million tokens can outperform models trained on enormous datasets, like the 15 billion token PDMX corpus, for composer and style classification.

David vs. Goliath: Data Size Showdown

The results? Smaller, meticulously curated datasets like BMdataset can outpace their larger, noisier counterparts. This is a bold claim in a world that often equates data volume with accuracy. Fine-tuning on BMdataset alone achieved better results than continuous pre-training on larger datasets. It's time to rethink the 'more is better' mantra.

The Perfect Blend

When you blend broad pre-training with focused fine-tuning, magic happens. LilyBERT achieved 84.3% composer accuracy, proving that combining the two data regimes offers the best of both worlds. So, what's the takeaway? If you're still relying on sheer size for AI training, you're missing the point. Precision and expertise trump bulk every time.

With the release of the BMdataset, tokenizer, and model, the stage is set for a new era in music AI. Will you continue to swim in the sea of data, or is it time to dive into the curated pool?

Share this article:

Get AI news in your inbox

Daily digest of what matters in AI.