Cracking the Code: Authorship Attribution in Scholarly Texts

Authorship attribution isn't as straightforward as it sounds. The challenge comes from the 'topical confound', two authors writing about the same topic often resemble each other more than one author writing on different subjects. But what if you could sidestep this problem by focusing on academic writing's inherent traits?

Introducing HALvest

Meet HALvest, a massive 17-billion-token multilingual collection of open-access academic papers. It’s not just another corpus, its English version, HALvest-Contrastive, takes the idea of minimizing topical overlap to a new level. By selecting passages from different papers by the same author within a single discipline, it cleverly reduces the shortcuts many algorithms rely on. This is where the real test of authorship attribution comes into play.

No More Free Rides for Lexical Baselines

Lexical baselines, which often rely on superficial cues, fall apart when these shortcuts are removed. That’s a telltale sign that traditional methods are skating by on surface-level similarities. The researchers behind HALvest validated their benchmark, demonstrating the collapse of these baselines. For those who think AI can just slap a model on a GPU rental and expect convergence, this is a wake-up call.

Rethinking Authorship Scoring

The standard approach to authorship scoring compresses documents into single vectors. HALvest introduces a smarter approach, keeping a sequence of vectors and comparing them with what's called late interaction. By grouping neighboring tokens into patches for matching, the method dramatically boosts performance over the simplistic single-vector baseline. But don't think for a second that this is a silver bullet, the optimal interaction granularity remains a complex puzzle.

Why should this matter to you? Because if AI can hold a wallet, who writes the risk model? If your system can't accurately attribute authorship, how do you trust its inferences in more complex tasks? The intersection is real, but only ten percent of the projects will truly matter. The real question: are you ready to separate signal from noise in the AI authorship game?

Cracking the Code: Authorship Attribution in Scholarly Texts

Introducing HALvest

No More Free Rides for Lexical Baselines

Rethinking Authorship Scoring

Key Terms Explained