Cracking the Code: Hate Speech Detection Across Languages

Detecting hate speech isn't easy, especially when navigating the complexities of multiple languages. Online platforms have long struggled with moderation, and the stakes are even higher for less-resourced languages like Lithuanian. Recent advancements in multilingual sentence embeddings offer a potential solution.

Introducing LtHate: A New Benchmark

The study brings forth LtHate, a new corpus focusing on Lithuanian hate speech. Sourced from news portals and social networks, it aims to fill a significant gap in existing datasets. But how effective are modern multilingual encoders at identifying hate speech across different languages?

Six state-of-the-art multilingual encoders - potion, gemma, bge, snow, jina, and e5 - were put to the test across Lithuanian, Russian, and English datasets. Here's what the benchmarks actually show: these models, when paired with gradient-boosted decision trees, can accurately detect hate speech.

Numbers Don't Lie

Performance varied notably among languages. In Lithuanian, the best models achieved up to 80.96% accuracy. For Russian, e5 led the pack with 92.19% accuracy. English saw a 77.21% accuracy, with PCA compression preserving most of the discriminative power in supervised settings.

Frankly, the numbers tell a different story for unsupervised models. PCA compression seems to dampen their effectiveness. Strip away the marketing and you get a clear message: supervised two-class models consistently outperform their one-class counterparts.

Why This Matters

So, why should anyone care? Multilingual sentence embeddings not only bolster hate speech detection but also promise scalable, cross-language solutions. This isn't just a tech win. It's a potential big deal for content moderation, offering platforms a solid toolkit to tackle hate speech in various languages.

But here's the rub: it raises ethical questions. As AI takes on a more significant role in moderating speech, who's ensuring these models respect free expression while safeguarding users? The architecture matters more than the parameter count, and getting it right is important.

Ultimately, this research underscores the power of AI in multilingual contexts. However, without careful oversight, could these tools inadvertently stifle legitimate speech? That's the challenge ahead.

Cracking the Code: Hate Speech Detection Across Languages

Introducing LtHate: A New Benchmark

Numbers Don't Lie

Why This Matters

Key Terms Explained