Why AI Isn't Ready for Industrial Safety, Yet
Large Language Models show promise in extracting data from Safety Data Sheets, but they're not there yet. Here's the real story from the trenches.
industrial safety, extracting structured information from Safety Data Sheets (SDS) has always been a bit of a nightmare. With documents varying wildly in format, traditional rule-based methods have often hit a wall. The latest buzz? Using Large Language Models (LLMs) like Gemini 1.5 Pro and GPT-4o to tackle the job. But is AI really up to the task?
The Performance Puzzle
Four models were put to the test: Gemini 1.5 Pro, GPT-4o, Claude 3.7 Sonnet, and Llama 3.1-70B. They were run through their paces using different prompting strategies, zero-shot, few-shot, and chain-of-thought. The metrics on the table? Accuracy, latency, and cost, across a staggering 50,000+ data fields. So, who came out on top?
Text-based extraction took the crown, outperforming its multimodal counterpart in all metrics. Gemini 1.5 Pro, paired with a Chain-of-Thought prompt, led the pack with an accuracy of 84%. It just edged out GPT-4o at 81% and Claude 3.7 Sonnet at 79%. Impressive as these numbers might seem, none of the models hit the magic 90% accuracy mark needed for real-world reliability.
Why Accuracy Matters
Here's the rub. In industries where safety is critical, anything less than near-perfect accuracy could mean the difference between life and death. If these models can't fully automate SDS data extraction reliably, are they really ready for industrial use? The gap between the keynote and the cubicle is enormous. Companies can't afford to deploy tools that aren't up to snuff safety.
The results suggest that while general-purpose LLMs show strong potential, they're not yet reliable enough for unsupervised industrial application. But don't write them off just yet. With some task-specific fine-tuning, these models could see a boost in performance. The next steps? Focus on domain-adapted training, tweak model calibration, and bring in Human-in-the-Loop verification to shore up reliability.
The Path Forward
So, what's next? As promising as these findings are, they serve as a stark reminder. AI might be racing ahead in many areas, but safety-critical applications, it's still playing catch-up. The real story here's that we're not quite ready to pull the trigger on full automation. But with continued research and adaptation, that day might not be far off.
In the meantime, companies should tread carefully. Management bought the licenses, but nobody told the team that these models aren't foolproof yet. The press release said AI transformation. The employee survey said otherwise. Are we ready to risk cutting corners just to say we're using AI?
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
Anthropic's family of AI assistants, including Claude Haiku, Sonnet, and Opus.
The process of taking a pre-trained model and continuing to train it on a smaller, specific dataset to adapt it for a particular task or domain.
Google's flagship multimodal AI model family, developed by Google DeepMind.
Generative Pre-trained Transformer.