Latest AI News

Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy

arXiv:2604.02709v2 Announce Type: replace-cross Abstract: The formal reasoning capabilities of LLMs are crucial for advancing automated software engineering. However, existing benchmarks for LLMs lack systematic evaluation based on computation and complexity, leaving a critical gap in understanding their formal reasoning capabilities. Therefore, it is still unknown whether SOTA LLMs can grasp the structured, hierarchical complexity of formal languages as defined by Computation Theory. To address this, we introduce ChomskyBench, a benchmark for systematically evaluating LLMs through the lens of Chomsky Hierarchy. Unlike prior work that uses vectorized classification for neural networks, ChomskyBench is the first to combine full Chomsky Hierarchy coverage, process-trace evaluation via natural language, and deterministic symbolic verifiability. ChomskyBench is composed of a comprehensive suite of language recognition and generation tasks designed to test capabilities at each level. Extensive experiments indicate a clear performance stratification that correlates with the hierarchy's levels of complexity. Our analysis reveals a direct relationship where increasing task difficulty substantially impacts both inference length and performance. Furthermore, we find that while larger models and advanced inference methods offer notable relative gains, they face severe efficiency barriers: achieving practical reliability would require prohibitive computational costs, revealing that current limitations stem from inefficiency rather than absolute capability bounds. A time complexity analysis further indicates that LLMs are significantly less efficient than traditional algorithmic programs for these formal tasks. These results delineate the practical limits of current LLMs, highlight the indispensability of traditional software tools, and provide insights to guide the development of future LLMs with more powerful formal reasoning capabilities.

Latest News

SHARe-KAN: Post-Training Vector Quantization for Cache-Resident KAN Inference

Diagnostics for Individual-Level Prediction Instability in Machine Learning for Healthcare

Latest News

SHARe-KAN: Post-Training Vector Quantization for Cache-Resident KAN Inference

Diagnostics for Individual-Level Prediction Instability in Machine Learning for Healthcare

From Load Tests to Live Streams: Graph Embedding-Based Anomaly Detection in Microservice Architectures

Autonomous Multi-objective Alloy Design through Simulation-guided Optimization

Syn-TurnTurk: A Synthetic Dataset for Turn-Taking Prediction in Turkish Dialogues

Drowsiness-Aware Adaptive Autonomous Braking System based on Deep Reinforcement Learning for Enhanced Road Safety

Data-Efficient RLVR via Off-Policy Influence Guidance

A Review of Diffusion-based Simulation-Based Inference: Foundations and Applications in Non-Ideal Data Scenarios

Empowering Targeted Neighborhood Search via Hyper Tour for Large-Scale TSP

KMMMU: Evaluation of Massive Multi-discipline Multimodal Understanding in Korean Language and Context

LoRA-MME: Multi-Model Ensemble of LoRA-Tuned Encoders for Code Comment Classification

Exploring Urban Land Use Patterns by Pattern Mining and Unsupervised Learning

Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy

AudioX: A Unified Framework for Anything-to-Audio Generation

A Mechanistic Analysis of Sim-and-Real Co-Training in Generative Robot Policies

Depth-Resolved Coral Reef Thermal Fields from Satellite SST and Sparse In-Situ Loggers Using Physics-Informed Neural Networks

Unsupervised Anomaly Detection in Process-Complex Industrial Time Series: A Real-World Case Study

Learning Probabilistic Responsibility Allocations for Multi-Agent Interactions

WOMBET: World Model-based Experience Transfer for Robust and Sample-efficient Reinforcement Learning

Does Dimensionality Reduction via Random Projections Preserve Landscape Features?