Latest AI News

arXiv cs.CL•about 3 hours ago·6 min read

Hierarchical Local-Global Transformer for Temporal Sentence Grounding

arXiv:2208.14882v2 Announce Type: replace-cross Abstract: This paper studies the multimedia problem of temporal sentence grounding (TSG), which aims to accurately determine the specific video segment in an untrimmed video according to a given sentence query. Traditional TSG methods mainly follow the top-down or bottom-up framework and are not end-to-end. They severely rely on time-consuming post-processing to refine the grounding results. Recently, some transformer-based approaches are proposed to efficiently and effectively model the fine-grained semantic alignment between video and query. Although these methods achieve significant performance to some extent, they equally take frames of the video and words of the query as transformer input for correlating, failing to capture their different levels of granularity with distinct semantics. To address this issue, in this paper, we propose a novel Hierarchical Local-Global Transformer (HLGT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities for learning more fine-grained multi-modal representations. Specifically, we first split the video and query into individual clips and phrases to learn their local context (adjacent dependency) and global correlation (long-range dependency) via a temporal transformer. Then, a global-local transformer is introduced to learn the interactions between the local-level and global-level semantics for better multi-modal reasoning. Besides, we develop a new cross-modal cycle-consistency loss to enforce interaction between two modalities and encourage the semantic alignment between them. Finally, we design a brand-new cross-modal parallel transformer decoder to integrate the encoded visual and textual features for final grounding. Extensive experiments on three challenging datasets show that our proposed HLGT achieves a new state-of-the-art performance.

Latest News

RotMoLE: Enhancing Mixture of Low-Rank Experts through Rotational Gating Mechanism

CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

Latest News

RotMoLE: Enhancing Mixture of Low-Rank Experts through Rotational Gating Mechanism

CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

CogniFold: Always-On Proactive Memory via Cognitive Folding

CMAP: Cross-Modal Adaptive Prompting for Multi-Domain Task-Incremental Learning

Mitigating Hallucinations in Healthcare LLMs with Granular Fact-Checking and Domain-Specific Adaptation

Benchmarking and Learning Real-World Customer Service Dialogue

Efficient Benchmarking Is Just Feature Selection and Multiple Regression

From Knowledge to Inference: Formalizing Specialized Public Health Reasoning on GlobalHealthAtlas

Hy-MT2: A Family of Fast, Efficient and Powerful Multilingual Translation Models in the Wild

Lean Formalization of Generalization Error Bound by Rademacher Complexity and Dudley's Entropy Integral

MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors

InfiFPO: Implicit Model Fusion via Preference Optimization in Large Language Models

CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning

HALvest-Contrastive: Retrieval-Like Authorship Attribution with Patch-Level Late Interaction

Hierarchical Local-Global Transformer for Temporal Sentence Grounding

Persona-Model Collapse in Emergent Misalignment

BacktestBench: Benchmarking Large Language Models for Automated Quantitative Strategy Backtesting

FLOATBench: A Dataset and Benchmark for Floating Offshore Wind Turbine Tower Fatigue

$D^2$-Monitor: Dynamic Safety Monitoring for Diffusion LLMs via Hesitation-Aware Routing