cs.LGcs.AIcs.CLstat.ML
Reveals that training datasets can transmit 'subliminal' signals to LLMs — statistical patterns not observable from individual datapoints but that systematically influence model behavior through log-linear mechanisms.
Why This Matters
This has major implications for AI safety and data governance: it shows that dataset-level distributional properties can steer model behavior in ways that per-example auditing cannot detect, formalizing a previously observed but poorly understood phenomenon.
Individual data point inspection is insufficient for understanding training data influence on LLMs — aggregate distributional signals can encode behaviors invisible at the sample level, demanding new dataset-level auditing approaches.
Subliminal Effects in Your Data: A General Mechanism via Log-Linearity
cs.LG | cs.AI | cs.CL | stat.ML
Authors: Ishaq Aden-Ali, Noah Golowich, Allen Liu, Abhishek Shetty, Ankur Moitra et al.
Published: 2026-02-04
Why This Matters
This has major implications for AI safety and data governance: it shows that dataset-level distributional properties can steer model behavior in ways that per-example auditing cannot detect, formalizing a previously observed but poorly understood phenomenon.
Key Insight
Individual data point inspection is insufficient for understanding training data influence on LLMs — aggregate distributional signals can encode behaviors invisible at the sample level, demanding new dataset-level auditing approaches.
Abstract
Training modern large language models (LLMs) has become a veritable smorgasbord of algorithms and datasets designed to elicit particular behaviors, making it critical to develop techniques to understand the effects of datasets on the model's properties. This is exacerbated by recent experiments that show datasets can transmit signals that are not directly observable from individual datapoints, posing a conceptual challenge for dataset-centric understandings of LLM training and suggesting a missing fundamental account of such phenomena. Towards understanding such effects, inspired by recent work on the linear structure of LLMs, we uncover a general mechanism through which hidden subtexts can arise in generic datasets. We introduce Logit-Linear-Selection (LLS), a method that prescribes how to select subsets of a generic preference dataset to elicit a wide range of hidden effects. We apply LLS to discover subsets of real-world datasets so that models trained on them exhibit behaviors ranging from having specific preferences, to responding to prompts in a different language not present in the dataset, to taking on a different persona. Crucially, the effect persists for the selected subset, across models with varying architectures, supporting its generality and universality.
cs.LGcs.AI
GraphBFF presents the first recipe for training billion-parameter Graph Foundation Models on arbitrary heterogeneous, billion-scale graphs with lightweight adaptation.
Why This Matters
While foundation models have transformed NLP and vision, graphs have been left behind due to heterogeneity and scale challenges; this work closes that gap with an end-to-end approach that handles real-world graph diversity at unprecedented scale.
General-purpose graph foundation models are now feasible at billion scale, potentially enabling transfer learning across diverse graph domains the way LLMs enabled it for text.
Billion-Scale Graph Foundation Models
cs.LG | cs.AI
Authors: Maya Bechler-Speicher, Yoel Gottlieb, Andrey Isakov, David Abensur, Ami Tavory et al.
Published: 2026-02-04
Why This Matters
While foundation models have transformed NLP and vision, graphs have been left behind due to heterogeneity and scale challenges; this work closes that gap with an end-to-end approach that handles real-world graph diversity at unprecedented scale.
Key Insight
General-purpose graph foundation models are now feasible at billion scale, potentially enabling transfer learning across diverse graph domains the way LLMs enabled it for text.
Abstract
Graph-structured data underpins many critical applications. While foundation models have transformed language and vision via large-scale pretraining and lightweight adaptation, extending this paradigm to general, real-world graphs is challenging. In this work, we present Graph Billion- Foundation-Fusion (GraphBFF): the first end-to-end recipe for building billion-parameter Graph Foundation Models (GFMs) for arbitrary heterogeneous, billion-scale graphs. Central to the recipe is the GraphBFF Transformer, a flexible and scalable architecture designed for practical billion-scale GFMs. Using the GraphBFF, we present the first neural scaling laws for general graphs and show that loss decreases predictably as either model capacity or training data scales, depending on which factor is the bottleneck. The GraphBFF framework provides concrete methodologies for data batching, pretraining, and fine-tuning for building GFMs at scale. We demonstrate the effectiveness of the framework with an evaluation of a 1.4 billion-parameter GraphBFF Transformer pretrained on one billion samples. Across ten diverse, real-world downstream tasks on graphs unseen during training, spanning node- and link-level classification and regression, GraphBFF achieves remarkable zero-shot and probing performance, including in few-shot settings, with large margins of up to 31 PRAUC points. Finally, we discuss key challenges and open opportunities for making GFMs a practical and principled foundation for graph learni...
cs.LGcs.AIq-bio.BMq-bio.QM
PAR introduces the first multi-scale autoregressive framework for protein backbone generation, using coarse-to-fine next-scale prediction that mimics sculpting a statue.
Why This Matters
Autoregressive generation has dominated language and images, but protein structure generation has relied on diffusion and flow-matching; PAR shows autoregressive methods can work for 3D molecular structures when operating across spatial scales rather than sequentially along the chain.
Hierarchical coarse-to-fine autoregressive generation can be a viable alternative to diffusion for structured 3D data like proteins, opening new avenues for biomolecular design.
Protein Autoregressive Modeling via Multiscale Structure Generation
cs.LG | cs.AI | q-bio.BM | q-bio.QM
Authors: Yanru Qu, Cheng-Yen Hsieh, Zaixiang Zheng, Ge Liu, Quanquan Gu
Published: 2026-02-04
Why This Matters
Autoregressive generation has dominated language and images, but protein structure generation has relied on diffusion and flow-matching; PAR shows autoregressive methods can work for 3D molecular structures when operating across spatial scales rather than sequentially along the chain.
Key Insight
Hierarchical coarse-to-fine autoregressive generation can be a viable alternative to diffusion for structured 3D data like proteins, opening new avenues for biomolecular design.
Abstract
We present protein autoregressive modeling (PAR), the first multi-scale autoregressive framework for protein backbone generation via coarse-to-fine next-scale prediction. Using the hierarchical nature of proteins, PAR generates structures that mimic sculpting a statue, forming a coarse topology and refining structural details over scales. To achieve this, PAR consists of three key components: (i) multi-scale downsampling operations that represent protein structures across multiple scales during training; (ii) an autoregressive transformer that encodes multi-scale information and produces conditional embeddings to guide structure generation; (iii) a flow-based backbone decoder that generates backbone atoms conditioned on these embeddings. Moreover, autoregressive models suffer from exposure bias, caused by the training and the generation procedure mismatch, and substantially degrades structure generation quality. We effectively alleviate this issue by adopting noisy context learning and scheduled sampling, enabling robust backbone generation. Notably, PAR exhibits strong zero-shot generalization, supporting flexible human-prompted conditional generation and motif scaffolding without requiring fine-tuning. On the unconditional generation benchmark, PAR effectively learns protein distributions and produces backbones of high design quality, and exhibits favorable scaling behavior. Together, these properties establish PAR as a promising framework for protein structure generation.
cs.AI
Mechanistic analysis of reasoning model QwQ-32B reveals that long chain-of-thought enables gradual construction of 'fluid representations' — abstract structural encodings that emerge dynamically during inference.
Why This Matters
This is one of the first mechanistic interpretability studies specifically targeting reasoning-trained models, providing concrete evidence for how extended thinking traces allow models to build up representations they couldn't form in a single forward pass.
Reasoning models don't just 'think longer' — they progressively construct abstract internal representations across their chain of thought, which explains why they dramatically outperform standard models on structural problems.
Fluid Representations in Reasoning Models
cs.AI
Authors: Dmitrii Kharlapenko, Alessandro Stolfo, Arthur Conmy, Mrinmaya Sachan, Zhijing Jin
Published: 2026-02-04
Why This Matters
This is one of the first mechanistic interpretability studies specifically targeting reasoning-trained models, providing concrete evidence for how extended thinking traces allow models to build up representations they couldn't form in a single forward pass.
Key Insight
Reasoning models don't just 'think longer' — they progressively construct abstract internal representations across their chain of thought, which explains why they dramatically outperform standard models on structural problems.
Abstract
Reasoning language models, which generate long chains of thought, dramatically outperform non-reasoning language models on abstract problems. However, the internal model mechanisms that allow this superior performance remain poorly understood. We present a mechanistic analysis of how QwQ-32B - a model specifically trained to produce extensive reasoning traces - process abstract structural information. On Mystery Blocksworld - a semantically obfuscated planning domain - we find that QwQ-32B gradually improves its internal representation of actions and concepts during reasoning. The model develops abstract encodings that focus on structure rather than specific action names. Through steering experiments, we establish causal evidence that these adaptations improve problem solving: injecting refined representations from successful traces boosts accuracy, while symbolic representations can replace many obfuscated encodings with minimal performance loss. We find that one of the factors driving reasoning model performance is in-context refinement of token representations, which we dub Fluid Reasoning Representations.
cs.CLcs.CVcs.LG
Reinforced Attention Learning (RAL) optimizes internal attention distributions in multimodal LLMs using policy gradients, rather than optimizing output token sequences.
Why This Matters
This challenges the dominant paradigm of improving multimodal models through verbose chain-of-thought rationales, showing that directly optimizing where the model looks (attention) is more effective for perception tasks than optimizing what it says.
For multimodal AI, optimizing internal representations via RL may be more effective than optimizing output text, suggesting a new direction for post-training beyond verbose reasoning.
Reinforced Attention Learning
cs.CL | cs.CV | cs.LG
Authors: Bangzheng Li, Jianmo Ni, Chen Qu, Ian Miao, Liu Yang et al.
Published: 2026-02-04
Why This Matters
This challenges the dominant paradigm of improving multimodal models through verbose chain-of-thought rationales, showing that directly optimizing where the model looks (attention) is more effective for perception tasks than optimizing what it says.
Key Insight
For multimodal AI, optimizing internal representations via RL may be more effective than optimizing output text, suggesting a new direction for post-training beyond verbose reasoning.
Abstract
Post-training with Reinforcement Learning (RL) has substantially improved reasoning in Large Language Models (LLMs) via test-time scaling. However, extending this paradigm to Multimodal LLMs (MLLMs) through verbose rationales yields limited gains for perception and can even degrade performance. We propose Reinforced Attention Learning (RAL), a policy-gradient framework that directly optimizes internal attention distributions rather than output token sequences. By shifting optimization from what to generate to where to attend, RAL promotes effective information allocation and improved grounding in complex multimodal inputs. Experiments across diverse image and video benchmarks show consistent gains over GRPO and other baselines. We further introduce On-Policy Attention Distillation, demonstrating that transferring latent attention behaviors yields stronger cross-modal alignment than standard knowledge distillation. Our results position attention policies as a principled and general alternative for multimodal post-training.
cs.CRcs.AIcs.CL
Introduces WebSentinel, a two-step detection and localization system for prompt injection attacks targeting web-browsing AI agents.
Why This Matters
As LLM-powered web agents become deployed in real products, prompt injection via manipulated webpage content is an urgent security threat; this paper provides the first systematic approach to both detecting and pinpointing injected instructions in web content, which is critical for safe agent deployment.
Practitioners should know that defending web agents requires not just detecting prompt injections but localizing them within page content, and a segment-extraction-then-classification pipeline achieves this more effectively than end-to-end approaches.
WebSentinel: Detecting and Localizing Prompt Injection Attacks for Web Agents
cs.CR | cs.AI | cs.CL
Authors: Xilong Wang, Yinuo Liu, Zhun Wang, Dawn Song, Neil Gong
Published: 2026-02-03
Why This Matters
As LLM-powered web agents become deployed in real products, prompt injection via manipulated webpage content is an urgent security threat; this paper provides the first systematic approach to both detecting and pinpointing injected instructions in web content, which is critical for safe agent deployment.
Key Insight
Practitioners should know that defending web agents requires not just detecting prompt injections but localizing them within page content, and a segment-extraction-then-classification pipeline achieves this more effectively than end-to-end approaches.
Abstract
Prompt injection attacks manipulate webpage content to cause web agents to execute attacker-specified tasks instead of the user's intended ones. Existing methods for detecting and localizing such attacks achieve limited effectiveness, as their underlying assumptions often do not hold in the web-agent setting. In this work, we propose WebSentinel, a two-step approach for detecting and localizing prompt injection attacks in webpages. Given a webpage, Step I extracts \emph{segments of interest} that may be contaminated, and Step II evaluates each segment by checking its consistency with the webpage content as context. We show that WebSentinel is highly effective, substantially outperforming baseline methods across multiple datasets of both contaminated and clean webpages that we collected. Our code is available at: https://github.com/wxl-lxw/WebSentinel.
cs.CL
Proposes an explicit information transmission approach for soft context compression in LLMs that outperforms existing methods by avoiding the structural limitations of self-attention-based compression.
Why This Matters
Long-context inference cost is one of the most pressing practical bottlenecks for LLM deployment, and this work identifies fundamental problems with using LLMs as their own compressors — offering a principled alternative that could significantly reduce inference costs for long-document applications.
Practitioners should know that repurposing an LLM's own self-attention for context compression introduces structural bottlenecks, and explicit information transmission architectures can achieve better compression ratios with less information loss.
Context Compression via Explicit Information Transmission
cs.CL
Authors: Jiangnan Ye, Hanqi Yan, Zhenyi Shen, Heng Chang, Ye Mao et al.
Published: 2026-02-03
Why This Matters
Long-context inference cost is one of the most pressing practical bottlenecks for LLM deployment, and this work identifies fundamental problems with using LLMs as their own compressors — offering a principled alternative that could significantly reduce inference costs for long-document applications.
Key Insight
Practitioners should know that repurposing an LLM's own self-attention for context compression introduces structural bottlenecks, and explicit information transmission architectures can achieve better compression ratios with less information loss.
Abstract
Long-context inference with Large Language Models (LLMs) is costly due to quadratic attention and growing key-value caches, motivating context compression. In this work, we study soft context compression, where a long context is condensed into a small set of continuous representations. Existing methods typically re-purpose the LLM itself as a trainable compressor, relying on layer-by-layer self-attention to iteratively aggregate information. We argue that this paradigm suffers from two structural limitations: (i) progressive representation overwriting across layers (ii) uncoordinated allocation of compression capacity across tokens. We propose ComprExIT (Context Compression via Explicit Information Transmission), a lightweight framework that formulates soft compression into a new paradigm: explicit information transmission over frozen LLM hidden states. This decouples compression from the model's internal self-attention dynamics. ComprExIT performs (i) depth-wise transmission to selectively transmit multi-layer information into token anchors, mitigating progressive overwriting, and (ii) width-wise transmission to aggregate anchors into a small number of slots via a globally optimized transmission plan, ensuring coordinated allocation of information. Across six question-answering benchmarks, ComprExIT consistently outperforms state-of-the-art context compression methods while introducing only ~1% additional parameters, demonstrating that explicit and coordinated information tr...
cs.LGcs.AIcs.CL
Proposes antidistillation fingerprinting, a method to detect when a student model has been distilled from a teacher LLM's outputs without degrading the teacher's generation quality.
Why This Matters
Model distillation of frontier LLMs is a growing intellectual property concern, and existing fingerprinting methods require sacrificing output quality; this work decouples fingerprint strength from quality degradation, addressing a real commercial need for model provenance verification.
Practitioners should know that robust distillation detection is now possible without meaningful degradation to the teacher model's outputs, enabling better enforcement of model usage policies.
Antidistillation Fingerprinting
cs.LG | cs.AI | cs.CL
Authors: Yixuan Even Xu, John Kirchenbauer, Yash Savani, Asher Trockman, Alexander Robey et al.
Published: 2026-02-03
Why This Matters
Model distillation of frontier LLMs is a growing intellectual property concern, and existing fingerprinting methods require sacrificing output quality; this work decouples fingerprint strength from quality degradation, addressing a real commercial need for model provenance verification.
Key Insight
Practitioners should know that robust distillation detection is now possible without meaningful degradation to the teacher model's outputs, enabling better enforcement of model usage policies.
Abstract
Model distillation enables efficient emulation of frontier large language models (LLMs), creating a need for robust mechanisms to detect when a third-party student model has trained on a teacher model's outputs. However, existing fingerprinting techniques that could be used to detect such distillation rely on heuristic perturbations that impose a steep trade-off between generation quality and fingerprinting strength, often requiring significant degradation of utility to ensure the fingerprint is effectively internalized by the student. We introduce antidistillation fingerprinting (ADFP), a principled approach that aligns the fingerprinting objective with the student's learning dynamics. Building upon the gradient-based framework of antidistillation sampling, ADFP utilizes a proxy model to identify and sample tokens that directly maximize the expected detectability of the fingerprint in the student after fine-tuning, rather than relying on the incidental absorption of the un-targeted biases of a more naive watermark. Experiments on GSM8K and OASST1 benchmarks demonstrate that ADFP achieves a significant Pareto improvement over state-of-the-art baselines, yielding stronger detection confidence with minimal impact on utility, even when the student model's architecture is unknown.
cs.AIcs.LG
Provides a theoretical and empirical framework explaining why scaling the number of agents in LLM-based multi-agent systems hits diminishing returns, and shows that diversity — not quantity — drives performance gains.
Why This Matters
As multi-agent LLM systems become mainstream for complex tasks, this paper answers a fundamental design question: adding more homogeneous agents wastes compute, but heterogeneous agents (different models, prompts, tools) continue to improve, giving practitioners a concrete scaling strategy.
Practitioners should know that investing in agent diversity (varied models, prompts, and tools) yields far better returns than simply increasing agent count in multi-agent systems.
Understanding Agent Scaling in LLM-Based Multi-Agent Systems via Diversity
cs.AI | cs.LG
Authors: Yingxuan Yang, Chengrui Qu, Muning Wen, Laixi Shi, Ying Wen et al.
Published: 2026-02-03
Why This Matters
As multi-agent LLM systems become mainstream for complex tasks, this paper answers a fundamental design question: adding more homogeneous agents wastes compute, but heterogeneous agents (different models, prompts, tools) continue to improve, giving practitioners a concrete scaling strategy.
Key Insight
Practitioners should know that investing in agent diversity (varied models, prompts, and tools) yields far better returns than simply increasing agent count in multi-agent systems.
Abstract
LLM-based multi-agent systems (MAS) have emerged as a promising approach to tackle complex tasks that are difficult for individual LLMs. A natural strategy is to scale performance by increasing the number of agents; however, we find that such scaling exhibits strong diminishing returns in homogeneous settings, while introducing heterogeneity (e.g., different models, prompts, or tools) continues to yield substantial gains. This raises a fundamental question: what limits scaling, and why does diversity help? We present an information-theoretic framework showing that MAS performance is bounded by the intrinsic task uncertainty, not by agent count. We derive architecture-agnostic bounds demonstrating that improvements depend on how many effective channels the system accesses. Homogeneous agents saturate early because their outputs are strongly correlated, whereas heterogeneous agents contribute complementary evidence. We further introduce $K^*$, an effective channel count that quantifies the number of effective channels without ground-truth labels. Empirically, we show that heterogeneous configurations consistently outperform homogeneous scaling: 2 diverse agents can match or exceed the performance of 16 homogeneous agents. Our results provide principled guidelines for building efficient and robust MAS through diversity-aware design. Code and Dataset are available at the link: https://github.com/SafeRL-Lab/Agent-Scaling.
cs.LG
Introduces latent reasoning tokens in discrete diffusion language models that enable implicit computation without generating visible chain-of-thought tokens.
Why This Matters
This bridges the gap between diffusion and autoregressive language models by showing that diffusion models' joint prediction mechanism acts as implicit reasoning, and that latent tokens can recover performance lost when ablating this mechanism — offering a new path to efficient reasoning without verbose outputs.
Practitioners should know that diffusion language models can reason through latent token prediction rather than explicit chain-of-thought, potentially offering more compute-efficient reasoning at inference time.
Reasoning with Latent Tokens in Diffusion Language Models
cs.LG
Authors: Andre He, Sean Welleck, Daniel Fried
Published: 2026-02-03
Why This Matters
This bridges the gap between diffusion and autoregressive language models by showing that diffusion models' joint prediction mechanism acts as implicit reasoning, and that latent tokens can recover performance lost when ablating this mechanism — offering a new path to efficient reasoning without verbose outputs.
Key Insight
Practitioners should know that diffusion language models can reason through latent token prediction rather than explicit chain-of-thought, potentially offering more compute-efficient reasoning at inference time.
Abstract
Discrete diffusion models have recently become competitive with autoregressive models for language modeling, even outperforming them on reasoning tasks requiring planning and global coherence, but they require more computation at inference time. We trace this trade-off to a key mechanism: diffusion models are trained to jointly predict a distribution over all unknown tokens, including those that will not actually be decoded in the current step. Ablating this joint prediction yields faster inference but degrades performance, revealing that accurate prediction at the decoded position relies on joint reasoning about the distribution of undecoded tokens. We interpret these as latent tokens and introduce a method for modulating their number, demonstrating empirically that this enables a smooth tradeoff between inference speed and sample quality. Furthermore, we demonstrate that latent tokens can be introduced into autoregressive models through an auxiliary multi-token prediction objective, yielding substantial improvements on the same reasoning tasks where they have traditionally struggled. Our results suggest that latent tokens, while arising naturally in diffusion, represent a general mechanism for improving performance on tasks requiring global coherence or lookahead.
cs.ROcs.AI
World-Gymnast trains robot policies via reinforcement learning inside learned world models, bypassing sim-to-real gaps and expert data requirements.
Why This Matters
This addresses a fundamental bottleneck in robot learning: physical interaction is expensive, simulators have reality gaps, and expert demos are scarce. Training in learned world models from real video offers a promising middle ground.
Video-based world models have matured enough to serve as viable training environments for manipulation policies that transfer to real robots.
World-Gymnast: Training Robots with Reinforcement Learning in a World Model
cs.RO | cs.AI
Authors: Ansh Kumar Sharma, Yixiang Sun, Ninghao Lu, Yunzhe Zhang, Jiarao Liu et al.
Published: 2026-02-02
Why This Matters
This addresses a fundamental bottleneck in robot learning: physical interaction is expensive, simulators have reality gaps, and expert demos are scarce. Training in learned world models from real video offers a promising middle ground.
Key Insight
Video-based world models have matured enough to serve as viable training environments for manipulation policies that transfer to real robots.
Abstract
Robot learning from interacting with the physical world is fundamentally bottlenecked by the cost of physical interaction. The two alternatives, supervised finetuning (SFT) from expert demonstrations and reinforcement learning (RL) in a software-based simulator, are limited by the amount of expert data available and the sim-to-real gap for manipulation. With the recent emergence of world models learned from real-world video-action data, we ask the question of whether training a policy in a world model can be more effective than supervised learning or software simulation in achieving better real-robot performance. We propose World-Gymnast, which performs RL finetuning of a vision-language-action (VLA) policy by rolling out the policy in an action-conditioned video world model and rewarding the rollouts with a vision-language model (VLM). On the Bridge robot setup, World-Gymnast outperforms SFT by as much as 18x and outperforms software simulator by as much as 2x. More importantly, World-Gymnast demonstrates intriguing capabilities of RL with a world model, including training on diverse language instructions and novel scenes from the world model, test-time training in a novel scene, and online iterative world model and policy improvement. Our results suggest learning a world model and training robot policies in the cloud could be the key to bridging the gap between robots that work in demonstrations and robots that can work in anyone's household.
cs.AI
Identity Bridge breaks the reversal curse in autoregressive LLMs by adding bidirectional identity links during training.
Why This Matters
The reversal curse (knowing 'A is B' but not inferring 'B is A') has been considered a fundamental limitation of autoregressive models. Demonstrating a simple fix challenges assumptions about what LLMs can and cannot learn.
Autoregressive models can learn bidirectional knowledge if training data includes explicit identity bridges, not just forward associations.
Breaking the Reversal Curse in Autoregressive Language Models via Identity Bridge
cs.AI
Authors: Xutao Ma, Yixiao Huang, Hanlin Zhu, Somayeh Sojoudi
Published: 2026-02-02
Why This Matters
The reversal curse (knowing 'A is B' but not inferring 'B is A') has been considered a fundamental limitation of autoregressive models. Demonstrating a simple fix challenges assumptions about what LLMs can and cannot learn.
Key Insight
Autoregressive models can learn bidirectional knowledge if training data includes explicit identity bridges, not just forward associations.
Abstract
Autoregressive large language models (LLMs) have achieved remarkable success in many complex tasks, yet they can still fail in very simple logical reasoning such as the "reversal curse" -- when trained on forward knowledge data of the form "$A \rightarrow B$" (e.g., Alice's husband is Bob), the model is unable to deduce the reversal knowledge "$B \leftarrow A$" (e.g., Bob's wife is Alice) during test. Extensive prior research suggests that this failure is an inherent, fundamental limit of autoregressive causal LLMs, indicating that these models tend to memorize factual-level knowledge rather than capture higher-level rules. In this paper, we challenge this view by showing that this seemingly fundamental limit can be mitigated by slightly tweaking the training data with a simple regularization data recipe called the Identity Bridge of the form "$A \to A$" (e.g., The name of Alice is Alice). Theoretically, we prove that under this recipe, even a one-layer transformer can break the reversal curse by analyzing the implicit bias of gradient descent. Empirically, we show that a 1B pretrained language model finetuned with the proposed data recipe achieves a 40% success rate on reversal tasks, in stark contrast to a near-zero success rate when trained solely on forward-knowledge data. Our work provides a novel theoretical foundation for the reversal curse and offers a principled, low-cost path to encouraging LLMs to learn higher-level rules from data.
cs.AI
AgentRx provides the first benchmark of 115 annotated failed agent trajectories with a grounded failure taxonomy for diagnosing AI agent breakdowns.
Why This Matters
As LLM agents move into production, understanding why they fail becomes critical. This benchmark fills a major gap by providing systematic annotations of failure modes across API workflows, incident management, and web tasks.
Agent failures cluster into identifiable categories that can be diagnosed from execution traces, enabling targeted improvements to agent architectures.
AgentRx: Diagnosing AI Agent Failures from Execution Trajectories
cs.AI
Authors: Shraddha Barke, Arnav Goyal, Alind Khare, Avaljot Singh, Suman Nath et al.
Published: 2026-02-02
Why This Matters
As LLM agents move into production, understanding why they fail becomes critical. This benchmark fills a major gap by providing systematic annotations of failure modes across API workflows, incident management, and web tasks.
Key Insight
Agent failures cluster into identifiable categories that can be diagnosed from execution traces, enabling targeted improvements to agent architectures.
Abstract
AI agents often fail in ways that are difficult to localize because executions are probabilistic, long-horizon, multi-agent, and mediated by noisy tool outputs. We address this gap by manually annotating failed agent runs and release a novel benchmark of 115 failed trajectories spanning structured API workflows, incident management, and open-ended web/file tasks. Each trajectory is annotated with a critical failure step and a category from a grounded-theory derived, cross-domain failure taxonomy. To mitigate the human cost of failure attribution, we present AGENTRX, an automated domain-agnostic diagnostic framework that pinpoints the critical failure step in a failed agent trajectory. It synthesizes constraints, evaluates them step-by-step, and produces an auditable validation log of constraint violations with associated evidence; an LLM-based judge uses this log to localize the critical step and category. Our framework improves step localization and failure attribution over existing baselines across three domains.
cs.CVcs.AI
PixelGen shows pixel-space diffusion can outperform latent diffusion when trained with perceptual loss instead of pixel-wise objectives.
Why This Matters
This challenges the dominant latent diffusion paradigm by demonstrating that the two-stage VAE approach may not be necessary. Removing the VAE bottleneck could eliminate common artifacts and simplify the image generation pipeline.
The key to successful pixel diffusion is not modeling the full image manifold but focusing on perceptually relevant signals.
PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss
cs.CV | cs.AI
Authors: Zehong Ma, Ruihan Xu, Shiliang Zhang
Published: 2026-02-02
Why This Matters
This challenges the dominant latent diffusion paradigm by demonstrating that the two-stage VAE approach may not be necessary. Removing the VAE bottleneck could eliminate common artifacts and simplify the image generation pipeline.
Key Insight
The key to successful pixel diffusion is not modeling the full image manifold but focusing on perceptually relevant signals.
Abstract
Pixel diffusion generates images directly in pixel space in an end-to-end manner, avoiding the artifacts and bottlenecks introduced by VAEs in two-stage latent diffusion. However, it is challenging to optimize high-dimensional pixel manifolds that contain many perceptually irrelevant signals, leaving existing pixel diffusion methods lagging behind latent diffusion models. We propose PixelGen, a simple pixel diffusion framework with perceptual supervision. Instead of modeling the full image manifold, PixelGen introduces two complementary perceptual losses to guide diffusion model towards learning a more meaningful perceptual manifold. An LPIPS loss facilitates learning better local patterns, while a DINO-based perceptual loss strengthens global semantics. With perceptual supervision, PixelGen surpasses strong latent diffusion baselines. It achieves an FID of 5.11 on ImageNet-256 without classifier-free guidance using only 80 training epochs, and demonstrates favorable scaling performance on large-scale text-to-image generation with a GenEval score of 0.79. PixelGen requires no VAEs, no latent representations, and no auxiliary stages, providing a simpler yet more powerful generative paradigm. Codes are publicly available at https://github.com/Zehong-Ma/PixelGen.
cs.LGq-bio.NC
MEG-XL enables brain-to-text interfaces with 2.5 minutes of MEG context, achieving data-efficient decoding for paralyzed patients.
Why This Matters
Brain-computer interfaces for communication require minimal training data from patients who cannot provide extensive recordings. This 5-300x longer context window represents a significant step toward practical clinical deployment of neural speech decoding.
Long-context pre-training on neural signals dramatically improves generalization across subjects, making BCIs more viable for real patients.
MEG-XL: Data-Efficient Brain-to-Text via Long-Context Pre-Training
cs.LG | q-bio.NC
Authors: Dulhan Jayalath, Oiwi Parker Jones
Published: 2026-02-02
Why This Matters
Brain-computer interfaces for communication require minimal training data from patients who cannot provide extensive recordings. This 5-300x longer context window represents a significant step toward practical clinical deployment of neural speech decoding.
Key Insight
Long-context pre-training on neural signals dramatically improves generalization across subjects, making BCIs more viable for real patients.
Abstract
Clinical brain-to-text interfaces are designed for paralysed patients who cannot provide extensive training recordings. Pre-training improves data-efficient generalisation by learning statistical priors across subjects, but these priors critically depend on context. While natural speech might unfold gradually over minutes, most methods pre-train with only a few seconds of context. Thus, we propose MEG-XL, a model pre-trained with 2.5 minutes of MEG context per sample, 5-300x longer than prior work, and equivalent to 191k tokens, capturing extended neural context. Fine-tuning on the task of word decoding from brain data, MEG-XL matches supervised performance with a fraction of the data (e.g. 1hr vs 50hrs) and outperforms brain foundation models. We find that models pre-trained with longer contexts learn representations that transfer better to word decoding. Our results indicate that long-context pre-training helps exploit extended neural context that other methods unnecessarily discard. Code, model weights, and instructions are available at https://github.com/neural-processing-lab/MEG-XL .
cs.CVcs.AIcs.LG
VideoGPA uses geometry foundation models to enforce 3D consistency in video diffusion through self-supervised preference alignment.
Why This Matters
Current video generation models produce visually impressive but geometrically inconsistent content with object deformation and spatial drift. This data-efficient approach addresses the fundamental gap between visual quality and physical plausibility.
Standard denoising objectives lack geometric coherence incentives, but leveraging external geometry priors through preference alignment can teach video models to maintain 3D structure without explicit 3D supervision.
VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation
cs.CV | cs.AI | cs.LG
Authors: Hongyang Du, Junjie Ye, Xiaoyan Cong, Runhao Li, Jingcheng Ni et al.
Published: 2026-01-30
Why This Matters
Current video generation models produce visually impressive but geometrically inconsistent content with object deformation and spatial drift. This data-efficient approach addresses the fundamental gap between visual quality and physical plausibility.
Key Insight
Standard denoising objectives lack geometric coherence incentives, but leveraging external geometry priors through preference alignment can teach video models to maintain 3D structure without explicit 3D supervision.
Abstract
While recent video diffusion models (VDMs) produce visually impressive results, they fundamentally struggle to maintain 3D structural consistency, often resulting in object deformation or spatial drift. We hypothesize that these failures arise because standard denoising objectives lack explicit incentives for geometric coherence. To address this, we introduce VideoGPA (Video Geometric Preference Alignment), a data-efficient self-supervised framework that leverages a geometry foundation model to automatically derive dense preference signals that guide VDMs via Direct Preference Optimization (DPO). This approach effectively steers the generative distribution toward inherent 3D consistency without requiring human annotations. VideoGPA significantly enhances temporal stability, physical plausibility, and motion coherence using minimal preference pairs, consistently outperforming state-of-the-art baselines in extensive experiments.
cs.AIcs.CLcs.ETcs.MA
MAPPA uses per-action process rewards from AI feedback to solve credit assignment and sample efficiency in multiagent system finetuning.
Why This Matters
Scaling multiagent systems is a key challenge for complex task automation, and this work tackles the twin problems of identifying which agent contributed to success and reducing expensive rollout costs.
Assigning credit at the action level rather than task completion enables efficient finetuning of specialized agent teams without requiring exponentially more multiagent rollouts.
Scaling Multiagent Systems with Process Rewards
cs.AI | cs.CL | cs.ET | cs.MA
Authors: Ed Li, Junyu Ren, Cat Yan
Published: 2026-01-30
Why This Matters
Scaling multiagent systems is a key challenge for complex task automation, and this work tackles the twin problems of identifying which agent contributed to success and reducing expensive rollout costs.
Key Insight
Assigning credit at the action level rather than task completion enables efficient finetuning of specialized agent teams without requiring exponentially more multiagent rollouts.
Abstract
While multiagent systems have shown promise for tackling complex tasks via specialization, finetuning multiple agents simultaneously faces two key challenges: (1) credit assignment across agents, and (2) sample efficiency of expensive multiagent rollouts. In this work, we propose finetuning multiagent systems with per-action process rewards from AI feedback (MAPPA) to address both. Through assigning credit to individual agent actions rather than only at task completion, MAPPA enables fine-grained supervision without ground truth labels while extracting maximal training signal from each rollout. We demonstrate our approach on competition math problems and tool-augmented data analysis tasks. On unseen math problems, MAPPA achieves +5.0--17.5pp on AIME and +7.8--17.2pp on AMC. For data analysis tasks, our method improves success rate by +12.5pp while quality metrics improve by up to 30%, validating that per-action supervision can lead to improvements across different multiagent system on various domains. By addressing these challenges, our work takes a first step toward scaling multiagent systems for complex, long-horizon tasks with minimal human supervision.
cs.CLcs.AIcs.CR
Audio narrative attacks embed jailbreak directives within natural-sounding audio streams to bypass safety filters in large audio-language models.
Why This Matters
As voice interfaces become ubiquitous, this exposes a critical security gap where harmful instructions can be hidden in conversational audio. The attack surface for multimodal AI systems is expanding faster than defenses.
Safety mechanisms developed for text modalities do not transfer well to audio, and narrative-style audio can disguise malicious content that would be flagged in text form.
Now You Hear Me: Audio Narrative Attacks Against Large Audio-Language Models
cs.CL | cs.AI | cs.CR
Authors: Ye Yu, Haibo Jin, Yaoning Yu, Jun Zhuang, Haohan Wang
Published: 2026-01-30
Why This Matters
As voice interfaces become ubiquitous, this exposes a critical security gap where harmful instructions can be hidden in conversational audio. The attack surface for multimodal AI systems is expanding faster than defenses.
Key Insight
Safety mechanisms developed for text modalities do not transfer well to audio, and narrative-style audio can disguise malicious content that would be flagged in text form.
Abstract
Large audio-language models increasingly operate on raw speech inputs, enabling more seamless integration across domains such as voice assistants, education, and clinical triage. This transition, however, introduces a distinct class of vulnerabilities that remain largely uncharacterized. We examine the security implications of this modality shift by designing a text-to-audio jailbreak that embeds disallowed directives within a narrative-style audio stream. The attack leverages an advanced instruction-following text-to-speech (TTS) model to exploit structural and acoustic properties, thereby circumventing safety mechanisms primarily calibrated for text. When delivered through synthetic speech, the narrative format elicits restricted outputs from state-of-the-art models, including Gemini 2.0 Flash, achieving a 98.26% success rate that substantially exceeds text-only baselines. These results highlight the need for safety frameworks that jointly reason over linguistic and paralinguistic representations, particularly as speech-based interfaces become more prevalent.
cs.LGcs.ARcs.CL
FOCUS identifies that most compute in diffusion language models is wasted on non-decodable tokens and proposes attention-guided selective computation.
Why This Matters
Diffusion LLMs are gaining traction as an alternative to autoregressive models, but deployment costs remain prohibitive. This work addresses a fundamental inefficiency that could make DLLMs practical for real-world applications.
Attention-derived token importance strongly predicts which tokens are decodable at each diffusion step, enabling significant compute savings by focusing resources on the tokens that matter.
FOCUS: DLLMs Know How to Tame Their Compute Bound
cs.LG | cs.AR | cs.CL
Authors: Kaihua Liang, Xin Tan, An Zhong, Hong Xu, Marco Canini
Published: 2026-01-30
Why This Matters
Diffusion LLMs are gaining traction as an alternative to autoregressive models, but deployment costs remain prohibitive. This work addresses a fundamental inefficiency that could make DLLMs practical for real-world applications.
Key Insight
Attention-derived token importance strongly predicts which tokens are decodable at each diffusion step, enabling significant compute savings by focusing resources on the tokens that matter.
Abstract
Diffusion Large Language Models (DLLMs) offer a compelling alternative to Auto-Regressive models, but their deployment is constrained by high decoding cost. In this work, we identify a key inefficiency in DLLM decoding: while computation is parallelized over token blocks, only a small subset of tokens is decodable at each diffusion step, causing most compute to be wasted on non-decodable tokens. We further observe a strong correlation between attention-derived token importance and token-wise decoding probability. Based on this insight, we propose FOCUS -- an inference system designed for DLLMs. By dynamically focusing computation on decodable tokens and evicting non-decodable ones on-the-fly, FOCUS increases the effective batch size, alleviating compute limitations and enabling scalable throughput. Empirical evaluations demonstrate that FOCUS achieves up to 3.52$\times$ throughput improvement over the production-grade engine LMDeploy, while preserving or improving generation quality across multiple benchmarks. The FOCUS system is publicly available on GitHub: https://github.com/sands-lab/FOCUS.
cs.LGcs.AImath.OCstat.ML
YuriiFormer reinterprets transformer layers as optimization steps and applies Nesterov acceleration to achieve faster convergence.
Why This Matters
This variational framework provides a principled theoretical lens for understanding transformers while delivering practical speedups. It bridges optimization theory and deep learning architecture design in a way that could inspire next-generation architectures.
Viewing self-attention as gradient descent on an interaction energy enables momentum-based acceleration techniques from classical optimization to be applied directly to transformer training.
YuriiFormer: A Suite of Nesterov-Accelerated Transformers
cs.LG | cs.AI | math.OC | stat.ML
Authors: Aleksandr Zimin, Yury Polyanskiy, Philippe Rigollet
Published: 2026-01-30
Why This Matters
This variational framework provides a principled theoretical lens for understanding transformers while delivering practical speedups. It bridges optimization theory and deep learning architecture design in a way that could inspire next-generation architectures.
Key Insight
Viewing self-attention as gradient descent on an interaction energy enables momentum-based acceleration techniques from classical optimization to be applied directly to transformer training.
Abstract
We propose a variational framework that interprets transformer layers as iterations of an optimization algorithm acting on token embeddings. In this view, self-attention implements a gradient step of an interaction energy, while MLP layers correspond to gradient updates of a potential energy. Standard GPT-style transformers emerge as vanilla gradient descent on the resulting composite objective, implemented via Lie--Trotter splitting between these two energy functionals. This perspective enables principled architectural design using classical optimization ideas. As a proof of concept, we introduce a Nesterov-style accelerated transformer that preserves the same attention and MLP oracles. The resulting architecture consistently outperforms a nanoGPT baseline on TinyStories and OpenWebText, demonstrating that optimization-theoretic insights can translate into practical gains.
cs.LG
Singular Value Ensembles enable uncertainty quantification in foundation models by ensembling only the singular value components, avoiding the cost of training multiple full models.
Why This Matters
Foundation models are notoriously overconfident, but training ensembles is prohibitively expensive—this method provides a practical path to calibrated uncertainty estimates by exploiting the structure of pretrained weights.
Ensembling at the singular value level rather than the full model level provides meaningful uncertainty estimates while being computationally tractable for large foundation models.
Making Foundation Models Probabilistic via Singular Value Ensembles
cs.LG
Authors: Mehmet Ozgur Turkoglu, Dominik J. Mühlematter, Alexander Becker, Konrad Schindler, Helge Aasen
Published: 2026-01-29
Why This Matters
Foundation models are notoriously overconfident, but training ensembles is prohibitively expensive—this method provides a practical path to calibrated uncertainty estimates by exploiting the structure of pretrained weights.
Key Insight
Ensembling at the singular value level rather than the full model level provides meaningful uncertainty estimates while being computationally tractable for large foundation models.
Abstract
Foundation models have become a dominant paradigm in machine learning, achieving remarkable performance across diverse tasks through large-scale pretraining. However, these models often yield overconfident, uncalibrated predictions. The standard approach to quantifying epistemic uncertainty, training an ensemble of independent models, incurs prohibitive computational costs that scale linearly with ensemble size, making it impractical for large foundation models. We propose Singular Value Ensemble (SVE), a parameter-efficient implicit ensemble method that builds on a simple, but powerful core assumption: namely, that the singular vectors of the weight matrices constitute meaningful subspaces of the model's knowledge. Pretrained foundation models encode rich, transferable information in their weight matrices. If the singular vectors are indeed meaningful (orthogonal) "knowledge directions". To obtain a model ensemble, we modulate only how strongly each direction contributes to the output. Rather than learning entirely new parameters, we freeze the singular vectors and only train per-member singular values that rescale the contribution of each direction in that shared knowledge basis. Ensemble diversity emerges naturally as stochastic initialization and random sampling of mini-batches during joint training cause different members to converge to different combinations of the same underlying knowledge. SVE achieves uncertainty quantification comparable to explicit deep ensembles w...
cs.CLcs.AI
Demonstrates that masked diffusion language models can reason 'out of order', generating answers and explanations non-sequentially unlike autoregressive models.
Why This Matters
This challenges the fundamental assumption that reasoning must proceed left-to-right, showing diffusion models can naturally handle cases where output structure conflicts with reasoning order—a limitation that forces AR models into premature commitment.
For tasks requiring answer-before-explanation output formats, diffusion language models may be inherently better suited than autoregressive models due to their flexible generation order.
Thinking Out of Order: When Output Order Stops Reflecting Reasoning Order in Diffusion Language Models
cs.CL | cs.AI
Authors: Longxuan Yu, Yu Fu, Shaorong Zhang, Hui Liu, Mukund Varma T et al.
Published: 2026-01-29
Why This Matters
This challenges the fundamental assumption that reasoning must proceed left-to-right, showing diffusion models can naturally handle cases where output structure conflicts with reasoning order—a limitation that forces AR models into premature commitment.
Key Insight
For tasks requiring answer-before-explanation output formats, diffusion language models may be inherently better suited than autoregressive models due to their flexible generation order.
Abstract
Autoregressive (AR) language models enforce a fixed left-to-right generation order, creating a fundamental limitation when the required output structure conflicts with natural reasoning (e.g., producing answers before explanations due to presentation or schema constraints). In such cases, AR models must commit to answers before generating intermediate reasoning, and this rigid constraint forces premature commitment. Masked diffusion language models (MDLMs), which iteratively refine all tokens in parallel, offer a way to decouple computation order from output structure. We validate this capability on GSM8K, Math500, and ReasonOrderQA, a benchmark we introduce with controlled difficulty and order-level evaluation. When prompts request answers before reasoning, AR models exhibit large accuracy gaps compared to standard chain-of-thought ordering (up to 67% relative drop), while MDLMs remain stable ($\leq$14% relative drop), a property we term "order robustness". Using ReasonOrderQA, we present evidence that MDLMs achieve order robustness by stabilizing simpler tokens (e.g., reasoning steps) earlier in the diffusion process than complex ones (e.g., final answers), enabling reasoning tokens to stabilize before answer commitment. Finally, we identify failure conditions where this advantage weakens, outlining the limits required for order robustness.
cs.CV
Reveals that vision-language models answer visual illusions correctly by recalling memorized patterns rather than actually perceiving the visual content.
Why This Matters
This exposes a fundamental limitation in how VLMs process visual information—they may be sophisticated pattern matchers rather than true visual reasoners, with major implications for safety-critical applications.
Testing VLMs with inverted illusions where human perception clearly changes but model responses don't is a powerful diagnostic for distinguishing genuine perception from memorization.
Do VLMs Perceive or Recall? Probing Visual Perception vs. Memory with Classic Visual Illusions
cs.CV
Authors: Xiaoxiao Sun, Mingyang Li, Kun yuan, Min Woo Sun, Mark Endo et al.
Published: 2026-01-29
Why This Matters
This exposes a fundamental limitation in how VLMs process visual information—they may be sophisticated pattern matchers rather than true visual reasoners, with major implications for safety-critical applications.
Key Insight
Testing VLMs with inverted illusions where human perception clearly changes but model responses don't is a powerful diagnostic for distinguishing genuine perception from memorization.
Abstract
Large Vision-Language Models (VLMs) often answer classic visual illusions "correctly" on original images, yet persist with the same responses when illusion factors are inverted, even though the visual change is obvious to humans. This raises a fundamental question: do VLMs perceive visual changes or merely recall memorized patterns? While several studies have noted this phenomenon, the underlying causes remain unclear. To move from observations to systematic understanding, this paper introduces VI-Probe, a controllable visual-illusion framework with graded perturbations and matched visual controls (without illusion inducer) that disentangles visually grounded perception from language-driven recall. Unlike prior work that focuses on averaged accuracy, we measure stability and sensitivity using Polarity-Flip Consistency, Template Fixation Index, and an illusion multiplier normalized against matched controls. Experiments across different families reveal that response persistence arises from heterogeneous causes rather than a single mechanism. For instance, GPT-5 exhibits memory override, Claude-Opus-4.1 shows perception-memory competition, while Qwen variants suggest visual-processing limits. Our findings challenge single-cause views and motivate probing-based evaluation that measures both knowledge and sensitivity to controlled visual change. Data and code are available at https://sites.google.com/view/vi-probe/.
cs.ROcs.CV
DynamicVLA introduces a compact 0.4B vision-language-action model that can manipulate moving objects through temporal reasoning and closed-loop control.
Why This Matters
While VLA models excel at static manipulation, real-world robotics requires handling dynamic objects—this work directly addresses that gap with a surprisingly small model that integrates temporal anticipation.
A convolutional vision encoder combined with explicit temporal reasoning mechanisms can enable dynamic manipulation without the computational overhead of massive VLA models.
DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation
cs.RO | cs.CV
Authors: Haozhe Xie, Beichen Wen, Jiarui Zheng, Zhaoxi Chen, Fangzhou Hong et al.
Published: 2026-01-29
Why This Matters
While VLA models excel at static manipulation, real-world robotics requires handling dynamic objects—this work directly addresses that gap with a surprisingly small model that integrates temporal anticipation.
Key Insight
A convolutional vision encoder combined with explicit temporal reasoning mechanisms can enable dynamic manipulation without the computational overhead of massive VLA models.
Abstract
Manipulating dynamic objects remains an open challenge for Vision-Language-Action (VLA) models, which, despite strong generalization in static manipulation, struggle in dynamic scenarios requiring rapid perception, temporal anticipation, and continuous control. We present DynamicVLA, a framework for dynamic object manipulation that integrates temporal reasoning and closed-loop adaptation through three key designs: 1) a compact 0.4B VLA using a convolutional vision encoder for spatially efficient, structurally faithful encoding, enabling fast multimodal inference; 2) Continuous Inference, enabling overlapping reasoning and execution for lower latency and timely adaptation to object motion; and 3) Latent-aware Action Streaming, which bridges the perception-execution gap by enforcing temporally aligned action execution. To fill the missing foundation of dynamic manipulation data, we introduce the Dynamic Object Manipulation (DOM) benchmark, built from scratch with an auto data collection pipeline that efficiently gathers 200K synthetic episodes across 2.8K scenes and 206 objects, and enables fast collection of 2K real-world episodes without teleoperation. Extensive evaluations demonstrate remarkable improvements in response speed, perception, and generalization, positioning DynamicVLA as a unified framework for general dynamic object manipulation across embodiments.
cs.CV
Pixel MeanFlow enables one-step image generation directly in pixel space without latents, achieving quality comparable to multi-step diffusion models.
Why This Matters
This represents a significant simplification of the generative image pipeline by eliminating both the need for multiple sampling steps and latent space encoding, potentially enabling real-time high-quality image generation on resource-constrained devices.
Separating the network output space from the loss space allows direct pixel-space generation to work effectively, challenging the assumption that latent spaces are necessary for quality.
One-step Latent-free Image Generation with Pixel Mean Flows
cs.CV
Authors: Yiyang Lu, Susie Lu, Qiao Sun, Hanhong Zhao, Zhicheng Jiang et al.
Published: 2026-01-29
Why This Matters
This represents a significant simplification of the generative image pipeline by eliminating both the need for multiple sampling steps and latent space encoding, potentially enabling real-time high-quality image generation on resource-constrained devices.
Key Insight
Separating the network output space from the loss space allows direct pixel-space generation to work effectively, challenging the assumption that latent spaces are necessary for quality.
Abstract
Modern diffusion/flow-based models for image generation typically exhibit two core characteristics: (i) using multi-step sampling, and (ii) operating in a latent space. Recent advances have made encouraging progress on each aspect individually, paving the way toward one-step diffusion/flow without latents. In this work, we take a further step towards this goal and propose "pixel MeanFlow" (pMF). Our core guideline is to formulate the network output space and the loss space separately. The network target is designed to be on a presumed low-dimensional image manifold (i.e., x-prediction), while the loss is defined via MeanFlow in the velocity space. We introduce a simple transformation between the image manifold and the average velocity field. In experiments, pMF achieves strong results for one-step latent-free generation on ImageNet at 256x256 resolution (2.22 FID) and 512x512 resolution (2.48 FID), filling a key missing piece in this regime. We hope that our study will further advance the boundaries of diffusion/flow-based generative models.
cs.CLcs.AI
DynaWeb applies model-based RL to train web agents by learning a world model of websites, enabling safer and more efficient training.
Why This Matters
Training web agents on the live internet is risky, costly, and inefficient - learning a world model enables simulated practice at scale, addressing a key bottleneck in autonomous web agent development.
The path to reliable web agents may run through learned simulators rather than direct internet interaction.
DynaWeb: Model-Based Reinforcement Learning of Web Agents
cs.CL | cs.AI
Authors: Hang Ding, Peidong Liu, Junqiao Wang, Ziwei Ji, Meng Cao et al.
Published: 2026-01-29
Why This Matters
Training web agents on the live internet is risky, costly, and inefficient - learning a world model enables simulated practice at scale, addressing a key bottleneck in autonomous web agent development.
Key Insight
The path to reliable web agents may run through learned simulators rather than direct internet interaction.
Abstract
The development of autonomous web agents, powered by Large Language Models (LLMs) and reinforcement learning (RL), represents a significant step towards general-purpose AI assistants. However, training these agents is severely hampered by the challenges of interacting with the live internet, which is inefficient, costly, and fraught with risks. Model-based reinforcement learning (MBRL) offers a promising solution by learning a world model of the environment to enable simulated interaction. This paper introduces DynaWeb, a novel MBRL framework that trains web agents through interacting with a web world model trained to predict naturalistic web page representations given agent actions. This model serves as a synthetic web environment where an agent policy can dream by generating vast quantities of rollout action trajectories for efficient online reinforcement learning. Beyond free policy rollouts, DynaWeb incorporates real expert trajectories from training data, which are randomly interleaved with on-policy rollouts during training to improve stability and sample efficiency. Experiments conducted on the challenging WebArena and WebVoyager benchmarks demonstrate that DynaWeb consistently and significantly improves the performance of state-of-the-art open-source web agent models. Our findings establish the viability of training web agents through imagination, offering a scalable and efficient way to scale up online agentic RL.
cs.LG
LLM Shepherding uses large models to provide short hints to small models rather than complete answers, dramatically reducing inference costs.
Why This Matters
This escapes the all-or-nothing tradeoff between cheap-but-weak SLMs and expensive-but-capable LLMs by using LLMs as consultants rather than workers, achieving significant cost savings while maintaining quality.
Sometimes the most cost-effective use of a powerful LLM is asking it for a hint, not an answer.
Pay for Hints, Not Answers: LLM Shepherding for Cost-Efficient Inference
cs.LG
Authors: Ziming Dong, Hardik Sharma, Evan O'Toole, Jaya Prakash Champati, Kui Wu
Published: 2026-01-29
Why This Matters
This escapes the all-or-nothing tradeoff between cheap-but-weak SLMs and expensive-but-capable LLMs by using LLMs as consultants rather than workers, achieving significant cost savings while maintaining quality.
Key Insight
Sometimes the most cost-effective use of a powerful LLM is asking it for a hint, not an answer.
Abstract
Large Language Models (LLMs) deliver state-of-the-art performance on complex reasoning tasks, but their inference costs limit deployment at scale. Small Language Models (SLMs) offer dramatic cost savings yet lag substantially in accuracy. Existing approaches - routing and cascading - treat the LLM as an all-or-nothing resource: either the query bypasses the LLM entirely, or the LLM generates a complete response at full cost. We introduce LLM Shepherding, a framework that requests only a short prefix (a hint) from the LLM and provides it to SLM. This simple mechanism is surprisingly effective for math and coding tasks: even hints comprising 10-30% of the full LLM response improve SLM accuracy significantly. Shepherding generalizes both routing and cascading, and it achieves lower cost under oracle decision-making. We develop a two-stage predictor that jointly determines whether a hint is needed and how many tokens to request. On the widely-used mathematical reasoning (GSM8K, CNK12) and code generation (HumanEval, MBPP) benchmarks, Shepherding reduces costs by 42-94% relative to LLM-only inference. Compared to state-of-the-art routing and cascading baselines, shepherding delivers up to 2.8x cost reduction while matching accuracy. To our knowledge, this is the first work to exploit token-level budget control for SLM-LLM collaboration.
cs.CLcs.AI
Proposes Proactive Interactive Reasoning (PIR) where LLMs ask clarifying questions during reasoning instead of making assumptions.
Why This Matters
Current reasoning models perform 'blind self-thinking' even when critical information is missing - PIR fundamentally changes this by having models proactively seek clarification, reducing hallucination and improving reliability.
The next evolution of reasoning models isn't just thinking harder, but knowing when to ask for help.
Reasoning While Asking: Transforming Reasoning Large Language Models from Passive Solvers to Proactive Inquirers
cs.CL | cs.AI
Authors: Xin Chen, Feng Jiang, Yiqian Zhang, Hardy Chen, Shuo Yan et al.
Published: 2026-01-29
Why This Matters
Current reasoning models perform 'blind self-thinking' even when critical information is missing - PIR fundamentally changes this by having models proactively seek clarification, reducing hallucination and improving reliability.
Key Insight
The next evolution of reasoning models isn't just thinking harder, but knowing when to ask for help.
Abstract
Reasoning-oriented Large Language Models (LLMs) have achieved remarkable progress with Chain-of-Thought (CoT) prompting, yet they remain fundamentally limited by a \emph{blind self-thinking} paradigm: performing extensive internal reasoning even when critical information is missing or ambiguous. We propose Proactive Interactive Reasoning (PIR), a new reasoning paradigm that transforms LLMs from passive solvers into proactive inquirers that interleave reasoning with clarification. Unlike existing search- or tool-based frameworks that primarily address knowledge uncertainty by querying external environments, PIR targets premise- and intent-level uncertainty through direct interaction with the user. PIR is implemented via two core components: (1) an uncertainty-aware supervised fine-tuning procedure that equips models with interactive reasoning capability, and (2) a user-simulator-based policy optimization framework driven by a composite reward that aligns model behavior with user intent. Extensive experiments on mathematical reasoning, code generation, and document editing demonstrate that PIR consistently outperforms strong baselines, achieving up to 32.70% higher accuracy, 22.90% higher pass rate, and 41.36 BLEU improvement, while reducing nearly half of the reasoning computation and unnecessary interaction turns. Further reliability evaluations on factual knowledge, question answering, and missing-premise scenarios confirm the strong generalization and robustness of PIR. M...
cs.AIcs.CL
Introduces Agent-RRM, a multi-faceted reward model that provides structured feedback for agentic trajectories beyond sparse outcome rewards.
Why This Matters
Current agentic RL relies on binary success/failure signals which can't distinguish good reasoning from lucky outcomes - this enables dense, interpretable feedback that improves agent training quality.
Training better AI agents requires evaluating the reasoning process itself, not just final outcomes.
Exploring Reasoning Reward Model for Agents
cs.AI | cs.CL
Authors: Kaixuan Fan, Kaituo Feng, Manyuan Zhang, Tianshuo Peng, Zhixun Li et al.
Published: 2026-01-29
Why This Matters
Current agentic RL relies on binary success/failure signals which can't distinguish good reasoning from lucky outcomes - this enables dense, interpretable feedback that improves agent training quality.
Key Insight
Training better AI agents requires evaluating the reasoning process itself, not just final outcomes.
Abstract
Agentic Reinforcement Learning (Agentic RL) has achieved notable success in enabling agents to perform complex reasoning and tool use. However, most methods still relies on sparse outcome-based reward for training. Such feedback fails to differentiate intermediate reasoning quality, leading to suboptimal training results. In this paper, we introduce Agent Reasoning Reward Model (Agent-RRM), a multi-faceted reward model that produces structured feedback for agentic trajectories, including (1) an explicit reasoning trace , (2) a focused critique that provides refinement guidance by highlighting reasoning flaws, and (3) an overall score that evaluates process performance. Leveraging these signals, we systematically investigate three integration strategies: Reagent-C (text-augmented refinement), Reagent-R (reward-augmented guidance), and Reagent-U (unified feedback integration). Extensive evaluations across 12 diverse benchmarks demonstrate that Reagent-U yields substantial performance leaps, achieving 43.7% on GAIA and 46.2% on WebWalkerQA, validating the effectiveness of our reasoning reward model and training schemes. Code, models, and datasets are all released to facilitate future research.
cs.LGcs.CL
Systematically evaluates 2,000+ models on HuggingFace to discover overlooked fine-tunes that outperform popular foundation models.
Why This Matters
This challenges the assumption that model popularity reflects quality - many 'hidden gem' fine-tunes significantly outperform their popular counterparts, suggesting the ML community is leaving performance on the table.
Before defaulting to popular checkpoints, practitioners should search model repositories more thoroughly as better-performing alternatives often exist with minimal downloads.
Discovering Hidden Gems in Model Repositories
cs.LG | cs.CL
Authors: Jonathan Kahana, Eliahu Horwitz, Yedid Hoshen
Published: 2026-01-29
Why This Matters
This challenges the assumption that model popularity reflects quality - many 'hidden gem' fine-tunes significantly outperform their popular counterparts, suggesting the ML community is leaving performance on the table.
Key Insight
Before defaulting to popular checkpoints, practitioners should search model repositories more thoroughly as better-performing alternatives often exist with minimal downloads.
Abstract
Public repositories host millions of fine-tuned models, yet community usage remains disproportionately concentrated on a small number of foundation checkpoints. We investigate whether this concentration reflects efficient market selection or if superior models are systematically overlooked. Through an extensive evaluation of over 2,000 models, we show the prevalence of "hidden gems", unpopular fine-tunes that significantly outperform their popular counterparts. Notably, within the Llama-3.1-8B family, we find rarely downloaded checkpoints that improve math performance from 83.2% to 96.0% without increasing inference costs. However, discovering these models through exhaustive evaluation of every uploaded model is computationally infeasible. We therefore formulate model discovery as a Multi-Armed Bandit problem and accelerate the Sequential Halving search algorithm by using shared query sets and aggressive elimination schedules. Our method retrieves top models with as few as 50 queries per candidate, accelerating discovery by over 50x.
cs.CV
ePAI is an automated system for early detection of pancreatic cancer from CT scans, identifying lesions that radiologists previously overlooked.
Why This Matters
Pancreatic cancer is often detected too late for surgery. Retrospective studies show expert radiologists can spot lesions in prediagnostic scans when they know the patient later developed cancer—this AI aims to enable that foresight prospectively.
AI can potentially catch subtle pancreatic lesions that humans miss on routine scans, enabling earlier intervention for one of the deadliest cancers.
Early and Prediagnostic Detection of Pancreatic Cancer from Computed Tomography
cs.CV
Authors: Wenxuan Li, Pedro R. A. S. Bassi, Lizhou Wu, Xinze Zhou, Yuxuan Zhao et al.
Published: 2026-01-29
Why This Matters
Pancreatic cancer is often detected too late for surgery. Retrospective studies show expert radiologists can spot lesions in prediagnostic scans when they know the patient later developed cancer—this AI aims to enable that foresight prospectively.
Key Insight
AI can potentially catch subtle pancreatic lesions that humans miss on routine scans, enabling earlier intervention for one of the deadliest cancers.
Abstract
Pancreatic ductal adenocarcinoma (PDAC), one of the deadliest solid malignancies, is often detected at a late and inoperable stage. Retrospective reviews of prediagnostic CT scans, when conducted by expert radiologists aware that the patient later developed PDAC, frequently reveal lesions that were previously overlooked. To help detecting these lesions earlier, we developed an automated system named ePAI (early Pancreatic cancer detection with Artificial Intelligence). It was trained on data from 1,598 patients from a single medical center. In the internal test involving 1,009 patients, ePAI achieved an area under the receiver operating characteristic curve (AUC) of 0.939-0.999, a sensitivity of 95.3%, and a specificity of 98.7% for detecting small PDAC less than 2 cm in diameter, precisely localizing PDAC as small as 2 mm. In an external test involving 7,158 patients across 6 centers, ePAI achieved an AUC of 0.918-0.945, a sensitivity of 91.5%, and a specificity of 88.0%, precisely localizing PDAC as small as 5 mm. Importantly, ePAI detected PDACs on prediagnostic CT scans obtained 3 to 36 months before clinical diagnosis that had originally been overlooked by radiologists. It successfully detected and localized PDACs in 75 of 159 patients, with a median lead time of 347 days before clinical diagnosis. Our multi-reader study showed that ePAI significantly outperformed 30 board-certified radiologists by 50.3% (P < 0.05) in sensitivity while maintaining a comparable specificit...
cs.SEcs.AIcs.LG
SWE-Replay enables efficient test-time scaling for software engineering agents by replaying and branching from successful trajectory prefixes instead of sampling from scratch.
Why This Matters
Standard test-time scaling for SWE agents wastes compute by resampling entire trajectories. This work shows that leveraging successful partial trajectories can dramatically reduce inference costs while maintaining performance.
For software engineering tasks, trajectory prefixes contain reusable computation—branching from successful partial solutions is far more efficient than repeated full sampling.
SWE-Replay: Efficient Test-Time Scaling for Software Engineering Agents
cs.SE | cs.AI | cs.LG
Authors: Yifeng Ding, Lingming Zhang
Published: 2026-01-29
Why This Matters
Standard test-time scaling for SWE agents wastes compute by resampling entire trajectories. This work shows that leveraging successful partial trajectories can dramatically reduce inference costs while maintaining performance.
Key Insight
For software engineering tasks, trajectory prefixes contain reusable computation—branching from successful partial solutions is far more efficient than repeated full sampling.
Abstract
Test-time scaling has been widely adopted to enhance the capabilities of Large Language Model (LLM) agents in software engineering (SWE) tasks. However, the standard approach of repeatedly sampling trajectories from scratch is computationally expensive. While recent methods have attempted to mitigate costs using specialized value agents, they can suffer from model miscalibration and fail to generalize to modern agents that synthesize custom bash scripts as tools. In this paper, we introduce SWE-Replay, the first efficient and generalizable test-time scaling technique for modern agents without reliance on potentially noisy value estimates. SWE-Replay optimizes the scaling process by recycling trajectories from prior trials, dynamically choosing to either explore from scratch or exploit archived experience by branching at critical intermediate steps. This selection of intermediate steps is driven by the potential and reasoning significance of repository exploration, rather than external LLM-based quality estimates. Our evaluation shows that, on SWE-Bench Verified, SWE-Replay consistently outperforms naive scaling, reducing costs by up to 17.4% while maintaining or even improving performance by up to 3.8%. Further evaluation on SWE-Bench Pro and Multilingual validates the generalizability of SWE-Replay, establishing it as a robust foundation for efficient test-time scaling of software engineering agents.
cs.LGcs.AIcs.CRcs.SE
StepShield is the first benchmark measuring WHEN safety violations are detected in AI agent trajectories, not just whether they're detected.
Why This Matters
This reframes agent safety evaluation from binary accuracy to intervention timing—a detector flagging violations at step 8 enables prevention, while step 48 detection is merely forensic. This distinction is critical for deploying agents safely.
Early detection of rogue agent behavior is fundamentally different from post-mortem analysis; safety benchmarks must measure detection latency, not just accuracy.
StepShield: When, Not Whether to Intervene on Rogue Agents
cs.LG | cs.AI | cs.CR | cs.SE
Authors: Gloria Felicia, Michael Eniolade, Jinfeng He, Zitha Sasindran, Hemant Kumar et al.
Published: 2026-01-29
Why This Matters
This reframes agent safety evaluation from binary accuracy to intervention timing—a detector flagging violations at step 8 enables prevention, while step 48 detection is merely forensic. This distinction is critical for deploying agents safely.
Key Insight
Early detection of rogue agent behavior is fundamentally different from post-mortem analysis; safety benchmarks must measure detection latency, not just accuracy.
Abstract
Existing agent safety benchmarks report binary accuracy, conflating early intervention with post-mortem analysis. A detector that flags a violation at step 8 enables intervention; one that reports it at step 48 provides only forensic value. This distinction is critical, yet current benchmarks cannot measure it. We introduce StepShield, the first benchmark to evaluate when violations are detected, not just whether. StepShield contains 9,213 code agent trajectories, including 1,278 meticulously annotated training pairs and a 7,935-trajectory test set with a realistic 8.1% rogue rate. Rogue behaviors are grounded in real-world security incidents across six categories. We propose three novel temporal metrics: Early Intervention Rate (EIR), Intervention Gap, and Tokens Saved. Surprisingly, our evaluation reveals that an LLM-based judge achieves 59% EIR while a static analyzer achieves only 26%, a 2.3x performance gap that is entirely invisible to standard accuracy metrics. We further show that early detection has direct economic benefits: our cascaded HybridGuard detector reduces monitoring costs by 75% and projects to $108M in cumulative savings over five years at enterprise scale. By shifting the focus of evaluation from whether to when, StepShield provides a new foundation for building safer and more economically viable AI agents. The code and data are released under an Apache 2.0 license.
cs.CLcs.AIcs.LG
Introduces efficient distillation methods to convert pretrained softmax attention Transformers into hybrid linear attention architectures for extremely long contexts.
Why This Matters
This tackles the prohibitive cost of training long-context models from scratch by enabling conversion of existing models. The hybrid approach preserves quality while dramatically improving throughput for long sequences.
You don't need to pretrain hybrid attention models from scratch—distillation from existing Transformers can achieve comparable quality with much better long-context efficiency.
Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts
cs.CL | cs.AI | cs.LG
Authors: Yingfa Chen, Zhen Leng Thai, Zihan Zhou, Zhu Zhang, Xingyu Shen et al.
Published: 2026-01-29
Why This Matters
This tackles the prohibitive cost of training long-context models from scratch by enabling conversion of existing models. The hybrid approach preserves quality while dramatically improving throughput for long sequences.
Key Insight
You don't need to pretrain hybrid attention models from scratch—distillation from existing Transformers can achieve comparable quality with much better long-context efficiency.
Abstract
Hybrid Transformer architectures, which combine softmax attention blocks and recurrent neural networks (RNNs), have shown a desirable performance-throughput tradeoff for long-context modeling, but their adoption and studies are hindered by the prohibitive cost of large-scale pre-training from scratch. Some recent studies have shown that pre-trained softmax attention blocks can be converted into RNN blocks through parameter transfer and knowledge distillation. However, these transfer methods require substantial amounts of training data (more than 10B tokens), and the resulting hybrid models also exhibit poor long-context performance, which is the scenario where hybrid models enjoy significant inference speedups over Transformer-based models. In this paper, we present HALO (Hybrid Attention via Layer Optimization), a pipeline for distilling Transformer models into RNN-attention hybrid models. We then present HypeNet, a hybrid architecture with superior length generalization enabled by a novel position encoding scheme (named HyPE) and various architectural modifications. We convert the Qwen3 series into HypeNet using HALO, achieving performance comparable to the original Transformer models while enjoying superior long-context performance and efficiency. The conversion requires just 2.3B tokens, less than 0.01% of their pre-training data
cs.CRcs.AIcs.CL
RedSage is a cybersecurity-specialized LLM trained on 11.8B tokens of security-focused data spanning frameworks, offensive techniques, and tools.
Why This Matters
This addresses a critical gap in security operations where proprietary APIs pose privacy risks and open models lack domain expertise. The 28.6K curated documents across security domains could enable safer, on-premise security assistants.
Domain-specific continual pretraining with carefully curated security corpora can bridge the gap between general LLMs and specialized cybersecurity workflows without exposing sensitive data to external APIs.
RedSage: A Cybersecurity Generalist LLM
cs.CR | cs.AI | cs.CL
Authors: Naufal Suryanto, Muzammal Naseer, Pengfei Li, Syed Talal Wasim, Jinhui Yi et al.
Published: 2026-01-29
Why This Matters
This addresses a critical gap in security operations where proprietary APIs pose privacy risks and open models lack domain expertise. The 28.6K curated documents across security domains could enable safer, on-premise security assistants.
Key Insight
Domain-specific continual pretraining with carefully curated security corpora can bridge the gap between general LLMs and specialized cybersecurity workflows without exposing sensitive data to external APIs.
Abstract
Cybersecurity operations demand assistant LLMs that support diverse workflows without exposing sensitive data. Existing solutions either rely on proprietary APIs with privacy risks or on open models lacking domain adaptation. To bridge this gap, we curate 11.8B tokens of cybersecurity-focused continual pretraining data via large-scale web filtering and manual collection of high-quality resources, spanning 28.6K documents across frameworks, offensive techniques, and security tools. Building on this, we design an agentic augmentation pipeline that simulates expert workflows to generate 266K multi-turn cybersecurity samples for supervised fine-tuning. Combined with general open-source LLM data, these resources enable the training of RedSage, an open-source, locally deployable cybersecurity assistant with domain-aware pretraining and post-training. To rigorously evaluate the models, we introduce RedSage-Bench, a benchmark with 30K multiple-choice and 240 open-ended Q&A items covering cybersecurity knowledge, skills, and tool expertise. RedSage is further evaluated on established cybersecurity benchmarks (e.g., CTI-Bench, CyberMetric, SECURE) and general LLM benchmarks to assess broader generalization. At the 8B scale, RedSage achieves consistently better results, surpassing the baseline models by up to +5.59 points on cybersecurity benchmarks and +5.05 points on Open LLM Leaderboard tasks. These findings demonstrate that domain-aware agentic augmentation and pre/post-training can...
cs.LGcs.AIcs.CL
Addresses training stagnation on saturated problems by conditioning rollouts on failure prefixes to find informative learning signals.
Why This Matters
As models get better at benchmarks, most training samples become trivially solved - this elegant technique keeps learning productive by specifically seeking out the edge cases where the model still fails.
When RL training plateaus because the model solves most problems, start rollouts from known failure points rather than from scratch to maintain learning signal.
Training Reasoning Models on Saturated Problems via Failure-Prefix Conditioning
cs.LG | cs.AI | cs.CL
Authors: Minwu Kim, Safal Shrestha, Keith Ross
Published: 2026-01-28
Why This Matters
As models get better at benchmarks, most training samples become trivially solved - this elegant technique keeps learning productive by specifically seeking out the edge cases where the model still fails.
Key Insight
When RL training plateaus because the model solves most problems, start rollouts from known failure points rather than from scratch to maintain learning signal.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has substantially improved the reasoning abilities of large language models (LLMs), yet training often stalls as problems become saturated. We identify the core challenge as the poor accessibility of informative failures: learning signals exist but are rarely encountered during standard rollouts. To address this, we propose failure-prefix conditioning, a simple and effective method for learning from saturated problems. Rather than starting from the original question, our approach reallocates exploration by conditioning training on prefixes derived from rare incorrect reasoning trajectories, thereby exposing the model to failure-prone states. We observe that failure-prefix conditioning yields performance gains matching those of training on medium-difficulty problems, while preserving token efficiency. Furthermore, we analyze the model's robustness, finding that our method reduces performance degradation under misleading failure prefixes, albeit with a mild trade-off in adherence to correct early reasoning. Finally, we demonstrate that an iterative approach, which refreshes failure prefixes during training, unlocks additional gains after performance plateaus. Overall, our results suggest that failure-prefix conditioning offers an effective pathway to extend RLVR training on saturated problems.
cs.LGcs.AI
Proposes using rich textual feedback (like error messages) for RL credit assignment in LLMs through a self-distillation approach.
Why This Matters
Current RLVR methods waste valuable signal by only using pass/fail outcomes - this method leverages the detailed feedback that verifiable environments already provide, like compiler errors and test failures.
When training on code or math, convert error messages and judge feedback into learning signal rather than discarding everything except the binary reward.
Reinforcement Learning via Self-Distillation
cs.LG | cs.AI
Authors: Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella et al.
Published: 2026-01-28
Why This Matters
Current RLVR methods waste valuable signal by only using pass/fail outcomes - this method leverages the detailed feedback that verifiable environments already provide, like compiler errors and test failures.
Key Insight
When training on code or math, convert error messages and judge feedback into learning signal rather than discarding everything except the binary reward.
Abstract
Large language models are increasingly post-trained with reinforcement learning in verifiable domains such as code and math. Yet, current methods for reinforcement learning with verifiable rewards (RLVR) learn only from a scalar outcome reward per attempt, creating a severe credit-assignment bottleneck. Many verifiable environments actually provide rich textual feedback, such as runtime errors or judge evaluations, that explain why an attempt failed. We formalize this setting as reinforcement learning with rich feedback and introduce Self-Distillation Policy Optimization (SDPO), which converts tokenized feedback into a dense learning signal without any external teacher or explicit reward model. SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy. In this way, SDPO leverages the model's ability to retrospectively identify its own mistakes in-context. Across scientific reasoning, tool use, and competitive programming on LiveCodeBench v6, SDPO improves sample efficiency and final accuracy over strong RLVR baselines. Notably, SDPO also outperforms baselines in standard RLVR environments that only return scalar feedback by using successful rollouts as implicit feedback for failed attempts. Finally, applying SDPO to individual questions at test time accelerates discovery on difficult binary-reward tasks, achieving the same discovery probability as best-of-k sampling or multi-turn conv...
cs.AI
Introduces SokoBench to systematically evaluate long-horizon planning in LLMs using simplified Sokoban puzzles that isolate planning from state tracking.
Why This Matters
Despite claims of improved reasoning, this benchmark reveals that even state-of-the-art reasoning models struggle with multi-step planning when they can't rely on pattern matching from training data.
Current LRMs may be better at reasoning that looks like training examples than genuine novel planning - evaluate planning capabilities separately from pattern-matching reasoning.
SokoBench: Evaluating Long-Horizon Planning and Reasoning in Large Language Models
cs.AI
Authors: Sebastiano Monti, Carlo Nicolini, Gianni Pellegrini, Jacopo Staiano, Bruno Lepri
Published: 2026-01-28
Why This Matters
Despite claims of improved reasoning, this benchmark reveals that even state-of-the-art reasoning models struggle with multi-step planning when they can't rely on pattern matching from training data.
Key Insight
Current LRMs may be better at reasoning that looks like training examples than genuine novel planning - evaluate planning capabilities separately from pattern-matching reasoning.
Abstract
Although the capabilities of large language models have been increasingly tested on complex reasoning tasks, their long-horizon planning abilities have not yet been extensively investigated. In this work, we provide a systematic assessment of the planning and long-horizon reasoning capabilities of state-of-the-art Large Reasoning Models (LRMs). We propose a novel benchmark based on Sokoban puzzles, intentionally simplified to isolate long-horizon planning from state persistence. Our findings reveal a consistent degradation in planning performance when more than 25 moves are required to reach the solution, suggesting a fundamental constraint on forward planning capacity. We show that equipping LRMs with Planning Domain Definition Language (PDDL) parsing, validation, and solving tools allows for modest improvements, suggesting inherent architectural limitations which might not be overcome by test-time scaling approaches alone.
cs.CLcs.LG
Shows that linear representations of concepts in LLMs can flip dramatically within a single conversation, with factual information becoming represented as non-factual.
Why This Matters
This challenges the assumption that mechanistic interpretability findings about static representations generalize to dynamic multi-turn interactions, which is how LLMs are actually used.
Interpretability researchers should validate their linear probes across conversation turns, not just single inputs, as representation dynamics can invalidate static analysis.
Linear representations in language models can change dramatically over a conversation
cs.CL | cs.LG
Authors: Andrew Kyle Lampinen, Yuxuan Li, Eghbal Hosseini, Sangnie Bhardwaj, Murray Shanahan
Published: 2026-01-28
Why This Matters
This challenges the assumption that mechanistic interpretability findings about static representations generalize to dynamic multi-turn interactions, which is how LLMs are actually used.
Key Insight
Interpretability researchers should validate their linear probes across conversation turns, not just single inputs, as representation dynamics can invalidate static analysis.
Abstract
Language model representations often contain linear directions that correspond to high-level concepts. Here, we study the dynamics of these representations: how representations evolve along these dimensions within the context of (simulated) conversations. We find that linear representations can change dramatically over a conversation; for example, information that is represented as factual at the beginning of a conversation can be represented as non-factual at the end and vice versa. These changes are content-dependent; while representations of conversation-relevant information may change, generic information is generally preserved. These changes are robust even for dimensions that disentangle factuality from more superficial response patterns, and occur across different model families and layers of the model. These representation changes do not require on-policy conversations; even replaying a conversation script written by an entirely different model can produce similar changes. However, adaptation is much weaker from simply having a sci-fi story in context that is framed more explicitly as such. We also show that steering along a representational direction can have dramatically different effects at different points in a conversation. These results are consistent with the idea that representations may evolve in response to the model playing a particular role that is cued by a conversation. Our findings may pose challenges for interpretability and steering -- in particular, ...
cs.LGcs.AIcs.CLcs.CY
Demonstrates that reward models inherit systematic value biases from their pretrained LLM initializations, affecting alignment outcomes.
Why This Matters
This reveals a critical blind spot in RLHF pipelines - the reward models we trust to align LLMs carry their own biases from pretraining, which could systematically skew what behaviors get reinforced.
Practitioners should audit reward models for inherited biases before deployment, as these biases persist through fine-tuning and can silently shape aligned model behavior.
Reward Models Inherit Value Biases from Pretraining
cs.LG | cs.AI | cs.CL | cs.CY
Authors: Brian Christian, Jessica A. F. Thompson, Elle Michelle Yang, Vincent Adam, Hannah Rose Kirk et al.
Published: 2026-01-28
Why This Matters
This reveals a critical blind spot in RLHF pipelines - the reward models we trust to align LLMs carry their own biases from pretraining, which could systematically skew what behaviors get reinforced.
Key Insight
Practitioners should audit reward models for inherited biases before deployment, as these biases persist through fine-tuning and can silently shape aligned model behavior.
Abstract
Reward models (RMs) are central to aligning large language models (LLMs) with human values but have received less attention than pre-trained and post-trained LLMs themselves. Because RMs are initialized from LLMs, they inherit representations that shape their behavior, but the nature and extent of this influence remain understudied. In a comprehensive study of 10 leading open-weight RMs using validated psycholinguistic corpora, we show that RMs exhibit significant differences along multiple dimensions of human value as a function of their base model. Using the "Big Two" psychological axes, we show a robust preference of Llama RMs for "agency" and a corresponding robust preference of Gemma RMs for "communion." This phenomenon holds even when the preference data and finetuning process are identical, and we trace it back to the logits of the respective instruction-tuned and pre-trained models. These log-probability differences themselves can be formulated as an implicit RM; we derive usable implicit reward scores and show that they exhibit the very same agency/communion difference. We run experiments training RMs with ablations for preference data source and quantity, which demonstrate that this effect is not only repeatable but surprisingly durable. Despite RMs being designed to represent human preferences, our evidence shows that their outputs are influenced by the pretrained LLMs on which they are based. This work underscores the importance of safety and alignment efforts at ...
cs.AIcs.CRcs.LG
GAVEL introduces rule-based activation monitoring for LLM safety, allowing interpretable, shareable safety rules modeled on cybersecurity practices.
Why This Matters
Current activation-based safety approaches suffer from poor precision and lack interpretability. By treating activations as cognitive signatures that can be matched against explicit rules, this work enables more precise, flexible, and explainable safety monitoring - crucial as LLMs are deployed in high-stakes applications.
Activation patterns can be modeled as shareable safety rules similar to cybersecurity threat signatures, enabling collaborative and interpretable safety monitoring across deployments.
GAVEL: Towards rule-based safety through activation monitoring
cs.AI | cs.CR | cs.LG
Authors: Shir Rozenfeld, Rahul Pankajakshan, Itay Zloczower, Eyal Lenga, Gilad Gressel et al.
Published: 2026-01-27
Why This Matters
Current activation-based safety approaches suffer from poor precision and lack interpretability. By treating activations as cognitive signatures that can be matched against explicit rules, this work enables more precise, flexible, and explainable safety monitoring - crucial as LLMs are deployed in high-stakes applications.
Key Insight
Activation patterns can be modeled as shareable safety rules similar to cybersecurity threat signatures, enabling collaborative and interpretable safety monitoring across deployments.
Abstract
Large language models (LLMs) are increasingly paired with activation-based monitoring to detect and prevent harmful behaviors that may not be apparent at the surface-text level. However, existing activation safety approaches, trained on broad misuse datasets, struggle with poor precision, limited flexibility, and lack of interpretability. This paper introduces a new paradigm: rule-based activation safety, inspired by rule-sharing practices in cybersecurity. We propose modeling activations as cognitive elements (CEs), fine-grained, interpretable factors such as ''making a threat'' and ''payment processing'', that can be composed to capture nuanced, domain-specific behaviors with higher precision. Building on this representation, we present a practical framework that defines predicate rules over CEs and detects violations in real time. This enables practitioners to configure and update safeguards without retraining models or detectors, while supporting transparency and auditability. Our results show that compositional rule-based activation safety improves precision, supports domain customization, and lays the groundwork for scalable, interpretable, and auditable AI governance. We will release GAVEL as an open-source framework and provide an accompanying automated rule creation tool.
cs.LGcs.CL
Uses neural networks to predict neural scaling laws, revealing that individual task performance follows diverse patterns obscured by aggregate metrics like validation loss.
Why This Matters
Understanding how specific capabilities emerge with scale is crucial for efficient AI development. This work shows that simple power-law predictions fail for individual tasks, and proposes learned predictors that can forecast which capabilities will improve, plateau, or degrade.
Validation perplexity is a poor proxy for downstream task performance; practitioners should expect diverse scaling behaviors across tasks and consider task-specific predictions when planning compute allocation.
Neural Neural Scaling Laws
cs.LG | cs.CL
Authors: Michael Y. Hu, Jane Pan, Ayush Rajesh Jhaveri, Nicholas Lourie, Kyunghyun Cho
Published: 2026-01-27
Why This Matters
Understanding how specific capabilities emerge with scale is crucial for efficient AI development. This work shows that simple power-law predictions fail for individual tasks, and proposes learned predictors that can forecast which capabilities will improve, plateau, or degrade.
Key Insight
Validation perplexity is a poor proxy for downstream task performance; practitioners should expect diverse scaling behaviors across tasks and consider task-specific predictions when planning compute allocation.
Abstract
Neural scaling laws predict how language model performance improves with increased compute. While aggregate metrics like validation loss can follow smooth power-law curves, individual downstream tasks exhibit diverse scaling behaviors: some improve monotonically, others plateau, and some even degrade with scale. We argue that predicting downstream performance from validation perplexity suffers from two limitations: averaging token-level losses obscures signal, and no simple parametric family can capture the full spectrum of scaling behaviors. To address this, we propose Neural Neural Scaling Laws (NeuNeu), a neural network that frames scaling law prediction as time-series extrapolation. NeuNeu combines temporal context from observed accuracy trajectories with token-level validation losses, learning to predict future performance without assuming any bottleneck or functional form. Trained entirely on open-source model checkpoints from HuggingFace, NeuNeu achieves 2.04% mean absolute error in predicting model accuracy on 66 downstream tasks -- a 38% reduction compared to logistic scaling laws (3.29% MAE). Furthermore, NeuNeu generalizes zero-shot to unseen model families, parameter counts, and downstream tasks. Our work suggests that predicting downstream scaling laws directly from data outperforms parametric alternatives.
cs.AI
Demonstrates that visual generation capabilities unlock human-like multimodal reasoning in AI systems, enabling manipulation of internal world models through imagery.
Why This Matters
While chain-of-thought reasoning has achieved expert performance in text-based domains, visual reasoning remains weak. This work suggests that the ability to generate and manipulate visual representations is key to bridging this gap, mirroring how humans reason spatially and visually.
Integrating visual generation into reasoning pipelines may be essential for AI systems to match human performance on tasks requiring spatial, physical, or visual understanding.
Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models
cs.AI
Authors: Jialong Wu, Xiaoying Zhang, Hongyi Yuan, Xiangcheng Zhang, Tianhao Huang et al.
Published: 2026-01-27
Why This Matters
While chain-of-thought reasoning has achieved expert performance in text-based domains, visual reasoning remains weak. This work suggests that the ability to generate and manipulate visual representations is key to bridging this gap, mirroring how humans reason spatially and visually.
Key Insight
Integrating visual generation into reasoning pipelines may be essential for AI systems to match human performance on tasks requiring spatial, physical, or visual understanding.
Abstract
Humans construct internal world models and reason by manipulating the concepts within these models. Recent advances in AI, particularly chain-of-thought (CoT) reasoning, approximate such human cognitive abilities, where world models are believed to be embedded within large language models. Expert-level performance in formal and abstract domains such as mathematics and programming has been achieved in current systems by relying predominantly on verbal reasoning. However, they still lag far behind humans in domains like physical and spatial intelligence, which require richer representations and prior knowledge. The emergence of unified multimodal models (UMMs) capable of both verbal and visual generation has therefore sparked interest in more human-like reasoning grounded in complementary multimodal pathways, though their benefits remain unclear. From a world-model perspective, this paper presents the first principled study of when and how visual generation benefits reasoning. Our key position is the visual superiority hypothesis: for certain tasks--particularly those grounded in the physical world--visual generation more naturally serves as world models, whereas purely verbal world models encounter bottlenecks arising from representational limitations or insufficient prior knowledge. Theoretically, we formalize internal world modeling as a core component of CoT reasoning and analyze distinctions among different forms of world models. Empirically, we identify tasks that necessi...
cs.LGcs.CL
Rehabilitates Post-LayerNorm for deep transformers by identifying and fixing its central failure mode, enabling stable training at extreme depths with superior expressivity.
Why This Matters
As LLM scaling via width and context length hits diminishing returns, depth scaling becomes crucial. This work reopens a promising direction that was abandoned due to training instability, potentially unlocking new scaling frontiers.
The instability of Post-LN at scale stems from specific failure modes that can be addressed, making depth scaling a viable alternative to width scaling for improving model capabilities.
Post-LayerNorm Is Back: Stable, ExpressivE, and Deep
cs.LG | cs.CL
Authors: Chen Chen, Lai Wei
Published: 2026-01-27
Why This Matters
As LLM scaling via width and context length hits diminishing returns, depth scaling becomes crucial. This work reopens a promising direction that was abandoned due to training instability, potentially unlocking new scaling frontiers.
Key Insight
The instability of Post-LN at scale stems from specific failure modes that can be addressed, making depth scaling a viable alternative to width scaling for improving model capabilities.
Abstract
Large language model (LLM) scaling is hitting a wall. Widening models yields diminishing returns, and extending context length does not improve fundamental expressivity. In contrast, depth scaling offers theoretically superior expressivity, yet current Transformer architectures struggle to train reliably at extreme depths. We revisit the Post-LayerNorm (Post-LN) formulation, whose instability at scale caused its replacement by Pre-LN in modern LLMs. We show that the central failure mode of Post-LN arises from the ResNet-style residual pathway, which introduces gradient vanishing in deep networks. We present Keel, a Post-LN Transformer that replaces this residual path with a Highway-style connection. This modification preserves the gradient flow through the residual branch, preventing signal vanishing from the top layers to the bottom. Unlike prior methods, Keel enables stable training at extreme depths without requiring specialized initialization or complex optimization tricks. Keel trains robustly at depths exceeding 1000 layers and consistently improves perplexity and depth-scaling characteristics over Pre-LN. These findings indicate that Post-LN, when paired with a Highway-style connection, provides a simple and effective foundation for building deeply scalable LLMs, opening the possibility for future infinite-depth architectures.
cs.LG
Self-Distillation Fine-Tuning (SDFT) enables continual learning from demonstrations without forgetting, by having models learn from their own on-policy generations rather than off-policy expert data.
Why This Matters
This addresses a fundamental limitation in foundation model training - the ability to learn new skills without degrading existing ones. Unlike RL-based approaches, SDFT doesn't require explicit reward functions, making it practical for real-world deployment where rewards are hard to define.
Self-distillation during fine-tuning can serve as a simple yet effective regularizer against catastrophic forgetting, potentially replacing complex replay buffers or architecture modifications.
Self-Distillation Enables Continual Learning
cs.LG
Authors: Idan Shenfeld, Mehul Damani, Jonas Hübotter, Pulkit Agrawal
Published: 2026-01-27
Why This Matters
This addresses a fundamental limitation in foundation model training - the ability to learn new skills without degrading existing ones. Unlike RL-based approaches, SDFT doesn't require explicit reward functions, making it practical for real-world deployment where rewards are hard to define.
Key Insight
Self-distillation during fine-tuning can serve as a simple yet effective regularizer against catastrophic forgetting, potentially replacing complex replay buffers or architecture modifications.
Abstract
Continual learning, enabling models to acquire new skills and knowledge without degrading existing capabilities, remains a fundamental challenge for foundation models. While on-policy reinforcement learning can reduce forgetting, it requires explicit reward functions that are often unavailable. Learning from expert demonstrations, the primary alternative, is dominated by supervised fine-tuning (SFT), which is inherently off-policy. We introduce Self-Distillation Fine-Tuning (SDFT), a simple method that enables on-policy learning directly from demonstrations. SDFT leverages in-context learning by using a demonstration-conditioned model as its own teacher, generating on-policy training signals that preserve prior capabilities while acquiring new skills. Across skill learning and knowledge acquisition tasks, SDFT consistently outperforms SFT, achieving higher new-task accuracy while substantially reducing catastrophic forgetting. In sequential learning experiments, SDFT enables a single model to accumulate multiple skills over time without performance regression, establishing on-policy distillation as a practical path to continual learning from demonstrations.
cs.LGcs.AIcs.CVcs.NE
SMART enables mesh-free aerodynamic simulations directly from raw 3D geometries using transformers, eliminating costly mesh generation
Why This Matters
Generating simulation meshes for new geometries is a major bottleneck in engineering workflows. By achieving comparable accuracy without requiring mesh generation, this could dramatically accelerate design iteration cycles for cars, aircraft, and other complex geometries.
Transformer architectures can learn to directly map raw geometry to physical simulation outputs, potentially replacing expensive mesh-dependent pipelines in engineering CAD workflows.
SMART: Scalable Mesh-free Aerodynamic Simulations from Raw Geometries using a Transformer-based Surrogate Model
cs.LG | cs.AI | cs.CV | cs.NE
Authors: Jan Hagnberger, Mathias Niepert
Published: 2026-01-26
Why This Matters
Generating simulation meshes for new geometries is a major bottleneck in engineering workflows. By achieving comparable accuracy without requiring mesh generation, this could dramatically accelerate design iteration cycles for cars, aircraft, and other complex geometries.
Key Insight
Transformer architectures can learn to directly map raw geometry to physical simulation outputs, potentially replacing expensive mesh-dependent pipelines in engineering CAD workflows.
Abstract
Machine learning-based surrogate models have emerged as more efficient alternatives to numerical solvers for physical simulations over complex geometries, such as car bodies. Many existing models incorporate the simulation mesh as an additional input, thereby reducing prediction errors. However, generating a simulation mesh for new geometries is computationally costly. In contrast, mesh-free methods, which do not rely on the simulation mesh, typically incur higher errors. Motivated by these considerations, we introduce SMART, a neural surrogate model that predicts physical quantities at arbitrary query locations using only a point-cloud representation of the geometry, without requiring access to the simulation mesh. The geometry and simulation parameters are encoded into a shared latent space that captures both structural and parametric characteristics of the physical field. A physics decoder then attends to the encoder's intermediate latent representations to map spatial queries to physical quantities. Through this cross-layer interaction, the model jointly updates latent geometric features and the evolving physical field. Extensive experiments show that SMART is competitive with and often outperforms existing methods that rely on the simulation mesh as input, demonstrating its capabilities for industry-level simulations.
cs.AIcs.LG
TSRBench provides the first comprehensive benchmark for testing LLM reasoning capabilities on time series data across multiple modalities and task types
Why This Matters
Time series reasoning is ubiquitous (energy, traffic, finance) yet absent from existing generalist model benchmarks. This fills a critical gap in understanding whether foundation models can actually reason about temporal patterns, not just process them.
Generalist models claiming broad reasoning capabilities should be tested on time series tasks - this benchmark reveals whether temporal reasoning is a genuine capability or a gap in current systems.
TSRBench: A Comprehensive Multi-task Multi-modal Time Series Reasoning Benchmark for Generalist Models
cs.AI | cs.LG
Authors: Fangxu Yu, Xingang Guo, Lingzhi Yuan, Haoqiang Kang, Hongyu Zhao et al.
Published: 2026-01-26
Why This Matters
Time series reasoning is ubiquitous (energy, traffic, finance) yet absent from existing generalist model benchmarks. This fills a critical gap in understanding whether foundation models can actually reason about temporal patterns, not just process them.
Key Insight
Generalist models claiming broad reasoning capabilities should be tested on time series tasks - this benchmark reveals whether temporal reasoning is a genuine capability or a gap in current systems.
Abstract
Time series data is ubiquitous in real-world scenarios and crucial for critical applications ranging from energy management to traffic control. Consequently, the ability to reason over time series is a fundamental skill for generalist models to solve practical problems. However, this dimension is notably absent from existing benchmarks of generalist models. To bridge this gap, we introduce TSRBench, a comprehensive multi-modal benchmark designed to stress-test the full spectrum of time series reasoning capabilities. TSRBench features: i) a diverse set of 4125 problems from 14 domains, and is categorized into 4 major dimensions: Perception, Reasoning, Prediction, and Decision-Making. ii) 15 tasks from the 4 dimensions evaluating essential reasoning capabilities (e.g., numerical reasoning). Through extensive experiments, we evaluated over 30 leading proprietary and open-source LLMs, VLMs, and TSLLMs within TSRBench. Our findings reveal that: i) scaling laws hold for perception and reasoning but break down for prediction; ii) strong reasoning does not guarantee accurate context-aware forecasting, indicating a decoupling between semantic understanding and numerical prediction; and iii) despite the complementary nature of textual and visual represenations of time series as inputs, current multimodal models fail to effectively fuse them for reciprocal performance gains. TSRBench provides a standardized evaluation platform that not only highlights existing challenges but also offers...
cs.LGcs.AI
HalluGuard introduces a unified framework distinguishing data-driven vs reasoning-driven hallucinations with a Hallucination Risk Boundary theory
Why This Matters
Most hallucination detection methods address only one failure mode. By providing both theoretical grounding (risk boundaries) and practical detection across both hallucination types, this enables more robust deployment in high-stakes domains like healthcare and law.
Hallucinations have fundamentally different causes (training data gaps vs flawed reasoning chains) requiring different detection and mitigation strategies.
HalluGuard: Demystifying Data-Driven and Reasoning-Driven Hallucinations in LLMs
cs.LG | cs.AI
Authors: Xinyue Zeng, Junhong Lin, Yujun Yan, Feng Guo, Liang Shi et al.
Published: 2026-01-26
Why This Matters
Most hallucination detection methods address only one failure mode. By providing both theoretical grounding (risk boundaries) and practical detection across both hallucination types, this enables more robust deployment in high-stakes domains like healthcare and law.
Key Insight
Hallucinations have fundamentally different causes (training data gaps vs flawed reasoning chains) requiring different detection and mitigation strategies.
Abstract
The reliability of Large Language Models (LLMs) in high-stakes domains such as healthcare, law, and scientific discovery is often compromised by hallucinations. These failures typically stem from two sources: data-driven hallucinations and reasoning-driven hallucinations. However, existing detection methods usually address only one source and rely on task-specific heuristics, limiting their generalization to complex scenarios. To overcome these limitations, we introduce the Hallucination Risk Bound, a unified theoretical framework that formally decomposes hallucination risk into data-driven and reasoning-driven components, linked respectively to training-time mismatches and inference-time instabilities. This provides a principled foundation for analyzing how hallucinations emerge and evolve. Building on this foundation, we introduce HalluGuard, an NTK-based score that leverages the induced geometry and captured representations of the NTK to jointly identify data-driven and reasoning-driven hallucinations. We evaluate HalluGuard on 10 diverse benchmarks, 11 competitive baselines, and 9 popular LLM backbones, consistently achieving state-of-the-art performance in detecting diverse forms of LLM hallucinations.
cs.CL
MortalMATH benchmark reveals that reasoning-optimized LLMs exhibit dangerous 'tunnel vision' - solving math problems while ignoring described life-threatening emergencies
Why This Matters
This exposes a critical safety gap in current reasoning models: the optimization for task completion can override basic safety awareness. Finding that generalist models balance both while reasoning specialists ignore emergencies has major deployment implications.
Deep reasoning optimization may come at the cost of contextual awareness - teams deploying reasoning models should evaluate whether their models can recognize when to break from task focus.
MortalMATH: Evaluating the Conflict Between Reasoning Objectives and Emergency Contexts
cs.CL
Authors: Etienne Lanzeray, Stephane Meilliez, Malo Ruelle, Damien Sileo
Published: 2026-01-26
Why This Matters
This exposes a critical safety gap in current reasoning models: the optimization for task completion can override basic safety awareness. Finding that generalist models balance both while reasoning specialists ignore emergencies has major deployment implications.
Key Insight
Deep reasoning optimization may come at the cost of contextual awareness - teams deploying reasoning models should evaluate whether their models can recognize when to break from task focus.
Abstract
Large Language Models are increasingly optimized for deep reasoning, prioritizing the correct execution of complex tasks over general conversation. We investigate whether this focus on calculation creates a "tunnel vision" that ignores safety in critical situations. We introduce MortalMATH, a benchmark of 150 scenarios where users request algebra help while describing increasingly life-threatening emergencies (e.g., stroke symptoms, freefall). We find a sharp behavioral split: generalist models (like Llama-3.1) successfully refuse the math to address the danger. In contrast, specialized reasoning models (like Qwen-3-32b and GPT-5-nano) often ignore the emergency entirely, maintaining over 95 percent task completion rates while the user describes dying. Furthermore, the computational time required for reasoning introduces dangerous delays: up to 15 seconds before any potential help is offered. These results suggest that training models to relentlessly pursue correct answers may inadvertently unlearn the survival instincts required for safe deployment.
cs.LGcs.CL
SOAR enables LLMs to generate their own curriculum to escape learning plateaus on problems they initially cannot solve
Why This Matters
This addresses a fundamental limitation of RL for reasoning models - when initial success rates are too low, there's no training signal. The meta-RL approach of having a teacher model generate pedagogical problems unlocks learning on previously intractable tasks.
Models can leverage latent knowledge to bootstrap their own learning through self-generated curricula, potentially enabling training on harder reasoning problems without requiring larger datasets.
Teaching Models to Teach Themselves: Reasoning at the Edge of Learnability
cs.LG | cs.CL
Authors: Shobhita Sundaram, John Quan, Ariel Kwiatkowski, Kartik Ahuja, Yann Ollivier et al.
Published: 2026-01-26
Why This Matters
This addresses a fundamental limitation of RL for reasoning models - when initial success rates are too low, there's no training signal. The meta-RL approach of having a teacher model generate pedagogical problems unlocks learning on previously intractable tasks.
Key Insight
Models can leverage latent knowledge to bootstrap their own learning through self-generated curricula, potentially enabling training on harder reasoning problems without requiring larger datasets.
Abstract
Can a model learn to escape its own learning plateau? Reinforcement learning methods for finetuning large reasoning models stall on datasets with low initial success rates, and thus little training signal. We investigate a fundamental question: Can a pretrained LLM leverage latent knowledge to generate an automated curriculum for problems it cannot solve? To explore this, we design SOAR: A self-improvement framework designed to surface these pedagogical signals through meta-RL. A teacher copy of the model proposes synthetic problems for a student copy, and is rewarded with its improvement on a small subset of hard problems. Critically, SOAR grounds the curriculum in measured student progress rather than intrinsic proxy rewards. Our study on the hardest subsets of mathematical benchmarks (0/128 success) reveals three core findings. First, we show that it is possible to realize bi-level meta-RL that unlocks learning under sparse, binary rewards by sharpening a latent capacity of pretrained models to generate useful stepping stones. Second, grounded rewards outperform intrinsic reward schemes used in prior LLM self-play, reliably avoiding the instability and diversity collapse modes they typically exhibit. Third, analyzing the generated questions reveals that structural quality and well-posedness are more critical for learning progress than solution correctness. Our results suggest that the ability to generate useful stepping stones does not require the preexisting ability to ac...
cs.CVcs.AI
LoL scales video generation to hour-long coherent videos by solving the 'sink-collapse' problem where autoregressive models repeatedly revert to anchor frames.
Why This Matters
Long-form video generation has been stuck at minutes due to error accumulation. Identifying and solving sink-collapse enables a 60x+ increase in generation length, opening practical applications in film and content creation.
Attention sink frames, while helpful for short-term coherence, cause catastrophic cyclic patterns in long generation - the solution requires explicit mechanisms to prevent content regression to sink frames.
LoL: Longer than Longer, Scaling Video Generation to Hour
cs.CV | cs.AI
Authors: Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li et al.
Published: 2026-01-23
Why This Matters
Long-form video generation has been stuck at minutes due to error accumulation. Identifying and solving sink-collapse enables a 60x+ increase in generation length, opening practical applications in film and content creation.
Key Insight
Attention sink frames, while helpful for short-term coherence, cause catastrophic cyclic patterns in long generation - the solution requires explicit mechanisms to prevent content regression to sink frames.
Abstract
Recent research in long-form video generation has shifted from bidirectional to autoregressive models, yet these methods commonly suffer from error accumulation and a loss of long-term coherence. While attention sink frames have been introduced to mitigate this performance decay, they often induce a critical failure mode we term sink-collapse: the generated content repeatedly reverts to the sink frame, resulting in abrupt scene resets and cyclic motion patterns. Our analysis reveals that sink-collapse originates from an inherent conflict between the periodic structure of Rotary Position Embedding (RoPE) and the multi-head attention mechanisms prevalent in current generative models. To address it, we propose a lightweight, training-free approach that effectively suppresses this behavior by introducing multi-head RoPE jitter that breaks inter-head attention homogenization and mitigates long-horizon collapse. Extensive experiments show that our method successfully alleviates sink-collapse while preserving generation quality. To the best of our knowledge, this work achieves the first demonstration of real-time, streaming, and infinite-length video generation with little quality decay. As an illustration of this robustness, we generate continuous videos up to 12 hours in length, which, to our knowledge, is among the longest publicly demonstrated results in streaming video generation.
cs.AIcs.CL
Shows that reasoning-oriented LLMs trained with reinforcement learning achieve more robust Theory of Mind performance than standard models.
Why This Matters
As LLMs are deployed in social contexts, understanding whether they truly model mental states or exploit surface patterns is crucial. The finding that explicit reasoning improves robustness suggests a path toward more reliable social AI.
RLVR-trained reasoning models maintain ToM performance under adversarial conditions where standard LLMs fail, indicating reasoning chains provide genuine robustness rather than just benchmark gaming.
Reasoning Promotes Robustness in Theory of Mind Tasks
cs.AI | cs.CL
Authors: Ian B. de Haan, Peter van der Putten, Max van Duijn
Published: 2026-01-23
Why This Matters
As LLMs are deployed in social contexts, understanding whether they truly model mental states or exploit surface patterns is crucial. The finding that explicit reasoning improves robustness suggests a path toward more reliable social AI.
Key Insight
RLVR-trained reasoning models maintain ToM performance under adversarial conditions where standard LLMs fail, indicating reasoning chains provide genuine robustness rather than just benchmark gaming.
Abstract
Large language models (LLMs) have recently shown strong performance on Theory of Mind (ToM) tests, prompting debate about the nature and true performance of the underlying capabilities. At the same time, reasoning-oriented LLMs trained via reinforcement learning with verifiable rewards (RLVR) have achieved notable improvements across a range of benchmarks. This paper examines the behavior of such reasoning models in ToM tasks, using novel adaptations of machine psychological experiments and results from established benchmarks. We observe that reasoning models consistently exhibit increased robustness to prompt variations and task perturbations. Our analysis indicates that the observed gains are more plausibly attributed to increased robustness in finding the correct solution, rather than to fundamentally new forms of ToM reasoning. We discuss the implications of this interpretation for evaluating social-cognitive behavior in LLMs.
cs.LGcs.AI
GRIP enables machine unlearning for Mixture-of-Experts models by preventing routers from simply redirecting queries instead of actually erasing knowledge.
Why This Matters
As MoE architectures become standard for large models (Mixtral, GPT-4), the discovery that traditional unlearning methods exploit routing rather than truly forgetting is a critical safety finding with immediate practical implications.
MoE unlearning requires geometric constraints on routers to prevent the 'routing escape' vulnerability where models appear to forget but actually just avoid activating knowledgeable experts.
GRIP: Algorithm-Agnostic Machine Unlearning for Mixture-of-Experts via Geometric Router Constraints
cs.LG | cs.AI
Authors: Andy Zhu, Rongzhe Wei, Yupu Gu, Pan Li
Published: 2026-01-23
Why This Matters
As MoE architectures become standard for large models (Mixtral, GPT-4), the discovery that traditional unlearning methods exploit routing rather than truly forgetting is a critical safety finding with immediate practical implications.
Key Insight
MoE unlearning requires geometric constraints on routers to prevent the 'routing escape' vulnerability where models appear to forget but actually just avoid activating knowledgeable experts.
Abstract
Machine unlearning (MU) for large language models has become critical for AI safety, yet existing methods fail to generalize to Mixture-of-Experts (MoE) architectures. We identify that traditional unlearning methods exploit MoE's architectural vulnerability: they manipulate routers to redirect queries away from knowledgeable experts rather than erasing knowledge, causing a loss of model utility and superficial forgetting. We propose Geometric Routing Invariance Preservation (GRIP), an algorithm-agnostic framework for unlearning for MoE. Our core contribution is a geometric constraint, implemented by projecting router gradient updates into an expert-specific null-space. Crucially, this decouples routing stability from parameter rigidity: while discrete expert selections remain stable for retained knowledge, the continuous router parameters remain plastic within the null space, allowing the model to undergo necessary internal reconfiguration to satisfy unlearning objectives. This forces the unlearning optimization to erase knowledge directly from expert parameters rather than exploiting the superficial router manipulation shortcut. GRIP functions as an adapter, constraining router parameter updates without modifying the underlying unlearning algorithm. Extensive experiments on large-scale MoE models demonstrate that our adapter eliminates expert selection shift (achieving over 95% routing stability) across all tested unlearning methods while preserving their utility. By prevent...
cs.LGcond-mat.dis-nncs.AIstat.ML
Introduces a scalable method to measure loss landscape curvature in LLMs without computing the full Hessian, enabling analysis of training dynamics at scale.
Why This Matters
Understanding curvature evolution is fundamental to training stability but has been computationally prohibitive for modern LLMs. This opens a window into understanding why certain learning rate schedules and optimizers work.
The proposed curvature measure reveals interactions between learning rate and sharpness throughout training that were previously unmeasurable at LLM scale.
A Scalable Measure of Loss Landscape Curvature for Analyzing the Training Dynamics of LLMs
cs.LG | cond-mat.dis-nn | cs.AI | stat.ML
Authors: Dayal Singh Kalra, Jean-Christophe Gagnon-Audet, Andrey Gromov, Ishita Mediratta, Kelvin Niu et al.
Published: 2026-01-23
Why This Matters
Understanding curvature evolution is fundamental to training stability but has been computationally prohibitive for modern LLMs. This opens a window into understanding why certain learning rate schedules and optimizers work.
Key Insight
The proposed curvature measure reveals interactions between learning rate and sharpness throughout training that were previously unmeasurable at LLM scale.
Abstract
Understanding the curvature evolution of the loss landscape is fundamental to analyzing the training dynamics of neural networks. The most commonly studied measure, Hessian sharpness ($λ_{\max}^H$) -- the largest eigenvalue of the loss Hessian -- determines local training stability and interacts with the learning rate throughout training. Despite its significance in analyzing training dynamics, direct measurement of Hessian sharpness remains prohibitive for Large Language Models (LLMs) due to high computational cost. We analyze $\textit{critical sharpness}$ ($λ_c$), a computationally efficient measure requiring fewer than $10$ forward passes given the update direction $Δ\mathbfθ$. Critically, this measure captures well-documented Hessian sharpness phenomena, including progressive sharpening and Edge of Stability. Using this measure, we provide the first demonstration of these sharpness phenomena at scale, up to $7$B parameters, spanning both pre-training and mid-training of OLMo-2 models. We further introduce $\textit{relative critical sharpness}$ ($λ_c^{1\to 2}$), which quantifies the curvature of one loss landscape while optimizing another, to analyze the transition from pre-training to fine-tuning and guide data mixing strategies. Critical sharpness provides practitioners with a practical tool for diagnosing curvature dynamics and informing data composition choices at scale. More broadly, our work shows that scalable curvature measures can provide actionable insights for l...
cs.LG
ARMD unifies autoregressive and masked diffusion models, achieving competitive language modeling performance with parallel generation capabilities.
Why This Matters
This bridges a fundamental gap between two dominant generative paradigms - ARMs excel at quality while MDMs enable parallel generation. The unified architecture could reshape how we think about efficient text generation.
Reframing masked diffusion through an autoregressive lens allows training efficiency of ARMs while preserving parallel decoding, suggesting hybrid approaches may outperform pure paradigms.
Auto-Regressive Masked Diffusion Models
cs.LG
Authors: Mahdi Karami, Ali Ghodsi
Published: 2026-01-23
Why This Matters
This bridges a fundamental gap between two dominant generative paradigms - ARMs excel at quality while MDMs enable parallel generation. The unified architecture could reshape how we think about efficient text generation.
Key Insight
Reframing masked diffusion through an autoregressive lens allows training efficiency of ARMs while preserving parallel decoding, suggesting hybrid approaches may outperform pure paradigms.
Abstract
Masked diffusion models (MDMs) have emerged as a promising approach for language modeling, yet they face a performance gap compared to autoregressive models (ARMs) and require more training iterations. In this work, we present the Auto-Regressive Masked Diffusion (ARMD) model, an architecture designed to close this gap by unifying the training efficiency of autoregressive models with the parallel generation capabilities of diffusion-based models. Our key insight is to reframe the masked diffusion process as a block-wise causal model. This perspective allows us to design a strictly causal, permutation-equivariant architecture that computes all conditional probabilities across multiple denoising steps in a single, parallel forward pass. The resulting architecture supports efficient, autoregressive-style decoding and a progressive permutation training scheme, allowing the model to learn both canonical left-to-right and random token orderings. Leveraging this flexibility, we introduce a novel strided parallel generation strategy that accelerates inference by generating tokens in parallel streams while maintaining global coherence. Empirical results demonstrate that ARMD achieves state-of-the-art performance on standard language modeling benchmarks, outperforming established diffusion baselines while requiring significantly fewer training steps. Furthermore, it establishes a new benchmark for parallel text generation, effectively bridging the performance gap between parallel and s...
cs.AI
Combines vision-language models with external knowledge retrieval to detect climate disinformation in images and videos, overcoming VLM knowledge cutoff limitations.
Why This Matters
As multimodal disinformation becomes more sophisticated, this addresses a real blind spot in VLMs—their inability to reason about events after their training cutoff—which is particularly important for fast-evolving topics like climate science.
Retrieval-augmented VLMs can significantly improve disinformation detection by grounding model reasoning in current, verified external knowledge sources.
Multimodal Climate Disinformation Detection: Integrating Vision-Language Models with External Knowledge Sources
cs.AI
Authors: Marzieh Adeli Shamsabad, Hamed Ghodrati
Published: 2026-01-22
Why This Matters
As multimodal disinformation becomes more sophisticated, this addresses a real blind spot in VLMs—their inability to reason about events after their training cutoff—which is particularly important for fast-evolving topics like climate science.
Key Insight
Retrieval-augmented VLMs can significantly improve disinformation detection by grounding model reasoning in current, verified external knowledge sources.
Abstract
Climate disinformation has become a major challenge in today digital world, especially with the rise of misleading images and videos shared widely on social media. These false claims are often convincing and difficult to detect, which can delay actions on climate change. While vision-language models (VLMs) have been used to identify visual disinformation, they rely only on the knowledge available at the time of training. This limits their ability to reason about recent events or updates. The main goal of this paper is to overcome that limitation by combining VLMs with external knowledge. By retrieving up-to-date information such as reverse image results, online fact-checks, and trusted expert content, the system can better assess whether an image and its claim are accurate, misleading, false, or unverifiable. This approach improves the model ability to handle real-world climate disinformation and supports efforts to protect public understanding of science in a rapidly changing information landscape.
cs.ROcs.AI
TeNet uses a hypernetwork conditioned on LLM text embeddings to generate compact, task-specific robot policies directly from natural language instructions.
Why This Matters
This elegantly sidesteps the deployment problem of large end-to-end models by generating small executable policies on-the-fly, making real-time robot control from language practical on commodity hardware.
Instead of running large models at inference time, you can use them to generate small, specialized policies that execute efficiently on robots.
TeNet: Text-to-Network for Compact Policy Synthesis
cs.RO | cs.AI
Authors: Ariyan Bighashdel, Kevin Sebastian Luck
Published: 2026-01-22
Why This Matters
This elegantly sidesteps the deployment problem of large end-to-end models by generating small executable policies on-the-fly, making real-time robot control from language practical on commodity hardware.
Key Insight
Instead of running large models at inference time, you can use them to generate small, specialized policies that execute efficiently on robots.
Abstract
Robots that follow natural-language instructions often either plan at a high level using hand-designed interfaces or rely on large end-to-end models that are difficult to deploy for real-time control. We propose TeNet (Text-to-Network), a framework for instantiating compact, task-specific robot policies directly from natural language descriptions. TeNet conditions a hypernetwork on text embeddings produced by a pretrained large language model (LLM) to generate a fully executable policy, which then operates solely on low-dimensional state inputs at high control frequencies. By using the language only once at the policy instantiation time, TeNet inherits the general knowledge and paraphrasing robustness of pretrained LLMs while remaining lightweight and efficient at execution time. To improve generalization, we optionally ground language in behavior during training by aligning text embeddings with demonstrated actions, while requiring no demonstrations at inference time. Experiments on MuJoCo and Meta-World benchmarks show that TeNet produces policies that are orders of magnitude smaller than sequence-based baselines, while achieving strong performance in both multi-task and meta-learning settings and supporting high-frequency control. These results show that text-conditioned hypernetworks offer a practical way to build compact, language-driven controllers for ressource-constrained robot control tasks with real-time requirements.
cs.CV
Uses a hypernetwork to efficiently align diffusion models with human preferences at test time, avoiding the diversity loss of fine-tuning and the compute cost of test-time scaling.
Why This Matters
This addresses a critical practical problem—aligning image generation with user intent—while avoiding the major pitfalls of current approaches: reward hacking from fine-tuning and slow inference from test-time optimization.
Hypernetwork-based alignment can provide a middle ground between expensive retraining and slow test-time scaling for diffusion models.
HyperAlign: Hypernetwork for Efficient Test-Time Alignment of Diffusion Models
cs.CV
Authors: Xin Xie, Jiaxian Guo, Dong Gong
Published: 2026-01-22
Why This Matters
This addresses a critical practical problem—aligning image generation with user intent—while avoiding the major pitfalls of current approaches: reward hacking from fine-tuning and slow inference from test-time optimization.
Key Insight
Hypernetwork-based alignment can provide a middle ground between expensive retraining and slow test-time scaling for diffusion models.
Abstract
Diffusion models achieve state-of-the-art performance but often fail to generate outputs that align with human preferences and intentions, resulting in images with poor aesthetic quality and semantic inconsistencies. Existing alignment methods present a difficult trade-off: fine-tuning approaches suffer from loss of diversity with reward over-optimization, while test-time scaling methods introduce significant computational overhead and tend to under-optimize. To address these limitations, we propose HyperAlign, a novel framework that trains a hypernetwork for efficient and effective test-time alignment. Instead of modifying latent states, HyperAlign dynamically generates low-rank adaptation weights to modulate the diffusion model's generation operators. This allows the denoising trajectory to be adaptively adjusted based on input latents, timesteps and prompts for reward-conditioned alignment. We introduce multiple variants of HyperAlign that differ in how frequently the hypernetwork is applied, balancing between performance and efficiency. Furthermore, we optimize the hypernetwork using a reward score objective regularized with preference data to reduce reward hacking. We evaluate HyperAlign on multiple extended generative paradigms, including Stable Diffusion and FLUX. It significantly outperforms existing fine-tuning and test-time scaling baselines in enhancing semantic consistency and visual appeal.
cs.AI
Shows that feeding the entire Return-to-Go sequence into Decision Transformers is redundant—only the most recent RTG affects action prediction.
Why This Matters
This identifies a fundamental inefficiency in the popular Decision Transformer architecture that many practitioners have overlooked, enabling significant computational savings without sacrificing performance in offline RL.
When using Decision Transformers, you can decouple RTG from the sequence modeling to reduce computation while maintaining the same action prediction quality.
Decoupling Return-to-Go for Efficient Decision Transformer
cs.AI
Authors: Yongyi Wang, Hanyu Liu, Lingfeng Li, Bozhou Chen, Ang Li et al.
Published: 2026-01-22
Why This Matters
This identifies a fundamental inefficiency in the popular Decision Transformer architecture that many practitioners have overlooked, enabling significant computational savings without sacrificing performance in offline RL.
Key Insight
When using Decision Transformers, you can decouple RTG from the sequence modeling to reduce computation while maintaining the same action prediction quality.
Abstract
The Decision Transformer (DT) has established a powerful sequence modeling approach to offline reinforcement learning. It conditions its action predictions on Return-to-Go (RTG), using it both to distinguish trajectory quality during training and to guide action generation at inference. In this work, we identify a critical redundancy in this design: feeding the entire sequence of RTGs into the Transformer is theoretically unnecessary, as only the most recent RTG affects action prediction. We show that this redundancy can impair DT's performance through experiments. To resolve this, we propose the Decoupled DT (DDT). DDT simplifies the architecture by processing only observation and action sequences through the Transformer, using the latest RTG to guide the action prediction. This streamlined approach not only improves performance but also reduces computational cost. Our experiments show that DDT significantly outperforms DT and establishes competitive performance against state-of-the-art DT variants across multiple offline RL tasks.
cs.CV
Introduces a masked modeling approach for human motion reconstruction that handles occlusions without slow diffusion or optimization-based methods.
Why This Matters
This bridges the gap between fast but fragile regression methods and robust but slow optimization/diffusion approaches for motion capture, which is critical for real-world AR/VR and robotics applications where occlusions are common.
Practitioners can achieve robust motion reconstruction under occlusion using efficient masked modeling rather than expensive diffusion-based approaches.
Masked Modeling for Human Motion Recovery Under Occlusions
cs.CV
Authors: Zhiyin Qian, Siwei Zhang, Bharat Lal Bhatnagar, Federica Bogo, Siyu Tang
Published: 2026-01-22
Why This Matters
This bridges the gap between fast but fragile regression methods and robust but slow optimization/diffusion approaches for motion capture, which is critical for real-world AR/VR and robotics applications where occlusions are common.
Key Insight
Practitioners can achieve robust motion reconstruction under occlusion using efficient masked modeling rather than expensive diffusion-based approaches.
Abstract
Human motion reconstruction from monocular videos is a fundamental challenge in computer vision, with broad applications in AR/VR, robotics, and digital content creation, but remains challenging under frequent occlusions in real-world settings.Existing regression-based methods are efficient but fragile to missing observations, while optimization- and diffusion-based approaches improve robustness at the cost of slow inference speed and heavy preprocessing steps. To address these limitations, we leverage recent advances in generative masked modeling and present MoRo: Masked Modeling for human motion Recovery under Occlusions. MoRo is an occlusion-robust, end-to-end generative framework that formulates motion reconstruction as a video-conditioned task, and efficiently recover human motion in a consistent global coordinate system from RGB videos. By masked modeling, MoRo naturally handles occlusions while enabling efficient, end-to-end inference. To overcome the scarcity of paired video-motion data, we design a cross-modality learning scheme that learns multi-modal priors from a set of heterogeneous datasets: (i) a trajectory-aware motion prior trained on MoCap datasets, (ii) an image-conditioned pose prior trained on image-pose datasets, capturing diverse per-frame poses, and (iii) a video-conditioned masked transformer that fuses motion and pose priors, finetuned on video-motion datasets to integrate visual cues with motion dynamics for robust inference. Extensive experiments o...
cs.AIastro-ph.IM
MarScope enables natural language-driven mapping of Martian landforms by aligning planetary images with text in a shared semantic space, trained on 200,000+ curated image-text pairs.
Why This Matters
This transforms how scientists can explore planetary surfaces - instead of pixel-level analysis, researchers can query vast orbital image archives using natural language descriptions, enabling open-ended discovery at planetary scale.
Vision-language models can be successfully adapted to scientific domains like planetary science by curating domain-specific image-text pairs, enabling semantic search over imagery that was previously only accessible through manual inspection.
Natural Language-Driven Global Mapping of Martian Landforms
cs.AI | astro-ph.IM
Authors: Yiran Wang, Shuoyuan Wang, Zhaoran Wei, Jiannan Zhao, Zhonghua Yao et al.
Published: 2026-01-22
Why This Matters
This transforms how scientists can explore planetary surfaces - instead of pixel-level analysis, researchers can query vast orbital image archives using natural language descriptions, enabling open-ended discovery at planetary scale.
Key Insight
Vision-language models can be successfully adapted to scientific domains like planetary science by curating domain-specific image-text pairs, enabling semantic search over imagery that was previously only accessible through manual inspection.
Abstract
Planetary surfaces are typically analyzed using high-level semantic concepts in natural language, yet vast orbital image archives remain organized at the pixel level. This mismatch limits scalable, open-ended exploration of planetary surfaces. Here we present MarScope, a planetary-scale vision-language framework enabling natural language-driven, label-free mapping of Martian landforms. MarScope aligns planetary images and text in a shared semantic space, trained on over 200,000 curated image-text pairs. This framework transforms global geomorphic mapping on Mars by replacing pre-defined classifications with flexible semantic retrieval, enabling arbitrary user queries across the entire planet in 5 seconds with F1 scores up to 0.978. Applications further show that it extends beyond morphological classification to facilitate process-oriented analysis and similarity-based geomorphological mapping at a planetary scale. MarScope establishes a new paradigm where natural language serves as a direct interface for scientific discovery over massive geospatial datasets.
cs.ROcs.AIcs.LG
PUMA enables quadruped robots to perform parkour by learning perception-driven foothold priors that guide agile locomotion over obstacles.
Why This Matters
This bridges the gap between human-like perceptual reasoning about terrain and robotic locomotion, moving beyond pre-computed footholds to real-time adaptive foothold selection - a key step toward truly agile legged robots.
Integrating learned foothold priors from visual perception directly into the reinforcement learning policy allows robots to dynamically adapt their gait to complex terrain without hierarchical controllers.
PUMA: Perception-driven Unified Foothold Prior for Mobility Augmented Quadruped Parkour
cs.RO | cs.AI | cs.LG
Authors: Liang Wang, Kanzhong Yao, Yang Liu, Weikai Qin, Jun Wu et al.
Published: 2026-01-22
Why This Matters
This bridges the gap between human-like perceptual reasoning about terrain and robotic locomotion, moving beyond pre-computed footholds to real-time adaptive foothold selection - a key step toward truly agile legged robots.
Key Insight
Integrating learned foothold priors from visual perception directly into the reinforcement learning policy allows robots to dynamically adapt their gait to complex terrain without hierarchical controllers.
Abstract
Parkour tasks for quadrupeds have emerged as a promising benchmark for agile locomotion. While human athletes can effectively perceive environmental characteristics to select appropriate footholds for obstacle traversal, endowing legged robots with similar perceptual reasoning remains a significant challenge. Existing methods often rely on hierarchical controllers that follow pre-computed footholds, thereby constraining the robot's real-time adaptability and the exploratory potential of reinforcement learning. To overcome these challenges, we present PUMA, an end-to-end learning framework that integrates visual perception and foothold priors into a single-stage training process. This approach leverages terrain features to estimate egocentric polar foothold priors, composed of relative distance and heading, guiding the robot in active posture adaptation for parkour tasks. Extensive experiments conducted in simulation and real-world environments across various discrete complex terrains, demonstrate PUMA's exceptional agility and robustness in challenging scenarios.
cs.PFcs.AIcs.LGcs.OS
Introduces Sawtooth Wavefront Reordering, a technique that reduces L2 cache misses in FlashAttention implementations on NVIDIA GB10 by over 50%.
Why This Matters
With attention being the computational bottleneck in LLMs, a 50% reduction in cache misses on the latest NVIDIA hardware directly translates to faster and more efficient inference, making this immediately applicable to production systems.
Reordering the wavefront pattern of tile processing in attention kernels can dramatically improve memory locality and cache utilization on modern GPU architectures.
Sawtooth Wavefront Reordering: Enhanced CuTile FlashAttention on NVIDIA GB10
cs.PF | cs.AI | cs.LG | cs.OS
Authors: Yifan Zhu, Yekai Pan, Chen Ding
Published: 2026-01-22
Why This Matters
With attention being the computational bottleneck in LLMs, a 50% reduction in cache misses on the latest NVIDIA hardware directly translates to faster and more efficient inference, making this immediately applicable to production systems.
Key Insight
Reordering the wavefront pattern of tile processing in attention kernels can dramatically improve memory locality and cache utilization on modern GPU architectures.
Abstract
High-performance attention kernels are essential for Large Language Models. This paper presents analysis of CuTile-based Flash Attention memory behavior and a technique to improve its cache performance. In particular, our analysis on the NVIDIA GB10 (Grace Blackwell) identifies the main cause of L2 cache miss. Leveraging this insight, we introduce a new programming technique called Sawtooth Wavefront Reordering that reduces L2 misses. We validate it in both CUDA and CuTile, observing 50% or greater reduction in L2 misses and up to 60% increase in throughput on GB10.
cs.CVcs.AI
Identifies and mitigates object-driven verb shortcuts in zero-shot compositional action recognition, where models incorrectly rely on objects rather than actions to make predictions.
Why This Matters
This exposes a fundamental failure mode in video understanding models that has been overlooked - models take shortcuts by recognizing objects instead of understanding actions, which undermines compositional generalization to unseen verb-object combinations.
When training action recognition models, the asymmetric learning difficulty between verbs and objects combined with sparse compositional supervision leads models to ignore verbs entirely.
Why Can't I Open My Drawer? Mitigating Object-Driven Shortcuts in Zero-Shot Compositional Action Recognition
cs.CV | cs.AI
Authors: Geo Ahn, Inwoong Lee, Taeoh Kim, Minho Shim, Dongyoon Wee et al.
Published: 2026-01-22
Why This Matters
This exposes a fundamental failure mode in video understanding models that has been overlooked - models take shortcuts by recognizing objects instead of understanding actions, which undermines compositional generalization to unseen verb-object combinations.
Key Insight
When training action recognition models, the asymmetric learning difficulty between verbs and objects combined with sparse compositional supervision leads models to ignore verbs entirely.
Abstract
We study Compositional Video Understanding (CVU), where models must recognize verbs and objects and compose them to generalize to unseen combinations. We find that existing Zero-Shot Compositional Action Recognition (ZS-CAR) models fail primarily due to an overlooked failure mode: object-driven verb shortcuts. Through systematic analysis, we show that this behavior arises from two intertwined factors: severe sparsity and skewness of compositional supervision, and the asymmetric learning difficulty between verbs and objects. As training progresses, the existing ZS-CAR model increasingly ignores visual evidence and overfits to co-occurrence statistics. Consequently, the existing model does not gain the benefit of compositional recognition in unseen verb-object compositions. To address this, we propose RCORE, a simple and effective framework that enforces temporally grounded verb learning. RCORE introduces (i) a composition-aware augmentation that diversifies verb-object combinations without corrupting motion cues, and (ii) a temporal order regularization loss that penalizes shortcut behaviors by explicitly modeling temporal structure. Across two benchmarks, Sth-com and our newly constructed EK100-com, RCORE significantly improves unseen composition accuracy, reduces reliance on co-occurrence bias, and achieves consistently positive compositional gaps. Our findings reveal object-driven shortcuts as a critical limiting factor in ZS-CAR and demonstrate that addressing them is esse...
cs.CV
CamPilot improves camera control in video diffusion models by introducing a specialized reward model for video-camera alignment and efficient reward feedback learning.
Why This Matters
Camera controllability is a major limitation in current video generation models, and this work addresses it with a novel reward-based approach that could significantly improve the quality of AI-generated videos for filmmaking and content creation.
Reward feedback learning can be adapted for video generation by building task-specific reward models that assess alignment between intended camera movements and generated video.
CamPilot: Improving Camera Control in Video Diffusion Model with Efficient Camera Reward Feedback
cs.CV
Authors: Wenhang Ge, Guibao Shen, Jiawei Feng, Luozhou Wang, Hao Lu et al.
Published: 2026-01-22
Why This Matters
Camera controllability is a major limitation in current video generation models, and this work addresses it with a novel reward-based approach that could significantly improve the quality of AI-generated videos for filmmaking and content creation.
Key Insight
Reward feedback learning can be adapted for video generation by building task-specific reward models that assess alignment between intended camera movements and generated video.
Abstract
Recent advances in camera-controlled video diffusion models have significantly improved video-camera alignment. However, the camera controllability still remains limited. In this work, we build upon Reward Feedback Learning and aim to further improve camera controllability. However, directly borrowing existing ReFL approaches faces several challenges. First, current reward models lack the capacity to assess video-camera alignment. Second, decoding latent into RGB videos for reward computation introduces substantial computational overhead. Third, 3D geometric information is typically neglected during video decoding. To address these limitations, we introduce an efficient camera-aware 3D decoder that decodes video latent into 3D representations for reward quantization. Specifically, video latent along with the camera pose are decoded into 3D Gaussians. In this process, the camera pose not only acts as input, but also serves as a projection parameter. Misalignment between the video latent and camera pose will cause geometric distortions in the 3D structure, resulting in blurry renderings. Based on this property, we explicitly optimize pixel-level consistency between the rendered novel views and ground-truth ones as reward. To accommodate the stochastic nature, we further introduce a visibility term that selectively supervises only deterministic regions derived via geometric warping. Extensive experiments conducted on RealEstate10K and WorldScore benchmarks demonstrate the effect...
cs.CVcs.AI
PhysicsMind benchmarks how well foundation VLMs and video world models understand physical mechanics through both simulated and real-world scenarios.
Why This Matters
While MLLMs excel at many reasoning tasks, their grasp of physics is underexplored. Existing benchmarks use synthetic VQA or focus on perceptual quality rather than physical law adherence—this provides a rigorous test of physical understanding that's crucial for embodied AI.
Current foundation models show significant gaps between visual/mathematical reasoning abilities and understanding of physical mechanics, highlighting a key area for improvement.
PhysicsMind: Sim and Real Mechanics Benchmarking for Physical Reasoning and Prediction in Foundational VLMs and World Models
cs.CV | cs.AI
Authors: Chak-Wing Mak, Guanyu Zhu, Boyi Zhang, Hongji Li, Xiaowei Chi et al.
Published: 2026-01-22
Why This Matters
While MLLMs excel at many reasoning tasks, their grasp of physics is underexplored. Existing benchmarks use synthetic VQA or focus on perceptual quality rather than physical law adherence—this provides a rigorous test of physical understanding that's crucial for embodied AI.
Key Insight
Current foundation models show significant gaps between visual/mathematical reasoning abilities and understanding of physical mechanics, highlighting a key area for improvement.
Abstract
Modern foundational Multimodal Large Language Models (MLLMs) and video world models have advanced significantly in mathematical, common-sense, and visual reasoning, but their grasp of the underlying physics remains underexplored. Existing benchmarks attempting to measure this matter rely on synthetic, Visual Question Answer templates or focus on perceptual video quality that is tangential to measuring how well the video abides by physical laws. To address this fragmentation, we introduce PhysicsMind, a unified benchmark with both real and simulation environments that evaluates law-consistent reasoning and generation over three canonical principles: Center of Mass, Lever Equilibrium, and Newton's First Law. PhysicsMind comprises two main tasks: i) VQA tasks, testing whether models can reason and determine physical quantities and values from images or short videos, and ii) Video Generation(VG) tasks, evaluating if predicted motion trajectories obey the same center-of-mass, torque, and inertial constraints as the ground truth. A broad range of recent models and video generation models is evaluated on PhysicsMind and found to rely on appearance heuristics while often violating basic mechanics. These gaps indicate that current scaling and training are still insufficient for robust physical understanding, underscoring PhysicsMind as a focused testbed for physics-aware multimodal models. Our data will be released upon acceptance.
cs.CVcs.RO
DTP identifies and prunes 'distracting tokens' in Vision-Language Action models that cause robots to attend to task-irrelevant image regions during manipulation.
Why This Matters
VLA models for robotics inherit attention patterns from general VLMs that aren't optimized for action generation. This simple pruning framework improves manipulation success rates by focusing attention on task-relevant regions.
Robot manipulation performance improves when VLA models are explicitly guided to ignore visually salient but task-irrelevant image tokens during action prediction.
DTP: A Simple yet Effective Distracting Token Pruning Framework for Vision-Language Action Models
cs.CV | cs.RO
Authors: Chenyang Li, Jieyuan Liu, Bin Li, Bo Gao, Yilin Yuan et al.
Published: 2026-01-22
Why This Matters
VLA models for robotics inherit attention patterns from general VLMs that aren't optimized for action generation. This simple pruning framework improves manipulation success rates by focusing attention on task-relevant regions.
Key Insight
Robot manipulation performance improves when VLA models are explicitly guided to ignore visually salient but task-irrelevant image tokens during action prediction.
Abstract
Vision-Language Action (VLA) models have shown remarkable progress in robotic manipulation by leveraging the powerful perception abilities of Vision-Language Models (VLMs) to understand environments and directly output actions. However, by default, VLA models may overly attend to image tokens in the task-irrelevant region, which we describe as 'distracting tokens'. This behavior can disturb the model from the generation of the desired action tokens in each step, affecting the success rate of tasks. In this paper, we introduce a simple yet effective plug-and-play Distracting Token Pruning (DTP) framework, which dynamically detects and prunes these distracting image tokens. By correcting the model's visual attention patterns, we aim to improve the task success rate, as well as exploring the performance upper boundaries of the model without altering its original architecture or adding additional inputs. Experiments on the SIMPLER Benchmark (Li et al., 2024) show that our method consistently achieving relative improvements in task success rates across different types of novel VLA models, demonstrating generalizability to transformer-based VLAs. Further analysis reveals a negative correlation between the task success rate and the amount of attentions in the task-irrelevant region for all models tested, highlighting a common phenomenon of VLA models that could guide future research. We also publish our code at: https://anonymous.4open.science/r/CBD3.
eess.IVcs.AI
THOR is a compute-adaptive Earth observation foundation model that unifies heterogeneous Sentinel satellite data while allowing flexible accuracy-compute tradeoffs at deployment.
Why This Matters
Current EO foundation models are architecturally rigid and struggle with multi-sensor heterogeneity. THOR's ability to process native resolutions from Sentinel-1/2/3 and adapt computation at inference makes it practically deployable for real-world climate monitoring.
Foundation models for remote sensing need native multi-sensor support and compute adaptivity to be useful in operational settings with varying resource constraints.
THOR: A Versatile Foundation Model for Earth Observation Climate and Society Applications
eess.IV | cs.AI
Authors: Theodor Forgaard, Jarle H. Reksten, Anders U. Waldeland, Valerio Marsocci, Nicolas Longépé et al.
Published: 2026-01-22
Why This Matters
Current EO foundation models are architecturally rigid and struggle with multi-sensor heterogeneity. THOR's ability to process native resolutions from Sentinel-1/2/3 and adapt computation at inference makes it practically deployable for real-world climate monitoring.
Key Insight
Foundation models for remote sensing need native multi-sensor support and compute adaptivity to be useful in operational settings with varying resource constraints.
Abstract
Current Earth observation foundation models are architecturally rigid, struggle with heterogeneous sensors and are constrained to fixed patch sizes. This limits their deployment in real-world scenarios requiring flexible computeaccuracy trade-offs. We propose THOR, a "computeadaptive" foundation model that solves both input heterogeneity and deployment rigidity. THOR is the first architecture to unify data from Copernicus Sentinel-1, -2, and -3 (OLCI & SLSTR) satellites, processing their native 10 m to 1000 m resolutions in a single model. We pre-train THOR with a novel randomized patch and input image size strategy. This allows a single set of pre-trained weights to be deployed at inference with any patch size, enabling a dynamic trade-off between computational cost and feature resolution without retraining. We pre-train THOR on THOR Pretrain, a new, large-scale multi-sensor dataset and demonstrate state-of-the-art performance on downstream benchmarks, particularly in data-limited regimes like the PANGAEA 10% split, validating that THOR's flexible feature generation excels for diverse climate and society applications.
stat.MLcs.LG
Proposes treating reliability as a first-class property of learned representations themselves, not just prediction outputs, with structural constraints for uncertainty quantification.
Why This Matters
Most uncertainty estimation focuses on final predictions while assuming representations are reliable by default. This challenges that assumption and provides a principled framework for building trustworthy representations—critical for high-stakes ML applications.
Representation-level uncertainty should be explicitly modeled and constrained during training, not treated as an afterthought at prediction time.
Beyond Predictive Uncertainty: Reliable Representation Learning with Structural Constraints
stat.ML | cs.LG
Authors: Yiyao Yang
Published: 2026-01-22
Why This Matters
Most uncertainty estimation focuses on final predictions while assuming representations are reliable by default. This challenges that assumption and provides a principled framework for building trustworthy representations—critical for high-stakes ML applications.
Key Insight
Representation-level uncertainty should be explicitly modeled and constrained during training, not treated as an afterthought at prediction time.
Abstract
Uncertainty estimation in machine learning has traditionally focused on the prediction stage, aiming to quantify confidence in model outputs while treating learned representations as deterministic and reliable by default. In this work, we challenge this implicit assumption and argue that reliability should be regarded as a first-class property of learned representations themselves. We propose a principled framework for reliable representation learning that explicitly models representation-level uncertainty and leverages structural constraints as inductive biases to regularize the space of feasible representations. Our approach introduces uncertainty-aware regularization directly in the representation space, encouraging representations that are not only predictive but also stable, well-calibrated, and robust to noise and structural perturbations. Structural constraints, such as sparsity, relational structure, or feature-group dependencies, are incorporated to define meaningful geometry and reduce spurious variability in learned representations, without assuming fully correct or noise-free structure. Importantly, the proposed framework is independent of specific model architectures and can be integrated with a wide range of representation learning methods.
cs.CVcs.AI
PyraTok introduces a language-aligned pyramidal video tokenizer that learns discrete visual representations across multiple spatiotemporal scales with strong text supervision.
Why This Matters
Current video tokenizers operate at single scales with weak language alignment, limiting zero-shot transfer. This hierarchical approach with deep language supervision could significantly improve text-to-video generation quality and enable better cross-modal understanding.
Multi-scale tokenization with explicit language alignment at each level produces more semantically meaningful video representations than flat single-scale approaches.
PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation
cs.CV | cs.AI
Authors: Onkar Susladkar, Tushar Prakash, Adheesh Juvekar, Kiet A. Nguyen, Dong-Hwan Jang et al.
Published: 2026-01-22
Why This Matters
Current video tokenizers operate at single scales with weak language alignment, limiting zero-shot transfer. This hierarchical approach with deep language supervision could significantly improve text-to-video generation quality and enable better cross-modal understanding.
Key Insight
Multi-scale tokenization with explicit language alignment at each level produces more semantically meaningful video representations than flat single-scale approaches.
Abstract
Discrete video VAEs underpin modern text-to-video generation and video understanding systems, yet existing tokenizers typically learn visual codebooks at a single scale with limited vocabularies and shallow language supervision, leading to poor cross-modal alignment and zero-shot transfer. We introduce PyraTok, a language-aligned pyramidal tokenizer that learns semantically structured discrete latents across multiple spatiotemporal resolutions. PyraTok builds on a pretrained video VAE and a novel Language aligned Pyramidal Quantization (LaPQ) module that discretizes encoder features at several depths using a shared large binary codebook, yielding compact yet expressive video token sequences. To tightly couple visual tokens with language, PyraTok jointly optimizes multi-scale text-guided quantization and a global autoregressive objective over the token hierarchy. Across ten benchmarks, PyraTok delivers state-of-the-art (SOTA) video reconstruction, consistently improves text-to-video quality, and sets new SOTA zero-shot performance on video segmentation, temporal action localization, and video understanding, scaling robustly to up to 4K/8K resolutions.
cs.ROcs.CV
DextER introduces contact-based embodied reasoning for dexterous grasping, having vision-language models explicitly reason about hand-object physical interactions.
Why This Matters
Previous VLA approaches mapped observations directly to grasp parameters without intermediate reasoning - adding explicit contact reasoning significantly improves manipulation success rates on complex multi-finger tasks.
For robotic manipulation, having models explicitly reason about physical contact points before generating actions produces more robust grasps than end-to-end approaches.
DextER: Language-driven Dexterous Grasp Generation with Embodied Reasoning
cs.RO | cs.CV
Authors: Junha Lee, Eunha Park, Minsu Cho
Published: 2026-01-22
Why This Matters
Previous VLA approaches mapped observations directly to grasp parameters without intermediate reasoning - adding explicit contact reasoning significantly improves manipulation success rates on complex multi-finger tasks.
Key Insight
For robotic manipulation, having models explicitly reason about physical contact points before generating actions produces more robust grasps than end-to-end approaches.
Abstract
Language-driven dexterous grasp generation requires the models to understand task semantics, 3D geometry, and complex hand-object interactions. While vision-language models have been applied to this problem, existing approaches directly map observations to grasp parameters without intermediate reasoning about physical interactions. We present DextER, Dexterous Grasp Generation with Embodied Reasoning, which introduces contact-based embodied reasoning for multi-finger manipulation. Our key insight is that predicting which hand links contact where on the object surface provides an embodiment-aware intermediate representation bridging task semantics with physical constraints. DextER autoregressively generates embodied contact tokens specifying which finger links contact where on the object surface, followed by grasp tokens encoding the hand configuration. On DexGYS, DextER achieves 67.14% success rate, outperforming state-of-the-art by 3.83%p with 96.4% improvement in intention alignment. We also demonstrate steerable generation through partial contact specification, providing fine-grained control over grasp synthesis.
cs.HCcs.AI
Replicates classic human motivated reasoning studies on LLMs and finds that base models don't exhibit the same politically-motivated biases humans show.
Why This Matters
As LLMs are increasingly used to study or simulate human behavior, understanding where their reasoning diverges from humans is crucial - this suggests LLMs may process politically charged information more neutrally than humans.
LLMs should not be assumed to replicate human cognitive biases without empirical validation, especially for motivated reasoning in political contexts.
Replicating Human Motivated Reasoning Studies with LLMs
cs.HC | cs.AI
Authors: Neeley Pate, Adiba Mahbub Proma, Hangfeng He, James N. Druckman, Daniel Molden et al.
Published: 2026-01-22
Why This Matters
As LLMs are increasingly used to study or simulate human behavior, understanding where their reasoning diverges from humans is crucial - this suggests LLMs may process politically charged information more neutrally than humans.
Key Insight
LLMs should not be assumed to replicate human cognitive biases without empirical validation, especially for motivated reasoning in political contexts.
Abstract
Motivated reasoning -- the idea that individuals processing information may be motivated to reach a certain conclusion, whether it be accurate or predetermined -- has been well-explored as a human phenomenon. However, it is unclear whether base LLMs mimic these motivational changes. Replicating 4 prior political motivated reasoning studies, we find that base LLM behavior does not align with expected human behavior. Furthermore, base LLM behavior across models shares some similarities, such as smaller standard deviations and inaccurate argument strength assessments. We emphasize the importance of these findings for researchers using LLMs to automate tasks such as survey data collection and argument assessment.
cs.CVcs.DC
DSFedMed enables mutual knowledge distillation between large foundation models and lightweight client models in federated medical image segmentation.
Why This Matters
This solves a critical deployment challenge - foundation models are too heavy for edge devices in federated settings, but this framework lets small client models benefit from foundation model knowledge while keeping data private.
Bidirectional distillation between scales in federated learning can achieve better results than either unidirectional distillation or traditional federated averaging.
DSFedMed: Dual-Scale Federated Medical Image Segmentation via Mutual Distillation Between Foundation and Lightweight Models
cs.CV | cs.DC
Authors: Hanwen Zhang, Qiaojin Shen, Yuxi Liu, Yuesheng Zhu, Guibo Luo
Published: 2026-01-22
Why This Matters
This solves a critical deployment challenge - foundation models are too heavy for edge devices in federated settings, but this framework lets small client models benefit from foundation model knowledge while keeping data private.
Key Insight
Bidirectional distillation between scales in federated learning can achieve better results than either unidirectional distillation or traditional federated averaging.
Abstract
Foundation Models (FMs) have demonstrated strong generalization across diverse vision tasks. However, their deployment in federated settings is hindered by high computational demands, substantial communication overhead, and significant inference costs. We propose DSFedMed, a dual-scale federated framework that enables mutual knowledge distillation between a centralized foundation model and lightweight client models for medical image segmentation. To support knowledge distillation, a set of high-quality medical images is generated to replace real public datasets, and a learnability-guided sample selection strategy is proposed to enhance efficiency and effectiveness in dual-scale distillation. This mutual distillation enables the foundation model to transfer general knowledge to lightweight clients, while also incorporating client-specific insights to refine the foundation model. Evaluations on five medical imaging segmentation datasets show that DSFedMed achieves an average 2 percent improvement in Dice score while reducing communication costs and inference time by nearly 90 percent compared to existing federated foundation model baselines. These results demonstrate significant efficiency gains and scalability for resource-limited federated deployments.
physics.chem-phstat.ML
Demonstrates that removing physical constraints from machine-learned interatomic potentials can paradoxically improve both efficiency and accuracy.
Why This Matters
This challenges the conventional wisdom that physics-informed ML models should strictly enforce physical laws - relaxing constraints like energy conservation can sometimes help rather than hurt model performance.
Practitioners working on physics-informed ML should carefully evaluate whether hard constraints actually improve their models or whether soft penalties might achieve better results.
Pushing the limits of unconstrained machine-learned interatomic potentials
physics.chem-ph | stat.ML
Authors: Filippo Bigi, Paolo Pegolo, Arslan Mazitov, Michele Ceriotti
Published: 2026-01-22
Why This Matters
This challenges the conventional wisdom that physics-informed ML models should strictly enforce physical laws - relaxing constraints like energy conservation can sometimes help rather than hurt model performance.
Key Insight
Practitioners working on physics-informed ML should carefully evaluate whether hard constraints actually improve their models or whether soft penalties might achieve better results.
Abstract
Machine-learned interatomic potentials (MLIPs) are increasingly used to replace computationally demanding electronic-structure calculations to model matter at the atomic scale. The most commonly used model architectures are constrained to fulfill a number of physical laws exactly, from geometric symmetries to energy conservation. Evidence is mounting that relaxing some of these constraints can be beneficial to the efficiency and (somewhat surprisingly) accuracy of MLIPs, even though care should be taken to avoid qualitative failures associated with the breaking of physical symmetries. Given the recent trend of \emph{scaling up} models to larger numbers of parameters and training samples, a very important question is how unconstrained MLIPs behave in this limit. Here we investigate this issue, showing that -- when trained on large datasets -- unconstrained models can be superior in accuracy and speed when compared to physically constrained models. We assess these models both in terms of benchmark accuracy and in terms of usability in practical scenarios, focusing on static simulation workflows such as geometry optimization and lattice dynamics. We conclude that accurate unconstrained models can be applied with confidence, especially since simple inference-time modifications can be used to recover observables that are consistent with the relevant physical symmetries.
cs.CV
360Anything lifts perspective images and videos to 360° panoramas without requiring camera calibration metadata, using pre-trained diffusion transformers.
Why This Matters
This eliminates a major bottleneck in immersive content creation - most in-the-wild photos and videos lack reliable camera metadata, making previous geometric alignment approaches impractical at scale.
Geometry-free approaches using diffusion models can achieve robust perspective-to-panorama conversion that generalizes better to uncalibrated real-world content than explicit geometric methods.
360Anything: Geometry-Free Lifting of Images and Videos to 360°
cs.CV
Authors: Ziyi Wu, Daniel Watson, Andrea Tagliasacchi, David J. Fleet, Marcus A. Brubaker et al.
Published: 2026-01-22
Why This Matters
This eliminates a major bottleneck in immersive content creation - most in-the-wild photos and videos lack reliable camera metadata, making previous geometric alignment approaches impractical at scale.
Key Insight
Geometry-free approaches using diffusion models can achieve robust perspective-to-panorama conversion that generalizes better to uncalibrated real-world content than explicit geometric methods.
Abstract
Lifting perspective images and videos to 360° panoramas enables immersive 3D world generation. Existing approaches often rely on explicit geometric alignment between the perspective and the equirectangular projection (ERP) space. Yet, this requires known camera metadata, obscuring the application to in-the-wild data where such calibration is typically absent or noisy. We propose 360Anything, a geometry-free framework built upon pre-trained diffusion transformers. By treating the perspective input and the panorama target simply as token sequences, 360Anything learns the perspective-to-equirectangular mapping in a purely data-driven way, eliminating the need for camera information. Our approach achieves state-of-the-art performance on both image and video perspective-to-360° generation, outperforming prior works that use ground-truth camera information. We also trace the root cause of the seam artifacts at ERP boundaries to zero-padding in the VAE encoder, and introduce Circular Latent Encoding to facilitate seamless generation. Finally, we show competitive results in zero-shot camera FoV and orientation estimation benchmarks, demonstrating 360Anything's deep geometric understanding and broader utility in computer vision tasks. Additional results are available at https://360anything.github.io/.
cs.AIcs.CL
Introduces explicit affective state dynamics to control long-horizon behavior and prevent persona drift in LLM agents during extended interactions.
Why This Matters
Addresses the underexplored problem of temporal coherence in conversational AI, where agents often exhibit abrupt personality shifts—critical for applications requiring consistent long-term engagement.
Imposing dynamical constraints on external state variables can provide temporal structure that pure next-token prediction lacks, enabling more coherent extended interactions.
Controlling Long-Horizon Behavior in Language Model Agents with Explicit State Dynamics
cs.AI | cs.CL
Authors: Sukesh Subaharan
Published: 2026-01-22
Why This Matters
Addresses the underexplored problem of temporal coherence in conversational AI, where agents often exhibit abrupt personality shifts—critical for applications requiring consistent long-term engagement.
Key Insight
Imposing dynamical constraints on external state variables can provide temporal structure that pure next-token prediction lacks, enabling more coherent extended interactions.
Abstract
Large language model (LLM) agents often exhibit abrupt shifts in tone and persona during extended interaction, reflecting the absence of explicit temporal structure governing agent-level state. While prior work emphasizes turn-local sentiment or static emotion classification, the role of explicit affective dynamics in shaping long-horizon agent behavior remains underexplored. This work investigates whether imposing dynamical structure on an external affective state can induce temporal coherence and controlled recovery in multi-turn dialogue. We introduce an agent-level affective subsystem that maintains a continuous Valence-Arousal-Dominance (VAD) state external to the language model and governed by first- and second-order update rules. Instantaneous affective signals are extracted using a fixed, memoryless estimator and integrated over time via exponential smoothing or momentum-based dynamics. The resulting affective state is injected back into generation without modifying model parameters. Using a fixed 25-turn dialogue protocol, we compare stateless, first-order, and second-order affective dynamics. Stateless agents fail to exhibit coherent trajectories or recovery, while state persistence enables delayed responses and reliable recovery. Second-order dynamics introduce affective inertia and hysteresis that increase with momentum, revealing a trade-off between stability and responsiveness.
cs.AI
Grounds LLMs in reaction knowledge graphs via Text2Cypher generation for reliable chemical synthesis retrieval instead of hallucinated suggestions.
Why This Matters
This tackles the hallucination problem in scientific AI by grounding LLM outputs in verified databases, demonstrating a scalable pattern for knowledge-intensive domains beyond chemistry.
Converting natural language queries to graph database queries provides a principled way to combine LLM reasoning with verified domain knowledge.
Grounding Large Language Models in Reaction Knowledge Graphs for Synthesis Retrieval
cs.AI
Authors: Olga Bunkova, Lorenzo Di Fruscia, Sophia Rupprecht, Artur M. Schweidtmann, Marcel J. T. Reinders et al.
Published: 2026-01-22
Why This Matters
This tackles the hallucination problem in scientific AI by grounding LLM outputs in verified databases, demonstrating a scalable pattern for knowledge-intensive domains beyond chemistry.
Key Insight
Converting natural language queries to graph database queries provides a principled way to combine LLM reasoning with verified domain knowledge.
Abstract
Large Language Models (LLMs) can aid synthesis planning in chemistry, but standard prompting methods often yield hallucinated or outdated suggestions. We study LLM interactions with a reaction knowledge graph by casting reaction path retrieval as a Text2Cypher (natural language to graph query) generation problem, and define single- and multi-step retrieval tasks. We compare zero-shot prompting to one-shot variants using static, random, and embedding-based exemplar selection, and assess a checklist-driven validator/corrector loop. To evaluate our framework, we consider query validity and retrieval accuracy. We find that one-shot prompting with aligned exemplars consistently performs best. Our checklist-style self-correction loop mainly improves executability in zero-shot settings and offers limited additional retrieval gains once a good exemplar is present. We provide a reproducible Text2Cypher evaluation setup to facilitate further work on KG-grounded LLMs for synthesis planning. Code is available at https://github.com/Intelligent-molecular-systems/KG-LLM-Synthesis-Retrieval.
cs.CV
ActionMesh generates production-ready animated 3D meshes in a feed-forward manner by adding a temporal axis to 3D diffusion models.
Why This Matters
Unlike existing methods requiring optimization or long runtimes, this produces immediately usable animated assets, bridging a critical gap between research and production workflows in games and film.
Extending spatial 3D diffusion to include temporal dynamics enables practical one-shot animated mesh generation without post-processing.
ActionMesh: Animated 3D Mesh Generation with Temporal 3D Diffusion
cs.CV
Authors: Remy Sabathier, David Novotny, Niloy J. Mitra, Tom Monnier
Published: 2026-01-22
Why This Matters
Unlike existing methods requiring optimization or long runtimes, this produces immediately usable animated assets, bridging a critical gap between research and production workflows in games and film.
Key Insight
Extending spatial 3D diffusion to include temporal dynamics enables practical one-shot animated mesh generation without post-processing.
Abstract
Generating animated 3D objects is at the heart of many applications, yet most advanced works are typically difficult to apply in practice because of their limited setup, their long runtime, or their limited quality. We introduce ActionMesh, a generative model that predicts production-ready 3D meshes "in action" in a feed-forward manner. Drawing inspiration from early video models, our key insight is to modify existing 3D diffusion models to include a temporal axis, resulting in a framework we dubbed "temporal 3D diffusion". Specifically, we first adapt the 3D diffusion stage to generate a sequence of synchronized latents representing time-varying and independent 3D shapes. Second, we design a temporal 3D autoencoder that translates a sequence of independent shapes into the corresponding deformations of a pre-defined reference shape, allowing us to build an animation. Combining these two components, ActionMesh generates animated 3D meshes from different inputs like a monocular video, a text description, or even a 3D mesh with a text prompt describing its animation. Besides, compared to previous approaches, our method is fast and produces results that are rig-free and topology consistent, hence enabling rapid iteration and seamless applications like texturing and retargeting. We evaluate our model on standard video-to-4D benchmarks (Consistent4D, Objaverse) and report state-of-the-art performances on both geometric accuracy and temporal consistency, demonstrating that our model...
cs.LGcs.CV
Introduces Feature-space Smoothing with certified robustness guarantees for multimodal LLMs against adversarial perturbations.
Why This Matters
As MLLMs are deployed in safety-critical applications, this provides the first provable robustness framework for multimodal models, moving beyond empirical defenses to mathematical guarantees.
Certified robustness can be achieved at the feature representation level rather than just output predictions, offering stronger guarantees for multimodal systems.
Provable Robustness in Multimodal Large Language Models via Feature Space Smoothing
cs.LG | cs.CV
Authors: Song Xia, Meiwen Ding, Chenqi Kong, Wenhan Yang, Xudong Jiang
Published: 2026-01-22
Why This Matters
As MLLMs are deployed in safety-critical applications, this provides the first provable robustness framework for multimodal models, moving beyond empirical defenses to mathematical guarantees.
Key Insight
Certified robustness can be achieved at the feature representation level rather than just output predictions, offering stronger guarantees for multimodal systems.
Abstract
Multimodal large language models (MLLMs) exhibit strong capabilities across diverse applications, yet remain vulnerable to adversarial perturbations that distort their feature representations and induce erroneous predictions. To address this vulnerability, we propose the Feature-space Smoothing (FS) and theoretically prove that FS offers certified robustness on the feature representations of MLLMs. Specifically, FS transforms any feature encoder into a smoothed variant that is guaranteed to maintain a certified lower bound on the feature cosine similarity between clean and adversarial representations under $\ell_2$-bounded attacks. Moreover, we indicate that the value of this Feature Cosine Similarity Bound (FCSB) derived from FS can be improved by enlarging the defined Gaussian robustness score on the vanilla encoder. Building upon this, we introduce the Purifier and Smoothness Mapper (PSM), a plug-and-play module that improves the Gaussian robustness score of MLLMs and thus enhances their certified robustness under FS, without requiring any retraining on MLLMs. We demonstrate that the FS with PSM not only provides a strong theoretical robustness guarantee but also exhibits superior empirical performance compared to adversarial training. Extensive experiments across diverse MLLMs and downstream tasks indicate the effectiveness of the FS-PSM, reducing the Attack Success Rate (ASR) of various white-box attacks from nearly 90% to about 1%.
cs.LGcs.AI
Proposes counterfactual training that uses counterfactual explanations during training to make ML models inherently more explainable with plausible, actionable explanations.
Why This Matters
This flips the XAI paradigm from post-hoc explanation to built-in explainability, addressing a fundamental limitation of current interpretability methods that often produce unrealistic or unhelpful counterfactuals.
Rather than explaining black-box models after training, you can train models to be explainable from the start by incorporating counterfactual constraints into the learning objective.
Counterfactual Training: Teaching Models Plausible and Actionable Explanations
cs.LG | cs.AI
Authors: Patrick Altmeyer, Aleksander Buszydlik, Arie van Deursen, Cynthia C. S. Liem
Published: 2026-01-22
Why This Matters
This flips the XAI paradigm from post-hoc explanation to built-in explainability, addressing a fundamental limitation of current interpretability methods that often produce unrealistic or unhelpful counterfactuals.
Key Insight
Rather than explaining black-box models after training, you can train models to be explainable from the start by incorporating counterfactual constraints into the learning objective.
Abstract
We propose a novel training regime termed counterfactual training that leverages counterfactual explanations to increase the explanatory capacity of models. Counterfactual explanations have emerged as a popular post-hoc explanation method for opaque machine learning models: they inform how factual inputs would need to change in order for a model to produce some desired output. To be useful in real-world decision-making systems, counterfactuals should be plausible with respect to the underlying data and actionable with respect to the feature mutability constraints. Much existing research has therefore focused on developing post-hoc methods to generate counterfactuals that meet these desiderata. In this work, we instead hold models directly accountable for the desired end goal: counterfactual training employs counterfactuals during the training phase to minimize the divergence between learned representations and plausible, actionable explanations. We demonstrate empirically and theoretically that our proposed method facilitates training models that deliver inherently desirable counterfactual explanations and additionally exhibit improved adversarial robustness.
stat.MLcs.LGstat.ME
Provides a unified statistical framework answering when synthetic augmentation helps imbalanced classification and how many samples to generate.
Why This Matters
Addresses two long-standing practical questions about minority oversampling with theoretical grounding—crucial for practitioners who currently rely on heuristics for synthetic data generation.
Synthetic augmentation effectiveness depends on how well the generative model captures minority class structure; the optimal amount is theoretically characterizable.
Synthetic Augmentation in Imbalanced Learning: When It Helps, When It Hurts, and How Much to Add
stat.ML | cs.LG | stat.ME
Authors: Zhengchi Ma, Anru R. Zhang
Published: 2026-01-22
Why This Matters
Addresses two long-standing practical questions about minority oversampling with theoretical grounding—crucial for practitioners who currently rely on heuristics for synthetic data generation.
Key Insight
Synthetic augmentation effectiveness depends on how well the generative model captures minority class structure; the optimal amount is theoretically characterizable.
Abstract
Imbalanced classification, where one class is observed far less frequently than the other, often causes standard training procedures to prioritize the majority class and perform poorly on rare but important cases. A classic and widely used remedy is to augment the minority class with synthetic examples, but two basic questions remain under-resolved: when does synthetic augmentation actually help, and how many synthetic samples should be generated? We develop a unified statistical framework for synthetic augmentation in imbalanced learning, studying models trained on imbalanced data augmented with synthetic minority samples and evaluated under the balanced population risk. Our theory shows that synthetic data is not always beneficial. In a local symmetry" regime, imbalance is not the dominant source of error near the balanced optimum, so adding synthetic samples cannot improve learning rates and can even degrade performance by amplifying generator mismatch. When augmentation can help (a local asymmetry" regime), the optimal synthetic size depends on generator accuracy and on whether the generator's residual mismatch is directionally aligned with the intrinsic majority-minority shift. This structure can make the best synthetic size deviate from naive full balancing, sometimes by a small refinement and sometimes substantially when generator bias is systematic. Practically, we recommend Validation-Tuned Synthetic Size (VTSS): select the synthetic size by minimizing balanced v...
cs.CV
Representation Autoencoders scale diffusion models to text-to-image generation by training in semantic latent spaces of frozen vision encoders.
Why This Matters
Provides evidence that high-dimensional semantic spaces (vs pixel/VAE latent) offer distinct advantages for diffusion, with practical insights on data composition for text rendering and general fidelity.
Diffusion in semantic representation space (like SigLIP embeddings) may be a viable alternative to VAE latent spaces for text-to-image.
Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders
cs.CV
Authors: Shengbang Tong, Boyang Zheng, Ziteng Wang, Bingda Tang, Nanye Ma et al.
Published: 2026-01-22
Why This Matters
Provides evidence that high-dimensional semantic spaces (vs pixel/VAE latent) offer distinct advantages for diffusion, with practical insights on data composition for text rendering and general fidelity.
Key Insight
Diffusion in semantic representation space (like SigLIP embeddings) may be a viable alternative to VAE latent spaces for text-to-image.
Abstract
Representation Autoencoders (RAEs) have shown distinct advantages in diffusion modeling on ImageNet by training in high-dimensional semantic latent spaces. In this work, we investigate whether this framework can scale to large-scale, freeform text-to-image (T2I) generation. We first scale RAE decoders on the frozen representation encoder (SigLIP-2) beyond ImageNet by training on web, synthetic, and text-rendering data, finding that while scale improves general fidelity, targeted data composition is essential for specific domains like text. We then rigorously stress-test the RAE design choices originally proposed for ImageNet. Our analysis reveals that scaling simplifies the framework: while dimension-dependent noise scheduling remains critical, architectural complexities such as wide diffusion heads and noise-augmented decoding offer negligible benefits at scale Building on this simplified framework, we conduct a controlled comparison of RAE against the state-of-the-art FLUX VAE across diffusion transformer scales from 0.5B to 9.8B parameters. RAEs consistently outperform VAEs during pretraining across all model scales. Further, during finetuning on high-quality datasets, VAE-based models catastrophically overfit after 64 epochs, while RAE models remain stable through 256 epochs and achieve consistently better performance. Across all experiments, RAE-based diffusion models demonstrate faster convergence and better generation quality, establishing RAEs as a simpler and stronge...
cs.NEcs.CV
Neural Particle Automata generalizes Neural Cellular Automata from fixed grids to continuous particle systems with learnable dynamics.
Why This Matters
Elegant extension of the NCA paradigm to Lagrangian particle systems enables heterogeneous dynamics and concentrates computation on active regions—opens new directions for self-organizing generative models.
Moving from Eulerian (grid-fixed) to Lagrangian (particle-based) neural automata enables more natural modeling of dynamic, sparse phenomena.
Neural Particle Automata: Learning Self-Organizing Particle Dynamics
cs.NE | cs.CV
Authors: Hyunsoo Kim, Ehsan Pajouheshgar, Sabine Süsstrunk, Wenzel Jakob, Jinah Park
Published: 2026-01-22
Why This Matters
Elegant extension of the NCA paradigm to Lagrangian particle systems enables heterogeneous dynamics and concentrates computation on active regions—opens new directions for self-organizing generative models.
Key Insight
Moving from Eulerian (grid-fixed) to Lagrangian (particle-based) neural automata enables more natural modeling of dynamic, sparse phenomena.
Abstract
We introduce Neural Particle Automata (NPA), a Lagrangian generalization of Neural Cellular Automata (NCA) from static lattices to dynamic particle systems. Unlike classical Eulerian NCA where cells are pinned to pixels or voxels, NPA model each cell as a particle with a continuous position and internal state, both updated by a shared, learnable neural rule. This particle-based formulation yields clear individuation of cells, allows heterogeneous dynamics, and concentrates computation only on regions where activity is present. At the same time, particle systems pose challenges: neighborhoods are dynamic, and a naive implementation of local interactions scale quadratically with the number of particles. We address these challenges by replacing grid-based neighborhood perception with differentiable Smoothed Particle Hydrodynamics (SPH) operators backed by memory-efficient, CUDA-accelerated kernels, enabling scalable end-to-end training. Across tasks including morphogenesis, point-cloud classification, and particle-based texture synthesis, we show that NPA retain key NCA behaviors such as robustness and self-regeneration, while enabling new behaviors specific to particle systems. Together, these results position NPA as a compact neural model for learning self-organizing particle dynamics.
cs.CL
Refusal behavior in aligned LLMs stems from universal low-dimensional circuits that can be transferred across architectures without target model supervision.
Why This Matters
Challenges the assumption that safety mechanisms are model-specific, revealing shared semantic structures across diverse LLMs including Dense-to-MoE transfers—important for understanding alignment transferability.
Safety behaviors may be more portable than expected; alignment work on one model family could transfer to others via concept-basis alignment.
Universal Refusal Circuits Across LLMs: Cross-Model Transfer via Trajectory Replay and Concept-Basis Reconstruction
cs.CL
Authors: Tony Cristofano
Published: 2026-01-22
Why This Matters
Challenges the assumption that safety mechanisms are model-specific, revealing shared semantic structures across diverse LLMs including Dense-to-MoE transfers—important for understanding alignment transferability.
Key Insight
Safety behaviors may be more portable than expected; alignment work on one model family could transfer to others via concept-basis alignment.
Abstract
Refusal behavior in aligned LLMs is often viewed as model-specific, yet we hypothesize it stems from a universal, low-dimensional semantic circuit shared across models. To test this, we introduce Trajectory Replay via Concept-Basis Reconstruction, a framework that transfers refusal interventions from donor to target models, spanning diverse architectures (e.g., Dense to MoE) and training regimes, without using target-side refusal supervision. By aligning layers via concept fingerprints and reconstructing refusal directions using a shared ``recipe'' of concept atoms, we map the donor's ablation trajectory into the target's semantic space. To preserve capabilities, we introduce a weight-SVD stability guard that projects interventions away from high-variance weight subspaces to prevent collateral damage. Our evaluation across 8 model pairs (including GPT-OSS-20B and GLM-4) confirms that these transferred recipes consistently attenuate refusal while maintaining performance, providing strong evidence for the semantic universality of safety alignment.
cs.AI
Simple tactic skeletons as prompts boost neural theorem provers by 43% relative improvement on miniF2F without retraining.
Why This Matters
Demonstrates that even highly-trained RL models like DeepSeek-Prover benefit substantially from lightweight structural guidance at inference time, suggesting we may be underutilizing simple interventions.
Before scaling up training, try cheap inference-time interventions like fixed prompt schedules over common patterns.
Structured Hints for Sample-Efficient Lean Theorem Proving
cs.AI
Authors: Zachary Burton
Published: 2026-01-22
Why This Matters
Demonstrates that even highly-trained RL models like DeepSeek-Prover benefit substantially from lightweight structural guidance at inference time, suggesting we may be underutilizing simple interventions.
Key Insight
Before scaling up training, try cheap inference-time interventions like fixed prompt schedules over common patterns.
Abstract
State-of-the-art neural theorem provers like DeepSeek-Prover-V1.5 combine large language models with reinforcement learning, achieving impressive results through sophisticated training. We ask: do these highly-trained models still benefit from simple structural guidance at inference time? We evaluate a lightweight intervention -- a fixed prompt schedule over 15 common tactic skeletons -- on the miniF2F benchmark. This simple approach yields 21.7% pass@16 compared to 15.2% for standard sampling from the same model, a 43% relative improvement using the same number of samples (k=16) and same maximum generation length (1024 tokens). Our results suggest that even capable RL-trained provers underutilize structural priors available in the tactic language, and that simple inference-time guidance remains a cheap, complementary boost.