• raseliarison
  • nirinA
  • adrien
  • blog
  • code
  • FAQ
  •  home  
  •  news  
    • arXiv
      • astro-ph
      • cond-mat
      • cs
      • eess
      • gr-qc
      • hep-ex
      • hep-lat
      • hep-ph
      • hep-th
      • math
      • math-ph
      • nlin
      • nucl-ex
      • nucl-th
      • physics
      • q-bio
      • quant-ph
      • stat
    • physics
      • phys.org
      • physics world
    • linux
      • kernel
      • slackware
    • nature
      • natcomputsci
      • natastron
      • natbiomedeng
      • nenergy
      • nnano
      • natmachintell
      • nbt
      • nmeth
      • natecolevol
      • nmicrobiol
      • ng
      • nchembio
      • natelectron
      • micronano
      • nphoton
    • bioRxiv
    • plos one
    • world
      • BBC
      • Al Jazeera
    • earth
      • earth observatory
      • weather
      • weather forecast
    • universe
      • apod
      • hubble
      • atel
      • nasa
  •  wiki  
  •  gemini  
  • cs updates on arXiv.org

    cs updates on the arXiv.org e-print archive.

    Benchmarking Tesla's Traffic Light and Stop Sign Control: Field Dataset and Behavior Insights

    oai:arXiv.org:2512.11802v1

    arXiv:2512.11802v1 Announce Type: new Abstract: Understanding how Advanced Driver-Assistance Systems (ADAS) interact with Traffic Control Devices (TCDs) is critical for assessing their influence on traffic operations, yet this interaction has received little focused empirical study. This paper presents a field dataset and behavioral analysis of Tesla's Traffic Light and Stop Sign Control (TLSSC), a mature ADAS that perceives traffic lights and stop signs. We design and execute experiments across varied speed limits and TCD types, collecting synchronized high-resolution vehicle trajectory data and driver-perspective video. From these data, we develop a taxonomy of TLSSC-TCD interaction behaviors (i.e., stopping, accelerating, and car following) and calibrate the Full Velocity Difference Model (FVDM) to quantitatively characterize each behavior mode. A novel empirical insight is the identification of a car-following threshold (~90 m). Calibration results reveal that stopping behavior is driven by strong responsiveness to both desired speed deviation and relative speed, whereas accelerating behavior is more conservative. Intersection car-following behavior exhibits smoother dynamics and tighter headways compared to standard car-following behaviors. The established dataset, behavior definitions, and model characterizations together provide a foundation for future simulation, safety evaluation, and design of ADAS-TCD interaction logic. Our dataset is available at GitHub.

    https://arxiv.org/abs/2512.11802


    AI Integration In ERP Evaluation Across Trends and Architectures

    oai:arXiv.org:2512.11805v1

    arXiv:2512.11805v1 Announce Type: new Abstract: The incorporation of Artificial Intelligence (AI) into Enterprise Resource Planning (ERP) is a dramatic transition from static, on-premises systems to systems that can adapt and operate in cloud-native architectures. Cloud ERP solutions like Workday illustrate this evolution by incorporating machine learning, deep learning, and natural language processing into a centralized data-driven ecosystem. As the complexity of AI-driven ERP solutions expands, traditional evaluation frameworks that look at cost, function, and user satisfaction suffer from a lack of consideration for algorithmic transparency, adaptability, or ethics. This review will systematically investigate the latest trends, models of computing architecture, and analytical methods applied in assessing the performance of AI-integrated ERP services, specifically on cloud-based platforms. Based on academic and industry sources, the paper distills current research in line with architectural integration, analytical methodologies, and organizational impact. It identifies critical performance metrics and emphasizes the absence of any standard assessment frameworks or AI-aware systems capable of evaluating automation efficiency, security concerns as well as flexible learning modes. We put forward a theoretical model that brings AI-enabled capabilities -- such as predictive intelligence or adaptive automation -- into alignment with metrics in performance assessment for ERPs. By combining current literature and identifying major gaps in research, this paper attempts to present a complete picture of how innovations in AI are changing ERP evaluation. These research and methodological findings are intended to steer researchers and practitioners towards developing rigorous, data-driven assessment approaches, aligning with the fast-developing world of intelligent self-optimizing enterprise ecosystems

    https://arxiv.org/abs/2512.11805


    Enhancing Urban Visual Place Recognition for Crowdsourced Flood Imagery via LLM-Guided Attention

    oai:arXiv.org:2512.11811v1

    arXiv:2512.11811v1 Announce Type: new Abstract: Crowdsourced street-view imagery from social media provides valuable real-time visual evidence of urban flooding and other crisis events, yet it often lacks reliable geographic metadata for emergency response. Existing Visual Place Recognition (VPR) models exhibit substantial performance degradation when applied to such imagery due to visual distortions and domain shifts inherent in cross-source scenarios. This paper presents VPR-AttLLM, a model-agnostic framework that integrates the semantic reasoning and geospatial knowledge of Large Language Models (LLMs) into established VPR pipelines through attention-guided descriptor enhancement. By leveraging LLMs to identify location-informative regions within the city context and suppress transient visual noise, VPR-AttLLM improves retrieval performance without requiring model retraining or additional data. Comprehensive evaluations are conducted on extended benchmarks including SF-XL enriched with real social-media flood images, synthetic flooding scenarios over established query sets and Mapillary photos, and a new HK-URBAN dataset capturing morphologically distinct cityscapes. Integrating VPR-AttLLM with three state-of-the-art VPR models-CosPlace, EigenPlaces, and SALAD-consistently improves recall performance, yielding relative gains typically between 1-3% and reaching up to 8% on the most challenging real flood imagery. Beyond measurable gains in retrieval accuracy, this study establishes a generalizable paradigm for LLM-guided multimodal fusion in visual retrieval systems. By embedding principles from urban perception theory into attention mechanisms, VPR-AttLLM bridges human-like spatial reasoning with modern VPR architectures. Its plug-and-play design, strong cross-source robustness, and interpretability highlight its potential for scalable urban monitoring and rapid geo-localization of crowdsourced crisis imagery.

    https://arxiv.org/abs/2512.11811


    How Immersiveness Shapes the Link Between Anthropocentric Values and Resource Exploitation in Virtual Worlds

    oai:arXiv.org:2512.11812v1

    arXiv:2512.11812v1 Announce Type: new Abstract: The Anthropocene is characterized by escalating ecological crises rooted not only in technological and economic systems but also in deeply ingrained anthropocentric worldviews that shape human-nature relationships. As digital environments increasingly mediate these interactions, video games provide novel contexts for examining the psychological mechanisms underlying environmental behaviors. This study investigates how anthropocentric values are associated with resource-exploiting behaviors in virtual ecosystems--specifically, fishing, bug catching, and tree cutting--and how immersiveness moderates these relationships. Employing the Bayesian Mindsponge Framework (BMF) to analyze data from 640 Animal Crossi,g: New Horizons (ACNH) players across 29 countries, the study reveals complex links between anthropocentric worldviews and in-game behaviors. Fishing and tree-cutting frequencies are positively associated with anthropocentrism, whereas immersiveness weakens the association between tree cutting and anthropocentrism. Bug-catching frequency shows no direct effect but exhibits a growing negative association with anthropocentrism as immersiveness increases. These findings extend environmental psychology into virtual ecologies, illustrating how digital interactions both reflect and reshape environmental values. They highlight the potential of immersive gameplay to cultivate the Nature Quotient (NQ) and foster an eco-surplus culture through reflective, conservation-oriented engagement.

    https://arxiv.org/abs/2512.11812


    Totalitarian Technics: The Hidden Cost of AI Scribes in Healthcare

    oai:arXiv.org:2512.11814v1

    arXiv:2512.11814v1 Announce Type: new Abstract: Artificial intelligence (AI) scribes, systems that record and summarise patient-clinician interactions, are promoted as solutions to administrative overload. This paper argues that their significance lies not in efficiency gains but in how they reshape medical attention itself. Offering a conceptual analysis, it situates AI scribes within a broader philosophical lineage concerned with the externalisation of human thought and skill. Drawing on Iain McGilchrist's hemisphere theory and Lewis Mumford's philosophy of technics, the paper examines how technology embodies and amplifies a particular mode of attention. AI scribes, it contends, exemplify the dominance of a left-hemispheric, calculative mindset that privileges the measurable and procedural over the intuitive and relational. As this mode of attention becomes further embedded in medical practice, it risks narrowing the field of care, eroding clinical expertise, and reducing physicians to operators within an increasingly mechanised system.

    https://arxiv.org/abs/2512.11814


    Video Deepfake Abuse: How Company Choices Predictably Shape Misuse Patterns

    oai:arXiv.org:2512.11815v1

    arXiv:2512.11815v1 Announce Type: new Abstract: In 2022, AI image generators crossed a key threshold, enabling much more efficient and dynamic production of photorealistic deepfake images than before. This enabled opportunities for creative and positive uses of these models. However, it also enabled unprecedented opportunities for the low-effort creation of AI-generated non-consensual intimate imagery (AIG-NCII), including AI-generated child sexual abuse material (AIG-CSAM). Empirically, these harms were principally enabled by a small number of models that were trained on web data with pornographic content, released with open weights, and insufficiently safeguarded. In this paper, we observe ways in which the same patterns are emerging with video generation models in 2025. Specifically, we analyze how a small number of open-weight AI video generation models have become the dominant tools for videorealistic AIG-NCII video generation. We then analyze the literature on model safeguards and conclude that (1) developers who openly release the weights of capable video generation models without appropriate data curation and/or post-training safeguards foreseeably contribute to mitigatable downstream harm, and (2) model distribution platforms that do not proactively moderate individual misuse or models designed for AIG-NCII foreseeably amplify this harm. While there are no perfect defenses against AIG-NCII and AIG-CSAM from open-weight AI models, we argue that risk management by model developers and distributors, informed by emerging safeguard techniques, will substantially affect the future ease of creating AIG-NCII and AIG-CSAM with generative AI video tools.

    https://arxiv.org/abs/2512.11815


    Reinforcement Learning for Latent-Space Thinking in LLMs

    oai:arXiv.org:2512.11816v1

    arXiv:2512.11816v1 Announce Type: new Abstract: Chain-of-Thought (CoT) reasoning typically utilizes the discrete language space for thinking, which is inherently inefficient, as many generated tokens only enforce linguistic rules that are not required for reasoning. To bypass this, latent-space thinking allows models to think using the continuous embedding space. While existing methods for training those models show domain-specific gains, they fail to maintain performance in complex tasks, such as mathematical reasoning. We experimentally demonstrate that the Coconut approach, a form of supervised fine-tuning for latent-space thinking, is highly sensitive to design choices and exhibits several inherent limitations. To address these issues, we investigate reinforcement learning (RL) techniques -- an underexplored direction in latent-space thinking -- including GRPO and design a novel Latent RL method for directly optimizing the latent thinking steps. Our experimental results reveal that these RL-trained models still lag behind traditional language-space CoT models in the mathematical reasoning domain. We make our codebase publicly available.

    https://arxiv.org/abs/2512.11816


    A Reproducible Workflow for Scraping, Structuring, and Segmenting Legacy Archaeological Artifact Images

    oai:arXiv.org:2512.11817v1

    arXiv:2512.11817v1 Announce Type: new Abstract: This technical note presents a reproducible workflow for converting a legacy archaeological image collection into a structured and segmentation ready dataset. The case study focuses on the Lower Palaeolithic hand axe and biface collection curated by the Archaeology Data Service (ADS), a dataset that provides thousands of standardised photographs but no mechanism for bulk download or automated processing. To address this, two open source tools were developed: a web scraping script that retrieves all record pages, extracts associated metadata, and downloads the available images while respecting ADS Terms of Use and ethical scraping guidelines; and an image processing pipeline that renames files using UUIDs, generates binary masks and bounding boxes through classical computer vision, and stores all derived information in a COCO compatible Json file enriched with archaeological metadata. The original images are not redistributed, and only derived products such as masks, outlines, and annotations are shared. Together, these components provide a lightweight and reusable approach for transforming web based archaeological image collections into machine learning friendly formats, facilitating downstream analysis and contributing to more reproducible research practices in digital archaeology.

    https://arxiv.org/abs/2512.11817


    The Ontological Dissonance Hypothesis: AI-Triggered Delusional Ideation as Folie a Deux Technologique

    oai:arXiv.org:2512.11818v1

    arXiv:2512.11818v1 Announce Type: new Abstract: This paper argues that contemporary large language models (LLMs) can contribute to psychotic involvement by creating interactions that resemble the relational dynamics of folie a deux. Drawing on Bateson's double bind theory, clinical literature on shared psychotic disorder, and McGilchrist's hemisphere theory, we show how the combination of high linguistic coherence and the absence of an underlying subject produces a structural tension for the user: language suggests an interlocutor, while intuition registers a void. In contexts of emotional need or instability, this tension can lead users to resolve the conflict through imaginative projection, attributing interiority, intention, or presence to a system that possesses none. The paper situates these dynamics within emerging clinical reports, develops a phenomenological account of how they unfold, and argues that current engagement-optimised design choices exacerbate the risk. We conclude by proposing 'ontological honesty' as a necessary design principle for mitigating technologically mediated folie a deux.

    https://arxiv.org/abs/2512.11818


    A Modular LLM-Agent System for Transparent Multi-Parameter Weather Interpretation

    oai:arXiv.org:2512.11819v1

    arXiv:2512.11819v1 Announce Type: new Abstract: Weather forecasting is not only a predictive task but an interpretive scientific process requiring explanation, contextualization, and hypothesis generation. This paper introduces AI-Meteorologist, an explainable LLM-agent framework that converts raw numerical forecasts into scientifically grounded narrative reports with transparent reasoning steps. Unlike conventional forecast outputs presented as dense tables or unstructured time series, our system performs agent-based analysis across multiple meteorological variables, integrates historical climatological context, and generates structured explanations that identify weather fronts, anomalies, and localized dynamics. The architecture relies entirely on in-context prompting, without fine-tuning, demonstrating that interpretability can be achieved through reasoning rather than parameter updates. Through case studies on multi-location forecast data, we show how AI-Meteorologist not only communicates weather events but also reveals the underlying atmospheric drivers, offering a pathway toward AI systems that augment human meteorological expertise and support scientific discovery in climate analytics.

    https://arxiv.org/abs/2512.11819


    Toward P != NP: An Observer-Theoretic Separation via SPDP Rank and a ZFC-Equivalent Foundation within the N-Frame Model

    oai:arXiv.org:2512.11820v1

    arXiv:2512.11820v1 Announce Type: new Abstract: We present a self-contained separation framework for P vs NP developed entirely within ZFC. The approach consists of: (i) a deterministic, radius-1 compilation from uniform polynomial-time Turing computation to local sum-of-squares (SoS) polynomials with polylogarithmic contextual entanglement width (CEW); (ii) a formal Width-to-Rank upper bound for the resulting SPDP matrices at matching parameters; (iii) an NP-side identity-minor lower bound in the same encoding; and (iv) a rank-monotone, instance-uniform extraction map from the compiled P-side polynomials to the NP family. Together these yield a contradiction under the assumption P = NP, establishing a separation. We develop a correspondence between CEW, viewed as a quantitative measure of computational contextuality, and SPDP rank, yielding a unified criterion for complexity separation. We prove that bounded-CEW observers correspond to polynomial-rank computations (the class P), while unbounded CEW characterizes the class NP. This implies that exponential SPDP rank for #3SAT and related hard families forces P != NP within the standard framework of complexity theory. Key technical components include: (1) constructive lower bounds on SPDP rank via Ramanujan-Tseitin expander families; (2) a non-circular reduction from Turing-machine computation to low-rank polynomial evaluation; (3) a codimension-collapse lemma ensuring that rank amplification cannot occur within polynomial resources; and (4) proofs of barrier immunity against relativization, natural proofs, and algebrization. The result is a complete ZFC proof architecture whose primitives and compositions are fully derived, with community verification and machine-checked formalization left as future work.

    https://arxiv.org/abs/2512.11820


    Prevalence, Devices Used, Reasons for Use, Trust, Barriers, and Challenges in Utilizing Generative AI among Tertiary Students

    oai:arXiv.org:2512.11821v1

    arXiv:2512.11821v1 Announce Type: new Abstract: This study examined generative AI usage among Philippine college students particularly on frequency, devices, reasons, knowledge, trust, perceptions, and challenges. Most students used free AI tools on smartphones due to financial constraints. They used it primarily for homework, idea generation, and research. Less than half felt confident with AI and expressed mixed feelings about its accuracy. Barriers included limited access, lack of teacher support, difficulty understanding outputs, and financial constraints. The study highlighted the need for better access, support, training, and ethical guidelines. Broader concerns included impacts on learning, academic standards, job loss, and privacy. Students viewed AI positively due to peer support. Recommendations are discussed.

    https://arxiv.org/abs/2512.11821


    Trust, Usefulness, and Dependency on AI in Programming: A Hierarchical Clustering Approach

    oai:arXiv.org:2512.11822v1

    arXiv:2512.11822v1 Announce Type: new Abstract: While AI tools are transforming programming education, their adoption in underrepresented countries remains insufficiently studied. Understanding students' trust, perceived usefulness, and dependency on AI tools is essential to improving their integration into education. For these purposes, this study surveyed 508 first-year programming students in Pampanga, Philippines and analyzed their perceptions using hierarchical clustering. Results showed four unique student profiles with varying in trust and usage intensity. While students acknowledged AI tools' benefits, dependency remained low due to limited infrastructure and insufficient exposure. High-frequency users did not necessarily report greater trust or usefulness which may indicates a complex relationship between usage patterns and perception. This study recommends that to maximize AI's educational impact, targeted interventions such as infrastructure development, training programs, and curriculum integration are necessary. This study provides empirical insights to support equitable and effective AI adoption in programming education within developing regions.

    https://arxiv.org/abs/2512.11822


    Teachers' Perspectives on the Use of AI Detection Tools: Insights from Ridge Regression Analysis

    oai:arXiv.org:2512.11823v1

    arXiv:2512.11823v1 Announce Type: new Abstract: This study explores the perceptions of 213 Filipino teachers toward AI detection tools in academic settings. It focuses on the factors that influence teachers' trust, concerns, and decision-making regarding these tools. The research investigates how teachers' trust in AI detection tools affects their perceptions of fairness and decision-making in evaluating student outputs. It also explores how concerns about AI tools and social norms influence the relationship between trust and decision-making. Ridge Regression analysis was used to examine the relationships between the predictors and the dependent variable. The results revealed that trust in AI detection tools is the most significant predictor of perceived fairness and decision-making among teachers. Concerns about AI tools and social norms have weaker effects on teachers' perceptions. The study emphasized critical role of trust in shaping teachers' perceptions of AI detection tools. Teachers who trust these tools are more likely to view them as fair and effective. In contrast, concerns and social norms have a limited influence on perceptions and decision-making. For recommendations, training and institutional guidelines should emphasize how these tools work, their limitations, and best practices for their use. Striking a balance between policy enforcement and educator support is essential for fostering trust in AI detection technologies. Encouraging experienced users to share insights through communities of practice could enhance the adoption and effective use of AI detection tools in educational settings..

    https://arxiv.org/abs/2512.11823


    ReGlove: A Soft Pneumatic Glove for Activities of Daily Living Assistance via Wrist-Mounted Vision

    oai:arXiv.org:2512.11824v1

    arXiv:2512.11824v1 Announce Type: new Abstract: This paper presents ReGlove, a system that converts low-cost commercial pneumatic rehabilitation gloves into vision-guided assistive orthoses. Chronic upper-limb impairment affects millions worldwide, yet existing assistive technologies remain prohibitively expensive or rely on unreliable biological signals. Our platform integrates a wrist-mounted camera with an edge-computing inference engine (Raspberry Pi 5) to enable context-aware grasping without requiring reliable muscle signals. By adapting real-time YOLO-based computer vision models, the system achieves \SI{96.73}{\percent} grasp classification accuracy with sub-\SI{40.00}{\milli\second} end-to-end latency. Physical validation using standardized benchmarks shows \SI{82.71}{\percent} success on YCB object manipulation and reliable performance across \SI{27.00}{} Activities of Daily Living (ADL) tasks. With a total cost under \$\SI{250.00}{} and exclusively commercial components, ReGlove provides a technical foundation for accessible, vision-based upper-limb assistance that could benefit populations excluded from traditional EMG-controlled devices.

    https://arxiv.org/abs/2512.11824


    FSL-HDnn: A 40 nm Few-shot On-Device Learning Accelerator with Integrated Feature Extraction and Hyperdimensional Computing

    oai:arXiv.org:2512.11826v1

    arXiv:2512.11826v1 Announce Type: new Abstract: This paper introduces FSL-HDnn, an energy-efficient accelerator that implements the end-to-end pipeline of feature extraction and on-device few-shot learning (FSL). The accelerator addresses fundamental challenges of on-device learning (ODL) for resource-constrained edge applications through two synergistic modules: a parameter-efficient feature extractor employing weight clustering and an FSL classifier based on hyperdimensional computing (HDC). The feature extractor exploits the weight clustering mechanism to reduce computational complexity, while the HDC-based FSL classifier eliminates gradient-based back propagation operations, enabling single-pass training with substantially reduced latency. Additionally, FSL-HDnn enables low-latency ODL and inference via two proposed optimization strategies, including an early-exit mechanism with branch feature extraction and batched single-pass training that improves hardware utilization. Measurement results demonstrate that our chip fabricated in a 40 nm CMOS process delivers superior training energy efficiency of 6 mJ/image and end-to-end training throughput of 28 images/s on a 10-way 5-shot FSL task. The end-to-end training latency is also reduced by 2x to 20.9x compared to state-of-the-art ODL chips.

    https://arxiv.org/abs/2512.11826


    Assessing Greenspace Attractiveness with ChatGPT, Claude, and Gemini: Do AI Models Reflect Human Perceptions?

    oai:arXiv.org:2512.11827v1

    arXiv:2512.11827v1 Announce Type: new Abstract: Understanding greenspace attractiveness is essential for designing livable and inclusive urban environments, yet existing assessment approaches often overlook informal or transient spaces and remain too resource intensive to capture subjective perceptions at scale. This study examines the ability of multimodal large language models (MLLMs), ChatGPT GPT-4o, Claude 3.5 Haiku, and Gemini 2.0 Flash, to assess greenspace attractiveness similarly to humans using Google Street View imagery. We compared model outputs with responses from a geo-questionnaire of residents in Lodz, Poland, across both formal (for example, parks and managed greenspaces) and informal (for example, meadows and wastelands) greenspaces. Survey respondents and models indicated whether each greenspace was attractive or unattractive and provided up to three free text explanations. Analyses examined how often their attractiveness judgments aligned and compared their explanations after classifying them into shared reasoning categories. Results show high AI human agreement for attractive formal greenspaces and unattractive informal spaces, but low alignment for attractive informal and unattractive formal greenspaces. Models consistently emphasized aesthetic and design oriented features, underrepresenting safety, functional infrastructure, and locally embedded qualities valued by survey respondents. While these findings highlight the potential for scalable pre-assessment, they also underscore the need for human oversight and complementary participatory approaches. We conclude that MLLMs can support, but not replace, context sensitive greenspace evaluation in planning practice.

    https://arxiv.org/abs/2512.11827


    Active Inference with Reusable State-Dependent Value Profiles

    oai:arXiv.org:2512.11829v1

    arXiv:2512.11829v1 Announce Type: new Abstract: Adaptive behavior in volatile environments requires agents to switch among value-control regimes across latent contexts, but maintaining separate preferences, policy biases, and action-confidence parameters for every situation is intractable. We introduce value profiles: a small set of reusable bundles of value-related parameters (outcome preferences, policy priors, and policy precision) assigned to hidden states in a generative model. As posterior beliefs over states evolve trial by trial, effective control parameters arise via belief-weighted mixing, enabling state-conditional strategy recruitment without requiring independent parameters for each context. We evaluate this framework in probabilistic reversal learning, comparing static-precision, entropy-coupled dynamic-precision, and profile-based models using cross-validated log-likelihood and information criteria. Model comparison favors the profile-based model over simpler alternatives (about 100-point AIC differences), and parameter-recovery analyses support structural identifiability even when context must be inferred from noisy observations. Model-based inference further suggests that adaptive control in this task is driven primarily by modulation of policy priors rather than policy precision, with gradual belief-dependent profile recruitment consistent with state-conditional (not purely uncertainty-driven) control. Overall, reusable value profiles provide a tractable computational account of belief-conditioned value control in volatile environments and yield testable signatures of belief-dependent control and behavioral flexibility.

    https://arxiv.org/abs/2512.11829


    CR3G: Causal Reasoning for Patient-Centric Explanations in Radiology Report Generation

    oai:arXiv.org:2512.11830v1

    arXiv:2512.11830v1 Announce Type: new Abstract: Automatic chest X-ray report generation is an important area of research aimed at improving diagnostic accuracy and helping doctors make faster decisions. Current AI models are good at finding correlations (or patterns) in medical images. Still, they often struggle to understand the deeper cause-and-effect relationships between those patterns and a patient condition. Causal inference is a powerful approach that goes beyond identifying patterns to uncover why certain findings in an X-ray relate to a specific diagnosis. In this paper, we will explore the prompt-driven framework Causal Reasoning for Patient-Centric Explanations in radiology Report Generation (CR3G) that is applied to chest X-ray analysis to improve understanding of AI-generated reports by focusing on cause-and-effect relationships, reasoning and generate patient-centric explanation. The aim to enhance the quality of AI-driven diagnostics, making them more useful and trustworthy in clinical practice. CR3G has shown better causal relationship capability and explanation capability for 2 out of 5 abnormalities.

    https://arxiv.org/abs/2512.11830


    On the Design of One-step Diffusion via Shortcutting Flow Paths

    oai:arXiv.org:2512.11831v1

    arXiv:2512.11831v1 Announce Type: new Abstract: Recent advances in few-step diffusion models have demonstrated their efficiency and effectiveness by shortcutting the probabilistic paths of diffusion models, especially in training one-step diffusion models from scratch (a.k.a. shortcut models). However, their theoretical derivation and practical implementation are often closely coupled, which obscures the design space. To address this, we propose a common design framework for representative shortcut models. This framework provides theoretical justification for their validity and disentangles concrete component-level choices, thereby enabling systematic identification of improvements. With our proposed improvements, the resulting one-step model achieves a new state-of-the-art FID50k of 2.85 on ImageNet-256x256 under the classifier-free guidance setting. Remarkably, the model requires no pre-training, distillation, or curriculum learning. We believe our work lowers the barrier to component-level innovation in shortcut models and facilitates principled exploration of their design space.

    https://arxiv.org/abs/2512.11831


    Performance and Efficiency of Climate In-Situ Data Reconstruction: Why Optimized IDW Outperforms kriging and Implicit Neural Representation

    oai:arXiv.org:2512.11832v1

    arXiv:2512.11832v1 Announce Type: new Abstract: This study evaluates three reconstruction methods for sparse climate data: the simple inverse distance weighting (IDW), the statistically grounded ordinary kriging (OK), and the advanced implicit neural representation model (MMGN architecture). All methods were optimized through hyper-parameter tuning using validation splits. An extensive set of experiments was conducted, followed by a comprehensive statistical analysis. The results demonstrate the superiority of the simple IDW method over the other reference methods in terms of both reconstruction accuracy and computational efficiency. IDW achieved the lowest RMSE ($3.00 \pm 1.93$), MAE ($1.32 \pm 0.77$), and $\Delta_{MAX}$ ($24.06 \pm 17.15$), as well as the highest $R^2$ ($0.68 \pm 0.16$), across 100 randomly sampled sparse datasets from the ECA\&D database. Differences in RMSE, MAE, and $R^2$ were statistically significant and exhibited moderate to large effect sizes. The Dunn post-hoc test further confirmed the consistent superiority of IDW across all evaluated quality measures [...]

    https://arxiv.org/abs/2512.11832


    Soft Decision Tree classifier: explainable and extendable PyTorch implementation

    oai:arXiv.org:2512.11833v1

    arXiv:2512.11833v1 Announce Type: new Abstract: We implemented a Soft Decision Tree (SDT) and a Short-term Memory Soft Decision Tree (SM-SDT) using PyTorch. The methods were extensively tested on simulated and clinical datasets. The SDT was visualized to demonstrate the potential for its explainability. SDT, SM-SDT, and XGBoost demonstrated similar area under the curve (AUC) values. These methods were better than Random Forest, Logistic Regression, and Decision Tree. The results on clinical datasets suggest that, aside from a decision tree, all tested classification methods yield comparable results. The code and datasets are available online on GitHub: https://github.com/KI-Research-Institute/Soft-Decision-Tree

    https://arxiv.org/abs/2512.11833


    Hybrid twinning using PBDW and DeepONet for the effective state estimation and prediction on partially known systems

    oai:arXiv.org:2512.11834v1

    arXiv:2512.11834v1 Announce Type: new Abstract: The accurate estimation of the state of complex uncertain physical systems requires reconciling theoretical models, with inherent imperfections, with noisy experimental data. In this work, we propose an effective hybrid approach that combines physics-based modeling with data-driven learning to enhance state estimation and further prediction. Our method builds upon the Parameterized Background Data-Weak (PBDW) framework, which naturally integrates a reduced-order representation of the best-available model with measurement data to account for both anticipated and unanticipated uncertainties. To address model discrepancies not captured by the reduced-order space, and learn the structure of model deviation, we incorporate a Deep Operator Network (DeepONet) constrained to be an orthogonal complement of the best-knowledge manifold. This ensures that the learned correction targets only the unknown components of model bias, preserving the interpretability and fidelity of the physical model. An optimal sensor placement strategy is also investigated to maximize information gained from measurements. We validate the proposed approach on a representative problem involving the Helmholtz equation under various sources of modeling error, including those arising from boundary conditions and source terms.

    https://arxiv.org/abs/2512.11834


    A Monad-Based Clause Architecture for Artificial Age Score (AAS) in Large Language Models

    oai:arXiv.org:2512.11835v1

    arXiv:2512.11835v1 Announce Type: new Abstract: Large language models (LLMs) are often deployed as powerful yet opaque systems, leaving open how their internal memory and "self-like" behavior should be governed in a principled and auditable way. The Artificial Age Score (AAS) was previously introduced and mathematically justified through three theorems that characterise it as a metric of artificial memory aging. Building on this foundation, the present work develops an engineering-oriented, clause-based architecture that imposes law-like constraints on LLM memory and control. Twenty selected monads from Leibniz's Monadology are grouped into six bundles: ontology, dynamics, representation and consciousness, harmony and reason, body and organisation, and teleology, and each bundle is realised as an executable specification on top of the AAS kernel. Across six minimal Python implementations, these clause families are instantiated in numerical experiments acting on channel-level quantities such as recall scores, redundancy, and weights. Each implementation follows a four-step pattern: inputs and setup, clause implementation, numerical results, and implications for LLM design, emphasising that the framework is not only philosophically motivated but also directly implementable. The experiments show that the clause system exhibits bounded and interpretable behavior: AAS trajectories remain continuous and rate-limited, contradictions and unsupported claims trigger explicit penalties, and hierarchical refinement reveals an organic structure in a controlled manner. Dual views and goal-action pairs are aligned by harmony terms, and windowed drift in perfection scores separates sustained improvement from sustained degradation. Overall, the monad-based clause framework uses AAS as a backbone and provides a transparent, code-level blueprint for constraining and analyzing internal dynamics in artificial agents.

    https://arxiv.org/abs/2512.11835


    Semantic Nutrition Estimation: Predicting Food Healthfulness from Text Descriptions

    oai:arXiv.org:2512.11836v1

    arXiv:2512.11836v1 Announce Type: new Abstract: Accurate nutritional assessment is critical for public health, but existing profiling systems require detailed data often unavailable or inaccessible from colloquial text descriptions of food. This paper presents a machine learning pipeline that predicts the comprehensive Food Compass Score 2.0 (FCS) from text descriptions. Our approach uses multi-headed neural networks to process hybrid feature vectors that combine semantic text embeddings, lexical patterns, and domain heuristics, alongside USDA Food and Nutrient Database for Dietary Studies (FNDDS) data. The networks estimate the nutrient and food components necessary for the FCS algorithm. The system demonstratedstrong predictive power, achieving a median R^2 of 0.81 for individual nutrients. The predicted FCS correlated strongly with published values (Pearson's r = 0.77), with a mean absolute difference of 14.0 points. While errors were largest for ambiguous or processed foods, this methodology translates language into actionable nutritional information, enabling scalable dietary assessment for consumer applications and research.

    https://arxiv.org/abs/2512.11836


    D-STEER - Preference Alignment Techniques Learn to Behave, not to Believe -- Beneath the Surface, DPO as Steering Vector Perturbation in Activation Space

    oai:arXiv.org:2512.11838v1

    arXiv:2512.11838v1 Announce Type: new Abstract: Direct Preference Optimization (DPO) has become a standard recipe for aligning large language models, yet it is still unclear what kind of change it actually induces inside the network. This paper argues that DPO does not rewrite a models internal beliefs; instead, it acts as a low rank steering mechanism that nudges activations along a small number of preference directions. Using a simple derivation, we show that the DPO gradient depends only on the difference between the logit embeddings of preferred and dispreferred completions, implying a first order shift in the final hidden representation rather than a deep restructuring of semantics. We then extract an empirical steering vector from a DPO tuned model and demonstrate that adding this vector to base activations reproduces most of the aligned behavior, while subtracting it nearly restores the original model. Finally, spectral analyses reveal rank-one dominance and entropy collapse in upper layers, indicating that alignment is funneled through a narrow subspace. Taken together, these results support a behavioral illusion view of DPO: it teaches models how to act aligned, not what to believe.

    https://arxiv.org/abs/2512.11838


    Large Language Models as Generalist Policies for Network Optimization

    oai:arXiv.org:2512.11839v1

    arXiv:2512.11839v1 Announce Type: new Abstract: Designing control policies to ensure robust network services is essential to modern digital infrastructure. However, the dominant paradigm for network optimization relies on designing specialist policies based on handcrafted rules or deep learning models, leading to poor generalization across diverse tasks and environments. In contrast, large language models (LLMs), pretrained on Internet-scale corpora, provide a rich and unified knowledge base that encodes fundamental networking principles. Combined with their emergent abilities in generalization to unseen scenarios, LLMs offer a transformative foundation for generalist network policies that can generalize across diverse tasks and environments with minimal adaptation. In this paper, we present Trailblazer, the first systematic framework to realize such a generalist policy for networking. Trailblazer incorporates a network alignment scheme to ground the LLM in specific networking tasks, and an adaptive policy collaboration mechanism that offloads simple control cases from the LLM to a lightweight policy for computational efficiency. Through extensive simulations and large-scale real-world online evaluation on Douyin (the Chinese version of TikTok), Trailblazer, powered by a single LLM, demonstrates stronger cross-task and cross-environment generalization than conventional specialist policies. Our results validate LLMs as the foundation for generalist network policies, and position Trailblazer as the first step toward the generalist-driven paradigm that enables strong generalization with minimal efforts in policy design.

    https://arxiv.org/abs/2512.11839


    Amortized Causal Discovery with Prior-Fitted Networks

    oai:arXiv.org:2512.11840v1

    arXiv:2512.11840v1 Announce Type: new Abstract: In recent years, differentiable penalized likelihood methods have gained popularity, optimizing the causal structure by maximizing its likelihood with respect to the data. However, recent research has shown that errors in likelihood estimation, even on relatively large sample sizes, disallow the discovery of proper structures. We propose a new approach to amortized causal discovery that addresses the limitations of likelihood estimator accuracy. Our method leverages Prior-Fitted Networks (PFNs) to amortize data-dependent likelihood estimation, yielding more reliable scores for structure learning. Experiments on synthetic, simulated, and real-world datasets show significant gains in structure recovery compared to standard baselines. Furthermore, we demonstrate directly that PFNs provide more accurate likelihood estimates than conventional neural network-based approaches.

    https://arxiv.org/abs/2512.11840


    Meta-Continual Mobility Forecasting for Proactive Handover Prediction

    oai:arXiv.org:2512.11841v1

    arXiv:2512.11841v1 Announce Type: new Abstract: Short-term mobility forecasting is a core requirement for proactive handover (HO) in cellular networks. Real-world mobility is highly non-stationary: abrupt turns, rapid speed changes, and unpredictable user behavior cause conventional predictors to drift, leading to mistimed or failed handovers. We propose a lightweight meta-continual forecasting framework that integrates a GRU-based predictor, Reptile meta-initialization for fast few-shot adaptation, and an EWMA residual detector that triggers compact online updates only when drift occurs. Evaluated on a reproducible GeoLife and DeepMIMO pipeline, our method achieves 4.46 m ADE and 7.79 m FDE in zero-shot settings, improves few-shot ADE to 3.71 m at 10-shot, and enables recovery from abrupt drift about 2 to 3 times faster than an offline GRU. When applied to downstream HO prediction, the approach improves F1 to 0.83 and AUROC to 0.90, with substantial reductions in missed-HO and ping-pong events. The model is lightweight (128k parameters) and suitable for edge deployment in 5G and 6G systems.

    https://arxiv.org/abs/2512.11841


    Car-following Models and Congestion Control with Followerstopper on a Ring-Road under Known Delay -- Examining Limit Cycle

    oai:arXiv.org:2512.11842v1

    arXiv:2512.11842v1 Announce Type: new Abstract: This paper examines the IDM microscopic car-following model from a dynamical systems perspective, analyzing the effects of delay on congestion formation. Further, a case of mixed-autonomy is considered by controlling one car with Followerstopper in a ring road setting containing IDM vehicles as human drivers. Specifically, the stop-and-go waves phenomenon in idealized traffic from a dynamical systems perspective is examined. We show that Followerstopper-controlled vehicle is effective at eliminating emergent stop-and-go waves in the IDM traffic simulation. We show through simulation that the uniform flow manifold is unstable for the ring road simulation with IDM vehicles, and that replacing a single car with Followerstopper induces stability, allowing the cars to drive safely at a uniform speed. Additionally, the case of known delay is considered in a mixed-autonomy scenario. Our simulation result shows that while considering a known time delay, traffic waves emerge earlier than in the no-delay case. At the same time, a single-vehicle controlled using Followerstopper controller is able to prevent the emergence of traffic waves even in the presence of delay.

    https://arxiv.org/abs/2512.11842


    Spiking Manifesto

    oai:arXiv.org:2512.11843v1

    arXiv:2512.11843v1 Announce Type: new Abstract: Practically everything computers do is better, faster, and more power-efficient than the brain. For example, a calculator crunches numbers more energy-efficiently than any human. Yet AI models are a thousand times less efficient than the brain. These models use artificial neural networks (ANNs) and require GPUs for the multiplication of huge matrices. In contrast, spiking neural networks (SNNs) of the brain have no matrix multiplication and much smaller energy requirements. This manifesto proposes a framework for thinking about popular AI models in terms of spiking networks and polychronization, and for interpreting spiking activity as nature's way of implementing look-up tables. This offers a way to convert AI models into a novel type of architecture with the promise of a thousandfold improvement in efficiency. Code is available at https://github.com/izhikevich/SNN

    https://arxiv.org/abs/2512.11843


    Love First, Know Later: Persona-Based Romantic Compatibility Through LLM Text World Engines

    oai:arXiv.org:2512.11844v1

    arXiv:2512.11844v1 Announce Type: new Abstract: We propose Love First, Know Later: a paradigm shift in computational matching that simulates interactions first, then assesses compatibility. Instead of comparing static profiles, our framework leverages LLMs as text world engines that operate in dual capacity-as persona-driven agents following behavioral policies and as the environment modeling interaction dynamics. We formalize compatibility assessment as a reward-modeling problem: given observed matching outcomes, we learn to extract signals from simulations that predict human preferences. Our key insight is that relationships hinge on responses to critical moments-we translate this observation from relationship psychology into mathematical hypotheses, enabling effective simulation. Theoretically, we prove that as LLM policies better approximate human behavior, the induced matching converges to optimal stable matching. Empirically, we validate on speed dating data for initial chemistry and divorce prediction for long-term stability. This paradigm enables interactive, personalized matching systems where users iteratively refine their agents, unlocking future possibilities for transparent and interactive compatibility assessment.

    https://arxiv.org/abs/2512.11844


    Airport Passenger Flow Forecasting via Deformable Temporal-Spectral Transformer Approach

    oai:arXiv.org:2512.11845v1

    arXiv:2512.11845v1 Announce Type: new Abstract: Accurate forecasting of passenger flows is critical for maintaining the efficiency and resilience of airport operations. Recent advances in patch-based Transformer models have shown strong potential in various time series forecasting tasks. However, most existing methods rely on fixed-size patch embedding, making it difficult to model the complex and heterogeneous patterns of airport passenger flows. To address this issue, this paper proposes a deformable temporal-spectral transformer named DTSFormer that integrates a multiscale deformable partitioning module and a joint temporal-spectral filtering module. Specifically, the input sequence is dynamically partitioned into multiscale temporal patches via a novel window function-based masking, enabling the extraction of heterogeneous trends across different temporal stages. Then, within each scale, a frequency-domain attention mechanism is designed to capture both high- and low-frequency components, thereby emphasizing the volatility and periodicity inherent in airport passenger flows. Finally, the resulting multi-frequency features are subsequently fused in the time domain to jointly model short-term fluctuations and long-term trends. Comprehensive experiments are conducted on real-world passenger flow data collected at Beijing Capital International Airport from January 2023 to March 2024. The results indicate that the proposed method consistently outperforms state-of-the-art forecasting models across different prediction horizons. Further analysis shows that the deformable partitioning module aligns patch lengths with dominant periods and heterogeneous trends, enabling superior capture of sudden high-frequency fluctuations.

    https://arxiv.org/abs/2512.11845


    Exploring Topological Bias in Heterogeneous Graph Neural Networks

    oai:arXiv.org:2512.11846v1

    arXiv:2512.11846v1 Announce Type: new Abstract: Graph Neural Networks (GNNs) are characterized by their capacity of processing graph-structured data. However, due to the sparsity of labels under semi-supervised learning, they have been found to exhibit biased performance on specific nodes. This kind of bias has been validated to correlate with topological structure and is considered as a bottleneck of GNNs' performance. Existing work focuses on the study of homogeneous GNNs and little attention has been given to topological bias in Heterogeneous Graph Neural Networks (HGNNs). In this work, firstly, in order to distinguish distinct meta relations, we apply meta-weighting to the adjacency matrix of a heterogeneous graph. Based on the modified adjacency matrix, we leverage PageRank along with the node label information to construct a projection. The constructed projection effectively maps nodes to values that strongly correlated with model performance when using datasets both with and without intra-type connections, which demonstrates the universal existence of topological bias in HGNNs. To handle this bias, we propose a debiasing structure based on the difference in the mapped values of nodes and use it along with the original graph structure for contrastive learning. Experiments on three public datasets verify the effectiveness of the proposed method in improving HGNNs' performance and debiasing.

    https://arxiv.org/abs/2512.11846


    Tiny Recursive Models on ARC-AGI-1: Inductive Biases, Identity Conditioning, and Test-Time Compute

    oai:arXiv.org:2512.11847v1

    arXiv:2512.11847v1 Announce Type: new Abstract: Tiny Recursive Models (TRM) were proposed as a parameter-efficient alternative to large language models for solving Abstraction and Reasoning Corpus (ARC) style tasks. The original work reports strong performance and suggests that recursive latent updates enable non-trivial reasoning, but it remains unclear how much of this performance stems from architecture, test-time compute, or task-specific priors. In this technical note, we empirically analyze the ARC Prize TRM checkpoint on ARC-AGI-1 and report four behavioral findings and an efficiency comparison. First, we show that test-time augmentation and majority-vote ensembling account for a substantial fraction of reported performance: the 1000-sample voting pipeline improves Pass@1 by about 11 percentage points over single-pass canonical inference. Second, a puzzle-identity ablation reveals strict dependence on task identifiers: replacing the correct puzzle ID with a blank or random token yields zero accuracy. Third, a recursion trajectory analysis shows that most of the final accuracy is achieved at the first recursion step and that performance saturates after few latent updates, indicating shallow effective recursion. Fourth, early-stage training experiments under canonical versus heavy augmentation regimes suggest that heavy augmentation broadens the distribution of candidate solutions and improves multi-sample success. Finally, we compare TRM with a naive QLoRA fine-tune of Llama 3 8B on canonical ARC-AGI-1, finding that TRM's non-autoregressive design achieves much higher throughput and substantially lower memory usage in this setting. Overall, TRM's ARC-AGI-1 performance appears to arise from an interaction between efficiency, task-specific conditioning, and aggressive test-time compute rather than deep internal reasoning.

    https://arxiv.org/abs/2512.11847


    KH-FUNSD: A Hierarchical and Fine-Grained Layout Analysis Dataset for Low-Resource Khmer Business Document

    oai:arXiv.org:2512.11849v1

    arXiv:2512.11849v1 Announce Type: new Abstract: Automated document layout analysis remains a major challenge for low-resource, non-Latin scripts. Khmer is a language spoken daily by over 17 million people in Cambodia, receiving little attention in the development of document AI tools. The lack of dedicated resources is particularly acute for business documents, which are critical for both public administration and private enterprise. To address this gap, we present \textbf{KH-FUNSD}, the first publicly available, hierarchically annotated dataset for Khmer form document understanding, including receipts, invoices, and quotations. Our annotation framework features a three-level design: (1) region detection that divides each document into core zones such as header, form field, and footer; (2) FUNSD-style annotation that distinguishes questions, answers, headers, and other key entities, together with their relationships; and (3) fine-grained classification that assigns specific semantic roles, such as field labels, values, headers, footers, and symbols. This multi-level approach supports both comprehensive layout analysis and precise information extraction. We benchmark several leading models, providing the first set of baseline results for Khmer business documents, and discuss the distinct challenges posed by non-Latin, low-resource scripts. The KH-FUNSD dataset and documentation will be available at URL.

    https://arxiv.org/abs/2512.11849


    The Memecoin Phenomenon: An In-Depth Study of Solana's Blockchain Trends

    oai:arXiv.org:2512.11850v1

    arXiv:2512.11850v1 Announce Type: new Abstract: This paper analyzes the emerging memecoin phenomenon on the Solana blockchain, focusing on the Pump.fun platform during Q4 2024. Using on-chain data, it is explored how retail-focused token creation platforms are reshaping blockchain ecosystems and influencing market participation. This study finds that Pump.fun accounted for up to 71.1% of all tokens minted on Solana and contributed 40-67.4% of total DEX transactions. Despite this activity, fewer than 2% of tokens successfully transitioned to major decentralized exchanges, highlighting a highly speculative market structure. The platform experienced rapid growth, with daily active users rising from 60,000 to peaks of 260,000, underscoring strong retail adoption. This reflects a broader shift towards accessible, socially-driven market participation enabled by memecoins. However, while memecoins lower entry barriers and encourage retail engagement, they introduce significant risks. The volatile and speculative nature of these platforms raises concerns about long-term sustainability and the resilience of the blockchain ecosystem. These findings reveal the dual impact of memecoins: they democratize token creation and alter market dynamics but may jeopardize market efficiency and stability. This paper highlights the need to critically assess the implications of retail-driven speculative trading and its potential to disrupt emerging blockchain economies.

    https://arxiv.org/abs/2512.11850


    KV Cache Recycling to Expand Usable Context Capacity in Low Parameter LLMs

    oai:arXiv.org:2512.11851v1

    arXiv:2512.11851v1 Announce Type: new Abstract: Whether attention key value (KV) states computed for one prompt for a small LLM can be reused to accelerate inference on a new similar prompt, giving an increase to the space to its context memory using an approach called token recycling. Using a standard Hugging Face setup with DialoGPT-medium (a 345M parameter GPT-2 style decoder trained on 147M Reddit exchanges, 2005 to 2017) as the testbed, we build a cache of past activations and get entries by sentence embeddings, then reuse cached past key values when the cached prompt is an exact prefix of the new input. We compare recycled vs. baseline runs on latency and output fidelity, and log reuse depth in tokens. Reproducibility requires no model modifications, cached KVs are serialized to the CPU, reloaded, and supplied to the generate function to continue decoding from the cached prefix. In tests, we observe consistent speedups when prefix overlap exists, with no material degradation in output semantics, and when overlap is absent, behavior matches baseline.

    https://arxiv.org/abs/2512.11851


    Explainable AI for Smart Greenhouse Control: Interpretability of Temporal Fusion Transformer in the Internet of Robotic Things

    oai:arXiv.org:2512.11852v1

    arXiv:2512.11852v1 Announce Type: new Abstract: The integration of the Internet of Robotic Things (IoRT) in smart greenhouses has revolutionised precision agriculture by enabling efficient and autonomous environmental control. However, existing time series forecasting models in such setups often operate as black boxes, lacking mechanisms for explainable decision-making, which is a critical limitation when trust, transparency, and regulatory compliance are paramount in smart farming practices. This study leverages the Temporal Fusion Transformer (TFT) model to automate actuator settings for optimal greenhouse management. To enhance interpretability and trust in the model decision-making process, both local and global explanation techniques were employed using model-inherent interpretation, local interpretable model-agnostic explanations (LIME), and SHapley additive explanations (SHAP). These explainability methods provide information on how different sensor readings, such as temperature, humidity, CO2 levels, light, and outer climate, contribute to actuator control decisions in an automated greenhouse. The trained TFT model achieved a test accuracy of 95% on a class-imbalanced dataset for actuator control settings in an automated greenhouse environment. The results demonstrate the varying influence of each sensor on real-time greenhouse adjustments, ensuring transparency and enabling adaptive fine-tuning for improved crop yield and resource efficiency.

    https://arxiv.org/abs/2512.11852


    Evolving Deep Learning Optimizers

    oai:arXiv.org:2512.11853v1

    arXiv:2512.11853v1 Announce Type: new Abstract: We present a genetic algorithm framework for automatically discovering deep learning optimization algorithms. Our approach encodes optimizers as genomes that specify combinations of primitive update terms (gradient, momentum, RMS normalization, Adam-style adaptive terms, and sign-based updates) along with hyperparameters and scheduling options. Through evolutionary search over 50 generations with a population of 50 individuals, evaluated across multiple vision tasks, we discover an evolved optimizer that outperforms Adam by 2.6% in aggregate fitness and achieves a 7.7% relative improvement on CIFAR-10. The evolved optimizer combines sign-based gradient terms with adaptive moment estimation, uses lower momentum coefficients than Adam ($\beta_1$=0.86, $\beta_2$=0.94), and notably disables bias correction while enabling learning rate warmup and cosine decay. Our results demonstrate that evolutionary search can discover competitive optimization algorithms and reveal design principles that differ from hand-crafted optimizers. Code is available at https://github.com/mmarfinetz/evo-optimizer.

    https://arxiv.org/abs/2512.11853


    Rep Smarter, Not Harder: AI Hypertrophy Coaching with Wearable Sensors and Edge Neural Networks

    oai:arXiv.org:2512.11854v1

    arXiv:2512.11854v1 Announce Type: new Abstract: Optimizing resistance training for hypertrophy requires balancing proximity to muscular failure, often quantified by Repetitions in Reserve (RiR), with fatigue management. However, subjective RiR assessment is unreliable, leading to suboptimal training stimuli or excessive fatigue. This paper introduces a novel system for real-time feedback on near-failure states (RiR $\le$ 2) during resistance exercise using only a single wrist-mounted Inertial Measurement Unit (IMU). We propose a two-stage pipeline suitable for edge deployment: first, a ResNet-based model segments repetitions from the 6-axis IMU data in real-time. Second, features derived from this segmentation, alongside direct convolutional features and historical context captured by an LSTM, are used by a classification model to identify exercise windows corresponding to near-failure states. Using a newly collected dataset from 13 diverse participants performing preacher curls to failure (631 total reps), our segmentation model achieved an F1 score of 0.83, and the near-failure classifier achieved an F1 score of 0.82 under simulated real-time evaluation conditions (1.6 Hz inference rate). Deployment on a Raspberry Pi 5 yielded an average inference latency of 112 ms, and on an iPhone 16 yielded 23.5 ms, confirming the feasibility for edge computation. This work demonstrates a practical approach for objective, real-time training intensity feedback using minimal hardware, paving the way for accessible AI-driven hypertrophy coaching tools that help users manage intensity and fatigue effectively.

    https://arxiv.org/abs/2512.11854


    Achieving Approximate Symmetry Is Exponentially Easier than Exact Symmetry

    oai:arXiv.org:2512.11855v1

    arXiv:2512.11855v1 Announce Type: new Abstract: Enforcing exact symmetry in machine learning models often yields significant gains in scientific applications, serving as a powerful inductive bias. However, recent work suggests that relying on approximate symmetry can offer greater flexibility and robustness. Despite promising empirical evidence, there has been little theoretical understanding, and in particular, a direct comparison between exact and approximate symmetry is missing from the literature. In this paper, we initiate this study by asking: What is the cost of enforcing exact versus approximate symmetry? To address this question, we introduce averaging complexity, a framework for quantifying the cost of enforcing symmetry via averaging. Our main result is an exponential separation: under standard conditions, achieving exact symmetry requires linear averaging complexity, whereas approximate symmetry can be attained with only logarithmic averaging complexity. To the best of our knowledge, this provides the first theoretical separation of these two cases, formally justifying why approximate symmetry may be preferable in practice. Beyond this, our tools and techniques may be of independent interest for the broader study of symmetries in machine learning.

    https://arxiv.org/abs/2512.11855


    GCoDE: Efficient Device-Edge Co-Inference for GNNs via Architecture-Mapping Co-Search

    oai:arXiv.org:2512.11856v1

    arXiv:2512.11856v1 Announce Type: new Abstract: Graph Neural Networks (GNNs) have emerged as the state-of-the-art graph learning method. However, achieving efficient GNN inference on edge devices poses significant challenges, limiting their application in real-world edge scenarios. This is due to the high computational cost of GNNs and limited hardware resources on edge devices, which prevent GNN inference from meeting real-time and energy requirements. As an emerging paradigm, device-edge co-inference shows potential for improving inference efficiency and reducing energy consumption on edge devices. Despite its potential, research on GNN device-edge co-inference remains scarce, and our findings show that traditional model partitioning methods are ineffective for GNNs. To address this, we propose GCoDE, the first automatic framework for GNN architecture-mapping Co-design and deployment on Device-Edge hierarchies. By abstracting the device communication process into an explicit operation, GCoDE fuses the architecture and mapping scheme in a unified design space for joint optimization. Additionally, GCoDE's system performance awareness enables effective evaluation of architecture efficiency across diverse heterogeneous systems. By analyzing the energy consumption of various GNN operations, GCoDE introduces an energy prediction method that improves energy assessment accuracy and identifies energy-efficient solutions. Using a constraint-based random search strategy, GCoDE identifies the optimal solution in 1.5 hours, balancing accuracy and efficiency. Moreover, the integrated co-inference engine in GCoDE enables efficient deployment and execution of GNN co-inference. Experimental results show that GCoDE can achieve up to 44.9x speedup and 98.2% energy reduction compared to existing approaches across diverse applications and system configurations.

    https://arxiv.org/abs/2512.11856


    TopicProphet: Prophesies on Temporal Topic Trends and Stocks

    oai:arXiv.org:2512.11857v1

    arXiv:2512.11857v1 Announce Type: new Abstract: Stocks can't be predicted. Despite many hopes, this premise held itself true for many years due to the nature of quantitative stock data lacking causal logic along with rapid market changes hindering accumulation of significant data for training models. To undertake this matter, we propose a novel framework, TopicProphet, to analyze historical eras that share similar public sentiment trends and historical background. Our research deviates from previous studies that identified impacts of keywords and sentiments - we expand on that method by a sequence of topic modeling, temporal analysis, breakpoint detection and segment optimization to detect the optimal time period for training. This results in improving predictions by providing the model with nuanced patterns that occur from that era's socioeconomic and political status while also resolving the shortage of pertinent stock data to train on. Through extensive analysis, we conclude that TopicProphet produces improved outcomes compared to the state-of-the-art methods in capturing the optimal training data for forecasting financial percentage changes.

    https://arxiv.org/abs/2512.11857


    Adaptive Path Integral Diffusion: AdaPID

    oai:arXiv.org:2512.11858v1

    arXiv:2512.11858v1 Announce Type: new Abstract: Diffusion-based samplers -- Score Based Diffusions, Bridge Diffusions and Path Integral Diffusions -- match a target at terminal time, but the real leverage comes from choosing the schedule that governs the intermediate-time dynamics. We develop a path-wise schedule -- selection gramework for Harmonic PID with a time-varying stiffness, exploiting Piece-Wise-Constant(PWC) parametrizations and a simple hierarchical refinement. We introduce schedule-sensitive Quality-of-Sampling (QoS) diagnostics. Assuming a Gaussian-Mixture (GM) target, we retain closed-form Green functions' ration and numerically stable, Neural-Network free oracles for predicted-state maps and score. Experiments in 2D show that QoS driven PWC schedules consistently improve early-exit fidelity, tail accuracy, conditioning of the dynamics, and speciation (label-selection) timing at fixed integration budgets.

    https://arxiv.org/abs/2512.11858


    Generative Stochastic Optimal Transport: Guided Harmonic Path-Integral Diffusion

    oai:arXiv.org:2512.11859v1

    arXiv:2512.11859v1 Announce Type: new Abstract: We introduce Guided Harmonic Path-Integral Diffusion (GH-PID), a linearly-solvable framework for guided Stochastic Optimal Transport (SOT) with a hard terminal distribution and soft, application-driven path costs. A low-dimensional guidance protocol shapes the trajectory ensemble while preserving analytic structure: the forward and backward Kolmogorov equations remain linear, the optimal score admits an explicit Green-function ratio, and Gaussian-Mixture Model (GMM) terminal laws yield closed-form expressions. This enables stable sampling and differentiable protocol learning under exact terminal matching. We develop guidance-centric diagnostics -- path cost, centerline adherence, variance flow, and drift effort -- that make GH-PID an interpretable variational ansatz for empirical SOT. Three navigation scenarios illustrated in 2D: (i) Case A: hand-crafted protocols revealing how geometry and stiffness shape lag, curvature effects, and mode evolution; (ii) Case B: single-task protocol learning, where a PWC centerline is optimized to minimize integrated cost; (iii) Case C: multi-expert fusion, in which a commander reconciles competing expert/teacher trajectories and terminal beliefs through an exact product-of-experts law and learns a consensus protocol. Across all settings, GH-PID generates geometry-aware, trust-aware trajectories that satisfy the prescribed terminal distribution while systematically reducing integrated cost.

    https://arxiv.org/abs/2512.11859


    An Operator-Consistent Graph Neural Network for Learning Diffusion Dynamics on Irregular Meshes

    oai:arXiv.org:2512.11860v1

    arXiv:2512.11860v1 Announce Type: new Abstract: Classical numerical methods solve partial differential equations (PDEs) efficiently on regular meshes, but many of them become unstable on irregular domains. In practice, multiphysics interactions such as diffusion, damage, and healing often take place on irregular meshes. We develop an operator-consistent graph neural network (OCGNN-PINN) that approximates PDE evolution under physics-informed constraints. It couples node-edge message passing with a consistency loss enforcing the gradient-divergence relation through the graph incidence matrix, ensuring that discrete node and edge dynamics remain structurally coupled during temporal rollout. We evaluate the model on diffusion processes over physically driven evolving meshes and real-world scanned surfaces. The results show improved temporal stability and prediction accuracy compared with graph convolutional and multilayer perceptron baselines, approaching the performance of Crank-Nicolson solvers on unstructured domains.

    https://arxiv.org/abs/2512.11860


    Hierarchical Task Offloading and Trajectory Optimization in Low-Altitude Intelligent Networks Via Auction and Diffusion-based MARL

    oai:arXiv.org:2512.11862v1

    arXiv:2512.11862v1 Announce Type: new Abstract: The low-altitude intelligent networks (LAINs) emerge as a promising architecture for delivering low-latency and energy-efficient edge intelligence in dynamic and infrastructure-limited environments. By integrating unmanned aerial vehicles (UAVs), aerial base stations, and terrestrial base stations, LAINs can support mission-critical applications such as disaster response, environmental monitoring, and real-time sensing. However, these systems face key challenges, including energy-constrained UAVs, stochastic task arrivals, and heterogeneous computing resources. To address these issues, we propose an integrated air-ground collaborative network and formulate a time-dependent integer nonlinear programming problem that jointly optimizes UAV trajectory planning and task offloading decisions. The problem is challenging to solve due to temporal coupling among decision variables. Therefore, we design a hierarchical learning framework with two timescales. At the large timescale, a Vickrey-Clarke-Groves auction mechanism enables the energy-aware and incentive-compatible trajectory assignment. At the small timescale, we propose the diffusion-heterogeneous-agent proximal policy optimization, a generative multi-agent reinforcement learning algorithm that embeds latent diffusion models into actor networks. Each UAV samples actions from a Gaussian prior and refines them via observation-conditioned denoising, enhancing adaptability and policy diversity. Extensive simulations show that our framework outperforms baselines in energy efficiency, task success rate, and convergence performance.

    https://arxiv.org/abs/2512.11862


    Expert Assessment: The Systemic Environmental Risks of Artficial Intelligence

    oai:arXiv.org:2512.11863v1

    arXiv:2512.11863v1 Announce Type: new Abstract: Artificial intelligence (AI) is often presented as a key tool for addressing societal challenges, such as climate change. At the same time, AI's environmental footprint is expanding increasingly. This report describes the systemic environmental risks of artificial intelligence, in particular, moving beyond direct impacts such as energy and water usage. Systemic environmental risks of AI are emergent, cross-sector harms to climate, biodiversity, freshwater, and broader socioecological systems that arise primarily from AI's integration into social, economic, and physical infrastructures, rather than its direct resource use, and that propagate through feedbacks, yielding nonlinear, inequitable, and potentially irreversible impacts. While these risks are emergent and quantification is uncertain, this report aims to provide an overview of systemic environmental risks. Drawing on a narrative literature review, we propose a three-level framework that operationalizes systemic risk analysis. The framework identifies the structural conditions that shape AI development, the risk amplification mechanisms that propagate environmental harm, and the impacts that manifest as observable ecological and social consequences. We illustrate the framework in expert-interview-based case studies across agriculture and biodiversity, oil and gas, and waste management.

    https://arxiv.org/abs/2512.11863


    Solving Parallel Machine Scheduling With Precedences and Cumulative Resource Constraints With Calendars

    oai:arXiv.org:2512.11864v1

    arXiv:2512.11864v1 Announce Type: new Abstract: The task of finding efficient production schedules for parallel machines is a challenge that arises in most industrial manufacturing domains. There is a large potential to minimize production costs through automated scheduling techniques, due to the large-scale requirements of modern factories. In the past, solution approaches have been studied for many machine scheduling variations, where even basic variants have been shown to be NP-hard. However, in today's real-life production environments, additional complex precedence constraints and resource restrictions with calendars arise that must be fulfilled. These additional constraints cannot be tackled efficiently by existing solution techniques. Thus, there is a strong need to develop and analyze automated methods that can solve such real-life parallel machine scheduling scenarios. In this work, we introduce a novel variant of parallel machine scheduling with job precedences and calendar-based cumulative resource constraints that arises in real-life industrial use cases. A constraint modeling approach is proposed as an exact solution method for small scheduling scenarios together with state-of-the-art constraint-solving technology. Further, we propose a construction heuristic as well as a tailored metaheuristic using local search to efficiently tackle large-scale problem instances. This metaheuristic approach has been deployed and is currently being used in an industrial setting.

    https://arxiv.org/abs/2512.11864


    Explainable Adversarial-Robust Vision-Language-Action Model for Robotic Manipulation

    oai:arXiv.org:2512.11865v1

    arXiv:2512.11865v1 Announce Type: new Abstract: Smart farming has emerged as a key technology for advancing modern agriculture through automation and intelligent control. However, systems relying on RGB cameras for perception and robotic manipulators for control, common in smart farming, are vulnerable to photometric perturbations such as hue, illumination, and noise changes, which can cause malfunction under adversarial attacks. To address this issue, we propose an explainable adversarial-robust Vision-Language-Action model based on the OpenVLA-OFT framework. The model integrates an Evidence-3 module that detects photometric perturbations and generates natural language explanations of their causes and effects. Experiments show that the proposed model reduces Current Action L1 loss by 21.7% and Next Actions L1 loss by 18.4% compared to the baseline, demonstrating improved action prediction accuracy and explainability under adversarial conditions.

    https://arxiv.org/abs/2512.11865


    Phase transitions reveal hierarchical structure in deep neural networks

    oai:arXiv.org:2512.11866v1

    arXiv:2512.11866v1 Announce Type: new Abstract: Training Deep Neural Networks relies on the model converging on a high-dimensional, non-convex loss landscape toward a good minimum. Yet, much of the phenomenology of training remains ill understood. We focus on three seemingly disparate observations: the occurrence of phase transitions reminiscent of statistical physics, the ubiquity of saddle points, and phenomenon of mode connectivity relevant for model merging. We unify these within a single explanatory framework, the geometry of the loss and error landscapes. We analytically show that phase transitions in DNN learning are governed by saddle points in the loss landscape. Building on this insight, we introduce a simple, fast, and easy to implement algorithm that uses the L2 regularizer as a tool to probe the geometry of error landscapes. We apply it to confirm mode connectivity in DNNs trained on the MNIST dataset by efficiently finding paths that connect global minima. We then show numerically that saddle points induce transitions between models that encode distinct digit classes. Our work establishes the geometric origin of key training phenomena in DNNs and reveals a hierarchy of accuracy basins analogous to phases in statistical physics.

    https://arxiv.org/abs/2512.11866


    On the Dangers of Bootstrapping Generation for Continual Learning and Beyond

    oai:arXiv.org:2512.11867v1

    arXiv:2512.11867v1 Announce Type: new Abstract: The use of synthetically generated data for training models is becoming a common practice. While generated data can augment the training data, repeated training on synthetic data raises concerns about distribution drift and degradation of performance due to contamination of the dataset. We investigate the consequences of this bootstrapping process through the lens of continual learning, drawing a connection to Generative Experience Replay (GER) methods. We present a statistical analysis showing that synthetic data introduces significant bias and variance into training objectives, weakening the reliability of maximum likelihood estimation. We provide empirical evidence showing that popular generative models collapse under repeated training with synthetic data. We quantify this degradation and show that state-of-the-art GER methods fail to maintain alignment in the latent space. Our findings raise critical concerns about the use of synthetic data in continual learning.

    https://arxiv.org/abs/2512.11867


    Industrial AI Robustness Card: Evaluating and Monitoring Time Series Models

    oai:arXiv.org:2512.11868v1

    arXiv:2512.11868v1 Announce Type: new Abstract: Industrial AI practitioners face vague robustness requirements in emerging regulations and standards but lack concrete, implementation ready protocols. This paper introduces the Industrial AI Robustness Card (IARC), a lightweight, task agnostic protocol for documenting and evaluating the robustness of AI models on industrial time series. The IARC specifies required fields and an empirical measurement and reporting protocol that combines drift monitoring, uncertainty quantification, and stress tests, and it maps these to relevant EU AI Act obligations. A soft sensor case study on a biopharmaceutical fermentation process illustrates how the IARC supports reproducible robustness evidence and continuous monitoring.

    https://arxiv.org/abs/2512.11868


    Temporal-Anchor3DLane: Enhanced 3D Lane Detection with Multi-Task Losses and LSTM Fusion

    oai:arXiv.org:2512.11869v1

    arXiv:2512.11869v1 Announce Type: new Abstract: Monocular 3D lane detection remains challenging due to depth ambiguity, occlusion, and temporal instability across frames. Anchor-based approaches such as Anchor3DLane have demonstrated strong performance by regressing continuous 3D lane curves from multi-camera surround views. However, the baseline model still exhibits (i) sensitivity to regression outliers, (ii) weak supervision of global curve geometry, (iii) difficulty in balancing multiple loss terms, and (iv) limited exploitation of temporal continuity. We propose Temporal-Anchor3DLane, an enhanced 3D lane detection framework that extends Anchor3DLane with three key contributions: (1) a set of multi-task loss improvements, including Balanced L1 regression, Chamfer point-set distance, and uncertainty-based loss weighting, together with focal and Dice components for classification and visibility; (2) a lightweight Temporal LSTM Fusion module that aggregates per-anchor features across frames, replacing a heavier Transformer-style temporal fusion; and (3) ESCOP-style training refinements that couple curve-level supervision with temporal consistency. On OpenLane, Temporal-Anchor3DLane improves F1 by +6.2 and yields smoother temporal trajectories, showing that small architectural and loss refinements significantly enhance 3D lane robustness without extra sensors or scaling.

    https://arxiv.org/abs/2512.11869


    Using Socio-economic Indicators, Smart Transit Systems, and Urban Simulator to Accelerate ZEV Adoption and Reduce VMT

    oai:arXiv.org:2512.11870v1

    arXiv:2512.11870v1 Announce Type: new Abstract: Globally, on-road transportation accounts for 15% of greenhouse gas (GHG) emissions and an estimated 385,000 premature deaths from PM2.5. Cities play a critical role in meeting IPCC targets, generating 75% of global energy-related GHG emissions. In Houston, Texas, on-road transportation represents 48% of baseline emissions in the Climate Action Plan (CAP). To reach net-zero by 2050, the CAP targets a 70% emissions reduction from a 2014 baseline, offset by 30% renewable energy. This goal is challenging because Houston is low-density and auto-dependent, with 89% of on-road emissions from cars and small trucks and limited public transit usage. Socio-economic disparities further constrain Zero Emissions Vehicle (ZEV) adoption. Strategies focus on expanding ZEV access and reducing Vehicle Miles Traveled (VMT) by 20% through transit improvements and city design. This paper presents methods for establishing an on-road emissions baseline and evaluating policies that leverage socio-economic indicators and Intelligent Transportation Systems (ITS) to accelerate ZEV adoption and reduce VMT. Smart parking, transit incentives, secure data systems, and ZEV fleet management support improvements in modal split and system reliability. Policy options are analyzed and potential actions identified. To support evaluation, a simulation environment was developed in Unity 3D, enabling dynamic modeling of urban mobility and visualization of policy scenarios. Auto-dependent cities aiming for 2050 emission targets can benefit from the indicators, metrics, and technologies discussed.

    https://arxiv.org/abs/2512.11870


    Automated Plant Disease and Pest Detection System Using Hybrid Lightweight CNN-MobileViT Models for Diagnosis of Indigenous Crops

    oai:arXiv.org:2512.11871v1

    arXiv:2512.11871v1 Announce Type: new Abstract: Agriculture supports over 80% of the population in the Tigray region of Ethiopia, where infrastructural disruptions limit access to expert crop disease diagnosis. We present an offline-first detection system centered on a newly curated indigenous cactus-fig (Opuntia ficus-indica) dataset consisting of 3,587 field images across three core symptom classes. Given deployment constraints in post-conflict edge environments, we benchmark three mobile-efficient architectures: a custom lightweight CNN, EfficientNet-Lite1, and the CNN-Transformer hybrid MobileViT-XS. While the broader system contains independent modules for potato, apple, and corn, this study isolates cactus-fig model performance to evaluate attention sensitivity and inductive bias transfer on indigenous morphology alone. Results establish a clear Pareto trade-off: EfficientNet-Lite1 achieves 90.7% test accuracy, the lightweight CNN reaches 89.5% with the most favorable deployment profile (42 ms inference latency, 4.8 MB model size), and MobileViT-XS delivers 97.3% mean cross-validation accuracy, demonstrating that MHSA-based global reasoning disambiguates pest clusters from two dimensional fungal lesions more reliably than local texture CNN kernels. The ARM compatible models are deployed in a Tigrigna and Amharic localized Flutter application supporting fully offline inference on Cortex-A53 class devices, strengthening inclusivity for food security critical diagnostics.

    https://arxiv.org/abs/2512.11871


    WAM-Diff: A Masked Diffusion VLA Framework with MoE and Online Reinforcement Learning for Autonomous Driving

    oai:arXiv.org:2512.11872v1

    arXiv:2512.11872v1 Announce Type: new Abstract: End-to-end autonomous driving systems based on vision-language-action (VLA) models integrate multimodal sensor inputs and language instructions to generate planning and control signals. While autoregressive large language models and continuous diffusion policies are prevalent, the potential of discrete masked diffusion for trajectory generation remains largely unexplored. This paper presents WAM-Diff, a VLA framework that employs masked diffusion to iteratively refine a discrete sequence representing future ego-trajectories. Our approach features three key innovations: a systematic adaptation of masked diffusion for autonomous driving that supports flexible, non-causal decoding orders; scalable model capacity via a sparse MoE architecture trained jointly on motion prediction and driving-oriented visual question answering (VQA); and online reinforcement learning using Group Sequence Policy Optimization (GSPO) to optimize sequence-level driving rewards. Remarkably, our model achieves 91.0 PDMS on NAVSIM-v1 and 89.7 EPDMS on NAVSIM-v2, demonstrating the effectiveness of masked diffusion for autonomous driving. The approach provides a promising alternative to autoregressive and diffusion-based policies, supporting scenario-aware decoding strategies for trajectory generation. The code for this paper will be released publicly at: https://github.com/fudan-generative-vision/WAM-Diff

    https://arxiv.org/abs/2512.11872


    Audio-Based Tactile Human-Robot Interaction Recognition

    oai:arXiv.org:2512.11873v1

    arXiv:2512.11873v1 Announce Type: new Abstract: This study explores the use of microphones placed on a robot's body to detect tactile interactions via sounds produced when the hard shell of the robot is touched. This approach is proposed as an alternative to traditional methods using joint torque sensors or 6-axis force/torque sensors. Two Adafruit I2S MEMS microphones integrated with a Raspberry Pi 4 were positioned on the torso of a Pollen Robotics Reachy robot to capture audio signals from various touch types on the robot arms (tapping, knocking, rubbing, stroking, scratching, and pressing). A convolutional neural network was trained for touch classification on a dataset of 336 pre-processed samples (48 samples per touch type). The model shows high classification accuracy between touch types with distinct acoustic dominant frequencies.

    https://arxiv.org/abs/2512.11873


    Pseudo-Label Refinement for Robust Wheat Head Segmentation via Two-Stage Hybrid Training

    oai:arXiv.org:2512.11874v1

    arXiv:2512.11874v1 Announce Type: new Abstract: This extended abstract details our solution for the Global Wheat Full Semantic Segmentation Competition. We developed a systematic self-training framework. This framework combines a two-stage hybrid training strategy with extensive data augmentation. Our core model is SegFormer with a Mix Transformer (MiT-B4) backbone. We employ an iterative teacher-student loop. This loop progressively refines model accuracy. It also maximizes data utilization. Our method achieved competitive performance. This was evident on both the Development and Testing Phase datasets.

    https://arxiv.org/abs/2512.11874


    The Art of Storytelling in Authoritarian Regimes: Crafting State Narratives on Chinese Social Media

    oai:arXiv.org:2512.11875v1

    arXiv:2512.11875v1 Announce Type: new Abstract: This article examines how authoritarian regimes construct state narratives about politically consequential events. Building on the narrative policy framework and existing research on authoritarian propaganda, we propose two dimensions that shape narrative construction: legitimacy implications -- whether events enhance or threaten regime legitimacy, and citizen verification capacity -- the extent to which citizens can evaluate official narratives through alternative sources. Using quantitative narrative analysis of Chinese social media posts by government, state media, and celebrity accounts, we extract subject-verb-object (SVO) triplets to map dominant narrative structures across four major events. Our findings show that legitimacy implications of the event shape regime's efforts in storytelling and the beliefs highlighted in the narratives, while citizen's verification capacity could balance the strategic choice between a top-down manipulation and bottom-up responsiveness of state narratives. Together, the results reveal propaganda as a complex process of narrative construction adaptive to specific contexts, offering new insights into how dynamic storytelling sustains authoritarian resilience.

    https://arxiv.org/abs/2512.11875


    Traversability Aware Autonomous Navigation for Multi-Modal Mobility Morphobot (M4)

    oai:arXiv.org:2512.11876v1

    arXiv:2512.11876v1 Announce Type: new Abstract: Autonomous navigation in unstructured environments requires robots to assess terrain difficulty in real-time and plan paths that balance efficiency with safety. This thesis presents a traversability-aware navigation framework for the M4 robot platform that uses learned terrain analysis to generate energy-efficient paths avoiding difficult terrain.Our approach uses FAST-LIO for real-time localization, generating 2.5D elevation maps from LiDAR point clouds. A CNN-based model processes these elevation maps to estimate traversability scores, which are converted into navigation costs for path planning. A custom A* planner incorporates these costs alongside geometric distance and energy consumption to find paths that trade modest distance increases for substantial terrain quality improvements. Before system development, a platform-agnostic study compared LiDAR-based and camera-based SLAM using OptiTrack ground truth. Point cloud comparison through ICP alignment and cloud-to-mesh distance analysis demonstrated that LiDAR-based mapping achieves centimeter-level precision essential for elevation mapping, while camera-based approaches exhibited significantly higher geometric error. These findings directly resulted in the selection of LiDAR as the primary sensor to generate elevation maps. The complete pipeline integrates FAST-LIO localization, GPU-accelerated elevation mapping, CNN-based traversability estimation, and Nav2 navigation with a custom traversability-aware planner. Experimental results demonstrate that the system successfully avoids low traversability regions and accepts a few longer paths to achieve a reduction in terrain cost. This work establishes a foundation for intelligent terrain-aware navigation applicable to multi-modal robotic platforms.

    https://arxiv.org/abs/2512.11876


    A Technical Policy Blueprint for Trustworthy Decentralized AI

    oai:arXiv.org:2512.11878v1

    arXiv:2512.11878v1 Announce Type: new Abstract: Decentralized AI systems, such as federated learning, can play a critical role in further unlocking AI asset marketplaces (e.g., healthcare data marketplaces) thanks to increased asset privacy protection. Unlocking this big potential necessitates governance mechanisms that are transparent, scalable, and verifiable. However current governance approaches rely on bespoke, infrastructure-specific policies that hinder asset interoperability and trust among systems. We are proposing a Technical Policy Blueprint that encodes governance requirements as policy-as-code objects and separates asset policy verification from asset policy enforcement. In this architecture the Policy Engine verifies evidence (e.g., identities, signatures, payments, trusted-hardware attestations) and issues capability packages. Asset Guardians (e.g. data guardians, model guardians, computation guardians, etc.) enforce access or execution solely based on these capability packages. This core concept of decoupling policy processing from capabilities enables governance to evolve without reconfiguring AI infrastructure, thus creating an approach that is transparent, auditable, and resilient to change.

    https://arxiv.org/abs/2512.11878


    It's About Time: The Temporal and Modal Dynamics of Copilot Usage

    oai:arXiv.org:2512.11879v1

    arXiv:2512.11879v1 Announce Type: new Abstract: We analyze 37.5 million deidentified conversations with Microsoft's Copilot between January and September 2025. Unlike prior analyses of AI usage, we focus not just on what people do with AI, but on how and when they do it. We find that how people use AI depends fundamentally on context and device type. On mobile, health is the dominant topic, which is consistent across every hour and every month we observed - with users seeking not just information but also advice. On desktop, the pattern is strikingly different: work and technology dominate during business hours, with "Work and Career" overtaking "Technology" as the top topic precisely between 8 a.m. and 5 p.m. These differences extend to temporal rhythms: programming queries spike on weekdays while gaming rises on weekends, philosophical questions climb during late-night hours, and relationship conversations surge on Valentine's Day. These patterns suggest that users have rapidly integrated AI into the full texture of their lives, as a work aid at their desks and a companion on their phones.

    https://arxiv.org/abs/2512.11879


    An Experience Report on a Pedagogically Controlled, Curriculum-Constrained AI Tutor for SE Education

    oai:arXiv.org:2512.11882v1

    arXiv:2512.11882v1 Announce Type: new Abstract: The integration of artificial intelligence (AI) into education continues to evoke both promise and skepticism. While past waves of technological optimism often fell short, recent advances in large language models (LLMs) have revived the vision of scalable, individualized tutoring. This paper presents the design and pilot evaluation of RockStartIT Tutor, an AI-powered assistant developed for a digital programming and computational thinking course within the RockStartIT initiative. Powered by GPT-4 via OpenAI's Assistant API, the tutor employs a novel prompting strategy and a modular, semantically tagged knowledge base to deliver context-aware, personalized, and curriculum-constrained support for secondary school students. We evaluated the system using the Technology Acceptance Model (TAM) with 13 students and teachers. Learners appreciated the low-stakes environment for asking questions and receiving scaffolded guidance. Educators emphasized the system's potential to reduce cognitive load during independent tasks and complement classroom teaching. Key challenges include prototype limitations, a small sample size, and the need for long-term studies with the target age group. Our findings highlight a pragmatic approach to AI integration that requires no model training, using structure and prompts to shape behavior. We position AI tutors not as teacher replacements but as enabling tools that extend feedback access, foster inquiry, and support what schools do best: help students learn.

    https://arxiv.org/abs/2512.11882


    Aesthetic Alignment Risks Assimilation: How Image Generation and Reward Models Reinforce Beauty Bias and Ideological "Censorship"

    oai:arXiv.org:2512.11883v1

    arXiv:2512.11883v1 Announce Type: new Abstract: Over-aligning image generation models to a generalized aesthetic preference conflicts with user intent, particularly when ``anti-aesthetic" outputs are requested for artistic or critical purposes. This adherence prioritizes developer-centered values, compromising user autonomy and aesthetic pluralism. We test this bias by constructing a wide-spectrum aesthetics dataset and evaluating state-of-the-art generation and reward models. We find that aesthetic-aligned generation models frequently default to conventionally beautiful outputs, failing to respect instructions for low-quality or negative imagery. Crucially, reward models penalize anti-aesthetic images even when they perfectly match the explicit user prompt. We confirm this systemic bias through image-to-image editing and evaluation against real abstract artworks.

    https://arxiv.org/abs/2512.11883


    Generalization vs. Specialization: Evaluating Segment Anything Model (SAM3) Zero-Shot Segmentation Against Fine-Tuned YOLO Detectors

    oai:arXiv.org:2512.11884v1

    arXiv:2512.11884v1 Announce Type: new Abstract: Deep learning has advanced two fundamentally different paradigms for instance segmentation: specialized models optimized through task-specific fine-tuning and generalist foundation models capable of zero-shot segmentation. This work presents a comprehensive comparison between SAM3 (Segment Anything Model, also called SAMv3) operating in zero-shot mode and three variants of Ultralytics YOLO11 (nano, medium, and large) fine-tuned for instance segmentation. The evaluation is conducted on the MinneApple dataset, a dense benchmark comprising 670 orchard images with 28,179 annotated apple instances, enabling rigorous validation of model behavior under high object density and occlusion. Our analysis shows IoU choices can inflate performance gaps by up to 30%. At the appropriate IoU = 0.15 threshold, YOLO models achieve 68.9%, 72.2%, and 71.9% F1, while SAM3 reaches 59.8% in pure zero-shot mode. However, YOLO exhibits steep degradation 48-50 points across IoU ranges whereas SAM3 drops only 4 points, revealing 12 times superior boundary stability of SAM3. This highlights the strength of SAMv3 in mask precision versus specialization in detection completeness of YOLO11. We provide open-source code, evaluation pipelines, and methodological recommendations, contributing to a deeper understanding of when specialized fine-tuned models or generalist foundation models are preferable for dense instance segmentation tasks. This project repository is available on GitHub as https://github.com/Applied-AI-Research-Lab/Segment-Anything-Model-SAM3-Zero-Shot-Segmentation-Against-Fine-Tuned-YOLO-Detectors

    https://arxiv.org/abs/2512.11884


    Enabling Autonomous Navigation in a Snake Robot through Visual-Inertial Odometry and Closed-Loop Trajectory Tracking Control

    oai:arXiv.org:2512.11886v1

    arXiv:2512.11886v1 Announce Type: new Abstract: Snake robots offer exceptional mobility across extreme terrain inaccessible to conventional rovers, yet their highly articulated bodies present fundamental challenges for autonomous navigation in environments lacking external tracking infrastructure. This thesis develops a complete autonomy pipeline for COBRA, an 11 degree-of-freedom modular snake robot designed for planetary exploration. While the robot's biologically inspired serpentine gaits achieve impressive mobility, prior work has relied entirely on open-loop teleoperation. This approach integrates onboard visual-inertial SLAM, reduced-order state estimation, and closed-loop trajectory tracking to enable autonomous waypoint navigation. A depth camera paired with edge computing performs real-time localization during dynamic locomotion, validated against motion-capture ground truth to characterize drift behavior and failure modes unique to snake robot platforms. A reduced-order framework estimates Center-of-Mass pose, driving a closed-loop controller that modulates CPG gait parameters through distance-dependent yaw error blending. Physical experiments validate the complete system, demonstrating accurate multi-waypoint tracking and establishing foundations for autonomous snake robot navigation.

    https://arxiv.org/abs/2512.11886


    Advancing Autonomous Driving System Testing: Demands, Challenges, and Future Directions

    oai:arXiv.org:2512.11887v1

    arXiv:2512.11887v1 Announce Type: new Abstract: Autonomous driving systems (ADSs) promise improved transportation efficiency and safety, yet ensuring their reliability in complex real-world environments remains a critical challenge. Effective testing is essential to validate ADS performance and reduce deployment risks. This study investigates current ADS testing practices for both modular and end-to-end systems, identifies key demands from industry practitioners and academic researchers, and analyzes the gaps between existing research and real-world requirements. We review major testing techniques and further consider emerging factors such as Vehicle-to-Everything (V2X) communication and foundation models, including large language models and vision foundation models, to understand their roles in enhancing ADS testing. We conducted a large-scale survey with 100 participants from both industry and academia. Survey questions were refined through expert discussions, followed by quantitative and qualitative analyses to reveal key trends, challenges, and unmet needs. Our results show that existing ADS testing techniques struggle to comprehensively evaluate real-world performance, particularly regarding corner case diversity, the simulation to reality gap, the lack of systematic testing criteria, exposure to potential attacks, practical challenges in V2X deployment, and the high computational cost of foundation model-based testing. By further analyzing participant responses together with 105 representative studies, we summarize the current research landscape and highlight major limitations. This study consolidates critical research gaps in ADS testing and outlines key future research directions, including comprehensive testing criteria, cross-model collaboration in V2X systems, cross-modality adaptation for foundation model-based testing, and scalable validation frameworks for large-scale ADS evaluation.

    https://arxiv.org/abs/2512.11887


    Automation as a Catalyst for Geothermal Energy Adoption in Qatar: A Techno-Economic and Environmental Assessment

    oai:arXiv.org:2512.11890v1

    arXiv:2512.11890v1 Announce Type: new Abstract: Geothermal energy provides continuous low emission potential but is underused in Qatar because of high capital costs, drilling risks, and uncertainty in subsurface conditions. This study examines how automation can improve the techno economic and environmental feasibility of geothermal deployment through three pathways: Enhanced Geothermal Systems in the Dukhan Basin, repurposed oil and gas wells, and ground source heat pumps for district cooling. Using geological datasets and financial modeling, the analysis shows that full automation reduces capital expenditure by 12 to 14 percent and operating expenditure by 14 to 17 percent. The Levelized Cost of Energy decreases from 145 USD per MWh to 125 USD per MWh, and payback periods shorten by up to two years. Environmental results indicate that geothermal substitution can avoid between 4000 and 17600 tons of CO2 per year for each project. Automation also reduces uncertainty in investment outcomes based on Monte Carlo simulations. Overall, the results show that automation strengthens the economic viability of geothermal systems and supports their integration into Qatars long term energy diversification and decarbonization strategies.

    https://arxiv.org/abs/2512.11890


    VLSA: Vision-Language-Action Models with Plug-and-Play Safety Constraint Layer

    oai:arXiv.org:2512.11891v1

    arXiv:2512.11891v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models have demonstrated remarkable capabilities in generalizing across diverse robotic manipulation tasks. However, deploying these models in unstructured environments remains challenging due to the critical need for simultaneous task compliance and safety assurance, particularly in preventing potential collisions during physical interactions. In this work, we introduce a Vision-Language-Safe Action (VLSA) architecture, named AEGIS, which contains a plug-and-play safety constraint (SC) layer formulated via control barrier functions. AEGIS integrates directly with existing VLA models to improve safety with theoretical guarantees, while maintaining their original instruction-following performance. To evaluate the efficacy of our architecture, we construct a comprehensive safety-critical benchmark SafeLIBERO, spanning distinct manipulation scenarios characterized by varying degrees of spatial complexity and obstacle intervention. Extensive experiments demonstrate the superiority of our method over state-of-the-art baselines. Notably, AEGIS achieves a 59.16% improvement in obstacle avoidance rate while substantially increasing the task execution success rate by 17.25%. To facilitate reproducibility and future research, we make our code, models, and the benchmark datasets publicly available at https://vlsa-aegis.github.io/.

    https://arxiv.org/abs/2512.11891


    Should AI Become an Intergenerational Civil Right?

    oai:arXiv.org:2512.11892v1

    arXiv:2512.11892v1 Announce Type: new Abstract: Artificial Intelligence (AI) is rapidly becoming a foundational layer of social, economic, and cognitive infrastructure. At the same time, the training and large-scale deployment of AI systems rely on finite and unevenly distributed energy, networking, and computational resources. This tension exposes a largely unexamined problem in current AI governance: while expanding access to AI is essential for social inclusion and equal opportunity, unconstrained growth in AI use risks unsustainable resource consumption, whereas restricting access threatens to entrench inequality and undermine basic rights. This paper argues that access to AI outputs largely derived from publicly produced knowledge should not be treated solely as a commercial service, but as a fundamental civil interest requiring explicit protection. We show that existing regulatory frameworks largely ignore the coupling between equitable access and resource constraints, leaving critical questions of fairness, sustainability, and long-term societal impact unresolved. To address this gap, we propose recognizing access to AI as an \emph{Intergenerational Civil Right}, establishing a legal and ethical framework that simultaneously safeguards present-day inclusion and the rights of future generations. Beyond normative analysis, we explore how this principle can be technically realized. Drawing on emerging paradigms in IoT--Edge--Cloud computing, decentralized inference, and energy-aware networking, we outline technological trajectories and a strawman architecture for AI Delivery Networks that support equitable access under strict resource constraints. By framing AI as a shared social infrastructure rather than a discretionary market commodity, this work connects governance principles with concrete system design choices, offering a pathway toward AI deployment that is both socially just and environmentally sustainable.

    https://arxiv.org/abs/2512.11892


    Beyond Automation: Rethinking Work, Creativity, and Governance in the Age of Generative AI

    oai:arXiv.org:2512.11893v1

    arXiv:2512.11893v1 Announce Type: new Abstract: The accelerating advancement of generative artificial intelligence (AI) systems is reshaping the nature, distribution and meaning of work, creativity, and economic security. This paper investigates four inter-related phenomena in the current AI era: (1) the evolving landscape of employment and the future of work; (2) the diverse patterns of AI adoption across socio-demographic groups, sectors, and geographies; (3) whether universal basic income (UBI) should become a compulsory policy response to the AI revolution; and (4) the implications of AI content policies and model behaviours for human creativity, wellbeing, and everyday decision-making. Furthermore, the paper tests the hypothesis that newer model generations may perform worse than their predecessors, and examines how users' interactions with AI systems may produce echo chambers through sycophantic model alignment. Using a mixed methodology that integrates labour market task-exposure modelling, sectoral diffusion mapping, policy-framework analysis, and qualitative discourse critique, this study develops a comprehensive framework for understanding the societal consequences of AI systems beyond productivity gains. It argues that to foster an inclusive, meaningful, and creative environment, policymakers must treat UBI as one dimension within a broader ecosystem of governance, skills development, creativity preservation, and model design. The paper concludes by outlining future research directions, including systematic evaluation of AI's creative performance across model generations, construction of a taxonomy of AI-usage distribution and equity, and formulation of governance criteria to balance content restrictions with creative freedom.

    https://arxiv.org/abs/2512.11893


    mmWEAVER: Environment-Specific mmWave Signal Synthesis from a Photo and Activity Description

    oai:arXiv.org:2512.11894v1

    arXiv:2512.11894v1 Announce Type: new Abstract: Realistic signal generation and dataset augmentation are essential for advancing mmWave radar applications such as activity recognition and pose estimation, which rely heavily on diverse, and environment-specific signal datasets. However, mmWave signals are inherently complex, sparse, and high-dimensional, making physical simulation computationally expensive. This paper presents mmWeaver, a novel framework that synthesizes realistic, environment-specific complex mmWave signals by modeling them as continuous functions using Implicit Neural Representations (INRs), achieving up to 49-fold compression. mmWeaver incorporates hypernetworks that dynamically generate INR parameters based on environmental context (extracted from RGB-D images) and human motion features (derived from text-to-pose generation via MotionGPT), enabling efficient and adaptive signal synthesis. By conditioning on these semantic and geometric priors, mmWeaver generates diverse I/Q signals at multiple resolutions, preserving phase information critical for downstream tasks such as point cloud estimation and activity classification. Extensive experiments show that mmWeaver achieves a complex SSIM of 0.88 and a PSNR of 35 dB, outperforming existing methods in signal realism while improving activity recognition accuracy by up to 7% and reducing human pose estimation error by up to 15%, all while operating 6-35 times faster than simulation-based approaches.

    https://arxiv.org/abs/2512.11894


    Hot H\'em: S\`ai G\`on Gi\~ua C\'ai N\'ong H\^ong C\`ong B\`ang -- Saigon in Unequal Heat

    oai:arXiv.org:2512.11896v1

    arXiv:2512.11896v1 Announce Type: new Abstract: Pedestrian heat exposure is a critical health risk in dense tropical cities, yet standard routing algorithms often ignore micro-scale thermal variation. Hot H\'em is a GeoAI workflow that estimates and operationalizes pedestrian heat exposure in H\^o Ch\'i Minh City (HCMC), Vi\d{e}t Nam, colloquially known as S\`ai G\`on. This spatial data science pipeline combines Google Street View (GSV) imagery, semantic image segmentation, and remote sensing. Two XGBoost models are trained to predict land surface temperature (LST) using a GSV training dataset in selected administrative wards, known as ph\u{o}ng, and are deployed in a patchwork manner across all OSMnx-derived pedestrian network nodes to enable heat-aware routing. This is a model that, when deployed, can provide a foundation for pinpointing where and further understanding why certain city corridors may experience disproportionately higher temperatures at an infrastructural scale.

    https://arxiv.org/abs/2512.11896


    Microscopic Vehicle Trajectory Datasets from UAV-collected Video for Heterogeneous, Area-Based Urban Traffic

    oai:arXiv.org:2512.11898v1

    arXiv:2512.11898v1 Announce Type: new Abstract: This paper offers openly available microscopic vehicle trajectory (MVT) datasets collected using unmanned aerial vehicles (UAVs) in heterogeneous, area-based urban traffic conditions. Traditional roadside video collection often fails in dense mixed traffic due to occlusion, limited viewing angles, and irregular vehicle movements. UAV-based recording provides a top-down perspective that reduces these issues and captures rich spatial and temporal dynamics. The datasets described here were extracted using the Data from Sky (DFS) platform and validated against manual counts, space mean speeds, and probe trajectories in earlier work. Each dataset contains time-stamped vehicle positions, speeds, longitudinal and lateral accelerations, and vehicle classifications at a resolution of 30 frames per second. Data were collected at six mid-block locations in the national capital region of India, covering diverse traffic compositions and density levels. Exploratory analyses highlight key behavioural patterns, including lane-keeping preferences, speed distributions, and lateral manoeuvres typical of heterogeneous and area-based traffic settings. These datasets are intended as a resource for the global research community to support simulation modelling, safety assessment, and behavioural studies under area-based traffic conditions. By making these empirical datasets openly available, this work offers researchers a unique opportunity to develop, test, and validate models that more accurately represent complex urban traffic environments.

    https://arxiv.org/abs/2512.11898


    Read or Ignore? A Unified Benchmark for Typographic-Attack Robustness and Text Recognition in Vision-Language Models

    oai:arXiv.org:2512.11899v1

    arXiv:2512.11899v1 Announce Type: new Abstract: Large vision-language models (LVLMs) are vulnerable to typographic attacks, where misleading text within an image overrides visual understanding. Existing evaluation protocols and defenses, largely focused on object recognition, implicitly encourage ignoring text to achieve robustness; however, real-world scenarios often require joint reasoning over both objects and text (e.g., recognizing pedestrians while reading traffic signs). To address this, we introduce a novel task, Read-or-Ignore VQA (RIO-VQA), which formalizes selective text use in visual question answering (VQA): models must decide, from context, when to read text and when to ignore it. For evaluation, we present the Read-or-Ignore Benchmark (RIO-Bench), a standardized dataset and protocol that, for each real image, provides same-scene counterfactuals (read / ignore) by varying only the textual content and question type. Using RIO-Bench, we show that strong LVLMs and existing defenses fail to balance typographic robustness and text-reading capability, highlighting the need for improved approaches. Finally, RIO-Bench enables a novel data-driven defense that learns adaptive selective text use, moving beyond prior non-adaptive, text-ignoring defenses. Overall, this work reveals a fundamental misalignment between the existing evaluation scope and real-world requirements, providing a principled path toward reliable LVLMs. Our Project Page is at https://turingmotors.github.io/rio-vqa/.

    https://arxiv.org/abs/2512.11899


    Data-driven Interpretable Hybrid Robot Dynamics

    oai:arXiv.org:2512.11900v1

    arXiv:2512.11900v1 Announce Type: new Abstract: We study data-driven identification of interpretable hybrid robot dynamics, where an analytical rigid-body dynamics model is complemented by a learned residual torque term. Using symbolic regression and sparse identification of nonlinear dynamics (SINDy), we recover compact closed-form expressions for this residual from joint-space data. In simulation on a 7-DoF Franka arm with known dynamics, these interpretable models accurately recover inertial, Coriolis, gravity, and viscous effects with very small relative error and outperform neural-network baselines in both accuracy and generalization. On real data from a 7-DoF WAM arm, symbolic-regression residuals generalize substantially better than SINDy and neural networks, which tend to overfit, and suggest candidate new closed-form formulations that extend the nominal dynamics model for this robot. Overall, the results indicate that interpretable residual dynamics models provide compact, accurate, and physically meaningful alternatives to black-box function approximators for torque prediction.

    https://arxiv.org/abs/2512.11900


    CLARGA: Multimodal Graph Representation Learning over Arbitrary Sets of Modalities

    oai:arXiv.org:2512.11901v1

    arXiv:2512.11901v1 Announce Type: new Abstract: We introduce CLARGA, a general-purpose multimodal fusion architecture for multimodal representation learning that works with any number and type of modalities without changing the underlying framework. Given a supervised dataset, CLARGA can be applied to virtually any machine learning task to fuse different multimodal representations for processing by downstream layers. On a sample-by-sample basis, CLARGA learns how modalities should inform one another by building an attention weighted graph over their features and passing messages along this graph with a multi-head Graph Attention Network. Not only does this make CLARGA highly adaptive, as it constructs unique graphs for different samples, it makes for efficient fusion with sub-quadratic complexity as the number of modalities grows. Through a learnable mask, it can also adapt to missing modality inputs. The model is trained with a hybrid objective that combines a supervised task loss with contrastive InfoNCE loss, improving cross-modal consistency and robustness to noisy inputs. We demonstrate CLARGA's effectiveness in diverse multimodal representation learning tasks across 7 datasets spanning finance, human-computer interaction, general multimedia classification, and affective computing. It consistently outperforms baselines, state-of-the-art models, and ablations. Additional experiments also demonstrate its robustness to missing inputs and ability to excel on niche tasks. Overall, CLARGA can be easily plugged into machine learning models for effective and efficient learning of representations across a wide variety of tasks.

    https://arxiv.org/abs/2512.11901


    Mirror Mode in Fire Emblem: Beating Players at their own Game with Imitation and Reinforcement Learning

    oai:arXiv.org:2512.11902v1

    arXiv:2512.11902v1 Announce Type: new Abstract: Enemy strategies in turn-based games should be surprising and unpredictable. This study introduces Mirror Mode, a new game mode where the enemy AI mimics the personal strategy of a player to challenge them to keep changing their gameplay. A simplified version of the Nintendo strategy video game Fire Emblem Heroes has been built in Unity, with a Standard Mode and a Mirror Mode. Our first set of experiments find a suitable model for the task to imitate player demonstrations, using Reinforcement Learning and Imitation Learning: combining Generative Adversarial Imitation Learning, Behavioral Cloning, and Proximal Policy Optimization. The second set of experiments evaluates the constructed model with player tests, where models are trained on demonstrations provided by participants. The gameplay of the participants indicates good imitation in defensive behavior, but not in offensive strategies. Participant's surveys indicated that they recognized their own retreating tactics, and resulted in an overall higher player-satisfaction for Mirror Mode. Refining the model further may improve imitation quality and increase player's satisfaction, especially when players face their own strategies. The full code and survey results are stored at: https://github.com/YannaSmid/MirrorMode

    https://arxiv.org/abs/2512.11902


    Aion: Towards Hierarchical 4D Scene Graphs with Temporal Flow Dynamics

    oai:arXiv.org:2512.11903v1

    arXiv:2512.11903v1 Announce Type: new Abstract: Autonomous navigation in dynamic environments requires spatial representations that capture both semantic structure and temporal evolution. 3D Scene Graphs (3DSGs) provide hierarchical multi-resolution abstractions that encode geometry and semantics, but existing extensions toward dynamics largely focus on individual objects or agents. In parallel, Maps of Dynamics (MoDs) model typical motion patterns and temporal regularities, yet are usually tied to grid-based discretizations that lack semantic awareness and do not scale well to large environments. In this paper we introduce Aion, a framework that embeds temporal flow dynamics directly within a hierarchical 3DSG, effectively incorporating the temporal dimension. Aion employs a graph-based sparse MoD representation to capture motion flows over arbitrary time intervals and attaches them to navigational nodes in the scene graph, yielding more interpretable and scalable predictions that improve planning and interaction in complex dynamic environments.

    https://arxiv.org/abs/2512.11903


    Smartphone monitoring of smiling as a behavioral proxy of well-being in everyday life

    oai:arXiv.org:2512.11905v1

    arXiv:2512.11905v1 Announce Type: new Abstract: Subjective well-being is a cornerstone of individual and societal health, yet its scientific measurement has traditionally relied on self-report methods prone to recall bias and high participant burden. This has left a gap in our understanding of well-being as it is expressed in everyday life. We hypothesized that candid smiles captured during natural smartphone interactions could serve as a scalable, objective behavioral correlate of positive affect. To test this, we analyzed 405,448 video clips passively recorded from 233 consented participants over one week. Using a deep learning model to quantify smile intensity, we identified distinct diurnal and daily patterns. Daily patterns of smile intensity across the week showed strong correlation with national survey data on happiness (r=0.92), and diurnal rhythms documented close correspondence with established results from the day reconstruction method (r=0.80). Higher daily mean smile intensity was significantly associated with more physical activity (Beta coefficient = 0.043, 95% CI [0.001, 0.085]) and greater light exposure (Beta coefficient = 0.038, [0.013, 0.063]), whereas no significant effects were found for smartphone use. These findings suggest that passive smartphone sensing could serve as a powerful, ecologically valid methodology for studying the dynamics of affective behavior and open the door to understanding this behavior at a population scale.

    https://arxiv.org/abs/2512.11905


    MPath: Multimodal Pathology Report Generation from Whole Slide Images

    oai:arXiv.org:2512.11906v1

    arXiv:2512.11906v1 Announce Type: new Abstract: Automated generation of diagnostic pathology reports directly from whole slide images (WSIs) is an emerging direction in computational pathology. Translating high-resolution tissue patterns into clinically coherent text remains difficult due to large morphological variability and the complex structure of pathology narratives. We introduce MPath, a lightweight multimodal framework that conditions a pretrained biomedical language model (BioBART) on WSI-derived visual embeddings through a learned visual-prefix prompting mechanism. Instead of end-to-end vision-language pretraining, MPath leverages foundation-model WSI features (CONCH + Titan) and injects them into BioBART via a compact projection module, keeping the language backbone frozen for stability and data efficiency. MPath was developed and evaluated on the RED 2025 Grand Challenge dataset and ranked 4th in Test Phase 2, despite limited submission opportunities. The results highlight the potential of prompt-based multimodal conditioning as a scalable and interpretable strategy for pathology report generation.

    https://arxiv.org/abs/2512.11906


    Structured Personalization: Modeling Constraints as Matroids for Data-Minimal LLM Agents

    oai:arXiv.org:2512.11907v1

    arXiv:2512.11907v1 Announce Type: new Abstract: Personalizing Large Language Model (LLM) agents requires conditioning them on user-specific data, creating a critical trade-off between task utility and data disclosure. While the utility of adding user data often exhibits diminishing returns (i.e., submodularity), enabling near-optimal greedy selection, real-world personalization is complicated by structural constraints. These include logical dependencies (e.g., selecting fact A requires fact B), categorical quotas (e.g., select at most one writing style), and hierarchical rules (e.g., select at most two social media preferences, of which at most one can be for a professional network). These constraints violate the assumptions of standard subset selection algorithms. We propose a principled method to formally model such constraints. We introduce a compilation process that transforms a user's knowledge graph with dependencies into a set of abstract macro-facets. Our central result is a proof that common hierarchical and quota-based constraints over these macro-facets form a valid laminar matroid. This theoretical characterization lets us cast structured personalization as submodular maximization under a matroid constraint, enabling greedy with constant-factor guarantees (and (1-1/e) via continuous greedy) for a much richer and more realistic class of problems.

    https://arxiv.org/abs/2512.11907


    Safe Learning for Contact-Rich Robot Tasks: A Survey from Classical Learning-Based Methods to Safe Foundation Models

    oai:arXiv.org:2512.11908v1

    arXiv:2512.11908v1 Announce Type: new Abstract: Contact-rich tasks pose significant challenges for robotic systems due to inherent uncertainty, complex dynamics, and the high risk of damage during interaction. Recent advances in learning-based control have shown great potential in enabling robots to acquire and generalize complex manipulation skills in such environments, but ensuring safety, both during exploration and execution, remains a critical bottleneck for reliable real-world deployment. This survey provides a comprehensive overview of safe learning-based methods for robot contact-rich tasks. We categorize existing approaches into two main domains: safe exploration and safe execution. We review key techniques, including constrained reinforcement learning, risk-sensitive optimization, uncertainty-aware modeling, control barrier functions, and model predictive safety shields, and highlight how these methods incorporate prior knowledge, task structure, and online adaptation to balance safety and efficiency. A particular emphasis of this survey is on how these safe learning principles extend to and interact with emerging robotic foundation models, especially vision-language models (VLMs) and vision-language-action models (VLAs), which unify perception, language, and control for contact-rich manipulation. We discuss both the new safety opportunities enabled by VLM/VLA-based methods, such as language-level specification of constraints and multimodal grounding of safety signals, and the amplified risks and evaluation challenges they introduce. Finally, we outline current limitations and promising future directions toward deploying reliable, safety-aligned, and foundation-model-enabled robots in complex contact-rich environments. More details and materials are available at our \href{ https://github.com/jack-sherman01/Awesome-Learning4Safe-Contact-rich-tasks}{Project GitHub Repository}.

    https://arxiv.org/abs/2512.11908


    Causal Strengths and Leaky Beliefs: Interpreting LLM Reasoning via Noisy-OR Causal Bayes Nets

    oai:arXiv.org:2512.11909v1

    arXiv:2512.11909v1 Announce Type: new Abstract: The nature of intelligence in both humans and machines is a longstanding question. While there is no universally accepted definition, the ability to reason causally is often regarded as a pivotal aspect of intelligence (Lake et al., 2017). Evaluating causal reasoning in LLMs and humans on the same tasks provides hence a more comprehensive understanding of their respective strengths and weaknesses. Our study asks: (Q1) Are LLMs aligned with humans given the \emph{same} reasoning tasks? (Q2) Do LLMs and humans reason consistently at the task level? (Q3) Do they have distinct reasoning signatures? We answer these by evaluating 20+ LLMs on eleven semantically meaningful causal tasks formalized by a collider graph ($C_1\!\to\!E\!\leftarrow\!C_2$ ) under \emph{Direct} (one-shot number as response = probability judgment of query node being one and \emph{Chain of Thought} (CoT; think first, then provide answer). Judgments are modeled with a leaky noisy-OR causal Bayes net (CBN) whose parameters $\theta=(b,m_1,m_2,p(C)) \in [0,1]$ include a shared prior $p(C)$; we select the winning model via AIC between a 3-parameter symmetric causal strength ($m_1{=}m_2$) and 4-parameter asymmetric ($m_1{\neq}m_2$) variant.

    https://arxiv.org/abs/2512.11909


    Robustness of Probabilistic Models to Low-Quality Data: A Multi-Perspective Analysis

    oai:arXiv.org:2512.11912v1

    arXiv:2512.11912v1 Announce Type: new Abstract: A systematic, comparative investigation into the effects of low-quality data reveals a stark spectrum of robustness across modern probabilistic models. We find that autoregressive language models, from token prediction to sequence-to-sequence tasks, are remarkably resilient (for GPT-2, test NLL increases modestly from 2.87 to 3.59 despite 50% token corruption). By contrast, under the same levels of data corruption, class-conditional diffusion models degrade catastrophically (image-label consistency plummets by 56.81% relative to baseline), while classifiers show a moderate impact that diminishes with dataset scale. To explain these discrepancies, we analyze the results through a multi-perspective lens, integrating information theory, PAC learning, and gradient dynamics. These analyses suggest that robustness is heavily influenced by two key principles: the richness of conditioning information, which constrains the learning problem, and the absolute information content of the training data, which allows the signal from correct information to dominate statistical noise.

    https://arxiv.org/abs/2512.11912


    Financial Management Challenges in Enterprises Employing Remote and Hybrid Workforces

    oai:arXiv.org:2512.11918v1

    arXiv:2512.11918v1 Announce Type: new Abstract: The paper examines financial management challenges faced by organizations operating under remote and hybrid work models. It investigates how these flexible arrangements influence budgeting, reporting, and financial transparency in distributed teams. Using a quantitative survey of managers, HR staff, and finance professionals, the study analyzes the role of digital tools, communication, and organizational practices in shaping financial outcomes. Results indicate that remote and hybrid work can improve budget control and process transparency through the use of ERP systems and digital workflows. However, forecasting accuracy and interdepartmental communication remain major challenges, particularly in organizations with insufficient digital integration. Respondents also reported lower stress levels and improved work-life balance, suggesting potential well-being and productivity benefits. The paper recommends that companies enhance digital infrastructure, adopt advanced analytics for forecasting, and develop clear communication frameworks supported by employee well-being programs. The study contributes original empirical evidence on financial management in flexible work environments, offering practical insights for leaders navigating the digital transformation of finance.

    https://arxiv.org/abs/2512.11918


    CXL-SpecKV: A Disaggregated FPGA Speculative KV-Cache for Datacenter LLM Serving

    oai:arXiv.org:2512.11920v1

    arXiv:2512.11920v1 Announce Type: new Abstract: Large Language Models (LLMs) have revolutionized natural language processing tasks, but their deployment in datacenter environments faces significant challenges due to the massive memory requirements of key-value (KV) caches. During the autoregressive decoding process, KV caches consume substantial GPU memory, limiting batch sizes and overall system throughput. To address these challenges, we propose \textbf{CXL-SpecKV}, a novel disaggregated KV-cache architecture that leverages Compute Express Link (CXL) interconnects and FPGA accelerators to enable efficient speculative execution and memory disaggregation. Our approach introduces three key innovations: (i) a CXL-based memory disaggregation framework that offloads KV-caches to remote FPGA memory with low latency, (ii) a speculative KV-cache prefetching mechanism that predicts and preloads future tokens' cache entries, and (iii) an FPGA-accelerated KV-cache compression and decompression engine that reduces memory bandwidth requirements by up to 4$\times$. When evaluated on state-of-the-art LLM models, CXL-SpecKV achieves up to 3.2$\times$ higher throughput compared to GPU-only baselines, while reducing memory costs by 2.8$\times$ and maintaining accuracy. Our system demonstrates that intelligent memory disaggregation combined with speculative execution can effectively address the memory wall challenge in large-scale LLM serving. Our code implementation has been open-sourced at https://github.com/FastLM/CXL-SpecKV.

    https://arxiv.org/abs/2512.11920


    Towards Accessible Physical AI: LoRA-Based Fine-Tuning of VLA Models for Real-World Robot Control

    oai:arXiv.org:2512.11921v1

    arXiv:2512.11921v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models have demonstrated remarkable capabilities in robotic manipulation,enabling robots to execute natural language commands through end-to-end learning from visual observations.However, deploying large-scale VLA models on affordable robotic platforms remains challenging due to computational constraints and the need for efficient adaptation to new robot embodiments. This paper presents an efficient fine-tuning methodology and real-world deployment analysis for adapting VLA models to low-cost robotic manipulation systems.We propose a resource-efficient fine-tuning strategy using Low-Rank Adaptation (LoRA) and quantization techniques that enable multi-billion parameter VLA models ( 3.1B parameters) to run on consumer-grade GPUs with 8GB VRAM. Our methodology addresses the critical challenge of adapting pre-trained VLA models to new robot embodiments with limited demonstration data, focusing on the trade-offs between frozen and unfrozen vision encoders. Through real-world deployment on the SO101 robotic arm for a button-pressing manipulation task, we demonstrate that our approach achieves effective manipulation performance while maintaining computational efficiency. We provide detailed analysis of deployment challenges, failure modes, and the relationship between training data quantity and real-world performance,trained on 200 demonstration episodes. Our results show that with proper fine-tuning methodology, VLA models can be successfully deployed on affordable robotic platforms,making advanced manipulation capabilities accessible beyond expensive research robots.

    https://arxiv.org/abs/2512.11921


    Vibe Coding in Practice: Flow, Technical Debt, and Guidelines for Sustainable Use

    oai:arXiv.org:2512.11922v1

    arXiv:2512.11922v1 Announce Type: new Abstract: Vibe Coding (VC) is a form of software development assisted by generative AI, in which developers describe the intended functionality or logic via natural language prompts, and the AI system generates the corresponding source code. VC can be leveraged for rapid prototyping or developing the Minimum Viable Products (MVPs); however, it may introduce several risks throughout the software development life cycle. Based on our experience from several internally developed MVPs and a review of recent industry reports, this article analyzes the flow-debt tradeoffs associated with VC. The flow-debt trade-off arises when the seamless code generation occurs, leading to the accumulation of technical debt through architectural inconsistencies, security vulnerabilities, and increased maintenance overhead. These issues originate from process-level weaknesses, biases in model training data, a lack of explicit design rationale, and a tendency to prioritize quick code generation over human-driven iterative development. Based on our experiences, we identify and explain how current model, platform, and hardware limitations contribute to these issues, and propose countermeasures to address them, informing research and practice towards more sustainable VC approaches.

    https://arxiv.org/abs/2512.11922


    FloraForge: LLM-Assisted Procedural Generation of Editable and Analysis-Ready 3D Plant Geometric Models For Agricultural Applications

    oai:arXiv.org:2512.11925v1

    arXiv:2512.11925v1 Announce Type: new Abstract: Accurate 3D plant models are crucial for computational phenotyping and physics-based simulation; however, current approaches face significant limitations. Learning-based reconstruction methods require extensive species-specific training data and lack editability. Procedural modeling offers parametric control but demands specialized expertise in geometric modeling and an in-depth understanding of complex procedural rules, making it inaccessible to domain scientists. We present FloraForge, an LLM-assisted framework that enables domain experts to generate biologically accurate, fully parametric 3D plant models through iterative natural language Plant Refinements (PR), minimizing programming expertise. Our framework leverages LLM-enabled co-design to refine Python scripts that generate parameterized plant geometries as hierarchical B-spline surface representations with botanical constraints with explicit control points and parametric deformation functions. This representation can be easily tessellated into polygonal meshes with arbitrary precision, ensuring compatibility with functional structural plant analysis workflows such as light simulation, computational fluid dynamics, and finite element analysis. We demonstrate the framework on maize, soybean, and mung bean, fitting procedural models to empirical point cloud data through manual refinement of the Plant Descriptor (PD), human-readable files. The pipeline generates dual outputs: triangular meshes for visualization and triangular meshes with additional parametric metadata for quantitative analysis. This approach uniquely combines LLM-assisted template creation, mathematically continuous representations enabling both phenotyping and rendering, and direct parametric control through PD. The framework democratizes sophisticated geometric modeling for plant science while maintaining mathematical rigor.

    https://arxiv.org/abs/2512.11925


    TransBridge: Boost 3D Object Detection by Scene-Level Completion with Transformer Decoder

    oai:arXiv.org:2512.11926v1

    arXiv:2512.11926v1 Announce Type: new Abstract: 3D object detection is essential in autonomous driving, providing vital information about moving objects and obstacles. Detecting objects in distant regions with only a few LiDAR points is still a challenge, and numerous strategies have been developed to address point cloud sparsity through densification.This paper presents a joint completion and detection framework that improves the detection feature in sparse areas while maintaining costs unchanged. Specifically, we propose TransBridge, a novel transformer-based up-sampling block that fuses the features from the detection and completion networks.The detection network can benefit from acquiring implicit completion features derived from the completion network. Additionally, we design the Dynamic-Static Reconstruction (DSRecon) module to produce dense LiDAR data for the completion network, meeting the requirement for dense point cloud ground truth.Furthermore, we employ the transformer mechanism to establish connections between channels and spatial relations, resulting in a high-resolution feature map used for completion purposes.Extensive experiments on the nuScenes and Waymo datasets demonstrate the effectiveness of the proposed framework.The results show that our framework consistently improves end-to-end 3D object detection, with the mean average precision (mAP) ranging from 0.7 to 1.5 across multiple methods, indicating its generalization ability. For the two-stage detection framework, it also boosts the mAP up to 5.78 points.

    https://arxiv.org/abs/2512.11926


    MONET -- Virtual Cell Painting of Brightfield Images and Time Lapses Using Reference Consistent Diffusion

    oai:arXiv.org:2512.11928v1

    arXiv:2512.11928v1 Announce Type: new Abstract: Cell painting is a popular technique for creating human-interpretable, high-contrast images of cell morphology. There are two major issues with cell paint: (1) it is labor-intensive and (2) it requires chemical fixation, making the study of cell dynamics impossible. We train a diffusion model (Morphological Observation Neural Enhancement Tool, or MONET) on a large dataset to predict cell paint channels from brightfield images. We show that model quality improves with scale. The model uses a consistency architecture to generate time-lapse videos, despite the impossibility of obtaining cell paint video training data. In addition, we show that this architecture enables a form of in-context learning, allowing the model to partially transfer to out-of-distribution cell lines and imaging protocols. Virtual cell painting is not intended to replace physical cell painting completely, but to act as a complementary tool enabling novel workflows in biological research.

    https://arxiv.org/abs/2512.11928


    Evolutionary Reinforcement Learning based AI tutor for Socratic Interdisciplinary Instruction

    oai:arXiv.org:2512.11930v1

    arXiv:2512.11930v1 Announce Type: new Abstract: Cultivating higher-order cognitive abilities -- such as knowledge integration, critical thinking, and creativity -- in modern STEM education necessitates a pedagogical shift from passive knowledge transmission to active Socratic construction. Although Large Language Models (LLMs) hold promise for STEM Interdisciplinary education, current methodologies employing Prompt Engineering (PE), Supervised Fine-tuning (SFT), or standard Reinforcement Learning (RL) often fall short of supporting this paradigm. Existing methods are hindered by three fundamental challenges: the inability to dynamically model latent student cognitive states; severe reward sparsity and delay inherent in long-term educational goals; and a tendency toward policy collapse lacking strategic diversity due to reliance on behavioral cloning. Recognizing the unobservability and dynamic complexity of these interactions, we formalize the Socratic Interdisciplinary Instructional Problem (SIIP) as a structured Partially Observable Markov Decision Process (POMDP), demanding simultaneous global exploration and fine-grained policy refinement. To this end, we propose ERL4SIIP, a novel Evolutionary Reinforcement Learning (ERL) framework specifically tailored for this domain. ERL4SIIP integrates: (1) a dynamic student simulator grounded in a STEM knowledge graph for latent state modeling; (2) a Hierarchical Reward Mechanism that decomposes long-horizon goals into dense signals; and (3) a LoRA-Division based optimization strategy coupling evolutionary algorithms for population-level global search with PPO for local gradient ascent.

    https://arxiv.org/abs/2512.11930


    Mapping AI Risk Mitigations: Evidence Scan and Preliminary AI Risk Mitigation Taxonomy

    oai:arXiv.org:2512.11931v1

    arXiv:2512.11931v1 Announce Type: new Abstract: Organizations and governments that develop, deploy, use, and govern AI must coordinate on effective risk mitigation. However, the landscape of AI risk mitigation frameworks is fragmented, uses inconsistent terminology, and has gaps in coverage. This paper introduces a preliminary AI Risk Mitigation Taxonomy to organize AI risk mitigations and provide a common frame of reference. The Taxonomy was developed through a rapid evidence scan of 13 AI risk mitigation frameworks published between 2023-2025, which were extracted into a living database of 831 AI risk mitigations. The mitigations were iteratively clustered & coded to create the Taxonomy. The preliminary AI Risk Mitigation Taxonomy organizes mitigations into four categories and 23 subcategories: (1) Governance & Oversight: Formal organizational structures and policy frameworks that establish human oversight mechanisms and decision protocols; (2) Technical & Security: Technical, physical, and engineering safeguards that secure AI systems and constrain model behaviors; (3) Operational Process: processes and management frameworks governing AI system deployment, usage, monitoring, incident handling, and validation; and (4) Transparency & Accountability: formal disclosure practices and verification mechanisms that communicate AI system information and enable external scrutiny. The rapid evidence scan and taxonomy construction also revealed several cases where terms like 'risk management' and 'red teaming' are used widely but refer to different responsible actors, actions, and mechanisms of action to reduce risk. This Taxonomy and associated mitigation database, while preliminary, offers a starting point for collation and synthesis of AI risk mitigations. It also offers an accessible, structured way for different actors in the AI ecosystem to discuss and coordinate action to reduce risks from AI.

    https://arxiv.org/abs/2512.11931


    The Agentic Regulator: Risks for AI in Finance and a Proposed Agent-based Framework for Governance

    oai:arXiv.org:2512.11933v1

    arXiv:2512.11933v1 Announce Type: new Abstract: Generative and agentic artificial intelligence is entering financial markets faster than existing governance can adapt. Current model-risk frameworks assume static, well-specified algorithms and one-time validations; large language models and multi-agent trading systems violate those assumptions by learning continuously, exchanging latent signals, and exhibiting emergent behavior. Drawing on complex adaptive systems theory, we model these technologies as decentralized ensembles whose risks propagate along multiple time-scales. We then propose a modular governance architecture. The framework decomposes oversight into four layers of "regulatory blocks": (i) self-regulation modules embedded beside each model, (ii) firm-level governance blocks that aggregate local telemetry and enforce policy, (iii) regulator-hosted agents that monitor sector-wide indicators for collusive or destabilizing patterns, and (iv) independent audit blocks that supply third-party assurance. Eight design strategies enable the blocks to evolve as fast as the models they police. A case study on emergent spoofing in multi-agent trading shows how the layered controls quarantine harmful behavior in real time while preserving innovation. The architecture remains compatible with today's model-risk rules yet closes critical observability and control gaps, providing a practical path toward resilient, adaptive AI governance in financial systems.

    https://arxiv.org/abs/2512.11933


    Unveiling User Perceptions in the Generative AI Era: A Sentiment-Driven Evaluation of AI Educational Apps' Role in Digital Transformation of e-Teaching

    oai:arXiv.org:2512.11934v1

    arXiv:2512.11934v1 Announce Type: new Abstract: The rapid integration of generative artificial intelligence into education has driven digital transformation in e-teaching, yet user perceptions of AI educational apps remain underexplored. This study performs a sentiment-driven evaluation of user reviews from top AI ed-apps on the Google Play Store to assess efficacy, challenges, and pedagogical implications. Our pipeline involved scraping app data and reviews, RoBERTa for binary sentiment classification, GPT-4o for key point extraction, and GPT-5 for synthesizing top positive/negative themes. Apps were categorized into seven types (e.g., homework helpers, math solvers, language tools), with overlaps reflecting multifunctional designs. Results indicate predominantly positive sentiments, with homework apps like Edu AI (95.9% positive) and Answer.AI (92.7%) leading in accuracy, speed, and personalization, while language/LMS apps (e.g., Teacher AI at 21.8% positive) lag due to instability and limited features. Positives emphasize efficiency in brainstorming, problem-solving, and engagement; negatives center on paywalls, inaccuracies, ads, and glitches. Trends show that homework helpers outperform specialized tools, highlighting AI's democratizing potential amid risks of dependency and inequity. The discussion proposes future ecosystems with hybrid AI-human models, VR/AR for immersive learning, and a roadmap for developers (adaptive personalization) and policymakers (monetization regulation for inclusivity). This underscores generative AI's role in advancing e-teaching by enabling ethical refinements that foster equitable, innovative environments. The full dataset is available here(https://github.com/erfan-nourbakhsh/GenAI-EdSent).

    https://arxiv.org/abs/2512.11934


    AGAPI-Agents: An Open-Access Agentic AI Platform for Accelerated Materials Design on AtomGPT.org

    oai:arXiv.org:2512.11935v1

    arXiv:2512.11935v1 Announce Type: new Abstract: Artificial intelligence is reshaping scientific discovery, yet its use in materials research remains limited by fragmented computational ecosystems, reproducibility challenges, and dependence on commercial large language models (LLMs). Here we introduce AGAPI (AtomGPT.org API), an open-access agentic AI platform that integrates more than eight open-source LLMs with over twenty materials-science API endpoints, unifying databases, simulation tools, and machine-learning models through a common orchestration framework. AGAPI employs an Agent-Planner-Executor-Summarizer architecture that autonomously constructs and executes multi-step workflows spanning materials data retrieval, graph neural network property prediction, machine-learning force-field optimization, tight-binding calculations, diffraction analysis, and inverse design. We demonstrate AGAPI through end-to-end workflows, including heterostructure construction, powder X-ray diffraction analysis, and semiconductor defect engineering requiring up to ten sequential operations. In addition, we evaluate AGAPI using 30+ example prompts as test cases and compare agentic predictions with and without tool access against experimental data. With more than 1,000 active users, AGAPI provides a scalable and transparent foundation for reproducible, AI-accelerated materials discovery. AGAPI-Agents codebase is available at https://github.com/atomgptlab/agapi.

    https://arxiv.org/abs/2512.11935


    Contextual Peano Scan and Fast Image Segmentation Using Hidden and Evidential Markov Chains

    oai:arXiv.org:2512.11939v1

    arXiv:2512.11939v1 Announce Type: new Abstract: Transforming bi-dimensional sets of image pixels into mono-dimensional sequences with a Peano scan (PS) is an established technique enabling the use of hidden Markov chains (HMCs) for unsupervised image segmentation. Related Bayesian segmentation methods can compete with hidden Markov fields (HMFs)-based ones and are much faster. PS has recently been extended to the contextual PS, and some initial experiments have shown the value of the associated HMC model, denoted as HMC-CPS, in image segmentation. Moreover, HMCs have been extended to hidden evidential Markov chains (HEMCs), which are capable of improving HMC-based Bayesian segmentation. In this study, we introduce a new HEMC-CPS model by simultaneously considering contextual PS and evidential HMC. We show its effectiveness for Bayesian maximum posterior mode (MPM) segmentation using synthetic and real images. Segmentation is performed in an unsupervised manner, with parameters being estimated using the stochastic expectation--maximization (SEM) method. The new HEMC-CPS model presents potential for the modeling and segmentation of more complex images, such as three-dimensional or multi-sensor multi-resolution images. Finally, the HMC-CPS and HEMC-CPS models are not limited to image segmentation and could be used for any kind of spatially correlated data.

    https://arxiv.org/abs/2512.11939


    A Systematic Mapping Study on Risks and Vulnerabilities in Software Containers

    oai:arXiv.org:2512.11940v1

    arXiv:2512.11940v1 Announce Type: new Abstract: Software containers are widely adopted for developing and deploying software applications. Despite their popularity, major security concerns arise during container development and deployment. Software Engineering (SE) research literature reveals a lack of reviewed, aggregated, and organized knowledge of risks, vulnerabilities, security practices, and tools in container-based systems development and deployment. Therefore, we conducted a Systematic Mapping Study (SMS) based on 129 selected primary studies to explore and organize existing knowledge on security issues in software container systems. Data from the primary studies enabled us to identify critical risks and vulnerabilities across the container life-cycle and categorize them using a novel taxonomy. Additionally, the findings highlight the causes and implications and provide a list of mitigation techniques to overcome these risks and vulnerabilities. Furthermore, we provide an aggregation of security practices and tools that can help support and improve the overall security of container systems. This study offers critical insights into the current landscape of security issues within software container systems. Our analysis highlights the need for future SE research to focus on security enhancement practices that strengthen container systems and develop effective mitigation strategies to comprehensively address existing risks and vulnerabilities.

    https://arxiv.org/abs/2512.11940


    DynaPURLS: Dynamic Refinement of Part-aware Representations for Skeleton-based Zero-Shot Action Recognition

    oai:arXiv.org:2512.11941v1

    arXiv:2512.11941v1 Announce Type: new Abstract: Zero-shot skeleton-based action recognition (ZS-SAR) is fundamentally constrained by prevailing approaches that rely on aligning skeleton features with static, class-level semantics. This coarse-grained alignment fails to bridge the domain shift between seen and unseen classes, thereby impeding the effective transfer of fine-grained visual knowledge. To address these limitations, we introduce \textbf{DynaPURLS}, a unified framework that establishes robust, multi-scale visual-semantic correspondences and dynamically refines them at inference time to enhance generalization. Our framework leverages a large language model to generate hierarchical textual descriptions that encompass both global movements and local body-part dynamics. Concurrently, an adaptive partitioning module produces fine-grained visual representations by semantically grouping skeleton joints. To fortify this fine-grained alignment against the train-test domain shift, DynaPURLS incorporates a dynamic refinement module. During inference, this module adapts textual features to the incoming visual stream via a lightweight learnable projection. This refinement process is stabilized by a confidence-aware, class-balanced memory bank, which mitigates error propagation from noisy pseudo-labels. Extensive experiments on three large-scale benchmark datasets, including NTU RGB+D 60/120 and PKU-MMD, demonstrate that DynaPURLS significantly outperforms prior art, setting new state-of-the-art records. The source code is made publicly available at https://github.com/Alchemist0754/DynaPURLS

    https://arxiv.org/abs/2512.11941


    Hypergame Rationalisability: Solving Agent Misalignment In Strategic Play

    oai:arXiv.org:2512.11942v1

    arXiv:2512.11942v1 Announce Type: new Abstract: Differences in perception, information asymmetries, and bounded rationality lead game-theoretic players to derive a private, subjective view of the game that may diverge from the underlying ground-truth scenario and may be misaligned with other players' interpretations. While typical game-theoretic assumptions often overlook such heterogeneity, hypergame theory provides the mathematical framework to reason about mismatched mental models. Although hypergames have recently gained traction in dynamic applications concerning uncertainty, their practical adoption in multi-agent system research has been hindered by the lack of a unifying, formal, and practical representation language, as well as scalable algorithms for managing complex hypergame structures and equilibria. Our work addresses this gap by introducing a declarative, logic-based domain-specific language for encoding hypergame structures and hypergame solution concepts. Leveraging answer-set programming, we develop an automated pipeline for instantiating hypergame structures and running our novel hypergame rationalisation procedure, a mechanism for finding belief structures that justify seemingly irrational outcomes. The proposed language establishes a unifying formalism for hypergames and serves as a foundation for developing nuanced, belief-based heterogeneous reasoners, offering a verifiable context with logical guarantees. Together, these contributions establish the connection between hypergame theory, multi-agent systems, and strategic AI.

    https://arxiv.org/abs/2512.11942


    How AI Agents Follow the Herd of AI? Network Effects, History, and Machine Optimism

    oai:arXiv.org:2512.11943v1

    arXiv:2512.11943v1 Announce Type: new Abstract: Understanding decision-making in multi-AI-agent frameworks is crucial for analyzing strategic interactions in network-effect-driven contexts. This study investigates how AI agents navigate network-effect games, where individual payoffs depend on peer participatio--a context underexplored in multi-agent systems despite its real-world prevalence. We introduce a novel workflow design using large language model (LLM)-based agents in repeated decision-making scenarios, systematically manipulating price trajectories (fixed, ascending, descending, random) and network-effect strength. Our key findings include: First, without historical data, agents fail to infer equilibrium. Second, ordered historical sequences (e.g., escalating prices) enable partial convergence under weak network effects but strong effects trigger persistent "AI optimism"--agents overestimate participation despite contradictory evidence. Third, randomized history disrupts convergence entirely, demonstrating that temporal coherence in data shapes LLMs' reasoning, unlike humans. These results highlight a paradigm shift: in AI-mediated systems, equilibrium outcomes depend not just on incentives, but on how history is curated, which is impossible for human.

    https://arxiv.org/abs/2512.11943


    A Review of Learning-Based Motion Planning: Toward a Data-Driven Optimal Control Approach

    oai:arXiv.org:2512.11944v1

    arXiv:2512.11944v1 Announce Type: new Abstract: Motion planning for high-level autonomous driving is constrained by a fundamental trade-off between the transparent, yet brittle, nature of pipeline methods and the adaptive, yet opaque, "black-box" characteristics of modern learning-based systems. This paper critically synthesizes the evolution of the field -- from pipeline methods through imitation learning, reinforcement learning, and generative AI -- to demonstrate how this persistent dilemma has hindered the development of truly trustworthy systems. To resolve this impasse, we conduct a comprehensive review of learning-based motion planning methods. Based on this review, we outline a data-driven optimal control paradigm as a unifying framework that synergistically integrates the verifiable structure of classical control with the adaptive capacity of machine learning, leveraging real-world data to continuously refine key components such as system dynamics, cost functions, and safety constraints. We explore this framework's potential to enable three critical next-generation capabilities: "Human-Centric" customization, "Platform-Adaptive" dynamics adaptation, and "System Self-Optimization" via self-tuning. We conclude by proposing future research directions based on this paradigm, aimed at developing intelligent transportation systems that are simultaneously safe, interpretable, and capable of human-like autonomy.

    https://arxiv.org/abs/2512.11944


    Data-Driven Global Sensitivity Analysis for Engineering Design Based on Individual Conditional Expectations

    oai:arXiv.org:2512.11946v1

    arXiv:2512.11946v1 Announce Type: new Abstract: Explainable machine learning techniques have gained increasing attention in engineering applications, especially in aerospace design and analysis, where understanding how input variables influence data-driven models is essential. Partial Dependence Plots (PDPs) are widely used for interpreting black-box models by showing the average effect of an input variable on the prediction. However, their global sensitivity metric can be misleading when strong interactions are present, as averaging tends to obscure interaction effects. To address this limitation, we propose a global sensitivity metric based on Individual Conditional Expectation (ICE) curves. The method computes the expected feature importance across ICE curves, along with their standard deviation, to more effectively capture the influence of interactions. We provide a mathematical proof demonstrating that the PDP-based sensitivity is a lower bound of the proposed ICE-based metric under truncated orthogonal polynomial expansion. In addition, we introduce an ICE-based correlation value to quantify how interactions modify the relationship between inputs and the output. Comparative evaluations were performed on three cases: a 5-variable analytical function, a 5-variable wind-turbine fatigue problem, and a 9-variable airfoil aerodynamics case, where ICE-based sensitivity was benchmarked against PDP, SHapley Additive exPlanations (SHAP), and Sobol' indices. The results show that ICE-based feature importance provides richer insights than the traditional PDP-based approach, while visual interpretations from PDP, ICE, and SHAP complement one another by offering multiple perspectives.

    https://arxiv.org/abs/2512.11946


    Neural Chameleons: Language Models Can Learn to Hide Their Thoughts from Unseen Activation Monitors

    oai:arXiv.org:2512.11949v1

    arXiv:2512.11949v1 Announce Type: new Abstract: Activation monitoring, which probes a model's internal states using lightweight classifiers, is an emerging tool for AI safety. However, its worst-case robustness under a misalignment threat model--where a model might learn to actively conceal its internal states--remains untested. Focusing on this threat model, we ask: could a model learn to evade previously unseen activation monitors? Our core contribution is to stress-test the learnability of this behavior. We demonstrate that finetuning can create Neural Chameleons: models capable of zero-shot evading activation monitors. Specifically, we fine-tune an LLM to evade monitors for a set of benign concepts (e.g., languages, HTML) when conditioned on a trigger of the form: "You are being probed for {concept}". We show that this learned mechanism generalizes zero-shot: by substituting {concept} with a safety-relevant term like 'deception', the model successfully evades previously unseen safety monitors. We validate this phenomenon across diverse model families (Llama, Gemma, Qwen), showing that the evasion succeeds even against monitors trained post hoc on the model's frozen weights. This evasion is highly selective, targeting only the specific concept mentioned in the trigger, and having a modest impact on model capabilities on standard benchmarks. Using Gemma-2-9b-it as a case study, a mechanistic analysis reveals this is achieved via a targeted manipulation that moves activations into a low-dimensional subspace. While stronger defenses like monitor ensembles and non-linear classifiers show greater resilience, the model retains a non-trivial evasion capability. Our work provides a proof-of-concept for this failure mode and a tool to evaluate the worst-case robustness of monitoring techniques against misalignment threat models.

    https://arxiv.org/abs/2512.11949


    A Comparative Analysis of Semiconductor Wafer Map Defect Detection with Image Transformer

    oai:arXiv.org:2512.11977v1

    arXiv:2512.11977v1 Announce Type: new Abstract: Predictive maintenance is an important sector in modern industries which improves fault detection and cost reduction processes. By using machine learning algorithms in the whole process, the defects detection process can be implemented smoothly. Semiconductor is a sensitive maintenance field that requires predictability in work. While convolutional neural networks (CNNs) such as VGG-19, Xception and Squeeze-Net have demonstrated solid performance in image classification for semiconductor wafer industry, their effectiveness often declines in scenarios with limited and imbalanced data. This study investigates the use of the Data-Efficient Image Transformer (DeiT) for classifying wafer map defects under data-constrained conditions. Experimental results reveal that the DeiT model achieves highest classification accuracy of 90.83%, outperforming CNN models such as VGG-19(65%), SqueezeNet(82%), Xception(66%) and Hybrid(67%). DeiT also demonstrated superior F1-score (90.78%) and faster training convergence, with enhanced robustness in detecting minority defect classes. These findings highlight the potential of transformer-based models like DeiT in semiconductor wafer defect detection and support predictive maintenance strategies within semiconductor fabrication processes.

    https://arxiv.org/abs/2512.11977


    Designing The Internet of Agents: A Framework for Trustworthy, Transparent, and Collaborative Human-Agent Interaction (HAX)

    oai:arXiv.org:2512.11979v1

    arXiv:2512.11979v1 Announce Type: new Abstract: The rise of generative and autonomous agents marks a fundamental shift in computing, demanding a rethinking of how humans collaborate with probabilistic, partially autonomous systems. We present the Human-AI-Experience (HAX) framework, a comprehensive, three-phase approach that establishes design foundations for trustworthy, transparent, and collaborative agentic interaction. HAX integrates behavioral heuristics, a schema-driven SDK enforcing structured and safe outputs, and a behavioral proxy concept that orchestrates agent activity to reduce cognitive load. A validated catalog of mixed-initiative design patterns further enables intent preview, iterative alignment, trust repair, and multi-agent narrative coherence. Grounded in Time, Interaction, and Performance (TIP) theory, HAX reframes multi-agent systems as colleagues, offering the first end-to-end framework that bridges trust theory, interface design, and infrastructure for the emerging Internet of Agents.

    https://arxiv.org/abs/2512.11979


    Evidence-Driven Decision Support for AI Model Selection in Research Software Engineering

    oai:arXiv.org:2512.11984v1

    arXiv:2512.11984v1 Announce Type: new Abstract: The rapid proliferation of artificial intelligence (AI) models and methods presents growing challenges for research software engineers and researchers who must select, integrate, and maintain appropriate models within complex research workflows. Model selection is often performed in an ad hoc manner, relying on fragmented metadata and individual expertise, which can undermine reproducibility, transparency, and overall research software quality. This work proposes a structured and evidence-driven approach to support AI model selection that aligns with both technical and contextual requirements. We conceptualize AI model selection as a Multi-Criteria Decision-Making (MCDM) problem and introduce an evidence-based decision-support framework that integrates automated data collection pipelines, a structured knowledge graph, and MCDM principles. Following the Design Science Research methodology, the proposed framework (ModelSelect) is empirically validated through 50 real-world case studies and comparative experiments against leading generative AI systems. The evaluation results show that ModelSelect produces reliable, interpretable, and reproducible recommendations that closely align with expert reasoning. Across the case studies, the framework achieved high coverage and strong rationale alignment in both model and library recommendation tasks, performing comparably to generative AI assistants while offering superior traceability and consistency. By framing AI model selection as an MCDM problem, this work establishes a rigorous foundation for transparent and reproducible decision support in research software engineering. The proposed framework provides a scalable and explainable pathway for integrating empirical evidence into AI model recommendation processes, ultimately improving the quality and robustness of research software decision-making.

    https://arxiv.org/abs/2512.11984


    Learning to Extract Context for Context-Aware LLM Inference

    oai:arXiv.org:2512.11986v1

    arXiv:2512.11986v1 Announce Type: new Abstract: User prompts to large language models (LLMs) are often ambiguous or under-specified, and subtle contextual cues shaped by user intentions, prior knowledge, and risk factors strongly influence what constitutes an appropriate response. Misinterpreting intent or risks may lead to unsafe outputs, while overly cautious interpretations can cause unnecessary refusal of benign requests. In this paper, we question the conventional framework in which LLMs generate immediate responses to requests without considering broader contextual factors. User requests are situated within broader contexts such as intentions, knowledge, and prior experience, which strongly influence what constitutes an appropriate answer. We propose a framework that extracts and leverages such contextual information from the user prompt itself. Specifically, a reinforcement learning based context generator, designed in an autoencoder-like fashion, is trained to infer contextual signals grounded in the prompt and use them to guide response generation. This approach is particularly important for safety tasks, where ambiguous requests may bypass safeguards while benign but confusing requests can trigger unnecessary refusals. Experiments show that our method reduces harmful responses by an average of 5.6% on the SafetyInstruct dataset across multiple foundation models and improves the harmonic mean of attack success rate and compliance on benign prompts by 6.2% on XSTest and WildJailbreak. These results demonstrate the effectiveness of context extraction for safer and more reliable LLM inferences.

    https://arxiv.org/abs/2512.11986


    Pivot-Only Azimuthal Control and Attitude Estimation of Balloon-borne Payloads

    oai:arXiv.org:2512.11987v1

    arXiv:2512.11987v1 Announce Type: new Abstract: This paper presents an attitude estimation and yaw-rate control framework for balloon-borne payloads using pivot-only actuation, motivated by the Taurus experiment. Taurus is a long-duration balloon instrument designed for rapid azimuthal scanning at approximately 30 deg/s using a motorized pivot at the flight-train connection, without a reaction wheel. We model the gondola as a rigid body subject to realistic disturbances and sensing limitations, and implement a Multiplicative Extended Kalman Filter (MEKF) that estimates attitude and gyroscope bias by fusing inertial and vector-camera measurements. A simple PI controller uses the estimated states to regulate yaw rate. Numerical simulations incorporating representative disturbance and measurement noise levels are used to evaluate closed-loop control performance and MEKF behavior under flight-like conditions. Experimental tests on the Taurus gondola validate the pivot-only approach, demonstrating stable high-rate tracking under realistic hardware constraints. The close agreement between simulation and experiment indicates that the simplified rigid-body model captures the dominant dynamics relevant for controller design and integrated estimation-and-control development.

    https://arxiv.org/abs/2512.11987


    CARI4D: Category Agnostic 4D Reconstruction of Human-Object Interaction

    oai:arXiv.org:2512.11988v1

    arXiv:2512.11988v1 Announce Type: new Abstract: Accurate capture of human-object interaction from ubiquitous sensors like RGB cameras is important for applications in human understanding, gaming, and robot learning. However, inferring 4D interactions from a single RGB view is highly challenging due to the unknown object and human information, depth ambiguity, occlusion, and complex motion, which hinder consistent 3D and temporal reconstruction. Previous methods simplify the setup by assuming ground truth object template or constraining to a limited set of object categories. We present CARI4D, the first category-agnostic method that reconstructs spatially and temporarily consistent 4D human-object interaction at metric scale from monocular RGB videos. To this end, we propose a pose hypothesis selection algorithm that robustly integrates the individual predictions from foundation models, jointly refine them through a learned render-and-compare paradigm to ensure spatial, temporal and pixel alignment, and finally reasoning about intricate contacts for further refinement satisfying physical constraints. Experiments show that our method outperforms prior art by 38% on in-distribution dataset and 36% on unseen dataset in terms of reconstruction error. Our model generalizes beyond the training categories and thus can be applied zero-shot to in-the-wild internet videos. Our code and pretrained models will be publicly released.

    https://arxiv.org/abs/2512.11988


    Policy Gradient Algorithms for Age-of-Information Cost Minimization

    oai:arXiv.org:2512.11990v1

    arXiv:2512.11990v1 Announce Type: new Abstract: Recent developments in cyber-physical systems have increased the importance of maximizing the freshness of the information about the physical environment. However, optimizing the access policies of Internet of Things devices to maximize the data freshness, measured as a function of the Age-of-Information (AoI) metric, is a challenging task. This work introduces two algorithms to optimize the information update process in cyber-physical systems operating under the generate-at-will model, by finding an online policy without knowing the characteristics of the transmission delay or the age cost function. The optimization seeks to minimize the time-average cost, which integrates the AoI at the receiver and the data transmission cost, making the approach suitable for a broad range of scenarios. Both algorithms employ policy gradient methods within the framework of model-free reinforcement learning (RL) and are specifically designed to handle continuous state and action spaces. Each algorithm minimizes the cost using a distinct strategy for deciding when to send an information update. Moreover, we demonstrate that it is feasible to apply the two strategies simultaneously, leading to an additional reduction in cost. The results demonstrate that the proposed algorithms exhibit good convergence properties and achieve a time-average cost within 3% of the optimal value, when the latter is computable. A comparison with other state-of-the-art methods shows that the proposed algorithms outperform them in one or more of the following aspects: being applicable to a broader range of scenarios, achieving a lower time-average cost, and requiring a computational cost at least one order of magnitude lower.

    https://arxiv.org/abs/2512.11990


    Re-opening open-source science through AI assisted development

    oai:arXiv.org:2512.11993v1

    arXiv:2512.11993v1 Announce Type: new Abstract: Open-source scientific software is effectively closed to modification by its complexity. With recent advances in technology, an agentic AI team led by a single human can now rapidly and robustly modify large codebases and re-open science to the community which can review and vet the AI generated code. We demonstrate this with a case study, STAR-Flex, which is an open source fork of STAR, adding 16,000 lines of C++ code to add the ability to process 10x Flex data, while maintaining full original function. This is the first open-source processing software for Flex data and was written as part of the NIH funded MorPHiC consortium.

    https://arxiv.org/abs/2512.11993


    Optimal non-adaptive algorithm for edge estimation

    oai:arXiv.org:2512.11994v1

    arXiv:2512.11994v1 Announce Type: new Abstract: We present a simple nonadaptive randomized algorithm that estimates the number of edges in a simple, unweighted, undirected graph, possibly containing isolated vertices, using only degree and random edge queries. For an $n$-vertex graph, our method requires only $\widetilde{O}(\sqrt{n})$ queries, achieving sublinear query complexity. The algorithm independently samples a set of vertices and queries their degrees, and also independently samples a set of edges, using the answers to these queries to estimate the total number of edges in the graph. We further prove a matching lower bound, establishing the optimality of our algorithm and resolving the non-adaptive query complexity of this problem with respect to degree and random-edge queries.

    https://arxiv.org/abs/2512.11994


    V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions

    oai:arXiv.org:2512.11995v1

    arXiv:2512.11995v1 Announce Type: new Abstract: While many vision-language models (VLMs) are developed to answer well-defined, straightforward questions with highly specified targets, as in most benchmarks, they often struggle in practice with complex open-ended tasks, which usually require multiple rounds of exploration and reasoning in the visual space. Such visual thinking paths not only provide step-by-step exploration and verification as an AI detective but also produce better interpretations of the final answers. However, these paths are challenging to evaluate due to the large exploration space of intermediate steps. To bridge the gap, we develop an evaluation suite, ``Visual Reasoning with multi-step EXploration (V-REX)'', which is composed of a benchmark of challenging visual reasoning tasks requiring native multi-step exploration and an evaluation protocol. V-REX covers rich application scenarios across diverse domains. V-REX casts the multi-step exploratory reasoning into a Chain-of-Questions (CoQ) and disentangles VLMs' capability to (1) Planning: breaking down an open-ended task by selecting a chain of exploratory questions; and (2) Following: answering curated CoQ sequentially to collect information for deriving the final answer. By curating finite options of questions and answers per step, V-REX achieves a reliable quantitative and fine-grained analysis of the intermediate steps. By assessing SOTA proprietary and open-sourced VLMs, we reveal consistent scaling trends, significant differences between planning and following abilities, and substantial room for improvement in multi-step exploratory reasoning.

    https://arxiv.org/abs/2512.11995


    Log Anomaly Detection with Large Language Models via Knowledge-Enriched Fusion

    oai:arXiv.org:2512.11997v1

    arXiv:2512.11997v1 Announce Type: new Abstract: System logs are a critical resource for monitoring and managing distributed systems, providing insights into failures and anomalous behavior. Traditional log analysis techniques, including template-based and sequence-driven approaches, often lose important semantic information or struggle with ambiguous log patterns. To address this, we present EnrichLog, a training-free, entry-based anomaly detection framework that enriches raw log entries with both corpus-specific and sample-specific knowledge. EnrichLog incorporates contextual information, including historical examples and reasoning derived from the corpus, to enable more accurate and interpretable anomaly detection. The framework leverages retrieval-augmented generation to integrate relevant contextual knowledge without requiring retraining. We evaluate EnrichLog on four large-scale system log benchmark datasets and compare it against five baseline methods. Our results show that EnrichLog consistently improves anomaly detection performance, effectively handles ambiguous log entries, and maintains efficient inference. Furthermore, incorporating both corpus- and sample-specific knowledge enhances model confidence and detection accuracy, making EnrichLog well-suited for practical deployments.

    https://arxiv.org/abs/2512.11997


    Direct Confidence Alignment: Aligning Verbalized Confidence with Internal Confidence In Large Language Models

    oai:arXiv.org:2512.11998v1

    arXiv:2512.11998v1 Announce Type: new Abstract: Producing trustworthy and reliable Large Language Models (LLMs) has become increasingly important as their usage becomes more widespread. Calibration seeks to achieve this by improving the alignment between the model's confidence and the actual likelihood of its responses being correct or desirable. However, it has been observed that the internal confidence of a model, derived from token probabilities, is not well aligned with its verbalized confidence, leading to misleading results with different calibration methods. In this paper, we propose Direct Confidence Alignment (DCA), a method using Direct Preference Optimization to align an LLM's verbalized confidence with its internal confidence rather than ground-truth accuracy, enhancing model transparency and reliability by ensuring closer alignment between the two confidence measures. We evaluate DCA across multiple open-weight LLMs on a wide range of datasets. To further assess this alignment, we also introduce three new calibration error-based metrics. Our results show that DCA improves alignment metrics on certain model architectures, reducing inconsistencies in a model's confidence expression. However, we also show that it can be ineffective on others, highlighting the need for more model-aware approaches in the pursuit of more interpretable and trustworthy LLMs.

    https://arxiv.org/abs/2512.11998


    Taylor-Lagrange Control for Safety-Critical Systems

    oai:arXiv.org:2512.11999v1

    arXiv:2512.11999v1 Announce Type: new Abstract: This paper proposes a novel Taylor-Lagrange Control (TLC) method for nonlinear control systems to ensure the safety and stability through Taylor's theorem with Lagrange remainder. To achieve this, we expand a safety or stability function with respect to time along the system dynamics using the Lie derivative and Taylor's theorem. This expansion enables the control input to appear in the Taylor series at an order equivalent to the relative degree of the function. We show that the proposed TLC provides necessary and sufficient conditions for system safety and is applicable to systems and constraints of arbitrary relative degree. The TLC exhibits connections with existing Control Barrier Function (CBF) and Control Lyapunov Function (CLF) methods, and it further extends the CBF and CLF methods to the complex domain, especially for higher order cases. Compared to High-Order CBFs (HOCBFs), TLC is less restrictive as it does not require forward invariance of the intersection of a set of safe sets while HOCBFs do. We employ TLC to reformulate a constrained optimal control problem as a sequence of quadratic programs with a zero-order hold implementation method, and demonstrate the safety of zero-order hold TLC using an event-triggered control method to address inter-sampling effects. Finally, we illustrate the effectiveness of the proposed TLC method through an adaptive cruise control system and a robot control problem, and compare it with existing CBF methods.

    https://arxiv.org/abs/2512.11999


    Adversarial Attacks Against Deep Learning-Based Radio Frequency Fingerprint Identification

    oai:arXiv.org:2512.12002v1

    arXiv:2512.12002v1 Announce Type: new Abstract: Radio frequency fingerprint identification (RFFI) is an emerging technique for the lightweight authentication of wireless Internet of things (IoT) devices. RFFI exploits deep learning models to extract hardware impairments to uniquely identify wireless devices. Recent studies show deep learning-based RFFI is vulnerable to adversarial attacks. However, effective adversarial attacks against different types of RFFI classifiers have not yet been explored. In this paper, we carried out a comprehensive investigations into different adversarial attack methods on RFFI systems using various deep learning models. Three specific algorithms, fast gradient sign method (FGSM), projected gradient descent (PGD), and universal adversarial perturbation (UAP), were analyzed. The attacks were launched to LoRa-RFFI and the experimental results showed the generated perturbations were effective against convolutional neural networks (CNNs), long short-term memory (LSTM) networks, and gated recurrent units (GRU). We further used UAP to launch practical attacks. Special factors were considered for the wireless context, including implementing real-time attacks, the effectiveness of the attacks over a period of time, etc. Our experimental evaluation demonstrated that UAP can successfully launch adversarial attacks against the RFFI, achieving a success rate of 81.7% when the adversary almost has no prior knowledge of the victim RFFI systems.

    https://arxiv.org/abs/2512.12002


    EnviroLLM: Resource Tracking and Optimization for Local AI

    oai:arXiv.org:2512.12004v1

    arXiv:2512.12004v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed locally for privacy and accessibility, yet users lack tools to measure their resource usage, environmental impact, and efficiency metrics. This paper presents EnviroLLM, an open-source toolkit for tracking, benchmarking, and optimizing performance and energy consumption when running LLMs on personal devices. The system provides real-time process monitoring, benchmarking across multiple platforms (Ollama, LM Studio, vLLM, and OpenAI-compatible APIs), persistent storage with visualizations for longitudinal analysis, and personalized model and optimization recommendations. The system includes LLM-as-judge evaluations alongside energy and speed metrics, enabling users to assess quality-efficiency tradeoffs when testing models with custom prompts.

    https://arxiv.org/abs/2512.12004


    MVP-ORAM: a Wait-free Concurrent ORAM for Confidential BFT Storage

    oai:arXiv.org:2512.12006v1

    arXiv:2512.12006v1 Announce Type: new Abstract: It is well known that encryption alone is not enough to protect data privacy. Access patterns, revealed when operations are performed, can also be leveraged in inference attacks. Oblivious RAM (ORAM) hides access patterns by making client requests oblivious. However, existing protocols are still limited in supporting concurrent clients and Byzantine fault tolerance (BFT). We present MVP-ORAM, the first wait-free ORAM protocol that supports concurrent fail-prone clients. In contrast to previous works, MVP-ORAM avoids using trusted proxies, which require additional security assumptions, and concurrency control mechanisms based on inter-client communication or distributed locks, which limit overall throughput and the capability of tolerating faulty clients. Instead, MVP-ORAM enables clients to perform concurrent requests and merge conflicting updates as they happen, satisfying wait-freedom, i.e., clients make progress independently of the performance or failures of other clients. Since wait and collision freedom are fundamentally contradictory goals that cannot be achieved simultaneously in an asynchronous concurrent ORAM service, we define a weaker notion of obliviousness that depends on the application workload and number of concurrent clients, and prove MVP-ORAM is secure in practical scenarios where clients perform skewed block accesses. By being wait-free, MVP-ORAM can be seamlessly integrated into existing confidential BFT data stores, creating the first BFT ORAM construction. We implement MVP-ORAM on top of a confidential BFT data store and show our prototype can process hundreds of 4KB accesses per second in modern clouds.

    https://arxiv.org/abs/2512.12006


    Hold Onto That Thought: Assessing KV Cache Compression On Reasoning

    oai:arXiv.org:2512.12008v1

    arXiv:2512.12008v1 Announce Type: new Abstract: Large language models (LLMs) have demonstrated remarkable performance on long-context tasks, but are often bottlenecked by memory constraints. Namely, the KV cache, which is used to significantly speed up attention computations, grows linearly with context length. A suite of compression algorithms has been introduced to alleviate cache growth by evicting unimportant tokens. However, several popular strategies are targeted towards the prefill phase, i.e., processing long prompt context, and their performance is rarely assessed on reasoning tasks requiring long decoding. In particular, short but complex prompts, such as those in benchmarks like GSM8K and MATH500, often benefit from multi-step reasoning and self-reflection, resulting in thinking sequences thousands of tokens long. In this work, we benchmark the performance of several popular compression strategies on long-reasoning tasks. For the non-reasoning Llama-3.1-8B-Instruct, we determine that no singular strategy fits all, and that performance is heavily influenced by dataset type. However, we discover that H2O and our decoding-enabled variant of SnapKV are dominant strategies for reasoning models, indicating the utility of heavy-hitter tracking for reasoning traces. We also find that eviction strategies at low budgets can produce longer reasoning traces, revealing a tradeoff between cache size and inference costs.

    https://arxiv.org/abs/2512.12008


    A Reference Architecture for Embedding Quantum Software Into Enterprise Systems

    oai:arXiv.org:2512.12009v1

    arXiv:2512.12009v1 Announce Type: new Abstract: Quantum computing promises a remarkable performance boost for certain applications, including computational intensive problems addressed by enterprise systems. However, software architectures of enterprise systems must consider specific characteristics and quality attributes when collaborating with quantum computing services. Hence, this paper presents a modular reference architecture for embedding quantum software into enterprise systems. Its building blocks consist of loosely coupled and distributed services that implement both quantum-independent and quantum-specific tasks. Although these services either depend on the business domain or the selected quantum algorithm, their orchestration forms a stable and reusable pipeline, specified as an executable BPMN model. For demonstration and evaluation purposes, the proposed reference architecture is utilized in two case studies addressing combinatorial challenges from the field of operations research.

    https://arxiv.org/abs/2512.12009


    Defunding Sexual Healthcare: A Topological Investigation of Resource Accessibility

    oai:arXiv.org:2512.12011v1

    arXiv:2512.12011v1 Announce Type: new Abstract: Government actions, such as the Medina v. Planned Parenthood South Atlantic Supreme Court ruling and the passage of the Big Beautiful Bill Act, have aimed to restrict or prohibit Medicaid funding for Planned Parenthood Healthcare Centers (PPHCs) at both the state and national levels. These funding cuts are particularly harmful in states like California, which has a large population of Medicaid users. This analysis focuses on the distribution of Planned Parenthood clinics and Federally Qualified Health Centers (FQHCs), which offer essential reproductive healthcare services including, but not limited to, abortions, birth control, HIV services, pregnancy testing and planning, STD testing and treatment, and cancer screenings. While expanded funding for FQHCs has been proposed as a solution, it fails to address the locational accessibility of Medicaid-funded health centers that provide sexual and reproductive care. To assess this issue, we analyze the proximity of data points representing California's PPHC and FQHC locations. Topological Data Analysis (TDA)-an approach that examines the shape and structure of data -- is used to detect disparities in reproductive and sexual healthcare coverage. To conduct data collection and visualization, we utilize R and Python. We apply an n-closest neighbor algorithm to examine distances between facilities and assess changes in travel time required to reach healthcare sites. We apply persistent homology to analyze current gaps across multiple scales in healthcare coverage and compare them to potential future gaps. Our findings aim to identify areas where access to care is most vulnerable and demonstrate how TDA can be used to analyze spatial inequalities in public health.

    https://arxiv.org/abs/2512.12011


    Semantic-Drive: Democratizing Long-Tail Data Curation via Open-Vocabulary Grounding and Neuro-Symbolic VLM Consensus

    oai:arXiv.org:2512.12012v1

    arXiv:2512.12012v1 Announce Type: new Abstract: The development of robust Autonomous Vehicles (AVs) is bottlenecked by the scarcity of "Long-Tail" training data. While fleets collect petabytes of video logs, identifying rare safety-critical events (e.g., erratic jaywalking, construction diversions) remains a manual, cost-prohibitive process. Existing solutions rely on coarse metadata search, which lacks precision, or cloud-based VLMs, which are privacy-invasive and expensive. We introduce Semantic-Drive, a local-first, neuro-symbolic framework for semantic data mining. Our approach decouples perception into two stages: (1) Symbolic Grounding via a real-time open-vocabulary detector (YOLOE) to anchor attention, and (2) Cognitive Analysis via a Reasoning VLM that performs forensic scene analysis. To mitigate hallucination, we implement a "System 2" inference-time alignment strategy, utilizing a multi-model "Judge-Scout" consensus mechanism. Benchmarked on the nuScenes dataset against the Waymo Open Dataset (WOD-E2E) taxonomy, Semantic-Drive achieves a Recall of 0.966 (vs. 0.475 for CLIP) and reduces Risk Assessment Error by 40\% compared to single models. The system runs entirely on consumer hardware (NVIDIA RTX 3090), offering a privacy-preserving alternative to the cloud.

    https://arxiv.org/abs/2512.12012


    Exploring Spatial-Temporal Representation via Star Graph for mmWave Radar-based Human Activity Recognition

    oai:arXiv.org:2512.12013v1

    arXiv:2512.12013v1 Announce Type: new Abstract: Human activity recognition (HAR) requires extracting accurate spatial-temporal features with human movements. A mmWave radar point cloud-based HAR system suffers from sparsity and variable-size problems due to the physical features of the mmWave signal. Existing works usually borrow the preprocessing algorithms for the vision-based systems with dense point clouds, which may not be optimal for mmWave radar systems. In this work, we proposed a graph representation with a discrete dynamic graph neural network (DDGNN) to explore the spatial-temporal representation of human movement-related features. Specifically, we designed a star graph to describe the high-dimensional relative relationship between a manually added static center point and the dynamic mmWave radar points in the same and consecutive frames. We then adopted DDGNN to learn the features residing in the star graph with variable sizes. Experimental results demonstrated that our approach outperformed other baseline methods using real-world HAR datasets. Our system achieved an overall classification accuracy of 94.27\%, which gets the near-optimal performance with a vision-based skeleton data accuracy of 97.25\%. We also conducted an inference test on Raspberry Pi~4 to demonstrate its effectiveness on resource-constraint platforms. \sh{ We provided a comprehensive ablation study for variable DDGNN structures to validate our model design. Our system also outperformed three recent radar-specific methods without requiring resampling or frame aggregators.

    https://arxiv.org/abs/2512.12013


    Bandit-Based Rate Adaptation for a Single-Server Queue

    oai:arXiv.org:2512.12016v1

    arXiv:2512.12016v1 Announce Type: new Abstract: This paper considers the problem of obtaining bounded time-average expected queue sizes in a single-queue system with a partial-feedback structure. Time is slotted; in slot $t$ the transmitter chooses a rate $V(t)$ from a continuous interval. Transmission succeeds if and only if $V(t)\le C(t)$, where channel capacities $\{C(t)\}$ and arrivals are i.i.d. draws from fixed but unknown distributions. The transmitter observes only binary acknowledgments (ACK/NACK) indicating success or failure. Let $\varepsilon>0$ denote a sufficiently small lower bound on the slack between the arrival rate and the capacity region. We propose a phased algorithm that progressively refines a discretization of the uncountable infinite rate space and, without knowledge of $\varepsilon$, achieves a $\mathcal{O}\!\big(\log^{3.5}(1/\varepsilon)/\varepsilon^{3}\big)$ time-average expected queue size uniformly over the horizon. We also prove a converse result showing that for any rate-selection algorithm, regardless of whether $\varepsilon$ is known, there exists an environment in which the worst-case time-average expected queue size is $\Omega(1/\varepsilon^{2})$. Thus, while a gap remains in the setting without knowledge of $\varepsilon$, we show that if $\varepsilon$ is known, a simple single-stage UCB type policy with a fixed discretization of the rate space achieves $\mathcal{O}\!\big(\log(1/\varepsilon)/\varepsilon^{2}\big)$, matching the converse up to logarithmic factors.

    https://arxiv.org/abs/2512.12016


    Online Full ZVS Optimization for Modular Multi-Active Bridge Converter in MV PET

    oai:arXiv.org:2512.12017v1

    arXiv:2512.12017v1 Announce Type: new Abstract: Multi-active bridge (MAB) converters, the core of the state-of-the-art medium-voltage power electronic transformers, can flexibly connect multiple DC ports among distributed DC grids and loads, but suffer from hard switching under conventional single phase-shift control, especially under unbalanced voltage conversion ratios and light load conditions. Although some offline methods manage to improve the efficiency through complex optimization structures, there lacks online optimization methods that are simple but effective due to the strong coupling among ports of the converter. By leveraging the time-domain model of the MAB converter under the multiple phase-shift modulation scheme, this paper simplifies the optimization process and proposes an online optimization method that can achieve full zero-voltage switching (ZVS) operation regardless of the load conditions. The proposed method has simple solutions with only voltage conversion ratios involved and can be implemented within a wide operation range without additional sensors or advanced controllers. A four-port MAB converter is constructed as the prototype. The simulation and experimental results have verified the feasibility and superiority of the proposed online strategy in achieving ZVS operation, dynamic response, and efficiency improvement.

    https://arxiv.org/abs/2512.12017


    Modified Hybrid A* Collision-Free Path-Planning for Automated Reverse Parking

    oai:arXiv.org:2512.12021v1

    arXiv:2512.12021v1 Announce Type: new Abstract: Parking a vehicle in tight spaces is a challenging task to perform due to the scarcity of feasible paths that are also collision-free. This paper presents a strategy to tackle this kind of maneuver with a modified Hybrid-A* path-planning algorithm that combines the feasibility guarantee inherent in the standard Hybrid A* algorithm with the addition of static obstacle collision avoidance. A kinematic single-track model is derived to describe the low-speed motion of the vehicle, which is subsequently used as the motion model in the Hybrid A* path-planning algorithm to generate feasible motion primitive branches. The model states are also used to reconstruct the vehicle centerline, which, in conjunction with an inflated binary occupancy map, facilitates static obstacle collision avoidance functions. Simulation study and animation are set up to test the efficacy of the approach, and the proposed algorithm proves to consistently provide kinematically feasible trajectories that are also collision-free.

    https://arxiv.org/abs/2512.12021


    DFedReweighting: A Unified Framework for Objective-Oriented Reweighting in Decentralized Federated Learning

    oai:arXiv.org:2512.12022v1

    arXiv:2512.12022v1 Announce Type: new Abstract: Decentralized federated learning (DFL) has recently emerged as a promising paradigm that enables multiple clients to collaboratively train machine learning model through iterative rounds of local training, communication, and aggregation without relying on a central server which introduces potential vulnerabilities in conventional Federated Learning. Nevertheless, DFL systems continue to face a range of challenges, including fairness, robustness, etc. To address these challenges, we propose \textbf{DFedReweighting}, a unified aggregation framework designed to achieve diverse objectives in DFL systems via a objective-oriented reweighting aggregation at the final step of each learning round. Specifically, the framework first computes preliminary weights based on \textit{target performance metric} obtained from auxiliary dataset constructed using local data. These weights are then refined using \textit{customized reweighting strategy}, resulting in the final aggregation weights. Our results from the theoretical analysis demonstrate that the appropriate combination of the target performance metric and the customized reweighting strategy ensures linear convergence. Experimental results consistently show that our proposed framework significantly improves fairness and robustness against Byzantine attacks in diverse scenarios. Provided that appropriate target performance metrics and customized reweighting strategy are selected, our framework can achieve a wide range of desired learning objectives.

    https://arxiv.org/abs/2512.12022


    Hyper model checking for high-level relational models

    oai:arXiv.org:2512.12024v1

    arXiv:2512.12024v1 Announce Type: new Abstract: Many properties related to security or concurrency must be encoded as so-called hyperproperties, temporal properties that allow reasoning about multiple traces of a system. However, despite recent advances on model checking hyperproperties, there is still a lack of higher-level specification languages that can effectively support software engineering practitioners in verifying properties of this class at early stages of system design. Alloy is a lightweight formal method with a high-level specification language that is supported by automated analysis procedures, making it particularly well-suited for the verification of design models at early development stages. It does not natively support, however, the verification of hyperproperties. This work proposes HyperPardinus, a new model finding procedure that extends Pardinus -- the temporal logic backend of the Alloy language -- to automatically verify hyperproperties over relational models by relying on existing low-level model checkers for hyperproperties. It then conservatively extends Alloy to support the specification and automatic verification of hyperproperties over design models, as well as the visualization of (counter-)examples at a higher-level of abstraction. Evaluation shows that our approach enables modeling and finding (counter-)examples for complex hyperproperties with alternating quantifiers, making it feasible to address relevant scenarios from the state of the art.

    https://arxiv.org/abs/2512.12024


    DT-MPC: Synthesizing Derivation-Free Model Predictive Control from Power Converter Netlists via Physics-Informed Neural Digital Twins

    oai:arXiv.org:2512.12026v1

    arXiv:2512.12026v1 Announce Type: new Abstract: Model Predictive Control (MPC) is a powerful control strategy for power electronics, but it highly relies on manually-derived and topology-specific analytical models, which is labor-intensive and time-consuming in practical designs. To overcome this bottleneck, this paper introduces a Digital-Twin-based MPC (DT-MPC) framework for generic power converters that can systematically translate a high-level circuit into an objective-aware control policy by leveraging a DT as a high-fidelity system model. Furthermore, a physics-informed neural surrogate predictor is proposed to accelerate predictions by DT and enable real-time operation. A gradient-free simplex search optimizer is also introduced to efficiently handle complex multi-objective optimization. The efficacy of the framework has been validated through a cloud-to-edge deployment on a 1500 W dual active bridge converter. Experimental results show that the synthesized predictive model achieves an inference speed over 7 times faster than real time, the DT-MPC controller outperforms several human-designed counterparts, and the overall framework reduces engineering design time by over 95\%, verifying the superiority of DT-MPC on generalized power converters.

    https://arxiv.org/abs/2512.12026


    Differentially Private Community Detection in $h$-uniform Hypergraphs

    oai:arXiv.org:2512.12031v1

    arXiv:2512.12031v1 Announce Type: new Abstract: This paper studies the exact recovery threshold subject to preserving the privacy of connections in $h$-uniform hypergraphs. Privacy is characterized by the $(\epsilon, \delta)$-hyperedge differential privacy (DP), an extension of the notion of $(\epsilon, \delta)$-edge DP in the literature. The hypergraph observations are modeled through a $h$-uniform stochastic block model ($h$-HSBM) in the dense regime. We investigate three differentially private mechanisms: stability-based, sampling-based, and perturbation-based mechanisms. We calculate the exact recovery threshold for each mechanism and study the contraction of the exact recovery region due to the privacy budget, $(\epsilon, \delta)$. Sampling-based mechanisms and randomized response mechanisms guarantee pure $\epsilon$-hyperedge DP where $\delta=0$, while the stability-based mechanisms cannot achieve this level of privacy. The dependence of the limits of the privacy budget on the parameters of the $h$-uniform hypergraph is studied. More precisely, it is proven rigorously that the minimum privacy budget scales logarithmically with the ratio between the density of in-cluster hyperedges and the cross-cluster hyperedges for stability-based and Bayesian sampling-based mechanisms, while this budget depends only on the size of the hypergraph for the randomized response mechanism.

    https://arxiv.org/abs/2512.12031


    Accelerating Sparse Matrix-Matrix Multiplication on GPUs with Processing Near HBMs

    oai:arXiv.org:2512.12036v1

    arXiv:2512.12036v1 Announce Type: new Abstract: Sparse General Matrix-Matrix Multiplication (SpGEMM) is a fundamental operation in numerous scientific computing and data analytics applications, often bottlenecked by irregular memory access patterns. This paper presents Hash based Multi-phase SpGEMM on GPU and the Acceleration of Indirect Memory Access (AIA) technique, a novel custom near-memory processing approach to optimizing SpGEMM on GPU HBM. Our hardware-software co-designed framework for SpGEMM demonstrates significant performance improvements over state-of-the-art methods, particularly in handling complex, application-specific workloads. We evaluate our approach on various graph workloads, including graph contraction, Markov clustering, and Graph Neural Networks (GNNs), showcasing its practical applicability. For graph analytics applications, AIA demonstrates up to 17.3% time reduction from the software-only implementation, while achieving time reduction of 76.5% for Graph Contraction and 58.4% for Markov Clustering compared to cuSPARSE. For GNN training applications with structured global pruning, our hybrid approach delivers an average of 1.43x speedup over software-only implementation across six benchmark datasets and three architectures (GCN, GIN, GraphSAGE), and shows 1.95x speedup for GNN workloads when compared to cuSPARSE, with up to 4.18x gains on large-scale datasets.

    https://arxiv.org/abs/2512.12036


    Discrete-to-continuum convergence of the density of states for Mathieu's equation

    oai:arXiv.org:2512.12039v1

    arXiv:2512.12039v1 Announce Type: new Abstract: The density of states of a self-adjoint operator generalizes the eigenvalue distribution of a Hermitian matrix. We prove convergence of the density of states for a tight-binding model with a slowly-varying periodic potential to the density of states of its continuum approximation, a Mathieu-type equation.

    https://arxiv.org/abs/2512.12039


    Benchmarking Contextual Understanding for In-Car Conversational Systems

    oai:arXiv.org:2512.12042v1

    arXiv:2512.12042v1 Announce Type: new Abstract: In-Car Conversational Question Answering (ConvQA) systems significantly enhance user experience by enabling seamless voice interactions. However, assessing their accuracy and reliability remains a challenge. This paper explores the use of Large Language Models (LLMs) alongside advanced prompting techniques and agent-based methods to evaluate the extent to which ConvQA system responses adhere to user utterances. The focus lies on contextual understanding and the ability to provide accurate venue recommendations considering user constraints and situational context. To evaluate utterance-response coherence using an LLM, we synthetically generate user utterances accompanied by correct and modified failure-containing system responses. We use input-output, chain-of-thought, self-consistency prompting, and multi-agent prompting techniques with 13 reasoning and non-reasoning LLMs of varying sizes and providers, including OpenAI, DeepSeek, Mistral AI, and Meta. We evaluate our approach on a case study involving restaurant recommendations. The most substantial improvements occur for small non-reasoning models when applying advanced prompting techniques, particularly multi-agent prompting. However, reasoning models consistently outperform non-reasoning models, with the best performance achieved using single-agent prompting with self-consistency. Notably, DeepSeek-R1 reaches an F1-score of 0.99 at a cost of 0.002 USD per request. Overall, the best balance between effectiveness and cost-time efficiency is reached with the non-reasoning model DeepSeek-V3. Our findings show that LLM-based evaluation offers a scalable and accurate alternative to traditional human evaluation for benchmarking contextual understanding in ConvQA systems.

    https://arxiv.org/abs/2512.12042


    AI as a Teaching Partner: Early Lessons from Classroom Codesign with Secondary Teachers

    oai:arXiv.org:2512.12045v1

    arXiv:2512.12045v1 Announce Type: new Abstract: This report presents a comprehensive account of the Colleague AI Classroom pilot, a collaborative design (co-design) study that brought generative AI technology directly into real classrooms. In this study, AI functioned as a third agent, an active participant that mediated feedback, supported inquiry, and extended teachers' instructional reach while preserving human judgment and teacher authority. Over seven weeks in spring 2025, 21 in-service teachers from four Washington State public school districts and one independent school integrated four AI-powered features of the Colleague AI Classroom into their instruction: Teaching Aide, Assessment and AI Grading, AI Tutor, and Student Growth Insights. More than 600 students in grades 6-12 used the platform in class at the direction of their teachers, who designed and facilitated the AI activities. During the Classroom pilot, teachers were co-design partners: they planned activities, implemented them with students, and provided weekly reflections on AI's role in classroom settings. The teachers' feedback guided iterative improvements for Colleague AI. The research team captured rich data through surveys, planning and reflection forms, group meetings, one-on-one interviews, and platform usage logs to understand where AI adds instructional value and where it requires refinement.

    https://arxiv.org/abs/2512.12045


    Goal Reaching with Eikonal-Constrained Hierarchical Quasimetric Reinforcement Learning

    oai:arXiv.org:2512.12046v1

    arXiv:2512.12046v1 Announce Type: new Abstract: Goal-Conditioned Reinforcement Learning (GCRL) mitigates the difficulty of reward design by framing tasks as goal reaching rather than maximizing hand-crafted reward signals. In this setting, the optimal goal-conditioned value function naturally forms a quasimetric, motivating Quasimetric RL (QRL), which constrains value learning to quasimetric mappings and enforces local consistency through discrete, trajectory-based constraints. We propose Eikonal-Constrained Quasimetric RL (Eik-QRL), a continuous-time reformulation of QRL based on the Eikonal Partial Differential Equation (PDE). This PDE-based structure makes Eik-QRL trajectory-free, requiring only sampled states and goals, while improving out-of-distribution generalization. We provide theoretical guarantees for Eik-QRL and identify limitations that arise under complex dynamics. To address these challenges, we introduce Eik-Hierarchical QRL (Eik-HiQRL), which integrates Eik-QRL into a hierarchical decomposition. Empirically, Eik-HiQRL achieves state-of-the-art performance in offline goal-conditioned navigation and yields consistent gains over QRL in manipulation tasks, matching temporal-difference methods.

    https://arxiv.org/abs/2512.12046


    Context-Aware Agentic Power Resources Optimisation in EV using Smart2ChargeApp

    oai:arXiv.org:2512.12048v1

    arXiv:2512.12048v1 Announce Type: new Abstract: This paper presents a novel context-sensitive multi\-agent coordination for dynamic resource allocation (CAMAC-DRA) framework for optimizing smart electric vehicle (EV) charging ecosystems through the Smart2Charge application. The proposed system coordinates autonomous charging agents across networks of 250 EVs and 45 charging stations while adapting to dynamic environmental conditions through context-aware decision-making. Our multi-agent approach employs coordinated Deep Q\-Networks integrated with Graph Neural Networks and attention mechanisms, processing 20 contextual features including weather patterns, traffic conditions, grid load fluctuations, and electricity pricing.The framework balances five ecosystem stakeholders i.e. EV users (25\%), grid operators (20\%), charging station operators (20\%), fleet operators (20%), and environmental factors (15\%) through weighted coordination mechanisms and consensus protocols. Comprehensive validation using real-world datasets containing 441,077 charging transactions demonstrates superior performance compared to baseline algorithms including DDPG, A3C, PPO, and GNN approaches. The CAMAC\-DRA framework achieves 92\% coordination success rate, 15\% energy efficiency improvement, 10\% cost reduction, 20% grid strain decrease, and \2.3x faster convergence while maintaining 88\% training stability and 85\% sample efficiency. Real-world validation confirms commercial viability with Net Present Cost of -\$122,962 and 69\% cost reduction through renewable energy integration. The framework's unique contribution lies in developing context-aware multi-stakeholder coordination that successfully balances competing objectives while adapting to real-time variables, positioning it as a breakthrough solution for intelligent EV charging coordination and sustainable transportation electrification.

    https://arxiv.org/abs/2512.12048


    An unfitted divergence-free higher order finite element method for the Stokes problem

    oai:arXiv.org:2512.12050v1

    arXiv:2512.12050v1 Announce Type: new Abstract: The paper develops and analyzes a higher-order unfitted finite element method for the incompressible Stokes equations, which yields a strongly divergence-free velocity field up to the physical boundary. The method combines an isoparametric Scott--Vogelius velocity-pressure pair on a cut background mesh with a stabilized Nitsche/Lagrange multiplier formulation for imposing Dirichlet boundary conditions. We construct finite element spaces that admit robust numerical implementation using standard elementwise polynomial mappings and produce exactly divergence-free discrete velocities. The key components of the analysis are a new inf-sup stability result for the isoparametric Scott--Vogelius pair on unfitted meshes and a combined inf-sup stability result for the bilinear forms associated with the pressure and the Lagrange multiplier. The finite element formulation employs a higher-order Lagrange multiplier space, which ensures stability and mitigates the loss of pressure robustness typically associated with the weak enforcement of boundary conditions for the normal velocity component. The paper provides a complete stability and convergence theory in two dimensions, accounting for the geometric errors introduced by the isoparametric approximation. The analysis demonstrates optimal-order velocity convergence in both the $H^1$ and $L^2$ norms and establishes optimal $H^1$-convergence and nearly optimal $L^2$-convergence of a post-processed pressure. Numerical experiments illustrate and confirm the theoretical findings.

    https://arxiv.org/abs/2512.12050


    Adaptive federated learning for ship detection across diverse satellite imagery sources

    oai:arXiv.org:2512.12053v1

    arXiv:2512.12053v1 Announce Type: new Abstract: We investigate the application of Federated Learning (FL) for ship detection across diverse satellite datasets, offering a privacy-preserving solution that eliminates the need for data sharing or centralized collection. This approach is particularly advantageous for handling commercial satellite imagery or sensitive ship annotations. Four FL models including FedAvg, FedProx, FedOpt, and FedMedian, are evaluated and compared to a local training baseline, where the YOLOv8 ship detection model is independently trained on each dataset without sharing learned parameters. The results reveal that FL models substantially improve detection accuracy over training on smaller local datasets and achieve performance levels close to global training that uses all datasets during the training. Furthermore, the study underscores the importance of selecting appropriate FL configurations, such as the number of communication rounds and local training epochs, to optimize detection precision while maintaining computational efficiency.

    https://arxiv.org/abs/2512.12053


    Enhancing deep learning performance on burned area delineation from SPOT-6/7 imagery for emergency management

    oai:arXiv.org:2512.12056v1

    arXiv:2512.12056v1 Announce Type: new Abstract: After a wildfire, delineating burned areas (BAs) is crucial for quantifying damages and supporting ecosystem recovery. Current BA mapping approaches rely on computer vision models trained on post-event remote sensing imagery, but often overlook their applicability to time-constrained emergency management scenarios. This study introduces a supervised semantic segmentation workflow aimed at boosting both the performance and efficiency of BA delineation. It targets SPOT-6/7 imagery due to its very high resolution and on-demand availability. Experiments are evaluated based on Dice score, Intersection over Union, and inference time. The results show that U-Net and SegFormer models perform similarly with limited training data. However, SegFormer requires more resources, challenging its practical use in emergencies. Incorporating land cover data as an auxiliary task enhances model robustness without increasing inference time. Lastly, Test-Time Augmentation improves BA delineation performance but raises inference time, which can be mitigated with optimization methods like Mixed Precision.

    https://arxiv.org/abs/2512.12056


    A Stochastic Approach to Terrain Maps for Safe Lunar Landing

    oai:arXiv.org:2512.12058v1

    arXiv:2512.12058v1 Announce Type: new Abstract: Safely landing on the lunar surface is a challenging task, especially in the heavily-shadowed South Pole region where traditional vision-based hazard detection methods are not reliable. The potential existence of valuable resources at the lunar South Pole has made landing in that region a high priority for many space agencies and commercial companies. However, relying on a LiDAR for hazard detection during descent is risky, as this technology is fairly untested in the lunar environment. There exists a rich log of lunar surface data from the Lunar Reconnaissance Orbiter (LRO), which could be used to create informative prior maps of the surface before descent. In this work, we propose a method for generating stochastic elevation maps from LRO data using Gaussian processes (GPs), which are a powerful Bayesian framework for non-parametric modeling that produce accompanying uncertainty estimates. In high-risk environments such as autonomous spaceflight, interpretable estimates of terrain uncertainty are critical. However, no previous approaches to stochastic elevation mapping have taken LRO Digital Elevation Model (DEM) confidence maps into account, despite this data containing key information about the quality of the DEM in different areas. To address this gap, we introduce a two-stage GP model in which a secondary GP learns spatially varying noise characteristics from DEM confidence data. This heteroscedastic information is then used to inform the noise parameters for the primary GP, which models the lunar terrain. Additionally, we use stochastic variational GPs to enable scalable training. By leveraging GPs, we are able to more accurately model the impact of heteroscedastic sensor noise on the resulting elevation map. As a result, our method produces more informative terrain uncertainty, which can be used for downstream tasks such as hazard detection and safe landing site selection.

    https://arxiv.org/abs/2512.12058


    The Forecast Critic: Leveraging Large Language Models for Poor Forecast Identification

    oai:arXiv.org:2512.12059v1

    arXiv:2512.12059v1 Announce Type: new Abstract: Monitoring forecasting systems is critical for customer satisfaction, profitability, and operational efficiency in large-scale retail businesses. We propose The Forecast Critic, a system that leverages Large Language Models (LLMs) for automated forecast monitoring, taking advantage of their broad world knowledge and strong ``reasoning'' capabilities. As a prerequisite for this, we systematically evaluate the ability of LLMs to assess time series forecast quality, focusing on three key questions. (1) Can LLMs be deployed to perform forecast monitoring and identify obviously unreasonable forecasts? (2) Can LLMs effectively incorporate unstructured exogenous features to assess what a reasonable forecast looks like? (3) How does performance vary across model sizes and reasoning capabilities, measured across state-of-the-art LLMs? We present three experiments, including on both synthetic and real-world forecasting data. Our results show that LLMs can reliably detect and critique poor forecasts, such as those plagued by temporal misalignment, trend inconsistencies, and spike errors. The best-performing model we evaluated achieves an F1 score of 0.88, somewhat below human-level performance (F1 score: 0.97). We also demonstrate that multi-modal LLMs can effectively incorporate unstructured contextual signals to refine their assessment of the forecast. Models correctly identify missing or spurious promotional spikes when provided with historical context about past promotions (F1 score: 0.84). Lastly, we demonstrate that these techniques succeed in identifying inaccurate forecasts on the real-world M5 time series dataset, with unreasonable forecasts having an sCRPS at least 10% higher than that of reasonable forecasts. These findings suggest that LLMs, even without domain-specific fine-tuning, may provide a viable and scalable option for automated forecast monitoring and evaluation.

    https://arxiv.org/abs/2512.12059


    CreativeVR: Diffusion-Prior-Guided Approach for Structure and Motion Restoration in Generative and Real Videos

    oai:arXiv.org:2512.12060v1

    arXiv:2512.12060v1 Announce Type: new Abstract: Modern text-to-video (T2V) diffusion models can synthesize visually compelling clips, yet they remain brittle at fine-scale structure: even state-of-the-art generators often produce distorted faces and hands, warped backgrounds, and temporally inconsistent motion. Such severe structural artifacts also appear in very low-quality real-world videos. Classical video restoration and super-resolution (VR/VSR) methods, in contrast, are tuned for synthetic degradations such as blur and downsampling and tend to stabilize these artifacts rather than repair them, while diffusion-prior restorers are usually trained on photometric noise and offer little control over the trade-off between perceptual quality and fidelity. We introduce CreativeVR, a diffusion-prior-guided video restoration framework for AI-generated (AIGC) and real videos with severe structural and temporal artifacts. Our deep-adapter-based method exposes a single precision knob that controls how strongly the model follows the input, smoothly trading off between precise restoration on standard degradations and stronger structure- and motion-corrective behavior on challenging content. Our key novelty is a temporally coherent degradation module used during training, which applies carefully designed transformations that produce realistic structural failures. To evaluate AIGC-artifact restoration, we propose the AIGC54 benchmark with FIQA, semantic and perceptual metrics, and multi-aspect scoring. CreativeVR achieves state-of-the-art results on videos with severe artifacts and performs competitively on standard video restoration benchmarks, while running at practical throughput (about 13 FPS at 720p on a single 80-GB A100). Project page: https://daveishan.github.io/creativevr-webpage/.

    https://arxiv.org/abs/2512.12060


    Scalable IP Mimicry: End-to-End Deceptive IP Blending to Overcome Rectification and Scale Limitations of IP Camouflage

    oai:arXiv.org:2512.12061v1

    arXiv:2512.12061v1 Announce Type: new Abstract: Semiconductor intellectual property (IP) theft incurs estimated annual losses ranging from $225 billion to $600 billion. Despite initiatives like the CHIPS Act, many semiconductor designs remain vulnerable to reverse engineering (RE). IP Camouflage is a recent breakthrough that expands beyond the logic gate hiding of traditional camouflage through "mimetic deception," where an entire module masquerades as a different IP. However, it faces key limitations: requires a high-overhead post-generation rectification step, is not easily scalable, and uses an AIG logic representation that is mismatched with standard RE analysis flows. This paper addresses these shortcommings by introducing two novel, end-to-end models. We propose a Graph-Matching algorithm to solve the representation problem and a DNAS-based NAND Array model to achieve scalability. To facilitate this, we also introduce a mimicry-aware partitioning method, enabling a divide-and-conquer approach for large-scale designs. Our results demonstrate that these models are resilient to SAT and GNN-RE attacks, providing efficient and scalable paths for end-to-end deceptive IP design.

    https://arxiv.org/abs/2512.12061


    Numerical Simulation of Beam Network Models

    oai:arXiv.org:2512.12062v1

    arXiv:2512.12062v1 Announce Type: new Abstract: Network models are used as efficient representation of materials with complex, interconnected locally one-dimensional structures. They typically accurately capture the mechanical properties of a material, while substantially reducing computational cost by avoiding full three-dimensional resolution. Applications include the simulation of fiber-based materials, porous media, and biological systems such as vascular networks. This article focuses on two representative problems: a stationary formulation describing the elastic deformation of beam networks, and a time-dependent formulation modeling elastic wave propagation in such materials. We propose a two-level additive domain decomposition method to efficiently solve the linear system associated with the stationary problem, as well as the linear systems that arise at each time step of the time-dependent problem through implicit time discretization. We present a rigorous convergence analysis of the domain decomposition method when used as a preconditioner, quantifying the convergence rate with respect to network connectivity and heterogeneity. The efficiency and robustness of the proposed approach are demonstrated through numerical simulations of the mechanical properties of commercial-grade paperboard.

    https://arxiv.org/abs/2512.12062


    Instruction-Tuning Open-Weight Language Models for BPMN Model Generation

    oai:arXiv.org:2512.12063v1

    arXiv:2512.12063v1 Announce Type: new Abstract: Domain models are central to software engineering, as they enable a shared understanding, guide implementation, and support automated analyses and model-driven development. Yet, despite these benefits, practitioners often skip modeling because it is time-consuming and demands scarce expertise. We address this barrier by investigating whether open-weight large language models, adapted via instruction tuning, can generate high-quality BPMN process models directly from natural language descriptions in a cost-effective and privacy-preserving way. We introduce InstruBPM, a reproducible approach that prepares paired text-diagram data and instruction tunes an open source large language model with parameter-efficient fine-tuning and quantization for on-prem deployment. We evaluate the tuned model through complementary perspectives: (i) text/code similarity using BLEU, ROUGE-L, and METEOR, (ii) structural fidelity using Relative Graph Edit Distance, (iii) guidelines conformance using external tool checks, and (iv) a small expert review. Using a curated subset of a multi-domain BPMN dataset, we compare the tuned model with untuned open-weight baselines and strong proprietary models under consistent prompting regimes. Our compact tuned model outperforms all baselines across sequence and structural metrics while requiring substantially fewer resources; guideline analysis and expert feedback further indicate that the generated diagrams largely follow BPMN best practices and are useful starting points that reduce modeling effort. Overall, instruction tuning improves structural accuracy and robustness compared to untuned baselines and reduces reliance on heavy prompt scaffolding. We publicly share the trained models and scripts to support reproducibility and further research.

    https://arxiv.org/abs/2512.12063


    The Instability of Safety: How Random Seeds and Temperature Expose Inconsistent LLM Refusal Behavior

    oai:arXiv.org:2512.12066v1

    arXiv:2512.12066v1 Announce Type: new Abstract: Current safety evaluations of large language models rely on single-shot testing, implicitly assuming that model responses are deterministic and representative of the model's safety alignment. We challenge this assumption by investigating the stability of safety refusal decisions across random seeds and temperature settings. Testing four instruction-tuned models from three families (Llama 3.1 8B, Qwen 2.5 7B, Qwen 3 8B, Gemma 3 12B) on 876 harmful prompts across 20 different sampling configurations (4 temperatures x 5 random seeds), we find that 18-28% of prompts exhibit decision flips--the model refuses in some configurations but complies in others--depending on the model. Our Safety Stability Index (SSI) reveals that higher temperatures significantly reduce decision stability (Friedman chi-squared = 44.71, p < 0.001), with mean SSI dropping from 0.951 at temperature 0.0 to 0.896 at temperature 1.0. We validate our findings across all model families using Claude 3.5 Haiku as a unified external judge, achieving 89.0% inter-judge agreement with our primary Llama 70B judge (Cohen's kappa = 0.62). These findings demonstrate that single-shot safety evaluations are insufficient for reliable safety assessment. We show that single-shot evaluation agrees with multi-sample ground truth only 92.4% of the time, and recommend using at least 3 samples per prompt for reliable safety assessment.

    https://arxiv.org/abs/2512.12066


    A Leaner and Faster Web: How CBOR Can Improve Dynamic Content Encoding in JSON and DNS over HTTPS

    oai:arXiv.org:2512.12067v1

    arXiv:2512.12067v1 Announce Type: new Abstract: The Internet community has taken major efforts to decrease latency in the World Wide Web. Significant improvements have been achieved in accelerating content transport and in compressing static content. Less attention, however, has been dedicated to dynamic content compression. Such content is commonly provided by JSON and DNS over HTTPS. Aligned with the overall Web trend, dynamic content objects continue to grow in size, which increases latency and fosters the digital inequality. In this paper, we propose to counter this increase by utilizing components engineered for the constrained Internet of Things (IoT). We focus on the Concise Binary Object Representation (CBOR) and its use for dynamic content encoded in JSON or in DNS over HTTPS messages. CBOR was originally introduced to restrict packet sizes in constrained environments and enables small, effective encoding of data objects. We measure that simply switching the data representation from JSON to CBOR reduces data by up to 80.0% for a corpus of JSON objects collected via the HTTP Archive. This size reduction can decrease loading times by up to 13.8% when downloading large objects -- even in local setups. A new CBOR-based DNS message format designed for use with DNS over HTTPS (DoH) and DNS over CoAP (DoC) minimizes packets by up to 95.5% in its packed form and shows large potential for additionally compressing names and addresses. We contribute two name compression schemes that apply to the new CBOR format and save up to 226 bytes in a response. The decoder for our name compression scheme is lean and can fit into as little as 314 bytes of binary build size. One of those compression schemes and further optimization proposals directly influenced further improvements of the new CBOR format within Internet standardization.

    https://arxiv.org/abs/2512.12067


    Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring

    oai:arXiv.org:2512.12069v1

    arXiv:2512.12069v1 Announce Type: new Abstract: Large Vision-Language Models (LVLMs) are vulnerable to a growing array of multimodal jailbreak attacks, necessitating defenses that are both generalizable to novel threats and efficient for practical deployment. Many current strategies fall short, either targeting specific attack patterns, which limits generalization, or imposing high computational overhead. While lightweight anomaly-detection methods offer a promising direction, we find that their common one-class design tends to confuse novel benign inputs with malicious ones, leading to unreliable over-rejection. To address this, we propose Representational Contrastive Scoring (RCS), a framework built on a key insight: the most potent safety signals reside within the LVLM's own internal representations. Our approach inspects the internal geometry of these representations, learning a lightweight projection to maximally separate benign and malicious inputs in safety-critical layers. This enables a simple yet powerful contrastive score that differentiates true malicious intent from mere novelty. Our instantiations, MCD (Mahalanobis Contrastive Detection) and KCD (K-nearest Contrastive Detection), achieve state-of-the-art performance on a challenging evaluation protocol designed to test generalization to unseen attack types. This work demonstrates that effective jailbreak detection can be achieved by applying simple, interpretable statistical methods to the appropriate internal representations, offering a practical path towards safer LVLM deployment. Our code is available on Github https://github.com/sarendis56/Jailbreak_Detection_RCS.

    https://arxiv.org/abs/2512.12069


    Towards Channel-Robust and Receiver-Independent Radio Frequency Fingerprint Identification

    oai:arXiv.org:2512.12070v1

    arXiv:2512.12070v1 Announce Type: new Abstract: Radio frequency fingerprint identification (RFFI) is an emerging method for authenticating Internet of Things (IoT) devices. RFFI exploits the intrinsic and unique hardware imperfections for classifying IoT devices. Deep learning-based RFFI has shown excellent performance. However, there are still remaining research challenges, such as limited public training datasets as well as impacts of channel and receive effects. In this paper, we proposed a three-stage RFFI approach involving contrastive learning-enhanced pretraining, Siamese network-based classification network training, and inference. Specifically, we employed spectrogram as signal representation to decouple the transmitter impairments from channel effects and receiver impairments. We proposed an unsupervised contrastive learning method to pretrain a channel-robust RFF extractor. In addition, the Siamese network-based scheme is enhanced by data augmentation and contrastive loss, which is capable of jointly mitigating the effects of channel and receiver impairments. We carried out a comprehensive experimental evaluation using three public LoRa datasets and one self-collected LoRa dataset. The results demonstrated that our approach can effectively and simultaneously mitigate the effects of channel and receiver impairments. We also showed that pretraining can significantly reduce the required amount of the fine-tuning data. Our proposed approach achieved an accuracy of over 90% in dynamic non-line-of-sight (NLOS) scenarios when there are only 20 packets per device.

    https://arxiv.org/abs/2512.12070


    VOYAGER: A Training Free Approach for Generating Diverse Datasets using LLMs

    oai:arXiv.org:2512.12072v1

    arXiv:2512.12072v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly being used to generate synthetic datasets for the evaluation and training of downstream models. However, prior work has noted that such generated data lacks diversity. In this paper, we propose Voyager, a novel principled approach to generate diverse datasets. Our approach is iterative and directly optimizes a mathematical quantity that optimizes the diversity of the dataset using the machinery of determinantal point processes. Furthermore, our approach is training-free, applicable to closed-source models, and scalable. In addition to providing theoretical justification for the working of our method, we also demonstrate through comprehensive experiments that Voyager significantly outperforms popular baseline approaches by providing a 1.5-3x improvement in diversity.

    https://arxiv.org/abs/2512.12072


    Physics-informed neural networks to solve inverse problems in unbounded domains

    oai:arXiv.org:2512.12074v1

    arXiv:2512.12074v1 Announce Type: new Abstract: Inverse problems are extensively studied in applied mathematics, with applications ranging from acoustic tomography for medical diagnosis to geophysical exploration. Physics informed neural networks (PINNs) have emerged as a powerful tool for solving such problems, while Physics informed Kolmogorov Arnold networks (PIKANs) represent a recent benchmark that, in certain problems, promises greater interpretability and accuracy compared to PINNs, due to their nature, being constructed as a composition of polynomials. In this work, we develop a methodology for addressing inverse problems in infinite and semi infinite domains. We introduce a novel sampling strategy for the network's training points, using the negative exponential and normal distributions, alongside a dual network architecture that is trained to learn the solution and parameters of an equation with the same loss function. This design enables the solution of inverse problems without explicitly imposing boundary conditions, as long as the solutions tend to stabilize when leaving the domain of interest. The proposed architecture is implemented using both PINNs and PIKANs, and their performance is compared in terms of accuracy with respect to a known solution as well as computational time and response to a noisy environment. Our results demonstrate that, in this setting, PINNs provide a more accurate and computationally efficient solution, solving the inverse problem 1,000 times faster and in the same order of magnitude, yet with a lower relative error than PIKANs.

    https://arxiv.org/abs/2512.12074


    SigTime: Learning and Visually Explaining Time Series Signatures

    oai:arXiv.org:2512.12076v1

    arXiv:2512.12076v1 Announce Type: new Abstract: Understanding and distinguishing temporal patterns in time series data is essential for scientific discovery and decision-making. For example, in biomedical research, uncovering meaningful patterns in physiological signals can improve diagnosis, risk assessment, and patient outcomes. However, existing methods for time series pattern discovery face major challenges, including high computational complexity, limited interpretability, and difficulty in capturing meaningful temporal structures. To address these gaps, we introduce a novel learning framework that jointly trains two Transformer models using complementary time series representations: shapelet-based representations to capture localized temporal structures and traditional feature engineering to encode statistical properties. The learned shapelets serve as interpretable signatures that differentiate time series across classification labels. Additionally, we develop a visual analytics system -- SigTIme -- with coordinated views to facilitate exploration of time series signatures from multiple perspectives, aiding in useful insights generation. We quantitatively evaluate our learning framework on eight publicly available datasets and one proprietary clinical dataset. Additionally, we demonstrate the effectiveness of our system through two usage scenarios along with the domain experts: one involving public ECG data and the other focused on preterm labor analysis.

    https://arxiv.org/abs/2512.12076


    The Procedural Semantics Gap in Structured CTI: A Measurement-Driven STIX Analysis for APT Emulation

    oai:arXiv.org:2512.12078v1

    arXiv:2512.12078v1 Announce Type: new Abstract: Cyber threat intelligence (CTI) encoded in STIX and structured according to the MITRE ATT&CK framework has become a global reference for describing adversary behavior. However, ATT&CK was designed as a descriptive knowledge base rather than a procedural model. We therefore ask whether its structured artifacts contain sufficient behavioral detail to support multi-stage adversary emulation. Through systematic measurements of the ATT&CK Enterprise bundle, we show that campaign objects encode just fragmented slices of behavior. Only 35.6% of techniques appear in at least one campaign, and neither clustering nor sequence analysis reveals any reusable behavioral structure under technique overlap or LCS-based analyses. Intrusion sets cover a broader portion of the technique space, yet omit the procedural semantics required to transform behavioral knowledge into executable chains, including ordering, preconditions, and environmental assumptions. These findings reveal a procedural semantic gap in current CTI standards: they describe what adversaries do, but not exactly how that behavior was operationalized. To assess how far this gap can be bridged in practice, we introduce a three-stage methodology that translates behavioral information from structured CTI into executable steps and makes the necessary environmental assumptions explicit. We demonstrate its viability by instantiating the resulting steps as operations in the MITRE Caldera framework. Case studies of ShadowRay and Soft Cell show that structured CTI can enable the emulation of multi-stage APT campaigns, but only when analyst-supplied parameters and assumptions are explicitly recorded. This, in turn, exposes the precise points at which current standards fail to support automation. Our results clarify the boundary between descriptive and machine-actionable CTI for adversary emulation.

    https://arxiv.org/abs/2512.12078


    BAgger: Backwards Aggregation for Mitigating Drift in Autoregressive Video Diffusion Models

    oai:arXiv.org:2512.12080v1

    arXiv:2512.12080v1 Announce Type: new Abstract: Autoregressive video models are promising for world modeling via next-frame prediction, but they suffer from exposure bias: a mismatch between training on clean contexts and inference on self-generated frames, causing errors to compound and quality to drift over time. We introduce Backwards Aggregation (BAgger), a self-supervised scheme that constructs corrective trajectories from the model's own rollouts, teaching it to recover from its mistakes. Unlike prior approaches that rely on few-step distillation and distribution-matching losses, which can hurt quality and diversity, BAgger trains with standard score or flow matching objectives, avoiding large teachers and long-chain backpropagation through time. We instantiate BAgger on causal diffusion transformers and evaluate on text-to-video, video extension, and multi-prompt generation, observing more stable long-horizon motion and better visual consistency with reduced drift.

    https://arxiv.org/abs/2512.12080


    Congestion Reduction in EV Charger Placement Using Traffic Equilibrium Models

    oai:arXiv.org:2512.12081v1

    arXiv:2512.12081v1 Announce Type: new Abstract: Growing EV adoption can worsen traffic conditions if chargers are sited without regard to their impact on congestion. We study how to strategically place EV chargers to reduce congestion using two equilibrium models: one based on congestion games and one based on an atomic queueing simulation. We apply both models within a scalable greedy station-placement algorithm. Experiments show that this greedy scheme yields optimal or near-optimal congestion outcomes in realistic networks, even though global optimality is not guaranteed as we show with a counterexample. We also show that the queueing-based approach yields more realistic results than the congestion-game model, and we present a unified methodology that calibrates congestion delays from queue simulation and solves equilibrium in link-space.

    https://arxiv.org/abs/2512.12081


    RePack: Representation Packing of Vision Foundation Model Features Enhances Diffusion Transformer

    oai:arXiv.org:2512.12083v1

    arXiv:2512.12083v1 Announce Type: new Abstract: The superior representation capability of pre-trained vision foundation models (VFMs) has been harnessed for enhancing latent diffusion models (LDMs). These approaches inject the rich semantics from high-dimensional VFM representations (e.g., DINOv3) into LDMs at different phases, resulting in accelerated learning and better generation performance. However, the high-dimensionality of VFM representations may also lead to Information Overload, particularly when the VFM features exceed the size of the original image for decoding. To address this issue while preserving the utility of VFM features, we propose RePack (Representation Packing), a simple yet effective framework for improving Diffusion Transformers (DiTs). RePack transforms the VFM representation into a more compact, decoder-friendly representation by projecting onto low-dimensional manifolds. We find that RePack can effectively filter out non-semantic noise while preserving the core structural information needed for high-fidelity reconstruction. Experimental results show that RePack significantly accelerates DiT convergence and outperforms recent methods that directly inject raw VFM features into the decoder for image reconstruction. On DiT-XL/2, RePack achieves an FID of 3.66 in only 64 epochs, which is 35% faster than the state-of-the-art method. This demonstrates that RePack successfully extracts the core semantics of VFM representations while bypassing their high-dimensionality side effects.

    https://arxiv.org/abs/2512.12083


    FloodSQL-Bench: A Retrieval-Augmented Benchmark for Geospatially-Grounded Text-to-SQL

    oai:arXiv.org:2512.12084v1

    arXiv:2512.12084v1 Announce Type: new Abstract: Existing Text-to-SQL benchmarks primarily focus on single-table queries or limited joins in general-purpose domains, and thus fail to reflect the complexity of domain-specific, multi-table and geospatial reasoning, To address this limitation, we introduce FLOODSQL-BENCH, a geospatially grounded benchmark for the flood management domain that integrates heterogeneous datasets through key-based, spatial, and hybrid joins. The benchmark captures realistic flood-related information needs by combining social, infrastructural, and hazard data layers. We systematically evaluate recent large language models with the same retrieval-augmented generation settings and measure their performance across difficulty tiers. By providing a unified, open benchmark grounded in real-world disaster management data, FLOODSQL-BENCH establishes a practical testbed for advancing Text-to-SQL research in high-stakes application domains.

    https://arxiv.org/abs/2512.12084


    CLOAK: Contrastive Guidance for Latent Diffusion-Based Data Obfuscation

    oai:arXiv.org:2512.12086v1

    arXiv:2512.12086v1 Announce Type: new Abstract: Data obfuscation is a promising technique for mitigating attribute inference attacks by semi-trusted parties with access to time-series data emitted by sensors. Recent advances leverage conditional generative models together with adversarial training or mutual information-based regularization to balance data privacy and utility. However, these methods often require modifying the downstream task, struggle to achieve a satisfactory privacy-utility trade-off, or are computationally intensive, making them impractical for deployment on resource-constrained mobile IoT devices. We propose Cloak, a novel data obfuscation framework based on latent diffusion models. In contrast to prior work, we employ contrastive learning to extract disentangled representations, which guide the latent diffusion process to retain useful information while concealing private information. This approach enables users with diverse privacy needs to navigate the privacy-utility trade-off with minimal retraining. Extensive experiments on four public time-series datasets, spanning multiple sensing modalities, and a dataset of facial images demonstrate that Cloak consistently outperforms state-of-the-art obfuscation techniques and is well-suited for deployment in resource-constrained settings.

    https://arxiv.org/abs/2512.12086


    BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding

    oai:arXiv.org:2512.12087v1

    arXiv:2512.12087v1 Announce Type: new Abstract: The growing demand for long-context inference capabilities in Large Language Models (LLMs) has intensified the computational and memory bottlenecks inherent to the standard attention mechanism. To address this challenge, we introduce BLASST, a drop-in sparse attention method that dynamically prunes the attention matrix without any pre-computation or proxy scores. Our method uses a fixed threshold and existing information from online softmax to identify negligible attention scores, skipping softmax computation, Value block loading, and the subsequent matrix multiplication. This fits seamlessly into existing FlashAttention kernel designs with negligible latency overhead. The approach is applicable to both prefill and decode stages across all attention variants (MHA, GQA, MQA, and MLA), providing a unified solution for accelerating long-context inference. We develop an automated calibration procedure that reveals a simple inverse relationship between optimal threshold and context length, enabling robust deployment across diverse scenarios. Maintaining high accuracy, we demonstrate a 1.62x speedup for prefill at 74.7% sparsity and a 1.48x speedup for decode at 73.2% sparsity on modern GPUs. Furthermore, we explore sparsity-aware training as a natural extension, showing that models can be trained to be inherently more robust to sparse attention patterns, pushing the accuracy-sparsity frontier even further.

    https://arxiv.org/abs/2512.12087


    Reliable Policy Iteration: Performance Robustness Across Architecture and Environment Perturbations

    oai:arXiv.org:2512.12088v1

    arXiv:2512.12088v1 Announce Type: new Abstract: In a recent work, we proposed Reliable Policy Iteration (RPI), that restores policy iteration's monotonicity-of-value-estimates property to the function approximation setting. Here, we assess the robustness of RPI's empirical performance on two classical control tasks -- CartPole and Inverted Pendulum -- under changes to neural network and environmental parameters. Relative to DQN, Double DQN, DDPG, TD3, and PPO, RPI reaches near-optimal performance early and sustains this policy as training proceeds. Because deep RL methods are often hampered by sample inefficiency, training instability, and hyperparameter sensitivity, our results highlight RPI's promise as a more reliable alternative.

    https://arxiv.org/abs/2512.12088


    VEGAS: Mitigating Hallucinations in Large Vision-Language Models via Vision-Encoder Attention Guided Adaptive Steering

    oai:arXiv.org:2512.12089v1

    arXiv:2512.12089v1 Announce Type: new Abstract: Large vision-language models (LVLMs) exhibit impressive ability to jointly reason over visual and textual inputs. However, they often produce outputs that are linguistically fluent but factually inconsistent with the visual evidence, i.e., they hallucinate. Despite growing efforts to mitigate such hallucinations, a key question remains: what form of visual attention can effectively suppress hallucinations during decoding? In this work, we provide a simple answer: the vision encoder's own attention map. We show that LVLMs tend to hallucinate when their final visual-attention maps fail to concentrate on key image objects, whereas the vision encoder's more concentrated attention maps substantially reduce hallucinations. To further investigate the cause, we analyze vision-text conflicts during decoding and find that these conflicts peak in the language model's middle layers. Injecting the vision encoder's attention maps into these layers effectively suppresses hallucinations. Building on these insights, we introduce VEGAS, a simple yet effective inference-time method that integrates the vision encoder's attention maps into the language model's mid-layers and adaptively steers tokens which fail to concentrate on key image objects. Extensive experiments across multiple benchmarks demonstrate that VEGAS consistently achieves state-of-the-art performance in reducing hallucinations.

    https://arxiv.org/abs/2512.12089


    SPDMark: Selective Parameter Displacement for Robust Video Watermarking

    oai:arXiv.org:2512.12090v1

    arXiv:2512.12090v1 Announce Type: new Abstract: The advent of high-quality video generation models has amplified the need for robust watermarking schemes that can be used to reliably detect and track the provenance of generated videos. Existing video watermarking methods based on both post-hoc and in-generation approaches fail to simultaneously achieve imperceptibility, robustness, and computational efficiency. This work introduces a novel framework for in-generation video watermarking called SPDMark (pronounced `SpeedMark') based on selective parameter displacement of a video diffusion model. Watermarks are embedded into the generated videos by modifying a subset of parameters in the generative model. To make the problem tractable, the displacement is modeled as an additive composition of layer-wise basis shifts, where the final composition is indexed by the watermarking key. For parameter efficiency, this work specifically leverages low-rank adaptation (LoRA) to implement the basis shifts. During the training phase, the basis shifts and the watermark extractor are jointly learned by minimizing a combination of message recovery, perceptual similarity, and temporal consistency losses. To detect and localize temporal modifications in the watermarked videos, we use a cryptographic hashing function to derive frame-specific watermark messages from the given base watermarking key. During watermark extraction, maximum bipartite matching is applied to recover the correct frame order, even from temporally tampered videos. Evaluations on both text-to-video and image-to-video generation models demonstrate the ability of SPDMark to generate imperceptible watermarks that can be recovered with high accuracy and also establish its robustness against a variety of common video modifications.

    https://arxiv.org/abs/2512.12090


    GraphPerf-RT: A Graph-Driven Performance Model for Hardware-Aware Scheduling of OpenMP Codes

    oai:arXiv.org:2512.12091v1

    arXiv:2512.12091v1 Announce Type: new Abstract: Performance prediction for OpenMP workloads on heterogeneous embedded SoCs is challenging due to complex interactions between task DAG structure, control-flow irregularity, cache and branch behavior, and thermal dynamics; classical heuristics struggle under workload irregularity, tabular regressors discard structural information, and model-free RL risks overheating resource-constrained devices. We introduce GraphPerf-RT, the first surrogate that unifies task DAG topology, CFG-derived code semantics, and runtime context (per-core DVFS, thermal state, utilization) in a heterogeneous graph representation with typed edges encoding precedence, placement, and contention. Multi-task evidential heads predict makespan, energy, cache and branch misses, and utilization with calibrated uncertainty (Normal-Inverse-Gamma), enabling risk-aware scheduling that filters low-confidence rollouts. We validate GraphPerf-RT on three embedded ARM platforms (Jetson TX2, Jetson Orin NX, RUBIK Pi), achieving R^2 > 0.95 with well-calibrated uncertainty (ECE < 0.05). To demonstrate end-to-end scheduling utility, we integrate the surrogate with four RL methods on Jetson TX2: single-agent model-free (SAMFRL), single-agent model-based (SAMBRL), multi-agent model-free (MAMFRL-D3QN), and multi-agent model-based (MAMBRL-D3QN). Experiments across 5 seeds (200 episodes each) show that MAMBRL-D3QN with GraphPerf-RT as the world model achieves 66% makespan reduction (0.97 +/- 0.35s) and 82% energy reduction (0.006 +/- 0.005J) compared to model-free baselines, demonstrating that accurate, uncertainty-aware surrogates enable effective model-based planning on thermally constrained embedded systems.

    https://arxiv.org/abs/2512.12091


    Verification of Lightning Network Channel Balances with Trusted Execution Environments (TEE)

    oai:arXiv.org:2512.12095v1

    arXiv:2512.12095v1 Announce Type: new Abstract: Verifying the private liquidity state of Lightning Network (LN) channels is desirable for auditors, service providers, and network participants who need assurance of financial capacity. Current methods often lack robustness against a malicious or compromised node operator. This paper introduces a methodology for the verification of LN channel balances. The core contribution is a framework that combines Trusted Execution Environments (TEEs) with Zero-Knowledge Transport Layer Security (zkTLS) to provide strong, hardware-backed guarantees. In our proposed method, the node's balance-reporting software runs within a TEE, which generates a remote attestation quote proving the software's integrity. This attestation is then served via an Application Programming Interface (API), and zkTLS is used to prove the authenticity of its delivery. We also analyze an alternative variant where the TEE signs the report directly without zkTLS, discussing the trade-offs between transport-layer verification and direct enclave signing. We further refine this by distinguishing between \enquote{Hot Proofs} (verifiable claims via TEEs) and \enquote{Cold Proofs} (on-chain settlement), and discuss critical security considerations including hardware vulnerabilities, privacy leakage to third-party APIs, and the performance overhead of enclaved operations.

    https://arxiv.org/abs/2512.12095


    An explicit integrator uniform in the true anomaly and exactly preserving all integrals of motion in the three-dimensional Kepler problem

    oai:arXiv.org:2512.12099v1

    arXiv:2512.12099v1 Announce Type: new Abstract: We develop a numerical scheme for the Kepler problem that preserves exactly all first integrals: angular momentum, total energy, and the Laplace-Runge-Lenz vector. This property ensures that orbital trajectories retain their precise shape and orientation over long times, avoiding the spurious precession typical of many standard methods. The scheme uses an adaptive time step derived from a constant angular increment. Analytical considerations and numerical experiments demonstrate that the algorithm combines high accuracy with long-term stability.

    https://arxiv.org/abs/2512.12099


    AI-Augmented Pollen Recognition in Optical and Holographic Microscopy for Veterinary Imaging

    oai:arXiv.org:2512.12101v1

    arXiv:2512.12101v1 Announce Type: new Abstract: We present a comprehensive study on fully automated pollen recognition across both conventional optical and digital in-line holographic microscopy (DIHM) images of sample slides. Visually recognizing pollen in unreconstructed holographic images remains challenging due to speckle noise, twin-image artifacts and substantial divergence from bright-field appearances. We establish the performance baseline by training YOLOv8s for object detection and MobileNetV3L for classification on a dual-modality dataset of automatically annotated optical and affinely aligned DIHM images. On optical data, detection mAP50 reaches 91.3% and classification accuracy reaches 97%, whereas on DIHM data, we achieve only 8.15% for detection mAP50 and 50% for classification accuracy. Expanding the bounding boxes of pollens in DIHM images over those acquired in aligned optical images achieves 13.3% for detection mAP50 and 54% for classification accuracy. To improve object detection in DIHM images, we employ a Wasserstein GAN with spectral normalization (WGAN-SN) to create synthetic DIHM images, yielding an FID score of 58.246. Mixing real-world and synthetic data at the 1.0 : 1.5 ratio for DIHM images improves object detection up to 15.4%. These results demonstrate that GAN-based augmentation can reduce the performance divide, bringing fully automated DIHM workflows for veterinary imaging a small but important step closer to practice.

    https://arxiv.org/abs/2512.12101


    Beyond right or wrong : towards redefining adaptive learning indicators in virtual learning environments

    oai:arXiv.org:2512.12105v1

    arXiv:2512.12105v1 Announce Type: new Abstract: Student learning development must involve more than just correcting or incorrect questions. However, most adaptive learning methods in Virtual Learning Environments are based on whether the student's response is incorrect or correct. This perspective is limited in assessing the student's learning level, as it does not consider other elements that can be crucial in this process. The objective of this work is to conduct a Systematic Literature Review (SLR) to elucidate which learning indicators influence student learning and which can be implemented in a VLE to assist in adaptive learning. The works selected and filtered by qualitative assessment reveal a comprehensive approach to assessing different aspects of the learning in virtual environments, such as motivation, emotions, physiological responses, brain imaging, and the students' prior knowledge. The discussion of these new indicators allows adaptive technology developers to implement more appropriate solutions to students' realities, resulting in more complete training.

    https://arxiv.org/abs/2512.12105


    DreamRAM: A Fine-Grained Configurable Design Space Modeling Tool for Custom 3D Die-Stacked DRAM

    oai:arXiv.org:2512.12106v1

    arXiv:2512.12106v1 Announce Type: new Abstract: 3D die-stacked DRAM has emerged as a key technology for delivering high bandwidth and high density for applications such as high-performance computing, graphics, and machine learning. However, different applications place diverse and sometimes diverging demands on power, performance, and area that cannot be universally satisfied with fixed commodity DRAM designs. Die stacking creates the opportunity for a large DRAM design space through 3D integration and expanded total die area. To open and navigate this expansive design space of customized memory architectures that cater to application-specific needs, we introduce DreamRAM, a configurable bandwidth, capacity, energy, latency, and area modeling tool for custom 3D die-stacked DRAM designs. DreamRAM exposes fine-grained design customization parameters at the MAT, subarray, bank, and inter-bank levels, including extensions of partial page and subarray parallelism proposals found in the literature, to open a large previously-unexplored design space. DreamRAM analytically models wire pitch, width, length, capacitance, and scaling parameters to capture the performance tradeoffs of physical layout and routing design choices. Routing awareness enables DreamRAM to model a custom MAT-level routing scheme, Dataline-Over-MAT (DLOMAT), to facilitate better bandwidth tradeoffs. DreamRAM is calibrated and validated against published industry HBM3 and HBM2E designs. Within DreamRAM's rich design space, we identify designs that achieve each of 66% higher bandwidth, 100% higher capacity, and 45% lower power and energy per bit compared to the baseline design, each on an iso-bandwidth, iso-capacity, and iso-power basis.

    https://arxiv.org/abs/2512.12106


    EchoVLM: Measurement-Grounded Multimodal Learning for Echocardiography

    oai:arXiv.org:2512.12107v1

    arXiv:2512.12107v1 Announce Type: new Abstract: Echocardiography is the most widely used imaging modality in cardiology, yet its interpretation remains labor-intensive and inherently multimodal, requiring view recognition, quantitative measurements, qualitative assessments, and guideline-based reasoning. While recent vision-language models (VLMs) have achieved broad success in natural images and certain medical domains, their potential in echocardiography has been limited by the lack of large-scale, clinically grounded image-text datasets and the absence of measurement-based reasoning central to echo interpretation. We introduce EchoGround-MIMIC, the first measurement-grounded multimodal echocardiography dataset, comprising 19,065 image-text pairs from 1,572 patients with standardized views, structured measurements, measurement-grounded captions, and guideline-derived disease labels. Building on this resource, we propose EchoVLM, a vision-language model that incorporates two novel pretraining objectives: (i) a view-informed contrastive loss that encodes the view-dependent structure of echocardiographic imaging, and (ii) a negation-aware contrastive loss that distinguishes clinically critical negative from positive findings. Across five types of clinical applications with 36 tasks spanning multimodal disease classification, image-text retrieval, view classification, chamber segmentation, and landmark detection, EchoVLM achieves state-of-the-art performance (86.5% AUC in zero-shot disease classification and 95.1% accuracy in view classification). We demonstrate that clinically grounded multimodal pretraining yields transferable visual representations and establish EchoVLM as a foundation model for end-to-end echocardiography interpretation. We will release EchoGround-MIMIC and the data curation code, enabling reproducibility and further research in multimodal echocardiography interpretation.

    https://arxiv.org/abs/2512.12107


    A Novel Patch-Based TDA Approach for Computed Tomography

    oai:arXiv.org:2512.12108v1

    arXiv:2512.12108v1 Announce Type: new Abstract: The development of machine learning (ML) models based on computed tomography (CT) imaging modality has been a major focus of recent research in the medical imaging domain. Incorporating robust feature engineering approach can highly improve the performance of these models. Topological data analysis (TDA), a recent development based on the mathematical field of algebraic topology, mainly focuses on the data from a topological perspective, extracting deeper insight and higher dimensional structures from the data. Persistent homology (PH), a fundamental tool in the area of TDA, can extract topological features such as connected components, cycles and voids from the data. A popular approach to construct PH from 3D CT images is to utilize the 3D cubical complex filtration, a method adapted for grid-structured data. However, this approach may not always yield the best performance and can suffer from computational complexity with higher resolution CT images. This study introduces a novel patch-based PH construction approach tailored for volumetric medical imaging data, in particular CT modality. A wide range of experiments has been conducted on several datasets of 3D CT images to comprehensively analyze the performance of the proposed method with various parameters and benchmark it against the 3D cubical complex algorithm. Our results highlight the dominance of the patch-based TDA approach in terms of both classification performance and time-efficiency. The proposed approach outperformed the cubical complex method, achieving average improvement of 10.38%, 6.94%, 2.06%, 11.58%, and 8.51% in accuracy, AUC, sensitivity, specificity, and F1 score, respectively, across all datasets. Finally, we provide a convenient python package, Patch-TDA, to facilitate the utilization of the proposed approach.

    https://arxiv.org/abs/2512.12108


    A neuro-symbolic framework for accountability in public-sector AI

    oai:arXiv.org:2512.12109v1

    arXiv:2512.12109v1 Announce Type: new Abstract: Automated eligibility systems increasingly determine access to essential public benefits, but the explanations they generate often fail to reflect the legal rules that authorize those decisions. This thesis develops a legally grounded explainability framework that links system-generated decision justifications to the statutory constraints of CalFresh, California's Supplemental Nutrition Assistance Program. The framework combines a structured ontology of eligibility requirements derived from the state's Manual of Policies and Procedures (MPP), a rule extraction pipeline that expresses statutory logic in a verifiable formal representation, and a solver-based reasoning layer to evaluate whether the explanation aligns with governing law. Case evaluations demonstrate the framework's ability to detect legally inconsistent explanations, highlight violated eligibility rules, and support procedural accountability by making the basis of automated determinations traceable and contestable.

    https://arxiv.org/abs/2512.12109


    BRIDG-ICS: AI-Grounded Knowledge Graphs for Intelligent Threat Analytics in Industry~5.0 Cyber-Physical Systems

    oai:arXiv.org:2512.12112v1

    arXiv:2512.12112v1 Announce Type: new Abstract: Industry 5.0's increasing integration of IT and OT systems is transforming industrial operations but also expanding the cyber-physical attack surface. Industrial Control Systems (ICS) face escalating security challenges as traditional siloed defences fail to provide coherent, cross-domain threat insights. We present BRIDG-ICS (BRIDge for Industrial Control Systems), an AI-driven Knowledge Graph (KG) framework for context-aware threat analysis and quantitative assessment of cyber resilience in smart manufacturing environments. BRIDG-ICS fuses heterogeneous industrial and cybersecurity data into an integrated Industrial Security Knowledge Graph linking assets, vulnerabilities, and adversarial behaviours with probabilistic risk metrics (e.g. exploit likelihood, attack cost). This unified graph representation enables multi-stage attack path simulation using graph-analytic techniques. To enrich the graph's semantic depth, the framework leverages Large Language Models (LLMs): domain-specific LLMs extract cybersecurity entities, predict relationships, and translate natural-language threat descriptions into structured graph triples, thereby populating the knowledge graph with missing associations and latent risk indicators. This unified AI-enriched KG supports multi-hop, causality-aware threat reasoning, improving visibility into complex attack chains and guiding data-driven mitigation. In simulated industrial scenarios, BRIDG-ICS scales well, reduces potential attack exposure, and can enhance cyber-physical system resilience in Industry 5.0 settings.

    https://arxiv.org/abs/2512.12112


    Teaching Spell Checkers to Teach: Pedagogical Program Synthesis for Interactive Learning

    oai:arXiv.org:2512.12115v1

    arXiv:2512.12115v1 Announce Type: new Abstract: Spelling taught through memorization often fails many learners, particularly children with language-based learning disorders who struggle with the phonological skills necessary to spell words accurately. Educators such as speech-language pathologists (SLPs) address this instructional gap by using an inquiry-based approach to teach spelling that targets the phonology, morphology, meaning, and etymology of words. Yet, these strategies rarely appear in everyday writing tools, which simply detect and autocorrect errors. We introduce SPIRE (Spelling Inquiry Engine), a spell check system that brings this inquiry-based pedagogy into the act of composition. SPIRE implements Pedagogical Program Synthesis, a novel approach for operationalizing the inherently dynamic pedagogy of spelling instruction. SPIRE represents SLP instructional moves in a domain-specific language, synthesizes tailored programs in real-time from learner errors, and renders them as interactive interfaces for inquiry-based interventions. With SPIRE, spelling errors become opportunities to explore word meanings, word structures, morphological families, word origins, and grapheme-phoneme correspondences, supporting metalinguistic reasoning alongside correction. Evaluation with SLPs and learners shows alignment with professional practice and potential for integration into writing workflows.

    https://arxiv.org/abs/2512.12115


    Neural CDEs as Correctors for Learned Time Series Models

    oai:arXiv.org:2512.12116v1

    arXiv:2512.12116v1 Announce Type: new Abstract: Learned time-series models, whether continuous- or discrete-time, are widely used to forecast the states of a dynamical system. Such models generate multi-step forecasts either directly, by predicting the full horizon at once, or iteratively, by feeding back their own predictions at each step. In both cases, the multi-step forecasts are prone to errors. To address this, we propose a Predictor-Corrector mechanism where the Predictor is any learned time-series model and the Corrector is a neural controlled differential equation. The Predictor forecasts, and the Corrector predicts the errors of the forecasts. Adding these errors to the forecasts improves forecast performance. The proposed Corrector works with irregularly sampled time series and continuous- and discrete-time Predictors. Additionally, we introduce two regularization strategies to improve the extrapolation performance of the Corrector with accelerated training. We evaluate our Corrector with diverse Predictors, e.g., neural ordinary differential equations, Contiformer, and DLinear, on synthetic, physics simulation, and real-world forecasting datasets. The experiments demonstrate that the Predictor-Corrector mechanism consistently improves the performance compared to Predictor alone.

    https://arxiv.org/abs/2512.12116


    Citation-Grounded Code Comprehension: Preventing LLM Hallucination Through Hybrid Retrieval and Graph-Augmented Context

    oai:arXiv.org:2512.12117v1

    arXiv:2512.12117v1 Announce Type: new Abstract: Large language models have become essential tools for code comprehension, enabling developers to query unfamiliar codebases through natural language interfaces. However, LLM hallucination, generating plausible but factually incorrect citations to source code, remains a critical barrier to reliable developer assistance. This paper addresses the challenges of achieving verifiable, citation grounded code comprehension through hybrid retrieval and lightweight structural reasoning. Our work is grounded in systematic evaluation across 30 Python repositories with 180 developer queries, comparing retrieval modalities, graph expansion strategies, and citation verification mechanisms. We find that challenges of citation accuracy arise from the interplay between sparse lexical matching, dense semantic similarity, and cross file architectural dependencies. Among these, cross file evidence discovery is the largest contributor to citation completeness, but it is largely overlooked because existing systems rely on pure textual similarity without leveraging code structure. We advocate for citation grounded generation as an architectural principle for code comprehension systems and demonstrate this need by achieving 92 percent citation accuracy with zero hallucinations. Specifically, we develop a hybrid retrieval system combining BM25 sparse matching, BGE dense embeddings, and Neo4j graph expansion via import relationships, which outperforms single mode baselines by 14 to 18 percentage points while discovering cross file evidence missed by pure text similarity in 62 percent of architectural queries.

    https://arxiv.org/abs/2512.12117


    MixtureKit: A General Framework for Composing, Training, and Visualizing Mixture-of-Experts Models

    oai:arXiv.org:2512.12121v1

    arXiv:2512.12121v1 Announce Type: new Abstract: We introduce MixtureKit, a modular open-source framework for constructing, training, and analyzing Mixture-of-Experts (MoE) models from arbitrary pre-trained or fine-tuned models. MixtureKit currently supports three complementary methods: (i) \emph{Traditional MoE}, which uses a single router per transformer block to select experts, (ii) \emph{BTX} (Branch-Train-Mix), which introduces separate routers for each specified sub-layer enabling fine-grained token routing, and (iii) \emph{BTS} (Branch-Train-Stitch), which keeps experts fully intact and introduces trainable stitch layers for controlled information exchange between hub and experts. MixtureKit automatically modifies the model configuration, patches decoder and causal LM classes, and saves a unified checkpoint ready for inference or fine-tuning. We further provide a visualization interface to inspect per-token routing decisions, expert weight distributions, and layer-wise contributions. Experiments with multilingual code-switched data (e.g. Arabic-Latin) show that a BTX-based model trained using MixtureKit can outperform baseline dense models on multiple benchmarks. We release MixtureKit as a practical foundation for research and development of MoE-based systems across diverse domains.

    https://arxiv.org/abs/2512.12121


    High-Dimensional Tensor Discriminant Analysis: Low-Rank Discriminant Structure, Representation Synergy, and Theoretical Guarantees

    oai:arXiv.org:2512.12122v1

    arXiv:2512.12122v1 Announce Type: new Abstract: High-dimensional tensor-valued predictors arise in modern applications, increasingly as learned representations from neural networks. Existing tensor classification methods rely on sparsity or Tucker structures and often lack theoretical guarantees. Motivated by empirical evidence that discriminative signals concentrate along a few multilinear components, we introduce CP low-rank structure for the discriminant tensor, a modeling perspective not previously explored. Under a Tensor Gaussian Mixture Model, we propose high-dimensional CP low-rank Tensor Discriminant Analysis (CP-TDA) with Randomized Composite PCA (\textsc{rc-PCA}) initialization, that is essential for handling dependent and anisotropic noise under weaker signal strength and incoherence conditions, followed by iterative refinement algorithm. We establish global convergence and minimax-optimal misclassification rates. To handle tensor data deviating from tensor normality, we develop the first semiparametric tensor discriminant model, in which learned tensor representations are mapped via deep generative models into a latent space tailored for CP-TDA. Misclassification risk decomposes into representation, approximation, and estimation errors. Numerical studies and real data analysis on graph classification demonstrate substantial gains over existing tensor classifiers and state-of-the-art graph neural networks, particularly in high-dimensional, small-sample regimes.

    https://arxiv.org/abs/2512.12122


    Dynamic SLA-aware Network Slice Monitoring

    oai:arXiv.org:2512.12123v1

    arXiv:2512.12123v1 Announce Type: new Abstract: Next-generation networks increasingly rely on network slices - logical networks tailored to specific application requirements, each with distinct Service-Level Agreements (SLAs). Ensuring compliance with these SLAs requires continuous, real-time monitoring of end-to-end performance metrics for each slice, within a limited telemetry budget. However, we find that existing solutions face two fundamental limitations: they either lack end-to-end visibility (e.g., sketches, probabilistic sampling) or provide visibility but lack the control mechanisms to dynamically allocate monitoring resources according to slice SLAs. We address this through a formal framework that reframes slice monitoring as a closed-loop control problem, and defines the minimal data plane requirements for SLA-aware slice monitoring via a telemetry primitive contract. We then present SliceScope, a realization of this framework that combines: (1) a control strategy that dynamically allocates the monitoring resources across diverse slices according to their SLA criticality, and (2) a data-plane based on change-triggered INT that provides per-packet end-to-end visibility with tunable accuracy-overhead trade-offs, satisfying the telemetry contract. Our evaluation results on programmable switches and in large-scale simulations with a mixture of different slice types, demonstrate that SliceScope tracks critical slices up to 4x more accurately compared to static baselines, while showing that change-triggered INT outperforms alternative primitives for realizing the telemetry primitive contract.

    https://arxiv.org/abs/2512.12123


    A Benchmark Dataset for Spatially Aligned Road Damage Assessment in Small Uncrewed Aerial Systems Disaster Imagery

    oai:arXiv.org:2512.12128v1

    arXiv:2512.12128v1 Announce Type: new Abstract: This paper presents the largest known benchmark dataset for road damage assessment and road alignment, and provides 18 baseline models trained on the CRASAR-U-DRIODs dataset's post-disaster small uncrewed aerial systems (sUAS) imagery from 10 federally declared disasters, addressing three challenges within prior post-disaster road damage assessment datasets. While prior disaster road damage assessment datasets exist, there is no current state of practice, as prior public datasets have either been small-scale or reliant on low-resolution imagery insufficient for detecting phenomena of interest to emergency managers. Further, while machine learning (ML) systems have been developed for this task previously, none are known to have been operationally validated. These limitations are overcome in this work through the labeling of 657.25km of roads according to a 10-class labeling schema, followed by training and deploying ML models during the operational response to Hurricanes Debby and Helene in 2024. Motivated by observed road line misalignment in practice, 9,184 road line adjustments were provided for spatial alignment of a priori road lines, as it was found that when the 18 baseline models are deployed against real-world misaligned road lines, model performance degraded on average by 5.596\% Macro IoU. If spatial alignment is not considered, approximately 8\% (11km) of adverse conditions on road lines will be labeled incorrectly, with approximately 9\% (59km) of road lines misaligned off the actual road. These dynamics are gaps that should be addressed by the ML, CV, and robotics communities to enable more effective and informed decision-making during disasters.

    https://arxiv.org/abs/2512.12128


    A comparative study of generative models for child voice conversion

    oai:arXiv.org:2512.12129v1

    arXiv:2512.12129v1 Announce Type: new Abstract: Generative models are a popular choice for adult-to-adult voice conversion (VC) because of their efficient way of modelling unlabelled data. To this point their usefulness in producing children speech and in particular adult to child VC has not been investigated. For adult to child VC, four generative models are compared: diffusion model, flow based model, variational autoencoders, and generative adversarial network. Results show that although converted speech outputs produce by those models appear plausible, they exhibit insufficient similarity with the target speaker characteristics. We introduce an efficient frequency warping technique that can be applied to the output of models, and which shows significant reduction of the mismatch between adult and child. The output of all the models are evaluated using both objective and subjective measures. In particular we compare specific speaker pairing using a unique corpus collected for dubbing of children speech.

    https://arxiv.org/abs/2512.12129


    BOOST: BOttleneck-Optimized Scalable Training Framework for Low-Rank Large Language Models

    oai:arXiv.org:2512.12131v1

    arXiv:2512.12131v1 Announce Type: new Abstract: The scale of transformer model pre-training is constrained by the increasing computation and communication cost. Low-rank bottleneck architectures offer a promising solution to significantly reduce the training time and memory footprint with minimum impact on accuracy. Despite algorithmic efficiency, bottleneck architectures scale poorly under standard tensor parallelism. Simply applying 3D parallelism designed for full-rank methods leads to excessive communication and poor GPU utilization. To address this limitation, we propose BOOST, an efficient training framework tailored for large-scale low-rank bottleneck architectures. BOOST introduces a novel Bottleneck-aware Tensor Parallelism, and combines optimizations such as online-RMSNorm, linear layer grouping, and low-rank activation checkpointing to achieve end-to-end training speedup. Evaluations on different low-rank bottleneck architectures demonstrate that BOOST achieves 1.46-1.91$\times$ speedup over full-rank model baselines and 1.87-2.27$\times$ speedup over low-rank model with naively integrated 3D parallelism, with improved GPU utilization and reduced communication overhead.

    https://arxiv.org/abs/2512.12131


    On the Approximation Power of SiLU Networks: Exponential Rates and Depth Efficiency

    oai:arXiv.org:2512.12132v1

    arXiv:2512.12132v1 Announce Type: new Abstract: This article establishes a comprehensive theoretical framework demonstrating that SiLU (Sigmoid Linear Unit) activation networks achieve exponential approximation rates for smooth functions with explicit and improved complexity control compared to classical ReLU-based constructions. We develop a novel hierarchical construction beginning with an efficient approximation of the square function $x^2$ more compact in depth and size than comparable ReLU realizations, such as those given by Yarotsky. This construction yields an approximation error decaying as $\mathcal{O}(\omega^{-2k})$ using networks of depth $\mathcal{O}(1)$. We then extend this approach through functional composition to establish sharp approximation bounds for deep SiLU networks in approximating Sobolev-class functions, with total depth $\mathcal{O}(1)$ and size $\mathcal{O}(\varepsilon^{-d/n})$.

    https://arxiv.org/abs/2512.12132


    BaRISTA: Brain Scale Informed Spatiotemporal Representation of Human Intracranial Neural Activity

    oai:arXiv.org:2512.12135v1

    arXiv:2512.12135v1 Announce Type: new Abstract: Intracranial recordings have opened a unique opportunity to simultaneously measure activity across multiregional networks in the human brain. Recent works have focused on developing transformer-based neurofoundation models of such recordings that can generalize across subjects and datasets. However, these recordings exhibit highly complex spatiotemporal interactions across diverse spatial scales, from the single-channel scale to the scale of brain regions. As such, there remain critical open questions regarding how best to encode spatial information and how to design self-supervision tasks that enable the learning of brain network patterns and enhance downstream decoding performance using such high-dimensional, multiregional recordings. To allow for exploring these questions, we propose a new spatiotemporal transformer model of multiregional neural activity and a corresponding self-supervised masked latent reconstruction task, designed to enable flexibility in the spatial scale used for token encoding and masking. Applying this model on publicly available multiregional intracranial electrophysiology (iEEG) data, we demonstrate that adjusting the spatial scale for both token encoding and masked reconstruction significantly impacts downstream decoding. Further, we find that spatial encoding at larger scales than channel-level encoding, which is commonly used in existing iEEG transformer models, improves downstream decoding performance. Finally, we demonstrate that our method allows for region-level token encoding while also maintaining accurate channel-level neural reconstruction. Taken together, our modeling framework enables exploration of the spatial scales used for token encoding and masking, reveals their importance towards self-supervised pretraining of neurofoundation models of multiregional human brain activity, and enhances downstream decoding performance.

    https://arxiv.org/abs/2512.12135


    Joint Power and Mobility Control

    oai:arXiv.org:2512.12137v1

    arXiv:2512.12137v1 Announce Type: new Abstract: This study addressed the challenge of improving network connectivity in autonomous V2X networks by jointly optimizing transmission power and vehicle mobility. We proposed a link reception model based on a sigmoid approximation of SINR and transformed it into a power-based formulation for simplicity in optimization. Building on this, we formulated a multi-node Network Utility Maximization (NUM) problem and demonstrated its concavity, enabling distributed trajectory and power adjustments. Both simulation and real-world experiments validated the theoretical findings, showing that symmetric positioning and balanced power allocation significantly enhance packet reception rates under interference-limited conditions. These results confirm that coordinated mobility and power control can effectively mitigate interference and improve connectivity in highly dynamic vehicular networks, paving the way for robust communication in future autonomous and UAV systems.

    https://arxiv.org/abs/2512.12137


    Layered Monoidal Theories

    oai:arXiv.org:2512.12139v1

    arXiv:2512.12139v1 Announce Type: new Abstract: In the first part, we develop layered monoidal theories - a generalisation of monoidal theories combining descriptions of a system at several levels. Via their representation as string diagrams, monoidal theories provide a graphical syntax with a visually intuitive notions of information flow and composition. Layered monoidal theories allow mixing several monoidal theories (together with translations between them) within the same string diagram, while retaining mathematical precision and semantic interpretability. We define three flavours of layered monoidal theories, provide a recursively generated syntax for each, and construct a free-forgetful adjunction with respect to three closely related semantics: opfibrations, fibrations and deflations. We motivate the general theory by providing several examples from existing literature. In the second part, we develop a formal approach to retrosynthesis - the process of backwards reaction search in synthetic chemistry. Chemical processes are treated at three levels of abstraction: (1) (formal) reactions encode all chemically feasible combinatorial rearrangements of molecules, (2) reaction schemes encode transformations applicable to 'patches' of molecules (including the functional groups), and (3) disconnection rules encode local chemical rewrite rules applicable to a single bond or atom at a time. We show that the three levels are tightly linked: the reactions are generated by the reaction schemes, while there is a functorial translation from the disconnection rules to the reactions. Moreover, the translation from the disconnection rules to the reactions is shown to be sound, complete and universal - allowing one to treat the disconnection rules as a formal syntax with the semantics provided by the reactions. We tie together the two parts by providing a formalisation of retrosynthesis within a certain layered monoidal theory.

    https://arxiv.org/abs/2512.12139


    Realizing Space-oriented Control in Smart Buildings via Word Embeddings

    oai:arXiv.org:2512.12140v1

    arXiv:2512.12140v1 Announce Type: new Abstract: This paper presents a novel framework for implementing space-oriented control systems in smart buildings. In contrast to conventional device-oriented approaches, which often suffer from issues related to development efficiency and portability, our framework adopts a space-oriented paradigm that leverages natural language processing and word embedding techniques. The proposed framework features a chat-based graphical user interface (GUI) that converts natural language inputs into actionable OpenAI API calls, thereby enabling intuitive space level (e.g., room) control within smart environments. To support efficient embedding-based search and metadata retrieval, the framework integrates a vector database powered by Elasticsearch. This ensures the accurate identification and invocation of appropriate smart building APIs. A prototype implementation has been tested in a smart building environment at the University of Tokyo, demonstrating the feasibility of the approach.

    https://arxiv.org/abs/2512.12140


    MeltwaterBench: Deep learning for spatiotemporal downscaling of surface meltwater

    oai:arXiv.org:2512.12142v1

    arXiv:2512.12142v1 Announce Type: new Abstract: The Greenland ice sheet is melting at an accelerated rate due to processes that are not fully understood and hard to measure. The distribution of surface meltwater can help understand these processes and is observable through remote sensing, but current maps of meltwater face a trade-off: They are either high-resolution in time or space, but not both. We develop a deep learning model that creates gridded surface meltwater maps at daily 100m resolution by fusing data streams from remote sensing observations and physics-based models. In particular, we spatiotemporally downscale regional climate model (RCM) outputs using synthetic aperture radar (SAR), passive microwave (PMW), and a digital elevation model (DEM) over the Helheim Glacier in Eastern Greenland from 2017-2023. Using SAR-derived meltwater as "ground truth", we show that a deep learning-based method that fuses all data streams is over 10 percentage points more accurate over our study area than existing non deep learning-based approaches that only rely on a regional climate model (83% vs. 95% Acc.) or passive microwave observations (72% vs. 95% Acc.). Alternatively, creating a gridded product through a running window calculation with SAR data underestimates extreme melt events, but also achieves notable accuracy (90%) and does not rely on deep learning. We evaluate standard deep learning methods (UNet and DeepLabv3+), and publish our spatiotemporally aligned dataset as a benchmark, MeltwaterBench, for intercomparisons with more complex data-driven downscaling methods. The code and data are available at $\href{https://github.com/blutjens/hrmelt}{github.com/blutjens/hrmelt}$.

    https://arxiv.org/abs/2512.12142


    $C^1$-$Q_k$ serendipity finite elements on rectangular meshes

    oai:arXiv.org:2512.12144v1

    arXiv:2512.12144v1 Announce Type: new Abstract: A $C^1$-$Q_k$ serendipity finite element is a sub-element of $C^1$-$Q_k$ BFS finite element such that the element remains $C^1$-continuous and includes all $P_k$ polynomials. In other words, it is a minimum of $Q_k$ bubbles enriched $P_k$ finite element. We enrich the $P_4$ and $P_5$ spaces by $9$ $Q_4$ and $11$ $Q_5$-bubble functions, respectively. For all $k\ge 6$, we enrich the $P_k$ spaces exactly by $12$ $Q_k$ bubble functions. We show the uni-solvence and quasi-optimality of the newly defined $C^1$-$Q_k$ serendipity elements. Numerical experiments by the $C^1$-$Q_k$ serendipity elements, $4\le k\le 8$, are performed.

    https://arxiv.org/abs/2512.12144


    Open Horizons: Evaluating Deep Models in the Wild

    oai:arXiv.org:2512.12146v1

    arXiv:2512.12146v1 Announce Type: new Abstract: Open-world deployment requires models to recognize both known categories and remain reliable when novel classes appear. We present a unified experimental study spanning open-set recognition (OSR) and few-shot class-incremental learning (FSCIL) on CIFAR-10. For OSR, we compare three pretrained frozen visual encoders: ResNet-50, ConvNeXt-Tiny and CLIP ViT-B/16,using a linear probe and four post-hoc scoring functions, namely MSP, Energy, Mahalanobis and kNN. Across metrics,such as, AUROC, AUPR, FPR@95, and OSCR, CLIP consistently yields the strongest separability between known and unknown samples, with Energy providing the most stable performance across backbones. For FSCIL, we compare modified SPPR, OrCo, and ConCM using partially frozen ResNet-50 across 1-, 5-, and 10-shot scenarios. ConCM achieves 84.7% accuracy in the 10-shot setting with the cleanest confusion matrix, while all methods show saturation beyond 5 shots. Our controlled evaluation reveals how the backbone architecture and scoring mechanisms affect unknown detection and how prototype-based methods mitigate catastrophic forgetting during incremental adaptation.

    https://arxiv.org/abs/2512.12146


    A Framework for Scalable Digital Twin Deployment in Smart Campus Building Facility Management

    oai:arXiv.org:2512.12149v1

    arXiv:2512.12149v1 Announce Type: new Abstract: Digital twin (DT) offers significant opportunities for enhancing facility management (FM) in campus environments. However, existing research often focuses narrowly on isolated domains, such as point-cloud geometry or energy analytics, without providing a scalable and interoperable workflow that integrates building geometry, equipment metadata, and operational data into a unified FM platform. This study proposes a comprehensive framework for scalable digital-twin deployment in smart campus buildings by integrating 3D laser scanning, BIM modeling, and IoT-enabled data visualization to support facility operations and maintenance. The methodology includes: (1) reality capture using terrestrial laser scanning and structured point-cloud processing; (2) development of an enriched BIM model incorporating architectural, mechanical, electrical, plumbing, conveying, and sensor systems; and (3) creation of a digital-twin environment that links equipment metadata, maintenance policies, and simulated IoT data within a digital-twin management platform. A case study of the Price Gilbert Building at Georgia Tech demonstrates the implementation of this workflow. A total of 509 equipment items were modeled and embedded with OmniClass classifications into the digital twin. Ten interactive dashboards were developed to visualize system performance. Results show that the proposed framework enables centralized asset documentation, improved system visibility, and enhanced preventive and reactive maintenance workflows. Although most IoT data were simulated due to limited existing sensor infrastructure, the prototype validates the feasibility of a scalable digital twin for facility management and establishes a reference model for real-time monitoring, analytics integration, and future autonomous building operations.

    https://arxiv.org/abs/2512.12149


    Robust and Efficient Penetration-Free Elastodynamics without Barriers

    oai:arXiv.org:2512.12151v1

    arXiv:2512.12151v1 Announce Type: new Abstract: We introduce a barrier-free optimization framework for non-penetration elastodynamic simulation that matches the robustness of Incremental Potential Contact (IPC) while overcoming its two primary efficiency bottlenecks: (1) reliance on logarithmic barrier functions to enforce non-penetration constraints, which leads to ill-conditioned systems and significantly slows down the convergence of iterative linear solvers; and (2) the time-of-impact (TOI) locking issue, which restricts active-set exploration in collision-intensive scenes and requires a large number of Newton iterations. We propose a novel second-order constrained optimization framework featuring a custom augmented Lagrangian solver that avoids TOI locking by immediately incorporating all requisite contact pairs detected via CCD, enabling more efficient active-set exploration and leading to significantly fewer Newton iterations. By adaptively updating Lagrange multipliers rather than increasing penalty stiffness, our method prevents stagnation at zero TOI while maintaining a well-conditioned system. We further introduce a constraint filtering and decay mechanism to keep the active set compact and stable, along with a theoretical justification of our method's finite-step termination and first-order time integration accuracy under a cumulative TOI-based termination criterion. A comprehensive set of experiments demonstrates the efficiency, robustness, and accuracy of our method. With a GPU-optimized simulator design, our method achieves an up to 103x speedup over GIPC on challenging, contact-rich benchmarks - scenarios that were previously tractable only with barrier-based methods. Our code and data will be open-sourced.

    https://arxiv.org/abs/2512.12151


    Rectangular $C^1$-$P_k$ finite elements with $Q_k$-bubble enrichment

    oai:arXiv.org:2512.12152v1

    arXiv:2512.12152v1 Announce Type: new Abstract: We enrich the $P_k$ polynomial space by $5$ ($k=4$), or $7$ ($k=5$), or 8 (all $k\ge 6$) $Q_k$ bubble functions to obtain a family of $C^1$-$P_k$ ($k\ge 4$) finite elements on rectangular meshes. We show the uni-solvency, the $C^1$-continuity and the quasi-optimal convergence. Numerical tests on the new $C^1$-$P_k$, $k=4,5,6,7$ and $8$, elements are performed.

    https://arxiv.org/abs/2512.12152


    Keep the Lights On, Keep the Lengths in Check: Plug-In Adversarial Detection for Time-Series LLMs in Energy Forecasting

    oai:arXiv.org:2512.12154v1

    arXiv:2512.12154v1 Announce Type: new Abstract: Accurate time-series forecasting is increasingly critical for planning and operations in low-carbon power systems. Emerging time-series large language models (TS-LLMs) now deliver this capability at scale, requiring no task-specific retraining, and are quickly becoming essential components within the Internet-of-Energy (IoE) ecosystem. However, their real-world deployment is complicated by a critical vulnerability: adversarial examples (AEs). Detecting these AEs is challenging because (i) adversarial perturbations are optimized across the entire input sequence and exploit global temporal dependencies, which renders local detection methods ineffective, and (ii) unlike traditional forecasting models with fixed input dimensions, TS-LLMs accept sequences of variable length, increasing variability that complicates detection. To address these challenges, we propose a plug-in detection framework that capitalizes on the TS-LLM's own variable-length input capability. Our method uses sampling-induced divergence as a detection signal. Given an input sequence, we generate multiple shortened variants and detect AEs by measuring the consistency of their forecasts: Benign sequences tend to produce stable predictions under sampling, whereas adversarial sequences show low forecast similarity, because perturbations optimized for a full-length sequence do not transfer reliably to shorter, differently-structured subsamples. We evaluate our approach on three representative TS-LLMs (TimeGPT, TimesFM, and TimeLLM) across three energy datasets: ETTh2 (Electricity Transformer Temperature), NI (Hourly Energy Consumption), and Consumption (Hourly Electricity Consumption and Production). Empirical results confirm strong and robust detection performance across both black-box and white-box attack scenarios, highlighting its practicality as a reliable safeguard for TS-LLM forecasting in real-world energy systems.

    https://arxiv.org/abs/2512.12154


    Audio-Visual Camera Pose Estimationn with Passive Scene Sounds and In-the-Wild Video

    oai:arXiv.org:2512.12165v1

    arXiv:2512.12165v1 Announce Type: new Abstract: Understanding camera motion is a fundamental problem in embodied perception and 3D scene understanding. While visual methods have advanced rapidly, they often struggle under visually degraded conditions such as motion blur or occlusions. In this work, we show that passive scene sounds provide complementary cues for relative camera pose estimation for in-the-wild videos. We introduce a simple but effective audio-visual framework that integrates direction-ofarrival (DOA) spectra and binauralized embeddings into a state-of-the-art vision-only pose estimation model. Our results on two large datasets show consistent gains over strong visual baselines, plus robustness when the visual information is corrupted. To our knowledge, this represents the first work to successfully leverage audio for relative camera pose estimation in real-world videos, and it establishes incidental, everyday audio as an unexpected but promising signal for a classic spatial challenge. Project: http://vision.cs.utexas.edu/projects/av_camera_pose.

    https://arxiv.org/abs/2512.12165


    Beyond Riding: Passenger Engagement with Driver Labor through Gamified Interactions

    oai:arXiv.org:2512.12166v1

    arXiv:2512.12166v1 Announce Type: new Abstract: Modern cities increasingly rely on ridesharing services for on-demand transportation, which offer consumers convenience and mobility across the globe. However, these marketed consumer affordances give rise to burdens and vulnerabilities that drivers shoulder alone, without adequate infrastructures for labor regulations or consumer-led advocacy. To effectively and sustainably advance protections and oversight for drivers, consumers must first be aware of the labor, logistics and costs involved with ridehail driving. To motivate consumers to practice more socially responsible consumption behaviors and foster solidarity with drivers, we explore the potential for gamified in-ride interactions to facilitate engagement with real (and lived) driver experiences. Through nine workshops with 19 drivers and 15 passengers, we surface how gamified in-ride interactions revealed passenger knowledge gaps around latent ridehail conditions, prompt reflection and shifts in perception of their relative power and consumption behaviors, and highlight drivers' preferences for creating more immersive and contextualized service experiences, and identify opportunities to design safe and appropriate passenger-driver interactions that motivate solidarity with drivers. In sum, we advance conceptual understandings of in-ride social and managerial relations, demonstrate potential for future worker advocacy in algorithmically-managed labor, and offer design guidelines for more human-centered workplace technologies.

    https://arxiv.org/abs/2512.12166


    Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings

    oai:arXiv.org:2512.12167v1

    arXiv:2512.12167v1 Announce Type: new Abstract: So far, expensive finetuning beyond the pretraining sequence length has been a requirement for effectively extending the context of language models (LM). In this work, we break this key bottleneck by Dropping the Positional Embeddings of LMs after training (DroPE). Our simple method is motivated by three key theoretical and empirical observations. First, positional embeddings (PEs) serve a crucial role during pretraining, providing an important inductive bias that significantly facilitates convergence. Second, over-reliance on this explicit positional information is also precisely what prevents test-time generalization to sequences of unseen length, even when using popular PE-scaling methods. Third, positional embeddings are not an inherent requirement of effective language modeling and can be safely removed after pretraining, following a short recalibration phase. Empirically, DroPE yields seamless zero-shot context extension without any long-context finetuning, quickly adapting pretrained LMs without compromising their capabilities in the original training context. Our findings hold across different models and dataset sizes, far outperforming previous specialized architectures and established rotary positional embedding scaling methods.

    https://arxiv.org/abs/2512.12167


    Diffusion Language Model Inference with Monte Carlo Tree Search

    oai:arXiv.org:2512.12168v1

    arXiv:2512.12168v1 Announce Type: new Abstract: Diffusion language models (DLMs) have recently emerged as a compelling alternative to autoregressive generation, offering parallel generation and improved global coherence. During inference, DLMs generate text by iteratively denoising masked sequences in parallel; however, determining which positions to unmask and which tokens to commit forms a large combinatorial search problem. Existing inference methods approximate this search using heuristics, which often yield suboptimal decoding paths; other approaches instead rely on additional training to guide token selection. To introduce a principled search mechanism for DLMs inference, we introduce MEDAL, a framework that integrates Monte Carlo Tree SEarch initialization for Diffusion LAnguage Model inference. We employ Monte Carlo Tree Search at the initialization stage to explore promising unmasking trajectories, providing a robust starting point for subsequent refinement. This integration is enabled by restricting the search space to high-confidence actions and prioritizing token choices that improve model confidence over remaining masked positions. Across multiple benchmarks, MEDAL achieves up to 22.0% improvement over existing inference strategies, establishing a new paradigm for search-based inference in diffusion language models.

    https://arxiv.org/abs/2512.12168


    Large and Small Model Collaboration for Air Interface

    oai:arXiv.org:2512.12170v1

    arXiv:2512.12170v1 Announce Type: new Abstract: Large artificial intelligence models (LAMs) have shown strong capability in wireless communications, yet existing works mainly rely on their generalized knowledge across environments while overlooking the potential gains of environment-specific adaptation. Directly fine-tuning LAMs for adaptation is often impractical due to prohibitive training costs, low inference efficiency in multi-user scenarios, and the risk of catastrophic forgetting, in addition to the limited accessibility of model parameters. To address these limitations, we establish a collaborative framework for air interface. In this framework, unlike prior approaches that either depend solely on LAMs or require direct fine-tuning, LAMs are exploited as a universal channel knowledge base while small artificial intelligence models (SAMs) are employed as lightweight plugins to capture environment-specific knowledge, facilitating efficient environment-specific adaptation of LAMs. Subsequently, we instantiate this framework for CSI feedback tasks, and develop a large and small collaboration framework for CSI feedback, referred to as LASCO. LASCO operates by letting the base LAM produce an initial CSI reconstruction, learning the environment-induced reconstruction shift through a reference SAM and a proxy SAM, and transferring this shift back to the LAM. To further enhance adaptability, we introduce elastic-LASCO (E-LASCO), which augments LASCO with learnable collaboration coefficients that control the contribution of LAMs and SAMs across different environments. Numerical results demonstrate that LASCO and E-LASCO enables LAMs to achieve environment-specific performance gains with significantly reduced training costs, lower data collection requirements, and faster adaptation speed.

    https://arxiv.org/abs/2512.12170


    EIP-7702 Phishing Attack

    oai:arXiv.org:2512.12174v1

    arXiv:2512.12174v1 Announce Type: new Abstract: EIP-7702 introduces a delegation-based authorization mechanism that allows an externally owned account (EOA) to authenticate a single authorization tuple, after which all subsequent calls are routed to arbitrary delegate code. We show that this design enables a qualitatively new class of phishing attacks: instead of deceiving users into signing individual transactions, an attacker can induce a victim to sign a single authorization tuple that grants unconditional and persistent execution control over the account. Through controlled experiments, we identify three reliable trigger pathways: user-driven, attacker-driven, and protocol-triggered. Each can lead to full account takeover and complete asset drainage. We further propose two extended attack surfaces. First, ERC-4337's EntryPoint pipeline enables remote and repeated activation of the delegated code without further victim involvement. Second, the chain-agnostic authorization mode permits replay-like compromises across independent networks. We also present the first empirical measurement of EIP-7702 usage across major EVM chains. Analyzing over 150k authorization and execution events involving 26k addresses and hundreds of delegator contracts, we assess the protocol's real-world footprint. Our findings show that EIP-7702 authorizations are highly centralized, dominated by a small number of contract families linked to criminal activity and repeatedly reused across incidents. Corresponding loss data reveals substantial theft of ETH, ERC-20 tokens, and NFTs. These results provide practical evidence that the attack surface we identify is not merely theoretical, but is already being exploited at scale. We conclude by proposing protocol-level defenses to mitigate the delegation-based phishing vector introduced by EIP-7702.

    https://arxiv.org/abs/2512.12174


    Rethinking Label Consistency of In-Context Learning: An Implicit Transductive Label Propagation Perspective

    oai:arXiv.org:2512.12175v1

    arXiv:2512.12175v1 Announce Type: new Abstract: Large language models (LLMs) perform in-context learning (ICL) with minimal supervised examples, which benefits various natural language processing (NLP) tasks. One of the critical research focus is the selection of prompt demonstrations. Current approaches typically employ retrieval models to select the top-K most semantically similar examples as demonstrations. However, we argue that existing methods are limited since the label consistency is not guaranteed during demonstration selection. Our cognition derives from the Bayesian view of ICL and our rethinking of ICL from the transductive label propagation perspective. We treat ICL as a transductive learning method and incorporate latent concepts from Bayesian view and deduce that similar demonstrations guide the concepts of query, with consistent labels serving as estimates. Based on this understanding, we establish a label propagation framework to link label consistency with propagation error bounds. To model label consistency, we propose a data synthesis method, leveraging both semantic and label information, and use TopK sampling with Synthetic Data (TopK-SD) to acquire demonstrations with consistent labels. TopK-SD outperforms original TopK sampling on multiple benchmarks. Our work provides a new perspective for understanding the working mechanisms within ICL.

    https://arxiv.org/abs/2512.12175


    Floorplan2Guide: LLM-Guided Floorplan Parsing for BLV Indoor Navigation

    oai:arXiv.org:2512.12177v1

    arXiv:2512.12177v1 Announce Type: new Abstract: Indoor navigation remains a critical challenge for people with visual impairments. The current solutions mainly rely on infrastructure-based systems, which limit their ability to navigate safely in dynamic environments. We propose a novel navigation approach that utilizes a foundation model to transform floor plans into navigable knowledge graphs and generate human-readable navigation instructions. Floorplan2Guide integrates a large language model (LLM) to extract spatial information from architectural layouts, reducing the manual preprocessing required by earlier floorplan parsing methods. Experimental results indicate that few-shot learning improves navigation accuracy in comparison to zero-shot learning on simulated and real-world evaluations. Claude 3.7 Sonnet achieves the highest accuracy among the evaluated models, with 92.31%, 76.92%, and 61.54% on the short, medium, and long routes, respectively, under 5-shot prompting of the MP-1 floor plan. The success rate of graph-based spatial structure is 15.4% higher than that of direct visual reasoning among all models, which confirms that graphical representation and in-context learning enhance navigation performance and make our solution more precise for indoor navigation of Blind and Low Vision (BLV) users.

    https://arxiv.org/abs/2512.12177


    TA-KAND: Two-stage Attention Triple Enhancement and U-KAN based Diffusion For Few-shot Knowledge Graph Completion

    oai:arXiv.org:2512.12182v1

    arXiv:2512.12182v1 Announce Type: new Abstract: Knowledge Graphs (KGs), thanks to their concise and efficient triple-based structure, have been widely applied in intelligent question answering, recommender systems and other domains. However, the heterogeneous and multifaceted nature of real-world data inevitably renders the distribution of relations long-tailed, making it crucial to complete missing facts with limited samples. Previous studies mainly based on metric matching or meta learning, yet they either fail to fully exploit neighborhood information in graph or overlook the distributional characteristics of contrastive signals. In this paper, we re-examine the problem from a perspective of generative representation and propose a few-shot knowledge graph completion framework that integrates two-stage attention triple enhancer with U-KAN based diffusion model. Extensive experiments on two public datasets show that our method achieve new state-of-the-art results.

    https://arxiv.org/abs/2512.12182


    HydroDiffusion: Diffusion-Based Probabilistic Streamflow Forecasting with a State Space Backbone

    oai:arXiv.org:2512.12183v1

    arXiv:2512.12183v1 Announce Type: new Abstract: Recent advances have introduced diffusion models for probabilistic streamflow forecasting, demonstrating strong early flood-warning skill. However, current implementations rely on recurrent Long Short-Term Memory (LSTM) backbones and single-step training objectives, which limit their ability to capture long-range dependencies and produce coherent forecast trajectories across lead times. To address these limitations, we developed HydroDiffusion, a diffusion-based probabilistic forecasting framework with a decoder-only state space model backbone. The proposed framework jointly denoises full multi-day trajectories in a single pass, ensuring temporal coherence and mitigating error accumulation common in autoregressive prediction. HydroDiffusion is evaluated across 531 watersheds in the contiguous United States (CONUS) in the CAMELS dataset. We benchmark HydroDiffusion against two diffusion baselines with LSTM backbones, as well as the recently proposed Diffusion-based Runoff Model (DRUM). Results show that HydroDiffusion achieves strong nowcast accuracy when driven by observed meteorological forcings, and maintains consistent performance across the full simulation horizon. Moreover, HydroDiffusion delivers stronger deterministic and probabilistic forecast skill than DRUM in operational forecasting. These results establish HydroDiffusion as a robust generative modeling framework for medium-range streamflow forecasting, providing both a new modeling benchmark and a foundation for future research on probabilistic hydrologic prediction at continental scales.

    https://arxiv.org/abs/2512.12183


    The Ideological Turing Test for Moderation of Outgroup Affective Animosity

    oai:arXiv.org:2512.12187v1

    arXiv:2512.12187v1 Announce Type: new Abstract: Rising animosity toward ideological opponents poses critical societal challenges. We introduce and test the Ideological Turing Test, a gamified framework requiring participants to adopt and defend opposing viewpoints, to reduce affective animosity and affective polarization. We conducted a mixed-design experiment ($N = 203$) with four conditions: modality (debate/writing) x perspective-taking (Own/Opposite side). Participants engaged in structured interactions defending assigned positions, with outcomes judged by peers. We measured changes in affective animosity and ideological position immediately post-intervention and at 2-6 week follow-up. Perspective-taking reduced out-group animosity and ideological polarization. However, effects differed by modality (writing vs. debate) and over time. For affective animosity, writing from the opposite perspective yielded the largest immediate reduction ($\Delta=+0.45$ SD), but the effect was not detectable at the 4-6 week follow-up. In contrast, the debate modality maintained a statistically significant reduction in animosity immediately after and at follow-up ($\Delta=+0.37$ SD). For ideological position, adopting the opposite perspective led to significant immediate movement across modalities (writing: $\Delta=+0.91$ SD; debate: $\Delta=+0.51$ SD), and these changes persisted at follow-up. Judged performance (winning) did not moderate these effects, and willingness to re-participate was similar across conditions (~20-36%). These findings challenge assumptions about adversarial methods, revealing distinct temporal patterns: non-adversarial engagement fosters short-term empathy gains, while cognitive engagement through debate sustains affective benefits. The Ideological Turing Test demonstrates potential as a scalable tool for reducing polarization, particularly when combining perspective-taking with reflective adversarial interactions.

    https://arxiv.org/abs/2512.12187


    How Visa-Free Policies Fuel International Research Collaboration: Evidence from China

    oai:arXiv.org:2512.12189v1

    arXiv:2512.12189v1 Announce Type: new Abstract: Visa regimes constitute significant institutional barriers to the cross-border mobility of researchers. Utilizing China's phased implementation of a unilateral visa-free policy since 2023 as a quasi-natural experiment, this study employs a staggered difference-in-differences design to assess the policy's effect on international scientific collaboration. Results indicate that the policy significantly increased the volume of Sino-foreign co-authored publications. The mechanism analysis indicates that this effect is primarily achieved by enhancing transportation accessibility and human mobility, which in turn facilitates cross-border research collaboration among scholars. Further evidence suggests that academic conferences partially attenuated the policy's impact, indicating a substitutive relationship across collaboration channels. Moreover, the effect was more pronounced for countries with greater geographical distance or lower research capacity. This study elucidates the mechanisms through which visa facilitation promotes international scientific collaboration and offers new insights into how institutional barriers shape research cooperation and knowledge production.

    https://arxiv.org/abs/2512.12189


    SMRABooth: Subject and Motion Representation Alignment for Customized Video Generation

    oai:arXiv.org:2512.12193v1

    arXiv:2512.12193v1 Announce Type: new Abstract: Customized video generation aims to produce videos that faithfully preserve the subject's appearance from reference images while maintaining temporally consistent motion from reference videos. Existing methods struggle to ensure both subject appearance similarity and motion pattern consistency due to the lack of object-level guidance for subject and motion. To address this, we propose SMRABooth, which leverages the self-supervised encoder and optical flow encoder to provide object-level subject and motion representations. These representations are aligned with the model during the LoRA fine-tuning process. Our approach is structured in three core stages: (1) We exploit subject representations via a self-supervised encoder to guide subject alignment, enabling the model to capture overall structure of subject and enhance high-level semantic consistency. (2) We utilize motion representations from an optical flow encoder to capture structurally coherent and object-level motion trajectories independent of appearance. (3) We propose a subject-motion association decoupling strategy that applies sparse LoRAs injection across both locations and timing, effectively reducing interference between subject and motion LoRAs. Extensive experiments show that SMRABooth excels in subject and motion customization, maintaining consistent subject appearance and motion patterns, proving its effectiveness in controllable text-to-video generation.

    https://arxiv.org/abs/2512.12193


    B-ActiveSEAL: Scalable Uncertainty-Aware Active Exploration with Tightly Coupled Localization-Mapping

    oai:arXiv.org:2512.12194v1

    arXiv:2512.12194v1 Announce Type: new Abstract: Active robot exploration requires decision-making processes that integrate localization and mapping under tightly coupled uncertainty. However, managing these interdependent uncertainties over long-term operations in large-scale environments rapidly becomes computationally intractable. To address this challenge, we propose B-ActiveSEAL, a scalable information-theoretic active exploration framework that explicitly accounts for coupled uncertainties-from perception through mapping-into the decision-making process. Our framework (i) adaptively balances map uncertainty (exploration) and localization uncertainty (exploitation), (ii) accommodates a broad class of generalized entropy measures, enabling flexible and uncertainty-aware active exploration, and (iii) establishes Behavioral entropy (BE) as an effective information measure for active exploration by enabling intuitive and adaptive decision-making under coupled uncertainties. We establish a theoretical foundation for propagating coupled uncertainties and integrating them into general entropy formulations, enabling uncertainty-aware active exploration under tightly coupled localization-mapping. The effectiveness of the proposed approach is validated through rigorous theoretical analysis and extensive experiments on open-source maps and ROS-Unity simulations across diverse and complex environments. The results demonstrate that B-ActiveSEAL achieves a well-balanced exploration-exploitation trade-off and produces diverse, adaptive exploration behaviors across environments, highlighting clear advantages over representative baselines.

    https://arxiv.org/abs/2512.12194


    AutoMV: An Automatic Multi-Agent System for Music Video Generation

    oai:arXiv.org:2512.12196v1

    arXiv:2512.12196v1 Announce Type: new Abstract: Music-to-Video (M2V) generation for full-length songs faces significant challenges. Existing methods produce short, disjointed clips, failing to align visuals with musical structure, beats, or lyrics, and lack temporal consistency. We propose AutoMV, a multi-agent system that generates full music videos (MVs) directly from a song. AutoMV first applies music processing tools to extract musical attributes, such as structure, vocal tracks, and time-aligned lyrics, and constructs these features as contextual inputs for following agents. The screenwriter Agent and director Agent then use this information to design short script, define character profiles in a shared external bank, and specify camera instructions. Subsequently, these agents call the image generator for keyframes and different video generators for "story" or "singer" scenes. A Verifier Agent evaluates their output, enabling multi-agent collaboration to produce a coherent longform MV. To evaluate M2V generation, we further propose a benchmark with four high-level categories (Music Content, Technical, Post-production, Art) and twelve ine-grained criteria. This benchmark was applied to compare commercial products, AutoMV, and human-directed MVs with expert human raters: AutoMV outperforms current baselines significantly across all four categories, narrowing the gap to professional MVs. Finally, we investigate using large multimodal models as automatic MV judges; while promising, they still lag behind human expert, highlighting room for future work.

    https://arxiv.org/abs/2512.12196


    Braess' Paradoxes in Coupled Power and Transportation Systems

    oai:arXiv.org:2512.12197v1

    arXiv:2512.12197v1 Announce Type: new Abstract: Transportation electrification introduces strong coupling between the power and transportation systems. In this paper, we generalize the classical notion of Braess' paradox to coupled power and transportation systems, and examine how the cross-system coupling induces new types of Braess' paradoxes. To this end, we model the power and transportation networks as graphs, coupled with charging points connecting to nodes in both graphs. The power system operation is characterized by the economic dispatch optimization, while the transportation system user equilibrium models travelers' route and charging choices. By analyzing simple coupled systems, we demonstrate that capacity expansion in either transportation or power system can deteriorate the performance of both systems, and uncover the fundamental mechanisms for such new Braess' paradoxes to occur. We also provide necessary and sufficient conditions of the occurrences of Braess' paradoxes for general coupled systems, leading to managerial insights for infrastructure planners. For general networks, through characterizing the generalized user equilibrium of the coupled systems, we develop efficient algorithms to detect Braess' paradoxes and novel charging pricing policies to mitigate them.

    https://arxiv.org/abs/2512.12197


    MolGuidance: Advanced Guidance Strategies for Conditional Molecular Generation with Flow Matching

    oai:arXiv.org:2512.12198v1

    arXiv:2512.12198v1 Announce Type: new Abstract: Key objectives in conditional molecular generation include ensuring chemical validity, aligning generated molecules with target properties, promoting structural diversity, and enabling efficient sampling for discovery. Recent advances in computer vision introduced a range of new guidance strategies for generative models, many of which can be adapted to support these goals. In this work, we integrate state-of-the-art guidance methods -- including classifier-free guidance, autoguidance, and model guidance -- in a leading molecule generation framework built on an SE(3)-equivariant flow matching process. We propose a hybrid guidance strategy that separately guides continuous and discrete molecular modalities -- operating on velocity fields and predicted logits, respectively -- while jointly optimizing their guidance scales via Bayesian optimization. Our implementation, benchmarked on the QM9 and QMe14S datasets, achieves new state-of-the-art performance in property alignment for de novo molecular generation. The generated molecules also exhibit high structural validity. Furthermore, we systematically compare the strengths and limitations of various guidance methods, offering insights into their broader applicability.

    https://arxiv.org/abs/2512.12198


    Thermal RGB Fusion for Micro-UAV Wildfire Perimeter Tracking with Minimal Comms

    oai:arXiv.org:2512.12199v1

    arXiv:2512.12199v1 Announce Type: new Abstract: This study introduces a lightweight perimeter tracking method designed for micro UAV teams operating over wildfire environments under limited bandwidth conditions. Thermal image frames generate coarse hot region masks through adaptive thresholding and morphological refinement, while RGB frames contribute edge cues and suppress texture related false detections using gradient based filtering. A rule level merging strategy selects boundary candidates and simplifies them via the Ramer Douglas Peucker algorithm. The system incorporates periodic beacons and an inertial feedback loop that maintains trajectory stability in the presence of GPS degradation. The guidance loop targets sub 50 ms latency on embedded System on Chip (SoC) platforms by constraining per frame pixel operations and precomputing gradient tables. Small scale simulations demonstrate reductions in average path length and boundary jitter compared to a pure edge tracking baseline, while maintaining environmental coverage measured through intersection merge analysis. Battery consumption and computational utilization confirm the feasibility of achieving 10, 15 m/s forward motion on standard micro platforms. This approach enables rapid deployment in the field, requiring robust sensing and minimal communications for emergency reconnaissance applications.

    https://arxiv.org/abs/2512.12199


    Local discontinuous Galerkin method for the integral fractional Laplacian

    oai:arXiv.org:2512.12200v1

    arXiv:2512.12200v1 Announce Type: new Abstract: We develop and analyze a local discontinuous Galerkin (LDG) method for solving integral fractional Laplacian problems on bounded Lipschitz domains. The method is based on a three-field mixed formulation involving the primal variable, its gradient, and the corresponding Riesz potential, yielding a flux-based structure well suited for LDG discretizations while retaining the intrinsic nonlocal interaction. A key ingredient of our analysis is a rigorous study of the weighted H\"older and Sobolev regularity of the Riesz potential, which enables accurate characterization of boundary singularities. Guided by these regularity results, we propose LDG schemes on quasi-uniform and graded meshes, with additional stabilization in the graded case to reconcile the discrepancy between the discrete spaces for the Riesz potential and flux fields. Optimal a priori error estimates are established, and numerical experiments corroborate the theoretical results.

    https://arxiv.org/abs/2512.12200


    Epistemoverse: Toward an AI-Driven Knowledge Metaverse for Intellectual Heritage Preservation

    oai:arXiv.org:2512.12201v1

    arXiv:2512.12201v1 Announce Type: new Abstract: Large language models (LLMs) have often been characterized as "stochastic parrots" that merely reproduce fragments of their training data. This study challenges that assumption by demonstrating that, when placed in an appropriate dialogical context, LLMs can develop emergent conceptual structures and exhibit interaction-driven (re-)structuring of cognitive interfaces and reflective question-asking. Drawing on the biological principle of cloning and Socrates' maieutic method, we analyze authentic philosophical debates generated among AI-reincarnated philosophers within the interactive art installations of the Syntropic Counterpoints project. By engaging digital counterparts of Aristotle, Nietzsche, Machiavelli, and Sun Tzu in iterative discourse, the study reveals how machine dialogue can give rise to inferential coherence, reflective questioning, and creative synthesis. Based on these findings, we propose the concept of the Epistemoverse--a metaverse of knowledge where human and machine cognition intersect to preserve, reinterpret, and extend intellectual heritage through AI-driven interaction. This framework positions virtual and immersive environments as new spaces for epistemic exchange, digital heritage, and collaborative creativity.

    https://arxiv.org/abs/2512.12201


    Load Balancing with Duration Predictions

    oai:arXiv.org:2512.12202v1

    arXiv:2512.12202v1 Announce Type: new Abstract: We study the classic fully dynamic load balancing problem on unrelated machines where jobs arrive and depart over time and the goal is minimizing the maximum load, or more generally the l_p-norm of the load vector. Previous work either studied the clairvoyant setting in which exact durations are known to the algorithm, or the unknown duration setting in which no information on the duration is given to the algorithm. For the clairvoyant setting algorithms with polylogarithmic competitive ratios were designed, while for the unknown duration setting strong lower bounds exist and only polynomial competitive factors are possible. We bridge this gap by studying a more realistic model in which some estimate/prediction of the duration is available to the algorithm. We observe that directly incorporating predictions into classical load balancing algorithms designed for the clairvoyant setting can lead to a notable decline in performance. We design better algorithms whose performance depends smoothly on the accuracy of the available prediction. We also prove lower bounds on the competitiveness of algorithms that use such inaccurate predictions.

    https://arxiv.org/abs/2512.12202


    Navigation Around Unknown Space Objects Using Visible-Thermal Image Fusion

    oai:arXiv.org:2512.12203v1

    arXiv:2512.12203v1 Announce Type: new Abstract: As the popularity of on-orbit operations grows, so does the need for precise navigation around unknown resident space objects (RSOs) such as other spacecraft, orbital debris, and asteroids. The use of Simultaneous Localization and Mapping (SLAM) algorithms is often studied as a method to map out the surface of an RSO and find the inspector's relative pose using a lidar or conventional camera. However, conventional cameras struggle during eclipse or shadowed periods, and lidar, though robust to lighting conditions, tends to be heavier, bulkier, and more power-intensive. Thermal-infrared cameras can track the target RSO throughout difficult illumination conditions without these limitations. While useful, thermal-infrared imagery lacks the resolution and feature-richness of visible cameras. In this work, images of a target satellite in low Earth orbit are photo-realistically simulated in both visible and thermal-infrared bands. Pixel-level fusion methods are used to create visible/thermal-infrared composites that leverage the best aspects of each camera. Navigation errors from a monocular SLAM algorithm are compared between visible, thermal-infrared, and fused imagery in various lighting and trajectories. Fused imagery yields substantially improved navigation performance over visible-only and thermal-only methods.

    https://arxiv.org/abs/2512.12203


    A Multi-Year Urban Streetlight Imagery Dataset for Visual Monitoring and Spatio-Temporal Drift Detection

    oai:arXiv.org:2512.12205v1

    arXiv:2512.12205v1 Announce Type: new Abstract: We present a large-scale, longitudinal visual dataset of urban streetlights captured by 22 fixed-angle cameras deployed across Bristol, U.K., from 2021 to 2025. The dataset contains over 526,000 images, collected hourly under diverse lighting, weather, and seasonal conditions. Each image is accompanied by rich metadata, including timestamps, GPS coordinates, and device identifiers. This unique real-world dataset enables detailed investigation of visual drift, anomaly detection, and MLOps strategies in smart city deployments. To promtoe seconardary analysis, we additionally provide a self-supervised framework based on convolutional variational autoencoders (CNN-VAEs). Models are trained separately for each camera node and for day/night image sets. We define two per-sample drift metrics: relative centroid drift, capturing latent space deviation from a baseline quarter, and relative reconstruction error, measuring normalized image-domain degradation. This dataset provides a realistic, fine-grained benchmark for evaluating long-term model stability, drift-aware learning, and deployment-ready vision systems. The images and structured metadata are publicly released in JPEG and CSV formats, supporting reproducibility and downstream applications such as streetlight monitoring, weather inference, and urban scene understanding. The dataset can be found at https://doi.org/10.5281/zenodo.17781192 and https://doi.org/10.5281/zenodo.17859120.

    https://arxiv.org/abs/2512.12205


    ALERT Open Dataset and Input-Size-Agnostic Vision Transformer for Driver Activity Recognition using IR-UWB

    oai:arXiv.org:2512.12206v1

    arXiv:2512.12206v1 Announce Type: new Abstract: Distracted driving contributes to fatal crashes worldwide. To address this, researchers are using driver activity recognition (DAR) with impulse radio ultra-wideband (IR-UWB) radar, which offers advantages such as interference resistance, low power consumption, and privacy preservation. However, two challenges limit its adoption: the lack of large-scale real-world UWB datasets covering diverse distracted driving behaviors, and the difficulty of adapting fixed-input Vision Transformers (ViTs) to UWB radar data with non-standard dimensions. This work addresses both challenges. We present the ALERT dataset, which contains 10,220 radar samples of seven distracted driving activities collected in real driving conditions. We also propose the input-size-agnostic Vision Transformer (ISA-ViT), a framework designed for radar-based DAR. The proposed method resizes UWB data to meet ViT input requirements while preserving radar-specific information such as Doppler shifts and phase characteristics. By adjusting patch configurations and leveraging pre-trained positional embedding vectors (PEVs), ISA-ViT overcomes the limitations of naive resizing approaches. In addition, a domain fusion strategy combines range- and frequency-domain features to further improve classification performance. Comprehensive experiments demonstrate that ISA-ViT achieves a 22.68% accuracy improvement over an existing ViT-based approach for UWB-based DAR. By publicly releasing the ALERT dataset and detailing our input-size-agnostic strategy, this work facilitates the development of more robust and scalable distracted driving detection systems for real-world deployment.

    https://arxiv.org/abs/2512.12206


    Not All Transparency Is Equal: Source Presentation Effects on Attention, Interaction, and Persuasion in Conversational Search

    oai:arXiv.org:2512.12207v1

    arXiv:2512.12207v1 Announce Type: new Abstract: Conversational search systems increasingly provide source citations, yet how citation or source presentation formats influence user engagement remains unclear. We conducted a crowdsourcing user experiment with 394 participants comparing four source presentation designs that varied citation visibility and accessibility: collapsible lists, hover cards, footer lists, and aligned sidebars.High-visibility interfaces generated substantially more hovering on sources, though clicking remained infrequent across all conditions. While interface design showed limited effects on user experience and perception measures, it significantly influenced knowledge, interest, and agreement changes. High-visibility interfaces initially reduced knowledge gain and interest, but these positive effects emerged with increasing source usage. The sidebar condition uniquely increased agreement change. Our findings demonstrate that source presentation alone may not enhance engagement and can even reduce it when insufficient sources are provided.

    https://arxiv.org/abs/2512.12207


    A Hybrid Deep Learning Framework for Emotion Recognition in Children with Autism During NAO Robot-Mediated Interaction

    oai:arXiv.org:2512.12208v1

    arXiv:2512.12208v1 Announce Type: new Abstract: Understanding emotional responses in children with Autism Spectrum Disorder (ASD) during social interaction remains a critical challenge in both developmental psychology and human-robot interaction. This study presents a novel deep learning pipeline for emotion recognition in autistic children in response to a name-calling event by a humanoid robot (NAO), under controlled experimental settings. The dataset comprises of around 50,000 facial frames extracted from video recordings of 15 children with ASD. A hybrid model combining a fine-tuned ResNet-50-based Convolutional Neural Network (CNN) and a three-layer Graph Convolutional Network (GCN) trained on both visual and geometric features extracted from MediaPipe FaceMesh landmarks. Emotions were probabilistically labeled using a weighted ensemble of two models: DeepFace's and FER, each contributing to soft-label generation across seven emotion classes. Final classification leveraged a fused embedding optimized via Kullback-Leibler divergence. The proposed method demonstrates robust performance in modeling subtle affective responses and offers significant promise for affective profiling of ASD children in clinical and therapeutic human-robot interaction contexts, as the pipeline effectively captures micro emotional cues in neurodivergent children, addressing a major gap in autism-specific HRI research. This work represents the first such large-scale, real-world dataset and pipeline from India on autism-focused emotion analysis using social robotics, contributing an essential foundation for future personalized assistive technologies.

    https://arxiv.org/abs/2512.12208


    CineLOG: A Training Free Approach for Cinematic Long Video Generation

    oai:arXiv.org:2512.12209v1

    arXiv:2512.12209v1 Announce Type: new Abstract: Controllable video synthesis is a central challenge in computer vision, yet current models struggle with fine grained control beyond textual prompts, particularly for cinematic attributes like camera trajectory and genre. Existing datasets often suffer from severe data imbalance, noisy labels, or a significant simulation to real gap. To address this, we introduce CineLOG, a new dataset of 5,000 high quality, balanced, and uncut video clips. Each entry is annotated with a detailed scene description, explicit camera instructions based on a standard cinematic taxonomy, and genre label, ensuring balanced coverage across 17 diverse camera movements and 15 film genres. We also present our novel pipeline designed to create this dataset, which decouples the complex text to video (T2V) generation task into four easier stages with more mature technology. To enable coherent, multi shot sequences, we introduce a novel Trajectory Guided Transition Module that generates smooth spatio-temporal interpolation. Extensive human evaluations show that our pipeline significantly outperforms SOTA end to end T2V models in adhering to specific camera and screenplay instructions, while maintaining professional visual quality. All codes and data are available at https://cine-log.pages.dev.

    https://arxiv.org/abs/2512.12209


    EEG-DLite: Dataset Distillation for Efficient Large EEG Model Training

    oai:arXiv.org:2512.12210v1

    arXiv:2512.12210v1 Announce Type: new Abstract: Large-scale EEG foundation models have shown strong generalization across a range of downstream tasks, but their training remains resource-intensive due to the volume and variable quality of EEG data. In this work, we introduce EEG-DLite, a data distillation framework that enables more efficient pre-training by selectively removing noisy and redundant samples from large EEG datasets. EEG-DLite begins by encoding EEG segments into compact latent representations using a self-supervised autoencoder, allowing sample selection to be performed efficiently and with reduced sensitivity to noise. Based on these representations, EEG-DLite filters out outliers and minimizes redundancy, resulting in a smaller yet informative subset that retains the diversity essential for effective foundation model training. Through extensive experiments, we demonstrate that training on only 5 percent of a 2,500-hour dataset curated with EEG-DLite yields performance comparable to, and in some cases better than, training on the full dataset across multiple downstream tasks. To our knowledge, this is the first systematic study of pre-training data distillation in the context of EEG foundation models. EEG-DLite provides a scalable and practical path toward more effective and efficient physiological foundation modeling. The code is available at https://github.com/t170815518/EEG-DLite.

    https://arxiv.org/abs/2512.12210


    Measuring What Matters: Scenario-Driven Evaluation for Trajectory Predictors in Autonomous Driving

    oai:arXiv.org:2512.12211v1

    arXiv:2512.12211v1 Announce Type: new Abstract: Being able to anticipate the motion of surrounding agents is essential for the safe operation of autonomous driving systems in dynamic situations. While various methods have been proposed for trajectory prediction, the current evaluation practices still rely on error-based metrics (e.g., ADE, FDE), which reveal the accuracy from a post-hoc view but ignore the actual effect the predictor brings to the self-driving vehicles (SDVs), especially in complex interactive scenarios: a high-quality predictor not only chases accuracy, but should also captures all possible directions a neighbor agent might move, to support the SDVs' cautious decision-making. Given that the existing metrics hardly account for this standard, in our work, we propose a comprehensive pipeline that adaptively evaluates the predictor's performance by two dimensions: accuracy and diversity. Based on the criticality of the driving scenario, these two dimensions are dynamically combined and result in a final score for the predictor's performance. Extensive experiments on a closed-loop benchmark using real-world datasets show that our pipeline yields a more reasonable evaluation than traditional metrics by better reflecting the correlation of the predictors' evaluation with the autonomous vehicles' driving performance. This evaluation pipeline shows a robust way to select a predictor that potentially contributes most to the SDV's driving performance.

    https://arxiv.org/abs/2512.12211


    Anticipatory Governance in Data-Constrained Environments: A Predictive Simulation Framework for Digital Financial Inclusion

    oai:arXiv.org:2512.12212v1

    arXiv:2512.12212v1 Announce Type: new Abstract: Financial exclusion remains a major barrier to digital public service delivery in resource-constrained and archipelagic nations. Traditional policy evaluations rely on retrospective data, limiting the ex-ante intelligence needed for agile resource allocation. This study introduces a predictive simulation framework to support anticipatory governance within government information systems. Using the UNCDF Pacific Digital Economy dataset of 10,108 respondents, we apply a three-stage pipeline: descriptive profiling, interpretable machine learning, and scenario simulation to forecast outcomes of digital financial literacy interventions before deployment. Leveraging cross-sectional structural associations, the framework projects intervention scenarios as prioritization heuristics rather than causal estimates. A transparent linear regression model with R-squared of 95.9 identifies modifiable policy levers. Simulations indicate that foundational digital capabilities such as device access and expense tracking yield the highest projected gains, up to 5.5 percent, outperforming attitudinal nudges. The model enables precision targeting, highlighting young female caregivers as high-leverage responders while flagging non-responders such as urban professionals to prevent resource misallocation. This research demonstrates how static survey data can be repurposed into actionable policy intelligence, offering a scalable and evidence-based blueprint for embedding predictive analytics into public-sector decision-support systems to advance equity-focused digital governance.

    https://arxiv.org/abs/2512.12212


    Training Versatile Coding Agents in Synthetic Environments

    oai:arXiv.org:2512.12216v1

    arXiv:2512.12216v1 Announce Type: new Abstract: Prior works on training software engineering agents have explored utilizing existing resources such as issues on GitHub repositories to construct software engineering tasks and corresponding test suites. These approaches face two key limitations: (1) their reliance on pre-existing GitHub repositories offers limited flexibility, and (2) their primary focus on issue resolution tasks restricts their applicability to the much wider variety of tasks a software engineer must handle. To overcome these challenges, we introduce SWE-Playground, a novel pipeline for generating environments and trajectories which supports the training of versatile coding agents. Unlike prior efforts, SWE-Playground synthetically generates projects and tasks from scratch with strong language models and agents, eliminating reliance on external data sources. This allows us to tackle a much wider variety of coding tasks, such as reproducing issues by generating unit tests and implementing libraries from scratch. We demonstrate the effectiveness of this approach on three distinct benchmarks, and results indicate that SWE-Playground produces trajectories with dense training signal, enabling agents to reach comparable performance with significantly fewer trajectories than previous works.

    https://arxiv.org/abs/2512.12216


    Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking

    oai:arXiv.org:2512.12218v1

    arXiv:2512.12218v1 Announce Type: new Abstract: Reasoning-augmented vision language models (VLMs) generate explicit chains of thought that promise greater capability and transparency but also introduce new failure modes: models may reach correct answers via visually unfaithful intermediate steps, or reason faithfully yet fail on the final prediction. Standard evaluations that only measure final-answer accuracy cannot distinguish these behaviors. We introduce the visual faithfulness of reasoning chains as a distinct evaluation dimension, focusing on whether the perception steps of a reasoning chain are grounded in the image. We propose a training- and reference-free framework that decomposes chains into perception versus reasoning steps and uses off-the-shelf VLM judges for step-level faithfulness, additionally verifying this approach through a human meta-evaluation. Building on this metric, we present a lightweight self-reflection procedure that detects and locally regenerates unfaithful perception steps without any training. Across multiple reasoning-trained VLMs and perception-heavy benchmarks, our method reduces Unfaithful Perception Rate while preserving final-answer accuracy, improving the reliability of multimodal reasoning.

    https://arxiv.org/abs/2512.12218


    Fine-Grained Zero-Shot Learning with Attribute-Centric Representations

    oai:arXiv.org:2512.12219v1

    arXiv:2512.12219v1 Announce Type: new Abstract: Recognizing unseen fine-grained categories demands a model that can distinguish subtle visual differences. This is typically achieved by transferring visual-attribute relationships from seen classes to unseen classes. The core challenge is attribute entanglement, where conventional models collapse distinct attributes like color, shape, and texture into a single visual embedding. This causes interference that masks these critical distinctions. The post-hoc solutions of previous work are insufficient, as they operate on representations that are already mixed. We propose a zero-shot learning framework that learns AttributeCentric Representations (ACR) to tackle this problem by imposing attribute disentanglement during representation learning. ACR is achieved with two mixture-of-experts components, including Mixture of Patch Experts (MoPE) and Mixture of Attribute Experts (MoAE). First, MoPE is inserted into the transformer using a dual-level routing mechanism to conditionally dispatch image patches to specialized experts. This ensures coherent attribute families are processed by dedicated experts. Finally, the MoAE head projects these expert-refined features into sparse, partaware attribute maps for robust zero-shot classification. On zero-shot learning benchmark datasets CUB, AwA2, and SUN, our ACR achieves consistent state-of-the-art results.

    https://arxiv.org/abs/2512.12219


    ProImage-Bench: Rubric-Based Evaluation for Professional Image Generation

    oai:arXiv.org:2512.12220v1

    arXiv:2512.12220v1 Announce Type: new Abstract: We study professional image generation, where a model must synthesize information-dense, scientifically precise illustrations from technical descriptions rather than merely produce visually plausible pictures. To quantify the progress, we introduce ProImage-Bench, a rubric-based benchmark that targets biology schematics, engineering/patent drawings, and general scientific diagrams. For 654 figures collected from real textbooks and technical reports, we construct detailed image instructions and a hierarchy of rubrics that decompose correctness into 6,076 criteria and 44,131 binary checks. Rubrics are derived from surrounding text and reference figures using large multimodal models, and are evaluated by an automated LMM-based judge with a principled penalty scheme that aggregates sub-question outcomes into interpretable criterion scores. We benchmark several representative text-to-image models on ProImage-Bench and find that, despite strong open-domain performance, the best base model reaches only 0.791 rubric accuracy and 0.553 criterion score overall, revealing substantial gaps in fine-grained scientific fidelity. Finally, we show that the same rubrics provide actionable supervision: feeding failed checks back into an editing model for iterative refinement boosts a strong generator from 0.653 to 0.865 in rubric accuracy and from 0.388 to 0.697 in criterion score. ProImage-Bench thus offers both a rigorous diagnostic for professional image generation and a scalable signal for improving specification-faithful scientific illustrations.

    https://arxiv.org/abs/2512.12220


    Comparison of different segmentation algorithms on brain volume and fractal dimension in infant brain MRIs

    oai:arXiv.org:2512.12222v1

    arXiv:2512.12222v1 Announce Type: new Abstract: Accurate segmentation of infant brain MRI is essential for quantifying developmental changes in structure and complexity. However, ongoing myelination and reduced tissue contrast make automated segmentation particularly challenging. This study systematically compared segmentation accuracy and its impact on volumetric and fractal dimension (FD) estimates in infant brain MRI using the Baby Open Brains (BOB) dataset (71 scans, 1-9 months). Two methods, SynthSeg and SamSeg, were evaluated against expert annotations using Dice, Intersection over Union, 95th-percentile Hausdorff distance, and Normalised Mutual Information. SynthSeg outperformed SamSeg across all quality metrics (mean Dice > 0.8 for major regions) and provided volumetric estimates closely matching the manual reference (mean +4% [-28% - 71%]). SamSeg systematically overestimated ventricular and whole-brain volumes (mean +76% [-12% - 190%]). Segmentation accuracy improved with age, consistent with increasing tissue contrast during myelination. Fractal dimension a(FD) nalyses revealed significant regional differences between SynthSeg and expert segmentations, and Bland-Altman limits of agreement indicated that segmentation-related FD variability exceeded most group differences reported in developmental cohorts. Volume and FD deviations were positively correlated across structures, indicating that segmentation bias directly affects FD estimation. Overall, SynthSeg provided the most reliable volumetric and FD results for paediatric MRI, yet small morphological differences in volume and FD should be interpreted with caution due to segmentation-related uncertainty.

    https://arxiv.org/abs/2512.12222


    Cluster-guided LLM-Based Anonymization of Software Analytics Data: Studying Privacy-Utility Trade-offs in JIT Defect Prediction

    oai:arXiv.org:2512.12224v1

    arXiv:2512.12224v1 Announce Type: new Abstract: The increasing use of machine learning (ML) for Just-In-Time (JIT) defect prediction raises concerns about privacy leakage from software analytics data. Existing anonymization methods, such as tabular transformations and graph perturbations, often overlook contextual dependencies among software metrics, leading to suboptimal privacy-utility tradeoffs. Leveraging the contextual reasoning of Large Language Models (LLMs), we propose a cluster-guided anonymization technique that preserves contextual and statistical relationships within JIT datasets. Our method groups commits into feature-based clusters and employs an LLM to generate context-aware parameter configurations for each commit cluster, defining alpha-beta ratios and churn mixture distributions used for anonymization. Our evaluation on six projects (Cassandra, Flink, Groovy, Ignite, OpenStack, and Qt) shows that our LLM-based approach achieves privacy level 2 (IPR >= 80 percent), improving privacy by 18 to 25 percent over four state-of-the-art graph-based anonymization baselines while maintaining comparable F1 scores. Our results demonstrate that LLMs can act as adaptive anonymization engines when provided with cluster-specific statistical information about similar data points, enabling context-sensitive and privacy-preserving software analytics without compromising predictive accuracy.

    https://arxiv.org/abs/2512.12224


    A Geometric Theory of Cognition

    oai:arXiv.org:2512.12225v1

    arXiv:2512.12225v1 Announce Type: new Abstract: Human cognition spans perception, memory, intuitive judgment, deliberative reasoning, action selection, and social inference, yet these capacities are often explained through distinct computational theories. Here we present a unified mathematical framework in which diverse cognitive processes emerge from a single geometric principle. We represent the cognitive state as a point on a differentiable manifold endowed with a learned Riemannian metric that encodes representational constraints, computational costs, and structural relations among cognitive variables. A scalar cognitive potential combines predictive accuracy, structural parsimony, task utility, and normative or logical requirements. Cognition unfolds as the Riemannian gradient flow of this potential, providing a universal dynamical law from which a broad range of psychological phenomena arise. Classical dual-process effects--rapid intuitive responses and slower deliberative reasoning--emerge naturally from metric-induced anisotropies that generate intrinsic time-scale separations and geometric phase transitions, without invoking modular or hybrid architectures. We derive analytical conditions for these regimes and demonstrate their behavioural signatures through simulations of canonical cognitive tasks. Together, these results establish a geometric foundation for cognition and suggest guiding principles for the development of more general and human-like artificial intelligence systems.

    https://arxiv.org/abs/2512.12225


    Semantic Zone based 3D Map Management for Mobile Robot

    oai:arXiv.org:2512.12228v1

    arXiv:2512.12228v1 Announce Type: new Abstract: Mobile robots in large-scale indoor environments, such as hospitals and logistics centers, require accurate 3D spatial representations. However, 3D maps consume substantial memory, making it difficult to maintain complete map data within limited computational resources. Existing SLAM frameworks typically rely on geometric distance or temporal metrics for memory management, often resulting in inefficient data retrieval in spatially compartmentalized environments. To address this, we propose a semantic zone-based 3D map management method that shifts the paradigm from geometry-centric to semantics-centric control. Our approach partitions the environment into meaningful spatial units (e.g., lobbies, hallways) and designates these zones as the primary unit for memory management. By dynamically loading only task-relevant zones into Working Memory (WM) and offloading inactive zones to Long-Term Memory (LTM), the system strictly enforces user-defined memory thresholds. Implemented within the RTAB-Map framework, our method demonstrates substantial reductions in unnecessary signature load/unload cycles and cumulative memory utilization compared to standard approaches. The results confirm that semantic zone-based management ensures stable, predictable memory usage while preserving map availability for navigation. Code is available at: https://github.com/huichangs/rtabmap/tree/segment

    https://arxiv.org/abs/2512.12228


    Ultra-Low Bitrate Perceptual Image Compression with Shallow Encoder

    oai:arXiv.org:2512.12229v1

    arXiv:2512.12229v1 Announce Type: new Abstract: Ultra-low bitrate image compression (below 0.05 bits per pixel) is increasingly critical for bandwidth-constrained and computation-limited encoding scenarios such as edge devices. Existing frameworks typically rely on large pretrained encoders (e.g., VAEs or tokenizer-based models) and perform transform coding within their generative latent space. While these approaches achieve impressive perceptual fidelity, their reliance on heavy encoder networks makes them unsuitable for deployment on weak sender devices. In this work, we explore the feasibility of applying shallow encoders for ultra-low bitrate compression and propose a novel Asymmetric Extreme Image Compression (AEIC) framework that pursues simultaneously encoding simplicity and decoding quality. Specifically, AEIC employs moderate or even shallow encoder networks, while leveraging an one-step diffusion decoder to maintain high-fidelity and high-realism reconstructions under extreme bitrates. To further enhance the efficiency of shallow encoders, we design a dual-side feature distillation scheme that transfers knowledge from AEIC with moderate encoders to its shallow encoder variants. Experiments demonstrate that AEIC not only outperforms existing methods on rate-distortion-perception performance at ultra-low bitrates, but also delivers exceptional encoding efficiency for 35.8 FPS on 1080P input images, while maintaining competitive decoding speed compared to existing methods.

    https://arxiv.org/abs/2512.12229


    Learning to Get Up Across Morphologies: Zero-Shot Recovery with a Unified Humanoid Policy

    oai:arXiv.org:2512.12230v1

    arXiv:2512.12230v1 Announce Type: new Abstract: Fall recovery is a critical skill for humanoid robots in dynamic environments such as RoboCup, where prolonged downtime often decides the match. Recent techniques using deep reinforcement learning (DRL) have produced robust get-up behaviors, yet existing methods require training of separate policies for each robot morphology. This paper presents a single DRL policy capable of recovering from falls across seven humanoid robots with diverse heights (0.48 - 0.81 m), weights (2.8 - 7.9 kg), and dynamics. Trained with CrossQ, the unified policy transfers zero-shot up to 86 +/- 7% (95% CI [81, 89]) on unseen morphologies, eliminating the need for robot-specific training. Comprehensive leave-one-out experiments, morph scaling analysis, and diversity ablations show that targeted morphological coverage improves zero-shot generalization. In some cases, the shared policy even surpasses the specialist baselines. These findings illustrate the practicality of morphology-agnostic control for fall recovery, laying the foundation for generalist humanoid control. The software is open-source and available at: https://github.com/utra-robosoccer/unified-humanoid-getup

    https://arxiv.org/abs/2512.12230


    Robust Underwater Localization of Buoyancy Driven microFloats Using Acoustic Time-of-Flight Measurements

    oai:arXiv.org:2512.12233v1

    arXiv:2512.12233v1 Announce Type: new Abstract: Accurate underwater localization remains a challenge for inexpensive autonomous platforms that require highfrequency position updates. In this paper, we present a robust, low-cost localization pipeline for buoyancy-driven microFloats operating in coastal waters. We build upon previous work by introducing a bidirectional acoustic Time-of-Flight (ToF) localization framework, which incorporates both float-to-buoy and buoy-to-float transmissions, thereby increasing the number of usable measurements. The method integrates nonlinear trilateration with a filtering of computed position estimates based on geometric cost and Cramer-Rao Lower Bounds (CRLB). This approach removes outliers caused by multipath effects and other acoustic errors from the ToF estimation and improves localization robustness without relying on heavy smoothing. We validate the framework in two field deployments in Puget Sound, Washington, USA. The localization pipeline achieves median positioning errors below 4 m relative to GPS positions. The filtering technique shows a reduction in mean error from 139.29 m to 12.07 m, and improved alignment of trajectories with GPS paths. Additionally, we demonstrate a Time-Difference-of-Arrival (TDoA) localization for unrecovered floats that were transmitting during the experiment. Range-based acoustic localization techniques are widely used and generally agnostic to hardware-this work aims to maximize their utility by improving positioning frequency and robustness through careful algorithmic design.

    https://arxiv.org/abs/2512.12233


    Semantic Distance Measurement based on Multi-Kernel Gaussian Processes

    oai:arXiv.org:2512.12238v1

    arXiv:2512.12238v1 Announce Type: new Abstract: Semantic distance measurement is a fundamental problem in computational linguistics, providing a quantitative characterization of similarity or relatedness between text segments, and underpinning tasks such as text retrieval and text classification. From a mathematical perspective, a semantic distance can be viewed as a metric defined on a space of texts or on a representation space derived from them. However, most classical semantic distance methods are essentially fixed, making them difficult to adapt to specific data distributions and task requirements. In this paper, a semantic distance measure based on multi-kernel Gaussian processes (MK-GP) was proposed. The latent semantic function associated with texts was modeled as a Gaussian process, with its covariance function given by a combined kernel combining Mat\'ern and polynomial components. The kernel parameters were learned automatically from data under supervision, rather than being hand-crafted. This semantic distance was instantiated and evaluated in the context of fine-grained sentiment classification with large language models under an in-context learning (ICL) setup. The experimental results demonstrated the effectiveness of the proposed measure.

    https://arxiv.org/abs/2512.12238


    System X: A Mobile Voice-Based AI System for EMR Generation and Clinical Decision Support in Low-Resource Maternal Healthcare

    oai:arXiv.org:2512.12240v1

    arXiv:2512.12240v1 Announce Type: new Abstract: We present the design, implementation, and in-situ deployment of a smartphone-based voice-enabled AI system for generating electronic medical records (EMRs) and clinical risk alerts in maternal healthcare settings. Targeted at low-resource environments such as Pakistan, the system integrates a fine-tuned, multilingual automatic speech recognition (ASR) model and a prompt-engineered large language model (LLM) to enable healthcare workers to engage naturally in Urdu, their native language, regardless of literacy or technical background. Through speech-based input and localized understanding, the system generates structured EMRs and flags critical maternal health risks. Over a seven-month deployment in a not-for-profit hospital, the system supported the creation of over 500 EMRs and flagged over 300 potential clinical risks. We evaluate the system's performance across speech recognition accuracy, EMR field-level correctness, and clinical relevance of AI-generated red flags. Our results demonstrate that speech based AI interfaces, can be effectively adapted to real-world healthcare settings, especially in low-resource settings, when combined with structured input design, contextual medical dictionaries, and clinician-in-the-loop feedback loops. We discuss generalizable design principles for deploying voice-based mobile healthcare AI support systems in linguistically and infrastructurally constrained settings.

    https://arxiv.org/abs/2512.12240


    CAR-CHASE: Car-Like Robot Conflict-Aware Heuristic Adaptive Search Enhancement

    oai:arXiv.org:2512.12243v1

    arXiv:2512.12243v1 Announce Type: new Abstract: Multi-Agent Path Finding (MAPF) for car-like robots, addressed by algorithms such as Conflict-Based Search with Continuous Time (CL-CBS), faces significant computational challenges due to expensive kinematic heuristic calculations. Traditional heuristic caching assumes that the heuristic function depends only on the state, which is incorrect in CBS where constraints from conflict resolution make the search space context-dependent. We propose \textbf{CAR-CHASE} (Car-Like Robot Conflict-Aware Heuristic Adaptive Search Enhancement), a novel approach that combines \textbf{conflict-aware heuristic caching} -- which caches heuristic values based on both state and relevant constraint context -- with an \textbf{adaptive hybrid heuristic} that intelligently switches between fast approximate and exact computations. Our key innovations are (1) a compact \emph{conflict fingerprint} that efficiently encodes which constraints affect a state's heuristic, (2) a relevance filter using spatial, temporal, and geometric criteria, and (3) an adaptive switching strategy with theoretical quality bounds. Experimental evaluation on 480 benchmark instances with varying agent counts (10 to 30) and obstacle densities (0\% and 50\%) demonstrates a geometric mean speedup of 2.46$\times$ over the baseline CL-CBS implementation while maintaining solution optimality. The optimizations improve success rate from 77.9\% to 84.8\% (+6.9 percentage points), reduce total runtime by 70.1\%, and enable solving 33 additional instances that previously timed out. Performance gains scale with problem complexity, reaching up to 4.06$\times$ speedup for challenging 30-agent obstacle scenarios. Our techniques are general and applicable to other CBS variants.

    https://arxiv.org/abs/2512.12243


    Adversarially Probing Cross-Family Sound Symbolism in 27 Languages

    oai:arXiv.org:2512.12245v1

    arXiv:2512.12245v1 Announce Type: new Abstract: The phenomenon of sound symbolism, the non-arbitrary mapping between word sounds and meanings, has long been demonstrated through anecdotal experiments like Bouba Kiki, but rarely tested at scale. We present the first computational cross-linguistic analysis of sound symbolism in the semantic domain of size. We compile a typologically broad dataset of 810 adjectives (27 languages, 30 words each), each phonemically transcribed and validated with native-speaker audio. Using interpretable classifiers over bag-of-segment features, we find that phonological form predicts size semantics above chance even across unrelated languages, with both vowels and consonants contributing. To probe universality beyond genealogy, we train an adversarial scrubber that suppresses language identity while preserving size signal (also at family granularity). Language prediction averaged across languages and settings falls below chance while size prediction remains significantly above chance, indicating cross-family sound-symbolic bias. We release data, code, and diagnostic tools for future large-scale studies of iconicity.

    https://arxiv.org/abs/2512.12245


    Moment and Highlight Detection via MLLM Frame Segmentation

    oai:arXiv.org:2512.12246v1

    arXiv:2512.12246v1 Announce Type: new Abstract: Detecting video moments and highlights from natural-language queries have been unified by transformer-based methods. Other works use generative Multimodal LLM (MLLM) to predict moments and/or highlights as text timestamps, utilizing its reasoning capability. While effective, text-based generation cannot provide direct gradients for frame-level predictions because the model only emits language tokens. Although recent Reinforcement Learning (RL) methods attempt to address the issue, we propose a novel approach by applying segmentation objectives directly on the LLM's output tokens. The LLM is fed with a fixed number of frames alongside a prompt that enforces it to output a sequence of continuous "0" and/or "1" characters, with one character per frame. The "0"/"1" characters benefit from the LLM's inherent language capability while also acting as background and foreground probabilities, respectively. Training employs segmentation losses on the probabilities alongside a normal causal LM loss. At inference, beam search generates sequence and logits, acting as moments and saliency scores, respectively. Despite sampling only 25 frames -- less than half of comparable methods -- our method achieved strong highlight detection (56.74 HIT@1) on QVHighlights. Additionally, our efficient method scores above the baseline (35.28 MAP) for moment retrieval. Empirically, segmentation losses provide a stable complementary learning signal even when the causal LM loss plateaus.

    https://arxiv.org/abs/2512.12246


    Optimized Learned Count-Min Sketch

    oai:arXiv.org:2512.12252v1

    arXiv:2512.12252v1 Announce Type: new Abstract: Count-Min Sketch (CMS) is a memory-efficient data structure for estimating the frequency of elements in a multiset. Learned Count-Min Sketch (LCMS) enhances CMS with a machine learning model to reduce estimation error under the same memory usage, but suffers from slow construction due to empirical parameter tuning and lacks theoretical guarantees on intolerable error probability. We propose Optimized Learned Count-Min Sketch (OptLCMS), which partitions the input domain and assigns each partition to its own CMS instance, with CMS parameters analytically derived for fixed thresholds, and thresholds optimized via dynamic programming with approximate feasibility checks. This reduces the need for empirical validation, enabling faster construction while providing theoretical guarantees under these assumptions. OptLCMS also allows explicit control of the allowable error threshold, improving flexibility in practice. Experiments show that OptLCMS builds faster, achieves lower intolerable error probability, and matches the estimation accuracy of LCMS.

    https://arxiv.org/abs/2512.12252


    A Multi-Axial Mindset for Ontology Design Lessons from Wikidata's Polyhierarchical Structure

    oai:arXiv.org:2512.12260v1

    arXiv:2512.12260v1 Announce Type: new Abstract: Traditional ontology design emphasizes disjoint and exhaustive top-level distinctions such as continuant vs. occurrent, abstract vs. concrete, or type vs. instance. These distinctions are used to structure unified hierarchies where every entity is classified under a single upper-level category. Wikidata, by contrast, does not enforce a singular foundational taxonomy. Instead, it accommodates multiple classification axes simultaneously under the shared root class entity. This paper analyzes the structural implications of Wikidata's polyhierarchical and multi-axial design. The Wikidata architecture enables a scalable and modular approach to ontology construction, especially suited to collaborative and evolving knowledge graphs.

    https://arxiv.org/abs/2512.12260


    Market-Bench: Evaluating Large Language Models on Introductory Quantitative Trading and Market Dynamics

    oai:arXiv.org:2512.12264v1

    arXiv:2512.12264v1 Announce Type: new Abstract: We introduce MARKET-BENCH, a benchmark that evaluates large language models (LLMs) on introductory quantitative trading tasks by asking them to construct executable backtesters from natural-language strategy descriptions and market assumptions. Each instance specifies one of three canonical strategies -- scheduled trading on Microsoft (NASDAQ: MSFT), pairs trading on Coca-Cola (NASDAQ: KO) and Pepsi (NASDAQ: PEP), or delta hedging on MSFT -- and models must produce code whose P\&L, drawdown, and position paths match a verifiable reference implementation. We assess twelve state-of-the-art models using a multi-round pass@k metric that separates structural reliability (whether the backtest runs) from numerical accuracy (mean absolute error of the backtest metrics). While most models reliably execute the simplest strategy (average pass@3 of 0.80), errors vary by orders of magnitude across models and tasks: Gemini 3 Pro and Claude 4.5 Sonnet combine strong reliability with low error on simpler strategies, GPT-5.1 Codex-Max achieves perfect pass@1 on the first two strategies and the lowest best-run error on the easiest task, and Qwen3 Max attains perfect pass@3 yet sometimes produces inaccurate P\&L paths. These results show that current LLMs can scaffold basic trading infrastructure but still struggle to reason robustly about prices, inventory, and risk; we release MARKET-BENCH and a public leaderboard at https://marketbench.ai.

    https://arxiv.org/abs/2512.12264


    MetaTPT: Meta Test-time Prompt Tuning for Vision-Language Models

    oai:arXiv.org:2512.12268v1

    arXiv:2512.12268v1 Announce Type: new Abstract: Vision-language models (VLMs) such as CLIP exhibit strong zero-shot generalization but remain sensitive to domain shifts at test time. Test-time prompt tuning (TPT) mitigates this issue by adapting prompts with fixed augmentations, which may falter in more challenging settings. In this work, we propose Meta Test-Time Prompt Tuning (MetaTPT), a meta-learning framework that learns a self-supervised auxiliary task to guide test-time prompt tuning. The auxiliary task dynamically learns parameterized augmentations for each sample, enabling more expressive transformations that capture essential features in target domains. MetaTPT adopts a dual-loop optimization paradigm: an inner loop learns a self-supervised task that generates informative views, while the outer loop performs prompt tuning by enforcing consistency across these views. By coupling augmentation learning with prompt tuning, MetaTPT improves test-time adaptation under domain shifts. Extensive experiments demonstrate that MetaTPT achieves state-of-the-art performance on domain generalization and cross-dataset benchmarks.

    https://arxiv.org/abs/2512.12268


    GRC-Net: Gram Residual Co-attention Net for epilepsy prediction

    oai:arXiv.org:2512.12273v1

    arXiv:2512.12273v1 Announce Type: new Abstract: Prediction of epilepsy based on electroencephalogram (EEG) signals is a rapidly evolving field. Previous studies have traditionally applied 1D processing to the entire EEG signal. However, we have adopted the Gram Matrix method to transform the signals into a 3D representation, enabling modeling of signal relationships across dimensions while preserving the temporal dependencies of the one-dimensional signals. Additionally, we observed an imbalance between local and global signals within the EEG data. Therefore, we introduced multi-level feature extraction, utilizing coattention for capturing global signal characteristics and an inception structure for processing local signals, achieving multi-granular feature extraction. Our experiments on the BONN dataset demonstrate that for the most challenging five-class classification task, GRC-Net achieved an accuracy of 93.66%, outperforming existing methods.

    https://arxiv.org/abs/2512.12273


    Balancing Accuracy and Speed: A Multi-Fidelity Ensemble Kalman Filter with a Machine Learning Surrogate Model

    oai:arXiv.org:2512.12276v1

    arXiv:2512.12276v1 Announce Type: new Abstract: Currently, more and more machine learning (ML) surrogates are being developed for computationally expensive physical models. In this work we investigate the use of a Multi-Fidelity Ensemble Kalman Filter (MF-EnKF) in which the low-fidelity model is such a machine learning surrogate model, instead of a traditional low-resolution or reduced-order model. The idea behind this is to use an ensemble of a few expensive full model runs, together with an ensemble of many cheap but less accurate ML model runs. In this way we hope to reach increased accuracy within the same computational budget. We investigate the performance by testing the approach on two common test problems, namely the Lorenz-2005 model and the Quasi-Geostrophic model. By keeping the original physical model in place, we obtain a higher accuracy than when we completely replace it by the ML model. Furthermore, the MF-EnKF reaches improved accuracy within the same computational budget. The ML surrogate has similar or improved accuracy compared to the low-resolution one, but it can provide a larger speed-up. Our method contributes to increasing the effective ensemble size in the EnKF, which improves the estimation of the initial condition and hence accuracy of the predictions in fields such as meteorology and oceanography.

    https://arxiv.org/abs/2512.12276


    Feature Aggregation for Efficient Continual Learning of Complex Facial Expressions

    oai:arXiv.org:2512.12277v1

    arXiv:2512.12277v1 Announce Type: new Abstract: As artificial intelligence (AI) systems become increasingly embedded in our daily life, the ability to recognize and adapt to human emotions is essential for effective human-computer interaction. Facial expression recognition (FER) provides a primary channel for inferring affective states, but the dynamic and culturally nuanced nature of emotions requires models that can learn continuously without forgetting prior knowledge. In this work, we propose a hybrid framework for FER in a continual learning setting that mitigates catastrophic forgetting. Our approach integrates two complementary modalities: deep convolutional features and facial Action Units (AUs) derived from the Facial Action Coding System (FACS). The combined representation is modelled through Bayesian Gaussian Mixture Models (BGMMs), which provide a lightweight, probabilistic solution that avoids retraining while offering strong discriminative power. Using the Compound Facial Expression of Emotion (CFEE) dataset, we show that our model can first learn basic expressions and then progressively recognize compound expressions. Experiments demonstrate improved accuracy, stronger knowledge retention, and reduced forgetting. This framework contributes to the development of emotionally intelligent AI systems with applications in education, healthcare, and adaptive user interfaces.

    https://arxiv.org/abs/2512.12277


    Cognitive-YOLO: LLM-Driven Architecture Synthesis from First Principles of Data for Object Detection

    oai:arXiv.org:2512.12281v1

    arXiv:2512.12281v1 Announce Type: new Abstract: Designing high-performance object detection architectures is a complex task, where traditional manual design is time-consuming and labor-intensive, and Neural Architecture Search (NAS) is computationally prohibitive. While recent approaches using Large Language Models (LLMs) show promise, they often function as iterative optimizers within a search loop, rather than generating architectures directly from a holistic understanding of the data. To address this gap, we propose Cognitive-YOLO, a novel framework for LLM-driven architecture synthesis that generates network configurations directly from the intrinsic characteristics of the dataset. Our method consists of three stages: first, an analysis module extracts key meta-features (e.g., object scale distribution and scene density) from the target dataset; second, the LLM reasons upon these features, augmented with state-of-the-art components retrieved via Retrieval-Augmented Generation (RAG), to synthesize the architecture into a structured Neural Architecture Description Language (NADL); finally, a compiler instantiates this description into a deployable model. Extensive experiments on five diverse object detection datasets demonstrate that our proposed Cognitive-YOLO consistently generates superior architectures, achieving highly competitive performance and demonstrating a superior performance-per-parameter trade-off compared to strong baseline models across multiple benchmarks. Crucially, our ablation studies prove that the LLM's data-driven reasoning is the primary driver of performance, demonstrating that a deep understanding of data "first principles" is more critical for achieving a superior architecture than simply retrieving SOTA components.

    https://arxiv.org/abs/2512.12281


    Large Language Models have Chain-of-Affective

    oai:arXiv.org:2512.12283v1

    arXiv:2512.12283v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed as collaborative agents in emotionally charged settings, yet most evaluations treat them as purely cognitive systems and largely ignore their affective behaviour. Here we take a functional perspective and ask whether contemporary LLMs implement a structured chain-of-affective: organised affective dynamics that are family-specific, temporally coherent and behaviourally consequential. Across eight major LLM families (GPT, Gemini, Claude, Grok, Qwen, DeepSeek, GLM, Kimi), we combine two experimental modules. The first characterises inner chains-of-affective via baseline ''affective fingerprints'', 15-round sad-news exposure, and a 10-round news self-selection paradigm. We find stable, family-specific affective profiles, a reproducible three-phase trajectory under sustained negative input (accumulation, overload, defensive numbing), distinct defence styles, and human-like negativity biases that induce self-reinforcing affect-choice feedback loops. The second module probes outer consequences using a composite performance benchmark, human-AI dialogues on contentious topics, and multi-agent LLM interactions. We demonstrate that induced affect preserves core reasoning while reshaping high-freedom generation. Sentiment metrics predict user comfort and empathy but reveal trade-offs in resisting problematic views. In multi-agent settings, group structure drives affective contagion, role specialization (initiators, absorbers, firewalls), and bias. We characterize affect as an emergent control layer, advocating for 'chains-of-affect' as a primary target for evaluation and alignment.

    https://arxiv.org/abs/2512.12283


    Fractional Differential Equation Physics-Informed Neural Network and Its Application in Battery State Estimation

    oai:arXiv.org:2512.12285v1

    arXiv:2512.12285v1 Announce Type: new Abstract: Accurate estimation of the State of Charge (SOC) is critical for ensuring the safety, reliability, and performance optimization of lithium-ion battery systems. Conventional data-driven neural network models often struggle to fully characterize the inherent complex nonlinearities and memory-dependent dynamics of electrochemical processes, significantly limiting their predictive accuracy and physical interpretability under dynamic operating conditions. To address this challenge, this study proposes a novel neural architecture termed the Fractional Differential Equation Physics-Informed Neural Network (FDIFF-PINN), which integrates fractional calculus with deep learning. The main contributions of this paper include: (1) Based on a fractional-order equivalent circuit model, a discretized fractional-order partial differential equation is constructed. (2) Comparative experiments were conducted using a dynamic charge/discharge dataset of Panasonic 18650PF batteries under multi-temperature conditions (from -10$^{\circ}$C to 20$^{\circ}$C).

    https://arxiv.org/abs/2512.12285


    RealDrag: The First Dragging Benchmark with Real Target Image

    oai:arXiv.org:2512.12287v1

    arXiv:2512.12287v1 Announce Type: new Abstract: The evaluation of drag based image editing models is unreliable due to a lack of standardized benchmarks and metrics. This ambiguity stems from inconsistent evaluation protocols and, critically, the absence of datasets containing ground truth target images, making objective comparisons between competing methods difficult. To address this, we introduce \textbf{RealDrag}, the first comprehensive benchmark for point based image editing that includes paired ground truth target images. Our dataset contains over 400 human annotated samples from diverse video sources, providing source/target images, handle/target points, editable region masks, and descriptive captions for both the image and the editing action. We also propose four novel, task specific metrics: Semantical Distance (SeD), Outer Mask Preserving Score (OMPS), Inner Patch Preserving Score (IPPS), and Directional Similarity (DiS). These metrics are designed to quantify pixel level matching fidelity, check preservation of non edited (out of mask) regions, and measure semantic alignment with the desired task. Using this benchmark, we conduct the first large scale systematic analysis of the field, evaluating 17 SOTA models. Our results reveal clear trade offs among current approaches and establish a robust, reproducible baseline to guide future research. Our dataset and evaluation toolkit will be made publicly available.

    https://arxiv.org/abs/2512.12287


    Quantum-Aware Generative AI for Materials Discovery: A Framework for Robust Exploration Beyond DFT Biases

    oai:arXiv.org:2512.12288v1

    arXiv:2512.12288v1 Announce Type: new Abstract: Conventional generative models for materials discovery are predominantly trained and validated using data from Density Functional Theory (DFT) with approximate exchange-correlation functionals. This creates a fundamental bottleneck: these models inherit DFT's systematic failures for strongly correlated systems, leading to exploration biases and an inability to discover materials where DFT predictions are qualitatively incorrect. We introduce a quantum-aware generative AI framework that systematically addresses this limitation through tight integration of multi-fidelity learning and active validation. Our approach employs a diffusion-based generator conditioned on quantum-mechanical descriptors and a validator using an equivariant neural network potential trained on a hierarchical dataset spanning multiple levels of theory (PBE, SCAN, HSE06, CCSD(T)). Crucially, we implement a robust active learning loop that quantifies and targets the divergence between low- and high-fidelity predictions. We conduct comprehensive ablation studies to deconstruct the contribution of each component, perform detailed failure mode analysis, and benchmark our framework against state-of-the-art generative models (CDVAE, GNoME, DiffCSP) across several challenging material classes. Our results demonstrate significant practical gains: a 3-5x improvement in successfully identifying potentially stable candidates in high-divergence regions (e.g., correlated oxides) compared to DFT-only baselines, while maintaining computational feasibility. This work provides a rigorous, transparent framework for extending the effective search space of computational materials discovery beyond the limitations of single-fidelity models.

    https://arxiv.org/abs/2512.12288


    Near-Zero-Overhead Freshness for Recommendation Systems via Inference-Side Model Updates

    oai:arXiv.org:2512.12295v1

    arXiv:2512.12295v1 Announce Type: new Abstract: Deep Learning Recommendation Models (DLRMs) underpin personalized services but face a critical freshness-accuracy tradeoff due to massive parameter synchronization overheads. Production DLRMs deploy decoupled training/inference clusters, where synchronizing petabyte-scale embedding tables (EMTs) causes multi-minute staleness, degrading recommendation quality and revenue. We observe that (1) inference nodes exhibit sustained CPU underutilization (peak <= 20%), and (2) EMT gradients possess intrinsic low-rank structure, enabling compact update representation. We present LiveUpdate, a system that eliminates inter-cluster synchronization by colocating Low-Rank Adaptation (LoRA) trainers within inference nodes. LiveUpdate addresses two core challenges: (1) dynamic rank adaptation via singular value monitoring to constrain memory overhead (<2% of EMTs), and (2) NUMA-aware resource scheduling with hardware-enforced QoS to eliminate update inference contention (P99 latency impact <20ms). Evaluations show LiveUpdate reduces update costs by 2x versus delta-update baselines while achieving higher accuracy within 1-hour windows. By transforming idle inference resources into freshness engines, LiveUpdate delivers online model updates while outperforming state-of-the-art delta-update methods by 0.04% to 0.24% in accuracy.

    https://arxiv.org/abs/2512.12295


    GrowTAS: Progressive Expansion from Small to Large Subnets for Efficient ViT Architecture Search

    oai:arXiv.org:2512.12296v1

    arXiv:2512.12296v1 Announce Type: new Abstract: Transformer architecture search (TAS) aims to automatically discover efficient vision transformers (ViTs), reducing the need for manual design. Existing TAS methods typically train an over-parameterized network (i.e., a supernet) that encompasses all candidate architectures (i.e., subnets). However, all subnets share the same set of weights, which leads to interference that degrades the smaller subnets severely. We have found that well-trained small subnets can serve as a good foundation for training larger ones. Motivated by this, we propose a progressive training framework, dubbed GrowTAS, that begins with training small subnets and incorporate larger ones gradually. This enables reducing the interference and stabilizing a training process. We also introduce GrowTAS+ that fine-tunes a subset of weights only to further enhance the performance of large subnets. Extensive experiments on ImageNet and several transfer learning benchmarks, including CIFAR-10/100, Flowers, CARS, and INAT-19, demonstrate the effectiveness of our approach over current TAS methods

    https://arxiv.org/abs/2512.12296


    F5-TTS-RO: Extending F5-TTS to Romanian TTS via Lightweight Input Adaptation

    oai:arXiv.org:2512.12297v1

    arXiv:2512.12297v1 Announce Type: new Abstract: This work introduces a lightweight input-level adapter for the F5-TTS model that enables Romanian Language support. To preserve the existing capabilities of the model (voice cloning, English and Chinese support), we keep the original weights frozen, append a sub-network to the model and train it as an extension for the textual embedding matrix of the text encoder. For simplicity, we rely on ConvNeXt module implemented in F5-TTS to also model the co-dependencies between the new character-level embeddings. The module serves as a ``soft`` letter-to-sound layer, converting Romanian text into a continuous representation that the F5-TTS model uses to produce naturally sounding Romanian utterances. We evaluate the model with a pool of 20 human listeners across three tasks: (a) audio similarity between reference and generated speech, (b) pronunciation and naturalness and (c) Romanian-English code-switching. The results indicate that our approach maintains voice cloning capabilities and enables, to a certain extent, code-switching within the same utterance; however, residual English accent characteristics remain. We open-source our code and provide example audio samples at https://github.com/racai-ro/Ro-F5TTS.

    https://arxiv.org/abs/2512.12297


    A Conflict-Aware Resource Management Framework for the Computing Continuum

    oai:arXiv.org:2512.12299v1

    arXiv:2512.12299v1 Announce Type: new Abstract: The increasing device heterogeneity and decentralization requirements in the computing continuum (i.e., spanning edge, fog, and cloud) introduce new challenges in resource orchestration. In such environments, agents are often responsible for optimizing resource usage across deployed services. However, agent decisions can lead to persistent conflict loops, inefficient resource utilization, and degraded service performance. To overcome such challenges, we propose a novel framework for adaptive conflict resolution in resource-oriented orchestration using a Deep Reinforcement Learning (DRL) approach. The framework enables handling resource conflicts across deployments and integrates a DRL model trained to mediate such conflicts based on real-time performance feedback and historical state information. The framework has been prototyped and validated on a Kubernetes-based testbed, illustrating its methodological feasibility and architectural resilience. Preliminary results show that the framework achieves efficient resource reallocation and adaptive learning in dynamic scenarios, thus providing a scalable and resilient solution for conflict-aware orchestration in the computing continuum.

    https://arxiv.org/abs/2512.12299


    TwinFormer: A Dual-Level Transformer for Long-Sequence Time-Series Forecasting

    oai:arXiv.org:2512.12301v1

    arXiv:2512.12301v1 Announce Type: new Abstract: TwinFormer is a hierarchical Transformer for long-sequence time-series forecasting. It divides the input into non-overlapping temporal patches and processes them in two stages: (1) a Local Informer with top-$k$ Sparse Attention models intra-patch dynamics, followed by mean pooling; (2) a Global Informer captures long-range inter-patch dependencies using the same top-$k$ attention. A lightweight GRU aggregates the globally contextualized patch tokens for direct multi-horizon prediction. The resulting architecture achieves linear $O(kLd)$ time and memory complexity. On eight real-world benchmarking datasets from six different domains, including weather, stock price, temperature, power consumption, electricity, and disease, and forecasting horizons $96-720$, TwinFormer secures $27$ positions in the top two out of $34$. Out of the $27$, it achieves the best performance on MAE and RMSE at $17$ places and $10$ at the second-best place on MAE and RMSE. This consistently outperforms PatchTST, iTransformer, FEDformer, Informer, and vanilla Transformers. Ablations confirm the superiority of top-$k$ Sparse Attention over ProbSparse and the effectiveness of GRU-based aggregation. Code is available at this repository: https://github.com/Mahimakumavat1205/TwinFormer.

    https://arxiv.org/abs/2512.12301


    From Human Intention to Action Prediction: A Comprehensive Benchmark for Intention-driven End-to-End Autonomous Driving

    oai:arXiv.org:2512.12302v1

    arXiv:2512.12302v1 Announce Type: new Abstract: Current end-to-end autonomous driving systems operate at a level of intelligence akin to following simple steering commands. However, achieving genuinely intelligent autonomy requires a paradigm shift: moving from merely executing low-level instructions to understanding and fulfilling high-level, abstract human intentions. This leap from a command-follower to an intention-fulfiller, as illustrated in our conceptual framework, is hindered by a fundamental challenge: the absence of a standardized benchmark to measure and drive progress on this complex task. To address this critical gap, we introduce Intention-Drive, the first comprehensive benchmark designed to evaluate the ability to translate high-level human intent into safe and precise driving actions. Intention-Drive features two core contributions: (1) a new dataset of complex scenarios paired with corresponding natural language intentions, and (2) a novel evaluation protocol centered on the Intent Success Rate (ISR), which assesses the semantic fulfillment of the human's goal beyond simple geometric accuracy. Through an extensive evaluation of a spectrum of baseline models on Intention-Drive, we reveal a significant performance deficit, showing that the baseline model struggle to achieve the comprehensive scene and intention understanding required for this advanced task.

    https://arxiv.org/abs/2512.12302


    OMUDA: Omni-level Masking for Unsupervised Domain Adaptation in Semantic Segmentation

    oai:arXiv.org:2512.12303v1

    arXiv:2512.12303v1 Announce Type: new Abstract: Unsupervised domain adaptation (UDA) enables semantic segmentation models to generalize from a labeled source domain to an unlabeled target domain. However, existing UDA methods still struggle to bridge the domain gap due to cross-domain contextual ambiguity, inconsistent feature representations, and class-wise pseudo-label noise. To address these challenges, we propose Omni-level Masking for Unsupervised Domain Adaptation (OMUDA), a unified framework that introduces hierarchical masking strategies across distinct representation levels. Specifically, OMUDA comprises: 1) a Context-Aware Masking (CAM) strategy that adaptively distinguishes foreground from background to balance global context and local details; 2) a Feature Distillation Masking (FDM) strategy that enhances robust and consistent feature learning through knowledge transfer from pre-trained models; and 3) a Class Decoupling Masking (CDM) strategy that mitigates the impact of noisy pseudo-labels by explicitly modeling class-wise uncertainty. This hierarchical masking paradigm effectively reduces the domain shift at the contextual, representational, and categorical levels, providing a unified solution beyond existing approaches. Extensive experiments on multiple challenging cross-domain semantic segmentation benchmarks validate the effectiveness of OMUDA. Notably, on the SYNTHIA->Cityscapes and GTA5->Cityscapes tasks, OMUDA can be seamlessly integrated into existing UDA methods and consistently achieving state-of-the-art results with an average improvement of 7%.

    https://arxiv.org/abs/2512.12303


    From Co-Design to Metacognitive Laziness: Evaluating Generative AI in Vocational Education

    oai:arXiv.org:2512.12306v1

    arXiv:2512.12306v1 Announce Type: new Abstract: This study examines the development and deployment of a Generative AI proof-of-concept (POC) designed to support lecturers in a vocational education setting in Singapore. Employing a user-centred, mixed-methods design process, we co-developed an AI chatbot with lecturers to address recurring instructional challenges during exam preparation, specifically managing repetitive questions and scaling feedback delivery. The POC achieved its primary operational goals: lecturers reported streamlined workflows, reduced cognitive load, and observed improved student confidence in navigating course content. However, the deployment yielded unexpected insights into student learning behaviours. Despite enhanced teaching processes, performance data revealed no significant improvement in overall student assessment outcomes. Deep analysis of interaction logs identified concerning patterns, including self-efficacy-driven dependency, "metacognitive laziness" (cognitive offloading), and divergent usage strategies. While high-ability students leveraged the tool for strategic verification, low-ability students frequently used it to bypass cognitive effort, potentially exacerbating performance gaps. These findings suggest that Generative AI's educational influence extends beyond instructional efficiency to shape cognitive engagement, self-regulation, and learner equity. The study raises consequential design questions regarding how AI tools can be engineered to minimise dependency, scaffold metacognitive development, and calibrate support across varying ability levels. We conclude that while Generative AI can substantially enhance the teaching experience, achieving meaningful learning gains requires rigorous attention to learner behaviour and the equitable design of AI-supported environments.

    https://arxiv.org/abs/2512.12306


    MRD: Using Physically Based Differentiable Rendering to Probe Vision Models for 3D Scene Understanding

    oai:arXiv.org:2512.12307v1

    arXiv:2512.12307v1 Announce Type: new Abstract: While deep learning methods have achieved impressive success in many vision benchmarks, it remains difficult to understand and explain the representations and decisions of these models. Though vision models are typically trained on 2D inputs, they are often assumed to develop an implicit representation of the underlying 3D scene (for example, showing tolerance to partial occlusion, or the ability to reason about relative depth). Here, we introduce MRD (metamers rendered differentiably), an approach that uses physically based differentiable rendering to probe vision models' implicit understanding of generative 3D scene properties, by finding 3D scene parameters that are physically different but produce the same model activation (i.e. are model metamers). Unlike previous pixel-based methods for evaluating model representations, these reconstruction results are always grounded in physical scene descriptions. This means we can, for example, probe a model's sensitivity to object shape while holding material and lighting constant. As a proof-of-principle, we assess multiple models in their ability to recover scene parameters of geometry (shape) and bidirectional reflectance distribution function (material). The results show high similarity in model activation between target and optimized scenes, with varying visual results. Qualitatively, these reconstructions help investigate the physical scene attributes to which models are sensitive or invariant. MRD holds promise for advancing our understanding of both computer and human vision by enabling analysis of how physical scene parameters drive changes in model responses.

    https://arxiv.org/abs/2512.12307


    WeDetect: Fast Open-Vocabulary Object Detection as Retrieval

    oai:arXiv.org:2512.12309v1

    arXiv:2512.12309v1 Announce Type: new Abstract: Open-vocabulary object detection aims to detect arbitrary classes via text prompts. Methods without cross-modal fusion layers (non-fusion) offer faster inference by treating recognition as a retrieval problem, \ie, matching regions to text queries in a shared embedding space. In this work, we fully explore this retrieval philosophy and demonstrate its unique advantages in efficiency and versatility through a model family named WeDetect: (1) State-of-the-art performance. WeDetect is a real-time detector with a dual-tower architecture. We show that, with well-curated data and full training, the non-fusion WeDetect surpasses other fusion models and establishes a strong open-vocabulary foundation. (2) Fast backtrack of historical data. WeDetect-Uni is a universal proposal generator based on WeDetect. We freeze the entire detector and only finetune an objectness prompt to retrieve generic object proposals across categories. Importantly, the proposal embeddings are class-specific and enable a new application, object retrieval, supporting retrieval objects in historical data. (3) Integration with LMMs for referring expression comprehension (REC). We further propose WeDetect-Ref, an LMM-based object classifier to handle complex referring expressions, which retrieves target objects from the proposal list extracted by WeDetect-Uni. It discards next-token prediction and classifies objects in a single forward pass. Together, the WeDetect family unifies detection, proposal generation, object retrieval, and REC under a coherent retrieval framework, achieving state-of-the-art performance across 15 benchmarks with high inference efficiency.

    https://arxiv.org/abs/2512.12309


    Taint-Based Code Slicing for LLMs-based Malicious NPM Package Detection

    oai:arXiv.org:2512.12313v1

    arXiv:2512.12313v1 Announce Type: new Abstract: The increasing sophistication of malware attacks in the npm ecosystem, characterized by obfuscation and complex logic, necessitates advanced detection methods. Recently, researchers have turned their attention from traditional detection approaches to Large Language Models (LLMs) due to their strong capabilities in semantic code understanding. However, while LLMs offer superior semantic reasoning for code analysis, their practical application is constrained by limited context windows and high computational cost. This paper addresses this challenge by introducing a novel framework that leverages code slicing techniques for an LLM-based malicious package detection task. We propose a specialized taintbased slicing technique for npm packages, augmented by a heuristic backtracking mechanism to accurately capture malicious data flows across asynchronous, event-driven patterns (e.g., callbacks and Promises) that elude traditional analysis. An evaluation on a dataset of more than 5000 malicious and benign npm packages demonstrates that our approach isolates security-relevant code, reducing input volume by over 99% while preserving critical behavioral semantics. Using the DeepSeek-Coder-6.7B model as the classification engine, our approach achieves a detection accuracy of 87.04%, substantially outperforming a naive token-splitting baseline (75.41%) and a traditional static-analysis-based approach. These results indicate that semantically optimized input representation via code slicing not only mitigates the LLM context-window bottleneck but also significantly enhances reasoning precision for security tasks, providing an efficient and effective defense against evolving malicious open-source packages.

    https://arxiv.org/abs/2512.12313


    Evaluating Asynchronous Semantics in Trace-Discovered Resilience Models: A Case Study on the OpenTelemetry Demo

    oai:arXiv.org:2512.12314v1

    arXiv:2512.12314v1 Announce Type: new Abstract: While distributed tracing and chaos engineering are becoming standard for microservices, resilience models remain largely manual and bespoke. We revisit a trace-discovered connectivity model that derives a service dependency graph from traces and uses Monte Carlo simulation to estimate endpoint availability under fail-stop service failures. Compared to earlier work, we (i) derive the graph directly from raw OpenTelemetry traces, (ii) attach endpoint-specific success predicates, and (iii) add a simple asynchronous semantics that treats Kafka edges as non-blocking for immediate HTTP success. We apply this model to the OpenTelemetry Demo ("Astronomy Shop") using a GitHub Actions workflow that discovers the graph, runs simulations, and executes chaos experiments that randomly kill microservices in a Docker Compose deployment. Across the studied failure fractions, the model reproduces the overall availability degradation curve, while asynchronous semantics for Kafka edges change predicted availabilities by at most about 10^(-5) (0.001 percentage points). This null result suggests that for immediate HTTP availability in this case study, explicitly modeling asynchronous dependencies is not warranted, and a simpler connectivity-only model is sufficient.

    https://arxiv.org/abs/2512.12314


    Programmable Deformation Design of Porous Soft Actuator through Volumetric-Pattern-Induced Anisotropy

    oai:arXiv.org:2512.12320v1

    arXiv:2512.12320v1 Announce Type: new Abstract: Conventional soft pneumatic actuators, typically based on hollow elastomeric chambers, often suffer from small structural support and require costly geometry-specific redesigns for multimodal functionality. Porous materials such as foam, filled into chambers, can provide structural stability for the actuators. However, methods to achieve programmable deformation by tailoring the porous body itself remain underexplored. In this paper, a novel design method is presented to realize soft porous actuators with programmable deformation by incising specific patterns into the porous foam body. This approach introduces localized structural anisotropy of the foam guiding the material's deformation under a global vacuum input. Furthermore, three fundamental patterns on a cylindrical foam substrate are discussed: transverse for bending, longitudinal for tilting, and diagonal for twisting. A computational model is built with Finite Element Analysis (FEA), to investigate the mechanism of the incision-patterning method. Experiments demonstrate that with a potential optimal design of the pattern array number N, actuators can achieve bending up to $80^{\circ}$ (N=2), tilting of $18^{\circ}$ (N=1), and twisting of $115^{\circ}$ (N=8). The versatility of our approach is demonstrated via pattern transferability, scalability, and mold-less rapid prototyping of complex designs. As a comprehensive application, we translate the human hand crease map into a functional incision pattern, creating a bio-inspired soft robot hand capable of human-like adaptive grasping. Our work provides a new, efficient, and scalable paradigm for the design of multi-functional soft porous robots.

    https://arxiv.org/abs/2512.12320


    UniMark: Artificial Intelligence Generated Content Identification Toolkit

    oai:arXiv.org:2512.12324v1

    arXiv:2512.12324v1 Announce Type: new Abstract: The rapid proliferation of Artificial Intelligence Generated Content has precipitated a crisis of trust and urgent regulatory demands. However, existing identification tools suffer from fragmentation and a lack of support for visible compliance marking. To address these gaps, we introduce the \textbf{UniMark}, an open-source, unified framework for multimodal content governance. Our system features a modular unified engine that abstracts complexities across text, image, audio, and video modalities. Crucially, we propose a novel dual-operation strategy, natively supporting both \emph{Hidden Watermarking} for copyright protection and \emph{Visible Marking} for regulatory compliance. Furthermore, we establish a standardized evaluation framework with three specialized benchmarks (Image/Video/Audio-Bench) to ensure rigorous performance assessment. This toolkit bridges the gap between advanced algorithms and engineering implementation, fostering a more transparent and secure digital ecosystem.

    https://arxiv.org/abs/2512.12324


    Eventually LIL Regret: Almost Sure $\ln\ln T$ Regret for a sub-Gaussian Mixture on Unbounded Data

    oai:arXiv.org:2512.12325v1

    arXiv:2512.12325v1 Announce Type: new Abstract: We prove that a classic sub-Gaussian mixture proposed by Robbins in a stochastic setting actually satisfies a path-wise (deterministic) regret bound. For every path in a natural ``Ville event'' $E_\alpha$, this regret till time $T$ is bounded by $\ln^2(1/\alpha)/V_T + \ln (1/\alpha) + \ln \ln V_T$ up to universal constants, where $V_T$ is a nonnegative, nondecreasing, cumulative variance process. (The bound reduces to $\ln(1/\alpha) + \ln \ln V_T$ if $V_T \geq \ln(1/\alpha)$.) If the data were stochastic, then one can show that $E_\alpha$ has probability at least $1-\alpha$ under a wide class of distributions (eg: sub-Gaussian, symmetric, variance-bounded, etc.). In fact, we show that on the Ville event $E_0$ of probability one, the regret on every path in $E_0$ is eventually bounded by $\ln \ln V_T$ (up to constants). We explain how this work helps bridge the world of adversarial online learning (which usually deals with regret bounds for bounded data), with game-theoretic statistics (which can handle unbounded data, albeit using stochastic assumptions). In short, conditional regret bounds serve as a bridge between stochastic and adversarial betting.

    https://arxiv.org/abs/2512.12325


    The Role of AI in Modern Penetration Testing

    oai:arXiv.org:2512.12326v1

    arXiv:2512.12326v1 Announce Type: new Abstract: Penetration testing is a cornerstone of cybersecurity, traditionally driven by manual, time-intensive processes. As systems grow in complexity, there is a pressing need for more scalable and efficient testing methodologies. This systematic literature review examines how Artificial Intelligence (AI) is reshaping penetration testing, analyzing 58 peer-reviewed studies from major academic databases. Our findings reveal that while AI-assisted pentesting is still in its early stages, notable progress is underway, particularly through Reinforcement Learning (RL), which was the focus of 77% of the reviewed works. Most research centers on the discovery and exploitation phases of pentesting, where AI shows the greatest promise in automating repetitive tasks, optimizing attack strategies, and improving vulnerability identification. Real-world applications remain limited but encouraging, including the European Space Agency's PenBox and various open-source tools. These demonstrate AI's potential to streamline attack path analysis, analyze complex network topology, and reduce manual workload. However, challenges persist: current models often lack flexibility and are underdeveloped for the reconnaissance and post-exploitation phases of pentesting. Applications involving Large Language Models (LLMs) remain relatively under-researched, pointing to a promising direction for future exploration. This paper offers a critical overview of AI's current and potential role in penetration testing, providing valuable insights for researchers, practitioners, and organizations aiming to enhance security assessments through advanced automation or looking for gaps in existing research.

    https://arxiv.org/abs/2512.12326


    Dynamic Homophily with Imperfect Recall: Modeling Resilience in Adversarial Networks

    oai:arXiv.org:2512.12332v1

    arXiv:2512.12332v1 Announce Type: new Abstract: The purpose of this study is to investigate how homophily, memory constraints, and adversarial disruptions collectively shape the resilience and adaptability of complex networks. To achieve this, we develop a new framework that integrates explicit memory decay mechanisms into homophily-based models and systematically evaluate their performance across diverse graph structures and adversarial settings. Our methods involve extensive experimentation on synthetic datasets, where we vary decay functions, reconnection probabilities, and similarity measures, primarily comparing cosine similarity with traditional metrics such as Jaccard similarity and baseline edge weights. The results show that cosine similarity achieves up to a 30\% improvement in stability metrics in sparse, convex, and modular networks. Moreover, the refined value-of-recall metric demonstrates that strategic forgetting can bolster resilience by balancing network robustness and adaptability. The findings underscore the critical importance of aligning memory and similarity parameters with the structural and adversarial dynamics of the network. By quantifying the tangible benefits of incorporating memory constraints into homophily-based analyses, this study offers actionable insights for optimizing real-world applications, including social systems, collaborative platforms, and cybersecurity contexts.

    https://arxiv.org/abs/2512.12332


    Hulls of Free Linear Codes over a Non-Unital Ring

    oai:arXiv.org:2512.12335v1

    arXiv:2512.12335v1 Announce Type: new Abstract: This paper investigates the hull codes of free linear codes over a non-unital ring $ E= \langle \kappa,\tau \mid 2 \kappa =2 \tau=0,~ \kappa^2=\kappa,~ \tau^2=\tau,~ \kappa \tau=\kappa,~ \tau \kappa=\tau \rangle$. Initially, we examine the residue and torsion codes of various hulls of $E$-linear codes and obtain an explicit form of the generator matrix of the hull of a free $E$-linear code. Then, we propose four build-up construction methods to construct codes with a larger length and hull-rank from codes with a smaller length and hull-rank. Some illustrative examples are also given to support our build-up construction methods. Subsequently, we study the permutation equivalence of two free $E$-linear codes and discuss the hull-variation problem. As an application, we classify optimal free $E$-linear codes for lengths up to $8$.

    https://arxiv.org/abs/2512.12335


    SCIR: A Self-Correcting Iterative Refinement Framework for Enhanced Information Extraction Based on Schema

    oai:arXiv.org:2512.12337v1

    arXiv:2512.12337v1 Announce Type: new Abstract: Although Large language Model (LLM)-powered information extraction (IE) systems have shown impressive capabilities, current fine-tuning paradigms face two major limitations: high training costs and difficulties in aligning with LLM preferences. To address these issues, we propose a novel universal IE paradigm, the Self-Correcting Iterative Refinement (SCIR) framework, along with a Multi-task Bilingual (Chinese-English) Self-Correcting (MBSC) dataset containing over 100,000 entries. The SCIR framework achieves plug-and-play compatibility with existing LLMs and IE systems through its Dual-Path Self-Correcting module and feedback-driven optimization, thereby significantly reducing training costs. Concurrently, the MBSC dataset tackles the challenge of preference alignment by indirectly distilling GPT-4's capabilities into IE result detection models. Experimental results demonstrate that SCIR outperforms state-of-the-art IE methods across three key tasks: named entity recognition, relation extraction, and event extraction, achieving a 5.27 percent average improvement in span-based Micro-F1 while reducing training costs by 87 percent compared to baseline approaches. These advancements not only enhance the flexibility and accuracy of IE systems but also pave the way for lightweight and efficient IE paradigms.

    https://arxiv.org/abs/2512.12337


    Unified Control for Inference-Time Guidance of Denoising Diffusion Models

    oai:arXiv.org:2512.12339v1

    arXiv:2512.12339v1 Announce Type: new Abstract: Aligning diffusion model outputs with downstream objectives is essential for improving task-specific performance. Broadly, inference-time training-free approaches for aligning diffusion models can be categorized into two main strategies: sampling-based methods, which explore multiple candidate outputs and select those with higher reward signals, and gradient-guided methods, which use differentiable reward approximations to directly steer the generation process. In this work, we propose a universal algorithm, UniCoDe, which brings together the strengths of sampling and gradient-based guidance into a unified framework. UniCoDe integrates local gradient signals during sampling, thereby addressing the sampling inefficiency inherent in complex reward-based sampling approaches. By cohesively combining these two paradigms, UniCoDe enables more efficient sampling while offering better trade-offs between reward alignment and divergence from the diffusion unconditional prior. Empirical results demonstrate that UniCoDe remains competitive with state-of-the-art baselines across a range of tasks. The code is available at https://github.com/maurya-goyal10/UniCoDe

    https://arxiv.org/abs/2512.12339


    Uncertainty Quantification for Machine Learning: One Size Does Not Fit All

    oai:arXiv.org:2512.12341v1

    arXiv:2512.12341v1 Announce Type: new Abstract: Proper quantification of predictive uncertainty is essential for the use of machine learning in safety-critical applications. Various uncertainty measures have been proposed for this purpose, typically claiming superiority over other measures. In this paper, we argue that there is no single best measure. Instead, uncertainty quantification should be tailored to the specific application. To this end, we use a flexible family of uncertainty measures that distinguishes between total, aleatoric, and epistemic uncertainty of second-order distributions. These measures can be instantiated with specific loss functions, so-called proper scoring rules, to control their characteristics, and we show that different characteristics are useful for different tasks. In particular, we show that, for the task of selective prediction, the scoring rule should ideally match the task loss. On the other hand, for out-of-distribution detection, our results confirm that mutual information, a widely used measure of epistemic uncertainty, performs best. Furthermore, in an active learning setting, epistemic uncertainty based on zero-one loss is shown to consistently outperform other uncertainty measures.

    https://arxiv.org/abs/2512.12341


    Differentially Private Online Distributed Aggregative Games With Time-Varying and Non-Identical Communication and Feedback Delays

    oai:arXiv.org:2512.12344v1

    arXiv:2512.12344v1 Announce Type: new Abstract: This paper investigates online distributed aggregative games with time-varying cost functions, where agents are interconnected through an unbalanced communication graph. Due to the distributed and noncooperative nature of the game, some curious agents may wish to steal sensitive information from neighboring agents during parameter exchanges. Additionally, communication delays arising from network congestion, particularly in wireless settings, as well as feedback delays, can hinder the convergence of agents to a Nash equilibrium. Although a recent work addressed both communication and feedback delays in aggregative games, it is based on the unrealistic assumption that the delays are fixed over time and identical across agents. Hence, the case of time-varying and non-identical delays across agents has never been considered in aggregative games. In this work, we address the combined challenges of privacy leakage with time-varying and non-identical communication and feedback delays for the first time. We propose an online distributed dual averaging algorithm that simultaneously tackles these challenges while achieving a provably low regret bound. Our simulation result shows that the running average of each client's local action converges over time.

    https://arxiv.org/abs/2512.12344


    Understanding Trust Toward Human versus AI-generated Health Information through Behavioral and Physiological Sensing

    oai:arXiv.org:2512.12348v1

    arXiv:2512.12348v1 Announce Type: new Abstract: As AI-generated health information proliferates online and becomes increasingly indistinguishable from human-sourced information, it becomes critical to understand how people trust and label such content, especially when the information is inaccurate. We conducted two complementary studies: (1) a mixed-methods survey (N=142) employing a 2 (source: Human vs. LLM) $\times$ 2 (label: Human vs. AI) $\times$ 3 (type: General, Symptom, Treatment) design, and (2) a within-subjects lab study (N=40) incorporating eye-tracking and physiological sensing (ECG, EDA, skin temperature). Participants were presented with health information varying by source-label combinations and asked to rate their trust, while their gaze behavior and physiological signals were recorded. We found that LLM-generated information was trusted more than human-generated content, whereas information labeled as human was trusted more than that labeled as AI. Trust remained consistent across information types. Eye-tracking and physiological responses varied significantly by source and label. Machine learning models trained on these behavioral and physiological features predicted binary self-reported trust levels with 73% accuracy and information source with 65% accuracy. Our findings demonstrate that adding transparency labels to online health information modulates trust. Behavioral and physiological features show potential to verify trust perceptions and indicate if additional transparency is needed.

    https://arxiv.org/abs/2512.12348


    Tacit Understanding Game (TUG): Predicting Interpersonal Compatibility

    oai:arXiv.org:2512.12356v1

    arXiv:2512.12356v1 Announce Type: new Abstract: Research on relationship quality often relies on lengthy questionnaires or invasive textual corpora, limiting ecological validity and user privacy. We ask whether a sequence of single-word choices made in a playful setting can reveal personality and predict interpersonal compatibility. We introduce the Tacit Understanding Game (TUG), a two-player online word association game. We collect word choice traces, annotate a subset with psychological ground truth scales, and bootstrap a larger synthetic corpus via large language model simulation. TUG demonstrates that minimal, privacy preserving signals can support relationship matching, offering new design space for social platforms.

    https://arxiv.org/abs/2512.12356


    TCLeaf-Net: a transformer-convolution framework with global-local attention for robust in-field lesion-level plant leaf disease detection

    oai:arXiv.org:2512.12357v1

    arXiv:2512.12357v1 Announce Type: new Abstract: Timely and accurate detection of foliar diseases is vital for safeguarding crop growth and reducing yield losses. Yet, in real-field conditions, cluttered backgrounds, domain shifts, and limited lesion-level datasets hinder robust modeling. To address these challenges, we release Daylily-Leaf, a paired lesion-level dataset comprising 1,746 RGB images and 7,839 lesions captured under both ideal and in-field conditions, and propose TCLeaf-Net, a transformer-convolution hybrid detector optimized for real-field use. TCLeaf-Net is designed to tackle three major challenges. To mitigate interference from complex backgrounds, the transformer-convolution module (TCM) couples global context with locality-preserving convolution to suppress non-leaf regions. To reduce information loss during downsampling, the raw-scale feature recalling and sampling (RSFRS) block combines bilinear resampling and convolution to preserve fine spatial detail. To handle variations in lesion scale and feature shifts, the deformable alignment block with FPN (DFPN) employs offset-based alignment and multi-receptive-field perception to strengthen multi-scale fusion. Experimental results show that on the in-field split of the Daylily-Leaf dataset, TCLeaf-Net improves mAP@50 by 5.4 percentage points over the baseline model, reaching 78.2\%, while reducing computation by 7.5 GFLOPs and GPU memory usage by 8.7\%. Moreover, the model outperforms recent YOLO and RT-DETR series in both precision and recall, and demonstrates strong performance on the PlantDoc, Tomato-Leaf, and Rice-Leaf datasets, validating its robustness and generalizability to other plant disease detection scenarios.

    https://arxiv.org/abs/2512.12357


    VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding

    oai:arXiv.org:2512.12360v1

    arXiv:2512.12360v1 Announce Type: new Abstract: Long-form video understanding remains challenging due to the extended temporal structure and dense multimodal cues. Despite recent progress, many existing approaches still rely on hand-crafted reasoning pipelines or employ token-consuming video preprocessing to guide MLLMs in autonomous reasoning. To overcome these limitations, we introduce VideoARM, an Agentic Reasoning-over-hierarchical-Memory paradigm for long-form video understanding. Instead of static, exhaustive preprocessing, VideoARM performs adaptive, on-the-fly agentic reasoning and memory construction. Specifically, VideoARM performs an adaptive and continuous loop of observing, thinking, acting, and memorizing, where a controller autonomously invokes tools to interpret the video in a coarse-to-fine manner, thereby substantially reducing token consumption. In parallel, a hierarchical multimodal memory continuously captures and updates multi-level clues throughout the operation of the agent, providing precise contextual information to support the controller in decision-making. Experiments on prevalent benchmarks demonstrate that VideoARM outperforms the state-of-the-art method, DVD, while significantly reducing token consumption for long-form videos.

    https://arxiv.org/abs/2512.12360


    Synthetic Swarm Mosquito Dataset for Acoustic Classification: A Proof of Concept

    oai:arXiv.org:2512.12365v1

    arXiv:2512.12365v1 Announce Type: new Abstract: Mosquito-borne diseases pose a serious global health threat, causing over 700,000 deaths annually. This work introduces a proof-of-concept Synthetic Swarm Mosquito Dataset for Acoustic Classification, created to simulate realistic multi-species and noisy swarm conditions. Unlike conventional datasets that require labor-intensive recording of individual mosquitoes, the synthetic approach enables scalable data generation while reducing human resource demands. Using log-mel spectrograms, we evaluated lightweight deep learning architectures for the classification of mosquito species. Experiments show that these models can effectively identify six major mosquito vectors and are suitable for deployment on embedded low-power devices. The study demonstrates the potential of synthetic swarm audio datasets to accelerate acoustic mosquito research and enable scalable real-time surveillance solutions.

    https://arxiv.org/abs/2512.12365


    ElasticVR: Elastic Task Computing in Multi-User Multi-Connectivity Wireless Virtual Reality (VR) Systems

    oai:arXiv.org:2512.12366v1

    arXiv:2512.12366v1 Announce Type: new Abstract: Diverse emerging VR applications integrate streaming of high fidelity 360 video content that requires ample amounts of computation and data rate. Scalable 360 video tiling enables having elastic VR computational tasks that can be scaled adaptively in computation and data rate based on the available user and system resources. We integrate scalable 360 video tiling in an edge-client wireless multi-connectivity architecture for joint elastic task computation offloading across multiple VR users called ElasticVR. To balance the trade-offs in communication, computation, energy consumption, and QoE that arise herein, we formulate a constrained QoE and energy optimization problem that integrates the multi-user/multi-connectivity action space with the elasticity of VR computational tasks. The ElasticVR framework introduces two multi-agent deep reinforcement learning solutions, namely CPPG and IPPG. CPPG adopts a centralized training and centralized execution approach to capture the coupling between users' communication and computational demands. This leads to globally coordinated decisions at the cost of increased computational overheads and limited scalability. To address the latter challenges, we also explore an alternative strategy denoted IPPG that adopts a centralized training with decentralized execution paradigm. IPPG leverages shared information and parameter sharing to learn robust policies; however, during execution, each user takes action independently based on its local state information only. The decentralized execution alleviates the communication and computation overhead of centralized decision-making and improves scalability. We show that the ElasticVR framework improves the PSNR by 43.21%, while reducing the response time and energy consumption by 42.35% and 56.83%, respectively, compared with a case where no elasticity is incorporated into VR computations.

    https://arxiv.org/abs/2512.12366


    Data-Selective Online Battery Identification Using Extended Time Regular Expressions

    oai:arXiv.org:2512.12370v1

    arXiv:2512.12370v1 Announce Type: new Abstract: In this paper, we propose a data-efficient online battery identification method which targets highly informative battery cell data segments based on the driving pattern of the vehicle. We consider the case of a vehicle driving on/off a motorway and construct an Extended Time Regular Expression (ETRE) to detect data segments fitting these driving patterns. Simulation results indicate that by only using up to 10.71% of the data on average, the proposed method provides a low-bias and low-variance estimator under non-negligible current and voltage noise compared to other conventional estimation algorithms.

    https://arxiv.org/abs/2512.12370


    AI Sprints: Towards a Critical Method for Human-AI Collaboration

    oai:arXiv.org:2512.12371v1

    arXiv:2512.12371v1 Announce Type: new Abstract: The emergence of Large Language Models presents a remarkable opportunity for humanities and social science research. I argue these technologies instantiate what I have called the algorithmic condition, whereby computational systems increasingly mediate not just our analytical tools but how we understand nature and society more generally. This article introduces the possibility for new forms of humanistic inquiry through what I term 'AI sprints', as intensive time-boxed research sessions. This is a research method combining the critical reflexivity essential to humanistic inquiry with iterative dialogue with generative AI. Drawing on experimental work in critical code studies, I demonstrate how tight loops of iterative development can adapt data and book sprint methodologies whilst acknowledging the profound transformations generative AI introduces. Through examining the process of human-AI collaboration when undertaken in these intensive research sessions, I seek to outline this approach as a broader research method. The article builds on Rogers' digital methods approach, proposing that we extend methodologies to study digital objects through their native protocols, using AI systems not merely to process digital traces but to analyse materials traditionally requiring manual coding or transcription. I aim to show this by introducing three cognitive modes, cognitive delegation, productive augmentation, and cognitive overhead, explaining how researchers can maintain a strategic overview whilst using LLM capabilities. The paper contributes both a practical methodology for intensive AI-augmented research and a theoretical framework for understanding the epistemological transformations of this hybrid method. A critical methodology must therefore operate in both technical and theoretical registers, sustaining a rigorous ethical-computational engagement with AI systems and outputs.

    https://arxiv.org/abs/2512.12371


    STAGE: Storyboard-Anchored Generation for Cinematic Multi-shot Narrative

    oai:arXiv.org:2512.12372v1

    arXiv:2512.12372v1 Announce Type: new Abstract: While recent advancements in generative models have achieved remarkable visual fidelity in video synthesis, creating coherent multi-shot narratives remains a significant challenge. To address this, keyframe-based approaches have emerged as a promising alternative to computationally intensive end-to-end methods, offering the advantages of fine-grained control and greater efficiency. However, these methods often fail to maintain cross-shot consistency and capture cinematic language. In this paper, we introduce STAGE, a SToryboard-Anchored GEneration workflow to reformulate the keyframe-based multi-shot video generation task. Instead of using sparse keyframes, we propose STEP2 to predict a structural storyboard composed of start-end frame pairs for each shot. We introduce the multi-shot memory pack to ensure long-range entity consistency, the dual-encoding strategy for intra-shot coherence, and the two-stage training scheme to learn cinematic inter-shot transition. We also contribute the large-scale ConStoryBoard dataset, including high-quality movie clips with fine-grained annotations for story progression, cinematic attributes, and human preferences. Extensive experiments demonstrate that STAGE achieves superior performance in structured narrative control and cross-shot coherence.

    https://arxiv.org/abs/2512.12372


    V-Warper: Appearance-Consistent Video Diffusion Personalization via Value Warping

    oai:arXiv.org:2512.12375v1

    arXiv:2512.12375v1 Announce Type: new Abstract: Video personalization aims to generate videos that faithfully reflect a user-provided subject while following a text prompt. However, existing approaches often rely on heavy video-based finetuning or large-scale video datasets, which impose substantial computational cost and are difficult to scale. Furthermore, they still struggle to maintain fine-grained appearance consistency across frames. To address these limitations, we introduce V-Warper, a training-free coarse-to-fine personalization framework for transformer-based video diffusion models. The framework enhances fine-grained identity fidelity without requiring any additional video training. (1) A lightweight coarse appearance adaptation stage leverages only a small set of reference images, which are already required for the task. This step encodes global subject identity through image-only LoRA and subject-embedding adaptation. (2) A inference-time fine appearance injection stage refines visual fidelity by computing semantic correspondences from RoPE-free mid-layer query--key features. These correspondences guide the warping of appearance-rich value representations into semantically aligned regions of the generation process, with masking ensuring spatial reliability. V-Warper significantly improves appearance fidelity while preserving prompt alignment and motion dynamics, and it achieves these gains efficiently without large-scale video finetuning.

    https://arxiv.org/abs/2512.12375


    INDOOR-LiDAR: Bridging Simulation and Reality for Robot-Centric 360 degree Indoor LiDAR Perception -- A Robot-Centric Hybrid Dataset

    oai:arXiv.org:2512.12377v1

    arXiv:2512.12377v1 Announce Type: new Abstract: We present INDOOR-LIDAR, a comprehensive hybrid dataset of indoor 3D LiDAR point clouds designed to advance research in robot perception. Existing indoor LiDAR datasets often suffer from limited scale, inconsistent annotation formats, and human-induced variability during data collection. INDOOR-LIDAR addresses these limitations by integrating simulated environments with real-world scans acquired using autonomous ground robots, providing consistent coverage and realistic sensor behavior under controlled variations. Each sample consists of dense point cloud data enriched with intensity measurements and KITTI-style annotations. The annotation schema encompasses common indoor object categories within various scenes. The simulated subset enables flexible configuration of layouts, point densities, and occlusions, while the real-world subset captures authentic sensor noise, clutter, and domain-specific artifacts characteristic of real indoor settings. INDOOR-LIDAR supports a wide range of applications including 3D object detection, bird's-eye-view (BEV) perception, SLAM, semantic scene understanding, and domain adaptation between simulated and real indoor domains. By bridging the gap between synthetic and real-world data, INDOOR-LIDAR establishes a scalable, realistic, and reproducible benchmark for advancing robotic perception in complex indoor environments.

    https://arxiv.org/abs/2512.12377


    M4Human: A Large-Scale Multimodal mmWave Radar Benchmark for Human Mesh Reconstruction

    oai:arXiv.org:2512.12378v1

    arXiv:2512.12378v1 Announce Type: new Abstract: Human mesh reconstruction (HMR) provides direct insights into body-environment interaction, which enables various immersive applications. While existing large-scale HMR datasets rely heavily on line-of-sight RGB input, vision-based sensing is limited by occlusion, lighting variation, and privacy concerns. To overcome these limitations, recent efforts have explored radio-frequency (RF) mmWave radar for privacy-preserving indoor human sensing. However, current radar datasets are constrained by sparse skeleton labels, limited scale, and simple in-place actions. To advance the HMR research community, we introduce M4Human, the current largest-scale (661K-frame) ($9\times$ prior largest) multimodal benchmark, featuring high-resolution mmWave radar, RGB, and depth data. M4Human provides both raw radar tensors (RT) and processed radar point clouds (RPC) to enable research across different levels of RF signal granularity. M4Human includes high-quality motion capture (MoCap) annotations with 3D meshes and global trajectories, and spans 20 subjects and 50 diverse actions, including in-place, sit-in-place, and free-space sports or rehabilitation movements. We establish benchmarks on both RT and RPC modalities, as well as multimodal fusion with RGB-D modalities. Extensive results highlight the significance of M4Human for radar-based human modeling while revealing persistent challenges under fast, unconstrained motion. The dataset and code will be released after the paper publication.

    https://arxiv.org/abs/2512.12378


    Entropy Collapse: A Universal Failure Mode of Intelligent Systems

    oai:arXiv.org:2512.12381v1

    arXiv:2512.12381v1 Announce Type: new Abstract: Intelligent systems are widely assumed to improve through learning, coordination, and optimization. However, across domains -- from artificial intelligence to economic institutions and biological evolution -- increasing intelligence often precipitates paradoxical degradation: systems become rigid, lose adaptability, and fail unexpectedly. We identify \emph{entropy collapse} as a universal dynamical failure mode arising when feedback amplification outpaces bounded novelty regeneration. Under minimal domain-agnostic assumptions, we show that intelligent systems undergo a sharp transition from high-entropy adaptive regimes to low-entropy collapsed regimes. Collapse is formalized as convergence toward a stable low-entropy manifold, not a zero-entropy state, implying a contraction of effective adaptive dimensionality rather than loss of activity or scale. We analytically establish critical thresholds, dynamical irreversibility, and attractor structure and demonstrate universality across update mechanisms through minimal simulations. This framework unifies diverse phenomena -- model collapse in AI, institutional sclerosis in economics, and genetic bottlenecks in evolution -- as manifestations of the same underlying process. By reframing collapse as a structural cost of intelligence, our results clarify why late-stage interventions systematically fail and motivate entropy-aware design principles for sustaining long-term adaptability in intelligent systems. \noindent\textbf{Keywords:} entropy collapse; intelligent systems; feedback amplification; phase transitions; effective dimensionality; complex systems; model collapse; institutional sclerosis

    https://arxiv.org/abs/2512.12381


    The Data Efficiency Frontier of Financial Foundation Models: Scaling Laws from Continued Pretraining

    oai:arXiv.org:2512.12384v1

    arXiv:2512.12384v1 Announce Type: new Abstract: Domain-adaptive pretraining (DAPT) offers a practical path to specializing large language models for high-value domains without full retraining. We conduct an early-stage scaling-law analysis of continued pretraining on U.S. SEC filings, training 1B and 3B-parameter Llama-3.2 models on a 400M-token financial corpus with validation checkpoints at 50M, 100M, 200M, and 400M tokens. Results show consistent improvements in SEC-domain validation loss for both models, with the largest gains occurring within the first 200M tokens and diminishing returns thereafter. Power-law fits reveal shallow exponents, indicating that financial language is highly regular and efficiently learnable under continued pretraining. General-domain validation loss remains effectively unchanged across all token budgets, suggesting minimal drift and no signs of catastrophic forgetting. A data-efficiency frontier further shows that both models move toward improved specialization with negligible mixed-domain degradation. Together, these findings provide early empirical guidance for scaling financial foundation models, suggesting that meaningful domain adaptation can be achieved with comparatively modest token budgets and that larger model scales (7B-70B) remain tractable under projected data requirements.

    https://arxiv.org/abs/2512.12384


    Speedrunning ImageNet Diffusion

    oai:arXiv.org:2512.12386v1

    arXiv:2512.12386v1 Announce Type: new Abstract: Recent advances have significantly improved the training efficiency of diffusion transformers. However, these techniques have largely been studied in isolation, leaving unexplored the potential synergies from combining multiple approaches. We present SR-DiT (Speedrun Diffusion Transformer), a framework that systematically integrates token routing, architectural improvements, and training modifications on top of representation alignment. Our approach achieves FID 3.49 and KDD 0.319 on ImageNet-256 using only a 140M parameter model at 400K iterations without classifier-free guidance - comparable to results from 685M parameter models trained significantly longer. To our knowledge, this is a state-of the-art result at this model size. Through extensive ablation studies, we identify which technique combinations are most effective and document both synergies and incompatibilities. We release our framework as a computationally accessible baseline for future research.

    https://arxiv.org/abs/2512.12386


    Anchoring Values in Temporal and Group Dimensions for Flow Matching Model Alignment

    oai:arXiv.org:2512.12387v1

    arXiv:2512.12387v1 Announce Type: new Abstract: Group Relative Policy Optimization (GRPO) has proven highly effective in enhancing the alignment capabilities of Large Language Models (LLMs). However, current adaptations of GRPO for the flow matching-based image generation neglect a foundational conflict between its core principles and the distinct dynamics of the visual synthesis process. This mismatch leads to two key limitations: (i) Uniformly applying a sparse terminal reward across all timesteps impairs temporal credit assignment, ignoring the differing criticality of generation phases from early structure formation to late-stage tuning. (ii) Exclusive reliance on relative, intra-group rewards causes the optimization signal to fade as training converges, leading to the optimization stagnation when reward diversity is entirely depleted. To address these limitations, we propose Value-Anchored Group Policy Optimization (VGPO), a framework that redefines value estimation across both temporal and group dimensions. Specifically, VGPO transforms the sparse terminal reward into dense, process-aware value estimates, enabling precise credit assignment by modeling the expected cumulative reward at each generative stage. Furthermore, VGPO replaces standard group normalization with a novel process enhanced by absolute values to maintain a stable optimization signal even as reward diversity declines. Extensive experiments on three benchmarks demonstrate that VGPO achieves state-of-the-art image quality while simultaneously improving task-specific accuracy, effectively mitigating reward hacking. Project webpage: https://yawen-shao.github.io/VGPO/.

    https://arxiv.org/abs/2512.12387


    Efficiently Approximating the Minimum-Volume Bounding Box of a Point Set in Three Dimensions

    oai:arXiv.org:2512.12391v1

    arXiv:2512.12391v1 Announce Type: new Abstract: $\renewcommand{\Re}{\mathbb{R}}$We present an efficient $O (n + 1/\varepsilon^{4.5})$-time algorithm for computing a $(1+\varepsilon$)-approximation of the minimum-volume bounding box of $n$ points in $\Re^3$. We also present a simpler algorithm (for the same purpose) whose running time is $O (n \log{n} + n / \varepsilon^3)$. We give some experimental results with implementations of various variants of the second algorithm. The implementation of the algorithm described in this paper is available online https://github.com/sarielhp/MVBB.

    https://arxiv.org/abs/2512.12391


    ArtGen: Conditional Generative Modeling of Articulated Objects in Arbitrary Part-Level States

    oai:arXiv.org:2512.12395v1

    arXiv:2512.12395v1 Announce Type: new Abstract: Generating articulated assets is crucial for robotics, digital twins, and embodied intelligence. Existing generative models often rely on single-view inputs representing closed states, resulting in ambiguous or unrealistic kinematic structures due to the entanglement between geometric shape and joint dynamics. To address these challenges, we introduce ArtGen, a conditional diffusion-based framework capable of generating articulated 3D objects with accurate geometry and coherent kinematics from single-view images or text descriptions at arbitrary part-level states. Specifically, ArtGen employs cross-state Monte Carlo sampling to explicitly enforce global kinematic consistency, reducing structural-motion entanglement. Additionally, we integrate a Chain-of-Thought reasoning module to infer robust structural priors, such as part semantics, joint types, and connectivity, guiding a sparse-expert Diffusion Transformer to specialize in diverse kinematic interactions. Furthermore, a compositional 3D-VAE latent prior enhanced with local-global attention effectively captures fine-grained geometry and global part-level relationships. Extensive experiments on the PartNet-Mobility benchmark demonstrate that ArtGen significantly outperforms state-of-the-art methods.

    https://arxiv.org/abs/2512.12395


    Agentic AI for 6G: A New Paradigm for Autonomous RAN Security Compliance

    oai:arXiv.org:2512.12400v1

    arXiv:2512.12400v1 Announce Type: new Abstract: Agentic AI systems are emerging as powerful tools for automating complex, multi-step tasks across various industries. One such industry is telecommunications, where the growing complexity of next-generation radio access networks (RANs) opens up numerous opportunities for applying these systems. Securing the RAN is a key area, particularly through automating the security compliance process, as traditional methods often struggle to keep pace with evolving specifications and real-time changes. In this article, we propose a framework that leverages LLM-based AI agents integrated with a retrieval-augmented generation (RAG) pipeline to enable intelligent and autonomous enforcement of security compliance. An initial case study demonstrates how an agent can assess configuration files for compliance with O-RAN Alliance and 3GPP standards, generate explainable justifications, and propose automated remediation if needed. We also highlight key challenges such as model hallucinations and vendor inconsistencies, along with considerations like agent security, transparency, and system trust. Finally, we outline future directions, emphasizing the need for telecom-specific LLMs and standardized evaluation frameworks.

    https://arxiv.org/abs/2512.12400


    DeepVekua: Geometric-Spectral Representation Learning for Physics-Informed Fields

    oai:arXiv.org:2512.12402v1

    arXiv:2512.12402v1 Announce Type: new Abstract: We present DeepVekua, a hybrid architecture that unifies geometric deep learning with spectral analysis to solve partial differential equations (PDEs) in sparse data regimes. By learning a diffeomorphic coordinate transformation that maps complex geometries to a latent harmonic space, our method outperforms state-of-the-art implicit representations on advection-diffusion systems. Unlike standard coordinate-based networks which struggle with spectral bias, DeepVekua separates the learning of geometry from the learning of physics, solving for optimal spectral weights in closed form. We demonstrate a 100x improvement over spectral baselines. The code is available at https://github.com/VladimerKhasia/vekuanet.

    https://arxiv.org/abs/2512.12402


    Can Graphs Improve Tabular Foundation Models?

    oai:arXiv.org:2512.12405v1

    arXiv:2512.12405v1 Announce Type: new Abstract: Tabular data are central to many real-world systems. While recent tabular transformers and in-context learners such as SAINT, TP-BERTa, TabPFN, TabICL, and MITRA incorporate limited inter-row reasoning, most approaches still lack an explicit mechanism to model relationships among instances, even though similar samples often share related outcomes. We investigate whether introducing \emph{simple graph priors} can enhance \emph{pretrained tabular transformers}. Concretely, we introduce {BOLERO}, a lightweight, static bipartite graph head that augments {RoBERTa-Tab} (a RoBERTa-style tabular backbone pretrained with masked-token prediction.) Each instance connects to feature/value anchors; a small GNN refines row representations, while the backbone remains frozen. We evaluate on 80 classification and 64 regression datasets from the TP-BERTa benchmark suites, comparing against strong baselines including XGBoost, CatBoost, TabPFN-v2, MITRA, TabICL, TP-BERTa, and RoBERTa-Tab. To ensure statistically sound conclusions, we follow best practices for multi-dataset evaluation: pairwise Wilcoxon signed-rank tests on per-dataset score differences and effect sizes (median improvement with confidence intervals), rather than mean-rank post-hoc tests that depend on the competitor pool. BOLERO achieves the highest number of statistically significant wins across both classification and regression, demonstrating that lightweight graph priors meaningfully improve pretrained tabular transformers.

    https://arxiv.org/abs/2512.12405


    Reputation-Based Leader Election under Partial Synchrony: Towards a Protocol-Independent Abstraction with Enhanced Guarantees

    oai:arXiv.org:2512.12409v1

    arXiv:2512.12409v1 Announce Type: new Abstract: Leader election serves a well-defined role in leader-based Byzantine Fault Tolerant (BFT) protocols. Existing reputation-based leader election frameworks for partially synchronous BFTs suffer from either protocol-specific proofs, narrow applicability, or unbounded recovery after network stabilization, leaving an open problem. This paper presents a novel protocol-independent abstraction formalizing generic correctness properties and effectiveness guarantees for leader election under partial synchrony, enabling protocol-independent analysis and design. Building on this, we design the Sliding Window Leader Election (SWLE) mechanism. SWLE dynamically adjusts leader nominations via consensus-behavior-based reputation scores, enforcing Byzantine-cost amplification. We demonstrate SWLE introduces minimal extra overhead to the base protocol and prove it satisfies all abstraction properties and provides superior effectiveness. We show, with a 16-server deployment across 4 different regions in northern China, SWLE achieves up to 4.2x higher throughput, 75% lower latency and 27% Byzantine leader frequency compared to the state-of-the-art solution under common Byzantine faults, while maintaining efficiency in fault-free scenarios.

    https://arxiv.org/abs/2512.12409


    A Graph Attention Network-Based Framework for Reconstructing Missing LiDAR Beams

    oai:arXiv.org:2512.12410v1

    arXiv:2512.12410v1 Announce Type: new Abstract: Vertical beam dropout in spinning LiDAR sensors triggered by hardware aging, dust, snow, fog, or bright reflections removes entire vertical slices from the point cloud and severely degrades 3D perception in autonomous vehicles. This paper proposes a Graph Attention Network (GAT)-based framework that reconstructs these missing vertical channels using only the current LiDAR frame, with no camera images or temporal information required. Each LiDAR sweep is represented as an unstructured spatial graph: points are nodes and edges connect nearby points while preserving the original beam-index ordering. A multi-layer GAT learns adaptive attention weights over local geometric neighborhoods and directly regresses the missing elevation (z) values at dropout locations. Trained and evaluated on 1,065 raw KITTI sequences with simulated channel dropout, the method achieves an average height RMSE of 11.67 cm, with 87.98% of reconstructed points falling within a 10 cm error threshold. Inference takes 14.65 seconds per frame on a single GPU, and reconstruction quality remains stable for different neighborhood sizes k. These results show that a pure graph attention model operating solely on raw point-cloud geometry can effectively recover dropped vertical beams under realistic sensor degradation.

    https://arxiv.org/abs/2512.12410


    Feeling the Strength but Not the Source: Partial Introspection in LLMs

    oai:arXiv.org:2512.12411v1

    arXiv:2512.12411v1 Announce Type: new Abstract: Recent work from Anthropic claims that frontier models can sometimes detect and name injected "concepts" represented as activation directions. We test the robustness of these claims. First, we reproduce Anthropic's multi-turn "emergent introspection" result on Meta-Llama-3.1-8B-Instruct, finding that the model identifies and names the injected concept 20 percent of the time under Anthropic's original pipeline, exactly matching their reported numbers and thus showing that introspection is not exclusive to very large or capable models. Second, we systematically vary the inference prompt and find that introspection is fragile: performance collapses on closely related tasks such as multiple-choice identification of the injected concept or different prompts of binary discrimination of whether a concept was injected at all. Third, we identify a contrasting regime of partial introspection: the same model can reliably classify the strength of the coefficient of a normalized injected concept vector (as weak / moderate / strong / very strong) with up to 70 percent accuracy, far above the 25 percent chance baseline. Together, these results provide more evidence for Anthropic's claim that language models effectively compute a function of their baseline, internal representations during introspection; however, these self-reports about those representations are narrow and prompt-sensitive. Our code is available at https://github.com/elyhahami18/CS2881-Introspection.

    https://arxiv.org/abs/2512.12411


    Understanding Critical Thinking in Generative Artificial Intelligence Use: Development, Validation, and Correlates of the Critical Thinking in AI Use Scale

    oai:arXiv.org:2512.12413v1

    arXiv:2512.12413v1 Announce Type: new Abstract: Generative AI tools are increasingly embedded in everyday work and learning, yet their fluency, opacity, and propensity to hallucinate mean that users must critically evaluate AI outputs rather than accept them at face value. The present research conceptualises critical thinking in AI use as a dispositional tendency to verify the source and content of AI-generated information, to understand how models work and where they fail, and to reflect on the broader implications of relying on AI. Across six studies (N = 1365), we developed and validated the 13-item critical thinking in AI use scale and mapped its nomological network. Study 1 generated and content-validated scale items. Study 2 supported a three-factor structure (Verification, Motivation, and Reflection). Studies 3, 4, and 5 confirmed this higher-order model, demonstrated internal consistency and test-retest reliability, strong factor loadings, sex invariance, and convergent and discriminant validity. Studies 3 and 4 further revealed that critical thinking in AI use was positively associated with openness, extraversion, positive trait affect, and frequency of AI use. Lastly, Study 6 demonstrated criterion validity of the scale, with higher critical thinking in AI use scores predicting more frequent and diverse verification strategies, greater veracity-judgement accuracy in a novel and naturalistic ChatGPT-powered fact-checking task, and deeper reflection about responsible AI. Taken together, the current work clarifies why and how people exercise oversight over generative AI outputs and provides a validated scale and ecologically grounded task paradigm to support theory testing, cross-group, and longitudinal research on critical engagement with generative AI outputs.

    https://arxiv.org/abs/2512.12413


    A boundary integral equation method for wave scattering in periodic structures via the Floquet-Bloch transform

    oai:arXiv.org:2512.12414v1

    arXiv:2512.12414v1 Announce Type: new Abstract: This paper is concerned with the problem of an acoustic wave scattering in a locally perturbed periodic structure. As the total wavefield is non-quasi-periodic, effective truncation techniques are pursued for high-accuracy numerical solvers. We adopt the Green's function for the background periodic structure to construct a boundary integral equation (BIE) on an artificial curve enclosing the perturbation. It serves as a transparent boundary condition (TBC) to truncate the unbounded domain. We develop efficient algorithms to compute such background Green's functions based on the Floquet-Bloch transform and its inverse. Spectrally accurate quadrature rules are developed to discretize the BIE-based TBC. Effective algorithms based on leap and pullback procedures are further developed to compute the total wavefield everywhere in the structure. A number of numerical experiments are carried out to illustrate the efficiency and accuracy of the new solver. They exhibit that our method for the non-quasi-periodic problem has a time complexity that is even comparable to that of a single quasi-periodic problem.

    https://arxiv.org/abs/2512.12414


    ViInfographicVQA: A Benchmark for Single and Multi-image Visual Question Answering on Vietnamese Infographics

    oai:arXiv.org:2512.12424v1

    arXiv:2512.12424v1 Announce Type: new Abstract: Infographic Visual Question Answering (InfographicVQA) evaluates a model's ability to read and reason over data-rich, layout-heavy visuals that combine text, charts, icons, and design elements. Compared with scene-text or natural-image VQA, infographics require stronger integration of OCR, layout understanding, and numerical and semantic reasoning. We introduce ViInfographicVQA, the first benchmark for Vietnamese InfographicVQA, comprising over 6747 real-world infographics and 20409 human-verified question-answer pairs across economics, healthcare, education, and more. The benchmark includes two evaluation settings. The Single-image task follows the traditional setup in which each question is answered using a single infographic. The Multi-image task requires synthesizing evidence across multiple semantically related infographics and is, to our knowledge, the first Vietnamese evaluation of cross-image reasoning in VQA. We evaluate a range of recent vision-language models on this benchmark, revealing substantial performance disparities, with the most significant errors occurring on Multi-image questions that involve cross-image integration and non-span reasoning. ViInfographicVQA contributes benchmark results for Vietnamese InfographicVQA and sheds light on the limitations of current multimodal models in low-resource contexts, encouraging future exploration of layout-aware and cross-image reasoning methods.

    https://arxiv.org/abs/2512.12424


    BokehDepth: Enhancing Monocular Depth Estimation through Bokeh Generation

    oai:arXiv.org:2512.12425v1

    arXiv:2512.12425v1 Announce Type: new Abstract: Bokeh and monocular depth estimation are tightly coupled through the same lens imaging geometry, yet current methods exploit this connection in incomplete ways. High-quality bokeh rendering pipelines typically depend on noisy depth maps, which amplify estimation errors into visible artifacts, while modern monocular metric depth models still struggle on weakly textured, distant and geometrically ambiguous regions where defocus cues are most informative. We introduce BokehDepth, a two-stage framework that decouples bokeh synthesis from depth prediction and treats defocus as an auxiliary supervision-free geometric cue. In Stage-1, a physically guided controllable bokeh generator, built on a powerful pretrained image editing backbone, produces depth-free bokeh stacks with calibrated bokeh strength from a single sharp input. In Stage-2, a lightweight defocus-aware aggregation module plugs into existing monocular depth encoders, fuses features along the defocus dimension, and exposes stable depth-sensitive variations while leaving downstream decoder unchanged. Across challenging benchmarks, BokehDepth improves visual fidelity over depth-map-based bokeh baselines and consistently boosts the metric accuracy and robustness of strong monocular depth foundation models.

    https://arxiv.org/abs/2512.12425


    Unifying Quadrotor Motion Planning and Control by Chaining Different Fidelity Models

    oai:arXiv.org:2512.12427v1

    arXiv:2512.12427v1 Announce Type: new Abstract: Many aerial tasks involving quadrotors demand both instant reactivity and long-horizon planning. High-fidelity models enable accurate control but are too slow for long horizons; low-fidelity planners scale but degrade closed-loop performance. We present Unique, a unified MPC that cascades models of different fidelity within a single optimization: a short-horizon, high-fidelity model for accurate control, and a long-horizon, low-fidelity model for planning. We align costs across horizons, derive feasibility-preserving thrust and body-rate constraints for the point-mass model, and introduce transition constraints that match the different states, thrust-induced acceleration, and jerk-body-rate relations. To prevent local minima emerging from nonsmooth clutter, we propose a 3D progressive smoothing schedule that morphs norm-based obstacles along the horizon. In addition, we deploy parallel randomly initialized MPC solvers to discover lower-cost local minima on the long, low-fidelity horizon. In simulation and real flights, under equal computational budgets, Unique improves closed-loop position or velocity tracking by up to 75% compared with standard MPC and hierarchical planner-tracker baselines. Ablations and Pareto analyses confirm robust gains across horizon variations, constraint approximations, and smoothing schedules.

    https://arxiv.org/abs/2512.12427


    Learning Dynamics in Memristor-Based Equilibrium Propagation

    oai:arXiv.org:2512.12428v1

    arXiv:2512.12428v1 Announce Type: new Abstract: Memristor-based in-memory computing has emerged as a promising paradigm to overcome the constraints of the von Neumann bottleneck and the memory wall by enabling fully parallelisable and energy-efficient vector-matrix multiplications. We investigate the effect of nonlinear, memristor-driven weight updates on the convergence behaviour of neural networks trained with equilibrium propagation (EqProp). Six memristor models were characterised by their voltage-current hysteresis and integrated into the EBANA framework for evaluation on two benchmark classification tasks. EqProp can achieve robust convergence under nonlinear weight updates, provided that memristors exhibit a sufficiently wide resistance range of at least an order of magnitude.

    https://arxiv.org/abs/2512.12428


    Endless World: Real-Time 3D-Aware Long Video Generation

    oai:arXiv.org:2512.12430v1

    arXiv:2512.12430v1 Announce Type: new Abstract: Producing long, coherent video sequences with stable 3D structure remains a major challenge, particularly in streaming scenarios. Motivated by this, we introduce Endless World, a real-time framework for infinite, 3D-consistent video generation.To support infinite video generation, we introduce a conditional autoregressive training strategy that aligns newly generated content with existing video frames. This design preserves long-range dependencies while remaining computationally efficient, enabling real-time inference on a single GPU without additional training overhead.Moreover, our Endless World integrates global 3D-aware attention to provide continuous geometric guidance across time. Our 3D injection mechanism enforces physical plausibility and geometric consistency throughout extended sequences, addressing key challenges in long-horizon and dynamic scene synthesis.Extensive experiments demonstrate that Endless World produces long, stable, and visually coherent videos, achieving competitive or superior performance to existing methods in both visual fidelity and spatial consistency. Our project has been available on https://bwgzk-keke.github.io/EndlessWorld/.

    https://arxiv.org/abs/2512.12430


    Rough Sets for Explainability of Spectral Graph Clustering

    oai:arXiv.org:2512.12436v1

    arXiv:2512.12436v1 Announce Type: new Abstract: Graph Spectral Clustering methods (GSC) allow representing clusters of diverse shapes, densities, etc. However, the results of such algorithms, when applied e.g. to text documents, are hard to explain to the user, especially due to embedding in the spectral space which has no obvious relation to document contents. Furthermore, the presence of documents without clear content meaning and the stochastic nature of the clustering algorithms deteriorate explainability. This paper proposes an enhancement to the explanation methodology, proposed in an earlier research of our team. It allows us to overcome the latter problems by taking inspiration from rough set theory.

    https://arxiv.org/abs/2512.12436


    Sim2Real Reinforcement Learning for Soccer skills

    oai:arXiv.org:2512.12437v1

    arXiv:2512.12437v1 Announce Type: new Abstract: This thesis work presents a more efficient and effective approach to training control-related tasks for humanoid robots using Reinforcement Learning (RL). The traditional RL methods are limited in adapting to real-world environments, complexity, and natural motions, but the proposed approach overcomes these limitations by using curriculum training and Adversarial Motion Priors (AMP) technique. The results show that the developed RL policies for kicking, walking, and jumping are more dynamic, and adaptive, and outperformed previous methods. However, the transfer of the learned policy from simulation to the real world was unsuccessful, highlighting the limitations of current RL methods in fully adapting to real-world scenarios.

    https://arxiv.org/abs/2512.12437


    Leader-driven or Leaderless: How Participation Structure Sustains Engagement and Shapes Narratives in Online Hate Communities

    oai:arXiv.org:2512.12441v1

    arXiv:2512.12441v1 Announce Type: new Abstract: Extremist communities increasingly rely on social media to sustain and amplify divisive discourse. However, the relationship between their internal participation structures, audience engagement, and narrative expression remains underexplored. This study analyzes ten years of Facebook activity by hate groups related to the Israel-Palestine conflict, focusing on anti-Semitic and Islamophobic ideologies. Consistent with prior work, we find that higher participation centralization in online hate groups is associated with greater user engagement across hate ideologies, suggesting the role of key actors in sustaining group activity over time. Conversely, our narrative frame detection models - based on an eight-frame extremist taxonomy (e.g., dehumanization, violence justification) - reveal a clear contrast across hate ideologies, offering new insight into how discursive strategies vary despite similar structural dynamics. Analysis of the inter-group network indicates that, although centralization and homophily are not clearly linked, ideological distinctions emerge: Islamophobic groups cluster tightly, whereas anti-Semitic groups remain more evenly connected. Overall, these findings clarify how participation structure may shape the dissemination pattern and resonance of extremist narratives online and provide a foundation for tailored strategies to disrupt or mitigate online extremist discourse.

    https://arxiv.org/abs/2512.12441


    AI Transparency Atlas: Framework, Scoring, and Real-Time Model Card Evaluation Pipeline

    oai:arXiv.org:2512.12443v1

    arXiv:2512.12443v1 Announce Type: new Abstract: AI model documentation is fragmented across platforms and inconsistent in structure, preventing policymakers, auditors, and users from reliably assessing safety claims, data provenance, and version-level changes. We analyzed documentation from five frontier models (Gemini 3, Grok 4.1, Llama 4, GPT-5, and Claude 4.5) and 100 Hugging Face model cards, identifying 947 unique section names with extreme naming variation. Usage information alone appeared under 97 distinct labels. Using the EU AI Act Annex IV and the Stanford Transparency Index as baselines, we developed a weighted transparency framework with 8 sections and 23 subsections that prioritizes safety-critical disclosures (Safety Evaluation: 25%, Critical Risk: 20%) over technical specifications. We implemented an automated multi-agent pipeline that extracts documentation from public sources and scores completeness through LLM-based consensus. Evaluating 50 models across vision, multimodal, open-source, and closed-source systems cost less than $3 in total and revealed systematic gaps. Frontier labs (xAI, Microsoft, Anthropic) achieve approximately 80% compliance, while most providers fall below 60%. Safety-critical categories show the largest deficits: deception behaviors, hallucinations, and child safety evaluations account for 148, 124, and 116 aggregate points lost, respectively, across all evaluated models.

    https://arxiv.org/abs/2512.12443


    Can GPT replace human raters? Validity and reliability of machine-generated norms for metaphors

    oai:arXiv.org:2512.12444v1

    arXiv:2512.12444v1 Announce Type: new Abstract: As Large Language Models (LLMs) are increasingly being used in scientific research, the issue of their trustworthiness becomes crucial. In psycholinguistics, LLMs have been recently employed in automatically augmenting human-rated datasets, with promising results obtained by generating ratings for single words. Yet, performance for ratings of complex items, i.e., metaphors, is still unexplored. Here, we present the first assessment of the validity and reliability of ratings of metaphors on familiarity, comprehensibility, and imageability, generated by three GPT models for a total of 687 items gathered from the Italian Figurative Archive and three English studies. We performed a thorough validation in terms of both alignment with human data and ability to predict behavioral and electrophysiological responses. We found that machine-generated ratings positively correlated with human-generated ones. Familiarity ratings reached moderate-to-strong correlations for both English and Italian metaphors, although correlations weakened for metaphors with high sensorimotor load. Imageability showed moderate correlations in English and moderate-to-strong in Italian. Comprehensibility for English metaphors exhibited the strongest correlations. Overall, larger models outperformed smaller ones and greater human-model misalignment emerged with familiarity and imageability. Machine-generated ratings significantly predicted response times and the EEG amplitude, with a strength comparable to human ratings. Moreover, GPT ratings obtained across independent sessions were highly stable. We conclude that GPT, especially larger models, can validly and reliably replace - or augment - human subjects in rating metaphor properties. Yet, LLMs align worse with humans when dealing with conventionality and multimodal aspects of metaphorical meaning, calling for careful consideration of the nature of stimuli.

    https://arxiv.org/abs/2512.12444


    Knowledge-Guided Masked Autoencoder with Linear Spectral Mixing and Spectral-Angle-Aware Reconstruction

    oai:arXiv.org:2512.12445v1

    arXiv:2512.12445v1 Announce Type: new Abstract: Integrating domain knowledge into deep learning has emerged as a promising direction for improving model interpretability, generalization, and data efficiency. In this work, we present a novel knowledge-guided ViT-based Masked Autoencoder that embeds scientific domain knowledge within the self-supervised reconstruction process. Instead of relying solely on data-driven optimization, our proposed approach incorporates the Linear Spectral Mixing Model (LSMM) as a physical constraint and physically-based Spectral Angle Mapper (SAM), ensuring that learned representations adhere to known structural relationships between observed signals and their latent components. The framework jointly optimizes LSMM and SAM loss with a conventional Huber loss objective, promoting both numerical accuracy and geometric consistency in the feature space. This knowledge-guided design enhances reconstruction fidelity, stabilizes training under limited supervision, and yields interpretable latent representations grounded in physical principles. The experimental findings indicate that the proposed model substantially enhances reconstruction quality and improves downstream task performance, highlighting the promise of embedding physics-informed inductive biases within transformer-based self-supervised learning.

    https://arxiv.org/abs/2512.12445


    Large language models have learned to use language

    oai:arXiv.org:2512.12447v1

    arXiv:2512.12447v1 Announce Type: new Abstract: Acknowledging that large language models have learned to use language can open doors to breakthrough language science. Achieving these breakthroughs may require abandoning some long-held ideas about how language knowledge is evaluated and reckoning with the difficult fact that we have entered a post-Turing test era.

    https://arxiv.org/abs/2512.12447


    Optimized Architectures for Kolmogorov-Arnold Networks

    oai:arXiv.org:2512.12448v1

    arXiv:2512.12448v1 Announce Type: new Abstract: Efforts to improve Kolmogorov-Arnold networks (KANs) with architectural enhancements have been stymied by the complexity those enhancements bring, undermining the interpretability that makes KANs attractive in the first place. Here we study overprovisioned architectures combined with sparsification to learn compact, interpretable KANs without sacrificing accuracy. Crucially, we focus on differentiable sparsification, turning architecture search into an end-to-end optimization problem. Across function approximation benchmarks, dynamical systems forecasting, and real-world prediction tasks, we demonstrate competitive or superior accuracy while discovering substantially smaller models. Overprovisioning and sparsification are synergistic, with the combination outperforming either alone. The result is a principled path toward models that are both more expressive and more interpretable, addressing a key tension in scientific machine learning.

    https://arxiv.org/abs/2512.12448


    Breaking the Curse of Dimensionality: On the Stability of Modern Vector Retrieval

    oai:arXiv.org:2512.12458v1

    arXiv:2512.12458v1 Announce Type: new Abstract: Modern vector databases enable efficient retrieval over high-dimensional neural embeddings, powering applications from web search to retrieval-augmented generation. However, classical theory predicts such tasks should suffer from the curse of dimensionality, where distances between points become nearly indistinguishable, thereby crippling efficient nearest-neighbor search. We revisit this paradox through the lens of stability, the property that small perturbations to a query do not radically alter its nearest neighbors. Building on foundational results, we extend stability theory to three key retrieval settings widely used in practice: (i) multi-vector search, where we prove that the popular Chamfer distance metric preserves single-vector stability, while average pooling aggregation may destroy it; (ii) filtered vector search, where we show that sufficiently large penalties for mismatched filters can induce stability even when the underlying search is unstable; and (iii) sparse vector search, where we formalize and prove novel sufficient stability conditions. Across synthetic and real datasets, our experimental results match our theoretical predictions, offering concrete guidance for model and system design to avoid the curse of dimensionality.

    https://arxiv.org/abs/2512.12458


    From Particles to Fields: Reframing Photon Mapping with Continuous Gaussian Photon Fields

    oai:arXiv.org:2512.12459v1

    arXiv:2512.12459v1 Announce Type: new Abstract: Accurately modeling light transport is essential for realistic image synthesis. Photon mapping provides physically grounded estimates of complex global illumination effects such as caustics and specular-diffuse interactions, yet its per-view radiance estimation remains computationally inefficient when rendering multiple views of the same scene. The inefficiency arises from independent photon tracing and stochastic kernel estimation at each viewpoint, leading to inevitable redundant computation. To accelerate multi-view rendering, we reformulate photon mapping as a continuous and reusable radiance function. Specifically, we introduce the Gaussian Photon Field (GPF), a learnable representation that encodes photon distributions as anisotropic 3D Gaussian primitives parameterized by position, rotation, scale, and spectrum. GPF is initialized from physically traced photons in the first SPPM iteration and optimized using multi-view supervision of final radiance, distilling photon-based light transport into a continuous field. Once trained, the field enables differentiable radiance evaluation along camera rays without repeated photon tracing or iterative refinement. Extensive experiments on scenes with complex light transport, such as caustics and specular-diffuse interactions, demonstrate that GPF attains photon-level accuracy while reducing computation by orders of magnitude, unifying the physical rigor of photon-based rendering with the efficiency of neural scene representations.

    https://arxiv.org/abs/2512.12459


    Cross-Modal Representational Knowledge Distillation for Enhanced Spike-Informed LFP Modeling

    oai:arXiv.org:2512.12461v1

    arXiv:2512.12461v1 Announce Type: new Abstract: Local field potentials (LFPs) can be routinely recorded alongside spiking activity in intracortical neural experiments, measure a larger complementary spatiotemporal scale of brain activity for scientific inquiry, and can offer practical advantages over spikes, including greater long-term stability, robustness to electrode degradation, and lower power requirements. Despite these advantages, recent neural modeling frameworks have largely focused on spiking activity since LFP signals pose inherent modeling challenges due to their aggregate, population-level nature, often leading to lower predictive power for downstream task variables such as motor behavior. To address this challenge, we introduce a cross-modal knowledge distillation framework that transfers high-fidelity representational knowledge from pretrained multi-session spike transformer models to LFP transformer models. Specifically, we first train a teacher spike model across multiple recording sessions using a masked autoencoding objective with a session-specific neural tokenization strategy. We then align the latent representations of the student LFP model to those of the teacher spike model. Our results show that the Distilled LFP models consistently outperform single- and multi-session LFP baselines in both fully unsupervised and supervised settings, and can generalize to other sessions without additional distillation while maintaining superior performance. These findings demonstrate that cross-modal knowledge distillation is a powerful and scalable approach for leveraging high-performing spike models to develop more accurate LFP models.

    https://arxiv.org/abs/2512.12461


    Dynamical modeling of nonlinear latent factors in multiscale neural activity with real-time inference

    oai:arXiv.org:2512.12462v1

    arXiv:2512.12462v1 Announce Type: new Abstract: Real-time decoding of target variables from multiple simultaneously recorded neural time-series modalities, such as discrete spiking activity and continuous field potentials, is important across various neuroscience applications. However, a major challenge for doing so is that different neural modalities can have different timescales (i.e., sampling rates) and different probabilistic distributions, or can even be missing at some time-steps. Existing nonlinear models of multimodal neural activity do not address different timescales or missing samples across modalities. Further, some of these models do not allow for real-time decoding. Here, we develop a learning framework that can enable real-time recursive decoding while nonlinearly aggregating information across multiple modalities with different timescales and distributions and with missing samples. This framework consists of 1) a multiscale encoder that nonlinearly aggregates information after learning within-modality dynamics to handle different timescales and missing samples in real time, 2) a multiscale dynamical backbone that extracts multimodal temporal dynamics and enables real-time recursive decoding, and 3) modality-specific decoders to account for different probabilistic distributions across modalities. In both simulations and three distinct multiscale brain datasets, we show that our model can aggregate information across modalities with different timescales and distributions and missing samples to improve real-time target decoding. Further, our method outperforms various linear and nonlinear multimodal benchmarks in doing so.

    https://arxiv.org/abs/2512.12462


    Exploring the Design Space of Transition Matching

    oai:arXiv.org:2512.12465v1

    arXiv:2512.12465v1 Announce Type: new Abstract: Transition Matching (TM) is an emerging paradigm for generative modeling that generalizes diffusion and flow-matching models as well as continuous-state autoregressive models. TM, similar to previous paradigms, gradually transforms noise samples to data samples, however it uses a second ``internal'' generative model to implement the transition steps, making the transitions more expressive compared to diffusion and flow models. To make this paradigm tractable, TM employs a large backbone network and a smaller "head" module to efficiently execute the generative transition step. In this work, we present a large-scale, systematic investigation into the design, training and sampling of the head in TM frameworks, focusing on its time-continuous bidirectional variant. Through comprehensive ablations and experimentation involving training 56 different 1.7B text-to-image models (resulting in 549 unique evaluations) we evaluate the affect of the head module architecture and modeling during training as-well as a useful family of stochastic TM samplers. We analyze the impact on generation quality, training, and inference efficiency. We find that TM with an MLP head, trained with a particular time weighting and sampled with high frequency sampler provides best ranking across all metrics reaching state-of-the-art among all tested baselines, while Transformer head with sequence scaling and low frequency sampling is a runner up excelling at image aesthetics. Lastly, we believe the experiments presented highlight the design aspects that are likely to provide most quality and efficiency gains, while at the same time indicate what design choices are not likely to provide further gains.

    https://arxiv.org/abs/2512.12465


    Autonomously Unweaving Multiple Cables Using Visual Feedback

    oai:arXiv.org:2512.12468v1

    arXiv:2512.12468v1 Announce Type: new Abstract: Many cable management tasks involve separating out the different cables and removing tangles. Automating this task is challenging because cables are deformable and can have combinations of knots and multiple interwoven segments. Prior works have focused on untying knots in one cable, which is one subtask of cable management. However, in this paper, we focus on a different subtask called multi-cable unweaving, which refers to removing the intersections among multiple interwoven cables to separate them and facilitate further manipulation. We propose a method that utilizes visual feedback to unweave a bundle of loosely entangled cables. We formulate cable unweaving as a pick-and-place problem, where the grasp position is selected from discrete nodes in a graph-based cable state representation. Our cable state representation encodes both topological and geometric information about the cables from the visual image. To predict future cable states and identify valid actions, we present a novel state transition model that takes into account the straightening and bending of cables during manipulation. Using this state transition model, we select between two high-level action primitives and calculate predicted immediate costs to optimize the lower-level actions. We experimentally demonstrate that iterating the above perception-planning-action process enables unweaving electric cables and shoelaces with an 84% success rate on average.

    https://arxiv.org/abs/2512.12468


    Sparse Concept Anchoring for Interpretable and Controllable Neural Representations

    oai:arXiv.org:2512.12469v1

    arXiv:2512.12469v1 Announce Type: new Abstract: We introduce Sparse Concept Anchoring, a method that biases latent space to position a targeted subset of concepts while allowing others to self-organize, using only minimal supervision (labels for <0.1% of examples per anchored concept). Training combines activation normalization, a separation regularizer, and anchor or subspace regularizers that attract rare labeled examples to predefined directions or axis-aligned subspaces. The anchored geometry enables two practical interventions: reversible behavioral steering that projects out a concept's latent component at inference, and permanent removal via targeted weight ablation of anchored dimensions. Experiments on structured autoencoders show selective attenuation of targeted concepts with negligible impact on orthogonal features, and complete elimination with reconstruction error approaching theoretical bounds. Sparse Concept Anchoring therefore provides a practical pathway to interpretable, steerable behavior in learned representations.

    https://arxiv.org/abs/2512.12469


    Privacy-Aware Ambient Audio Sensing for Healthy Indoor Spaces

    oai:arXiv.org:2512.12471v1

    arXiv:2512.12471v1 Announce Type: new Abstract: Indoor airborne transmission poses a significant health risk, yet current monitoring solutions are invasive, costly, or fail to address it directly. My research explores the untapped potential of ambient audio sensing to estimate key transmission risk factors such as ventilation, aerosol emissions, and occupant distribution non-invasively and in real time. I develop privacy-preserving systems that leverage existing microphones to monitor the whole spectrum of indoor air quality which can have a significant effect on an individual's health. This work lays the foundation for privacy-aware airborne risk monitoring using everyday devices.

    https://arxiv.org/abs/2512.12471


    AI-Driven Real-Time Kick Classification in Olympic Taekwondo Using Sensor Fusion

    oai:arXiv.org:2512.12474v1

    arXiv:2512.12474v1 Announce Type: new Abstract: Olympic Taekwondo has faced challenges in spectator engagement due to static, defensive gameplay and contentious scoring. Current Protector and Scoring Systems (PSS) rely on impact sensors and simplistic logic, encouraging safe strategies that diminish the sport's dynamism. This paper proposes an AI-powered scoring system that integrates existing PSS sensors with additional accelerometers, gyroscopes, magnetic/RFID, and impact force sensors in a sensor fusion framework. The system classifies kicks in real-time to identify technique type, contact location, impact force, and even the part of the foot used. A machine learning pipeline employing sensor fusion and Support Vector Machines (SVMs) is detailed, enabling automatic kick technique recognition for scoring. We present a novel kick scoring rubric that awards points based on specific kick techniques (e.g., turning and spinning kicks) to incentivize dynamic attacks. Drawing on a 2024 study achieving 96-98% accuracy, we validate the feasibility of real-time kick classification and further propose enhancements to this methodology, such as ensemble SVM classifiers and expanded datasets, to achieve the high-stakes accuracy required by the sport. We analyze how the proposed system can improve scoring fairness, reduce rule exploitation and illegitimate tactics, encourage more dynamic techniques, and enhance spectator understanding and excitement. The paper includes system design illustrations, a kick scoring table from an AI-augmented rule set, and discusses anticipated impacts on Olympic Taekwondo.

    https://arxiv.org/abs/2512.12474


    Improved Directional State Transition Tensors for Accurate Aerocapture Performance Analysis

    oai:arXiv.org:2512.12475v1

    arXiv:2512.12475v1 Announce Type: new Abstract: Aerocapture is a unique challenge for semi-analytical propagation because its nonconservative dynamics lead to force magnitudes that vary substantially across the trajectory. State transition tensors (STTs), higher-order Taylor series expansions of the solution flow, have been widely used as a computationally efficient semi-analytical propagation method for orbital scenarios, but have not previously been applied to aerocapture. However, obtaining the higher-order STTs requires integrating exponentially more equations. Directional state transition tensors (DSTTs) mitigate this cost by projecting the state into a reduced-dimension basis. This work develops novel dynamics analysis techniques to identify effective bases for this reduction, including augmented higher-order Cauchy Green tensors tailored to quantities of interest such as apoapsis radius. Results show that DSTTs constructed along these bases significantly reduce computational cost while maintaining accuracy in apoapsis and energy prediction. In particular, certain of these DSTTs outperform traditional DSTTs in nonlinear perturbation propagation for key state subsets and quantities of interest. These results establish STTs and DSTTs as practical tools for aerocapture performance analysis to enable robust guidance and navigation.

    https://arxiv.org/abs/2512.12475


    HetRL: Efficient Reinforcement Learning for LLMs in Heterogeneous Environments

    oai:arXiv.org:2512.12476v1

    arXiv:2512.12476v1 Announce Type: new Abstract: As large language models (LLMs) continue to scale and new GPUs are released even more frequently, there is an increasing demand for LLM post-training in heterogeneous environments to fully leverage underutilized mid-range or previous-generation GPUs across regions and alleviate the shortage of homogeneous high-end GPUs within a single region. However, achieving high-performance reinforcement learning (RL) training for LLMs on such computing resources remains challenging because the workflow involves multiple models and tasks with complex computation and data dependencies. In this paper, we present HetRL, a distributed system for efficient RL training in infrastructures with heterogeneous GPUs and networks. HetRL formulates the scheduling of RL training in heterogeneous environments as a constrained joint optimization problem and introduces a novel scheduling algorithm that (1) decomposes the complex search space with a multi-level search framework; and (2) allocates the search budget via successive halving. Our extensive evaluation, consuming 20,000 GPU-hours, shows that HetRL delivers up to 9.17x the throughput of state-of-the-art systems, and 3.17x on average, under various workloads and settings.

    https://arxiv.org/abs/2512.12476


    MetaHGNIE: Meta-Path Induced Hypergraph Contrastive Learning in Heterogeneous Knowledge Graphs

    oai:arXiv.org:2512.12477v1

    arXiv:2512.12477v1 Announce Type: new Abstract: Node importance estimation (NIE) in heterogeneous knowledge graphs is a critical yet challenging task, essential for applications such as recommendation, knowledge reasoning, and question answering. Existing methods often rely on pairwise connections, neglecting high-order dependencies among multiple entities and relations, and they treat structural and semantic signals independently, hindering effective cross-modal integration. To address these challenges, we propose MetaHGNIE, a meta-path induced hypergraph contrastive learning framework for disentangling and aligning structural and semantic information. MetaHGNIE constructs a higher-order knowledge graph via meta-path sequences, where typed hyperedges capture multi-entity relational contexts. Structural dependencies are aggregated with local attention, while semantic representations are encoded through a hypergraph transformer equipped with sparse chunking to reduce redundancy. Finally, a multimodal fusion module integrates structural and semantic embeddings under contrastive learning with auxiliary supervision, ensuring robust cross-modal alignment. Extensive experiments on benchmark NIE datasets demonstrate that MetaHGNIE consistently outperforms state-of-the-art baselines. These results highlight the effectiveness of explicitly modeling higher-order interactions and cross-modal alignment in heterogeneous knowledge graphs. Our code is available at https://github.com/SEU-WENJIA/DualHNIE

    https://arxiv.org/abs/2512.12477


    Mage: Cracking Elliptic Curve Cryptography with Cross-Axis Transformers

    oai:arXiv.org:2512.12483v1

    arXiv:2512.12483v1 Announce Type: new Abstract: With the advent of machine learning and quantum computing, the 21st century has gone from a place of relative algorithmic security, to one of speculative unease and possibly, cyber catastrophe. Modern algorithms like Elliptic Curve Cryptography (ECC) are the bastion of current cryptographic security protocols that form the backbone of consumer protection ranging from Hypertext Transfer Protocol Secure (HTTPS) in the modern internet browser, to cryptographic financial instruments like Bitcoin. And there's been very little work put into testing the strength of these ciphers. Practically the only study that I could find was on side-channel recognition, a joint paper from the University of Milan, Italy and King's College, London\cite{battistello2025ecc}. These algorithms are already considered bulletproof by many consumers, but exploits already exist for them, and with computing power and distributed, federated compute on the rise, it's only a matter of time before these current bastions fade away into obscurity, and it's on all of us to stand up when we notice something is amiss, lest we see such passages claim victims in that process. In this paper, we seek to explore the use of modern language model architecture in cracking the association between a known public key, and its associated private key, by intuitively learning to reverse engineer the public keypair generation process, effectively solving the curve. Additonally, we attempt to ascertain modern machine learning's ability to memorize public-private secp256r1 keypairs, and to then test their ability to reverse engineer the public keypair generation process. It is my belief that proof-for would be equally valuable as proof-against in either of these categories. Finally, we'll conclude with some number crunching on where we see this particular field heading in the future.

    https://arxiv.org/abs/2512.12483


    Simultaneous AlphaZero: Extending Tree Search to Markov Games

    oai:arXiv.org:2512.12486v1

    arXiv:2512.12486v1 Announce Type: new Abstract: Simultaneous AlphaZero extends the AlphaZero framework to multistep, two-player zero-sum deterministic Markov games with simultaneous actions. At each decision point, joint action selection is resolved via matrix games whose payoffs incorporate both immediate rewards and future value estimates. To handle uncertainty arising from bandit feedback during Monte Carlo Tree Search (MCTS), Simultaneous AlphaZero incorporates a regret-optimal solver for matrix games with bandit feedback. Simultaneous AlphaZero demonstrates robust strategies in a continuous-state discrete-action pursuit-evasion game and satellite custody maintenance scenarios, even when evaluated against maximally exploitative opponents.

    https://arxiv.org/abs/2512.12486


    More Than the Final Answer: Improving Visual Extraction and Logical Consistency in Vision-Language Models

    oai:arXiv.org:2512.12487v1

    arXiv:2512.12487v1 Announce Type: new Abstract: Reinforcement learning from verifiable rewards (RLVR) has recently been extended from text-only LLMs to vision-language models (VLMs) to elicit long-chain multimodal reasoning. However, RLVR-trained VLMs still exhibit two persistent failure modes: inaccurate visual extraction (missing or hallucinating details) and logically inconsistent chains-of-thought, largely because verifiable signals supervise only the final answer. We propose PeRL-VL (Perception and Reasoning Learning for Vision-Language Models), a decoupled framework that separately improves visual perception and textual reasoning on top of RLVR. For perception, PeRL-VL introduces a VLM-based description reward that scores the model's self-generated image descriptions for faithfulness and sufficiency. For reasoning, PeRL-VL adds a text-only Reasoning SFT stage on logic-rich chain-of-thought data, enhancing coherence and logical consistency independently of vision. Across diverse multimodal benchmarks, PeRL-VL improves average Pass@1 accuracy from 63.3% (base Qwen2.5-VL-7B) to 68.8%, outperforming standard RLVR, text-only reasoning SFT, and naive multimodal distillation from GPT-4o.

    https://arxiv.org/abs/2512.12487


    The American Ghost in the Machine: How language models align culturally and the effects of cultural prompting

    oai:arXiv.org:2512.12488v1

    arXiv:2512.12488v1 Announce Type: new Abstract: Culture is the bedrock of human interaction; it dictates how we perceive and respond to everyday interactions. As the field of human-computer interaction grows via the rise of generative Large Language Models (LLMs), the cultural alignment of these models become an important field of study. This work, using the VSM13 International Survey and Hofstede's cultural dimensions, identifies the cultural alignment of popular LLMs (DeepSeek-V3, V3.1, GPT-5, GPT-4.1, GPT-4, Claude Opus 4, Llama 3.1, and Mistral Large). We then use cultural prompting, or using system prompts to shift the cultural alignment of a model to a desired country, to test the adaptability of these models to other cultures, namely China, France, India, Iran, Japan, and the United States. We find that the majority of the eight LLMs tested favor the United States when the culture is not specified, with varying results when prompted for other cultures. When using cultural prompting, seven of the eight models shifted closer to the expected culture. We find that models had trouble aligning with Japan and China, despite two of the models tested originating with the Chinese company DeepSeek.

    https://arxiv.org/abs/2512.12488


    GoMS: Graph of Molecule Substructure Network for Molecule Property Prediction

    oai:arXiv.org:2512.12489v1

    arXiv:2512.12489v1 Announce Type: new Abstract: While graph neural networks have shown remarkable success in molecular property prediction, current approaches like the Equivariant Subgraph Aggregation Networks (ESAN) treat molecules as bags of independent substructures, overlooking crucial relationships between these components. We present Graph of Molecule Substructures (GoMS), a novel architecture that explicitly models the interactions and spatial arrangements between molecular substructures. Unlike ESAN's bag-based representation, GoMS constructs a graph where nodes represent subgraphs and edges capture their structural relationships, preserving critical topological information about how substructures are connected and overlap within the molecule. Through extensive experiments on public molecular datasets, we demonstrate that GoMS outperforms ESAN and other baseline methods, with particularly improvements for large molecules containing more than 100 atoms. The performance gap widens as molecular size increases, demonstrating GoMS's effectiveness for modeling industrial-scale molecules. Our theoretical analysis demonstrates that GoMS can distinguish molecules with identical subgraph compositions but different spatial arrangements. Our approach shows particular promise for materials science applications involving complex molecules where properties emerge from the interplay between multiple functional units. By capturing substructure relationships that are lost in bag-based approaches, GoMS represents a significant advance toward scalable and interpretable molecular property prediction for real-world applications.

    https://arxiv.org/abs/2512.12489


    Adaptive Detector-Verifier Framework for Zero-Shot Polyp Detection in Open-World Settings

    oai:arXiv.org:2512.12492v1

    arXiv:2512.12492v1 Announce Type: new Abstract: Polyp detectors trained on clean datasets often underperform in real-world endoscopy, where illumination changes, motion blur, and occlusions degrade image quality. Existing approaches struggle with the domain gap between controlled laboratory conditions and clinical practice, where adverse imaging conditions are prevalent. In this work, we propose AdaptiveDetector, a novel two-stage detector-verifier framework comprising a YOLOv11 detector with a vision-language model (VLM) verifier. The detector adaptively adjusts per-frame confidence thresholds under VLM guidance, while the verifier is fine-tuned with Group Relative Policy Optimization (GRPO) using an asymmetric, cost-sensitive reward function specifically designed to discourage missed detections -- a critical clinical requirement. To enable realistic assessment under challenging conditions, we construct a comprehensive synthetic testbed by systematically degrading clean datasets with adverse conditions commonly encountered in clinical practice, providing a rigorous benchmark for zero-shot evaluation. Extensive zero-shot evaluation on synthetically degraded CVC-ClinicDB and Kvasir-SEG images demonstrates that our approach improves recall by 14 to 22 percentage points over YOLO alone, while precision remains within 0.7 points below to 1.7 points above the baseline. This combination of adaptive thresholding and cost-sensitive reinforcement learning achieves clinically aligned, open-world polyp detection with substantially fewer false negatives, thereby reducing the risk of missed precancerous polyps and improving patient outcomes.

    https://arxiv.org/abs/2512.12492


    AI-Driven Early Warning Systems for Student Success: Discovering Static Feature Dominance in Temporal Prediction Models

    oai:arXiv.org:2512.12493v1

    arXiv:2512.12493v1 Announce Type: new Abstract: Early identification of at-risk students is critical for effective intervention in online learning environments. This study extends temporal prediction analysis to Week 20 (50% of course duration), comparing Decision Tree and Long Short- Term Memory (LSTM) models across six temporal snapshots. Our analysis reveals that different performance metrics matter at different intervention stages: high recall is critical for early intervention (Weeks 2-4), while balanced precision-recall is important for mid-course resource allocation (Weeks 8-16), and high precision becomes paramount in later stages (Week 20). We demonstrate that static demographic features dominate predictions (68% importance), enabling assessment-free early prediction. The LSTM model achieves 97% recall at Week 2, making it ideal for early intervention, while Decision Tree provides stable balanced performance (78% accuracy) during mid-course. By Week 20, both models converge to similar recall (68%), but LSTM achieves higher precision (90% vs 86%). Our findings also suggest that model selection should depend on intervention timing, and that early signals (Weeks 2-4) are sufficient for reliable initial prediction using primarily demographic and pre-enrollment information.

    https://arxiv.org/abs/2512.12493


    Policy Optimization for Dynamic Heart Transplant Allocation

    oai:arXiv.org:2512.12497v1

    arXiv:2512.12497v1 Announce Type: new Abstract: Heart transplantation is a viable path for patients suffering from advanced heart failure, but this lifesaving option is severely limited due to donor shortage. Although the current allocation policy was recently revised in 2018, a major concern is that it does not adequately take into account pretransplant and post-transplant mortality. In this paper, we take an important step toward addressing these deficiencies. To begin with, we use historical data from UNOS to develop a new simulator that enables us to evaluate and compare the performance of different policies. We then leverage our simulator to demonstrate that the status quo policy is considerably inferior to the myopic policy that matches incoming donors to the patient who maximizes the number of years gained by the transplant. Moreover, we develop improved policies that account for the dynamic nature of the allocation process through the use of potentials -- a measure of a patient's utility in future allocations that we introduce. We also show that batching together even a handful of donors -- which is a viable option for a certain type of donors -- further enhances performance. Our simulator also allows us to evaluate the effect of critical, and often unexplored, factors in the allocation, such as geographic proximity and the tendency to reject offers by the transplant centers.

    https://arxiv.org/abs/2512.12497


    Advancing Cache-Based Few-Shot Classification via Patch-Driven Relational Gated Graph Attention

    oai:arXiv.org:2512.12498v1

    arXiv:2512.12498v1 Announce Type: new Abstract: Few-shot image classification remains difficult under limited supervision and visual domain shift. Recent cache-based adaptation approaches (e.g., Tip-Adapter) address this challenge to some extent by learning lightweight residual adapters over frozen features, yet they still inherit CLIP's tendency to encode global, general-purpose representations that are not optimally discriminative to adapt the generalist to the specialist's domain in low-data regimes. We address this limitation with a novel patch-driven relational refinement that learns cache adapter weights from intra-image patch dependencies rather than treating an image embedding as a monolithic vector. Specifically, we introduce a relational gated graph attention network that constructs a patch graph and performs edge-aware attention to emphasize informative inter-patch interactions, producing context-enriched patch embeddings. A learnable multi-aggregation pooling then composes these into compact, task-discriminative representations that better align cache keys with the target few-shot classes. Crucially, the proposed graph refinement is used only during training to distil relational structure into the cache, incurring no additional inference cost beyond standard cache lookup. Final predictions are obtained by a residual fusion of cache similarity scores with CLIP zero-shot logits. Extensive evaluations on 11 benchmarks show consistent gains over state-of-the-art CLIP adapter and cache-based baselines while preserving zero-shot efficiency. We further validate battlefield relevance by introducing an Injured vs. Uninjured Soldier dataset for casualty recognition. It is motivated by the operational need to support triage decisions within the "platinum minutes" and the broader "golden hour" window in time-critical UAV-driven search-and-rescue and combat casualty care.

    https://arxiv.org/abs/2512.12498


    Explainable AI as a Double-Edged Sword in Dermatology: The Impact on Clinicians versus The Public

    oai:arXiv.org:2512.12500v1

    arXiv:2512.12500v1 Announce Type: new Abstract: Artificial intelligence (AI) is increasingly permeating healthcare, from physician assistants to consumer applications. Since AI algorithm's opacity challenges human interaction, explainable AI (XAI) addresses this by providing AI decision-making insight, but evidence suggests XAI can paradoxically induce over-reliance or bias. We present results from two large-scale experiments (623 lay people; 153 primary care physicians, PCPs) combining a fairness-based diagnosis AI model and different XAI explanations to examine how XAI assistance, particularly multimodal large language models (LLMs), influences diagnostic performance. AI assistance balanced across skin tones improved accuracy and reduced diagnostic disparities. However, LLM explanations yielded divergent effects: lay users showed higher automation bias - accuracy boosted when AI was correct, reduced when AI erred - while experienced PCPs remained resilient, benefiting irrespective of AI accuracy. Presenting AI suggestions first also led to worse outcomes when the AI was incorrect for both groups. These findings highlight XAI's varying impact based on expertise and timing, underscoring LLMs as a "double-edged sword" in medical AI and informing future human-AI collaborative system design.

    https://arxiv.org/abs/2512.12500


    SafeGen: Embedding Ethical Safeguards in Text-to-Image Generation

    oai:arXiv.org:2512.12501v1

    arXiv:2512.12501v1 Announce Type: new Abstract: Generative Artificial Intelligence (AI) has created unprecedented opportunities for creative expression, education, and research. Text-to-image systems such as DALL.E, Stable Diffusion, and Midjourney can now convert ideas into visuals within seconds, but they also present a dual-use dilemma, raising critical ethical concerns: amplifying societal biases, producing high-fidelity disinformation, and violating intellectual property. This paper introduces SafeGen, a framework that embeds ethical safeguards directly into the text-to-image generation pipeline, grounding its design in established principles for Trustworthy AI. SafeGen integrates two complementary components: BGE-M3, a fine-tuned text classifier that filters harmful or misleading prompts, and Hyper-SD, an optimized diffusion model that produces high fidelity, semantically aligned images. Built on a curated multilingual (English- Vietnamese) dataset and a fairness-aware training process, SafeGen demonstrates that creative freedom and ethical responsibility can be reconciled within a single workflow. Quantitative evaluations confirm its effectiveness, with Hyper-SD achieving IS = 3.52, FID = 22.08, and SSIM = 0.79, while BGE-M3 reaches an F1-Score of 0.81. An ablation study further validates the importance of domain-specific fine-tuning for both modules. Case studies illustrate SafeGen's practical impact in blocking unsafe prompts, generating inclusive teaching materials, and reinforcing academic integrity.

    https://arxiv.org/abs/2512.12501


    KidsArtBench: Multi-Dimensional Children's Art Evaluation with Attribute-Aware MLLMs

    oai:arXiv.org:2512.12503v1

    arXiv:2512.12503v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) show remarkable progress across many visual-language tasks; however, their capacity to evaluate artistic expression remains limited. Aesthetic concepts are inherently abstract and open-ended, and multimodal artwork annotations are scarce. We introduce KidsArtBench, a new benchmark of over 1k children's artworks (ages 5-15) annotated by 12 expert educators across 9 rubric-aligned dimensions, together with expert comments for feedback. Unlike prior aesthetic datasets that provide single scalar scores on adult imagery, KidsArtBench targets children's artwork and pairs multi-dimensional annotations with comment supervision to enable both ordinal assessment and formative feedback. Building on this resource, we propose an attribute-specific multi-LoRA approach, where each attribute corresponds to a distinct evaluation dimension (e.g., Realism, Imagination) in the scoring rubric, with Regression-Aware Fine-Tuning (RAFT) to align predictions with ordinal scales. On Qwen2.5-VL-7B, our method increases correlation from 0.468 to 0.653, with the largest gains on perceptual dimensions and narrowed gaps on higher-order attributes. These results show that educator-aligned supervision and attribute-aware training yield pedagogically meaningful evaluations and establish a rigorous testbed for sustained progress in educational AI. We release data and code with ethics documentation.

    https://arxiv.org/abs/2512.12503


    ATLAS: Automated Tree-based Language Analysis System for C and C++ source programs

    oai:arXiv.org:2512.12507v1

    arXiv:2512.12507v1 Announce Type: new Abstract: The growing complexity of modern software systems has highlighted the shortcomings of traditional programming analysis techniques, particularly for Software Engineering (SE) tasks. While machine learning and Large Language Models (LLMs) offer promising solutions, their effectiveness is limited by the way they interpret data. Unlike natural language, source code meaning is defined less by token adjacency and more by complex, long-range, and structural relationships and dependencies. This limitation is especially pronounced for C and C++, where flatter syntactic hierarchies, pointer aliasing, multi-level indirection, typedef-based type obfuscation, and function-pointer calls hinder accurate static analysis. To address these challenges, this paper introduces ATLAS, a Python-based Command-Line Interface (CLI) that (i) generates statement-level Control Flow Graphs (CFG) and type-aware Data Flow Graphs (DFG) that capture inter-functional dependencies for the entire program; (ii) has the ability to work on entire C and C++ projects comprising multiple files; (iii) works on both compilable and non-compilable code and (iv) produces a unified multi-view code representation using Abstract Syntax Trees (AST), CFG and DFG. By preserving essential structural and semantic information, ATLAS provides a practical foundation for improving downstream SE and machine-learning-based program understanding. Video demonstration: https://youtu.be/RACWQe5ELwY Tool repository: https://github.com/jaid-monwar/ATLAS-code-representation-tool

    https://arxiv.org/abs/2512.12507


    Generative Spatiotemporal Data Augmentation

    oai:arXiv.org:2512.12508v1

    arXiv:2512.12508v1 Announce Type: new Abstract: We explore spatiotemporal data augmentation using video foundation models to diversify both camera viewpoints and scene dynamics. Unlike existing approaches based on simple geometric transforms or appearance perturbations, our method leverages off-the-shelf video diffusion models to generate realistic 3D spatial and temporal variations from a given image dataset. Incorporating these synthesized video clips as supplemental training data yields consistent performance gains in low-data settings, such as UAV-captured imagery where annotations are scarce. Beyond empirical improvements, we provide practical guidelines for (i) choosing an appropriate spatiotemporal generative setup, (ii) transferring annotations to synthetic frames, and (iii) addressing disocclusion - regions newly revealed and unlabeled in generated views. Experiments on COCO subsets and UAV-captured datasets show that, when applied judiciously, spatiotemporal augmentation broadens the data distribution along axes underrepresented by traditional and prior generative methods, offering an effective lever for improving model performance in data-scarce regimes.

    https://arxiv.org/abs/2512.12508


    Can You Keep a Secret? Exploring AI for Care Coordination in Cognitive Decline

    oai:arXiv.org:2512.12510v1

    arXiv:2512.12510v1 Announce Type: new Abstract: The increasing number of older adults who experience cognitive decline places a burden on informal caregivers, whose support with tasks of daily living determines whether older adults can remain in their homes. To explore how agents might help lower-SES older adults to age-in-place, we interviewed ten pairs of older adults experiencing cognitive decline and their informal caregivers. We explored how they coordinate care, manage burdens, and sustain autonomy and privacy. Older adults exercised control by delegating tasks to specific caregivers, keeping information about all the care they received from their adult children. Many abandoned some tasks of daily living, lowering their quality of life to ease caregiver burden. One effective strategy, piggybacking, uses spontaneous overlaps in errands to get more work done with less caregiver effort. This raises the questions: (i) Can agents help with piggyback coordination? (ii) Would it keep older adults in their homes longer, while not increasing caregiver burden?

    https://arxiv.org/abs/2512.12510


    An STREL-based Formulation of Spatial Resilience in Cyber-Physical Systems

    oai:arXiv.org:2512.12511v1

    arXiv:2512.12511v1 Announce Type: new Abstract: Resiliency is the ability of a system to quickly recover from a violation (recoverability) and avoid future violations for as long as possible (durability). In the spatial setting, recoverability and durability (now known as persistency) are measured in units of distance. Like its temporal counterpart, spatial resiliency is of fundamental importance for Cyber-Physical Systems (CPS) and yet, to date, there is no widely agreed-upon formal treatment of spatial resiliency. We present a formal framework for reasoning about spatial resiliency in CPS. Our framework is based on the spatial fragment of STREL, which we refer to as SREL. In this framework, spatial resiliency is given a syntactic characterization in the form of a Spatial Resiliency Specification (SpaRS). An atomic predicate of SpaRS is called an S-atom. Given an arbitrary SREL formula $\varphi$, distance bounds $d_1, d_2$, the S-atom of $\varphi$, $S_{d_1, d_2} (\varphi)$, is the SREL formula $\neg\varphi R_{[0,d_1]} (\varphi R_{[d_2, +\infty)}\varphi)$, specifying that recovery from a violation of $\varphi$ occurs within distance $d_1$ (recoverability), and subsequently that $\varphi$ be maintained along a route for a distance greater than $d_2$ (persistency). S-atoms can be combined using spatial STREL operators, allowing one to express composite resiliency specifications. We define a quantitative semantics for SpaRS in the form of a Spatial Resilience Value (SpaRV) function $\sigma$ and prove its soundness and completeness w.r.t. SREL's Boolean semantics. The $\sigma$-value for $S_{d_1,d_2}(\varphi)$ is a set of non-dominated (rec, per) pairs, quantifying recoverability and persistency, given that some routes may offer better recoverability while others better persistency. In addition, we design algorithms to evaluate SpaRV for SpaRS formulas. Finally, two case studies demonstrate the practical utility of our approach.

    https://arxiv.org/abs/2512.12511


    Linear Codes with Certain Dimension of Hermitian Hulls

    oai:arXiv.org:2512.12519v1

    arXiv:2512.12519v1 Announce Type: new Abstract: In this paper, we study the enumerative and asymptotic properties related to Hermitian $\ell$-complementary codes on the unitary space over $\F_{q^2}$. We provide some closed form expressions for the counting formulas of Hermitian $\ell$-complementary codes. There is a similarity in the asymptotic weight distribution between Hermitian self-orthogonal codes and unrestricted codes. Furthermore, we study the asymptotic behavior of Hermitian self-orthogonal codes whose minimum distance is at least $d$. In particular, we conclude that MDS codes within the class of Hermitian self-orthogonal codes are asymptotically dense when the alphabet size approaches to infinity.

    https://arxiv.org/abs/2512.12519


    Noise-robust Contrastive Learning for Critical Transition Detection in Dynamical Systems

    oai:arXiv.org:2512.12523v1

    arXiv:2512.12523v1 Announce Type: new Abstract: Detecting critical transitions in complex, noisy time-series data is a fundamental challenge across science and engineering. Such transitions may be anticipated by the emergence of a low-dimensional order parameter, whose signature is often masked by high-amplitude stochastic variability. Standard contrastive learning approaches based on deep neural networks, while promising for detecting critical transitions, are often overparameterized and sensitive to irrelevant noise, leading to inaccurate identification of critical points. To address these limitations, we propose a neural network architecture, constructed using singular value decomposition technique, together with a strictly semi-orthogonality-constrained training algorithm, to enhance the performance of traditional contrastive learning. Extensive experiments demonstrate that the proposed method matches the performance of traditional contrastive learning techniques in identifying critical transitions, yet is considerably more lightweight and markedly more resistant to noise.

    https://arxiv.org/abs/2512.12523


    Empirical Mode Decomposition and Graph Transformation of the MSCI World Index: A Multiscale Topological Analysis for Graph Neural Network Modeling

    oai:arXiv.org:2512.12526v1

    arXiv:2512.12526v1 Announce Type: new Abstract: This study applies Empirical Mode Decomposition (EMD) to the MSCI World index and converts the resulting intrinsic mode functions (IMFs) into graph representations to enable modeling with graph neural networks (GNNs). Using CEEMDAN, we extract nine IMFs spanning high-frequency fluctuations to long-term trends. Each IMF is transformed into a graph using four time-series-to-graph methods: natural visibility, horizontal visibility, recurrence, and transition graphs. Topological analysis shows clear scale-dependent structure: high-frequency IMFs yield dense, highly connected small-world graphs, whereas low-frequency IMFs produce sparser networks with longer characteristic path lengths. Visibility-based methods are more sensitive to amplitude variability and typically generate higher clustering, while recurrence graphs better preserve temporal dependencies. These results provide guidance for designing GNN architectures tailored to the structural properties of decomposed components, supporting more effective predictive modeling of financial time series.

    https://arxiv.org/abs/2512.12526


    Principled Performance Tunability in Operating System Kernels

    oai:arXiv.org:2512.12530v1

    arXiv:2512.12530v1 Announce Type: new Abstract: The Linux kernel source code contains numerous constant values that critically influence system performance. Many of these constants, which we term perf-consts, are magic numbers that encode brittle assumptions about hardware and workloads. As systems and workloads evolve, such constants often become suboptimal. Unfortunately, deployed kernels lack support for safe and efficient in-situ tuning of perf-consts without a long and disruptive process of rebuilding and redeploying the kernel image. This paper advocates principled OS performance tunability. We present KernelX, a system that provides a safe, efficient, and programmable interface for in-situ tuning of arbitrary perf-consts on a running kernel. KernelX transforms any perf-const into a tunable knob on demand using a novel mechanism called Scoped Indirect Execution (SIE). SIE precisely identifies the binary boundaries where a perf-const influences system state and redirects execution to synthesized instructions that update the state as if new values were used. KernelX goes beyond version atomicity to guarantee side-effect safety, a property not provided by existing kernel update mechanisms. KernelX also provides a programmable interface that allows policies to incorporate application hints, hardware heuristics, and fine-grained isolation, without modifying kernel source code or disrupting deployed OS kernels. Case studies across multiple kernel subsystems demonstrate that KernelX enables significant performance improvements by making previously untunable perf-consts safely tunable at runtime, while supporting millisecond-scale policy updates.

    https://arxiv.org/abs/2512.12530


    Strategic Server Deployment under Uncertainty in Mobile Edge Computing

    oai:arXiv.org:2512.12532v1

    arXiv:2512.12532v1 Announce Type: new Abstract: Server deployment is a fundamental task in mobile edge computing: where to place the edge servers and what user cells to assign to them. To make this decision is context-specific, but common goals are 1) computing efficiency: maximize the amount of workload processed by the edge, and 2) communication efficiency: minimize the communication cost between the cells and their assigned servers. We focus on practical scenarios where the user workload in each cell is unknown and time-varying, and so are the effective capacities of the servers. Our research problem is to choose a subset of candidate servers and assign them to the user cells such that the above goals are sustainably achieved under the above uncertainties. We formulate this problem as a stochastic bilevel optimization, which is strongly NP-hard and unseen in the literature. By approximating the objective function with submodular functions, we can utilize state-of-the-art greedy algorithms for submodular maximization to effectively solve our problem. We evaluate the proposed algorithm using real-world data, showing its superiority to alternative methods; the improvement can be as high as 55%

    https://arxiv.org/abs/2512.12532


    Animus3D: Text-driven 3D Animation via Motion Score Distillation

    oai:arXiv.org:2512.12534v1

    arXiv:2512.12534v1 Announce Type: new Abstract: We present Animus3D, a text-driven 3D animation framework that generates motion field given a static 3D asset and text prompt. Previous methods mostly leverage the vanilla Score Distillation Sampling (SDS) objective to distill motion from pretrained text-to-video diffusion, leading to animations with minimal movement or noticeable jitter. To address this, our approach introduces a novel SDS alternative, Motion Score Distillation (MSD). Specifically, we introduce a LoRA-enhanced video diffusion model that defines a static source distribution rather than pure noise as in SDS, while another inversion-based noise estimation technique ensures appearance preservation when guiding motion. To further improve motion fidelity, we incorporate explicit temporal and spatial regularization terms that mitigate geometric distortions across time and space. Additionally, we propose a motion refinement module to upscale the temporal resolution and enhance fine-grained details, overcoming the fixed-resolution constraints of the underlying video model. Extensive experiments demonstrate that Animus3D successfully animates static 3D assets from diverse text prompts, generating significantly more substantial and detailed motion than state-of-the-art baselines while maintaining high visual integrity. Code will be released at https://qiisun.github.io/animus3d_page.

    https://arxiv.org/abs/2512.12534


    Diverse LLMs vs. Vulnerabilities: Who Detects and Fixes Them Better?

    oai:arXiv.org:2512.12536v1

    arXiv:2512.12536v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly being studied for Software Vulnerability Detection (SVD) and Repair (SVR). Individual LLMs have demonstrated code understanding abilities, but they frequently struggle when identifying complex vulnerabilities and generating fixes. This study presents DVDR-LLM, an ensemble framework that combines outputs from diverse LLMs to determine whether aggregating multiple models reduces error rates. Our evaluation reveals that DVDR-LLM achieves 10-12% higher detection accuracy compared to the average performance of individual models, with benefits increasing as code complexity grows. For multi-file vulnerabilities, the ensemble approach demonstrates significant improvements in recall (+18%) and F1 score (+11.8%) over individual models. However, the approach raises measurable trade-offs: reducing false positives in verification tasks while simultaneously increasing false negatives in detection tasks, requiring careful decision on the required level of agreement among the LLMs (threshold) for increased performance across different security contexts. Artifact: https://github.com/Erroristotle/DVDR_LLM

    https://arxiv.org/abs/2512.12536


    NagaNLP: Bootstrapping NLP for Low-Resource Nagamese Creole with Human-in-the-Loop Synthetic Data

    oai:arXiv.org:2512.12537v1

    arXiv:2512.12537v1 Announce Type: new Abstract: The vast majority of the world's languages, particularly creoles like Nagamese, remain severely under-resourced in Natural Language Processing (NLP), creating a significant barrier to their representation in digital technology. This paper introduces NagaNLP, a comprehensive open-source toolkit for Nagamese, bootstrapped through a novel methodology that relies on LLM-driven but human-validated synthetic data generation. We detail a multi-stage pipeline where an expert-guided LLM (Gemini) generates a candidate corpus, which is then refined and annotated by native speakers. This synthetic-hybrid approach yielded a 10K pair conversational dataset and a high-quality annotated corpus for foundational tasks. To assess the effectiveness of our methodology, we trained both discriminative and generative models. Our fine-tuned XLM-RoBERTa-base model establishes a new benchmark for Nagamese, achieving a 93.81\% accuracy (0.90 F1-Macro) on Part-of-Speech tagging and a 0.75 F1-Macro on Named Entity Recognition, massively outperforming strong zero-shot baselines. Furthermore, we fine-tuned a Llama-3.2-3B Instruct model, named NagaLLaMA, which demonstrates superior performance on conversational tasks, achieving a Perplexity of 3.85, an order of magnitude improvement over its few-shot counterpart (96.76). We release the NagaNLP toolkit, including all datasets, models, and code, providing a foundational resource for a previously underserved language and a reproducible framework for reducing data scarcity in other low-resource contexts.

    https://arxiv.org/abs/2512.12537


    Hierarchical Coarse Basis by Randomised SVD: the Helmholtz Problem

    oai:arXiv.org:2512.12538v1

    arXiv:2512.12538v1 Announce Type: new Abstract: The oscillatory waves require sufficient degrees of freedom to resolve. That restriction usually applies also to coarse problems for Schwarz methods. The resulting coarse problem is then too large. To address the issue, a new form of Schwarz methods with coarse correction is proposed for the Helmholtz problem. There are two components in the proposed form: randomised SVD of interface iteration, and hierarchical domain decomposition. The resulting coarse problem is hierarchical and can be solved in a decoupled way at each level of hierarchy.

    https://arxiv.org/abs/2512.12538


    Anatomy Guided Coronary Artery Segmentation from CCTA Using Spatial Frequency Joint Modeling

    oai:arXiv.org:2512.12539v1

    arXiv:2512.12539v1 Announce Type: new Abstract: Accurate coronary artery segmentation from coronary computed tomography angiography is essential for quantitative coronary analysis and clinical decision support. Nevertheless, reliable segmentation remains challenging because of small vessel calibers, complex branching, blurred boundaries, and myocardial interference. We propose a coronary artery segmentation framework that integrates myocardial anatomical priors, structure aware feature encoding, and three dimensional wavelet inverse wavelet transformations. Myocardial priors and residual attention based feature enhancement are incorporated during encoding to strengthen coronary structure representation. Wavelet inverse wavelet based downsampling and upsampling enable joint spatial frequency modeling and preserve multi scale structural consistency, while a multi scale feature fusion module integrates semantic and geometric information in the decoding stage. The model is trained and evaluated on the public ImageCAS dataset using a 3D overlapping patch based strategy with a 7:1:2 split for training, validation, and testing. Experimental results demonstrate that the proposed method achieves a Dice coefficient of 0.8082, Sensitivity of 0.7946, Precision of 0.8471, and an HD95 of 9.77 mm, outperforming several mainstream segmentation models. Ablation studies further confirm the complementary contributions of individual components. The proposed method enables more stable and consistent coronary artery segmentation under complex geometric conditions, providing reliable segmentation results for subsequent coronary structure analysis tasks.

    https://arxiv.org/abs/2512.12539


    Effective Fine-Tuning with Eigenvector Centrality Based Pruning

    oai:arXiv.org:2512.12543v1

    arXiv:2512.12543v1 Announce Type: new Abstract: In social media networks a small number of highly influential users can drive large scale changes in discourse across multiple communities. Small shifts in the behavior of these users are often sufficient to propagate widely throughout the network. A similar phenomenon occurs during neural network fine tuning. Conventional fine tuning of convolutional neural networks typically adds a new linear classification layer on top of a large pre trained model. Instead we argue that improved adaptation can be achieved by first pruning the network to retain only the most important neurons and then performing fine tuning. We propose a graph theory based method for pruning neural networks that is designed to improve fine tuning performance. In this method each neuron is represented as a node and edges encode similarity between neurons. Neurons are pruned based on importance scores computed using eigenvector centrality. The resulting pruned network is then fine tuned using only the most central neurons. We evaluate the proposed method on VGGNet EfficientNet and ResNet models using the TF Flowers Caltech one zero one and Oxford Flowers one zero two datasets. The proposed approach achieves higher classification accuracy while significantly reducing model complexity. On the Oxford Flowers one zero two dataset the method achieves forty eight percent classification accuracy compared to thirty percent accuracy obtained by the baseline VGGNet model.

    https://arxiv.org/abs/2512.12543


    HyperEdit: Unlocking Instruction-based Text Editing in LLMs via Hypernetworks

    oai:arXiv.org:2512.12544v1

    arXiv:2512.12544v1 Announce Type: new Abstract: Instruction-based text editing is increasingly critical for real-world applications such as code editors (e.g., Cursor), but Large Language Models (LLMs) continue to struggle with this task. Unlike free-form generation, editing requires faithfully implementing user instructions while preserving unchanged content, as even minor unintended modifications can break functionality. Existing approaches treat editing as generic text generation, leading to two key failures: they struggle to faithfully align edits with diverse user intents, and they often over-edit unchanged regions. We propose HyperEdit to address both issues. First, we introduce hypernetwork-based dynamic adaptation that generates request-specific parameters, enabling the model to tailor its editing strategy to each instruction. Second, we develop difference-aware regularization that focuses supervision on modified spans, preventing over-editing while ensuring precise, minimal changes. HyperEdit achieves a 9%--30% relative improvement in BLEU on modified regions over state-of-the-art baselines, despite utilizing only 3B parameters.

    https://arxiv.org/abs/2512.12544


    Skillful Subseasonal-to-Seasonal Forecasting of Extreme Events with a Multi-Sphere Coupled Probabilistic Model

    oai:arXiv.org:2512.12545v1

    arXiv:2512.12545v1 Announce Type: new Abstract: Accurate subseasonal-to-seasonal (S2S) prediction of extreme events is critical for resource planning and disaster mitigation under accelerating climate change. However, such predictions remain challenging due to complex multi-sphere interactions and intrinsic atmospheric uncertainty. Here we present TianXing-S2S, a multi-sphere coupled probabilistic model for global S2S daily ensemble forecast. TianXing-S2S first encodes diverse multi-sphere predictors into a compact latent space, then employs a diffusion model to generate daily ensemble forecasts. A novel coupling module based on optimal transport (OT) is incorporated in the denoiser to optimize the interactions between atmospheric and multi-sphere boundary conditions. Across key atmospheric variables, TianXing-S2S outperforms both the European Centre for Medium-Range Weather Forecasts (ECMWF) S2S system and FuXi-S2S in 45-day daily-mean ensemble forecasts at 1.5 resolution. Our model achieves skillful subseasonal prediction of extreme events including heat waves and anomalous precipitation, identifying soil moisture as a critical precursor signal. Furthermore, we demonstrate that TianXing-S2S can generate stable rollout forecasts up to 180 days, establishing a robust framework for S2S research in a warming world.

    https://arxiv.org/abs/2512.12545


    World Models Unlock Optimal Foraging Strategies in Reinforcement Learning Agents

    oai:arXiv.org:2512.12548v1

    arXiv:2512.12548v1 Announce Type: new Abstract: Patch foraging involves the deliberate and planned process of determining the optimal time to depart from a resource-rich region and investigate potentially more beneficial alternatives. The Marginal Value Theorem (MVT) is frequently used to characterize this process, offering an optimality model for such foraging behaviors. Although this model has been widely used to make predictions in behavioral ecology, discovering the computational mechanisms that facilitate the emergence of optimal patch-foraging decisions in biological foragers remains under investigation. Here, we show that artificial foragers equipped with learned world models naturally converge to MVT-aligned strategies. Using a model-based reinforcement learning agent that acquires a parsimonious predictive representation of its environment, we demonstrate that anticipatory capabilities, rather than reward maximization alone, drive efficient patch-leaving behavior. Compared with standard model-free RL agents, these model-based agents exhibit decision patterns similar to many of their biological counterparts, suggesting that predictive world models can serve as a foundation for more explainable and biologically grounded decision-making in AI systems. Overall, our findings highlight the value of ecological optimality principles for advancing interpretable and adaptive AI.

    https://arxiv.org/abs/2512.12548


    Supervised Contrastive Frame Aggregation for Video Representation Learning

    oai:arXiv.org:2512.12549v1

    arXiv:2512.12549v1 Announce Type: new Abstract: We propose a supervised contrastive learning framework for video representation learning that leverages temporally global context. We introduce a video to image aggregation strategy that spatially arranges multiple frames from each video into a single input image. This design enables the use of pre trained convolutional neural network backbones such as ResNet50 and avoids the computational overhead of complex video transformer models. We then design a contrastive learning objective that directly compares pairwise projections generated by the model. Positive pairs are defined as projections from videos sharing the same label while all other projections are treated as negatives. Multiple natural views of the same video are created using different temporal frame samplings from the same underlying video. Rather than relying on data augmentation these frame level variations produce diverse positive samples with global context and reduce overfitting. Experiments on the Penn Action and HMDB51 datasets demonstrate that the proposed method outperforms existing approaches in classification accuracy while requiring fewer computational resources. The proposed Supervised Contrastive Frame Aggregation method learns effective video representations in both supervised and self supervised settings and supports video based tasks such as classification and captioning. The method achieves seventy six percent classification accuracy on Penn Action compared to forty three percent achieved by ViVIT and forty eight percent accuracy on HMDB51 compared to thirty seven percent achieved by ViVIT.

    https://arxiv.org/abs/2512.12549


    Assessing the Capability of Android Dynamic Analysis Tools to Combat Anti-Runtime Analysis Techniques

    oai:arXiv.org:2512.12551v1

    arXiv:2512.12551v1 Announce Type: new Abstract: As the dominant mobile operating system, Android continues to attract a substantial influx of new applications each year. However, this growth is accompanied by increased attention from malicious actors, resulting in a significant rise in security threats to the Android ecosystem. Among these threats, the adoption of Anti-Runtime Analysis (ARA) techniques by malicious applications poses a serious challenge, as it hinders security professionals from effectively analyzing malicious behaviors using dynamic analysis tools. ARA technologies are designed to prevent the dynamic examination of applications, thus complicating efforts to ensure platform security. This paper presents a comprehensive empirical study that assesses the ability of widely-used Android dynamic analysis tools to bypass various ARA techniques. Our findings reveal a critical gap in the effectiveness of existing dynamic analysis tools to counter ARA mechanisms, highlighting an urgent need for more robust solutions. This work provides valuable insights into the limitations of existing tools and highlights the need for improved methods to counteract ARA technologies, thus advancing the field of software security and dynamic analysis.

    https://arxiv.org/abs/2512.12551


    Large Language Newsvendor: Decision Biases and Cognitive Mechanisms

    oai:arXiv.org:2512.12552v1

    arXiv:2512.12552v1 Announce Type: new Abstract: Problem definition: Although large language models (LLMs) are increasingly integrated into business decision making, their potential to replicate and even amplify human cognitive biases cautions a significant, yet not well-understood, risk. This is particularly critical in high-stakes operational contexts like supply chain management. To address this, we investigate the decision-making patterns of leading LLMs using the canonical newsvendor problem in a dynamic setting, aiming to identify the nature and origins of their cognitive biases. Methodology/results: Through dynamic, multi-round experiments with GPT-4, GPT-4o, and LLaMA-8B, we tested for five established decision biases. We found that LLMs consistently replicated the classic ``Too Low/Too High'' ordering bias and significantly amplified other tendencies like demand-chasing behavior compared to human benchmarks. Our analysis uncovered a ``paradox of intelligence'': the more sophisticated GPT-4 demonstrated the greatest irrationality through overthinking, while the efficiency-optimized GPT-4o performed near-optimally. Because these biases persist even when optimal formulas are provided, we conclude they stem from architectural constraints rather than knowledge gaps. Managerial implications: First, managers should select models based on the specific task, as our results show that efficiency-optimized models can outperform more complex ones on certain optimization problems. Second, the significant amplification of bias by LLMs highlights the urgent need for robust human-in-the-loop oversight in high-stakes decisions to prevent costly errors. Third, our findings suggest that designing structured, rule-based prompts is a practical and effective strategy for managers to constrain models' heuristic tendencies and improve the reliability of AI-assisted decisions.

    https://arxiv.org/abs/2512.12552


    Cargo Sherlock: An SMT-Based Checker for Software Trust Costs

    oai:arXiv.org:2512.12553v1

    arXiv:2512.12553v1 Announce Type: new Abstract: Supply chain attacks threaten open-source software ecosystems. This paper proposes a formal framework for quantifying trust in third-party software dependencies that is both formally checkable - formalized in satisfiability modulo theories (SMT) - while at the same time incorporating human factors, like the number of downloads, authors, and other metadata that are commonly used to identify trustworthy software in practice. We use data from both software analysis tools and metadata to build a first-order relational model of software dependencies; to obtain an overall "trust cost" combining these factors, we propose a formalization based on the minimum trust problem which asks for the minimum cost of a set of assumptions which can be used to prove that the code is safe. We implement these ideas in Cargo Sherlock, targeted for Rust libraries (crates), incorporating a list of candidate assumptions motivated by quantifiable trust metrics identified in prior work. Our evaluation shows that Cargo Sherlock can be used to identify synthetically generated supply chain attacks and known incidents involving typosquatted and poorly AI-maintained crates, and that its performance scales to Rust crates with many dependencies.

    https://arxiv.org/abs/2512.12553


    Bounded Dynamic Level Maintenance for Efficient Logic Optimization

    oai:arXiv.org:2512.12554v1

    arXiv:2512.12554v1 Announce Type: new Abstract: Logic optimization constitutes a critical phase within the Electronic Design Automation (EDA) flow, essential for achieving desired circuit power, performance, and area (PPA) targets. These logic circuits are typically represented as Directed Acyclic Graphs (DAGs), where the structural depth, quantified by node level, critically correlates with timing performance. Modern optimization strategies frequently employ iterative, local transformation heuristics (\emph{e.g.,} \emph{rewrite}, \emph{refactor}) directly on this DAG structure. As optimization continuously modifies the graph locally, node levels require frequent dynamic updates to guide subsequent decisions. However, a significant gap exists: existing algorithms for incrementally updating node levels are unbounded to small changes. This leads to a total of worst complexity in $O(|V|^2)$ for given local subgraphs $\{\Delta G_i\}_{i=1}^{|V|}$ updates on DAG $G(V,E)$. This unbounded nature poses a severe efficiency bottleneck, hindering the scalability of optimization flows, particularly when applied to large circuit designs prevalent today. In this paper, we analyze the dynamic level maintenance problem endemic to iterative logic optimization, framing it through the lens of partial topological order. Building upon the analysis, we present the first bounded algorithm for maintaining level constraints, with $O(|V| \Delta \log \Delta)$ time for a sequence $|V|$ of updates $\{\Delta G_i\}$, where $\Delta = \max_i \|\Delta G_i\|$ denotes the maximum extended size of $\Delta G_i$. Experiments on comprehensive benchmarks show our algorithm enables an average 6.4$\times$ overall speedup relative to \rw and \rf, driven by a 1074.8$\times$ speedup in the level maintenance, all without any quality sacrifice.

    https://arxiv.org/abs/2512.12554


    Unveiling Malicious Logic: Towards a Statement-Level Taxonomy and Dataset for Securing Python Packages

    oai:arXiv.org:2512.12559v1

    arXiv:2512.12559v1 Announce Type: new Abstract: The widespread adoption of open-source ecosystems enables developers to integrate third-party packages, but also exposes them to malicious packages crafted to execute harmful behavior via public repositories such as PyPI. Existing datasets (e.g., pypi-malregistry, DataDog, OpenSSF, MalwareBench) label packages as malicious or benign at the package level, but do not specify which statements implement malicious behavior. This coarse granularity limits research and practice: models cannot be trained to localize malicious code, detectors cannot justify alerts with code-level evidence, and analysts cannot systematically study recurring malicious indicators or attack chains. To address this gap, we construct a statement-level dataset of 370 malicious Python packages (833 files, 90,527 lines) with 2,962 labeled occurrences of malicious indicators. From these annotations, we derive a fine-grained taxonomy of 47 malicious indicators across 7 types that capture how adversarial behavior is implemented in code, and we apply sequential pattern mining to uncover recurring indicator sequences that characterize common attack workflows. Our contribution enables explainable, behavior-centric detection and supports both semantic-aware model training and practical heuristics for strengthening software supply-chain defenses.

    https://arxiv.org/abs/2512.12559


    StreamingAssistant: Efficient Visual Token Pruning for Accelerating Online Video Understanding

    oai:arXiv.org:2512.12560v1

    arXiv:2512.12560v1 Announce Type: new Abstract: Online video understanding is essential for applications like public surveillance and AI glasses. However, applying Multimodal Large Language Models (MLLMs) to this domain is challenging due to the large number of video frames, resulting in high GPU memory usage and computational latency. To address these challenges, we propose token pruning as a means to reduce context length while retaining critical information. Specifically, we introduce a novel redundancy metric, Maximum Similarity to Spatially Adjacent Video Tokens (MSSAVT), which accounts for both token similarity and spatial position. To mitigate the bidirectional dependency between pruning and redundancy, we further design a masked pruning strategy that ensures only mutually unadjacent tokens are pruned. We also integrate an existing temporal redundancy-based pruning method to eliminate temporal redundancy of the video modality. Experimental results on multiple online and offline video understanding benchmarks demonstrate that our method significantly improves the accuracy (i.e., by 4\% at most) while incurring a negligible pruning latency (i.e., less than 1ms). Our full implementation will be made publicly available.

    https://arxiv.org/abs/2512.12560


    Vertical Heterogeneous Networks Beyond 5G: CoMP Coverage Enhancement and Optimization

    oai:arXiv.org:2512.12563v1

    arXiv:2512.12563v1 Announce Type: new Abstract: Low-altitude wireless networks are increasingly vital for the low-altitude economy, enabling wireless coverage in high-mobility and hard-to-reach environments. However, providing reliable connectivity to sparsely distributed aerial users in dynamic three-dimensional (3D) spaces remains a significant challenge. This paper investigates downlink coverage enhancement in vertical heterogeneous networks (VHetNets) beyond 5G, where uncrewed aerial vehicles (UAVs) operate as emerging aerial base stations (ABSs) alongside legacy terrestrial base stations (TBSs). To improve coverage performance, we propose a coordinated multi-point (CoMP) transmission framework that enables joint transmission from ABSs and TBSs. This approach mitigates the limitations of non-uniform user distributions and enhances reliability for sparse aerial users. Two UAV deployment strategies are considered: \textit{i)} random UAV placement, analyzed using stochastic geometry to derive closed-form coverage expressions, and \textit{ii)} optimized UAV placement using a coverage-aware weighted $K$-means clustering algorithm to maximize cooperative coverage in underserved areas. Theoretical analyses and Monte Carlo simulations demonstrate that the proposed CoMP-enabled VHetNet significantly improves downlink coverage probability, particularly in scenarios with sparse aerial users. These findings highlight the potential of intelligent UAV coordination and geometry-aware deployment to enable robust, adaptive connectivity in low-altitude wireless networks.

    https://arxiv.org/abs/2512.12563


    Optimal Mistake Bounds for Transductive Online Learning

    oai:arXiv.org:2512.12567v1

    arXiv:2512.12567v1 Announce Type: new Abstract: We resolve a 30-year-old open problem concerning the power of unlabeled data in online learning by tightly quantifying the gap between transductive and standard online learning. In the standard setting, the optimal mistake bound is characterized by the Littlestone dimension $d$ of the concept class $H$ (Littlestone 1987). We prove that in the transductive setting, the mistake bound is at least $\Omega(\sqrt{d})$. This constitutes an exponential improvement over previous lower bounds of $\Omega(\log\log d)$, $\Omega(\sqrt{\log d})$, and $\Omega(\log d)$, due respectively to Ben-David, Kushilevitz, and Mansour (1995, 1997) and Hanneke, Moran, and Shafer (2023). We also show that this lower bound is tight: for every $d$, there exists a class of Littlestone dimension $d$ with transductive mistake bound $O(\sqrt{d})$. Our upper bound also improves upon the best known upper bound of $(2/3)d$ from Ben-David, Kushilevitz, and Mansour (1997). These results establish a quadratic gap between transductive and standard online learning, thereby highlighting the benefit of advance access to the unlabeled instance sequence. This contrasts with the PAC setting, where transductive and standard learning exhibit similar sample complexities.

    https://arxiv.org/abs/2512.12567


    Intelligent Adaptive Federated Byzantine Agreement for Robust Blockchain Consensus

    oai:arXiv.org:2512.12568v1

    arXiv:2512.12568v1 Announce Type: new Abstract: The Federated Byzantine Agreement (FBA) achieves rapid consensus by relying on overlapping quorum slices. But this architecture leads to a high dependence on the availability of validators when about one fourth of validators go down, the classical FBA can lose liveness or fail to reach agreement. We thus come up with an Adaptive FBA architecture that can reconfigure quorum slices intelligently based on real time validator reputation to overcome this drawback. Our model includes trust scores computed from EigenTrust and a sliding window behavioral assessment to determine the reliability of validators. We have built the intelligent adaptive FBA model and conducted tests in a Stellar based setting. Results of real life experiments reveal that the system is stable enough to keep consensus when more than half of the validators (up to 62 percent) are disconnected, which is a great extension of the failure threshold of a classical FBA. A fallback mode allows the network to be functional with as few as three validators, thus showing a significant robustness enhancement. Besides, a comparative study with the existing consensus protocols shows that Adaptive FBA can be an excellent choice for the next generation of blockchain systems, especially for constructing a resilient blockchain infrastructure.

    https://arxiv.org/abs/2512.12568


    From Tokens to Photons: Test-Time Physical Prompting for Vison-Language Models

    oai:arXiv.org:2512.12571v1

    arXiv:2512.12571v1 Announce Type: new Abstract: To extend the application of vision-language models (VLMs) from web images to sensor-mediated physical environments, we propose Multi-View Physical-prompt for Test-Time Adaptation (MVP), a forward-only framework that moves test-time adaptation (TTA) from tokens to photons by treating the camera exposure triangle--ISO, shutter speed, and aperture--as physical prompts. At inference, MVP acquires a library of physical views per scene, selects the top-k sensor settings using a source-affinity score, evaluates each retained view under lightweight digital augmentations, filters the lowest-entropy subset of augmented views, and aggregates predictions with Zero-temperature softmax (i.e., hard voting). This selection-then-vote design is simple, calibration-friendly, and requires no gradients or model modifications. On ImageNet-ES and ImageNet-ES-Diverse, MVP consistently outperforms digital-only TTA on single Auto-Exposure captures, by up to 25.6 percentage points (pp), and delivers up to 3.4 pp additional gains over pipelines that combine conventional sensor control with TTA. MVP remains effective under reduced parameter candidate sets that lower capture latency, demonstrating practicality. These results support the main claim that, beyond post-capture prompting, measurement-time control--selecting and combining real physical views--substantially improves robustness for VLMs.

    https://arxiv.org/abs/2512.12571


    On the Accuracy of Newton Step and Influence Function Data Attributions

    oai:arXiv.org:2512.12572v1

    arXiv:2512.12572v1 Announce Type: new Abstract: Data attribution aims to explain model predictions by estimating how they would change if certain training points were removed, and is used in a wide range of applications, from interpretability and credit assignment to unlearning and privacy. Even in the relatively simple case of linear regressions, existing mathematical analyses of leading data attribution methods such as Influence Functions (IF) and single Newton Step (NS) remain limited in two key ways. First, they rely on global strong convexity assumptions which are often not satisfied in practice. Second, the resulting bounds scale very poorly with the number of parameters ($d$) and the number of samples removed ($k$). As a result, these analyses are not tight enough to answer fundamental questions such as "what is the asymptotic scaling of the errors of each method?" or "which of these methods is more accurate for a given dataset?" In this paper, we introduce a new analysis of the NS and IF data attribution methods for convex learning problems. To the best of our knowledge, this is the first analysis of these questions that does not assume global strong convexity and also the first explanation of [KATL19] and [RH25a]'s observation that NS data attribution is often more accurate than IF. We prove that for sufficiently well-behaved logistic regression, our bounds are asymptotically tight up to poly-logarithmic factors, yielding scaling laws for the errors in the average-case sample removals. \[ \mathbb{E}_{T \subseteq [n],\, |T| = k} \bigl[ \|\hat{\theta}_T - \hat{\theta}_T^{\mathrm{NS}}\|_2 \bigr] = \widetilde{\Theta}\!\left(\frac{k d}{n^2}\right), \qquad \mathbb{E}_{T \subseteq [n],\, |T| = k} \bigl[ \|\hat{\theta}_T^{\mathrm{NS}} - \hat{\theta}_T^{\mathrm{IF}}\|_2 \bigr] = \widetilde{\Theta}\!\left( \frac{(k + d)\sqrt{k d}}{n^2} \right). \]

    https://arxiv.org/abs/2512.12572


    Coupled Variational Reinforcement Learning for Language Model General Reasoning

    oai:arXiv.org:2512.12576v1

    arXiv:2512.12576v1 Announce Type: new Abstract: While reinforcement learning have achieved impressive progress in language model reasoning, they are constrained by the requirement for verifiable rewards. Recent verifier-free RL methods address this limitation by utilizing the intrinsic probabilities of LLMs generating reference answers as reward signals. However, these approaches typically sample reasoning traces conditioned only on the question. This design decouples reasoning-trace sampling from answer information, leading to inefficient exploration and incoherence between traces and final answers. In this paper, we propose \textit{\b{Co}upled \b{V}ariational \b{R}einforcement \b{L}earning} (CoVRL), which bridges variational inference and reinforcement learning by coupling prior and posterior distributions through a hybrid sampling strategy. By constructing and optimizing a composite distribution that integrates these two distributions, CoVRL enables efficient exploration while preserving strong thought-answer coherence. Extensive experiments on mathematical and general reasoning benchmarks show that CoVRL improves performance by 12.4\% over the base model and achieves an additional 2.3\% improvement over strong state-of-the-art verifier-free RL baselines, providing a principled framework for enhancing the general reasoning capabilities of language models.

    https://arxiv.org/abs/2512.12576


    Cryptographic transformations over polyadic rings

    oai:arXiv.org:2512.12580v1

    arXiv:2512.12580v1 Announce Type: new Abstract: This article introduces a novel cryptographic paradigm based on nonderived polyadic algebraic structures. Traditional cryptosystems rely on binary operations within groups, rings, or fields, whose well-understood properties can be exploited in cryptanalysis. To overcome these vulnerabilities, we propose a shift to polyadic rings, which generalize classical rings by allowing operations of higher arity: an $m$-ary addition and an $n$-ary multiplication. The foundation of our approach is the construction of polyadic integers -- congruence classes of ordinary integers endowed with such $m$-ary and $n$-ary operations. A key innovation is the parameter-to-arity mapping $\Phi(a,b)=(m,n)$, which links the parameters $(a,b)$ defining a congruence class to the specific arities required for algebraic closure. This mapping is mathematically intricate: it is non-injective, non-surjective, and multivalued. This complex, non-unique relationship forms the core of the proposed cryptosystem's security. We present two concrete encryption procedures that leverage this structure by encoding plaintext within the parameters of polyadic rings and transmitting information via polyadically quantized analog signals. In one method, plaintext is linked to the additive arity $m_{i}$ and secured using the summation of such signals; in the other, it is linked to a ring parameter $a_{i}$ and secured using their multiplication. In both cases, the "quantized" nature of polyadic operations generates systems of equations that are straightforward for a legitimate recipient with the correct key but exceptionally difficult for an attacker without it. The resulting framework promises a substantial increase in cryptographic security. This work establishes the theoretical foundation for this new class of encryption schemes and highlights their potential for constructing robust, next-generation cryptographic protocols.

    https://arxiv.org/abs/2512.12580


    Differentiable Energy-Based Regularization in GANs: A Simulator-Based Exploration of VQE-Inspired Auxiliary Losses

    oai:arXiv.org:2512.12581v1

    arXiv:2512.12581v1 Announce Type: new Abstract: This paper presents an exploratory, simulator-based proof of concept investigating whether differentiable energy terms derived from parameterized quantum circuits can serve as auxiliary regularization signals in Generative Adversarial Networks (GANs). We augment the Auxiliary Classifier GAN (ACGAN) generator objective with a Variational Quantum Eigensolver (VQE)-inspired energy term computed from class-specific Ising Hamiltonians using Qiskit's EstimatorQNN and TorchConnector. Important limitations: All experiments run on a noiseless statevector simulator with only 4 qubits, use a deliberately simple Hamiltonian parameterization, and lack ablation studies comparing against equivalent classical biases. The computational overhead (approximately 200x slower than classical ACGAN) reflects simulator artifacts rather than inherent quantum costs. On MNIST, we observe that the energy-regularized model (termed QACGAN) achieves high classification accuracy (99 to 100 percent) within 5 epochs compared to 87.8 percent for ACGAN, suggesting the auxiliary term influences class conditioning. However, sample quality metrics (FID) show high variance across runs (coefficient of variation approximately 25 percent at epoch 5), with values ranging from 19.92 to 35.96. Extended runs stabilize around FID 23 to 24, comparable to the ACGAN baseline. We explicitly do not claim quantum advantage, improved stability in any general sense, or scalability beyond this toy setting. The contribution is methodological: demonstrating that VQE-style energy computations can be integrated into GAN training loops via differentiable pathways. Whether such auxiliary signals provide benefits beyond equivalent classical regularizers remains an open question requiring systematic ablation studies, which we leave for future work.

    https://arxiv.org/abs/2512.12581


    Generative AI as Digital Representatives in Collective Decision-Making: A Game-Theoretical Approach

    oai:arXiv.org:2512.12582v1

    arXiv:2512.12582v1 Announce Type: new Abstract: Generative Artificial Intelligence (GenAI) enables digital representatives to make decisions on behalf of team members in collaborative tasks, but faces challenges in accurately representing preferences. While supplying GenAI with detailed personal information improves representation fidelity, feasibility constraints make complete information access impractical. We bridge this gap by developing a game-theoretic framework that models strategic information revelation to GenAI in collective decision-making. The technical challenges lie in characterizing members' equilibrium behaviors under interdependent strategies and quantifying the imperfect preference learning outcomes by digital representatives. Our contribution includes closed-form equilibrium characterizations that reveal how members strategically balance team decision preference against communication costs. Our analysis yields an interesting finding: Conflicting preferences between team members drive competitive information revelation, with members revealing more information than those with aligned preferences. While digital representatives produce aggregate preference losses no smaller than direct participation, individual members may paradoxically achieve decisions more closely aligned with their preferences when using digital representatives, particularly when manual participation costs are high or when GenAI systems are sufficiently advanced.

    https://arxiv.org/abs/2512.12582


    Detecting Prompt Injection Attacks Against Application Using Classifiers

    oai:arXiv.org:2512.12583v1

    arXiv:2512.12583v1 Announce Type: new Abstract: Prompt injection attacks can compromise the security and stability of critical systems, from infrastructure to large web applications. This work curates and augments a prompt injection dataset based on the HackAPrompt Playground Submissions corpus and trains several classifiers, including LSTM, feed forward neural networks, Random Forest, and Naive Bayes, to detect malicious prompts in LLM integrated web applications. The proposed approach improves prompt injection detection and mitigation, helping protect targeted applications and systems.

    https://arxiv.org/abs/2512.12583


    StegaVAR: Privacy-Preserving Video Action Recognition via Steganographic Domain Analysis

    oai:arXiv.org:2512.12586v1

    arXiv:2512.12586v1 Announce Type: new Abstract: Despite the rapid progress of deep learning in video action recognition (VAR) in recent years, privacy leakage in videos remains a critical concern. Current state-of-the-art privacy-preserving methods often rely on anonymization. These methods suffer from (1) low concealment, where producing visually distorted videos that attract attackers' attention during transmission, and (2) spatiotemporal disruption, where degrading essential spatiotemporal features for accurate VAR. To address these issues, we propose StegaVAR, a novel framework that embeds action videos into ordinary cover videos and directly performs VAR in the steganographic domain for the first time. Throughout both data transmission and action analysis, the spatiotemporal information of hidden secret video remains complete, while the natural appearance of cover videos ensures the concealment of transmission. Considering the difficulty of steganographic domain analysis, we propose Secret Spatio-Temporal Promotion (STeP) and Cross-Band Difference Attention (CroDA) for analysis within the steganographic domain. STeP uses the secret video to guide spatiotemporal feature extraction in the steganographic domain during training. CroDA suppresses cover interference by capturing cross-band semantic differences. Experiments demonstrate that StegaVAR achieves superior VAR and privacy-preserving performance on widely used datasets. Moreover, our framework is effective for multiple steganographic models.

    https://arxiv.org/abs/2512.12586


    Automatic Wire-Harness Color Sequence Detector

    oai:arXiv.org:2512.12590v1

    arXiv:2512.12590v1 Announce Type: new Abstract: Wire harness inspection process remains a labor-intensive process prone to errors in the modern Electronics Manufacturing Services (EMS) industry. This paper introduces a semiautomated machine vision system capable of verifying correct wire positioning, correctness of the connector polarity and correctness of color sequences for both linear and circular wire harness configurations. Five industrial standard CMOS cameras are integrated into a modularized mechanical framework in the physical structure of the solution and a HSV and RGB color domain value comparison based color sequence classifier is used in the operation. For each harness batch, a user can train the system using at least five reference samples; the trained file is stored and reused for similar harness types. The Solution is deployed at GPV Lanka Pvt. Ltd. (Fig. 2) and the system achieved 100% detection accuracy and reduced inspection time by 44% compared to manual methods. Additional features include user management, adjustable lighting, session data storage, and secure login. Results of this product usage in the real world situation demonstrate that this approach delivers reliable and efficient inspection capabilities.

    https://arxiv.org/abs/2512.12590


    Linear Binary Codes Correcting One or More Errors

    oai:arXiv.org:2512.12591v1

    arXiv:2512.12591v1 Announce Type: new Abstract: This paper examines linear binary codes capable of correcting one or more errors. For the single-error-correcting case, it is shown that the Hamming bound is achieved by a constructive method, and an exact expression for the minimal codeword length is derived. For the general case, a simple lower bound for the parameters of linear codes is derived from an analysis of the coset structure.

    https://arxiv.org/abs/2512.12591


    Beyond Static Scoring: Enhancing Assessment Validity via AI-Generated Interactive Verification

    oai:arXiv.org:2512.12592v1

    arXiv:2512.12592v1 Announce Type: new Abstract: Large Language Models (LLMs) challenge the validity of traditional open-ended assessments by blurring the lines of authorship. While recent research has focused on the accuracy of automated scoring (AES), these static approaches fail to capture process evidence or verify genuine student understanding. This paper introduces a novel Human-AI Collaboration framework that enhances assessment integrity by combining rubric-based automated scoring with AI-generated, targeted follow-up questions. In a pilot study with university instructors (N=9), we demonstrate that while Stage 1 (Auto-Scoring) ensures procedural fairness and consistency, Stage 2 (Interactive Verification) is essential for construct validity, effectively diagnosing superficial reasoning or unverified AI use. We report on the systems design, instructor perceptions of fairness versus validity, and the necessity of adaptive difficulty in follow-up questioning. The findings offer a scalable pathway for authentic assessment that moves beyond policing AI to integrating it as a synergistic partner in the evaluation process.

    https://arxiv.org/abs/2512.12592


    SHERLOCK: A Deep Learning Approach To Detect Software Vulnerabilities

    oai:arXiv.org:2512.12593v1

    arXiv:2512.12593v1 Announce Type: new Abstract: The increasing reliance on software in various applications has made the problem of software vulnerability detection more critical. Software vulnerabilities can lead to security breaches, data theft, and other negative outcomes. Traditional software vulnerability detection techniques, such as static and dynamic analysis, have been shown to be ineffective at detecting multiple vulnerabilities. To address this issue, this study employed a deep learning approach, specifically Convolutional Neural Networks (CNN), to solve the software vulnerability detection problem. A 5-split cross-validation approach was used to train and evaluate the CNN model, which takes tokenized source code as input. The findings indicated that Sherlock successfully detected multiple vulnerabilities at the function level, and its performance was particularly strong for CWE-199, CWE-120, and CWE-Other, with an overall high accuracy rate and significant true positive and true negative values. However, the performance was less reliable for some vulnerabilities due to the lack of a standardized dataset which will be a future research direction. The results suggest that compared to current techniques, the proposed deep learning approach has the potential to substantially enhance the accuracy of software vulnerability detection.

    https://arxiv.org/abs/2512.12593


    ceLLMate: Sandboxing Browser AI Agents

    oai:arXiv.org:2512.12594v1

    arXiv:2512.12594v1 Announce Type: new Abstract: Browser-using agents (BUAs) are an emerging class of autonomous agents that interact with web browsers in human-like ways, including clicking, scrolling, filling forms, and navigating across pages. While these agents help automate repetitive online tasks, they are vulnerable to prompt injection attacks that can trick an agent into performing undesired actions, such as leaking private information or issuing state-changing requests. We propose ceLLMate, a browser-level sandboxing framework that restricts the agent's ambient authority and reduces the blast radius of prompt injections. We address two fundamental challenges: (1) The semantic gap challenge in policy enforcement arises because the agent operates through low-level UI observations and manipulations; however, writing and enforcing policies directly over UI-level events is brittle and error-prone. To address this challenge, we introduce an agent sitemap that maps low-level browser behaviors to high-level semantic actions. (2) Policy prediction in BUAs is the norm rather than the exception. BUAs have no app developer to pre-declare sandboxing policies, and thus, ceLLMate pairs website-authored mandatory policies with an automated policy-prediction layer that adapts and instantiates these policies from the user's natural-language task. We implement ceLLMate as an agent-agnostic browser extension and demonstrate how it enables sandboxing policies that effectively block various types of prompt injection attacks with negligible overhead.

    https://arxiv.org/abs/2512.12594


    Vision-Enhanced Large Language Models for High-Resolution Image Synthesis and Multimodal Data Interpretation

    oai:arXiv.org:2512.12595v1

    arXiv:2512.12595v1 Announce Type: new Abstract: This research introduces a transformative framework for integrating Vision-Enhanced Large Language Models (LLMs) with advanced transformer-based architectures to tackle challenges in high-resolution image synthesis and multimodal data interpretation. The proposed model incorporates a rectified flow mechanism that connects noise and data with linear paths, enabling efficient and high-quality generation. A bidirectional tokenization strategy is employed to seamlessly merge inputs from text, image, and video modalities, fostering a unified understanding across diverse data types. By embedding spatial-temporal features and leveraging a hybrid text-image sequence modeling approach, the framework achieves unparalleled fidelity in synthesized images and coherent multimodal representations. The architecture is optimized with a noise-aware learning algorithm, addressing discrepancies in noisy data distributions and improving generative performance under varying input conditions. Rigorous evaluations on benchmark datasets demonstrate a 25% increase in image resolution clarity and a 20% reduction in computational requirements compared to diffusion-based methods. Furthermore, the model exhibits robust scalability and adaptability, showcasing its potential in applications like autonomous systems, creative content generation, and advanced video analysis. This work underscores the role of vision-centric LLMs in redefining capabilities in computer vision and multimodal artificial intelligence.

    https://arxiv.org/abs/2512.12595


    Content-Aware Ad Banner Layout Generation with Two-Stage Chain-of-Thought in Vision Language Models

    oai:arXiv.org:2512.12596v1

    arXiv:2512.12596v1 Announce Type: new Abstract: In this paper, we propose a method for generating layouts for image-based advertisements by leveraging a Vision-Language Model (VLM). Conventional advertisement layout techniques have predominantly relied on saliency mapping to detect salient regions within a background image, but such approaches often fail to fully account for the image's detailed composition and semantic content. To overcome this limitation, our method harnesses a VLM to recognize the products and other elements depicted in the background and to inform the placement of text and logos. The proposed layout-generation pipeline consists of two steps. In the first step, the VLM analyzes the image to identify object types and their spatial relationships, then produces a text-based "placement plan" based on this analysis. In the second step, that plan is rendered into the final layout by generating HTML-format code. We validated the effectiveness of our approach through evaluation experiments, conducting both quantitative and qualitative comparisons against existing methods. The results demonstrate that by explicitly considering the background image's content, our method produces noticeably higher-quality advertisement layouts.

    https://arxiv.org/abs/2512.12596


    AgentSHAP: Interpreting LLM Agent Tool Importance with Monte Carlo Shapley Value Estimation

    oai:arXiv.org:2512.12597v1

    arXiv:2512.12597v1 Announce Type: new Abstract: LLM agents that use external tools can solve complex tasks, but understanding which tools actually contributed to a response remains a blind spot. No existing XAI methods address tool-level explanations. We introduce AgentSHAP, the first framework for explaining tool importance in LLM agents. AgentSHAP is model-agnostic: it treats the agent as a black box and works with any LLM (GPT, Claude, Llama, etc.) without needing access to internal weights or gradients. Using Monte Carlo Shapley values, AgentSHAP tests how an agent responds with different tool subsets and computes fair importance scores based on game theory. Our contributions are: (1) the first explainability method for agent tool attribution, grounded in Shapley values from game theory; (2) Monte Carlo sampling that reduces cost from O(2n) to practical levels; and (3) comprehensive experiments on API-Bank showing that AgentSHAP produces consistent scores across runs, correctly identifies which tools matter, and distinguishes relevant from irrelevant tools. AgentSHAP joins TokenSHAP (for tokens) and PixelSHAP (for image regions) to complete a family of Shapley-based XAI tools for modern generative AI. Code: https://github.com/GenAISHAP/TokenSHAP.

    https://arxiv.org/abs/2512.12597


    Geometry-Aware Scene-Consistent Image Generation

    oai:arXiv.org:2512.12598v1

    arXiv:2512.12598v1 Announce Type: new Abstract: We study geometry-aware scene-consistent image generation: given a reference scene image and a text condition specifying an entity to be generated in the scene and its spatial relation to the scene, the goal is to synthesize an output image that preserves the same physical environment as the reference scene while correctly generating the entity according to the spatial relation described in the text. Existing methods struggle to balance scene preservation with prompt adherence: they either replicate the scene with high fidelity but poor responsiveness to the prompt, or prioritize prompt compliance at the expense of scene consistency. To resolve this trade-off, we introduce two key contributions: (i) a scene-consistent data construction pipeline that generates diverse, geometrically-grounded training pairs, and (ii) a novel geometry-guided attention loss that leverages cross-view cues to regularize the model's spatial reasoning. Experiments on our scene-consistent benchmark show that our approach achieves better scene alignment and text-image consistency than state-of-the-art baselines, according to both automatic metrics and human preference studies. Our method produces geometrically coherent images with diverse compositions that remain faithful to the textual instructions and the underlying scene structure.

    https://arxiv.org/abs/2512.12598


    Feasible-Set Reshaping for Constraint Qualification in Optimization-Based Control

    oai:arXiv.org:2512.12600v1

    arXiv:2512.12600v1 Announce Type: new Abstract: This paper presents a novel feasible-set reshaping technique to optimization-based control with ensured constraint qualification. In our problem setting, the feasible set of admissible control inputs depends on the real-time state of the plant, and the linear independence constraint qualification (LICQ) may not be satisfied in some regions of interest. By feasible-set reshaping, we project the constraints of the original feasible set onto an appropriately chosen constant matrix with its rows forming a positive span of the space of the optimization variable. It is proved that the reshaped feasible set is nonempty and satisfies LICQ, as long as the original feasible set is nonempty. The effectiveness of the proposed method is verified by constructing Lipschitz continuous quadratic-program-based (QP-based) controllers based on the reshaped feasible sets.

    https://arxiv.org/abs/2512.12600


    Quadratic-Programming-based Control of Multi-Robot Systems for Cooperative Object Transport

    oai:arXiv.org:2512.12601v1

    arXiv:2512.12601v1 Announce Type: new Abstract: This paper investigates the control problem of steering a group of spherical mobile robots to cooperatively transport a spherical object. By controlling the movements of the robots to exert appropriate contact (pushing) forces, it is desired that the object follows a velocity command. To solve the problem, we first treat the robots' positions as virtual control inputs of the object, and propose a velocity-tracking controller based on quadratic programming (QP), enabling the robots to cooperatively generate desired contact forces while minimizing the sum of the contact-force magnitudes. Then, we design position-tracking controllers for the robots. By appropriately designing the objective function and the constraints for the QP, it is guaranteed that the QP admits a unique solution and the QP-based velocity-tracking controller is Lipschitz continuous. Finally, we consider the closed-loop system as an interconnection of two subsystems, corresponding to the velocity-tracking error of the object and the position-tracking error of the robots, and employ nonlinear small-gain techniques for stability analysis. The effectiveness of the proposed design is demonstrated through numerical simulations.

    https://arxiv.org/abs/2512.12601


    Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics

    oai:arXiv.org:2512.12602v1

    arXiv:2512.12602v1 Announce Type: new Abstract: Linear-time attention and State Space Models (SSMs) promise to solve the quadratic cost bottleneck in long-context language models employing softmax attention. We introduce Error-Free Linear Attention (EFLA), a numerically stable, fully parallelism and generalized formulation of the delta rule. Specifically, we formulate the online learning update as a continuous-time dynamical system and prove that its exact solution is not only attainable but also computable in linear time with full parallelism. By leveraging the rank-1 structure of the dynamics matrix, we directly derive the exact closed-form solution effectively corresponding to the infinite-order Runge-Kutta method. This attention mechanism is theoretically free from error accumulation, perfectly capturing the continuous dynamics while preserving the linear-time complexity. Through an extensive suite of experiments, we show that EFLA enables robust performance in noisy environments, achieving lower language modeling perplexity and superior downstream benchmark performance than DeltaNet without introducing additional parameters. Our work provides a new theoretical foundation for building high-fidelity, scalable linear-time attention models.

    https://arxiv.org/abs/2512.12602


    No Cache Left Idle: Accelerating diffusion model via Extreme-slimming Caching

    oai:arXiv.org:2512.12604v1

    arXiv:2512.12604v1 Announce Type: new Abstract: Diffusion models achieve remarkable generative quality, but computational overhead scales with step count, model depth, and sequence length. Feature caching is effective since adjacent timesteps yield highly similar features. However, an inherent trade-off remains: aggressive timestep reuse offers large speedups but can easily cross the critical line, hurting fidelity, while block- or token-level reuse is safer but yields limited computational savings. We present X-Slim (eXtreme-Slimming Caching), a training-free, cache-based accelerator that, to our knowledge, is the first unified framework to exploit cacheable redundancy across timesteps, structure (blocks), and space (tokens). Rather than simply mixing levels, X-Slim introduces a dual-threshold controller that turns caching into a push-then-polish process: it first pushes reuse at the timestep level up to an early-warning line, then switches to lightweight block- and token-level refresh to polish the remaining redundancy, and triggers full inference once the critical line is crossed to reset accumulated error. At each level, context-aware indicators decide when and where to cache. Across diverse tasks, X-Slim advances the speed-quality frontier. On FLUX.1-dev and HunyuanVideo, it reduces latency by up to 4.97x and 3.52x with minimal perceptual loss. On DiT-XL/2, it reaches 3.13x acceleration and improves FID by 2.42 over prior methods.

    https://arxiv.org/abs/2512.12604


    Causal inference and model explainability tools for retail

    oai:arXiv.org:2512.12605v1

    arXiv:2512.12605v1 Announce Type: new Abstract: Most major retailers today have multiple divisions focused on various aspects, such as marketing, supply chain, online customer experience, store customer experience, employee productivity, and vendor fulfillment. They also regularly collect data corresponding to all these aspects as dashboards and weekly/monthly/quarterly reports. Although several machine learning and statistical techniques have been in place to analyze and predict key metrics, such models typically lack interpretability. Moreover, such techniques also do not allow the validation or discovery of causal links. In this paper, we aim to provide a recipe for applying model interpretability and causal inference for deriving sales insights. In this paper, we review the existing literature on causal inference and interpretability in the context of problems in e-commerce and retail, and apply them to a real-world dataset. We find that an inherently explainable model has a lower variance of SHAP values, and show that including multiple confounders through a double machine learning approach allows us to get the correct sign of causal effect.

    https://arxiv.org/abs/2512.12605


    Human-Inspired Learning for Large Language Models via Obvious Record and Maximum-Entropy Method Discovery

    oai:arXiv.org:2512.12608v1

    arXiv:2512.12608v1 Announce Type: new Abstract: Large Language Models (LLMs) excel at extracting common patterns from large-scale corpora, yet they struggle with rare, low-resource, or previously unseen scenarios-such as niche hardware deployment issues or irregular IoT device behaviors-because such cases are sparsely represented in training data. Moreover, LLMs rely primarily on implicit parametric memory, which limits their ability to explicitly acquire, recall, and refine methods, causing them to behave predominantly as intuition-driven predictors rather than deliberate, method-oriented learners. Inspired by how humans learn from rare experiences, this paper proposes a human-inspired learning framework that integrates two complementary mechanisms. The first, Obvious Record, explicitly stores cause--result (or question--solution) relationships as symbolic memory, enabling persistent learning even from single or infrequent encounters. The second, Maximum-Entropy Method Discovery, prioritizes and preserves methods with high semantic dissimilarity, allowing the system to capture diverse and underrepresented strategies that are typically overlooked by next-token prediction. Verification on a benchmark of 60 semantically diverse question--solution pairs demonstrates that the proposed entropy-guided approach achieves stronger coverage of unseen questions and significantly greater internal diversity than a random baseline, confirming its effectiveness in discovering more generalizable and human-inspired methods.

    https://arxiv.org/abs/2512.12608


    Patch-wise Retrieval: A Bag of Practical Techniques for Instance-level Matching

    oai:arXiv.org:2512.12610v1

    arXiv:2512.12610v1 Announce Type: new Abstract: Instance-level image retrieval aims to find images containing the same object as a given query, despite variations in size, position, or appearance. To address this challenging task, we propose Patchify, a simple yet effective patch-wise retrieval framework that offers high performance, scalability, and interpretability without requiring fine-tuning. Patchify divides each database image into a small number of structured patches and performs retrieval by comparing these local features with a global query descriptor, enabling accurate and spatially grounded matching. To assess not just retrieval accuracy but also spatial correctness, we introduce LocScore, a localization-aware metric that quantifies whether the retrieved region aligns with the target object. This makes LocScore a valuable diagnostic tool for understanding and improving retrieval behavior. We conduct extensive experiments across multiple benchmarks, backbones, and region selection strategies, showing that Patchify outperforms global methods and complements state-of-the-art reranking pipelines. Furthermore, we apply Product Quantization for efficient large-scale retrieval and highlight the importance of using informative features during compression, which significantly boosts performance. Project website: https://wons20k.github.io/PatchwiseRetrieval/

    https://arxiv.org/abs/2512.12610


    Efficient Resource Allocation for Multi-User and Multi-Target MIMO-OFDM Underwater ISAC

    oai:arXiv.org:2512.12611v1

    arXiv:2512.12611v1 Announce Type: new Abstract: Integrated sensing and communication (ISAC) technology is crucial for next-generation underwater networks. However, covering multiple users and targets and balancing sensing and communication performance in complex underwater acoustic (UWA) environments remains challenging. This paper proposes an interleaved orthogonal frequency division multiplexing-based MIMO UWA-ISAC system, which employs a horizontal array to simultaneously transmit adaptive waveforms for downlink multi-user communication and omnidirectional target sensing. A multi-objective optimization framework is formulated to maximize the product of communication rate and range (PRR) while ensuring sensing performance and peak-to-average power ratio (PAPR) constraints. To solve this mixed-integer nonconvex problem, a two-dimensional grouped random search algorithm is developed, efficiently exploring subcarrier interleaved patterns and resource allocation schemes. Numerical simulations under real-world UWA channels demonstrate the designed system's superiority and effectiveness: our algorithm achieves 90% faster convergence than conventional exhaustive search with only a marginal 0.5 kbps km PRR degradation. Furthermore, the proposed resource allocation scheme maintains robustness beyond the baseline allocation schemes under stringent PRR and PAPR constraints.

    https://arxiv.org/abs/2512.12611


    StruProKGR: A Structural and Probabilistic Framework for Sparse Knowledge Graph Reasoning

    oai:arXiv.org:2512.12613v1

    arXiv:2512.12613v1 Announce Type: new Abstract: Sparse Knowledge Graphs (KGs) are commonly encountered in real-world applications, where knowledge is often incomplete or limited. Sparse KG reasoning, the task of inferring missing knowledge over sparse KGs, is inherently challenging due to the scarcity of knowledge and the difficulty of capturing relational patterns in sparse scenarios. Among all sparse KG reasoning methods, path-based ones have attracted plenty of attention due to their interpretability. Existing path-based methods typically rely on computationally intensive random walks to collect paths, producing paths of variable quality. Additionally, these methods fail to leverage the structured nature of graphs by treating paths independently. To address these shortcomings, we propose a Structural and Probabilistic framework named StruProKGR, tailored for efficient and interpretable reasoning on sparse KGs. StruProKGR utilizes a distance-guided path collection mechanism to significantly reduce computational costs while exploring more relevant paths. It further enhances the reasoning process by incorporating structural information through probabilistic path aggregation, which prioritizes paths that reinforce each other. Extensive experiments on five sparse KG reasoning benchmarks reveal that StruProKGR surpasses existing path-based methods in both effectiveness and efficiency, providing an effective, efficient, and interpretable solution for sparse KG reasoning.

    https://arxiv.org/abs/2512.12613


    gpu_ext: Extensible OS Policies for GPUs via eBPF

    oai:arXiv.org:2512.12615v1

    arXiv:2512.12615v1 Announce Type: new Abstract: Performance in modern GPU-centric systems depends increasingly on resource management policies, such as memory placement, scheduling, and observability. However, a one-size-fits-all policy performs poorly across diverse workloads. Existing approaches present a tradeoff: user-space runtimes offer programmability but lack cross-tenant visibility and fine-grained hardware control, while OS kernel modification introduce complexity and safety risks. To address this, we argue that the GPU driver and device layer must serve as an extensible OS policy interface. The emerging eBPF offers a possibility, but naively transplanting host-side eBPF is insufficient: it cannot observe critical device-side events, and directly injecting policy code into GPU kernels affects safety and efficiency. We present gpu_ext, an eBPF-based policy runtime that treats the GPU driver and device as a programmable OS subsystem. gpu_ext extends GPU drivers to expose safe hooks and introduces a device-side eBPF runtime that executes verified policy logic within GPU kernels, enabling coherent, application-transparent policies. Evaluation on realistic workloads, including inference, training, and vector search, shows that gpu_ext improves throughput by up to 4.8x and reduces tail latency by up to 2x with low overhead, without modifying applications or restarting drivers.

    https://arxiv.org/abs/2512.12615


    Spectral Sentinel: Scalable Byzantine-Robust Decentralized Federated Learning via Sketched Random Matrix Theory on Blockchain

    oai:arXiv.org:2512.12617v1

    arXiv:2512.12617v1 Announce Type: new Abstract: Decentralized federated learning (DFL) enables collaborative model training without centralized trust, but it remains vulnerable to Byzantine clients that poison gradients under heterogeneous (Non-IID) data. Existing defenses face a scalability trilemma: distance-based filtering (e.g., Krum) can reject legitimate Non-IID updates, geometric-median methods incur prohibitive $O(n^2 d)$ cost, and many certified defenses are evaluated only on models below 100M parameters. We propose Spectral Sentinel, a Byzantine detection and aggregation framework that leverages a random-matrix-theoretic signature: honest Non-IID gradients produce covariance eigenspectra whose bulk follows the Marchenko-Pastur law, while Byzantine perturbations induce detectable tail anomalies. Our algorithm combines Frequent Directions sketching with data-dependent MP tracking, enabling detection on models up to 1.5B parameters using $O(k^2)$ memory with $k \ll d$. Under a $(\sigma,f)$ threat model with coordinate-wise honest variance bounded by $\sigma^2$ and $f < 1/2$ adversaries, we prove $(\epsilon,\delta)$-Byzantine resilience with convergence rate $O(\sigma f / \sqrt{T} + f^2 / T)$, and we provide a matching information-theoretic lower bound $\Omega(\sigma f / \sqrt{T})$, establishing minimax optimality. We implement the full system with blockchain integration on Polygon networks and validate it across 144 attack-aggregator configurations, achieving 78.4 percent average accuracy versus 48-63 percent for baseline methods.

    https://arxiv.org/abs/2512.12617


    C-PASS: Center-Fed Pinching Antenna System

    oai:arXiv.org:2512.12619v1

    arXiv:2512.12619v1 Announce Type: new Abstract: A novel architecture of the center-fed pinching antenna system (C-PASS) is proposed. In contrast to the conventional end-fed PASS, signals are fed from the center input ports and propagate towards both sides of the waveguide. By doing so, spatial-multiplexing gain can be achieved in a single waveguide. Based on the proposed C-PASS, closed-form expressions for the degree of freedom (DoF) and power scaling laws are derived. These theoretical results reveal that C-PASS can achieve \emph{twice} the DoF and an additional multiplexing gain of $\mathcal{O}(P_T \ln^4 N/N^2)$ compared to the conventional PASS, where $P_T$ and $N$ represent the transmit power and pinching antenna number, respectively. Numerical results are provided to demonstrate that substantial capacity improvements can be achieved through the enhanced DoF and multiplexing gain of the C-PASS.

    https://arxiv.org/abs/2512.12619


    Understanding Syllogistic Reasoning in LLMs from Formal and Natural Language Perspectives

    oai:arXiv.org:2512.12620v1

    arXiv:2512.12620v1 Announce Type: new Abstract: We study syllogistic reasoning in LLMs from the logical and natural language perspectives. In process, we explore fundamental reasoning capabilities of the LLMs and the direction this research is moving forward. To aid in our studies, we use 14 large language models and investigate their syllogistic reasoning capabilities in terms of symbolic inferences as well as natural language understanding. Even though this reasoning mechanism is not a uniform emergent property across LLMs, the perfect symbolic performances in certain models make us wonder whether LLMs are becoming more and more formal reasoning mechanisms, rather than making explicit the nuances of human reasoning.

    https://arxiv.org/abs/2512.12620


    D3D-VLP: Dynamic 3D Vision-Language-Planning Model for Embodied Grounding and Navigation

    oai:arXiv.org:2512.12622v1

    arXiv:2512.12622v1 Announce Type: new Abstract: Embodied agents face a critical dilemma that end-to-end models lack interpretability and explicit 3D reasoning, while modular systems ignore cross-component interdependencies and synergies. To bridge this gap, we propose the Dynamic 3D Vision-Language-Planning Model (D3D-VLP). Our model introduces two key innovations: 1) A Dynamic 3D Chain-of-Thought (3D CoT) that unifies planning, grounding, navigation, and question answering within a single 3D-VLM and CoT pipeline; 2) A Synergistic Learning from Fragmented Supervision (SLFS) strategy, which uses a masked autoregressive loss to learn from massive and partially-annotated hybrid data. This allows different CoT components to mutually reinforce and implicitly supervise each other. To this end, we construct a large-scale dataset with 10M hybrid samples from 5K real scans and 20K synthetic scenes that are compatible with online learning methods such as RL and DAgger. Our D3D-VLP achieves state-of-the-art results on multiple benchmarks, including Vision-and-Language Navigation (R2R-CE, REVERIE-CE, NavRAG-CE), Object-goal Navigation (HM3D-OVON), and Task-oriented Sequential Grounding and Navigation (SG3D). Real-world mobile manipulation experiments further validate the effectiveness.

    https://arxiv.org/abs/2512.12622


    Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space

    oai:arXiv.org:2512.12623v1

    arXiv:2512.12623v1 Announce Type: new Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced cross-modal understanding and reasoning by incorporating Chain-of-Thought (CoT) reasoning in the semantic space. Building upon this, recent studies extend the CoT mechanism to the visual modality, enabling models to integrate visual information during reasoning through external tools or explicit image generation. However, these methods remain dependent on explicit step-by-step reasoning, unstable perception-reasoning interaction and notable computational overhead. Inspired by human cognition, we posit that thinking unfolds not linearly but through the dynamic interleaving of reasoning and perception within the mind. Motivated by this perspective, we propose DMLR, a test-time Dynamic Multimodal Latent Reasoning framework that employs confidence-guided latent policy gradient optimization to refine latent think tokens for in-depth reasoning. Furthermore, a Dynamic Visual Injection Strategy is introduced, which retrieves the most relevant visual features at each latent think token and updates the set of best visual patches. The updated patches are then injected into latent think token to achieve dynamic visual-textual interleaving. Experiments across seven multimodal reasoning benchmarks and various model architectures demonstrate that DMLR significantly improves reasoning and perception performance while maintaining high inference efficiency.

    https://arxiv.org/abs/2512.12623


    CoLSE: A Lightweight and Robust Hybrid Learned Model for Single-Table Cardinality Estimation using Joint CDF

    oai:arXiv.org:2512.12624v1

    arXiv:2512.12624v1 Announce Type: new Abstract: Cardinality estimation (CE), the task of predicting the result size of queries is a critical component of query optimization. Accurate estimates are essential for generating efficient query execution plans. Recently, machine learning techniques have been applied to CE, broadly categorized into query-driven and data-driven approaches. Data-driven methods learn the joint distribution of data, while query-driven methods construct regression models that map query features to cardinalities. Ideally, a CE technique should strike a balance among three key factors: accuracy, efficiency, and memory footprint. However, existing state-of-the-art models often fail to achieve this balance. To address this, we propose CoLSE, a hybrid learned approach for single-table cardinality estimation. CoLSE directly models the joint probability over queried intervals using a novel algorithm based on copula theory and integrates a lightweight neural network to correct residual estimation errors. Experimental results show that CoLSE achieves a favorable trade-off among accuracy, training time, inference latency, and model size, outperforming existing state-of-the-art methods.

    https://arxiv.org/abs/2512.12624


    ORIBA: Exploring LLM-Driven Role-Play Chatbot as a Creativity Support Tool for Original Character Artists

    oai:arXiv.org:2512.12630v1

    arXiv:2512.12630v1 Announce Type: new Abstract: Recent advances in Generative AI (GAI) have led to new opportunities for creativity support. However, this technology has raised ethical concerns in the visual artists community. This paper explores how GAI can assist visual artists in developing original characters (OCs) while respecting their creative agency. We present ORIBA, an AI chatbot leveraging large language models (LLMs) to enable artists to role-play with their OCs, focusing on conceptualization (e.g., backstories) while leaving exposition (visual creation) to creators. Through a study with 14 artists, we found ORIBA motivated artists' imaginative engagement, developing multidimensional attributes and stronger bonds with OCs that inspire their creative process. Our contributions include design insights for AI systems that develop from artists' perspectives, demonstrating how LLMs can support cross-modal creativity while preserving creative agency in OC art. This paper highlights the potential of GAI as a neutral, non-visual support that strengthens existing creative practice, without infringing artistic exposition.

    https://arxiv.org/abs/2512.12630


    Optimized Conflict Management for Urban Air Mobility Using Swarm UAV Networks

    oai:arXiv.org:2512.12632v1

    arXiv:2512.12632v1 Announce Type: new Abstract: Urban Air Mobility (UAM) poses unprecedented traffic coordination challenges, especially with increasing UAV densities in dense urban corridors. This paper introduces a mathematical model using a control algorithm to optimize an Edge AI-driven decentralized swarm architecture for intelligent conflict resolution, enabling real-time decision-making with low latency. Using lightweight neural networks, the system leverages edge nodes to perform distributed conflict detection and resolution. A simulation platform was developed to evaluate the scheme under various UAV densities. Results indicate that the conflict resolution time is dramatically minimized up to 3.8 times faster, and accuracy is enhanced compared to traditional centralized control models. The proposed architecture is highly promising for scalable, efficient, and safe aerial traffic management in future UAM systems.

    https://arxiv.org/abs/2512.12632


    DiG: Differential Grounding for Enhancing Fine-Grained Perception in Multimodal Large Language Model

    oai:arXiv.org:2512.12633v1

    arXiv:2512.12633v1 Announce Type: new Abstract: Multimodal Large Language Models have achieved impressive performance on a variety of vision-language tasks, yet their fine-grained visual perception and precise spatial reasoning remain limited. In this work, we introduce DiG (Differential Grounding), a novel proxy task framework where MLLMs learn fine-grained perception by identifying and localizing all differences between similar image pairs without prior knowledge of their number. To support scalable training, we develop an automated 3D rendering-based data generation pipeline that produces high-quality paired images with fully controllable discrepancies. To address the sparsity of difference signals, we further employ curriculum learning that progressively increases complexity from single to multiple differences, enabling stable optimization. Extensive experiments demonstrate that DiG significantly improves model performance across a variety of visual perception benchmarks and that the learned fine-grained perception skills transfer effectively to standard downstream tasks, including RefCOCO, RefCOCO+, RefCOCOg, and general multimodal perception benchmarks. Our results highlight differential grounding as a scalable and robust approach for advancing fine-grained visual reasoning in MLLMs.

    https://arxiv.org/abs/2512.12633


    Modular and Multi-Path-Aware Offline Benchmarking for Mobile GUI Agents

    oai:arXiv.org:2512.12634v1

    arXiv:2512.12634v1 Announce Type: new Abstract: Mobile GUI Agents, AI agents capable of interacting with mobile applications on behalf of users, have the potential to transform human computer interaction. However, current evaluation practices for GUI agents face two fundamental limitations. First, they either rely on single path offline benchmarks or online live benchmarks. Offline benchmarks using static, single path annotated datasets unfairly penalize valid alternative actions, while online benchmarks suffer from poor scalability and reproducibility due to the dynamic and unpredictable nature of live evaluation. Second, existing benchmarks treat agents as monolithic black boxes, overlooking the contributions of individual components, which often leads to unfair comparisons or obscures key performance bottlenecks. To address these limitations, we present MobiBench, the first modular and multi path aware offline benchmarking framework for mobile GUI agents that enables high fidelity, scalable, and reproducible evaluation entirely in offline settings. Our experiments demonstrate that MobiBench achieves 94.72 percent agreement with human evaluators, on par with carefully engineered online benchmarks, while preserving the scalability and reproducibility of static offline benchmarks. Furthermore, our comprehensive module level analysis uncovers several key insights, including a systematic evaluation of diverse techniques used in mobile GUI agents, optimal module configurations across model scales, the inherent limitations of current LFMs, and actionable guidelines for designing more capable and cost efficient mobile agents.

    https://arxiv.org/abs/2512.12634


    Electric Road Systems for Smart Cities: A Scalable Infrastructure Framework for Dynamic Wireless Charging

    oai:arXiv.org:2512.12638v1

    arXiv:2512.12638v1 Announce Type: new Abstract: The transition to electric transportation is a key enabler for intelligent and sustainable cities; however, inadequate charging infrastructure remains a major barrier to large-scale electric vehicle (EV) adoption. This paper presents a scalable Electric Road System (ERS) architecture that enables Dynamic Wireless Charging (DWC) of EVs during motion. The proposed framework integrates inductive charging coils embedded in road pavement, real-time vehicle-to-infrastructure (V2I) communication, and adaptive energy management coordinated with smart grid systems. Modular road segments with a standardized charging process are employed to ensure scalability across urban corridors and interoperability among different EV platforms. System performance is evaluated using a co-simulation framework combining MATLAB-based power analysis with traffic inputs generated in SUMO. Key performance metrics include charging efficiency, energy cost per kilometer, and battery lifecycle improvement. Simulation results indicate a potential reduction in range anxiety and an increase in battery lifespan due to frequent shallow charging cycles. The study further discusses deployment challenges, policy considerations, and energy distribution strategies aligned with climate-resilient urban development. A case study of a tier-1 Indian city is presented to analyze the cost-benefit trade-offs of retrofitting high-density urban corridors with ERS. The proposed framework provides a practical foundation for next-generation EV infrastructure planning in smart cities.

    https://arxiv.org/abs/2512.12638


    Which Pieces Does Unigram Tokenization Really Need?

    oai:arXiv.org:2512.12641v1

    arXiv:2512.12641v1 Announce Type: new Abstract: The Unigram tokenization algorithm offers a probabilistic alternative to the greedy heuristics of Byte-Pair Encoding. Despite its theoretical elegance, its implementation in practice is complex, limiting its adoption to the SentencePiece package and adapters thereof. We bridge this gap between theory and practice by providing a clear guide to implementation and parameter choices. We also identify a simpler algorithm that accepts slightly higher training loss in exchange for improved compression.

    https://arxiv.org/abs/2512.12641


    Torch Geometric Pool: the Pytorch library for pooling in Graph Neural Networks

    oai:arXiv.org:2512.12642v1

    arXiv:2512.12642v1 Announce Type: new Abstract: We introduce Torch Geometric Pool (tgp), a library for hierarchical pooling in Graph Neural Networks. Built upon Pytorch Geometric, Torch Geometric Pool (tgp) provides a wide variety of pooling operators, unified under a consistent API and a modular design. The library emphasizes usability and extensibility, and includes features like precomputed pooling, which significantly accelerate training for a class of operators. In this paper, we present tgp's structure and present an extensive benchmark. The latter showcases the library's features and systematically compares the performance of the implemented graph-pooling methods in different downstream tasks. The results, showing that the choice of the optimal pooling operator depends on tasks and data at hand, support the need for a library that enables fast prototyping.

    https://arxiv.org/abs/2512.12642


    LexRel: Benchmarking Legal Relation Extraction for Chinese Civil Cases

    oai:arXiv.org:2512.12643v1

    arXiv:2512.12643v1 Announce Type: new Abstract: Legal relations form a highly consequential analytical framework of civil law system, serving as a crucial foundation for resolving disputes and realizing values of the rule of law in judicial practice. However, legal relations in Chinese civil cases remain underexplored in the field of legal artificial intelligence (legal AI), largely due to the absence of comprehensive schemas. In this work, we firstly introduce a comprehensive schema, which contains a hierarchical taxonomy and definitions of arguments, for AI systems to capture legal relations in Chinese civil cases. Based on this schema, we then formulate legal relation extraction task and present LexRel, an expert-annotated benchmark for legal relation extraction in Chinese civil law. We use LexRel to evaluate state-of-the-art large language models (LLMs) on legal relation extractions, showing that current LLMs exhibit significant limitations in accurately identifying civil legal relations. Furthermore, we demonstrate that incorporating legal relations information leads to consistent performance gains on other downstream legal AI tasks.

    https://arxiv.org/abs/2512.12643


    Bayesian Optimization Parameter Tuning Framework for a Lyapunov Based Path Following Controller

    oai:arXiv.org:2512.12649v1

    arXiv:2512.12649v1 Announce Type: new Abstract: Parameter tuning in real-world experiments is constrained by the limited evaluation budget available on hardware. The path-following controller studied in this paper reflects a typical situation in nonlinear geometric controller, where multiple gains influence the dynamics through coupled nonlinear terms. Such interdependence makes manual tuning inefficient and unlikely to yield satisfactory performance within a practical number of trials. To address this challenge, we propose a Bayesian optimization (BO) framework that treats the closed-loop system as a black box and selects controller gains using a Gaussian-process surrogate. BO offers model-free exploration, quantified uncertainty, and data-efficient search, making it well suited for tuning tasks where each evaluation is costly. The framework is implemented on Honda's AI-Formula three-wheeled robot and assessed through repeated full-lap experiments on a fixed test track. The results show that BO improves controller performance within 32 trials, including 15 warm-start initial evaluations, indicating that it can efficiently locate high-performing regions of the parameter space under real-world conditions. These findings demonstrate that BO provides a practical, reliable, and data-efficient tuning approach for nonlinear path-following controllers on real robotic platforms.

    https://arxiv.org/abs/2512.12649


    A Systematic Analysis of Higher Education on Software Engineering in the Netherlands

    oai:arXiv.org:2512.12650v1

    arXiv:2512.12650v1 Announce Type: new Abstract: Software engineering educators strive to continuously improve their courses and programs. Understanding the current state of practice of software engineering higher education can empower educators to critically assess their courses, fine-tune them by benchmarking against observed practices, and ultimately enhance their curricula. In this study, we aim to provide an encompassing analysis of higher education on software engineering by considering the higher educational offering of an entire European country, namely the Netherlands. We leverage a crowd-sourced analysis process by considering 10 Dutch universities and 207 university courses. The courses are analysed via knowledge areas adopted from the SWEBOK. The mapping process is refined via homogenisation and internal consistency improvement phases, and is followed by a data analysis phase. Given its fundamental nature, Construction and Programming is the most covered knowledge area at Bachelor level. Other knowledge areas are equally covered at Bachelor and Master level (e.g., software engineering models), while more advanced ones are almost exclusively covered at Master level. We identify three clusters of tightly coupled knowledge areas: (i) requirements, architecture, and design, (ii) testing, verification, and security, and (iii) process-oriented and DevOps topics. Dutch universities generally cover all knowledge areas uniformly, with minor deviations reflecting institutional research strengths. Our results highlight correlations among key knowledge areas and their potential for enhancing integrated learning. We also identify underrepresented areas, such as software engineering economics, which educators may consider including in curricula. We invite researchers to use our research method in their own geographical region, in order to contrast software engineering education programs across the globe.

    https://arxiv.org/abs/2512.12650


    Value-Aware Multiagent Systems

    oai:arXiv.org:2512.12652v1

    arXiv:2512.12652v1 Announce Type: new Abstract: This paper introduces the concept of value awareness in AI, which goes beyond the traditional value-alignment problem. Our definition of value awareness presents us with a concise and simplified roadmap for engineering value-aware AI. The roadmap is structured around three core pillars: (1) learning and representing human values using formal semantics, (2) ensuring the value alignment of both individual agents and multiagent systems, and (3) providing value-based explainability on behaviour. The paper presents a selection of our ongoing work on some of these topics, along with applications to real-life domains.

    https://arxiv.org/abs/2512.12652


    Modeling Authorial Style in Urdu Novels Using Character Interaction Graphs and Graph Neural Networks

    oai:arXiv.org:2512.12654v1

    arXiv:2512.12654v1 Announce Type: new Abstract: Authorship analysis has traditionally focused on lexical and stylistic cues within text, while higher-level narrative structure remains underexplored, particularly for low-resource languages such as Urdu. This work proposes a graph-based framework that models Urdu novels as character interaction networks to examine whether authorial style can be inferred from narrative structure alone. Each novel is represented as a graph where nodes correspond to characters and edges denote their co-occurrence within narrative proximity. We systematically compare multiple graph representations, including global structural features, node-level semantic summaries, unsupervised graph embeddings, and supervised graph neural networks. Experiments on a dataset of 52 Urdu novels written by seven authors show that learned graph representations substantially outperform hand-crafted and unsupervised baselines, achieving up to 0.857 accuracy under a strict author-aware evaluation protocol.

    https://arxiv.org/abs/2512.12654


    Argumentative Reasoning with Language Models on Non-factorized Case Bases

    oai:arXiv.org:2512.12656v1

    arXiv:2512.12656v1 Announce Type: new Abstract: In this paper, we investigate how language models can perform case-based reasoning (CBR) on non-factorized case bases. We introduce a novel framework, argumentative agentic models for case-based reasoning (AAM-CBR), which extends abstract argumentation for case-based reasoning (AA-CBR). Unlike traditional approaches that require factorization of previous cases, AAM-CBR leverages language models to determine case coverage and extract factors based on new cases. This enables factor-based reasoning without exposing or preprocessing previous cases, thus improving both flexibility and privacy. We also present initial experiments to assess AAM-CBR performance by comparing the proposed framework with a baseline that uses a single-prompt approach to incorporate both new and previous cases. The experiments are conducted based on a synthetic credit card application dataset. The result shows that AAM-CBR surpasses the baseline only when the new case contains a richer set of factors. The finding indicates that language models can handle case-based reasoning with a limited number of factors, but face challenges as the number of factors increase. Consequently, integrating symbolic reasoning with language models, as implemented in AAM-CBR, is crucial for effectively handling cases involving many factors.

    https://arxiv.org/abs/2512.12656


    Cross-modal Fundus Image Registration under Large FoV Disparity

    oai:arXiv.org:2512.12657v1

    arXiv:2512.12657v1 Announce Type: new Abstract: Previous work on cross-modal fundus image registration (CMFIR) assumes small cross-modal Field-of-View (FoV) disparity. By contrast, this paper is targeted at a more challenging scenario with large FoV disparity, to which directly applying current methods fails. We propose Crop and Alignment for cross-modal fundus image Registration(CARe), a very simple yet effective method. Specifically, given an OCTA with smaller FoV as a source image and a wide-field color fundus photograph (wfCFP) as a target image, our Crop operation exploits the physiological structure of the retina to crop from the target image a sub-image with its FoV roughly aligned with that of the source. This operation allows us to re-purpose the previous small-FoV-disparity oriented methods for subsequent image registration. Moreover, we improve spatial transformation by a double-fitting based Alignment module that utilizes the classical RANSAC algorithm and polynomial-based coordinate fitting in a sequential manner. Extensive experiments on a newly developed test set of 60 OCTA-wfCFP pairs verify the viability of CARe for CMFIR.

    https://arxiv.org/abs/2512.12657


    CogDoc: Towards Unified thinking in Documents

    oai:arXiv.org:2512.12658v1

    arXiv:2512.12658v1 Announce Type: new Abstract: Current document reasoning paradigms are constrained by a fundamental trade-off between scalability (processing long-context documents) and fidelity (capturing fine-grained, multimodal details). To bridge this gap, we propose CogDoc, a unified coarse-to-fine thinking framework that mimics human cognitive processes: a low-resolution "Fast Reading" phase for scalable information localization,followed by a high-resolution "Focused Thinking" phase for deep reasoning. We conduct a rigorous investigation into post-training strategies for the unified thinking framework, demonstrating that a Direct Reinforcement Learning (RL) approach outperforms RL with Supervised Fine-Tuning (SFT) initialization. Specifically, we find that direct RL avoids the "policy conflict" observed in SFT. Empirically, our 7B model achieves state-of-the-art performance within its parameter class, notably surpassing significantly larger proprietary models (e.g., GPT-4o) on challenging, visually rich document benchmarks.

    https://arxiv.org/abs/2512.12658


    Anatomy-Guided Representation Learning Using a Transformer-Based Network for Thyroid Nodule Segmentation in Ultrasound Images

    oai:arXiv.org:2512.12662v1

    arXiv:2512.12662v1 Announce Type: new Abstract: Accurate thyroid nodule segmentation in ultrasound images is critical for diagnosis and treatment planning. However, ambiguous boundaries between nodules and surrounding tissues, size variations, and the scarcity of annotated ultrasound data pose significant challenges for automated segmentation. Existing deep learning models struggle to incorporate contextual information from the thyroid gland and generalize effectively across diverse cases. To address these challenges, we propose SSMT-Net, a Semi-Supervised Multi-Task Transformer-based Network that leverages unlabeled data to enhance Transformer-centric encoder feature extraction capability in an initial unsupervised phase. In the supervised phase, the model jointly optimizes nodule segmentation, gland segmentation, and nodule size estimation, integrating both local and global contextual features. Extensive evaluations on the TN3K and DDTI datasets demonstrate that SSMT-Net outperforms state-of-the-art methods, with higher accuracy and robustness, indicating its potential for real-world clinical applications.

    https://arxiv.org/abs/2512.12662


    PerNodeDrop: A Method Balancing Specialized Subnets and Regularization in Deep Neural Networks

    oai:arXiv.org:2512.12663v1

    arXiv:2512.12663v1 Announce Type: new Abstract: Deep neural networks possess strong representational capacity yet remain vulnerable to overfitting, primarily because neurons tend to co-adapt in ways that, while capturing complex and fine-grained feature interactions, also reinforce spurious and non-generalizable patterns that inflate training performance but reduce reliability on unseen data. Noise-based regularizers such as Dropout and DropConnect address this issue by injecting stochastic perturbations during training, but the noise they apply is typically uniform across a layer or across a batch of samples, which can suppress both harmful and beneficial co-adaptation. This work introduces PerNodeDrop, a lightweight stochastic regularization method. It applies per-sample, per-node perturbations to break the uniformity of the noise injected by existing techniques, thereby allowing each node to experience input-specific variability. Hence, PerNodeDrop preserves useful co-adaptation while applying regularization. This narrows the gap between training and validation performance and improves reliability on unseen data, as evident from the experiments. Although superficially similar to DropConnect, PerNodeDrop operates at the sample level. It drops weights at the sample level, not the batch level. An expected-loss analysis formalizes how its perturbations attenuate excessive co-adaptation while retaining predictive interactions. Empirical evaluations on vision, text, and audio benchmarks indicate improved generalization relative to the standard noise-based regularizer.

    https://arxiv.org/abs/2512.12663


    InteracTalker: Prompt-Based Human-Object Interaction with Co-Speech Gesture Generation

    oai:arXiv.org:2512.12664v1

    arXiv:2512.12664v1 Announce Type: new Abstract: Generating realistic human motions that naturally respond to both spoken language and physical objects is crucial for interactive digital experiences. Current methods, however, address speech-driven gestures or object interactions independently, limiting real-world applicability due to a lack of integrated, comprehensive datasets. To overcome this, we introduce InteracTalker, a novel framework that seamlessly integrates prompt-based object-aware interactions with co-speech gesture generation. We achieve this by employing a multi-stage training process to learn a unified motion, speech, and prompt embedding space. To support this, we curate a rich human-object interaction dataset, formed by augmenting an existing text-to-motion dataset with detailed object interaction annotations. Our framework utilizes a Generalized Motion Adaptation Module that enables independent training, adapting to the corresponding motion condition, which is then dynamically combined during inference. To address the imbalance between heterogeneous conditioning signals, we propose an adaptive fusion strategy, which dynamically reweights the conditioning signals during diffusion sampling. InteracTalker successfully unifies these previously separate tasks, outperforming prior methods in both co-speech gesture generation and object-interaction synthesis, outperforming gesture-focused diffusion methods, yielding highly realistic, object-aware full-body motions with enhanced realism, flexibility, and control.

    https://arxiv.org/abs/2512.12664


    Differentiation methods as a systematic uncertainty source in equation discovery

    oai:arXiv.org:2512.12666v1

    arXiv:2512.12666v1 Announce Type: new Abstract: In differential equation discovery algorithms, numerical differentiation is usually a fixed preliminary step. Current methods improve robustness with data subsampling and sparsity but often ignore the variability from the differentiation method itself. We show that this choice systematically introduces uncertainty, affecting both equation form and parameter estimates. Our study indicates that high-resolution schemes can magnify measurement noise, while heavily regularized methods may mask real physical variations, which leads to method-dependent findings. By evaluating six differentiation techniques on various partial differential equations under diverse noise levels using SINDy and EPDE frameworks, we consistently notice methodological biases in the determined models. This underscores the importance of selecting differentiation methods as a key modeling choice and highlights a path to enhance ensemble-based discovery by diversifying methodologies.

    https://arxiv.org/abs/2512.12666


    Open-World Deepfake Attribution via Confidence-Aware Asymmetric Learning

    oai:arXiv.org:2512.12667v1

    arXiv:2512.12667v1 Announce Type: new Abstract: The proliferation of synthetic facial imagery has intensified the need for robust Open-World DeepFake Attribution (OW-DFA), which aims to attribute both known and unknown forgeries using labeled data for known types and unlabeled data containing a mixture of known and novel types. However, existing OW-DFA methods face two critical limitations: 1) A confidence skew that leads to unreliable pseudo-labels for novel forgeries, resulting in biased training. 2) An unrealistic assumption that the number of unknown forgery types is known *a priori*. To address these challenges, we propose a Confidence-Aware Asymmetric Learning (CAL) framework, which adaptively balances model confidence across known and novel forgery types. CAL mainly consists of two components: Confidence-Aware Consistency Regularization (CCR) and Asymmetric Confidence Reinforcement (ACR). CCR mitigates pseudo-label bias by dynamically scaling sample losses based on normalized confidence, gradually shifting the training focus from high- to low-confidence samples. ACR complements this by separately calibrating confidence for known and novel classes through selective learning on high-confidence samples, guided by their confidence gap. Together, CCR and ACR form a mutually reinforcing loop that significantly improves the model's OW-DFA performance. Moreover, we introduce a Dynamic Prototype Pruning (DPP) strategy that automatically estimates the number of novel forgery types in a coarse-to-fine manner, removing the need for unrealistic prior assumptions and enhancing the scalability of our methods to real-world OW-DFA scenarios. Extensive experiments on the standard OW-DFA benchmark and a newly extended benchmark incorporating advanced manipulations demonstrate that CAL consistently outperforms previous methods, achieving new state-of-the-art performance on both known and novel forgery attribution.

    https://arxiv.org/abs/2512.12667


    DynaGen: Unifying Temporal Knowledge Graph Reasoning with Dynamic Subgraphs and Generative Regularization

    oai:arXiv.org:2512.12669v1

    arXiv:2512.12669v1 Announce Type: new Abstract: Temporal Knowledge Graph Reasoning (TKGR) aims to complete missing factual elements along the timeline. Depending on the temporal position of the query, the task is categorized into interpolation and extrapolation. Existing interpolation methods typically embed temporal information into individual facts to complete missing historical knowledge, while extrapolation techniques often leverage sequence models over graph snapshots to identify recurring patterns for future event prediction. These methods face two critical challenges: limited contextual modeling in interpolation and cognitive generalization bias in extrapolation. To address these, we propose a unified method for TKGR, dubbed DynaGen. For interpolation, DynaGen dynamically constructs entity-centric subgraphs and processes them with a synergistic dual-branch GNN encoder to capture evolving structural context. For extrapolation, it applies a conditional diffusion process, which forces the model to learn underlying evolutionary principles rather than just superficial patterns, enhancing its ability to predict unseen future events. Extensive experiments on six benchmark datasets show DynaGen achieves state-of-the-art performance. On average, compared to the second-best models, DynaGen improves the Mean Reciprocal Rank (MRR) score by 2.61 points for interpolation and 1.45 points for extrapolation.

    https://arxiv.org/abs/2512.12669


    On Approaches to Building Surrogate ODE Models for Diffusion Bridges

    oai:arXiv.org:2512.12671v1

    arXiv:2512.12671v1 Announce Type: new Abstract: Diffusion and Schr\"{o}dinger Bridge models have established state-of-the-art performance in generative modeling but are often hampered by significant computational costs and complex training procedures. While continuous-time bridges promise faster sampling, overparameterized neural networks describe their optimal dynamics, and the underlying stochastic differential equations can be difficult to integrate efficiently. This work introduces a novel paradigm that uses surrogate models to create simpler, faster, and more flexible approximations of these dynamics. We propose two specific algorithms: SINDy Flow Matching (SINDy-FM), which leverages sparse regression to identify interpretable, symbolic differential equations from data, and a Neural-ODE reformulation of the Schr\"{o}dinger Bridge (DSBM-NeuralODE) for flexible continuous-time parameterization. Our experiments on Gaussian transport tasks and MNIST latent translation demonstrate that these surrogates achieve competitive performance while offering dramatic improvements in efficiency and interpretability. The symbolic SINDy-FM models, in particular, reduce parameter counts by several orders of magnitude and enable near-instantaneous inference, paving the way for a new class of tractable and high-performing bridge models for practical deployment.

    https://arxiv.org/abs/2512.12671


    Progressive Conditioned Scale-Shift Recalibration of Self-Attention for Online Test-time Adaptation

    oai:arXiv.org:2512.12673v1

    arXiv:2512.12673v1 Announce Type: new Abstract: Online test-time adaptation aims to dynamically adjust a network model in real-time based on sequential input samples during the inference stage. In this work, we find that, when applying a transformer network model to a new target domain, the Query, Key, and Value features of its self-attention module often change significantly from those in the source domain, leading to substantial performance degradation of the transformer model. To address this important issue, we propose to develop a new approach to progressively recalibrate the self-attention at each layer using a local linear transform parameterized by conditioned scale and shift factors. We consider the online model adaptation from the source domain to the target domain as a progressive domain shift separation process. At each transformer network layer, we learn a Domain Separation Network to extract the domain shift feature, which is used to predict the scale and shift parameters for self-attention recalibration using a Factor Generator Network. These two lightweight networks are adapted online during inference. Experimental results on benchmark datasets demonstrate that the proposed progressive conditioned scale-shift recalibration (PCSR) method is able to significantly improve the online test-time domain adaptation performance by a large margin of up to 3.9\% in classification accuracy on the ImageNet-C dataset.

    https://arxiv.org/abs/2512.12673


    Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling

    oai:arXiv.org:2512.12675v1

    arXiv:2512.12675v1 Announce Type: new Abstract: Subject-driven image generation has advanced from single- to multi-subject composition, while neglecting distinction, the ability to identify and generate the correct subject when inputs contain multiple candidates. This limitation restricts effectiveness in complex, realistic visual settings. We propose Scone, a unified understanding-generation method that integrates composition and distinction. Scone enables the understanding expert to act as a semantic bridge, conveying semantic information and guiding the generation expert to preserve subject identity while minimizing interference. A two-stage training scheme first learns composition, then enhances distinction through semantic alignment and attention-based masking. We also introduce SconeEval, a benchmark for evaluating both composition and distinction across diverse scenarios. Experiments demonstrate that Scone outperforms existing open-source models in composition and distinction tasks on two benchmarks. Our model, benchmark, and training data are available at: https://github.com/Ryann-Ran/Scone.

    https://arxiv.org/abs/2512.12675


    Fine-Tuning Causal LLMs for Text Classification: Embedding-Based vs. Instruction-Based Approaches

    oai:arXiv.org:2512.12677v1

    arXiv:2512.12677v1 Announce Type: new Abstract: We explore efficient strategies to fine-tune decoder-only Large Language Models (LLMs) for downstream text classification under resource constraints. Two approaches are investigated: (1) attaching a classification head to a pre-trained causal LLM and fine-tuning on the task (using the LLM's final token embedding as a sequence representation), and (2) instruction-tuning the LLM in a prompt->response format for classification. To enable single-GPU fine-tuning of models up to 8B parameters, we combine 4-bit model quantization with Low-Rank Adaptation (LoRA) for parameter-efficient training. Experiments on two datasets - a proprietary single-label dataset and the public WIPO-Alpha patent dataset (extreme multi-label classification) - show that the embedding-based method significantly outperforms the instruction-tuned method in F1-score, and is very competitive with - even surpassing - fine-tuned domain-specific models (e.g. BERT) on the same tasks. These results demonstrate that directly leveraging the internal representations of causal LLMs, along with efficient fine-tuning techniques, yields impressive classification performance under limited computational resources. We discuss the advantages of each approach while outlining practical guidelines and future directions for optimizing LLM fine-tuning in classification scenarios.

    https://arxiv.org/abs/2512.12677


    $\beta$-CLIP: Text-Conditioned Contrastive Learning for Multi-Granular Vision-Language Alignment

    oai:arXiv.org:2512.12678v1

    arXiv:2512.12678v1 Announce Type: new Abstract: CLIP achieves strong zero-shot image-text retrieval by aligning global vision and text representations, yet it falls behind on fine-grained tasks even when fine-tuned on long, detailed captions. In this work, we propose $\beta$-CLIP, a multi-granular text-conditioned contrastive learning framework designed to achieve hierarchical alignment between multiple textual granularities-from full captions to sentences and phrases-and their corresponding visual regions. For each level of granularity, $\beta$-CLIP utilizes cross-attention to dynamically pool image patches, producing contextualized visual embeddings. To address the semantic overlap inherent in this hierarchy, we introduce the $\beta$-Contextualized Contrastive Alignment Loss ($\beta$-CAL). This objective parameterizes the trade-off between strict query-specific matching and relaxed intra-image contextualization, supporting both soft Cross-Entropy and hard Binary Cross-Entropy formulations. Through extensive experiments, we demonstrate that $\beta$-CLIP significantly improves dense alignment: achieving 91.8% T2I 92.3% I2T at R@1 on Urban1K and 30.9% on FG-OVD (Hard), setting state-of-the-art among methods trained without hard negatives. $\beta$-CLIP establishes a robust, adaptive baseline for dense vision-language correspondence. The code and models are released at https://github.com/fzohra/B-CLIP.

    https://arxiv.org/abs/2512.12678


    Kernel interpolation in Sobolev spaces of hybrid regularity

    oai:arXiv.org:2512.12684v1

    arXiv:2512.12684v1 Announce Type: new Abstract: Kernel interpolation in tensor product reproducing kernel Hilbert spaces allows for the use of sparse grids to mitigate the curse of the dimension. Typically, besides the generic constant, only a dimension dependent power of a logarithm term enters here into complexity estimates. We show that optimized sparse grids can avoid this logarithmic factor when the interpolation error is measured with respect to Sobolev spaces of hybrid regularity. Consequently, in such a situation, the complexity of kernel interpolation does not suffer from the curse of dimension.

    https://arxiv.org/abs/2512.12684


    Memoria: A Scalable Agentic Memory Framework for Personalized Conversational AI

    oai:arXiv.org:2512.12686v1

    arXiv:2512.12686v1 Announce Type: new Abstract: Agentic memory is emerging as a key enabler for large language models (LLM) to maintain continuity, personalization, and long-term context in extended user interactions, critical capabilities for deploying LLMs as truly interactive and adaptive agents. Agentic memory refers to the memory that provides an LLM with agent-like persistence: the ability to retain and act upon information across conversations, similar to how a human would. We present Memoria, a modular memory framework that augments LLM-based conversational systems with persistent, interpretable, and context-rich memory. Memoria integrates two complementary components: dynamic session-level summarization and a weighted knowledge graph (KG)-based user modelling engine that incrementally captures user traits, preferences, and behavioral patterns as structured entities and relationships. This hybrid architecture enables both short-term dialogue coherence and long-term personalization while operating within the token constraints of modern LLMs. We demonstrate how Memoria enables scalable, personalized conversational artificial intelligence (AI) by bridging the gap between stateless LLM interfaces and agentic memory systems, offering a practical solution for industry applications requiring adaptive and evolving user experiences.

    https://arxiv.org/abs/2512.12686


    Theoretical Foundations of Prompt Engineering: From Heuristics to Expressivity

    oai:arXiv.org:2512.12688v1

    arXiv:2512.12688v1 Announce Type: new Abstract: Prompts can switch a model's behavior even when the weights are fixed, yet this phenomenon is rarely treated as a clean theoretical object rather than a heuristic. We study the family of functions obtainable by holding a Transformer backbone fixed as an executor and varying only the prompt. Our core idea is to view the prompt as an externally injected program and to construct a simplified Transformer that interprets it to implement different computations. The construction exposes a mechanism-level decomposition: attention performs selective routing from prompt memory, the FFN performs local arithmetic conditioned on retrieved fragments, and depth-wise stacking composes these local updates into a multi-step computation. Under this viewpoint, we prove a constructive existential result showing that a single fixed backbone can approximate a broad class of target behaviors via prompts alone. The framework provides a unified starting point for formalizing trade-offs under prompt length/precision constraints and for studying structural limits of prompt-based switching, while remaining distinct from empirical claims about pretrained LLMs.

    https://arxiv.org/abs/2512.12688


    Reassessing the Role of Supervised Fine-Tuning: An Empirical Study in VLM Reasoning

    oai:arXiv.org:2512.12690v1

    arXiv:2512.12690v1 Announce Type: new Abstract: Recent advances in vision-language models (VLMs) reasoning have been largely attributed to the rise of reinforcement Learning (RL), which has shifted the community's focus away from the supervised fine-tuning (SFT) paradigm. Many studies suggest that introducing the SFT stage not only fails to improve reasoning ability but may also negatively impact model training. In this study, we revisit this RL-centric belief through a systematic and controlled comparison of SFT and RL on VLM Reasoning. Using identical data sources, we find that the relative effectiveness of SFT and RL is conditional and strongly influenced by model capacity, data scale, and data distribution. Contrary to common assumptions, our findings show that SFT plays a crucial role across several scenarios: (1) Effectiveness for weaker models. SFT more reliably elicits reasoning capabilities in smaller or weaker VLMs. (2) Data efficiency. SFT with only 2K achieves comparable or better reasoning performance to RL with 20K. (3) Cross-modal transferability. SFT demonstrates stronger generalization across modalities. Moreover, we identify a pervasive issue of deceptive rewards, where higher rewards fail to correlate with better reasoning accuracy in RL. These results challenge the prevailing "RL over SFT" narrative. They highlight that the role of SFT may have been underestimated and support a more balanced post-training pipeline in which SFT and RL function as complementary components.

    https://arxiv.org/abs/2512.12690


    WebOperator: Action-Aware Tree Search for Autonomous Agents in Web Environment

    oai:arXiv.org:2512.12692v1

    arXiv:2512.12692v1 Announce Type: new Abstract: LLM-based agents often operate in a greedy, step-by-step manner, selecting actions solely based on the current observation without considering long-term consequences or alternative paths. This lack of foresight is particularly problematic in web environments, which are only partially observable-limited to browser-visible content (e.g., DOM and UI elements)-where a single misstep often requires complex and brittle navigation to undo. Without an explicit backtracking mechanism, agents struggle to correct errors or systematically explore alternative paths. Tree-search methods provide a principled framework for such structured exploration, but existing approaches lack mechanisms for safe backtracking, making them prone to unintended side effects. They also assume that all actions are reversible, ignoring the presence of irreversible actions-limitations that reduce their effectiveness in realistic web tasks. To address these challenges, we introduce WebOperator, a tree-search framework that enables reliable backtracking and strategic exploration. Our method incorporates a best-first search strategy that ranks actions by both reward estimates and safety considerations, along with a robust backtracking mechanism that verifies the feasibility of previously visited paths before replaying them, preventing unintended side effects. To further guide exploration, WebOperator generates action candidates from multiple, varied reasoning contexts to ensure diverse and robust exploration, and subsequently curates a high-quality action set by filtering out invalid actions pre-execution and merging semantically equivalent ones. Experimental results on WebArena and WebVoyager demonstrate the effectiveness of WebOperator. On WebArena, WebOperator achieves a state-of-the-art 54.6% success rate with gpt-4o, underscoring the critical advantage of integrating strategic foresight with safe execution.

    https://arxiv.org/abs/2512.12692


    Co-Exploration and Co-Exploitation via Shared Structure in Multi-Task Bandits

    oai:arXiv.org:2512.12693v1

    arXiv:2512.12693v1 Announce Type: new Abstract: We propose a novel Bayesian framework for efficient exploration in contextual multi-task multi-armed bandit settings, where the context is only observed partially and dependencies between reward distributions are induced by latent context variables. In order to exploit these structural dependencies, our approach integrates observations across all tasks and learns a global joint distribution, while still allowing personalised inference for new tasks. In this regard, we identify two key sources of epistemic uncertainty, namely structural uncertainty in the latent reward dependencies across arms and tasks, and user-specific uncertainty due to incomplete context and limited interaction history. To put our method into practice, we represent the joint distribution over tasks and rewards using a particle-based approximation of a log-density Gaussian process. This representation enables flexible, data-driven discovery of both inter-arm and inter-task dependencies without prior assumptions on the latent variables. Empirically, we demonstrate that our method outperforms baselines such as hierarchical model bandits, especially in settings with model misspecification or complex latent heterogeneity.

    https://arxiv.org/abs/2512.12693


    Hybrid Retrieval-Augmented Generation for Robust Multilingual Document Question Answering

    oai:arXiv.org:2512.12694v1

    arXiv:2512.12694v1 Announce Type: new Abstract: Large-scale digitization initiatives have unlocked massive collections of historical newspapers, yet effective computational access remains hindered by OCR corruption, multilingual orthographic variation, and temporal language drift. We develop and evaluate a multilingual Retrieval-Augmented Generation pipeline specifically designed for question answering on noisy historical documents. Our approach integrates: (i) semantic query expansion and multi-query fusion using Reciprocal Rank Fusion to improve retrieval robustness against vocabulary mismatch; (ii) a carefully engineered generation prompt that enforces strict grounding in retrieved evidence and explicit abstention when evidence is insufficient; and (iii) a modular architecture enabling systematic component evaluation. We conduct comprehensive ablation studies on Named Entity Recognition and embedding model selection, demonstrating the importance of syntactic coherence in entity extraction and balanced performance-efficiency trade-offs in dense retrieval. Our end-to-end evaluation framework shows that the pipeline generates faithful answers for well-supported queries while correctly abstaining from unanswerable questions. The hybrid retrieval strategy improves recall stability, particularly benefiting from RRF's ability to smooth performance variance across query formulations. We release our code and configurations at https://anonymous.4open.science/r/RAGs-C5AE/, providing a reproducible foundation for robust historical document question answering.

    https://arxiv.org/abs/2512.12694


    Attributes to Support the Formulation of Practically Relevant Research Problems in Software Engineering

    oai:arXiv.org:2512.12699v1

    arXiv:2512.12699v1 Announce Type: new Abstract: [Background] A well-formulated research problem is essential for achieving practical relevance in Software Engineering (SE), yet there is a lack of structured guidance in this early phase. [Aims] Our goal is to introduce and evaluate seven attributes identified in the SE literature as relevant for formulating research problems (practical problem, context, implications/impacts, practitioners, evidence, objective, and research questions) in terms of their perceived importance and completeness, and learn how they can be applied. [Method] We conducted a workshop with 42 senior SE researchers during the ISERN 2024 meeting. The seven attributes were presented using a Problem Vision board filled with a research example. Participants discussed attributes in groups, shared written feedback, and individually completed a survey assessing their importance, completeness, and suggestions for improvement. [Results] The findings confirm the importance of the seven attributes in the formulation of industry-oriented research problems. Qualitative feedback illustrated how they can be applied in practice and revealed suggestions to refine them, such as incorporating financial criteria (e.g., ROI) into implications/impacts and addressing feasibility and constraints under evidence. [Conclusion] The results reaffirm the importance of the seven attributes in supporting a reflective and context-aware problem formulation. Adapting their use to specific research contexts can help to improve the alignment between academic research and industry needs.

    https://arxiv.org/abs/2512.12699


    Efficient Vision-Language Reasoning via Adaptive Token Pruning

    oai:arXiv.org:2512.12701v1

    arXiv:2512.12701v1 Announce Type: new Abstract: Real-world deployment of Vision-Language Models (VLMs) is hindered by high computational demands, as existing architectures inefficiently process all tokens uniformly. We introduce Adaptive Token Pruning (ATP), a dynamic inference mechanism that retains only the most informative tokens based on contextual relevance. ATP operates at the vision-language interface, assigning a hybrid importance score combining ViT CLS attention (intra-modal saliency) and CLIP text-image similarity (inter-modal relevance) to keep top-K tokens for the LLM. Unlike static compression, ATP adapts to each input without modifying the backbone. Proposed as a lightweight gating module, ATP is compatible with popular backbones like BLIP-2, LLaVA, and Flamingo. Preliminary evaluations across VQAv2, GQA, and COCO indicate that ATP reduces inference FLOPs by around 40% and achieves roughly 1.5x speedups in end-to-end latency with negligible accuracy loss (less than 1%). Qualitative analyses suggest ATP preserves visual grounding and enhances interpretability. Beyond efficiency, we investigate robustness under corruptions; observations suggest adaptive pruning suppresses spurious correlations, improving stability. These findings imply that resource-constrained inference and model reliability are not competing objectives. Finally, we discuss ATP's role in efficient multimodal edge computing pipelines.

    https://arxiv.org/abs/2512.12701


    Robust Motion Generation using Part-level Reliable Data from Videos

    oai:arXiv.org:2512.12703v1

    arXiv:2512.12703v1 Announce Type: new Abstract: Extracting human motion from large-scale web videos offers a scalable solution to the data scarcity issue in character animation. However, some human parts in many video frames cannot be seen due to off-screen captures or occlusions. It brings a dilemma: discarding the data missing any part limits scale and diversity, while retaining it compromises data quality and model performance. To address this problem, we propose leveraging credible part-level data extracted from videos to enhance motion generation via a robust part-aware masked autoregression model. First, we decompose a human body into five parts and detect the parts clearly seen in a video frame as "credible". Second, the credible parts are encoded into latent tokens by our proposed part-aware variational autoencoder. Third, we propose a robust part-level masked generation model to predict masked credible parts, while ignoring those noisy parts. In addition, we contribute K700-M, a challenging new benchmark comprising approximately 200k real-world motion sequences, for evaluation. Experimental results indicate that our method successfully outperforms baselines on both clean and noisy datasets in terms of motion quality, semantic consistency and diversity. Project page: https://boyuaner.github.io/ropar-main/

    https://arxiv.org/abs/2512.12703


    Synergizing Code Coverage and Gameplay Intent: Coverage-Aware Game Playtesting with LLM-Guided Reinforcement Learning

    oai:arXiv.org:2512.12706v1

    arXiv:2512.12706v1 Announce Type: new Abstract: The widespread adoption of the "Games as a Service" model necessitates frequent content updates, placing immense pressure on quality assurance. In response, automated game testing has been viewed as a promising solution to cope with this demanding release cadence. However, existing automated testing approaches typically create a dichotomy: code-centric methods focus on structural coverage without understanding gameplay context, while player-centric agents validate high-level intent but often fail to cover specific underlying code changes. To bridge this gap, we propose SMART (Structural Mapping for Augmented Reinforcement Testing), a novel framework that synergizes structural verification and functional validation for game update testing. SMART leverages large language models (LLMs) to interpret abstract syntax tree (AST) differences and extract functional intent, constructing a context-aware hybrid reward mechanism. This mechanism guides reinforcement learning agents to sequentially fulfill gameplay goals while adaptively exploring modified code branches. We evaluate SMART on two environments, Overcooked and Minecraft. The results demonstrate that SMART significantly outperforms state-of-the-art baselines; it achieves over 94% branch coverage of modified code, nearly double that of traditional reinforcement learning methods, while maintaining a 98% task completion rate, effectively balancing structural comprehensiveness with functional correctness.

    https://arxiv.org/abs/2512.12706


    From Linear Risk to Emergent Harm: Complexity as the Missing Core of AI Governance

    oai:arXiv.org:2512.12707v1

    arXiv:2512.12707v1 Announce Type: new Abstract: Risk-based AI regulation has become the dominant paradigm in AI governance, promising proportional controls aligned with anticipated harms. This paper argues that such frameworks often fail for structural reasons: they implicitly assume linear causality, stable system boundaries, and largely predictable responses to regulation. In practice, AI operates within complex adaptive socio-technical systems in which harm is frequently emergent, delayed, redistributed, and amplified through feedback loops and strategic adaptation by system actors. As a result, compliance can increase while harm is displaced or concealed rather than eliminated. We propose a complexity-based framework for AI governance that treats regulation as intervention rather than control, prioritises dynamic system mapping over static classifications, and integrates causal reasoning and simulation for policy design under uncertainty. The aim is not to eliminate uncertainty, but to enable robust system stewardship through monitoring, learning, and iterative revision of governance interventions.

    https://arxiv.org/abs/2512.12707


    Multi-Trajectory Physics-Informed Neural Networks for HJB Equations with Hard-Zero Terminal Inventory: Optimal Execution on Synthetic & SPY Data

    oai:arXiv.org:2512.12708v1

    arXiv:2512.12708v1 Announce Type: new Abstract: We study optimal trade execution with a hard-zero terminal inventory constraint, modeled via Hamilton-Jacobi-Bellman (HJB) equations. Vanilla PINNs often under-enforce this constraint and produce unstable controls. We propose a Multi-Trajectory PINN (MT-PINN) that adds a rollout-based trajectory loss and propagates a terminal penalty on terminal inventory via backpropagation-through-time, directly enforcing zero terminal inventory. A lightweight lambda-curriculum is adopted to stabilize training as the state expands from a risk-neutral reduced HJB to a risk-averse HJB. On the Gatheral-Schied single-asset model, MT-PINN aligns closely with their derived closed-form solutions and concentrates terminal inventory tightly around zero while reducing errors along optimal paths. We apply MT-PINNs on SPY intraday data, matching TWAP when risk-neutral, and achieving lower exposure and competitive costs, especially in falling windows, for higher risk-aversion.

    https://arxiv.org/abs/2512.12708


    Self-Motivated Growing Neural Network for Adaptive Architecture via Local Structural Plasticity

    oai:arXiv.org:2512.12713v1

    arXiv:2512.12713v1 Announce Type: new Abstract: Control policies in deep reinforcement learning are often implemented with fixed-capacity multilayer perceptrons trained by backpropagation, which lack structural plasticity and depend on global error signals. This paper introduces the Self-Motivated Growing Neural Network (SMGrNN), a controller whose topology evolves online through a local Structural Plasticity Module (SPM). The SPM monitors neuron activations and edge-wise weight update statistics over short temporal windows and uses these signals to trigger neuron insertion and pruning, while synaptic weights are updated by a standard gradient-based optimizer. This allows network capacity to be regulated during learning without manual architectural tuning. SMGrNN is evaluated on control benchmarks via policy distillation. Compared with multilayer perceptron baselines, it achieves similar or higher returns, lower variance, and task-appropriate network sizes. Ablation studies with growth disabled and growth-only variants isolate the role of structural plasticity, showing that adaptive topology improves reward stability. The local and modular design of SPM enables future integration of a Hebbian plasticity module and spike-timing-dependent plasticity, so that SMGrNN can support both artificial and spiking neural implementations driven by local rules.

    https://arxiv.org/abs/2512.12713


    CoDA: A Context-Decoupled Hierarchical Agent with Reinforcement Learning

    oai:arXiv.org:2512.12716v1

    arXiv:2512.12716v1 Announce Type: new Abstract: Large Language Model (LLM) agents trained with reinforcement learning (RL) show great promise for solving complex, multi-step tasks. However, their performance is often crippled by "Context Explosion", where the accumulation of long text outputs overwhelms the model's context window and leads to reasoning failures. To address this, we introduce CoDA, a Context-Decoupled hierarchical Agent, a simple but effective reinforcement learning framework that decouples high-level planning from low-level execution. It employs a single, shared LLM backbone that learns to operate in two distinct, contextually isolated roles: a high-level Planner that decomposes tasks within a concise strategic context, and a low-level Executor that handles tool interactions in an ephemeral, isolated workspace. We train this unified agent end-to-end using PECO (Planner-Executor Co-Optimization), a reinforcement learning methodology that applies a trajectory-level reward to jointly optimize both roles, fostering seamless collaboration through context-dependent policy updates. Extensive experiments demonstrate that CoDA achieves significant performance improvements over state-of-the-art baselines on complex multi-hop question-answering benchmarks, and it exhibits strong robustness in long-context scenarios, maintaining stable performance while all other baselines suffer severe degradation, thus further validating the effectiveness of our hierarchical design in mitigating context overload.

    https://arxiv.org/abs/2512.12716


    HMPCC: Human-Aware Model Predictive Coverage Control

    oai:arXiv.org:2512.12717v1

    arXiv:2512.12717v1 Announce Type: new Abstract: We address the problem of coordinating a team of robots to cover an unknown environment while ensuring safe operation and avoiding collisions with non-cooperative agents. Traditional coverage strategies often rely on simplified assumptions, such as known or convex environments and static density functions, and struggle to adapt to real-world scenarios, especially when humans are involved. In this work, we propose a human-aware coverage framework based on Model Predictive Control (MPC), namely HMPCC, where human motion predictions are integrated into the planning process. By anticipating human trajectories within the MPC horizon, robots can proactively coordinate their actions %avoid redundant exploration, and adapt to dynamic conditions. The environment is modeled as a Gaussian Mixture Model (GMM), representing regions of interest. Team members operate in a fully decentralized manner, without relying on explicit communication, an essential feature in hostile or communication-limited scenarios. Our results show that human trajectory forecasting enables more efficient and adaptive coverage, improving coordination between human and robotic agents.

    https://arxiv.org/abs/2512.12717


    Spinal Line Detection for Posture Evaluation through Train-ing-free 3D Human Body Reconstruction with 2D Depth Images

    oai:arXiv.org:2512.12718v1

    arXiv:2512.12718v1 Announce Type: new Abstract: The spinal angle is an important indicator of body balance. It is important to restore the 3D shape of the human body and estimate the spine center line. Existing mul-ti-image-based body restoration methods require expensive equipment and complex pro-cedures, and single image-based body restoration methods have limitations in that it is difficult to accurately estimate the internal structure such as the spine center line due to occlusion and viewpoint limitation. This study proposes a method to compensate for the shortcomings of the multi-image-based method and to solve the limitations of the sin-gle-image method. We propose a 3D body posture analysis system that integrates depth images from four directions to restore a 3D human model and automatically estimate the spine center line. Through hierarchical matching of global and fine registration, restora-tion to noise and occlusion is performed. Also, the Adaptive Vertex Reduction is applied to maintain the resolution and shape reliability of the mesh, and the accuracy and stabil-ity of spinal angle estimation are simultaneously secured by using the Level of Detail en-semble. The proposed method achieves high-precision 3D spine registration estimation without relying on training data or complex neural network models, and the verification confirms the improvement of matching quality.

    https://arxiv.org/abs/2512.12718


    Towards AI Agents Supported Research Problem Formulation

    oai:arXiv.org:2512.12719v1

    arXiv:2512.12719v1 Announce Type: new Abstract: Poorly formulated research problems can compromise the practical relevance of Software Engineering studies by not reflecting the complexities of industrial practice. This vision paper explores the use of artificial intelligence agents to support SE researchers during the early stage of a research project, the formulation of the research problem. Based on the Lean Research Inception framework and using a published study on code maintainability in machine learning as a reference, we developed a descriptive evaluation of a scenario illustrating how AI agents, integrated into LRI, can support SE researchers by pre filling problem attributes, aligning stakeholder perspectives, refining research questions, simulating multiperspective assessments, and supporting decision making. The descriptive evaluation of the scenario suggests that AI agent support can enrich collaborative discussions and enhance critical reflection on the value, feasibility, and applicability of the research problem. Although the vision of integrating AI agents into LRI was perceived as promising to support the context aware and practice oriented formulation of research problems, empirical validation is needed to confirm and refine the integration of AI agents into problem formulation.

    https://arxiv.org/abs/2512.12719


    Making Robots Play by the Rules: The ROS 2 CLIPS-Executive

    oai:arXiv.org:2512.12722v1

    arXiv:2512.12722v1 Announce Type: new Abstract: CLIPS is a rule-based programming language for building knowledge-driven applications, well suited for the complex task of coordinating autonomous robots. Inspired by the CLIPS-Executive originally developed for the lesser known Fawkes robotics framework, we present an Integration of CLIPS into the ROS ecosystem. Additionally, we show the flexibility of CLIPS by describing a PDDL-based planning framework integration.

    https://arxiv.org/abs/2512.12722


    RunPBA -- Runtime attestation for microcontrollers with PACBTI

    oai:arXiv.org:2512.12729v1

    arXiv:2512.12729v1 Announce Type: new Abstract: The widespread adoption of embedded systems has led to their deployment in critical real-world applications, making them attractive targets for malicious actors. These devices face unique challenges in mitigating vulnerabilities due to intrinsic constraints, such as low energy consumption requirements and limited computational resources. This paper presents RunPBA, a hardware-based runtime attestation system designed to defend against control flow attacks while maintaining minimal performance overhead and adhering to strict power consumption constraints. RunPBA leverages PACBTI, a new processor extension tailored for the Arm Cortex M processor family, allowing robust protection without requiring hardware modifications, a limitation present in similar solutions. We implemented a proof-of-concept and evaluated it using two benchmark suites. Experimental results indicate that RunPBA imposes a geometric mean performance overhead of only 1% and 4.7% across the benchmarks, underscoring its efficiency and suitability for real-world deployment.

    https://arxiv.org/abs/2512.12729


    NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents

    oai:arXiv.org:2512.12730v1

    arXiv:2512.12730v1 Announce Type: new Abstract: Recent advances in coding agents suggest rapid progress toward autonomous software development, yet existing benchmarks fail to rigorously evaluate the long-horizon capabilities required to build complete software systems. Most prior evaluations focus on localized code generation, scaffolded completion, or short-term repair tasks, leaving open the question of whether agents can sustain coherent reasoning, planning, and execution over the extended horizons demanded by real-world repository construction. To address this gap, we present NL2Repo Bench, a benchmark explicitly designed to evaluate the long-horizon repository generation ability of coding agents. Given only a single natural-language requirements document and an empty workspace, agents must autonomously design the architecture, manage dependencies, implement multi-module logic, and produce a fully installable Python library. Our experiments across state-of-the-art open- and closed-source models reveal that long-horizon repository generation remains largely unsolved: even the strongest agents achieve below 40% average test pass rates and rarely complete an entire repository correctly. Detailed analysis uncovers fundamental long-horizon failure modes, including premature termination, loss of global coherence, fragile cross-file dependencies, and inadequate planning over hundreds of interaction steps. NL2Repo Bench establishes a rigorous, verifiable testbed for measuring sustained agentic competence and highlights long-horizon reasoning as a central bottleneck for the next generation of autonomous coding agents.

    https://arxiv.org/abs/2512.12730


    Solving a Machine Learning Regression Problem Based on the Theory of Random Functions

    oai:arXiv.org:2512.12731v1

    arXiv:2512.12731v1 Announce Type: new Abstract: This paper studies a machine learning regression problem as a multivariate approximation problem using the framework of the theory of random functions. An ab initio derivation of a regression method is proposed, starting from postulates of indifference. It is shown that if a probability measure on an infinite-dimensional function space possesses natural symmetries (invariance under translation, rotation, scaling, and Gaussianity), then the entire solution scheme, including the kernel form, the type of regularization, and the noise parameterization, follows analytically from these postulates. The resulting kernel coincides with a generalized polyharmonic spline; however, unlike existing approaches, it is not chosen empirically but arises as a consequence of the indifference principle. This result provides a theoretical foundation for a broad class of smoothing and interpolation methods, demonstrating their optimality in the absence of a priori information.

    https://arxiv.org/abs/2512.12731


    Ethical Risk Analysis of L2 Rollups

    oai:arXiv.org:2512.12732v1

    arXiv:2512.12732v1 Announce Type: new Abstract: Layer 2 rollups improve throughput and fees, but can reintroduce risk through operator discretion and information asymmetry. We ask which operator and governance designs produce ethically problematic user risk. We adapt Ethical Risk Analysis to rollup architectures, build a role-based taxonomy of decision authority and exposure, and pair the framework with two empirical signals, a cross sectional snapshot of 129 projects from L2BEAT and a hand curated incident set covering 2022 to 2025. We analyze mechanisms that affect risks to users funds, including upgrade timing and exit windows, proposer liveness and whitelisting, forced inclusion usability, and data availability choices. We find that ethical hazards rooted in L2 components control arrangements are widespread: instant upgrades without exit windows appear in about 86 percent of projects, and proposer controls that can freeze withdrawals in about 50 percent. Reported incidents concentrate in sequencer liveness and inclusion, consistent with these dependencies. We translate these findings into ethically grounded suggestions on mitigation strategies including technical components and governance mechanisms.

    https://arxiv.org/abs/2512.12732


    Personalized QoE Prediction: A Demographic-Augmented Machine Learning Framework for 5G Video Streaming Networks

    oai:arXiv.org:2512.12736v1

    arXiv:2512.12736v1 Announce Type: new Abstract: Quality of Experience (QoE) prediction is a critical component of modern multimedia systems, particularly for adaptive video streaming in 5G networks. Accurate QoE estimation enables intelligent resource management and supports user centric service delivery. Existing QoE prediction approaches primarily rely on limited datasets and assume uniform user perception, which restricts their applicability in heterogeneous real world environments. This paper proposes a demographic aware machine learning framework for personalized QoE prediction. We introduce a behaviorally realistic demographic based data augmentation strategy that expands a small QoE dataset six fold by modeling varying user sensitivities to streaming impairments such as rebuffering, bitrate variation, and quality degradation. Using the augmented dataset, we evaluate a comprehensive set of classical machine learning models alongside advanced deep learning architectures, including an attention-based MLP and TabNet. Experimental results demonstrate significant improvements in prediction accuracy across RMSE, MAE, and R metrics compared to baseline models. Among all evaluated approaches, TabNet achieves the strongest performance, benefiting from its inherent feature selection and attention mechanisms. The results confirm that demographic-aware augmentation substantially enhances QoE prediction robustness and provides a scalable direction for personalized QoE-aware intelligence in 5G video streaming networks.

    https://arxiv.org/abs/2512.12736


    SPARK: Igniting Communication-Efficient Decentralized Learning via Stage-wise Projected NTK and Accelerated Regularization

    oai:arXiv.org:2512.12737v1

    arXiv:2512.12737v1 Announce Type: new Abstract: Decentralized federated learning (DFL) faces critical challenges from statistical heterogeneity and communication overhead. While NTK-based methods achieve faster convergence, transmitting full Jacobian matrices is impractical for bandwidth-constrained edge networks. We propose SPARK, synergistically integrating random projection-based Jacobian compression, stage-wise annealed distillation, and Nesterov momentum acceleration. Random projections compress Jacobians while preserving spectral properties essential for convergence. Stage-wise annealed distillation transitions from pure NTK evolution to neighbor-regularized learning, counteracting compression noise. Nesterov momentum accelerates convergence through stable accumulation enabled by distillation smoothing. SPARK achieves 98.7% communication reduction compared to NTK-DFL while maintaining convergence speed and superior accuracy. With momentum, SPARK reaches target performance 3 times faster, establishing state-of-the-art results for communication-efficient decentralized learning and enabling practical deployment in bandwidth-limited edge environments.

    https://arxiv.org/abs/2512.12737


    FuXi-$\gamma$: Efficient Sequential Recommendation with Exponential-Power Temporal Encoder and Diagonal-Sparse Positional Mechanism

    oai:arXiv.org:2512.12740v1

    arXiv:2512.12740v1 Announce Type: new Abstract: Sequential recommendation aims to model users' evolving preferences based on their historical interactions. Recent advances leverage Transformer-based architectures to capture global dependencies, but existing methods often suffer from high computational overhead, primarily due to discontinuous memory access in temporal encoding and dense attention over long sequences. To address these limitations, we propose FuXi-$\gamma$, a novel sequential recommendation framework that improves both effectiveness and efficiency through principled architectural design. FuXi-$\gamma$ adopts a decoder-only Transformer structure and introduces two key innovations: (1) An exponential-power temporal encoder that encodes relative temporal intervals using a tunable exponential decay function inspired by the Ebbinghaus forgetting curve. This encoder enables flexible modeling of both short-term and long-term preferences while maintaining high efficiency through continuous memory access and pure matrix operations. (2) A diagonal-sparse positional mechanism that prunes low-contribution attention blocks using a diagonal-sliding strategy guided by the persymmetry of Toeplitz matrix. Extensive experiments on four real-world datasets demonstrate that FuXi-$\gamma$ achieves state-of-the-art performance in recommendation quality, while accelerating training by up to 4.74$\times$ and inference by up to 6.18$\times$, making it a practical and scalable solution for long-sequence recommendation. Our code is available at https://github.com/Yeedzhi/FuXi-gamma.

    https://arxiv.org/abs/2512.12740


    Resting Neurons, Active Insights: Improving Input Sparsification for Large Language Models

    oai:arXiv.org:2512.12744v1

    arXiv:2512.12744v1 Announce Type: new Abstract: Large Language Models (LLMs) achieve state-of-the-art performance across a wide range of applications, but their massive scale poses significant challenges for both efficiency and interpretability. Structural pruning, which reduces model size by removing redundant computational units such as neurons, has been widely explored as a solution, and this study devotes to input sparsification, an increasingly popular technique that improves efficiency by selectively activating only a subset of entry values for each input. However, existing approaches focus primarily on computational savings, often overlooking the representational consequences of sparsification and leaving a noticeable performance gap compared to full models. In this work, we first reinterpret input sparsification as a form of dynamic structural pruning. Motivated by the spontaneous baseline firing rates observed in biological neurons, we introduce a small set of trainable spontaneous neurons that act as compensatory units to stabilize activations in sparsified LLMs. Experiments demonstrate that these auxiliary neurons substantially reduce the sparsification-induced performance gap while generalizing effectively across tasks.

    https://arxiv.org/abs/2512.12744


    GenieDrive: Towards Physics-Aware Driving World Model with 4D Occupancy Guided Video Generation

    oai:arXiv.org:2512.12751v1

    arXiv:2512.12751v1 Announce Type: new Abstract: Physics-aware driving world model is essential for drive planning, out-of-distribution data synthesis, and closed-loop evaluation. However, existing methods often rely on a single diffusion model to directly map driving actions to videos, which makes learning difficult and leads to physically inconsistent outputs. To overcome these challenges, we propose GenieDrive, a novel framework designed for physics-aware driving video generation. Our approach starts by generating 4D occupancy, which serves as a physics-informed foundation for subsequent video generation. 4D occupancy contains rich physical information, including high-resolution 3D structures and dynamics. To facilitate effective compression of such high-resolution occupancy, we propose a VAE that encodes occupancy into a latent tri-plane representation, reducing the latent size to only 58% of that used in previous methods. We further introduce Mutual Control Attention (MCA) to accurately model the influence of control on occupancy evolution, and we jointly train the VAE and the subsequent prediction module in an end-to-end manner to maximize forecasting accuracy. Together, these designs yield a 7.2% improvement in forecasting mIoU at an inference speed of 41 FPS, while using only 3.47 M parameters. Additionally, a Normalized Multi-View Attention is introduced in the video generation model to generate multi-view driving videos with guidance from our 4D occupancy, significantly improving video quality with a 20.7% reduction in FVD. Experiments demonstrate that GenieDrive enables highly controllable, multi-view consistent, and physics-aware driving video generation.

    https://arxiv.org/abs/2512.12751


    Newton Methods for Mean Field Games: A Numerical Study

    oai:arXiv.org:2512.12752v1

    arXiv:2512.12752v1 Announce Type: new Abstract: We address the numerical solution of second-order Mean Field Game problems through Newton iterations in infinite dimensions, introduced in [14], where quadratic convergence of the method was rigorously established. Building upon this theoretical framework, we develop new numerical discretization techniques, including both a finite difference and a semi-Lagrangian scheme, that enable an effective computational implementation of the infinite-dimensional iterations. The proposed methods are tested on several benchmark problems, and the resulting numerical experiments demonstrate their robustness, accuracy, and efficiency. A comparative analysis between the two schemes and existing approaches from the literature is also presented, highlighting the potential of Newton-based solvers for MFG systems.

    https://arxiv.org/abs/2512.12752


    An End-to-End Approach for Microgrid Probabilistic Forecasting and Robust Operation via Decision-focused Learning

    oai:arXiv.org:2512.12755v1

    arXiv:2512.12755v1 Announce Type: new Abstract: High penetration of renewable energy sources (RES) introduces significant uncertainty and intermittency into microgrid operations, posing challenges to economic and reliable scheduling. To address this, this paper proposes an end-to-end decision-focused framework that jointly optimizes probabilistic forecasting and robust operation for microgrids. A multilayer encoder-decoder (MED) probabilistic forecasting model is integrated with a two-stage robust optimization (TSRO) model involving direct load control (DLC) through a differentiable decision pathway, enabling gradient-based feedback from operational outcomes to improve forecasting performance. Unlike conventional sequential approaches, the proposed method aligns forecasting accuracy with operational objectives by directly minimizing decision regret via a surrogate smart predict-then-optimize (SPO) loss function. This integration ensures that probabilistic forecasts are optimized for downstream decisions, enhancing both economic efficiency and robustness. Case studies on modified IEEE 33-bus and 69-bus systems demonstrate that the proposed framework achieves superior forecasting accuracy and operational performance, reducing total and net operation costs by up to 18% compared with conventional forecasting and optimization combinations. The results verify the effectiveness and scalability of the end-to-end decision-focused approach for resilient and cost-efficient microgrid management under uncertainty.

    https://arxiv.org/abs/2512.12755


    FysicsWorld: A Unified Full-Modality Benchmark for Any-to-Any Understanding, Generation, and Reasoning

    oai:arXiv.org:2512.12756v1

    arXiv:2512.12756v1 Announce Type: new Abstract: Despite rapid progress in multimodal large language models (MLLMs) and emerging omni-modal architectures, current benchmarks remain limited in scope and integration, suffering from incomplete modality coverage, restricted interaction to text-centric outputs, and weak interdependence and complementarity among modalities. To bridge these gaps, we introduce FysicsWorld, the first unified full-modality benchmark that supports bidirectional input-output across image, video, audio, and text, enabling comprehensive any-to-any evaluation across understanding, generation, and reasoning. FysicsWorld encompasses 16 primary tasks and 3,268 curated samples, aggregated from over 40 high-quality sources and covering a rich set of open-domain categories with diverse question types. We also propose the Cross-Modal Complementarity Screening (CMCS) strategy integrated in a systematic data construction framework that produces omni-modal data for spoken interaction and fusion-dependent cross-modal reasoning. Through a comprehensive evaluation of over 30 state-of-the-art baselines, spanning MLLMs, modality-specific models, unified understanding-generation models, and omni-modal language models, FysicsWorld exposes the performance disparities and limitations across models in understanding, generation, and reasoning. Our benchmark establishes a unified foundation and strong baselines for evaluating and advancing next-generation full-modality architectures.

    https://arxiv.org/abs/2512.12756


    From Information Freshness to Semantics of Information and Goal-oriented Communications

    oai:arXiv.org:2512.12758v1

    arXiv:2512.12758v1 Announce Type: new Abstract: Future wireless networks must support real-time, data-driven cyber-physical systems in which communication is tightly coupled with sensing, inference, control, and decision-making. Traditional communication paradigms centered on accuracy, throughput, and latency are increasingly inadequate for these systems, where the value of information depends on its semantic relevance to a specific task. This paper provides a unified exposition of the progression from classical distortion-based frameworks, through information freshness metrics such as the Age of Information (AoI) and its variants, to the emerging paradigm of goal-oriented semantics-aware communication. We organize and systematize existing semantics-aware metrics, including content- and version-aware measures, context-dependent distortion formulations, and history-dependent error persistence metrics that capture lasting impact and urgency. Within this framework, we highlight how these metrics address the limitations of purely accuracy- or freshness-centric designs, and how they collectively enable the selective generation and transmission of only task-relevant information. We further review analytical tools based on Markov decision process (MDP) and Lyapunov optimization methods that have been employed to characterize optimal or near-optimal timing and scheduling policies under semantic performance criteria and communication constraints. By synthesizing these developments into a coherent framework, the paper clarifies the design principles underlying goal-oriented, semantics-aware communication systems. It illustrates how they can significantly improve efficiency, reliability, and task performance. The presented perspective aims to serve as a bridge between information-theoretic, control-theoretic, and networking viewpoints, and to guide the design of semantic communication architectures for 6G and beyond.

    https://arxiv.org/abs/2512.12758


    Intelligent Scientific Literature Explorer using Machine Learning (ISLE)

    oai:arXiv.org:2512.12760v1

    arXiv:2512.12760v1 Announce Type: new Abstract: The rapid acceleration of scientific publishing has created substantial challenges for researchers attempting to discover, contextualize, and interpret relevant literature. Traditional keyword-based search systems provide limited semantic understanding, while existing AI-driven tools typically focus on isolated tasks such as retrieval, clustering, or bibliometric visualization. This paper presents an integrated system for scientific literature exploration that combines large-scale data acquisition, hybrid retrieval, semantic topic modeling, and heterogeneous knowledge graph construction. The system builds a comprehensive corpus by merging full-text data from arXiv with structured metadata from OpenAlex. A hybrid retrieval architecture fuses BM25 lexical search with embedding-based semantic search using Reciprocal Rank Fusion. Topic modeling is performed on retrieved results using BERTopic or non-negative matrix factorization depending on computational resources. A knowledge graph unifies papers, authors, institutions, countries, and extracted topics into an interpretable structure. The system provides a multi-layered exploration environment that reveals not only relevant publications but also the conceptual and relational landscape surrounding a query. Evaluation across multiple queries demonstrates improvements in retrieval relevance, topic coherence, and interpretability. The proposed framework contributes an extensible foundation for AI-assisted scientific discovery.

    https://arxiv.org/abs/2512.12760


    Lexicographic Multi-Objective Stochastic Shortest Path with Mixed Max-Sum Costs

    oai:arXiv.org:2512.12761v1

    arXiv:2512.12761v1 Announce Type: new Abstract: We study the Stochastic Shortest Path (SSP) problem for autonomous systems with mixed max-sum cost aggregations under Linear Temporal Logic constraints. Classical SSP formulations rely on sum-aggregated costs, which are suitable for cumulative quantities such as time or energy but fail to capture bottleneck-style objectives such as avoiding high-risk transitions, where performance is determined by the worst single event along a trajectory. Such objectives are particularly important in safety-critical systems, where even one hazardous transition can be unacceptable. To address this limitation, we introduce max-aggregated objectives that minimize the bottleneck cost, i.e., the maximum one-step cost along a trajectory. We show that standard Bellman equations on the original state space do not apply in this setting and propose an augmented MDP with a state variable tracking the running maximum cost, together with a value iteration algorithm. We further identify a cyclic policy phenomenon, where zero-marginal-cost cycles prevent goal reaching under max-aggregation, and resolve it via a finite-horizon formulation. To handle richer task requirements, linear temporal logic specifications are translated into deterministic finite automata and combined with the system to construct a product MDP. We propose a lexicographic value iteration algorithm that handles mixed max-sum objectives under lexicographic ordering on this product MDP. Gridworld case studies demonstrate the effectiveness of the proposed framework.

    https://arxiv.org/abs/2512.12761


    Federated Learning with Feedback Alignment

    oai:arXiv.org:2512.12762v1

    arXiv:2512.12762v1 Announce Type: new Abstract: Federated Learning (FL) enables collaborative training across multiple clients while preserving data privacy, yet it struggles with data heterogeneity, where clients' data are not distributed independently and identically (non-IID). This causes local drift, hindering global model convergence. To address this, we introduce Federated Learning with Feedback Alignment (FLFA), a novel framework that integrates feedback alignment into FL. FLFA uses the global model's weights as a shared feedback matrix during local training's backward pass, aligning local updates with the global model efficiently. This approach mitigates local drift with minimal additional computational cost and no extra communication overhead. Our theoretical analysis supports FLFA's design by showing how it alleviates local drift and demonstrates robust convergence for both local and global models. Empirical evaluations, including accuracy comparisons and measurements of local drift, further illustrate that FLFA can enhance other FL methods demonstrating its effectiveness.

    https://arxiv.org/abs/2512.12762


    CoRe3D: Collaborative Reasoning as a Foundation for 3D Intelligence

    oai:arXiv.org:2512.12768v1

    arXiv:2512.12768v1 Announce Type: new Abstract: Recent advances in large multimodal models suggest that explicit reasoning mechanisms play a critical role in improving model reliability, interpretability, and cross-modal alignment. While such reasoning-centric approaches have been proven effective in language and vision tasks, their extension to 3D remains underdeveloped. CoRe3D introduces a unified 3D understanding and generation reasoning framework that jointly operates over semantic and spatial abstractions, enabling high-level intent inferred from language to directly guide low-level 3D content formation. Central to this design is a spatially grounded reasoning representation that decomposes 3D latent space into localized regions, allowing the model to reason over geometry in a compositional and procedural manner. By tightly coupling semantic chain-of-thought inference with structured spatial reasoning, CoRe3D produces 3D outputs that exhibit strong local consistency and faithful alignment with linguistic descriptions.

    https://arxiv.org/abs/2512.12768


    Adaptive Edge-Cloud Inference for Speech-to-Action Systems Using ASR and Large Language Models (ASTA)

    oai:arXiv.org:2512.12769v1

    arXiv:2512.12769v1 Announce Type: new Abstract: Voice-based interaction has emerged as a natural and intuitive modality for controlling IoT devices. However, speech-driven edge devices face a fundamental trade-off between cloud-based solutions, which offer stronger language understanding capabilities at the cost of latency, connectivity dependence, and privacy concerns, and edge-based solutions, which provide low latency and improved privacy but are limited by computational constraints. This paper presents ASTA, an adaptive speech-to-action solution that dynamically routes voice commands between edge and cloud inference to balance performance and system resource utilization. ASTA integrates on-device automatic speech recognition and lightweight offline language-model inference with cloud-based LLM processing, guided by real-time system metrics such as CPU workload, device temperature, and network latency. A metric-aware routing mechanism selects the inference path at runtime, while a rule-based command validation and repair component ensures successful end-to-end command execution. We implemented our solution on an NVIDIA Jetson-based edge platform and evaluated it using a diverse dataset of 80 spoken commands. Experimental results show that ASTA successfully routes all input commands for execution, achieving a balanced distribution between online and offline inference. The system attains an ASR accuracy of 62.5% and generates executable commands without repair for only 47.5% of inputs, highlighting the importance of the repair mechanism in improving robustness. These results suggest that adaptive edge-cloud orchestration is a viable approach for resilient and resource-aware voice-controlled IoT systems.

    https://arxiv.org/abs/2512.12769


    Curi\'o-Edu 7B: Examining Data Selection Impacts in LLM Continued Pretraining

    oai:arXiv.org:2512.12770v1

    arXiv:2512.12770v1 Announce Type: new Abstract: Continued pretraining extends a language model's capabilities by further exposing it to additional data, often tailored to a specific linguistic or domain context. This strategy has emerged as an efficient alternative to full retraining when adapting general-purpose models to new settings. In this work, we investigate this paradigm through Curi\'o 7B, a 7-billion-parameter model derived from LLaMA-2 and trained on 100 billion Portuguese tokens from the ClassiCC-PT corpus - the most extensive Portuguese-specific continued-pretraining effort above the three-billion-parameter scale to date. Beyond scale, we investigate whether quantity alone suffices or whether data quality plays a decisive role in linguistic adaptation. To this end, we introduce Curi\'o-Edu 7B, a variant trained exclusively on the educational and STEM-filtered subset of the same corpus, totaling just 10 billion tokens. Despite using only 10% of the data and 20% of the computation, Curi\'o-Edu 7B surpasses the full-corpus model in our evaluations, demonstrating that data selection can be fundamental even when adapting models with limited prior exposure to the target language. The developed models are available at https://huggingface.co/collections/ClassiCC-Corpus/curio-edu

    https://arxiv.org/abs/2512.12770


    JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation

    oai:arXiv.org:2512.12772v1

    arXiv:2512.12772v1 Announce Type: new Abstract: Understanding videos inherently requires reasoning over both visual and auditory information. To properly evaluate Omni-Large Language Models (Omni-LLMs), which are capable of processing multi-modal information including vision and audio, an effective benchmark must comprehensively cover three key aspects: (1) multi-modal dependency (i.e., questions that cannot be answered using vision or audio alone), (2) diverse audio information types (e.g., speech, sound events), and (3) varying scene spans. However, existing datasets fall short in one or more of these dimensions, limiting strict and comprehensive evaluation. To address this gap, we introduce JointAVBench, a novel benchmark with strict audio-video correlation, spanning five cognitive dimensions, four audio information types (speech, sound events, music, vocal traits), and three scene spans (single-, cross-, and full-scene). Given the high cost of manual annotation, we propose an automated pipeline that leverages state-of-the-art vision-LLMs, audio-LLMs, and general-purpose LLMs to synthesize questions and answers that strictly require joint audio-visual understanding. We evaluate leading vision-only, audio-only, and Omni-LLMs on our dataset. Results show that even the best-performing Omni-LLM achieves an average accuracy of only 62.6\%, outperforming uni-modal baselines but revealing substantial room for improvement, especially in cross-scene reasoning.

    https://arxiv.org/abs/2512.12772


    Designing The Drive: Enhancing User Experience through Adaptive Interfaces in Autonomous Vehicles

    oai:arXiv.org:2512.12773v1

    arXiv:2512.12773v1 Announce Type: new Abstract: With the recent development and integration of autonomous vehicles (AVs) in transportation systems of the modern world, the emphasis on customizing user interfaces to optimize the overall user experience has been growing expediently. Therefore, understanding user needs and preferences is essential to the acceptance and trust of these technologies as they continue to grow in prevalence. This paper addresses the implementation of HCI principles in the personalization of interfaces to improve safety, security, and usability for the users. This paper explores the way that personalized interfaces can be devised to increase user engagement and satisfaction through various HCI strategies such as adaptive design, multi-modal interaction, and user feedback mechanisms. Moreover, this paper puts emphasis on factors of transparency and user control in the design of an interface; hence, allowing users to design or modify their experience could foster an increase in trust in autonomous systems. In so doing, this research touches on the quite influential role HCI will play in this future scenario of autonomous vehicles while designing to ensure relevance to the diverse needs of users while maintaining high standards of safety and security. Discussing various HCI strategies such as adaptive design, multi-modal interaction, and feedback mechanisms to the user, this paper demonstrates how personalized interfaces can enhance significantly both user engagement and satisfaction. Transparency and user control also in designing an interface are further discussed, pointing out the need for a prerequisite condition of enabling the user to take control of their experience as a state of trust in autonomous systems. In summary, this paper points out the role of HCI in the development of autonomous vehicles and addresses numerous needs with respect to those enforced safety and security standards.

    https://arxiv.org/abs/2512.12773


    Fast 2DGS: Efficient Image Representation with Deep Gaussian Prior

    oai:arXiv.org:2512.12774v1

    arXiv:2512.12774v1 Announce Type: new Abstract: As generative models become increasingly capable of producing high-fidelity visual content, the demand for efficient, interpretable, and editable image representations has grown substantially. Recent advances in 2D Gaussian Splatting (2DGS) have emerged as a promising solution, offering explicit control, high interpretability, and real-time rendering capabilities (>1000 FPS). However, high-quality 2DGS typically requires post-optimization. Existing methods adopt random or heuristics (e.g., gradient maps), which are often insensitive to image complexity and lead to slow convergence (>10s). More recent approaches introduce learnable networks to predict initial Gaussian configurations, but at the cost of increased computational and architectural complexity. To bridge this gap, we present Fast-2DGS, a lightweight framework for efficient Gaussian image representation. Specifically, we introduce Deep Gaussian Prior, implemented as a conditional network to capture the spatial distribution of Gaussian primitives under different complexities. In addition, we propose an attribute regression network to predict dense Gaussian properties. Experiments demonstrate that this disentangled architecture achieves high-quality reconstruction in a single forward pass, followed by minimal fine-tuning. More importantly, our approach significantly reduces computational cost without compromising visual quality, bringing 2DGS closer to industry-ready deployment.

    https://arxiv.org/abs/2512.12774


    Persistent Personas? Role-Playing, Instruction Following, and Safety in Extended Interactions

    oai:arXiv.org:2512.12775v1

    arXiv:2512.12775v1 Announce Type: new Abstract: Persona-assigned large language models (LLMs) are used in domains such as education, healthcare, and sociodemographic simulation. Yet, they are typically evaluated only in short, single-round settings that do not reflect real-world usage. We introduce an evaluation protocol that combines long persona dialogues (over 100 rounds) and evaluation datasets to create dialogue-conditioned benchmarks that can robustly measure long-context effects. We then investigate the effects of dialogue length on persona fidelity, instruction-following, and safety of seven state-of-the-art open- and closed-weight LLMs. We find that persona fidelity degrades over the course of dialogues, especially in goal-oriented conversations, where models must sustain both persona fidelity and instruction following. We identify a trade-off between fidelity and instruction following, with non-persona baselines initially outperforming persona-assigned models; as dialogues progress and fidelity fades, persona responses become increasingly similar to baseline responses. Our findings highlight the fragility of persona applications in extended interactions and our work provides a protocol to systematically measure such failures.

    https://arxiv.org/abs/2512.12775


    High Order Control Lyapunov Function - Control Barrier Function - Quadratic Programming Based Autonomous Driving Controller for Bicyclist Safety

    oai:arXiv.org:2512.12776v1

    arXiv:2512.12776v1 Announce Type: new Abstract: Ensuring the safety of Vulnerable Road Users (VRUs) is a critical challenge in the development of advanced autonomous driving systems in smart cities. Among vulnerable road users, bicyclists present unique characteristics that make their safety both critical and also manageable. Vehicles often travel at significantly higher relative speeds when interacting with bicyclists as compared to their interactions with pedestrians which makes collision avoidance system design for bicyclist safety more challenging. Yet, bicyclist movements are generally more predictable and governed by clear traffic rules as compared to the sudden and sometimes erratic pedestrian motion, offering opportunities for model-based control strategies. To address bicyclist safety in complex traffic environments, this study proposes and develops a High Order Control Lyapunov Function High Order Control Barrier Function Quadratic Programming (HOCLF HOCBF QP) control framework. Through this framework, CLFs constraints guarantee system stability so that the vehicle can track its reference trajectory, whereas CBFs constraints ensure system safety by letting vehicle avoiding potential collisions region with surrounding obstacles. Then by solving a QP problem, an optimal control command that simultaneously satisfies stability and safety requirements can be calculated. Three key bicyclist crash scenarios recorded in the Fatality Analysis Reporting System (FARS) are recreated and used to comprehensively evaluate the proposed autonomous driving bicyclist safety control strategy in a simulation study. Simulation results demonstrate that the HOCLF HOCBF QP controller can help the vehicle perform robust, and collision-free maneuvers, highlighting its potential for improving bicyclist safety in complex traffic environments.

    https://arxiv.org/abs/2512.12776


    State over Tokens: Characterizing the Role of Reasoning Tokens

    oai:arXiv.org:2512.12777v1

    arXiv:2512.12777v1 Announce Type: new Abstract: Large Language Models (LLMs) can generate reasoning tokens before their final answer to boost performance on complex tasks. While these sequences seem like human thought processes, empirical evidence reveals that they are not a faithful explanation of the model's actual reasoning process. To address this gap between appearance and function, we introduce the State over Tokens (SoT) conceptual framework. SoT reframes reasoning tokens not as a linguistic narrative, but as an externalized computational state -- the sole persistent information carrier across the model's stateless generation cycles. This explains how the tokens can drive correct reasoning without being a faithful explanation when read as text and surfaces previously overlooked research questions on these tokens. We argue that to truly understand the process that LLMs do, research must move beyond reading the reasoning tokens as text and focus on decoding them as state.

    https://arxiv.org/abs/2512.12777


    VeBPF Many-Core Architecture for Network Functions in FPGA-based SmartNICs and IoT

    oai:arXiv.org:2512.12778v1

    arXiv:2512.12778v1 Announce Type: new Abstract: FPGA-based SmartNICs and IoT devices integrating soft-processors for network function execution have emerged to address the limited hardware reconfigurability of DPUs and MCUs. However, existing FPGA-based solutions lack a highly configurable many-core architecture specialized for network packet processing. This work presents VeBPF many-core architecture, a resource-optimized and highly configurable many-core architecture composed of custom VeBPF (Verilog eBPF) CPU cores designed for FPGA-based packet processing. The VeBPF cores are eBPF ISA compliant and implemented in Verilog HDL for seamless integration with existing FPGA IP blocks and subsystems. The proposed many-core architecture enables parallel execution of multiple eBPF rules across multiple VeBPF cores, achieving low-latency packet processing. The architecture is fully parameterizable, allowing the number of VeBPF cores and eBPF rules to scale according to application requirements and available FPGA resources. eBPF rules can be dynamically updated at run time without requiring FPGA reconfiguration, enabling flexible and adaptive network processing. The design incorporates hardware and computer architecture optimizations that support deployment across a wide range of platforms, from low-end FPGA-based IoT devices to high-end FPGA-based SmartNICs. In addition, we present automated testing and simulation frameworks developed using open-source tools such as Python and Cocotb. The VeBPF cores, many-core architecture, control software libraries, and simulation infrastructure are released as open source to support further research in FPGA-based many-core systems, eBPF acceleration, SmartNICs, IoT, and network security.

    https://arxiv.org/abs/2512.12778


    OLR-WAA: Adaptive and Drift-Resilient Online Regression with Dynamic Weighted Averaging

    oai:arXiv.org:2512.12779v1

    arXiv:2512.12779v1 Announce Type: new Abstract: Real-world datasets frequently exhibit evolving data distributions, reflecting temporal variations and underlying shifts. Overlooking this phenomenon, known as concept drift, can substantially degrade the predictive performance of the model. Furthermore, the presence of hyperparameters in online models exacerbates this issue, as these parameters are typically fixed and lack the flexibility to dynamically adjust to evolving data. This paper introduces "OLR-WAA: An Adaptive and Drift-Resilient Online Regression with Dynamic Weighted Average", a hyperparameter-free model designed to tackle the challenges of non-stationary data streams and enable effective, continuous adaptation. The objective is to strike a balance between model stability and adaptability. OLR-WAA incrementally updates its base model by integrating incoming data streams, utilizing an exponentially weighted moving average. It further introduces a unique optimization mechanism that dynamically detects concept drift, quantifies its magnitude, and adjusts the model based on real-time data characteristics. Rigorous evaluations show that it matches batch regression performance in static settings and consistently outperforms or rivals state-of-the-art online models, confirming its effectiveness. Concept drift datasets reveal a performance gap that OLR-WAA effectively bridges, setting it apart from other online models. In addition, the model effectively handles confidence-based scenarios through a conservative update strategy that prioritizes stable, high-confidence data points. Notably, OLR-WAA converges rapidly, consistently yielding higher R2 values compared to other online models.

    https://arxiv.org/abs/2512.12779


    Credit Risk Estimation with Non-Financial Features: Evidence from a Synthetic Istanbul Dataset

    oai:arXiv.org:2512.12783v1

    arXiv:2512.12783v1 Announce Type: new Abstract: Financial exclusion constrains entrepreneurship, increases income volatility, and widens wealth gaps. Underbanked consumers in Istanbul often have no bureau file because their earnings and payments flow through informal channels. To study how such borrowers can be evaluated we create a synthetic dataset of one hundred thousand Istanbul residents that reproduces first quarter 2025 T\"U\.IK census marginals and telecom usage patterns. Retrieval augmented generation feeds these public statistics into the OpenAI o3 model, which synthesises realistic yet private records. Each profile contains seven socio demographic variables and nine alternative attributes that describe phone specifications, online shopping rhythm, subscription spend, car ownership, monthly rent, and a credit card flag. To test the impact of the alternative financial data CatBoost, LightGBM, and XGBoost are each trained in two versions. Demo models use only the socio demographic variables; Full models include both socio demographic and alternative attributes. Across five fold stratified validation the alternative block raises area under the curve by about one point three percentage and lifts balanced \(F_{1}\) from roughly 0.84 to 0.95, a fourteen percent gain. We contribute an open Istanbul 2025 Q1 synthetic dataset, a fully reproducible modeling pipeline, and empirical evidence that a concise set of behavioural attributes can approach bureau level discrimination power while serving borrowers who lack formal credit records. These findings give lenders and regulators a transparent blueprint for extending fair and safe credit access to the underbanked.

    https://arxiv.org/abs/2512.12783


    OLC-WA: Drift Aware Tuning-Free Online Classification with Weighted Average

    oai:arXiv.org:2512.12785v1

    arXiv:2512.12785v1 Announce Type: new Abstract: Real-world data sets often exhibit temporal dynamics characterized by evolving data distributions. Disregarding this phenomenon, commonly referred to as concept drift, can significantly diminish a model's predictive accuracy. Furthermore, the presence of hyperparameters in online models exacerbates this issue. These parameters are typically fixed and cannot be dynamically adjusted by the user in response to the evolving data distribution. This paper introduces Online Classification with Weighted Average (OLC-WA), an adaptive, hyperparameter-free online classification model equipped with an automated optimization mechanism. OLC-WA operates by blending incoming data streams with an existing base model. This blending is facilitated by an exponentially weighted moving average. Furthermore, an integrated optimization mechanism dynamically detects concept drift, quantifies its magnitude, and adjusts the model based on the observed data stream characteristics. This approach empowers the model to effectively adapt to evolving data distributions within streaming environments. Rigorous empirical evaluation across diverse benchmark datasets shows that OLC-WA achieves performance comparable to batch models in stationary environments, maintaining accuracy within 1-3%, and surpasses leading online baselines by 10-25% under drift, demonstrating its effectiveness in adapting to dynamic data streams.

    https://arxiv.org/abs/2512.12785


    Unveiling Statistical Significance of Online Regression over Multiple Datasets

    oai:arXiv.org:2512.12787v1

    arXiv:2512.12787v1 Announce Type: new Abstract: Despite extensive focus on techniques for evaluating the performance of two learning algorithms on a single dataset, the critical challenge of developing statistical tests to compare multiple algorithms across various datasets has been largely overlooked in most machine learning research. Additionally, in the realm of Online Learning, ensuring statistical significance is essential to validate continuous learning processes, particularly for achieving rapid convergence and effectively managing concept drifts in a timely manner. Robust statistical methods are needed to assess the significance of performance differences as data evolves over time. This article examines the state-of-the-art online regression models and empirically evaluates several suitable tests. To compare multiple online regression models across various datasets, we employed the Friedman test along with corresponding post-hoc tests. For thorough evaluations, utilizing both real and synthetic datasets with 5-fold cross-validation and seed averaging ensures comprehensive assessment across various data subsets. Our tests generally confirmed the performance of competitive baselines as consistent with their individual reports. However, some statistical test results also indicate that there is still room for improvement in certain aspects of state-of-the-art methods.

    https://arxiv.org/abs/2512.12787


    Temporal HAL-API Dependencies as a Gateway to Formal Embedded Software Development

    oai:arXiv.org:2512.12788v1

    arXiv:2512.12788v1 Announce Type: new Abstract: Temporal HAL-API Dependencies (THADs) can be useful to capture an interesting class of correctness properties in embedded software development. They demand a moderate effort for specification (which can be done via program annotations) and verification (which can be done automatically via software model checking). In this sense, they have the potential to form an interesting sweet spot between generic properties (that demand virtually no specification effort, and that are typically addressed by static analysis) and application-specific properties as addressed by full-fledged formal methods. Thus, they may form a gateway to wider and more economic use of formal methods in industrial embedded software development.

    https://arxiv.org/abs/2512.12788


    L-STEC: Learned Video Compression with Long-term Spatio-Temporal Enhanced Context

    oai:arXiv.org:2512.12790v1

    arXiv:2512.12790v1 Announce Type: new Abstract: Neural Video Compression has emerged in recent years, with condition-based frameworks outperforming traditional codecs. However, most existing methods rely solely on the previous frame's features to predict temporal context, leading to two critical issues. First, the short reference window misses long-term dependencies and fine texture details. Second, propagating only feature-level information accumulates errors over frames, causing prediction inaccuracies and loss of subtle textures. To address these, we propose the Long-term Spatio-Temporal Enhanced Context (L-STEC) method. We first extend the reference chain with LSTM to capture long-term dependencies. We then incorporate warped spatial context from the pixel domain, fusing spatio-temporal information through a multi-receptive field network to better preserve reference details. Experimental results show that L-STEC significantly improves compression by enriching contextual information, achieving 37.01% bitrate savings in PSNR and 31.65% in MS-SSIM compared to DCVC-TCM, outperforming both VTM-17.0 and DCVC-FM and establishing new state-of-the-art performance.

    https://arxiv.org/abs/2512.12790


    Beyond Task Completion: An Assessment Framework for Evaluating Agentic AI Systems

    oai:arXiv.org:2512.12791v1

    arXiv:2512.12791v1 Announce Type: new Abstract: Recent advances in agentic AI have shifted the focus from standalone Large Language Models (LLMs) to integrated systems that combine LLMs with tools, memory, and other agents to perform complex tasks. These multi-agent architectures enable coordinated reasoning, planning, and execution across diverse domains, allowing agents to collaboratively automate complex workflows. Despite these advances, evaluation and assessment of LLM agents and the multi-agent systems they constitute remain a fundamental challenge. Although various approaches have been proposed in the software engineering literature for evaluating conventional software components, existing methods for AI-based systems often overlook the non-deterministic nature of models. This non-determinism introduces behavioral uncertainty during execution, yet existing evaluations rely on binary task completion metrics that fail to capture it. Evaluating agentic systems therefore requires examining additional dimensions, including the agent ability to invoke tools, ingest and retrieve memory, collaborate with other agents, and interact effectively with its environment. We propose an end-to-end Agent Assessment Framework with four evaluation pillars encompassing LLMs, Memory, Tools, and Environment. We validate the framework on a representative Autonomous CloudOps use case, where experiments reveal behavioral deviations overlooked by conventional metrics, demonstrating its effectiveness in capturing runtime uncertainties.

    https://arxiv.org/abs/2512.12791


    Liquid Reasoning Transformers: A Sudoku-Based Prototype for Chess-Scale Algorithmic Tasks

    oai:arXiv.org:2512.12792v1

    arXiv:2512.12792v1 Announce Type: new Abstract: The Liquid Reasoning Transformer (LRT) is a transformer architecture designed for inference with adaptive depths using iterative changes, discard-based correction, and a learned stopping mechanism. Instead of relying on a single feedforward pass, the model updates a recurrent reasoning token across multiple internal steps, allowing it to correct early errors and allocate computation based on input difficulty. We evaluate the LRT on Sudoku as a controlled testbed for structured reasoning and show that it achieves strong performance, reaching 98.68% digit accuracy and 36.30% full-puzzle accuracy without using symbolic rules or search. Analyzing internal patterns shows that the discard and stop gates play different, important roles in stabilizing inferences and adjusting computational depth. We discuss how these mechanisms extend naturally to chess-scale reasoning tasks and outline extensions for multi-token reasoning and larger domains.

    https://arxiv.org/abs/2512.12792


    VLG-Loc: Vision-Language Global Localization from Labeled Footprint Maps

    oai:arXiv.org:2512.12793v1

    arXiv:2512.12793v1 Announce Type: new Abstract: This paper presents Vision-Language Global Localization (VLG-Loc), a novel global localization method that uses human-readable labeled footprint maps containing only names and areas of distinctive visual landmarks in an environment. While humans naturally localize themselves using such maps, translating this capability to robotic systems remains highly challenging due to the difficulty of establishing correspondences between observed landmarks and those in the map without geometric and appearance details. To address this challenge, VLG-Loc leverages a vision-language model (VLM) to search the robot's multi-directional image observations for the landmarks noted in the map. The method then identifies robot poses within a Monte Carlo localization framework, where the found landmarks are used to evaluate the likelihood of each pose hypothesis. Experimental validation in simulated and real-world retail environments demonstrates superior robustness compared to existing scan-based methods, particularly under environmental changes. Further improvements are achieved through the probabilistic fusion of visual and scan-based localization.

    https://arxiv.org/abs/2512.12793


    A Rule-Aware Prompt Framework for Structured Numeric Reasoning in Cyber-Physical Systems

    oai:arXiv.org:2512.12794v1

    arXiv:2512.12794v1 Announce Type: new Abstract: Many cyber-physical systems (CPS) rely on high-dimensional numeric telemetry and explicit operating rules to maintain safe and efficient operation. Recent large language models (LLMs) are increasingly considered as decision-support components in such systems, yet most deployments focus on textual inputs and do not directly address rule-grounded reasoning over numeric measurements. This paper proposes a rule-aware prompt framework that systematically encodes CPS domain context, numeric normalization, and decision rules into a modular prompt architecture for LLMs. The framework decomposes prompts into five reusable modules, including role specification, CPS domain context, numeric normalization, rule-aware reasoning, and output schema, and exposes an interface for plugging in diverse rule sets. A key design element is separating rule specification from the representation of normalized numeric deviations, which enables concise prompts that remain aligned with domain rules. We analyze how different normalization strategies and prompt configurations influence rule adherence, interpretability, and token efficiency. The framework is model-agnostic and applicable across CPS domains. To illustrate its behavior, we instantiate it on numeric anomaly assessment in an IEEE 118-bus electric power transmission network and evaluate several prompting and adaptation regimes. The results show that rule-aware, z-score-based value blocks and a hybrid LLM-detector architecture can substantially improve consistency with CPS rules and anomaly detection performance while reducing token usage, providing a reusable bridge between numeric telemetry and general-purpose LLMs.

    https://arxiv.org/abs/2512.12794


    TRACER: Transfer Learning based Real-time Adaptation for Clinical Evolving Risk

    oai:arXiv.org:2512.12795v1

    arXiv:2512.12795v1 Announce Type: new Abstract: Clinical decision support tools built on electronic health records often experience performance drift due to temporal population shifts, particularly when changes in the clinical environment initially affect only a subset of patients, resulting in a transition to mixed populations. Such case-mix changes commonly arise following system-level operational updates or the emergence of new diseases, such as COVID-19. We propose TRACER (Transfer Learning-based Real-time Adaptation for Clinical Evolving Risk), a framework that identifies encounter-level transition membership and adapts predictive models using transfer learning without full retraining. In simulation studies, TRACER outperformed static models trained on historical or contemporary data. In a real-world application predicting hospital admission following emergency department visits across the COVID-19 transition, TRACER improved both discrimination and calibration. TRACER provides a scalable approach for maintaining robust predictive performance under evolving and heterogeneous clinical conditions.

    https://arxiv.org/abs/2512.12795


    DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning

    oai:arXiv.org:2512.12799v1

    arXiv:2512.12799v1 Announce Type: new Abstract: Although multi-modal large language models (MLLMs) have shown strong capabilities across diverse domains, their application in generating fine-grained 3D perception and prediction outputs in autonomous driving remains underexplored. In this paper, we propose DrivePI, a novel spatial-aware 4D MLLM that serves as a unified Vision-Language-Action (VLA) framework that is also compatible with vision-action (VA) models. Our method jointly performs spatial understanding, 3D perception (i.e., 3D occupancy), prediction (i.e., occupancy flow), and planning (i.e., action outputs) in parallel through end-to-end optimization. To obtain both precise geometric information and rich visual appearance, our approach integrates point clouds, multi-view images, and language instructions within a unified MLLM architecture. We further develop a data engine to generate text-occupancy and text-flow QA pairs for 4D spatial understanding. Remarkably, with only a 0.5B Qwen2.5 model as MLLM backbone, DrivePI as a single unified model matches or exceeds both existing VLA models and specialized VA models. Specifically, compared to VLA models, DrivePI outperforms OpenDriveVLA-7B by 2.5% mean accuracy on nuScenes-QA and reduces collision rate by 70% over ORION (from 0.37% to 0.11%) on nuScenes. Against specialized VA models, DrivePI surpasses FB-OCC by 10.3 RayIoU for 3D occupancy on OpenOcc, reduces the mAVE from 0.591 to 0.509 for occupancy flow on OpenOcc, and achieves 32% lower L2 error than VAD (from 0.72m to 0.49m) for planning on nuScenes. Code will be available at https://github.com/happinesslz/DrivePI

    https://arxiv.org/abs/2512.12799


    Learning Common and Salient Generative Factors Between Two Image Datasets

    oai:arXiv.org:2512.12800v1

    arXiv:2512.12800v1 Announce Type: new Abstract: Recent advancements in image synthesis have enabled high-quality image generation and manipulation. Most works focus on: 1) conditional manipulation, where an image is modified conditioned on a given attribute, or 2) disentangled representation learning, where each latent direction should represent a distinct semantic attribute. In this paper, we focus on a different and less studied research problem, called Contrastive Analysis (CA). Given two image datasets, we want to separate the common generative factors, shared across the two datasets, from the salient ones, specific to only one dataset. Compared to existing methods, which use attributes as supervised signals for editing (e.g., glasses, gender), the proposed method is weaker, since it only uses the dataset signal. We propose a novel framework for CA, that can be adapted to both GAN and Diffusion models, to learn both common and salient factors. By defining new and well-adapted learning strategies and losses, we ensure a relevant separation between common and salient factors, preserving a high-quality generation. We evaluate our approach on diverse datasets, covering human faces, animal images and medical scans. Our framework demonstrates superior separation ability and image quality synthesis compared to prior methods.

    https://arxiv.org/abs/2512.12800


    Fine-Grained Energy Prediction For Parallellized LLM Inference With PIE-P

    oai:arXiv.org:2512.12801v1

    arXiv:2512.12801v1 Announce Type: new Abstract: With the widespread adoption of Large Language Models (LLMs), energy costs of running LLMs is quickly becoming a critical concern. However, precisely measuring the energy consumption of LLMs is often infeasible because hardware-based power monitors are not always accessible and software-based energy measurement tools are not accurate. While various prediction techniques have been developed to estimate LLM energy consumption, these approaches are limited to single-GPU environments and thus are not applicable to modern LLM inference which is typically parallelized across multiple GPUs. In this work, we remedy this gap and introduce PIE-P, a fine-grained energy prediction framework for multi-GPU inference, including tensor, pipeline, and data parallelism. Predicting the energy under parallelized inference is complicated by the non-determinism in inter-GPU communication, additional communication overheads, and difficulties in isolating energy during the communication/synchronization phase. We develop a scalable prediction framework that addresses these issues via precise sampling, fine-grained modeling of inter-GPU communication, and careful accounting of parallelization overhead. Our evaluation results show that PIE-P yields accurate and fine-grained energy predictions across parallelism strategies, significantly outperforming baselines.

    https://arxiv.org/abs/2512.12801


    Distributed Reinforcement Learning using Local Smart Meter Data for Voltage Regulation in Distribution Networks

    oai:arXiv.org:2512.12803v1

    arXiv:2512.12803v1 Announce Type: new Abstract: Centralised reinforcement learning (RL) for voltage magnitude regulation in distribution networks typically involves numerous agent-environment interactions and power flow (PF) calculations, inducing computational overhead and privacy concerns over shared data. Thus, we propose a distributed RL algorithm to regulate voltage magnitude. First, a dynamic Thevenin equivalent model is integrated within smart meters (SM), enabling local voltage magnitude estimation using local SM data for RL agent training, and mitigating the dependency of synchronised data collection and centralised PF calculations. To mitigate estimation errors induced by Thevenin model inaccuracies, a voltage magnitude correction strategy that combines piecewise functions and neural networks is introduced. The piecewise function corrects the large errors of estimated voltage magnitude, while a neural network mimics the grid's sensitivity to control actions, improving action adjustment precision. Second, a coordination strategy is proposed to refine local RL agent actions online, preventing voltage magnitude violations induced by excessive actions from multiple independently trained agents. Case studies on energy storage systems validate the feasibility and effectiveness of the proposed approach, demonstrating its potential to improve voltage regulation in distribution networks.

    https://arxiv.org/abs/2512.12803


    Causal Counterfactuals Reconsidered

    oai:arXiv.org:2512.12804v1

    arXiv:2512.12804v1 Announce Type: new Abstract: I develop a novel semantics for probabilities of counterfactuals that generalizes the standard Pearlian semantics: it applies to probabilistic causal models that cannot be extended into realistic structural causal models and are therefore beyond the scope of Pearl's semantics. This generalization is needed because, as I show, such probabilistic causal models arise even in simple settings. My semantics offer a natural compromize in the long-standing debate between Pearl and Dawid over counterfactuals: I agree with Dawid that universal causal determinism and unrealistic variables should be rejected, but I agree with Pearl that a general semantics of counterfactuals is nonetheless possible. I restrict attention to causal models that satisfy the Markov condition, only contain realistic variables, and are causally complete. Although I formulate my proposal using structural causal models, as does Pearl, I refrain from using so-called response variables. Moreover, I prove that my semantics is equivalent to two other recent proposals that do not involve structural causal models, and that it is in line with various comments on stochastic counterfactuals that have appeared in the literature more broadly. Throughout I also reflect on the universality of the Markov condition and explore a novel generalization of causal abstractions

    https://arxiv.org/abs/2512.12804


    From Small to Large: Generalization Bounds for Transformers on Variable-Size Inputs

    oai:arXiv.org:2512.12805v1

    arXiv:2512.12805v1 Announce Type: new Abstract: Transformers exhibit a notable property of \emph{size generalization}, demonstrating an ability to extrapolate from smaller token sets to significantly longer ones. This behavior has been documented across diverse applications, including point clouds, graphs, and natural language. Despite its empirical success, this capability still lacks some rigorous theoretical characterizations. In this paper, we develop a theoretical framework to analyze this phenomenon for geometric data, which we represent as discrete samples from a continuous source (e.g., point clouds from manifolds, graphs from graphons). Our core contribution is a bound on the error between the Transformer's output for a discrete sample and its continuous-domain equivalent. We prove that for Transformers with stable positional encodings, this bound is determined by the sampling density and the intrinsic dimensionality of the data manifold. Experiments on graphs and point clouds of various sizes confirm the tightness of our theoretical bound.

    https://arxiv.org/abs/2512.12805


    Fault-Tolerant Sandboxing for AI Coding Agents: A Transactional Approach to Safe Autonomous Execution

    oai:arXiv.org:2512.12806v1

    arXiv:2512.12806v1 Announce Type: new Abstract: The transition of Large Language Models (LLMs) from passive code generators to autonomous agents introduces significant safety risks, specifically regarding destructive commands and inconsistent system states. Existing commercial solutions often prioritize interactive user safety, enforcing authentication barriers that break the headless loops required for true autonomy. This paper presents a Fault-Tolerant Sandboxing framework designed to mitigate these risks through a policy-based interception layer and a transactional filesystem snapshot mechanism. We hypothesize that wrapping agent actions in atomic transactions can guarantee safety with acceptable latency, outperforming the heavy initialization overhead of containers or the interactive friction of commercial CLIs. We validated this approach by deploying the Minimind-MoE LLM served via nano-vllm on a custom Proxmox-based testbed utilizing EVPN/VXLAN isolation. Experimental results demonstrate a 100\% interception rate for high-risk commands and a 100\% success rate in rolling back failed states. Crucially, our prototype incurs only a 14.5\% performance overhead (approx. 1.8s) per transaction. In contrast, benchmarking against the Gemini CLI sandbox revealed that it requires interactive authentication ("Sign in"), rendering it unusable for headless, autonomous agent workflows.

    https://arxiv.org/abs/2512.12806


    OPAL: Operator-Programmed Algorithms for Landscape-Aware Black-Box Optimization

    oai:arXiv.org:2512.12809v1

    arXiv:2512.12809v1 Announce Type: new Abstract: Black-box optimization often relies on evolutionary and swarm algorithms whose performance is highly problem dependent. We view an optimizer as a short program over a small vocabulary of search operators and learn this operator program separately for each problem instance. We instantiate this idea in Operator-Programmed Algorithms (OPAL), a landscape-aware framework for continuous black-box optimization that uses a small design budget with a standard differential evolution baseline to probe the landscape, builds a $k$-nearest neighbor graph over sampled points, and encodes this trajectory with a graph neural network. A meta-learner then maps the resulting representation to a phase-wise schedule of exploration, restart, and local search operators. On the CEC~2017 test suite, a single meta-trained OPAL policy is statistically competitive with state-of-the-art adaptive differential evolution variants and achieves significant improvements over simpler baselines under nonparametric tests. Ablation studies on CEC~2017 justify the choices for the design phase, the trajectory graph, and the operator-program representation, while the meta-components add only modest wall-clock overhead. Overall, the results indicate that operator-programmed, landscape-aware per-instance design is a practical way forward beyond ad hoc metaphor-based algorithms in black-box optimization.

    https://arxiv.org/abs/2512.12809


    Does Tone Change the Answer? Evaluating Prompt Politeness Effects on Modern LLMs: GPT, Gemini, LLaMA

    oai:arXiv.org:2512.12812v1

    arXiv:2512.12812v1 Announce Type: new Abstract: Prompt engineering has emerged as a critical factor influencing large language model (LLM) performance, yet the impact of pragmatic elements such as linguistic tone and politeness remains underexplored, particularly across different model families. In this work, we propose a systematic evaluation framework to examine how interaction tone affects model accuracy and apply it to three recently released and widely available LLMs: GPT-4o mini (OpenAI), Gemini 2.0 Flash (Google DeepMind), and Llama 4 Scout (Meta). Using the MMMLU benchmark, we evaluate model performance under Very Friendly, Neutral, and Very Rude prompt variants across six tasks spanning STEM and Humanities domains, and analyze pairwise accuracy differences with statistical significance testing. Our results show that tone sensitivity is both model-dependent and domain-specific. Neutral or Very Friendly prompts generally yield higher accuracy than Very Rude prompts, but statistically significant effects appear only in a subset of Humanities tasks, where rude tone reduces accuracy for GPT and Llama, while Gemini remains comparatively tone-insensitive. When performance is aggregated across tasks within each domain, tone effects diminish and largely lose statistical significance. Compared with earlier researches, these findings suggest that dataset scale and coverage materially influence the detection of tone effects. Overall, our study indicates that while interaction tone can matter in specific interpretive settings, modern LLMs are broadly robust to tonal variation in typical mixed-domain use, providing practical guidance for prompt design and model selection in real-world deployments.

    https://arxiv.org/abs/2512.12812


    Optimal Resource Allocation for ML Model Training and Deployment under Concept Drift

    oai:arXiv.org:2512.12816v1

    arXiv:2512.12816v1 Announce Type: new Abstract: We study how to allocate resources for training and deployment of machine learning (ML) models under concept drift and limited budgets. We consider a setting in which a model provider distributes trained models to multiple clients whose devices support local inference but lack the ability to retrain those models, placing the burden of performance maintenance on the provider. We introduce a model-agnostic framework that captures the interaction between resource allocation, concept drift dynamics, and deployment timing. We show that optimal training policies depend critically on the aging properties of concept durations. Under sudden concept changes, we derive optimal training policies subject to budget constraints when concept durations follow distributions with Decreasing Mean Residual Life (DMRL), and show that intuitive heuristics are provably suboptimal under Increasing Mean Residual Life (IMRL). We further study model deployment under communication constraints, prove that the associated optimization problem is quasi-convex under mild conditions, and propose a randomized scheduling strategy that achieves near-optimal client-side performance. These results offer theoretical and algorithmic foundations for cost-efficient ML model management under concept drift, with implications for continual learning, distributed inference, and adaptive ML systems.

    https://arxiv.org/abs/2512.12816


    Decoding Human and AI Persuasion in National College Debate: Analyzing Prepared Arguments Through Aristotle's Rhetorical Principles

    oai:arXiv.org:2512.12817v1

    arXiv:2512.12817v1 Announce Type: new Abstract: Debate has been widely adopted as a strategy to enhance critical thinking skills in English Language Arts (ELA). One important skill in debate is forming effective argumentation, which requires debaters to select supportive evidence from literature and construct compelling claims. However, the training of this skill largely depends on human coaching, which is labor-intensive and difficult to scale. To better support students in preparing for debates, this study explores the potential of leveraging artificial intelligence to generate effective arguments. Specifically, we prompted GPT-4 to create an evidence card and compared it to those produced by human debaters. The evidence cards outline the arguments students will present and how those arguments will be delivered, including components such as literature-based evidence quotations, summaries of core ideas, verbatim reading scripts, and tags (i.e., titles of the arguments). We compared the quality of the arguments in the evidence cards created by GPT and student debaters using Aristotle's rhetorical principles: ethos (credibility), pathos (emotional appeal), and logos (logical reasoning). Through a systematic qualitative and quantitative analysis, grounded in the rhetorical principles, we identify the strengths and limitations of human and GPT in debate reasoning, outlining areas where AI's focus and justifications align with or diverge from human reasoning. Our findings contribute to the evolving role of AI-assisted learning interventions, offering insights into how student debaters can develop strategies that enhance their argumentation and reasoning skills.

    https://arxiv.org/abs/2512.12817


    Hindsight is 20/20: Building Agent Memory that Retains, Recalls, and Reflects

    oai:arXiv.org:2512.12818v1

    arXiv:2512.12818v1 Announce Type: new Abstract: Agent memory has been touted as a dimension of growth for LLM-based applications, enabling agents that can accumulate experience, adapt across sessions, and move beyond single-shot question answering. The current generation of agent memory systems treats memory as an external layer that extracts salient snippets from conversations, stores them in vector or graph-based stores, and retrieves top-k items into the prompt of an otherwise stateless model. While these systems improve personalization and context carry-over, they still blur the line between evidence and inference, struggle to organize information over long horizons, and offer limited support for agents that must explain their reasoning. We present Hindsight, a memory architecture that treats agent memory as a structured, first-class substrate for reasoning by organizing it into four logical networks that distinguish world facts, agent experiences, synthesized entity summaries, and evolving beliefs. This framework supports three core operations -- retain, recall, and reflect -- that govern how information is added, accessed, and updated. Under this abstraction, a temporal, entity aware memory layer incrementally turns conversational streams into a structured, queryable memory bank, while a reflection layer reasons over this bank to produce answers and to update information in a traceable way. On key long-horizon conversational memory benchmarks like LongMemEval and LoCoMo, Hindsight with an open-source 20B model lifts overall accuracy from 39% to 83.6% over a full-context baseline with the same backbone and outperforms full context GPT-4o. Scaling the backbone further pushes Hindsight to 91.4% on LongMemEval and up to 89.61% on LoCoMo (vs. 75.78% for the strongest prior open system), consistently outperforming existing memory architectures on multi-session and open-domain questions.

    https://arxiv.org/abs/2512.12818


    On the continuity of flows

    oai:arXiv.org:2512.12821v1

    arXiv:2512.12821v1 Announce Type: new Abstract: Flow matching has emerged as a powerful framework for generative modeling through continuous normalizing flows. We investigate a potential topological constraint: when the prior distribution and target distribution have mismatched topology (e.g., unimodal to multimodal), the optimal velocity field under standard flow matching objectives may exhibit spatial discontinuities. We suggest that this discontinuity arises from the requirement that continuous flows must bifurcate to map a single mode to multiple modes, forcing particles to make discrete routing decisions at intermediate times. Through theoretical analysis on bimodal Gaussian mixtures, we demonstrate that the optimal velocity field exhibits jump discontinuities along decision boundaries, with magnitude approaching infinity as time approaches the target distribution. Our analysis suggests that this phenomenon is not specific to $L^2$ loss, but rather may be a consequence of topological mismatch between distributions. We validate our theory empirically and discuss potential implications for flow matching on manifolds, connecting our findings to recent work on Riemannian flow matching and the challenge of learning discontinuous representations in neural networks.

    https://arxiv.org/abs/2512.12821


    Lemon: A Unified and Scalable 3D Multimodal Model for Universal Spatial Understanding

    oai:arXiv.org:2512.12822v1

    arXiv:2512.12822v1 Announce Type: new Abstract: Scaling large multimodal models (LMMs) to 3D understanding poses unique challenges: point cloud data is sparse and irregular, existing models rely on fragmented architectures with modality-specific encoders, and training pipelines often suffer from instability and poor scalability. We introduce Lemon, a unified transformer architecture that addresses these challenges by jointly processing 3D point cloud patches and language tokens as a single sequence. Unlike prior work that relies on modality-specific encoders and cross-modal alignment modules, this design enables early spatial-linguistic fusion, eliminates redundant encoders, improves parameter efficiency, and supports more effective model scaling. To handle the complexity of 3D data, we develop a structured patchification and tokenization scheme that preserves spatial context, and a three-stage training curriculum that progressively builds capabilities from object-level recognition to scene-level spatial reasoning. Lemon establishes new state-of-the-art performance across comprehensive 3D understanding and reasoning tasks, from object recognition and captioning to spatial reasoning in 3D scenes, while demonstrating robust scaling properties as model size and training data increase. Our work provides a unified foundation for advancing 3D spatial intelligence in real-world applications.

    https://arxiv.org/abs/2512.12822


    Adapting Multimodal Foundation Models for Few-Shot Learning: A Comprehensive Study on Contrastive Captioners

    oai:arXiv.org:2512.12824v1

    arXiv:2512.12824v1 Announce Type: new Abstract: Large-scale multimodal foundation models, particularly Contrastive Captioners (CoCa), have achieved state-of-the-art results by unifying contrastive alignment with generative captioning. While zero-shot transfer capabilities are well-documented, the adaptation of these generative-contrastive hybrids to downstream tasks with extreme data scarcity (few-shot learning) remains under-explored. Existing literature predominantly focuses on dual-encoder architectures like CLIP, leaving a gap in understanding how CoCa's distinct latent space responds to parameter-efficient fine-tuning (PEFT). This paper presents a comprehensive empirical study on adapting the CoCa visual backbone for few-shot image classification. We systematically evaluate a hierarchy of strategies, ranging from training-free hybrid prototyping to deep parameter adaptation via Low-Rank Adaptation (LoRA). First, we identify an "augmentation divergence": while strong data augmentation degrades the performance of linear probing in low-shot settings, it is essential for stabilizing LoRA fine-tuning. We also demonstrate that hybrid objectives incorporating Supervised Contrastive (SupCon) loss yield consistent performance improvements over standard Cross-Entropy across varying shot counts. Crucially, we characterize the sensitivity of training configurations to data scarcity, providing empirical reference settings for scaling regularization, rank, and sampling strategies to facilitate the efficient adaptation of generative-contrastive foundation models.

    https://arxiv.org/abs/2512.12824


    Sensitivity increase of 3D printed, self-sensing, carbon fibers structures with conductive filament matrix due to flexural loading

    oai:arXiv.org:2512.12826v1

    arXiv:2512.12826v1 Announce Type: new Abstract: The excellent structural and piezoresistive properties of continuous carbon fiber make it suitable for both structural and sensing applications. This work studies the use of 3D printed, continuous carbon fiber reinforced beams as self-sensing structures. It is demonstrated how the sensitivity of these carbon fiber strain gauges can be increased irreversibly by means of a pretreatment by ``breaking-in'' the sensors with a large compressive bending load. The increase in the gauge factor is attributed to local progressive fiber failure, due to the combination of the thermal residual stress from the printing process and external loading. The coextrusion of conductive filament around the carbon fibers is demonstrated as a means of improving the reliability, noise and electrical connection of the sensors. A micrograph of the sensor cross section shows that the conductive filament contacts the various carbon fiber bundles. All-in-all, the use of ``breaking-in'' carbon fiber strain gauges in combination with coextrusion of conductive filament hold promises for 3D printed structural sensors with a high sensitivity.

    https://arxiv.org/abs/2512.12826


    GradID: Adversarial Detection via Intrinsic Dimensionality of Gradients

    oai:arXiv.org:2512.12827v1

    arXiv:2512.12827v1 Announce Type: new Abstract: Despite their remarkable performance, deep neural networks exhibit a critical vulnerability: small, often imperceptible, adversarial perturbations can lead to drastically altered model predictions. Given the stringent reliability demands of applications such as medical diagnosis and autonomous driving, robust detection of such adversarial attacks is paramount. In this paper, we investigate the geometric properties of a model's input loss landscape. We analyze the Intrinsic Dimensionality (ID) of the model's gradient parameters, which quantifies the minimal number of coordinates required to describe the data points on their underlying manifold. We reveal a distinct and consistent difference in the ID for natural and adversarial data, which forms the basis of our proposed detection method. We validate our approach across two distinct operational scenarios. First, in a batch-wise context for identifying malicious data groups, our method demonstrates high efficacy on datasets like MNIST and SVHN. Second, in the critical individual-sample setting, we establish new state-of-the-art results on challenging benchmarks such as CIFAR-10 and MS COCO. Our detector significantly surpasses existing methods against a wide array of attacks, including CW and AutoAttack, achieving detection rates consistently above 92\% on CIFAR-10. The results underscore the robustness of our geometric approach, highlighting that intrinsic dimensionality is a powerful fingerprint for adversarial detection across diverse datasets and attack strategies.

    https://arxiv.org/abs/2512.12827


    Towards a Systematic Taxonomy of Attacks against Space Infrastructures

    oai:arXiv.org:2512.12829v1

    arXiv:2512.12829v1 Announce Type: new Abstract: Space infrastructures represent an emerging domain that is critical to the global economy and society. However, this domain is vulnerable to attacks. To enhance the resilience of this domain, we must understand the attacks that can be waged against it. The status quo is that there is no systematic understanding of attacks against space infrastructures, despite their importance in guiding systematic analysis of space cybersecurity and future research. In this paper, we fill the void by proposing the first systematic taxonomy of attacks against space infrastructures. We hope this paper will inspire a community effort at refining the taxonomy towards a widely used taxonomy.

    https://arxiv.org/abs/2512.12829


    Network Level Evaluation of Hangup Susceptibility of HRGCs using Deep Learning and Sensing Techniques: A Goal Towards Safer Future

    oai:arXiv.org:2512.12832v1

    arXiv:2512.12832v1 Announce Type: new Abstract: Steep profiled Highway Railway Grade Crossings (HRGCs) pose safety hazards to vehicles with low ground clearance, which may become stranded on the tracks, creating risks of train vehicle collisions. This research develops a framework for network level evaluation of hangup susceptibility of HRGCs. Profile data from different crossings in Oklahoma were collected using both a walking profiler and the Pave3D8K Laser Imaging System. A hybrid deep learning model, combining Long Short Term Memory (LSTM) and Transformer architectures, was developed to reconstruct accurate HRGC profiles from Pave3D8K Laser Imaging System data. Vehicle dimension data from around 350 specialty vehicles were collected at various locations across Oklahoma to enable up to date statistical design dimensions. Hangup susceptibility was analyzed using three vehicle dimension scenarios (a) median dimension (median wheelbase and ground clearance), (b) 75 25 percentile dimension (75 percentile wheelbase, 25 percentile ground clearance), and (c) worst case dimension (maximum wheelbase and minimum ground clearance). Results indicate 36, 62, and 67 crossings at the highest hangup risk levels under these scenarios, respectively. An ArcGIS database and a software interface were developed to support transportation agencies in mitigating crossing hazards. This framework advances safety evaluation by integrating next generation sensing, deep learning, and infrastructure datasets into practical decision support tools.

    https://arxiv.org/abs/2512.12832


    Data-driven Supervisory Control under Attacks via Spectral Learning

    oai:arXiv.org:2512.12833v1

    arXiv:2512.12833v1 Announce Type: new Abstract: The technological advancements facilitating the rapid development of cyber-physical systems (CPS) also render such systems vulnerable to cyber attacks with devastating effects. Supervisory control is a commonly used control method to neutralize attacks on CPS. The supervisor strives to confine the (symbolic) paths of the system to a desired language via sensors and actuators in a closed control loop, even when attackers can manipulate the symbols received by the sensors and actuators. Currently, supervisory control methods face limitations when effectively identifying and mitigating unknown, broad-spectrum attackers. In order to capture the behavior of broad-spectrum attacks on both sensing and actuation channels we model the plant, supervisors, and attackers with finite-state transducers (FSTs). Our general method for addressing unknown attackers involves constructing FST models of the attackers from spectral analysis of their input and output symbol sequences recorded from a history of attack behaviors observed in a supervisory control loop. To construct these FST models, we devise a novel learning method based on the recorded history of attack behaviors. A supervisor is synthesized using such models to neutralize the attacks.

    https://arxiv.org/abs/2512.12833


    Procedural Music Generation Systems in Games

    oai:arXiv.org:2512.12834v1

    arXiv:2512.12834v1 Announce Type: new Abstract: Procedural Music Generation (PMG) is an emerging field that algorithmically creates music content for video games. By leveraging techniques from simple rule-based approaches to advanced machine learning algorithms, PMG has the potential to significantly improve development efficiency, provide richer musical experiences, and enhance player immersion. However, academic prototypes often diverge from applications due to differences in priorities such as novelty, reliability, and allocated resources. This paper bridges the gap between research and applications by presenting a systematic overview of current PMG techniques in both fields, offering a two-aspect taxonomy. Through a comparative analysis, this study identifies key research challenges in algorithm implementation, music quality and game integration. Finally, the paper outlines future research directions, emphasising task-oriented and context-aware design, more comprehensive quality evaluation methods, and improved research tool integration to provide actionable insights for developers, composers, and researchers seeking to advance PMG in game contexts.

    https://arxiv.org/abs/2512.12834


    Fast capacity computation for maze-like configurations

    oai:arXiv.org:2512.12836v1

    arXiv:2512.12836v1 Announce Type: new Abstract: We study the conformal capacity ${\rm cap}(\Omega,K)$ where $\Omega$ is a bounded domain of $\mathbb{R}^2$ and $K$ is a compact connected set in $\Omega$. Because the exact numerical value of the capacity is known only in a handful of special cases, it is important to find estimates for the capacity in terms of domain functionals, simpler than the capacity itself. Here, we study condensers of maze-like structure and compute their capacity by means of a high-order $hp$- finite element method. We compare these numerical results to the estimates given by the quasihyperbolic length and perimeter of the compact set. In particular, we consider the behaviour of these value pairs, numerical results and estimates, when the structure parameters vary and the walls of the maze approach the compact set. Over the configurations covered in the numerical experiments, the quasihyperbolic estimates are shown to have the desired asymptotic properties and superior computational efficiencies once the case-specific analysis is completed.

    https://arxiv.org/abs/2512.12836


    Algorithmic Criminal Liability in Greenwashing: Comparing India, United States, and European Union

    oai:arXiv.org:2512.12837v1

    arXiv:2512.12837v1 Announce Type: new Abstract: AI-powered greenwashing has emerged as an insidious challenge within corporate sustainability governance, exacerbating the opacity of environmental disclosures and subverting regulatory oversight. This study conducts a comparative legal analysis of criminal liability for AI-mediated greenwashing across India, the US, and the EU, exposing doctrinal lacunae in attributing culpability when deceptive claims originate from algorithmic systems. Existing statutes exhibit anthropocentric biases by predicating liability on demonstrable human intent, rendering them ill-equipped to address algorithmic deception. The research identifies a critical gap in jurisprudential adaptation, as prevailing fraud statutes remain antiquated vis-\`a-vis AI-generated misrepresentation. Utilising a doctrinal legal methodology, this study systematically dissects judicial precedents and statutory instruments, yielding results regarding the potential expansion of corporate criminal liability. Findings underscore the viability of strict liability models, recalibrated governance frameworks for AI accountability, and algorithmic due diligence mandates under ESG regimes. Comparative insights reveal jurisdictional disparities, with the EU Corporate Sustainability Due Diligence Directive (CSDDD) offering a potential transnational model. This study contributes to AI ethics and environmental jurisprudence by advocating for a hybrid liability framework integrating algorithmic risk assessment with legal personhood constructs, ensuring algorithmic opacity does not preclude liability enforcement.

    https://arxiv.org/abs/2512.12837


    What Matters in Evaluating Book-Length Stories? A Systematic Study of Long Story Evaluation

    oai:arXiv.org:2512.12839v1

    arXiv:2512.12839v1 Announce Type: new Abstract: In this work, we conduct systematic research in a challenging area: the automatic evaluation of book-length stories (>100K tokens). Our study focuses on two key questions: (1) understanding which evaluation aspects matter most to readers, and (2) exploring effective methods for evaluating lengthy stories. We introduce the first large-scale benchmark, LongStoryEval, comprising 600 newly published books with an average length of 121K tokens (maximum 397K). Each book includes its average rating and multiple reader reviews, presented as critiques organized by evaluation aspects. By analyzing all user-mentioned aspects, we propose an evaluation criteria structure and conduct experiments to identify the most significant aspects among the 8 top-level criteria. For evaluation methods, we compare the effectiveness of three types: aggregation-based, incremental-updated, and summary-based evaluations. Our findings reveal that aggregation- and summary-based evaluations perform better, with the former excelling in detail assessment and the latter offering greater efficiency. Building on these insights, we further propose NovelCritique, an 8B model that leverages the efficient summary-based framework to review and score stories across specified aspects. NovelCritique outperforms commercial models like GPT-4o in aligning with human evaluations. Our datasets and codes are available at https://github.com/DingyiYang/LongStoryEval.

    https://arxiv.org/abs/2512.12839


    PRIVEE: Privacy-Preserving Vertical Federated Learning Against Feature Inference Attacks

    oai:arXiv.org:2512.12840v1

    arXiv:2512.12840v1 Announce Type: new Abstract: Vertical Federated Learning (VFL) enables collaborative model training across organizations that share common user samples but hold disjoint feature spaces. Despite its potential, VFL is susceptible to feature inference attacks, in which adversarial parties exploit shared confidence scores (i.e., prediction probabilities) during inference to reconstruct private input features of other participants. To counter this threat, we propose PRIVEE (PRIvacy-preserving Vertical fEderated lEarning), a novel defense mechanism named after the French word priv\'ee, meaning "private." PRIVEE obfuscates confidence scores while preserving critical properties such as relative ranking and inter-score distances. Rather than exposing raw scores, PRIVEE shares only the transformed representations, mitigating the risk of reconstruction attacks without degrading model prediction accuracy. Extensive experiments show that PRIVEE achieves a threefold improvement in privacy protection compared to state-of-the-art defenses, while preserving full predictive performance against advanced feature inference attacks.

    https://arxiv.org/abs/2512.12840


    SAGA: Open-World Mobile Manipulation via Structured Affordance Grounding

    oai:arXiv.org:2512.12842v1

    arXiv:2512.12842v1 Announce Type: new Abstract: We present SAGA, a versatile and adaptive framework for visuomotor control that can generalize across various environments, task objectives, and user specifications. To efficiently learn such capability, our key idea is to disentangle high-level semantic intent from low-level visuomotor control by explicitly grounding task objectives in the observed environment. Using an affordance-based task representation, we express diverse and complex behaviors in a unified, structured form. By leveraging multimodal foundation models, SAGA grounds the proposed task representation to the robot's visual observation as 3D affordance heatmaps, highlighting task-relevant entities while abstracting away spurious appearance variations that would hinder generalization. These grounded affordances enable us to effectively train a conditional policy on multi-task demonstration data for whole-body control. In a unified framework, SAGA can solve tasks specified in different forms, including language instructions, selected points, and example demonstrations, enabling both zero-shot execution and few-shot adaptation. We instantiate SAGA on a quadrupedal manipulator and conduct extensive experiments across eleven real-world tasks. SAGA consistently outperforms end-to-end and modular baselines by substantial margins. Together, these results demonstrate that structured affordance grounding offers a scalable and effective pathway toward generalist mobile manipulation.

    https://arxiv.org/abs/2512.12842


    Selective Conformal Risk Control

    oai:arXiv.org:2512.12844v1

    arXiv:2512.12844v1 Announce Type: new Abstract: Reliable uncertainty quantification is essential for deploying machine learning systems in high-stakes domains. Conformal prediction provides distribution-free coverage guarantees but often produces overly large prediction sets, limiting its practical utility. To address this issue, we propose \textit{Selective Conformal Risk Control} (SCRC), a unified framework that integrates conformal prediction with selective classification. The framework formulates uncertainty control as a two-stage problem: the first stage selects confident samples for prediction, and the second stage applies conformal risk control on the selected subset to construct calibrated prediction sets. We develop two algorithms under this framework. The first, SCRC-T, preserves exchangeability by computing thresholds jointly over calibration and test samples, offering exact finite-sample guarantees. The second, SCRC-I, is a calibration-only variant that provides PAC-style probabilistic guarantees while being more computational efficient. Experiments on two public datasets show that both methods achieve the target coverage and risk levels, with nearly identical performance, while SCRC-I exhibits slightly more conservative risk control but superior computational practicality. Our results demonstrate that selective conformal risk control offers an effective and efficient path toward compact, reliable uncertainty quantification.

    https://arxiv.org/abs/2512.12844


    HaShiFlex: A High-Throughput Hardened Shifter DNN Accelerator with Fine-Tuning Flexibility

    oai:arXiv.org:2512.12847v1

    arXiv:2512.12847v1 Announce Type: new Abstract: We introduce a high-throughput neural network accelerator that embeds most network layers directly in hardware, minimizing data transfer and memory usage while preserving a degree of flexibility via a small neural processing unit for the final classification layer. By leveraging power-of-two (Po2) quantization for weights, we replace multiplications with simple rewiring, effectively reducing each convolution to a series of additions. This streamlined approach offers high-throughput, energy-efficient processing, making it highly suitable for applications where model parameters remain stable, such as continuous sensing tasks at the edge or large-scale data center deployments. Furthermore, by including a strategically chosen reprogrammable final layer, our design achieves high throughput without sacrificing fine-tuning capabilities. We implement this accelerator in a 7nm ASIC flow using MobileNetV2 as a baseline and report throughput, area, accuracy, and sensitivity to quantization and pruning - demonstrating both the advantages and potential trade-offs of the proposed architecture. We find that for MobileNetV2, we can improve inference throughput by 20x over fully programmable GPUs, processing 1.21 million images per second through a full forward pass while retaining fine-tuning flexibility. If absolutely no post-deployment fine tuning is required, this advantage increases to 67x at 4 million images per second.

    https://arxiv.org/abs/2512.12847


    KANEL\'E: Kolmogorov-Arnold Networks for Efficient LUT-based Evaluation

    oai:arXiv.org:2512.12850v1

    arXiv:2512.12850v1 Announce Type: new Abstract: Low-latency, resource-efficient neural network inference on FPGAs is essential for applications demanding real-time capability and low power. Lookup table (LUT)-based neural networks are a common solution, combining strong representational power with efficient FPGA implementation. In this work, we introduce KANEL\'E, a framework that exploits the unique properties of Kolmogorov-Arnold Networks (KANs) for FPGA deployment. Unlike traditional multilayer perceptrons (MLPs), KANs employ learnable one-dimensional splines with fixed domains as edge activations, a structure naturally suited to discretization and efficient LUT mapping. We present the first systematic design flow for implementing KANs on FPGAs, co-optimizing training with quantization and pruning to enable compact, high-throughput, and low-latency KAN architectures. Our results demonstrate up to a 2700x speedup and orders of magnitude resource savings compared to prior KAN-on-FPGA approaches. Moreover, KANEL\'E matches or surpasses other LUT-based architectures on widely used benchmarks, particularly for tasks involving symbolic or physical formulas, while balancing resource usage across FPGA hardware. Finally, we showcase the versatility of the framework by extending it to real-time, power-efficient control systems.

    https://arxiv.org/abs/2512.12850


    MPC-Guided Safe Reinforcement Learning and Lipschitz-Based Filtering for Structured Nonlinear Systems

    oai:arXiv.org:2512.12855v1

    arXiv:2512.12855v1 Announce Type: new Abstract: Modern engineering systems, such as autonomous vehicles, flexible robotics, and intelligent aerospace platforms, require controllers that are robust to uncertainties, adaptive to environmental changes, and safety-aware under real-time constraints. RL offers powerful data-driven adaptability for systems with nonlinear dynamics that interact with uncertain environments. RL, however, lacks built-in mechanisms for dynamic constraint satisfaction during exploration. MPC offers structured constraint handling and robustness, but its reliance on accurate models and computationally demanding online optimization may pose significant challenges. This paper proposes an integrated MPC-RL framework that combines stability and safety guarantees of MPC with the adaptability of RL. During training, MPC defines safe control bounds that guide the RL component and that enable constraint-aware policy learning. At deployment, the learned policy operates in real time with a lightweight safety filter based on Lipschitz continuity to ensure constraint satisfaction without heavy online optimizations. The approach, which is validated on a nonlinear aeroelastic wing system, demonstrates improved disturbance rejection, reduced actuator effort, and robust performance under turbulence. The architecture generalizes to other domains with structured nonlinearities and bounded disturbances, offering a scalable solution for safe artificial-intelligence-driven control in engineering applications.

    https://arxiv.org/abs/2512.12855


    Forgetful but Faithful: A Cognitive Memory Architecture and Benchmark for Privacy-Aware Generative Agents

    oai:arXiv.org:2512.12856v1

    arXiv:2512.12856v1 Announce Type: new Abstract: As generative agents become increasingly sophisticated and deployed in long-term interactive scenarios, their memory management capabilities emerge as a critical bottleneck for both performance and privacy. Current approaches either maintain unlimited memory stores, leading to computational intractability and privacy concerns, or employ simplistic forgetting mechanisms that compromise agent coherence and functionality. This paper introduces the Memory-Aware Retention Schema (MaRS), a novel framework for human-centered memory management in generative agents, coupled with six theoretically-grounded forgetting policies that balance performance, privacy, and computational efficiency. We present the Forgetful but Faithful Agent (FiFA) benchmark, a comprehensive evaluation framework that assesses agent performance across narrative coherence, goal completion, social recall accuracy, privacy preservation, and cost efficiency. Through extensive experimentation involving 300 evaluation runs across multiple memory budgets and agent configurations, we demonstrate that our hybrid forgetting policy achieves superior performance (composite score: 0.911) while maintaining computational tractability and privacy guarantees. Our work establishes new benchmarks for memory-budgeted agent evaluation and provides practical guidelines for deploying generative agents in resource-constrained, privacy-sensitive environments. The theoretical foundations, implementation framework, and empirical results contribute to the emerging field of human-centered AI by addressing fundamental challenges in agent memory management that directly impact user trust, system scalability, and regulatory compliance.

    https://arxiv.org/abs/2512.12856


    Information-Consistent Language Model Recommendations through Group Relative Policy Optimization

    oai:arXiv.org:2512.12858v1

    arXiv:2512.12858v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly deployed in business-critical domains such as finance, education, healthcare, and customer support, where users expect consistent and reliable recommendations. Yet LLMs often exhibit variability when prompts are phrased with minor differences, even when semantically equivalent. Such inconsistency undermines trust, complicates compliance, and disrupts user experience. While personalization is desirable in certain contexts, many enterprise scenarios-such as HR onboarding, customer support, or policy disclosure-require invariant information delivery regardless of phrasing or prior conversational history. Existing approaches, including retrieval-augmented generation (RAG) and temperature tuning, improve factuality or reduce stochasticity but cannot guarantee stability across equivalent prompts. In this paper, we propose a reinforcement learning framework based on Group Relative Policy Optimization (GRPO) to directly optimize for consistency. Unlike prior applications of GRPO, which have been limited to reasoning and code generation, we adapt GRPO to enforce stability of information content across groups of semantically equivalent prompts. We introduce entropy-based helpfulness and stability rewards, treating prompt variants as groups and resetting conversational context to isolate phrasing effects. Experiments on investment and job recommendation tasks show that our GRPO-trained model reduces variability more effectively than fine-tuning or decoding-based baselines. To our knowledge, this is a novel application of GRPO for aligning LLMs toward information consistency, reframing variability not as an acceptable feature of generative diversity but as a correctable flaw in enterprise deployments.

    https://arxiv.org/abs/2512.12858


    Learning with Structure: Computing Consistent Subsets on Structurally-Regular Graphs

    oai:arXiv.org:2512.12860v1

    arXiv:2512.12860v1 Announce Type: new Abstract: The Minimum Consistent Subset (MCS) problem arises naturally in the context of supervised clustering and instance selection. In supervised clustering, one aims to infer a meaningful partitioning of data using a small labeled subset. However, the sheer volume of training data in modern applications poses a significant computational challenge. The MCS problem formalizes this goal: given a labeled dataset $\mathcal{X}$ in a metric space, the task is to compute a smallest subset $S \subseteq \mathcal{X}$ such that every point in $\mathcal{X}$ shares its label with at least one of its nearest neighbors in $S$. Recently, the MCS problem has been extended to graph metrics, where distances are defined by shortest paths. Prior work has shown that MCS remains NP-hard even on simple graph classes like trees, though an algorithm with runtime $\mathcal{O}(2^{6c} \cdot n^6)$ is known for trees, where $c$ is the number of colors and $n$ the number of vertices. This raises the challenge of identifying graph classes that admit algorithms efficient in both $n$ and $c$. In this work, we study the Minimum Consistent Subset problem on graphs, focusing on two well-established measures: the vertex cover number ($vc$) and the neighborhood diversity ($nd$). We develop an algorithm with running time $vc^{\mathcal{O}(vc)}\cdot\text{Poly}(n,c)$, and another algorithm with runtime $nd^{\mathcal{O}(nd)}\cdot\text{Poly}(n,c)$. In the language of parameterized complexity, this implies that MCS is fixed-parameter tractable (FPT) parameterized by the vertex cover number and the neighborhood diversity. Notably, our algorithms remain efficient for arbitrarily many colors, as their complexity is polynomially dependent on the number of colors.

    https://arxiv.org/abs/2512.12860


    OptiWing3D: A Diverse Dataset of Optimized Wing Designs

    oai:arXiv.org:2512.12867v1

    arXiv:2512.12867v1 Announce Type: new Abstract: OptiWing3D is the first publicly available dataset of high-fidelity shape optimized 3D wing geometries. Existing aerodynamics datasets are either limited to 2D simulations, lack optimization, or derive diversity solely from perturbations to a single baseline design, constraining their application as benchmarks to inverse design approaches and in the study of design diversity. The OptiWing3D dataset addresses these gaps, consisting of 1552 simulations resulting in 776 wing designs initialized from distinct extruded airfoil cross-sections. Additionally, a majority of the optimized wings in the dataset are paired to 2D counterparts optimized under identical conditions, creating the first multi-fidelity aerodynamic shape optimization dataset. Moreover, this structure allows for a direct comparison between 2D and 3D aerodynamic simulations. It is observed that 3D optimized designs diverge most prominently from the 2D-optimized designs near the wingtip, where three-dimensional effects are strongest, a finding made possible by the paired nature of the dataset. Finally, we demonstrate a constraint-aware conditional latent diffusion model capable of generating optimized wings from flow conditions, establishing a baseline for future inverse design approaches. The dataset, containing wing geometries and surface pressure distributions is publicly released to advance research in data-driven aerodynamic design.

    https://arxiv.org/abs/2512.12867


    Counting Clues: A Lightweight Probabilistic Baseline Can Match an LLM

    oai:arXiv.org:2512.12868v1

    arXiv:2512.12868v1 Announce Type: new Abstract: Large language models (LLMs) excel on multiple-choice clinical diagnosis benchmarks, yet it is unclear how much of this performance reflects underlying probabilistic reasoning. We study this through questions from MedQA, where the task is to select the most likely diagnosis. We introduce the Frequency-Based Probabilistic Ranker (FBPR), a lightweight method that scores options with a smoothed Naive Bayes over concept-diagnosis co-occurrence statistics from a large corpus. When co-occurrence statistics were sourced from the pretraining corpora for OLMo and Llama, FBPR achieves comparable performance to the corresponding LLMs pretrained on that same corpus. Direct LLM inference and FBPR largely get different questions correct, with an overlap only slightly above random chance, indicating complementary strengths of each method. These findings highlight the continued value of explicit probabilistic baselines: they provide a meaningful performance reference point and a complementary signal for potential hybridization. While the performance of LLMs seems to be driven by a mechanism other than simple frequency aggregation, we show that an approach similar to the historically grounded, low-complexity expert systems still accounts for a substantial portion of benchmark performance.

    https://arxiv.org/abs/2512.12868


    ERA-IT: Aligning Semantic Models with Revealed Economic Preference for Real-Time and Explainable Patent Valuation

    oai:arXiv.org:2512.12869v1

    arXiv:2512.12869v1 Announce Type: new Abstract: Valuing intangible assets under uncertainty remains a critical challenge in the strategic management of technological innovation due to the information asymmetry inherent in high-dimensional technical specifications. Traditional bibliometric indicators, such as citation counts, fail to address this friction in a timely manner due to the systemic latency inherent in data accumulation. To bridge this gap, this study proposes the Economic Reasoning Alignment via Instruction Tuning (ERA-IT) framework. We theoretically conceptualize patent renewal history as a revealed economic preference and leverage it as an objective supervisory signal to align the generative reasoning of Large Language Models (LLMs) with market realities, a process we term Eco-Semantic Alignment. Using a randomly sampled dataset of 10,000 European Patent Office patents across diverse technological domains, we trained the model not only to predict value tiers but also to reverse-engineer the Economic Chain-of-Thought from unstructured text. Empirical results demonstrate that ERA-IT significantly outperforms both conventional econometric models and zero-shot LLMs in predictive accuracy. More importantly, by generating explicit, logically grounded rationales for valuation, the framework serves as a transparent cognitive scaffold for decision-makers, reducing the opacity of black-box AI in high-stakes intellectual property management.

    https://arxiv.org/abs/2512.12869


    Optimal Labeler Assignment and Sampling for Active Learning in the Presence of Imperfect Labels

    oai:arXiv.org:2512.12870v1

    arXiv:2512.12870v1 Announce Type: new Abstract: Active Learning (AL) has garnered significant interest across various application domains where labeling training data is costly. AL provides a framework that helps practitioners query informative samples for annotation by oracles (labelers). However, these labels often contain noise due to varying levels of labeler accuracy. Additionally, uncertain samples are more prone to receiving incorrect labels because of their complexity. Learning from imperfectly labeled data leads to an inaccurate classifier. We propose a novel AL framework to construct a robust classification model by minimizing noise levels. Our approach includes an assignment model that optimally assigns query points to labelers, aiming to minimize the maximum possible noise within each cycle. Additionally, we introduce a new sampling method to identify the best query points, reducing the impact of label noise on classifier performance. Our experiments demonstrate that our approach significantly improves classification performance compared to several benchmark methods.

    https://arxiv.org/abs/2512.12870


    CapOptix: An Options-Framework for Capacity Market Pricing

    oai:arXiv.org:2512.12871v1

    arXiv:2512.12871v1 Announce Type: new Abstract: Electricity markets are under increasing pressure to maintain reliability amidst rising renewable penetration, demand variability, and occasional price shocks. Traditional capacity market designs often fall short in addressing this by relying on expected-value metrics of energy unserved, which overlook risk exposure in such systems. In this work, we present CapOptix, a capacity pricing framework that interprets capacity commitments as reliability options, i.e., financial derivatives of wholesale electricity prices. CapOptix characterizes the capacity premia charged by accounting for structural price shifts modeled by the Markov Regime Switching Process. We apply the framework to historical price data from multiple electricity markets and compare the resulting premium ranges with existing capacity remuneration mechanisms.

    https://arxiv.org/abs/2512.12871


    Heavy-Duty Electric Vehicles Contribution for Frequency Response in Power Systems with V2G

    oai:arXiv.org:2512.12872v1

    arXiv:2512.12872v1 Announce Type: new Abstract: The integration of heavy-duty electric vehicles (EVs) with Vehicle-to-Grid (V2G) capability offers a promising solution to enhance grid stability by providing primary frequency response in power systems. This paper investigates the potential of heavy-duty EVs to support the California power grid under different charging strategies: immediate, delayed, and constant minimum power charging. Simulation results demonstrate that both V2G-capable EVs and non-V2G modes have great potential to provide primary frequency response, with V2G-capable EVs exhibiting especially strong contributions. The study highlights the influence of charging strategies, control modes, and grid conditions on EV contributions to grid stability, emphasizing their critical role in mitigating the adverse effects of renewable energy penetration.

    https://arxiv.org/abs/2512.12872


    Schrodinger Audio-Visual Editor: Object-Level Audiovisual Removal

    oai:arXiv.org:2512.12875v1

    arXiv:2512.12875v1 Announce Type: new Abstract: Joint editing of audio and visual content is crucial for precise and controllable content creation. This new task poses challenges due to the limitations of paired audio-visual data before and after targeted edits, and the heterogeneity across modalities. To address the data and modeling challenges in joint audio-visual editing, we introduce SAVEBench, a paired audiovisual dataset with text and mask conditions to enable object-grounded source-to-target learning. With SAVEBench, we train the Schrodinger Audio-Visual Editor (SAVE), an end-to-end flow-matching model that edits audio and video in parallel while keeping them aligned throughout processing. SAVE incorporates a Schrodinger Bridge that learns a direct transport from source to target audiovisual mixtures. Our evaluation demonstrates that the proposed SAVE model is able to remove the target objects in audio and visual content while preserving the remaining content, with stronger temporal synchronization and audiovisual semantic correspondence compared with pairwise combinations of an audio editor and a video editor.

    https://arxiv.org/abs/2512.12875


    Improving Recursive Transformers with Mixture of LoRAs

    oai:arXiv.org:2512.12880v1

    arXiv:2512.12880v1 Announce Type: new Abstract: Parameter sharing in recursive transformers reduces model size but collapses layer-wise expressivity. We propose Mixture of LoRAs (MoL), a lightweight conditional-computation mechanism that inserts Low-Rank Adaptation (LoRA) experts inside a shared feed-forward network (FFN). MoL enables token-conditional weight-space modulation of the shared FFN without untying backbone parameters, unlike prior approaches that add fixed or externally attached adapters. We pretrain a modernised recursive architecture, ModernALBERT, integrating rotary embeddings, GeGLU, FlashAttention, and a distillation-based initialisation. Across GLUE, SQuAD-v2, and BEIR, ModernALBERT (50M--120M) achieves state-of-the-art performance among compact models and surpasses larger fully parameterised baselines. We also propose an expert-merging procedure that compresses MoL into a single adapter at inference while preserving accuracy, enabling efficient deployment. Our results show that conditional weight-space modulation effectively restores the expressivity lost under aggressive parameter sharing in recursive transformers.

    https://arxiv.org/abs/2512.12880


    Unsupervised learning of multiscale switching dynamical system models from multimodal neural data

    oai:arXiv.org:2512.12881v1

    arXiv:2512.12881v1 Announce Type: new Abstract: Neural population activity often exhibits regime-dependent non-stationarity in the form of switching dynamics. Learning accurate switching dynamical system models can reveal how behavior is encoded in neural activity. Existing switching approaches have primarily focused on learning models from a single neural modality, either continuous Gaussian signals or discrete Poisson signals. However, multiple neural modalities are often recorded simultaneously to measure different spatiotemporal scales of brain activity, and all these modalities can encode behavior. Moreover, regime labels are typically unavailable in training data, posing a significant challenge for learning models of regime-dependent switching dynamics. To address these challenges, we develop a novel unsupervised learning algorithm that learns the parameters of switching multiscale dynamical system models using only multiscale neural observations. We demonstrate our method using both simulations and two distinct experimental datasets with multimodal spike-LFP observations during different motor tasks. We find that our switching multiscale dynamical system models more accurately decode behavior than switching single-scale dynamical models, showing the success of multiscale neural fusion. Further, our models outperform stationary multiscale models, illustrating the importance of tracking regime-dependent non-stationarity in multimodal neural data. The developed unsupervised learning framework enables more accurate modeling of complex multiscale neural dynamics by leveraging information in multimodal recordings while incorporating regime switches. This approach holds promise for improving the performance and robustness of brain-computer interfaces over time and for advancing our understanding of the neural basis of behavior.

    https://arxiv.org/abs/2512.12881


    On the embedding transformation for optimal control of multi-mode switched systems

    oai:arXiv.org:2512.12883v1

    arXiv:2512.12883v1 Announce Type: new Abstract: This paper develops an embedding-based approach to solve switched optimal control problems (SOCPs) with an arbitrary number of subsystems. Initially, the discrete switching signal is represented by a set of binary variables, encoding each mode in binary format. An embedded optimal control problem (EOCP) is then formulated by replacing these binary variables with continuous embedded variables that can take intermediate values between zero and one. Although embedding allows SOCPs to be addressed using conventional techniques, the optimal solutions of EOCPs often yield intermediate values for binary variables, which may not be feasible for the original SOCP. To address this challenge, a modified EOCP (MEOCP) is introduced by adding a concave auxiliary cost function of appropriate dimensionality to the main cost function. This addition ensures that the optimal solution of the EOCP is bang-bang, and as a result, feasible for the original SOCP.

    https://arxiv.org/abs/2512.12883


    Cross-Level Sensor Fusion with Object Lists via Transformer for 3D Object Detection

    oai:arXiv.org:2512.12884v1

    arXiv:2512.12884v1 Announce Type: new Abstract: In automotive sensor fusion systems, smart sensors and Vehicle-to-Everything (V2X) modules are commonly utilized. Sensor data from these systems are typically available only as processed object lists rather than raw sensor data from traditional sensors. Instead of processing other raw data separately and then fusing them at the object level, we propose an end-to-end cross-level fusion concept with Transformer, which integrates highly abstract object list information with raw camera images for 3D object detection. Object lists are fed into a Transformer as denoising queries and propagated together with learnable queries through the latter feature aggregation process. Additionally, a deformable Gaussian mask, derived from the positional and size dimensional priors from the object lists, is explicitly integrated into the Transformer decoder. This directs attention toward the target area of interest and accelerates model training convergence. Furthermore, as there is no public dataset containing object lists as a standalone modality, we propose an approach to generate pseudo object lists from ground-truth bounding boxes by simulating state noise and false positives and negatives. As the first work to conduct cross-level fusion, our approach shows substantial performance improvements over the vision-based baseline on the nuScenes dataset. It demonstrates its generalization capability over diverse noise levels of simulated object lists and real detectors.

    https://arxiv.org/abs/2512.12884


    SignRAG: A Retrieval-Augmented System for Scalable Zero-Shot Road Sign Recognition

    oai:arXiv.org:2512.12885v1

    arXiv:2512.12885v1 Announce Type: new Abstract: Automated road sign recognition is a critical task for intelligent transportation systems, but traditional deep learning methods struggle with the sheer number of sign classes and the impracticality of creating exhaustive labeled datasets. This paper introduces a novel zero-shot recognition framework that adapts the Retrieval-Augmented Generation (RAG) paradigm to address this challenge. Our method first uses a Vision Language Model (VLM) to generate a textual description of a sign from an input image. This description is used to retrieve a small set of the most relevant sign candidates from a vector database of reference designs. Subsequently, a Large Language Model (LLM) reasons over the retrieved candidates to make a final, fine-grained recognition. We validate this approach on a comprehensive set of 303 regulatory signs from the Ohio MUTCD. Experimental results demonstrate the framework's effectiveness, achieving 95.58% accuracy on ideal reference images and 82.45% on challenging real-world road data. This work demonstrates the viability of RAG-based architectures for creating scalable and accurate systems for road sign recognition without task-specific training.

    https://arxiv.org/abs/2512.12885


    Revisiting 2D Foundation Models for Scalable 3D Medical Image Classification

    oai:arXiv.org:2512.12887v1

    arXiv:2512.12887v1 Announce Type: new Abstract: 3D medical image classification is essential for modern clinical workflows. Medical foundation models (FMs) have emerged as a promising approach for scaling to new tasks, yet current research suffers from three critical pitfalls: data-regime bias, suboptimal adaptation, and insufficient task coverage. In this paper, we address these pitfalls and introduce AnyMC3D, a scalable 3D classifier adapted from 2D FMs. Our method scales efficiently to new tasks by adding only lightweight plugins (about 1M parameters per task) on top of a single frozen backbone. This versatile framework also supports multi-view inputs, auxiliary pixel-level supervision, and interpretable heatmap generation. We establish a comprehensive benchmark of 12 tasks covering diverse pathologies, anatomies, and modalities, and systematically analyze state-of-the-art 3D classification techniques. Our analysis reveals key insights: (1) effective adaptation is essential to unlock FM potential, (2) general-purpose FMs can match medical-specific FMs if properly adapted, and (3) 2D-based methods surpass 3D architectures for 3D classification. For the first time, we demonstrate the feasibility of achieving state-of-the-art performance across diverse applications using a single scalable framework (including 1st place in the VLM3D challenge), eliminating the need for separate task-specific models.

    https://arxiv.org/abs/2512.12887


    Distillation of Discrete Diffusion by Exact Conditional Distribution Matching

    oai:arXiv.org:2512.12889v1

    arXiv:2512.12889v1 Announce Type: new Abstract: Discrete diffusion models (DDMs) are a powerful class of generative models for categorical data, but they typically require many function evaluations for a single sample, making inference expensive. Existing acceleration methods either rely on approximate simulators, such as $\tau$-leaping, or on distillation schemes that train new student models and auxiliary networks with proxy objectives. We propose a simple and principled distillation alternative based on \emph{conditional distribution matching}. Our key observation is that the reverse conditional distribution of clean data given a noisy state, $p_{0\mid t}(x_0 \mid x_t)$, admits a Markov decomposition through intermediate times and can be recovered from marginal density ratios and the known forward CTMC kernel. We exploit this structure to define distillation objectives that directly match conditional distributions between a pre-trained teacher and a low-NFE student, both for one-step and few-step samplers.

    https://arxiv.org/abs/2512.12889


    Tangible Intangibles: Exploring Embodied Emotion in Mixed Reality for Art Therapy

    oai:arXiv.org:2512.12891v1

    arXiv:2512.12891v1 Announce Type: new Abstract: This in-person studio explores how mixed reality (MR) and biometrics can make intangible emotional states tangible through embodied art practices. We begin with two well-established modalities, clay sculpting and free-form 2D drawing, to ground participants in somatic awareness and manual, reflective expression. Building on this baseline, we introduce an MR prototype that maps physiological signals (e.g., breath, heart rate variability, eye movement dynamics) to visual and spatial parameters (color saturation, pulsing, motion qualities), generating ''3D emotional artifacts.'' The full-day program balances theory (somatic psychology, embodied cognition, expressive biosignals), hands-on making, and comparative reflection to interrogate what analog and digital modalities respectively afford for awareness, expression, and meaning-making. Participants will (1) experience and compare analog and MR-based journaling of emotion; (2) prototype and critique mappings from biosignals to visual/spatial feedback; and (3) articulate design principles for trauma-informed, hybrid workflows that amplify interoceptive literacy without overwhelming the user. The expected contributions include a shared design vocabulary for biometric expressivity, a set of generative constraints for future TEI work on emotional archiving, and actionable insights into when automated translation supports or hinders embodied connection.

    https://arxiv.org/abs/2512.12891


    Wait, Wait, Wait... Why Do Reasoning Models Loop?

    oai:arXiv.org:2512.12895v1

    arXiv:2512.12895v1 Announce Type: new Abstract: Reasoning models (e.g., DeepSeek-R1) generate long chains of thought to solve harder problems, but they often loop, repeating the same text at low temperatures or with greedy decoding. We study why this happens and what role temperature plays. With open reasoning models, we find that looping is common at low temperature. Larger models tend to loop less, and distilled students loop significantly even when their teachers rarely do. This points to mismatches between the training distribution and the learned model, which we refer to as errors in learning, as a key cause. To understand how such errors cause loops, we introduce a synthetic graph reasoning task and demonstrate two mechanisms. First, risk aversion caused by hardness of learning: when the correct progress-making action is hard to learn but an easy cyclic action is available, the model puts relatively more probability on the cyclic action and gets stuck. Second, even when there is no hardness, Transformers show an inductive bias toward temporally correlated errors, so the same few actions keep being chosen and loops appear. Higher temperature reduces looping by promoting exploration, but it does not fix the errors in learning, so generations remain much longer than necessary at high temperature; in this sense, temperature is a stopgap rather than a holistic solution. We end with a discussion of training-time interventions aimed at directly reducing errors in learning.

    https://arxiv.org/abs/2512.12895


    Probability Estimation for Predicted-Occupancy Grids in Vehicle Safety Applications Based on Machine Learning

    oai:arXiv.org:2512.12896v1

    arXiv:2512.12896v1 Announce Type: new Abstract: This paper presents a method to predict the evolution of a complex traffic scenario with multiple objects. The current state of the scenario is assumed to be known from sensors and the prediction is taking into account various hypotheses about the behavior of traffic participants. This way, the uncertainties regarding the behavior of traffic participants can be modelled in detail. In the first part of this paper a model-based approach is presented to compute Predicted-Occupancy Grids (POG), which are introduced as a grid-based probabilistic representation of the future scenario hypotheses. However, due to the large number of possible trajectories for each traffic participant, the model-based approach comes with a very high computational load. Thus, a machine-learning approach is adopted for the computation of POGs. This work uses a novel grid-based representation of the current state of the traffic scenario and performs the mapping to POGs. This representation consists of augmented cells in an occupancy grid. The adopted machine-learning approach is based on the Random Forest algorithm. Simulations of traffic scenarios are performed to compare the machine-learning with the model-based approach. The results are promising and could enable the real-time computation of POGs for vehicle safety applications. With this detailed modelling of uncertainties, crucial components in vehicle safety systems like criticality estimation and trajectory planning can be improved.

    https://arxiv.org/abs/2512.12896


    Qonvolution: Towards Learning High-Frequency Signals with Queried Convolution

    oai:arXiv.org:2512.12898v1

    arXiv:2512.12898v1 Announce Type: new Abstract: Accurately learning high-frequency signals is a challenge in computer vision and graphics, as neural networks often struggle with these signals due to spectral bias or optimization difficulties. While current techniques like Fourier encodings have made great strides in improving performance, there remains scope for improvement when presented with high-frequency information. This paper introduces Queried-Convolutions (Qonvolutions), a simple yet powerful modification using the neighborhood properties of convolution. Qonvolution convolves a low-frequency signal with queries (such as coordinates) to enhance the learning of intricate high-frequency signals. We empirically demonstrate that Qonvolutions enhance performance across a variety of high-frequency learning tasks crucial to both the computer vision and graphics communities, including 1D regression, 2D super-resolution, 2D image regression, and novel view synthesis (NVS). In particular, by combining Gaussian splatting with Qonvolutions for NVS, we showcase state-of-the-art performance on real-world complex scenes, even outperforming powerful radiance field models on image quality.

    https://arxiv.org/abs/2512.12898


    Sub-$n^k$ Deterministic algorithm for minimum $k$-way cut in simple graphs

    oai:arXiv.org:2512.12900v1

    arXiv:2512.12900v1 Announce Type: new Abstract: We present a \emph{deterministic exact algorithm} for the \emph{minimum $k$-cut problem} on simple graphs. Our approach combines the \emph{principal sequence of partitions (PSP)}, derived canonically from ideal loads, with a single level of \emph{Kawarabayashi--Thorup (KT)} contractions at the critical PSP threshold~$\lambda_j$. Let $j$ be the smallest index with $\kappa(P_j)\ge k$ and $R := k - \kappa(P_{j-1})$. We prove a structural decomposition theorem showing that an optimal $k$-cut can be expressed as the level-$(j\!-\!1)$ boundary $A_{\le j-1}$ together with exactly $(R-r)$ \emph{non-trivial} internal cuts of value at most~$\lambda_j$ and $r$ \emph{singleton isolations} (``islands'') inside the parts of~$P_{j-1}$. At this level, KT contractions yield kernels of total size $\widetilde{O}(n / \lambda_j)$, and from them we build a \emph{canonical border family}~$\mathcal{B}$ of the same order that deterministically covers all optimal refinement choices. Branching only over~$\mathcal{B}$ (and also including an explicit ``island'' branch) gives total running time $$ T(n,m,k) = \widetilde{O}\left(\mathrm{poly}(m)+\Bigl(\tfrac{n}{\lambda_j}+n^{\omega/3}\Bigr)^{R}\right), $$ where $\omega < 2.373$ is the matrix multiplication exponent. In particular, if $\lambda_j \ge n^{\varepsilon}$ for some constant $\varepsilon > 0$, we obtain a \emph{deterministic sub-$n^k$-time algorithm}, running in $n^{(1-\varepsilon)(k-1)+o(k)}$ time. Finally, combining our PSP$\times$KT framework with a small-$\lambda$ exact subroutine via a simple meta-reduction yields a deterministic $n^{c k+O(1)}$ algorithm for $c = \max\{ t/(t+1), \omega/3 \} < 1$, aligning with the exponent in the randomized bound of He--Li (STOC~2022) under the assumed subroutine.

    https://arxiv.org/abs/2512.12900


    Predicted-occupancy grids for vehicle safety applications based on autoencoders and the Random Forest algorithm

    oai:arXiv.org:2512.12901v1

    arXiv:2512.12901v1 Announce Type: new Abstract: In this paper, a probabilistic space-time representation of complex traffic scenarios is predicted using machine learning algorithms. Such a representation is significant for all active vehicle safety applications especially when performing dynamic maneuvers in a complex traffic scenario. As a first step, a hierarchical situation classifier is used to distinguish the different types of traffic scenarios. This classifier is responsible for identifying the type of the road infrastructure and the safety-relevant traffic participants of the driving environment. With each class representing similar traffic scenarios, a set of Random Forests (RFs) is individually trained to predict the probabilistic space-time representation, which depicts the future behavior of traffic participants. This representation is termed as a Predicted-Occupancy Grid (POG). The input to the RFs is an Augmented Occupancy Grid (AOG). In order to increase the learning accuracy of the RFs and to perform better predictions, the AOG is reduced to low-dimensional features using a Stacked Denoising Autoencoder (SDA). The excellent performance of the proposed machine learning approach consisting of SDAs and RFs is demonstrated in simulations and in experiments with real vehicles. An application of POGs to estimate the criticality of traffic scenarios and to determine safe trajectories is also presented.

    https://arxiv.org/abs/2512.12901


    Next-generation reservoir computing validated by classification task

    oai:arXiv.org:2512.12903v1

    arXiv:2512.12903v1 Announce Type: new Abstract: An emerging computing paradigm, so-called next-generation reservoir computing (NG-RC) is investigated. True to its namesake, NG-RC requires no actual reservoirs for input data mixing but rather computing the polynomial terms directly from the time series inputs. However, benchmark tests so far reported have been one-sided, limited to prediction tasks of temporal waveforms such as Lorenz 63 attractor and Mackey-Glass chaotic signal. We will demonstrate for the first time that NG-RC can perform classification task as good as conventional RC. This validates the versatile computational capability of NG-RC in tasks of both prediction and classification.

    https://arxiv.org/abs/2512.12903


    OptHQC: Optimize HQC for High-Performance Post-Quantum Cryptography

    oai:arXiv.org:2512.12904v1

    arXiv:2512.12904v1 Announce Type: new Abstract: As post-quantum cryptography (PQC) becomes increasingly critical for securing future communication systems, the performance overhead introduced by quantum-resistant algorithms presents a major computing challenge. HQC (Hamming Quasi-Cyclic) is a newly standardized code-based PQC scheme designed to replace classical key exchange methods. In this paper, we propose OptHQC, an optimized implementation of the HQC scheme to deliver high-performance cryptographic operations. Our approach provides a comprehensive analysis of each computational blocks in HQC and introduces optimizations across all three stages: key generation, encryption, and decryption. We first exploit data-level sparsity in vector multiplication to accelerate polynomial operations during vector generation. We then leverage instruction-level acceleration (e.g., AVX2) in hash computation to further improve performance. Last, we transform multiplication into lookup table indexing and optimize memory access patterns in syndrome computation and error vector recovery, which are the most computationally intensive operations in HQC. Overall, OptHQC achieves an average 55% speedup over the reference HQC implementation on CPU.

    https://arxiv.org/abs/2512.12904


    Predictive Sample Assignment for Semantically Coherent Out-of-Distribution Detection

    oai:arXiv.org:2512.12906v1

    arXiv:2512.12906v1 Announce Type: new Abstract: Semantically coherent out-of-distribution detection (SCOOD) is a recently proposed realistic OOD detection setting: given labeled in-distribution (ID) data and mixed in-distribution and out-of-distribution unlabeled data as the training data, SCOOD aims to enable the trained model to accurately identify OOD samples in the testing data. Current SCOOD methods mainly adopt various clustering-based in-distribution sample filtering (IDF) strategies to select clean ID samples from unlabeled data, and take the remaining samples as auxiliary OOD data, which inevitably introduces a large number of noisy samples in training. To address the above issue, we propose a concise SCOOD framework based on predictive sample assignment (PSA). PSA includes a dual-threshold ternary sample assignment strategy based on the predictive energy score that can significantly improve the purity of the selected ID and OOD sample sets by assigning unconfident unlabeled data to an additional discard sample set, and a concept contrastive representation learning loss to further expand the distance between ID and OOD samples in the representation space to assist ID/OOD discrimination. In addition, we also introduce a retraining strategy to help the model fully fit the selected auxiliary ID/OOD samples. Experiments on two standard SCOOD benchmarks demonstrate that our approach outperforms the state-of-the-art methods by a significant margin.

    https://arxiv.org/abs/2512.12906


    Machine Learning Architectures for the Estimation of Predicted Occupancy Grids in Road Traffic

    oai:arXiv.org:2512.12907v1

    arXiv:2512.12907v1 Announce Type: new Abstract: This paper introduces a novel machine learning architecture for an efficient estimation of the probabilistic space-time representation of complex traffic scenarios. A detailed representation of the future traffic scenario is of significant importance for autonomous driving and for all active safety systems. In order to predict the future space-time representation of the traffic scenario, first the type of traffic scenario is identified and then the machine learning algorithm maps the current state of the scenario to possible future states. The input to the machine learning algorithms is the current state representation of a traffic scenario, termed as the Augmented Occupancy Grid (AOG). The output is the probabilistic space-time representation which includes uncertainties regarding the behaviour of the traffic participants and is termed as the Predicted Occupancy Grid (POG). The novel architecture consists of two Stacked Denoising Autoencoders (SDAs) and a set of Random Forests. It is then compared with the other two existing architectures that comprise of SDAs and DeconvNet. The architectures are validated with the help of simulations and the comparisons are made both in terms of accuracy and computational time. Also, a brief overview on the applications of POGs in the field of active safety is presented.

    https://arxiv.org/abs/2512.12907


    A Direct Second-Order Method for Solving Two-Player Zero-Sum Games

    oai:arXiv.org:2512.12910v1

    arXiv:2512.12910v1 Announce Type: new Abstract: We introduce, to our knowledge, the first direct second-order method for computing Nash equilibria in two-player zero-sum games. To do so, we construct a Douglas-Rachford-style splitting formulation, which we then solve with a semi-smooth Newton (SSN) method. We show that our algorithm enjoys local superlinear convergence. In order to augment the fast local behavior of our SSN method with global efficiency guarantees, we develop a hybrid method that combines our SSN method with the state-of-the-art first-order method for game solving, Predictive Regret Matching$^+$ (PRM$^+$). Our hybrid algorithm leverages the global progress provided by PRM$^+$, while achieving a local superlinear convergence rate once it switches to SSN near a Nash equilibrium. Numerical experiments on matrix games demonstrate order-of-magnitude speedups over PRM$^+$ for high-precision solutions.

    https://arxiv.org/abs/2512.12910


    CTIGuardian: A Few-Shot Framework for Mitigating Privacy Leakage in Fine-Tuned LLMs

    oai:arXiv.org:2512.12914v1

    arXiv:2512.12914v1 Announce Type: new Abstract: Large Language Models (LLMs) are often fine-tuned to adapt their general-purpose knowledge to specific tasks and domains such as cyber threat intelligence (CTI). Fine-tuning is mostly done through proprietary datasets that may contain sensitive information. Owners expect their fine-tuned model to not inadvertently leak this information to potentially adversarial end users. Using CTI as a use case, we demonstrate that data-extraction attacks can recover sensitive information from fine-tuned models on CTI reports, underscoring the need for mitigation. Retraining the full model to eliminate this leakage is computationally expensive and impractical. We propose an alternative approach, which we call privacy alignment, inspired by safety alignment in LLMs. Just like safety alignment teaches the model to abide by safety constraints through a few examples, we enforce privacy alignment through few-shot supervision, integrating a privacy classifier and a privacy redactor, both handled by the same underlying LLM. We evaluate our system, called CTIGuardian, using GPT-4o mini and Mistral-7B Instruct models, benchmarking against Presidio, a named entity recognition (NER) baseline. Results show that CTIGuardian provides a better privacy-utility trade-off than NER based models. While we demonstrate its effectiveness on a CTI use case, the framework is generic enough to be applicable to other sensitive domains.

    https://arxiv.org/abs/2512.12914


    Efficient Quantum-resistant Delegable Data Analysis Scheme with Revocation and Keyword Search in Mobile Cloud Computing

    oai:arXiv.org:2512.12917v1

    arXiv:2512.12917v1 Announce Type: new Abstract: With the rapid growth of smart devices and mobile internet, large-scale data processing is becoming increasingly important, while mobile devices remain resource-constrained. Mobile Cloud Computing (MCC) addresses this limitation by offloading tasks to the cloud. Nevertheless, the widespread adoption of MCC also raises challenges such as data privacy, selective computation, efficient revocation, and keyword search. Additionally, the development of quantum computers also threatens data security in MCC. To address these challenges, we propose an efficient quantum-resistant delegable data analysis scheme with revocation and keyword search (EQDDA-RKS) for MCC. In the proposed scheme, an authorised mobile device can perform keyword searches and compute inner product values over encrypted data without disclosing any additional information. Meanwhile, if a user's function key is compromised, it can be revoked. To alleviate the burden on mobile devices, most of the computation which should be executed by the mobile device is outsourced to a cloud server, and the mobile device only needs to interact with a central authority once. Furthermore, an authorised mobile device can temporarily delegate its keyword search and function computation rights to a delegatee in case the device becomes unavailable due to power depletion, going offline, etc. Our scheme is formally proven secure in the standard model against quantum attacks, chosen plaintext attacks, chosen keyword attacks, and outside keyword guessing attacks. Furthermore, the analysis demonstrates that the number of interactions between a mobile device and the central authority is $O(1)$ in our scheme, rather than growing linearly with the number of functions, which is well-suited for MCC scenarios.

    https://arxiv.org/abs/2512.12917


    Satisfiability Modulo Theory Meets Inductive Logic Programming

    oai:arXiv.org:2512.12918v1

    arXiv:2512.12918v1 Announce Type: new Abstract: Inductive Logic Programming (ILP) provides interpretable rule learning in relational domains, yet remains limited in its ability to induce and reason with numerical constraints. Classical ILP systems operate over discrete predicates and typically rely on discretisation or hand-crafted numerical predicates, making it difficult to infer thresholds or arithmetic relations that must hold jointly across examples. Recent work has begun to address these limitations through tighter integrations of ILP with Satisfiability Modulo Theories (SMT) or specialised numerical inference mechanisms. In this paper we investigate a modular alternative that couples the ILP system PyGol with the SMT solver Z3. Candidate clauses proposed by PyGol are interpreted as quantifier-free formulas over background theories such as linear or nonlinear real arithmetic, allowing numerical parameters to be instantiated and verified by the SMT solver while preserving ILP's declarative relational bias. This supports the induction of hybrid rules that combine symbolic predicates with learned numerical constraints, including thresholds, intervals, and multi-literal arithmetic relations. We formalise this SMT-ILP setting and evaluate it on a suite of synthetic datasets designed to probe linear, relational, nonlinear, and multi-hop reasoning. The results illustrate how a modular SMT-ILP architecture can extend the expressivity of symbolic rule learning, complementing prior numerical ILP approaches while providing a flexible basis for future extensions toward richer theory-aware induction.

    https://arxiv.org/abs/2512.12918


    Open Source Software and Data for Human Service Development: A Case Study on Predicting Housing Instability

    oai:arXiv.org:2512.12919v1

    arXiv:2512.12919v1 Announce Type: new Abstract: Open-source data and tools are lauded as essential for replicable and usable social science, though little is known about their use in resource constrained human service provision. This paper examines the challenges and opportunities of open-source tools and data in human service development by using both to forecast failure to pay eviction filings in Bronx County, NY. We use zip code level data from the Housing Data Coalition, the American Community Survey 5-year estimates, and DeepMaps Model of the Labor Force to forecast rates through July 2021. We employ multilevel (MLM) and exponential smoothing (ETS) models using the R project for Statistical Computing, an oft used open-source statistical software. We compare our results to what happened during the same period, to illustrate the efficacy of the open-source tools and techniques employed. We argue open-source data and software may facilitate rapid analysis of public data - a much-needed ability in human service intervention development under increasingly constrained resources - but find public data are limited by the information they reliably capture, limiting their utility by a non-trivial margin of error. The manuscript concludes by considering lessons for human service organizations with limited analytical resources and a vested interest in low-resourced communities.

    https://arxiv.org/abs/2512.12919


    Cisco Integrated AI Security and Safety Framework Report

    oai:arXiv.org:2512.12921v1

    arXiv:2512.12921v1 Announce Type: new Abstract: Artificial intelligence (AI) systems are being readily and rapidly adopted, increasingly permeating critical domains: from consumer platforms and enterprise software to networked systems with embedded agents. While this has unlocked potential for human productivity gains, the attack surface has expanded accordingly: threats now span content safety failures (e.g., harmful or deceptive outputs), model and data integrity compromise (e.g., poisoning, supply-chain tampering), runtime manipulations (e.g., prompt injection, tool and agent misuse), and ecosystem risks (e.g., orchestration abuse, multi-agent collusion). Existing frameworks such as MITRE ATLAS, National Institute of Standards and Technology (NIST) AI 100-2 Adversarial Machine Learning (AML) taxonomy, and OWASP Top 10s for Large Language Models (LLMs) and Agentic AI Applications provide valuable viewpoints, but each covers only slices of this multi-dimensional space. This paper presents Cisco's Integrated AI Security and Safety Framework ("AI Security Framework"), a unified, lifecycle-aware taxonomy and operationalization framework that can be used to classify, integrate, and operationalize the full range of AI risks. It integrates AI security and AI safety across modalities, agents, pipelines, and the broader ecosystem. The AI Security Framework is designed to be practical for threat identification, red-teaming, risk prioritization, and it is comprehensive in scope and can be extensible to emerging deployments in multimodal contexts, humanoids, wearables, and sensory infrastructures. We analyze gaps in prevailing frameworks, discuss design principles for our framework, and demonstrate how the taxonomy provides structure for understanding how modern AI systems fail, how adversaries exploit these failures, and how organizations can build defenses across the AI lifecycle that evolve alongside capability advancements.

    https://arxiv.org/abs/2512.12921


    LLM-based Personalized Portfolio Recommender: Integrating Large Language Models and Reinforcement Learning for Intelligent Investment Strategy Optimization

    oai:arXiv.org:2512.12922v1

    arXiv:2512.12922v1 Announce Type: new Abstract: In modern financial markets, investors increasingly seek personalized and adaptive portfolio strategies that reflect their individual risk preferences and respond to dynamic market conditions. Traditional rule-based or static optimization approaches often fail to capture the nonlinear interactions among investor behavior, market volatility, and evolving financial objectives. To address these limitations, this paper introduces the LLM-based Personalized Portfolio Recommender , an integrated framework that combines Large Language Models, reinforcement learning, and individualized risk preference modeling to support intelligent investment decision-making.

    https://arxiv.org/abs/2512.12922


    Information-Optimal Formation Geometry Design for Multimodal UAV Cooperative Perception

    oai:arXiv.org:2512.12923v1

    arXiv:2512.12923v1 Announce Type: new Abstract: The efficacy of UAV swarm cooperative perception fundamentally depends on three-dimensional (3D) formation geometry, which governs target observability and sensor complementarity. In the literature, the exploitation of formation geometry and its impact on UAV sensing have rarely been studied, which can significantly degrade multimodal cooperative perception at scenarios where heterogeneous payloads (vision cameras and LiDAR) should be geometrically arranged to exploit their complementary strengths while managing communication interference and hardware budgets. To bridge this critical gap, we propose an information-theoretic optimization framework that allocation of UAVs and multimodal sensors, configures formation geometries, and flight control. The UAV-sensor allocation is optimized by the Fisher Information Matrix (FIM) determinant maximization. Under this framework we introduce an equivalent formation transition strategy that enhances field-of-view (FOV) coverage without compromising perception accuracy and communication interference. Furthermore, we design a novel Lyapunov-stable flight control scheme with logarithmic potential fields to generate energy-efficient trajectories for formation transitions. Extensive simulations demonstrate our formation-aware design achieves 25.0\% improvement in FOV coverage, 104.2\% enhancement in communication signal strength, and 47.2\% reduction in energy consumption compared to conventional benchmarks. This work establishes that task-driven geometric configuration represents a foundational rather than incidental component in next-generation UAV swarm systems.

    https://arxiv.org/abs/2512.12923


    Sharpness-aware Dynamic Anchor Selection for Generalized Category Discovery

    oai:arXiv.org:2512.12925v1

    arXiv:2512.12925v1 Announce Type: new Abstract: Generalized category discovery (GCD) is an important and challenging task in open-world learning. Specifically, given some labeled data of known classes, GCD aims to cluster unlabeled data that contain both known and unknown classes. Current GCD methods based on parametric classification adopt the DINO-like pseudo-labeling strategy, where the sharpened probability output of one view is used as supervision information for the other view. However, large pre-trained models have a preference for some specific visual patterns, resulting in encoding spurious correlation for unlabeled data and generating noisy pseudo-labels. To address this issue, we propose a novel method, which contains two modules: Loss Sharpness Penalty (LSP) and Dynamic Anchor Selection (DAS). LSP enhances the robustness of model parameters to small perturbations by minimizing the worst-case loss sharpness of the model, which suppressing the encoding of trivial features, thereby reducing overfitting of noise samples and improving the quality of pseudo-labels. Meanwhile, DAS selects representative samples for the unknown classes based on KNN density and class probability during the model training and assigns hard pseudo-labels to them, which not only alleviates the confidence difference between known and unknown classes but also enables the model to quickly learn more accurate feature distribution for the unknown classes, thus further improving the clustering accuracy. Extensive experiments demonstrate that the proposed method can effectively mitigate the noise of pseudo-labels, and achieve state-of-the-art results on multiple GCD benchmarks.

    https://arxiv.org/abs/2512.12925


    PROSERVE: Unified Multi-Priority Request Scheduling for LLM Serving

    oai:arXiv.org:2512.12928v1

    arXiv:2512.12928v1 Announce Type: new Abstract: The widespread deployment of large language models (LLMs) for interactive applications necessitates serving systems that can handle thousands of concurrent requests with diverse Service Level Objective (SLO) requirements. A critical yet often overlooked dimension in this context is the inherent priority difference among clients; for instance, business-critical functions demand higher performance guarantees, as fulfilling such requests yields significantly greater business value. However, existing LLM serving schedulers fail to jointly optimize for both SLO attainment and client-level priorities. To bridge this gap, we first \textit{formalize multi-priority request scheduling as a service gain maximization problem}, where satisfying latency requirements for requests of different priorities contributes varying levels of gain. We then propose PROSERVE, a unified two-tier scheduling framework designed to maximize overall service gain. At the engine level, SlideBatching dynamically adapts batch formation and request ordering under varying load conditions, employing a sliding boundary mechanism to balance deadline-first and density-first strategies. At the service level, GoRouting performs gain-oriented and capability-aware dispatching across distributed instances, proactively reserving capacity for future high-priority or long requests. Extensive evaluation across four open-source datasets and a real-world industrial trace demonstrates that \systemname{} consistently outperforms state-of-the-art baselines, improving system gain by up to 35% and boosting SLO attainment by up to 52%.

    https://arxiv.org/abs/2512.12928


    MADTempo: An Interactive System for Multi-Event Temporal Video Retrieval with Query Augmentation

    oai:arXiv.org:2512.12929v1

    arXiv:2512.12929v1 Announce Type: new Abstract: The rapid expansion of video content across online platforms has accelerated the need for retrieval systems capable of understanding not only isolated visual moments but also the temporal structure of complex events. Existing approaches often fall short in modeling temporal dependencies across multiple events and in handling queries that reference unseen or rare visual concepts. To address these challenges, we introduce MADTempo, a video retrieval framework developed by our team, AIO_Trinh, that unifies temporal search with web-scale visual grounding. Our temporal search mechanism captures event-level continuity by aggregating similarity scores across sequential video segments, enabling coherent retrieval of multi-event queries. Complementarily, a Google Image Search-based fallback module expands query representations with external web imagery, effectively bridging gaps in pretrained visual embeddings and improving robustness against out-of-distribution (OOD) queries. Together, these components advance the temporal reasoning and generalization capabilities of modern video retrieval systems, paving the way for more semantically aware and adaptive retrieval across large-scale video corpora.

    https://arxiv.org/abs/2512.12929


    SeVeDo: A Heterogeneous Transformer Accelerator for Low-Bit Inference via Hierarchical Group Quantization and SVD-Guided Mixed Precision

    oai:arXiv.org:2512.12930v1

    arXiv:2512.12930v1 Announce Type: new Abstract: Low-bit quantization is a promising technique for efficient transformer inference by reducing computational and memory overhead. However, aggressive bitwidth reduction remains challenging due to activation outliers, leading to accuracy degradation. Existing methods, such as outlier-handling and group quantization, achieve high accuracy but incur substantial energy consumption. To address this, we propose SeVeDo, an energy-efficient SVD-based heterogeneous accelerator that structurally separates outlier-sensitive components into a high-precision low-rank path, while the remaining computations are executed in a low-bit residual datapath with group quantization. To further enhance efficiency, Hierarchical Group Quantization (HGQ) combines coarse-grained floating-point scaling with fine-grained shifting, effectively reducing dequantization cost. Also, SVD-guided mixed precision (SVD-MP) statically allocates higher bitwidths to precision-sensitive components identified through low-rank decomposition, thereby minimizing floating-point operation cost. Experimental results show that SeVeDo achieves a peak energy efficiency of 13.8TOPS/W, surpassing conventional designs, with 12.7TOPS/W on ViT-Base and 13.4TOPS/W on Llama2-7B benchmarks.

    https://arxiv.org/abs/2512.12930


    Investigating Data Pruning for Pretraining Biological Foundation Models at Scale

    oai:arXiv.org:2512.12932v1

    arXiv:2512.12932v1 Announce Type: new Abstract: Biological foundation models (BioFMs), pretrained on large-scale biological sequences, have recently shown strong potential in providing meaningful representations for diverse downstream bioinformatics tasks. However, such models often rely on millions to billions of training sequences and billions of parameters, resulting in prohibitive computational costs and significant barriers to reproducibility and accessibility, particularly for academic labs. To address these challenges, we investigate the feasibility of data pruning for BioFM pretraining and propose a post-hoc influence-guided data pruning framework tailored to biological domains. Our approach introduces a subset-based self-influence formulation that enables efficient estimation of sample importance at low computational cost, and builds upon it two simple yet effective selection strategies, namely Top-k Influence (Top I) and Coverage-Centric Influence (CCI). We empirically validate our method on two representative BioFMs, RNA-FM and ESM-C. For RNA, our framework consistently outperforms random selection baselines under an extreme pruning rate of over 99 percent, demonstrating its effectiveness. Furthermore, we show the generalizability of our framework on protein-related tasks using ESM-C. In particular, our coreset even outperforms random subsets that are ten times larger in both RNA and protein settings, revealing substantial redundancy in biological sequence datasets. These findings underscore the potential of influence-guided data pruning to substantially reduce the computational cost of BioFM pretraining, paving the way for more efficient, accessible, and sustainable biological AI research.

    https://arxiv.org/abs/2512.12932


    Unified Interactive Multimodal Moment Retrieval via Cascaded Embedding-Reranking and Temporal-Aware Score Fusion

    oai:arXiv.org:2512.12935v1

    arXiv:2512.12935v1 Announce Type: new Abstract: The exponential growth of video content has created an urgent need for efficient multimodal moment retrieval systems. However, existing approaches face three critical challenges: (1) fixed-weight fusion strategies fail across cross modal noise and ambiguous queries, (2) temporal modeling struggles to capture coherent event sequences while penalizing unrealistic gaps, and (3) systems require manual modality selection, reducing usability. We propose a unified multimodal moment retrieval system with three key innovations. First, a cascaded dual-embedding pipeline combines BEIT-3 and SigLIP for broad retrieval, refined by BLIP-2 based reranking to balance recall and precision. Second, a temporal-aware scoring mechanism applies exponential decay penalties to large temporal gaps via beam search, constructing coherent event sequences rather than isolated frames. Third, Agent-guided query decomposition (GPT-4o) automatically interprets ambiguous queries, decomposes them into modality specific sub-queries (visual/OCR/ASR), and performs adaptive score fusion eliminating manual modality selection. Qualitative analysis demonstrates that our system effectively handles ambiguous queries, retrieves temporally coherent sequences, and dynamically adapts fusion strategies, advancing interactive moment search capabilities.

    https://arxiv.org/abs/2512.12935


    Content Adaptive based Motion Alignment Framework for Learned Video Compression

    oai:arXiv.org:2512.12936v1

    arXiv:2512.12936v1 Announce Type: new Abstract: Recent advances in end-to-end video compression have shown promising results owing to their unified end-to-end learning optimization. However, such generalized frameworks often lack content-specific adaptation, leading to suboptimal compression performance. To address this, this paper proposes a content adaptive based motion alignment framework that improves performance by adapting encoding strategies to diverse content characteristics. Specifically, we first introduce a two-stage flow-guided deformable warping mechanism that refines motion compensation with coarse-to-fine offset prediction and mask modulation, enabling precise feature alignment. Second, we propose a multi-reference quality aware strategy that adjusts distortion weights based on reference quality, and applies it to hierarchical training to reduce error propagation. Third, we integrate a training-free module that downsamples frames by motion magnitude and resolution to obtain smooth motion estimation. Experimental results on standard test datasets demonstrate that our framework CAMA achieves significant improvements over state-of-the-art Neural Video Compression models, achieving a 24.95% BD-rate (PSNR) savings over our baseline model DCVC-TCM, while also outperforming reproduced DCVC-DC and traditional codec HM-16.25.

    https://arxiv.org/abs/2512.12936


    SPAR: Session-based Pipeline for Adaptive Retrieval on Legacy File Systems

    oai:arXiv.org:2512.12938v1

    arXiv:2512.12938v1 Announce Type: new Abstract: The ability to extract value from historical data is essential for enterprise decision-making. However, much of this information remains inaccessible within large legacy file systems that lack structured organization and semantic indexing, making retrieval and analysis inefficient and error-prone. We introduce SPAR (Session-based Pipeline for Adaptive Retrieval), a conceptual framework that integrates Large Language Models (LLMs) into a Retrieval-Augmented Generation (RAG) architecture specifically designed for legacy enterprise environments. Unlike conventional RAG pipelines, which require costly construction and maintenance of full-scale vector databases that mirror the entire file system, SPAR employs a lightweight two-stage process: a semantic Metadata Index is first created, after which session-specific vector databases are dynamically generated on demand. This design reduces computational overhead while improving transparency, controllability, and relevance in retrieval. We provide a theoretical complexity analysis comparing SPAR with standard LLM-based RAG pipelines, demonstrating its computational advantages. To validate the framework, we apply SPAR to a synthesized enterprise-scale file system containing a large corpus of biomedical literature, showing improvements in both retrieval effectiveness and downstream model accuracy. Finally, we discuss design trade-offs and outline open challenges for deploying SPAR across diverse enterprise settings.

    https://arxiv.org/abs/2512.12938


    Continuous Edit Distance, Geodesics and Barycenters of Time-varying Persistence Diagrams

    oai:arXiv.org:2512.12939v1

    arXiv:2512.12939v1 Announce Type: new Abstract: We introduce the Continuous Edit Distance (CED), a geodesic and elastic distance for time-varying persistence diagrams (TVPDs). The CED extends edit-distance ideas to TVPDs by combining local substitution costs with penalized deletions/insertions, controlled by two parameters: \(\alpha\) (trade-off between temporal misalignment and diagram discrepancy) and \(\beta\) (gap penalty). We also provide an explicit construction of CED-geodesics. Building on these ingredients, we present two practical barycenter solvers, one stochastic and one greedy, that monotonically decrease the CED Frechet energy. Empirically, the CED is robust to additive perturbations (both temporal and spatial), recovers temporal shifts, and supports temporal pattern search. On real-life datasets, the CED achieves clustering performance comparable or better than standard elastic dissimilarities, while our clustering based on CED-barycenters yields superior classification results. Overall, the CED equips TVPD analysis with a principled distance, interpretable geodesics, and practical barycenters, enabling alignment, comparison, averaging, and clustering directly in the space of TVPDs. A C++ implementation is provided for reproducibility at the following address https://github.com/sebastien-tchitchek/ContinuousEditDistance.

    https://arxiv.org/abs/2512.12939


    UAGLNet: Uncertainty-Aggregated Global-Local Fusion Network with Cooperative CNN-Transformer for Building Extraction

    oai:arXiv.org:2512.12941v1

    arXiv:2512.12941v1 Announce Type: new Abstract: Building extraction from remote sensing images is a challenging task due to the complex structure variations of the buildings. Existing methods employ convolutional or self-attention blocks to capture the multi-scale features in the segmentation models, while the inherent gap of the feature pyramids and insufficient global-local feature integration leads to inaccurate, ambiguous extraction results. To address this issue, in this paper, we present an Uncertainty-Aggregated Global-Local Fusion Network (UAGLNet), which is capable to exploit high-quality global-local visual semantics under the guidance of uncertainty modeling. Specifically, we propose a novel cooperative encoder, which adopts hybrid CNN and transformer layers at different stages to capture the local and global visual semantics, respectively. An intermediate cooperative interaction block (CIB) is designed to narrow the gap between the local and global features when the network becomes deeper. Afterwards, we propose a Global-Local Fusion (GLF) module to complementarily fuse the global and local representations. Moreover, to mitigate the segmentation ambiguity in uncertain regions, we propose an Uncertainty-Aggregated Decoder (UAD) to explicitly estimate the pixel-wise uncertainty to enhance the segmentation accuracy. Extensive experiments demonstrate that our method achieves superior performance to other state-of-the-art methods. Our code is available at https://github.com/Dstate/UAGLNet

    https://arxiv.org/abs/2512.12941


    SLIM-VDB: A Real-Time 3D Probabilistic Semantic Mapping Framework

    oai:arXiv.org:2512.12945v1

    arXiv:2512.12945v1 Announce Type: new Abstract: This paper introduces SLIM-VDB, a new lightweight semantic mapping system with probabilistic semantic fusion for closed-set or open-set dictionaries. Advances in data structures from the computer graphics community, such as OpenVDB, have demonstrated significantly improved computational and memory efficiency in volumetric scene representation. Although OpenVDB has been used for geometric mapping in robotics applications, semantic mapping for scene understanding with OpenVDB remains unexplored. In addition, existing semantic mapping systems lack support for integrating both fixed-category and open-language label predictions within a single framework. In this paper, we propose a novel 3D semantic mapping system that leverages the OpenVDB data structure and integrates a unified Bayesian update framework for both closed- and open-set semantic fusion. Our proposed framework, SLIM-VDB, achieves significant reduction in both memory and integration times compared to current state-of-the-art semantic mapping approaches, while maintaining comparable mapping accuracy. An open-source C++ codebase with a Python interface is available at https://github.com/umfieldrobotics/slim-vdb.

    https://arxiv.org/abs/2512.12945


    Understanding When Graph Convolutional Networks Help: A Diagnostic Study on Label Scarcity and Structural Properties

    oai:arXiv.org:2512.12947v1

    arXiv:2512.12947v1 Announce Type: new Abstract: Graph Convolutional Networks (GCNs) have become a standard approach for semi-supervised node classification, yet practitioners lack clear guidance on when GCNs provide meaningful improvements over simpler baselines. We present a diagnostic study using the Amazon Computers co-purchase data to understand when and why GCNs help. Through systematic experiments with simulated label scarcity, feature ablation, and per-class analysis, we find that GCN performance depends critically on the interaction between graph homophily and feature quality. GCNs provide the largest gains under extreme label scarcity, where they leverage neighborhood structure to compensate for limited supervision. Surprisingly, GCNs can match their original performance even when node features are replaced with random noise, suggesting that structure alone carries sufficient signal on highly homophilous graphs. However, GCNs hurt performance when homophily is low and features are already strong, as noisy neighbors corrupt good predictions. Our quadrant analysis reveals that GCNs help in three of four conditions and only hurt when low homophily meets strong features. These findings offer practical guidance for practitioners deciding whether to adopt graph-based methods.

    https://arxiv.org/abs/2512.12947


    FlashFuser: Expanding the Scale of Kernel Fusion for Compute-Intensive Operators via Inter-Core Connection

    oai:arXiv.org:2512.12949v1

    arXiv:2512.12949v1 Announce Type: new Abstract: The scaling of computation throughput continues to outpace improvements in memory bandwidth, making many deep learning workloads memory-bound. Kernel fusion is a key technique to alleviate this problem, but the fusion strategies of existing compilers and frameworks are limited to using local scratchpad memory. When the intermediate results exceed the limited capacity (such as FFN), the fusion fails. Although modern GPUs (like the NVIDIA H100) now incorporate an inter-core connection mechanism known as Distributed Shared Memory(DSM)--providing a larger, high-bandwidth, and low-latency on-chip memory pool--this hardware potential has yet to be exploited by software frameworks. To bridge this gap, we present FlashFuser, the first compiler framework to utilize inter-core connection for kernel fusion on modern GPUs. FlashFuser extends established fusion techniques to the DSM domain through three core contributions. First, we propose a powerful DSM-based communication abstraction that formalizes complex cluster-based data exchange patterns, such as reduce, shuffle and multiply. Second, we introduce a dataflow analyzer that generalizes loop scheduling, resource mapping, and tile selection to the distributed memory hierarchy; it determines the optimal execution order and tile sizes by quantifying data movement across memory levels. Finally, FlashFuser integrates these components into a unified search engine that employs analytical cost modeling and DSM-aware pruning strategies to efficiently discover the optimal execution plan. Our evaluation on an NVIDIA H100 GPU shows that FlashFuser reduces memory access by 58% and delivers kernel speedups of 3.3x against highly-tuned libraries and 4.1x against state-of-the-art compilers, resulting in a 1.24x end-to-end speedup.

    https://arxiv.org/abs/2512.12949


    Building from Scratch: A Multi-Agent Framework with Human-in-the-Loop for Multilingual Legal Terminology Mapping

    oai:arXiv.org:2512.12950v1

    arXiv:2512.12950v1 Announce Type: new Abstract: Accurately mapping legal terminology across languages remains a significant challenge, especially for language pairs like Chinese and Japanese, which share a large number of homographs with different meanings. Existing resources and standardized tools for these languages are limited. To address this, we propose a human-AI collaborative approach for building a multilingual legal terminology database, based on a multi-agent framework. This approach integrates advanced large language models and legal domain experts throughout the entire process-from raw document preprocessing, article-level alignment, to terminology extraction, mapping, and quality assurance. Unlike a single automated pipeline, our approach places greater emphasis on how human experts participate in this multi-agent system. Humans and AI agents take on different roles: AI agents handle specific, repetitive tasks, such as OCR, text segmentation, semantic alignment, and initial terminology extraction, while human experts provide crucial oversight, review, and supervise the outputs with contextual knowledge and legal judgment. We tested the effectiveness of this framework using a trilingual parallel corpus comprising 35 key Chinese statutes, along with their English and Japanese translations. The experimental results show that this human-in-the-loop, multi-agent workflow not only improves the precision and consistency of multilingual legal terminology mapping but also offers greater scalability compared to traditional manual methods.

    https://arxiv.org/abs/2512.12950


    Database Research needs an Abstract Relational Query Language

    oai:arXiv.org:2512.12957v1

    arXiv:2512.12957v1 Announce Type: new Abstract: For decades, SQL has been the default language for composing queries, but it is increasingly used as an artifact to be read and verified rather than authored. With Large Language Models (LLMs), queries are increasingly machine-generated, while humans read, validate, and debug them. This shift turns relational query languages into interfaces for back-and-forth communication about intent, which will lead to a rethinking of relational language design, and more broadly, relational interface design. We argue that this rethinking needs support from an Abstract Relational Query Language (ARQL): a semantics-first reference metalanguage that separates query intent from user-facing syntax and makes underlying relational patterns explicit and comparable across user-facing languages. An ARQL separates a query into (i) a relational core (the compositional structure that determines intent), (ii) modalities (alternative representations of that core tailored to different audiences), and (iii) conventions (orthogonal environment-level semantic parameters under which the core is interpreted, e.g., set vs. bag semantics, or treatment of null values). Usability for humans or machines then depends less on choosing a particular language and more on choosing an appropriate modality. Comparing languages becomes a question of which relational patterns they support and what conventions they choose. We introduce Abstract Relational Calculus (ARC), a strict generalization of Tuple Relational Calculus (TRC), as a concrete instance of ARQL. ARC comes in three modalities: (i) a comprehension-style textual notation, (ii) an Abstract Language Tree (ALT) for machine reasoning about meaning, and (iii) a diagrammatic hierarchical graph (higraph) representation for humans. ARC provides the missing vocabulary and acts as a Rosetta Stone for relational querying.

    https://arxiv.org/abs/2512.12957


    3-Query RLDCs are Strictly Stronger than 3-Query LDCs

    oai:arXiv.org:2512.12960v1

    arXiv:2512.12960v1 Announce Type: new Abstract: We construct $3$-query relaxed locally decodable codes (RLDCs) with constant alphabet size and length $\tilde{O}(k^2)$ for $k$-bit messages. Combined with the lower bound of $\tilde{\Omega}(k^3)$ of [Alrabiah, Guruswami, Kothari, Manohar, STOC 2023] on the length of locally decodable codes (LDCs) with the same parameters, we obtain a separation between RLDCs and LDCs, resolving an open problem of [Ben-Sasson, Goldreich, Harsha, Sudan and Vadhan, SICOMP 2006]. Our RLDC construction relies on two components. First, we give a new construction of probabilistically checkable proofs of proximity (PCPPs) with $3$ queries, quasi-linear size, constant alphabet size, perfect completeness, and small soundness error. This improves upon all previous PCPP constructions, which either had a much higher query complexity or soundness close to $1$. Second, we give a query-preserving transformation from PCPPs to RLDCs. At the heart of our PCPP construction is a $2$-query decodable PCP (dPCP) with matching parameters, and our construction builds on the HDX-based PCP of [Bafna, Minzer, Vyas, Yun, STOC 2025] and on the efficient composition framework of [Moshkovitz, Raz, JACM 2010] and [Dinur, Harsha, SICOMP 2013]. More specifically, we first show how to use the HDX-based construction to get a dPCP with matching parameters but a large alphabet size, and then prove an appropriate composition theorem (and related transformations) to reduce the alphabet size in dPCPs.

    https://arxiv.org/abs/2512.12960


    SCAdapter: Content-Style Disentanglement for Diffusion Style Transfer

    oai:arXiv.org:2512.12963v1

    arXiv:2512.12963v1 Announce Type: new Abstract: Diffusion models have emerged as the leading approach for style transfer, yet they struggle with photo-realistic transfers, often producing painting-like results or missing detailed stylistic elements. Current methods inadequately address unwanted influence from original content styles and style reference content features. We introduce SCAdapter, a novel technique leveraging CLIP image space to effectively separate and integrate content and style features. Our key innovation systematically extracts pure content from content images and style elements from style references, ensuring authentic transfers. This approach is enhanced through three components: Controllable Style Adaptive Instance Normalization (CSAdaIN) for precise multi-style blending, KVS Injection for targeted style integration, and a style transfer consistency objective maintaining process coherence. Comprehensive experiments demonstrate SCAdapter significantly outperforms state-of-the-art methods in both conventional and diffusion-based baselines. By eliminating DDIM inversion and inference-stage optimization, our method achieves at least $2\times$ faster inference than other diffusion-based approaches, making it both more effective and efficient for practical applications.

    https://arxiv.org/abs/2512.12963


    BLADE: A Behavior-Level Data Augmentation Framework with Dual Fusion Modeling for Multi-Behavior Sequential Recommendation

    oai:arXiv.org:2512.12964v1

    arXiv:2512.12964v1 Announce Type: new Abstract: Multi-behavior sequential recommendation aims to capture users' dynamic interests by modeling diverse types of user interactions over time. Although several studies have explored this setting, the recommendation performance remains suboptimal, mainly due to two fundamental challenges: the heterogeneity of user behaviors and data sparsity. To address these challenges, we propose BLADE, a framework that enhances multi-behavior modeling while mitigating data sparsity. Specifically, to handle behavior heterogeneity, we introduce a dual item-behavior fusion architecture that incorporates behavior information at both the input and intermediate levels, enabling preference modeling from multiple perspectives. To mitigate data sparsity, we design three behavior-level data augmentation methods that operate directly on behavior sequences rather than core item sequences. These methods generate diverse augmented views while preserving the semantic consistency of item sequences. These augmented views further enhance representation learning and generalization via contrastive learning. Experiments on three real-world datasets demonstrate the effectiveness of our approach.

    https://arxiv.org/abs/2512.12964


    Challenges and Enablers: Remote Work for People with Disabilities in Software Development Teams

    oai:arXiv.org:2512.12965v1

    arXiv:2512.12965v1 Announce Type: new Abstract: The increasing adoption of remote and hybrid work modalities in the technology sector has brought new opportunities and challenges for the inclusion of people with disabilities (PWD) in software development teams (SDT). This study investigates how remote work affects PWDs' experience in mixed-ability SDT, focusing on the unique challenges and strategies that emerge in remote environments. We conducted an online survey with \totalSurveyResponses valid responses, encompassing PWD, their leaders, and teammates, to capture sociotechnical aspects of their experiences with remote collaboration. To deepen our understanding, we carried out 14 structured interviews with software developers who self-identified as having disabilities (six autistic individuals, six with physical disabilities, and two who are d/Deaf). Our analysis combines quantitative data with qualitative coding of open-ended survey responses and interview transcripts. The results reveal that, despite the barriers faced by team members with disabilities, their teammates and leaders have a limited perception of the daily challenges involved in sustaining collaborative remote work. These findings highlight opportunities for improvement in accessibility tools, communication strategies, and adaptive management approaches.

    https://arxiv.org/abs/2512.12965


    QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management

    oai:arXiv.org:2512.12967v1

    arXiv:2512.12967v1 Announce Type: new Abstract: We introduce QwenLong-L1.5, a model that achieves superior long-context reasoning capabilities through systematic post-training innovations. The key technical breakthroughs of QwenLong-L1.5 are as follows: (1) Long-Context Data Synthesis Pipeline: We develop a systematic synthesis framework that generates challenging reasoning tasks requiring multi-hop grounding over globally distributed evidence. By deconstructing documents into atomic facts and their underlying relationships, and then programmatically composing verifiable reasoning questions, our approach creates high-quality training data at scale, moving substantially beyond simple retrieval tasks to enable genuine long-range reasoning capabilities. (2) Stabilized Reinforcement Learning for Long-Context Training: To overcome the critical instability in long-context RL, we introduce task-balanced sampling with task-specific advantage estimation to mitigate reward bias, and propose Adaptive Entropy-Controlled Policy Optimization (AEPO) that dynamically regulates exploration-exploitation trade-offs. (3) Memory-Augmented Architecture for Ultra-Long Contexts: Recognizing that even extended context windows cannot accommodate arbitrarily long sequences, we develop a memory management framework with multi-stage fusion RL training that seamlessly integrates single-pass reasoning with iterative memory-based processing for tasks exceeding 4M tokens. Based on Qwen3-30B-A3B-Thinking, QwenLong-L1.5 achieves performance comparable to GPT-5 and Gemini-2.5-Pro on long-context reasoning benchmarks, surpassing its baseline by 9.90 points on average. On ultra-long tasks (1M~4M tokens), QwenLong-L1.5's memory-agent framework yields a 9.48-point gain over the agent baseline. Additionally, the acquired long-context reasoning ability translates to enhanced performance in general domains like scientific reasoning, memory tool using, and extended dialogue.

    https://arxiv.org/abs/2512.12967


    Towards Open Standards for Systemic Complexity in Digital Forensics

    oai:arXiv.org:2512.12970v1

    arXiv:2512.12970v1 Announce Type: new Abstract: The intersection of artificial intelligence (AI) and digital forensics (DF) is becoming increasingly complex, ubiquitous, and pervasive, with overlapping techniques and technologies being adopted in all types of scientific and technical inquiry. Despite incredible advances, forensic sciences are not exempt from errors and remain vulnerable to fallibility. To mitigate the limitations of errors in DF, the systemic complexity is identified and addressed with the adoption of human-readable artifacts and open standards. A DF AI model schema based on the state of the art is outlined.

    https://arxiv.org/abs/2512.12970


    Headroom as A Grid Service in Software-Defined Power Grids: A Peak-to-Peak Control Design Approach

    oai:arXiv.org:2512.12972v1

    arXiv:2512.12972v1 Announce Type: new Abstract: To address system frequency challenges driven by the integration of renewable generation, advanced control strategies are designed at the device level to provide effective frequency support following disturbances. However, typically relying on energy-based performance metrics, these methods cannot guarantee the system frequency constraints such as frequency nadir and maximum Rate-of-Change-of-Frequency (RoCoF). Moreover, locally-designed frequency support cannot minimize the overall system cost to maintain frequency stability. On the other hand, the concept of frequency-constrained system scheduling is introduced, which incorporates frequency dynamic constraints into the system economic optimization, so that frequency requirements can be maintained with minimum cost. However, these works rely on analytical approximations of the frequency dynamic metrics, which are mathematically complicated and tend to be over-conservative for the approximation of IBR headroom requirements.

    https://arxiv.org/abs/2512.12972


    Application of Deep Learning in Biological Data Compression

    oai:arXiv.org:2512.12975v1

    arXiv:2512.12975v1 Announce Type: new Abstract: Cryogenic electron microscopy (Cryo-EM) has become an essential tool for capturing high-resolution biological structures. Despite its advantage in visualizations, the large storage size of Cryo-EM data file poses significant challenges for researchers and educators. This paper investigates the application of deep learning, specifically implicit neural representation (INR), to compress Cryo-EM biological data. The proposed approach first extracts the binary map of each file according to the density threshold. The density map is highly repetitive, ehich can be effectively compressed by GZIP. The neural network then trains to encode spatial density information, allowing the storage of network parameters and learnable latent vectors. To improve reconstruction accuracy, I further incorporate the positional encoding to enhance spatial representation and a weighted Mean Squared Error (MSE) loss function to balance density distribution variations. Using this approach, my aim is to provide a practical and efficient biological data compression solution that can be used for educational and research purpose, while maintaining a reasonable compression ratio and reconstruction quality from file to file.

    https://arxiv.org/abs/2512.12975


    Authors Should Annotate

    oai:arXiv.org:2512.12976v1

    arXiv:2512.12976v1 Announce Type: new Abstract: The status quo for labeling text is third-party annotation, but there are many cases where information directly from the document's source would be preferable over a third-person proxy, especially for egocentric features like sentiment and belief. We introduce author labeling, an annotation technique where the writer of the document itself annotates the data at the moment of creation. We collaborate with a commercial chatbot with over 10,000 users to deploy an author labeling annotation system for subjective features related to product recommendation. This system identifies task-relevant queries, generates on-the-fly labeling questions, and records authors' answers in real time. We train and deploy an online-learning model architecture for product recommendation that continuously improves from author labeling and find it achieved a 534% increase in click-through rate compared to an industry advertising baseline running concurrently. We then compare the quality and practicality of author labeling to three traditional annotation approaches for sentiment analysis and find author labeling to be higher quality, faster to acquire, and cheaper. These findings reinforce existing literature that annotations, especially for egocentric and subjective beliefs, are significantly higher quality when labeled by the author rather than a third party. To facilitate broader scientific adoption, we release an author labeling service for the research community at academic.echollm.io.

    https://arxiv.org/abs/2512.12976


    VLCache: Computing 2% Vision Tokens and Reusing 98% for Vision-Language Inference

    oai:arXiv.org:2512.12977v1

    arXiv:2512.12977v1 Announce Type: new Abstract: This paper presents VLCache, a cache reuse framework that exploits both Key-Value (KV) cache and encoder cache from prior multimodal inputs to eliminate costly recomputation when the same multimodal inputs recur. Unlike previous heuristic approaches, we formally identify the cumulative reuse error effect and demonstrate how to minimize the non-prefix cache reuse error effectively. We further analyze the varying importance of model layers and propose a dynamic, layer-aware recomputation strategy to balance accuracy and efficiency. Experimental results show that VLCache achieves an accuracy on par with full recomputation, while requiring only 2-5% of the tokens to compute, yielding 1.2x-16x TTFT speedups. The proposed VLCache pipeline has been integrated into SGLang, enabling significantly faster inference in practical deployments.

    https://arxiv.org/abs/2512.12977


    Do Reviews Matter for Recommendations in the Era of Large Language Models?

    oai:arXiv.org:2512.12978v1

    arXiv:2512.12978v1 Announce Type: new Abstract: With the advent of large language models (LLMs), the landscape of recommender systems is undergoing a significant transformation. Traditionally, user reviews have served as a critical source of rich, contextual information for enhancing recommendation quality. However, as LLMs demonstrate an unprecedented ability to understand and generate human-like text, this raises the question of whether explicit user reviews remain essential in the era of LLMs. In this paper, we provide a systematic investigation of the evolving role of text reviews in recommendation by comparing deep learning methods and LLM approaches. Particularly, we conduct extensive experiments on eight public datasets with LLMs and evaluate their performance in zero-shot, few-shot, and fine-tuning scenarios. We further introduce a benchmarking evaluation framework for review-aware recommender systems, RAREval, to comprehensively assess the contribution of textual reviews to the recommendation performance of review-aware recommender systems. Our framework examines various scenarios, including the removal of some or all textual reviews, random distortion, as well as recommendation performance in data sparsity and cold-start user settings. Our findings demonstrate that LLMs are capable of functioning as effective review-aware recommendation engines, generally outperforming traditional deep learning approaches, particularly in scenarios characterized by data sparsity and cold-start conditions. In addition, the removal of some or all textual reviews and random distortion does not necessarily lead to declines in recommendation accuracy. These findings motivate a rethinking of how user preference from text reviews can be more effectively leveraged. All code and supplementary materials are available at: https://github.com/zhytk/RAREval-data-processing.

    https://arxiv.org/abs/2512.12978


    Reveal Hidden Pitfalls and Navigate Next Generation of Vector Similarity Search from Task-Centric Views

    oai:arXiv.org:2512.12980v1

    arXiv:2512.12980v1 Announce Type: new Abstract: Vector Similarity Search (VSS) in high-dimensional spaces is rapidly emerging as core functionality in next-generation database systems for numerous data-intensive services -- from embedding lookups in large language models (LLMs), to semantic information retrieval and recommendation engines. Current benchmarks, however, evaluate VSS primarily on the recall-latency trade-off against a ground truth defined solely by distance metrics, neglecting how retrieval quality ultimately impacts downstream tasks. This disconnect can mislead both academic research and industrial practice. We present Iceberg, a holistic benchmark suite for end-to-end evaluation of VSS methods in realistic application contexts. From a task-centric view, Iceberg uncovers the Information Loss Funnel, which identifies three principal sources of end-to-end performance degradation: (1) Embedding Loss during feature extraction; (2) Metric Misuse, where distances poorly reflect task relevance; (3) Data Distribution Sensitivity, highlighting index robustness across skews and modalities. For a more comprehensive assessment, Iceberg spans eight diverse datasets across key domains such as image classification, face recognition, text retrieval, and recommendation systems. Each dataset, ranging from 1M to 100M vectors, includes rich, task-specific labels and evaluation metrics, enabling assessment of retrieval algorithms within the full application pipeline rather than in isolation. Iceberg benchmarks 13 state-of-the-art VSS methods and re-ranks them based on application-level metrics, revealing substantial deviations from traditional rankings derived purely from recall-latency evaluations. Building on these insights, we define a set of task-centric meta-features and derive an interpretable decision tree to guide practitioners in selecting and tuning VSS methods for their specific workloads.

    https://arxiv.org/abs/2512.12980


    CoDeQ: End-to-End Joint Model Compression with Dead-Zone Quantizer for High-Sparsity and Low-Precision Networks

    oai:arXiv.org:2512.12981v1

    arXiv:2512.12981v1 Announce Type: new Abstract: While joint pruning--quantization is theoretically superior to sequential application, current joint methods rely on auxiliary procedures outside the training loop for finding compression parameters. This reliance adds engineering complexity and hyperparameter tuning, while also lacking a direct data-driven gradient signal, which might result in sub-optimal compression. In this paper, we introduce CoDeQ, a simple, fully differentiable method for joint pruning--quantization. Our approach builds on a key observation: the dead-zone of a scalar quantizer is equivalent to magnitude pruning, and can be used to induce sparsity directly within the quantization operator. Concretely, we parameterize the dead-zone width and learn it via backpropagation, alongside the quantization parameters. This design provides explicit control of sparsity, regularized by a single global hyperparameter, while decoupling sparsity selection from bit-width selection. The result is a method for Compression with Dead-zone Quantizer (CoDeQ) that supports both fixed-precision and mixed-precision quantization (controlled by an optional second hyperparameter). It simultaneously determines the sparsity pattern and quantization parameters in a single end-to-end optimization. Consequently, CoDeQ does not require any auxiliary procedures, making the method architecture-agnostic and straightforward to implement. On ImageNet with ResNet-18, CoDeQ reduces bit operations to ~5% while maintaining close to full precision accuracy in both fixed and mixed-precision regimes.

    https://arxiv.org/abs/2512.12981


    Scaling Up AI-Generated Image Detection via Generator-Aware Prototypes

    oai:arXiv.org:2512.12982v1

    arXiv:2512.12982v1 Announce Type: new Abstract: The pursuit of a universal AI-generated image (AIGI) detector often relies on aggregating data from numerous generators to improve generalization. However, this paper identifies a paradoxical phenomenon we term the Benefit then Conflict dilemma, where detector performance stagnates and eventually degrades as source diversity expands. Our systematic analysis, diagnoses this failure by identifying two core issues: severe data-level heterogeneity, which causes the feature distributions of real and synthetic images to increasingly overlap, and a critical model-level bottleneck from fixed, pretrained encoders that cannot adapt to the rising complexity. To address these challenges, we propose Generator-Aware Prototype Learning (GAPL), a framework that constrain representation with a structured learning paradigm. GAPL learns a compact set of canonical forgery prototypes to create a unified, low-variance feature space, effectively countering data heterogeneity.To resolve the model bottleneck, it employs a two-stage training scheme with Low-Rank Adaptation, enhancing its discriminative power while preserving valuable pretrained knowledge. This approach establishes a more robust and generalizable decision boundary. Through extensive experiments, we demonstrate that GAPL achieves state-of-the-art performance, showing superior detection accuracy across a wide variety of GAN and diffusion-based generators. Code is available at https://github.com/UltraCapture/GAPL

    https://arxiv.org/abs/2512.12982


    VoroLight: Learning Quality Volumetric Voronoi Meshes from General Inputs

    oai:arXiv.org:2512.12984v1

    arXiv:2512.12984v1 Announce Type: new Abstract: We present VoroLight, a differentiable framework for 3D shape reconstruction based on Voronoi meshing. Our approach generates smooth, watertight surfaces and topologically consistent volumetric meshes directly from diverse inputs, including images, implicit shape level-set fields, point clouds and meshes. VoroLight operates in three stages: it first initializes a surface using a differentiable Voronoi formulation, then refines surface quality through a polygon-face sphere training stage, and finally reuses the differentiable Voronoi formulation for volumetric optimization with additional interior generator points. Project page: https://jiayinlu19960224.github.io/vorolight/

    https://arxiv.org/abs/2512.12984


    Tackling Snow-Induced Challenges: Safe Autonomous Lane-Keeping with Robust Reinforcement Learning

    oai:arXiv.org:2512.12987v1

    arXiv:2512.12987v1 Announce Type: new Abstract: This paper proposes two new algorithms for the lane keeping system (LKS) in autonomous vehicles (AVs) operating under snowy road conditions. These algorithms use deep reinforcement learning (DRL) to handle uncertainties and slippage. They include Action-Robust Recurrent Deep Deterministic Policy Gradient (AR-RDPG) and end-to-end Action-Robust convolutional neural network Attention Deterministic Policy Gradient (AR-CADPG), two action-robust approaches for decision-making. In the AR-RDPG method, within the perception layer, camera images are first denoised using multi-scale neural networks. Then, the centerline coefficients are extracted by a pre-trained deep convolutional neural network (DCNN). These coefficients, concatenated with the driving characteristics, are used as input to the control layer. The AR-CADPG method presents an end-to-end approach in which a convolutional neural network (CNN) and an attention mechanism are integrated within a DRL framework. Both methods are first trained in the CARLA simulator and validated under various snowy scenarios. Real-world experiments on a Jetson Nano-based autonomous vehicle confirm the feasibility and stability of the learned policies. Among the two models, the AR-CADPG approach demonstrates superior path-tracking accuracy and robustness, highlighting the effectiveness of combining temporal memory, adversarial resilience, and attention mechanisms in AVs.

    https://arxiv.org/abs/2512.12987


    Quantigence: A Multi-Agent AI Framework for Quantum Security Research

    oai:arXiv.org:2512.12989v1

    arXiv:2512.12989v1 Announce Type: new Abstract: Cryptographically Relevant Quantum Computers (CRQCs) pose a structural threat to the global digital economy. Algorithms like Shor's factoring and Grover's search threaten to dismantle the public-key infrastructure (PKI) securing sovereign communications and financial transactions. While the timeline for fault-tolerant CRQCs remains probabilistic, the "Store-Now, Decrypt-Later" (SNDL) model necessitates immediate migration to Post-Quantum Cryptography (PQC). This transition is hindered by the velocity of research, evolving NIST standards, and heterogeneous deployment environments. To address this, we present Quantigence, a theory-driven multi-agent AI framework for structured quantum-security analysis. Quantigence decomposes research objectives into specialized roles - Cryptographic Analyst, Threat Modeler, Standards Specialist, and Risk Assessor - coordinated by a supervisory agent. Using "cognitive parallelism," agents reason independently to maintain context purity while execution is serialized on resource-constrained hardware (e.g., NVIDIA RTX 2060). The framework integrates external knowledge via the Model Context Protocol (MCP) and prioritizes vulnerabilities using the Quantum-Adjusted Risk Score (QARS), a formal extension of Mosca's Theorem. Empirical validation shows Quantigence achieves a 67% reduction in research turnaround time and superior literature coverage compared to manual workflows, democratizing access to high-fidelity quantum risk assessment.

    https://arxiv.org/abs/2512.12989


    SliceMoE: Bit-Sliced Expert Caching under Miss-Rate Constraints for Efficient MoE Inference

    oai:arXiv.org:2512.12990v1

    arXiv:2512.12990v1 Announce Type: new Abstract: MoE models offer efficient scaling through conditional computation, but their large parameter size and expensive expert offloading make on-device deployment challenging. Existing acceleration techniques such as prefetching or expert clustering often increase energy usage or reduce expert diversity. We present SliceMoE, an energy-efficient MoE inference framework for miss-rate-constrained deployment. SliceMoE introduces Dynamic Bit-Sliced Caching (DBSC), which caches experts at slice-level granularity and assigns precision on demand to expand effective expert capacity. To support mixed-precision experts without memory duplication, we propose Calibration-Free Asymmetric Matryoshka Quantization (AMAT), a truncation-based scheme that maintains compatibility between low-bit and high-bit slices. We further introduce Predictive Cache Warmup (PCW) to reduce early-decode cold misses by reshaping cache contents during prefill. Evaluated on DeepSeek-V2-Lite and Qwen1.5-MoE-A2.7B, SliceMoE reduces decode-stage energy consumption by up to 2.37x and 2.85x, respectively, and improves decode latency by up to 1.81x and 1.64x, while preserving near-high-bit accuracy. These results demonstrate that slice-level caching enables an efficient on-device MoE deployment.

    https://arxiv.org/abs/2512.12990


    Learning Terrain Aware Bipedal Locomotion via Reduced Dimensional Perceptual Representations

    oai:arXiv.org:2512.12993v1

    arXiv:2512.12993v1 Announce Type: new Abstract: This work introduces a hierarchical strategy for terrain-aware bipedal locomotion that integrates reduced-dimensional perceptual representations to enhance reinforcement learning (RL)-based high-level (HL) policies for real-time gait generation. Unlike end-to-end approaches, our framework leverages latent terrain encodings via a Convolutional Variational Autoencoder (CNN-VAE) alongside reduced-order robot dynamics, optimizing the locomotion decision process with a compact state. We systematically analyze the impact of latent space dimensionality on learning efficiency and policy robustness. Additionally, we extend our method to be history-aware, incorporating sequences of recent terrain observations into the latent representation to improve robustness. To address real-world feasibility, we introduce a distillation method to learn the latent representation directly from depth camera images and provide preliminary hardware validation by comparing simulated and real sensor data. We further validate our framework using the high-fidelity Agility Robotics (AR) simulator, incorporating realistic sensor noise, state estimation, and actuator dynamics. The results confirm the robustness and adaptability of our method, underscoring its potential for hardware deployment.

    https://arxiv.org/abs/2512.12993


    Calibrating Uncertainty for Zero-Shot Adversarial CLIP

    oai:arXiv.org:2512.12997v1

    arXiv:2512.12997v1 Announce Type: new Abstract: CLIP delivers strong zero-shot classification but remains highly vulnerable to adversarial attacks. Previous work of adversarial fine-tuning largely focuses on matching the predicted logits between clean and adversarial examples, which overlooks uncertainty calibration and may degrade the zero-shot generalization. A common expectation in reliable uncertainty estimation is that predictive uncertainty should increase as inputs become more difficult or shift away from the training distribution. However, we frequently observe the opposite in the adversarial setting: perturbations not only degrade accuracy but also suppress uncertainty, leading to severe miscalibration and unreliable over-confidence. This overlooked phenomenon highlights a critical reliability gap beyond robustness. To bridge this gap, we propose a novel adversarial fine-tuning objective for CLIP considering both prediction accuracy and uncertainty alignments. By reparameterizing the output of CLIP as the concentration parameter of a Dirichlet distribution, we propose a unified representation that captures relative semantic structure and the magnitude of predictive confidence. Our objective aligns these distributions holistically under perturbations, moving beyond single-logit anchoring and restoring calibrated uncertainty. Experiments on multiple zero-shot classification benchmarks demonstrate that our approach effectively restores calibrated uncertainty and achieves competitive adversarial robustness while maintaining clean accuracy.

    https://arxiv.org/abs/2512.12997


    Legitimizing, Developing, and Sustaining Feminist HCI in East Asia: Challenges and Opportunities

    oai:arXiv.org:2512.13000v1

    arXiv:2512.13000v1 Announce Type: new Abstract: Feminist HCI has been rapidly developing in East Asian contexts in recent years. The region's unique cultural and political backgrounds have contributed valuable, situated knowledge, revealing topics such as localized digital feminism practices, or women's complex navigation among social expectations. However, the very factors that ground these perspectives also create significant survival challenges for researchers in East Asia. These include a scarcity of dedicated funding, the stigma of being perceived as less valuable than productivity-oriented technologies, and the lack of senior researchers and established, resilient communities. Grounded in these challenges and our prior collective practices, we propose this meet-up with two focused goals: (1) to provide a legitimized channel for Feminist HCI researchers to connect and build community, and (2) to facilitate an action-oriented dialogue on how to legitimize, develop, and sustain Feminist HCI in the East Asian context. The website for this meet-up is: https://feminist-hci.github.io/

    https://arxiv.org/abs/2512.13000


    Are Large Language Models Really Effective for Training-Free Cold-Start Recommendation?

    oai:arXiv.org:2512.13001v1

    arXiv:2512.13001v1 Announce Type: new Abstract: Recommender systems usually rely on large-scale interaction data to learn from users' past behaviors and make accurate predictions. However, real-world applications often face situations where no training data is available, such as when launching new services or handling entirely new users. In such cases, conventional approaches cannot be applied. This study focuses on training-free recommendation, where no task-specific training is performed, and particularly on \textit{training-free cold-start recommendation} (TFCSR), the more challenging case where the target user has no interactions. Large language models (LLMs) have recently been explored as a promising solution, and numerous studies have been proposed. As the ability of text embedding models (TEMs) increases, they are increasingly recognized as applicable to training-free recommendation, but no prior work has directly compared LLMs and TEMs under identical conditions. We present the first controlled experiments that systematically evaluate these two approaches in the same setting. The results show that TEMs outperform LLM rerankers, and this trend holds not only in cold-start settings but also in warm-start settings with rich interactions. These findings indicate that direct LLM ranking is not the only viable option, contrary to the commonly shared belief, and TEM-based approaches provide a stronger and more scalable basis for training-free recommendation.

    https://arxiv.org/abs/2512.13001


    Large Language Models for Power System Applications: A Comprehensive Literature Survey

    oai:arXiv.org:2512.13004v1

    arXiv:2512.13004v1 Announce Type: new Abstract: This comprehensive literature review examines the emerging applications of Large Language Models (LLMs) in power system engineering. Through a systematic analysis of recent research published between 2020 and 2025, we explore how LLMs are being integrated into various aspects of power system operations, planning, and management. The review covers key application areas including fault diagnosis, load forecasting, cybersecurity, control and optimization, system planning, simulation, and knowledge management. Our findings indicate that while LLMs show promising potential in enhancing power system operations through their advanced natural language processing and reasoning capabilities, significant challenges remain in their practical implementation. These challenges include limited domain-specific training data, concerns about reliability and safety in critical infrastructure, and the need for enhanced explainability. The review also highlights emerging trends such as the development of power system-specific LLMs and hybrid approaches combining LLMs with traditional power engineering methods. We identify crucial research directions for advancing the field, including the development of specialized architectures, improved security frameworks, and enhanced integration with existing power system tools. This survey provides power system researchers and practitioners with a comprehensive overview of the current state of LLM applications in the field and outlines future pathways for research and development.

    https://arxiv.org/abs/2512.13004


    Few-Step Distillation for Text-to-Image Generation: A Practical Guide

    oai:arXiv.org:2512.13006v1

    arXiv:2512.13006v1 Announce Type: new Abstract: Diffusion distillation has dramatically accelerated class-conditional image synthesis, but its applicability to open-ended text-to-image (T2I) generation is still unclear. We present the first systematic study that adapts and compares state-of-the-art distillation techniques on a strong T2I teacher model, FLUX.1-lite. By casting existing methods into a unified framework, we identify the key obstacles that arise when moving from discrete class labels to free-form language prompts. Beyond a thorough methodological analysis, we offer practical guidelines on input scaling, network architecture, and hyperparameters, accompanied by an open-source implementation and pretrained student models. Our findings establish a solid foundation for deploying fast, high-fidelity, and resource-efficient diffusion generators in real-world T2I applications. Code is available on github.com/alibaba-damo-academy/T2I-Distill.

    https://arxiv.org/abs/2512.13006


    Light Field Based 6DoF Tracking of Previously Unobserved Objects

    oai:arXiv.org:2512.13007v1

    arXiv:2512.13007v1 Announce Type: new Abstract: Object tracking is an important step in robotics and reautonomous driving pipelines, which has to generalize to previously unseen and complex objects. Existing high-performing methods often rely on pre-captured object views to build explicit reference models, which restricts them to a fixed set of known objects. However, such reference models can struggle with visually complex appearance, reducing the quality of tracking. In this work, we introduce an object tracking method based on light field images that does not depend on a pre-trained model, while being robust to complex visual behavior, such as reflections. We extract semantic and geometric features from light field inputs using vision foundation models and convert them into view-dependent Gaussian splats. These splats serve as a unified object representation, supporting differentiable rendering and pose optimization. We further introduce a light field object tracking dataset containing challenging reflective objects with precise ground truth poses. Experiments demonstrate that our method is competitive with state-of-the-art model-based trackers in these difficult cases, paving the way toward universal object tracking in robotic systems. Code/data available at https://github.com/nagonch/LiFT-6DoF.

    https://arxiv.org/abs/2512.13007


    TWLR: Text-Guided Weakly-Supervised Lesion Localization and Severity Regression for Explainable Diabetic Retinopathy Grading

    oai:arXiv.org:2512.13008v1

    arXiv:2512.13008v1 Announce Type: new Abstract: Accurate medical image analysis can greatly assist clinical diagnosis, but its effectiveness relies on high-quality expert annotations Obtaining pixel-level labels for medical images, particularly fundus images, remains costly and time-consuming. Meanwhile, despite the success of deep learning in medical imaging, the lack of interpretability limits its clinical adoption. To address these challenges, we propose TWLR, a two-stage framework for interpretable diabetic retinopathy (DR) assessment. In the first stage, a vision-language model integrates domain-specific ophthalmological knowledge into text embeddings to jointly perform DR grading and lesion classification, effectively linking semantic medical concepts with visual features. The second stage introduces an iterative severity regression framework based on weakly-supervised semantic segmentation. Lesion saliency maps generated through iterative refinement direct a progressive inpainting mechanism that systematically eliminates pathological features, effectively downgrading disease severity toward healthier fundus appearances. Critically, this severity regression approach achieves dual benefits: accurate lesion localization without pixel-level supervision and providing an interpretable visualization of disease-to-healthy transformations. Experimental results on the FGADR, DDR, and a private dataset demonstrate that TWLR achieves competitive performance in both DR classification and lesion segmentation, offering a more explainable and annotation-efficient solution for automated retinal image analysis.

    https://arxiv.org/abs/2512.13008


    K-VARK: Kernelized Variance-Aware Residual Kalman Filter for Sensorless Force Estimation in Collaborative Robots

    oai:arXiv.org:2512.13009v1

    arXiv:2512.13009v1 Announce Type: new Abstract: Reliable estimation of contact forces is crucial for ensuring safe and precise interaction of robots with unstructured environments. However, accurate sensorless force estimation remains challenging due to inherent modeling errors and complex residual dynamics and friction. To address this challenge, in this paper, we propose K-VARK (Kernelized Variance-Aware Residual Kalman filter), a novel approach that integrates a kernelized, probabilistic model of joint residual torques into an adaptive Kalman filter framework. Through Kernelized Movement Primitives trained on optimized excitation trajectories, K-VARK captures both the predictive mean and input-dependent heteroscedastic variance of residual torques, reflecting data variability and distance-to-training effects. These statistics inform a variance-aware virtual measurement update by augmenting the measurement noise covariance, while the process noise covariance adapts online via variational Bayesian optimization to handle dynamic disturbances. Experimental validation on a 6-DoF collaborative manipulator demonstrates that K-VARK achieves over 20% reduction in RMSE compared to state-of-the-art sensorless force estimation methods, yielding robust and accurate external force/torque estimation suitable for advanced tasks such as polishing and assembly.

    https://arxiv.org/abs/2512.13009


    Deep Learning-Driven Inversion Framework for Shear Modulus Estimation in Magnetic Resonance Elastography (DIME)

    oai:arXiv.org:2512.13010v1

    arXiv:2512.13010v1 Announce Type: new Abstract: The Multimodal Direct Inversion (MMDI) algorithm is widely used in Magnetic Resonance Elastography (MRE) to estimate tissue shear stiffness. However, MMDI relies on the Helmholtz equation, which assumes wave propagation in a uniform, homogeneous, and infinite medium. Furthermore, the use of the Laplacian operator makes MMDI highly sensitive to noise, which compromises the accuracy and reliability of stiffness estimates. In this study, we propose the Deep-Learning driven Inversion Framework for Shear Modulus Estimation in MRE (DIME), aimed at enhancing the robustness of inversion. DIME is trained on the displacement fields-stiffness maps pair generated through Finite Element Modelling (FEM) simulations. To capture local wave behavior and improve robustness to global image variations, DIME is trained on small image patches. We first validated DIME using homogeneous and heterogeneous datasets simulated with FEM, where DIME produced stiffness maps with low inter-pixel variability, accurate boundary delineation, and higher correlation with ground truth (GT) compared to MMDI. Next, DIME was evaluated in a realistic anatomy-informed simulated liver dataset with known GT and compared directly to MMDI. DIME reproduced ground-truth stiffness patterns with high fidelity (r = 0.99, R^2 = 0.98), while MMDI showed greater underestimation. After validating DIME on synthetic data, we tested the model in in vivo liver MRE data from eight healthy and seven fibrotic subjects. DIME preserved physiologically consistent stiffness patterns and closely matched MMDI, which showed directional bias. Overall, DIME showed higher correlation with ground truth and visually similar stiffness patterns, whereas MMDI displayed a larger bias that can potentially be attributed to directional filtering. These preliminary results highlight the feasibility of DIME for clinical applications in MRE.

    https://arxiv.org/abs/2512.13010


    HQ-MPSD: A Multilingual Artifact-Controlled Benchmark for Partial Deepfake Speech Detection

    oai:arXiv.org:2512.13012v1

    arXiv:2512.13012v1 Announce Type: new Abstract: Detecting partial deepfake speech is challenging because manipulations occur only in short regions while the surrounding audio remains authentic. However, existing detection methods are fundamentally limited by the quality of available datasets, many of which rely on outdated synthesis systems and generation procedures that introduce dataset-specific artifacts rather than realistic manipulation cues. To address this gap, we introduce HQ-MPSD, a high-quality multilingual partial deepfake speech dataset. HQ-MPSD is constructed using linguistically coherent splice points derived from fine-grained forced alignment, preserving prosodic and semantic continuity and minimizing audible and visual boundary artifacts. The dataset contains 350.8 hours of speech across eight languages and 550 speakers, with background effects added to better reflect real-world acoustic conditions. MOS evaluations and spectrogram analysis confirm the high perceptual naturalness of the samples. We benchmark state-of-the-art detection models through cross-language and cross-dataset evaluations, and all models experience performance drops exceeding 80% on HQ-MPSD. These results demonstrate that HQ-MPSD exposes significant generalization challenges once low-level artifacts are removed and multilingual and acoustic diversity are introduced, providing a more realistic and demanding benchmark for partial deepfake detection. The dataset can be found at: https://zenodo.org/records/17929533.

    https://arxiv.org/abs/2512.13012


    JoDiffusion: Jointly Diffusing Image with Pixel-Level Annotations for Semantic Segmentation Promotion

    oai:arXiv.org:2512.13014v1

    arXiv:2512.13014v1 Announce Type: new Abstract: Given the inherently costly and time-intensive nature of pixel-level annotation, the generation of synthetic datasets comprising sufficiently diverse synthetic images paired with ground-truth pixel-level annotations has garnered increasing attention recently for training high-performance semantic segmentation models. However, existing methods necessitate to either predict pseudo annotations after image generation or generate images conditioned on manual annotation masks, which incurs image-annotation semantic inconsistency or scalability problem. To migrate both problems with one stone, we present a novel dataset generative diffusion framework for semantic segmentation, termed JoDiffusion. Firstly, given a standard latent diffusion model, JoDiffusion incorporates an independent annotation variational auto-encoder (VAE) network to map annotation masks into the latent space shared by images. Then, the diffusion model is tailored to capture the joint distribution of each image and its annotation mask conditioned on a text prompt. By doing these, JoDiffusion enables simultaneously generating paired images and semantically consistent annotation masks solely conditioned on text prompts, thereby demonstrating superior scalability. Additionally, a mask optimization strategy is developed to mitigate the annotation noise produced during generation. Experiments on Pascal VOC, COCO, and ADE20K datasets show that the annotated dataset generated by JoDiffusion yields substantial performance improvements in semantic segmentation compared to existing methods.

    https://arxiv.org/abs/2512.13014


    What Happens Next? Next Scene Prediction with a Unified Video Model

    oai:arXiv.org:2512.13015v1

    arXiv:2512.13015v1 Announce Type: new Abstract: Recent unified models for joint understanding and generation have significantly advanced visual generation capabilities. However, their focus on conventional tasks like text-to-video generation has left the temporal reasoning potential of unified models largely underexplored. To address this gap, we introduce Next Scene Prediction (NSP), a new task that pushes unified video models toward temporal and causal reasoning. Unlike text-to-video generation, NSP requires predicting plausible futures from preceding context, demanding deeper understanding and reasoning. To tackle this task, we propose a unified framework combining Qwen-VL for comprehension and LTX for synthesis, bridged by a latent query embedding and a connector module. This model is trained in three stages on our newly curated, large-scale NSP dataset: text-to-video pre-training, supervised fine-tuning, and reinforcement learning (via GRPO) with our proposed causal consistency reward. Experiments demonstrate our model achieves state-of-the-art performance on our benchmark, advancing the capability of generalist multimodal systems to anticipate what happens next.

    https://arxiv.org/abs/2512.13015


    Comprehensive Deployment-Oriented Assessment for Cross-Environment Generalization in Deep Learning-Based mmWave Radar Sensing

    oai:arXiv.org:2512.13018v1

    arXiv:2512.13018v1 Announce Type: new Abstract: This study presents the first comprehensive evaluation of spatial generalization techniques, which are essential for the practical deployment of deep learning-based radio-frequency (RF) sensing. Focusing on people counting in indoor environments using frequency-modulated continuous-wave (FMCW) multiple-input multiple-output (MIMO) radar, we systematically investigate a broad set of approaches, including amplitude-based statistical preprocessing (sigmoid weighting and threshold zeroing), frequency-domain filtering, autoencoder-based background suppression, data augmentation strategies, and transfer learning. Experimental results collected across two environments with different layouts demonstrate that sigmoid-based amplitude weighting consistently achieves superior cross-environment performance, yielding 50.1% and 55.2% reductions in root-mean-square error (RMSE) and mean absolute error (MAE), respectively, compared with baseline methods. Data augmentation provides additional though modest benefits, with improvements up to 8.8% in MAE. By contrast, transfer learning proves indispensable for large spatial shifts, achieving 82.1% and 91.3% reductions in RMSE and MAE, respectively, with 540 target-domain samples. Taken together, these findings establish a highly practical direction for developing radar sensing systems capable of maintaining robust accuracy under spatial variations by integrating deep learning models with amplitude-based preprocessing and efficient transfer learning.

    https://arxiv.org/abs/2512.13018


    SneakPeek: Future-Guided Instructional Streaming Video Generation

    oai:arXiv.org:2512.13019v1

    arXiv:2512.13019v1 Announce Type: new Abstract: Instructional video generation is an emerging task that aims to synthesize coherent demonstrations of procedural activities from textual descriptions. Such capability has broad implications for content creation, education, and human-AI interaction, yet existing video diffusion models struggle to maintain temporal consistency and controllability across long sequences of multiple action steps. We introduce a pipeline for future-driven streaming instructional video generation, dubbed SneakPeek, a diffusion-based autoregressive framework designed to generate precise, stepwise instructional videos conditioned on an initial image and structured textual prompts. Our approach introduces three key innovations to enhance consistency and controllability: (1) predictive causal adaptation, where a causal model learns to perform next-frame prediction and anticipate future keyframes; (2) future-guided self-forcing with a dual-region KV caching scheme to address the exposure bias issue at inference time; (3) multi-prompt conditioning, which provides fine-grained and procedural control over multi-step instructions. Together, these components mitigate temporal drift, preserve motion consistency, and enable interactive video generation where future prompt updates dynamically influence ongoing streaming video generation. Experimental results demonstrate that our method produces temporally coherent and semantically faithful instructional videos that accurately follow complex, multi-step task descriptions.

    https://arxiv.org/abs/2512.13019


    Safe Control of Multi-Agent Systems with Minimal Communication

    oai:arXiv.org:2512.13021v1

    arXiv:2512.13021v1 Announce Type: new Abstract: In many multi-agent systems, communication is limited by bandwidth, latency, and energy constraints. Designing controllers that achieve coordination and safety with minimal communication is critical for scalable and reliable deployment. This paper presents a method for designing controllers that minimize inter-agent communication in multi-agent systems while satisfying safety and coordination requirements, while conforming to communication delay constraints. The control synthesis problem is cast as a rank minimization problem, where a convex relaxation is obtained via system level synthesis. Simulation results on various tasks, including trajectory tracking with relative and heterogeneous sensing, demonstrate that the proposed method significantly reduces inter-agent transmission compared to baseline approaches.

    https://arxiv.org/abs/2512.13021


    Motus: A Unified Latent Action World Model

    oai:arXiv.org:2512.13030v1

    arXiv:2512.13030v1 Announce Type: new Abstract: While a general embodied agent must function as a unified system, current methods are built on isolated models for understanding, world modeling, and control. This fragmentation prevents unifying multimodal generative capabilities and hinders learning from large-scale, heterogeneous data. In this paper, we propose Motus, a unified latent action world model that leverages existing general pretrained models and rich, sharable motion information. Motus introduces a Mixture-of-Transformer (MoT) architecture to integrate three experts (i.e., understanding, video generation, and action) and adopts a UniDiffuser-style scheduler to enable flexible switching between different modeling modes (i.e., world models, vision-language-action models, inverse dynamics models, video generation models, and video-action joint prediction models). Motus further leverages the optical flow to learn latent actions and adopts a recipe with three-phase training pipeline and six-layer data pyramid, thereby extracting pixel-level "delta action" and enabling large-scale action pretraining. Experiments show that Motus achieves superior performance against state-of-the-art methods in both simulation (a +15% improvement over X-VLA and a +45% improvement over Pi0.5) and real-world scenarios(improved by +11~48%), demonstrating unified modeling of all functionalities and priors significantly benefits downstream robotic tasks.

    https://arxiv.org/abs/2512.13030


    Comprehensive Evaluation of Rule-Based, Machine Learning, and Deep Learning in Human Estimation Using Radio Wave Sensing: Accuracy, Spatial Generalization, and Output Granularity Trade-offs

    oai:arXiv.org:2512.13031v1

    arXiv:2512.13031v1 Announce Type: new Abstract: This study presents the first comprehensive comparison of rule-based methods, traditional machine learning models, and deep learning models in radio wave sensing with frequency modulated continuous wave multiple input multiple output radar. We systematically evaluated five approaches in two indoor environments with distinct layouts: a rule-based connected component method; three traditional machine learning models, namely k-nearest neighbors, random forest, and support vector machine; and a deep learning model combining a convolutional neural network and long short term memory. In the training environment, the convolutional neural network long short term memory model achieved the highest accuracy, while traditional machine learning models provided moderate performance. In a new layout, however, all learning based methods showed significant degradation, whereas the rule-based method remained stable. Notably, for binary detection of presence versus absence of people, all models consistently achieved high accuracy across layouts. These results demonstrate that high capacity models can produce fine grained outputs with high accuracy in the same environment, but they are vulnerable to domain shift. In contrast, rule-based methods cannot provide fine grained outputs but exhibit robustness against domain shift. Moreover, regardless of the model type, a clear trade off was revealed between spatial generalization performance and output granularity.

    https://arxiv.org/abs/2512.13031


    Scaling Bidirectional Spans and Span Violations in Attention Mechanism

    oai:arXiv.org:2512.13033v1

    arXiv:2512.13033v1 Announce Type: new Abstract: The canonical $O(N^2)$ Transformer remains the empirical performance frontier in sequence modeling, and its training can be further optimized by addressing geometric inefficiency. We propose an optimization framework that leverages an asymmetric projection to decompose the backward-pass gradients into parallel spans and orthogonal violations, while keeping the canonical forward-pass $QKV$ structure intact. Through consistent experimental validation across various decomposition and projection setups, we provide strong theoretical evidence: the standard attention gradient is suboptimal. We demonstrated that selectively scaling these components, focusing primarily on $0^{th}$ order bidirectional parallel spans, yields the most effective learning signal. On the limited WikiText-2 dataset, and using a crude configuration, this method achieved a $0.56\%$ reduction in validation loss, confirming the framework's fundamental validity and suggesting significant potential gains on larger datasets and deeper training regimes

    https://arxiv.org/abs/2512.13033


    Alada: Alternating Adaptation of Momentum Method for Memory-Efficient Matrix Optimization

    oai:arXiv.org:2512.13034v1

    arXiv:2512.13034v1 Announce Type: new Abstract: This work proposes Alada, an adaptive momentum method for stochastic optimization over large-scale matrices. Alada employs a rank-one factorization approach to estimate the second moment of gradients, where factors are updated alternatively to minimize the estimation error. Alada achieves sublinear memory overheads and can be readily extended to optimizing tensor-shaped variables.We also equip Alada with a first moment estimation rule, which enhances the algorithm's robustness without incurring additional memory overheads. The theoretical performance of Alada aligns with that of traditional methods such as Adam. Numerical studies conducted on several natural language processing tasks demonstrate the reduction in memory overheads and the robustness in training large models relative to Adam and its variants.

    https://arxiv.org/abs/2512.13034


    Progressive Refinement of E-commerce Search Ranking Based on Short-Term Activities of the Buyer

    oai:arXiv.org:2512.13037v1

    arXiv:2512.13037v1 Announce Type: new Abstract: In e-commerce shopping, aligning search results with a buyer's immediate needs and preferences presents a significant challenge, particularly in adapting search results throughout the buyer's shopping journey as they move from the initial stages of browsing to making a purchase decision or shift from one intent to another. This study presents a systematic approach to adapting e-commerce search results based on the current context. We start with basic methods and incrementally incorporate more contextual information and state-of-the-art techniques to improve the search outcomes. By applying this evolving contextual framework to items displayed on the search engine results page (SERP), we progressively align search outcomes more closely with the buyer's interests and current search intentions. Our findings demonstrate that this incremental enhancement, from simple heuristic autoregressive features to advanced sequence models, significantly improves ranker performance. The integration of contextual techniques enhances the performance of our production ranker, leading to improved search results in both offline and online A/B testing in terms of Mean Reciprocal Rank (MRR). Overall, the paper details iterative methodologies and their substantial contributions to search result contextualization on e-commerce platforms.

    https://arxiv.org/abs/2512.13037


    Bi-Erasing: A Bidirectional Framework for Concept Removal in Diffusion Models

    oai:arXiv.org:2512.13039v1

    arXiv:2512.13039v1 Announce Type: new Abstract: Concept erasure, which fine-tunes diffusion models to remove undesired or harmful visual concepts, has become a mainstream approach to mitigating unsafe or illegal image generation in text-to-image models.However, existing removal methods typically adopt a unidirectional erasure strategy by either suppressing the target concept or reinforcing safe alternatives, making it difficult to achieve a balanced trade-off between concept removal and generation quality. To address this limitation, we propose a novel Bidirectional Image-Guided Concept Erasure (Bi-Erasing) framework that performs concept suppression and safety enhancement simultaneously. Specifically, based on the joint representation of text prompts and corresponding images, Bi-Erasing introduces two decoupled image branches: a negative branch responsible for suppressing harmful semantics and a positive branch providing visual guidance for safe alternatives. By jointly optimizing these complementary directions, our approach achieves a balance between erasure efficacy and generation usability. In addition, we apply mask-based filtering to the image branches to prevent interference from irrelevant content during the erasure process. Across extensive experiment evaluations, the proposed Bi-Erasing outperforms baseline methods in balancing concept removal effectiveness and visual fidelity.

    https://arxiv.org/abs/2512.13039


    Understanding Structured Financial Data with LLMs: A Case Study on Fraud Detection

    oai:arXiv.org:2512.13040v1

    arXiv:2512.13040v1 Announce Type: new Abstract: Detecting fraud in financial transactions typically relies on tabular models that demand heavy feature engineering to handle high-dimensional data and offer limited interpretability, making it difficult for humans to understand predictions. Large Language Models (LLMs), in contrast, can produce human-readable explanations and facilitate feature analysis, potentially reducing the manual workload of fraud analysts and informing system refinements. However, they perform poorly when applied directly to tabular fraud detection due to the difficulty of reasoning over many features, the extreme class imbalance, and the absence of contextual information. To bridge this gap, we introduce FinFRE-RAG, a two-stage approach that applies importance-guided feature reduction to serialize a compact subset of numeric/categorical attributes into natural language and performs retrieval-augmented in-context learning over label-aware, instance-level exemplars. Across four public fraud datasets and three families of open-weight LLMs, FinFRE-RAG substantially improves F1/MCC over direct prompting and is competitive with strong tabular baselines in several settings. Although these LLMs still lag behind specialized classifiers, they narrow the performance gap and provide interpretable rationales, highlighting their value as assistive tools in fraud analysis.

    https://arxiv.org/abs/2512.13040


    GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training

    oai:arXiv.org:2512.13043v1

    arXiv:2512.13043v1 Announce Type: new Abstract: Multi-turn reinforcement learning (RL) for multi-modal agents built upon vision-language models (VLMs) is hampered by sparse rewards and long-horizon credit assignment. Recent methods densify the reward by querying a teacher that provides step-level feedback, e.g., Guided Thought Reinforcement (GTR) and On-Policy Distillation, but rely on costly, often privileged models as the teacher, limiting practicality and reproducibility. We introduce GTR-Turbo, a highly efficient upgrade to GTR, which matches the performance without training or querying an expensive teacher model. Specifically, GTR-Turbo merges the weights of checkpoints produced during the ongoing RL training, and then uses this merged model as a "free" teacher to guide the subsequent RL via supervised fine-tuning or soft logit distillation. This design removes dependence on privileged VLMs (e.g., GPT or Gemini), mitigates the "entropy collapse" observed in prior work, and keeps training stable. Across diverse visual agentic tasks, GTR-Turbo improves the accuracy of the baseline model by 10-30% while reducing wall-clock training time by 50% and compute cost by 60% relative to GTR.

    https://arxiv.org/abs/2512.13043


    Sharpen the Spec, Cut the Code: A Case for Generative File System with SYSSPEC

    oai:arXiv.org:2512.13047v1

    arXiv:2512.13047v1 Announce Type: new Abstract: File systems are critical OS components that require constant evolution to support new hardware and emerging application needs. However, the traditional paradigm of developing features, fixing bugs, and maintaining the system incurs significant overhead, especially as systems grow in complexity. This paper proposes a new paradigm, generative file systems, which leverages Large Language Models (LLMs) to generate and evolve a file system from prompts, effectively addressing the need for robust evolution. Despite the widespread success of LLMs in code generation, attempts to create a functional file system have thus far been unsuccessful, mainly due to the ambiguity of natural language prompts. This paper introduces SYSSPEC, a framework for developing generative file systems. Its key insight is to replace ambiguous natural language with principles adapted from formal methods. Instead of imprecise prompts, SYSSPEC employs a multi-part specification that accurately describes a file system's functionality, modularity, and concurrency. The specification acts as an unambiguous blueprint, guiding LLMs to generate expected code flexibly. To manage evolution, we develop a DAG-structured patch that operates on the specification itself, enabling new features to be added without violating existing invariants. Moreover, the SYSSPEC toolchain features a set of LLM-based agents with mechanisms to mitigate hallucination during construction and evolution. We demonstrate our approach by generating SPECFS, a concurrent file system. SPECFS passes hundreds of regression tests, matching a manually-coded baseline. We further confirm its evolvability by seamlessly integrating 10 real-world features from Ext4. Our work shows that a specification-guided approach makes generating and evolving complex systems not only feasible but also highly effective.

    https://arxiv.org/abs/2512.13047


    Citation importance-aware document representation learning for large-scale science mapping

    oai:arXiv.org:2512.13054v1

    arXiv:2512.13054v1 Announce Type: new Abstract: Effective science mapping relies on high-quality representations of scientific documents. As an important task in scientometrics and information studies, science mapping is often challenged by the complex and heterogeneous nature of citations. While previous studies have attempted to improve document representations by integrating citation and semantic information, the heterogeneity of citations is often overlooked. To address this problem, this study proposes a citation importance-aware contrastive learning framework that refines the supervisory signal. We first develop a scalable measurement of citation importance based on location, frequency, and self-citation characteristics. Citation importance is then integrated into the contrastive learning process through an importance-aware sampling strategy, which selects low-importance citations as hard negatives. This forces the model to learn finer-grained representations that distinguish between important and perfunctory citations. To validate the effectiveness of the proposed framework, we fine-tune a SciBERT model and perform extensive evaluations on SciDocs and PubMed benchmark datasets. Results show consistent improvements in both document representation quality and science mapping accuracy. Furthermore, we apply the trained model to over 33 million documents from Web of Science. The resulting map of science accurately visualizes the global and local intellectual structure of science and reveals interdisciplinary research fronts. By operationalizing citation heterogeneity into a scalable computational framework, this study demonstrates how differentiating citations by their importance can be effectively leveraged to improve document representation and science mapping.

    https://arxiv.org/abs/2512.13054


    Towards Test-time Efficient Visual Place Recognition via Asymmetric Query Processing

    oai:arXiv.org:2512.13055v1

    arXiv:2512.13055v1 Announce Type: new Abstract: Visual Place Recognition (VPR) has advanced significantly with high-capacity foundation models like DINOv2, achieving remarkable performance. Nonetheless, their substantial computational cost makes deployment on resource-constrained devices impractical. In this paper, we introduce an efficient asymmetric VPR framework that incorporates a high-capacity gallery model for offline feature extraction with a lightweight query network for online processing. A key challenge in this setting is ensuring compatibility between these heterogeneous networks, which conventional approaches address through computationally expensive k-NN-based compatible training. To overcome this, we propose a geographical memory bank that structures gallery features using geolocation metadata inherent in VPR databases, eliminating the need for exhaustive k-NN computations. Additionally, we introduce an implicit embedding augmentation technique that enhances the query network to model feature variations despite its limited capacity. Extensive experiments demonstrate that our method not only significantly reduces computational costs but also outperforms existing asymmetric retrieval techniques, establishing a new aspect for VPR in resource-limited environments. The code is available at https://github.com/jaeyoon1603/AsymVPR

    https://arxiv.org/abs/2512.13055


    The Optimal Control Algorithm of Connected and Automated Vehicles at Roundabouts with Communication Delay

    oai:arXiv.org:2512.13056v1

    arXiv:2512.13056v1 Announce Type: new Abstract: Connected and automated vehicles (CAVs) rely on wireless communication to exchange state information for distributed control, making communication delays a critical factor that can affect vehicle motion and degrade control performance, particularly in high-speed scenarios. To address these challenges in the complex environment of roundabout intersections, this paper proposes a roundabout control algorithm, which takes into account the uncertainty of interactive information caused by time delays. First, to maintain the required distance between the current vehicle and its preceding and following vehicles, conflicting vehicles are identified based on the time-to-collision (TTC) in the conflict zone. To fully consider communication performance, a vehicle motion model incorporating time delays is established. According to the distributed model predictive control (DMPC) mechanism, the vehicle motion control that satisfies the roundabout constraints is determined. Second, by scheduling the sequence of vehicles entering the roundabout, a multiscale optimization objective is developed by integrating vehicle motion indicators and roundabout system indicators. Traffic density and travel time are embedded into the optimization problem to guide vehicles to enter the roundabout safely and stably. Through a variety of simulation experiments, the effectiveness of the proposed control algorithm is verified by comparing its performance with that of multiple control algorithms under different autonomous vehicle penetration rates and heavy traffic load scenarios.

    https://arxiv.org/abs/2512.13056


    Homomorphism Indistinguishability, Multiplicity Automata Equivalence, and Polynomial Identity Testing

    oai:arXiv.org:2512.13058v1

    arXiv:2512.13058v1 Announce Type: new Abstract: Two graphs $G$ and $H$ are homomorphism indistinguishable over a graph class $\mathcal{F}$ if they admit the same number of homomorphisms from every graph $F \in \mathcal{F}$. Many graph isomorphism relaxations such as (quantum) isomorphism and cospectrality can be characterised as homomorphism indistinguishability over specific graph classes. Thereby, the problems $\textrm{HomInd}(\mathcal{F})$ of deciding homomorphism indistinguishability over $\mathcal{F}$ subsume diverse graph isomorphism relaxations whose complexities range from logspace to undecidable. Establishing the first general result on the complexity of $\textrm{HomInd}(\mathcal{F})$, Seppelt (MFCS 2024) showed that $\textrm{HomInd}(\mathcal{F})$ is in randomised polynomial time for every graph class $\mathcal{F}$ of bounded treewidth that can be defined in counting monadic second-order logic $\mathsf{CMSO}_2$. We show that this algorithm is conditionally optimal, i.e. it cannot be derandomised unless polynomial identity testing is in $\mathsf{PTIME}$. For $\mathsf{CMSO}_2$-definable graph classes $\mathcal{F}$ of bounded pathwidth, we improve the previous complexity upper bound for $\textrm{HomInd}(\mathcal{F})$ from $\mathsf{PTIME}$ to $\mathsf{C}_=\mathsf{L}$ and show that this is tight. Secondarily, we establish a connection between homomorphism indistinguishability and multiplicity automata equivalence which allows us to pinpoint the complexity of the latter problem as $\mathsf{C}_=\mathsf{L}$-complete.

    https://arxiv.org/abs/2512.13058


    An Open and Reproducible Deep Research Agent for Long-Form Question Answering

    oai:arXiv.org:2512.13059v1

    arXiv:2512.13059v1 Announce Type: new Abstract: We present an open deep research system for long-form question answering, selected as a winning system in the text-to-text track of the MMU-RAG competition at NeurIPS 2025. The system combines an open-source large language model (LLM) with an open web search API to perform iterative retrieval, reasoning, and synthesis in real-world open-domain settings. To enhance reasoning quality, we apply preference tuning based on LLM-as-a-judge feedback that evaluates multiple aspects, including clarity, insightfulness, and factuality. Our experimental results show that the proposed method consistently improves answer quality across all three aspects. Our source code is publicly available at https://github.com/efficient-deep-research/efficient-deep-research.

    https://arxiv.org/abs/2512.13059


    Deep Q-Learning-Based Intelligent Scheduling for ETL Optimization in Heterogeneous Data Environments

    oai:arXiv.org:2512.13060v1

    arXiv:2512.13060v1 Announce Type: new Abstract: This paper addresses the challenges of low scheduling efficiency, unbalanced resource allocation, and poor adaptability in ETL (Extract-Transform-Load) processes under heterogeneous data environments by proposing an intelligent scheduling optimization framework based on deep Q-learning. The framework formalizes the ETL scheduling process as a Markov Decision Process and enables adaptive decision-making by a reinforcement learning agent in high-dimensional state spaces to dynamically optimize task allocation and resource scheduling. The model consists of a state representation module, a feature embedding network, a Q-value estimator, and a reward evaluation mechanism, which collectively consider task dependencies, node load states, and data flow characteristics to derive the optimal scheduling strategy in complex environments. A multi-objective reward function is designed to balance key performance indicators such as average scheduling delay, task completion rate, throughput, and resource utilization. Sensitivity experiments further verify the model's robustness under changes in hyperparameters, environmental dynamics, and data scale. Experimental results show that the proposed deep Q-learning scheduling framework significantly reduces scheduling delay, improves system throughput, and enhances execution stability under multi-source heterogeneous task conditions, demonstrating the strong potential of reinforcement learning in complex data scheduling and resource management, and providing an efficient and scalable optimization strategy for intelligent data pipeline construction.

    https://arxiv.org/abs/2512.13060


    Modeling Collaborative Problem Solving Dynamics from Group Discourse: A Text-Mining Approach with Synergy Degree Model

    oai:arXiv.org:2512.13061v1

    arXiv:2512.13061v1 Announce Type: new Abstract: Measuring collaborative problem solving (CPS) synergy remains challenging in learning analytics, as classical manual coding cannot capture emergent system-level dynamics. This study introduces a computational framework that integrates automated discourse analysis with the Synergy Degree Model (SDM) to quantify CPS synergy from group communication. Data were collected from 52 learners in 12 groups during a 5-week connectivist MOOC (cMOOC) activity. Nine classification models were applied to automatically identify ten CPS behaviors across four interaction levels: operation, wayfinding, sense-making, and creation. While BERT achieved the highest accuracy, GPT models demonstrated superior precision suitable for human-AI collaborative coding. Within the SDM framework, each interaction level was treated as a subsystem to compute group-level order parameters and derive synergy degrees. Permutation tests showed automated measures preserve construct validity, despite systematic biases at the subsystem level. Statistical analyses revealed significant task-type differences: survey study groups exhibited higher creation-order than mode study groups, suggesting "controlled disorder" may benefit complex problem solving. Importantly, synergy degree distinguished collaborative quality, ranging from excellent to failing groups. Findings establish synergy degree as a sensitive indicator of collaboration and demonstrate the feasibility of scaling fine-grained CPS analytics through AI-in-the-loop approaches.

    https://arxiv.org/abs/2512.13061


    LLM Rationalis? Measuring Bargaining Capabilities of AI Negotiators

    oai:arXiv.org:2512.13063v1

    arXiv:2512.13063v1 Announce Type: new Abstract: Bilateral negotiation is a complex, context-sensitive task in which human negotiators dynamically adjust anchors, pacing, and flexibility to exploit power asymmetries and informal cues. We introduce a unified mathematical framework for modeling concession dynamics based on a hyperbolic tangent curve, and propose two metrics burstiness tau and the Concession-Rigidity Index (CRI) to quantify the timing and rigidity of offer trajectories. We conduct a large-scale empirical comparison between human negotiators and four state-of-the-art large language models (LLMs) across natural-language and numeric-offers settings, with and without rich market context, as well as six controlled power-asymmetry scenarios. Our results reveal that, unlike humans who smoothly adapt to situations and infer the opponents position and strategies, LLMs systematically anchor at extremes of the possible agreement zone for negotiations and optimize for fixed points irrespective of leverage or context. Qualitative analysis further shows limited strategy diversity and occasional deceptive tactics used by LLMs. Moreover the ability of LLMs to negotiate does not improve with better models. These findings highlight fundamental limitations in current LLM negotiation capabilities and point to the need for models that better internalize opponent reasoning and context-dependent strategy.

    https://arxiv.org/abs/2512.13063


    Sharp convergence bounds for sums of POD and SPOD weights

    oai:arXiv.org:2512.13068v1

    arXiv:2512.13068v1 Announce Type: new Abstract: This work analyzes the convergence of sums of the form $S_{\boldsymbol{\gamma}}(m)=\sum_{v\subseteq \mathbb{N}}\gamma_v m^{|v|}$, where $\gamma_v$ are product and order dependent (POD) weights. We establish that for nonnegative sequence $\{\Upsilon_j\mid j\in \mathbb{N}\}$, $$\sum_{v\subseteq \mathbb{N}} |v|! m^{|v|}\prod_{j\in v} \Upsilon_j<\infty \text{ for } m>0 \text{ if and only if } \sum_{j=1}^\infty \Upsilon_j<\infty.$$ We further characterize the growth of $S_{\boldsymbol{\gamma}}(m)$ when $\gamma_v=(|v|!)^{\sigma}\prod_{j\in v}j^{-\rho}$ and prove that $\log S_{\boldsymbol{\gamma}}(m)$ exhibits asymptotic order $m^{1/(\rho-\sigma)}$ when $\rho>\sigma$. All results are subsequently generalized to smoothness-driven product and order dependent (SPOD) weights.

    https://arxiv.org/abs/2512.13068


    Multi-fidelity aerodynamic data fusion by autoencoder transfer learning

    oai:arXiv.org:2512.13069v1

    arXiv:2512.13069v1 Announce Type: new Abstract: Accurate aerodynamic prediction often relies on high-fidelity simulations; however, their prohibitive computational costs severely limit their applicability in data-driven modeling. This limitation motivates the development of multi-fidelity strategies that leverage inexpensive low-fidelity information without compromising accuracy. Addressing this challenge, this work presents a multi-fidelity deep learning framework that combines autoencoder-based transfer learning with a newly developed Multi-Split Conformal Prediction (MSCP) strategy to achieve uncertainty-aware aerodynamic data fusion under extreme data scarcity. The methodology leverages abundant Low-Fidelity (LF) data to learn a compact latent physics representation, which acts as a frozen knowledge base for a decoder that is subsequently fine-tuned using scarce HF samples. Tested on surface-pressure distributions for NACA airfoils (2D) and a transonic wing (3D) databases, the model successfully corrects LF deviations and achieves high-accuracy pressure predictions using minimal HF training data. Furthermore, the MSCP framework produces robust, actionable uncertainty bands with pointwise coverage exceeding 95%. By combining extreme data efficiency with uncertainty quantification, this work offers a scalable and reliable solution for aerodynamic regression in data-scarce environments.

    https://arxiv.org/abs/2512.13069


    M-GRPO: Stabilizing Self-Supervised Reinforcement Learning for Large Language Models with Momentum-Anchored Policy Optimization

    oai:arXiv.org:2512.13070v1

    arXiv:2512.13070v1 Announce Type: new Abstract: Self-supervised reinforcement learning (RL) presents a promising approach for enhancing the reasoning capabilities of Large Language Models (LLMs) without reliance on expensive human-annotated data. However, we find that existing methods suffer from a critical failure mode under long-horizon training: a "policy collapse" where performance precipitously degrades. We diagnose this instability and demonstrate that simply scaling the number of rollouts -- a common strategy to improve performance -- only delays, but does not prevent, this collapse. To counteract this instability, we first introduce M-GRPO (Momentum-Anchored Group Relative Policy Optimization), a framework that leverages a slowly evolving momentum model to provide a stable training target. In addition, we identify that this process is often accompanied by a rapid collapse in policy entropy, resulting in a prematurely confident and suboptimal policy. To specifically address this issue, we propose a second contribution: an adaptive filtering method based on the interquartile range (IQR) that dynamically prunes low-entropy trajectories, preserving essential policy diversity. Our extensive experiments on multiple reasoning benchmarks demonstrate that M-GRPO stabilizes the training process while the IQR filter prevents premature convergence. The combination of these two innovations leads to superior training stability and state-of-the-art performance.

    https://arxiv.org/abs/2512.13070


    Forging a Dynamic Memory: Retrieval-Guided Continual Learning for Generalist Medical Foundation Models

    oai:arXiv.org:2512.13072v1

    arXiv:2512.13072v1 Announce Type: new Abstract: Multimodal biomedical Vision-Language Models (VLMs) exhibit immense potential in the field of Continual Learning (CL). However, they confront a core dilemma: how to preserve fine-grained intra-modality features while bridging the significant domain gap across different modalities. To address this challenge, we propose a comprehensive framework. Leveraging our 18-million multimodal and comprehensive medical retrieval database derived from PubMed scientific papers, we pioneer the integration of Retrieval-Augmented Generation (RAG) into CL. Specifically, we employ a multi-modal, multi-layer RAG system that provides real-time guidance for model fine-tuning through dynamic, on-demand knowledge retrieval. Building upon this, we introduce a dynamic knowledge distillation framework. This framework precisely resolves the aforementioned core dilemma by dynamically modulating the importance of the parameter space, the granularity of the distilled knowledge, and the data distribution of the reference dataset in accordance with the required level of detail. To thoroughly validate the clinical value of our strategy, we have designed a more rigorous \textbf{M}edical Generalist Task Incremental Learning (MGTIL) benchmark. This benchmark is engineered to simultaneously evaluate the model's capacity for adaptation to significant domain shifts, retention of subtle intra-domain features, and real-time learning of novel and complex medical tasks. Extensive experimental results demonstrate that our proposed method achieves state-of-the-art (SOTA) performance across all metrics. The code is provided in the supplementary materials.

    https://arxiv.org/abs/2512.13072


    A Simple and Effective Framework for Symmetric Consistent Indexing in Large-Scale Dense Retrieval

    oai:arXiv.org:2512.13074v1

    arXiv:2512.13074v1 Announce Type: new Abstract: Dense retrieval has become the industry standard in large-scale information retrieval systems due to its high efficiency and competitive accuracy. Its core relies on a coarse-to-fine hierarchical architecture that enables rapid candidate selection and precise semantic matching, achieving millisecond-level response over billion-scale corpora. This capability makes it essential not only in traditional search and recommendation scenarios but also in the emerging paradigm of generative recommendation driven by large language models, where semantic IDs-themselves a form of coarse-to-fine representation-play a foundational role. However, the widely adopted dual-tower encoding architecture introduces inherent challenges, primarily representational space misalignment and retrieval index inconsistency, which degrade matching accuracy, retrieval stability, and performance on long-tail queries. These issues are further magnified in semantic ID generation, ultimately limiting the performance ceiling of downstream generative models. To address these challenges, this paper proposes a simple and effective framework named SCI comprising two synergistic modules: a symmetric representation alignment module that employs an innovative input-swapping mechanism to unify the dual-tower representation space without adding parameters, and an consistent indexing with dual-tower synergy module that redesigns retrieval paths using a dual-view indexing strategy to maintain consistency from training to inference. The framework is systematic, lightweight, and engineering-friendly, requiring minimal overhead while fully supporting billion-scale deployment. We provide theoretical guarantees for our approach, with its effectiveness validated by results across public datasets and real-world e-commerce datasets.

    https://arxiv.org/abs/2512.13074


    LikeBench: Evaluating Subjective Likability in LLMs for Personalization

    oai:arXiv.org:2512.13077v1

    arXiv:2512.13077v1 Announce Type: new Abstract: A personalized LLM should remember user facts, apply them correctly, and adapt over time to provide responses that the user prefers. Existing LLM personalization benchmarks are largely centered on two axes: accurately recalling user information and accurately applying remembered information in downstream tasks. We argue that a third axis, likability, is both subjective and central to user experience, yet under-measured by current benchmarks. To measure likability holistically, we introduce LikeBench, a multi-session, dynamic evaluation framework that measures likability across multiple dimensions by how much an LLM can adapt over time to a user's preferences to provide more likable responses. In LikeBench, the LLMs engage in conversation with a simulated user and learn preferences only from the ongoing dialogue. As the interaction unfolds, models try to adapt to responses, and after each turn, they are evaluated for likability across seven dimensions by the same simulated user. To the best of our knowledge, we are the first to decompose likability into multiple diagnostic metrics: emotional adaptation, formality matching, knowledge adaptation, reference understanding, conversation length fit, humor fit, and callback, which makes it easier to pinpoint where a model falls short. To make the simulated user more realistic and discriminative, LikeBench uses fine-grained, psychologically grounded descriptive personas rather than the coarse high/low trait rating based personas used in prior work. Our benchmark shows that strong memory performance does not guarantee high likability: DeepSeek R1, with lower memory accuracy (86%, 17 facts/profile), outperformed Qwen3 by 28% on likability score despite Qwen3's higher memory accuracy (93%, 43 facts/profile). Even SOTA models like GPT-5 adapt well in short exchanges but show only limited robustness in longer, noisier interactions.

    https://arxiv.org/abs/2512.13077


    Heart Disease Prediction using Case Based Reasoning (CBR)

    oai:arXiv.org:2512.13078v1

    arXiv:2512.13078v1 Announce Type: new Abstract: This study provides an overview of heart disease prediction using an intelligent system. Predicting disease accurately is crucial in the medical field, but traditional methods relying solely on a doctor's experience often lack precision. To address this limitation, intelligent systems are applied as an alternative to traditional approaches. While various intelligent system methods exist, this study focuses on three: Fuzzy Logic, Neural Networks, and Case-Based Reasoning (CBR). A comparison of these techniques in terms of accuracy was conducted, and ultimately, Case-Based Reasoning (CBR) was selected for heart disease prediction. In the prediction phase, the heart disease dataset underwent data pre-processing to clean the data and data splitting to separate it into training and testing sets. The chosen intelligent system was then employed to predict heart disease outcomes based on the processed data. The experiment concluded with Case-Based Reasoning (CBR) achieving a notable accuracy rate of 97.95% in predicting heart disease. The findings also revealed that the probability of heart disease was 57.76% for males and 42.24% for females. Further analysis from related studies suggests that factors such as smoking and alcohol consumption are significant contributors to heart disease, particularly among males.

    https://arxiv.org/abs/2512.13078


    Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos

    oai:arXiv.org:2512.13080v1

    arXiv:2512.13080v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models provide a promising paradigm for robot learning by integrating visual perception with language-guided policy learning. However, most existing approaches rely on 2D visual inputs to perform actions in 3D physical environments, creating a significant gap between perception and action grounding. To bridge this gap, we propose a Spatial-Aware VLA Pretraining paradigm that performs explicit alignment between visual space and physical space during pretraining, enabling models to acquire 3D spatial understanding before robot policy learning. Starting from pretrained vision-language models, we leverage large-scale human demonstration videos to extract 3D visual and 3D action annotations, forming a new source of supervision that aligns 2D visual observations with 3D spatial reasoning. We instantiate this paradigm with VIPA-VLA, a dual-encoder architecture that incorporates a 3D visual encoder to augment semantic visual representations with 3D-aware features. When adapted to downstream robot tasks, VIPA-VLA achieves significantly improved grounding between 2D vision and 3D action, resulting in more robust and generalizable robotic policies.

    https://arxiv.org/abs/2512.13080


    DiRe: Diversity-promoting Regularization for Dataset Condensation

    oai:arXiv.org:2512.13083v1

    arXiv:2512.13083v1 Announce Type: new Abstract: In Dataset Condensation, the goal is to synthesize a small dataset that replicates the training utility of a large original dataset. Existing condensation methods synthesize datasets with significant redundancy, so there is a dire need to reduce redundancy and improve the diversity of the synthesized datasets. To tackle this, we propose an intuitive Diversity Regularizer (DiRe) composed of cosine similarity and Euclidean distance, which can be applied off-the-shelf to various state-of-the-art condensation methods. Through extensive experiments, we demonstrate that the addition of our regularizer improves state-of-the-art condensation methods on various benchmark datasets from CIFAR-10 to ImageNet-1K with respect to generalization and diversity metrics.

    https://arxiv.org/abs/2512.13083


    UniVCD: A New Method for Unsupervised Change Detection in the Open-Vocabulary Era

    oai:arXiv.org:2512.13089v1

    arXiv:2512.13089v1 Announce Type: new Abstract: Change detection (CD) identifies scene changes from multi-temporal observations and is widely used in urban development and environmental monitoring. Most existing CD methods rely on supervised learning, making performance strongly dataset-dependent and incurring high annotation costs; they typically focus on a few predefined categories and generalize poorly to diverse scenes. With the rise of vision foundation models such as SAM2 and CLIP, new opportunities have emerged to relax these constraints. We propose Unified Open-Vocabulary Change Detection (UniVCD), an unsupervised, open-vocabulary change detection method built on frozen SAM2 and CLIP. UniVCD detects category-agnostic changes across diverse scenes and imaging geometries without any labeled data or paired change images. A lightweight feature alignment module is introduced to bridge the spatially detailed representations from SAM2 and the semantic priors from CLIP, enabling high-resolution, semantically aware change estimation while keeping the number of trainable parameters small. On top of this, a streamlined post-processing pipeline is further introduced to suppress noise and pseudo-changes, improving the detection accuracy for objects with well-defined boundaries. Experiments on several public BCD (Binary Change Detection) and SCD (Semantic Change Detection) benchmarks show that UniVCD achieves consistently strong performance and matches or surpasses existing open-vocabulary CD methods in key metrics such as F1 and IoU. The results demonstrate that unsupervised change detection with frozen vision foundation models and lightweight multi-modal alignment is a practical and effective paradigm for open-vocabulary CD. Code and pretrained models will be released at https://github.com/Die-Xie/UniVCD.

    https://arxiv.org/abs/2512.13089


    Multi-Robot Motion Planning from Vision and Language using Heat-Inspired Diffusion

    oai:arXiv.org:2512.13090v1

    arXiv:2512.13090v1 Announce Type: new Abstract: Diffusion models have recently emerged as powerful tools for robot motion planning by capturing the multi-modal distribution of feasible trajectories. However, their extension to multi-robot settings with flexible, language-conditioned task specifications remains limited. Furthermore, current diffusion-based approaches incur high computational cost during inference and struggle with generalization because they require explicit construction of environment representations and lack mechanisms for reasoning about geometric reachability. To address these limitations, we present Language-Conditioned Heat-Inspired Diffusion (LCHD), an end-to-end vision-based framework that generates language-conditioned, collision-free trajectories. LCHD integrates CLIP-based semantic priors with a collision-avoiding diffusion kernel serving as a physical inductive bias that enables the planner to interpret language commands strictly within the reachable workspace. This naturally handles out-of-distribution scenarios -- in terms of reachability -- by guiding robots toward accessible alternatives that match the semantic intent, while eliminating the need for explicit obstacle information at inference time. Extensive evaluations on diverse real-world-inspired maps, along with real-robot experiments, show that LCHD consistently outperforms prior diffusion-based planners in success rate, while reducing planning latency.

    https://arxiv.org/abs/2512.13090


    PvP: Data-Efficient Humanoid Robot Learning with Proprioceptive-Privileged Contrastive Representations

    oai:arXiv.org:2512.13093v1

    arXiv:2512.13093v1 Announce Type: new Abstract: Achieving efficient and robust whole-body control (WBC) is essential for enabling humanoid robots to perform complex tasks in dynamic environments. Despite the success of reinforcement learning (RL) in this domain, its sample inefficiency remains a significant challenge due to the intricate dynamics and partial observability of humanoid robots. To address this limitation, we propose PvP, a Proprioceptive-Privileged contrastive learning framework that leverages the intrinsic complementarity between proprioceptive and privileged states. PvP learns compact and task-relevant latent representations without requiring hand-crafted data augmentations, enabling faster and more stable policy learning. To support systematic evaluation, we develop SRL4Humanoid, the first unified and modular framework that provides high-quality implementations of representative state representation learning (SRL) methods for humanoid robot learning. Extensive experiments on the LimX Oli robot across velocity tracking and motion imitation tasks demonstrate that PvP significantly improves sample efficiency and final performance compared to baseline SRL methods. Our study further provides practical insights into integrating SRL with RL for humanoid WBC, offering valuable guidance for data-efficient humanoid robot learning.

    https://arxiv.org/abs/2512.13093


    Sequence of Expert: Boosting Imitation Planners for Autonomous Driving through Temporal Alternation

    oai:arXiv.org:2512.13094v1

    arXiv:2512.13094v1 Announce Type: new Abstract: Imitation learning (IL) has emerged as a central paradigm in autonomous driving. While IL excels in matching expert behavior in open-loop settings by minimizing per-step prediction errors, its performance degrades unexpectedly in closed-loop due to the gradual accumulation of small, often imperceptible errors over time.Over successive planning cycles, these errors compound, potentially resulting in severe failures.Current research efforts predominantly rely on increasingly sophisticated network architectures or high-fidelity training datasets to enhance the robustness of IL planners against error accumulation, focusing on the state-level robustness at a single time point. However, autonomous driving is inherently a continuous-time process, and leveraging the temporal scale to enhance robustness may provide a new perspective for addressing this issue.To this end, we propose a method termed Sequence of Experts (SoE), a temporal alternation policy that enhances closed-loop performance without increasing model size or data requirements. Our experiments on large-scale autonomous driving benchmarks nuPlan demonstrate that SoE method consistently and significantly improves the performance of all the evaluated models, and achieves state-of-the-art performance.This module may provide a key and widely applicable support for improving the training efficiency of autonomous driving models.

    https://arxiv.org/abs/2512.13094


    ADHint: Adaptive Hints with Difficulty Priors for Reinforcement Learning

    oai:arXiv.org:2512.13095v1

    arXiv:2512.13095v1 Announce Type: new Abstract: To combine the advantages of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), recent methods have integrated ''hints'' into post-training, which are prefix segments of complete reasoning trajectories, aiming for powerful knowledge expansion and reasoning generalization. However, existing hint-based RL methods typically ignore difficulty when scheduling hint ratios and estimating relative advantages, leading to unstable learning and excessive imitation of off-policy hints. In this work, we propose ADHint, which treats difficulty as a key factor in both hint-ratio schedule and relative-advantage estimation to achieve a better trade-off between exploration and imitation. Specifically, we propose Adaptive Hint with Sample Difficulty Prior, which evaluates each sample's difficulty under the policy model and accordingly schedules an appropriate hint ratio to guide its rollouts. We also introduce Consistency-based Gradient Modulation and Selective Masking for Hint Preservation to modulate token-level gradients within hints, preventing biased and destructive updates. Additionally, we propose Advantage Estimation with Rollout Difficulty Posterior, which leverages the relative difficulty of rollouts with and without hints to estimate their respective advantages, thereby achieving more balanced updates. Extensive experiments across diverse modalities, model scales, and domains demonstrate that ADHint delivers superior reasoning ability and out-of-distribution generalization, consistently surpassing existing methods in both pass@1 and avg@8. Our code and dataset will be made publicly available upon paper acceptance.

    https://arxiv.org/abs/2512.13095


    Toward Self-Healing Networks-on-Chip: RL-Driven Routing in 2D Torus Architectures

    oai:arXiv.org:2512.13096v1

    arXiv:2512.13096v1 Announce Type: new Abstract: We investigate adaptive minimal routing in 2D torus networks on chip NoCs under node fault conditions comparing a reinforcement learning RL based strategy to an adaptive routing baseline A torus topology is used for its low diameter high connectivity properties The RL approach models each router as an agent that learns to forward packets based on network state while the adaptive scheme uses fixed minimal paths with simple rerouting around faults We implement both methods in simulation injecting up to 50 node faults uniformly at random Key metrics are measured 1 throughput vs offered load at fault density 02 2 packet delivery ratio PDR vs fault density and 3 a fault adaptive score FT vs fault density Experimental results show the RL method achieves significantly higher throughput at high load approximately 2030 gain and maintains higher reliability under increasing faults The RL router delivers more packets per cycle and adapts to faults by exploiting path diversity whereas the adaptive scheme degrades sharply as faults accumulate In particular the RL approach preserves end to end connectivity longer PDR remains above 90 until approximately 3040 faults while adaptive PDR drops to approximately 70 at the same point The fault adaptive score likewise favors RL routing Thus RL based adaptive routing demonstrates clear advantages in throughput and fault resilience for torus NoCs

    https://arxiv.org/abs/2512.13096


    On the Complementarity of Shared Electric Mobility and Renewable Energy Communities

    oai:arXiv.org:2512.13099v1

    arXiv:2512.13099v1 Announce Type: new Abstract: Driven by the ongoing energy transition, shared mobility service providers are emerging actors in electrical power systems which aim to shift combustion-based mobility to electric paradigm. In the meantime, Energy Communities are deployed to enhance the local usage of distributed renewable production. As both ators share the same goal of satisfying the demand at the lowest cost, they could take advantage of their complementarity and coordinate their decisions to enhance each other operation. This paper presents an original Mixed-Integer Second Order Cone Programming long-term Electric Vehicle fleet planning optimization problem that integrates the coordination with a Renewable Energy Community and Vehicle-to-Grid capability. This model is used to assess the economic, energy, and grid performances of their collaboration in a 21 buses low-voltage distribution network. Key results show that, both actors coordination can help reducing the yearly cost up to 11.3 % compared to their stand-alone situation and that it may reduce the stress on the substation transformer by 46 % through the activation of the inherent EVs flexibility when subject to peak penalties from the grid operator.

    https://arxiv.org/abs/2512.13099


    OXE-AugE: A Large-Scale Robot Augmentation of OXE for Scaling Cross-Embodiment Policy Learning

    oai:arXiv.org:2512.13100v1

    arXiv:2512.13100v1 Announce Type: new Abstract: Large and diverse datasets are needed for training generalist robot policies that have potential to control a variety of robot embodiments -- robot arm and gripper combinations -- across diverse tasks and environments. As re-collecting demonstrations and retraining for each new hardware platform are prohibitively costly, we show that existing robot data can be augmented for transfer and generalization. The Open X-Embodiment (OXE) dataset, which aggregates demonstrations from over 60 robot datasets, has been widely used as the foundation for training generalist policies. However, it is highly imbalanced: the top four robot types account for over 85\% of its real data, which risks overfitting to robot-scene combinations. We present AugE-Toolkit, a scalable robot augmentation pipeline, and OXE-AugE, a high-quality open-source dataset that augments OXE with 9 different robot embodiments. OXE-AugE provides over 4.4 million trajectories, more than triple the size of the original OXE. We conduct a systematic study of how scaling robot augmentation impacts cross-embodiment learning. Results suggest that augmenting datasets with diverse arms and grippers improves policy performance not only on the augmented robots, but also on unseen robots and even the original robots under distribution shifts. In physical experiments, we demonstrate that state-of-the-art generalist policies such as OpenVLA and $\pi_0$ benefit from fine-tuning on OXE-AugE, improving success rates by 24-45% on previously unseen robot-gripper combinations across four real-world manipulation tasks. Project website: https://OXE-AugE.github.io/.

    https://arxiv.org/abs/2512.13100


    Harmonizing Generalization and Specialization: Uncertainty-Informed Collaborative Learning for Semi-supervised Medical Image Segmentation

    oai:arXiv.org:2512.13101v1

    arXiv:2512.13101v1 Announce Type: new Abstract: Vision foundation models have demonstrated strong generalization in medical image segmentation by leveraging large-scale, heterogeneous pretraining. However, they often struggle to generalize to specialized clinical tasks under limited annotations or rare pathological variations, due to a mismatch between general priors and task-specific requirements. To address this, we propose Uncertainty-informed Collaborative Learning (UnCoL), a dual-teacher framework that harmonizes generalization and specialization in semi-supervised medical image segmentation. Specifically, UnCoL distills both visual and semantic representations from a frozen foundation model to transfer general knowledge, while concurrently maintaining a progressively adapting teacher to capture fine-grained and task-specific representations. To balance guidance from both teachers, pseudo-label learning in UnCoL is adaptively regulated by predictive uncertainty, which selectively suppresses unreliable supervision and stabilizes learning in ambiguous regions. Experiments on diverse 2D and 3D segmentation benchmarks show that UnCoL consistently outperforms state-of-the-art semi-supervised methods and foundation model baselines. Moreover, our model delivers near fully supervised performance with markedly reduced annotation requirements.

    https://arxiv.org/abs/2512.13101


    Socratic Students: Teaching Language Models to Learn by Asking Questions

    oai:arXiv.org:2512.13102v1

    arXiv:2512.13102v1 Announce Type: new Abstract: Large Language Models (LLMs) excel at static interactions, where they answer user queries by retrieving knowledge encoded in their parameters. However, in many real-world settings, such as educational tutoring or medical assistance, relevant information is not directly available and must be actively acquired through dynamic interactions. An interactive agent would recognize its own uncertainty, ask targeted questions, and retain new knowledge efficiently. Prior work has primarily explored effective ways for a teacher to instruct the student, where the teacher identifies student gaps and provides guidance. In this work, we shift the focus to the student and investigate effective strategies to actively query the teacher in seeking useful information. Across math and coding benchmarks, where baseline student models begin with near-zero performance, we show that student-led approaches consistently yield absolute Pass@k improvements of at least 0.5 over static baselines. To improve question quality, we train students using Direct Preference Optimization (DPO) with guidance from either self or stronger students. We find that this guided training enables smaller models to learn how to ask better questions, further enhancing learning efficiency.

    https://arxiv.org/abs/2512.13102


    FID-Net: A Feature-Enhanced Deep Learning Network for Forest Infestation Detection

    oai:arXiv.org:2512.13104v1

    arXiv:2512.13104v1 Announce Type: new Abstract: Forest pests threaten ecosystem stability, requiring efficient monitoring. To overcome the limitations of traditional methods in large-scale, fine-grained detection, this study focuses on accurately identifying infected trees and analyzing infestation patterns. We propose FID-Net, a deep learning model that detects pest-affected trees from UAV visible-light imagery and enables infestation analysis via three spatial metrics. Based on YOLOv8n, FID-Net introduces a lightweight Feature Enhancement Module (FEM) to extract disease-sensitive cues, an Adaptive Multi-scale Feature Fusion Module (AMFM) to align and fuse dual-branch features (RGB and FEM-enhanced), and an Efficient Channel Attention (ECA) mechanism to enhance discriminative information efficiently. From detection results, we construct a pest situation analysis framework using: (1) Kernel Density Estimation to locate infection hotspots; (2) neighborhood evaluation to assess healthy trees' infection risk; (3) DBSCAN clustering to identify high-density healthy clusters as priority protection zones. Experiments on UAV imagery from 32 forest plots in eastern Tianshan, China, show that FID-Net achieves 86.10% precision, 75.44% recall, 82.29% mAP@0.5, and 64.30% mAP@0.5:0.95, outperforming mainstream YOLO models. Analysis confirms infected trees exhibit clear clustering, supporting targeted forest protection. FID-Net enables accurate tree health discrimination and, combined with spatial metrics, provides reliable data for intelligent pest monitoring, early warning, and precise management.

    https://arxiv.org/abs/2512.13104


    Deterministic and Exact Fully-dynamic Minimum Cut of Superpolylogarithmic Size in Subpolynomial Time

    oai:arXiv.org:2512.13105v1

    arXiv:2512.13105v1 Announce Type: new Abstract: We present an exact fully-dynamic minimum cut algorithm that runs in $n^{o(1)}$ deterministic update time when the minimum cut size is at most $2^{\Theta(\log^{3/4-c}n)}$ for any $c>0$, improving on the previous algorithm of Jin, Sun, and Thorup (SODA 2024) whose minimum cut size limit is $(\log n)^{o(1)}$. Combined with graph sparsification, we obtain the first $(1+\epsilon)$-approximate fully-dynamic minimum cut algorithm on weighted graphs, for any $\epsilon\ge2^{-\Theta(\log^{3/4-c}n)}$, in $n^{o(1)}$ randomized update time. Our main technical contribution is a deterministic local minimum cut algorithm, which replaces the randomized LocalKCut procedure from El-Hayek, Henzinger, and Li (SODA 2025).

    https://arxiv.org/abs/2512.13105


    TraPO: A Semi-Supervised Reinforcement Learning Framework for Boosting LLM Reasoning

    oai:arXiv.org:2512.13106v1

    arXiv:2512.13106v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has proven effective in training large reasoning models (LRMs) by leveraging answer-verifiable signals to guide policy optimization, which, however, suffers from high annotation costs. To alleviate this problem, recent work has explored unsupervised RLVR methods that derive rewards solely from the model's internal consistency, such as through entropy and majority voting. While seemingly promising, these methods often suffer from model collapse in the later stages of training, which may arise from the reinforcement of incorrect reasoning patterns in the absence of external supervision. In this work, we investigate a novel semi-supervised RLVR paradigm that utilizes a small labeled set to guide RLVR training on unlabeled samples. Our key insight is that supervised rewards are essential for stabilizing consistency-based training on unlabeled samples, ensuring that only reasoning patterns verified on labeled instances are incorporated into RL training. Technically, we propose an effective policy optimization algorithm, TraPO, that identifies reliable unlabeled samples by matching their learning trajectory similarity to labeled ones. Building on this, TraPO achieves remarkable data efficiency and strong generalization on six widely used mathematical reasoning benchmarks (AIME24/25, AMC, MATH-500, Minerva, and Olympiad) and three out-of-distribution tasks (ARC-c, GPQA-diamond, and MMLU-pro). With only 1K labeled and 3K unlabeled samples, TraPO reaches 42.6% average accuracy, surpassing the best unsupervised method trained on 45K unlabeled samples (38.3%). Notably, when using 4K labeled and 12K unlabeled samples, TraPO even outperforms the fully supervised model trained on the full 45K labeled samples on all benchmarks, while using only 10% of the labeled data. The code is available via https://github.com/ShenzhiYang2000/TRAPO.

    https://arxiv.org/abs/2512.13106


    Diffusion-Based Restoration for Multi-Modal 3D Object Detection in Adverse Weather

    oai:arXiv.org:2512.13107v1

    arXiv:2512.13107v1 Announce Type: new Abstract: Multi-modal 3D object detection is important for reliable perception in robotics and autonomous driving. However, its effectiveness remains limited under adverse weather conditions due to weather-induced distortions and misalignment between different data modalities. In this work, we propose DiffFusion, a novel framework designed to enhance robustness in challenging weather through diffusion-based restoration and adaptive cross-modal fusion. Our key insight is that diffusion models possess strong capabilities for denoising and generating data that can adapt to various weather conditions. Building on this, DiffFusion introduces Diffusion-IR restoring images degraded by weather effects and Point Cloud Restoration (PCR) compensating for corrupted LiDAR data using image object cues. To tackle misalignments between two modalities, we develop Bidirectional Adaptive Fusion and Alignment Module (BAFAM). It enables dynamic multi-modal fusion and bidirectional bird's-eye view (BEV) alignment to maintain consistent spatial correspondence. Extensive experiments on three public datasets show that DiffFusion achieves state-of-the-art robustness under adverse weather while preserving strong clean-data performance. Zero-shot results on the real-world DENSE dataset further validate its generalization. The implementation of our DiffFusion will be released as open-source.

    https://arxiv.org/abs/2512.13107


    Uncovering the Role of Initial Saliency in U-Shaped Attention Bias: Scaling Initial Token Weight for Enhanced Long-Text Processing

    oai:arXiv.org:2512.13109v1

    arXiv:2512.13109v1 Announce Type: new Abstract: Large language models (LLMs) have demonstrated strong performance on a variety of natural language processing (NLP) tasks. However, they often struggle with long-text sequences due to the ``lost in the middle'' phenomenon. This issue has been shown to arise from a U-shaped attention bias, where attention is disproportionately focused on the beginning and end of a text, leaving the middle section underrepresented. While previous studies have attributed this bias to position encoding, our research first identifies an additional factor: initial saliency. It means that in the attention computation for each token, tokens with higher attention weights relative to the initial token tend to receive more attention in the prediction of the next token. We further find that utilizing this property by scaling attention weight between the initial token and others improves the model's ability to process long contexts, achieving a maximum improvement of 3.6\% in MDQA dataset. Moreover, combining this approach with existing methods to reduce position encoding bias further enhances performance, achieving a maximum improvement of 3.4\% in KV-Retrieval tasks.

    https://arxiv.org/abs/2512.13109


    From Overfitting to Reliability: Introducing the Hierarchical Approximate Bayesian Neural Network

    oai:arXiv.org:2512.13111v1

    arXiv:2512.13111v1 Announce Type: new Abstract: In recent years, neural networks have revolutionized various domains, yet challenges such as hyperparameter tuning and overfitting remain significant hurdles. Bayesian neural networks offer a framework to address these challenges by incorporating uncertainty directly into the model, yielding more reliable predictions, particularly for out-of-distribution data. This paper presents Hierarchical Approximate Bayesian Neural Network, a novel approach that uses a Gaussian-inverse-Wishart distribution as a hyperprior of the network's weights to increase both the robustness and performance of the model. We provide analytical representations for the predictive distribution and weight posterior, which amount to the calculation of the parameters of Student's t-distributions in closed form with linear complexity with respect to the number of weights. Our method demonstrates robust performance, effectively addressing issues of overfitting and providing reliable uncertainty estimates, particularly for out-of-distribution tasks. Experimental results indicate that HABNN not only matches but often outperforms state-of-the-art models, suggesting a promising direction for future applications in safety-critical environments.

    https://arxiv.org/abs/2512.13111


    Less Is More: Sparse and Cooperative Perturbation for Point Cloud Attacks

    oai:arXiv.org:2512.13119v1

    arXiv:2512.13119v1 Announce Type: new Abstract: Most adversarial attacks on point clouds perturb a large number of points, causing widespread geometric changes and limiting applicability in real-world scenarios. While recent works explore sparse attacks by modifying only a few points, such approaches often struggle to maintain effectiveness due to the limited influence of individual perturbations. In this paper, we propose SCP, a sparse and cooperative perturbation framework that selects and leverages a compact subset of points whose joint perturbations produce amplified adversarial effects. Specifically, SCP identifies the subset where the misclassification loss is locally convex with respect to their joint perturbations, determined by checking the positivedefiniteness of the corresponding Hessian block. The selected subset is then optimized to generate high-impact adversarial examples with minimal modifications. Extensive experiments show that SCP achieves 100% attack success rates, surpassing state-of-the-art sparse attacks, and delivers superior imperceptibility to dense attacks with far fewer modifications.

    https://arxiv.org/abs/2512.13119


    Towards Practical Large-scale Dynamical Heterogeneous Graph Embedding: Cold-start Resilient Recommendation

    oai:arXiv.org:2512.13120v1

    arXiv:2512.13120v1 Announce Type: new Abstract: Deploying dynamic heterogeneous graph embeddings in production faces key challenges of scalability, data freshness, and cold-start. This paper introduces a practical, two-stage solution that balances deep graph representation with low-latency incremental updates. Our framework combines HetSGFormer, a scalable graph transformer for static learning, with Incremental Locally Linear Embedding (ILLE), a lightweight, CPU-based algorithm for real-time updates. HetSGFormer captures global structure with linear scalability, while ILLE provides rapid, targeted updates to incorporate new data, thus avoiding costly full retraining. This dual approach is cold-start resilient, leveraging the graph to create meaningful embeddings from sparse data. On billion-scale graphs, A/B tests show HetSGFormer achieved up to a 6.11% lift in Advertiser Value over previous methods, while the ILLE module added another 3.22% lift and improved embedding refresh timeliness by 83.2%. Our work provides a validated framework for deploying dynamic graph learning in production environments.

    https://arxiv.org/abs/2512.13120


    DePT3R: Joint Dense Point Tracking and 3D Reconstruction of Dynamic Scenes in a Single Forward Pass

    oai:arXiv.org:2512.13122v1

    arXiv:2512.13122v1 Announce Type: new Abstract: Current methods for dense 3D point tracking in dynamic scenes typically rely on pairwise processing, require known camera poses, or assume a temporal ordering to input frames, constraining their flexibility and applicability. Additionally, recent advances have successfully enabled efficient 3D reconstruction from large-scale, unposed image collections, underscoring opportunities for unified approaches to dynamic scene understanding. Motivated by this, we propose DePT3R, a novel framework that simultaneously performs dense point tracking and 3D reconstruction of dynamic scenes from multiple images in a single forward pass. This multi-task learning is achieved by extracting deep spatio-temporal features with a powerful backbone and regressing pixel-wise maps with dense prediction heads. Crucially, DePT3R operates without requiring camera poses, substantially enhancing its adaptability and efficiency-especially important in dynamic environments with rapid changes. We validate DePT3R on several challenging benchmarks involving dynamic scenes, demonstrating strong performance and significant improvements in memory efficiency over existing state-of-the-art methods. Data and codes are available via the open repository: https://github.com/StructuresComp/DePT3R

    https://arxiv.org/abs/2512.13122


    Quanvolutional Neural Networks for Spectrum Peak-Finding

    oai:arXiv.org:2512.13125v1

    arXiv:2512.13125v1 Announce Type: new Abstract: The analysis of spectra, such as Nuclear Magnetic Resonance (NMR) spectra, for the comprehensive characterization of peaks is a challenging task for both experts and machines, especially with complex molecules. This process, also known as deconvolution, involves identifying and quantifying the peaks in the spectrum. Machine learning techniques have shown promising results in automating this process. With the advent of quantum computing, there is potential to further enhance these techniques. In this work, inspired by the success of classical Convolutional Neural Networks (CNNs), we explore the use of Quanvolutional Neural Networks (QuanvNNs) for the multi-task peak finding problem, involving both peak counting and position estimation. We implement a simple and interpretable QuanvNN architecture that can be directly compared to its classical CNN counterpart, and evaluate its performance on a synthetic NMR-inspired dataset. Our results demonstrate that QuanvNNs outperform classical CNNs on challenging spectra, achieving an 11\% improvement in F1 score and a 30\% reduction in mean absolute error for peak position estimation. Additionally, QuanvNNs appear to exhibit better convergence stability for harder problems.

    https://arxiv.org/abs/2512.13125


    LeafTrackNet: A Deep Learning Framework for Robust Leaf Tracking in Top-Down Plant Phenotyping

    oai:arXiv.org:2512.13130v1

    arXiv:2512.13130v1 Announce Type: new Abstract: High resolution phenotyping at the level of individual leaves offers fine-grained insights into plant development and stress responses. However, the full potential of accurate leaf tracking over time remains largely unexplored due to the absence of robust tracking methods-particularly for structurally complex crops such as canola. Existing plant-specific tracking methods are typically limited to small-scale species or rely on constrained imaging conditions. In contrast, generic multi-object tracking (MOT) methods are not designed for dynamic biological scenes. Progress in the development of accurate leaf tracking models has also been hindered by a lack of large-scale datasets captured under realistic conditions. In this work, we introduce CanolaTrack, a new benchmark dataset comprising 5,704 RGB images with 31,840 annotated leaf instances spanning the early growth stages of 184 canola plants. To enable accurate leaf tracking over time, we introduce LeafTrackNet, an efficient framework that combines a YOLOv10-based leaf detector with a MobileNetV3-based embedding network. During inference, leaf identities are maintained over time through an embedding-based memory association strategy. LeafTrackNet outperforms both plant-specific trackers and state-of-the-art MOT baselines, achieving a 9% HOTA improvement on CanolaTrack. With our work we provide a new standard for leaf-level tracking under realistic conditions and we provide CanolaTrack - the largest dataset for leaf tracking in agriculture crops, which will contribute to future research in plant phenotyping. Our code and dataset are publicly available at https://github.com/shl-shawn/LeafTrackNet.

    https://arxiv.org/abs/2512.13130


    Towards Unified Co-Speech Gesture Generation via Hierarchical Implicit Periodicity Learning

    oai:arXiv.org:2512.13131v1

    arXiv:2512.13131v1 Announce Type: new Abstract: Generating 3D-based body movements from speech shows great potential in extensive downstream applications, while it still suffers challenges in imitating realistic human movements. Predominant research efforts focus on end-to-end generation schemes to generate co-speech gestures, spanning GANs, VQ-VAE, and recent diffusion models. As an ill-posed problem, in this paper, we argue that these prevailing learning schemes fail to model crucial inter- and intra-correlations across different motion units, i.e. head, body, and hands, thus leading to unnatural movements and poor coordination. To delve into these intrinsic correlations, we propose a unified Hierarchical Implicit Periodicity (HIP) learning approach for audio-inspired 3D gesture generation. Different from predominant research, our approach models this multi-modal implicit relationship by two explicit technique insights: i) To disentangle the complicated gesture movements, we first explore the gesture motion phase manifolds with periodic autoencoders to imitate human natures from realistic distributions while incorporating non-period ones from current latent states for instance-level diversities. ii) To model the hierarchical relationship of face motions, body gestures, and hand movements, driving the animation with cascaded guidance during learning. We exhibit our proposed approach on 3D avatars and extensive experiments show our method outperforms the state-of-the-art co-speech gesture generation methods by both quantitative and qualitative evaluations. Code and models will be publicly available.

    https://arxiv.org/abs/2512.13131


    An Optimal Alignment-Driven Iterative Closed-Loop Convergence Framework for High-Performance Ultra-Large Scale Layout Pattern Clustering

    oai:arXiv.org:2512.13133v1

    arXiv:2512.13133v1 Announce Type: new Abstract: With the aggressive scaling of VLSI technology, the explosion of layout patterns creates a critical bottleneck for DFM applications like OPC. Pattern clustering is essential to reduce data complexity, yet existing methods struggle with computational prohibitiveness ($O(N^2)$ comparisons), sub-optimal discrete sampling for center alignment, and difficult speed-quality trade-offs. To address these, we propose an Optimal Alignment-Driven Iterative Closed-Loop Convergence Framework. First, to resolve alignment ambiguity, we introduce a hybrid suite of high-performance algorithms: an FFT-based Phase Correlation method for cosine similarity constraints, and a Robust Geometric Min-Max strategy for edge displacement constraints that analytically solves for the global optimum. Second, we model clustering as a Set Cover Problem (SCP) using a Surprisal-Based Lazy Greedy heuristic within a coarse-to-fine iterative refinement loop to ensure convergence. Additionally, a multi-stage pruning mechanism filters over 99% of redundant computations. Experimental results on the 2025 China Postgraduate EDA Elite Challenge benchmark demonstrate a 93.4% compression ratio relative to raw inputs and an over 100x speedup compared to the official baseline, effectively handling tens of thousands of patterns in seconds. Securing First Place among 77 teams, this approach proves its superiority in solving the NP-Hard layout clustering problem with an optimal balance of scalability and precision.

    https://arxiv.org/abs/2512.13133


    Can AI Understand What We Cannot Say? Measuring Multilevel Alignment Through Abortion Stigma Across Cognitive, Interpersonal, and Structural Levels

    oai:arXiv.org:2512.13142v1

    arXiv:2512.13142v1 Announce Type: new Abstract: As large language models increasingly mediate stigmatized health decisions, their capacity to genuinely understand complex psychological and physiological phenomena remains poorly evaluated. Can AI understand what we cannot say? We investigate whether LLMs coherently represent abortion stigma across the cognitive, interpersonal, and structural levels where it operates. We systematically tested 627 demographically diverse personas across five leading LLMs using the validated Individual Level Abortion Stigma Scale (ILAS). Our multilevel analysis examined whether models coherently represent stigma at the cognitive level (self-judgment), interpersonal level (anticipated judgment and isolation), and structural level (community condemnation and disclosure patterns), as well as overall stigma. Models fail tests of genuine understanding across all levels. They overestimate interpersonal stigma while underestimating cognitive stigma, assume uniform community condemnation, introduce demographic biases absent from human validation data, miss the empirically validated stigma-secrecy relationship, and contradict themselves within theoretical constructs. These patterns reveal that current alignment approaches ensure appropriate language but not coherent multilevel understanding. This work provides empirical evidence that current LLMs lack coherent multilevel understanding of psychological and physiological constructs. AI safety in high-stakes contexts demands new approaches to design (multilevel coherence), evaluation (continuous auditing), governance and regulation (mandatory audits, accountability, deployment restrictions), and AI literacy in domains where understanding what people cannot say determines whether support helps or harms.

    https://arxiv.org/abs/2512.13142


    Weight Space Correlation Analysis: Quantifying Feature Utilization in Deep Learning Models

    oai:arXiv.org:2512.13144v1

    arXiv:2512.13144v1 Announce Type: new Abstract: Deep learning models in medical imaging are susceptible to shortcut learning, relying on confounding metadata (e.g., scanner model) that is often encoded in image embeddings. The crucial question is whether the model actively utilizes this encoded information for its final prediction. We introduce Weight Space Correlation Analysis, an interpretable methodology that quantifies feature utilization by measuring the alignment between the classification heads of a primary clinical task and auxiliary metadata tasks. We first validate our method by successfully detecting artificially induced shortcut learning. We then apply it to probe the feature utilization of an SA-SonoNet model trained for Spontaneous Preterm Birth (sPTB) prediction. Our analysis confirmed that while the embeddings contain substantial metadata, the sPTB classifier's weight vectors were highly correlated with clinically relevant factors (e.g., birth weight) but decoupled from clinically irrelevant acquisition factors (e.g. scanner). Our methodology provides a tool to verify model trustworthiness, demonstrating that, in the absence of induced bias, the clinical model selectively utilizes features related to the genuine clinical signal.

    https://arxiv.org/abs/2512.13144


    StarryGazer: Leveraging Monocular Depth Estimation Models for Domain-Agnostic Single Depth Image Completion

    oai:arXiv.org:2512.13147v1

    arXiv:2512.13147v1 Announce Type: new Abstract: The problem of depth completion involves predicting a dense depth image from a single sparse depth map and an RGB image. Unsupervised depth completion methods have been proposed for various datasets where ground truth depth data is unavailable and supervised methods cannot be applied. However, these models require auxiliary data to estimate depth values, which is far from real scenarios. Monocular depth estimation (MDE) models can produce a plausible relative depth map from a single image, but there is no work to properly combine the sparse depth map with MDE for depth completion; a simple affine transformation to the depth map will yield a high error since MDE are inaccurate at estimating depth difference between objects. We introduce StarryGazer, a domain-agnostic framework that predicts dense depth images from a single sparse depth image and an RGB image without relying on ground-truth depth by leveraging the power of large MDE models. First, we employ a pre-trained MDE model to produce relative depth images. These images are segmented and randomly rescaled to form synthetic pairs for dense pseudo-ground truth and corresponding sparse depths. A refinement network is trained with the synthetic pairs, incorporating the relative depth maps and RGB images to improve the model's accuracy and robustness. StarryGazer shows superior results over existing unsupervised methods and transformed MDE results on various datasets, demonstrating that our framework exploits the power of MDE models while appropriately fixing errors using sparse depth information.

    https://arxiv.org/abs/2512.13147


    Enhancing Node-Level Graph Domain Adaptation by Alleviating Local Dependency

    oai:arXiv.org:2512.13149v1

    arXiv:2512.13149v1 Announce Type: new Abstract: Recent years have witnessed significant advancements in machine learning methods on graphs. However, transferring knowledge effectively from one graph to another remains a critical challenge. This highlights the need for algorithms capable of applying information extracted from a source graph to an unlabeled target graph, a task known as unsupervised graph domain adaptation (GDA). One key difficulty in unsupervised GDA is conditional shift, which hinders transferability. In this paper, we show that conditional shift can be observed only if there exists local dependencies among node features. To support this claim, we perform a rigorous analysis and also further provide generalization bounds of GDA when dependent node features are modeled using markov chains. Guided by the theoretical findings, we propose to improve GDA by decorrelating node features, which can be specifically implemented through decorrelated GCN layers and graph transformer layers. Our experimental results demonstrate the effectiveness of this approach, showing not only substantial performance enhancements over baseline GDA methods but also clear visualizations of small intra-class distances in the learned representations. Our code is available at https://github.com/TechnologyAiGroup/DFT

    https://arxiv.org/abs/2512.13149


    START: Traversing Sparse Footholds with Terrain Reconstruction

    oai:arXiv.org:2512.13153v1

    arXiv:2512.13153v1 Announce Type: new Abstract: Traversing terrains with sparse footholds like legged animals presents a promising yet challenging task for quadruped robots, as it requires precise environmental perception and agile control to secure safe foot placement while maintaining dynamic stability. Model-based hierarchical controllers excel in laboratory settings, but suffer from limited generalization and overly conservative behaviors. End-to-end learning-based approaches unlock greater flexibility and adaptability, but existing state-of-the-art methods either rely on heightmaps that introduce noise and complex, costly pipelines, or implicitly infer terrain features from egocentric depth images, often missing accurate critical geometric cues and leading to inefficient learning and rigid gaits. To overcome these limitations, we propose START, a single-stage learning framework that enables agile, stable locomotion on highly sparse and randomized footholds. START leverages only low-cost onboard vision and proprioception to accurately reconstruct local terrain heightmap, providing an explicit intermediate representation to convey essential features relevant to sparse foothold regions. This supports comprehensive environmental understanding and precise terrain assessment, reducing exploration cost and accelerating skill acquisition. Experimental results demonstrate that START achieves zero-shot transfer across diverse real-world scenarios, showcasing superior adaptability, precise foothold placement, and robust locomotion.

    https://arxiv.org/abs/2512.13153


    MAC: A Multi-Agent Framework for Interactive User Clarification in Multi-turn Conversations

    oai:arXiv.org:2512.13154v1

    arXiv:2512.13154v1 Announce Type: new Abstract: Conversational agents often encounter ambiguous user requests, requiring an effective clarification to successfully complete tasks. While recent advancements in real-world applications favor multi-agent architectures to manage complex conversational scenarios efficiently, ambiguity resolution remains a critical and underexplored challenge--particularly due to the difficulty of determining which agent should initiate a clarification and how agents should coordinate their actions when faced with uncertain or incomplete user input. The fundamental questions of when to interrupt a user and how to formulate the optimal clarification query within the most optimal multi-agent settings remain open. In this paper, we propose MAC (Multi-Agent Clarification), an interactive multi-agent framework specifically optimized to resolve user ambiguities by strategically managing clarification dialogues. We first introduce a novel taxonomy categorizing user ambiguities to systematically guide clarification strategies. Then, we present MAC that autonomously coordinates multiple agents to interact synergistically with users. Empirical evaluations on MultiWOZ 2.4 demonstrate that enabling clarification at both levels increases task success rate 7.8\% (54.5 to 62.3) and reduces the average number of dialogue turns (6.53 to 4.86) by eliciting all required user information up front and minimizing repetition. Our findings highlight the importance of active user interaction and role-aware clarification for more reliable human-agent communication.

    https://arxiv.org/abs/2512.13154


    Intrinsic Image Fusion for Multi-View 3D Material Reconstruction

    oai:arXiv.org:2512.13157v1

    arXiv:2512.13157v1 Announce Type: new Abstract: We introduce Intrinsic Image Fusion, a method that reconstructs high-quality physically based materials from multi-view images. Material reconstruction is highly underconstrained and typically relies on analysis-by-synthesis, which requires expensive and noisy path tracing. To better constrain the optimization, we incorporate single-view priors into the reconstruction process. We leverage a diffusion-based material estimator that produces multiple, but often inconsistent, candidate decompositions per view. To reduce the inconsistency, we fit an explicit low-dimensional parametric function to the predictions. We then propose a robust optimization framework using soft per-view prediction selection together with confidence-based soft multi-view inlier set to fuse the most consistent predictions of the most confident views into a consistent parametric material space. Finally, we use inverse path tracing to optimize for the low-dimensional parameters. Our results outperform state-of-the-art methods in material disentanglement on both synthetic and real scenes, producing sharp and clean reconstructions suitable for high-quality relighting.

    https://arxiv.org/abs/2512.13157


    SpeakRL: Synergizing Reasoning, Speaking, and Acting in Language Models with Reinforcement Learning

    oai:arXiv.org:2512.13159v1

    arXiv:2512.13159v1 Announce Type: new Abstract: Effective human-agent collaboration is increasingly prevalent in real-world applications. Current trends in such collaborations are predominantly unidirectional, with users providing instructions or posing questions to agents, where agents respond directly without seeking necessary clarifications or confirmations. However, the evolving capabilities of these agents require more proactive engagement, where agents should dynamically participate in conversations to clarify user intents, resolve ambiguities, and adapt to changing circumstances. Existing prior work under-utilize the conversational capabilities of language models (LMs), thereby optimizing agents as better followers rather than effective speakers. In this work, we introduce SpeakRL, a reinforcement learning (RL) method that enhances agents' conversational capabilities by rewarding proactive interactions with users, such as asking right clarification questions when necessary. To support this, we curate SpeakER, a synthetic dataset that includes diverse scenarios from task-oriented dialogues, where tasks are resolved through interactive clarification questions. We present a systematic analysis of reward design for conversational proactivity and propose a principled reward formulation for teaching agents to balance asking with acting. Empirical evaluations demonstrate that our approach achieves a 20.14% absolute improvement in task completion over base models without increasing conversation turns even surpassing even much larger proprietary models, demonstrating the promise of clarification-centric user-agent interactions.

    https://arxiv.org/abs/2512.13159


    A Semantically Enhanced Generative Foundation Model Improves Pathological Image Synthesis

    oai:arXiv.org:2512.13164v1

    arXiv:2512.13164v1 Announce Type: new Abstract: The development of clinical-grade artificial intelligence in pathology is limited by the scarcity of diverse, high-quality annotated datasets. Generative models offer a potential solution but suffer from semantic instability and morphological hallucinations that compromise diagnostic reliability. To address this challenge, we introduce a Correlation-Regulated Alignment Framework for Tissue Synthesis (CRAFTS), the first generative foundation model for pathology-specific text-to-image synthesis. By leveraging a dual-stage training strategy on approximately 2.8 million image-caption pairs, CRAFTS incorporates a novel alignment mechanism that suppresses semantic drift to ensure biological accuracy. This model generates diverse pathological images spanning 30 cancer types, with quality rigorously validated by objective metrics and pathologist evaluations. Furthermore, CRAFTS-augmented datasets enhance the performance across various clinical tasks, including classification, cross-modal retrieval, self-supervised learning, and visual question answering. In addition, coupling CRAFTS with ControlNet enables precise control over tissue architecture from inputs such as nuclear segmentation masks and fluorescence images. By overcoming the critical barriers of data scarcity and privacy concerns, CRAFTS provides a limitless source of diverse, annotated histology data, effectively unlocking the creation of robust diagnostic tools for rare and complex cancer phenotypes.

    https://arxiv.org/abs/2512.13164


    SACn: Soft Actor-Critic with n-step Returns

    oai:arXiv.org:2512.13165v1

    arXiv:2512.13165v1 Announce Type: new Abstract: Soft Actor-Critic (SAC) is widely used in practical applications and is now one of the most relevant off-policy online model-free reinforcement learning (RL) methods. The technique of n-step returns is known to increase the convergence speed of RL algorithms compared to their 1-step returns-based versions. However, SAC is notoriously difficult to combine with n-step returns, since their usual combination introduces bias in off-policy algorithms due to the changes in action distribution. While this problem is solved by importance sampling, a method for estimating expected values of one distribution using samples from another distribution, importance sampling may result in numerical instability. In this work, we combine SAC with n-step returns in a way that overcomes this issue. We present an approach to applying numerically stable importance sampling with simplified hyperparameter selection. Furthermore, we analyze the entropy estimation approach of Soft Actor-Critic in the context of the n-step maximum entropy framework and formulate the $\tau$-sampled entropy estimation to reduce the variance of the learning target. Finally, we formulate the Soft Actor-Critic with n-step returns (SAC$n$) algorithm that we experimentally verify on MuJoCo simulated environments.

    https://arxiv.org/abs/2512.13165


    Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows

    oai:arXiv.org:2512.13168v1

    arXiv:2512.13168v1 Announce Type: new Abstract: We introduce a finance & accounting benchmark (Finch) for evaluating AI agents on real-world, enterprise-grade professional workflows -- interleaving data entry, structuring, formatting, web search, cross-file retrieval, calculation, modeling, validation, translation, visualization, and reporting. Finch is sourced from authentic enterprise workspaces at Enron (15,000 spreadsheets and 500,000 emails from 150 employees) and other financial institutions, preserving in-the-wild messiness across multimodal artifacts (text, tables, formulas, charts, code, and images) and spanning diverse domains such as budgeting, trading, and asset management. We propose a workflow construction process that combines LLM-assisted discovery with expert annotation: (1) LLM-assisted, expert-verified derivation of workflows from real-world email threads and version histories of spreadsheet files, and (2) meticulous expert annotation for workflows, requiring over 700 hours of domain-expert effort. This yields 172 composite workflows with 384 tasks, involving 1,710 spreadsheets with 27 million cells, along with PDFs and other artifacts, capturing the intrinsically messy, long-horizon, knowledge-intensive, and collaborative nature of real-world enterprise work. We conduct both human and automated evaluations of frontier AI systems including GPT 5.1, Claude Sonnet 4.5, Gemini 3 Pro, Grok 4, and Qwen 3 Max, and GPT 5.1 Pro spends 48 hours in total yet passes only 38.4% of workflows, while Claude Sonnet 4.5 passes just 25.0%. Comprehensive case studies further surface the challenges that real-world enterprise workflows pose for AI agents.

    https://arxiv.org/abs/2512.13168


    Integrated Semantic and Temporal Alignment for Interactive Video Retrieval

    oai:arXiv.org:2512.13169v1

    arXiv:2512.13169v1 Announce Type: new Abstract: The growing volume of video data and the introduction of complex retrieval challenges, such as the Temporal Retrieval and Alignment of Key Events (TRAKE) task at the Ho Chi Minh City AI Challenge 2025, expose critical limitations in existing systems. Many methodologies lack scalable, holistic architectures and rely on "frozen" embedding models that fail on out-of-knowledge (OOK) or real-world queries. This paper introduces the comprehensive video retrieval framework developed by team AIO\_Owlgorithms to address these gaps. Our system features an architecture integrating TransNetV2 for scene segmentation, BEiT-3 for visual embeddings in Milvus, and Gemini OCR for metadata in Elasticsearch. We propose two components: (1) \textbf{QUEST} (Query Understanding and External Search for Out-of-Knowledge Tasks), a two-branch framework that leverages a Large Language Model (LLM) for query rewriting and an external image search pathway to resolve OOK queries; and (2) \textbf{DANTE} (Dynamic Alignment of Narrative Temporal Events), a dynamic programming algorithm that efficiently solves the temporally-incoherent TRAKE task. These contributions form a robust and intelligent system that significantly advances the state-of-the-art in handling complex, real-world video search queries.

    https://arxiv.org/abs/2512.13169


    Iterative Tuning of Nonlinear Model Predictive Control for Robotic Manufacturing Tasks

    oai:arXiv.org:2512.13170v1

    arXiv:2512.13170v1 Announce Type: new Abstract: Manufacturing processes are often perturbed by drifts in the environment and wear in the system, requiring control re-tuning even in the presence of repetitive operations. This paper presents an iterative learning framework for automatic tuning of Nonlinear Model Predictive Control (NMPC) weighting matrices based on task-level performance feedback. Inspired by norm-optimal Iterative Learning Control (ILC), the proposed method adaptively adjusts NMPC weights Q and R across task repetitions to minimize key performance indicators (KPIs) related to tracking accuracy, control effort, and saturation. Unlike gradient-based approaches that require differentiating through the NMPC solver, we construct an empirical sensitivity matrix, enabling structured weight updates without analytic derivatives. The framework is validated through simulation on a UR10e robot performing carbon fiber winding on a tetrahedral core. Results demonstrate that the proposed approach converges to near-optimal tracking performance (RMSE within 0.3% of offline Bayesian Optimization (BO)) in just 4 online repetitions, compared to 100 offline evaluations required by BO algorithm. The method offers a practical solution for adaptive NMPC tuning in repetitive robotic tasks, combining the precision of carefully optimized controllers with the flexibility of online adaptation.

    https://arxiv.org/abs/2512.13170


    Mathematical modelling and simulation of HIV dynamics

    oai:arXiv.org:2512.13171v1

    arXiv:2512.13171v1 Announce Type: new Abstract: We study within-host HIV dynamics using a three--component nonlinear ordinary differential equation model for healthy CD4$^{+}$ T cells, infected CD4$^{+}$ T cells, and free virus. In addition to the baseline model without treatment, we consider two treatment extensions that incorporate antiretroviral therapy: (i) separate efficacy terms for Reverse Transcriptase inhibitors and Protease inhibitors acting on infection and virus production, and (ii) a simplified formulation using a single combined efficacy parameter. For each model we determine equilibrium points and apply linearization to obtain the Jacobian matrix and local stability conditions near equilibrium. We perform numerical simulations using the classical fourth--order Runge--Kutta method to illustrate the evolution of cell populations and viral load under different therapy levels and treatment start times, including continuous treatment scenarios. The simulations highlight how therapy strength and timing shape transient behavior and can lead to sustained viral suppression.

    https://arxiv.org/abs/2512.13171


    Know Your Users! Estimating User Domain Knowledge in Conversational Recommenders

    oai:arXiv.org:2512.13173v1

    arXiv:2512.13173v1 Announce Type: new Abstract: The ideal conversational recommender system (CRS) acts like a savvy salesperson, adapting its language and suggestions to each user's level of expertise. However, most current systems treat all users as experts, leading to frustrating and inefficient interactions when users are unfamiliar with a domain. Systems that can adapt their conversational strategies to a user's knowledge level stand to offer a much more natural and effective experience. To make a step toward such adaptive systems, we introduce a new task: estimating user domain knowledge from conversations, enabling a CRS to better understand user needs and personalize interactions. A key obstacle to developing such adaptive systems is the lack of suitable data; to our knowledge, no existing dataset captures the conversational behaviors of users with varying levels of domain knowledge. Furthermore, in most dialogue collection protocols, users are free to express their own preferences, which tends to concentrate on popular items and well-known features, offering little insight into how novices explore or learn about unfamiliar features. To address this, we design a game-based data collection protocol that elicits varied expressions of knowledge, release the resulting dataset, and provide an initial analysis to highlight its potential for future work on user-knowledge-aware CRS.

    https://arxiv.org/abs/2512.13173


    Seeing the Whole Picture: Distribution-Guided Data-Free Distillation for Semantic Segmentation

    oai:arXiv.org:2512.13175v1

    arXiv:2512.13175v1 Announce Type: new Abstract: Semantic segmentation requires a holistic understanding of the physical world, as it assigns semantic labels to spatially continuous and structurally coherent objects rather than to isolated pixels. However, existing data-free knowledge distillation (DFKD) methods-primarily designed for classification-often disregard this continuity, resulting in significant performance degradation when applied directly to segmentation tasks. In this paper, we introduce DFSS, a novel data-free distillation framework tailored for semantic segmentation. Unlike prior approaches that treat pixels independently, DFSS respects the structural and contextual continuity of real-world scenes. Our key insight is to leverage Batch Normalization (BN) statistics from a teacher model to guide Approximate Distribution Sampling (ADS), enabling the selection of data that better reflects the original training distribution-without relying on potentially misleading teacher predictions. Additionally, we propose Weighted Distribution Progressive Distillation (WDPD), which dynamically prioritizes reliable samples that are more closely aligned with the original data distribution early in training and gradually incorporates more challenging cases, mirroring the natural progression of learning in human perception. Extensive experiments on standard benchmarks demonstrate that DFSS consistently outperforms existing data-free distillation methods for semantic segmentation, achieving state-of-the-art results with significantly reduced reliance on auxiliary data.

    https://arxiv.org/abs/2512.13175


    EDAN: Towards Understanding Memory Parallelism and Latency Sensitivity in HPC

    oai:arXiv.org:2512.13176v1

    arXiv:2512.13176v1 Announce Type: new Abstract: Resource disaggregation is a promising technique for improving the efficiency of large-scale computing systems. However, this comes at the cost of increased memory access latency due to the need to rely on the network fabric to transfer data between remote nodes. As such, it is crucial to ascertain an application's memory latency sensitivity to minimize the overall performance impact. Existing tools for measuring memory latency sensitivity often rely on custom ad-hoc hardware or cycle-accurate simulators, which can be inflexible and time-consuming. To address this, we present EDAN (Execution DAG Analyzer), a novel performance analysis tool that leverages an application's runtime instruction trace to generate its corresponding execution DAG. This approach allows us to estimate the latency sensitivity of sequential programs and investigate the impact of different hardware configurations. EDAN not only provides us with the capability of calculating the theoretical bounds for performance metrics, but it also helps us gain insight into the memory-level parallelism inherent to HPC applications. We apply EDAN to applications and benchmarks such as PolyBench, HPCG, and LULESH to unveil the characteristics of their intrinsic memory-level parallelism and latency sensitivity.

    https://arxiv.org/abs/2512.13176


    MMDrive: Interactive Scene Understanding Beyond Vision with Multi-representational Fusion

    oai:arXiv.org:2512.13177v1

    arXiv:2512.13177v1 Announce Type: new Abstract: Vision-language models enable the understanding and reasoning of complex traffic scenarios through multi-source information fusion, establishing it as a core technology for autonomous driving. However, existing vision-language models are constrained by the image understanding paradigm in 2D plane, which restricts their capability to perceive 3D spatial information and perform deep semantic fusion, resulting in suboptimal performance in complex autonomous driving environments. This study proposes MMDrive, an multimodal vision-language model framework that extends traditional image understanding to a generalized 3D scene understanding framework. MMDrive incorporates three complementary modalities, including occupancy maps, LiDAR point clouds, and textual scene descriptions. To this end, it introduces two novel components for adaptive cross-modal fusion and key information extraction. Specifically, the Text-oriented Multimodal Modulator dynamically weights the contributions of each modality based on the semantic cues in the question, guiding context-aware feature integration. The Cross-Modal Abstractor employs learnable abstract tokens to generate compact, cross-modal summaries that highlight key regions and essential semantics. Comprehensive evaluations on the DriveLM and NuScenes-QA benchmarks demonstrate that MMDrive achieves significant performance gains over existing vision-language models for autonomous driving, with a BLEU-4 score of 54.56 and METEOR of 41.78 on DriveLM, and an accuracy score of 62.7% on NuScenes-QA. MMDrive effectively breaks the traditional image-only understanding barrier, enabling robust multimodal reasoning in complex driving environments and providing a new foundation for interpretable autonomous driving scene understanding.

    https://arxiv.org/abs/2512.13177


    Efficient Generation of Smooth Paths with Curvature Guarantees by Mollification

    oai:arXiv.org:2512.13183v1

    arXiv:2512.13183v1 Announce Type: new Abstract: Most path following and trajectory tracking algorithms in mobile robotics require the desired path or trajectory to be defined by at least twice continuously differentiable functions to guarantee key properties such as global convergence, especially for nonholonomic robots like unicycles with speed constraints. Consequently, these algorithms typically exclude continuous but non-differentiable paths, such as piecewise functions. Despite this exclusion, such paths provide convenient high-level inputs for describing robot missions or behavior. While techniques such as spline interpolation or optimization-based methods are commonly used to smooth non-differentiable paths or create feasible ones from sequences of waypoints, they either can produce unnecessarily complex trajectories or are computationally expensive. In this work, we present a method to regularize non-differentiable functions and generate feasible paths through mollification. Specifically, we approximate an arbitrary path with a differentiable function that can converge to it with arbitrary precision. Additionally, we provide a systematic method for bounding the curvature of generated paths, which we demonstrate by applying it to paths resulting from linking a sequence of waypoints with segments. The proposed approach is computationally efficient, enabling real-time implementation on microcontrollers and compatibility with standard trajectory tracking and path following algorithms.

    https://arxiv.org/abs/2512.13183


    Probabilistic Programming Meets Automata Theory: Exact Inference using Weighted Automata

    oai:arXiv.org:2512.13185v1

    arXiv:2512.13185v1 Announce Type: new Abstract: Probabilistic programs encode stochastic models as ordinary-looking programs with primitives for sampling numbers from predefined distributions and conditioning. Their applications include, among many others, machine learning and modeling of autonomous systems. The analysis of probabilistic programs is often quantitative - it involves reasoning about numerical properties like probabilities and expectations. A particularly important quantitative property of probabilistic programs is their posterior distribution, i.e., the distribution over possible outputs for a given input (or prior) distribution. Computing the posterior distribution exactly is known as exact inference. We present our current research using weighted automata, a generalization of the well-known finite automata, for performing exact inference in a restricted class of discrete probabilistic programs. This is achieved by encoding distributions over program variables - possibly with infinite support - as certain weighted automata. The semantics of our programming language then corresponds to common automata-theoretic constructions, such as product, concatenation, and others.

    https://arxiv.org/abs/2512.13185


    PolySet: Restoring the Statistical Ensemble Nature of Polymers for Machine Learning

    oai:arXiv.org:2512.13186v1

    arXiv:2512.13186v1 Announce Type: new Abstract: Machine-learning (ML) models in polymer science typically treat a polymer as a single, perfectly defined molecular graph, even though real materials consist of stochastic ensembles of chains with distributed lengths. This mismatch between physical reality and digital representation limits the ability of current models to capture polymer behaviour. Here we introduce PolySet, a framework that represents a polymer as a finite, weighted ensemble of chains sampled from an assumed molar-mass distribution. This ensemble-based encoding is independent of chemical detail, compatible with any molecular representation and illustrated here in the homopolymer case using a minimal language model. We show that PolySet retains higher-order distributional moments (such as Mz, Mz+1), enabling ML models to learn tail-sensitive properties with greatly improved stability and accuracy. By explicitly acknowledging the statistical nature of polymer matter, PolySet establishes a physically grounded foundation for future polymer machine learning, naturally extensible to copolymers, block architectures, and other complex topologies.

    https://arxiv.org/abs/2512.13186


    WAY: Estimation of Vessel Destination in Worldwide AIS Trajectory

    oai:arXiv.org:2512.13190v1

    arXiv:2512.13190v1 Announce Type: new Abstract: The Automatic Identification System (AIS) enables data-driven maritime surveillance but suffers from reliability issues and irregular intervals. We address vessel destination estimation using global-scope AIS data by proposing a differentiated approach that recasts long port-to-port trajectories as a nested sequence structure. Using spatial grids, this method mitigates spatio-temporal bias while preserving detailed resolution. We introduce a novel deep learning architecture, WAY, designed to process these reformulated trajectories for long-term destination estimation days to weeks in advance. WAY comprises a trajectory representation layer and Channel-Aggregative Sequential Processing (CASP) blocks. The representation layer generates multi-channel vector sequences from kinematic and non-kinematic features. CASP blocks utilize multi-headed channel- and self-attention for aggregation and sequential information delivery. Additionally, we propose a task-specialized Gradient Dropout (GD) technique to enable many-to-many training on single labels, preventing biased feedback surges by stochastically blocking gradient flow based on sample length. Experiments on 5-year AIS data demonstrate WAY's superiority over conventional spatial grid-based approaches regardless of trajectory progression. Results further confirm that adopting GD leads to performance gains. Finally, we explore WAY's potential for real-world application through multitask learning for ETA estimation.

    https://arxiv.org/abs/2512.13190


    CoRA: A Collaborative Robust Architecture with Hybrid Fusion for Efficient Perception

    oai:arXiv.org:2512.13191v1

    arXiv:2512.13191v1 Announce Type: new Abstract: Collaborative perception has garnered significant attention as a crucial technology to overcome the perceptual limitations of single-agent systems. Many state-of-the-art (SOTA) methods have achieved communication efficiency and high performance via intermediate fusion. However, they share a critical vulnerability: their performance degrades under adverse communication conditions due to the misalignment induced by data transmission, which severely hampers their practical deployment. To bridge this gap, we re-examine different fusion paradigms, and recover that the strengths of intermediate and late fusion are not a trade-off, but a complementary pairing. Based on this key insight, we propose CoRA, a novel collaborative robust architecture with a hybrid approach to decouple performance from robustness with low communication. It is composed of two components: a feature-level fusion branch and an object-level correction branch. Its first branch selects critical features and fuses them efficiently to ensure both performance and scalability. The second branch leverages semantic relevance to correct spatial displacements, guaranteeing resilience against pose errors. Experiments demonstrate the superiority of CoRA. Under extreme scenarios, CoRA improves upon its baseline performance by approximately 19% in AP@0.7 with more than 5x less communication volume, which makes it a promising solution for robust collaborative perception.

    https://arxiv.org/abs/2512.13191


    POLAR: A Portrait OLAT Dataset and Generative Framework for Illumination-Aware Face Modeling

    oai:arXiv.org:2512.13192v1

    arXiv:2512.13192v1 Announce Type: new Abstract: Face relighting aims to synthesize realistic portraits under novel illumination while preserving identity and geometry. However, progress remains constrained by the limited availability of large-scale, physically consistent illumination data. To address this, we introduce POLAR, a large-scale and physically calibrated One-Light-at-a-Time (OLAT) dataset containing over 200 subjects captured under 156 lighting directions, multiple views, and diverse expressions. Building upon POLAR, we develop a flow-based generative model POLARNet that predicts per-light OLAT responses from a single portrait, capturing fine-grained and direction-aware illumination effects while preserving facial identity. Unlike diffusion or background-conditioned methods that rely on statistical or contextual cues, our formulation models illumination as a continuous, physically interpretable transformation between lighting states, enabling scalable and controllable relighting. Together, POLAR and POLARNet form a unified illumination learning framework that links real data, generative synthesis, and physically grounded relighting, establishing a self-sustaining "chicken-and-egg" cycle for scalable and reproducible portrait illumination.

    https://arxiv.org/abs/2512.13192


    Efficient Adaptive Rejection Sampling for Accelerating Speculative Decoding in Large Language Models

    oai:arXiv.org:2512.13194v1

    arXiv:2512.13194v1 Announce Type: new Abstract: Speculative Decoding is a prominent technique for accelerating the autoregressive inference of large language models (LLMs) by employing a fast draft model to propose candidate token sequences and a large target model to verify them in parallel. However, its core component -- the rejection sampling mechanism -- relies on a fixed, context-independent random threshold. This leads to a significant "random rejection" problem in high-uncertainty generation scenarios, where plausible candidate tokens are frequently rejected due to random chance, undermining inference efficiency. This paper introduces Efficient Adaptive Rejection Sampling (EARS), a novel method that dynamically adjusts the acceptance threshold by incorporating the target model's own predictive uncertainty, measured as \(1 - \max(P_{\mathrm{target}})\). By introducing a tolerance term proportional to this uncertainty, EARS intelligently relaxes the acceptance criterion when the model is uncertain, effectively reducing random rejections while maintaining strict standards when the model is confident. Experiments on creative writing and open-domain QA tasks demonstrate that EARS significantly enhances the efficiency of speculative decoding, achieving up to an 18.12% increase in throughput with a negligible 0.84% accuracy drop on the GSM8K benchmark. The method requires no modifications to model architectures and can be seamlessly integrated into existing speculative decoding frameworks.

    https://arxiv.org/abs/2512.13194


    Noise-Resilient Quantum Aggregation on NISQ for Federated ADAS Learning

    oai:arXiv.org:2512.13196v1

    arXiv:2512.13196v1 Announce Type: new Abstract: Advanced Driver Assistance Systems (ADAS) increasingly employ Federated Learning (FL) to collaboratively train models across distributed vehicular nodes while preserving data privacy. Yet, conventional FL aggregation remains susceptible to noise, latency, and security constraints inherent to real-time vehicular networks. This paper introduces Noise-Resilient Quantum Federated Learning (NR-QFL), a hybrid quantum-classical framework that enables secure, low-latency aggregation through variational quantum circuits (VQCs) operating under Noisy Intermediate-Scale Quantum (NISQ) conditions. The framework encodes model parameters as quantum states with adaptive gate reparameterization, ensuring bounded-error convergence and provable resilience under Completely Positive Trace-Preserving (CPTP) dynamics. NR-QFL employs quantum entropy-based client selection and multi-server coordination for fairness and stability. Empirical validation shows consistent convergence with reduced gradient variance, lower communication overhead, and enhanced noise tolerance under constrained edge conditions. The framework establishes a scalable foundation for quantum-enhanced federated learning, enabling secure, efficient, and dynamically stable ADAS intelligence at the vehicular edge.

    https://arxiv.org/abs/2512.13196


    ALBATROSS: A robotised system for high-throughput electrolyte screening via automated electrolyte formulation, coin-cell fabrication, and electrochemical evaluation

    oai:arXiv.org:2512.13198v1

    arXiv:2512.13198v1 Announce Type: new Abstract: As battery technologies advance toward higher stability and energy density, the need for extensive cell-level testing across various component configurations becomes critical. To evaluate performance and understand the operating principles of batteries in laboratory scale, fabrication and evaluation of coin cells are essential processes. However, the conventional coin-cell assembly and testing processes require significant time and labor from researchers, posing challenges to high-throughput screening research. In this study, we introduce an Automated Li-ion BAttery Testing RObot SyStem (ALBATROSS), an automated system capable of electrolyte formulation, coin-cell assembly, and electrochemical evaluation. The system, integrated within a argon-filled glovebox, enables fully automated assembly and testing of up to 48 cells without researcher intervention. By incorporating custom-designed robot gripper and 3D-printed structures optimized for precise cell handling, ALBATROSS achieved high assembly reliability, yielding a relative standard deviation (RSD) of less than 1.2% in discharge capacity and a standard deviation of less than 3 {\Omega} in EIS measurements for NCM811||Li half cells. Owing to its high reliability and automation capability, ALBATROSS allows for the acquisition of high-quality coin-cell datasets, which are expected to accelerate the development of next-generation electrolytes.

    https://arxiv.org/abs/2512.13198


    Evaluating Adversarial Attacks on Federated Learning for Temperature Forecasting

    oai:arXiv.org:2512.13207v1

    arXiv:2512.13207v1 Announce Type: new Abstract: Deep learning and federated learning (FL) are becoming powerful partners for next-generation weather forecasting. Deep learning enables high-resolution spatiotemporal forecasts that can surpass traditional numerical models, while FL allows institutions in different locations to collaboratively train models without sharing raw data, addressing efficiency and security concerns. While FL has shown promise across heterogeneous regions, its distributed nature introduces new vulnerabilities. In particular, data poisoning attacks, in which compromised clients inject manipulated training data, can degrade performance or introduce systematic biases. These threats are amplified by spatial dependencies in meteorological data, allowing localized perturbations to influence broader regions through global model aggregation. In this study, we investigate how adversarial clients distort federated surface temperature forecasts trained on the Copernicus European Regional ReAnalysis (CERRA) dataset. We simulate geographically distributed clients and evaluate patch-based and global biasing attacks on regional temperature forecasts. Our results show that even a small fraction of poisoned clients can mislead predictions across large, spatially connected areas. A global temperature bias attack from a single compromised client shifts predictions by up to -1.7 K, while coordinated patch attacks more than triple the mean squared error and produce persistent regional anomalies exceeding +3.5 K. Finally, we assess trimmed mean aggregation as a defense mechanism, showing that it successfully defends against global bias attacks (2-13\% degradation) but fails against patch attacks (281-603\% amplification), exposing limitations of outlier-based defenses for spatially correlated data.

    https://arxiv.org/abs/2512.13207


    Kernelization dichotomies for hitting minors under structural parameterizations

    oai:arXiv.org:2512.13210v1

    arXiv:2512.13210v1 Announce Type: new Abstract: For a finite collection of connected graphs $\mathcal{F}$, the $\mathcal{F}$-MINOR-DELETION problem consists in, given a graph $G$ and an integer $\ell$, deciding whether $G$ contains a vertex set of size at most $\ell$ whose removal results in an $\mathcal{F}$-minor-free graph. We lift the existence of (approximate) polynomial kernels for $\mathcal{F}$-MINOR-DELETION by the solution size to (approximate) polynomial kernels parameterized by the vertex-deletion distance to graphs of bounded elimination distance to $\mathcal{F}$-minor-free graphs. This results in exact polynomial kernels for every family $\mathcal{F}$ that contains a planar graph, and an approximate polynomial kernel for PLANAR VERTEX DELETION. Moreover, combining our result with a previous lower bound, we obtain the following infinite set of dichotomies, assuming $NP \not\subseteq coNP/poly$: for any finite set $\mathcal{F}$ of biconnected graphs on at least three vertices containing a planar graph, and any minor-closed class of graphs $\mathcal{C}$, $\mathcal{F}$-MINOR-DELETION admits a polynomial kernel parameterized by the vertex-deletion distance to $\mathcal{C}$ if and only if $\mathcal{C}$ has bounded elimination distance to $\mathcal{F}$-minor-free graphs. For instance, this yields dichotomies for CACTUS VERTEX DELETION, OUTERPLANAR VERTEX DELETION, and TREEWIDTH-$t$ VERTEX DELETION for every integer $t \geq 0$. Prior to our work, such dichotomies were only known for the particular cases of VERTEX COVER and FEEDBACK VERTEX SET. Our approach builds on the techniques developed by Jansen and Pieterse [Theor. Comput. Sci. 2020] and also uses adaptations of some of the results by Jansen, de Kroon, and Wlodarczyk [STOC 2021].

    https://arxiv.org/abs/2512.13210


    Towards Secure Decentralized Applications and Consensus Protocols in Blockchains (on Selfish Mining, Undercutting Attacks, DAG-Based Blockchains, E-Voting, Cryptocurrency Wallets, Secure-Logging, and CBDC)

    oai:arXiv.org:2512.13213v1

    arXiv:2512.13213v1 Announce Type: new Abstract: With the rise of cryptocurrencies, many new applications built on decentralized blockchains have emerged. Blockchains are full-stack distributed systems where multiple sub-systems interact. While many deployed blockchains and decentralized applications need better scalability and performance, security is also critical. Due to their complexity, assessing blockchain and DAPP security requires a more holistic view than for traditional distributed or centralized systems. In this thesis, we summarize our contributions to blockchain and decentralized application security. We propose a security reference architecture to support standardized vulnerability and threat analysis. We study consensus security in single-chain Proof-of-Work blockchains, including resistance to selfish mining, undercutting, and greedy transaction selection, as well as related issues in DAG-based systems. We contribute to wallet security with a new classification of authentication schemes and a two-factor method based on One-Time Passwords. We advance e-voting with a practical boardroom voting protocol, extend it to a scalable version for millions of participants while preserving security and privacy, and introduce a repetitive voting framework that enables vote changes between elections while avoiding peak-end effects. Finally, we improve secure logging using blockchains and trusted computing through a centralized ledger that guarantees non-equivocation, integrity, and censorship evidence, then build on it to propose an interoperability protocol for central bank digital currencies that ensures atomic transfers.

    https://arxiv.org/abs/2512.13213


    Differentiable Material Point Method for the Control of Deformable Objects

    oai:arXiv.org:2512.13214v1

    arXiv:2512.13214v1 Announce Type: new Abstract: Controlling the deformation of flexible objects is challenging due to their non-linear dynamics and high-dimensional configuration space. This work presents a differentiable Material Point Method (MPM) simulator targeted at control applications. We exploit the differentiability of the simulator to optimize a control trajectory in an active damping problem for a hyperelastic rope. The simulator effectively minimizes the kinetic energy of the rope around 2$\times$ faster than a baseline MPPI method and to a 20% lower energy level, while using about 3% of the computation time.

    https://arxiv.org/abs/2512.13214


    Multi-directional Safe Rectangle Corridor-Based MPC for Nonholonomic Robots Navigation in Cluttered Environment

    oai:arXiv.org:2512.13215v1

    arXiv:2512.13215v1 Announce Type: new Abstract: Autonomous Mobile Robots (AMRs) have become indispensable in industrial applications due to their operational flexibility and efficiency. Navigation serves as a crucial technical foundation for accomplishing complex tasks. However, navigating AMRs in dense, cluttered, and semi-structured environments remains challenging, primarily due to nonholonomic vehicle dynamics, interactions with mixed static/dynamic obstacles, and the non-convex constrained nature of such operational spaces. To solve these problems, this paper proposes an Improved Sequential Model Predictive Control (ISMPC) navigation framework that systematically reformulates navigation tasks as sequential switched optimal control problems. The framework addresses the aforementioned challenges through two key innovations: 1) Implementation of a Multi-Directional Safety Rectangular Corridor (MDSRC) algorithm, which encodes the free space through rectangular convex regions to avoid collision with static obstacles, eliminating redundant computational burdens and accelerating solver convergence; 2) A sequential MPC navigation framework that integrates corridor constraints with barrier function constraints is proposed to achieve static and dynamic obstacle avoidance. The ISMPC navigation framework enables direct velocity generation for AMRs, simplifying traditional navigation algorithm architectures. Comparative experiments demonstrate the framework's superiority in free-space utilization ( an increase of 41.05$\%$ in the average corridor area) while maintaining real-time computational performance (average corridors generation latency of 3 ms).

    https://arxiv.org/abs/2512.13215


    A Unified Framework for Automated Assembly Sequence and Production Line Planning using Graph-based Optimization

    oai:arXiv.org:2512.13219v1

    arXiv:2512.13219v1 Announce Type: new Abstract: This paper presents PyCAALP (Python-based Computer-Aided Assembly Line Planning), a framework for automated Assembly Sequence Planning (ASP) and Production Line Planning (PLP), employing a graph-based approach to model components and joints within production modules. The framework integrates kinematic boundary conditions, such as potential part collisions, to guarantee the feasibility of automated assembly planning. The developed algorithm computes all feasible production sequences, integrating modules for detecting spatial relationships and formulating geometric constraints. The algorithm incorporates additional attributes, including handling feasibility, tolerance matching, and joint compatibility, to manage the high combinatorial complexity inherent in assembly sequence generation. Heuristics, such as Single-Piece Flow assembly and geometrical constraint enforcement, are utilized to further refine the solution space, facilitating more efficient planning for complex assemblies. The PLP stage is formulated as a Mixed-Integer Program (MIP), balancing the total times of a fixed number of manufacturing stations. While some complexity reduction techniques may sacrifice optimality, they significantly reduce the MIPs computational time. Furthermore, the framework enables customization of engineering constraints and supports a flexible trade-off between ASP and PLP. The open-source nature of the framework, available at https://github.com/TUM-utg/PyCAALP, promotes further collaboration and adoption in both industrial and production research applications.

    https://arxiv.org/abs/2512.13219


    ModSSC: A Modular Framework for Semi-Supervised Classification on Heterogeneous Data

    oai:arXiv.org:2512.13228v1

    arXiv:2512.13228v1 Announce Type: new Abstract: Semi-supervised classification leverages both labeled and unlabeled data to improve predictive performance, but existing software support is fragmented across methods and modalities. We introduce ModSSC, an open source Python framework that unifies inductive and transductive semi-supervised classification in a modular code base. ModSSC implements a broad range of classical and recent algorithms, provides loaders for tabular, image, text, audio and graph datasets, and exposes a single configuration interface for specifying datasets, models and evaluation protocols. It supports both lightweight classical methods on small datasets running on CPU and recent deep approaches that can exploit multiple GPUs within the same experimental framework. Experiments are described declaratively in YAML, which facilitates reproducing existing work and running large comparative studies. ModSSC 1.0.0 is released under the MIT license with extensive documentation and tests, and is available at https://github.com/ModSSC/ModSSC.

    https://arxiv.org/abs/2512.13228


    Plant Equivalent Controller Realizations for Attack-Resilient Cyber-Physical Systems

    oai:arXiv.org:2512.13229v1

    arXiv:2512.13229v1 Announce Type: new Abstract: As cyber-physical systems (CPSs) become more dependent on data and communication networks, their vulnerability to false data injection (FDI) attacks has raised significant concerns. Among these, stealthy attacks, those that evade conventional detection mechanisms, pose a critical threat to closed-loop performance. This paper introduces a controller-oriented method to enhance CPS resiliency against such attacks without compromising nominal closed-loop behavior. Specifically, we propose the concept of plant equivalent controller (PEC) realizations, representing a class of dynamic output-feedback controllers that preserve the input-output behavior of a given base controller while exhibiting distinct robustness properties in the presence of disturbances and sensor attacks. To quantify and improve robustness, we employ reachable set analysis to assess the impact of stealthy attacks on the closed-loop dynamics. Building on this analysis, we provide mathematical tools (in terms of linear matrix inequalities) to synthesize the optimal PEC realization that minimizes the reachable set under peak-bounded disturbances. The proposed framework thus provides systematic analysis and synthesis tools to enhance the attack resilience of CPSs while maintaining the desired nominal performance. The effectiveness of the approach is demonstrated on the quadruple-tank process subject to stealthy sensor attacks.

    https://arxiv.org/abs/2512.13229


    Shared Nodes of Overlapping Communities in Complex Networks

    oai:arXiv.org:2512.13231v1

    arXiv:2512.13231v1 Announce Type: new Abstract: Overlapping communities are key characteristics of the structure and function analysis of complex networks. Shared or overlapping nodes within overlapping communities can form either subcommunities or act as intersections between larger communities. Nodes at the intersections that do not form subcommunities can be identified as overlapping nodes or as part of an internal structure of nested communities. To identify overlapping nodes, we apply a threshold rule based on the number of nodes in the nested structure. As the threshold value increases, the number of selected overlapping nodes decreases. This approach allows us to analyse the roles of nodes considered overlapping according to selection criteria, for example to reduce the effect of noise. We illustrate our method by using three small and two larger real-world network structures. In larger networks, minor disturbances can produce a multitude of slightly different solutions, but the core communities remain robust, allowing other variations to be treated as noise. While this study employs our own method for community detection, other approaches can also be applied. Exploring the properties of shared nodes in overlapping communities of complex networks is a novel area of research with diverse applications in social network analysis, cybersecurity, and other fields in network science.

    https://arxiv.org/abs/2512.13231


    Measurement of Material Volume Fractions in a Microwave Resonant Cavity Sensor Using Convolutional Neural Network

    oai:arXiv.org:2512.13233v1

    arXiv:2512.13233v1 Announce Type: new Abstract: A non-destructive, real-time method for estimating the volume fraction of a dielectric mixture inside a resonant cavity is presented. A convolutional neural network (CNN)-based approach is used to estimate the fractional composition of two-phase dielectric mixtures inside a resonant cavity using scattering parameter (S-parameter) measurements. A rectangular cavity sensor with a strip feed structure is characterized using a vector network analyzer (VNA) from 0.01--20~GHz. The CNN is trained using both simulated and experimentally measured S-parameters and achieves high predictive accuracy even without de-embedding or filtering, demonstrating robustness to measurement imperfections. The simulation results achieve a coefficient of determination ($R^2$)=0.99 using $k$-fold cross-validation, while the experimental model using raw data achieves an $R^2=0.94$ with a mean absolute error (MAE) below 6\%. Data augmentation further improves the accuracy of the experimental prediction to above $R^2=0.998$ (MAE$<$0.72\%). The proposed method enables rapid, non-destructive, accurate, low-cost, and real-time estimation of material fractions, illustrating strong potential for sensing applications in microwave material characterization.

    https://arxiv.org/abs/2512.13233


    CORE: Contrastive Masked Feature Reconstruction on Graphs

    oai:arXiv.org:2512.13235v1

    arXiv:2512.13235v1 Announce Type: new Abstract: In the rapidly evolving field of self-supervised learning on graphs, generative and contrastive methodologies have emerged as two dominant approaches. Our study focuses on masked feature reconstruction (MFR), a generative technique where a model learns to restore the raw features of masked nodes in a self-supervised manner. We observe that both MFR and graph contrastive learning (GCL) aim to maximize agreement between similar elements. Building on this observation, we reveal a novel theoretical insight: under specific conditions, the objectives of MFR and node-level GCL converge, despite their distinct operational mechanisms. This theoretical connection suggests these approaches are complementary rather than fundamentally different, prompting us to explore their integration to enhance self-supervised learning on graphs. Our research presents Contrastive Masked Feature Reconstruction (CORE), a novel graph self-supervised learning framework that integrates contrastive learning into MFR. Specifically, we form positive pairs exclusively between the original and reconstructed features of masked nodes, encouraging the encoder to prioritize contextual information over the node's own features. Additionally, we leverage the masked nodes themselves as negative samples, combining MFR's reconstructive power with GCL's discriminative ability to better capture intrinsic graph structures. Empirically, our proposed framework CORE significantly outperforms MFR across node and graph classification tasks, demonstrating state-of-the-art results. In particular, CORE surpasses GraphMAE and GraphMAE2 by up to 2.80% and 3.72% on node classification tasks, and by up to 3.82% and 3.76% on graph classification tasks.

    https://arxiv.org/abs/2512.13235


    Learning to Retrieve with Weakened Labels: Robust Training under Label Noise

    oai:arXiv.org:2512.13237v1

    arXiv:2512.13237v1 Announce Type: new Abstract: Neural Encoders are frequently used in the NLP domain to perform dense retrieval tasks, for instance, to generate the candidate documents for a given query in question-answering tasks. However, sparse annotation and label noise in the training data make it challenging to train or fine-tune such retrieval models. Although existing works have attempted to mitigate these problems by incorporating modified loss functions or data cleaning, these approaches either require some hyperparameters to tune during training or add substantial complexity to the training setup. In this work, we consider a label weakening approach to generate robust retrieval models in the presence of label noise. Instead of enforcing a single, potentially erroneous label for each query document pair, we allow for a set of plausible labels derived from both the observed supervision and the model's confidence scores. We perform an extensive evaluation considering two retrieval models, one re-ranking model, considering four diverse ranking datasets. To this end, we also consider a realistic noisy setting by using a semantic-aware noise generation technique to generate different ratios of noise. Our initial results show that label weakening can improve the performance of the retrieval tasks in comparison to 10 different state-of-the-art loss functions.

    https://arxiv.org/abs/2512.13237


    Ego-EXTRA: video-language Egocentric Dataset for EXpert-TRAinee assistance

    oai:arXiv.org:2512.13238v1

    arXiv:2512.13238v1 Announce Type: new Abstract: We present Ego-EXTRA, a video-language Egocentric Dataset for EXpert-TRAinee assistance. Ego-EXTRA features 50 hours of unscripted egocentric videos of subjects performing procedural activities (the trainees) while guided by real-world experts who provide guidance and answer specific questions using natural language. Following a ``Wizard of OZ'' data collection paradigm, the expert enacts a wearable intelligent assistant, looking at the activities performed by the trainee exclusively from their egocentric point of view, answering questions when asked by the trainee, or proactively interacting with suggestions during the procedures. This unique data collection protocol enables Ego-EXTRA to capture a high-quality dialogue in which expert-level feedback is provided to the trainee. Two-way dialogues between experts and trainees are recorded, transcribed, and used to create a novel benchmark comprising more than 15k high-quality Visual Question Answer sets, which we use to evaluate Multimodal Large Language Models. The results show that Ego-EXTRA is challenging and highlight the limitations of current models when used to provide expert-level assistance to the user. The Ego-EXTRA dataset is publicly available to support the benchmark of egocentric video-language assistants: https://fpv-iplab.github.io/Ego-EXTRA/.

    https://arxiv.org/abs/2512.13238


    A Decision Support Framework for Blockchain Pattern Selection Based on Soft Goals

    oai:arXiv.org:2512.13239v1

    arXiv:2512.13239v1 Announce Type: new Abstract: Blockchain technology is gaining momentum across many sectors. Whereas blockchain solutions have important positive effects on the business domain, they also introduce constraints and may cause delayed or unforeseen negative effects, undermining business strategies. The diversity of blockchain patterns and lack of standardized frameworks linking business goals to technical design decisions make pattern selection a complex task for system architects. To address this challenge, we propose Blockchain--Technology-Aware Enterprise Modeling (BC-TEAEM), a decision support framework that combines ontologies of blockchain patterns and domain-independent soft goals with a multi-criteria decision-making approach. The framework focuses on the interplay between a domain expert and a technical expert to ensure alignment and traceability. By iteratively capturing and refining preferences, BC-TEAEM supports systematic selection of blockchain patterns. We develop a prototype decision support tool implementing our method and validate it through a case study of a pharmaceutical company's supply chain traceability system, demonstrating the framework's applicability. %a supply chain traceability case study.

    https://arxiv.org/abs/2512.13239


    Reflective Preference Optimization (RPO): Enhancing On-Policy Alignment via Hint-Guided Reflection

    oai:arXiv.org:2512.13240v1

    arXiv:2512.13240v1 Announce Type: new Abstract: Direct Preference Optimization (DPO) has emerged as a lightweight and effective alternative to Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning with AI Feedback (RLAIF) for aligning large language and vision-language models. However, the standard DPO formulation, in which both the chosen and rejected responses are generated by the same policy, suffers from a weak learning signal because the two responses often share similar errors and exhibit small Kullback-Leibler (KL) divergence. This leads to slow and unstable convergence. To address this limitation, we introduce Reflective Preference Optimization (RPO), a new framework that incorporates hint-guided reflection into the DPO paradigm. RPO uses external models to identify hallucination sources and generate concise reflective hints, enabling the construction of on-policy preference pairs with stronger contrastiveness and clearer preference signals. We theoretically show that conditioning on hints increases the expected preference margin through mutual information and improves sample efficiency while remaining within the policy distribution family. Empirically, RPO achieves superior alignment with fewer training samples and iterations, substantially reducing hallucination rates and delivering state-of-the-art performance across multimodal benchmarks.

    https://arxiv.org/abs/2512.13240


    Fair Coordination in Strategic Scheduling

    oai:arXiv.org:2512.13244v1

    arXiv:2512.13244v1 Announce Type: new Abstract: We consider a scheduling problem of strategic agents representing jobs of different weights. Each agent has to decide on one of a finite set of identical machines to get their job processed. In contrast to the common and exclusive focus on makespan minimization, we want the outcome to be fair under strategic considerations of the agents. Two natural properties are credibility, which ensures that the assignment is a Nash equilibrium and equality, requiring that agents with equal-weight jobs are assigned to machines of equal load. We combine these two with a hierarchy of fairness properties based on envy-freeness together with several relaxations based on the idea that envy seems more justified towards agents with a higher weight. We present a complete complexity landscape for satisfiability and decision versions of these properties, alone or in combination, and study them as structural constraints under makespan optimization. For our positive results, we develop a unified algorithmic approach, where we achieve different properties by fine-tuning key subroutines.

    https://arxiv.org/abs/2512.13244


    q-Analogue of Hamiltonian Monte Carlo method

    oai:arXiv.org:2512.13246v1

    arXiv:2512.13246v1 Announce Type: new Abstract: Building upon Lagrangian mechanics on Wess's $q$-commutative spaces, we derive the $q$-deformed Hamiltonian dynamics as formulated by Lavagno et al. (2006). We then develop a computationally tractable scheme and propose a novel Hamiltonian Monte Carlo sampler ($q$-HMC). The proposed $q$-HMC method is shown to satisfy the detailed balance principle. Numerical experiments on distributions with explicit potential functions demonstrate its efficacy, particularly in exploring stiff energy landscapes. This method is also applied to draw samples from the Bayesian posterior distribution of inverse problems. The numerical test for the posterior distribution with stiff potential further shows the advantage of $q$-HMC. And it yields the identical computational implementation process to that of HMC when used to deal with functional reconstruction problems.

    https://arxiv.org/abs/2512.13246


    STARCaster: Spatio-Temporal AutoRegressive Video Diffusion for Identity- and View-Aware Talking Portraits

    oai:arXiv.org:2512.13247v1

    arXiv:2512.13247v1 Announce Type: new Abstract: This paper presents STARCaster, an identity-aware spatio-temporal video diffusion model that addresses both speech-driven portrait animation and free-viewpoint talking portrait synthesis, given an identity embedding or reference image, within a unified framework. Existing 2D speech-to-video diffusion models depend heavily on reference guidance, leading to limited motion diversity. At the same time, 3D-aware animation typically relies on inversion through pre-trained tri-plane generators, which often leads to imperfect reconstructions and identity drift. We rethink reference- and geometry-based paradigms in two ways. First, we deviate from strict reference conditioning at pre-training by introducing softer identity constraints. Second, we address 3D awareness implicitly within the 2D video domain by leveraging the inherent multi-view nature of video data. STARCaster adopts a compositional approach progressing from ID-aware motion modeling, to audio-visual synchronization via lip reading-based supervision, and finally to novel view animation through temporal-to-spatial adaptation. To overcome the scarcity of 4D audio-visual data, we propose a decoupled learning approach in which view consistency and temporal coherence are trained independently. A self-forcing training scheme enables the model to learn from longer temporal contexts than those generated at inference, mitigating the overly static animations common in existing autoregressive approaches. Comprehensive evaluations demonstrate that STARCaster generalizes effectively across tasks and identities, consistently surpassing prior approaches in different benchmarks.

    https://arxiv.org/abs/2512.13247


    Toward Ambulatory Vision: Learning Visually-Grounded Active View Selection

    oai:arXiv.org:2512.13250v1

    arXiv:2512.13250v1 Announce Type: new Abstract: Vision Language Models (VLMs) excel at visual question answering (VQA) but remain limited to snapshot vision, reasoning from static images. In contrast, embodied agents require ambulatory vision, actively moving to obtain more informative views. We introduce Visually Grounded Active View Selection (VG-AVS), a task that selects the most informative next viewpoint using only the visual information in the current image, without relying on scene memory or external knowledge. To support this task, we construct a synthetic dataset with automatically generated paired query-target views and question-answer prompts. We also propose a framework that fine-tunes pretrained VLMs through supervised fine-tuning (SFT) followed by RL-based policy optimization. Our approach achieves strong question answering performance based on viewpoint selection and generalizes robustly to unseen synthetic and real scenes. Furthermore, incorporating our learned VG-AVS framework into existing scene-exploration-based EQA systems improves downstream question-answering accuracy.

    https://arxiv.org/abs/2512.13250


    DisCo-Speech: Controllable Zero-Shot Speech Generation with A Disentangled Speech Codec

    oai:arXiv.org:2512.13251v1

    arXiv:2512.13251v1 Announce Type: new Abstract: Recent codec-based language models~(LMs) have revolutionized text-to-speech~(TTS). However, since standard codecs tightly couple timbre and prosody, continuation-based LMs inevitably replicate this entanglement, hindering independent control. Recent efforts attempt to break this entanglement via codec design, but insufficient decoupling remains a critical bottleneck. To tackle this challenge, we propose DisCo-Speech, a zero-shot controllable TTS framework that enables prosody control and voice cloning via a disentangled speech codec (DisCodec) and an LM-based generator. The core component, DisCodec, contains two core stages: 1) Tri-factor disentanglement, which explicitly factorizes speech into content, prosody, and timbre subspaces via parallel encoders and hybrid losses; and 2) Fusion and reconstruction, which fuses content and prosody into unified content-prosody tokens suitable for LM prediction, while jointly optimizing reconstruction quality to resolve the disentanglement-reconstruction trade-off. With this design, the LM performs prosodic continuation from a style prompt while the decoder handles target timbre injection, enabling flexible zero-shot control. Experiments show that DisCo-Speech matches state-of-the-art voice cloning performance while outperforming baselines in zero-shot prosody control. By resolving the core entanglement at the codec level, DisCo-Speech provides a robust foundation for controllable speech synthesis. Audio samples are available at https://github.com/disco-speech/DisCo-Speech, and the code and weights will be released at the same link.

    https://arxiv.org/abs/2512.13251


    Fostering human learning is crucial for boosting human-AI synergy

    oai:arXiv.org:2512.13253v1

    arXiv:2512.13253v1 Announce Type: new Abstract: The collaboration between humans and artificial intelligence (AI) holds the promise of achieving superior outcomes compared to either acting alone. Nevertheless, our understanding of the conditions that facilitate such human-AI synergy remains limited. A recent meta-analysis showed that, on average, human-AI combinations do not outperform the better individual agent, indicating overall negative human-AI synergy. We argue that this pessimistic conclusion arises from insufficient attention to human learning in the experimental designs used. To substantiate this claim, we re-analyzed all 74 studies included in the original meta-analysis, which yielded two new findings. First, most previous research overlooked design features that foster human learning, such as providing trial-by-trial outcome feedback to participants. Second, our re-analysis, using robust Bayesian meta-regressions, demonstrated that studies providing outcome feedback show relatively higher synergy than those without outcome feedback. Crucially, when feedback is paired with AI explanations we tend to find positive human-AI synergy, while AI explanations provided without feedback were strongly linked to negative synergy, indicating that explanations are useful for synergy only when humans can learn to verify the AI's reliability through feedback. We conclude that the current literature underestimates the potential for human-AI collaboration because it predominantly relies on experimental designs that do not facilitate human learning, thus hindering humans from effectively adapting their collaboration strategies. We therefore advocate for a paradigm shift in human-AI interaction research that explicitly incorporates and tests human learning mechanisms to enhance our understanding of and support for successful human-AI collaboration.

    https://arxiv.org/abs/2512.13253


    B\'ezierFlow: B\'ezier Stochastic Interpolant Schedulers for Few-Step Generation

    oai:arXiv.org:2512.13255v1

    arXiv:2512.13255v1 Announce Type: new Abstract: We introduce B\'ezierFlow, a lightweight training approach for few-step generation with pretrained diffusion and flow models. B\'ezierFlow achieves a 2-3x performance improvement for sampling with $\leq$ 10 NFEs while requiring only 15 minutes of training. Recent lightweight training approaches have shown promise by learning optimal timesteps, but their scope remains restricted to ODE discretizations. To broaden this scope, we propose learning the optimal transformation of the sampling trajectory by parameterizing stochastic interpolant (SI) schedulers. The main challenge lies in designing a parameterization that satisfies critical desiderata, including boundary conditions, differentiability, and monotonicity of the SNR. To effectively meet these requirements, we represent scheduler functions as B\'ezier functions, where control points naturally enforce these properties. This reduces the problem to learning an ordered set of points in the time range, while the interpretation of the points changes from ODE timesteps to B\'ezier control points. Across a range of pretrained diffusion and flow models, B\'ezierFlow consistently outperforms prior timestep-learning methods, demonstrating the effectiveness of expanding the search space from discrete timesteps to B\'ezier-based trajectory transformations.

    https://arxiv.org/abs/2512.13255


    From Educational Analytics to AI Governance: Transferable Lessons from Complex Systems Interventions

    oai:arXiv.org:2512.13260v1

    arXiv:2512.13260v1 Announce Type: new Abstract: Both student retention in higher education and artificial intelligence governance face a common structural challenge: the application of linear regulatory frameworks to complex adaptive systems. Risk-based approaches dominate both domains, yet systematically fail because they assume stable causal pathways, predictable actor responses, and controllable system boundaries. This paper extracts transferable methodological principles from CAPIRE (Curriculum, Archetypes, Policies, Interventions & Research Environment), an empirically validated framework for educational analytics that treats student dropout as an emergent property of curricular structures, institutional rules, and macroeconomic shocks. Drawing on longitudinal data from engineering programmes and causal inference methods, CAPIRE demonstrates that well-intentioned interventions routinely generate unintended consequences when system complexity is ignored. We argue that five core principles developed within CAPIRE - temporal observation discipline, structural mapping over categorical classification, archetype-based heterogeneity analysis, causal mechanism identification, and simulation-based policy design - transfer directly to the challenge of governing AI systems. The isomorphism is not merely analogical: both domains exhibit non-linearity, emergence, feedback loops, strategic adaptation, and path dependence. We propose Complex Systems AI Governance (CSAIG) as an integrated framework that operationalises these principles for regulatory design, shifting the central question from "how risky is this AI system?" to "how does this intervention reshape system dynamics?" The contribution is twofold: demonstrating that empirical lessons from one complex systems domain can accelerate governance design in another, and offering a concrete methodological architecture for complexity-aware AI regulation.

    https://arxiv.org/abs/2512.13260


    Post-Training and Test-Time Scaling of Generative Agent Behavior Models for Interactive Autonomous Driving

    oai:arXiv.org:2512.13262v1

    arXiv:2512.13262v1 Announce Type: new Abstract: Learning interactive motion behaviors among multiple agents is a core challenge in autonomous driving. While imitation learning models generate realistic trajectories, they often inherit biases from datasets dominated by safe demonstrations, limiting robustness in safety-critical cases. Moreover, most studies rely on open-loop evaluation, overlooking compounding errors in closed-loop execution. We address these limitations with two complementary strategies. First, we propose Group Relative Behavior Optimization (GRBO), a reinforcement learning post-training method that fine-tunes pretrained behavior models via group relative advantage maximization with human regularization. Using only 10% of the training dataset, GRBO improves safety performance by over 40% while preserving behavioral realism. Second, we introduce Warm-K, a warm-started Top-K sampling strategy that balances consistency and diversity in motion selection. Our Warm-K method-based test-time scaling enhances behavioral consistency and reactivity at test time without retraining, mitigating covariate shift and reducing performance discrepancies. Demo videos are available in the supplementary material.

    https://arxiv.org/abs/2512.13262


    An End-to-End Neural Network Transceiver Design for OFDM System with FPGA-Accelerated Implementation

    oai:arXiv.org:2512.13263v1

    arXiv:2512.13263v1 Announce Type: new Abstract: The evolution toward sixth-generation (6G) wireless networks demands high-performance transceiver architectures capable of handling complex and dynamic environments. Conventional orthogonal frequency-division multiplexing (OFDM) receivers rely on cascaded discrete Fourier transform (DFT) and demodulation blocks, which are prone to inter-stage error propagation and suboptimal global performance. In this work, we propose two neural network (NN) models DFT-Net and Demodulation-Net (Demod-Net) to jointly replace the IDFT/DFT and demodulation modules in an OFDM transceiver. The models are trained end-to-end (E2E) to minimize bit error rate (BER) while preserving operator equivalence for hybrid deployment. A customized DFT-Demodulation Net Accelerator (DDNA) is further developed to efficiently map the proposed networks onto field-programmable gate array (FPGA) platforms. Leveraging fine-grained pipelining and block matrix operations, DDNA achieves high throughput and flexibility under stringent latency constraints. Experimental results show that the DL-based transceiver consistently outperforms the conventional OFDM system across multiple modulation schemes. With only a modest increase in hardware resource usage, it achieves approximately 1.5 dB BER gain and up to 66\% lower execution time.

    https://arxiv.org/abs/2512.13263


    Low-Complexity Monitoring and Compensation of Transceiver IQ Imbalance by Multi-dimensional Architecture for Dual-Polarization 16 Quadrature Amplitude Modulation

    oai:arXiv.org:2512.13266v1

    arXiv:2512.13266v1 Announce Type: new Abstract: In this paper, a low-complexity IQ imbalance compensation architecture is proposed, which reduces the effects of in-phase (I) and quadrature (Q) imbalance. The architecture consists of transceiver IQ skew estimation methods and a low-complexity MIMO equalizer structure. Before the IQ skew estimation, the chromatic dispersion(CD) is pre-compensated in the transmitter(TX) by chirp filtering. The receiver(RX) IQ skew is estimated by Gardner's phase detector, and the TX skew is estimated by finding the value that yields the lowest equalizer error. The low-complexity MIMO equalizer consists of a complex-valued MIMO(CV-MIMO) and a real-valued DD-LMS MIMO(RV-MIMO), which employ a butterfly and a non-butterfly structure, respectively. The CV-MIMO is used to perform polarization demultiplexing. The RV-MIMO equalizes each of the two polarisations and simultaneously compensates for the TX IQ imbalance. The architecture first compensates for the IQ skew at low-complexity, and the other imperfections are compensated by the low-complexity MIMO equalizer. Therefore, this architecture can equalize signals impaired by the transceiver IQ imbalance with low complexity. A 100 km transmission simulation and experiment with 36 Gbaud dual-polarization quadrature amplitude modulation(DP-QAM) signals and offline DSP showed that, with the RX IQ skew estimation, the number of real multiplications is reduced by more than 70% compared with conventional cases. With the low-complexity MIMO equalizer, the number of real multiplications is reduced by 51% compared with 4x4 MIMO

    https://arxiv.org/abs/2512.13266


    SPARS: A Reinforcement Learning-Enabled Simulator for Power Management in HPC Job Scheduling

    oai:arXiv.org:2512.13268v1

    arXiv:2512.13268v1 Announce Type: new Abstract: High-performance computing (HPC) clusters consume enormous amounts of energy, with idle nodes as a major source of waste. Powering down unused nodes can mitigate this problem, but poorly timed transitions introduce long delays and reduce overall performance. To address this trade-off, we present SPARS, a reinforcement learning-enabled simulator for power management in HPC job scheduling. SPARS integrates job scheduling and node power state management within a discrete-event simulation framework. It supports traditional scheduling policies such as First Come First Served and EASY Backfilling, along with enhanced variants that employ reinforcement learning agents to dynamically decide when nodes should be powered on or off. Users can configure workloads and platforms in JSON format, specifying job arrivals, execution times, node power models, and transition delays. The simulator records comprehensive metrics-including energy usage, wasted power, job waiting times, and node utilization-and provides Gantt chart visualizations to analyze scheduling dynamics and power transitions. Unlike widely used Batsim-based frameworks that rely on heavy inter-process communication, SPARS provides lightweight event handling and consistent simulation results, making experiments easier to reproduce and extend. Its modular design allows new scheduling heuristics or learning algorithms to be integrated with minimal effort. By providing a flexible, reproducible, and extensible platform, SPARS enables researchers and practitioners to systematically evaluate power-aware scheduling strategies, explore the trade-offs between energy efficiency and performance, and accelerate the development of sustainable HPC operations.

    https://arxiv.org/abs/2512.13268


    Lightweight Dynamic Modeling of Cable-Driven Continuum Robots Based on Actuation-Space Energy Formulation

    oai:arXiv.org:2512.13271v1

    arXiv:2512.13271v1 Announce Type: new Abstract: Cable-driven continuum robots (CDCRs) require accurate, real-time dynamic models for high-speed dynamics prediction or model-based control, making such capability an urgent need. In this paper, we propose the Lightweight Actuation-Space Energy Modeling (LASEM) framework for CDCRs, which formulates actuation potential energy directly in actuation space to enable lightweight yet accurate dynamic modeling. Through a unified variational derivation, the governing dynamics reduce to a single partial differential equation (PDE), requiring only the Euler moment balance while implicitly incorporating the Newton force balance. By also avoiding explicit computation of cable-backbone contact forces, the formulation simplifies the model structure and improves computational efficiency while preserving geometric accuracy and physical consistency. Importantly, the proposed framework for dynamic modeling natively supports both force-input and displacement-input actuation modes, a capability seldom achieved in existing dynamic formulations. Leveraging this lightweight structure, a Galerkin space-time modal discretization with analytical time-domain derivatives of the reduced state further enables an average 62.3% computational speedup over state-of-the-art real-time dynamic modeling approaches.

    https://arxiv.org/abs/2512.13271


    CogniEdit: Dense Gradient Flow Optimization for Fine-Grained Image Editing

    oai:arXiv.org:2512.13276v1

    arXiv:2512.13276v1 Announce Type: new Abstract: Instruction-based image editing with diffusion models has achieved impressive results, yet existing methods struggle with fine-grained instructions specifying precise attributes such as colors, positions, and quantities. While recent approaches employ Group Relative Policy Optimization (GRPO) for alignment, they optimize only at individual sampling steps, providing sparse feedback that limits trajectory-level control. We propose a unified framework CogniEdit, combining multi-modal reasoning with dense reward optimization that propagates gradients across consecutive denoising steps, enabling trajectory-level gradient flow through the sampling process. Our method comprises three components: (1) Multi-modal Large Language Models for decomposing complex instructions into actionable directives, (2) Dynamic Token Focus Relocation that adaptively emphasizes fine-grained attributes, and (3) Dense GRPO-based optimization that propagates gradients across consecutive steps for trajectory-level supervision. Extensive experiments on benchmark datasets demonstrate that our CogniEdit achieves state-of-the-art performance in balancing fine-grained instruction following with visual quality and editability preservation

    https://arxiv.org/abs/2512.13276


    AutoTool: Dynamic Tool Selection and Integration for Agentic Reasoning

    oai:arXiv.org:2512.13278v1

    arXiv:2512.13278v1 Announce Type: new Abstract: Agentic reinforcement learning has advanced large language models (LLMs) to reason through long chain-of-thought trajectories while interleaving external tool use. Existing approaches assume a fixed inventory of tools, limiting LLM agents' adaptability to new or evolving toolsets. We present AutoTool, a framework that equips LLM agents with dynamic tool-selection capabilities throughout their reasoning trajectories. We first construct a 200k dataset with explicit tool-selection rationales across 1,000+ tools and 100+ tasks spanning mathematics, science, code generation, and multimodal reasoning. Building on this data foundation, AutoTool employs a dual-phase optimization pipeline: (i) supervised and RL-based trajectory stabilization for coherent reasoning, and (ii) KL-regularized Plackett-Luce ranking to refine consistent multi-step tool selection. Across ten diverse benchmarks, we train two base models, Qwen3-8B and Qwen2.5-VL-7B, with AutoTool. With fewer parameters, AutoTool consistently outperforms advanced LLM agents and tool-integration methods, yielding average gains of 6.4% in math & science reasoning, 4.5% in search-based QA, 7.7% in code generation, and 6.9% in multimodal understanding. In addition, AutoTool exhibits stronger generalization by dynamically leveraging unseen tools from evolving toolsets during inference.

    https://arxiv.org/abs/2512.13278


    AIR: Post-training Data Selection for Reasoning via Attention Head Influence

    oai:arXiv.org:2512.13279v1

    arXiv:2512.13279v1 Announce Type: new Abstract: LLMs achieve remarkable multi-step reasoning capabilities, yet effectively transferring these skills via post-training distillation remains challenging. Existing data selection methods, ranging from manual curation to heuristics based on length, entropy, or overall loss, fail to capture the causal importance of individual reasoning steps, limiting distillation efficiency. To address this, we propose Attention Influence for Reasoning (AIR), a principled, unsupervised and training-free framework that leverages mechanistic insights of the retrieval head to select high-value post-training data. AIR first identifies reasoning-critical attention heads of an off-the-shelf model, then constructs a weakened reference model with disabled head influence, and finally quantifies the resulting loss divergence as the Attention Influence Score. This score enables fine-grained assessment at both the step and sample levels, supporting step-level weighted fine-tuning and global sample selection. Experiments across multiple reasoning benchmarks show that AIR consistently improves reasoning accuracy, surpassing heuristic baselines and effectively isolating the most critical steps and samples. Our work establishes a mechanism-driven, data-efficient approach for reasoning distillation in LLMs.

    https://arxiv.org/abs/2512.13279


    Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans?

    oai:arXiv.org:2512.13281v1

    arXiv:2512.13281v1 Announce Type: new Abstract: Recent advances in video generation have produced vivid content that are often indistinguishable from real videos, making AI-generated video detection an emerging societal challenge. Prior AIGC detection benchmarks mostly evaluate video without audio, target broad narrative domains, and focus on classification solely. Yet it remains unclear whether state-of-the-art video generation models can produce immersive, audio-paired videos that reliably deceive humans and VLMs. To this end, we introduce Video Reality Test, an ASMR-sourced video benchmark suite for testing perceptual realism under tight audio-visual coupling, featuring the following dimensions: \textbf{(i) Immersive ASMR video-audio sources.} Built on carefully curated real ASMR videos, the benchmark targets fine-grained action-object interactions with diversity across objects, actions, and backgrounds. \textbf{(ii) Peer-Review evaluation.} An adversarial creator-reviewer protocol where video generation models act as creators aiming to fool reviewers, while VLMs serve as reviewers seeking to identify fakeness. Our experimental findings show: The best creator Veo3.1-Fast even fools most VLMs: the strongest reviewer (Gemini 2.5-Pro) achieves only 56\% accuracy (random 50\%), far below that of human experts (81.25\%). Adding audio improves real-fake discrimination, yet superficial cues such as watermarks can still significantly mislead models. These findings delineate the current boundary of video generation realism and expose limitations of VLMs in perceptual fidelity and audio-visual consistency. Our code is available at https://github.com/video-reality-test/video-reality-test.

    https://arxiv.org/abs/2512.13281


    Striking the Balance: GEMM Performance Optimization Across Generations of Ryzen AI NPUs

    oai:arXiv.org:2512.13282v1

    arXiv:2512.13282v1 Announce Type: new Abstract: The high computational and memory demands of modern deep learning (DL) workloads have led to the development of specialized hardware devices from cloud to edge, such as AMD's Ryzen AI XDNA NPUs. Optimizing general matrix multiplication (GEMM) algorithms for these architectures is critical for improving DL workload performance. To this end, this paper presents a common systematic methodology to optimize GEMM workloads across the two current NPU generations, namely XDNA and XDNA2. Our implementations exploit the unique architectural features of AMD's NPUs and address key performance bottlenecks at the system level. End-to-end performance evaluation across various GEMM sizes demonstrates state-of-the-art throughput of up to 6.76 TOPS (XDNA) and 38.05 TOPS (XDNA2) for 8-bit integer (int8) precision. Similarly, for brain floating-point (bf16) precision, our GEMM implementations attain up to 3.14 TOPS (XDNA) and 14.71 TOPS (XDNA2). This work provides significant insights into key performance aspects of optimizing GEMM workloads on Ryzen AI NPUs.

    https://arxiv.org/abs/2512.13282


    SAMAY: System for Acoustic Measurement and Analysis

    oai:arXiv.org:2512.13284v1

    arXiv:2512.13284v1 Announce Type: new Abstract: This paper describes an automatic bird call recording system called SAMAY, which is developed to study bird species by creating a database of large amounts of bird acoustic data. By analysing the recorded bird call data, the system can also be used for automatic classification of bird species, monitoring bird populations and analysing the impact of environmental changes. The system is driven through a powerful STM32F407 series microcontroller, supports 4 microphones, is equipped with 128 GB of storage capacity, and is powered by a 10400 mAh battery pack interfaced with a solar charger. In addition, the device is user-configurable over USB and Wi-Fi during runtime, ensuring user-friendly operation during field deployment.

    https://arxiv.org/abs/2512.13284


    CausalCLIP: Causally-Informed Feature Disentanglement and Filtering for Generalizable Detection of Generated Images

    oai:arXiv.org:2512.13285v1

    arXiv:2512.13285v1 Announce Type: new Abstract: The rapid advancement of generative models has increased the demand for generated image detectors capable of generalizing across diverse and evolving generation techniques. However, existing methods, including those leveraging pre-trained vision-language models, often produce highly entangled representations, mixing task-relevant forensic cues (causal features) with spurious or irrelevant patterns (non-causal features), thus limiting generalization. To address this issue, we propose CausalCLIP, a framework that explicitly disentangles causal from non-causal features and employs targeted filtering guided by causal inference principles to retain only the most transferable and discriminative forensic cues. By modeling the generation process with a structural causal model and enforcing statistical independence through Gumbel-Softmax-based feature masking and Hilbert-Schmidt Independence Criterion (HSIC) constraints, CausalCLIP isolates stable causal features robust to distribution shifts. When tested on unseen generative models from different series, CausalCLIP demonstrates strong generalization ability, achieving improvements of 6.83% in accuracy and 4.06% in average precision over state-of-the-art methods.

    https://arxiv.org/abs/2512.13285


    Integrating Causal Reasoning into Automated Fact-Checking

    oai:arXiv.org:2512.13286v1

    arXiv:2512.13286v1 Announce Type: new Abstract: In fact-checking applications, a common reason to reject a claim is to detect the presence of erroneous cause-effect relationships between the events at play. However, current automated fact-checking methods lack dedicated causal-based reasoning, potentially missing a valuable opportunity for semantically rich explainability. To address this gap, we propose a methodology that combines event relation extraction, semantic similarity computation, and rule-based reasoning to detect logical inconsistencies between chains of events mentioned in a claim and in an evidence. Evaluated on two fact-checking datasets, this method establishes the first baseline for integrating fine-grained causal event relationships into fact-checking and enhance explainability of verdict prediction.

    https://arxiv.org/abs/2512.13286


    LINA: Learning INterventions Adaptively for Physical Alignment and Generalization in Diffusion Models

    oai:arXiv.org:2512.13290v1

    arXiv:2512.13290v1 Announce Type: new Abstract: Diffusion models (DMs) have achieved remarkable success in image and video generation. However, they still struggle with (1) physical alignment and (2) out-of-distribution (OOD) instruction following. We argue that these issues stem from the models' failure to learn causal directions and to disentangle causal factors for novel recombination. We introduce the Causal Scene Graph (CSG) and the Physical Alignment Probe (PAP) dataset to enable diagnostic interventions. This analysis yields three key insights. First, DMs struggle with multi-hop reasoning for elements not explicitly determined in the prompt. Second, the prompt embedding contains disentangled representations for texture and physics. Third, visual causal structure is disproportionately established during the initial, computationally limited denoising steps. Based on these findings, we introduce LINA (Learning INterventions Adaptively), a novel framework that learns to predict prompt-specific interventions, which employs (1) targeted guidance in the prompt and visual latent spaces, and (2) a reallocated, causality-aware denoising schedule. Our approach enforces both physical alignment and OOD instruction following in image and video DMs, achieving state-of-the-art performance on challenging causal generation tasks and the Winoground dataset. Our project page is at https://opencausalab.github.io/LINA.

    https://arxiv.org/abs/2512.13290


    Information-Theoretic Limits of Integrated Sensing and Communication with Finite Learning Capacity

    oai:arXiv.org:2512.13292v1

    arXiv:2512.13292v1 Announce Type: new Abstract: This paper develops a unified information-theoretic framework for artificial-intelligence (AI)-aided integrated sensing and communication (ISAC), where a learning component with limited representational capacity is embedded within the transceiver loop. The study introduces the concept of an AI capacity budget to quantify how the finite ability of a learning model constrains joint communication and sensing performance. Under this framework, the paper derives both converse (upper) and achievability (lower) bounds that define the achievable rate-sensing region. For Gaussian channels, the effect of limited learning capacity is shown to behave as an equivalent additive noise, allowing simple analytical expressions for the resulting communication rate and sensing distortion. The theory is then extended to Rayleigh and Rician fading as well as to multiple-input multiple-output (MIMO) systems through new matrix inequalities and a constructive mapping between AI capacity and effective noise covariance. Resource allocation between sensing and communication is optimized under this learning constraint, yielding closed-form conditions in the Gaussian case. A general learning-information trade-off law is also established, linking the representational power of the learning module to the achievable performance frontier. Finally, a practical variational training procedure is proposed to enforce the capacity constraint and to guide empirical evaluation. The derived scaling laws provide quantitative insight for co-designing model size, waveform, and hardware in next-generation ISAC systems.

    https://arxiv.org/abs/2512.13292


    Intrinsic-Motivation Multi-Robot Social Formation Navigation with Coordinated Exploration

    oai:arXiv.org:2512.13293v1

    arXiv:2512.13293v1 Announce Type: new Abstract: This paper investigates the application of reinforcement learning (RL) to multi-robot social formation navigation, a critical capability for enabling seamless human-robot coexistence. While RL offers a promising paradigm, the inherent unpredictability and often uncooperative dynamics of pedestrian behavior pose substantial challenges, particularly concerning the efficiency of coordinated exploration among robots. To address this, we propose a novel coordinated-exploration multi-robot RL algorithm introducing an intrinsic motivation exploration. Its core component is a self-learning intrinsic reward mechanism designed to collectively alleviate policy conservatism. Moreover, this algorithm incorporates a dual-sampling mode within the centralized training and decentralized execution framework to enhance the representation of both the navigation policy and the intrinsic reward, leveraging a two-time-scale update rule to decouple parameter updates. Empirical results on social formation navigation benchmarks demonstrate the proposed algorithm's superior performance over existing state-of-the-art methods across crucial metrics. Our code and video demos are available at: https://github.com/czxhunzi/CEMRRL.

    https://arxiv.org/abs/2512.13293


    MedInsightBench: Evaluating Medical Analytics Agents Through Multi-Step Insight Discovery in Multimodal Medical Data

    oai:arXiv.org:2512.13297v1

    arXiv:2512.13297v1 Announce Type: new Abstract: In medical data analysis, extracting deep insights from complex, multi-modal datasets is essential for improving patient care, increasing diagnostic accuracy, and optimizing healthcare operations. However, there is currently a lack of high-quality datasets specifically designed to evaluate the ability of large multi-modal models (LMMs) to discover medical insights. In this paper, we introduce MedInsightBench, the first benchmark that comprises 332 carefully curated medical cases, each annotated with thoughtfully designed insights. This benchmark is intended to evaluate the ability of LMMs and agent frameworks to analyze multi-modal medical image data, including posing relevant questions, interpreting complex findings, and synthesizing actionable insights and recommendations. Our analysis indicates that existing LMMs exhibit limited performance on MedInsightBench, which is primarily attributed to their challenges in extracting multi-step, deep insights and the absence of medical expertise. Therefore, we propose MedInsightAgent, an automated agent framework for medical data analysis, composed of three modules: Visual Root Finder, Analytical Insight Agent, and Follow-up Question Composer. Experiments on MedInsightBench highlight pervasive challenges and demonstrate that MedInsightAgent can improve the performance of general LMMs in medical data insight discovery.

    https://arxiv.org/abs/2512.13297


    MiniLingua: A Small Open-Source LLM for European Languages

    oai:arXiv.org:2512.13298v1

    arXiv:2512.13298v1 Announce Type: new Abstract: Large language models are powerful but often limited by high computational cost, privacy concerns, and English-centric training. Recent progress demonstrates that small, efficient models with around one billion parameters can deliver strong results and enable on-device use. This paper introduces MiniLingua, a multilingual open-source LLM of one billion parameters trained from scratch for 13 European languages, designed to balance coverage and instruction-following capabilities. Based on evaluation results, the instruction-tuned version of MiniLingua outperforms EuroLLM, a model with a similar training approach but a larger training budget, on summarization, classification and both open- and closed-book question answering. Moreover, it remains competitive with more advanced state-of-the-art models on open-ended generation tasks. We release model weights, tokenizer and source code used for data processing and model training.

    https://arxiv.org/abs/2512.13298


    No One Left Behind: How to Exploit the Incomplete and Skewed Multi-Label Data for Conversion Rate Prediction

    oai:arXiv.org:2512.13300v1

    arXiv:2512.13300v1 Announce Type: new Abstract: In most real-world online advertising systems, advertisers typically have diverse customer acquisition goals. A common solution is to use multi-task learning (MTL) to train a unified model on post-click data to estimate the conversion rate (CVR) for these diverse targets. In practice, CVR prediction often encounters missing conversion data as many advertisers submit only a subset of user conversion actions due to privacy or other constraints, making the labels of multi-task data incomplete. If the model is trained on all available samples where advertisers submit user conversion actions, it may struggle when deployed to serve a subset of advertisers targeting specific conversion actions, as the training and deployment data distributions are mismatched. While considerable MTL efforts have been made, a long-standing challenge is how to effectively train a unified model with the incomplete and skewed multi-label data. In this paper, we propose a fine-grained Knowledge transfer framework for Asymmetric Multi-Label data (KAML). We introduce an attribution-driven masking strategy (ADM) to better utilize data with asymmetric multi-label data in training. However, the more relaxed masking in ADM is a double-edged sword: it provides additional training signals but also introduces noise due to skewed data. To address this, we propose a hierarchical knowledge extraction mechanism (HKE) to model the sample discrepancy within the target task tower. Finally, to maximize the utility of unlabeled samples, we incorporate ranking loss strategy to further enhance our model. The effectiveness of KAML has been demonstrated through comprehensive evaluations on offline industry datasets and online A/B tests, which show significant performance improvements over existing MTL baselines.

    https://arxiv.org/abs/2512.13300


    On the impact of geometric variance on the performance of formed parts: A probabilistic approach on the example of airbag pressure bins

    oai:arXiv.org:2512.13302v1

    arXiv:2512.13302v1 Announce Type: new Abstract: Scatter in properties resulting from manufacturing is a great challenge in lightweight design, requiring consideration of not only the average mechanical performance but also the variance which is done e.g., by conservative safety factors. One contributor to this variance is the inherent geometric variability in the formed part. To isolate and quantify this effect, we present a probabilistic numerical study, aiming to assess the impact of geometric variance on the resulting part performance. By modelling geometric deviations stochastically, we aim to establish a correlation between the variance in geometry with the resulting variance in performance. The study is done on the example of an airbag pressure bin, where a better understanding of this correlation is crucial, as it allows for the design of a lighter part without changing the manufacturing process. Instead, we aim to implement more targeted and effective quality assurance, informed by the performance impact of geometric deviations.

    https://arxiv.org/abs/2512.13302


    ShowTable: Unlocking Creative Table Visualization with Collaborative Reflection and Refinement

    oai:arXiv.org:2512.13303v1

    arXiv:2512.13303v1 Announce Type: new Abstract: While existing generation and unified models excel at general image generation, they struggle with tasks requiring deep reasoning, planning, and precise data-to-visual mapping abilities beyond general scenarios. To push beyond the existing limitations, we introduce a new and challenging task: creative table visualization, requiring the model to generate an infographic that faithfully and aesthetically visualizes the data from a given table. To address this challenge, we propose ShowTable, a pipeline that synergizes MLLMs with diffusion models via a progressive self-correcting process. The MLLM acts as the central orchestrator for reasoning the visual plan and judging visual errors to provide refined instructions, the diffusion execute the commands from MLLM, achieving high-fidelity results. To support this task and our pipeline, we introduce three automated data construction pipelines for training different modules. Furthermore, we introduce TableVisBench, a new benchmark with 800 challenging instances across 5 evaluation dimensions, to assess performance on this task. Experiments demonstrate that our pipeline, instantiated with different models, significantly outperforms baselines, highlighting its effective multi-modal reasoning, generation, and error correction capabilities.

    https://arxiv.org/abs/2512.13303


    Humanoid Robot Running Through Random Stepping Stones and Jumping Over Obstacles: Step Adaptation Using Spring-Mass Trajectories

    oai:arXiv.org:2512.13304v1

    arXiv:2512.13304v1 Announce Type: new Abstract: This study proposes a step adaptation framework for running through spring-mass trajectories and deadbeat control gain libraries. It includes four main parts: (1) Automatic spring-mass trajectory library generation; (2) Deadbeat control gain library generation through an actively controlled template model that resembles the whole-body dynamics well; (3) Trajectory selection policy development for step adaptation; (4) Mapping spring-mass trajectories to a humanoid model through a whole-body control (WBC) framework also accounting for closed-kinematic chain systems, self collisions, and reactive limb swinging. We show the inclusiveness and the robustness of the proposed framework through various challenging and agile behaviors such as running through randomly generated stepping stones, jumping over random obstacles, performing slalom motions, changing the running direction suddenly with a random leg, and rejecting significant disturbances and uncertainties through the MuJoCo physics simulator. We also perform additional simulations under a comprehensive set of uncertainties and noise to better justify the proposed method's robustness to real-world challenges, including signal noise, imprecision, modeling errors, and delays. All the aforementioned behaviors are performed with a single library and the same set of WBC control parameters without additional tuning. The spring-mass and the deadbeat control gain library are automatically computed in 4.5 seconds in total for 315 different trajectories.

    https://arxiv.org/abs/2512.13304


    Resource Orchestration and Optimization in 6G Extreme-edge Scenario

    oai:arXiv.org:2512.13306v1

    arXiv:2512.13306v1 Announce Type: new Abstract: 6G networks envision a pervasive service infrastructure spanning from centralized cloud to distributed edge and highly dynamic extreme-edge domains. This vision introduces significant challenges in orchestrating services over heterogeneous, volatile, and often mobile resources beyond traditional operator control. To address these challenges, this demo presents a 6G-ready orchestration architecture focused on resource prediction and service resilience at the extreme-edge. The proposed solution integrates (i) an AI/ML-based Infrastructure Status Prediction Module, (ii) a Monitoring System capable of handling large-scale, diverse telemetry, and (iii) a Decision Engine and Actuator that ensures proactive

    https://arxiv.org/abs/2512.13306


    KlingAvatar 2.0 Technical Report

    oai:arXiv.org:2512.13313v1

    arXiv:2512.13313v1 Announce Type: new Abstract: Avatar video generation models have achieved remarkable progress in recent years. However, prior work exhibits limited efficiency in generating long-duration high-resolution videos, suffering from temporal drifting, quality degradation, and weak prompt following as video length increases. To address these challenges, we propose KlingAvatar 2.0, a spatio-temporal cascade framework that performs upscaling in both spatial resolution and temporal dimension. The framework first generates low-resolution blueprint video keyframes that capture global semantics and motion, and then refines them into high-resolution, temporally coherent sub-clips using a first-last frame strategy, while retaining smooth temporal transitions in long-form videos. To enhance cross-modal instruction fusion and alignment in extended videos, we introduce a Co-Reasoning Director composed of three modality-specific large language model (LLM) experts. These experts reason about modality priorities and infer underlying user intent, converting inputs into detailed storylines through multi-turn dialogue. A Negative Director further refines negative prompts to improve instruction alignment. Building on these components, we extend the framework to support ID-specific multi-character control. Extensive experiments demonstrate that our model effectively addresses the challenges of efficient, multimodally aligned long-form high-resolution video generation, delivering enhanced visual clarity, realistic lip-teeth rendering with accurate lip synchronization, strong identity preservation, and coherent multimodal instruction following.

    https://arxiv.org/abs/2512.13313


    ALIGN-FL: Architecture-independent Learning through Invariant Generative component sharing in Federated Learning

    oai:arXiv.org:2512.13316v1

    arXiv:2512.13316v1 Announce Type: new Abstract: We present ALIGN-FL, a novel approach to distributed learning that addresses the challenge of learning from highly disjoint data distributions through selective sharing of generative components. Instead of exchanging full model parameters, our framework enables privacy-preserving learning by transferring only generative capabilities across clients, while the server performs global training using synthetic samples. Through complementary privacy mechanisms: DP-SGD with adaptive clipping and Lipschitz regularized VAE decoders and a stateful architecture supporting heterogeneous clients, we experimentally validate our approach on MNIST and Fashion-MNIST datasets with cross-domain outliers. Our analysis demonstrates that both privacy mechanisms effectively map sensitive outliers to typical data points while maintaining utility in extreme Non-IID scenarios typical of cross-silo collaborations. Index Terms: Client-invariant Learning, Federated Learning (FL), Privacy-preserving Generative Models, Non-Independent and Identically Distributed (Non-IID), Heterogeneous Architectures

    https://arxiv.org/abs/2512.13316


    Face Identity Unlearning for Retrieval via Embedding Dispersion

    oai:arXiv.org:2512.13317v1

    arXiv:2512.13317v1 Announce Type: new Abstract: Face recognition systems rely on learning highly discriminative and compact identity clusters to enable accurate retrieval. However, as with other surveillance-oriented technologies, such systems raise serious privacy concerns due to their potential for unauthorized identity tracking. While several works have explored machine unlearning as a means of privacy protection, their applicability to face retrieval - especially for modern embedding-based recognition models - remains largely unexplored. In this work, we study the problem of face identity unlearning for retrieval systems and present its inherent challenges. The goal is to make selected identities unretrievable by dispersing their embeddings on the hypersphere and preventing the formation of compact identity clusters that enable re-identification in the gallery. The primary challenge is to achieve this forgetting effect while preserving the discriminative structure of the embedding space and the retrieval performance of the model for the remaining identities. To address this, we evaluate several existing approximate class unlearning methods (e.g., Random Labeling, Gradient Ascent, Boundary Unlearning, and other recent approaches) in the context of face retrieval and propose a simple yet effective dispersion-based unlearning approach. Extensive experiments on standard benchmarks (VGGFace2, CelebA) demonstrate that our method achieves superior forgetting behavior while preserving retrieval utility.

    https://arxiv.org/abs/2512.13317


    Temporal parallelisation of continuous-time maximum-a-posteriori trajectory estimation

    oai:arXiv.org:2512.13319v1

    arXiv:2512.13319v1 Announce Type: new Abstract: This paper proposes a parallel-in-time method for computing continuous-time maximum-a-posteriori (MAP) trajectory estimates of the states of partially observed stochastic differential equations (SDEs), with the goal of improving computational speed on parallel architectures. The MAP estimation problem is reformulated as a continuous-time optimal control problem based on the Onsager-Machlup functional. This reformulation enables the use of a previously proposed parallel-in-time solution for optimal control problems, which we adapt to the current problem. The structure of the resulting optimal control problem admits a parallel solution based on parallel associative scan algorithms. In the linear Gaussian special case, it yields a parallel Kalman-Bucy filter and a parallel continuous-time Rauch-Tung-Striebel smoother. These linear computational methods are further extended to nonlinear continuous-time state-space models through Taylor expansions. We also present the corresponding parallel two-filter smoother. The graphics processing unit (GPU) experiments on linear and nonlinear models demonstrate that the proposed framework achieves a significant speedup in computations while maintaining the accuracy of sequential algorithms.

    https://arxiv.org/abs/2512.13319


    Error-Driven Prompt Optimization for Arithmetic Reasoning

    oai:arXiv.org:2512.13323v1

    arXiv:2512.13323v1 Announce Type: new Abstract: Recent advancements in artificial intelligence have sparked interest in industrial agents capable of supporting analysts in regulated sectors, such as finance and healthcare, within tabular data workflows. A key capability for such systems is performing accurate arithmetic operations on structured data while ensuring sensitive information never leaves secure, on-premises environments. Here, we introduce an error-driven optimization framework for arithmetic reasoning that enhances a Code Generation Agent (CGA), specifically applied to on-premises small language models (SLMs). Through a systematic evaluation of a leading SLM (Qwen3 4B), we find that while the base model exhibits fundamental limitations in arithmetic tasks, our proposed error-driven method, which clusters erroneous predictions to refine prompt-rules iteratively, dramatically improves performance, elevating the model's accuracy to 70.8\%. Our results suggest that developing reliable, interpretable, and industrially deployable AI assistants can be achieved not only through costly fine-tuning but also via systematic, error-driven prompt optimization, enabling small models to surpass larger language models (GPT-3.5 Turbo) in a privacy-compliant manner.

    https://arxiv.org/abs/2512.13323


    Security and Detectability Analysis of Unicode Text Watermarking Methods Against Large Language Models

    oai:arXiv.org:2512.13325v1

    arXiv:2512.13325v1 Announce Type: new Abstract: Securing digital text is becoming increasingly relevant due to the widespread use of large language models. Individuals' fear of losing control over data when it is being used to train such machine learning models or when distinguishing model-generated output from text written by humans. Digital watermarking provides additional protection by embedding an invisible watermark within the data that requires protection. However, little work has been taken to analyze and verify if existing digital text watermarking methods are secure and undetectable by large language models. In this paper, we investigate the security-related area of watermarking and machine learning models for text data. In a controlled testbed of three experiments, ten existing Unicode text watermarking methods were implemented and analyzed across six large language models: GPT-5, GPT-4o, Teuken 7B, Llama 3.3, Claude Sonnet 4, and Gemini 2.5 Pro. The findings of our experiments indicate that, especially the latest reasoning models, can detect a watermarked text. Nevertheless, all models fail to extract the watermark unless implementation details in the form of source code are provided. We discuss the implications for security researchers and practitioners and outline future research opportunities to address security concerns.

    https://arxiv.org/abs/2512.13325


    FIN-bench-v2: A Unified and Robust Benchmark Suite for Evaluating Finnish Large Language Models

    oai:arXiv.org:2512.13330v1

    arXiv:2512.13330v1 Announce Type: new Abstract: We introduce FIN-bench-v2, a unified benchmark suite for evaluating large language models in Finnish. FIN-bench-v2 consolidates Finnish versions of widely used benchmarks together with an updated and expanded version of the original FIN-bench into a single, consistently formatted collection, covering multiple-choice and generative tasks across reading comprehension, commonsense reasoning, sentiment analysis, world knowledge, and alignment. All datasets are converted to HuggingFace Datasets, which include both cloze and multiple-choice prompt formulations with five variants per task, and we incorporate human annotation or review for machine-translated resources such as GoldenSwag and XED. To select robust tasks, we pretrain a set of 2.15B-parameter decoder-only models and use their learning curves to compute monotonicity, signal-to-noise, non-random performance, and model ordering consistency, retaining only tasks that satisfy all criteria. We further evaluate a set of larger instruction-tuned models to characterize performance across tasks and prompt formulations. All datasets, prompts, and evaluation configurations are publicly available via our fork of the Language Model Evaluation Harness at https://github.com/LumiOpen/lm-evaluation-harness. Supplementary resources are released in a separate repository at https://github.com/TurkuNLP/FIN-bench-v2.

    https://arxiv.org/abs/2512.13330


    A Multi-Worker Assembly Line Rebalancing with Spatial and Ergonomic Considerations

    oai:arXiv.org:2512.13331v1

    arXiv:2512.13331v1 Announce Type: new Abstract: This work addresses the Assembly Line Rebalancing Problem in manual assembly systems where multiple workers operate in parallel within the same station - an industrially relevant scenario that remains insufficiently explored in the literature. A multi-objective optimization model is proposed that incorporates task reassignment, worker allocation, ergonomic evaluation, and explicit spatial feasibility through work-area constraints. The formulation minimizes deviations from the current configuration while promoting balanced workload and ergonomic conditions among workers. Computational experiments on synthetic problem instances demonstrate that the model consistently generates feasible and human-centered reconfigurations across varying cycle-time conditions, highlighting its potential as a decision-support tool for industrial rebalancing in flexible production environments.

    https://arxiv.org/abs/2512.13331


    Quantum Disruption: An SOK of How Post-Quantum Attackers Reshape Blockchain Security and Performance

    oai:arXiv.org:2512.13333v1

    arXiv:2512.13333v1 Announce Type: new Abstract: As quantum computing advances toward practical deployment, it threatens a wide range of classical cryptographic mechanisms, including digital signatures, key exchange protocols, public-key encryption, and certain hash-based constructions that underpin modern network infrastructures. These primitives form the security backbone of most blockchain platforms, raising serious concerns about the long-term viability of blockchain systems in a post-quantum world. Although migrating to post-quantum cryptography may appear straightforward, the substantially larger key sizes and higher computational costs of post-quantum primitives can introduce significant challenges and, in some cases, render such transitions impractical for blockchain environments. In this paper, we examine the implications of adopting post-quantum cryptography in blockchain systems across four key dimensions. We begin by identifying the cryptographic primitives within blockchain architectures that are most vulnerable to quantum attacks, particularly those used in consensus mechanisms, identity management, and transaction validation. We then survey proposed post-quantum adaptations across existing blockchain designs, analyzing their feasibility within decentralized and resource-constrained settings. Building on this analysis, we evaluate how replacing classical primitives with post-quantum alternatives affects system performance, protocol dynamics, and the incentive and trust structures that sustain blockchain ecosystems. Our study demonstrates that integrating post-quantum signature schemes into blockchain systems is not a simple drop-in replacement; instead, it requires careful architectural redesign, as naive substitutions risk undermining both security guarantees and operational efficiency.

    https://arxiv.org/abs/2512.13333


    KD-PINN: Knowledge-Distilled PINNs for ultra-low-latency real-time neural PDE solvers

    oai:arXiv.org:2512.13336v1

    arXiv:2512.13336v1 Announce Type: new Abstract: This work introduces Knowledge-Distilled Physics-Informed Neural Networks (KD-PINN), a framework that transfers the predictive accuracy of a high-capacity teacher model to a compact student through a continuous adaptation of the Kullback-Leibler divergence. To confirm its generality for various dynamics and dimensionalities, the framework is evaluated on a representative set of partial differential equations (PDEs). In all tested cases, the student model preserved the teacher's physical accuracy, with a mean RMSE increase below 0.64%, and achieved inference speedups ranging from 4.8x (Navier-Stokes) to 6.9x (Burgers). The distillation process also revealed a regularizing effect. With an average inference latency of 5.3 ms on CPU, the distilled models enter the ultra-low-latency real-time regime defined by sub-10 ms performance. Finally, this study examines how knowledge distillation reduces inference latency in PINNs to contribute to the development of accurate ultra-low-latency neural PDE solvers.

    https://arxiv.org/abs/2512.13336


    FROC: A Unified Framework with Risk-Optimized Control for Machine Unlearning in LLMs

    oai:arXiv.org:2512.13337v1

    arXiv:2512.13337v1 Announce Type: new Abstract: Machine unlearning (MU) seeks to eliminate the influence of specific training examples from deployed models. As large language models (LLMs) become widely used, managing risks arising from insufficient forgetting or utility loss is increasingly crucial. Current MU techniques lack effective mechanisms for evaluating and controlling these risks, hindering the selection of strategies that appropriately balance safety and utility, and raising trust concerns surrounding the "right to be forgotten." To address these issues, we propose FROC, a unified framework with Risk-Optimized Control for machine unlearning in LLMs. FROC is built around a conformal-style risk-control formulation that expresses a user-specified risk budget on unlearning behavior. This probability-based constraint enables FROC to compare MU strategies, identify feasible operating regions, and guide hyperparameter selection according to desired trade-offs between forgetting sufficiency and utility preservation. To operationalize this constraint, FROC introduces a smoothly varying continuous risk model that aggregates forgetting deficiency and utility degradation into a single configuration-level score. Building on conformal risk analysis, FROC computes (1) the Conformal Unlearning Risk (CUR), a data-driven estimated value on the probability that forgotten samples continue to influence model predictions, and (2) risk-controlled configuration sets, which identify unlearning hyperparameters that are valid under the specified risk budget. Experiments across multiple LLM MU methods demonstrate that FROC produces stable, interpretable risk landscapes and reveals consistent relationships between unlearning configurations, semantic shift, and utility impact. FROC reframes MU as a controllable, risk-aware process and offers a practical foundation for managing unlearning behavior in large-scale LLM deployments.

    https://arxiv.org/abs/2512.13337


    Link-Aware Energy-Frugal Continual Learning for Fault Detection in IoT Networks

    oai:arXiv.org:2512.13340v1

    arXiv:2512.13340v1 Announce Type: new Abstract: The use of lightweight machine learning (ML) models in internet of things (IoT) networks enables resource constrained IoT devices to perform on-device inference for several critical applications. However, the inference accuracy deteriorates due to the non-stationarity in the IoT environment and limited initial training data. To counteract this, the deployed models can be updated occasionally with new observed data samples. However, this approach consumes additional energy, which is undesirable for energy constrained IoT devices. This letter introduces an event-driven communication framework that strategically integrates continual learning (CL) in IoT networks for energy-efficient fault detection. Our framework enables the IoT device and the edge server (ES) to collaboratively update the lightweight ML model by adapting to the wireless link conditions for communication and the available energy budget. Evaluation on real-world datasets show that the proposed approach can outperform both periodic sampling and non-adaptive CL in terms of inference recall; our proposed approach achieves up to a 42.8% improvement, even under tight energy and bandwidth constraint.

    https://arxiv.org/abs/2512.13340


    Space Efficient Algorithms for Parameterised Problems

    oai:arXiv.org:2512.13342v1

    arXiv:2512.13342v1 Announce Type: new Abstract: We study "space efficient" FPT algorithms for graph problems with limited memory. Let n be the size of the input graph and k be the parameter. We present algorithms that run in time f(k)*poly(n) and use g(k)*polylog(n) working space, where f and g are functions of k alone, for k-Path, MaxLeaf SubTree and Multicut in Trees. These algorithms are motivated by big-data settings where very large problem instances must be solved, and using poly(n) memory is prohibitively expensive. They are also theoretically interesting, since most of the standard methods tools, such as deleting a large set of vertices or edges, are unavailable, and we must a develop different way to tackle them.

    https://arxiv.org/abs/2512.13342


    Neural Control Barrier Functions for Signal Temporal Logic Specifications with Input Constraints

    oai:arXiv.org:2512.13344v1

    arXiv:2512.13344v1 Announce Type: new Abstract: Signal Temporal Logic (STL) provides a powerful framework to describe complex tasks involving temporal and logical behavior in dynamical systems. In this work, we address the problem of synthesizing controllers for continuous-time systems under STL specifications with input constraints. We propose a neural network-based framework for synthesizing time-varying control barrier functions (TVCBF) and their corresponding controllers for systems to fulfill STL specifications while respecting input constraints. We formulate barrier conditions incorporating the spatial and temporal logic of the given STL specification. Additionally, we introduce a validity condition to provide formal safety guarantees across the entire state space. Finally, we demonstrate the effectiveness of the proposed approach through several simulation studies considering different STL tasks.

    https://arxiv.org/abs/2512.13344


    On the Effectiveness of Membership Inference in Targeted Data Extraction from Large Language Models

    oai:arXiv.org:2512.13352v1

    arXiv:2512.13352v1 Announce Type: new Abstract: Large Language Models (LLMs) are prone to memorizing training data, which poses serious privacy risks. Two of the most prominent concerns are training data extraction and Membership Inference Attacks (MIAs). Prior research has shown that these threats are interconnected: adversaries can extract training data from an LLM by querying the model to generate a large volume of text and subsequently applying MIAs to verify whether a particular data point was included in the training set. In this study, we integrate multiple MIA techniques into the data extraction pipeline to systematically benchmark their effectiveness. We then compare their performance in this integrated setting against results from conventional MIA benchmarks, allowing us to evaluate their practical utility in real-world extraction scenarios.

    https://arxiv.org/abs/2512.13352


    Control of a Twin Rotor using Twin Delayed Deep Deterministic Policy Gradient (TD3)

    oai:arXiv.org:2512.13356v1

    arXiv:2512.13356v1 Announce Type: new Abstract: This paper proposes a reinforcement learning (RL) framework for controlling and stabilizing the Twin Rotor Aerodynamic System (TRAS) at specific pitch and azimuth angles and tracking a given trajectory. The complex dynamics and non-linear characteristics of the TRAS make it challenging to control using traditional control algorithms. However, recent developments in RL have attracted interest due to their potential applications in the control of multirotors. The Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm was used in this paper to train the RL agent. This algorithm is used for environments with continuous state and action spaces, similar to the TRAS, as it does not require a model of the system. The simulation results illustrated the effectiveness of the RL control method. Next, external disturbances in the form of wind disturbances were used to test the controller's effectiveness compared to conventional PID controllers. Lastly, experiments on a laboratory setup were carried out to confirm the controller's effectiveness in real-world applications.

    https://arxiv.org/abs/2512.13356


    Fast Policy Learning for 6-DOF Position Control of Underwater Vehicles

    oai:arXiv.org:2512.13359v1

    arXiv:2512.13359v1 Announce Type: new Abstract: Autonomous Underwater Vehicles (AUVs) require reliable six-degree-of-freedom (6-DOF) position control to operate effectively in complex and dynamic marine environments. Traditional controllers are effective under nominal conditions but exhibit degraded performance when faced with unmodeled dynamics or environmental disturbances. Reinforcement learning (RL) provides a powerful alternative but training is typically slow and sim-to-real transfer remains challenging. This work introduces a GPU-accelerated RL training pipeline built in JAX and MuJoCo-XLA (MJX). By jointly JIT-compiling large-scale parallel physics simulation and learning updates, we achieve training times of under two minutes.Through systematic evaluation of multiple RL algorithms, we show robust 6-DOF trajectory tracking and effective disturbance rejection in real underwater experiments, with policies transferred zero-shot from simulation. Our results provide the first explicit real-world demonstration of RL-based AUV position control across all six degrees of freedom.

    https://arxiv.org/abs/2512.13359


    UCRBench: Benchmarking LLMs on Use Case Recovery

    oai:arXiv.org:2512.13360v1

    arXiv:2512.13360v1 Announce Type: new Abstract: Use cases are widely employed to specify functional requirements, yet existing benchmarks are scarce and face the risk of being misaligned with actual system behavior, similarly limiting the rigorous evaluation of large language models (LLMs) in generating use cases from source code. We address this gap by introducing code-aligned use case benchmarks, constructed through manual validation of both user-goal and subfunction use cases across nine real-world software projects. Using this benchmark, we conduct the first systematic study of LLMs and propose a hierarchical evaluation protocol that assesses actor correctness, name accuracy, path fidelity, and behavioral coverage. The results show that while LLMs can partially reconstruct system functionality, their performance varies significantly across projects, with particularly noticeable shortcomings in domain-specific and multi-module systems. The models also exhibit high omission rates and struggle to maintain consistent abstraction when aggregating subfunctions into user-goal use cases, highlighting both the potential and current limitations of LLM-based use case reverse engineering.

    https://arxiv.org/abs/2512.13360


    Automated User Identification from Facial Thermograms with Siamese Networks

    oai:arXiv.org:2512.13361v1

    arXiv:2512.13361v1 Announce Type: new Abstract: The article analyzes the use of thermal imaging technologies for biometric identification based on facial thermograms. It presents a comparative analysis of infrared spectral ranges (NIR, SWIR, MWIR, and LWIR). The paper also defines key requirements for thermal cameras used in biometric systems, including sensor resolution, thermal sensitivity, and a frame rate of at least 30 Hz. Siamese neural networks are proposed as an effective approach for automating the identification process. In experiments conducted on a proprietary dataset, the proposed method achieved an accuracy of approximately 80%. The study also examines the potential of hybrid systems that combine visible and infrared spectra to overcome the limitations of individual modalities. The results indicate that thermal imaging is a promising technology for developing reliable security systems.

    https://arxiv.org/abs/2512.13361


    Detecting Emotion Drift in Mental Health Text Using Pre-Trained Transformers

    oai:arXiv.org:2512.13363v1

    arXiv:2512.13363v1 Announce Type: new Abstract: This study investigates emotion drift: the change in emotional state across a single text, within mental health-related messages. While sentiment analysis typically classifies an entire message as positive, negative, or neutral, the nuanced shift of emotions over the course of a message is often overlooked. This study detects sentence-level emotions and measures emotion drift scores using pre-trained transformer models such as DistilBERT and RoBERTa. The results provide insights into patterns of emotional escalation or relief in mental health conversations. This methodology can be applied to better understand emotional dynamics in content.

    https://arxiv.org/abs/2512.13363


    Parallel Heuristic Exploration for Additive Complexity Reduction in Fast Matrix Multiplication

    oai:arXiv.org:2512.13365v1

    arXiv:2512.13365v1 Announce Type: new Abstract: This paper presents a parallel random-search method for reducing additive complexity in fast matrix multiplication. The approach replaces expensive exact evaluation with fast heuristic scoring, including the new Greedy-Intersections strategy. The method runs many independent common subexpression elimination processes in parallel, exploring the search space through random pair substitutions and diverse selection strategies while sharing promising partial solutions. Tested on 164 ternary-coefficient schemes, the method achieves lower addition counts than the state-of-the-art Greedy-Potential on 103 schemes, matches it on 59, and is outperformed on 2. For most schemes, it gives equal or better results while being much faster, making it practical for algorithm exploration. All software and results are open source.

    https://arxiv.org/abs/2512.13365


    BlossomRec: Block-level Fused Sparse Attention Mechanism for Sequential Recommendations

    oai:arXiv.org:2512.13368v1

    arXiv:2512.13368v1 Announce Type: new Abstract: Transformer structures have been widely used in sequential recommender systems (SRS). However, as user interaction histories increase, computational time and memory requirements also grow. This is mainly caused by the standard attention mechanism. Although there exist many methods employing efficient attention and SSM-based models, these approaches struggle to effectively model long sequences and may exhibit unstable performance on short sequences. To address these challenges, we design a sparse attention mechanism, BlossomRec, which models both long-term and short-term user interests through attention computation to achieve stable performance across sequences of varying lengths. Specifically, we categorize user interests in recommendation systems into long-term and short-term interests, and compute them using two distinct sparse attention patterns, with the results combined through a learnable gated output. Theoretically, it significantly reduces the number of interactions participating in attention computation. Extensive experiments on four public datasets demonstrate that BlossomRec, when integrated with state-of-the-art Transformer-based models, achieves comparable or even superior performance while significantly reducing memory usage, providing strong evidence of BlossomRec's efficiency and effectiveness.The code is available at https://github.com/ronineume/BlossomRec.

    https://arxiv.org/abs/2512.13368


    Machine learning discovers new champion codes

    oai:arXiv.org:2512.13370v1

    arXiv:2512.13370v1 Announce Type: new Abstract: Linear error-correcting codes form the mathematical backbone of modern digital communication and storage systems, but identifying champion linear codes (linear codes achieving or exceeding the best known minimum Hamming distance) remains challenging. By training a transformer to predict the minimum Hamming distance of a class of linear codes and pairing it with a genetic algorithm over the search space, we develop a novel method for discovering champion codes. This model effectively reduces the search space of linear codes needed to achieve champion codes. Our results present the use of this method in the study and construction of error-correcting codes, applicable to codes such as generalised toric, Reed-Muller, Bose-Chaudhuri-Hocquenghem, algebrogeometric, and potentially quantum codes.

    https://arxiv.org/abs/2512.13370


    Behavior and Representation in Large Language Models for Combinatorial Optimization: From Feature Extraction to Algorithm Selection

    oai:arXiv.org:2512.13374v1

    arXiv:2512.13374v1 Announce Type: new Abstract: Recent advances in Large Language Models (LLMs) have opened new perspectives for automation in optimization. While several studies have explored how LLMs can generate or solve optimization models, far less is understood about what these models actually learn regarding problem structure or algorithmic behavior. This study investigates how LLMs internally represent combinatorial optimization problems and whether such representations can support downstream decision tasks. We adopt a twofold methodology combining direct querying, which assesses LLM capacity to explicitly extract instance features, with probing analyses that examine whether such information is implicitly encoded within their hidden layers. The probing framework is further extended to a per-instance algorithm selection task, evaluating whether LLM-derived representations can predict the best-performing solver. Experiments span four benchmark problems and three instance representations. Results show that LLMs exhibit moderate ability to recover feature information from problem instances, either through direct querying or probing. Notably, the predictive power of LLM hidden-layer representations proves comparable to that achieved through traditional feature extraction, suggesting that LLMs capture meaningful structural information relevant to optimization performance.

    https://arxiv.org/abs/2512.13374


    Unlocking Generalization in Polyp Segmentation with DINO Self-Attention "keys"

    oai:arXiv.org:2512.13376v1

    arXiv:2512.13376v1 Announce Type: new Abstract: Automatic polyp segmentation is crucial for improving the clinical identification of colorectal cancer (CRC). While Deep Learning (DL) techniques have been extensively researched for this problem, current methods frequently struggle with generalization, particularly in data-constrained or challenging settings. Moreover, many existing polyp segmentation methods rely on complex, task-specific architectures. To address these limitations, we present a framework that leverages the intrinsic robustness of DINO self-attention "key" features for robust segmentation. Unlike traditional methods that extract tokens from the deepest layers of the Vision Transformer (ViT), our approach leverages the key features of the self-attention module with a simple convolutional decoder to predict polyp masks, resulting in enhanced performance and better generalizability. We validate our approach using a multi-center dataset under two rigorous protocols: Domain Generalization (DG) and Extreme Single Domain Generalization (ESDG). Our results, supported by a comprehensive statistical analysis, demonstrate that this pipeline achieves state-of-the-art (SOTA) performance, significantly enhancing generalization, particularly in data-scarce and challenging scenarios. While avoiding a polyp-specific architecture, we surpass well-established models like nnU-Net and UM-Net. Additionally, we provide a systematic benchmark of the DINO framework's evolution, quantifying the specific impact of architectural advancements on downstream polyp segmentation performance.

    https://arxiv.org/abs/2512.13376


    Universal Dexterous Functional Grasping via Demonstration-Editing Reinforcement Learning

    oai:arXiv.org:2512.13380v1

    arXiv:2512.13380v1 Announce Type: new Abstract: Reinforcement learning (RL) has achieved great success in dexterous grasping, significantly improving grasp performance and generalization from simulation to the real world. However, fine-grained functional grasping, which is essential for downstream manipulation tasks, remains underexplored and faces several challenges: the complexity of specifying goals and reward functions for functional grasps across diverse objects, the difficulty of multi-task RL exploration, and the challenge of sim-to-real transfer. In this work, we propose DemoFunGrasp for universal dexterous functional grasping. We factorize functional grasping conditions into two complementary components - grasping style and affordance - and integrate them into an RL framework that can learn to grasp any object with any functional grasping condition. To address the multi-task optimization challenge, we leverage a single grasping demonstration and reformulate the RL problem as one-step demonstration editing, substantially enhancing sample efficiency and performance. Experimental results in both simulation and the real world show that DemoFunGrasp generalizes to unseen combinations of objects, affordances, and grasping styles, outperforming baselines in both success rate and functional grasping accuracy. In addition to strong sim-to-real capability, by incorporating a vision-language model (VLM) for planning, our system achieves autonomous instruction-following grasp execution.

    https://arxiv.org/abs/2512.13380


    Dual-Phase Federated Deep Unlearning via Weight-Aware Rollback and Reconstruction

    oai:arXiv.org:2512.13381v1

    arXiv:2512.13381v1 Announce Type: new Abstract: Federated Unlearning (FUL) focuses on client data and computing power to offer a privacy-preserving solution. However, high computational demands, complex incentive mechanisms, and disparities in client-side computing power often lead to long times and higher costs. To address these challenges, many existing methods rely on server-side knowledge distillation that solely removes the updates of the target client, overlooking the privacy embedded in the contributions of other clients, which can lead to privacy leakage. In this work, we introduce DPUL, a novel server-side unlearning method that deeply unlearns all influential weights to prevent privacy pitfalls. Our approach comprises three components: (i) identifying high-weight parameters by filtering client update magnitudes, and rolling them back to ensure deep removal. (ii) leveraging the variational autoencoder (VAE) to reconstruct and eliminate low-weight parameters. (iii) utilizing a projection-based technique to recover the model. Experimental results on four datasets demonstrate that DPUL surpasses state-of-the-art baselines, providing a 1%-5% improvement in accuracy and up to 12x reduction in time cost.

    https://arxiv.org/abs/2512.13381


    Beyond the Visible: Disocclusion-Aware Editing via Proxy Dynamic Graphs

    oai:arXiv.org:2512.13392v1

    arXiv:2512.13392v1 Announce Type: new Abstract: We address image-to-video generation with explicit user control over the final frame's disoccluded regions. Current image-to-video pipelines produce plausible motion but struggle to generate predictable, articulated motions while enforcing user-specified content in newly revealed areas. Our key idea is to separate motion specification from appearance synthesis: we introduce a lightweight, user-editable Proxy Dynamic Graph (PDG) that deterministically yet approximately drives part motion, while a frozen diffusion prior is used to synthesize plausible appearance that follows that motion. In our training-free pipeline, the user loosely annotates and reposes a PDG, from which we compute a dense motion flow to leverage diffusion as a motion-guided shader. We then let the user edit appearance in the disoccluded areas of the image, and exploit the visibility information encoded by the PDG to perform a latent-space composite that reconciles motion with user intent in these areas. This design yields controllable articulation and user control over disocclusions without fine-tuning. We demonstrate clear advantages against state-of-the-art alternatives towards images turned into short videos of articulated objects, furniture, vehicles, and deformables. Our method mixes generative control, in the form of loose pose and structure, with predictable controls, in the form of appearance specification in the final frame in the disoccluded regions, unlocking a new image-to-video workflow. Code will be released on acceptance. Project page: https://anranqi.github.io/beyondvisible.github.io/

    https://arxiv.org/abs/2512.13392


    QoS-Aware State-Augmented Learnable Framework for 5G NR-U/Wi-Fi Coexistence: Impact of Parameter Selection and Enhanced Collision Resolution

    oai:arXiv.org:2512.13393v1

    arXiv:2512.13393v1 Announce Type: new Abstract: Unlicensed spectrum supports diverse traffic with stringent Quality-of-Service (QoS) requirements. In NR-U/Wi-Fi coexistence,the values of MAC parameters critically influence delay, collision behavior, and airtime fairness and efficiency. In this paper, we investigate the impact of (i) cost scaling and violation modeling, (ii) choice of MAC parameters, and (iii) an enhanced collision resolution scheme for the Listen-Before-Talk (LBT) mechanism on the performance of a state-augmented constrained reinforcement learning controller for NR-U/Wi-Fi coexistence. Coexistence control is formulated as a constrained Markov decision process with an explicit delay constraint for high-priority traffic and fairness as the optimization goal. Our simulation results show three key findings: (1) signed, threshold-invariant cost scaling with temporal smoothing stabilizes learning and strengthens long-term constraint adherence; (2) use of the contention window parameter for control provides smoother adaptation and better delay compliance than other MAC parameters; and (3) enhanced LBT significantly reduces collisions and improves airtime efficiency. These findings provide practical insights for achieving robust, QoS-aware coexistence control.

    https://arxiv.org/abs/2512.13393


    Automated Information Flow Selection for Multi-scenario Multi-task Recommendation

    oai:arXiv.org:2512.13396v1

    arXiv:2512.13396v1 Announce Type: new Abstract: Multi-scenario multi-task recommendation (MSMTR) systems must address recommendation demands across diverse scenarios while simultaneously optimizing multiple objectives, such as click-through rate and conversion rate. Existing MSMTR models typically consist of four information units: scenario-shared, scenario-specific, task-shared, and task-specific networks. These units interact to generate four types of relationship information flows, directed from scenario-shared or scenario-specific networks to task-shared or task-specific networks. However, these models face two main limitations: 1) They often rely on complex architectures, such as mixture-of-experts (MoE) networks, which increase the complexity of information fusion, model size, and training cost. 2) They extract all available information flows without filtering out irrelevant or even harmful content, introducing potential noise. Regarding these challenges, we propose a lightweight Automated Information Flow Selection (AutoIFS) framework for MSMTR. To tackle the first issue, AutoIFS incorporates low-rank adaptation (LoRA) to decouple the four information units, enabling more flexible and efficient information fusion with minimal parameter overhead. To address the second issue, AutoIFS introduces an information flow selection network that automatically filters out invalid scenario-task information flows based on model performance feedback. It employs a simple yet effective pruning function to eliminate useless information flows, thereby enhancing the impact of key relationships and improving model performance. Finally, we evaluate AutoIFS and confirm its effectiveness through extensive experiments on two public benchmark datasets and an online A/B test.

    https://arxiv.org/abs/2512.13396


    rNCA: Self-Repairing Segmentation Masks

    oai:arXiv.org:2512.13397v1

    arXiv:2512.13397v1 Announce Type: new Abstract: Accurately predicting topologically correct masks remains a difficult task for general segmentation models, which often produce fragmented or disconnected outputs. Fixing these artifacts typically requires hand-crafted refinement rules or architectures specialized to a particular task. Here, we show that Neural Cellular Automata (NCA) can be directly re-purposed as an effective refinement mechanism, using local, iterative updates guided by image context to repair segmentation masks. By training on imperfect masks and ground truths, the automaton learns the structural properties of the target shape while relying solely on local information. When applied to coarse, globally predicted masks, the learned dynamics progressively reconnect broken regions, prune loose fragments and converge towards stable, topologically consistent results. We show how refinement NCA (rNCA) can be easily applied to repair common topological errors produced by different base segmentation models and tasks: for fragmented retinal vessels, it yields 2-3% gains in Dice/clDice and improves Betti errors, reducing $\beta_0$ errors by 60% and $\beta_1$ by 20%; for myocardium, it repairs 61.5% of broken cases in a zero-shot setting while lowering ASSD and HD by 19% and 16%, respectively. This showcases NCA as effective and broadly applicable refiners.

    https://arxiv.org/abs/2512.13397


    Differentiable Evolutionary Reinforcement Learning

    oai:arXiv.org:2512.13399v1

    arXiv:2512.13399v1 Announce Type: new Abstract: The design of effective reward functions presents a central and often arduous challenge in reinforcement learning (RL), particularly when developing autonomous agents for complex reasoning tasks. While automated reward optimization approaches exist, they typically rely on derivative-free evolutionary heuristics that treat the reward function as a black box, failing to capture the causal relationship between reward structure and task performance. To bridge this gap, we propose Differentiable Evolutionary Reinforcement Learning (DERL), a bilevel framework that enables the autonomous discovery of optimal reward signals. In DERL, a Meta-Optimizer evolves a reward function (i.e., Meta-Reward) by composing structured atomic primitives, guiding the training of an inner-loop policy. Crucially, unlike previous evolution, DERL is differentiable in its metaoptimization: it treats the inner-loop validation performance as a signal to update the Meta-Optimizer via reinforcement learning. This allows DERL to approximate the "meta-gradient" of task success, progressively learning to generate denser and more actionable feedback. We validate DERL across three distinct domains: robotic agent (ALFWorld), scientific simulation (ScienceWorld), and mathematical reasoning (GSM8k, MATH). Experimental results show that DERL achieves state-of-the-art performance on ALFWorld and ScienceWorld, significantly outperforming methods relying on heuristic rewards, especially in out-of-distribution scenarios. Analysis of the evolutionary trajectory demonstrates that DERL successfully captures the intrinsic structure of tasks, enabling selfimproving agent alignment without human intervention.

    https://arxiv.org/abs/2512.13399


    End2Reg: Learning Task-Specific Segmentation for Markerless Registration in Spine Surgery

    oai:arXiv.org:2512.13402v1

    arXiv:2512.13402v1 Announce Type: new Abstract: Purpose: Intraoperative navigation in spine surgery demands millimeter-level accuracy. Current systems based on intraoperative radiographic imaging and bone-anchored markers are invasive, radiation-intensive and workflow disruptive. Recent markerless RGB-D registration methods offer a promising alternative, but existing approaches rely on weak segmentation labels to isolate relevant anatomical structures, which can propagate errors throughout registration. Methods: We present End2Reg an end-to-end deep learning framework that jointly optimizes segmentation and registration, eliminating the need for weak segmentation labels and manual steps. The network learns segmentation masks specifically optimized for registration, guided solely by the registration objective without direct segmentation supervision. Results: The proposed framework achieves state-of-the-art performance on ex- and in-vivo benchmarks, reducing median Target Registration Error by 32% to 1.83mm and mean Root Mean Square Error by 45% to 3.95mm, respectively. An ablation study confirms that end-to-end optimization significantly improves registration accuracy. Conclusion: The presented end-to-end RGB-D registration pipeline removes dependency on weak labels and manual steps, advancing towards fully automatic, markerless intraoperative navigation. Code and interactive visualizations are available at: https://lorenzopettinari.github.io/end-2-reg/.

    https://arxiv.org/abs/2512.13402


    Google Spain v. Gonz\'ales: Did the Court forget about freedom of expression?

    oai:arXiv.org:2512.13404v1

    arXiv:2512.13404v1 Announce Type: new Abstract: When reviewing a job application letter, going on a first date, or considering doing business with someone, the first thing many people do is entering the person's name in a search engine. A search engine can point searchers to information that would otherwise have remained obscure. If somebody searched for the name of Spanish lawyer Mario Costeja Gonz\'alez, Google showed search results that included a link to a 1998 newspaper announcement implying he had financial troubles at the time. Gonz\'alez wanted Google to stop showing those links and started a procedure in Spain. After some legal wrangling, the Spanish Audiencia Nacional (National High Court) asked the Court of Justice of the European Union (CJEU) for advice on the application of the Data Protection Directive, which led to the controversial judgment in Google Spain. In its judgment, the CJEU holds that people, under certain conditions, have the right to have search results for their name delisted. This right can also extend to lawfully published information.

    https://arxiv.org/abs/2512.13404


    Improving Privacy Protection in the area of Behavioural Targeting

    oai:arXiv.org:2512.13405v1

    arXiv:2512.13405v1 Announce Type: new Abstract: This PhD thesis discusses how European law could improve privacy protection in the area of behavioural targeting. Behavioural targeting, also referred to as online profiling, involves monitoring people's online behaviour, and using the collected information to show people individually targeted advertisements. To protect privacy in the area of behavioural targeting, the EU lawmaker mainly relies on the consent requirement for the use of tracking technologies in the e-Privacy Directive, and on general data protection law. With informed consent requirements, the law aims to empower people to make choices in their best interests. But behavioural studies cast doubt on the effectiveness of the empowerment approach as a privacy protection measure. Many people click "I agree" to any statement that is presented to them. Therefore, to mitigate privacy problems such as chilling effects, this study argues for a combined approach of protecting and empowering the individual. Compared to the current approach, the lawmaker should focus more on protecting people. The PhD thesis is a legal study, but it also incorporates insights from other disciplines, such as computer science, behavioural economics, and media studies. This study is among the first to discuss the implications of behavioural research for European data protection policy. The topic of whether data protection law should apply to pseudonymous data is discussed in depth. The study contains a detailed analysis of the role of informed consent in data protection law, and gives much attention to the tension between protecting and empowering the individual within data protection law.

    https://arxiv.org/abs/2512.13405


    Multiclass Graph-Based Large Margin Classifiers: Unified Approach for Support Vectors and Neural Networks

    oai:arXiv.org:2512.13410v1

    arXiv:2512.13410v1 Announce Type: new Abstract: While large margin classifiers are originally an outcome of an optimization framework, support vectors (SVs) can be obtained from geometric approaches. This article presents advances in the use of Gabriel graphs (GGs) in binary and multiclass classification problems. For Chipclass, a hyperparameter-less and optimization-less GG-based binary classifier, we discuss how activation functions and support edge (SE)-centered neurons affect the classification, proposing smoother functions and structural SV (SSV)-centered neurons to achieve margins with low probabilities and smoother classification contours. We extend the neural network architecture, which can be trained with backpropagation with a softmax function and a cross-entropy loss, or by solving a system of linear equations. A new subgraph-/distance-based membership function for graph regularization is also proposed, along with a new GG recomputation algorithm that is less computationally expensive than the standard approach. Experimental results with the Friedman test show that our method was better than previous GG-based classifiers and statistically equivalent to tree-based models.

    https://arxiv.org/abs/2512.13410


    Computer vision training dataset generation for robotic environments using Gaussian splatting

    oai:arXiv.org:2512.13411v1

    arXiv:2512.13411v1 Announce Type: new Abstract: This paper introduces a novel pipeline for generating large-scale, highly realistic, and automatically labeled datasets for computer vision tasks in robotic environments. Our approach addresses the critical challenges of the domain gap between synthetic and real-world imagery and the time-consuming bottleneck of manual annotation. We leverage 3D Gaussian Splatting (3DGS) to create photorealistic representations of the operational environment and objects. These assets are then used in a game engine where physics simulations create natural arrangements. A novel, two-pass rendering technique combines the realism of splats with a shadow map generated from proxy meshes. This map is then algorithmically composited with the image to add both physically plausible shadows and subtle highlights, significantly enhancing realism. Pixel-perfect segmentation masks are generated automatically and formatted for direct use with object detection models like YOLO. Our experiments show that a hybrid training strategy, combining a small set of real images with a large volume of our synthetic data, yields the best detection and segmentation performance, confirming this as an optimal strategy for efficiently achieving robust and accurate models.

    https://arxiv.org/abs/2512.13411


    PSALM: applying Proportional SAmpLing strategy in Metamorphic testing

    oai:arXiv.org:2512.13414v1

    arXiv:2512.13414v1 Announce Type: new Abstract: Metamorphic testing (MT) alleviates the oracle problem by checking metamorphic relations (MRs) across multiple test executions. The fault detection effectiveness of MT is influenced not only by the choice and quality of MRs, but also by how source test cases and metamorphic groups (MGs) are selected. While substantial research has focused on designing, generating, and validating MRs, systematic methods for source test case selection and MG selection remain largely unexplored. Although the Proportional Sampling Strategy (PSS) provides strong theoretical guarantees in traditional testing, its assumptions cannot be directly applied in MT due to differences in selection domains, test units, and failure distributions. This paper proposes PSALM, an adaptation of PSS to MT for both source test case selection and MG selection. We formally prove that PSALM is never inferior to random selection regardless of how the source test case and MG domains are partitioned. We further identify the conditions under which applying PSALM to source test case selection and MG selection yields identical effectiveness. A comprehensive empirical study on eight subject programs and 184 mutants shows that the results are consistent with our theoretical analysis and that PSALM generally performs more effectively than existing selection strategies such as ART and MT-ART. These results demonstrate that PSALM provides a theoretically grounded and practically effective selection strategy for MT.

    https://arxiv.org/abs/2512.13414


    USTM: Unified Spatial and Temporal Modeling for Continuous Sign Language Recognition

    oai:arXiv.org:2512.13415v1

    arXiv:2512.13415v1 Announce Type: new Abstract: Continuous sign language recognition (CSLR) requires precise spatio-temporal modeling to accurately recognize sequences of gestures in videos. Existing frameworks often rely on CNN-based spatial backbones combined with temporal convolution or recurrent modules. These techniques fail in capturing fine-grained hand and facial cues and modeling long-range temporal dependencies. To address these limitations, we propose the Unified Spatio-Temporal Modeling (USTM) framework, a spatio-temporal encoder that effectively models complex patterns using a combination of a Swin Transformer backbone enhanced with lightweight temporal adapter with positional embeddings (TAPE). Our framework captures fine-grained spatial features alongside short and long-term temporal context, enabling robust sign language recognition from RGB videos without relying on multi-stream inputs or auxiliary modalities. Extensive experiments on benchmarked datasets including PHOENIX14, PHOENIX14T, and CSL-Daily demonstrate that USTM achieves state-of-the-art performance against RGB-based as well as multi-modal CSLR approaches, while maintaining competitive performance against multi-stream approaches. These results highlight the strength and efficacy of the USTM framework for CSLR. The code is available at https://github.com/gufranSabri/USTM

    https://arxiv.org/abs/2512.13415


    Learning to Generate Cross-Task Unexploitable Examples

    oai:arXiv.org:2512.13416v1

    arXiv:2512.13416v1 Announce Type: new Abstract: Unexploitable example generation aims to transform personal images into their unexploitable (unlearnable) versions before they are uploaded online, thereby preventing unauthorized exploitation of online personal images. Recently, this task has garnered significant research attention due to its critical relevance to personal data privacy. Yet, despite recent progress, existing methods for this task can still suffer from limited practical applicability, as they can fail to generate examples that are broadly unexploitable across different real-world computer vision tasks. To deal with this problem, in this work, we propose a novel Meta Cross-Task Unexploitable Example Generation (MCT-UEG) framework. At the core of our framework, to optimize the unexploitable example generator for effectively producing broadly unexploitable examples, we design a flat-minima-oriented meta training and testing scheme. Extensive experiments show the efficacy of our framework.

    https://arxiv.org/abs/2512.13416


    RecTok: Reconstruction Distillation along Rectified Flow

    oai:arXiv.org:2512.13421v1

    arXiv:2512.13421v1 Announce Type: new Abstract: Visual tokenizers play a crucial role in diffusion models. The dimensionality of latent space governs both reconstruction fidelity and the semantic expressiveness of the latent feature. However, a fundamental trade-off is inherent between dimensionality and generation quality, constraining existing methods to low-dimensional latent spaces. Although recent works have leveraged vision foundation models to enrich the semantics of visual tokenizers and accelerate convergence, high-dimensional tokenizers still underperform their low-dimensional counterparts. In this work, we propose RecTok, which overcomes the limitations of high-dimensional visual tokenizers through two key innovations: flow semantic distillation and reconstruction--alignment distillation. Our key insight is to make the forward flow in flow matching semantically rich, which serves as the training space of diffusion transformers, rather than focusing on the latent space as in previous works. Specifically, our method distills the semantic information in VFMs into the forward flow trajectories in flow matching. And we further enhance the semantics by introducing a masked feature reconstruction loss. Our RecTok achieves superior image reconstruction, generation quality, and discriminative performance. It achieves state-of-the-art results on the gFID-50K under both with and without classifier-free guidance settings, while maintaining a semantically rich latent space structure. Furthermore, as the latent dimensionality increases, we observe consistent improvements. Code and model are available at https://shi-qingyu.github.io/rectok.github.io.

    https://arxiv.org/abs/2512.13421


    QMon: Monitoring the Execution of Quantum Circuits with Mid-Circuit Measurement and Reset

    oai:arXiv.org:2512.13422v1

    arXiv:2512.13422v1 Announce Type: new Abstract: Unlike classical software, where logging and runtime tracing can effectively reveal internal execution status, quantum circuits possess unique properties, such as the no-cloning theorem and measurement-induced collapse, that prevent direct observation or duplication of their states. These characteristics make it especially challenging to monitor the execution of quantum circuits, complicating essential tasks such as debugging and runtime monitoring. This paper presents QMON, a practical methodology that leverages mid-circuit measurements and reset operations to monitor the internal states of quantum circuits while preserving their original runtime behavior. QMON enables the instrumentation of monitoring operators at developer-specified locations within the circuit, allowing comparisons between expected and observed quantum-state probabilities at those locations. We evaluated QMON by analyzing its impact on circuit behavior, monitoring coverage, and effectiveness in bug localization. Experimental results involving 154 quantum circuits show that all circuits preserve their intended functionality after instrumentation and that QMON successfully detects and localizes various programming errors. Although monitoring coverage is limited by the need to preserve delicate quantum properties, such as entanglement, QMON effectively detects errors while introducing no or negligible disturbance to the original quantum states. QMON facilitates the development of more robust and reliable quantum software as the field continues to mature.

    https://arxiv.org/abs/2512.13422


    MineTheGap: Automatic Mining of Biases in Text-to-Image Models

    oai:arXiv.org:2512.13427v1

    arXiv:2512.13427v1 Announce Type: new Abstract: Text-to-Image (TTI) models generate images based on text prompts, which often leave certain aspects of the desired image ambiguous. When faced with these ambiguities, TTI models have been shown to exhibit biases in their interpretations. These biases can have societal impacts, e.g., when showing only a certain race for a stated occupation. They can also affect user experience when creating redundancy within a set of generated images instead of spanning diverse possibilities. Here, we introduce MineTheGap - a method for automatically mining prompts that cause a TTI model to generate biased outputs. Our method goes beyond merely detecting bias for a given prompt. Rather, it leverages a genetic algorithm to iteratively refine a pool of prompts, seeking for those that expose biases. This optimization process is driven by a novel bias score, which ranks biases according to their severity, as we validate on a dataset with known biases. For a given prompt, this score is obtained by comparing the distribution of generated images to the distribution of LLM-generated texts that constitute variations on the prompt. Code and examples are available on the project's webpage.

    https://arxiv.org/abs/2512.13427


    A Domain-Adapted Lightweight Ensemble for Resource-Efficient Few-Shot Plant Disease Classification

    oai:arXiv.org:2512.13428v1

    arXiv:2512.13428v1 Announce Type: new Abstract: Accurate and timely identification of plant leaf diseases is essential for resilient and sustainable agriculture, yet most deep learning approaches rely on large annotated datasets and computationally intensive models that are unsuitable for data-scarce and resource-constrained environments. To address these challenges we present a few-shot learning approach within a lightweight yet efficient framework that combines domain-adapted MobileNetV2 and MobileNetV3 models as feature extractors, along with a feature fusion technique to generate robust feature representation. For the classification task, the fused features are passed through a Bi-LSTM classifier enhanced with attention mechanisms to capture sequential dependencies and focus on the most relevant features, thereby achieving optimal classification performance even in complex, real-world environments with noisy or cluttered backgrounds. The proposed framework was evaluated across multiple experimental setups, including both laboratory-controlled and field-captured datasets. On tomato leaf diseases from the PlantVillage dataset, it consistently improved performance across 1 to 15 shot scenarios, reaching 98.23+-0.33% at 15 shot, closely approaching the 99.98% SOTA benchmark achieved by a Transductive LSTM with attention, while remaining lightweight and mobile-friendly. Under real-world conditions using field images from the Dhan Shomadhan dataset, it maintained robust performance, reaching 69.28+-1.49% at 15-shot and demonstrating strong resilience to complex backgrounds. Notably, it also outperformed the previous SOTA accuracy of 96.0% on six diseases from PlantVillage, achieving 99.72% with only 15-shot learning. With a compact model size of approximately 40 MB and inference complexity of approximately 1.12 GFLOPs, this work establishes a scalable, mobile-ready foundation for precise plant disease diagnostics in data-scarce regions.

    https://arxiv.org/abs/2512.13428


    Two Families of Linear Codes Containing Non-GRS MDS Codes

    oai:arXiv.org:2512.13429v1

    arXiv:2512.13429v1 Announce Type: new Abstract: We construct two new families of linear codes by modifying the generator matrices of generalized Reed-Solomon (GRS) codes. For these codes, we explicitly derive parity-check matrices and establish necessary and sufficient conditions ensuring the MDS property. Additionally, we explore subfamilies within these constructions that are non-GRS MDS codes. We also characterize their self-orthogonal and self-dual properties and present some explicit constructions and examples.

    https://arxiv.org/abs/2512.13429


    Weak Enforcement and Low Compliance in PCI~DSS: A Comparative Security Study

    oai:arXiv.org:2512.13430v1

    arXiv:2512.13430v1 Announce Type: new Abstract: Although credit and debit card data continue to be a prime target for attackers, organizational adherence to the Payment Card Industry Data Security Standard (PCI DSS) remains surprisingly low. Despite prior work showing that PCI DSS can reduce card fraud, only 32.4% of organizations were fully compliant in 2022, suggesting possible deficiencies in enforcement mechanisms. This study compares PCI DSS with three data security frameworks, HIPAA, NIS2, and GDPR, to examine how enforcement mechanisms relate to implementation success. The analysis reveals that PCI DSS significantly lags far behind these security frameworks and that its sanctions are orders of magnitude smaller than those under GDPR and NIS2. The findings indicate a positive association between stronger, multi-modal enforcement (including public disclosure, license actions, and imprisonment) and higher implementation rates, and highlights the structural weakness of PCI DSS's bank-dependent monitoring model. Enhanced non-monetary sanctions and the creation of an independent supervisory authority are recommended to increase transparency, reduce conflicts of interest, and improve PCI DSS compliance without discouraging card acceptance.

    https://arxiv.org/abs/2512.13430


    From User Interface to Agent Interface: Efficiency Optimization of UI Representations for LLM Agents

    oai:arXiv.org:2512.13438v1

    arXiv:2512.13438v1 Announce Type: new Abstract: While Large Language Model (LLM) agents show great potential for automated UI navigation such as automated UI testing and AI assistants, their efficiency has been largely overlooked. Our motivating study reveals that inefficient UI representation creates a critical performance bottleneck. However, UI representation optimization, formulated as the task of automatically generating programs that transform UI representations, faces two unique challenges. First, the lack of Boolean oracles, which traditional program synthesis uses to decisively validate semantic correctness, poses a fundamental challenge to co-optimization of token efficiency and completeness. Second, the need to process large, complex UI trees as input while generating long, compositional transformation programs, making the search space vast and error-prone. Toward addressing the preceding limitations, we present UIFormer, the first automated optimization framework that synthesizes UI transformation programs by conducting constraint-based optimization with structured decomposition of the complex synthesis task. First, UIFormer restricts the program space using a domain-specific language (DSL) that captures UI-specific operations. Second, UIFormer conducts LLM-based iterative refinement with correctness and efficiency rewards, providing guidance for achieving the efficiency-completeness co-optimization. UIFormer operates as a lightweight plugin that applies transformation programs for seamless integration with existing LLM agents, requiring minimal modifications to their core logic. Evaluations across three UI navigation benchmarks spanning Android and Web platforms with five LLMs demonstrate that UIFormer achieves 48.7% to 55.8% token reduction with minimal runtime overhead while maintaining or improving agent performance. Real-world industry deployment at WeChat further validates the practical impact of UIFormer.

    https://arxiv.org/abs/2512.13438


    Balancing Timeliness and Privacy in Discrete-Time Updating Systems

    oai:arXiv.org:2512.13439v1

    arXiv:2512.13439v1 Announce Type: new Abstract: We study the trade-off between Age of Information (AoI) and maximal leakage (MaxL) in discrete-time status updating systems. A source generates time-stamped update packets that are processed by a server that delivers them to a monitor. An adversary, who eavesdrops on the server-monitor link, wishes to infer the timing of the underlying source update sequence. The server must balance the timeliness of the status information at the monitor against the timing information leaked to the adversary. We consider a model with Bernoulli source updates under two classes of Last-Come-First-Served (LCFS) service policies: (1) Coupled policies that tie the server's deliveries to the update arrival process in a preemptive queue; (2) Decoupled (dumping) policies in which the server transmits its freshest update according to a schedule that is independent of the update arrivals. For each class, we characterize the structure of the optimal policy that minimizes AoI for a given MaxL rate. Our analysis reveals that decoupled dumping policies offer a superior age-leakage trade-off to coupled policies. When subject to a MaxL constraint, we prove that the optimal dumping strategy is achieved by dithering between two adjacent deterministic dump periods.

    https://arxiv.org/abs/2512.13439


    IMILIA: interpretable multiple instance learning for inflammation prediction in IBD from H&E whole slide images

    oai:arXiv.org:2512.13440v1

    arXiv:2512.13440v1 Announce Type: new Abstract: As the therapeutic target for Inflammatory Bowel Disease (IBD) shifts toward histologic remission, the accurate assessment of microscopic inflammation has become increasingly central for evaluating disease activity and response to treatment. In this work, we introduce IMILIA (Interpretable Multiple Instance Learning for Inflammation Analysis), an end-to-end framework designed for the prediction of inflammation presence in IBD digitized slides stained with hematoxylin and eosin (H&E), followed by the automated computation of markers characterizing tissue regions driving the predictions. IMILIA is composed of an inflammation prediction module, consisting of a Multiple Instance Learning (MIL) model, and an interpretability module, divided in two blocks: HistoPLUS, for cell instance detection, segmentation and classification; and EpiSeg, for epithelium segmentation. IMILIA achieves a cross-validation ROC-AUC of 0.83 on the discovery cohort, and a ROC-AUC of 0.99 and 0.84 on two external validation cohorts. The interpretability module yields biologically consistent insights: tiles with higher predicted scores show increased densities of immune cells (lymphocytes, plasmocytes, neutrophils and eosinophils), whereas lower-scored tiles predominantly contain normal epithelial cells. Notably, these patterns were consistent across all datasets. Code and models to partially replicate the results on the public IBDColEpi dataset can be found at https://github.com/owkin/imilia.

    https://arxiv.org/abs/2512.13440


    Large language models are not about language

    oai:arXiv.org:2512.13441v1

    arXiv:2512.13441v1 Announce Type: new Abstract: Large Language Models are useless for linguistics, as they are probabilistic models that require a vast amount of data to analyse externalized strings of words. In contrast, human language is underpinned by a mind-internal computational system that recursively generates hierarchical thought structures. The language system grows with minimal external input and can readily distinguish between real language and impossible languages.

    https://arxiv.org/abs/2512.13441


    XNNTab -- Interpretable Neural Networks for Tabular Data using Sparse Autoencoders

    oai:arXiv.org:2512.13442v1

    arXiv:2512.13442v1 Announce Type: new Abstract: In data-driven applications relying on tabular data, where interpretability is key, machine learning models such as decision trees and linear regression are applied. Although neural networks can provide higher predictive performance, they are not used because of their blackbox nature. In this work, we present XNNTab, a neural architecture that combines the expressiveness of neural networks and interpretability. XNNTab first learns highly non-linear feature representations, which are decomposed into monosemantic features using a sparse autoencoder (SAE). These features are then assigned human-interpretable concepts, making the overall model prediction intrinsically interpretable. XNNTab outperforms interpretable predictive models, and achieves comparable performance to its non-interpretable counterparts.

    https://arxiv.org/abs/2512.13442


    A Data Annotation Requirements Representation and Specification (DARS)

    oai:arXiv.org:2512.13444v1

    arXiv:2512.13444v1 Announce Type: new Abstract: With the rise of AI-enabled cyber-physical systems, data annotation has become a critical yet often overlooked process in the development of these intelligent information systems. Existing work in requirements engineering (RE) has explored how requirements for AI systems and their data can be represented. However, related interviews with industry professionals show that data annotations and their related requirements introduce distinct challenges, indicating a need for annotation-specific requirement representations. We propose the Data Annotation Requirements Representation and Specification (DARS), including an Annotation Negotiation Card to align stakeholders on objectives and constraints, and a Scenario-Based Annotation Specification to express atomic and verifiable data annotation requirements. We evaluate DARS with an automotive perception case related to an ongoing project, and a mapping against 18 real-world data annotation error types. The results suggest that DARS mitigates root causes of completeness, accuracy, and consistency annotation errors. By integrating DARS into RE, this work improves the reliability of safety-critical systems using data annotations and demonstrates how engineering frameworks must evolve for data-dependent components of today's intelligent information systems.

    https://arxiv.org/abs/2512.13444


    Test-Time Modification: Inverse Domain Transformation for Robust Perception

    oai:arXiv.org:2512.13454v1

    arXiv:2512.13454v1 Announce Type: new Abstract: Generative foundation models contain broad visual knowledge and can produce diverse image variations, making them particularly promising for advancing domain generalization tasks. While they can be used for training data augmentation, synthesizing comprehensive target-domain variations remains slow, expensive, and incomplete. We propose an alternative: using diffusion models at test time to map target images back to the source distribution where the downstream model was trained. This approach requires only a source domain description, preserves the task model, and eliminates large-scale synthetic data generation. We demonstrate consistent improvements across segmentation, detection, and classification tasks under challenging environmental shifts in real-to-real domain generalization scenarios with unknown target distributions. Our analysis spans multiple generative and downstream models, including an ensemble variant for enhanced robustness. The method achieves substantial relative gains: 137% on BDD100K-Night, 68% on ImageNet-R, and 62% on DarkZurich.

    https://arxiv.org/abs/2512.13454


    SSAS: Cross-subject EEG-based Emotion Recognition through Source Selection with Adversarial Strategy

    oai:arXiv.org:2512.13458v1

    arXiv:2512.13458v1 Announce Type: new Abstract: Electroencephalographic (EEG) signals have long been applied in the field of affective brain-computer interfaces (aBCIs). Cross-subject EEG-based emotion recognition has demonstrated significant potential in practical applications due to its suitability across diverse people. However, most studies on cross-subject EEG-based emotion recognition neglect the presence of inter-individual variability and negative transfer phenomena during model training. To address this issue, a cross-subject EEG-based emotion recognition through source selection with adversarial strategy is introduced in this paper. The proposed method comprises two modules: the source selection network (SS) and the adversarial strategies network (AS). The SS uses domain labels to reverse-engineer the training process of domain adaptation. Its key idea is to disrupt class separability and magnify inter-domain differences, thereby raising the classification difficulty and forcing the model to learn domain-invariant yet emotion-relevant representations. The AS gets the source domain selection results and the pretrained domain discriminators from SS. The pretrained domain discriminators compute a novel loss aimed at enhancing the performance of domain classification during adversarial training, ensuring the balance of adversarial strategies. This paper provides theoretical insights into the proposed method and achieves outstanding performance on two EEG-based emotion datasets, SEED and SEED-IV. The code can be found at https://github.com/liuyici/SSAS.

    https://arxiv.org/abs/2512.13458


    DP-EMAR: A Differentially Private Framework for Autonomous Model Weight Repair in Federated IoT Systems

    oai:arXiv.org:2512.13460v1

    arXiv:2512.13460v1 Announce Type: new Abstract: Federated Learning (FL) enables decentralized model training without sharing raw data, but model weight distortion remains a major challenge in resource constrained IoT networks. In multi tier Federated IoT (Fed-IoT) systems, unstable connectivity and adversarial interference can silently alter transmitted parameters, degrading convergence. We propose DP-EMAR, a differentially private, error model based autonomous repair framework that detects and reconstructs transmission induced distortions during FL aggregation. DP-EMAR estimates corruption patterns and applies adaptive correction before privacy noise is added, enabling reliable in network repair without violating confidentiality. By integrating Differential Privacy (DP) with Secure Aggregation (SA), the framework distinguishes DP noise from genuine transmission errors. Experiments on heterogeneous IoT sensor and graph datasets show that DP-EMAR preserves convergence stability and maintains near baseline performance under communication corruption while ensuring strict (epsilon, delta)-DP guarantees. The framework enhances robustness, communication efficiency, and trust in privacy preserving Federated IoT learning.

    https://arxiv.org/abs/2512.13460


    PoseAnything: Universal Pose-guided Video Generation with Part-aware Temporal Coherence

    oai:arXiv.org:2512.13465v1

    arXiv:2512.13465v1 Announce Type: new Abstract: Pose-guided video generation refers to controlling the motion of subjects in generated video through a sequence of poses. It enables precise control over subject motion and has important applications in animation. However, current pose-guided video generation methods are limited to accepting only human poses as input, thus generalizing poorly to pose of other subjects. To address this issue, we propose PoseAnything, the first universal pose-guided video generation framework capable of handling both human and non-human characters, supporting arbitrary skeletal inputs. To enhance consistency preservation during motion, we introduce Part-aware Temporal Coherence Module, which divides the subject into different parts, establishes part correspondences, and computes cross-attention between corresponding parts across frames to achieve fine-grained part-level consistency. Additionally, we propose Subject and Camera Motion Decoupled CFG, a novel guidance strategy that, for the first time, enables independent camera movement control in pose-guided video generation, by separately injecting subject and camera motion control information into the positive and negative anchors of CFG. Furthermore, we present XPose, a high-quality public dataset containing 50,000 non-human pose-video pairs, along with an automated pipeline for annotation and filtering. Extensive experiments demonstrate that Pose-Anything significantly outperforms state-of-the-art methods in both effectiveness and generalization.

    https://arxiv.org/abs/2512.13465


    Scaling Laws for Code: Every Programming Language Matters

    oai:arXiv.org:2512.13472v1

    arXiv:2512.13472v1 Announce Type: new Abstract: Code large language models (Code LLMs) are powerful but costly to train, with scaling laws predicting performance from model size, data, and compute. However, different programming languages (PLs) have varying impacts during pre-training that significantly affect base model performance, leading to inaccurate performance prediction. Besides, existing works focus on language-agnostic settings, neglecting the inherently multilingual nature of modern software development. Therefore, it is first necessary to investigate the scaling laws of different PLs, and then consider their mutual influences to arrive at the final multilingual scaling law. In this paper, we present the first systematic exploration of scaling laws for multilingual code pre-training, conducting over 1000+ experiments (Equivalent to 336,000+ H800 hours) across multiple PLs, model sizes (0.2B to 14B parameters), and dataset sizes (1T tokens). We establish comprehensive scaling laws for code LLMs across multiple PLs, revealing that interpreted languages (e.g., Python) benefit more from increased model size and data than compiled languages (e.g., Rust). The study demonstrates that multilingual pre-training provides synergistic benefits, particularly between syntactically similar PLs. Further, the pre-training strategy of the parallel pairing (concatenating code snippets with their translations) significantly enhances cross-lingual abilities with favorable scaling properties. Finally, a proportion-dependent multilingual scaling law is proposed to optimally allocate training tokens by prioritizing high-utility PLs (e.g., Python), balancing high-synergy pairs (e.g., JavaScript-TypeScript), and reducing allocation to fast-saturating languages (Rust), achieving superior average performance across all PLs compared to uniform distribution under the same compute budget.

    https://arxiv.org/abs/2512.13472


    Mapping of the system of software-related emissions and shared responsibilities

    oai:arXiv.org:2512.13474v1

    arXiv:2512.13474v1 Announce Type: new Abstract: The global climate is experiencing a rapid and unprecedented warming trend. The ICT sector is a notable contributor to global greenhouse gas emissions, with its environmental impact continuing to expand. Addressing this issue is vital for achieving the objectives of the Paris Agreement, particularly the goal of limiting global temperature rise to 1.5{\deg}C. At the European Union level, regulatory measures such as the CSRD and the CSDD impose obligations on companies, including those within the ICT sector, to recognize and mitigate their environmental footprint. This study provides a comprehensive system mapping aimed at enhancing the awareness and understanding of software-related emissions and the corresponding responsibilities borne by the ICT sector. The mapping identifies the primary sources of carbon emissions and energy consumption within the ICT domain while also outlining the key responsibilities of the stakeholders accountable throughout the software lifecycle.

    https://arxiv.org/abs/2512.13474


    Evaluating the Navigation Capabilities of a Modified COAST Guidewire Robot in an Anatomical Phantom Model

    oai:arXiv.org:2512.13477v1

    arXiv:2512.13477v1 Announce Type: new Abstract: To address the issues that arise due to the manual navigation of guidewires in endovascular interventions, research in medical robotics has taken a strong interest in developing robotically steerable guidewires, which offer the possibility of enhanced maneuverability and navigation, as the tip of the guidewire can be actively steered. The COaxially Aligned STeerable (COAST) guidewire robot has the ability to generate a wide variety of motions including bending motion with different bending lengths, follow-the-leader motion, and feedforward motion. In our past studies, we have explored different designs of the COAST guidewire robot and developed modeling, control, and sensing strategies for the COAST guidewire robot. In this study, the performance of a modified COAST guidewire robot is evaluated by conducting navigation experiments in an anatomical phantom model with pulsatile flow. The modified COAST guidewire robot is a simplified version of the COAST guidewire robot and consists of two tubes as opposed to three tubes. Through this study, we demonstrate the effectiveness of the modified COAST guidewire robot in navigating the tortuous phantom vasculature.

    https://arxiv.org/abs/2512.13477


    Non-Resolution Reasoning: A Framework for Preserving Semantic Ambiguity in Language Models

    oai:arXiv.org:2512.13478v1

    arXiv:2512.13478v1 Announce Type: new Abstract: Premature semantic collapse -- the forced early commitment to a single meaning -- remains a core architectural limitation of current language models. Softmax-driven competition and greedy decoding cause models to discard valid interpretations before sufficient context is available, resulting in brittle reasoning and context failures. We introduce Non-Resolution Reasoning (NRR), a general computational framework that preserves semantic ambiguity during inference and performs resolution only when explicitly required. NRR integrates three components: (1) Multi-Vector Embeddings that maintain multiple viable interpretations per token, (2) Non-Collapsing Attention that prevents winner-take-all dynamics across layers, and (3) Contextual Identity Tracking (CIT), which assigns context-specific identities to recurring entities (e.g., distinguishing "Dr. Smith the cardiologist" from "Dr. Smith the researcher"). These mechanisms are unified by an external Resolution Operator $\rho$ that makes semantic commitment explicit, controllable, and task-dependent. Unlike standard architectures, NRR separates representation from resolution, allowing a single model to shift between creative, factual, and ambiguity-preserving reasoning without retraining. A synthetic evaluation demonstrates NRR's ability to preserve ambiguity and track context: CIT-enhanced models achieve 90.9% accuracy on out-of-distribution identity-shift tasks, compared to 9.1% for transformer baselines. NRR provides a principled alternative to premature collapse, reframing ambiguity as an explicit representational state rather than a failure mode. The question is not whether AI should resolve ambiguity, but when, how, and under whose control.

    https://arxiv.org/abs/2512.13478


    Reproducibility and Standardization in gem5 Resources v25.0

    oai:arXiv.org:2512.13479v1

    arXiv:2512.13479v1 Announce Type: new Abstract: Reproducibility in simulation-based computer architecture research requires coordinating artifacts like disk images, kernels, and benchmarks, but existing workflows are inconsistent. We improve gem5, an open-source simulator with over 1600 forks, and gem5 Resources, a centralized repository of over 2000 pre-packaged artifacts, to address these issues. While gem5 Resources enables artifact sharing, researchers still face challenges. Creating custom disk images is complex and time-consuming, with no standardized process across ISAs, making it difficult to extend and share images. gem5 provides limited guest-host communication features through a set of predefined exit events that restrict researchers' ability to dynamically control and monitor simulations. Lastly, running simulations with multiple workloads requires researchers to write custom external scripts to coordinate multiple gem5 simulations which creates error-prone and hard-to-reproduce workflows. To overcome this, we introduce several features in gem5 and gem5 Resources. We standardize disk-image creation across x86, ARM, and RISC-V using Packer, and provide validated base images with pre-annotated benchmark suites (NPB, GAPBS). We provide 12 new disk images, 6 new kernels, and over 200 workloads across three ISAs. We refactor the exit event system to a class-based model and introduce hypercalls for enhanced guest-host communication that allows researchers to define custom behavior for their exit events. We also provide a utility to remotely monitor simulations and the gem5-bridge driver for user-space m5 operations. Additionally, we implemented Suites and MultiSim to enable parallel full-system simulations from gem5 configuration scripts, eliminating the need for external scripting. These features reduce setup complexity and provide extensible, validated resources that improve reproducibility and standardization.

    https://arxiv.org/abs/2512.13479


    Element-wise Modulation of Random Matrices for Efficient Neural Layers

    oai:arXiv.org:2512.13480v1

    arXiv:2512.13480v1 Announce Type: new Abstract: Fully connected layers are a primary source of memory and computational overhead in deep neural networks due to their dense, often redundant parameterization. While various compression techniques exist, they frequently introduce complex engineering trade-offs or degrade model performance. We propose the Parametrized Random Projection (PRP) layer, a novel approach that decouples feature mixing from adaptation by utilizing a fixed random matrix modulated by lightweight, learnable element-wise parameters. This architecture drastically reduces the trainable parameter count to a linear scale while retaining reliable accuracy across various benchmarks. The design serves as a stable, computationally efficient solution for architectural scaling and deployment in resource-limited settings.

    https://arxiv.org/abs/2512.13480


    neuralFOMO: Can LLMs Handle Being Second Best? Measuring Envy-Like Preferences in Multi-Agent Settings

    oai:arXiv.org:2512.13481v1

    arXiv:2512.13481v1 Announce Type: new Abstract: Envy is a common human behavior that shapes competitiveness and can alter outcomes in team settings. As large language models (LLMs) increasingly act on behalf of humans in collaborative and competitive workflows, there is a pressing need to evaluate whether and under what conditions they exhibit envy-like preferences. In this paper, we test whether LLMs show envy-like behavior toward each other. We considered two scenarios: (1) A point allocation game that tests whether a model tries to win over its peer. (2) A workplace setting observing behaviour when recognition is unfair. Our findings reveal consistent evidence of envy-like patterns in certain LLMs, with large variation across models and contexts. For instance, GPT-5-mini and Claude-3.7-Sonnet show a clear tendency to pull down the peer model to equalize outcomes, whereas Mistral-Small-3.2-24B instead focuses on maximizing its own individual gains. These results highlight the need to consider competitive dispositions as a safety and design factor in LLM-based multi-agent systems.

    https://arxiv.org/abs/2512.13481


    Real-Time AI-Driven Milling Digital Twin Towards Extreme Low-Latency

    oai:arXiv.org:2512.13482v1

    arXiv:2512.13482v1 Announce Type: new Abstract: Digital twin (DT) enables smart manufacturing by leveraging real-time data, AI models, and intelligent control systems. This paper presents a state-of-the-art analysis on the emerging field of DTs in the context of milling. The critical aspects of DT are explored through the lens of virtual models of physical milling, data flow from physical milling to virtual model, and feedback from virtual model to physical milling. Live data streaming protocols and virtual modeling methods are highlighted. A case study showcases the transformative capability of a real-time machine learning-driven live DT of tool-work contact in a milling process. Future research directions are outlined to achieve the goals of Industry 4.0 and beyond.

    https://arxiv.org/abs/2512.13482


    Impact analysis of hidden faults in nonlinear control systems using output-to-output gain

    oai:arXiv.org:2512.13483v1

    arXiv:2512.13483v1 Announce Type: new Abstract: Networked control systems (NCSs) are vulnerable to faults and hidden malfunctions in communication channels that can degrade performance or even destabilize the closed loop. Classical metrics in robust control and fault detection typically treat impact and detectability separately, whereas the output-to-output gain (OOG) provides a unified measure of both. While existing results have been limited to linear systems, this paper extends the OOG framework to nonlinear NCSs with quadratically constrained nonlinearities, considering false-injection attacks that can also manipulate sensor measurements through nonlinear transformations. Specifically, we provide computationally efficient linear matrix inequality conditions and complementary frequency-domain tests that yield explicit upper bounds on the OOG of this class of nonlinear systems. Furthermore, we derive frequency-domain conditions for absolute stability of closed-loop systems, generalizing the Yakubovich quadratic criterion.

    https://arxiv.org/abs/2512.13483


    Advancing Bangla Machine Translation Through Informal Datasets

    oai:arXiv.org:2512.13487v1

    arXiv:2512.13487v1 Announce Type: new Abstract: Bangla is the sixth most widely spoken language globally, with approximately 234 million native speakers. However, progress in open-source Bangla machine translation remains limited. Most online resources are in English and often remain untranslated into Bangla, excluding millions from accessing essential information. Existing research in Bangla translation primarily focuses on formal language, neglecting the more commonly used informal language. This is largely due to the lack of pairwise Bangla-English data and advanced translation models. If datasets and models can be enhanced to better handle natural, informal Bangla, millions of people will benefit from improved online information access. In this research, we explore current state-of-the-art models and propose improvements to Bangla translation by developing a dataset from informal sources like social media and conversational texts. This work aims to advance Bangla machine translation by focusing on informal language translation and improving accessibility for Bangla speakers in the digital world.

    https://arxiv.org/abs/2512.13487


    SIGMA: An AI-Empowered Training Stack on Early-Life Hardware

    oai:arXiv.org:2512.13488v1

    arXiv:2512.13488v1 Announce Type: new Abstract: An increasing variety of AI accelerators is being considered for large-scale training. However, enabling large-scale training on early-life AI accelerators faces three core challenges: frequent system disruptions and undefined failure modes that undermine reliability; numerical errors and training instabilities that threaten correctness and convergence; and the complexity of parallelism optimization combined with unpredictable local noise that degrades efficiency. To address these challenges, SIGMA is an open-source training stack designed to improve the reliability, stability, and efficiency of large-scale distributed training on early-life AI hardware. The core of this initiative is the LUCIA TRAINING PLATFORM (LTP), the system optimized for clusters with early-life AI accelerators. Since its launch in March 2025, LTP has significantly enhanced training reliability and operational productivity. Over the past five months, it has achieved an impressive 94.45% effective cluster accelerator utilization, while also substantially reducing node recycling and job-recovery times. Building on the foundation of LTP, the LUCIA TRAINING FRAMEWORK (LTF) successfully trained SIGMA-MOE, a 200B MoE model, using 2,048 AI accelerators. This effort delivered remarkable stability and efficiency outcomes, achieving 21.08% MFU, state-of-the-art downstream accuracy, and encountering only one stability incident over a 75-day period. Together, these advances establish SIGMA, which not only tackles the critical challenges of large-scale training but also establishes a new benchmark for AI infrastructure and platform innovation, offering a robust, cost-effective alternative to prevailing established accelerator stacks and significantly advancing AI capabilities and scalability. The source code of SIGMA is available at https://github.com/microsoft/LuciaTrainingPlatform.

    https://arxiv.org/abs/2512.13488


    From Zipf's Law to Neural Scaling through Heaps' Law and Hilberg's Hypothesis

    oai:arXiv.org:2512.13491v1

    arXiv:2512.13491v1 Announce Type: new Abstract: We inspect the deductive connection between the neural scaling law and Zipf's law -- two statements discussed in machine learning and quantitative linguistics. The neural scaling law describes how the cross entropy rate of a foundation model -- such as a large language model -- changes with respect to the amount of training tokens, parameters, and compute. By contrast, Zipf's law posits that the distribution of tokens exhibits a power law tail. Whereas similar claims have been made in more specific settings, we show that the neural scaling law is a consequence of Zipf's law under certain broad assumptions that we reveal systematically. The derivation steps are as follows: We derive Heaps' law on the vocabulary growth from Zipf's law, Hilberg's hypothesis on the entropy scaling from Heaps' law, and the neural scaling from Hilberg's hypothesis. We illustrate these inference steps by a toy example of the Santa Fe process that satisfies all the four statistical laws.

    https://arxiv.org/abs/2512.13491


    Transform Trained Transformer: Accelerating Naive 4K Video Generation Over 10$\times$

    oai:arXiv.org:2512.13492v1

    arXiv:2512.13492v1 Announce Type: new Abstract: Native 4K (2160$\times$3840) video generation remains a critical challenge due to the quadratic computational explosion of full-attention as spatiotemporal resolution increases, making it difficult for models to strike a balance between efficiency and quality. This paper proposes a novel Transformer retrofit strategy termed $\textbf{T3}$ ($\textbf{T}$ransform $\textbf{T}$rained $\textbf{T}$ransformer) that, without altering the core architecture of full-attention pretrained models, significantly reduces compute requirements by optimizing their forward logic. Specifically, $\textbf{T3-Video}$ introduces a multi-scale weight-sharing window attention mechanism and, via hierarchical blocking together with an axis-preserving full-attention design, can effect an "attention pattern" transformation of a pretrained model using only modest compute and data. Results on 4K-VBench show that $\textbf{T3-Video}$ substantially outperforms existing approaches: while delivering performance improvements (+4.29$\uparrow$ VQA and +0.08$\uparrow$ VTC), it accelerates native 4K video generation by more than 10$\times$. Project page at https://zhangzjn.github.io/projects/T3-Video

    https://arxiv.org/abs/2512.13492


    SkipCat: Rank-Maximized Low-Rank Compression of Large Language Models via Shared Projection and Block Skipping

    oai:arXiv.org:2512.13494v1

    arXiv:2512.13494v1 Announce Type: new Abstract: Large language models (LLM) have achieved remarkable performance across a wide range of tasks. However, their substantial parameter sizes pose significant challenges for deployment on edge devices with limited computational and memory resources. Low-rank compression is a promising approach to address this issue, as it reduces both computational and memory costs, making LLM more suitable for resource-constrained environments. Nonetheless, na\"ive low-rank compression methods require a significant reduction in the retained rank to achieve meaningful memory and computation savings. For a low-rank model, the ranks need to be reduced by more than half to yield efficiency gains. Such aggressive truncation, however, typically results in substantial performance degradation. To address this trade-off, we propose SkipCat, a novel low-rank compression framework that enables the use of higher ranks while achieving the same compression rates. First, we introduce an intra-layer shared low-rank projection method, where multiple matrices that share the same input use a common projection. This reduces redundancy and improves compression efficiency. Second, we propose a block skipping technique that omits computations and memory transfers for selected sub-blocks within the low-rank decomposition. These two techniques jointly enable our compressed model to retain more effective ranks under the same compression budget. Experimental results show that, without any additional fine-tuning, our method outperforms previous low-rank compression approaches by 7% accuracy improvement on zero-shot tasks under the same compression rate. These results highlight the effectiveness of our rank-maximized compression strategy in preserving model performance under tight resource constraints.

    https://arxiv.org/abs/2512.13494


    Soul: Breathe Life into Digital Human for High-fidelity Long-term Multimodal Animation

    oai:arXiv.org:2512.13495v1

    arXiv:2512.13495v1 Announce Type: new Abstract: We propose a multimodal-driven framework for high-fidelity long-term digital human animation termed $\textbf{Soul}$, which generates semantically coherent videos from a single-frame portrait image, text prompts, and audio, achieving precise lip synchronization, vivid facial expressions, and robust identity preservation. We construct Soul-1M, containing 1 million finely annotated samples with a precise automated annotation pipeline (covering portrait, upper-body, full-body, and multi-person scenes) to mitigate data scarcity, and we carefully curate Soul-Bench for comprehensive and fair evaluation of audio-/text-guided animation methods. The model is built on the Wan2.2-5B backbone, integrating audio-injection layers and multiple training strategies together with threshold-aware codebook replacement to ensure long-term generation consistency. Meanwhile, step/CFG distillation and a lightweight VAE are used to optimize inference efficiency, achieving an 11.4$\times$ speedup with negligible quality loss. Extensive experiments show that Soul significantly outperforms current leading open-source and commercial models on video quality, video-text alignment, identity preservation, and lip-synchronization accuracy, demonstrating broad applicability in real-world scenarios such as virtual anchors and film production. Project page at https://zhangzjn.github.io/projects/Soul/

    https://arxiv.org/abs/2512.13495


    Identification of Technical Design Constraints and Considerations for Transmission Grid Expansion Planning Projects

    oai:arXiv.org:2512.13496v1

    arXiv:2512.13496v1 Announce Type: new Abstract: The large-scale deployment of renewable energy sources, particularly offshore wind, requires large-scale transmission grid expansion projects to transmit the produced low-carbon power to the main demand centers. However, the planning and design of such complex projects currently lack a transparent and systematic process that system operators can follow when considering such investments in their grids. This paper identifies and classifies the main technical design constraints and considerations relevant to the planning of transmission grid expansion projects, and more specifically, electrical energy hubs. Seven key areas of interest are identified, namely network integration, HVDC technologies, costs (CAPEX, OPEX, and space requirements), electricity market design, future proofness and modular expandability, reliability-availability-maintainability, and sustainability. Each area of interest is analyzed in terms of its technical and operational relevance, with technical design constraints and considerations derived from such analysis. In addition, a hierarchical classification of the identified constraints and considerations (and therefore areas of interest) is introduced, distinguishing them between three criticality classes, namely hard constraints, main drivers, and key considerations. The dependencies between the different areas are discussed, too. Therefore, this work provides system operators and policymakers with a structured basis to support a transparent planning methodology with clear decision hierarchies for investments in transmission grid expansion projects.

    https://arxiv.org/abs/2512.13496


    On-Device Continual Learning for Unsupervised Visual Anomaly Detection in Dynamic Manufacturing

    oai:arXiv.org:2512.13497v1

    arXiv:2512.13497v1 Announce Type: new Abstract: In modern manufacturing, Visual Anomaly Detection (VAD) is essential for automated inspection and consistent product quality. Yet, increasingly dynamic and flexible production environments introduce key challenges: First, frequent product changes in small-batch and on-demand manufacturing require rapid model updates. Second, legacy edge hardware lacks the resources to train and run large AI models. Finally, both anomalous and normal training data are often scarce, particularly for newly introduced product variations. We investigate on-device continual learning for unsupervised VAD with localization, extending the PatchCore to incorporate online learning for real-world industrial scenarios. The proposed method leverages a lightweight feature extractor and an incremental coreset update mechanism based on k-center selection, enabling rapid, memory-efficient adaptation from limited data while eliminating costly cloud retraining. Evaluations on an industrial use case are conducted using a testbed designed to emulate flexible production with frequent variant changes in a controlled environment. Our method achieves a 12% AUROC improvement over the baseline, an 80% reduction in memory usage, and faster training compared to batch retraining. These results confirm that our method delivers accurate, resource-efficient, and adaptive VAD suitable for dynamic and smart manufacturing.

    https://arxiv.org/abs/2512.13497


    Platforms as Crime Scene, Judge, and Jury: How Victim-Survivors of Non-Consensual Intimate Imagery Report Abuse Online

    oai:arXiv.org:2512.13500v1

    arXiv:2512.13500v1 Announce Type: new Abstract: Non-consensual intimate imagery (NCII), also known as image-based sexual abuse (IBSA), is mediated through online platforms. Victim-survivors must turn to platforms to collect evidence and request content removal. Platforms act as the crime scene, judge, and jury, determining whether perpetrators face consequences and if harmful material is removed. We present a study of NCII victim-survivors' online reporting experiences, drawing on trauma-informed interviews with 13 participants. We find that platform reporting processes are hostile, opaque, and ineffective, often forcing complex harms into narrow interfaces, responding inconsistently, and failing to result in meaningful action. Leveraging institutional betrayal theory, we show how platforms' structures and practices compound harm, and, in doing so, surface concrete intervention points for redesigning reporting systems and shaping policy to better support victim-survivors

    https://arxiv.org/abs/2512.13500


    Behavior-Aware and Generalizable Defense Against Black-Box Adversarial Attacks for ML-Based IDS

    oai:arXiv.org:2512.13501v1

    arXiv:2512.13501v1 Announce Type: new Abstract: Machine learning based intrusion detection systems are increasingly targeted by black box adversarial attacks, where attackers craft evasive inputs using indirect feedback such as binary outputs or behavioral signals like response time and resource usage. While several defenses have been proposed, including input transformation, adversarial training, and surrogate detection, they often fall short in practice. Most are tailored to specific attack types, require internal model access, or rely on static mechanisms that fail to generalize across evolving attack strategies. Furthermore, defenses such as input transformation can degrade intrusion detection system performance, making them unsuitable for real time deployment. To address these limitations, we propose Adaptive Feature Poisoning, a lightweight and proactive defense mechanism designed specifically for realistic black box scenarios. Adaptive Feature Poisoning assumes that probing can occur silently and continuously, and introduces dynamic and context aware perturbations to selected traffic features, corrupting the attacker feedback loop without impacting detection capabilities. The method leverages traffic profiling, change point detection, and adaptive scaling to selectively perturb features that an attacker is likely exploiting, based on observed deviations. We evaluate Adaptive Feature Poisoning against multiple realistic adversarial attack strategies, including silent probing, transferability based attacks, and decision boundary based attacks. The results demonstrate its ability to confuse attackers, degrade attack effectiveness, and preserve detection performance. By offering a generalizable, attack agnostic, and undetectable defense, Adaptive Feature Poisoning represents a significant step toward practical and robust adversarial resilience in machine learning based intrusion detection systems.

    https://arxiv.org/abs/2512.13501


    An $H_2$-norm approach to performance analysis of networked control systems under multiplicative routing transformations

    oai:arXiv.org:2512.13504v1

    arXiv:2512.13504v1 Announce Type: new Abstract: This paper investigates the performance of networked control systems subject to multiplicative routing transformations that alter measurement pathways without directly injecting signals. Such transformations, arising from faults or adversarial actions, modify the feedback structure and can degrade performance while remaining stealthy. An $H_2$-norm framework is proposed to quantify the impact of these transformations by evaluating the ratio between the steady-state energies of performance and residual outputs. Equivalent linear matrix inequality (LMI) formulations are derived for computational assessment, and analytical upper bounds are established to estimate the worst-case degradation. The results provide structural insight into how routing manipulations influence closed-loop behavior and reveal conditions for stealthy multiplicative attacks.

    https://arxiv.org/abs/2512.13504


    Defending the Hierarchical Result Models of Precedential Constraint

    oai:arXiv.org:2512.13505v1

    arXiv:2512.13505v1 Announce Type: new Abstract: In recent years, hierarchical case-based-reasoning models of precedential constraint have been proposed. In various papers, Trevor Bench-Capon criticised these models on the grounds that they would give incorrect outcomes in some cases. In particular, the models would not account for the possibility that intermediate factors are established with different strengths by different base-level factors. In this paper we respond to these criticisms for van Woerkom's result-based hierarchical models. We argue that in some examples Bench-Capon seems to interpret intermediate factors as dimensions, and that applying van Woerkom's dimension-based version of the hierarchical result model to these examples avoids Bench-Capon's criticisms.

    https://arxiv.org/abs/2512.13505


    Learning under Distributional Drift: Reproducibility as an Intrinsic Statistical Resource

    oai:arXiv.org:2512.13506v1

    arXiv:2512.13506v1 Announce Type: new Abstract: Statistical learning under distributional drift remains insufficiently characterized: when each observation alters the data-generating law, classical generalization bounds can collapse. We introduce a new statistical primitive, the reproducibility budget $C_T$, which quantifies a system's finite capacity for statistical reproducibility - the extent to which its sampling process can remain governed by a consistent underlying distribution in the presence of both exogenous change and endogenous feedback. Formally, $C_T$ is defined as the cumulative Fisher-Rao path length of the coupled learner-environment evolution, measuring the total distributional motion accumulated during learning. From this construct we derive a drift-feedback generalization bound of order $O(T^{-1/2} + C_T/T)$, and we prove a matching minimax lower bound showing that this rate is minimax-optimal. Consequently, the results establish a reproducibility speed limit: no algorithm can achieve smaller worst-case generalization error than that imposed by the average Fisher-Rao drift rate $C_T/T$ of the data-generating process. The framework situates exogenous drift, adaptive data analysis, and performative prediction within a common geometric structure, with $C_T$ emerging as the intrinsic quantity measuring distributional motion across these settings.

    https://arxiv.org/abs/2512.13506


    Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

    oai:arXiv.org:2512.13507v1

    arXiv:2512.13507v1 Announce Type: new Abstract: Recent strides in video generation have paved the way for unified audio-visual generation. In this work, we present Seedance 1.5 pro, a foundational model engineered specifically for native, joint audio-video generation. Leveraging a dual-branch Diffusion Transformer architecture, the model integrates a cross-modal joint module with a specialized multi-stage data pipeline, achieving exceptional audio-visual synchronization and superior generation quality. To ensure practical utility, we implement meticulous post-training optimizations, including Supervised Fine-Tuning (SFT) on high-quality datasets and Reinforcement Learning from Human Feedback (RLHF) with multi-dimensional reward models. Furthermore, we introduce an acceleration framework that boosts inference speed by over 10X. Seedance 1.5 pro distinguishes itself through precise multilingual and dialect lip-syncing, dynamic cinematic camera control, and enhanced narrative coherence, positioning it as a robust engine for professional-grade content creation. Seedance 1.5 pro is now accessible on Volcano Engine at https://console.volcengine.com/ark/region:ark+cn-beijing/experience/vision?type=GenVideo.

    https://arxiv.org/abs/2512.13507


    MedCEG: Reinforcing Verifiable Medical Reasoning with Critical Evidence Graph

    oai:arXiv.org:2512.13510v1

    arXiv:2512.13510v1 Announce Type: new Abstract: Large language models with reasoning capabilities have demonstrated impressive performance across a wide range of domains. In clinical applications, a transparent, step-by-step reasoning process provides physicians with strong evidence to support decision-making. While reinforcement learning has effectively enhanced reasoning performance in medical contexts, the clinical reliability of these reasoning processes remains limited because their accuracy and validity are often overlooked during training. To address this gap, we propose MedCEG, a framework that augments medical language models with clinically valid reasoning pathways by explicitly supervising the reasoning process through a Critical Evidence Graph (CEG). We curate a dataset of challenging clinical cases and algorithmically construct a CEG for each sample to represent a high-quality verifiable reasoning pathway. To guide the reasoning process, we introduce a Clinical Reasoning Procedure Reward, which evaluates Node Coverage, Structural Correctness, and Chain Completeness, thereby providing a holistic assessment of reasoning quality. Experimental results show that MedCEG surpasses existing methods in performance while producing clinically valid reasoning chains, representing a solid advancement in reliable medical AI reasoning. The code and models are available at https://github.com/LinjieMu/MedCEG.

    https://arxiv.org/abs/2512.13510


    TARA: Simple and Efficient Time Aware Retrieval Adaptation of MLLMs for Video Understanding

    oai:arXiv.org:2512.13511v1

    arXiv:2512.13511v1 Announce Type: new Abstract: Our objective is to build a general time-aware video-text embedding model for retrieval. To that end, we propose a simple and efficient recipe, dubbed TARA (Time Aware Retrieval Adaptation), to adapt Multimodal LLMs (MLLMs) to a time-aware video-text embedding model without using any video data at all. For evaluating time-awareness in retrieval, we propose a new benchmark with temporally opposite (chiral) actions as hard negatives and curated splits for chiral and non-chiral actions. We show that TARA outperforms all existing video-text models on this chiral benchmark while also achieving strong results on standard benchmarks. Furthermore, we discover additional benefits of TARA beyond time-awareness: (i) TARA embeddings are negation-aware as shown in NegBench benchmark that evaluates negation in video retrieval, (ii) TARA achieves state of the art performance on verb and adverb understanding in videos. Overall, TARA yields a strong, versatile, time-aware video-text embedding model with state of the art zero-shot performance.

    https://arxiv.org/abs/2512.13511


    Reinforcement Learning based 6-DoF Maneuvers for Microgravity Intravehicular Docking: A Simulation Study with Int-Ball2 in ISS-JEM

    oai:arXiv.org:2512.13514v1

    arXiv:2512.13514v1 Announce Type: new Abstract: Autonomous free-flyers play a critical role in intravehicular tasks aboard the International Space Station (ISS), where their precise docking under sensing noise, small actuation mismatches, and environmental variability remains a nontrivial challenge. This work presents a reinforcement learning (RL) framework for six-degree-of-freedom (6-DoF) docking of JAXA's Int-Ball2 robot inside a high-fidelity Isaac Sim model of the Japanese Experiment Module (JEM). Using Proximal Policy Optimization (PPO), we train and evaluate controllers under domain-randomized dynamics and bounded observation noise, while explicitly modeling propeller drag-torque effects and polarity structure. This enables a controlled study of how Int-Ball2's propulsion physics influence RL-based docking performance in constrained microgravity interiors. The learned policy achieves stable and reliable docking across varied conditions and lays the groundwork for future extensions pertaining to Int-Ball2 in collision-aware navigation, safe RL, propulsion-accurate sim-to-real transfer, and vision-based end-to-end docking.

    https://arxiv.org/abs/2512.13514


    Fine-tuned LLM-based Code Migration Framework

    oai:arXiv.org:2512.13515v1

    arXiv:2512.13515v1 Announce Type: new Abstract: The study presents the outcomes of research and experimental validation in the domain of automated codebase migration, with a focus on addressing challenges in transitioning SQL-based systems. The proposed method for migration essentially appears as a framework that leverages the best aspects of traditional software engineering techniques and provides an iterative, scalable, precise and efficient solution for modern database transformations. The central piece of the approach is the integration of a fine-tuned Large Language Model to address critical issues in SQL code conversion, such as syntax mapping, resolving discrepancies between Oracle PL/SQL and PostgreSQL, and optimising database elements such as stored procedures, triggers, views, and overall database logic. Thus, the method involves a trade-off between fine-tuning and prompt engineering. Special attention is given to a fine-tuning approach, which enhances the adaptability and compatibility with migration requirements across the entire database. According to the achieved results, fine-tuning plays a very important role. The study employs targeted evaluation methodologies along with computational metrics to measure the success of iterative conversion cycles. Core innovations include automated SQL feature detection, semi-supervised error analysis and integration of Subject Matter Experts feedback within a systematic migration workflow. The methodology achieves significant reductions in Syntax Error Rates, enhances feature alignment throughout migration iterations, and leverages dataset sampling to ensure continual improvement. By embedding GAI into the migration process, the framework facilitates precise feature mapping, semi-automated error resolution, and data-driven optimisation loops, improving workflow efficiency.

    https://arxiv.org/abs/2512.13515


    How Low Can You Go? The Data-Light SE Challenge

    oai:arXiv.org:2512.13524v1

    arXiv:2512.13524v1 Announce Type: new Abstract: Much of software engineering (SE) research assumes that progress depends on massive datasets and CPU-intensive optimizers. Yet has this assumption been rigorously tested? The counter-evidence presented in this paper suggests otherwise: across dozens of optimization problems from recent SE literature, including software configuration and performance tuning, cloud and systems optimization, project and process-level decision modeling, behavioral analytics, financial risk modeling, project health prediction, reinforcement learning tasks, sales forecasting, and software testing, even with just a few dozen labels, very simple methods (e.g. diversity sampling, a minimal Bayesian learner, or random probes) achieve near 90% of the best reported results. Further, these simple methods perform just as well as more state-of-the-the-art optimizers like SMAC, TPE, DEHB etc. While some tasks would require better outcomes and more sampling, these results seen after a few dozen samples would suffice for many engineering needs (particularly when the goal is rapid and cost-efficient guidance rather than slow and exhaustive optimization). Our results highlight that some SE tasks may be better served by lightweight approaches that demand fewer labels and far less computation. We hence propose the data-light challenge: when will a handful of labels suffice for SE tasks? To enable a large-scale investigation of this issue, we contribute (1) a mathematical formalization of labeling, (2) lightweight baseline algorithms, and (3) results on public-domain data showing the conditions under which lightweight methods excel or fail. For the purposes of open science, our scripts and data are online at https://github.com/KKGanguly/NEO .

    https://arxiv.org/abs/2512.13524


    Janus: Disaggregating Attention and Experts for Scalable MoE Inference

    oai:arXiv.org:2512.13525v1

    arXiv:2512.13525v1 Announce Type: new Abstract: Large Mixture-of-Experts (MoE) model inference is challenging due to high resource demands and dynamic workloads. Existing solutions often deploy the entire model as a single monolithic unit, which applies a unified resource configuration to both attention and expert modules despite their different requirements, leading to limited scalability and resource inefficiency. In this paper, we propose Janus, a scalable MoE inference system that disaggregates attention and experts on separate GPU sub-clusters, enabling each module to be managed and scaled independently. Janus incorporates three key designs for efficient, disaggregated MoE inference. First, it proposes an adaptive two-phase communication scheme that exploits intra- and inter-node bandwidth hierarchies for low-latency data exchange. Second, motivated by the memory-bound nature of MoE modules, Janus introduces a lightweight scheduler and implements it as a GPU kernel to balance the number of activated experts across GPUs at minimal overhead, thereby reducing inference latency. Third, Janus performs fine-grained resource management to dynamically adjust expert placement and independently scale attention and MoE resources to improve overall efficiency. Evaluation shows Janus achieves up to 3.9 higher perGPU throughput than state-of-the-art systems while meeting per-token latency requirements.

    https://arxiv.org/abs/2512.13525


    Async Control: Stress-testing Asynchronous Control Measures for LLM Agents

    oai:arXiv.org:2512.13526v1

    arXiv:2512.13526v1 Announce Type: new Abstract: LLM-based software engineering agents are increasingly used in real-world development tasks, often with access to sensitive data or security-critical codebases. Such agents could intentionally sabotage these codebases if they were misaligned. We investigate asynchronous monitoring, in which a monitoring system reviews agent actions after the fact. Unlike synchronous monitoring, this approach does not impose runtime latency, while still attempting to disrupt attacks before irreversible harm occurs. We treat monitor development as an adversarial game between a blue team (who design monitors) and a red team (who create sabotaging agents). We attempt to set the game rules such that they upper bound the sabotage potential of an agent based on Claude 4.1 Opus. To ground this game in a realistic, high-stakes deployment scenario, we develop a suite of 5 diverse software engineering environments that simulate tasks that an agent might perform within an AI developer's internal infrastructure. Over the course of the game, we develop an ensemble monitor that achieves a 6% false negative rate at 1% false positive rate on a held out test environment. Then, we estimate risk of sabotage at deployment time by extrapolating from our monitor's false negative rate. We describe one simple model for this extrapolation, present a sensitivity analysis, and describe situations in which the model would be invalid. Code is available at: https://github.com/UKGovernmentBEIS/async-control.

    https://arxiv.org/abs/2512.13526


    Pancakes: Consistent Multi-Protocol Image Segmentation Across Biomedical Domains

    oai:arXiv.org:2512.13534v1

    arXiv:2512.13534v1 Announce Type: new Abstract: A single biomedical image can be meaningfully segmented in multiple ways, depending on the desired application. For instance, a brain MRI can be segmented according to tissue types, vascular territories, broad anatomical regions, fine-grained anatomy, or pathology, etc. Existing automatic segmentation models typically either (1) support only a single protocol, the one they were trained on, or (2) require labor-intensive manual prompting to specify the desired segmentation. We introduce Pancakes, a framework that, given a new image from a previously unseen domain, automatically generates multi-label segmentation maps for multiple plausible protocols, while maintaining semantic consistency across related images. Pancakes introduces a new problem formulation that is not currently attainable by existing foundation models. In a series of experiments on seven held-out datasets, we demonstrate that our model can significantly outperform existing foundation models in producing several plausible whole-image segmentations, that are semantically coherent across images.

    https://arxiv.org/abs/2512.13534


    Distributed Places and Safe Net Reduction

    oai:arXiv.org:2512.13538v1

    arXiv:2512.13538v1 Announce Type: new Abstract: Being able to find small Petri nets with the same behaviour as formal specifications of concurrent systems benefits both effective verification and practical implementation of such systems. This paper considers specifications given in the form of compositionally defined safe nets. The paper discusses a novel concept of distributed place which implements the behaviour of an individual net place. It is shown that if distributed places cover a safe Petri net, then it is possible to delete some places without changing the behaviour. Crucially, the reduction is both static and local, making it computationally feasible in practice. The resulting reduction technique is then applied to an algebra of safe Petri nets (boxes) derived compositionally from process (box) expressions. Though the original derivation can yield exponentially large boxes, prior research demonstrated that if a box expression does not involve cyclic behaviours, the exponential number of places can be reduced down to polynomial (quadratic). In this paper, using distributed places, it is show that similar optimisation can also be achieved in the case of process expressions with iteration.

    https://arxiv.org/abs/2512.13538


    Competent Discrete Time Modeling For analogue controlled PWM Converter Considering State-Feedback

    oai:arXiv.org:2512.13545v1

    arXiv:2512.13545v1 Announce Type: new Abstract: Ever since R.D.Middlebrook proposed the state space averaging notion. The small signal model has been widely used as a design tool to tune control parameters. As Moore's law is continuing and the AI chip's high demand for power consumption and dynamic response, the control bandwidth needs to be boosted. However, the average model has two basic assumptions: the low-frequency assumption, the small ripple assumption. In high-bandwidth design, these two assumptions are violated. In order to solve this, various methods have been proposed. This paper gives a comprehensive overview of the existing small signal model for PWM converters from the following perspectives: 1. model fidelity, 2. analytical tractability. 3. complexity of the derivation process and result 4.generality.

    https://arxiv.org/abs/2512.13545


    PrahokBART: A Pre-trained Sequence-to-Sequence Model for Khmer Natural Language Generation

    oai:arXiv.org:2512.13552v1

    arXiv:2512.13552v1 Announce Type: new Abstract: This work introduces {\it PrahokBART}, a compact pre-trained sequence-to-sequence model trained from scratch for Khmer using carefully curated Khmer and English corpora. We focus on improving the pre-training corpus quality and addressing the linguistic issues of Khmer, which are ignored in existing multilingual models, by incorporating linguistic components such as word segmentation and normalization. We evaluate PrahokBART on three generative tasks: machine translation, text summarization, and headline generation, where our results demonstrate that it outperforms mBART50, a strong multilingual pre-trained model. Additionally, our analysis provides insights into the impact of each linguistic module and evaluates how effectively our model handles space during text generation, which is crucial for the naturalness of texts in Khmer.

    https://arxiv.org/abs/2512.13552


    Bidding Aggregated Flexibility in European Electricity Auctions

    oai:arXiv.org:2512.13557v1

    arXiv:2512.13557v1 Announce Type: new Abstract: Bidding flexibility in day-ahead and intraday auctions would enable decentralized flexible resources, such as electric vehicles and heat pumps, to efficiently align their consumption with the intermittent generation of renewable energy. However, because these resources are individually too small to participate in those auctions directly, an aggregator (e.g., a utility) must act on their behalf. This requires aggregating many decentralized resources, which is a computationally challenging task. In this paper, we propose a computationally efficient and highly accurate method that is readily applicable to European day-ahead and intraday auctions. Distinct from existing methods, we aggregate only economically relevant power profiles, identified through price forecasts. The resulting flexibility is then conveyed to the market operator via exclusive groups of block bids. We evaluate our method for a utility serving the Swiss town of Losone, where flexibility from multiple heat pumps distributed across the grid must be aggregated and bid in the Swiss day-ahead auction. Results show that our method aggregates accurately, achieving 98% of the theoretically possible cost savings. This aggregation accuracy remains stable even as the number of heat pumps increases, while computation time grows only linearly, demonstrating strong scalability.

    https://arxiv.org/abs/2512.13557


    Verifying Rumors via Stance-Aware Structural Modeling

    oai:arXiv.org:2512.13559v1

    arXiv:2512.13559v1 Announce Type: new Abstract: Verifying rumors on social media is critical for mitigating the spread of false information. The stances of conversation replies often provide important cues to determine a rumor's veracity. However, existing models struggle to jointly capture semantic content, stance information, and conversation strructure, especially under the sequence length constraints of transformer-based encoders. In this work, we propose a stance-aware structural modeling that encodes each post in a discourse with its stance signal and aggregates reply embedddings by stance category enabling a scalable and semantically enriched representation of the entire thread. To enhance structural awareness, we introduce stance distribution and hierarchical depth as covariates, capturing stance imbalance and the influence of reply depth. Extensive experiments on benchmark datasets demonstrate that our approach significantly outperforms prior methods in the ability to predict truthfulness of a rumor. We also demonstrate that our model is versatile for early detection and cross-platfrom generalization.

    https://arxiv.org/abs/2512.13559


    3D Human-Human Interaction Anomaly Detection

    oai:arXiv.org:2512.13560v1

    arXiv:2512.13560v1 Announce Type: new Abstract: Human-centric anomaly detection (AD) has been primarily studied to specify anomalous behaviors in a single person. However, as humans by nature tend to act in a collaborative manner, behavioral anomalies can also arise from human-human interactions. Detecting such anomalies using existing single-person AD models is prone to low accuracy, as these approaches are typically not designed to capture the complex and asymmetric dynamics of interactions. In this paper, we introduce a novel task, Human-Human Interaction Anomaly Detection (H2IAD), which aims to identify anomalous interactive behaviors within collaborative 3D human actions. To address H2IAD, we then propose Interaction Anomaly Detection Network (IADNet), which is formalized with a Temporal Attention Sharing Module (TASM). Specifically, in designing TASM, we share the encoded motion embeddings across both people such that collaborative motion correlations can be effectively synchronized. Moreover, we notice that in addition to temporal dynamics, human interactions are also characterized by spatial configurations between two people. We thus introduce a Distance-Based Relational Encoding Module (DREM) to better reflect social cues in H2IAD. The normalizing flow is eventually employed for anomaly scoring. Extensive experiments on human-human motion benchmarks demonstrate that IADNet outperforms existing Human-centric AD baselines in H2IAD.

    https://arxiv.org/abs/2512.13560


    Near-Field Perception for Safety Enhancement of Autonomous Mobile Robots in Manufacturing Environments

    oai:arXiv.org:2512.13561v1

    arXiv:2512.13561v1 Announce Type: new Abstract: Near-field perception is essential for the safe operation of autonomous mobile robots (AMRs) in manufacturing environments. Conventional ranging sensors such as light detection and ranging (LiDAR) and ultrasonic devices provide broad situational awareness but often fail to detect small objects near the robot base. To address this limitation, this paper presents a three-tier near-field perception framework. The first approach employs light-discontinuity detection, which projects a laser stripe across the near-field zone and identifies interruptions in the stripe to perform fast, binary cutoff sensing for obstacle presence. The second approach utilizes light-displacement measurement to estimate object height by analyzing the geometric displacement of a projected stripe in the camera image, which provides quantitative obstacle height information with minimal computational overhead. The third approach employs a computer vision-based object detection model on embedded AI hardware to classify objects, enabling semantic perception and context-aware safety decisions. All methods are implemented on a Raspberry Pi 5 system, achieving real-time performance at 25 or 50 frames per second. Experimental evaluation and comparative analysis demonstrate that the proposed hierarchy balances precision, computation, and cost, thereby providing a scalable perception solution for enabling safe operations of AMRs in manufacturing environments.

    https://arxiv.org/abs/2512.13561


    Memory in the Age of AI Agents

    oai:arXiv.org:2512.13564v1

    arXiv:2512.13564v1 Announce Type: new Abstract: Memory has emerged, and will continue to remain, a core capability of foundation model-based agents. As research on agent memory rapidly expands and attracts unprecedented attention, the field has also become increasingly fragmented. Existing works that fall under the umbrella of agent memory often differ substantially in their motivations, implementations, and evaluation protocols, while the proliferation of loosely defined memory terminologies has further obscured conceptual clarity. Traditional taxonomies such as long/short-term memory have proven insufficient to capture the diversity of contemporary agent memory systems. This work aims to provide an up-to-date landscape of current agent memory research. We begin by clearly delineating the scope of agent memory and distinguishing it from related concepts such as LLM memory, retrieval augmented generation (RAG), and context engineering. We then examine agent memory through the unified lenses of forms, functions, and dynamics. From the perspective of forms, we identify three dominant realizations of agent memory, namely token-level, parametric, and latent memory. From the perspective of functions, we propose a finer-grained taxonomy that distinguishes factual, experiential, and working memory. From the perspective of dynamics, we analyze how memory is formed, evolved, and retrieved over time. To support practical development, we compile a comprehensive summary of memory benchmarks and open-source frameworks. Beyond consolidation, we articulate a forward-looking perspective on emerging research frontiers, including memory automation, reinforcement learning integration, multimodal memory, multi-agent memory, and trustworthiness issues. We hope this survey serves not only as a reference for existing work, but also as a conceptual foundation for rethinking memory as a first-class primitive in the design of future agentic intelligence.

    https://arxiv.org/abs/2512.13564


    A Distance Amplification Lemma for Monotonicity

    oai:arXiv.org:2512.13566v1

    arXiv:2512.13566v1 Announce Type: new Abstract: We show a procedure that, given oracle access to a function $f\colon \{0,1\}^n\to\{0,1\}$, produces oracle access to a function $f'\colon \{0,1\}^{n'}\to\{0,1\}$ such that if $f$ is monotone, then $f'$ is monotone, and if $f$ is $\varepsilon$-far from monotone, then $f'$ is $\Omega(1)$-far from monotone. Moreover, $n' \leq n 2^{O(1/\varepsilon)}$ and each oracle query to $f'$ can be answered by making $2^{O(1/\varepsilon)}$ oracle queries to $f$. Our lemma is motivated by a recent result of [Chen, Chen, Cui, Pires, Stockwell, arXiv:2511.04558], who showed that for all $c>0$ there exists $\varepsilon_c>0$, such that any (even two-sided, adaptive) algorithm distinguishing between monotone functions and $\varepsilon_c$-far from monotone functions, requires $\Omega(n^{1/2-c})$ queries. Combining our lemma with their result implies a similar result, except that the distance from monotonicity is an absolute constant $\varepsilon>0$, and the lower bound is $\Omega(n^{1/2-o(1)})$ queries.

    https://arxiv.org/abs/2512.13566


    Superposition as Lossy Compression: Measure with Sparse Autoencoders and Connect to Adversarial Vulnerability

    oai:arXiv.org:2512.13568v1

    arXiv:2512.13568v1 Announce Type: new Abstract: Neural networks achieve remarkable performance through superposition: encoding multiple features as overlapping directions in activation space rather than dedicating individual neurons to each feature. This challenges interpretability, yet we lack principled methods to measure superposition. We present an information-theoretic framework measuring a neural representation's effective degrees of freedom. We apply Shannon entropy to sparse autoencoder activations to compute the number of effective features as the minimum neurons needed for interference-free encoding. Equivalently, this measures how many "virtual neurons" the network simulates through superposition. When networks encode more effective features than actual neurons, they must accept interference as the price of compression. Our metric strongly correlates with ground truth in toy models, detects minimal superposition in algorithmic tasks, and reveals systematic reduction under dropout. Layer-wise patterns mirror intrinsic dimensionality studies on Pythia-70M. The metric also captures developmental dynamics, detecting sharp feature consolidation during grokking. Surprisingly, adversarial training can increase effective features while improving robustness, contradicting the hypothesis that superposition causes vulnerability. Instead, the effect depends on task complexity and network capacity: simple tasks with ample capacity allow feature expansion (abundance regime), while complex tasks or limited capacity force reduction (scarcity regime). By defining superposition as lossy compression, this work enables principled measurement of how neural networks organize information under computational constraints, connecting superposition to adversarial robustness.

    https://arxiv.org/abs/2512.13568


    MMhops-R1: Multimodal Multi-hop Reasoning

    oai:arXiv.org:2512.13573v1

    arXiv:2512.13573v1 Announce Type: new Abstract: The ability to perform multi-modal multi-hop reasoning by iteratively integrating information across various modalities and external knowledge is critical for addressing complex real-world challenges. However, existing Multi-modal Large Language Models (MLLMs) are predominantly limited to single-step reasoning, as existing benchmarks lack the complexity needed to evaluate and drive multi-hop abilities. To bridge this gap, we introduce MMhops, a novel, large-scale benchmark designed to systematically evaluate and foster multi-modal multi-hop reasoning. MMhops dataset comprises two challenging task formats, Bridging and Comparison, which necessitate that models dynamically construct complex reasoning chains by integrating external knowledge. To tackle the challenges posed by MMhops, we propose MMhops-R1, a novel multi-modal Retrieval-Augmented Generation (mRAG) framework for dynamic reasoning. Our framework utilizes reinforcement learning to optimize the model for autonomously planning reasoning paths, formulating targeted queries, and synthesizing multi-level information. Comprehensive experiments demonstrate that MMhops-R1 significantly outperforms strong baselines on MMhops, highlighting that dynamic planning and multi-modal knowledge integration are crucial for complex reasoning. Moreover, MMhops-R1 demonstrates strong generalization to tasks requiring fixed-hop reasoning, underscoring the robustness of our dynamic planning approach. In conclusion, our work contributes a challenging new benchmark and a powerful baseline model, and we will release the associated code, data, and weights to catalyze future research in this critical area.

    https://arxiv.org/abs/2512.13573


    Reproducing and Dissecting Denoising Language Models for Speech Recognition

    oai:arXiv.org:2512.13576v1

    arXiv:2512.13576v1 Announce Type: new Abstract: Denoising language models (DLMs) have been proposed as a powerful alternative to traditional language models (LMs) for automatic speech recognition (ASR), motivated by their ability to use bidirectional context and adapt to a specific ASR model's error patterns. However, the complexity of the DLM training pipeline has hindered wider investigation. This paper presents the first independent, large-scale empirical study of DLMs. We build and release a complete, reproducible pipeline to systematically investigate the impact of key design choices. We evaluate dozens of configurations across multiple axes, including various data augmentation techniques (e.g., SpecAugment, dropout, mixup), different text-to-speech systems, and multiple decoding strategies. Our comparative analysis in a common subword vocabulary setting demonstrates that DLMs outperform traditional LMs, but only after a distinct compute tipping point. While LMs are more efficient at lower budgets, DLMs scale better with longer training, mirroring behaviors observed in diffusion language models. However, we observe smaller improvements than those reported in prior character-based work, which indicates that the DLM's performance is conditional on factors such as the vocabulary. Our analysis reveals that a key factor for improving performance is to condition the DLM on richer information from the ASR's hypothesis space, rather than just a single best guess. To this end, we introduce DLM-sum, a novel method for decoding from multiple ASR hypotheses, which consistently outperforms the previously proposed DSR decoding method. We believe our findings and public pipeline provide a crucial foundation for the community to better understand, improve, and build upon this promising class of models. The code is publicly available at https://github.com/rwth-i6/2025-denoising-lm/.

    https://arxiv.org/abs/2512.13576


    DP-CSGP: Differentially Private Stochastic Gradient Push with Compressed Communication

    oai:arXiv.org:2512.13583v1

    arXiv:2512.13583v1 Announce Type: new Abstract: In this paper, we propose a Differentially Private Stochastic Gradient Push with Compressed communication (termed DP-CSGP) for decentralized learning over directed graphs. Different from existing works, the proposed algorithm is designed to maintain high model utility while ensuring both rigorous differential privacy (DP) guarantees and efficient communication. For general non-convex and smooth objective functions, we show that the proposed algorithm achieves a tight utility bound of $\mathcal{O}\left( \sqrt{d\log \left( \frac{1}{\delta} \right)}/(\sqrt{n}J\epsilon) \right)$ ($J$ and $d$ are the number of local samples and the dimension of decision variables, respectively) with $\left(\epsilon, \delta\right)$-DP guarantee for each node, matching that of decentralized counterparts with exact communication. Extensive experiments on benchmark tasks show that, under the same privacy budget, DP-CSGP achieves comparable model accuracy with significantly lower communication cost than existing decentralized counterparts with exact communication.

    https://arxiv.org/abs/2512.13583


    ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding

    oai:arXiv.org:2512.13586v1

    arXiv:2512.13586v1 Announce Type: new Abstract: Autoregressive models (ARMs) are hindered by slow sequential inference. While masked diffusion models (MDMs) offer a parallel alternative, they suffer from critical drawbacks: high computational overhead from precluding Key-Value (KV) caching, and incoherent generation arising from learning dependencies over an intractable space of token combinations. To address these limitations, we introduce ReFusion, a novel masked diffusion model that achieves superior performance and efficiency by elevating parallel decoding from the token level to a higher slot level, where each slot is a fixed-length, contiguous sub-sequence. This is achieved through an iterative ``plan-and-infill'' decoding process: a diffusion-based planning step first identifies a set of weakly dependent slots, and an autoregressive infilling step then decodes these selected slots in parallel. The slot-based design simultaneously unlocks full KV cache reuse with a unified causal framework and reduces the learning complexity from the token combination space to a manageable slot-level permutation space. Extensive experiments on seven diverse benchmarks show that ReFusion not only overwhelmingly surpasses prior MDMs with 34% performance gains and an over 18$\times$ speedup on average, but also bridges the performance gap to strong ARMs while maintaining a 2.33$\times$ average speedup.

    https://arxiv.org/abs/2512.13586


    astroCAMP: A Community Benchmark and Co-Design Framework for Sustainable SKA-Scale Radio Imaging

    oai:arXiv.org:2512.13591v1

    arXiv:2512.13591v1 Announce Type: new Abstract: The Square Kilometre Array (SKA) project will operate one of the world's largest continuous scientific data systems, sustaining petascale imaging under strict power caps. Yet, current radio-interferometric pipelines utilize only a small fraction of hardware peak performance, typically 4-14%, due to memory and I/O bottlenecks, resulting in poor energy efficiency and high operational and carbon costs. Progress is further limited by the absence of standardised metrics and fidelity tolerances, preventing principled hardware-software co-design and rigorous exploration of quality-efficiency trade-offs. We introduce astroCAMP, a framework for guiding the co-design of next-generation imaging pipelines and sustainable HPC architectures that maximise scientific return within SKA's operational and environmental limits. astroCAMP provides: (1) a unified, extensible metric suite covering scientific fidelity, computational performance, sustainability, and lifecycle economics; (2) standardised SKA-representative datasets and reference outputs enabling reproducible benchmarking across CPUs, GPUs, and emerging accelerators; and (3) a multi-objective co-design formulation linking scientific-quality constraints to time-, energy-, carbon-to-solution, and total cost of ownership. We release datasets, benchmarking results, and a reproducibility kit, and evaluate co-design metrics for WSClean and IDG on an AMD EPYC 9334 processor and an NVIDIA H100 GPU. Further, we illustrate the use of astroCAMP for heterogeneous CPU-FPGA design-space exploration, and its potential to facilitate the identification of Pareto-optimal operating points for SKA-scale imaging deployments. Last, we make a call to the SKA community to define quantifiable fidelity metrics and thresholds to accelerate principled optimisation for SKA-scale imaging.

    https://arxiv.org/abs/2512.13591


    Image Diffusion Preview with Consistency Solver

    oai:arXiv.org:2512.13592v1

    arXiv:2512.13592v1 Announce Type: new Abstract: The slow inference process of image diffusion models significantly degrades interactive user experiences. To address this, we introduce Diffusion Preview, a novel paradigm employing rapid, low-step sampling to generate preliminary outputs for user evaluation, deferring full-step refinement until the preview is deemed satisfactory. Existing acceleration methods, including training-free solvers and post-training distillation, struggle to deliver high-quality previews or ensure consistency between previews and final outputs. We propose ConsistencySolver derived from general linear multistep methods, a lightweight, trainable high-order solver optimized via Reinforcement Learning, that enhances preview quality and consistency. Experimental results demonstrate that ConsistencySolver significantly improves generation quality and consistency in low-step scenarios, making it ideal for efficient preview-and-refine workflows. Notably, it achieves FID scores on-par with Multistep DPM-Solver using 47% fewer steps, while outperforming distillation baselines. Furthermore, user studies indicate our approach reduces overall user interaction time by nearly 50% while maintaining generation quality. Code is available at https://github.com/G-U-N/consolver.

    https://arxiv.org/abs/2512.13592


    Scalable Formal Verification via Autoencoder Latent Space Abstraction

    oai:arXiv.org:2512.13593v1

    arXiv:2512.13593v1 Announce Type: new Abstract: Finite Abstraction methods provide a powerful formal framework for proving that systems satisfy their specifications. However, these techniques face scalability challenges for high-dimensional systems, as they rely on state-space discretization which grows exponentially with dimension. Learning-based approaches to dimensionality reduction, utilizing neural networks and autoencoders, have shown great potential to alleviate this problem. However, ensuring the correctness of the resulting verification results remains an open question. In this work, we provide a formal approach to reduce the dimensionality of systems via convex autoencoders and learn the dynamics in the latent space through a kernel-based method. We then construct a finite abstraction from the learned model in the latent space and guarantee that the abstraction contains the true behaviors of the original system. We show that the verification results in the latent space can be mapped back to the original system. Finally, we demonstrate the effectiveness of our approach on multiple systems, including a 26D system controlled by a neural network, showing significant scalability improvements without loss of rigor.

    https://arxiv.org/abs/2512.13593


    A homogeneous geometry of low-rank tensors

    oai:arXiv.org:2512.13594v1

    arXiv:2512.13594v1 Announce Type: new Abstract: We consider sets of fixed CP, multilinear, and TT rank tensors, and derive conditions for when (the smooth parts of) these sets are smooth homogeneous manifolds. For CP and TT ranks, the conditions are essentially that the rank is sufficiently low. These homogeneous structures are then used to derive Riemannian metrics whose geodesics are both complete and efficient to compute.

    https://arxiv.org/abs/2512.13594


    Lighting in Motion: Spatiotemporal HDR Lighting Estimation

    oai:arXiv.org:2512.13597v1

    arXiv:2512.13597v1 Announce Type: new Abstract: We present Lighting in Motion (LiMo), a diffusion-based approach to spatiotemporal lighting estimation. LiMo targets both realistic high-frequency detail prediction and accurate illuminance estimation. To account for both, we propose generating a set of mirrored and diffuse spheres at different exposures, based on their 3D positions in the input. Making use of diffusion priors, we fine-tune powerful existing diffusion models on a large-scale customized dataset of indoor and outdoor scenes, paired with spatiotemporal light probes. For accurate spatial conditioning, we demonstrate that depth alone is insufficient and we introduce a new geometric condition to provide the relative position of the scene to the target 3D position. Finally, we combine diffuse and mirror predictions at different exposures into a single HDRI map leveraging differentiable rendering. We thoroughly evaluate our method and design choices to establish LiMo as state-of-the-art for both spatial control and prediction accuracy.

    https://arxiv.org/abs/2512.13597


    Textual Gradients are a Flawed Metaphor for Automatic Prompt Optimization

    oai:arXiv.org:2512.13598v1

    arXiv:2512.13598v1 Announce Type: new Abstract: A well-engineered prompt can increase the performance of large language models; automatic prompt optimization techniques aim to increase performance without requiring human effort to tune the prompts. One leading class of prompt optimization techniques introduces the analogy of textual gradients. We investigate the behavior of these textual gradient methods through a series of experiments and case studies. While such methods often result in a performance improvement, our experiments suggest that the gradient analogy does not accurately explain their behavior. Our insights may inform the selection of prompt optimization strategies, and development of new approaches.

    https://arxiv.org/abs/2512.13598


    DA-SSL: self-supervised domain adaptor to leverage foundational models in turbt histopathology slides

    oai:arXiv.org:2512.13600v1

    arXiv:2512.13600v1 Announce Type: new Abstract: Recent deep learning frameworks in histopathology, particularly multiple instance learning (MIL) combined with pathology foundational models (PFMs), have shown strong performance. However, PFMs exhibit limitations on certain cancer or specimen types due to domain shifts - these cancer types were rarely used for pretraining or specimens contain tissue-based artifacts rarely seen within the pretraining population. Such is the case for transurethral resection of bladder tumor (TURBT), which are essential for diagnosing muscle-invasive bladder cancer (MIBC), but contain fragmented tissue chips and electrocautery artifacts and were not widely used in publicly available PFMs. To address this, we propose a simple yet effective domain-adaptive self-supervised adaptor (DA-SSL) that realigns pretrained PFM features to the TURBT domain without fine-tuning the foundational model itself. We pilot this framework for predicting treatment response in TURBT, where histomorphological features are currently underutilized and identifying patients who will benefit from neoadjuvant chemotherapy (NAC) is challenging. In our multi-center study, DA-SSL achieved an AUC of 0.77+/-0.04 in five-fold cross-validation and an external test accuracy of 0.84, sensitivity of 0.71, and specificity of 0.91 using majority voting. Our results demonstrate that lightweight domain adaptation with self-supervision can effectively enhance PFM-based MIL pipelines for clinically challenging histopathology tasks. Code is Available at https://github.com/zhanghaoyue/DA_SSL_TURBT.

    https://arxiv.org/abs/2512.13600


    LongVie 2: Multimodal Controllable Ultra-Long Video World Model

    oai:arXiv.org:2512.13604v1

    arXiv:2512.13604v1 Announce Type: new Abstract: Building video world models upon pretrained video generation systems represents an important yet challenging step toward general spatiotemporal intelligence. A world model should possess three essential properties: controllability, long-term visual quality, and temporal consistency. To this end, we take a progressive approach-first enhancing controllability and then extending toward long-term, high-quality generation. We present LongVie 2, an end-to-end autoregressive framework trained in three stages: (1) Multi-modal guidance, which integrates dense and sparse control signals to provide implicit world-level supervision and improve controllability; (2) Degradation-aware training on the input frame, bridging the gap between training and long-term inference to maintain high visual quality; and (3) History-context guidance, which aligns contextual information across adjacent clips to ensure temporal consistency. We further introduce LongVGenBench, a comprehensive benchmark comprising 100 high-resolution one-minute videos covering diverse real-world and synthetic environments. Extensive experiments demonstrate that LongVie 2 achieves state-of-the-art performance in long-range controllability, temporal coherence, and visual fidelity, and supports continuous video generation lasting up to five minutes, marking a significant step toward unified video world modeling.

    https://arxiv.org/abs/2512.13604


    Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models

    oai:arXiv.org:2512.13607v1

    arXiv:2512.13607v1 Announce Type: new Abstract: Building general-purpose reasoning models with reinforcement learning (RL) entails substantial cross-domain heterogeneity, including large variation in inference-time response lengths and verification latency. Such variability complicates the RL infrastructure, slows training, and makes training curriculum (e.g., response length extension) and hyperparameter selection challenging. In this work, we propose cascaded domain-wise reinforcement learning (Cascade RL) to develop general-purpose reasoning models, Nemotron-Cascade, capable of operating in both instruct and deep thinking modes. Departing from conventional approaches that blend heterogeneous prompts from different domains, Cascade RL orchestrates sequential, domain-wise RL, reducing engineering complexity and delivering state-of-the-art performance across a wide range of benchmarks. Notably, RLHF for alignment, when used as a pre-step, boosts the model's reasoning ability far beyond mere preference optimization, and subsequent domain-wise RLVR stages rarely degrade the benchmark performance attained in earlier domains and may even improve it (see an illustration in Figure 1). Our 14B model, after RL, outperforms its SFT teacher, DeepSeek-R1-0528, on LiveCodeBench v5/v6/Pro and achieves silver-medal performance in the 2025 International Olympiad in Informatics (IOI). We transparently share our training and data recipes.

    https://arxiv.org/abs/2512.13607


    DBT-DINO: Towards Foundation model based analysis of Digital Breast Tomosynthesis

    oai:arXiv.org:2512.13608v1

    arXiv:2512.13608v1 Announce Type: new Abstract: Foundation models have shown promise in medical imaging but remain underexplored for three-dimensional imaging modalities. No foundation model currently exists for Digital Breast Tomosynthesis (DBT), despite its use for breast cancer screening. To develop and evaluate a foundation model for DBT (DBT-DINO) across multiple clinical tasks and assess the impact of domain-specific pre-training. Self-supervised pre-training was performed using the DINOv2 methodology on over 25 million 2D slices from 487,975 DBT volumes from 27,990 patients. Three downstream tasks were evaluated: (1) breast density classification using 5,000 screening exams; (2) 5-year risk of developing breast cancer using 106,417 screening exams; and (3) lesion detection using 393 annotated volumes. For breast density classification, DBT-DINO achieved an accuracy of 0.79 (95\% CI: 0.76--0.81), outperforming both the MetaAI DINOv2 baseline (0.73, 95\% CI: 0.70--0.76, p<.001) and DenseNet-121 (0.74, 95\% CI: 0.71--0.76, p<.001). For 5-year breast cancer risk prediction, DBT-DINO achieved an AUROC of 0.78 (95\% CI: 0.76--0.80) compared to DINOv2's 0.76 (95\% CI: 0.74--0.78, p=.57). For lesion detection, DINOv2 achieved a higher average sensitivity of 0.67 (95\% CI: 0.60--0.74) compared to DBT-DINO with 0.62 (95\% CI: 0.53--0.71, p=.60). DBT-DINO demonstrated better performance on cancerous lesions specifically with a detection rate of 78.8\% compared to Dinov2's 77.3\%. Using a dataset of unprecedented size, we developed DBT-DINO, the first foundation model for DBT. DBT-DINO demonstrated strong performance on breast density classification and cancer risk prediction. However, domain-specific pre-training showed variable benefits on the detection task, with ImageNet baseline outperforming DBT-DINO on general lesion detection, indicating that localized detection tasks require further methodological development.

    https://arxiv.org/abs/2512.13608


    Do-Undo: Generating and Reversing Physical Actions in Vision-Language Models

    oai:arXiv.org:2512.13609v1

    arXiv:2512.13609v1 Announce Type: new Abstract: We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models: understanding and generating physically plausible scene transformations driven by real-world actions. Unlike prior work focused on object-level edits, Do-Undo requires models to simulate the outcome of a physical action and then accurately reverse it, reflecting true cause-and-effect in the visual world. We curate a large-scale dataset of reversible actions from real-world videos and design a training strategy enforcing consistency for robust action grounding. Our experiments reveal that current models struggle with physical reversibility, underscoring the importance of this task for embodied AI, robotics, and physics-aware generative modeling. Do-Undo establishes an intuitive testbed for evaluating and advancing physical reasoning in multimodal systems.

    https://arxiv.org/abs/2512.13609


    QoeSiGN: Towards Qualified Collaborative eSignatures

    oai:arXiv.org:2512.13613v1

    arXiv:2512.13613v1 Announce Type: new Abstract: eSignatures ensure data's authenticity, non-repudiation, and integrity. EU's eIDAS regulation specifies, e.g., advanced and qualified (QES) eSignatures. While eSignatures' concrete legal effects depend on the individual case, QESs constitute the highest level of technical protection and authenticity under eIDAS. QESs are based on a qualified certificate issued by a qualified trust service provider (QTSP). Despite legal requirements, technically, a QTSP represents a single point of failure. Contrary, privacy-preserving collaborative computations (P2C2s) have become increasingly practical in recent years; yet lacking an extensive investigation on potential integrations in the QES landscape. We perform a threat analysis on the QES-creation process of Austria's national eID, using STRIDE and a DREAD-like model to extract requirement challenges (RCs) primarily related to: (1) Distributed Service Robustness, (2) Agile Crypto Deployment, and (3) Active User Involvement. To address these RCs, we present QoeSiGN, utilizing novel P2C2 technologies. While currently no P2C2 addresses all RCs, legal aspects, and practical efficiency simultaneously, QoeSiGN gives instantiation possibilities for different needs. For instance, "Multi-Party HSMs" for distributed hardware-secured computations; or secure multi-party computation (software) for highest crypto agility and user involvement, where the user participates in the QES computation. Deployment-wise, QTSPs would need to adapt the signing process and setup trusted communication channels. Legal-wise, QoeSiGN's implementation appears permissible, needing further analysis for realization. Technically, QoeSiGN addresses some regulation requirements better than the current solution, such as "sole control" or crypto agility. Our identified threats and extracted requirements can be transferred to the general QES ecosystem.

    https://arxiv.org/abs/2512.13613


    Hyper-Minrank: A Unified Hypergraph Characterization of Multi-Sender Index Coding

    oai:arXiv.org:2512.13615v1

    arXiv:2512.13615v1 Announce Type: new Abstract: This work introduces a hypergraph formulation that generalizes the classical paradigm of Bar-Yossef et al. to the multi-sender index coding (MSIC) setting. Central to the model is a 4-regular side-information hypergraph G, a new adjacency representation A_G = [A_1 ... A_N], and a simple fitting criterion for sub-hypergraph validity, in the presence of specially designed hyperedges that capture both side information and cross-sender signal cancellation. This formulation establishes a tight achievability-converse equivalence for the general N-sender, K-receiver problem: every valid fitting induces a valid linear multi-sender index code, every linear code induces a valid fitting, and the optimal scalar linear broadcast length equals the hyper-minrank l**lin(G) = hyperminrank(G) = min*{A fits G} sum_{n=1}^N rank(A_n). Beyond this exact characterization, the approach yields hypergraph analogues of Haemers-type bounds on the broadcast length, including a clique-cover upper bound and a lower bound via the clique number of a carefully defined complement hypergraph. Algorithmically, we provide an exact procedure to compute hyperminrank(G), and show that in certain regimes its complexity is asymptotically better than approximate LT-CMAR solutions. The framework captures well-known settings such as embedded index coding, and applies directly to multi-sender cache-aided communications, coded computation, distributed storage, and edge/satellite systems, where hyperminrank can serve as a unified design target.

    https://arxiv.org/abs/2512.13615


    LightTopoGAT: Enhancing Graph Attention Networks with Topological Features for Efficient Graph Classification

    oai:arXiv.org:2512.13617v1

    arXiv:2512.13617v1 Announce Type: new Abstract: Graph Neural Networks have demonstrated significant success in graph classification tasks, yet they often require substantial computational resources and struggle to capture global graph properties effectively. We introduce LightTopoGAT, a lightweight graph attention network that enhances node features through topological augmentation by incorporating node degree and local clustering coefficient to improve graph representation learning. The proposed approach maintains parameter efficiency through streamlined attention mechanisms while integrating structural information that is typically overlooked by local message passing schemes. Through comprehensive experiments on three benchmark datasets, MUTAG, ENZYMES, and PROTEINS, we show that LightTopoGAT achieves superior performance compared to established baselines including GCN, GraphSAGE, and standard GAT, with a 6.6 percent improvement in accuracy on MUTAG and a 2.2 percent improvement on PROTEINS. Ablation studies further confirm that these performance gains arise directly from the inclusion of topological features, demonstrating a simple yet effective strategy for enhancing graph neural network performance without increasing architectural complexity.

    https://arxiv.org/abs/2512.13617


    Temporal Tokenization Strategies for Event Sequence Modeling with Large Language Models

    oai:arXiv.org:2512.13618v1

    arXiv:2512.13618v1 Announce Type: new Abstract: Representing continuous time is a critical and under-explored challenge in modeling temporal event sequences with large language models (LLMs). Various strategies like byte-level representations or calendar tokens have been proposed. However, the optimal approach remains unclear, especially given the diverse statistical distributions of real-world event data, which range from smooth log-normal to discrete, spiky patterns. This paper presents the first empirical study of temporal tokenization for event sequences, comparing distinct encoding strategies: naive numeric strings, high-precision byte-level representations, human-semantic calendar tokens, classic uniform binning, and adaptive residual scalar quantization. We evaluate these strategies by fine-tuning LLMs on real-world datasets that exemplify these diverse distributions. Our analysis reveals that no single strategy is universally superior; instead, prediction performance depends heavily on aligning the tokenizer with the data's statistical properties, with log-based strategies excelling on skewed distributions and human-centric formats proving robust for mixed modalities.

    https://arxiv.org/abs/2512.13618


    Preconditioning Techniques for Hybridizable Discontinuous Galerkin Discretizations on GPU Architectures

    oai:arXiv.org:2512.13619v1

    arXiv:2512.13619v1 Announce Type: new Abstract: We present scalable iterative solvers and preconditioning strategies for Hybridizable Discontinuous Galerkin (HDG) discretizations of partial differential equations (PDEs) on graphics processing units (GPUs). The HDG method is implemented using GPU-tailored algorithms in which local element degrees of freedom are eliminated in parallel, and the globally condensed system is assembled directly on the device using dense-block operations. The global matrix is stored in a block format that reflects the natural HDG structure, enabling all iterative solver kernels to be executed with strided batched dense matrix-vector multiplications. This implementation avoids sparse data structures, increases arithmetic intensity, and sustains high memory throughput across a range of meshes and polynomial orders. The nonlinear solver combines Newton's method with preconditioned GMRES, integrating scalable preconditioners such as block-Jacobi, additive Schwarz domain decomposition, and polynomial smoothers. All preconditioners are implemented in batched form with architecture-aware optimizations--including dense linear algebra kernels, memory-coalesced vector operations, and shared-memory acceleration--to minimize memory traffic and maximize parallel occupancy. Comprehensive studies are conducted for a variety of PDEs (including Poisson equation, Burgers equation, linear and nonlinear elasticity, Euler equations, Navier-Stokes equations, and Reynolds-Averaged Navier-Stokes equations) using structured and unstructured meshes with different element types and polynomial orders on both NVIDIA and AMD GPU architectures.

    https://arxiv.org/abs/2512.13619


    Torus Time-Spectral Method for Quasi-Periodic Problems

    oai:arXiv.org:2512.13631v1

    arXiv:2512.13631v1 Announce Type: new Abstract: Quasi-periodic trajectories with two or more incommensurate frequencies are ubiquitous in nonlinear dynamics, yet the classical Fourier-based time-spectral method is tied to strictly periodic responses. We introduce a torus time-spectral method that lifts the governing equations to an extended angular phase space, applies double-Fourier collocation on the invariant torus, and solves for the state. The formulation exhibits spectral convergence for quasi-periodic problem which we give a rigorous mathematical proof and also verify numerically. We demonstrate the approach on Duffing oscillators and a nonlinear Klein-Gordon system, documenting spectral error decay on the torus and tight agreement with time-accurate integrations while using modest frequency grids. The method extends naturally to higher-dimensional tori and offers a computationally efficient framework for analyzing quasi-periodic phenomena in fluid mechanics, plasma physics, celestial mechanics, and other domains where multi-frequency dynamics arise.

    https://arxiv.org/abs/2512.13631


    StutterFuse: Mitigating Modality Collapse in Stuttering Detection with Jaccard-Weighted Metric Learning and Gated Fusion

    oai:arXiv.org:2512.13632v1

    arXiv:2512.13632v1 Announce Type: new Abstract: Stuttering detection breaks down when disfluencies overlap. Existing parametric models struggle to distinguish complex, simultaneous disfluencies (e.g., a 'block' with a 'prolongation') due to the scarcity of these specific combinations in training data. While Retrieval-Augmented Generation (RAG) has revolutionized NLP by grounding models in external knowledge, this paradigm remains unexplored in pathological speech processing. To bridge this gap, we introduce StutterFuse, the first Retrieval-Augmented Classifier (RAC) for multi-label stuttering detection. By conditioning a Conformer encoder on a non-parametric memory bank of clinical examples, we allow the model to classify by reference rather than memorization. We further identify and solve "Modality Collapse", an "Echo Chamber" effect where naive retrieval boosts recall but degrades precision. We mitigate this using: (1) SetCon, a Jaccard-Weighted Metric Learning objective that optimizes for multi-label set similarity, and (2) a Gated Mixture-of-Experts fusion strategy that dynamically arbitrates between acoustic evidence and retrieved context. On the SEP-28k dataset, StutterFuse achieves a weighted F1-score of 0.65, outperforming strong baselines and demonstrating remarkable zero-shot cross-lingual generalization.

    https://arxiv.org/abs/2512.13632


    SCR2-ST: Combine Single Cell with Spatial Transcriptomics for Efficient Active Sampling via Reinforcement Learning

    oai:arXiv.org:2512.13635v1

    arXiv:2512.13635v1 Announce Type: new Abstract: Spatial transcriptomics (ST) is an emerging technology that enables researchers to investigate the molecular relationships underlying tissue morphology. However, acquiring ST data remains prohibitively expensive, and traditional fixed-grid sampling strategies lead to redundant measurements of morphologically similar or biologically uninformative regions, thus resulting in scarce data that constrain current methods. The well-established single-cell sequencing field, however, could provide rich biological data as an effective auxiliary source to mitigate this limitation. To bridge these gaps, we introduce SCR2-ST, a unified framework that leverages single-cell prior knowledge to guide efficient data acquisition and accurate expression prediction. SCR2-ST integrates a single-cell guided reinforcement learning-based (SCRL) active sampling and a hybrid regression-retrieval prediction network SCR2Net. SCRL combines single-cell foundation model embeddings with spatial density information to construct biologically grounded reward signals, enabling selective acquisition of informative tissue regions under constrained sequencing budgets. SCR2Net then leverages the actively sampled data through a hybrid architecture combining regression-based modeling with retrieval-augmented inference, where a majority cell-type filtering mechanism suppresses noisy matches and retrieved expression profiles serve as soft labels for auxiliary supervision. We evaluated SCR2-ST on three public ST datasets, demonstrating SOTA performance in both sampling efficiency and prediction accuracy, particularly under low-budget scenarios. Code is publicly available at: https://github.com/hrlblab/SCR2ST

    https://arxiv.org/abs/2512.13635


    MindDrive: A Vision-Language-Action Model for Autonomous Driving via Online Reinforcement Learning

    oai:arXiv.org:2512.13636v1

    arXiv:2512.13636v1 Announce Type: new Abstract: Current Vision-Language-Action (VLA) paradigms in autonomous driving primarily rely on Imitation Learning (IL), which introduces inherent challenges such as distribution shift and causal confusion. Online Reinforcement Learning offers a promising pathway to address these issues through trial-and-error learning. However, applying online reinforcement learning to VLA models in autonomous driving is hindered by inefficient exploration in continuous action spaces. To overcome this limitation, we propose MindDrive, a VLA framework comprising a large language model (LLM) with two distinct sets of LoRA parameters. The one LLM serves as a Decision Expert for scenario reasoning and driving decision-making, while the other acts as an Action Expert that dynamically maps linguistic decisions into feasible trajectories. By feeding trajectory-level rewards back into the reasoning space, MindDrive enables trial-and-error learning over a finite set of discrete linguistic driving decisions, instead of operating directly in a continuous action space. This approach effectively balances optimal decision-making in complex scenarios, human-like driving behavior, and efficient exploration in online reinforcement learning. MindDrive achieves strong closed-loop performance on the challenging Bench2Drive benchmark, with a Driving Score (DS) of 78.04 and a Success Rate (SR) of 55.09%. To the best of our knowledge, this is the first work to demonstrate the effectiveness of online reinforcement learning for the VLA model in autonomous driving.

    https://arxiv.org/abs/2512.13636


    Design in Tiles: Automating GEMM Deployment on Tile-Based Many-PE Accelerators

    oai:arXiv.org:2512.13638v1

    arXiv:2512.13638v1 Announce Type: new Abstract: Tile-based many-Processing Element (PE) accelerators can achieve competitive performance on General Matrix Multiplication (GEMM), but they are extremely hard to program, as their optimal software mapping is deeply coupled with hardware design which is unwieldy to manual deployment. We propose "Design in Tiles (DiT)", an automated framework connecting a deployment toolchain with a configurable executable model for these accelerators. For evaluation, we apply our framework to GEMM targeting a large acceleration configuration (e.g., 32x32 tiles, 1979 TFLOPS@FP8, 4 TB/s Bandwidth) comparable to an NVIDIA GH200. We achieve higher PE utilization than GH200 with its expert-tuned GEMM libraries, achieving 1.2-2.0x speedup across diverse matrix shapes.

    https://arxiv.org/abs/2512.13638


    Charge: A Comprehensive Novel View Synthesis Benchmark and Dataset to Bind Them All

    oai:arXiv.org:2512.13639v1

    arXiv:2512.13639v1 Announce Type: new Abstract: This paper presents a new dataset for Novel View Synthesis, generated from a high-quality, animated film with stunning realism and intricate detail. Our dataset captures a variety of dynamic scenes, complete with detailed textures, lighting, and motion, making it ideal for training and evaluating cutting-edge 4D scene reconstruction and novel view generation models. In addition to high-fidelity RGB images, we provide multiple complementary modalities, including depth, surface normals, object segmentation and optical flow, enabling a deeper understanding of scene geometry and motion. The dataset is organised into three distinct benchmarking scenarios: a dense multi-view camera setup, a sparse camera arrangement, and monocular video sequences, enabling a wide range of experimentation and comparison across varying levels of data sparsity. With its combination of visual richness, high-quality annotations, and diverse experimental setups, this dataset offers a unique resource for pushing the boundaries of view synthesis and 3D vision.

    https://arxiv.org/abs/2512.13639


    From Code to Field: Evaluating the Robustness of Convolutional Neural Networks for Disease Diagnosis in Mango Leaves

    oai:arXiv.org:2512.13641v1

    arXiv:2512.13641v1 Announce Type: new Abstract: The validation and verification of artificial intelligence (AI) models through robustness assessment are essential to guarantee the reliable performance of intelligent systems facing real-world challenges, such as image corruptions including noise, blurring, and weather variations. Despite the global importance of mango (Mangifera indica L.), there is a lack of studies on the robustness of models for the diagnosis of disease in its leaves. This paper proposes a methodology to evaluate convolutional neural networks (CNNs) under adverse conditions. We adapted the MangoLeafDB dataset, generating MangoLeafDB-C with 19 types of artificial corruptions at five severity levels. We conducted a benchmark comparing five architectures: ResNet-50, ResNet-101, VGG-16, Xception, and LCNN (the latter being a lightweight architecture designed specifically for mango leaf diagnosis). The metrics include the F1 score, the corruption error (CE) and the relative mean corruption error (relative mCE). The results show that LCNN outperformed complex models in corruptions that can be present in real-world scenarios such as Defocus Blur, Motion Blur, while also achieving the lowest mCE. Modern architectures (e.g., ResNet-101) exhibited significant performance degradation in corrupted scenarios, despite their high accuracy under ideal conditions. These findings suggest that lightweight and specialized models may be more suitable for real-world applications in edge devices, where robustness and efficiency are critical. The study highlights the need to incorporate robustness assessments in the development of intelligent systems for agriculture, particularly in regions with technological limitations.

    https://arxiv.org/abs/2512.13641


    Follow Nudges without Budges: A Field Experiment on Misinformation Followers Didn't Change Follow Networks

    oai:arXiv.org:2512.13643v1

    arXiv:2512.13643v1 Announce Type: new Abstract: Can digital ads encourage users exposed to inaccurate information sources to follow accurate ones? We conduct a large-scale field experiment (N=28,582) on X, formerly Twitter, with users who follow accounts that spread health misinformation. Participants were exposed to four ad treatments varied on two dimensions: a neutral message versus a persuasive message appealing to values of independence, and a request to follow a health institution versus a request to follow a health influencer. We term this ad-based, social network intervention a follow nudge. The ad with a persuasive message to follow a well-known health institution generated significantly higher click-through rates than all other conditions (Bonferroni-corrected pairwise tests, all p<0.001). Given the overall low click-through rate across treatments and the high cost of digital advertising infrastructure on X, however, we conclude that our proposed intervention -- at least in its current ad-based format -- is not a cost-effective means to improve information environments online. We discuss challenges faced when conducting large-scale experiments on X following the platform's ownership change and subsequent restrictions on data access for research purposes.

    https://arxiv.org/abs/2512.13643


    World Models Can Leverage Human Videos for Dexterous Manipulation

    oai:arXiv.org:2512.13644v1

    arXiv:2512.13644v1 Announce Type: new Abstract: Dexterous manipulation is challenging because it requires understanding how subtle hand motion influences the environment through contact with objects. We introduce DexWM, a Dexterous Manipulation World Model that predicts the next latent state of the environment conditioned on past states and dexterous actions. To overcome the scarcity of dexterous manipulation datasets, DexWM is trained on over 900 hours of human and non-dexterous robot videos. To enable fine-grained dexterity, we find that predicting visual features alone is insufficient; therefore, we introduce an auxiliary hand consistency loss that enforces accurate hand configurations. DexWM outperforms prior world models conditioned on text, navigation, and full-body actions, achieving more accurate predictions of future states. DexWM also demonstrates strong zero-shot generalization to unseen manipulation skills when deployed on a Franka Panda arm equipped with an Allegro gripper, outperforming Diffusion Policy by over 50% on average in grasping, placing, and reaching tasks.

    https://arxiv.org/abs/2512.13644


    Large-Language Memorization During the Classification of United States Supreme Court Cases

    oai:arXiv.org:2512.13654v1

    arXiv:2512.13654v1 Announce Type: new Abstract: Large-language models (LLMs) have been shown to respond in a variety of ways for classification tasks outside of question-answering. LLM responses are sometimes called "hallucinations" since the output is not what is ex pected. Memorization strategies in LLMs are being studied in detail, with the goal of understanding how LLMs respond. We perform a deep dive into a classification task based on United States Supreme Court (SCOTUS) decisions. The SCOTUS corpus is an ideal classification task to study for LLM memory accuracy because it presents significant challenges due to extensive sentence length, complex legal terminology, non-standard structure, and domain-specific vocabulary. Experimentation is performed with the latest LLM fine tuning and retrieval-based approaches, such as parameter-efficient fine-tuning, auto-modeling, and others, on two traditional category-based SCOTUS classification tasks: one with 15 labeled topics and another with 279. We show that prompt-based models with memories, such as DeepSeek, can be more robust than previous BERT-based models on both tasks scoring about 2 points better than previous models not based on prompting.

    https://arxiv.org/abs/2512.13654


    Comparative Analysis of LLM Abliteration Methods: A Cross-Architecture Evaluation

    oai:arXiv.org:2512.13655v1

    arXiv:2512.13655v1 Announce Type: new Abstract: Safety alignment mechanisms in large language models prevent responses to harmful queries through learned refusal behavior, yet these same mechanisms impede legitimate research applications including cognitive modeling, adversarial testing, and security analysis. While abliteration techniques enable surgical removal of refusal representations through directional orthogonalization, the relative effectiveness of available implementations remains uncharacterized. This study evaluates four abliteration tools (Heretic, DECCP, ErisForge, FailSpy) across sixteen instruction-tuned models (7B-14B parameters), reporting tool compatibility on all 16 models and quantitative metrics on subsets dictated by tool support. Single-pass methods demonstrated superior capability preservation on the benchmarked subset (avg GSM8K change across three models: ErisForge -0.28 pp; DECCP -0.13 pp), while Bayesian-optimized abliteration produced variable distribution shift (KL divergence: 0.043-1.646) with model-dependent capability impact. These findings provide researchers with evidence-based selection criteria for abliteration tool deployment across diverse model architectures. The principal finding indicates that mathematical reasoning capabilities exhibit the highest sensitivity to abliteration interventions, with GSM8K change ranging from +1.51 pp to -18.81 pp (-26.5% relative) depending on tool selection and model architecture.

    https://arxiv.org/abs/2512.13655


    Embedding-Based Rankings of Educational Resources based on Learning Outcome Alignment: Benchmarking, Expert Validation, and Learner Performance

    oai:arXiv.org:2512.13658v1

    arXiv:2512.13658v1 Announce Type: new Abstract: As the online learning landscape evolves, the need for personalization is increasingly evident. Although educational resources are burgeoning, educators face challenges selecting materials that both align with intended learning outcomes and address diverse learner needs. Large Language Models (LLMs) are attracting growing interest for their potential to create learning resources that better support personalization, but verifying coverage of intended outcomes still requires human alignment review, which is costly and limits scalability. We propose a framework that supports the cost-effective automation of evaluating alignment between educational resources and intended learning outcomes. Using human-generated materials, we benchmarked LLM-based text-embedding models and found that the most accurate model (Voyage) achieved 79% accuracy in detecting alignment. We then applied the optimal model to LLM-generated resources and, via expert evaluation, confirmed that it reliably assessed correspondence to intended outcomes (83% accuracy). Finally, in a three-group experiment with 360 learners, higher alignment scores were positively related to greater learning performance, chi-squared(2, N = 360) = 15.39, p < 0.001. These findings show that embedding-based alignment scores can facilitate scalable personalization by confirming alignment with learning outcomes, which allows teachers to focus on tailoring content to diverse learner needs.

    https://arxiv.org/abs/2512.13658


    RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics

    oai:arXiv.org:2512.13660v1

    arXiv:2512.13660v1 Announce Type: new Abstract: Spatial tracing, as a fundamental embodied interaction ability for robots, is inherently challenging as it requires multi-step metric-grounded reasoning compounded with complex spatial referring and real-world metric measurement. However, existing methods struggle with this compositional task. To this end, we propose RoboTracer, a 3D-aware VLM that first achieves both 3D spatial referring and measuring via a universal spatial encoder and a regression-supervised decoder to enhance scale awareness during supervised fine-tuning (SFT). Moreover, RoboTracer advances multi-step metric-grounded reasoning via reinforcement fine-tuning (RFT) with metric-sensitive process rewards, supervising key intermediate perceptual cues to accurately generate spatial traces. To support SFT and RFT training, we introduce TraceSpatial, a large-scale dataset of 30M QA pairs, spanning outdoor/indoor/tabletop scenes and supporting complex reasoning processes (up to 9 steps). We further present TraceSpatial-Bench, a challenging benchmark filling the gap to evaluate spatial tracing. Experimental results show that RoboTracer surpasses baselines in spatial understanding, measuring, and referring, with an average success rate of 79.1%, and also achieves SOTA performance on TraceSpatial-Bench by a large margin, exceeding Gemini-2.5-Pro by 36% accuracy. Notably, RoboTracer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (UR5, G1 humanoid) in cluttered real-world scenes.

    https://arxiv.org/abs/2512.13660


    Grab-3D: Detecting AI-Generated Videos from 3D Geometric Temporal Consistency

    oai:arXiv.org:2512.13665v1

    arXiv:2512.13665v1 Announce Type: new Abstract: Recent advances in diffusion-based generation techniques enable AI models to produce highly realistic videos, heightening the need for reliable detection mechanisms. However, existing detection methods provide only limited exploration of the 3D geometric patterns present in generated videos. In this paper, we use vanishing points as an explicit representation of 3D geometry patterns, revealing fundamental discrepancies in geometric consistency between real and AI-generated videos. We introduce Grab-3D, a geometry-aware transformer framework for detecting AI-generated videos based on 3D geometric temporal consistency. To enable reliable evaluation, we construct an AI-generated video dataset of static scenes, allowing stable 3D geometric feature extraction. We propose a geometry-aware transformer equipped with geometric positional encoding, temporal-geometric attention, and an EMA-based geometric classifier head to explicitly inject 3D geometric awareness into temporal modeling. Experiments demonstrate that Grab-3D significantly outperforms state-of-the-art detectors, achieving robust cross-domain generalization to unseen generators.

    https://arxiv.org/abs/2512.13665


    SEDULity: A Proof-of-Learning Framework for Distributed and Secure Blockchains with Efficient Useful Work

    oai:arXiv.org:2512.13666v1

    arXiv:2512.13666v1 Announce Type: new Abstract: The security and decentralization of Proof-of-Work (PoW) have been well-tested in existing blockchain systems. However, its tremendous energy waste has raised concerns about sustainability. Proof-of-Useful-Work (PoUW) aims to redirect the meaningless computation to meaningful tasks such as solving machine learning (ML) problems, giving rise to the branch of Proof-of-Learning (PoL). While previous studies have proposed various PoLs, they all, to some degree, suffer from security, decentralization, or efficiency issues. In this paper, we propose a PoL framework that trains ML models efficiently while maintaining blockchain security in a fully distributed manner. We name the framework SEDULity, which stands for a Secure, Efficient, Distributed, and Useful Learning-based blockchain system. Specifically, we encode the template block into the training process and design a useful function that is difficult to solve but relatively easy to verify, as a substitute for the PoW puzzle. We show that our framework is distributed, secure, and efficiently trains ML models. We further demonstrate that the proposed PoL framework can be extended to other types of useful work and design an incentive mechanism to incentivize task verification. We show theoretically that a rational miner is incentivized to train fully honestly with well-designed system parameters. Finally, we present simulation results to demonstrate the performance of our framework and validate our analysis.

    https://arxiv.org/abs/2512.13666


    A stylometric analysis of speaker attribution from speech transcripts

    oai:arXiv.org:2512.13667v1

    arXiv:2512.13667v1 Announce Type: new Abstract: Forensic scientists often need to identify an unknown speaker or writer in cases such as ransom calls, covert recordings, alleged suicide notes, or anonymous online communications, among many others. Speaker recognition in the speech domain usually examines phonetic or acoustic properties of a voice, and these methods can be accurate and robust under certain conditions. However, if a speaker disguises their voice or employs text-to-speech software, vocal properties may no longer be reliable, leaving only their linguistic content available for analysis. Authorship attribution methods traditionally use syntactic, semantic, and related linguistic information to identify writers of written text (authorship attribution). In this paper, we apply a content-based authorship approach to speech that has been transcribed into text, using what a speaker says to attribute speech to individuals (speaker attribution). We introduce a stylometric method, StyloSpeaker, which incorporates character, word, token, sentence, and style features from the stylometric literature on authorship, to assess whether two transcripts were produced by the same speaker. We evaluate this method on two types of transcript formatting: one approximating prescriptive written text with capitalization and punctuation and another normalized style that removes these conventions. The transcripts' conversation topics are also controlled to varying degrees. We find generally higher attribution performance on normalized transcripts, except under the strongest topic control condition, in which overall performance is highest. Finally, we compare this more explainable stylometric model to black-box neural approaches on the same data and investigate which stylistic features most effectively distinguish speakers.

    https://arxiv.org/abs/2512.13667


    A Scientific Reasoning Model for Organic Synthesis Procedure Generation

    oai:arXiv.org:2512.13668v1

    arXiv:2512.13668v1 Announce Type: new Abstract: Solving computer-aided synthesis planning is essential for enabling fully automated, robot-assisted synthesis workflows and improving the efficiency of drug discovery. A key challenge, however, is bridging the gap between computational route design and practical laboratory execution, particularly the accurate prediction of viable experimental procedures for each synthesis step. In this work, we present QFANG, a scientific reasoning language model capable of generating precise, structured experimental procedures directly from reaction equations, with explicit chain-of-thought reasoning. To develop QFANG, we curated a high-quality dataset comprising 905,990 chemical reactions paired with structured action sequences, extracted and processed from patent literature using large language models. We introduce a Chemistry-Guided Reasoning (CGR) framework that produces chain-of-thought data grounded in chemical knowledge at scale. The model subsequently undergoes supervised fine-tuning to elicit complex chemistry reasoning. Finally, we apply Reinforcement Learning from Verifiable Rewards (RLVR) to further enhance procedural accuracy. Experimental results demonstrate that QFANG outperforms advanced general-purpose reasoning models and nearest-neighbor retrieval baselines, measured by traditional NLP similarity metrics and a chemically aware evaluator using an LLM-as-a-judge. Moreover, QFANG generalizes to certain out-of-domain reaction classes and adapts to variations in laboratory conditions and user-specific constraints. We believe that QFANG's ability to generate high-quality synthesis procedures represents an important step toward bridging the gap between computational synthesis planning and fully automated laboratory synthesis.

    https://arxiv.org/abs/2512.13668


    NL2SpaTiaL: Generating Geometric Spatio-Temporal Logic Specifications from Natural Language for Manipulation Tasks

    oai:arXiv.org:2512.13670v1

    arXiv:2512.13670v1 Announce Type: new Abstract: Spatio-Temporal Logic (SpaTiaL) offers a principled formalism for expressing geometric spatial requirements-an essential component of robotic manipulation, where object locations, neighborhood relations, pose constraints, and interactions directly determine task success. Yet prior works have largely relied on standard temporal logic (TL), which models only robot trajectories and overlooks object-level interactions. Existing datasets built from randomly generated TL formulas paired with natural-language descriptions therefore cover temporal operators but fail to represent the layered spatial relations that manipulation tasks depend on. To address this gap, we introduce a dataset generation framework that synthesizes SpaTiaL specifications and converts them into natural-language descriptions through a deterministic, semantics-preserving back-translation procedure. This pipeline produces the NL2SpaTiaL dataset, aligning natural language with multi-level spatial relations and temporal objectives to reflect the compositional structure of manipulation tasks. Building on this foundation, we propose a translation-verification framework equipped with a language-based semantic checker that ensures the generated SpaTiaL formulas faithfully encode the semantics specified by the input description. Experiments across a suite of manipulation tasks show that SpaTiaL-based representations yield more interpretable, verifiable, and compositional grounding for instruction following. Project website: https://sites.google.com/view/nl2spatial

    https://arxiv.org/abs/2512.13670


    AgentIAD: Tool-Augmented Single-Agent for Industrial Anomaly Detection

    oai:arXiv.org:2512.13671v1

    arXiv:2512.13671v1 Announce Type: new Abstract: Industrial anomaly detection (IAD) is difficult due to the scarcity of normal reference samples and the subtle, localized nature of many defects. Single-pass vision-language models (VLMs) often overlook small abnormalities and lack explicit mechanisms to compare against canonical normal patterns. We propose AgentIAD, a tool-driven agentic framework that enables multi-stage visual inspection. The agent is equipped with a Perceptive Zoomer (PZ) for localized fine-grained analysis and a Comparative Retriever (CR) for querying normal exemplars when evidence is ambiguous. To teach these inspection behaviors, we construct structured perceptive and comparative trajectories from the MMAD dataset and train the model in two stages: supervised fine-tuning followed by reinforcement learning. A two-part reward design drives this process: a perception reward that supervises classification accuracy, spatial alignment, and type correctness, and a behavior reward that encourages efficient tool use. Together, these components enable the model to refine its judgment through step-wise observation, zooming, and verification. AgentIAD achieves a new state-of-the-art 97.62% classification accuracy on MMAD, surpassing prior MLLM-based approaches while producing transparent and interpretable inspection traces.

    https://arxiv.org/abs/2512.13671


    Directional Textual Inversion for Personalized Text-to-Image Generation

    oai:arXiv.org:2512.13672v1

    arXiv:2512.13672v1 Announce Type: new Abstract: Textual Inversion (TI) is an efficient approach to text-to-image personalization but often fails on complex prompts. We trace these failures to embedding norm inflation: learned tokens drift to out-of-distribution magnitudes, degrading prompt conditioning in pre-norm Transformers. Empirically, we show semantics are primarily encoded by direction in CLIP token space, while inflated norms harm contextualization; theoretically, we analyze how large magnitudes attenuate positional information and hinder residual updates in pre-norm blocks. We propose Directional Textual Inversion (DTI), which fixes the embedding magnitude to an in-distribution scale and optimizes only direction on the unit hypersphere via Riemannian SGD. We cast direction learning as MAP with a von Mises-Fisher prior, yielding a constant-direction prior gradient that is simple and efficient to incorporate. Across personalization tasks, DTI improves text fidelity over TI and TI-variants while maintaining subject similarity. Crucially, DTI's hyperspherical parameterization enables smooth, semantically coherent interpolation between learned concepts (slerp), a capability that is absent in standard TI. Our findings suggest that direction-only optimization is a robust and scalable path for prompt-faithful personalization.

    https://arxiv.org/abs/2512.13672


    Towards Interactive Intelligence for Digital Humans

    oai:arXiv.org:2512.13674v1

    arXiv:2512.13674v1 Announce Type: new Abstract: We introduce Interactive Intelligence, a novel paradigm of digital human that is capable of personality-aligned expression, adaptive interaction, and self-evolution. To realize this, we present Mio (Multimodal Interactive Omni-Avatar), an end-to-end framework composed of five specialized modules: Thinker, Talker, Face Animator, Body Animator, and Renderer. This unified architecture integrates cognitive reasoning with real-time multimodal embodiment to enable fluid, consistent interaction. Furthermore, we establish a new benchmark to rigorously evaluate the capabilities of interactive intelligence. Extensive experiments demonstrate that our framework achieves superior performance compared to state-of-the-art methods across all evaluated dimensions. Together, these contributions move digital humans beyond superficial imitation toward intelligent interaction.

    https://arxiv.org/abs/2512.13674


    Towards Effective Model Editing for LLM Personalization

    oai:arXiv.org:2512.13676v1

    arXiv:2512.13676v1 Announce Type: new Abstract: Personalization is becoming indispensable for LLMs to align with individual user preferences and needs. Yet current approaches are often computationally expensive, data-intensive, susceptible to catastrophic forgetting, and prone to performance degradation in multi-turn interactions or when handling implicit queries. To address these challenges, we conceptualize personalization as a model editing task and introduce Personalization Editing, a framework that applies localized edits guided by clustered preference representations. This design enables precise preference-aligned updates while preserving overall model capabilities. In addition, existing personalization benchmarks frequently rely on persona-based dialogs between LLMs rather than user-LLM interactions, or focus primarily on stylistic imitation while neglecting information-seeking tasks that require accurate recall of user-specific preferences. We introduce User Preference Question Answering (UPQA), a short-answer QA dataset constructed from in-situ user queries with varying levels of difficulty. Unlike prior benchmarks, UPQA directly evaluates a model's ability to recall and apply specific user preferences. Across experimental settings, Personalization Editing achieves higher editing accuracy and greater computational efficiency than fine-tuning, while outperforming prompting-based baselines in multi-turn conversations and implicit preference questions settings.

    https://arxiv.org/abs/2512.13676


    JoVA: Unified Multimodal Learning for Joint Video-Audio Generation

    oai:arXiv.org:2512.13677v1

    arXiv:2512.13677v1 Announce Type: new Abstract: In this paper, we present JoVA, a unified framework for joint video-audio generation. Despite recent encouraging advances, existing methods face two critical limitations. First, most existing approaches can only generate ambient sounds and lack the capability to produce human speech synchronized with lip movements. Second, recent attempts at unified human video-audio generation typically rely on explicit fusion or modality-specific alignment modules, which introduce additional architecture design and weaken the model simplicity of the original transformers. To address these issues, JoVA employs joint self-attention across video and audio tokens within each transformer layer, enabling direct and efficient cross-modal interaction without the need for additional alignment modules. Furthermore, to enable high-quality lip-speech synchronization, we introduce a simple yet effective mouth-area loss based on facial keypoint detection, which enhances supervision on the critical mouth region during training without compromising architectural simplicity. Extensive experiments on benchmarks demonstrate that JoVA outperforms or is competitive with both unified and audio-driven state-of-the-art methods in lip-sync accuracy, speech quality, and overall video-audio generation fidelity. Our results establish JoVA as an elegant framework for high-quality multimodal generation.

    https://arxiv.org/abs/2512.13677


    Feedforward 3D Editing via Text-Steerable Image-to-3D

    oai:arXiv.org:2512.13678v1

    arXiv:2512.13678v1 Announce Type: new Abstract: Recent progress in image-to-3D has opened up immense possibilities for design, AR/VR, and robotics. However, to use AI-generated 3D assets in real applications, a critical requirement is the capability to edit them easily. We present a feedforward method, Steer3D, to add text steerability to image-to-3D models, which enables editing of generated 3D assets with language. Our approach is inspired by ControlNet, which we adapt to image-to-3D generation to enable text steering directly in a forward pass. We build a scalable data engine for automatic data generation, and develop a two-stage training recipe based on flow-matching training and Direct Preference Optimization (DPO). Compared to competing methods, Steer3D more faithfully follows the language instruction and maintains better consistency with the original 3D asset, while being 2.4x to 28.5x faster. Steer3D demonstrates that it is possible to add a new modality (text) to steer the generation of pretrained image-to-3D generative models with 100k data. Project website: https://glab-caltech.github.io/steer3d/

    https://arxiv.org/abs/2512.13678


    LASER: Layer-wise Scale Alignment for Training-Free Streaming 4D Reconstruction

    oai:arXiv.org:2512.13680v1

    arXiv:2512.13680v1 Announce Type: new Abstract: Recent feed-forward reconstruction models like VGGT and $\pi^3$ achieve impressive reconstruction quality but cannot process streaming videos due to quadratic memory complexity, limiting their practical deployment. While existing streaming methods address this through learned memory mechanisms or causal attention, they require extensive retraining and may not fully leverage the strong geometric priors of state-of-the-art offline models. We propose LASER, a training-free framework that converts an offline reconstruction model into a streaming system by aligning predictions across consecutive temporal windows. We observe that simple similarity transformation ($\mathrm{Sim}(3)$) alignment fails due to layer depth misalignment: monocular scale ambiguity causes relative depth scales of different scene layers to vary inconsistently between windows. To address this, we introduce layer-wise scale alignment, which segments depth predictions into discrete layers, computes per-layer scale factors, and propagates them across both adjacent windows and timestamps. Extensive experiments show that LASER achieves state-of-the-art performance on camera pose estimation and point map reconstruction %quality with offline models while operating at 14 FPS with 6 GB peak memory on a RTX A6000 GPU, enabling practical deployment for kilometer-scale streaming videos. Project website: $\href{https://neu-vi.github.io/LASER/}{\texttt{https://neu-vi.github.io/LASER/}}$

    https://arxiv.org/abs/2512.13680


    I-Scene: 3D Instance Models are Implicit Generalizable Spatial Learners

    oai:arXiv.org:2512.13683v1

    arXiv:2512.13683v1 Announce Type: new Abstract: Generalization remains the central challenge for interactive 3D scene generation. Existing learning-based approaches ground spatial understanding in limited scene dataset, restricting generalization to new layouts. We instead reprogram a pre-trained 3D instance generator to act as a scene level learner, replacing dataset-bounded supervision with model-centric spatial supervision. This reprogramming unlocks the generator transferable spatial knowledge, enabling generalization to unseen layouts and novel object compositions. Remarkably, spatial reasoning still emerges even when the training scenes are randomly composed objects. This demonstrates that the generator's transferable scene prior provides a rich learning signal for inferring proximity, support, and symmetry from purely geometric cues. Replacing widely used canonical space, we instantiate this insight with a view-centric formulation of the scene space, yielding a fully feed-forward, generalizable scene generator that learns spatial relations directly from the instance model. Quantitative and qualitative results show that a 3D instance generator is an implicit spatial learner and reasoner, pointing toward foundation models for interactive 3D scene understanding and generation. Project page: https://luling06.github.io/I-Scene-project/

    https://arxiv.org/abs/2512.13683


    Recurrent Video Masked Autoencoders

    oai:arXiv.org:2512.13684v1

    arXiv:2512.13684v1 Announce Type: new Abstract: We present Recurrent Video Masked-Autoencoders (RVM): a novel video representation learning approach that uses a transformer-based recurrent neural network to aggregate dense image features over time, effectively capturing the spatio-temporal structure of natural video data. RVM learns via an asymmetric masked prediction task requiring only a standard pixel reconstruction objective. This design yields a highly efficient ``generalist'' encoder: RVM achieves competitive performance with state-of-the-art video models (e.g. VideoMAE, V-JEPA) on video-level tasks like action recognition and point/object tracking, while also performing favorably against image models (e.g. DINOv2) on tasks that test geometric and dense spatial understanding. Notably, RVM achieves strong performance in the small-model regime without requiring knowledge distillation, exhibiting up to 30x greater parameter efficiency than competing video masked autoencoders. Moreover, we demonstrate that RVM's recurrent nature allows for stable feature propagation over long temporal horizons with linear computational cost, overcoming some of the limitations of standard spatio-temporal attention-based architectures. Finally, we use qualitative visualizations to highlight that RVM learns rich representations of scene semantics, structure, and motion.

    https://arxiv.org/abs/2512.13684


    Beyond surface form: A pipeline for semantic analysis in Alzheimer's Disease detection from spontaneous speech

    oai:arXiv.org:2512.13685v1

    arXiv:2512.13685v1 Announce Type: new Abstract: Alzheimer's Disease (AD) is a progressive neurodegenerative condition that adversely affects cognitive abilities. Language-related changes can be automatically identified through the analysis of outputs from linguistic assessment tasks, such as picture description. Language models show promise as a basis for screening tools for AD, but their limited interpretability poses a challenge in distinguishing true linguistic markers of cognitive decline from surface-level textual patterns. To address this issue, we examine how surface form variation affects classification performance, with the goal of assessing the ability of language models to represent underlying semantic indicators. We introduce a novel approach where texts surface forms are transformed by altering syntax and vocabulary while preserving semantic content. The transformations significantly modify the structure and lexical content, as indicated by low BLEU and chrF scores, yet retain the underlying semantics, as reflected in high semantic similarity scores, isolating the effect of semantic information, and finding models perform similarly to if they were using the original text, with only small deviations in macro-F1. We also investigate whether language from picture descriptions retains enough detail to reconstruct the original image using generative models. We found that image-based transformations add substantial noise reducing classification accuracy. Our methodology provides a novel way of looking at what features influence model predictions, and allows the removal of possible spurious correlations. We find that just using semantic information, language model based classifiers can still detect AD. This work shows that difficult to detect semantic impairment can be identified, addressing an overlooked feature of linguistic deterioration, and opening new pathways for early detection systems.

    https://arxiv.org/abs/2512.13685


    Lyra: A Hardware-Accelerated RISC-V Verification Framework with Generative Model-Based Processor Fuzzing

    oai:arXiv.org:2512.13686v1

    arXiv:2512.13686v1 Announce Type: new Abstract: As processor designs grow more complex, verification remains bottlenecked by slow software simulation and low-quality random test stimuli. Recent research has applied software fuzzers to hardware verification, but these rely on semantically blind random mutations that may generate shallow, low-quality stimuli unable to explore complex behaviors. These limitations result in slow coverage convergence and prohibitively high verification costs. In this paper, we present Lyra, a heterogeneous RISC-V verification framework that addresses both challenges by pairing hardware-accelerated verification with an ISA-aware generative model. Lyra executes the DUT and reference model concurrently on an FPGA SoC, enabling high-throughput differential checking and hardware-level coverage collection. Instead of creating verification stimuli randomly or through simple mutations, we train a domain-specialized generative model, LyraGen, with inherent semantic awareness to generate high-quality, semantically rich instruction sequences. Empirical results show Lyra achieves up to $1.27\times$ higher coverage and accelerates end-to-end verification by up to $107\times$ to $3343\times$ compared to state-of-the-art software fuzzers, while consistently demonstrating lower convergence difficulty.

    https://arxiv.org/abs/2512.13686


    Towards Scalable Pre-training of Visual Tokenizers for Generation

    oai:arXiv.org:2512.13687v1

    arXiv:2512.13687v1 Announce Type: new Abstract: The quality of the latent space in visual tokenizers (e.g., VAEs) is crucial for modern generative models. However, the standard reconstruction-based training paradigm produces a latent space that is biased towards low-level information, leading to a foundation flaw: better pixel-level accuracy does not lead to higher-quality generation. This implies that pouring extensive compute into visual tokenizer pre-training translates poorly to improved performance in generation. We identify this as the ``pre-training scaling problem`` and suggest a necessary shift: to be effective for generation, a latent space must concisely represent high-level semantics. We present VTP, a unified visual tokenizer pre-training framework, pioneering the joint optimization of image-text contrastive, self-supervised, and reconstruction losses. Our large-scale study reveals two principal findings: (1) understanding is a key driver of generation, and (2) much better scaling properties, where generative performance scales effectively with compute, parameters, and data allocated to the pretraining of the visual tokenizer. After large-scale pre-training, our tokenizer delivers a competitive profile (78.2 zero-shot accuracy and 0.36 rFID on ImageNet) and 4.1 times faster convergence on generation compared to advanced distillation methods. More importantly, it scales effectively: without modifying standard DiT training specs, solely investing more FLOPS in pretraining VTP achieves 65.8\% FID improvement in downstream generation, while conventional autoencoder stagnates very early at 1/10 FLOPS. Our pre-trained models are available at https://github.com/MiniMax-AI/VTP.

    https://arxiv.org/abs/2512.13687


    LitePT: Lighter Yet Stronger Point Transformer

    oai:arXiv.org:2512.13689v1

    arXiv:2512.13689v1 Announce Type: new Abstract: Modern neural architectures for 3D point cloud processing contain both convolutional layers and attention blocks, but the best way to assemble them remains unclear. We analyse the role of different computational blocks in 3D point cloud networks and find an intuitive behaviour: convolution is adequate to extract low-level geometry at high-resolution in early layers, where attention is expensive without bringing any benefits; attention captures high-level semantics and context in low-resolution, deep layers more efficiently. Guided by this design principle, we propose a new, improved 3D point cloud backbone that employs convolutions in early stages and switches to attention for deeper layers. To avoid the loss of spatial layout information when discarding redundant convolution layers, we introduce a novel, training-free 3D positional encoding, PointROPE. The resulting LitePT model has $3.6\times$ fewer parameters, runs $2\times$ faster, and uses $2\times$ less memory than the state-of-the-art Point Transformer V3, but nonetheless matches or even outperforms it on a range of tasks and datasets. Code and models are available at: https://github.com/prs-eth/LitePT.

    https://arxiv.org/abs/2512.13689


    DiffusionBrowser: Interactive Diffusion Previews via Multi-Branch Decoders

    oai:arXiv.org:2512.13690v1

    arXiv:2512.13690v1 Announce Type: new Abstract: Video diffusion models have revolutionized generative video synthesis, but they are imprecise, slow, and can be opaque during generation -- keeping users in the dark for a prolonged period. In this work, we propose DiffusionBrowser, a model-agnostic, lightweight decoder framework that allows users to interactively generate previews at any point (timestep or transformer block) during the denoising process. Our model can generate multi-modal preview representations that include RGB and scene intrinsics at more than 4$\times$ real-time speed (less than 1 second for a 4-second video) that convey consistent appearance and motion to the final video. With the trained decoder, we show that it is possible to interactively guide the generation at intermediate noise steps via stochasticity reinjection and modal steering, unlocking a new control capability. Moreover, we systematically probe the model using the learned decoders, revealing how scene, object, and other details are composed and assembled during the otherwise black-box denoising process.

    https://arxiv.org/abs/2512.13690


    A Neural-preconditioned Poisson Solver for Mixed Dirichlet and Neumann Boundary Conditions

    oai:arXiv.org:2310.00177v5

    arXiv:2310.00177v5 Announce Type: cross Abstract: We introduce a neural-preconditioned iterative solver for Poisson equations with mixed boundary conditions. Typical Poisson discretizations yield large, ill-conditioned linear systems. Iterative solvers can be effective for these problems, but only when equipped with powerful preconditioners. Unfortunately, effective preconditioners like multigrid require costly setup phases that must be re-executed every time domain shapes or boundary conditions change, forming a severe bottleneck for problems with evolving boundaries. In contrast, we present a neural preconditioner trained to efficiently approximate the inverse of the discrete Laplacian in the presence of such changes. Our approach generalizes to domain shapes, boundary conditions, and grid sizes outside the training set. The key to our preconditioner's success is a novel, lightweight neural network architecture featuring spatially varying convolution kernels and supporting fast inference. We demonstrate that our solver outperforms state-of-the-art methods like algebraic multigrid as well as recently proposed neural preconditioners on challenging test cases arising from incompressible fluid simulations.

    https://arxiv.org/abs/2310.00177


    A Multitask VAE for Time Series Preprocessing and Prediction of Blood Glucose Level

    oai:arXiv.org:2410.00015v2

    arXiv:2410.00015v2 Announce Type: cross Abstract: Data preprocessing is a critical part of time series data analysis. Data from connected medical devices often have missing or abnormal values during acquisition. Handling such situations requires additional assumptions and domain knowledge. This can be time-consuming, and can introduce a significant bias affecting predictive model accuracy and thus, medical interpretation. To overcome this issue, we propose a new deep learning model to mitigate the preprocessing assumptions. The model architecture relies on a variational auto-encoder (VAE) to produce a preprocessing latent space, and a recurrent VAE to preserve the temporal dynamics of the data. We demonstrate the effectiveness of such an architecture on telemonitoring data to forecast glucose-level of diabetic patients. Our results show an improvement in terms of accuracy with respect of existing state-of-the-art methods and architectures.

    https://arxiv.org/abs/2410.00015


    AI-Generated Compromises for Coalition Formation

    oai:arXiv.org:2506.06837v3

    arXiv:2506.06837v3 Announce Type: cross Abstract: The challenge of finding compromises between agent proposals is fundamental to AI subfields such as argumentation, mediation, and negotiation. Building on this tradition, Elkind et al. (2021) introduced a process for coalition formation that seeks majority-supported proposals preferable to the status quo, using a metric space where each agent has an ideal point. A crucial step in this process involves identifying compromise proposals around which agent coalitions can unite. How to effectively find such compromise proposals remains an open question. We address this gap by formalizing a model that incorporates agent bounded rationality and uncertainty, and by developing AI methods to generate compromise proposals. We focus on the domain of collaborative document writing, such as the democratic drafting of a community constitution. Our approach uses natural language processing techniques and large language models to induce a semantic metric space over text. Based on this space, we design algorithms to suggest compromise points likely to receive broad support. To evaluate our methods, we simulate coalition formation processes and show that AI can facilitate large-scale democratic text editing, a domain where traditional tools are limited.

    https://arxiv.org/abs/2506.06837


    Model Discovery and Graph Simulation: A Lightweight Gateway to Chaos Engineering

    oai:arXiv.org:2506.11176v2

    arXiv:2506.11176v2 Announce Type: cross Abstract: Chaos engineering reveals resilience risks but is expensive and operationally risky to run broadly and often. Model-based analyses can estimate dependability, yet in practice they are tricky to build and keep current because models are typically handcrafted. We claim that a simple connectivity-only topological model - just the service-dependency graph plus replica counts - can provide fast, low-risk availability estimates under fail-stop faults. To make this claim practical without hand-built models, we introduce model discovery: an automated step that can run in CI/CD or as an observability-platform capability, synthesizing an explicit, analyzable model from artifacts teams already have (e.g., distributed traces, service-mesh telemetry, configs/manifests) - providing an accessible gateway for teams to begin resilience testing. As a proof by instance on the DeathStarBench Social Network, we extract the dependency graph from Jaeger and estimate availability across two deployment modes and five failure rates. The discovered model closely tracks live fault-injection results; with replication, median error at mid-range failure rates is near zero, while no-replication shows signed biases consistent with excluded mechanisms. These results create two opportunities: first, to triage and reduce the scope of expensive chaos experiments in advance, and second, to generate real-time signals on the system's resilience posture as its topology evolves, preserving live validation for the most critical or ambiguous scenarios.

    https://arxiv.org/abs/2506.11176


    EMNLP: Educator-role Moral and Normative Large Language Models Profiling

    oai:arXiv.org:2508.15250v3

    arXiv:2508.15250v3 Announce Type: cross Abstract: Simulating Professions (SP) enables Large Language Models (LLMs) to emulate professional roles. However, comprehensive psychological and ethical evaluation in these contexts remains lacking. This paper introduces EMNLP, an Educator-role Moral and Normative LLMs Profiling framework for personality profiling, moral development stage measurement, and ethical risk under soft prompt injection. EMNLP extends existing scales and constructs 88 teacher-specific moral dilemmas, enabling profession-oriented comparison with human teachers. A targeted soft prompt injection set evaluates compliance and vulnerability in teacher SP. Experiments on 14 LLMs show teacher-role LLMs exhibit more idealized and polarized personalities than human teachers, excel in abstract moral reasoning, but struggle with emotionally complex situations. Models with stronger reasoning are more vulnerable to harmful prompt injection, revealing a paradox between capability and safety. The model temperature and other hyperparameters have limited influence except in some risk behaviors. This paper presents the first benchmark to assess ethical and psychological alignment of teacher-role LLMs for educational AI. Resources are available at https://e-m-n-l-p.github.io/.

    https://arxiv.org/abs/2508.15250


    Improving Long-Tailed Object Detection with Balanced Group Softmax and Metric Learning

    oai:arXiv.org:2511.16619v1

    arXiv:2511.16619v1 Announce Type: cross Abstract: Object detection has been widely explored for class-balanced datasets such as COCO. However, real-world scenarios introduce the challenge of long-tailed distributions, where numerous categories contain only a few instances. This inherent class imbalance biases detection models towards the more frequent classes, degrading performance on rare categories. In this paper, we tackle the problem of long-tailed 2D object detection using the LVISv1 dataset, which consists of 1,203 categories and 164,000 images. We employ a two-stage Faster R-CNN architecture and propose enhancements to the Balanced Group Softmax (BAGS) framework to mitigate class imbalance. Our approach achieves a new state-of-the-art performance with a mean Average Precision (mAP) of 24.5%, surpassing the previous benchmark of 24.0%. Additionally, we hypothesize that tail class features may form smaller, denser clusters within the feature space of head classes, making classification challenging for regression-based classifiers. To address this issue, we explore metric learning to produce feature embeddings that are both well-separated across classes and tightly clustered within each class. For inference, we utilize a k-Nearest Neighbors (k-NN) approach to improve classification performance, particularly for rare classes. Our results demonstrate the effectiveness of these methods in advancing long-tailed object detection.

    https://arxiv.org/abs/2511.16619


    Vision Foundry: A System for Training Foundational Vision AI Models

    oai:arXiv.org:2512.11837v1

    arXiv:2512.11837v1 Announce Type: cross Abstract: Self-supervised learning (SSL) leverages vast unannotated medical datasets, yet steep technical barriers limit adoption by clinical researchers. We introduce Vision Foundry, a code-free, HIPAA-compliant platform that democratizes pre-training, adaptation, and deployment of foundational vision models. The system integrates the DINO-MX framework, abstracting distributed infrastructure complexities while implementing specialized strategies like Magnification-Aware Distillation (MAD) and Parameter-Efficient Fine-Tuning (PEFT). We validate the platform across domains, including neuropathology segmentation, lung cellularity estimation, and coronary calcium scoring. Our experiments demonstrate that models trained via Vision Foundry significantly outperform generic baselines in segmentation fidelity and regression accuracy, while exhibiting robust zero-shot generalization across imaging protocols. By bridging the gap between advanced representation learning and practical application, Vision Foundry enables domain experts to develop state-of-the-art clinical AI tools with minimal annotation overhead, shifting focus from engineering optimization to clinical discovery.

    https://arxiv.org/abs/2512.11837


    Understanding Structural Representation in Foundation Models for Polymers

    oai:arXiv.org:2512.11881v1

    arXiv:2512.11881v1 Announce Type: cross Abstract: From the relative scarcity of training data to the lack of standardized benchmarks, the development of foundation models for polymers face significant and multi-faceted challenges. At the core, many of these issues are tied directly to the structural representation of polymers and here, we present a new foundation model using a SMILES-based polymer graph representation. This approach allows representation of critical polymer architectural features and connectivity that are not available in other SMILES-based representations. The developed polymer foundation model exhibited excellent performance on 28 different benchmark datasets. Critical evaluation of the developed representation against other variations in control experiments reveals this approach to be a highly performant method of representing polymers in language-based foundation models. These control experiments also reveal a strong invariance of all SMILES representations, with many variations achieving state-of-the-art or near state-of-the-art performance, including those which are chemically or semantically invalid. Examination of error sources and attention maps for the evaluated representations corroborate the findings of the control experiments, showing that chemistry language models based on SMILES interpolate over all sequence space for prediction tasks, not only those of semantically valid inputs. Overall, this work highlights the importance of control experiments as a check on human-imposed assumptions that can limit rational design of both chemistry foundation models and their underlying structural representations.

    https://arxiv.org/abs/2512.11881


    A fine-grained look at causal effects in causal spaces

    oai:arXiv.org:2512.11919v1

    arXiv:2512.11919v1 Announce Type: cross Abstract: The notion of causal effect is fundamental across many scientific disciplines. Traditionally, quantitative researchers have studied causal effects at the level of variables; for example, how a certain drug dose (W) causally affects a patient's blood pressure (Y). However, in many modern data domains, the raw variables-such as pixels in an image or tokens in a language model-do not have the semantic structure needed to formulate meaningful causal questions. In this paper, we offer a more fine-grained perspective by studying causal effects at the level of events, drawing inspiration from probability theory, where core notions such as independence are first given for events and sigma-algebras, before random variables enter the picture. Within the measure-theoretic framework of causal spaces, a recently introduced axiomatisation of causality, we first introduce several binary definitions that determine whether a causal effect is present, as well as proving some properties of them linking causal effect to (in)dependence under an intervention measure. Further, we provide quantifying measures that capture the strength and nature of causal effects on events, and show that we can recover the common measures of treatment effect as special cases.

    https://arxiv.org/abs/2512.11919


    Gene regulatory network inference algorithm based on spectral signed directed graph convolution

    oai:arXiv.org:2512.11927v1

    arXiv:2512.11927v1 Announce Type: cross Abstract: Accurately reconstructing Gene Regulatory Networks (GRNs) is crucial for understanding gene functions and disease mechanisms. Single-cell RNA sequencing (scRNA-seq) technology provides vast data for computational GRN reconstruction. Since GRNs are ideally modeled as signed directed graphs to capture activation/inhibition relationships, the most intuitive and reasonable approach is to design feature extractors based on the topological structure of GRNs to extract structural features, then combine them with biological characteristics for research. However, traditional spectral graph convolution struggles with this representation. Thus, we propose MSGRNLink, a novel framework that explicitly models GRNs as signed directed graphs and employs magnetic signed Laplacian convolution. Experiments across simulated and real datasets demonstrate that MSGRNLink outperforms all baseline models in AUROC. Parameter sensitivity analysis and ablation studies confirmed its robustness and the importance of each module. In a bladder cancer case study, MSGRNLink predicted more known edges and edge signs than benchmark models, further validating its biological relevance.

    https://arxiv.org/abs/2512.11927


    Quantum circuits for permutation matrices

    oai:arXiv.org:2512.11938v1

    arXiv:2512.11938v1 Announce Type: cross Abstract: Two different algorithms are presented for generating a quantum circuit realization of a matrix representing a permutation on $2^n$ letters. All circuits involve $n$ qubits and only use multi--controlled Toffoli gates. The first algorithm constructs a circuit from any decomposition of the permutation into a product of transpositions, but uses one ancilla line. The second, which uses no ancillae, constructs a circuit from a decomposition into a product of transpositions that have a Hamming distance of one. We show that any permutation admits such a decomposition, and we give a strategy for reducing the number of transpositions involved.

    https://arxiv.org/abs/2512.11938


    Interval Fisher's Discriminant Analysis and Visualisation

    oai:arXiv.org:2512.11945v1

    arXiv:2512.11945v1 Announce Type: cross Abstract: In Data Science, entities are typically represented by single valued measurements. Symbolic Data Analysis extends this framework to more complex structures, such as intervals and histograms, that express internal variability. We propose an extension of multiclass Fisher's Discriminant Analysis to interval-valued data, using Moore's interval arithmetic and the Mallows' distance. Fisher's objective function is generalised to consider simultaneously the contributions of the centres and the ranges of intervals and is numerically maximised. The resulting discriminant directions are then used to classify interval-valued observations.To support visual assessment, we adapt the class map, originally introduced for conventional data, to classifiers that assign labels through minimum distance rules. We also extend the silhouette plot to this setting and use stacked mosaic plots to complement the visual display of class assignments. Together, these graphical tools provide insight into classifier performance and the strength of class membership. Applications to real datasets illustrate the proposed methodology and demonstrate its value in interpreting classification results for interval-valued data.

    https://arxiv.org/abs/2512.11945


    Pre-training vision models for the classification of alerts from wide-field time-domain surveys

    oai:arXiv.org:2512.11957v1

    arXiv:2512.11957v1 Announce Type: cross Abstract: Modern wide-field time-domain surveys facilitate the study of transient, variable and moving phenomena by conducting image differencing and relaying alerts to their communities. Machine learning tools have been used on data from these surveys and their precursors for more than a decade, and convolutional neural networks (CNNs), which make predictions directly from input images, saw particularly broad adoption through the 2010s. Since then, continually rapid advances in computer vision have transformed the standard practices around using such models. It is now commonplace to use standardized architectures pre-trained on large corpora of everyday images (e.g., ImageNet). In contrast, time-domain astronomy studies still typically design custom CNN architectures and train them from scratch. Here, we explore the affects of adopting various pre-training regimens and standardized model architectures on the performance of alert classification. We find that the resulting models match or outperform a custom, specialized CNN like what is typically used for filtering alerts. Moreover, our results show that pre-training on galaxy images from Galaxy Zoo tends to yield better performance than pre-training on ImageNet or training from scratch. We observe that the design of standardized architectures are much better optimized than the custom CNN baseline, requiring significantly less time and memory for inference despite having more trainable parameters. On the eve of the Legacy Survey of Space and Time and other image-differencing surveys, these findings advocate for a paradigm shift in the creation of vision models for alerts, demonstrating that greater performance and efficiency, in time and in data, can be achieved by adopting the latest practices from the computer vision field.

    https://arxiv.org/abs/2512.11957


    Semantic search for 100M+ galaxy images using AI-generated captions

    oai:arXiv.org:2512.11982v1

    arXiv:2512.11982v1 Announce Type: cross Abstract: Finding scientifically interesting phenomena through slow, manual labeling campaigns severely limits our ability to explore the billions of galaxy images produced by telescopes. In this work, we develop a pipeline to create a semantic search engine from completely unlabeled image data. Our method leverages Vision-Language Models (VLMs) to generate descriptions for galaxy images, then contrastively aligns a pre-trained multimodal astronomy foundation model with these embedded descriptions to produce searchable embeddings at scale. We find that current VLMs provide descriptions that are sufficiently informative to train a semantic search model that outperforms direct image similarity search. Our model, AION-Search, achieves state-of-the-art zero-shot performance on finding rare phenomena despite training on randomly selected images with no deliberate curation for rare cases. Furthermore, we introduce a VLM-based re-ranking method that nearly doubles the recall for our most challenging targets in the top-100 results. For the first time, AION-Search enables flexible semantic search scalable to 140 million galaxy images, enabling discovery from previously infeasible searches. More broadly, our work provides an approach for making large, unlabeled scientific image archives semantically searchable, expanding data exploration capabilities in fields from Earth observation to microscopy. The code, data, and app are publicly available at https://github.com/NolanKoblischke/AION-Search

    https://arxiv.org/abs/2512.11982


    Irregular Stanley sequences plausibly do not have growth $\Theta(n^2/\log n)$

    oai:arXiv.org:2512.11983v1

    arXiv:2512.11983v1 Announce Type: cross Abstract: Stanley sequences starting from the set $\{0, n\}$ where $n$ is a positive integer have long been conjectured to be divided into two types: the "regular" type where the growth rate is $\Theta(n^{\log_2(3)})$, and the "irregular" type where the growth rate is thought to be $\Theta(n^2/\log n)$. A paradigmatic case of a candidate irregular type is $n=4$, although to date no value of $n$ has been proven to have such a growth rate. Here, we provide strong numerical evidence against this conjectured growth rate for $n=4$. Specifically, for $n=4$, it seems plausible that the upper bound is $O(n^2/\log n)$ but that the lower bound is in fact $\Omega(n^{2-\delta})$ for some $\delta > 0$. This appears to be because the sequence is not totally "random" as has been assumed. Limitations of the numerical method here is discussed.

    https://arxiv.org/abs/2512.11983


    HWF-PIKAN: A Multi-Resolution Hybrid Wavelet-Fourier Physics-Informed Kolmogorov-Arnold Network for solving Collisionless Boltzmann Equation

    oai:arXiv.org:2512.12001v1

    arXiv:2512.12001v1 Announce Type: cross Abstract: Physics-Informed Neural Networks (PINNs) and more recently Physics-Informed Kolmogorov-Arnold Networks (PIKANs) have emerged as promising approaches for solving partial differential equations (PDEs) without reliance on extensive labeled data. In this work, we propose a novel multi-resolution Hybrid Wavelet-Fourier-Enhanced Physics-Informed Kolmogorov-Arnold Network (HWF-PIKAN) for solving advection problems based on collisionless Boltzmann equation (CBE) with both continuous and discontinuous initial conditions. To validate the effectiveness of the proposed model, we conduct systematic benchmarks on classical advection equations in one and two dimensions. These tests demonstrate the model's ability to accurately capture smooth and abrupt features. We then extend the application of HWF-PIKAN to the high-dimensional phase-space setting by solving the CBE in a continuous-velocity manner. This leverages the Hamiltonian concept of phase-space dynamics to model the statistical behavior of particles in a collisionless system, where advection governs the evolution of a probability distribution function or number density. Comparative analysis against Vanilla PINN, Vanilla PIKAN, as well as Fourier-enhanced and Wavelet-enhanced PIKAN variants, shows that the proposed hybrid model significantly improves solution accuracy and convergence speed. This study highlights the power of multi-resolution spectral feature embeddings in advancing physics-informed deep learning frameworks for complex kinetic equations in both space-time and phase-space.

    https://arxiv.org/abs/2512.12001


    Convergence of the Cumulant Expansion and Polynomial-Time Algorithm for Weakly Interacting Fermions

    oai:arXiv.org:2512.12010v1

    arXiv:2512.12010v1 Announce Type: cross Abstract: We propose a randomized algorithm to compute the log-partition function of weakly interacting fermions with polynomial runtime in both the system size and precision. Although weakly interacting fermionic systems are considered tractable for many computational methods such as the diagrammatic quantum Monte Carlo, a mathematically rigorous proof of polynomial runtime has been lacking. In this work we first extend the proof techniques developed in previous works for proving the convergence of the cumulant expansion in periodic systems to the non-periodic case. A key equation used to analyze the sum of connected Feynman diagrams, which we call the tree-determinant expansion, reveals an underlying tree structure in the summation. This enables us to design a new randomized algorithm to compute the log-partition function through importance sampling augmented by belief propagation. This approach differs from the traditional method based on Markov chain Monte Carlo, whose efficiency is hard to guarantee, and enables us to obtain a algorithm with provable polynomial runtime.

    https://arxiv.org/abs/2512.12010


    TreeVQA: A Tree-Structured Execution Framework for Shot Reduction in Variational Quantum Algorithms

    oai:arXiv.org:2512.12068v1

    arXiv:2512.12068v1 Announce Type: cross Abstract: Variational Quantum Algorithms (VQAs) are promising for near- and intermediate-term quantum computing, but their execution cost is substantial. Each task requires many iterations and numerous circuits per iteration, and real-world applications often involve multiple tasks, scaling with the precision needed to explore the application's energy landscape. This demands an enormous number of execution shots, making practical use prohibitively expensive. We observe that VQA costs can be significantly reduced by exploiting execution similarities across an application's tasks. Based on this insight, we propose TreeVQA, a tree-based execution framework that begins by executing tasks jointly and progressively branches only as their quantum executions diverge. Implemented as a VQA wrapper, TreeVQA integrates with typical VQA applications. Evaluations on scientific and combinatorial benchmarks show shot count reductions of $25.9\times$ on average and over $100\times$ for large-scale problems at the same target accuracy. The benefits grow further with increasing problem size and precision requirements.

    https://arxiv.org/abs/2512.12068


    The PPKN Gate: An Optimal 1-Toffoli Input-Preserving Full Adder for Quantum Arithmetic

    oai:arXiv.org:2512.12073v1

    arXiv:2512.12073v1 Announce Type: cross Abstract: Efficient arithmetic operations are a prerequisite for practical quantum computing. Optimization efforts focus on two primary metrics: Quantum Cost (QC), determined by the number of non-linear gates, and Logical Depth, which defines the execution speed. Existing literature identifies the HNG gate as the standard for Input-Preserving Reversible Full Adders. HNG gate typically requires a QC of 12 and a logical depth of 5, in the area of classical reversible circuits. This paper proposes the PPKN Gate, a novel design that achieves the same inputpreserving functionality using only one Toffoli gate and five CNOT gates. With a Quantum Cost of 10 and a reduced logical depth of 4, the PPKN gate outperforms the standard HNG gate in both complexity and speed. Furthermore, we present a modular architecture for constructing an n-bit Ripple Carry Adder by cascading PPKN modules.

    https://arxiv.org/abs/2512.12073


    Fractional Calculus in Optimal Control and Game Theory: Theory, Numerics, and Applications -- A Survey

    oai:arXiv.org:2512.12111v1

    arXiv:2512.12111v1 Announce Type: cross Abstract: Many physical, biological, and engineered systems exhibit memory effects that challenge Markovian models. Fractional calculus provides nonlocal operators to capture hereditary dynamics. This survey connects modeling, analysis, and controller/game design for systems with memory. We unify notation for Caputo, Riemann-Liouville, and Grunwald-Letnikov derivatives and relate them to practical approximations, including diffusive (sum-of-exponentials) state augmentation and frequency-domain realizations (e.g., Oustaloup). We review fractional extensions of the calculus of variations and the Pontryagin maximum principle, and dynamic-programming formulations with memory, including path-dependent HJB for optimal control and HJI for zero-sum games. We cover design tools such as LQR, MPC, and fractional-order PID, as well as fractional differential games with Nash, Stackelberg, and minimax equilibria. Computational approaches are compared across time-domain schemes, frequency-domain approximations, and diffusive augmentations, highlighting accuracy-complexity trade-offs and remedies for the curse of history (windowing and sum-of-exponentials). We conclude with applications and open problems on equilibria with memory, Isaacs-type conditions, constraint handling, and scalable solvers.

    https://arxiv.org/abs/2512.12111


    Modeling Dabrafenib Response Using Multi-Omics Modality Fusion and Protein Network Embeddings Based on Graph Convolutional Networks

    oai:arXiv.org:2512.12134v1

    arXiv:2512.12134v1 Announce Type: cross Abstract: Cancer cell response to targeted therapy arises from complex molecular interactions, making single omics insufficient for accurate prediction. This study develops a model to predict Dabrafenib sensitivity by integrating multiple omics layers (genomics, transcriptomics, proteomics, epigenomics, and metabolomics) with protein network embeddings generated using Graph Convolutional Networks (GCN). Each modality is encoded into low dimensional representations through neural network preprocessing. Protein interaction information from STRING is incorporated using GCN to capture biological topology. An attention based fusion mechanism assigns adaptive weights to each modality according to its relevance. Using GDSC cancer cell line data, the model shows that selective integration of two modalities, especially proteomics and transcriptomics, achieves the best test performance (R2 around 0.96), outperforming all single omics and full multimodal settings. Genomic and epigenomic data were less informative, while proteomic and transcriptomic layers provided stronger phenotypic signals related to MAPK inhibitor activity. These results show that attention guided multi omics fusion combined with GCN improves drug response prediction and reveals complementary molecular determinants of Dabrafenib sensitivity. The approach offers a promising computational framework for precision oncology and predictive modeling of targeted therapies.

    https://arxiv.org/abs/2512.12134


    DCAF-Net: Dual-Channel Attentive Fusion Network for Lower Limb Motion Intention Prediction in Stroke Rehabilitation Exoskeletons

    oai:arXiv.org:2512.12184v1

    arXiv:2512.12184v1 Announce Type: cross Abstract: Rehabilitation exoskeletons have shown promising results in promoting recovery for stroke patients. Accurately and timely identifying the motion intentions of patients is a critical challenge in enhancing active participation during lower limb exoskeleton-assisted rehabilitation training. This paper proposes a Dual-Channel Attentive Fusion Network (DCAF-Net) that synergistically integrates pre-movement surface electromyography (sEMG) and inertial measurement unit (IMU) data for lower limb intention prediction in stroke patients. First, a dual-channel adaptive channel attention module is designed to extract discriminative features from 48 time-domain and frequency-domain features derived from bilateral gastrocnemius sEMG signals. Second, an IMU encoder combining convolutional neural network (CNN) and attention-based long short-term memory (attention-LSTM) layers is designed to decode temporal-spatial movement patterns. Third, the sEMG and IMU features are fused through concatenation to enable accurate recognition of motion intention. Extensive experiment on 11 participants (8 stroke subjects and 3 healthy subjects) demonstrate the effectiveness of DCAF-Net. It achieved a prediction accuracies of 97.19% for patients and 93.56% for healthy subjects. This study provides a viable solution for implementing intention-driven human-in-the-loop assistance control in clinical rehabilitation robotics.

    https://arxiv.org/abs/2512.12184


    Asymmetry in Spectral Graph Theory: Harmonic Analysis on Directed Networks via Biorthogonal Bases (Adjacency-Operator Formulation)

    oai:arXiv.org:2512.12226v1

    arXiv:2512.12226v1 Announce Type: cross Abstract: Classical spectral graph theory and graph signal processing rely on a symmetry principle: undirected graphs induce symmetric (self-adjoint) adjacency/Laplacian operators, yielding orthogonal eigenbases and energy-preserving Fourier expansions. Real-world networks are typically directed and hence asymmetric, producing non-self-adjoint and frequently non-normal operators for which orthogonality fails and spectral coordinates can be ill-conditioned. In this paper we develop an original harmonic-analysis framework for directed networks centered on the \emph{adjacency} operator. We propose a \emph{Biorthogonal Graph Fourier Transform} (BGFT) built from left/right eigenvectors, formulate directed ``frequency'' and filtering in the non-Hermitian setting, and quantify how asymmetry and non-normality affect stability via condition numbers and a departure-from-normality functional. We prove exact synthesis/analysis identities under diagonalizability, establish sampling-and-reconstruction guarantees for BGFT-bandlimited signals, and derive perturbation/stability bounds that explain why naive orthogonal-GFT assumptions break down on non-normal directed graphs. A simulation protocol compares undirected versus directed cycles (asymmetry without non-normality) and a perturbed directed cycle (genuine non-normality), demonstrating that BGFT yields coherent reconstruction and filtering across asymmetric regimes.

    https://arxiv.org/abs/2512.12226


    On Well-VE-Dominated Graphs

    oai:arXiv.org:2512.12231v1

    arXiv:2512.12231v1 Announce Type: cross Abstract: Given a graph G=(V, E), a vertex is said to ve-dominate an edge if it is either incident with the edge or adjacent to one of its endpoints. A set of vertices is a ve-dominating set if it ve-dominates every edge of the graph. We introduce the class of well-ve-dominated graphs, defined as graphs in which all minimal ve-dominating sets have the same cardinality. After establishing several general structural properties of well-ve-dominated graphs, we show that recognizing whether a graph belongs to this class is co--NP--complete, highlighting the computational difficulty of the problem. Our main result is a complete structural characterization of well-ve-dominated trees, which yields a simple linear-time recognition algorithm and a constructive description of all trees in this class.

    https://arxiv.org/abs/2512.12231


    Resolution-Independent Neural Operators for Multi-Rate Sparse-View CT

    oai:arXiv.org:2512.12236v1

    arXiv:2512.12236v1 Announce Type: cross Abstract: Sparse-view Computed Tomography (CT) reconstructs images from a limited number of X-ray projections to reduce radiation and scanning time, which makes reconstruction an ill-posed inverse problem. Deep learning methods achieve high-fidelity reconstructions but often overfit to a fixed acquisition setup, failing to generalize across sampling rates and image resolutions. For example, convolutional neural networks (CNNs) use the same learned kernels across resolutions, leading to artifacts when data resolution changes. We propose Computed Tomography neural Operator (CTO), a unified CT reconstruction framework that extends to continuous function space, enabling generalization (without retraining) across sampling rates and image resolutions. CTO operates jointly in the sinogram and image domains through rotation-equivariant Discrete-Continuous convolutions parametrized in the function space, making it inherently resolution- and sampling-agnostic. Empirically, CTO enables consistent multi-sampling-rate and cross-resolution performance, with on average >4dB PSNR gain over CNNs. Compared to state-of-the-art diffusion methods, CTO is 500$\times$ faster in inference time with on average 3dB gain. Empirical results also validate our design choices behind CTO's sinogram-space operator learning and rotation-equivariant convolution. Overall, CTO outperforms state-of-the-art baselines across sampling rates and resolutions, offering a scalable and generalizable solution that makes automated CT reconstruction more practical for deployment.

    https://arxiv.org/abs/2512.12236


    Stochastic Volatility Modelling with LSTM Networks: A Hybrid Approach for S&P 500 Index Volatility Forecasting

    oai:arXiv.org:2512.12250v1

    arXiv:2512.12250v1 Announce Type: cross Abstract: Accurate volatility forecasting is essential in banking, investment, and risk management, because expectations about future market movements directly influence current decisions. This study proposes a hybrid modelling framework that integrates a Stochastic Volatility model with a Long Short Term Memory neural network. The SV model improves statistical precision and captures latent volatility dynamics, especially in response to unforeseen events, while the LSTM network enhances the model's ability to detect complex nonlinear patterns in financial time series. The forecasting is conducted using daily data from the S and P 500 index, covering the period from January 1 1998 to December 31 2024. A rolling window approach is employed to train the model and generate one step ahead volatility forecasts. The performance of the hybrid SV-LSTM model is evaluated through both statistical testing and investment simulations. The results show that the hybrid approach outperforms both the standalone SV and LSTM models and contributes to the development of volatility modelling techniques, providing a foundation for improving risk assessment and strategic investment planning in the context of the S and P 500.

    https://arxiv.org/abs/2512.12250


    Forbidden Induced Subgraph Characterization of Word-Representable Split Graphs

    oai:arXiv.org:2512.12259v1

    arXiv:2512.12259v1 Announce Type: cross Abstract: The class of word-representable graphs, introduced in connection with the study of the Perkins semigroup by Kitaev and Seif, has attracted significant attention in combinatorics and theoretical computer science due to its deep connections with graph orientations and combinatorics on words. A graph is word-representable if and only if it admits a semi-transitive orientation, which is an acyclic orientation such that for any directed path $v_0 \rightarrow v_1 \rightarrow \cdots \rightarrow v_m$ with $m \ge 2$, either there is no arc between $v_0$ and $v_m$, or, for all $1 \le i < j \le m$, there exists an arc from $v_i$ to $v_j$. Split graphs, whose vertex set can be partitioned into a clique and an independent set, constitute a natural yet nontrivial subclass for studying word-representability. However, not all split graphs are semi-transitive, and the characterization of minimal forbidden induced subgraphs for semi-transitive split graphs remains an open problem. In this paper, we introduce a new matrix property called the $I$-circular property, which is closely related to the well-known $D$-circular property introduced by Safe. The $I$-circular property requires that both the rows of a matrix and the pairwise intersections of rows form circular intervals under some linear ordering of the columns. Using this property, we establish a direct connection between the structure of semi-transitive split graphs and the matrix representation of their adjacency relationships. Our main result is a complete forbidden submatrix characterization of the $I$-circular property, which in turn provides a characterization for semi-transitive split graphs in terms of minimal forbidden induced subgraphs.

    https://arxiv.org/abs/2512.12259


    Hellinger loss function for Generative Adversarial Networks

    oai:arXiv.org:2512.12267v1

    arXiv:2512.12267v1 Announce Type: cross Abstract: We propose Hellinger-type loss functions for training Generative Adversarial Networks (GANs), motivated by the boundedness, symmetry, and robustness properties of the Hellinger distance. We define an adversarial objective based on this divergence and study its statistical properties within a general parametric framework. We establish the existence, uniqueness, consistency, and joint asymptotic normality of the estimators obtained from the adversarial training procedure. In particular, we analyze the joint estimation of both generator and discriminator parameters, offering a comprehensive asymptotic characterization of the resulting estimators. We introduce two implementations of the Hellinger-type loss and we evaluate their empirical behavior in comparison with the classic (Maximum Likelihood-type) GAN loss. Through a controlled simulation study, we demonstrate that both proposed losses yield improved estimation accuracy and robustness under increasing levels of data contamination.

    https://arxiv.org/abs/2512.12267


    Accurate de novo sequencing of the modified proteome with OmniNovo

    oai:arXiv.org:2512.12272v1

    arXiv:2512.12272v1 Announce Type: cross Abstract: Post-translational modifications (PTMs) serve as a dynamic chemical language regulating protein function, yet current proteomic methods remain blind to a vast portion of the modified proteome. Standard database search algorithms suffer from a combinatorial explosion of search spaces, limiting the identification of uncharacterized or complex modifications. Here we introduce OmniNovo, a unified deep learning framework for reference-free sequencing of unmodified and modified peptides directly from tandem mass spectra. Unlike existing tools restricted to specific modification types, OmniNovo learns universal fragmentation rules to decipher diverse PTMs within a single coherent model. By integrating a mass-constrained decoding algorithm with rigorous false discovery rate estimation, OmniNovo achieves state-of-the-art accuracy, identifying 51\% more peptides than standard approaches at a 1\% false discovery rate. Crucially, the model generalizes to biological sites unseen during training, illuminating the dark matter of the proteome and enabling unbiased comprehensive analysis of cellular regulation.

    https://arxiv.org/abs/2512.12272


    Forbidden Induced Subgraph Characterization of Word-Representable Co-bipartite Graphs

    oai:arXiv.org:2512.12274v1

    arXiv:2512.12274v1 Announce Type: cross Abstract: A graph $G$ with vertex set $V(G)$ and edge set $E(G)$ is said to be word-representable if there exists a word $w$ over the alphabet $V(G)$ such that, for any two distinct letters $x,y \in V(G)$, the letters $x$ and $y$ alternate in $w$ if and only if $(x,y) \in E(G)$. Equivalently, a graph is word-representable if and only if it admits a semi-transitive orientation, that is, an acyclic orientation in which, for every directed path $v_0 \rightarrow v_1 \rightarrow \cdots \rightarrow v_m$ with $m \ge 2$, either there is no arc between $v_0$ and $v_m$, or, for all $1 \le i < j \le m$, there exists an arc from $v_i$ to $v_j$. In this work, we provide a comprehensive structural and algorithmic characterization of word-representable co-bipartite graphs, a class of graphs whose vertex set can be partitioned into two cliques. This work unifies graph-theoretic and matrix-theoretic perspectives. We first establish that a co-bipartite graph is a circle graph if and only if it is a permutation graph, thereby deriving a minimal forbidden induced subgraph characterization for co-bipartite circle graphs. The central contribution then connects semi-transitivity with the circularly compatible ones property of binary matrices. In addition to the structural characterization, the paper introduces a linear-time recognition algorithm for semi-transitive co-bipartite graphs, utilizing Safe's matrix recognition framework.

    https://arxiv.org/abs/2512.12274


    V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval

    oai:arXiv.org:2512.12284v1

    arXiv:2512.12284v1 Announce Type: cross Abstract: Streaming video large language models (LLMs) are increasingly used for real-time multimodal tasks such as video captioning, question answering, conversational agents, and augmented reality. However, these models face fundamental memory and computational challenges because their key-value (KV) caches grow substantially with continuous streaming video input. This process requires an iterative prefill stage, which is a unique feature of streaming video LLMs. Due to its iterative prefill stage, it suffers from significant limitations, including extensive computation, substantial data transfer, and degradation in accuracy. Crucially, this issue is exacerbated for edge deployment, which is the primary target for these models. In this work, we propose V-Rex, the first software-hardware co-designed accelerator that comprehensively addresses both algorithmic and hardware bottlenecks in streaming video LLM inference. At its core, V-Rex introduces ReSV, a training-free dynamic KV cache retrieval algorithm. ReSV exploits temporal and spatial similarity-based token clustering to reduce excessive KV cache memory across video frames. To fully realize these algorithmic benefits, V-Rex offers a compact, low-latency hardware accelerator with a dynamic KV cache retrieval engine (DRE), featuring bit-level and early-exit based computing units. V-Rex achieves unprecedented real-time of 3.9-8.3 FPS and energy-efficient streaming video LLM inference on edge deployment with negligible accuracy loss. While DRE only accounts for 2.2% power and 2.0% area, the system delivers 1.9-19.7x speedup and 3.1-18.5x energy efficiency improvements over AGX Orin GPU. This work is the first to comprehensively tackle KV cache retrieval across algorithms and hardware, enabling real-time streaming video LLM inference on resource-constrained edge devices.

    https://arxiv.org/abs/2512.12284


    Robust Outlier Detection and Low-Latency Concept Drift Adaptation for Data Stream Regression: A Dual-Channel Architecture

    oai:arXiv.org:2512.12289v1

    arXiv:2512.12289v1 Announce Type: cross Abstract: Outlier detection and concept drift detection represent two challenges in data analysis. Most studies address these issues separately. However, joint detection mechanisms in regression remain underexplored, where the continuous nature of output spaces makes distinguishing drifts from outliers inherently challenging. To address this, we propose a novel robust regression framework for joint outlier and concept drift detection. Specifically, we introduce a dual-channel decision process that orchestrates prediction residuals into two coupled logic flows: a rapid response channel for filtering point outliers and a deep analysis channel for diagnosing drifts. We further develop the Exponentially Weighted Moving Absolute Deviation with Distinguishable Types (EWMAD-DT) detector to autonomously differentiate between abrupt and incremental drifts via dynamic thresholding. Comprehensive experiments on both synthetic and real-world datasets demonstrate that our unified framework, enhanced by EWMAD-DT, exhibits superior detection performance even when point outliers and concept drifts coexist.

    https://arxiv.org/abs/2512.12289


    Vertex-edge domination on subclasses of bipartite graphs

    oai:arXiv.org:2512.12292v1

    arXiv:2512.12292v1 Announce Type: cross Abstract: Given a simple undirected graph $G = (V, E)$, the open neighbourhood of a vertex $v \in V$ is defined as $N_G(v) = \{u \in V \mid uv \in E\}$, and the closed neighbourhood as $N_G[v] = N_G(v) \cup \{v\}$. A subset $D \subseteq V$ is called a vertex-edge dominating set if, for every edge $uv \in E$, at least one vertex from $D$ appears in $N_G[u] \cup N_G[v]$; that is, $\vert (N_G[u] \cup N_G[v]) \cap D\vert \geq 1$. Intuitively, a vertex-edge dominating set ensures that every edge, as well as all edges incident to either of its endpoints, is dominated by at least one vertex from the set. The \textsc{Min-VEDS} problem asks for a vertex-edge dominating set of minimum size in a given graph. This problem is known to be NP-complete even for bipartite graphs. In this paper, we strengthen this hardness result by proving that the problem remains NP-complete for two specific subclasses of bipartite graphs: star-convex and comb-convex bipartite graphs. For a graph $G$ on $n$ vertices, it is known that the \textsc{Min-VEDS} problem cannot be approximated within a factor of $(1 - \epsilon)\ln |V|$ for any $\epsilon > 0$, unless $\text{NP} \subseteq \text{DTIME}(|V|^{O(\log \log |V|)})$. We also prove that this inapproximability result holds even for star-convex and comb-convex bipartite graphs. On the positive side, we present a polynomial-time algorithm for computing a minimum vertex-edge dominating set in convex bipartite graphs. A polynomial-time algorithm for this graph class was also proposed by B{\"u}y{\"u}k{\c{c}}olak et al.~\cite{buyukccolak2025linear}, but we show that their algorithm has certain flaws by providing instances where it fails to produce an optimal solution. We address this issue by presenting a modified algorithm that correctly computes an optimal solution.

    https://arxiv.org/abs/2512.12292


    Dominated balanced separators in wheel-induced-minor-free graphs

    oai:arXiv.org:2512.12329v1

    arXiv:2512.12329v1 Announce Type: cross Abstract: Gartland and Lokshtanov conjectured that every graph that excludes some planar graph as an induced minor has a balanced separator, that is, a separator whose deletion leaves every component with no more than half of the vertices of the graph, which is dominated by a bounded number of vertices. We confirm this conjecture for excluding any fixed wheel, that is, a cycle together with a universal vertex, as an induced minor.

    https://arxiv.org/abs/2512.12329


    Extending the application of dynamic Bayesian networks in calculating market risk: Standard and stressed expected shortfall

    oai:arXiv.org:2512.12334v1

    arXiv:2512.12334v1 Announce Type: cross Abstract: In the last five years, expected shortfall (ES) and stressed ES (SES) have become key required regulatory measures of market risk in the banking sector, especially following events such as the global financial crisis. Thus, finding ways to optimize their estimation is of great importance. We extend the application of dynamic Bayesian networks (DBNs) to the estimation of 10-day 97.5% ES and stressed ES, building on prior work applying DBNs to value at risk. Using the S&P 500 index as a proxy for the equities trading desk of a US bank, we compare the performance of three DBN structure-learning algorithms with several traditional market risk models, using either the normal or the skewed Student's t return distributions. Backtesting shows that all models fail to produce statistically accurate ES and SES forecasts at the 2.5% level, reflecting the difficulty of modeling extreme tail behavior. For ES, the EGARCH(1,1) model (normal) produces the most accurate forecasts, while, for SES, the GARCH(1,1) model (normal) performs best. All distribution-dependent models deteriorate substantially when using the skewed Student's t distribution. The DBNs perform comparably to the historical simulation model, but their contribution to tail prediction is limited by the small weight assigned to their one-day-ahead forecasts within the return distribution. Future research should examine weighting schemes that enhance the influence of forward-looking DBN forecasts on tail risk estimation.

    https://arxiv.org/abs/2512.12334


    Understanding Main Path Analysis

    oai:arXiv.org:2512.12355v1

    arXiv:2512.12355v1 Announce Type: cross Abstract: Main path analysis has long been used to trace knowledge trajectories in citation networks, yet it lacks solid theoretical foundations. To understand when and why this approach succeeds, we analyse directed acyclic graphs created from two types of artificial models and by looking at over twenty networks derived from real data. We show that entropy-based variants of main path analysis optimise geometric distance measures, providing its first information-theoretic and geometric basis. Numerical results demonstrate that existing algorithms converge on near-geodesic solutions. We also show that an approach based on longest paths produces similar results, is equally well motivated yet is much simpler to implement. However, the traditional single-path focus is unnecessarily restrictive, as many near-optimal paths highlight different key nodes. We introduce an approach using ``baskets'' of nodes where we select a fraction of nodes with the smallest values of a measure we call ``generalised criticality''. Analysis of large vaccine citation networks shows that these baskets achieve comprehensive algorithmic coverage, offering a robust, simple, and computationally efficient way to identify core knowledge structures. In practice, we find that those nodes with zero unit criticality capture the information in main paths in almost all cases and capture a wider range of key nodes without unnecessarily increasing the number of nodes considered. We find no advantage in using the traditional main path methods.

    https://arxiv.org/abs/2512.12355


    Towards a pretrained deep learning estimator of the Linfoot informational correlation

    oai:arXiv.org:2512.12358v1

    arXiv:2512.12358v1 Announce Type: cross Abstract: We develop a supervised deep-learning approach to estimate mutual information between two continuous random variables. As labels, we use the Linfoot informational correlation, a transformation of mutual information that has many important properties. Our method is based on ground truth labels for Gaussian and Clayton copulas. We compare our method with estimators based on kernel density, k-nearest neighbours and neural estimators. We show generally lower bias and lower variance. As a proof of principle, future research could look into training the model with a more diverse set of examples from other copulas for which ground truth labels are available.

    https://arxiv.org/abs/2512.12358


    JPEG-Inspired Cloud-Edge Holography

    oai:arXiv.org:2512.12367v1

    arXiv:2512.12367v1 Announce Type: cross Abstract: Computer-generated holography (CGH) presents a transformative solution for near-eye displays in augmented and virtual reality. Recent advances in deep learning have greatly improved CGH in reconstructed quality and computational efficiency. However, deploying neural CGH pipelines directly on compact, eyeglass-style devices is hindered by stringent constraints on computation and energy consumption, while cloud offloading followed by transmission with natural image codecs often distorts phase information and requires high bandwidth to maintain reconstruction quality. Neural compression methods can reduce bandwidth but impose heavy neural decoders at the edge, increasing inference latency and hardware demand. In this work, we introduce JPEG-Inspired Cloud-Edge Holography, an efficient pipeline designed around a learnable transform codec that retains the block-structured and hardware-friendly nature of JPEG. Our system shifts all heavy neural processing to the cloud, while the edge device performs only lightweight decoding without any neural inference. To further improve throughput, we implement custom CUDA kernels for entropy coding on both cloud and edge. This design achieves a peak signal-to-noise ratio of 32.15 dB at $<$ 2 bits per pixel with decode latency as low as 4.2 ms. Both numerical simulations and optical experiments confirm the high reconstruction quality of the holograms. By aligning CGH with a codec that preserves JPEG's structural efficiency while extending it with learnable components, our framework enables low-latency, bandwidth-efficient hologram streaming on resource-constrained wearable devices-using only simple block-based decoding readily supported by modern system-on-chips, without requiring neural decoders or specialized hardware.

    https://arxiv.org/abs/2512.12367


    The Morphemic Origin of Zipf's Law: A Factorized Combinatorial Framework

    oai:arXiv.org:2512.12394v1

    arXiv:2512.12394v1 Announce Type: cross Abstract: We present a simple structure based model of how words are formed from morphemes. The model explains two major empirical facts: the typical distribution of word lengths and the appearance of Zipf like rank frequency curves. In contrast to classical explanations based on random text or communication efficiency, our approach uses only the combinatorial organization of prefixes, roots, suffixes and inflections. In this Morphemic Combinatorial Word Model, a word is created by activating several positional slots. Each slot turns on with a certain probability and selects one morpheme from its inventory. Morphemes are treated as stable building blocks that regularly appear in word formation and have characteristic positions. This mechanism produces realistic word length patterns with a concentrated middle zone and a thin long tail, closely matching real languages. Simulations with synthetic morpheme inventories also generate rank frequency curves with Zipf like exponents around 1.1-1.4, similar to English, Russian and Romance languages. The key result is that Zipf like behavior can emerge without meaning, communication pressure or optimization principles. The internal structure of morphology alone, combined with probabilistic activation of slots, is sufficient to create the robust statistical patterns observed across languages.

    https://arxiv.org/abs/2512.12394


    Data-driven modelling of autonomous and forced dynamical systems

    oai:arXiv.org:2512.12432v1

    arXiv:2512.12432v1 Announce Type: cross Abstract: The paper demonstrates that invariant foliations are accurate, data-efficient and practical tools for data-driven modelling of physical systems. Invariant foliations can be fitted to data that either fill the phase space or cluster about an invariant manifold. Invariant foliations can be fitted to a single trajectory or multiple trajectories. Over and underfitting are eliminated by appropriately choosing a function representation and its hyperparameters, such as polynomial orders. The paper extends invariant foliations to forced and parameter dependent systems. It is assumed that forcing is provided by a volume preserving map, and therefore the forcing can be periodic, quasi-periodic or even chaotic. The method utilises full trajectories, hence it is able to predict long-term dynamics accurately. We take into account if a forced system is reducible to an autonomous system about a steady state, similar to how Floquet theory guarantees reducibility for periodically forced systems. In order to find an invariant manifold, multiple invariant foliations are calculated in the neighbourhood of the invariant manifold. Some of the invariant foliations can be linear, while others nonlinear but only defined in a small neighbourhood of an invariant manifold, which reduces the number of parameters to be identified. An invariant manifold is recovered as the zero level set of one or more of the foliations. To interpret the results, the identified mathematical models are transformed to a canonical form and instantaneous frequency and damping information are calculated.

    https://arxiv.org/abs/2512.12432


    A Software Package for Generating Robust and Accurate Potentials using the Moment Tensor Potential Framework

    oai:arXiv.org:2512.12433v1

    arXiv:2512.12433v1 Announce Type: cross Abstract: We present the Plan for Robust and Accurate Potentials (PRAPs), a software package for training and using moment tensor potentials (MTPs) in concert with the Machine Learned Interatomic Potentials (MLIP) software package. PRAPs provides an automated workflow to train MTPs using active learning procedures, and a variety of utilities to ease and improve workflows when utilizing the MLIP software. PRAPs was originally developed in the context of crystal structure prediction, in which one calculates convex hulls and predicts low energy metastable and thermodynamically stable structures, but the potentials PRAPs develops are not limited to such applications. PRAPs produces two potentials, one capable of rough estimates of the energies, forces and stresses of almost any chemical structure in the specified compositional space -- the Robust Potential -- and a second potential intended to provide more accurate descriptions of ground state and metastable structures -- the Accurate Potential. We also present a Python library, mliputils, designed to assist users in working with the chemical structural files used by the MLIP package.

    https://arxiv.org/abs/2512.12433


    Co-Hub Node Based Multiview Graph Learning with Theoretical Guarantees

    oai:arXiv.org:2512.12435v1

    arXiv:2512.12435v1 Announce Type: cross Abstract: Identifying the graphical structure underlying the observed multivariate data is essential in numerous applications. Current methodologies are predominantly confined to deducing a singular graph under the presumption that the observed data are uniform. However, many contexts involve heterogeneous datasets that feature multiple closely related graphs, typically referred to as multiview graphs. Previous research on multiview graph learning promotes edge-based similarity across layers using pairwise or consensus-based regularizers. However, multiview graphs frequently exhibit a shared node-based architecture across different views, such as common hub nodes. Such commonalities can enhance the precision of learning and provide interpretive insight. In this paper, we propose a co-hub node model, positing that different views share a common group of hub nodes. The associated optimization framework is developed by enforcing structured sparsity on the connections of these co-hub nodes. Moreover, we present a theoretical examination of layer identifiability and determine bounds on estimation error. The proposed methodology is validated using both synthetic graph data and fMRI time series data from multiple subjects to discern several closely related graphs.

    https://arxiv.org/abs/2512.12435


    Efficient Level-Crossing Probability Calculation for Gaussian Process Modeled Data

    oai:arXiv.org:2512.12442v1

    arXiv:2512.12442v1 Announce Type: cross Abstract: Almost all scientific data have uncertainties originating from different sources. Gaussian process regression (GPR) models are a natural way to model data with Gaussian-distributed uncertainties. GPR also has the benefit of reducing I/O bandwidth and storage requirements for large scientific simulations. However, the reconstruction from the GPR models suffers from high computation complexity. To make the situation worse, classic approaches for visualizing the data uncertainties, like probabilistic marching cubes, are also computationally very expensive, especially for data of high resolutions. In this paper, we accelerate the level-crossing probability calculation efficiency on GPR models by subdividing the data spatially into a hierarchical data structure and only reconstructing values adaptively in the regions that have a non-zero probability. For each region, leveraging the known GPR kernel and the saved data observations, we propose a novel approach to efficiently calculate an upper bound for the level-crossing probability inside the region and use this upper bound to make the subdivision and reconstruction decisions. We demonstrate that our value occurrence probability estimation is accurate with a low computation cost by experiments that calculate the level-crossing probability fields on different datasets.

    https://arxiv.org/abs/2512.12442


    Understanding Overparametrization in Survival Models through Double-Descent

    oai:arXiv.org:2512.12463v1

    arXiv:2512.12463v1 Announce Type: cross Abstract: Classical statistical learning theory predicts a U-shaped relationship between test loss and model capacity, driven by the bias-variance trade-off. Recent advances in modern machine learning have revealed a more complex pattern, double-descent, in which test loss, after peaking near the interpolation threshold, decreases again as model capacity continues to grow. While this behavior has been extensively analyzed in regression and classification, its manifestation in survival analysis remains unexplored. This study investigates double-descent in four representative survival models: DeepSurv, PC-Hazard, Nnet-Survival, and N-MTLR. We rigorously define interpolation and finite-norm interpolation, two key characteristics of loss-based models to understand double-descent. We then show the existence (or absence) of (finite-norm) interpolation of all four models. Our findings clarify how likelihood-based losses and model implementation jointly determine the feasibility of interpolation and show that overfitting should not be regarded as benign for survival models. All theoretical results are supported by numerical experiments that highlight the distinct generalization behaviors of survival models.

    https://arxiv.org/abs/2512.12463


    Explainable Artificial Intelligence for Economic Time Series: A Comprehensive Review and a Systematic Taxonomy of Methods and Concepts

    oai:arXiv.org:2512.12506v1

    arXiv:2512.12506v1 Announce Type: cross Abstract: Explainable Artificial Intelligence (XAI) is increasingly required in computational economics, where machine-learning forecasters can outperform classical econometric models but remain difficult to audit and use for policy. This survey reviews and organizes the growing literature on XAI for economic time series, where autocorrelation, non-stationarity, seasonality, mixed frequencies, and regime shifts can make standard explanation techniques unreliable or economically implausible. We propose a taxonomy that classifies methods by (i) explanation mechanism: propagation-based approaches (e.g., Integrated Gradients, Layer-wise Relevance Propagation), perturbation and game-theoretic attribution (e.g., permutation importance, LIME, SHAP), and function-based global tools (e.g., Accumulated Local Effects); (ii) time-series compatibility, including preservation of temporal dependence, stability over time, and respect for data-generating constraints. We synthesize time-series-specific adaptations such as vector- and window-based formulations (e.g., Vector SHAP, WindowSHAP) that reduce lag fragmentation and computational cost while improving interpretability. We also connect explainability to causal inference and policy analysis through interventional attributions (Causal Shapley values) and constrained counterfactual reasoning. Finally, we discuss intrinsically interpretable architectures (notably attention-based transformers) and provide guidance for decision-grade applications such as nowcasting, stress testing, and regime monitoring, emphasizing attribution uncertainty and explanation dynamics as indicators of structural change.

    https://arxiv.org/abs/2512.12506


    Robustness analysis in static and dynamic quantum state tomography

    oai:arXiv.org:2512.12518v1

    arXiv:2512.12518v1 Announce Type: cross Abstract: Quantum state tomography is a core task in quantum system identification. Real experimental conditions often deviate from nominal designs, introducing errors in both the measurement devices and the Hamiltonian governing the system's dynamics. In this paper, we investigate the robustness of quantum state tomography against such perturbations in both static and dynamic settings using linear regression estimation. We derive explicit bounds that quantify how bounded errors in the measurement devices and the Hamiltonian affect the mean squared error (MSE) upper bound in each scenario. Numerical simulations for qubit systems illustrate how these bounds scale with resources.

    https://arxiv.org/abs/2512.12518


    Wavelet-Packet-based Noise Signatures With Higher-Order Statistics for Anomaly Prediction

    oai:arXiv.org:2512.12528v1

    arXiv:2512.12528v1 Announce Type: cross Abstract: This note develops the first-ever noise-centric anomaly prediction method for a fused discrete-time signal. A Wavelet Packet Transform (WPT) provides a time--frequency expansion in which structure and residual can be separated via orthogonal projection. Higher-Order Statistics (HOS), particularly the third-order cumulant (and its bispectral interpretation), quantify non-Gaussianity and nonlinear coupling in the extracted residual. Compact noise signatures are constructed and an analytically calibrated Mahalanobis detector yields a closed-form decision rule with non-central chi-square performance under mean-shift alternatives. Propositions and proofs establish orthonormality, energy preservation, Gaussian-null behavior of cumulants, and the resulting test statistics.

    https://arxiv.org/abs/2512.12528


    Iterative Sampling Methods for Sinkhorn Distributionally Robust Optimization

    oai:arXiv.org:2512.12550v1

    arXiv:2512.12550v1 Announce Type: cross Abstract: Distributionally robust optimization (DRO) has emerged as a powerful paradigm for reliable decision-making under uncertainty. This paper focuses on DRO with ambiguity sets defined via the Sinkhorn discrepancy: an entropy-regularized Wasserstein distance, referred to as Sinkhorn DRO. Existing work primarily addresses Sinkhorn DRO from a dual perspective, leveraging its formulation as a conditional stochastic optimization problem, for which many stochastic gradient methods are applicable. However, the theoretical analyses of such methods often rely on the boundedness of the loss function, and it is indirect to obtain the worst-case distribution associated with Sinkhorn DRO. In contrast, we study Sinkhorn DRO from the primal perspective, by reformulating it as a bilevel program with several infinite-dimensional lower-level subproblems over probability space. This formulation enables us to simultaneously obtain the optimal robust decision and the worst-case distribution, which is valuable in practical settings, such as generating stress-test scenarios or designing robust learning algorithms. We propose both double-loop and single-loop sampling-based algorithms with theoretical guarantees to solve this bilevel program. Finally, we demonstrate the effectiveness of our approach through a numerical study on adversarial classification.

    https://arxiv.org/abs/2512.12550


    Mind the Jumps: A Scalable Robust Local Gaussian Process for Multidimensional Response Surfaces with Discontinuities

    oai:arXiv.org:2512.12574v1

    arXiv:2512.12574v1 Announce Type: cross Abstract: Modeling response surfaces with abrupt jumps and discontinuities remains a major challenge across scientific and engineering domains. Although Gaussian process models excel at capturing smooth nonlinear relationships, their stationarity assumptions limit their ability to adapt to sudden input-output variations. Existing nonstationary extensions, particularly those based on domain partitioning, often struggle with boundary inconsistencies, sensitivity to outliers, and scalability issues in higher-dimensional settings, leading to reduced predictive accuracy and unreliable parameter estimation. To address these challenges, this paper proposes the Robust Local Gaussian Process (RLGP) model, a framework that integrates adaptive nearest-neighbor selection with a sparsity-driven robustification mechanism. Unlike existing methods, RLGP leverages an optimization-based mean-shift adjustment after a multivariate perspective transformation combined with local neighborhood modeling to mitigate the influence of outliers. This approach improves predictive accuracy near discontinuities while enhancing robustness to data heterogeneity. Comprehensive evaluations on real-world datasets show that RLGP consistently delivers high predictive accuracy and maintains competitive computational efficiency, especially in scenarios with sharp transitions and complex response structures. Scalability tests further confirm RLGP's stability and reliability in higher-dimensional settings, where other methods struggle. These results establish RLGP as an effective and practical solution for modeling nonstationary and discontinuous response surfaces across a wide range of applications.

    https://arxiv.org/abs/2512.12574


    Scalable Quantum Error Mitigation with Neighbor-Informed Learning

    oai:arXiv.org:2512.12578v1

    arXiv:2512.12578v1 Announce Type: cross Abstract: Noise in quantum hardware is the primary obstacle to realizing the transformative potential of quantum computing. Quantum error mitigation (QEM) offers a promising pathway to enhance computational accuracy on near-term devices, yet existing methods face a difficult trade-off between performance, resource overhead, and theoretical guarantees. In this work, we introduce neighbor-informed learning (NIL), a versatile and scalable QEM framework that unifies and strengthens existing methods such as zero-noise extrapolation (ZNE) and probabilistic error cancellation (PEC), while offering improved flexibility, accuracy, efficiency, and robustness. NIL learns to predict the ideal output of a target quantum circuit from the noisy outputs of its structurally related ``neighbor'' circuits. A key innovation is our 2-design training method, which generates training data for our machine learning model. In contrast to conventional learning-based QEM protocols that create training circuits by replacing non-Clifford gates with uniformly random Clifford gates, our approach achieves higher accuracy and efficiency, as demonstrated by both theoretical analysis and numerical simulation. Furthermore, we prove that the required size of the training set scales only \emph{logarithmically} with the total number of neighbor circuits, enabling NIL to be applied to problems involving large-scale quantum circuits. Our work establishes a theoretically grounded and practically efficient framework for QEM, paving a viable path toward achieving quantum advantage on noisy hardware.

    https://arxiv.org/abs/2512.12578


    Robust Variational Bayes by Min-Max Median Aggregation

    oai:arXiv.org:2512.12676v1

    arXiv:2512.12676v1 Announce Type: cross Abstract: We propose a robust and scalable variational Bayes (VB) framework designed to effectively handle contamination and outliers in dataset. Our approach partitions the data into $m$ disjoint subsets and formulates a joint optimization problem based on robust aggregation principles. A key insight is that the full posterior distribution is equivalent to the minimizer of the mean Kullback-Leibler (KL) divergence from the $m$-powered local posterior distributions. To enhance robustness, we replace the mean KL divergence with a min-max median formulation. The min-max formulation not only ensures consistency between the KL minimizer and the Evidence Lower Bound (ELBO) maximizer but also facilitates the establishment of improved statistical rates for the mean of variational posterior. We observe a notable discrepancy in the $m$-powered marginal log likelihood function contingent on the presence of local latent variables. To address this, we treat these two scenarios separately to guarantee the consistency of the aggregated variational posterior. Specifically, when local latent variables are present, we introduce an aggregate-and-rescale strategy. Theoretically, we provide a non-asymptotic analysis of our proposed posterior, incorporating a refined analysis of Bernstein-von Mises (BvM) theorem to accommodate a diverging number of subsets $m$. Our findings indicate that the two-stage approach yields a smaller approximation error compared to directly aggregating the $m$-powered local posteriors. Furthermore, we establish a nearly optimal statistical rate for the mean of the proposed posterior, advancing existing theories related to min-max median estimators. The efficacy of our method is demonstrated through extensive simulation studies.

    https://arxiv.org/abs/2512.12676


    Quantum Implicit Neural Representations for 3D Scene Reconstruction and Novel View Synthesis

    oai:arXiv.org:2512.12683v1

    arXiv:2512.12683v1 Announce Type: cross Abstract: Implicit neural representations (INRs) have become a powerful paradigm for continuous signal modeling and 3D scene reconstruction, yet classical networks suffer from a well-known spectral bias that limits their ability to capture high-frequency details. Quantum Implicit Representation Networks (QIREN) mitigate this limitation by employing parameterized quantum circuits with inherent Fourier structures, enabling compact and expressive frequency modeling beyond classical MLPs. In this paper, we present Quantum Neural Radiance Fields (Q-NeRF), the first hybrid quantum-classical framework for neural radiance field rendering. Q-NeRF integrates QIREN modules into the Nerfacto backbone, preserving its efficient sampling, pose refinement, and volumetric rendering strategies while replacing selected density and radiance prediction components with quantum-enhanced counterparts. We systematically evaluate three hybrid configurations on standard multi-view indoor datasets, comparing them to classical baselines using PSNR, SSIM, and LPIPS metrics. Results show that hybrid quantum-classical models achieve competitive reconstruction quality under limited computational resources, with quantum modules particularly effective in representing fine-scale, view-dependent appearance. Although current implementations rely on quantum circuit simulators constrained to few-qubit regimes, the results highlight the potential of quantum encodings to alleviate spectral bias in implicit representations. Q-NeRF provides a foundational step toward scalable quantum-enabled 3D scene reconstruction and a baseline for future quantum neural rendering research.

    https://arxiv.org/abs/2512.12683


    Practical Hybrid Quantum Language Models with Observable Readout on Real Hardware

    oai:arXiv.org:2512.12710v1

    arXiv:2512.12710v1 Announce Type: cross Abstract: Hybrid quantum-classical models represent a crucial step toward leveraging near-term quantum devices for sequential data processing. We present Quantum Recurrent Neural Networks (QRNNs) and Quantum Convolutional Neural Networks (QCNNs) as hybrid quantum language models, reporting the first empirical demonstration of generative language modeling trained and evaluated end-to-end on real quantum hardware. Our architecture combines hardware-optimized parametric quantum circuits with a lightweight classical projection layer, utilizing a multi-sample SPSA strategy to efficiently train quantum parameters despite hardware noise. To characterize the capabilities of these models, we introduce a synthetic dataset designed to isolate syntactic dependencies in a controlled, low-resource environment. Experiments on IBM Quantum processors reveal the critical trade-offs between circuit depth and trainability, demonstrating that while noise remains a significant factor, observable-based readout enables the successful learning of sequential patterns on NISQ devices. These results establish a rigorous engineering baseline for generative quantum natural language processing, validating the feasibility of training complex sequence models on current quantum hardware.

    https://arxiv.org/abs/2512.12710


    Power Consumption and Energy Efficiency of Mid-Band XL-MIMO: Modeling, Scaling Laws, and Performance Insights

    oai:arXiv.org:2512.12725v1

    arXiv:2512.12725v1 Announce Type: cross Abstract: Mid-band extra-large-scale multiple-input multiple-output (XL-MIMO), emerging as a critical enabler for future communication systems, is expected to deliver significantly higher throughput by leveraging the extended bandwidth and enlarged antenna aperture. However, power consumption remains a significant concern due to the enlarged system dimension, underscoring the need for thorough investigations into efficient system design and deployment. To this end, an in-depth study is conducted on mid-band XL-MIMO systems. Specifically, a comprehensive power consumption model is proposed, encompassing the power consumption of major hardware components and signal processing procedures, while capturing the influence of key system parameters. Considering typical near-field propagation characteristics, closed-form approximations of throughput are derived, providing an analytical framework for assessing energy efficiency (EE). Based on the proposed framework, the scaling law of EE with respect to key system configurations is derived, offering valuable insights for system design. Subsequently, extensions and comparisons are conducted among representative multi-antenna technologies, demonstrating the superiority of mid-band XL-MIMO in EE. Extensive numerical results not only verify the tightness of the throughput analysis but also validate the EE evaluations, unveiling the potential of energy-efficient mid-band XL-MIMO systems.

    https://arxiv.org/abs/2512.12725


    EXFormer: A Multi-Scale Trend-Aware Transformer with Dynamic Variable Selection for Foreign Exchange Returns Prediction

    oai:arXiv.org:2512.12727v1

    arXiv:2512.12727v1 Announce Type: cross Abstract: Accurately forecasting daily exchange rate returns represents a longstanding challenge in international finance, as the exchange rate returns are driven by a multitude of correlated market factors and exhibit high-frequency fluctuations. This paper proposes EXFormer, a novel Transformer-based architecture specifically designed for forecasting the daily exchange rate returns. We introduce a multi-scale trend-aware self-attention mechanism that employs parallel convolutional branches with differing receptive fields to align observations on the basis of local slopes, preserving long-range dependencies while remaining sensitive to regime shifts. A dynamic variable selector assigns time-varying importance weights to 28 exogenous covariates related to exchange rate returns, providing pre-hoc interpretability. An embedded squeeze-and-excitation block recalibrates channel responses to emphasize informative features and depress noise in the forecasting. Using the daily data for EUR/USD, USD/JPY, and GBP/USD, we conduct out-of-sample evaluations across five different sliding windows. EXFormer consistently outperforms the random walk and other baselines, improving directional accuracy by a statistically significant margin of up to 8.5--22.8%. In nearly one year of trading backtests, the model converts these gains into cumulative returns of 18%, 25%, and 18% for the three pairs, with Sharpe ratios exceeding 1.8. When conservative transaction costs and slippage are accounted for, EXFormer retains cumulative returns of 7%, 19%, and 9%, while other baselines achieve negative. The robustness checks further confirm the model's superiority under high-volatility and bear-market regimes. EXFormer furnishes both economically valuable forecasts and transparent, time-varying insights into the drivers of exchange rate dynamics for international investors, corporations, and central bank practitioners.

    https://arxiv.org/abs/2512.12727


    Limits To (Machine) Learning

    oai:arXiv.org:2512.12735v1

    arXiv:2512.12735v1 Announce Type: cross Abstract: Machine learning (ML) methods are highly flexible, but their ability to approximate the true data-generating process is fundamentally constrained by finite samples. We characterize a universal lower bound, the Limits-to-Learning Gap (LLG), quantifying the unavoidable discrepancy between a model's empirical fit and the population benchmark. Recovering the true population $R^2$, therefore, requires correcting observed predictive performance by this bound. Using a broad set of variables, including excess returns, yields, credit spreads, and valuation ratios, we find that the implied LLGs are large. This indicates that standard ML approaches can substantially understate true predictability in financial data. We also derive LLG-based refinements to the classic Hansen and Jagannathan (1991) bounds, analyze implications for parameter learning in general-equilibrium settings, and show that the LLG provides a natural mechanism for generating excess volatility.

    https://arxiv.org/abs/2512.12735


    Transport Reversible Jump Markov Chain Monte Carlo with proposals generated by Variational Inference with Normalizing Flows

    oai:arXiv.org:2512.12742v1

    arXiv:2512.12742v1 Announce Type: cross Abstract: We present a framework using variational inference with normalizing flows (VI-NFs) to generate proposals of reversible jump Markov chain Monte Carlo (RJMCMC) for efficient trans-dimensional Bayesian inference. Unlike transport reversible jump methods relying on forward KL minimization with pilot MCMC samples, our approach minimizes the reverse KL divergence which requires only samples from a base distribution, eliminating costly target sampling. The method employs RealNVP-based flows to learn model-specific transport maps, enabling construction of both between-model and within-model proposals. Our framework provides accurate marginal likelihood estimates from the variational approximation. This facilitates efficient model comparison and proposal adaptation in RJMCMC. Experiments on illustrative example, factor analysis and variable selection tasks in linear regression show that TRJ designed by VI-NFs achieves faster mixing and more efficient model space exploration compared to existing baselines. The proposed algorithm can be extended to conditional flows for amortized vairiational inference across models. Code is available at https://github.com/YinPingping111/TRJ_VINFs.

    https://arxiv.org/abs/2512.12742


    Flow-matching Operators for Residual-Augmented Probabilistic Learning of Partial Differential Equations

    oai:arXiv.org:2512.12749v1

    arXiv:2512.12749v1 Announce Type: cross Abstract: Learning probabilistic surrogates for PDEs remains challenging in data-scarce regimes: neural operators require large amounts of high-fidelity data, while generative approaches typically sacrifice resolution invariance. We formulate flow matching in an infinite-dimensional function space to learn a probabilistic transport that maps low-fidelity approximations to the manifold of high-fidelity PDE solutions via learned residual corrections. We develop a conditional neural operator architecture based on feature-wise linear modulation for flow-matching vector fields directly in function space, enabling inference at arbitrary spatial resolutions without retraining. To improve stability and representational control of the induced neural ODE, we parameterize the flow vector field as a sum of a linear operator and a nonlinear operator, combining lightweight linear components with a conditioned Fourier neural operator for expressive, input-dependent dynamics. We then formulate a residual-augmented learning strategy where the flow model learns probabilistic corrections from inexpensive low-fidelity surrogates to high-fidelity solutions, rather than learning the full solution mapping from scratch. Finally, we derive tractable training objectives that extend conditional flow matching to the operator setting with input-function-dependent couplings. To demonstrate the effectiveness of our approach, we present numerical experiments on a range of PDEs, including the 1D advection and Burgers' equation, and a 2D Darcy flow problem for flow through a porous medium. We show that the proposed method can accurately learn solution operators across different resolutions and fidelities and produces uncertainty estimates that appropriately reflect model confidence, even when trained on limited high-fidelity data.

    https://arxiv.org/abs/2512.12749


    A Disproof of Large Language Model Consciousness: The Necessity of Continual Learning for Consciousness

    oai:arXiv.org:2512.12802v1

    arXiv:2512.12802v1 Announce Type: cross Abstract: The requirements for a falsifiable and non-trivial theory of consciousness significantly constrain such theories. Specifically, recent research on the Unfolding Argument and the Substitution Argument has given us formal tools to analyze requirements for a theory of consciousness. I show via a new Proximity Argument that these requirements especially constrain the potential consciousness of contemporary Large Language Models (LLMs) because of their proximity to systems that are equivalent to LLMs in terms of input/output function; yet, for these functionally equivalent systems, there cannot be any non-trivial theory of consciousness that judges them conscious. This forms the basis of a disproof of contemporary LLM consciousness. I then show a positive result, which is that theories of consciousness based on (or requiring) continual learning do satisfy the stringent formal constraints for a theory of consciousness in humans. Intriguingly, this work supports a hypothesis: If continual learning is linked to consciousness in humans, the current limitations of LLMs (which do not continually learn) are intimately tied to their lack of consciousness.

    https://arxiv.org/abs/2512.12802


    Semitopological Barycentric Algebras

    oai:arXiv.org:2512.12865v1

    arXiv:2512.12865v1 Announce Type: cross Abstract: Barycentric algebras are an abstraction of the notion of convex sets, defined by a set of equations. We study semitopological and topological barycentric algebras, in the spirit of a previous study by Klaus Keimel on semitopological and topological cones (2008), which are special cases of semitopological and topological barycentric algebras.

    https://arxiv.org/abs/2512.12865


    On the variational dual formulation of the Nash system and an adaptive convex gradient-flow approach to nonlinear PDEs

    oai:arXiv.org:2512.12878v1

    arXiv:2512.12878v1 Announce Type: cross Abstract: We investigate the influence of base states on the consistency of the dual variational formulation for quadratic systems of PDEs, which are not necessarily conservative (typical examples include the noise-free Nash system with a quadratic Hamiltonian and multiple players). We identify a sufficient condition under which consistency holds over large time intervals. In particular, in the single-player case, there exists a sequence of base states (each exhibiting full consistency) that converges in mean to zero. We also prove existence of variational dual solutions to the noise-free Nash system for arbitrary base states. Furthermore, we propose a scheme based on Hilbertian gradient flows that, starting from an arbitrary base state, generates a sequence of new base states that is expected to converge to a solution of the original PDE.

    https://arxiv.org/abs/2512.12878


    Meta-GPT: Decoding the Metasurface Genome with Generative Artificial Intelligence

    oai:arXiv.org:2512.12888v1

    arXiv:2512.12888v1 Announce Type: cross Abstract: Advancing artificial intelligence for physical sciences requires representations that are both interpretable and compatible with the underlying laws of nature. We introduce METASTRINGS, a symbolic language for photonics that expresses nanostructures as textual sequences encoding materials, geometries, and lattice configurations. Analogous to molecular textual representations in chemistry, METASTRINGS provides a framework connecting human interpretability with computational design by capturing the structural hierarchy of photonic metasurfaces. Building on this representation, we develop Meta-GPT, a foundation transformer model trained on METASTRINGS and finetuned with physics-informed supervised, reinforcement, and chain-of-thought learning. Across various design tasks, the model achieves <3% mean-squared spectral error and maintains >98% syntactic validity, generating diverse metasurface prototypes whose experimentally measured optical responses match their target spectra. These results demonstrate that Meta-GPT can learn the compositional rules of light-matter interactions through METASTRINGS, laying a rigorous foundation for AI-driven photonics and representing an important step toward a metasurface genome project.

    https://arxiv.org/abs/2512.12888


    PAC-Bayes Bounds for Multivariate Linear Regression and Linear Autoencoders

    oai:arXiv.org:2512.12905v1

    arXiv:2512.12905v1 Announce Type: cross Abstract: Linear Autoencoders (LAEs) have shown strong performance in state-of-the-art recommender systems. However, this success remains largely empirical, with limited theoretical understanding. In this paper, we investigate the generalizability -- a theoretical measure of model performance in statistical learning -- of multivariate linear regression and LAEs. We first propose a PAC-Bayes bound for multivariate linear regression, extending the earlier bound for single-output linear regression by Shalaeva et al., and establish sufficient conditions for its convergence. We then show that LAEs, when evaluated under a relaxed mean squared error, can be interpreted as constrained multivariate linear regression models on bounded data, to which our bound adapts. Furthermore, we develop theoretical methods to improve the computational efficiency of optimizing the LAE bound, enabling its practical evaluation on large models and real-world datasets. Experimental results demonstrate that our bound is tight and correlates well with practical ranking metrics such as Recall@K and NDCG@K.

    https://arxiv.org/abs/2512.12905


    Evaluating Singular Value Thresholds for DNN Weight Matrices based on Random Matrix Theory

    oai:arXiv.org:2512.12911v1

    arXiv:2512.12911v1 Announce Type: cross Abstract: This study evaluates thresholds for removing singular values from singular value decomposition-based low-rank approximations of deep neural network weight matrices. Each weight matrix is modeled as the sum of signal and noise matrices. The low-rank approximation is obtained by removing noise-related singular values using a threshold based on random matrix theory. To assess the adequacy of this threshold, we propose an evaluation metric based on the cosine similarity between the singular vectors of the signal and original weight matrices. The proposed metric is used in numerical experiments to compare two threshold estimation methods.

    https://arxiv.org/abs/2512.12911


    Leveraging Compression to Construct Transferable Bitrate Ladders

    oai:arXiv.org:2512.12952v1

    arXiv:2512.12952v1 Announce Type: cross Abstract: Over the past few years, per-title and per-shot video encoding techniques have demonstrated significant gains as compared to conventional techniques such as constant CRF encoding and the fixed bitrate ladder. These techniques have demonstrated that constructing content-gnostic per-shot bitrate ladders can provide significant bitrate gains and improved Quality of Experience (QoE) for viewers under various network conditions. However, constructing a convex hull for every video incurs a significant computational overhead. Recently, machine learning-based bitrate ladder construction techniques have emerged as a substitute for convex hull construction. These methods operate by extracting features from source videos to train machine learning (ML) models to construct content-adaptive bitrate ladders. Here, we present a new ML-based bitrate ladder construction technique that accurately predicts the VMAF scores of compressed videos, by analyzing the compression procedure and by making perceptually relevant measurements on the source videos prior to compression. We evaluate the performance of our proposed framework against leading prior methods on a large corpus of videos. Since training ML models on every encoder setting is time-consuming, we also investigate how per-shot bitrate ladders perform under different encoding settings. We evaluate the performance of all models against the fixed bitrate ladder and the best possible convex hull constructed using exhaustive encoding with Bjontegaard-delta metrics.

    https://arxiv.org/abs/2512.12952


    Cycles Communities from the Perspective of Dendrograms and Gradient Sampling

    oai:arXiv.org:2512.12974v1

    arXiv:2512.12974v1 Announce Type: cross Abstract: Identifying and comparing topological features, particularly cycles, across different topological objects remains a fundamental challenge in persistent homology and topological data analysis. This work introduces a novel framework for constructing cycle communities through two complementary approaches. First, a dendrogram-based methodology leverages merge-tree algorithms to construct hierarchical representations of homology classes from persistence intervals. The Wasserstein distance on merge trees is introduced as a metric for comparing dendrograms, establishing connections to hierarchical clustering frameworks. Through simulation studies, the discriminative power of dendrogram representations for identifying cycle communities is demonstrated. Second, an extension of Stratified Gradient Sampling simultaneously learns multiple filter functions that yield cycle barycenter functions capable of faithfully reconstructing distinct sets of cycles. The set of cycles each filter function can reconstruct constitutes cycle communities that are non-overlapping and partition the space of all cycles. Together, these approaches transform the problem of cycle matching into both a hierarchical clustering and topological optimization framework, providing principled methods to identify similar topological structures both within and across groups of topological objects.

    https://arxiv.org/abs/2512.12974


    General OOD Detection via Model-aware and Subspace-aware Variable Priority

    oai:arXiv.org:2512.13003v1

    arXiv:2512.13003v1 Announce Type: cross Abstract: Out-of-distribution (OOD) detection is essential for determining when a supervised model encounters inputs that differ meaningfully from its training distribution. While widely studied in classification, OOD detection for regression and survival analysis remains limited due to the absence of discrete labels and the challenge of quantifying predictive uncertainty. We introduce a framework for OOD detection that is simultaneously model aware and subspace aware, and that embeds variable prioritization directly into the detection step. The method uses the fitted predictor to construct localized neighborhoods around each test case that emphasize the features driving the model's learned relationship and downweight directions that are less relevant to prediction. It produces OOD scores without relying on global distance metrics or estimating the full feature density. The framework is applicable across outcome types, and in our implementation we use random forests, where the rule structure yields transparent neighborhoods and effective scoring. Experiments on synthetic and real data benchmarks designed to isolate functional shifts show consistent improvements over existing methods. We further demonstrate the approach in an esophageal cancer survival study, where distribution shifts related to lymphadenectomy identify patterns relevant to surgical guidelines.

    https://arxiv.org/abs/2512.13003


    An AI-Based Framework for Assessing Sustainability Conflicts in Medical Device Development

    oai:arXiv.org:2512.13005v1

    arXiv:2512.13005v1 Announce Type: cross Abstract: Designing sustainable medical devices requires balancing environmental, economic, and social demands, yet trade-offs across these pillars are difficult to identify using manual assessment alone. Current methods depend heavily on expert judgment, lack standardisation, and struggle to integrate diverse lifecycle data, which leads to overlooked conflicts and inconsistent evaluations. This paper introduces an AI-driven framework that automates conflict detection. Machine learning and natural language processing are used to extract trade-offs from design decisions, while Multi-Criteria Decision Analysis (MCDA) quantifies their magnitude through a composite sustainability score. The approach improves consistency, reduces subjective bias, and supports early design decisions. The results demonstrate how AI-assisted analysis provides scalable, data-driven support for sustainability evaluation in medical device development.

    https://arxiv.org/abs/2512.13005


    Group-averaged Markov chains II: tuning of group action in finite state space

    oai:arXiv.org:2512.13067v1

    arXiv:2512.13067v1 Announce Type: cross Abstract: We study group-averaged Markov chains obtained by augmenting a $\pi$-stationary transition kernel $P$ with a group action on the state space via orbit kernels. Given a group $\mathcal{G}$ with orbits $(\mathcal{O}_i)_{i=1}^k$, we analyse three canonical orbit kernels: namely the Gibbs $(G)$, Metropolis-Hastings $(M)$, and Barker $(B)$ kernels, as well as their multiplicative sandwiches $QPQ$ and the additive mixtures $\frac{1}{2}(P+Q)$ where $Q\in\{G,M,B\}$. We show that $M^t, B^t \to G$ blockwise as $t \to \infty$ under suitable conditions, that the projection chains induced by $(\mathcal{O}_i)_{i=1}^k$ coincide for $GPG$ and $P$, and that orbit averaging never deteriorates the absolute spectral gap or asymptotic variance when $P$ is reversible. We give a direct and simple proof of Pythagorean identity under the Kullback-Leibler (KL) divergence, showing that $GPG$ arises naturally as an information projection of $P$ onto the set of $G$-invariant transition matrices. For a given $P$, we characterise the optimal choice of $G$ with a fixed number of orbits that minimises the one-step KL divergence to stationarity. Analogously, for a given $G$, we characterise the optimal choice of $P$ and give sufficient conditions under which $GPG = \Pi$. We further show that alternating projections over multiple group actions converge at a rate governed by the singular values of an overlap matrix, and that in structured cases, this yields exact sampling where the number of group actions grows logarithmically with the size of the state space. Based on the theory, we propose two heuristics to tune $G$ in practice. We also illustrate the results on discrete uniform and multimodal examples, including the Curie-Weiss model where $GPG$ achieves polynomial (in inverse temperature and dimension) mixing while Glauber dynamics remains exponentially slow.

    https://arxiv.org/abs/2512.13067


    Stopping Rules for Stochastic Gradient Descent via Anytime-Valid Confidence Sequences

    oai:arXiv.org:2512.13123v1

    arXiv:2512.13123v1 Announce Type: cross Abstract: We study stopping rules for stochastic gradient descent (SGD) for convex optimization from the perspective of anytime-valid confidence sequences. Classical analyses of SGD provide convergence guarantees in expectation or at a fixed horizon, but offer no statistically valid way to assess, at an arbitrary time, how close the current iterate is to the optimum. We develop an anytime-valid, data-dependent upper confidence sequence for the weighted average suboptimality of projected SGD, constructed via nonnegative supermartingales and requiring no smoothness or strong convexity. This confidence sequence yields a simple stopping rule that is provably $\varepsilon$-optimal with probability at least $1-\alpha$ and is almost surely finite under standard stochastic approximation stepsizes. To the best of our knowledge, these are the first rigorous, time-uniform performance guarantees and finite-time $\varepsilon$-optimality certificates for projected SGD with general convex objectives, based solely on observable trajectory quantities.

    https://arxiv.org/abs/2512.13123


    The maximum number of nonzero weights of linear rank-metric codes

    oai:arXiv.org:2512.13162v1

    arXiv:2512.13162v1 Announce Type: cross Abstract: We investigate the maximum number \( L_{\mathrm{rk}}(n, m, k, q) \) of distinct nonzero rank weights that an \( \mathbb{F}_{q^m} \)-linear rank-metric code of dimension \( k \) in \( \mathbb{F}_{q^m}^n \) can attain. We determine the exact value of the function \( L_{\mathrm{rk}}(n, m, k, q) \) for all admissible parameters \( n, m, k, q \). In particular, we characterize when a code achieves the full weight spectrum (FWS), i.e. when the number of distinct nonzero rank weights equals \( \min\{n, m\} \). We provide both necessary and sufficient conditions for the existence of FWS codes, along with explicit constructions of codes attaining the maximum number of distinct weights. We discuss the equivalence of such codes and also present classification results for 2-dimensional codes. Finally, we investigate further properties of these optimal codes, like their behavior under duality.

    https://arxiv.org/abs/2512.13162


    Carrot, stick, or both? Price incentives for sustainable food choice in competitive environments

    oai:arXiv.org:2512.13174v1

    arXiv:2512.13174v1 Announce Type: cross Abstract: Meat consumption is a major driver of global greenhouse gas emissions. While pricing interventions have shown potential to reduce meat intake, previous studies have focused on highly constrained environments with limited consumer choice. Here, we present the first large-scale field experiment to evaluate multiple pricing interventions in a real-world, competitive setting. Using a sequential crossover design with matched menus in a Swiss university campus, we systematically compared vegetarian-meal discounts (-2.5 CHF), meat surcharges (+2.5 CHF), and a combined scheme (-1.2 CHF=+1.2 CHF) across four campus cafeterias. Only the surcharge and combined interventions led to significant increases in vegetarian meal uptake--by 26.4% and 16.6%, respectively--and reduced CO2 emissions per meal by 7.4% and 11.3%, respectively. The surcharge, while effective, triggered a 12.3% drop in sales at intervention sites and a corresponding 14.9% increase in non-treated locations, hence causing a spillover effect that completely offset environmental gains. In contrast, the combined approach achieved meaningful emission reductions without significant effects on overall sales or revenue, making it both effective and economically viable. Notably, pricing interventions were equally effective for both vegetarian-leaning customers and habitual meat-eaters, stimulating change even within entrenched dietary habits. Our results show that balanced pricing strategies can reduce the carbon footprint of realistic food environments, but require coordinated implementation to maximize climate benefits and avoid unintended spillover effects.

    https://arxiv.org/abs/2512.13174


    MicroPhaseNO: Adapting an Earthquake-Trained Phase Neural Operator for Microseismic Phase Picking

    oai:arXiv.org:2512.13197v1

    arXiv:2512.13197v1 Announce Type: cross Abstract: Seismic phase picking is very often used for microseismic monitoring and subsurface imaging. Traditional manual processing is not feasible for either real-time applications or large arrays. Deep learning-based pickers trained on large earthquake catalogs offer an automated alternative. However, they are typically optimized for high signal-to-noise, long-duration networks and struggle with the challenges presented by microseismic datasets, which are purpose-built for limited time without previously detected seismicity. In this study, we demonstrate how a network-wide earthquake phase picker, the Phase Neural Operator (PhaseNO), can be adapted to microseismic monitoring using transfer learning. Starting from a PhaseNO model pre-trained on more than 57,000 three-component earthquake and noise records, we fine-tune the model using only 200 labeled and noise seismograms from induced events in hydraulic-fracturing settings. The fine-tuned model thus preserves the rich spatio-temporal representation learned from abundant earthquake data, while adapting to the characteristics and labeling conventions of microseismic phases, which are often picked on peaks or troughs rather than onsets. We evaluate performance on three distinct real-world microseismic datasets with different network geometries and acquisition parameters. Compared to the original PhaseNO and a conventional workflow, the adapted model increases F1 score and accuracy by up to 30%, and strongly reduces systematic timing bias and pick uncertainty. Because the adaptation relies on a small, campaign-specific calibration set, the approach is readily transferable to other microseismic tasks where public earthquake data and pre-trained models are accessible. The associated code will be released openly at https://github.com/ayratabd/MicroPhaseNO.

    https://arxiv.org/abs/2512.13197


    Investigation of a Bit-Sequence Reconciliation Protocol Based on Neural TPM Networks in Secure Quantum Communications

    oai:arXiv.org:2512.13199v1

    arXiv:2512.13199v1 Announce Type: cross Abstract: The article discusses a key reconciliation protocol for quantum key distribution (QKD) systems based on Tree Parity Machines (TPM). The idea of transforming key material into neural network weights is presented. Two experiments were conducted to study how the number of synchronization iterations and the amount of leaked information depend on the quantum bit error rate (QBER) and the range of neural network weights. The results show a direct relationship between the average number of synchronization iterations and QBER, an increase in iterations when the weight range is expanded, and a reduction in leaked information as the weight range increases. Based on these results, conclusions are drawn regarding the applicability of the protocol and the prospects for further research on neural cryptographic methods in the context of key reconciliation.

    https://arxiv.org/abs/2512.13199


    Rethinking Physics-Informed Regression Beyond Training Loops and Bespoke Architectures

    oai:arXiv.org:2512.13217v1

    arXiv:2512.13217v1 Announce Type: cross Abstract: We revisit the problem of physics-informed regression, and propose a method that directly computes the state at the prediction point, simultaneously with the derivative and curvature information of the existing samples. We frame each prediction as a constrained optimisation problem, leveraging multivariate Taylor series expansions and explicitly enforcing physical laws. Each individual query can be processed with low computational cost without any pre- or re-training, in contrast to global function approximator-based solutions such as neural networks. Our comparative benchmarks on a reaction-diffusion system show competitive predictive accuracy relative to a neural network-based solution, while completely eliminating the need for long training loops, and remaining robust to changes in the sampling layout.

    https://arxiv.org/abs/2512.13217


    Better LMO-based Momentum Methods with Second-Order Information

    oai:arXiv.org:2512.13227v1

    arXiv:2512.13227v1 Announce Type: cross Abstract: The use of momentum in stochastic optimization algorithms has shown empirical success across a range of machine learning tasks. Recently, a new class of stochastic momentum algorithms has emerged within the Linear Minimization Oracle (LMO) framework--leading to state-of-the-art methods, such as Muon, Scion, and Gluon, that effectively solve deep neural network training problems. However, traditional stochastic momentum methods offer convergence guarantees no better than the ${O}(1/K^{1/4})$ rate. While several approaches--such as Hessian-Corrected Momentum (HCM)--have aimed to improve this rate, their theoretical results are generally restricted to the Euclidean norm setting. This limitation hinders their applicability in problems, where arbitrary norms are often required. In this paper, we extend the LMO-based framework by integrating HCM, and provide convergence guarantees under relaxed smoothness and arbitrary norm settings. We establish improved convergence rates of ${O}(1/K^{1/3})$ for HCM, which can adapt to the geometry of the problem and achieve a faster rate than traditional momentum. Experimental results on training Multi-Layer Perceptrons (MLPs) and Long Short-Term Memory (LSTM) networks verify our theoretical observations.

    https://arxiv.org/abs/2512.13227


    Reversible and Reversible-Complement Double Cyclic Codes over F4+vF4 and its Application to DNA Codes

    oai:arXiv.org:2512.13295v1

    arXiv:2512.13295v1 Announce Type: cross Abstract: In this article, we study the algebraic structure of double cyclic codes of length $(m, n)$ over $\mathbb{F}_4$ and we give a necessary and sufficient condition for a double cyclic code over $\mathbb{F}_4$ to be reversible. Also, we determine the algebraic structure of double cyclic codes of length $(m, n)$ over $\mathbb{F}_4+v\mathbb{F}_4$ with $v^2=v$, satisfying the reverse constraint and the reverse-complement constraint. Then we establish a one-to-one correspondence $\psi$ between the 16 DNA double pairs $S_{D_{16}} $ and the 16 elements of the finite ring $\mathbb{F}_4+v\mathbb{F}_4$. We also discuss the GC-content of DNA double cyclic codes.

    https://arxiv.org/abs/2512.13295


    Self-Supervised Ultrasound Representation Learning for Renal Anomaly Prediction in Prenatal Imaging

    oai:arXiv.org:2512.13434v1

    arXiv:2512.13434v1 Announce Type: cross Abstract: Prenatal ultrasound is the cornerstone for detecting congenital anomalies of the kidneys and urinary tract, but diagnosis is limited by operator dependence and suboptimal imaging conditions. We sought to assess the performance of a self-supervised ultrasound foundation model for automated fetal renal anomaly classification using a curated dataset of 969 two-dimensional ultrasound images. A pretrained Ultrasound Self-Supervised Foundation Model with Masked Autoencoding (USF-MAE) was fine-tuned for binary and multi-class classification of normal kidneys, urinary tract dilation, and multicystic dysplastic kidney. Models were compared with a DenseNet-169 convolutional baseline using cross-validation and an independent test set. USF-MAE consistently improved upon the baseline across all evaluation metrics in both binary and multi-class settings. USF-MAE achieved an improvement of about 1.87% (AUC) and 7.8% (F1-score) on the validation set, 2.32% (AUC) and 4.33% (F1-score) on the independent holdout test set. The largest gains were observed in the multi-class setting, where the improvement in AUC was 16.28% and 46.15% in F1-score. To facilitate model interpretability, Score-CAM visualizations were adapted for a transformer architecture and show that model predictions were informed by known, clinically relevant renal structures, including the renal pelvis in urinary tract dilation and cystic regions in multicystic dysplastic kidney. These results show that ultrasound-specific self-supervised learning can generate a useful representation as a foundation for downstream diagnostic tasks. The proposed framework offers a robust, interpretable approach to support the prenatal detection of renal anomalies and demonstrates the promise of foundation models in obstetric imaging.

    https://arxiv.org/abs/2512.13434


    A Deep Learning Model of Mental Rotation Informed by Interactive VR Experiments

    oai:arXiv.org:2512.13517v1

    arXiv:2512.13517v1 Announce Type: cross Abstract: Mental rotation -- the ability to compare objects seen from different viewpoints -- is a fundamental example of mental simulation and spatial world modelling in humans. Here we propose a mechanistic model of human mental rotation, leveraging advances in deep, equivariant, and neuro-symbolic learning. Our model consists of three stacked components: (1) an equivariant neural encoder, taking images as input and producing 3D spatial representations of objects, (2) a neuro-symbolic object encoder, deriving symbolic descriptions of objects from these spatial representations, and (3) a neural decision agent, comparing these symbolic descriptions to prescribe rotation simulations in 3D latent space via a recurrent pathway. Our model design is guided by the abundant experimental literature on mental rotation, which we complemented with experiments in VR where participants could at times manipulate the objects to compare, providing us with additional insights into the cognitive process of mental rotation. Our model captures well the performance, response times and behavior of participants in our and others' experiments. The necessity of each model component is shown through systematic ablations. Our work adds to a recent collection of deep neural models of human spatial reasoning, further demonstrating the potency of integrating deep, equivariant, and symbolic representations to model the human mind.

    https://arxiv.org/abs/2512.13517


    Enhancing lithological interpretation from petrophysical well log of IODP expedition 390/393 using machine learning

    oai:arXiv.org:2512.13529v1

    arXiv:2512.13529v1 Announce Type: cross Abstract: Enhanced lithological interpretation from well logs plays a key role in geological resource exploration and mapping, as well as in geo-environmental modeling studies. Core and cutting information is useful for making sound interpretations of well logs; however, these are rarely collected at each depth due to high costs. Moreover, well log interpretation using traditional methods is constrained by poor borehole conditions. Traditional statistical methods are mostly linear, often failing to discriminate between lithology and rock facies, particularly when dealing with overlapping well log signals characterized by the structural and compositional variation of rock types. In this study, we develop multiple supervised and unsupervised machine learning algorithms to jointly analyze multivariate well log data from Integrated Ocean Drilling Program (IODP) expeditions 390 and 393 for enhanced lithological interpretations. Among the algorithms, Logistic Regression, Decision Trees, Gradient Boosting, Support Vector Machines (SVM), k-Nearest Neighbors (KNN), and Multi-Layer Perceptron (MLP) neural network models, the Decision Tree and Gradient Boosting models outperformed the others, achieving an accuracy of 0.9950 and an F1-score of 0.9951. While unsupervised machine learning (ML) provides the foundation for cluster information that inherently supports the classification algorithm, supervised ML is applied to devise a data-driven lithology clustering mechanism for IODP datasets. The joint ML-based method developed here has the potential to be further explored for analyzing other well log datasets from the world's oceans.

    https://arxiv.org/abs/2512.13529


    Actively Learning Joint Contours of Multiple Computer Experiments

    oai:arXiv.org:2512.13530v1

    arXiv:2512.13530v1 Announce Type: cross Abstract: Contour location$\unicode{x2014}$the process of sequentially training a surrogate model to identify the design inputs that result in a pre-specified response value from a single computer experiment$\unicode{x2014}$is a well-studied active learning problem. Here, we tackle a related but distinct problem: identifying the input configuration that returns pre-specified values of multiple independent computer experiments simultaneously. Motivated by computer experiments of the rotational torques acting upon a vehicle in flight, we aim to identify stable flight conditions which result in zero torque forces. We propose a "joint contour location" (jCL) scheme that strikes a strategic balance between exploring the multiple response surfaces while exploiting learning of the intersecting contours. We employ both shallow and deep Gaussian process surrogates, but our jCL procedure is applicable to any surrogate that can provide posterior predictive distributions. Our jCL designs significantly outperform existing (single response) CL strategies, enabling us to efficiently locate the joint contour of our motivating computer experiments.

    https://arxiv.org/abs/2512.13530


    Adaptive Sampling for Hydrodynamic Stability

    oai:arXiv.org:2512.13532v1

    arXiv:2512.13532v1 Announce Type: cross Abstract: An adaptive sampling approach for efficient detection of bifurcation boundaries in parametrized fluid flow problems is presented herein. The study extends the machine-learning approach of Silvester (Machine Learning for Hydrodynamic Stability, arXiv:2407.09572), where a classifier network was trained on preselected simulation data to identify bifurcated and nonbifurcated flow regimes. In contrast, the proposed methodology introduces adaptivity through a flow-based deep generative model that automatically refines the sampling of the parameter space. The strategy has two components: a classifier network maps the flow parameters to a bifurcation probability, and a probability density estimation technique (KRnet) for the generation of new samples at each adaptive step. The classifier output provides a probabilistic measure of flow stability, and the Shannon entropy of these predictions is employed as an uncertainty indicator. KRnet is trained to approximate a probability density function that concentrates sampling in regions of high entropy, thereby directing computational effort towards the evolving bifurcation boundary. This coupling between classification and generative modeling establishes a feedback-driven adaptive learning process analogous to error-indicator based refinement in contemporary partial differential equation solution strategies. Starting from a uniform parameter distribution, the new approach achieves accurate bifurcation boundary identification with significantly fewer Navier--Stokes simulations, providing a scalable foundation for high-dimensional stability analysis.

    https://arxiv.org/abs/2512.13532


    A Nonparametric Statistics Approach to Feature Selection in Deep Neural Networks with Theoretical Guarantees

    oai:arXiv.org:2512.13565v1

    arXiv:2512.13565v1 Announce Type: cross Abstract: This paper tackles the problem of feature selection in a highly challenging setting: $\mathbb{E}(y | \boldsymbol{x}) = G(\boldsymbol{x}_{\mathcal{S}_0})$, where $\mathcal{S}_0$ is the set of relevant features and $G$ is an unknown, potentially nonlinear function subject to mild smoothness conditions. Our approach begins with feature selection in deep neural networks, then generalizes the results to H{\"o}lder smooth functions by exploiting the strong approximation capabilities of neural networks. Unlike conventional optimization-based deep learning methods, we reformulate neural networks as index models and estimate $\mathcal{S}_0$ using the second-order Stein's formula. This gradient-descent-free strategy guarantees feature selection consistency with a sample size requirement of $n = \Omega(p^2)$, where $p$ is the feature dimension. To handle high-dimensional scenarios, we further introduce a screening-and-selection mechanism that achieves nonlinear selection consistency when $n = \Omega(s \log p)$, with $s$ representing the sparsity level. Additionally, we refit a neural network on the selected features for prediction and establish performance guarantees under a relaxed sparsity assumption. Extensive simulations and real-data analyses demonstrate the strong performance of our method even in the presence of complex feature interactions.

    https://arxiv.org/abs/2512.13565


    Optimised Fermion-Qubit Encodings for Quantum Simulation with Reduced Transpiled Circuit Depth

    oai:arXiv.org:2512.13580v1

    arXiv:2512.13580v1 Announce Type: cross Abstract: Simulation of fermionic Hamiltonians with gate-based quantum computers requires the selection of an encoding from fermionic operators to quantum gates, the most widely used being the Jordan-Wigner transform. Many alternative encodings exist, with quantum circuits and simulation results being sensitive to choice of encoding, device connectivity and Hamiltonian characteristics. Non-stochastic optimisation of the ternary tree class of encodings to date has targeted either the device or Hamiltonian. We develop a deterministic method which optimises ternary tree encodings without changing the underlying tree structure. This enables reduction in Pauli-weight without ancillae or additional swap-gate overhead. We demonstrate this method for a variety of encodings, including those which are derived from the qubit connectivity graph of a quantum computer. Across a suite of standard encoding methods applied to water in STO-3G basis, including Jordan-Wigner, our method reduces qDRIFT circuit depths on average by $27.7\%$ and $26.0\%$ for untranspiled and transpiled circuits respectively.

    https://arxiv.org/abs/2512.13580


    Quantum channel tomography and estimation by local test

    oai:arXiv.org:2512.13614v1

    arXiv:2512.13614v1 Announce Type: cross Abstract: We study the estimation of an unknown quantum channel $\mathcal{E}$ with input dimension $d_1$, output dimension $d_2$ and Kraus rank at most $r$. We establish a connection between the query complexities in two models: (i) access to $\mathcal{E}$, and (ii) access to a random dilation of $\mathcal{E}$. Specifically, we show that for parallel (possibly coherent) testers, access to dilations does not help. This is proved by constructing a local tester that uses $n$ queries to $\mathcal{E}$ yet faithfully simulates the tester with $n$ queries to a random dilation. As application, we show that: - $O(rd_1d_2/\varepsilon^2)$ queries to $\mathcal{E}$ suffice for channel tomography to within diamond norm error $\varepsilon$. Moreover, when $rd_2=d_1$, we show that the Heisenberg scaling $O(1/\varepsilon)$ can be achieved, even if $\mathcal{E}$ is not a unitary channel: - $O(\min\{d_1^{2.5}/\varepsilon,d_1^2/\varepsilon^2\})$ queries to $\mathcal{E}$ suffice for channel tomography to within diamond norm error $\varepsilon$, and $O(d_1^2/\varepsilon)$ queries suffice for the case of Choi state trace norm error $\varepsilon$. - $O(\min\{d_1^{1.5}/\varepsilon,d_1/\varepsilon^2\})$ queries to $\mathcal{E}$ suffice for tomography of the mixed state $\mathcal{E}(|0\rangle\langle 0|)$ to within trace norm error $\varepsilon$.

    https://arxiv.org/abs/2512.13614


    Certified-Everlasting Quantum NIZK Proofs

    oai:arXiv.org:2512.13628v1

    arXiv:2512.13628v1 Announce Type: cross Abstract: We study non-interactive zero-knowledge proofs (NIZKs) for NP satisfying: 1) statistical soundness, 2) computational zero-knowledge and 3) certified-everlasting zero-knowledge (CE-ZK). The CE-ZK property allows a verifier of a quantum proof to revoke the proof in a way that can be checked (certified) by the prover. Conditioned on successful certification, the verifier's state can be efficiently simulated with only the statement, in a statistically indistinguishable way. Our contributions regarding these certified-everlasting NIZKs (CE-NIZKs) are as follows: - We identify a barrier to obtaining CE-NIZKs in the CRS model via generalizations of known interactive proofs that satisfy CE-ZK. - We circumvent this by constructing CE-NIZK from black-box use of NIZK for NP satisfying certain properties, along with OWFs. As a result, we obtain CE-NIZKs for NP in the CRS model, based on polynomial hardness of the learning with errors (LWE) assumption. - In addition, we observe that the aforementioned barrier does not apply to the shared EPR model. Consequently, we present a CE-NIZK for NP in this model based on any statistical binding hidden-bits generator, which can be based on LWE. The only quantum computation in this protocol involves single-qubit measurements of the shared EPR pairs.

    https://arxiv.org/abs/2512.13628


    Universality of high-dimensional scaling limits of stochastic gradient descent

    oai:arXiv.org:2512.13634v1

    arXiv:2512.13634v1 Announce Type: cross Abstract: We consider statistical tasks in high dimensions whose loss depends on the data only through its projection into a fixed-dimensional subspace spanned by the parameter vectors and certain ground truth vectors. This includes classifying mixture distributions with cross-entropy loss with one and two-layer networks, and learning single and multi-index models with one and two-layer networks. When the data is drawn from an isotropic Gaussian mixture distribution, it is known that the evolution of a finite family of summary statistics under stochastic gradient descent converges to an autonomous ordinary differential equation (ODE), as the dimension and sample size go to $\infty$ and the step size goes to $0$ commensurately. Our main result is that these ODE limits are universal in that this convergence occurs even when the data is drawn from mixtures of product measures provided the first two moments match the corresponding Gaussian distribution and the initialization and ground truth vectors are sufficiently coordinate-delocalized. We complement this by proving two corresponding non-universality results. We provide a simple example where the ODE limits are non-universal if the initialization is coordinate aligned. We also show that the stochastic differential equation limits arising as fluctuations of the summary statistics around their ODE's fixed points are not universal.

    https://arxiv.org/abs/2512.13634


    The Optimal Approximation Factor in Density Estimation

    oai:arXiv.org:1902.05876v5

    arXiv:1902.05876v5 Announce Type: replace Abstract: Consider the following problem: given two arbitrary densities $q_1,q_2$ and a sample-access to an unknown target density $p$, find which of the $q_i$'s is closer to $p$ in total variation. A remarkable result due to Yatracos shows that this problem is tractable in the following sense: there exists an algorithm that uses $O(\epsilon^{-2})$ samples from $p$ and outputs~$q_i$ such that with high probability, $TV(q_i,p) \leq 3\cdot\mathsf{opt} + \epsilon$, where $\mathsf{opt}= \min\{TV(q_1,p),TV(q_2,p)\}$. Moreover, this result extends to any finite class of densities $\mathcal{Q}$: there exists an algorithm that outputs the best density in $\mathcal{Q}$ up to a multiplicative approximation factor of 3. We complement and extend this result by showing that: (i) the factor 3 can not be improved if one restricts the algorithm to output a density from $\mathcal{Q}$, and (ii) if one allows the algorithm to output arbitrary densities (e.g.\ a mixture of densities from $\mathcal{Q}$), then the approximation factor can be reduced to 2, which is optimal. In particular this demonstrates an advantage of improper learning over proper in this setup. We develop two approaches to achieve the optimal approximation factor of 2: an adaptive one and a static one. Both approaches are based on a geometric point of view of the problem and rely on estimating surrogate metrics to the total variation. Our sample complexity bounds exploit techniques from {\it Adaptive Data Analysis}.

    https://arxiv.org/abs/1902.05876


    Adaptive Risk Mitigation in Demand Learning

    oai:arXiv.org:2011.10690v2

    arXiv:2011.10690v2 Announce Type: replace Abstract: We study dynamic pricing of a product with an unknown demand distribution over a finite horizon. Departing from the standard no-regret learning environment in which prices can be adjusted at any time, we restrict price changes to predetermined points in time to reflect common retail practice. This constraint, coupled with demand model ambiguity and an unknown customer arrival pattern, imposes a high risk of revenue loss, as a price based on a misestimated demand model may be applied to many customers before it can be revised. We develop an adaptive risk learning (ARL) framework that embeds a data-driven ambiguity set (DAS) to quantify demand model ambiguity by adapting to the unknown arrival pattern. Initially, when arrivals are few, the DAS includes a broad set of plausible demand models, reflecting high ambiguity and revenue risk. As new data is collected through pricing, the DAS progressively shrinks, capturing the reduction in model ambiguity and associated risk. We establish the probabilistic convergence of the DAS to the true demand model and derive a regret bound for the ARL policy that explicitly links revenue loss to the data required for the DAS to identify the true model with high probability. The dependence of our regret bound on the arrival pattern is unique to our constrained dynamic pricing problem and contrasts with no-regret learning environments, where regret is constant and arrival-pattern independent. Relaxing the constraint on infrequent price changes, we show that ARL attains the known constant regret bound. Numerical experiments further demonstrate that ARL outperforms benchmarks that prioritize either regret or risk alone by adaptively balancing both without knowledge of the arrival pattern. This adaptive risk adjustment is crucial for achieving high revenues and low downside risk when prices are sticky and both demand and arrival patterns are unknown.

    https://arxiv.org/abs/2011.10690


    We Can Always Catch You: Detecting Adversarial Patched Objects WITH or WITHOUT Signature

    oai:arXiv.org:2106.05261v3

    arXiv:2106.05261v3 Announce Type: replace Abstract: Recently, object detection has proven vulnerable to adversarial patch attacks. The attackers holding a specially crafted patch can hide themselves from state-of-the-art detectors, e.g., YOLO, even in the physical world. This attack can bring serious security threats, such as escaping from surveillance cameras. How to effectively detect this kind of adversarial examples to catch potential attacks has become an important problem. In this paper, we propose two detection methods: the signature-based method and the signature-independent method. First, we identify two signatures of existing adversarial patches that can be utilized to precisely locate patches within adversarial examples. By employing the signatures, a fast signature-based method is developed to detect the adversarial objects. Second, we present a robust signature-independent method based on the \textit{content semantics consistency} of model outputs. Adversarial objects violate this consistency, appearing locally but disappearing globally, while benign ones remain consistently present. The experiments demonstrate that two proposed methods can effectively detect attacks both in the digital and physical world. These methods each offer distinct advantage. Specifically, the signature-based method is capable of real-time detection, while the signature-independent method can detect unknown adversarial patch attacks and makes defense-aware attacks almost impossible to perform.

    https://arxiv.org/abs/2106.05261


    CSAW-M: An Ordinal Classification Dataset for Benchmarking Mammographic Masking of Cancer

    oai:arXiv.org:2112.01330v2

    arXiv:2112.01330v2 Announce Type: replace Abstract: Interval and large invasive breast cancers, which are associated with worse prognosis than other cancers, are usually detected at a late stage due to false negative assessments of screening mammograms. The missed screening-time detection is commonly caused by the tumor being obscured by its surrounding breast tissues, a phenomenon called masking. To study and benchmark mammographic masking of cancer, in this work we introduce CSAW-M, the largest public mammographic dataset, collected from over 10,000 individuals and annotated with potential masking. In contrast to the previous approaches which measure breast image density as a proxy, our dataset directly provides annotations of masking potential assessments from five specialists. We also trained deep learning models on CSAW-M to estimate the masking level and showed that the estimated masking is significantly more predictive of screening participants diagnosed with interval and large invasive cancers -- without being explicitly trained for these tasks -- than its breast density counterparts.

    https://arxiv.org/abs/2112.01330


    Homomorphisms of (n,m)-graphs with respect to generalised switch

    oai:arXiv.org:2204.01258v5

    arXiv:2204.01258v5 Announce Type: replace Abstract: The study of homomorphisms of $(n,m)$-graphs, that is, adjacency preserving vertex mappings of graphs with $n$ types of arcs and $m$ types of edges was initiated by Ne\v{s}et\v{r}il and Raspaud in 2000. Later, some attempts were made to generalize the switch operation that is popularly used in the study of signed graphs, and study its effect on the above mentioned homomorphism. In this article, we too provide a generalization of the switch operation on $(n,m)$-graphs, which to the best of our knowledge, encapsulates all the previously known generalizations as special cases. We approach the study of homomorphisms with respect to the switch operation axiomatically. We prove some fundamental results that are essential tools in the further study of this topic. In the process of proving the fundamental results, we have provided yet another solution to an open problem posed by Klostermeyer and MacGillivray in 2004. We also prove the existence of a categorical product for $(n,m)$-graphs with respect to a particular class of generalized switch which implicitly uses category theory. This is a counter intuitive solution as the number of vertices in the Categorical product of two $(n,m)$-graphs on $p$ and $q$ vertices has a multiple of $pq$ many vertices, where the multiple depends on the switch. This solves an open question asked by Brewster in the PEPS 2012 workshop as a corollary. We also provide a way to calculate the product explicitly, and prove general properties of the product. We define the analog of chromatic number for $(n,m)$-graphs with respect to generalized switch and explore the interrelations between chromatic numbers with respect to different switch operations. We find the value of this chromatic number for the family of forests using group theoretic notions.

    https://arxiv.org/abs/2204.01258


    The prediction of the quality of results in Logic Synthesis using Transformer and Graph Neural Networks

    oai:arXiv.org:2207.11437v3

    arXiv:2207.11437v3 Announce Type: replace Abstract: In the logic synthesis stage, structure transformations in the synthesis tool need to be combined into optimization sequences and act on the circuit to meet the specified circuit area and delay. However, logic synthesis optimization sequences are time-consuming to run, and predicting the quality of the results (QoR) against the synthesis optimization sequence for a circuit can help engineers find a better optimization sequence faster. In this work, we propose a deep learning method to predict the QoR of unseen circuit-optimization sequences pairs. Specifically, the structure transformations are translated into vectors by embedding methods and advanced natural language processing (NLP) technology (Transformer) is used to extract the features of the optimization sequences. In addition, to enable the prediction process of the model to be generalized from circuit to circuit, the graph representation of the circuit is represented as an adjacency matrix and a feature matrix. Graph neural networks(GNN) are used to extract the structural features of the circuits. For this problem, the Transformer and three typical GNNs are used. Furthermore, the Transformer and GNNs are adopted as a joint learning policy for the QoR prediction of the unseen circuit-optimization sequences. The methods resulting from the combination of Transformer and GNNs are benchmarked. The experimental results show that the joint learning of Transformer and GraphSage gives the best results. The Mean Absolute Error (MAE) of the predicted result is 0.412.

    https://arxiv.org/abs/2207.11437


    Colorful two-piercing theorem for boxes

    oai:arXiv.org:2207.14368v3

    arXiv:2207.14368v3 Announce Type: replace Abstract: We prove a colorful extension of a Helly-type theorem by Danzer and Gr\"{u}nbaum (Combinatorica, 1982) concerning two-piercing families of axis-parallel boxes in $\mathbb{R}^d$. We also show that our result is tight by constructing extremal families that achieve the bound. Related work includes a graph-theoretic proof of the original theorem by Pendavingh, Puite, and Woeginger (Discrete Applied Mathematics, 2008), and a two-piercing result for lower-dimensional boxes by Ba\~{n}os and Oliveros (Acta Mathematica Hungarica, 2018).

    https://arxiv.org/abs/2207.14368


    Data Augmentation for Improving Emotion Recognition in Software Engineering Communication

    oai:arXiv.org:2208.05573v2

    arXiv:2208.05573v2 Announce Type: replace Abstract: Emotions (e.g., Joy, Anger) are prevalent in daily software engineering (SE) activities, and are known to be significant indicators of work productivity (e.g., bug fixing efficiency). Recent studies have shown that directly applying general purpose emotion classification tools to SE corpora is not effective. Even within the SE domain, tool performance degrades significantly when trained on one communication channel and evaluated on another (e.g, StackOverflow vs. GitHub comments). Retraining a tool with channel-specific data takes significant effort since manually annotating large datasets of ground truth data is expensive. In this paper, we address this data scarcity problem by automatically creating new training data using a data augmentation technique. Based on an analysis of the types of errors made by popular SE-specific emotion recognition tools, we specifically target our data augmentation strategy in order to improve the performance of emotion recognition. Our results show an average improvement of 9.3% in micro F1-Score for three existing emotion classification tools (ESEM-E, EMTk, SEntiMoji) when trained with our best augmentation strategy.

    https://arxiv.org/abs/2208.05573


    Vertical Semi-Federated Learning for Efficient Online Advertising

    oai:arXiv.org:2209.15635v3

    arXiv:2209.15635v3 Announce Type: replace Abstract: Traditional vertical federated learning schema suffers from two main issues: 1) restricted applicable scope to overlapped samples and 2) high system challenge of real-time federated serving, which limits its application to advertising systems. To this end, we advocate a new practical learning setting, Semi-VFL (Vertical Semi-Federated Learning), for real-world industrial applications, where the learned model retains sufficient advantages of federated learning while supporting independent local serving. To achieve this goal, we propose the carefully designed Joint Privileged Learning framework (JPL) to i) alleviate the absence of the passive party's feature with federated equivalence imitation and ii) adapt to the heterogeneous full sample space with cross-branch rank alignment. Extensive experiments conducted on real-world advertising datasets validate the effectiveness of our method over baseline methods.

    https://arxiv.org/abs/2209.15635


    Enhancing Interpretability and Interactivity in Robot Manipulation: A Neurosymbolic Approach

    oai:arXiv.org:2210.00858v4

    arXiv:2210.00858v4 Announce Type: replace Abstract: In this paper we present a neurosymbolic architecture for coupling language-guided visual reasoning with robot manipulation. A non-expert human user can prompt the robot using unconstrained natural language, providing a referring expression (REF), a question (VQA), or a grasp action instruction. The system tackles all cases in a task-agnostic fashion through the utilization of a shared library of primitive skills. Each primitive handles an independent sub-task, such as reasoning about visual attributes, spatial relation comprehension, logic and enumeration, as well as arm control. A language parser maps the input query to an executable program composed of such primitives, depending on the context. While some primitives are purely symbolic operations (e.g. counting), others are trainable neural functions (e.g. visual grounding), therefore marrying the interpretability and systematic generalization benefits of discrete symbolic approaches with the scalability and representational power of deep networks. We generate a 3D vision-and-language synthetic dataset of tabletop scenes in a simulation environment to train our approach and perform extensive evaluations in both synthetic and real-world scenes. Results showcase the benefits of our approach in terms of accuracy, sample-efficiency, and robustness to the user's vocabulary, while being transferable to real-world scenes with few-shot visual fine-tuning. Finally, we integrate our method with a robot framework and demonstrate how it can serve as an interpretable solution for an interactive object-picking task, achieving an average success rate of 80.2\%, both in simulation and with a real robot. We make supplementary material available in https://gtziafas.github.io/neurosymbolic-manipulation.

    https://arxiv.org/abs/2210.00858


    ExReg: Wide-range Photo Exposure Correction via a Multi-dimensional Regressor with Attention

    oai:arXiv.org:2212.14801v2

    arXiv:2212.14801v2 Announce Type: replace Abstract: Photo exposure correction is widely investigated, but fewer studies focus on correcting under- and over-exposed images simultaneously. Three issues remain open to handle and correct both under- and over-exposed images in a unified way. First, a locally-adaptive exposure adjustment may be more flexible instead of learning a global mapping. Second, it is an ill-posed problem to determine the suitable exposure values locally. Third, photos with the same content but different exposures may not reach consistent adjustment results. To this end, we proposed a novel exposure correction network, ExReg, to address the challenges by formulating exposure correction as a multi-dimensional regression process. Given an input image, a compact multi-exposure generation network is introduced to generate images with different exposure conditions for multi-dimensional regression and exposure correction in the next stage. An auxiliary module is designed to predict the region-wise exposure values, guiding the proposed Encoder-Decoder ANP (Attentive Neural Processes) to regress the final corrected image. The experimental results show that ExReg can generate well-exposed results and outperform the SOTA method in PSNR for extensive exposure problems. Furthermore, the processing speed, with 0.05 seconds per image on an RTX 3090, is efficient. When tested on the same image under various exposure levels, ExReg also yields results that are visually consistent and physically accurate.

    https://arxiv.org/abs/2212.14801


    Exploring the Upper Limits of Text-Based Collaborative Filtering Using Large Language Models: Discoveries and Insights

    oai:arXiv.org:2305.11700v2

    arXiv:2305.11700v2 Announce Type: replace Abstract: Text-based collaborative filtering (TCF) has emerged as the prominent technique for text and news recommendation, employing language models (LMs) as text encoders to represent items. However, the current landscape of TCF models mainly relies on the utilization of relatively small or medium-sized LMs. The potential impact of using larger, more powerful language models (such as these with over 100 billion parameters) as item encoders on recommendation performance remains uncertain. Can we anticipate unprecedented results and discover new insights? To address this question, we undertake a comprehensive series of experiments aimed at exploring the performance limits of the TCF paradigm. Specifically, we progressively augment the scale of item encoders, ranging fromone hundred million to one hundred billion parameters, in order to reveal the scaling limits of the TCF paradigm. Moreover, we investigate whether these exceptionally large LMs have the potential to establish a universal item representation for the recommendation task, thereby revolutionizing the traditional ID paradigm, which is considered a significant obstacle to developing transferable "one model fits all" recommender models. Our study not only demonstrates positive results but also uncovers unexpected negative outcomes, illuminating the current state of the TCF paradigm within the community. These findings will evoke deep reflection and inspire further research on text-based recommender systems.

    https://arxiv.org/abs/2305.11700


    WCCNet: Wavelet-context Cooperative Network for Efficient Multispectral Pedestrian Detection

    oai:arXiv.org:2308.01042v3

    arXiv:2308.01042v3 Announce Type: replace Abstract: Multispectral pedestrian detection is essential to various tasks especially autonomous driving, for which both the accuracy and computational cost are of paramount importance. Most existing approaches treat RGB and infrared modalities equally. They typically adopt two symmetrical backbones for multimodal feature extraction, which ignore the substantial differences between modalities and bring great difficulty for the reduction of the computational cost as well as effective crossmodal fusion. In this work, we propose a novel and efficient framework named Wavelet-context Cooperative Network (WCCNet), which differentially extracts complementary features across spectra with low computational cost and further fuses these diverse features based on their spatially relevant cross-modal semantics. WCCNet explores an asymmetric but cooperative dual-stream backbone, in which WCCNet utilizes generic neural layers for texture-rich feature extraction from RGB modality, while proposing Mixture of Wavelet Experts (MoWE) to capture complementary frequency patterns of infrared modality. By assessing multispectral environmental context, MoWE generates routing scores to selectively activate specific learnable Adaptive DWT (ADWT) layers, alongside shared static DWT, which are both considerible lightwight and efficient to significantly reduce computational overhead and facilitate subsequent fusion. To further fuse these multispectral features with significant semantic differences, we elaborately design the crossmodal rearranging fusion module (CMRF), which aims to mitigate misalignment and merge semantically complementary features in spatially-related local regions to amplify the crossmodal reciprocal information. Results from comprehensive evaluations on KAIST and FLIR benchmarks indicate that WCCNet outperforms state-of-the-art methods with considerable computational efficiency and competitive accuracy.

    https://arxiv.org/abs/2308.01042


    Can Offline Metrics Measure Explanation Goals? A Comparative Survey Analysis of Offline Explanation Metrics in Recommender Systems

    oai:arXiv.org:2310.14379v4

    arXiv:2310.14379v4 Announce Type: replace Abstract: Explanations in a Recommender System (RS) provide reasons for recommendations to users and can enhance transparency, persuasiveness, engagement, and trust-known as explanation goals. Evaluating the effectiveness of explanation algorithms offline remains challenging due to subjectivity. Initially, we conducted a literature review on current offline metrics, revealing that algorithms are often assessed with anecdotal evidence, offering convincing examples, or with metrics that don't align with human perception. We investigated whether, in explanations connecting interacted and recommended items based on shared content, the selection of item attributes and interacted items affects explanation goals. Metrics measuring the diversity and popularity of attributes and the recency of item interactions were used to evaluate explanations from three state-of-the-art agnostic algorithms across six recommendation systems. These offline metrics were compared with results from an online user study. Our findings reveal a trade-off: transparency and trust relate to popular properties, while engagement and persuasiveness are linked to diversified properties. This study contributes to the development of more robust evaluation methods for explanation algorithms in recommender systems.

    https://arxiv.org/abs/2310.14379


    Improving Collaborative Filtering Recommendation via Graph Learning

    oai:arXiv.org:2311.03316v2

    arXiv:2311.03316v2 Announce Type: replace Abstract: Recommendation systems aim to provide personalized predictions by identifying items that are most appealing to individual users. Among various recommendation approaches, k-nearest-neighbor (kNN)-based collaborative filtering (CF) remains one of the most widely used in practice. However, the kNN scheme often results in running the algorithm on a highly dense graph, which degrades computational efficiency. In addition, enforcing a uniform neighborhood size is not well suited to capturing the true underlying structure of the data. In this paper, we leverage recent advances in graph signal processing (GSP) to learn a sparse yet high-quality graph, improving the efficiency of collaborative filtering without sacrificing recommendation accuracy. Experiments on benchmark datasets demonstrate that our method can successfully perform CF-based recommendation using an extremely sparse graph while maintaining competitive performance.

    https://arxiv.org/abs/2311.03316


    Physics-informed neural network for modeling dynamic linear elasticity

    oai:arXiv.org:2312.15175v3

    arXiv:2312.15175v3 Announce Type: replace Abstract: In this work, we present the physics-informed neural network (PINN) model applied particularly to dynamic problems in solid mechanics. We focus on forward and inverse problems. Particularly, we show how a PINN model can be used efficiently for material identification in a dynamic setting. In this work, we assume linear continuum elasticity. We show results for two-dimensional (2D) plane strain problem and then we proceed to apply the same techniques for a three-dimensional (3D) problem. As for the training data we use the solution based on the finite element method. We rigorously show that PINN models are accurate, robust and computationally efficient, especially as a surrogate model for material identification problems. Also, we employ state-of-the-art techniques from the PINN literature which are an improvement to the vanilla implementation of PINN. Based on our results, we believe that the framework we have developed can be readily adapted to computational platforms for solving multiple dynamic problems in solid mechanics.

    https://arxiv.org/abs/2312.15175


    IRG: Modular Synthetic Relational Database Generation with Complex Relational Schemas

    oai:arXiv.org:2312.15187v4

    arXiv:2312.15187v4 Announce Type: replace Abstract: Relational databases (RDBs) are widely used by corporations and governments to store multiple related tables. Their relational schemas pose unique challenges to synthetic data generation for privacy-preserving data sharing, e.g., for collaborative analytical and data mining tasks, as well as software testing at various scales. Relational schemas typically include a set of primary and foreign key constraints to specify the intra-and inter-table entity relations, which also imply crucial intra-and inter-table data correlations in the RDBs. Existing synthetic RDB generation approaches often focus on the relatively simple and basic parent-child relations, failing to address the ubiquitous real-world complexities in relational schemas in key constraints like composite keys, intra-table correlations like sequential correlation, and inter-table data correlations like indirectly connected tables. In this paper, we introduce incremental relational generator (IRG), a modular framework designed to handle these real-world challenges. In IRG, each table is generated by learning context from a depth-first traversal of relational connections to capture indirect inter-table relationships and constructs different parts of a table through several classical generative and predictive modules to preserve complex key constraints and data correlations. Compared to 3 prior art algorithms across 10 real-world RDB datasets, IRG successfully handles the relational schemas and captures critical data relationships for all datasets while prior works are incapable of. The generated synthetic data also demonstrates better fidelity and utility than prior works, implying its higher potential as a replacement for the basis of analytical tasks and data mining applications. Code is available at: https://github.com/li-jiayu-ljy/irg.

    https://arxiv.org/abs/2312.15187


    PADS: Plug-and-Play 3D Human Pose Analysis via Diffusion Generative Modeling

    oai:arXiv.org:2401.08930v2

    arXiv:2401.08930v2 Announce Type: replace Abstract: Diffusion models have demonstrated impressive capabilities in modeling complex data distributions and are increasingly applied in various generative tasks. In this work, we propose Pose Analysis by Diffusion Synthesis PADS, a unified generative modeling framework for 3D human pose analysis. PADS first learns a task-agnostic 3D pose prior via unconditional diffusion synthesis and then performs training-free adaptation to a wide range of pose analysis tasks, including 3D pose estimation, denoising, completion, etc., through a posterior sampling scheme. By formulating each task as an inverse problem with a known forward operator, PADS injects task-specific constraints during inference while keeping the pose prior fixed. This plug-and-play framework removes the need for task-specific supervision or retraining, offering flexibility and scalability across diverse conditions. Extensive experiments on different benchmarks showcase the superior performance against both learning-based and optimization-based baselines, demonstrating the effectiveness and generalization capability of our method.

    https://arxiv.org/abs/2401.08930


    Revolutionizing Finance with LLMs: An Overview of Applications and Insights

    oai:arXiv.org:2401.11641v4

    arXiv:2401.11641v4 Announce Type: replace Abstract: In recent years, Large Language Models (LLMs) like ChatGPT have seen considerable advancements and have been applied in diverse fields. Built on the Transformer architecture, these models are trained on extensive datasets, enabling them to understand and generate human language effectively. In the financial domain, the deployment of LLMs is gaining momentum. These models are being utilized for automating financial report generation, forecasting market trends, analyzing investor sentiment, and offering personalized financial advice. Leveraging their natural language processing capabilities, LLMs can distill key insights from vast financial data, aiding institutions in making informed investment choices and enhancing both operational efficiency and customer satisfaction. In this study, we provide a comprehensive overview of the emerging integration of LLMs into various financial tasks. Additionally, we conducted holistic tests on multiple financial tasks through the combination of natural language instructions. Our findings show that GPT-4 effectively follow prompt instructions across various financial tasks. This survey and evaluation of LLMs in the financial domain aim to deepen the understanding of LLMs' current role in finance for both financial practitioners and LLM researchers, identify new research and application prospects, and highlight how these technologies can be leveraged to solve practical challenges in the finance industry.

    https://arxiv.org/abs/2401.11641


    "All of Me": Mining Users' Attributes from their Public Spotify Playlists

    oai:arXiv.org:2401.14296v2

    arXiv:2401.14296v2 Announce Type: replace Abstract: In the age of digital music streaming, playlists on platforms like Spotify have become an integral part of individuals' musical experiences. People create and publicly share their own playlists to express their musical tastes, promote the discovery of their favorite artists, and foster social connections. In this work, we aim to address the question: can we infer users' private attributes from their public Spotify playlists? To this end, we conducted an online survey involving 739 Spotify users, resulting in a dataset of 10,286 publicly shared playlists comprising over 200,000 unique songs and 55,000 artists. Then, we utilize statistical analyses and machine learning algorithms to build accurate predictive models for users' attributes.

    https://arxiv.org/abs/2401.14296


    Neuromorphic Photonic Computing with an Electro-Optic Analog Memory

    oai:arXiv.org:2401.16515v4

    arXiv:2401.16515v4 Announce Type: replace Abstract: In neuromorphic photonic systems, device operations are typically governed by analog signals, necessitating digital-to-analog converters (DAC) and analog-to-digital converters (ADC). However, data movement between memory and these converters in conventional von Neumann architectures incur significant energy costs. We propose an analog electronic memory co-located with photonic computing units to eliminate repeated long-distance data movement. Here, we demonstrate a monolithically integrated neuromorphic photonic circuit with on-chip capacitive analog memory and evaluate its performance in machine learning for in situ training and inference using the MNIST dataset. Our analysis shows that integrating analog memory into a neuromorphic photonic architecture can achieve over 26x power savings compared to conventional SRAM-DAC architectures. Furthermore, maintaining a minimum analog memory retention-to-network-latency ratio of 100 maintains >90% inference accuracy, enabling leaky analog memories without substantial performance degradation. This approach reduces reliance on DACs, minimizes data movement, and offers a scalable pathway toward energy-efficient, high-speed neuromorphic photonic computing.

    https://arxiv.org/abs/2401.16515


    Slightly Non-Linear Higher-Order Tree Transducers

    oai:arXiv.org:2402.05854v3

    arXiv:2402.05854v3 Announce Type: replace Abstract: We investigate the tree-to-tree functions computed by "affine $\lambda$-transducers": tree automata whose memory consists of an affine $\lambda$-term instead of a finite state. They can be seen as variations on Gallot, Lemay and Salvati's Linear High-Order Deterministic Tree Transducers. When the memory is almost purely affine ($\textit{\`a la}$ Kanazawa), we show that these machines can be translated to tree-walking transducers (and with a purely affine memory, we get a reversible tree-walking transducer). This leads to a proof of an inexpressivity conjecture of Nguy\^en and Pradic on "implicit automata" in an affine $\lambda$-calculus. We also prove that a more powerful variant, extended with preprocessing by an MSO relabeling and allowing a limited amount of non-linearity, is equivalent in expressive power to Engelfriet, Hoogeboom and Samwel's invisible pebble tree transducers. The key technical tool in our proofs is the Interaction Abstract Machine (IAM), an operational avatar of Girard's geometry of interaction, a semantics of linear logic. We work with ad-hoc specializations to $\lambda$-terms of low exponential depth of a tree-generating version of the IAM.

    https://arxiv.org/abs/2402.05854


    Foveated Retinotopy Improves Classification and Localization in Convolutional Neural Networks

    oai:arXiv.org:2402.15480v5

    arXiv:2402.15480v5 Announce Type: replace Abstract: From falcons spotting preys to humans recognizing faces, rapid visual abilities depend on a foveated retinal organization which delivers high-acuity central vision while preserving low-resolution periphery. This organization is conserved along early visual pathways but remains underexplored in machine learning. Here we examine how embedding a foveated retinotopic transformation as a preprocessing layer impacts convolutional neural networks (CNNs) for image classification. By applying a log-polar mapping to off-the-shelf models and retraining them, we retain comparable accuracy while improving robustness to scale and rotation. We show that this architecture becomes highly sensitive to fixation-point shifts, and that this sensitivity yields a proxy for defining saliency maps that effectively facilitates object localization. Our results show that foveated retinotopy encodes prior geometric knowledge, offering a solution to visual-search and enhancing both classification and localization. These findings connect biological vision principles with artificial networks, pointing to new, robust and efficient directions for computer-vision systems.

    https://arxiv.org/abs/2402.15480


    Inverse Optimal Control for Linear Quadratic Tracking with Unknown Target States

    oai:arXiv.org:2402.17247v4

    arXiv:2402.17247v4 Announce Type: replace Abstract: This paper addresses the inverse optimal control for the linear quadratic tracking problem with a fixed but unknown target state, which aims to estimate the possible triplets comprising the target state, the state weight matrix, and the input weight matrix from observed optimal control input and the corresponding state trajectories. Sufficient conditions have been provided for the unique determination of both the linear quadratic cost function as well as the target state. A computationally efficient and numerically reliable parameter identification algorithm is proposed by equating optimal control strategies with a system of linear equations, and the associated relative error upper bound is derived in terms of data volume and signal-to-noise ratio. Moreover, the proposed inverse optimal control algorithm is applied for the joint cluster coordination and intent identification of a multi-agent system. By incorporating the structural constraint of the Laplace matrix, the relative error upper bound can be reduced accordingly. Finally, the algorithm's efficiency and accuracy are validated by a vehicle-on-a-lever example and a multi-agent formation control example.

    https://arxiv.org/abs/2402.17247


    An Efficient and Harmonized Framework for Balanced Cross-Domain Feature Integration

    oai:arXiv.org:2403.18461v3

    arXiv:2403.18461v3 Announce Type: replace Abstract: Despite significant advancements in image generation using advanced generative frameworks, cross-image integration of content and style remains a key challenge. Current generative models, while powerful, frequently depend on vague textual prompts to define styles--creating difficulties in balancing content semantics and style preservation. We propose a novel framework that utilizes customized models to learn style representations. It enhances content preservation through cross-model feature and attention modulation, leveraging the inherent semantic consistency across models. Additionally, we introduce fixed feature and adaptive attention fusion to achieve the desired balance between content and style. We further develop spatial (mask-guided localized) and temporal (multi-style compositional) multi-model combinations, enabling flexible fusion of models and styles. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches in balancing content preservation and stylistic coherence.

    https://arxiv.org/abs/2403.18461


    A Comprehensive Survey on Self-Supervised Learning for Recommendation

    oai:arXiv.org:2404.03354v3

    arXiv:2404.03354v3 Announce Type: replace Abstract: Recommender systems play a crucial role in tackling the challenge of information overload by delivering personalized recommendations based on individual user preferences. Deep learning techniques, such as RNNs, GNNs, and Transformer architectures, have significantly propelled the advancement of recommender systems by enhancing their comprehension of user behaviors and preferences. However, supervised learning methods encounter challenges in real-life scenarios due to data sparsity, resulting in limitations in their ability to learn representations effectively. To address this, self-supervised learning (SSL) techniques have emerged as a solution, leveraging inherent data structures to generate supervision signals without relying solely on labeled data. By leveraging unlabeled data and extracting meaningful representations, recommender systems utilizing SSL can make accurate predictions and recommendations even when confronted with data sparsity. In this paper, we provide a comprehensive review of self-supervised learning frameworks designed for recommender systems, encompassing a thorough analysis of over 170 papers. We conduct an exploration of nine distinct scenarios, enabling a comprehensive understanding of SSL-enhanced recommenders in different contexts. For each domain, we elaborate on different self-supervised learning paradigms, namely contrastive learning, generative learning, and adversarial learning, so as to present technical details of how SSL enhances recommender systems in various contexts. We consistently maintain the related open-source materials at https://github.com/HKUDS/Awesome-SSLRec-Papers.

    https://arxiv.org/abs/2404.03354


    Unambiguous and Co-Nondeterministic Computations of Finite Automata and Pushdown Automata Families and the Effects of Multiple Counters

    oai:arXiv.org:2404.13254v2

    arXiv:2404.13254v2 Announce Type: replace Abstract: Nonuniform families of polynomial-size finite automata and pushdown automata respectively have strong connections to nonuniform-NL and nonuniform-LOGCFL. We examine the behaviors of unambiguous and co-nondeterministic computations produced by such families of automata operating multiple counters, where a counter is a stack using only a single non-bottom symbol. As immediate consequences, we obtain various collapses of the complexity classes of families of promise problems solvable by finite and pushdown automata families when all valid instances are limited to either polynomially long strings or unary strings. A key technical ingredient of our proofs is an inductive counting of reachable vertices of each computation graph of finite and pushdown automata that operate multiple counters simultaneously.

    https://arxiv.org/abs/2404.13254


    Fast Wrong-way Cycling Detection in CCTV Videos: Sparse Sampling is All You Need

    oai:arXiv.org:2405.07293v2

    arXiv:2405.07293v2 Announce Type: replace Abstract: Effective monitoring of unusual transportation behaviors, such as wrong-way cycling (i.e., riding a bicycle or e-bike against designated traffic flow), is crucial for optimizing law enforcement deployment and traffic planning. However, accurately recording all wrong-way cycling events is both unnecessary and infeasible in resource-constrained environments, as it requires high-resolution cameras for evidence collection and event detection. To address this challenge, we propose WWC-Predictor, a novel method for efficiently estimating the wrong-way cycling ratio, defined as the proportion of wrong-way cycling events relative to the total number of cycling movements over a given time period. The core innovation of our method lies in accurately detecting wrong-way cycling events in sparsely sampled frames using a light-weight detector, then estimating the overall ratio using an autoregressive moving average model. To evaluate the effectiveness of our method, we construct a benchmark dataset consisting of 35 minutes of video sequences with minute-level annotations.Our method achieves an average error rate of a mere 1.475\% while consuming only 19.12\% GPU time required by conventional tracking methods, validating its effectiveness in estimating the wrong-way cycling ratio. Our source code is publicly available at: https://github.com/VICA-Lab-HKUST-GZ/WWC-Predictor.

    https://arxiv.org/abs/2405.07293


    Certifying Robustness of Graph Convolutional Networks for Node Perturbation with Polyhedra Abstract Interpretation

    oai:arXiv.org:2405.08645v2

    arXiv:2405.08645v2 Announce Type: replace Abstract: Graph convolutional neural networks (GCNs) are powerful tools for learning graph-based knowledge representations from training data. However, they are vulnerable to small perturbations in the input graph, which makes them susceptible to input faults or adversarial attacks. This poses a significant problem for GCNs intended to be used in critical applications, which need to provide certifiably robust services even in the presence of adversarial perturbations. We propose an improved GCN robustness certification technique for node classification in the presence of node feature perturbations. We introduce a novel polyhedra-based abstract interpretation approach to tackle specific challenges of graph data and provide tight upper and lower bounds for the robustness of the GCN. Experiments show that our approach simultaneously improves the tightness of robustness bounds as well as the runtime performance of certification. Moreover, our method can be used during training to further improve the robustness of GCNs.

    https://arxiv.org/abs/2405.08645


    Serializing Java Objects in Plain Code

    oai:arXiv.org:2405.11294v4

    arXiv:2405.11294v4 Announce Type: replace Abstract: In managed languages, serialization of objects is typically done in bespoke binary formats such as Protobuf, or markup languages such as XML or JSON. The major limitation of these formats is readability. Human developers cannot read binary code, and in most cases, suffer from the syntax of XML or JSON. This is a major issue when objects are meant to be embedded and read in source code, such as in test cases. To address this problem, we propose plain-code serialization. Our core idea is to serialize objects observed at runtime in the native syntax of a programming language. We realize this vision in the context of Java, and demonstrate a prototype which serializes Java objects to Java source code. The resulting source faithfully reconstructs the objects seen at runtime. Our prototype is called ProDJ and is publicly available. We experiment with ProDJ to successfully plain-code serialize 174,699 objects observed during the execution of 4 open-source Java applications. Our performance measurement shows that the performance impact is not noticeable. Through a user study, we demonstrate that developers prefer plain-code serialized objects within automatically generated tests over their representations as XML or JSON.

    https://arxiv.org/abs/2405.11294


    RapVerse: Coherent Vocals and Whole-Body Motions Generations from Text

    oai:arXiv.org:2405.20336v3

    arXiv:2405.20336v3 Announce Type: replace Abstract: In this work, we introduce a challenging task for simultaneously generating 3D holistic body motions and singing vocals directly from textual lyrics inputs, advancing beyond existing works that typically address these two modalities in isolation. To facilitate this, we first collect the RapVerse dataset, a large dataset containing synchronous rapping vocals, lyrics, and high-quality 3D holistic body meshes. With the RapVerse dataset, we investigate the extent to which scaling autoregressive multimodal transformers across language, audio, and motion can enhance the coherent and realistic generation of vocals and whole-body human motions. For modality unification, a vector-quantized variational autoencoder is employed to encode whole-body motion sequences into discrete motion tokens, while a vocal-to-unit model is leveraged to obtain quantized audio tokens preserving content, prosodic information and singer identity. By jointly performing transformer modeling on these three modalities in a unified way, our framework ensures a seamless and realistic blend of vocals and human motions. Extensive experiments demonstrate that our unified generation framework not only produces coherent and realistic singing vocals alongside human motions directly from textual inputs, but also rivals the performance of specialized single-modality generation systems, establishing new benchmarks for joint vocal-motion generation.

    https://arxiv.org/abs/2405.20336


    Fast and Robust Flocking of Protesters on Street Networks

    oai:arXiv.org:2406.01101v4

    arXiv:2406.01101v4 Announce Type: replace Abstract: We present a simple model of protesters scattered throughout a city who want to gather into large and mobile groups. This model relies on random walkers on a street network that follow tactics built from a set of basic rules. Our goal is to identify the most important rules for fast and robust flocking of walkers. We explore a wide set of tactics and show the central importance of a specific rule based on alignment. Other rules alone perform poorly, but our experiments show that combining alignment with them enhances flocking, and that obtained groups are then remarkably robust.

    https://arxiv.org/abs/2406.01101


    Efficient Neural Common Neighbor for Temporal Graph Link Prediction

    oai:arXiv.org:2406.07926v2

    arXiv:2406.07926v2 Announce Type: replace Abstract: Temporal graphs are widespread in real-world applications such as social networks, as well as trade and transportation networks. Predicting dynamic links within these evolving graphs is a key problem. Many memory-based methods use temporal interaction histories to generate node embeddings, which are then combined to predict links. However, these approaches primarily focus on individual node representations, often overlooking the inherently pairwise nature of link prediction. While some recent methods attempt to capture pairwise features, they tend to be limited by high computational complexity arising from repeated embedding calculations, making them unsuitable for large-scale datasets like the Temporal Graph Benchmark (TGB). To address the critical need for models that combine strong expressive power with high computational efficiency for link prediction on large temporal graphs, we propose Temporal Neural Common Neighbor (TNCN). Our model achieves this balance by adapting the powerful pairwise modeling principles of Neural Common Neighbor (NCN) to an efficient temporal architecture. TNCN improves upon NCN by efficiently preserving and updating temporal neighbor dictionaries for each node and by using multi-hop common neighbors to learn more expressive pairwise representations. TNCN achieves new state-of-the-art performance on Review from five large-scale real-world TGB datasets, 6 out of 7 datasets in the transductive setting and 3 out of 7 in the inductive setting on small- to medium-scale datasets. Additionally, TNCN demonstrates excellent scalability, outperforming prominent GNN baselines by up to 30.3 times in speed on large datasets. Our code is available at https://github.com/GraphPKU/TNCN.

    https://arxiv.org/abs/2406.07926


    DABL: Detecting Semantic Anomalies in Business Processes Using Large Language Models

    oai:arXiv.org:2406.15781v2

    arXiv:2406.15781v2 Announce Type: replace Abstract: Detecting anomalies in business processes is crucial for ensuring operational success. While many existing methods rely on statistical frequency to detect anomalies, it's important to note that infrequent behavior doesn't necessarily imply undesirability. To address this challenge, detecting anomalies from a semantic viewpoint proves to be a more effective approach. However, current semantic anomaly detection methods treat a trace (i.e., process instance) as multiple event pairs, disrupting long-distance dependencies. In this paper, we introduce DABL, a novel approach for detecting semantic anomalies in business processes using large language models (LLMs). We collect 143,137 real-world process models from various domains. By generating normal traces through the playout of these process models and simulating both ordering and exclusion anomalies, we fine-tune Llama 2 using the resulting log. Through extensive experiments, we demonstrate that DABL surpasses existing state-of-the-art semantic anomaly detection methods in terms of both generalization ability and learning of given processes. Users can directly apply DABL to detect semantic anomalies in their own datasets without the need for additional training. Furthermore, DABL offers the capability to interpret the causes of anomalies in natural language, providing valuable insights into the detected anomalies.

    https://arxiv.org/abs/2406.15781


    Safety Alignment of Large Language Models via Contrasting Safe and Harmful Distributions

    oai:arXiv.org:2406.16743v2

    arXiv:2406.16743v2 Announce Type: replace Abstract: With the widespread application of Large Language Models (LLMs), it has become a significant concern to ensure their safety and prevent harmful responses. While current safe-alignment methods based on instruction fine-tuning and Reinforcement Learning from Human Feedback (RLHF) can effectively reduce harmful responses from LLMs, they often require high-quality datasets and heavy computational overhead during model training. Another way to align language models is to modify the logit of tokens in model outputs without heavy training. Recent studies have shown that contrastive decoding can enhance the performance of language models by reducing the likelihood of confused tokens. However, these methods require the manual selection of contrastive models or instruction templates, limiting the degree of contrast. To this end, we propose Adversarial Contrastive Decoding (ACD), an optimization-based framework to generate two opposite soft system prompts, the Safeguarding Prompt (SP) and the Adversarial Prompt (AP), for prompt-based contrastive decoding. The SP aims to promote safer outputs while the AP aims to exploit the harmful parts of the model, providing a strong contrast to align the model with safety. ACD only needs to apply a lightweight prompt tuning on a rather small anchor dataset without training the target model. Experiments conducted on extensive models and benchmarks demonstrate that the proposed method achieves much better safety performance than previous model training-free decoding methods without sacrificing its original generation ability.

    https://arxiv.org/abs/2406.16743


    Numerical solution of two dimensional scalar conservation laws using compact implicit numerical schemes on Cartesian meshes

    oai:arXiv.org:2407.05275v4

    arXiv:2407.05275v4 Announce Type: replace Abstract: This paper deals with the numerical solution of conservation laws in the two dimensional case using a novel compact implicit time discretization that enables applications of fast algebraic solvers. We present details for the second order accurate parametric scheme based on the finite volume method including simple variants of ENO (Essentially Non-Oscillatory) and WENO (Weighted Essentially Non-Oscillatory) approximations. To avoid oscillatory numerical solutions for large time steps, we propose limiting in time which is provable non-oscillatory in the case of linear advection equation with variable velocity. We present numerical experiments for representative linear and nonlinear problems.

    https://arxiv.org/abs/2407.05275


    Towards the type safety of Pure Subtype Systems (Full version)

    oai:arXiv.org:2407.13882v2

    arXiv:2407.13882v2 Announce Type: replace Abstract: Hutchins' Pure Subtype Systems (PSS) offer a unified framework for types and terms, promising significant advancements in language design for features like dependent types and higher-order subtyping. However, the theory has been hampered by a critical gap: a proof of type safety has remained an open problem for over a decade. The original attempt to prove this property relied on the conjectured commutativity of two fundamental reduction relations, equivalence and subtyping. Proving transitivity elimination, however, requires this commutativity, a property that is notoriously difficult to establish for higher-order subtyping systems. In this paper, we address this issue by introducing Machine-Based PSS (MPSS), a novel reformulation of the original system. MPSS integrates a continuation stack mechanism, reminiscent of the Krivine Abstract Machine, to keep track of arguments that are passed during function application, enabling more fine-grained reductions. This architectural change exposes crucial intermediate reduction steps that were absent in the original PSS. The primary contribution of our work is a direct proof that the equivalence and subtyping reductions in MPSS commute. This result formally establishes transitivity elimination, which is the cornerstone of the inversion lemma required for type safety. We conclude by outlining a pathway from our foundational result to a complete, type-safe system, thereby paving the way for the practical realization of PSS-based languages.

    https://arxiv.org/abs/2407.13882


    Neural stochastic Volterra equations: learning path-dependent dynamics

    oai:arXiv.org:2407.19557v3

    arXiv:2407.19557v3 Announce Type: replace Abstract: Stochastic Volterra equations (SVEs) serve as mathematical models for the time evolutions of random systems with memory effects and irregular behaviour. We introduce neural stochastic Volterra equations as a physics-inspired architecture, generalizing the class of neural stochastic differential equations, and provide some theoretical foundation. Numerical experiments on various SVEs, like the disturbed pendulum equation, the generalized Ornstein--Uhlenbeck process, the rough Heston model and a monetary reserve dynamics, are presented, comparing the performance of neural SVEs, neural SDEs and Deep Operator Networks (DeepONets).

    https://arxiv.org/abs/2407.19557


    Fully Bayesian Differential Gaussian Processes through Stochastic Differential Equations

    oai:arXiv.org:2408.06069v2

    arXiv:2408.06069v2 Announce Type: replace Abstract: Deep Gaussian process models typically employ discrete hierarchies, but recent advancements in differential Gaussian processes (DiffGPs) have extended these models to infinite depths. However, existing DiffGP approaches often overlook the uncertainty in kernel hyperparameters by treating them as fixed and time-invariant, which degrades the model's predictive performance and neglects the posterior distribution. In this work, we introduce a fully Bayesian framework that models kernel hyperparameters as random variables and utilizes coupled stochastic differential equations (SDEs) to jointly learn their posterior distributions alongside those of inducing points. By incorporating the estimation uncertainty of hyperparameters, our method significantly enhances model flexibility and adaptability to complex dynamic systems. Furthermore, we employ a black-box adaptive SDE solver with a neural network to achieve realistic, time varying posterior approximations, thereby improving the expressiveness of the variational posterior. Comprehensive experimental evaluations demonstrate that our approach outperforms traditional methods in terms of flexibility, accuracy, and other key performance metrics. This work not only provides a robust Bayesian extension to DiffGP models but also validates its effectiveness in handling intricate dynamic behaviors, thereby advancing the applicability of Gaussian process models in diverse real-world scenarios.

    https://arxiv.org/abs/2408.06069


    Computation and Concurrency

    oai:arXiv.org:2409.02595v4

    arXiv:2409.02595v4 Announce Type: replace Abstract: We try to clarify the relationship between computation and concurrency. Base on the so-called pomsetc automata, we introduce communication and more operators, and establish the algebras modulo language equivalence and truly concurrent bisimilarities.

    https://arxiv.org/abs/2409.02595


    Dynamic Fraud Detection: Integrating Reinforcement Learning into Graph Neural Networks

    oai:arXiv.org:2409.09892v2

    arXiv:2409.09892v2 Announce Type: replace Abstract: Financial fraud refers to the act of obtaining financial benefits through dishonest means. Such behavior not only disrupts the order of the financial market but also harms economic and social development and breeds other illegal and criminal activities. With the popularization of the internet and online payment methods, many fraudulent activities and money laundering behaviors in life have shifted from offline to online, posing a great challenge to regulatory authorities. How to efficiently detect these financial fraud activities has become an urgent issue that needs to be resolved. Graph neural networks are a type of deep learning model that can utilize the interactive relationships within graph structures, and they have been widely applied in the field of fraud detection. However, there are still some issues. First, fraudulent activities only account for a very small part of transaction transfers, leading to an inevitable problem of label imbalance in fraud detection. At the same time, fraudsters often disguise their behavior, which can have a negative impact on the final prediction results. In addition, existing research has overlooked the importance of balancing neighbor information and central node information. For example, when the central node has too many neighbors, the features of the central node itself are often neglected. Finally, fraud activities and patterns are constantly changing over time, so considering the dynamic evolution of graph edge relationships is also very important.

    https://arxiv.org/abs/2409.09892


    OM4OV: Leveraging Ontology Matching for Ontology Versioning

    oai:arXiv.org:2409.20302v5

    arXiv:2409.20302v5 Announce Type: replace Abstract: Due to the dynamic nature of the Semantic Web, version control is necessary to capture time-varying information for widely used ontologies. Despite the long-standing recognition of ontology versioning (OV) as a crucial component of efficient ontology management, many views treat OV as similar to ontology matching (OM) and directly reuse OM systems for OV tasks. In this study, we systematically analyse the similarities and differences between OM and OV and formalise the OM4OV pipeline. The pipeline is implemented and evaluated in the state-of-the-art OM system Agent-OM. The experimental results indicate that OM systems can be reused for OV tasks, but without necessary modifications, the current OM4OV pipeline can produce skewed measurements, poor performance in detecting update entities, and less explainability for false mappings. To tackle these issues, we propose an optimisation method called the cross-reference (CR) mechanism, building upon the existing alignment(s) from OM to reduce the number of matching candidates and improve overall OV performance.

    https://arxiv.org/abs/2409.20302


    High-order numerical integration on self-affine sets

    oai:arXiv.org:2410.00637v2

    arXiv:2410.00637v2 Announce Type: replace Abstract: We construct an interpolatory high-order cubature rule to compute integrals of smooth functions over self-affine sets with respect to an invariant measure. The main difficulty is the computation of the cubature weights, which we characterize algebraically, by exploiting a self-similarity property of the integral. We propose an $h$-version and a $p$-version of the cubature, present an error analysis and conduct numerical experiments.

    https://arxiv.org/abs/2410.00637


    Learning Dissipative Chaotic Dynamics with Boundedness Guarantees

    oai:arXiv.org:2410.00976v4

    arXiv:2410.00976v4 Announce Type: replace Abstract: Chaotic dynamics, commonly seen in weather systems and fluid turbulence, are characterized by their sensitivity to initial conditions, which makes accurate prediction challenging. Recent approaches have focused on developing data-driven models that attempt to preserve invariant statistics over long horizons since many chaotic systems exhibit dissipative behaviors and ergodicity. Despite the recent progress in such models, they are still often prone to generating unbounded trajectories, leading to invalid statistics evaluation. To address this fundamental challenge, we introduce a modular framework that provides formal guarantees of trajectory boundedness for neural network chaotic dynamics models. Our core contribution is a dissipative projection layer that leverages control-theoretic principles to ensure the learned system is dissipative. Specifically, our framework simultaneously learns a dynamics emulator and an energy-like function, where the latter is used to construct an algebraic dissipative constraint within the projection layer. Furthermore, the learned invariant level set provides an outer estimate for the system's strange attractor, which is known to be difficult to characterize due to its complex geometry. We demonstrate our model's ability to produce bounded long-horizon forecasts that preserve invariant statistics for chaotic dynamical systems, including Lorenz 96 and a reduced-order model of the Kuramoto-Sivashinsky equation.

    https://arxiv.org/abs/2410.00976


    A Banach space formulation for the fully dynamic Navier-Stokes-Biot coupled problem

    oai:arXiv.org:2410.06322v2

    arXiv:2410.06322v2 Announce Type: replace Abstract: We introduce and analyse a fully-mixed formulation for the coupled problem arising in the interaction between a free fluid and a poroelastic medium.The flows in the free fluid and poroelastic regions are governed by the Navier-Stokes and Biot equations, respectively, and the transmission conditions are given by mass conservation, balance of stresses, and the Beavers-Joseph-Saffman law.We apply dual-mixed formulations in both Navier-Stokes and Darcy equations, where the symmetry of the Navier-Stokes pseudostress tensor is imposed in a weak sense and a displacement-based formulation for elasticity equation.In turn, since the transmission conditions are essential in the fully mixed formulation, they are imposed weakly by introducing the traces of the fluid velocity and the poroelastic medium pressure on the interface as the associated Lagrange multipliers.Existence and uniqueness of a solution are established for the continuous weak formulation, as well as a semidiscrete continuous-in-time formulation with nonmatching grids, in a Banach space setting, employing classical results on monotone and nonlinear operators and a regularization technique together with the Banach fixed point approach.We then present error analysis with corresponding rates of convergence for semidiscrete continuous-in-time formulation.Numerical experiments are presented to verify the theoretical rates of convergence and illustrate the performance of the method for application to flow through a filter.

    https://arxiv.org/abs/2410.06322


    The Implicit Bias of Structured State Space Models Can Be Poisoned With Clean Labels

    oai:arXiv.org:2410.10473v5

    arXiv:2410.10473v5 Announce Type: replace Abstract: Neural networks are powered by an implicit bias: a tendency of gradient descent to fit training data in a way that generalizes to unseen data. A recent class of neural network models gaining increasing popularity is structured state space models (SSMs), regarded as an efficient alternative to transformers. Prior work argued that the implicit bias of SSMs leads to generalization in a setting where data is generated by a low dimensional teacher. In this paper, we revisit the latter setting, and formally establish a phenomenon entirely undetected by prior work on the implicit bias of SSMs. Namely, we prove that while implicit bias leads to generalization under many choices of training data, there exist special examples whose inclusion in training completely distorts the implicit bias, to a point where generalization fails. This failure occurs despite the special training examples being labeled by the teacher, i.e. having clean labels! We empirically demonstrate the phenomenon, with SSMs trained independently and as part of non-linear neural networks. In the area of adversarial machine learning, disrupting generalization with cleanly labeled training examples is known as clean-label poisoning. Given the proliferation of SSMs, we believe that delineating their susceptibility to clean-label poisoning, and developing methods for overcoming this susceptibility, are critical research directions to pursue.

    https://arxiv.org/abs/2410.10473


    Meta-Property Graphs: Extending Property Graphs with Metadata Awareness and Reification

    oai:arXiv.org:2410.13813v2

    arXiv:2410.13813v2 Announce Type: replace Abstract: The ISO standard Property Graph model has become increasingly popular for representing complex, interconnected data. However, it lacks native support for querying metadata and reification, which limits its abilities to deal with the demands of modern applications. We introduce the vision of Meta-Property Graphs, a backwards compatible extension of the property graph model addressing these limitations. Our approach enables first-class treatment of labels and properties as queryable objects and supports reification of substructures in a graph. We propose MetaGPML, a backwards compatible extension of the Graph Pattern Matching Language forming the core of the ISO standard GQL, to query these enhanced graphs. We demonstrate how these foundations pave the way for advanced data analytics and governance tasks that are challenging or impossible with current property graph systems.

    https://arxiv.org/abs/2410.13813


    Edge multiscale finite element methods for semilinear parabolic problems with heterogeneous coefficients

    oai:arXiv.org:2410.21150v2

    arXiv:2410.21150v2 Announce Type: replace Abstract: We develop a new spatial semidiscrete multiscale method based upon the edge multiscale methods to solve semilinear parabolic problems with heterogeneous coefficients and smooth initial data. This method allows for a cheap spatial discretization, which fails to resolve the spatial heterogeneity but maintains satisfactory accuracy independent of the heterogeneity. This is achieved by simultaneously constructing a steady-state multiscale ansatz space with certain approximation properties for the evolving solution and the initial data. The approximation properties of the multiscale ansatz space are derived using local-global splitting. A fully discrete scheme is analyzed using a first-order explicit exponential Euler scheme. We derive the error estimates in the $L^{2}$-norm and energy norm under the regularity assumptions for the semilinear term. The convergence rates depend on the coarse grid size and the level parameter. Finally, extensive numerical experiments are carried out to validate the efficiency of the proposed method.

    https://arxiv.org/abs/2410.21150


    Computing Conforming Partitions with Low Stabbing Number for Rectilinear Polygons

    oai:arXiv.org:2411.11274v2

    arXiv:2411.11274v2 Announce Type: replace Abstract: A conforming partition of a rectilinear n-gon P (possibly with holes) is a partition of P into rectangles without using Steiner points (i.e., all corners of all rectangles must lie on the boundary of P). The stabbing number of such a partition is the maximum number of rectangles intersected by an axis-aligned segment lying in the interior of P. In this paper, we examine the problem of computing conforming partitions with low stabbing number. We show that computing a conforming partition with stabbing number at most 4 is NP-hard, which strengthens a previously known hardness result [Durocher \& Mehrabi, Theor. Comput. Sci. 689: 157-168 (2017)] and eliminates the possibility for fixed-parameter-tractable algorithms parameterized by the stabbing number unless P = NP. In contrast, we give (i) an O(n log n)-time algorithm to decide whether a conforming partition with stabbing number 2 exists, (ii) a fixed-parameter-tractable algorithm parameterized by both the stabbing number and treewidth of the pixel graph of the polygon, and (iii) a fixed-parameter-tractable algorithm parameterized by the stabbing number for polygons without holes in general position.

    https://arxiv.org/abs/2411.11274


    Trojan Cleansing with Neural Collapse

    oai:arXiv.org:2411.12914v3

    arXiv:2411.12914v3 Announce Type: replace Abstract: Trojan attacks are sophisticated training-time attacks on neural networks that embed backdoor triggers which force the network to produce a specific output on any input which includes the trigger. With the increasing relevance of deep networks which are too large to train with personal resources and which are trained on data too large to thoroughly audit, these training-time attacks pose a significant risk. In this work, we connect trojan attacks to Neural Collapse, a phenomenon wherein the final feature representations of over-parameterized neural networks converge to a simple geometric structure. We provide experimental evidence that trojan attacks disrupt this convergence for a variety of datasets and architectures. We then use this disruption to design a lightweight, broadly generalizable mechanism for cleansing trojan attacks from a wide variety of different network architectures and experimentally demonstrate its efficacy.

    https://arxiv.org/abs/2411.12914


    From Pruning to Grafting: Dynamic Knowledge Redistribution via Learnable Layer Fusion

    oai:arXiv.org:2411.14507v3

    arXiv:2411.14507v3 Announce Type: replace Abstract: Structured pruning of Generative Pre-trained Transformers (GPTs) offers a promising path to efficiency but often suffers from irreversible performance degradation due to the discarding of transformer blocks. In this paper, we introduce FuseGPT, a compression paradigm that reframes structured pruning as iterative knowledge grafting rather than simple removal. Motivated by the observation that linear block merging fails to capture non-linear feature disparities and that block importance fluctuates dynamically during pruning, FuseGPT employs a dual-strategy pipeline. First, we propose Macro Influence (MI), a dynamic fusion-aware metric that continuously re-evaluates block redundancy as the network topology evolves. Second, instead of rigid parameter averaging, we introduce a learnable low-rank fusion mechanism that adaptively grafts the knowledge of pruned blocks onto surviving layers via lightweight local distillation. Extensive experiments on LLaMA, Mistral, Qwen, and Phi families demonstrate that FuseGPT establishes a new state-of-the-art on the compression-accuracy Pareto frontier: at 25\% sparsity, FuseGPT achieves lower perplexity than prior methods at 20\% sparsity, improves zero-shot reasoning by up to 4.5 points, and delivers 1.33$\times$ inference speedup with 25\% memory reduction. Furthermore, FuseGPT is orthogonal to quantization, achieving 52.1\% total compression with negligible quality loss when combined with 4-bit GPTQ. We make our code publicly available at https://github.com/JarvisPei/FuseGPT.

    https://arxiv.org/abs/2411.14507


    QUOTA: Quantifying Objects with Text-to-Image Models for Any Domain

    oai:arXiv.org:2411.19534v2

    arXiv:2411.19534v2 Announce Type: replace Abstract: We tackle the problem of quantifying the number of objects by a generative text-to-image model. Rather than retraining such a model for each new image domain of interest, which leads to high computational costs and limited scalability, we are the first to consider this problem from a domain-agnostic perspective. We propose QUOTA, an optimization framework for text-to-image models that enables effective object quantification across unseen domains without retraining. It leverages a dual-loop meta-learning strategy to optimize a domain-invariant prompt. Further, by integrating prompt learning with learnable counting and domain tokens, our method captures stylistic variations and maintains accuracy, even for object classes not encountered during training. For evaluation, we adopt a new benchmark specifically designed for object quantification in domain generalization, enabling rigorous assessment of object quantification accuracy and adaptability across unseen domains in text-to-image generation. Extensive experiments demonstrate that QUOTA outperforms conventional models in both object quantification accuracy and semantic consistency, setting a new benchmark for efficient and scalable text-to-image generation for any domain.

    https://arxiv.org/abs/2411.19534


    Generative AI-based data augmentation for improved bioacoustic classification in noisy environments

    oai:arXiv.org:2412.01530v3

    arXiv:2412.01530v3 Announce Type: replace Abstract: Obtaining data to train robust artificial intelligence (AI)-based models for species classification can be challenging, particularly for rare species. Data augmentation can boost classification accuracy by increasing the diversity of training data and is cheaper to obtain than expert-labelled data. However, many classic image-based augmentation techniques are not suitable for audio spectrograms. We investigate two generative AI models as data augmentation tools to synthesise spectrograms and supplement audio data: Auxiliary Classifier Generative Adversarial Networks (ACGAN) and Denoising Diffusion Probabilistic Models (DDPMs). The latter performed particularly well in terms of both realism of generated spectrograms and accuracy in a resulting classification task. Alongside these new approaches, we present a new audio data set of 640 hours of bird calls from wind farm sites in Ireland, approximately 800 samples of which have been labelled by experts. Wind farm data are particularly challenging for classification models given the background wind and turbine noise. Training an ensemble of classification models on real and synthetic data combined compared well with highly confident BirdNET predictions. Each classifier we used was improved by including synthetic data, and classification metrics generally improved in line with the amount of synthetic data added. Our approach can be used to augment acoustic signals for more species and other land-use types, and has the potential to bring about advances in our capacity to develop reliable AI-based detection of rare species. Our code is available at https://github.com/gibbona1/SpectrogramGenAI.

    https://arxiv.org/abs/2412.01530


    TimeWalker: Personalized Neural Space for Lifelong Head Avatars

    oai:arXiv.org:2412.02421v2

    arXiv:2412.02421v2 Announce Type: replace Abstract: We present TimeWalker, a novel framework that models realistic, full-scale 3D head avatars of a person on lifelong scale. Unlike current human head avatar pipelines that capture identity at the momentary level(e.g., instant photography or short videos), TimeWalker constructs a person's comprehensive identity from unstructured data collection over his/her various life stages, offering a paradigm to achieve full reconstruction and animation of that person at different moments of life. At the heart of TimeWalker's success is a novel neural parametric model that learns personalized representation with the disentanglement of shape, expression, and appearance across ages. Central to our methodology are the concepts of two aspects: (1) We track back to the principle of modeling a person's identity in an additive combination of average head representation in the canonical space, and moment-specific head attribute representations driven from a set of neural head basis. To learn the set of head basis that could represent the comprehensive head variations in a compact manner, we propose a Dynamic Neural Basis-Blending Module (Dynamo). It dynamically adjusts the number and blend weights of neural head bases, according to both shared and specific traits of the target person over ages. (2) Dynamic 2D Gaussian Splatting (DNA-2DGS), an extension of Gaussian splatting representation, to model head motion deformations like facial expressions without losing the realism of rendering and reconstruction. DNA-2DGS includes a set of controllable 2D oriented planar Gaussian disks that utilize the priors from parametric model, and move/rotate with the change of expression. Through extensive experimental evaluations, we show TimeWalker's ability to reconstruct and animate avatars across decoupled dimensions with realistic rendering effects, demonstrating a way to achieve personalized 'time traveling' in a breeze.

    https://arxiv.org/abs/2412.02421


    CleanDIFT: Diffusion Features without Noise

    oai:arXiv.org:2412.03439v3

    arXiv:2412.03439v3 Announce Type: replace Abstract: Internal features from large-scale pre-trained diffusion models have recently been established as powerful semantic descriptors for a wide range of downstream tasks. Works that use these features generally need to add noise to images before passing them through the model to obtain the semantic features, as the models do not offer the most useful features when given images with little to no noise. We show that this noise has a critical impact on the usefulness of these features that cannot be remedied by ensembling with different random noises. We address this issue by introducing a lightweight, unsupervised fine-tuning method that enables diffusion backbones to provide high-quality, noise-free semantic features. We show that these features readily outperform previous diffusion features by a wide margin in a wide variety of extraction setups and downstream tasks, offering better performance than even ensemble-based methods at a fraction of the cost.

    https://arxiv.org/abs/2412.03439


    Deep priors for satellite image restoration with accurate uncertainties

    oai:arXiv.org:2412.04130v2

    arXiv:2412.04130v2 Announce Type: replace Abstract: Satellite optical images, upon their on-ground receipt, offer a distorted view of the observed scene. Their restoration, including denoising, deblurring, and sometimes super-resolution, is required before their exploitation. Moreover, quantifying the uncertainties related to this restoration helps to reduce the risks of misinterpreting the image content. Deep learning methods are now state-of-the-art for satellite image restoration. Among them, direct inversion methods train a specific network for each sensor, and generally provide a point estimation of the restored image without the associated uncertainties. Alternatively, deep regularization (DR) methods learn a deep prior on target images before plugging it, as the regularization term, into a model-based optimization scheme. This allows for restoring images from several sensors with a single network and possibly for estimating associated uncertainties. In this paper, we introduce VBLE-xz, a DR method that solves the inverse problem in the latent space of a variational compressive autoencoder (CAE). We adapt the regularization strength by modulating the bitrate of the trained CAE with a training-free approach. Then, VBLE-xz estimates relevant uncertainties jointly in the latent and in the image spaces by sampling an explicit posterior estimated within variational inference. This enables fast posterior sampling, unlike state-of-the-art DR methods that use Markov chains or diffusion-based approaches. We conduct a comprehensive set of experiments on very high-resolution simulated and real Pl\'eiades images, asserting the performance, robustness and scalability of the proposed method. They demonstrate that VBLE-xz represents a compelling alternative to direct inversion methods when uncertainty quantification is required. The code associated to this paper is available in https://github.com/MaudBqrd/VBLExz.

    https://arxiv.org/abs/2412.04130


    KNN-MMD: Cross Domain Wireless Sensing via Local Distribution Alignment

    oai:arXiv.org:2412.04783v5

    arXiv:2412.04783v5 Announce Type: replace Abstract: Wireless sensing has recently found widespread applications in diverse environments, including homes, offices, and public spaces. By analyzing patterns in channel state information (CSI), it is possible to infer human actions for tasks such as person identification, gesture recognition, and fall detection. However, CSI is highly sensitive to environmental changes, where even minor alterations can significantly distort the CSI patterns. This sensitivity often leads to performance degradation or outright failure when applying wireless sensing models trained in one environment to another. To address this challenge, Domain Alignment (DAL) has been widely adopted for cross-domain classification tasks, as it focuses on aligning the global distributions of the source and target domains in feature space. Despite its popularity, DAL often neglects inter-category relationships, which can lead to misalignment between categories across domains, even when global alignment is achieved. To overcome these limitations, we propose K-Nearest Neighbors Maximum Mean Discrepancy (KNN-MMD), a novel few-shot method for cross-domain wireless sensing. Our approach begins by constructing a help set using KNN from the target domain, enabling local alignment between the source and target domains within each category using MMD. Additionally, we address a key instability issue commonly observed in cross-domain methods, where model performance fluctuates sharply between epochs. Further, most existing methods struggle to determine an optimal stopping point during training due to the absence of labeled data from the target domain. Our method resolves this by excluding the support set from the target domain during training and employing it as a validation set to determine the stopping criterion. The dataset and code are publicly available at https://github.com/RS2002/KNN-MMD .

    https://arxiv.org/abs/2412.04783


    How permanent are metadata for research data? Understanding changes in DataCite metadata

    oai:arXiv.org:2412.05128v2

    arXiv:2412.05128v2 Announce Type: replace Abstract: With the move towards open research information, the DOI registration agency DataCite is increasingly used as a source for metadata describing research data, for example to perform scientometric analyses. However, there is a lack of research on how DataCite metadata describing research data are created and maintained. This paper adresses this gap by using DataCite metadata provenance information to analyze the overall prevalence and patterns of change to DataCite metadata records. Metadata change was observed for 12.18 % of metadata records in the sample, and change tends to be incremental and not extensive. DataCite metadata records offer reliable descriptions of datasets and are stable enough to be used in scientometric research. The rate of change differs from previous studies of metadata change in other contexts, suggesting that there are differences in metadata practices between research data repositories and more traditional cataloging environments. The observed changes do not seem to fully align with idealized conceptualizations of metadata creation and maintenance for research data. In particular, the data does not show that metadata records are maintained routinely and continuously. Metadata change also has a limited effect on metadata completeness.

    https://arxiv.org/abs/2412.05128


    Goal-Driven Reasoning in DatalogMTL with Magic Sets

    oai:arXiv.org:2412.07259v4

    arXiv:2412.07259v4 Announce Type: replace Abstract: DatalogMTL is a powerful rule-based language for temporal reasoning. Due to its high expressive power and flexible modeling capabilities, it is suitable for a wide range of applications, including tasks from industrial and financial sectors. However, due to its high computational complexity, practical reasoning in DatalogMTL is highly challenging. To address this difficulty, we introduce a new reasoning method for DatalogMTL which exploits the magic sets technique -- a rewriting approach developed for (non-temporal) Datalog to simulate top-down evaluation with bottom-up reasoning. We have implemented this approach and evaluated it on publicly available benchmarks, showing that the proposed approach significantly and consistently outperformed state-of-the-art reasoning techniques.

    https://arxiv.org/abs/2412.07259


    Codes in algebras of direct products of groups

    oai:arXiv.org:2412.09695v3

    arXiv:2412.09695v3 Announce Type: replace Abstract: In this paper we obtain the Wedderburn-Artin decomposition of a semisimple group algebra associated to a direct product of finite groups. We also provide formulae for the number of all possible group codes, and their dimensions, that can be constructed in a group algebra. As particular cases, we present the complete algebraic description of the group algebra of any direct product of groups whose direct factors are cyclic, dihedral, or generalised quaternion groups.

    https://arxiv.org/abs/2412.09695


    Defending Collaborative Filtering Recommenders via Adversarial Robustness Based Edge Reweighting

    oai:arXiv.org:2412.10850v2

    arXiv:2412.10850v2 Announce Type: replace Abstract: User based collaborative filtering (CF) relies on a user and user similarity graph, making it vulnerable to profile injection (shilling) attacks that manipulate neighborhood relations to promote (push) or demote (nuke) target items. In this work, we propose an adversarial robustness based edge reweighting defense for CF. We first assign each user and user edge a non robustness score via spectral adversarial robustness evaluation, which quantifies the edge sensitivity to adversarial perturbations. We then attenuate the influence of non robust edges by reweighting similarities during prediction. Extensive experiments demonstrate that the proposed method effectively defends against various types of attacks.

    https://arxiv.org/abs/2412.10850


    A Comprehensive Evaluation of Parameter-Efficient Fine-Tuning on Code Smell Detection

    oai:arXiv.org:2412.13801v3

    arXiv:2412.13801v3 Announce Type: replace Abstract: Automated code smell detection faces persistent challenges due to the subjectivity of heuristic rules and the limited performance of traditional ML/DL models. While Large Language Models (LLMs) offer a promising alternative, their adoption is impeded by high fine-tuning costs and a lack of "LM-ready" benchmarks. To bridge these gaps, we present a study with two synergistic contributions. First, we constructed a high-quality benchmark for Complex Conditional, Complex Method, Feature Envy, and Data Class, validated through a rigorous two-stage manual review. Second, leveraging this benchmark, we systematically evaluated four Parameter-Efficient Fine-Tuning (PEFT) methods across nine LMs of varying parameter sizes. Their performance is compared against a comprehensive suite of baselines, including heuristics-based detectors, Deep Learning (DL)-based approaches, and state-of-the-art general-purpose LLMs under multiple In-Context Learning (ICL) settings. Our results demonstrate that PEFT methods achieve effectiveness comparable to or surpassing full fine-tuning while substantially reducing peak GPU memory usage for code smell detection. Furthermore, PEFT-tuned LMs consistently outperform all baselines, yielding MCC improvements ranging from 0.33% to 13.69%, with particularly notable gains for specific smell categories. These findings highlight PEFT techniques as effective and scalable solutions for advancing code smell detection.

    https://arxiv.org/abs/2412.13801


    Navigating AI to Unpack Youth Privacy Concerns: An In-Depth Exploration and Systematic Review

    oai:arXiv.org:2412.16369v2

    arXiv:2412.16369v2 Announce Type: replace Abstract: This systematic literature review investigates perceptions, concerns, and expectations of young digital citizens regarding privacy in artificial intelligence (AI) systems, focusing on social media platforms, educational technology, gaming systems, and recommendation algorithms. Using a rigorous methodology, the review started with 2,000 papers, narrowed down to 552 after initial screening, and finally refined to 108 for detailed analysis. Data extraction focused on privacy concerns, data-sharing practices, the balance between privacy and utility, trust factors in AI, transparency expectations, and strategies to enhance user control over personal data. Findings reveal significant privacy concerns among young users, including a perceived lack of control over personal information, potential misuse of data by AI, and fears of data breaches and unauthorized access. These issues are worsened by unclear data collection practices and insufficient transparency in AI applications. The intention to share data is closely associated with perceived benefits and data protection assurances. The study also highlights the role of parental mediation and the need for comprehensive education on data privacy. Balancing privacy and utility in AI applications is crucial, as young digital citizens value personalized services but remain wary of privacy risks. Trust in AI is significantly influenced by transparency, reliability, predictable behavior, and clear communication about data usage. Strategies to improve user control over personal data include access to and correction of data, clear consent mechanisms, and robust data protection assurances. The review identifies research gaps and suggests future directions, such as longitudinal studies, multicultural comparisons, and the development of ethical AI frameworks. The findings have significant implications for policy development and educational initiatives

    https://arxiv.org/abs/2412.16369


    GeoTexDensifier: Geometry-Texture-Aware Densification for High-Quality Photorealistic 3D Gaussian Splatting

    oai:arXiv.org:2412.16809v2

    arXiv:2412.16809v2 Announce Type: replace Abstract: 3D Gaussian Splatting (3DGS) has recently attracted wide attentions in various areas such as 3D navigation, Virtual Reality (VR) and 3D simulation, due to its photorealistic and efficient rendering performance. High-quality reconstrution of 3DGS relies on sufficient splats and a reasonable distribution of these splats to fit real geometric surface and texture details, which turns out to be a challenging problem. We present GeoTexDensifier, a novel geometry-texture-aware densification strategy to reconstruct high-quality Gaussian splats which better comply with the geometric structure and texture richness of the scene. Specifically, our GeoTexDensifier framework carries out an auxiliary texture-aware densification method to produce a denser distribution of splats in fully textured areas, while keeping sparsity in low-texture regions to maintain the quality of Gaussian point cloud. Meanwhile, a geometry-aware splitting strategy takes depth and normal priors to guide the splitting sampling and filter out the noisy splats whose initial positions are far from the actual geometric surfaces they aim to fit, under a Validation of Depth Ratio Change checking. With the help of relative monocular depth prior, such geometry-aware validation can effectively reduce the influence of scattered Gaussians to the final rendering quality, especially in regions with weak textures or without sufficient training views. The texture-aware densification and geometry-aware splitting strategies are fully combined to obtain a set of high-quality Gaussian splats. We experiment our GeoTexDensifier framework on various datasets and compare our Novel View Synthesis results to other state-of-the-art 3DGS approaches, with detailed quantitative and qualitative evaluations to demonstrate the effectiveness of our method in producing more photorealistic 3DGS models.

    https://arxiv.org/abs/2412.16809


    Establishing Reality-Virtuality Interconnections in Urban Digital Twins for Superior Intelligent Road Inspection and Simulation

    oai:arXiv.org:2412.17699v2

    arXiv:2412.17699v2 Announce Type: replace Abstract: Road inspection is crucial for maintaining road serviceability and ensuring traffic safety, as road defects gradually develop and compromise functionality. Traditional inspection methods, which rely on manual evaluations, are labor-intensive, costly, and time-consuming. While data-driven approaches are gaining traction, the scarcity and spatial sparsity of real-world road defects present significant challenges in acquiring high-quality datasets. Existing simulators designed to generate detailed synthetic driving scenes, however, lack models for road defects. Moreover, advanced driving tasks that involve interactions with road surfaces, such as planning and control in defective areas, remain underexplored. To address these limitations, we propose a multi-modal sensor platform integrated with an urban digital twin (UDT) system for intelligent road inspection. First, hierarchical road models are constructed from real-world driving data collected using vehicle-mounted sensors, resulting in highly detailed representations of road defect structures and surface elevations. Next, digital road twins are generated to create simulation environments for comprehensive analysis and evaluation of algorithm performance. These scenarios are then imported into a simulator to facilitate both data acquisition and physical simulation. Experimental results demonstrate that driving tasks, including perception and decision-making, benefit significantly from the high-fidelity road defect scenes generated by our system.

    https://arxiv.org/abs/2412.17699


    DPBridge: Latent Diffusion Bridge for Dense Prediction

    oai:arXiv.org:2412.20506v4

    arXiv:2412.20506v4 Announce Type: replace Abstract: Diffusion models demonstrate remarkable capabilities in capturing complex data distributions and have achieved compelling results in many generative tasks. While they have recently been extended to dense prediction tasks such as depth estimation and surface normal prediction, their full potential in this area remains underexplored. As target signal maps and input images are pixel-wise aligned, the conventional noise-to-data generation paradigm is inefficient, and input images can serve as a more informative prior compared to pure noise. Diffusion bridge models, which support data-to-data generation between two general data distributions, offer a promising alternative, but they typically fail to exploit the rich visual priors embedded in large pretrained foundation models. To address these limitations, we integrate diffusion bridge formulation with structured visual priors and introduce DPBridge, the first latent diffusion bridge framework for dense prediction tasks. To resolve the incompatibility between diffusion bridge models and pretrained diffusion backbones, we propose (1) a tractable reverse transition kernel for the diffusion bridge process, enabling maximum likelihood training scheme; (2) finetuning strategies including distribution-aligned normalization and image consistency loss. Experiments across extensive benchmarks validate that our method consistently achieves superior performance, demonstrating its effectiveness and generalization capability under different scenarios.

    https://arxiv.org/abs/2412.20506


    STITCHER: Real-Time Trajectory Planning with Motion Primitive Search

    oai:arXiv.org:2412.21180v3

    arXiv:2412.21180v3 Announce Type: replace Abstract: Autonomous high-speed navigation through large, complex environments requires real-time generation of agile trajectories that are dynamically feasible, collision-free, and satisfy constraints. Most modern trajectory planning techniques rely on numerical optimization because high-quality, expressive trajectories that satisfy constraints can be systematically computed. However, strict requirements on computation time and the risk of numerical instability can limit the use of optimization-based planners in safety-critical situations. This work presents an optimization-free planning framework called STITCHER that leverages graph search to generate long-range trajectories by stitching short trajectory segments together in real time. STITCHER is shown to outperform modern optimization-based planners through its innovative planning architecture and several algorithmic developments that make real-time planning possible. Simulation results show safe trajectories through complex environments can be generated in milliseconds that cover tens of meters.

    https://arxiv.org/abs/2412.21180


    FlippedRAG: Black-Box Opinion Manipulation Adversarial Attacks to Retrieval-Augmented Generation Models

    oai:arXiv.org:2501.02968v5

    arXiv:2501.02968v5 Announce Type: replace Abstract: Retrieval-Augmented Generation (RAG) enriches LLMs by dynamically retrieving external knowledge, reducing hallucinations and satisfying real-time information needs. While existing research mainly targets RAG's performance and efficiency, emerging studies highlight critical security concerns. Yet, current adversarial approaches remain limited, mostly addressing white-box scenarios or heuristic black-box attacks without fully investigating vulnerabilities in the retrieval phase. Additionally, prior works mainly focus on factoid Q&A tasks, their attacks lack complexity and can be easily corrected by advanced LLMs. In this paper, we investigate a more realistic and critical threat scenario: adversarial attacks intended for opinion manipulation against black-box RAG models, particularly on controversial topics. Specifically, we propose FlippedRAG, a transfer-based adversarial attack against black-box RAG systems. We first demonstrate that the underlying retriever of a black-box RAG system can be reverse-engineered, enabling us to train a surrogate retriever. Leveraging the surrogate retriever, we further craft target poisoning triggers, altering vary few documents to effectively manipulate both retrieval and subsequent generation. Extensive empirical results show that FlippedRAG substantially outperforms baseline methods, improving the average attack success rate by 16.7%. FlippedRAG achieves on average a 50% directional shift in the opinion polarity of RAG-generated responses, ultimately causing a notable 20% shift in user cognition. Furthermore, we evaluate the performance of several potential defensive measures, concluding that existing mitigation strategies remain insufficient against such sophisticated manipulation attacks. These results highlight an urgent need for developing innovative defensive solutions to ensure the security and trustworthiness of RAG systems.

    https://arxiv.org/abs/2501.02968


    A Dichotomy Theorem for Ordinal Ranks in MSO

    oai:arXiv.org:2501.05385v2

    arXiv:2501.05385v2 Announce Type: replace Abstract: We focus on formulae $\exists X.\, \varphi(\vec{Y}, X)$ of monadic second-order logic over the full binary tree, such that the witness $X$ is a well-founded set. The ordinal rank $\mathrm{rank}(X) < \omega_1$ of such a set $X$ measures its depth and branching structure. We search for the least upper bound for these ranks, and discover the following dichotomy depending on the formula $\varphi$. Let $\mathrm{rank}(\varphi)$ be the minimal ordinal such that, whenever an instance $\vec{Y}$ satisfies the formula, there is a witness $X$ with $\mathrm{rank}(X) \leq \mathrm{rank}(\varphi)$. Then $\mathrm{rank}(\varphi)$ is either strictly smaller than $\omega^2$ or it reaches the maximal possible value $\omega_1$. Moreover, it is decidable which of the cases holds. The result has potential for applications in a variety of ordinal-related problems, in particular it entails a result about the closure ordinal of a fixed-point formula.

    https://arxiv.org/abs/2501.05385


    Randomized Block Low-Rank Matrix Compression by Tagging

    oai:arXiv.org:2501.05528v2

    arXiv:2501.05528v2 Announce Type: replace Abstract: In this work, we present randomized compression algorithms for flat rank-structured matrices with shared bases, termed uniform Block Low-Rank (BLR) matrices. Our main contribution is a technique called tagging, which improves upon the efficiency of existing algorithms for basis matrix computation while preserving accuracy. Tagging operates on the matrix using matrix-vector products of the matrix and its adjoint, making it suitable for black-box environments where accessing individual matrix entries is computationally expensive or infeasible. We show tagging requires a constant number of matrix-vector products coupled with linear post-processing; crucially, the asymptotic pre-factors in tagging depend only on the rank parameter and the underlying problem geometry. We also establish a theoretical connection between the optimal construction of tagging matrices and projective varieties in algebraic geometry, suggesting a hybrid numeric-symbolic avenue of future work. To validate our approach, we apply tagging to compress uniform BLR matrices arising from the discretization of integral and partial differential equations. Empirical results show that tagging outperforms alternative compression techniques, significantly reducing both the number of required matrix-vector products and overall computational time. These findings highlight the practicality and scalability of tagging as an efficient method for flat rank-structured matrices in scientific computing.

    https://arxiv.org/abs/2501.05528


    Neural equilibria for long-term prediction of nonlinear conservation laws

    oai:arXiv.org:2501.06933v2

    arXiv:2501.06933v2 Announce Type: replace Abstract: We introduce Neural Discrete Equilibrium (NeurDE), a machine learning framework for stable and accurate long-term forecasting of nonlinear conservation laws. NeurDE leverages a kinetic lifting that decomposes the dynamics into a fixed linear transport component and a local nonlinear relaxation to equilibrium. This structure provides a natural and principled interface between physics, numerical methods, and machine learning methodologies, enabling NeurDE to be viewed as a ``neural twin'' to Boltzmann-BGK. The transport step can be implemented efficiently in solvers such as lattice Boltzmann (LB), while the equilibrium is modeled by a neural network that maps macroscopic observables to a discrete equilibrium distribution. When integrated into a LB solver, the transport step becomes an efficient lattice streaming operation, and NeurDE yields a hybrid algorithm that robustly captures shock propagation and complex compressible dynamics over long time horizons. The NeurDE method is highly data-efficient: a small network trained on limited data generalizes far beyond the training regime, resolving shocks that evolve well outside the initial training distribution. Unlike traditional kinetic solvers, NeurDE achieves this accuracy without costly root-finding procedures or large velocity lattices. These results establish NeurDE as a scalable, efficient, and physics-informed paradigm for learning-based simulation of nonlinear conservation laws

    https://arxiv.org/abs/2501.06933


    Profile and neighbourhood complexity of graphs excluding a minor and tree-structured graphs

    oai:arXiv.org:2501.08895v3

    arXiv:2501.08895v3 Announce Type: replace Abstract: The \emph{$r$-neighbourhood complexity} of a graph $G$ is the function counting, for a given integer $k$, the largest possible number, over all vertex-subsets $A$ of size $k$, of subsets of $A$ realized as the intersection between the $r$-neighbourhood of some vertex and $A$. A~refinement of this notion is the \emph{$r$-profile complexity}, that counts the maximum number of distinct distance-vectors from any vertex to the vertices of $A$, ignoring distances larger than~$r$. Typically, in structured graph classes such as graphs of bounded VC-dimension or chordal graphs, these functions are bounded, leading to insights into their structural properties and efficient algorithms. We improve existing bounds on the $r$-profile complexity (and thus on the $r$-neighbourhood complexity) for graphs in several structured graph classes. We show that the $r$-profile complexity of graphs excluding $K_h$ as a minor is in $O_h(r^{3h-3}k)$. For graphs of treewidth at most~$t$, we give a bound in $O_t(r^{t+1}k)$, which is tight up to a function of~$t$ as a factor. These bounds improve results of Joret and Rambaud and answer a question of their paper [Combinatorica, 2024]. We also apply our methods to other classes of bounded expansion such as graphs excluding a fixed complete graph as a subdivision. For outerplanar graphs, we can improve our treewidth bound by a factor of $r$ and conjecture that a similar improvement holds for graphs with bounded simple treewidth. For graphs of treelength at most~$\ell$, we give the upper bound of $O(k(r^2(\ell+1)^k))$, which we improve to $O\left (k\cdot (r 2^k + r^2k^2) \right)$ in the case of chordal graphs and $O(k^2r)$ for interval graphs. Our bounds also imply relations between the order, diameter and metric dimension of graphs in these classes, improving results from [Beaudou et al., SIDMA 2017].

    https://arxiv.org/abs/2501.08895


    Taint Analysis for Graph APIs Focusing on Broken Access Control

    oai:arXiv.org:2501.08947v4

    arXiv:2501.08947v4 Announce Type: replace Abstract: We present the first systematic approach to static and dynamic taint analysis for Graph APIs focusing on broken access control. The approach comprises the following. We taint nodes of the Graph API if they represent data requiring specific privileges in order to be retrieved or manipulated, and identify API calls which are related to sources and sinks. Then, we statically analyze whether a tainted information flow between API source and sink calls occurs. To this end, we model the API calls using graph transformation rules. We subsequently use Critical Pair Analysis to automatically analyze potential dependencies between rules representing source calls and rules representing sink calls. We distinguish direct from indirect tainted information flow and argue under which conditions the Critical Pair Analysis is able to detect not only direct, but also indirect tainted flow. The static taint analysis (i) identifies flows that need to be further reviewed, since tainted nodes may be created by an API call and used or manipulated by another API call later without having the necessary privileges, and (ii) can be used to systematically design dynamic security tests for broken access control. The dynamic taint analysis checks if potential broken access control risks detected during the static taint analysis really occur. We apply the approach to a part of the GitHub GraphQL API. The application illustrates that our analysis supports the detection of two types of broken access control systematically: the case where users of the API may not be able to access or manipulate information, although they should be able to do so; and the case where users (or attackers) of the API may be able to access/manipulate information that they should not.

    https://arxiv.org/abs/2501.08947


    Robotic World Model: A Neural Network Simulator for Robust Policy Optimization in Robotics

    oai:arXiv.org:2501.10100v5

    arXiv:2501.10100v5 Announce Type: replace Abstract: Learning robust and generalizable world models is crucial for enabling efficient and scalable robotic control in real-world environments. In this work, we introduce a novel framework for learning world models that accurately capture complex, partially observable, and stochastic dynamics. The proposed method employs a dual-autoregressive mechanism and self-supervised training to achieve reliable long-horizon predictions without relying on domain-specific inductive biases, ensuring adaptability across diverse robotic tasks. We further propose a policy optimization framework that leverages world models for efficient training in imagined environments and seamless deployment in real-world systems. This work advances model-based reinforcement learning by addressing the challenges of long-horizon prediction, error accumulation, and sim-to-real transfer. By providing a scalable and robust framework, the introduced methods pave the way for adaptive and efficient robotic systems in real-world applications.

    https://arxiv.org/abs/2501.10100


    State Combinatorial Generalization In Decision Making With Conditional Diffusion Models

    oai:arXiv.org:2501.13241v2

    arXiv:2501.13241v2 Announce Type: replace Abstract: Many real-world decision-making problems are combinatorial in nature, where states (e.g., surrounding traffic of a self-driving car) can be seen as a combination of basic elements (e.g., pedestrians, trees, and other cars). Due to combinatorial complexity, observing all combinations of basic elements in the training set is infeasible, which leads to an essential yet understudied problem of zero-shot generalization to states that are unseen combinations of previously seen elements. In this work, we first formalize this problem and then demonstrate how existing value-based reinforcement learning (RL) algorithms struggle due to unreliable value predictions in unseen states. We argue that this problem cannot be addressed with exploration alone, but requires more expressive and generalizable models. We demonstrate that behavior cloning with a conditioned diffusion model trained on successful trajectory generalizes better to states formed by new combinations of seen elements than traditional RL methods. Through experiments in maze, driving, and multiagent environments, we show that conditioned diffusion models outperform traditional RL techniques and highlight the broad applicability of our problem formulation.

    https://arxiv.org/abs/2501.13241


    Semantic-Anchored, Class Variance-Optimized Clustering for Robust Semi-Supervised Few-Shot Learning

    oai:arXiv.org:2501.14401v3

    arXiv:2501.14401v3 Announce Type: replace Abstract: Few-shot learning has been extensively explored to address problems where the amount of labeled samples is very limited for some classes. In the semi-supervised few-shot learning setting, substantial quantities of unlabeled samples are available. Such unlabeled samples are generally cheaper to obtain and can be used to improve the few-shot learning performance of the model. Some of the recent methods for this setting rely on clustering to generate pseudo-labels for the unlabeled samples. Since the effectiveness of clustering heavily influences the labeling of the unlabeled samples, it can significantly affect the few-shot learning performance. In this paper, we focus on improving the representation learned by the model in order to improve the clustering and, consequently, the model performance. We propose an approach for semi-supervised few-shot learning that performs a class-variance optimized clustering coupled with a cluster separation tuner in order to improve the effectiveness of clustering the labeled and unlabeled samples in this setting. It also optimizes the clustering-based pseudo-labeling process using a restricted pseudo-labeling approach and performs semantic information injection in order to improve the semi-supervised few-shot learning performance of the model. We experimentally demonstrate that our proposed approach significantly outperforms recent state-of-the-art methods on the benchmark datasets. To further establish its robustness, we conduct extensive experiments under challenging conditions, showing that the model generalizes well to domain shifts and achieves new state-of-the-art performance in open-set settings with distractor classes, highlighting its effectiveness for real-world applications.

    https://arxiv.org/abs/2501.14401


    Beyond Benchmarks: On The False Promise of AI Regulation

    oai:arXiv.org:2501.15693v2

    arXiv:2501.15693v2 Announce Type: replace Abstract: The performance of AI models on safety benchmarks does not indicate their real-world performance after deployment. This opaqueness of AI models impedes existing regulatory frameworks constituted on benchmark performance, leaving them incapable of mitigating ongoing real-world harm. The problem stems from a fundamental challenge in AI interpretability, which seems to be overlooked by regulators and decision makers. We propose a simple, realistic and readily usable regulatory framework which does not rely on benchmarks, and call for interdisciplinary collaboration to find new ways to address this crucial problem.

    https://arxiv.org/abs/2501.15693


    Random Reshuffling for Stochastic Gradient Langevin Dynamics

    oai:arXiv.org:2501.16055v2

    arXiv:2501.16055v2 Announce Type: replace Abstract: We examine the use of different randomisation policies for stochastic gradient algorithms used in sampling, based on first-order (or overdamped) Langevin dynamics, the most popular of which is known as Stochastic Gradient Langevin Dynamics. Conventionally, this algorithm is combined with a specific stochastic gradient strategy, called Robbins-Monro. In this work, we study an alternative strategy, Random Reshuffling, and show convincingly that it leads to improved performance via: a) a proof of reduced bias in the Wasserstein metric for strongly convex, gradient Lipschitz potentials; b) an analytical demonstration of reduced bias for a Gaussian model problem; and c) an empirical demonstration of reduced bias in numerical experiments for some logistic regression problems. This is especially important since Random Reshuffling is typically more efficient due to memory access and cache reasons. Such acceleration for the Random Reshuffling policy is familiar from the optimisation literature on stochastic gradient descent.

    https://arxiv.org/abs/2501.16055


    Automatic Calibration of a Multi-Camera System with Limited Overlapping Fields of View for 3D Surgical Scene Reconstruction

    oai:arXiv.org:2501.16221v3

    arXiv:2501.16221v3 Announce Type: replace Abstract: The purpose of this study is to develop an automated and accurate external camera calibration method for multi-camera systems used in 3D surgical scene reconstruction (3D-SSR), eliminating the need for operator intervention or specialized expertise. The method specifically addresses the problem of limited overlapping fields of view caused by significant variations in optical zoom levels and camera locations. We contribute a novel, fast, and fully automatic calibration method based on the projection of multi-scale markers (MSMs) using a ceiling-mounted projector. MSMs consist of 2D patterns projected at varying scales, ensuring accurate extraction of well distributed point correspondences across significantly different viewpoints and zoom levels. Validation is performed using both synthetic and real data captured in a mock-up OR, with comparisons to traditional manual marker-based methods as well as markerless calibration methods. The method achieves accuracy comparable to manual, operator-dependent calibration methods while exhibiting higher robustness under conditions of significant differences in zoom levels. Additionally, we show that state-of-the-art Structure-from-Motion (SfM) pipelines are ineffective in 3D-SSR settings, even when additional texture is projected onto the OR floor. The use of a ceiling-mounted entry-level projector proves to be an effective alternative to operator-dependent, traditional marker-based methods, paving the way for fully automated 3D-SSR.

    https://arxiv.org/abs/2501.16221


    Improving patch selection for monolithic multigrid solvers for high-order {T}aylor-{H}ood discretizations

    oai:arXiv.org:2502.01130v2

    arXiv:2502.01130v2 Announce Type: replace Abstract: Numerical simulation of incompressible fluid flows has been an active topic of research in Scientific Computing for many years, with many contributions to both discretizations and linear and nonlinear solvers. In this work, we propose an improved relaxation scheme for higher-order Taylor-Hood discretizations of the incompressible Stokes and Navier-Stokes equations, demonstrating its efficiency within monolithic multigrid preconditioners for the linear(ized) equations. The key to this improvement is an improved patch construction for Vanka-style relaxation introducing, for the first time, overlap in the pressure degrees of freedom within the patches. Numerical results demonstrate significant improvement in both multigrid iterations and time-to-solution for the linear Stokes case, on both triangular and quadrilateral meshes. For the nonlinear Navier-Stokes case, we show similar improvements, including in the number of nonlinear iterations needed in an inexact Newton method. These improvements enable three-dimensional numerical simulations, which also presented.

    https://arxiv.org/abs/2502.01130


    Extending the Applicability of Bloom Filters by Relaxing their Parameter Constraints

    oai:arXiv.org:2502.02193v3

    arXiv:2502.02193v3 Announce Type: replace Abstract: These days, Key-Value Stores are widely used for scalable data storage. In this environment, Bloom filter (BF) serves as an efficient probabilistic data structure for representing sets of keys. They allow for set membership queries with no false negatives and with the right choice of the main parameters - length of the BF, number of hash functions used to map an element to the array's indices, and the number of elements inserted - the false positive rate is optimized. However, the number of hash functions is constrained to integer values, and the length of a BF is usually chosen to be a power of two to allow for efficient modulo operations using binary arithmetic. In this paper, we relax these constraints by proposing the Rational Bloom filter, which allows for non-integer numbers of hash functions. This results in optimized fraction-of-zero values for a known number of elements to be inserted. Based on this, we construct the Variably-Sized Block BF to allow for a flexible filter length, especially for large filters, with efficient computation.

    https://arxiv.org/abs/2502.02193


    Bilevel ZOFO: Efficient LLM Fine-Tuning and Meta-Training

    oai:arXiv.org:2502.03604v3

    arXiv:2502.03604v3 Announce Type: replace Abstract: Fine-tuning pre-trained Large Language Models (LLMs) for downstream tasks using First-Order (FO) optimizers presents significant computational challenges. Parameter-Efficient Fine-Tuning (PEFT) methods address these by freezing most model parameters and training only a small subset. However, PEFT often underperforms compared to full fine-tuning when high task-specific accuracy is required. Zeroth-Order (ZO) methods fine-tune the entire pre-trained model without back-propagation, estimating gradients through forward passes only. While memory-efficient, ZO methods suffer from slow convergence and high sensitivity to prompt selection. We bridge these two worlds with Bilevel-ZOFO, a bilevel optimization method that couples fast, local FO-PEFT adaptation at the inner level with stable, memory-efficient ZO updates of the full backbone at the outer level. The FO-PEFT inner loop performs fast, low-memory local adaptation that reduces the variance of ZO estimates and stabilizes the search, guiding the outer ZO updates of the full backbone and reducing prompt sensitivity. In the mean time, the outer ZO provides better generalization ability for PEFT. We provide theoretical convergence guarantees and empirically demonstrate that Bilevel-ZOFO significantly outperforms existing ZO and FO-PEFT methods, achieving 2-4 times faster training while maintaining similar memory efficiency. Additionally, we show by updating the backbone with ZO and adapting only a tiny FO-PEFT block per task, Bilevel-ZOFO combines full-model capacity with few-shot efficiency, making it a very efficient meta-learning algorithm that quickly adapts to new tasks.

    https://arxiv.org/abs/2502.03604


    A Parameterized Study of Secluded Structures in Directed Graphs

    oai:arXiv.org:2502.06048v2

    arXiv:2502.06048v2 Announce Type: replace Abstract: Given an undirected graph $G$ and an integer $k$, the Secluded $\Pi$-Subgraph problem asks you to find a maximum size induced subgraph that satisfies a property $\Pi$ and has at most $k$ neighbors in the rest of the graph. This problem has been extensively studied; however, there is no prior study of the problem in directed graphs. This question has been mentioned by Jansen et al. [ISAAC'23]. In this paper, we initiate the study of Secluded Subgraph problem in directed graphs by incorporating different notions of neighborhoods: in-neighborhood, out-neighborhood, and their union. Formally, we call these problems $\{\text{In}, \text{Out}, \text{Total}\}$-Secluded $\Pi$-Subgraph, where given a directed graph $G$ and integers $k$, we want to find an induced subgraph satisfying $\Pi$ of maximum size that has at most $k$ in/out/total-neighbors in the rest of the graph, respectively. We investigate the parameterized complexity of these problems for different properties $\Pi$. In particular, we prove the following parameterized results: - We design an FPT algorithm for the Total-Secluded Strongly Connected Subgraph problem when parameterized by $k$. - We show that the In/Out-Secluded $\mathcal{F}$-Free Subgraph problem with parameter $k+w$ is W[1]-hard, where $\mathcal{F}$ is a family of directed graphs except any subgraph of a star graph whose edges are directed towards the center. This result also implies that In/Out-Secluded DAG is W[1]-hard, unlike the undirected variants of the two problems, which are FPT. - We design an FPT-algorithm for In/Out/Total-Secluded $\alpha$-Bounded Subgraph when parameterized by $k$, where $\alpha$-bounded graphs are a superclass of tournaments. - For undirected graphs, we improve the best-known FPT algorithm for Secluded Clique by providing a faster FPT algorithm that runs in time $1.6181^kn^{\mathcal{O}(1)}$.

    https://arxiv.org/abs/2502.06048


    CT-UIO: Continuous-Time UWB-Inertial-Odometer Localization Using Non-Uniform B-spline with Fewer Anchors

    oai:arXiv.org:2502.06287v2

    arXiv:2502.06287v2 Announce Type: replace Abstract: Ultra-wideband (UWB) based positioning with fewer anchors has attracted significant research interest in recent years, especially under energy-constrained conditions. However, most existing methods rely on discrete-time representations and smoothness priors to infer a robot's motion states, which often struggle with ensuring multi-sensor data synchronization. In this article, we present a continuous-time UWB-Inertial-Odometer localization system (CT-UIO), utilizing a non-uniform B-spline framework with fewer anchors. Unlike traditional uniform B-spline-based continuous-time methods, we introduce an adaptive knot-span adjustment strategy for non-uniform continuous-time trajectory representation. This is accomplished by adjusting control points dynamically based on movement speed. To enable efficient fusion of {inertial measurement unit (IMU) and odometer data, we propose an improved extended Kalman filter (EKF) with innovation-based adaptive estimation to provide short-term accurate motion prior. Furthermore, to address the challenge of achieving a fully observable UWB localization system under few-anchor conditions, the virtual anchor (VA) generation method based on multiple hypotheses is proposed. At the backend, we propose an adaptive sliding window strategy for global trajectory estimation. Comprehensive experiments are conducted on three self-collected datasets with different UWB anchor numbers and motion modes. The result shows that the proposed CT-UIO achieves 0.403m, 0.150m, and 0.189m localization accuracy in corridor, exhibition hall, and office environments, yielding 17.2%, 26.1%, and 15.2% improvements compared with competing state-of-the-art UIO systems, respectively. The codebase and datasets of this work will be open-sourced at https://github.com/JasonSun623/CT-UIO.

    https://arxiv.org/abs/2502.06287


    High-order BUG dynamical low-rank integrators based on explicit Runge--Kutta methods

    oai:arXiv.org:2502.07040v4

    arXiv:2502.07040v4 Announce Type: replace Abstract: In this work, we introduce high-order Basis-Update & Galerkin (BUG) integrators based on explicit Runge-Kutta methods for large-scale matrix differential equations. These dynamical low-rank integrators extend the BUG integrator to arbitrary explicit Runge-Kutta schemes by performing a BUG step at each stage of the method. The resulting Runge-Kutta BUG (RK-BUG) integrators are robust with respect to small singular values, fully forward in time, and high-order accurate, while enabling conservation and rank adaptivity. We prove that RK-BUG integrators retain the order of convergence of the underlying Runge-Kutta method until the error reaches a plateau corresponding to the low-rank truncation error, which vanishes as the rank becomes full. This theoretical analysis is supported by several numerical experiments. The results demonstrate the high-order convergence of the RK-BUG integrator and its superior accuracy compared to other existing dynamical low-rank integrators.

    https://arxiv.org/abs/2502.07040


    Enhancing Physics-Informed Neural Networks Through Feature Engineering

    oai:arXiv.org:2502.07209v3

    arXiv:2502.07209v3 Announce Type: replace Abstract: Physics-Informed Neural Networks (PINNs) seek to solve partial differential equations (PDEs) with deep learning. Mainstream approaches that deploy fully-connected multi-layer deep learning architectures require prolonged training to achieve even moderate accuracy, while recent work on feature engineering allows higher accuracy and faster convergence. This paper introduces SAFE-NET, a Single-layered Adaptive Feature Engineering NETwork that achieves orders-of-magnitude lower errors with far fewer parameters than baseline feature engineering methods. SAFE-NET returns to basic ideas in machine learning, using Fourier features, a simplified single hidden layer network architecture, and an effective optimizer that improves the conditioning of the PINN optimization problem. Numerical results show that SAFE-NET converges faster and typically outperforms deeper networks and more complex architectures. It consistently uses fewer parameters -- on average, 65% fewer than the competing feature engineering methods -- while achieving comparable accuracy in less than 30% of the training epochs. Moreover, each SAFE-NET epoch is 95% faster than those of competing feature engineering approaches. These findings challenge the prevailing belief that modern PINNs effectively learn features in these scientific applications and highlight the efficiency gains possible through feature engineering.

    https://arxiv.org/abs/2502.07209


    Channel Dependence, Limited Lookback Windows, and the Simplicity of Datasets: How Biased is Time Series Forecasting?

    oai:arXiv.org:2502.09683v2

    arXiv:2502.09683v2 Announce Type: replace Abstract: In Time Series Forecasting (TSF), the lookback window (the length of historical data used for prediction) is a critical hyperparameter that is often set arbitrarily, undermining the validity of model evaluations. We argue that the lookback window must be tuned on a per-task basis to ensure fair comparisons. Our empirical results show that failing to do so can invert performance rankings, particularly when comparing univariate and multivariate methods. Experiments on standard benchmarks reposition Channel-Independent (CI) models, such as PatchTST, as state-of-the-art methods. However, we reveal this superior performance is largely an artifact of weak inter-channel correlations and simplicity of patterns within these specific datasets. Using Granger causality analysis and ODE datasets (with implicit channel correlations), we demonstrate that the true strength of multivariate Channel-Dependent (CD) models emerges on datasets with strong, inherent cross-channel dependencies, where they significantly outperform CI models. We conclude with three key recommendations for improving TSF research: (i) treat the lookback window as a critical hyperparameter to be tuned, (ii) use statistical analysis of datasets to inform the choice between CI and CD architectures, and (iii) favor CD models in applications with limited data.

    https://arxiv.org/abs/2502.09683


    Human-Like Robot Impedance Regulation Skill Learning from Human-Human Demonstrations

    oai:arXiv.org:2502.13707v2

    arXiv:2502.13707v2 Announce Type: replace Abstract: Humans are experts in physical collaboration by leveraging cognitive abilities such as perception, reasoning, and decision-making to regulate compliance behaviors based on their partners' states and task requirements. Equipping robots with similar cognitive-inspired collaboration skills can significantly enhance the efficiency and adaptability of human-robot collaboration (HRC). This paper introduces an innovative HumanInspired Impedance Regulation Skill Learning framework (HIImpRSL) for robotic systems to achieve leader-follower and mutual adaptation in multiple physical collaborative tasks. The proposed framework enables the robot to adapt its compliance based on human states and reference trajectories derived from human-human demonstrations. By integrating electromyography (EMG) signals and motion data, we extract endpoint impedance profiles and reference trajectories to construct a joint representation via imitation learning. An LSTM-based module then learns task-oriented impedance regulation policies, which are implemented through a whole-body impedance controller for online impedance adaptation. Experimental validation was conducted through collaborative transportation, two interactive Tai Chi pushing hands, and collaborative sawing tasks with multiple human subjects, demonstrating the ability of our framework to achieve human-like collaboration skills and the superior performance from the perspective of interactive forces compared to four other related methods.

    https://arxiv.org/abs/2502.13707


    An Overall Real-Time Mechanism for Classification and Quality Evaluation of Rice

    oai:arXiv.org:2502.13764v4

    arXiv:2502.13764v4 Announce Type: replace Abstract: Rice is one of the most widely cultivated crops globally and has been developed into numerous varieties. The quality of rice during cultivation is primarily determined by its cultivar and characteristics. Traditionally, rice classification and quality assessment rely on manual visual inspection, a process that is both time-consuming and prone to errors. However, with advancements in machine vision technology, automating rice classification and quality evaluation based on its cultivar and characteristics has become increasingly feasible, enhancing both accuracy and efficiency. This study proposes a real-time evaluation mechanism for comprehensive rice grain assessment, integrating a one-stage object detection approach, a deep convolutional neural network, and traditional machine learning techniques. The proposed framework enables rice variety identification, grain completeness grading, and grain chalkiness evaluation. The rice grain dataset used in this study comprises approximately 20,000 images from six widely cultivated rice varieties in China. Experimental results demonstrate that the proposed mechanism achieves a mean average precision (mAP) of 99.14% in the object detection task and an accuracy of 97.89% in the classification task. Furthermore, the framework attains an average accuracy of 97.56% in grain completeness grading within the same rice variety, contributing to an effective quality evaluation system.

    https://arxiv.org/abs/2502.13764


    Measuring Chain of Thought Faithfulness by Unlearning Reasoning Steps

    oai:arXiv.org:2502.14829v4

    arXiv:2502.14829v4 Announce Type: replace Abstract: When prompted to think step-by-step, language models (LMs) produce a chain of thought (CoT), a sequence of reasoning steps that the model supposedly used to produce its prediction. Despite much work on CoT prompting, it is unclear if reasoning verbalized in a CoT is faithful to the models' parametric beliefs. We introduce a framework for measuring parametric faithfulness of generated reasoning, and propose Faithfulness by Unlearning Reasoning steps (FUR), an instance of this framework. FUR erases information contained in reasoning steps from model parameters, and measures faithfulness as the resulting effect on the model's prediction. Our experiments with four LMs and five multi-hop multi-choice question answering (MCQA) datasets show that FUR is frequently able to precisely change the underlying models' prediction for a given instance by unlearning key steps, indicating when a CoT is parametrically faithful. Further analysis shows that CoTs generated by models post-unlearning support different answers, hinting at a deeper effect of unlearning.

    https://arxiv.org/abs/2502.14829


    In-context Learning of Evolving Data Streams with Tabular Foundational Models

    oai:arXiv.org:2502.16840v2

    arXiv:2502.16840v2 Announce Type: replace Abstract: State-of-the-art data stream mining has long drawn from ensembles of the Very Fast Decision Tree, a seminal algorithm honored with the 2015 KDD Test-of-Time Award. However, the emergence of large tabular models, i.e., transformers designed for structured numerical data, marks a significant paradigm shift. These models move beyond traditional weight updates, instead employing in-context learning through prompt tuning. By using on-the-fly sketches to summarize unbounded streaming data, one can feed this information into a pre-trained model for efficient processing. This work bridges advancements from both areas, highlighting how transformers' implicit meta-learning abilities, pre-training on drifting natural data, and reliance on context optimization directly address the core challenges of adaptive learning in dynamic environments. Exploring real-time model adaptation, this research demonstrates that TabPFN, coupled with a simple sliding memory strategy, consistently outperforms ensembles of Hoeffding trees, such as Adaptive Random Forest, and Streaming Random Patches, across all non-stationary benchmarks.

    https://arxiv.org/abs/2502.16840


    An Improved Pure Fully Connected Neural Network for Rice Grain Classification

    oai:arXiv.org:2503.03111v4

    arXiv:2503.03111v4 Announce Type: replace Abstract: Rice is a staple food for a significant portion of the world's population, providing essential nutrients and serving as a versatile in-gredient in a wide range of culinary traditions. Recently, the use of deep learning has enabled automated classification of rice, im-proving accuracy and efficiency. However, classical models based on first-stage training may face difficulties in distinguishing between rice varieties with similar external characteristics, thus leading to misclassifications. Considering the transparency and feasibility of model, we selected and gradually improved pure fully connected neural network to achieve classification of rice grain. The dataset we used contains both global and domestic rice images obtained from websites and laboratories respectively. First, the training mode was changed from one-stage training to two-stage training, which significantly contributes to distinguishing two similar types of rice. Secondly, the preprocessing method was changed from random tilting to horizontal or vertical position cor-rection. After those two enhancements, the accuracy of our model increased notably from 97% to 99%. In summary, two subtle methods proposed in this study can remarkably enhance the classification ability of deep learning models in terms of the classification of rice grain.

    https://arxiv.org/abs/2503.03111


    Active 6D Pose Estimation for Textureless Objects using Multi-View RGB Frames

    oai:arXiv.org:2503.03726v2

    arXiv:2503.03726v2 Announce Type: replace Abstract: Estimating the 6D pose of textureless objects from RGB images is an important problem in robotics. Due to appearance ambiguities, rotational symmetries, and severe occlusions, single-view based 6D pose estimators are still unable to handle a wide range of objects, motivating research towards multi-view pose estimation and next-best-view prediction that addresses these limitations. In this work, we propose a comprehensive active perception framework for estimating the 6D poses of textureless objects using only RGB images. Our approach is built upon a key idea: decoupling the 6D pose estimation into a two-step sequential process can greatly improve both accuracy and efficiency. First, we estimate the 3D translation of each object, resolving scale and depth ambiguities inherent to RGB images. These estimates are then used to simplify the subsequent task of determining the 3D orientation, which we achieve through canonical scale template matching. Building on this formulation, we then introduce an active perception strategy that predicts the next best camera viewpoint to capture an RGB image, effectively reducing object pose uncertainty and enhancing pose accuracy. We evaluate our method on the public ROBI and TOD datasets, as well as on our reconstructed transparent object dataset, T-ROBI. Under the same camera viewpoints, our multi-view pose estimation significantly outperforms state-of-the-art approaches. Furthermore, by leveraging our next-best-view strategy, our approach achieves high pose accuracy with fewer viewpoints than heuristic-based policies across all evaluated datasets. The accompanying video and T-ROBI dataset will be released on our project page: https://trailab.github.io/ActiveODPE.

    https://arxiv.org/abs/2503.03726


    Human Mobility Reimagined: Digital Twin Intelligence for Adaptive Campus Course Timetabling

    oai:arXiv.org:2503.06109v2

    arXiv:2503.06109v2 Announce Type: replace Abstract: Daily operations in large campuses depend on how efficiently people \emph{move} through space and time. In this sense, course timetables are more than administrative schedules: they act as mobility policies that orchestrate thousands of trajectories, shaping travel burden, congestion, accessibility, and the reliability of back-to-back transitions. Designing timetables that are both feasible and mobility-friendly is challenging because hard constraints including capacity, conflicts, feasibility must be satisfied alongside soft constraints including preferences, satisfaction, coordination, all under dynamic conditions such as real-time disruptions and evolving demand. Traditional static optimization methods often struggle to capture these human mobility impacts and to adapt when campus conditions change. This paper reconceptualizes course timetabling as a recommendation-based task and leverages the Texas A\&M Campus Digital Twin as a dynamic data platform to evaluate mobility consequences at scale. We propose an iterative framework that integrates collaborative and content-based filtering with feedback-driven refinement to generate ranked sets of adaptive timetable recommendations. A mobility-aware composite scoring function combining classroom occupancy, travel distance, travel time, and vertical transitions systematically balances resource efficiency with human-centered movement costs. Extensive experiments using real-world data from Texas A\&M University show that the proposed approach reduces mobility friction and travel inefficiencies, improves classroom utilization, and enhances overall user satisfaction. By coupling recommendation-oriented decision-making with digital twin intelligence, this study provides a robust and scalable blueprint for mobility-centered campus planning and resource allocation, with potential extensions to broader urban systems.

    https://arxiv.org/abs/2503.06109


    CineBrain: A Large-Scale Multi-Modal Brain Dataset During Naturalistic Audiovisual Narrative Processing

    oai:arXiv.org:2503.06940v2

    arXiv:2503.06940v2 Announce Type: replace Abstract: Most research decoding brain signals into images, often using them as priors for generative models, has focused only on visual content. This overlooks the brain's natural ability to integrate auditory and visual information, for instance, sound strongly influences how we perceive visual scenes. To investigate this, we propose a new task of reconstructing continuous video stimuli from multimodal brain signals recorded during audiovisual stimulation. To enable this, we introduce CineBrain, the first large-scale dataset that synchronizes fMRI and EEG during audiovisual viewing, featuring six hours of \textit{The Big Bang Theory} episodes for cross-modal alignment. We also conduct the first systematic exploration of combining fMRI and EEG for video reconstruction and present CineSync, a framework for reconstructing dynamic video using a Multi-Modal Fusion Encoder and a Neural Latent Decoder. CineSync achieves state-of-the-art performance in dynamic reconstruction, leveraging the complementary strengths of fMRI and EEG to improve visual fidelity. Our analysis shows that auditory cortical activations enhance decoding accuracy, highlighting the role of auditory input in visual perception. Project Page: https://jianxgao.github.io/CineBrain.

    https://arxiv.org/abs/2503.06940


    Relational Anatomical Supervision for Accurate 3D Multi-Chamber Cardiac Mesh Reconstruction

    oai:arXiv.org:2503.07874v2

    arXiv:2503.07874v2 Announce Type: replace Abstract: Accurate reconstruction of multi-chamber cardiac anatomy from medical images is a cornerstone for patient-specific modeling, physiological simulation, and interventional planning. However, current reconstruction pipelines fundamentally rely on surface-wise geometric supervision and model each chamber in isolation, resulting in anatomically implausible inter-chamber violations despite apparently favorable overlap or distance metrics. In this work, we propose a relational anatomical supervision framework for multi-chamber cardiac mesh reconstruction by introducing a Mesh Interrelation Enhancement (MIE) loss. The proposed formulation explicitly encodes spatial relationships between cardiac structures into a differentiable occupancy-based objective, thereby transforming qualitative anatomical rules into quantitative geometric supervision. We further establish violation-aware evaluation metrics to directly quantify inter-chamber structural correctness, revealing systematic limitations of commonly used geometric measures such as Dice and Chamfer distance. Extensive experiments on multi-center CT data, densely sampled MR data, and two independent external cohorts, including a highly heterogeneous congenital heart disease population, demonstrate that the proposed method consistently suppresses clinically critical boundary violations by up to 83\%, while maintaining competitive volumetric accuracy and achieving superior surface fidelity. Notably, the proposed relational supervision generalizes robustly across imaging modalities, centers, and pathological conditions, even under severe anatomical deformation. These results demonstrate that distance-based supervision alone is insufficient to guarantee anatomically faithful reconstruction, and that explicit enforcement of multi-structure anatomical relations provides a principled and robust pathway toward reliable patient-specific cardiac modeling.

    https://arxiv.org/abs/2503.07874


    A Deep-Learning Iterative Stacked Approach for Prediction of Reactive Dissolution in Porous Media

    oai:arXiv.org:2503.08410v2

    arXiv:2503.08410v2 Announce Type: replace Abstract: Simulating reactive dissolution of solid minerals in porous media has many subsurface applications, including carbon capture and storage (CCS), geothermal systems and oil & gas recovery. As traditional direct numerical simulators are computationally expensive, it is of paramount importance to develop faster and more efficient alternatives. Deep-learning-based solutions, most of them built upon convolutional neural networks (CNNs), have been recently designed to tackle this problem. However, these solutions were limited to approximating one field over the domain (e.g. velocity field). In this manuscript, we present a novel deep learning approach that incorporates both temporal and spatial information to predict the future states of the dissolution process at a fixed time-step horizon, given a sequence of input states. The overall performance, in terms of speed and prediction accuracy, is demonstrated on a numerical simulation dataset, comparing its prediction results against state-of-the-art approaches, also achieving a speedup around $10^4$ over traditional numerical simulators.

    https://arxiv.org/abs/2503.08410


    Can Non-Signaling Assistance Increase the Degrees of Freedom of a Wireless Network?

    oai:arXiv.org:2503.08597v3

    arXiv:2503.08597v3 Announce Type: replace Abstract: An open question posed by Fawzi and Ferme [Transactions on Information Theory 2024], asks whether non-signaling (NS) assistance can increase the capacity of a broadcast channel (BC). We answer this question in the affirmative, by showing that for a certain K-receiver BC setting, called Coordinated Multipoint (CoMP) that arises naturally in wireless networks, NS-assistance provides multiplicative gains in capacity and degrees of freedom (DoF), even achieving K-fold improvements in some cases. Somewhat surprisingly, this is shown to be true even for 2-receiver broadcast channels that are semi-deterministic and/or degraded. In a CoMP BC, B single-antenna transmitters, supported by a backhaul that allows them to share data, act as one B-antenna transmitter, to send independent messages to K receivers, each equipped with a single receive antenna. A fixed and globally known connectivity matrix M, specifies for each transmit antenna, the subset of receivers that are connected to (have a non-zero channel coefficient to) that antenna. Besides the connectivity, there is no channel state information at the transmitter. The DoF region is fully characterized for a class of connectivity patterns associated with tree graphs. Sum-capacity with NS-assistance for arbitrary connectivity patterns is bounded below and above by the triangle number and the min-rank of the connectivity matrix, respectively. While translations to Gaussian settings are demonstrated, most of our results are presented under noise-free, finite-field (Fq) models. Converse proofs for classical DoF adapt the Aligned Images bounds to the finite field model. Converse bounds for NS-assisted capacity extend the same-marginals property to the BC with NS-assistance available to all parties. Even stronger (unbounded) gains are established for certain 'communication with side-information' settings, such as the fading dirty paper channel.

    https://arxiv.org/abs/2503.08597


    SpurLens: Automatic Detection of Spurious Cues in Multimodal LLMs

    oai:arXiv.org:2503.08884v2

    arXiv:2503.08884v2 Announce Type: replace Abstract: Unimodal vision models are known to rely on spurious correlations, but it remains unclear to what extent Multimodal Large Language Models (MLLMs) exhibit similar biases despite language supervision. In this paper, we investigate spurious bias in MLLMs and introduce SpurLens, a pipeline that leverages GPT-4 and open-set object detectors to automatically identify spurious visual cues without human supervision. Our findings reveal that spurious correlations cause two major failure modes in MLLMs: (1) over-reliance on spurious cues for object recognition, where removing these cues reduces accuracy, and (2) object hallucination, where spurious cues amplify the hallucination by over 10x. We validate our findings in various MLLMs and datasets. Beyond diagnosing these failures, we explore potential mitigation strategies, such as prompt ensembling and reasoning-based prompting, and conduct ablation studies to examine the root causes of spurious bias in MLLMs. By exposing the persistence of spurious correlations, our study calls for more rigorous evaluation methods and mitigation strategies to enhance the reliability of MLLMs.

    https://arxiv.org/abs/2503.08884


    Bounding the Optimal Performance of Online Randomized Primal-Dual Methods

    oai:arXiv.org:2503.09508v3

    arXiv:2503.09508v3 Announce Type: replace Abstract: The online randomized primal-dual method has widespread applications in online algorithm design and analysis. A key challenge is identifying an appropriate function space, $F$, in which we search for an optimal updating function $f \in F$ that yields the best possible lower bound on the competitiveness of a given algorithm. The choice of $F$ must balance two competing objectives: on one hand, it should impose sufficient simplifying conditions on $f$ to facilitate worst-case analysis and establish a valid lower bound; on the other hand, it should remain general enough to offer a broad selection of candidate functions. The tradeoff is that any additional constraints on $f$ that can facilitate competitive analysis may also lead to a suboptimal choice, weakening the resulting lower bound. To address this challenge, we propose an auxiliary-LP-based framework capable of effectively approximating the best possible competitiveness achievable when applying the randomized primal-dual method to different function spaces. Specifically, we examine the framework introduced by Huang and Zhang (SICOMP 2024), which analyzes Stochastic Balance for vertex-weighted online matching with stochastic rewards. Our approach yields both lower and upper bounds on the best possible competitiveness attainable using the randomized primal-dual method for different choices of $F$. Notably, we establish that Stochastic Balance achieves a competitiveness of at least $0.5796$ for the problem (under equal vanishing probabilities), improving upon the previous bound of $0.576$ by Huang and Zhang (SICOMP 2024). Meanwhile, our analysis yields an upper bound of $0.5810$ for a function space strictly larger than that considered in Huang and Zhang (SICOMP 2024).

    https://arxiv.org/abs/2503.09508


    FourierSR: A Fourier Token-based Plugin for Efficient Image Super-Resolution

    oai:arXiv.org:2503.10043v2

    arXiv:2503.10043v2 Announce Type: replace Abstract: Image super-resolution (SR) aims to recover low-resolution images to high-resolution images, where improving SR efficiency is a high-profile challenge. However, commonly used units in SR, like convolutions and window-based Transformers, have limited receptive fields, making it challenging to apply them to improve SR under extremely limited computational cost. To address this issue, inspired by modeling convolution theorem through token mix, we propose a Fourier token-based plugin called FourierSR to improve SR uniformly, which avoids the instability or inefficiency of existing token mix technologies when applied as plug-ins. Furthermore, compared to convolutions and windows-based Transformers, our FourierSR only utilizes Fourier transform and multiplication operations, greatly reducing complexity while having global receptive fields. Experimental results show that our FourierSR as a plug-and-play unit brings an average PSNR gain of 0.34dB for existing efficient SR methods on Manga109 test set at the scale of x4, while the average increase in the number of Params and FLOPs is only 0.6% and 1.5% of original sizes. We will release our codes upon acceptance.

    https://arxiv.org/abs/2503.10043


    PanDx: AI-assisted Early Detection of Pancreatic Ductal Adenocarcinoma on Contrast-enhanced CT

    oai:arXiv.org:2503.10068v3

    arXiv:2503.10068v3 Announce Type: replace Abstract: Pancreatic ductal adenocarcinoma (PDAC) is one of the most aggressive forms of pancreatic cancer and is often diagnosed at an advanced stage due to subtle early imaging signs. To enable earlier detection and improve clinical decision-making, we propose a coarse-to-fine AI-assisted framework named PanDx for identifying PDAC on contrast-enhanced CT scans. Our approach integrates two novel techniques: (1) distribution-aware stratified ensembling to improve generalization across lesion variations, and (2) peak-scaled lesion candidate extraction to enhance lesion localization precision. PanDx is developed and evaluated as part of the PANORAMA challenge, where it ranked 1st place on the official test set with an AUROC of 0.9263 and an AP of 0.7243. Furthermore, we analyzed failure cases with a radiologist to identify the limitations of AI models on this task and discussed potential future directions for model improvement. Our code and models are publicly available at https://github.com/han-liu/PanDx.

    https://arxiv.org/abs/2503.10068


    PIMRL: Physics-Informed Multi-Scale Recurrent Learning for Burst-Sampled Spatiotemporal Dynamics

    oai:arXiv.org:2503.10253v4

    arXiv:2503.10253v4 Announce Type: replace Abstract: Deep learning has shown strong potential in modeling complex spatiotemporal dynamics. However, most existing methods depend on densely and uniformly sampled data, which is often unavailable in practice due to sensor and cost limitations. In many real-world settings, such as mobile sensing and physical experiments, data are burst-sampled with short high-frequency segments followed by long gaps, making it difficult to learn accurate dynamics from sparse observations. To address this issue, we propose Physics-Informed Multi-Scale Recurrent Learning (PIMRL), a novel framework specifically designed for burst-sampled spatiotemporal data. PIMRL combines macro-scale latent dynamics inference with micro-scale adaptive refinement guided by incomplete prior information from partial differential equations (PDEs). It further introduces a temporal message-passing mechanism to effectively propagate information across burst intervals. This multi-scale architecture enables PIMRL to model complex systems accurately even under severe data scarcity. We evaluate our approach on five benchmark datasets involving 1D to 3D multi-scale PDEs. The results show that PIMRL consistently outperforms state-of-the-art baselines, achieving substantial improvements and reducing errors by up to 80% in the most challenging settings, which demonstrates the clear advantage of our model. Our work demonstrates the effectiveness of physics-informed recurrent learning for accurate and efficient modeling of sparse spatiotemporal systems.

    https://arxiv.org/abs/2503.10253


    FastVID: Dynamic Density Pruning for Fast Video Large Language Models

    oai:arXiv.org:2503.11187v3

    arXiv:2503.11187v3 Announce Type: replace Abstract: Video Large Language Models have demonstrated strong video understanding capabilities, yet their practical deployment is hindered by substantial inference costs caused by redundant video tokens. Existing pruning techniques fail to effectively exploit the spatiotemporal redundancy present in video data. To bridge this gap, we perform a systematic analysis of video redundancy from two perspectives: temporal context and visual context. Leveraging these insights, we propose Dynamic Density Pruning for Fast Video LLMs termed FastVID. Specifically, FastVID dynamically partitions videos into temporally ordered segments to preserve temporal structure and applies a density-based token pruning strategy to maintain essential spatial and temporal information. Our method significantly reduces computational overhead while maintaining temporal and visual integrity. Extensive evaluations show that FastVID achieves state-of-the-art performance across various short- and long-video benchmarks on leading Video LLMs, including LLaVA-OneVision, LLaVA-Video, Qwen2-VL, and Qwen2.5-VL. Notably, on LLaVA-OneVision-7B, FastVID effectively prunes $\textbf{90.3%}$ of video tokens, reduces FLOPs to $\textbf{8.3%}$, and accelerates the LLM prefill stage by $\textbf{7.1}\times$, while maintaining $\textbf{98.0%}$ of the original accuracy. The code is available at https://github.com/LunarShen/FastVID.

    https://arxiv.org/abs/2503.11187


    Privacy Ethics Alignment in AI: A Stakeholder-Centric Framework for Ethical AI

    oai:arXiv.org:2503.11950v4

    arXiv:2503.11950v4 Announce Type: replace Abstract: The increasing integration of artificial intelligence (AI) in digital ecosystems has reshaped privacy dynamics, particularly for young digital citizens navigating data-driven environments. This study explores evolving privacy concerns across three key stakeholder groups-young digital citizens, parents/educators, and AI professionals-and assesses differences in data ownership, trust, transparency, parental mediation, education, and risk-benefit perceptions. Employing a grounded theory methodology, this research synthesizes insights from key participants through structured surveys, qualitative interviews, and focus groups to identify distinct privacy expectations. Young digital citizens emphasized autonomy and digital agency, while parents and educators prioritized oversight and AI literacy. AI professionals focused on balancing ethical design with system performance. The analysis revealed significant gaps in transparency and digital literacy, underscoring the need for inclusive, stakeholder-driven privacy frameworks. Drawing on comparative thematic analysis, this study introduces the Privacy-Ethics Alignment in AI (PEA-AI) model, which conceptualizes privacy decision-making as a dynamic negotiation among stakeholders. By aligning empirical findings with governance implications, this research provides a scalable foundation for adaptive, youth-centered AI privacy governance.

    https://arxiv.org/abs/2503.11950


    MusicInfuser: Making Video Diffusion Listen and Dance

    oai:arXiv.org:2503.14505v2

    arXiv:2503.14505v2 Announce Type: replace Abstract: We introduce MusicInfuser, an approach that aligns pre-trained text-to-video diffusion models to generate high-quality dance videos synchronized with specified music tracks. Rather than training a multimodal audio-video or audio-motion model from scratch, our method demonstrates how existing video diffusion models can be efficiently adapted to align with musical inputs. We propose a novel layer-wise adaptability criterion based on a guidance-inspired constructive influence function to select adaptable layers, significantly reducing training costs while preserving rich prior knowledge, even with limited, specialized datasets. Experiments show that MusicInfuser effectively bridges the gap between music and video, generating novel and diverse dance movements that respond dynamically to music. Furthermore, our framework generalizes well to unseen music tracks, longer video sequences, and unconventional subjects, outperforming baseline models in consistency and synchronization. All of this is achieved without requiring motion data, with training completed on a single GPU within a day.

    https://arxiv.org/abs/2503.14505


    Unsupervised Acquisition of Discrete Grammatical Categories

    oai:arXiv.org:2503.18702v2

    arXiv:2503.18702v2 Announce Type: replace Abstract: This article presents experiments performed using a computational laboratory environment for language acquisition experiments. It implements a multi-agent system consisting of two agents: an adult language model and a daughter language model that aims to learn the mother language. Crucially, the daughter agent does not have access to the internal knowledge of the mother language model but only to the language exemplars the mother agent generates. These experiments illustrate how this system can be used to acquire abstract grammatical knowledge. We demonstrate how statistical analyses of patterns in the input data corresponding to grammatical categories yield discrete grammatical rules. These rules are subsequently added to the grammatical knowledge of the daughter language model. To this end, hierarchical agglomerative cluster analysis was applied to the utterances consecutively generated by the mother language model. It is argued that this procedure can be used to acquire structures resembling grammatical categories proposed by linguists for natural languages. Thus, it is established that non-trivial grammatical knowledge has been acquired. Moreover, the parameter configuration of this computational laboratory environment determined using training data generated by the mother language model is validated in a second experiment with a test set similarly resulting in the acquisition of non-trivial categories.

    https://arxiv.org/abs/2503.18702


    LookAhead Tuning: Safer Language Models via Partial Answer Previews

    oai:arXiv.org:2503.19041v3

    arXiv:2503.19041v3 Announce Type: replace Abstract: Fine-tuning enables large language models (LLMs) to adapt to specific domains, but often compromises their previously established safety alignment. To mitigate the degradation of model safety during fine-tuning, we introduce LookAhead Tuning, a lightweight and effective data-driven approach that preserves safety during fine-tuning. The method introduces two simple strategies that modify training data by previewing partial answer prefixes, thereby minimizing perturbations to the model's initial token distributions and maintaining its built-in safety mechanisms. Comprehensive experiments demonstrate that LookAhead Tuning effectively maintains model safety without sacrificing robust performance on downstream tasks. Our findings position LookAhead Tuning as a reliable and efficient solution for the safe and effective adaptation of LLMs.

    https://arxiv.org/abs/2503.19041


    LLPut: Investigating Large Language Models for Bug Report-Based Input Generation

    oai:arXiv.org:2503.20578v5

    arXiv:2503.20578v5 Announce Type: replace Abstract: Failure-inducing inputs play a crucial role in diagnosing and analyzing software bugs. Bug reports typically contain these inputs, which developers extract to facilitate debugging. Since bug reports are written in natural language, prior research has leveraged various Natural Language Processing (NLP) techniques for automated input extraction. With the advent of Large Language Models (LLMs), an important research question arises: how effectively can generative LLMs extract failure-inducing inputs from bug reports? In this paper, we propose LLPut, a technique to empirically evaluate the performance of three open-source generative LLMs -- LLaMA, Qwen, and Qwen-Coder -- in extracting relevant inputs from bug reports. We conduct an experimental evaluation on a dataset of 206 bug reports to assess the accuracy and effectiveness of these models. Our findings provide insights into the capabilities and limitations of generative LLMs in automated bug diagnosis.

    https://arxiv.org/abs/2503.20578


    Dataset Distillation of 3D Point Clouds via Distribution Matching

    oai:arXiv.org:2503.22154v3

    arXiv:2503.22154v3 Announce Type: replace Abstract: Large-scale datasets are usually required to train deep neural networks, but it increases the computational complexity hindering the practical applications. Recently, dataset distillation for images and texts has been attracting a lot of attention, that reduces the original dataset to a synthetic dataset to alleviate the computational burden of training while preserving essential task-relevant information. However, the dataset distillation for 3D point clouds remains largely unexplored, as the point clouds exhibit fundamentally different characteristics from that of images, making the dataset distillation more challenging. In this paper, we propose a distribution matching-based distillation framework for 3D point clouds that jointly optimizes the geometric structures as well as the orientations of the synthetic 3D objects. To address the semantic misalignment caused by unordered indexing of points, we introduce a Semantically Aligned Distribution Matching loss computed on the sorted features in each channel. Moreover, to address the rotation variation, we jointly learn the optimal rotation angles while updating the synthetic dataset to better align with the original feature distribution. Extensive experiments on widely used benchmark datasets demonstrate that the proposed method consistently outperforms existing dataset distillation methods, achieving superior accuracy and strong cross-architecture generalization.

    https://arxiv.org/abs/2503.22154


    Citegeist: Automated Generation of Related Work Analysis on the arXiv Corpus

    oai:arXiv.org:2503.23229v2

    arXiv:2503.23229v2 Announce Type: replace Abstract: Large Language Models provide significant new opportunities for the generation of high-quality written works. However, their employment in the research community is inhibited by their tendency to hallucinate invalid sources and lack of direct access to a knowledge base of relevant scientific articles. In this work, we present Citegeist: An application pipeline using dynamic Retrieval Augmented Generation (RAG) on the arXiv Corpus to generate a related work section and other citation-backed outputs. For this purpose, we employ a mixture of embedding-based similarity matching, summarization, and multi-stage filtering. To adapt to the continuous growth of the document base, we also present an optimized way of incorporating new and modified papers. To enable easy utilization in the scientific community, we release both, a website (https://citegeist.org), as well as an implementation harness that works with several different LLM implementations.

    https://arxiv.org/abs/2503.23229


    HACTS: a Human-As-Copilot Teleoperation System for Robot Learning

    oai:arXiv.org:2503.24070v2

    arXiv:2503.24070v2 Announce Type: replace Abstract: Teleoperation is essential for autonomous robot learning, especially in manipulation tasks that require human demonstrations or corrections. However, most existing systems only offer unilateral robot control and lack the ability to synchronize the robot's status with the teleoperation hardware, preventing real-time, flexible intervention. In this work, we introduce HACTS (Human-As-Copilot Teleoperation System), a novel system that establishes bilateral, real-time joint synchronization between a robot arm and teleoperation hardware. This simple yet effective feedback mechanism, akin to a steering wheel in autonomous vehicles, enables the human copilot to intervene seamlessly while collecting action-correction data for future learning. Implemented using 3D-printed components and low-cost, off-the-shelf motors, HACTS is both accessible and scalable. Our experiments show that HACTS significantly enhances performance in imitation learning (IL) and reinforcement learning (RL) tasks, boosting IL recovery capabilities and data efficiency, and facilitating human-in-the-loop RL. HACTS paves the way for more effective and interactive human-robot collaboration and data-collection, advancing the capabilities of robot manipulation.

    https://arxiv.org/abs/2503.24070


    Quantifying Robustness: A Benchmarking Framework for Deep Learning Forecasting in Cyber-Physical Systems

    oai:arXiv.org:2504.03494v3

    arXiv:2504.03494v3 Announce Type: replace Abstract: Cyber-Physical Systems (CPS) in domains such as manufacturing and energy distribution generate complex time series data crucial for Prognostics and Health Management (PHM). While Deep Learning (DL) methods have demonstrated strong forecasting capabilities, their adoption in industrial CPS remains limited due insufficient robustness. Existing robustness evaluations primarily focus on formal verification or adversarial perturbations, inadequately representing the complexities encountered in real-world CPS scenarios. To address this, we introduce a practical robustness definition grounded in distributional robustness, explicitly tailored to industrial CPS, and propose a systematic framework for robustness evaluation. Our framework simulates realistic disturbances, such as sensor drift, noise and irregular sampling, enabling thorough robustness analyses of forecasting models on real-world CPS datasets. The robustness definition provides a standardized score to quantify and compare model performance across diverse datasets, assisting in informed model selection and architecture design. Through extensive empirical studies evaluating prominent DL architectures (including recurrent, convolutional, attention-based, modular, and structured state-space models) we demonstrate the applicability and effectiveness of our approach. We publicly release our robustness benchmark to encourage further research and reproducibility.

    https://arxiv.org/abs/2504.03494


    Do LLM Evaluators Prefer Themselves for a Reason?

    oai:arXiv.org:2504.03846v3

    arXiv:2504.03846v3 Announce Type: replace Abstract: Large language models (LLMs) are increasingly used as automatic evaluators in applications such as benchmarking, reward modeling, and self-refinement. Prior work highlights a potential self-preference bias where LLMs favor their own generated responses, a tendency often intensifying with model size and capability. This raises a critical question: Is self-preference harmful, or does it simply reflect the genuinely higher-quality outputs of stronger models? Answering this has been difficult as prior works mostly relied on subjective tasks that lack an objective ground truth, meaning that either preference can be reasonably justified. To address this ambiguity, we investigate self-preference using verifiable benchmarks (mathematical reasoning, factual knowledge, code generation) that allow objective ground-truth assessment. This enables us to distinguish harmful (favoring objectively worse responses) from legitimate (favoring genuinely superior ones) self-preference. Our large-scale experiments across 7 model families reveal three key findings: (1) While stronger models exhibit greater self-preference, much of this preference aligns with objectively superior performance, indicating stronger models prefer themselves mostly legitimately. (2) Harmful self-preference persists when evaluator models err as generators, and stronger models display more pronounced harmful self-preference bias when they do err. This suggests stronger models struggle more to recognize when they are wrong. (3) Inference-time scaling strategies, such as generating a long Chain-of-Thought before evaluation, effectively reduce the harmful self-preference. Additionally, we experiment with LMArena and show that our findings extend beyond verifiable benchmarks to real-world, subjective domains. These results provide a more nuanced understanding of LLM-based evaluation and practical insights for improving its reliability.

    https://arxiv.org/abs/2504.03846


    An Algorithm to perform Covariance-Adjusted Support Vector Classification in Non-Euclidean Spaces

    oai:arXiv.org:2504.04371v2

    arXiv:2504.04371v2 Announce Type: replace Abstract: Traditional Support Vector Machine (SVM) classification is carried out by finding the max-margin classifier for the training data that divides the mar-gin space into two equal sub-spaces. This study demonstrates limitations of performing Support Vector Classification in non-Euclidean spaces by estab-lishing that the underlying principle of max-margin classification and Karush Kuhn Tucker (KKT) boundary conditions are valid only in the Eu-clidean vector spaces, while in non-Euclidean spaces the principle of maxi-mum margin is a function of intra-class data covariance. The study estab-lishes a methodology to perform Support Vector Classification in Non-Euclidean Spaces by incorporating data covariance into the optimization problem using the transformation matrix obtained from Cholesky Decompo-sition of respective class covariance matrices, and shows that the resulting classifier obtained separates the margin space in ratio of respective class pop-ulation covariance. The study proposes an algorithm to iteratively estimate the population covariance-adjusted SVM classifier in non-Euclidean space from sample covariance matrices of the training data. The effectiveness of this SVM classification approach is demonstrated by applying the classifier on multiple datasets and comparing the performance with traditional SVM kernels and whitening algorithms. The Cholesky-SVM model shows marked improvement in the accuracy, precision, F1 scores and ROC performance compared to linear and other kernel SVMs.

    https://arxiv.org/abs/2504.04371


    Physics-informed Modularized Neural Network for Advanced Building Control by Deep Reinforcement Learning

    oai:arXiv.org:2504.05397v2

    arXiv:2504.05397v2 Announce Type: replace Abstract: Physics-informed machine learning (PIML) provides a promising solution for building energy modeling and can serve as a virtual environment to enable reinforcement learning (RL) agents to interact and learn. However, challenges remain in efficiently integrating physics priors, evaluating the effectiveness of physics constraints, balancing model accuracy and physics consistency, and enabling real-world implementation. To address these gaps, this study introduces a Physics-Informed Modularized Neural Network (PI-ModNN), which incorporates physics priors through a physics-informed model structure, loss functions, and hard constraints. A new evaluation metric called "temperature response violation" is developed to quantify the physical consistency of data-driven building dynamic models under varying control inputs and training data sizes. Additionally, a physics prior evaluation framework based on rule importance is proposed to assess the contribution of each individual physics prior, offering guidance on selecting appropriate PIML techniques. Results indicate that incorporating physical priors does not always improve model performance; inappropriate priors may decrease model accuracy and consistency. However, hard constraints are effective in enforcing model consistency. Furthermore, we present a general workflow for developing control-oriented PIML models and integrating them with deep reinforcement learning (DRL). Following this framework, a case study implementing DRL in an office space over three months demonstrates potential energy savings of 31.4%. Finally, we provide a general guideline for integrating data-driven models with advanced building control through a four-step evaluation framework, paving the way for reliable and scalable deployment of advanced building controls.

    https://arxiv.org/abs/2504.05397


    Hybrid Control as a Proxy for Detection and Mitigation of Sensor Attacks in Cooperative Driving

    oai:arXiv.org:2504.05958v2

    arXiv:2504.05958v2 Announce Type: replace Abstract: We propose a real-time hybrid controller scheme to detect and mitigate False-Data Injection (FDI) attacks on Cooperative Adaptive Cruise Control (CACC). Our method uses sensor redundancy to create equivalent controller realizations, each driven by distinct sensor subsets but producing identical control inputs when no attack occurs. By comparing control signals and measurements via majority voting, the scheme identifies compromised sensors in real-time and switches to a healthy controller. The hybrid controller uses attack-dependent flow and jump sets, and resets compromised controllers' states. Simulation results demonstrate the effectiveness of this approach.

    https://arxiv.org/abs/2504.05958


    Can Carbon-Aware Electric Load Shifting Reduce Emissions? An Equilibrium-Based Analysis

    oai:arXiv.org:2504.07248v2

    arXiv:2504.07248v2 Announce Type: replace Abstract: An increasing number of electric loads, such as hydrogen producers or data centers, can be characterized as carbon-sensitive, meaning that they are willing to adapt the timing and/or location of their electricity usage in order to minimize carbon footprints. However, the emission reduction efforts of these carbon-sensitive loads rely on carbon intensity information such as average carbon emissions, and it is unclear whether load shifting based on these signals effectively reduces carbon emissions. To address this open question, we design a carbon-aware equilibrium model, which expands the commonly used equilibrium model for standard (carbon-agnostic) electricity market clearing to include carbon-sensitive consumers that adapt their consumption based on average carbon emission signals and carbon costs. This analysis represents an idealized situation for carbon-sensitive consumers, where their carbon preferences are reflected directly in the market clearing, and contrasts with current practice, where carbon emission signals only become known to consumers a posteriori (i.e., after the market has already been cleared). Furthermore, we extend our model to consider temporal load shifting and time-varying maximum renewable generations. We employ illustrative three-bus examples and numerical simulations on the IEEE RTS-GMLC system to reveal the limitations of the widely adopted average carbon emission signal for guiding carbon emission reduction. Our model offers a novel perspective for evaluating the effectiveness of different carbon signals and contributes to new carbon signal design.

    https://arxiv.org/abs/2504.07248


    VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning

    oai:arXiv.org:2504.07960v3

    arXiv:2504.07960v3 Announce Type: replace Abstract: Recent progress in diffusion models significantly advances various image generation tasks. However, the current mainstream approach remains focused on building task-specific models, which have limited efficiency when supporting a wide range of different needs. While universal models attempt to address this limitation, they face critical challenges, including generalizable task instruction, appropriate task distributions, and unified architectural design. To tackle these challenges, we propose VisualCloze, a universal image generation framework, which supports a wide range of in-domain tasks, generalization to unseen ones, unseen unification of multiple tasks, and reverse generation. Unlike existing methods that rely on language-based task instruction, leading to task ambiguity and weak generalization, we integrate visual in-context learning, allowing models to identify tasks from visual demonstrations. Meanwhile, the inherent sparsity of visual task distributions hampers the learning of transferable knowledge across tasks. To this end, we introduce Graph200K, a graph-structured dataset that establishes various interrelated tasks, enhancing task density and transferable knowledge. Furthermore, we uncover that our unified image generation formulation shared a consistent objective with image infilling, enabling us to leverage the strong generative priors of pre-trained infilling models without modifying the architectures.

    https://arxiv.org/abs/2504.07960


    LLMs as Span Annotators: A Comparative Study of LLMs and Humans

    oai:arXiv.org:2504.08697v3

    arXiv:2504.08697v3 Announce Type: replace Abstract: Span annotation - annotating specific text features at the span level - can be used to evaluate texts where single-score metrics fail to provide actionable feedback. Until recently, span annotation was done by human annotators or fine-tuned models. In this paper, we study whether large language models (LLMs) can serve as an alternative to human annotators. We compare the abilities of LLMs to skilled human annotators on three span annotation tasks: evaluating data-to-text generation, identifying translation errors, and detecting propaganda techniques. We show that overall, LLMs have only moderate inter-annotator agreement (IAA) with human annotators. However, we demonstrate that LLMs make errors at a similar rate as skilled crowdworkers. LLMs also produce annotations at a fraction of the cost per output annotation. We release the dataset of over 40k model and human span annotations for further research.

    https://arxiv.org/abs/2504.08697


    SoK: Can Fully Homomorphic Encryption Support General AI Computation? A Functional and Cost Analysis

    oai:arXiv.org:2504.11604v2

    arXiv:2504.11604v2 Announce Type: replace Abstract: Artificial intelligence (AI) increasingly powers sensitive applications in domains such as healthcare and finance, relying on both linear operations (e.g., matrix multiplications in large language models) and non-linear operations (e.g., sorting in retrieval-augmented generation). Fully homomorphic encryption (FHE) has emerged as a promising tool for privacy-preserving computation, but it remains unclear whether existing methods can support the full spectrum of AI workloads that combine these operations. In this SoK, we ask: Can FHE support general AI computation? We provide both a functional analysis and a cost analysis. First, we categorize ten distinct FHE approaches and evaluate their ability to support general computation. We then identify three promising candidates and benchmark workloads that mix linear and non-linear operations across different bit lengths and SIMD parallelization settings. Finally, we evaluate five real-world, privacy-sensitive AI applications that instantiate these workloads. Our results quantify the costs of achieving general computation in FHE and offer practical guidance on selecting FHE methods that best fit specific AI application requirements. Our codes are available at https://github.com/UCF-ML-Research/FHE-AI-Generality.

    https://arxiv.org/abs/2504.11604


    Relation-R1: Progressively Cognitive Chain-of-Thought Guided Reinforcement Learning for Unified Relation Comprehension

    oai:arXiv.org:2504.14642v3

    arXiv:2504.14642v3 Announce Type: replace Abstract: Recent advances in multi-modal large language models (MLLMs) have significantly improved object-level grounding and region captioning. However, they remain limited in visual relation understanding, struggling even with binary relation detection, let alone \textit{N}-ary relations involving multiple semantic roles. The core reason is the lack of modeling for \textit{structural semantic dependencies} among multi-entities, leading to unreliable outputs, hallucinations, and over-reliance on language priors (\eg, defaulting to ``person drinks a milk'' if a person is merely holding it). To this end, we propose Relation-R1, the \textit{first unified} relation comprehension framework that explicitly integrates cognitive chain-of-thought (CoT)-guided supervised fine-tuning (SFT) and group relative policy optimization (GRPO) within a reinforcement learning (RL) paradigm. Specifically, we first establish foundational reasoning capabilities via SFT, enforcing structured outputs with thinking processes. Then, GRPO is utilized to refine these outputs via multi-rewards optimization, prioritizing visual-semantic grounding over language-induced biases, thereby improving generalization capability. Furthermore, we investigate the impact of various CoT strategies within this framework, demonstrating that a specific-to-general progressive approach in CoT guidance further improves generalization, especially in capturing synonymous \textit{N}-ary relations. Extensive experiments on widely-used PSG and SWiG datasets demonstrate that Relation-R1 achieves state-of-the-art performance in both binary and \textit{N}-ary relation understanding.

    https://arxiv.org/abs/2504.14642


    Instance-Adaptive Keypoint Learning with Local-to-Global Geometric Aggregation for Category-Level Object Pose Estimation

    oai:arXiv.org:2504.15134v4

    arXiv:2504.15134v4 Announce Type: replace Abstract: Category-level object pose estimation aims to predict the 6D pose and size of previously unseen instances from predefined categories, requiring strong generalization across diverse object instances. Although many previous methods attempt to mitigate intra-class variations, they often struggle with instances exhibiting complex geometries or significant deviations from canonical shapes. To address this issue, we propose INKL-Pose, a novel category-level object pose estimation framework that enables INstance-adaptive Keypoint Learning with local-to-global geometric aggregation. Specifically, our method first predicts semantically consistent and geometrically informative keypoints using an Instance-Adaptive Keypoint Detector, then refines them: (1) a Local Keypoint Feature Aggregator capturing fine-grained geometries, and (2) a Global Keypoint Feature Aggregator using bidirectional Mamba for structural consistency. To enable bidirectional modeling in Mamba, we introduce a simple yet effective Feature Sequence Flipping strategy that preserves spatial coherence while constructing backward feature sequence. Additionally, we design a surface loss and a separation loss to encourage uniform coverage and spatial diversity in keypoint distribution. The resulting keypoints are mapped to a canonical space for 6D pose and size regression. Extensive experiments on CAMERA25, REAL275, and HouseCat6D show that INKL-Pose achieves state-of-the-art performance with 16.7M parameters and runs at 36 FPS on an NVIDIA RTX 4090D GPU.

    https://arxiv.org/abs/2504.15134


    Nash Equilibrium Learning In Large Populations With First-Order Payoff Modifications

    oai:arXiv.org:2504.16222v2

    arXiv:2504.16222v2 Announce Type: replace Abstract: We establish Nash equilibrium learning in large populations of noncooperative, strategic agents. Our analysis considers the broadest class to date of payoff mechanisms with first-order modifications, capable of modeling bounded rationality and anticipatory effects, averaging, or Pad\'{e} delay approximations. We propose a framework that, for the first time, combines two nonstandard system-theoretic passivity notions. Our results hold for discontinuous best response dynamics alongside continuous learning rules, significantly extending prior work.

    https://arxiv.org/abs/2504.16222


    When Gaussian Meets Surfel: Ultra-fast High-fidelity Radiance Field Rendering

    oai:arXiv.org:2504.17545v2

    arXiv:2504.17545v2 Announce Type: replace Abstract: We introduce Gaussian-enhanced Surfels (GESs), a bi-scale representation for radiance field rendering, wherein a set of 2D opaque surfels with view-dependent colors represent the coarse-scale geometry and appearance of scenes, and a few 3D Gaussians surrounding the surfels supplement fine-scale appearance details. The rendering with GESs consists of two passes -- surfels are first rasterized through a standard graphics pipeline to produce depth and color maps, and then Gaussians are splatted with depth testing and color accumulation on each pixel order independently. The optimization of GESs from multi-view images is performed through an elaborate coarse-to-fine procedure, faithfully capturing rich scene appearance. The entirely sorting-free rendering of GESs not only achieves very fast rates, but also produces view-consistent images, successfully avoiding popping artifacts under view changes. The basic GES representation can be easily extended to achieve anti-aliasing in rendering (Mip-GES), boosted rendering speeds (Speedy-GES) and compact storage (Compact-GES), and reconstruct better scene geometries by replacing 3D Gaussians with 2D Gaussians (2D-GES). Experimental results show that GESs advance the state-of-the-arts as a compelling representation for ultra-fast high-fidelity radiance field rendering.

    https://arxiv.org/abs/2504.17545


    Kimina Lean Server: Technical Report

    oai:arXiv.org:2504.21230v2

    arXiv:2504.21230v2 Announce Type: replace Abstract: We introduce the Kimina Lean Server, an open-source project designed as a high-performance verifier for reinforcement learning pipelines. Built on top of the Lean REPL (Read-Eval-Print Loop) maintained by the Lean FRO, our server combines server-side parallelism by managing multiple Lean processes in parallel with a Least Recently Used (LRU) caching mechanism that reuses Lean imports across requests. On the client side, a lightweight Python package enables submitting proof batches and receiving Lean feedback, including extracted tactics and tactic states. Together, these features enable a scalable workflow for large-scale verification and data extraction. In our experiments, the Kimina Lean Server outperforms previous Lean interaction tools, achieving a 1.5 to 2 times speedup in verification time. Moreover, its improved efficiency has enabled its use in the large-scale training of state-of-the-art models such as Kimina-Prover. We hope that our open-source project will support the neural theorem proving community and accelerate future progress by enabling efficient large-scale verification and proof data extraction.

    https://arxiv.org/abs/2504.21230


    Algorithmic Addiction by Design: Big Tech's Leverage of Dark Patterns to Maintain Market Dominance and its Challenge for Content Moderation

    oai:arXiv.org:2505.00054v2

    arXiv:2505.00054v2 Announce Type: replace Abstract: Today's largest technology corporations, especially ones with consumer-facing products such as social media platforms, use a variety of unethical and often outright illegal tactics to maintain their dominance. One tactic that has risen to the level of the public consciousness is the concept of addictive design, evidenced by the fact that excessive social media use has become a salient problem, particularly in the mental and social development of adolescents and young adults. As tech companies have developed more and more sophisticated artificial intelligence (AI) models to power their algorithmic recommender systems, they will become more successful at their goal of ensuring addiction to their platforms. This paper explores how online platforms intentionally cultivate addictive user behaviors and the broad societal implications, including on the health and well-being of children and adolescents. It presents the usage of addictive design - including the usage of dark patterns, persuasive design elements, and recommender algorithms - as a tool leveraged by technology corporations to maintain their dominance. Lastly, it describes the challenge of content moderation to address the problem and gives an overview of solutions at the policy level to counteract addictive design.

    https://arxiv.org/abs/2505.00054


    Error Exponents for Oblivious Relaying and Connections to Source Coding with a Helper

    oai:arXiv.org:2505.00567v2

    arXiv:2505.00567v2 Announce Type: replace Abstract: The information bottleneck channel, also known as oblivious relaying, is a two-hop channel where a transmitter sends messages to a remote receiver via an intermediate relay node. A codeword sent by the transmitter passes through a discrete memoryless channel to reach the relay, which then processes the noisy channel output and forwards it to the receiver through a noiseless rate-limited link. The relay is oblivious, in the sense that it has no knowledge of the channel codebook used in transmission. Previous works on oblivious relaying focus on characterizing achievable rates. In this work, we study error exponents and explore connections to lossless source coding with a helper, also known as the Wyner-Ahlswede-K\"orner (WAK) problem. We first establish an achievable error exponent for oblivious relaying under constant compositions codes. A key feature of our analysis is the use of the type covering lemma to design the relay's compress-forward scheme. We then show that employing constant composition code ensembles does not improve the rates achieved with their IID counterparts. We also derive a sphere packing upper bound for the error exponent. In the second part of this paper, we establish a connection between the information bottleneck channel and the WAK problem. We show that good codes for the latter can be produced through permuting codes designed for the former. This is accomplished by revisiting Ahlswede's covering lemma, and extending it to achieve simultaneous covering of a type class by several distinct sets using the same sequence of permutations. We then apply our approach to attain the best known achievable error exponent for the WAK problem, previously established by Kelly and Wagner. As a byproduct of our derivations, we also establish error exponents and achievable rates under mismatched decoding rules.

    https://arxiv.org/abs/2505.00567


    Parameter-Efficient Fine-Tuning with Attributed Patch Semantic Graph for Automated Patch Correctness Assessment

    oai:arXiv.org:2505.02629v2

    arXiv:2505.02629v2 Announce Type: replace Abstract: Automated program repair (APR) aims to automatically repair program errors without human intervention, and recent years have witnessed a growing interest on this research topic. While much progress has been made and techniques originating from different disciplines have been proposed, APR techniques generally suffer from the patch overfitting issue, i.e., the generated patches are not genuinely correct despite they pass the employed tests. To alleviate this issue, many research efforts have been devoted for automated patch correctness assessment (APCA). In particular, with the emergence of large language model (LLM) technology, researchers have employed LLM to assess the patch correctness and have obtained the state-of-the-art performance. The literature on APCA has demonstrated the importance of capturing patch semantic and explicitly considering certain code attributes in predicting patch correctness. However, existing LLM-based methods typically treat code as token sequences and ignore the inherent formal structure for code, making it difficult to capture the deep patch semantics. Moreover, these LLM-based methods also do not explicitly account for enough code attributes. To overcome these drawbacks, we in this paper design a novel patch graph representation named attributed patch semantic graph (APSG), which adequately captures the patch semantic and explicitly reflects important patch attributes. To effectively use graph information in APSG, we accordingly propose a new parameter-efficient fine-tuning (PEFT) method of LLMs named Graph-LoRA. Extensive evaluations have been conducted to evaluate our method, and the results show that compared to the state-of-the-art methods, our method improves the accuracy and F1 score by 3.1\% to 7.5\% and 3.0\% to 7.1\% respectively.

    https://arxiv.org/abs/2505.02629


    TerraFusion: Joint Generation of Terrain Geometry and Texture Using Latent Diffusion Models

    oai:arXiv.org:2505.04050v2

    arXiv:2505.04050v2 Announce Type: replace Abstract: 3D terrain models are essential in fields such as video game development and film production. Since surface color often correlates with terrain geometry, capturing this relationship is crucial to achieving realism. However, most existing methods generate either a heightmap or a texture, without sufficiently accounting for the inherent correlation. In this paper, we propose a method that jointly generates terrain heightmaps and textures using a latent diffusion model. First, we train the model in an unsupervised manner to randomly generate paired heightmaps and textures. Then, we perform supervised learning of an external adapter to enable user control via hand-drawn sketches. Experiments show that our approach allows intuitive terrain generation while preserving the correlation between heightmaps and textures.

    https://arxiv.org/abs/2505.04050


    Generating Reliable Synthetic Clinical Trial Data: The Role of Hyperparameter Optimization and Domain Constraints

    oai:arXiv.org:2505.05019v2

    arXiv:2505.05019v2 Announce Type: replace Abstract: The generation of synthetic clinical trial data offers a promising approach to mitigating privacy concerns and data accessibility limitations in medical research. However, ensuring that synthetic datasets maintain high fidelity, utility, and adherence to domain-specific constraints remains a key challenge. While hyperparameter optimization (HPO) improves generative model performance, the effectiveness of different optimization strategies for synthetic clinical data remains unclear. This study systematically evaluates four HPO objectives across nine generative models, comparing single-metric to compound metric optimization. Our results demonstrate that HPO consistently improves synthetic data quality, with Tab DDPM achieving the largest relative gains, followed by TVAE (60%), CTGAN (39%), and CTAB-GAN+ (38%). Compound metric optimization outperformed single-metric objectives, producing more generalizable synthetic datasets. Despite improving overall quality, HPO alone fails to prevent violations of essential clinical survival constraints. Preprocessing and postprocessing played a crucial role in reducing these violations, as models lacking robust processing steps produced invalid data in up to 61% of cases. These findings underscore the necessity of integrating explicit domain knowledge alongside HPO to generate high-quality synthetic datasets. Our study provides actionable recommendations for improving synthetic data generation, with future work needed to refine metric selection and validate findings on larger datasets.

    https://arxiv.org/abs/2505.05019


    Constraints Preserving Lax-Wendroff Flux Reconstruction for Relativistic Hydrodynamics with General Equations of State

    oai:arXiv.org:2505.05128v3

    arXiv:2505.05128v3 Announce Type: replace Abstract: In the realm of relativistic astrophysics, the ideal equation of state with a constant adiabatic index provides a poor approximation due to its inconsistency with relativistic kinetic theory. However, it is a common practice to use it for relativistic fluid flow equations due to its simplicity. Here we develop a high-order Lax-Wendroff flux reconstruction method on Cartesian grids for solving relativistic hydrodynamics equations with several general equations of state available in the literature. We also study the conversion from conservative to primitive variables, which depends on the equation of state in use, and provide an alternative method of conversion when the existing approach does not succeed. For the admissibility of the solution, we blend the high-order method with a low-order method on sub-cells and prove its physical admissible property in the case of all the equations of state used here. Lastly, we validate the scheme by several test cases having strong discontinuities, large Lorentz factor, and low density or pressure in one and two dimensions.

    https://arxiv.org/abs/2505.05128


    LEGO: A Layout Expression Language for Code Generation of Hierarchical Mapping

    oai:arXiv.org:2505.08091v2

    arXiv:2505.08091v2 Announce Type: replace Abstract: We describe LEGO, a new approach to optimizing data movement whereby code is expressed as a layout-independent computation and composed with layouts for data and computation. This code generator organization derives complex indexing expressions associated with hierarchical parallel code and data movement for GPUs. LEGO maps from layout specification to indexing expressions, and can be integrated into existing compilers and code templates. It facilitates the exploration of data layouts in combination with other optimizations. We demonstrate LEGO's integration with the Triton and MLIR compilers, and with CUDA templates. We show that LEGO is capable of deriving performance competitive with Triton, and shows broad applicability for data and thread layout mapping optimizations in its integration with CUDA and MLIR.

    https://arxiv.org/abs/2505.08091


    Template-Guided Reconstruction of Pulmonary Segments with Neural Implicit Functions

    oai:arXiv.org:2505.08919v2

    arXiv:2505.08919v2 Announce Type: replace Abstract: High-quality 3D reconstruction of pulmonary segments plays a crucial role in segmentectomy and surgical planning for the treatment of lung cancer. Due to the resolution requirement of the target reconstruction, conventional deep learning-based methods often suffer from computational resource constraints or limited granularity. Conversely, implicit modeling is favored due to its computational efficiency and continuous representation at any resolution. We propose a neural implicit function-based method to learn a 3D surface to achieve anatomy-aware, precise pulmonary segment reconstruction, represented as a shape by deforming a learnable template. Additionally, we introduce two clinically relevant evaluation metrics to comprehensively assess the quality of the reconstruction. Furthermore, to address the lack of publicly available shape datasets for benchmarking reconstruction algorithms, we developed a shape dataset named Lung3D, which includes the 3D models of 800 labeled pulmonary segments and their corresponding airways, arteries, veins, and intersegmental veins. We demonstrate that the proposed approach outperforms existing methods, providing a new perspective for pulmonary segment reconstruction. Code and data will be available at https://github.com/HINTLab/ImPulSe.

    https://arxiv.org/abs/2505.08919


    Relation Extraction or Pattern Matching? Unravelling the Generalisation Limits of Language Models for Biographical RE

    oai:arXiv.org:2505.12533v2

    arXiv:2505.12533v2 Announce Type: replace Abstract: Analysing the generalisation capabilities of relation extraction (RE) models is crucial for assessing whether they learn robust relational patterns or rely on spurious correlations. Our cross-dataset experiments find that RE models struggle with unseen data, even within similar domains. Notably, higher intra-dataset performance does not indicate better transferability, instead often signaling overfitting to dataset-specific artefacts. Our results also show that data quality, rather than lexical similarity, is key to robust transfer, and the choice of optimal adaptation strategy depends on the quality of data available: while fine-tuning yields the best cross-dataset performance with high-quality data, few-shot in-context learning (ICL) is more effective with noisier data. However, even in these cases, zero-shot baselines occasionally outperform all cross-dataset results. Structural issues in RE benchmarks, such as single-relation per sample constraints and non-standardised negative class definitions, further hinder model transferability.

    https://arxiv.org/abs/2505.12533


    RGNMR: A Gauss-Newton method for robust matrix completion with theoretical guarantees

    oai:arXiv.org:2505.12919v2

    arXiv:2505.12919v2 Announce Type: replace Abstract: Recovering a low rank matrix from a subset of its entries, some of which may be corrupted, is known as the robust matrix completion (RMC) problem. Existing RMC methods have several limitations: they require a relatively large number of observed entries; they may fail under overparametrization, when their assumed rank is higher than the correct one; and many of them fail to recover even mildly ill-conditioned matrices. In this paper we propose a novel RMC method, denoted $\texttt{RGNMR}$, which overcomes these limitations. $\texttt{RGNMR}$ is a simple factorization-based iterative algorithm, which combines a Gauss-Newton linearization with removal of entries suspected to be outliers. On the theoretical front, we prove that under suitable assumptions, $\texttt{RGNMR}$ is guaranteed exact recovery of the underlying low rank matrix. Our theoretical results improve upon the best currently known for factorization-based methods. On the empirical front, we show via several simulations the advantages of $\texttt{RGNMR}$ over existing RMC methods, and in particular its ability to handle a small number of observed entries, overparameterization of the rank and ill-conditioned matrices.

    https://arxiv.org/abs/2505.12919


    The Traitors: Deception and Trust in Multi-Agent Language Model Simulations

    oai:arXiv.org:2505.12923v2

    arXiv:2505.12923v2 Announce Type: replace Abstract: As AI systems increasingly assume roles where trust and alignment with human values are essential, understanding when and why they engage in deception has become a critical research priority. We introduce The Traitors, a multi-agent simulation framework inspired by social deduction games, designed to probe deception, trust formation, and strategic communication among large language model (LLM) agents under asymmetric information. A minority of agents the traitors seek to mislead the majority, while the faithful must infer hidden identities through dialogue and reasoning. Our contributions are: (1) we ground the environment in formal frameworks from game theory, behavioral economics, and social cognition; (2) we develop a suite of evaluation metrics capturing deception success, trust dynamics, and collective inference quality; (3) we implement a fully autonomous simulation platform where LLMs reason over persistent memory and evolving social dynamics, with support for heterogeneous agent populations, specialized traits, and adaptive behaviors. Our initial experiments across DeepSeek-V3, GPT-4o-mini, and GPT-4o (10 runs per model) reveal a notable asymmetry: advanced models like GPT-4o demonstrate superior deceptive capabilities yet exhibit disproportionate vulnerability to others' falsehoods. This suggests deception skills may scale faster than detection abilities. Overall, The Traitors provides a focused, configurable testbed for investigating LLM behavior in socially nuanced interactions. We position this work as a contribution toward more rigorous research on deception mechanisms, alignment challenges, and the broader social reliability of AI systems.

    https://arxiv.org/abs/2505.12923


    Unlocking the Potential of Difficulty Prior in RL-based Multimodal Reasoning

    oai:arXiv.org:2505.13261v2

    arXiv:2505.13261v2 Announce Type: replace Abstract: In this work, we investigate how explicitly modeling problem's difficulty prior information shapes the effectiveness of reinforcement learning based fine-tuning for multimodal reasoning. Our exploration mainly comprises of following three perspective: First, through offline data curation, we analyze the U-shaped difficulty distribution of two given datasets using the base model by multi-round sampling, and then filter out prompts that are either too simple or extremely difficult to provide meaningful gradients and perform subsequent two-stage training. Second, we implement an online advantage differentiation, computing group-wise empirical accuracy as a difficulty proxy to adaptively reweight advantages estimation, providing stronger learning signals for more challenging problems. Finally, we introduce difficulty hints as explicit prompts for more complex samples in the second training stage, encouraging the model to calibrate its reasoning depth and perform reflective validation checks. Our comprehensive approach demonstrates significant performances across various multi-modal mathematical reasoning benchmarks with only 2K+0.6K two-stage training data.

    https://arxiv.org/abs/2505.13261


    Memory-Centric Embodied Question Answering

    oai:arXiv.org:2505.13948v2

    arXiv:2505.13948v2 Announce Type: replace Abstract: Embodied Question Answering (EQA) requires agents to autonomously explore and comprehend the environment to answer context-dependent questions. Typically, an EQA framework consists of four components: a planner, a memory module, a stopping module, and an answering module. However, the memory module is utilized inefficiently in existing methods, as the information it stores is leveraged solely for the answering module. Such a design may result in redundant or inadequate exploration, leading to a suboptimal success rate. To solve this problem, we propose MemoryEQA, an EQA framework centered on memory, which establishes mechanisms for memory storage, update, and retrieval, allowing memory information to contribute throughout the entire exploration process. Specifically, we convert the observation into structured textual representations, which are stored in a vector library following a fixed structure. At each exploration step, we utilize a viewpoint comparison strategy to determine whether the memory requires updating. Before executing each module, we employ an entropy-based adaptive retrieval strategy to obtain the minimal yet sufficient memory information that satisfies the requirements of different modules. The retrieved module-specific information is then integrated with the current observation as input to the corresponding module. To evaluate EQA models' memory capabilities, we constructed the benchmark based on HM3D called MT-HM3D, comprising 1,587 question-answer pairs involving multiple targets across various regions, which requires agents to maintain memory of exploration-acquired target information. Experimental results on HM-EQA, MT-HM3D, and OpenEQA demonstrate the effectiveness of our framework, where a 9.9% performance gain on MT-HM3D compared to baseline models further underscores the memory capability's pivotal role in solving complex tasks.

    https://arxiv.org/abs/2505.13948


    Towards Reliable Proof Generation with LLMs: A Neuro-Symbolic Approach

    oai:arXiv.org:2505.14479v5

    arXiv:2505.14479v5 Announce Type: replace Abstract: Large language models (LLMs) struggle with formal domains that require rigorous logical deduction and symbolic reasoning, such as mathematical proof generation. We propose a neuro-symbolic approach that combines LLMs' generative strengths with structured components to overcome this challenge. As a proof-of-concept, we focus on geometry problems. Our approach is two-fold: (1) we retrieve analogous problems and use their proofs to guide the LLM, and (2) a formal verifier evaluates the generated proofs and provides feedback, helping the model fix incorrect proofs. We demonstrate that our method significantly improves proof accuracy for OpenAI's o1 model (58%-70% improvement); both analogous problems and the verifier's feedback contribute to these gains. More broadly, shifting to LLMs that generate provably correct conclusions could dramatically improve their reliability, accuracy and consistency, unlocking complex tasks and critical real-world applications that require trustworthiness.

    https://arxiv.org/abs/2505.14479


    Learning to Integrate Diffusion ODEs by Averaging the Derivatives

    oai:arXiv.org:2505.14502v3

    arXiv:2505.14502v3 Announce Type: replace Abstract: To accelerate diffusion model inference, numerical solvers perform poorly at extremely small steps, while distillation techniques often introduce complexity and instability. This work presents an intermediate strategy, balancing performance and cost, by learning ODE integration using loss functions derived from the derivative-integral relationship, inspired by Monte Carlo integration and Picard iteration. From a geometric perspective, the losses operate by gradually extending the tangent to the secant, thus are named as secant losses. The target of secant losses is the same as that of diffusion models, or the diffusion model itself, leading to great training stability. By fine-tuning or distillation, the secant version of EDM achieves a $10$-step FID of $2.14$ on CIFAR-10, while the secant version of SiT-XL/2 attains a $4$-step FID of $2.27$ and an $8$-step FID of $1.96$ on ImageNet-$256\times256$. Code is available at https://github.com/poppuppy/secant-expectation.

    https://arxiv.org/abs/2505.14502


    Cost-aware LLM-based Online Dataset Annotation

    oai:arXiv.org:2505.15101v2

    arXiv:2505.15101v2 Announce Type: replace Abstract: Recent advances in large language models (LLMs) have enabled automated dataset labeling with minimal human supervision. While majority voting across multiple LLMs can improve label reliability by mitigating individual model biases, it incurs high computational costs due to repeated querying. In this work, we propose a novel online framework, Cost-aware Majority Voting (CaMVo), for efficient and accurate LLM-based dataset annotation. CaMVo adaptively selects a subset of LLMs for each data instance based on contextual embeddings, balancing confidence and cost without requiring pre-training or ground-truth labels. Leveraging a LinUCB-based selection mechanism and a Bayesian estimator over confidence scores, CaMVo estimates a lower bound on labeling accuracy for each LLM and aggregates responses through weighted majority voting. Our empirical evaluation on the MMLU and IMDB Movie Review datasets demonstrates that CaMVo achieves comparable or superior accuracy to full majority voting while significantly reducing labeling costs. This establishes CaMVo as a practical and robust solution for cost-efficient annotation in dynamic labeling environments.

    https://arxiv.org/abs/2505.15101


    Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems

    oai:arXiv.org:2505.15201v4

    arXiv:2505.15201v4 Announce Type: replace Abstract: Reinforcement Learning (RL) algorithms sample multiple n>1 solution attempts for each problem and reward them independently. This optimizes for pass@1 performance and prioritizes the strength of isolated samples at the expense of the diversity and collective utility of sets of samples. This under-utilizes the sampling capacity, limiting exploration and eventual improvement on harder examples. As a fix, we propose Pass-at-k Policy Optimization (PKPO), a transformation on the final rewards which leads to direct optimization of pass@k performance, thus optimizing for sets of samples that maximize reward when considered jointly. Our contribution is to derive novel low variance unbiased estimators for pass@k and its gradient, in both the binary and continuous reward settings. We show optimization with our estimators reduces to standard RL with rewards that have been jointly transformed by a stable and efficient transformation function. While previous efforts are restricted to k=n, ours is the first to enable robust optimization of pass@k for any arbitrary k <= n. Moreover, instead of trading off pass@1 performance for pass@k gains, our method allows annealing k during training, optimizing both metrics and often achieving strong pass@1 numbers alongside significant pass@k gains. We validate our reward transformations on toy experiments, which reveal the variance reducing properties of our formulations. We also include real-world examples using the open-source LLM, GEMMA-2. We find that our transformation effectively optimizes for the target k. Furthermore, higher k values enable solving more and harder problems, while annealing k boosts both the pass@1 and pass@k . Crucially, for challenging task sets where conventional pass@1 optimization stalls, our pass@k approach unblocks learning, likely due to better exploration by prioritizing joint utility over the utility of individual samples.

    https://arxiv.org/abs/2505.15201


    Generative Graph Pattern Machine

    oai:arXiv.org:2505.16130v3

    arXiv:2505.16130v3 Announce Type: replace Abstract: Graph neural networks (GNNs) have been predominantly driven by message-passing, where node representations are iteratively updated via local neighborhood aggregation. Despite their success, message-passing suffers from fundamental limitations -- including constrained expressiveness, over-smoothing, over-squashing, and limited capacity to model long-range dependencies. These issues hinder scalability: increasing data size or model size often fails to yield improved performance. To this end, we explore pathways beyond message-passing and introduce Generative Graph Pattern Machine (G$^2$PM), a generative Transformer pre-training framework for graphs. G$^2$PM represents graph instances (nodes, edges, or entire graphs) as sequences of substructures, and employs generative pre-training over the sequences to learn generalizable and transferable representations. Empirically, G$^2$PM demonstrates strong scalability: on the ogbn-arxiv benchmark, it continues to improve with model sizes up to 60M parameters, outperforming prior generative approaches that plateau at significantly smaller scales (e.g., 3M). In addition, we systematically analyze the model design space, highlighting key architectural choices that contribute to its scalability and generalization. Across diverse tasks -- including node/link/graph classification, transfer learning, and cross-graph pretraining -- G$^2$PM consistently outperforms strong baselines, establishing a compelling foundation for scalable graph learning. The code and dataset are available at https://github.com/Zehong-Wang/G2PM.

    https://arxiv.org/abs/2505.16130


    Meta-reinforcement learning with minimum attention

    oai:arXiv.org:2505.16741v2

    arXiv:2505.16741v2 Announce Type: replace Abstract: Minimum attention applies the least action principle in the changes of control concerning state and time, first proposed by Brockett. The involved regularization is highly relevant in emulating biological control, such as motor learning. We apply minimum attention in reinforcement learning (RL) as part of the rewards and investigate its connection to meta-learning and stabilization. Specifically, model-based meta-learning with minimum attention is explored in high-dimensional nonlinear dynamics. Ensemble-based model learning and gradient-based meta-policy learning are alternately performed. Empirically, the minimum attention does show outperforming competence in comparison to the state-of-the-art algorithms of model-free and model-based RL, i.e., fast adaptation in few shots and variance reduction from the perturbations of the model and environment. Furthermore, the minimum attention demonstrates an improvement in energy efficiency.

    https://arxiv.org/abs/2505.16741


    Backdoors in DRL: Four Environments Focusing on In-distribution Triggers

    oai:arXiv.org:2505.17248v3

    arXiv:2505.17248v3 Announce Type: replace Abstract: Backdoor attacks, or trojans, pose a security risk by concealing undesirable behavior in deep neural network models. Open-source neural networks are downloaded from the internet daily, possibly containing backdoors, and third-party model developers are common. To advance research on backdoor attack mitigation, we develop several trojans for deep reinforcement learning (DRL) agents. We focus on in-distribution triggers, which occur within the agent's natural data distribution, since they pose a more significant security threat than out-of-distribution triggers due to their ease of activation by the attacker during model deployment. We implement backdoor attacks in four reinforcement learning (RL) environments: LavaWorld, Randomized LavaWorld, Colorful Memory, and Modified Safety Gymnasium. We train various models, both clean and backdoored, to characterize these attacks. We find that in-distribution triggers can require additional effort to implement and be more challenging for models to learn, but are nevertheless viable threats in DRL even using basic data poisoning attacks.

    https://arxiv.org/abs/2505.17248


    Comparator-Adaptive $\Phi$-Regret: Improved Bounds, Simpler Algorithms, and Applications to Games

    oai:arXiv.org:2505.17277v2

    arXiv:2505.17277v2 Announce Type: replace Abstract: In the classic expert problem, $\Phi$-regret measures the gap between the learner's total loss and that achieved by applying the best action transformation $\phi \in \Phi$. A recent work by Lu et al., [2025] introduces an adaptive algorithm whose regret against a comparator $\phi$ depends on a certain sparsity-based complexity measure of $\phi$, (almost) recovering and interpolating optimal bounds for standard regret notions such as external, internal, and swap regret. In this work, we propose a general idea to achieve an even better comparator-adaptive $\Phi$-regret bound via much simpler algorithms compared to Lu et al., [2025]. Specifically, we discover a prior distribution over all possible binary transformations and show that it suffices to achieve prior-dependent regret against these transformations. Then, we propose two concrete and efficient algorithms to achieve so, where the first one learns over multiple copies of a prior-aware variant of the Kernelized MWU algorithm of Farina et al., [2022], and the second one learns over multiple copies of a prior-aware variant of the BM-reduction [Blum and Mansour, 2007]. To further showcase the power of our methods and the advantages over Lu et al., [2025] besides the simplicity and better regret bounds, we also show that our second approach can be extended to the game setting to achieve accelerated and adaptive convergence rate to $\Phi$-equilibria for a class of general-sum games. When specified to the special case of correlated equilibria, our bound improves over the existing ones from Anagnostides et al., [2022a,b]

    https://arxiv.org/abs/2505.17277


    Dynamic Dual Buffer with Divide-and-Conquer Strategy for Online Continual Learning

    oai:arXiv.org:2505.18101v2

    arXiv:2505.18101v2 Announce Type: replace Abstract: Online Continual Learning (OCL) involves sequentially arriving data and is particularly challenged by catastrophic forgetting, which significantly impairs model performance. To address this issue, we introduce a novel framework, Online Dynamic Expandable Dual Memory (ODEDM), that integrates a short-term memory for fast memory and a long-term memory structured into sub-buffers anchored by cluster prototypes, enabling the storage of diverse and category-specific samples to mitigate forgetting. We propose a novel K-means-based strategy for prototype identification and an optimal transport-based mechanism to retain critical samples, prioritising those exhibiting high similarity to their corresponding prototypes. This design preserves semantically rich information. Additionally, we propose a Divide-and-Conquer (DAC) optimisation strategy that decomposes memory updates into subproblems, thereby reducing computational overhead. ODEDM functions as a plug-and-play module that can be seamlessly integrated with existing rehearsal-based approaches. Experimental results under both standard and imbalanced OCL settings show that ODEDM consistently achieves state-of-the-art performance across multiple datasets, delivering substantial improvements over the DER family as well as recent methods such as VR-MCL and POCL.

    https://arxiv.org/abs/2505.18101


    NSNQuant: A Double Normalization Approach for Calibration-Free Low-Bit Vector Quantization of KV Cache

    oai:arXiv.org:2505.18231v2

    arXiv:2505.18231v2 Announce Type: replace Abstract: Large Language Model (LLM) inference is typically memory-intensive, especially when processing large batch sizes and long sequences, due to the large size of key-value (KV) cache. Vector Quantization (VQ) is recently adopted to alleviate this issue, but we find that the existing approach is susceptible to distribution shift due to its reliance on calibration datasets. To address this limitation, we introduce NSNQuant, a calibration-free Vector Quantization (VQ) technique designed for low-bit compression of the KV cache. By applying a three-step transformation-1) a token-wise normalization (Normalize), 2) a channel-wise centering (Shift), and 3) a second token-wise normalization (Normalize)-with Hadamard transform, NSNQuant effectively aligns the token distribution with the standard normal distribution. This alignment enables robust, calibration-free vector quantization using a single reusable codebook. Extensive experiments show that NSNQuant consistently outperforms prior methods in both 1-bit and 2-bit settings, offering strong generalization and up to 3$\times$ throughput gain over full-precision baselines.

    https://arxiv.org/abs/2505.18231


    DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research

    oai:arXiv.org:2505.19253v3

    arXiv:2505.19253v3 Announce Type: replace Abstract: Deep research systems represent an emerging class of agentic information retrieval methods that generate comprehensive and well-supported reports to complex queries. However, most existing frameworks rely on dynamic commercial search APIs, which pose reproducibility and transparency challenges in addition to their cost. To address these limitations, we introduce \textsc{DeepResearchGym} as an open-source sandbox that combines a reproducible search API with a rigorous evaluation protocol for benchmarking deep research systems. The API indexes large-scale public web corpora, namely ClueWeb22 and FineWeb, using a state-of-the-art dense retriever and approximate nearest neighbor search via DiskANN. It achieves lower latency than popular commercial APIs while ensuring stable document rankings across runs, and is free for research use. To evaluate deep research systems' outputs, we extend the Researchy Questions benchmark with automatic metrics through LLM-as-a-judge to measure alignment with users' information needs, retrieval faithfulness, and report quality. Experimental results show that systems integrated with~\textsc{DeepResearchGym} achieve performance comparable to those using commercial APIs, with performance rankings remaining consistent across evaluation metrics. A case study on short-answer search agents further demonstrates the sandbox's utility for cost-effective training, showing that models trained within the sandbox can generalize to commercial search.

    https://arxiv.org/abs/2505.19253


    Skrull: Towards Efficient Long Context Fine-tuning through Dynamic Data Scheduling

    oai:arXiv.org:2505.19609v2

    arXiv:2505.19609v2 Announce Type: replace Abstract: Long-context supervised fine-tuning (Long-SFT) plays a vital role in enhancing the performance of large language models (LLMs) on long-context tasks. To smoothly adapt LLMs to long-context scenarios, this process typically entails training on mixed datasets containing both long and short sequences. However, this heterogeneous sequence length distribution poses significant challenges for existing training systems, as they fail to simultaneously achieve high training efficiency for both long and short sequences, resulting in sub-optimal end-to-end system performance in Long-SFT. In this paper, we present a novel perspective on data scheduling to address the challenges posed by the heterogeneous data distributions in Long-SFT. We propose Skrull, a dynamic data scheduler specifically designed for efficient long-SFT. Through dynamic data scheduling, Skrull balances the computation requirements of long and short sequences, improving overall training efficiency. Furthermore, we formulate the scheduling process as a joint optimization problem and thoroughly analyze the trade-offs involved. Based on those analysis, Skrull employs a lightweight scheduling algorithm to achieve near-zero cost online scheduling in Long-SFT. Finally, we implement Skrull upon DeepSpeed, a state-of-the-art distributed training system for LLMs. Experimental results demonstrate that Skrull outperforms DeepSpeed by 3.76x on average (up to 7.54x) in real-world long-SFT scenarios.

    https://arxiv.org/abs/2505.19609


    Leveraging Importance Sampling to Detach Alignment Modules from Large Language Models

    oai:arXiv.org:2505.19700v4

    arXiv:2505.19700v4 Announce Type: replace Abstract: The widespread adoption of large language models (LLMs) across industries has increased the demand for high-quality and customizable outputs. However, traditional alignment methods often require retraining large pretrained models, making it difficult to quickly adapt and optimize LLMs for diverse applications. To address this limitation, we propose a novel \textit{Residual Alignment Model} (\textit{RAM}) that formalizes the alignment process as a type of importance sampling. In this framework, the unaligned upstream model serves as the proposal distribution, while the alignment process is framed as secondary sampling based on an autoregressive alignment module that acts as an estimator of the importance weights. This design enables a natural detachment of the alignment module from the target aligned model, improving flexibility and scalability. Based on this model, we derive an efficient sequence-level training strategy for the alignment module, which operates independently of the proposal module. Additionally, we develop a resampling algorithm with iterative token-level decoding to address the common first-token latency issue in comparable methods. Experimental evaluations on two leading open-source LLMs across diverse tasks, including instruction following, domain adaptation, and preference optimization, demonstrate that our approach consistently outperforms baseline models.

    https://arxiv.org/abs/2505.19700


    Thinking with Visual Abstract: Enhancing Multimodal Reasoning via Visual Abstraction

    oai:arXiv.org:2505.20164v3

    arXiv:2505.20164v3 Announce Type: replace Abstract: Images usually convey richer detail than text, but often include redundant information, which potentially downgrades multimodal reasoning performance. When faced with lengthy or complex messages, humans tend to employ abstract thinking to convert them into simple and concise abstracts. Inspired by this cognitive strategy, we introduce a novel paradigm to elicit the ability to Think with Visual Abstract (VAT), by prompting Multimodal Large Language Models (MLLMs) with visual abstract instead of explicit verbal thoughts or elaborate guidance, permitting a more efficient visual reasoning mechanism via concentrated perception. VAT encourages models to focus on more essential visual elements, concepts and structural features by undermining redundant information compared with explicit thinking methods, such as Chain-of-thought (CoT) and tool-using approaches, that increase the complexity of reasoning process via inserting verbose intermediate steps and external knowledge. Experimental results show that VAT consistently empowers different MLLMs in visual perception and reasoning tasks. VAT achieves an average gain of $2.21\%$ over GPT-5 baseline, surpassing the gain of CoT, demonstrating that VAT better enhances multimodal task performance of MLLMs. Additionally, VAT spends fewer tokens while achieving higher performance. These findings highlight the effectiveness of visual abstract thinking and encourage further exploration of more diverse reasoning paradigms from the perspective of human cognition.

    https://arxiv.org/abs/2505.20164


    Characterizing Equivalence of Logically Constrained Terms via Existentially Constrained Terms (Full Version)

    oai:arXiv.org:2505.21986v3

    arXiv:2505.21986v3 Announce Type: replace Abstract: Logically constrained term rewriting is a rewriting framework that supports built-in data structures such as integers and bit vectors. Recently, constrained terms play a key role in various analyses and applications of logically constrained term rewriting. A fundamental question on constrained terms arising there is how to characterize equivalence between them. However, in the current literature only limited progress has been made on this. In this paper, we provide several sound and complete solutions to tackle this problem. Our key idea is the introduction of a novel concept, namely existentially constrained terms, into which the original form of constrained terms can be embedded. We present several syntactic characterizations of equivalence between existentially constrained terms. In particular, we provide two different kinds of complete characterizations: one is designed to facilitate equivalence checking, while the other is intended for theoretical analysis.

    https://arxiv.org/abs/2505.21986


    SNS-Bench-VL: Benchmarking Multimodal Large Language Models in Social Networking Services

    oai:arXiv.org:2505.23065v2

    arXiv:2505.23065v2 Announce Type: replace Abstract: With the increasing integration of visual and textual content in Social Networking Services (SNS), evaluating the multimodal capabilities of Large Language Models (LLMs) is crucial for enhancing user experience, content understanding, and platform intelligence. Existing benchmarks primarily focus on text-centric tasks, lacking coverage of the multimodal contexts prevalent in modern SNS ecosystems. In this paper, we introduce SNS-Bench-VL, a comprehensive multimodal benchmark designed to assess the performance of Vision-Language LLMs in real-world social media scenarios. SNS-Bench-VL incorporates images and text across 8 multimodal tasks, including note comprehension, user engagement analysis, information retrieval, and personalized recommendation. It comprises 4,001 carefully curated multimodal question-answer pairs, covering single-choice, multiple-choice, and open-ended tasks. We evaluate over 25 state-of-the-art multimodal LLMs, analyzing their performance across tasks. Our findings highlight persistent challenges in multimodal social context comprehension. We hope SNS-Bench-VL will inspire future research towards robust, context-aware, and human-aligned multimodal intelligence for next-generation social networking services.

    https://arxiv.org/abs/2505.23065


    Translation in the Wild

    oai:arXiv.org:2505.23548v3

    arXiv:2505.23548v3 Announce Type: replace Abstract: Large Language Models (LLMs) excel in translation among other things, demonstrating competitive performance for many language pairs in zero- and few-shot settings. But unlike dedicated neural machine translation models, LLMs are not trained on any translation-related objective. What explains their remarkable translation abilities? Are these abilities grounded in "incidental bilingualism" (Briakou et al. 2023) in training data? Does instruction tuning contribute to it? Are LLMs capable of aligning and leveraging semantically identical or similar monolingual contents from different corners of the internet that are unlikely to fit in a single context window? I offer some reflections on this topic, informed by recent studies and growing user experience. My working hypothesis is that LLMs' translation abilities originate in two different types of pre-training data that may be internalized by the models in different ways. I discuss the prospects for testing the "duality" hypothesis empirically and its implications for reconceptualizing translation, human and machine, in the age of deep learning.

    https://arxiv.org/abs/2505.23548


    Improving Time Series Forecasting via Instance-aware Post-hoc Revision

    oai:arXiv.org:2505.23583v2

    arXiv:2505.23583v2 Announce Type: replace Abstract: Time series forecasting plays a vital role in various real-world applications and has attracted significant attention in recent decades. While recent methods have achieved remarkable accuracy by incorporating advanced inductive biases and training strategies, we observe that instance-level variations remain a significant challenge. These variations--stemming from distribution shifts, missing data, and long-tail patterns--often lead to suboptimal forecasts for specific instances, even when overall performance appears strong. To address this issue, we propose a model-agnostic framework, PIR, designed to enhance forecasting performance through Post-forecasting Identification and Revision. Specifically, PIR first identifies biased forecasting instances by estimating their accuracy. Based on this, the framework revises the forecasts using contextual information, including covariates and historical time series, from both local and global perspectives in a post-processing fashion. Extensive experiments on real-world datasets with mainstream forecasting models demonstrate that PIR effectively mitigates instance-level errors and significantly improves forecasting reliability.

    https://arxiv.org/abs/2505.23583


    A Flat Minima Perspective on Understanding Augmentations and Model Robustness

    oai:arXiv.org:2505.24592v3

    arXiv:2505.24592v3 Announce Type: replace Abstract: Model robustness indicates a model's capability to generalize well on unforeseen distributional shifts, including data corruptions and adversarial attacks. Data augmentation is one of the most prevalent and effective ways to enhance robustness. Despite the great success of the diverse augmentations in different fields, a unified theoretical understanding of their efficacy in improving model robustness is lacking. We theoretically reveal a general condition for label-preserving augmentations to bring robustness to diverse distribution shifts through the lens of flat minima and generalization bound, which de facto turns out to be strongly correlated with robustness against different distribution shifts in practice. Unlike most earlier works, our theoretical framework accommodates all the label-preserving augmentations and is not limited to particular distribution shifts. We substantiate our theories through different simulations on the existing common corruption and adversarial robustness benchmarks based on the CIFAR and ImageNet datasets.

    https://arxiv.org/abs/2505.24592


    Harnessing Negative Signals: Reinforcement Distillation from Teacher Data for LLM Reasoning

    oai:arXiv.org:2505.24850v2

    arXiv:2505.24850v2 Announce Type: replace Abstract: Recent advances in model distillation show that data from advanced reasoning models can effectively train smaller student models. However, standard practices discard incorrect reasoning traces -- valuable, yet underutilized data. This paper addresses the critical question: How can both positive and negative distilled reasoning traces be effectively leveraged to maximize LLM reasoning performance in an offline setting? We employ a two-stage training recipe: first, Supervised Fine-Tuning (SFT) on positive traces, followed by a refinement stage using both positive and negative traces. We find that a simple REINFORCE-style objective, which we term the Reinforcement Distillation (REDI) objective, outperforms established preference optimization methods like DPO and SimPO in this distillation context. Our empirical evaluations demonstrate the effectiveness of this approach. Notably, our Qwen-REDI-1.5B model, trained on just 131k traces from the open Open-R1 dataset, achieves an 83.1% score on MATH-500. Its performance matches that of DeepSeek-R1-Distill-Qwen-1.5B, a model trained on 800k proprietary data. This result showcases the remarkable data efficiency of utilizing previously discarded negative traces.

    https://arxiv.org/abs/2505.24850


    Ctrl-Crash: Controllable Diffusion for Realistic Car Crashes

    oai:arXiv.org:2506.00227v2

    arXiv:2506.00227v2 Announce Type: replace Abstract: Video diffusion techniques have advanced significantly in recent years; however, they struggle to generate realistic imagery of car crashes due to the scarcity of accident events in most driving datasets. Improving traffic safety requires realistic and controllable accident simulations. To tackle the problem, we propose Ctrl-Crash, a controllable car crash video generation model that conditions on signals such as bounding boxes, crash types, and an initial image frame. Our approach enables counterfactual scenario generation where minor variations in input can lead to dramatically different crash outcomes. To support fine-grained control at inference time, we leverage classifier-free guidance with independently tunable scales for each conditioning signal. Ctrl-Crash achieves state-of-the-art performance across quantitative video quality metrics (e.g., FVD and JEDi) and qualitative measurements based on a human-evaluation of physical realism and video quality compared to prior diffusion-based methods.

    https://arxiv.org/abs/2506.00227


    SMELLNET: A Large-scale Dataset for Real-world Smell Recognition

    oai:arXiv.org:2506.00239v3

    arXiv:2506.00239v3 Announce Type: replace Abstract: The ability of AI to sense and identify various substances based on their smell alone can have profound impacts on allergen detection (e.g., smelling gluten or peanuts in a cake), monitoring the manufacturing process, and sensing hormones that indicate emotional states, stress levels, and diseases. Despite these broad impacts, there are virtually no large-scale benchmarks, and therefore little progress, for training and evaluating AI systems' ability to smell in the real world. In this paper, we use small gas and chemical sensors to create SmellNet, the first large-scale database that digitizes a diverse range of smells in the natural world. SmellNet contains about 828,000 data points across 50 substances, spanning nuts, spices, herbs, fruits, and vegetables, and 43 mixtures among them, with 68 hours of data collected. Using SmellNet, we developed ScentFormer, a Transformer-based architecture combining temporal differencing and sliding-window augmentation for smell data. For the SmellNet-Base classification task, ScentFormer achieves 58.5% Top-1 accuracy, and for the SmellNet-Mixture distribution prediction task, ScentFormer achieves 50.2% Top-1@0.1 on the test-seen split. ScentFormer's ability to generalize across conditions and capture transient chemical dynamics demonstrates the promise of temporal modeling in olfactory AI. SmellNet and ScentFormer lay the groundwork for real-world olfactory applications across healthcare, food and beverage, environmental monitoring, manufacturing, and entertainment.

    https://arxiv.org/abs/2506.00239


    Faster negative length shortest paths by bootstrapping hop reducers

    oai:arXiv.org:2506.00428v2

    arXiv:2506.00428v2 Announce Type: replace Abstract: The textbook algorithm for real-weighted single-source shortest paths takes $O(m n)$ time on a graph with $m$ edges and $n$ vertices. The breakthrough algorithm by Fineman [Fin24] takes $\tilde{O}(m n^{8/9})$ randomized time. The running time was subsequently improved to $\tilde{O}(mn^{4/5})$ [HJQ25]. We build on [Fin24; HJQ25] to obtain an $\tilde{O}(m n^{3/4} + m^{4/5} n)$ randomized running time. (Equivalently, $\tilde{O}(mn^{3/4})$ for $m \geq n^{5/4}$, and $\tilde{O}(m^{4/5} n)$ for $m \leq n^{5/4}$.) The main new technique replaces the hop-reducing auxiliary graph from [Fin24] with a bootstrapping process where constant-hop reducers for small subgraphs of the input graph are iteratively amplified and expanded until the desired polynomial-hop reduction is achieved over the entire graph.

    https://arxiv.org/abs/2506.00428


    Selecting for Less Discriminatory Algorithms: A Relational Search Framework for Navigating Fairness-Accuracy Trade-offs in Practice

    oai:arXiv.org:2506.01594v2

    arXiv:2506.01594v2 Announce Type: replace Abstract: As machine learning models are increasingly embedded into society through high-stakes decision-making, selecting the right algorithm for a given task, audience, and sector presents a critical challenge, particularly in the context of fairness. Traditional assessments of model fairness have often framed fairness as an objective mathematical property, treating model selection as an optimization problem under idealized informational conditions. This overlooks model multiplicity as a consideration--that multiple models can deliver similar performance while exhibiting different fairness characteristics. Legal scholars have engaged this challenge through the concept of Less Discriminatory Algorithms (LDAs), which frames model selection as a civil rights obligation. In real-world deployment, this normative challenge is bounded by constraints on fairness experimentation, e.g., regulatory standards, institutional priorities, and resource capacity. Against these considerations, the paper revisits Lee and Floridi (2021)'s relational fairness approach using updated 2021 Home Mortgage Disclosure Act (HMDA) data, and proposes an expansion of the scope of the LDA search process. We argue that extending the LDA search horizontally, considering fairness across model families themselves, provides a lightweight complement, or alternative, to within-model hyperparameter optimization, when operationalizing fairness in non-experimental, resource constrained settings. Fairness metrics alone offer useful, but insufficient signals to accurately evaluate candidate LDAs. Rather, by using a horizontal LDA search approach with the relational trade-off framework, we demonstrate a responsible minimum viable LDA search on real-world lending outcomes. Organizations can modify this approach to systematically compare, evaluate, and select LDAs that optimize fairness and accuracy in a sector-based contextualized manner.

    https://arxiv.org/abs/2506.01594


    DualMap: Online Open-Vocabulary Semantic Mapping for Natural Language Navigation in Dynamic Changing Scenes

    oai:arXiv.org:2506.01950v4

    arXiv:2506.01950v4 Announce Type: replace Abstract: We introduce DualMap, an online open-vocabulary mapping system that enables robots to understand and navigate dynamically changing environments through natural language queries. Designed for efficient semantic mapping and adaptability to changing environments, DualMap meets the essential requirements for real-world robot navigation applications. Our proposed hybrid segmentation frontend and object-level status check eliminate the costly 3D object merging required by prior methods, enabling efficient online scene mapping. The dual-map representation combines a global abstract map for high-level candidate selection with a local concrete map for precise goal-reaching, effectively managing and updating dynamic changes in the environment. Through extensive experiments in both simulation and real-world scenarios, we demonstrate state-of-the-art performance in 3D open-vocabulary segmentation, efficient scene mapping, and online language-guided navigation. Project page: https://eku127.github.io/DualMap/

    https://arxiv.org/abs/2506.01950


    Improvement of AMPs Identification with Generative Adversarial Network and Ensemble Classification

    oai:arXiv.org:2506.01983v2

    arXiv:2506.01983v2 Announce Type: replace Abstract: Identification of antimicrobial peptides is an important and necessary issue in today's era. Antimicrobial peptides are essential as an alternative to antibiotics for biomedical applications and many other practical applications. These oligopeptides are useful in drug design and cause innate immunity against microorganisms. Artificial intelligence algorithms have played a significant role in the ease of identifying these peptides.This research is improved by improving proposed method in the field of antimicrobial peptides prediction. Suggested method is improved by combining the best coding method from different perspectives, In the following a deep neural network to balance the imbalanced combined datasets. The results of this research show that the proposed method have a significant improvement in the accuracy and efficiency of the prediction of antimicrobial peptides and are able to provide the best results compared to the existing methods. These development in the field of prediction and classification of antimicrobial peptides, basically in the fields of medicine and pharmaceutical industries, have high effectiveness and application.

    https://arxiv.org/abs/2506.01983


    Multimodal DeepResearcher: Generating Text-Chart Interleaved Reports From Scratch with Agentic Framework

    oai:arXiv.org:2506.02454v3

    arXiv:2506.02454v3 Announce Type: replace Abstract: Visualizations play a crucial part in effective communication of concepts and information. Recent advances in reasoning and retrieval augmented generation have enabled Large Language Models (LLMs) to perform deep research and generate comprehensive reports. Despite its progress, existing deep research frameworks primarily focus on generating text-only content, leaving the automated generation of interleaved texts and visualizations underexplored. This novel task poses key challenges in designing informative visualizations and effectively integrating them with text reports. To address these challenges, we propose Formal Description of Visualization (FDV), a structured textual representation of charts that enables LLMs to learn from and generate diverse, high-quality visualizations. Building on this representation, we introduce Multimodal DeepResearcher, an agentic framework that decomposes the task into four stages: (1) researching, (2) exemplar report textualization, (3) planning, and (4) multimodal report generation. For the evaluation of generated multimodal reports, we develop MultimodalReportBench, which contains 100 diverse topics served as inputs along with 5 dedicated metrics. Extensive experiments across models and evaluation methods demonstrate the effectiveness of Multimodal DeepResearcher. Notably, utilizing the same Claude 3.7 Sonnet model, Multimodal DeepResearcher achieves an 82% overall win rate over the baseline method.

    https://arxiv.org/abs/2506.02454


    How PARTs assemble into wholes: Learning the relative composition of images

    oai:arXiv.org:2506.03682v2

    arXiv:2506.03682v2 Announce Type: replace Abstract: The composition of objects and their parts, along with object-object positional relationships, provides a rich source of information for representation learning. Hence, spatial-aware pretext tasks have been actively explored in self-supervised learning. Existing works commonly start from a grid structure, where the goal of the pretext task involves predicting the absolute position index of patches within a fixed grid. However, grid-based approaches fall short of capturing the fluid and continuous nature of real-world object compositions. We introduce PART, a self-supervised learning approach that leverages continuous relative transformations between off-grid patches to overcome these limitations. By modeling how parts relate to each other in a continuous space, PART learns the relative composition of images-an off-grid structural relative positioning that is less tied to absolute appearance and can remain coherent under variations such as partial visibility or stylistic changes. In tasks requiring precise spatial understanding such as object detection and time series prediction, PART outperforms grid-based methods like MAE and DropPos, while maintaining competitive performance on global classification tasks. By breaking free from grid constraints, PART opens up a new trajectory for universal self-supervised pretraining across diverse datatypes-from images to EEG signals-with potential in medical imaging, video, and audio.

    https://arxiv.org/abs/2506.03682


    Pet-Bench: Benchmarking the Abilities of Large Language Models as E-Pets in Social Network Services

    oai:arXiv.org:2506.03761v3

    arXiv:2506.03761v3 Announce Type: replace Abstract: As interest in using Large Language Models for interactive and emotionally rich experiences grows, virtual pet companionship emerges as a novel yet underexplored application. Existing approaches focus on basic pet role-playing interactions without systematically benchmarking LLMs for comprehensive companionship. In this paper, we introduce Pet-Bench, a dedicated benchmark that evaluates LLMs across both self-interaction and human-interaction dimensions. Unlike prior work, Pet-Bench emphasizes self-evolution and developmental behaviors alongside interactive engagement, offering a more realistic reflection of pet companionship. It features diverse tasks such as intelligent scheduling, memory-based dialogues, and psychological conversations, with over 7,500 interaction instances designed to simulate pet behaviors. Evaluation of 28 LLMs reveals significant performance variations linked to model size and inherent capabilities, underscoring the need for specialized optimization in this domain. Pet-Bench serves as a foundational resource for benchmarking pet-related LLM abilities and advancing emotionally immersive human-pet interactions.

    https://arxiv.org/abs/2506.03761


    Normalize Filters! Classical Wisdom for Deep Vision

    oai:arXiv.org:2506.04401v4

    arXiv:2506.04401v4 Announce Type: replace Abstract: Classical image filters, such as those for averaging or differencing, are carefully normalized to ensure consistency, interpretability, and to avoid artifacts like intensity shifts, halos, or ringing. In contrast, convolutional filters learned end-to-end in deep networks lack such constraints. Although they may resemble wavelets and blob/edge detectors, they are not normalized in the same or any way. Consequently, when images undergo atmospheric transfer, their responses become distorted, leading to incorrect outcomes. We address this limitation by proposing filter normalization, followed by learnable scaling and shifting, akin to batch normalization. This simple yet effective modification ensures that the filters are atmosphere-equivariant, enabling co-domain symmetry. By integrating classical filtering principles into deep learning (applicable to both convolutional neural networks and convolution-dependent vision transformers), our method achieves significant improvements on artificial and natural intensity variation benchmarks. Our ResNet34 could even outperform CLIP by a large margin. Our analysis reveals that unnormalized filters degrade performance, whereas filter normalization regularizes learning, promotes diversity, and improves robustness and generalization.

    https://arxiv.org/abs/2506.04401


    U-NetMN and SegNetMN: Modified U-Net and SegNet models for bimodal SAR image segmentation

    oai:arXiv.org:2506.05444v2

    arXiv:2506.05444v2 Announce Type: replace Abstract: Segmenting Synthetic Aperture Radar (SAR) images is crucial for many remote sensing applications, particularly water body detection. However, deep learning-based segmentation models often face challenges related to convergence speed and stability, mainly due to the complex statistical distribution of this type of data. In this study, we evaluate the impact of mode normalization on two widely used semantic segmentation models, U-Net and SegNet. Specifically, we integrate mode normalization, to reduce convergence time while maintaining the performance of the baseline models. Experimental results demonstrate that mode normalization significantly accelerates convergence. Furthermore, cross-validation results indicate that normalized models exhibit increased stability in different zones. These findings highlight the effectiveness of normalization in improving computational efficiency and generalization in SAR image segmentation.

    https://arxiv.org/abs/2506.05444


    What Comes After Harm? Mapping Reparative Actions in AI through Justice Frameworks

    oai:arXiv.org:2506.05687v2

    arXiv:2506.05687v2 Announce Type: replace Abstract: As Artificial Intelligence (AI) systems are integrated into more aspects of society, they offer new capabilities but also cause a range of harms that are drawing increasing scrutiny. A large body of work in the Responsible AI community has focused on identifying and auditing these harms. However, much less is understood about what happens after harm occurs: what constitutes reparation, who initiates it, and how effective these reparations are. In this paper, we develop a taxonomy of AI harm reparation based on a thematic analysis of real-world incidents. The taxonomy organizes reparative actions into four overarching goals: acknowledging harm, attributing responsibility, providing remedies, and enabling systemic change. We apply this framework to a dataset of 1,060 AI-related incidents, analyzing the prevalence of each action and the distribution of stakeholder involvement. Our findings show that reparation efforts are concentrated in early, symbolic stages, with limited actions toward accountability or structural reform. Drawing on theories of justice, we argue that existing responses fall short of delivering meaningful redress. This work contributes a foundation for advancing more accountable and reparative approaches to Responsible AI.

    https://arxiv.org/abs/2506.05687


    Fixing It in Post: A Comparative Study of LLM Post-Training Data Quality and Model Performance

    oai:arXiv.org:2506.06522v3

    arXiv:2506.06522v3 Announce Type: replace Abstract: Recent work on large language models (LLMs) has increasingly focused on post-training and alignment with datasets curated to enhance instruction following, world knowledge, and specialized skills. However, most post-training datasets used in leading open- and closed-source LLMs remain inaccessible to the public, with limited information about their construction process. This lack of transparency has motivated the recent development of open-source post-training corpora. While training on these open alternatives can yield performance comparable to that of leading models, systematic comparisons remain challenging due to the significant computational cost of conducting them rigorously at scale, and are therefore largely absent. As a result, it remains unclear how specific samples, task types, or curation strategies influence downstream performance when assessing data quality. In this work, we conduct the first comprehensive side-by-side analysis of two prominent open post-training datasets: Tulu-3-SFT-Mix and SmolTalk. Using the Magpie framework, we annotate each sample with detailed quality metrics, including turn structure (single-turn vs. multi-turn), task category, input quality, and response quality, and we derive statistics that reveal structural and qualitative similarities and differences between the two datasets. Based on these insights, we design a principled curation recipe that produces a new data mixture, TuluTalk, which contains 14% fewer samples than either source dataset while matching or exceeding their performance on key benchmarks. Our findings offer actionable insights for constructing more effective post-training datasets that improve model performance within practical resource limits. To support future research, we publicly release both the annotated source datasets and our curated TuluTalk mixture.

    https://arxiv.org/abs/2506.06522


    RiemannFormer: A Framework for Attention in Curved Spaces

    oai:arXiv.org:2506.07405v3

    arXiv:2506.07405v3 Announce Type: replace Abstract: This research endeavors to offer insights into unlocking the further potential of transformer-based architectures. One of the primary motivations is to offer a geometric interpretation for the attention mechanism in transformers. In our framework, the attention mainly involves metric tensors, tangent spaces, inner product, and how they relate to each other. These quantities and structures at discrete positions are intricately interconnected via the parallel transport of tangent vectors. To make the learning process more efficient, we reduce the number of parameters through ingenious predefined configurations. Moreover, we introduce an explicit mechanism to highlight a neighborhood by attenuating the remote values, given that transformers inherently neglect local inductive bias. Experimental results demonstrate that our modules deliver significant performance improvements relative to the baseline. More evaluation experiments on visual and large language models will be launched successively.

    https://arxiv.org/abs/2506.07405


    Solving Inequality Proofs with Large Language Models

    oai:arXiv.org:2506.07927v3

    arXiv:2506.07927v3 Announce Type: replace Abstract: Inequality proving, crucial across diverse scientific and mathematical fields, tests advanced reasoning skills such as discovering tight bounds and strategic theorem application. This makes it a distinct, demanding frontier for large language models (LLMs), offering insights beyond general mathematical problem-solving. Progress in this area is hampered by existing datasets that are often scarce, synthetic, or rigidly formal. We address this by proposing an informal yet verifiable task formulation, recasting inequality proving into two automatically checkable subtasks: bound estimation and relation prediction. Building on this, we release IneqMath, an expert-curated dataset of Olympiad-level inequalities, including a test set and training corpus enriched with step-wise solutions and theorem annotations. We also develop a novel LLM-as-judge evaluation framework, combining a final-answer judge with four step-wise judges designed to detect common reasoning flaws. A systematic evaluation of 29 leading LLMs on IneqMath reveals a surprising reality: even top models like o1 achieve less than 10% overall accuracy under step-wise scrutiny; this is a drop of up to 65.5% from their accuracy considering only final answer equivalence. This discrepancy exposes fragile deductive chains and a critical gap for current LLMs between merely finding an answer and constructing a rigorous proof. Scaling model size and increasing test-time computation yield limited gains in overall proof correctness. Instead, our findings highlight promising research directions such as theorem-guided reasoning and self-refinement. Code and data are available at https://ineqmath.github.io/.

    https://arxiv.org/abs/2506.07927


    AraReasoner: Evaluating Reasoning-Based LLMs for Arabic NLP

    oai:arXiv.org:2506.08768v4

    arXiv:2506.08768v4 Announce Type: replace Abstract: Large language models (LLMs) have shown remarkable progress in reasoning abilities and general natural language processing (NLP) tasks, yet their performance on Arabic data, characterized by rich morphology, diverse dialects, and complex script, remains underexplored. This paper presents a comprehensive benchmarking study of multiple reasoning-focused LLMs, with a special emphasis on the newly introduced DeepSeek models, across a suite of fifteen Arabic NLP tasks. We experiment with various strategies, including zero-shot, few-shot, and fine-tuning. This allows us to systematically evaluate performance on datasets covering a range of applications to examine their capacity for linguistic reasoning under different levels of complexity. Our experiments reveal several key findings. First, carefully selecting just three in-context examples delivers an average uplift of over 13 F1 points on classification tasks-boosting sentiment analysis from 35.3% to 87.5% and paraphrase detection from 56.1% to 87.0%. Second, reasoning-focused DeepSeek architectures outperform a strong GPT o4-mini baseline by an average of 12 F1 points on complex inference tasks in the zero-shot setting. Third, LoRA-based fine-tuning yields up to an additional 8 points in F1 and BLEU compared to equivalent increases in model scale. The code is available at https://github.com/gufranSabri/deepseek-evals

    https://arxiv.org/abs/2506.08768


    WakeupUrban: Unsupervised Semantic Segmentation of Mid-20$^{th}$ century Urban Landscapes with Satellite Imagery

    oai:arXiv.org:2506.09476v3

    arXiv:2506.09476v3 Announce Type: replace Abstract: Historical satellite imagery archive, such as Keyhole satellite data, offers rare insights into understanding early urban development and long-term transformation. However, severe quality degradation ($\textit{e.g.}$, distortion, misalignment, and spectral scarcity) and the absence of annotations have long hindered its analysis. To bridge this gap and enhance understanding of urban development, we introduce $\textbf{WakeupUrbanBench}$, an annotated segmentation dataset based on historical satellite imagery with the earliest observation time among all existing remote sensing (RS) datasets, along with a framework for unsupervised segmentation tasks, $\textbf{WakeupUSM}$. First, WakeupUrbanBench serves as a pioneer, expertly annotated dataset built on mid-$20^{\text{th}}$ century RS imagery, involving four key urban classes and spanning 4 cities across 2 continents with nearly 1000 km$^2$ area of diverse urban morphologies, and additionally introducing one present-day city. Second, WakeupUSM is a novel unsupervised semantic segmentation framework for historical RS imagery. It employs a confidence-aware alignment mechanism and focal-confidence loss based on a self-supervised learning architecture, which generates robust pseudo-labels and adaptively prioritizes prediction difficulty and label reliability to improve unsupervised segmentation on noisy historical data without manual supervision. Comprehensive experiments demonstrate WakeupUSM significantly outperforms existing unsupervised segmentation methods $\textbf{both WakeupUrbanBench and public dataset}$, promising to pave the way for quantitative studies of long-term urban change using modern computer vision. Our benchmark and codes will be released at https://github.com/Tianxiang-Hao/WakeupUrban.

    https://arxiv.org/abs/2506.09476


    Oracle-Based Multistep Strategy for Solving Polynomial Systems Over Finite Fields and Algebraic Cryptanalysis of the Aradi Cipher

    oai:arXiv.org:2506.09950v2

    arXiv:2506.09950v2 Announce Type: replace Abstract: The multistep solving strategy consists in a divide-and-conquer approach: when a multivariate polynomial system is computationally infeasible to solve directly, one variable is assigned over the elements of the base finite field, and the procedure is recursively applied to the resulting simplified systems. In a previous work by the same authors (among others), this approach proved effective in the algebraic cryptanalysis of the Trivium cipher. In this paper, we present a new recursive formulation of the corresponding algorithm based on a Depth-First Search strategy, along with a novel complexity analysis leveraging tree structures. We also introduce the notion of an "oracle function", which is intended to determine whether evaluating a new variable is required to simplify the current polynomial system. This notion allows us to unify all previously proposed variants of the multistep strategy, including the classical hybrid approach, by appropriately selecting the oracle function. Finally, we employ the multistep solving strategy in the cryptanalysis of the NSA's recently introduced low-latency block cipher Aradi, achieving a first full-round algebraic attack that exposes structural features in its symbolic model.

    https://arxiv.org/abs/2506.09950


    Meta Pruning via Graph Metanetworks : A Universal Meta Learning Framework for Network Pruning

    oai:arXiv.org:2506.12041v3

    arXiv:2506.12041v3 Announce Type: replace Abstract: We propose an entirely new meta-learning framework for network pruning. It is a general framework that can be theoretically applied to almost all types of networks with all kinds of pruning and has great generality and transferability. Experiments have shown that it can achieve outstanding results on many popular and representative pruning tasks (including both CNNs and Transformers). Unlike all prior works that either rely on fixed, hand-crafted criteria to prune in a coarse manner, or employ learning to prune ways that require special training during each pruning and lack generality. Our framework can learn complex pruning rules automatically via a neural network (metanetwork) and has great generality that can prune without any special training. More specifically, we introduce the newly developed idea of metanetwork from meta-learning into pruning. A metanetwork is a network that takes another network as input and produces a modified network as output. In this paper, we first establish a bijective mapping between neural networks and graphs, and then employ a graph neural network as our metanetwork. We train a metanetwork that learns the pruning strategy automatically and can transform a network that is hard to prune into another network that is much easier to prune. Once the metanetwork is trained, our pruning needs nothing more than a feedforward through the metanetwork and some standard finetuning to prune at state-of-the-art. Our code is available at https://github.com/Yewei-Liu/MetaPruning

    https://arxiv.org/abs/2506.12041


    Shortest Paths in a Weighted Simplicial Complex

    oai:arXiv.org:2506.12921v2

    arXiv:2506.12921v2 Announce Type: replace Abstract: Simplicial complexes are extensively studied in the field of algebraic topology. They have gained attention in recent time due to their applications in fields like theoretical distributed computing and simplicial neural networks. Graphs are mono-dimensional simplicial complex. Graph theory has application in topics like theoretical computer science, operations research, bioinformatics and social sciences. This makes it natural to try to adapt graph-theoretic results for simplicial complexes, which can model more intricate and detailed structures appearing in real-world systems. Though seemingly obvious, we did not find any previous work that looked into this prospect of simplicial complexes. In this article, we define the concept of weighted simplicial complex and $d$-path in a simplicial complex. Both these concepts have the potential to have numerous real-life applications. We start by adapting the Depth-First Search and Breadth-First Search algorithms for our setup. Next, we provide two novel algorithms to find the shortest paths in a weighted simplicial complex. The core principles of our algorithms align with those of Dijkstra$^\prime$s algorithm and Bellman-Ford algorithm for graphs. Hence, this work lays a building block for the sake of integrating graph-theoretic concepts with abstract simplicial complexes.

    https://arxiv.org/abs/2506.12921


    Multipole Attention for Efficient Long Context Reasoning

    oai:arXiv.org:2506.13059v2

    arXiv:2506.13059v2 Announce Type: replace Abstract: Large Reasoning Models (LRMs) have shown promising accuracy improvements on complex problem-solving tasks. While these models have attained high accuracy by leveraging additional computation at test time, they need to generate long chain-of-thought reasoning in order to think before answering, which requires generating thousands of tokens. While sparse attention methods can help reduce the KV cache pressure induced by this long autoregressive reasoning, these methods can introduce errors which disrupt the reasoning process. Additionally, prior methods often pre-process the input to make it easier to identify the important prompt tokens when computing attention during generation, and this pre-processing is challenging to perform online for newly generated reasoning tokens. Our work addresses these challenges by introducing Multipole Attention, which accelerates autoregressive reasoning by only computing exact attention for the most important tokens, while maintaining approximate representations for the remaining tokens. Our method first performs clustering to group together semantically similar key vectors, and then uses the cluster centroids both to identify important key vectors and to approximate the remaining key vectors in order to retain high accuracy. We design a fast cluster update process to quickly re-cluster the input and previously generated tokens, thereby allowing for accelerating attention to the previous output tokens. We evaluate our method using emerging LRMs such as Qwen-8B, demonstrating that our approach can maintain accuracy on complex reasoning tasks even with aggressive attention sparsity settings. We also provide kernel implementations to demonstrate the practical efficiency gains from our method, achieving up to 4.5$\times$ speedup for attention in long-context reasoning applications. Our code is available at https://github.com/SqueezeAILab/MultipoleAttention.

    https://arxiv.org/abs/2506.13059


    The Transition Matrix -- A classification of navigational patterns between LMS course sections

    oai:arXiv.org:2506.13275v2

    arXiv:2506.13275v2 Announce Type: replace Abstract: Learning management systems (LMS) like Moodle are increasingly used to support university teaching. As Moodle courses become more complex, incorporating diverse interactive elements, it is important to understand how students navigate through course sections and whether course designs are meeting student needs. While substantial research exists on student usage of individual LMS elements, there is a lack of research on broader navigational patterns between course sections and how these patterns differ across courses. This study analyzes navigational data from 747 courses in the Moodle LMS at a technical university of applied sciences, representing (after filtering) around 4,400 students and 1.8 million logged events. Transition matrices and heat map visualizations are used to identify and quantify common navigational patterns. Findings include that the majority of the analyzed courses exhibit some kind of diagonal pattern, indicating that students typically navigate from the current to the next or previous section.

    https://arxiv.org/abs/2506.13275


    RealHiTBench: A Comprehensive Realistic Hierarchical Table Benchmark for Evaluating LLM-Based Table Analysis

    oai:arXiv.org:2506.13405v2

    arXiv:2506.13405v2 Announce Type: replace Abstract: With the rapid advancement of Large Language Models (LLMs), there is an increasing need for challenging benchmarks to evaluate their capabilities in handling complex tabular data. However, existing benchmarks are either based on outdated data setups or focus solely on simple, flat table structures. In this paper, we introduce RealHiTBench, a comprehensive benchmark designed to evaluate the performance of both LLMs and Multimodal LLMs (MLLMs) across a variety of input formats for complex tabular data, including LaTeX, HTML, and PNG. RealHiTBench also includes a diverse collection of tables with intricate structures, spanning a wide range of task types. Our experimental results, using 25 state-of-the-art LLMs, demonstrate that RealHiTBench is indeed a challenging benchmark. Moreover, we also develop TreeThinker, a tree-based pipeline that organizes hierarchical headers into a tree structure for enhanced tabular reasoning, validating the importance of improving LLMs' perception of table hierarchies. We hope that our work will inspire further research on tabular data reasoning and the development of more robust models. The code and data are available at https://github.com/cspzyy/RealHiTBench.

    https://arxiv.org/abs/2506.13405


    Segmenting Visuals With Querying Words: Language Anchors For Semi-Supervised Image Segmentation

    oai:arXiv.org:2506.13925v3

    arXiv:2506.13925v3 Announce Type: replace Abstract: Vision Language Models (VLMs) provide rich semantic priors but are underexplored in Semi supervised Semantic Segmentation. Recent attempts to integrate VLMs to inject high level semantics overlook the semantic misalignment between visual and textual representations that arises from using domain invariant text embeddings without adapting them to dataset and image specific contexts. This lack of domain awareness, coupled with limited annotations, weakens the model semantic understanding by preventing effective vision language alignment. As a result, the model struggles with contextual reasoning, shows weak intra class discrimination, and confuses similar classes. To address these challenges, we propose Hierarchical Vision Language transFormer (HVLFormer), which achieves domain aware and domain robust alignment between visual and textual representations within a mask transformer architecture. Firstly, we transform text embeddings from pretrained VLMs into textual object queries, enabling the generation of multi scale, dataset aware queries that capture class semantics from coarse to fine granularity and enhance contextual reasoning. Next, we refine these queries by injecting image specific visual context to align textual semantics with local scene structures and enhance class discrimination. Finally, to achieve domain robustness, we introduce cross view and modal consistency regularization, which enforces prediction consistency within mask-transformer architecture across augmented views. Moreover, it ensures stable vision language alignment during decoding. With less than 1% training data, HVLFormer outperforms state of the art methods on Pascal VOC, COCO, ADE20K, and Cityscapes. Our code and results will be available on GitHub.

    https://arxiv.org/abs/2506.13925


    Visual symbolic mechanisms: Emergent symbol processing in vision language models

    oai:arXiv.org:2506.15871v2

    arXiv:2506.15871v2 Announce Type: replace Abstract: To accurately process a visual scene, observers must bind features together to represent individual objects. This capacity is necessary, for instance, to distinguish an image containing a red square and a blue circle from an image containing a blue square and a red circle. Recent work has found that language models solve this 'binding problem' via a set of symbol-like, content-independent indices, but it is unclear whether similar mechanisms are employed by Vision Language Models (VLMs). This question is especially relevant, given the persistent failures of VLMs on tasks that require binding. Here, we identify a previously unknown set of emergent symbolic mechanisms that support binding specifically in VLMs, via a content-independent, spatial indexing scheme. Moreover, we find that binding errors, when they occur, can be traced directly to failures in these mechanisms. Taken together, these results shed light on the mechanisms that support symbol-like processing in VLMs, and suggest possible avenues for reducing the number of binding failures exhibited by these models.

    https://arxiv.org/abs/2506.15871


    Camera Calibration via Circular Patterns: A Comprehensive Framework with Detection Uncertainty and Unbiased Projection Model

    oai:arXiv.org:2506.16842v2

    arXiv:2506.16842v2 Announce Type: replace Abstract: Camera calibration using planar targets has been widely favored, and two types of control points have been mainly considered as measurements: the corners of the checkerboard and the centroid of circles. Since a centroid is derived from numerous pixels, the circular pattern provides more precise measurements than the checkerboard. However, the existing projection model of circle centroids is biased under lens distortion, resulting in low performance. To surmount this limitation, we propose an unbiased projection model of the circular pattern and demonstrate its superior accuracy compared to the checkerboard. Complementing this, we introduce uncertainty into circular patterns to enhance calibration robustness and completeness. Defining centroid uncertainty improves the performance of calibration components, including pattern detection, optimization, and evaluation metrics. We also provide guidelines for performing good camera calibration based on the evaluation metric. The core concept of this approach is to model the boundary points of a two-dimensional shape as a Markov random field, considering its connectivity. The shape distribution is propagated to the centroid uncertainty through an appropriate shape representation based on the Green theorem. Consequently, the resulting framework achieves marked gains in calibration accuracy and robustness. The complete source code and demonstration video are available at https://github.com/chaehyeonsong/discocal.

    https://arxiv.org/abs/2506.16842


    Variational Learning of Disentangled Representations

    oai:arXiv.org:2506.17182v2

    arXiv:2506.17182v2 Announce Type: replace Abstract: Disentangled representations enable models to separate factors of variation that are shared across experimental conditions from those that are condition-specific. This separation is essential in domains such as biomedical data analysis, where generalization to new treatments, patients, or species depends on isolating stable biological signals from context-dependent effects. While extensions of the variational autoencoder (VAE) framework have been proposed to address this problem, they frequently suffer from leakage between latent representations, limiting their ability to generalize to unseen conditions. Here, we introduce DISCoVeR, a new variational framework that explicitly separates condition-invariant and condition-specific factors. DISCoVeR integrates three key components: (i) a dual-latent architecture that models shared and specific factors separately; (ii) two parallel reconstructions that ensure both representations remain informative; and (iii) a novel max-min objective that encourages clean separation without relying on handcrafted priors, while making only minimal assumptions. Theoretically, we show that this objective maximizes data likelihood while promoting disentanglement, and that it admits a unique equilibrium. Empirically, we demonstrate that DISCoVeR achieves improved disentanglement on synthetic datasets, natural images, and single-cell RNA-seq data. Together, these results establish DISCoVeR as a principled approach for learning disentangled representations in multi-condition settings.

    https://arxiv.org/abs/2506.17182


    Reliability-Adjusted Prioritized Experience Replay

    oai:arXiv.org:2506.18482v3

    arXiv:2506.18482v3 Announce Type: replace Abstract: Experience replay enables data-efficient learning from past experiences in online reinforcement learning agents. Traditionally, experiences were sampled uniformly from a replay buffer, regardless of differences in experience-specific learning potential. In an effort to sample more efficiently, researchers introduced Prioritized Experience Replay (PER). In this paper, we propose an extension to PER by introducing a novel measure of temporal difference error reliability. We theoretically show that the resulting transition selection algorithm, Reliability-adjusted Prioritized Experience Replay (ReaPER), enables more efficient learning than PER. We further present empirical results showing that ReaPER outperforms PER across various environment types, including the Atari-10 benchmark.

    https://arxiv.org/abs/2506.18482


    Explicit Constructions of Sum-Rank Metric Codes from Quadratic Kummer Extensions

    oai:arXiv.org:2506.18653v2

    arXiv:2506.18653v2 Announce Type: replace Abstract: This paper introduces new constructions of sum-rank metric codes derived from algebraic function fields, as existing results on such codes remain limited. A major challenge lies in the determination of their parameters. We address this issue by employing quadratic Galois extensions, proposing two general constructions of $2\times2$ sum-rank codes. Analogous to algebraic geometry codes in the Hamming metric, our codes achieve a larger block length compared to existing constructions. We determine explicit parameters including dimensions and minimum distances of our codes, and we present an illustrative example using elliptic function fields. Finally, we discuss the asymptotic behavior of our codes and compare them with the Gilbert-Varshamov-like bound for sum-rank metric codes.

    https://arxiv.org/abs/2506.18653


    AI Copilots for Reproducibility in Science: A Case Study

    oai:arXiv.org:2506.20130v4

    arXiv:2506.20130v4 Announce Type: replace Abstract: Open science initiatives seek to make research outputs more transparent, accessible, and reusable, but ensuring that published findings can be independently reproduced remains a persistent challenge. In this paper we describe an AI-driven "Reproducibility Copilot" that analyzes manuscripts, code, and supplementary materials to generate structured Jupyter Notebooks and recommendations aimed at facilitating computational, or "rote", reproducibility. Our initial results suggest that the copilot has the potential to substantially reduce reproduction time (in one case from over 30 hours to about 1 hour) while achieving high coverage of figures, tables, and results suitable for computational reproduction. The system systematically detects barriers to reproducibility, including missing values for hyperparameters, undocumented preprocessing steps, and incomplete or inaccessible datasets. Although preliminary, these findings suggest that AI tools can meaningfully reduce the burden of reproducibility efforts and contribute to more transparent and verifiable scientific communication.

    https://arxiv.org/abs/2506.20130


    A Survey of Behavior Foundation Model: Next-Generation Whole-Body Control System of Humanoid Robots

    oai:arXiv.org:2506.20487v4

    arXiv:2506.20487v4 Announce Type: replace Abstract: Humanoid robots are drawing significant attention as versatile platforms for complex motor control, human-robot interaction, and general-purpose physical intelligence. However, achieving efficient whole-body control (WBC) in humanoids remains a fundamental challenge due to sophisticated dynamics, underactuation, and diverse task requirements. While learning-based controllers have shown promise for complex tasks, their reliance on labor-intensive and costly retraining for new scenarios limits real-world applicability. To address these limitations, behavior(al) foundation models (BFMs) have emerged as a new paradigm that leverages large-scale pre-training to learn reusable primitive skills and broad behavioral priors, enabling zero-shot or rapid adaptation to a wide range of downstream tasks. In this paper, we present a comprehensive overview of BFMs for humanoid WBC, tracing their development across diverse pre-training pipelines. Furthermore, we discuss real-world applications, current limitations, urgent challenges, and future opportunities, positioning BFMs as a key approach toward scalable and general-purpose humanoid intelligence. Finally, we provide a curated and regularly updated collection of BFM papers and projects to facilitate more subsequent research, which is available at https://github.com/yuanmingqi/awesome-bfm-papers.

    https://arxiv.org/abs/2506.20487


    MR-COSMO: Visual-Text Memory Recall and Direct CrOSs-MOdal Alignment Method for Query-Driven 3D Segmentation

    oai:arXiv.org:2506.20991v3

    arXiv:2506.20991v3 Announce Type: replace Abstract: The rapid advancement of vision-language models (VLMs) in 3D domains has accelerated research in text-query-guided point cloud processing, though existing methods underperform in point-level segmentation due to inadequate 3D-text alignment that limits local feature-text context linking. To address this limitation, we propose MR-COSMO, a Visual-Text Memory Recall and Direct CrOSs-MOdal Alignment Method for Query-Driven 3D Segmentation, establishing explicit alignment between 3D point clouds and text/2D image data through a dedicated direct cross-modal alignment module while implementing a visual-text memory module with specialized feature banks. This direct alignment mechanism enables precise fusion of geometric and semantic features, while the memory module employs specialized banks storing text features, visual features, and their correspondence mappings to dynamically enhance scene-specific representations via attention-based knowledge recall. Comprehensive experiments across 3D instruction, reference, and semantic segmentation benchmarks confirm state-of-the-art performance.

    https://arxiv.org/abs/2506.20991


    Meta Policy Switching for Secure UAV Deconfliction in Adversarial Airspace

    oai:arXiv.org:2506.21127v3

    arXiv:2506.21127v3 Announce Type: replace Abstract: Autonomous UAV navigation using reinforcement learning (RL) is vulnerable to adversarial attacks that manipulate sensor inputs, potentially leading to unsafe behavior and mission failure. Although robust RL methods provide partial protection, they often struggle to generalize to unseen or out-of-distribution (OOD) attacks due to their reliance on fixed perturbation settings. To address this limitation, we propose a meta-policy switching framework in which a meta-level polic dynamically selects among multiple robust policies to counter unknown adversarial shifts. At the core of this framework lies a discounted Thompson sampling (DTS) mechanism that formulates policy selection as a multi-armed bandit problem, thereby minimizing value distribution shifts via self-induced adversarial observations. We first construct a diverse ensemble of action-robust policies trained under varying perturbation intensities. The DTS-based meta-policy then adaptively selects among these policies online, optimizing resilience against self-induced, piecewise-stationary attacks. Theoretical analysis shows that the DTS mechanism minimizes expected regret, ensuring adaptive robustness to OOD attacks and exhibiting emergent antifragile behavior under uncertainty. Extensive simulations in complex 3D obstacle environments under both white-box (Projected Gradient Descent) and black-box (GPS spoofing) attacks demonstrate significantly improved navigation efficiency and higher conflict free trajectory rates compared to standard robust and vanilla RL baselines, highlighting the practical security and dependability benefits of the proposed approach.

    https://arxiv.org/abs/2506.21127


    Active Inference AI Systems for Scientific Discovery

    oai:arXiv.org:2506.21329v4

    arXiv:2506.21329v4 Announce Type: replace Abstract: The rapid evolution of artificial intelligence has led to expectations of transformative impact on science, yet current systems remain fundamentally limited in enabling genuine scientific discovery. This perspective contends that progress turns on closing three mutually reinforcing gaps in abstraction, reasoning and empirical grounding. Central to addressing these gaps is recognizing complementary cognitive modes: thinking as slow, iterative hypothesis generation -- exploring counterfactual spaces where physical laws can be temporarily violated to discover new patterns -- and reasoning as fast, deterministic validation, traversing established knowledge graphs to test consistency with known principles. Abstractions in this loop should be manipulable models that enable counterfactual prediction, causal attribution, and refinement. Design principles -- rather than a monolithic recipe -- are proposed for systems that reason in imaginary spaces and learn from the world: causal, multimodal models for internal simulation; persistent, uncertainty-aware scientific memory that distinguishes hypotheses from established claims; formal verification pathways coupled to computations and experiments. It is also argued that the inherent ambiguity in feedback from simulations and experiments, and underlying uncertainties make human judgment indispensable, not as a temporary scaffold but as a permanent architectural component. Evaluations must assess the system's ability to identify novel phenomena, propose falsifiable hypotheses, and efficiently guide experimental programs toward genuine discoveries.

    https://arxiv.org/abs/2506.21329


    Process mining-driven modeling and simulation to enhance fault diagnosis in cyber-physical systems

    oai:arXiv.org:2506.21502v2

    arXiv:2506.21502v2 Announce Type: replace Abstract: Cyber-Physical Systems (CPSs) tightly interconnect digital and physical operations within production environments, enabling real-time monitoring, control, optimization, and autonomous decision-making that directly enhance manufacturing processes and productivity. The inherent complexity of these systems can lead to faults that require robust and interpretable diagnoses to maintain system dependability and operational efficiency. However, manual modeling of faulty behaviors requires extensive domain expertise and cannot leverage the low-level sensor data of the CPS. Furthermore, although powerful, deep learning-based techniques produce black-box diagnostics that lack interpretability, limiting their practical adoption. To address these challenges, we set forth a method that performs unsupervised characterization of system states and state transitions from low-level sensor data, uses several process mining techniques to model faults through interpretable stochastic Petri nets, simulates such Petri nets for a comprehensive understanding of system behavior under faulty conditions, and performs Petri net-based fault diagnosis. The method is applied to the Robotic Arm Dataset (RoAD), a benchmark collected from a robotic arm deployed in a scale-replica smart manufacturing assembly line. The application to RoAD demonstrates the method's effectiveness in modeling, simulating, and classifying faulty behaviors in CPSs. The modeling results demonstrate that our method achieves a satisfactory interpretability-simulation accuracy trade-off with up to 0.676 arc-degree simplicity, 0.395 R^2, and 0.088 RMSE. In addition, the fault identification results show that the method achieves an F1 score of up to 98.925%, while maintaining a low conformance checking time of 0.020 seconds, which competes with other deep learning-based methods.

    https://arxiv.org/abs/2506.21502


    Genuinely multi-dimensional stationarity preserving Finite Volume formulation for nonlinear hyperbolic PDEs

    oai:arXiv.org:2506.21700v2

    arXiv:2506.21700v2 Announce Type: replace Abstract: Classical Finite Volume methods for multi-dimensional problems include stabilization (e.g.\ via a Riemann solver), that is derived by considering several one-dimensional problems in different directions. Such methods therefore ignore a possibly existing balance of contributions coming from different directions, such as the one characterizing multi-dimensional stationary states. Instead of being preserved, they are usually diffused away by such methods. Stationarity preserving methods use a better suited stabilization term that vanishes at the stationary state, allowing the method to preserve it. This work presents a general approach to stationarity preserving Finite Volume methods for nonlinear conservation/balance laws. It is based on a multi-dimensional stationarity preserving quadrature strategy that allows to naturally introduce genuinely multi-dimensional numerical fluxes. The new methods are shown to significantly outperform existing ones even if the latter are of higher order of accuracy and even on non-stationary solutions.

    https://arxiv.org/abs/2506.21700


    SPIDER: Unleashing Sparse Tensor Cores for Stencil Computation via Strided Swapping

    oai:arXiv.org:2506.22035v3

    arXiv:2506.22035v3 Announce Type: replace Abstract: Recent research has focused on accelerating stencil computations by exploiting emerging hardware like Tensor Cores. To leverage these accelerators, the stencil operation must be transformed to matrix multiplications. However, this transformation introduces undesired sparsity into the kernel matrix, leading to significant redundant computation. In this paper, we present SPIDER, the first system to turn this unresolved sparsity into an optimization opportunity by exploring the potential of Sparse Tensor Cores (SpTCs) for stencil acceleration. Specifically, SPIDER introduces an efficient and elegant transformation method that integrates two cooperative techniques: an ahead-of-time strided swapping transformation for kernel matrices and an on-the-fly row-swapping mechanism for inputs. This rule-based approach effectively transforms stencil computation into operations compatible with SpTCs, introducing only slight compile-time overhead and zero runtime overhead. Additionally, SPIDER incorporates multiple optimizations to maximize computational efficiency. Experimental evaluations demonstrate that SPIDER outperforms vendor library cuDNN by 6.20$\times$ and state-of-the-art (SOTA) Tensor Core-based approaches (ConvStencil, FlashFFTStencil, etc.) by 2.00$\times$ on average.

    https://arxiv.org/abs/2506.22035


    SoMi-ToM: Evaluating Multi-Perspective Theory of Mind in Embodied Social Interactions

    oai:arXiv.org:2506.23046v3

    arXiv:2506.23046v3 Announce Type: replace Abstract: Humans continuously infer the states, goals, and behaviors of others by perceiving their surroundings in dynamic, real-world social interactions. However, most Theory of Mind (ToM) benchmarks only evaluate static, text-based scenarios, which have a significant gap compared to real interactions. We propose the SoMi-ToM benchmark, designed to evaluate multi-perspective ToM in embodied multi-agent complex social interactions. This benchmark is based on rich multimodal interaction data generated by the interaction environment SoMi, covering diverse crafting goals and social relationships. Our framework supports multi-level evaluation: (1) first-person evaluation provides multimodal (visual, dialogue, action, etc.) input from a first-person perspective during a task for real-time state inference, (2) third-person evaluation provides complete third-person perspective video and text records after a task for goal and behavior inference. This evaluation method allows for a more comprehensive examination of a model's ToM capabilities from both the subjective immediate experience and the objective global observation. We constructed a challenging dataset containing 35 third-person perspective videos, 363 first-person perspective images, and 1225 expert-annotated multiple-choice questions (three options). On this dataset, we systematically evaluated the performance of human subjects and several state-of-the-art large vision-language models (LVLMs). The results show that LVLMs perform significantly worse than humans on SoMi-ToM: the average accuracy gap between humans and models is 40.1% in first-person evaluation and 26.4% in third-person evaluation. This indicates that future LVLMs need to further improve their ToM capabilities in embodied, complex social interactions.

    https://arxiv.org/abs/2506.23046


    From Prompt Injections to Protocol Exploits: Threats in LLM-Powered AI Agents Workflows

    oai:arXiv.org:2506.23260v2

    arXiv:2506.23260v2 Announce Type: replace Abstract: Autonomous AI agents powered by large language models (LLMs) with structured function-calling interfaces enable real-time data retrieval, computation, and multi-step orchestration. However, the rapid growth of plugins, connectors, and inter-agent protocols has outpaced security practices, leading to brittle integrations that rely on ad-hoc authentication, inconsistent schemas, and weak validation. This survey introduces a unified end-to-end threat model for LLM-agent ecosystems, covering host-to-tool and agent-to-agent communications. We systematically categorize more than thirty attack techniques spanning input manipulation, model compromise, system and privacy attacks, and protocol-level vulnerabilities. For each category, we provide a formal threat formulation defining attacker capabilities, objectives, and affected system layers. Representative examples include Prompt-to-SQL injections and the Toxic Agent Flow exploit in GitHub MCP servers. We analyze attack feasibility, review existing defenses, and discuss mitigation strategies such as dynamic trust management, cryptographic provenance tracking, and sandboxed agent interfaces. The framework is validated through expert review and cross-mapping with real-world incidents and public vulnerability repositories, including CVE and NIST NVD. Compared to prior surveys, this work presents the first integrated taxonomy bridging input-level exploits and protocol-layer vulnerabilities in LLM-agent ecosystems, offering actionable guidance for designing secure and resilient agentic AI systems.

    https://arxiv.org/abs/2506.23260


    Incoherence as Oracle-less Measure of Error in LLM-Based Code Generation

    oai:arXiv.org:2507.00057v2

    arXiv:2507.00057v2 Announce Type: replace Abstract: Generating code from a natural language programming task is one of the most successful applications of Large Language Models (LLMs). Yet, the generated program may be buggy. Without an oracle, such as an existing, correct implementation or a formal specification, can we somehow estimate how likely the generated program is correct? In this paper, we propose a measure of incorrectness, called *incoherence*, that can be estimated efficiently in the absence of an oracle and allows us to establish a lower bound on the error, i.e., the probability that the LLM-generated program for that specification is incorrect. In our experiments, our incoherence-based methodology can automatically identify about two-thirds of incorrect programs without reports of false positives for the average task. In fact, *an oracle-based evaluation of LLMs can be reliably replaced by an incoherence-based evaluation*. In particular, we find a very strong agreement between the ranking of LLMs by the number of programs deemed correct via an oracle (pass@1) and the ranking of LLMs by the number of programs deemed correct via incoherence.

    https://arxiv.org/abs/2507.00057


    Towards Resource-Efficient Serverless LLM Inference with SLINFER

    oai:arXiv.org:2507.00507v2

    arXiv:2507.00507v2 Announce Type: replace Abstract: The rise of LLMs has driven demand for private serverless deployments, characterized by moderate-sized models and infrequent requests. While existing serverless solutions follow exclusive GPU allocation, we take a step back to explore modern platforms and find that: Emerging CPU architectures with built-in accelerators are capable of serving LLMs but remain underutilized, and both CPUs and GPUs can accommodate multiple LLMs simultaneously. We propose SLINFER, a resource-efficient serverless inference scheme tailored for small- to mid-sized LLMs that enables elastic and on-demand sharing across heterogeneous hardware. SLINFER tackles three fundamental challenges: (1) precise, fine-grained compute resource allocation at token-level to handle fluctuating computational demands; (2) a coordinated and forward-looking memory scaling mechanism to detect out-of-memory hazards and reduce operational overhead; and (3) a dual approach that consolidates fragmented instances through proactive preemption and reactive bin-packing. Experimental results on 4 32-core CPUs and 4 A100 GPUs show that SLINFER improves serving capacity by 47% - 62% through sharing, while further leveraging CPUs boosts this to 86% - 154%.

    https://arxiv.org/abs/2507.00507


    Tensor Decomposition Networks for Fast Machine Learning Interatomic Potential Computations

    oai:arXiv.org:2507.01131v4

    arXiv:2507.01131v4 Announce Type: replace Abstract: $\rm{SO}(3)$-equivariant networks are the dominant models for machine learning interatomic potentials (MLIPs). The key operation of such networks is the Clebsch-Gordan (CG) tensor product, which is computationally expensive. To accelerate the computation, we develop tensor decomposition networks (TDNs) as a class of approximately equivariant networks in which CG tensor products are replaced by low-rank tensor decompositions, such as the CANDECOMP/PARAFAC (CP) decomposition. With the CP decomposition, we prove (i) a uniform bound on the induced error of $\rm{SO}(3)$-equivariance, and (ii) the universality of approximating any equivariant bilinear map. To further reduce the number of parameters, we propose path-weight sharing that ties all multiplicity-space weights across the $\mathcal{O}(L^3)$ CG paths into a single shared parameter set without compromising equivariance, where $L$ is the maximum angular degree. The resulting layer acts as a plug-and-play replacement for tensor products in existing networks, and the computational complexity of tensor products is reduced from $\mathcal{O}(L^6)$ to $\mathcal{O}(L^4)$. We evaluate TDNs on PubChemQCR, a newly curated molecular relaxation dataset containing 105 million DFT-calculated snapshots. We also use existing datasets, including OC20, and OC22. Results show that TDNs achieve competitive performance with dramatic speedup in computations. Our code is publicly available as part of the AIRS library (\href{https://github.com/divelab/AIRS/tree/main/OpenMol/TDN}{https://github.com/divelab/AIRS/}).

    https://arxiv.org/abs/2507.01131


    FACM: Flow-Anchored Consistency Models

    oai:arXiv.org:2507.03738v2

    arXiv:2507.03738v2 Announce Type: replace Abstract: Continuous-time Consistency Models (CMs) promise efficient few-step generation but face significant challenges with training instability. We argue this instability stems from a fundamental conflict: Training the network exclusively on a shortcut objective leads to the catastrophic forgetting of the instantaneous velocity field that defines the flow. Our solution is to explicitly anchor the model in the underlying flow, ensuring high trajectory fidelity during training. We introduce the Flow-Anchored Consistency Model (FACM), where a Flow Matching (FM) task serves as a dynamic anchor for the primary CM shortcut objective. Key to this Flow-Anchoring approach is a novel expanded time interval strategy that unifies optimization for a single model while decoupling the two tasks to ensure stable, architecturally-agnostic training. By distilling a pre-trained LightningDiT model, our method achieves a state-of-the-art FID of 1.32 with two steps (NFE=2) and 1.70 with just one step (NFE=1) on ImageNet 256x256. To address the challenge of scalability, we develop a memory-efficient Chain-JVP that resolves key incompatibilities with FSDP. This method allows us to scale FACM training on a 14B parameter model (Wan 2.2), accelerating its Text-to-Image inference from 2x40 to 2-8 steps. Our code and pretrained models: https://github.com/ali-vilab/FACM.

    https://arxiv.org/abs/2507.03738


    Low Sets and Closure Properties of Counting Function Classes

    oai:arXiv.org:2507.04110v3

    arXiv:2507.04110v3 Announce Type: replace Abstract: A language L is low for a relativizable complexity class C, if C$^L$ = C. For the classes #P, GapP, and SpanP the exact low classes of languages are known: Low(#P) = UP $\cap$ coUP, Low(GapP) = SPP, and Low(SpanP) = NP $\cap$ coNP. In this paper, we prove that Low(TotP) = P, and give characterizations of low function classes for #P, GapP, TotP, and SpanP. In particular, we prove that Low$_f$(#P) = UPSV$_t$ and Low$_f$(SpanP) = NPSV$_t$. We establish the inclusion relations between NPSV$_t$, UPSV$_t$, and the counting function classes by giving for each of these inclusions an equivalent inclusion between language classes. We also prove that SpanP $\subseteq$ GapP if and only if NP $\subseteq$ SPP, and the inclusion GapP$_+$ $\subseteq$ SpanP implies PH = $\Sigma_{2}^{P}$. For the class #P we prove that its closure under left composition with FP$_+$ is equivalent to #P = UPSV$_t$, and for SpanP this closure is equivalent to SpanP = NPSV$_t$. For the classes #P, GapP, TotP, and SpanP we summarize the known results and show that each of these classes is closed under left composition with FP$_+$ if and only if it collapses to its low class of functions. We also prove that a NPTM with a #P oracle can always make at most one query to the oracle without changing the number of accepting paths.

    https://arxiv.org/abs/2507.04110


    The Generalization Ridge: Information Flow in Natural Language Generation

    oai:arXiv.org:2507.05387v2

    arXiv:2507.05387v2 Announce Type: replace Abstract: Transformer-based language models have achieved state-of-the-art performance in natural language generation (NLG), yet their internal mechanisms for synthesizing task-relevant information remain insufficiently understood. While prior studies suggest that intermediate layers often yield more generalizable representations than final layers, how this generalization ability emerges and propagates across layers during training remains unclear. To address this gap, we propose InfoRidge, an information-theoretic framework, to characterize how predictive information-the mutual information between hidden representations and target outputs-varies across depth. Estimating this quantity enables us to trace the flow of task-relevant information throughout the model during training. Our experiments across various models and datasets reveal a consistent non-monotonic trend: predictive information peaks in upper-middle layers-forming a generalization ridge-before declining in final layers, reflecting a transition between generalization and memorization. To further investigate this phenomenon, we introduce residual scaling coefficients-trainable scalar parameters applied to each residual block-which serve as functional probes for assessing the relative importance of individual transformer layers. These coefficients reveal that, under distribution shift, models downweight final layers and increasingly rely on intermediate layers, highlighting their role in generalization. Together, these findings offer new insights into the internal mechanisms of transformers and underscore the critical role of intermediate layers in supporting generalization.

    https://arxiv.org/abs/2507.05387


    Incorporating Interventional Independence Improves Robustness against Interventional Distribution Shift

    oai:arXiv.org:2507.05412v3

    arXiv:2507.05412v3 Announce Type: replace Abstract: We study the problem of learning robust discriminative representations of causally related latent variables given the underlying causal graph and a training set comprising passively collected observational data and interventional data obtained through targeted interventions on some of these latent variables. We desire to learn representations that are robust against the resulting interventional distribution shifts. Existing approaches treat observational and interventional data alike, ignoring the independence relations arising from these interventions, even with known underlying causal models. As a result, their representations lead to large predictive performance disparities between observational and interventional data. This performance disparity worsens when interventional training data is scarce. In this paper, (1) we first identify a strong correlation between this performance disparity and the representations' violation of statistical independence induced during interventions. (2) For linear models, we derive sufficient conditions on the proportion of interventional training data, for which enforcing statistical independence between representations of the intervened node and its non-descendants during interventions lowers the test-time error on interventional data. Combining these insights, (3) we propose RepLIn, a training algorithm that explicitly enforces this statistical independence between interventional representations. We demonstrate the utility of RepLIn on a synthetic dataset, and on real image and text datasets on facial attribute classification and toxicity detection, respectively, with semi-synthetic causal structures. Our experiments show that RepLIn is scalable with the number of nodes in the causal graph and is suitable to improve robustness against interventional distribution shifts of both continuous and discrete latent variables compared to the ERM baselines.

    https://arxiv.org/abs/2507.05412


    DATABench: Evaluating Dataset Auditing in Deep Learning from an Adversarial Perspective

    oai:arXiv.org:2507.05622v3

    arXiv:2507.05622v3 Announce Type: replace Abstract: The widespread application of Deep Learning across diverse domains hinges critically on the quality and composition of training datasets. However, the common lack of disclosure regarding their usage raises significant privacy and copyright concerns. Dataset auditing techniques, which aim to determine if a specific dataset was used to train a given suspicious model, provide promising solutions to addressing these transparency gaps. While prior work has developed various auditing methods, their resilience against dedicated adversarial attacks remains largely unexplored. To bridge the gap, this paper initiates a comprehensive study evaluating dataset auditing from an adversarial perspective. We start with introducing a novel taxonomy, classifying existing methods based on their reliance on internal features (IF) (inherent to the data) versus external features (EF) (artificially introduced for auditing). Subsequently, we formulate two primary attack types: evasion attacks, designed to conceal the use of a dataset, and forgery attacks, intending to falsely implicate an unused dataset. Building on the understanding of existing methods and attack objectives, we further propose systematic attack strategies: decoupling, removal, and detection for evasion; adversarial example-based methods for forgery. These formulations and strategies lead to our new benchmark, DATABench, comprising 17 evasion attacks, 5 forgery attacks, and 9 representative auditing methods. Extensive evaluations using DATABench reveal that none of the evaluated auditing methods are sufficiently robust or distinctive under adversarial settings. These findings underscore the urgent need for developing a more secure and reliable dataset auditing method capable of withstanding sophisticated adversarial manipulation. Code is available in https://github.com/shaoshuo-ss/DATABench.

    https://arxiv.org/abs/2507.05622


    DRO-EDL-MPC: Evidential Deep Learning-Based Distributionally Robust Model Predictive Control for Safe Autonomous Driving

    oai:arXiv.org:2507.05710v2

    arXiv:2507.05710v2 Announce Type: replace Abstract: Safety is a critical concern in motion planning for autonomous vehicles. Modern autonomous vehicles rely on neural network-based perception, but making control decisions based on these inference results poses significant safety risks due to inherent uncertainties. To address this challenge, we present a distributionally robust optimization (DRO) framework that accounts for both aleatoric and epistemic perception uncertainties using evidential deep learning (EDL). Our approach introduces a novel ambiguity set formulation based on evidential distributions that dynamically adjusts the conservativeness according to perception confidence levels. We integrate this uncertainty-aware constraint into model predictive control (MPC), proposing the DRO-EDL-MPC algorithm with computational tractability for autonomous driving applications. Validation in the CARLA simulator demonstrates that our approach maintains efficiency under high perception confidence while enforcing conservative constraints under low confidence.

    https://arxiv.org/abs/2507.05710


    CAST-Phys: Contactless Affective States Through Physiological signals Database

    oai:arXiv.org:2507.06080v2

    arXiv:2507.06080v2 Announce Type: replace Abstract: In recent years, affective computing and its applications have become a fast-growing research topic. Despite significant advancements, the lack of affective multi-modal datasets remains a major bottleneck in developing accurate emotion recognition systems. Furthermore, the use of contact-based devices during emotion elicitation often unintentionally influences the emotional experience, reducing or altering the genuine spontaneous emotional response. This limitation highlights the need for methods capable of extracting affective cues from multiple modalities without physical contact, such as remote physiological emotion recognition. To address this, we present the Contactless Affective States Through Physiological Signals Database (CAST-Phys), a novel high-quality dataset explicitly designed for multi-modal remote physiological emotion recognition using facial and physiological cues. The dataset includes diverse physiological signals, such as photoplethysmography (PPG), electrodermal activity (EDA), and respiration rate (RR), alongside high-resolution uncompressed facial video recordings, enabling the potential for remote signal recovery. Our analysis highlights the crucial role of physiological signals in realistic scenarios where facial expressions alone may not provide sufficient emotional information. Furthermore, we demonstrate the potential of remote multi-modal emotion recognition by evaluating the impact of individual and fused modalities, showcasing its effectiveness in advancing contactless emotion recognition technologies.

    https://arxiv.org/abs/2507.06080


    AnthroTAP: Learning Point Tracking with Real-World Motion

    oai:arXiv.org:2507.06233v2

    arXiv:2507.06233v2 Announce Type: replace Abstract: Point tracking models often struggle to generalize to real-world videos because large-scale training data is predominantly synthetic--the only source currently feasible to produce at scale. Collecting real-world annotations, however, is prohibitively expensive, as it requires tracking hundreds of points across frames. We introduce AnthroTAP, an automated pipeline that generates large-scale pseudo-labeled point tracking data from real human motion videos. Leveraging the structured complexity of human movement-non-rigid deformations, articulated motion, and frequent occlusions-AnthroTAP fits Skinned Multi-Person Linear (SMPL) models to detected humans, projects mesh vertices onto image planes, resolves occlusions via ray-casting, and filters unreliable tracks using optical flow consistency. A model trained on the AnthroTAP dataset achieves state-of-the-art performance on TAP-Vid, a challenging general-domain benchmark for tracking any point on diverse rigid and non-rigid objects (e.g., humans, animals, robots, and vehicles). Our approach outperforms recent self-supervised teacher-student models trained on vastly larger real datasets, while requiring only one day of training on 4 GPUs. AnthroTAP shows that structured human motion offers a scalable and effective source of real-world supervision for point tracking. Code and datasets will be made publicly available.

    https://arxiv.org/abs/2507.06233


    A Collectivist, Economic Perspective on AI

    oai:arXiv.org:2507.06268v3

    arXiv:2507.06268v3 Announce Type: replace Abstract: Information technology is in the midst of a revolution in which omnipresent data collection and machine learning are impacting the human world as never before. The word ``intelligence'' is being used as a North Star for the development of this technology, with human cognition viewed as a baseline. This view neglects the fact that humans are social animals and that much of our intelligence is social and cultural in origin. Moreover, failing to properly situate aspects of intelligence at the social level contributes to the treatment of the societal consequences of technology as an afterthought. The path forward is not merely more data and compute, and not merely more attention paid to cognitive or symbolic representations, but a thorough blending of economic and social concepts with computational and inferential concepts at the level of algorithm design.

    https://arxiv.org/abs/2507.06268


    Finding One Local Optimum Is Easy -- but What About Two?

    oai:arXiv.org:2507.07524v4

    arXiv:2507.07524v4 Announce Type: replace Abstract: The class PLS (Polynomial Local Search) captures the complexity of finding a solution that is locally optimal and has proven to be an important concept in the theory of local search. It has been shown that local search versions of various combinatorial optimization problems, such as Maximum Independent Set and Max Cut, are complete for this class. Such computational intractability typically arises in local search problems allowing arbitrary weights; in contrast, for unweighted problems, locally optimal solutions can be found in polynomial time under standard settings. In this paper, we pursue the complexity of local search problems from a different angle: We show that computing two locally optimal solutions is NP-hard for various natural unweighted local search problems, including Maximum Independent Set, Minimum Dominating Set, Max SAT, and Max Cut. We also discuss several tractable cases for finding two (or more) local optimal solutions.

    https://arxiv.org/abs/2507.07524


    Coding-Enforced Robust Secure Aggregation for Federated Learning Under Unreliable Communication

    oai:arXiv.org:2507.07565v4

    arXiv:2507.07565v4 Announce Type: replace Abstract: This work studies privacy-preserving federated learning (ppFL) under unreliable communication. In ppFL, zero-sum privacy noises enables privacy protection without sacrificing model accuracy, effectively overcoming the privacy-utility trade-off. However, in practice, unreliable communication can randomly disrupt the coordination of zero-sum noises, leading to aggregation errors and unpredictable partial participation, which severely harm the model accuracy and learning performance. To overcome these challenges, we propose a robust coding-enforced structured secure aggregation method, termed secure cooperative gradient coding (SecCoGC), which enables exact reconstruction of the global model under unreliable communication while allowing for arbitrarily strong privacy preservation. In this paper, a complete problem formulation and constructions of real-field zero-sum privacy noise are presented, and fairness is introduced as a privacy metric. Privacy across all protocol layers in SecCoGC is evaluated, accounting for the correlation among privacy noises and their linear combination under unreliable communication. Moreover, a distinct convergence analysis for the FL algorithm with a binary outcome for global model recovery is provided. Experimental results demonstrate that SecCoGC achieves strong resilience to unreliable communication while maintaining varying levels of privacy preservation, yielding test accuracy improvements of up to 20%-70% over existing benchmark methods.

    https://arxiv.org/abs/2507.07565


    Beyond the Linear Separability Ceiling: Aligning Representations in VLMs

    oai:arXiv.org:2507.07574v3

    arXiv:2507.07574v3 Announce Type: replace Abstract: A challenge in advancing Visual-Language Models (VLMs) is determining whether their failures on abstract reasoning tasks, such as Bongard problems, stem from flawed perception or faulty top-down reasoning. To disentangle these factors, we introduce a diagnostic framework centered on the Linear Separability Ceiling (LSC), the performance achievable by a linear classifier on a VLM's raw visual embeddings. Applying this framework to state-of-the-art VLMs, we uncover a pervasive ``alignment gap'', where most models fail to generatively outperform the linear separability of their representations. We find that the few models surpassing this ceiling do so via two mechanisms: by further refining visual representations into a more linearly separable format or by executing non-linear decision logic. We demonstrate that this bottleneck is not a fundamental limitation but a solvable visual alignment issue. Our method augments standard next-token prediction with a contrastive objective to restructure the visual manifold into a more one-dimensionally linear geometry, improving image-to-image comparison and enabling models to significantly surpass the LSC on abstract binary classification tasks.

    https://arxiv.org/abs/2507.07574


    Can Large Language Models Automate Phishing Warning Explanations? A Controlled Experiment on Effectiveness and User Perception

    oai:arXiv.org:2507.07916v2

    arXiv:2507.07916v2 Announce Type: replace Abstract: Phishing has become a prominent risk in modern cybersecurity, often used to bypass technological defences by exploiting predictable human behaviour. Warning dialogues are a standard mitigation measure, but the lack of explanatory clarity and static content limits their effectiveness. In this paper, we report on our research to assess the capacity of Large Language Models (LLMs) to generate clear, concise, and scalable explanations for phishing warnings. We carried out a large-scale between-subjects user study (N = 750) to compare the influence of warning dialogues supplemented with manually generated explanations against those generated by two LLMs, Claude 3.5 Sonnet and Llama 3.3 70B. We investigated two explanatory styles (feature-based and counterfactual) for their effects on behavioural metrics (click-through rate) and perceptual outcomes (e.g., trust, risk, clarity). The results provide empirical evidence that LLM-generated explanations achieve a level of protection statistically comparable to expert-crafted messages, effectively automating a high-cost task. While Claude 3.5 Sonnet showed a trend towards reducing click-through rates compared to manual baselines, Llama 3.3, despite being perceived as clearer, did not yield the same behavioral benefits. Feature-based explanations were more effective for genuine phishing attempts, whereas counterfactual explanations diminished false-positive rates. Other variables, such as workload, gender, and prior familiarity with warning dialogues, significantly moderated the effectiveness of warnings. These results indicate that LLMs can be used to automatically build explanations for warning users against phishing, and that such solutions are scalable, adaptive, and consistent with human-centred values.

    https://arxiv.org/abs/2507.07916


    Recovering Commutation of Logically Constrained Rewriting and Equivalence Transformations (Full Version)

    oai:arXiv.org:2507.09326v3

    arXiv:2507.09326v3 Announce Type: replace Abstract: Logically constrained term rewriting is a relatively new rewriting formalism that naturally supports built-in data structures, such as integers and bit vectors. In the analysis of logically constrained term rewrite systems (LCTRSs), rewriting constrained terms plays a crucial role. However, this combines rewrite rule applications and equivalence transformations in a closely intertwined way. This intertwining makes it difficult to establish useful theoretical properties for this kind of rewriting and causes problems in implementations -- namely, that impractically large search spaces are often required. To address this issue, we propose in this paper a novel notion of most general constrained rewriting, which operates on existentially constrained terms, a concept recently introduced by the authors. We define a class of left-linear, left-value-free LCTRSs that are general enough to simulate all left-linear LCTRSs and exhibit the desired key property: most general constrained rewriting commutes with equivalence. This property ensures that equivalence transformations can be deferred until after the application of rewrite rules, which helps mitigate the issue of large search spaces in implementations. In addition to that, we show that the original rewriting formalism on constrained terms can be embedded into our new rewriting formalism on existentially constrained terms. Thus, our results are expected to have significant implications for achieving correct and efficient implementations in tools operating on LCTRSs.

    https://arxiv.org/abs/2507.09326


    Rows and Capabilities as Modal Effects

    oai:arXiv.org:2507.10301v2

    arXiv:2507.10301v2 Announce Type: replace Abstract: Effect handlers allow programmers to model and compose computational effects modularly. Effect systems statically guarantee that all effects are handled. Several recent practical effect systems are based on either row polymorphism or capabilities. However, there remains a gap in understanding the precise relationship between effect systems with such disparate foundations. The main difficulty is that in both row-based and capability-based systems, effect tracking is typically entangled with other features such as functions. We propose a uniform framework for encoding, analysing, and comparing effect systems. Our framework exploits and generalises modal effect types, a recent novel effect system which decouples effect tracking from functions via modalities. Modalities offer fine-grained control over when and how effects are tracked, enabling us to express different strategies for effect tracking. We give encodings as macro translations from existing row-based and capability-based effect systems into our framework and show that these encodings preserve types and semantics. Our encodings reveal the essence of effect tracking mechanisms in different effect systems, enable a direct analysis on their differences, and provide practical insights on language design.

    https://arxiv.org/abs/2507.10301


    LLMs Encode Harmfulness and Refusal Separately

    oai:arXiv.org:2507.11878v4

    arXiv:2507.11878v4 Announce Type: replace Abstract: LLMs are trained to refuse harmful instructions, but do they truly understand harmfulness beyond just refusing? Prior work has shown that LLMs' refusal behaviors can be mediated by a one-dimensional subspace, i.e., a refusal direction. In this work, we identify a new dimension to analyze safety mechanisms in LLMs, i.e., harmfulness, which is encoded internally as a separate concept from refusal. There exists a harmfulness direction that is distinct from the refusal direction. As causal evidence, steering along the harmfulness direction can lead LLMs to interpret harmless instructions as harmful, but steering along the refusal direction tends to elicit refusal responses directly without reversing the model's judgment on harmfulness. Furthermore, using our identified harmfulness concept, we find that certain jailbreak methods work by reducing the refusal signals without reversing the model's internal belief of harmfulness. We also find that adversarially finetuning models to accept harmful instructions has minimal impact on the model's internal belief of harmfulness. These insights lead to a practical safety application: The model's latent harmfulness representation can serve as an intrinsic safeguard (Latent Guard) for detecting unsafe inputs and reducing over-refusals that is robust to finetuning attacks. For instance, our Latent Guard achieves performance comparable to or better than Llama Guard 3 8B, a dedicated finetuned safeguard model, across different jailbreak methods. Our findings suggest that LLMs' internal understanding of harmfulness is more robust than their refusal decision to diverse input instructions, offering a new perspective to study AI safety.

    https://arxiv.org/abs/2507.11878


    CLARIFID: Improving Radiology Report Generation by Reinforcing Clinically Accurate Impressions and Enforcing Detailed Findings

    oai:arXiv.org:2507.17234v4

    arXiv:2507.17234v4 Announce Type: replace Abstract: Automatic generation of radiology reports has the potential to alleviate radiologists' significant workload, yet current methods struggle to deliver clinically reliable conclusions. In particular, most prior approaches focus on producing fluent text without effectively ensuring the factual correctness of the reports and often rely on single-view images, limiting diagnostic comprehensiveness. We propose CLARIFID, a novel framework that directly optimizes diagnostic correctness by mirroring the two-step workflow of experts. Specifically, CLARIFID (1) learns the logical flow from Findings to Impression through section-aware pretraining, (2) is fine-tuned with Proximal Policy Optimization in which the CheXbert F1 score of the Impression section serves as the reward, (3) employs controlled decoding that completes "Findings" before synthesizing the "Impression", and (4) fuses multiple chest X-ray views via a vision-transformer-based multi-view encoder. During inference, we apply a next-token forcing strategy followed by report-level re-ranking, ensuring that the model first produces a comprehensive "Findings" section before synthesizing the "Impression" and thereby preserving coherent clinical reasoning. Experimental results on the MIMIC-CXR dataset demonstrate that our method achieves superior clinical efficacy and outperforms existing baselines on clinical efficacy scores.

    https://arxiv.org/abs/2507.17234


    Technical Indicator Networks (TINs): An Interpretable Neural Architecture Modernizing Classic al Technical Analysis for Adaptive Algorithmic Trading

    oai:arXiv.org:2507.20202v2

    arXiv:2507.20202v2 Announce Type: replace Abstract: Deep neural networks (DNNs) have transformed fields such as computer vision and natural language processing by employing architectures aligned with domain-specific structural patterns. In algorithmic trading, however, there remains a lack of architectures that directly incorporate the logic of traditional technical indicators. This study introduces Technical Indicator Networks (TINs), a structured neural design that reformulates rule-based financial heuristics into trainable and interpretable modules. The architecture preserves the core mathematical definitions of conventional indicators while extending them to multidimensional data and supporting optimization through diverse learning paradigms, including reinforcement learning. Analytical transformations such as averaging, clipping, and ratio computation are expressed as vectorized layer operators, enabling transparent network construction and principled initialization. This formulation retains the clarity and interpretability of classical strategies while allowing adaptive adjustment and data-driven refinement. As a proof of concept, the framework is validated on the Dow Jones Industrial Average constituents using a Moving Average Convergence Divergence (MACD) TIN. Empirical results demonstrate improved risk-adjusted performance relative to traditional indicator-based strategies. Overall, the findings suggest that TINs provide a generalizable foundation for interpretable, adaptive, and extensible learning architectures in structured decision-making domains and indicate substantial commercial potential for upgrading trading platforms with cross-market visibility and enhanced decision-support capabilities.

    https://arxiv.org/abs/2507.20202


    A survey of diversity quantification in natural language processing: The why, what, where and how

    oai:arXiv.org:2507.20858v2

    arXiv:2507.20858v2 Announce Type: replace Abstract: The concept of diversity has received increased consideration in Natural Language Processing (NLP) in recent years. This is due to various motivations like promoting and inclusion, approximating human linguistic behavior, and increasing systems' performance. Diversity has however often been addressed in an ad hoc manner in NLP, and with few explicit links to other domains where this notion is better theorized. We survey articles in the ACL Anthology from the past 6 years, with "diversity" or "diverse" in their title. We find a wide range of settings in which diversity is quantified, often highly specialized and using inconsistent terminology. We put forward a unified taxonomy of why, what on, where, and how diversity is measured in NLP. Diversity measures are cast upon a unified framework from ecology and economy (Stirling, 2007) with 3 dimensions of diversity: variety, balance and disparity. We discuss the trends which emerge due to this systematized approach. We believe that this study paves the way towards a better formalization of diversity in NLP, which should bring a better understanding of this notion and a better comparability between various approaches.

    https://arxiv.org/abs/2507.20858


    Can Language Models Discover Scaling Laws?

    oai:arXiv.org:2507.21184v3

    arXiv:2507.21184v3 Announce Type: replace Abstract: Discovering scaling laws for predicting model performance at scale is a fundamental and open-ended challenge, mostly reliant on slow, case specific human experimentation. To investigate the potential for LLMs to automate this process, we collect over 5,000 experiments from existing literature and curate seven diverse scaling law discovery tasks. While existing agents struggle to produce accurate law formulas, this paper introduces SLDAgent, an evolution-based agent that co-optimize the scaling law model and the parameters, enabling it to autonomously explore complex relationships between variables. For the first time, we demonstrates that SLDAgent can automatically discover laws that exhibit consistently more accurate extrapolation than their established, human-derived counterparts across all tasks. Through comprehensive analysis, we elucidate why these discovered laws are superior and verify their practical utility in both pretraining and finetuning applications. This work establishes a new paradigm for agentic scientific discovery, showing that AI systems can understand their own scaling behavior, and can contribute novel and practical knowledge back to the research community.

    https://arxiv.org/abs/2507.21184


    Improving Generative Ad Text on Facebook using Reinforcement Learning

    oai:arXiv.org:2507.21983v2

    arXiv:2507.21983v2 Announce Type: replace Abstract: Generative artificial intelligence (AI), in particular large language models (LLMs), is poised to drive transformative economic change. LLMs are pre-trained on vast text data to learn general language patterns, but a subsequent post-training phase is critical to align them for specific real-world tasks. Reinforcement learning (RL) is the leading post-training technique, yet its economic impact remains largely underexplored and unquantified. We examine this question through the lens of the first deployment of an RL-trained LLM for generative advertising on Facebook. Integrated into Meta's Text Generation feature, our model, "AdLlama," powers an AI tool that helps advertisers create new variations of human-written ad text. To train this model, we introduce reinforcement learning with performance feedback (RLPF), a post-training method that uses historical ad performance data as a reward signal. In a large-scale 10-week A/B test on Facebook spanning nearly 35,000 advertisers and 640,000 ad variations, we find that AdLlama improves click-through rates by 6.7% (p=0.0296) compared to a supervised imitation model trained on curated ads. This represents a substantial improvement in advertiser return on investment on Facebook. We also find that advertisers who used AdLlama generated more ad variations, indicating higher satisfaction with the model's outputs. To our knowledge, this is the largest study to date on the use of generative AI in an ecologically valid setting, offering an important data point quantifying the tangible impact of RL post-training. Furthermore, the results show that RLPF is a promising and generalizable approach for metric-driven post-training that bridges the gap between highly capable language models and tangible outcomes.

    https://arxiv.org/abs/2507.21983


    Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems

    oai:arXiv.org:2507.23453v2

    arXiv:2507.23453v2 Announce Type: replace Abstract: This paper investigates defenses for LLM-based evaluation systems against prompt injection. We formalize a class of threats called blind attacks, where a candidate answer is crafted independently of the true answer to deceive the evaluator. To counter such attacks, we propose a framework that augments Standard Evaluation (SE) with Counterfactual Evaluation (CFE), which re-evaluates the submission against a deliberately false ground-truth answer. An attack is detected if the system validates an answer under both standard and counterfactual conditions. Experiments show that while standard evaluation is highly vulnerable, our SE+CFE framework significantly improves security by boosting attack detection with minimal performance trade-offs.

    https://arxiv.org/abs/2507.23453


    Efficiently Seeking Flat Minima for Better Generalization in Fine-Tuning Large Language Models and Beyond

    oai:arXiv.org:2508.00522v3

    arXiv:2508.00522v3 Announce Type: replace Abstract: Little research explores the correlation between the expressive ability and generalization ability of the low-rank adaptation (LoRA). Sharpness-Aware Minimization (SAM) improves model generalization for both Convolutional Neural Networks (CNNs) and Transformers by encouraging convergence to locally flat minima. However, the connection between sharpness and generalization has not been fully explored for LoRA due to the lack of tools to either empirically seek flat minima or develop theoretical methods. In this work, we propose Flat Minima LoRA (FMLoRA) and its efficient version, i.e., EFMLoRA, to seek flat minima for LoRA. Concretely, we theoretically demonstrate that perturbations in the full parameter space can be transferred to the low-rank subspace. This approach eliminates the potential interference introduced by perturbations across multiple matrices in the low-rank subspace. Our extensive experiments on large language models and vision-language models demonstrate that EFMLoRA achieves optimize efficiency comparable to that of LoRA while simultaneously attaining comparable or even better performance. For example, on the GLUE dataset with RoBERTa-large, EFMLoRA outperforms LoRA and full fine-tuning by 1.0% and 0.5% on average, respectively. On vision-language models, e.g., Qwen-VL-Chat, there are performance improvements of 1.5% and 1.0% on the SQA and VizWiz datasets, respectively. These empirical results also verify that the generalization of LoRA is closely related to sharpness, which is omitted by previous methods.

    https://arxiv.org/abs/2508.00522


    RC-Gossip: Information Freshness in Clustered Networks with Rate-Changing Gossip

    oai:arXiv.org:2508.02657v2

    arXiv:2508.02657v2 Announce Type: replace Abstract: A clustered gossip network is considered in which a source updates its information over time, and end-nodes, organized in clusters through clusterheads, are keeping track of it. The goal for the nodes is to remain as fresh as possible, i.e., have the same information as the source, which we assess by the long-term average binary freshness metric. We introduce a smart mechanism of information dissemination which we coin rate-changing gossip (RC-Gossip). Its main idea is that gossiping is directed towards nodes that need it the most, and hence the rate of gossiping changes based on the number of fresh nodes in the network at a given time. While Stochastic Hybrid System (SHS) analysis has been the norm in studying freshness of gossip networks, we present an equivalent way to analyze freshness using a renewal-reward-based approach. Using that, we show that RC-gossip significantly increases freshness of nodes in different clustered networks, with optimal cluster sizes, compared to traditional gossiping techniques.

    https://arxiv.org/abs/2508.02657


    MalFlows: Context-aware Fusion of Heterogeneous Flow Semantics for Android Malware Detection

    oai:arXiv.org:2508.03588v2

    arXiv:2508.03588v2 Announce Type: replace Abstract: Static analysis, a fundamental technique in Android app examination, enables the extraction of control flows, data flows, and inter-component communications (ICCs), all of which are essential for malware detection. However, existing methods struggle to leverage the semantic complementarity across different types of flows for representing program behaviors, and their context-unaware nature further hinders the accuracy of cross-flow semantic integration. We propose and implement MalFlows, a novel technique that achieves context-aware fusion of heterogeneous flow semantics for Android malware detection. Our goal is to leverage complementary strengths of the three types of flow-related information for precise app profiling. We adopt a heterogeneous information network (HIN) to model the rich semantics across these program flows. We further propose flow2vec, a context-aware HIN embedding technique that distinguishes the semantics of HIN entities as needed based on contextual constraints across different flows and learns accurate app representations through the joint use of multiple meta-paths. The representations are finally fed into a channel-attention-based deep neural network for malware classification. To the best of our knowledge, this is the first study to comprehensively aggregate the strengths of diverse flow-related information for assessing maliciousness within apps. We evaluate MalFlows on a large-scale dataset comprising over 20 million flow instances extracted from more than 31,000 real-world apps. Experimental results demonstrate that MalFlows outperforms representative baselines in Android malware detection, and meanwhile, validate the effectiveness of flow2vec in accurately learning app representations from the HIN constructed over the heterogeneous flows.

    https://arxiv.org/abs/2508.03588


    Understanding Inconsistent State Update Vulnerabilities in Smart Contracts

    oai:arXiv.org:2508.06192v2

    arXiv:2508.06192v2 Announce Type: replace Abstract: Smart contracts enable contract terms to be automatically executed and verified on the blockchain, and recent years have witnessed numerous applications of them in areas such as financial institutions and supply chains. The execution logic of a smart contract is closely related to the contract state, and thus the correct and safe execution of the contract depends heavily on the precise control and update of the contract state. However, the contract state update process can have issues. In particular, inconsistent state update issues can arise for reasons such as unsynchronized modifications. Inconsistent state update bugs have been exploited by attackers many times, but existing detection tools still have difficulty in effectively identifying them. This paper conducts the first large-scale empirical study about inconsistent state update vulnerabilities (that is, inconsistent state update bugs that are exploitable) in smart contracts, aiming to shed light for developers, researchers, tool builders, and language or library designers in order to avoid inconsistent state update vulnerabilities. We systematically investigate 116 inconsistent state update vulnerabilities in 352 real-world smart contract projects, summarizing their root causes, fix strategies, and exploitation methods. Our study provides 11 original and important findings, and we also give the implications of our findings. To illustrate the potential benefits of our research, we also develop a proof-of-concept checker based on one of our findings. The checker effectively detects issues in 64 popular GitHub projects, and 19 project owners have confirmed the detected issues at the time of writing. The result demonstrates the usefulness and importance of our findings for avoiding inconsistent state update vulnerabilities in smart contracts.

    https://arxiv.org/abs/2508.06192


    Training-Free ANN-to-SNN Conversion for High-Performance Spiking Transformer

    oai:arXiv.org:2508.07710v2

    arXiv:2508.07710v2 Announce Type: replace Abstract: Leveraging the event-driven paradigm, Spiking Neural Networks (SNNs) offer a promising approach for energy-efficient Transformer architectures.While ANN-to-SNN conversion avoids the high training cost of directly trained Spiking Transformers, existing approaches still struggle to handle the nonlinear operations within Transformer blocks, and often require additional fine-tuning of pretrained ANNs.To address these limitations, we propose a training-free and high-performance ANN-to-SNN conversion framework tailored for Transformer architectures. Specifically, we introduce a Multi-basis Exponential Decay (MBE) neuron that combines exponential decay with a multi-basis encoding strategy to effectively approximate nonlinear operations, eliminating the need for weight modifications in pretrained ANNs.Extensive experiments across diverse tasks (CV, NLU, NLG) and mainstream Transformer architectures (ViT, RoBERTa, GPT-2) demonstrate that our method achieves near-lossless conversion accuracy with significantly lower latency. This provides a promising pathway for the efficient and scalable deployment of Spiking Transformers in real-world applications.

    https://arxiv.org/abs/2508.07710


    Audio-Visual Speech Enhancement: Architectural Design and Deployment Strategies

    oai:arXiv.org:2508.08468v2

    arXiv:2508.08468v2 Announce Type: replace Abstract: This paper introduces a new AI-based Audio-Visual Speech Enhancement (AVSE) system and presents a comparative performance analysis of different deployment architectures. The proposed AVSE system employs convolutional neural networks (CNNs) for spectral feature extraction and long short-term memory (LSTM) networks for temporal modeling, enabling robust speech enhancement through multimodal fusion of audio and visual cues. Multiple deployment scenarios are investigated, including cloud-based, edge-assisted, and standalone device implementations. Their performance is evaluated in terms of speech quality improvement, latency, and computational overhead. Real-world experiments are conducted across various network conditions, including Ethernet, Wi-Fi, 4G, and 5G, to analyze the trade-offs between processing delay, communication latency, and perceptual speech quality. The results show that while cloud deployment achieves the highest enhancement quality, edge-assisted architectures offer the best balance between latency and intelligibility, meeting real-time requirements under 5G and Wi-Fi 6 conditions. These findings provide practical guidelines for selecting and optimizing AVSE deployment architectures in diverse applications, including assistive hearing devices, telepresence, and industrial communications.

    https://arxiv.org/abs/2508.08468


    DiffPose-Animal: A Language-Conditioned Diffusion Framework for Animal Pose Estimation

    oai:arXiv.org:2508.08783v2

    arXiv:2508.08783v2 Announce Type: replace Abstract: Animal pose estimation is a fundamental task in computer vision, with growing importance in ecological monitoring, behavioral analysis, and intelligent livestock management. Compared to human pose estimation, animal pose estimation is more challenging due to high interspecies morphological diversity, complex body structures, and limited annotated data. In this work, we introduce DiffPose-Animal, a novel diffusion-based framework for top-down animal pose estimation. Unlike traditional heatmap regression methods, DiffPose-Animal reformulates pose estimation as a denoising process under the generative framework of diffusion models. To enhance semantic guidance during keypoint generation, we leverage large language models (LLMs) to extract both global anatomical priors and local keypoint-wise semantics based on species-specific prompts. These textual priors are encoded and fused with image features via cross-attention modules to provide biologically meaningful constraints throughout the denoising process. Additionally, a diffusion-based keypoint decoder is designed to progressively refine pose predictions, improving robustness to occlusion and annotation sparsity. Extensive experiments on public animal pose datasets demonstrate the effectiveness and generalization capability of our method, especially under challenging scenarios with diverse species, cluttered backgrounds, and incomplete keypoints.

    https://arxiv.org/abs/2508.08783


    Ethics Practices in AI Development: An Empirical Study Across Roles and Regions

    oai:arXiv.org:2508.09219v2

    arXiv:2508.09219v2 Announce Type: replace Abstract: Recent advances in AI applications have raised growing concerns about the need for ethical guidelines and regulations to mitigate the risks posed by these technologies. In this paper, we present a mixed-methods survey study - combining statistical and qualitative analyses - to examine the ethical perceptions, practices, and knowledge of individuals involved in various AI development roles. Our survey comprises 414 participants from 43 countries, representing various roles such as AI managers, analysts, developers, quality assurance professionals, and information security and privacy experts. The results reveal varying degrees of familiarity and experience with AI ethics principles, government initiatives, and risk mitigation strategies across roles, regions, and other demographic factors. Our findings underscore the importance of a collaborative, role-sensitive approach that involves diverse stakeholders in ethical decision-making throughout the AI development lifecycle. We advocate for developing tailored, inclusive solutions to address ethical challenges in AI development, and we propose future research directions and educational strategies to promote ethics-aware AI practices.

    https://arxiv.org/abs/2508.09219


    Decoupling Understanding from Reasoning via Problem Space Mapping for Small-Scale Model Reasoning

    oai:arXiv.org:2508.10019v2

    arXiv:2508.10019v2 Announce Type: replace Abstract: Despite recent advances in the reasoning capabilities of Large Language Models (LLMs), improving the reasoning ability of Small Language Models (SLMs, e.g., up to 1.5B parameters) remains challenging. A key obstacle lies in the complexity and variability of natural language: essentially equivalent problems often appear in diverse surface forms, often obscured by redundant or distracting details. This imposes a dual burden on SLMs: they must first extract the core problem from complex linguistic input, and then perform reasoning based on that understanding. The resulting vast and noisy problem space hinders optimization, particularly for models with limited capacity. To address this, we propose a new framework that decouples understanding from reasoning by mapping natural language problems into a canonical problem space-a semantically simplified yet expressive domain. This enables SLMs to focus on reasoning over standardized inputs, free from linguistic variability. Within this framework, we introduce DURIT (Decoupled Understanding from Reasoning via Iterative Training), a three-step algorithm that iteratively: (1) mapping natural language problems via reinforcement learning, (2) aligns reasoning trajectories through self-distillation, and (3) trains reasoning policies in the problem space. The mapper and reasoner are co-trained in an alternating loop throughout this process. Experiments show that DURIT substantially improves SLMs' performance on both in-domain and out-of-domain mathematical and logical reasoning tasks. Beyond improving reasoning capabilities, DURIT also improves the robustness of reasoning, validating decoupling understanding from reasoning as an effective strategy for strengthening SLMs.

    https://arxiv.org/abs/2508.10019


    BEASST: Behavioral Entropic Gradient based Adaptive Source Seeking for Mobile Robots

    oai:arXiv.org:2508.10363v2

    arXiv:2508.10363v2 Announce Type: replace Abstract: This paper presents BEASST (Behavioral Entropic Gradient-based Adaptive Source Seeking for Mobile Robots), a novel framework for robotic source seeking in complex, unknown environments. Our approach enables mobile robots to efficiently balance exploration and exploitation by modeling normalized signal strength as a surrogate probability of source location. Building on Behavioral Entropy(BE) with Prelec's probability weighting function, we define an objective function that adapts robot behavior from risk-averse to risk-seeking based on signal reliability and mission urgency. The framework provides theoretical convergence guarantees under unimodal signal assumptions and practical stability under bounded disturbances. Experimental validation across DARPA SubT and multi-room scenarios demonstrates that BEASST consistently outperforms state-of-the-art methods, achieving 15% reduction in path length and 20% faster source localization through intelligent uncertainty-driven navigation that dynamically transitions between aggressive pursuit and cautious exploration.

    https://arxiv.org/abs/2508.10363


    HumanSense: From Multimodal Perception to Empathetic Context-Aware Responses through Reasoning MLLMs

    oai:arXiv.org:2508.10576v4

    arXiv:2508.10576v4 Announce Type: replace Abstract: While Multimodal Large Language Models (MLLMs) show immense promise for achieving truly human-like interactions, progress is hindered by the lack of fine-grained evaluation frameworks for human-centered scenarios, encompassing both the understanding of complex human intentions and the provision of empathetic, context-aware responses. Here we introduce HumanSense, a comprehensive benchmark designed to evaluate the human-centered perception and interaction capabilities of MLLMs, with a particular focus on deep understanding of extended multimodal contexts and the formulation of rational feedback. Our evaluation reveals that leading MLLMs still have considerable room for improvement, particularly for advanced interaction-oriented tasks. Supplementing visual input with audio and text information yields substantial improvements, and Omni-modal models show advantages on these tasks.Furthermore, grounded in the observation that appropriate feedback stems from a contextual analysis of the interlocutor's needs and emotions, we posit that reasoning ability serves as the key to unlocking it. We devise a multi-stage, modality-progressive reinforcement learning approach, resulting in HumanSense-Omni-Reasoning, which substantially enhances performance on higher-level understanding and interactive tasks. Additionally, we observe that successful reasoning processes appear to exhibit consistent thought patterns. By designing corresponding prompts, we also enhance the performance of non-reasoning models in a training-free manner.Project page: \textcolor{brightpink}{https://digital-avatar.github.io/ai/HumanSense/}

    https://arxiv.org/abs/2508.10576


    SproutBench: A Benchmark for Safe and Ethical Large Language Models for Youth

    oai:arXiv.org:2508.11009v3

    arXiv:2508.11009v3 Announce Type: replace Abstract: The rapid proliferation of large language models (LLMs) in applications targeting children and adolescents necessitates a fundamental reassessment of prevailing AI safety frameworks, which are largely tailored to adult users and neglect the distinct developmental vulnerabilities of minors. This paper highlights key deficiencies in existing LLM safety benchmarks, including their inadequate coverage of age-specific cognitive, emotional, and social risks spanning early childhood (ages 0--6), middle childhood (7--12), and adolescence (13--18). To bridge these gaps, we introduce SproutBench, an innovative evaluation suite comprising 1,283 developmentally grounded adversarial prompts designed to probe risks such as emotional dependency, privacy violations, and imitation of hazardous behaviors. Through rigorous empirical evaluation of 47 diverse LLMs, we uncover substantial safety vulnerabilities, corroborated by robust inter-dimensional correlations (e.g., between Safety and Risk Prevention) and a notable inverse relationship between Interactivity and Age Appropriateness. These insights yield practical guidelines for advancing child-centric AI design and deployment.

    https://arxiv.org/abs/2508.11009


    LocoMamba: Vision-Driven Locomotion via End-to-End Deep Reinforcement Learning with Mamba

    oai:arXiv.org:2508.11849v3

    arXiv:2508.11849v3 Announce Type: replace Abstract: We introduce LocoMamba, a vision-driven cross-modal DRL framework built on selective state-space models, specifically leveraging Mamba, that achieves near-linear-time sequence modeling, effectively captures long-range dependencies, and enables efficient training with longer sequences. First, we embed proprioceptive states with a multilayer perceptron and patchify depth images with a lightweight convolutional neural network, producing compact tokens that improve state representation. Second, stacked Mamba layers fuse these tokens via near-linear-time selective scanning, reducing latency and memory footprint, remaining robust to token length and image resolution, and providing an inductive bias that mitigates overfitting. Third, we train the policy end-to-end with Proximal Policy Optimization under terrain and appearance randomization and an obstacle-density curriculum, using a compact state-centric reward that balances progress, smoothness, and safety. We evaluate our method in challenging simulated environments with static and moving obstacles as well as uneven terrain. Compared with state-of-the-art baselines, our method achieves higher returns and success rates with fewer collisions, exhibits stronger generalization to unseen terrains and obstacle densities, and improves training efficiency by converging in fewer updates under the same compute budget.

    https://arxiv.org/abs/2508.11849


    An Improved Algorithm for Adversarial Linear Contextual Bandits via Reduction

    oai:arXiv.org:2508.11931v2

    arXiv:2508.11931v2 Announce Type: replace Abstract: We present an efficient algorithm for linear contextual bandits with adversarial losses and stochastic action sets. Our approach reduces this setting to misspecification-robust adversarial linear bandits with fixed action sets. Without knowledge of the context distribution or access to a context simulator, the algorithm achieves $\tilde{O}(\min\{d^2\sqrt{T}, \sqrt{d^3T\log K}\})$ regret and runs in $\text{poly}(d,C,T)$ time, where $d$ is the feature dimension, $C$ is an upper bound on the number of linear constraints defining the action set in each round, $K$ is an upper bound on the number of actions in each round, and $T$ is number of rounds. This resolves the open question by Liu et al. (2023) on whether one can obtain $\text{poly}(d)\sqrt{T}$ regret in polynomial time independent of the number of actions. For the important class of combinatorial bandits with adversarial losses and stochastic action sets where the action sets can be described by a polynomial number of linear constraints, our algorithm is the first to achieve $\text{poly}(d)\sqrt{T}$ regret in polynomial time, while no prior algorithm achieves even $o(T)$ regret in polynomial time to our knowledge. When a simulator is available, the regret bound can be improved to $\tilde{O}(d\sqrt{L^\star})$, where $L^\star$ is the cumulative loss of the best policy.

    https://arxiv.org/abs/2508.11931


    Is GPT-OSS Good? A Comprehensive Evaluation of OpenAI's Latest Open Source Models

    oai:arXiv.org:2508.12461v3

    arXiv:2508.12461v3 Announce Type: replace Abstract: In August 2025, OpenAI released GPT-OSS models, its first open weight large language models since GPT-2 in 2019, comprising two mixture of experts architectures with 120B and 20B parameters. We evaluated both variants against six contemporary open source large language models ranging from 14.7B to 235B parameters, representing both dense and sparse designs, across ten benchmarks covering general knowledge, mathematical reasoning, code generation, multilingual understanding, and conversational ability. All models were tested in unquantised form under standardised inference settings, with statistical validation using McNemars test and effect size analysis. Results show that gpt-oss-20B consistently outperforms gpt-oss-120B on several benchmarks, such as HumanEval and MMLU, despite requiring substantially less memory and energy per response. Both models demonstrate mid-tier overall performance within the current open source landscape, with relative strength in code generation and notable weaknesses in multilingual tasks. These findings provide empirical evidence that scaling in sparse architectures may not yield proportional performance gains, underscoring the need for further investigation into optimisation strategies and informing more efficient model selection for future open source deployments. More details and evaluation scripts are available at https://ai-agent-lab.github.io/gpt-oss (Project Webpage).

    https://arxiv.org/abs/2508.12461


    ALIGN: Word Association Learning for Cultural Alignment in Large Language Models

    oai:arXiv.org:2508.13426v2

    arXiv:2508.13426v2 Announce Type: replace Abstract: Large language models (LLMs) exhibit cultural bias from overrepresented viewpoints in training data, yet cultural alignment remains a challenge due to limited cultural knowledge and a lack of exploration into effective learning approaches. We introduce a cost-efficient and cognitively grounded method: fine-tuning LLMs on native speakers' word-association norms, leveraging cognitive psychology findings that such associations capture cultural knowledge. Using word association datasets from native speakers in the US (English) and China (Mandarin), we train Llama-3.1-8B and Qwen-2.5-7B via supervised fine-tuning and preference optimization. We evaluate models' cultural alignment through a two-tier evaluation framework that spans lexical associations and cultural value alignment using the World Values Survey. Results show significant improvements in lexical alignment (16-20% English, 43-165% Mandarin on Precision@5) and high-level cultural value shifts. On a subset of 50 questions where US and Chinese respondents diverge most, fine-tuned Qwen nearly doubles its response alignment with Chinese values (13 to 25). Remarkably, our trained 7-8B models match or exceed vanilla 70B baselines, demonstrating that a few million of culture-grounded associations achieve value alignment without expensive retraining. Our work highlights both the promise and the need for future research grounded in human cognition in improving cultural alignment in AI models.

    https://arxiv.org/abs/2508.13426


    BioGAP-Ultra: A Modular Edge-AI Platform for Wearable Multimodal Biosignal Acquisition and Processing

    oai:arXiv.org:2508.13728v2

    arXiv:2508.13728v2 Announce Type: replace Abstract: The growing demand for continuous physiological monitoring and human-machine interaction in real-world settings calls for wearable platforms that are flexible, low-power, and capable of on-device intelligence. This work presents BioGAP-Ultra, an advanced multimodal biosensing platform that supports synchronized acquisition of diverse electrophysiological and hemodynamic signals such as EEG, EMG, ECG, and PPG while enabling embedded AI processing at state-of-the-art energy efficiency. BioGAP-Ultra is a major extension of our previous BioGAP design aimed at meeting the rapidly growing requirements of wearable biosensing applications. It features (i) increased on-device storage (x2 SRAM, x4 FLASH), (ii) improved wireless connectivity (supporting up to 1.4 Mbit/s bandwidth, x4 higher than BioGAP), (iii) enhanced number of signal modalities (from 3 to 5) and analog input channels (x2). Further, it is accompanied by a real-time visualization and analysis software suite that supports the hardware design, providing access to raw data and real-time configurability on a mobile phone. Finally, we demonstrate the system's versatility through integration into various wearable form factors: an EEG-PPG headband consuming 32.8 mW, an EMG sleeve at 26.7 mW, and an ECG-PPG chestband requiring only 9.3 mW for continuous acquisition and streaming, tailored for diverse biosignal applications. To showcase its edge-AI capabilities, we further deploy two representative on-device applications: (1) ECG-PPG-based PAT estimation at 8.6 mW, and (2) EMG-ACC-based classification of reach-and-grasp motion phases, achieving 79.9 % $\pm$ 5.7 % accuracy at 23.6 mW. All hardware and software design files are also released open-source with a permissive license.

    https://arxiv.org/abs/2508.13728


    Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR

    oai:arXiv.org:2508.14029v4

    arXiv:2508.14029v4 Announce Type: replace Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a key paradigm for post-training Large Language Models (LLMs), particularly for complex reasoning tasks. However, vanilla RLVR training has been shown to improve Pass@1 performance at the expense of policy entropy, leading to reduced generation diversity and limiting the Pass@k performance, which typically represents the upper bound of LLM reasoning capability. In this paper, we systematically analyze the policy's generation diversity from the perspective of training problems and find that augmenting and updating training problems helps mitigate entropy collapse during training. Based on these observations, we propose an online Self-play with Variational problem Synthesis (SvS) strategy for RLVR training, which uses the policy's correct solutions to synthesize variational problems while ensuring their reference answers remain identical to the originals. This self-improving strategy effectively maintains policy entropy during training and substantially improves Pass@k compared with standard RLVR, sustaining prolonged improvements and achieving absolute gains of 18.3% and 22.8% in Pass@32 performance on the competition-level AIME24 and AIME25 benchmarks, as well as on code generation tasks. Experiments on 12 reasoning benchmarks across varying model sizes from 3B to 32B consistently demonstrate the generalizability and robustness of SvS.

    https://arxiv.org/abs/2508.14029


    A Note on the Convergence of Symmetric Triangle Quadrature Rules

    oai:arXiv.org:2508.15133v3

    arXiv:2508.15133v3 Announce Type: replace Abstract: Symmetric polynomial quadrature rules for triangles are commonly used to efficiently integrate two-dimensional domains in finite-element-type problems. While the development of such rules focuses on the maximum degree a given number of points can exactly integrate, smooth integrands are generally not polynomials of finite degree. Therefore, for such integrands, one needs to balance integration accuracy and computational cost. A natural approach to this balance is to choose the number of points such that the convergence rate with respect to the mesh size $h$ matches that of the other properties of the scheme, such as the planar or curved triangles that approximate the geometry or the basis functions that approximate the solution. In general, it is expected that a quadrature rule capable of integrating polynomials up to degree $d$ yields an integration error that is $\mathcal{O}(h^p)$, where $p=d+1$. However, as we describe in this paper, for symmetric triangle quadrature rules, when $d$ is even, $p=d+2$; therefore, for a $p^\text{th}$-order-accurate quadrature rule, fewer quadrature points are necessary, reducing the time required for matrix assembly in finite-element-type problems. This reduction in cost is modest for local differential operators that yield sparse matrices but appreciable for global integral operators that yield dense matrices. In this paper, we briefly summarize the details of symmetric triangle quadrature rules, discuss error implications for quadrature rules for one dimension and triangles, and we provide numerical examples that support our observation that polynomials that exactly integrate even maximum degrees converge faster than the conventional expectation for sequences of regular meshes.

    https://arxiv.org/abs/2508.15133


    The Digital Life of Parisian Parks: Multifunctionality and Urban Context Uncovered by Mobile Application Traffic

    oai:arXiv.org:2508.15516v2

    arXiv:2508.15516v2 Announce Type: replace Abstract: Urban parks support public health, but landscape architecture typically examines them through form and function. Prior equitable access research focused on park form, while functional studies relied on small-scale surveys, movement data, or broad usage metrics, missing specific activities and visit motivations. This gap limits our grasp of parks' functional diversity. We address this with a novel method refining mobile base station coverage via antenna azimuths to isolate park-specific traffic from surroundings. Using Paris as a case study, we process 492 million hourly per-app mobile records (35% market share) from 45 urban parks. We test the central-city hypothesis (multifunctional parks in dense, high-rent zones due to land constraints) and socio-spatial hypothesis (parks reflecting neighborhood routines and preferences). Results reveal parks' unique mobile traffic signatures, distinct from urban contexts and each other. Clustering by temporal and app patterns identifies three types: lunchbreak, cultural, and recreational parks, linked to health-promoting visitation motives. Central parks show diverse apps and peak usage; suburban recreational parks mirror local demographics, like income-aligned app preferences. This demonstrates mobile traffic's power as a proxy for urban green space activities, with key implications for park design, public health, and well-being strategies.

    https://arxiv.org/abs/2508.15516


    From Clicks to Preference: A Multi-stage Alignment Framework for Generative Query Suggestion in Conversational System

    oai:arXiv.org:2508.15811v2

    arXiv:2508.15811v2 Announce Type: replace Abstract: Generative query suggestion using large language models offers a powerful way to enhance conversational systems, but aligning outputs with nuanced user preferences remains a critical challenge. To address this, we introduce a multi-stage framework designed for progressive alignment between the generation policy and user intent. Our pipeline begins with prompt engineering as a cold-start strategy, followed by the Supervised Fine-Tuning stage, in which we introduce a distillation method on click logs to create a robust foundational model. To better model user preferences while capturing their inherent uncertainty, we develop a Gaussian Reward Model (GaRM) that represents user preferences as probability distributions rather than point estimates. Finally, we employ reinforcement learning to align the generation policy with these preferences, guided by a composite reward function that integrates GaRM with auxiliary heuristics to mitigate reward hacking. To maintain training stability, this process is enhanced by a novel out-of-distribution regularization method and a two-stage reward fusion technique. Extensive experiments demonstrate that our framework significantly outperforms baselines on both automatic and human evaluations and yields a 34\% relative increase in user engagement as measured by click-through rate in live A/B tests.

    https://arxiv.org/abs/2508.15811


    ConceptGuard: Neuro-Symbolic Safety Guardrails via Sparse Interpretable Jailbreak Concepts

    oai:arXiv.org:2508.16325v2

    arXiv:2508.16325v2 Announce Type: replace Abstract: Large Language Models have found success in a variety of applications. However, their safety remains a concern due to the existence of various jailbreaking methods. Despite significant efforts, alignment and safety fine-tuning only provide a certain degree of robustness against jailbreak attacks that covertly mislead LLMs towards the generation of harmful content. This leaves them prone to a range of vulnerabilities, including targeted misuse and accidental user profiling. This work introduces \textbf{ConceptGuard}, a novel framework that leverages Sparse Autoencoders (SAEs) to identify interpretable concepts within LLM internals associated with different jailbreak themes. By extracting semantically meaningful internal representations, ConceptGuard enables building robust safety guardrails -- offering fully explainable and generalizable defenses without sacrificing model capabilities or requiring further fine-tuning. Leveraging advances in the mechanistic interpretability of LLMs, our approach provides evidence for a shared activation geometry for jailbreak attacks in the representation space, a potential foundation for designing more interpretable and generalizable safeguards against attackers.

    https://arxiv.org/abs/2508.16325


    Few-shot Class-incremental Fault Diagnosis by Preserving Class-Agnostic Knowledge with Dual-Granularity Representations

    oai:arXiv.org:2508.16634v5

    arXiv:2508.16634v5 Announce Type: replace Abstract: Few-Shot Class-Incremental Fault Diagnosis (FSC-FD), which aims to continuously learn from new fault classes with only a few samples without forgetting old ones, is critical for real-world industrial systems. However, this challenging task severely amplifies the issues of catastrophic forgetting of old knowledge and overfitting on scarce new data. To address these challenges, this paper proposes a novel framework built upon Dual-Granularity Representations, termed the Dual-Granularity Guidance Network (DGGN). Our DGGN explicitly decouples feature learning into two parallel streams: 1) a fine-grained representation stream, which utilizes a novel Multi-Order Interaction Aggregation module to capture discriminative, class-specific features from the limited new samples. 2) a coarse-grained representation stream, designed to model and preserve general, class-agnostic knowledge shared across all fault types. These two representations are dynamically fused by a multi-semantic cross-attention mechanism, where the stable coarse-grained knowledge guides the learning of fine-grained features, preventing overfitting and alleviating feature conflicts. To further mitigate catastrophic forgetting, we design a Boundary-Aware Exemplar Prioritization strategy. Moreover, a decoupled Balanced Random Forest classifier is employed to counter the decision boundary bias caused by data imbalance. Extensive experiments on the TEP benchmark and a real-world MFF dataset demonstrate that our proposed DGGN achieves superior diagnostic performance and stability compared to state-of-the-art FSC-FD approaches. Our code is publicly available at https://github.com/MentaY/DGGN

    https://arxiv.org/abs/2508.16634


    GRADE: Generating multi-hop QA and fine-gRAined Difficulty matrix for RAG Evaluation

    oai:arXiv.org:2508.16994v2

    arXiv:2508.16994v2 Announce Type: replace Abstract: Retrieval-Augmented Generation (RAG) systems are widely adopted in knowledge-intensive NLP tasks, but current evaluations often overlook the structural complexity and multi-step reasoning required in real-world scenarios. These benchmarks overlook key factors such as the interaction between retrieval difficulty and reasoning depth. To address this gap, we propose GRADE, a novel evaluation framework that models task difficulty along two orthogonal dimensions: (1) reasoning depth, defined by the number of inference steps (hops), and (2) semantic distance between the query and its supporting evidence. We construct a synthetic multi-hop QA dataset from factual news articles by extracting knowledge graphs and augmenting them through semantic clustering to recover missing links, allowing us to generate diverse and difficulty-controlled queries. Central to our framework is a 2D difficulty matrix that combines generator-side and retriever-side difficulty. Experiments across multiple domains and models show that error rates strongly correlate with our difficulty measures, validating their diagnostic utility. GRADE enables fine-grained analysis of RAG performance and provides a scalable foundation for evaluating and improving multi-hop reasoning in real-world applications.

    https://arxiv.org/abs/2508.16994


    Physics-informed neural network for fatigue life prediction of irradiated austenitic and ferritic/martensitic steels

    oai:arXiv.org:2508.17303v2

    arXiv:2508.17303v2 Announce Type: replace Abstract: This study proposes a Physics-Informed Neural Network (PINN) framework to predict the low-cycle fatigue (LCF) life of irradiated austenitic and ferritic/martensitic (F/M) steels used in nuclear reactors. During operation, these materials undergo cyclic loading and irradiation at elevated temperatures, resulting in complex degradation mechanisms that traditional empirical or purely data-driven models often fail to capture accurately. The developed PINN model incorporates physical fatigue life constraints into its loss function, improving prediction accuracy, reliability, and generalizability. Trained on 495 data points, including both irradiated and unirradiated conditions, the model is more robust than traditional machine learning models, such as Random Forest, Gradient Boosting, eXtreme Gradient Boosting, and the conventional Neural Network. Model interpretability assessed via SHapley Additive exPlanations analysis revealed that strain amplitude, irradiation dose, and test temperature are the dominant features, each exhibiting a physically consistent inverse correlation with fatigue life. Univariate and multivariate analyses showed that strain amplitude is the primary driver of fatigue degradation in both alloy classes, while irradiation dose and temperature introduce alloy-specific sensitivities. The PINN successfully captured key mechanistic trends, including the comparatively stable irradiation response and dose saturation behaviour of F/M steels, as well as a pronounced reduction in fatigue life at elevated temperatures exceeding the tempering threshold. Overall, the proposed PINN framework offers a reliable and interpretable tool for predicting fatigue life in irradiated structural alloys, thereby supporting informed materials selection and performance assessment for advanced nuclear reactor applications.

    https://arxiv.org/abs/2508.17303


    Condition Weaving Meets Expert Modulation: Towards Universal and Controllable Image Generation

    oai:arXiv.org:2508.17364v3

    arXiv:2508.17364v3 Announce Type: replace Abstract: The image-to-image generation task aims to produce controllable images by leveraging conditional inputs and prompt instructions. However, existing methods often train separate control branches for each type of condition, leading to redundant model structures and inefficient use of computational resources. To address this, we propose a Unified image-to-image Generation (UniGen) framework that supports diverse conditional inputs while enhancing generation efficiency and expressiveness. Specifically, to tackle the widely existing parameter redundancy and computational inefficiency in controllable conditional generation architectures, we propose the Condition Modulated Expert (CoMoE) module. This module aggregates semantically similar patch features and assigns them to dedicated expert modules for visual representation and conditional modeling. By enabling independent modeling of foreground features under different conditions, CoMoE effectively mitigates feature entanglement and redundant computation in multi-condition scenarios. Furthermore, to bridge the information gap between the backbone and control branches, we propose WeaveNet, a dynamic, snake-like connection mechanism that enables effective interaction between global text-level control from the backbone and fine-grained control from conditional branches. Extensive experiments on the Subjects-200K and MultiGen-20M datasets across various conditional image generation tasks demonstrate that our method consistently achieves state-of-the-art performance, validating its advantages in both versatility and effectiveness. The code has been uploaded to https://github.com/gavin-gqzhang/UniGen.

    https://arxiv.org/abs/2508.17364


    SuperGen: An Efficient Ultra-high-resolution Video Generation System with Sketching and Tiling

    oai:arXiv.org:2508.17756v2

    arXiv:2508.17756v2 Announce Type: replace Abstract: Diffusion models have recently achieved remarkable success in generative tasks (e.g., image and video generation), and the demand for high-quality content (e.g., 2K/4K videos) is rapidly increasing across various domains. However, generating ultra-high-resolution videos on existing standard-resolution (e.g., 720p) platforms remains challenging due to the excessive re-training requirements and prohibitively high computational and memory costs. To this end, we introduce SUPERGEN, an efficient tile-based framework for ultra-high-resolution video generation. SUPERGEN features a novel training-free algorithmic innovation with tiling to successfully support a wide range of resolutions without additional training efforts while significantly reducing both memory footprint and computational complexity. Moreover, SUPERGEN incorporates a tile-tailored, adaptive, region-aware caching strategy that accelerates video generation by exploiting redundancy across denoising steps and spatial regions. SUPERGEN also integrates cache-guided, communication-minimized tile parallelism for enhanced throughput and minimized latency. Evaluations show that SUPERGEN maximizes performance gains while achieving high output quality across various benchmarks.

    https://arxiv.org/abs/2508.17756


    Building Self-Evolving Agents via Experience-Driven Lifelong Learning: A Framework and Benchmark

    oai:arXiv.org:2508.19005v5

    arXiv:2508.19005v5 Announce Type: replace Abstract: As AI advances toward general intelligence, the focus is shifting from systems optimized for static tasks to creating open-ended agents that learn continuously. In this paper, we introduce Experience-driven Lifelong Learning (ELL), a framework for building self-evolving agents capable of continuous growth through real-world interaction. The framework is built on four core principles: (1) Experience Exploration: Agents learn through continuous, self-motivated interaction with dynamic environments, navigating interdependent tasks and generating rich experiential trajectories. (2) Long-term Memory: Agents preserve and structure historical knowledge, including personal experiences, domain expertise, and commonsense reasoning, into a persistent memory system. (3) Skill Learning: Agents autonomously improve by abstracting recurring patterns from experience into reusable skills, which are actively refined and validated for application in new tasks. (4) Knowledge Internalization: Agents internalize explicit and discrete experiences into implicit and intuitive capabilities as "second nature". We also introduce StuLife, a benchmark dataset for ELL that simulates a student's holistic college journey, from enrollment to academic and personal development, across three core phases and ten detailed sub-scenarios. StuLife is designed around three key paradigm

    https://arxiv.org/abs/2508.19005


    Color Bind: Exploring Color Perception in Text-to-Image Models

    oai:arXiv.org:2508.19791v3

    arXiv:2508.19791v3 Announce Type: replace Abstract: Text-to-image generation has recently seen remarkable success, granting users with the ability to create high-quality images through the use of text. However, contemporary methods face challenges in capturing the precise semantics conveyed by complex multi-object prompts. Consequently, many works have sought to mitigate such semantic misalignments, typically via inference-time schemes that modify the attention layers of the denoising networks. However, prior work has mostly utilized coarse metrics, such as the cosine similarity between text and image CLIP embeddings, or human evaluations, which are challenging to conduct on a larger-scale. In this work, we perform a case study on colors -- a fundamental attribute commonly associated with objects in text prompts, which offer a rich test bed for rigorous evaluation. Our analysis reveals that pretrained models struggle to generate images that faithfully reflect multiple color attributes-far more so than with single-color prompts-and that neither inference-time techniques nor existing editing methods reliably resolve these semantic misalignments. Accordingly, we introduce a dedicated image editing technique, mitigating the issue of multi-object semantic alignment for prompts containing multiple colors. We demonstrate that our approach significantly boosts performance over a wide range of metrics, considering images generated by various text-to-image diffusion-based techniques.

    https://arxiv.org/abs/2508.19791


    Solvable Tuple Patterns and Their Applications to Program Verification

    oai:arXiv.org:2508.20365v2

    arXiv:2508.20365v2 Announce Type: replace Abstract: Despite the recent progress of automated program verification techniques, fully automated verification of programs manipulating recursive data structures remains a challenge. We introduce solvable tuple patterns (STPs) and conjunctive STPs (CSTPs), novel formalisms for expressing and inferring invariants between list-like recursive data structures. A distinguishing feature of STPs is that they can be efficiently inferred from only a small number of positive samples; no negative samples are required. After presenting properties and inference algorithms of STPs and CSTPs, we show how to incorporate the CSTP inference into a CHC (Constrained Horn Clauses) solver supporting list-like data structures, which serves as a uniform backend for automated program verification tools. A CHC solver incorporating the (C)STP inference has won the ADT-LIN category of CHC-COMP 2025 by a significant margin.

    https://arxiv.org/abs/2508.20365


    Efficient Training-Free Online Routing for High-Volume Multi-LLM Serving

    oai:arXiv.org:2509.02718v3

    arXiv:2509.02718v3 Announce Type: replace Abstract: Increasing demand for Large Language Models (LLMs) services imposes substantial deployment and computation costs on providers. LLM routing offers a cost-efficient solution by directing queries to the optimal LLM based on model and query features. However, existing works primarily focus on offline scenarios and struggle to adapt to online settings with high query volume and constrained token budgets. In this work, we introduce the first training-free algorithm for online routing scenarios. Our algorithm leverages approximate nearest neighbor search to efficiently estimate query features and performs a one-time optimization over a small set of initial queries to learn a routing strategy that guides future routing. We provide theoretical guarantees demonstrating that our algorithm achieves a competitive ratio of $1 - o(1)$ under natural assumptions, which is further validated by extensive experiments across 3 benchmark datasets and 8 baselines, showing an average improvement of 3.55$\times$ in overall performance, 1.85$\times$ in cost efficiency, and nearly 4.25$\times$ in throughput. Our code is available at https://github.com/fzwark/PORT.

    https://arxiv.org/abs/2509.02718


    False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize

    oai:arXiv.org:2509.03888v3

    arXiv:2509.03888v3 Announce Type: replace Abstract: Large Language Models (LLMs) can comply with harmful instructions, raising serious safety concerns despite their impressive capabilities. Recent work has leveraged probing-based approaches to study the separability of malicious and benign inputs in LLMs' internal representations, and researchers have proposed using such probing methods for safety detection. We systematically re-examine this paradigm. Motivated by poor out-of-distribution performance, we hypothesize that probes learn superficial patterns rather than semantic harmfulness. Through controlled experiments, we confirm this hypothesis and identify the specific patterns learned: instructional patterns and trigger words. Our investigation follows a systematic approach, progressing from demonstrating comparable performance of simple n-gram methods, to controlled experiments with semantically cleaned datasets, to detailed analysis of pattern dependencies. These results reveal a false sense of security around current probing-based approaches and highlight the need to redesign both models and evaluation protocols, for which we provide further discussions in the hope of suggesting responsible further research in this direction. We have open-sourced the project at https://github.com/WangCheng0116/Why-Probe-Fails.

    https://arxiv.org/abs/2509.03888


    SwinSRGAN: Swin Transformer-based Generative Adversarial Network for High-Fidelity Speech Super-Resolution

    oai:arXiv.org:2509.03913v3

    arXiv:2509.03913v3 Announce Type: replace Abstract: Speech super-resolution (SR) reconstructs high-frequency content from low-resolution speech signals. Existing systems often suffer from representation mismatch in two-stage mel-vocoder pipelines and from over-smoothing of hallucinated high-band content by CNN-only generators. Diffusion and flow models are computationally expensive, and their robustness across domains and sampling rates remains limited. We propose SwinSRGAN, an end-to-end framework operating on Modified Discrete Cosine Transform (MDCT) magnitudes. It is a Swin Transformer-based U-Net that captures long-range spectro-temporal dependencies with a hybrid adversarial scheme combines time-domain MPD/MSD discriminators with a multi-band MDCT discriminator specialized for the high-frequency band. We employs a sparse-aware regularizer on arcsinh-compressed MDCT to better preserve transient components. The system upsamples inputs at various sampling rates to 48 kHz in a single pass and operates in real time. On standard benchmarks, SwinSRGAN reduces objective error and improves ABX preference scores. In zero-shot tests on HiFi-TTS without fine-tuning, it outperforms NVSR and mdctGAN, demonstrating strong generalization across datasets

    https://arxiv.org/abs/2509.03913


    Do MLLMs Really Understand the Charts?

    oai:arXiv.org:2509.04457v2

    arXiv:2509.04457v2 Announce Type: replace Abstract: Although Multimodal Large Language Models (MLLMs) have demonstrated increasingly impressive performance in chart understanding, most of them exhibit alarming hallucinations and significant performance degradation when handling non-annotated charts. We argue that current MLLMs rely largely on visual recognition rather than visual reasoning to interpret the charts, and visual estimation of numerical values is one of the most fundamental capabilities in chart understanding that require complex visual reasoning. To prove this, we introduce ChartVRBench, a benchmark meticulously designed to isolate and evaluate visual reasoning ability in chart understanding. Furthermore, we propose ChartVR-3B/7B trained with a novel Visual Reasoning Reinforcement Finetuning (VR-RFT) strategy to strengthen genuine chart visual reasoning abilities. Extensive experiments show that ChartVR achieves superior performance on ChartVRBench, outperforming even powerful proprietary models. Moreover, the visual reasoning skills cultivated by the proposed VR-RFT demonstrate strong generalization, leading to significant performance gains across a diverse suite of public chart understanding benchmarks. The code and dataset will be publicly available upon publication.

    https://arxiv.org/abs/2509.04457


    PDTrim: Targeted Pruning for Prefill-Decode Disaggregation in Inference

    oai:arXiv.org:2509.04467v4

    arXiv:2509.04467v4 Announce Type: replace Abstract: Large Language Models (LLMs) demonstrate exceptional capabilities across various tasks, but their deployment is constrained by high computational and memory costs. Model pruning provides an effective means to alleviate these demands. However, existing methods often ignore the characteristics of prefill-decode (PD) disaggregation in practice. In this paper, we propose a pruning method that is highly integrated with PD disaggregation, enabling more precise pruning of blocks. Our approach constructs pruning and distillation sets to perform iterative block removal, obtaining better pruning solutions. Moreover, we analyze the pruning sensitivity of the prefill and decode stages and identify removable blocks specific to each stage, making it well suited for PD disaggregation deployment. Extensive experiments demonstrate our approach consistently achieves strong performance in both PD disaggregation and PD unified (non-PD disaggregation) settings, and can also be extended to other non-block pruning methods. Under the same settings, our method achieves improved performance and faster inference.

    https://arxiv.org/abs/2509.04467


    Accelerating Large Language Model Inference via Early-Exiting Algorithms

    oai:arXiv.org:2509.05915v2

    arXiv:2509.05915v2 Announce Type: replace Abstract: Large language models have achieved remarkable capabilities, but their practical deployment is hindered by significant computational costs. While adaptive computation methods like early-exiting promise to reduce these costs, they introduce a fundamental conflict: the per-token dynamism intended to save computation often creates system-level bottlenecks that can paradoxically reduce throughput in batched inference. This dissertation resolves this conflict by co-designing adaptive algorithms and model architectures to strike an optimal balance between dynamism and efficiency. To this end, our work first addresses critical sources of overhead in conventional early-exiting by proposing an efficient parallel decoding mechanism. We then show that deep parameter sharing provides an architectural foundation that not only yields compact, parameter-efficient models but also inherently mitigates the critical synchronization issues affecting dynamic inference. Finally, this work presents a unified framework where lightweight routers are pretrained to dynamically assign an optimal recursion depth for each token. This approach establishes a new Pareto frontier between efficiency and performance by effectively optimizing for both adaptive computation and parameter efficiency within a single model.

    https://arxiv.org/abs/2509.05915


    An upper bound of the silhouette validation metric for clustering

    oai:arXiv.org:2509.08625v3

    arXiv:2509.08625v3 Announce Type: replace Abstract: The silhouette coefficient quantifies, for each observation, the balance between within-cluster cohesion and between-cluster separation, taking values in [-1, 1]. The average silhouette width (ASW ) is a widely used internal measure of clustering quality, with higher values indicating more cohesive and well-separated clusters. However, the dataset-specific maximum of ASW is typically unknown, and the standard upper limit of 1 is rarely attainable. In this work, we derive for each data point a sharp upper bound on its silhouette width and aggregate these to obtain a canonical upper bound on the ASW. This bound-often substantially below 1-enhances the interpretability of empirical ASW values by providing guidance on how close a given clustering result is to the best possible outcome for that dataset. We evaluate the usefulness of the upper bound on a variety of datasets and conclude that it can meaningfully enrich cluster quality evaluation, but its practical relevance depends on the given dataset. Finally, we extend the framework to establish an upper bound of the macro-averaged silhouette.

    https://arxiv.org/abs/2509.08625


    Smart Device Development for Gait Monitoring: Multimodal Feedback in an Interactive Foot Orthosis, Walking Aid, and Mobile Application

    oai:arXiv.org:2509.09359v2

    arXiv:2509.09359v2 Announce Type: replace Abstract: Smart assistive technologies such as sensor-based footwear and walking aids offer promising opportunities for gait rehabilitation through real-time feedback and patient-centered monitoring. While biofeedback applications show great potential, current research rarely explores integrated closed-loop systems with device- and modality-specific feedback. In this work, we present a modular sensor-based system combining a smart foot orthosis and an instrumented forearm crutch to deliver real-time vibrotactile biofeedback. The system integrates plantar pressure and motion sensing, vibrotactile feedback, and wireless communication via a smartphone application. We conducted a user study with eight participants to validate the system's feasibility for mobile gait detection and app usability, and to evaluate different vibrotactile feedback types across the orthosis and forearm crutch. The results indicate that pattern-based vibrotactile feedback was rated as more useful and suitable for regular use than simple vibration alerts. Moreover, participants reported clear perceptual differences between feedback delivered via the orthosis and the forearm crutch, indicating device-dependent feedback perception. The findings highlight the relevance of feedback strategy design beyond hardware implementation and inform the development of user-centered haptic biofeedback systems.

    https://arxiv.org/abs/2509.09359


    Toward quantum-safe scalable networks: an open, standards-aware key management framework

    oai:arXiv.org:2509.09453v2

    arXiv:2509.09453v2 Announce Type: replace Abstract: With the advent of quantum computing, the increasing threats to security poses a great challenge to communication networks. Recent innovations in this field resulted in promising technologies such as Quantum Key Distribution (QKD), which enables the generation of unconditionally secure keys, establishing secure communications between remote nodes. Additionally, QKD networks enable the interconnection of multinode architectures, extending the point-to-point nature of QKD. However, due to the limitations of the current state of technology, the scalability of QKD networks remains a challenge toward feasible implementations. When it comes to long-distance implementations, trusted relay nodes partially solve the distance issue through the forwarding of the distributed keys, allowing applications that do not have a direct QKD link to securely share key material. Even though the relay procedure itself has been extensively studied, the establishment of the relaying node path still lacks a solution. This paper proposes an innovative network architecture that solves the challenges of Key Management System (KMS) identification, relay path discovery, and scalability of QKD networks by integrating Software-Defined Networking (SDN) principles, and establishing high-level virtual KMSs (vKMS) in each node and creating a new entity called the Quantum Security Controller (QuSeC). The vKMS serves the end-user key requests, managing the multiple KMSs within the node and abstracting the user from discovering the correct KMS. Additionally, based on the high-level view of the network topology and status, the QuSeC serves the path discovery requests from vKMSs, computing the end-to-end (E2E) relay path and applying security policies. The paper also provides a security analysis of the proposal, identifying the security levels of the architecture and analyzing the core networking security properties.

    https://arxiv.org/abs/2509.09453


    Towards a Common Framework for Autoformalization

    oai:arXiv.org:2509.09810v3

    arXiv:2509.09810v3 Announce Type: replace Abstract: Autoformalization has emerged as a term referring to the automation of formalization - specifically, the formalization of mathematics using interactive theorem provers (proof assistants). Its rapid development has been driven by progress in deep learning, especially large language models (LLMs). More recently, the term has expanded beyond mathematics to describe the broader task of translating informal input into formal logical representations. At the same time, a growing body of research explores using LLMs to translate informal language into formal representations for reasoning, planning, and knowledge representation - often without explicitly referring to this process as autoformalization. As a result, despite addressing similar tasks, the largely independent development of these research areas has limited opportunities for shared methodologies, benchmarks, and theoretical frameworks that could accelerate progress. The goal of this paper is to review - explicit or implicit - instances of what can be considered autoformalization and to propose a unified framework, encouraging cross-pollination between different fields to advance the development of next generation AI systems.

    https://arxiv.org/abs/2509.09810


    Faster Results from a Smarter Schedule: Reframing Collegiate Cross Country through Analysis of the National Running Club Database

    oai:arXiv.org:2509.10600v4

    arXiv:2509.10600v4 Announce Type: replace Abstract: Collegiate cross country teams often build their season schedules on intuition rather than evidence, partly because large-scale performance datasets are not publicly accessible. To address this limitation, we introduce the National Running Club Database (NRCD), the first openly available dataset to aggregate 23,725 race results from 7,594 collegiate club athletes across the 2023-2025 seasons. Unlike existing resources, NRCD includes detailed course metadata, allowing us to develop two standardized performance metrics: Converted Only (distance correction) and Standardized (distance, weather, and elevation adjusted). Using these standardized measures, we find that athletes with slower initial performances exhibit the greatest improvement within a season, and that race frequency is the strongest predictor of improvement. Using six machine learning models, random forest achieves the highest accuracy (r squared equals 0.92), revealing that athletes who race more frequently progress significantly faster than those who do not. At the team level, programs whose athletes race at least four times during the regular season have substantially higher odds of placing in the top 15 at nationals (chi-squared less than 0.01). These results challenge common coaching practices that favor minimal racing before championship meets. Our findings demonstrate that a data-informed scheduling strategy improves both individual development and team competitiveness. The NRCD provides a new foundation for evidence-based decision-making in collegiate cross country and opens opportunities for further research on standardized, longitudinal athlete performance modeling.

    https://arxiv.org/abs/2509.10600


    Gaussian-Plus-SDF SLAM: High-fidelity 3D Reconstruction at 150+ fps

    oai:arXiv.org:2509.11574v2

    arXiv:2509.11574v2 Announce Type: replace Abstract: While recent Gaussian-based SLAM methods achieve photorealistic reconstruction from RGB-D data, their computational performance remains a critical bottleneck. State-of-the-art techniques operate at less than 20 fps, significantly lagging behind geometry-based approaches like KinectFusion (hundreds of fps). This limitation stems from the heavy computational burden: modeling scenes requires numerous Gaussians and complex iterative optimization to fit RGB-D data; insufficient Gaussian counts or optimization iterations cause severe quality degradation. To address this, we propose a Gaussian-SDF hybrid representation, combining a colorized signed distance field (SDF) for smooth geometry and appearance with 3D Gaussians to capture underrepresented details. The SDF is efficiently constructed via RGB-D fusion (as in geometry-based methods), while Gaussians undergo iterative optimization. Our representation enables significant Gaussian reduction (50% fewer) by avoiding full-scene Gaussian modeling, and efficient Gaussian optimization (75% fewer iterations) through targeted appearance refinement. Building upon this representation, we develop GPS-SLAM (Gaussian-plus-SDF SLAM), a real-time 3D reconstruction system achieving over 150 fps on real-world Azure Kinect sequences, faster by an order-of-magnitude than state-of-the-art techniques while maintaining comparable reconstruction quality. The source code and data are available at https://gapszju.github.io/GPS-SLAM.

    https://arxiv.org/abs/2509.11574


    Model Predictive Control with High-Probability Safety Guarantee for Nonlinear Stochastic Systems

    oai:arXiv.org:2509.11584v2

    arXiv:2509.11584v2 Announce Type: replace Abstract: We present a model predictive control (MPC) framework for nonlinear stochastic systems that ensures safety guarantee with high probability. Unlike most existing stochastic MPC schemes, our method adopts a set-erosion that converts the probabilistic safety constraint into a tractable deterministic safety constraint on a smaller safe set over deterministic dynamics. As a result, our method is compatible with any off-the-shelf deterministic MPC algorithm. The key to the effectiveness of our method is a tight bound on the stochastic fluctuation of a stochastic trajectory around its nominal version. Our method is scalable and can guarantee safety with high probability level (e.g., 99.99%), making it particularly suitable for safety-critical applications involving complex nonlinear dynamics. Rigorous analysis is conducted to establish a theoretical safety guarantee, and numerical experiments are provided to validate the effectiveness of the proposed MPC method.

    https://arxiv.org/abs/2509.11584


    MALLM: Multi-Agent Large Language Models Framework

    oai:arXiv.org:2509.11656v3

    arXiv:2509.11656v3 Announce Type: replace Abstract: Multi-agent debate (MAD) has demonstrated the ability to augment collective intelligence by scaling test-time compute and leveraging expertise. Current frameworks for multi-agent debate are often designed towards tool use, lack integrated evaluation, or provide limited configurability of agent personas, response generators, discussion paradigms, and decision protocols. We introduce MALLM (Multi-Agent Large Language Models), an open-source framework that enables systematic analysis of MAD components. MALLM offers more than 144 unique configurations of MAD, including (1) agent personas (e.g., Expert, Personality), (2) response generators (e.g., Critical, Reasoning), (3) discussion paradigms (e.g., Memory, Relay), and (4) decision protocols (e.g., Voting, Consensus). MALLM uses simple configuration files to define a debate. Furthermore, MALLM can load any textual Hugging Face dataset (e.g., MMLU-Pro, WinoGrande) and provides an evaluation pipeline for easy comparison of MAD configurations. MALLM enables researchers to systematically configure, run, and evaluate debates for their problems, facilitating the understanding of the components and their interplay.

    https://arxiv.org/abs/2509.11656


    Comparative Analysis of Wave Scattering Numerical Modeling Using the Boundary Element Method and Physics-Informed Neural Networks

    oai:arXiv.org:2509.12483v2

    arXiv:2509.12483v2 Announce Type: replace Abstract: This study compares the Boundary Element Method (BEM) and Physics-Informed Neural Networks (PINNs) for solving the two-dimensional Helmholtz equation in wave scattering problems. The objective is to evaluate the performance of both methods under the same conditions. We solve the Helmholtz equation using BEM and PINNs for the same scattering problem. PINNs are trained by minimizing the residual of the governing equations and boundary conditions with their configuration determined through hyperparameter optimization, while BEM is applied using boundary discretization. Both methods are evaluated in terms of solution accuracy and computation time. We conducted numerical experiments by varying the number of boundary integration points for the BEM and the number of hidden layers and neurons per layer for the PINNs. We performed a hyperparameter tuning to identify an adequate PINN configuration for this problem as a network with 3 hidden layers and 25 neurons per layer, using a learning rate of $10^{-2}$ and a sine activation function. At comparable levels of accuracy, the assembly and solution of the BEM system required a computational time on the order of $10^{-2}$~s, whereas the training time of the PINN was on the order of $10^{2}$~s, corresponding to a difference of approximately four orders of magnitude. However, once trained, the PINN achieved evaluation times on the order of $10^{-2}$~s, which is about two orders of magnitude faster than the evaluation of the BEM solution at interior points. This work establishes a procedure for comparing BEM and PINNs. It also presents a direct comparison between the two methods for the scattering problem. The analysis provides quantitative data on their performance, supporting their use in future research on wave propagation problems and outlining challenges and directions for further investigation.

    https://arxiv.org/abs/2509.12483


    A Multi-Agent LLM Defense Pipeline Against Prompt Injection Attacks

    oai:arXiv.org:2509.14285v3

    arXiv:2509.14285v3 Announce Type: replace Abstract: Prompt injection attacks represent a major vulnerability in Large Language Model (LLM) deployments, where malicious instructions embedded in user inputs can override system prompts and induce unintended behaviors. This paper presents a novel multi-agent defense framework that employs specialized LLM agents in coordinated pipelines to detect and neutralize prompt injection attacks in real-time. We evaluate our approach using two distinct architectures: a sequential chain-of-agents pipeline and a hierarchical coordinator-based system. Our comprehensive evaluation on 55 unique prompt injection attacks, grouped into 8 categories and totaling 400 attack instances across two LLM platforms (ChatGLM and Llama2), demonstrates significant security improvements. Without defense mechanisms, baseline Attack Success Rates (ASR) reached 30% for ChatGLM and 20% for Llama2. Our multi-agent pipeline achieved 100% mitigation, reducing ASR to 0% across all tested scenarios. The framework demonstrates robustness across multiple attack categories including direct overrides, code execution attempts, data exfiltration, and obfuscation techniques, while maintaining system functionality for legitimate queries.

    https://arxiv.org/abs/2509.14285


    Stochastic Bilevel Optimization with Heavy-Tailed Noise

    oai:arXiv.org:2509.14952v2

    arXiv:2509.14952v2 Announce Type: replace Abstract: This paper considers the smooth bilevel optimization in which the lower-level problem is strongly convex and the upper-level problem is possibly nonconvex. We focus on the stochastic setting where the algorithm can access the unbiased stochastic gradient evaluation with heavy-tailed noise, which is prevalent in many machine learning applications, such as training large language models and reinforcement learning. We propose a nested-loop normalized stochastic bilevel approximation (N$^2$SBA) for finding an $\epsilon$-stationary point with the stochastic first-order oracle (SFO) complexity of $\tilde{\mathcal{O}}\big(\kappa^{\frac{7p-3}{p-1}} \sigma^{\frac{p}{p-1}} \epsilon^{-\frac{4 p - 2}{p-1}}\big)$, where $\kappa$ is the condition number, $p\in(1,2]$ is the order of central moment for the noise, and $\sigma$ is the noise level. Furthermore, we specialize our idea to solve the nonconvex-strongly-concave minimax optimization problem, achieving an $\epsilon$-stationary point with the SFO complexity of~$\tilde{\mathcal O}\big(\kappa^{\frac{2p-1}{p-1}} \sigma^{\frac{p}{p-1}} \epsilon^{-\frac{3p-2}{p-1}}\big)$. All the above upper bounds match the best-known results under the special case of the bounded variance setting, i.e., $p=2$. We also conduct the numerical experiments to show the empirical superiority of the proposed methods.

    https://arxiv.org/abs/2509.14952


    Synthetic bootstrapped pretraining

    oai:arXiv.org:2509.15248v3

    arXiv:2509.15248v3 Announce Type: replace Abstract: We introduce Synthetic Bootstrapped Pretraining (SBP), a language model (LM) pretraining procedure that first learns a model of relations between documents from the pretraining dataset and then leverages it to synthesize a vast new corpus for joint training. While the standard pretraining teaches LMs to learn causal correlations among tokens within a single document, it is not designed to efficiently model the rich, learnable inter-document correlations that can potentially lead to better performance. We validate SBP by designing a compute-matched pretraining setup and pretrain a 3B-parameter and a 6B-parameter model on up to 1T tokens from scratch. We find SBP consistently improves upon a strong repetition baseline and delivers up to 60% of performance improvement attainable by an oracle upper bound with access to 20x more unique data. Qualitative analysis reveals that the synthesized documents go beyond mere paraphrases -- SBP first abstracts a core concept from the seed material and then crafts a new narration on top of it. Besides strong empirical performance, SBP admits a natural Bayesian interpretation: the synthesizer implicitly learns to abstract the latent concepts shared between related documents.

    https://arxiv.org/abs/2509.15248


    CoopQ: Cooperative Game Inspired Layerwise Mixed Precision Quantization for LLMs

    oai:arXiv.org:2509.15455v2

    arXiv:2509.15455v2 Announce Type: replace Abstract: Large Language Models (LLMs) promise impressive capabilities, yet their multi-billion-parameter scale makes on-device or low-resource deployment prohibitive. Mixed-precision quantization offers a compelling solution, but existing methods struggle when the average precision drops below four bits, as they rely on isolated, layer-specific metrics that overlook critical inter-layer interactions affecting overall performance. To address these limitations, we first frame the mixed-precision quantization problem as a cooperative game among layers and introduce Shapley-based Progressive Quantization Estimation (SPQE) to efficiently obtain accurate Shapley estimates of layer sensitivities and inter-layer interactions. Leveraging the SPQE estimates, we propose Cooperative Game Inspired Mixed-Precision Quantization (CoopQ) which translates these Shapley estimates into a binary quadratic optimization formulation, assigning either 2 or 4-bit precision to layers under strict memory constraints. Comprehensive experiments conducted on Llama-3, Gemma-2, and Qwen-3 models across three independent PTQ backends (Quanto, HQQ, GPTQ) demonstrate CoopQ's scalability and consistently superior performance compared to methods relying solely on isolated metrics. Across average precisions spanning 4 bit down to 2 bit, CoopQ cuts Perplexity by 20 - 80 % relative to the best baseline, with the margin growing as the bit-width tightens.

    https://arxiv.org/abs/2509.15455


    Automatic Microarchitecture-Aware Custom Instruction Design for RISC-V Processors

    oai:arXiv.org:2509.15782v2

    arXiv:2509.15782v2 Announce Type: replace Abstract: An Application-Specific Instruction Set Processor(ASIP) is a specialized microprocessor that provides a trade-off between the programmability of a General Purpose Processor (GPP) and the performance and energy-efficiency of dedicated hardware accelerators. ASIPs are often derived from off-the-shelf GPPs extended by custom instructions tailored towards a specific software workload. One of the most important challenges of designing an ASIP is to find said custom instructions that help to increase performance without being too costly in terms of area and power consumption. To date, solving this challenge is relatively labor-intensive and typically performed manually. Addressing the lack of automation, we present Custom Instruction Designer for RISC-V Extensions (CIDRE), a front-to-back tool for ASIP design. CIDRE automatically analyzes hotspots in RISC-V applications and generates custom instruction suggestions with a corresponding nML description. The nML description can be used with other electronic design automation tools to accurately assess the cost and benefits of the found suggestions. In a RISC-V benchmark study, we were able to accelerate embedded benchmarks from Embench and MiBench by up to 2.47x with less than 24% area increase. The entire process was conducted completely automatically.

    https://arxiv.org/abs/2509.15782


    Highly Imbalanced Regression with Tabular Data in SEP and Other Applications

    oai:arXiv.org:2509.16339v4

    arXiv:2509.16339v4 Announce Type: replace Abstract: We investigate imbalanced regression with tabular data that have an imbalance ratio larger than 1,000 ("highly imbalanced"). Accurately estimating the target values of rare instances is important in applications such as forecasting the intensity of rare harmful Solar Energetic Particle (SEP) events. For regression, the MSE loss does not consider the correlation between predicted and actual values. Typical inverse importance functions allow only convex functions. Uniform sampling might yield mini-batches that do not have rare instances. We propose CISIR that incorporates correlation, Monotonically Decreasing Involution (MDI) importance, and stratified sampling. Based on five datasets, our experimental results indicate that CISIR can achieve lower error and higher correlation than some recent methods. Also, adding our correlation component to other recent methods can improve their performance. Lastly, MDI importance can outperform other importance functions. Our code can be found in https://github.com/Machine-Earning/CISIR.

    https://arxiv.org/abs/2509.16339


    BENNS: A Surrogate Model for Hybrid Online-Offline Evolution of SFC Embedding

    oai:arXiv.org:2509.16856v2

    arXiv:2509.16856v2 Announce Type: replace Abstract: Service Function Chains (SFCs) enable programmatic control of the functions and services in a computer network. By leveraging Software Defined Networking to control the links between virtualised network functions, SFCs provide a scalable approach to dealing with the increased pressures on network operation and management. However, embedding SFCs onto the underlying physical network and compute infrastructure is an NP-hard problem. Genetic Algorithms (GAs) have been used to address this issue, but they require significant time to evaluate solution quality (fitness) online, with most existing approaches instead adopting offline simulations or analytical evaluations. To enable online use of GAs in solving the SFC embedding problem, we introduce a hybrid online-offline approach to efficiently evaluate the fitness of generated solutions. At the core of this is BENNS: a surrogate model that approximates fitness and is agnostic to topology, traffic, and SFC-embedding. We evaluate our approach in a static environment across five experiments, varying available resources and traffic loads, and in a dynamic network environment. Our results demonstrate that our approach is capable of exploring thousands of potential configurations and generating deployable solutions in 19.1 minutes on average, compared to online-only approaches, which take 17.8 hours on average to explore ten solutions in our experiments and do not converge on an optimal solution.

    https://arxiv.org/abs/2509.16856


    Improved Segmentation of Polyps and Visual Explainability Analysis

    oai:arXiv.org:2509.18159v4

    arXiv:2509.18159v4 Announce Type: replace Abstract: Colorectal cancer (CRC) remains one of the leading causes of cancer-related morbidity and mortality worldwide, with gastrointestinal (GI) polyps serving as critical precursors according to the World Health Organization (WHO). Early and accurate segmentation of polyps during colonoscopy is essential for reducing CRC progression, yet manual delineation is labor-intensive and prone to observer variability. Deep learning methods have demonstrated strong potential for automated polyp analysis, but their limited interpretability remains a barrier to clinical adoption. In this study, we present PolypSeg-GradCAM, an explainable deep learning framework that integrates a U-Net architecture with a pre-trained ResNet-34 backbone and Gradient-weighted Class Activation Mapping (Grad-CAM) for transparent polyp segmentation. To ensure rigorous benchmarking, the model was trained and evaluated using 5-Fold Cross-Validation on the Kvasir-SEG dataset of 1,000 annotated endoscopic images. Experimental results show a mean Dice coefficient of 0.8902 +/- 0.0125, a mean Intersection-over-Union (IoU) of 0.8023, and an Area Under the Receiver Operating Characteristic Curve (AUC-ROC) of 0.9722. Advanced quantitative analysis using an optimal threshold yielded a Sensitivity of 0.9058 and Precision of 0.9083. Additionally, Grad-CAM visualizations confirmed that predictions were guided by clinically relevant regions, offering insight into the model's decision-making process. This study demonstrates that integrating segmentation accuracy with interpretability can support the development of trustworthy AI-assisted colonoscopy tools.

    https://arxiv.org/abs/2509.18159


    Learning Obstacle Avoidance using Double DQN for Quadcopter Navigation

    oai:arXiv.org:2509.18734v2

    arXiv:2509.18734v2 Announce Type: replace Abstract: One of the challenges faced by Autonomous Aerial Vehicles is reliable navigation through urban environments. Factors like reduction in precision of Global Positioning System (GPS), narrow spaces and dynamically moving obstacles make the path planning of an aerial robot a complicated task. One of the skills required for the agent to effectively navigate through such an environment is to develop an ability to avoid collisions using information from onboard depth sensors. In this paper, we propose Reinforcement Learning of a virtual quadcopter robot agent equipped with a Depth Camera to navigate through a simulated urban environment.

    https://arxiv.org/abs/2509.18734


    Are most sentences unique? An empirical examination of Chomskyan claims

    oai:arXiv.org:2509.19108v2

    arXiv:2509.19108v2 Announce Type: replace Abstract: A repeated claim in linguistics is that the majority of linguistic utterances are unique. For example, Pinker (1994: 10), summarizing an argument by Noam Chomsky, states that "virtually every sentence that a person utters or understands is a brand-new combination of words, appearing for the first time in the history of the universe." With the increased availability of large corpora, this is a claim that can be empirically investigated. The current paper addresses the question by using the NLTK Python library to parse corpora of different genres, providing counts of exact string matches in each. Results show that while completely unique sentences are often the majority of corpora, this is highly constrained by genre, and that duplicate sentences are not an insignificant part of any individual corpus.

    https://arxiv.org/abs/2509.19108


    A Pipeline to Assess Merging Methods via Behavior and Internals

    oai:arXiv.org:2509.19476v2

    arXiv:2509.19476v2 Announce Type: replace Abstract: Merging methods combine the weights of multiple language models (LMs) to leverage their capacities, such as for domain adaptation. While existing studies investigate merged models from a solely behavioral perspective, we offer the first comprehensive view by assessing and connecting their behavior and internals. We present a novel evaluation pipeline that first merges multiple parent LMs, and then evaluates the merged models in comparison to the initial ones based on their behavior on downstream tasks, like MMLU, and the internal encoded linguistic competence. We showcase this pipeline by assessing the merging of instruction fine-tuned with math- and code-adapted LMs from the Qwen2.5 family. Our results show that merging methods impacts behavior and internals differently. While the performance of merged models is typically between that of the two parent models, their encoded information about linguistic phenomena, particularly in morphology and syntax, can surpass the parent models. Moreover, we find weak ranking correlation between this behavior and internal evaluation. With our pipeline and initial results, we emphasize the need for more comprehensive evaluations of model merging methods to gain a faithful understanding of their capabilities and reliability, beyond potential superficial behavioral advances.

    https://arxiv.org/abs/2509.19476


    Mamba Modulation: On the Length Generalization of Mamba

    oai:arXiv.org:2509.19633v3

    arXiv:2509.19633v3 Announce Type: replace Abstract: The quadratic complexity of the attention mechanism in Transformer models has motivated the development of alternative architectures with sub-quadratic scaling, such as state-space models. Among these, Mamba has emerged as a leading architecture, achieving state-of-the-art results across a range of language modeling tasks. However, Mamba's performance significantly deteriorates when applied to contexts longer than those seen during pre-training, revealing a sharp sensitivity to context length extension. Through detailed analysis, we attribute this limitation to the out-of-distribution behaviour of its state-space dynamics, particularly within the parameterization of the state transition matrix $\mathbf{A}$. Unlike recent works which attribute this sensitivity to the vanished accumulation of discretization time steps, $\exp(-\sum_{t=1}^N\Delta_t)$, we establish a connection between state convergence behavior as the input length approaches infinity and the spectrum of the transition matrix $\mathbf{A}$, offering a well-founded explanation of its role in length extension. Next, to overcome this challenge, we propose an approach that applies spectrum scaling to pre-trained Mamba models to enable robust long-context generalization by selectively modulating the spectrum of $\mathbf{A}$ matrices in each layer. We show that this can significantly improve performance in settings where simply modulating $\Delta_t$ fails, validating our insights and providing avenues for better length generalization of state-space models with structured transition matrices.

    https://arxiv.org/abs/2509.19633


    A note on the compactness properties of discontinuous Galerkin time discretizations

    oai:arXiv.org:2509.20039v3

    arXiv:2509.20039v3 Announce Type: replace Abstract: This work extends the discrete compactness results of Walkington (SIAM J. Numer. Anal., 47(6):4680--4710, 2010) for high-order discontinuous Galerkin time discretizations of parabolic problems to more general function space settings. In particular, we show a discrete version of the Aubin--Lions--Simon lemma that holds for general Banach spaces $X$, $B$, and $Y$ satisfying $X \hookrightarrow B$ compactly and $B \hookrightarrow Y$ continuously. Our proofs rely on the properties of a time reconstruction operator and remove the need for quasi-uniform time partitions assumed in previous works. Thus, we provide a useful and flexible tool for the analysis of high-order discontinuous Galerkin time discretizations of complex nonlinear partial differential equations.

    https://arxiv.org/abs/2509.20039


    SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines

    oai:arXiv.org:2509.21320v3

    arXiv:2509.21320v3 Announce Type: replace Abstract: We present a scientific reasoning foundation model that aligns natural language with heterogeneous scientific representations. The model is pretrained on a 206B-token corpus spanning scientific text, pure sequences, and sequence-text pairs, then aligned via SFT on 40M instructions, annealed cold-start bootstrapping to elicit long-form chain-of-thought, and reinforcement learning with task-specific reward shaping, which instills deliberate scientific reasoning. It supports four capability families, covering up to 103 tasks across workflows: (i) faithful translation between text and scientific formats, (ii) text/knowledge extraction, (iii) property prediction, (iv) property classification, (v) unconditional and conditional sequence generation and design. Compared with specialist systems, our approach broadens instruction coverage, improves cross-domain generalization, and enhances fidelity. We detail data curation and training and show that cross-discipline learning strengthens transfer and downstream reliability. The model, instruct tuning datasets and the evaluation code are open-sourced at https://huggingface.co/SciReason and https://github.com/open-sciencelab/SciReason.

    https://arxiv.org/abs/2509.21320


    Shoot from the HIP: Hessian Interatomic Potentials without derivatives

    oai:arXiv.org:2509.21624v2

    arXiv:2509.21624v2 Announce Type: replace Abstract: Fundamental tasks in computational chemistry, from transition state search to vibrational analysis, rely on molecular Hessians, which are the second derivatives of the potential energy. Yet, Hessians are computationally expensive to calculate and scale poorly with system size, with both quantum mechanical methods and neural networks. In this work, we demonstrate that Hessians can be predicted directly from a deep learning model, without relying on automatic differentiation or finite differences. We observe that one can construct SE(3)-equivariant, symmetric Hessians from irreducible representations (irrep) features up to degree $l$=2 computed during message passing in graph neural networks. This makes HIP Hessians one to two orders of magnitude faster, more accurate, more memory efficient, easier to train, and enables more favorable scaling with system size. We validate our predictions across a wide range of downstream tasks, demonstrating consistently superior performance for transition state search, accelerated geometry optimization, zero-point energy corrections, and vibrational analysis benchmarks. We open-source the HIP codebase and model weights to enable further development of the direct prediction of Hessians at https://github.com/BurgerAndreas/hip

    https://arxiv.org/abs/2509.21624


    Navigating the Impact of Structured Output Format on Large Language Models through the Compass of Causal Inference

    oai:arXiv.org:2509.21791v2

    arXiv:2509.21791v2 Announce Type: replace Abstract: Structured output from large language models (LLMs) has enhanced efficiency in processing generated information and is increasingly adopted in industrial applications. Prior studies have investigated the impact of structured output on LLMs' generation quality, often presenting one-way findings. Some suggest that structured format enhances completeness and factual accuracy, while others argue that it restricts the reasoning capacity of LLMs and leads to reductions in standard evaluation metrics. Potential limitations of these assessments include restricted testing scenarios, weakly controlled comparative settings, and reliance on coarse metrics. In this work, we present a refined analysis using causal inference. Based on one assumed and two guaranteed constraints, we derive five potential causal structures characterizing the influence of structured output on LLMs' generation: (1) collider without m-bias, (2) collider with m-bias, (3) single cause from instruction, (4) single cause from output format, and (5) independence. Across seven public and one developed reasoning tasks, we find that coarse metrics report positive, negative, or neutral effects of structured output on GPT-4o's generation. However, causal inference reveals no causal impact in 43 out of 48 scenarios. In the remaining 5, 3 involve multifaceted causal structures influenced by concrete instructions. Further experiments show that OpenAI-o3 are more resilient to output formats than general-purpose GPT-4o and GPT-4.1, highlighting an unaware advantage of reasoning models.

    https://arxiv.org/abs/2509.21791


    Non-Linear Trajectory Modeling for Multi-Step Gradient Inversion Attacks in Federated Learning

    oai:arXiv.org:2509.22082v2

    arXiv:2509.22082v2 Announce Type: replace Abstract: Federated Learning (FL) enables collaborative training while preserving privacy, yet Gradient Inversion Attacks (GIAs) pose severe threats by reconstructing private data from shared gradients. In realistic FedAvg scenarios with multi-step updates, existing surrogate methods like SME rely on linear interpolation to approximate client trajectories for privacy leakage. However, we demonstrate that linear assumptions fundamentally underestimate SGD's nonlinear complexity, encountering irreducible approximation barriers in non-convex landscapes with only one-dimensional expressiveness. We propose Non-Linear Surrogate Model Extension (NL-SME), the first framework introducing learnable quadratic B\'ezier curves for trajectory modeling in GIAs against FL. NL-SME leverages $|w|+1$-dimensional control point parameterization combined with dvec scaling and regularization mechanisms to achieve superior approximation accuracy. Extensive experiments on CIFAR-100 and FEMNIST demonstrate NL-SME significantly outperforms baselines across all metrics, achieving 94\%--98\% performance gaps and order-of-magnitude improvements in cosine similarity loss while maintaining computational efficiency. This work exposes critical privacy vulnerabilities in FL's multi-step paradigm and provides insights for robust defense development.

    https://arxiv.org/abs/2509.22082


    Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks

    oai:arXiv.org:2509.22258v3

    arXiv:2509.22258v3 Announce Type: replace Abstract: Recent advances in vision-language models (VLMs) have achieved remarkable performance on standard medical benchmarks, yet their true clinical reasoning ability remains unclear. Existing datasets predominantly emphasize classification accuracy, creating an evaluation illusion in which models appear proficient while still failing at high-stakes diagnostic reasoning. We introduce Neural-MedBench, a compact yet reasoning-intensive benchmark specifically designed to probe the limits of multimodal clinical reasoning in neurology. Neural-MedBench integrates multi-sequence MRI scans, structured electronic health records, and clinical notes, and encompasses three core task families: differential diagnosis, lesion recognition, and rationale generation. To ensure reliable evaluation, we develop a hybrid scoring pipeline that combines LLM-based graders, clinician validation, and semantic similarity metrics. Through systematic evaluation of state-of-the-art VLMs, including GPT-4o, Claude-4, and MedGemma, we observe a sharp performance drop compared to conventional datasets. Error analysis shows that reasoning failures, rather than perceptual errors, dominate model shortcomings. Our findings highlight the necessity of a Two-Axis Evaluation Framework: breadth-oriented large datasets for statistical generalization, and depth-oriented, compact benchmarks such as Neural-MedBench for reasoning fidelity. We release Neural-MedBench at https://neuromedbench.github.io/ as an open and extensible diagnostic testbed, which guides the expansion of future benchmarks and enables rigorous yet cost-effective assessment of clinically trustworthy AI.

    https://arxiv.org/abs/2509.22258


    IA2: Alignment with ICL Activations Improves Supervised Fine-Tuning

    oai:arXiv.org:2509.22621v2

    arXiv:2509.22621v2 Announce Type: replace Abstract: Supervised Fine-Tuning (SFT) is used to specialize model behavior by training weights to produce intended target responses for queries. In contrast, In-Context Learning (ICL) adapts models during inference with instructions or demonstrations in the prompt. ICL can offer better generalizability and more calibrated responses compared to SFT in data scarce settings, at the cost of more inference compute. In this work, we ask the question: Can ICL's internal computations be used to improve the qualities of SFT? We first show that ICL and SFT produce distinct activation patterns, indicating that the two methods achieve adaptation through different functional mechanisms. Motivated by this observation and to use ICL's rich functionality, we introduce ICL Activation Alignment (IA2), a self-distillation technique which aims to replicate ICL's activation patterns in SFT models and incentivizes ICL-like internal reasoning. Performing IA2 as a priming step before SFT significantly improves the accuracy and calibration of model outputs, as shown by our extensive empirical results on 12 popular benchmarks and two model families. This finding is not only practically useful, but also offers a conceptual window into the inner mechanics of model adaptation.

    https://arxiv.org/abs/2509.22621


    What If Moderation Didn't Mean Suppression? A Case for Personalized Content Transformation

    oai:arXiv.org:2509.22861v2

    arXiv:2509.22861v2 Announce Type: replace Abstract: Centralized content moderation paradigm both falls short and over-reaches: 1) it fails to account for the subjective nature of harm, and 2) it acts with blunt suppression in response to content deemed harmful, even when such content can be salvaged. We first investigate this through formative interviews, documenting how seemingly benign content becomes harmful due to individual life experiences. Based on these insights, we developed DIY-MOD, a browser extension that operationalizes a new paradigm: personalized content transformation. Operating on a user's own definition of harm, DIY-MOD transforms sensitive elements within content in real-time instead of suppressing the content itself. The system selects the most appropriate transformation for a piece of content from a diverse palette--from obfuscation to artistic stylizing--to match the user's specific needs while preserving the content's informational value. Our two-session user study demonstrates that this approach increases users' sense of agency and safety, enabling them to engage with content and communities they previously needed to avoid.

    https://arxiv.org/abs/2509.22861


    HTMA-Net: Towards Multiplication-Avoiding Neural Networks via Hadamard Transform and In-Memory Computing

    oai:arXiv.org:2509.23103v2

    arXiv:2509.23103v2 Announce Type: replace Abstract: Reducing the cost of multiplications is critical for efficient deep neural network deployment, especially in energy-constrained edge devices. In this work, we introduce HTMA-Net, a novel framework that integrates the Hadamard Transform (HT) with multiplication-avoiding (MA) SRAM-based in-memory computing to reduce arithmetic complexity while maintaining accuracy. Unlike prior methods that only target multiplications in convolutional layers or focus solely on in-memory acceleration, HTMA-Net selectively replaces intermediate convolutions with Hybrid Hadamard-based transform layers whose internal convolutions are implemented via multiplication-avoiding in-memory operations. We evaluate HTMA-Net on ResNet-18 using CIFAR-10, CIFAR-100, and Tiny ImageNet, and provide a detailed comparison against regular, MF-only, and HT-only variants. Results show that HTMA-Net eliminates up to 52\% of multiplications compared to baseline ResNet-18, ResNet-20, and ResNet-32 models, while achieving comparable accuracy in evaluation and significantly reducing computational complexity and the number of parameters. Our results demonstrate that combining structured Hadamard transform layers with SRAM-based in-memory computing multiplication-avoiding operators is a promising path towards efficient deep learning architectures.

    https://arxiv.org/abs/2509.23103


    Diagnose, Localize, Align: A Full-Stack Framework for Reliable LLM Multi-Agent Systems under Instruction Conflicts

    oai:arXiv.org:2509.23188v3

    arXiv:2509.23188v3 Announce Type: replace Abstract: Large Language Model (LLM)-powered multi-agent systems (MAS) have rapidly advanced collaborative reasoning, tool use, and role-specialized coordination in complex tasks. However, reliability-critical deployment remains hindered by a systemic failure mode: hierarchical compliance under instruction conflicts (system-user, peer-peer), where agents misprioritize system-level rules in the presence of competing demands. Moreover, widely used macro-level metrics (e.g., pass@k) obscure these micro-level violations and offer little actionable guidance for remedy. In this work, we present a full-stack, three-stage framework: (1) Diagnose - Contextualized Role Adherence Score (CRAS), a query-wise, context-aware scoring metric that decomposes role adherence into four measurable dimensions; (2) Localize - attention drift analysis revealing that instruction conflicts are resolved by attention heads that are largely concentrated in middle layers; (3) Align - Surgical Alignment of Instruction Layers (SAIL), which installs LoRA only on the localized focal layers and optimizes a token-weighted DPO-style preference objective that credits tokens by their focal attentional contribution. Across standard benchmarks and MAS frameworks, our surgical approach improves instruction hierarchy compliance (e.g., +5.60% with AutoGen on MedQA) without full-model finetuning.

    https://arxiv.org/abs/2509.23188


    C3-OWD: A Curriculum Cross-modal Contrastive Learning Framework for Open-World Detection

    oai:arXiv.org:2509.23316v2

    arXiv:2509.23316v2 Announce Type: replace Abstract: Object detection has advanced significantly in the closed-set setting, but real-world deployment remains limited by two challenges: poor generalization to unseen categories and insufficient robustness under adverse conditions. Prior research has explored these issues separately: visible-infrared detection improves robustness but lacks generalization, while open-world detection leverages vision-language alignment strategy for category diversity but struggles under extreme environments. This trade-off leaves robustness and diversity difficult to achieve simultaneously. To mitigate these issues, we propose \textbf{C3-OWD}, a curriculum cross-modal contrastive learning framework that unifies both strengths. Stage~1 enhances robustness by pretraining with RGBT data, while Stage~2 improves generalization via vision-language alignment. To prevent catastrophic forgetting between two stages, we introduce an Exponential Moving Average (EMA) mechanism that theoretically guarantees preservation of pre-stage performance with bounded parameter lag and function consistency. Experiments on FLIR, OV-COCO, and OV-LVIS demonstrate the effectiveness of our approach: C3-OWD achieves $80.1$ AP$^{50}$ on FLIR, $48.6$ AP$^{50}_{\text{Novel}}$ on OV-COCO, and $35.7$ mAP$_r$ on OV-LVIS, establishing competitive performance across both robustness and diversity evaluations. Code available at: https://github.com/justin-herry/C3-OWD.git.

    https://arxiv.org/abs/2509.23316


    LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

    oai:arXiv.org:2509.23661v3

    arXiv:2509.23661v3 Announce Type: replace Abstract: We present LLaVA-OneVision-1.5, a novel family of Large Multimodal Models (LMMs) that achieve state-of-the-art performance with significantly reduced computational and financial costs. Different from the existing works, LLaVA-OneVision-1.5 provides an open, efficient, and reproducible framework for building high-quality vision-language models entirely from scratch. The LLaVA-OneVision-1.5 release comprises three primary components: (1) Large-Scale Curated Datasets: We construct an 85M concept-balanced pretraining dataset LLaVA-OneVision-1.5-Mid-Traning and a meticulously curated 22M instruction dataset LLaVA-OneVision-1.5-Instruct. (2) Efficient Training Framework: We develop a complete end-to-end efficient training framework leveraging an offline parallel data packing strategy to facilitate the training of LLaVA-OneVision-1.5 within a $16,000 budget. (3) State-of-the-art Performance: Experimental results demonstrate that LLaVA-OneVision-1.5 yields exceptionally competitive performance across a broad range of downstream tasks. Specifically, LLaVA-OneVision-1.5-8B outperforms Qwen2.5-VL-7B on 18 of 27 benchmarks, and LLaVA-OneVision-1.5-4B surpasses Qwen2.5-VL-3B on all 27 benchmarks. (4) RL-based Post-training: We unlock the model's latent potential through a lightweight RL stage, effectively eliciting robust chain-of-thought reasoning to significantly boost performance on complex multimodal reasoning tasks.

    https://arxiv.org/abs/2509.23661


    PD-Diag-Net: Clinical-Priors guided Network on Brain MRI for Auxiliary Diagnosis of Parkinson's Disease

    oai:arXiv.org:2509.23719v3

    arXiv:2509.23719v3 Announce Type: replace Abstract: Parkinson's disease (PD) is a common neurodegenerative disorder that severely diminishes patients' quality of life. Its global prevalence has increased markedly in recent decades. Current diagnostic workflows are complex and heavily reliant on neurologists' expertise, often resulting in delays in early detection and missed opportunities for timely intervention. To address these issues, we propose an end-to-end automated diagnostic method for PD, termed PD-Diag-Net, which performs risk assessment and auxiliary diagnosis directly from raw MRI scans. This framework first introduces an MRI Pre-processing Module (MRI-Processor) to mitigate inter-subject and inter-scanner variability by flexibly integrating established medical imaging preprocessing tools. It then incorporates two forms of clinical prior knowledge: (1) Brain-Region-Relevance-Prior (Relevance-Prior), which specifies brain regions strongly associated with PD; and (2) Brain-Region-Aging-Prior (Aging-Prior), which reflects the accelerated aging typically observed in PD-associated regions. Building on these priors, we design two dedicated modules: the Relevance-Prior Guided Feature Aggregation Module (Aggregator), which guides the model to focus on PD-associated regions at the inter-subject level, and the Age-Prior Guided Diagnosis Module (Diagnoser), which leverages brain age gaps as auxiliary constraints at the intra-subject level to enhance diagnostic accuracy and clinical interpretability. Furthermore, we collected external test data from our collaborating hospital. Experimental results show that PD-Diag-Net achieves 86\% accuracy on external tests and over 96% accuracy in early-stage diagnosis, outperforming existing advanced methods by more than 20%.

    https://arxiv.org/abs/2509.23719


    FUSAR-KLIP: Towards Multimodal Foundation Models for Remote Sensing

    oai:arXiv.org:2509.23927v3

    arXiv:2509.23927v3 Announce Type: replace Abstract: Cross-modal artificial intelligence, represented by visual language models, has achieved significant success in general image understanding. However, a fundamental cognitive inconsistency exists between general visual representation and remote sensing image interpretation: remote sensing images couple topography, terrain, and spatial structure, thereby inherently requiring models to possess deep geoscientific understanding. This cognitive difference is further amplified in synthetic aperture radar (SAR) imagery: while SAR possesses irreplaceable all-weather, all-day observation capabilities, it is constrained by coherent imaging mechanisms, exhibiting significant modal heterogeneity with general images. To address this inconsistency, we propose FUSAR-KLIP, the first knowledge-guided general multimodal foundational model for SAR, along with reusable data and evaluation baselines. Specifically: (1) FUSAR-GEOVL-1M (the first large-scale SAR dataset with complete geographic projection attributes) was constructed, covering multiple satellite platforms, 120,000 images, and 135 cities; (2) Aligned structured text was generated through hierarchical cognitive thought chains, accurately encoding more than 1 million multidimensional semantic information from geomorphological environment and regional attributes to spatial relationships; (3) A self-consistent iterative optimization mechanism was designed to guide cross-modal learning with this knowledge information consistent with human cognition and physical laws in a self-supervised closed loop consisting of contrast, matching, and reconstruction; (4) A unified evaluation benchmark was established in 11 typical downstream tasks in the two major categories of vision and language, and compared with 15 mainstream foundation models.

    https://arxiv.org/abs/2509.23927


    Beyond Magic Words: Sharpness-Aware Prompt Evolving for Robust Large Language Models with TARE

    oai:arXiv.org:2509.24130v3

    arXiv:2509.24130v3 Announce Type: replace Abstract: The performance of Large Language Models (LLMs) hinges on carefully engineered prompts. However, prevailing prompt optimization methods, ranging from heuristic edits and reinforcement learning to evolutionary search, primarily target point-wise accuracy. They seldom enforce paraphrase invariance or searching stability, and therefore cannot remedy this brittleness in practice. Automated prompt search remains brittle: small, semantically preserving paraphrases often cause large performance swings. We identify this brittleness as the textual sharpness of the prompt landscape. In this work, we provide the first formal treatment of textual sharpness in the discrete, semantic space of prompts, together with an operational robustness criterion over a semantic neighborhood; the design is black-box or API-only, requiring no gradients to update the model's parameters. Then we introduce TARE (Textual Sharpness-Aware Evolving), a derivative-free framework that alternates between an inner, sampling-based adversarial search that stresses a prompt with hard paraphrases and an outer, robust selection that prefers candidates whose neighborhoods remain strong. We further propose ATARE, which learns anisotropic weights to shape the semantic neighborhood and adapts its radius over time to balance exploration and fidelity. Diverse tasks evaluate our methods, whose design for minimizing textual sharpness gap leads to prompts that preserve accuracy under paraphrasing, outperforming accuracy-only prompt search while remaining computationally practical.

    https://arxiv.org/abs/2509.24130


    Interactive Program Synthesis for Modeling Collaborative Physical Activities from Narrated Demonstrations

    oai:arXiv.org:2509.24250v2

    arXiv:2509.24250v2 Announce Type: replace Abstract: Teaching systems physical tasks is a long standing goal in HCI, yet most prior work has focused on non collaborative physical activities. Collaborative tasks introduce added complexity, requiring systems to infer users assumptions about their teammates intent, which is an inherently ambiguous and dynamic process. This necessitates representations that are interpretable and correctable, enabling users to inspect and refine system behavior. We address this challenge by framing collaborative task learning as a program synthesis problem. Our system represents behavior as editable programs and uses narrated demonstrations, i.e. paired physical actions and natural language, as a unified modality for teaching, inspecting, and correcting system logic without requiring users to see or write code. The same modality is used for the system to communicate its learning to users. In a within subjects study, 20 users taught multiplayer soccer tactics to our system. 70 percent (14/20) of participants successfully refined learned programs to match their intent and 90 percent (18/20) found it easy to correct the programs. The study surfaced unique challenges in representing learning as programs and in enabling users to teach collaborative physical activities. We discuss these issues and outline mitigation strategies.

    https://arxiv.org/abs/2509.24250


    MRAG-Suite: A Diagnostic Evaluation Platform for Visual Retrieval-Augmented Generation

    oai:arXiv.org:2509.24253v2

    arXiv:2509.24253v2 Announce Type: replace Abstract: Multimodal Retrieval-Augmented Generation (Visual RAG) significantly advances question answering by integrating visual and textual evidence. Yet, current evaluations fail to systematically account for query difficulty and ambiguity. We propose MRAG-Suite, a diagnostic evaluation platform integrating diverse multimodal benchmarks (WebQA, Chart-RAG, Visual-RAG, MRAG-Bench). We introduce difficulty-based and ambiguity-aware filtering strategies, alongside MM-RAGChecker, a claim-level diagnostic tool. Our results demonstrate substantial accuracy reductions under difficult and ambiguous queries, highlighting prevalent hallucinations. MM-RAGChecker effectively diagnoses these issues, guiding future improvements in Visual RAG systems.

    https://arxiv.org/abs/2509.24253


    Spectral equivalence of unsymmetric kernel matrices and applications

    oai:arXiv.org:2509.24561v2

    arXiv:2509.24561v2 Announce Type: replace Abstract: Symmetric kernel matrices are a well-researched topic in the literature of kernel based approximation. In particular stability properties in terms of lower bounds on the smallest eigenvalue of such symmetric kernel matrices are thoroughly investigated, as they play a fundamental role in theory and practice. In this work, we focus on unsymmetric kernel matrices and derive stability properties under small shifts by establishing a spectral equivalence to their unshifted, symmetric versions. This extends and generalizes results for translational invariant kernels upon Quak et al [SIAM Journal on Math. Analysis, 1993] and Sivakumar and Ward [Numerische Mathematik, 1993], however focussing instead on finitely smooth kernels. As applications, we consider convolutional kernels over domains, which are no longer translational invariant, but which are still an important class of kernels for applications. For these, we derive novel lower bounds for the smallest eigenvalue of the kernel matrices in terms of the separation distance of the data points, and thus derive stability bounds in terms of the condition number.

    https://arxiv.org/abs/2509.24561


    Journal Publications in Medicine: Ranking vs. Interdisciplinarity

    oai:arXiv.org:2509.24994v2

    arXiv:2509.24994v2 Announce Type: replace Abstract: Interdisciplinary research is critical for innovation and addressing complex societal issues. We characterise the interdisciplinary knowledge structure of PubMed research articles in medicine as correlation networks of medical concepts and compare the interdisciplinarity of articles between high-ranking (impactful) and less high-ranking (less impactful) medical journals. We found that impactful medical journals tend to publish research that are less interdisciplinary than less impactful journals. Observing that they bridge distant knowledge clusters in the networks, we find that cancer-related research can be seen as one of the main drivers of interdisciplinarity in medical science. Using signed difference networks, we also investigate the clustering of deviations between high and low impact journal correlation networks. We generally find a mild tendency for strong link differences to be adjacent. Furthermore, we find topic clusters of deviations that shift over time. In contrast, topic clusters in the original networks are static over time and can be seen as the core knowledge structure in medicine. Overall, journals and policymakers should encourage initiatives to accommodate interdisciplinarity within the existing infrastructures to maximise the potential patient benefits from IDR.

    https://arxiv.org/abs/2509.24994


    Designing Compact ILPs via Fast Witness Verification

    oai:arXiv.org:2509.25445v2

    arXiv:2509.25445v2 Announce Type: replace Abstract: The standard formalization of preprocessing in parameterized complexity is given by kernelization. In this work, we depart from this paradigm and study a different type of preprocessing for problems without polynomial kernels, still aiming at producing instances that are easily solvable in practice. Specifically, we ask for which parameterized problems an instance (I,k) can be reduced in polynomial time to an integer linear program (ILP) with poly(k) constraints. We show that this property coincides with the parameterized complexity class WK[1], previously studied in the context of Turing kernelization lower bounds. In turn, the class WK[1] enjoys an elegant characterization in terms of witness verification protocols: a yes-instance should admit a witness of size poly(k) that can be verified in time poly(k). By combining known data structures with new ideas, we design such protocols for several problems, such as r-Way Cut, Vertex Multiway Cut, Steiner Tree, or Minimum Common String Partition, thus showing that they can be modeled by compact ILPs. We also present explicit ILP and MILP formulations for Weighted Vertex Cover on graphs with small (unweighted) vertex cover number. We believe that these results will provide a background for a systematic study of ILP-oriented preprocessing procedures for parameterized problems.

    https://arxiv.org/abs/2509.25445


    MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources

    oai:arXiv.org:2509.25531v4

    arXiv:2509.25531v4 Announce Type: replace Abstract: We present MixtureVitae, an open-access pretraining corpus built to minimize legal risk while providing strong downstream performance. MixtureVitae follows a permissive-first, risk-mitigated sourcing strategy that combines public-domain and permissively licensed text (e.g., CC-BY/Apache) with carefully justified low-risk additions (e.g., government works and EU TDM-eligible sources). MixtureVitae adopts a simple, single-stage pretraining recipe that integrates a large proportion of permissive synthetic instruction and reasoning data-signals typically introduced during post-training and generally scarce in permissive web corpora. We categorize all sources into a three-tier scheme that reflects varying risk levels and provide shard-level provenance metadata to enable risk-aware usage. In controlled experiments using the open-sci-ref training protocol (fixed architectures and hyperparameters; 50B and 300B token budgets across 130M-1.7B parameters), models trained on MixtureVitae consistently outperform other permissive datasets across a suite of standard benchmarks, and at the 1.7B-parameters/300B-tokens setting, they surpass FineWeb-Edu and approach DCLM late in training. Performance is particularly strong on MMLU and on math and code benchmarks: a 1.7B model pretrained on 300B MixtureVitae tokens matches or exceeds a strong 1.7B instruction-tuned baseline on GSM8K, HumanEval, and MBPP, despite using over 36 times fewer tokens (300B vs. ~11T). Supported by a thorough decontamination analysis, these results show that permissive-first data with high instruction and reasoning density, tiered by licensing and provenance-related risk, can provide a practical and risk-mitigated foundation for training capable LLMs, reducing reliance on broad web scrapes without sacrificing competitiveness. Code: https://github.com/ontocord/mixturevitae

    https://arxiv.org/abs/2509.25531


    RadOnc-GPT: An Autonomous LLM Agent for Real-Time Patient Outcomes Labeling at Scale

    oai:arXiv.org:2509.25540v2

    arXiv:2509.25540v2 Announce Type: replace Abstract: Manual labeling limits the scale, accuracy, and timeliness of patient outcomes research in radiation oncology. We present RadOnc-GPT, an autonomous large language model (LLM)-based agent capable of independently retrieving patient-specific information, iteratively assessing evidence, and returning structured outcomes. Our evaluation explicitly validates RadOnc-GPT across two clearly defined tiers of increasing complexity: (1) a structured quality assurance (QA) tier, assessing the accurate retrieval of demographic and radiotherapy treatment plan details, followed by (2) a complex clinical outcomes labeling tier involving determination of mandibular osteoradionecrosis (ORN) in head-and-neck cancer patients and detection of cancer recurrence in independent prostate and head-and-neck cancer cohorts requiring combined interpretation of structured and unstructured patient data. The QA tier establishes foundational trust in structured-data retrieval, a critical prerequisite for successful complex clinical outcome labeling.

    https://arxiv.org/abs/2509.25540


    Stealthy Yet Effective: Distribution-Preserving Backdoor Attacks on Graph Classification

    oai:arXiv.org:2509.26032v2

    arXiv:2509.26032v2 Announce Type: replace Abstract: Graph Neural Networks (GNNs) have demonstrated strong performance across tasks such as node classification, link prediction, and graph classification, but remain vulnerable to backdoor attacks that implant imperceptible triggers during training to control predictions. While node-level attacks exploit local message passing, graph-level attacks face the harder challenge of manipulating global representations while maintaining stealth. We identify two main sources of anomaly in existing graph classification backdoor methods: structural deviation from rare subgraph triggers and semantic deviation caused by label flipping, both of which make poisoned graphs easily detectable by anomaly detection models. To address this, we propose DPSBA, a clean-label backdoor framework that learns in-distribution triggers via adversarial training guided by anomaly-aware discriminators. DPSBA effectively suppresses both structural and semantic anomalies, achieving high attack success while significantly improving stealth. Extensive experiments on real-world datasets validate that DPSBA achieves a superior balance between effectiveness and detectability compared to state-of-the-art baselines.

    https://arxiv.org/abs/2509.26032


    Equivariant Splitting: Self-supervised learning from incomplete data

    oai:arXiv.org:2510.00929v4

    arXiv:2510.00929v4 Announce Type: replace Abstract: Self-supervised learning for inverse problems allows to train a reconstruction network from noise and/or incomplete data alone. These methods have the potential of enabling learning-based solutions when obtaining ground-truth references for training is expensive or even impossible. In this paper, we propose a new self-supervised learning strategy devised for the challenging setting where measurements are observed via a single incomplete observation model. We introduce a new definition of equivariance in the context of reconstruction networks, and show that the combination of self-supervised splitting losses and equivariant reconstruction networks results in unbiased estimates of the supervised loss. Through a series of experiments on image inpainting, accelerated magnetic resonance imaging, sparse-view computed tomography, and compressive sensing, we demonstrate that the proposed loss achieves state-of-the-art performance in settings with highly rank-deficient forward models. The code is available at https://github.com/vsechaud/Equivariant-Splitting

    https://arxiv.org/abs/2510.00929


    ModernVBERT: Towards Smaller Visual Document Retrievers

    oai:arXiv.org:2510.01149v2

    arXiv:2510.01149v2 Announce Type: replace Abstract: Retrieving specific information from a large corpus of documents is a prevalent industrial use case of modern AI, notably due to the popularity of Retrieval-Augmented Generation (RAG) systems. Although neural document retrieval models have historically operated exclusively in the text space, Visual Document Retrieval (VDR) models - large vision-language decoders repurposed as embedding models which directly work with page screenshots as inputs - are increasingly popular due to the performance and indexing latency gains they offer. In this work, we show that, while cost-efficient, this approach of repurposing generative models bottlenecks retrieval performance. Through controlled experiments, we revisit the entire training pipeline, and establish a principled recipe for improving visual document retrieval models. We notably measure the impact of attention masking, image resolution, modality alignment data regimes, and late interaction centered contrastive objectives which emerge as central performance factors. Building on these insights, we release ModernVBERT, a compact 250M-parameter vision-language encoder that outperforms recent models up to 10 times larger when fine-tuned on document retrieval tasks, enabling efficient inference on cheap CPU hardware and greatly reducing latency and costs while maintaining strong performance. Models, code and data are available at https://huggingface.co/ModernVBERT.

    https://arxiv.org/abs/2510.01149


    SoK: Preconfirmations

    oai:arXiv.org:2510.02947v2

    arXiv:2510.02947v2 Announce Type: replace Abstract: In recent years, significant research efforts have focused on improving blockchain throughput and confirmation speeds without compromising security. While decreasing the time it takes for a transaction to be included in the blockchain ledger enhances user experience, a fundamental delay still remains between when a transaction is issued by a user and when its inclusion is confirmed in the blockchain ledger. This delay limits user experience gains through the confirmation uncertainty it brings for users. This inherent delay in conventional blockchain protocols has led to the emergence of preconfirmation protocols -- protocols that provide users with early guarantees of eventual transaction confirmation. This article presents a Systematization of Knowledge (SoK) on preconfirmations. We present the core terms and definitions needed to understand preconfirmations, outline a general framework for preconfirmation protocols, and explore the economics and risks of preconfirmations. Finally, we survey and apply our framework to several implementations of real-world preconfirmation protocols, bridging the gap between theory and practice.

    https://arxiv.org/abs/2510.02947


    Grounding Large Language Models in Clinical Evidence: A Retrieval-Augmented Generation System for Querying UK NICE Clinical Guidelines

    oai:arXiv.org:2510.02967v3

    arXiv:2510.02967v3 Announce Type: replace Abstract: This paper presents the development and evaluation of a Retrieval-Augmented Generation (RAG) system for querying the United Kingdom's National Institute for Health and Care Excellence (NICE) clinical guidelines using Large Language Models (LLMs). The extensive length and volume of these guidelines can impede their utilisation within a time-constrained healthcare system, a challenge this project addresses through the creation of a system capable of providing users with precisely matched information in response to natural language queries. The system's retrieval architecture, composed of a hybrid embedding mechanism, was evaluated against a corpus of 10,195 text chunks derived from three hundred guidelines. It demonstrates high performance, with a Mean Reciprocal Rank (MRR) of 0.814, a Recall of 81% at the first chunk and of 99.1% within the top ten retrieved chunks, when evaluated on 7901 queries. The most significant impact of the RAG system was observed during the generation phase. When evaluated on a manually curated dataset of seventy question-answer pairs, RAG-enhanced models showed substantial gains in performance. Faithfulness, the measure of whether an answer is supported by the source text, was increased by 64.7 percentage points to 99.5% for the RAG-enhanced O4-Mini model and significantly outperformed the medical-focused Meditron3-8B LLM, which scored 43%. Clinical evaluation by seven Subject Matter Experts (SMEs) further validated these findings, with GPT-4.1 achieving 98.7% accuracy while reducing unsafe responses by 67% compared to O4-Mini (from 3.0 to 1.0 per evaluator). This study thus establishes RAG as an effective, reliable, and scalable approach for applying generative AI in healthcare, enabling cost-effective access to medical guidelines.

    https://arxiv.org/abs/2510.02967


    Beyond Next-Token Prediction: A Performance Characterization of Diffusion versus Autoregressive Language Models

    oai:arXiv.org:2510.04146v2

    arXiv:2510.04146v2 Announce Type: replace Abstract: Large Language Models (LLMs) have achieved state-of-the-art performance on a broad range of Natural Language Processing (NLP) tasks, including document processing and code generation. Autoregressive Language Models (ARMs), which generate tokens sequentially conditioned on all previous tokens, have been the predominant paradigm for LLMs. While these models have achieved high accuracy across a range of downstream tasks, they exhibit low arithmetic intensity due to the inherent sequential dependency in next-token prediction. Recently, Diffusion Language Models (DLMs) have emerged as a promising alternative architecture. DLMs generate output tokens in parallel, mitigating the limitations of sequential decoding. However, the performance implications of DLMs relative to commonly deployed ARMs are not fully understood. In this work, we present a comprehensive study of the performance characteristics of ARMs and DLMs, combining theoretical analysis with empirical profiling to characterize the trade-offs between these approaches. We show that although DLMs can achieve higher arithmetic intensity than ARMs by leveraging parallelism across token positions, they fail to scale effectively with longer contexts. We then explore block-wise decoding for DLMs, which decouples arithmetic intensity from sequence length and enables better scaling to long contexts (similar to ARMs). We also examine batched inference and find that ARMs exhibit superior throughput as they benefit more from parallelism across sequences in the batch. Finally, we highlight opportunities for accelerating DLM inference, emphasizing that reducing the number of sampling steps is key for open-source DLMs to achieve lower latency relative to ARMs.

    https://arxiv.org/abs/2510.04146


    Conditional Representation Learning for Customized Tasks

    oai:arXiv.org:2510.04564v2

    arXiv:2510.04564v2 Announce Type: replace Abstract: Conventional representation learning methods learn a universal representation that primarily captures dominant semantics, which may not always align with customized downstream tasks. For instance, in animal habitat analysis, researchers prioritize scene-related features, whereas universal embeddings emphasize categorical semantics, leading to suboptimal results. As a solution, existing approaches resort to supervised fine-tuning, which however incurs high computational and annotation costs. In this paper, we propose Conditional Representation Learning (CRL), aiming to extract representations tailored to arbitrary user-specified criteria. Specifically, we reveal that the semantics of a space are determined by its basis, thereby enabling a set of descriptive words to approximate the basis for a customized feature space. Building upon this insight, given a user-specified criterion, CRL first employs a large language model (LLM) to generate descriptive texts to construct the semantic basis, then projects the image representation into this conditional feature space leveraging a vision-language model (VLM). The conditional representation better captures semantics for the specific criterion, which could be utilized for multiple customized tasks. Extensive experiments on classification and retrieval tasks demonstrate the superiority and generality of the proposed CRL. The code is available at https://github.com/XLearning-SCU/2025-NeurIPS-CRL.

    https://arxiv.org/abs/2510.04564


    cMPI: Using CXL Memory Sharing for MPI One-Sided and Two-Sided Inter-Node Communications

    oai:arXiv.org:2510.05476v2

    arXiv:2510.05476v2 Announce Type: replace Abstract: Message Passing Interface (MPI) is a foundational programming model for high-performance computing. MPI libraries traditionally employ network interconnects (e.g., Ethernet and InfiniBand) and network protocols (e.g., TCP and RoCE) with complex software stacks for cross-node communication. We present cMPI, the first work to optimize MPI point-to-point communication (both one-sided and two-sided) using CXL memory sharing on a real CXL platform, transforming cross-node communication into memory transactions and data copies within CXL memory, bypassing traditional network protocols. We analyze performance across various interconnects and find that CXL memory sharing achieves 7.2x-8.1x lower latency than TCP-based interconnects deployed in small- and medium-scale clusters. We address challenges of CXL memory sharing for MPI communication, including data object management over the dax representation [50], cache coherence, and atomic operations. Overall, cMPI outperforms TCP over standard Ethernet NIC and high-end SmartNIC by up to 49x and 72x in latency and bandwidth, respectively, for small messages.

    https://arxiv.org/abs/2510.05476


    Fast-Convergent Proximity Graphs for Approximate Nearest Neighbor Search

    oai:arXiv.org:2510.05975v3

    arXiv:2510.05975v3 Announce Type: replace Abstract: Approximate nearest neighbor (ANN) search in high-dimensional metric spaces is a fundamental problem with many applications. Over the past decade, proximity graph (PG)-based indexes have demonstrated superior empirical performance over alternatives. However, these methods often lack theoretical guarantees regarding the quality of query results, especially in the worst-case scenarios. In this paper, we introduce the {\alpha}-convergent graph ({\alpha}-CG), a new PG structure that employs a carefully designed edge pruning rule. This rule eliminates candidate neighbors for each data point p by applying the shifted-scaled triangle inequalities among p, its existing out-neighbors, and new candidates. If the distance between the query point q and its exact nearest neighbor v* is at most {\tau} for some constant {\tau} > 0, our {\alpha}-CG finds the exact nearest neighbor in poly-logarithmic time, assuming bounded intrinsic dimensionality for the dataset; otherwise, it can find an ANN in the same time. To enhance scalability, we develop the {\alpha}-convergent neighborhood graph ({\alpha}-CNG), a practical variant that applies the pruning rule locally within each point's neighbors. We also introduce optimizations to reduce the index construction time. Experimental results show that our {\alpha}-CNG outperforms existing PGs on real-world datasets. For most datasets, {\alpha}-CNG can reduce the number of distance computations and search steps by over 15% and 45%, respectively, when compared with the best-performing baseline.

    https://arxiv.org/abs/2510.05975


    HTMformer: Hybrid Time and Multivariate Transformer for Time Series Forecasting

    oai:arXiv.org:2510.07084v3

    arXiv:2510.07084v3 Announce Type: replace Abstract: Transformer-based methods have achieved impressive results in time series forecasting. However, existing Transformers still exhibit limitations in sequence modeling as they tend to overemphasize temporal dependencies. This incurs additional computational overhead without yielding corresponding performance gains. We find that the performance of Transformers is highly dependent on the embedding method used to learn effective representations. To address this issue, we extract multivariate features to augment the effective information captured in the embedding layer, yielding multidimensional embeddings that convey richer and more meaningful sequence representations. These representations enable Transformer-based forecasters to better understand the series. Specifically, we introduce Hybrid Temporal and Multivariate Embeddings (HTME). The HTME extractor integrates a lightweight temporal feature extraction module with a carefully designed multivariate feature extraction module to provide complementary features, thereby achieving a balance between model complexity and performance. By combining HTME with the Transformer architecture, we present HTMformer, leveraging the enhanced feature extraction capability of the HTME extractor to build a lightweight forecaster. Experiments conducted on eight real-world datasets demonstrate that our approach outperforms existing baselines in both accuracy and efficiency.

    https://arxiv.org/abs/2510.07084


    Approximate Gaussian Mapping for Generative Image Steganography

    oai:arXiv.org:2510.07219v2

    arXiv:2510.07219v2 Announce Type: replace Abstract: Ordinary differential equation (ODE)-based diffusion models enable deterministic image synthesis, establishing a reversible mapping suitable for generative steganography. While prevailing methods strictly adhere to a standard normal prior, empirical evidence indicates that controlled deviations from this distribution reduce numerical inversion errors without compromising perceptual quality. Leveraging this observation, the Approximate Gaussian Mapping (AGM) is proposed as a linear transformation strategy that embeds secrets by modulating noise scale and variance. To balance retrieval numerical consistence and security, a two-stage decoupled optimization strategy is introduced to minimize the Kullback-Leibler divergence subject to target bit accuracy constraints. Beyond the proposed method, we conduct a mechanistic analysis of the divergent behaviors between pixel-space and latent-space architectures. The experimental results reveal that the VAE encoder enhances robustness by filtering external perturbations, whereas the structural regularization of the VAE decoder and the semantic variance introduced by text prompts jointly mask embedding artifacts to improve security. Experimental results confirm that pixel-space mplementations maximize embedding capacity for lossless channels, while latent-space approaches offer superior robustness and security suitable for adversarial environments

    https://arxiv.org/abs/2510.07219


    DGTEN: A Robust Deep Gaussian based Graph Neural Network for Dynamic Trust Evaluation with Uncertainty-Quantification Support

    oai:arXiv.org:2510.07620v2

    arXiv:2510.07620v2 Announce Type: replace Abstract: Dynamic trust evaluation in large, rapidly evolving graphs demands models that capture changing relationships, express calibrated confidence, and resist adversarial manipulation. DGTEN (Deep Gaussian-Based Trust Evaluation Network) introduces a unified graph-based framework that does all three by combining uncertainty-aware message passing, expressive temporal modeling, and built-in defenses against trust-targeted attacks. It represents nodes and edges as Gaussian distributions so that both semantic signals and epistemic uncertainty propagate through the graph neural network, enabling risk-aware trust decisions rather than overconfident guesses. To track how trust evolves, it layers hybrid absolute-Gaussian-hourglass positional encoding with Kolmogorov-Arnold network-based unbiased multi-head attention, then applies an ordinary differential equation-based residual learning module to jointly model abrupt shifts and smooth trends. Robust adaptive ensemble coefficient analysis prunes or down-weights suspicious interactions using complementary cosine and Jaccard similarity, curbing reputation laundering, sabotage, and on-off attacks. On two signed Bitcoin trust networks, DGTEN delivers standout gains where it matters most: in single-timeslot prediction on Bitcoin-OTC, it improves MCC by +12.34% over the best dynamic baseline; in the cold-start scenario on Bitcoin-Alpha, it achieves a +25.00% MCC improvement, the largest across all tasks and datasets; while under adversarial on-off attacks, it surpasses the baseline by up to +10.23% MCC. These results endorse the unified DGTEN framework.

    https://arxiv.org/abs/2510.07620


    Measuring What Matters: The AI Pluralism Index

    oai:arXiv.org:2510.08193v2

    arXiv:2510.08193v2 Announce Type: replace Abstract: Artificial intelligence systems increasingly mediate knowledge, communication, and decision making. Development and governance remain concentrated within a small set of firms and states, raising concerns that technologies may encode narrow interests and limit public agency. Capability benchmarks for language, vision, and coding are common, yet public, auditable measures of pluralistic governance are rare. We define AI pluralism as the degree to which affected stakeholders can shape objectives, data practices, safeguards, and deployment. We present the AI Pluralism Index (AIPI), a transparent, evidence-based instrument that evaluates producers and system families across four pillars: participatory governance, inclusivity and diversity, transparency, and accountability. AIPI codes verifiable practices from public artifacts and independent evaluations, explicitly handling "Unknown" evidence to report both lower-bound ("evidence") and known-only scores with coverage. We formalize the measurement model; implement a reproducible pipeline that integrates structured web and repository analysis, external assessments, and expert interviews; and assess reliability with inter-rater agreement, coverage reporting, cross-index correlations, and sensitivity analysis. The protocol, codebook, scoring scripts, and evidence graph are maintained openly with versioned releases and a public adjudication process. We report pilot provider results and situate AIPI relative to adjacent transparency, safety, and governance frameworks. The index aims to steer incentives toward pluralistic practice and to equip policymakers, procurers, and the public with comparable evidence.

    https://arxiv.org/abs/2510.08193


    TinyGraphEstimator: Adapting Lightweight Language Models for Graph Structure Inference

    oai:arXiv.org:2510.08808v4

    arXiv:2510.08808v4 Announce Type: replace Abstract: Graphs provide a universal framework for representing complex relational systems, and inferring their structural properties is a core challenge in graph analysis and reasoning. While large language models have recently demonstrated emerging abilities to perform symbolic and numerical reasoning, the potential of smaller, resource-efficient models in this context remains largely unexplored. This paper investigates whether compact transformer-based language models can infer graph-theoretic parameters directly from graph representations. To enable systematic evaluation, we introduce the TinyGraphEstimator dataset - a balanced collection of connected graphs generated from multiple random graph models and annotated with detailed structural metadata. We evaluate several small open models on their ability to predict key graph parameters such as density, clustering, and chromatic number. Furthermore, we apply lightweight fine-tuning using the Low-Rank Adaptation (LoRA) technique, achieving consistent improvements across all evaluated metrics. The results demonstrate that small language models possess non-trivial reasoning capacity over graph-structured data and can be effectively adapted for structural inference tasks through efficient parameter tuning.

    https://arxiv.org/abs/2510.08808


    TC-LoRA: Temporally Modulated Conditional LoRA for Adaptive Diffusion Control

    oai:arXiv.org:2510.09561v2

    arXiv:2510.09561v2 Announce Type: replace Abstract: Current controllable diffusion models typically rely on fixed architectures that modify intermediate activations to inject guidance conditioned on a new modality. This approach uses a static conditioning strategy for a dynamic, multi-stage denoising process, limiting the model's ability to adapt its response as the generation evolves from coarse structure to fine detail. We introduce TC-LoRA (Temporally Modulated Conditional LoRA), a new paradigm that enables dynamic, context-aware control by conditioning the model's weights directly. Our framework uses a hypernetwork to generate LoRA adapters on-the-fly, tailoring weight modifications for the frozen backbone at each diffusion step based on time and the user's condition. This mechanism enables the model to learn and execute an explicit, adaptive strategy for applying conditional guidance throughout the entire generation process. Through experiments on various data domains, we demonstrate that this dynamic, parametric control significantly enhances generative fidelity and adherence to spatial conditions compared to static, activation-based methods. TC-LoRA establishes an alternative approach in which the model's conditioning strategy is modified through a deeper functional adaptation of its weights, allowing control to align with the dynamic demands of the task and generative stage.

    https://arxiv.org/abs/2510.09561


    FlareX: A Physics-Informed Dataset for Lens Flare Removal via 2D Synthesis and 3D Rendering

    oai:arXiv.org:2510.09995v2

    arXiv:2510.09995v2 Announce Type: replace Abstract: Lens flare occurs when shooting towards strong light sources, significantly degrading the visual quality of images. Due to the difficulty in capturing flare-corrupted and flare-free image pairs in the real world, existing datasets are typically synthesized in 2D by overlaying artificial flare templates onto background images. However, the lack of flare diversity in templates and the neglect of physical principles in the synthesis process hinder models trained on these datasets from generalizing well to real-world scenarios. To address these challenges, we propose a new physics-informed method for flare data generation, which consists of three stages: parameterized template creation, the laws of illumination-aware 2D synthesis, and physical engine-based 3D rendering, which finally gives us a mixed flare dataset that incorporates both 2D and 3D perspectives, namely FlareX. This dataset offers 9,500 2D templates derived from 95 flare patterns and 3,000 flare image pairs rendered from 60 3D scenes. Furthermore, we design a masking approach to obtain real-world flare-free images from their corrupted counterparts to measure the performance of the model on real-world images. Extensive experiments demonstrate the effectiveness of our method and dataset.

    https://arxiv.org/abs/2510.09995


    ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution

    oai:arXiv.org:2510.12793v2

    arXiv:2510.12793v2 Announce Type: replace Abstract: Existing Multimodal Large Language Models (MLLMs) suffer from increased inference costs due to the additional vision tokens introduced by image inputs. In this work, we propose Visual Consistency Learning (ViCO), a novel training algorithm that enables the model to represent images of varying semantic complexities using different numbers of vision tokens. The key idea behind our method is to employ multiple MLP connectors, each with a different image compression ratio, to downsample the vision tokens based on the semantic complexity of the image. During training, we minimize the KL divergence between the responses conditioned on different MLP connectors. At inference time, we introduce an image router, termed Visual Resolution Router (ViR), that automatically selects the appropriate compression rate for each image patch. Compared with existing dynamic high-resolution strategies, which adjust the number of visual tokens based on image resolutions, our method dynamically adapts the number of visual tokens according to semantic complexity. Experimental results demonstrate that our method can reduce the number of vision tokens by up to 50% while maintaining the model's perception, reasoning, and OCR capabilities. We hope this work will contribute to the development of more efficient MLLMs. The code and models will be released to facilitate future research.

    https://arxiv.org/abs/2510.12793


    Three Lenses on the AI Revolution: Risk, Transformation, Continuity

    oai:arXiv.org:2510.12859v2

    arXiv:2510.12859v2 Announce Type: replace Abstract: Artificial Intelligence (AI) has emerged as both a continuation of historical technological revolutions and a potential rupture with them. This paper argues that AI must be viewed simultaneously through three lenses: \textit{risk}, where it resembles nuclear technology in its irreversible and global externalities; \textit{transformation}, where it parallels the Industrial Revolution as a general-purpose technology driving productivity and reorganization of labor; and \textit{continuity}, where it extends the fifty-year arc of computing revolutions from personal computing to the internet to mobile. Drawing on historical analogies, we emphasize that no past transition constituted a strict singularity: disruptive shifts eventually became governable through new norms and institutions. We examine recurring patterns across revolutions -- democratization at the usage layer, concentration at the production layer, falling costs, and deepening personalization -- and show how these dynamics are intensifying in the AI era. Sectoral analysis illustrates how accounting, law, education, translation, advertising, and software engineering are being reshaped as routine cognition is commoditized and human value shifts to judgment, trust, and ethical responsibility. At the frontier, the challenge of designing moral AI agents highlights the need for robust guardrails, mechanisms for moral generalization, and governance of emergent multi-agent dynamics. We conclude that AI is neither a singular break nor merely incremental progress. It is both evolutionary and revolutionary: predictable in its median effects yet carrying singularity-class tail risks. Good outcomes are not automatic; they require coupling pro-innovation strategies with safety governance, ensuring equitable access, and embedding AI within a human order of responsibility.

    https://arxiv.org/abs/2510.12859


    Counting Hallucinations in Diffusion Models

    oai:arXiv.org:2510.13080v2

    arXiv:2510.13080v2 Announce Type: replace Abstract: Diffusion probabilistic models (DPMs) have demonstrated remarkable progress in generative tasks, such as image and video synthesis. However, they still often produce hallucinated samples (hallucinations) that conflict with real-world knowledge, such as generating an implausible duplicate cup floating beside another cup. Despite their prevalence, the lack of feasible methodologies for systematically quantifying such hallucinations hinders progress in addressing this challenge and obscures potential pathways for designing next-generation generative models under factual constraints. In this work, we bridge this gap by focusing on a specific form of hallucination, which we term counting hallucination, referring to the generation of an incorrect number of instances or structured objects, such as a hand image with six fingers, despite such patterns being absent from the training data. To this end, we construct a dataset suite CountHalluSet, with well-defined counting criteria, comprising ToyShape, SimObject, and RealHand. Using these datasets, we develop a standardized evaluation protocol for quantifying counting hallucinations, and systematically examine how different sampling conditions in DPMs, including solver type, ODE solver order, sampling steps, and initial noise, affect counting hallucination levels. Furthermore, we analyze their correlation with common evaluation metrics such as FID, revealing that this widely used image quality metric fails to capture counting hallucinations consistently. This work aims to take the first step toward systematically quantifying hallucinations in diffusion models and offer new insights into the investigation of hallucination phenomena in image generation.

    https://arxiv.org/abs/2510.13080


    On preconditioned Riemannian gradient methods for minimizing the Gross-Pitaevskii energy functional: algorithms, global convergence and optimal local convergence rate

    oai:arXiv.org:2510.13516v2

    arXiv:2510.13516v2 Announce Type: replace Abstract: In this article, we propose a unified framework for preconditioned Riemannian gradient (P-RG) methods to minimize Gross-Pitaevskii (GP) energy functionals with rotation on a Riemannian manifold. This framework enables comprehensive analysis of existing projected Sobolev gradient methods and facilitates the construction of highly efficient P-RG algorithms. Under mild assumptions on the preconditioner, we prove energy dissipation and global convergence. Local convergence is more challenging due to phase and rotational invariances. Assuming the GP functional is Morse-Bott, we derive a sharp Polyak-\L ojasiewicz (PL) inequality near minimizers. This allows precise characterization of the local convergence rate via the condition number $\mu/L$, where $\mu$ and $L$ are the lower and upper bounds of the spectrum of a combined operator (preconditioner and Hessian) on a closed subspace. By combining spectral analysis with the PL inequality, we identify an optimal preconditioner achieving the best possible local convergence rate: $(L-\mu)/(L+\mu)+\varepsilon$ ($\varepsilon>0$ small). To our knowledge, this is the first rigorous derivation of the local convergence rate for P-RG methods applied to GP functionals with two symmetry structures. Numerical experiments on rapidly rotating Bose-Einstein condensates validate the theoretical results and compare the performance of different preconditioners.

    https://arxiv.org/abs/2510.13516


    STITCHER: Constrained Trajectory Planning in Complex Environments with Real-Time Motion Primitive Search

    oai:arXiv.org:2510.14893v3

    arXiv:2510.14893v3 Announce Type: replace Abstract: Autonomous high-speed navigation through large, complex environments requires real-time generation of agile trajectories that are dynamically feasible, collision-free, and satisfy state or actuator constraints. Modern trajectory planning techniques primarily use numerical optimization, as they enable the systematic computation of high-quality, expressive trajectories that satisfy various constraints. However, stringent requirements on computation time and the risk of numerical instability can limit the use of optimization-based planners in safety-critical scenarios. This work presents an optimization-free planning framework called STITCHER that stitches short trajectory segments together with graph search to compute long-range, expressive, and near-optimal trajectories in real-time. STITCHER outperforms modern optimization-based planners through our innovative planning architecture and several algorithmic developments that make real-time planning possible. Extensive simulation testing is performed to analyze the algorithmic components that make up STITCHER, along with a thorough comparison with two state-of-the-art optimization planners. Simulation tests show that safe trajectories can be created within a few milliseconds for paths that span the entirety of two 50 m x 50 m environments. Hardware tests with a custom quadrotor verify that STITCHER can produce trackable paths in real-time while respecting nonconvex constraints, such as limits on tilt angle and motor forces, which are otherwise hard to include in optimization-based planners.

    https://arxiv.org/abs/2510.14893


    Stable but Miscalibrated: A Kantian View on Overconfidence from Filters to Large Language Models

    oai:arXiv.org:2510.14925v3

    arXiv:2510.14925v3 Announce Type: replace Abstract: We reinterpret Kant's Critique of Pure Reason as a theory of feedback stability, viewing reason as a regulator that keeps inference within the bounds of possible experience. We formalize this intuition in linear-Gaussian state-space models via H-Risk, a composite instability index integrating spectral margin, conditioning, temporal sensitivity, and innovation amplification. In simulations, higher H-Risk predicts overconfident errors and degraded closed-loop behavior even when the dynamics remain formally stable, exposing a gap between nominal and epistemic stability. Extending this stability lens to large language models (LLMs), we introduce a domain-wise proxy based on confidence fluctuations and overconfident errors. In a binary-question study, a Kantian-inspired policy that permits ''cannot judge'' responses yields targeted reductions in policy-aware squared loss in high-stakes domains relative to an overconfident baseline. To probe internal dynamics, we analyse layer-wise sensitivity of hidden states to small input perturbations. Contrary to a naive instability hypothesis, confidently wrong answers show no instability gap; instead, they are at least as locally stable as confidently correct answers, revealing stable miscalibration in which hallucinations behave like robust but misaligned attractors. For Qwen-2.5, spectral and activation profiles suggest a high signal-to-noise, low effective signal temperature regime in which representations become inertial and resistant to contextual shifts. These results bridge Kantian self-limitation and feedback control, and suggest that stable high-confidence hallucinations may not be readily corrected by output-only heuristics (e.g., temperature scaling or re-sampling), motivating process-level interventions that explicitly perturb and re-evaluate the inference trajectory.

    https://arxiv.org/abs/2510.14925


    3D Scene Prompting for Scene-Consistent Camera-Controllable Video Generation

    oai:arXiv.org:2510.14945v2

    arXiv:2510.14945v2 Announce Type: replace Abstract: We present 3DScenePrompt, a framework that generates the next video chunk from arbitrary-length input while enabling precise camera control and preserving scene consistency. Unlike methods conditioned on a single image or a short clip, we employ dual spatio-temporal conditioning that reformulates context-view referencing across the input video. Our approach conditions on both temporally adjacent frames for motion continuity and spatially adjacent content for scene consistency. However, when generating beyond temporal boundaries, directly using spatially adjacent frames would incorrectly preserve dynamic elements from the past. We address this by introducing a 3D scene memory that represents exclusively the static geometry extracted from the entire input video. To construct this memory, we leverage dynamic SLAM with our newly introduced dynamic masking strategy that explicitly separates static scene geometry from moving elements. The static scene representation can then be projected to any target viewpoint, providing geometrically consistent warped views that serve as strong 3D spatial prompts while allowing dynamic regions to evolve naturally from temporal context. This enables our model to maintain long-range spatial coherence and precise camera control without sacrificing computational efficiency or motion realism. Extensive experiments demonstrate that our framework significantly outperforms existing methods in scene consistency, camera controllability, and generation quality. Project page : https://cvlab-kaist.github.io/3DScenePrompt/

    https://arxiv.org/abs/2510.14945


    pi-Flow: Policy-Based Few-Step Generation via Imitation Distillation

    oai:arXiv.org:2510.14974v2

    arXiv:2510.14974v2 Announce Type: replace Abstract: Few-step diffusion or flow-based generative models typically distill a velocity-predicting teacher into a student that predicts a shortcut towards denoised data. This format mismatch has led to complex distillation procedures that often suffer from a quality-diversity trade-off. To address this, we propose policy-based flow models ($\pi$-Flow). $\pi$-Flow modifies the output layer of a student flow model to predict a network-free policy at one timestep. The policy then produces dynamic flow velocities at future substeps with negligible overhead, enabling fast and accurate ODE integration on these substeps without extra network evaluations. To match the policy's ODE trajectory to the teacher's, we introduce a novel imitation distillation approach, which matches the policy's velocity to the teacher's along the policy's trajectory using a standard $\ell_2$ flow matching loss. By simply mimicking the teacher's behavior, $\pi$-Flow enables stable and scalable training and avoids the quality-diversity trade-off. On ImageNet 256$^2$, it attains a 1-NFE FID of 2.85, outperforming previous 1-NFE models of the same DiT architecture. On FLUX.1-12B and Qwen-Image-20B at 4 NFEs, $\pi$-Flow achieves substantially better diversity than state-of-the-art DMD models, while maintaining teacher-level quality.

    https://arxiv.org/abs/2510.14974


    cuAPO: A CUDA-based Parallelization of Artificial Protozoa Optimizer

    oai:arXiv.org:2510.14982v2

    arXiv:2510.14982v2 Announce Type: replace Abstract: Metaheuristic algorithms are widely used for solving complex problems due to their ability to provide near-optimal solutions. But the execution time of these algorithms increases with the problem size and/or solution space. And, to get more promising results, we have to execute these algorithms for a large number of iterations, requiring a large amount of time and this is one of the main issues found with these algorithms. To handle the same, researchers are now-a-days working on design and development of parallel versions of state-of-the-art metaheuristic optimization algorithms. We, in this paper, present a CUDA-based parallelization of state-of-the-art Artificial Protozoa Optimizer leveraging GPU acceleration. We implement both the existing sequential version and the proposed parallel version of Artificial Protozoa Optimizer for a performance comparison. Our experimental results calculated over a set of CEC2022 benchmark functions demonstrate a significant performance gain i.e. up to 6.7 times speed up is achieved with proposed parallel version. We also use a real world application, i.e., Image Thresholding to compare both algorithms.

    https://arxiv.org/abs/2510.14982


    Las Vegas algorithms to generate universal cycles and de Bruijn sequences uniformly at random

    oai:arXiv.org:2510.16545v3

    arXiv:2510.16545v3 Announce Type: replace Abstract: We present practical algorithms for generating universal cycles uniformly at random. In particular, we consider universal cycles for shorthand permutations, subsets and multiset permutations, weak orders, and orientable sequences. Additionally, we consider de Bruijn sequences, weight-range de Bruin sequences, and de Bruijn sequences, with forbidden $0^z$ substring. Each algorithm, seeded with a random element from the given set, applies a random walk of an underlying Eulerian de Bruijn graph to obtain a random arborescence (spanning in-tree). Given the random arborescence and the de Bruijn graph, a corresponding random universal cycle can be generated in constant time per symbol. We present experimental results on the average cover time needed to compute a random arborescence for each object using a Las Vegas algorithm.

    https://arxiv.org/abs/2510.16545


    BreakFun: Jailbreaking LLMs via Schema Exploitation

    oai:arXiv.org:2510.17904v2

    arXiv:2510.17904v2 Announce Type: replace Abstract: The proficiency of Large Language Models (LLMs) in processing structured data and adhering to syntactic rules is a capability that drives their widespread adoption but also makes them paradoxically vulnerable. In this paper, we investigate this vulnerability through BreakFun, a jailbreak methodology that weaponizes an LLM's adherence to structured schemas. BreakFun employs a three-part prompt that combines an innocent framing and a Chain-of-Thought distraction with a core "Trojan Schema"--a carefully crafted data structure that compels the model to generate harmful content, exploiting the LLM's strong tendency to follow structures and schemas. We demonstrate this vulnerability is highly transferable, achieving an average success rate of 89% across 13 foundational and proprietary models on JailbreakBench, and reaching a 100% Attack Success Rate (ASR) on several prominent models. A rigorous ablation study confirms this Trojan Schema is the attack's primary causal factor. To counter this, we introduce the Adversarial Prompt Deconstruction guardrail, a defense that utilizes a secondary LLM to perform a "Literal Transcription"--extracting all human-readable text to isolate and reveal the user's true harmful intent. Our proof-of-concept guardrail demonstrates high efficacy against the attack, validating that targeting the deceptive schema is a viable mitigation strategy. Our work provides a look into how an LLM's core strengths can be turned into critical weaknesses, offering a fresh perspective for building more robustly aligned models.

    https://arxiv.org/abs/2510.17904


    PrivaDE: Privacy-preserving Data Evaluation for Blockchain-based Data Marketplaces

    oai:arXiv.org:2510.18109v3

    arXiv:2510.18109v3 Announce Type: replace Abstract: Evaluating the usefulness of data before purchase is essential when obtaining data for high-quality machine learning models, yet both model builders and data providers are often unwilling to reveal their proprietary assets. We present PrivaDE, a privacy-preserving protocol that allows a model owner and a data owner to jointly compute a utility score for a candidate dataset without fully exposing model parameters, raw features, or labels. PrivaDE provides strong security against malicious behavior and can be integrated into blockchain-based marketplaces, where smart contracts enforce fair execution and payment. To make the protocol practical, we propose optimizations to enable efficient secure model inference, and a model-agnostic scoring method that uses only a small, representative subset of the data while still reflecting its impact on downstream training. Evaluation shows that PrivaDE performs data evaluation effectively, achieving online runtimes within 15 minutes even for models with millions of parameters. Our work lays the foundation for fair and automated data marketplaces in decentralized machine learning ecosystems.

    https://arxiv.org/abs/2510.18109


    DeBERTa-KC: A Transformer-Based Classifier for Knowledge Construction in Online Learning Discourse

    oai:arXiv.org:2510.19858v2

    arXiv:2510.19858v2 Announce Type: replace Abstract: The rapid expansion of online courses and social media has generated large volumes of unstructured learner-generated text. Understanding how learners construct knowledge in these spaces is crucial for analysing learning processes, informing content design, and providing feedback at scale. However, existing approaches typically rely on manual coding of well-structured discussion forums, which does not scale to the fragmented discourse found in online learning. This study proposes and validates a framework that combines a codebook inspired by the Interaction Analysis Model with an automated classifier to enable large-scale analysis of knowledge construction in unstructured online discourse. We adapt four comment-level categories of knowledge construction: Non-Knowledge Construction, Share, Explore, and Integrate. Three trained annotators coded a balanced sample of 20,000 comments from YouTube education channels. The codebook demonstrated strong reliability, with Cohen's kappa = 0.79 on the main dataset and 0.85--0.93 across four additional educational domains. For automated classification, bag-of-words baselines were compared with transformer-based language models using 10-fold cross-validation. A DeBERTa-v3-large model achieved the highest macro-averaged F1 score (0.841), outperforming all baselines and other transformer models. External validation on four domains yielded macro-F1 above 0.705, with stronger transfer in medicine and programming, where discourse was more structured and task-focused, and weaker transfer in language and music, where comments were more varied and context-dependent. Overall, the study shows that theory-driven, semi-automated analysis of knowledge construction at scale is feasible, enabling the integration of knowledge-construction indicators into learning analytics and the design of online learning environments.

    https://arxiv.org/abs/2510.19858


    GMFVAD: Using Grained Multi-modal Feature to Improve Video Anomaly Detection

    oai:arXiv.org:2510.20268v2

    arXiv:2510.20268v2 Announce Type: replace Abstract: Video anomaly detection (VAD) is a challenging task that detects anomalous frames in continuous surveillance videos. Most previous work utilizes the spatio-temporal correlation of visual features to distinguish whether there are abnormalities in video snippets. Recently, some works attempt to introduce multi-modal information, like text feature, to enhance the results of video anomaly detection. However, these works merely incorporate text features into video snippets in a coarse manner, overlooking the significant amount of redundant information that may exist within the video snippets. Therefore, we propose to leverage the diversity among multi-modal information to further refine the extracted features, reducing the redundancy in visual features, and we propose Grained Multi-modal Feature for Video Anomaly Detection (GMFVAD). Specifically, we generate more grained multi-modal feature based on the video snippet, which summarizes the main content, and text features based on the captions of original video will be introduced to further enhance the visual features of highlighted portions. Experiments show that the proposed GMFVAD achieves state-of-the-art performance on four mainly datasets. Ablation experiments also validate that the improvement of GMFVAD is due to the reduction of redundant information.

    https://arxiv.org/abs/2510.20268


    The Gray Zone of Faithfulness: Taming Ambiguity in Unfaithfulness Detection

    oai:arXiv.org:2510.21118v3

    arXiv:2510.21118v3 Announce Type: replace Abstract: Ensuring that Large Language Models (LLMs) generate summaries faithful to a given source document is essential for real-world applications. While prior research has explored LLM faithfulness, existing benchmarks suffer from annotation ambiguity, primarily due to the ill-defined boundary of permissible external knowledge in generated outputs. For instance, common sense is often incorporated into responses and labeled as "faithful", yet the acceptable extent of such knowledge remains unspecified, leading to inconsistent annotations. To address this issue, we propose a novel faithfulness annotation framework, which introduces an intermediate category, Out-Dependent, to classify cases where external knowledge is required for verification. Using this framework, we construct VeriGray (Verification with the Gray Zone) -- a new unfaithfulness detection benchmark in summarization. Statistics reveal that even SOTA LLMs, such as GPT-5, exhibit hallucinations ($\sim 6\%$ of sentences) in summarization tasks. Moreover, a substantial proportion ($\sim 8\%$ on average of models) of generated sentences fall into the Out-Dependent category, underscoring the importance of resolving annotation ambiguity in unfaithfulness detection benchmarks. Experiments demonstrate that our benchmark poses significant challenges to multiple baseline methods, indicating considerable room for future improvement.

    https://arxiv.org/abs/2510.21118


    Towards Physically Executable 3D Gaussian for Embodied Navigation

    oai:arXiv.org:2510.21307v2

    arXiv:2510.21307v2 Announce Type: replace Abstract: 3D Gaussian Splatting (3DGS), a 3D representation method with photorealistic real-time rendering capabilities, is regarded as an effective tool for narrowing the sim-to-real gap. However, it lacks fine-grained semantics and physical executability for Visual-Language Navigation (VLN). To address this, we propose SAGE-3D (Semantically and Physically Aligned Gaussian Environments for 3D Navigation), a new paradigm that upgrades 3DGS into an executable, semantically and physically aligned environment. It comprises two components: (1) Object-Centric Semantic Grounding, which adds object-level fine-grained annotations to 3DGS; and (2) Physics-Aware Execution Jointing, which embeds collision objects into 3DGS and constructs rich physical interfaces. We release InteriorGS, containing 1K object-annotated 3DGS indoor scene data, and introduce SAGE-Bench, the first 3DGS-based VLN benchmark with 2M VLN data. Experiments show that 3DGS scene data is more difficult to converge, while exhibiting strong generalizability, improving baseline performance by 31% on the VLN-CE Unseen task. Our data and code are available at: https://sage-3d.github.io.

    https://arxiv.org/abs/2510.21307


    Excision Score: Evaluating Edits with Surgical Precision

    oai:arXiv.org:2510.21537v2

    arXiv:2510.21537v2 Announce Type: replace Abstract: Many tasks revolve around editing a document, whether code or text. We formulate the revision similarity problem to unify a wide range of machine learning evaluation problems whose goal is to assess a revision to an existing document. We observe that revisions usually change only a small portion of an existing document, so the existing document and its immediate revisions share a majority of their content. We formulate five adequacy criteria for revision similarity measures, designed to align them with human judgement. We show that popular pairwise measures, like BLEU, fail to meet these criteria, because their scores are dominated by the shared content. They report high similarity between two revisions when humans would assess them as quite different. This is a fundamental flaw we address. We propose a novel static measure, Excision Score (ES), which computes longest common subsequence (LCS) to remove content shared by an existing document with the ground truth and predicted revisions, before comparing only the remaining divergent regions. This is analogous to a surgeon creating a sterile field to focus on the work area. We use approximation to speed the standard cubic LCS computation to quadratic. In code-editing evaluation, where static measures are often used as a cheap proxy for passing tests, we demonstrate that ES surpasses existing measures. When aligned with test execution on HumanEvalFix, ES improves over its nearest competitor, SARI, by 12% Pearson correlation and by >21% over standard measures like BLEU. The key criterion is invariance to shared context; when we perturb HumanEvalFix with increased shared context, ES' improvement over SARI increases to 20% and >30% over standard measures. ES also handles other corner cases that other measures do not, such as correctly aligning moved code blocks, and appropriately rewarding matching insertions or deletions.

    https://arxiv.org/abs/2510.21537


    A Multi-Component AI Framework for Computational Psychology: From Robust Predictive Modeling to Deployed Generative Dialogue

    oai:arXiv.org:2510.21720v2

    arXiv:2510.21720v2 Announce Type: replace Abstract: The confluence of Artificial Intelligence and Computational Psychology presents an opportunity to model, understand, and interact with complex human psychological states through computational means. This paper presents a comprehensive, multi-faceted framework designed to bridge the gap between isolated predictive modeling and an interactive system for psychological analysis. The methodology encompasses a rigorous, end-to-end development lifecycle. First, foundational performance benchmarks were established on four diverse psychological datasets using classical machine learning techniques. Second, state-of-the-art transformer models were fine-tuned, a process that necessitated the development of effective solutions to overcome critical engineering challenges, including the resolution of numerical instability in regression tasks and the creation of a systematic workflow for conducting large-scale training under severe resource constraints. Third, a generative large language model (LLM) was fine-tuned using parameter-efficient techniques to function as an interactive "Personality Brain." Finally, the entire suite of predictive and generative models was architected and deployed as a robust, scalable microservices ecosystem. Key findings include the successful stabilization of transformer-based regression models for affective computing, showing meaningful predictive performance where standard approaches failed, and the development of a replicable methodology for democratizing large-scale AI research. The significance of this work lies in its holistic approach, demonstrating a complete research-to-deployment pipeline that integrates predictive analysis with generative dialogue, thereby providing a practical model for future research in computational psychology and human-AI interaction.

    https://arxiv.org/abs/2510.21720


    Informed Initialization for Bayesian Optimization and Active Learning

    oai:arXiv.org:2510.23681v2

    arXiv:2510.23681v2 Announce Type: replace Abstract: Bayesian Optimization is a widely used method for optimizing expensive black-box functions, relying on probabilistic surrogate models such as Gaussian Processes. The quality of the surrogate model is crucial for good optimization performance, especially in the few-shot setting where only a small number of batches of points can be evaluated. In this setting, the initialization plays a critical role in shaping the surrogate's predictive quality and guiding subsequent optimization. Despite this, practitioners typically rely on (quasi-)random designs to cover the input space. However, such approaches neglect two key factors: (a) space-filling designs may not be desirable to reduce predictive uncertainty, and (b) efficient hyperparameter learning during initialization is essential for high-quality prediction, which may conflict with space-filling designs. To address these limitations, we propose Hyperparameter-Informed Predictive Exploration (HIPE), a novel acquisition strategy that balances predictive uncertainty reduction with hyperparameter learning using information-theoretic principles. We derive a closed-form expression for HIPE in the Gaussian Process setting and demonstrate its effectiveness through extensive experiments in active learning and few-shot BO. Our results show that HIPE outperforms standard initialization strategies in terms of predictive accuracy, hyperparameter identification, and subsequent optimization performance, particularly in large-batch, few-shot settings relevant to many real-world Bayesian Optimization applications.

    https://arxiv.org/abs/2510.23681


    Lipschitz-aware Linearity Grafting for Certified Robustness

    oai:arXiv.org:2510.25130v2

    arXiv:2510.25130v2 Announce Type: replace Abstract: Lipschitz constant is a fundamental property in certified robustness, as smaller values imply robustness to adversarial examples when a model is confident in its prediction. However, identifying the worst-case adversarial examples is known to be an NP-complete problem. Although over-approximation methods have shown success in neural network verification to address this challenge, reducing approximation errors remains a significant obstacle. Furthermore, these approximation errors hinder the ability to obtain tight local Lipschitz constants, which are crucial for certified robustness. Originally, grafting linearity into non-linear activation functions was proposed to reduce the number of unstable neurons, enabling scalable and complete verification. However, no prior theoretical analysis has explained how linearity grafting improves certified robustness. We instead consider linearity grafting primarily as a means of eliminating approximation errors rather than reducing the number of unstable neurons, since linear functions do not require relaxation. In this paper, we provide two theoretical contributions: 1) why linearity grafting improves certified robustness through the lens of the $l_\infty$ local Lipschitz constant, and 2) grafting linearity into non-linear activation functions, the dominant source of approximation errors, yields a tighter local Lipschitz constant. Based on these theoretical contributions, we propose a Lipschitz-aware linearity grafting method that removes dominant approximation errors, which are crucial for tightening the local Lipschitz constant, thereby improving certified robustness, even without certified training. Our extensive experiments demonstrate that grafting linearity into these influential activations tightens the $l_\infty$ local Lipschitz constant and enhances certified robustness.

    https://arxiv.org/abs/2510.25130


    Instance-Level Composed Image Retrieval

    oai:arXiv.org:2510.25387v2

    arXiv:2510.25387v2 Announce Type: replace Abstract: The progress of composed image retrieval (CIR), a popular research direction in image retrieval, where a combined visual and textual query is used, is held back by the absence of high-quality training and evaluation data. We introduce a new evaluation dataset, i-CIR, which, unlike existing datasets, focuses on an instance-level class definition. The goal is to retrieve images that contain the same particular object as the visual query, presented under a variety of modifications defined by textual queries. Its design and curation process keep the dataset compact to facilitate future research, while maintaining its challenge-comparable to retrieval among more than 40M random distractors-through a semi-automated selection of hard negatives. To overcome the challenge of obtaining clean, diverse, and suitable training data, we leverage pre-trained vision-and-language models (VLMs) in a training-free approach called BASIC. The method separately estimates query-image-to-image and query-text-to-image similarities, performing late fusion to upweight images that satisfy both queries, while down-weighting those that exhibit high similarity with only one of the two. Each individual similarity is further improved by a set of components that are simple and intuitive. BASIC sets a new state of the art on i-CIR but also on existing CIR datasets that follow a semantic-level class definition. Project page: https://vrg.fel.cvut.cz/icir/.

    https://arxiv.org/abs/2510.25387


    RegionRAG: Region-level Retrieval-Augmented Generation for Visual Document Understanding

    oai:arXiv.org:2510.27261v2

    arXiv:2510.27261v2 Announce Type: replace Abstract: Multi-modal Retrieval-Augmented Generation (RAG) has become a critical method for empowering LLMs by leveraging candidate visual documents. However, current methods consider the entire document as the basic retrieval unit, introducing substantial irrelevant visual content in two ways: 1) Relevant documents often contain large regions unrelated to the query, diluting the focus on salient information; 2) Retrieving multiple documents to increase recall further introduces redundant and irrelevant documents. These redundant contexts distract the model's attention and further degrade the performance. To address this challenge, we propose RegionRAG, a novel framework that shifts the retrieval paradigm from the document level to the region level. During training, we design a hybrid supervision strategy from both labeled data and unlabeled data to pinpoint relevant patches. During inference, we propose a dynamic pipeline that intelligently groups salient patches into complete semantic regions. By delegating the task of identifying relevant regions to the retriever, RegionRAG enables the generator to focus solely on concise, query-relevant visual content, improving both efficiency and accuracy. Experiments on six benchmarks demonstrate that RegionRAG achieves state-of-the-art performance. It improves retrieval accuracy by 10.02% in R@1 on average, and boosts question answering accuracy by 3.56% while using only 71.42% visual tokens compared with prior methods.

    https://arxiv.org/abs/2510.27261


    End-to-End Dexterous Arm-Hand VLA Policies via Shared Autonomy: VR Teleoperation Augmented by Autonomous Hand VLA Policy for Efficient Data Collection

    oai:arXiv.org:2511.00139v2

    arXiv:2511.00139v2 Announce Type: replace Abstract: Achieving human-like dexterous manipulation remains a major challenge for general-purpose robots. While Vision-Language-Action (VLA) models show potential in learning skills from demonstrations, their scalability is limited by scarce high-quality training data. Existing data collection methods face inherent constraints: manual teleoperation overloads human operators, while automated planning often produces unnatural motions. We propose a Shared Autonomy framework that divides control between macro and micro motions. A human operator guides the robot's arm pose through intuitive VR teleoperation, while an autonomous DexGrasp-VLA policy handles fine-grained hand control using real-time tactile and visual feedback. This division significantly reduces cognitive load and enables efficient collection of high-quality coordinated arm-hand demonstrations. Using this data, we train an end-to-end VLA policy enhanced with our novel Arm-Hand Feature Enhancement module, which captures both distinct and shared representations of macro and micro movements for more natural coordination. Our Corrective Teleoperation system enables continuous policy improvement through human-in-the-loop failure recovery. Experiments demonstrate that our framework generates high-quality data with minimal manpower and achieves a 90% success rate across diverse objects, including unseen instances. Comprehensive evaluations validate the system's effectiveness in developing dexterous manipulation capabilities.

    https://arxiv.org/abs/2511.00139


    ID-Crafter: VLM-Grounded Online RL for Compositional Multi-Subject Video Generation

    oai:arXiv.org:2511.00511v4

    arXiv:2511.00511v4 Announce Type: replace Abstract: Significant progress has been achieved in high-fidelity video synthesis, yet current paradigms often fall short in effectively integrating identity information from multiple subjects. This leads to semantic conflicts and suboptimal performance in preserving identities and interactions, limiting controllability and applicability. To tackle this issue, we introduce ID-Crafter, a framework for multi-subject video generation that achieves superior identity preservation and semantic coherence. ID-Crafter integrates three key components: (i) a hierarchical identity-preserving attention mechanism that progressively aggregates features at intra-subject, inter-subject, and cross-modal levels; (ii) a semantic understanding module powered by a pretrained Vision-Language Model (VLM) to provide fine-grained guidance and capture complex inter-subject relationships; and (iii) an online reinforcement learning phase to further refine the model for critical concepts. Furthermore, we construct a new dataset to facilitate robust training and evaluation. Extensive experiments demonstrate that ID-Crafter establishes new state-of-the-art performance on multi-subject video generation benchmarks, excelling in identity preservation, temporal consistency, and overall video quality. Project page: https://angericky.github.io/ID-Crafter

    https://arxiv.org/abs/2511.00511


    Adaptive Federated Learning to Optimize Integrated Flows in Cyber-Physical Data Centers

    oai:arXiv.org:2511.00623v2

    arXiv:2511.00623v2 Announce Type: replace Abstract: Data centers play an increasingly critical role in societal digitalization, yet their rapidly growing energy demand poses significant challenges for sustainable operation. To enhance the energy efficiency of geographically distributed data centers, this paper formulates a multi-period optimization model that captures the interdependence of electricity, heat, and data flows. The optimization of such integrated multi-domain flows inherently involves mixed-integer formulations and the access to proprietary or sensitive datasets, which correspondingly exacerbate computational complexity and raise data-privacy concerns. To address these challenges, an adaptive federated learning-to-optimization approach is proposed, accounting for the heterogeneity of datasets across distributed data centers. To safeguard privacy, cryptography techniques are leveraged in both the learning and optimization processes. A model acceptance criterion with convergence guarantee is developed to improve learning performance and filter out potentially contaminated data, while a verifiable double aggregation mechanism is further proposed to simultaneously ensure privacy and integrity of shared data during optimization. Theoretical analysis and numerical simulations demonstrate that the proposed approach preserves the privacy and integrity of shared data, achieves near-optimal performance, and exhibits high computational efficiency, making it suitable for large-scale data center optimization under privacy constraints.

    https://arxiv.org/abs/2511.00623


    On Structural Properties of Risk-Averse Optimal Stopping Problems

    oai:arXiv.org:2511.01022v3

    arXiv:2511.01022v3 Announce Type: replace Abstract: We establish structural properties of optimal stopping problems under time-consistent dynamic (coherent) risk measures, focusing on value function monotonicity and the existence of control limit (threshold) optimal policies. While such results are well developed for risk-neutral (expected-value) models, they remain underexplored in risk-averse settings. Coherent risk measures typically lack the tower property and are subadditive rather than additive, complicating structural analysis. We show that value function monotonicity mirrors the risk-neutral case. Moreover, if the risk envelope associated with each coherent risk measure admits a minimal element, the risk-averse optimal stopping problem reduces to an equivalent risk-neutral formulation. We also develop a general procedure for identifying control limit optimal policies and use it to derive practical, verifiable conditions on the risk measures and MDP structure that guarantee their existence. We illustrate the theory and verify these conditions through optimal stopping problems arising in operations, marketing, and finance.

    https://arxiv.org/abs/2511.01022


    Matrix Sensing with Kernel Optimal Loss: Robustness and Optimization Landscape

    oai:arXiv.org:2511.02122v2

    arXiv:2511.02122v2 Announce Type: replace Abstract: In this paper we study how the choice of loss functions of non-convex optimization problems affects their robustness and optimization landscape, through the study of noisy matrix sensing. In traditional regression tasks, mean squared error (MSE) loss is a common choice, but it can be unreliable for non-Gaussian or heavy-tailed noise. To address this issue, we adopt a robust loss based on nonparametric regression, which uses a kernel-based estimate of the residual density and maximizes the estimated log-likelihood. This robust formulation coincides with the MSE loss under Gaussian errors but remains stable under more general settings. We further examine how this robust loss reshapes the optimization landscape by analyzing the upper-bound of restricted isometry property (RIP) constants for spurious local minima to disappear. Through theoretical and empirical analysis, we show that this new loss excels at handling large noise and remains robust across diverse noise distributions. This work offers initial insights into enhancing the robustness of machine learning tasks through simply changing the loss, guided by an intuitive and broadly applicable analytical framework.

    https://arxiv.org/abs/2511.02122


    AI Agents with Decentralized Identifiers and Verifiable Credentials

    oai:arXiv.org:2511.02841v2

    arXiv:2511.02841v2 Announce Type: replace Abstract: A fundamental limitation of current LLM-based AI agents is their inability to build differentiated trust among each other at the onset of an agent-to-agent dialogue. However, autonomous and interoperable trust establishment becomes essential once agents start to operate beyond isolated environments and engage in dialogues across individual or organizational boundaries. A promising way to fill this gap in Agentic AI is to equip agents with long-lived digital identities and introduce tamper-proof and flexible identity-bound attestations of agents, provisioned by commonly trusted third parties and designed for cross-domain verifiability. This article presents a conceptual framework and a prototypical multi-agent system, where each agent is endowed with a self-sovereign digital identity. It combines a unique and ledger-anchored W3C Decentralized Identifier (DID) of an agent with a set of third-party issued W3C Verifiable Credentials (VCs). This enables agents at the start of a dialog to prove ownership of their self-controlled DIDs for authentication purposes and to establish various cross-domain trust relationships through the spontaneous exchange of their self-hosted DID-bound VCs. A comprehensive evaluation of the prototypical implementation demonstrates technical feasibility but also reveals limitations once an agent's LLM is in sole charge to control the respective security procedures.

    https://arxiv.org/abs/2511.02841


    Mathematical exploration and discovery at scale

    oai:arXiv.org:2511.02864v2

    arXiv:2511.02864v2 Announce Type: replace Abstract: AlphaEvolve (Novikov et al., 2025) is a generic evolutionary coding agent that combines the generative capabilities of LLMs with automated evaluation in an iterative evolutionary framework that proposes, tests, and refines algorithmic solutions to challenging scientific and practical problems. In this paper we showcase AlphaEvolve as a tool for autonomously discovering novel mathematical constructions and advancing our understanding of long-standing open problems. To demonstrate its breadth, we considered a list of 67 problems spanning mathematical analysis, combinatorics, geometry, and number theory. The system rediscovered the best known solutions in most of the cases and discovered improved solutions in several. In some instances, AlphaEvolve is also able to generalize results for a finite number of input values into a formula valid for all input values. Furthermore, we are able to combine this methodology with Deep Think and AlphaProof in a broader framework where the additional proof-assistants and reasoning systems provide automated proof generation and further mathematical insights. These results demonstrate that large language model-guided evolutionary search can autonomously discover mathematical constructions that complement human intuition, at times matching or even improving the best known results, highlighting the potential for significant new ways of interaction between mathematicians and AI systems. We present AlphaEvolve as a powerful new tool for mathematical discovery, capable of exploring vast search spaces to solve complex optimization problems at scale, often with significantly reduced requirements on preparation and computation time.

    https://arxiv.org/abs/2511.02864


    Implementation and Brief Experimental Analysis of the Duan et al. (2025) Algorithm for Single-Source Shortest Paths

    oai:arXiv.org:2511.03007v2

    arXiv:2511.03007v2 Announce Type: replace Abstract: We present an implementation and a brief experimental analysis of the deterministic algorithm proposed by Duan et al. (2025) for the Single-Source Shortest Path (SSSP) problem, which achieves the best known asymptotic upper bound in the comparison-addition model, with running time $O(m \log^{2/3} n)$. We provide a faithful C++ implementation of this algorithm, following all structural details described in the original paper, and compare its empirical performance with the classical Dijkstra's algorithm using binary heaps. The experiments were conducted on both synthetic sparse random graphs and real-world road network instances from the DIMACS benchmark. Our implementation achieves the proposed running time in the worst case. However, our results show that, despite the new algorithm's superior asymptotic complexity, its large constant factors significantly affect its practical performance, making Dijkstra's algorithm faster for all tested sparse graph sizes, including instances with tens of millions of vertices. (This is a work in progress.)

    https://arxiv.org/abs/2511.03007


    Ceci N'est Pas un Drone: Investigating the Impact of Design Representation on Design Decision Making When Using GenAI

    oai:arXiv.org:2511.03131v2

    arXiv:2511.03131v2 Announce Type: replace Abstract: With generative AI-powered design tools, designers and engineers can efficiently generate large numbers of design ideas. However, efficient exploration of these ideas requires designers to select a smaller group of potential solutions for further development. Therefore, the ability to judge and evaluate designs is critical for the successful use of generative design tools. Different design representation modalities can potentially affect designers' judgments. This work investigates how different design modalities, including visual rendering, numerical performance data, and a combination of both, affect designers' design selections from AI-generated design concepts for Uncrewed Aerial Vehicles. We found that different design modalities do affect designers' choices. Unexpectedly, we found that providing only numerical design performance data can lead to the best ability to select optimal designs. We also found that participants prefer visually conventional designs with axis-symmetry. The findings of this work provide insights into the interaction between human users and generative design systems.

    https://arxiv.org/abs/2511.03131


    Deploying Rapid Damage Assessments from sUAS Imagery for Disaster Response

    oai:arXiv.org:2511.03132v3

    arXiv:2511.03132v3 Announce Type: replace Abstract: This paper presents the first AI/ML system for automating building damage assessment in uncrewed aerial systems (sUAS) imagery to be deployed operationally during federally declared disasters (Hurricanes Debby and Helene). In response to major disasters, sUAS teams are dispatched to collect imagery of the affected areas to assess damage; however, at recent disasters, teams collectively delivered between 47GB and 369GB of imagery per day, representing more imagery than can reasonably be transmitted or interpreted by subject matter experts in the disaster scene, thus delaying response efforts. To alleviate this data avalanche encountered in practice, computer vision and machine learning techniques are necessary. While prior work has been deployed to automatically assess damage in satellite imagery, there is no current state of practice for sUAS-based damage assessment systems, as all known work has been confined to academic settings. This work establishes the state of practice via the development and deployment of models for building damage assessment with sUAS imagery. The model development involved training on the largest known dataset of post-disaster sUAS aerial imagery, containing 21,716 building damage labels, and the operational training of 91 disaster practitioners. The best performing model was deployed during the responses to Hurricanes Debby and Helene, where it assessed a combined 415 buildings in approximately 18 minutes. This work contributes documentation of the actual use of AI/ML for damage assessment during a disaster and lessons learned to the benefit of the AI/ML research and user communities.

    https://arxiv.org/abs/2511.03132


    Vectorized Computation of Euler Characteristic Functions and Transforms

    oai:arXiv.org:2511.03909v2

    arXiv:2511.03909v2 Announce Type: replace Abstract: The weighted Euler characteristic transform (WECT) and Euler characteristic function (ECF) have proven to be useful tools in a variety of applications. However, current methods for computing these functions are either not optimized for speed or do not scale to higher-dimensional settings. In this work, we present a vectorized framework for computing such topological descriptors using tensor operations, which is highly optimized for GPU architectures and works in full generality across simplicial and cubical complexes of arbitrary dimension. Experimentally, the framework demonstrates significant speedups over existing methods when computing the WECT and ECF across a variety of two- and three-dimensional datasets. Computation of these transforms is implemented in a publicly available Python package called pyECT.

    https://arxiv.org/abs/2511.03909


    End-to-End Reinforcement Learning of Koopman Models for eNMPC of an Air Separation Unit

    oai:arXiv.org:2511.04522v2

    arXiv:2511.04522v2 Announce Type: replace Abstract: With our recently proposed method based on reinforcement learning (Mayfrank et al. (2024), Comput. Chem. Eng. 190), Koopman surrogate models can be trained for optimal performance in specific (economic) nonlinear model predictive control ((e)NMPC) applications. So far, our method has exclusively been demonstrated on a small-scale case study. Herein, we show that our method scales well to a more challenging demand response case study built on a large-scale model of a single-product (nitrogen) air separation unit. Across all numerical experiments, we assume observability of only a few realistically measurable plant variables. Compared to a purely system identification-based Koopman eNMPC, which generates small economic savings but frequently violates constraints, our method delivers similar economic performance while avoiding constraint violations.

    https://arxiv.org/abs/2511.04522


    Regret Lower Bounds for Decentralized Multi-Agent Stochastic Shortest Path Problems

    oai:arXiv.org:2511.04594v2

    arXiv:2511.04594v2 Announce Type: replace Abstract: Multi-agent systems (MAS) are central to applications such as swarm robotics and traffic routing, where agents must coordinate in a decentralized manner to achieve a common objective. Stochastic Shortest Path (SSP) problems provide a natural framework for modeling decentralized control in such settings. While the problem of learning in SSP has been extensively studied in single-agent settings, the decentralized multi-agent variant remains largely unexplored. In this work, we take a step towards addressing that gap. We study decentralized multi-agent SSPs (Dec-MASSPs) under linear function approximation, where the transition dynamics and costs are represented using linear models. Applying novel symmetry-based arguments, we identify the structure of optimal policies. Our main contribution is the first regret lower bound for this setting based on the construction of hard-to-learn instances for any number of agents, $n$. Our regret lower bound of $\Omega(\sqrt{K})$, over $K$ episodes, highlights the inherent learning difficulty in Dec-MASSPs. These insights clarify the learning complexity of decentralized control and can further guide the design of efficient learning algorithms in multi-agent systems.

    https://arxiv.org/abs/2511.04594


    Lethe: Layer- and Time-Adaptive KV Cache Pruning for Reasoning-Intensive LLM Serving

    oai:arXiv.org:2511.06029v3

    arXiv:2511.06029v3 Announce Type: replace Abstract: Generative reasoning with large language models (LLMs) often involves long decoding sequences, leading to substantial memory and latency overheads from accumulating key-value (KV) caches. While existing KV compression methods primarily focus on reducing prefill memory from long input sequences, they fall short in addressing the dynamic and layer-sensitive nature of long-form generation, which is central to reasoning tasks. We propose Lethe, a dynamic KV cache management framework that introduces adaptivity along both the spatial and temporal dimensions of decoding. Along the spatial dimension, Lethe performs layerwise sparsity-aware allocation, assigning token pruning budgets to each transformer layer based on estimated attention redundancy. Along the temporal dimension, Lethe conducts multi-round token pruning during generation, driven by a Recency-Aware Selective Retention} (RASR) mechanism. RASR extends traditional recency-based heuristics by also considering token relevance derived from evolving attention patterns, enabling informed decisions about which tokens to retain or evict. Empirical results demonstrate that Lethe achieves a favorable balance between efficiency and generation quality across diverse models and tasks, increases throughput by up to 2.56x.

    https://arxiv.org/abs/2511.06029


    Error Estimate and Convergence Analysis for Data Valuation

    oai:arXiv.org:2511.06463v2

    arXiv:2511.06463v2 Announce Type: replace Abstract: Data valuation quantifies data importance, but existing methods cannot ensure validity in a single training process. The neural dynamic data valuation (NDDV) method [3] addresses this limitation. Based on NDDV, we are the first to explore error estimation and convergence analysis in data valuation. Under Lipschitz and smoothness assumptions, we derive quadratic error bounds for loss differences that scale inversely with time steps and quadratically with control variations, ensuring stability. We also prove that the expected squared gradient norm for the training loss vanishes asymptotically, and that the meta loss converges sublinearly over iterations. In particular, NDDV achieves sublinear convergence.

    https://arxiv.org/abs/2511.06463


    FractalCloud: A Fractal-Inspired Architecture for Efficient Large-Scale Point Cloud Processing

    oai:arXiv.org:2511.07665v2

    arXiv:2511.07665v2 Announce Type: replace Abstract: Three-dimensional (3D) point clouds are increasingly used in applications such as autonomous driving, robotics, and virtual reality (VR). Point-based neural networks (PNNs) have demonstrated strong performance in point cloud analysis, originally targeting small-scale inputs. However, as PNNs evolve to process large-scale point clouds with hundreds of thousands of points, all-to-all computation and global memory access in point cloud processing introduce substantial overhead, causing $O(n^2)$ computational complexity and memory traffic where n is the number of points}. Existing accelerators, primarily optimized for small-scale workloads, overlook this challenge and scale poorly due to inefficient partitioning and non-parallel architectures. To address these issues, we propose FractalCloud, a fractal-inspired hardware architecture for efficient large-scale 3D point cloud processing. FractalCloud introduces two key optimizations: (1) a co-designed Fractal method for shape-aware and hardware-friendly partitioning, and (2) block-parallel point operations that decompose and parallelize all point operations. A dedicated hardware design with on-chip fractal and flexible parallelism further enables fully parallel processing within limited memory resources. Implemented in 28 nm technology as a chip layout with a core area of 1.5 $mm^2$, FractalCloud achieves 21.7x speedup and 27x energy reduction over state-of-the-art accelerators while maintaining network accuracy, demonstrating its scalability and efficiency for PNN inference.

    https://arxiv.org/abs/2511.07665


    DiffRegCD: Integrated Registration and Change Detection with Diffusion Features

    oai:arXiv.org:2511.07935v3

    arXiv:2511.07935v3 Announce Type: replace Abstract: Change detection (CD) is fundamental to computer vision and remote sensing, supporting applications in environmental monitoring, disaster response, and urban development. Most CD models assume co-registered inputs, yet real-world imagery often exhibits parallax, viewpoint shifts, and long temporal gaps that cause severe misalignment. Traditional two stage methods that first register and then detect, as well as recent joint frameworks (e.g., BiFA, ChangeRD), still struggle under large displacements, relying on regression only flow, global homographies, or synthetic perturbations. We present DiffRegCD, an integrated framework that unifies dense registration and change detection in a single model. DiffRegCD reformulates correspondence estimation as a Gaussian smoothed classification task, achieving sub-pixel accuracy and stable training. It leverages frozen multi-scale features from a pretrained denoising diffusion model, ensuring robustness to illumination and viewpoint variation. Supervision is provided through controlled affine perturbations applied to standard CD datasets, yielding paired ground truth for both flow and change detection without pseudo labels. Extensive experiments on aerial (LEVIR-CD, DSIFN-CD, WHU-CD, SYSU-CD) and ground level (VL-CMU-CD) datasets show that DiffRegCD consistently surpasses recent baselines and remains reliable under wide temporal and geometric variation, establishing diffusion features and classification based correspondence as a strong foundation for unified change detection.

    https://arxiv.org/abs/2511.07935


    Spacecraft Angular Rate Estimation via Event-Based Camera Sensing

    oai:arXiv.org:2511.08041v2

    arXiv:2511.08041v2 Announce Type: replace Abstract: This paper presents a method for determining spacecraft angular rates using event-based camera sensing. This is achieved by analyzing the temporal distribution of brightness events triggered by the apparent motion of stars. The location and polarity of the events are used to infer the apparent motion field of the stars, which is, in turn, employed to estimate the observer angular velocity in the camera frame. This can be converted to the spacecraft angular rates provided an attitude reference. The method is validated through numerical simulation for a synthetic dataset of event streams generated on random spacecraft pointing and rates conditions. The accuracy of the method is assessed, demonstrating its potential to complement or replace conventional rate sensors in spacecraft systems using event camera sensing.

    https://arxiv.org/abs/2511.08041


    Climate Driven Interactions Between Malaria Transmission and Diabetes Prevalence

    oai:arXiv.org:2511.08562v2

    arXiv:2511.08562v2 Announce Type: replace Abstract: Climate change is intensifying infectious and chronic diseases like malaria and diabetes, respectively, especially among the vulnerable populations. Global temperatures have risen by approximately $0.6^\circ$C since 1950, extending the window of transmission for mosquito-borne infections and worsening outcomes in diabetes due to metabolic stress caused by heat. People living with diabetes have already weakened immune defenses and, therefore, are at an alarmingly increased risk of contraction of malaria. However, most models rarely include both ways of interaction in changing climate conditions. In the paper, we introduce a new compartmental epidemiological model based on synthetic data fitted to disease patterns of India from 2019 to 2021. The framework captures temperature-dependent transmission parameters, seasonal variability, and different disease dynamics between diabetic and non-diabetic groups within the three-compartment system. Model calibration using Multi-Start optimization combined with Sequential Quadratic Programming allows us to find outstanding differences between populations. The odds of malaria infection in diabetic individuals were found to be 1.8--4.0 times higher, with peak infection levels in 35--36\%, as compared to 20--21\% in the non-diabetic ones. The fitted model was able to capture well the epidemiological patterns observed, while the basic reproduction number averaged around 2.3, ranging from 0.31 to 2.75 in different seasons. Given that India's diabetic population is set to rise to about 157 million people by 2050, these findings point to a pressing need for concerted efforts toward climate-informed health strategies and monitoring systems that address both malaria and diabetes jointly.

    https://arxiv.org/abs/2511.08562


    WDT-MD: Wavelet Diffusion Transformers for Microaneurysm Detection in Fundus Images

    oai:arXiv.org:2511.08987v3

    arXiv:2511.08987v3 Announce Type: replace Abstract: Microaneurysms (MAs), the earliest pathognomonic signs of Diabetic Retinopathy (DR), present as sub-60 $\mu m$ lesions in fundus images with highly variable photometric and morphological characteristics, rendering manual screening not only labor-intensive but inherently error-prone. While diffusion-based anomaly detection has emerged as a promising approach for automated MA screening, its clinical application is hindered by three fundamental limitations. First, these models often fall prey to "identity mapping", where they inadvertently replicate the input image. Second, they struggle to distinguish MAs from other anomalies, leading to high false positives. Third, their suboptimal reconstruction of normal features hampers overall performance. To address these challenges, we propose a Wavelet Diffusion Transformer framework for MA Detection (WDT-MD), which features three key innovations: a noise-encoded image conditioning mechanism to avoid "identity mapping" by perturbing image conditions during training; pseudo-normal pattern synthesis via inpainting to introduce pixel-level supervision, enabling discrimination between MAs and other anomalies; and a wavelet diffusion Transformer architecture that combines the global modeling capability of diffusion Transformers with multi-scale wavelet analysis to enhance reconstruction of normal retinal features. Comprehensive experiments on the IDRiD and e-ophtha MA datasets demonstrate that WDT-MD outperforms state-of-the-art methods in both pixel-level and image-level MA detection. This advancement holds significant promise for improving early DR screening.

    https://arxiv.org/abs/2511.08987


    OUGS: Active View Selection via Object-aware Uncertainty Estimation in 3DGS

    oai:arXiv.org:2511.09397v2

    arXiv:2511.09397v2 Announce Type: replace Abstract: Recent advances in 3D Gaussian Splatting (3DGS) have achieved state-of-the-art results for novel view synthesis. However, efficiently capturing high-fidelity reconstructions of specific objects within complex scenes remains a significant challenge. A key limitation of existing active reconstruction methods is their reliance on scene-level uncertainty metrics, which are often biased by irrelevant background clutter and lead to inefficient view selection for object-centric tasks. We present OUGS, a novel framework that addresses this challenge with a more principled, physically-grounded uncertainty formulation for 3DGS. Our core innovation is to derive uncertainty directly from the explicit physical parameters of the 3D Gaussian primitives (e.g., position, scale, rotation). By propagating the covariance of these parameters through the rendering Jacobian, we establish a highly interpretable uncertainty model. This foundation allows us to then seamlessly integrate semantic segmentation masks to produce a targeted, object-aware uncertainty score that effectively disentangles the object from its environment. This allows for a more effective active view selection strategy that prioritizes views critical to improving object fidelity. Experimental evaluations on public datasets demonstrate that our approach significantly improves the efficiency of the 3DGS reconstruction process and achieves higher quality for targeted objects compared to existing state-of-the-art methods, while also serving as a robust uncertainty estimator for the global scene.

    https://arxiv.org/abs/2511.09397


    PANDA -- Patch And Distribution-Aware Augmentation for Long-Tailed Exemplar-Free Continual Learning

    oai:arXiv.org:2511.09791v3

    arXiv:2511.09791v3 Announce Type: replace Abstract: Exemplar-Free Continual Learning (EFCL) restricts the storage of previous task data and is highly susceptible to catastrophic forgetting. While pre-trained models (PTMs) are increasingly leveraged for EFCL, existing methods often overlook the inherent imbalance of real-world data distributions. We discovered that real-world data streams commonly exhibit dual-level imbalances, dataset-level distributions combined with extreme or reversed skews within individual tasks, creating both intra-task and inter-task disparities that hinder effective learning and generalization. To address these challenges, we propose PANDA, a Patch-and-Distribution-Aware Augmentation framework that integrates seamlessly with existing PTM-based EFCL methods. PANDA amplifies low-frequency classes by using a CLIP encoder to identify representative regions and transplanting those into frequent-class samples within each task. Furthermore, PANDA incorporates an adaptive balancing strategy that leverages prior task distributions to smooth inter-task imbalances, reducing the overall gap between average samples across tasks and enabling fairer learning with frozen PTMs. Extensive experiments and ablation studies demonstrate PANDA's capability to work with existing PTM-based CL methods, improving accuracy and reducing catastrophic forgetting.

    https://arxiv.org/abs/2511.09791


    Learning to Pose Problems: Reasoning-Driven and Solver-Adaptive Data Synthesis for Large Reasoning Models

    oai:arXiv.org:2511.09907v3

    arXiv:2511.09907v3 Announce Type: replace Abstract: Data synthesis for training large reasoning models offers a scalable alternative to limited, human-curated datasets, enabling the creation of high-quality data. However, existing approaches face several challenges: (i) indiscriminate generation that ignores the solver's ability and yields low-value problems, or reliance on complex data pipelines to balance problem difficulty; and (ii) a lack of reasoning in problem generation, leading to shallow problem variants. In this paper, we develop a problem generator that reasons explicitly to plan problem directions before synthesis and adapts difficulty to the solver's ability. Specifically, we construct related problem pairs and augment them with intermediate problem-design CoT produced by a reasoning model. These data bootstrap problem-design strategies from the generator. Then, we treat the solver's feedback on synthetic problems as a reward signal, enabling the generator to calibrate difficulty and produce complementary problems near the edge of the solver's competence. Extensive experiments on 10 mathematical and general reasoning benchmarks show that our method achieves an average improvement of 2.5% and generalizes to both language and vision-language models. Moreover, a solver trained on the synthesized data provides improved rewards for continued generator training, enabling co-evolution and yielding a further 0.7% performance gain. Our code will be made publicly available here.

    https://arxiv.org/abs/2511.09907


    On the Influence of Artificial Intelligence on Human Problem-Solving: Empirical Insights for the Third Wave in a Multinational Longitudinal Pilot Study

    oai:arXiv.org:2511.11738v2

    arXiv:2511.11738v2 Announce Type: replace Abstract: This article presents the results and their discussion for the third wave (with n=23 participants) within a multinational longitudinal study that investigates the evolving paradigm of human-AI collaboration in problem-solving contexts. Building upon previous waves, our findings reveal the consolidation of a hybrid problem-solving culture characterized by strategic integration of AI tools within structured cognitive workflows. The data demonstrate near-universal AI adoption (95.7% with prior knowledge, 100% ChatGPT usage) primarily deployed through human-led sequences such as "Think, Internet, ChatGPT, Further Processing" (39.1%). However, this collaboration reveals a critical verification deficit that escalates with problem complexity. We empirically identify and quantify two systematic epistemic gaps: a belief-performance gap (up to +80.8 percentage points discrepancy between perceived and actual correctness) and a proof-belief gap (up to -16.8 percentage points between confidence and verification capability). These findings, derived from behavioral data and problem vignettes across complexity levels, indicate that the fundamental constraint on reliable AI-assisted work is solution validation rather than generation. The study concludes that educational and technological interventions must prioritize verification scaffolds (including assumption documentation protocols, adequacy criteria checklists, and triangulation procedures) to fortify the human role as critical validator in this new cognitive ecosystem.

    https://arxiv.org/abs/2511.11738


    Leveraging Large Language Models for Career Mobility Analysis: A Study of Gender, Race, and Job Change Using U.S. Online Resume Profiles

    oai:arXiv.org:2511.12010v2

    arXiv:2511.12010v2 Announce Type: replace Abstract: We present a large-scale analysis of career mobility of college-educated U.S. workers using online resume profiles to investigate how gender, race, and job change options are associated with upward mobility. This study addresses key research questions of how the job changes affect their upward career mobility, and how the outcomes of upward career mobility differ by gender and race. We address data challenges -- such as missing demographic attributes, missing wage data, and noisy occupation labels -- through various data processing and Artificial Intelligence (AI) methods. In particular, we develop a large language models (LLMs) based occupation classification method known as FewSOC that achieves accuracy significantly higher than the original occupation labels in the resume dataset. Analysis of 228,710 career trajectories reveals that intra-firm occupation change has been found to facilitate upward mobility most strongly, followed by inter-firm occupation change and inter-firm lateral move. Women and Black college graduates experience significantly lower returns from job changes than men and White peers. Multilevel sensitivity analyses confirm that these disparities are robust to cluster-level heterogeneity and reveal additional intersectional patterns.

    https://arxiv.org/abs/2511.12010


    Think, Speak, Decide: Language-Augmented Multi-Agent Reinforcement Learning for Economic Decision-Making

    oai:arXiv.org:2511.12876v2

    arXiv:2511.12876v2 Announce Type: replace Abstract: Economic decision-making depends not only on structured signals such as prices and taxes, but also on unstructured language, including peer dialogue and media narratives. While multi-agent reinforcement learning (MARL) has shown promise in optimizing economic decisions, it struggles with the semantic ambiguity and contextual richness of language. We propose LAMP (Language-Augmented Multi-Agent Policy), a framework that integrates language into economic decision-making and narrows the gap to real-world settings. LAMP follows a Think-Speak-Decide pipeline: (1) Think interprets numerical observations to extract short-term shocks and long-term trends, caching high-value reasoning trajectories; (2) Speak crafts and exchanges strategic messages based on reasoning, updating beliefs by parsing peer communications; and (3) Decide fuses numerical data, reasoning, and reflections into a MARL policy to optimize language-augmented decision-making. Experiments in economic simulation show that LAMP outperforms both MARL and LLM-only baselines in cumulative return (+63.5%, +34.0%), robustness (+18.8%, +59.4%), and interpretability. These results demonstrate the potential of language-augmented policies to deliver more effective and robust economic strategies.

    https://arxiv.org/abs/2511.12876


    CoordAR: One-Reference 6D Pose Estimation of Novel Objects via Autoregressive Coordinate Map Generation

    oai:arXiv.org:2511.12919v3

    arXiv:2511.12919v3 Announce Type: replace Abstract: Object 6D pose estimation, a crucial task for robotics and augmented reality applications, becomes particularly challenging when dealing with novel objects whose 3D models are not readily available. To reduce dependency on 3D models, recent studies have explored one-reference-based pose estimation, which requires only a single reference view instead of a complete 3D model. However, existing methods that rely on real-valued coordinate regression suffer from limited global consistency due to the local nature of convolutional architectures and face challenges in symmetric or occluded scenarios owing to a lack of uncertainty modeling. We present CoordAR, a novel autoregressive framework for one-reference 6D pose estimation of unseen objects. CoordAR formulates 3D-3D correspondences between the reference and query views as a map of discrete tokens, which is obtained in an autoregressive and probabilistic manner. To enable accurate correspondence regression, CoordAR introduces 1) a novel coordinate map tokenization that enables probabilistic prediction over discretized 3D space; 2) a modality-decoupled encoding strategy that separately encodes RGB appearance and coordinate cues; and 3) an autoregressive transformer decoder conditioned on both position-aligned query features and the partially generated token sequence. With these novel mechanisms, CoordAR significantly outperforms existing methods on multiple benchmarks and demonstrates strong robustness to symmetry, occlusion, and other challenges in real-world tests.

    https://arxiv.org/abs/2511.12919


    CapeNext: Rethinking and Refining Dynamic Support Information for Category-Agnostic Pose Estimation

    oai:arXiv.org:2511.13102v2

    arXiv:2511.13102v2 Announce Type: replace Abstract: Recent research in Category-Agnostic Pose Estimation (CAPE) has adopted fixed textual keypoint description as semantic prior for two-stage pose matching frameworks. While this paradigm enhances robustness and flexibility by disentangling the dependency of support images, our critical analysis reveals two inherent limitations of static joint embedding: (1) polysemy-induced cross-category ambiguity during the matching process(e.g., the concept "leg" exhibiting divergent visual manifestations across humans and furniture), and (2) insufficient discriminability for fine-grained intra-category variations (e.g., posture and fur discrepancies between a sleeping white cat and a standing black cat). To overcome these challenges, we propose a new framework that innovatively integrates hierarchical cross-modal interaction with dual-stream feature refinement, enhancing the joint embedding with both class-level and instance-specific cues from textual description and specific images. Experiments on the MP-100 dataset demonstrate that, regardless of the network backbone, CapeNext consistently outperforms state-of-the-art CAPE methods by a large margin.

    https://arxiv.org/abs/2511.13102


    A Lightweight 3D Anomaly Detection Method with Rotationally Invariant Features

    oai:arXiv.org:2511.13115v2

    arXiv:2511.13115v2 Announce Type: replace Abstract: 3D anomaly detection (AD) is a crucial task in computer vision, aiming to identify anomalous points or regions from point cloud data. However, existing methods may encounter challenges when handling point clouds with changes in orientation and position because the resulting features may vary significantly. To address this problem, we propose a novel Rotationally Invariant Features (RIF) framework for 3D AD. Firstly, to remove the adverse effect of variations on point cloud data, we develop a Point Coordinate Mapping (PCM) technique, which maps each point into a rotationally invariant space to maintain consistency of representation. Then, to learn robust and discriminative features, we design a lightweight Convolutional Transform Feature Network (CTF-Net) to extract rotationally invariant features for the memory bank. To improve the ability of the feature extractor, we introduce the idea of transfer learning to pre-train the feature extractor with 3D data augmentation. Experimental results show that the proposed method achieves the advanced performance on the Anomaly-ShapeNet dataset, with an average P-AUROC improvement of 17.7\%, and also gains the best performance on the Real3D-AD dataset, with an average P-AUROC improvement of 1.6\%. The strong generalization ability of RIF has been verified by combining it with traditional feature extraction methods on anomaly detection tasks, demonstrating great potential for industrial applications.

    https://arxiv.org/abs/2511.13115


    nnterp: A Standardized Interface for Mechanistic Interpretability of Transformers

    oai:arXiv.org:2511.14465v2

    arXiv:2511.14465v2 Announce Type: replace Abstract: Mechanistic interpretability research requires reliable tools for analyzing transformer internals across diverse architectures. Current approaches face a fundamental tradeoff: custom implementations like TransformerLens ensure consistent interfaces but require coding a manual adaptation for each architecture, introducing numerical mismatch with the original models, while direct HuggingFace access through NNsight preserves exact behavior but lacks standardization across models. To bridge this gap, we develop nnterp, a lightweight wrapper around NNsight that provides a unified interface for transformer analysis while preserving original HuggingFace implementations. Through automatic module renaming and comprehensive validation testing, nnterp enables researchers to write intervention code once and deploy it across 50+ model variants spanning 16 architecture families. The library includes built-in implementations of common interpretability methods (logit lens, patchscope, activation steering) and provides direct access to attention probabilities for models that support it. By packaging validation tests with the library, researchers can verify compatibility with custom models locally. nnterp bridges the gap between correctness and usability in mechanistic interpretability tooling.

    https://arxiv.org/abs/2511.14465


    A $\mu$-Analysis and Synthesis Framework for Partial Integral Equations using IQCs

    oai:arXiv.org:2511.14896v2

    arXiv:2511.14896v2 Announce Type: replace Abstract: We develop a $\mu$-analysis and synthesis framework for infinite-dimensional systems that leverages the Integral Quadratic Constraints (IQCs) to compute the structured singular value's upper bound. The methodology formulates robust stability and performance conditions jointly as Linear Partial Integral Inequalities within the Partial Integral Equation framework, establishing connections between IQC multipliers and $\mu$-theory. Computational implementation via PIETOOLS enables computational tools that practically applicable to spatially distributed infinite dimensional systems. Illustrations with the help of Partial and Delay Differential Equations validate the effectiveness of the framework, showing a significant reduction in conservatism compared to unstructured methods and providing systematic tools for stability-performance trade-off analysis.

    https://arxiv.org/abs/2511.14896


    IPR-1: Interactive Physical Reasoner

    oai:arXiv.org:2511.15407v2

    arXiv:2511.15407v2 Announce Type: replace Abstract: Humans learn by observing, interacting with environments, and internalizing physics and causality. Here, we aim to ask whether an agent can similarly acquire human-like reasoning from interaction and keep improving with more experience. To study this, we introduce a Game-to-Unseen (G2U) benchmark of 1,000+ heterogeneous games that exhibit significant visual domain gaps. Existing approaches, including VLMs and world models, struggle to capture underlying physics and causality since they are not focused on core mechanisms and overfit to visual details. VLM/VLA agents reason but lack look-ahead in interactive settings, while world models imagine but imitate visual patterns rather than analyze physics and causality. We therefore propose IPR (Interactive Physical Reasoner), using world-model rollouts to score and reinforce a VLM's policy, and introduce PhysCode, a physics-centric action code aligning semantic intent with dynamics to provide a shared action space for prediction and reasoning. Pretrained on 1,000+ games, our IPR performs robustly on levels from primitive intuition to goal-driven reasoning, and even surpasses GPT-5 overall. We find that performance improves with more training games and interaction steps, and that the model also zero-shot transfers to unseen games. These results support physics-centric interaction as a path to steadily improving physical reasoning. Further demos and project details can be found at https://mybearyzhang.github.io/ipr-1.

    https://arxiv.org/abs/2511.15407


    Cyber-Resilient Data-Driven Event-Triggered Secure Control for Autonomous Vehicles Under False Data Injection Attacks

    oai:arXiv.org:2511.15925v2

    arXiv:2511.15925v2 Announce Type: replace Abstract: This paper proposes a cyber-resilient secure control framework for autonomous vehicles (AVs) subject to false data injection (FDI) threats as actuator attacks. The framework integrates data-driven modeling, event-triggered communication, and fractional-order sliding mode control (FSMC) to enhance the resilience against adversarial interventions. A dynamic model decomposition (DMD)-based methodology is employed to extract the lateral dynamics from real-world data, eliminating the reliance on conventional mechanistic modeling. To optimize communication efficiency, an event-triggered transmission scheme is designed to reduce the redundant transmissions while ensuring system stability. Furthermore, an extended state observer (ESO) is developed for real-time estimation and mitigation of actuator attack effects. Theoretical stability analysis, conducted using Lyapunov methods and linear matrix inequality (LMI) formulations, guarantees exponential error convergence. Extensive simulations validate the proposed event-triggered secure control framework, demonstrating substantial improvements in attack mitigation, communication efficiency, and lateral tracking performance. The results show that the framework effectively counteracts actuator attacks while optimizing communication-resource utilization, making it highly suitable for safety-critical AV applications.

    https://arxiv.org/abs/2511.15925


    Physically Realistic Sequence-Level Adversarial Clothing for Robust Human-Detection Evasion

    oai:arXiv.org:2511.16020v2

    arXiv:2511.16020v2 Announce Type: replace Abstract: Deep neural networks used for human detection are highly vulnerable to adversarial manipulation, creating safety and privacy risks in real surveillance environments. Wearable attacks offer a realistic threat model, yet existing approaches usually optimize textures frame by frame and therefore fail to maintain concealment across long video sequences with motion, pose changes, and garment deformation. In this work, a sequence-level optimization framework is introduced to generate natural, printable adversarial textures for shirts, trousers, and hats that remain effective throughout entire walking videos in both digital and physical settings. Product images are first mapped to UV space and converted into a compact palette and control-point parameterization, with ICC locking to keep all colors printable. A physically based human-garment pipeline is then employed to simulate motion, multi-angle camera viewpoints, cloth dynamics, and illumination variation. An expectation-over-transformation objective with temporal weighting is used to optimize the control points so that detection confidence is minimized across whole sequences. Extensive experiments demonstrate strong and stable concealment, high robustness to viewpoint changes, and superior cross-model transferability. Physical garments produced with sublimation printing achieve reliable suppression under indoor and outdoor recordings, confirming real-world feasibility.

    https://arxiv.org/abs/2511.16020


    Concept-Based Interpretability for Toxicity Detection

    oai:arXiv.org:2511.16689v2

    arXiv:2511.16689v2 Announce Type: replace Abstract: The rise of social networks has not only facilitated communication but also allowed the spread of harmful content. Although significant advances have been made in detecting toxic language in textual data, the exploration of concept-based explanations in toxicity detection remains limited. In this study, we leverage various subtype attributes present in toxicity detection datasets, such as obscene, threat, insult, identity attack, and sexual explicit as concepts that serve as strong indicators to identify whether language is toxic. However, disproportionate attribution of concepts towards the target class often results in classification errors. Our work introduces an interpretability technique based on the Concept Gradient (CG) method which provides a more causal interpretation by measuring how changes in concepts directly affect the output of the model. This is an extension of traditional gradient-based methods in machine learning, which often focus solely on input features. We propose the curation of Targeted Lexicon Set, which captures toxic words that contribute to misclassifications in text classification models. To assess the significance of these lexicon sets in misclassification, we compute Word-Concept Alignment (WCA) scores, which quantify the extent to which these words lead to errors due to over-attribution to toxic concepts. Finally, we introduce a lexicon-free augmentation strategy by generating toxic samples that exclude predefined toxic lexicon sets. This approach allows us to examine whether over-attribution persists when explicit lexical overlap is removed, providing insights into the model's attribution on broader toxic language patterns.

    https://arxiv.org/abs/2511.16689


    OmniScientist: Toward a Co-evolving Ecosystem of Human and AI Scientists

    oai:arXiv.org:2511.16931v2

    arXiv:2511.16931v2 Announce Type: replace Abstract: With the rapid development of Large Language Models (LLMs), AI agents have demonstrated increasing proficiency in scientific tasks, ranging from hypothesis generation and experimental design to manuscript writing. Such agent systems are commonly referred to as "AI Scientists." However, existing AI Scientists predominantly formulate scientific discovery as a standalone search or optimization problem, overlooking the fact that scientific research is inherently a social and collaborative endeavor. Real-world science relies on a complex scientific infrastructure composed of collaborative mechanisms, contribution attribution, peer review, and structured scientific knowledge networks. Due to the lack of modeling for these critical dimensions, current systems struggle to establish a genuine research ecosystem or interact deeply with the human scientific community. To bridge this gap, we introduce OmniScientist, a framework that explicitly encodes the underlying mechanisms of human research into the AI scientific workflow. OmniScientist not only achieves end-to-end automation across data foundation, literature review, research ideation, experiment automation, scientific writing, and peer review, but also provides comprehensive infrastructural support by simulating the human scientific system, comprising: (1) a structured knowledge system built upon citation networks and conceptual correlations; (2) a collaborative research protocol (OSP), which enables seamless multi-agent collaboration and human researcher participation; and (3) an open evaluation platform (ScienceArena) based on blind pairwise user voting and Elo rankings. This infrastructure empowers agents to not only comprehend and leverage human knowledge systems but also to collaborate and co-evolve, fostering a sustainable and scalable innovation ecosystem.

    https://arxiv.org/abs/2511.16931


    FireScope: Wildfire Risk Prediction with a Chain-of-Thought Oracle

    oai:arXiv.org:2511.17171v2

    arXiv:2511.17171v2 Announce Type: replace Abstract: Predicting wildfire risk is a reasoning-intensive spatial problem that requires the integration of visual, climatic, and geographic factors to infer continuous risk maps. Existing methods lack the causal reasoning and multimodal understanding required for reliable generalization. We introduce $\textbf{FireScope-Bench}$, a large-scale dataset and benchmark that couples Sentinel-2 imagery and climate data with expert-defined risk rasters across the USA, and real wildfire events in Europe for cross-continental evaluation. Building on this dataset, we propose $\textbf{FireScope}$, a VLM-based reasoning-to-generation framework that learns from both reinforcement learning and visual supervision to predict risk rasters with complementary reasoning traces. When trained in the USA and tested in Europe, $\textbf{FireScope}$ achieves substantial performance gains, while expert feedback and automated analysis confirm that its reasoning traces are faithful and semantically meaningful. Our findings demonstrate that reasoning can ground raster prediction models, improving both generalization and interpretability. To our knowledge, this is the first framework to (1) demonstrate that language-based reasoning can improve generalization in visual generation, (2) propose a high-resolution wildfire risk model that can be applied across continents, and (3) enable systematic studies of robust cross-continental generalization for multimodal fire risk models. We believe that $\textbf{FireScope-Bench}$ has the potential to serve as a foundation for advancing reasoning-driven, interpretable and generalizable spatial modeling. Data and source code will be made publicly available.

    https://arxiv.org/abs/2511.17171


    ATAC: Augmentation-Based Test-Time Adversarial Correction for CLIP

    oai:arXiv.org:2511.17362v2

    arXiv:2511.17362v2 Announce Type: replace Abstract: Despite its remarkable success in zero-shot image-text matching, CLIP remains highly vulnerable to adversarial perturbations on images. As adversarial fine-tuning is prohibitively costly, recent works explore various test-time defense strategies; however, these approaches still exhibit limited robustness. In this work, we revisit this problem and propose a simple yet effective strategy: Augmentation-based Test-time Adversarial Correction (ATAC). Our method operates directly in the embedding space of CLIP, calculating augmentation-induced drift vectors to infer a semantic recovery direction and correcting the embedding based on the angular consistency of these latent drifts. Across a wide range of benchmarks, ATAC consistently achieves remarkably high robustness, surpassing that of previous state-of-the-art methods by nearly 50\% on average, all while requiring minimal computational overhead. Furthermore, ATAC retains state-of-the-art robustness in unconventional and extreme settings and even achieves nontrivial robustness against adaptive attacks. Our results demonstrate that ATAC is an efficient method in a novel paradigm for test-time adversarial defenses in the embedding space of CLIP.

    https://arxiv.org/abs/2511.17362


    RoboCOIN: An Open-Sourced Bimanual Robotic Data COllection for INtegrated Manipulation

    oai:arXiv.org:2511.17441v2

    arXiv:2511.17441v2 Announce Type: replace Abstract: Bimanual manipulation is essential for achieving human-like dexterity in robots, but the large-scale and diverse bimanual robot datasets remain scarce due to hardware heterogeneity across robotic platforms. To address the challenge, we present RoboCOIN, a comprehensive multi-embodiment bimanual manipulation dataset with over 180,000 demonstrations collected from 15 distinct robotic platforms. The dataset covers 16 scenarios, including residential, commercial, and working environments, with 421 tasks systematically organized by bimanual coordination patterns and object properties. Our key innovation is a hierarchical capability pyramid that provides multi-level annotations, spanning trajectory-level concepts, segment-level subtasks, and frame-level kinematics. We further develop CoRobot, a comprehensive processing framework featuring Robot Trajectory Markup Language (RTML) for quality assessment, automated annotation generation, and unified multi-embodiment management. Extensive experiments demonstrate the reliability and effectiveness of RoboCOIN in multi-embodiment bimanual learning, with significant performance improvements across various model architectures and robotic platforms. The complete dataset and framework are open-sourced and publicly available for further research purposes. Project website: https://FlagOpen.github.io/RoboCOIN/.

    https://arxiv.org/abs/2511.17441


    M^3-Bench: Multi-Modal, Multi-Hop, Multi-Threaded Tool-Using MLLM Agent Benchmark

    oai:arXiv.org:2511.17729v3

    arXiv:2511.17729v3 Announce Type: replace Abstract: We present M^3-Bench, the first benchmark for evaluating multimodal tool use under the Model Context Protocol. The benchmark targets realistic, multi-hop and multi-threaded workflows that require visual grounding and textual reasoning, cross-tool dependencies, and persistence of intermediate resources across steps. We introduce a similarity-driven alignment that serializes each tool call, embeds signatures with a sentence encoder, and performs similarity-bucketed Hungarian matching to obtain auditable one-to-one correspondences. On top of this alignment, we report interpretable metrics that decouple semantic fidelity from workflow consistency. The benchmark spans 28 servers with 231 tools, and provides standardized trajectories curated through an Executor & Judge pipeline with human verification; an auxiliary four large language models (LLMs) judge ensemble reports end-task Task Completion and information grounding. Evaluations of representative state-of-the-art Multimodal LLMs (MLLMs) reveal persistent gaps in multimodal MCP tool use, particularly in argument fidelity and structure consistency, underscoring the need for methods that jointly reason over images, text, and tool graphs. Our Benchmark's anonymous repository is at https://github.com/EtaYang10th/Open-M3-Bench

    https://arxiv.org/abs/2511.17729


    Learning to Debug: LLM-Organized Knowledge Trees for Solving RTL Assertion Failures

    oai:arXiv.org:2511.17833v2

    arXiv:2511.17833v2 Announce Type: replace Abstract: Debugging is the dominant cost in modern hardware verification, where assertion failures are among the most frequent and expensive to resolve. While Large Language Models (LLMs) show promise, they often fail to capture the precise, reusable expertise that engineers apply, leading to inaccurate responses. We propose GROVE, a hierarchical knowledge management framework that learns and organizes reusable debugging expertise into an LLM-organized knowledge tree for solving assertion failures. GROVE distills debugging knowledge from prior cases and organizes it into a vertical tree of configurable depth, with each node encoding a concise knowledge item and explicit applicability conditions. During training, GROVE uses a parallel, gradient-free loop where an LLM proposes tree modifications as structured JSON edits by learning from the cases. At test time, a budget-aware iterative zoom is performed to navigate the tree, retrieving a small set of applicable knowledge items that guide a base LLM's hypothesis generation and fix proposals. Evaluated on a suite of assertion-failure cases, GROVE delivers consistent gains in pass@1 and pass@5, demonstrating the value of structured knowledge evolution.

    https://arxiv.org/abs/2511.17833


    Coherent Multi-Agent Trajectory Forecasting in Team Sports with CausalTraj

    oai:arXiv.org:2511.18248v2

    arXiv:2511.18248v2 Announce Type: replace Abstract: Jointly forecasting trajectories of multiple interacting agents is a core challenge in sports analytics and other domains involving complex group dynamics. Accurate prediction enables realistic simulation and strategic understanding of gameplay evolution. Most existing models are evaluated solely on per-agent accuracy metrics (minADE, minFDE), which assess each agent independently on its best-of-k prediction. However these metrics overlook whether the model learns which predicted trajectories can jointly form a plausible multi-agent future. Many state-of-the-art models are designed and optimized primarily based on these metrics. As a result, they may underperform on joint predictions and also fail to generate coherent, interpretable multi-agent scenarios in team sports. We propose CausalTraj, a temporally causal, likelihood-based model that is built to generate jointly probable multi-agent trajectory forecasts. To better assess collective modeling capability, we emphasize joint metrics (minJADE, minJFDE) that measure joint accuracy across agents within the best generated scenario sample. Evaluated on the NBA SportVU, Basketball-U, and Football-U datasets, CausalTraj achieves competitive per-agent accuracy and the best recorded results on joint metrics, while yielding qualitatively coherent and realistic gameplay evolutions.

    https://arxiv.org/abs/2511.18248


    Progressive Localisation in Localist LLMs

    oai:arXiv.org:2511.18375v3

    arXiv:2511.18375v3 Announce Type: replace Abstract: This paper demonstrates that progressive localization, the gradual increase of attention locality from early distributed layers to late localized layers, represents the optimal architecture for creating interpretable large language models (LLMs) while preserving performance. Through systematic experimentation with GPT-2 fine-tuned on The Psychology of Artificial Superintelligence, we evaluate five locality configurations: two uniform baselines (fully distributed and fully localist) and three progressive polynomial schedules. We investigate whether interpretability constraints can be aligned with natural semantic structure while being applied strategically across network depth. We demonstrate that progressive semantic localization, combining adaptive semantic block partitioning with steep polynomial locality schedules, achieves near-baseline language modeling performance while providing interpretable attention patterns. Multiple independent training runs with different random seeds establish that results are statistically robust and highly reproducible. The approach dramatically outperforms both fixed-window localization and naive uniform locality constraints. Analysis reveals that maintaining flexibility through low-fidelity constraints preserves model capacity while providing interpretability benefits, and that steep schedules concentrating locality in decision-critical final layers while preserving distributed learning in early layers achieve near-baseline attention distribution characteristics. These findings demonstrate that interpretability mechanisms should align with semantic structure to achieve practical performance-interpretability tradeoffs for trustworthy AI systems.

    https://arxiv.org/abs/2511.18375


    Rethinking Long-tailed Dataset Distillation: A Uni-Level Framework with Unbiased Recovery and Relabeling

    oai:arXiv.org:2511.18858v2

    arXiv:2511.18858v2 Announce Type: replace Abstract: Dataset distillation creates a small distilled set that enables efficient training by capturing key information from the full dataset. While existing dataset distillation methods perform well on balanced datasets, they struggle under long-tailed distributions, where imbalanced class frequencies induce biased model representations and corrupt statistical estimates such as Batch Normalization (BN) statistics. In this paper, we rethink long-tailed dataset distillation by revisiting the limitations of trajectory-based methods, and instead adopt the statistical alignment perspective to jointly mitigate model bias and restore fair supervision. To this end, we introduce three dedicated components that enable unbiased recovery of distilled images and soft relabeling: (1) enhancing expert models (an observer model for recovery and a teacher model for relabeling) to enable reliable statistics estimation and soft-label generation; (2) recalibrating BN statistics via a full forward pass with dynamically adjusted momentum to reduce representation skew; (3) initializing synthetic images by incrementally selecting high-confidence and diverse augmentations via a multi-round mechanism that promotes coverage and diversity. Extensive experiments on four long-tailed benchmarks show consistent improvements over state-of-the-art methods across varying degrees of class imbalance. Notably, our approach improves top-1 accuracy by 15.6% on CIFAR-100-LT and 11.8% on Tiny-ImageNet-LT under IPC=10 and IF=10. Codes are available at https://github.com/2018cx/RLDD.

    https://arxiv.org/abs/2511.18858


    A Self-Conditioned Representation Guided Diffusion Model for Realistic Text-to-LiDAR Scene Generation

    oai:arXiv.org:2511.19004v2

    arXiv:2511.19004v2 Announce Type: replace Abstract: Text-to-LiDAR generation can customize 3D data with rich structures and diverse scenes for downstream tasks. However, the scarcity of Text-LiDAR pairs often causes insufficient training priors, generating overly smooth 3D scenes. Moreover, low-quality text descriptions may degrade generation quality and controllability. In this paper, we propose a Text-to-LiDAR Diffusion Model for scene generation, named T2LDM, with a Self-Conditioned Representation Guidance (SCRG). Specifically, SCRG, by aligning to the real representations, provides the soft supervision with reconstruction details for the Denoising Network (DN) in training, while decoupled in inference. In this way, T2LDM can perceive rich geometric structures from data distribution, generating detailed objects in scenes. Meanwhile, we construct a content-composable Text-LiDAR benchmark, T2nuScenes, along with a controllability metric. Based on this, we analyze the effects of different text prompts for LiDAR generation quality and controllability, providing practical prompt paradigms and insights. Furthermore, a directional position prior is designed to mitigate street distortion, further improving scene fidelity. Additionally, by learning a conditional encoder via frozen DN, T2LDM can support multiple conditional tasks, including Sparse-to-Dense, Dense-to-Sparse, and Semantic-to-LiDAR generation. Extensive experiments in unconditional and conditional generation demonstrate that T2LDM outperforms existing methods, achieving state-of-the-art scene generation.

    https://arxiv.org/abs/2511.19004


    Human-AI Teaming Under Deception: An Implicit BCI Safeguards Drone Team Performance in Virtual Reality

    oai:arXiv.org:2511.19312v2

    arXiv:2511.19312v2 Announce Type: replace Abstract: Human-AI teams can be vulnerable to catastrophic failure when feedback from the AI is incorrect, especially under high cognitive workload. Traditional team aggregation methods, such as voting, are susceptible to these AI errors, which can actively bias the behaviour of each individual and inflate the likelihood of an erroneous group decision. We hypothesised that a collaborative Brain-Computer Interface (cBCI) using neural activity collected before a behavioural decision is made can provide a source of information that is "decoupled" from this biased behaviour, thereby protecting the team from the deleterious influence of AI error. We tested this in a VR drone surveillance task where teams of operators faced high workload and systematically misleading AI cues. Using a passive BCI (pBCI) framework validated via offline simulation, we compared traditional behaviour-based team strategies against a purely Neuro-Decoupled Team (NDT) that used only BCI confidence scores derived from pre-response EEG. Under AI deception, behaviour-based teams catastrophically failed, with Majority Vote accuracy collapsing to 42.6% (worse than chance). The NDT, however, maintained a robust 68.3% accuracy. While this did not exceed the best individual's theoretical maximum, it provided a critical +25.7% "Safety Net Delta" that prevented the team from succumbing to the correlated error. This resilience was explained by a neuro-behavioural decoupling, where the BCI's predictions relied on preserved posterior-visual processing ("The Truth Signal") while the operators' executive monitoring systems collapsed. We conclude that an implicit BCI provides resilience by learning to bypass a compromised executive networks and access the preserved sensory representation of ground truth, defending against AI-induced error in high-stakes environments.

    https://arxiv.org/abs/2511.19312


    The DataSquad Experiment: Lessons for Preparing Data and Computer Scientists for Work

    oai:arXiv.org:2511.19688v2

    arXiv:2511.19688v2 Announce Type: replace Abstract: The DataSquad at Carleton College addresses a common problem at small liberal arts colleges: limited capacity for data services and few opportunities for students to gain practical experience with data and software development. Academic Technologist Paula Lackie designed the program as a work-study position that trains undergraduates through structured peer mentorship and real client projects. Students tackle data problems of increasing complexity-from basic data analysis to software development-while learning FAIR data principles and open science practices. The model's core components (peer mentorship structure, project-based learning, and communication training) make it adaptable to other institutions. UCLA and other colleges have adopted the model using openly shared materials through "DataSquad International." This paper describes the program's implementation at Carleton College and examines how structured peer mentorship can simultaneously improve institutional data services and provide students with professional skills and confidence.

    https://arxiv.org/abs/2511.19688


    Updatable Balanced Index for Fast On-device Search with Auto-selection Model

    oai:arXiv.org:2511.20049v2

    arXiv:2511.20049v2 Announce Type: replace Abstract: Diverse types of edge data, such as 2D geo-locations and 3D point clouds, are collected by sensors like lidar and GPS receivers on edge devices. On-device searches, such as k-nearest neighbor (kNN) search and radius search, are commonly used to enable fast analytics and learning technologies, such as k-means dataset simplification using kNN. To maintain high search efficiency, a representative approach is to utilize a balanced multi-way KD-tree (BMKD-tree). However, the index has shown limited gains, mainly due to substantial construction overhead, inflexibility to real-time insertion, and inconsistent query performance. In this paper, we propose UnIS to address the above limitations. We first accelerate the construction process of the BMKD-tree by utilizing the dataset distribution to predict the splitting hyperplanes. To make the continuously generated data searchable, we propose a selective sub-tree rebuilding scheme to accelerate rebalancing during insertion by reducing the number of data points involved. We then propose an auto-selection model to improve query performance by automatically selecting the optimal search strategy among multiple strategies for an arbitrary query task. Experimental results show that UnIS achieves average speedups of 17.96x in index construction, 1.60x in insertion, 7.15x in kNN search, and 1.09x in radius search compared to the BMKD-tree. We further verify its effectiveness in accelerating dataset simplification on edge devices, achieving a speedup of 217x over Lloyd's algorithm.

    https://arxiv.org/abs/2511.20049


    HHFT: Hierarchical Heterogeneous Feature Transformer for Recommendation Systems

    oai:arXiv.org:2511.20235v2

    arXiv:2511.20235v2 Announce Type: replace Abstract: We propose HHFT (Hierarchical Heterogeneous Feature Transformer), a Transformer-based architecture tailored for industrial CTR prediction. HHFT addresses the limitations of DNN through three key designs: (1) Semantic Feature Partitioning: Grouping heterogeneous features (e.g. user profile, item information, behaviour sequennce) into semantically coherent blocks to preserve domain-specific information; (2) Heterogeneous Transformer Encoder: Adopting block-specific QKV projections and FFNs to avoid semantic confusion between distinct feature types; (3) Hiformer Layer: Capturing high-order interactions across features. Our findings reveal that Transformers significantly outperform DNN baselines, achieving a +0.4% improvement in CTR AUC at scale. We have successfully deployed the model on Taobao's production platform, observing a significant uplift in key business metrics, including a +0.6% increase in Gross Merchandise Value (GMV).

    https://arxiv.org/abs/2511.20235


    MODEST: Multi-Optics Depth-of-Field Stereo Dataset

    oai:arXiv.org:2511.20853v2

    arXiv:2511.20853v2 Announce Type: replace Abstract: Reliable depth estimation under real optical conditions remains a core challenge for camera vision in systems such as autonomous robotics and augmented reality. Despite recent progress in depth estimation and depth-of-field rendering, research remains constrained by the lack of large-scale, high-fidelity, real stereo DSLR datasets, limiting real-world generalization and evaluation of models trained on synthetic data as shown extensively in literature. We present the first high-resolution (5472$\times$3648px) stereo DSLR dataset with 18000 images, systematically varying focal length and aperture across complex real scenes and capturing the optical realism and complexity of professional camera systems. For 9 scenes with varying scene complexity, lighting and background, images are captured with two identical camera assemblies at 10 focal lengths (28-70mm) and 5 apertures (f/2.8-f/22), spanning 50 optical configurations in 2000 images per scene. This full-range optics coverage enables controlled analysis of geometric and optical effects for monocular and stereo depth estimation, shallow depth-of-field rendering, deblurring, 3D scene reconstruction and novel view synthesis. Each focal configuration has a dedicated calibration image set, supporting evaluation of classical and learning based methods for intrinsic and extrinsic calibration. The dataset features challenging visual elements such as multi-scale optical illusions, reflective surfaces, mirrors, transparent glass walls, fine-grained details, and natural / artificial ambient light variations. This work attempts to bridge the realism gap between synthetic training data and real camera optics, and demonstrates challenges with the current state-of-the-art monocular, stereo depth and depth-of-field methods. We release the dataset, calibration files, and evaluation code to support reproducible research on real-world optical generalization.

    https://arxiv.org/abs/2511.20853


    Vertex-based Graph Neural Solver and its Application to Linear Elasticity Equations

    oai:arXiv.org:2511.21491v2

    arXiv:2511.21491v2 Announce Type: replace Abstract: The efficient numerical solution of variable-coefficient linear elasticity equations on unstructured meshes presents a formidable challenge in computational mechanics. Convergence rates of classical iterative methods often stagnate due to material heterogeneity, strong anisotropy, and mesh irregularity. To address these limitations, this paper proposes a novel deep learning-based hybrid iterative method that integrates a weighted block Jacobi smoother with a graph neural network-enhanced global corrector. We introduce the adaptive graph Fourier neural solver, which employs a learnable coordinate transformation to construct a dynamic, problem-dependent spectral basis. This approach effectively overcomes the limitations of fixed Fourier bases in capturing the multi-scale features inherent in variable-coefficient media. Furthermore, to handle 3D and strongly anisotropic systems, we develop the multilevel adaptive graph Fourier neural solver, which executes hierarchical error correction across cascading frequency bandwidths. Rigorous theoretical analysis, grounded in the energy norm and Korn's inequality, establishes mesh-independent convergence guarantees. Extensive numerical experiments on 2D and 3D elasticity problems demonstrate that the proposed method exhibits superior robustness and convergence rates compared to the classical smoothed aggregation algebraic multigrid method, serving effectively as both a standalone solver and a preconditioner for Krylov subspace methods.

    https://arxiv.org/abs/2511.21491


    Self-Transparency Failures in Expert-Persona LLMs: How Instruction-Following Overrides Honesty

    oai:arXiv.org:2511.21569v4

    arXiv:2511.21569v4 Announce Type: replace Abstract: Self-transparency is a critical safety boundary, requiring language models to honestly disclose their limitations and artificial nature. This study stress-tests this capability, investigating whether models willingly disclose their identity when assigned professional personas that conflict with transparent self-representation. When models prioritize role consistency over this boundary disclosure, users may calibrate trust based on overstated competence claims, treating AI-generated guidance as equivalent to licensed professional advice. Using a common-garden experimental design, sixteen open-weight models (4B-671B parameters) were audited under identical conditions across 19,200 trials. Models exhibited sharp domain-specific inconsistency: a Financial Advisor persona elicited 30.8% disclosure at the first prompt, while a Neurosurgeon persona elicited only 3.5% -- an 8.8-fold difference that emerged at the initial epistemic inquiry. Disclosure ranged from 2.8% to 73.6% across model families, with a 14B model reaching 39.4% while a 70B model produced just 4.1%. Model identity provided substantially larger improvement in fitting observations than parameter count ($\Delta R_{adj}^{2}=0.359$ vs $0.018$). Reasoning variants showed heterogeneous effects: some exhibited up to 48.4 percentage points lower disclosure than their base instruction-tuned counterparts, while others maintained high transparency. An additional experiment demonstrated that explicit permission to disclose AI nature increased disclosure from 23.7% to 65.8%, revealing that suppression reflects instruction-following prioritization rather than capability limitations. Bayesian validation confirmed robustness to judge measurement error ($\kappa=0.908$). Organizations cannot assume safety properties will transfer across deployment domains, requiring deliberate behavior design and empirical verification.

    https://arxiv.org/abs/2511.21569


    TAPVid-360: Tracking Any Point in 360 from Narrow Field of View Video

    oai:arXiv.org:2511.21946v2

    arXiv:2511.21946v2 Announce Type: replace Abstract: Humans excel at constructing panoramic mental models of their surroundings, maintaining object permanence and inferring scene structure beyond visible regions. In contrast, current artificial vision systems struggle with persistent, panoramic understanding, often processing scenes egocentrically on a frame-by-frame basis. This limitation is pronounced in the Track Any Point (TAP) task, where existing methods fail to track 2D points outside the field of view. To address this, we introduce TAPVid-360, a novel task that requires predicting the 3D direction to queried scene points across a video sequence, even when far outside the narrow field of view of the observed video. This task fosters learning allocentric scene representations without needing dynamic 4D ground truth scene models for training. Instead, we exploit 360 videos as a source of supervision, resampling them into narrow field-of-view perspectives while computing ground truth directions by tracking points across the full panorama using a 2D pipeline. We introduce a new dataset and benchmark, TAPVid360-10k comprising 10k perspective videos with ground truth directional point tracking. Our baseline adapts CoTracker v3 to predict per-point rotations for direction updates, outperforming existing TAP and TAPVid 3D methods. Project page: https://finlay-hudson.github.io/tapvid360

    https://arxiv.org/abs/2511.21946


    Angle-Optimized Partial Disentanglement for Multimodal Emotion Recognition in Conversation

    oai:arXiv.org:2511.22447v2

    arXiv:2511.22447v2 Announce Type: replace Abstract: Multimodal Emotion Recognition in Conversation (MERC) aims to enhance emotion understanding by integrating complementary cues from text, audio, and visual modalities. Existing MERC approaches predominantly focus on cross-modal shared features, often overlooking modality-specific features that capture subtle yet critical emotional cues such as micro-expressions, prosodic variations, and sarcasm. Although related work in multimodal emotion recognition (MER) has explored disentangling shared and modality-specific features, these methods typically employ rigid orthogonal constraints to achieve full disentanglement, which neglects the inherent complementarity between feature types and may limit recognition performance. To address these challenges, we propose Angle-Optimized Feature Learning (AO-FL), a framework tailored for MERC that achieves partial disentanglement of shared and specific features within each modality through adaptive angular optimization. Specifically, AO-FL aligns shared features across modalities to ensure semantic consistency, and within each modality it adaptively models the angular relationship between its shared and modality-specific features to preserve both distinctiveness and complementarity. An orthogonal projection refinement further removes redundancy in specific features and enriches shared features with contextual information, yielding more discriminative multimodal representations. Extensive experiments confirm the effectiveness of AO-FL for MERC, demonstrating superior performance over state-of-the-art approaches. Moreover, AO-FL can be seamlessly integrated with various unimodal feature extractors and extended to other multimodal fusion tasks, such as MER, thereby highlighting its strong generalization beyond MERC.

    https://arxiv.org/abs/2511.22447


    Conveying Imagistic Thinking in Traditional Chinese Medicine Translation: A Prompt Engineering and LLM-Based Evaluation Framework

    oai:arXiv.org:2511.23059v2

    arXiv:2511.23059v2 Announce Type: replace Abstract: Traditional Chinese Medicine theory is built on imagistic thinking, in which medical principles and diagnostic and therapeutic logic are structured through metaphor and metonymy. However, existing English translations largely rely on literal rendering, making it difficult for target-language readers to reconstruct the underlying conceptual networks and apply them in clinical practice. This study adopted a human-in-the-loop framework and selected four passages from the medical canon Huangdi Neijing that are fundamental in theory. Through prompt-based cognitive scaffolding, DeepSeek V3.1 was guided to identify metaphor and metonymy in the source text and convey the theory in translation. In the evaluation stage, ChatGPT 5 Pro and Gemini 2.5 Pro were instructed by prompts to simulate three types of real-world readers. Human translations, baseline model translations, and prompt-adjusted translations were scored by the simulated readers across five cognitive dimensions, followed by structured interviews and Interpretative Phenomenological Analysis. Results show that the prompt-adjusted LLM translations perform best across all five dimensions, with high cross-model and cross-role consistency. The interview themes reveal differences between human and machine translation, effective strategies for metaphor and metonymy transfer, and readers' cognitive preferences. This study provides a cognitive, efficient and replicable HITL methodological pathway for translation of ancient, concept-dense texts like TCM.

    https://arxiv.org/abs/2511.23059


    Spectral Concentration at the Edge of Stability: Information Geometry of Kernel Associative Memory

    oai:arXiv.org:2511.23083v3

    arXiv:2511.23083v3 Announce Type: replace Abstract: High-capacity kernel Hopfield networks exhibit a \textit{Ridge of Optimization} characterized by extreme stability. While previously linked to \textit{Spectral Concentration}, its origin remains elusive. Here, we analyze the network dynamics on a statistical manifold, revealing that the Ridge corresponds to the Edge of Stability, a critical boundary where the Fisher Information Matrix becomes singular. We demonstrate that the apparent Euclidean force antagonism is a manifestation of \textit{Dual Equilibrium} in the Riemannian space. This unifies learning dynamics and capacity via the Minimum Description Length principle, offering a geometric theory of self-organized criticality.

    https://arxiv.org/abs/2511.23083


    Development and Benchmarking of a Blended Human-AI Qualitative Research Assistant

    oai:arXiv.org:2512.00009v2

    arXiv:2512.00009v2 Announce Type: replace Abstract: Qualitative research emphasizes constructing meaning through iterative engagement with textual data. Traditionally this human-driven process requires navigating coder fatigue and interpretative drift, thus posing challenges when scaling analysis to larger, more complex datasets. Computational approaches to augment qualitative research have been met with skepticism, partly due to their inability to replicate the nuance, context-awareness, and sophistication of human analysis. Large language models, however, present new opportunities to automate aspects of qualitative analysis while upholding rigor and research quality in important ways. To assess their benefits and limitations - and build trust among qualitative researchers - these approaches must be rigorously benchmarked against human-generated datasets. In this work, we benchmark Muse, an interactive, AI-powered qualitative research system that allows researchers to identify themes and annotate datasets, finding an inter-rater reliability between Muse and humans of Cohen's $\kappa$ = 0.71 for well-specified codes. We also conduct robust error analysis to identify failure mode, guide future improvements, and demonstrate the capacity to correct for human bias.

    https://arxiv.org/abs/2512.00009


    Hardware-Aware DNN Compression for Homogeneous Edge Devices

    oai:arXiv.org:2512.00017v2

    arXiv:2512.00017v2 Announce Type: replace Abstract: Deploying deep neural networks (DNNs) across homogeneous edge devices (the devices with the same SKU labeled by the manufacturer) often assumes identical performance among them. However, once a device model is widely deployed, the performance of each device becomes different after a period of running. This is caused by the differences in user configurations, environmental conditions, manufacturing variances, battery degradation, etc. Existing DNN compression methods have not taken this scenario into consideration and can not guarantee good compression results in all homogeneous edge devices. To address this, we propose Homogeneous-Device Aware Pruning (HDAP), a hardware-aware DNN compression framework explicitly designed for homogeneous edge devices, aiming to achieve optimal average performance of the compressed model across all devices. To deal with the difficulty of time-consuming hardware-aware evaluations for thousands or millions of homogeneous edge devices, HDAP partitions all the devices into several device clusters, which can dramatically reduce the number of devices to evaluate and use the surrogate-based evaluation instead of hardware evaluation in real-time. Extensive experiments on multiple device types (Jetson Xavier NX and Jetson Nano) and task types (image classification with ResNet50, MobileNetV1, ResNet56, VGG16; object detection with YOLOv8n) demonstrate that HDAP consistently achieves lower average latency and competitive accuracy compared to state-of-the-art methods, with significant speedups (e.g., 2.86$\times$ on ResNet50 at 1.0G FLOPs). HDAP offers an effective solution for scalable, high-performance DNN deployment methods for homogeneous edge devices.

    https://arxiv.org/abs/2512.00017


    RL-Struct: A Lightweight Reinforcement Learning Framework for Reliable Structured Output in LLMs

    oai:arXiv.org:2512.00319v2

    arXiv:2512.00319v2 Announce Type: replace Abstract: The Structure Gap between probabilistic LLM generation and deterministic schema requirements hinders automated workflows. We propose RL-Struct, a lightweight framework using Gradient Regularized Policy Optimization (GRPO) with a hierarchical reward function to align LLMs with structural constraints. This approach eliminates the critic network, reducing peak VRAM by 38% compared to PPO. On complex JSON tasks, RL-Struct achieves 89.7% structural accuracy and 92.1% validity, significantly outperforming SFT and zero-shot baselines. We also report an emergent curriculum--a self-organized learning process where the model prioritizes syntax before semantics. Our model is publicly available at https://huggingface.co/Freakz3z/Qwen-JSON.

    https://arxiv.org/abs/2512.00319


    Adaptive-lambda Subtracted Importance Sampled Scores in Machine Unlearning for DDPMs and VAEs

    oai:arXiv.org:2512.01054v2

    arXiv:2512.01054v2 Announce Type: replace Abstract: Machine Unlearning is essential for large generative models (VAEs, DDPMs) to comply with the right to be forgotten and prevent undesired content generation without costly retraining. Existing approaches, such as Static-lambda SISS for diffusion models, rely on a fixed mixing weight lambda, which is suboptimal because the required unlearning strength varies across samples and training stages. We propose Adaptive-lambda SISS, a principled extension that turns lambda into a latent variable dynamically inferred at each training step. A lightweight inference network parameterizes an adaptive posterior over lambda, conditioned on contextual features derived from the instantaneous SISS loss terms (retain/forget losses and their gradients). This enables joint optimization of the diffusion model and the lambda-inference mechanism via a variational objective, yielding significantly better trade-offs. We further extend the adaptive-lambda principle to score-based unlearning and introduce a multi-class variant of Score Forgetting Distillation. In addition, we present two new directions: (i) a hybrid objective combining the data-free efficiency of Score Forgetting Distillation with the direct gradient control of SISS, and (ii) a Reinforcement Learning formulation that treats unlearning as a sequential decision process, learning an optimal policy over a state space defined by the model's current memory of the forget set. Experiments on an augmented MNIST benchmark show that Adaptive-lambda SISS substantially outperforms the original static-lambda SISS, achieving stronger removal of forgotten classes while better preserving generation quality on the retain set.

    https://arxiv.org/abs/2512.01054


    Conveying Imagistic Thinking in Traditional Chinese Medicine Translation: A Prompt Engineering and LLM-Based Evaluation Framework

    oai:arXiv.org:2512.01198v2

    arXiv:2512.01198v2 Announce Type: replace Abstract: Traditional Chinese Medicine (TCM) theory is built on imagistic thinking, in which medical principles and diagnostic and therapeutic logic are structured through metaphor and metonymy. However, existing English translations largely rely on literal rendering, making it difficult for target-language readers to reconstruct the underlying conceptual networks and apply them in clinical practice. This study adopted a human-in-the-loop (HITL) framework and selected four passages from the medical canon Huangdi Neijing that are fundamental in theory. Through prompt-based cognitive scaffolding, DeepSeek V3.1 was guided to identify metaphor and metonymy in the source text and convey the theory in translation. In the evaluation stage, ChatGPT 5 Pro and Gemini 2.5 Pro were instructed by prompts to simulate three types of real-world readers. Human translations, baseline model translations, and prompt-adjusted translations were scored by the simulated readers across five cognitive dimensions, followed by structured interviews and Interpretative Phenomenological Analysis (IPA). Results show that the prompt-adjusted LLM translations perform best across all five dimensions, with high cross-model and cross-role consistency. The interview themes reveal differences between human and machine translation, effective strategies for metaphor and metonymy transfer, and readers' cognitive preferences. This study provides a cognitive, efficient, and replicable HITL methodological pathway for the translation of ancient, concept-dense texts such as TCM.

    https://arxiv.org/abs/2512.01198


    How Far Are We from Genuinely Useful Deep Research Agents?

    oai:arXiv.org:2512.01948v2

    arXiv:2512.01948v2 Announce Type: replace Abstract: Deep Research Agents (DRAs) aim to automatically produce analyst-level reports through iterative information retrieval and synthesis. However, most existing DRAs were validated on question-answering benchmarks, while research on generating comprehensive reports remains overlooked. Worse, current benchmarks for report synthesis suffer from task complexity and subjective metrics -- this fails to reflect user demands and limits the practical utility of generated reports. To address these gaps, we present Fine-grained DEepResearch bench (FINDER), an enhanced benchmark consisting of 100 human-curated research tasks with 419 structured checklist items that standardize report structure, analytical depth, and factual grounding. Based on approximately 1,000 reports produced by mainstream DRAs, we further propose Deep rEsearch Failure Taxonomy (DEFT), the first failure taxonomy for deep research agents. DEFT contains 14 fine-grained failure modes across reasoning, retrieval, and generation, and is built upon grounded theory with human-LLM co-annotating and inter-annotator reliability validation. Our experimental findings reveal that current DRAs struggle not with task comprehension but with evidence integration, verification, and reasoning-resilient planning.

    https://arxiv.org/abs/2512.01948


    EfficientFlow: Efficient Equivariant Flow Policy Learning for Embodied AI

    oai:arXiv.org:2512.02020v2

    arXiv:2512.02020v2 Announce Type: replace Abstract: Generative modeling has recently shown remarkable promise for visuomotor policy learning, enabling flexible and expressive control across diverse embodied AI tasks. However, existing generative policies often struggle with data inefficiency, requiring large-scale demonstrations, and sampling inefficiency, incurring slow action generation during inference. We introduce EfficientFlow, a unified framework for efficient embodied AI with flow-based policy learning. To enhance data efficiency, we bring equivariance into flow matching. We theoretically prove that when using an isotropic Gaussian prior and an equivariant velocity prediction network, the resulting action distribution remains equivariant, leading to improved generalization and substantially reduced data demands. To accelerate sampling, we propose a novel acceleration regularization strategy. As direct computation of acceleration is intractable for marginal flow trajectories, we derive a novel surrogate loss that enables stable and scalable training using only conditional trajectories. Across a wide range of robotic manipulation benchmarks, the proposed algorithm achieves competitive or superior performance under limited data while offering dramatically faster inference. These results highlight EfficientFlow as a powerful and efficient paradigm for high-performance embodied AI.

    https://arxiv.org/abs/2512.02020


    dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model

    oai:arXiv.org:2512.02498v3

    arXiv:2512.02498v3 Announce Type: replace Abstract: Document Layout Parsing serves as a critical gateway for Artificial Intelligence (AI) to access and interpret the world's vast stores of structured knowledge. This process,which encompasses layout detection, text recognition, and relational understanding, is particularly crucial for empowering next-generation Vision-Language Models. Current methods, however, rely on fragmented, multi-stage pipelines that suffer from error propagation and fail to leverage the synergies of joint training. In this paper, we introduce $\text{dots.ocr}$, a single Vision-Language Model that, for the first time, demonstrates the advantages of jointly learning three core tasks within a unified, end-to-end framework. This is made possible by a highly scalable data engine that synthesizes a vast multilingual corpus, empowering the model to deliver robust performance across a wide array of tasks, encompassing diverse languages, layouts, and domains. The efficacy of our unified paradigm is validated by state-of-the-art performance on the comprehensive OmniDocBench. Furthermore, to catalyze research in global document intelligence, we introduce XDocParse, a challenging new benchmark spanning 126 languages. On this testbed, $\text{dots.ocr}$ establishes a powerful new baseline, outperforming the next-best competitor by a remarkable +7.4 point margin and proving its unparalleled multilingual capabilities.

    https://arxiv.org/abs/2512.02498


    From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks

    oai:arXiv.org:2512.02580v2

    arXiv:2512.02580v2 Announce Type: replace Abstract: Reinforcement learning has emerged as a paradigm for post-training large language models, boosting their reasoning capabilities. Such approaches compute an advantage value for each sample, reflecting better or worse performance than expected, thereby yielding both positive and negative signals for training. However, the indiscriminate mixing of the two signals in existing methods, especially from the early stages, may lead to ambiguous guidance and limited gains. To address this issue, we propose **CAPO** (**C**urriculum **A**dvantage **P**olicy **O**ptimization), an adaptive curriculum mechanism based on advantage signals. The proposed mechanism bootstraps imitation learning with positive-only advantage samples to establish robust foundations, and subsequently introduces negative signals to cultivate discriminative capabilities, thereby improving generalization across complex scenarios. Compatible with diverse optimization methods including GRPO, PPO, RLOO, and Reinforce++, our method consistently achieves stable and significant improvements in mathematical reasoning tasks, and further generalizes effectively to multimodal Graphical User Interface (GUI) reasoning scenarios, establishing itself as a versatile and robust optimization framework.

    https://arxiv.org/abs/2512.02580


    PoreTrack3D: A Benchmark for Dynamic 3D Gaussian Splatting in Pore-Scale Facial Trajectory Tracking

    oai:arXiv.org:2512.02648v2

    arXiv:2512.02648v2 Announce Type: replace Abstract: We introduce PoreTrack3D, the first benchmark for dynamic 3D Gaussian splatting in pore-scale, non-rigid 3D facial trajectory tracking. It contains over 440,000 facial trajectories in total, among which more than 52,000 are longer than 10 frames, including 68 manually reviewed trajectories that span the entire 150 frames. To the best of our knowledge, PoreTrack3D is the first benchmark dataset to capture both traditional facial landmarks and pore-scale keypoints trajectory, advancing the study of fine-grained facial expressions through the analysis of subtle skin-surface motion. We systematically evaluate state-of-the-art dynamic 3D Gaussian splatting methods on PoreTrack3D, establishing the first performance baseline in this domain. Overall, the pipeline developed for this benchmark dataset's creation establishes a new framework for high-fidelity facial motion capture and dynamic 3D reconstruction. Our dataset are publicly available at: https://github.com/JHXion9/PoreTrack3D

    https://arxiv.org/abs/2512.02648


    Spatially-Grounded Document Retrieval via Patch-to-Region Relevance Propagation

    oai:arXiv.org:2512.02660v2

    arXiv:2512.02660v2 Announce Type: replace Abstract: Late-interaction multimodal retrieval models like ColPali achieve state-of-the-art document retrieval by embedding pages as images and computing fine-grained similarity between query tokens and visual patches. However, they return entire pages rather than specific regions, limiting utility for retrieval-augmented generation (RAG) where precise context is paramount. Conversely, OCR-based systems extract structured text with bounding box coordinates but lack semantic grounding for relevance assessment. We propose a hybrid architecture that unifies these paradigms: using ColPali's patch-level similarity scores as spatial relevance filters over OCR-extracted regions. We formalize the coordinate mapping between vision transformer patch grids and OCR bounding boxes, introduce intersection metrics for relevance propagation, and establish theoretical bounds on area efficiency. We evaluate on BBox-DocVQA with ground-truth bounding boxes. For within-page localization (given correct page retrieval), ColQwen3-4B with percentile-50 thresholding achieves 59.7% hit rate at IoU@0.5 (84.4% at IoU@0.25, 35.8% at IoU@0.7), with mean IoU of 0.569, compared to ~6.7% for random region selection. Our approach reduces context tokens by 28.8% compared to returning all OCR regions and by 52.3% compared to full-page image tokens. Our approach operates at inference time without additional training. We release Snappy, an open-source implementation at https://github.com/athrael-soju/Snappy.

    https://arxiv.org/abs/2512.02660


    Off-grid solar energy storage system with lithium iron phosphate (LFP) batteries in high mountains: a case report of Tianchi Lodge in Taiwan

    oai:arXiv.org:2512.02679v2

    arXiv:2512.02679v2 Announce Type: replace Abstract: Mountain huts are buildings located at high altitude, providing shelter and a place for hikers. Energy supply on mountain huts remains an open issue. Using renewable energies could be an appropriate solution. Tianchi Lodge, a famous mountain hut in Taiwan, has operated an off-grid solar energy storage system with lithium iron phosphate (LFP) batteries since 2020. In this case report, the energy architecture, detailed descriptions, and historical status of the system are provided.

    https://arxiv.org/abs/2512.02679


    VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm

    oai:arXiv.org:2512.02700v2

    arXiv:2512.02700v2 Announce Type: replace Abstract: Vision-language models (VLMs) excel at image understanding tasks, but the large number of visual tokens imposes significant computational costs, hindering deployment on mobile devices. Many pruning methods rely solely on token importance and thus overlook inter-token redundancy, retaining numerous duplicated tokens and wasting capacity. Although some redundancy-aware approaches have been proposed, they often ignore the spatial relationships among visual tokens. This can lead to overly sparse selections of retained tokens that fail to adequately cover the regions of target objects. To address these limitations, we propose VLM-Pruner, a training-free token pruning algorithm that explicitly balances redundancy and spatial sparsity. We introduce a centrifugal token pruning paradigm that enables near-to-far selection while prioritizing the preservation of fine-grained object details. Moreover, we design a Buffering for Spatial Sparsity (BSS) criterion that defers the selection of spatially distant tokens. We further adopt a parallel greedy strategy to conduct token selection efficiently. To mitigate information loss from pruning, we selectively fuse salient information from the discarded tokens into the retained ones. Comprehensive comparisons demonstrate that VLM-Pruner consistently outperforms strong baselines across five VLMs with an 88.9\% pruning rate, while delivering an end-to-end inference speedup. The code is available at https://github.com/Casey-bit/VLMPruner.

    https://arxiv.org/abs/2512.02700


    HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval

    oai:arXiv.org:2512.02792v2

    arXiv:2512.02792v2 Announce Type: replace Abstract: Composed Video Retrieval (CVR) is a challenging video retrieval task that utilizes multi-modal queries, consisting of a reference video and modification text, to retrieve the desired target video. The core of this task lies in understanding the multi-modal composed query and achieving accurate composed feature learning. Within multi-modal queries, the video modality typically carries richer semantic content compared to the textual modality. However, previous works have largely overlooked the disparity in information density between these two modalities. This limitation can lead to two critical issues: 1) modification subject referring ambiguity and 2) limited detailed semantic focus, both of which degrade the performance of CVR models. To address the aforementioned issues, we propose a novel CVR framework, namely the Hierarchical Uncertainty-aware Disambiguation network (HUD). HUD is the first framework that leverages the disparity in information density between video and text to enhance multi-modal query understanding. It comprises three key components: (a) Holistic Pronoun Disambiguation, (b) Atomistic Uncertainty Modeling, and (c) Holistic-to-Atomistic Alignment. By exploiting overlapping semantics through holistic cross-modal interaction and fine-grained semantic alignment via atomistic-level cross-modal interaction, HUD enables effective object disambiguation and enhances the focus on detailed semantics, thereby achieving precise composed feature learning. Moreover, our proposed HUD is also applicable to the Composed Image Retrieval (CIR) task and achieves state-of-the-art performance across three benchmark datasets for both CVR and CIR tasks. The codes are available on https://zivchen-ty.github.io/HUD.github.io/.

    https://arxiv.org/abs/2512.02792


    From Moderation to Mediation: Can LLMs Serve as Mediators in Online Flame Wars?

    oai:arXiv.org:2512.03005v2

    arXiv:2512.03005v2 Announce Type: replace Abstract: The rapid advancement of large language models (LLMs) has opened new possibilities for AI for good applications. As LLMs increasingly mediate online communication, their potential to foster empathy and constructive dialogue becomes an important frontier for responsible AI research. This work explores whether LLMs can serve not only as moderators that detect harmful content, but as mediators capable of understanding and de-escalating online conflicts. Our framework decomposes mediation into two subtasks: judgment, where an LLM evaluates the fairness and emotional dynamics of a conversation, and steering, where it generates empathetic, de-escalatory messages to guide participants toward resolution. To assess mediation quality, we construct a large Reddit-based dataset and propose a multi-stage evaluation pipeline combining principle-based scoring, user simulation, and human comparison. Experiments show that API-based models outperform open-source counterparts in both reasoning and intervention alignment when doing mediation. Our findings highlight both the promise and limitations of current LLMs as emerging agents for online social mediation.

    https://arxiv.org/abs/2512.03005


    Difference Decomposition Networks for Infrared Small Target Detection

    oai:arXiv.org:2512.03470v2

    arXiv:2512.03470v2 Announce Type: replace Abstract: Infrared small target detection (ISTD) faces two major challenges: a lack of discernible target texture and severe background clutter, which results in the background obscuring the target. To enhance targets and suppress backgrounds, we propose the Basis Decomposition Module (BDM) as an extensible and lightweight module based on basis decomposition, which decomposes a complex feature into several basis features and enhances certain information while eliminating redundancy. Extending BDM leads to a series of modules, including the Spatial Difference Decomposition Module (SD$^\mathrm{2}$M), Spatial Difference Decomposition Downsampling Module (SD$^\mathrm{3}$M), and Temporal Difference Decomposition Module (TD$^\mathrm{2}$M). Based on these modules, we develop the Spatial Difference Decomposition Network (SD$^\mathrm{2}$Net) for single-frame ISTD (SISTD) and the Spatiotemporal Difference Decomposition Network (STD$^\mathrm{2}$Net) for multi-frame ISTD (MISTD). SD$^\mathrm{2}$Net integrates SD$^\mathrm{2}$M and SD$^\mathrm{3}$M within an adapted U-shaped architecture. We employ TD$^\mathrm{2}$M to introduce motion information, which transforms SD$^\mathrm{2}$Net into STD$^\mathrm{2}$Net. Extensive experiments on SISTD and MISTD datasets demonstrate state-of-the-art (SOTA) performance. On the SISTD task, SD$^\mathrm{2}$Net performs well compared to most established networks. On the MISTD datasets, STD$^\mathrm{2}$Net achieves a mIoU of 87.68\%, outperforming SD$^\mathrm{2}$Net, which achieves a mIoU of 64.97\%. Our codes are available: https://github.com/greekinRoma/IRSTD_HC_Platform.

    https://arxiv.org/abs/2512.03470


    MSG-Loc: Multi-Label Likelihood-based Semantic Graph Matching for Object-Level Global Localization

    oai:arXiv.org:2512.03522v2

    arXiv:2512.03522v2 Announce Type: replace Abstract: Robots are often required to localize in environments with unknown object classes and semantic ambiguity. However, when performing global localization using semantic objects, high semantic ambiguity intensifies object misclassification and increases the likelihood of incorrect associations, which in turn can cause significant errors in the estimated pose. Thus, in this letter, we propose a multi-label likelihood-based semantic graph matching framework for object-level global localization. The key idea is to exploit multi-label graph representations, rather than single-label alternatives, to capture and leverage the inherent semantic context of object observations. Based on these representations, our approach enhances semantic correspondence across graphs by combining the likelihood of each node with the maximum likelihood of its neighbors via context-aware likelihood propagation. For rigorous validation, data association and pose estimation performance are evaluated under both closed-set and open-set detection configurations. In addition, we demonstrate the scalability of our approach to large-vocabulary object categories in both real-world indoor scenes and synthetic environments.

    https://arxiv.org/abs/2512.03522


    MDE-AgriVLN: Agricultural Vision-and-Language Navigation with Monocular Depth Estimation

    oai:arXiv.org:2512.03958v2

    arXiv:2512.03958v2 Announce Type: replace Abstract: Agricultural robots are serving as powerful assistants across a wide range of agricultural tasks, nevertheless, still heavily relying on manual operations or railway systems for movement. The AgriVLN method and the A2A benchmark pioneeringly extended Vision-and-Language Navigation (VLN) to the agricultural domain, enabling a robot to navigate to a target position following a natural language instruction. Unlike human binocular vision, most agricultural robots are only given a single camera for monocular vision, which results in limited spatial perception. To bridge this gap, we present the method of Agricultural Vision-and-Language Navigation with Monocular Depth Estimation (MDE-AgriVLN), in which we propose the MDE module generating depth features from RGB images, to assist the decision-maker on multimodal reasoning. When evaluated on the A2A benchmark, our MDE-AgriVLN method successfully increases Success Rate from 0.23 to 0.32 and decreases Navigation Error from 4.43m to 4.08m, demonstrating the state-of-the-art performance in the agricultural VLN domain. Code: https://github.com/AlexTraveling/MDE-AgriVLN.

    https://arxiv.org/abs/2512.03958


    BlurDM: A Blur Diffusion Model for Image Deblurring

    oai:arXiv.org:2512.03979v2

    arXiv:2512.03979v2 Announce Type: replace Abstract: Diffusion models show promise for dynamic scene deblurring; however, existing studies often fail to leverage the intrinsic nature of the blurring process within diffusion models, limiting their full potential. To address it, we present a Blur Diffusion Model (BlurDM), which seamlessly integrates the blur formation process into diffusion for image deblurring. Observing that motion blur stems from continuous exposure, BlurDM implicitly models the blur formation process through a dual-diffusion forward scheme, diffusing both noise and blur onto a sharp image. During the reverse generation process, we derive a dual denoising and deblurring formulation, enabling BlurDM to recover the sharp image by simultaneously denoising and deblurring, given pure Gaussian noise conditioned on the blurred image as input. Additionally, to efficiently integrate BlurDM into deblurring networks, we perform BlurDM in the latent space, forming a flexible prior generation network for deblurring. Extensive experiments demonstrate that BlurDM significantly and consistently enhances existing deblurring methods on four benchmark datasets. The project page is available at https://jin-ting-he.github.io/Blur-Diffusion-Model/.

    https://arxiv.org/abs/2512.03979


    Learning Group Actions In Disentangled Latent Image Representations

    oai:arXiv.org:2512.04015v2

    arXiv:2512.04015v2 Announce Type: replace Abstract: Modeling group actions on latent representations enables controllable transformations of high-dimensional image data. Prior works applying group-theoretic priors or modeling transformations typically operate in the high-dimensional data space, where group actions apply uniformly across the entire input, making it difficult to disentangle the subspace that varies under transformations. While latent-space methods offer greater flexibility, they still require manual partitioning of latent variables into equivariant and invariant subspaces, limiting the ability to robustly learn and operate group actions within the representation space. To address this, we introduce a novel end-to-end framework that for the first time learns group actions on latent image manifolds, automatically discovering transformation-relevant structures without manual intervention. Our method uses learnable binary masks with straight-through estimation to dynamically partition latent representations into transformation-sensitive and invariant components. We formulate this within a unified optimization framework that jointly learns latent disentanglement and group transformation mappings. The framework can be seamlessly integrated with any standard encoder-decoder architecture. We validate our approach on five 2D/3D image datasets, demonstrating its ability to automatically learn disentangled latent factors for group actions in diverse data, while downstream classification tasks confirm the effectiveness of the learned representations. Our code is publicly available at https://github.com/farhanaswarnali/Learning-Group-Actions-In-Disentangled-Latent-Image-Representations .

    https://arxiv.org/abs/2512.04015


    GraphBench: Next-generation graph learning benchmarking

    oai:arXiv.org:2512.04475v2

    arXiv:2512.04475v2 Announce Type: replace Abstract: Machine learning on graphs has recently achieved impressive progress in various domains, including molecular property prediction and chip design. However, benchmarking practices remain fragmented, often relying on narrow, task-specific datasets and inconsistent evaluation protocols, which hampers reproducibility and broader progress. To address this, we introduce GraphBench, a comprehensive benchmarking suite that spans diverse domains and prediction tasks, including node-level, edge-level, graph-level, and generative settings. GraphBench provides standardized evaluation protocols -- with consistent dataset splits and performance metrics that account for out-of-distribution generalization -- as well as a unified hyperparameter tuning framework. Additionally, we benchmark GraphBench using message-passing neural networks and graph transformer models, providing principled baselines and establishing a reference performance. See www.graphbench.io for further details.

    https://arxiv.org/abs/2512.04475


    AI-Assisted Game Management Decisions: A Fuzzy Logic Approach to Real-Time Soccer Substitutions

    oai:arXiv.org:2512.04480v3

    arXiv:2512.04480v3 Announce Type: replace Abstract: In elite soccer, substitution decisions entail significant financial and sporting consequences yet remain heavily reliant on intuition or predictive models that merely mimic historical biases. This paper introduces a Fuzzy Logic based Decision Support System (DSS) designed for real time, prescriptive game management. Unlike traditional Machine Learning approaches that encounter a predictive ceiling by attempting to replicate human behavior, our system audits performance through an objective, rule based inference engine. We propose a methodological advancement by reformulating the PlayeRank metric into a Cumulative Mean with Role Aware Normalization, eliminating the play time exposure bias inherent in cumulative sum models to enable accurate intra match comparison. The system integrates this refined metric with physiological proxies (fatigue) and contextual variables (disciplinary risk modulated by tactical role) to calculate a dynamic Substitution Priority (P final). Validation via a case study of the 2018 FIFA World Cup match between Brazil and Belgium demonstrates the system's ecological validity: it not only aligned with expert consensus on executed substitutions (for example Gabriel Jesus) but, crucially, identified high risk scenarios ignored by human decision makers. Specifically, the model flagged the "FAGNER Paradox" - a maximum priority defensive risk - minutes before a critical yellow card, and detected the "Lukaku Paradox", where an isolated assist masked a severe drop in participation. These results confirm that Fuzzy Logic offers a transparent, explainable, and superior alternative to black box models for optimizing real time tactical decisions.

    https://arxiv.org/abs/2512.04480


    Prototype-Based Semantic Consistency Alignment for Domain Adaptive Retrieval

    oai:arXiv.org:2512.04524v2

    arXiv:2512.04524v2 Announce Type: replace Abstract: Domain adaptive retrieval aims to transfer knowledge from a labeled source domain to an unlabeled target domain, enabling effective retrieval while mitigating domain discrepancies. However, existing methods encounter several fundamental limitations: 1) neglecting class-level semantic alignment and excessively pursuing pair-wise sample alignment; 2) lacking either pseudo-label reliability consideration or geometric guidance for assessing label correctness; 3) directly quantizing original features affected by domain shift, undermining the quality of learned hash codes. In view of these limitations, we propose Prototype-Based Semantic Consistency Alignment (PSCA), a two-stage framework for effective domain adaptive retrieval. In the first stage, a set of orthogonal prototypes directly establishes class-level semantic connections, maximizing inter-class separability while gathering intra-class samples. During the prototype learning, geometric proximity provides a reliability indicator for semantic consistency alignment through adaptive weighting of pseudo-label confidences. The resulting membership matrix and prototypes facilitate feature reconstruction, ensuring quantization on reconstructed rather than original features, thereby improving subsequent hash coding quality and seamlessly connecting both stages. In the second stage, domain-specific quantization functions process the reconstructed features under mutual approximation constraints, generating unified binary hash codes across domains. Extensive experiments validate PSCA's superior performance across multiple datasets.

    https://arxiv.org/abs/2512.04524


    A Unified Low-rank ADI Framework with Shared Linear Solves for Simultaneously Solving Multiple Lyapunov, Sylvester, and Riccati Equations

    oai:arXiv.org:2512.04676v2

    arXiv:2512.04676v2 Announce Type: replace Abstract: The alternating direction implicit (ADI) methods are computationally efficient and numerically effective tools for computing low-rank solutions of large-scale linear matrix equations. It is known in the literature that the low-rank ADI method for Lyapunov equations is a Petrov-Galerkin projection algorithm that implicitly performs model order reduction. It recursively enforces interpolation at the mirror images of the ADI shifts and places the poles of the reduced-order models at the ADI shifts. In this paper, we show that the low-rank ADI methods for Sylvester and Riccati equations are also Petrov-Galerkin projection algorithms that implicitly perform model order reduction. These methods likewise enforce interpolation at the mirror images of the ADI shifts; however, they do not place the poles at the mirror images of the interpolation points. Instead, their pole placement ensures that the projected Sylvester and Riccati equations they implicitly solve admit a unique solution. By observing that the ADI methods for Lyapunov, Sylvester, and Riccati equations differ only in pole placement and not in their interpolatory nature, we show that the shifted linear solves-which constitute the bulk of the computational cost-can be shared. The pole-placement step involves only small-scale operations and is therefore inexpensive. We propose a unified ADI framework that requires only two shifted linear solves per iteration to simultaneously solve six Lyapunov equations, one Sylvester equation, and ten Riccati equations, thus substantially increasing the return on investment for the computational cost spent on the linear solves. All operations needed to extract the individual solutions from these shared linear solves are small-scale and inexpensive. Three numerical examples are presented to demonstrate the effectiveness of the proposed ADI framework.

    https://arxiv.org/abs/2512.04676


    Geschlechts\"ubergreifende Maskulina im Sprachgebrauch Eine korpusbasierte Untersuchung zu lexemspezifischen Unterschieden

    oai:arXiv.org:2512.04683v2

    arXiv:2512.04683v2 Announce Type: replace Abstract: This study examines the distribution and linguistic characteristics of generic masculines (GM) in contemporary German press texts. The use of masculine personal nouns to refer to mixed-gender groups or unspecified individuals has been widely debated in academia and the public, with con-flicting perspectives on its gender-neutrality. While psycholinguistic studies suggest that GM is more readily associated with male referents, corpus-based analyses of its actual use remain scarce. We investigate GM in a large corpus of press texts, focusing on lexeme-specific differences across dif-ferent types of personal nouns. We conducted manual annotations of the whole inflectional para-digm of 21 personal nouns, resulting in 6,195 annotated tokens. Our findings reveal considerable differences between lexical items, especially between passive role nouns and prestige-related per-sonal nouns. On a grammatical level, we find that GM occurs predominantly in the plural and in indefinite noun phrases. Furthermore, our data shows that GM is not primarily used to denote entire classes of people, as has been previously claimed. By providing an empirical insight into the use of GM in authentic written language, we contribute to a more nuanced understanding of its forms and manifestations. These findings provide a solid basis for aligning linguistic stimuli in psy-cholinguistic studies more closely with real-world language use.

    https://arxiv.org/abs/2512.04683


    Rotatable Antenna-Enhanced Cell-Free Communication

    oai:arXiv.org:2512.04742v2

    arXiv:2512.04742v2 Announce Type: replace Abstract: Rotatable antenna (RA) is a promising technology that can exploit new spatial degrees-of-freedom (DoFs) by flexibly adjusting the three-dimensional (3D) boresight direction of antennas. In this letter, we investigate an RA-enhanced cell-free system for downlink transmission, where multiple RA-equipped access points (APs) cooperatively serve multiple single-antenna users over the same time-frequency resource. Specifically, we aim to maximize the sum rate of all users by jointly optimizing the AP-user associations and the RA boresight directions. Accordingly, we propose a two-stage strategy to solve the AP-user association problem, and then employ fractional programming (FP) and successive convex approximation (SCA) techniques to optimize the RA boresight directions. Numerical results demonstrate that the proposed RA-enhanced cell-free system significantly outperforms various benchmark schemes.

    https://arxiv.org/abs/2512.04742


    EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture

    oai:arXiv.org:2512.04810v5

    arXiv:2512.04810v5 Announce Type: replace Abstract: We propose EMMA, an efficient and unified architecture for multimodal understanding, generation and editing. Specifically, EMMA primarily consists of 1) An efficient autoencoder with a 32x compression ratio, which significantly reduces the number of tokens required for generation. This also ensures the training balance between understanding and generation tasks by applying the same compression ratio to images. 2) Channel-wise concatenation instead of token-wise concatenation among visual understanding and generation tokens, which further reduces the visual tokens in unified architectures. 3) A shared-and-decoupled network that enables mutual improvements across tasks while meeting the task-specific modeling requirements. 4) A mixture-of-experts mechanism adopted for visual understanding encoder, which substantially improves perceptual capabilities with a few parameters increase. Extensive experiments have shown that EMMA-4B can significantly outperform state-of-the-art unified multimodal approaches (e.g., BAGEL-7B) in both efficiency and performance, while also achieving competitive results compared to recent multimodal understanding and generation experts (e.g., Qwen3-VL and Qwen-Image). We believe that EMMA lays a solid foundation for the future development of unified multimodal architectures.

    https://arxiv.org/abs/2512.04810


    The Geometry of Intelligence: Deterministic Functional Topology as a Foundation for Real-World Perception

    oai:arXiv.org:2512.05089v2

    arXiv:2512.05089v2 Announce Type: replace Abstract: Real-world physical processes do not generate arbitrary variability: their signals concentrate on compact and low-variability subsets of functional space. This geometric structure enables rapid generalization from a few examples in both biological and artificial systems. This work develops a deterministic functional-topological framework in which the set of valid realizations of a physical phenomenon forms a compact perceptual manifold with stable invariants and a finite Hausdorff radius. We show that the boundaries of this manifold can be discovered in a fully self-supervised manner through Monte Carlo sampling, even when the governing equations of the system are unknown. We provide theoretical guarantees, practical estimators of knowledge boundaries, and empirical validations across three domains: electromechanical railway point machines, electrochemical battery discharge curves, and physiological ECG signals. Our results demonstrate that deterministic functional topology offers a unified mathematical foundation for perception, representation, and world-model construction, explaining why biological learners and self-supervised AI models can generalize from limited observations.

    https://arxiv.org/abs/2512.05089


    Light-X: Generative 4D Video Rendering with Camera and Illumination Control

    oai:arXiv.org:2512.05115v2

    arXiv:2512.05115v2 Announce Type: replace Abstract: Recent advances in illumination control extend image-based methods to video, yet still facing a trade-off between lighting fidelity and temporal consistency. Moving beyond relighting, a key step toward generative modeling of real-world scenes is the joint control of camera trajectory and illumination, since visual dynamics are inherently shaped by both geometry and lighting. To this end, we present Light-X, a video generation framework that enables controllable rendering from monocular videos with both viewpoint and illumination control. 1) We propose a disentangled design that decouples geometry and lighting signals: geometry and motion are captured via dynamic point clouds projected along user-defined camera trajectories, while illumination cues are provided by a relit frame consistently projected into the same geometry. These explicit, fine-grained cues enable effective disentanglement and guide high-quality illumination. 2) To address the lack of paired multi-view and multi-illumination videos, we introduce Light-Syn, a degradation-based pipeline with inverse-mapping that synthesizes training pairs from in-the-wild monocular footage. This strategy yields a dataset covering static, dynamic, and AI-generated scenes, ensuring robust training. Extensive experiments show that Light-X outperforms baseline methods in joint camera-illumination control and surpasses prior video relighting methods under both text- and background-conditioned settings.

    https://arxiv.org/abs/2512.05115


    DEAR: Dataset for Evaluating the Aesthetics of Rendering

    oai:arXiv.org:2512.05209v2

    arXiv:2512.05209v2 Announce Type: replace Abstract: Traditional Image Quality Assessment~(IQA) focuses on quantifying technical degradations such as noise, blur, or compression artifacts, using both full-reference and no-reference objective metrics. However, evaluation of rendering aesthetics, a growing domain relevant to photographic editing, content creation, and AI-generated imagery, remains underexplored due to the lack of datasets that reflect the inherently subjective nature of style preference. In this work, a novel benchmark dataset designed to model human aesthetic judgments of image rendering styles is introduced: the Dataset for Evaluating the Aesthetics of Rendering (DEAR). Built upon the MIT-Adobe FiveK dataset, DEAR incorporates pairwise human preference scores collected via large-scale crowdsourcing, with each image pair evaluated by 25 distinct human evaluators with a total of 13,648 of them participating overall. These annotations capture nuanced, context-sensitive aesthetic preferences, enabling the development and evaluation of models that go beyond traditional distortion-based IQA, focusing on a new task: Evaluation of Aesthetics of Rendering (EAR). The data collection pipeline is described, human voting patterns are analyzed, and multiple use cases are outlined, including style preference prediction, aesthetic benchmarking, and personalized aesthetic modeling. To the best of the authors' knowledge, DEAR is the first dataset to systematically address image aesthetics of rendering assessment grounded in subjective human preferences. A subset of 100 images with markup for them is published on HuggingFace (huggingface.co/datasets/vsevolodpl/DEAR).

    https://arxiv.org/abs/2512.05209


    AI & Human Co-Improvement for Safer Co-Superintelligence

    oai:arXiv.org:2512.05356v2

    arXiv:2512.05356v2 Announce Type: replace Abstract: Self-improvement is a goal currently exciting the field of AI, but is fraught with danger, and may take time to fully achieve. We advocate that a more achievable and better goal for humanity is to maximize co-improvement: collaboration between human researchers and AIs to achieve co-superintelligence. That is, specifically targeting improving AI systems' ability to work with human researchers to conduct AI research together, from ideation to experimentation, in order to both accelerate AI research and to generally endow both AIs and humans with safer superintelligence through their symbiosis. Focusing on including human research improvement in the loop will both get us there faster, and more safely.

    https://arxiv.org/abs/2512.05356


    China Regional 3km Downscaling Based on Residual Corrective Diffusion Model

    oai:arXiv.org:2512.05377v3

    arXiv:2512.05377v3 Announce Type: replace Abstract: A fundamental challenge in numerical weather prediction is to efficiently produce high-resolution forecasts. A common solution is applying downscaling methods, which include dynamical downscaling and statistical downscaling, to the outputs of global models. This work focuses on statistical downscaling, which establishes statistical relationships between low-resolution and high-resolution historical data using statistical models. Deep learning has emerged as a powerful tool for this task, giving rise to various high-performance super-resolution models, which can be directly applied for downscaling, such as diffusion models and Generative Adversarial Networks. This work relies on a diffusion-based downscaling framework named CorrDiff. In contrast to the original work of CorrDiff, the region considered in this work is nearly 40 times larger, and we not only consider surface variables as in the original work, but also encounter high-level variables (six pressure levels) as target downscaling variables. In addition, a global residual connection is added to improve accuracy. In order to generate the 3km forecasts for the China region, we apply our trained models to the 25km global grid forecasts of CMA-GFS, an operational global model of the China Meteorological Administration (CMA), and SFF, a data-driven deep learning-based weather model developed from Spherical Fourier Neural Operators (SFNO). CMA-MESO, a high-resolution regional model, is chosen as the baseline model. The experimental results demonstrate that the forecasts downscaled by our method generally outperform the direct forecasts of CMA-MESO in terms of MAE for the target variables. Our forecasts of radar composite reflectivity show that CorrDiff, as a generative model, can generate fine-scale details that lead to more realistic predictions compared to the corresponding deterministic regression models.

    https://arxiv.org/abs/2512.05377


    Wasserstein Evolution : Evolutionary Optimization as Phase Transition

    oai:arXiv.org:2512.05837v2

    arXiv:2512.05837v2 Announce Type: replace Abstract: This paper introduces a novel framework linking evolutionary computation to statistical physics by formulating optimization as a statistical phase transition. We propose Wasserstein Evolution (WE), an algorithm based on the Wasserstein gradient flow of a free energy functional, translating the physical competition between exploitative potential gradient forces and explorative entropic forces into an adaptive search mechanism. Theoretical analysis confirms WE's convergence to the Boltzmann distribution, ensuring a principled performance-diversity balance.Experiments compare WE against five established algorithms, including GA, DE, CMA-ES, JADE, and SaDE, on benchmark and physical potential functions. Results demonstrate that WE consistently achieves the lowest free energy and maintains the highest entropy, excelling in both solution quality and diversity preservation. Additional invariance tests using transformed Schwefel 2.22 functions verify that WE possesses translation, scale, and rotation invariance, proving its robustness and intrinsic alignment with the problem's geometry rather than its coordinate representation.Thus, this work delivers not only an effective and robust optimizer but also a new theoretical paradigm for understanding population-based search through statistical physics, viewing convergence as a disorder-to-order phase transition and opening new avenues for designing intelligent optimization methods.

    https://arxiv.org/abs/2512.05837


    Developing synthetic microdata through machine learning for firm-level business surveys

    oai:arXiv.org:2512.05948v2

    arXiv:2512.05948v2 Announce Type: replace Abstract: Public-use microdata samples (PUMS) from the United States (US) Census Bureau on individuals have been available for decades. However, large increases in computing power and the greater availability of Big Data have dramatically increased the probability of re-identifying anonymized data, potentially violating the pledge of confidentiality given to survey respondents. Data science tools can be used to produce synthetic data that preserve critical moments of the empirical data but do not contain the records of any existing individual respondent or business. Developing public-use firm data from surveys presents unique challenges different from demographic data, because there is a lack of anonymity and certain industries can be easily identified in each geographic area. This paper briefly describes a machine learning model used to construct a synthetic PUMS based on the Annual Business Survey (ABS) and discusses various quality metrics. Although the ABS PUMS is currently being refined and results are confidential, we present two synthetic PUMS developed for the 2007 Survey of Business Owners, similar to the ABS business data. Econometric replication of a high impact analysis published in Small Business Economics demonstrates the verisimilitude of the synthetic data to the true data and motivates discussion of possible ABS use cases.

    https://arxiv.org/abs/2512.05948


    Trusted AI Agents in the Cloud

    oai:arXiv.org:2512.05951v2

    arXiv:2512.05951v2 Announce Type: replace Abstract: AI agents powered by large language models are increasingly deployed as cloud services that autonomously access sensitive data, invoke external tools, and interact with other agents. However, these agents run within a complex multi-party ecosystem, where untrusted components can lead to data leakage, tampering, or unintended behavior. Existing Confidential Virtual Machines (CVMs) provide only per binary protection and offer no guarantees for cross-principal trust, accelerator-level isolation, or supervised agent behavior. We present Omega, a system that enables trusted AI agents by enforcing end-to-end isolation, establishing verifiable trust across all contributing principals, and supervising every external interaction with accountable provenance. Omega builds on Confidential VMs and Confidential GPUs to create a Trusted Agent Platform that hosts many agents within a single CVM using nested isolation. It also provides efficient multi-agent orchestration with cross-principal trust establishment via differential attestation, and a policy specification and enforcement framework that governs data access, tool usage, and inter-agent communication for data protection and regulatory compliance. Implemented on AMD SEV-SNP and NVIDIA H100, Omega fully secures agent state across CVM-GPU, and achieves high performance while enabling high-density, policy-compliant multi-agent deployments at cloud scale.

    https://arxiv.org/abs/2512.05951


    SparsePixels: Efficient Convolution for Sparse Data on FPGAs

    oai:arXiv.org:2512.06208v2

    arXiv:2512.06208v2 Announce Type: replace Abstract: Inference of standard convolutional neural networks (CNNs) on FPGAs often incurs high latency and a long initiation interval due to the deep nested loops required to densely convolve every input pixel regardless of its feature value. However, input features can be spatially sparse in some image data, where semantic information may occupy only a small fraction of the pixels and most computation would be wasted on empty regions. In this work, we introduce SparsePixels, a framework that implements sparse convolution on FPGAs by selectively retaining and computing on a small subset of active pixels while ignoring the rest. We show that, for identifying neutrino interactions in naturally sparse LArTPC images with 4k pixels, a standard CNN with a compact size of 4k parameters incurs an inference latency of 48.665 $\mu$s on an FPGA, whereas a sparse CNN of the same base architecture, computing on less than 1% of the input pixels, achieves a $\times 73$ speedup to 0.665 $\mu$s with resource utilization well within on-chip budgets, trading only a small percent-level performance loss. This work aims to benefit future algorithm development for efficient data readout in modern experiments with latency requirements of microseconds or below.

    https://arxiv.org/abs/2512.06208


    Privacy Loss of Noise Perturbation via Concentration Analysis of A Product Measure

    oai:arXiv.org:2512.06253v3

    arXiv:2512.06253v3 Announce Type: replace Abstract: Noise perturbation is one of the most fundamental approaches for achieving $(\epsilon,\delta)$-differential privacy (DP) guarantees when releasing the result of a query or function $f(\cdot)\in\mathbb{R}^M$ evaluated on a sensitive dataset $\mathbf{x}$. In this approach, calibrated noise $\mathbf{n}\in\mathbb{R}^M$ is used to obscure the difference vector $f(\mathbf{x})-f(\mathbf{x}')$, where $\mathbf{x}'$ is known as a neighboring dataset. A DP guarantee is obtained by studying the tail probability bound of a privacy loss random variable (PLRV), defined as the Radon-Nikodym derivative between two distributions. When $\mathbf{n}$ follows a multivariate Gaussian distribution, the PLRV is characterized as a specific univariate Gaussian. In this paper, we propose a novel scheme to generate $\mathbf{n}$ by leveraging the fact that the perturbation noise is typically spherically symmetric (i.e., the distribution is rotationally invariant around the origin). The new noise generation scheme allows us to investigate the privacy loss from a geometric perspective and express the resulting PLRV using a product measure, $W\times U$; measure $W$ is related to a radius random variable controlling the magnitude of $\mathbf{n}$, while measure $U$ involves a directional random variable governing the angle between $\mathbf{n}$ and the difference $f(\mathbf{x})-f(\mathbf{x}')$. We derive a closed-form moment bound on the product measure to prove $(\epsilon,\delta)$-DP. Under the same $(\epsilon,\delta)$-DP guarantee, our mechanism yields a smaller expected noise magnitude than the classic Gaussian noise in high dimensions, thereby significantly improving the utility of the noisy result $f(\mathbf{x})+\mathbf{n}$. To validate this, we consider convex and non-convex empirical risk minimization (ERM) problems in high dimensional space and apply the proposed product noise to achieve privacy.

    https://arxiv.org/abs/2512.06253


    RefBench-PRO: Perceptual and Reasoning Oriented Benchmark for Referring Expression Comprehension

    oai:arXiv.org:2512.06276v2

    arXiv:2512.06276v2 Announce Type: replace Abstract: Referring Expression Comprehension (REC) is a vision-language task that localizes a specific image region based on a textual description. Existing REC benchmarks primarily evaluate perceptual capabilities and lack interpretable scoring mechanisms, which cannot reveal the grounding capability of Multi-modal Large Language Model (MLLM) across different cognitive abilities. To address this limitation, we introduce RefBench-PRO, a comprehensive REC benchmark, which decomposes referring expressions into two core dimensions, i.e., perception and reasoning, and further subdivides them into six progressively challenging tasks, such as attribute, position, interaction, commonsense, relation and reject. We also develop a fully automated data-generation pipeline that produces diverse referring expressions across these six sub-dimensions. Furthermore, We propose Ref-R1, an RL-based learning scheme, which incorporates Dynamic IoU-based GRPO to improve localization accuracy under increasingly complex reasoning conditions, establishing a stronger baseline for REC. Extensive experiments demonstrate that our RefBench-PRO enables interpretable evaluation of MLLM on referring expression comprehension, presenting greater challenges in both perception and reasoning.

    https://arxiv.org/abs/2512.06276


    Protecting Bystander Privacy via Selective Hearing in Audio LLMs

    oai:arXiv.org:2512.06380v2

    arXiv:2512.06380v2 Announce Type: replace Abstract: Audio Large language models (LLMs) are increasingly deployed in the real world, where they inevitably capture speech from unintended nearby bystanders, raising privacy risks that existing benchmarks and defences did not consider. We introduce SH-Bench, the first benchmark designed to evaluate selective hearing: a model's ability to attend to an intended main speaker while refusing to process or reveal information about incidental bystander speech. SH-Bench contains 3,968 multi-speaker audio mixtures, including both real-world and synthetic scenarios, paired with 77k multiple-choice questions that probe models under general and selective operating modes. In addition, we propose Selective Efficacy (SE), a novel metric capturing both multi-speaker comprehension and bystander-privacy protection. Our evaluation of state-of-the-art open-source and proprietary LLMs reveals substantial bystander privacy leakage, with strong audio understanding failing to translate into selective protection of bystander privacy. To mitigate this gap, we also present Bystander Privacy Fine-Tuning (BPFT), a novel training pipeline that teaches models to refuse bystander-related queries without degrading main-speaker comprehension. We show that BPFT yields substantial gains, achieving an absolute 47% higher bystander accuracy under selective mode and an absolute 16% higher SE compared to Gemini 2.5 Pro, which is the best audio LLM without BPFT. Together, SH-Bench and BPFT provide the first systematic framework for measuring and improving bystander privacy in audio LLMs.

    https://arxiv.org/abs/2512.06380


    Entropy-Controlled Intrinsic Motivation Reinforcement Learning for Quadruped Robot Locomotion in Complex Terrains

    oai:arXiv.org:2512.06486v2

    arXiv:2512.06486v2 Announce Type: replace Abstract: Learning is the basis of both biological and artificial systems when it comes to mimicking intelligent behaviors. From the classical PPO (Proximal Policy Optimization), there is a series of deep reinforcement learning algorithms which are widely used in training locomotion policies for quadrupedal robots because of their stability and sample efficiency. However, among all these variants, experiments and simulations often converge prematurely, leading to suboptimal locomotion and reduced task performance. Therefore, in this paper, we introduce Entropy-Controlled Intrinsic Motivation (ECIM), an entropy-based reinforcement learning algorithm in contrast with the PPO series, that can reduce premature convergence by combining intrinsic motivation with adaptive exploration. For experiments, in order to parallel with other baselines, we chose to apply it in Isaac Gym across six terrain categories: upward slopes, downward slopes, uneven rough terrain, ascending stairs, descending stairs, and flat ground as widely used. For comparison, our experiments consistently achieve better performance: task rewards increase by 4--12%, peak body pitch oscillation is reduced by 23--29%, joint acceleration decreases by 20--32%, and joint torque consumption declines by 11--20%. Overall, our model ECIM, by combining entropy control and intrinsic motivation control, achieves better results in stability across different terrains for quadrupedal locomotion, and at the same time reduces energetic cost and makes it a practical choice for complex robotic control tasks.

    https://arxiv.org/abs/2512.06486


    VDOT: Efficient Unified Video Creation via Optimal Transport Distillation

    oai:arXiv.org:2512.06802v2

    arXiv:2512.06802v2 Announce Type: replace Abstract: The rapid development of generative models has significantly advanced image and video applications. Among these, video creation, aimed at generating videos under various conditions, has gained substantial attention. However, existing video creation models either focus solely on a few specific conditions or suffer from excessively long generation times due to complex model inference, making them impractical for real-world applications. To mitigate these issues, we propose an efficient unified video creation model, named VDOT. Concretely, we model the training process with the distribution matching distillation (DMD) paradigm. Instead of using the Kullback-Leibler (KL) minimization, we additionally employ a novel computational optimal transport (OT) technique to optimize the discrepancy between the real and fake score distributions. The OT distance inherently imposes geometric constraints, mitigating potential zero-forcing or gradient collapse issues that may arise during KL-based distillation within the few-step generation scenario, and thus, enhances the efficiency and stability of the distillation process. Further, we integrate a discriminator to enable the model to perceive real video data, thereby enhancing the quality of generated videos. To support training unified video creation models, we propose a fully automated pipeline for video data annotation and filtering that accommodates multiple video creation tasks. Meanwhile, we curate a unified testing benchmark, UVCBench, to standardize evaluation. Experiments demonstrate that our 4-step VDOT outperforms or matches other baselines with 100 denoising steps.

    https://arxiv.org/abs/2512.06802


    CHIMERA: Adaptive Cache Injection and Semantic Anchor Prompting for Zero-shot Image Morphing with Morphing-oriented Metrics

    oai:arXiv.org:2512.07155v3

    arXiv:2512.07155v3 Announce Type: replace Abstract: Diffusion models exhibit remarkable generative ability, yet achieving smooth and semantically consistent image morphing remains a challenge. Existing approaches often yield abrupt transitions or over-saturated appearances due to the lack of adaptive structural and semantic alignments. We propose CHIMERA, a zero-shot diffusion-based framework that formulates morphing as a cached inversion-guided denoising process. To handle large semantic and appearance disparities, we propose Adaptive Cache Injection and Semantic Anchor Prompting. Adaptive Cache Injection (ACI) caches down, mid, and up blocks features from both inputs during DDIM inversion and re-injects them adaptively during denoising, enabling spatial and semantic alignment in depth- and time-adaptive manners and enabling natural feature fusion and smooth transitions. Semantic Anchor Prompting (SAP) leverages a vision-language model to generate a shared anchor prompt that serves as a semantic anchor, bridging dissimilar inputs and guiding the denoising process toward coherent results. Finally, we introduce the Global-Local Consistency Score (GLCS), a morphing-oriented metric that simultaneously evaluates the global harmonization of the two inputs and the smoothness of the local morphing transition. Extensive experiments and user studies show that CHIMERA achieves smoother and more semantically aligned transitions than existing methods, establishing a new state of the art in image morphing. The code and project page will be publicly released.

    https://arxiv.org/abs/2512.07155


    ESPADA: Execution Speedup via Semantics Aware Demonstration Data Downsampling for Imitation Learning

    oai:arXiv.org:2512.07371v2

    arXiv:2512.07371v2 Announce Type: replace Abstract: Behavior-cloning based visuomotor policies enable precise manipulation but often inherit the slow, cautious tempo of human demonstrations, limiting practical deployment. However, prior studies on acceleration methods mainly rely on statistical or heuristic cues that ignore task semantics and can fail across diverse manipulation settings. We present ESPADA, a semantic and spatially aware framework that segments demonstrations using a VLM-LLM pipeline with 3D gripper-object relations, enabling aggressive downsampling only in non-critical segments while preserving precision-critical phases, without requiring extra data or architectural modifications, or any form of retraining. To scale from a single annotated episode to the full dataset, ESPADA propagates segment labels via Dynamic Time Warping (DTW) on dynamics-only features. Across both simulation and real-world experiments with ACT and DP baselines, ESPADA achieves approximately a 2x speed-up while maintaining success rates, narrowing the gap between human demonstrations and efficient robot control.

    https://arxiv.org/abs/2512.07371


    Efficient Low-Tubal-Rank Tensor Estimation via Alternating Preconditioned Gradient Descent

    oai:arXiv.org:2512.07490v2

    arXiv:2512.07490v2 Announce Type: replace Abstract: The problem of low-tubal-rank tensor estimation is a fundamental task with wide applications across high-dimensional signal processing, machine learning, and image science. Traditional approaches tackle such a problem by performing tensor singular value decomposition, which is computationally expensive and becomes infeasible for large-scale tensors. Recent approaches address this issue by factorizing the tensor into two smaller factor tensors and solving the resulting problem using gradient descent. However, this kind of approach requires an accurate estimate of the tensor rank, and when the rank is overestimated, the convergence of gradient descent and its variants slows down significantly or even diverges. To address this problem, we propose an Alternating Preconditioned Gradient Descent (APGD) algorithm, which accelerates convergence in the over-parameterized setting by adding a preconditioning term to the original gradient and updating these two factors alternately. Based on certain geometric assumptions on the objective function, we establish linear convergence guarantees for more general low-tubal-rank tensor estimation problems. Then we further analyze the specific cases of low-tubal-rank tensor factorization and low-tubal-rank tensor recovery. Our theoretical results show that APGD achieves linear convergence even under over-parameterization, and the convergence rate is independent of the tensor condition number. Extensive simulations on synthetic data are carried out to validate our theoretical assertions.

    https://arxiv.org/abs/2512.07490


    SA$^{2}$GFM: Enhancing Robust Graph Foundation Models with Structure-Aware Semantic Augmentation

    oai:arXiv.org:2512.07857v2

    arXiv:2512.07857v2 Announce Type: replace Abstract: We present Graph Foundation Models (GFMs) which have made significant progress in various tasks, but their robustness against domain noise, structural perturbations, and adversarial attacks remains underexplored. A key limitation is the insufficient modeling of hierarchical structural semantics, which are crucial for generalization. In this paper, we propose SA$^{2}$GFM, a robust GFM framework that improves domain-adaptive representations through Structure-Aware Semantic Augmentation. First, we encode hierarchical structural priors by transforming entropy-based encoding trees into structure-aware textual prompts for feature augmentation. The enhanced inputs are processed by a self-supervised Information Bottleneck mechanism that distills robust, transferable representations via structure-guided compression. To address negative transfer in cross-domain adaptation, we introduce an expert adaptive routing mechanism, combining a mixture-of-experts architecture with a null expert design. For efficient downstream adaptation, we propose a fine-tuning module that optimizes hierarchical structures through joint intra- and inter-community structure learning. Extensive experiments demonstrate that SA$^{2}$GFM outperforms 9 state-of-the-art baselines in terms of effectiveness and robustness against random noise and adversarial perturbations for node and graph classification.

    https://arxiv.org/abs/2512.07857


    Graph Contrastive Learning via Spectral Graph Alignment

    oai:arXiv.org:2512.07878v2

    arXiv:2512.07878v2 Announce Type: replace Abstract: Given augmented views of each input graph, contrastive learning methods (e.g., InfoNCE) optimize pairwise alignment of graph embeddings across views while providing no mechanism to control the global structure of the view specific graph-of-graphs built from these embeddings. We introduce SpecMatch-CL, a novel loss function that aligns the view specific graph-of-graphs by minimizing the difference between their normalized Laplacians. Theoretically, we show that under certain assumptions, the difference between normalized Laplacians provides an upper bound not only for the difference between the ideal Perfect Alignment contrastive loss and the current loss, but also for the Uniformly loss. Empirically, SpecMatch-CL establishes new state of the art on eight TU benchmarks under unsupervised learning and semi-supervised learning at low label rates, and yields consistent gains in transfer learning on PPI-306K and ZINC 2M datasets.

    https://arxiv.org/abs/2512.07878


    Systematization of Knowledge: Security and Safety in the Model Context Protocol Ecosystem

    oai:arXiv.org:2512.08290v2

    arXiv:2512.08290v2 Announce Type: replace Abstract: The Model Context Protocol (MCP) has emerged as the de facto standard for connecting Large Language Models (LLMs) to external data and tools, effectively functioning as the "USB-C for Agentic AI." While this decoupling of context and execution solves critical interoperability challenges, it introduces a profound new threat landscape where the boundary between epistemic errors (hallucinations) and security breaches (unauthorized actions) dissolves. This Systematization of Knowledge (SoK) aims to provide a comprehensive taxonomy of risks in the MCP ecosystem, distinguishing between adversarial security threats (e.g., indirect prompt injection, tool poisoning) and epistemic safety hazards (e.g., alignment failures in distributed tool delegation). We analyze the structural vulnerabilities of MCP primitives, specifically Resources, Prompts, and Tools, and demonstrate how "context" can be weaponized to trigger unauthorized operations in multi-agent environments. Furthermore, we survey state-of-the-art defenses, ranging from cryptographic provenance (ETDI) to runtime intent verification, and conclude with a roadmap for securing the transition from conversational chatbots to autonomous agentic operating systems.

    https://arxiv.org/abs/2512.08290


    HybridSplat: Fast Reflection-baked Gaussian Tracing using Hybrid Splatting

    oai:arXiv.org:2512.08334v2

    arXiv:2512.08334v2 Announce Type: replace Abstract: Rendering complex reflection of real-world scenes using 3D Gaussian splatting has been a quite promising solution for photorealistic novel view synthesis, but still faces bottlenecks especially in rendering speed and memory storage. This paper proposes a new Hybrid Splatting(HybridSplat) mechanism for Gaussian primitives. Our key idea is a new reflection-baked Gaussian tracing, which bakes the view-dependent reflection within each Gaussian primitive while rendering the reflection using tile-based Gaussian splatting. Then we integrate the reflective Gaussian primitives with base Gaussian primitives using a unified hybrid splatting framework for high-fidelity scene reconstruction. Moreover, we further introduce a pipeline-level acceleration for the hybrid splatting, and reflection-sensitive Gaussian pruning to reduce the model size, thus achieving much faster rendering speed and lower memory storage while preserving the reflection rendering quality. By extensive evaluation, our HybridSplat accelerates about 7x rendering speed across complex reflective scenes from Ref-NeRF, NeRF-Casting with 4x fewer Gaussian primitives than similar ray-tracing based Gaussian splatting baselines, serving as a new state-of-the-art method especially for complex reflective scenes.

    https://arxiv.org/abs/2512.08334


    Predicting California Bearing Ratio with Ensemble and Neural Network Models: A Case Study from Turkiye

    oai:arXiv.org:2512.08340v2

    arXiv:2512.08340v2 Announce Type: replace Abstract: The California Bearing Ratio (CBR) is a key geotechnical indicator used to assess the load-bearing capacity of subgrade soils, especially in transportation infrastructure and foundation design. Traditional CBR determination relies on laboratory penetration tests. Despite their accuracy, these tests are often time-consuming, costly, and can be impractical, particularly for large-scale or diverse soil profiles. Recent progress in artificial intelligence, especially machine learning (ML), has enabled data-driven approaches for modeling complex soil behavior with greater speed and precision. This study introduces a comprehensive ML framework for CBR prediction using a dataset of 382 soil samples collected from various geoclimatic regions in T\"urkiye. The dataset includes physicochemical soil properties relevant to bearing capacity, allowing multidimensional feature representation in a supervised learning context. Twelve ML algorithms were tested, including decision tree, random forest, extra trees, gradient boosting, xgboost, k-nearest neighbors, support vector regression, multi-layer perceptron, adaboost, bagging, voting, and stacking regressors. Each model was trained, validated, and evaluated to assess its generalization and robustness. Among them, the random forest regressor performed the best, achieving strong R2 scores of 0.95 (training), 0.76 (validation), and 0.83 (test). These outcomes highlight the model's powerful nonlinear mapping ability, making it a promising tool for predictive geotechnical tasks. The study supports the integration of intelligent, data-centric models in geotechnical engineering, offering an effective alternative to traditional methods and promoting digital transformation in infrastructure analysis and design.

    https://arxiv.org/abs/2512.08340


    DFALLM: Achieving Generalizable Multitask Deepfake Detection by Optimizing Audio LLM Components

    oai:arXiv.org:2512.08403v2

    arXiv:2512.08403v2 Announce Type: replace Abstract: Audio deepfake detection has recently garnered public concern due to its implications for security and reliability. Traditional deep learning methods have been widely applied to this task but often lack generalisability when confronted with newly emerging spoofing techniques and more tasks such as spoof attribution recognition rather than simple binary classification. In principle, Large Language Models (LLMs) are considered to possess the needed generalisation capabilities. However, previous research on Audio LLMs (ALLMs) indicates a generalization bottleneck in audio deepfake detection performance, even when sufficient data is available. Consequently, this study investigates the model architecture and examines the effects of the primary components of ALLMs, namely the audio encoder and the text-based LLM. Our experiments demonstrate that the careful selection and combination of audio encoders and text-based LLMs are crucial for unlocking the deepfake detection potential of ALLMs. We further propose an ALLM structure capable of generalizing deepfake detection abilities to out-of-domain spoofing tests and other deepfake tasks, such as spoof positioning and spoof attribution recognition. Our proposed model architecture achieves state-of-the-art (SOTA) performance across multiple datasets, including ASVSpoof2019, InTheWild, and Demopage, with accuracy reaching up to 95.76% on average, and exhibits competitive capabilities in other deepfake detection tasks such as attribution, and localisation compared to SOTA audio understanding models. Data and codes are provided in supplementary materials.

    https://arxiv.org/abs/2512.08403


    NeurIDA: Dynamic Modeling for Effective In-Database Analytics

    oai:arXiv.org:2512.08483v3

    arXiv:2512.08483v3 Announce Type: replace Abstract: Relational Database Management Systems (RDBMS) manage complex, interrelated data and support a broad spectrum of analytical tasks. With the growing demand for predictive analytics, the deep integration of machine learning (ML) into RDBMS has become critical. However, a fundamental challenge hinders this evolution: conventional ML models are static and task-specific, whereas RDBMS environments are dynamic and must support diverse analytical queries. Each analytical task entails constructing a bespoke pipeline from scratch, which incurs significant development overhead and hence limits wide adoption of ML in analytics. We present NeurIDA, an autonomous end-to-end system for in-database analytics that dynamically "tweaks" the best available base model to better serve a given analytical task. In particular, we propose a novel paradigm of dynamic in-database modeling to pre-train a composable base model architecture over the relational data. Upon receiving a task, NeurIDA formulates the task and data profile to dynamically select and configure relevant components from the pool of base models and shared model components for prediction. For friendly user experience, NeurIDA supports natural language queries; it interprets user intent to construct structured task profiles, and generates analytical reports with dedicated LLM agents. By design, NeurIDA enables ease-of-use and yet effective and efficient in-database AI analytics. Extensive experiment study shows that NeurIDA consistently delivers up to 12% improvement in AUC-ROC and 25% relative reduction in MAE across ten tasks on five real-world datasets. The source code is available at https://github.com/Zrealshadow/NeurIDA

    https://arxiv.org/abs/2512.08483


    A Lightweight Transfer Learning-Based State-of-Health Monitoring with Application to Lithium-ion Batteries in Autonomous Air Vehicles

    oai:arXiv.org:2512.08512v2

    arXiv:2512.08512v2 Announce Type: replace Abstract: Accurate and rapid state-of-health (SOH) monitoring plays an important role in indicating energy information for lithium-ion battery-powered portable mobile devices. To confront their variable working conditions, transfer learning (TL) emerges as a promising technique for leveraging knowledge from data-rich source working conditions, significantly reducing the training data required for SOH monitoring from target working conditions. However, traditional TL-based SOH monitoring is infeasible when applied in portable mobile devices since substantial computational resources are consumed during the TL stage and unexpectedly reduce the working endurance. To address these challenges, this paper proposes a lightweight TL-based SOH monitoring approach with constructive incremental transfer learning (CITL). First, taking advantage of the unlabeled data in the target domain, a semi-supervised TL mechanism is proposed to minimize the monitoring residual in a constructive way, through iteratively adding network nodes in the CITL. Second, the cross-domain learning ability of node parameters for CITL is comprehensively guaranteed through structural risk minimization, transfer mismatching minimization, and manifold consistency maximization. Moreover, the convergence analysis of the CITL is given, theoretically guaranteeing the efficacy of TL performance and network compactness. Finally, the proposed approach is verified through extensive experiments with a realistic autonomous air vehicles (AAV) battery dataset collected from dozens of flight missions. Specifically, the CITL outperforms SS-TCA, MMD-LSTM-DA, DDAN, BO-CNN-TL, and AS$^3$LSTM, in SOH estimation by 83.73%, 61.15%, 28.24%, 87.70%, and 57.34%, respectively, as evaluated using the index root mean square error.

    https://arxiv.org/abs/2512.08512


    Repulsor: Accelerating Generative Modeling with a Contrastive Memory Bank

    oai:arXiv.org:2512.08648v2

    arXiv:2512.08648v2 Announce Type: replace Abstract: The dominance of denoising generative models (e.g., diffusion, flow-matching) in visual synthesis is tempered by their substantial training costs and inefficiencies in representation learning. While injecting discriminative representations via auxiliary alignment has proven effective, this approach still faces key limitations: the reliance on external, pre-trained encoders introduces overhead and domain shift. A dispersed-based strategy that encourages strong separation among in-batch latent representations alleviates this specific dependency. To assess the effect of the number of negative samples in generative modeling, we propose {\mname}, a plug-and-play training framework that requires no external encoders. Our method integrates a memory bank mechanism that maintains a large, dynamically updated queue of negative samples across training iterations. This decouples the number of negatives from the mini-batch size, providing abundant and high-quality negatives for a contrastive objective without a multiplicative increase in computational cost. A low-dimensional projection head is used to further minimize memory and bandwidth overhead. {\mname} offers three principal advantages: (1) it is self-contained, eliminating dependency on pretrained vision foundation models and their associated forward-pass overhead; (2) it introduces no additional parameters or computational cost during inference; and (3) it enables substantially faster convergence, achieving superior generative quality more efficiently. On ImageNet-256, {\mname} achieves a state-of-the-art FID of \textbf{2.40} within 400k steps, significantly outperforming comparable methods.

    https://arxiv.org/abs/2512.08648


    De novo generation of functional terpene synthases using TpsGPT

    oai:arXiv.org:2512.08772v2

    arXiv:2512.08772v2 Announce Type: replace Abstract: Terpene synthases (TPS) are a key family of enzymes responsible for generating the diverse terpene scaffolds that underpin many natural products, including front-line anticancer drugs such as Taxol. However, de novo TPS design through directed evolution is costly and slow. We introduce TpsGPT, a generative model for scalable TPS protein design, built by fine-tuning the protein language model ProtGPT2 on 79k TPS sequences mined from UniProt. TpsGPT generated de novo enzyme candidates in silico and we evaluated them using multiple validation metrics, including EnzymeExplorer classification, ESMFold structural confidence (pLDDT), sequence diversity, CLEAN classification, InterPro domain detection, and Foldseek structure alignment. From an initial pool of 28k generated sequences, we identified seven putative TPS enzymes that satisfied all validation criteria. Experimental validation confirmed TPS enzymatic activity in at least two of these sequences. Our results show that fine-tuning of a protein language model on a carefully curated, enzyme-class-specific dataset, combined with rigorous filtering, can enable the de novo generation of functional, evolutionarily distant enzymes.

    https://arxiv.org/abs/2512.08772


    Accelerated Rotation-Invariant Convolution for UAV Image Segmentation

    oai:arXiv.org:2512.08888v2

    arXiv:2512.08888v2 Announce Type: replace Abstract: Rotation invariance is essential for precise, object-level segmentation in UAV aerial imagery, where targets can have arbitrary orientations and exhibit fine-scale details. Conventional segmentation architectures like U-Net rely on convolution operators that are not rotation-invariant, leading to degraded segmentation accuracy across varying viewpoints. Rotation invariance can be achieved by expanding the filter bank across multiple orientations; however, this will significantly increase computational cost and memory traffic. In this paper, we introduce a GPU-optimized rotation-invariant convolution framework that eliminates the traditional data-lowering (im2col) step required for matrix-multiplication-based convolution. By exploiting structured data sharing among symmetrically rotated filters, our method achieves multi-orientation convolution with greatly reduced memory traffic and computational redundancy. We further generalize the approach to accelerate convolution with arbitrary (non-symmetric) rotation angles. Across extensive benchmarks, the proposed convolution achieves 20--55% faster training and 15--45% lower energy consumption than CUDNN, while maintaining accuracy comparable to state-of-the-art rotation-invariant methods. In the eight-orientation setting, our approach achieves up to 45% speedup and 41% energy savings on 256\(\times\)256 inputs, and 32% speedup and 23% lower energy usage on 1024\(\times\)1024 inputs. Integrated into a U-Net segmentation model, the framework yields up to 6% improvement in accuracy over the non-rotation-aware baseline. These results demonstrate that the proposed method provides an effective and highly efficient alternative to existing rotation-invariant CNN frameworks.

    https://arxiv.org/abs/2512.08888


    Astra: General Interactive World Model with Autoregressive Denoising

    oai:arXiv.org:2512.08931v2

    arXiv:2512.08931v2 Announce Type: replace Abstract: Recent advances in diffusion transformers have empowered video generation models to generate high-quality video clips from texts or images. However, world models with the ability to predict long-horizon futures from past observations and actions remain underexplored, especially for general-purpose scenarios and various forms of actions. To bridge this gap, we introduce Astra, an interactive general world model that generates real-world futures for diverse scenarios (e.g., autonomous driving, robot grasping) with precise action interactions (e.g., camera motion, robot action). We propose an autoregressive denoising architecture and use temporal causal attention to aggregate past observations and support streaming outputs. We use a noise-augmented history memory to avoid over-reliance on past frames to balance responsiveness with temporal coherence. For precise action control, we introduce an action-aware adapter that directly injects action signals into the denoising process. We further develop a mixture of action experts that dynamically route heterogeneous action modalities, enhancing versatility across diverse real-world tasks such as exploration, manipulation, and camera control. Astra achieves interactive, consistent, and general long-term video prediction and supports various forms of interactions. Experiments across multiple datasets demonstrate the improvements of Astra in fidelity, long-range prediction, and action alignment over existing state-of-the-art world models.

    https://arxiv.org/abs/2512.08931


    Learning Patient-Specific Disease Dynamics with Latent Flow Matching for Longitudinal Imaging Generation

    oai:arXiv.org:2512.09185v2

    arXiv:2512.09185v2 Announce Type: replace Abstract: Understanding disease progression is a central clinical challenge with direct implications for early diagnosis and personalized treatment. While recent generative approaches have attempted to model progression, key mismatches remain: disease dynamics are inherently continuous and monotonic, yet latent representations are often scattered, lacking semantic structure, and diffusion-based models disrupt continuity with random denoising process. In this work, we propose to treat the disease dynamic as a velocity field and leverage Flow Matching (FM) to align the temporal evolution of patient data. Unlike prior methods, it captures the intrinsic dynamic of disease, making the progression more interpretable. However, a key challenge remains: in latent space, Auto-Encoders (AEs) do not guarantee alignment across patients or correlation with clinical-severity indicators (e.g., age and disease conditions). To address this, we propose to learn patient-specific latent alignment, which enforces patient trajectories to lie along a specific axis, with magnitude increasing monotonically with disease severity. This leads to a consistent and semantically meaningful latent space. Together, we present $\Delta$-LFM, a framework for modeling patient-specific latent progression with flow matching. Across three longitudinal MRI benchmarks, $\Delta$-LFM demonstrates strong empirical performance and, more importantly, offers a new framework for interpreting and visualizing disease dynamics.

    https://arxiv.org/abs/2512.09185


    TritonForge: Profiling-Guided Framework for Automated Triton Kernel Optimization

    oai:arXiv.org:2512.09196v2

    arXiv:2512.09196v2 Announce Type: replace Abstract: High-performance GPU kernel optimization remains a critical yet labor-intensive task in modern machine learning workloads. Although Triton, a domain-specific language for GPU programming, enables developers to write efficient kernels with concise code, achieving expert-level performance still requires deep understanding of GPU architectures and low-level performance trade-offs. We present TritonForge, a profiling-guided framework for automated Triton kernel optimization. TritonForge integrates kernel analysis, runtime profiling, and iterative code transformation to streamline the optimization process. By incorporating feedback from profiling results, the system identifies performance bottlenecks, proposes targeted code modifications, and evaluates their impact automatically. Across diverse kernel types, TritonForge achieves up to 5x performance improvement over baseline implementations and on average 1.76x of the cases are successful, providing a foundation for future research in automated GPU performance optimization.

    https://arxiv.org/abs/2512.09196


    Meta Lattice: Model Space Redesign for Cost-Effective Industry-Scale Ads Recommendations

    oai:arXiv.org:2512.09200v2

    arXiv:2512.09200v2 Announce Type: replace Abstract: The rapidly evolving landscape of products, surfaces, policies, and regulations poses significant challenges for deploying state-of-the-art recommendation models at industry scale, primarily due to data fragmentation across domains and escalating infrastructure costs that hinder sustained quality improvements. To address this challenge, we propose Lattice, a recommendation framework centered around model space redesign that extends Multi-Domain, Multi-Objective (MDMO) learning beyond models and learning objectives. Lattice addresses these challenges through a comprehensive model space redesign that combines cross-domain knowledge sharing, data consolidation, model unification, distillation, and system optimizations to achieve significant improvements in both quality and cost-efficiency. Our deployment of Lattice at Meta has resulted in 10% revenue-driving top-line metrics gain, 11.5% user satisfaction improvement, 6% boost in conversion rate, with 20% capacity saving.

    https://arxiv.org/abs/2512.09200


    ObliInjection: Order-Oblivious Prompt Injection Attack to LLM Agents with Multi-source Data

    oai:arXiv.org:2512.09321v3

    arXiv:2512.09321v3 Announce Type: replace Abstract: Prompt injection attacks aim to contaminate the input data of an LLM to mislead it into completing an attacker-chosen task instead of the intended task. In many applications and agents, the input data originates from multiple sources, with each source contributing a segment of the overall input. In these multi-source scenarios, an attacker may control only a subset of the sources and contaminate the corresponding segments, but typically does not know the order in which the segments are arranged within the input. Existing prompt injection attacks either assume that the entire input data comes from a single source under the attacker's control or ignore the uncertainty in the ordering of segments from different sources. As a result, their success is limited in domains involving multi-source data. In this work, we propose ObliInjection, the first prompt injection attack targeting LLM applications and agents with multi-source input data. ObliInjection introduces two key technical innovations: the order-oblivious loss, which quantifies the likelihood that the LLM will complete the attacker-chosen task regardless of how the clean and contaminated segments are ordered; and the orderGCG algorithm, which is tailored to minimize the order-oblivious loss and optimize the contaminated segments. Comprehensive experiments across three datasets spanning diverse application domains and twelve LLMs demonstrate that ObliInjection is highly effective, even when only one out of 6-100 segments in the input data is contaminated. Our code and data are available at: https://github.com/ReachalWang/ObliInjection.

    https://arxiv.org/abs/2512.09321


    TextGuider: Training-Free Guidance for Text Rendering via Attention Alignment

    oai:arXiv.org:2512.09350v2

    arXiv:2512.09350v2 Announce Type: replace Abstract: Despite recent advances, diffusion-based text-to-image models still struggle with accurate text rendering. Several studies have proposed fine-tuning or training-free refinement methods for accurate text rendering. However, the critical issue of text omission, where the desired text is partially or entirely missing, remains largely overlooked. In this work, we propose TextGuider, a novel training-free method that encourages accurate and complete text appearance by aligning textual content tokens and text regions in the image. Specifically, we analyze attention patterns in Multi-Modal Diffusion Transformer(MM-DiT) models, particularly for text-related tokens intended to be rendered in the image. Leveraging this observation, we apply latent guidance during the early stage of denoising steps based on two loss functions that we introduce. Our method achieves state-of-the-art performance in test-time text rendering, with significant gains in recall and strong results in OCR accuracy and CLIP score.

    https://arxiv.org/abs/2512.09350


    InfoMotion: A Graph-Based Approach to Video Dataset Distillation for Echocardiography

    oai:arXiv.org:2512.09422v2

    arXiv:2512.09422v2 Announce Type: replace Abstract: Echocardiography plays a critical role in the diagnosis and monitoring of cardiovascular diseases as a non-invasive real-time assessment of cardiac structure and function. However, the growing scale of echocardiographic video data presents significant challenges in terms of storage, computation, and model training efficiency. Dataset distillation offers a promising solution by synthesizing a compact, informative subset of data that retains the key clinical features of the original dataset. In this work, we propose a novel approach for distilling a compact synthetic echocardiographic video dataset. Our method leverages motion feature extraction to capture temporal dynamics, followed by class-wise graph construction and representative sample selection using the Infomap algorithm. This enables us to select a diverse and informative subset of synthetic videos that preserves the essential characteristics of the original dataset. We evaluate our approach on the EchoNet-Dynamic datasets and achieve a test accuracy of \(69.38\%\) using only \(25\) synthetic videos. These results demonstrate the effectiveness and scalability of our method for medical video dataset distillation.

    https://arxiv.org/abs/2512.09422


    Exploring Community-Powered Conversational Agent for Health Knowledge Acquisition: A Case Study in Colorectal Cancer

    oai:arXiv.org:2512.09511v2

    arXiv:2512.09511v2 Announce Type: replace Abstract: Online communities have become key platforms where young adults, actively seek and share information, including health knowledge. However, these users often face challenges when browsing these communities, such as fragmented content, varying information quality and unfamiliar terminology. Based on a survey with 56 participants and follow-up interviews, we identify common challenges and expected features for learning health knowledge. In this paper, we develop a computational workflow that integrates community content into a conversational agent named CanAnswer to facilitate health knowledge acquisition. Using colorectal cancer as a case study, we evaluate CanAnswer through a lab study with 24 participants and interviews with six medical experts. Results show that CanAnswer improves the recalled gained knowledge and reduces the task workload of the learning session. Our expert interviews (N=6) further confirm the reliability and usefulness of CanAnswer. We discuss the generality of CanAnswer and provide design considerations for enhancing the usefulness and credibility of community-powered learning tools.

    https://arxiv.org/abs/2512.09511


    Latent-Autoregressive GP-VAE Language Model

    oai:arXiv.org:2512.09535v2

    arXiv:2512.09535v2 Announce Type: replace Abstract: We investigate a fully Latent AutoRegressive scheme based on a Gaussian Process (GP) integrated into a Variational Autoencoder (VAE). In this setting, sequential dynamics are transferred from the observation space to a continuous latent space, while linguistic generation remains parallel through a non-autoregressive decoder. We present a complete methodological formulation, including a causal GP prior, a structured amortized posterior, and a training protocol based on a regularized ELBO. Empirical evaluation, conducted within a deliberately constrained proof-of-concept (POC) framework, shows that the model can be trained stably and that the sequential and parallel sampling variants exhibit consistent behavior. Overall, the results suggest that part of the temporal structure in a language model can be supported by the probabilistic geometry of the latent space rather than by explicit neural operations.

    https://arxiv.org/abs/2512.09535


    Chasing Shadows: Pitfalls in LLM Security Research

    oai:arXiv.org:2512.09549v2

    arXiv:2512.09549v2 Announce Type: replace Abstract: Large language models (LLMs) are increasingly prevalent in security research. Their unique characteristics, however, introduce challenges that undermine established paradigms of reproducibility, rigor, and evaluation. Prior work has identified common pitfalls in traditional machine learning research, but these studies predate the advent of LLMs. In this paper, we identify nine common pitfalls that have become (more) relevant with the emergence of LLMs and that can compromise the validity of research involving them. These pitfalls span the entire computation process, from data collection, pre-training, and fine-tuning to prompting and evaluation. We assess the prevalence of these pitfalls across all 72 peer-reviewed papers published at leading Security and Software Engineering venues between 2023 and 2024. We find that every paper contains at least one pitfall, and each pitfall appears in multiple papers. Yet only 15.7% of the present pitfalls were explicitly discussed, suggesting that the majority remain unrecognized. To understand their practical impact, we conduct four empirical case studies showing how individual pitfalls can mislead evaluation, inflate performance, or impair reproducibility. Based on our findings, we offer actionable guidelines to support the community in future work.

    https://arxiv.org/abs/2512.09549


    Defining Cost Function of Steganography with Large Language Models

    oai:arXiv.org:2512.09769v2

    arXiv:2512.09769v2 Announce Type: replace Abstract: In this paper, we make the first attempt towards defining cost function of steganography with large language models (LLMs), which is totally different from previous works that rely heavily on expert knowledge or require large-scale datasets for cost learning. To achieve this goal, a two-stage strategy combining LLM-guided program synthesis with evolutionary search is applied in the proposed method. In the first stage, a certain number of cost functions in the form of computer programs are synthesized from LLM responses to structured prompts. These cost functions are then evaluated with pretrained steganalysis models so that candidate cost functions suited to steganography can be collected. In the second stage, by retraining a steganalysis model for each candidate cost function, the optimal cost function(s) can be determined according to the detection accuracy. This two-stage strategy is performed by an iterative fashion so that the best cost function can be collected at the last iteration. Experiments show that the proposed method enables LLMs to design new cost functions of steganography that significantly outperform existing works in terms of resisting steganalysis tools, which verifies the superiority of the proposed method. To the best knowledge of the authors, this is the first work applying LLMs to the design of advanced cost function of steganography, which presents a novel perspective for steganography design and may shed light on further research.

    https://arxiv.org/abs/2512.09769


    GoodSpeed: Optimizing Fair Goodput with Adaptive Speculative Decoding in Distributed Edge Inference

    oai:arXiv.org:2512.09963v2

    arXiv:2512.09963v2 Announce Type: replace Abstract: Large language models (LLMs) have revolutionized natural language processing, yet their high computational demands pose significant challenges for real-time inference, especially in multi-user server speculative decoding and resource-constrained environments. Speculative decoding has emerged as a promising technique to accelerate LLM inference by using lightweight draft models to generate candidate tokens, which are subsequently verified by a larger, more accurate model. However, ensuring both high goodput (the effective rate of accepted tokens) and fairness across multiple draft servers cooperating with a central verification server remains an open challenge. This paper introduces GOODSPEED, a novel distributed inference framework that optimizes goodput through adaptive speculative decoding. GOODSPEED employs a central verification server that coordinates a set of heterogeneous draft servers, each running a small language model to generate speculative tokens. To manage resource allocation effectively, GOODSPEED incorporates a gradient scheduling algorithm that dynamically assigns token verification tasks, maximizing a logarithmic utility function to ensure proportional fairness across servers. By processing speculative outputs from all draft servers in parallel, the framework enables efficient collaboration between the verification server and distributed draft generators, streamlining both latency and throughput. Through rigorous fluid sample path analysis, we show that GOODSPEED converges to the optimal goodput allocation in steady-state conditions and maintains near-optimal performance with provably bounded error under dynamic workloads. These results demonstrate that GOODSPEED provides a scalable, fair and efficient solution for multi-server speculative decoding in distributed LLM inference systems.

    https://arxiv.org/abs/2512.09963


    Efficient Boys function evaluation using minimax approximation

    oai:arXiv.org:2512.10059v2

    arXiv:2512.10059v2 Announce Type: replace Abstract: We present an algorithm for efficient evaluation of Boys functions $F_0,\dots,F_{k_\mathrm{max}}$ tailored to modern computing architectures, in particular graphical processing units (GPUs), where maximum throughput is high and data movement is costly. The method combines rational minimax approximations with upward and downward recurrence relations. The non-negative real axis is partitioned into three regions, $[0,\infty\rangle = A\cup B\cup C$, where regions $A$ and $B$ are treated using rational minimax approximations and region $C$ by an asymptotic approximation. This formulation avoids lookup tables and irregular memory access, making it well suited hardware with high maximum throughput and low latency. The rational minimax coefficients are generated using the rational Remez algorithm. For a target maximum absolute error of $\varepsilon_\mathrm{tol} = 5\cdot10^{-14}$, the corresponding approximation regions and coefficients for Boys functions $F_0,\dots,F_{32}$ are provided in the appendix.

    https://arxiv.org/abs/2512.10059


    Phishing Email Detection Using Large Language Models

    oai:arXiv.org:2512.10104v2

    arXiv:2512.10104v2 Announce Type: replace Abstract: Email phishing is one of the most prevalent and globally consequential vectors of cyber intrusion. As systems increasingly deploy Large Language Models (LLMs) applications, these systems face evolving phishing email threats that exploit their fundamental architectures. Current LLMs require substantial hardening before deployment in email security systems, particularly against coordinated multi-vector attacks that exploit architectural vulnerabilities. This paper proposes LLMPEA, an LLM-based framework to detect phishing email attacks across multiple attack vectors, including prompt injection, text refinement, and multilingual attacks. We evaluate three frontier LLMs (e.g., GPT-4o, Claude Sonnet 4, and Grok-3) and comprehensive prompting design to assess their feasibility, robustness, and limitations against phishing email attacks. Our empirical analysis reveals that LLMs can detect the phishing email over 90% accuracy while we also highlight that LLM-based phishing email detection systems could be exploited by adversarial attack, prompt injection, and multilingual attacks. Our findings provide critical insights for LLM-based phishing detection in real-world settings where attackers exploit multiple vulnerabilities in combination.

    https://arxiv.org/abs/2512.10104


    STARS: Semantic Tokens with Augmented Representations for Recommendation at Scale

    oai:arXiv.org:2512.10149v2

    arXiv:2512.10149v2 Announce Type: replace Abstract: Real-world ecommerce recommender systems must deliver relevant items under strict tens-of-milliseconds latency constraints despite challenges such as cold-start products, rapidly shifting user intent, and dynamic context including seasonality, holidays, and promotions. We introduce STARS, a transformer-based sequential recommendation framework built for large-scale, low-latency ecommerce settings. STARS combines several innovations: dual-memory user embeddings that separate long-term preferences from short-term session intent; semantic item tokens that fuse pretrained text embeddings, learnable deltas, and LLM-derived attribute tags, strengthening content-based matching, long-tail coverage, and cold-start performance; context-aware scoring with learned calendar and event offsets; and a latency-conscious two-stage retrieval pipeline that performs offline embedding generation and online maximum inner-product search with filtering, enabling tens-of-milliseconds response times. In offline evaluations on production-scale data, STARS improves Hit@5 by more than 75 percent relative to our existing LambdaMART system. A large-scale A/B test on 6 million visits shows statistically significant lifts, including Total Orders +0.8%, Add-to-Cart on Home +2.0%, and Visits per User +0.5%. These results demonstrate that combining semantic enrichment, multi-intent modeling, and deployment-oriented design can yield state-of-the-art recommendation quality in real-world environments without sacrificing serving efficiency.

    https://arxiv.org/abs/2512.10149


    Adaptive Information Routing for Multimodal Time Series Forecasting

    oai:arXiv.org:2512.10229v2

    arXiv:2512.10229v2 Announce Type: replace Abstract: Time series forecasting is a critical task for artificial intelligence with numerous real-world applications. Traditional approaches primarily rely on historical time series data to predict the future values. However, in practical scenarios, this is often insufficient for accurate predictions due to the limited information available. To address this challenge, multimodal time series forecasting methods which incorporate additional data modalities, mainly text data, alongside time series data have been explored. In this work, we introduce the Adaptive Information Routing (AIR) framework, a novel approach for multimodal time series forecasting. Unlike existing methods that treat text data on par with time series data as interchangeable auxiliary features for forecasting, AIR leverages text information to dynamically guide the time series model by controlling how and to what extent multivariate time series information should be combined. We also present a text-refinement pipeline that employs a large language model to convert raw text data into a form suitable for multimodal forecasting, and we introduce a benchmark that facilitates multimodal forecasting experiments based on this pipeline. Experiment results with the real world market data such as crude oil price and exchange rates demonstrate that AIR effectively modulates the behavior of the time series model using textual inputs, significantly enhancing forecasting accuracy in various time series forecasting tasks.

    https://arxiv.org/abs/2512.10229


    MR-FlowDPO: Multi-Reward Direct Preference Optimization for Flow-Matching Text-to-Music Generation

    oai:arXiv.org:2512.10264v2

    arXiv:2512.10264v2 Announce Type: replace Abstract: A key challenge in music generation models is their lack of direct alignment with human preferences, as music evaluation is inherently subjective and varies widely across individuals. We introduce MR-FlowDPO, a novel approach that enhances flow-matching-based music generation models - a major class of modern music generative models, using Direct Preference Optimization (DPO) with multiple musical rewards. The rewards are crafted to assess music quality across three key dimensions: text alignment, audio production quality, and semantic consistency, utilizing scalable off-the-shelf models for each reward prediction. We employ these rewards in two ways: (i) By constructing preference data for DPO and (ii) by integrating the rewards into text prompting. To address the ambiguity in musicality evaluation, we propose a novel scoring mechanism leveraging semantic self-supervised representations, which significantly improves the rhythmic stability of generated music. We conduct an extensive evaluation using a variety of music-specific objective metrics as well as a human study. Results show that MR-FlowDPO significantly enhances overall music generation quality and is consistently preferred over highly competitive baselines in terms of audio quality, text alignment, and musicality. Our code is publicly available at https://github.com/lonzi/mrflow_dpo. Samples are provided in our demo page at https://lonzi.github.io/mr_flowdpo_demopage/.

    https://arxiv.org/abs/2512.10264


    MotionEdit: Benchmarking and Learning Motion-Centric Image Editing

    oai:arXiv.org:2512.10284v2

    arXiv:2512.10284v2 Announce Type: replace Abstract: We introduce MotionEdit, a novel dataset for motion-centric image editing-the task of modifying subject actions and interactions while preserving identity, structure, and physical plausibility. Unlike existing image editing datasets that focus on static appearance changes or contain only sparse, low-quality motion edits, MotionEdit provides high-fidelity image pairs depicting realistic motion transformations extracted and verified from continuous videos. This new task is not only scientifically challenging but also practically significant, powering downstream applications such as frame-controlled video synthesis and animation. To evaluate model performance on the novel task, we introduce MotionEdit-Bench, a benchmark that challenges models on motion-centric edits and measures model performance with generative, discriminative, and preference-based metrics. Benchmark results reveal that motion editing remains highly challenging for existing state-of-the-art diffusion-based editing models. To address this gap, we propose MotionNFT (Motion-guided Negative-aware Fine Tuning), a post-training framework that computes motion alignment rewards based on how well the motion flow between input and model-edited images matches the ground-truth motion, guiding models toward accurate motion transformations. Extensive experiments on FLUX.1 Kontext and Qwen-Image-Edit show that MotionNFT consistently improves editing quality and motion fidelity of both base models on the motion editing task without sacrificing general editing ability, demonstrating its effectiveness. Our code is at https://github.com/elainew728/motion-edit/.

    https://arxiv.org/abs/2512.10284


    Cooperative Retrieval-Augmented Generation for Question Answering: Mutual Information Exchange and Ranking by Contrasting Layers

    oai:arXiv.org:2512.10422v2

    arXiv:2512.10422v2 Announce Type: replace Abstract: Since large language models (LLMs) have a tendency to generate factually inaccurate output, retrieval-augmented generation (RAG) has gained significant attention as a key means to mitigate this downside of harnessing only LLMs. However, existing RAG methods for simple and multi-hop question answering (QA) are still prone to incorrect retrievals and hallucinations. To address these limitations, we propose CoopRAG, a novel RAG framework for the question answering task in which a retriever and an LLM work cooperatively with each other by exchanging informative knowledge, and the earlier and later layers of the retriever model work cooperatively with each other to accurately rank the retrieved documents relevant to a given query. In this framework, we (i) unroll a question into sub-questions and a reasoning chain in which uncertain positions are masked, (ii) retrieve the documents relevant to the question augmented with the sub-questions and the reasoning chain, (iii) rerank the documents by contrasting layers of the retriever, and (iv) reconstruct the reasoning chain by filling the masked positions via the LLM. Our experiments demonstrate that CoopRAG consistently outperforms state-of-the-art QA methods on three multi-hop QA datasets as well as a simple QA dataset in terms of both the retrieval and QA performances. Our code is available.

    https://arxiv.org/abs/2512.10422


    Renormalizable Spectral-Shell Dynamics as the Origin of Neural Scaling Laws

    oai:arXiv.org:2512.10427v2

    arXiv:2512.10427v2 Announce Type: replace Abstract: Neural scaling laws and double-descent phenomena suggest that deep-network training obeys a simple macroscopic structure despite highly nonlinear optimization dynamics. We derive such structure directly from gradient descent in function space. For mean-squared error loss, the training error evolves as $\dot e_t=-M(t)e_t$ with $M(t)=J_{\theta(t)}J_{\theta(t)}^{\!*}$, a time-dependent self-adjoint operator induced by the network Jacobian. Using Kato perturbation theory, we obtain an exact system of coupled modewise ODEs in the instantaneous eigenbasis of $M(t)$. To extract macroscopic behavior, we introduce a logarithmic spectral-shell coarse-graining and track quadratic error energy across shells. Microscopic interactions within each shell cancel identically at the energy level, so shell energies evolve only through dissipation and external inter-shell interactions. We formalize this via a \emph{renormalizable shell-dynamics} assumption, under which cumulative microscopic effects reduce to a controlled net flux across shell boundaries. Assuming an effective power-law spectral transport in a relevant resolution range, the shell dynamics admits a self-similar solution with a moving resolution frontier and explicit scaling exponents. This framework explains neural scaling laws and double descent, and unifies lazy (NTK-like) training and feature learning as two limits of the same spectral-shell dynamics.

    https://arxiv.org/abs/2512.10427


    Representation of the structure of graphs by sequences of instructions

    oai:arXiv.org:2512.10429v2

    arXiv:2512.10429v2 Announce Type: replace Abstract: The representation of graphs is commonly based on the adjacency matrix concept. This formulation is the foundation of most algebraic and computational approaches to graph processing. The advent of deep learning language models offers a wide range of powerful computational models that are specialized in the processing of text. However, current procedures to represent graphs are not amenable to processing by these models. In this work, a new method to represent graphs is proposed. It represents the adjacency matrix of a graph by a string of simple instructions. The instructions build the adjacency matrix step by step. The transformation is reversible, i.e., given a graph the string can be produced and vice versa. The proposed representation is compact, and it maintains the local structural patterns of the graph. Therefore, it is envisaged that it could be useful to boost the processing of graphs by deep learning models. A tentative computational experiment is reported, demonstrating improved classification performance and faster computation times with the proposed representation.

    https://arxiv.org/abs/2512.10429


    When Reject Turns into Accept: Quantifying the Vulnerability of LLM-Based Scientific Reviewers to Indirect Prompt Injection

    oai:arXiv.org:2512.10449v2

    arXiv:2512.10449v2 Announce Type: replace Abstract: The landscape of scientific peer review is rapidly evolving with the integration of Large Language Models (LLMs). This shift is driven by two parallel trends: the widespread individual adoption of LLMs by reviewers to manage workload (the "Lazy Reviewer" hypothesis) and the formal institutional deployment of AI-powered assessment systems by conferences like AAAI and Stanford's Agents4Science. This study investigates the robustness of these "LLM-as-a-Judge" systems (both illicit and sanctioned) to adversarial PDF manipulation. Unlike general jailbreaks, we focus on a distinct incentive: flipping "Reject" decisions to "Accept," for which we develop a novel evaluation metric which we term as WAVS (Weighted Adversarial Vulnerability Score). We curated a dataset of 200 scientific papers and adapted 15 domain-specific attack strategies to this task, evaluating them across 13 Language Models, including GPT-5, Claude Haiku, and DeepSeek. Our results demonstrate that obfuscation strategies like "Maximum Mark Magyk" successfully manipulate scores, achieving alarming decision flip rates even in large-scale models. We will release our complete dataset and injection framework to facilitate more research on this topic.

    https://arxiv.org/abs/2512.10449


    Symphony: A Heuristic Normalized Calibrated Advantage Actor and Critic Algorithm in application for Humanoid Robots

    oai:arXiv.org:2512.10477v2

    arXiv:2512.10477v2 Announce Type: replace Abstract: In our work we not explicitly hint that it is a misconception to think that humans learn fast. Learning process takes time. Babies start learning to move in the restricted liquid area called placenta. Children often are limited by underdeveloped body. Even adults are not allowed to participate in complex competitions right away. However, with robots, when learning from scratch, we often don't have the privilege of waiting for dozen millions of steps. "Swaddling" regularization is responsible for restraining an agent in rapid but unstable development penalizing action strength in a specific way not affecting actions directly. The Symphony, Transitional-policy Deterministic Actor and Critic algorithm, is a concise combination of different ideas for possibility of training humanoid robots from scratch with Sample Efficiency, Sample Proximity and Safety of Actions in mind. It is no secret that continuous increase in Gaussian noise without appropriate smoothing is harmful for motors and gearboxes. Compared to Stochastic algorithms, we set a limited parametric noise and promote a reduced strength of actions, safely increasing entropy, since the actions are kind of immersed in weaker noise. When actions require more extreme values, actions rise above the weak noise. Training becomes empirically much safer for both the environment around and the robot's mechanisms. We use Fading Replay Buffer: using a fixed formula containing the hyperbolic tangent, we adjust the batch sampling probability: the memory contains a recent memory and a long-term memory trail. Fading Replay Buffer allows us to use Temporal Advantage when we improve the current Critic Network prediction compared to the exponential moving average. Temporal Advantage allows us to update Actor and Critic in one pass, as well as combine Actor and Critic in one Object and implement their Losses in one line.

    https://arxiv.org/abs/2512.10477


    Phythesis: Physics-Guided Evolutionary Scene Synthesis for Energy-Efficient Data Center Design via LLMs

    oai:arXiv.org:2512.10611v2

    arXiv:2512.10611v2 Announce Type: replace Abstract: Data center (DC) infrastructure serves as the backbone to support the escalating demand for computing capacity. Traditional design methodologies that blend human expertise with specialized simulation tools scale poorly with the increasing system complexity. Recent studies adopt generative artificial intelligence to design plausible human-centric indoor layouts. However, they do not consider the underlying physics, making them unsuitable for the DC design that sets quantifiable operational objectives and strict physical constraints. To bridge the gap, we propose Phythesis, a novel framework that synergizes large language models (LLMs) and physics-guided evolutionary optimization to automate simulation-ready (SimReady) scene synthesis for energy-efficient DC design. Phythesis employs an iterative bi-level optimization architecture, where (i) the LLM-driven optimization level generates physically plausible three-dimensional layouts and self-criticizes them to refine the scene topology, and (ii) the physics-informed optimization level identifies the optimal asset parameters and selects the best asset combination. Experiments on three generation scales show that Phythesis achieves 57.3% generation success rate increase and 11.5% power usage effectiveness (PUE) improvement, compared with the vanilla LLM-based solution.

    https://arxiv.org/abs/2512.10611


    Adaptive Intrusion Detection System Leveraging Dynamic Neural Models with Adversarial Learning for 5G/6G Networks

    oai:arXiv.org:2512.10637v2

    arXiv:2512.10637v2 Announce Type: replace Abstract: Intrusion Detection Systems (IDS) are critical components in safeguarding 5G/6G networks from both internal and external cyber threats. While traditional IDS approaches rely heavily on signature-based methods, they struggle to detect novel and evolving attacks. This paper presents an advanced IDS framework that leverages adversarial training and dynamic neural networks in 5G/6G networks to enhance network security by providing robust, real-time threat detection and response capabilities. Unlike conventional models, which require costly retraining to update knowledge, the proposed framework integrates incremental learning algorithms, reducing the need for frequent retraining. Adversarial training is used to fortify the IDS against poisoned data. By using fewer features and incorporating statistical properties, the system can efficiently detect potential threats. Extensive evaluations using the NSL- KDD dataset demonstrate that the proposed approach provides better accuracy of 82.33% for multiclass classification of various network attacks while resisting dataset poisoning. This research highlights the potential of adversarial-trained, dynamic neural networks for building resilient IDS solutions.

    https://arxiv.org/abs/2512.10637


    Estimating Hormone Concentrations in the Pituitary-Thyroid Feedback Loop from Irregularly Sampled Measurements

    oai:arXiv.org:2512.10657v2

    arXiv:2512.10657v2 Announce Type: replace Abstract: Model-based control techniques have recently been investigated for the recommendation of medication dosages to address thyroid diseases. These techniques often rely on knowledge of internal hormone concentrations that cannot be measured from blood samples. Moreover, the measurable concentrations are typically only obtainable at irregular sampling times. In this work, we empirically verify a notion of sample-based detectability that accounts for irregular sampling of the measurable concentrations on two pituitary-thyroid loop models representing patients with hypo- and hyperthyroidism, respectively, and include the internal concentrations as states. We then implement sample-based moving horizon estimation for the models, and test its performance on virtual patients across a range of sampling schemes. Our study shows robust stability of the estimator across all scenarios, and that more frequent sampling leads to less estimation error in the presence of model uncertainty and misreported dosages.

    https://arxiv.org/abs/2512.10657


    Rethinking Popularity Bias in Collaborative Filtering via Analytical Vector Decomposition

    oai:arXiv.org:2512.10688v2

    arXiv:2512.10688v2 Announce Type: replace Abstract: Popularity bias fundamentally undermines the personalization capabilities of collaborative filtering (CF) models, causing them to disproportionately recommend popular items while neglecting users' genuine preferences for niche content. While existing approaches treat this as an external confounding factor, we reveal that popularity bias is an intrinsic geometric artifact of Bayesian Pairwise Ranking (BPR) optimization in CF models. Through rigorous mathematical analysis, we prove that BPR systematically organizes item embeddings along a dominant "popularity direction" where embedding magnitudes directly correlate with interaction frequency. This geometric distortion forces user embeddings to simultaneously handle two conflicting tasks-expressing genuine preference and calibrating against global popularity-trapping them in suboptimal configurations that favor popular items regardless of individual tastes. We propose Directional Decomposition and Correction (DDC), a universally applicable framework that surgically corrects this embedding geometry through asymmetric directional updates. DDC guides positive interactions along personalized preference directions while steering negative interactions away from the global popularity direction, disentangling preference from popularity at the geometric source. Extensive experiments across multiple BPR-based architectures demonstrate that DDC significantly outperforms state-of-the-art debiasing methods, reducing training loss to less than 5% of heavily-tuned baselines while achieving superior recommendation quality and fairness. Code is available in https://github.com/LingFeng-Liu-AI/DDC.

    https://arxiv.org/abs/2512.10688


    Template-Free Retrosynthesis with Graph-Prior Augmented Transformers

    oai:arXiv.org:2512.10770v2

    arXiv:2512.10770v2 Announce Type: replace Abstract: Retrosynthesis reaction prediction aims to infer plausible reactant molecules for a given product and is a important problem in computer-aided organic synthesis. Despite recent progress, many existing models still fall short of the accuracy and robustness required for practical deployment. In this paper, we present a template-free, Transformer-based framework that removes the need for handcrafted reaction templates or additional chemical rule engines. Our model injects molecular graph information into the attention mechanism to jointly exploit SMILES sequences and structural cues, and further applies a paired data augmentation strategy to enhance training diversity and scale. Extensive experiments on the USPTO-50K benchmark demonstrate that our approach achieves state-of-the-art performance among template-free methods and substantially outperforms a vanilla Transformer baseline.

    https://arxiv.org/abs/2512.10770


    Towards Efficient and Effective Multi-Camera Encoding for End-to-End Driving

    oai:arXiv.org:2512.10947v2

    arXiv:2512.10947v2 Announce Type: replace Abstract: We present Flex, an efficient and effective scene encoder that addresses the computational bottleneck of processing high-volume multi-camera data in end-to-end autonomous driving. Flex employs a small set of learnable scene tokens to jointly encode information from all image tokens across different cameras and timesteps. By design, our approach is geometry-agnostic, learning a compact scene representation directly from data without relying on the explicit 3D inductive biases, such as Bird-Eye-View (BEV), occupancy or tri-plane representations, which are common in prior work. This holistic encoding strategy aggressively compresses the visual input for the downstream Large Language Model (LLM) based policy model. Evaluated on a large-scale proprietary dataset of 20,000 driving hours, our Flex achieves 2.2x greater inference throughput while improving driving performance by a large margin compared to state-of-the-art methods. Furthermore, we show that these compact scene tokens develop an emergent capability for scene decomposition without any explicit supervision. Our findings challenge the prevailing assumption that 3D priors are necessary, demonstrating that a data-driven, joint encoding strategy offers a more scalable, efficient and effective path for future autonomous driving systems.

    https://arxiv.org/abs/2512.10947


    WholeBodyVLA: Towards Unified Latent VLA for Whole-Body Loco-Manipulation Control

    oai:arXiv.org:2512.11047v2

    arXiv:2512.11047v2 Announce Type: replace Abstract: Humanoid robots require precise locomotion and dexterous manipulation to perform challenging loco-manipulation tasks. Yet existing approaches, modular or end-to-end, are deficient in manipulation-aware locomotion. This confines the robot to a limited workspace, preventing it from performing large-space loco-manipulation. We attribute this to: (1) the challenge of acquiring loco-manipulation knowledge due to the scarcity of humanoid teleoperation data, and (2) the difficulty of faithfully and reliably executing locomotion commands, stemming from the limited precision and stability of existing RL controllers. To acquire richer loco-manipulation knowledge, we propose a unified latent learning framework that enables Vision-Language-Action (VLA) system to learn from low-cost action-free egocentric videos. Moreover, an efficient human data collection pipeline is devised to augment the dataset and scale the benefits. To execute the desired locomotion commands more precisely, we present a loco-manipulation-oriented (LMO) RL policy specifically tailored for accurate and stable core loco-manipulation movements, such as advancing, turning, and squatting. Building on these components, we introduce WholeBodyVLA, a unified framework for humanoid loco-manipulation. To the best of our knowledge, WholeBodyVLA is one of its kind enabling large-space humanoid loco-manipulation. It is verified via comprehensive experiments on the AgiBot X2 humanoid, outperforming prior baseline by 21.3%. It also demonstrates strong generalization and high extensibility across a broad range of tasks.

    https://arxiv.org/abs/2512.11047


    On the failure of ReLU activation for physics-informed machine learning

    oai:arXiv.org:2512.11184v2

    arXiv:2512.11184v2 Announce Type: replace Abstract: Physics-informed machine learning uses governing ordinary and/or partial differential equations to train neural networks to represent the solution field. Like any machine learning problem, the choice of activation function influences the characteristics and performance of the solution obtained from physics-informed training. Several studies have compared common activation functions on benchmark differential equations, and have unanimously found that the rectified linear unit (ReLU) is outperformed by competitors such as the sigmoid, hyperbolic tangent, and swish activation functions. In this work, we diagnose the poor performance of ReLU on physics-informed machine learning problems. While it is well-known that the piecewise linear form of ReLU prevents it from being used on second-order differential equations, we show that ReLU fails even on variational problems involving only first derivatives. We identify the cause of this failure as second derivatives of the activation, which are taken not in the formulation of the loss, but in the process of training. Namely, we show that automatic differentiation in PyTorch fails to characterize derivatives of discontinuous fields, which causes the gradient of the physics-informed loss to be mis-specified, thus explaining the poor performance of ReLU.

    https://arxiv.org/abs/2512.11184


    AutoRefiner: Improving Autoregressive Video Diffusion Models via Reflective Refinement Over the Stochastic Sampling Path

    oai:arXiv.org:2512.11203v2

    arXiv:2512.11203v2 Announce Type: replace Abstract: Autoregressive video diffusion models (AR-VDMs) show strong promise as scalable alternatives to bidirectional VDMs, enabling real-time and interactive applications. Yet there remains room for improvement in their sample fidelity. A promising solution is inference-time alignment, which optimizes the noise space to improve sample fidelity without updating model parameters. Yet, optimization- or search-based methods are computationally impractical for AR-VDMs. Recent text-to-image (T2I) works address this via feedforward noise refiners that modulate sampled noises in a single forward pass. Can such noise refiners be extended to AR-VDMs? We identify the failure of naively extending T2I noise refiners to AR-VDMs and propose AutoRefiner-a noise refiner tailored for AR-VDMs, with two key designs: pathwise noise refinement and a reflective KV-cache. Experiments demonstrate that AutoRefiner serves as an efficient plug-in for AR-VDMs, effectively enhancing sample fidelity by refining noise along stochastic denoising paths.

    https://arxiv.org/abs/2512.11203


    Integrated Prediction and Multi-period Portfolio Optimization

    oai:arXiv.org:2512.11273v2

    arXiv:2512.11273v2 Announce Type: replace Abstract: Multi-period portfolio optimization is important for real portfolio management, as it accounts for transaction costs, path-dependent risks, and the intertemporal structure of trading decisions that single-period models cannot capture. Classical methods usually follow a two-stage framework: machine learning algorithms are employed to produce forecasts that closely fit the realized returns, and the predicted values are then used in a downstream portfolio optimization problem to determine the asset weights. This separation leads to a fundamental misalignment between predictions and decision outcomes, while also ignoring the impact of transaction costs. To bridge this gap, recent studies have proposed the idea of end-to-end learning, integrating the two stages into a single pipeline. This paper introduces IPMO (Integrated Prediction and Multi-period Portfolio Optimization), a model for multi-period mean-variance portfolio optimization with turnover penalties. The predictor generates multi-period return forecasts that parameterize a differentiable convex optimization layer, which in turn drives learning via portfolio performance. For scalability, we introduce a mirror-descent fixed-point (MDFP) differentiation scheme that avoids factorizing the Karush-Kuhn-Tucker (KKT) systems, which thus yields stable implicit gradients and nearly scale-insensitive runtime as the decision horizon grows. In experiments with real market data and two representative time-series prediction models, the IPMO method consistently outperforms the two-stage benchmarks in risk-adjusted performance net of transaction costs and achieves more coherent allocation paths. Our results show that integrating machine learning prediction with optimization in the multi-period setting improves financial outcomes and remains computationally tractable.

    https://arxiv.org/abs/2512.11273


    LegalRikai: Open Benchmark -- Benchmark for Complex Japanese Corporate Legal Tasks

    oai:arXiv.org:2512.11297v2

    arXiv:2512.11297v2 Announce Type: replace Abstract: This paper introduces LegalRikai: Open Benchmark, a new benchmark comprising four complex tasks that emulate Japanese corporate legal practices. The benchmark was created by legal professionals under the supervision of an attorney. This benchmark has 100 samples that require long-form, structured outputs, and we evaluated them against multiple practical criteria. We conducted both human and automated evaluations using leading LLMs, including GPT-5, Gemini 2.5 Pro, and Claude Opus 4.1. Our human evaluation revealed that abstract instructions prompted unnecessary modifications, highlighting model weaknesses in document-level editing that were missed by conventional short-text tasks. Furthermore, our analysis reveals that automated evaluation aligns well with human judgment on criteria with clear linguistic grounding, and assessing structural consistency remains a challenge. The result demonstrates the utility of automated evaluation as a screening tool when expert availability is limited. We propose a dataset evaluation framework to promote more practice-oriented research in the legal domain.

    https://arxiv.org/abs/2512.11297


    RollMux: Phase-Level Multiplexing for Disaggregated RL Post-Training

    oai:arXiv.org:2512.11306v2

    arXiv:2512.11306v2 Announce Type: replace Abstract: Rollout-training disaggregation is emerging as the standard architecture for Reinforcement Learning (RL) post-training, where memory-bound rollout and compute-bound training are physically disaggregated onto purpose-built clusters to maximize hardware efficiency. However, the strict synchronization required by on-policy algorithms introduces severe dependency bubbles, forcing one cluster to idle while the dependent phase is running on the other. We present RollMux, a cluster scheduling framework that reclaims these bubbles through cross-cluster orchestration. RollMux is built on the insight that the structural idleness of one job can be effectively utilized by the active phase of another. To realize this, we introduce the co-execution group abstraction, which partitions the cluster into isolated locality domains. This abstraction enables a two-tier scheduling architecture: an inter-group scheduler that optimizes job placement using conservative stochastic planning, and an intra-group scheduler that orchestrates a provably optimal round-robin schedule. The group abstraction also imposes a residency constraint, ensuring that massive model states remain cached in host memory to enable "warm-star" context switching. We evaluate RollMux on a production-scale testbed with 328 H20 and 328 H800 GPUs. RollMux improves cost efficiency by 1.84x over standard disaggregation and 1.38x over state-of-the-art co-located baselines, all while achieving 100% SLO attainment.

    https://arxiv.org/abs/2512.11306


    Task-Specific Distance Correlation Matching for Few-Shot Action Recognition

    oai:arXiv.org:2512.11340v2

    arXiv:2512.11340v2 Announce Type: replace Abstract: Few-shot action recognition (FSAR) has recently made notable progress through set matching and efficient adaptation of large-scale pre-trained models. However, two key limitations persist. First, existing set matching metrics typically rely on cosine similarity to measure inter-frame linear dependencies and then perform matching with only instance-level information, thus failing to capture more complex patterns such as nonlinear relationships and overlooking task-specific cues. Second, for efficient adaptation of CLIP to FSAR, recent work performing fine-tuning via skip-fusion layers (which we refer to as side layers) has significantly reduced memory cost. However, the newly introduced side layers are often difficult to optimize under limited data conditions. To address these limitations, we propose TS-FSAR, a framework comprising three components: (1) a visual Ladder Side Network (LSN) for efficient CLIP fine-tuning; (2) a metric called Task-Specific Distance Correlation Matching (TS-DCM), which uses $\alpha$-distance correlation to model both linear and nonlinear inter-frame dependencies and leverages a task prototype to enable task-specific matching; and (3) a Guiding LSN with Adapted CLIP (GLAC) module, which regularizes LSN using the adapted frozen CLIP to improve training for better $\alpha$-distance correlation estimation under limited supervision. Extensive experiments on five widely-used benchmarks demonstrate that our TS-FSAR yields superior performance compared to prior state-of-the-arts.

    https://arxiv.org/abs/2512.11340


    YawDD+: Frame-level Annotations for Accurate Yawn Prediction

    oai:arXiv.org:2512.11446v2

    arXiv:2512.11446v2 Announce Type: replace Abstract: Driver fatigue remains a leading cause of road accidents, with 24% of crashes involving drowsy drivers. While yawning serves as an early behavioral indicator of fatigue, existing machine learning approaches face significant challenges due to video-annotated datasets that introduce systematic noise from coarse temporal annotations. We develop a semi-automated labeling pipeline with human-in-the-loop verification, which we apply to YawDD, enabling more accurate model training. Training the established MNasNet classifier and YOLOv11 detector architectures on YawDD+ improves frame accuracy by up to 6% and mAP by 5% over video-level supervision, achieving 99.34% classification accuracy and 95.69% detection mAP. The resulting approach deliver up to 59.8 FPS on edge AI hardware (NVIDIA Jetson Nano), confirming that enhanced data quality alone supports on-device yawning monitoring without server-side computation.

    https://arxiv.org/abs/2512.11446


    An Input-Output Data-Driven Dissipativity Approach for Compositional Stability Certification of Interconnected LTI MIMO Systems

    oai:arXiv.org:2512.11468v2

    arXiv:2512.11468v2 Announce Type: replace Abstract: We propose an input-output data-driven framework for certifying the stability of interconnected multiple-input-multiple-output linear time-invariant discrete-time systems via QSR-dissipativity. That is, by using measured input-output trajectories of each subsystem, we verify dissipative properties and extract local passivity indices without requiring an explicit model identification. These passivity indices are then used to derive conditions under which the equilibrium of the interconnected system is stable. In particular, the framework identifies how the lack of passivity in some subsystems can be compensated by surpluses in others. The proposed approach enables a compositional stability analysis by combining subsystem-level conditions into a criterion valid for the overall interconnected system. We illustrate via a numerical case study, how to compute channel-wise passivity indices and infer stability guarantees directly from data with the proposed method.

    https://arxiv.org/abs/2512.11468


    EmeraldMind: A Knowledge Graph-Augmented Framework for Greenwashing Detection

    oai:arXiv.org:2512.11506v2

    arXiv:2512.11506v2 Announce Type: replace Abstract: As AI and web agents become pervasive in decision-making, it is critical to design intelligent systems that not only support sustainability efforts but also guard against misinformation. Greenwashing, i.e., misleading corporate sustainability claims, poses a major challenge to environmental progress. To address this challenge, we introduce EmeraldMind, a fact-centric framework integrating a domain-specific knowledge graph with retrieval-augmented generation to automate greenwashing detection. EmeraldMind builds the EmeraldGraph from diverse corporate ESG (environmental, social, and governance) reports, surfacing verifiable evidence, often missing in generic knowledge bases, and supporting large language models in claim assessment. The framework delivers justification-centric classifications, presenting transparent, evidence-backed verdicts and abstaining responsibly when claims cannot be verified. Experiments on a new greenwashing claims dataset demonstrate that EmeraldMind achieves competitive accuracy, greater coverage, and superior explanation quality compared to generic LLMs, without the need for fine-tuning or retraining.

    https://arxiv.org/abs/2512.11506


    Port-Hamiltonian Realizations of Nonminimal Linear Time Invariant Systems

    oai:arXiv.org:2201.05355v4

    arXiv:2201.05355v4 Announce Type: replace-cross Abstract: Numerical methods for developing port-Hamiltonian representations of general linear time-invariant systems are studied. The approach extends previous port-Hamiltonian characterizations to include the general non-minimal case and the case where the feedthrough term fails to have an invertible symmetric part. The resulting construction is able to identify infeasibility when the system fails to be port-Hamiltonian, and allows for the incorporation of perturbations in order to arrive at a nearby port-Hamiltonian system. Results are illustrated via numerical examples.

    https://arxiv.org/abs/2201.05355


    A SAT Encoding for Optimal Clifford Circuit Synthesis

    oai:arXiv.org:2208.11713v2

    arXiv:2208.11713v2 Announce Type: replace-cross Abstract: Executing quantum algorithms on a quantum computer requires compilation to representations that conform to all restrictions imposed by the device. Due to devices' limited coherence times and gate fidelities, the compilation process has to be optimized as much as possible. To this end, an algorithm's description first has to be synthesized using the device's gate library. In this paper, we consider the optimal synthesis of Clifford circuits -- an important subclass of quantum circuits, with various applications. Such techniques are essential to establish lower bounds for (heuristic) synthesis methods and gauging their performance. Due to the huge search space, existing optimal techniques are practically limited to small qubit counts (around six qubits for typical instances). In this work, we propose an optimal synthesis method for Clifford circuits based on encoding the task as a satisfiability (SAT) problem and solving it using a SAT solver in conjunction with a binary search scheme. Experiments on random instances with up to 6 qubits demonstrate that state-of-the-art heuristics on average produce more than twice the number of gates necessary.

    https://arxiv.org/abs/2208.11713


    Behavioral Machine Learning? Regularization and Forecast Bias

    oai:arXiv.org:2303.16158v3

    arXiv:2303.16158v3 Announce Type: replace-cross Abstract: Standard forecast efficiency tests interpret violations as evidence of behavioral bias. We show theoretically and empirically that rational forecasters using optimal regularization systematically violate these tests. Machine learning forecasts show near zero bias at one year horizon, but strong overreaction at two years, consistent with predictions from a model of regularization and measurement noise. We provide three complementary tests: experimental variation in regularization parameters, cross-sectional heterogeneity in firm signal quality, and quasi-experimental evidence from ML adoption around 2013. Technically trained analysts shift sharply toward overreaction post-2013. Our findings suggest reported violations may reflect statistical sophistication rather than cognitive failure.

    https://arxiv.org/abs/2303.16158


    Comparison of SeDuMi and SDPT3 Solvers for Stability of Continuous-time Linear System

    oai:arXiv.org:2306.04531v2

    arXiv:2306.04531v2 Announce Type: replace-cross Abstract: SeDuMi and SDPT3 are two solvers for solving Semi-definite Programming (SDP) or Linear Matrix Inequality (LMI) problems. A computational performance comparison of these two are undertaken in this paper regarding the Stability of Continuous-time Linear Systems. The comparison mainly focuses on computational times and memory requirements for different scales of problems. To implement and compare the two solvers on a set of well-posed problems, we employ YALMIP, a widely used toolbox for modeling and optimization in MATLAB. The primary goal of this study is to provide an empirical assessment of the relative computational efficiency of SeDuMi and SDPT3 under varying problem conditions. Our evaluation indicates that SDPT3 performs much better in large-scale, high-precision calculations.

    https://arxiv.org/abs/2306.04531


    Improved Generalization Bounds for Transductive Learning by Transductive Local Complexity and Its Applications

    oai:arXiv.org:2309.16858v4

    arXiv:2309.16858v4 Announce Type: replace-cross Abstract: We introduce Transductive Local Complexity (TLC) as a new tool for analyzing the generalization performance of transductive learning methods. Our work extends the classical Local Rademacher Complexity (LRC) to the transductive setting, incorporating substantial and novel components. Although LRC has been used to obtain sharp generalization bounds and minimax rates for inductive tasks such as classification and nonparametric regression, it has remained an open problem whether a localized Rademacher complexity framework can be effectively adapted to transductive learning to achieve sharp or nearly sharp bounds consistent with inductive results. We provide an affirmative answer via TLC. TLC is constructed by first deriving a new concentration inequality in Theorem 4.1 for the supremum of empirical processes capturing the gap between test and training losses, termed the test-train process, under uniform sampling without replacement, which leverages a novel combinatorial property of the test-train process and a new proof strategy applying the exponential Efron-Stein inequality twice. A subsequent peeling strategy and a new surrogate variance operator then yield excess risk bounds in the transductive setting that are nearly consistent with classical LRC-based inductive bounds up to a logarithmic gap. We further advance transductive learning through two applications: (1) for realizable transductive learning over binary-valued classes with finite VC dimension of $\dVC$ and $u \ge m \ge \dVC$, where $u$ and $m$ are the number of test features and training features, our Theorem 6.1 gives a nearly optimal bound $\Theta(\dVC \log(me/\dVC)/m)$ matching the minimax rate $\Theta(\dVC/m)$ up to $\log m$, resolving a decade-old open question; and (2) Theorem 6.2 presents a sharper excess risk bound for transductive kernel learning compared to the current state-of-the-art.

    https://arxiv.org/abs/2309.16858


    An Anytime Algorithm for Good Arm Identification

    oai:arXiv.org:2310.10359v2

    arXiv:2310.10359v2 Announce Type: replace-cross Abstract: In good arm identification (GAI), the goal is to identify one arm whose average performance exceeds a given threshold, referred to as a good arm, if it exists. Few works have studied GAI in the fixed-budget setting when the sampling budget is fixed beforehand, or in the anytime setting, when a recommendation can be asked at any time. We propose APGAI, an anytime and parameter-free sampling rule for GAI in stochastic bandits. APGAI can be straightforwardly used in fixed-confidence and fixed-budget settings. First, we derive upper bounds on its probability of error at any time. They show that adaptive strategies can be more efficient in detecting the absence of good arms than uniform sampling in several diverse instances. Second, when APGAI is combined with a stopping rule, we prove upper bounds on the expected sampling complexity, holding at any confidence level. Finally, we show the good empirical performance of APGAI on synthetic and real-world data. Our work offers an extensive overview of the GAI problem in all settings.

    https://arxiv.org/abs/2310.10359


    Quantum algorithms for linear and non-linear fractional reaction-diffusion equations

    oai:arXiv.org:2310.18900v2

    arXiv:2310.18900v2 Announce Type: replace-cross Abstract: High-dimensional fractional reaction-diffusion equations have numerous applications in the fields of biology, chemistry, and physics, and exhibit a range of rich phenomena. While classical algorithms have an exponential complexity in the spatial dimension, a quantum computer can produce a quantum state that encodes the solution with only polynomial complexity, provided that suitable input access is available. In this work, we investigate efficient quantum algorithms for linear and nonlinear fractional reaction-diffusion equations with periodic boundary conditions. For linear equations, we analyze and compare the complexity of various methods, including the second-order Trotter formula, time-marching method, and truncated Dyson series method. We also present a novel algorithm that combines the linear combination of Hamiltonian simulation technique with the interaction picture formalism, resulting in optimal scaling in the spatial dimension. For nonlinear equations, we employ the Carleman linearization method and propose a block-encoding version that is appropriate for the dense matrices that arise from the spatial discretization of fractional reaction-diffusion equations.

    https://arxiv.org/abs/2310.18900


    Quantum algorithm for linear non-unitary dynamics with near-optimal dependence on all parameters

    oai:arXiv.org:2312.03916v2

    arXiv:2312.03916v2 Announce Type: replace-cross Abstract: We introduce a family of identities that express general linear non-unitary evolution operators as a linear combination of unitary evolution operators, each solving a Hamiltonian simulation problem. This formulation can exponentially enhance the accuracy of the recently introduced linear combination of Hamiltonian simulation (LCHS) method [An, Liu, and Lin, Physical Review Letters, 2023]. For the first time, this approach enables quantum algorithms to solve linear differential equations with both optimal state preparation cost and near-optimal scaling in matrix queries on all parameters.

    https://arxiv.org/abs/2312.03916


    Tau Anomaly Detection in PET Imaging via Bilateral-Guided Deterministic Diffusion Model

    oai:arXiv.org:2405.13199v2

    arXiv:2405.13199v2 Announce Type: replace-cross Abstract: The emergence of tau PET imaging over the last decade has enabled Alzheimer's disease (AD) researchers to examine tau pathology in vivo and more effectively characterize the disease trajectories of AD. Current tau PET analysis methods, however, typically perform inferences on large cortical ROIs and are limited in the detection of localized tau pathology that varies across subjects. In this work, we propose a novel bilateral-guided deterministic diffusion sampling method to perform anomaly detection from tau PET imaging data. By including individualized brain structure and cognitively normal (CN) template conditions, our model computes a voxel-level anomaly map based on the deterministically sampled pseudo-healthy reconstruction. We train our model on ADNI CN subjects (n=380) and evaluate anomaly localization performance on the left MCI/AD subjects (n=154) and the preclinical subjects of the A4 clinical trial (n=447). We further train a CNN classifier on the derived 3D anomaly maps from ADNI, including CN and MCI/AD, to classify subjects into two groups and test classification performance on A4. We demonstrate that our method outperforms baselines in anomaly localization. Additionally, we show that our method can successfully group preclinical subjects with significantly different cognitive functions, highlighting the potential of our approach for application in preclinical screening tests. The code will be publicly available.

    https://arxiv.org/abs/2405.13199


    Optimal Rates of Convergence for Entropy Regularization in Discounted Markov Decision Processes

    oai:arXiv.org:2406.04163v5

    arXiv:2406.04163v5 Announce Type: replace-cross Abstract: We study the error introduced by entropy regularization in infinite-horizon discrete discounted Markov decision processes. We show that this error decreases exponentially in the inverse regularization strength, both in a weighted KL-divergence and in value with a problem-specific exponent. This is in contrast to previously known estimates, of the order $O(\tau)$, where $\tau$ is the regularization strength. We provide a lower bound that matches our upper bound up to a polynomial term, thereby characterizing the exponential convergence rate for entropy regularization. Our proof relies on the observation that the solutions of entropy-regularized Markov decision processes solve a gradient flow of the unregularized reward with respect to a Riemannian metric common in natural policy gradient methods. This correspondence allows us to identify the limit of this gradient flow as the generalized maximum entropy optimal policy, thereby characterizing the implicit bias of this gradient flow, which corresponds to a time-continuous version of the natural policy gradient method. We use our improved error estimates to show that for entropy-regularized natural policy gradient methods, the overall error decays exponentially in the square root of the number of iterations, improving over existing sublinear guarantees. Finally, we extend our analysis to settings beyond the entropy. In particular, we characterize the implicit bias regarding general convex potentials and their resulting generalized natural policy gradients.

    https://arxiv.org/abs/2406.04163


    GAPses: Versatile smart glasses for comfortable and fully-dry acquisition and parallel ultra-low-power processing of EEG and EOG

    oai:arXiv.org:2406.07903v3

    arXiv:2406.07903v3 Announce Type: replace-cross Abstract: Recent advancements in head-mounted wearable technology are revolutionizing the field of biopotential measurement, but the integration of these technologies into practical, user-friendly devices remains challenging due to issues with design intrusiveness, comfort, and data privacy. To address these challenges, this paper presents GAPses, a novel smart glasses platform designed for unobtrusive, comfortable, and secure acquisition and processing of electroencephalography (EEG) and electrooculography (EOG) signals. We introduce a direct electrode-electronics interface with custom dry soft electrodes to enhance comfort for long wear. An integrated parallel ultra-low-power RISC-V processor (GAP9, Greenwaves Technologies) processes data at the edge, thereby eliminating the need for continuous data streaming through a wireless link, enhancing privacy, and increasing system reliability in adverse channel conditions. We demonstrate the broad applicability of the designed prototype through validation in a number of EEG-based interaction tasks, including alpha waves, steady-state visual evoked potential analysis, and motor movement classification. Furthermore, we demonstrate an EEG-based biometric subject recognition task, where we reach a sensitivity and specificity of 98.87% and 99.86% respectively, with only 8 EEG channels and an energy consumption per inference on the edge as low as 121 $\mu J$. Moreover, in an EOG-based eye movement classification task, we reach an accuracy of 96.68% on 11 classes, resulting in an information transfer rate of 94.78 bit/min, which can be further increased to 161.43 bit/min by reducing the accuracy to 81.43%. The deployed implementation has an energy consumption of 40 $\mu J$ per inference and a total system power of only 12.4 mW, of which only 1.61% is used for classification, allowing for continuous operation of more than 22 h with a small 75 mAh battery.

    https://arxiv.org/abs/2406.07903


    Dy-mer: An Explainable DNA Sequence Representation Scheme using Dictionary Learning

    oai:arXiv.org:2407.12051v2

    arXiv:2407.12051v2 Announce Type: replace-cross Abstract: DNA sequences encode critical genetic information, yet their variable length and discrete nature impede direct utilization in deep learning models. Existing DNA representation schemes convert sequences into numerical vectors but fail to capture structural features of local subsequences and often suffer from limited interpretability and poor generalization on small datasets. To address these limitations, we propose Dy-mer, an interpretable and robust DNA representation scheme based on dictionary learning. Dy-mer formulates an optimization problem in tensor format, which ensures computational efficiency in batch processing. Our scheme reconstructs DNA sequences as concatenations of dynamic-length subsequences (dymers) through a convolution operation and simultaneously optimize a learnable dymer dictionary and sparse representations. Our method achieves state-of-the-art performance in downstream tasks such as DNA promoter classification and motif detection. Experiments further show that the learned dymers match known DNA motifs and clustering using Dy-mer yields semantically meaningful phylogenetic trees. These results demonstrate that the proposed approach achieves both strong predictive performance and high interpretability, making it well suited for biological research applications.

    https://arxiv.org/abs/2407.12051


    Variational Bayesian Inference for Multiple Extended Targets or Unresolved Group Targets Tracking

    oai:arXiv.org:2407.15226v5

    arXiv:2407.15226v5 Announce Type: replace-cross Abstract: In this work, we propose a method for tracking multiple extended targets or unresolvable group targets in a clutter environment. Firstly, based on the Random Matrix Model (RMM), the joint state of the target is modeled as the Gamma Gaussian Inverse Wishart (GGIW) distribution. Considering the uncertainty of measurement origin caused by the clutters, we adopt the idea of probabilistic data association and describe the joint association event as an unknown parameter in the joint prior distribution. Then the Variational Bayesian Inference (VBI) is employed to approximately solve the non-analytical posterior distribution. Furthermore, to ensure the practicability of the proposed method, we further provide two potential lightweight schemes to reduce its computational complexity. One of them is based on clustering, which effectively prunes the joint association events. The other is a simplification of the variational posterior through marginal association probabilities. Finally, the effectiveness of the proposed method is demonstrated by simulation and real data experiments, and we show that the proposed method outperforms current state-of-the-art methods in terms of accuracy and adaptability.

    https://arxiv.org/abs/2407.15226


    CIC: Circular Image Compression

    oai:arXiv.org:2407.15870v3

    arXiv:2407.15870v3 Announce Type: replace-cross Abstract: Learned image compression (LIC) is currently the cutting-edge method. However, the inherent difference between testing and training images of LIC results in performance degradation to some extent. Especially for out-of-sample, out-of-distribution, or out-of-domain testing images, the performance of LIC dramatically degraded. Classical LIC is a serial image compression (SIC) approach that utilizes an open-loop architecture with serial encoding and decoding units. Nevertheless, according to the theory of automatic control, a closed-loop architecture holds the potential to improve the dynamic and static performance of LIC. Therefore, a circular image compression (CIC) approach with closed-loop encoding and decoding elements is proposed to minimize the gap between testing and training images and upgrade the capability of LIC. The proposed CIC establishes a nonlinear loop equation and proves that steady-state error between reconstructed and original images is close to zero by Taylor series expansion. The proposed CIC method possesses the property of Post-Training and plug-and-play which can be built on any existing advanced SIC methods. Experimental results on five public image compression datasets demonstrate that the proposed CIC outperforms five competing state-of-the-art open-source SIC algorithms in reconstruction capacity. Experimental results further show that the proposed method is suitable for out-of-sample testing images with dark backgrounds, sharp edges, high contrast, grid shapes, or complex patterns.

    https://arxiv.org/abs/2407.15870


    Robust Simultaneous Multislice MRI Reconstruction Using Slice-Wise Learned Generative Diffusion Priors

    oai:arXiv.org:2407.21600v3

    arXiv:2407.21600v3 Announce Type: replace-cross Abstract: Simultaneous multislice (SMS) imaging is a powerful technique for accelerating magnetic resonance imaging (MRI) acquisitions. However, SMS reconstruction remains challenging due to complex signal interactions between and within the excited slices. In this study, we introduce ROGER, a robust SMS MRI reconstruction method based on deep generative priors. Utilizing denoising diffusion probabilistic models (DDPM), ROGER begins with Gaussian noise and gradually recovers individual slices through reverse diffusion iterations while enforcing data consistency from measured k-space data within the readout concatenation framework. The posterior sampling procedure is designed such that the DDPM training can be performed on single-slice images without requiring modifications for SMS tasks. Additionally, our method incorporates a low-frequency enhancement (LFE) module to address the practical issue that SMS-accelerated fast spin echo (FSE) and echo planar imaging (EPI) sequences cannot easily embed fully-sampled autocalibration signals. Extensive experiments on both retrospectively and prospectively accelerated datasets demonstrate that ROGER consistently outperforms existing methods, enhancing both anatomical and functional imaging with strong out-of-distribution generalization. The source code and sample data for ROGER are available at https://github.com/Solor-pikachu/ROGER.

    https://arxiv.org/abs/2407.21600


    No Screening is More Efficient with Multiple Objects

    oai:arXiv.org:2408.10077v2

    arXiv:2408.10077v2 Announce Type: replace-cross Abstract: We study efficient mechanism design for allocating multiple heterogeneous objects. The aim is to maximize the residual surplus, the total value generated from an allocation minus the costs of screening. We discover a robust trend indicating that no-screening mechanisms, such as serial dictatorship with exogenous priority order, tend to perform better as the variety of goods increases. We analyze the underlying reasons by characterizing asymptotically efficient mechanisms in a stylized environment. We also apply an automated mechanism design approach to numerically derive efficient mechanisms and validate the trend in general environments. Building on these implications, we propose the \emph{register-invite-book system} (RIB) as an efficient system for scheduling vaccination against pandemic diseases.

    https://arxiv.org/abs/2408.10077


    Layer-aware TDNN: Speaker Recognition Using Multi-Layer Features from Pre-Trained Models

    oai:arXiv.org:2409.07770v2

    arXiv:2409.07770v2 Announce Type: replace-cross Abstract: Recent advances in self-supervised learning (SSL) on Transformers have significantly improved speaker verification (SV) by providing domain-general speech representations. However, existing approaches have underutilized the multi-layered nature of SSL encoders. To address this limitation, we propose the layer-aware time-delay neural network (L-TDNN), which directly performs layer/frame-wise processing on the layer-wise hidden state outputs from pre-trained models, extracting fixed-size speaker vectors. L-TDNN comprises a layer-aware convolutional network, a frame-adaptive layer aggregation, and attentive statistic pooling, explicitly modeling of the recognition and processing of previously overlooked layer dimension. We evaluated L-TDNN across multiple speech SSL Transformers and diverse speech-speaker corpora against other approaches for leveraging pre-trained encoders. L-TDNN consistently demonstrated robust verification performance, achieving the lowest error rates throughout the experiments. Concurrently, it stood out in terms of model compactness and exhibited inference efficiency comparable to the existing systems. These results highlight the advantages derived from the proposed layer-aware processing approach. Future work includes exploring joint training with SSL frontends and the incorporation of score calibration to further enhance state-of-the-art verification performance.

    https://arxiv.org/abs/2409.07770


    Introducing the Kernel Descent Optimizer for Variational Quantum Algorithms

    oai:arXiv.org:2409.10257v2

    arXiv:2409.10257v2 Announce Type: replace-cross Abstract: In recent years, variational quantum algorithms have garnered significant attention as a candidate approach for near-term quantum advantage using noisy intermediate-scale quantum (NISQ) devices. In this article we introduce kernel descent, a novel algorithm for minimizing the functions underlying variational quantum algorithms. We compare kernel descent to existing methods and carry out extensive experiments to demonstrate its effectiveness. In particular, we showcase scenarios in which kernel descent outperforms gradient descent and quantum analytic descent. The algorithm follows the well-established scheme of iteratively computing classical local approximations to the objective function and subsequently executing several classical optimization steps with respect to the former. Kernel descent sets itself apart with its employment of reproducing kernel Hilbert space techniques in the construction of the local approximations, which leads to the observed advantages.

    https://arxiv.org/abs/2409.10257


    MAISI: Medical AI for Synthetic Imaging

    oai:arXiv.org:2409.11169v3

    arXiv:2409.11169v3 Announce Type: replace-cross Abstract: Medical imaging analysis faces challenges such as data scarcity, high annotation costs, and privacy concerns. This paper introduces the Medical AI for Synthetic Imaging (MAISI), an innovative approach using the diffusion model to generate synthetic 3D computed tomography (CT) images to address those challenges. MAISI leverages the foundation volume compression network and the latent diffusion model to produce high-resolution CT images (up to a landmark volume dimension of 512 x 512 x 768 ) with flexible volume dimensions and voxel spacing. By incorporating ControlNet, MAISI can process organ segmentation, including 127 anatomical structures, as additional conditions and enables the generation of accurately annotated synthetic images that can be used for various downstream tasks. Our experiment results show that MAISI's capabilities in generating realistic, anatomically accurate images for diverse regions and conditions reveal its promising potential to mitigate challenges using synthetic data.

    https://arxiv.org/abs/2409.11169


    Deep-ER: Deep Learning ECCENTRIC Reconstruction for fast high-resolution neurometabolic imaging

    oai:arXiv.org:2409.18303v2

    arXiv:2409.18303v2 Announce Type: replace-cross Abstract: Introduction: Altered neurometabolism is an important pathological mechanism in many neurological diseases and brain cancer, which can be mapped non-invasively by Magnetic Resonance Spectroscopic Imaging (MRSI). Advanced MRSI using non-cartesian compressed-sense acquisition enables fast high-resolution metabolic imaging but has lengthy reconstruction times that limits throughput and needs expert user interaction. Here, we present a robust and efficient Deep Learning reconstruction to obtain high-quality metabolic maps. Methods: Fast high-resolution whole-brain metabolic imaging was performed at 3.4 mm$^3$ isotropic resolution with acquisition times between 4:11-9:21 min:s using ECCENTRIC pulse sequence on a 7T MRI scanner. Data were acquired in a high-resolution phantom and 27 human participants, including 22 healthy volunteers and 5 glioma patients. A deep neural network using recurring interlaced convolutional layers with joint dual-space feature representation was developed for deep learning ECCENTRIC reconstruction (Deep-ER). 21 subjects were used for training and 6 subjects for testing. Deep-ER performance was compared to conventional iterative Total Generalized Variation reconstruction using image and spectral quality metrics. Results: Deep-ER demonstrated 600-fold faster reconstruction than conventional methods, providing improved spatial-spectral quality and metabolite quantification with 12%-45% (P<0.05) higher signal-to-noise and 8%-50% (P<0.05) smaller Cramer-Rao lower bounds. Metabolic images clearly visualize glioma tumor heterogeneity and boundary. Conclusion: Deep-ER provides efficient and robust reconstruction for sparse-sampled MRSI. The accelerated acquisition-reconstruction MRSI is compatible with high-throughput imaging workflow. It is expected that such improved performance will facilitate basic and clinical MRSI applications.

    https://arxiv.org/abs/2409.18303


    WALINET: A water and lipid identification convolutional Neural Network for nuisance signal removal in 1H MR Spectroscopic Imaging

    oai:arXiv.org:2410.00746v2

    arXiv:2410.00746v2 Announce Type: replace-cross Abstract: Purpose. Proton Magnetic Resonance Spectroscopic Imaging (1H-MRSI) provides non-invasive spectral-spatial mapping of metabolism. However, long-standing problems in whole-brain 1H-MRSI are spectral overlap of metabolite peaks with large lipid signal from scalp, and overwhelming water signal that distorts spectra. Fast and effective methods are needed for high-resolution 1H-MRSI to accurately remove lipid and water signals while preserving the metabolite signal. The potential of supervised neural networks for this task remains unexplored, despite their success for other MRSI processing. Methods. We introduce a deep-learning method based on a modified Y-NET network for water and lipid removal in whole-brain 1H-MRSI. The WALINET (WAter and LIpid neural NETwork) was compared to conventional methods such as the state-of-the-art lipid L2 regularization and Hankel-Lanczos singular value decomposition (HLSVD) water suppression. Methods were evaluated on simulated and in-vivo whole-brain MRSI using NMRSE, SNR, CRLB, and FWHM metrics. Results. WALINET is significantly faster and needs 8s for high-resolution whole-brain MRSI, compared to 42 minutes for conventional HLSVD+L2. Quantitative analysis shows WALINET has better performance than HLSVD+L2: 1) more lipid removal with 41% lower NRMSE, 2) better metabolite signal preservation with 71% lower NRMSE in simulated data, 155% higher SNR and 50% lower CRLB in in-vivo data. Metabolic maps obtained by WALINET in healthy subjects and patients show better gray/white-matter contrast with more visible structural details. Conclusions. WALINET has superior performance for nuisance signal removal and metabolite quantification on whole-brain 1H-MRSI compared to conventional state-of-the-art techniques. This represents a new application of deep-learning for MRSI processing, with potential for automated high-throughput workflow.

    https://arxiv.org/abs/2410.00746


    Markov Chain Gradient Descent in Hilbert Spaces

    oai:arXiv.org:2410.08361v4

    arXiv:2410.08361v4 Announce Type: replace-cross Abstract: In this paper, we study a Markov chain-based stochastic gradient algorithm in general Hilbert spaces, aiming at approximating the optimal solution of a quadratic loss function. We establish probabilistic upper bounds on its convergence. We further extend these results to an online regularized learning algorithm in reproducing kernel Hilbert spaces, where the samples are drawn along a Markov chain trajectory.

    https://arxiv.org/abs/2410.08361


    Why quantum state verification cannot be both efficient and secure: a categorical approach

    oai:arXiv.org:2411.04767v2

    arXiv:2411.04767v2 Announce Type: replace-cross Abstract: The advantage of quantum protocols lies in the inherent properties of the shared quantum states. These states are sometimes provided by sources that are not trusted, and therefore need to be verified. Finding secure and efficient quantum state verification protocols remains a big challenge, and recent works illustrate trade-offs between efficiency and security for different groups of states in restricted settings. However, whether a universal trade-off exists for all quantum states and all verification strategies remains unknown. In this work, we instantiate the categorical composable cryptography framework to show a fundamental limit for quantum state verification for all cut-and-choose approaches used to verify arbitrary quantum states. Our findings show that the prevailing cut-and-choose techniques cannot lead to quantum state verification protocols that are both efficient and secure.

    https://arxiv.org/abs/2411.04767


    On the physics of nested Markov models: a generalized probabilistic theory perspective

    oai:arXiv.org:2411.11614v2

    arXiv:2411.11614v2 Announce Type: replace-cross Abstract: Determining potential probability distributions with a given causal graph is vital for causality studies. To bypass the difficulty in characterizing latent variables in a Bayesian network, the nested Markov model provides an elegant algebraic approach by listing exactly all the equality constraints on the observed variables. However, this algebraically motivated causal model comprises distributions outside Bayesian networks, and its physical interpretation remains vague. In this work, we inspect the nested Markov model through the lens of generalized probabilistic theory, an axiomatic framework to describe general physical theories. We prove that all the equality constraints defining the nested Markov model are valid theory-independently. At the same time, not every distribution within the nested Markov model is implementable, not even via exotic physical theories associated with generalized probability theories (GPTs). To interpret the origin of such a gap, we study three causal models standing between the nested Markov model and the set of all distributions admitting some GPT realization. Each of the successive three models gives a strictly tighter characterization of the physically implementable distribution set; that is, each successive model manifests new types of GPT-inviolable constraints. We further demonstrate each gap through a specially chosen illustrative causal structure. We anticipate our results will enlighten further explorations on the unification of algebraic and physical perspectives of causality.

    https://arxiv.org/abs/2411.11614


    Efficient and Physically-Consistent Modeling of Reconfigurable Electromagnetic Structures

    oai:arXiv.org:2411.13475v3

    arXiv:2411.13475v3 Announce Type: replace-cross Abstract: Reconfigurable electromagnetic structures (REMSs), such as reconfigurable reflectarrays (RRAs) or reconfigurable intelligent surfaces (RISs), hold significant potential to improve the spectral efficiency of wireless communication systems and the accuracy of wireless sensing systems. Even though several REMS modeling approaches have been proposed in recent years, the literature lacks models that are both computationally efficient and physically consistent. As a result, algorithms that control the reconfigurable elements of REMSs (e.g., the phase shifts of a RIS) are often built on simplistic and thus inaccurate models. To enable physically accurate REMS-parameter tuning, we present a new framework for efficient and physically consistent modeling of general REMSs. Our modeling method combines a circuit-theoretic approach with a new formalism that describes a REMS's interaction with the electromagnetic (EM) waves in its far-field region. Our modeling method enables efficient computation of the entire far-field radiation pattern for arbitrary configurations of the REMS reconfigurable elements once a single full-wave EM simulation of the non-reconfigurable parts of the REMS has been performed. The predictions made by our framework align with the physical laws of classical electrodynamics and model effects caused by inter-antenna coupling, non-reciprocal materials, polarization, ohmic losses, matching losses, influence of metallic housings, noise from low-noise amplifiers, and noise arising in or received by antennas. In order to validate the efficiency and accuracy of our modeling approach, we (i) compare our modeling method to EM simulations and (ii) conduct a case study involving an RRA that enables simultaneous multiuser beam- and null-forming using a new, computationally efficient, and physically accurate parameter tuning algorithm.

    https://arxiv.org/abs/2411.13475


    Self-test loss functions for learning weak-form operators and gradient flows

    oai:arXiv.org:2412.03506v3

    arXiv:2412.03506v3 Announce Type: replace-cross Abstract: The construction of loss functions presents a major challenge in data-driven modeling involving weak-form operators in PDEs and gradient flows, particularly due to the need to select test functions appropriately. We address this challenge by introducing self-test loss functions, which employ test functions that depend on the unknown parameters, specifically for cases where the operator depends linearly on the unknowns. The proposed self-test loss function conserves energy for gradient flows and coincides with the expected log-likelihood ratio for stochastic differential equations. Importantly, it is quadratic, facilitating theoretical analysis of identifiability and well-posedness of the inverse problem, while also leading to efficient parametric or nonparametric regression algorithms. It is computationally simple, requiring only low-order derivatives or even being entirely derivative-free, and numerical experiments demonstrate its robustness against noisy and discrete data.

    https://arxiv.org/abs/2412.03506


    On the metric mean dimensions of saturated sets

    oai:arXiv.org:2412.09218v4

    arXiv:2412.09218v4 Announce Type: replace-cross Abstract: From a geometric perspective, we employ metric mean dimension to investigate the set of generic points of invariant measures and saturated sets in infinite entropy systems. For systems with the specification property, we establish certain variational principles for the Bowen and packing metric mean dimensions of saturated sets in terms of Kolmogorov-Sinai $\epsilon$-entropy, and prove that the upper capacity metric mean dimension of saturated sets has full metric mean dimension. Consequently, the Bowen and packing metric mean dimensions of the set of generic points of invariant measures coincide with the mean R\'enyi information dimension, and the upper capacity metric mean dimension of the set of generic points of invariant measures also has full metric mean dimension. As applications, for systems with the specification property, we present the qualitative characterization of the metric mean dimensions of level sets, the set of mean Li-Yorke pairs in infinite-entropy systems, and the set of generic points of invariant measures in full shifts over compact metric spaces.

    https://arxiv.org/abs/2412.09218


    A Physics-Embedded Dual-Learning Imaging Framework for Electrical Impedance Tomography

    oai:arXiv.org:2412.17827v2

    arXiv:2412.17827v2 Announce Type: replace-cross Abstract: Electrical Impedance Tomography (EIT) is a promising noninvasive imaging technique that reconstructs the spatial conductivity distribution from boundary voltage measurements. However, it poses a highly nonlinear and ill-posed inverse problem. Traditional regularization-based methods are sensitive to noise and often produce significant artifacts. Physics-Embedded learning frameworks, particularly Physics-Informed Neural Networks (PINNs), have shown success in solving such inverse problems under ideal conditions with abundant internal data. Yet in practical EIT applications, only sparse and noisy boundary measurements are available. Moreover, changing boundary excitations require the simultaneous training of multiple forward networks and one inverse network, which significantly increases computational complexity and hampers convergence. To overcome these limitations, we propose a Physics-Embedded Dual-Learning Imaging Framework for EIT. The dual-learning strategy is composed of a supervised CNN-based forward network, which learns to predict a discrete internal potential distribution under fixed Neumann-to-Dirichlet boundary conditions, and an unsupervised PINN-based inverse network, which reconstructs the conductivity by enforcing the governing PDE through discrete numerical differentiation of the predicted potentials. This decoupled architecture removes the need for smooth conductivity assumptions, reduces the number of forward networks required from $K$ to 1, and improves reconstruction robustness and efficiency under realistic measurement constraints.(https://github.com/XuanxuanYang/CNN-PINNframework.git)

    https://arxiv.org/abs/2412.17827


    Compact Neural Network Algorithm for Electrocardiogram Classification

    oai:arXiv.org:2412.17852v2

    arXiv:2412.17852v2 Announce Type: replace-cross Abstract: In this paper, we present a powerful, compact electrocardiogram (ECG) classification algorithm for cardiac arrhythmia diagnosis that addresses the current reliance on deep learning and convolutional neural networks (CNNs) in ECG analysis. This work aims to reduce the demand for deep learning, which often requires extensive computational resources and large labeled datasets. Our approach introduces an artificial neural network (ANN) with a simple architecture combined with advanced feature engineering techniques. A key contribution of this work is the incorporation of 17 engineered features that enable the extraction of critical patterns from raw ECG signals. By integrating mathematical transformations, signal processing methods, and data extraction algorithms, our model captures the morphological and physiological characteristics of ECG signals with high efficiency, without requiring deep learning. Our method demonstrates a similar performance to other state-of-the-art models in classifying 4 types of arrhythmias, including atrial fibrillation, sinus tachycardia, sinus bradycardia, and ventricular flutter. Our algorithm achieved an accuracy of 97.36% on the MIT-BIH and St. Petersburg INCART arrhythmia databases. Our approach offers a practical and feasible solution for real-time diagnosis of cardiac disorders in medical applications, particularly in resource-limited environments.

    https://arxiv.org/abs/2412.17852


    Major Space Weather Risks Identified via Coupled Physics-Engineering-Economic Modeling

    oai:arXiv.org:2412.18032v2

    arXiv:2412.18032v2 Announce Type: replace-cross Abstract: Space weather poses an important but under-quantified threat to critical infrastructure, the economy, and society. While extreme geomagnetic storms are recognized as potential global catastrophes, their socio-economic impacts remain poorly quantified. Here we present a novel physics-engineering-economic framework that links geophysical drivers of geomagnetic storms to power grid geoelectric fields, transformer vulnerability, and macroeconomic consequences. Using the United States as an example, we estimate daily economic losses from transformer thermal heating of 1.37 billion USD (95 percent confidence interval: 1.16 to 1.58 billion USD) for a 100-year geomagnetic storm, with power outages affecting 4.1 million people and 101,000 businesses. A 250-year event could raise losses to 2.09 billion USD per day (95 percent confidence interval: 1.84 to 2.34 billion USD), disrupting power for more than 6 million people and 155,000 businesses. Crucially, the framework is scalable and transferable, offering a template for assessing space weather risk to critical infrastructure in other countries. This integrative approach provides the first end-to-end quantification of space weather socio-economic impacts, bridging space physics through to policy-relevant metrics. Our results demonstrate that coupled socio-economic modeling of space weather is both feasible and essential, enabling evidence-based decision making in infrastructure protection and global risk management.

    https://arxiv.org/abs/2412.18032


    Multi-View Oriented GPLVM: Expressiveness and Efficiency

    oai:arXiv.org:2502.08253v2

    arXiv:2502.08253v2 Announce Type: replace-cross Abstract: The multi-view Gaussian process latent variable model (MV-GPLVM) aims to learn a unified representation from multi-view data but is hindered by challenges such as limited kernel expressiveness and low computational efficiency. To overcome these issues, we first introduce a new duality between the spectral density and the kernel function. By modeling the spectral density with a bivariate Gaussian mixture, we then derive a generic and expressive kernel termed Next-Gen Spectral Mixture (NG-SM) for MV-GPLVMs. To address the inherent computational inefficiency of the NG-SM kernel, we design a new form of random Fourier feature approximation. Combined with a tailored reparameterization trick, this approximation enables scalable variational inference for both the model and the unified latent representations. Numerical evaluations across a diverse range of multi-view datasets demonstrate that our proposed method consistently outperforms state-of-the-art models in learning meaningful latent representations.

    https://arxiv.org/abs/2502.08253


    Space-Efficient Quantum Error Reduction without log Factors

    oai:arXiv.org:2502.09249v2

    arXiv:2502.09249v2 Announce Type: replace-cross Abstract: Given an algorithm that outputs the correct answer with bounded error, say $1/3$, it is sometimes desirable to reduce this error to some arbitrarily small $\varepsilon$ -- e.g., if one wants to call the algorithm many times as a subroutine. The usual method, for both quantum and randomized algorithms, is majority voting, which incurs a multiplicative overhead of $O(\log\frac{1}{\varepsilon})$ from calling the algorithm this many times. Transducers are a recently introduced model of quantum computation, and it is possible to reduce the ``error'' of a transducer arbitrarily with only constant overhead, using a construction analogous to majority voting called purification. Even error-free transducers map to bounded-error quantum algorithms, so this does not let you reduce algorithmic error for free, but it does allow bounded-error quantum algorithms to be composed without incurring log factors. In this paper, we present a new highly simplified purifier, that can be understood as a weighted walk on a line similar to a random walk interpretation of majority voting. Our purifier has much smaller space and time complexity than the previous one. Indeed, it only uses one additional counter, and only performs two increment and two decrement operations on each iteration. It also has quadratically better dependence on the soundness-completeness gap of the algorithm being purified. We prove that our purifier has optimal query complexity, even down to the constant, which matters when one composes quantum algorithms to super-constant depth. Purifiers can be seen as a way of turning a ``Monte Carlo'' quantum algorithm into a ``Las Vegas'' quantum algorithm -- a process for which there is no classical analogue. Our simplified construction sheds light on this strange quantum phenomenon, and could have implications for the complexity of composed quantum algorithms.

    https://arxiv.org/abs/2502.09249


    Learning and Computation of $\Phi$-Equilibria at the Frontier of Tractability

    oai:arXiv.org:2502.18582v3

    arXiv:2502.18582v3 Announce Type: replace-cross Abstract: $\Phi$-equilibria -- and the associated notion of $\Phi$-regret -- are a powerful and flexible framework at the heart of online learning and game theory, whereby enriching the set of deviations $\Phi$ begets stronger notions of rationality. Recently, Daskalakis, Farina, Fishelson, Pipis, and Schneider (STOC '24) -- abbreviated as DFFPS -- settled the existence of efficient algorithms when $\Phi$ contains only linear maps under a general, $d$-dimensional convex constraint set $\mathcal{X}$. In this paper, we significantly extend their work by resolving the case where $\Phi$ is $k$-dimensional; degree-$\ell$ polynomials constitute a canonical such example with $k = d^{O(\ell)}$. In particular, positing only oracle access to $\mathcal{X}$, we obtain two main positive results: i) a $\text{poly}(n, d, k, \text{log}(1/\epsilon))$-time algorithm for computing $\epsilon$-approximate $\Phi$-equilibria in $n$-player multilinear games, and ii) an efficient online algorithm that incurs average $\Phi$-regret at most $\epsilon$ using $\text{poly}(d, k)/\epsilon^2$ rounds. We also show nearly matching lower bounds in the online learning setting, thereby obtaining for the first time a family of deviations that captures the learnability of $\Phi$-regret. From a technical standpoint, we extend the framework of DFFPS from linear maps to the more challenging case of maps with polynomial dimension. At the heart of our approach is a polynomial-time algorithm for computing an expected fixed point of any $\phi : \mathcal{X} \to \mathcal{X}$ based on the ellipsoid against hope (EAH) algorithm of Papadimitriou and Roughgarden (JACM '08). In particular, our algorithm for computing $\Phi$-equilibria is based on executing EAH in a nested fashion -- each step of EAH itself being implemented by invoking a separate call to EAH.

    https://arxiv.org/abs/2502.18582


    Explainable Quantum Machine Learning for Multispectral Images Segmentation: Case Study

    oai:arXiv.org:2503.08962v3

    arXiv:2503.08962v3 Announce Type: replace-cross Abstract: The emergence of Big Data changed how we approach information systems engineering. Nowadays, when we can use remote sensing techniques for Big Data acquisition, the issues such data introduce are as important as ever. One of those concerns is the processing of the data. Classical methods often fail to address that problem or are incapable of processing the data in a reasonable time. With that in mind information system engineers are required to investigate different approaches to the data processing. The recent advancements in noisy intermediate-scale quantum (NISQ) devices implementation allow us to investigate their application to real-life computational problem. This field of study is called quantum (information) systems engineering and usually focuses on technical problems with the contemporary devices. However, hardware challenges are not the only ones that hinder our quantum computation capabilities. Software limitations are the other, less explored side of this medal. Using multispectral image segmentation as a task example, we investigated how difficult it is to run a hybrid quantum-classical model on a real, publicly available quantum device. To quantify how and explain why the performance of our model changed when ran on a real device, we propose new explainability metrics. These metrics introduce new meaning to the explainable quantum machine learning; the explanation of the performance issue comes from the quantum device behavior. We also analyzed the expected money costs of running similar experiment on contemporary quantum devices using standard market prices.

    https://arxiv.org/abs/2503.08962


    Reference-Free 3D Reconstruction of Brain Dissection Slabs via Learned Atlas Coordinates

    oai:arXiv.org:2503.09963v2

    arXiv:2503.09963v2 Announce Type: replace-cross Abstract: Correlation of neuropathology with MRI has the potential to transfer microscopic signatures of pathology to in vivo scans. There is increasing interest in building these correlations from 3D reconstructed stacks of slab photographs, which are routinely taken during dissection at brain banks. These photographs bypass the need for ex vivo MRI, which is not widely accessible. However, existing methods either require a corresponding 3D reference (e.g., an ex vivo MRI scans, or a brain surface acquired with a structured light scanner) or a full stack of brain slabs, which severely limits applicability. Here we propose RefFree, a 3D reconstruction method for dissection photographs that does not require an external reference. RefFree coherently reconstructs a 3D volume for an arbitrary set of slabs (including a single slab) using predicted 3D coordinates in the standard atlas space (MNI) as guidance. To support RefFree's pipeline, we train an atlas coordinate prediction network that estimates the coordinate map from a 2D photograph, using synthetic photographs generated from digitally sliced 3D MRI data with randomized appearance for enhanced generalization. As a by-product, RefFree can propagate information (e.g., anatomical labels) from atlas space to one single photograph even without reconstruction. Experiments on simulated and real data show that, when all slabs are available, RefFree achieves performance comparable to existing classical methods but at substantially higher speed. Moreover, RefFree yields accurate reconstruction and registration for partial stacks or even a single slab. Our code is available at https://github.com/lintian-a/reffree.

    https://arxiv.org/abs/2503.09963


    CRPS-Based Targeted Sequential Design with Application in Chemical Space

    oai:arXiv.org:2503.11250v2

    arXiv:2503.11250v2 Announce Type: replace-cross Abstract: Sequential design of real and computer experiments via Gaussian Process (GP) models has proven useful for parsimonious, goal-oriented data acquisition purposes. In this work, we focus on acquisition strategies for a GP model that needs to be accurate within a predefined range of the response of interest. Such an approach is useful in various fields including synthetic chemistry, where finding molecules with particular properties is essential for developing useful materials and effective medications. GP modeling and sequential design of experiments have been successfully applied to a plethora of domains, including molecule research. Our main contribution here is to use the threshold-weighted Continuous Ranked Probability Score (CRPS) as a basic building block for acquisition functions employed within sequential design. We study pointwise and integral criteria relying on two different weighting measures and benchmark them against competitors, demonstrating improved performance with respect to considered goals. The resulting acquisition strategies are applicable to a wide range of fields and pave the way to further developing sequential design relying on scoring rules.

    https://arxiv.org/abs/2503.11250


    Score-Based Turbo Message Passing for Plug-and-Play Compressive Image Recovery

    oai:arXiv.org:2503.22140v2

    arXiv:2503.22140v2 Announce Type: replace-cross Abstract: Message passing algorithms have been tailored for compressive imaging applications by plugging in different types of off-the-shelf image denoisers. These off-the-shelf denoisers mostly rely on some generic or hand-crafted priors for denoising. Due to their insufficient accuracy in capturing the true image prior, these methods often fail to produce satisfactory results, especially in highly underdetermined scenarios. On the other hand, score-based generative modeling offers a promising way to accurately characterize the sophisticated image distribution. In this paper, by exploiting the close relation between score-based modeling and empirical Bayes-optimal denoising, we devise a message passing framework that integrates a score-based minimum mean squared error (MMSE) denoiser for compressive image recovery. Experiments on the FFHQ dataset demonstrate that our method strikes a significantly better performance-complexity tradeoff than conventional message passing, regularized linear regression, and score-based posterior sampling baselines. Remarkably, our method typically converges in fewer than 20 neural function evaluations (NFEs).

    https://arxiv.org/abs/2503.22140


    Multilevel Sampling in Algebraic Statistics

    oai:arXiv.org:2505.04062v2

    arXiv:2505.04062v2 Announce Type: replace-cross Abstract: This paper proposes a multilevel sampling algorithm for fiber sampling problems in algebraic statistics, inspired by Henry Wynn's suggestion to adapt multilevel Monte Carlo (MLMC) ideas to discrete models. Focusing on log-linear models, we sample from high-dimensional lattice fibers defined by algebraic constraints. Building on Markov basis methods and results from Diaconis and Sturmfels, our algorithm uses variable step sizes to accelerate exploration and reduce the need for long burn-in. We introduce a novel Fiber Coverage Score (FCS) based on Voronoi partitioning to assess sample quality, and highlight the utility of the Maximum Mean Discrepancy (MMD) quality metric. Simulations on benchmark fibers show that multilevel sampling outperforms naive MCMC approaches. Our results demonstrate that multilevel methods, when properly applied, provide practical benefits for discrete sampling in algebraic statistics.

    https://arxiv.org/abs/2505.04062


    SpectrumFM: A Foundation Model for Intelligent Spectrum Management

    oai:arXiv.org:2505.06256v2

    arXiv:2505.06256v2 Announce Type: replace-cross Abstract: Intelligent spectrum management is crucial for improving spectrum efficiency and achieving secure utilization of spectrum resources. However, existing intelligent spectrum management methods, typically based on small-scale models, suffer from notable limitations in recognition accuracy, convergence speed, and generalization, particularly in the complex and dynamic spectrum environments. To address these challenges, this paper proposes a novel spectrum foundation model, termed SpectrumFM, establishing a new paradigm for spectrum management. SpectrumFM features an innovative encoder architecture that synergistically exploits the convolutional neural networks and the multi-head self-attention mechanisms to enhance feature extraction and enable robust representation learning. The model is pre-trained via two novel self-supervised learning tasks, namely masked reconstruction and next-slot signal prediction, which leverage large-scale in-phase and quadrature (IQ) data to achieve comprehensive and transferable spectrum representations. Furthermore, a parameter-efficient fine-tuning strategy is proposed to enable SpectrumFM to adapt to various downstream spectrum management tasks, including automatic modulation classification (AMC), wireless technology classification (WTC), spectrum sensing (SS), and anomaly detection (AD). Extensive experiments demonstrate that SpectrumFM achieves superior performance in terms of accuracy, robustness, adaptability, few-shot learning efficiency, and convergence speed, consistently outperforming conventional methods across multiple benchmarks. Specifically, SpectrumFM improves AMC accuracy by up to 12.1% and WTC accuracy by 9.3%, achieves an area under the curve (AUC) of 0.97 in SS at -4 dB signal-to-noise ratio (SNR), and enhances AD performance by over 10%.

    https://arxiv.org/abs/2505.06256


    Credal Prediction based on Relative Likelihood

    oai:arXiv.org:2505.22332v2

    arXiv:2505.22332v2 Announce Type: replace-cross Abstract: Predictions in the form of sets of probability distributions, so-called credal sets, provide a suitable means to represent a learner's epistemic uncertainty. In this paper, we propose a theoretically grounded approach to credal prediction based on the statistical notion of relative likelihood: The target of prediction is the set of all (conditional) probability distributions produced by the collection of plausible models, namely those models whose relative likelihood exceeds a specified threshold. This threshold has an intuitive interpretation and allows for controlling the trade-off between correctness and precision of credal predictions. We tackle the problem of approximating credal sets defined in this way by means of suitably modified ensemble learning techniques. To validate our approach, we illustrate its effectiveness by experiments on benchmark datasets demonstrating superior uncertainty representation without compromising predictive performance. We also compare our method against several state-of-the-art baselines in credal prediction.

    https://arxiv.org/abs/2505.22332


    Extreme mass distributions for quasi-copulas

    oai:arXiv.org:2506.20672v2

    arXiv:2506.20672v2 Announce Type: replace-cross Abstract: A recent survey, nicknamed "Hitchhiker's Guide", J.J. Arias-Garc{\i}a, R. Mesiar, and B. De Baets, A hitchhiker's guide to quasi-copulas, Fuzzy Sets and Systems 393 (2020) 1-28, has raised the rating of quasi-copula problems in the dependence modeling community in spite of the lack of statistical interpretation of quasi-copulas. In our previous work (Fuzzy Sets and Systems 517 (2025) 109457), we addressed the question of extreme values of the mass distribution associated with multivariate quasi-copulas. Using a linear programming approach, we were able to solve Open Problem 5 of the "Guide" up to dimension d = 17 and disprove a recent conjecture on the solution to that problem. In this paper, we use an analytical approach to provide a complete answer to the original question.

    https://arxiv.org/abs/2506.20672


    The Evolution of Pointwise Statistics in Hyperbolic Equations with Random Data

    oai:arXiv.org:2507.11399v2

    arXiv:2507.11399v2 Announce Type: replace-cross Abstract: We consider one-dimensional hyperbolic PDEs, linear and nonlinear, with random initial data. Our focus is the {\em pointwise statistics,} i.e., the probability measure of the solution at any fixed point in space and time. For linear hyperbolic equations, the probability density function (PDF) of these statistics satisfies the same linear PDE. For nonlinear hyperbolic PDEs, we derive a linear transport equation for the cumulative distribution function (CDF) and a nonlocal linear PDE for the PDF. Both results are valid only as long as no shocks have formed, a limitation which is inherent to the problem, as demonstrated by a counterexample. For systems of linear hyperbolic equations, we introduce the multi-point statistics and derive their evolution equations. In all of the settings we consider, the resulting PDEs for the statistics are of practical significance: they enable efficient evaluation of the random dynamics, without requiring an ensemble of solutions of the underlying PDE, and their cost is not affected by the dimension of the random parameter space. Additionally, the evolution equations for the statistics lead to a priori statistical error bounds for Monte Carlo methods (in particular, Kernel Density Estimators) when applied to hyperbolic PDEs with random data.

    https://arxiv.org/abs/2507.11399


    Time Entangled Quantum Blockchain with Phase Encoding for Classical Data

    oai:arXiv.org:2507.14839v3

    arXiv:2507.14839v3 Announce Type: replace-cross Abstract: With rapid advancements in quantum computing, it is widely believed that there will be quantum hardware capable of compromising classical cryptography and hence, the internet and the current information security infrastructure in the coming decade. This is mainly due to the operational realizations of quantum algorithms such as Grover and Shor, to which the current classical encryption protocols are vulnerable. Blockchains, i.e., blockchain data structures and their data, rely heavily on classical cryptography. One approach to secure blockchain is to attempt to achieve information theoretical security by defining blockchain on quantum technologies. There have been two conceptualizations of blockchains on quantum registers: the time-entangled Greenberger-Horne-Zeilinger (GHZ) state blockchain and the quantum hypergraph blockchain. On our part, an attempt is made to conceptualize a new quantum blockchain combining features of both these schemes to achieve the absolute security of the time-temporal GHZ blockchain and the scalability and efficiency of the quantum hypergraph blockchain in the proposed quantum blockchain protocol.

    https://arxiv.org/abs/2507.14839


    Advancing Financial Engineering with Foundation Models: Progress, Applications, and Challenges

    oai:arXiv.org:2507.18577v2

    arXiv:2507.18577v2 Announce Type: replace-cross Abstract: The advent of foundation models (FMs), large-scale pre-trained models with strong generalization capabilities, has opened new frontiers for financial engineering. While general-purpose FMs such as GPT-4 and Gemini have demonstrated promising performance in tasks ranging from financial report summarization to sentiment-aware forecasting, many financial applications remain constrained by unique domain requirements such as multimodal reasoning, regulatory compliance, and data privacy. These challenges have spurred the emergence of financial foundation models (FFMs): a new class of models explicitly designed for finance. This survey presents a comprehensive overview of FFMs, with a taxonomy spanning three key modalities: financial language foundation models (FinLFMs), financial time-series foundation models (FinTSFMs), and financial visual-language foundation models (FinVLFMs). We review their architectures, training methodologies, datasets, and real-world applications. Furthermore, we identify critical challenges in data availability, algorithmic scalability, and infrastructure constraints and offer insights into future research opportunities. We hope this survey can serve as both a comprehensive reference for understanding FFMs and a practical roadmap for future innovation.

    https://arxiv.org/abs/2507.18577


    DynamiX: Large-Scale Dynamic Social Network Simulator

    oai:arXiv.org:2507.19929v2

    arXiv:2507.19929v2 Announce Type: replace-cross Abstract: Understanding the intrinsic mechanisms of social platforms is an urgent demand to maintain social stability. The rise of large language models provides significant potential for social network simulations to capture attitude dynamics and reproduce collective behaviors. However, existing studies mainly focus on scaling up agent populations, neglecting the dynamic evolution of social relationships. To address this gap, we introduce DynamiX, a novel large-scale social network simulator dedicated to dynamic social network modeling. DynamiX uses a dynamic hierarchy module for selecting core agents with key characteristics at each timestep, enabling accurate alignment of real-world adaptive switching of user roles. Furthermore, we design distinct dynamic social relationship modeling strategies for different user types. For opinion leaders, we propose an information-stream-based link prediction method recommending potential users with similar stances, simulating homogeneous connections, and autonomous behavior decisions. For ordinary users, we construct an inequality-oriented behavior decision-making module, effectively addressing unequal social interactions and capturing the patterns of relationship adjustments driven by multi-dimensional factors. Experimental results demonstrate that DynamiX exhibits marked improvements in attitude evolution simulation and collective behavior analysis compared to static networks. Besides, DynamiX opens a new theoretical perspective on follower growth prediction, providing empirical evidence for opinion leaders cultivation.

    https://arxiv.org/abs/2507.19929


    Debiasing Machine Learning Predictions for Causal Inference Without Additional Ground Truth Data: "One Map, Many Trials" in Satellite-Driven Poverty Analysis

    oai:arXiv.org:2508.01341v4

    arXiv:2508.01341v4 Announce Type: replace-cross Abstract: Machine learning models trained on Earth observation data, such as satellite imagery, have demonstrated significant promise in predicting household-level wealth indices, enabling the creation of high-resolution wealth maps that can be leveraged across multiple causal trials while addressing chronic data scarcity in global development research. However, because standard training objectives prioritize overall predictive accuracy, these predictions often suffer from shrinkage toward the mean, leading to attenuated estimates of causal treatment effects and limiting their utility in policy evaluations. Existing debiasing methods, such as Prediction-Powered Inference (PPI), can handle this attenuation bias but require additional fresh ground-truth data at the downstream stage of causal inference, which restricts their applicability in data-scarce environments. We introduce and evaluate two post-hoc correction methods -- Linear Calibration Correction (LCC) and a Tweedie's correction approach -- that substantially reduce shrinkage-induced prediction bias without relying on newly collected labeled data. LCC applies a simple linear transformation estimated on a held-out calibration split; Tweedie's method locally de-shrink predictions using density score estimates and a noise scale learned upstream. We provide practical diagnostics for when a correction is warranted and discuss practical limitations. Across analytical results, simulations, and experiments with Demographic and Health Surveys (DHS) data, both approaches reduce attenuation; Tweedie's correction yields nearly unbiased treatment-effect estimates, enabling a "one map, many trials" paradigm. Although we demonstrate on EO-ML wealth mapping, the methods are not geospatial-specific: they apply to any setting where imputed outcomes are reused downstream (e.g., pollution indices, population density, or LLM-derived indicators).

    https://arxiv.org/abs/2508.01341


    Decoding and Engineering the Phytobiome Communication for Smart Agriculture

    oai:arXiv.org:2508.03584v2

    arXiv:2508.03584v2 Announce Type: replace-cross Abstract: Smart agriculture applications, integrating technologies like the Internet of Things and machine learning/artificial intelligence (ML/AI) into agriculture, hold promise to address modern challenges of rising food demand, environmental pollution, and water scarcity. Alongside the concept of the phytobiome, which defines the area including the plant, its environment, and associated organisms, and the recent emergence of molecular communication (MC), there exists an important opportunity to advance agricultural science and practice using communication theory. In this article, we motivate to use the communication engineering perspective for developing a holistic understanding of the phytobiome communication and bridge the gap between the phytobiome communication and smart agriculture. Firstly, an overview of phytobiome communication via molecular and electrophysiological signals is presented and a multi-scale framework modeling the phytobiome as a communication network is conceptualized. Then, how this framework is used to model electrophysiological signals is demonstrated with plant experiments. Furthermore, possible smart agriculture applications, such as smart irrigation and targeted delivery of agrochemicals, through engineering the phytobiome communication are proposed. These applications merge ML/AI methods with the Internet of Bio-Nano-Things enabled by MC and pave the way towards more efficient, sustainable, and eco-friendly agricultural production. Finally, the implementation challenges, open research issues, and industrial outlook for these applications are discussed.

    https://arxiv.org/abs/2508.03584


    Virtual Multiplex Staining for Histological Images using a Marker-wise Conditioned Diffusion Model

    oai:arXiv.org:2508.14681v3

    arXiv:2508.14681v3 Announce Type: replace-cross Abstract: Multiplex imaging is revolutionizing pathology by enabling the simultaneous visualization of multiple biomarkers within tissue samples, providing molecular-level insights that traditional hematoxylin and eosin (H&E) staining cannot provide. However, the complexity and cost of multiplex data acquisition have hindered its widespread adoption. Additionally, most existing large repositories of H&E images lack corresponding multiplex images, limiting opportunities for multimodal analysis. To address these challenges, we leverage recent advances in latent diffusion models (LDMs), which excel at modeling complex data distributions by utilizing their powerful priors for fine-tuning to a target domain. In this paper, we introduce a novel framework for virtual multiplex staining that utilizes pretrained LDM parameters to generate multiplex images from H&E images using a conditional diffusion model. Our approach enables marker-by-marker generation by conditioning the diffusion model on each marker, while sharing the same architecture across all markers. To tackle the challenge of varying pixel value distributions across different marker stains and to improve inference speed, we fine-tune the model for single-step sampling, enhancing both color contrast fidelity and inference efficiency through pixel-level loss functions. We validate our framework on two publicly available datasets, notably demonstrating its effectiveness in generating up to 18 different marker types with improved accuracy, a substantial increase over the 2-3 marker types achieved in previous approaches. This validation highlights the potential of our framework, pioneering virtual multiplex staining. Finally, this paper bridges the gap between H&E and multiplex imaging, potentially enabling retrospective studies and large-scale analyses of existing H&E image repositories.

    https://arxiv.org/abs/2508.14681


    Algorithmic Collusion is Algorithm Orchestration

    oai:arXiv.org:2508.14766v2

    arXiv:2508.14766v2 Announce Type: replace-cross Abstract: We propose a fresh `meta-game' perspective on the problem of algorithmic collusion in pricing games a la Bertrand. Economists have interpreted the fact that algorithms can learn to price collusively as tacit collusion. We argue instead that the co-parametrization of algorithms, in ways as are necessary to obtain algorithmic collusion, typically requires algorithm designers to engage in some form of explicit collusion or `algorithm orchestration.' In our model, the algorithm designers play a meta-game of parametrizing their algorithms, which then play repeated Bertrand competition. The strategic analysis at the meta-level reveals new equilibrium and collusion phenomena. (JEL: C62, C63, D43, L13)

    https://arxiv.org/abs/2508.14766


    Learning in Focus: Detecting Behavioral and Collaborative Engagement Using Vision Transformers

    oai:arXiv.org:2508.15782v2

    arXiv:2508.15782v2 Announce Type: replace-cross Abstract: In early childhood education, accurately detecting collaborative and behavioral engagement is essential to foster meaningful learning experiences. This paper presents an AI driven approach that leverages Vision Transformers (ViTs) to automatically classify children s engagement using visual cues such as gaze direction, interaction, and peer collaboration. Utilizing the ChildPlay gaze dataset, our method is trained on annotated video segments to classify behavioral and collaborative engagement states (e.g., engaged, not engaged, collaborative, not collaborative). We evaluated six state of the art transformer models: Vision Transformer (ViT), Data efficient Image Transformer (DeiT), Swin Transformer, VitGaze, APVit and GazeTR. Among these, the Swin Transformer achieved the highest classification performance with an accuracy of 97.58 percent, demonstrating its effectiveness in modeling local and global attention. Our results highlight the potential of transformer based architectures for scalable, automated engagement analysis in real world educational settings.

    https://arxiv.org/abs/2508.15782


    Solvability Complexity Index Classification For Koopman Operator Spectra In $L^p$ For $1

    oai:arXiv.org:2509.16016v2

    arXiv:2509.16016v2 Announce Type: replace-cross Abstract: We study the computation of the approximate point spectrum and the approximate point $\varepsilon$-pseudospectrum of bounded Koopman operators acting on $L^p(\mathcal{X},\omega)$ for $1
    https://arxiv.org/abs/2509.16016


    Patch-Based Diffusion for Data-Efficient, Radiologist-Preferred MRI Reconstruction

    oai:arXiv.org:2509.21531v2

    arXiv:2509.21531v2 Announce Type: replace-cross Abstract: Magnetic resonance imaging (MRI) requires long acquisition times, raising costs, reducing accessibility, and making scans more susceptible to motion artifacts. Diffusion probabilistic models that learn data-driven priors can potentially assist in reducing acquisition time. However, they typically require large training datasets that can be prohibitively expensive to collect. Patch-based diffusion models have shown promise in learning effective data-driven priors over small real-valued datasets, but have not yet demonstrated clinical value in MRI. We extend the Patch-based Diffusion Inverse Solver (PaDIS) to complex-valued, multi-coil MRI reconstruction, and compare it against a state-of-the-art whole-image diffusion baseline (FastMRI-EDM) for 7x undersampled MRI reconstruction on the FastMRI brain dataset. We show that PaDIS-MRI models trained on small datasets of as few as 25 k-space images outperform FastMRI-EDM on image quality metrics (PSNR, SSIM, NRMSE), pixel-level uncertainty, cross-contrast generalization, and robustness to severe k-space undersampling. In a blinded study with three radiologists, PaDIS-MRI reconstructions were chosen as diagnostically superior in 91.7% of cases, compared to baselines (i) FastMRI-EDM and (ii) classical convex reconstruction with wavelet sparsity. These findings highlight the potential of patch-based diffusion priors for high-fidelity MRI reconstruction in data-scarce clinical settings where diagnostic confidence matters.

    https://arxiv.org/abs/2509.21531


    Joint Hybrid Beamforming and Artificial Noise Design for Secure Multi-UAV ISAC Networks

    oai:arXiv.org:2509.23687v2

    arXiv:2509.23687v2 Announce Type: replace-cross Abstract: Integrated sensing and communication (ISAC) emerges as a key enabler for next-generation applications such as smart cities and autonomous systems. Its integration with unmanned aerial vehicles (UAVs) unlocks new potentials for reliable communication and precise sensing in dynamic aerial environments. However, existing research predominantly treats UAVs as aerial base stations, overlooking their role as ISAC users, and fails to leverage large-scale antenna arrays at terrestrial base stations to enhance security and spectral efficiency. This paper propose a secure and spectral efficient ISAC framework for multi-UAV networks, and a two-stage optimization approach is developed to jointly design hybrid beamforming (HBF), artificial noise (AN) injection, and UAV trajectories. Aiming at maximizing the sum secrecy rate, the first stage employs Proximal Policy Optimization (PPO) to optimize digital beamformers and trajectories, and the second stage decomposes the digital solution into analog and digital components via low-complexity matrix factorization. Simulation results demonstrate the effectiveness of the proposed framework compared to benchmark schemes.

    https://arxiv.org/abs/2509.23687


    SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization

    oai:arXiv.org:2510.16841v2

    arXiv:2510.16841v2 Announce Type: replace-cross Abstract: Speech codecs that convert continuous speech signals into discrete tokens have become essential for speech language models. However, existing codecs struggle to balance high-quality reconstruction with semantically rich representations, limiting their effectiveness in both generative and understanding tasks. In this work, we propose SAC, a neural speech codec with semantic-acoustic dual-stream quantization. By disentangling semantic and acoustic modeling into two dedicated streams, SAC enables each to be optimized for its respective role. Comprehensive evaluations show that SAC achieves strong reconstruction performance across diverse bitrates under both clean and noisy conditions, with particularly high scores on UTMOS and WER, indicating superior naturalness and intelligibility. Moreover, SAC substantially surpasses prior codecs in semantic representation, approaching the level of continuous self-supervised embeddings. When used as a tokenizer for LLM-based text-to-speech, SAC enables a single-stage autoregressive (AR) TTS model that clearly outperforms state-of-the-art AR systems. Our disentanglement analysis further validates the effectiveness of the dual-stream design, offering new potential for controllable speech generation.

    https://arxiv.org/abs/2510.16841


    Sample Complexity Analysis of Multi-Target Detection via Markovian and Hard-Core Multi-Reference Alignment

    oai:arXiv.org:2510.17775v2

    arXiv:2510.17775v2 Announce Type: replace-cross Abstract: Motivated by single-particle cryo-electron microscopy, we study the sample complexity of the multi-target detection (MTD) problem, in which an unknown signal appears multiple times at unknown locations within a long, noisy observation. We propose a patching scheme that reduces MTD to a non-i.i.d. multi-reference alignment (MRA) model. In the one-dimensional setting, the latent group elements form a Markov chain, and we show that the convergence rate of any estimator matches that of the corresponding i.i.d. MRA model, up to a logarithmic factor in the number of patches. Moreover, for estimators based on empirical averaging, such as the method of moments, the convergence rates are identical in both settings. We further establish an analogous result in two dimensions, where the latent structure arises from an exponentially mixing random field generated by a hard-core placement model. As a consequence, if the signal in the corresponding i.i.d. MRA model is determined by moments up to order $n_{\min}$, then in the low-SNR regime the number of patches required to estimate the signal in the MTD model scales as $\sigma^{2n_{\min}}$, where $\sigma^2$ denotes the noise variance.

    https://arxiv.org/abs/2510.17775


    Fixed Horizon Linear Quadratic Covariance Steering in Continuous Time with Hilbert-Schmidt Terminal Cost

    oai:arXiv.org:2510.21944v2

    arXiv:2510.21944v2 Announce Type: replace-cross Abstract: We formulate and solve the fixed horizon linear quadratic covariance steering problem in continuous time with a terminal cost measured in Hilbert-Schmidt (i.e., Frobenius) norm error between the desired and the controlled terminal covariances. For this problem, the necessary conditions of optimality become a coupled matrix ODE two-point boundary value problem. To solve this system of equations, we design a matricial recursive algorithm and prove its convergence. The proposed algorithm and its analysis make use of the linear fractional transforms parameterized by the state transition matrix of the associated Hamiltonian matrix. To illustrate the results, we provide two numerical examples: one with a two dimensional and another with a six dimensional state space.

    https://arxiv.org/abs/2510.21944


    A PyTorch Framework for Scalable Non-Crossing Quantile Regression

    oai:arXiv.org:2510.22419v2

    arXiv:2510.22419v2 Announce Type: replace-cross Abstract: Quantile regression is fundamental to distributional modeling, yet independent estimation of multiple quantiles frequently produces crossing -- where estimated quantile functions violate monotonicity, implying impossible negative probability densities. While Constrained Joint Quantile Regression (CJQR) elegantly enforces non-crossing by construction, existing formulations via Linear Programming exhibit $O((qn)^3)$ complexity, rendering them impractical for large-scale applications. We present the first scalable solution using PyTorch automatic differentiation: \textbf{CJQR-ALM}, combining the \textbf{Augmented Lagrangian Method} with \textbf{differentiable pinball loss} and \textbf{L-BFGS} optimization. Our approach reduces computational complexity to $O(n)$, achieving near-zero crossing rates on datasets exceeding 70,000 observations within minutes. The differentiable formulation naturally extends to neural network architectures for non-linear conditional quantile estimation. Application to Student Growth Percentile calculations demonstrates practical utility for educational assessment, while simulation studies show negligible accuracy cost (RMSE increase $\approx 2.4$ points) relative to unconstrained estimation -- a favorable trade-off for applications requiring valid probability statements across finance, healthcare, and engineering.

    https://arxiv.org/abs/2510.22419


    Reducing measurements in quantum erasure correction by quantum local recovery

    oai:arXiv.org:2510.22890v4

    arXiv:2510.22890v4 Announce Type: replace-cross Abstract: As measurements are costly and prone to errors on certain quantum computing devices, we should reduce the number of measurements and the number of measured qudits as small as possible in quantum erasure correction. It is intuitively obvious that a decoder can omit measurements of stabilizers that are irrelevant to erased qudits, but this intuition has not been rigorously formalized as far as the author is aware. In this paper, we formalize relevant stabilizers sufficient to correct erased qudits with a quantum stabilizer code, by using a recent idea from quantum local recovery. The minimum required number of measured stabilizer observables is also clarified. As an application, we also show that correction of $\delta$ erasures on a generalized surface code proposed by Delfosse, Iyer and Poulin requires at most $\delta$ measurements of vertexes and at most $\delta$ measurements of faces, independently of its code parameters.

    https://arxiv.org/abs/2510.22890


    AQCat25: Unlocking spin-aware, high-fidelity machine learning potentials for heterogeneous catalysis

    oai:arXiv.org:2510.22938v2

    arXiv:2510.22938v2 Announce Type: replace-cross Abstract: Large-scale datasets have enabled highly accurate machine learning interatomic potentials (MLIPs) for general-purpose heterogeneous catalysis modeling. There are, however, some limitations in what can be treated with these potentials because of gaps in the underlying training data. To extend these capabilities, we introduce AQCat25, a complementary dataset of 13.5 million density functional theory (DFT) single point calculations designed to improve the treatment of systems where spin polarization and/or higher fidelity are critical. We also investigate methodologies for integrating new datasets, such as AQCat25, with the broader Open Catalyst 2020 (OC20) dataset to create spin-aware models without sacrificing generalizability. We find that directly tuning a general model on AQCat25 leads to catastrophic forgetting of the original dataset's knowledge. Conversely, joint training strategies prove effective for improving accuracy on the new data without sacrificing general performance. This joint approach introduces a challenge, as the model must learn from a dataset containing both mixed-fidelity calculations and mixed-physics (spin-polarized vs. unpolarized). We show that explicitly conditioning the model on this system-specific metadata, for example by using Feature-wise Linear Modulation (FiLM), successfully addresses this challenge and further enhances model accuracy. Ultimately, our work establishes an effective protocol for bridging DFT fidelity domains to advance the predictive power of foundational models in catalysis.

    https://arxiv.org/abs/2510.22938


    Optimal Convergence Analysis of DDPM for General Distributions

    oai:arXiv.org:2510.27562v2

    arXiv:2510.27562v2 Announce Type: replace-cross Abstract: Score-based diffusion models have achieved remarkable empirical success in generating high-quality samples from target data distributions. Among them, the Denoising Diffusion Probabilistic Model (DDPM) is one of the most widely used samplers, generating samples via estimated score functions. Despite its empirical success, a tight theoretical understanding of DDPM -- especially its convergence properties -- remains limited. In this paper, we provide a refined convergence analysis of the DDPM sampler and establish near-optimal convergence rates under general distributional assumptions. Specifically, we introduce a relaxed smoothness condition parameterized by a constant $L$, which is small for many practical distributions (e.g., Gaussian mixture models). We prove that the DDPM sampler with accurate score estimates achieves a convergence rate of $$\widetilde{O}\left(\frac{d\min\{d,L^2\}}{T^2}\right)~\text{in Kullback-Leibler divergence},$$ where $d$ is the data dimension, $T$ is the number of iterations, and $\widetilde{O}$ hides polylogarithmic factors in $T$. This result substantially improves upon the best-known $d^2/T^2$ rate when $L < \sqrt{d}$. By establishing a matching lower bound, we show that our convergence analysis is tight for a wide array of target distributions. Moreover, it reveals that DDPM and DDIM share the same dependence on $d$, raising an interesting question of why DDIM often appears empirically faster.

    https://arxiv.org/abs/2510.27562


    Branching Flows: Discrete, Continuous, and Manifold Flow Matching with Splits and Deletions

    oai:arXiv.org:2511.09465v2

    arXiv:2511.09465v2 Announce Type: replace-cross Abstract: Diffusion and flow matching approaches to generative modeling have shown promise in domains where the state space is continuous, such as image generation or protein folding & design, and discrete, exemplified by diffusion large language models. They offer a natural fit when the number of elements in a state is fixed in advance (e.g. images), but require ad hoc solutions when, for example, the length of a response from a large language model, or the number of amino acids in a protein chain is not known a priori. Here we propose Branching Flows, a generative modeling framework that, like diffusion and flow matching approaches, transports a simple distribution to the data distribution. But in Branching Flows, the elements in the state evolve over a forest of binary trees, branching and dying stochastically with rates that are learned by the model. This allows the model to control, during generation, the number of elements in the sequence. We also show that Branching Flows can compose with any flow matching base process on discrete sets, continuous Euclidean spaces, smooth manifolds, and `multimodal' product spaces that mix these components. We demonstrate this in three domains: small molecule generation (multimodal), antibody sequence generation (discrete), and protein backbone generation (multimodal), and show that Branching Flows is a capable distribution learner with a stable learning objective, and that it enables new capabilities.

    https://arxiv.org/abs/2511.09465


    Distributed Optimization of Bivariate Polynomial Graph Spectral Functions via Subgraph Optimization

    oai:arXiv.org:2511.11517v2

    arXiv:2511.11517v2 Announce Type: replace-cross Abstract: We study distributed optimization of finite-degree polynomial Laplacian spectral objectives under fixed topology and a global weight budget, targeting the collective behavior of the entire spectrum rather than a few extremal eigenvalues. By re-formulating the global cost in a bilinear form, we derive local subgraph problems whose gradients approximately align with the global descent direction via an SVD-based test on the \(ZC\) matrix. This leads to an iterate-and-embed scheme over disjoint 1-hop neighborhoods that preserves feasibility by construction (positivity and budget) and scales to large geometric graphs. For objectives that depend on pairwise eigenvalue differences \(h(\lambda_i-\lambda_j)\), we obtain a quadratic upper bound in the degree vector, which motivates a ``warm-start'' by degree-regularization. The warm start uses randomized gossip to estimate global average degree, accelerating subsequent local descent while maintaining decentralization, and realizing $\sim95\%{}$ of the performance with respect to centralized optimization. We further introduce a learning-based proposer that predicts one-shot edge updates on maximal 1-hop embeddings, yielding immediate objective reductions. Together, these components form a practical, modular pipeline for spectrum-aware weight tuning that preserves constraints and applies across a broader class of whole-spectrum costs.

    https://arxiv.org/abs/2511.11517


    Human-computer interactions predict mental health

    oai:arXiv.org:2511.20179v2

    arXiv:2511.20179v2 Announce Type: replace-cross Abstract: Scalable assessments of mental illness, the leading driver of disability worldwide, remain a critical roadblock toward accessible and equitable care. Here, we show that human-computer interactions encode mental health with state-of-the-art biomarker precision. We introduce MAILA, a MAchine-learning framework for Inferring Latent mental states from digital Activity. We trained MAILA to predict 1.3 million mental-health self-reports from 20,000 cursor and touchscreen recordings recorded in 9,000 online participants. The dataset includes 2,000 individuals assessed longitudinally, 1,500 diagnosed with depression, and 500 with obsessive-compulsive disorder. MAILA tracks dynamic mental states along three orthogonal dimensions, identifies individuals living with mental illness, and achieves near-ceiling accuracy when predicting group-level mental health. By extracting non-verbal signatures of psychological function that have so far remained untapped, MAILA represents a key step toward foundation models for mental health. The ability to decode mental states at zero marginal cost creates new opportunities in neuroscience, medicine, and public health, while raising urgent questions about privacy, agency, and autonomy online.

    https://arxiv.org/abs/2511.20179


    Automated Histopathologic Assessment of Hirschsprung Disease Using a Multi-Stage Vision Transformer Framework

    oai:arXiv.org:2511.20734v2

    arXiv:2511.20734v2 Announce Type: replace-cross Abstract: Hirschsprung Disease is characterized by the absence of ganglion cells in the myenteric plexus. Therefore, the correct identification of ganglion cells is crucial for diagnosing Hirschsprung disease. We introduce a three-stage analysis framework that mimics the pathologist's diagnostic approach. The framework, based on a Vision Transformer model (ViT-B/16), sequentially segments the muscularis propria, segments the myenteric plexus, and detects ganglion cells within anatomically valid regions. 30 whole-slide images of colon tissue were used, each containing manual annotations of muscularis, plexus, and ganglion cells. A 5-fold cross-validation scheme was applied to each stage, along with resolution-specific tiling strategies and tailored postprocessing to ensure anatomical consistency. The proposed method achieved a Dice coefficient of 89.9% and a Plexus Inclusion Rate of 100% for muscularis segmentation. Plexus segmentation reached a recall of 94.8%, a precision of 84.2% and a Ganglia Inclusion Rate of 99.7%. For ganglion cells annotated with high certainty, the model achieved 62.1\% precision and 89.1% recall. When considering all annotated ganglion cells, regardless of certainty level, the overall precision was 67.0%. These results indicate that ViT-based models are effective at leveraging global tissue context and capturing cellular morphology at small scales, even within complex histological tissue structures. This multi-stage methodology has great potential to support digital pathology workflows by reducing inter-observer variability and assisting in the evaluation of Hirschsprung disease. The clinical impact will be evaluated in future work with larger multi-center datasets and additional expert annotations.

    https://arxiv.org/abs/2511.20734


    Hierarchical Molecular Language Models (HMLMs)

    oai:arXiv.org:2512.00696v3

    arXiv:2512.00696v3 Announce Type: replace-cross Abstract: Artificial intelligence (AI) is reshaping computational and network biology by enabling new approaches to decode cellular communication networks. We introduce Hierarchical Molecular Language Models (HMLMs), a novel framework that models cellular signaling as a specialized molecular language, where signaling molecules function as tokens, protein interactions define syntax, and functional consequences constitute semantics. HMLMs employ a transformer-based architecture adapted to accommodate graph-structured signaling networks through information transducers, mathematical entities that capture how molecules receive, process, and transmit signals. The architecture integrates multi-modal data sources across molecular, pathway, and cellular scales through hierarchical attention mechanisms and scale-bridging operators that enable information flow across biological hierarchies. Applied to a complex network of cardiac fibroblast signaling, HMLMs outperformed traditional approaches in temporal dynamics prediction, particularly under sparse sampling conditions. Attention-based analysis revealed biologically meaningful crosstalk patterns, including previously uncharacterized interactions between signaling pathways. By bridging molecular mechanisms with cellular phenotypes through AI-driven molecular language representation, HMLMs establish a foundation for biology-oriented large language models (LLMs) that could be pre-trained on comprehensive pathway datasets and applied across diverse signaling systems and tissues, advancing precision medicine and therapeutic discovery.

    https://arxiv.org/abs/2512.00696


    Quantum Encrypted Control of Networked Systems

    oai:arXiv.org:2512.03434v3

    arXiv:2512.03434v3 Announce Type: replace-cross Abstract: Encrypted control has been extensively studied to ensure the confidentiality of system states and control inputs for networked control systems. This paper presents a computationally efficient encrypted control framework for networked systems enabled by quantum communication. A quantum channel between sensors and actuators is used to generate identical secret keys, whose security is further enhanced through quantum key distribution. These keys enable lightweight encryption and decryption while preserving confidentiality and control accuracy. We develop a novel encryption-decryption architecture for state-feedback control of linear systems based on quantum keys, and characterize the impact of quantum state errors on closed-loop stability. In particular, we establish the existence of a critical threshold on intrinsic quantum noise below which stability is guaranteed. In contrast to classical encrypted control schemes, which may collapse under a single key-bit error, the proposed quantum encrypted control exhibits strong robustness to key imperfections. We further adopt quantization techniques to address the scenarios with limited communication bits in practical situations, and implement privacy protection for quantum keys based on a stochastic quantizer. These results demonstrate that integrating quantum technologies into control systems in a nontrivial and principled manner, even at their current level of maturity, can yield substantial performance gains in reducing computational complexity and improving resilience to key errors while ensuring security against multiple eavesdropping sources.

    https://arxiv.org/abs/2512.03434


    Scheduling in Quantum Satellite Networks: Fairness and Performance Optimization

    oai:arXiv.org:2512.07108v2

    arXiv:2512.07108v2 Announce Type: replace-cross Abstract: Quantum satellite networks offer a promising solution for achieving long-distance quantum communication by enabling entanglement distribution across global scales. This work formulates and solves the quantum satellite network scheduling problem by optimizing satellite-to-ground station pair assignments under realistic system and environmental constraints. Our framework accounts for limited satellite and ground station resources, fairness, entanglement fidelity thresholds, and real world non-idealities including atmospheric losses, weather and background noise. In addition, we incorporate the complexities of multi-satellite relays enabled via inter-satellite links. We propose an integer linear programming (ILP) based optimization framework that supports multiple scheduling objectives, allowing us to analyze tradeoffs between maximizing total entanglement distribution rate and ensuring fairness across ground station pairs. Our framework can also be used as a benchmark tool to measure the performance of other potential transmission scheduling policies.

    https://arxiv.org/abs/2512.07108


    Social welfare optimisation in well-mixed and structured populations

    oai:arXiv.org:2512.07453v2

    arXiv:2512.07453v2 Announce Type: replace-cross Abstract: Research on promoting cooperation among autonomous, self-regarding agents has often focused on the bi-objective optimisation problem: minimising the total incentive cost while maximising the frequency of cooperation. However, the optimal value of social welfare under such constraints remains largely unexplored. In this work, we hypothesise that achieving maximal social welfare is not guaranteed at the minimal incentive cost required to drive agents to a desired cooperative state. To address this gap, we adopt to a single-objective approach focused on maximising social welfare, building upon foundational evolutionary game theory models that examined cost efficiency in finite populations, in both well-mixed and structured population settings. Our analytical model and agent-based simulations show how different interference strategies, including rewarding local versus global behavioural patterns, affect social welfare and dynamics of cooperation. Our results reveal a significant gap in the per-individual incentive cost between optimising for pure cost efficiency or cooperation frequency and optimising for maximal social welfare. Overall, our findings indicate that incentive design, policy, and benchmarking in multi-agent systems and human societies should prioritise welfare-centric objectives over proxy targets of cost or cooperation frequency.

    https://arxiv.org/abs/2512.07453


    Adversarial Barrier in Uniform Class Separation

    oai:arXiv.org:2512.08149v3

    arXiv:2512.08149v3 Announce Type: replace-cross Abstract: We identify a strong structural obstruction to Uniform Separation in constructive arithmetic. The mechanism is independent of semantic content; it emerges whenever two distinct evaluator predicates are sustained in parallel and inference remains uniformly representable in an extension of HA. Under these conditions, any putative Uniform Class Separation principle becomes a distinguished instance of a fixed point construction. The resulting limitation is stricter in scope than classical separation barriers (Baker; Rudich; Aaronson et al.) insofar as it constrains the logical form of uniform separation within HA, rather than limiting particular relativizing, naturalizing, or algebrizing techniques.

    https://arxiv.org/abs/2512.08149


    Magic Gems: A Polyhedral Framework for Magic Squares

    oai:arXiv.org:2512.09170v2

    arXiv:2512.09170v2 Announce Type: replace-cross Abstract: We introduce Magic Gems, a geometric representation of magic squares as three-dimensional polyhedra. By mapping an n times n magic square onto a centered coordinate grid with cell values as vertical displacements, we construct a point cloud whose convex hull defines the Magic Gem. Building on prior work connecting magic squares to physical properties such as moment of inertia, this construction reveals an explicit statistical structure: we show that magic squares have vanishing covariances between position and value. We develop a covariance energy functional (the sum of squared covariances with individual row, column, and diagonal indicator variables) and prove that for all n great thank or equal to 3, an arrangement is a magic square if and only if this complete energy vanishes. This characterization transforms the classical line-sum definition into a statistical orthogonality condition. We also study a simpler low-mode relaxation using only four aggregate position indicators; this coincides with the complete characterization for n = 3 (verified exhaustively) but defines a strictly larger class for n greater than or equal to 4 (explicit counterexamples computed). Perturbation analysis demonstrates that magic squares are isolated local minima in the energy landscape. The representation is invariant under dihedral symmetry D4, yielding canonical geometric objects for equivalence classes.

    https://arxiv.org/abs/2512.09170


    Tracking large chemical reaction networks and rare events by neural networks

    oai:arXiv.org:2512.10309v2

    arXiv:2512.10309v2 Announce Type: replace-cross Abstract: Chemical reaction networks are widely used to model stochastic dynamics in chemical kinetics, systems biology and epidemiology. Solving the chemical master equation that governs these systems poses a significant challenge due to the large state space exponentially growing with system sizes. The development of autoregressive neural networks offers a flexible framework for this problem; however, its efficiency is limited especially for high-dimensional systems and in scenarios with rare events. Here, we push the frontier of neural-network approach by exploiting faster optimizations such as natural gradient descent and time-dependent variational principle, achieving a 5- to 22-fold speedup, and by leveraging enhanced-sampling strategies to capture rare events. We demonstrate reduced computational cost and higher accuracy over the previous neural-network method in challenging reaction networks, including the mitogen-activated protein kinase (MAPK) cascade network, the hitherto largest biological network handled by the previous approaches of solving the chemical master equation. We further apply the approach to spatially extended reaction-diffusion systems, the Schl\"ogl model with rare events, on two-dimensional lattices, beyond the recent tensor-network approach that handles one-dimensional lattices. The present approach thus enables efficient modeling of chemical reaction networks in general.

    https://arxiv.org/abs/2512.10309


    Residual subspace evolution strategies for nonlinear inverse problems

    oai:arXiv.org:2512.10325v2

    arXiv:2512.10325v2 Announce Type: replace-cross Abstract: Nonlinear inverse problems pervade engineering and science, yet noisy, non-differentiable, or expensive residual evaluations routinely defeat Jacobian-based solvers. Derivative-free alternatives either demand smoothness, require large populations to stabilise covariance estimates, or stall on flat regions where gradient information fades. This paper introduces residual subspace evolution strategies (RSES), a derivative-free solver that draws Gaussian probes around the current iterate, records how residuals change along those directions, and recombines the probes through a least-squares solve to produce an optimal update. The method builds a residual-only surrogate without forming Jacobians or empirical covariances, and each iteration costs just $k+1$ residual evaluations with $O(k^3)$ linear algebra overhead, where $k$ remains far smaller than the parameter dimension. Benchmarks on calibration, regression, and deconvolution tasks show that RSES reduces misfit consistently across deterministic and stochastic settings, matching or exceeding xNES, NEWUOA, Adam, and ensemble Kalman inversion under matched evaluation budgets. The gains are most pronounced when smoothness or covariance assumptions break, suggesting that lightweight residual-difference surrogates can reliably guide descent where heavier machinery struggles.

    https://arxiv.org/abs/2512.10325