• raseliarison
  • nirinA
  • adrien
  • blog
  • code
  • FAQ
  •  home  
  •  news  
    • arXiv
      • astro-ph
      • cond-mat
      • cs
      • eess
      • gr-qc
      • hep-ex
      • hep-lat
      • hep-ph
      • hep-th
      • math
      • math-ph
      • nlin
      • nucl-ex
      • nucl-th
      • physics
      • q-bio
      • quant-ph
      • stat
    • physics
      • phys.org
      • physics world
    • linux
      • kernel
      • slackware
    • nature
      • natcomputsci
      • natastron
      • natbiomedeng
      • nenergy
      • nnano
      • natmachintell
      • nbt
      • nmeth
      • natecolevol
      • nmicrobiol
      • ng
      • nchembio
      • natelectron
      • micronano
      • nphoton
    • bioRxiv
    • plos one
    • world
      • BBC
      • Al Jazeera
    • earth
      • earth observatory
      • weather
      • weather forecast
    • universe
      • apod
      • hubble
      • atel
      • nasa
  •  wiki  
  •  gemini  
  • stat updates on arXiv.org

    stat updates on the arXiv.org e-print archive.

    A fine-grained look at causal effects in causal spaces

    oai:arXiv.org:2512.11919v1

    arXiv:2512.11919v1 Announce Type: new Abstract: The notion of causal effect is fundamental across many scientific disciplines. Traditionally, quantitative researchers have studied causal effects at the level of variables; for example, how a certain drug dose (W) causally affects a patient's blood pressure (Y). However, in many modern data domains, the raw variables-such as pixels in an image or tokens in a language model-do not have the semantic structure needed to formulate meaningful causal questions. In this paper, we offer a more fine-grained perspective by studying causal effects at the level of events, drawing inspiration from probability theory, where core notions such as independence are first given for events and sigma-algebras, before random variables enter the picture. Within the measure-theoretic framework of causal spaces, a recently introduced axiomatisation of causality, we first introduce several binary definitions that determine whether a causal effect is present, as well as proving some properties of them linking causal effect to (in)dependence under an intervention measure. Further, we provide quantifying measures that capture the strength and nature of causal effects on events, and show that we can recover the common measures of treatment effect as special cases.

    https://arxiv.org/abs/2512.11919


    Interval Fisher's Discriminant Analysis and Visualisation

    oai:arXiv.org:2512.11945v1

    arXiv:2512.11945v1 Announce Type: new Abstract: In Data Science, entities are typically represented by single valued measurements. Symbolic Data Analysis extends this framework to more complex structures, such as intervals and histograms, that express internal variability. We propose an extension of multiclass Fisher's Discriminant Analysis to interval-valued data, using Moore's interval arithmetic and the Mallows' distance. Fisher's objective function is generalised to consider simultaneously the contributions of the centres and the ranges of intervals and is numerically maximised. The resulting discriminant directions are then used to classify interval-valued observations.To support visual assessment, we adapt the class map, originally introduced for conventional data, to classifiers that assign labels through minimum distance rules. We also extend the silhouette plot to this setting and use stacked mosaic plots to complement the visual display of class assignments. Together, these graphical tools provide insight into classifier performance and the strength of class membership. Applications to real datasets illustrate the proposed methodology and demonstrate its value in interpreting classification results for interval-valued data.

    https://arxiv.org/abs/2512.11945


    Debiased Inference for High-Dimensional Regression Models Based on Profile M-Estimation

    oai:arXiv.org:2512.12003v1

    arXiv:2512.12003v1 Announce Type: new Abstract: Debiased inference for high-dimensional regression models has received substantial recent attention to ensure regularized estimators have valid inference. All existing methods focus on achieving Neyman orthogonality through explicitly constructing projections onto the space of nuisance parameters, which is infeasible when an explicit form of the projection is unavailable. We introduce a general debiasing framework, Debiased Profile M-Estimation (DPME), which applies to a broad class of models and does not require model-specific Neyman orthogonalization or projection derivations as in existing methods. Our approach begins by obtaining an initial estimator of the parameters by optimizing a penalized objective function. To correct for the bias introduced by penalization, we construct a one-step estimator using the Newton-Raphson update, applied to the gradient of a profile function defined as the optimal objective function with the parameter of interest held fixed. We use numerical differentiation without requiring the explicit calculation of the gradients. The resulting DPME estimator is shown to be asymptotically linear and normally distributed. Through extensive simulations, we demonstrate that the proposed method achieves better coverage rates than existing alternatives, with largely reduced computational cost. Finally, we illustrate the utility of our method with an application to estimating a treatment rule for multiple myeloma.

    https://arxiv.org/abs/2512.12003


    Proximal Causal Inference for Modified Treatment Policies

    oai:arXiv.org:2512.12038v1

    arXiv:2512.12038v1 Announce Type: new Abstract: The proximal causal inference framework enables the identification and estimation of causal effects in the presence of unmeasured confounding by leveraging two disjoint sets of observed strong proxies: negative control treatments and negative control outcomes. In the point exposure setting, this framework has primarily been applied to estimands comparing counterfactual outcomes under a static fixed intervention or, possibly randomized, regime that depends on baseline covariates. For continuous exposures, alternative hypothetical scenarios can enrich our understanding of causal effects, such as those where each individual receives their observed treatment dose modified in a pre-specified manner - commonly referred to as modified treatment regimes. In this work, we extend the proximal causal inference framework to identify and estimate the mean outcome under a modified treatment regime, addressing this gap in the literature. We propose a flexible strategy that does not rely on the assumption that all confounders have been measured - unlike existing estimators - and leverages modern debiased machine learning techniques using non-parametric estimators of nuisance functions to avoid restrictive parametric assumptions. Our methodology was motivated by immunobridging studies of COVID-19 vaccines aimed at identifying correlates of protection, where the individual's underlying immune capacity is an important unmeasured confounder. We demonstrate its applicability using data from such a study and evaluate its finite-sample performance through simulation studies.

    https://arxiv.org/abs/2512.12038


    Sparse Bayesian Partially Identified Models for Sequence Count Data

    oai:arXiv.org:2512.12040v1

    arXiv:2512.12040v1 Announce Type: new Abstract: In genomics, differential abundance and expression analyses are complicated by the compositional nature of sequence count data, which reflect only relative-not absolute-abundances or expression levels. Many existing methods attempt to address this limitation through data normalizations, but we have shown that such approaches imply strong, often biologically implausible assumptions about total microbial load or total gene expression. Even modest violations of these assumptions can inflate Type I and Type II error rates to over 70%. Sparse estimators have been proposed as an alternative, leveraging the assumption that only a small subset of taxa (or genes) change between conditions. However, we show that current sparse methods suffer from similar pathologies because they treat sparsity assumptions as fixed and ignore the uncertainty inherent in these assumptions. We introduce a sparse Bayesian Partially Identified Model (PIM) that addresses this limitation by explicitly modeling uncertainty in sparsity assumptions. Our method extends the Scale-Reliant Inference (SRI) framework to the sparse setting, providing a principled approach to differential analysis under scale uncertainty. We establish theoretical consistency of the proposed estimator and, through extensive simulations and real data analyses, demonstrate substantial reductions in both Type I and Type II errors compared to existing methods.

    https://arxiv.org/abs/2512.12040


    Estimation of Heterogeneous Causal Mediation Effects in a Hypertension Treatment Trial

    oai:arXiv.org:2512.12043v1

    arXiv:2512.12043v1 Announce Type: new Abstract: Hypertension is a highly prevalent condition and a major risk factor for cardiovascular disease. The landmark Systolic Blood Pressure Intervention Trial (SPRINT) showed that lowering systolic blood pressure (BP) goals from 140 mmHg to 120 mmHg leads to significantly reduced BP, cardiovascular mortality, and morbidity. However, the underlying mechanisms are not yet fully elucidated. In patients with impaired renal function, early reduction of albuminuria has been proposed as a potential mediation pathway. Evidence from the standard causal mediation analysis (CMA), however, yields inconsistent results, possibly due to heterogeneous mediation effects across individuals. To disseminate the heterogeneity, a new framework that incorporates covariate-treatment and mediator-treatment interactions within a linear structural equation modeling system is introduced. Causal assumptions are discussed and heterogeneous natural direct and indirect effects are parameterized as functions of patient characteristics. A modified covariate approach is proposed to relax the hierarchical constraints and the generalized lasso regularization is employed to ensure parsimony in high-dimensional settings. Asymptotic properties are studied. Simulation studies demonstrate good estimation and inference performance. Analysis of the SPRINT data reveals substantial heterogeneity in mediation effects, identifying a subset of patients who stand to gain from therapies targeting albuminuria.

    https://arxiv.org/abs/2512.12043


    StochTree: BART-based modeling in R and Python

    oai:arXiv.org:2512.12051v1

    arXiv:2512.12051v1 Announce Type: new Abstract: stochtree is a C++ library for Bayesian tree ensemble models such as BART and Bayesian Causal Forests (BCF), as well as user-specified variations. Unlike previous BART packages, stochtree provides bindings to both R and Python for full interoperability. stochtree boasts a more comprehensive range of models relative to previous packages, including heteroskedastic forests, random effects, and treed linear models. Additionally, stochtree offers flexible handling of model fits: the ability to save model fits, reinitialize models from existing fits (facilitating improved model initialization heuristics), and pass fits between R and Python. On both platforms, stochtree exposes lower-level functionality, allowing users to specify models incorporating Bayesian tree ensembles without needing to modify C++ code. We illustrate the use of stochtree in three settings: i) straightfoward applications of existing models such as BART and BCF, ii) models that include more sophisticated components like heteroskedasticity and leaf-wise regression models, and iii) as a component of custom MCMC routines to fit nonstandard tree ensemble models.

    https://arxiv.org/abs/2512.12051


    Meta-analysis of diagnostic test accuracy with multiple disease stages: combining stage-specific and merged-stage data

    oai:arXiv.org:2512.12065v1

    arXiv:2512.12065v1 Announce Type: new Abstract: For many conditions, it is of clinical importance to know not just the ability of a test to distinguish between those with and without the disease, but also the sensitivity to detect disease at different stages: in particular, the test's ability to detect disease at a stage most amenable to treatment. In a systematic review of test accuracy, pooled stage-specific estimates can be produced using subgroup analysis or meta-regression. However, this requires stage-specific data from each study, which is often not reported. Studies may however report test sensitivity for merged stage categories (e.g. stages I-II) or merged across all stages, together with information on the proportion of patients with disease at each stage. We demonstrate how to incorporate studies reporting merged stage data alongside studies reporting stage-specific data, to allow the inclusion of more studies in the meta-analysis. We consider both meta-analysis of tests with binary results, and meta-analysis of tests with continuous results, where the sensitivity to detect disease of each stage across the whole range of observed thresholds is estimated. The methods are demonstrated using a series of simulated datasets and applied to data from a systematic review of the accuracy of tests used to screen for hepatocellular carcinoma in people with liver cirrhosis. We show that incorporating studies with merged stage data can lead to more precise estimates and, in some cases, corrects biologically implausible results that can arise when the availability of stage-specific data is limited.

    https://arxiv.org/abs/2512.12065


    Local Asymptotic Normality for Multi-Armed Bandits

    oai:arXiv.org:2512.12192v1

    arXiv:2512.12192v1 Announce Type: new Abstract: Van den Akker, Werker, and Zhou (2025) showed that the limit experiment, in the sense of H\a'{a}jek-Le Cam, for (contextual) bandits whose arms' expected payoffs differ by $O(T^{-1/2})$, is Locally Asymptotically Quadratic (LAQ) but highly non-standard, being characterized by a system of coupled stochastic differential equations. The present paper considers the complementary case where the arms' expected payoffs are fixed with a unique optimal (in the sense of highest expected payoff) arm. It is shown that, under sampling schemes satisfying mild regularity conditions (including UCB and Thompson sampling), the model satisfies the standard Locally Asymptotically Normal (LAN) property.

    https://arxiv.org/abs/2512.12192


    Safe, Always-Valid Alpha-Investing Rules For Doubly Sequential Online Inference

    oai:arXiv.org:2512.12244v1

    arXiv:2512.12244v1 Announce Type: new Abstract: Dynamic decision-making in rapidly evolving research domains, including marketing, finance, and pharmaceutical development, presents a significant challenge. Researchers frequently confront the need for real-time action within a doubly sequential framework characterized by the continuous influx of high-volume data streams and the intermittent arrival of novel tasks. This calls for the development and implementation of new online inference protocols capable of handling both the continuous processing of incoming information and the efficient allocation of resources to address emerging priorities. We introduce a novel class of Safe and Always-Valid Alpha-investing (SAVA) rules that leverages powerful tools including always valid p-values, e-processes, and online false discovery rate methods. The SAVA algorithm effectively integrates information across all tasks, mitigates the alpha-death problem, and controls the false selection rate (FSR) at all decision points. We validate the efficacy of the SAVA framework through rigorous theoretical analysis and extensive numerical experiments. Our results demonstrate that SAVA not only offers effective control of the FSR but also significantly improves statistical power compared to traditional online testing approaches.

    https://arxiv.org/abs/2512.12244


    A complete characterization of maximal copulas with a given track section

    oai:arXiv.org:2512.12257v1

    arXiv:2512.12257v1 Announce Type: new Abstract: Bivariate copulas with prescribed diagonal section were first studied by Bertino. Their maximality was studied so far only from the point of view of upper bounds which brings quasi-copulas into the picture and limits the resulting set substantially. We propose to study maximality of these families in the order theoretic sense. A copula C with given diagonal section {\delta} is called undominated if there is no copula C' {\neq} C with the same diagonal section {\delta} such that C {\leq} C'. The main contribution of this paper is a new method that provides copulas of the kind. This method generates a much wider class that contains the known upper bounds as a very small subclass. There was a recent call for the study of asymmetry which is addressed by our class better than by the known ones. Corresponding quasi-copulas can be obtained from our copulas via splicing techniques. Most results are given on the level of tracks.

    https://arxiv.org/abs/2512.12257


    Marshall-Olkin copulas revisited

    oai:arXiv.org:2512.12265v1

    arXiv:2512.12265v1 Announce Type: new Abstract: Almost seventy years old Marshall-Olkin copulas, then wider Marshall copulas, and finally even wider shock model (SM) copulas constitute a substantial part of nowadays copula theory due to numerous applications. Recently, Christian Genest with some coauthors introduced a new stochastic model for a special subclass of SM copulas which gives not only a new angle on these copulas but also widens the range of applications. In this paper we extend this type of stochastic model to all known subclasses of SM copulas. We also introduce a novel class of SM copulas and extend the new stochastic model to this class as well.

    https://arxiv.org/abs/2512.12265


    Hellinger loss function for Generative Adversarial Networks

    oai:arXiv.org:2512.12267v1

    arXiv:2512.12267v1 Announce Type: new Abstract: We propose Hellinger-type loss functions for training Generative Adversarial Networks (GANs), motivated by the boundedness, symmetry, and robustness properties of the Hellinger distance. We define an adversarial objective based on this divergence and study its statistical properties within a general parametric framework. We establish the existence, uniqueness, consistency, and joint asymptotic normality of the estimators obtained from the adversarial training procedure. In particular, we analyze the joint estimation of both generator and discriminator parameters, offering a comprehensive asymptotic characterization of the resulting estimators. We introduce two implementations of the Hellinger-type loss and we evaluate their empirical behavior in comparison with the classic (Maximum Likelihood-type) GAN loss. Through a controlled simulation study, we demonstrate that both proposed losses yield improved estimation accuracy and robustness under increasing levels of data contamination.

    https://arxiv.org/abs/2512.12267


    Robust Outlier Detection and Low-Latency Concept Drift Adaptation for Data Stream Regression: A Dual-Channel Architecture

    oai:arXiv.org:2512.12289v1

    arXiv:2512.12289v1 Announce Type: new Abstract: Outlier detection and concept drift detection represent two challenges in data analysis. Most studies address these issues separately. However, joint detection mechanisms in regression remain underexplored, where the continuous nature of output spaces makes distinguishing drifts from outliers inherently challenging. To address this, we propose a novel robust regression framework for joint outlier and concept drift detection. Specifically, we introduce a dual-channel decision process that orchestrates prediction residuals into two coupled logic flows: a rapid response channel for filtering point outliers and a deep analysis channel for diagnosing drifts. We further develop the Exponentially Weighted Moving Absolute Deviation with Distinguishable Types (EWMAD-DT) detector to autonomously differentiate between abrupt and incremental drifts via dynamic thresholding. Comprehensive experiments on both synthetic and real-world datasets demonstrate that our unified framework, enhanced by EWMAD-DT, exhibits superior detection performance even when point outliers and concept drifts coexist.

    https://arxiv.org/abs/2512.12289


    Quantile regression with generalized multiquadric loss function

    oai:arXiv.org:2512.12340v1

    arXiv:2512.12340v1 Announce Type: new Abstract: Quantile regression (QR) is now widely used to analyze the effect of covariates on the conditional distribution of a response variable. It provides a more comprehensive picture of the relationship between a response and covariates compared with classical least squares regression. However, the non-differentiability of the check loss function precludes the use of gradient-based methods to solve the optimization problem in quantile regression estimation. To this end, This paper constructs a smoothed loss function based on multiquadric (MQ) function. The proposed loss function leads to a globally convex optimization problem that can be efficiently solved via (stochastic) gradient descent methods. As an example, we apply the Barzilai-Borwein gradient descent method to obtain the estimation of quantile regression. We establish the theoretical results of the proposed estimator under some regularity conditions, and compare it with other estimation methods using Monte Carlo simulations.

    https://arxiv.org/abs/2512.12340


    Towards a pretrained deep learning estimator of the Linfoot informational correlation

    oai:arXiv.org:2512.12358v1

    arXiv:2512.12358v1 Announce Type: new Abstract: We develop a supervised deep-learning approach to estimate mutual information between two continuous random variables. As labels, we use the Linfoot informational correlation, a transformation of mutual information that has many important properties. Our method is based on ground truth labels for Gaussian and Clayton copulas. We compare our method with estimators based on kernel density, k-nearest neighbours and neural estimators. We show generally lower bias and lower variance. As a proof of principle, future research could look into training the model with a more diverse set of examples from other copulas for which ground truth labels are available.

    https://arxiv.org/abs/2512.12358


    Asymmetric Laplace distribution regression model for fitting heterogeneous longitudinal response

    oai:arXiv.org:2512.12362v1

    arXiv:2512.12362v1 Announce Type: new Abstract: The systematic collection of longitudinal data is very common in practice, making mixed models widely used. Most developments around these models focus on modeling the mean trajectory of repeated measurements, typically under the assumption of homoskedasticity. However, as data become increasingly rich through intensive collection over time, these models can become limiting and may introduce biases in analysis. In fact, such data are often heterogeneous, with the presence of outliers, heteroskedasticity, and asymmetry in the distribution of individual measurements. Therefore, ignoring these characteristics can lead to biased modeling results. In this work, we propose a mixed-effect distributional regression model based on the asymmetric Laplace distribution to: (1) address the presence of outliers, heteroskedasticity, and asymmetry in longitudinal measurements; (2) model the entire individual distribution of the heterogeneous longitudinal response over time, rather than just its conditional expectation; and (3) give a more comprehensive evaluation of the impact of covariates on the distribution of the responses through meaningful indicator. A Bayesian estimation procedure is presented. In order to choose between two distributional regression models, we also propose a new model selection criterion for longitudinal data. It measures the proximity between the individual distribution estimated by the model and the empirical individual distribution of the data over time, using a set of quantiles. The estimation procedure and the selection criterion are validated in a simulation study and the proposed model is compared to a distributional regression mixed model based on the Gaussian distribution and a location-scale linear quantile mixed model. Finally, the proposed model is applied to analyze blood pressure over time for hospitalized patients in the intensive care unit.

    https://arxiv.org/abs/2512.12362


    On the epsilon-delta Structure Underlying Chatterjee's Rank Correlation

    oai:arXiv.org:2512.12363v1

    arXiv:2512.12363v1 Announce Type: new Abstract: We provide an epsilon-delta interpretation of Chatterjee's rank correlation by tracing its origin to a notion of local dependence between random variables. Starting from a primitive epsilon-delta construction, we show that rank-based dependence measures arise naturally as epsilon to zero limits of local averaging procedures. Within this framework, Chatterjee's rank correlation admits a transparent interpretation as an empirical realization of a local L1 residual. We emphasize that the probability integral transform plays no structural role in the underlying epsilon-delta mechanism, and is introduced only as a normalization step that renders the final expression distribution-free. We further consider a moment-based analogue obtained by replacing the absolute deviation with a squared residual. This L2 formulation is independent of rank transformations and, under a Gaussian assumption, recovers Pearson's coefficient of determination.

    https://arxiv.org/abs/2512.12363


    Early Highlights in the History of the Bernstein-von Mises Theorem

    oai:arXiv.org:2512.12379v1

    arXiv:2512.12379v1 Announce Type: new Abstract: The designation ``Bernstein-von Mises theorem'' is apparently due to Lucien Le Cam. Roughly, the assertion of this theorem states that the posterior distribution of a parameter, conditioned on a large sample, is approximately normal, independent of a particular prior. The present paper discusses important steps in the development of this theorem and its applications, from Laplace in 1774 to Le Cam in 1953. Regarding Bernstein and his disciple Neyman, it thereby relies on sources which were widely unknown and hard to obtain until recently.

    https://arxiv.org/abs/2512.12379


    The Morphemic Origin of Zipf's Law: A Factorized Combinatorial Framework

    oai:arXiv.org:2512.12394v1

    arXiv:2512.12394v1 Announce Type: new Abstract: We present a simple structure based model of how words are formed from morphemes. The model explains two major empirical facts: the typical distribution of word lengths and the appearance of Zipf like rank frequency curves. In contrast to classical explanations based on random text or communication efficiency, our approach uses only the combinatorial organization of prefixes, roots, suffixes and inflections. In this Morphemic Combinatorial Word Model, a word is created by activating several positional slots. Each slot turns on with a certain probability and selects one morpheme from its inventory. Morphemes are treated as stable building blocks that regularly appear in word formation and have characteristic positions. This mechanism produces realistic word length patterns with a concentrated middle zone and a thin long tail, closely matching real languages. Simulations with synthetic morpheme inventories also generate rank frequency curves with Zipf like exponents around 1.1-1.4, similar to English, Russian and Romance languages. The key result is that Zipf like behavior can emerge without meaning, communication pressure or optimization principles. The internal structure of morphology alone, combined with probabilistic activation of slots, is sufficient to create the robust statistical patterns observed across languages.

    https://arxiv.org/abs/2512.12394


    Scalable Spatial Stream Network (S3N) Models

    oai:arXiv.org:2512.12398v1

    arXiv:2512.12398v1 Announce Type: new Abstract: Understanding how habitats shape species distributions and abundances across spatially complex, dendritic freshwater networks remains a longstanding and fundamental challenge in ecology, with direct implications for effective biodiversity management and conservation. Existing spatial stream network (SSN) models adapt spatial process models to river networks by creating covariance functions that account for stream distance, but preprocessing and estimation with these models is both computationally and time intensive, thus precluding the application of these models to regional or continental scales. This paper introduces a new class of Scalable Spatial Stream Network (S3N) models, which extend nearest-neighbor Gaussian processes to incorporate ecologically relevant spatial dependence while greatly improving computational efficiency. The S3N framework enables scalable modeling of spatial stream networks, demonstrated here for 285 fish species in the Ohio River Basin (>4,000 river km). Validation analyses show that S3N accurately recovers spatial and covariance parameters, even with reduced bias and variance compared to standard SSN implementations. These results represent a key advancement toward large-scale mapping of freshwater fish distributions and quantifying the influence of environmental drivers across extensive river networks.

    https://arxiv.org/abs/2512.12398


    Co-Hub Node Based Multiview Graph Learning with Theoretical Guarantees

    oai:arXiv.org:2512.12435v1

    arXiv:2512.12435v1 Announce Type: new Abstract: Identifying the graphical structure underlying the observed multivariate data is essential in numerous applications. Current methodologies are predominantly confined to deducing a singular graph under the presumption that the observed data are uniform. However, many contexts involve heterogeneous datasets that feature multiple closely related graphs, typically referred to as multiview graphs. Previous research on multiview graph learning promotes edge-based similarity across layers using pairwise or consensus-based regularizers. However, multiview graphs frequently exhibit a shared node-based architecture across different views, such as common hub nodes. Such commonalities can enhance the precision of learning and provide interpretive insight. In this paper, we propose a co-hub node model, positing that different views share a common group of hub nodes. The associated optimization framework is developed by enforcing structured sparsity on the connections of these co-hub nodes. Moreover, we present a theoretical examination of layer identifiability and determine bounds on estimation error. The proposed methodology is validated using both synthetic graph data and fMRI time series data from multiple subjects to discern several closely related graphs.

    https://arxiv.org/abs/2512.12435


    Efficient Level-Crossing Probability Calculation for Gaussian Process Modeled Data

    oai:arXiv.org:2512.12442v1

    arXiv:2512.12442v1 Announce Type: new Abstract: Almost all scientific data have uncertainties originating from different sources. Gaussian process regression (GPR) models are a natural way to model data with Gaussian-distributed uncertainties. GPR also has the benefit of reducing I/O bandwidth and storage requirements for large scientific simulations. However, the reconstruction from the GPR models suffers from high computation complexity. To make the situation worse, classic approaches for visualizing the data uncertainties, like probabilistic marching cubes, are also computationally very expensive, especially for data of high resolutions. In this paper, we accelerate the level-crossing probability calculation efficiency on GPR models by subdividing the data spatially into a hierarchical data structure and only reconstructing values adaptively in the regions that have a non-zero probability. For each region, leveraging the known GPR kernel and the saved data observations, we propose a novel approach to efficiently calculate an upper bound for the level-crossing probability inside the region and use this upper bound to make the subdivision and reconstruction decisions. We demonstrate that our value occurrence probability estimation is accurate with a low computation cost by experiments that calculate the level-crossing probability fields on different datasets.

    https://arxiv.org/abs/2512.12442


    Design-Based Weighted Regression Estimators for Average and Conditional Spillover Effects

    oai:arXiv.org:2512.12452v1

    arXiv:2512.12452v1 Announce Type: new Abstract: When individuals engage in social or physical interactions, a unit's outcome may depend on the treatments received by others. In such interference environments, we provide a unified framework characterizing a broad class of spillover estimands as weighted averages of unit-to-unit spillover effects, with estimand-specific weights. We then develop design-based weighted least squares (WLS) estimators for both average and conditional spillover effects. We introduce three nonparametric estimators under the dyadic, sender, and receiver perspectives, which distribute the estimand weights differently across the outcome vector, design matrix, and weight matrix. For the average-type estimands, we show that all three estimators are equivalent to the Hajek estimator. For conditional spillover effects, we establish conditions under which the estimands are consistent for the target conditional spillover effects. We further derive concentration inequalities, a central limit theorem, and conservative variance estimators in an asymptotic regime where both the number of clusters and cluster sizes grow.

    https://arxiv.org/abs/2512.12452


    Understanding Overparametrization in Survival Models through Double-Descent

    oai:arXiv.org:2512.12463v1

    arXiv:2512.12463v1 Announce Type: new Abstract: Classical statistical learning theory predicts a U-shaped relationship between test loss and model capacity, driven by the bias-variance trade-off. Recent advances in modern machine learning have revealed a more complex pattern, double-descent, in which test loss, after peaking near the interpolation threshold, decreases again as model capacity continues to grow. While this behavior has been extensively analyzed in regression and classification, its manifestation in survival analysis remains unexplored. This study investigates double-descent in four representative survival models: DeepSurv, PC-Hazard, Nnet-Survival, and N-MTLR. We rigorously define interpolation and finite-norm interpolation, two key characteristics of loss-based models to understand double-descent. We then show the existence (or absence) of (finite-norm) interpolation of all four models. Our findings clarify how likelihood-based losses and model implementation jointly determine the feasibility of interpolation and show that overfitting should not be regarded as benign for survival models. All theoretical results are supported by numerical experiments that highlight the distinct generalization behaviors of survival models.

    https://arxiv.org/abs/2512.12463


    Sleep pattern profiling using a finite mixture of contaminated multivariate skew-normal distributions on incomplete data

    oai:arXiv.org:2512.12464v1

    arXiv:2512.12464v1 Announce Type: new Abstract: Medical data often exhibit characteristics that make cluster analysis particularly challenging, such as missing values, outliers, and cluster features like skewness. Typically, such data would need to be preprocessed -- by cleaning outliers and missing values -- before clustering could be performed. However, these preliminary steps rely on objective functions different from those used in the clustering stage. In this paper, we propose a unified model-based clustering approach that simultaneously handles atypical observations, missing values, and cluster-wise skewness within a single framework. Each cluster is modelled using a contaminated multivariate skew-normal distribution -- a convenient two-component mixture of multivariate skew-normal densities -- in which one component represents the main data (the "bulk") and the other captures potential outliers. From an inferential perspective, we implement and use a variant of the EM algorithm to obtain the maximum likelihood estimates of the model parameters. Simulation studies demonstrate that the proposed model outperforms existing approaches in both clustering accuracy and outlier detection, across low- and high-dimensional settings, even in the presence of substantial missingness. The method is further applied to the Cleveland Children's Sleep and Health Study (CCSHS), a dataset characterised by incomplete observations. Without any preprocessing, the proposed approach identifies five distinct groups of sleepers, revealing meaningful differences in sleeper typologies.

    https://arxiv.org/abs/2512.12464


    Iterative Sampling Methods for Sinkhorn Distributionally Robust Optimization

    oai:arXiv.org:2512.12550v1

    arXiv:2512.12550v1 Announce Type: new Abstract: Distributionally robust optimization (DRO) has emerged as a powerful paradigm for reliable decision-making under uncertainty. This paper focuses on DRO with ambiguity sets defined via the Sinkhorn discrepancy: an entropy-regularized Wasserstein distance, referred to as Sinkhorn DRO. Existing work primarily addresses Sinkhorn DRO from a dual perspective, leveraging its formulation as a conditional stochastic optimization problem, for which many stochastic gradient methods are applicable. However, the theoretical analyses of such methods often rely on the boundedness of the loss function, and it is indirect to obtain the worst-case distribution associated with Sinkhorn DRO. In contrast, we study Sinkhorn DRO from the primal perspective, by reformulating it as a bilevel program with several infinite-dimensional lower-level subproblems over probability space. This formulation enables us to simultaneously obtain the optimal robust decision and the worst-case distribution, which is valuable in practical settings, such as generating stress-test scenarios or designing robust learning algorithms. We propose both double-loop and single-loop sampling-based algorithms with theoretical guarantees to solve this bilevel program. Finally, we demonstrate the effectiveness of our approach through a numerical study on adversarial classification.

    https://arxiv.org/abs/2512.12550


    Mind the Jumps: A Scalable Robust Local Gaussian Process for Multidimensional Response Surfaces with Discontinuities

    oai:arXiv.org:2512.12574v1

    arXiv:2512.12574v1 Announce Type: new Abstract: Modeling response surfaces with abrupt jumps and discontinuities remains a major challenge across scientific and engineering domains. Although Gaussian process models excel at capturing smooth nonlinear relationships, their stationarity assumptions limit their ability to adapt to sudden input-output variations. Existing nonstationary extensions, particularly those based on domain partitioning, often struggle with boundary inconsistencies, sensitivity to outliers, and scalability issues in higher-dimensional settings, leading to reduced predictive accuracy and unreliable parameter estimation. To address these challenges, this paper proposes the Robust Local Gaussian Process (RLGP) model, a framework that integrates adaptive nearest-neighbor selection with a sparsity-driven robustification mechanism. Unlike existing methods, RLGP leverages an optimization-based mean-shift adjustment after a multivariate perspective transformation combined with local neighborhood modeling to mitigate the influence of outliers. This approach improves predictive accuracy near discontinuities while enhancing robustness to data heterogeneity. Comprehensive evaluations on real-world datasets show that RLGP consistently delivers high predictive accuracy and maintains competitive computational efficiency, especially in scenarios with sharp transitions and complex response structures. Scalability tests further confirm RLGP's stability and reliability in higher-dimensional settings, where other methods struggle. These results establish RLGP as an effective and practical solution for modeling nonstationary and discontinuous response surfaces across a wide range of applications.

    https://arxiv.org/abs/2512.12574


    A Real Data-Driven, Robust Survival Analysis on Patients who Underwent Deep Brain Stimulation for Parkinson's Disease by Utilizing Parametric, Non-Parametric, and Semi-Parametric Approaches

    oai:arXiv.org:2512.12579v1

    arXiv:2512.12579v1 Announce Type: new Abstract: Parkinson's Disease (PD) is a devastating neurodegenerative disorder that affects millions of people around the globe. Many researchers are continuously working to understand PD and develop treatments to improve the condition of PD patients, which affects their day-to-day lives. Since the last decades, the treatment, Deep Brain Stimulation (DBS) has given promising results for motor symptoms by improving the quality of daily living of PD patients. In the methodology of the present study, we have utilized sophisticated statistical approaches such as Nonparametric, Semi-parametric, and robust Parametric survival analysis to extract useful and important information about the long-term survival outcomes of the patients who underwent DBS for PD. Finally, we were able to conclude that the probabilistic behavior of the survival time of female patients is statistically different from that of male patients. Furthermore, we have identified that the probabilistic behavior of the survival times of Female patients is characterized by the 3-parameter Lognormal distribution, while that of Male patients is characterized by the 3-parameter Weibull distribution. More importantly, we have found that the Female patients have higher survival compared to the Male patients after conducting a robust parametric survival analysis. Using the semi-parametric COX-PH, we found that the initial implant of the right side leads to a high frequency of events occurring for the female patients with a bad prognostic factor, while for the male patients, a low events occurs with a good prognostic factor. Furthermore, we have found an interaction term between the number of revisions and the initial size of the implant, which increases the frequency of events occurring for the Male patients with a bad prognostic factor.

    https://arxiv.org/abs/2512.12579


    Robust Variational Bayes by Min-Max Median Aggregation

    oai:arXiv.org:2512.12676v1

    arXiv:2512.12676v1 Announce Type: new Abstract: We propose a robust and scalable variational Bayes (VB) framework designed to effectively handle contamination and outliers in dataset. Our approach partitions the data into $m$ disjoint subsets and formulates a joint optimization problem based on robust aggregation principles. A key insight is that the full posterior distribution is equivalent to the minimizer of the mean Kullback-Leibler (KL) divergence from the $m$-powered local posterior distributions. To enhance robustness, we replace the mean KL divergence with a min-max median formulation. The min-max formulation not only ensures consistency between the KL minimizer and the Evidence Lower Bound (ELBO) maximizer but also facilitates the establishment of improved statistical rates for the mean of variational posterior. We observe a notable discrepancy in the $m$-powered marginal log likelihood function contingent on the presence of local latent variables. To address this, we treat these two scenarios separately to guarantee the consistency of the aggregated variational posterior. Specifically, when local latent variables are present, we introduce an aggregate-and-rescale strategy. Theoretically, we provide a non-asymptotic analysis of our proposed posterior, incorporating a refined analysis of Bernstein-von Mises (BvM) theorem to accommodate a diverging number of subsets $m$. Our findings indicate that the two-stage approach yields a smaller approximation error compared to directly aggregating the $m$-powered local posteriors. Furthermore, we establish a nearly optimal statistical rate for the mean of the proposed posterior, advancing existing theories related to min-max median estimators. The efficacy of our method is demonstrated through extensive simulation studies.

    https://arxiv.org/abs/2512.12676


    Limits To (Machine) Learning

    oai:arXiv.org:2512.12735v1

    arXiv:2512.12735v1 Announce Type: new Abstract: Machine learning (ML) methods are highly flexible, but their ability to approximate the true data-generating process is fundamentally constrained by finite samples. We characterize a universal lower bound, the Limits-to-Learning Gap (LLG), quantifying the unavoidable discrepancy between a model's empirical fit and the population benchmark. Recovering the true population $R^2$, therefore, requires correcting observed predictive performance by this bound. Using a broad set of variables, including excess returns, yields, credit spreads, and valuation ratios, we find that the implied LLGs are large. This indicates that standard ML approaches can substantially understate true predictability in financial data. We also derive LLG-based refinements to the classic Hansen and Jagannathan (1991) bounds, analyze implications for parameter learning in general-equilibrium settings, and show that the LLG provides a natural mechanism for generating excess volatility.

    https://arxiv.org/abs/2512.12735


    Transport Reversible Jump Markov Chain Monte Carlo with proposals generated by Variational Inference with Normalizing Flows

    oai:arXiv.org:2512.12742v1

    arXiv:2512.12742v1 Announce Type: new Abstract: We present a framework using variational inference with normalizing flows (VI-NFs) to generate proposals of reversible jump Markov chain Monte Carlo (RJMCMC) for efficient trans-dimensional Bayesian inference. Unlike transport reversible jump methods relying on forward KL minimization with pilot MCMC samples, our approach minimizes the reverse KL divergence which requires only samples from a base distribution, eliminating costly target sampling. The method employs RealNVP-based flows to learn model-specific transport maps, enabling construction of both between-model and within-model proposals. Our framework provides accurate marginal likelihood estimates from the variational approximation. This facilitates efficient model comparison and proposal adaptation in RJMCMC. Experiments on illustrative example, factor analysis and variable selection tasks in linear regression show that TRJ designed by VI-NFs achieves faster mixing and more efficient model space exploration compared to existing baselines. The proposed algorithm can be extended to conditional flows for amortized vairiational inference across models. Code is available at https://github.com/YinPingping111/TRJ_VINFs.

    https://arxiv.org/abs/2512.12742


    Complexity of Markov Chain Monte Carlo for Generalized Linear Models

    oai:arXiv.org:2512.12748v1

    arXiv:2512.12748v1 Announce Type: new Abstract: Markov Chain Monte Carlo (MCMC), Laplace approximation (LA) and variational inference (VI) methods are popular approaches to Bayesian inference, each with trade-offs between computational cost and accuracy. However, a theoretical understanding of these differences is missing, particularly when both the sample size $n$ and the dimension $d$ are large. LA and Gaussian VI are justified by Bernstein-von Mises (BvM) theorems, and recent work has derived the characteristic condition $n\gg d^2$ for their validity, improving over the condition $n\gg d^3$. In this paper, we show for linear, logistic and Poisson regression that for $n\gtrsim d$, MCMC attains the same complexity scaling in $n$, $d$ as first-order optimization algorithms, up to sub-polynomial factors. Thus MCMC is competitive with LA and Gaussian VI in complexity, under a scaling between $n$ and $d$ more general than BvM regimes. Our complexities apply to appropriately scaled priors that are not necessarily Gaussian-tailed, including Student-$t$ and flat priors, with log-posteriors that are not necessarily globally concave or gradient-Lipschitz.

    https://arxiv.org/abs/2512.12748


    Flow-matching Operators for Residual-Augmented Probabilistic Learning of Partial Differential Equations

    oai:arXiv.org:2512.12749v1

    arXiv:2512.12749v1 Announce Type: new Abstract: Learning probabilistic surrogates for PDEs remains challenging in data-scarce regimes: neural operators require large amounts of high-fidelity data, while generative approaches typically sacrifice resolution invariance. We formulate flow matching in an infinite-dimensional function space to learn a probabilistic transport that maps low-fidelity approximations to the manifold of high-fidelity PDE solutions via learned residual corrections. We develop a conditional neural operator architecture based on feature-wise linear modulation for flow-matching vector fields directly in function space, enabling inference at arbitrary spatial resolutions without retraining. To improve stability and representational control of the induced neural ODE, we parameterize the flow vector field as a sum of a linear operator and a nonlinear operator, combining lightweight linear components with a conditioned Fourier neural operator for expressive, input-dependent dynamics. We then formulate a residual-augmented learning strategy where the flow model learns probabilistic corrections from inexpensive low-fidelity surrogates to high-fidelity solutions, rather than learning the full solution mapping from scratch. Finally, we derive tractable training objectives that extend conditional flow matching to the operator setting with input-function-dependent couplings. To demonstrate the effectiveness of our approach, we present numerical experiments on a range of PDEs, including the 1D advection and Burgers' equation, and a 2D Darcy flow problem for flow through a porous medium. We show that the proposed method can accurately learn solution operators across different resolutions and fidelities and produces uncertainty estimates that appropriately reflect model confidence, even when trained on limited high-fidelity data.

    https://arxiv.org/abs/2512.12749


    Improved Concentration for Mean Estimators via Shrinkage

    oai:arXiv.org:2512.12750v1

    arXiv:2512.12750v1 Announce Type: new Abstract: We study a class of robust mean estimators $\widehat{\mu}$ obtained by adaptively shrinking the weights of sample points far from a base estimator $\widehat{\kappa}$. Given a data-dependent scaling factor $\widehat{\alpha}$ and a weighting function $w:[0, \infty) \to [0,1]$, we let $\widehat{\mu} = \widehat{\kappa} + \frac{1}{n}\sum_{i=1}^n(X_i - \widehat{\kappa})w(\widehat{\alpha}|X_i-\widehat{\kappa}|) $. We prove that, under mild assumptions over $w$, these estimators achieve stronger concentration bounds than the base estimate $\widehat{\kappa}$, including sub-Gaussian guarantees. This framework unifies and extends several existing approaches to robust mean estimation in $\mathbb{R}$. Through numerical experiments, we show that our shrinking approach translates to faster concentration, even for small sample sizes.

    https://arxiv.org/abs/2512.12750


    Variational Inference for Fully Bayesian Hierarchical Linear Models

    oai:arXiv.org:2512.12857v1

    arXiv:2512.12857v1 Announce Type: new Abstract: Bayesian hierarchical linear models provide a natural framework to analyze nested and clustered data. Classical estimation with Markov chain Monte Carlo produces well calibrated posterior distributions but becomes computationally expensive in high dimensional or large sample settings. Variational Inference and Stochastic Variational Inference offer faster optimization based alternatives, but their accuracy in hierarchical structures is uncertain when group separation is weak. This paper compares these two paradigms across three model classes, the Linear Regression Model, the Hierarchical Linear Regression Model, and a Clustered Hierarchical Linear Regression Model. Through simulation studies and an application to real data, the results show that variational methods recover global regression effects and clustering structure with a fraction of the computing time, but distort posterior dependence and yield unstable values of information criteria such as WAIC and DIC. The findings clarify when variational methods can serve as practical surrogates for Markov chain Monte Carlo and when their limitations make full Bayesian sampling necessary, and they provide guidance for extending the same variational framework to generalized linear models and other members of the exponential family.

    https://arxiv.org/abs/2512.12857


    PAC-Bayes Bounds for Multivariate Linear Regression and Linear Autoencoders

    oai:arXiv.org:2512.12905v1

    arXiv:2512.12905v1 Announce Type: new Abstract: Linear Autoencoders (LAEs) have shown strong performance in state-of-the-art recommender systems. However, this success remains largely empirical, with limited theoretical understanding. In this paper, we investigate the generalizability -- a theoretical measure of model performance in statistical learning -- of multivariate linear regression and LAEs. We first propose a PAC-Bayes bound for multivariate linear regression, extending the earlier bound for single-output linear regression by Shalaeva et al., and establish sufficient conditions for its convergence. We then show that LAEs, when evaluated under a relaxed mean squared error, can be interpreted as constrained multivariate linear regression models on bounded data, to which our bound adapts. Furthermore, we develop theoretical methods to improve the computational efficiency of optimizing the LAE bound, enabling its practical evaluation on large models and real-world datasets. Experimental results demonstrate that our bound is tight and correlates well with practical ranking metrics such as Recall@K and NDCG@K.

    https://arxiv.org/abs/2512.12905


    Evaluating Singular Value Thresholds for DNN Weight Matrices based on Random Matrix Theory

    oai:arXiv.org:2512.12911v1

    arXiv:2512.12911v1 Announce Type: new Abstract: This study evaluates thresholds for removing singular values from singular value decomposition-based low-rank approximations of deep neural network weight matrices. Each weight matrix is modeled as the sum of signal and noise matrices. The low-rank approximation is obtained by removing noise-related singular values using a threshold based on random matrix theory. To assess the adequacy of this threshold, we propose an evaluation metric based on the cosine similarity between the singular vectors of the signal and original weight matrices. The proposed metric is used in numerical experiments to compare two threshold estimation methods.

    https://arxiv.org/abs/2512.12911


    Robust tests for parameter change in conditionally heteroscedastic time series models

    oai:arXiv.org:2512.12946v1

    arXiv:2512.12946v1 Announce Type: new Abstract: Structural changes and outliers often coexist, complicating statistical inference. This paper addresses the problem of testing for parameter changes in conditionally heteroscedastic time series models, particularly in the presence of outliers. To mitigate the impact of outliers, we introduce a two-step procedure comprising robust estimation and residual truncation. Based on this procedure, we propose a residual-based robust CUSUM test and its self-normalized counterpart. We derive the limiting null distributions of the proposed robust tests and establish their consistency. Simulation results demonstrate the strong robustness of the tests against outliers. To illustrate the practical application, we analyze Bitcoin data.

    https://arxiv.org/abs/2512.12946


    Asymptotic Inference for Constrained Regression

    oai:arXiv.org:2512.12953v1

    arXiv:2512.12953v1 Announce Type: new Abstract: We consider statistical inference in high-dimensional regression problems under affine constraints on the parameter space. The theoretical study of this is motivated by the study of genetic determinants of diseases, such as diabetes, using external information from mediating protein expression levels. Specifically, we develop rigorous methods for estimating genetic effects on diabetes-related continuous outcomes when these associations are constrained based on external information about genetic determinants of proteins, and genetic relationships between proteins and the outcome of interest. In this regard, we discuss multiple candidate estimators and study their theoretical properties, sharp large sample optimality, and numerical qualities under a high-dimensional proportional asymptotic framework.

    https://arxiv.org/abs/2512.12953


    A Bayesian approach to learning mixtures of nonparametric components

    oai:arXiv.org:2512.12988v1

    arXiv:2512.12988v1 Announce Type: new Abstract: Mixture models are widely used in modeling heterogeneous data populations. A standard approach of mixture modeling is to assume that the mixture component takes a parametric kernel form, while the flexibility of the model can be obtained by using a large or possibly unbounded number of such parametric kernels. In many applications, making parametric assumptions on the latent subpopulation distributions may be unrealistic, which motivates the need for nonparametric modeling of the mixture components themselves. In this paper we study finite mixtures with nonparametric mixture components, using a Bayesian nonparametric modeling approach. In particular, it is assumed that the data population is generated according to a finite mixture of latent component distributions, where each component is endowed with a Bayesian nonparametric prior such as the Dirichlet process mixture. We present conditions under which the individual mixture component's distributions can be identified, and establish posterior contraction behavior for the data population's density, as well as densities of the latent mixture components. We develop an efficient MCMC algorithm for posterior inference and demonstrate via simulation studies and real-world data illustrations that it is possible to efficiently learn complex distributions for the latent subpopulations. In theory, the posterior contraction rate of the component densities is nearly polynomial, which is a significant improvement over the logarithm convergence rate of estimating mixing measures via deconvolution.

    https://arxiv.org/abs/2512.12988


    General OOD Detection via Model-aware and Subspace-aware Variable Priority

    oai:arXiv.org:2512.13003v1

    arXiv:2512.13003v1 Announce Type: new Abstract: Out-of-distribution (OOD) detection is essential for determining when a supervised model encounters inputs that differ meaningfully from its training distribution. While widely studied in classification, OOD detection for regression and survival analysis remains limited due to the absence of discrete labels and the challenge of quantifying predictive uncertainty. We introduce a framework for OOD detection that is simultaneously model aware and subspace aware, and that embeds variable prioritization directly into the detection step. The method uses the fitted predictor to construct localized neighborhoods around each test case that emphasize the features driving the model's learned relationship and downweight directions that are less relevant to prediction. It produces OOD scores without relying on global distance metrics or estimating the full feature density. The framework is applicable across outcome types, and in our implementation we use random forests, where the rule structure yields transparent neighborhoods and effective scoring. Experiments on synthetic and real data benchmarks designed to isolate functional shifts show consistent improvements over existing methods. We further demonstrate the approach in an esophageal cancer survival study, where distribution shifts related to lymphadenectomy identify patterns relevant to surgical guidelines.

    https://arxiv.org/abs/2512.13003


    Spectral Equivariance and Geometric Transport in Reproducing Kernel Hilbert Spaces: A Unified Framework for Orthogonal Polynomial and Kernel Estimation

    oai:arXiv.org:2512.13073v1

    arXiv:2512.13073v1 Announce Type: new Abstract: We develop a unified geometric framework for nonparametric estimation based on the notion of Twin Kernel Spaces, defined as orbits of a reproducing kernel under a group action. This structure induces a family of transported RKHS geometries in which classical orthogonal polynomial estimators, kernel estimators, and spectral smoothing methods arise as projections onto transported eigenfunction systems. Our main contribution is a Spectral Equivariance Theorem showing that the eigenfunctions of any transported kernel are obtained by unitary transport of the base eigenfunctions. As a consequence, orthogonal polynomial estimators are equivariant under geometric deformation, kernel estimators correspond to soft spectral filtering in a twin space, and minimax rates and bias--variance tradeoffs are invariant under transport. We provide examples based on Hermite and Legendre polynomials, affine and Gaussian groups, and illustrate the effectiveness of twin transport for adaptive and multimodal estimation. The framework reveals a deep connection between group actions, RKHS geometry, and spectral nonparametrics, offering a unified perspective that encompasses kernel smoothing, orthogonal series, splines, and multiscale methods.

    https://arxiv.org/abs/2512.13073


    Clinical transfusion-outcomes research: A practical guide

    oai:arXiv.org:2512.13155v1

    arXiv:2512.13155v1 Announce Type: new Abstract: Clinical transfusion-outcomes research faces unique methodological challenges compared with other areas of clinical research. These challenges arise because patients frequently receive multiple transfusions, each unit originates from a different donor, and the probability of receiving specific blood product characteristics is influenced by external, often uncontrollable, factors. These complexities complicate causal inference in observational studies of transfusion effectiveness and safety. This guide addresses key challenges in observational transfusion research, with a focus on time-varying exposure, time-varying confounding, and treatment-confounder feedback. Using the example of donor sex and pregnancy history in relation to recipient mortality, we illustrate the strengths and limitations of commonly used analytical approaches. We compare restriction-based analyses, time-varying Cox regression, and inverse probability weighted marginal structural models using a large observational dataset of male transfusion recipients. In the applied example, restriction and conventional time-varying approaches suggested an increased mortality risk associated with transfusion of red blood cells from ever-pregnant female donors compared with male-only donors (hazard ratio [HR] 1.22; 95% CI 1.05-1.42 and HR 1.21; 95% CI 1.04-1.41, respectively). In contrast, inverse probability of treatment and censoring weighted analyses, which account for treatment-confounder feedback, showed no evidence of an association (HR 1.01; 95% CI 0.85-1.20). These findings demonstrate how conventional methods can yield biased estimates when complex longitudinal structures are not adequately handled. We provide practical guidance on study design, target trial emulation, and the use of g-methods, including a reproducible tutorial and example dataset, to support valid causal inference in clinical transfusion research.

    https://arxiv.org/abs/2512.13155


    Convergence of covariance and spectral density estimates for high-dimensional functional time series

    oai:arXiv.org:2512.13310v1

    arXiv:2512.13310v1 Announce Type: new Abstract: Second-order characteristics including covariance and spectral density functions are fundamentally important for both statistical applications and theoretical analysis in functional time series. In the high-dimensional setting where the number of functional variables is large relative to the length of functional time series, non-asymptotic theory for covariance function estimation has been developed for Gaussian and sub-Gaussian functional linear processes. However, corresponding non-asymptotic results for high-dimensional non-Gaussian and nonlinear functional time series, as well as for spectral density function estimation, are largely unexplored. In this paper, we introduce novel functional dependence measures, based on which we establish systematic non-asymptotic concentration bounds for estimates of (auto)covariance and spectral density functions in high-dimensional and non-Gaussian settings. We then illustrate the usefulness of our convergence results through two applications to dynamic functional principal component analysis and sparse spectral density function estimation. To handle the practical scenario where curves are discretely observed with errors, we further develop convergence rates of the corresponding estimates obtained via a nonparametric smoothing method. Finally, extensive simulation studies are conducted to corroborate our theoretical findings.

    https://arxiv.org/abs/2512.13310


    Beyond Missing Data: Questionnaire Uncertainty Responses as Early Digital Biomarkers of Cognitive Decline and Neurodegenerative Diseases

    oai:arXiv.org:2512.13346v1

    arXiv:2512.13346v1 Announce Type: new Abstract: Identifying preclinical biomarkers of neurodegenerative diseases remains a major challenge in aging research. In this study, we demonstrate that frequent "Don't know/can't remember" (DK) responses, often treated as missing data in touchscreen questionnaires, serve as a novel digital behavioral biomarker of early cognitive vulnerability and neurodegenerative disease risk. Using data from 502,234 UK Biobank participants, we stratified individuals based on DK response frequency (0-1, 2-4, 5-7, >7) and observed a robust, dose-dependent association with an increased risk of Alzheimer's disease (HR = 1.64, 95% CI: 1.26-2.14) and vascular dementia (HR = 1.93, 95% CI: 1.37-2.72), independent of established risk factors. As DK response frequency increased, participants exhibited higher BMI, reduced physical activity, higher smoking rates, and a higher prevalence of chronic diseases, particularly hypertension, diabetes, and depression. Further analysis revealed a dose-dependent relationship between DK response frequency and the risk of Alzheimer's disease and vascular dementia, with high DK responders showing early neurodegenerative changes, marked by elevated levels of Abeta40, Abeta42, NFL, and pTau-181. Metabolomic analysis also revealed lipid metabolism abnormalities, which may mediate this relationship. Together, these findings reframe DK response patterns as clinically meaningful signals of multidimensional neurobiological alterations, offering a scalable, low-cost, non-invasive tool for early risk identification and prevention at the population level.

    https://arxiv.org/abs/2512.13346


    Data-driven inverse uncertainty quantification: application to the Chemical Vapor Deposition Reactor Modeling

    oai:arXiv.org:2512.13354v1

    arXiv:2512.13354v1 Announce Type: new Abstract: This study presents a Bayesian framework for (inverse) uncertainty quantification and parameter estimation in a two-step Chemical Vapor Deposition coating process using production data. We develop an XGBoost surrogate model that maps reactor setup parameters to coating thickness measurements, enabling efficient Bayesian analysis while reducing sampling costs. The methodology handles a mixture of data including continuous, discrete integer, binary, and encoded categorical variables. We establish parameter prior distributions through Bayesian Model Selection and perform Inverse Uncertainty Quantification via weighted Approximate Bayesian Computation with summary statistics, providing robust parameter credible intervals while filtering measurement noise across multiple reactor locations. Furthermore, we employ clustering methods guided by geometry embeddings to focus analysis within homogeneous production groups. This integrated approach provides a validated tool for improving industrial process control under uncertainty.

    https://arxiv.org/abs/2512.13354


    Automatic Quality Control for Agricultural Field Trials -- Detection of Nonstationarity in Grid-indexed Data

    oai:arXiv.org:2512.13383v1

    arXiv:2512.13383v1 Announce Type: new Abstract: A common assumption in the spatial analysis of agricultural field trials is stationarity. In practice, however, this assumption is often violated due to unaccounted field effects. For instance, in plant breeding field trials, this can lead to inaccurate estimates of plant performance. Based on such inaccurate estimates, breeders may be impeded in selecting the best performing plant varieties, slowing breeding progress. We propose a method to automatically verify the hypothesis of stationarity. The method is sensitive towards mean as well as variance-covariance nonstationarity. It is specifically developed for the two-dimensional grid-structure of field trials. The method relies on the hypothesis that we can detect nonstationarity by partitioning the field into areas, within which stationarity holds. We applied the method to a large number of simulated datasets and a real-data example. The method reliably points out which trials exhibit quality issues and gives an indication about the severity of nonstationarity. This information can significantly reduce the time spent on manual quality control and enhance its overall reliability. Furthermore, the output of the method can be used to improve the analysis of conducted trials as well as the experimental design of future trials.

    https://arxiv.org/abs/2512.13383


    A Metadata-Only Feature-Augmented Method Factor for Ex-Post Correction and Attribution of Common Method Variance

    oai:arXiv.org:2512.13446v1

    arXiv:2512.13446v1 Announce Type: new Abstract: Common Method Variance (CMV) is a recurring problem that reduces survey accuracy. Popular fixes such as the Harman single-factor test, correlated uniquenesses, common latent factor models, and marker variable approaches have well known flaws. These approaches either poorly identify issues, rely too heavily on researchers' choices, omit real information, or require special marker items that many datasets lack. This paper introduces a metadata-only Feature-Augmented Method Factor (FAMF-SEM): a single extra method factor with fixed, item-specific weights based on questionnaire details like reverse coding, page and item order, scale width, wording direction, and item length. These weights are set using ridge regression, based on residual correlations in a basic CFA, and remain fixed in the model. The method avoids the need for additional data or marker variables and provides CMV-adjusted results with clear links to survey design features. An AMOS/LISREL-friendly, no-code Excel workflow demonstrates the method. The paper explains the rationale, provides model details, outlines setup, presents step-by-step instructions, describes checks and reliability tests, and notes limitations.

    https://arxiv.org/abs/2512.13446


    Parsimonious Ultrametric Manly Mixture Models

    oai:arXiv.org:2512.13473v1

    arXiv:2512.13473v1 Announce Type: new Abstract: A family of parsimonious ultrametric mixture models with the Manly transformation is developed for clustering high-dimensional and asymmetric data. Advances in Gaussian mixture modeling sufficiently handle high-dimensional data but struggle with the common presence of skewness. While these advances reduce the number of free parameters, they often provide limited insight into the structure and interpretation of the clusters. To address this shortcoming, this research implements the extended ultrametric covariance structure and the Manly transformation resulting in the parsimonious ultrametric Manly mixture model family. The ultrametric covariance structure reduces the number of free parameters while identifying latent hierarchical relationships between and within groups of variables. This phenomenon allows the visualization of hierarchical relationships within individual clusters, improving cluster interpretability. Additionally, as with many classes of mixture models, model selection remains a fundamental challenge; a two-step model selection procedure is proposed herein. With simulation studies and real data analyses, we demonstrate improved model selection via the proposed two-step method, and the effective clustering performance for the proposed family.

    https://arxiv.org/abs/2512.13473


    Actively Learning Joint Contours of Multiple Computer Experiments

    oai:arXiv.org:2512.13530v1

    arXiv:2512.13530v1 Announce Type: new Abstract: Contour location$\unicode{x2014}$the process of sequentially training a surrogate model to identify the design inputs that result in a pre-specified response value from a single computer experiment$\unicode{x2014}$is a well-studied active learning problem. Here, we tackle a related but distinct problem: identifying the input configuration that returns pre-specified values of multiple independent computer experiments simultaneously. Motivated by computer experiments of the rotational torques acting upon a vehicle in flight, we aim to identify stable flight conditions which result in zero torque forces. We propose a "joint contour location" (jCL) scheme that strikes a strategic balance between exploring the multiple response surfaces while exploiting learning of the intersecting contours. We employ both shallow and deep Gaussian process surrogates, but our jCL procedure is applicable to any surrogate that can provide posterior predictive distributions. Our jCL designs significantly outperform existing (single response) CL strategies, enabling us to efficiently locate the joint contour of our motivating computer experiments.

    https://arxiv.org/abs/2512.13530


    A Nonparametric Statistics Approach to Feature Selection in Deep Neural Networks with Theoretical Guarantees

    oai:arXiv.org:2512.13565v1

    arXiv:2512.13565v1 Announce Type: new Abstract: This paper tackles the problem of feature selection in a highly challenging setting: $\mathbb{E}(y | \boldsymbol{x}) = G(\boldsymbol{x}_{\mathcal{S}_0})$, where $\mathcal{S}_0$ is the set of relevant features and $G$ is an unknown, potentially nonlinear function subject to mild smoothness conditions. Our approach begins with feature selection in deep neural networks, then generalizes the results to H{\"o}lder smooth functions by exploiting the strong approximation capabilities of neural networks. Unlike conventional optimization-based deep learning methods, we reformulate neural networks as index models and estimate $\mathcal{S}_0$ using the second-order Stein's formula. This gradient-descent-free strategy guarantees feature selection consistency with a sample size requirement of $n = \Omega(p^2)$, where $p$ is the feature dimension. To handle high-dimensional scenarios, we further introduce a screening-and-selection mechanism that achieves nonlinear selection consistency when $n = \Omega(s \log p)$, with $s$ representing the sparsity level. Additionally, we refit a neural network on the selected features for prediction and establish performance guarantees under a relaxed sparsity assumption. Extensive simulations and real-data analyses demonstrate the strong performance of our method even in the presence of complex feature interactions.

    https://arxiv.org/abs/2512.13565


    Machine learning to optimize precision in the analysis of randomized trials: A journey in pre-specified, yet data-adaptive learning

    oai:arXiv.org:2512.13610v1

    arXiv:2512.13610v1 Announce Type: new Abstract: Covariate adjustment is an approach to improve the precision of trial analyses by adjusting for baseline variables that are prognostic of the primary endpoint. Motivated by the SEARCH Universal HIV Test-and-Treat Trial (2013-2017), we tell our story of developing, evaluating, and implementing a machine learning-based approach for covariate adjustment. We provide the rationale for as well as the practical concerns with such an approach for estimating marginal effects. Using schematics, we illustrate our procedure: targeted machine learning estimation (TMLE) with Adaptive Pre-specification. Briefly, sample-splitting is used to data-adaptively select the combination of estimators of the outcome regression (i.e., the conditional expectation of the outcome given the trial arm and covariates) and known propensity score (i.e., the conditional probability of being randomized to the intervention given the covariates) that minimizes the cross-validated variance estimate and, thereby, maximizes empirical efficiency. We discuss our approach for evaluating finite sample performance with parametric and plasmode simulations, pre-specifying the Statistical Analysis Plan, and unblinding in real-time on video conference with our colleagues from around the world. We present the results from applying our approach in the primary, pre-specified analysis of 8 recently published trials (2022-2024). We conclude with practical recommendations and an invitation to implement our approach in the primary analysis of your next trial.

    https://arxiv.org/abs/2512.13610


    Empirical Bayes learning from selectively reported confidence intervals

    oai:arXiv.org:2512.13622v1

    arXiv:2512.13622v1 Announce Type: new Abstract: We develop a statistical framework for empirical Bayes learning from selectively reported confidence intervals, applied here to provide context for interpreting results published in MEDLINE abstracts. A collection of 326,060 z-scores from MEDLINE abstracts (2000-2018) provides context for interpreting individual studies; we formalize this as an empirical Bayes task complicated by selection bias. We address selection bias through a selective tilting approach that extends empirical Bayes confidence intervals to truncated sampling mechanisms. Sign information is unreliable (a positive z-score need not indicate benefit, and investigators may choose contrast directions post hoc), so we work with absolute z-scores and identify only the distribution of absolute signal-to-noise ratios (SNRs). Our framework provides coverage guarantees for functionals including posterior estimands describing idealized replications and the symmetrized posterior mean, which we justify decision-theoretically as optimal among sign-equivariant (odd) estimators and minimax among priors inducing the same absolute SNR distribution.

    https://arxiv.org/abs/2512.13622


    A comparative overview of win ratio and joint frailty models for recurrent event endpoints with applications in oncology and cardiology

    oai:arXiv.org:2512.13629v1

    arXiv:2512.13629v1 Announce Type: new Abstract: Composite endpoints that combine recurrent non-fatal events with a terminal event are increasingly used in randomized clinical trials, yet conventional time-to-first event analyses may obscure clinically relevant information. We compared two statistical frameworks tailored to such endpoints: the joint frailty model (JFM) and the last-event assisted recurrent-event win ratio (LWR). The JFM specifies proportional hazards for the recurrent and terminal events linked through a shared frailty, yielding covariate-adjusted, component-specific hazard ratios that account for informative recurrences and dependence with death. The LWR is a nonparametric, prioritized pairwise comparison that incorporates all observed events over follow-up and summarizes a population-level benefit of treatment while respecting a pre-specified hierarchy between death and recurrences. We first assessed the performance of the methods using simulations that varied both the gamma-frailty variance and the event rates. Next, we investigated these two frameworks using practical clinical applications, to assess the performance of the methods and to estimate the sample size required to achieve adequate power. These two approaches delivered complementary insights. The JFM provided component-specific estimates, while the LWR led to a summary measure of treatment effect with direction. Power was systematically improved with JFM, which thus appeared as the most reliable approach for inference and sample size estimation. Methodological extensions of the LWR to appropriately handle censoring and to formalize causal estimands remain a promising direction for future research.

    https://arxiv.org/abs/2512.13629


    Universality of high-dimensional scaling limits of stochastic gradient descent

    oai:arXiv.org:2512.13634v1

    arXiv:2512.13634v1 Announce Type: new Abstract: We consider statistical tasks in high dimensions whose loss depends on the data only through its projection into a fixed-dimensional subspace spanned by the parameter vectors and certain ground truth vectors. This includes classifying mixture distributions with cross-entropy loss with one and two-layer networks, and learning single and multi-index models with one and two-layer networks. When the data is drawn from an isotropic Gaussian mixture distribution, it is known that the evolution of a finite family of summary statistics under stochastic gradient descent converges to an autonomous ordinary differential equation (ODE), as the dimension and sample size go to $\infty$ and the step size goes to $0$ commensurately. Our main result is that these ODE limits are universal in that this convergence occurs even when the data is drawn from mixtures of product measures provided the first two moments match the corresponding Gaussian distribution and the initialization and ground truth vectors are sufficiently coordinate-delocalized. We complement this by proving two corresponding non-universality results. We provide a simple example where the ODE limits are non-universal if the initialization is coordinate aligned. We also show that the stochastic differential equation limits arising as fluctuations of the summary statistics around their ODE's fixed points are not universal.

    https://arxiv.org/abs/2512.13634


    Active Inference with Reusable State-Dependent Value Profiles

    oai:arXiv.org:2512.11829v1

    arXiv:2512.11829v1 Announce Type: cross Abstract: Adaptive behavior in volatile environments requires agents to switch among value-control regimes across latent contexts, but maintaining separate preferences, policy biases, and action-confidence parameters for every situation is intractable. We introduce value profiles: a small set of reusable bundles of value-related parameters (outcome preferences, policy priors, and policy precision) assigned to hidden states in a generative model. As posterior beliefs over states evolve trial by trial, effective control parameters arise via belief-weighted mixing, enabling state-conditional strategy recruitment without requiring independent parameters for each context. We evaluate this framework in probabilistic reversal learning, comparing static-precision, entropy-coupled dynamic-precision, and profile-based models using cross-validated log-likelihood and information criteria. Model comparison favors the profile-based model over simpler alternatives (about 100-point AIC differences), and parameter-recovery analyses support structural identifiability even when context must be inferred from noisy observations. Model-based inference further suggests that adaptive control in this task is driven primarily by modulation of policy priors rather than policy precision, with gradual belief-dependent profile recruitment consistent with state-conditional (not purely uncertainty-driven) control. Overall, reusable value profiles provide a tractable computational account of belief-conditioned value control in volatile environments and yield testable signatures of belief-dependent control and behavioral flexibility.

    https://arxiv.org/abs/2512.11829


    Amortized Causal Discovery with Prior-Fitted Networks

    oai:arXiv.org:2512.11840v1

    arXiv:2512.11840v1 Announce Type: cross Abstract: In recent years, differentiable penalized likelihood methods have gained popularity, optimizing the causal structure by maximizing its likelihood with respect to the data. However, recent research has shown that errors in likelihood estimation, even on relatively large sample sizes, disallow the discovery of proper structures. We propose a new approach to amortized causal discovery that addresses the limitations of likelihood estimator accuracy. Our method leverages Prior-Fitted Networks (PFNs) to amortize data-dependent likelihood estimation, yielding more reliable scores for structure learning. Experiments on synthetic, simulated, and real-world datasets show significant gains in structure recovery compared to standard baselines. Furthermore, we demonstrate directly that PFNs provide more accurate likelihood estimates than conventional neural network-based approaches.

    https://arxiv.org/abs/2512.11840


    Exploring Topological Bias in Heterogeneous Graph Neural Networks

    oai:arXiv.org:2512.11846v1

    arXiv:2512.11846v1 Announce Type: cross Abstract: Graph Neural Networks (GNNs) are characterized by their capacity of processing graph-structured data. However, due to the sparsity of labels under semi-supervised learning, they have been found to exhibit biased performance on specific nodes. This kind of bias has been validated to correlate with topological structure and is considered as a bottleneck of GNNs' performance. Existing work focuses on the study of homogeneous GNNs and little attention has been given to topological bias in Heterogeneous Graph Neural Networks (HGNNs). In this work, firstly, in order to distinguish distinct meta relations, we apply meta-weighting to the adjacency matrix of a heterogeneous graph. Based on the modified adjacency matrix, we leverage PageRank along with the node label information to construct a projection. The constructed projection effectively maps nodes to values that strongly correlated with model performance when using datasets both with and without intra-type connections, which demonstrates the universal existence of topological bias in HGNNs. To handle this bias, we propose a debiasing structure based on the difference in the mapped values of nodes and use it along with the original graph structure for contrastive learning. Experiments on three public datasets verify the effectiveness of the proposed method in improving HGNNs' performance and debiasing.

    https://arxiv.org/abs/2512.11846


    Adaptive Path Integral Diffusion: AdaPID

    oai:arXiv.org:2512.11858v1

    arXiv:2512.11858v1 Announce Type: cross Abstract: Diffusion-based samplers -- Score Based Diffusions, Bridge Diffusions and Path Integral Diffusions -- match a target at terminal time, but the real leverage comes from choosing the schedule that governs the intermediate-time dynamics. We develop a path-wise schedule -- selection gramework for Harmonic PID with a time-varying stiffness, exploiting Piece-Wise-Constant(PWC) parametrizations and a simple hierarchical refinement. We introduce schedule-sensitive Quality-of-Sampling (QoS) diagnostics. Assuming a Gaussian-Mixture (GM) target, we retain closed-form Green functions' ration and numerically stable, Neural-Network free oracles for predicted-state maps and score. Experiments in 2D show that QoS driven PWC schedules consistently improve early-exit fidelity, tail accuracy, conditioning of the dynamics, and speciation (label-selection) timing at fixed integration budgets.

    https://arxiv.org/abs/2512.11858


    Generative Stochastic Optimal Transport: Guided Harmonic Path-Integral Diffusion

    oai:arXiv.org:2512.11859v1

    arXiv:2512.11859v1 Announce Type: cross Abstract: We introduce Guided Harmonic Path-Integral Diffusion (GH-PID), a linearly-solvable framework for guided Stochastic Optimal Transport (SOT) with a hard terminal distribution and soft, application-driven path costs. A low-dimensional guidance protocol shapes the trajectory ensemble while preserving analytic structure: the forward and backward Kolmogorov equations remain linear, the optimal score admits an explicit Green-function ratio, and Gaussian-Mixture Model (GMM) terminal laws yield closed-form expressions. This enables stable sampling and differentiable protocol learning under exact terminal matching. We develop guidance-centric diagnostics -- path cost, centerline adherence, variance flow, and drift effort -- that make GH-PID an interpretable variational ansatz for empirical SOT. Three navigation scenarios illustrated in 2D: (i) Case A: hand-crafted protocols revealing how geometry and stiffness shape lag, curvature effects, and mode evolution; (ii) Case B: single-task protocol learning, where a PWC centerline is optimized to minimize integrated cost; (iii) Case C: multi-expert fusion, in which a commander reconciles competing expert/teacher trajectories and terminal beliefs through an exact product-of-experts law and learns a consensus protocol. Across all settings, GH-PID generates geometry-aware, trust-aware trajectories that satisfy the prescribed terminal distribution while systematically reducing integrated cost.

    https://arxiv.org/abs/2512.11859


    The Art of Storytelling in Authoritarian Regimes: Crafting State Narratives on Chinese Social Media

    oai:arXiv.org:2512.11875v1

    arXiv:2512.11875v1 Announce Type: cross Abstract: This article examines how authoritarian regimes construct state narratives about politically consequential events. Building on the narrative policy framework and existing research on authoritarian propaganda, we propose two dimensions that shape narrative construction: legitimacy implications -- whether events enhance or threaten regime legitimacy, and citizen verification capacity -- the extent to which citizens can evaluate official narratives through alternative sources. Using quantitative narrative analysis of Chinese social media posts by government, state media, and celebrity accounts, we extract subject-verb-object (SVO) triplets to map dominant narrative structures across four major events. Our findings show that legitimacy implications of the event shape regime's efforts in storytelling and the beliefs highlighted in the narratives, while citizen's verification capacity could balance the strategic choice between a top-down manipulation and bottom-up responsiveness of state narratives. Together, the results reveal propaganda as a complex process of narrative construction adaptive to specific contexts, offering new insights into how dynamic storytelling sustains authoritarian resilience.

    https://arxiv.org/abs/2512.11875


    Contextual Peano Scan and Fast Image Segmentation Using Hidden and Evidential Markov Chains

    oai:arXiv.org:2512.11939v1

    arXiv:2512.11939v1 Announce Type: cross Abstract: Transforming bi-dimensional sets of image pixels into mono-dimensional sequences with a Peano scan (PS) is an established technique enabling the use of hidden Markov chains (HMCs) for unsupervised image segmentation. Related Bayesian segmentation methods can compete with hidden Markov fields (HMFs)-based ones and are much faster. PS has recently been extended to the contextual PS, and some initial experiments have shown the value of the associated HMC model, denoted as HMC-CPS, in image segmentation. Moreover, HMCs have been extended to hidden evidential Markov chains (HEMCs), which are capable of improving HMC-based Bayesian segmentation. In this study, we introduce a new HEMC-CPS model by simultaneously considering contextual PS and evidential HMC. We show its effectiveness for Bayesian maximum posterior mode (MPM) segmentation using synthetic and real images. Segmentation is performed in an unsupervised manner, with parameters being estimated using the stochastic expectation--maximization (SEM) method. The new HEMC-CPS model presents potential for the modeling and segmentation of more complex images, such as three-dimensional or multi-sensor multi-resolution images. Finally, the HMC-CPS and HEMC-CPS models are not limited to image segmentation and could be used for any kind of spatially correlated data.

    https://arxiv.org/abs/2512.11939


    Data-Driven Global Sensitivity Analysis for Engineering Design Based on Individual Conditional Expectations

    oai:arXiv.org:2512.11946v1

    arXiv:2512.11946v1 Announce Type: cross Abstract: Explainable machine learning techniques have gained increasing attention in engineering applications, especially in aerospace design and analysis, where understanding how input variables influence data-driven models is essential. Partial Dependence Plots (PDPs) are widely used for interpreting black-box models by showing the average effect of an input variable on the prediction. However, their global sensitivity metric can be misleading when strong interactions are present, as averaging tends to obscure interaction effects. To address this limitation, we propose a global sensitivity metric based on Individual Conditional Expectation (ICE) curves. The method computes the expected feature importance across ICE curves, along with their standard deviation, to more effectively capture the influence of interactions. We provide a mathematical proof demonstrating that the PDP-based sensitivity is a lower bound of the proposed ICE-based metric under truncated orthogonal polynomial expansion. In addition, we introduce an ICE-based correlation value to quantify how interactions modify the relationship between inputs and the output. Comparative evaluations were performed on three cases: a 5-variable analytical function, a 5-variable wind-turbine fatigue problem, and a 9-variable airfoil aerodynamics case, where ICE-based sensitivity was benchmarked against PDP, SHapley Additive exPlanations (SHAP), and Sobol' indices. The results show that ICE-based feature importance provides richer insights than the traditional PDP-based approach, while visual interpretations from PDP, ICE, and SHAP complement one another by offering multiple perspectives.

    https://arxiv.org/abs/2512.11946


    Goal Reaching with Eikonal-Constrained Hierarchical Quasimetric Reinforcement Learning

    oai:arXiv.org:2512.12046v1

    arXiv:2512.12046v1 Announce Type: cross Abstract: Goal-Conditioned Reinforcement Learning (GCRL) mitigates the difficulty of reward design by framing tasks as goal reaching rather than maximizing hand-crafted reward signals. In this setting, the optimal goal-conditioned value function naturally forms a quasimetric, motivating Quasimetric RL (QRL), which constrains value learning to quasimetric mappings and enforces local consistency through discrete, trajectory-based constraints. We propose Eikonal-Constrained Quasimetric RL (Eik-QRL), a continuous-time reformulation of QRL based on the Eikonal Partial Differential Equation (PDE). This PDE-based structure makes Eik-QRL trajectory-free, requiring only sampled states and goals, while improving out-of-distribution generalization. We provide theoretical guarantees for Eik-QRL and identify limitations that arise under complex dynamics. To address these challenges, we introduce Eik-Hierarchical QRL (Eik-HiQRL), which integrates Eik-QRL into a hierarchical decomposition. Empirically, Eik-HiQRL achieves state-of-the-art performance in offline goal-conditioned navigation and yields consistent gains over QRL in manipulation tasks, matching temporal-difference methods.

    https://arxiv.org/abs/2512.12046


    SigTime: Learning and Visually Explaining Time Series Signatures

    oai:arXiv.org:2512.12076v1

    arXiv:2512.12076v1 Announce Type: cross Abstract: Understanding and distinguishing temporal patterns in time series data is essential for scientific discovery and decision-making. For example, in biomedical research, uncovering meaningful patterns in physiological signals can improve diagnosis, risk assessment, and patient outcomes. However, existing methods for time series pattern discovery face major challenges, including high computational complexity, limited interpretability, and difficulty in capturing meaningful temporal structures. To address these gaps, we introduce a novel learning framework that jointly trains two Transformer models using complementary time series representations: shapelet-based representations to capture localized temporal structures and traditional feature engineering to encode statistical properties. The learned shapelets serve as interpretable signatures that differentiate time series across classification labels. Additionally, we develop a visual analytics system -- SigTIme -- with coordinated views to facilitate exploration of time series signatures from multiple perspectives, aiding in useful insights generation. We quantitatively evaluate our learning framework on eight publicly available datasets and one proprietary clinical dataset. Additionally, we demonstrate the effectiveness of our system through two usage scenarios along with the domain experts: one involving public ECG data and the other focused on preterm labor analysis.

    https://arxiv.org/abs/2512.12076


    Estimation of a Dynamic Tobit Model with a Unit Root

    oai:arXiv.org:2512.12110v1

    arXiv:2512.12110v1 Announce Type: cross Abstract: This paper studies robust estimation in the dynamic Tobit model under local-to-unity (LUR) asymptotics. We show that both Gaussian maximum likelihood (ML) and censored least absolute deviations (CLAD) estimators are consistent, extending results from the stationary case where ordinary least squares (OLS) is inconsistent. The asymptotic distributions of MLE and CLAD are derived; for the short-run parameters they are shown to be Gaussian, yielding standard normal t-statistics. In contrast, although OLS remains consistent under LUR, its t-statistics are not standard normal. These results enable reliable model selection via sequential t-tests based on ML and CLAD, paralleling the linear autoregressive case. Applications to financial and epidemiological time series illustrate their practical relevance.

    https://arxiv.org/abs/2512.12110


    Neural CDEs as Correctors for Learned Time Series Models

    oai:arXiv.org:2512.12116v1

    arXiv:2512.12116v1 Announce Type: cross Abstract: Learned time-series models, whether continuous- or discrete-time, are widely used to forecast the states of a dynamical system. Such models generate multi-step forecasts either directly, by predicting the full horizon at once, or iteratively, by feeding back their own predictions at each step. In both cases, the multi-step forecasts are prone to errors. To address this, we propose a Predictor-Corrector mechanism where the Predictor is any learned time-series model and the Corrector is a neural controlled differential equation. The Predictor forecasts, and the Corrector predicts the errors of the forecasts. Adding these errors to the forecasts improves forecast performance. The proposed Corrector works with irregularly sampled time series and continuous- and discrete-time Predictors. Additionally, we introduce two regularization strategies to improve the extrapolation performance of the Corrector with accelerated training. We evaluate our Corrector with diverse Predictors, e.g., neural ordinary differential equations, Contiformer, and DLinear, on synthetic, physics simulation, and real-world forecasting datasets. The experiments demonstrate that the Predictor-Corrector mechanism consistently improves the performance compared to Predictor alone.

    https://arxiv.org/abs/2512.12116


    Anticipatory Governance in Data-Constrained Environments: A Predictive Simulation Framework for Digital Financial Inclusion

    oai:arXiv.org:2512.12212v1

    arXiv:2512.12212v1 Announce Type: cross Abstract: Financial exclusion remains a major barrier to digital public service delivery in resource-constrained and archipelagic nations. Traditional policy evaluations rely on retrospective data, limiting the ex-ante intelligence needed for agile resource allocation. This study introduces a predictive simulation framework to support anticipatory governance within government information systems. Using the UNCDF Pacific Digital Economy dataset of 10,108 respondents, we apply a three-stage pipeline: descriptive profiling, interpretable machine learning, and scenario simulation to forecast outcomes of digital financial literacy interventions before deployment. Leveraging cross-sectional structural associations, the framework projects intervention scenarios as prioritization heuristics rather than causal estimates. A transparent linear regression model with R-squared of 95.9 identifies modifiable policy levers. Simulations indicate that foundational digital capabilities such as device access and expense tracking yield the highest projected gains, up to 5.5 percent, outperforming attitudinal nudges. The model enables precision targeting, highlighting young female caregivers as high-leverage responders while flagging non-responders such as urban professionals to prevent resource misallocation. This research demonstrates how static survey data can be repurposed into actionable policy intelligence, offering a scalable and evidence-based blueprint for embedding predictive analytics into public-sector decision-support systems to advance equity-focused digital governance.

    https://arxiv.org/abs/2512.12212


    Scalable branch-and-bound model selection with non-monotonic criteria including AIC, BIC and Mallows's $\mathit{C_p}$

    oai:arXiv.org:2512.12221v1

    arXiv:2512.12221v1 Announce Type: cross Abstract: Model selection is a pivotal process in the quantitative sciences, where researchers must navigate between numerous candidate models of varying complexity. Traditional information criteria, such as the corrected Akaike Information Criterion (AICc), Bayesian Information Criterion (BIC), and Mallows's $\mathit{C_p}$, are valuable tools for identifying optimal models. However, the exponential increase in candidate models with each additional model parameter renders the evaluation of these criteria for all models -- a strategy known as exhaustive, or brute-force, searches -- computationally prohibitive. Consequently, heuristic approaches like stepwise regression are commonly employed, albeit without guarantees of finding the globally-optimal model. In this study, we challenge the prevailing notion that non-monotonicity in information criteria precludes bounds on the search space. We introduce a simple but novel bound that enables the development of branch-and-bound algorithms tailored for these non-monotonic functions. We demonstrate that our approach guarantees identification of the optimal model(s) across diverse model classes, sizes, and applications, often with orders of magnitude computational speedups. For instance, in one previously-published model selection task involving $2^{32}$ (approximately 4 billion) candidate models, our method achieves a computational speedup exceeding 6,000. These findings have broad implications for the scalability and effectiveness of model selection in complex scientific domains.

    https://arxiv.org/abs/2512.12221


    Balancing Accuracy and Speed: A Multi-Fidelity Ensemble Kalman Filter with a Machine Learning Surrogate Model

    oai:arXiv.org:2512.12276v1

    arXiv:2512.12276v1 Announce Type: cross Abstract: Currently, more and more machine learning (ML) surrogates are being developed for computationally expensive physical models. In this work we investigate the use of a Multi-Fidelity Ensemble Kalman Filter (MF-EnKF) in which the low-fidelity model is such a machine learning surrogate model, instead of a traditional low-resolution or reduced-order model. The idea behind this is to use an ensemble of a few expensive full model runs, together with an ensemble of many cheap but less accurate ML model runs. In this way we hope to reach increased accuracy within the same computational budget. We investigate the performance by testing the approach on two common test problems, namely the Lorenz-2005 model and the Quasi-Geostrophic model. By keeping the original physical model in place, we obtain a higher accuracy than when we completely replace it by the ML model. Furthermore, the MF-EnKF reaches improved accuracy within the same computational budget. The ML surrogate has similar or improved accuracy compared to the low-resolution one, but it can provide a larger speed-up. Our method contributes to increasing the effective ensemble size in the EnKF, which improves the estimation of the initial condition and hence accuracy of the predictions in fields such as meteorology and oceanography.

    https://arxiv.org/abs/2512.12276


    Precise Deviations for the Ewens-Pitman Model

    oai:arXiv.org:2512.12323v1

    arXiv:2512.12323v1 Announce Type: cross Abstract: In this paper, we derive an integral representation for the distribution of the number of types $K_n$ in the Ewens-Pitman model. Based on this representation, we also establish precise large deviations and precise moderate deviations for $K_n$. After careful examination, we find that the rate function exhibits a second-order phase transition and the critical point is $\alpha=\frac{1}{2}$.

    https://arxiv.org/abs/2512.12323


    Eventually LIL Regret: Almost Sure $\ln\ln T$ Regret for a sub-Gaussian Mixture on Unbounded Data

    oai:arXiv.org:2512.12325v1

    arXiv:2512.12325v1 Announce Type: cross Abstract: We prove that a classic sub-Gaussian mixture proposed by Robbins in a stochastic setting actually satisfies a path-wise (deterministic) regret bound. For every path in a natural ``Ville event'' $E_\alpha$, this regret till time $T$ is bounded by $\ln^2(1/\alpha)/V_T + \ln (1/\alpha) + \ln \ln V_T$ up to universal constants, where $V_T$ is a nonnegative, nondecreasing, cumulative variance process. (The bound reduces to $\ln(1/\alpha) + \ln \ln V_T$ if $V_T \geq \ln(1/\alpha)$.) If the data were stochastic, then one can show that $E_\alpha$ has probability at least $1-\alpha$ under a wide class of distributions (eg: sub-Gaussian, symmetric, variance-bounded, etc.). In fact, we show that on the Ville event $E_0$ of probability one, the regret on every path in $E_0$ is eventually bounded by $\ln \ln V_T$ (up to constants). We explain how this work helps bridge the world of adversarial online learning (which usually deals with regret bounds for bounded data), with game-theoretic statistics (which can handle unbounded data, albeit using stochastic assumptions). In short, conditional regret bounds serve as a bridge between stochastic and adversarial betting.

    https://arxiv.org/abs/2512.12325


    Uncertainty Quantification for Machine Learning: One Size Does Not Fit All

    oai:arXiv.org:2512.12341v1

    arXiv:2512.12341v1 Announce Type: cross Abstract: Proper quantification of predictive uncertainty is essential for the use of machine learning in safety-critical applications. Various uncertainty measures have been proposed for this purpose, typically claiming superiority over other measures. In this paper, we argue that there is no single best measure. Instead, uncertainty quantification should be tailored to the specific application. To this end, we use a flexible family of uncertainty measures that distinguishes between total, aleatoric, and epistemic uncertainty of second-order distributions. These measures can be instantiated with specific loss functions, so-called proper scoring rules, to control their characteristics, and we show that different characteristics are useful for different tasks. In particular, we show that, for the task of selective prediction, the scoring rule should ideally match the task loss. On the other hand, for out-of-distribution detection, our results confirm that mutual information, a widely used measure of epistemic uncertainty, performs best. Furthermore, in an active learning setting, epistemic uncertainty based on zero-one loss is shown to consistently outperform other uncertainty measures.

    https://arxiv.org/abs/2512.12341


    Optimized Architectures for Kolmogorov-Arnold Networks

    oai:arXiv.org:2512.12448v1

    arXiv:2512.12448v1 Announce Type: cross Abstract: Efforts to improve Kolmogorov-Arnold networks (KANs) with architectural enhancements have been stymied by the complexity those enhancements bring, undermining the interpretability that makes KANs attractive in the first place. Here we study overprovisioned architectures combined with sparsification to learn compact, interpretable KANs without sacrificing accuracy. Crucially, we focus on differentiable sparsification, turning architecture search into an end-to-end optimization problem. Across function approximation benchmarks, dynamical systems forecasting, and real-world prediction tasks, we demonstrate competitive or superior accuracy while discovering substantially smaller models. Overprovisioning and sparsification are synergistic, with the combination outperforming either alone. The result is a principled path toward models that are both more expressive and more interpretable, addressing a key tension in scientific machine learning.

    https://arxiv.org/abs/2512.12448


    Optimal Mistake Bounds for Transductive Online Learning

    oai:arXiv.org:2512.12567v1

    arXiv:2512.12567v1 Announce Type: cross Abstract: We resolve a 30-year-old open problem concerning the power of unlabeled data in online learning by tightly quantifying the gap between transductive and standard online learning. In the standard setting, the optimal mistake bound is characterized by the Littlestone dimension $d$ of the concept class $H$ (Littlestone 1987). We prove that in the transductive setting, the mistake bound is at least $\Omega(\sqrt{d})$. This constitutes an exponential improvement over previous lower bounds of $\Omega(\log\log d)$, $\Omega(\sqrt{\log d})$, and $\Omega(\log d)$, due respectively to Ben-David, Kushilevitz, and Mansour (1995, 1997) and Hanneke, Moran, and Shafer (2023). We also show that this lower bound is tight: for every $d$, there exists a class of Littlestone dimension $d$ with transductive mistake bound $O(\sqrt{d})$. Our upper bound also improves upon the best known upper bound of $(2/3)d$ from Ben-David, Kushilevitz, and Mansour (1997). These results establish a quadratic gap between transductive and standard online learning, thereby highlighting the benefit of advance access to the unlabeled instance sequence. This contrasts with the PAC setting, where transductive and standard learning exhibit similar sample complexities.

    https://arxiv.org/abs/2512.12567


    On the Accuracy of Newton Step and Influence Function Data Attributions

    oai:arXiv.org:2512.12572v1

    arXiv:2512.12572v1 Announce Type: cross Abstract: Data attribution aims to explain model predictions by estimating how they would change if certain training points were removed, and is used in a wide range of applications, from interpretability and credit assignment to unlearning and privacy. Even in the relatively simple case of linear regressions, existing mathematical analyses of leading data attribution methods such as Influence Functions (IF) and single Newton Step (NS) remain limited in two key ways. First, they rely on global strong convexity assumptions which are often not satisfied in practice. Second, the resulting bounds scale very poorly with the number of parameters ($d$) and the number of samples removed ($k$). As a result, these analyses are not tight enough to answer fundamental questions such as "what is the asymptotic scaling of the errors of each method?" or "which of these methods is more accurate for a given dataset?" In this paper, we introduce a new analysis of the NS and IF data attribution methods for convex learning problems. To the best of our knowledge, this is the first analysis of these questions that does not assume global strong convexity and also the first explanation of [KATL19] and [RH25a]'s observation that NS data attribution is often more accurate than IF. We prove that for sufficiently well-behaved logistic regression, our bounds are asymptotically tight up to poly-logarithmic factors, yielding scaling laws for the errors in the average-case sample removals. \[ \mathbb{E}_{T \subseteq [n],\, |T| = k} \bigl[ \|\hat{\theta}_T - \hat{\theta}_T^{\mathrm{NS}}\|_2 \bigr] = \widetilde{\Theta}\!\left(\frac{k d}{n^2}\right), \qquad \mathbb{E}_{T \subseteq [n],\, |T| = k} \bigl[ \|\hat{\theta}_T^{\mathrm{NS}} - \hat{\theta}_T^{\mathrm{IF}}\|_2 \bigr] = \widetilde{\Theta}\!\left( \frac{(k + d)\sqrt{k d}}{n^2} \right). \]

    https://arxiv.org/abs/2512.12572


    Continuous Treatment Effects with Spatial and Network Spillovers

    oai:arXiv.org:2512.12653v1

    arXiv:2512.12653v1 Announce Type: cross Abstract: This paper develops a continuous functional framework for treatment effects that propagate through geographic space and economic networks. We derive a master equation governing propagation from three economic foundations -- heterogeneous agent aggregation, market equilibrium, and cost minimization -- establishing that the framework rests on fundamental principles rather than ad hoc specifications. A key result shows that the spatial-network interaction coefficient equals the mutual information between geographic and market coordinates. The Feynman-Kac representation decomposes effects into inherited and accumulated components along stochastic paths representing economic linkages. The framework nests the no-spillover case as a testable restriction. Monte Carlo simulations demonstrate that conventional estimators -- two-way fixed effects, difference-in-differences, and generalized propensity score -- exhibit 25-38% bias and severe undercoverage when spillovers exist, while our estimator maintains correct inference regardless of whether spillovers are present. Applying the framework to U.S. minimum wage policy, we reject the no-spillover null and find total effects at state borders four times larger than direct effects -- conventional methods capture only one-quarter of policy impact. Structural estimates reveal spatial diffusion consistent with commuting-distance labor mobility, network diffusion consistent with quarterly supply chain adjustment, and significant spatial-network interaction reflecting geographic clustering of industries. Entropy-based fragility diagnostics outperform standard centrality measures by 56-76% in predicting labor market disruptions, identifying all high-risk state-industry pairs during 2020-2021 with six-month advance warning.

    https://arxiv.org/abs/2512.12653


    Credit Risk Estimation with Non-Financial Features: Evidence from a Synthetic Istanbul Dataset

    oai:arXiv.org:2512.12783v1

    arXiv:2512.12783v1 Announce Type: cross Abstract: Financial exclusion constrains entrepreneurship, increases income volatility, and widens wealth gaps. Underbanked consumers in Istanbul often have no bureau file because their earnings and payments flow through informal channels. To study how such borrowers can be evaluated we create a synthetic dataset of one hundred thousand Istanbul residents that reproduces first quarter 2025 T\"U\.IK census marginals and telecom usage patterns. Retrieval augmented generation feeds these public statistics into the OpenAI o3 model, which synthesises realistic yet private records. Each profile contains seven socio demographic variables and nine alternative attributes that describe phone specifications, online shopping rhythm, subscription spend, car ownership, monthly rent, and a credit card flag. To test the impact of the alternative financial data CatBoost, LightGBM, and XGBoost are each trained in two versions. Demo models use only the socio demographic variables; Full models include both socio demographic and alternative attributes. Across five fold stratified validation the alternative block raises area under the curve by about one point three percentage and lifts balanced \(F_{1}\) from roughly 0.84 to 0.95, a fourteen percent gain. We contribute an open Istanbul 2025 Q1 synthetic dataset, a fully reproducible modeling pipeline, and empirical evidence that a concise set of behavioural attributes can approach bureau level discrimination power while serving borrowers who lack formal credit records. These findings give lenders and regulators a transparent blueprint for extending fair and safe credit access to the underbanked.

    https://arxiv.org/abs/2512.12783


    TRACER: Transfer Learning based Real-time Adaptation for Clinical Evolving Risk

    oai:arXiv.org:2512.12795v1

    arXiv:2512.12795v1 Announce Type: cross Abstract: Clinical decision support tools built on electronic health records often experience performance drift due to temporal population shifts, particularly when changes in the clinical environment initially affect only a subset of patients, resulting in a transition to mixed populations. Such case-mix changes commonly arise following system-level operational updates or the emergence of new diseases, such as COVID-19. We propose TRACER (Transfer Learning-based Real-time Adaptation for Clinical Evolving Risk), a framework that identifies encounter-level transition membership and adapts predictive models using transfer learning without full retraining. In simulation studies, TRACER outperformed static models trained on historical or contemporary data. In a real-world application predicting hospital admission following emergency department visits across the COVID-19 transition, TRACER improved both discrimination and calibration. TRACER provides a scalable approach for maintaining robust predictive performance under evolving and heterogeneous clinical conditions.

    https://arxiv.org/abs/2512.12795


    Diffusion Model-Based Posterior Sampling in Full Waveform Inversion

    oai:arXiv.org:2512.12797v1

    arXiv:2512.12797v1 Announce Type: cross Abstract: Bayesian full waveform inversion (FWI) offers uncertainty-aware subsurface models; however, posterior sampling directly on observed seismic shot records is rarely practical at the field scale because each sample requires numerous wave-equation solves. We aim to make such sampling feasible for large surveys while preserving calibration, that is, high uncertainty in less illuminated areas. Our approach couples diffusion-based posterior sampling with simultaneous-source FWI data. At each diffusion noise level, a network predicts a clean velocity model. We then apply a stochastic refinement step in model space using Langevin dynamics under the wave-equation likelihood and reintroduce noise to decouple successive levels before proceeding. Simultaneous-source batches reduce forward and adjoint solves approximately in proportion to the supergather size, while an unconditional diffusion prior trained on velocity patches and volumes helps suppress source-related numerical artefacts. We evaluate the method on three 2D synthetic datasets (SEG/EAGE Overthrust, SEG/EAGE Salt, SEAM Arid), a 2D field line, and a 3D upscaling study. Relative to a particle-based variational baseline, namely Stein variational gradient descent without a learned prior and with single-source (non-simultaneous-source) FWI, our sampler achieves lower model error and better data fit at a substantially reduced computational cost. By aligning encoded-shot likelihoods with diffusion-based sampling and exploiting straightforward parallelization over samples and source batches, the method provides a practical path to calibrated posterior inference on observed shot records that scales to large 2D and 3D problems.

    https://arxiv.org/abs/2512.12797


    CapOptix: An Options-Framework for Capacity Market Pricing

    oai:arXiv.org:2512.12871v1

    arXiv:2512.12871v1 Announce Type: cross Abstract: Electricity markets are under increasing pressure to maintain reliability amidst rising renewable penetration, demand variability, and occasional price shocks. Traditional capacity market designs often fall short in addressing this by relying on expected-value metrics of energy unserved, which overlook risk exposure in such systems. In this work, we present CapOptix, a capacity pricing framework that interprets capacity commitments as reliability options, i.e., financial derivatives of wholesale electricity prices. CapOptix characterizes the capacity premia charged by accounting for structural price shifts modeled by the Markov Regime Switching Process. We apply the framework to historical price data from multiple electricity markets and compare the resulting premium ranges with existing capacity remuneration mechanisms.

    https://arxiv.org/abs/2512.12871


    Unsupervised learning of multiscale switching dynamical system models from multimodal neural data

    oai:arXiv.org:2512.12881v1

    arXiv:2512.12881v1 Announce Type: cross Abstract: Neural population activity often exhibits regime-dependent non-stationarity in the form of switching dynamics. Learning accurate switching dynamical system models can reveal how behavior is encoded in neural activity. Existing switching approaches have primarily focused on learning models from a single neural modality, either continuous Gaussian signals or discrete Poisson signals. However, multiple neural modalities are often recorded simultaneously to measure different spatiotemporal scales of brain activity, and all these modalities can encode behavior. Moreover, regime labels are typically unavailable in training data, posing a significant challenge for learning models of regime-dependent switching dynamics. To address these challenges, we develop a novel unsupervised learning algorithm that learns the parameters of switching multiscale dynamical system models using only multiscale neural observations. We demonstrate our method using both simulations and two distinct experimental datasets with multimodal spike-LFP observations during different motor tasks. We find that our switching multiscale dynamical system models more accurately decode behavior than switching single-scale dynamical models, showing the success of multiscale neural fusion. Further, our models outperform stationary multiscale models, illustrating the importance of tracking regime-dependent non-stationarity in multimodal neural data. The developed unsupervised learning framework enables more accurate modeling of complex multiscale neural dynamics by leveraging information in multimodal recordings while incorporating regime switches. This approach holds promise for improving the performance and robustness of brain-computer interfaces over time and for advancing our understanding of the neural basis of behavior.

    https://arxiv.org/abs/2512.12881


    Interpretable Hypothesis-Driven Trading:A Rigorous Walk-Forward Validation Framework for Market Microstructure Signals

    oai:arXiv.org:2512.12924v1

    arXiv:2512.12924v1 Announce Type: cross Abstract: We develop a rigorous walk-forward validation framework for algorithmic trading designed to mitigate overfitting and lookahead bias. Our methodology combines interpretable hypothesis-driven signal generation with reinforcement learning and strict out-of-sample testing. The framework enforces strict information set discipline, employs rolling window validation across 34 independent test periods, maintains complete interpretability through natural language hypothesis explanations, and incorporates realistic transaction costs and position constraints. Validating five market microstructure patterns across 100 US equities from 2015 to 2024, the system yields modest annualized returns (0.55%, Sharpe ratio 0.33) with exceptional downside protection (maximum drawdown -2.76%) and market-neutral characteristics (beta = 0.058). Performance exhibits strong regime dependence, generating positive returns during high-volatility periods (0.60% quarterly, 2020-2024) while underperforming in stable markets (-0.16%, 2015-2019). We report statistically insignificant aggregate results (p-value 0.34) to demonstrate a reproducible, honest validation protocol that prioritizes interpretability and extends naturally to advanced hypothesis generators, including large language models. The key empirical finding reveals that daily OHLCV-based microstructure signals require elevated information arrival and trading activity to function effectively. The framework provides complete mathematical specifications and open-source implementation, establishing a template for rigorous trading system evaluation that addresses the reproducibility crisis in quantitative finance research. For researchers, practitioners, and regulators, this work demonstrates that interpretable algorithmic trading strategies can be rigorously validated without sacrificing transparency or regulatory compliance.

    https://arxiv.org/abs/2512.12924


    Asymptotic Normality of Subgraph Counts in Sparse Inhomogeneous Random Graphs

    oai:arXiv.org:2512.12937v1

    arXiv:2512.12937v1 Announce Type: cross Abstract: In this paper, we derive the asymptotic distribution of the number of copies of a fixed graph $H$ in a random graph $G_n$ sampled from a sparse graphon model. Specifically, we provide a refined analysis that separates the contributions of edge randomness and vertex-label randomness, allowing us to identify distinct sparsity regimes in which each component dominates or both contribute jointly to the fluctuations. As a result, we establish asymptotic normality for the count of any fixed graph $H$ in $G_n$ across the entire range of sparsity (above the containment threshold for $H$ in $G_n$). These results provide a complete description of subgraph count fluctuations in sparse inhomogeneous networks, closing several gaps in the existing literature that were limited to specific motifs or suboptimal sparsity assumptions.

    https://arxiv.org/abs/2512.12937


    Understanding When Graph Convolutional Networks Help: A Diagnostic Study on Label Scarcity and Structural Properties

    oai:arXiv.org:2512.12947v1

    arXiv:2512.12947v1 Announce Type: cross Abstract: Graph Convolutional Networks (GCNs) have become a standard approach for semi-supervised node classification, yet practitioners lack clear guidance on when GCNs provide meaningful improvements over simpler baselines. We present a diagnostic study using the Amazon Computers co-purchase data to understand when and why GCNs help. Through systematic experiments with simulated label scarcity, feature ablation, and per-class analysis, we find that GCN performance depends critically on the interaction between graph homophily and feature quality. GCNs provide the largest gains under extreme label scarcity, where they leverage neighborhood structure to compensate for limited supervision. Surprisingly, GCNs can match their original performance even when node features are replaced with random noise, suggesting that structure alone carries sufficient signal on highly homophilous graphs. However, GCNs hurt performance when homophily is low and features are already strong, as noisy neighbors corrupt good predictions. Our quadrant analysis reveals that GCNs help in three of four conditions and only hurt when low homophily meets strong features. These findings offer practical guidance for practitioners deciding whether to adopt graph-based methods.

    https://arxiv.org/abs/2512.12947


    Cycles Communities from the Perspective of Dendrograms and Gradient Sampling

    oai:arXiv.org:2512.12974v1

    arXiv:2512.12974v1 Announce Type: cross Abstract: Identifying and comparing topological features, particularly cycles, across different topological objects remains a fundamental challenge in persistent homology and topological data analysis. This work introduces a novel framework for constructing cycle communities through two complementary approaches. First, a dendrogram-based methodology leverages merge-tree algorithms to construct hierarchical representations of homology classes from persistence intervals. The Wasserstein distance on merge trees is introduced as a metric for comparing dendrograms, establishing connections to hierarchical clustering frameworks. Through simulation studies, the discriminative power of dendrogram representations for identifying cycle communities is demonstrated. Second, an extension of Stratified Gradient Sampling simultaneously learns multiple filter functions that yield cycle barycenter functions capable of faithfully reconstructing distinct sets of cycles. The set of cycles each filter function can reconstruct constitutes cycle communities that are non-overlapping and partition the space of all cycles. Together, these approaches transform the problem of cycle matching into both a hierarchical clustering and topological optimization framework, providing principled methods to identify similar topological structures both within and across groups of topological objects.

    https://arxiv.org/abs/2512.12974


    CoDeQ: End-to-End Joint Model Compression with Dead-Zone Quantizer for High-Sparsity and Low-Precision Networks

    oai:arXiv.org:2512.12981v1

    arXiv:2512.12981v1 Announce Type: cross Abstract: While joint pruning--quantization is theoretically superior to sequential application, current joint methods rely on auxiliary procedures outside the training loop for finding compression parameters. This reliance adds engineering complexity and hyperparameter tuning, while also lacking a direct data-driven gradient signal, which might result in sub-optimal compression. In this paper, we introduce CoDeQ, a simple, fully differentiable method for joint pruning--quantization. Our approach builds on a key observation: the dead-zone of a scalar quantizer is equivalent to magnitude pruning, and can be used to induce sparsity directly within the quantization operator. Concretely, we parameterize the dead-zone width and learn it via backpropagation, alongside the quantization parameters. This design provides explicit control of sparsity, regularized by a single global hyperparameter, while decoupling sparsity selection from bit-width selection. The result is a method for Compression with Dead-zone Quantizer (CoDeQ) that supports both fixed-precision and mixed-precision quantization (controlled by an optional second hyperparameter). It simultaneously determines the sparsity pattern and quantization parameters in a single end-to-end optimization. Consequently, CoDeQ does not require any auxiliary procedures, making the method architecture-agnostic and straightforward to implement. On ImageNet with ResNet-18, CoDeQ reduces bit operations to ~5% while maintaining close to full precision accuracy in both fixed and mixed-precision regimes.

    https://arxiv.org/abs/2512.12981


    The EEPAS Model Revisited: Statistical Formalism and a High-Performance, Reproducible Open-Source Framework

    oai:arXiv.org:2512.13064v1

    arXiv:2512.13064v1 Announce Type: cross Abstract: While short-term models such as the Short-Term Earthquake Probability (STEP) and Epidemic-Type Aftershock Sequence (ETAS) are well established and supported by open-source software, medium- to long-term models, notably the Every Earthquake a Precursor According to Scale (EEPAS) and Proximity to Past Earthquakes (PPE), remain under-documented and largely inaccessible. Despite outperforming time-invariant models in regional studies, their mathematical foundations are often insufficiently formalized. This study addresses these gaps by formally deriving the EEPAS and PPE models within the framework of inhomogeneous Poisson point processes and clarifying the connection between empirical $\Psi$-scaling regressions and likelihood-based inference. We introduce a fully automated, open-source Python implementation of EEPAS that combines analytical modeling with Numba JIT acceleration, NumPy vectorization, and joblib parallelization, all configured via modular JSON files for usability and reproducibility. Integration with pyCSEP enables standardized evaluation and comparison. When applied to the Italy HORUS dataset, our system reproduces published results within one hour using identical initialization settings. It also provides a comprehensive pipeline from raw catalog to parameter estimation, achieving improved log-likelihoods and passing strict consistency tests without manual $\Psi$ identification. We position our framework as part of a growing open-source ecosystem for seismological research that spans the full workflow from data acquisition to forecast evaluation. Our framework fills a key gap in this ecosystem by providing robust tools for medium- to long-term statistical modeling of earthquake catalogs, which is an essential but underserved component in probabilistic seismic forecasting.

    https://arxiv.org/abs/2512.13064


    Group-averaged Markov chains II: tuning of group action in finite state space

    oai:arXiv.org:2512.13067v1

    arXiv:2512.13067v1 Announce Type: cross Abstract: We study group-averaged Markov chains obtained by augmenting a $\pi$-stationary transition kernel $P$ with a group action on the state space via orbit kernels. Given a group $\mathcal{G}$ with orbits $(\mathcal{O}_i)_{i=1}^k$, we analyse three canonical orbit kernels: namely the Gibbs $(G)$, Metropolis-Hastings $(M)$, and Barker $(B)$ kernels, as well as their multiplicative sandwiches $QPQ$ and the additive mixtures $\frac{1}{2}(P+Q)$ where $Q\in\{G,M,B\}$. We show that $M^t, B^t \to G$ blockwise as $t \to \infty$ under suitable conditions, that the projection chains induced by $(\mathcal{O}_i)_{i=1}^k$ coincide for $GPG$ and $P$, and that orbit averaging never deteriorates the absolute spectral gap or asymptotic variance when $P$ is reversible. We give a direct and simple proof of Pythagorean identity under the Kullback-Leibler (KL) divergence, showing that $GPG$ arises naturally as an information projection of $P$ onto the set of $G$-invariant transition matrices. For a given $P$, we characterise the optimal choice of $G$ with a fixed number of orbits that minimises the one-step KL divergence to stationarity. Analogously, for a given $G$, we characterise the optimal choice of $P$ and give sufficient conditions under which $GPG = \Pi$. We further show that alternating projections over multiple group actions converge at a rate governed by the singular values of an overlap matrix, and that in structured cases, this yields exact sampling where the number of group actions grows logarithmically with the size of the state space. Based on the theory, we propose two heuristics to tune $G$ in practice. We also illustrate the results on discrete uniform and multimodal examples, including the Curie-Weiss model where $GPG$ achieves polynomial (in inverse temperature and dimension) mixing while Glauber dynamics remains exponentially slow.

    https://arxiv.org/abs/2512.13067


    Multi-fidelity aerodynamic data fusion by autoencoder transfer learning

    oai:arXiv.org:2512.13069v1

    arXiv:2512.13069v1 Announce Type: cross Abstract: Accurate aerodynamic prediction often relies on high-fidelity simulations; however, their prohibitive computational costs severely limit their applicability in data-driven modeling. This limitation motivates the development of multi-fidelity strategies that leverage inexpensive low-fidelity information without compromising accuracy. Addressing this challenge, this work presents a multi-fidelity deep learning framework that combines autoencoder-based transfer learning with a newly developed Multi-Split Conformal Prediction (MSCP) strategy to achieve uncertainty-aware aerodynamic data fusion under extreme data scarcity. The methodology leverages abundant Low-Fidelity (LF) data to learn a compact latent physics representation, which acts as a frozen knowledge base for a decoder that is subsequently fine-tuned using scarce HF samples. Tested on surface-pressure distributions for NACA airfoils (2D) and a transonic wing (3D) databases, the model successfully corrects LF deviations and achieves high-accuracy pressure predictions using minimal HF training data. Furthermore, the MSCP framework produces robust, actionable uncertainty bands with pointwise coverage exceeding 95%. By combining extreme data efficiency with uncertainty quantification, this work offers a scalable and reliable solution for aerodynamic regression in data-scarce environments.

    https://arxiv.org/abs/2512.13069


    Stopping Rules for Stochastic Gradient Descent via Anytime-Valid Confidence Sequences

    oai:arXiv.org:2512.13123v1

    arXiv:2512.13123v1 Announce Type: cross Abstract: We study stopping rules for stochastic gradient descent (SGD) for convex optimization from the perspective of anytime-valid confidence sequences. Classical analyses of SGD provide convergence guarantees in expectation or at a fixed horizon, but offer no statistically valid way to assess, at an arbitrary time, how close the current iterate is to the optimum. We develop an anytime-valid, data-dependent upper confidence sequence for the weighted average suboptimality of projected SGD, constructed via nonnegative supermartingales and requiring no smoothness or strong convexity. This confidence sequence yields a simple stopping rule that is provably $\varepsilon$-optimal with probability at least $1-\alpha$ and is almost surely finite under standard stochastic approximation stepsizes. To the best of our knowledge, these are the first rigorous, time-uniform performance guarantees and finite-time $\varepsilon$-optimality certificates for projected SGD with general convex objectives, based solely on observable trajectory quantities.

    https://arxiv.org/abs/2512.13123


    Enhancing Node-Level Graph Domain Adaptation by Alleviating Local Dependency

    oai:arXiv.org:2512.13149v1

    arXiv:2512.13149v1 Announce Type: cross Abstract: Recent years have witnessed significant advancements in machine learning methods on graphs. However, transferring knowledge effectively from one graph to another remains a critical challenge. This highlights the need for algorithms capable of applying information extracted from a source graph to an unlabeled target graph, a task known as unsupervised graph domain adaptation (GDA). One key difficulty in unsupervised GDA is conditional shift, which hinders transferability. In this paper, we show that conditional shift can be observed only if there exists local dependencies among node features. To support this claim, we perform a rigorous analysis and also further provide generalization bounds of GDA when dependent node features are modeled using markov chains. Guided by the theoretical findings, we propose to improve GDA by decorrelating node features, which can be specifically implemented through decorrelated GCN layers and graph transformer layers. Our experimental results demonstrate the effectiveness of this approach, showing not only substantial performance enhancements over baseline GDA methods but also clear visualizations of small intra-class distances in the learned representations. Our code is available at https://github.com/TechnologyAiGroup/DFT

    https://arxiv.org/abs/2512.13149


    q-Analogue of Hamiltonian Monte Carlo method

    oai:arXiv.org:2512.13246v1

    arXiv:2512.13246v1 Announce Type: cross Abstract: Building upon Lagrangian mechanics on Wess's $q$-commutative spaces, we derive the $q$-deformed Hamiltonian dynamics as formulated by Lavagno et al. (2006). We then develop a computationally tractable scheme and propose a novel Hamiltonian Monte Carlo sampler ($q$-HMC). The proposed $q$-HMC method is shown to satisfy the detailed balance principle. Numerical experiments on distributions with explicit potential functions demonstrate its efficacy, particularly in exploring stiff energy landscapes. This method is also applied to draw samples from the Bayesian posterior distribution of inverse problems. The numerical test for the posterior distribution with stiff potential further shows the advantage of $q$-HMC. And it yields the identical computational implementation process to that of HMC when used to deal with functional reconstruction problems.

    https://arxiv.org/abs/2512.13246


    Temporal parallelisation of continuous-time maximum-a-posteriori trajectory estimation

    oai:arXiv.org:2512.13319v1

    arXiv:2512.13319v1 Announce Type: cross Abstract: This paper proposes a parallel-in-time method for computing continuous-time maximum-a-posteriori (MAP) trajectory estimates of the states of partially observed stochastic differential equations (SDEs), with the goal of improving computational speed on parallel architectures. The MAP estimation problem is reformulated as a continuous-time optimal control problem based on the Onsager-Machlup functional. This reformulation enables the use of a previously proposed parallel-in-time solution for optimal control problems, which we adapt to the current problem. The structure of the resulting optimal control problem admits a parallel solution based on parallel associative scan algorithms. In the linear Gaussian special case, it yields a parallel Kalman-Bucy filter and a parallel continuous-time Rauch-Tung-Striebel smoother. These linear computational methods are further extended to nonlinear continuous-time state-space models through Taylor expansions. We also present the corresponding parallel two-filter smoother. The graphics processing unit (GPU) experiments on linear and nonlinear models demonstrate that the proposed framework achieves a significant speedup in computations while maintaining the accuracy of sequential algorithms.

    https://arxiv.org/abs/2512.13319


    Policy-Aligned Estimation of Conditional Average Treatment Effects

    oai:arXiv.org:2512.13400v1

    arXiv:2512.13400v1 Announce Type: cross Abstract: Firms often develop targeting policies to personalize marketing actions and improve incremental profits. Effective targeting depends on accurately separating customers with positive versus negative treatment effects. We propose an approach to estimate the conditional average treatment effects (CATEs) of marketing actions that aligns their estimation with the firm's profit objective. The method recognizes that, for many customers, treatment effects are so extreme that additional accuracy is unlikely to change the recommended actions. However, accuracy matters near the decision boundary, as small errors can alter targeting decisions. By modifying the firm's objective function in the standard profit maximization problem, our method yields a near-optimal targeting policy while simultaneously estimating CATEs. This introduces a new perspective on CATE estimation, reframing it as a problem of profit optimization rather than prediction accuracy. We establish the theoretical properties of the proposed method and demonstrate its performance and trade-offs using synthetic data.

    https://arxiv.org/abs/2512.13400


    Multiclass Graph-Based Large Margin Classifiers: Unified Approach for Support Vectors and Neural Networks

    oai:arXiv.org:2512.13410v1

    arXiv:2512.13410v1 Announce Type: cross Abstract: While large margin classifiers are originally an outcome of an optimization framework, support vectors (SVs) can be obtained from geometric approaches. This article presents advances in the use of Gabriel graphs (GGs) in binary and multiclass classification problems. For Chipclass, a hyperparameter-less and optimization-less GG-based binary classifier, we discuss how activation functions and support edge (SE)-centered neurons affect the classification, proposing smoother functions and structural SV (SSV)-centered neurons to achieve margins with low probabilities and smoother classification contours. We extend the neural network architecture, which can be trained with backpropagation with a softmax function and a cross-entropy loss, or by solving a system of linear equations. A new subgraph-/distance-based membership function for graph regularization is also proposed, along with a new GG recomputation algorithm that is less computationally expensive than the standard approach. Experimental results with the Friedman test show that our method was better than previous GG-based classifiers and statistically equivalent to tree-based models.

    https://arxiv.org/abs/2512.13410


    Data Quality Issues in Flare Prediction using Machine Learning Models

    oai:arXiv.org:2512.13417v1

    arXiv:2512.13417v1 Announce Type: cross Abstract: Machine learning models for forecasting solar flares have been trained and tested using a variety of data sources, such as Space Weather Prediction Center (SWPC) operational and science-quality data. Typically, data from these sources is minimally processed before being used to train and validate a forecasting model. However, predictive performance can be impaired if defects in and inconsistencies between these data sources are ignored. For a number of commonly used data sources, together with softwares that query and then output processed data, we identify their respective defects and inconsistencies, quantify their extent, and show how they can affect the predictions produced by data-driven machine learning forecasting models. We also outline procedures for fixing these issues or at least mitigating their impacts. Finally, based on our thorough comparisons of the impacts of data sources on the trained forecasting model in terms of predictive skill scores, we offer recommendations for the use of different data products in operational forecasting.

    https://arxiv.org/abs/2512.13417


    From Zipf's Law to Neural Scaling through Heaps' Law and Hilberg's Hypothesis

    oai:arXiv.org:2512.13491v1

    arXiv:2512.13491v1 Announce Type: cross Abstract: We inspect the deductive connection between the neural scaling law and Zipf's law -- two statements discussed in machine learning and quantitative linguistics. The neural scaling law describes how the cross entropy rate of a foundation model -- such as a large language model -- changes with respect to the amount of training tokens, parameters, and compute. By contrast, Zipf's law posits that the distribution of tokens exhibits a power law tail. Whereas similar claims have been made in more specific settings, we show that the neural scaling law is a consequence of Zipf's law under certain broad assumptions that we reveal systematically. The derivation steps are as follows: We derive Heaps' law on the vocabulary growth from Zipf's law, Hilberg's hypothesis on the entropy scaling from Heaps' law, and the neural scaling from Hilberg's hypothesis. We illustrate these inference steps by a toy example of the Santa Fe process that satisfies all the four statistical laws.

    https://arxiv.org/abs/2512.13491


    Learning under Distributional Drift: Reproducibility as an Intrinsic Statistical Resource

    oai:arXiv.org:2512.13506v1

    arXiv:2512.13506v1 Announce Type: cross Abstract: Statistical learning under distributional drift remains insufficiently characterized: when each observation alters the data-generating law, classical generalization bounds can collapse. We introduce a new statistical primitive, the reproducibility budget $C_T$, which quantifies a system's finite capacity for statistical reproducibility - the extent to which its sampling process can remain governed by a consistent underlying distribution in the presence of both exogenous change and endogenous feedback. Formally, $C_T$ is defined as the cumulative Fisher-Rao path length of the coupled learner-environment evolution, measuring the total distributional motion accumulated during learning. From this construct we derive a drift-feedback generalization bound of order $O(T^{-1/2} + C_T/T)$, and we prove a matching minimax lower bound showing that this rate is minimax-optimal. Consequently, the results establish a reproducibility speed limit: no algorithm can achieve smaller worst-case generalization error than that imposed by the average Fisher-Rao drift rate $C_T/T$ of the data-generating process. The framework situates exogenous drift, adaptive data analysis, and performative prediction within a common geometric structure, with $C_T$ emerging as the intrinsic quantity measuring distributional motion across these settings.

    https://arxiv.org/abs/2512.13506


    A Class of Accelerated Fixed-Point-Based Methods with Delayed Inexact Oracles and Its Applications

    oai:arXiv.org:2512.13547v1

    arXiv:2512.13547v1 Announce Type: cross Abstract: In this paper, we develop a novel accelerated fixed-point-based framework using delayed inexact oracles to approximate a fixed point of a nonexpansive operator (or equivalently, a root of a co-coercive operator), a central problem in scientific computing. Our approach leverages both Nesterov's acceleration technique and the Krasnosel'skii-Mann (KM) iteration, while accounting for delayed inexact oracles, a key mechanism in asynchronous algorithms. We also introduce a unified approximate error condition for delayed inexact oracles, which can cover various practical scenarios. Under mild conditions and appropriate parameter updates, we establish both $\mathcal{O}(1/k^2)$ non-asymptotic and $o(1/k^2)$ asymptotic convergence rates in expectation for the squared norm of residual. Our rate significantly improves the $\mathcal{O}(1/k)$ rates in classical KM-type methods, including their asynchronous variants. We also establish $o(1/k^2)$ almost sure convergence rates and the almost sure convergence of iterates to a solution of the problem. Within our framework, we instantiate three settings for the underlying operator: (i) a deterministic universal delayed oracle; (ii) a stochastic delayed oracle; and (iii) a finite-sum structure with asynchronous updates. For each case, we instantiate our framework to obtain a concrete algorithmic variant for which our convergence results still apply, and whose iteration complexity depends linearly on the maximum delay. Finally, we verify our algorithms and theoretical results through two numerical examples on both matrix game and shallow neural network training problems.

    https://arxiv.org/abs/2512.13547


    Correcting exponentiality test for binned earthquake magnitudes

    oai:arXiv.org:2512.13599v1

    arXiv:2512.13599v1 Announce Type: cross Abstract: In theory, earthquake magnitudes follow an exponential distribution. In practice, however, earthquake catalogs report magnitudes with finite resolution, resulting in a discrete (geometric) distribution. To determine the lowest magnitude above which seismic events are completely recorded, the Lilliefors test is commonly applied. Because this test assumes continuous data, it is standard practice to add uniform noise to binned magnitudes prior to testing exponentiality. This work shows analytically that uniform dithering cannot recover the exponential distribution from its geometric form. It instead returns a piecewise-constant residual lifetime distribution, whose deviation from the exponential model becomes detectable as catalog size or bin width increases. Numerical experiments confirm that this deviation yields an overestimation of the magnitude of completeness in large catalogs. We therefore derive the exact noise distribution - a truncated exponential on the bin interval - that correctly restores the continuous exponential distribution over the whole magnitude range. Numerical tests show that this correction yields Lilliefors rejection rates consistent with the significance level for all bin widths and catalog sizes. The proposed solution eliminates a methodological bias in completeness estimation, which especially impacts high-resolution catalogs.

    https://arxiv.org/abs/2512.13599


    From Many Models, One: Macroeconomic Forecasting with Reservoir Ensembles

    oai:arXiv.org:2512.13642v1

    arXiv:2512.13642v1 Announce Type: cross Abstract: Model combination is a powerful approach to achieve superior performance with a set of models than by just selecting any single one. We study both theoretically and empirically the effectiveness of ensembles of Multi-Frequency Echo State Networks (MFESNs), which have been shown to achieve state-of-the-art macroeconomic time series forecasting results (Ballarin et al., 2024a). Hedge and Follow-the-Leader schemes are discussed, and their online learning guarantees are extended to the case of dependent data. In applications, our proposed Ensemble Echo State Networks show significantly improved predictive performance compared to individual MFESN models.

    https://arxiv.org/abs/2512.13642


    Advancing Machine Learning Optimization of Chiral Photonic Metasurface: Comparative Study of Neural Network and Genetic Algorithm Approaches

    oai:arXiv.org:2512.13656v1

    arXiv:2512.13656v1 Announce Type: cross Abstract: Chiral photonic metasurfaces provide unique capabilities for tailoring light-matter interactions, which are essential for next-generation photonic devices. Here, we report an advanced optimization framework that combines deep learning and evolutionary algorithms to significantly improve both the design and performance of chiral photonic nanostructures. Building on previous work utilizing a three-layer perceptron reinforced learning and stochastic evolutionary algorithm with decaying changes and mass extinction for chiral photonic optimization, our study introduces a refined pipeline featuring a two-output neural network architecture to reduce the trade-off between high chiral dichroism (CD) and reflectivity. Additionally, we use an improved fitness function, and efficient data augmentation techniques. A comparative analysis between a neural network (NN)-based approach and a genetic algorithm (GA) is presented for structures of different interface pattern depth, material combinations, and geometric complexity. We demonstrate a twice higher CD and the impact of both the corner number and the refractive index contrast at the example of a GaP/air and PMMA/air metasurface as a result of superior optimization performance. Additionally, a substantial increase in the number of structures explored within limited computational resources is highlighted, with tailored spectral reflectivity suggested by our electromagnetic simulations, paving the way for chiral mirrors applicable to polarization-selective light-matter interaction studies.

    https://arxiv.org/abs/2512.13656


    Quantum oracles give an advantage for identifying classical counterfactuals

    oai:arXiv.org:2512.13692v1

    arXiv:2512.13692v1 Announce Type: cross Abstract: We show that quantum oracles provide an advantage over classical oracles for answering classical counterfactual questions in causal models, or equivalently, for identifying unknown causal parameters such as distributions over functional dependences. In structural causal models with discrete classical variables, observational data and even ideal interventions generally fail to answer all counterfactual questions, since different causal parameters can reproduce the same observational and interventional data while disagreeing on counterfactuals. Using a simple binary example, we demonstrate that if the classical variables of interest are encoded in quantum systems and the causal dependence among them is encoded in a quantum oracle, coherently querying the oracle enables the identification of all causal parameters -- hence all classical counterfactuals. We generalize this to arbitrary finite cardinalities and prove that coherent probing 1) allows the identification of all two-way joint counterfactuals p(Y_x=y, Y_{x'}=y'), which is not possible with any number of queries to a classical oracle, and 2) provides tighter bounds on higher-order multi-way counterfactuals than with a classical oracle. This work can also be viewed as an extension to traditional quantum oracle problems such as Deutsch--Jozsa to identifying more causal parameters beyond just, e.g., whether a function is constant or balanced. Finally, we raise the question of whether this quantum advantage relies on uniquely non-classical features like contextuality. We provide some evidence against this by showing that in the binary case, oracles in some classically-explainable theories like Spekkens' toy theory also give rise to a counterfactual identifiability advantage over strictly classical oracles.

    https://arxiv.org/abs/2512.13692


    A Kernel-Based Approach for Modelling Gaussian Processes with Functional Information

    oai:arXiv.org:2201.11023v3

    arXiv:2201.11023v3 Announce Type: replace Abstract: Gaussian processes (GPs) are ubiquitous tools for modeling and predicting continuous processes in physical and engineering sciences. This is partly due to the fact that one may employ a Gaussian process as an interpolator while facilitating straightforward uncertainty quantification at other locations. In addition to training data, it is sometimes the case that available information is not in the form of a finite collection of points. For example, boundary value problems contain information on the boundary of a domain, or underlying physics lead to known behavior on an entire uncountable subset of the domain of interest. While an approximation to such known information may be obtained via pseudo-training points in the known subset, such a procedure is ad hoc with little guidance on the number of points to use, nor the behavior as the number of pseudo-observations grows large. We propose and construct Gaussian processes that unify, via reproducing kernel Hilbert space, the typical finite training data case with the case of having uncountable information by exploiting the equivalence of conditional expectation and orthogonal projections in Hilbert space. We show existence of the proposed process and establish that it is the limit of a conventional GP conditioned on an increasing number of training points. We illustrate the flexibility and advantages of our proposed approach via numerical experiments.

    https://arxiv.org/abs/2201.11023


    Switchback Experiments under Geometric Mixing

    oai:arXiv.org:2209.00197v4

    arXiv:2209.00197v4 Announce Type: replace Abstract: The switchback is an experimental design that measures treatment effects by repeatedly turning an intervention on and off for a whole system. Switchback experiments are a robust way to overcome cross-unit spillover effects; however, they are vulnerable to bias from temporal carryovers. In this paper, we consider properties of switchback experiments in Markovian systems that mix at a geometric rate. We find that, in this setting, standard switchback designs suffer considerably from carryover bias: Their estimation error decays as $T^{-1/3}$ in terms of the experiment horizon $T$, whereas in the absence of carryovers a faster rate of $T^{-1/2}$ would have been possible. We also show, however, that judicious use of burn-in periods can considerably improve the situation, and enables errors that decay as $\log(T)^{1/2}T^{-1/2}$. Our formal results are mirrored in an empirical evaluation.

    https://arxiv.org/abs/2209.00197


    Bayesian spatial+: A joint model perspective

    oai:arXiv.org:2309.05496v3

    arXiv:2309.05496v3 Announce Type: replace Abstract: Spatial confounding is a common issue in spatial regression models, occurring when spatially varying covariates correlate with the spatial effect included in the model. This dependence, particularly at high spatial frequencies, can introduce bias in regression coefficient estimates when combined with smoothing penalties. The spatial+ framework is a widely used two-stage frequentist approach that mitigates spatial confounding by explicitly modeling and removing the spatial structure in the confounding covariate, then using the corresponding residuals in the second-stage model for the response. However, it does not propagate first-stage uncertainty, does not discuss a general inferential framework, and, crucially, cannot guarantee that covariate residuals and spatial effects in the response model are free of shared high-frequency structure, so confounding may persist. We propose Bayesian spatial+, a joint modeling approach that simultaneously addresses these limitations. Our framework naturally propagates uncertainty and enables straightforward posterior inference, while ensuring separation of spatial frequencies through specialized joint priors on smoothness parameters. We further introduce a cut-feedback strategy that prevents feedback between model components from reintroducing confounding. Simulation studies and real-world applications show substantial gains in bias reduction and interval coverage relative to existing approaches. Notably, in our comparisons, Bayesian spatial+ is the only method for which credible interval coverage remains stable as the sample size increases.

    https://arxiv.org/abs/2309.05496


    Improved Generalization Bounds for Transductive Learning by Transductive Local Complexity and Its Applications

    oai:arXiv.org:2309.16858v4

    arXiv:2309.16858v4 Announce Type: replace Abstract: We introduce Transductive Local Complexity (TLC) as a new tool for analyzing the generalization performance of transductive learning methods. Our work extends the classical Local Rademacher Complexity (LRC) to the transductive setting, incorporating substantial and novel components. Although LRC has been used to obtain sharp generalization bounds and minimax rates for inductive tasks such as classification and nonparametric regression, it has remained an open problem whether a localized Rademacher complexity framework can be effectively adapted to transductive learning to achieve sharp or nearly sharp bounds consistent with inductive results. We provide an affirmative answer via TLC. TLC is constructed by first deriving a new concentration inequality in Theorem 4.1 for the supremum of empirical processes capturing the gap between test and training losses, termed the test-train process, under uniform sampling without replacement, which leverages a novel combinatorial property of the test-train process and a new proof strategy applying the exponential Efron-Stein inequality twice. A subsequent peeling strategy and a new surrogate variance operator then yield excess risk bounds in the transductive setting that are nearly consistent with classical LRC-based inductive bounds up to a logarithmic gap. We further advance transductive learning through two applications: (1) for realizable transductive learning over binary-valued classes with finite VC dimension of $\dVC$ and $u \ge m \ge \dVC$, where $u$ and $m$ are the number of test features and training features, our Theorem 6.1 gives a nearly optimal bound $\Theta(\dVC \log(me/\dVC)/m)$ matching the minimax rate $\Theta(\dVC/m)$ up to $\log m$, resolving a decade-old open question; and (2) Theorem 6.2 presents a sharper excess risk bound for transductive kernel learning compared to the current state-of-the-art.

    https://arxiv.org/abs/2309.16858


    An Anytime Algorithm for Good Arm Identification

    oai:arXiv.org:2310.10359v2

    arXiv:2310.10359v2 Announce Type: replace Abstract: In good arm identification (GAI), the goal is to identify one arm whose average performance exceeds a given threshold, referred to as a good arm, if it exists. Few works have studied GAI in the fixed-budget setting when the sampling budget is fixed beforehand, or in the anytime setting, when a recommendation can be asked at any time. We propose APGAI, an anytime and parameter-free sampling rule for GAI in stochastic bandits. APGAI can be straightforwardly used in fixed-confidence and fixed-budget settings. First, we derive upper bounds on its probability of error at any time. They show that adaptive strategies can be more efficient in detecting the absence of good arms than uniform sampling in several diverse instances. Second, when APGAI is combined with a stopping rule, we prove upper bounds on the expected sampling complexity, holding at any confidence level. Finally, we show the good empirical performance of APGAI on synthetic and real-world data. Our work offers an extensive overview of the GAI problem in all settings.

    https://arxiv.org/abs/2310.10359


    False Discovery Rate and Localizing Power

    oai:arXiv.org:2401.03554v3

    arXiv:2401.03554v3 Announce Type: replace Abstract: False discovery rate (FDR) is commonly used for correction for multiple testing in neuroimaging studies. However, when using two-tailed tests, making directional inferences about the results can lead to a vastly inflated error rate, even approaching 100% in some cases. This happens because FDR controls the error rate only globally, over all tests, not within subsets, such as among those in only one or another direction. Here we consider and evaluate different strategies for FDR control in such cases, using both synthetic and real imaging data. Approaches that separate the tests by direction of the hypothesis test, or by the direction of the resulting test statistic, more properly control the directional error rate and preserve FDR benefits, albeit with a doubled risk of errors under complete absence of signal. Strategies that combine tests in both directions, or that use simple two-tailed p-values, can lead to invalid directional conclusions, even if these tests remain globally valid. A solution to this problem is through the use of selective inference, whereby positive and negative tails are treated as sets (families), which are screened locally, then subjected to FDR at a modified level that controls average FDR over those that survive the initial screening. Moreover, the BKY procedure can be used in place of the well-known Benjamini-Hochberg, yielding additional power. These methods are easy to implement. Finally, to enable valid thresholding for directional inference, we suggest that imaging software should allow the user to set asymmetrical thresholds for the two sides of the statistical map. While FDR continues to be a valid, powerful procedure for multiple testing correction, care is needed when making directional inferences for two-tailed tests, or more broadly, when making any localized inference.

    https://arxiv.org/abs/2401.03554


    Statistical inference for pairwise comparison models

    oai:arXiv.org:2401.08463v3

    arXiv:2401.08463v3 Announce Type: replace Abstract: Pairwise comparison models have been widely used for utility evaluation and rank aggregation across various fields. The increasing scale of modern problems underscores the need to understand statistical inference in these models when the number of subjects diverges, a topic that is currently underexplored in the literature. To address this gap, this paper establishes a near-optimal asymptotic normality result for the maximum likelihood estimator in a broad class of pairwise comparison models. The key idea lies in identifying the Fisher information matrix as a weighted graph Laplacian, which can be studied via a meticulous spectral analysis. Our findings provide theoretical foundations for performing statistical inference in a wide range of pairwise comparison models beyond the Bradley--Terry model. Simulations utilizing synthetic data are conducted to validate the asymptotic normality result, followed by a hypothesis test using a tennis competition dataset.

    https://arxiv.org/abs/2401.08463


    Efficient Nonparametric Inference for Mediation Analysis with Nonignorable Missing Confounders

    oai:arXiv.org:2402.05384v2

    arXiv:2402.05384v2 Announce Type: replace Abstract: Mediation analysis is widely used for exploring treatment mechanisms; however, it faces challenges when nonignorable missing confounders are present. Efficient inference of mediation effects and the efficiency loss due to nonignorable missingness have been rarely studied in the literature because of the difficulties arising from the ill-posed inverse problem. In this paper, we propose a general shadow variable framework for identifying mediation effects, allowing shadow variables to be selected from either observed covariates or externally collected auxiliary data. We then propose a Sieve-based Iterative Outward (SIO) approach for estimation. We establish large-sample theory, particularly asymptotic normality, for the proposed estimator despite the ill-posedness of the problem. We show that our estimator is locally efficient and attains the semiparametric efficiency bound under certain conditions. Building on the efficient influence function, we explicitly quantify the efficiency loss attributable to missingness and propose a debiased machine learning approach for estimation and inference. We examine the finite-sample performance of the proposed approach using extensive simulation studies and showcase its practical applicability through an empirical analysis of CFPS data.

    https://arxiv.org/abs/2402.05384


    Admissible online closed testing must employ e-values

    oai:arXiv.org:2407.15733v4

    arXiv:2407.15733v4 Announce Type: replace Abstract: In contemporary research, data scientists often test an infinite sequence of hypotheses $H_1,H_2,\ldots$ one by one, and are required to make real-time decisions without knowing the future hypotheses or data. In this paper, we consider such an online multiple testing problem with the goal of providing simultaneous lower bounds for the number of true discoveries in data-adaptively chosen rejection sets. Employing the recent online closure principle, we show that for this task it is necessary to use an anytime-valid test for each intersection hypothesis. This connects two distinct branches of the literature: online testing of multiple hypotheses (where the hypotheses appear online), and sequential anytime-valid testing of a single hypothesis (where the data for a fixed hypothesis appears online). Motivated by this result, we construct a new online closed testing procedure and a corresponding short-cut with a true discovery guarantee based on multiplying sequential e-values. This general but simple procedure gives uniform improvements over the state-of-the-art methods but also allows to construct entirely new and powerful procedures.

    https://arxiv.org/abs/2407.15733


    On Nonparanormal Likelihoods

    oai:arXiv.org:2408.17346v2

    arXiv:2408.17346v2 Announce Type: replace Abstract: Nonparanormal models describe the joint distribution of multivariate responses via latent Gaussian, and thus parametric, copulae while allowing flexible nonparametric marginals. Some aspects of such distributions, for example conditional independence, are formulated parametrically. Other features, such as marginal distributions, can be formulated non- or semiparametrically. Such models are attractive when multivariate normality is questionable. Most estimation procedures perform two steps, first estimating the nonparametric part. The copula parameters come second, treating the marginal estimates as known. This is sufficient for some applications. For other applications, e.g. when a semiparametric margin features parameters of interest or when standard errors are important, a simultaneous estimation of all parameters might be more advantageous. We present suitable parameterisations of nonparanormal models, possibly including semiparametric effects, and define four novel nonparanormal log-likelihood functions. In general, the corresponding one-step optimization problems are shown to be non-convex. In some cases, however, biconvex problems emerge. Several convex approximations are discussed. From a low-level computational point of view, the core contribution is the score function for multivariate normal log-probabilities computed via Genz' procedure. We present transformation discriminant analysis when some biomarkers are subject to limit-of-detection problems as an application and illustrate possible empirical gains in semiparametric efficient polychoric correlation analysis.

    https://arxiv.org/abs/2408.17346


    Fused $L_{1/2}$ prior for large scale linear inverse problem with Gibbs bouncy particle sampler

    oai:arXiv.org:2409.07874v2

    arXiv:2409.07874v2 Announce Type: replace Abstract: In this paper, we study Bayesian approach for solving large scale linear inverse problems arising in various scientific and engineering fields. We propose a fused $L_{1/2}$ prior with edge-preserving and sparsity-promoting properties and show that it can be formulated as a Gaussian mixture Markov random field. Since the density function of this family of prior is neither log-concave nor Lipschitz, gradient-based Markov chain Monte Carlo methods can not be applied to sample the posterior. Thus, we present a Gibbs sampler in which all the conditional posteriors involved have closed form expressions. The Gibbs sampler works well for small size problems but it is computationally intractable for large scale problems due to the need for sample high dimensional Gaussian distribution. To reduce the computation burden, we construct a Gibbs bouncy particle sampler (Gibbs-BPS) based on a piecewise deterministic Markov process. This new sampler combines elements of Gibbs sampler with bouncy particle sampler and its computation complexity is an order of magnitude smaller. We show that the new sampler converges to the target distribution. With computed tomography examples, we demonstrate that the proposed method shows competitive performance with existing popular Bayesian methods and is highly efficient in large scale problems.

    https://arxiv.org/abs/2409.07874


    Markov Chain Gradient Descent in Hilbert Spaces

    oai:arXiv.org:2410.08361v4

    arXiv:2410.08361v4 Announce Type: replace Abstract: In this paper, we study a Markov chain-based stochastic gradient algorithm in general Hilbert spaces, aiming at approximating the optimal solution of a quadratic loss function. We establish probabilistic upper bounds on its convergence. We further extend these results to an online regularized learning algorithm in reproducing kernel Hilbert spaces, where the samples are drawn along a Markov chain trajectory.

    https://arxiv.org/abs/2410.08361


    Self-test loss functions for learning weak-form operators and gradient flows

    oai:arXiv.org:2412.03506v3

    arXiv:2412.03506v3 Announce Type: replace Abstract: The construction of loss functions presents a major challenge in data-driven modeling involving weak-form operators in PDEs and gradient flows, particularly due to the need to select test functions appropriately. We address this challenge by introducing self-test loss functions, which employ test functions that depend on the unknown parameters, specifically for cases where the operator depends linearly on the unknowns. The proposed self-test loss function conserves energy for gradient flows and coincides with the expected log-likelihood ratio for stochastic differential equations. Importantly, it is quadratic, facilitating theoretical analysis of identifiability and well-posedness of the inverse problem, while also leading to efficient parametric or nonparametric regression algorithms. It is computationally simple, requiring only low-order derivatives or even being entirely derivative-free, and numerical experiments demonstrate its robustness against noisy and discrete data.

    https://arxiv.org/abs/2412.03506


    Fast Estimation of the Composite Link Model for Multidimensional Grouped Counts

    oai:arXiv.org:2412.04956v2

    arXiv:2412.04956v2 Announce Type: replace Abstract: This paper presents a significant advancement in the estimation of the Composite Link Model within a penalized likelihood framework, specifically designed to address indirect observations of grouped count data. While the model is effective in these contexts, its application becomes computationally challenging in large, high-dimensional settings. To overcome this, we propose a reformulated iterative estimation procedure that leverages Generalized Linear Array Models, enabling the disaggregation and smooth estimation of latent distributions in multidimensional data. Through simulation studies and applications to high-dimensional mortality datasets, we demonstrate the model's capability to capture fine-grained patterns while comparing its computational performance to the conventional algorithm. The proposed methodology offers notable improvements in computational speed, storage efficiency, and practical applicability, making it suitable for a wide range of fields in which high-dimensional data are provided in grouped formats.

    https://arxiv.org/abs/2412.04956


    The Group R2D2 Shrinkage Prior for Sparse Linear Models with Grouped Covariates

    oai:arXiv.org:2412.15293v3

    arXiv:2412.15293v3 Announce Type: replace Abstract: Shrinkage priors are a popular Bayesian paradigm to handle sparsity in high-dimensional regression. Still limited, however, is a flexible class of shrinkage priors to handle grouped sparsity, where covariates exhibit some natural grouping structure. This paper proposes a novel extension of the $R^2$-induced Dirichlet Decomposition (R2D2) prior to accommodate grouped variable selection in linear regression models. The proposed method, called the Group R2D2 prior, employs a Dirichlet prior distribution on the coefficient of determination for each group, allowing for a flexible and adaptive shrinkage that operates at both group and individual variable levels. This approach improves the original R2D2 prior to handle grouped predictors, providing a balance between within-group dependence and group-level sparsity. We present several theoretical properties of this proposed prior distribution while also developing a Markov Chain Monte Carlo algorithm. Through simulation studies and real-data analysis, we demonstrate that our method outperforms traditional shrinkage priors in terms of both estimation accuracy, inference and prediction.

    https://arxiv.org/abs/2412.15293


    An adaptive approximate Bayesian computation MCMC with Global-Local proposals

    oai:arXiv.org:2412.15644v2

    arXiv:2412.15644v2 Announce Type: replace Abstract: In this paper, we address the challenge of Markov Chain Monte Carlo (MCMC) algorithms within the approximate Bayesian Computation (ABC) framework, which often get trapped in local optima due to their inherent local exploration mechanism. We propose a novel Global-Local ABC-MCMC algorithm that combines the ``exploration" capabilities of global proposals with the ``exploitation" finesse of local proposals. We integrate iterated importance resampling into the likelihood-free framework to establish an effective global proposal distribution. For high-dimensional parameter spaces, we optimize the efficiency of the local sampler by leveraging Langevin dynamics and common random numbers. Furthermore, we introduce two adaptive schemes to enhance the algorithmic performance. The first scheme divides the update target of the importance proposal into a sequence of intermediate target distributions that progressively approximate the ABC posterior, thereby gradually updating the importance proposal distribution during the iterations. The second adaptive scheme automatically selects the optimal mixture of global and local moves through sequential optimization, based on a relative version of the expected squared jumping distance (ESJD). We theoretically and numerically demonstrate that our method is able to improve sampling efficiency and achieve more reliable convergence for complex posteriors. We develop a software package that is available at https://github.com/caofff/GL-ABC-MCMC.

    https://arxiv.org/abs/2412.15644


    Active multiple testing with proxy p-values and e-values

    oai:arXiv.org:2502.05715v2

    arXiv:2502.05715v2 Announce Type: replace Abstract: Researchers often lack the resources to test every hypothesis of interest directly or compute test statistics comprehensively, but often possess auxiliary data from which we can compute an estimate of the experimental outcome. We introduce a novel approach for selecting which hypotheses to query a statistic (e.g., run an experiment, perform expensive computation, etc.) in a hypothesis testing setup by leveraging estimates to compute proxy statistics. Our framework allows a scientist to propose a proxy statistic and then query the true statistic with some probability based on the value of the proxy. We make no assumptions about how the proxy is derived, and it can be arbitrarily dependent on the true statistic. If the true statistic is not queried, the proxy is used in its place. We characterize "active" methods that produce valid p-values and e-values in this setting and utilize this framework in the multiple testing setting to create procedures with false discovery rate (FDR) control. Through simulations and real data analysis of causal effects in scCRISPR screen experiments, we empirically demonstrate that our proxy framework has both high power and low resource usage when our proxies are accurate estimates of the respective true statistics.

    https://arxiv.org/abs/2502.05715


    Multi-View Oriented GPLVM: Expressiveness and Efficiency

    oai:arXiv.org:2502.08253v2

    arXiv:2502.08253v2 Announce Type: replace Abstract: The multi-view Gaussian process latent variable model (MV-GPLVM) aims to learn a unified representation from multi-view data but is hindered by challenges such as limited kernel expressiveness and low computational efficiency. To overcome these issues, we first introduce a new duality between the spectral density and the kernel function. By modeling the spectral density with a bivariate Gaussian mixture, we then derive a generic and expressive kernel termed Next-Gen Spectral Mixture (NG-SM) for MV-GPLVMs. To address the inherent computational inefficiency of the NG-SM kernel, we design a new form of random Fourier feature approximation. Combined with a tailored reparameterization trick, this approximation enables scalable variational inference for both the model and the unified latent representations. Numerical evaluations across a diverse range of multi-view datasets demonstrate that our proposed method consistently outperforms state-of-the-art models in learning meaningful latent representations.

    https://arxiv.org/abs/2502.08253


    Coherent Disaggregation and Uncertainty Quantification for Spatially Misaligned Data

    oai:arXiv.org:2502.10584v4

    arXiv:2502.10584v4 Announce Type: replace Abstract: Spatial misalignment arises when datasets are aggregated or collected at different spatial scales, leading to information loss. We develop a Bayesian disaggregation framework that links misaligned data to a continuous-domain model through an iteratively linearised integration scheme implemented with the Integrated Nested Laplace Approximation (INLA). The framework accommodates different ways of handling observations depending on the application, resulting in four variants: (i) \textit{Raster at Full Resolution}, (ii) \textit{Raster Aggregation}, (iii) \textit{Polygon Aggregation} (PolyAgg), and (iv) \textit{Point Values} (PointVal). The first three represent increasing levels of spatial averaging, while the last two address situations with incomplete covariate information. For PolyAgg and PointVal, we reconstruct the covariate field using three strategies -- \textit{Value Plugin}, \textit{Joint Uncertainty}, and \textit{Uncertainty Plugin} -- with the latter two propagating uncertainty. We illustrate the framework with an example motivated by landslide modelling, focusing on methodology rather than interpreting landslide processes. Simulations show that uncertainty-propagating approaches outperform \textit{Value Plugin} method and remain robust under model misspecification. Point-pattern observations and full-resolution covariates are therefore preferable, and when covariate fields are incomplete, uncertainty-aware methods are most reliable. The framework is well suited to landslide susceptibility modelling and other spatial mapping tasks, and integrates seamlessly with INLA-based tools.

    https://arxiv.org/abs/2502.10584


    Learning and Computation of $\Phi$-Equilibria at the Frontier of Tractability

    oai:arXiv.org:2502.18582v3

    arXiv:2502.18582v3 Announce Type: replace Abstract: $\Phi$-equilibria -- and the associated notion of $\Phi$-regret -- are a powerful and flexible framework at the heart of online learning and game theory, whereby enriching the set of deviations $\Phi$ begets stronger notions of rationality. Recently, Daskalakis, Farina, Fishelson, Pipis, and Schneider (STOC '24) -- abbreviated as DFFPS -- settled the existence of efficient algorithms when $\Phi$ contains only linear maps under a general, $d$-dimensional convex constraint set $\mathcal{X}$. In this paper, we significantly extend their work by resolving the case where $\Phi$ is $k$-dimensional; degree-$\ell$ polynomials constitute a canonical such example with $k = d^{O(\ell)}$. In particular, positing only oracle access to $\mathcal{X}$, we obtain two main positive results: i) a $\text{poly}(n, d, k, \text{log}(1/\epsilon))$-time algorithm for computing $\epsilon$-approximate $\Phi$-equilibria in $n$-player multilinear games, and ii) an efficient online algorithm that incurs average $\Phi$-regret at most $\epsilon$ using $\text{poly}(d, k)/\epsilon^2$ rounds. We also show nearly matching lower bounds in the online learning setting, thereby obtaining for the first time a family of deviations that captures the learnability of $\Phi$-regret. From a technical standpoint, we extend the framework of DFFPS from linear maps to the more challenging case of maps with polynomial dimension. At the heart of our approach is a polynomial-time algorithm for computing an expected fixed point of any $\phi : \mathcal{X} \to \mathcal{X}$ based on the ellipsoid against hope (EAH) algorithm of Papadimitriou and Roughgarden (JACM '08). In particular, our algorithm for computing $\Phi$-equilibria is based on executing EAH in a nested fashion -- each step of EAH itself being implemented by invoking a separate call to EAH.

    https://arxiv.org/abs/2502.18582


    CRPS-Based Targeted Sequential Design with Application in Chemical Space

    oai:arXiv.org:2503.11250v2

    arXiv:2503.11250v2 Announce Type: replace Abstract: Sequential design of real and computer experiments via Gaussian Process (GP) models has proven useful for parsimonious, goal-oriented data acquisition purposes. In this work, we focus on acquisition strategies for a GP model that needs to be accurate within a predefined range of the response of interest. Such an approach is useful in various fields including synthetic chemistry, where finding molecules with particular properties is essential for developing useful materials and effective medications. GP modeling and sequential design of experiments have been successfully applied to a plethora of domains, including molecule research. Our main contribution here is to use the threshold-weighted Continuous Ranked Probability Score (CRPS) as a basic building block for acquisition functions employed within sequential design. We study pointwise and integral criteria relying on two different weighting measures and benchmark them against competitors, demonstrating improved performance with respect to considered goals. The resulting acquisition strategies are applicable to a wide range of fields and pave the way to further developing sequential design relying on scoring rules.

    https://arxiv.org/abs/2503.11250


    Predicting data value before collection: A coefficient for prioritizing sources under random distribution shift

    oai:arXiv.org:2504.06570v4

    arXiv:2504.06570v4 Announce Type: replace Abstract: Researchers often face choices between multiple data sources that differ in quality, cost, and representativeness. Which sources will most improve predictive performance? We study this data prioritization problem under a random distribution shift model, where candidate sources arise from random perturbations to a target population. We propose the Data Usefulness Coefficient (DUC), which predicts the reduction in prediction error from adding a dataset to training, using only covariate summary statistics and no outcome data. We prove that under random shifts, covariate differences between sources are informative about outcome prediction quality. Through theory and experiments on synthetic and real data, we demonstrate that DUC-based selection outperforms alternative strategies, allowing more efficient resource allocation across heterogeneous data sources. The method provides interpretable rankings between candidate datasets and works for any data modality, including ordinal, categorical, and continuous data.

    https://arxiv.org/abs/2504.06570


    Multilevel Sampling in Algebraic Statistics

    oai:arXiv.org:2505.04062v2

    arXiv:2505.04062v2 Announce Type: replace Abstract: This paper proposes a multilevel sampling algorithm for fiber sampling problems in algebraic statistics, inspired by Henry Wynn's suggestion to adapt multilevel Monte Carlo (MLMC) ideas to discrete models. Focusing on log-linear models, we sample from high-dimensional lattice fibers defined by algebraic constraints. Building on Markov basis methods and results from Diaconis and Sturmfels, our algorithm uses variable step sizes to accelerate exploration and reduce the need for long burn-in. We introduce a novel Fiber Coverage Score (FCS) based on Voronoi partitioning to assess sample quality, and highlight the utility of the Maximum Mean Discrepancy (MMD) quality metric. Simulations on benchmark fibers show that multilevel sampling outperforms naive MCMC approaches. Our results demonstrate that multilevel methods, when properly applied, provide practical benefits for discrete sampling in algebraic statistics.

    https://arxiv.org/abs/2505.04062


    Credal Prediction based on Relative Likelihood

    oai:arXiv.org:2505.22332v2

    arXiv:2505.22332v2 Announce Type: replace Abstract: Predictions in the form of sets of probability distributions, so-called credal sets, provide a suitable means to represent a learner's epistemic uncertainty. In this paper, we propose a theoretically grounded approach to credal prediction based on the statistical notion of relative likelihood: The target of prediction is the set of all (conditional) probability distributions produced by the collection of plausible models, namely those models whose relative likelihood exceeds a specified threshold. This threshold has an intuitive interpretation and allows for controlling the trade-off between correctness and precision of credal predictions. We tackle the problem of approximating credal sets defined in this way by means of suitably modified ensemble learning techniques. To validate our approach, we illustrate its effectiveness by experiments on benchmark datasets demonstrating superior uncertainty representation without compromising predictive performance. We also compare our method against several state-of-the-art baselines in credal prediction.

    https://arxiv.org/abs/2505.22332


    Bayesian Inference for Non-Gaussian Simultaneous Autoregressive Models with Missing Data

    oai:arXiv.org:2505.23070v2

    arXiv:2505.23070v2 Announce Type: replace Abstract: Standard simultaneous autoregressive (SAR) models typically assume normally distributed errors, an assumption often violated in real-world datasets that frequently exhibit non-normal, skewed, or heavy-tailed characteristics. New SAR models are proposed to capture these non-Gaussian features. The spatial error model (SEM), a widely used SAR-type model, is considered. Three novel SEMs are introduced, extending the standard Gaussian SEM. These extensions incorporate Student's $t$-distributed errors to accommodate heavy-tailed behaviour, one-to-one transformations of the response variable to address skewness, or a combination of both. Variational Bayes (VB) estimation methods are developed for these models, and the framework is further extended to handle missing response data under the missing not at random (MNAR) mechanism. Standard VB methods perform well with complete datasets; however, handling missing data requires a hybrid VB (HVB) approach, which integrates a Markov chain Monte Carlo (MCMC) sampler to generate missing values. The proposed VB methods are evaluated using both simulated and real-world datasets, demonstrating their robustness and effectiveness in dealing with non-Gaussian data and missing data in spatial models. Although the method is demonstrated using SAR models, the proposed model specifications and estimation approaches are widely applicable to various types of models for handling non-Gaussian data with missing values.

    https://arxiv.org/abs/2505.23070


    Projected Bayesian Spatial Factor Models

    oai:arXiv.org:2506.01098v2

    arXiv:2506.01098v2 Announce Type: replace Abstract: Factor models balance flexibility, identifiability, and computational efficiency, with Bayesian spatial factor models particularly prone to identifiability challenges and scaling limitations. This work introduces Projected Bayesian Spatial Factor (PBSF) models, a new class of models designed to achieve scalability and robust identifiability for spatial factor analysis. PBSF models are defined through a novel Markov chain Monte Carlo construction, Projected MCMC (ProjMC$^2$), which leverages conditional conjugacy and projection to improve posterior stability and mixing by constraining factor sampling to a scaled Stiefel manifold. Theoretical results establish convergence of ProjMC$^2$ irrespective of initialisation. By integrating scalable univariate spatial modelling, PBSF provides a flexible and interpretable framework for low-dimensional spatial representation learning of massive spatial data. Simulation studies demonstrate substantial efficiency and robustness gains, and an application to human kidney spatial transcriptomics data highlights the practical utility of the proposed methodology for improving interpretability in spatial omics.

    https://arxiv.org/abs/2506.01098


    Goodness-of-fit testing for the stationary density of a size-structured PDE

    oai:arXiv.org:2506.05103v2

    arXiv:2506.05103v2 Announce Type: replace Abstract: We consider two division models for structured cell populations, where cells can grow, age and divide. These models have been introduced in the literature under the denomination of `mitosis' and `adder' models. In the recent years, there has been an increasing interest in biology to understand whether the cells divide equally or not, as this can be related to important mechanisms in cellular aging or recovery. We are therefore interested in testing the null hypothesis $H_0$ where the division of a mother cell results into two daughters of equal size, against the alternative hypothesis $H_1$ where the division is asymmetric and ruled by a kernel that is absolutely continuous with respect to the Lebesgue measure. The sample consists of i.i.d. observations of cell sizes and ages drawn from the population, and the division is not directly observed. The hypotheses of the test are reformulated as hypotheses on the stationary size and age distributions of the models, which we assume are also the distributions of the observations. We propose a goodness-of-fit test that we study numerically on simulated data before applying it on real data.

    https://arxiv.org/abs/2506.05103


    Extreme mass distributions for quasi-copulas

    oai:arXiv.org:2506.20672v2

    arXiv:2506.20672v2 Announce Type: replace Abstract: A recent survey, nicknamed "Hitchhiker's Guide", J.J. Arias-Garc{\i}a, R. Mesiar, and B. De Baets, A hitchhiker's guide to quasi-copulas, Fuzzy Sets and Systems 393 (2020) 1-28, has raised the rating of quasi-copula problems in the dependence modeling community in spite of the lack of statistical interpretation of quasi-copulas. In our previous work (Fuzzy Sets and Systems 517 (2025) 109457), we addressed the question of extreme values of the mass distribution associated with multivariate quasi-copulas. Using a linear programming approach, we were able to solve Open Problem 5 of the "Guide" up to dimension d = 17 and disprove a recent conjecture on the solution to that problem. In this paper, we use an analytical approach to provide a complete answer to the original question.

    https://arxiv.org/abs/2506.20672


    Differential Distance Correlation and Its Applications

    oai:arXiv.org:2507.00524v2

    arXiv:2507.00524v2 Announce Type: replace Abstract: In this paper, we propose a novel Euclidean-distance-based coefficient, named differential distance correlation, to measure the strength of dependence between a random variable $ Y \in \mathbb{R} $ and a random vector $ \boldsymbol{X} \in \mathbb{R}^p $. The coefficient has a concise expression and is invariant to arbitrary orthogonal transformations of the random vector. Moreover, the coefficient is a strongly consistent estimator of a simple and interpretable dependent measure, which is 0 if and only if $ \boldsymbol{X} $ and $ Y $ are independent and equal to 1 if and only if $ Y $ determines $ \boldsymbol{X} $ almost surely. An alternative approach is also proposed to address the limitation that the coefficient is non-robust to outliers. Furthermore, the coefficient exhibits asymptotic normality with a simple variance under the independent hypothesis, facilitating fast and accurate estimation of $ p $-value for testing independence. Three simulation experiments show that the proposed coefficient is more computationally efficient for independence testing and more effective in detecting oscillatory relationships than several competing methods. We also apply our method to analyze a real data example.

    https://arxiv.org/abs/2507.00524


    Minority representation and fairness in network ranking: An application to school contact diary data

    oai:arXiv.org:2507.01136v2

    arXiv:2507.01136v2 Announce Type: replace Abstract: Considerations of bias, fairness and representation are a prerequisite of responsible modern statistics. In statistical network analysis, observed networks are often incomplete or systematically biased, which can lead to systematic underrepresentation of protected groups, and affect any downstream ranking or decision based on the observed network. In this paper, we study a high school contact network constructed from self-reported contact diaries and introduce a formal measure of minority representation, defined as the proportion of minority nodes among the top-ranked individuals. We model systematic bias through group-dependent missing edge mechanisms and develop statistical methods to estimate and test for such bias. When bias is detected, we propose a re-ranking procedure based on an asymptotic approximation that improves group representation. Applying the framework to the high school contact network reveals systematic underreporting of cross-group contacts consistent with recall bias. These findings highlight the importance of modeling and correcting systematic bias in social networks with heterogeneous groups.

    https://arxiv.org/abs/2507.01136


    A New Integrative Learning Framework for Integrating Multiple Secondary Outcomes into Primary Outcome Analysis: A Case Study on Liver Health

    oai:arXiv.org:2507.18865v2

    arXiv:2507.18865v2 Announce Type: replace Abstract: In the era of big data, secondary outcomes have become increasingly important alongside primary outcomes. These secondary outcomes, which can be derived from traditional endpoints in clinical trials, compound measures, or risk prediction scores, hold the potential to enhance the analysis of primary outcomes. Our method is motivated by the challenge of utilizing multiple secondary outcomes, such as blood biochemistry markers and urine assays, to improve the analysis of the primary outcome related to liver health. Current integration methods often fall short, as they impose strong model assumptions or require prior knowledge to construct over-identified working functions. This paper addresses these statistical challenges and potentially opens a new avenue in data integration by introducing a novel integrative learning framework that is applicable in a general setting. The proposed framework allows for the robust, data-driven integration of information from multiple secondary outcomes, promotes the development of efficient learning algorithms, and ensures optimal use of available data. Extensive simulation studies demonstrate that the proposed method significantly reduces variance in primary outcome analysis, outperforming existing integration approaches. Additionally, applying this method to UK Biobank (UKB) reveals that cigarette smoking is associated with increased fatty liver measures, with these effects being particularly pronounced in the older adult cohort.

    https://arxiv.org/abs/2507.18865


    Debiasing Machine Learning Predictions for Causal Inference Without Additional Ground Truth Data: "One Map, Many Trials" in Satellite-Driven Poverty Analysis

    oai:arXiv.org:2508.01341v4

    arXiv:2508.01341v4 Announce Type: replace Abstract: Machine learning models trained on Earth observation data, such as satellite imagery, have demonstrated significant promise in predicting household-level wealth indices, enabling the creation of high-resolution wealth maps that can be leveraged across multiple causal trials while addressing chronic data scarcity in global development research. However, because standard training objectives prioritize overall predictive accuracy, these predictions often suffer from shrinkage toward the mean, leading to attenuated estimates of causal treatment effects and limiting their utility in policy evaluations. Existing debiasing methods, such as Prediction-Powered Inference (PPI), can handle this attenuation bias but require additional fresh ground-truth data at the downstream stage of causal inference, which restricts their applicability in data-scarce environments. We introduce and evaluate two post-hoc correction methods -- Linear Calibration Correction (LCC) and a Tweedie's correction approach -- that substantially reduce shrinkage-induced prediction bias without relying on newly collected labeled data. LCC applies a simple linear transformation estimated on a held-out calibration split; Tweedie's method locally de-shrink predictions using density score estimates and a noise scale learned upstream. We provide practical diagnostics for when a correction is warranted and discuss practical limitations. Across analytical results, simulations, and experiments with Demographic and Health Surveys (DHS) data, both approaches reduce attenuation; Tweedie's correction yields nearly unbiased treatment-effect estimates, enabling a "one map, many trials" paradigm. Although we demonstrate on EO-ML wealth mapping, the methods are not geospatial-specific: they apply to any setting where imputed outcomes are reused downstream (e.g., pollution indices, population density, or LLM-derived indicators).

    https://arxiv.org/abs/2508.01341


    Polynomial Log-Marginals and Tweedie's Formula : When Is Bayes Possible?

    oai:arXiv.org:2509.05823v2

    arXiv:2509.05823v2 Announce Type: replace Abstract: Motivated by Tweedie's formula for the Compound Decision problem, we examine the theoretical foundations of empirical Bayes estimators that directly model the marginal density $m(y)$. Our main result shows that polynomial log-marginals of degree $k \ge 3 $ cannot arise from any valid prior distribution in exponential family models, while quadratic forms correspond exactly to Gaussian priors. This provides theoretical justification for why certain empirical Bayes decision rules, while practically useful, do not correspond to any formal Bayes procedures. We also strengthen the diagnostic by showing that a marginal is a Gaussian convolution only if it extends to a bounded solution of the heat equation in a neighborhood of the smoothing parameter, beyond the convexity of $c(y)=\tfrac12 y^2+\log m(y)$.

    https://arxiv.org/abs/2509.05823


    Sequential Design for the Efficient Estimation of Offshore Structure Failure Probability

    oai:arXiv.org:2509.18319v2

    arXiv:2509.18319v2 Announce Type: replace Abstract: Estimation of the failure probability of offshore structures exposed to extreme ocean environments is critical to their safe design and operation. The conditional density of the environment (CDE) quantifies regions of the space of long term environment responsible for extreme structural response. Moreover, the probability of structural failure is obtained by simply integrating the CDE over the environment space. In this work, two methodologies for estimation of the CDE and failure probability are considered. The first (IS-PT) combines parallel tempering MCMC (for CDE estimation) with important sampling (for eventual estimation of failure probability). The second (AGE) combines adaptive Gaussian emulation with Bayesian quadrature. We evaluate IS-PT and two variants of the AGE procedure in application to a simple synthetic structure with multimodal CDE, and a monopile structure exhibiting non-linear resonant response. IS-PT provides reliable results for both applications for lesser compute cost than naive integration. The AGE procedures require balancing exploration and exploitation of the environment space, using a typically-unknown weight parameter, lambda. When lambda is known, perhaps from prior engineering knowledge, AGE provides a further reduction in computational cost over IS-PT. However, when unknown, IS-PT is more reliable.

    https://arxiv.org/abs/2509.18319


    Z-scores-based methods and their application to biological monitoring: An extended analysis of professional soccer players and cyclists athletes

    oai:arXiv.org:2510.01810v2

    arXiv:2510.01810v2 Announce Type: replace Abstract: The increase in the collection of biological data allows for the individual and longitudinal monitoring of hematological or urine biomarkers. However, identifying abnormal behavior in these biological sequences is not trivial. Moreover, the complexity of the biological data (correlation between biomarkers, seasonal effects, etc.) is also an issue. Z-score methods can help assess the abnormality in these longitudinal sequences while capturing some features of the biological complexity. This work details a statistical framework for handling biological sequences using three custom Z-score methods in the intra-individual variability scope. These methods can detect abnormal samples in the longitudinal sequences with respect to the seasonality, chronological time or correlation between biomarkers. One of these methods is an extension of one custom Z-score method to the Gaussian linear model, which allows for including additional variables in the model design. We illustrate the use of the framework on the longitudinal data of 3,936 professional soccer players (5 biomarkers) and 1,683 amateur or professional cyclists (10 biomarkers). The results show that a particular Z-score method, designed to detect a change in a series of consecutive observations, measured a high proportion of abnormal values (more than three times the false positive rate) in the ferritin and IGF1 biomarkers for both data sets. The proposed framework and methods could be applied in other contexts, such as the clinical patient follow-up in monitoring abnormal values of biological markers. The methods are flexible enough to include more complicated biological features, which can be directly incorporated into the model design.

    https://arxiv.org/abs/2510.01810


    A PyTorch Framework for Scalable Non-Crossing Quantile Regression

    oai:arXiv.org:2510.22419v2

    arXiv:2510.22419v2 Announce Type: replace Abstract: Quantile regression is fundamental to distributional modeling, yet independent estimation of multiple quantiles frequently produces crossing -- where estimated quantile functions violate monotonicity, implying impossible negative probability densities. While Constrained Joint Quantile Regression (CJQR) elegantly enforces non-crossing by construction, existing formulations via Linear Programming exhibit $O((qn)^3)$ complexity, rendering them impractical for large-scale applications. We present the first scalable solution using PyTorch automatic differentiation: \textbf{CJQR-ALM}, combining the \textbf{Augmented Lagrangian Method} with \textbf{differentiable pinball loss} and \textbf{L-BFGS} optimization. Our approach reduces computational complexity to $O(n)$, achieving near-zero crossing rates on datasets exceeding 70,000 observations within minutes. The differentiable formulation naturally extends to neural network architectures for non-linear conditional quantile estimation. Application to Student Growth Percentile calculations demonstrates practical utility for educational assessment, while simulation studies show negligible accuracy cost (RMSE increase $\approx 2.4$ points) relative to unconstrained estimation -- a favorable trade-off for applications requiring valid probability statements across finance, healthcare, and engineering.

    https://arxiv.org/abs/2510.22419


    Optimal Convergence Analysis of DDPM for General Distributions

    oai:arXiv.org:2510.27562v2

    arXiv:2510.27562v2 Announce Type: replace Abstract: Score-based diffusion models have achieved remarkable empirical success in generating high-quality samples from target data distributions. Among them, the Denoising Diffusion Probabilistic Model (DDPM) is one of the most widely used samplers, generating samples via estimated score functions. Despite its empirical success, a tight theoretical understanding of DDPM -- especially its convergence properties -- remains limited. In this paper, we provide a refined convergence analysis of the DDPM sampler and establish near-optimal convergence rates under general distributional assumptions. Specifically, we introduce a relaxed smoothness condition parameterized by a constant $L$, which is small for many practical distributions (e.g., Gaussian mixture models). We prove that the DDPM sampler with accurate score estimates achieves a convergence rate of $$\widetilde{O}\left(\frac{d\min\{d,L^2\}}{T^2}\right)~\text{in Kullback-Leibler divergence},$$ where $d$ is the data dimension, $T$ is the number of iterations, and $\widetilde{O}$ hides polylogarithmic factors in $T$. This result substantially improves upon the best-known $d^2/T^2$ rate when $L < \sqrt{d}$. By establishing a matching lower bound, we show that our convergence analysis is tight for a wide array of target distributions. Moreover, it reveals that DDPM and DDIM share the same dependence on $d$, raising an interesting question of why DDIM often appears empirically faster.

    https://arxiv.org/abs/2510.27562


    Nonparametric Least squares estimators for interval censoring

    oai:arXiv.org:2511.01103v4

    arXiv:2511.01103v4 Announce Type: replace Abstract: The limit distribution of the nonparametric maximum likelihood estimator for interval censored data with more than one observation time per unobservable observation, is still unknown in general. For the so-called separated case, where one has observation times which are at a distance larger than a fixed $\epsilon>0$, the limit distribution was derived in [4]. For the non-separated case there is a conjectured limit distribution, given in [9], Section 5.2 of Part 2. But the findings of the present paper suggest that this conjecture may not hold. We prove consistency of a closely related nonparametric isotonic least squares estimator and give a sketch of the proof for a result on its limit distribution. We also provide simulation results to show how the nonparametric MLE and least squares estimator behave in comparison. Moreover, we discuss a simpler least squares estimator that can be computed in one step, but is inferior to the other least squares estimator, since it does not use all information. For the simplest model of interval censoring, the current status model, the nonparametric maximum likelihood and least squares estimators are the same. This equivalence breaks down if there are more observation times per unobservable observation. The computations for the simulation of the more complicated interval censoring model were performed by using the iterative convex minorant algorithm. They are provided in the GitHub repository [6].

    https://arxiv.org/abs/2511.01103


    Asymptotics of constrained $M$-estimation under convexity

    oai:arXiv.org:2511.04612v2

    arXiv:2511.04612v2 Announce Type: replace Abstract: M-estimation, aka empirical risk minimization, is at the heart of statistics and machine learning: Classification, regression, location estimation, etc. Asymptotic theory is well understood when the loss satisfies some smoothness assumptions and its derivatives are dominated locally. However, these conditions are typically technical and can be too restrictive or heavy to check. Here, we consider the case of a convex loss function, which may not even be differentiable: We establish an asymptotic theory for M-estimation with convex loss (which needs not be differentiable) under convex constraints. We show that the asymptotic distributions of the corresponding M-estimators depend on an interplay between the loss function and the boundary structure of the set of constraints. We extend our results to U-estimators, building on the asymptotic theory of U-statistics. Applications of our work include, among other, robust location/scatter estimation, estimation of deepest points relative to depth functions such as Oja's depth, etc.

    https://arxiv.org/abs/2511.04612


    Branching Flows: Discrete, Continuous, and Manifold Flow Matching with Splits and Deletions

    oai:arXiv.org:2511.09465v2

    arXiv:2511.09465v2 Announce Type: replace Abstract: Diffusion and flow matching approaches to generative modeling have shown promise in domains where the state space is continuous, such as image generation or protein folding & design, and discrete, exemplified by diffusion large language models. They offer a natural fit when the number of elements in a state is fixed in advance (e.g. images), but require ad hoc solutions when, for example, the length of a response from a large language model, or the number of amino acids in a protein chain is not known a priori. Here we propose Branching Flows, a generative modeling framework that, like diffusion and flow matching approaches, transports a simple distribution to the data distribution. But in Branching Flows, the elements in the state evolve over a forest of binary trees, branching and dying stochastically with rates that are learned by the model. This allows the model to control, during generation, the number of elements in the sequence. We also show that Branching Flows can compose with any flow matching base process on discrete sets, continuous Euclidean spaces, smooth manifolds, and `multimodal' product spaces that mix these components. We demonstrate this in three domains: small molecule generation (multimodal), antibody sequence generation (discrete), and protein backbone generation (multimodal), and show that Branching Flows is a capable distribution learner with a stable learning objective, and that it enables new capabilities.

    https://arxiv.org/abs/2511.09465


    Source apportionment of air pollution burden using geometric non-negative matrix factorization and high-throughput multi-pollutant air sensor data in Curtis Bay, Baltimore, USA

    oai:arXiv.org:2511.11833v2

    arXiv:2511.11833v2 Announce Type: replace Abstract: Air sensor networks provide hyperlocal, high temporal resolution data on multiple pollutants that can support credible identification of common pollution sources. Source apportionment using least squares-based non-negative matrix factorization is non-unique and often does not scale. A recent geometric source apportionment framework focuses inference on the source attribution matrix, which is shown to remain identifiable even when the factorization is not. Recognizing that the method scales with and benefits from large data volumes, we use this geometric method to analyze 451,946 one-minute air sensor records from Curtis Bay, collected from October 21, 2022 to June 16, 2023, covering size-resolved particulate matter (PM), black carbon (BC), carbon monoxide (CO), nitric oxide (NO), and nitrogen dioxide (NO2). The analysis identifies three stable sources. Source 1 explains > 70% of fine and coarse PM and ~30% of BC. Source 2 dominates CO and contributes ~70% of BC, NO, and NO2. Source 3 is specific to the larger PM fractions, PM10 to PM40. Regression analyses show Source 1 and Source 3 rise during bulldozer activity at a nearby coal terminal and under winds from the terminal, indicating a direct coal terminal influence, while Source 2 exhibits diurnal patterns consistent with traffic. A case-study on the day with a known bulldozer incident at the coal terminal further confirms the association of terminal activities with Sources 1 and 3. Extreme episodes identified from Source 1 intensity affected ~33 minutes per day at the study site nearest the coal terminal, with impacts attenuating at locations farther from the terminal. The results are stable under sensitivity analyses. The analysis demonstrates that geometric source apportionment, paired with high temporal resolution data from multi-pollutant air sensor networks, delivers scalable and reliable evidence to inform mitigation strategies.

    https://arxiv.org/abs/2511.11833


    Regularized Reduced Rank Regression for mixed predictor and response variables

    oai:arXiv.org:2511.16718v2

    arXiv:2511.16718v2 Announce Type: replace Abstract: In this paper, we introduce the Generalized Mixed Regularized Reduced Rank Regression model (GMR4), an extension of the GMR3 model designed to improve performance in high-dimensional settings. GMR3 is a regression method for a mix of numeric, binary and ordinal response variables, while also allowing for mixed-type predictors through optimal scaling. GMR4 extends this approach by incorporating regularization techniques, such as Ridge, Lasso, Group Lasso, or any combination thereof, making the model suitable for datasets with a large number of predictors or collinearity among them. In addition, we propose a cross-validation procedure that enables the estimation of the rank S and the penalty parameter lambda. Through a simulation study, we evaluate the performance of the model under different scenarios, varying the sample size, the number of non-informative predictors and response dimension. The results of the simulation study guide the choice of the penalty parameter lambda in the empirical application ISSP: Health and Healthcare I-II (2023), which includes mixed-type predictors and ordinal responses. In this application, the model results in a sparse and interpretable solution, with a limited set of influential predictors that provide insights into public attitudes toward healthcare.

    https://arxiv.org/abs/2511.16718


    Differentially Private Fisher Randomization Tests for Binary Outcomes

    oai:arXiv.org:2511.20884v2

    arXiv:2511.20884v2 Announce Type: replace Abstract: Across many disciplines, causal inference often relies on randomized experiments with binary outcomes. In such experiments, the Fisher randomization test provides exact, assumption-free tests for causal effects. Sometimes the outcomes are sensitive and must be kept confidential, for example, when they comprise physical or mental health measurements. Releasing test statistics or p-values computed with the confidential outcomes can leak information about the individuals in the study. Those responsible for sharing the analysis results may wish to bound this information leakage, which they can do by ensuring the released outputs satisfy differential privacy. In this article, we develop several differentially private versions of the Fisher randomization test for binary outcomes. Specifically, we consider direct perturbation approaches that inject calibrated noise into test statistics or p-values, as well as a mechanism-aware, Bayesian denoising framework that explicitly models the privacy mechanism. We further develop decision-making procedures under privacy constraints, including a Bayes risk-optimal rule and a frequentist-calibrated significance test. Through theoretical results, simulation studies, and an application to the ADAPTABLE clinical trial, we demonstrate that our methods can achieve valid and interpretable causal inference while ensuring the differential privacy guarantee.

    https://arxiv.org/abs/2511.20884


    Reyes's I: Measuring Spatial Autocorrelation in Compositions

    oai:arXiv.org:2512.04289v2

    arXiv:2512.04289v2 Announce Type: replace Abstract: Compositional observations arise when measurements are recorded as parts of a whole, so that only relative information is meaningful and the natural sample space is the simplex equipped with Aitchison geometry. Despite extensive development of compositional methods, a direct analogue of Moran's \(I\) for assessing spatial autocorrelation in areal compositional data has been lacking. We propose Reyes's \(I\), a Moran type statistic defined through the Aitchison inner product and norm, which is invariant to scale, to permutations of the parts, and to the choice of the \(\operatorname{ilr}\) contrast matrix. Under the randomization assumption, we derive an upper bound, the expected value, and the noncentral second moment, and we describe exact and Monte Carlo permutation procedures for inference. Through simulations covering identical, independent, and spatially correlated compositions under multiple covariance structures and neighborhood definitions, we show that Reyes's \(I\) provides stable behavior, competitive calibration, and improved efficiency relative to a naive alternative based on averaging componentwise Moran statistics. We illustrate practical utility by studying the spatial dependence of a composition measuring COVID-19 severity across Colombian departments during January 2021, documenting significant positive autocorrelation early in the month that attenuates over time.

    https://arxiv.org/abs/2512.04289


    ADOPT: Additive Optimal Transport Regression

    oai:arXiv.org:2512.08118v3

    arXiv:2512.08118v3 Announce Type: replace Abstract: Regression analysis for responses taking values in general metric spaces has received increasing attention, particularly for settings with Euclidean predictors $X \in \mathbb{R}^p$ and non-Euclidean responses $Y$ in metric spaces. While additive regression is a powerful tool for enhancing interpretability and mitigating the curse of dimensionality in the presence of multivariate predictors, its direct extension is hindered by the absence of vector space operations in general metric spaces. We propose a novel framework for additive optimal transport regression, which incorporates additive structure through optimal geodesic transports. A key idea is to extend the notion of optimal transports in Wasserstein spaces to general geodesic metric spaces. This unified approach accommodates a wide range of responses, including probability distributions, symmetric positive definite (SPD) matrices with various metrics and spherical data. The practical utility of the method is illustrated with correlation matrices derived from resting state fMRI brain imaging data.

    https://arxiv.org/abs/2512.08118


    Model-robust Inference for Seamless II/III Trials with Covariate Adaptive Randomization

    oai:arXiv.org:2512.09430v2

    arXiv:2512.09430v2 Announce Type: replace Abstract: Seamless phase II/III trials have become a cornerstone of modern drug development, offering a means to accelerate evaluation while maintaining statistical rigor. However, most existing inference procedures are model-based, designed primarily for continuous outcomes, and often neglect the stratification used in covariate-adaptive randomization (CAR), limiting their practical relevance. In this paper, we propose a unified, model-robust framework for seamless phase II/III trials grounded in generalized linear models (GLMs), enabling valid inference across diverse outcome types, estimands, and CAR schemes. Using Z-estimation, we derive the asymptotic properties of treatment effect estimators and explicitly characterize how their variance depends on the underlying randomization procedure. Based on these results, we develop adjusted Wald tests that, together with Dunnett's multiple-comparison procedure and the inverse chi-square combination method, ensure valid overall Type I error. Extensive simulation studies and a trial example demonstrate that the proposed model-robust tests achieve superior power and reliable inference compared to conventional approaches.

    https://arxiv.org/abs/2512.09430


    Learning Time-Varying Correlation Networks with FDR Control via Time-Varying P-values

    oai:arXiv.org:2512.10467v2

    arXiv:2512.10467v2 Announce Type: replace Abstract: This paper presents a systematic framework for controlling false discovery rate in learning time-varying correlation networks from high-dimensional, non-linear, non-Gaussian and non-stationary time series with an increasing number of potential abrupt change points in means. We propose a bootstrap-assisted approach to derive dependent and time-varying P-values from a robust estimate of time-varying correlation functions, which are not sensitive to change points. Our procedure is based on a new high-dimensional Gaussian approximation result for the uniform approximation of P-values across time and different coordinates. Moreover, we establish theoretically guaranteed Benjamini--Hochberg and Benjamini--Yekutieli procedures for the dependent and time-varying P-values, which can achieve uniform false discovery rate control. The proposed methods are supported by rigorous mathematical proofs and simulation studies. We also illustrate the real-world application of our framework using both brain electroencephalogram and financial time series data.

    https://arxiv.org/abs/2512.10467


    Autotune: fast, accurate, and automatic tuning parameter selection for Lasso

    oai:arXiv.org:2512.11139v2

    arXiv:2512.11139v2 Announce Type: replace Abstract: Least absolute shrinkage and selection operator (Lasso), a popular method for high-dimensional regression, is now used widely for estimating high-dimensional time series models such as the vector autoregression (VAR). Selecting its tuning parameter efficiently and accurately remains a challenge, despite the abundance of available methods for doing so. We propose $\mathsf{autotune}$, a strategy for Lasso to automatically tune itself by optimizing a penalized Gaussian log-likelihood alternately over regression coefficients and noise standard deviation. Using extensive simulation experiments on regression and VAR models, we show that $\mathsf{autotune}$ is faster, and provides better generalization and model selection than established alternatives in low signal-to-noise regimes. In the process, $\mathsf{autotune}$ provides a new estimator of noise standard deviation that can be used for high-dimensional inference, and a new visual diagnostic procedure for checking the sparsity assumption on regression coefficients. Finally, we demonstrate the utility of $\mathsf{autotune}$ on a real-world financial data set. An R package based on C++ is also made publicly available on Github.

    https://arxiv.org/abs/2512.11139


    The Optimal Approximation Factor in Density Estimation

    oai:arXiv.org:1902.05876v5

    arXiv:1902.05876v5 Announce Type: replace-cross Abstract: Consider the following problem: given two arbitrary densities $q_1,q_2$ and a sample-access to an unknown target density $p$, find which of the $q_i$'s is closer to $p$ in total variation. A remarkable result due to Yatracos shows that this problem is tractable in the following sense: there exists an algorithm that uses $O(\epsilon^{-2})$ samples from $p$ and outputs~$q_i$ such that with high probability, $TV(q_i,p) \leq 3\cdot\mathsf{opt} + \epsilon$, where $\mathsf{opt}= \min\{TV(q_1,p),TV(q_2,p)\}$. Moreover, this result extends to any finite class of densities $\mathcal{Q}$: there exists an algorithm that outputs the best density in $\mathcal{Q}$ up to a multiplicative approximation factor of 3. We complement and extend this result by showing that: (i) the factor 3 can not be improved if one restricts the algorithm to output a density from $\mathcal{Q}$, and (ii) if one allows the algorithm to output arbitrary densities (e.g.\ a mixture of densities from $\mathcal{Q}$), then the approximation factor can be reduced to 2, which is optimal. In particular this demonstrates an advantage of improper learning over proper in this setup. We develop two approaches to achieve the optimal approximation factor of 2: an adaptive one and a static one. Both approaches are based on a geometric point of view of the problem and rely on estimating surrogate metrics to the total variation. Our sample complexity bounds exploit techniques from {\it Adaptive Data Analysis}.

    https://arxiv.org/abs/1902.05876


    Information Based Inference in Models with Set-Valued Predictions and Misspecification

    oai:arXiv.org:2401.11046v2

    arXiv:2401.11046v2 Announce Type: replace-cross Abstract: This paper proposes an information-based inference method for partially identified parameters in incomplete models that is valid both when the model is correctly specified and when it is misspecified. Key features of the method are: (i) it is based on minimizing a suitably defined Kullback-Leibler information criterion that accounts for incompleteness of the model and delivers a non-empty pseudo-true set; (ii) it is computationally tractable; (iii) its implementation is the same for both correctly and incorrectly specified models; (iv) it exploits all information provided by variation in discrete and continuous covariates; (v) it relies on Rao's score statistic, which is shown to be asymptotically pivotal.

    https://arxiv.org/abs/2401.11046


    Imprecise Markov Semigroups and their Ergodicity

    oai:arXiv.org:2405.00081v4

    arXiv:2405.00081v4 Announce Type: replace-cross Abstract: We introduce the concept of an imprecise Markov semigroup $\mathbf{Q}$. It is a tool that allows us to represent ambiguity around both the initial and the transition probabilities of a continuous-time Markov process via a compact collection of Markov semigroups, each associated with a (possibly different) Markov process. We use techniques from topology, geometry, and probability to study the ergodic behavior of $\mathbf{Q}$. We show that, under some conditions that also involve the geometry of the state space, eventually the ambiguity fades. We call this property ergodicity of the imprecise Markov semigroup, and we relate it to the classical notion of ergodicity. We prove ergodicity both when the state space is Euclidean or a Riemannian manifold, and when it is an arbitrary measurable space. The importance of our findings for the fields of artificial intelligence and computer vision is also discussed, in particular in the study of how the probability of an output evolves over time as we perturb the input of a convolutional autoencoder.

    https://arxiv.org/abs/2405.00081


    Neural stochastic Volterra equations: learning path-dependent dynamics

    oai:arXiv.org:2407.19557v3

    arXiv:2407.19557v3 Announce Type: replace-cross Abstract: Stochastic Volterra equations (SVEs) serve as mathematical models for the time evolutions of random systems with memory effects and irregular behaviour. We introduce neural stochastic Volterra equations as a physics-inspired architecture, generalizing the class of neural stochastic differential equations, and provide some theoretical foundation. Numerical experiments on various SVEs, like the disturbed pendulum equation, the generalized Ornstein--Uhlenbeck process, the rough Heston model and a monetary reserve dynamics, are presented, comparing the performance of neural SVEs, neural SDEs and Deep Operator Networks (DeepONets).

    https://arxiv.org/abs/2407.19557


    The Implicit Bias of Structured State Space Models Can Be Poisoned With Clean Labels

    oai:arXiv.org:2410.10473v5

    arXiv:2410.10473v5 Announce Type: replace-cross Abstract: Neural networks are powered by an implicit bias: a tendency of gradient descent to fit training data in a way that generalizes to unseen data. A recent class of neural network models gaining increasing popularity is structured state space models (SSMs), regarded as an efficient alternative to transformers. Prior work argued that the implicit bias of SSMs leads to generalization in a setting where data is generated by a low dimensional teacher. In this paper, we revisit the latter setting, and formally establish a phenomenon entirely undetected by prior work on the implicit bias of SSMs. Namely, we prove that while implicit bias leads to generalization under many choices of training data, there exist special examples whose inclusion in training completely distorts the implicit bias, to a point where generalization fails. This failure occurs despite the special training examples being labeled by the teacher, i.e. having clean labels! We empirically demonstrate the phenomenon, with SSMs trained independently and as part of non-linear neural networks. In the area of adversarial machine learning, disrupting generalization with cleanly labeled training examples is known as clean-label poisoning. Given the proliferation of SSMs, we believe that delineating their susceptibility to clean-label poisoning, and developing methods for overcoming this susceptibility, are critical research directions to pursue.

    https://arxiv.org/abs/2410.10473


    On the physics of nested Markov models: a generalized probabilistic theory perspective

    oai:arXiv.org:2411.11614v2

    arXiv:2411.11614v2 Announce Type: replace-cross Abstract: Determining potential probability distributions with a given causal graph is vital for causality studies. To bypass the difficulty in characterizing latent variables in a Bayesian network, the nested Markov model provides an elegant algebraic approach by listing exactly all the equality constraints on the observed variables. However, this algebraically motivated causal model comprises distributions outside Bayesian networks, and its physical interpretation remains vague. In this work, we inspect the nested Markov model through the lens of generalized probabilistic theory, an axiomatic framework to describe general physical theories. We prove that all the equality constraints defining the nested Markov model are valid theory-independently. At the same time, not every distribution within the nested Markov model is implementable, not even via exotic physical theories associated with generalized probability theories (GPTs). To interpret the origin of such a gap, we study three causal models standing between the nested Markov model and the set of all distributions admitting some GPT realization. Each of the successive three models gives a strictly tighter characterization of the physically implementable distribution set; that is, each successive model manifests new types of GPT-inviolable constraints. We further demonstrate each gap through a specially chosen illustrative causal structure. We anticipate our results will enlighten further explorations on the unification of algebraic and physical perspectives of causality.

    https://arxiv.org/abs/2411.11614


    Generative AI-based data augmentation for improved bioacoustic classification in noisy environments

    oai:arXiv.org:2412.01530v3

    arXiv:2412.01530v3 Announce Type: replace-cross Abstract: Obtaining data to train robust artificial intelligence (AI)-based models for species classification can be challenging, particularly for rare species. Data augmentation can boost classification accuracy by increasing the diversity of training data and is cheaper to obtain than expert-labelled data. However, many classic image-based augmentation techniques are not suitable for audio spectrograms. We investigate two generative AI models as data augmentation tools to synthesise spectrograms and supplement audio data: Auxiliary Classifier Generative Adversarial Networks (ACGAN) and Denoising Diffusion Probabilistic Models (DDPMs). The latter performed particularly well in terms of both realism of generated spectrograms and accuracy in a resulting classification task. Alongside these new approaches, we present a new audio data set of 640 hours of bird calls from wind farm sites in Ireland, approximately 800 samples of which have been labelled by experts. Wind farm data are particularly challenging for classification models given the background wind and turbine noise. Training an ensemble of classification models on real and synthetic data combined compared well with highly confident BirdNET predictions. Each classifier we used was improved by including synthetic data, and classification metrics generally improved in line with the amount of synthetic data added. Our approach can be used to augment acoustic signals for more species and other land-use types, and has the potential to bring about advances in our capacity to develop reliable AI-based detection of rare species. Our code is available at https://github.com/gibbona1/SpectrogramGenAI.

    https://arxiv.org/abs/2412.01530


    Defending Collaborative Filtering Recommenders via Adversarial Robustness Based Edge Reweighting

    oai:arXiv.org:2412.10850v2

    arXiv:2412.10850v2 Announce Type: replace-cross Abstract: User based collaborative filtering (CF) relies on a user and user similarity graph, making it vulnerable to profile injection (shilling) attacks that manipulate neighborhood relations to promote (push) or demote (nuke) target items. In this work, we propose an adversarial robustness based edge reweighting defense for CF. We first assign each user and user edge a non robustness score via spectral adversarial robustness evaluation, which quantifies the edge sensitivity to adversarial perturbations. We then attenuate the influence of non robust edges by reweighting similarities during prediction. Extensive experiments demonstrate that the proposed method effectively defends against various types of attacks.

    https://arxiv.org/abs/2412.10850


    Compact Neural Network Algorithm for Electrocardiogram Classification

    oai:arXiv.org:2412.17852v2

    arXiv:2412.17852v2 Announce Type: replace-cross Abstract: In this paper, we present a powerful, compact electrocardiogram (ECG) classification algorithm for cardiac arrhythmia diagnosis that addresses the current reliance on deep learning and convolutional neural networks (CNNs) in ECG analysis. This work aims to reduce the demand for deep learning, which often requires extensive computational resources and large labeled datasets. Our approach introduces an artificial neural network (ANN) with a simple architecture combined with advanced feature engineering techniques. A key contribution of this work is the incorporation of 17 engineered features that enable the extraction of critical patterns from raw ECG signals. By integrating mathematical transformations, signal processing methods, and data extraction algorithms, our model captures the morphological and physiological characteristics of ECG signals with high efficiency, without requiring deep learning. Our method demonstrates a similar performance to other state-of-the-art models in classifying 4 types of arrhythmias, including atrial fibrillation, sinus tachycardia, sinus bradycardia, and ventricular flutter. Our algorithm achieved an accuracy of 97.36% on the MIT-BIH and St. Petersburg INCART arrhythmia databases. Our approach offers a practical and feasible solution for real-time diagnosis of cardiac disorders in medical applications, particularly in resource-limited environments.

    https://arxiv.org/abs/2412.17852


    Random Reshuffling for Stochastic Gradient Langevin Dynamics

    oai:arXiv.org:2501.16055v2

    arXiv:2501.16055v2 Announce Type: replace-cross Abstract: We examine the use of different randomisation policies for stochastic gradient algorithms used in sampling, based on first-order (or overdamped) Langevin dynamics, the most popular of which is known as Stochastic Gradient Langevin Dynamics. Conventionally, this algorithm is combined with a specific stochastic gradient strategy, called Robbins-Monro. In this work, we study an alternative strategy, Random Reshuffling, and show convincingly that it leads to improved performance via: a) a proof of reduced bias in the Wasserstein metric for strongly convex, gradient Lipschitz potentials; b) an analytical demonstration of reduced bias for a Gaussian model problem; and c) an empirical demonstration of reduced bias in numerical experiments for some logistic regression problems. This is especially important since Random Reshuffling is typically more efficient due to memory access and cache reasons. Such acceleration for the Random Reshuffling policy is familiar from the optimisation literature on stochastic gradient descent.

    https://arxiv.org/abs/2501.16055


    An Algorithm to perform Covariance-Adjusted Support Vector Classification in Non-Euclidean Spaces

    oai:arXiv.org:2504.04371v2

    arXiv:2504.04371v2 Announce Type: replace-cross Abstract: Traditional Support Vector Machine (SVM) classification is carried out by finding the max-margin classifier for the training data that divides the mar-gin space into two equal sub-spaces. This study demonstrates limitations of performing Support Vector Classification in non-Euclidean spaces by estab-lishing that the underlying principle of max-margin classification and Karush Kuhn Tucker (KKT) boundary conditions are valid only in the Eu-clidean vector spaces, while in non-Euclidean spaces the principle of maxi-mum margin is a function of intra-class data covariance. The study estab-lishes a methodology to perform Support Vector Classification in Non-Euclidean Spaces by incorporating data covariance into the optimization problem using the transformation matrix obtained from Cholesky Decompo-sition of respective class covariance matrices, and shows that the resulting classifier obtained separates the margin space in ratio of respective class pop-ulation covariance. The study proposes an algorithm to iteratively estimate the population covariance-adjusted SVM classifier in non-Euclidean space from sample covariance matrices of the training data. The effectiveness of this SVM classification approach is demonstrated by applying the classifier on multiple datasets and comparing the performance with traditional SVM kernels and whitening algorithms. The Cholesky-SVM model shows marked improvement in the accuracy, precision, F1 scores and ROC performance compared to linear and other kernel SVMs.

    https://arxiv.org/abs/2504.04371


    RGNMR: A Gauss-Newton method for robust matrix completion with theoretical guarantees

    oai:arXiv.org:2505.12919v2

    arXiv:2505.12919v2 Announce Type: replace-cross Abstract: Recovering a low rank matrix from a subset of its entries, some of which may be corrupted, is known as the robust matrix completion (RMC) problem. Existing RMC methods have several limitations: they require a relatively large number of observed entries; they may fail under overparametrization, when their assumed rank is higher than the correct one; and many of them fail to recover even mildly ill-conditioned matrices. In this paper we propose a novel RMC method, denoted $\texttt{RGNMR}$, which overcomes these limitations. $\texttt{RGNMR}$ is a simple factorization-based iterative algorithm, which combines a Gauss-Newton linearization with removal of entries suspected to be outliers. On the theoretical front, we prove that under suitable assumptions, $\texttt{RGNMR}$ is guaranteed exact recovery of the underlying low rank matrix. Our theoretical results improve upon the best currently known for factorization-based methods. On the empirical front, we show via several simulations the advantages of $\texttt{RGNMR}$ over existing RMC methods, and in particular its ability to handle a small number of observed entries, overparameterization of the rank and ill-conditioned matrices.

    https://arxiv.org/abs/2505.12919


    Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems

    oai:arXiv.org:2505.15201v4

    arXiv:2505.15201v4 Announce Type: replace-cross Abstract: Reinforcement Learning (RL) algorithms sample multiple n>1 solution attempts for each problem and reward them independently. This optimizes for pass@1 performance and prioritizes the strength of isolated samples at the expense of the diversity and collective utility of sets of samples. This under-utilizes the sampling capacity, limiting exploration and eventual improvement on harder examples. As a fix, we propose Pass-at-k Policy Optimization (PKPO), a transformation on the final rewards which leads to direct optimization of pass@k performance, thus optimizing for sets of samples that maximize reward when considered jointly. Our contribution is to derive novel low variance unbiased estimators for pass@k and its gradient, in both the binary and continuous reward settings. We show optimization with our estimators reduces to standard RL with rewards that have been jointly transformed by a stable and efficient transformation function. While previous efforts are restricted to k=n, ours is the first to enable robust optimization of pass@k for any arbitrary k <= n. Moreover, instead of trading off pass@1 performance for pass@k gains, our method allows annealing k during training, optimizing both metrics and often achieving strong pass@1 numbers alongside significant pass@k gains. We validate our reward transformations on toy experiments, which reveal the variance reducing properties of our formulations. We also include real-world examples using the open-source LLM, GEMMA-2. We find that our transformation effectively optimizes for the target k. Furthermore, higher k values enable solving more and harder problems, while annealing k boosts both the pass@1 and pass@k . Crucially, for challenging task sets where conventional pass@1 optimization stalls, our pass@k approach unblocks learning, likely due to better exploration by prioritizing joint utility over the utility of individual samples.

    https://arxiv.org/abs/2505.15201


    Meta-reinforcement learning with minimum attention

    oai:arXiv.org:2505.16741v2

    arXiv:2505.16741v2 Announce Type: replace-cross Abstract: Minimum attention applies the least action principle in the changes of control concerning state and time, first proposed by Brockett. The involved regularization is highly relevant in emulating biological control, such as motor learning. We apply minimum attention in reinforcement learning (RL) as part of the rewards and investigate its connection to meta-learning and stabilization. Specifically, model-based meta-learning with minimum attention is explored in high-dimensional nonlinear dynamics. Ensemble-based model learning and gradient-based meta-policy learning are alternately performed. Empirically, the minimum attention does show outperforming competence in comparison to the state-of-the-art algorithms of model-free and model-based RL, i.e., fast adaptation in few shots and variance reduction from the perturbations of the model and environment. Furthermore, the minimum attention demonstrates an improvement in energy efficiency.

    https://arxiv.org/abs/2505.16741


    Variational Learning of Disentangled Representations

    oai:arXiv.org:2506.17182v2

    arXiv:2506.17182v2 Announce Type: replace-cross Abstract: Disentangled representations enable models to separate factors of variation that are shared across experimental conditions from those that are condition-specific. This separation is essential in domains such as biomedical data analysis, where generalization to new treatments, patients, or species depends on isolating stable biological signals from context-dependent effects. While extensions of the variational autoencoder (VAE) framework have been proposed to address this problem, they frequently suffer from leakage between latent representations, limiting their ability to generalize to unseen conditions. Here, we introduce DISCoVeR, a new variational framework that explicitly separates condition-invariant and condition-specific factors. DISCoVeR integrates three key components: (i) a dual-latent architecture that models shared and specific factors separately; (ii) two parallel reconstructions that ensure both representations remain informative; and (iii) a novel max-min objective that encourages clean separation without relying on handcrafted priors, while making only minimal assumptions. Theoretically, we show that this objective maximizes data likelihood while promoting disentanglement, and that it admits a unique equilibrium. Empirically, we demonstrate that DISCoVeR achieves improved disentanglement on synthetic datasets, natural images, and single-cell RNA-seq data. Together, these results establish DISCoVeR as a principled approach for learning disentangled representations in multi-condition settings.

    https://arxiv.org/abs/2506.17182


    Reliability-Adjusted Prioritized Experience Replay

    oai:arXiv.org:2506.18482v3

    arXiv:2506.18482v3 Announce Type: replace-cross Abstract: Experience replay enables data-efficient learning from past experiences in online reinforcement learning agents. Traditionally, experiences were sampled uniformly from a replay buffer, regardless of differences in experience-specific learning potential. In an effort to sample more efficiently, researchers introduced Prioritized Experience Replay (PER). In this paper, we propose an extension to PER by introducing a novel measure of temporal difference error reliability. We theoretically show that the resulting transition selection algorithm, Reliability-adjusted Prioritized Experience Replay (ReaPER), enables more efficient learning than PER. We further present empirical results showing that ReaPER outperforms PER across various environment types, including the Atari-10 benchmark.

    https://arxiv.org/abs/2506.18482


    Incorporating Interventional Independence Improves Robustness against Interventional Distribution Shift

    oai:arXiv.org:2507.05412v3

    arXiv:2507.05412v3 Announce Type: replace-cross Abstract: We study the problem of learning robust discriminative representations of causally related latent variables given the underlying causal graph and a training set comprising passively collected observational data and interventional data obtained through targeted interventions on some of these latent variables. We desire to learn representations that are robust against the resulting interventional distribution shifts. Existing approaches treat observational and interventional data alike, ignoring the independence relations arising from these interventions, even with known underlying causal models. As a result, their representations lead to large predictive performance disparities between observational and interventional data. This performance disparity worsens when interventional training data is scarce. In this paper, (1) we first identify a strong correlation between this performance disparity and the representations' violation of statistical independence induced during interventions. (2) For linear models, we derive sufficient conditions on the proportion of interventional training data, for which enforcing statistical independence between representations of the intervened node and its non-descendants during interventions lowers the test-time error on interventional data. Combining these insights, (3) we propose RepLIn, a training algorithm that explicitly enforces this statistical independence between interventional representations. We demonstrate the utility of RepLIn on a synthetic dataset, and on real image and text datasets on facial attribute classification and toxicity detection, respectively, with semi-synthetic causal structures. Our experiments show that RepLIn is scalable with the number of nodes in the causal graph and is suitable to improve robustness against interventional distribution shifts of both continuous and discrete latent variables compared to the ERM baselines.

    https://arxiv.org/abs/2507.05412


    A Collectivist, Economic Perspective on AI

    oai:arXiv.org:2507.06268v3

    arXiv:2507.06268v3 Announce Type: replace-cross Abstract: Information technology is in the midst of a revolution in which omnipresent data collection and machine learning are impacting the human world as never before. The word ``intelligence'' is being used as a North Star for the development of this technology, with human cognition viewed as a baseline. This view neglects the fact that humans are social animals and that much of our intelligence is social and cultural in origin. Moreover, failing to properly situate aspects of intelligence at the social level contributes to the treatment of the societal consequences of technology as an afterthought. The path forward is not merely more data and compute, and not merely more attention paid to cognitive or symbolic representations, but a thorough blending of economic and social concepts with computational and inferential concepts at the level of algorithm design.

    https://arxiv.org/abs/2507.06268


    Approaches for modelling the term-structure of default risk under IFRS 9: A tutorial using discrete-time survival analysis

    oai:arXiv.org:2507.15441v3

    arXiv:2507.15441v3 Announce Type: replace-cross Abstract: Under the International Financial Reporting Standards (IFRS) 9, credit losses ought to be recognised timeously and accurately. This requirement belies a certain degree of dynamicity when estimating the constituent parts of a credit loss event, most notably the probability of default (PD). It is notoriously difficult to produce such PD-estimates at every point of loan life that are adequately dynamic and accurate, especially when considering the ever-changing macroeconomic background. In rendering these lifetime PD-estimates, the choice of modelling technique plays an important role, which is why we first review a few classes of techniques, including the merits and limitations of each. Our main contribution however is the development of an in-depth and data-driven tutorial using a particular class of techniques called discrete-time survival analysis. This tutorial is accompanied by a diverse set of reusable diagnostic measures for evaluating various aspects of a survival model and the underlying data. A comprehensive R-based codebase is further contributed. We believe that our work can help cultivate common modelling practices under IFRS 9, and should be valuable to practitioners, model validators, and regulators alike.

    https://arxiv.org/abs/2507.15441


    On a $T_1$ Transport inequality for the adapted Wasserstein distance

    oai:arXiv.org:2507.19215v2

    arXiv:2507.19215v2 Announce Type: replace-cross Abstract: The $L^1$ transport-entropy inequality (or $T_1$ inequality), which bounds the $1$-Wasserstein distance in terms of the relative entropy, is known to characterize Gaussian concentration. To extend the $T_1$ inequality to laws of discrete-time processes while preserving their temporal structure, we investigate the adapted $T_1$ inequality which relates the $1$-adapted Wasserstein distance to the relative entropy. Building on the Bolley--Villani inequality, we establish the adapted $T_1$ inequality under the same moment assumption as the classical $T_1$ inequality.

    https://arxiv.org/abs/2507.19215


    Reduced Order Modeling of Energetic Materials Using Physics-Aware Recurrent Convolutional Neural Networks in a Latent Space (LatentPARC)

    oai:arXiv.org:2509.12401v2

    arXiv:2509.12401v2 Announce Type: replace-cross Abstract: Physics-aware deep learning (PADL) has gained popularity for use in complex spatiotemporal dynamics (field evolution) simulations, such as those that arise frequently in computational modeling of energetic materials (EM). Here, we show that the challenge PADL methods face while learning complex field evolution problems can be simplified and accelerated by decoupling it into two tasks: learning complex geometric features in evolving fields and modeling dynamics over these features in a lower dimensional feature space. To accomplish this, we build upon our previous work on physics-aware recurrent convolutions (PARC). PARC embeds knowledge of underlying physics into its neural network architecture for more robust and accurate prediction of evolving physical fields. PARC was shown to effectively learn complex nonlinear features such as the formation of hotspots and coupled shock fronts in various initiation scenarios of EMs, as a function of microstructures, serving effectively as a microstructure-aware burn model. In this work, we further accelerate PARC and reduce its computational cost by projecting the original dynamics onto a lower-dimensional invariant manifold, or 'latent space.' The projected latent representation encodes the complex geometry of evolving fields (e.g. temperature and pressure) in a set of data-driven features. The reduced dimension of this latent space allows us to learn the dynamics during the initiation of EM with a lighter and more efficient model. We observe a significant decrease in training and inference time while maintaining results comparable to PARC at inference. This work takes steps towards enabling rapid prediction of EM thermomechanics at larger scales and characterization of EM structure-property-performance linkages at a full application scale.

    https://arxiv.org/abs/2509.12401


    Mamba Modulation: On the Length Generalization of Mamba

    oai:arXiv.org:2509.19633v3

    arXiv:2509.19633v3 Announce Type: replace-cross Abstract: The quadratic complexity of the attention mechanism in Transformer models has motivated the development of alternative architectures with sub-quadratic scaling, such as state-space models. Among these, Mamba has emerged as a leading architecture, achieving state-of-the-art results across a range of language modeling tasks. However, Mamba's performance significantly deteriorates when applied to contexts longer than those seen during pre-training, revealing a sharp sensitivity to context length extension. Through detailed analysis, we attribute this limitation to the out-of-distribution behaviour of its state-space dynamics, particularly within the parameterization of the state transition matrix $\mathbf{A}$. Unlike recent works which attribute this sensitivity to the vanished accumulation of discretization time steps, $\exp(-\sum_{t=1}^N\Delta_t)$, we establish a connection between state convergence behavior as the input length approaches infinity and the spectrum of the transition matrix $\mathbf{A}$, offering a well-founded explanation of its role in length extension. Next, to overcome this challenge, we propose an approach that applies spectrum scaling to pre-trained Mamba models to enable robust long-context generalization by selectively modulating the spectrum of $\mathbf{A}$ matrices in each layer. We show that this can significantly improve performance in settings where simply modulating $\Delta_t$ fails, validating our insights and providing avenues for better length generalization of state-space models with structured transition matrices.

    https://arxiv.org/abs/2509.19633


    Are penalty shootouts better than a coin toss? Evidence from international club football in Europe

    oai:arXiv.org:2510.17641v3

    arXiv:2510.17641v3 Announce Type: replace-cross Abstract: Penalty shootouts play an important role in the knockout stage of major football tournaments, especially since the 2021/22 season, when the Union of European Football Associations (UEFA) scrapped the away goals rule in its club competitions. Inspired by this rule change, our paper examines whether the outcome of a penalty shootout can be predicted in UEFA club competitions. Based on all shootouts between 2000 and 2025, we find no evidence for the effect of the kicking order, the field of the match, or psychological momentum. In contrast to previous results, stronger teams, defined first by Elo ratings, do not perform better than their weaker opponents. Consequently, penalty shootouts seem to be close to a coin toss in top European club football.

    https://arxiv.org/abs/2510.17641


    Fixed Horizon Linear Quadratic Covariance Steering in Continuous Time with Hilbert-Schmidt Terminal Cost

    oai:arXiv.org:2510.21944v2

    arXiv:2510.21944v2 Announce Type: replace-cross Abstract: We formulate and solve the fixed horizon linear quadratic covariance steering problem in continuous time with a terminal cost measured in Hilbert-Schmidt (i.e., Frobenius) norm error between the desired and the controlled terminal covariances. For this problem, the necessary conditions of optimality become a coupled matrix ODE two-point boundary value problem. To solve this system of equations, we design a matricial recursive algorithm and prove its convergence. The proposed algorithm and its analysis make use of the linear fractional transforms parameterized by the state transition matrix of the associated Hamiltonian matrix. To illustrate the results, we provide two numerical examples: one with a two dimensional and another with a six dimensional state space.

    https://arxiv.org/abs/2510.21944


    CBDC Stress Test in a Dual-Currency Setting

    oai:arXiv.org:2511.13384v4

    arXiv:2511.13384v4 Announce Type: replace-cross Abstract: This study explores the potential impact of introducing a Central Bank Digital Currency (CBDC) on financial stability in an emerging dual-currency economy (Romania), where the domestic currency (RON) coexists with the euro. It develops an integrated analytical framework combining econometrics, machine learning, and behavioural modelling. CBDC adoption probabilities are estimated using XGBoost and logistic regression models trained on behavioural and macro-financial indicators rather than survey data. Liquidity stress simulations assess how banks would respond to deposit withdrawals resulting from CBDC adoption, while VAR, MSVAR, and SVAR models capture the macro-financial transmission of liquidity shocks into credit contraction and changes in monetary conditions. The findings indicate that CBDC uptake (co-circulating Digital RON and Digital EUR) would be moderate at issuance, amounting to around EUR 1 billion, primarily driven by digital readiness and trust in the central bank. The study concludes that a non-remunerated, capped CBDC, designed primarily as a means of payment rather than a store of value, can be introduced without compromising financial stability. In dual currency economies, differentiated holding limits for domestic and foreign digital currencies (e.g., Digital RON versus Digital Euro) are crucial to prevent uncontrolled euroisation and preserve monetary sovereignty. A prudent design with moderate caps, non remuneration, and macroprudential coordination can transform CBDC into a digital liquidity buffer and a complementary monetary policy instrument that enhances resilience and inclusion rather than destabilising the financial system.

    https://arxiv.org/abs/2511.13384


    Nonparametric Uniform Inference in Binary Classification and Policy Values

    oai:arXiv.org:2511.14700v2

    arXiv:2511.14700v2 Announce Type: replace-cross Abstract: We develop methods for nonparametric uniform inference in cost-sensitive binary classification, a framework that encompasses maximum score estimation, predicting utility maximizing actions, and policy learning. These problems are well known for slow convergence rates and non-standard limiting behavior, even under point identified parametric frameworks. In nonparametric settings, they may further suffer from failures of identification. To address these challenges, we introduce a strictly convex surrogate loss that point-identifies a representative nonparametric policy function. We then estimate this representative policy function to conduct inference on both the optimal classification policy and the optimal policy value. This approach enables Gaussian inference, substantially simplifying empirical implementation relative to working directly with the original classification problem. In particular, we establish root-$n$ asymptotic normality for the optimal policy value and derive a Gaussian approximation for the optimal classification policy at the standard nonparametric rate. Extensive simulation studies corroborate the theoretical findings. We apply our method to the National JTPA Study to conduct inference on the optimal treatment assignment policy and its associated welfare.

    https://arxiv.org/abs/2511.14700


    Spectral Concentration at the Edge of Stability: Information Geometry of Kernel Associative Memory

    oai:arXiv.org:2511.23083v3

    arXiv:2511.23083v3 Announce Type: replace-cross Abstract: High-capacity kernel Hopfield networks exhibit a \textit{Ridge of Optimization} characterized by extreme stability. While previously linked to \textit{Spectral Concentration}, its origin remains elusive. Here, we analyze the network dynamics on a statistical manifold, revealing that the Ridge corresponds to the Edge of Stability, a critical boundary where the Fisher Information Matrix becomes singular. We demonstrate that the apparent Euclidean force antagonism is a manifestation of \textit{Dual Equilibrium} in the Riemannian space. This unifies learning dynamics and capacity via the Minimum Description Length principle, offering a geometric theory of self-organized criticality.

    https://arxiv.org/abs/2511.23083


    GraphBench: Next-generation graph learning benchmarking

    oai:arXiv.org:2512.04475v2

    arXiv:2512.04475v2 Announce Type: replace-cross Abstract: Machine learning on graphs has recently achieved impressive progress in various domains, including molecular property prediction and chip design. However, benchmarking practices remain fragmented, often relying on narrow, task-specific datasets and inconsistent evaluation protocols, which hampers reproducibility and broader progress. To address this, we introduce GraphBench, a comprehensive benchmarking suite that spans diverse domains and prediction tasks, including node-level, edge-level, graph-level, and generative settings. GraphBench provides standardized evaluation protocols -- with consistent dataset splits and performance metrics that account for out-of-distribution generalization -- as well as a unified hyperparameter tuning framework. Additionally, we benchmark GraphBench using message-passing neural networks and graph transformer models, providing principled baselines and establishing a reference performance. See www.graphbench.io for further details.

    https://arxiv.org/abs/2512.04475


    Developing synthetic microdata through machine learning for firm-level business surveys

    oai:arXiv.org:2512.05948v2

    arXiv:2512.05948v2 Announce Type: replace-cross Abstract: Public-use microdata samples (PUMS) from the United States (US) Census Bureau on individuals have been available for decades. However, large increases in computing power and the greater availability of Big Data have dramatically increased the probability of re-identifying anonymized data, potentially violating the pledge of confidentiality given to survey respondents. Data science tools can be used to produce synthetic data that preserve critical moments of the empirical data but do not contain the records of any existing individual respondent or business. Developing public-use firm data from surveys presents unique challenges different from demographic data, because there is a lack of anonymity and certain industries can be easily identified in each geographic area. This paper briefly describes a machine learning model used to construct a synthetic PUMS based on the Annual Business Survey (ABS) and discusses various quality metrics. Although the ABS PUMS is currently being refined and results are confidential, we present two synthetic PUMS developed for the 2007 Survey of Business Owners, similar to the ABS business data. Econometric replication of a high impact analysis published in Small Business Economics demonstrates the verisimilitude of the synthetic data to the true data and motivates discussion of possible ABS use cases.

    https://arxiv.org/abs/2512.05948


    Rice Price Dynamics during the 1945--1947 Famine in Post-War Taiwan: A Quantitative Reassessment

    oai:arXiv.org:2512.07492v2

    arXiv:2512.07492v2 Announce Type: replace-cross Abstract: We compiled the first high-frequency rice price panel for Taiwan from August 1945 to March 1947, during the transition from Japanese rule to China rule. Using regression models, we found that the pattern of rice price changes could be divided into four stages, each with distinct characteristics. Based on different stages, we combined the policies formulated by the Taiwan government at the time to demonstrate the correlation between rice prices and policies. The research results highlight the dominant role of policy systems in post-war food crises.

    https://arxiv.org/abs/2512.07492