stat updates on arXiv.org
stat updates on the arXiv.org e-print archive.
Exploring the periodicity of flight patterns
oai:arXiv.org:2606.00128v1
arXiv:2606.00128v1 Announce Type: new
Abstract: Each year the American Statistical Association (ASA) hosts the Annual Data Challenge Expo, which tasks participants with analyzing a given dataset and presenting their work at the Joint Statistical Meeting (JSM). The 2025 Data Challenge Expo tasked participants with analyzing over 35 years of commercial flight data from the United States Bureau of Transportation Statistics (BTS). These data provide extensive geographic coverage and operational details for the U.S. domestic aviation market. For millions of past flights, there is information about the flight's date, origin, destination, carrier, plane, departure, and arrival. In this article, we present our analysis for the 2025 JSM Data Challenge Expo. We chose to explore patterns in the daily scheduling of departures and arrivals across airlines, airports, and time. In doing so, we observed distinct scheduling ``waves'', or periodic structures at major airline hubs as well as large Federal Aviation Administration (FAA) hubs. In the remainder of this article, we detail the process of visualizing periodicity in flight scheduling as well as quantifying it through the calculation of Shannon entropy. An additional element to the 2025 Data Challenge Expo is the incorporation of a second dataset, to be decided by the participants. We detail the use of a BTS dataset with passenger enplanement (boarding) information to determine Federal Aviation Administration (FAA) hub classification (as opposed to airline-specific hubs). Furthermore, we discuss results from this visual and quantitative analysis, highlighting noticeable differences in the scheduling periodicity and entropy across airports, for the ``big four'' or four largest carriers, in U.S. aviation: American Airlines, Delta Air Lines, United Airlines, and Southwest Airlines.
Interpreting FCDNNs via RG on Exponential Family
oai:arXiv.org:2606.00157v1
arXiv:2606.00157v1 Announce Type: new
Abstract: We consider establishing the interpretability theory of deep learning through constructing a corresponding relationship between the renormalization group (RG) method in statistical physics and the training process of deep neural networks (DNNs). We have proved the constructed relationship using the one-dimensional Ising model as the input data. In this paper we generalize our results to the case of continuous input data, which is a necessary preparation for applying the corresponding framework to real-world data. To be representative, we consider a class of data distribution in the exponential family. We prove that when the parameters of fully connected (FC) DNNs achieve their optimal value after training, the characteristic parameters of the feature layer output of DNNs are equal to the fixed points of the characteristic parameters of input data under RG method for continuous fields. This conclusion shows that the training process of DNNs is equivalent to RG calculation on this kind of data and therefore the network can extract main features from the input data just like RG. Also, the equivalence further validates the correspondence framework we have established, providing an explanation for the outstanding performance of DNNs on real-world data.
Infinite-Dimensional Spherical Kernel ridge Regression
oai:arXiv.org:2606.00181v1
arXiv:2606.00181v1 Announce Type: new
Abstract: We introduce a novel regression framework designed to model non-linear responses situated on a sphere $\mathbb{S}$ of finite or infinite dimension. Unlike traditional tangent-space regressions, which lift responses to a tangent space $T_o \mathbb{S}$ and thereby violate intrinsic spherical distances, our proposed method employs an intrinsic approach. We model the conditional mean through an intercept $o \in \mathbb{S}$ and a linear predictor function $f: \mathfrak{X} \to T_o \mathbb{S}$. This formulation transforms the estimation problem into finding a linear predictor within a function space, but utilizing a metric defined by spherical geometry rather than standard Euclidean distance. Leveraging vector-valued reproducing kernel Hilbert space theory, our approach reduces the infinite-dimensional estimation challenge to a manageable finite-dimensional problem via the representer theorem, leading to an efficient BFGS-based estimation algorithm. We establish convergence rates and analyze the finite-sample behavior of our estimator, concluding with a practical application to density regression. The full implementation is available in R.
On Asymptotic Outlier Rejection in Bayesian Mixed Poisson Regression Models Under Extreme Target and Covariate Values
oai:arXiv.org:2606.00231v1
arXiv:2606.00231v1 Announce Type: new
Abstract: Bayesian models are claimed to be fully robust against outliers if, asymptotically, observations infinitely far from the other data do not influence the posterior. Early works in robust Bayesian inference concentrated on continuous distributions and i.i.d. observations. Robustness results were then extended to linear regression in the presence of infinite residuals, either through an outlying outcome or an outlying covariate. Recently, Hamura et al. (2025, arXiv:2106.10503) presented a count regression model, with Poisson-Rescaled Beta (-RSB) target distribution and Gaussian latent variables (GLVs), which is robust against infinitely large counts and able to handle zero-inflation. We continue from the work of Hamura et al. and study the robustness properties of mixed Poisson regression models with GLVs in the presence of outlying data points arising from either corrupted covariates or corrupted target values. While in linear regression the two cases are interchangeable, as both infinite target or covariates lead to infinite residuals, we show that in count regression infinite covariates is not a symmetric case to infinite target. Specifically, we show that mixed Poisson models are not asymptotically robust to outliers resulting from infinite covariates. We then consider three alternative mixed Poissons (Poisson-Gamma, Poisson-log-t, and Poisson-RSB) as target distribution and examine, both theoretically and via simulations as well as real-world case studies, their behavior in the presence of outliers of three alternative types: large target value as well as large and small covariate values. Our results show that models robust to data points with an anomalous target are not robust to data points with anomalous covariates, calling for methodological development for models that are robust for covariate outliers.
Density Evolution: A Multiscale View of Density Estimation
oai:arXiv.org:2606.00233v1
arXiv:2606.00233v1 Announce Type: new
Abstract: Density estimation is often presented as a choice among parametric summaries, finite mixtures, and nonparametric smoothers. This review argues for a complementary view: a data set can be studied through a path of densities indexed by smoothing scale, diffusion time, model complexity, density level, or noise level. We call this perspective density evolution. Under this lens, Gaussian kernel density estimation is heat flow from the empirical measure; scale-space methods, critical bandwidths, mode trees, and derivative-significance displays describe the evolution of modal and derivative structure; finite mixtures and mixture reduction provide compressed representations of kernel-like estimates; and cluster trees and persistent homology summarize evolving level-set topology. We review these connections and discuss inference for feature lifetimes, high-dimensional complications, and links with score-based generative diffusion. We also include three elementary structural results: nondegenerate modes move along smooth branches, a natural moment-preserving Gaussianization semigroup is forced to be Ornstein--Uhlenbeck, and shared-covariance Gaussian mixtures become log-concave once component means are sufficiently concentrated. Together, these ideas shift attention from choosing one density estimate to studying the multiscale probability landscape.
Out-of-Distribution generalization of quantile regression with heavy tailed inputs: an SVM approach
oai:arXiv.org:2606.00265v1
arXiv:2606.00265v1 Announce Type: new
Abstract: We study quantile regression in an extrapolation regime where the covariate takes unusually large values. Under regular variation assumptions, extreme observations can be effectively characterized through their angular components, enabling learning strategies that focus on the angle of the most extreme observations. This approach is formalized through the minimization of an asymptotic conditional risk that localizes learning in the tail of the covariate distribution.
We propose a novel Support Vector Machine (SVM) framework for extreme quantile regression, leveraging reproducing kernel Hilbert spaces to handle high-dimensional and nonlinear settings. Our method also accommodates unbounded response variables and avoids restrictive transformations. We establish finite-sample learning guarantees under mild regularity assumptions.
The proposed framework unifies ideas from statistical learning and multivariate extremes, providing a tractable and theoretically grounded approach to extrapolation. We complement our theoretical findings with an empirical study on river flow data from the Danube, demonstrating the practical relevance of our methods.
Is Zero-Shot Super-Resolution Possible in Operator Learning?
oai:arXiv.org:2606.00296v1
arXiv:2606.00296v1 Announce Type: new
Abstract: Neural operators are often reported to exhibit zero-shot super-resolution, a phenomenon in which a model trained on coarse grids produces accurate predictions on finer testing grids without additional retraining. Despite strong empirical evidence, the theoretical foundations of this phenomenon remain unclear. In this work, we provide a systematic theoretical study of zero-shot super-resolution in operator learning. We first show that zero-shot super-resolution can be information-theoretically impossible even in benign settings such as when the input functions are available over the entire continuum and the ground truth is a simple rank-one linear operator. We then identify H{\" o}lder smoothness of the output functions as a sufficient condition for zero-shot super-resolution and derive corresponding generalization bounds. Finally, we also validate the identified failure modes through experimental results.
ERICA: Quantifying Replicability of Cluster Analysis
oai:arXiv.org:2606.00302v1
arXiv:2606.00302v1 Announce Type: new
Abstract: Despite being ubiquitous in science, clustering remains a technique whose results are not quantitatively scrutinized via a framework. We present an analysis called evaluating replicability via iterative clustering assignments (ERICA) that is applied to a dataset to determine whether clusters are identified in a replicable manner. The pipeline computes a statistic that describes whether structure is found in a dataset. Quantitative visualization methods are presented to answer important questions such as the similarity between clusters, and the identity of points that may be outliers. When tested on synthetic data, the findings show clusters being discovered in a replicable manner. However, we note a possibility for non-replicable results when the pipeline is applied to three gene expression datasets for breast cancer subtype validation. The study underscores the need for rigorous inspection and offers a practical tool for doing so.
Cluster Analysis with Resampling for Validation and Exploration (CARVE)
oai:arXiv.org:2606.00327v1
arXiv:2606.00327v1 Announce Type: new
Abstract: Clustering is widely used across the sciences as the foundation for downstream data-driven scientific discoveries. However, clustering results are highly sensitive to the choice of algorithm, preprocessing, and the number of clusters $k$, producing scientific claims that are often not reproducible. The current state of the art for validating clustering solutions consists of clustering validation indices (CVIs) such as Silhouette, Davies-Bouldin, and Calinski-Harabasz, which rely on geometric assumptions that break down on the heavy-tailed, high-dimensional, and nonlinearly structured data encountered in biomedical research. Resampling-based alternatives - grounded in the ideas of clustering stability and generalizability - have been proposed but remain scattered across specialized tools with no unified, accessible software. We fill this gap with CARVE (Cluster Analysis with Resampling for Validation and Exploration), an open-source Python and R package that jointly evaluates multiple clustering algorithms and hyperparameters, returning stability and generalizability diagnostics at the global, cluster, and sample level together with principled selection rules and consensus-based cluster labels. Across six synthetic benchmarks CARVE consistently recovers near-optimal clusterings where classical indices degrade substantially. On experimental genomics and proteomics data sets, CARVE recovers finer biological structure when classical CVIs collapse entirely. CARVE is available with a scikit-learn-compatible Python API and an analogous R interface compatible with Seurat workflows.
Polar Depth for Potentially Heavy-Tailed Data
oai:arXiv.org:2606.00343v1
arXiv:2606.00343v1 Announce Type: new
Abstract: Motivated by the analysis of the behaviour of extremes from multivariate heavy-tailed distributions, we introduce a novel notion of statistical depth, referred to as Polar Depth. The polar depth function is naturally expressed in polar coordinates, as is the limiting distribution of a regularly varying random variable, beyond asymptotically large thresholds, once its marginals have been appropriately normalized. Not only does the polar depth function make it easy to order the extreme values taken by a heavy-tailed random variable X and finds natural applications in anomaly detection, but it is also possible to show, as we prove it under appropriate assumptions in this article, that the polar depth of the largest observations, i.e. observations X which norm is larger than t>0, converges to the polar depth of the limiting distribution as t converges to infinity. Although designed to quantify the depth of multivariate extremes, the polar depth is interesting in its own right, insofar as this notion is more relevant for distributions whose support is included in a halfspace than the alternatives proposed in the literature, the halfspace depth in particular. Here, we demonstrate its properties and analyze statistical issues related to its estimation from both finite-sample and asymptotic points of view. We present numerical results to empirically demonstrate its relevance, particularly for the statistical analysis of extreme observations and more specifically for the identification of anomalies among them.
Network knockoffs: controlling false discovery in dyadic space
oai:arXiv.org:2606.00346v1
arXiv:2606.00346v1 Announce Type: new
Abstract: Phenomena such as epidemiological processes, hydrologic systems, social platforms, utility services, and supply chains can be represented as topological networks. A central question about these networks concerns connectivity and the permeability of edges. Dyadic regression and related approaches have been proposed to identify network features associated with pairwise node-level differences. In high-dimensional settings, it is important to control the number of spuriously selected features. However, controlling the false discovery rate for dyadic outcomes is challenging because dependence among dyads invalidates classic asymptotic procedures and complicates standard data splitting and knockoff approaches. We propose a novel knockoff variable selection procedure that simulates synthetic features directly on the topological network prior to constructing the augmented design matrix in dyadic space. Empirically, our method controls the false discovery rate for both node- and edge-level features. The Benjamini-Hochberg, Benjamini-Yekutieli, Storey Q-value, data-splitting, and standard knockoff procedures were all anticonservative. We applied our network knockoffs to assess the impassability of over 1000 stream barriers in North Carolina for Salvelinus fontinalis. Compared to data splitting and traditional knockoff approaches, our proposed approach selected a higher proportion of barriers previously assessed to impede fish movement.
A Distribution-Free Framework for Rewrite-Based Human-text Detection via Knockoff Filtering
oai:arXiv.org:2606.00402v1
arXiv:2606.00402v1 Announce Type: new
Abstract: We propose a distribution-free statistical framework that converts arbitrary rewrite-based detectors into detectors with finite-sample FDR guarantees without retraining. Our key observation is that rewrite-based detection implicitly constructs knockoff samples, enabling LLM-generated text detection to be formulated as a multiple hypothesis testing problem with knockoff structure. This perspective separates the design of detection statistics from the control of false discoveries, allowing existing rewrite detectors to inherit finite-sample false discovery rate (FDR) guarantees through a simple calibration procedure. We demonstrate reliable FDR control with meaningful detection power across three detection models, 19 domains, and four LLMs.
Riemannian Stochastic Optimization for Sufficient Dimension Reduction
oai:arXiv.org:2606.00413v1
arXiv:2606.00413v1 Announce Type: new
Abstract: Sufficient dimension reduction (SDR) makes high-dimensional regression tractable by projecting the covariates onto a low-dimensional subspace that preserves the conditional mean of the response. Existing gradient-based estimators either operate in the ambient space and suffer from the curse of dimensionality, or localize in the reduced space at a per-outer-iteration cost at least quadratic in the sample size. We show that minimizers of the population Minimum Average Variance Estimation (MAVE) risk approximate the same Grassmannian target as the Outer Product of Gradients (OPG), and recast the empirical criterion as a smooth maximization on the Stiefel manifold with closed-form Riemannian gradient. The resulting algorithm, SMAVE, combines sparse projected-space nearest-neighbor localization with Riemannian stochastic gradient ascent. A simplified version comes with almost-sure convergence and a non-asymptotic rate matching the standard non-convex stochastic first-order scaling. Empirically, SMAVE matches or improves on RMAVE's synthetic subspace recovery at moderate-to-high ambient dimension, and on four real datasets it uniformly improves over OPG and is competitive with or outperforms RMAVE at orders of magnitude lower runtime.
Parameter-Free and Group Conditional Online Conformal Prediction
oai:arXiv.org:2606.00419v1
arXiv:2606.00419v1 Announce Type: new
Abstract: Uncertainty quantification (UQ) is critical for the deployment of machine learning predictors in real-world scenarios where the data distribution may shift over time (i.e., data may not be exchangeable). Online conformal prediction (OCP) methods address this issue at the expense of either (i) group-wise error control or (ii) learning-rate independent implementation. Group-conditional coverage is essential for fairness across different collections of data points and for providing finer UQ guarantees. Parameter-free optimization is crucial for robustness to adversarial and unknown data shifts. We propose a parameter-free algorithm for group-conditional OCP and demonstrate that it achieves the best group-conditional coverage guarantees.We evaluate our algorithm on synthetic and real-world data, demonstrating that our method not only improves the reliability of existing parameter-free OCP methods but also provides prediction intervals that are comparable in size to well-tuned group-conditional approaches. By unifying group-conditional coverage with parameter-free online algorithms, our work lays a foundation for fair and robust uncertainty quantification in shifting environments.
Empirical Likelihood with Generative AI
oai:arXiv.org:2606.00425v1
arXiv:2606.00425v1 Announce Type: new
Abstract: Moment conditions are widely used to identify parameters in models where the full likelihood is either unknown or intentionally left unspecified. Empirical likelihood methods address this problem by assigning probability weights to the observed data so that the sample moment conditions hold exactly. Building on this idea, we propose a nonparametric Bayesian framework based on exponentially tilted empirical likelihood. This Bayesian formulation is particularly appealing in settings where prior information is more naturally specified on the observables rather than on the underlying parameters. Such settings arise in the presence of auxiliary data sources or synthetic data generated by modern generative AI models.Inference proceeds by projecting posterior draws from a Dirichlet process onto the moment-restricted model, yielding a computationally efficient procedure that is naturally amenable to parallelization. We establish new Bernstein--von Mises and consistency theorems for the resulting projection posterior under both vanishing-prior and persistent-prior regimes. In an application to return prediction using overnight news headlines, we show that AI-generated auxiliary data can provide a useful source of indirect regularization when informative priors on the parameter itself are unavailable.
Weighted Conformal Clustering
oai:arXiv.org:2606.00436v1
arXiv:2606.00436v1 Announce Type: new
Abstract: Clustering is a central tool for discovering latent structure in unlabeled data; yet modern clustering pipelines often end with a hard assignment of each observation to a cluster without rigorous measures of assignment uncertainty. We propose a novel weighted conformal approach for constructing valid confidence sets for cluster labels. The key difficulty is that the labels available for calibration are not observed ground-truth labels, but synthetic labels produced by a data-dependent clustering algorithm. Our method develops a conformal inference algorithm that corrects the resulting mismatch with the latent target labels through weights by formulating conformal clustering as a conditional label-distribution shift problem. We first derive an oracle procedure that attains finite-sample marginal coverage and then develop a computationally tractable and implementable version using estimated conditional label probabilities and novel augmented calibration. We show that the coverage of the estimated-weight procedure depends on the estimator, giving an explicit bound on the loss relative to the nominal level. Empirical studies demonstrate that the proposed weighted approach offers improvements over the recently proposed split conformal clustering procedure in terms of informative confidence set size, especially in nonlinear and high-dimensional clustering applications.
Extrinsic Analysis on BHV4
oai:arXiv.org:2606.00465v1
arXiv:2606.00465v1 Announce Type: new
Abstract: One investigates the extrinsic statistical analysis on the space of Billera- Holmes-Vogtmann tree space with four leaves (T4 or BHV4) based on its recently proposed novel representation (see [1])- the Spiky Projective ExcavatedDodecahedron (SPED). Due to the symmetry of the SPED, the Veronese- Whitney (VW) embeddingwe consider here produces a natural extrinsicmetric for a statistical analysis on BHV4. one derives the exact solution for the VW extrinsic mean and applies this novel method on a yeast genome dataset to study the phylogenetic trees of four distinct yeast clades.
Online Sparse Regression with Expanding Observables
oai:arXiv.org:2606.00478v1
arXiv:2606.00478v1 Announce Type: new
Abstract: Online high-dimensional regression has gained increasing attention in recent years, yet existing methods typically assume that all candidate features, including important ones, are observed from the outset of data collection. This assumption is often violated in real-world scenarios, where new variables become available gradually as data accumulate. To address this gap, we introduce a novel framework, Recurrent Adaptive Variable Selection (RAVAS), for online regression with expanding observability. RAVAS employs a recurrent procedure that dynamically updates feature selection as both the sample size and the observable feature set grow. The algorithm is designed to be computationally efficient and memory-light, relying only on low-dimensional sufficient statistics that are updated online. A key advantage of the method lies in its ability to detect and incorporate important variables that emerge later, thereby mitigating the effect of early-stage missingness. We establish theoretical guarantees on model selection, estimation error, and feature coverage, and develop an adaptive online tuning strategy. Extensive simulations and real-world experiments verify the effectiveness of RAVAS for high-dimensional streaming data.
When Do Generalized Permutation Tests Achieve Optimal Power? A Dispersion Characterization
oai:arXiv.org:2606.00578v1
arXiv:2606.00578v1 Announce Type: new
Abstract: We study generalized Monte Carlo permutation tests under a non-uniform distribution on permutations. Focusing on the difference-in-means statistic, we introduce two scalar dispersion measures that quantify departures from complete randomization at the individual and pairwise levels. We show that if both dispersions vanish asymptotically, then the conditional permutation distribution converges to its Gaussian benchmark, the critical value stabilizes, and the test attains optimal Pitman local power. Conversely, if these dispersions fail to vanish, the permutation distribution does not self-average, the critical value need not stabilize, and optimal local power cannot in general be guaranteed. We further show that beyond the standard Pitman local model, suitably chosen non-uniform permutation distributions can strictly dominate the uniform distribution by exploiting nuisance structure in the data.
Spectra-Guided Neural Tucker Factorization
oai:arXiv.org:2606.00584v1
arXiv:2606.00584v1 Announce Type: new
Abstract: This paper proposes Spectra-Guided Neural Tucker Factorization (SG-NTF) for High-Dimensional and Incomplete (HDI) tensor completion. Circumventing discrete representational limits, SG-NTF maps scalar timestamps into a continuous spectral space to abstract temporal periodicities. Concurrently, a Spatio-Temporal Co-Gating (STCG) mechanism explicitly filters latent interactions via multiplicative modulation on spatiotemporal contexts. Evaluations on real-world HDI tensors verify that SG-NTF maintains competitive completion accuracy with parameter efficiency.
Taming the Loss Landscape of PINNs with Noisy Feynman-Kac Supervision: Operator Preconditioning and Non-Asymptotic Error Bounds
oai:arXiv.org:2606.00643v1
arXiv:2606.00643v1 Announce Type: new
Abstract: Physics-Informed Neural Networks (PINNs) often train slowly or fail to converge on challenging partial differential equations (PDEs), a behavior recently linked to severely ill-conditioned loss landscapes inherited from the underlying differential operator. We study PINNs augmented with a pointwise data-fidelity term, added at a few points in the domain to the standard residual and boundary losses. We show that this supervision term acts as an operator-level preconditioner: for suitable weights, our comparison bounds guarantee a substantially smaller condition number than under the standard PINN loss, independently of how the pointwise labels are obtained. For a broad class of PDEs admitting a Feynman-Kac (FK) representation, we generate such labels by Monte Carlo averages of the FK functional, resulting in what we call ``FK-PINNs", and using the excess risk decomposition approach, we derive non-asymptotic $L^2(\Omega)$-error bounds for FK-PINNs with $\tanh$ activation trained by finitely many steps of gradient descent. Along the way, we establish pseudo-dimension bounds for first- and second-order derivatives of $\tanh$ neural networks, which are of independent interest and, to the best of our knowledge, new. Numerical experiments on Poisson, Schr\"odinger, mean exit time, and committor problems corroborate the theory, and show that FK-PINNs can successfully solve PDEs for which standard PINNs exhibit severe failure modes.
On Median of Incomplete U-Statistics
oai:arXiv.org:2606.00661v1
arXiv:2606.00661v1 Announce Type: new
Abstract: We establish the finite-sample concentration rate for the Median-of-Incomplete-U-Statistics (MIU), an efficient robust estimator for the expectation of symmetric kernels.
Rate-optimal neural boundary detection from unlabeled noisy images
oai:arXiv.org:2606.00715v1
arXiv:2606.00715v1 Announce Type: new
Abstract: We study boundary detection for unlabeled noisy images from a statistical perspective. The aim is to recover an unknown object region from raw intensity observations without pixel-wise annotating labels or a parametric model for the intensity distributions. Motivated by robust Gibbs posterior approaches based on thresholded misclassification losses, we propose a continuous hinge-type surrogate loss for boundary detection. The proposed loss is amenable to gradient-based optimization and can be combined with deep neural networks to represent complex object boundaries. We prove that the proposed loss function is Fisher consistent under a mild separation assumption and obtain a calibration inequality linking excess surrogate risk to the symmetric difference error of the estimated region. Under a piecewise smooth boundary model, we prove that the resulting deep neural network estimator achieves the minimax-optimal boundary recovery rate, up to logarithmic factors. The piecewise smooth formulation accommodates boundaries with corners and kinks, thereby extending beyond globally smooth boundary models. Numerical experiments demonstrate that the proposed method accurately and stably recovers object boundaries across a range of noise levels and shape configurations, and compares favorably with existing unsupervised boundary detection methods.
Causal Density Functions
oai:arXiv.org:2606.00754v1
arXiv:2606.00754v1 Announce Type: new
Abstract: We introduce causal density functions: Radon-Nikodym derivatives that compare interventional laws to observational laws and therefore act as local density ratios for causal effects. Whereas many causal-strength measures compare whole distributions after graph surgery, causal density functions provide a pointwise change-of-measure object that can be estimated, calibrated, and used to score directed influence. The basic identity
\[
\mathbb{E}_{\mathrm{do}}[f(Y)]
=
\mathbb{E}_{\mathrm{obs}}\!\left[f(Y)\rho(X,Y)\right]
\]
makes causal density directly testable: if the estimated density ratio is correct, observational expectations reweighted by $\rho$ reproduce interventional expectations. We derive practical estimators for do-curves and directed edge scores, relate the construction to Radon-Nikodym/Kan semantics for conditioning and intervention, and evaluate the resulting estimators on synthetic and real perturbation benchmarks.
Statistical Testing on Directed Graphs by Surrogate Data Generation
oai:arXiv.org:2606.00758v1
arXiv:2606.00758v1 Announce Type: new
Abstract: In recent years, graph signal processing has emerged as a powerful framework at the intersection of signal processing and graph theory, providing tools for the analysis of signals defined on nodes while accounting for their relationships represented by edges. These tools have been successfully applied to various settings, including statistical hypothesis testing. In particular, non-parametric approaches based on surrogate generation have been proposed for signals on undirected graphs. However, they are yet to be extended to directed graphs. In this work, we first revisit the notion of stationary graph signals on directed graphs. Specifically, and through the eigendecomposition of the graph shift operator, we define directed graph wide-sense stationary signals. Then, we propose a new framework to generate surrogate graph signals that preserve covariance structure under stationarity assumptions. Null distributions of the test metric can then be constructed from these surrogates and serve as a reference for the empirical data. Finally, we provide guiding examples and an application on real data, in which we compare the performance of our framework with existing techniques for undirected graphs or based on naive permutation, demonstrating feasibility and superiority of the proposed approach.
The Effect of Choice of Metric and Scan Length on Reliability in Resting-State fMRI
oai:arXiv.org:2606.00767v1
arXiv:2606.00767v1 Announce Type: new
Abstract: Resting-state fMRI (rs-fMRI) is widely used to investigate brain functional connectivity, but the reliability of these measurements remains a key concern for ensuring reproducibility. The distance-based intraclass correlation coefficient (dbICC) generalizes classical ICC to more general data types, making it well-suited for assessing the reliability of measures of functional connectivity. In this study, we applied dbICC to assess the reliability of rs-fMRI data from the Midnight Scanning Club (MSC) dataset, which consists of 10 subjects, each undergoing 10 sessions of 30-minute rs-fMRI scans. The functional connectivity was estimated using Pearson's correlation coefficients between all pairs of brain regions, resulting in a correlation matrix for each session. We compared two distance metrics-the widely used Frobenius metric and the Affine Invariant Riemannian Metric (AIRM) selected to respect the geometry of the space of covariance matrices-to evaluate how the choice of metric affects the reliability of estimating correlation. In addition, we investigated the impact of scan length and time interval between sessions on reliability. Results based on each metric agreed in some respects but disagreed in others, illustrating the impact of choice of metric. We also found that longer scan lengths significantly improve reliability, while the time interval between sessions has less impact.
Bayesian Inference of Nonlinear Malaria Dynamics in Ghana via an Ensemble Markov Chain Monte Carlo Sampler
oai:arXiv.org:2606.00783v1
arXiv:2606.00783v1 Announce Type: new
Abstract: Reliable quantification of malaria dynamics in sub-Saharan Africa is hindered by short, noisy, and spatially heterogeneous surveillance records. In Ghana, health-facility data from 2014 to 2023 reveal non-linear and age-specific fluctuations in hospital admissions, yet existing approaches struggle to capture stochastic variability or provide credible uncertainty bounds. This study develops a Bayesian nonlinear inference framework that integrates a cubic baseline with a damped oscillatory kernel, estimated via an affine-invariant ensemble Markov Chain Monte Carlo sampler. The framework accommodates limited data, models parameter uncertainty, and generates probabilistic forecasts for children under five years and individuals aged five years or more. Results show strong empirical adequacy ($R^2 = 0.9958$ for $<5$ years; $R^2 = 0.9956$ for $\geq 5$ years) with residual errors below $2\%$ and well-mixed posteriors confirming convergence. District-level analysis reveals pronounced spatial heterogeneity, with coefficients of variation ranging from $<0.07$ in urban centres such as Kumasi to $>3.3$ in peripheral districts such as Mpohor and Bia East. Forecasts for 2024-2026 indicate a gradual resurgence: from 137,000 to 149,000 cases among children under five years and from 348,000 to 375,000 cases among older individuals, with uncertainty widening over time. By producing probabilistic forecasts, this Bayesian framework provides a principled tool for anticipating malaria fluctuations and strengthening data-driven decision-making in Ghana's national malaria control strategy.
Robust inference for risk heterogeneity under group imbalance
oai:arXiv.org:2606.00797v1
arXiv:2606.00797v1 Announce Type: new
Abstract: Population-level heterogeneity is ubiquitous in biomedical data, where differences across demographic or clinical subgroups can substantially alter risk patterns. For example, in intensive care unit (ICU) studies, the mortality risk associated with specific admission diagnoses can vary across ethnic groups. Existing approaches for detecting risk heterogeneity are often sensitive to baseline model misspecification and regularization bias, both of which commonly arise in practice. In this paper, we propose a robust framework for inferring risk heterogeneity between two populations using Neyman orthogonality, which yields estimators that are locally insensitive to nuisance parameter estimation error. The proposed estimator is consistent and asymptotically normal, and simulation studies demonstrate that in finite samples our method substantially reduces bias and improves inferential stability compared with standard likelihood-based approaches. In an application to the eICU Collaborative Research Database, our method reveals clinically meaningful ethnicity-specific heterogeneity in admission diagnoses for in-hospital mortality that standard likelihood-based methods fail to detect.
Hybrid Probabilistic Forecasting of Under-Five Malaria Admissions in Ghana: A Gaussian Process Regression with Holt-Winters Smoothing
oai:arXiv.org:2606.00834v1
arXiv:2606.00834v1 Announce Type: new
Abstract: Accurate malaria forecasting remains a major challenge in sub-Saharan Africa, where strong seasonality, reporting uncertainty, and non-stationary transmission dynamics reduce the reliability of conventional models. In Ghana, district-level malaria surveillance requires forecasting frameworks that are probabilistically rigorous and robust under limited data. This study proposes a hybrid framework integrating Gaussian Process Regression (GPR) with Holt-Winters exponential smoothing for modelling monthly under-five malaria admissions. GPR captures non-linear behaviour and predictive uncertainty, while Holt-Winters stabilises long-horizon forecasts and preserves seasonal structure. Using ten years of district-level data (2014-2023), performance was evaluated via rolling-origin expanding-window validation. The hybrid model achieved $R^2 = 0.9906$ versus $0.8213$ for Holt-Winters alone, with $94.2\%$ of residuals within $\pm 2\sigma$ bounds. Forecasts for 2024-2028 project average monthly admissions from approximately 8{,}000 to 12{,}200 cases. Spatio-temporal analysis revealed pronounced ecological heterogeneity: northern high-burden districts exhibited stable relative patterns despite large absolute fluctuations. The framework provides a scalable probabilistic approach for malaria early warning and operational planning in endemic settings, supporting Ghana's national malaria control strategy.
Sequential multiple testing with multiple hypotheses and prior information on the hypothesis configuration
oai:arXiv.org:2606.00839v1
arXiv:2606.00839v1 Announce Type: new
Abstract: In this work, we study the problem of testing the marginal distributions of multiple independent, sequentially observed data streams, where for each stream there are multiple candidate hypotheses to select from, in the presence of prior information on the unknown hypothesis configuration. The goal is to understand the benefit of such information and to design a sequential testing procedure that effectively leverages it. We start with arbitrary prior information and specialize to concrete examples, including known number or known lower bound on the number of streams following each hypothesis, and the presence of exclusive hypotheses. The designed procedure is three-fold: (i) reliable, i.e., controlling all types of familywise error probabilities below arbitrary user-specified levels, (ii) computationally efficient, i.e., focusing on minimal sets of alternative hypothesis configurations in making decisions, and (iii) asymptotically optimal, i.e., achieving the minimum expected sample size among all reliable procedures asymptotically as the error levels go to zero. Numerical studies are presented for illustration.
Partial Identification under High-Dimensional Potential Outcomes and Confounders via Optimal Transport
oai:arXiv.org:2606.00847v1
arXiv:2606.00847v1 Announce Type: new
Abstract: Partial identification provides informative causal guarantees when point identification is impossible, but existing approaches based on optimal transport (OT) become computationally and statistically intractable in high-dimensional settings. This limitation is particularly severe when both potential outcomes and confounders are high-dimensional, where classical OT-based bounds suffer from the curse of dimensionality and unfavorable convergence rates. To address this challenge, we propose a novel estimator that decomposes the transport problem into a low-dimensional signal subspace and a high-dimensional residual subspace. Unlike existing projection-based methods that discard residual information, we recover the residual transport energy using the Sliced Wasserstein distance, which is computationally efficient and robust to high dimensions. We establish interpretable conditions controlling the approximation gap based on residual structure and provide a data-driven rule for signal dimension selection. Empirical results show that our estimator consistently outperforms projection-only baselines by recovering lost transport energy, yielding more informative causal bounds while remaining computationally tractable in high dimensions.
Change-Point Detection for Object-valued Time Series
oai:arXiv.org:2606.00858v1
arXiv:2606.00858v1 Announce Type: new
Abstract: This article is concerned with change point detection for object-valued data that reside in a metric space, which has attracted some recent interests in statistics and econometrics literature. The existing methods either focus on independent data or can only detect change in the Fr\'echet mean or variance. In this paper, we propose a self-normalization (SN, hereafter) based statistic for detecting a shift in the marginal distribution of object-valued time series. Our test is universally applicable to a wide range of object-valued data, such as distributional and network data, and can accommodate weak serial dependence. In addition the proposed test statistic is almost tuning parameter free, has pivotal limiting null distribution and only uses the pairwise distances. When combined with the Wild Binary Segmentation algorithm (WBS, hereafter), our statistic can be used to estimate the number and locations of multiple change points. Asymptotic results for our SN based statistic are derived under both null and local alternatives in the single change point setting. For the first time, the WBS estimation consistency is shown for a broad class of object-valued time series and in a nonparametric setting, which requires new non-standard theoretical arguments. Extensive numerical experiments and real data analysis are conducted to illustrate the effectiveness and broad applicability of our proposed method.
Another Look at Bandwidth-free Inference: a Sample Splitting Approach
oai:arXiv.org:2606.00864v1
arXiv:2606.00864v1 Announce Type: new
Abstract: The bandwidth-free tests/inferences for a multi-dimensional parameter have attracted considerable attention in econometrics and statistics literature. These tests can be conveniently implemented due to their tuning-parameter free nature and possess more accurate size as compared to the traditional HAC-based approaches, where consistent long run variance estimation was involved. However, when sample size is small/medium, these bandwidth-free tests exhibit large size distortion when both the dimension of the parameter and the magnitude of temporal dependence are moderate, making them unreliable to use in practice. In this paper, we propose a sample splitting based approach to reduce the dimension of the parameter to one for the subsequent bandwidth-free inference.
Our SS-SN (sample splitting plus self-normalization) idea is broadly applicable to many testing problems for time series, including mean testing, testing for zero autocorrelation, linear hypotheses testing in a time series regression model and testing for a change point in multivariate mean. Specifically, we propose $L_{\infty}$-type and $L_2$-type SS-SN test statistics and derive their limiting distributions under both the null and alternatives and show their effectiveness in alleviating size distortion via simulations. As an important theoretical contribution, we obtain the limiting distributions for both SS-SN test statistics in the multivariate mean testing problem when the dimension is allowed to diverge as sample size grows to infinity. In addition we show the asymptotic independence of $L_{\infty}$-type and $L_2$-type SS-SN test statistics under the null in the growing dimensional setting.
Statistical Analysis of using the Shapley Value for Sensor Anomaly Localization with Accurate Classifiers
oai:arXiv.org:2606.00867v1
arXiv:2606.00867v1 Announce Type: new
Abstract: Recent publications have suggested using the Shap- ley value for sensor anomaly/attack localization. We study the performance of such an approach by using mathematically de- fined optimum binary classifiers in the Shapley value calculation. To judge localization performance, we study the ability of the Shapley value of a given sensor observation to determine if that observation is anomalous. First, we prove that for cases with independent sensor observations, an optimized anomaly test using the Shapley value is equivalent to an optimized lower-complexity anomaly test using a single term in the Shapley value calculation, yielding the exact same probability of error. For some popular dependent observation cases involving two sensors, including correlated bivariate Gaussian/Laplacian probability density functions and constant/Gaussian at- tacks/anomalies, we prove that these two tests are fundamentally different, yielding different decision regions and error probabil- ities. Further, we prove that the Shapley value test is sometimes strictly inferior to the other (single term in Shapley calculation) test in certain statistically dependent bivariate Gaussian scenarios with large correlation magnitude and additive attacks/anomalies, while it is strictly superior in others, depending on the sign of the correlation. One can combine these two approaches to obtain a strictly better approach in these cases. These results, which provide the first theoretical statistical analysis of Shapley-based localization, seem very interesting based on the wide acceptance of the Shapley value by many researchers and should encourage further research on this topic. Numerical results are provided which illustrate our findings.
Anytime-valid testing with e-values and confirmatory adaptive designs
oai:arXiv.org:2606.00878v1
arXiv:2606.00878v1 Announce Type: new
Abstract: Confirmatory adaptive designs were introduced more than 30 years ago and enable for example sample size re-assessments and the selection of treatments, endpoints as well as subpopulations during the course of a clinical trial. Recently, sequential tests based on e-values for an anytime-valid inference have been developed, promising seemingly similar or even more flexibility and utility. In this note, we compare these two independently developed concepts, shedding light on their formal and methodological connections and differences. Specifically, we show that adaptive design tools like conditional error functions and combination tests are formally equivalent to e-value based, anytime-valid sequential tests. However, in spite of their common fundamental intention to bring flexibility into statistical inference, they have quite different emphases: While hypothesis testing with combination tests and conditional error function usually intent to exhaust type I error rates under the offered flexibility, e-value based testing aims on the additional flexibility with regard to optional continuation, the chosen level and, in recent extensions, in the loss functions to be controlled. We also indicate how recent e-value achievements could enrich clinical trial methodology and adaptive design methodology could inspire and improve e-value based testing.
Hypothesis Testing for a Functional Parameter via Self-normalization
oai:arXiv.org:2606.00887v1
arXiv:2606.00887v1 Announce Type: new
Abstract: Testing simple or composite hypothesis on a functional parameter has attracted considerable attention in time series analysis. To accommodate for the unknown temporal dependence, classical nonparametric approaches such as block bootstrapping and subsampling all involve a bandwidth parameter, the choice of which can substantially affect the finite sample performance. The self normalization (SN) method is tuning parameter free when applied to the inference of a finite-dimensional parameter but its applicability to a functional parameter is unknown.
In this paper, we propose a sample splitting based approach to generalize the SN method to hypothesis testing of a functional parameter. Our SS-SN (sample splitting plus self-normalization) idea is broadly applicable to many testing problems for functional parameters, including testing for simple/composite hypothesis on marginal cumulative distribution function, testing for time-reversibility and testing for a change point on the spectral distribution of a multivariate time series. Specifically, we derive the pivotal limiting distributions of our SS-SN test statistics under the null for both simple and composite null hypothesis, and derive the limiting power function under the local alternatives. Numerical simulations show that our new tests tend to yield accurate size with competitive power performance as compared to many existing ones.
Notes on Randomized Controlled Trials for Studying Social Media Harms
oai:arXiv.org:2606.00900v1
arXiv:2606.00900v1 Announce Type: new
Abstract: Randomized controlled trials (RCTs) and person-level observational studies feature prominently in debates over social media harms. I highlight some under-acknowledged limitations of such evidence. Most important is that published RCTs typically identify effects of a \textit{local}, or small-scale, intervention: a person is assigned to quit social media, but her immediate peers continue using it in large numbers. Critics of social media, in contrast, focus on a \textit{global}, or large-scale, intervention: the mass adoption of social media among U.S. teenagers. Such global interventions alter both the proximal social environment and the broader culture, potentially harming teenagers who abstain from social media entirely. This paper discusses the local--global distinction at length and offers other notes on the limits of learning about social media harms from existing RCTs and person-level observational studies. I suggest that triangulating different forms of imperfect evidence may provide the deepest insights about social media's aggregate effect on teen mental health.
Bandit Simulation for Average Reward Inference
oai:arXiv.org:2606.00913v1
arXiv:2606.00913v1 Announce Type: new
Abstract: Multi-arm bandit algorithms are increasingly used in online platforms, clinical trials, and social science experiments, but valid statistical inference on their performance remains an open challenge. After deploying bandits, a natural question is whether one can construct a confidence interval for its mean reward and assess whether it reliably outperforms a baseline policy. The total reward achieved in any single bandit deployment is random, and deploying a bandit twice on the same population typically yields different reward trajectories due to stochastic rewards. Standard statistical inference methods cannot be used because bandit algorithms introduce complex dependencies in the collected data, which violate the i.i.d. assumption underlying many classical approaches. Moreover, existing inference methods for adaptively collected data only apply to estimands that do not depend on the data-collection algorithm (such as the mean reward under a fixed action). We propose Bandit Simulation for Inference (BSI), a framework that fits a simulator of the bandit environment from observed data--either on-policy or off-policy--and uses it to estimate the mean reward under any evaluation policy, including adaptive blackbox algorithms. BSI formally propagates uncertainty in the estimated simulator parameters into the confidence interval construction. Furthermore, for BSI to be valid, it requires only weak exploration assumptions on the behavior policy and avoids importance weighting. We prove that BSI yields asymptotically valid confidence intervals, and demonstrate empirically that it maintains nominal coverage in settings where standard off-policy evaluation methods fail.
Efficient Synthetic Network Generation via Latent Embedding Reconstruction
oai:arXiv.org:2606.00934v1
arXiv:2606.00934v1 Announce Type: new
Abstract: Network data are ubiquitous across the social sciences, biology, and information systems. Generating realistic synthetic network data has broad applications from network simulation to scientific discovery. However, many existing black-box approaches for network generation tend to overfit observed data while overlooking characteristic network structure, and incur substantial computational overhead at scale. These practical challenges call for synthetic network generation methods that are both efficient and capable of capturing structural properties of networks. In this paper, we introduce Synthetic Network Generation via Latent Embedding Reconstruction (SyNGLER), a general and efficient framework for synthetic network generation that builds on latent space network models. Given an observed network, SyNGLER first learns low-dimensional latent node embeddings via a latent space network model and then reconstructs the latent space by building a distribution-free generator over these embeddings. For generation, SyNGLER first samples (or resamples) node embeddings from the generator in the latent space and then produces synthetic networks using the latent space network model. Through the latent space framework, SyNGLER preserves unique characteristics in networks such as sparsity and node degree heterogeneity, while allowing for efficient training with lower computational cost than many existing deep architectures. We provide theoretical guarantees by developing consistency results on the distance between the true and synthetic edge distributions. Empirical studies further demonstrate the effectiveness of SyNGLER, which efficiently produces networks that better preserve key network characteristics such as network moments and degree distributions compared with existing approaches. Code is available at https://github.com/FeifanJiang/syngler.
Design-based edge-level causal inference with machine learning assisted covariate adjustment
oai:arXiv.org:2606.00965v1
arXiv:2606.00965v1 Announce Type: new
Abstract: We study design-based causal inference for edge-level outcomes in directed networks under dyadic interference. In this setting, outcomes are defined on directed edges and depend on the joint treatment assignments of pairs of units, inducing a complex dependence structure that invalidates standard estimation and inference procedures developed for node-level data. We construct Horvitz--Thompson estimators for a general class of edge-level causal effects and establish their asymptotic normality under mild regularity conditions. To enable valid inference, we develop variance estimators that exploit identifiable components of network dependence, yielding substantially less conservative bounds than classical approaches. To improve efficiency, we incorporate auxiliary covariates through a sample splitting and cross-fitting procedure. A key technical challenge is that standard two-fold sample splitting fails in the presence of edge-level outcomes due to the dependence induced by shared units. To address this issue, we introduce a three-fold sample splitting and cross-fitting scheme that restores the conditional independence required for unbiased estimation. Under a stability condition, the resulting covariate-adjusted estimator is asymptotically normal and accommodates both linear adjustment and flexible machine learning methods. We further introduce a calibration step that guarantees no asymptotic efficiency loss relative to the unadjusted estimator. Simulation studies and a real-data application confirm the theoretical results and demonstrate substantial efficiency gains.
Practical and Optimal Algorithm for Linear Contextual Bandits with Rare Parameter Updates
oai:arXiv.org:2606.00984v1
arXiv:2606.00984v1 Announce Type: new
Abstract: We study linear contextual bandits under rare parameter updates: the learner may incorporate reward feedback into its parameter estimate only at a small number of update times, while still observing contexts online and selecting actions sequentially. This viewpoint clarifies a practical distinction that is often blurred in the literature: many "strictly batched" methods additionally restrict within-interval context adaptivity, meaning that the action rule inside an interval cannot depend on the sequence of realized contexts/actions in that interval (beyond the current round's context). For linear contextual bandits, we propose two practical algorithms with only $O(\log\log T)$ parameter updates. Our first algorithm BLCE-G attains minimax-optimal regret (up to polylogarithmic factors in $T$) simultaneously in both the small-$K$ and large-$K$ regimes under a static schedule. Our second algorithm BLCE removes the near G-optimal design step -- a dominant computational bottleneck in prior strictly batched static-grid methods -- yet preserves minimax-optimal regret and achieves the lowest known runtime complexity among optimal algorithms. We further extend these rare-update and computational principles to generalized linear contextual bandits. Overall, our results yield statistically optimal algorithms under $O(\log\log T)$ parameter updates that are also computationally efficient in practice.
Theoretical Analysis of Engression and Reverse Markov Engression
oai:arXiv.org:2606.01002v1
arXiv:2606.01002v1 Announce Type: new
Abstract: Engression is a recently proposed and effective framework for conditional distribution learning. Its multi-step Reverse Markov extension further improves generative flexibility by decomposing complex conditional sampling into sequential reverse transitions. Despite their strong empirical performance, rigorous finite-sample statistical guarantees for these methods remain unavailable. In this paper, under deep neural network parameterizations, we establish nonasymptotic convergence bounds for Engression by directly controlling the Energy Distance between the learned and target conditional distributions. For the Reverse Markov framework, we further develop an Energy-Distance-based chain rule that enables a rigorous analysis of error propagation across reverse steps. Our analysis yields corresponding excess-risk bounds that are near-optimal up to logarithmic factors relative to the classical minimax rate over a general H\"older class.
Semiparametric Efficiency of Residual Correlation Testing under Gaussian Additive Noise Models
oai:arXiv.org:2606.01011v1
arXiv:2606.01011v1 Announce Type: new
Abstract: This paper studies conditional independence testing under the Gaussian additive noise model (GANM), where two variables are modeled as nonlinear functions of covariates with independent bivariate Gaussian regression errors. Under this framework, conditional independence can be characterized by the correlation coefficient of the regression errors, which motivates a test based on the Pearson correlation coefficient computed from the fitted residuals. Despite its simple form, the asymptotic behavior and statistical efficiency of the resulting test have not been well understood. In this paper, we develop the semiparametric efficiency theory under GANM and show, surprisingly, that the efficient estimator coincides exactly with the ordinary residual Pearson correlation estimator. We further establish the asymptotic properties of the proposed test and develop the corresponding inference procedure. Simulation studies demonstrate that the proposed method achieves near-oracle efficiency and competitive empirical power while maintaining valid Type I error control. We further apply the proposed test to conditional dependence analysis of U.S. stock returns.
Measuring the Symmetry--Data Exchange Rate
oai:arXiv.org:2606.01090v1
arXiv:2606.01090v1 Announce Type: new
Abstract: Equivariance theory predicts that an architectural symmetry prior reduces sample complexity by a factor of |G|; this is widely cited but rarely measured as a scaling law with controls that separate the prior from its confounds. On a controlled C_n-symmetric task, we report three findings. First, a wrong-group control with identical orbit size and matched compute is worse than no constraint (joint pairwise CI [+0.79, +3.26] excludes zero, robust across estimators); misaligned constraint is actively harmful, not merely unhelpful. Second, an augmentation baseline equipped with test-time orbit averaging matches the equivariant model exactly -- bit-identical per-epoch validation curves across matched cells -- so the architecture-vs-augmentation gap is conditional on asymmetric test-time computation, not unconditional. Third, the relative exchange rate beta_diff = 1.28 is consistent in sign and order of magnitude with the theoretical 1.0 (single-level CI [+0.92, +2.05]); the more conservative two-level bootstrap (seeds x group sizes) widens this to [-0.63, +1.72], including zero, and a finer-N replication on a sqrt(2)-spaced grid is inconclusive (point estimate -0.82). The methodological contributions -- the relative-rate estimator that cancels the shared-difficulty confound, the wrong-group control, and a pre-specified failure taxonomy -- transfer to any inductive bias whose strength can be parameterised. Honest scoping: the primary estimator beta_diff was adopted post-hoc after the initial analysis revealed a positive-slope identifiability problem; the design was never externally pre-registered; and the headline number rests on an OLS slope over seven group sizes on a coarse N grid. This is an exploratory study, not a confirmatory measurement; the wrong-group result is the cleanest finding and the one we report with the most confidence. A registered replication on fresh seeds is future work.
Topological Ignorability for Structural Causal Effects Beyond Means
oai:arXiv.org:2606.01184v1
arXiv:2606.01184v1 Announce Type: new
Abstract: Many interventions alter the structure of an outcome distribution rather than its mean: they can split a population into disconnected regimes, create loops or holes, generate branches, or reorganize an outcome cloud while leaving the average response nearly unchanged. In such settings, mean-based causal estimands such as the average treatment effect may miss important structural effects.
We introduce topological-geometrical causal metrics based on summaries of interventional outcome laws, including density-superlevel Betti summaries, Euler signatures, and persistent-homology summaries. These metrics quantify structural differences between treated and untreated outcome laws beyond averages. We also study the assumptions needed for causal interpretation. We introduce topological ignorability, a topological analogue of conditional ignorability that requires invariance of the chosen structural feature rather than the full counterfactual distribution. When the chosen summary is injective, this condition coincides with weak ignorability; for noninjective summaries, it can identify the structural feature of interest without identifying the full interventional law.
We define a covariate-standardized topological-geometrical causal effect and develop practical estimators. We validate the framework in two hidden-confounding benchmarks: a fully synthetic exact benchmark and a real-covariate semi-synthetic benchmark using Wisconsin breast-cancer covariates. In both, weak ignorability fails and balancing observed covariates nearly eliminates standardized mean differences, yet the coordinate-mean average treatment effect remains biased. By contrast, selected finite density-superlevel Betti and Euler contrasts remain stable across oracle, observational, and weighted analyses.
Markovianity-Based Conditioning Depth Diagnostics for Hidden Confounding in Observational Datasets
oai:arXiv.org:2606.01214v1
arXiv:2606.01214v1 Announce Type: new
Abstract: Reliable causal discovery in time series depends on whether the conditioning set adequately represents the system state. If relevant history or unobserved processes are omitted, residual dependence can appear as direct causal links. We study this failure mode on promnient constraint-based causal discovery methods through a simple premise: how much does the inferred graph change as conditioning depth increases? When the observed process is described approximately by a finite-order Markovian representation, inferred graphs should stabilize once sufficient past observations are observed. Hidden confounding and other hidden-memory mechanisms should remain sensitive to depth when the observed state is incomplete. We formalise this behavior with graph instability statistics computed over the conditioning-depth grid. The empirical study covers synthetic systems with known ground truth and calcium imaging recordings with unknown causal structure. In simulations, both Markovian and non-Markovian systems relatively upheld our premise. With known ground truth, we evaluate recovery using confusion matrix metrics; while in real data without ground truth, we use descriptive graph instability summaries. Across synthetic Markovian and hidden memory systems, c-GC variants give the clearest separation, while PCMCI variants show weaker compatible trends. In real data, inferred connectivity drops sharply with conditioning depths and then levels off. This method, however, does not recover latent graphs, nor does it clearly separate latent confounding from lag-order misspecification, non-stationarity, measurement error. Its contribution is more modest and practical: and explicit model-checking tool for deciding when causal claims are stable and when they should be treated caustiosly.
Functional Clustering of Survival Data via Smoothed Log-Hazard Trajectories: A Risk-Dynamics Perspective
oai:arXiv.org:2606.01239v1
arXiv:2606.01239v1 Announce Type: new
Abstract: This paper investigates clustering in survival data by shifting the analytical focus from cumulative survival probabilities to instantaneous risk, as characterized by the hazard function. We model smoothed log-hazard trajectories as functional objects that capture the temporal evolution of risk and propose a clustering framework based on Functional Principal Component Analysis applied to B-spline smoothed log-hazard trajectories. The number of retained functional principal components is selected before clustering using a 95% cumulative explained-variance rule, and clustering is then performed on the unstandardized FPCA scores. The proposed methodology is evaluated through simulation studies covering progressively complex scenarios, including overlapping and crossing hazard functions, cohort imbalance, heterogeneous risk profiles, and outlier contamination. The framework is further illustrated on two real-world clinical datasets, the German Breast Cancer Study and the Primary Biliary Cirrhosis dataset. Results show that the proposed log-hazard-based functional clustering framework provides an interpretable representation of relative temporal risk dynamics, with competitive internal cohesion and explicit robustness diagnostics when compared with cumulative-survival-based benchmarks.
Efficient Approximation for Encoder--Decoder Neural Operators via Variation Spaces
oai:arXiv.org:2606.01244v1
arXiv:2606.01244v1 Announce Type: new
Abstract: We study operator learning using encoder--decoder neural networks. Inspired by the function-space theory of neural networks, we introduce a variation space as an infinite-dimensional structural class for nonlinear operators. This space is defined through vector-valued measures directly on the input and output spaces. For operators in this space, we establish approximation bounds for encoder--decoder two-layer networks in the Bochner $L^q$ norm. The resulting error bound decomposes into the input encoding error, the output encoding error, and a finite-width approximation term of order $N^{-1/2}$, with a constant independent of the input and output encoding dimensions. When the input and output encoding errors decay polynomially in the encoding dimensions, these estimates yield algebraic approximation and learning rates. The results provide an theoretical guarantees for efficient neural operator learning beyond general Lipschitz or Fr\'echet differentiable operator classes.
Distribution-free changepoint localization after sequential change detection
oai:arXiv.org:2606.01256v1
arXiv:2606.01256v1 Announce Type: new
Abstract: This paper introduces a distribution-free framework for constructing post-detection confidence sets for changepoints after stopping a sequential change detection procedure. It is well known that conformal test martingales can be used to sequentially detect changes in distribution, but by themselves provide no inference for the time at which a proclaimed change occurred. Past work on post-detection inference requires pre- and post-change classes of distributions to be known, but this paper accomplishes localization of the changepoint without any distributional assumptions. We establish finite-sample coverage guarantees (conditional on correct detection). We provide non-asymptotic bounds on the conditional expected size of the confidence sets. Under suitable asymptotic regimes, we proved that the conditional expected size of the confidence set remains uniformly bounded. and demonstrate strong empirical performance on simulated and real data. To the best of our knowledge, this is the first general distribution-free framework for sequential changepoint localization with a valid post-detection coverage guarantee.
Statistical Inference on Gradient Flows
oai:arXiv.org:2606.01257v1
arXiv:2606.01257v1 Announce Type: new
Abstract: Gradient-based algorithms are central to modern statistical estimation, yet their statistical analysis is often restricted to fixed-time behavior, such as convergence to a population target or fluctuations at a prescribed iteration. In many applications, however, uncertainty quantification is needed along the entire optimization path, especially when the stopping time is data-dependent or divergent. In this paper, we develop a theory for time-uniform statistical inference on gradient flows arising from empirical risk minimization. We prove a uniform central limit theorem that characterizes the deviation between empirical and population gradient flows as a continuous-time Gaussian process over the entire nonnegative real line. Building on this result, we introduce an algorithm-aware covariance estimator that evolves jointly with the gradient flow and avoids matrix inversion, resampling, or sample splitting. We show that the covariance estimator is uniformly consistent over time and use it to construct confidence intervals for the target parameter with asymptotically valid coverage. Our results connect optimization dynamics with statistical inference and provide practical tools for uncertainty quantification in gradient-based methods.
Scale-Free Priors and Survival Dynamics: A Bayesian Framework for Conflict Duration
oai:arXiv.org:2606.01328v1
arXiv:2606.01328v1 Announce Type: new
Abstract: We have developed a fully Bayesian survival-analysis framework that reformulates inference about system lifetimes in terms of hazard and survival functions, and extends this representation to interacting actors. Starting from J.~Richard Gott's Copernican principle, we express the scale-free prior as a baseline hazard $\lambda(t)=1/t$, thereby linking a static prior over lifetimes to the dynamic language of survival analysis. In this formulation, Bayesian updating corresponds to conditioning on survival, while the resulting posterior distribution admits a natural representation in terms of hazard and survival functions. The approach is intended for settings where data are sparse or unreliable, and where a scale-free, assumption-light baseline is preferable to heavily parameterized models.
Building on this foundation, we derive general expressions for two-actor systems that characterize joint survival, conditional lifetimes, and comparative outcomes without requiring a specific parametric form of interaction. This yields a flexible and modular framework in which baseline dynamics are separated from interaction effects, allowing different mechanisms to be incorporated transparently. Thus, the primary contribution is a general hazard-based formulation of Bayesian updating and its extension to interacting systems
To illustrate the framework, we consider a multiplicative resource-depletion specification in which interaction modifies the baseline hazard through cumulative engagement intensity. This example demonstrates how interaction terms can be embedded while preserving analytical tractability, including closed-form expressions under simplifying assumptions. We further provide a stylized application to an asymmetric two-actor conflict, the 2026 US/Israel--Iran hostilities, to highlight the qualitative implications of the approach.
FlowSDR: Sufficient Dimension Reduction via Conditional Normalizing Flows
oai:arXiv.org:2606.01346v1
arXiv:2606.01346v1 Announce Type: new
Abstract: Sufficient dimension reduction (SDR) seeks a low-dimensional linear projection of predictors that preserves the conditional distribution of the response. Existing methods target this conditional distribution indirectly, via inverse moments, local forward regression, or neural ensemble regression. We propose FlowSDR, a likelihood-based framework that jointly learns the projection and the conditional density by maximizing a conditional log-likelihood, with the density parameterized by monotone rational-quadratic spline flows. The estimator is Fisher consistent under the SDR model, and its sample objective admits a population interpretation in terms of mutual information. As a complementary model within the same likelihood framework, we introduce the neural Gaussian SDR, a heteroscedastic conditional Gaussian model whose mean and variance are parameterized by shared neural-network functions of the projected predictors. In simulations spanning Gaussian errors, heavy-tailed distributions, two-component mixtures, and settings with tail behavior not captured by mean-variance structure, FlowSDR recovers the central subspace more accurately than existing SDR methods and the neural Gaussian SDR baseline. We further validate these advantages on a face-age prediction task using the UTKFace dataset.
On the Uncertainty Quantification Ability of Tabular Foundation Models
oai:arXiv.org:2606.01427v1
arXiv:2606.01427v1 Announce Type: new
Abstract: Foundation models (FMs) have achieved substantial success in generalizing across tasks without problemspecific training or fine-tuning. However, many critical applications in mechanics and computational science require not only accurate predictions but also reliable uncertainty quantification (UQ). Herein we investigate the UQ capabilities of tabular FMs in regression tasks through a comprehensive empirical study comparing Tabular Prior-Data Fitted Networks (TabPFN) against Gaussian processes (GPs). We systematically evaluate these two methods across a host of regression problems with varying complexity, dataset sizes, and input dimensionalities. We use a default setting to build all the GPs and for a fair comparison against TabPFN v2.5. Our findings highlight an important trade-off between explicit and learned priors: while TabPFN achieves highly competitive performance for complex, high-dimensional problems with sufficient data, GPs often provide superior predictive accuracy and UQ in data-scarce settings. Moreover, when the chosen kernel constitutes a good prior for the underlying function, GP performance can substantially exceed that of TabPFN. Our results can be reproduced from https://github.com/kianswarehouse/GPvsPFN.
Quantifying Evidential Rigor in Meta-Analytic Corpora: A Simulation-Characterized, Bias-Robust Bayesian Workflow with a Nutrition Case Study
oai:arXiv.org:2606.01428v1
arXiv:2606.01428v1 Announce Type: new
Abstract: Conventional meta-analysis summarizes evidence through pooled estimates, intervals, and p-values, but these outputs do not directly measure evidence for an effect, evidence for no effect, or the degree to which conclusions depend on publication selection or small-study effects. We introduce a corpus-scale Bayesian evidential-audit workflow for meta-analytic corpora. The workflow reconstructs or accepts study-level effects and standard errors, harmonizes directions, fits a matched Bayesian random-effects baseline and a bias-aware model-averaged ensemble, and reports paired estimates with component and joint model-family evidence. The central estimand is rigor: a joint Bayes-factor summary combining resolved effect/no-effect evidence with absence of an explicit bias component in the fitted ensemble. Rigor is not a positive-finding score; no-effect evidence can score highly, whereas inconclusive or bias-dependent evidence scores poorly. We characterize the workflow using an ADEMP-framed simulation/resampling design with known-cell synthetic simulation, empirical registry resampling, and empirical fitted-profile-weighted synthetic sampling. A nutrition intervention corpus provides the worked case study, where bias-aware fitting often attenuates conventional estimates and many nominally meaningful effects lose clean evidential support. A public companion repository provides empirical inputs, generated artifacts, simulation source/design files, and documentation for reproducing and adapting the audit.
Comb Test: Histogram Uniformity Testing Based on Discrete Total Variation
oai:arXiv.org:2606.01465v1
arXiv:2606.01465v1 Announce Type: new
Abstract: Histogram uniformity testing is a common statistical task usually performed using Pearson's chi-square test. This paper proposes a new test based on the discrete total variation that is easy to compute and, for comb-like (alternating) deviations, achieves up to 67% higher statistical power than Pearson's chi-square test, making it a complement to standard tests. The exact null distribution is computed via dynamic programming, and a gamma approximation with Monte Carlo estimation extends the test to arbitrarily large sample sizes. Experiments on simulated ADC alternating differential nonlinearity and on rounding bias detection in scientific data confirm the claims. The Python source code and precomputed data are available at https://github.com/DiscreteTotalVariation/CombTest.
Computation-Aware Kalman Filtering with Model Selection for Neural Dynamics
oai:arXiv.org:2606.01468v1
arXiv:2606.01468v1 Announce Type: new
Abstract: Due to their explicit priors and ability to model uncertainty, Bayesian methods have played a major role in dynamical latent variable modeling of single-cell neural recordings. However, modern-sized datasets have made overparameterized deep networks the preferred methods of choice due to their predictive power and favorable computational scaling. While many posterior approximations exist, all incur approximation errors. Recent work accounts for this error in the form of computational uncertainty but comes at the cost of quadratic complexity and assumes fixed model hyperparameters. Here we extend this development to model selection, including a novel training loss and optimization scheme, which yields tractable inference in large state-spaces. We introduce a framework, the Computation-Aware State-Space Model (CASSM), specifically designed for the scale-imbalanced regime, where the number of trials is significantly lower than the number of recorded neurons. In this regime, for both synthetic and real data, we show that our method is competitive with data-hungry deep networks, with significantly improved uncertainty calibration over previous attempts to scale Bayesian methods. Our experiments provide a roadmap to neuroscience researchers in choosing from a host of potential dynamical latent variable models given key dataset properties and constraints.
Voronoi-Elitism Genetic Algorithm: A Generic Derivative-Free Routine With Theory and Implementation for Statistical Optimization
oai:arXiv.org:2606.01474v1
arXiv:2606.01474v1 Announce Type: new
Abstract: In this paper, we propose a generic optimization approach for challenging objective functions that finds applications in various statistical problems. We focus on objective functions with two parameter blocks of one amenable to analytic optimization, and another that is irregular or computationally expensive. To address this setting, we propose the Voronoi-Elitism Genetic Algorithm (VEGA), a derivative-free optimization method that embeds geometric information into genetic search. The proposed algorithm retains elite candidates and constructs Voronoi-based neighborhoods around them, whose crossover and self-adaptive mutation balance exploitation of promising solutions with exploration of under-covered regions. We study the high dimensional behavior of genetic search by analyzing distance concentration, and the effects of population size and shrinking mutation, which shows that the algorithm improves spatial coverage and yields sharper distance bounds under limited computational budgets. Simulation studies are conducted to compare VEGA with two genetic-type algorithms competitors in finite samples. A real data application on Stack Exchange activity data further illustrates its ability to identify stable structural changes, implying the algorithm is computationally flexible for high-dimensional, derivative-free optimization and applicable for various statistical problems.
Model complexity in econometrics - a combinatorial analysis
oai:arXiv.org:2606.01489v1
arXiv:2606.01489v1 Announce Type: new
Abstract: Regression models and Vector Autoregressive Models (VARs) play crucial roles in econometrics by allowing the analysis of multiple variables simultaneously. Despite their utility, these models face challenges like underfitting and overfitting, especially when determining the optimal model specification, which can lead to significant computational costs. To address these challenges, econometricians often rely on widely adopted model selection criteria such as the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC). These criteria help balance model complexity and goodness of fit, aiding in the selection of the most suitable model specification for the given data. Nonetheless, there is a notable gap in existing research concerning the correct specification of these models, particularly in determining the optimal number of states a system can assume. Addressing this gap, we introduce a combinatorial framework designed to calculate the potential number of states in such econometric models. Our approach involves delineating four distinct stages in model development, each offering a range of specifications. This method enables a comprehensive combinatorial calculation of all possible states. The aim of this paper is to highlight this overlooked aspect of model specification and to spark a constructive dialogue within the empirical research community. By doing so, we hope to inspire further research that enhances the precision and applicability of econometric models. A theoretical complexity criterion is necessary to elucidate fundamental limitations and propose new objectives to pursue.
A flexible and robust approach to univariate Gaussian splitting using parameterised Gaussian mixtures
oai:arXiv.org:2606.01530v1
arXiv:2606.01530v1 Announce Type: new
Abstract: We consider approximation of a Gaussian distribution with a mixture of homoscedastic Gaussians of smaller variance. The solution is obtained by minimising the $L^2$ norm between the original Gaussian and the mixture, which is parameterised to reduce the complexity of the optimisation problem. The developed technique is straightforward, sufficiently robust and yields Gaussian Mixtures that rapidly approach the original function as the number of mixands is increased. The proposed solution is examined for multiple special cases of input parameters resulting in further simplifications. Extension of the proposed method for approximating non-Gaussian distributions is discussed.
Scalable Counterfactual Risk Estimation for Rare Events in Longitudinal Data
oai:arXiv.org:2606.01539v1
arXiv:2606.01539v1 Announce Type: new
Abstract: Estimating the causal effect of time-varying treatments on survival outcomes in large observational studies is computationally demanding, particularly when outcomes are rare. While g-formula-based methods such as the iterative conditional expectation (ICE) estimator provide a principled framework for longitudinal causal inference, they become computationally expensive, especially when bootstrap-based variance estimation is required. In addition, outcome rarity at each time point induces severe class imbalance, leading to instability and convergence issues in logistic regression and related models. To address these challenges, we propose a principled subsampling and reweighting strategy for longitudinal survival data that can be applied to a range of existing causal effect estimators in this setting, including the ICE estimator. The proposed method substantially reduces computational burden while preserving consistency and improving estimation stability in rare-outcome settings. We evaluate the method through simulations and validate it using a large-scale EHR cohort study on social and behavioral determinants of health (SBDH) and suicide risk, demonstrating its effectiveness for modeling rare outcomes in longitudinal data.
Structural Change Detection in High-Dimensional Transformed Factor Models via Canonical Correlation Analysis
oai:arXiv.org:2606.01553v1
arXiv:2606.01553v1 Announce Type: new
Abstract: This paper develops a canonical-correlation-based method for detecting structural changes in high-dimensional transformed factor models. The proposed approach exploits the low-rank canonical-correlation structure induced by dynamically dependent common factors, while serially uncorrelated idiosyncratic components correspond to a noise subspace with zero canonical correlations. We construct an eigenvalue-ratio criterion that measures residual dynamic dependence in the estimated noise subspace and identifies the true change point under sufficient separation of the regime-specific loading spaces or dynamic canonical correlation structures. Since the change-point location and the regime-specific factor numbers are both unknown, we further propose an alternating iterative estimation procedure that updates them sequentially until convergence. Under suitable mixing and moment conditions, we establish asymptotic properties of the proposed estimators, with convergence rates depending explicitly on factor strength, cross-sectional dimension, and sample size. Monte Carlo experiments and empirical applications to intraday stock returns and U.S. temperature series demonstrate the finite-sample
Fast Near-Optimal Estimation over Symmetric Norm Balls
oai:arXiv.org:2606.01554v1
arXiv:2606.01554v1 Announce Type: new
Abstract: This short note proposes a polynomial-time algorithm for near-optimal Euclidean estimation of a signal constrained to lie in the unit ball of a symmetric norm, where the symmetry is with respect to a known basis and the norm is accessible through an evaluation oracle. We further extend the method to a random-design, moderate-dimensional linear regression setting, where the regression parameter is likewise assumed to belong to a constraint set defined by a symmetric norm.
Self-Regulating Annealing in Heavy-Tailed Diffusion Models
oai:arXiv.org:2606.01645v1
arXiv:2606.01645v1 Announce Type: new
Abstract: Diffusion models have emerged as a leading framework for deep generative modeling. While the standard Gaussian formulation is theoretically convenient, its suitability for heavy-tailed datasets remains unclear. To address this, heavy-tailed diffusion models (HTDMs) extend the standard formulation by replacing the Gaussian distribution with a Student's t-distribution, thereby improving tail fidelity on heavy-tailed datasets. Although stochastic differential equation (SDE)-based sampling is possible in HTDMs, it has not been fully explored. In this paper, we propose an SDE-based sampler for HTDMs that explicitly incorporates a state-dependent diffusion coefficient. This state dependence naturally induces a self-regulating annealing mechanism by adaptively modulating the effective noise scale. We theoretically explore this mechanism and experimentally verify its necessity for reproducing samples from a heavy-tailed distribution.
Beyond principal ignorability: Nonparametric sensitivity bounds for principal stratification
oai:arXiv.org:2606.01669v1
arXiv:2606.01669v1 Announce Type: new
Abstract: Principal stratification is an effective framework addressing intermediate variables in causal inference. However, point identification of the principal causal effects (PCEs) often requires the untestable principal ignorability (PI) assumption. This article develops a nonparametric sensitivity analysis framework for evaluating PI violations. We introduce a margin-free bounding factor parameterized by the selection and outcome relative risks of an unmeasured confounder. Using this bounding factor, we derive sharp nonparametric bounds for each PCE. We prove that these bounds nest within the worst-case nonparametric bounds with and without the monotonicity assumption. We then discuss Cornfield-type conditions and principal E-values that quantify the minimum joint magnitude of unmeasured confounding required to nullify the target PCE. Furthermore, we generalize this methodology to principal generalized causal effects, extending the sensitivity bounds and falsification thresholds to the recent pairwise comparison estimands evaluated over a product space.
Higher-Order Efficient Estimators: A Review and Simulation-Based Benchmark Study
oai:arXiv.org:2606.01674v1
arXiv:2606.01674v1 Announce Type: new
Abstract: Higher-order efficient estimators extend standard first-order semiparametric estimators by replacing second-order residuals with third- or higher-order terms, potentially enabling asymptotic efficiency under slower nuisance function convergence rates and improving finite-sample performance. Existing methods achieve higher-order expansions through structurally different approximation strategies, including basis truncation, kernel smoothing, and highly adaptive lasso (HAL) representations, making direct theoretical and practical comparison difficult. In this manuscript, we provide a focused review and a simulation-based empirical benchmark for second-order efficient estimators, using treatment-specific mean estimation as a canonical causal inference and missing data problem. We compare how higher-order influence function (HOIF) estimators, kernel-based higher-order targeted minimum loss-based estimator (HOTMLE), and HAL-based HOTMLE construct higher-order expansions and the approximation or regularization burdens they introduce. The asymptotic and numerical study evaluates first-order and empirical second-order estimators under controlled nuisance errors with constant or increasing sectional variation complexity. Results show that higher-order debiasing can substantially reduce first-order estimation bias; however, gains depend strongly on stability of the approximation or regularization required for higher-order correction. Empirical HAL-based HOTMLE shows relatively stable performance, while empirical HOIF remains sensitive to basis truncation and tuning choices. Overall, this manuscript clarifies when higher-order asymptotic improvements are attained in theory, when they may be practically visible, and when implementation instability may offset theoretical advantages.
Mapping the Storm: Geospatial Impacts of Severe Weather on LEO Network Performance
oai:arXiv.org:2606.01724v1
arXiv:2606.01724v1 Announce Type: new
Abstract: LEO satellite constellations, led by deployments such as Starlink, are playing an increasingly pivotal role in enabling global broadband connectivity. However, the reliability and performance of these space-based networks are highly sensitive to environmental dynamics, particularly localized weather phenomena that exhibit strong spatio-temporal variability. In this study, we present a continental-scale geospatial analysis of weather-induced performance degradation in the Starlink LEO network, with a focus on the contiguous United States. Leveraging a unique dataset comprising more than 870,000 terminal hours of minute-level telemetry from 1,292 Starlink terminals, we integrate high-resolution localized weather observations to quantify the impact of various meteorological conditions. We evaluated key performance indicators (KPIs)-including ping latency, ping drop rate, and signal quality-using spatial join techniques and time-aligned correlation with classified weather events. Our analysis reveals that severe weather events, such as thunderstorms with heavy rain or snow, have a pronounced effect on network performance. In particular, more than 55% affected terminals experienced substantial degradation. Temporal continuity analysis at the minute level shows that such degradation can lead to sustained impairments or full service outages lasting from several minutes to multiple hours.This work contributes to the first large-scale empirical study linking LEO satellite Internet performance with fine-grained weather data in both space and time. Our findings offer actionable insights for geospatial predictive modeling, weather-aware network provisioning, and resilient satellite communication system design. We also propose a framework for incorporating weather-inferred performance variability into future geospatial planning and service-level forecasting tools for LEO-based Internet systems.
LoopPerm-CPD: A Robust Loop Permutation Framework for Automatic Multiple Change-Point Detection in Longitudinal Data
oai:arXiv.org:2606.01796v1
arXiv:2606.01796v1 Announce Type: new
Abstract: Human viral challenge studies, in which participants are deliberately inoculated with influenza strains such as H1N1 or H3N2 and monitored through longitudinal transcriptomic profiling before and after inoculation, are critical for characterizing dynamic biological immune responses to viral infection. A key analytical goal in such settings is to detect critical transition times, or change points, at which an underlying trajectory shifts direction or rate, indicating events such as the onset of an immune response or recovery. However, change-point detection in these longitudinal data is fundamentally challenging because observations are often sparse and irregularly spaced, sample sizes are small, outliers are common, and the number of change points is unknown in advance.
To address these challenges, we propose LoopPerm-CPD, a robust change-point detection approach with a built-in loop permutation procedure for automatic multiple change-point detection. The method evaluates candidate slope change points and assesses their significance using within-subject circular permutation combined with binary segmentation, jointly estimating both the number and locations of change points. The accompanying R package, LoopPerm-CPD, implements this framework and flexibly accommodates generalized least squares, quantile regression, and quantile rank-score statistics for different types of longitudinal outcomes.
The proposed approach is evaluated through simulations, demonstrating Type I error control and improved power compared with competing methods. Applied to real data, the framework identifies interpretable transition points in multiple human respiratory viral inoculation studies. Together, these results establish LoopPerm-CPD and its companion software as a robust and user-friendly tool for change-point detection in complex human longitudinal cohort data.
A Uniform Improvement of the Benjamini-Hochberg Procedure using e-Closure
oai:arXiv.org:2606.01854v1
arXiv:2606.01854v1 Announce Type: new
Abstract: This paper presents closed BH, a uniform improvement of the False Discovery Rate controlling method of Benjamini and Hochberg (BH). Closed BH is valid under the same assumption of Positive Regression Dependency on a Subset (PRDS) as BH. As a uniform improvement, closed BH never rejects fewer hypotheses than BH, but it may reject quite a few more. An increase in power is observed especially when the number of false null hypotheses is large. The novel method is constructed using the e-Closure principle, a recently derived general principle for multiple testing.
Spatial Capture-Recapture With Penalized Regression Splines to Flexibly Model Wildlife Density and Distribution
oai:arXiv.org:2606.01932v1
arXiv:2606.01932v1 Announce Type: new
Abstract: Spatial capture-recapture models are routinely used to estimate the abundance and distribution of wild animal populations and involve a latent spatial point process of animal activity centres that describes the spatial distribution of individuals. While traditional spatial capture-recapture models use a Poisson process, the assumption of conditional independence between points is often violated in practice due to factors not included in the point process, such as social clustering, territoriality, or preferential selection of habitat due to unobserved covariates. Log-Gaussian Cox processes are commonly used in spatial statistics to overcome weaknesses of Poisson processes, but methods to fit them within spatial capture-recapture do not currently exist. Here, we present a spatial capture-recapture framework that allows for the use of penalized regression splines to describe the activity centre distribution, with model fitting via a Laplace-approximate penalized marginal maximum likelihood approach. Our method approximates using a log-Gaussian Cox process for activity centres, and allows flexible modelling of nonlinear effect of covariates on density. We illustrate the use of our method with a simulation study and two case-studies. We demonstrate that, while population size estimates of traditional approaches are robust to density model misspecification, our approach substantially improves the estimation of spatial animal distributions.
Return-to-Baseline Testing via Empirically Calibrated e-processes
oai:arXiv.org:2606.01960v1
arXiv:2606.01960v1 Announce Type: new
Abstract: We consider the problem of detecting a Return to Baseline (RtB) in high-frequency monitoring data preceding and following an intervention, where the aim is to identify the time at which the data-generating distribution realigns with its pre-intervention distribution. We propose a sequential, distribution-free testing procedure that does not rely on specifying a null model and provides anytime-valid error control. The method relies on ideas from universal inference to define a discrepancy measure that is aggregated into a non-negative super-martingale, and is then empirically cal- ibrated to form an e-process. The calibration is performed using the baseline data, and is thus subject-specific. We establish finite-sample bounds for the calibration error (under a flexible non-parametric assumption), discuss the impact of tuning parameters and computational complexity, and illustrate through simulations and a clinical case study that the procedure accurately detects RtB from monitoring data.
Testing for Single-Population Ancestry in the Admixture Model
oai:arXiv.org:2606.01990v1
arXiv:2606.01990v1 Announce Type: new
Abstract: The Admixture Model describes genetic marker data by representing each individual's genome as a mixture of contributions from $K$ ancestral populations, with the individual admixture vector summarizing the corresponding ancestry proportions. In population and forensic genetics, a key question is whether an individual's genome supports a predominantly single-ancestry interpretation or whether an admixed interpretation is more appropriate. We propose a statistical test for single-population ancestry in the supervised Admixture Model, where ancestral allele frequencies are treated as known. The test assesses whether the largest admixture component exceeds a practitioner-chosen dominance threshold, giving precise meaning to the notion of a sufficiently strong single-population contribution.
To calibrate the test, we develop a constrained parametric bootstrap procedure that generates data under a null-constrained maximum likelihood estimator, accounting for the constrained hypothesis structure, the marker-wise heterogeneity and small sample sizes. Under standard regularity conditions, we prove that the proposed test has asymptotic level $\alpha$ and is consistent, ensuring control of false single-ancestry declarations while reliably detecting dominant ancestry components.
Simulation studies demonstrate good finite-sample performance across different numbers of ancestral populations, marker-panel sizes, dominance thresholds, and allele-frequency distributions. We further illustrate the practical utility of the method using data from the 1000 Genomes Project. The proposed framework delivers interpretable, threshold-based ancestry assessment with rigorous error control, and extends constrained bootstrap methodology to the independent but non-identically distributed setting of genetic marker data.
Provable Data Scaling Law for Meta Learning via Complexity Minimization
oai:arXiv.org:2606.02008v1
arXiv:2606.02008v1 Announce Type: new
Abstract: Pre-training has become a fundamental paradigm in modern machine learning, with one of its key empirical benefits being reduced downstream sample complexity as the scale of pre-training data increases. However, existing theoretical frameworks for pre-training do not fully explain this phenomenon. In this paper, we introduce complexity minimization, a novel meta-representation learning framework designed to enable theoretical analysis of this scaling behavior, which learns representations by evaluating the downstream model complexity best suited to each domain and minimizing the worst-case such complexity across source domains. Our end-to-end theoretical analysis, spanning pre-training through downstream regression, shows that this framework provably captures this scaling behavior; in particular, we show that the error rate of few-shot adaptation improves as the amount of meta-training data grows. Empirically, we demonstrate that incorporating complexity regularization into existing meta-learning methods consistently improves downstream sample efficiency.
PliableBVS: A flexible Bayesian variable selection method for modeling interactions with mandatory modifying variables
oai:arXiv.org:2606.02017v1
arXiv:2606.02017v1 Announce Type: new
Abstract: High-dimensional interaction models are useful for studying, for example, how a large set of variables of interest, such as gene expression or other omics features, interact with a smaller set of modifying variables, such as clinical covariates. In this context, the pliable lasso has recently been proposed as an efficient method for screening large numbers of potential interaction terms under an asymmetric weak hierarchical constraint. In this work, we extend this framework by introducing PliableBVS, a Bayesian variable selection approach that preserves the hierarchical structure of the pliable lasso while inducing sparsity through spike-and-slab priors. The proposed model combines the continuous shrinkage effect of Bayesian lasso with a hierarchical spike-and-slab prior formulation that has two layers of decision variables: one governing the inclusion of main effects and another controlling the inclusion of interaction effects which is conditional on the inclusion of the corresponding main effects. This structure enables simultaneous selection of high-dimensional main and interaction effects within a coherent probabilistic framework. In simulation studies the proposed method outperforms the original pliable lasso in identifying active main and interaction effects, reducing false discoveries, and improving prediction accuracy in most scenarios. Applications with data from a labor onset study and a preeclampsia study demonstrate that PliableBVS selects biologically meaningful features and interactions.
Convex Distance Operator Transport: A Convex and Geometry-Preserving Formulation
oai:arXiv.org:2606.02047v1
arXiv:2606.02047v1 Announce Type: new
Abstract: We introduce Convex Distance Operator Transport (CDOT), the first convex optimal transport framework that aligns distributions across heterogeneous domains by jointly preserving feature correspondence and intrinsic geometric structure. Specifically, CDOT employs an operator-based regularization that aligns aggregated distance structures by introducing distance and conditional expectation operators. Consequently, the proposed regularization improves the robustness to local geometric variations. We further prove that the resulting CDOT discrepancy is a valid pseudometric on the space of attributed compact metric-measure spaces. In addition, we characterize the relationship between CDOT and Gromov--Wasserstein (GW) through a new notion of dispersion gap, formally elucidating the geometric source of non-convexity in GW compared to the convexity of CDOT. In the finite-sample regime, we derive a non-asymptotic risk bound decomposed into optimization and statistical errors, establishing risk consistency under a globally convergent Frank--Wolfe algorithm. Experiments on synthetic point clouds, brain connectomes, and graph classification benchmarks demonstrate better performance over existing methods, with stable and reliable behavior in practice.
ICCDesign: An R Package for the Design and Analysis of ICC-Based Reliability Studies with Continuous Responses
oai:arXiv.org:2606.02059v1
arXiv:2606.02059v1 Announce Type: new
Abstract: The intraclass correlation coefficient (ICC) is among the most widely used statistics in reliability research, playing a central role in medical measurement, psychological assessment, and behavioral science. However, practical application of ICC faces two major obstacles. First, ICC can be organized into multiple forms under the McGraw and Wong (1996) framework -- including six widely reported standard forms and four additional design combinations -- and researchers must select the appropriate form based on their study design, yet existing guidelines are not always operationalized in software interfaces. Second, available R tools are highly fragmented: sample size calculation, ICC estimation with confidence intervals, and reliability evaluation are distributed across separate packages, compelling researchers to switch between tools and increasing the risk of analytical errors. This paper introduces the ICCDesign package, designed specifically to provide an integrated workflow for ICC-based reliability studies with continuous responses, assuming one continuous rating per subject-rater cell. The package integrates four core functionalities: (1) point estimation, ANOVA-based confidence intervals, and implemented hypothesis tests for supported ICC design combinations following the McGraw and Wong (1996) framework, with a built-in four-step decision framework guiding users toward an appropriate ICC form; (2) sample size planning based on Zou's (2012) closed-form formulas, supporting two planning modes and an inverse assurance calculation; (3) automated reliability evaluation based on Koo and Li (2016) criteria, with an uncertainty notification when the confidence interval spans the 0.75 good-reliability threshold; and (4) an interactive Shiny web application covering the main analysis and planning functionalities. ICCDesign is available from GitHub at https://github.com/KlariZhang/ICCDesign.
Evaluating the role of correlation among markers in prediction models
oai:arXiv.org:2606.02062v1
arXiv:2606.02062v1 Announce Type: new
Abstract: Different methods have been employed to estimate models maximizing the area under the receiver operating characteristic curve (ROC-AUC). Once a model is developed, integrating novel biomarkers may improve its diagnostic ability. However, the discrimination improvement from adding a new biomarker is not always evident, even if the marker itself has good discriminatory power. The sign and magnitude of correlations between biomarkers may impact model performance. In this paper, we assess the effect of such correlations on the discrimination ability of predictive models. Under multivariate normality, we derive an expression for the maximum AUC as a function of the correlations between markers, illustrated graphically using surfaces. Logarithmic folded bivariate normal and Gamma simulations address skewed data cases. Additionally, AUC improvement was assessed combining 1934 blood lipid metabolites determined by liquid chromatography in 44 pancreatic cancer cases and 38 controls from the PanGenMic Study. Our results show that negative correlations consistently maximize the combined AUC, offering the greatest improvements when markers have equal predictive ability, while positive correlations yield the least favorable results. Negative correlations remain optimal for markers with differing abilities, though positive correlations show slight benefits. Simulations with skewed distributions confirm these trends, emphasizing the role of asymmetry in marker selection. Real-world analysis of serum lipid-derived metabolites for detecting pancreatic ductal adenocarcinoma (PDAC) reinforces the influence of correlations on AUC optimization. These findings suggest that the sign and magnitude of inter-biomarker correlations should be considered when incorporating new markers into predictive algorithms.
Inverting Poisson-Laguerre tessellations
oai:arXiv.org:2606.02065v1
arXiv:2606.02065v1 Announce Type: new
Abstract: While it is well-known how to compute the cells of a Laguerre tessellation for a given set of weighted generator points, it is not obvious how to invert a Laguerre tessellation. That is, given that one observes a Laguerre tessellation, how can one retrieve the weighted generators corresponding to the observed cells. In this paper, we consider inversion of a class of random Laguerre tessellations known as Poisson-Laguerre tessellations. The weighted generators of observed cells of a Poisson-Laguerre tessellation are of interest because knowledge of these weighted generators is useful for statistical inference of Poisson-Laguerre tessellations. For general Laguerre tessellations we provide a characterization of all configurations of weighted generator points which yield the same Laguerre tessellation. For Poisson-Laguerre tessellations we propose a method for consistent inversion, meaning that as one observes the tessellation through increasing observation windows, a closer approximation of the original weighted generators can be obtained. In a simulation study we examine both performance of the inversion procedure, as well as the use of the obtained approximated weighted generators for nonparametrically estimating the weight distribution function corresponding to a Poisson-Laguerre tessellation.
Modelling multi-cancer screening data to infer on natural history of disease: when can valid, identifiable and precise inference be obtained?
oai:arXiv.org:2606.02076v1
arXiv:2606.02076v1 Announce Type: new
Abstract: Background: Multistate models (MSMs) applied to screening data can characterise the natural history of cancer and predict "stage-shifts" from screening. However, inferring parameters like mean sojourn time (MST) is challenging as disease onset is inherently unobserved in these data. This is even more challenging when characterising heterogeneity between cancer types in multicancer early detection (MCED) trial data.
Methods: We utilised simulated longitudinal MCED screening datasets to evaluate the inferential bounds of MSMs under increasing clinical disaggregation: a 3-state (overall MST), 5-state (early/late stage), and 9-state (stages I-IV) model. Bayesian estimation was performed via Markov chain Monte Carlo. Robustness was assessed through chain convergence, parameter identifiability (via profile likelihood), and precision of estimates. We also explored hierarchical models and the use of informative priors to improve identifiability.
Results: Based only on MCED trial data, many cancer types exhibited inferential challenges. Generally, the 5-state model was as robust as the 3-state model, showing slight improvements to convergence and identifiability while maintaining precision for overall MST. In contrast, the 9-state model showed worsened convergence and identifiability, and a significant reduction in the precision of overall MST estimates. Hierarchical models successfully improved performance, as have informative prior models but the latter introduced bias towards the prior values.
Conclusions: While disaggregating natural history models by individual cancer stages is desirable for policy, these higher-dimensional models show a greater reliance on external data/assumptions. We recommend explicit identifiability assessments and assessments of the influence of external data/assumptions to support inference for MCED screening evaluations.
It does what it says on the tin: safe synthetic data from coarsened margins
oai:arXiv.org:2606.02101v1
arXiv:2606.02101v1 Announce Type: new
Abstract: This paper proposes a method of creating synthetic data (SD) that will have two important advantages for the user compared to other methods currently available. The first is transparency; unlike other methods, the person in receipt of the SD will know which of the relationships between variables in the original data will be approximately maintained in the SD. The second is a guarantee that the SD is derived from information that has already been judged to be free of disclosure risk. This is achieved by first defining and calculating the margins where relationships between variables will be maintained in the SD. Each margin will then be subject to statistical disclosure control (SDC) to the standards defined by the data custodian, e.g. top-coding and bottom-coding, combination of small categories and/or modifying small counts. Further adjustment of the curated margins is advised by coarsening all counts in the table to multiples of the disclosure limit. These adjusted margins are used to create SD by the Iterative Proportional Fitting (IPF) algorithm. The practical steps involved in creating such SD are illustrated using data from the 1901 Census of Scotland.
Error Bounds for a Diffusion Model-Based Drift Estimator
oai:arXiv.org:2606.02115v1
arXiv:2606.02115v1 Announce Type: new
Abstract: Parameter estimation in stochastic differential equations is a classical statistical problem of much importance in many scientific fields. Recent work of Tapia Costa et al. (2026) introduced a novel technique for estimating the drift when the diffusion parameter is known, using discrete samples from multiple trajectories. Their method treats drift estimation as a denoising problem, and leverages tools from (conditional) score-matching diffusion models. Although their experiments showed promising results across different drift classes, the question of theoretical guarantees for their estimator was left unanswered. In this note, we address this gap by exploiting techniques from diffusion model theory. More concretely, we derive an explicit risk bound for the time-averaged mean-squared error of said drift estimator. Our bound decomposes the risk into the (i) Euler-Maruyama discretization, (ii) score/denoiser approximation, (iii) noise initialization, and (iv) sampling variance, revealing the trade-offs between the different hyperparameters and sources of error in the estimator.
ProbRes: Volatility Learning for Probabilistic Time-Series Forecasting
oai:arXiv.org:2606.02117v1
arXiv:2606.02117v1 Announce Type: new
Abstract: Probabilistic time series forecasting has attracted increasing attention in financial applications due to the need to quantify risk and uncertainty in future observations. We propose ProbRes, a post-hoc probabilistic calibration method that explicitly learns and incorporates volatility dynamics into probabilistic forecasting, enabling effective handling of heteroskedastic data. During training, ProbRes employs two architecture-agnostic modules to separately model the conditional mean and conditional volatility. At the inference stage, it generates predictive distributions by resampling normalized residuals. ProbRes is applicable to both univariate and multivariate time series and remains robust under a wide range of error distributions, including non-Gaussian innovations with conditional heteroskedasticity. Theoretical results demonstrate ProbRes's validity and experiments on both synthetic and real-world datasets show that ProbRes accurately captures predictive distributions and produces well-calibrated prediction intervals.
Observed Fisher Information in hidden Markov models - Application to a noisy Gaussian random walk
oai:arXiv.org:2606.02118v1
arXiv:2606.02118v1 Announce Type: new
Abstract: In this work we provide analytical and closed-form expressions for the exact computation of the score and the observed Fisher information matrix in a Gaussian random walk observed through Gaussian noise. Our method is based on the Oakes' identity and, as for the computation of the log-likelihood, its complexity in time is linear in the length of the sequence with the forward-backward (or Baum-Welch) algorithm. We illustrate the method over various simulation studies and provide parameter estimates computed with the Newton-Raphson algorithm along with confidence intervals.
Methods for adjusting for covariate measurement error in flexible modelling of functional form: designing a blinded, controlled neutral comparison simulation study
oai:arXiv.org:2606.02130v1
arXiv:2606.02130v1 Announce Type: new
Abstract: This article describes the design of a neutral comparison study in the context of empirical studies where the interest is in learning the functional relationship between a continuous errorprone exposure variable and a binary outcome. The performance of combinations of measurement error correction methods and flexible regression modeling techniques was compared using a simulation study. The project involved four independent teams, one devoted to data generation and evaluation, the other three to specific methods of measurement error correction (Simulation-Extrapolation, Regression-Calibration and Multiple imputation, Bayesian method). The study was conducted in three successive stages. In Stage 1, the first team simulated five datasets differing only by the true exposure-outcome functional form and distribution of true exposure. Furthermore, the implementation of flexible modeling methods (B-splines, P-splines, and fractional polynomials) was standardized. The three methods teams, blinded to the underlying data generation process, created the codes to implement their methods, and provided their results to the first team who evaluated them. These codes were then used by this team in the next Stages of the project. In Stage 2, the team simulated 150 additional datasets where other design parameters varied while using the same five exposureoutcome functions. Stage 3 consisted of simulating independent replications of each of the 150 scenarios considered in Stage 2 to quantify the sampling variance of the estimates. This work emphasizes the relevance of neutral comparison studies to fairly evaluate statistical methods aimed at addressing a complex analytical challenge, and demonstrates their feasibility through a large collaborative project.
Sharp Support Thresholds for Smeariness of Absolutely Continuous Measures on Spheres
oai:arXiv.org:2606.02144v1
arXiv:2606.02144v1 Announce Type: new
Abstract: We investigate support thresholds for fully smeary and directionally smeary absolutely continuous probability measures on the sphere \(\mathbb{S}^m\). The motivation is inferential: smeariness is caused by degeneracy of the Hessian of the Fr\'echet function, and such degeneracy can invalidate the classical central limit theorem (CLT) for Fr\'echet means and the corresponding Wald-type \(\chi^2\) inference.
For rotationally symmetric densities, we show that full and directional smeariness are equivalent. The Hessian and fourth-order terms are governed by two explicit geometry-dependent radii \(R_m0\) with \(S_m+\varepsilon<\pi\), we construct examples of rotationally symmetric \(2\)-smeary densities supported in the ball of radius \(S_m+\varepsilon\).
For general densities, closed hemispherical support rules out both full and directional smeariness. Support contained in the closed ball of radius \(S_m\) rules out full smeariness, while we construct explicit, directionally \(2\)-smeary examples supported in balls of radius \(\pi/2+\varepsilon\).
As a byproduct, the explicit Hessian formulas in this paper also provide a practical diagnostic for detecting proximity to the Hessian-degenerate, non-classical regime.
A Contaminated Model for Overdispersed Multinomial Microbiome Count Data
oai:arXiv.org:2606.02199v1
arXiv:2606.02199v1 Announce Type: new
Abstract: Multinomial count data, such as microbial composition profiles derived from sequencing studies, frequently contain anomalous observations that distort parameter estimates. The Dirichlet-multinomial (DM) distribution is widely used in this setting but remains sensitive to such contamination. We propose the contaminated Dirichlet-multinomial (CDM) distribution, a two-component mixture in which the regular data come from a DM component with a lower dispersion and the irregular data come from a DM component with an inflated dispersion parameter. This construction accommodates anomalies without requiring their removal, and yields a natural rule for anomaly detection via posterior probabilities. Through sensitivity analyses involving both single-point anomalies and background noise, we demonstrate that the CDM distribution effectively downweights the influence of anomalous observations on the parameter estimates. The model is applied to gut microbiome data from a colorectal carcinogenesis study, where it consistently outperforms the DM distribution across all information criteria and identifies biologically plausible anomaly proportions in both the healthy and carcinoma subsets.
Bayesian meta-learning for modeling Alzheimer's disease progression
oai:arXiv.org:2606.02228v1
arXiv:2606.02228v1 Announce Type: new
Abstract: Predicting whether an individual with Alzheimer's disease will experience mild or severe disease progression is essential for personalized treatment. Typically, practitioners seek to predict the distribution of a discrete disease score, conditional on an individual's current MRI volume and their historical disease trajectory. Classical statistical regression models and single-task neural networks are not well-suited for this purpose because fitting separate models is infeasible (since each individual typically has few observations), while ignoring individual-level correlation leads to poor generalization. Meta-learning, in contrast, provides a natural avenue to dynamically predict distributions without retraining and model nonlinear relationships between the outcome and covariates. Motivated by this, we propose a Bayesian meta-learner that is trained on multiple individuals but tailors the predictive disease score distribution to each individual's historical data. Our model predicts on unseen individuals without retraining, scales linearly with the number of historical observations, and is guaranteed to be less overconfident when predicting long-term disease scores compared to its deterministic counterpart. On real-world data from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database, our model achieves performance competitive with both single-task models and deterministic meta-learners, while substantially improving performance when predicting long-term disease progression.
Identifiable Markov Switching Models with Instantaneous Effects and Exponential Families
oai:arXiv.org:2606.02231v1
arXiv:2606.02231v1 Announce Type: new
Abstract: Temporal systems often exhibit non-stationary behaviour, such as seasonal climate variation or glucose fluctuations in patients with type-1 diabetes. One way to model non-stationarity is through discrete latent regimes, i.e., stationary segments of time. Such systems induce a Markov Switching Model (MSM), a class of Hidden Markov Models with autoregressive dependencies among latent regimes and observed variables. Identifying latent regimes is challenging in the presence of frequent regime switches and nonlinear and non-Gaussian dynamics, particularly when there are instantaneous effects between the variables, e.g., due to slow rates of measurements. In this work, we establish the identifiability of both latent regimes and regime-dependent causal structures under temporal regime dependencies, nonlinear lagged and instantaneous effects, and independent noise from the exponential family. Our identifiability theory subsumes non-temporal mixtures of causal models. Furthermore, we introduce FlowMSM, a regime detection framework that can be paired with any stationary causal discovery method to recover regime-dependent causal structures. Experiments on synthetic benchmarks and a financial economics dataset demonstrate the effectiveness of our approach to detect latent regimes and discover causal structures from non-stationary time series.
ShaplEIG: Bayesian Experimental Design for Shapley Value Estimation
oai:arXiv.org:2606.02247v1
arXiv:2606.02247v1 Announce Type: new
Abstract: Shapley values are a principled attribution measure widely used in interpretable machine learning, but their exact computation scales exponentially with the number of players, motivating a wide range of approximation methods based on value function evaluations of sampled coalitions. This raises the question of whether approximation accuracy can be improved by adaptively selecting coalitions for evaluation based on previous evaluations. This is particularly relevant in settings where the value function is costly and the number of evaluations is severely limited, such as retraining-based feature importance, data valuation, and hyperparameter importance. For this purpose, we propose ShaplEIG, a Bayesian experimental design approach that approximates the expensive value function using a Gaussian process surrogate and adaptively selects coalitions based on their expected information gain about the Shapley values. By the linearity of the Shapley values in the value function, we show that the expected information gain is available in closed form. Furthermore, we propose an efficient computation scheme that reduces the complexity from exponential to polynomial in the number of players via elementary symmetric polynomials. In extensive experiments across diverse costly applications, our method consistently improves sample efficiency in the low-budget regime over state-of-the-art baselines.
Bandwidth selection with a frequency-domain version of the AIC
oai:arXiv.org:2606.02295v1
arXiv:2606.02295v1 Announce Type: new
Abstract: When it comes to estimating an unknown spectral density as simply and reliably as possible, parametric spectral density estimation using AR models and order selection via AIC is the method of choice. In contrast, no standard method has yet emerged for automatic nonparametric spectral density estimation, and there seems to be little willingness to weigh the advantages and disadvantages of different risk functions and the various methods for estimating them on a case-by-case basis, particularly because it is unclear whether the effort is even worthwhile without concrete prior information about the unknown spectral density. As a result, subjective visual methods are still widely used in practice to determine the appropriate smoothing parameter for a nonparametric estimation. This article aims to encourage the increased use of objective automatic methods by presenting evidence that using what is arguably the simplest and most straightforward frequency-domain version of the AIC for the automatic determination of an appropriate bandwidth enables results that are comparable to those obtained using the standard parametric approach. This evidence is based on both real-world time series and synthetic time series with spectral densities of varying complexity.
Doing well with less! On Sampling Techniques for Empirical Pairwise Loss Estimation/Minimization
oai:arXiv.org:2606.02345v1
arXiv:2606.02345v1 Announce Type: new
Abstract: Many machine learning problems, including similarity learning, ranking, and clustering, rely on empirical pairwise loss functions whose quadratic computational cost quickly becomes prohibitive at scale. We demonstrate how a frugal approach that retains only a fraction of the available information on pairs can achieve estimation or optimization performance comparable to that obtained by using all pairs, by leveraging survey sampling techniques. A central finding, supported by both theory and experiments, is that such sampling plans must target pairs directly rather than individual observations. In particular, for pairwise losses between high-dimensional vectors such as embeddings in vision or graph learning, assigning higher inclusion probabilities to informative pairs using suitable auxiliary information yields performance close to full pairwise evaluation, providing a principled and theoretically grounded trade-off between accuracy and computational cost.
Optimal sequential two-stage Bayes Factor Design for two-arm clinical Phase II Trials with binary Endpoints
oai:arXiv.org:2606.02410v1
arXiv:2606.02410v1 Announce Type: new
Abstract: Two-arm phase II clinical trials often benefit from an interim analysis that allows early stopping for futility, but Bayesian calibration of such designs is usually based on computationally intensive Monte Carlo simulation. In this work, a simulation-free methodology is developed to obtain Bayesian optimal two-stage designs in two-arm phase II trials with binary endpoints using Bayes factors as the primary measure of evidence. Building on recent matrix-search methods for fixed-sample two-arm Bayes factor designs and earlier correction formulas for one-arm two-stage designs, the proposed approach derives exact expressions for the operating characteristics of a two-stage two-arm design with a single futility interim. Bayesian power and type-I error are obtained by correcting the corresponding fixed-sample quantities for trajectories that would have been removed by early stopping, yielding a fully numerical calibration procedure that avoids Monte Carlo error entirely. The resulting method searches over admissible interim and final sample sizes to identify the optimal design that satisfies target constraints on Bayesian power, type-I error, and the probability of compelling evidence in favour of the null hypothesis, while minimizing the expected sample size under the null hypothesis. The methodology is illustrated in realistic phase II settings, including a detailed re-analysis of the riociguat trial in systemic sclerosis. Overall, the approach extends simulation-free Bayes factor design methodology to the practically important setting of two-arm two-stage phase II trials and provides a transparent basis for Bayesian design calibration and sensitivity analysis.
AI and physics-based weather forecasting: A comparative study
oai:arXiv.org:2606.02508v1
arXiv:2606.02508v1 Announce Type: new
Abstract: In the last few years, AI-based models have become the centre of attention in weather forecasting due to their increasing accuracy and efficiency. Pioneering among weather services, ECMWF has developed its Artificial Intelligence Forecasting System (AIFS) model, which was first to provide data-driven ensemble forecasts in June 2024. Since July 2025, the AIFS ensemble model has been operational and runs in parallel with ECMWF's physics-based Integrated Forecasting System (IFS), which is considered the gold standard in weather prediction. The new AIFS model can generate forecasts ten times faster than the classical numerical weather prediction model, while consuming approximately a thousand times less energy. We present the results of our systematic assessment of the performance of the IFS and AIFS models by comparing the accuracy of raw and post-processed medium-range 10-m wind-speed ensemble forecasts generated operationally by the two models for the period between July and November 2025 for more than 9000 synoptic observation stations across the globe. The post-processed case involves the parametric ensemble model output statistics (EMOS) as well as the non-parametric quantile regression (QR) approach to correct any systematic inaccuracies in the raw forecasts. The predictive performance of raw IFS ensemble forecasts proves to be substantially superior to the skill of the raw AIFS predictions for all investigated forecast horizons. As expected, post-processing significantly improves the skill of both IFS and AIFS predictions, and, across most verification metrics, EMOS is superior to QR, especially for short lead times. Compared to the raw ensemble, the differences in skill between the matching IFS and AIFS predictions are substantially decreased by post-processing and are mostly significant at short lead times, when the IFS forecasts outperform their AIFS counterparts.
Space-Filling One-Factor-At-A-Time Designs
oai:arXiv.org:2606.02533v1
arXiv:2606.02533v1 Announce Type: new
Abstract: Space-filling designs are commonly used in deterministic computer experiments. However, they are ineffective for factor screening, which makes them inefficient when only a small subset of input factors is influential to the output. Recently developed screening designs, such as MOFAT designs, are effective at identifying important factors but lack space-filling properties, limiting their usefulness for surrogate modeling. In this article, we propose a new class of screening designs that improves the space-fillingness while retaining their screening capability. Through several numerical examples, we demonstrate that the proposed designs offer clear advantages over existing designs.
Probabilistic storyline attribution using machine learning
oai:arXiv.org:2606.02550v1
arXiv:2606.02550v1 Announce Type: new
Abstract: A fundamental goal in climate attribution is to estimate how forced climate change contributes to observed extreme weather events. The storyline attribution method compares an observed weather event, conditional on its atmospheric dynamic state (i.e., atmospheric circulation), in the current, 'factual' climate to an event with very similar circulation conditions in a hypothetical, 'counterfactual' climate. However, physical climate models cannot directly transfer these storyline counterfactuals across different climate forcing states. Statistical and machine learning techniques may overcome this limitation; yet, emulating circulation-conditional extreme events under different climate states is challenging. Here, we demonstrate distributional autoencoders (DAEs) as a versatile method for generating climate counterfactuals. They model the full distribution of spatially resolved European temperature fields conditional on the atmospheric circulation state and the mean global warming level. These distributions allow for deriving meaningful conditional probability ratios, which is a particular advantage of the DAE-based storyline approach. We train DAEs on fully coupled climate model simulations and we evaluate the modelled distributions across different factual and storyline-based counterfactual climate model simulations. In an illustrative case study, we revisit the 2003 European heatwave and we generate counterfactuals for a hypothetical `2003-like European heatwave' using ERA5 circulation, which we hypothesize to occur a quarter century (2028) and a half century (2053) after 2003. The conditional intensity would increase from 29.3 {\deg}C in 2003, to 30.3 {\deg}C and 32.1 {\deg}C in 2028 and 2053, respectively and conditional probability ratios would be 2.1 and 3.2 when compared to 2003.
Optimal Rates for Differentially Private Hypothesis Testing with E-values
oai:arXiv.org:2605.28952v2
arXiv:2605.28952v2 Announce Type: cross
Abstract: E-values have attracted considerable interest in recent years as flexible tools for enabling anytime-valid and adaptive data analysis. Hypothesis testing is at the core of many of these applications, which can often involve private or sensitive data. In this work, we answer a simple but important question: given two distributions $\mathbb{P}$ and $\mathbb{Q}$, what is the maximum achievable e-power when testing $X\sim \mathbb{P}^n$ against $X\sim\mathbb{Q}^n$ with e-values that satisfy $\varepsilon$-differential privacy? We characterize the optimal rate for this problem and provide an algorithm which matches it exactly. In the sequential setting, when observations arrive one-by-one and the analyst chooses when to halt, we give matching upper and lower bounds on the stopping times of any private e-process. Numerical experiments confirm the practicality of our algorithms, which require less data than the recently proposed DP-SPRT across a range of sequential testing problems and privacy levels.
Hoeffding Concept Bottleneck Models with Applications to Overhead Images
oai:arXiv.org:2606.00082v1
arXiv:2606.00082v1 Announce Type: cross
Abstract: Explainability of deep learning algorithms is critical for computer-vision applications with high-stake decisions. Concept bottleneck models (CBM) have recently shown promising performance to provide explainable and accurate predictions for classification problems, based on a bottleneck of high-level concepts. Existing CBM methods rely on a linear aggregation of the concept scores to compute predictions. However, a large number of concepts is often used in this linear approach, which undermines explainability and favors information leakage. In general, the underlying relation between concepts and output logits is not linear. Therefore, we introduce Hoeffding Concept Bottleneck Models (HCBM), which build on the Hoeffding functional decomposition of gradient-boosted trees to provide non-linear and sparse aggregations of concept scores, and generate compact predictions using prime implicants. HCBM are proved to be robust to interconcept leakage, and outperform standard linear CBM in practice, as shown in extensive experiments. Beyond classification, HCBM can be adapted to object detection, and we focus on a challenging case with overhead images to show the high performance of HCBM in these settings.
Physics from Video: Identifiability of Time-Invariant Second-Order ODEs under Minimal Trajectory Conditions
oai:arXiv.org:2606.00115v1
arXiv:2606.00115v1 Announce Type: cross
Abstract: Bridging the gap between visual realism and physical understanding is a core challenge for video-based world models. We study the structural identifiability of continuous-time physical laws from raw pixels, focusing on whether an encoder-only pipeline can uniquely recover the parameters of second-order linear ODEs. We prove that a level-set slope-coverage condition ensures the learned latent space is locally affine to the true physical state, enabling exact parameter recovery. Our theory provides the first characterization of minimal data requirements across damping regimes, establishing that underdamped systems are identifiable from a single video clip, whereas other regimes require three diverse trajectories. We further introduce a variance-floor regularizer to stabilize the decoder-free objective and prevent latent collapse. Validated on synthetic and real-world data, our approach demonstrates that interpretable physical constants can be reliably estimated from video without the need for compute-intensive pixel reconstruction, ensuring both physical correctness and transparency. Code is available at https://github.com/wenjiewang3/PhysicsFromVideo.
Agentic Transformers Provably Learn to Search via Reinforcement Learning
oai:arXiv.org:2606.00183v1
arXiv:2606.00183v1 Announce Type: cross
Abstract: Tree search is a central abstraction behind many language-agent reasoning and decision-making tasks: agents must explore actions, remember failures, and backtrack toward promising alternatives. Yet, we lack a theoretical understanding of how transformer-based policies acquire such search capabilities from the training dynamics of reinforcement learning (RL). We study this question in a stochastic $k$-ary tree environment, where an agentic transformer observes only its trajectory history through interaction and receives a terminal reward for reaching a hidden leaf goal node. We first construct a two-head transformer that implements randomized depth-first search (DFS): one head tracks previous actions, while the other detects failure outcomes and triggers backtracking. We then analyze the training dynamics of policy gradient under a depth-wise curriculum, showing that this same DFS mechanism emerges in stages from sparse reinforcement feedback without expert demonstrations. The resulting policy exhibits depth generalization: after training only on depth-$1$ and depth-$2$ trees, it succeeds on deeper full trees. We further show that, under imbalanced goal distributions, discounting the return leads to a ranked DFS policy that prioritizes higher-probability branches. Overall, our results identify a mechanistic normal form for transformer-based search, in which attention heads specialize and cooperate to extract decision-relevant traces from context and convert them into agentic action selection via RL training.
InfoAtlas: A Foundation Model for Zero-Shot Statistical Dependence Estimate
oai:arXiv.org:2606.00241v1
arXiv:2606.00241v1 Announce Type: cross
Abstract: Measuring statistical dependency between high-dimensional random variables is a fundamental task in data science and machine learning. Neural mutual information (MI) estimators offer a promising avenue, but they typically require costly iterative optimization for each new dataset, making them impractical for real-time applications. We present InfoAtlas, a foundation model-like architecture that eliminates this bottleneck by directly inferring MI in a single forward pass. Pretrained on large-scale synthetic data with rich dependence patterns, InfoAtlas learns to identify diverse dependence structures and predict MI directly from the dataset. Comprehensive experiments demonstrate that InfoAtlas matches state-of-the-art neural estimators in accuracy while achieving $100\times$ speedup, can flexibly handle varying dimensions and sample sizes through a single unified model, and generalizes effectively to complex, real-world scenarios. By reformulating MI estimation as an inference task, InfoAtlas establishes a foundation for real-time dependency analysis.
Dynamics and Representation Structure of Local Approximations to Gradient-Based Learning in Linear Recurrent Neural Networks
oai:arXiv.org:2606.00243v1
arXiv:2606.00243v1 Announce Type: cross
Abstract: Biological and neuromorphic recurrent neural networks (RNNs) are subject to spatial and temporal locality constraints on the information that can plausibly be used during learning. A common strategy to satisfy these constraints is to modify gradient descent by neglecting non-local terms to varying degrees, as in random feedback local online (RFLO) learning and truncated backpropagation through time (tBPTT). However, the learning dynamics of these algorithms, and how they compare with BPTT, remain poorly understood. We apply dynamical systems theory to data-aligned linear RNNs -- whose dynamics can be separated into orthogonal modes -- to compare stationary solutions, stability properties, and convergence rates, finding qualitatively distinct behaviour for RFLO versus BPTT and one-step tBPTT. We further observe that the solutions learned by RFLO are restricted to low-rank perturbations of initial parameters, a result which holds beyond the data-aligned setting. Our work provides analytical insight into how locality constraints shape learning dynamics, with implications for neuroscientific models of learning and alternative optimization approaches for RNNs.
When Softmax Fails at the Top: Extreme Value Corrections for InfoNCE
oai:arXiv.org:2606.00262v1
arXiv:2606.00262v1 Announce Type: cross
Abstract: InfoNCE is the standard contrastive learning objective, but its softmax form is not only a computational convenience: it also encodes a statistical assumption about how the top-scoring example is selected. Using extreme value theory, we show that this assumption is often misaligned with the normalized embedding setting used in modern contrastive learning. Motivated by this mismatch, we propose \textsc{WEINCE}, a simple modification of InfoNCE that uses anchor-wise online batch statistics to blend the usual softmax logits with an endpoint shortfall correction, adding no trainable parameters. Across five vision benchmarks, \textsc{WEINCE} yields consistent improvements in frozen-feature evaluation. These results show that a more faithful statistical treatment of hard negatives can improve contrastive objectives.
Accurate Large-sample Uncertainty Quantification using Stochastic Gradient Markov Chain Monte Carlo
oai:arXiv.org:2606.00293v1
arXiv:2606.00293v1 Announce Type: cross
Abstract: Tuning algorithms such as stochastic gradient descent (SGD) and stochastic gradient Langevin dynamics (SGLD) for approximate sampling and uncertainty quantification remains challenging, particularly in the practically relevant settings when the batch size is large or the model is misspecified. Existing theory that provides tuning guidance relies on continuous-time limits or strong statistical assumptions, which can become quantitatively inaccurate in these regimes. We address these shortcomings by proposing new discrete-time approximations to SG(L)D with and without momentum, which enables accurate predictions of the stationary covariance, iterate average covariance, and integrated autocorrelation time. Moreover, we prove quantitative, non-asymptotic error bounds showing that these estimates are sufficiently accurate for practical tuning and uncertainty quantification. Numerical experiments demonstrate that our theory yields improved tuning guidance across a range of models and data-generating distributions where existing approaches fail, including when using the $\beta$-divergence rather than log-loss to obtain statistically robust inferences.
Large-scale Uncertainty Quantification for Latent Variable Models Using Subsampling Markov Chain Monte Carlo
oai:arXiv.org:2606.00309v1
arXiv:2606.00309v1 Announce Type: cross
Abstract: Stochastic gradient Langevin dynamics combined with Gibbs updates (SGLD--Gibbs) provides a highly scalable approach to approximate Bayesian inference in latent variable models. However, it remains unclear how to tune the algorithm's hyperparameters in a principled manner to ensure the uncertainty estimates are statistically meaningful. In this work, we address this gap in tuning guidance by developing a statistical scaling limit theory for SGLD--Gibbs. We derive a joint asymptotic limit for the global parameters and latent variables under appropriate space-time rescaling. We show that global parameters converge to a diffusion-type limit, while each latent variable converges to a jump process, reflecting the use of intermittent Gibbs updates. This joint jump-diffusion structure reveals how latent-variable randomness contributes to the stationary distribution of the global parameters. We leverage our results to propose explicit guidance on hyperparameter tuning for SGLD--Gibbs that ensures meaningful uncertainty quantification. Numerical experiments show that SGLD--Gibbs with our tuning guidance leads to better parameter estimates, uncertainty quantification, and predictive performance than stochastic variational inference.
Perturbative methods for non-parametric instrumental variable
oai:arXiv.org:2606.00322v1
arXiv:2606.00322v1 Announce Type: cross
Abstract: We introduce a perturbative approach for nonparametric instrumental variable (NPIV) estimation. By drawing inspiration from perturbation theory in physics, we extend standard kernel ridge methods with systematic higher perturbation order corrections that significantly improve estimation accuracy. Spectrally, the perturbation introduces mixing between different eigenmodes of the expectation integral operator, which becomes especially useful when the integral equation is ill-defined. One source for such ill-definedness can be the curse of dimensionality. Our method performs across various dimensionality regimes, particularly when the dimensionality parameter $\beta$ which is defined through the number of samples $n$ and dimension $d$ as $n^\beta = d$, becomes large. Experimental results show that our first-order perturbative corrections can reduce prediction error by up to 99\% in high-dimensional ill-defined cases ($\beta > 0.7$) compared to standard ridge regression approaches. The performance improvement is maintained across a wide range of dimensions, with the advantage becoming more pronounced as dimensionality increases.
Benchmarking Recursive-Collapse Warning Claims Under Matched False-Positive Control
oai:arXiv.org:2606.00329v1
arXiv:2606.00329v1 Announce Type: cross
Abstract: Recursive systems can enter collapse-like regimes -- self-reinforcing amplification, persistent recursion, and narrowing diversity that mask accelerating internal degradation -- before overt failure becomes visible. We introduce Loopzero, a claim-bounded benchmark framework for testing whether recursive failures follow a directional telemetry pattern: rising gain (G), recursive persistence (p), and declining diversity ($\delta$). The claim boundary is specified in Lean; the Lean artifact does not verify real telemetry, benchmark validity, or detector performance.
We evaluate the bridge on two frozen public-artifact benchmarks: a segmented public-markets benchmark (Volmageddon 2018, COVID MWCB 2020) and a MovieLens-25M offline deterministic recommender replay. Detectors are evaluated under a locked equal-false-positive contract (FP $\in$ [0.03, 0.07], pre-registered) so all configurations face the same alert budget. Neither tested standard comparators nor Loopzero's pre-registered quantile detector achieved an accepted operating point. Directional witness alignment held on both canonical benchmarks, with adjacent-horizon and row-level limitations disclosed. Digitized Shumailov et al. (2024) LLM training-loop trajectories are directionally consistent with the pattern; matched-FP evaluation in that domain is deferred.
The contribution is a reproducible, falsifiable benchmark framework for evaluating recursive-collapse warning claims under an explicit alert-budget contract -- non-acceptance reported as a first-class scientific outcome.
VESTA: Visual Exploration with Statistical Tool Agents
oai:arXiv.org:2606.00384v1
arXiv:2606.00384v1 Announce Type: cross
Abstract: Fitting quantitative models to data is a central step in scientific workflows, yet it remains one of the least automated. Recent agent-based systems leverage language and vision-language models (VLMs) to iteratively propose and refine statistical models, but these systems struggle on more challenging modeling tasks. To address these limitations, we introduce VESTA: Visual Exploration with Statistical Tool Agents, a framework that equips VLMs with a dynamically growing exploration toolkit to guide model refinement through data transformations, hypothesis-driven visualizations, and robust statistical tests. Unlike prior systems that rely on iterative critique alone, VESTA actively explores data before and during refinement by selecting or creating diagnostic tools, which accumulate in the model's context and can be reused later. We evaluate VESTA against established baselines in three toolkit configurations: no tools, static expert-written tools, and dynamic model-written tools. To support this evaluation, we introduce DAWN (Dataset for Automated Workflows and Numerical Modeling), a benchmark targeting distribution fitting and time series modeling with varying difficulty tiers, and culminating in real-world astronomy tasks including modeling initial mass functions and gravitational-wave chirp signals. We find that VESTA's dynamic tool creation outperforms prior agentic pipelines, with the largest gains on complex and domain-specific tasks. We further show that dynamically generated tools are substantially more sophisticated than those produced by existing visual tool-creation systems, covering more diagnostic categories per function and strongly preferring visual outputs that the VLM critic can reason over directly.
Probing and graph coloring techniques for trace estimation in Lattice QCD
oai:arXiv.org:2606.00394v1
arXiv:2606.00394v1 Announce Type: cross
Abstract: The computation of $\mathrm{Tr}[D^{-1}]$, where $D$ is the Wilson-Dirac matrix of Lattice QCD, is a fundamental and computationally demanding task with applications to disconnected hadronic correlation functions. Since $D^{-1}$ is a dense matrix of prohibitive size, its trace cannot be computed exactly, and one must resort to stochastic estimation via the Hutchinson estimator. The variance of the resulting estimation, however, can be large, as it is dominated by the off-diagonal entries of $D^{-1}$. We review the stochastic probing technique, which reduces the variance by constructing structured sampling vectors from distance-$d$ colorings of the graph associated with $D$, exploiting the exponential off-diagonal decay of $D^{-1}$ to eliminate dominant short-range contributions to the variance. We then present a novel multiplier-based coloring scheme, which achieves valid distance-$d$ colorings at arbitrary distances with significantly fewer colors than the established hierarchical probing construction. We prove that at any intermediate coloring falling between two consecutive hierarchical levels, the multiplier-based estimator achieves strictly lower variance than the partial hierarchical estimator, for large enough $d$. This is confirmed by numerical experiments showing that the multiplier-based variance decreases smoothly and monotonically with the number of colors, avoiding the irregular behavior affecting hierarchical probing at intermediate colorings, and achieving a substantial improvement in relative accuracy.
Exploiting weight-space symmetries for approximating curvature
oai:arXiv.org:2606.00442v1
arXiv:2606.00442v1 Announce Type: cross
Abstract: Many machine learning techniques rely on approximating a loss function's curvature, but this is notoriously hard to do at the scale of modern deep networks. Surprisingly, no previous work has exploited the curvature constraints that arise from well known weight-space symmetries in loss landscapes. By analytically averaging over group actions that leave the loss invariant, we construct structured Hessian approximations from single gradients that can be tractably estimated, stored, and inverted. The choice of user-specified symmetry group directly governs the trade-off between approximation accuracy and computational cost. Moreover, our framework provides a unifying theoretical lens for viewing existing methods; in particular, a specific choice of symmetry group recovers Shampoo/Muon-like curvature estimates. We validate our method on a range of network architectures, and deploy it to second-order optimization benchmarks, including a small language model. Our curvature estimation framework might find applications in other machine learning problems such as uncertainty estimation, continual learning, compression/pruning, training data attribution, and more.
On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance
oai:arXiv.org:2606.00467v1
arXiv:2606.00467v1 Announce Type: cross
Abstract: Large Language Models (LLMs) are increasingly used for zero-shot annotation and LLM-as-a-judge tasks, yet their reliability hinges on how model-internalized priors interact with user-provided instructions. We investigate three dimensions of this interaction: (1) how an LLM's familiarity with data and task definitions affects performance, (2) the extent to which additional information in prompts can correct zero-shot errors ("decision stickiness"), and (3) model susceptibility to misaligned task definitions. Through experiments on toxicity detection across diverse datasets (spanning social media, gaming, news, and forums) using both dense and mixture-of-experts models, we find that nearly two-thirds of zero-shot errors are resistant to correction, with an overall rescue rate (fraction of initial errors corrected by prompting) of only 34.8%. High-confidence errors prove especially resistant to correction. When given misaligned definitions, LLMs follow them while maintaining confidence levels unchanged from the aligned condition. Crucially, we introduce Definition-Specific Familiarity (DSF), which measures alignment between a model's internal concept and the task definition. After controlling for dataset-level confounds, DSF shows a positive association with model performance (partial r = +0.41), while three distinct memorization metrics (ROUGE-L, BERTScore, and embedding cosine similarity) all fail to show a positive association. These findings show the limitations of prompt-based correction in annotation tasks, highlighting the importance of definition alignment over text-level memorization.
Constructive interpolation and generalization rates for neural ODEs: a control perspective
oai:arXiv.org:2606.00469v1
arXiv:2606.00469v1 Announce Type: cross
Abstract: We study supervised regression with neural ODEs (NODEs) from a control-theoretic perspective to derive explicit population-risk bounds. We focus on a widely used class of non-autonomous models with constant parameters and explicit time dependence, which we call semi-autonomous NODEs (SA-NODEs). We constructively prove that SA-NODEs are capable of \emph{exact} interpolation of admissible finite datasets, and even satisfy a stronger property that we call \emph{simultaneous cell controllability} (SCC): their flows can map prescribed disjoint cells into arbitrarily small target balls. This property is the mechanism that upgrades interpolation into quantitative generalization, by allowing SA-NODEs to emulate piecewise-constant nonparametric estimators. Consequently, our risk bounds recover the rates of histogram and nearest-neighbor estimators, provided the network width satisfies a conservative scaling with the sample size. Numerical experiments show that trained SA-NODEs achieve competitive -- often lower -- test errors than these baselines. Finally, we show that the explicit time dependence is essential. Although two-layer autonomous NODEs can interpolate geometrically nondegenerate datasets, structural obstructions prevent them from achieving SCC. These limitations, further confirmed numerically, support the view that SA-NODEs provide a minimal effective architecture for learning.
Continuous Data Assimilation with Learned Surrogate Dynamics
oai:arXiv.org:2606.00480v1
arXiv:2606.00480v1 Announce Type: cross
Abstract: Continuous data assimilation seeks to estimate the state of a dynamical system from partial observations. In many applications, however, the state dynamics are unknown or prohibitively expensive to simulate at the required resolution, leading to model error. Motivated by this challenge and the increasing adoption of machine learning surrogates in data assimilation, this paper develops a unified finite-dimensional analysis of nudging algorithms that employ learned surrogate models of the dynamics. We first establish general conditions on the dynamics and observations that guarantee accurate tracking for nudging with the true dynamics model, both in the noise-free and noisy settings. We then show that nudging algorithms that employ surrogate models retain exponential convergence up to an explicit error floor that quantifies the effects of surrogate approximation error and observation noise. Finally, we analyze surrogate models obtained by learning either the vector field or the short-time solution map of the system, and quantify the amount of training data needed to ensure accurate nudging in the noise-free setting. Numerical experiments support the theory.
Stochastic Analysis of Cybersecurity Defense Strategies Under Single Attack Scenario
oai:arXiv.org:2606.00481v1
arXiv:2606.00481v1 Announce Type: cross
Abstract: This research presents a novel stochastic framework for proactive cybersecurity defense timing under a single attack scenario. The approach models the defense process as a continuous observation mechanism in which the defense instant and the subsequent observation slot follow independent exponential distributions. Laplace-Carson transforms combined with first-excess theory yield the joint detection function that brackets the attack moment. Marginalization under Markovian Poisson arrivals then produces the probability density of the defense moment and conditional expectations of pre-attack and post-attack observation times. These closed-form results enable quantitative assessment of defense timing sensitivity to threat intensity and support precise calibration of observation parameters for low-latency proactive measures. Major contributions include the explicit derivation of marginal distributions and expected values, visualization of defense moment density, and the bridging of stochastic duel methodology with practical cybersecurity applications.
Easy, robust approximate message passing for planted spike models
oai:arXiv.org:2606.00500v1
arXiv:2606.00500v1 Announce Type: cross
Abstract: We present a simple and efficient algorithm for robust approximate message passing (AMP) in the spiked matrix setting. In particular, let $\varepsilon$ be a sufficiently small constant, and suppose that $X \in \mathbb R^{n \times n}$ is a Gaussian matrix with a planted rank-$1$ spike, and $E \in \mathbb R^{n \times n}$ is an adversarially chosen matrix supported on an $\varepsilon n \times \varepsilon n$ principal minor. Let $v_{\mathrm{AMP}}(X)$ be the output of an AMP iteration on the uncorrupted matrix $X$. We give a procedure that, given access only to the corrupted matrix $Y = X + E$, computes a vector $v_{\mathrm{ALG}}(Y)$ which is $\tilde{O}(\sqrt{\varepsilon})$-close to $v_{\mathrm{AMP}}(X)$, for any of a class of AMP iterations which includes sparse Principal Component Analysis (PCA), non-negative PCA, and $\mathbb Z_2$ synchronization. Our algorithm consists of a spectral pre-processing step combined with a robust spectral initialization procedure; given these inputs, we prove that (perhaps surprisingly) AMP is robust out-of-the-box.
Semi-Supervised Learning with Noisy Proxy Covariates: Generalization Bounds and Distribution Regression
oai:arXiv.org:2606.00512v1
arXiv:2606.00512v1 Announce Type: cross
Abstract: In many modern machine learning pipelines, abundant pretrained representations serve as noisy proxy covariates, while task-specific labels remain scarce. We study semi-supervised regression in this setting, and propose a simple two stage estimator that learns kernel eigenfeatures from all proxy covariates and fits a ridge predictor on labeled data. We derive finite sample bounds showing that fast labeled sample rates are recovered when proxy perturbation is controlled and unlabeled proxy covariates are sufficiently abundant. We also show that distribution regression is a direct special case, with analogous guarantees when the finite bag size is large enough. Experiments show consistent gains over supervised and semi-supervised baselines, especially in low label regimes.
In-Expectation Convergence of Stochastic Gradient Methods under Heavy-Tailed Noise
oai:arXiv.org:2606.00520v1
arXiv:2606.00520v1 Announce Type: cross
Abstract: Many stochastic gradient methods are believed not to converge when the noise in stochastic gradients has only a finite $p$-th moment for $p\in\left(1,2\right)$, a setting known as the heavy-tailed noise assumption. However, some recent studies have found that Stochastic Gradient Descent ($\textsf{SGD}$), without any modification to its update rule, can surprisingly converge in expectation for convex problems with bounded domains, highlighting the potential of classical stochastic gradient methods. Inspired by this recent progress, we provide a comprehensive study of stochastic optimization under heavy-tailed noise and establish new in-expectation convergence results for Stochastic Mirror Descent ($\textsf{SMD}$) and Accelerated Stochastic Mirror Descent ($\textsf{ASMD}$) in convex optimization, and for $\textsf{SGD}$ and Stochastic Gradient Descent with Momentum ($\textsf{SGDM}$) in nonconvex optimization. Notably, our results not only hold without algorithmic changes but also avoid restrictive assumptions, such as bounded domains, imposed in prior work. More importantly, our analysis provides a new, elegant, and powerful framework for studying heavy-tailed stochastic optimization, opening a new route to understanding first-order stochastic gradient methods.
GNMR: Runtime Stability Control for Low-Precision Large Language Model Training
oai:arXiv.org:2606.00539v1
arXiv:2606.00539v1 Announce Type: cross
Abstract: Training stability is a key bottleneck in low-precision language model training: efficient low-cost paths can still produce short-lived numerical risks at a small set of operators. We formulate this as runtime stability control and present Gradient Norm-to-Mean Ratio (GNMR), a lightweight controller that compares each recoverable unit's current gradient norm with its historical mean. Together with $\Delta$-GNMR for abrupt short-window increases, GNMR maps local risk signals to bounded recovery actions under a hard $\mathrm{maxO}$ budget and a short lock interval, without changing the numerical format, kernel, or backend recipe. Across activation-quantization stress, DeepSeek-style recipe-level training, and LLaMA-2 13B fine-tuning, GNMR preserves high-fidelity quality with sparse, budgeted recovery. These results support GNMR as a backend-agnostic controller to improve low-precision training stability while preserving low-cost execution.
A Practical Upper Bound on Selection Bias Effects in Medical Prediction Models
oai:arXiv.org:2606.00563v1
arXiv:2606.00563v1 Announce Type: cross
Abstract: Selection bias is a common and often unavoidable aspect of real-world data that challenges the generalizability of machine learning models. When models trained on biased data are deployed in the broader target population, poor model generalization may lead to real harm, particularly in high-risk settings such as healthcare. This risk highlights the need for practitioners to reliably assess model generalizability prior to deployment. However, existing methods for predicting model performance rely on unrealistic access to the target distribution or knowledge of the selection mechanism causing bias. To address these limitations, we propose a novel upper bound on the worst-case model performance on the target population under the realistic setting where the selection mechanism and the target population data are only partially observed. We demonstrate the validity and practical utility of our method through experiments on fully synthetic data, semi-synthetic data derived from the All of Us Research Program, and real-world selection bias in MIMIC-IV. Our work offers a principled and practical tool to estimate the impact of selection bias in an otherwise intractable setting, thereby enabling practitioners to build safer and more generalizable models in healthcare and beyond.
Looped Transformers with Layer Normalization Provably Learn the Power Method
oai:arXiv.org:2606.00605v1
arXiv:2606.00605v1 Announce Type: cross
Abstract: Transformers have achieved remarkable success across a wide range of applications, and a growing body of work suggests that part of their strength comes from their ability to learn and execute algorithmic procedures. However, our understanding of how transformers learn such algorithms remains limited, especially in the presence of layer normalization (LN). In this work, we study principal component prediction as a concrete testbed for understanding the training dynamics of transformers with LN. We prove that a looped linear transformer with LN, trained by gradient descent, converges to a solution that implements the power method, with each self-attention layer performing one power iteration. Notably, the model is trained only for principal component prediction, rather than being explicitly supervised to implement the power method. Our finding thus reveals an "algorithmic implicit bias" of looped transformers with LN: principal-component prediction can in principle be achieved by many mechanisms, yet gradient descent selects one that realizes the power method. We further provide a concrete comparison between transformers with and without LN: even with layerwise guidance from power iterations, a transformer without LN cannot exactly learn the power method, whereas the corresponding transformer with LN can, leading to a provable performance gap in principal component prediction. Our results provide, to our knowledge, the first theoretical analysis of the training dynamics of looped and single-layer transformers with LN, and shed light on the role of LN in transformer models.
A Systematic Benchmark of Intraoperative Ultrasound-to-MR Synthesis for Brain Tumour Surgery
oai:arXiv.org:2606.00630v1
arXiv:2606.00630v1 Announce Type: cross
Abstract: Intraoperative ultrasound (ioUS) is a versatile, cost-effective modality in brain tumour surgery, but its interpretation is difficult: acquisition planes are non-standard, artefacts are modality-specific, and its appearance differs markedly from the preoperative MRI on which surgical-planning tools, segmentation models and the surgeon's experience rely. Synthesising MRI-like images from ioUS could let this MRI-based infrastructure be reused intraoperatively without an extra scan. Most prior work evaluates a single architecture in isolation; to our knowledge, no benchmark has spanned architectural paradigms, inference regimes and downstream-task endpoints under a common protocol. We address this gap on the public ReMIND data set (76 patients; 153 paired ioUS/T2w and 104 paired ioUS/FLAIR studies; 60/16 patient-level train/held-out split). Six generators (four GAN baselines: Pix2Pix, SwinPix2Pix, CycleGAN, CUT; the transformer-augmented ResViT; and the few-step diffusion model SynDiff) were each trained under four inference regimes (2D, 2.5D, 2D + 3D-refinement, full-3D) and two targets (T2w only; T2w + FLAIR multi-task), yielding 48 experiments. Image-fidelity metrics (SSIM, PSNR, MAE, LPIPS) were complemented by an nnU-Net v2 downstream segmentation evaluation (tumour and resection cavity) and by subgroup analyses by histological grade and reoperation. No architecture dominated every axis, and, critically, perceptual quality tracked downstream utility most closely (LPIPS, r=-0.66, p<0.001), whereas higher SSIM was associated with worse utility (r=-0.64, p<0.001); SynDiff-2.5D best preserved downstream segmentation (U_Dice=0.55). Perceptual and downstream-task metrics should therefore be reported alongside or in preference to global SSIM, and architecture choice conditioned on surgical phase, patient history and clinical objective.
Multi-Agent Conformal Prediction with Personalized Statistical Validity
oai:arXiv.org:2606.00717v1
arXiv:2606.00717v1 Announce Type: cross
Abstract: Uncertainty quantification is essential in high-stakes machine learning tasks. However, one of the principled solutions, conformal prediction, faces challenges under limited local calibration data, privacy constraints, and data heterogeneity. In multi-agent settings, existing works do not simultaneously and satisfactorily address these challenges with guarantees either limited to averages across agents or losing validity in heterogeneous settings. Hence, we propose personalized federated weighted conformal prediction (PFWCP), a framework that combines local density ratio weighting with weighted quantile aggregation to correct for heterogeneity while preserving privacy. The method yields asymptotically valid marginal and calibration-conditional coverage guarantees for each participating agent and supports protocols with one-shot communication. Theoretical analysis presents an adjustment to the coverage variance, governed by an effective sample size expression, which is necessary in the context of weighted conformal prediction, and experiments on synthetic and real datasets show improved calibration quality over state-of-the-art federated conformal baselines.
Quantum Tunneling-Aware Machine Learning: Physics-Derived Noise Models for Robust Deployment
oai:arXiv.org:2606.00741v1
arXiv:2606.00741v1 Announce Type: cross
Abstract: Transistor scaling is approaching a quantum-mechanical limit, as thin gate oxides induce electron leakage through quantum tunneling. Unlike conventional digital systems, AI inference can tolerate such errors provided their structure is modeled correctly. In this paper, we introduce quantum tunneling-aware machine learning (QTAML). We derive the deployment-time weight-error distribution from first principles using the Wentzel-Kramers-Brillouin (WKB) approximation and show that it has structure that generic Gaussian noise models miss: an exact affine mean drift, a per-bit variance hierarchy dominated by the most-significant bit, and a per-layer dependence on $\|W_\ell\|_\infty$ and the trained-network Jacobian. We package these three structural properties into a single deployment-time algorithm, Tunneling-Aware Compensation (TAC), that combines closed-form mean correction with an optimal layer-adaptive bit-budget allocation derived from the WKB variance decomposition. Across four convolutional architectures at $p_\mathrm{flip}$=0.10 and a transformer encoder at $p_\mathrm{flip}$=0.05, TAC reaches $95\%$ of clean accuracy with 3.4$\times$ to 33.6$\times$ less ECC overhead than Uniform-MSP, the natural baseline derived from the same physics. The closed-form saturation ratio $\rho^*$ predicts these gains in advance, and on heterogeneous architectures WKB-derived scoring outperforms magnitude-based allocation by up to 24 percentage points at small budgets. The algorithm requires no retraining, no labels, and no inference-time overhead. We also verify the WKB-derived distributional theorems to Monte Carlo precision. These results connect WKB tunneling physics with noise-aware deep learning and suggest a principled path toward hardware--software co-design beyond conventional scaling limits.
Bayesian estimation of spectral parameters of the 6.7-GHz methanol maser G339.884-1.259 from GRAO observations
oai:arXiv.org:2606.00768v1
arXiv:2606.00768v1 Announce Type: cross
Abstract: Accurate decomposition of methanol maser spectra is essential for understanding high-mass star-forming regions, especially in complex blended spectra where small differences alter physical interpretation. Conventional Gaussian fitting often fails to capture non-Gaussian structure and lacks uncertainty quantification. We develop a Bayesian spectral decomposition framework using Gaussian, Lorentzian, and Voigt profiles with Markov Chain Monte Carlo sampling, enabling model comparison and uncertainty estimation. Applied to the 6.7\,GHz methanol maser G339.884$-$1.259 observed with the Ghana Radio Astronomy Observatory, our method reveals seven velocity-coherent components. The Voigt model is statistically preferred, yielding the lowest AIC and BIC ($\approx 1.98 \times 10^{4}$ and $1.99 \times 10^{4}$), the smallest RMSE ($\approx 11.1$ Jy), and the highest $R^{2}$ (0.985). Purely Gaussian or Lorentzian models leave systematic residuals. Elevated reduced $\chi^{2}_{\nu}$ values indicate unresolved substructure and non-ideal noise. Bayesian inference provides a robust framework for maser spectral analysis, extendable to other molecular lines and combinable with high-resolution interferometry.
A Finite-Calibration Regime Map for LLM Judge Panels
oai:arXiv.org:2606.01034v1
arXiv:2606.01034v1 Announce Type: cross
Abstract: We study when LLM judge panels should be calibrated with low-dimensional stackers versus joint output tables under finite human-label budgets. Low-dimensional stackers have small estimation cost but miss interactions, whereas joint-table calibrators can represent interactions but pay for cell counts and unseen patterns. We cast this tradeoff as a finite-calibration regime map and instantiate it as Finite-Calibration Panel Selection, a deployable validation selector over judge path, prefix size, and aggregator family with table and parametric estimation diagnostics. On RewardBench, LLMBar, SummEval, and Arena100K with a seven-judge pool including DeepSeek V4 Flash, scalar/reliability aggregation wins 16 of 20 real dataset--budget cells, indicating that current judge outputs are often additive or redundant. Controlled calibration-growth data show the complementary regime: additive labels remain scalar-favored, whereas a six-way interaction selects a larger joint table and its test MSE drops from 0.224 to 0.061 once unseen mass vanishes. Thus the practical question is not ``how many judges?'' but whether the next judge's information is estimable under the available human labels.
Non-Vacuous Certification of Transport MCMC via Oscillation-Controlled Normalizing Flows
oai:arXiv.org:2606.01078v1
arXiv:2606.01078v1 Announce Type: cross
Abstract: Transport MCMC trains a normalizing flow to precondition Metropolis--Hastings proposals, achieving high empirical efficiency on challenging posteriors; yet no prior work produces a numerically non-vacuous, rigorous spectral-gap bound for such samplers. We establish the first such bounds. For independence MH on the banana family we certify (\gamma^\ast = 0.828) at (D = 2) (covering in the original space) and (\gamma^\ast \ge 7.6\times 10^{-4}) at (D = 5) (covering in an analytically unwarped Gaussian space with a grid-certified gradient bound under the stated numerical Lipschitz certification), both rigorous at 95% confidence. The framework rests on three pillars: (i) spectral normalization with reduced scale clips constrains the flow Lipschitz constant from (10^{47}) to (10^4); (ii) a coverage-based empirical oscillation bound replaces the vacuous analytical bound with a data-dependent certificate; and (iii) oscillation-regularised training cuts the empirical oscillation by 60--90% at no cost to density fit, extending practical certificates through (D = 20) ((\gamma^\ast \ge 1.7\times 10^{-4})). Tests on four further targets (Gaussian mixture, shear-building, Neal's funnel, Bayesian logistic regression) identify three precise barriers: boundary curvature, target stiffness, and tail-coverage mismatch. An affine-vs-spline comparison shows that simpler architectures yield tighter certificates at identical NLL, inverting the usual expressiveness hierarchy.
Revisiting Neural Processes via Fourier Transform and Volterra Series
oai:arXiv.org:2606.01172v1
arXiv:2606.01172v1 Announce Type: cross
Abstract: Modeling unknown latent functions from finite, irregularly sampled measurements is a recurring challenge across science and engineering. Neural processes (NPs), a family of probabilistic functional models, are promising solutions -- especially when endowed with domain-specific symmetries like translation equivariance, which improve sample efficiency and generalization. Yet existing translation-equivariant NPs face two limitations: (i) they stack generic components with non-linearities, obscuring the induced function class and limiting interpretability; and (ii) convolutional designs rely on kernels with local receptive fields and require dense uniform input grids, while attention-based methods avoid these issues but scale quadratically with the number of observations. We address both with two contributions. First, using the Volterra expansion, we characterize continuous translation-equivariant operators as sums of higher-order convolutions, yielding analytical transparency while admitting efficient approximation by first-order convolutions. Second, we introduce set Fourier convolutions (SFConvs), a frequency-domain parameterization that operates directly on irregularly sampled points, achieves approximately global receptive fields, and scales linearly in the number of observations. Building on these ideas, we propose two conditional NPs (CNPs): SFConvCNPs, which stack SFConv blocks with non-linearities, and SFVConvCNPs, which integrate the Volterra formulation. Experiments on synthetic and real-world datasets demonstrate our methods' efficacy against state-of-the-art baselines.
Analysis of Ethnic Disparities in Autism Spectrum Disorder among Toddlers
oai:arXiv.org:2606.01217v1
arXiv:2606.01217v1 Announce Type: cross
Abstract: Autism Spectrum Disorder (ASD) is a neurodevelopmental disorder characterized by challenges in communication and behavior. This study examines the relationship between ethnicity and ASD traits, along with behavioural scores, sex and neonatal jaundice across three ethnic groups: White Europeans, Asians, and Middle Eastern individuals. We perform a logistic regression and show that ethnicity has a significant effect on incidence of ASD. White Europeans are 81% increased risk of ASD and Middle Easterners are at 79\% reduced risk of ASD compared to Asians. We also confirm earlier studied which show that neonatal jaundice is a significant predictor of ASD, while male children are at much higher risk of ASD compared to female children. These results suggest the need for diagnostic frameworks and interventions that account for ethnic in the presentation and assessment of ASD traits
Sample Complexity and Decision-Theoretic Guarantees for Bayesian Model Averaging over Decision Trees with Catalan-Exponential Priors
oai:arXiv.org:2606.01340v1
arXiv:2606.01340v1 Announce Type: cross
Abstract: We ask: when do Bayesian model averaging (BMA) weights over decision trees carry sufficient epistemic information to justify committed exploitation of the averaging distribution? We answer this question in closed form for Bayesian decision trees (BDTs) with Dirichlet-Multinomial leaf models and a Catalan-exponential tree-size prior (Schetinin&Jakaite, 2025), establishing a complete non-asymptotic theory of rational commitment thresholds.
Leaf Spectral Reflectance Prediction Using Multi-Head Attention Neural Networks
oai:arXiv.org:2606.01432v1
arXiv:2606.01432v1 Announce Type: cross
Abstract: Accurate modeling of leaf spectral reflectance from physiological and biochemical traits is essential for advancing remote sensing applications in plant science and precision agriculture. Widely used radiative transfer models, such as PROSPECT-PRO, rely on generalized trait-reflectance relationships developed from a wide range of species, which may not fully capture the spectral behavior of specific crops like grapevines. In this study, we developed a trait-to-spectra prediction model using a multi-head attention neural network trained on a grapevine-specific dataset that includes 16 leaf traits measured across multiple varieties, growth stages, and years. The model was evaluated using stratified 5-fold cross-validation and achieved an average coefficient of determination (R^2) of 0.84 and normalized root mean squared error (NRMSE) of 1.52 percent, demonstrating high accuracy and generalizability. When compared to PROSPECT-PRO in forward mode, the neural network exhibited lower mean absolute error (MAE), especially in the near-infrared (NIR) and shortwave-infrared (SWIR) regions. These results emphasize the importance of species-specific modeling approaches and show that integrating biochemical and structural traits into data-driven architectures can significantly improve spectral prediction. The proposed model provides a robust framework for generating accurate leaf-level reflectance data, with potential applications in canopy trait retrieval, vineyard monitoring, and remote sensing-driven crop management.
Transferring Information Across Interventions in Causal Bayesian Optimization
oai:arXiv.org:2606.01457v1
arXiv:2606.01457v1 Announce Type: cross
Abstract: Bayesian optimization is a popular way to optimize expensive systems, where every experiment, simulation, or intervention costs time or money. In its standard form, it treats the variables we control as plain inputs to a black box and cannot tell apart mere correlation from a real cause and effect. Causal Bayesian optimization closes part of this gap by using a known causal graph together with observational data to decide which variables are worth intervening on. Existing methods, however, learn the effect of each possible intervention almost in isolation, even though in a causal system these effects usually share the same underlying mechanisms. We propose graph-coupled causal Bayesian optimization, which ties the different intervention effects together through the uncertainty we have about a small set of shared causal parameters. The result is a causal kernel that lets evidence collected from one intervention improve our estimate of related interventions. For identifiable linear Gaussian causal models, we show that this kernel has low rank, bounded by the number of shared parameters rather than by the size of the intervention menu. This in turn yields an information-gain bound that grows only logarithmically in the optimization horizon, and a regret bound that cleanly separates three sources of error: optimization, causal estimation, and the choice of which intervention sets to consider. We also describe nonlinear and adaptive extensions. Across theory-aligned Gaussian systems, shared-mechanism stress tests, and standard causal optimization benchmarks, the method keeps the benefits of causal Bayesian optimization while transferring information across related interventions, with the clearest gains when direct interventions on the target's parents are unavailable and sparse interventional data must be reused across a large family of candidate interventions.
The Information Content of Quasar Variability Light Curves: How Well Can we Infer Stochastic Model Parameters?
oai:arXiv.org:2606.01496v1
arXiv:2606.01496v1 Announce Type: cross
Abstract: Quasar variability, driven by multi-scale physical processing within a relativistic accretion disk, is commonly modelled with stochastic time series models. The simplest of these is the Damped Random Walk (DRW), also known as the Ornstein-Uhlenbeck (OU) process. Here, we demonstrate that, when fitting such a model to quasar light curve data, the mean of the light curve, $\mu$, should not be fixed (which is the typical approach), as this leads to overconfident inferences about the variability timescale $\tau$, with substantially underestimated uncertainties. However, the short term volatility parameter $\eta$ is typically very well constrained from short light curves. Through simulations, we compute information theoretic quantities such as the conditional entropy and the mutual information, confirming that light curves provide much more information about $\eta$ than about $\tau$. As a result, we recommend that future quasar variability studies focus on $\eta$ rather than $\tau$. To demonstrate this approach, we fit a hierarchical Bayesian regression model for $\eta$ as a function of bolometric luminosity and rest wavelength to a dataset of 570 light curves measured over decades. We perform the fit using a likelihood function that uses the light curves directly, rather than using intermediate $\eta$ values from individual light curve fits. We find that volatility decreases as a function of both bolometric luminosity and rest wavelength. The volatility also decreases more steeply with redshift than time dilation alone would suggest, pointing to an increase in intrinsic volatility as quasars evolve over cosmic time.
Fast Generalization after Interpolation via Critically Damped Momentum Optimization
oai:arXiv.org:2606.01521v1
arXiv:2606.01521v1 Announce Type: cross
Abstract: A central problem in machine learning is that models can achieve near-perfect training performance while generalizing substantially less well to unseen examples. This gap is especially acute in high-dimensional, low-sample regimes, where many interpolating solutions exist and optimization must implicitly select among minima with different generalization properties. Following recent theoretical advances on optimization dynamics near the interpolation threshold, we note that the two-regime structure of risk minimization, with loss minimization followed by complexity minimization, motivates a biphasic optimization schedule. We thus theoretically demonstrate that GROKtimizer, a biphasic strategy that combines rapid convergence to interpolation with Critically Damped Momentum (CDM)-based post-interpolation norm minimization, offers a natural solution for selecting low-norm interpolating solutions. Under a local quadratic model of the post-interpolation basin, GROKtimizer provides a quadratic speedup over classical gradient descent, with provable optimality among first-order optimizers. To showcase the applicability of our method, we evaluate GROKtimizer on several synthetic benchmarks common in the classical grokking literature and on various real-world datasets. Finally, we reconcile our findings with the flat-minima hypothesis, highlighting the importance of post-interpolation dynamics in the construction of high-quality, generalizing models.
Semi-Supervised Hyperbolic Hierarchical Clustering with Set-Level Structural Priors
oai:arXiv.org:2606.01525v1
arXiv:2606.01525v1 Announce Type: cross
Abstract: Semi-supervised hierarchical clustering aims to learn a tree structure consistent with data patterns and user-provided supervision. Supervision is usually given as leaf-level relations, such as pairwise must-link/cannot-link constraints or triplet-wise must-link-before constraints. Although useful for regulating local sample relations, such supervision does not directly indicate which samples should form coherent subtrees. Consequently, the non-leaf structure of the learned tree may deviate from the hierarchical organization preferred by ground-truth labels. To address this limitation, we propose a semi-supervised hyperbolic hierarchical clustering method with set-level structural priors. The main contribution is to introduce sets as basic modeling units for hierarchy learning. Each set denotes samples expected to cohere within a subtree and is induced from leaf-level supervision together with a learned constraint-consistent similarity structure. These sets act as soft structural priors for subtree-level supervision, allowing supervision to guide non-leaf hierarchy formation beyond local leaf-level relations. Specifically, we first learn constraint-consistent embeddings to obtain a reliable set partition, then construct constraint-induced sets and estimate inter-set similarities to form set-level structural priors. Finally, these priors are incorporated into a hyperbolic hierarchy objective for continuous tree optimization. Experiments on eleven benchmark datasets and ablation studies show that the proposed method consistently improves label consistency over representative hierarchical clustering baselines while also enhancing similarity-based tree quality.
ReSkill: Reconciling Skill Creation with Policy Optimization in Agentic RL
oai:arXiv.org:2606.01619v1
arXiv:2606.01619v1 Announce Type: cross
Abstract: Agentic reinforcement learning (RL) enables LLM agents to improve continuously from environment rewards, yet the resulting policies do not systematically accumulate reusable strategies that generalize across tasks. Modular skills can provide such reusable strategies, yet existing skill-augmented RL methods decouple skill creation from policy optimization, risking adopting skills that conflict with the evolving policy. Inspired by Anthropic's Skill Creator, we introduce ReSkill, an RL-in-the-loop skill creation framework that reconciles skill evolution with policy learning. ReSkill exploits the group-wise structure of GRPO to naturally embed three mechanisms with only marginal additional overhead: (1) an assertion-driven skill creator that diagnoses failures from past experience and proposes conditional, trigger-based skill revisions; (2) within-group rollout sampling that enables controlled comparison of skill versions, capturing which version best supports the policy's ongoing learning; and (3) Thompson Sampling with adaptive discounting to balance exploration and exploitation in skill version selection as the policy evolves. Across several domains, ReSkill consistently outperforms existing memory and skill-based RL methods, with the largest gains on unseen tasks. Analysis of the skill lifecycle shows skills being automatically created, tested, refined, and pruned as the policy improves, demonstrating reconciled skill-policy co-evolution.
Post Selection Estimation of Sharpe Ratios
oai:arXiv.org:2606.01650v1
arXiv:2606.01650v1 Announce Type: cross
Abstract: We consider the problem of estimating the true Sharpe ratio of an asset selected for having the highest observed in-sample Sharpe ratio among many assets. We discuss estimators based on the polyhedral lemma, James Stein shrinkage, debiasing the expected maximum Sharpe ratio, thresholding and empirical Bayes. We test these estimators in simulations, computing bias and root mean square error across different values of sample size, number of assets, and spread and shape of population Sharpe ratios. We also compute rank correlation of the estimators against the underlying quantity, simulating how these estimators might be used to compare or rank the output of different teams which perform this selection process. We find that the James Stein estimator provides the best performance across many different realistic values of the relevant parameters, followed by the GMLEB estimator of Jiang and Zhang. These results are fairly robust to correlation of asset returns, with some caveats.
MINTS: Minimalist Thompson Sampling
oai:arXiv.org:2606.01655v1
arXiv:2606.01655v1 Announce Type: cross
Abstract: The Bayesian paradigm offers principled tools for sequential decision-making under uncertainty, but its reliance on a probabilistic model for all parameters can hinder the incorporation of complex structural constraints. We introduce a minimalist Bayesian framework that places a prior only on the location of the optimum, while eliminating nuisance parameters through profile likelihood. This yields a generalized posterior that naturally accommodates structural constraints. As a direct instantiation, we develop MINimalist Thompson Sampling (MINTS). For multi-armed bandits with mean constraints, we establish near-optimal non-asymptotic regret guarantees and sharp almost-sure asymptotic regret characterizations. In particular, MINTS attains the classical Lai--Robbins constant in the unstructured setting and automatically adapts to unimodal structure, achieving the sharp constant determined only by the immediate neighbors of the optimal arm.
Data-Automated Policy Learning for Nonlinear Welfare
oai:arXiv.org:2606.01659v1
arXiv:2606.01659v1 Announce Type: cross
Abstract: This paper explores policy learning from observational data, focusing on a nonlinear welfare criterion in a binary treatment setting. The nonlinear criterion is inspired by scenarios where policymakers prioritize specific population segments. We model this criterion using a utility function that encompasses potential outcomes and intermediate parameters, with the latter capturing higher moments of the outcome distributions. When formulated in the context of observational data, both the intermediate parameters and the welfare criterion depend on the propensity score, which we estimate using machine-learning techniques. To address bias in machine learning estimates, we introduce a novel reweighting-based debiasing approach that offers a promising alternative to traditional orthogonality-based methods. To tackle the complexities of infinite-dimensional policy spaces, we employ sieve approximations and $K$-fold cross-validation for model selection, thereby fully automating the policy-learning process. Despite these complexities, we demonstrate that both the welfare regret and the average welfare regret of our proposed policy learning method satisfy an oracle inequality, thereby providing theoretical guarantees on the performance of the estimated policy relative to the best possible policy. This finding extends the existing results from linear to nonlinear welfare criteria, from finite-dimensional to infinite-dimensional policy spaces, and from a known propensity score to a machine-learned one.
Feature leakage and the identifiability of direct-dependency entropy models of neural activity
oai:arXiv.org:2606.01661v1
arXiv:2606.01661v1 Announce Type: cross
Abstract: Biological neurons receive thousands of synaptic inputs on branching, electrically excitable dendrites, yet population activity is often modeled with direct input-output rules in which each input contributes independently to a scalar drive. We study what successful prediction by such models does, and does not, reveal about neural computation. For conditional maximum-entropy models that match output rates and pairwise output-input coactivities, the entropy explained by a direct model is a prediction measure under the sampled input distribution, not a mechanism-identification test. A restricted MaxEnt fit is an information projection: omitted interaction, temporal, or hidden-state terms can be absorbed into fitted first-order parameters whenever they are correlated with the included sufficient statistics. For sparse correlated binary inputs, this absorption has an explicit coskewness form. We introduce diagnostics that separate in-distribution prediction from recovery of the response rule: state reweighting that holds P(y|x) fixed while changing P(x), conditional log-odds contrasts for local additivity, and temporal leakage controls. In ground-truth simulations, purely higher-order responses can pass first-order entropy and raw coactivity tests under leakage-prone sampling, but are correctly classified after reweighting. Applied to selected, leakage-enriched local tables from CA1 hippocampal recordings, approximately half of tables that appear first-order under empirical weights become distribution-sensitive under balanced reweighting, far above a matched additive-surrogate null. Thus direct entropy-explained fractions and raw coactivity predictions should be interpreted as predictions under the observed state distribution, not as evidence that mechanisms outside the direct model are absent or small.
HS3: A Descriptive, Interoperable Serialization Standard for Statistical Models in High-Energy Physics
oai:arXiv.org:2606.01760v1
arXiv:2606.01760v1 Announce Type: cross
Abstract: Statistical models in high-energy physics formally encode the relationship between observed data, physics parameters of interest, and experimental and theoretical uncertainties. Likelihood-based inference is the central tool for precision measurements, effective field theory fits, and cross-analysis combinations. Consequently, there is an increasing need for machine-readable, descriptive, and portable model representations. Existing formats such as ROOT workspaces, pyhf JSON, and CMS DataCards provide valuable capabilities but remain tied to specific software stacks and offer no universal standard for exchange, validation, or long-term preservation. We introduce HS3, the High-Energy Physics Statistics Serialization Standard, an implementation-agnostic, human-readable, and extensible serialization format for statistical models. HS3 is designed such that new statistical constructs can be incorporated through backward-compatible extensions, while inference procedures and implementation-specific execution details remain the responsibility of downstream frameworks. HS3 represents likelihoods as computational graphs composed of named distributions, functions, datasets, domains, and analysis prescriptions. It supports binned and unbinned likelihoods as well as hierarchical composite models. HS3 is convertible from and to ROOT/RooFit and is a superset of pyhf. We describe the design principles, structure, and semantics of HS3 and summarize existing implementations in C++, Python, and Julia. We also present early applications to public likelihoods on HEPData, cross-framework validation, and reproducibility efforts. HS3 provides a foundation for FAIR (Findable, Accessible, Interoperable, Reusable), long-lived statistical models at the LHC and beyond. The standard is intended to serve the broader scientific community and to evolve over time for application across a wide range of domains.
Tree-Guided Identify-Then-Exploit: A Unified Framework of Best Arm Identification and Regret Minimization for Dueling Bandits
oai:arXiv.org:2606.01799v1
arXiv:2606.01799v1 Announce Type: cross
Abstract: We study $N$-armed stochastic dueling bandits under the Condorcet-winner assumption, where three widely adopted objectives are considered: best-arm identification (BAI), weak regret, and strong regret. We propose Tree-Guided Identify-Then-Exploit (TG-ITE), the first unified framework to tackle all these objectives to our knowledge. Without requiring stronger assumptions, we propose a shared tree-guided identification approach to find a high-confidence incumbent within $O(N)$ comparisons. We further propose varied exploitation strategies to utilize this warm-start stage to optimize the specific objectives at hand. This methodology enables our approach to (1) achieve $O(N)$ sample complexity in BAI without commonly adopted stronger assumptions; (2) build the first winner-stays-style algorithm to achieve $O(N)$ weak regret; (3) enjoy the same $O(N \log T)$ guarantee as specialized strong-regret approaches; (4) realize the joint optimization of BAI and weak regret with $O(N)$ guarantees for both, eliminating the sub-optimal gap of $O(\log N)$ in the existing approach. Our results provide evidence that the trade-off between BAI and regret minimization is relatively benign in dueling bandits.
Adaptive Sharpness-Aware Minimization with a Polyak-type Step size: A Theory-Grounded Scheduler
oai:arXiv.org:2606.01827v1
arXiv:2606.01827v1 Announce Type: cross
Abstract: Sharpness-Aware Minimization (SAM) has established itself as a powerful and widely adopted optimizer for training machine learning models. By explicitly minimizing the sharpness of the loss landscape, SAM often improves generalization while delivering strong empirical performance. However, SAM and its variants, like most training algorithms, are sensitive to the choice of learning rate, which is typically selected through extensive hyperparameter tuning or predefined schedulers. In this work, motivated by recent advances on the effectiveness of stochastic Polyak step sizes for Stochastic Gradient Descent (SGD), we derive Polyak schedulers tailored to SAM-style updates, yielding novel adaptive algorithms in both deterministic and stochastic settings. In the smooth setting, we prove linear convergence for strongly convex objectives and an $\mathcal{O}(1/T)$ convergence rate for convex objectives in the deterministic case. In the stochastic setting, we establish analogous convergence guarantees up to a neighborhood of the optimum. Numerical experiments demonstrate that the proposed Polyak schedulers achieve performance comparable to or better than carefully tuned SAM baselines, while substantially reducing the need for learning-rate tuning.
Flow-Transformed Implicit Processes for Function-Space Variational Inference
oai:arXiv.org:2606.01954v1
arXiv:2606.01954v1 Announce Type: cross
Abstract: Implicit-process priors define distributions over functions through flexible generative mechanisms, making them attractive for Bayesian function-space modelling. However, performing posterior inference with such priors is challenging because their induced function-space distributions are typically not available in closed form. One practical strategy is to approximate the prior using a finite collection of sampled functions, and then represent posterior functions as learned combinations of these samples. Existing approaches commonly place a Gaussian variational distribution over the combination weights. While tractable, this choice limits the shapes of posterior uncertainty that can be represented, especially when the true posterior is asymmetric, heavy-tailed, or multimodal. We propose Flow-Transformed Implicit Processes (FTIP), a variational inference method that makes this finite-dimensional function-space approximation more expressive. Instead of using a Gaussian distribution over the combination weights, FTIP uses a normalizing flow to define a richer variational distribution. This induces a flexible posterior distribution over functions while preserving tractable optimization. We train the model using a Black-Box {\alpha} objective, allowing us to compare mass-covering and mode-seeking variational behaviour. Experiments show that FTIP captures asymmetric and multimodal posterior structure in function space that Gaussian coefficient approximations tend to smooth or collapse.
Query-Limited Community Recovery in Stochastic Block Models
oai:arXiv.org:2606.02055v1
arXiv:2606.02055v1 Announce Type: cross
Abstract: We study exact community recovery in the two-community stochastic block model on $n$ vertices under limited and noisy access to network data. The learner may query a noisy neighborhood oracle that reveals each true neighbor of a queried vertex independently with fixed probability and never returns non-neighbors, subject to a finite query budget. We consider both oracle-only access and a combined model where the learner also observes a single subsampled copy of the underlying graph. For oracle-only access, balanced uniform querying gives a sharp non-adaptive benchmark: when each vertex is queried the same integer number of times, the observations reduce to an SBM with attenuated edge probabilities and the Abbe-Bandeira-Hall exact-recovery threshold applies. We show that this benchmark is not adaptively optimal: a two-stage adaptive strategy succeeds with $n+o(n)$ queries in a regime where balanced uniform querying requires $m n$ queries for some $m>1$. With an additional subsampled graph, we prove a sublinear-query adaptivity gap: balanced data-independent uniform querying with a sublinear budget does not improve over the subsampled graph alone, whereas adaptive querying can target a small set of uncertain vertices and achieve exact recovery. Thus adaptive data acquisition can strictly improve the information-theoretic limits of exact recovery.
Decision-calibrated prediction sets for robust power system operations
oai:arXiv.org:2606.02081v1
arXiv:2606.02081v1 Announce Type: cross
Abstract: Robust optimization offers a tractable approach to balance operating costs and reliability in power systems dominated by weather-dependent renewable uncertainty, but its performance depends critically on the uncertainty set. Standard data-driven approaches often calibrate uncertainty sets to attain predictive coverage, which can produce unnecessarily large sets and costly operating decisions. In contrast, we introduce decision-calibrated prediction sets and embed them as uncertainty sets in robust optimization problems; these are conditional multivariate prediction sets where calibration is defined in terms of the reliability of downstream decisions, rather than in terms of the coverage. First, we learn these conditional prediction sets as sub-level sets of norm-based score functions represented by partially input-convex neural networks, capturing contextual information and multivariate dependence while preserving convexity and tractability in downstream robust formulations. Second, inspired by conformal risk control, we calibrate a score-threshold parameter that sets the volume of the uncertainty set, thereby controlling the expected violations of downstream operational constraints. We apply our approach to 15-minute-ahead reserve scheduling with network-constrained deliverability, which we formulate as a robust DC optimal power flow problem with affine recourse. Numerical experiments show that decision-calibrated sets attain prescribed constraint-satisfaction targets within about three percentage points, whereas standard coverage-based calibration systematically exceeds these targets by more than eleven percentage points, leading to larger sets and higher operating costs.
When Tabular Foundation Models Transfer Across Modalities: A Systematic Evaluation Across 95 Datasets, 7 Modalities, and Two Regimes
oai:arXiv.org:2606.02106v1
arXiv:2606.02106v1 Announce Type: cross
Abstract: We present a single classification pipeline that combines an Equiangular Tight Frame (ETF) preprocessing stage with a tabular foundation model for in-context inference, applied identically across modalities once data is mapped to fixed vector representations. We evaluate it on 95 datasets spanning seven signal modalities -- vision, audio, speech, text, molecular, time-series, and tabular. The main methodological contribution is to fix the comparison object: throughout the paper, performance is judged against the strongest lightweight tuned baseline on the same frozen features, while oracle selection, deployed selection, and specialized fine-tuning are reported separately.
The pipeline is broadly competitive with strong lightweight tuned baselines on the same frozen features. It does not match the very best specialized models or heavily tuned pipelines on every task, but it stays close, and it runs much faster -- typically 4 to 200 times faster than full backbone fine-tuning, often at comparable quality.
We describe how to deploy the pipeline in practice: when to apply ETF preprocessing, how to stop its training without a validation split, how to set up the in-context classifier, and how to calibrate the resulting probabilities. The calibration step is non-cosmetic: TabICL produces well-calibrated probabilities by construction, ETF preprocessing initially disrupts that calibration, and the post-hoc rescaling restores it -- yielding a per-prediction confidence signal that practitioners can use as a trust threshold for confidence-gated deployment. We also report where the pipeline should not be expected to help, and how to identify those cases in advance.
Network Learning with Semi-relaxed Gromov-Wasserstein
oai:arXiv.org:2606.02223v1
arXiv:2606.02223v1 Announce Type: cross
Abstract: Estimating the generative mechanism of large-scale networks is a fundamental challenge in statistical machine learning. It requires the identification of the latent connectivity structure, which is in general an NP-hard combinatorial problem due to the absence of canonical node labels. We address this challenge by allowing for probabilistic couplings, thereby relaxing the assignment problem. Our estimation framework can be formulated as a semi-relaxed Gromov-Wasserstein objective and provides a low-dimensional representation of the generative structure. We solve this via a block-coordinate conditional gradient algorithm. Despite the relaxation, the resulting solution is typically deterministic: in fact, we show that the optimality gap between the relaxed solution and the deterministic assignment vanishes at rate $O(1/n)$, where $n$ is the number of nodes. This allows for tractable recovery of the underlying model and enables rigorous statistical analysis: we establish consistency and minimax-optimal convergence rates for both stochastic block models and Holder-smooth graphons. Our implementation scales efficiently with $n$, as demonstrated on both synthetic and real-world datasets.
When Do Treatment Changes Identify Causal Effects?
oai:arXiv.org:2606.02234v1
arXiv:2606.02234v1 Announce Type: cross
Abstract: This paper clarifies the identifying assumptions underlying causal inference based on treatment changes rather than treatment levels, and their relationship to conventional identification strategies. We characterize two distinct structural models, with non-nested identifying assumptions, under which treatment-change identification is valid conditional on observed covariates. We demonstrate that the identifying assumptions relying on treatment changes are generally not nested with those of methods relying on treatment levels, such as selection-on-observables strategies that control for past outcomes, treatments, and covariates, or difference-in-differences approaches that difference outcomes rather than treatments over time. We show, however, that under a random-walk restriction on the treatment process, conditioning on treatment changes is equivalent to conditioning on treatment levels given lagged treatment. This and other equivalence results motivate overidentification tests by jointly considering methods based on treatment levels and changes. Beyond these tests, the non-nesting results carry a structural double robustness implication: an estimator that differences both the outcome and the treatment over time, such as two-way fixed effects regression, remains consistent if either the treatment-change assumption or the parallel-trends assumption holds, without requiring both simultaneously. We characterize the causal models consistent with each method, investigate finite-sample behavior in a simulation study, and present an empirical application to cigarette demand.
Exponential thermalisation of viscous fluids on negatively curved manifolds
oai:arXiv.org:2606.02286v1
arXiv:2606.02286v1 Announce Type: cross
Abstract: The deterministic incompressible Navier-Stokes equations are physically incomplete: any viscous fluid at finite temperature must exhibit thermal fluctuations whose form is dictated by the fluctuation-dissipation relation. We formulate the stochastic Navier-Stokes equations with the kinematically selected deformation Laplacian on compact Riemannian manifolds with strictly negative Ricci curvature. The fluctuation-dissipation relation, derived from a topological (Poincar\'e lemma) argument, uniquely determines the noise from the viscous operator. For the spectrally truncated system, we prove that the unique stationary distribution is the Gibbs measure (Gaussian in the mode amplitudes, because the nonlinear convective terms preserve energy), and that convergence to equilibrium is exponentially fast with rate at least $2\nu\lambda_\Def$, where $\nu$ is the kinematic viscosity and $\lambda_\Def$ is the spectral gap of the deformation Laplacian. The spectral gap satisfies $\lambda_\Def \geq \kappa^2$ when $\Ric \leq -\kappa^2 g$, and is independent of the volume of the domain. On flat space, the analogous thermalisation rate vanishes in the infinite-volume limit. The equilibrium velocity-velocity correlation function decays exponentially in geodesic distance, in contrast to the algebraic decay on flat space. These results provide a rigorous statistical-mechanical foundation for viscous fluids on negatively curved manifolds and illustrate how the geometry of the domain controls not only the deterministic dynamics but also the approach to thermal equilibrium.
Transitivity in Inhomogeneous Random Tournaments
oai:arXiv.org:2606.02340v1
arXiv:2606.02340v1 Announce Type: cross
Abstract: Paired-comparison data are naturally represented by tournaments, where transitivity corresponds to the existence of a global ranking consistent with all pairwise outcomes. Accordingly, the classical Kendall-Smith coefficient of consistency measures deviations from transitivity in a tournament by counting the number of circular triads (directed $3$-cycles). In this paper, we characterize the fluctuations of the number of circular triads in inhomogeneous random tournaments and develop an inferential framework for the consistency coefficient. Specifically, we consider the $W$-random tournament model, where the comparison probabilities are determined by a tournamenton $W$, the analogue of a graphon in the tournament setting. We show that, for a $W$-random tournament on $n$ vertices, the number of circular triads exhibits three different fluctuation regimes, determined by suitable notions of regularity and uniformity of $W$. We further develop a novel tournamenton multiplier bootstrap that consistently approximates the limiting distribution of the circular-triad count in the relevant asymptotic regime. Combining this with procedures for testing regularity and uniformity, we design an algorithm for constructing confidence intervals for the consistency coefficient that is asymptotically valid for all tournamentons. We also obtain structural characterizations of tournamentons for which the limiting distribution of the number of circular triads exhibits specific degeneracies. These results can also be viewed through the lens of tournament quasirandomness and may be of independent interest.
Local Preferential Bayesian Optimization
oai:arXiv.org:2606.02351v1
arXiv:2606.02351v1 Announce Type: cross
Abstract: Bayesian optimization (BO) is a popular and effective approach for tuning expensive, noisy experiments, but requires the formulation of an explicit objective function. Preferential BO (PBO) removes this requirement by learning from pairwise human feedback, yet existing methods struggle to efficiently optimize beyond low- and medium-dimensional problems due to their global search approaches. We address this limitation by developing a family of local PBO methods that transfer key ideas from high-dimensional BO to the preferential setting. In particular, we introduce local PBO methods which adapt trust-region and derivative-informed local search to pairwise preference feedback, where the latter exploits first- and second-order derivatives of the Laplace-approximated GP posterior. Our benchmark on GP sample paths, standard optimization benchmark functions, and policy-search tasks shows that local PBO methods are especially effective in high-dimensional and complex landscapes with steep optima. Compared with global preference-based baselines, they can substantially reduce cumulative regret, making them particularly useful for real-world preference-based optimization tasks such as policy search.
Minimax-Optimal Policy Regret in Partially Observable Markov Games
oai:arXiv.org:2606.02363v1
arXiv:2606.02363v1 Announce Type: cross
Abstract: We study sequential decision-making in partially observable environments against strategic, adaptive opponents, modeled as partially observable Markov games (POMGs). The central challenge is to learn latent dynamics from partial observations while facing an adversary whose behavior depends on the learner's strategy, making standard regret notions inadequate. We prove that an epoch-based optimistic maximum-likelihood algorithm achieves $\tilde{O}(\sqrt{T})$ policy regret for fixed problem parameters, with explicit dependence on the horizon, adversary memory, confidence radius, and the aggregate Eluder dimension of the observable-operator class. The algorithm selects one policy per geometrically growing epoch using confidence sets built cumulatively from past data, which keeps the cost of comparing adversary responses across policies logarithmic in $T$. We also prove a lower bound matching the $\sqrt{T}$ and aggregate-Eluder-dimension dependence, up to problem-dependent and logarithmic factors. Finally, we extend the framework to horizon-adaptive guarantees and adversaries with geometric fading memory.
Attention Dynamics and Adaptive Decision Support in C5ISR: A Recurrence Quantification Analysis of Visual and Multimodal Attention Guidance Effects on Mission Performance
oai:arXiv.org:2606.02382v1
arXiv:2606.02382v1 Announce Type: cross
Abstract: Modern command, control, communications, computers, cyber, intelligence, surveillance, and reconnaissance (C5ISR) environments place substantial attentional demands on mission commanders. Failures in attention allocation in these high-risk settings can have severe operational consequences. This study investigates the efficacy of gaze-driven, attention-guided adaptive decision support tools, including visual-only and multimodal designs, in a high-fidelity simulated military command center. To characterize gaze and attentional dynamics during interaction with these tools, recurrence quantification analysis was applied to eye-tracking data. Stepwise regression using the Bayesian information criterion was then used to identify recurrence-based gaze metrics associated with performance. Results showed that the multimodal adaptive decision support tool was associated with significantly higher performance than the visual-only attention-guided tool. Average diagonal line length showed a negative linear association with performance, whereas entropy showed a positive linear association. Recurrence rate, determinism, and entropy also showed nonlinear quadratic relationships with performance. In particular, recurrence rate and determinism followed an inverted-U pattern consistent with the Yerkes-Dodson law. These findings suggest that effective performance in dynamic C5ISR contexts depends on a balance between structured and flexible visual scanning, and that recurrence-based gaze metrics can help characterize attentional dynamics during interaction with adaptive decision support systems.
Speculative Sampling For Faster Molecular Dynamics
oai:arXiv.org:2606.02455v1
arXiv:2606.02455v1 Announce Type: cross
Abstract: Molecular dynamics (MD) is a key tool for simulating the dynamical behavior of atomic systems. However, MD is inherently serial, which makes it difficult to increase single-system throughput with concurrent compute. To address this, we introduce Langevin Speculative Dynamics (LSD), a distributed and model-agnostic speculative sampler for accelerating MD without adding relative error. Inspired by speculative methods in language and diffusion modeling, LSD uses a draft model to propose fast simulation steps and verifies them in parallel with a slower target model, applying a transport map from the draft to the target distribution. We extend speculative sampling to second-order Langevin dynamics, derive the achievable speedup as a function of physical parameters, show that LSD generalizes across different systems and draft-target combinations with a 3-9x speedup, and confirm theoretically and empirically that LSD samples trajectories from its target model distribution.
Correlated uniform attachment trees
oai:arXiv.org:2606.02472v1
arXiv:2606.02472v1 Announce Type: cross
Abstract: We introduce and study a new model of correlated uniform attachment (UA) trees, where correlation is sprinkled throughout the time evolution of the process. In this model, two UA trees are grown in parallel, and at each time step a new node is added to each tree, with an edge between it and a uniformly chosen existing vertex in the respective tree. The two choices of attachment are correlated: with probability $\alpha$, the edges attach to nodes with the same time label in both trees, and with probability $1-\alpha$, the choices are made independently. We study fundamental detection and estimation questions for this model, given two \emph{unlabeled} trees. In our main result, we construct a consistent estimator of the correlation parameter $\alpha$, as the size of the trees goes to infinity.
The construction of our statistic relies on two key ideas. First, we use Jordan centrality to identify subsets of vertices of each tree whose intersection has a sufficient number of common early vertices. The second idea is that, across multiple time scales, it is possible to approximately determine the labels of vertices that have attached to these early vertices, using the sizes of fringe subtrees. Our analysis includes novel quantitative bounds on the fraction of early vertices that remain central, which are of independent interest in the network archaeology literature.
Robust and Efficient Estimation for a Discrete Distribution Using L2 Optimization
oai:arXiv.org:1606.04182v4
arXiv:1606.04182v4 Announce Type: replace
Abstract: This paper proposes a novel method to estimate the rate parameter of the Poisson distribution. The proposed method employs the Cramer-von Mises type optimization which has been commonly used in estimating parameters of continuous distributions. Upon obtaining the estimator through the proposed method, its desirable properties such as asymptotic distribution and robustness are rigorously investigated. Simulation studies serve to demonstrate that the proposed method compares favorably with other well-celebrated methods including the maximum likelihood method.
New statistical methodology for second level global sensitivity analysis
oai:arXiv.org:1902.07030v2
arXiv:1902.07030v2 Announce Type: replace
Abstract: Global sensitivity analysis (GSA) of numerical simulators aims at studying the global impact of the input uncertainties on the output. To perform the GSA, statistical tools based on inputs/output dependence measures are commonly used. We focus here on dependence measures based on reproducing kernel Hilbert spaces: the Hilbert-Schmidt Independence Criterion denoted HSIC. Sometimes, the probability distributions modeling the uncertainty of inputs may be themselves uncertain and it is important to quantify the global impact of this uncertainty on GSA results. We call it here the second-level global sensitivity analysis (GSA2). However, GSA2, when performed with a double Monte Carlo loop, requires a large number of model evaluations which is intractable with CPU time expensive simulators. To cope with this limitation, we propose a new statistical methodology based on a single Monte Carlo loop with a limited calculation budget. Firstly, we build a unique sample of inputs from a well chosen probability distribution and the associated code outputs are computed. From this inputs/output sample, we perform GSA for various assumed probability distributions of inputs by using weighted HSIC measures estimators. Statistical properties of these weighted esti-mators are demonstrated. Finally, we define 2 nd-level HSIC-based measures between the probability distributions of inputs and GSA results, which constitute GSA2 indices. The efficiency of our GSA2 methodology is illustrated on an analytical example, thereby comparing several technical options. Finally, an application to a test case simulating a severe accidental scenario on nuclear reactor is provided.
A Unified Framework for Regularized Estimating Equations via Fixed-Point and Variational Inequality Problems
oai:arXiv.org:2110.11074v3
arXiv:2110.11074v3 Announce Type: replace
Abstract: Many statistics problems are formulated within an estimating equation framework instead of a minimization framework. However, the regularized estimating equations (REE) have been much less extensively studies than regularized minimization problems. In this paper, we study an improved regularized estimating equation formulation and explore its subsequent equivalences in terms of (1) fixed-point problem specified via the proximal operator of the corresponding regularizer, and (2) generalized variational inequality problems. Such equivalences hold under general conditions and accommodate nonconvex regularizers. Moreover, these equivalences open up new possibilities in theoretical analysis and computational algorithms when studying the REE.
Bayesian Mixed Multidimensional Scaling for Auditory Processing
oai:arXiv.org:2209.00102v4
arXiv:2209.00102v4 Announce Type: replace
Abstract: The human brain distinguishes speech sounds by mapping acoustic signals into a latent perceptual space. This space can be estimated via multidimensional scaling (MDS), preserving the similarity structure in lower dimensions. However, individual and group-level heterogeneity, especially between native and non-native listeners, remains poorly understood. Prior approaches often ignore such variability or cannot capture shared structure, limiting principled comparisons. Moreover, the literature often focuses on latent distances rather than the underlying features themselves. To address these issues, we develop a Bayesian mixed MDS method that accounts for both subject- and group-level heterogeneity, allows for the recovery of unique, identifiable latent features, facilitating their biological interpretability, while also determining the effective dimensionality of the latent space in an automated, data-adaptive manner. Simulations and an auditory neuroscience application demonstrate how these features reconstruct observed distances and vary with individual and language background, revealing novel insights.
An average-case sensitivity analysis for unmeasured confounding
oai:arXiv.org:2211.04697v5
arXiv:2211.04697v5 Announce Type: replace
Abstract: Sensitivity analysis for the unconfoundedness assumption is crucial in observational studies. For this purpose, the marginal sensitivity model gained popularity recently due to good interpretability and mathematical properties. However, most existing models only consider a worst-case parameter that bounds the logit difference between the observed and full data propensity scores, which may not fully capture the extent of unmeasured confounding. We propose a new sensitivity model that is parameterized by the second moment of the propensity score ratio, requiring only the average strength of unmeasured confounding to be bounded. By characterizing the associated sensitivity analysis as an optimization problem, we derive sharp closed-form bounds of the average potential outcomes under our model. We propose efficient one-step estimators for these bounds based on the corresponding efficient influence functions. Additionally, we apply multiplier bootstrap to construct simultaneous confidence bands to cover the sensitivity curve that consists of bounds at different values of the sensitivity parameters. Through a real-data study, we illustrate how this average-case sensitivity analysis can provide tighter bounds and facilitate calibration of the results using observed covariates.
Interventional Processes for Causal Uncertainty Quantification
oai:arXiv.org:2410.14483v3
arXiv:2410.14483v3 Announce Type: replace
Abstract: Reliable uncertainty quantification for causal effects is crucial in high-stakes applications, but remains challenging when the target is an entire function rather than a scalar estimand. In this work, we introduce a GP-based approach for uncertainty quantification of interventional functions. The central idea is to build on recent work representing interventional functions as an inner-product of observational functions in a reproducing kernel Hilbert space (RKHS), by constructing appropriate GP priors for such functions and inferring posteriors from observational data. Our approach yields closed-form posterior moments and tractable training and inference, while avoiding pathologies of previous GP prior constructions for RKHS functions. We further derive a practical procedure for posterior coverage calibration. Across synthetic benchmarks, causal Bayesian optimization tasks, and a large-scale real dataset, our method improves uncertainty quantification while remaining competitive in causal effect estimation.
Near-Optimal and Tractable Estimation under Shift-Invariance
oai:arXiv.org:2411.03383v3
arXiv:2411.03383v3 Announce Type: replace
Abstract: How hard is it to estimate a discrete-time signal $(x_{1}, ..., x_{n}) \in \mathbb{C}^n$ satisfying an unknown linear recurrence relation of order $s$ and observed in i.i.d. complex Gaussian noise? The class of all such signals is parametric but extremely rich: it contains all exponential polynomials over $\mathbb{C}$ with total degree $s$, including harmonic oscillations with $s$ arbitrary frequencies. Geometrically, this class corresponds to the projection onto $\mathbb{C}^{n}$ of the union of all shift-invariant subspaces of $\mathbb{C}^\mathbb{Z}$ of dimension $s$. We show that the statistical complexity of this class, as measured by the squared minimax radius of the $(1-\delta)$-confidence $\ell_2$-ball, is nearly the same as for the class of $s$-sparse signals, namely $O\left(s\log(en) + \log(\delta^{-1})\right) \cdot \log^2(es) \cdot \log(en/s).$ Moreover, the corresponding near-minimax estimator is tractable, and it can be used to build a test statistic with a near-minimax detection threshold in the associated detection problem. These statistical results rely upon a simple analytic observation: the interpretation of the Fourier coefficients of the Christoffel function of any shift-invariant subspace of~$\mathds{C}^\mathds{Z}$ as a reproducing filter with the smallest possible spectrum, in all~$\ell_p$-norms, $p \in [1,\infty]$, at once.
B-MASTER: Scalable Bayesian Multivariate Regression for Master Predictor Discovery in Colorectal Cancer Microbiome-Metabolite Profiles
oai:arXiv.org:2412.05998v4
arXiv:2412.05998v4 Announce Type: replace
Abstract: Motivation: The gut microbiome shapes cancer therapy response through its influence on host metabolism. While prior studies examine pairwise associations between individual genera and metabolites, there is limited methodology for identifying microbial genera that systematically regulate the overall metabolome. Scalable statistical tools are needed to uncover such system-level 'master predictors' in high-dimensional microbiome-metabolome data.
Results: We introduce B-MASTER, a scalable Bayesian multivariate regression framework combining L1 sparsity and L2 group shrinkage to identify essential cross-metabolite regulators. A Gibbs sampler enables near-linear computational scaling, supporting models with millions of parameters. The method is supported by theoretical guarantees, including posterior contraction and selection consistency. Analysis of colorectal cancer microbiome-metabolome data reveals key microbial genera that govern global and cancer-associated metabolite patterns, highlighting system-level regulatory structure.
Availability: The B-MASTER code, including demonstration scripts, is available at https://github.com/priyamdas2/B-MASTER. An archived snapshot of the code corresponding to this manuscript is available on Zenodo with DOI: 10.5281/zenodo.20484958.
Highest Posterior Density Intervals of Unimodal Distributions As Analogues to Profile Likelihood Ratio Confidence Intervals
oai:arXiv.org:2412.06528v5
arXiv:2412.06528v5 Announce Type: replace
Abstract: In Bayesian statistics, the highest posterior density (HPD) interval is often used to describe properties of a posterior distribution. As a method for estimating confidence intervals (CIs), the HPD has two main desirable properties. Firstly, it is the shortest interval to have a specified coverage probability. Secondly, every point inside the HPD interval has a density greater than every point outside the interval. However, the HPD interval is sometimes criticized for being transformation invariant.
We make the case that under certain conditions the HPD interval is a natural analog to the frequentist profile likelihood ratio confidence interval (LRCI). Our main result is to derive a proof showing that under specified conditions, the HPD interval with respect to the density mode is transformation invariant for monotonic functions in a manner which is similar to a profile LRCI.
Targeted Data Fusion for Region-Specific Survival Effects in the AMP HIV Prevention Trials
oai:arXiv.org:2501.18798v3
arXiv:2501.18798v3 Announce Type: replace
Abstract: The Antibody Mediated Prevention (AMP) trials opened a new scientific frontier by showing that passively administered monoclonal broadly neutralizing antibodies (bnAbs) could prevent HIV-1 acquisition. Conducted across multiple geographic regions, including the United States, Brazil, Peru, Switzerland, and sub-Saharan Africa, the AMP trials revealed substantial regional heterogeneity in treatment efficacy. These differences, together with privacy and regulatory limits on central data pooling, call for methods that borrow strength across regions without sharing individual-level data. To estimate region- and treatment-specific survival curves under distributional heterogeneity, we develop a federated learning approach that combines site-specific estimators via an L1-regularized criterion that downweights data sources not aligned with the target. We further extend the framework to a general class of causal contrasts, including the risk difference (RD), survival ratio (SR), and restricted mean survival time (RMST) difference. Through extensive simulations and an analysis of the AMP trials under different target populations, we show that the proposed approach provides privacy-preserving, region-adaptive inference with improved precision.
A Unified Framework for Multiple-Try Metropolis: Construction and Empirical Benchmarks
oai:arXiv.org:2503.11583v2
arXiv:2503.11583v2 Announce Type: replace
Abstract: The multiple-try Metropolis (MTM) algorithm uses a compound proposal with multiple candidate draws to improve local sampling efficiency. While several methodological works have continued to develop MTM and the multi-candidate mechanism that characterizes it, the literature lacks a unified comparison of these components. This paper presents a structured formulation of MTM within the involutive MCMC framework, providing a principled approach for deriving valid acceptance probabilities based on the proposal mechanism. Through a comprehensive simulation experiment, we evaluate the impact of MTM configurations on non-Gaussian and multimodal target distributions. Our results reveal that while weight functions are a focus of several methodological developments, their impact on stationary sampling efficiency is secondary to the configuration of the proposal distribution. Furthermore, we find that while increasing the number of candidates enhances per-iteration efficiency, the realized performance gains are offset by computational overhead introduced by multiple candidacy unless parallelize computing is used. Our findings offer practical guidance for configuring an MTM algorithm for complex and non-Gaussian targets.
Causal inference in connected populations with contagion
oai:arXiv.org:2504.06108v3
arXiv:2504.06108v3 Announce Type: replace
Abstract: Causal inference in connected populations is complicated by contagion and other real-world processes inducing dependence among outcomes. We address a gap in the literature on causal inference under contagion: while there is a growing body of work on estimating causal effects under contagion, little is known about how contagion impacts causal effects and inference. We provide insight into how contagion impacts causal effects and inference based on closed-form expressions for causal effects under contagion. These closed-form expressions reveal that the effects of interventions, spillover, and contagion are intertwined even in the simplest possible settings, and that contagion can decrease or increase causal effects. We discuss statistical implications, including asymptotic bias of model-based estimators ignoring dependence among outcomes due to contagion, violations of neighborhood exposure assumptions underlying design-based estimators by unrestricted contagion, and possible remedies.
Assessing Racial Disparities in Healthcare Expenditures via Mediator Distribution Shifts
oai:arXiv.org:2504.21688v4
arXiv:2504.21688v4 Announce Type: replace
Abstract: Racial disparities in healthcare expenditures are well-documented, yet the underlying drivers remain complex. This study develops a framework to decompose such disparities through shifts in the distributions of mediating variables, rather than treating race itself as a manipulable exposure. We define disparities as differences in covariate-adjusted outcome distributions across racial groups, and decompose the total disparity into a component attributable to differences in mediator distributions, and a residual component that remains after equalizing those distributions. Using data from the Medical Expenditures Panel Survey (MEPS), we examine the extent to which expenditure disparities would persist or be reduced if mediators such as socioeconomic status (SES), insurance access, health behaviors, or health status were equalized across racial groups. To ensure valid inference, we derive asymptotically linear estimators based on influence-function techniques and flexible machine learning, including super learners and a two-part model designed for the zero-inflated, right-skewed nature of expenditure data.
Applying this framework to MEPS data from 2009 and 2016, substantial disparities were observed across all pairwise racial comparisons, with the largest gaps observed between non-Hispanic Whites and Hispanics in both years. Differences in SES and health status were the largest contributors to these disparities, with insurance access also playing a meaningful role, particularly for Hispanic populations, whereas health behaviors contributed minimally. Residual disparities persisted, especially in comparisons involving non-Hispanic Whites, suggesting the influence of unmeasured or structural factors.
Cellwise and Casewise Robust Covariance in High Dimensions
oai:arXiv.org:2505.19925v2
arXiv:2505.19925v2 Announce Type: replace
Abstract: The sample covariance matrix is a cornerstone of multivariate statistics, but it is highly sensitive to outliers. These can be casewise outliers, such as cases belonging to a different population, or cellwise outliers, which are deviating cells (entries) of the data matrix. Recently some robust covariance estimators have been developed that can handle both types of outliers, but their computation is only feasible up to at most 20 dimensions. To remedy this we propose the cellRCov method, a robust covariance estimator that simultaneously handles casewise outliers, cellwise outliers, and missing data. It relies on a decomposition of the covariance on principal and orthogonal subspaces, leveraging recent work on robust PCA. It also employs a ridge-type regularization to stabilize the estimated covariance matrix. We establish some theoretical properties of cellRCov, including its casewise and cellwise influence functions as well as consistency and asymptotic normality. A simulation study demonstrates the superior performance of cellRCov in contaminated and missing data scenarios. Furthermore, its practical utility is illustrated in a real-world application to anomaly detection. We also construct and illustrate the cellRCCA method for robust and regularized canonical correlation analysis.
A longitudinal Bayesian framework for estimating causal dose-response relationships
oai:arXiv.org:2505.20893v4
arXiv:2505.20893v4 Announce Type: replace
Abstract: Existing causal methods for time-varying exposure and time-varying confounding focus on estimating the average causal effect of a time-varying binary treatment on an end-of-study outcome, offering limited tools for characterizing marginal causal dose-response relationships under continuous exposures. We propose a scalable, nonparametric Bayesian framework for estimating marginal longitudinal causal dose-response functions with repeated outcome measurements. Our approach targets the average potential outcome at any fixed dose level and accommodates time-varying confounding through the generalized propensity score. The proposed approach embeds a Dirichlet process specification within a generalized estimating equations structure, capturing temporal correlation while making minimal assumptions about the functional form of the continuous exposure. We apply the proposed methods to monthly metro ridership and COVID-19 case data from major international cities, identifying causal relationships and the dose-response patterns between higher ridership and increased case counts.
Position: Stop Chasing the C-index when Evaluating Survival Analysis Models
oai:arXiv.org:2506.02075v3
arXiv:2506.02075v3 Announce Type: replace
Abstract: The current state of evaluation in survival analysis is plagued by the persistent use of evaluation metrics in ways that are misaligned with the stated modeling objective. In addition, many such evaluations are based on censoring assumptions that are left implicit or unjustified. This means that the reported performance can be misleading and may fail to answer the scientific or modeling question the evaluation was intended to address. In this position paper, we critically examine evaluation practices in survival analysis and highlight how censoring makes evaluation fundamentally different from standard regression or classification. We place particular focus on concordance-based measures, such as the C-index, which we show are heavily overused in the literature. To help identify appropriate metrics, we propose a set of key desiderata and introduce a double-helix ladder, in which valid evaluation requires alignment between metric and modeling assumptions. Through controlled experiments, we show that violations of this alignment can lead to misleading model comparisons. We conclude by providing practical guidance on how to evaluate a survival model.
Identifiability in epidemic models with prior immunity and under-reporting
oai:arXiv.org:2506.07825v2
arXiv:2506.07825v2 Announce Type: replace
Abstract: Identifiability is the property in mathematical modelling that determines if model parameters can be uniquely estimated from data. For infectious disease models, failure to ensure identifiability can lead to misleading parameter estimates and unreliable policy recommendations. We examine the identifiability of a modified SIR model that accounts for under-reporting and pre-existing immunity in the population. We provide a mathematical proof of the unidentifiability of jointly estimating three parameters: the fraction under-reporting, the proportion of the population with prior immunity, and the community transmission rate, when only reported case data are available. We then show, analytically and with a simulation study, that the identifiability of all three parameters is achieved if the reported incidence is complemented with sample survey data of prior immunity or prevalence during the outbreak. Our results show the limitations of parameter inference in partially observed epidemics and the importance of identifiability analysis when developing and applying models for public health decision making.
Consistent Infill Estimability of the Regression Slope Between Gaussian Random Fields Under Spatial Confounding
oai:arXiv.org:2506.09267v3
arXiv:2506.09267v3 Announce Type: replace
Abstract: The problem of estimating the slope parameter in regression between two spatial processes under confounding by an unmeasured spatial process has received widespread attention in the recent statistical literature. Yet, a fundamental question remains unresolved: when is this slope consistently estimable under spatial confounding, with existing insights being largely empirical or estimator-specific. We characterize conditions for consistent estimability of the regression slope between Gaussian random fields (GRFs), the common stochastic model for spatial processes, under spatial confounding. Under fixed-domain (infill) asymptotics, we give sufficient conditions for consistent estimability in terms of the smoothness or local behavior of the exposure and confounder processes. When estimability holds, we provide consistent estimators of the slope using local differencing (taking discrete differences or Laplacians of the processes of suitable order). Using functional analysis results on Paley-Wiener spaces, we then provide an easy-to-verify necessary condition for consistent estimability of the slope in terms of the relative spectral tail decays of the confounder and exposure. As a by-product, we establish a novel and general spectral condition on the equivalence of measures on the paths of multivariate GRFs with component fields of varying smoothnesses. We show that for many covariance classes like the Mat\'ern, power-exponential, generalized Cauchy, and coregionalization families, the necessary and sufficient conditions become identical, thereby providing a sharp characterization of consistent estimability of the slope for these processes. The results are extended to multivariate slopes, to accommodate measurement error, to popular classes of non-stationary Gaussian random fields and some non-Gaussian random fields, and for irregular designs.
Exploiting Similarities in A/B Testing with Off-Policy Estimation
oai:arXiv.org:2506.10677v3
arXiv:2506.10677v3 Announce Type: replace
Abstract: We study A/B testing, the standard protocol for measuring the performance gain of a new decision system relative to a baseline. Traditional A/B testing treats both systems as black boxes, ignoring potential similarities between them. In practice, however, new and baseline systems are rarely radically different and often share significant structure, which can be captured by their propensities to make similar decisions. We show that in such cases, the commonly used difference-in-means estimator, though unbiased, is statistically suboptimal. Leveraging off-policy estimation, we introduce a family of A/B testing estimators that exploit the propensities of the tested systems to achieve improved concentration properties. This family is flexible enough to be tailored to practical decision-making. The resulting estimators are simple, robust to propensities misspecification, substantially more accurate when the tested systems exhibit similarities, and gracefully fall back to the difference-in-means estimator when such similarities are absent. Our theoretical analysis and empirical studies confirm their efficiency and practicality.
The fundamental problem of risk prediction for individuals: health AI, uncertainty, and personalized medicine
oai:arXiv.org:2506.17141v2
arXiv:2506.17141v2 Announce Type: replace
Abstract: Background and Objective: Clinical prediction models are commonly evaluated regarding performance for a population, although decisions are made for individuals. The classic view relates uncertainty in risk estimates for individuals to sample size (estimation uncertainty) while other sources are model uncertainty (variability in modeling choices) and applicability uncertainty (variability in measurement procedures and between populations). We aim to illustrate the uncertainty of prediction models in estimating individual risks with an ovarian cancer example. Methods: We used real and synthetic data for ovarian cancer diagnosis to train 59400 models with variations in estimation, model, and applicability uncertainty. We then used these models to estimate the probability of ovarian cancer in a fixed test set of 100 patients and evaluate the variability in individual estimates. Results: We show empirically that estimation uncertainty can be strongly dominated by model uncertainty and applicability uncertainty, even for models that perform well at the population level. Estimation uncertainty decreased considerably with increasing training sample size, whereas model and applicability uncertainty remained large. Conclusion: Individual risk estimates are far more uncertain than often assumed. Model uncertainty and applicability uncertainty usually remain invisible when prediction models or algorithms are based on a single study. Predictive algorithms should inform, not dictate, care and support personalization through clinician-patient interaction.
Simultaneous estimation of the effective reproduction number and the time series of daily infections: Application to Covid-19
oai:arXiv.org:2506.21027v3
arXiv:2506.21027v3 Announce Type: replace
Abstract: The time-varying effective reproduction number is an important parameter for communication and policy decisions during an epidemic. In this paper, we present new statistical methods for estimating the reproduction number based on the popular model of \citet{cori2013new} which defines the effective reproduction number based on self-exciting dynamics of new infections. Such a model is conceptually simple and less susceptible to misspecifications than more complicated multi-compartment models. However, statistical inference is challenging, and the previous literature has either relied on proxy data and/or a two-step approach in which the number of infections is first estimated. In contrast, we present a coherent Bayesian method that approximates the joint posterior of daily new infections and reproduction numbers using a novel Markov chain Monte Carlo (MCMC) algorithm. Comparing our method to the state-of-the-art three-step estimation procedure of \citet{huisman2022estimation}, both using daily confirmed cases from Switzerland in the Covid-19 epidemic and simulated data, we find that our method is more accurate in terms of point estimates and uncertainty quantification, especially near the beginning and end of an observation period.
Hyperspherical Variational Autoencoders Using Efficient Spherical Cauchy Distribution
oai:arXiv.org:2506.21278v3
arXiv:2506.21278v3 Announce Type: replace
Abstract: We propose spherical Cauchy (spCauchy) latent variables for variational autoencoders on hyperspherical latent spaces. The spCauchy family has heavy-tailed global behavior and admits an exact differentiable reparameterization by applying a M\"obius transformation to uniform samples on the sphere. We show that, in the high-concentration limit, spCauchy recovers the local tangent-space geometry of the von Mises-Fisher (vMF) distribution under an explicit concentration parameter mapping, while avoiding the high-order Bessel-function evaluations required by vMF implementations. For training, the Kullback-Leibler divergence to a uniform spherical prior admits rapidly convergent series, stable quadrature, and high-concentration asymptotic forms. We further establish monotonicity of the concentration-dependent KL core and derive analytic brackets with closed-form surrogates and error control, supporting stable approximation in extreme regimes. Stress-test benchmarks show that the resulting latent-layer objective remains stable and faster to evaluate than vMF baselines on CPU and GPU. Experiments on image and molecular sequence data demonstrate that spCauchy-VAEs provide a robust and scalable alternative for generative modeling with hyperspherical latent representations.
Covariance scanning for adaptively optimal change point detection in high-dimensional linear models
oai:arXiv.org:2507.02552v4
arXiv:2507.02552v4 Announce Type: replace
Abstract: This paper investigates the detection and estimation of a single change in high-dimensional linear models. We derive minimax lower bounds for the detection boundary and the estimation rate, which uncover a phase transition governed by the sparsity of the covariance-weighted differential parameter. This form of "inherent sparsity" captures a delicate interplay between the covariance structure of the regressors and the change in regression coefficients on the detectability of a change point. Complementing the lower bounds, we introduce two covariance scanning-based methods, McScan and QcSan, which achieve minimax optimal performance (up to possible logarithmic factors) in the sparse and the dense regimes, respectively. In particular, QcScan is the first method shown to achieve consistency in the dense regime and further, we devise a combined procedure which is adaptively minimax optimal across sparse and dense regimes without the knowledge of the sparsity. Computationally, covariance scanning-based methods avoid costly computation of Lasso-type estimators and attain worst-case computation complexity that is linear in the dimension and sample size. Additionally, we consider the post-detection estimation of the differential parameter and the refinement of the change point estimator. Simulation studies support the theoretical findings and demonstrate the computational and statistical efficiency of the proposed covariance scanning methods.
Benchmarking Waitlist Mortality Prediction in Heart Transplantation Through Time-to-Event Modeling using New Longitudinal UNOS Dataset
oai:arXiv.org:2507.07339v2
arXiv:2507.07339v2 Announce Type: replace
Abstract: Decisions about managing patients on the heart transplant waitlist are currently made by committees of doctors who consider multiple factors, but the process remains largely ad-hoc. With the growing volume of longitudinal patient, donor, and organ data collected by the United Network for Organ Sharing (UNOS) since 2018, there is increasing interest in analytical approaches to support clinical decision-making at the time of organ availability. In this study, we benchmark machine learning models that leverage longitudinal waitlist history data for time-dependent, time-to-event modeling of waitlist mortality. We train on 23,807 patient records with 77 variables and evaluate both survival prediction and discrimination at a 1-year horizon. Our best model achieves a C-Index of 0.94 and AUROC of 0.89, significantly outperforming previous models. Key predictors align with known risk factors while also revealing novel associations. Our findings can support urgency assessment and policy refinement in heart transplant decision making.
Exact conditional goodness-of-fit tests for the mixed membership stochastic block model
oai:arXiv.org:2507.14464v2
arXiv:2507.14464v2 Announce Type: replace
Abstract: We propose exact conditional goodness-of-fit tests for directed mixed membership stochastic block models. Given dyad-level sender and receiver roles, the block-pair edge totals are sufficient for the block probability matrix; conditioning on these totals gives a nuisance-free uniform law on a finite fiber. This yields finite-sample randomization tests for residual sender and receiver heterogeneity, reciprocity, and directed transitive closure. The procedure uses an independent fiber sampler, Monte Carlo rank \(p\)-values, and can be applied after drawing latent block-pair assignments from the posterior distribution. Simulations and the Sampson monastery network show that the tests are calibrated under the null and diagnostically useful for directed model misspecification.
Signal Detection under Composite Hypotheses with Identical Distributions for Signals and for Noises
oai:arXiv.org:2507.21692v2
arXiv:2507.21692v2 Announce Type: replace
Abstract: In this paper, we consider the problem of detecting signals in multiple, sequentially observed data streams, where the distribution of each stream lies in one of two common composite spaces, depending on whether it is a signal or a noise. For this problem, we study a practical yet underexplored setting where it is a priori known that all signals have an identical distribution and so do all noises. Compared to the general setting where local distributions are free to take any values, this structure facilitates faster decision-making thanks to a smaller joint distribution space. However, it introduces additional challenges to the analysis of problem and design of tests, since the local distributions are now coupled. In this paper, we first establish a universal lower bound on the minimum expected sample size, which characterizes the essential difficulty of the problem and involves constants that are neither the minimum Kullback-Leibler divergences between the signal/noise distribution to the noise/signal distribution space, which appear in the lower bound for the general setting, nor the Kullback-Leibler divergences between the signal distribution and the noise distribution. Besides, we propose a test that controls the two types of familywise error rates below arbitrary levels, and achieves the minimum expected sample size asymptotically as the levels go to zero. Numerical studies are presented to compare with the state-of-the-art test for the general setting and demonstrate robustness against model misspecification.
A New Class of Asymptotically Distribution-Free Smooth Tests
oai:arXiv.org:2508.01973v4
arXiv:2508.01973v4 Announce Type: replace
Abstract: This article demonstrates how recent developments in the theory of empirical processes allow us to construct a new family of asymptotically distribution-free smooth tests. Their distribution-free property is preserved even when the parameters are estimated, model selection is performed, and the sample size is only moderately large. A computationally efficient alternative to the classical parametric bootstrap is also discussed.
Off-Policy Learning in Large Action Spaces: Optimization Matters More Than Estimation
oai:arXiv.org:2509.03456v2
arXiv:2509.03456v2 Announce Type: replace
Abstract: Off-policy evaluation (OPE) and off-policy learning (OPL) are foundational for decision-making in offline contextual bandits. Recent advances in OPL primarily optimize OPE estimators with improved statistical properties, assuming that better estimators inherently yield superior policies. Although theoretically justified, this estimator-centric approach neglects a critical practical obstacle: challenging optimization landscapes. In this paper, we provide theoretical insights and empirical evidence showing that current OPL methods encounter severe optimization issues, particularly as the action space grows. We show that estimator-aware policy parametrization can mitigate, but not fully resolve, optimization challenges. Building on this, we explore simpler weighted log-likelihood objectives and demonstrate that they enjoy substantially better optimization properties and still recover competitive, often superior, learned policies. Our findings emphasize the necessity of explicitly addressing optimization considerations in the development of OPL algorithms for large action spaces.
Geometry-preserving and interpretable dimension reduction for compositional data
oai:arXiv.org:2509.05563v2
arXiv:2509.05563v2 Announce Type: replace
Abstract: High-dimensional compositional data pose unique statistical challenges due to the simplex constraint and excess zeros. While dimension reduction is indispensable for analyzing such data, conventional approaches often rely on log-ratio transformations that compromise interpretability and distort the data through ad hoc zero replacements. To address these issues, we introduce a geometry-preserving framework for dimension reduction of compositional data, mapping high-dimensional compositions directly to a lower-dimensional simplex. This framework is interpretable as a softened amalgamation of compositions and enables dual visualization -- showing both projected data and how variables contribute to reduced components -- for at-a-glance interpretation. Within this geometry, we define a new sufficient dimension reduction (SDR) approach for compositional predictors, whose identifiable object, termed the central compositional subspace, differs from the classical central subspace in Euclidean SDR. For estimation, we propose a kernel-based method that yields sparse solutions and comes with an intrinsic predictive model for direct downstream analyses. We prove consistency through a new subspace-comparison argument that allows the estimated and target subspaces to have different dimensions. Applications to real microbiome datasets demonstrate that our approach provides a powerful graphical exploration tool for uncovering meaningful biological patterns in high-dimensional compositional data.
Adaptive clinical trial design with delayed treatment effects using elicited prior distributions
oai:arXiv.org:2509.07602v2
arXiv:2509.07602v2 Announce Type: replace
Abstract: Clinical trials with time-to-event endpoints, such as overall survival (OS) or progression-free survival (PFS), are fundamental for evaluating new treatments, particularly in immuno-oncology. However, modern therapies, such as immunotherapies and targeted treatments, often exhibit delayed effects that challenge traditional trial designs. These delayed effects violate the proportional hazards assumption, which underpins standard statistical methods like the Cox proportional hazards model and the log-rank test. Careful planning is essential to ensure trials are appropriately designed to account for the timing and magnitude of these effects. Without this planning, interim analyses may lead to premature trial termination if the treatment effect is underestimated early in the study. We present an adaptive trial design framework that incorporates prior distributions, elicited from experts, for delayed treatment effects. By addressing the uncertainty surrounding delayed treatment effects, our approach enhances trial efficiency and robustness, minimizing the risk of premature termination and improving the detection of treatment benefits over time. We present an example illustrating how interim analyses, informed by prior distributions, can guide early stopping decisions. To facilitate the implementation of our framework, we have developed free, open-source software that enables researchers to integrate prior distributions into trial planning and decision-making. This software provides a flexible, accessible tool for designing trials that more accurately evaluate modern therapies through adaptive trial designs.
A Statistical Test for Comparing the Linkage and Admixture Model Based on Central Limit Theorems
oai:arXiv.org:2509.12734v4
arXiv:2509.12734v4 Announce Type: replace
Abstract: In the Admixture Model, the probability that an individual carries a certain allele at a specific marker depends on the allele frequencies in $K$ ancestral populations and the proportion of the individual's genome originating from these populations. The markers are assumed to be independent. The Linkage Model is a Hidden Markov Model (HMM) that extends the Admixture Model by incorporating linkage between neighboring loci.
We prove consistency and asymptotic normality of maximum likelihood estimators (MLEs) for the ancestry of individuals in the Linkage Model, complementing earlier results by \citep{pfaff2004information, pfaffelhuber2022central, HEINZEL2025} for the Admixture Model. These results are used to prove that a statistical test that allows for model selection between the Admixture Model and the Linkage Model is an asymptotic level-$\alpha$-test. Finally, we demonstrate the practical relevance of our results by applying the test to real-world data from the 1000 Genomes Project.
End-to-End Deep Learning for Predicting Metric Space-Valued Outputs
oai:arXiv.org:2509.23544v2
arXiv:2509.23544v2 Announce Type: replace
Abstract: Many modern applications involve predicting structured, non-Euclidean outputs such as probability distributions, networks, and symmetric positive-definite matrices. These outputs are naturally modeled as elements of general metric spaces, where classical regression techniques that rely on vector space structure no longer apply. We introduce E2M (End-to-End Metric regression), a deep learning framework for predicting metric space-valued outputs. E2M performs prediction via weighted Fr\'echet means over training outputs, where the weights are learned by a neural network conditioned on the input. This construction provides a principled mechanism for geometry-aware prediction that avoids surrogate embeddings and restrictive parametric assumptions, while fully preserving the intrinsic geometry of the output space. We establish theoretical guarantees, including a universal approximation theorem that characterizes the expressive capacity of the model and a convergence analysis of the entropy-regularized training objective. Through extensive simulations involving probability distributions, networks, and symmetric positive-definite matrices, we show that E2M consistently achieves state-of-the-art performance, with its advantages becoming more pronounced at larger sample sizes. Applications to human mortality distributions and New York City taxi networks further demonstrate the flexibility and practical utility of this framework.
Domain-Shift-Aware Conformal Prediction for Large Language Models
oai:arXiv.org:2510.05566v2
arXiv:2510.05566v2 Announce Type: replace
Abstract: Large language models have achieved impressive performance across diverse tasks. However, their tendency to produce overconfident and factually incorrect outputs, known as hallucinations, poses risks in real-world applications. Conformal prediction provides finite-sample, distribution-free coverage guarantees, but standard conformal prediction breaks down under domain shift, often leading to under-coverage and unreliable prediction sets. We propose a new framework called Domain-Shift-Aware Conformal Prediction (DS-CP). Our framework adapts conformal prediction to large language models under domain shift, by systematically reweighting calibration samples based on their proximity to the test prompt, thereby preserving validity while enhancing adaptivity. Our theoretical analysis and experiments on the MMLU benchmark demonstrate that the proposed method delivers more reliable coverage than standard conformal prediction, especially under substantial distribution shifts, while maintaining efficiency. This provides a practical step toward trustworthy uncertainty quantification for large language models in real-world deployment.
A unifying Bayesian framework for adversarial robustness
oai:arXiv.org:2510.09288v2
arXiv:2510.09288v2 Announce Type: replace
Abstract: The vulnerability of machine learning models to adversarial attacks remains a critical societal security challenge. Traditional defenses, such as adversarial training, typically robustify models by minimizing a worst-case loss. These deterministic approaches do not account for uncertainty in the adversary's attack. While stochastic defenses placing a probability distribution on the adversary exist, they often lack statistical rigor and fail to make explicit their underlying assumptions. To resolve these issues, we introduce a formal Bayesian framework that models adversarial uncertainty through a stochastic channel, articulating all probabilistic assumptions. This yields two robustification strategies: a proactive defense enacted during training, aligned with adversarial training, and a reactive defense enacted during operations, aligned with adversarial purification. Several state-of-the-art defenses can be recovered as limiting cases of our model. We empirically validate our methodology, showcasing the benefits of explicitly modeling adversarial uncertainty.
Incorporating estimands into meta-analyses of clinical trials
oai:arXiv.org:2510.15762v3
arXiv:2510.15762v3 Announce Type: replace
Abstract: The estimand framework is increasingly established to pose research questions in confirmatory clinical trials. In evidence synthesis, the uptake of estimands has been modest, and the PICO (Population, Intervention, Comparator, Outcome) framework is more often applied. While PICOs and estimands have overlapping elements, the estimand framework explicitly considers different strategies for intercurrent events. We propose a pragmatic framework for the use of estimands in meta-analyses of clinical trials, highlighting the value of estimands to systematically identify and mitigate key sources of quantitative heterogeneity, and to enhance the applicability or external validity of pooled estimates. Focus is placed on the role of strategies for intercurrent events, within the specific context of meta-analyses for health technology assessment. We apply the estimand framework to a network meta-analysis of clinical trials, comparing the efficacy of semaglutide versus dulaglutide in type 2 diabetes. We explore the impact of a treatment policy strategy for treatment discontinuation or initiation of rescue medication versus a hypothetical strategy for the corresponding intercurrent events. The specification of different target estimands at the meta-analytical level allows us to be explicit about the source of heterogeneity, the intercurrent event strategy, driving any potential differences in results. We advocate for the integration of estimands into the planning of meta-analyses, while acknowledging that potential challenges exist in the absence of subject-level data. Estimands can complement PICOs to strengthen communication between stakeholders about what evidence syntheses seek to demonstrate, and to ensure that the generated evidence is maximally relevant to healthcare decision-makers.
Generalized Guarantees for Variational Inference in the Presence of Even and Elliptical Symmetry
oai:arXiv.org:2511.01064v3
arXiv:2511.01064v3 Announce Type: replace
Abstract: Variational inference (VI) approximates a target density $p$ by the best match $q$ in a family of tractable distributions. The best variational approximation is found by minimizing a divergence between distributions, $D(p||q)$, and several divergences have been proposed as objective functions for VI, with different choices leading to different approximations. We show that even when these divergences have different minimizers, the resulting approximations all abide by certain symmetry-matching principles. Specifically, our results hold for all $f$-divergences, a broad class which includes the reverse and forward Kullback-Leibler divergences and the $\alpha$-divergences. We show that in the presence of even symmetry, any stationary point of an $f$-divergence is guaranteed to recover the mean of $p$ and likewise, in the presence of elliptical symmetry, any stationary point is guaranteed to recover its correlation matrix. To obtain these guarantees we assume that $p$ and $q$ are unimodal, but notably we do not require them to be log-concave, light-tailed, or even everywhere-smooth. These guarantees generalize a previous result obtained for the reverse Kullback-Leibler divergence when $p$ is log-concave. They also extend to cases where the target density $p$ only exhibits symmetry along some but not all of its coordinates. These partial symmetries arise naturally in Bayesian hierarchical models, where the prior induces a challenging geometry but still possesses axes of symmetry.
Prototype Selection Using Topological Data Analysis
oai:arXiv.org:2511.04873v2
arXiv:2511.04873v2 Announce Type: replace
Abstract: Prototype selection methods compress a training set, but the existing taxonomy of condensation, edition, hybrid, competence-based, optimization-based, and clustering-based families does not include methods that operate on the multi-scale topological structure of the data. This paper introduces two different persistence-based prototype selector variants, Topological Prototype Selector (TPS) and Boundary-Conscious Topological Prototype Selector (BoundaryTPS). TPS uses two sequential Rips filtrations to retain boundary-relevant and interior-typical points. BoundaryTPS is a single-stage variant whose vertex-weighted filtration concentrates retention near the decision boundary. We evaluate both methods against seven classical baselines on fifteen real datasets and find that the topological methods occupy a different operating point in the prototype-selection design space than existing methods. BoundaryTPS achieves the lowest mean Friedman rank on $H_1$ persistence-diagram preservation and is significantly better than five of the seven baselines (Nemenyi, $\alpha = 0.05$). TPS ranks third on the same endpoint. Both methods are more stable under fold perturbation than any chained-decision selector tested, and both inherit the source set's class proportions without label-aware machinery. On aggregate G-Mean both methods are competitive but not leading, with rank-1 frequencies of $11.3\%$ (TPS) and $9.9\%$ (BoundaryTPS) across fold combinations. Empirically, both methods scale sub-quadratically in sample size.
Sequential Bootstrap for Out-of-Bag Error Estimation: A 100-Seed Replication Study and Variance-Structure Analysis
oai:arXiv.org:2511.18065v2
arXiv:2511.18065v2 Announce Type: replace
Abstract: Out-of-Bag (OOB) estimation is the standard internal diagnostic for bootstrap-aggregated tree ensembles. Under the classical multinomial bootstrap, the number of distinct training observations in each replicate, $U_b$, is itself random, but its contribution to OOB-based variability has rarely been isolated empirically. We use Sequential Bootstrap (SB) -- a resampling scheme that holds $U_b$ at a fixed target $k_n = \lfloor 0.632 n\rfloor$ -- as a controlled perturbation of the bootstrap mechanism, and ask whether stabilizing $U_b$ produces any measurable change in OOB-based diagnostics. We reproduce Breiman's five OOB experimental families on twelve synthetic and real datasets, but unlike the three-seed presentation common in this literature, we run 100 independent random seeds with 50 internal replications per seed, enabling formal paired statistical comparison (Wilcoxon signed-rank, paired-$t$, Pitman--Morgan variance test). We report three findings. First, OOB means are essentially insensitive to stabilization of $U_b$: of 57 (experiment, dataset, metric) cells under 100 seeds, only 6 reach $p<0.05$ on the paired mean comparison, and 4 of those 6 point in the opposite direction from what a 3-seed reading would suggest. Second, a narrow but reproducible effect survives at the variance level: SB reduces the cross-seed standard deviation of node-level classification diagnostics on real datasets while slightly increasing it on synthetic ones (permutation $p=0.026$); the Vehicle dataset exhibits a 21% cross-seed sd reduction (Pitman--Morgan $p=0.017$). Third, several directional claims that appear stable across three seeds flip sign under 100-seed replication, illustrating the cost of underpowered replication protocols. We therefore treat SB as a diagnostic tool for probing the distinct-sample-count term in the variance of OOB estimators, not as an alternative to the classical bootstrap.
ThSQCA: Threshold-Sweep Qualitative Comparative Analysis in R
oai:arXiv.org:2601.11229v4
arXiv:2601.11229v4 Announce Type: replace
Abstract: Qualitative Comparative Analysis (QCA) requires researchers to choose calibration and dichotomization thresholds, and these choices can substantially affect truth tables, minimization, and resulting solution formulas. Despite this dependency, threshold sensitivity is often examined only in an ad hoc manner because repeated analyses are time-intensive and error-prone. We present ThSQCA, an R package that automates threshold-sweep analyses by treating thresholds as explicit analytical variables. It provides four sweep functions (otSweep, ctSweepS, ctSweepM, dtSweep) to explore outcome thresholds, single-condition thresholds, multi-condition threshold grids, and joint outcome-condition threshold spaces, respectively. ThSQCA integrates with the established CRAN package QCA for truth table construction and Boolean minimization, while returning structured S3 objects with consistent print/summary methods and optional detailed results. The package also supports automated Markdown report generation and configuration-chart output to facilitate reproducible documentation of cross-threshold results.
Estimating conditional Mann-Whitney effects using pseudo-observation-based regression
oai:arXiv.org:2601.15880v3
arXiv:2601.15880v3 Announce Type: replace
Abstract: The Mann-Whitney effect is an effect measure for the order of two sample-specific outcome variables. It has the interpretation of a probability and also a connection to the area under the ROC curve. In the literature it has been considered for both ordinal and right-censored time-to-event outcomes. For both cases, the present paper introduces a distribution-free regression model that relates the Mann-Whitney effect to a linear combination of covariates. To fit the model, we develop a pseudo-observation-based procedure yielding consistent and asymptotically normal coefficient estimates. In addition, we propose bootstrap-based hypothesis tests to infer the effects of the covariates on the Mann-Whitney effect. A simulation study on the small-sample behavior of the proposed method demonstrates that the novel hypothesis tests keep up with the z-test of a Cox regression model. The new methods are used to analyze progression-free survival in breast cancer patients enrolled for the randomized phase III SUCCESS-A trial.
Independent Component Discovery in Temporal Count Data
oai:arXiv.org:2601.21696v2
arXiv:2601.21696v2 Announce Type: replace
Abstract: Advances in data collection are producing growing volumes of temporal count observations, making adapted modeling increasingly necessary. In this work, we introduce a generative framework for independent component analysis of temporal count data, combining regime-adaptive dynamics with Poisson log-normal emissions. The model identifies disentangled components with regime-dependent contributions, enabling representation learning and perturbations analysis. Notably, we establish the identifiability of the model, supporting principled interpretation. To learn the parameters, we propose an efficient amortized variational inference procedure. Experiments on simulated data evaluate recovery of the mixing function and latent sources across diverse settings, while real-world applications to gut microbiome and climate datasets reveal co-variation patterns and regime shifts consistent with domain-specific knowledge.
Near-Optimal Private Tests for Simple and MLR Hypotheses
oai:arXiv.org:2601.21959v2
arXiv:2601.21959v2 Announce Type: replace
Abstract: We develop a near-optimal testing procedure under the framework of Gaussian differential privacy for simple as well as one- and two-sided tests under monotone likelihood ratio conditions. Our mechanism is based on a private mean estimator with data-driven clamping bounds, whose population risk matches the private minimax rate up to logarithmic factors. Using this estimator, we construct private test statistics that achieve the same asymptotic relative efficiency as the non-private, most powerful tests while maintaining conservative type I error control. In addition to our theoretical results, our numerical experiments show that our private tests outperform competing DP methods and offer comparable power to the non-private most powerful tests, even at moderately small sample sizes and privacy loss budgets.
Approximating $f$-Divergences with Rank Statistics
oai:arXiv.org:2601.22784v2
arXiv:2601.22784v2 Announce Type: replace
Abstract: We introduce a rank-statistic approximation of $f$-divergences that avoids explicit density-ratio estimation by working directly with the distribution of ranks. For a resolution parameter $K$, we map the mismatch between two univariate distributions $\mu$ and $\nu$ to a rank histogram on $\{ 0, \ldots, K\}$ and measure its deviation from uniformity via a discrete $f$-divergence, yielding a rank-statistic divergence estimator. We prove that the resulting estimator of the divergence is monotone in $K$, is always a lower bound of the true $f$-divergence, and we establish quantitative convergence rates for $K\to\infty$ under mild regularity of the quantile-domain density ratio. To handle high-dimensional data, we define the sliced rank-statistic $f$-divergence by averaging the univariate construction over random projections, and we provide convergence results for the sliced limit as well. We also derive finite-sample deviation bounds along with asymptotic normality results for the estimator. Finally, we empirically validate the approach by benchmarking against neural baselines and illustrating its use as a learning objective in generative modeling experiments.
Persuasive Privacy
oai:arXiv.org:2601.22945v2
arXiv:2601.22945v2 Announce Type: replace
Abstract: We propose a novel framework for measuring privacy from a Bayesian game-theoretic perspective. This framework enables the creation of new, purpose-driven privacy definitions that are rigorously justified, while also allowing for the assessment of existing privacy guarantees through game theory. We show that pure and probabilistic differential privacy are special cases of our framework, and provide new interpretations of the post-processing inequality in this setting. Further, we demonstrate that privacy guarantees can be established for deterministic algorithms, which are overlooked by current privacy standards.
Complexity bounds for Dirichlet process slice samplers
oai:arXiv.org:2602.00878v2
arXiv:2602.00878v2 Announce Type: replace
Abstract: Slice sampling is a standard Monte Carlo technique for Dirichlet process (DP)-based models, widely used in posterior simulation. However, formal assessments of the scalability of posterior slice samplers have remained largely unexplored, primarily because the computational cost of a slice-sampling iteration is random and potentially unbounded. In this work, we obtain high-probability bounds on the computational complexity of DP slice samplers. Our main results show that, uniformly across posterior cluster-growth regimes, the overhead induced by slice variables, relatively to the number of clusters supported by the posterior, is $O_{\mathbb P}(\log n)$. As a consequence, even in worst-case configurations, superlinear blow-ups in per-iteration computational cost occur with vanishing probability. Our analysis applies broadly to DP-based models without any likelihood-specific assumptions, still providing complexity guarantees for posterior sampling on arbitrary datasets. These results establish a theoretical foundation for assessing the practical scalability of slice sampling in DP-based models.
Statistical Guarantees for Reasoning Probes on Looped Boolean Circuits
oai:arXiv.org:2602.03970v3
arXiv:2602.03970v3 Announce Type: replace
Abstract: We study the statistical behavior of reasoning probes in a stylized model of iterative computation inspired by neural algorithmic reasoning. The underlying computation is given by a looped Boolean circuit whose graph is a perfect $\nu$-ary tree ($\nu\ge 2$), with outputs recursively fed back as inputs across computation rounds. A probe observes a sampled subset of internal nodes and seeks to infer the latent operation at each node, represented as a probability distribution over a finite set of admissible Boolean gates. This partial observability induces a transductive generalization problem on a structured computation graph. We show that when the probe is parameterized by a graph convolutional network and queries $N$ nodes, the worst-case generalization error decays at the optimal rate $\mathcal{O}(\sqrt{\log(2/\delta)}/\sqrt{N})$ with probability at least $1-\delta$. Our analysis combines metric embedding techniques with tools from optimal transport. A key insight is that this rate is achievable independently of the size of the computation graph, enabled by a low-distortion one-dimensional snowflake embedding of the induced graph metric. These results highlight a geometric mechanism underlying statistical efficiency in probing structured, iterative computations.
Fixed Budget is No Harder Than Fixed Confidence in Best-Arm Identification up to Logarithmic Factors
oai:arXiv.org:2602.03972v3
arXiv:2602.03972v3 Announce Type: replace
Abstract: The best-arm identification (BAI) problem is one of the most fundamental problems in interactive machine learning, which has two flavors: the fixed-budget setting (FB) and the fixed-confidence setting (FC). For $K$-armed bandits with a unique best arm, the optimal sample complexities for both settings have been settled down, and they match up to logarithmic factors. This prompts an interesting research question about the generic, potentially structured BAI problems: is FB harder than FC or the other way around? In this paper, we show that FB is no harder than FC up to logarithmic factors. We do this constructively: we propose a novel algorithm called FC2FB (fixed confidence to fixed budget), which is a meta algorithm that takes in an FC algorithm $\mathcal{A}$ and turn it into an FB algorithm. We prove that FC2FB enjoys a sample complexity that matches, up to logarithmic factors, that of the sample complexity of $\mathcal{A}$. This means that the optimal FC sample complexity is an upper bound of the optimal FB sample complexity up to logarithmic factors. Our result not only reveals a fundamental relationship between FB and FC, but also has a significant implication: FC2FB combined with existing state-of-the-art FC algorithms leads to improved sample complexity for a number of FB problems.
Optimal Bayesian Stopping for Efficient Inference of Consistent LLM Answers
oai:arXiv.org:2602.05395v2
arXiv:2602.05395v2 Announce Type: replace
Abstract: A simple strategy for improving LLM accuracy, especially in math and reasoning problems, is to sample multiple responses and submit the answer most consistently reached. In this paper we leverage Bayesian prior information to save on sampling costs, stopping once sufficient consistency is reached. Although the exact posterior is computationally intractable, we further introduce an efficient "L-aggregated" stopping policy that tracks only the L-1 most frequent answer counts. Theoretically, we prove that L=3 is all you need: this coarse approximation is sufficient to achieve asymptotic optimality, and strictly dominates prior-free baselines, while having a fast posterior computation. Empirically, this identifies the most consistent (i.e., mode) LLM answer using fewer samples, and can achieve similar answer accuracy while cutting the number of LLM calls (i.e., saving on LLM inference costs) by up to 50%.
Deep networks learn to parse uniform-depth context-free languages from local statistics
oai:arXiv.org:2602.06065v3
arXiv:2602.06065v3 Announce Type: replace
Abstract: Understanding how the structure of language can be learned from sentences alone is a central question in both cognitive science and machine learning. Studies of the internal representations of Large Language Models (LLMs) support their ability to parse text when predicting the next word, while representing semantic notions independently of surface form. Yet, which data statistics make these feats possible, and how much data is required, remain largely unknown. Probabilistic context-free grammars (PCFGs) provide a tractable testbed for studying these questions. However, prior work has focused either on the post-hoc characterization of the parsing-like algorithms used by trained networks; or on the learnability of PCFGs with fixed syntax, where parsing is unnecessary. Here, we (i) introduce a tunable class of PCFGs in which both the degree of ambiguity and the correlation structure across scales can be controlled; (ii) provide a learning mechanism -- an inference algorithm inspired by the structure of deep convolutional networks -- that links learnability and sample complexity to specific language statistics; and (iii) validate our predictions empirically across deep convolutional and transformer-based architectures. Overall, we propose a unifying framework where correlations at different scales lift local ambiguities, enabling the emergence of hierarchical representations of the data.
The Entropic Signature of Class Speciation in Diffusion Models
oai:arXiv.org:2602.09651v2
arXiv:2602.09651v2 Announce Type: replace
Abstract: Diffusion models do not recover semantic structure uniformly over time. Instead, samples transition from semantic ambiguity to class commitment within a narrow regime. Recent theoretical work attributes this transition to dynamical instabilities along class-separating directions, but practical methods to detect and exploit these windows in trained models are still limited. We show that tracking the class-conditional entropy of a latent semantic variable given the noisy state provides a reliable signature of these transition regimes. By restricting the entropy to semantic partitions, the entropy can furthermore resolve semantic decisions at different levels of abstraction. We analyze this behavior in high-dimensional Gaussian mixture models and show that the entropy rate concentrates on the same logarithmic time scale as the speciation symmetry-breaking instability previously identified in variance-preserving diffusion. We validate our method on EDM2-XS and Stable Diffusion 1.5, where class-conditional entropy consistently isolates the noise regimes critical for semantic structure formation. Finally, we use our framework to quantify how guidance redistributes semantic information over time. Together, these results connect information-theoretic and statistical physics perspectives on diffusion and provide a principled basis for time-localized control.
How Accurately Can a Gaussian Approximate Stochastic Approximation Iterates?
oai:arXiv.org:2602.13906v2
arXiv:2602.13906v2 Announce Type: replace
Abstract: Stochastic approximation (SA) is a method for finding the root of an operator perturbed by noise. The focus of this paper is studying the distribution of SA iterates in finite time. In general, it is not possible to characterize the exact distribution, and therefore our goal is to find an approximation which can yield useful tail bounds. Inspired by the rich literature on the asymptotic normality of rescaled SA iterates, we approximate the pre-limit distributions by a sequence of Gaussians whose covariance is recursively defined. In particular, we establish explicit bounds on the Wasserstein-1 distance between the rescaled iterate at time $k$ and the aforementioned Gaussian for various choices of step-sizes. Since these covariances converge to the classical asymptotic limit, our analysis also provides a convergence rate for asymptotic normality as a by-product. As an immediate consequence of our bounds, we obtain tail bounds on the error of SA iterates at any time. Finally, we establish the sharpness of our rates by providing matching lower bounds and validate our findings through simulations.
We obtain the sharp rates by first studying the convergence rate of the discrete Ornstein-Uhlenbeck (O-U) process driven by general noise, whose stationary distribution is identical to the limiting Gaussian distribution of the rescaled SA iterates. We believe that this is of independent interest, given its connection to sampling literature. The analysis involves adapting Stein's method for Gaussian approximation to handle the matrix weighted sum of i.i.d. random variables. The desired finite-time bounds for SA are obtained by characterizing the error dynamics between the rescaled SA iterate and the discrete time O-U process and combining it with the convergence rate of the latter process.
Beyond Procedure: Substantive Fairness in Conformal Prediction
oai:arXiv.org:2602.16794v2
arXiv:2602.16794v2 Announce Type: replace
Abstract: Conformal prediction (CP) offers distribution-free uncertainty quantification for machine learning models, yet its interplay with fairness in downstream decision-making remains underexplored. Moving beyond CP as a standalone operation (procedural fairness), we analyze the holistic decision-making pipeline to evaluate substantive fairness-the equity of downstream outcomes. Theoretically, we derive an upper bound that decomposes prediction-set size disparity into interpretable components, clarifying how label-clustered CP helps control method-driven contributions to unfairness. To facilitate scalable empirical analysis, we introduce an LLM-in-the-loop evaluator that approximates human assessment of substantive fairness across diverse modalities. Our experiments show that label-clustered CP often provides a favorable balance between utility and substantive fairness, while reducing set-size disparities in line with our theory. Finally, we empirically show that equalized set sizes, rather than coverage, strongly correlate with improved substantive fairness, enabling practitioners to design more fair CP systems. Our code is available at https://github.com/layer6ai-labs/llm-in-the-loop-conformal-fairness.
Asymptotic Theory and Sequential Testing for Adaptive Bandits
oai:arXiv.org:2602.22768v2
arXiv:2602.22768v2 Announce Type: replace
Abstract: Multi-armed bandit (MAB) processes constitute a foundational subclass of reinforcement learning problems and represent a central topic in statistical decision theory. Yet, conducting valid sequential testing under adaptive allocation remains challenging due to the lack of asymptotic theory under non-i.i.d. reward sequences and sublinear sample sizes for some arms. To address this open challenge, we propose an Urn Bandit (UNB) process to integrate the reinforcement mechanism of urn probabilistic models with MAB principles, ensuring almost sure concentration of allocation proportions on optimal arms. We establish a joint functional central limit theorem (FCLT) for consistent estimators of expected rewards under non-i.i.d. reward sequences with non-sub-Gaussian tails and pairwise cross-arm dependence. To overcome the limitations of existing methods that focus mainly on cumulative regret and therefore provide only algorithmic performance guarantees without supporting valid sequential testing, we develop an asymptotic theory for sequential test statistics under the proposed UNB process. The resulting framework enables a broad class of sequential inference procedures, such as A/B testing and policy evaluation. Simulation studies and real data analysis demonstrate that UNB maintains testing performance comparable to that of the equal randomization (ER) design while achieving improved reward accumulation relative to ER.
Asymptotic theory for multiple samples with flexible random membership
oai:arXiv.org:2602.24219v2
arXiv:2602.24219v2 Announce Type: replace
Abstract: A statistic can be a function of multiple samples. There is little existing work on asymptotic theory for such statistics when group membership is random. We propose a flexible framework that can handle both deterministic and random membership. We prove some asymptotic properties and apply the framework to the stratified sampling context.
Robust Wasserstein barycenter
oai:arXiv.org:2603.07563v2
arXiv:2603.07563v2 Announce Type: replace
Abstract: In this paper, we address a fundamental limitation of the classical Wasserstein barycenter -- its sensitivity to outliers. To overcome these issues, we propose the robust Wasserstein barycenter (RWB) based on a recent concept of the robust optimal transport. Theoretical guarantees, including existence and consistency, are established for the proposed RWB. Through extensive numerical experiments on both simulated and real-world data -- including image processing and financial data analysis -- we demonstrate that the RWB exhibits superior robustness compared to the classical Wasserstein barycenter.
A Bayesian adaptive enrichment design using aggregate historical data to inform individualized treatment recommendations
oai:arXiv.org:2603.09919v2
arXiv:2603.09919v2 Announce Type: replace
Abstract: Adaptive enrichment trials aim to identify and recruit participants most likely to benefit from treatment based on evolving biomarker evidence, with the goal of informing individualized treatment recommendations. Bayesian methods are well suited to these designs because they allow external information to be incorporated in a principled manner. In practice, prior studies often provide only summary-level information, with subgroup-specific estimates unavailable due to design or privacy constraints. Existing dynamic borrowing approaches therefore rely on aggregate measures, such as the average treatment effect, and implicitly assume that historical information maps directly onto model parameters. In adaptive enrichment settings aimed at identifying individualized treatment effects, however, subgroup-specific treatment parameters are not identifiable when only marginal historical effects are available. To address this gap, we propose a Bayesian adaptive enrichment design that borrows information from external studies using a normalized power prior anchored on one or more summary measures, such as the average treatment effect. { To our knowledge, no existing method addresses this gap.} Interim analyses use posterior probabilities to guide early stopping for efficacy or futility, or to continue recruitment within promising biomarker-defined subgroups. Simulation studies evaluate operating characteristics across historical bias, sample size, and prior informativeness. Together with a motivating future trial in obstructive sleep apnea, the results show efficiency gains versus non-borrowing designs, including improved power, earlier stopping, and reduced expected sample size.
Preconditioned One-Step Generative Modeling for Bayesian Inverse Problems in Function Spaces
oai:arXiv.org:2603.14798v2
arXiv:2603.14798v2 Announce Type: replace
Abstract: We propose a machine-learning algorithm for Bayesian inverse problems in the function-space regime. Based on one-step generative transport, the method learns an amortized neural operator whose pushforward of a Gaussian source approximates the posterior distribution conditioned on each new observation. We show that white-noise sources are incompatible with the function-space limit, and therefore adopt a prior-aligned GRF as the source. We justify this choice through the Lipschitz regularity of the resulting one-step conditional posterior transport and numerical experiments on linear inverse and PDE-based inverse problems. The method is not distilled from MCMC: it is trained only with prior samples and simulated partial noisy observations. Once trained, it generates a $64\times64$ posterior sample in $\sim 10^{-3}$s, avoiding repeated forward-model evaluations in MCMC and repeated network evaluations in multistep generative samplers while matching key posterior summaries.
Multiview Graph Fusion with Covariates
oai:arXiv.org:2603.22215v2
arXiv:2603.22215v2 Announce Type: replace
Abstract: Joint modeling of multiview graphs with a common set of nodes between views and auxiliary predictors is an essential, yet less explored, area in statistical methodology. Traditional approaches often treat graphs in different views as independent or fail to adequately incorporate predictors, potentially missing complex dependencies within and across graph views and leading to reduced inferential accuracy. Motivated by such methodological shortcomings, we introduce an integrative Bayesian approach for joint learning of a multiview graph with vector-valued predictors. Our modeling framework assumes a common set of nodes for each graph view while allowing for diverse interconnections or edge weights between nodes across graph views, accommodating both binary and continuous valued edge weights. By adopting a hierarchical Bayesian modeling approach, our framework seamlessly integrates information from diverse graphs through carefully designed prior distributions on model parameters. This approach enables the estimation of crucial model parameters defining the relationship between these graph views and predictors, as well as offers predictive inference of the graph views. Crucially, the approach provides uncertainty quantification in all such inferences. Theoretical analysis establishes that the posterior predictive density for our model asymptotically converges to the true data-generating density, under mild assumptions on the true data-generating density and the growth of the number of graph nodes relative to the sample size. Simulation studies validate the inferential advantages of our approach over predictor-dependent tensor learning and independent learning of different graph views with predictors. We further illustrate model utility by analyzing functional connectivity graphs in neuroscience under cognitive control tasks, relating task-related brain connectivity with phenotypic measures.
Tackling the 6/49 Lottery and Debunking Common Myths with Probabilistic Methods and Combinatorial Designs
oai:arXiv.org:2603.24170v3
arXiv:2603.24170v3 Announce Type: replace
Abstract: At the end, the house always wins! This simple truth holds for all public games of chance. Nevertheless, since lotteries have existed, people have tried everything to give luck a helping hand. This article compares objective scientific approaches to tackle the 6/49 lottery: probabilistic methods and combinatorial designs. The mathematical models developed herein can be modified and applied to other lotteries. The newly constructed (49, 6, 5) covering design is introduced, which meets the Sch\"onheim bound. For lottery designs and for covering designs, a benchmark based on probabilistic methods is presented. It is demonstrated that common attempts to outwit the odds correspond to limitations of numbers to subsets, which disproportionately reduce the chances of winning.
Adaptive Querying with AI Persona Priors
oai:arXiv.org:2605.00696v2
arXiv:2605.00696v2 Announce Type: replace
Abstract: We study adaptive querying for learning user-dependent quantities of interest, such as responses to held-out items and psychometric indicators, within tight query budgets. Classical Bayesian design and computerized adaptive testing typically rely on restrictive parametric assumptions or expensive posterior approximations, limiting their use in heterogeneous, high-dimensional, and cold-start settings. We introduce a persona-induced latent variable model that represents a user's state through membership in a finite dictionary of AI personas, each offering response distributions produced by a large language model. This yields expressive priors with closed-form posterior updates and efficient finite-mixture predictions, enabling scalable Bayesian design for sequential item selection. Experiments on synthetic data and WorldValuesBench demonstrate that persona-based posteriors deliver accurate probabilistic predictions and an interpretable adaptive elicitation pipeline.
Empirical Bernstein Confidence Intervals for Kernel Smoothers: A Safe and Sharp Way to Exhaust Assumed Smoothness
oai:arXiv.org:2605.03781v4
arXiv:2605.03781v4 Announce Type: replace
Abstract: Using standard-normal critical-value calibration (SNC) to construct a kernel-smoother-based confidence interval faces a fundamental challenge: the normalization makes a small estimation bias become a non-negligible inferential bias. This paper takes a different route by replacing the SNC control with empirical Bernstein tail control. The resulting confidence intervals control stochastic variability on the original estimation scale, so that deterministic smoothing bias enters the radius as an estimation-scale approximation error rather than as a normalized inferential bias. We develop this idea for pointwise inference on univariate density and regression functions. The proposed empirical Bernstein confidence intervals (EBCIs) combine empirical Bernstein calibration with bias-aware fixed-length radius construction under a local Taylor-remainder class. Uniformly over functions with $S$-th order local smoothness, both one-sided and two-sided intervals attain the nominal coverage level up to a remainder of order $n^{-\frac{2S}{2S+1}}$ or an exponential remainder in bounded or sub-Gaussian settings. Their widths shrink at the minimax rate $n^{-\frac{S}{2S+1}}$. Moreover, in the small-$\alpha$ regime, the EBCI radius is first-order aligned with the radii of bias-aware-type fixed-length confidence intervals. For one-sided inference, the leading term coincides, while, for two-sided inference, the only difference is the usual replacement of \(\log(\frac{1}{\alpha})\) by $\log(\frac{2}{\alpha})$. Thus, EBCI safely converts correctly specified smoothness into both coverage accuracy and interval-length efficiency. The contribution is not a new bias-control approach, but a new calibration method that can inherit existing ideas such as bias-aware inference (BA) and robust bias correction (RBC) while avoiding the normalized-bias inflation induced by SNC.
Expectation-Maximization as a Spectrally Governed Relaxation Flow
oai:arXiv.org:2605.07818v2
arXiv:2605.07818v2 Announce Type: replace
Abstract: The expectation--maximization (EM) algorithm combines global monotonicity, local linear convergence, and strong practical robustness, but these features are usually analyzed separately. Global descent is nonlinear, whereas local convergence is governed by the spectrum of the linearized EM map. How these two levels fit into a single dynamical picture has remained less transparent.
We make explicit the latent-variable operator that connects them. Along the EM trajectory, the likelihood increment admits a global energy decomposition in terms of posterior-relative entropy. Linearization at a nondegenerate maximizer $\theta^\ast$ reveals the local operator \[ \mathcal G_{\theta^\ast}=I-DT(\theta^\ast), \] which coincides with both the missing-information ratio and the information-geometric Hessian of the observed likelihood.
From this operator we derive two acceleration strategies. The \textbf{G-Accelerator} uses the spectral gap to obtain an optimal Nesterov-type momentum $\beta^* = (1-\sqrt{\lambda_*})/(1+\sqrt{\lambda_*})$. The \textbf{Geo-Adaptive} accelerator extends the geometric EM framework of Zhou, Alexander \& Lange by replacing their fixed correction strength $\gamma=8$ with the adaptive rule $\gamma_k = 1/\hat\lambda_k$, where $\hat\lambda_k$ is estimated online from the parameter trajectory. Both methods are parameter-free; Geo-Adaptive achieves dramatic acceleration precisely when the spectral gap is smallest.
Numerical experiments on Gaussian mixtures demonstrate that both accelerators consistently outperform standard EM and fixed-$\gamma$ DCC-EM, with Geo-Adaptive attaining speedups exceeding $8\times$ in the most challenging regimes.
ISOMORPH: A Supply Chain Digital Twin for Simulation, Dataset Generation, and Forecasting Benchmarks
oai:arXiv.org:2605.12768v2
arXiv:2605.12768v2 Announce Type: replace
Abstract: Open time-series forecasting (TSF) benchmarks cover retail, energy, weather, and traffic, but supply-chain logistics remains underserved. We introduce ISOMORPH, the first public digital twin of a multi-echelon logistics network with interpretable, user-configurable parameters and modular topology, demand, and control rules. The simulator advances a directed routing graph in discrete time: demand is served from inventory or recorded as backlog and triggers replenishment throughout the network. The state tracks inventory, outstanding orders, in-transit shipments, and a smoothed demand estimate, yielding Markovian dynamics on a tractable state space. The released data reproduces the bullwhip effect at empirically consistent magnitudes, while three conservation laws provide verification tools for simulator extensions. We release datasets at two catalogue scales ($C=50$ and $C=200$), six scenario sweeps, and 20 Latin-hypercube perturbations. These datasets exhibit dynamics largely absent from fixed TSF benchmarks, including variance amplification, cascading bottlenecks, regime shifts, and cross-channel coupling through shared macro shocks. Zero-shot evaluation of four foundation models (Chronos, Moirai, TimesFM, and Lag-Llama) yields MASE values exceeding public GIFT-Eval references at low-to-moderate horizons, supporting incorporation into existing benchmark suites. The same models provide forecast confidence bands through Latin-hypercube perturbations of demand-side parameters, enabling forward uncertainty quantification (UQ) unavailable on standard TSF datasets and demonstrating that foundation models can serve as fast surrogates for digital-twin-based UQ. Code (MIT): https://github.com/tuhinsahai/ISOMORPH. Interactive demo: https://huggingface.co/spaces/HyeminGu/ISOMORPH-demo.
Double Descent and Ensemble Emergence in Model Averaging Prediction
oai:arXiv.org:2605.13203v2
arXiv:2605.13203v2 Announce Type: replace
Abstract: This paper investigates the predictive performance of model averaging in high-dimensional linear regression where the number of regressors is comparable to the sample size. Leveraging tools from random matrix theory, we derive the exact limiting out-of-sample risk under a nested model setting and comprehensively characterize the risk landscape. This limiting risk helps to reveal two phenomena: simple weighting inherits the double descent trajectory and its associated variance explosion near the interpolation boundary; strategic weighting triggers an ensemble emergence that suppresses the localized risk surge and yields a globally flat risk surface. Building on this limiting risk, we also propose the Large Model Averaging (LaMA) method, in which we consider the discrepancy between in-sample and out-of-sample risks in the high-dimensional regime. Numerical studies and real data applications confirm that LaMA achieves superior predictive accuracy in high-dimensional environments.
Stabilised weighted data subsampling for accelerated inference in models with recursive likelihoods
oai:arXiv.org:2605.13397v2
arXiv:2605.13397v2 Announce Type: replace
Abstract: Inference for models with recursively defined likelihoods is computationally demanding, limiting scalability to large datasets. We propose a stabilised weighted subsampling methodology for accelerated inference based on an unbiased estimator of the log-likelihood. By assigning higher sampling probabilities to early observations, the method reduces the effective depth of recursive likelihood evaluations and hence computational cost. However, sampling probabilities that decay too slowly yield limited savings, while overly aggressive decay can substantially inflate estimator variance. We develop a stabilisation framework, supported by theory, that restricts the decay to avoid both computational and variance pathologies through principled hyperparameter tuning. We also derive an unbiased subsampling estimator of the log-likelihood gradient, enabling gradient-based inference. The methodology can be embedded within a range of inferential frameworks. We illustrate its use in variational Bayes and subsampling Markov chain Monte Carlo for conditional volatility models, including leverage effects. Empirical results show substantial computational speed-ups relative to full-data methods while maintaining inferential accuracy. We also compare with recent stochastic gradient MCMC and divide-and-conquer MCMC methods for temporally dependent data, observing favourable empirical performance.
Towards a holistic understanding of Selection Bias for Causal Effect Identification
oai:arXiv.org:2605.13430v3
arXiv:2605.13430v3 Announce Type: replace
Abstract: Selection bias is pervasive in observational studies. For example, large scale biobanks data can exhibit ``healthy volunteer bias'' when respondents are healthier and of higher socio-economic status than the population they are meant to represent. Recovering causal effects from such sub-population is an important problem in causal inference, as estimating average treatment effects (ATE) from selected populations can result in a severely biased estimate of the ATE from the whole population. In this paper, we investigate the identifiability of the ATE under selection bias. We provide necessary and sufficient conditions for ATE identifiability, leveraging weak assumptions on probability classes to characterize propensity score and selection probability. Compared to previous works, our results extend existing graphical identifiability criteria and offer a more comprehensive understanding of causal effect identification with strictly weaker conditions in the presence of selection bias.
Evaluating causal indirect effects when mediators are left-censored by assay limit of quantification
oai:arXiv.org:2605.20615v2
arXiv:2605.20615v2 Announce Type: replace
Abstract: Causal mediation analysis is essential for disentangling the mechanisms by which investigational therapeutic and preventive agents impact clinical outcomes. However, the measurement of biological mediators is often subject to left-censoring by technical measurement limitations, most commonly an assay's limit of quantification. This form of censoring can pose severe challenges for both identification and estimation of causal mediation estimands, particularly when the censoring mechanism is deterministic and the resulting missingness is missing not at random (MNAR) or nonignorable. Motivated by the question of assessing the role of viral RNA in the action mechanism of monoclonal antibody therapies for COVID-19 in the Accelerating COVID-19 Therapeutics and Vaccine (ACTIV)-2 platform trial, we develop a semi-parametric framework for estimation of the natural direct and indirect effects when the mediator of interest is partially subject to this form of left-censoring. Our proposed strategy combines fractional imputation with a semi-parametric EM algorithm to flexibly estimate key components of the factorized data likelihood. Applying the proposed strategy to circumvent the left-censoring, we discuss both traditional plug-in and asymptotically efficient estimators of the direct and indirect effect estimands, introducing a data-adaptive $m$-out-of-$n$ bootstrap for robust inference under the imputation procedure. We demonstrate in numerical experiments that our approach significantly reduces bias and allows for reliable inference. An application to data from the ACTIV-2 platform trial confirms that monoclonal antibody therapies reduce the risk of hospitalization and death due to COVID-19, while suggesting that changes in viral RNA mediate only a modest proportion of the overall treatment effect.
Trustworthy AI/ML Regression and Unbiased Causal Inference for Real-World Data
oai:arXiv.org:2605.24377v2
arXiv:2605.24377v2 Announce Type: replace
Abstract: Real-World Data (RWD), with its large sample sizes and rich clinical detail, offers a compelling alternative to randomized controlled trials (RCTs) for studying treatment effects in diverse and complex patient populations. However, its observational nature introduces confounding that prevents straightforward comparative effectiveness research. Target trial emulation leverages RWD to estimate average treatment effects (ATE) at the population scale and diversity that RCTs cannot achieve, yet its validity depends critically on unbiased ATE estimation under high-dimensional confounding. Many causal inference pipelines address high-dimensional confounding through machine learning and artificial intelligence (ML/AI) outcome regression. However, commonly used ML/AI regression models exhibit systematic prediction bias, with predicted outcomes shrinking toward the marginal outcome mean. This structural bias propagates into ATE estimation and cannot be corrected by cross-fitting, ensemble methods, or any standard ML practice. In this work, we first quantitatively characterize how systematic prediction bias in ML/AI outcome regression leads to biased ATE estimates in causal inference models. We further propose an unbiased ML/AI regression-based causal inference framework to ensure unbiased ATE estimation for observational studies. We demonstrate our approach by studying the effects of opioids on cardiovascular health in patients with chronic pain using UK Biobank data.
Logistic regression is not enough: The need for Bayesian nonparametric modelling for causal inference using observational data, exemplified by the 'gateway' effect
oai:arXiv.org:2605.24847v2
arXiv:2605.24847v2 Announce Type: replace
Abstract: Introduction: Logistic regression (LR)-type model limitations for causal inference are explained theoretically and empirically through the lens of the purported gateway effect from e-cigarette use to smoking. Previous studies have reported that baseline e-cigarette use quadruples odds of follow-up smoking (binarized) in LR-type models of adolescent longitudinal cohorts (LCs), such that increased e-cigarette use would counteract smoking declines. However, US population-level trends show accelerated smoking declines to record-lows when e-cigarette use increased, presenting an apparent paradox. Methods: Population Assessment of Tobacco and Health (USA) Youth Waves 3 to 4 were analyzed with Bayesian Additive Regression Trees (BART) to model baseline e-cigarette use (treatment) and change in number of days smoking from baseline to follow-up (numerical response) among never- and ever-smoking respondents (group effects), adjusting for confounding risk factors (socio-demographic, intra-individual, behavioural, peer influence, and family background). Unlike LR-type models, BART provides nonlinear, nonparametric modelling with counterfactuals and provides causal effect estimates with principled uncertainty estimation. Results: The average effect of e-cigarette use on smoking was both clinically and statistically significant among ever-smoking adolescents (-2 days smoking [diversionary effect; opposite to gateway]) and was not clinically significant among never-smoking adolescents (<1-day absolute change in days smoking [null effect]). Conclusions: When LC data are analyzed with causal inference techniques, the gateway effect disappears, consistent with population-level trends. This likely explains why gateway effects predicted in previous LR-type studies have not materialized in a population-level reversal/unexpected slowing of the US adolescent smoking decline, resolving the paradox.
Approximating full conformal prediction: distribution free guarantees via the tournament correction
oai:arXiv.org:2605.29200v2
arXiv:2605.29200v2 Announce Type: replace
Abstract: Conformal prediction is a framework for providing prediction intervals with distribution-free validity, guaranteeing predictive coverage for data drawn from any distribution. Its two main variants are full conformal prediction and split conformal prediction (also called transductive and inductive). Full conformal prediction is widely considered to be statistically more efficient (since split conformal prediction requires data splitting, and therefore can lead to wider prediction intervals due to the resulting loss in sample size), but its implementation is computationally prohibitive, as it requires the underlying model to be refit for every candidate value in the response space. Existing computational shortcuts, such as using a discrete grid of values to approximate the full conformal prediction construction, frequently lack theoretical guarantees on marginal coverage and can fail in practice.
To address this limitation, we introduce a novel class of approximations to the full conformal prediction method, based on the idea of \emph{tournaments}, which enables the construction of prediction sets with a rigorous marginal coverage guarantee of $1-2\alpha$. Under stability conditions, the theoretical coverage guarantee tightens to approximately $1-\alpha$. This new framework generalizes the existing method of leave-one-out cross-conformal prediction, while allowing for flexible use of various existing approximation strategies.
Gaussian Differentially Private $e$-values: Construction, Threshold Calibration, and Multiple Testing
oai:arXiv.org:2605.29388v2
arXiv:2605.29388v2 Announce Type: replace
Abstract: This paper develops a framework for differentially private $e$-values under Gaussian differential privacy ($\mu$-GDP). We characterize the canonical noise mechanism, establishing that optimal multiplicative perturbation follows a Gaussian distribution. Using this distribution, we derive a globally sharp rejection threshold that strictly improves upon the standard Markov bound. Asymptotic analysis shows that in low-sensitivity regimes, the calibrated private test achieves a net power gain over the non-private baseline. For multiple testing, we introduce a recursive peeling algorithm that adaptively concentrates the privacy budget on the most promising hypotheses. This construction guarantees rigorous $\mu$-GDP and yields valid private $e$-values compatible with standard multiple testing procedures. Simulations and a genome-wide association study confirm that the method controls the false discovery rate while improving upon naive all-noisy privatization and recovering power close to non-private benchmarks.
Multi-source land-use emissions reveal rising airborne fraction
oai:arXiv.org:2605.30242v2
arXiv:2605.30242v2 Announce Type: replace
Abstract: The airborne fraction is the share of anthropogenic carbon dioxide emissions that remains in the atmosphere and is a key indicator of carbon-cycle response and remaining carbon budgets under continued emissions. Whether this share is rising remains debated because inference is sensitive to uncertainty in land-use and land-cover change (LULC) emissions. Here we use all available LULC measurement series from Global Carbon Budget 2025 and estimate airborne-fraction trends with a mixed-effects model with random intercepts and slopes by LULC series. We find that the airborne fraction increased over 1959-2024, from about 0.40 to about 0.47, and that this conclusion is robust to excluding the final year and to alternative specifications that explicitly propagate denominator uncertainty. These results clarify why earlier studies reported weak or inconclusive trend evidence and strengthen support for the view that an increasing share of emitted carbon dioxide is accumulating in the atmosphere rather than being taken up by land and ocean sinks, with implications for carbon-budget assessment and near-term mitigation requirements.
Optimizing accuracy and diversity: a multi-task approach to forecast combinations
oai:arXiv.org:2310.20545v3
arXiv:2310.20545v3 Announce Type: replace-cross
Abstract: We present a multi-task optimization approach based on a deep learning architecture for time series forecasting. We leverage large collections of time series to identify the weights of forecasting models that can be combined to produce forecasts for each series. This method jointly addresses two tasks: the selection of different forecasting models, and their effective combination. In doing so, it keeps into account, in an original way, both the accuracy and diversity of the forecasting methods. For a given time series, the model combination module extracts features and uses them to optimize the weights of the forecasting methods. Simultaneously, the model selection module extracts other features to identify the subset of methods to be used for the prediction. This selection process is framed as a classification problem, with the labels representing the set of models to be used for a series. These labels are determined by solving an auxiliary optimization problem that identifies accurate and diverse methods for each time series. The outputs of the two modules are then combined and the entire neural network is jointly trained by minimizing a custom loss function via gradient descent optimization. Experimental results on a large set of series from the M4 competition dataset and from real road traffic data show that our proposal enhances point forecast accuracy compared to state-of-the-art methods.
AutoEval Done Right: Using Synthetic Data for Model Evaluation
oai:arXiv.org:2403.07008v3
arXiv:2403.07008v3 Announce Type: replace-cross
Abstract: The evaluation of machine learning models using human-labeled validation data can be expensive and time-consuming. AI-labeled synthetic data can be used to decrease the number of human annotations required for this purpose in a process called autoevaluation. We suggest efficient and statistically principled algorithms for this purpose that improve sample efficiency while remaining unbiased. These algorithms increase the effective human-labeled sample size by up to 50% on experiments with GPT-4.
A theory of generalised coordinates for stochastic differential equations
oai:arXiv.org:2409.15532v3
arXiv:2409.15532v3 Announce Type: replace-cross
Abstract: Stochastic differential equations are ubiquitous modelling tools in physics and the sciences. In most modelling scenarios, random fluctuations driving dynamics or motion have some non-trivial temporal correlation structure, which renders the SDE non-Markovian; a phenomenon commonly known as ``colored'' noise. Thus, an important objective is to develop effective tools for mathematically and numerically studying (possibly non-Markovian) SDEs. In this report, we formalise a mathematical theory for analysing and numerically studying SDEs based on so-called `generalised coordinates of motion'. Like the theory of rough paths, we analyse SDEs pathwise for any given realisation of the noise, not solely probabilistically. Like the established theory of Markovian realisation, we realise non-Markovian SDEs as a Markov process in an extended space. Unlike the established theory of Markovian realisation however, the Markovian realisations here are accurate on short timescales and may be exact globally in time, when flows and fluctuations are analytic. This theory is exact for SDEs with analytic flows and fluctuations, and is approximate when flows and fluctuations are differentiable. It provides useful analysis tools, which we employ to solve linear SDEs with analytic fluctuations. It may also be useful for studying rougher SDEs, as these may be identified as the limit of smoother ones. This theory supplies effective, computationally straightforward methods for simulation, filtering and control of SDEs; amongst others, we re-derive generalised Bayesian filtering, a state-of-the-art method for time-series analysis. Looking forward, this report suggests that generalised coordinates have far-reaching applications throughout stochastic differential equations.
General Seemingly Unrelated Local Projections
oai:arXiv.org:2410.17105v4
arXiv:2410.17105v4 Announce Type: replace-cross
Abstract: We develop a flexible framework for Bayesian estimation of impulse responses using Local Projections (LPs) with instrumental variables. It accommodates multiple shocks and instruments, accounts for autocorrelation in multi-step forecasts by jointly modeling all LPs as a seemingly unrelated system of equations, defines a flexible yet parsimonious joint prior for impulse responses based on a Gaussian Process, and allows for joint inference about the entire vector of impulse responses. We show via Monte Carlo simulations that our approach delivers more accurate point and uncertainty estimates than standard methods. To address misspecification, we propose an optional robustification step based on power posteriors.
A Likelihood Approach for Inference of Population Heterogeneity in Particle Ensembles with Second-Order Langevin Dynamics
oai:arXiv.org:2411.08692v2
arXiv:2411.08692v2 Announce Type: replace-cross
Abstract: The inherent complexity of biological agents often leads to motility behavior that appears to have random components. Robust stochastic inference methods are therefore required to understand and predict the motion patterns from time-discrete trajectory data provided by experiments. In many cases, second-order Langevin models are needed to adequately capture the motility. Additionally, population heterogeneity needs to be taken into account when analyzing data from several individual organisms. In this work, we describe a maximum likelihood approach to infer dynamical, stochastic models and, simultaneously, estimate the heterogeneity in a population of motile active particles from discretely sampled, stochastic trajectories. To this end, we propose a method to approximate the likelihood for non-linear second-order Langevin models. We show that this maximum likelihood ansatz outperforms alternative approaches, especially for short trajectories. Additionally, we demonstrate how a measure of uncertainty for the heterogeneity estimate can be derived. We thereby pave the way for the systematic, data-driven inference of dynamical models for actively driven entities based on trajectory data, deciphering temporal fluctuations and inter-particle variability.
Dimension Reduction via Sum-of-Squares and Improved Clustering Algorithms for Non-Spherical Mixtures
oai:arXiv.org:2411.12438v2
arXiv:2411.12438v2 Announce Type: replace-cross
Abstract: We develop a new approach for clustering non-spherical (i.e., arbitrary component covariances) Gaussian mixture models via a subroutine, based on the sum-of-squares method, that finds a low-dimensional separation-preserving projection of the input data. Our method gives a non-spherical analog of the classical dimension reduction, based on singular value decomposition, that, among several other applications, forms a key component of the celebrated spherical clustering algorithm of Vempala and Wang [VW04].
As applications, we obtain an algorithm to (1) cluster an arbitrary total-variation separated mixture of $k$ centered (i.e., zero-mean) Gaussians with $n\geq \operatorname{poly}(d) f(w_{\min}^{-1})$ samples and $\operatorname{poly}(n)$ time, and (2) cluster an arbitrary total-variation separated mixture of $k$ Gaussians with identical but arbitrary unknown covariance with $n \geq d^{O(\log w_{\min}^{-1})} f(w_{\min}^{-1})$ samples and $n^{O(\log w_{\min}^{-1})}$ time. Here, $w_{\min}$ is the minimum mixing weight of the input mixture, and $f$ does not depend on the dimension $d$. Our algorithms naturally extend to tolerating a dimension-independent fraction of arbitrary outliers. Before this work, the techniques in the state-of-the-art non-spherical clustering algorithms needed $d^{O(k)} f(w_{\min}^{-1})$ samples and time for clustering such mixtures.
Our results may come as a surprise in the context of the $d^{\Omega(k)}$ statistical query and sum-of-squares lower bounds [DKS17, DKPP24] for clustering non-spherical Gaussian mixtures. While these results are usually thought to rule out $d^{o(k)}$ cost algorithms for the problem, our results show that the lower bounds can in fact be circumvented for a remarkably general class of Gaussian mixtures.
Fixed-Mean Gaussian Processes for Post-hoc Bayesian Deep Learning
oai:arXiv.org:2412.04177v2
arXiv:2412.04177v2 Announce Type: replace-cross
Abstract: Recently, there has been an increasing interest in performing post-hoc uncertainty estimation about the predictions of pre-trained deep neural networks (DNNs). Given a pre-trained DNN via back-propagation, these methods enhance the original network by adding output confidence measures, such as error bars, without compromising its initial accuracy. In this context, we introduce a novel family of sparse variational Gaussian processes (GPs), where the posterior mean is fixed to any continuous function when using a universal kernel. Specifically, we fix the mean of this GP to the output of the pre-trained DNN, allowing our approach to effectively fit the GP's predictive variances to estimate the DNN prediction uncertainty. Our approach leverages variational inference (VI) for efficient stochastic optimization, with training costs that remain independent of the number of training points, scaling efficiently to large datasets such as ImageNet. The proposed method, called fixed-mean GP (FMGP), is architecture-agnostic, relying solely on the pre-trained model's outputs to adjust the predictive variances. Experimental results demonstrate that FMGP improves both uncertainty estimation and computational efficiency when compared to state-of-the-art methods for DNN post-hoc Bayesian inference.
Challenges in the calibration of tree-based models for imbalanced classification
oai:arXiv.org:2412.16209v5
arXiv:2412.16209v5 Announce Type: replace-cross
Abstract: When using machine learning for imbalanced binary classification problems, it is common to subsample the majority class to create a (more) balanced training dataset. This biases the model's predictions because the model learns from data that is not fully representative of the underlying population of interest. One way of accounting for this bias is analytically mapping the resulting predictions to new values based on the sampling rate for the majority class. We show that calibrating a random forest this way has negative consequences, including prevalence estimates that depend on both the number of predictors considered at each split in the random forest and the sampling rate used. We explain the former using known properties of random forests and analytical calibration and the latter by demonstrating a bias in decision trees. In contradiction with much of the existing literature, we show that decision trees can be biased towards the minority class. These issues indicate that tree-based models trained on undersampled data should not be calibrated analytically. Calibration approaches that can learn a miscalibration pattern in the original model (e.g., beta calibration) are more suitable.
Towards Simple and Provable Parameter-Free Adaptive Gradient Methods
oai:arXiv.org:2412.19444v2
arXiv:2412.19444v2 Announce Type: replace-cross
Abstract: Optimization algorithms such as AdaGrad and Adam have significantly advanced the training of deep models by dynamically adjusting the learning rate during the optimization process. However, ad-hoc tuning of learning rates poses a challenge and leads to inefficiencies in practice. To address this issue, recent research has focused on developing ``parameter-free'' algorithms that operate effectively without the need for learning rate tuning. Despite these efforts, existing parameter-free variants of AdaGrad and Adam tend to be overly complex and/or lack formal convergence guarantees. In this paper, we present AdaGrad++ and Adam++, novel and simple parameter-free variants of AdaGrad and Adam with convergence guarantees. We prove that AdaGrad++ achieves comparable convergence rates to AdaGrad in convex optimization without predefined learning rate assumptions. Similarly, Adam++ matches the convergence rate of Adam without relying on any conditions on the learning rates. Experimental results across various deep learning tasks validate the competitive performance of Adam++.
Quantum Reservoir Computing and Risk Bounds
oai:arXiv.org:2501.08640v2
arXiv:2501.08640v2 Announce Type: replace-cross
Abstract: We propose a way to bound the generalisation errors of several classes of quantum reservoirs using the Rademacher complexity. We give specific, parameter-dependent bounds for two particular quantum reservoir classes. We analyse how the generalisation bounds scale with growing numbers of qubits. Applying our results to classes with polynomial readout functions, we find that the risk bounds converge in the number of training samples. The explicit dependence on the quantum reservoir and readout parameters in our bounds can be used to control the generalisation error to a certain extent. It should be noted that the bounds scale exponentially with the number of qubits n. The upper bounds on the Rademacher complexity can be applied to other reservoir classes that fulfill a few hypotheses on the quantum dynamics and the readout function.
Non-vacuous Generalization Bounds for Deep Neural Networks without any modification to the trained models
oai:arXiv.org:2503.07325v2
arXiv:2503.07325v2 Announce Type: replace-cross
Abstract: Understanding and certifying the behavior of modern deep neural networks remains a fundamental challenge in reliable machine learning. We introduce a new class of data-dependent generalization bounds that apply directly to trained models, without any modification. In particular, we present an exactly computable bound that is non-vacuous across all evaluated networks, including ImageNet-scale models with 600M parameters. This this is the first work showing that meaningful generalization guarantees are achievable even for large, unaltered deep networks.
Our approach reveals that generalization is governed by the interaction between the trained model and the geometry of the data distribution. We decompose the generalization error into two interpretable components: a distributional complexity term, capturing how the data mass is distributed across the input space, and local model-behavior terms, capturing the network's behavior within individual regions. This joint dependence identifies where and why generalization gaps arise. Empirically, some components of our bound are highly predictive of the true test error, and the bound tightens when the partition aligns with the intrinsic data geometry, highlighting data-dependent local regularity as a key driver of generalization.
Advancing Local Clustering on Graphs via Compressive Sensing: Semi-supervised and Unsupervised Methods
oai:arXiv.org:2504.19419v3
arXiv:2504.19419v3 Announce Type: replace-cross
Abstract: Local clustering aims to identify specific substructures within a large graph without any additional structural information of the graph. These substructures are typically small compared to the overall graph, enabling the problem to be approached by finding a sparse solution to a linear system associated with the graph Laplacian. In this work, we first propose a method for identifying specific local clusters when very few labeled data are given, which we term semi-supervised local clustering. We then extend this approach to the unsupervised setting when no prior information on labels is available. The proposed methods involve randomly sampling the graph, applying diffusion through local cluster extraction, then examining the overlap among the results to find each cluster. We establish the co-membership conditions for any pair of nodes, and rigorously prove the correctness of our methods. Additionally, we conduct extensive experiments to demonstrate that the proposed methods achieve state of the art results in the low-label rates regime.
Global Convergence of Adaptive Sensing for Principal Eigenvector Estimation
oai:arXiv.org:2505.10882v2
arXiv:2505.10882v2 Announce Type: replace-cross
Abstract: Principal component analysis classically requires full $d$-dimensional samples, yet in various applications hardware limits acquisition to a few scalar measurements per sample. We analyze a compressed variant of Oja's algorithm for estimating the principal eigenvector of the data covariance matrix using only two adaptive measurements per sample. At each iteration, we observe one measurement along the current estimate and one in a random orthogonal direction. We prove that after $t$ iterations, the expected sine-squared error to the true eigenvector is $\mathcal{O}(\lambda_1\lambda_2 d^2 / (\Delta^2 t))$, where $d$ is the ambient dimension, $\lambda_1, \lambda_2$ are the leading eigenvalues, and $\Delta = \lambda_1 - \lambda_2$ is the eigengap. We complement this with a matching information-theoretic lower bound of $\Omega(\lambda_1\lambda_2 d^2 / (\Delta^2 t))$ -- the first for compressed eigenvector estimation -- proving that the $d^2$ factor, an additional factor of $d$ compared to the fully-observed minimax rate $\Theta(\lambda_1\lambda_2 d / (\Delta^2 t))$, is the fundamental cost of compression and cannot be improved. In contrast, any non-adaptive scheme with two measurements per iteration suffers $\Omega(\lambda_2^2 d^3 / (\Delta^2 t))$, an additional power of $d$. This separates fully-observed, adaptive-compressed, and non-adaptive-compressed PCA across three powers of $d$. Our analysis handles the noisy setting where the covariance has nonzero trailing eigenvalues, providing the first convergence guarantee for adaptive compressed subspace tracking beyond the noiseless case.
HR-VILAGE-3K3M: A Human Respiratory Viral Immunization Longitudinal Gene Expression Dataset for Systems Immunity
oai:arXiv.org:2505.14725v2
arXiv:2505.14725v2 Announce Type: replace-cross
Abstract: Respiratory viral infections pose a global health burden, yet the cellular immune mechanisms underlying protection and pathology remain unclear. Natural infection cohorts often lack pre-exposure baselines and time-controlled sampling, whereas inoculation and vaccination trials generate well-structured longitudinal transcriptomic data. However, these datasets are scattered across repositories and processed inconsistently, hindering integrative and AI-driven analyses. To address these challenges, we developed the Human Respiratory Viral Immunization LongitudinAl Gene Expression (HR-VILAGE-3K3M) repository: an AI-ready resource integrating bulk and single-cell transcriptomic profiles from 3,178 subjects across 66 studies. The dataset spans vaccination, inoculation, and mixed exposures, with samples from blood and nasal swabs collected from public repositories including GEO, ImmPort, and ArrayExpress. We curated and harmonized subject-level metadata, standardized outcome measures, and applied unified preprocessing with rigorous quality control. We further provide benchmark analyses illustrating its utility. This resource supports discovery of biomarkers, immune mechanisms, and methodological development. As one of the largest longitudinal transcriptomic resources for human respiratory viral immunization, HR-VILAGE-3K3M enables reproducible and scalable analyses to accelerate vaccine and antiviral research.
Human in the Loop Adaptive Optimization for Improved Time Series Forecasting
oai:arXiv.org:2505.15354v2
arXiv:2505.15354v2 Announce Type: replace-cross
Abstract: Time series forecasting models often produce systematic, predictable errors even in critical domains such as energy, finance, and healthcare. We introduce a novel post training adaptive optimization framework that improves forecast accuracy without retraining or architectural changes. Our method automatically applies expressive transformations optimized via reinforcement learning, contextual bandits, or genetic algorithms to correct model outputs in a lightweight and model agnostic way. Theoretically, we prove that affine corrections always reduce the mean squared error; practically, we extend this idea with dynamic action based optimization. The framework also supports an optional human in the loop component: domain experts can guide corrections using natural language, which is parsed into actions by a language model. Across multiple benchmarks (e.g., electricity, weather, traffic), we observe consistent accuracy gains with minimal computational overhead. Our interactive demo shows the framework's real time usability. By combining automated post hoc refinement with interpretable and extensible mechanisms, our approach offers a powerful new direction for practical forecasting systems.
How Many Domains Suffice for Domain Generalization? A Tight Characterization via the Domain Shattering Dimension
oai:arXiv.org:2506.16704v3
arXiv:2506.16704v3 Announce Type: replace-cross
Abstract: We study a fundamental question of domain generalization: given a family of domains (i.e., data distributions), how many randomly sampled domains do we need to collect data from in order to learn a model that performs reasonably well on every seen and unseen domain in the family? We model this problem in the PAC framework and introduce a new combinatorial measure, which we call the domain shattering dimension. We show that this dimension characterizes the domain sample complexity. Furthermore, we establish a tight quantitative relationship between the domain shattering dimension and the classic VC dimension, demonstrating that every hypothesis class that is learnable in the standard PAC setting is also learnable in our setting.
VERA: Variational Inference Framework for Jailbreaking Large Language Models
oai:arXiv.org:2506.22666v3
arXiv:2506.22666v3 Announce Type: replace-cross
Abstract: The rise of API-only access to state-of-the-art LLMs highlights the need for effective black-box jailbreak methods to identify model vulnerabilities in real-world settings. Without a principled objective for gradient-based optimization, most existing approaches rely on genetic algorithms, which are limited by their initialization and dependence on manually curated prompt pools. Furthermore, these methods require individual optimization for each prompt, failing to provide a comprehensive characterization of model vulnerabilities. To address this gap, we introduce VERA: Variational infErence fRamework for jAilbreaking. VERA casts black-box jailbreak prompting as a variational inference problem, training a small attacker LLM to approximate the target LLM's posterior over adversarial prompts. Once trained, the attacker can generate diverse, fluent jailbreak prompts for a target query without re-optimization. Experimental results show that VERA achieves strong performance across a range of target LLMs, highlighting the value of probabilistic inference for adversarial prompt generation.
Fundamental bounds on efficiency-confidence trade-off for transductive conformal prediction
oai:arXiv.org:2509.04631v2
arXiv:2509.04631v2 Announce Type: replace-cross
Abstract: Transductive conformal prediction addresses the simultaneous prediction for multiple data points. Given a desired confidence level, the objective is to construct a prediction set that includes the true outcomes with the prescribed confidence. We demonstrate a fundamental trade-off between confidence and efficiency in transductive methods, where efficiency is measured by the size of the prediction sets. Specifically, we derive a strict finite-sample bound showing that any non-trivial confidence level leads to exponential growth in prediction set size for data with inherent uncertainty. The exponent scales linearly with the number of samples and is proportional to the conditional entropy of the data. Additionally, the bound includes a second-order term, dispersion, defined as the variance of the log conditional probability distribution. We show that the transductive methods based on the approximate conditional distribution can approach this bound. Inspired by this setup, we introduce a practical transductive prediction algorithm that surpasses Bonferroni methods.
Towards a Physics Foundation Model
oai:arXiv.org:2509.13805v4
arXiv:2509.13805v4 Announce Type: replace-cross
Abstract: Foundation models have revolutionized natural language processing through a ``train once, deploy anywhere'' paradigm, where a single pre-trained model adapts to countless downstream tasks without retraining. Access to a Physics Foundation Model (PFM) would be transformative - democratizing access to high-fidelity simulations, accelerating scientific discovery, and eliminating the need for specialized solver development. Yet current physics-aware machine learning approaches remain fundamentally limited to single, narrow domains and require retraining for each new system. We present the General Physics Transformer (GPhyT), trained on 1.8 TB of diverse simulation data, that demonstrates foundation model capabilities are achievable for physics. Our key insight is that transformers can learn to infer governing dynamics from context, enabling a single model to simulate fluid-solid interactions, shock waves, thermal convection, and multi-phase dynamics without being told the underlying equations. GPhyT achieves three critical breakthroughs: (1) superior performance across multiple physics domains, outperforming specialized architectures by more than 7x, (2) plausible zero-shot generalization to entirely unseen physical systems through in-context learning, and (3) more stable long-term predictions through long-horizon rollouts. By establishing that a single model can learn generalizable physical principles from data alone, this work opens the path toward a universal PFM that could transform computational science and engineering.
Deep Learning as the Disciplined Construction of Tame Objects
oai:arXiv.org:2509.18025v2
arXiv:2509.18025v2 Announce Type: replace-cross
Abstract: One can see deep-learning models as compositions of functions within the so-called tame geometry. In this expository note, we give an overview of some topics at the interface of tame geometry (also known as o-minimality), optimization theory, and deep learning theory and practice. To do so, we gradually introduce the concepts and tools used to build convergence guarantees for stochastic gradient descent in a general nonsmooth nonconvex, but tame, setting. This illustrates some ways in which tame geometry is a natural mathematical framework for the study of AI systems, especially within Deep Learning.
Interpretable Self-Supervised Learning via Representer Landmarks and Nystr\"om Approximation
oai:arXiv.org:2509.24467v3
arXiv:2509.24467v3 Announce Type: replace-cross
Abstract: Self-supervised learning (SSL) learns representations from massive unlabeled data, yet the resulting models typically operate as black boxes, necessitating domain-specific explanations. We introduce KREPES, a unified framework to analytically interpret the learned representations of SSL objectives, including SimCLR, BYOL, and VICReg. By bridging empirical neural tangent kernel approximations of neural networks with the Representer Theorem for kernels, we express the learned latent space directly via "Representer Landmarks", which are the representations of influential unlabeled training examples. We introduce novel metrics, "Sample-Specific Influence Score", "Concept-Conditioned Influence Score" and "Feature Alignment Gap", to quantify the transparency of the learned representations. KREPES enables direct audit of the latent space without supervision, for example, revealing an algorithmic bias in the Adult-1M dataset where SSL uses demographic proxies for income. Finally, to ensure scalability to benchmarks with 1M+ samples (ImageNet-1K, Adult-1M), KREPES introduces a novel Nystr\"om approximation-based analytical inference framework for SSL objectives.
Trajectory Data Suffices for Statistically Efficient Policy Evaluation in Fixed-Horizon Offline RL with Linear $q^\pi$-Realizability and Concentrability
oai:arXiv.org:2510.03494v2
arXiv:2510.03494v2 Announce Type: replace-cross
Abstract: We study finite-horizon offline reinforcement learning (RL) with function approximation for both policy evaluation and policy optimization. Prior work established that statistically efficient learning is impossible for either of these problems when the only assumptions are that the data has good coverage (concentrability) and the state-action value function of every policy is linearly realizable ($q^\pi$-realizability) (Foster et al., 2021). Recently, Tkachuk et al. (2024) gave a statistically efficient learner for policy optimization, if in addition the data is assumed to be given as trajectories. In this work we present a statistically efficient learner for policy evaluation under the same assumptions. Further, we show that the sample complexity of the learner used by Tkachuk et al. (2024) for policy optimization can be improved by a tighter analysis.
From Moments to Models: Graphon-Mixture Learning for Mixup and Contrastive Learning
oai:arXiv.org:2510.03690v4
arXiv:2510.03690v4 Announce Type: replace-cross
Abstract: Real-world graph datasets often arise from mixtures of populations, where graphs are generated by multiple distinct underlying distributions. In this work, we propose a unified framework that explicitly models graph data as a mixture of probabilistic graph generative models represented by graphons. To characterize and estimate these graphons, we leverage graph moments (motif densities) to cluster graphs generated from the same underlying model. We establish a novel theoretical guarantee, deriving a tighter bound showing that graphs sampled from structurally similar graphons exhibit similar motif densities with high probability. This result enables principled estimation of graphon mixture components. We show how incorporating estimated graphon mixture components enhances two widely used downstream paradigms: graph data augmentation via mixup and graph contrastive learning. By conditioning these methods on the underlying generative models, we develop graphon-mixture-aware mixup (GMAM) and model-aware graph contrastive learning (MGCL). Extensive experiments on both simulated and real-world datasets demonstrate strong empirical performance. In supervised learning, GMAM outperforms existing augmentation strategies, achieving new state-of-the-art accuracy on 6 out of 7 datasets. In unsupervised learning, MGCL performs competitively across seven benchmark datasets and achieves the lowest average rank overall.
Generalization of Gibbs and Langevin Monte Carlo Algorithms in the Interpolation Regime
oai:arXiv.org:2510.06028v3
arXiv:2510.06028v3 Announce Type: replace-cross
Abstract: This paper provides data-dependent bounds on the expected error of the Gibbs algorithm in the overparameterized interpolation regime, where low training errors are also obtained for impossible data, such as random labels in classification. The results show that generalization in the low-temperature regime is already signaled by small training errors in the noisier high-temperature regime. The bounds are stable under approximation with Langevin Monte Carlo algorithms. The analysis motivates the design of an algorithm to compute bounds, which on the MNIST, CIFAR-10, and SVHN datasets yield nontrivial, close predictions on the test error for true labeled data, while maintaining a correct upper bound on the test error for random labels.
Symmetries in PAC-Bayesian Learning
oai:arXiv.org:2510.17303v2
arXiv:2510.17303v2 Announce Type: replace-cross
Abstract: Symmetries are known to improve the empirical performance of machine learning models, yet theoretical guarantees explaining these gains remain limited. Prior work has focused mainly on compact group symmetries and often assumes that the data distribution itself is invariant, an assumption rarely satisfied in real-world applications. In this work, we extend generalization guarantees to the broader setting of non-compact symmetries, such as translations and to non-invariant data distributions. Building on the PAC-Bayes framework, we adapt and tighten existing bounds, demonstrating the approach on McAllester's PAC-Bayes bound while showing that it applies to a wide range of PAC-Bayes bounds. We validate our theory with experiments on several datasets with non-uniform and non-compact transformations, where the derived guarantees not only hold but also improve upon prior results. These findings provide theoretical evidence that, for symmetric data, symmetric models are preferable beyond the narrow setting of compact groups and invariant distributions, opening the way to a more general understanding of symmetries in machine learning.
Two Datasets Are Better Than One: Method of Double Moments for 3-D Reconstruction in Cryo-EM
oai:arXiv.org:2511.07438v3
arXiv:2511.07438v3 Announce Type: replace-cross
Abstract: Cryo-electron microscopy (cryo-EM) is a powerful imaging technique for reconstructing three-dimensional molecular structures from noisy tomographic projection images of randomly oriented particles. We introduce a new data fusion framework, termed the method of double moments (MoDM), which reconstructs molecular structures from two instances of the second-order moment of projection images obtained under distinct orientation distributions: one uniform, the other non-uniform and unknown. We prove that these moments generically uniquely determine the underlying structure, up to a global rotation and reflection, and we develop a convex-relaxation-based algorithm that achieves accurate recovery using only second-order statistics. Our results demonstrate the advantage of collecting and modeling multiple datasets under different experimental conditions, illustrating that leveraging dataset diversity can substantially enhance reconstruction quality in computational imaging tasks.
How to Correctly Report LLM-as-a-Judge Evaluations
oai:arXiv.org:2511.21140v4
arXiv:2511.21140v4 Announce Type: replace-cross
Abstract: Large language models (LLMs) are widely used as scalable evaluators of model responses in lieu of human annotators. However, imperfect sensitivity and specificity of the LLM judges induce bias in naive evaluation scores. We propose a simple plug-in framework that corrects this bias and enables statistically principled uncertainty quantification. Our framework constructs confidence intervals that account for uncertainty from both the test dataset and a human-labeled calibration dataset. Additionally, it uses an adaptive strategy to allocate calibration samples for tighter intervals. Importantly, we characterize parameter regimes defined by the true evaluation score and the LLM judge's sensitivity and specificity in which our LLM-based evaluation yields more reliable estimates than human-only evaluation. Moreover, we show that our framework remains unbiased under distribution shift between the test and calibration datasets, in contrast to existing approaches.
Safeguarded Stochastic Polyak Step Sizes for Non-smooth Optimization: Robust Performance Without Small (Sub)Gradients
oai:arXiv.org:2512.02342v3
arXiv:2512.02342v3 Announce Type: replace-cross
Abstract: The stochastic Polyak step size (SPS) has proven to be a promising choice for stochastic gradient descent (SGD), delivering competitive performance relative to state-of-the-art methods on smooth convex and non-convex optimization problems, including deep neural network training. However, extensions of this approach to non-smooth settings remain in their early stages, often relying on interpolation assumptions or requiring knowledge of the optimal solution. In this work, we propose a novel SPS variant, Safeguarded SPS (SPS$_{safe}$), for the stochastic subgradient method, and provide rigorous convergence guarantees for non-smooth convex optimization with no need for strong assumptions. We further incorporate momentum into the update rule, yielding equally tight theoretical results. Comprehensive experiments on convex benchmarks and deep neural networks corroborate our theory: the proposed step size achieves competitive performance to existing adaptive baselines and exhibits stable behavior across a wide range of problem settings. Finally, in the context of deep neural network training, the gradient norms under our step size do not collapse to (near) zero, indicating robustness to vanishing gradients.
State and Parameter Estimation for a Neural Model of Local Field Potentials
oai:arXiv.org:2512.07842v2
arXiv:2512.07842v2 Announce Type: replace-cross
Abstract: The study of cortical dynamics during different states such as decision making, sleep and movement, is an important topic in Neuroscience. Modelling efforts aim to relate the neural rhythms present in cortical recordings to the underlying dynamics responsible for their emergence. We present an effort to characterize the neural activity from the cortex of a mouse during natural sleep, captured through local field potential measurements. Our approach relies on using a discretized Wilson--Cowan Amari neural field model for neural activity, along with a data assimilation method that allows the Bayesian joint estimation of the state and parameters. We demonstrate the feasibility of our approach on synthetic measurements before applying it to a dataset available in literature. Our findings suggest the potential of our approach to characterize the stimulus received by the cortex from other brain regions, while simultaneously inferring a state that aligns with the observed signal.
Multigrade Neural Network Approximation
oai:arXiv.org:2601.16884v3
arXiv:2601.16884v3 Announce Type: replace-cross
Abstract: We study multigrade deep learning (MGDL) as a principled framework for structured error refinement in deep neural networks. While the approximation power of neural networks is now relatively well understood, training very deep architectures remains challenging due to highly nonconvex and often ill-conditioned optimization landscapes. In contrast, for relatively shallow networks, most notably certain one-hidden-layer ReLU models, training admits convex reformulations with global guarantees under appropriate settings, motivating learning paradigms that improve stability while scaling to depth. MGDL builds on this insight by training deep networks grade by grade: previously learned grades are frozen, and each newly added grade-wise subnetwork is composed on top of the previously learned grades and trained to fit the residual left by the current approximation, yielding a structured and interpretable hierarchical refinement process. We develop an operator-theoretic foundation for MGDL and prove that, for any continuous target function defined on a hypercube, there exists a fixed-width multigrade ReLU scheme whose residuals are pointwise nonincreasing in magnitude and converge uniformly to zero, with strict $L^p$-norm decay at every nontrivial grade for $p\in [1,\infty)$. To the best of our knowledge, this work provides the first rigorous constructive approximation guarantee showing that a grade-wise residual refinement scheme can achieve vanishing error in a fixed-width multigrade ReLU architecture.
Causal Evaluation of Membership Inference Attacks
oai:arXiv.org:2602.02819v4
arXiv:2602.02819v4 Announce Type: replace-cross
Abstract: Membership Inference Attacks (MIAs) aim to distinguish training points (members) from unseen data (non-members), and are widely used to quantify memorization and assess privacy risks. Standard MIA evaluation requires repeated retraining, which is computationally costly for large models. One-run (single training with randomized data inclusion) and zero-run (post hoc evaluation) methods are often used instead, but their statistical validity remains unclear. We address this gap by framing MIA evaluation as a causal inference problem, defining \emph{memorization as the causal effect of including a data point in the training set}. This novel formulation reveals and formalizes key sources of bias in existing protocols: one-run methods suffer from interference between jointly included points, while zero-run evaluations are additionally confounded by distribution shift between member and non-member evaluation data. We derive causal analogues of standard MIA metrics and propose practical estimators for multi-run, one-run, and zero-run regimes with non-asymptotic consistency guarantees. We validate our approach in several settings, including pretrained and fine-tuned LLMs, showing that it enables reliable measurement of MIA performance without retraining and under distribution shift. Overall, our framework provides a principled foundation for privacy evaluation in modern AI systems.
Universal One-third Time Scaling in Learning Peaked Distributions
oai:arXiv.org:2602.03685v2
arXiv:2602.03685v2 Announce Type: replace-cross
Abstract: Training large language models (LLMs) is computationally expensive, partly because the loss exhibits slow power-law convergence whose origin remains debatable. Through systematic analysis of toy models and empirical evaluation of LLMs, we show that this behavior can arise intrinsically from the use of softmax and cross-entropy. When learning peaked probability distributions, e.g., next-token distributions, these components generically yield power-law vanishing losses and gradients, regardless of many microscopic details, creating a fundamental optimization bottleneck. This ultimately leads to power-law time scaling of the loss with a universal exponent of $1/3$. Our results provide a mechanistic explanation for observed neural scaling and suggest new directions for improving LLM training efficiency.
Inverse Depth Scaling From Most Layers Being Similar
oai:arXiv.org:2602.05970v2
arXiv:2602.05970v2 Announce Type: replace-cross
Abstract: Neural scaling laws relate loss to model size in large language models (LLMs), yet depth and width may contribute to performance differently, requiring more detailed studies. Here, we quantify how depth affects loss via analysis of LLMs and toy residual networks. We find loss scales inversely proportional to depth in LLMs, probably due to functionally similar layers reducing error through ensemble averaging rather than compositional learning or discretizing smooth dynamics. This regime is inefficient yet robust and may arise from the architectural bias of residual networks and target functions incompatible with smooth dynamics. The findings suggest that improving LLM efficiency may require architectural innovations to encourage compositional use of depth.
Sharpness-Aware Hybrid Model Learning for Architecture-Agnostic Parameter Estimation
oai:arXiv.org:2602.06837v2
arXiv:2602.06837v2 Announce Type: replace-cross
Abstract: Hybrid modeling, the combination of machine learning models and scientific mathematical models, enables flexible and robust data-driven prediction with partial interpretability. However, the unknown parameters of the scientific model cannot necessarily be estimated properly, since the flexibility of the machine learning model might make the scientific model part effectively ignored in prediction. We may avoid it by applying some regularization, but the formulation of such regularizers typically depends on model architectures and domain knowledge. In this paper, we propose an architecture-agnostic method to learn hybrid models while properly estimating the scientific parameters. The idea is to use the flatness of loss minima to achieve model simplicity, based upon the Occam's razor principle. We employ the idea of sharpness-aware minimization and adapt it to the hybrid modeling setting. Numerical experiments demonstrate the effectiveness of the SAM-based hybrid model learning for scientific parameter estimation.
Collaborative and Efficient Fine-tuning: Leveraging Task Similarity
oai:arXiv.org:2602.07218v2
arXiv:2602.07218v2 Announce Type: replace-cross
Abstract: Adaptability has been regarded as a central feature in the foundation models, enabling them to effectively acclimate to unseen downstream tasks. Parameter-efficient fine-tuning methods such as celebrated LoRA facilitate efficient adaptation of large foundation models using labeled, high-quality and generally scarce task data. To mitigate data scarcity in fine-tuning of foundation models, we propose to leverage task similarity across multiple downstream users. Intuitively, users with similar tasks must be able to assist each other in boosting the effective fine-tuning data size. We propose Collaborative Low-Rank Adaptation, or CoLoRA, which exploits task similarity to collaboratively and efficiently fine-tune personalized foundation models. The main idea in CoLoRA is to train one shared adapter capturing underlying task similarities across all tasks, and personalized adapters tailored to user-specific tasks. We theoretically study CoLoRA on heterogeneous linear regression and provide provable guarantees for ground truth recovery. We also conduct several natural language experiments with varying task similarity, which further demonstrate that when trained together with similar tasks, individual performances are significantly boosted.
A Task-Centric Theory for Iterative Self-Improvement with Easy-to-Hard Curricula
oai:arXiv.org:2602.10014v3
arXiv:2602.10014v3 Announce Type: replace-cross
Abstract: Iterative self-improvement fine-tunes an autoregressive large language model (LLM) on reward-verified outputs generated by the LLM itself. In contrast to the empirical success of self-improvement, the theoretical foundation of this generative, iterative procedure in a practical, finite-sample setting remains limited. We make progress toward this goal by modeling each round of self-improvement as maximum-likelihood fine-tuning on a reward-filtered distribution and deriving finite-sample guarantees for the expected reward. Our analysis reveals an explicit feedback loop where better models accept more data per iteration, supporting sustained self-improvement while explaining eventual saturation of such improvement. Adopting a task-centric view by considering reasoning tasks with multiple difficulty levels, we further prove quantifiable conditions on model initialization, task difficulty, and sample budget where easy-to-hard curricula provably achieve better guarantees than training on fixed mixtures of tasks. Our analyses are validated through Monte-Carlo simulations and experiments spanning a synthetic graph-based reasoning task and multiple standard mathematical reasoning benchmarks.
WildCat: Near-Linear Attention in Theory and Practice
oai:arXiv.org:2602.10056v2
arXiv:2602.10056v2 Announce Type: replace-cross
Abstract: We introduce WildCat, a high-accuracy, low-cost approach to compressing the attention mechanism in neural networks. While attention is a staple of modern network architectures, it is also notoriously expensive to deploy due to resource requirements that scale quadratically with the input sequence length $n$. WildCat avoids these quadratic costs by only attending over a small weighted coreset. Crucially, we select the coreset using a fast but spectrally-accurate subsampling algorithm -- randomly pivoted Cholesky -- and weight the elements optimally to minimise reconstruction error. Remarkably, given bounded inputs, WildCat approximates exact attention with super-polynomial $O(n^{-\sqrt{\log(\log(n))}})$ error decay while running in near-linear $O(n^{1+o(1)})$ time. In contrast, prior practical approximations either lack error guarantees or require quadratic runtime to guarantee such high fidelity. We couple this advance with a GPU-optimized PyTorch implementation and a suite of benchmark experiments demonstrating the benefits of WildCat for image generation, image classification, and language model KV cache compression.
Scaling Reproducibility: An AI-Assisted Workflow for Large-Scale Replication and Reanalysis
oai:arXiv.org:2602.16733v3
arXiv:2602.16733v3 Announce Type: replace-cross
Abstract: Computational reproducibility is central to scientific credibility, yet verifying published results at scale remains costly. We develop an AI-assisted workflow for automated full-paper replication -- retrieving materials, reconstructing environments, executing code, and matching outputs to point estimates reported in regression tables. We define a universe of all empirical and quantitative papers from the three top political science journals (2010--2025) and measure stated data availability using automated extraction. For a stratified sample of 384 studies, we apply the workflow to conduct full-paper replication, totaling 3,523 empirical models. We find that journal verification requirements, combined with data archiving mandates, drive reproducibility: the share of fully or largely reproducible papers rises from 20.8% before DA-RT adoption to 82.5% after, and conditional on accessible replication packages, 92.1% of papers are fully or largely reproducible (234/254). As a secondary application, we apply standardized IV diagnostics to 84 studies (597 IV specifications among 1,910 replicated models), illustrating how automated execution enables systematic reanalysis across heterogeneous empirical settings.
Robust Predictive Uncertainty and Double Descent in Contaminated Bayesian Random Features
oai:arXiv.org:2602.19126v2
arXiv:2602.19126v2 Announce Type: replace-cross
Abstract: We propose a robust Bayesian formulation of random feature (RF) regression that accounts explicitly for prior and likelihood misspecification via Huber-style contamination sets. Starting from the classical equivalence between ridge-regularized RF training and Bayesian inference with Gaussian priors and likelihoods, we replace the single prior and likelihood with $\epsilon$- and $\eta$-contaminated credal sets, respectively, and perform inference using pessimistic generalized Bayesian updating. We derive explicit and tractable bounds for the resulting lower and upper posterior predictive densities. These bounds show that, when contamination is moderate, prior and likelihood ambiguity effectively acts as a direct contamination of the posterior predictive distribution, yielding uncertainty envelopes around the classical Gaussian predictive. We introduce an Imprecise Highest Density Region (IHDR) for robust predictive uncertainty quantification and show that it admits an efficient approximation via an adjusted Gaussian credible interval. We further obtain predictive variance bounds (under a mild truncation approximation for the upper bound) and prove that they preserve the leading-order proportional-growth asymptotics known for RF models. Together, these results establish a robustness theory for Bayesian random features: predictive uncertainty remains computationally tractable, inherits the classical double-descent phase structure, and is improved by explicit worst-case guarantees under bounded prior and likelihood misspecification.
Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models
oai:arXiv.org:2602.23197v2
arXiv:2602.23197v2 Announce Type: replace-cross
Abstract: Transformer-based large language models exhibit in-context learning, enabling adaptation to downstream tasks via few-shot prompting with demonstrations. In practice, such models are often fine-tuned to improve zero-shot performance on downstream tasks, allowing them to solve tasks without examples and thereby reducing inference costs. However, fine-tuning can degrade in-context learning, limiting the performance of fine-tuned models on tasks not seen during fine-tuning. Using linear attention models, we provide a theoretical analysis that characterizes how fine-tuning objectives modify attention parameters and identifies conditions under which this leads to degraded few-shot performance. We show that fine-tuning all attention parameters can harm in-context learning, whereas restricting updates to the value matrix improves zero-shot performance while preserving in-context learning. We further show that incorporating an auxiliary few-shot loss enhances in-context learning primarily on the target task, at the expense of degraded in-context learning ability on tasks not seen during fine-tuning. We provide empirical evidence from synthetic and real-world datasets consistent with the qualitative predictions of our theory.
Adaptive Window Selection for Financial Risk Forecasting
oai:arXiv.org:2603.01157v2
arXiv:2603.01157v2 Announce Type: replace-cross
Abstract: Risk forecasts in financial regulation and internal management are calculated through historical data. The unknown structural changes of financial data pose a substantial challenge in selecting an appropriate look-back window for risk modeling and forecasting. We develop a data-driven online learning method, called the bootstrap-based adaptive window selection (BAWS), that adaptively determines the window size in a sequential manner. A central component of BAWS is to compare the realized scores against a data-dependent threshold based on the bootstrap method. We provide an asymptotic justification for the bootstrap threshold, covering non-smooth scores such as the VaR check loss and the joint VaR--ES score, with an extension to stationary weakly dependent data via the moving block bootstrap. A single-break analysis further shows that BAWS rejects overlong windows crossing sufficiently large breaks. The proposed method is applicable to the forecasting of risk measures that are elicitable individually or jointly, such as the Value-at-Risk (VaR) and the pair of VaR and the corresponding Expected Shortfall. Through simulation studies and an empirical analysis, we demonstrate that BAWS often improves upon the standard rolling window approach and the recently developed method of stability-based adaptive window selection, especially when there are structural changes in the data-generating process.
AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science
oai:arXiv.org:2603.19005v2
arXiv:2603.19005v2 Announce Type: replace-cross
Abstract: Data science plays a critical role in transforming complex data into actionable insights across numerous domains. Recent developments in large language models (LLMs) and artificial intelligence (AI) agents have significantly automated data science workflow. However, it remains unclear to what extent AI agents can match the performance of human experts on domain-specific data science tasks, and in which aspects human expertise continues to provide advantages. We introduce AgentDS, a benchmark and competition designed to evaluate both AI agents and human-AI collaboration performance in domain-specific data science. AgentDS consists of 17 challenges across six industries: commerce, food production, healthcare, insurance, manufacturing, and retail banking. We conducted an open competition involving 29 teams and 80 participants, enabling systematic comparison between human-AI collaborative approaches and AI-only baselines. Our results show that current AI agents struggle with domain-specific reasoning.AI-only baselines perform below the top quartile of competition participants, while the strongest solutions arise from human-AI collaboration. These findings challenge the narrative of complete automation by AI and underscore the enduring importance of human expertise in data science, while illuminating directions for the next generation of AI. Visit the AgentDS website here: https://agentds.org/ and open source datasets here: https://huggingface.co/datasets/lainmn/AgentDS .
Graph Energy Matching: Transport-Aligned Energy-Based Modeling for Graph Generation
oai:arXiv.org:2603.23398v3
arXiv:2603.23398v3 Announce Type: replace-cross
Abstract: Generative modeling of discrete data, such as graphs, underpins many scientific and industrial applications, including molecular discovery and materials design. In these domains, probabilistic inference is particularly valuable, as it enables composable generation and principled incorporation of desired constraints, such as structural or functional properties. Energy-based models naturally support this goal by capturing relative likelihoods and enabling composable inference by directly enforcing constraints during inference. However, discrete energy-based models typically struggle with efficient and high-quality sampling, as off-support regions often contain spurious local minima, trapping samplers and causing training instabilities, resulting in a fidelity gap compared to discrete diffusion models. To address this gap, we introduce Graph Energy Matching (GEM), a discrete generative framework inspired by the Jordan-Kinderlehrer-Otto (JKO) transport-map optimization perspective. GEM learns a permutation-invariant potential energy that simultaneously guides discrete transport from noise toward high-likelihood graph regions and refines samples within these regions. We further introduce a sampling protocol leveraging an energy-based switching strategy, seamlessly bridging rapid, gradient-guided transport and a local mixing regime for effective exploration. On molecular graph benchmarks, GEM matches or surpasses strong discrete diffusion baselines on most reported metrics. Beyond improving generation quality, GEM's relative likelihood modeling enables targeted exploration, facilitating compositional generation, property-constrained sampling, and interpolation between graphs. Project page: https://michalbalcerak.ai/graph-energy-matching/.
Revisiting Marked Galaxy Clustering from a Joint Point Process Perspective
oai:arXiv.org:2604.00578v2
arXiv:2604.00578v2 Announce Type: replace-cross
Abstract: Marked correlation functions, in which galaxy properties such as luminosity or stellar mass are treated as marks, are widely used to test models of galaxy formation. In astronomy, however, these statistics are typically implemented as summary measures that do not preserve the joint structure of mark pairs conditioned on separation. In this work, we formulate galaxies as points $(x,m)$ on the product space $\mathbb{R}^3\times\mathcal{M}$, where $x$ denotes position and $m$ a mark, and introduce the joint pair correlation function $g(r;m_1,m_2)$ as the fundamental quantity describing mark-dependent clustering. We further define a diagnostic quantity $\Delta_{\mathrm{ind}}(r;m_1,m_2)$ that locally quantifies deviations from the independence hypothesis relative to spatial clustering alone, thereby providing a projection-free description of which mark pairs are over- or underrepresented at a given separation scale. Within this framework, commonly used diagnostics such as the inhomogeneous cross-$J$ function are naturally interpreted as summary statistics obtained through averaging over mark sets and geometric-event-based reductions of the joint structure. This perspective clarifies that previously discussed marked effects, including assembly bias, correspond to projections of an underlying joint dependence, and that observationally accessible information is the existence of non-factorizable joint structure itself. The present formulation provides both a fundamental quantity and practical diagnostics for its characterization.
A Direct Approach for Handling Contextual Bandits with Latent State Dynamics
oai:arXiv.org:2604.08149v2
arXiv:2604.08149v2 Announce Type: replace-cross
Abstract: We consider a linear contextual bandit model where contexts and rewards are governed by a finite hidden Markov chain. We first revisit the simplified model by Nelson et al. (2022), in which rewards are linear functions of the posterior probabilities over the hidden states given the observed contexts (called beliefs), rather than functions of the hidden states themselves. This simplified model may be handled through a direct reduction to standard linear contextual bandits. We extend the theoretical analysis of this reduction to take into account the estimation of the parameters of the hidden Markov model [HMM] in the regret bound and to provide high-probability bounds not depending anymore on the reward functions and only depending on the model through the estimation of the HMM parameters. Second, and most importantly, we instead study the more natural and more complex model incorporating direct dependencies in the hidden states (on top of dependencies on the observed contexts, as is natural for contextual bandits). Under a classic HMM forgetting condition, the main algorithmic tool introduced to cope with the various statistical dependencies that the reward structure introduces is to only periodically update reward-model parameters.
U-Cast: A Surprisingly Simple and Efficient Frontier Probabilistic AI Weather Forecaster
oai:arXiv.org:2604.09041v2
arXiv:2604.09041v2 Announce Type: replace-cross
Abstract: AI-based weather forecasting now rivals traditional physics-based ensembles, but state-of-the-art (SOTA) models rely on specialized architectures and massive computational budgets, creating a high barrier to entry. We demonstrate that such complexity is unnecessary for frontier performance. We introduce \ours, a probabilistic forecaster built on a standard U-Net backbone trained with a simple recipe: deterministic pre-training on Mean Absolute Error followed by short probabilistic fine-tuning on the Continuous Ranked Probability Score (CRPS) using Monte Carlo Dropout for stochasticity. As a result, our model matches or exceeds the probabilistic skill of GenCast and IFS ENS at $1.5^\circ$ resolution while reducing training compute by over $10\times$ compared to leading CRPS-based models and inference latency by over $10\times$ compared to diffusion-based models. U-Cast trains in under 12 H200 GPU-days and generates a 15-day ensemble forecast in 3 seconds. These results suggest that scalable, general-purpose architectures paired with efficient training curricula can match complex domain-specific designs at a fraction of the cost, opening the training of frontier probabilistic weather models to the broader community.
Efficient Diffusion Models under Nonconvex Equality and Inequality constraints via Landing
oai:arXiv.org:2604.17838v2
arXiv:2604.17838v2 Announce Type: replace-cross
Abstract: Generative modeling within constrained sets is essential for scientific and engineering applications involving physical, geometric, or safety requirements (e.g., molecular generation, robotics). We present a unified framework for constrained diffusion models on generic nonconvex feasible sets $\Sigma$ that simultaneously enforces equality and inequality constraints throughout the diffusion process. Our framework incorporates both overdamped and underdamped dynamics for forward and backward sampling. A key algorithmic innovation is a computationally efficient landing mechanism that replaces costly and often ill-defined projections onto $\Sigma$, ensuring feasibility without iterative Newton solves or projection failures. By leveraging underdamped dynamics, we accelerate mixing toward the prior distribution, effectively alleviating the high simulation costs typically associated with constrained diffusion. Empirically, this approach reduces function evaluations and memory usage during both training and inference while preserving sample quality. On benchmarks featuring equality and mixed constraints, our method achieves comparable sample quality to state-of-the-art baselines while significantly reducing computational cost, providing a practical and scalable solution for diffusion on nonconvex feasible sets.
RISED: A Pre-Deployment Evaluation Framework for High-Stakes AI Decision-Support Systems, with Application to Healthcare
oai:arXiv.org:2605.12895v2
arXiv:2605.12895v2 Announce Type: replace-cross
Abstract: Clinical decision-support systems are expert systems whose recommendations clinicians act on directly, yet they are usually cleared on one aggregate accuracy number from a held-out test set. That number says nothing about input reliability under encoding shifts, subgroup gaps, threshold sensitivity, or operational feasibility. We present RISED, a pre-deployment evaluation framework operationalising five dimensions (Reliability, Inclusivity, Sensitivity, Equity, Deployability) through BCa bootstrap 95% confidence intervals, literature-grounded thresholds, and Holm-Bonferroni-corrected PASS / FAIL / INCONCLUSIVE verdicts; Equity is a proxy-dependence diagnostic rather than a gating test. Applied to seven cohorts spanning 35 years (n from 303 to 99,492), RISED surfaces failures invisible to AUROC: on Diabetes 130, Reliability passes by three orders of magnitude (PSS = 0.0004) while Inclusivity (AUC parity gap = 0.262) and Sensitivity (max threshold-flip rate 49.1%) fail decisively; both NHIS cohorts reproduce this. NHANES 2021-2023, with a complete feature profile, achieves INCONCLUSIVE verdicts; BRFSS 2024 produces the suite's most severe Sensitivity failure (max threshold-flip rate 64.2%) after instrument rotation removed hypertension and cholesterol. The pattern recurs on credit- and income-prediction cohorts, confirming domain-agnosticity; a multi-model check shows the failures are data-driven, not model-specific. RISED ships as an open-source Python package complementing TRIPOD+AI, FUTURE-AI, and Fairlearn with the structured numerical evidence those standards require but do not prescribe.
Singular Asymptotics of SPADE in Quantum Source Discrimination
oai:arXiv.org:2605.14432v2
arXiv:2605.14432v2 Announce Type: replace-cross
Abstract: We study far-field discrimination between one and two incoherent point sources in the singular regime of weak and closely spaced emitters. Under ideal alignment, spatial-mode demultiplexing (SPADE) attains the quantum-optimal large-sample Stein exponent, but the finite-photon behavior near the one-source boundary and the effect of realistic imperfections remain less understood. Using singular learning theory, we analyze both the aligned and misaligned problems. In the aligned Gaussian case, we derive the zeta-function poles for direct imaging and SPADE, show that both share the same real log canonical threshold $\lambda=1/2$ but differ in multiplicity, and obtain the corresponding Bayes free-energy asymptotics. This yields a universal subleading advantage of aligned SPADE in the local prior-weighted regime. In the misaligned setting, we study a physically motivated binary-SPADE reduction that retains the full leading $O(s^2)$ leakage contrast near alignment, with corrections from the detailed higher-mode redistribution entering only at $O(s^4)$. We show that misaligned binary-SPADE and direct imaging acquire nontrivial local power on different intrinsic scales, $s=O(n^{-1/4})$ and $s=O(n^{-1/2})$, respectively. However, finite-$n$ Neyman--Pearson comparisons under common physical conditions reveal that direct imaging is stronger on the plotted grids and that misaligned binary-SPADE exhibits an exact blind separation $s^\ast=2\theta$, where its power collapses to $\alpha$. These results identify model singularity as a structural organizing principle for finite-photon quantum discrimination and clarify how ideal aligned SPADE benchmarks can fail to translate into finite-$n$ advantages under misalignment.
Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad
oai:arXiv.org:2605.18694v2
arXiv:2605.18694v2 Announce Type: replace-cross
Abstract: Many tasks in modern machine learning are observed to involve heavy-tailed gradient noise during the optimization process. To manage this realistic and challenging setting, new mechanisms, such as gradient clipping and gradient normalization, have been introduced to ensure the convergence of first-order algorithms. However, adaptive gradient methods, a famous class of modern optimizers that includes popular $\mathtt{Adam}$ and $\mathtt{AdamW}$, often perform well even without any extra operations mentioned above. It is therefore natural to ask whether adaptive gradient methods can converge under heavy-tailed noise without any algorithmic changes. In this work, we take the first step toward answering this question by investigating a special case, $\mathtt{AdaGrad}$, the origin of adaptive gradient methods. We provide the first provable convergence rate for $\mathtt{AdaGrad}$ in non-convex optimization when the tail index $p$ satisfies $4/3
Decision-Path Patterns as Tree Reliability Signals: Path-based Adaptive Weighting for Random Forest Classification
oai:arXiv.org:2605.20716v5
arXiv:2605.20716v5 Announce Type: replace-cross
Abstract: Random forests construct each tree with a different, randomised representation of the feature space. Their uniform voting cannot correct errors in regions where trees with incorrect representations probabilistically outnumber correct ones, even when the ensemble collectively holds enough correct information - a reducible error that this paper addresses. We propose using the structural pattern of each tree's decision path as an instance-adaptive reliability signal to identify and differentially weight the more reliable trees. At inference, a random forest reaches its prediction through the root-to-leaf path the sample traverses in each tree, so path-level reliability offers a finer granularity than tree-level weighting can access. We show that this signal reflects the actual reliability of each tree's decision, and that using it yields a statistically significant accuracy improvement over RF on 36 binary classification benchmarks (Wilcoxon p < 0.0001). Class-recall regression - the typical failure mode of RF correction methods - is measured: zero minority-recall regressions and a single majority-recall regression at the 0.2 pp threshold, indicating bias reduction rather than a class trade-off. We further quantify the reducible error accessible to the method from the fitted RF alone; this estimate correlates strongly with per-dataset gain (Pearson r = +0.840, p < 0.0001). On the qualifying group it identifies, the method delivers a mean +0.99 pp accuracy improvement with strict wins on every dataset (7/0/0); an optional amplification mechanism further raises this to +1.48 pp.
Dropout Universality: Scaling Laws and Optimal Scheduling at the Edge-of-Chaos
oai:arXiv.org:2605.21648v2
arXiv:2605.21648v2 Announce Type: replace-cross
Abstract: We develop a mean-field theory of dropout as a perturbation of critical signal propagation at the edge of chaos, and show that it predicts a simple, no-cost change to standard practice: \emph{front-loaded} dropout schedules cut test loss by \(18\)--\(35\%\) over constant dropout in MLPs and Vision Transformers at fixed budget. The theoretical mechanism is that dropout shifts the perfect-alignment fixed point, making the depth scale for information propagation finite even at critical initialization. We derive critical and crossover scaling laws for correlation decay and establish that smooth activations and kinked, \relu{}-like activations constitute distinct universality classes, with different critical exponents and a universal two-parameter scaling collapse in detuning and dropout strength. The distinction traces to the analytic structure of the correlation map: smooth activations admit a Taylor expansion near perfect alignment, while kinked activations develop a branch point with universal non-analyticity. As a corollary, the framework yields saturated dropout profiles under fixed budget; a regularization-reach argument then selects front-loaded schedules, with accuracy gains as a consistent secondary effect. We also discuss how the same Gaussian-kernel structure extends the theory beyond MLPs toward CNNs and residual architectures.
Measuring Alignment-Induced Activation Shifts Correctly: A Template-Controlled Difference-in-Differences Protocol
oai:arXiv.org:2605.24583v3
arXiv:2605.24583v3 Announce Type: replace-cross
Abstract: Comparing a model's internal activations before and after alignment is a natural way to ask what safety training changes: one forms the matrix of paired aligned-minus-base activations on safety-relevant inputs and reads off its effective rank or top direction. We show the obvious way to form this matrix is confounded. The aligned model is evaluated under a chat template the base model never saw, so the naive difference conflates the alignment shift with chat formatting. We introduce a four-variant decomposition of the modification matrix (naive, template-controlled, within-aligned, and difference-in-differences, DiD) that separates the two effects. Template control alone removes a 2.0-3.9x inflation of the measured effective rank across Llama-3.1-8B, Gemma-2-9B, and Qwen-2.5-7B; the DiD contrast is what recovers the refusal direction of Arditi et al. (2024), lifting its cosine alignment from 0.18-0.39 to 0.50-0.86. Projection-ablation across the three families confirms the recovered subspace is behaviorally active and that singular-value order is not causal order. We validate the protocol on a controlled testbed and distill it into measurement recommendations for activation-difference studies of alignment.
Probing Minimalist Phase Structure in LLMs: What Universal Dependencies Cannot Represent
oai:arXiv.org:2605.26431v2
arXiv:2605.26431v2 Announce Type: replace-cross
Abstract: Structural probes train on Universal Dependencies (UD), which does not encode formal-syntactic abstractions such as phase boundaries or phase-internal cohesion. Whether large language models (LLMs) encode these remains an open question that UD-based probing cannot answer by construction. We evaluate structural probes on wh-movement stimuli where UD distances are invariant across conditions by design -- any non-zero effect therefore reflects structure beyond UD. The three conditions -- bare small clause, infinitival, and finite -- are ordered by the number of Minimalist Program (MP) phase boundaries the wh-element crosses.
Across 13 LLMs from four families, we find a phase-count gradient on a cross-clause pair (12/13 models) and a 13/13 sign asymmetry on a within-clause pair whose UD distance is identical across conditions -- the latter specifically predicted by phase-internal cohesion, an MP abstraction invisible to UD by construction. Activation patching confirms the representations are causally active in 12/13 models. These findings suggest that distributional pretraining can induce representations aligned with formal-syntactic abstractions beyond the reach of annotation-based probing; UD-grounded probes provide a lower bound on syntactic encoding, not an upper bound.
Agile Online Model Selection: Resolving Adaptation Lag via Safeguarded Large Learning Rates
oai:arXiv.org:2605.26919v2
arXiv:2605.26919v2 Announce Type: replace-cross
Abstract: Maintaining predictive accuracy in non-stationary environments requires online model selection to adapt autonomously to unknown distribution shifts. However, existing tuning-free algorithms face a fundamental trade-off between robustness and agility. Specifically, to ensure dynamic regret bounds, they must restrict learning rates to small constants (e.g., $O(1)$). This restriction inevitably causes significant adaptation lag during abrupt changes. To resolve this, we propose a novel optimistic online mirror descent that utilizes safeguarded large learning rates up to $\Theta(T)$, where $T$ is the number of rounds. Our key technical contribution is a post-hoc penalty mechanism that dynamically monitors unstable updates and excludes learning rates incurring excessive regret, eliminating the need for restrictive a priori constraints. We show that the cumulative penalty remains $O(\log T)$, allowing our algorithm to match near-optimal worst-case guarantees while achieving superior rates in benign cases. Empirical evaluations on three synthetic and eleven diverse real-world datasets demonstrate that our approach reduces the adaptation lag from hundreds of rounds to a few rounds, consistently outperforming tuning-free baselines.
CalArena: A Large-Scale Post-Hoc Calibration Benchmark
oai:arXiv.org:2605.30188v2
arXiv:2605.30188v2 Announce Type: replace-cross
Abstract: Reliable probability estimates are critical in many machine learning applications, yet modern classifiers are often poorly calibrated. Post-hoc calibration provides a simple and widely used solution, but the large number of proposed methods, combined with small-scale and inconsistent evaluations, makes it difficult to determine which approaches are truly effective in practice. We introduce a large-scale, standardized benchmark for post-hoc calibration, covering nearly 2000 experiments across tabular and computer vision tasks, including binary, multiclass, and large-scale classification settings. Our benchmark aggregates predictions from a diverse set of classical models, modern deep learning architectures, and foundation models, and provides unified, reproducible implementations of dozens of calibration methods within a common evaluation framework. We argue that Post-Hoc Improvement (PHI) in proper scoring rules offers a principled alternative to traditional calibration error estimators for comparing post-hoc methods, capturing both calibration quality and potential degradation to the model's predictive performance. Using this framework, we conduct the most comprehensive empirical study of post-hoc calibration to date. Our results reveal consistent patterns across domains: smooth calibration functions outperform binning-based approaches, dedicated multiclass methods are essential in high-dimensional settings, and generic machine learning models are not competitive without calibration-specific design. To facilitate future research, we release all data, code, and evaluation tools, providing a plug-and-play benchmark for developing and comparing calibration methods.
Beyond Additive Decompositions: Interpretability Through Separability
oai:arXiv.org:2605.31200v2
arXiv:2605.31200v2 Announce Type: replace-cross
Abstract: Interpretable machine learning requires models that are accurate and structurally faithful to the data. Existing explainability methods rely heavily on additive representations (e.g., Generalized Additive Models (GAMs), SHapley Additive exPlanations (SHAP), functional ANOVA), which can suffer from signal cancellation and off-support extrapolation in the presence of strong interactions. We propose Tensor Separation Learning (TSL), a regression model that learns a sum of rank-1 products of univariate per-feature functions via a stagewise greedy procedure with orthogonal refitting. By enforcing separability, TSL avoids the information loss inherent in additive projections caused by marginalizing higher-order interactions. The learned TSL model can be fully reconstructed from first-order partial dependence functions, up to constant factors. This stage-wise correspondence ensures that the resulting visualizations are faithful to the fitted components. We establish approximation-rate guarantees for functions with bounded mixed $p$-th order partial derivatives and demonstrate that TSL competes with black-box models on regression benchmarks.