stat updates on arXiv.org

Improving conversion rate prediction via self-supervised pre-training in online advertising. (arXiv:2401.16432v1 [cs.IR])

back

The task of predicting conversion rates (CVR) lies at the heart of online advertising systems aiming to optimize bids to meet advertiser performance requirements. Even with the recent rise of deep neural networks, these predictions are often made by factorization machines (FM), especially in commercial settings where inference latency is key. These models are trained using the logistic regression framework on labeled tabular data formed from past user activity that is relevant to the task at hand. Many advertisers only care about click-attributed conversions. A major challenge in training models that predict conversions-given-clicks comes from data sparsity - clicks are rare, conversions attributed to clicks are even rarer. However, mitigating sparsity by adding conversions that are not click-attributed to the training set impairs model calibration. Since calibration is critical to achieving advertiser goals, this is infeasible. In this work we use the well-known idea of self-supervised pre-training, and use an auxiliary auto-encoder model trained on all conversion events, both click-attributed and not, as a feature extractor to enrich the main CVR prediction model. Since the main model does not train on non click-attributed conversions, this does not impair calibration. We adapt the basic self-supervised pre-training idea to our online advertising setup by using a loss function designed for tabular data, facilitating continual learning by ensuring auto-encoder stability, and incorporating a neural network into a large-scale real-time ad auction that ranks tens of thousands of ads, under strict latency constraints, and without incurring a major engineering cost. We show improvements both offline, during training, and in an online A/B test. Following its success in A/B tests, our solution is now fully deployed to the Yahoo native advertising system.

The Hubbert diffusion process: Estimation via simulated annealing and variable neighborhood search procedures. Application to forecasting peak oil production. (arXiv:2401.16447v1 [stat.ME])

back

Accurately charting the progress of oil production is a problem of great current interest. Oil production is widely known to be cyclical: in any given system, after it reaches its peak, a decline will begin. With this in mind, Marion King Hubbert developed his peak theory in 1956 based on the bell-shaped curve that bears his name. In the present work, we consider a stochasticmodel based on the theory of diffusion processes and associated with the Hubbert curve. The problem of the maximum likelihood estimation of the parameters for this process is also considered. Since a complex system of equations appears, with a solution that cannot be guaranteed by classical numerical procedures, we suggest the use of metaheuristic optimization algorithms such as simulated annealing and variable neighborhood search. Some strategies are suggested for bounding the space of solutions, and a description is provided for the application of the algorithms selected. In the case of the variable neighborhood search algorithm, a hybrid method is proposed in which it is combined with simulated annealing. In order to validate the theory developed here, we also carry out some studies based on simulated data and consider 2 real crude oil production scenarios from Norway and Kazakhstan.

Active learning of Boltzmann samplers and potential energies with quantum mechanical accuracy. (arXiv:2401.16487v1 [physics.chem-ph])

back

Extracting consistent statistics between relevant free-energy minima of a molecular system is essential for physics, chemistry and biology. Molecular dynamics (MD) simulations can aid in this task but are computationally expensive, especially for systems that require quantum accuracy. To overcome this challenge, we develop an approach combining enhanced sampling with deep generative models and active learning of a machine learning potential (MLP). We introduce an adaptive Markov chain Monte Carlo framework that enables the training of one Normalizing Flow (NF) and one MLP per state. We simulate several Markov chains in parallel until they reach convergence, sampling the Boltzmann distribution with an efficient use of energy evaluations. At each iteration, we compute the energy of a subset of the NF-generated configurations using Density Functional Theory (DFT), we predict the remaining configuration's energy with the MLP and actively train the MLP using the DFT-computed energies. Leveraging the trained NF and MLP models, we can compute thermodynamic observables such as free-energy differences or optical spectra. We apply this method to study the isomerization of an ultrasmall silver nanocluster, belonging to a set of systems with diverse applications in the fields of medicine and catalysis.

Topological Detection of Phenomenological Bifurcations with Unreliable Kernel Densities. (arXiv:2401.16563v1 [math.AT])

back

Phenomenological (P-type) bifurcations are qualitative changes in stochastic dynamical systems whereby the stationary probability density function (PDF) changes its topology. The current state of the art for detecting these bifurcations requires reliable kernel density estimates computed from an ensemble of system realizations. However, in several real world signals such as Big Data, only a single system realization is available -- making it impossible to estimate a reliable kernel density. This study presents an approach for detecting P-type bifurcations using unreliable density estimates. The approach creates an ensemble of objects from Topological Data Analysis (TDA) called persistence diagrams from the system's sole realization and statistically analyzes the resulting set. We compare several methods for replicating the original persistence diagram including Gibbs point process modelling, Pairwise Interaction Point Modelling, and subsampling. We show that for the purpose of predicting a bifurcation, the simple method of subsampling exceeds the other two methods of point process modelling in performance.

Parallel Affine Transformation Tuning of Markov Chain Monte Carlo. (arXiv:2401.16567v1 [stat.ME])

back

The performance of Markov chain Monte Carlo samplers strongly depends on the properties of the target distribution such as its covariance structure, the location of its probability mass and its tail behavior. We explore the use of bijective affine transformations of the sample space to improve the properties of the target distribution and thereby the performance of samplers running in the transformed space. In particular, we propose a flexible and user-friendly scheme for adaptively learning the affine transformation during sampling. Moreover, the combination of our scheme with Gibbsian polar slice sampling is shown to produce samples of high quality at comparatively low computational cost in several settings based on real-world data.

Individualized Multi-Treatment Response Curves Estimation using RBF-net with Shared Neurons. (arXiv:2401.16571v1 [stat.ME])

back

Heterogeneous treatment effect estimation is an important problem in precision medicine. Specific interests lie in identifying the differential effect of different treatments based on some external covariates. We propose a novel non-parametric treatment effect estimation method in a multi-treatment setting. Our non-parametric modeling of the response curves relies on radial basis function (RBF)-nets with shared hidden neurons. Our model thus facilitates modeling commonality among the treatment outcomes. The estimation and inference schemes are developed under a Bayesian framework and implemented via an efficient Markov chain Monte Carlo algorithm, appropriately accommodating uncertainty in all aspects of the analysis. The numerical performance of the method is demonstrated through simulation experiments. Applying our proposed method to MIMIC data, we obtain several interesting findings related to the impact of different treatment strategies on the length of ICU stay and 12-hour SOFA score for sepsis patients who are home-discharged.

PrIsing: Privacy-Preserving Peer Effect Estimation via Ising Model. (arXiv:2401.16596v1 [stat.ME])

back

The Ising model, originally developed as a spin-glass model for ferromagnetic elements, has gained popularity as a network-based model for capturing dependencies in agents' outputs. Its increasing adoption in healthcare and the social sciences has raised privacy concerns regarding the confidentiality of agents' responses. In this paper, we present a novel $(\varepsilon,\delta)$-differentially private algorithm specifically designed to protect the privacy of individual agents' outcomes. Our algorithm allows for precise estimation of the natural parameter using a single network through an objective perturbation technique. Furthermore, we establish regret bounds for this algorithm and assess its performance on synthetic datasets and two real-world networks: one involving HIV status in a social network and the other concerning the political leaning of online blogs.

Probabilistic Context Neighborhood Model for Lattices. (arXiv:2401.16598v1 [stat.ME])

back

We present the Probabilistic Context Neighborhood model designed for two-dimensional lattices as a variation of a Markov Random Field assuming discrete values. In this model, the neighborhood structure has a fixed geometry but a variable order, depending on the neighbors' values. Our model extends the Probabilistic Context Tree model, originally applicable to one-dimensional space. It retains advantageous properties, such as representing the dependence neighborhood structure as a graph in a tree format, facilitating an understanding of model complexity. Furthermore, we adapt the algorithm used to estimate the Probabilistic Context Tree to estimate the parameters of the proposed model. We illustrate the accuracy of our estimation methodology through simulation studies. Additionally, we apply the Probabilistic Context Neighborhood model to spatial real-world data, showcasing its practical utility.

Learning a Gaussian Mixture for Sparsity Regularization in Inverse Problems. (arXiv:2401.16612v1 [stat.ML])

back

In inverse problems, it is widely recognized that the incorporation of a sparsity prior yields a regularization effect on the solution. This approach is grounded on the a priori assumption that the unknown can be appropriately represented in a basis with a limited number of significant components, while most coefficients are close to zero. This occurrence is frequently observed in real-world scenarios, such as with piecewise smooth signals. In this study, we propose a probabilistic sparsity prior formulated as a mixture of degenerate Gaussians, capable of modeling sparsity with respect to a generic basis. Under this premise, we design a neural network that can be interpreted as the Bayes estimator for linear inverse problems. Additionally, we put forth both a supervised and an unsupervised training strategy to estimate the parameters of this network. To evaluate the effectiveness of our approach, we conduct a numerical comparison with commonly employed sparsity-promoting regularization techniques, namely LASSO, group LASSO, iterative hard thresholding, and sparse coding/dictionary learning. Notably, our reconstructions consistently exhibit lower mean square error values across all $1$D datasets utilized for the comparisons, even in cases where the datasets significantly deviate from a Gaussian mixture model.

A constructive approach to selective risk control. (arXiv:2401.16651v1 [stat.ME])

back

Many modern applications require the use of data to both select the statistical tasks and make valid inference after selection. In this article, we provide a unifying approach to control for a class of selective risks. Our method is motivated by a reformulation of the celebrated Benjamini-Hochberg (BH) procedure for multiple hypothesis testing as the iterative limit of the Benjamini-Yekutieli (BY) procedure for constructing post-selection confidence intervals. Although several earlier authors have made noteworthy observations related to this, our discussion highlights that (1) the BH procedure is precisely the fixed-point iteration of the BY procedure; (2) the fact that the BH procedure controls the false discovery rate is almost an immediate corollary of the fact that the BY procedure controls the false coverage-statement rate. Building on this observation, we propose a constructive approach to control extra-selection risk (selection made after decision) by iterating decision strategies that control the post-selection risk (decision made after selection), and show that many previous methods and results are special cases of this general framework. We further extend this approach to problems with multiple selective risks and demonstrate how new methods can be developed. Our development leads to two surprising results about the BH procedure: (1) in the context of one-sided location testing, the BH procedure not only controls the false discovery rate at the null but also at other locations for free; (2) in the context of permutation tests, the BH procedure with exact permutation p-values can be well approximated by a procedure which only requires a total number of permutations that is almost linear in the total number of hypotheses.

Rademacher Complexity of Neural ODEs via Chen-Fliess Series. (arXiv:2401.16655v1 [stat.ML])

back

We show how continuous-depth neural ODE models can be framed as single-layer, infinite-width nets using the Chen--Fliess series expansion for nonlinear ODEs. In this net, the output ''weights'' are taken from the signature of the control input -- a tool used to represent infinite-dimensional paths as a sequence of tensors -- which comprises iterated integrals of the control input over a simplex. The ''features'' are taken to be iterated Lie derivatives of the output function with respect to the vector fields in the controlled ODE model. The main result of this work applies this framework to derive compact expressions for the Rademacher complexity of ODE models that map an initial condition to a scalar output at some terminal time. The result leverages the straightforward analysis afforded by single-layer architectures. We conclude with some examples instantiating the bound for some specific systems and discuss potential follow-up work.

A Nonparametric Approach for Estimating the Effective Sample Size in Gaussian Approximation of Expected Value of Sample Information. (arXiv:2401.16660v1 [stat.ME])

back

The effective sample size (ESS) measures the informational value of a probability distribution in terms of an equivalent number of study participants. The ESS plays a crucial role in estimating the Expected Value of Sample Information (EVSI) through the Gaussian approximation approach. Despite the significance of ESS, existing ESS estimation methods within the Gaussian approximation framework are either computationally expensive or potentially inaccurate. To address these limitations, we propose a novel approach that estimates the ESS using the summary statistics of generated datasets and nonparametric regression methods. The simulation results suggest that the proposed method provides accurate ESS estimates at a low computational cost, making it an efficient and practical way to quantify the information contained in the probability distribution of a parameter. Overall, determining the ESS can help analysts understand the uncertainty levels in complex prior distributions in the probability analyses of decision models and perform efficient EVSI calculations.

Generalization of LiNGAM that allows confounding. (arXiv:2401.16661v1 [cs.LG])

back

LiNGAM determines the variable order from cause to effect using additive noise models, but it faces challenges with confounding. Previous methods maintained LiNGAM's fundamental structure while trying to identify and address variables affected by confounding. As a result, these methods required significant computational resources regardless of the presence of confounding, and they did not ensure the detection of all confounding types. In contrast, this paper enhances LiNGAM by introducing LiNGAM-MMI, a method that quantifies the magnitude of confounding using KL divergence and arranges the variables to minimize its impact. This method efficiently achieves a globally optimal variable order through the shortest path problem formulation. LiNGAM-MMI processes data as efficiently as traditional LiNGAM in scenarios without confounding while effectively addressing confounding situations. Our experimental results suggest that LiNGAM-MMI more accurately determines the correct variable order, both in the presence and absence of confounding.

Sharp variance estimator and causal bootstrap in stratified randomized experiments. (arXiv:2401.16667v1 [math.ST])

back

The finite-population asymptotic theory provides a normal approximation for the sampling distribution of the average treatment effect estimator in stratified randomized experiments. The asymptotic variance is often estimated by a Neyman-type conservative variance estimator. However, the variance estimator can be overly conservative, and the asymptotic theory may fail in small samples. To solve these issues, we propose a sharp variance estimator for the difference-in-means estimator weighted by the proportion of stratum sizes in stratified randomized experiments. Furthermore, we propose two causal bootstrap procedures to more accurately approximate the sampling distribution of the weighted difference-in-means estimator. The first causal bootstrap procedure is based on rank-preserving imputation and we show that it has second-order refinement over normal approximation. The second causal bootstrap procedure is based on sharp null imputation and is applicable in paired experiments. Our analysis is randomization-based or design-based by conditioning on the potential outcomes, with treatment assignment being the sole source of randomness. Numerical studies and real data analyses demonstrate advantages of our proposed methods in finite samples.

Polynomial Chaos Expansions on Principal Geodesic Grassmannian Submanifolds for Surrogate Modeling and Uncertainty Quantification. (arXiv:2401.16683v1 [stat.ML])

back

In this work we introduce a manifold learning-based surrogate modeling framework for uncertainty quantification in high-dimensional stochastic systems. Our first goal is to perform data mining on the available simulation data to identify a set of low-dimensional (latent) descriptors that efficiently parameterize the response of the high-dimensional computational model. To this end, we employ Principal Geodesic Analysis on the Grassmann manifold of the response to identify a set of disjoint principal geodesic submanifolds, of possibly different dimension, that captures the variation in the data. Since operations on the Grassmann require the data to be concentrated, we propose an adaptive algorithm based on Riemanniann K-means and the minimization of the sample Frechet variance on the Grassmann manifold to identify "local" principal geodesic submanifolds that represent different system behavior across the parameter space. Polynomial chaos expansion is then used to construct a mapping between the random input parameters and the projection of the response on these local principal geodesic submanifolds. The method is demonstrated on four test cases, a toy-example that involves points on a hypersphere, a Lotka-Volterra dynamical system, a continuous-flow stirred-tank chemical reactor system, and a two-dimensional Rayleigh-Benard convection problem

A Detailed Historical and Statistical Analysis of the Influence of Hardware Artifacts on SPEC Integer Benchmark Performance. (arXiv:2401.16690v1 [cs.CY])

back

The Standard Performance Evaluation Corporation (SPEC) CPU benchmark has been widely used as a measure of computing performance for decades. The SPEC is an industry-standardized, CPU-intensive benchmark suite and the collective data provide a proxy for the history of worldwide CPU and system performance. Past efforts have not provided or enabled answers to questions such as, how has the SPEC benchmark suite evolved empirically over time and what micro-architecture artifacts have had the most influence on performance? -- have any micro-benchmarks within the suite had undue influence on the results and comparisons among the codes? -- can the answers to these questions provide insights to the future of computer system performance? To answer these questions, we detail our historical and statistical analysis of specific hardware artifacts (clock frequencies, core counts, etc.) on the performance of the SPEC benchmarks since 1995. We discuss in detail several methods to normalize across benchmark evolutions. We perform both isolated and collective sensitivity analyses for various hardware artifacts and we identify one benchmark (libquantum) that had somewhat undue influence on performance outcomes. We also present the use of SPEC data to predict future performance.

Multivariate Priors and the Linearity of Optimal Bayesian Estimators under Gaussian Noise. (arXiv:2401.16701v1 [math.ST])

back

Consider the task of estimating a random vector $X$ from noisy observations $Y = X + Z$, where $Z$ is a standard normal vector, under the $L^p$ fidelity criterion. This work establishes that, for $1 \leq p \leq 2$, the optimal Bayesian estimator is linear and positive definite if and only if the prior distribution on $X$ is a (non-degenerate) multivariate Gaussian. Furthermore, for $p > 2$, it is demonstrated that there are infinitely many priors that can induce such an estimator.

Bayesian scalar-on-network regression with applications to brain functional connectivity. (arXiv:2401.16749v1 [stat.ME])

back

This study presents a Bayesian regression framework to model the relationship between scalar outcomes and brain functional connectivity represented as symmetric positive definite (SPD) matrices. Unlike many proposals that simply vectorize the connectivity predictors thereby ignoring their matrix structures, our method respects the Riemannian geometry of SPD matrices by modelling them in a tangent space. We perform dimension reduction in the tangent space, relating the resulting low-dimensional representations with the responses. The dimension reduction matrix is learnt in a supervised manner with a sparsity-inducing prior imposed on a Stiefel manifold to prevent overfitting. Our method yields a parsimonious regression model that allows uncertainty quantification of the estimates and identification of key brain regions that predict the outcomes. We demonstrate the performance of our approach in simulation settings and through a case study to predict Picture Vocabulary scores using data from the Human Connectome Project.

Leveraging Nested MLMC for Sequential Neural Posterior Estimation with Intractable Likelihoods. (arXiv:2401.16776v1 [stat.CO])

back

Sequential neural posterior estimation (SNPE) techniques have been recently proposed for dealing with simulation-based models with intractable likelihoods. They are devoted to learning the posterior from adaptively proposed simulations using neural network-based conditional density estimators. As a SNPE technique, the automatic posterior transformation (APT) method proposed by Greenberg et al. (2019) performs notably and scales to high dimensional data. However, the APT method bears the computation of an expectation of the logarithm of an intractable normalizing constant, i.e., a nested expectation. Although atomic APT was proposed to solve this by discretizing the normalizing constant, it remains challenging to analyze the convergence of learning. In this paper, we propose a nested APT method to estimate the involved nested expectation instead. This facilitates establishing the convergence analysis. Since the nested estimators for the loss function and its gradient are biased, we make use of unbiased multi-level Monte Carlo (MLMC) estimators for debiasing. To further reduce the excessive variance of the unbiased estimators, this paper also develops some truncated MLMC estimators by taking account of the trade-off between the bias and the average cost. Numerical experiments for approximating complex posteriors with multimodal in moderate dimensions are provided.

Simulating signed mixtures. (arXiv:2401.16828v1 [stat.CO])

back

Simulating mixtures of distributions with signed weights proves a challenge as standard simulation algorithms are inefficient in handling the negative weights. In particular, the natural representation of mixture variates as associated with latent component indicators is no longer available. We propose here an exact accept-reject algorithm in the general case of finite signed mixtures that relies on optimaly pairing positive and negative components and designing a stratified sampling scheme on pairs. We analyze the performances of our approach, relative to the inverse cdf approach, since the cdf of the distribution remains available for standard signed mixtures.

Analysis of Knowledge Tracing performance on synthesised student data. (arXiv:2401.16832v1 [cs.CY])

back

Knowledge Tracing (KT) aims to predict the future performance of students by tracking the development of their knowledge states. Despite all the recent progress made in this field, the application of KT models in education systems is still restricted from the data perspectives: 1) limited access to real life data due to data protection concerns, 2) lack of diversity in public datasets, 3) noises in benchmark datasets such as duplicate records. To resolve these problems, we simulated student data with three statistical strategies based on public datasets and tested their performance on two KT baselines. While we observe only minor performance improvement with additional synthetic data, our work shows that using only synthetic data for training can lead to similar performance as real data.

CAFCT: Contextual and Attentional Feature Fusions of Convolutional Neural Networks and Transformer for Liver Tumor Segmentation. (arXiv:2401.16886v1 [cs.CV])

back

Medical image semantic segmentation techniques can help identify tumors automatically from computed tomography (CT) scans. In this paper, we propose a Contextual and Attentional feature Fusions enhanced Convolutional Neural Network (CNN) and Transformer hybrid network (CAFCT) model for liver tumor segmentation. In the proposed model, three other modules are introduced in the network architecture: Attentional Feature Fusion (AFF), Atrous Spatial Pyramid Pooling (ASPP) of DeepLabv3, and Attention Gates (AGs) to improve contextual information related to tumor boundaries for accurate segmentation. Experimental results show that the proposed CAFCT achieves a mean Intersection over Union (IoU) of 90.38% and Dice score of 86.78%, respectively, on the Liver Tumor Segmentation Benchmark (LiTS) dataset, outperforming pure CNN or Transformer methods, e.g., Attention U-Net, and PVTFormer.

Learning Properties of Quantum States Without the I.I.D. Assumption. (arXiv:2401.16922v1 [quant-ph])

back

We develop a framework for learning properties of quantum states beyond the assumption of independent and identically distributed (i.i.d.) input states. We prove that, given any learning problem (under reasonable assumptions), an algorithm designed for i.i.d. input states can be adapted to handle input states of any nature, albeit at the expense of a polynomial increase in copy complexity. Furthermore, we establish that algorithms which perform non-adaptive incoherent measurements can be extended to encompass non-i.i.d. input states while maintaining comparable error probabilities. This allows us, among others applications, to generalize the classical shadows of Huang, Kueng, and Preskill to the non-i.i.d. setting at the cost of a small loss in efficiency. Additionally, we can efficiently verify any pure state using Clifford measurements, in a way that is independent of the ideal state. Our main techniques are based on de Finetti-style theorems supported by tools from information theory. In particular, we prove a new randomized local de Finetti theorem that can be of independent interest.

Dynamical System Identification, Model Selection and Model Uncertainty Quantification by Bayesian Inference. (arXiv:2401.16943v1 [stat.ME])

back

This study presents a Bayesian maximum \textit{a~posteriori} (MAP) framework for dynamical system identification from time-series data. This is shown to be equivalent to a generalized zeroth-order Tikhonov regularization, providing a rational justification for the choice of the residual and regularization terms, respectively, from the negative logarithms of the likelihood and prior distributions. In addition to the estimation of model coefficients, the Bayesian interpretation gives access to the full apparatus for Bayesian inference, including the ranking of models, the quantification of model uncertainties and the estimation of unknown (nuisance) hyperparameters. Two Bayesian algorithms, joint maximum \textit{a~posteriori} (JMAP) and variational Bayesian approximation (VBA), are compared to the popular SINDy algorithm for thresholded least-squares regression, by application to several dynamical systems with added noise. For multivariate Gaussian likelihood and prior distributions, the Bayesian formulation gives Gaussian posterior and evidence distributions, in which the numerator terms can be expressed in terms of the Mahalanobis distance or ``Gaussian norm'' $||\vy-\hat{\vy}||^2_{M^{-1}} = (\vy-\hat{\vy})^\top {M^{-1}} (\vy-\hat{\vy})$, where $\vy$ is a vector variable, $\hat{\vy}$ is its estimator and $M$ is the covariance matrix. The posterior Gaussian norm is shown to provide a robust metric for quantitative model selection.

Nonparametric latency estimation for mixture cure models. (arXiv:2401.16954v1 [stat.ME])

back

A nonparametric latency estimator for mixture cure models is studied in this paper. An i.i.d. representation is obtained, the asymptotic mean squared error of the latency estimator is found, and its asymptotic normality is proven. A bootstrap bandwidth selection method is introduced and its efficiency is evaluated in a simulation study. The proposed methods are applied to a dataset of colorectal cancer patients in the University Hospital of A Coru\~na (CHUAC).

Partial correlation graphs for continuous-parameter time series. (arXiv:2401.16970v1 [math.ST])

back

In this paper, we establish the partial correlation graph for multivariate continuous-time stochastic processes, assuming only that the underlying process is stationary and mean-square continuous with expectation zero and spectral density function. In the partial correlation graph, the vertices are the components of the process and the undirected edges represent partial correlations between the vertices. To define this graph, we therefore first introduce the partial correlation relation for continuous-time processes and provide several equivalent characterisations. In particular, we establish that the partial correlation relation defines a graphoid. The partial correlation graph additionally satisfies the usual Markov properties and the edges can be determined very easily via the inverse of the spectral density function. Throughout the paper, we compare and relate the partial correlation graph to the mixed (local) causality graph of Fasen-Hartmann and Schenk (2023a). Finally, as an example, we explicitly characterise and interpret the edges in the partial correlation graph for the popular multivariate continuous-time AR (MCAR) processes.

Multiple Yield Curve Modeling and Forecasting using Deep Learning. (arXiv:2401.16985v1 [stat.ML])

back

This manuscript introduces deep learning models that simultaneously describe the dynamics of several yield curves. We aim to learn the dependence structure among the different yield curves induced by the globalization of financial markets and exploit it to produce more accurate forecasts. By combining the self-attention mechanism and nonparametric quantile regression, our model generates both point and interval forecasts of future yields. The architecture is designed to avoid quantile crossing issues affecting multiple quantile regression models. Numerical experiments conducted on two different datasets confirm the effectiveness of our approach. Finally, we explore potential extensions and enhancements by incorporating deep ensemble methods and transfer learning mechanisms.

Causal Machine Learning for Cost-Effective Allocation of Development Aid. (arXiv:2401.16986v1 [stat.ML])

back

The Sustainable Development Goals (SDGs) of the United Nations provide a blueprint of a better future by 'leaving no one behind', and, to achieve the SDGs by 2030, poor countries require immense volumes of development aid. In this paper, we develop a causal machine learning framework for predicting heterogeneous treatment effects of aid disbursements to inform effective aid allocation. Specifically, our framework comprises three components: (i) a balancing autoencoder that uses representation learning to embed high-dimensional country characteristics while addressing treatment selection bias; (ii) a counterfactual generator to compute counterfactual outcomes for varying aid volumes to address small sample-size settings; and (iii) an inference model that is used to predict heterogeneous treatment-response curves. We demonstrate the effectiveness of our framework using data with official development aid earmarked to end HIV/AIDS in 105 countries, amounting to more than USD 5.2 billion. For this, we first show that our framework successfully computes heterogeneous treatment-response curves using semi-synthetic data. Then, we demonstrate our framework using real-world HIV data. Our framework points to large opportunities for a more effective aid allocation, suggesting that the total number of new HIV infections could be reduced by up to 3.3% (~50,000 cases) compared to the current allocation practice.

Recovering target causal effects from post-exposure selection induced by missing outcome data. (arXiv:2401.16990v1 [stat.ME])

back

Confounding bias and selection bias are two significant challenges to the validity of conclusions drawn from applied causal inference. The latter can arise through informative missingness, wherein relevant information about units in the target population is missing, censored, or coarsened due to factors related to the exposure, the outcome, or their consequences. We extend existing graphical criteria to address selection bias induced by missing outcome data by leveraging post-exposure variables. We introduce the Sequential Adjustment Criteria (SAC), which support recovering causal effects through sequential regressions. A refined estimator is further developed by applying Targeted Minimum-Loss Estimation (TMLE). Under certain regularity conditions, this estimator is multiply-robust, ensuring consistency even in scenarios where the Inverse Probability Weighting (IPW) and the sequential regressions approaches fall short. A simulation exercise featuring various toy scenarios compares the relative bias and robustness of the two proposed solutions against other estimators. As a motivating application case, we study the effects of pharmacological treatment for Attention-Deficit/Hyperactivity Disorder (ADHD) upon the scores obtained by diagnosed Norwegian schoolchildren in national tests using observational data ($n=9\,352$). Our findings support the accumulated clinical evidence affirming a positive but small effect of stimulant medication on school performance. A small positive selection bias was identified, indicating that the treatment effect may be even more modest for those exempted or abstained from the tests.

A Unified Three-State Model Framework for Analysis of Treatment Crossover in Survival Trials. (arXiv:2401.17008v1 [stat.ME])

back

We present a unified three-state model (TSM) framework for evaluating treatment effects in clinical trials in the presence of treatment crossover. Researchers have proposed diverse methodologies to estimate the treatment effect that would have hypothetically been observed if treatment crossover had not occurred. However, there is little work on understanding the connections between these different approaches from a statistical point of view. Our proposed TSM framework unifies existing methods, effectively identifying potential biases, model assumptions, and inherent limitations for each method. This can guide researchers in understanding when these methods are appropriate and choosing a suitable approach for their data. The TSM framework also facilitates the creation of new methods to adjust for confounding effects from treatment crossover. To illustrate this capability, we introduce a new imputation method that falls under its scope. Using a piecewise constant prior for the hazard, our proposed method directly estimates the hazard function with increased flexibility. Through simulation experiments, we demonstrate the performance of different approaches for estimating the treatment effects.

Bayesian Optimization with Noise-Free Observations: Improved Regret Bounds via Random Exploration. (arXiv:2401.17037v1 [cs.LG])

back

This paper studies Bayesian optimization with noise-free observations. We introduce new algorithms rooted in scattered data approximation that rely on a random exploration step to ensure that the fill-distance of query points decays at a near-optimal rate. Our algorithms retain the ease of implementation of the classical GP-UCB algorithm and satisfy cumulative regret bounds that nearly match those conjectured in <a href="/abs/2002.05096">arXiv:2002.05096</a>, hence solving a COLT open problem. Furthermore, the new algorithms outperform GP-UCB and other popular Bayesian optimization strategies in several examples.

back

Nearest-neighbor methods have become popular in statistics and play a key role in statistical learning. Important decisions in nearest-neighbor methods concern the variables to use (when many potential candidates exist) and how to measure the dissimilarity between units. The first decision depends on the scope of the application while second depends mainly on the type of variables. Unfortunately, relatively few options permit to handle mixed-type variables, a situation frequently encountered in practical applications. The most popular dissimilarity for mixed-type variables is derived as the complement to one of the Gower's similarity coefficient. It is appealing because ranges between 0 and 1, being an average of the scaled dissimilarities calculated variable by variable, handles missing values and allows for a user-defined weighting scheme when averaging dissimilarities. The discussion on the weighting schemes is sometimes misleading since it often ignores that the unweighted "standard" setting hides an unbalanced contribution of the single variables to the overall dissimilarity. We address this drawback following the recent idea of introducing a weighting scheme that minimizes the differences in the correlation between each contributing dissimilarity and the resulting weighted Gower's dissimilarity. In particular, this note proposes different approaches for measuring the correlation depending on the type of variables. The performances of the proposed approaches are evaluated in simulation studies related to classification and imputation of missing values.

Dynamical Survival Analysis with Controlled Latent States. (arXiv:2401.17077v1 [stat.ML])

back

We consider the task of learning individual-specific intensities of counting processes from a set of static variables and irregularly sampled time series. We introduce a novel modelization approach in which the intensity is the solution to a controlled differential equation. We first design a neural estimator by building on neural controlled differential equations. In a second time, we show that our model can be linearized in the signature space under sufficient regularity conditions, yielding a signature-based estimator which we call CoxSig. We provide theoretical learning guarantees for both estimators, before showcasing the performance of our models on a vast array of simulated and real-world datasets from finance, predictive maintenance and food supply chain management.

Nonparametric covariate hypothesis tests for the cure rate in mixture cure models. (arXiv:2401.17110v1 [stat.ME])

back

In lifetime data, like cancer studies, theremay be long term survivors, which lead to heavy censoring at the end of the follow-up period. Since a standard survival model is not appropriate to handle these data, a cure model is needed. In the literature, covariate hypothesis tests for cure models are limited to parametric and semiparametric methods.We fill this important gap by proposing a nonparametric covariate hypothesis test for the probability of cure in mixture cure models. A bootstrap method is proposed to approximate the null distribution of the test statistic. The procedure can be applied to any type of covariate, and could be extended to the multivariate setting. Its efficiency is evaluated in a Monte Carlo simulation study. Finally, the method is applied to a colorectal cancer dataset.

Test for high-dimensional mean vectors via the weighted $L_2$-norm. (arXiv:2401.17143v1 [math.ST])

back

In this paper, we propose a novel approach to test the equality of high-dimensional mean vectors of several populations via the weighted $L_2$-norm. We establish the asymptotic normality of the test statistics under the null hypothesis. We also explain theoretically why our test statistics can be highly useful in weakly dense cases when the nonzero signal in mean vectors is present. Furthermore, we compare the proposed test with existing tests using simulation results, demonstrating that the weighted $L_2$-norm-based test statistic exhibits favorable properties in terms of both size and power.

Nonparametric incidence estimation and bootstrap bandwidth selection in mixture cure models. (arXiv:2401.17152v1 [stat.ME])

back

A completely nonparametric method for the estimation of mixture cure models is proposed. A nonparametric estimator of the incidence is extensively studied and a nonparametric estimator of the latency is presented. These estimators, which are based on the Beran estimator of the conditional survival function, are proved to be the local maximum likelihood estimators. An i.i.d. representation is obtained for the nonparametric incidence estimator. As a consequence, an asymptotically optimal bandwidth is found. Moreover, a bootstrap bandwidth selection method for the nonparametric incidence estimator is proposed. The introduced nonparametric estimators are compared with existing semiparametric approaches in a simulation study, in which the performance of the bootstrap bandwidth selector is also assessed. Finally, the method is applied to a database of colorectal cancer from the University Hospital of A Coru\~na (CHUAC).

An analytic approach for understanding mechanisms driving breakthrough infections. (arXiv:2401.17164v1 [stat.AP])

back

Real world data is an increasingly utilized resource for post-market monitoring of vaccines and provides insight into real world effectiveness. However, outside of the setting of a clinical trial, heterogeneous mechanisms may drive observed breakthrough infection rates among vaccinated individuals; for instance, waning vaccine-induced immunity as time passes and the emergence of a new strain against which the vaccine has reduced protection. Analyses of infection incidence rates are typically predicated on a presumed mechanism in their choice of an "analytic time zero" after which infection rates are modeled. In this work, we propose an explicit test for driving mechanism situated in a standard Cox proportional hazards framework. We explore the test's performance in simulation studies and in an illustrative application to real world data. We additionally introduce subgroup differences in infection incidence and evaluate the impact of time zero misspecification on bias and coverage of model estimates. In this study we observe strong power and controlled type I error of the test to detect the correct infection-driving mechanism under various settings. Similar to previous studies, we find mitigated bias and greater coverage of estimates when the analytic time zero is correctly specified or accounted for.

Adaptive Experiment Design with Synthetic Controls. (arXiv:2401.17205v1 [stat.ML])

back

Clinical trials are typically run in order to understand the effects of a new treatment on a given population of patients. However, patients in large populations rarely respond the same way to the same treatment. This heterogeneity in patient responses necessitates trials that investigate effects on multiple subpopulations - especially when a treatment has marginal or no benefit for the overall population but might have significant benefit for a particular subpopulation. Motivated by this need, we propose Syntax, an exploratory trial design that identifies subpopulations with positive treatment effect among many subpopulations. Syntax is sample efficient as it (i) recruits and allocates patients adaptively and (ii) estimates treatment effects by forming synthetic controls for each subpopulation that combines control samples from other subpopulations. We validate the performance of Syntax and provide insights into when it might have an advantage over conventional trial designs through experiments.

Joint model with latent disease age: overcoming the need for reference time. (arXiv:2401.17249v1 [stat.ME])

back

Introduction: Heterogeneity of the progression of neurodegenerative diseases is one of the main challenges faced in developing effective therapies. With the increasing number of large clinical databases, disease progression models have led to a better understanding of this heterogeneity. Nevertheless, these diseases may have no clear onset and biological underlying processes may start before the first symptoms. Such an ill-defined disease reference time is an issue for current joint models, which have proven their effectiveness by combining longitudinal and survival data. Objective In this work, we propose a joint non-linear mixed effect model with a latent disease age, to overcome this need for a precise reference time. Method: To do so, we utilized an existing longitudinal model with a latent disease age as a longitudinal sub-model and associated it with a survival sub-model that estimates a Weibull distribution from the latent disease age. We then validated our model on different simulated scenarios. Finally, we benchmarked our model with a state-of-the-art joint model and reference survival and longitudinal models on simulated and real data in the context of Amyotrophic Lateral Sclerosis (ALS). Results: On real data, our model got significantly better results than the state-of-the-art joint model for absolute bias (4.21(4.41) versus 4.24(4.14)(p-value=1.4e-17)), and mean cumulative AUC for right censored events (0.67(0.07) versus 0.61(0.09)(p-value=1.7e-03)). Conclusion: We showed that our approach is better suited than the state-of-the-art in the context where the reference time is not reliable. This work opens up the perspective to design predictive and personalized therapeutic strategies.

Effect of Weight Quantization on Learning Models by Typical Case Analysis. (arXiv:2401.17269v1 [stat.ML])

back

This paper examines the quantization methods used in large-scale data analysis models and their hyperparameter choices. The recent surge in data analysis scale has significantly increased computational resource requirements. To address this, quantizing model weights has become a prevalent practice in data analysis applications such as deep learning. Quantization is particularly vital for deploying large models on devices with limited computational resources. However, the selection of quantization hyperparameters, like the number of bits and value range for weight quantization, remains an underexplored area. In this study, we employ the typical case analysis from statistical physics, specifically the replica method, to explore the impact of hyperparameters on the quantization of simple learning models. Our analysis yields three key findings: (i) an unstable hyperparameter phase, known as replica symmetry breaking, occurs with a small number of bits and a large quantization width; (ii) there is an optimal quantization width that minimizes error; and (iii) quantization delays the onset of overparameterization, helping to mitigate overfitting as indicated by the double descent phenomenon. We also discover that non-uniform quantization can enhance stability. Additionally, we develop an approximate message-passing algorithm to validate our theoretical results.

Low-complexity 8-point DCT Approximation Based on Angle Similarity for Image and Video Coding. (arXiv:1808.02950v2 [eess.IV] UPDATED)

back

The principal component analysis (PCA) is widely used for data decorrelation and dimensionality reduction. However, the use of PCA may be impractical in real-time applications, or in situations were energy and computing constraints are severe. In this context, the discrete cosine transform (DCT) becomes a low-cost alternative to data decorrelation. This paper presents a method to derive computationally efficient approximations to the DCT. The proposed method aims at the minimization of the angle between the rows of the exact DCT matrix and the rows of the approximated transformation matrix. The resulting transformations matrices are orthogonal and have extremely low arithmetic complexity. Considering popular performance measures, one of the proposed transformation matrices outperforms the best competitors in both matrix error and coding capabilities. Practical applications in image and video coding demonstrate the relevance of the proposed transformation. In fact, we show that the proposed approximate DCT can outperform the exact DCT for image encoding under certain compression ratios. The proposed transform and its direct competitors are also physically realized as digital prototype circuits using FPGA technology.

Nearest neighbor empirical processes. (arXiv:2110.15083v3 [math.ST] UPDATED)

back

In the regression framework, the empirical measure based on the responses resulting from the nearest neighbors, among the covariates, to a given point $x$ is introduced and studied as a central statistical quantity. First, the associated empirical process is shown to satisfy a uniform central limit theorem under a local bracketing entropy condition on the underlying class of functions reflecting the localizing nature of the nearest neighbor algorithm. Second a uniform non-asymptotic bound is established under a well-known condition, often referred to as Vapnik-Chervonenkis, on the uniform entropy numbers. The covariance of the Gaussian limit obtained in the uniform central limit theorem is simply equal to the conditional covariance operator given the covariate value. This suggests the possibility of using standard formulas to estimate the variance by using only the nearest neighbors instead of the full data. This is illustrated on two problems: the estimation of the conditional cumulative distribution function and local linear regression.

Solving, Tracking and Stopping Streaming Linear Inverse Problems. (arXiv:2201.05741v3 [math.NA] UPDATED)

back

In large-scale applications including medical imaging, collocation differential equation solvers, and estimation with differential privacy, the underlying linear inverse problem can be reformulated as a streaming problem. In theory, the streaming problem can be effectively solved using memory-efficient, exponentially-converging streaming solvers. In practice, a streaming solver's effectiveness is undermined if it is stopped before, or well-after, the desired accuracy is achieved. In special cases when the underlying linear inverse problem is finite-dimensional, streaming solvers can periodically evaluate the residual norm at a substantial computational cost. When the underlying system is infinite dimensional, streaming solver can only access noisy estimates of the residual. While such noisy estimates are computationally efficient, they are useful only when their accuracy is known. In this work, we rigorously develop a general family of computationally-practical residual estimators and their uncertainty sets for streaming solvers, and we demonstrate the accuracy of our methods on a number of large-scale linear problems. Thus, we further enable the practical use of streaming solvers for important classes of linear inverse problems.

Nonresponse Bias Analysis in Longitudinal Studies: A Comparative Review with an Application to the Early Childhood Longitudinal Study. (arXiv:2204.07105v5 [stat.ME] UPDATED)

back

Longitudinal studies are subject to nonresponse when individuals fail to provide data for entire waves or particular questions of the survey. We compare approaches to nonresponse bias analysis (NRBA) in longitudinal studies and illustrate them on the Early Childhood Longitudinal Study, Kindergarten Class of 2010-11 (ECLS-K:2011). Wave nonresponse with attrition often yields a monotone missingness pattern, and the missingness mechanism can be missing at random (MAR) or missing not at random (MNAR). We discuss weighting, multiple imputation (MI), incomplete data modeling, and Bayesian approaches to NRBA for monotone patterns. Weighting adjustments are effective when the constructed weights are correlated to the survey outcome of interest. MI allows for variables with missing values to be included in the imputation model, yielding potentially less biased and more efficient estimates. Multilevel models with maximum likelihood estimation and marginal models estimated using generalized estimating equations can also handle incomplete longitudinal data. Bayesian methods introduce prior information and potentially stabilize model estimation. We add offsets in the MAR results to provide sensitivity analyses to assess MNAR deviations. We conduct NRBA for descriptive summaries and analytic model estimates and find that in the ECLS-K:2011 application, NRBA yields minor changes to the substantive conclusions. The strength of evidence about our NRBA depends on the strength of the relationship between the characteristics in the nonresponse adjustment and the key survey outcomes, so the key to a successful NRBA is to include strong predictors.

Estimating counterfactual treatment outcomes over time in complex multi-agent scenarios. (arXiv:2206.01900v3 [cs.AI] UPDATED)

back

Evaluation of intervention in a multi-agent system, e.g., when humans should intervene in autonomous driving systems and when a player should pass to teammates for a good shot, is challenging in various engineering and scientific fields. Estimating the individual treatment effect (ITE) using counterfactual long-term prediction is practical to evaluate such interventions. However, most of the conventional frameworks did not consider the time-varying complex structure of multi-agent relationships and covariate counterfactual prediction. This may lead to erroneous assessments of ITE and difficulty in interpretation. Here we propose an interpretable, counterfactual recurrent network in multi-agent systems to estimate the effect of the intervention. Our model leverages graph variational recurrent neural networks and theory-based computation with domain knowledge for the ITE estimation framework based on long-term prediction of multi-agent covariates and outcomes, which can confirm the circumstances under which the intervention is effective. On simulated models of an automated vehicle and biological agents with time-varying confounders, we show that our methods achieved lower estimation errors in counterfactual covariates and the most effective treatment timing than the baselines. Furthermore, using real basketball data, our methods performed realistic counterfactual predictions and evaluated the counterfactual passes in shot scenarios.

Reliable emulation of complex functionals by active learning with error control. (arXiv:2208.06727v3 [physics.chem-ph] UPDATED)

back

A statistical emulator can be used as a surrogate of complex physics-based calculations to drastically reduce the computational cost. Its successful implementation hinges on an accurate representation of the nonlinear response surface with a high-dimensional input space. Conventional "space-filling" designs, including random sampling and Latin hypercube sampling, become inefficient as the dimensionality of the input variables increases, and the predictive accuracy of the emulator can degrade substantially for a test input distant from the training input set. To address this fundamental challenge, we develop a reliable emulator for predicting complex functionals by active learning with error control (ALEC). The algorithm is applicable to infinite-dimensional mapping with high-fidelity predictions and a controlled predictive error. The computational efficiency has been demonstrated by emulating the classical density functional theory (cDFT) calculations, a statistical-mechanical method widely used in modeling the equilibrium properties of complex molecular systems. We show that ALEC is much more accurate than conventional emulators based on the Gaussian processes with "space-filling" designs and alternative active learning methods. Besides, it is computationally more efficient than direct cDFT calculations. ALEC can be a reliable building block for emulating expensive functionals owing to its minimal computational cost, controllable predictive error, and fully automatic features.

Switchback Experiments under Geometric Mixing. (arXiv:2209.00197v2 [stat.ME] UPDATED)

back

The switchback is an experimental design that measures treatment effects by repeatedly turning an intervention on and off for a whole system. Switchback experiments are a robust way to overcome cross-unit spillover effects; however, they are vulnerable to bias from temporal carryovers. In this paper, we consider properties of switchback experiments in Markovian systems that mix at a geometric rate. We find that, in this setting, standard switchback designs suffer considerably from carryover bias: Their estimation error decays as $T^{-1/3}$ in terms of the experiment horizon $T$, whereas in the absence of carryovers a faster rate of $T^{-1/2}$ would have been possible. We also show, however, that judicious use of burn-in periods can considerably improve the situation, and enables errors that decay as $\log(T)^{1/2}T^{-1/2}$. Our formal results are mirrored in an empirical evaluation.

On the potential benefits of entropic regularization for smoothing Wasserstein estimators. (arXiv:2210.06934v2 [stat.ML] UPDATED)

back

This paper is focused on the study of entropic regularization in optimal transport as a smoothing method for Wasserstein estimators, through the prism of the classical tradeoff between approximation and estimation errors in statistics. Wasserstein estimators are defined as solutions of variational problems whose objective function involves the use of an optimal transport cost between probability measures. Such estimators can be regularized by replacing the optimal transport cost by its regularized version using an entropy penalty on the transport plan. The use of such a regularization has a potentially significant smoothing effect on the resulting estimators. In this work, we investigate its potential benefits on the approximation and estimation properties of regularized Wasserstein estimators. Our main contribution is to discuss how entropic regularization may reach, at a lower computational cost, statistical performances that are comparable to those of un-regularized Wasserstein estimators in statistical learning problems involving distributional data analysis. To this end, we present new theoretical results on the convergence of regularized Wasserstein estimators. We also study their numerical performances using simulated and real data in the supervised learning problem of proportions estimation in mixture models using optimal transport.

Doubly robust nearest neighbors in factor models. (arXiv:2211.14297v3 [stat.ML] UPDATED)

back

We introduce and analyze an improved variant of nearest neighbors (NN) for estimation with missing data in latent factor models. We consider a matrix completion problem with missing data, where the $(i, t)$-th entry, when observed, is given by its mean $f(u_i, v_t)$ plus mean-zero noise for an unknown function $f$ and latent factors $u_i$ and $v_t$. Prior NN strategies, like unit-unit NN, for estimating the mean $f(u_i, v_t)$ relies on existence of other rows $j$ with $u_j \approx u_i$. Similarly, time-time NN strategy relies on existence of columns $t'$ with $v_{t'} \approx v_t$. These strategies provide poor performance respectively when similar rows or similar columns are not available. Our estimate is doubly robust to this deficit in two ways: (1) As long as there exist either good row or good column neighbors, our estimate provides a consistent estimate. (2) Furthermore, if both good row and good column neighbors exist, it provides a (near-)quadratic improvement in the non-asymptotic error and admits a significantly narrower asymptotic confidence interval when compared to both unit-unit or time-time NN.

Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing. (arXiv:2212.10789v3 [cs.LG] UPDATED)

back

There is increasing adoption of artificial intelligence in drug discovery. However, existing studies use machine learning to mainly utilize the chemical structures of molecules but ignore the vast textual knowledge available in chemistry. Incorporating textual knowledge enables us to realize new drug design objectives, adapt to text-based instructions and predict complex biological activities. Here we present a multi-modal molecule structure-text model, MoleculeSTM, by jointly learning molecules' chemical structures and textual descriptions via a contrastive learning strategy. To train MoleculeSTM, we construct a large multi-modal dataset, namely, PubChemSTM, with over 280,000 chemical structure-text pairs. To demonstrate the effectiveness and utility of MoleculeSTM, we design two challenging zero-shot tasks based on text instructions, including structure-text retrieval and molecule editing. MoleculeSTM has two main properties: open vocabulary and compositionality via natural language. In experiments, MoleculeSTM obtains the state-of-the-art generalization ability to novel biochemical concepts across various benchmarks.

Electricity price forecasting with Smoothing Quantile Regression Averaging: Quantifying economic benefits of probabilistic forecasts. (arXiv:2302.00411v2 [stat.AP] UPDATED)

back

In the world of the complex power market, accurate electricity price forecasting is essential for strategic bidding and affects both daily operations and long-term investments. This article introduce a new method dubbed Smoothing Quantile Regression (SQR) Averaging, that improves on well-performing schemes for probabilistic forecasting. To showcase its utility, a comprehensive study is conducted across four power markets, including recent data encompassing the COVID-19 pandemic and the Russian invasion on Ukraine. The performance of SQR Averaging is evaluated and compared to state-of-the-art benchmark methods in terms of the reliability and sharpness measures. Additionally, an evaluation scheme is introduced to quantify the economic benefits derived from SQR Averaging predictions. This scheme can be applied in any day-ahead electricity market and is based on a trading strategy that leverages battery storage and sets limit orders using selected quantiles of the predictive distribution. The results reveal that, compared to the benchmark strategy, utilizing SQR Averaging leads to average profit increases of up to 14\%. These findings provide strong evidence for the effectiveness of SQR Averaging in improving forecast accuracy and the practical value of utilizing probabilistic forecasts in day-ahead electricity trading, even in the face of challenging events such as the COVID-19 pandemic and geopolitical disruptions.

Data-dependent Generalization Bounds via Variable-Size Compressibility. (arXiv:2303.05369v2 [stat.ML] UPDATED)

back

In this paper, we establish novel data-dependent upper bounds on the generalization error through the lens of a "variable-size compressibility" framework that we introduce newly here. In this framework, the generalization error of an algorithm is linked to a variable-size 'compression rate' of its input data. This is shown to yield bounds that depend on the empirical measure of the given input data at hand, rather than its unknown distribution. Our new generalization bounds that we establish are tail bounds, tail bounds on the expectation, and in-expectations bounds. Moreover, it is shown that our framework also allows to derive general bounds on any function of the input data and output hypothesis random variables. In particular, these general bounds are shown to subsume and possibly improve over several existing PAC-Bayes and data-dependent intrinsic dimension-based bounds that are recovered as special cases, thus unveiling a unifying character of our approach. For instance, a new data-dependent intrinsic dimension-based bound is established, which connects the generalization error to the optimization trajectories and reveals various interesting connections with the rate-distortion dimension of a process, the R\'enyi information dimension of a process, and the metric mean dimension.

Parametric Estimation of Tempered Stable Laws. (arXiv:2303.07060v3 [math.ST] UPDATED)

back

Tempered stable distributions are frequently used in financial applications (e.g., for option pricing) in which the tails of stable distributions would be too heavy. Given the non-explicit form of the probability density function, estimation relies on numerical algorithms which typically are time-consuming. We compare several parametric estimation methods such as the maximum likelihood method and different generalized method of moment approaches. We study large sample properties and derive consistency, asymptotic normality, and asymptotic efficiency results for our estimators. Additionally, we conduct simulation studies to analyze finite sample properties measured by the empirical bias, precision, and asymptotic confidence interval coverage rates and compare computational costs. We cover relevant subclasses of tempered stable distributions such as the classical tempered stable distribution and the tempered stable subordinator. Moreover, we discuss the normal tempered stable distribution which arises by subordinating a Brownian motion with a tempered stable subordinator. Our financial applications to log returns of asset indices and to energy spot prices illustrate the benefits of tempered stable models.

Federated Learning for Heterogeneous Bandits with Unobserved Contexts. (arXiv:2303.17043v2 [cs.LG] UPDATED)

back

We study the problem of federated stochastic multi-arm contextual bandits with unknown contexts, in which M agents are faced with different bandits and collaborate to learn. The communication model consists of a central server and the agents share their estimates with the central server periodically to learn to choose optimal actions in order to minimize the total regret. We assume that the exact contexts are not observable and the agents observe only a distribution of the contexts. Such a situation arises, for instance, when the context itself is a noisy measurement or based on a prediction mechanism. Our goal is to develop a distributed and federated algorithm that facilitates collaborative learning among the agents to select a sequence of optimal actions so as to maximize the cumulative reward. By performing a feature vector transformation, we propose an elimination-based algorithm and prove the regret bound for linearly parametrized reward functions. Finally, we validated the performance of our algorithm and compared it with another baseline approach using numerical simulations on synthetic data and on the real-world movielens dataset.

Neural networks for geospatial data. (arXiv:2304.09157v2 [stat.ML] UPDATED)

back

Analysis of geospatial data has traditionally been model-based, with a mean model, customarily specified as a linear regression on the covariates, and a covariance model, encoding the spatial dependence. We relax the strong assumption of linearity and propose embedding neural networks directly within the traditional geostatistical models to accommodate non-linear mean functions while retaining all other advantages including use of Gaussian Processes to explicitly model the spatial covariance, enabling inference on the covariate effect through the mean and on the spatial dependence through the covariance, and offering predictions at new locations via kriging. We propose NN-GLS, a new neural network estimation algorithm for the non-linear mean in GP models that explicitly accounts for the spatial covariance through generalized least squares (GLS), the same loss used in the linear case. We show that NN-GLS admits a representation as a special type of graph neural network (GNN). This connection facilitates use of standard neural network computational techniques for irregular geospatial data, enabling novel and scalable mini-batching, backpropagation, and kriging schemes. Theoretically, we show that NN-GLS will be consistent for irregularly observed spatially correlated data processes. To our knowledge this is the first asymptotic consistency result for any neural network algorithm for spatial data. We demonstrate the methodology through simulated and real datasets.

Deep Neural-network Prior for Orbit Recovery from Method of Moments. (arXiv:2304.14604v2 [stat.ME] UPDATED)

back

Orbit recovery problems are a class of problems that often arise in practice and various forms. In these problems, we aim to estimate an unknown function after being distorted by a group action and observed via a known operator. Typically, the observations are contaminated with a non-trivial level of noise. Two particular orbit recovery problems of interest in this paper are multireference alignment and single-particle cryo-EM modelling. In order to suppress the noise, we suggest using the method of moments approach for both problems while introducing deep neural network priors. In particular, our neural networks should output the signals and the distribution of group elements, with moments being the input. In the multireference alignment case, we demonstrate the advantage of using the NN to accelerate the convergence for the reconstruction of signals from the moments. Finally, we use our method to reconstruct simulated and biological volumes in the cryo-EM setting.

DoWG Unleashed: An Efficient Universal Parameter-Free Gradient Descent Method. (arXiv:2305.16284v4 [cs.LG] UPDATED)

back

This paper proposes a new easy-to-implement parameter-free gradient-based optimizer: DoWG (Distance over Weighted Gradients). We prove that DoWG is efficient -- matching the convergence rate of optimally tuned gradient descent in convex optimization up to a logarithmic factor without tuning any parameters, and universal -- automatically adapting to both smooth and nonsmooth problems. While popular algorithms following the AdaGrad framework compute a running average of the squared gradients to use for normalization, DoWG maintains a new distance-based weighted version of the running average, which is crucial to achieve the desired properties. To complement our theory, we also show empirically that DoWG trains at the edge of stability, and validate its effectiveness on practical machine learning tasks.

Improving Forecasts for Heterogeneous Time Series by "Averaging", with Application to Food Demand Forecast. (arXiv:2306.07119v3 [stat.ME] UPDATED)

back

A common forecasting setting in real world applications considers a set of possibly heterogeneous time series of the same domain. Due to different properties of each time series such as length, obtaining forecasts for each individual time series in a straight-forward way is challenging. This paper proposes a general framework utilizing a similarity measure in Dynamic Time Warping to find similar time series to build neighborhoods in a k-Nearest Neighbor fashion, and improve forecasts of possibly simple models by averaging. Several ways of performing the averaging are suggested, and theoretical arguments underline the usefulness of averaging for forecasting. Additionally, diagnostics tools are proposed allowing a deep understanding of the procedure.

A brief introduction on latent variable based ordinal regression models with an application to survey data. (arXiv:2306.13538v2 [stat.ME] UPDATED)

back

The analysis of survey data is a frequently arising issue in clinical trials, particularly when capturing quantities which are difficult to measure. Typical examples are questionnaires about patient's well-being, pain, or consent to an intervention. In these, data is captured on a discrete scale containing only a limited number of possible answers, from which the respondent has to pick the answer which fits best his/her personal opinion. This data is generally located on an ordinal scale as answers can usually be arranged in an ascending order, e.g., "bad", "neutral", "good" for well-being. Since responses are usually stored numerically for data processing purposes, analysis of survey data using ordinary linear regression models are commonly applied. However, assumptions of these models are often not met as linear regression requires a constant variability of the response variable and can yield predictions out of the range of response categories. By using linear models, one only gains insights about the mean response which may affect representativeness. In contrast, ordinal regression models can provide probability estimates for all response categories and yield information about the full response scale beyond the mean. In this work, we provide a concise overview of the fundamentals of latent variable based ordinal models, applications to a real data set, and outline the use of state-of-the-art-software for this purpose. Moreover, we discuss strengths, limitations and typical pitfalls. This is a companion work to a current vignette-based structured interview study in paediatric anaesthesia.

Unified Transfer Learning Models in High-Dimensional Linear Regression. (arXiv:2307.00238v4 [stat.ML] UPDATED)

back

Transfer learning plays a key role in modern data analysis when: (1) the target data are scarce but the source data are sufficient; (2) the distributions of the source and target data are heterogeneous. This paper develops an interpretable unified transfer learning model, termed as UTrans, which can detect both transferable variables and source data. More specifically, we establish the estimation error bounds and prove that our bounds are lower than those with target data only. Besides, we propose a source detection algorithm based on hypothesis testing to exclude the nontransferable data. We evaluate and compare UTrans to the existing algorithms in multiple experiments. It is shown that UTrans attains much lower estimation and prediction errors than the existing methods, while preserving interpretability. We finally apply it to the US intergenerational mobility data and compare our proposed algorithms to the classical machine learning algorithms.

Do machine learning methods lead to similar individualized treatment rules? A comparison study on real data. (arXiv:2308.03398v2 [stat.AP] UPDATED)

back

Identifying patients who benefit from a treatment is a key aspect of personalized medicine, which allows the development of individualized treatment rules (ITRs). Many machine learning methods have been proposed to create such rules. However, to what extent the methods lead to similar ITRs, i.e., recommending the same treatment for the same individuals is unclear. In this work, we compared 22 of the most common approaches in two randomized control trials. Two classes of methods can be distinguished. The first class of methods relies on predicting individualized treatment effects from which an ITR is derived by recommending the treatment evaluated to the individuals with a predicted benefit. In the second class, methods directly estimate the ITR without estimating individualized treatment effects. For each trial, the performance of ITRs was assessed by various metrics, and the pairwise agreement between all ITRs was also calculated. Results showed that the ITRs obtained via the different methods generally had considerable disagreements regarding the patients to be treated. A better concordance was found among akin methods. Overall, when evaluating the performance of ITRs in a validation sample, all methods produced ITRs with limited performance, suggesting a high potential for optimism. For non-parametric methods, this optimism was likely due to overfitting. The different methods do not lead to similar ITRs and are therefore not interchangeable. The choice of the method strongly influences for which patients a certain treatment is recommended, drawing some concerns about their practical use.

Exact Inference for Continuous-Time Gaussian Process Dynamics. (arXiv:2309.02351v2 [cs.LG] UPDATED)

back

Physical systems can often be described via a continuous-time dynamical system. In practice, the true system is often unknown and has to be learned from measurement data. Since data is typically collected in discrete time, e.g. by sensors, most methods in Gaussian process (GP) dynamics model learning are trained on one-step ahead predictions. This can become problematic in several scenarios, e.g. if measurements are provided at irregularly-sampled time steps or physical system properties have to be conserved. Thus, we aim for a GP model of the true continuous-time dynamics. Higher-order numerical integrators provide the necessary tools to address this problem by discretizing the dynamics function with arbitrary accuracy. Many higher-order integrators require dynamics evaluations at intermediate time steps making exact GP inference intractable. In previous work, this problem is often tackled by approximating the GP posterior with variational inference. However, exact GP inference is preferable in many scenarios, e.g. due to its mathematical guarantees. In order to make direct inference tractable, we propose to leverage multistep and Taylor integrators. We demonstrate how to derive flexible inference schemes for these types of integrators. Further, we derive tailored sampling schemes that allow to draw consistent dynamics functions from the learned posterior. This is crucial to sample consistent predictions from the dynamics model. We demonstrate empirically and theoretically that our approach yields an accurate representation of the continuous-time system.

Equivariant Matrix Function Neural Networks. (arXiv:2310.10434v2 [stat.ML] UPDATED)

back

Graph Neural Networks (GNNs), especially message-passing neural networks (MPNNs), have emerged as powerful architectures for learning on graphs in diverse applications. However, MPNNs face challenges when modeling non-local interactions in graphs such as large conjugated molecules, and social networks due to oversmoothing and oversquashing. Although Spectral GNNs and traditional neural networks such as recurrent neural networks and transformers mitigate these challenges, they often lack generalizability, or fail to capture detailed structural relationships or symmetries in the data. To address these concerns, we introduce Matrix Function Neural Networks (MFNs), a novel architecture that parameterizes non-local interactions through analytic matrix equivariant functions. Employing resolvent expansions offers a straightforward implementation and the potential for linear scaling with system size. The MFN architecture achieves stateof-the-art performance in standard graph benchmarks, such as the ZINC and TU datasets, and is able to capture intricate non-local interactions in quantum systems, paving the way to new state-of-the-art force fields.

Synthesis estimators for positivity violations with a continuous covariate. (arXiv:2311.09388v2 [stat.ME] UPDATED)

back

Studies intended to estimate the effect of a treatment, like randomized trials, often consist of a biased sample of the desired target population. To correct for this bias, estimates can be transported to the desired target population. Methods for transporting between populations are often premised on a positivity assumption, such that all relevant covariate patterns in one population are also present in the other. However, eligibility criteria, particularly in the case of trials, can result in violations of positivity. To address nonpositivity, a synthesis of statistical and mathematical models can be considered. This approach integrates multiple data sources (e.g. trials, observational, pharmacokinetic studies) to estimate treatment effects, leveraging mathematical models to handle positivity violations. This approach was previously demonstrated for positivity violations by a single binary covariate. Here, we extend the synthesis approach for positivity violations with a continuous covariate. For estimation, two novel augmented inverse probability weighting estimators are proposed. Both estimators are contrasted with other common approaches for addressing nonpositivity. Empirical performance is compared via Monte Carlo simulation. Finally, the competing approaches are illustrated with an example in the context of two-drug versus one-drug antiretroviral therapy on CD4 T cell counts among women with HIV.

Policy Learning with Distributional Welfare. (arXiv:2311.15878v2 [stat.ME] UPDATED)

back

In this paper, we explore optimal treatment allocation policies that target distributional welfare. Most literature on treatment choice has considered utilitarian welfare based on the conditional average treatment effect (ATE). While average welfare is intuitive, it may yield undesirable allocations especially when individuals are heterogeneous (e.g., with outliers) - the very reason individualized treatments were introduced in the first place. This observation motivates us to propose an optimal policy that allocates the treatment based on the conditional quantile of individual treatment effects (QoTE). Depending on the choice of the quantile probability, this criterion can accommodate a policymaker who is either prudent or negligent. The challenge of identifying the QoTE lies in its requirement for knowledge of the joint distribution of the counterfactual outcomes, which is generally hard to recover even with experimental data. Therefore, we introduce minimax policies that are robust to model uncertainty. A range of identifying assumptions can be used to yield more informative policies. For both stochastic and deterministic policies, we establish the asymptotic bound on the regret of implementing the proposed policies. In simulations and two empirical applications, we compare optimal decisions based on the QoTE with decisions based on other criteria. The framework can be generalized to any setting where welfare is defined as a functional of the joint distribution of the potential outcomes.

Detecting influential observations in single-index models with metric-valued response objects. (arXiv:2311.17246v2 [stat.CO] UPDATED)

back

Regression with random data objects is becoming increasingly common in modern data analysis. Unfortunately, like the traditional regression setting with Euclidean data, random response regression is not immune to the trouble caused by unusual observations. A metric Cook's distance extending the classical Cook's distances of Cook (1977) to general metric-valued response objects is proposed. The performance of the metric Cook's distance in both Euclidean and non-Euclidean response regression with Euclidean predictors is demonstrated in an extensive experimental study. A real data analysis of county-level COVID-19 transmission in the United States also illustrates the usefulness of this method in practice.

Weighted least-squares approximation with determinantal point processes and generalized volume sampling. (arXiv:2312.14057v2 [math.NA] UPDATED)

back

We consider the problem of approximating a function from $L^2$ by an element of a given $m$-dimensional space $V_m$, associated with some feature map $\varphi$, using evaluations of the function at random points $x_1,\dots,x_n$. After recalling some results on optimal weighted least-squares using independent and identically distributed points, we consider weighted least-squares using projection determinantal point processes (DPP) or volume sampling. These distributions introduce dependence between the points that promotes diversity in the selected features $\varphi(x_i)$. We first provide a generalized version of volume-rescaled sampling yielding quasi-optimality results in expectation with a number of samples $n = O(m\log(m))$, that means that the expected $L^2$ error is bounded by a constant times the best approximation error in $L^2$. Also, further assuming that the function is in some normed vector space $H$ continuously embedded in $L^2$, we further prove that the approximation is almost surely bounded by the best approximation error measured in the $H$-norm. This includes the cases of functions from $L^\infty$ or reproducing kernel Hilbert spaces. Finally, we present an alternative strategy consisting in using independent repetitions of projection DPP (or volume sampling), yielding similar error bounds as with i.i.d. or volume sampling, but in practice with a much lower number of samples. Numerical experiments illustrate the performance of the different strategies.

Causal Forecasting for Pricing. (arXiv:2312.15282v3 [stat.ML] UPDATED)

back

This paper proposes a novel method for demand forecasting in a pricing context. Here, modeling the causal relationship between price as an input variable to demand is crucial because retailers aim to set prices in a (profit) optimal manner in a downstream decision making problem. Our methods bring together the Double Machine Learning methodology for causal inference and state-of-the-art transformer-based forecasting models. In extensive empirical experiments, we show on the one hand that our method estimates the causal effect better in a fully controlled setting via synthetic, yet realistic data. On the other hand, we demonstrate on real-world data that our method outperforms forecasting methods in off-policy settings (i.e., when there's a change in the pricing policy) while only slightly trailing in the on-policy setting.

FDR-Controlled Portfolio Optimization for Sparse Financial Index Tracking. (arXiv:2401.15139v2 [q-fin.PM] UPDATED)

back

In high-dimensional data analysis, such as financial index tracking or biomedical applications, it is crucial to select the few relevant variables while maintaining control over the false discovery rate (FDR). In these applications, strong dependencies often exist among the variables (e.g., stock returns), which can undermine the FDR control property of existing methods like the model-X knockoff method or the T-Rex selector. To address this issue, we have expanded the T-Rex framework to accommodate overlapping groups of highly correlated variables. This is achieved by integrating a nearest neighbors penalization mechanism into the framework, which provably controls the FDR at the user-defined target level. A real-world example of sparse index tracking demonstrates the proposed method's ability to accurately track the S&P 500 index over the past 20 years based on a small number of stocks. An open-source implementation is provided within the R package TRexSelector on CRAN.

Bayesian Nonparametrics Meets Data-Driven Robust Optimization. (arXiv:2401.15771v2 [stat.ML] UPDATED)

back

Training machine learning and statistical models often involves optimizing a data-driven risk criterion. The risk is usually computed with respect to the empirical data distribution, but this may result in poor and unstable out-of-sample performance due to distributional uncertainty. In the spirit of distributionally robust optimization, we propose a novel robust criterion by combining insights from Bayesian nonparametric (i.e., Dirichlet Process) theory and recent decision-theoretic models of smooth ambiguity-averse preferences. First, we highlight novel connections with standard regularized empirical risk minimization techniques, among which Ridge and LASSO regressions. Then, we theoretically demonstrate the existence of favorable finite-sample and asymptotic statistical guarantees on the performance of the robust optimization procedure. For practical implementation, we propose and study tractable approximations of the criterion based on well-known Dirichlet Process representations. We also show that the smoothness of the criterion naturally leads to standard gradient-based numerical optimization. Finally, we provide insights into the workings of our method by applying it to high-dimensional sparse linear regression and robust location parameter estimation tasks.

High-Dimensional False Discovery Rate Control for Dependent Variables. (arXiv:2401.15796v2 [stat.ME] UPDATED)

back

Algorithms that ensure reproducible findings from large-scale, high-dimensional data are pivotal in numerous signal processing applications. In recent years, multivariate false discovery rate (FDR) controlling methods have emerged, providing guarantees even in high-dimensional settings where the number of variables surpasses the number of samples. However, these methods often fail to reliably control the FDR in the presence of highly dependent variable groups, a common characteristic in fields such as genomics and finance. To tackle this critical issue, we introduce a novel framework that accounts for general dependency structures. Our proposed dependency-aware T-Rex selector integrates hierarchical graphical models within the T-Rex framework to effectively harness the dependency structure among variables. Leveraging martingale theory, we prove that our variable penalization mechanism ensures FDR control. We further generalize the FDR-controlling framework by stating and proving a clear condition necessary for designing both graphical and non-graphical models that capture dependencies. Additionally, we formulate a fully integrated optimal calibration algorithm that concurrently determines the parameters of the graphical model and the T-Rex framework, such that the FDR is controlled while maximizing the number of selected variables. Numerical experiments and a breast cancer survival analysis use-case demonstrate that the proposed method is the only one among the state-of-the-art benchmark methods that controls the FDR and reliably detects genes that have been previously identified to be related to breast cancer. An open-source implementation is available within the R package TRexSelector on CRAN.

Distribution-consistency Structural Causal Models. (arXiv:2401.15911v2 [cs.AI] UPDATED)

back

In the field of causal modeling, potential outcomes (PO) and structural causal models (SCMs) stand as the predominant frameworks. However, these frameworks face notable challenges in practically modeling counterfactuals, formalized as parameters of the joint distribution of potential outcomes. Counterfactual reasoning holds paramount importance in contemporary decision-making processes, especially in scenarios that demand personalized incentives based on the joint values of $(Y(0), Y(1))$. This paper begins with an investigation of the PO and SCM frameworks for modeling counterfactuals. Through the analysis, we identify an inherent model capacity limitation, termed as the ``degenerative counterfactual problem'', emerging from the consistency rule that is the cornerstone of both frameworks. To address this limitation, we introduce a novel \textit{distribution-consistency} assumption, and in alignment with it, we propose the Distribution-consistency Structural Causal Models (DiscoSCMs) offering enhanced capabilities to model counterfactuals. To concretely reveal the enhanced model capacity, we introduce a new identifiable causal parameter, \textit{the probability of consistency}, which holds practical significance within DiscoSCM alone, showcased with a personalized incentive example. Furthermore, we provide a comprehensive set of theoretical results about the ``Ladder of Causation'' within the DiscoSCM framework. We hope it opens new avenues for future research of counterfactual modeling, ultimately enhancing our understanding of causality and its real-world applications.

stat updates on arXiv.org

Statistics (stat) updates on the arXiv.org e-print archive