|
|
|
<p>The task of predicting conversion rates (CVR) lies at the heart of online
advertising systems aiming to optimize bids to meet advertiser performance
requirements. Even with the recent rise of deep neural networks, these
predictions are often made by factorization machines (FM), especially in
commercial settings where inference latency is key. These models are trained
using the logistic regression framework on labeled tabular data formed from
past user activity that is relevant to the task at hand.
</p>
<p>Many advertisers only care about click-attributed conversions. A major
challenge in training models that predict conversions-given-clicks comes from
data sparsity - clicks are rare, conversions attributed to clicks are even
rarer. However, mitigating sparsity by adding conversions that are not
click-attributed to the training set impairs model calibration. Since
calibration is critical to achieving advertiser goals, this is infeasible.
</p>
<p>In this work we use the well-known idea of self-supervised pre-training, and
use an auxiliary auto-encoder model trained on all conversion events, both
click-attributed and not, as a feature extractor to enrich the main CVR
prediction model. Since the main model does not train on non click-attributed
conversions, this does not impair calibration. We adapt the basic
self-supervised pre-training idea to our online advertising setup by using a
loss function designed for tabular data, facilitating continual learning by
ensuring auto-encoder stability, and incorporating a neural network into a
large-scale real-time ad auction that ranks tens of thousands of ads, under
strict latency constraints, and without incurring a major engineering cost. We
show improvements both offline, during training, and in an online A/B test.
Following its success in A/B tests, our solution is now fully deployed to the
Yahoo native advertising system.
</p>
|
|
|
|
<p>Accurately charting the progress of oil production is a problem of great
current interest. Oil production is widely known to be cyclical: in any given
system, after it reaches its peak, a decline will begin. With this in mind,
Marion King Hubbert developed his peak theory in 1956 based on the bell-shaped
curve that bears his name. In the present work, we consider a stochasticmodel
based on the theory of diffusion processes and associated with the Hubbert
curve. The problem of the maximum likelihood estimation of the parameters for
this process is also considered. Since a complex system of equations appears,
with a solution that cannot be guaranteed by classical numerical procedures, we
suggest the use of metaheuristic optimization algorithms such as simulated
annealing and variable neighborhood search. Some strategies are suggested for
bounding the space of solutions, and a description is provided for the
application of the algorithms selected. In the case of the variable
neighborhood search algorithm, a hybrid method is proposed in which it is
combined with simulated annealing. In order to validate the theory developed
here, we also carry out some studies based on simulated data and consider 2
real crude oil production scenarios from Norway and Kazakhstan.
</p>
|
|
|
|
<p>Extracting consistent statistics between relevant free-energy minima of a
molecular system is essential for physics, chemistry and biology. Molecular
dynamics (MD) simulations can aid in this task but are computationally
expensive, especially for systems that require quantum accuracy. To overcome
this challenge, we develop an approach combining enhanced sampling with deep
generative models and active learning of a machine learning potential (MLP). We
introduce an adaptive Markov chain Monte Carlo framework that enables the
training of one Normalizing Flow (NF) and one MLP per state. We simulate
several Markov chains in parallel until they reach convergence, sampling the
Boltzmann distribution with an efficient use of energy evaluations. At each
iteration, we compute the energy of a subset of the NF-generated configurations
using Density Functional Theory (DFT), we predict the remaining configuration's
energy with the MLP and actively train the MLP using the DFT-computed energies.
Leveraging the trained NF and MLP models, we can compute thermodynamic
observables such as free-energy differences or optical spectra. We apply this
method to study the isomerization of an ultrasmall silver nanocluster,
belonging to a set of systems with diverse applications in the fields of
medicine and catalysis.
</p>
|
|
|
|
<p>Phenomenological (P-type) bifurcations are qualitative changes in stochastic
dynamical systems whereby the stationary probability density function (PDF)
changes its topology. The current state of the art for detecting these
bifurcations requires reliable kernel density estimates computed from an
ensemble of system realizations. However, in several real world signals such as
Big Data, only a single system realization is available -- making it impossible
to estimate a reliable kernel density. This study presents an approach for
detecting P-type bifurcations using unreliable density estimates. The approach
creates an ensemble of objects from Topological Data Analysis (TDA) called
persistence diagrams from the system's sole realization and statistically
analyzes the resulting set. We compare several methods for replicating the
original persistence diagram including Gibbs point process modelling, Pairwise
Interaction Point Modelling, and subsampling. We show that for the purpose of
predicting a bifurcation, the simple method of subsampling exceeds the other
two methods of point process modelling in performance.
</p>
|
|
|
|
<p>The performance of Markov chain Monte Carlo samplers strongly depends on the
properties of the target distribution such as its covariance structure, the
location of its probability mass and its tail behavior. We explore the use of
bijective affine transformations of the sample space to improve the properties
of the target distribution and thereby the performance of samplers running in
the transformed space. In particular, we propose a flexible and user-friendly
scheme for adaptively learning the affine transformation during sampling.
Moreover, the combination of our scheme with Gibbsian polar slice sampling is
shown to produce samples of high quality at comparatively low computational
cost in several settings based on real-world data.
</p>
|
|
|
|
<p>Heterogeneous treatment effect estimation is an important problem in
precision medicine. Specific interests lie in identifying the differential
effect of different treatments based on some external covariates. We propose a
novel non-parametric treatment effect estimation method in a multi-treatment
setting. Our non-parametric modeling of the response curves relies on radial
basis function (RBF)-nets with shared hidden neurons. Our model thus
facilitates modeling commonality among the treatment outcomes. The estimation
and inference schemes are developed under a Bayesian framework and implemented
via an efficient Markov chain Monte Carlo algorithm, appropriately
accommodating uncertainty in all aspects of the analysis. The numerical
performance of the method is demonstrated through simulation experiments.
Applying our proposed method to MIMIC data, we obtain several interesting
findings related to the impact of different treatment strategies on the length
of ICU stay and 12-hour SOFA score for sepsis patients who are home-discharged.
</p>
|
|
|
|
<p>The Ising model, originally developed as a spin-glass model for ferromagnetic
elements, has gained popularity as a network-based model for capturing
dependencies in agents' outputs. Its increasing adoption in healthcare and the
social sciences has raised privacy concerns regarding the confidentiality of
agents' responses. In this paper, we present a novel
$(\varepsilon,\delta)$-differentially private algorithm specifically designed
to protect the privacy of individual agents' outcomes. Our algorithm allows for
precise estimation of the natural parameter using a single network through an
objective perturbation technique. Furthermore, we establish regret bounds for
this algorithm and assess its performance on synthetic datasets and two
real-world networks: one involving HIV status in a social network and the other
concerning the political leaning of online blogs.
</p>
|
|
|
|
<p>We present the Probabilistic Context Neighborhood model designed for
two-dimensional lattices as a variation of a Markov Random Field assuming
discrete values. In this model, the neighborhood structure has a fixed geometry
but a variable order, depending on the neighbors' values. Our model extends the
Probabilistic Context Tree model, originally applicable to one-dimensional
space. It retains advantageous properties, such as representing the dependence
neighborhood structure as a graph in a tree format, facilitating an
understanding of model complexity. Furthermore, we adapt the algorithm used to
estimate the Probabilistic Context Tree to estimate the parameters of the
proposed model. We illustrate the accuracy of our estimation methodology
through simulation studies. Additionally, we apply the Probabilistic Context
Neighborhood model to spatial real-world data, showcasing its practical
utility.
</p>
|
|
|
|
<p>In inverse problems, it is widely recognized that the incorporation of a
sparsity prior yields a regularization effect on the solution. This approach is
grounded on the a priori assumption that the unknown can be appropriately
represented in a basis with a limited number of significant components, while
most coefficients are close to zero. This occurrence is frequently observed in
real-world scenarios, such as with piecewise smooth signals. In this study, we
propose a probabilistic sparsity prior formulated as a mixture of degenerate
Gaussians, capable of modeling sparsity with respect to a generic basis. Under
this premise, we design a neural network that can be interpreted as the Bayes
estimator for linear inverse problems. Additionally, we put forth both a
supervised and an unsupervised training strategy to estimate the parameters of
this network. To evaluate the effectiveness of our approach, we conduct a
numerical comparison with commonly employed sparsity-promoting regularization
techniques, namely LASSO, group LASSO, iterative hard thresholding, and sparse
coding/dictionary learning. Notably, our reconstructions consistently exhibit
lower mean square error values across all $1$D datasets utilized for the
comparisons, even in cases where the datasets significantly deviate from a
Gaussian mixture model.
</p>
|
|
|
|
<p>Many modern applications require the use of data to both select the
statistical tasks and make valid inference after selection. In this article, we
provide a unifying approach to control for a class of selective risks. Our
method is motivated by a reformulation of the celebrated Benjamini-Hochberg
(BH) procedure for multiple hypothesis testing as the iterative limit of the
Benjamini-Yekutieli (BY) procedure for constructing post-selection confidence
intervals. Although several earlier authors have made noteworthy observations
related to this, our discussion highlights that (1) the BH procedure is
precisely the fixed-point iteration of the BY procedure; (2) the fact that the
BH procedure controls the false discovery rate is almost an immediate corollary
of the fact that the BY procedure controls the false coverage-statement rate.
Building on this observation, we propose a constructive approach to control
extra-selection risk (selection made after decision) by iterating decision
strategies that control the post-selection risk (decision made after
selection), and show that many previous methods and results are special cases
of this general framework. We further extend this approach to problems with
multiple selective risks and demonstrate how new methods can be developed. Our
development leads to two surprising results about the BH procedure: (1) in the
context of one-sided location testing, the BH procedure not only controls the
false discovery rate at the null but also at other locations for free; (2) in
the context of permutation tests, the BH procedure with exact permutation
p-values can be well approximated by a procedure which only requires a total
number of permutations that is almost linear in the total number of hypotheses.
</p>
|
|
|
|
<p>We show how continuous-depth neural ODE models can be framed as single-layer,
infinite-width nets using the Chen--Fliess series expansion for nonlinear ODEs.
In this net, the output ''weights'' are taken from the signature of the control
input -- a tool used to represent infinite-dimensional paths as a sequence of
tensors -- which comprises iterated integrals of the control input over a
simplex. The ''features'' are taken to be iterated Lie derivatives of the
output function with respect to the vector fields in the controlled ODE model.
The main result of this work applies this framework to derive compact
expressions for the Rademacher complexity of ODE models that map an initial
condition to a scalar output at some terminal time. The result leverages the
straightforward analysis afforded by single-layer architectures. We conclude
with some examples instantiating the bound for some specific systems and
discuss potential follow-up work.
</p>
|
|
|
|
<p>The effective sample size (ESS) measures the informational value of a
probability distribution in terms of an equivalent number of study
participants. The ESS plays a crucial role in estimating the Expected Value of
Sample Information (EVSI) through the Gaussian approximation approach. Despite
the significance of ESS, existing ESS estimation methods within the Gaussian
approximation framework are either computationally expensive or potentially
inaccurate. To address these limitations, we propose a novel approach that
estimates the ESS using the summary statistics of generated datasets and
nonparametric regression methods. The simulation results suggest that the
proposed method provides accurate ESS estimates at a low computational cost,
making it an efficient and practical way to quantify the information contained
in the probability distribution of a parameter. Overall, determining the ESS
can help analysts understand the uncertainty levels in complex prior
distributions in the probability analyses of decision models and perform
efficient EVSI calculations.
</p>
|
|
|
|
<p>LiNGAM determines the variable order from cause to effect using additive
noise models, but it faces challenges with confounding. Previous methods
maintained LiNGAM's fundamental structure while trying to identify and address
variables affected by confounding. As a result, these methods required
significant computational resources regardless of the presence of confounding,
and they did not ensure the detection of all confounding types. In contrast,
this paper enhances LiNGAM by introducing LiNGAM-MMI, a method that quantifies
the magnitude of confounding using KL divergence and arranges the variables to
minimize its impact. This method efficiently achieves a globally optimal
variable order through the shortest path problem formulation. LiNGAM-MMI
processes data as efficiently as traditional LiNGAM in scenarios without
confounding while effectively addressing confounding situations. Our
experimental results suggest that LiNGAM-MMI more accurately determines the
correct variable order, both in the presence and absence of confounding.
</p>
|
|
|
|
<p>The finite-population asymptotic theory provides a normal approximation for
the sampling distribution of the average treatment effect estimator in
stratified randomized experiments. The asymptotic variance is often estimated
by a Neyman-type conservative variance estimator. However, the variance
estimator can be overly conservative, and the asymptotic theory may fail in
small samples. To solve these issues, we propose a sharp variance estimator for
the difference-in-means estimator weighted by the proportion of stratum sizes
in stratified randomized experiments. Furthermore, we propose two causal
bootstrap procedures to more accurately approximate the sampling distribution
of the weighted difference-in-means estimator. The first causal bootstrap
procedure is based on rank-preserving imputation and we show that it has
second-order refinement over normal approximation. The second causal bootstrap
procedure is based on sharp null imputation and is applicable in paired
experiments. Our analysis is randomization-based or design-based by
conditioning on the potential outcomes, with treatment assignment being the
sole source of randomness. Numerical studies and real data analyses demonstrate
advantages of our proposed methods in finite samples.
</p>
|
|
|
|
<p>In this work we introduce a manifold learning-based surrogate modeling
framework for uncertainty quantification in high-dimensional stochastic
systems. Our first goal is to perform data mining on the available simulation
data to identify a set of low-dimensional (latent) descriptors that efficiently
parameterize the response of the high-dimensional computational model. To this
end, we employ Principal Geodesic Analysis on the Grassmann manifold of the
response to identify a set of disjoint principal geodesic submanifolds, of
possibly different dimension, that captures the variation in the data. Since
operations on the Grassmann require the data to be concentrated, we propose an
adaptive algorithm based on Riemanniann K-means and the minimization of the
sample Frechet variance on the Grassmann manifold to identify "local" principal
geodesic submanifolds that represent different system behavior across the
parameter space. Polynomial chaos expansion is then used to construct a mapping
between the random input parameters and the projection of the response on these
local principal geodesic submanifolds. The method is demonstrated on four test
cases, a toy-example that involves points on a hypersphere, a Lotka-Volterra
dynamical system, a continuous-flow stirred-tank chemical reactor system, and a
two-dimensional Rayleigh-Benard convection problem
</p>
|
|
|
|
<p>The Standard Performance Evaluation Corporation (SPEC) CPU benchmark has been
widely used as a measure of computing performance for decades. The SPEC is an
industry-standardized, CPU-intensive benchmark suite and the collective data
provide a proxy for the history of worldwide CPU and system performance. Past
efforts have not provided or enabled answers to questions such as, how has the
SPEC benchmark suite evolved empirically over time and what micro-architecture
artifacts have had the most influence on performance? -- have any
micro-benchmarks within the suite had undue influence on the results and
comparisons among the codes? -- can the answers to these questions provide
insights to the future of computer system performance? To answer these
questions, we detail our historical and statistical analysis of specific
hardware artifacts (clock frequencies, core counts, etc.) on the performance of
the SPEC benchmarks since 1995. We discuss in detail several methods to
normalize across benchmark evolutions. We perform both isolated and collective
sensitivity analyses for various hardware artifacts and we identify one
benchmark (libquantum) that had somewhat undue influence on performance
outcomes. We also present the use of SPEC data to predict future performance.
</p>
|
|
|
|
<p>Consider the task of estimating a random vector $X$ from noisy observations
$Y = X + Z$, where $Z$ is a standard normal vector, under the $L^p$ fidelity
criterion. This work establishes that, for $1 \leq p \leq 2$, the optimal
Bayesian estimator is linear and positive definite if and only if the prior
distribution on $X$ is a (non-degenerate) multivariate Gaussian. Furthermore,
for $p > 2$, it is demonstrated that there are infinitely many priors that can
induce such an estimator.
</p>
|
|
|
|
<p>This study presents a Bayesian regression framework to model the relationship
between scalar outcomes and brain functional connectivity represented as
symmetric positive definite (SPD) matrices. Unlike many proposals that simply
vectorize the connectivity predictors thereby ignoring their matrix structures,
our method respects the Riemannian geometry of SPD matrices by modelling them
in a tangent space. We perform dimension reduction in the tangent space,
relating the resulting low-dimensional representations with the responses. The
dimension reduction matrix is learnt in a supervised manner with a
sparsity-inducing prior imposed on a Stiefel manifold to prevent overfitting.
Our method yields a parsimonious regression model that allows uncertainty
quantification of the estimates and identification of key brain regions that
predict the outcomes. We demonstrate the performance of our approach in
simulation settings and through a case study to predict Picture Vocabulary
scores using data from the Human Connectome Project.
</p>
|
|
|
|
<p>Sequential neural posterior estimation (SNPE) techniques have been recently
proposed for dealing with simulation-based models with intractable likelihoods.
They are devoted to learning the posterior from adaptively proposed simulations
using neural network-based conditional density estimators. As a SNPE technique,
the automatic posterior transformation (APT) method proposed by Greenberg et
al. (2019) performs notably and scales to high dimensional data. However, the
APT method bears the computation of an expectation of the logarithm of an
intractable normalizing constant, i.e., a nested expectation. Although atomic
APT was proposed to solve this by discretizing the normalizing constant, it
remains challenging to analyze the convergence of learning. In this paper, we
propose a nested APT method to estimate the involved nested expectation
instead. This facilitates establishing the convergence analysis. Since the
nested estimators for the loss function and its gradient are biased, we make
use of unbiased multi-level Monte Carlo (MLMC) estimators for debiasing. To
further reduce the excessive variance of the unbiased estimators, this paper
also develops some truncated MLMC estimators by taking account of the trade-off
between the bias and the average cost. Numerical experiments for approximating
complex posteriors with multimodal in moderate dimensions are provided.
</p>
|
|
|
|
<p>Simulating mixtures of distributions with signed weights proves a challenge
as standard simulation algorithms are inefficient in handling the negative
weights. In particular, the natural representation of mixture variates as
associated with latent component indicators is no longer available. We propose
here an exact accept-reject algorithm in the general case of finite signed
mixtures that relies on optimaly pairing positive and negative components and
designing a stratified sampling scheme on pairs. We analyze the performances of
our approach, relative to the inverse cdf approach, since the cdf of the
distribution remains available for standard signed mixtures.
</p>
|
|
|
|
<p>Knowledge Tracing (KT) aims to predict the future performance of students by
tracking the development of their knowledge states. Despite all the recent
progress made in this field, the application of KT models in education systems
is still restricted from the data perspectives: 1) limited access to real life
data due to data protection concerns, 2) lack of diversity in public datasets,
3) noises in benchmark datasets such as duplicate records. To resolve these
problems, we simulated student data with three statistical strategies based on
public datasets and tested their performance on two KT baselines. While we
observe only minor performance improvement with additional synthetic data, our
work shows that using only synthetic data for training can lead to similar
performance as real data.
</p>
|
|
|
|
<p>Medical image semantic segmentation techniques can help identify tumors
automatically from computed tomography (CT) scans. In this paper, we propose a
Contextual and Attentional feature Fusions enhanced Convolutional Neural
Network (CNN) and Transformer hybrid network (CAFCT) model for liver tumor
segmentation. In the proposed model, three other modules are introduced in the
network architecture: Attentional Feature Fusion (AFF), Atrous Spatial Pyramid
Pooling (ASPP) of DeepLabv3, and Attention Gates (AGs) to improve contextual
information related to tumor boundaries for accurate segmentation. Experimental
results show that the proposed CAFCT achieves a mean Intersection over Union
(IoU) of 90.38% and Dice score of 86.78%, respectively, on the Liver Tumor
Segmentation Benchmark (LiTS) dataset, outperforming pure CNN or Transformer
methods, e.g., Attention U-Net, and PVTFormer.
</p>
|
|
|
|
<p>We develop a framework for learning properties of quantum states beyond the
assumption of independent and identically distributed (i.i.d.) input states. We
prove that, given any learning problem (under reasonable assumptions), an
algorithm designed for i.i.d. input states can be adapted to handle input
states of any nature, albeit at the expense of a polynomial increase in copy
complexity. Furthermore, we establish that algorithms which perform
non-adaptive incoherent measurements can be extended to encompass non-i.i.d.
input states while maintaining comparable error probabilities. This allows us,
among others applications, to generalize the classical shadows of Huang, Kueng,
and Preskill to the non-i.i.d. setting at the cost of a small loss in
efficiency. Additionally, we can efficiently verify any pure state using
Clifford measurements, in a way that is independent of the ideal state. Our
main techniques are based on de Finetti-style theorems supported by tools from
information theory. In particular, we prove a new randomized local de Finetti
theorem that can be of independent interest.
</p>
|
|
|
|
<p>This study presents a Bayesian maximum \textit{a~posteriori} (MAP) framework
for dynamical system identification from time-series data. This is shown to be
equivalent to a generalized zeroth-order Tikhonov regularization, providing a
rational justification for the choice of the residual and regularization terms,
respectively, from the negative logarithms of the likelihood and prior
distributions. In addition to the estimation of model coefficients, the
Bayesian interpretation gives access to the full apparatus for Bayesian
inference, including the ranking of models, the quantification of model
uncertainties and the estimation of unknown (nuisance) hyperparameters. Two
Bayesian algorithms, joint maximum \textit{a~posteriori} (JMAP) and variational
Bayesian approximation (VBA), are compared to the popular SINDy algorithm for
thresholded least-squares regression, by application to several dynamical
systems with added noise. For multivariate Gaussian likelihood and prior
distributions, the Bayesian formulation gives Gaussian posterior and evidence
distributions, in which the numerator terms can be expressed in terms of the
Mahalanobis distance or ``Gaussian norm'' $||\vy-\hat{\vy}||^2_{M^{-1}} =
(\vy-\hat{\vy})^\top {M^{-1}} (\vy-\hat{\vy})$, where $\vy$ is a vector
variable, $\hat{\vy}$ is its estimator and $M$ is the covariance matrix. The
posterior Gaussian norm is shown to provide a robust metric for quantitative
model selection.
</p>
|
|
|
|
<p>A nonparametric latency estimator for mixture cure models is studied in this
paper. An i.i.d. representation is obtained, the asymptotic mean squared error
of the latency estimator is found, and its asymptotic normality is proven. A
bootstrap bandwidth selection method is introduced and its efficiency is
evaluated in a simulation study. The proposed methods are applied to a dataset
of colorectal cancer patients in the University Hospital of A Coru\~na (CHUAC).
</p>
|
|
|
|
<p>In this paper, we establish the partial correlation graph for multivariate
continuous-time stochastic processes, assuming only that the underlying process
is stationary and mean-square continuous with expectation zero and spectral
density function. In the partial correlation graph, the vertices are the
components of the process and the undirected edges represent partial
correlations between the vertices. To define this graph, we therefore first
introduce the partial correlation relation for continuous-time processes and
provide several equivalent characterisations. In particular, we establish that
the partial correlation relation defines a graphoid. The partial correlation
graph additionally satisfies the usual Markov properties and the edges can be
determined very easily via the inverse of the spectral density function.
Throughout the paper, we compare and relate the partial correlation graph to
the mixed (local) causality graph of Fasen-Hartmann and Schenk (2023a).
Finally, as an example, we explicitly characterise and interpret the edges in
the partial correlation graph for the popular multivariate continuous-time AR
(MCAR) processes.
</p>
|
|
|
|
<p>This manuscript introduces deep learning models that simultaneously describe
the dynamics of several yield curves. We aim to learn the dependence structure
among the different yield curves induced by the globalization of financial
markets and exploit it to produce more accurate forecasts. By combining the
self-attention mechanism and nonparametric quantile regression, our model
generates both point and interval forecasts of future yields. The architecture
is designed to avoid quantile crossing issues affecting multiple quantile
regression models. Numerical experiments conducted on two different datasets
confirm the effectiveness of our approach. Finally, we explore potential
extensions and enhancements by incorporating deep ensemble methods and transfer
learning mechanisms.
</p>
|
|
|
|
<p>The Sustainable Development Goals (SDGs) of the United Nations provide a
blueprint of a better future by 'leaving no one behind', and, to achieve the
SDGs by 2030, poor countries require immense volumes of development aid. In
this paper, we develop a causal machine learning framework for predicting
heterogeneous treatment effects of aid disbursements to inform effective aid
allocation. Specifically, our framework comprises three components: (i) a
balancing autoencoder that uses representation learning to embed
high-dimensional country characteristics while addressing treatment selection
bias; (ii) a counterfactual generator to compute counterfactual outcomes for
varying aid volumes to address small sample-size settings; and (iii) an
inference model that is used to predict heterogeneous treatment-response
curves. We demonstrate the effectiveness of our framework using data with
official development aid earmarked to end HIV/AIDS in 105 countries, amounting
to more than USD 5.2 billion. For this, we first show that our framework
successfully computes heterogeneous treatment-response curves using
semi-synthetic data. Then, we demonstrate our framework using real-world HIV
data. Our framework points to large opportunities for a more effective aid
allocation, suggesting that the total number of new HIV infections could be
reduced by up to 3.3% (~50,000 cases) compared to the current allocation
practice.
</p>
|
|
|
|
<p>Confounding bias and selection bias are two significant challenges to the
validity of conclusions drawn from applied causal inference. The latter can
arise through informative missingness, wherein relevant information about units
in the target population is missing, censored, or coarsened due to factors
related to the exposure, the outcome, or their consequences. We extend existing
graphical criteria to address selection bias induced by missing outcome data by
leveraging post-exposure variables. We introduce the Sequential Adjustment
Criteria (SAC), which support recovering causal effects through sequential
regressions. A refined estimator is further developed by applying Targeted
Minimum-Loss Estimation (TMLE). Under certain regularity conditions, this
estimator is multiply-robust, ensuring consistency even in scenarios where the
Inverse Probability Weighting (IPW) and the sequential regressions approaches
fall short. A simulation exercise featuring various toy scenarios compares the
relative bias and robustness of the two proposed solutions against other
estimators. As a motivating application case, we study the effects of
pharmacological treatment for Attention-Deficit/Hyperactivity Disorder (ADHD)
upon the scores obtained by diagnosed Norwegian schoolchildren in national
tests using observational data ($n=9\,352$). Our findings support the
accumulated clinical evidence affirming a positive but small effect of
stimulant medication on school performance. A small positive selection bias was
identified, indicating that the treatment effect may be even more modest for
those exempted or abstained from the tests.
</p>
|
|
|
|
<p>We present a unified three-state model (TSM) framework for evaluating
treatment effects in clinical trials in the presence of treatment crossover.
Researchers have proposed diverse methodologies to estimate the treatment
effect that would have hypothetically been observed if treatment crossover had
not occurred. However, there is little work on understanding the connections
between these different approaches from a statistical point of view. Our
proposed TSM framework unifies existing methods, effectively identifying
potential biases, model assumptions, and inherent limitations for each method.
This can guide researchers in understanding when these methods are appropriate
and choosing a suitable approach for their data. The TSM framework also
facilitates the creation of new methods to adjust for confounding effects from
treatment crossover. To illustrate this capability, we introduce a new
imputation method that falls under its scope. Using a piecewise constant prior
for the hazard, our proposed method directly estimates the hazard function with
increased flexibility. Through simulation experiments, we demonstrate the
performance of different approaches for estimating the treatment effects.
</p>
|
|
|
|
<p>This paper studies Bayesian optimization with noise-free observations. We
introduce new algorithms rooted in scattered data approximation that rely on a
random exploration step to ensure that the fill-distance of query points decays
at a near-optimal rate. Our algorithms retain the ease of implementation of the
classical GP-UCB algorithm and satisfy cumulative regret bounds that nearly
match those conjectured in <a href="/abs/2002.05096">arXiv:2002.05096</a>, hence solving a COLT open problem.
Furthermore, the new algorithms outperform GP-UCB and other popular Bayesian
optimization strategies in several examples.
</p>
|
|
|
|
<p>Nearest-neighbor methods have become popular in statistics and play a key
role in statistical learning. Important decisions in nearest-neighbor methods
concern the variables to use (when many potential candidates exist) and how to
measure the dissimilarity between units. The first decision depends on the
scope of the application while second depends mainly on the type of variables.
Unfortunately, relatively few options permit to handle mixed-type variables, a
situation frequently encountered in practical applications. The most popular
dissimilarity for mixed-type variables is derived as the complement to one of
the Gower's similarity coefficient. It is appealing because ranges between 0
and 1, being an average of the scaled dissimilarities calculated variable by
variable, handles missing values and allows for a user-defined weighting scheme
when averaging dissimilarities. The discussion on the weighting schemes is
sometimes misleading since it often ignores that the unweighted "standard"
setting hides an unbalanced contribution of the single variables to the overall
dissimilarity. We address this drawback following the recent idea of
introducing a weighting scheme that minimizes the differences in the
correlation between each contributing dissimilarity and the resulting weighted
Gower's dissimilarity. In particular, this note proposes different approaches
for measuring the correlation depending on the type of variables. The
performances of the proposed approaches are evaluated in simulation studies
related to classification and imputation of missing values.
</p>
|
|
|
|
<p>We consider the task of learning individual-specific intensities of counting
processes from a set of static variables and irregularly sampled time series.
We introduce a novel modelization approach in which the intensity is the
solution to a controlled differential equation. We first design a neural
estimator by building on neural controlled differential equations. In a second
time, we show that our model can be linearized in the signature space under
sufficient regularity conditions, yielding a signature-based estimator which we
call CoxSig. We provide theoretical learning guarantees for both estimators,
before showcasing the performance of our models on a vast array of simulated
and real-world datasets from finance, predictive maintenance and food supply
chain management.
</p>
|
|
|
|
<p>In lifetime data, like cancer studies, theremay be long term survivors, which
lead to heavy censoring at the end of the follow-up period. Since a standard
survival model is not appropriate to handle these data, a cure model is needed.
In the literature, covariate hypothesis tests for cure models are limited to
parametric and semiparametric methods.We fill this important gap by proposing a
nonparametric covariate hypothesis test for the probability of cure in mixture
cure models. A bootstrap method is proposed to approximate the null
distribution of the test statistic. The procedure can be applied to any type of
covariate, and could be extended to the multivariate setting. Its efficiency is
evaluated in a Monte Carlo simulation study. Finally, the method is applied to
a colorectal cancer dataset.
</p>
|
|
|
|
<p>In this paper, we propose a novel approach to test the equality of
high-dimensional mean vectors of several populations via the weighted
$L_2$-norm. We establish the asymptotic normality of the test statistics under
the null hypothesis. We also explain theoretically why our test statistics can
be highly useful in weakly dense cases when the nonzero signal in mean vectors
is present. Furthermore, we compare the proposed test with existing tests using
simulation results, demonstrating that the weighted $L_2$-norm-based test
statistic exhibits favorable properties in terms of both size and power.
</p>
|
|
|
|
<p>A completely nonparametric method for the estimation of mixture cure models
is proposed. A nonparametric estimator of the incidence is extensively studied
and a nonparametric estimator of the latency is presented. These estimators,
which are based on the Beran estimator of the conditional survival function,
are proved to be the local maximum likelihood estimators. An i.i.d.
representation is obtained for the nonparametric incidence estimator. As a
consequence, an asymptotically optimal bandwidth is found. Moreover, a
bootstrap bandwidth selection method for the nonparametric incidence estimator
is proposed. The introduced nonparametric estimators are compared with existing
semiparametric approaches in a simulation study, in which the performance of
the bootstrap bandwidth selector is also assessed. Finally, the method is
applied to a database of colorectal cancer from the University Hospital of A
Coru\~na (CHUAC).
</p>
|
|
|
|
<p>Real world data is an increasingly utilized resource for post-market
monitoring of vaccines and provides insight into real world effectiveness.
However, outside of the setting of a clinical trial, heterogeneous mechanisms
may drive observed breakthrough infection rates among vaccinated individuals;
for instance, waning vaccine-induced immunity as time passes and the emergence
of a new strain against which the vaccine has reduced protection. Analyses of
infection incidence rates are typically predicated on a presumed mechanism in
their choice of an "analytic time zero" after which infection rates are
modeled. In this work, we propose an explicit test for driving mechanism
situated in a standard Cox proportional hazards framework. We explore the
test's performance in simulation studies and in an illustrative application to
real world data. We additionally introduce subgroup differences in infection
incidence and evaluate the impact of time zero misspecification on bias and
coverage of model estimates. In this study we observe strong power and
controlled type I error of the test to detect the correct infection-driving
mechanism under various settings. Similar to previous studies, we find
mitigated bias and greater coverage of estimates when the analytic time zero is
correctly specified or accounted for.
</p>
|
|
|
|
<p>Clinical trials are typically run in order to understand the effects of a new
treatment on a given population of patients. However, patients in large
populations rarely respond the same way to the same treatment. This
heterogeneity in patient responses necessitates trials that investigate effects
on multiple subpopulations - especially when a treatment has marginal or no
benefit for the overall population but might have significant benefit for a
particular subpopulation. Motivated by this need, we propose Syntax, an
exploratory trial design that identifies subpopulations with positive treatment
effect among many subpopulations. Syntax is sample efficient as it (i) recruits
and allocates patients adaptively and (ii) estimates treatment effects by
forming synthetic controls for each subpopulation that combines control samples
from other subpopulations. We validate the performance of Syntax and provide
insights into when it might have an advantage over conventional trial designs
through experiments.
</p>
|
|
|
|
<p>Introduction: Heterogeneity of the progression of neurodegenerative diseases
is one of the main challenges faced in developing effective therapies. With the
increasing number of large clinical databases, disease progression models have
led to a better understanding of this heterogeneity. Nevertheless, these
diseases may have no clear onset and biological underlying processes may start
before the first symptoms. Such an ill-defined disease reference time is an
issue for current joint models, which have proven their effectiveness by
combining longitudinal and survival data. Objective In this work, we propose a
joint non-linear mixed effect model with a latent disease age, to overcome this
need for a precise reference time.
</p>
<p>Method: To do so, we utilized an existing longitudinal model with a latent
disease age as a longitudinal sub-model and associated it with a survival
sub-model that estimates a Weibull distribution from the latent disease age. We
then validated our model on different simulated scenarios. Finally, we
benchmarked our model with a state-of-the-art joint model and reference
survival and longitudinal models on simulated and real data in the context of
Amyotrophic Lateral Sclerosis (ALS).
</p>
<p>Results: On real data, our model got significantly better results than the
state-of-the-art joint model for absolute bias (4.21(4.41) versus
4.24(4.14)(p-value=1.4e-17)), and mean cumulative AUC for right censored events
(0.67(0.07) versus 0.61(0.09)(p-value=1.7e-03)).
</p>
<p>Conclusion: We showed that our approach is better suited than the
state-of-the-art in the context where the reference time is not reliable. This
work opens up the perspective to design predictive and personalized therapeutic
strategies.
</p>
|
|
|
|
<p>This paper examines the quantization methods used in large-scale data
analysis models and their hyperparameter choices. The recent surge in data
analysis scale has significantly increased computational resource requirements.
To address this, quantizing model weights has become a prevalent practice in
data analysis applications such as deep learning. Quantization is particularly
vital for deploying large models on devices with limited computational
resources. However, the selection of quantization hyperparameters, like the
number of bits and value range for weight quantization, remains an
underexplored area. In this study, we employ the typical case analysis from
statistical physics, specifically the replica method, to explore the impact of
hyperparameters on the quantization of simple learning models. Our analysis
yields three key findings: (i) an unstable hyperparameter phase, known as
replica symmetry breaking, occurs with a small number of bits and a large
quantization width; (ii) there is an optimal quantization width that minimizes
error; and (iii) quantization delays the onset of overparameterization, helping
to mitigate overfitting as indicated by the double descent phenomenon. We also
discover that non-uniform quantization can enhance stability. Additionally, we
develop an approximate message-passing algorithm to validate our theoretical
results.
</p>
|
|
|
|
<p>The principal component analysis (PCA) is widely used for data decorrelation
and dimensionality reduction. However, the use of PCA may be impractical in
real-time applications, or in situations were energy and computing constraints
are severe. In this context, the discrete cosine transform (DCT) becomes a
low-cost alternative to data decorrelation. This paper presents a method to
derive computationally efficient approximations to the DCT. The proposed method
aims at the minimization of the angle between the rows of the exact DCT matrix
and the rows of the approximated transformation matrix. The resulting
transformations matrices are orthogonal and have extremely low arithmetic
complexity. Considering popular performance measures, one of the proposed
transformation matrices outperforms the best competitors in both matrix error
and coding capabilities. Practical applications in image and video coding
demonstrate the relevance of the proposed transformation. In fact, we show that
the proposed approximate DCT can outperform the exact DCT for image encoding
under certain compression ratios. The proposed transform and its direct
competitors are also physically realized as digital prototype circuits using
FPGA technology.
</p>
|
|
|
|
<p>In the regression framework, the empirical measure based on the responses
resulting from the nearest neighbors, among the covariates, to a given point
$x$ is introduced and studied as a central statistical quantity. First, the
associated empirical process is shown to satisfy a uniform central limit
theorem under a local bracketing entropy condition on the underlying class of
functions reflecting the localizing nature of the nearest neighbor algorithm.
Second a uniform non-asymptotic bound is established under a well-known
condition, often referred to as Vapnik-Chervonenkis, on the uniform entropy
numbers. The covariance of the Gaussian limit obtained in the uniform central
limit theorem is simply equal to the conditional covariance operator given the
covariate value. This suggests the possibility of using standard formulas to
estimate the variance by using only the nearest neighbors instead of the full
data. This is illustrated on two problems: the estimation of the conditional
cumulative distribution function and local linear regression.
</p>
|
|
|
|
<p>In large-scale applications including medical imaging, collocation
differential equation solvers, and estimation with differential privacy, the
underlying linear inverse problem can be reformulated as a streaming problem.
In theory, the streaming problem can be effectively solved using
memory-efficient, exponentially-converging streaming solvers. In practice, a
streaming solver's effectiveness is undermined if it is stopped before, or
well-after, the desired accuracy is achieved. In special cases when the
underlying linear inverse problem is finite-dimensional, streaming solvers can
periodically evaluate the residual norm at a substantial computational cost.
When the underlying system is infinite dimensional, streaming solver can only
access noisy estimates of the residual. While such noisy estimates are
computationally efficient, they are useful only when their accuracy is known.
In this work, we rigorously develop a general family of
computationally-practical residual estimators and their uncertainty sets for
streaming solvers, and we demonstrate the accuracy of our methods on a number
of large-scale linear problems. Thus, we further enable the practical use of
streaming solvers for important classes of linear inverse problems.
</p>
|
|
|
|
<p>Longitudinal studies are subject to nonresponse when individuals fail to
provide data for entire waves or particular questions of the survey. We compare
approaches to nonresponse bias analysis (NRBA) in longitudinal studies and
illustrate them on the Early Childhood Longitudinal Study, Kindergarten Class
of 2010-11 (ECLS-K:2011). Wave nonresponse with attrition often yields a
monotone missingness pattern, and the missingness mechanism can be missing at
random (MAR) or missing not at random (MNAR). We discuss weighting, multiple
imputation (MI), incomplete data modeling, and Bayesian approaches to NRBA for
monotone patterns. Weighting adjustments are effective when the constructed
weights are correlated to the survey outcome of interest. MI allows for
variables with missing values to be included in the imputation model, yielding
potentially less biased and more efficient estimates. Multilevel models with
maximum likelihood estimation and marginal models estimated using generalized
estimating equations can also handle incomplete longitudinal data. Bayesian
methods introduce prior information and potentially stabilize model estimation.
We add offsets in the MAR results to provide sensitivity analyses to assess
MNAR deviations. We conduct NRBA for descriptive summaries and analytic model
estimates and find that in the ECLS-K:2011 application, NRBA yields minor
changes to the substantive conclusions. The strength of evidence about our NRBA
depends on the strength of the relationship between the characteristics in the
nonresponse adjustment and the key survey outcomes, so the key to a successful
NRBA is to include strong predictors.
</p>
|
|
|
|
<p>Evaluation of intervention in a multi-agent system, e.g., when humans should
intervene in autonomous driving systems and when a player should pass to
teammates for a good shot, is challenging in various engineering and scientific
fields. Estimating the individual treatment effect (ITE) using counterfactual
long-term prediction is practical to evaluate such interventions. However, most
of the conventional frameworks did not consider the time-varying complex
structure of multi-agent relationships and covariate counterfactual prediction.
This may lead to erroneous assessments of ITE and difficulty in interpretation.
Here we propose an interpretable, counterfactual recurrent network in
multi-agent systems to estimate the effect of the intervention. Our model
leverages graph variational recurrent neural networks and theory-based
computation with domain knowledge for the ITE estimation framework based on
long-term prediction of multi-agent covariates and outcomes, which can confirm
the circumstances under which the intervention is effective. On simulated
models of an automated vehicle and biological agents with time-varying
confounders, we show that our methods achieved lower estimation errors in
counterfactual covariates and the most effective treatment timing than the
baselines. Furthermore, using real basketball data, our methods performed
realistic counterfactual predictions and evaluated the counterfactual passes in
shot scenarios.
</p>
|
|
|
|
<p>A statistical emulator can be used as a surrogate of complex physics-based
calculations to drastically reduce the computational cost. Its successful
implementation hinges on an accurate representation of the nonlinear response
surface with a high-dimensional input space. Conventional "space-filling"
designs, including random sampling and Latin hypercube sampling, become
inefficient as the dimensionality of the input variables increases, and the
predictive accuracy of the emulator can degrade substantially for a test input
distant from the training input set. To address this fundamental challenge, we
develop a reliable emulator for predicting complex functionals by active
learning with error control (ALEC). The algorithm is applicable to
infinite-dimensional mapping with high-fidelity predictions and a controlled
predictive error. The computational efficiency has been demonstrated by
emulating the classical density functional theory (cDFT) calculations, a
statistical-mechanical method widely used in modeling the equilibrium
properties of complex molecular systems. We show that ALEC is much more
accurate than conventional emulators based on the Gaussian processes with
"space-filling" designs and alternative active learning methods. Besides, it is
computationally more efficient than direct cDFT calculations. ALEC can be a
reliable building block for emulating expensive functionals owing to its
minimal computational cost, controllable predictive error, and fully automatic
features.
</p>
|
|
|
|
<p>The switchback is an experimental design that measures treatment effects by
repeatedly turning an intervention on and off for a whole system. Switchback
experiments are a robust way to overcome cross-unit spillover effects; however,
they are vulnerable to bias from temporal carryovers. In this paper, we
consider properties of switchback experiments in Markovian systems that mix at
a geometric rate. We find that, in this setting, standard switchback designs
suffer considerably from carryover bias: Their estimation error decays as
$T^{-1/3}$ in terms of the experiment horizon $T$, whereas in the absence of
carryovers a faster rate of $T^{-1/2}$ would have been possible. We also show,
however, that judicious use of burn-in periods can considerably improve the
situation, and enables errors that decay as $\log(T)^{1/2}T^{-1/2}$. Our formal
results are mirrored in an empirical evaluation.
</p>
|
|
|
|
<p>This paper is focused on the study of entropic regularization in optimal
transport as a smoothing method for Wasserstein estimators, through the prism
of the classical tradeoff between approximation and estimation errors in
statistics. Wasserstein estimators are defined as solutions of variational
problems whose objective function involves the use of an optimal transport cost
between probability measures. Such estimators can be regularized by replacing
the optimal transport cost by its regularized version using an entropy penalty
on the transport plan. The use of such a regularization has a potentially
significant smoothing effect on the resulting estimators. In this work, we
investigate its potential benefits on the approximation and estimation
properties of regularized Wasserstein estimators. Our main contribution is to
discuss how entropic regularization may reach, at a lower computational cost,
statistical performances that are comparable to those of un-regularized
Wasserstein estimators in statistical learning problems involving
distributional data analysis. To this end, we present new theoretical results
on the convergence of regularized Wasserstein estimators. We also study their
numerical performances using simulated and real data in the supervised learning
problem of proportions estimation in mixture models using optimal transport.
</p>
|
|
|
|
<p>We introduce and analyze an improved variant of nearest neighbors (NN) for
estimation with missing data in latent factor models. We consider a matrix
completion problem with missing data, where the $(i, t)$-th entry, when
observed, is given by its mean $f(u_i, v_t)$ plus mean-zero noise for an
unknown function $f$ and latent factors $u_i$ and $v_t$. Prior NN strategies,
like unit-unit NN, for estimating the mean $f(u_i, v_t)$ relies on existence of
other rows $j$ with $u_j \approx u_i$. Similarly, time-time NN strategy relies
on existence of columns $t'$ with $v_{t'} \approx v_t$. These strategies
provide poor performance respectively when similar rows or similar columns are
not available. Our estimate is doubly robust to this deficit in two ways: (1)
As long as there exist either good row or good column neighbors, our estimate
provides a consistent estimate. (2) Furthermore, if both good row and good
column neighbors exist, it provides a (near-)quadratic improvement in the
non-asymptotic error and admits a significantly narrower asymptotic confidence
interval when compared to both unit-unit or time-time NN.
</p>
|
|
|
|
<p>There is increasing adoption of artificial intelligence in drug discovery.
However, existing studies use machine learning to mainly utilize the chemical
structures of molecules but ignore the vast textual knowledge available in
chemistry. Incorporating textual knowledge enables us to realize new drug
design objectives, adapt to text-based instructions and predict complex
biological activities. Here we present a multi-modal molecule structure-text
model, MoleculeSTM, by jointly learning molecules' chemical structures and
textual descriptions via a contrastive learning strategy. To train MoleculeSTM,
we construct a large multi-modal dataset, namely, PubChemSTM, with over 280,000
chemical structure-text pairs. To demonstrate the effectiveness and utility of
MoleculeSTM, we design two challenging zero-shot tasks based on text
instructions, including structure-text retrieval and molecule editing.
MoleculeSTM has two main properties: open vocabulary and compositionality via
natural language. In experiments, MoleculeSTM obtains the state-of-the-art
generalization ability to novel biochemical concepts across various benchmarks.
</p>
|
|
|
|
<p>In the world of the complex power market, accurate electricity price
forecasting is essential for strategic bidding and affects both daily
operations and long-term investments. This article introduce a new method
dubbed Smoothing Quantile Regression (SQR) Averaging, that improves on
well-performing schemes for probabilistic forecasting. To showcase its utility,
a comprehensive study is conducted across four power markets, including recent
data encompassing the COVID-19 pandemic and the Russian invasion on Ukraine.
The performance of SQR Averaging is evaluated and compared to state-of-the-art
benchmark methods in terms of the reliability and sharpness measures.
Additionally, an evaluation scheme is introduced to quantify the economic
benefits derived from SQR Averaging predictions. This scheme can be applied in
any day-ahead electricity market and is based on a trading strategy that
leverages battery storage and sets limit orders using selected quantiles of the
predictive distribution. The results reveal that, compared to the benchmark
strategy, utilizing SQR Averaging leads to average profit increases of up to
14\%. These findings provide strong evidence for the effectiveness of SQR
Averaging in improving forecast accuracy and the practical value of utilizing
probabilistic forecasts in day-ahead electricity trading, even in the face of
challenging events such as the COVID-19 pandemic and geopolitical disruptions.
</p>
|
|
|
|
<p>In this paper, we establish novel data-dependent upper bounds on the
generalization error through the lens of a "variable-size compressibility"
framework that we introduce newly here. In this framework, the generalization
error of an algorithm is linked to a variable-size 'compression rate' of its
input data. This is shown to yield bounds that depend on the empirical measure
of the given input data at hand, rather than its unknown distribution. Our new
generalization bounds that we establish are tail bounds, tail bounds on the
expectation, and in-expectations bounds. Moreover, it is shown that our
framework also allows to derive general bounds on any function of the input
data and output hypothesis random variables. In particular, these general
bounds are shown to subsume and possibly improve over several existing
PAC-Bayes and data-dependent intrinsic dimension-based bounds that are
recovered as special cases, thus unveiling a unifying character of our
approach. For instance, a new data-dependent intrinsic dimension-based bound is
established, which connects the generalization error to the optimization
trajectories and reveals various interesting connections with the
rate-distortion dimension of a process, the R\'enyi information dimension of a
process, and the metric mean dimension.
</p>
|
|
|
|
<p>Tempered stable distributions are frequently used in financial applications
(e.g., for option pricing) in which the tails of stable distributions would be
too heavy. Given the non-explicit form of the probability density function,
estimation relies on numerical algorithms which typically are time-consuming.
We compare several parametric estimation methods such as the maximum likelihood
method and different generalized method of moment approaches. We study large
sample properties and derive consistency, asymptotic normality, and asymptotic
efficiency results for our estimators. Additionally, we conduct simulation
studies to analyze finite sample properties measured by the empirical bias,
precision, and asymptotic confidence interval coverage rates and compare
computational costs. We cover relevant subclasses of tempered stable
distributions such as the classical tempered stable distribution and the
tempered stable subordinator. Moreover, we discuss the normal tempered stable
distribution which arises by subordinating a Brownian motion with a tempered
stable subordinator. Our financial applications to log returns of asset indices
and to energy spot prices illustrate the benefits of tempered stable models.
</p>
|
|
|
|
<p>We study the problem of federated stochastic multi-arm contextual bandits
with unknown contexts, in which M agents are faced with different bandits and
collaborate to learn. The communication model consists of a central server and
the agents share their estimates with the central server periodically to learn
to choose optimal actions in order to minimize the total regret. We assume that
the exact contexts are not observable and the agents observe only a
distribution of the contexts. Such a situation arises, for instance, when the
context itself is a noisy measurement or based on a prediction mechanism. Our
goal is to develop a distributed and federated algorithm that facilitates
collaborative learning among the agents to select a sequence of optimal actions
so as to maximize the cumulative reward. By performing a feature vector
transformation, we propose an elimination-based algorithm and prove the regret
bound for linearly parametrized reward functions. Finally, we validated the
performance of our algorithm and compared it with another baseline approach
using numerical simulations on synthetic data and on the real-world movielens
dataset.
</p>
|
|
|
|
<p>Analysis of geospatial data has traditionally been model-based, with a mean
model, customarily specified as a linear regression on the covariates, and a
covariance model, encoding the spatial dependence. We relax the strong
assumption of linearity and propose embedding neural networks directly within
the traditional geostatistical models to accommodate non-linear mean functions
while retaining all other advantages including use of Gaussian Processes to
explicitly model the spatial covariance, enabling inference on the covariate
effect through the mean and on the spatial dependence through the covariance,
and offering predictions at new locations via kriging. We propose NN-GLS, a new
neural network estimation algorithm for the non-linear mean in GP models that
explicitly accounts for the spatial covariance through generalized least
squares (GLS), the same loss used in the linear case. We show that NN-GLS
admits a representation as a special type of graph neural network (GNN). This
connection facilitates use of standard neural network computational techniques
for irregular geospatial data, enabling novel and scalable mini-batching,
backpropagation, and kriging schemes. Theoretically, we show that NN-GLS will
be consistent for irregularly observed spatially correlated data processes. To
our knowledge this is the first asymptotic consistency result for any neural
network algorithm for spatial data. We demonstrate the methodology through
simulated and real datasets.
</p>
|
|
|
|
<p>Orbit recovery problems are a class of problems that often arise in practice
and various forms. In these problems, we aim to estimate an unknown function
after being distorted by a group action and observed via a known operator.
Typically, the observations are contaminated with a non-trivial level of noise.
Two particular orbit recovery problems of interest in this paper are
multireference alignment and single-particle cryo-EM modelling. In order to
suppress the noise, we suggest using the method of moments approach for both
problems while introducing deep neural network priors. In particular, our
neural networks should output the signals and the distribution of group
elements, with moments being the input. In the multireference alignment case,
we demonstrate the advantage of using the NN to accelerate the convergence for
the reconstruction of signals from the moments. Finally, we use our method to
reconstruct simulated and biological volumes in the cryo-EM setting.
</p>
|
|
|
|
<p>This paper proposes a new easy-to-implement parameter-free gradient-based
optimizer: DoWG (Distance over Weighted Gradients). We prove that DoWG is
efficient -- matching the convergence rate of optimally tuned gradient descent
in convex optimization up to a logarithmic factor without tuning any
parameters, and universal -- automatically adapting to both smooth and
nonsmooth problems. While popular algorithms following the AdaGrad framework
compute a running average of the squared gradients to use for normalization,
DoWG maintains a new distance-based weighted version of the running average,
which is crucial to achieve the desired properties. To complement our theory,
we also show empirically that DoWG trains at the edge of stability, and
validate its effectiveness on practical machine learning tasks.
</p>
|
|
|
|
<p>A common forecasting setting in real world applications considers a set of
possibly heterogeneous time series of the same domain. Due to different
properties of each time series such as length, obtaining forecasts for each
individual time series in a straight-forward way is challenging. This paper
proposes a general framework utilizing a similarity measure in Dynamic Time
Warping to find similar time series to build neighborhoods in a k-Nearest
Neighbor fashion, and improve forecasts of possibly simple models by averaging.
Several ways of performing the averaging are suggested, and theoretical
arguments underline the usefulness of averaging for forecasting. Additionally,
diagnostics tools are proposed allowing a deep understanding of the procedure.
</p>
|
|
|
|
<p>The analysis of survey data is a frequently arising issue in clinical trials,
particularly when capturing quantities which are difficult to measure. Typical
examples are questionnaires about patient's well-being, pain, or consent to an
intervention. In these, data is captured on a discrete scale containing only a
limited number of possible answers, from which the respondent has to pick the
answer which fits best his/her personal opinion. This data is generally located
on an ordinal scale as answers can usually be arranged in an ascending order,
e.g., "bad", "neutral", "good" for well-being. Since responses are usually
stored numerically for data processing purposes, analysis of survey data using
ordinary linear regression models are commonly applied. However, assumptions of
these models are often not met as linear regression requires a constant
variability of the response variable and can yield predictions out of the range
of response categories. By using linear models, one only gains insights about
the mean response which may affect representativeness. In contrast, ordinal
regression models can provide probability estimates for all response categories
and yield information about the full response scale beyond the mean. In this
work, we provide a concise overview of the fundamentals of latent variable
based ordinal models, applications to a real data set, and outline the use of
state-of-the-art-software for this purpose. Moreover, we discuss strengths,
limitations and typical pitfalls. This is a companion work to a current
vignette-based structured interview study in paediatric anaesthesia.
</p>
|
|
|
|
<p>Transfer learning plays a key role in modern data analysis when: (1) the
target data are scarce but the source data are sufficient; (2) the
distributions of the source and target data are heterogeneous. This paper
develops an interpretable unified transfer learning model, termed as UTrans,
which can detect both transferable variables and source data. More
specifically, we establish the estimation error bounds and prove that our
bounds are lower than those with target data only. Besides, we propose a source
detection algorithm based on hypothesis testing to exclude the nontransferable
data. We evaluate and compare UTrans to the existing algorithms in multiple
experiments. It is shown that UTrans attains much lower estimation and
prediction errors than the existing methods, while preserving interpretability.
We finally apply it to the US intergenerational mobility data and compare our
proposed algorithms to the classical machine learning algorithms.
</p>
|
|
|
|
<p>Identifying patients who benefit from a treatment is a key aspect of
personalized medicine, which allows the development of individualized treatment
rules (ITRs). Many machine learning methods have been proposed to create such
rules. However, to what extent the methods lead to similar ITRs, i.e.,
recommending the same treatment for the same individuals is unclear. In this
work, we compared 22 of the most common approaches in two randomized control
trials. Two classes of methods can be distinguished. The first class of methods
relies on predicting individualized treatment effects from which an ITR is
derived by recommending the treatment evaluated to the individuals with a
predicted benefit. In the second class, methods directly estimate the ITR
without estimating individualized treatment effects. For each trial, the
performance of ITRs was assessed by various metrics, and the pairwise agreement
between all ITRs was also calculated. Results showed that the ITRs obtained via
the different methods generally had considerable disagreements regarding the
patients to be treated. A better concordance was found among akin methods.
Overall, when evaluating the performance of ITRs in a validation sample, all
methods produced ITRs with limited performance, suggesting a high potential for
optimism. For non-parametric methods, this optimism was likely due to
overfitting. The different methods do not lead to similar ITRs and are
therefore not interchangeable. The choice of the method strongly influences for
which patients a certain treatment is recommended, drawing some concerns about
their practical use.
</p>
|
|
|
|
<p>Physical systems can often be described via a continuous-time dynamical
system. In practice, the true system is often unknown and has to be learned
from measurement data. Since data is typically collected in discrete time, e.g.
by sensors, most methods in Gaussian process (GP) dynamics model learning are
trained on one-step ahead predictions. This can become problematic in several
scenarios, e.g. if measurements are provided at irregularly-sampled time steps
or physical system properties have to be conserved. Thus, we aim for a GP model
of the true continuous-time dynamics. Higher-order numerical integrators
provide the necessary tools to address this problem by discretizing the
dynamics function with arbitrary accuracy. Many higher-order integrators
require dynamics evaluations at intermediate time steps making exact GP
inference intractable. In previous work, this problem is often tackled by
approximating the GP posterior with variational inference. However, exact GP
inference is preferable in many scenarios, e.g. due to its mathematical
guarantees. In order to make direct inference tractable, we propose to leverage
multistep and Taylor integrators. We demonstrate how to derive flexible
inference schemes for these types of integrators. Further, we derive tailored
sampling schemes that allow to draw consistent dynamics functions from the
learned posterior. This is crucial to sample consistent predictions from the
dynamics model. We demonstrate empirically and theoretically that our approach
yields an accurate representation of the continuous-time system.
</p>
|
|
|
|
<p>Graph Neural Networks (GNNs), especially message-passing neural networks
(MPNNs), have emerged as powerful architectures for learning on graphs in
diverse applications. However, MPNNs face challenges when modeling non-local
interactions in graphs such as large conjugated molecules, and social networks
due to oversmoothing and oversquashing. Although Spectral GNNs and traditional
neural networks such as recurrent neural networks and transformers mitigate
these challenges, they often lack generalizability, or fail to capture detailed
structural relationships or symmetries in the data. To address these concerns,
we introduce Matrix Function Neural Networks (MFNs), a novel architecture that
parameterizes non-local interactions through analytic matrix equivariant
functions. Employing resolvent expansions offers a straightforward
implementation and the potential for linear scaling with system size. The MFN
architecture achieves stateof-the-art performance in standard graph benchmarks,
such as the ZINC and TU datasets, and is able to capture intricate non-local
interactions in quantum systems, paving the way to new state-of-the-art force
fields.
</p>
|
|
|
|
<p>Studies intended to estimate the effect of a treatment, like randomized
trials, often consist of a biased sample of the desired target population. To
correct for this bias, estimates can be transported to the desired target
population. Methods for transporting between populations are often premised on
a positivity assumption, such that all relevant covariate patterns in one
population are also present in the other. However, eligibility criteria,
particularly in the case of trials, can result in violations of positivity. To
address nonpositivity, a synthesis of statistical and mathematical models can
be considered. This approach integrates multiple data sources (e.g. trials,
observational, pharmacokinetic studies) to estimate treatment effects,
leveraging mathematical models to handle positivity violations. This approach
was previously demonstrated for positivity violations by a single binary
covariate. Here, we extend the synthesis approach for positivity violations
with a continuous covariate. For estimation, two novel augmented inverse
probability weighting estimators are proposed. Both estimators are contrasted
with other common approaches for addressing nonpositivity. Empirical
performance is compared via Monte Carlo simulation. Finally, the competing
approaches are illustrated with an example in the context of two-drug versus
one-drug antiretroviral therapy on CD4 T cell counts among women with HIV.
</p>
|
|
|
|
<p>In this paper, we explore optimal treatment allocation policies that target
distributional welfare. Most literature on treatment choice has considered
utilitarian welfare based on the conditional average treatment effect (ATE).
While average welfare is intuitive, it may yield undesirable allocations
especially when individuals are heterogeneous (e.g., with outliers) - the very
reason individualized treatments were introduced in the first place. This
observation motivates us to propose an optimal policy that allocates the
treatment based on the conditional quantile of individual treatment effects
(QoTE). Depending on the choice of the quantile probability, this criterion can
accommodate a policymaker who is either prudent or negligent. The challenge of
identifying the QoTE lies in its requirement for knowledge of the joint
distribution of the counterfactual outcomes, which is generally hard to recover
even with experimental data. Therefore, we introduce minimax policies that are
robust to model uncertainty. A range of identifying assumptions can be used to
yield more informative policies. For both stochastic and deterministic
policies, we establish the asymptotic bound on the regret of implementing the
proposed policies. In simulations and two empirical applications, we compare
optimal decisions based on the QoTE with decisions based on other criteria. The
framework can be generalized to any setting where welfare is defined as a
functional of the joint distribution of the potential outcomes.
</p>
|
|
|
|
<p>Regression with random data objects is becoming increasingly common in modern
data analysis. Unfortunately, like the traditional regression setting with
Euclidean data, random response regression is not immune to the trouble caused
by unusual observations. A metric Cook's distance extending the classical
Cook's distances of Cook (1977) to general metric-valued response objects is
proposed. The performance of the metric Cook's distance in both Euclidean and
non-Euclidean response regression with Euclidean predictors is demonstrated in
an extensive experimental study. A real data analysis of county-level COVID-19
transmission in the United States also illustrates the usefulness of this
method in practice.
</p>
|
|
|
|
<p>We consider the problem of approximating a function from $L^2$ by an element
of a given $m$-dimensional space $V_m$, associated with some feature map
$\varphi$, using evaluations of the function at random points $x_1,\dots,x_n$.
After recalling some results on optimal weighted least-squares using
independent and identically distributed points, we consider weighted
least-squares using projection determinantal point processes (DPP) or volume
sampling. These distributions introduce dependence between the points that
promotes diversity in the selected features $\varphi(x_i)$. We first provide a
generalized version of volume-rescaled sampling yielding quasi-optimality
results in expectation with a number of samples $n = O(m\log(m))$, that means
that the expected $L^2$ error is bounded by a constant times the best
approximation error in $L^2$. Also, further assuming that the function is in
some normed vector space $H$ continuously embedded in $L^2$, we further prove
that the approximation is almost surely bounded by the best approximation error
measured in the $H$-norm. This includes the cases of functions from $L^\infty$
or reproducing kernel Hilbert spaces. Finally, we present an alternative
strategy consisting in using independent repetitions of projection DPP (or
volume sampling), yielding similar error bounds as with i.i.d. or volume
sampling, but in practice with a much lower number of samples. Numerical
experiments illustrate the performance of the different strategies.
</p>
|
|
|
|
<p>This paper proposes a novel method for demand forecasting in a pricing
context. Here, modeling the causal relationship between price as an input
variable to demand is crucial because retailers aim to set prices in a (profit)
optimal manner in a downstream decision making problem. Our methods bring
together the Double Machine Learning methodology for causal inference and
state-of-the-art transformer-based forecasting models. In extensive empirical
experiments, we show on the one hand that our method estimates the causal
effect better in a fully controlled setting via synthetic, yet realistic data.
On the other hand, we demonstrate on real-world data that our method
outperforms forecasting methods in off-policy settings (i.e., when there's a
change in the pricing policy) while only slightly trailing in the on-policy
setting.
</p>
|
|
|
|
<p>In high-dimensional data analysis, such as financial index tracking or
biomedical applications, it is crucial to select the few relevant variables
while maintaining control over the false discovery rate (FDR). In these
applications, strong dependencies often exist among the variables (e.g., stock
returns), which can undermine the FDR control property of existing methods like
the model-X knockoff method or the T-Rex selector. To address this issue, we
have expanded the T-Rex framework to accommodate overlapping groups of highly
correlated variables. This is achieved by integrating a nearest neighbors
penalization mechanism into the framework, which provably controls the FDR at
the user-defined target level. A real-world example of sparse index tracking
demonstrates the proposed method's ability to accurately track the S&P 500
index over the past 20 years based on a small number of stocks. An open-source
implementation is provided within the R package TRexSelector on CRAN.
</p>
|
|
|
|
<p>Training machine learning and statistical models often involves optimizing a
data-driven risk criterion. The risk is usually computed with respect to the
empirical data distribution, but this may result in poor and unstable
out-of-sample performance due to distributional uncertainty. In the spirit of
distributionally robust optimization, we propose a novel robust criterion by
combining insights from Bayesian nonparametric (i.e., Dirichlet Process) theory
and recent decision-theoretic models of smooth ambiguity-averse preferences.
First, we highlight novel connections with standard regularized empirical risk
minimization techniques, among which Ridge and LASSO regressions. Then, we
theoretically demonstrate the existence of favorable finite-sample and
asymptotic statistical guarantees on the performance of the robust optimization
procedure. For practical implementation, we propose and study tractable
approximations of the criterion based on well-known Dirichlet Process
representations. We also show that the smoothness of the criterion naturally
leads to standard gradient-based numerical optimization. Finally, we provide
insights into the workings of our method by applying it to high-dimensional
sparse linear regression and robust location parameter estimation tasks.
</p>
|
|
|
|
<p>Algorithms that ensure reproducible findings from large-scale,
high-dimensional data are pivotal in numerous signal processing applications.
In recent years, multivariate false discovery rate (FDR) controlling methods
have emerged, providing guarantees even in high-dimensional settings where the
number of variables surpasses the number of samples. However, these methods
often fail to reliably control the FDR in the presence of highly dependent
variable groups, a common characteristic in fields such as genomics and
finance. To tackle this critical issue, we introduce a novel framework that
accounts for general dependency structures. Our proposed dependency-aware T-Rex
selector integrates hierarchical graphical models within the T-Rex framework to
effectively harness the dependency structure among variables. Leveraging
martingale theory, we prove that our variable penalization mechanism ensures
FDR control. We further generalize the FDR-controlling framework by stating and
proving a clear condition necessary for designing both graphical and
non-graphical models that capture dependencies. Additionally, we formulate a
fully integrated optimal calibration algorithm that concurrently determines the
parameters of the graphical model and the T-Rex framework, such that the FDR is
controlled while maximizing the number of selected variables. Numerical
experiments and a breast cancer survival analysis use-case demonstrate that the
proposed method is the only one among the state-of-the-art benchmark methods
that controls the FDR and reliably detects genes that have been previously
identified to be related to breast cancer. An open-source implementation is
available within the R package TRexSelector on CRAN.
</p>
|
|
|
|
<p>In the field of causal modeling, potential outcomes (PO) and structural
causal models (SCMs) stand as the predominant frameworks. However, these
frameworks face notable challenges in practically modeling counterfactuals,
formalized as parameters of the joint distribution of potential outcomes.
Counterfactual reasoning holds paramount importance in contemporary
decision-making processes, especially in scenarios that demand personalized
incentives based on the joint values of $(Y(0), Y(1))$. This paper begins with
an investigation of the PO and SCM frameworks for modeling counterfactuals.
Through the analysis, we identify an inherent model capacity limitation, termed
as the ``degenerative counterfactual problem'', emerging from the consistency
rule that is the cornerstone of both frameworks. To address this limitation, we
introduce a novel \textit{distribution-consistency} assumption, and in
alignment with it, we propose the Distribution-consistency Structural Causal
Models (DiscoSCMs) offering enhanced capabilities to model counterfactuals. To
concretely reveal the enhanced model capacity, we introduce a new identifiable
causal parameter, \textit{the probability of consistency}, which holds
practical significance within DiscoSCM alone, showcased with a personalized
incentive example. Furthermore, we provide a comprehensive set of theoretical
results about the ``Ladder of Causation'' within the DiscoSCM framework. We
hope it opens new avenues for future research of counterfactual modeling,
ultimately enhancing our understanding of causality and its real-world
applications.
</p>
|