## Statistics (stat) updates on the arXiv.org e-print archive



A stream of algorithmic advances has steadily increased the popularity of the Bayesian approach as an inference paradigm, both from the theoretical and applied perspective. Even with apparent successes in numerous application fields, a rising concern is the robustness of Bayesian inference in the presence of model misspecification, which may lead to undesirable extreme behavior of the posterior distributions for large sample sizes. Generalized belief updating with a loss function represents a central principle to making Bayesian inference more robust and less vulnerable to deviations from the assumed model. Here we consider such updates with $f$-divergences to quantify a discrepancy between the assumed statistical model and the probability distribution which generated the observed data. Since the latter is generally unknown, estimation of the divergence may be viewed as an intractable problem. We show that the divergence becomes accessible through the use of probabilistic classifiers that can leverage an estimate of the ratio of two probability distributions even when one or both of them is unknown. We demonstrate the behavior of generalized belief updates for various specific choices under the $f$-divergence family. We show that for specific divergence functions such an approach can even improve on methods evaluating the correct model likelihood function analytically.

Boundaries on spatial fields divide regions with particular features from surrounding background areas. These boundaries are often described with contour lines. To measure and record these boundaries, contours are often represented as ordered sequences of spatial points that connect to form a line. Methods to identify boundary lines from interpolated spatial fields are well-established. Less attention has been paid to how to model sequences of connected spatial points. For data of the latter form, we introduce the Gaussian Star-shaped Contour Model (GSCM). GSMCs generate sequences of spatial points via generating sets of distances in various directions from a fixed starting point. The GSCM is designed for modeling contours that enclose regions that are star-shaped polygons or approximately star-shaped polygons. Metrics are introduced to assess the extent to which a polygon deviates from star-shaped. Simulation studies illustrate the performance of the GSCM in various scenarios and an analysis of Arctic sea ice edge contour data highlights how GSCMs can be applied to observational data.

Assigning weights to a large pool of objects is a fundamental task in a wide variety of applications. In this article, we introduce a concept of structured high-dimensional probability simplexes, whose most components are zero or near zero and the remaining ones are close to each other. Such structure is well motivated by 1) high-dimensional weights that are common in modern applications, and 2) ubiquitous examples in which equal weights---despite their simplicity---often achieve favorable or even state-of-the-art predictive performances. This particular structure, however, presents unique challenges both computationally and statistically. To address these challenges, we propose a new class of double spike Dirichlet priors to shrink a probability simplex to one with the desired structure. When applied to ensemble learning, such priors lead to a Bayesian method for structured high-dimensional ensembles that is useful for forecast combination and improving random forests, while enabling uncertainty quantification. We design efficient Markov chain Monte Carlo algorithms for easy implementation. Posterior contraction rates are established to provide theoretical support. We demonstrate the wide applicability and competitive performance of the proposed methods through simulations and two real data applications using the European Central Bank Survey of Professional Forecasters dataset and a UCI dataset.

We consider regret minimization for online control with time-varying linear dynamical systems. The metric of performance we study is adaptive policy regret, or regret compared to the best policy on {\it any interval in time}. We give an efficient algorithm that attains first-order adaptive regret guarantees for the setting of online convex optimization with memory. We also show that these first-order bounds are nearly tight.

This algorithm is then used to derive a controller with adaptive regret guarantees that provably competes with the best linear controller on any interval in time. We validate these theoretical findings experimentally on simulations of time-varying dynamics and disturbances.

While the celebrated graph neural networks yield effective representations for individual nodes of a graph, there has been relatively less success in extending to deep graph similarity learning. Recent work has considered either global-level graph-graph interactions or low-level node-node interactions, ignoring the rich cross-level interactions (e.g., between nodes and a whole graph). In this paper, we propose a Hierarchical Graph Matching Network (HGMN) for computing the graph similarity between any pair of graph-structured objects. Our model jointly learns graph representations and a graph matching metric function for computing graph similarities in an end-to-end fashion. The proposed HGMN model consists of a node-graph matching network for effectively learning cross-level interactions between nodes of a graph and a whole graph, and a siamese graph neural network for learning global-level interactions between two graphs. Our comprehensive experiments demonstrate that HGMN consistently outperforms state-of-the-art graph matching network baselines for both classification and regression tasks.

The threat status and criminal collaborations of potential terrorists are hidden but give rise to observable behaviours and communications. Terrorists, when acting in concert, need to communicate to organise their plots. The authorities utilise such observable behaviour and communication data to inform their investigations and policing. We present a dynamic latent network model that integrates real-time communications data with prior knowledge on individuals. This model estimates and predicts the latent strength of criminal collaboration between individuals to assist in the identification of potential cells and the measurement of their threat levels. We demonstrate how, by assuming certain plausible conditional independences across the measurements associated with this population, the network model can be combined with models of individual suspects to provide fast transparent algorithms to predict group attacks. The methods are illustrated using a simulated example involving the threat posed by a cell suspected of plotting an attack.

To relieve the computational cost of design evaluations using expensive finite element simulations, surrogate models have been widely applied in computer-aided engineering design. Machine learning algorithms (MLAs) have been implemented as surrogate models due to their capability of learning the complex interrelations between the design variables and the response from big datasets. Typically, an MLA regression model contains model parameters and hyperparameters. The model parameters are obtained by fitting the training data. Hyperparameters, which govern the model structures and the training processes, are assigned by users before training. There is a lack of systematic studies on the effect of hyperparameters on the accuracy and robustness of the surrogate model. In this work, we proposed to establish a hyperparameter optimization (HOpt) framework to deepen our understanding of the effect. Four frequently used MLAs, namely Gaussian Process Regression (GPR), Support Vector Machine (SVM), Random Forest Regression (RFR), and Artificial Neural Network (ANN), are tested on four benchmark examples. For each MLA model, the model accuracy and robustness before and after the HOpt are compared. The results show that HOpt can generally improve the performance of the MLA models in general. HOpt leads to few improvements in the MLAs accuracy and robustness for complex problems, which are featured by high-dimensional mixed-variable design space. The HOpt is recommended for the design problems with intermediate complexity. We also investigated the additional computational costs incurred by HOpt. The training cost is closely related to the MLA architecture. After HOpt, the training cost of ANN and RFR is increased more than that of the GPR and SVM. To sum up, this study benefits the selection of HOpt method for the different types of design problems based on their complexity.

We propose and study Collpasing Bandits, a new restless multi-armed bandit (RMAB) setting in which each arm follows a binary-state Markovian process with a special structure: when an arm is played, the state is fully observed, thus "collapsing" any uncertainty, but when an arm is passive, no observation is made, thus allowing uncertainty to evolve. The goal is to keep as many arms in the "good" state as possible by planning a limited budget of actions per round. Such Collapsing Bandits are natural models for many healthcare domains in which workers must simultaneously monitor patients and deliver interventions in a way that maximizes the health of their patient cohort. Our main contributions are as follows: (i) Building on the Whittle index technique for RMABs, we derive conditions under which the Collapsing Bandits problem is indexable. Our derivation hinges on novel conditions that characterize when the optimal policies may take the form of either "forward" or "reverse" threshold policies. (ii) We exploit the optimality of threshold policies to build fast algorithms for computing the Whittle index, including a closed-form. (iii) We evaluate our algorithm on several data distributions including data from a real-world healthcare task in which a worker must monitor and deliver interventions to maximize their patients' adherence to tuberculosis medication. Our algorithm achieves a 3-order-of-magnitude speedup compared to state-of-the-art RMAB techniques while achieving similar performance.

Solving large complex partial differential equations (PDEs), such as those that arise in computational fluid dynamics (CFD), is a computationally expensive process. This has motivated the use of deep learning approaches to approximate the PDE solutions, yet the simulation results predicted from these approaches typically do not generalize well to truly novel scenarios. In this work, we develop a hybrid (graph) neural network that combines a traditional graph convolutional network with an embedded differentiable fluid dynamics simulator inside the network itself. By combining an actual CFD simulator (run on a much coarser resolution representation of the problem) with the graph network, we show that we can both generalize well to new situations and benefit from the substantial speedup of neural network CFD predictions, while also substantially outperforming the coarse CFD simulation alone.

While the relative trade-offs between sparse and distributed representations in deep neural networks (DNNs) are well-studied, less is known about how these trade-offs apply to representations of semantically-meaningful information. Class selectivity, the variability of a unit's responses across data classes or dimensions, is one way of quantifying the sparsity of semantic representations. Given recent evidence showing that class selectivity can impair generalization, we sought to investigate whether it also confers robustness (or vulnerability) to perturbations of input data. We found that mean class selectivity predicts vulnerability to naturalistic corruptions; networks regularized to have lower levels of class selectivity are more robust to corruption, while networks with higher class selectivity are more vulnerable to corruption, as measured using Tiny ImageNetC and CIFAR10C. In contrast, we found that class selectivity increases robustness to multiple types of gradient-based adversarial attacks. To examine this difference, we studied the dimensionality of the change in the representation due to perturbation, finding that decreasing class selectivity increases the dimensionality of this change for both corruption types, but with a notably larger increase for adversarial attacks. These results demonstrate the causal relationship between selectivity and robustness and provide new insights into the mechanisms of this relationship.

We study the problem of selecting features associated with extreme values in high dimensional linear regression. Normally, in linear modeling problems, the presence of abnormal extreme values or outliers is considered an anomaly which should either be removed from the data or remedied using robust regression methods. In many situations, however, the extreme values in regression modeling are not outliers but rather the signals of interest; consider traces from spiking neurons, volatility in finance, or extreme events in climate science, for example. In this paper, we propose a new method for sparse high-dimensional linear regression for extreme values which is motivated by the Subbotin, or generalized normal distribution. This leads us to utilize an $\ell_p$ norm loss where $p$ is an even integer greater than two; we demonstrate that this loss increases the weight on extreme values. We prove consistency and variable selection consistency for the $\ell_p$ norm regression with a Lasso penalty, which we term the Extreme Lasso. Through simulation studies and real-world data data examples, we show that this method outperforms other methods currently used in the literature for selecting features of interest associated with extreme values in high-dimensional regression.

We consider stochastic gradient estimation using noisy black-box function evaluations. A standard approach is to use the finite-difference method or its variants. While natural, it is open to our knowledge whether its statistical accuracy is the best possible. This paper argues so by showing that central finite-difference is a nearly minimax optimal zeroth-order gradient estimator, among both the class of linear estimators and the much larger class of all (nonlinear) estimators.

With the increasing adoption of electronic health records, there is an increasing interest in developing individualized treatment rules (ITRs), which recommend treatments according to patients' characteristics, from large observational data. However, there is a lack of valid inference procedures for ITRs developed from this type of data in the presence of high-dimensional covariates. In this work, we develop a penalized doubly robust method to estimate the optimal ITRs from high-dimensional data. We propose a split-and-pooled de-correlated score to construct hypothesis tests and confidence intervals. Our proposal utilizes the data splitting to conquer the slow convergence rate of nuisance parameter estimations, such as non-parametric methods for outcome regression or propensity models. We establish the limiting distributions of the split-and-pooled de-correlated score test and the corresponding one-step estimator in high-dimensional setting. Simulation and real data analysis are conducted to demonstrate the superiority of the proposed method.

Gradient boosting methods based on Structured Categorical Decision Trees (SCDT) have been demonstrated to outperform numerical and one-hot-encodings on problems where the categorical variable has a known underlying structure. However, the enumeration procedure in the SCDT is infeasible except for categorical variables with low or moderate cardinality. We propose and implement two methods to overcome the computational obstacles and efficiently perform Gradient Boosting on complex structured categorical variables. The resulting package, called StructureBoost, is shown to outperform established packages such as CatBoost and LightGBM on problems with categorical predictors that contain sophisticated structure. Moreover, we demonstrate that StructureBoost can make accurate predictions on unseen categorical values due to its knowledge of the underlying structure.

We introduce online probabilistic label trees (OPLTs), an algorithm that trains a label tree classifier in a fully online manner, without any prior knowledge about the number of training instances, their features and labels. OPLTs are characterized by low time and space complexity as well as strong theoretical guarantees. They can be used for online multi-label and multi-class classification, including the very challenging scenarios of one- or few-shot learning. We demonstrate the attractiveness of OPLTs in a wide empirical study on several instances of the tasks mentioned above.

We build a Bayesian contextual classification model using an optimistic score ratio for robust binary classification when there is limited information on the class-conditional, or contextual, distribution. The optimistic score searches for the distribution that is most plausible to explain the observed outcomes in the testing sample among all distributions belonging to the contextual ambiguity set which is prescribed using a limited structural constraint on the mean vector and the covariance matrix of the underlying contextual distribution. We show that the Bayesian classifier using the optimistic score ratio is conceptually attractive, delivers solid statistical guarantees and is computationally tractable. We showcase the power of the proposed optimistic score ratio classifier on both synthetic and empirical data.

We explore in this paper the use of neural networks designed for point-clouds and sets on a new meta-learning task. We present experiments on the astronomical challenge of characterizing the stellar population of stellar streams. Stellar streams are elongated structures of stars in the outskirts of the Milky Way that form when a (small) galaxy breaks up under the Milky Way's gravitational force. We consider that we obtain, for each stream, a small 'support set' of stars that belongs to this stream. We aim to predict if the other stars in that region of the sky are from that stream or not, similar to one-class classification. Each "stream task" could also be transformed into a binary classification problem in a highly imbalanced regime (or supervised anomaly detection) by using the much bigger set of "other" stars and considering them as noisy negative examples. We propose to study the problem in the meta-learning regime: we expect that we can learn general information on characterizing a stream's stellar population by meta-learning across several streams in a fully supervised regime, and transfer it to new streams using only positive supervision. We present a novel use of Deep Sets, a model developed for point-cloud and sets, trained in a meta-learning fully supervised regime, and evaluated in a one-class classification setting. We compare it against Random Forests (with and without self-labeling) in the classic setting of binary classification, retrained for each task. We show that our method outperforms the Random-Forests even though the Deep Sets is not retrained on the new tasks, and accesses only a small part of the data compared to the Random Forest. We also show that the model performs well on a real-life stream when including additional fine-tuning.

Wasserstein Barycenter is a principled approach to represent the weighted mean of a given set of probability distributions, utilizing the geometry induced by optimal transport. In this work, we present a novel scalable algorithm to approximate the Wasserstein Barycenters aiming at high-dimensional applications in machine learning. Our proposed algorithm is based on the Kantorovich dual formulation of the 2-Wasserstein distance as well as a recent neural network architecture, input convex neural network, that is known to parametrize convex functions. The distinguishing features of our method are: i) it only requires samples from the marginal distributions; ii) unlike the existing semi-discrete approaches, it represents the Barycenter with a generative model; iii) it allows to compute the barycenter with arbitrary weights after one training session. We demonstrate the efficacy of our algorithm by comparing it with the state-of-art methods in multiple experiments.

While deep learning methods continue to improve in predictive accuracy on a wide range of application domains, significant issues remain with other aspects of their performance including their ability to quantify uncertainty and their robustness. Recent advances in approximate Bayesian inference hold significant promise for addressing these concerns, but the computational scalability of these methods can be problematic when applied to large-scale models. In this paper, we describe initial work on the development ofURSABench(the Uncertainty, Robustness, Scalability, and Accu-racy Benchmark), an open-source suite of bench-marking tools for comprehensive assessment of approximate Bayesian inference methods with a focus on deep learning-based classification tasks

Scientists and engineers are often interested in learning the number of subpopulations (or components) present in a data set. Practitioners commonly use a Dirichlet process mixture model (DPMM) for this purpose; in particular, they count the number of clusters---i.e. components containing at least one data point---in the DPMM posterior. But Miller and Harrison (2013) warn that the DPMM cluster-count posterior is severely inconsistent for the number of latent components when the data are truly generated from a finite mixture; that is, the cluster-count posterior probability on the true generating number of components goes to zero in the limit of infinite data. A potential alternative is to use a finite mixture model (FMM) with a prior on the number of components. Past work has shown the resulting FMM component-count posterior is consistent. But existing results crucially depend on the assumption that the component likelihoods are perfectly specified. In practice, this assumption is unrealistic, and empirical evidence (Miller and Dunson, 2019) suggests that the FMM posterior on the number of components is sensitive to the likelihood choice. In this paper, we add rigor to data-analysis folk wisdom by proving that under even the slightest model misspecification, the FMM posterior on the number of components is ultraseverely inconsistent: for any finite $k \in \mathbb{N}$, the posterior probability that the number of components is $k$ converges to 0 in the limit of infinite data. We illustrate practical consequences of our theory on simulated and real data sets.

Network security applications, including intrusion detection systems of deep neural networks, are increasing rapidly to make detection task of anomaly activities more accurate and robust. With the rapid increase of using DNN and the volume of data traveling through systems, different growing types of adversarial attacks to defeat them create a severe challenge. In this paper, we focus on investigating the effectiveness of different evasion attacks and how to train a resilience deep learning-based IDS using different Neural networks, e.g., convolutional neural networks (CNN) and recurrent neural networks (RNN). We use the min-max approach to formulate the problem of training robust IDS against adversarial examples using two benchmark datasets. Our experiments on different deep learning algorithms and different benchmark datasets demonstrate that defense using an adversarial training-based min-max approach improves the robustness against the five well-known adversarial attack methods.

We present the first system that provides real-time probe movement guidance for acquiring standard planes in routine freehand obstetric ultrasound scanning. Such a system can contribute to the worldwide deployment of obstetric ultrasound scanning by lowering the required level of operator expertise. The system employs an artificial neural network that receives the ultrasound video signal and the motion signal of an inertial measurement unit (IMU) that is attached to the probe, and predicts a guidance signal. The network termed US-GuideNet predicts either the movement towards the standard plane position (goal prediction), or the next movement that an expert sonographer would perform (action prediction). While existing models for other ultrasound applications are trained with simulations or phantoms, we train our model with real-world ultrasound video and probe motion data from 464 routine clinical scans by 17 accredited sonographers. Evaluations for 3 standard plane types show that the model provides a useful guidance signal with an accuracy of 88.8% for goal prediction and 90.9% for action prediction.

We propose new tools for policy-makers to use when assessing and correcting fairness and bias in AI algorithms. The three tools are:

- A new definition of fairness called "controlled fairness" with respect to choices of protected features and filters. The definition provides a simple test of fairness of an algorithm with respect to a dataset. This notion of fairness is suitable in cases where fairness is prioritized over accuracy, such as in cases where there is no "ground truth" data, only data labeled with past decisions (which may have been biased).

- Algorithms for retraining a given classifier to achieve "controlled fairness" with respect to a choice of features and filters. Two algorithms are presented, implemented and tested. These algorithms require training two different models in two stages. We experiment with combinations of various types of models for the first and second stage and report on which combinations perform best in terms of fairness and accuracy.

- Algorithms for adjusting model parameters to achieve a notion of fairness called "classification parity". This notion of fairness is suitable in cases where accuracy is prioritized. Two algorithms are presented, one which assumes that protected features are accessible to the model during testing, and one which assumes protected features are not accessible during testing.

We evaluate our tools on three different publicly available datasets. We find that the tools are useful for understanding various dimensions of bias, and that in practice the algorithms are effective in starkly reducing a given observed bias when tested on new data.

In this work, we study some novel applications of conformal inference techniques to the problem of providing machine learning procedures with more transparent, accurate, and practical performance guarantees. We provide a natural extension of the traditional conformal prediction framework, done in such a way that we can make valid and well-calibrated predictive statements about the future performance of arbitrary learning algorithms, when passed an as-yet unseen training set. In addition, we include some nascent empirical examples to illustrate potential applications.

Differential equations parameterized by neural networks become expensive to solve numerically as training progresses. We propose a remedy that encourages learned dynamics to be easier to solve. Specifically, we introduce a differentiable surrogate for the time cost of standard numerical solvers, using higher-order derivatives of solution trajectories. These derivatives are efficient to compute with Taylor-mode automatic differentiation. Optimizing this additional objective trades model performance against the time cost of solving the learned dynamics. We demonstrate our approach by training substantially faster, while nearly as accurate, models in supervised classification, density estimation, and time-series modelling tasks.

In many studies, dimension reduction methods are used to profile participant characteristics. For example, nutrition epidemiologists often use latent class models to characterize dietary patterns. One challenge with such approaches is understanding subtle variations in patterns across subpopulations. Robust Profile Clustering (RPC) provides a dual flexible clustering model, where participants may cluster at two levels: (1) globally, where participants are clustered according to behaviors shared across an overall population, and (2) locally, where individual behaviors can deviate and cluster in subpopulations. We link clusters to a health outcome using a joint model. This model is used to derive dietary patterns in the United States and evaluate case proportion of orofacial clefts. Using dietary consumption data from the 1997-2009 National Birth Defects Prevention Study, a population-based case-control study, we determine how maternal dietary profiles are associated with an orofacial cleft among offspring. Results indicated that mothers who consumed a high proportion of fruits and vegetables compared to meats, such as chicken and beef, had lower odds delivering a child with an orofacial cleft defect.

The use of twins designs to address causal questions is becoming increasingly popular. A standard assumption is that there is no interference between twins---that is, no twin's exposure has a causal impact on their co-twin's outcome. However, there may be settings in which this assumption would not hold, and this would (1) impact the causal interpretation of parameters obtained by commonly used existing methods; (2) change which effects are of greatest interest; and (3) impact the conditions under which we may estimate these effects. We explore these issues, and we derive semi-parametric efficient estimators for causal effects in the presence of interference between twins. Using data from the Minnesota Twin Family Study, we apply our estimators to assess whether twins' consumption of alcohol in early adolescence may have a causal impact on their co-twins' substance use later in life.

This paper studies robust regression for data on Riemannian manifolds. Geodesic regression is the generalization of linear regression to a setting with a manifold-valued dependent variable and one or more real-valued independent variables. The existing work on geodesic regression uses the sum-of-squared errors to find the solution, but as in the classical Euclidean case, the least-squares method is highly sensitive to outliers. In this paper, we use M-type estimators, including the $L_1$, Huber and Tukey biweight estimators, to perform robust geodesic regression, and describe how to calculate the tuning parameters for the latter two. We also show that, on compact symmetric spaces, all M-type estimators are maximum likelihood estimators, and argue for the overall superiority of the $L_1$ estimator over the $L_2$ and Huber estimators on high-dimensional manifolds and over the Tukey biweight estimator on compact high-dimensional manifolds. Results from numerical examples, including analysis of real neuroimaging data, demonstrate the promising empirical properties of the proposed approach.

We provide improved convergence rates for constrained convex-concave min-max problems and monotone variational inequalities with higher-order smoothness. In min-max settings where the $p^{th}$-order derivatives are Lipschitz continuous, we give an algorithm HigherOrderMirrorProx that achieves an iteration complexity of $O(1/T^{\frac{p+1}{2}})$ when given access to an oracle for finding a fixed point of a $p^{th}$-order equation. We give analogous rates for the weak monotone variational inequality problem. For $p>2$, our results improve upon the iteration complexity of the first-order Mirror Prox method of Nemirovski [2004] and the second-order method of Monteiro and Svaiter [2012]. We further instantiate our entire algorithm in the unconstrained $p=2$ case.

Ideal point estimation and dimensionality reduction have long been utilized to simplify and cluster complex, high-dimensional political data (e.g., roll-call votes and surveys) for use in analysis and visualization. These methods often work by finding the directions or principal components (PCs) on which either the data varies the most or respondents make the fewest decision errors. However, these PCs, which usually reflect the left-right political spectrum, are sometimes uninformative in explaining significant differences in the distribution of the data (e.g., how to categorize a set of highly-moderate voters). To tackle this issue, we adopt an emerging analysis approach, called contrastive learning. Contrastive learning-e.g., contrastive principal component analysis (cPCA)-works by first splitting the data by predefined groups, and then deriving PCs on which the target group varies the most but the background group varies the least. As a result, cPCA can often find `hidden' patterns, such as subgroups within the target group, which PCA cannot reveal when some variables are the dominant source of variations across the groups. We contribute to the field of contrastive learning by extending it to multiple correspondence analysis (MCA) to enable an analysis of data often encountered by social scientists---namely binary, ordinal, and nominal variables. We demonstrate the utility of contrastive MCA (cMCA) by analyzing three different surveys: The 2015 Cooperative Congressional Election Study, 2012 UTokyo-Asahi Elite Survey, and 2018 European Social Survey. Our results suggest that, first, for the cases when ordinary MCA depicts differences between groups, cMCA can further identify characteristics that divide the target group; second, for the cases when MCA does not show clear differences, cMCA can successfully identify meaningful directions and subgroups, which traditional methods overlook.

We aim to bridge the gap between typical human and machine-learning environments by extending the standard framework of few-shot learning to an online, continual setting. In this setting, episodes do not have separate training and testing phases, and instead models are evaluated online while learning novel classes. As in real world, where the presence of spatiotemporal context helps us retrieve learned skills in the past, our online few-shot learning setting also features an underlying context that changes throughout time. Object classes are correlated within a context and inferring the correct context can lead to better performance. Building upon this setting, we propose a new few-shot learning dataset based on large scale indoor imagery that mimics the visual experience of an agent wandering within a world. Furthermore, we convert popular few-shot learning approaches into online versions and we also propose a new model named contextual prototypical memory that can make use of spatiotemporal contextual information from the recent past.

We prove an exponential decay concentration inequality to bound the tail probability of the difference between the log-likelihood of discrete random variables and the negative entropy. The concentration bound we derive holds uniformly over all parameter values. The new result improves the convergence rate in an earlier work \cite{zhao2020note}, from $(K^2\log K)/n=o(1)$ to $(\log K)^2/n=o(1)$, where $n$ is the sample size and $K$ is the number of possible values of the discrete variable. We further prove that the rate $(\log K)^2/n=o(1)$ is optimal. The results are extended to misspecified log-likelihoods for grouped random variables.

In this paper, we model the trajectory of the cumulative confirmed cases and deaths of COVID-19 (in log scale) via a piecewise linear trend model. The model naturally captures the phase transitions of the epidemic growth rate via change-points and further enjoys great interpretability due to its semiparametric nature. On the methodological front, we advance the nascent self-normalization (SN) technique (Shao, 2010) to testing and estimation of a single change-point in the linear trend of a nonstationary time series. We further combine the SN-based change-point test with the NOT algorithm (Baranowski et al., 2019) to achieve multiple change-point estimation. Using the proposed method, we analyze the trajectory of the cumulative COVID-19 cases and deaths for 30 major countries and discover interesting patterns with potentially relevant implications for effectiveness of the pandemic responses by different countries. Furthermore, based on the change-point detection algorithm and a flexible extrapolation function, we design a simple two-stage forecasting scheme for COVID-19 and demonstrate its promising performance in predicting cumulative deaths in the U.S.

Alzheimer's disease is a progressive form of dementia that results in problems with memory, thinking and behavior. It often starts with abnormal aggregation and deposition of beta-amyloid and tau, followed by neuronal damage such as atrophy of the hippocampi, and finally leads to behavioral deficits. Despite significant progress in finding biomarkers associated with behavioral deficits, the underlying causal mechanism remains largely unknown. Here we investigate whether and how hippocampal atrophy contributes to behavioral deficits based on a large-scale observational study conducted by the Alzheimer's Disease Neuroimaging Initiative (ADNI). As a key novelty, we use 2D representations of the hippocampi, which allows us to better understand atrophy associated with different subregions. It, however, introduces methodological challenges as existing causal inference methods are not well suited for exploiting structural information embedded in the 2D exposures. Moreover, our data contain more than 6 million clinical and genetic covariates, necessitating appropriate confounder selection methods. We hence develop a novel two-step causal inference approach tailored for our ADNI data application. Analysis results suggest that atrophy of CA1 and subiculum subregions may cause more severe behavioral deficits compared to CA2 and CA3 subregions. We further evaluate our method using simulations and provide theoretical guarantees.

First-price auctions have very recently swept the online advertising industry, replacing second-price auctions as the predominant auction mechanism on many platforms. This shift has brought forth important challenges for a bidder: how should one bid in a first-price auction, where unlike in second-price auctions, it is no longer optimal to bid one's private value truthfully and hard to know the others' bidding behaviors? In this paper, we take an online learning angle and address the fundamental problem of learning to bid in repeated first-price auctions, where both the bidder's private valuations and other bidders' bids can be arbitrary. We develop the first minimax optimal online bidding algorithm that achieves an $\widetilde{O}(\sqrt{T})$ regret when competing with the set of all Lipschitz bidding policies, a strong oracle that contains a rich set of bidding strategies. This novel algorithm is built on the insight that the presence of a good expert can be leveraged to improve performance, as well as an original hierarchical expert-chaining structure, both of which could be of independent interest in online learning. Further, by exploiting the product structure that exists in the problem, we modify this algorithm--in its vanilla form statistically optimal but computationally infeasible--to a computationally efficient and space efficient algorithm that also retains the same $\widetilde{O}(\sqrt{T})$ minimax optimal regret guarantee. Additionally, through an impossibility result, we highlight that one is unlikely to compete this favorably with a stronger oracle (than the considered Lipschitz bidding policies). Finally, we test our algorithm on three real-world first-price auction datasets obtained from Verizon Media and demonstrate our algorithm's superior performance compared to several existing bidding algorithms.

Graph Convolutional Network (GCN) has experienced great success in graph analysis tasks. It works by smoothing the node features across the graph. The current GCN models overwhelmingly assume that node feature information is complete. However, real-world graph data are often incomplete and containing missing features. Traditionally, people have to estimate and fill in the unknown features based on imputation techniques and then apply GCN. However, the process of feature filling and graph learning are separated, resulting in degraded and unstable performance. This problem becomes more serious when a large number of features are missing. We propose an approach that adapts GCN to graphs containing missing features. In contrast to traditional strategy, our approach integrates the processing of missing features and graph learning within the same neural network architecture. Our idea is to represent the missing data by Gaussian Mixture Model (GMM) and calculate the expected activation of neurons in the first hidden layer of GCN, while keeping the other layers of the network unchanged. This enables us to learn the GMM parameters and network weight parameters in an end-to-end manner. Notably, our approach does not increase the computational complexity of GCN and it is consistent with GCN when the features are complete. We conduct experiments on the node label classification task and demonstrate that our approach significantly outperforms the best imputation based methods by up to 99.43%, 102.96%, 6.97%, 35.36% in four benchmark graphs when a large portion of features are missing. The performance of our approach for the case with a low level of missing features is even superior to GCN for the case with complete features.

We consider the problem of clustering with $K$-means and Gaussian mixture models with a constraint on the separation between the centers in the context of real-valued data. We first propose a dynamic programming approach to solving the $K$-means problem with a separation constraint on the centers, building on (Wang and Song, 2011). In the context of fitting a Gaussian mixture model, we then propose an EM algorithm that incorporates such a constraint. A separation constraint can help regularize the output of a clustering algorithm, and we provide both simulated and real data examples to illustrate this point.

While Generative Adversarial Networks (GANs) are fundamental to many generative modelling applications, they suffer from numerous issues. In this work, we propose a principled framework to simultaneously address two fundamental issues in GANs: catastrophic forgetting of the discriminator and mode collapse of the generator. We achieve this by employing for GANs a contrastive learning and mutual information maximization approach, and perform extensive analyses to understand sources of improvements. Our approach significantly stabilises GAN training and improves GAN performance for image synthesis across five datasets under the same training and evaluation conditions against state-of-the-art works. Our approach is simple to implement and practical: it involves only one objective, is computationally inexpensive, and is robust across a wide range of hyperparameters without any tuning. For reproducibility, our code is available at https://github.com/kwotsin/mimicry.

We consider the dynamic of gradient descent for learning a two-layer neural network. We assume the input $x\in\mathbb{R}^d$ is drawn from a Gaussian distribution and the label of $x$ satisfies $f^{\star}(x) = a^{\top}|W^{\star}x|$, where $a\in\mathbb{R}^d$ is a nonnegative vector and $W^{\star} \in\mathbb{R}^{d\times d}$ is an orthonormal matrix. We show that an over-parametrized two-layer neural network with ReLU activation, trained by gradient descent from random initialization, can provably learn the ground truth network with population loss at most $o(1/d)$ in polynomial time with polynomial samples. On the other hand, we prove that any kernel method, including Neural Tangent Kernel, with a polynomial number of samples in $d$, has population loss at least $\Omega(1 / d)$.

Autism spectrum disorder is a neurodevelopmental condition that includes issues with communication and social interactions. People with ASD also often have restricted interests and repetitive behaviors. In this paper we build preliminary bricks of an automated gesture imitation game that will aim at improving social interactions with teenagers with ASD. The structure of the game is presented, as well as support tools and methods for skeleton detection and imitation learning. The game shall later be implemented using an interactive robot.

We seek to learn models that we can interact with using high-level concepts: if the model did not think there was a bone spur in the x-ray, would it still predict severe arthritis? State-of-the-art models today do not typically support the manipulation of concepts like "the existence of bone spurs", as they are trained end-to-end to go directly from raw input (e.g., pixels) to output (e.g., arthritis severity). We revisit the classic idea of first predicting concepts that are provided at training time, and then using these concepts to predict the label. By construction, we can intervene on these \emph{concept bottleneck models} by editing their predicted concept values and propagating these changes to the final prediction. On x-ray grading and bird identification, concept bottleneck models achieve competitive accuracy with standard end-to-end models, while enabling interpretation in terms of high-level clinical concepts ("bone spurs") or bird attributes ("wing color"). These models also allow for richer human-model interaction: accuracy improves significantly if we can correct model mistakes on concepts at test time.

Machine learning-based User Authentication (UA) models have been widely deployed in smart devices. UA models are trained to map input data of different users to highly separable embedding vectors, which are then used to accept or reject new inputs at test time. Training UA models requires having direct access to the raw inputs and embedding vectors of users, both of which are privacy-sensitive information. In this paper, we propose Federated User Authentication (FedUA), a framework for privacy-preserving training of UA models. FedUA adopts federated learning framework to enable a group of users to jointly train a model without sharing the raw inputs. It also allows users to generate their embeddings as random binary vectors, so that, unlike the existing approach of constructing the spread out embeddings by the server, the embedding vectors are kept private as well. We show our method is privacy-preserving, scalable with number of users, and allows new users to be added to training without changing the output layer. Our experimental results on the VoxCeleb dataset for speaker verification shows our method reliably rejects data of unseen users at very high true positive rates.

While successful in many fields, deep neural networks (DNNs) still suffer from some open problems such as bad local minima and unsatisfactory generalization performance. In this work, we propose a novel architecture called Maximum-and-Concatenation Networks (MCN) to try eliminating bad local minima and improving generalization ability as well. Remarkably, we prove that MCN has a very nice property; that is, \emph{every local minimum of an $(l+1)$-layer MCN can be better than, at least as good as, the global minima of the network consisting of its first $l$ layers}. In other words, by increasing the network depth, MCN can autonomously improve its local minima's goodness, what is more, \emph{it is easy to plug MCN into an existing deep model to make it also have this property}. Finally, under mild conditions, we show that MCN can approximate certain continuous functions arbitrarily well with \emph{high efficiency}; that is, the covering number of MCN is much smaller than most existing DNNs such as deep ReLU. Based on this, we further provide a tight generalization bound to guarantee the inference ability of MCN when dealing with testing samples.

Active learning (AL) prioritizes the labeling of the most informative data samples. As the performance of well-known AL heuristics highly depends on the underlying model and data, recent heuristic-independent approaches that are based on reinforcement learning directly learn a policy that makes use of the labeling history to select the next sample. However, those methods typically need a huge number of samples to sufficiently explore the relevant state space. Imitation learning approaches aim to help out but again rely on a given heuristic.

This paper proposes an improved imitation learning scheme that learns a policy for batch-mode pool-based AL. This is similar to previously presented multi-armed bandit approaches but in contrast to them we train a policy that imitates the selection of the best expert heuristic at each stage of the AL cycle directly. We use DAGGER to train the policy on a dataset and later apply it to similar datasets. With multiple AL heuristics as experts, the policy is able to reflect the choices of the best AL heuristics given the current state of the active learning process. We evaluate our method on well-known image datasets and show that we outperform state of the art imitation learners and heuristics.

In a reward-free environment, what is a suitable intrinsic objective for an agent to pursue so that it can learn an optimal task-agnostic exploration policy? In this paper, we argue that the entropy of the state distribution induced by limited-horizon trajectories is a sensible target. Especially, we present a novel and practical policy-search algorithm, Maximum Entropy POLicy optimization (MEPOL), to learn a policy that maximizes a non-parametric, $k$-nearest neighbors estimate of the state distribution entropy. In contrast to known methods, MEPOL is completely model-free as it requires neither to estimate the state distribution of any policy nor to model transition dynamics. Then, we empirically show that MEPOL allows learning a maximum-entropy exploration policy in high-dimensional, continuous-control domains, and how this policy facilitates learning a variety of meaningful reward-based tasks downstream.

An alternative to current mainstream preprocessing methods is proposed: Value Selection (VS). Unlike the existing methods such as feature selection that removes features and instance selection that eliminates instances, value selection eliminates the values (with respect to each feature) in the dataset with two purposes: reducing the model size and preserving its accuracy. Two probabilistic methods based on information theory's metric are proposed: PVS and P + VS. Extensive experiments on the benchmark datasets with various sizes are elaborated. Those results are compared with the existing preprocessing methods such as feature selection, feature transformation, and instance selection methods. Experiment results show that value selection can achieve the balance between accuracy and model size reduction.

Machine teaching uses a meta/teacher model to guide the training of a student model (which will be used in real tasks) through training data selection, loss function design, etc. Previously, the teacher model only takes shallow/surface information as inputs (e.g., training iteration number, loss and accuracy from training/validation sets) while ignoring the internal states of the student model, which limits the potential of learning to teach. In this work, we propose an improved data teaching algorithm, where the teacher model deeply interacts with the student model by accessing its internal states. The teacher model is jointly trained with the student model using meta gradients propagated from a validation set. We conduct experiments on image classification with clean/noisy labels and empirically demonstrate that our algorithm makes significant improvement over previous data teaching methods.

Data augmentation is a popular pre-processing trick to improve generalization accuracy. It is believed that by processing augmented inputs in tandem with the original ones, the model learns a more robust set of features which are shared between the original and augmented counterparts. However, we show that is not the case even for the best augmentation technique. In this work, we take a Domain Generalization viewpoint of augmentation based methods. This new perspective allowed for probing overfitting and delineating avenues for improvement. Our exploration with the state-of-art augmentation method provides evidence that the learned representations are not as robust even towards distortions used during training. This suggests evidence for the untapped potential of augmented examples.

Traditional methods for black box optimization require a considerable number of evaluations which can be time consuming, unpractical, and often unfeasible for many engineering applications that rely on accurate representations and expensive models to evaluate. Bayesian Optimization (BO) methods search for the global optimum by progressively (actively) learning a surrogate model of the objective function along the search path. Bayesian optimization can be accelerated through multifidelity approaches which leverage multiple black-box approximations of the objective functions that can be computationally cheaper to evaluate, but still provide relevant information to the search task. Further computational benefits are offered by the availability of parallel and distributed computing architectures whose optimal usage is an open opportunity within the context of active learning. This paper introduces the Resource Aware Active Learning (RAAL) strategy, a multifidelity Bayesian scheme to accelerate the optimization of black box functions. At each optimization step, the RAAL procedure computes the set of best sample locations and the associated fidelity sources that maximize the information gain to acquire during the parallel/distributed evaluation of the objective function, while accounting for the limited computational budget. The scheme is demonstrated for a variety of benchmark problems and results are discussed for both single fidelity and multifidelity settings. In particular we observe that the RAAL strategy optimally seeds multiple points at each iteration allowing for a major speed up of the optimization task.

Restricted Boltzmann machines (RBMs) with low-precision synapses are much appealing with high energy efficiency. However, training RBMs with binary synapses is challenging due to the discrete nature of synapses. Recently Huang proposed one efficient method to train RBMs with binary synapses by using a combination of gradient ascent and the message passing algorithm under the variational inference framework. However, additional heuristic clipping operation is needed. In this technical note, inspired from Huang's work , we propose one alternative optimization method using the Bayesian learning rule, which is one natural gradient variational inference method. As opposed to Huang's method, we update the natural parameters of the variational symmetric Bernoulli distribution rather than the expectation parameters. Since the natural parameters take values in the entire real domain, no additional clipping is needed. Interestingly, the algorithm in \cite{huang2019data} could be viewed as one first-order approximation of the proposed algorithm, which justifies its efficacy with heuristic clipping.

Nowadays open data is entering the mainstream - it is free available for every stakeholder and is often used in business decision-making. It is important to be sure data is trustable and error-free as its quality problems can lead to huge losses. The research discusses how (open) data quality could be assessed. It also covers main points which should be considered developing a data quality management solution. One specific approach is applied to several Latvian open data sets. The research provides a step-by-step open data sets analysis guide and summarizes its results. It is also shown there could exist differences in data quality depending on data supplier (centralized and decentralized data releases) and, unfortunately, trustable data supplier cannot guarantee data quality problems absence. There are also underlined common data quality problems detected not only in Latvian open data but also in open data of 3 European countries.

A structural version of the Gaussian mixture vector autoregressive model is introduced. The shocks are identified by combining simultaneous diagonalization of the error term covariance matrices with zero and sign constraints. It turns out that this often leads to less restrictive identification conditions than in conventional SVAR models, while some of the constraints are also testable. The accompanying R-package gmvarkit provides easy-to-use tools for estimating the models and applying the introduced methods.

In this work, we propose a novel approach for reinforcement learning driven by evolutionary computation. Our algorithm, dubbed as Evolutionary-Driven Reinforcement Learning (evo-RL), embeds the reinforcement learning algorithm in an evolutionary cycle, where we distinctly differentiate between purely evolvable (instinctive) behaviour versus purely learnable behaviour. Furthermore, we propose that this distinction is decided by the evolutionary process, thus allowing evo-RL to be adaptive to different environments. In addition, evo-RL facilitates learning on environments with rewardless states, which makes it more suited for real-world problems with incomplete information. To show that evo-RL leads to state-of-the-art performance, we present the performance of different state-of-the-art reinforcement learning algorithms when operating within evo-RL and compare it with the case when these same algorithms are executed independently. Results show that reinforcement learning algorithms embedded within our evo-RL approach significantly outperform the stand-alone versions of the same RL algorithms on OpenAI Gym control problems with rewardless states constrained by the same computational budget.

Testing to see whether a given data set comes from some specified distribution is among the oldest types of problems in Statistics. Many such tests have been developed and their performance studied. The general result has been that while a certain test might perform well, aka have good power, in one situation it will fail badly in others. This is not a surprise given the great many ways in which a distribution can differ from the one specified in the null hypothesis. It is therefore very difficult to decide a priori which test to use. The obvious solution is not to rely on any one test but to run several of them. This however leads to the problem of simultaneous inference, that is, if several tests are done even if the null hypothesis were true, one of them is likely to reject it anyway just by random chance. In this paper we present a method that yields a p value that is uniform under the null hypothesis no matter how many tests are run. This is achieved by adjusting the p value via simulation. We present a number of simulation studies that show the uniformity of the p value and others that show that this test is superior to any one test if the power is averaged over a large number of cases.

Scientific observations often consist of a large number of variables (features). Identifying a subset of meaningful features is often ignored in unsupervised learning, despite its potential for unraveling clear patterns hidden in the ambient space. In this paper, we present a method for unsupervised feature selection, tailored for the task of clustering. We propose a differentiable loss function which combines the graph Laplacian with a gating mechanism based on continuous approximation of Bernoulli random variables. The Laplacian is used to define a scoring term that favors low-frequency features, while the parameters of the Bernoulli variables are trained to enable selection of the most informative features. We mathematically motivate the proposed approach and demonstrate that in the high noise regime, it is crucial to compute the Laplacian on the gated inputs, rather than on the full feature set. Experimental demonstration of the efficacy of the proposed approach and its advantage over current baselines is provided using several real-world examples.

Gaussian process (GP) regression with 1D inputs can often be performed in linear time via a stochastic differential equation formulation. However, for non-Gaussian likelihoods, this requires application of approximate inference methods which can make the implementation difficult, e.g., expectation propagation can be numerically unstable and variational inference can be computationally inefficient. In this paper, we propose a new method that removes such difficulties. Building upon an existing method called conjugate-computation variational inference, our approach enables linear-time inference via Kalman recursions while avoiding numerical instabilities and convergence issues. We provide an efficient JAX implementation which exploits just-in-time compilation and allows for fast automatic differentiation through large for-loops. Overall, our approach leads to fast and stable variational inference in state-space GP models that can be scaled to time series with millions of data points.

Redlining is the discriminatory practice whereby institutions avoided investment in certain neighborhoods due to their demographics. Here we explore the lasting impacts of redlining on the spread of COVID-19 in New York City (NYC). Using data available through the Home Mortgage Disclosure Act, we construct a redlining index for each NYC census tract via a multi-level logistical model. We compare this redlining index with the COVID-19 statistics for each NYC Zip Code Tabulation Area. Accurate mappings of the pandemic would aid the identification of the most vulnerable areas and permit the most effective allocation of medical resources, while reducing ethnic health disparities.

An agent in a non-stationary contextual bandit problem should balance between exploration and the exploitation of (periodic or structured) patterns present in its previous experiences. Handcrafting an appropriate historical context is an attractive alternative to transform a non-stationary problem into a stationary problem that can be solved efficiently. However, even a carefully designed historical context may introduce spurious relationships or lack a convenient representation of crucial information. In order to address these issues, we propose an approach that learns to represent the relevant context for a decision based solely on the raw history of interactions between the agent and the environment. This approach relies on a combination of features extracted by recurrent neural networks with a contextual linear bandit algorithm based on posterior sampling. Our experiments on a diverse selection of contextual and non-contextual non-stationary problems show that our recurrent approach consistently outperforms its feedforward counterpart, which requires handcrafted historical contexts, while being more widely applicable than conventional non-stationary bandit algorithms.

As corporates and governments become more digital, they become vulnerable to various forms of cyber attack. Cyber insurance products have been used as risk management tools, yet their pricing does not reflect actual risk, including that of multiple, catastrophic and contagious losses. For the modelling of aggregate losses from cyber events, in this paper we introduce a bivariate compound dynamic contagion process, where the bivariate dynamic contagion process is a point process that includes both externally excited joint jumps, which are distributed according to a shot noise Cox process and two separate self-excited jumps, which are distributed according to the branching structure of a Hawkes process with an exponential fertility rate, respectively. We analyse the theoretical distributional properties for these processes systematically, based on the piecewise deterministic Markov process developed by Davis (1984) and the univariate dynamic contagion process theory developed by Dassios and Zhao (2011). The analytic expression of the Laplace transform of the compound process and its moments are presented, which have the potential to be applicable to a variety of problems in credit, insurance, market and other operational risks. As an application of this process, we provide insurance premium calculations based on its moments. Numerical examples show that this compound process can be used for the modelling of aggregate losses from cyber events. We also provide the simulation algorithm for statistical analysis, further business applications and research.

In this review paper, we give a comprehensive overview of the large variety of approximation results for neural networks. Approximation rates for classical function spaces as well as benefits of deep neural networks over shallow ones for specifically structured function classes are discussed. While the mainbody of existing results is for general feedforward architectures, we also depict approximation results for convolutional, residual and recurrent neural networks.

A fundamental concept in two-arm non-parametric survival analysis is the comparison of observed versus expected numbers of events on one of the treatment arms (the choice of which arm is arbitrary), where the expectation is taken assuming that the true survival curves in the two arms are identical. This concept is at the heart of the counting-process theory that provides a rigorous basis for methods such as the log-rank test. It is natural, therefore, to maintain this perspective when extending the log-rank test to deal with non-proportional hazards, for example by considering a weighted sum of the "observed - expected" terms, where larger weights are given to time periods where the hazard ratio is expected to favour the experimental treatment. In doing so, however, one may stumble across some rather subtle issues, related to the difficulty in ascribing a causal interpretation to hazard ratios, that may lead to strange conclusions. An alternative approach is to view non-parametric survival comparisons as permutation tests. With this perspective, one can easily improve on the efficiency of the log-rank test, whilst thoroughly controlling the false positive rate. In particular, for the field of immuno-oncology, where researchers often anticipate a delayed treatment effect, sample sizes could be substantially reduced without loss of power.

Graph Neural Networks (GNN) have been extensively used to extract meaningful representations from graph structured data and to perform predictive tasks such as node classification and link prediction. In recent years, there has been a lot of work incorporating edge features along with node features for prediction tasks. One of the main difficulties in using edge features is that they are often handcrafted, hard to get, specific to a particular domain, and may contain redundant information. In this work, we present a framework for creating new edge features, applicable to any domain, via a combination of self-supervised and unsupervised learning. In addition to this, we use Forman-Ricci curvature as an additional edge feature to encapsulate the local geometry of the graph. We then encode our edge features via a Set Transformer and combine them with node features extracted from popular GNN architectures for node classification in an end-to-end training scheme. We validate our work on three biological datasets comprising of single-cell RNA sequencing data of neurological disease, \textit{in vitro} SARS-CoV-2 infection, and human COVID-19 patients. We demonstrate that our method achieves better performance on node classification tasks over baseline Graph Attention Network (GAT) and Graph Convolutional Network (GCN) models. Furthermore, given the attention mechanism on edge and node features, we are able to interpret the cell types and genes that determine the course and severity of COVID-19, contributing to a growing list of potential disease biomarkers and therapeutic targets.

Neural architecture search (NAS) with an accuracy predictor that predicts the accuracy of candidate architectures has drawn increasing interests due to its simplicity and effectiveness. Previous works employ neural network based predictors which unfortunately cannot well exploit the tabular data representations of network architectures. As decision tree-based models can better handle tabular data, in this paper, we propose to leverage gradient boosting decision tree (GBDT) as the predictor for NAS and demonstrate that it can improve the prediction accuracy and help to find better architectures than neural network based predictors. Moreover, considering that a better and compact search space can ease the search process, we propose to prune the search space gradually according to important features derived from GBDT using an interpreting tool named SHAP. In this way, NAS can be performed by first pruning the search space (using GBDT as a pruner) and then searching a neural architecture (using GBDT as a predictor), which is more efficient and effective. Experiments on NASBench-101 and ImageNet demonstrate the effectiveness of GBDT for NAS: (1) NAS with GBDT predictor finds top-10 architecture (among all the architectures in the search space) with $0.18\%$ test regret on NASBench-101, and achieves $24.2\%$ top-1 error rate on ImageNet; and (2) GBDT based search space pruning and neural architecture search further achieves $23.5\%$ top-1 error rate on ImageNet.

Deep generative models have proven useful for automatic design synthesis and design space exploration. However, they face three challenges when applied to engineering design: 1) generated designs lack diversity, 2) it is difficult to explicitly improve all the performance measures of generated designs, and 3) existing models generally do not generate high-performance novel designs, outside the domain of the training data. To address these challenges, we propose MO-PaDGAN, which contains a new Determinantal Point Processes based loss function for probabilistic modeling of diversity and performances. Through a real-world airfoil design example, we demonstrate that MO-PaDGAN expands the existing boundary of the design space towards high-performance regions and generates new designs with high diversity and performances exceeding training data.

The issue of variance components testing arises naturally when building mixed-effects models, to decide which effects should be modeled as fixed or random. While tests for fixed effects are available in R for models fitted with lme4, tools are missing when it comes to random effects. The varTestnlme package for R aims at filling this gap. It allows to test whether any subset of the variances and covariances are equal to zero using likelihood ratio tests. It also offers the possibility to test simultaneously for fixed effects and variance components. It can be used for linear, generalized linear or nonlinear mixed-effects models fitted via lme4, nlme or saemix. Theoretical properties of the used likelihood ratio test are recalled and examples based on different real datasets using different mixed models are provided.

Structures of brain arterial networks (BANs) - that are complex arrangements of individual arteries, their branching patterns, and inter-connectivities - play an important role in characterizing and understanding brain physiology. One would like tools for statistically analyzing the shapes of BANs, i.e. quantify shape differences, compare population of subjects, and study the effects of covariates on these shapes. This paper mathematically represents and statistically analyzes BAN shapes as elastic shape graphs. Each elastic shape graph is made up of nodes that are connected by a number of 3D curves, and edges, with arbitrary shapes. We develop a mathematical representation, a Riemannian metric and other geometrical tools, such as computations of geodesics, means and covariances, and PCA for analyzing elastic graphs and BANs. This analysis is applied to BANs after separating them into four components -- top, bottom, left, and right. This framework is then used to generate shape summaries of BANs from 92 subjects, and to study the effects of age and gender on shapes of BAN components. We conclude that while gender effects require further investigation, the age has a clear, quantifiable effect on BAN shapes. Specifically, we find an increased variance in BAN shapes as age increases.

We present a theoretical framework for a (copula-based) notion of dissimilarity between subsets of continuous random variables and study its main properties. Special attention is paid to those properties that are prone to the hierarchical agglomerative methods, such as reducibility. We hence provide insights for the use of such a measure in clustering algorithms, which allows us to cluster random variables according to the association/dependence among them, and present a simulation study. Real case studies illustrate the whole methodology.

How can humans and machines learn to make joint decisions? This has become an important question in domains such as medicine, law and finance. We approach the question from a theoretical perspective and formalize our intuitions about human-machine decision making in a non-symmetric bandit model. In doing so, we follow the example of a doctor who is assisted by a computer program. We show that in our model, exploration is generally hard. In particular, unless one is willing to make assumptions about how human and machine interact, the machine cannot explore efficiently. We highlight one such assumption, policy space independence, which resolves the coordination problem and allows both players to explore independently. Our results shed light on the fundamental difficulties faced by the interaction of humans and machines. We also discuss practical implications for the design of algorithmic decision systems.

We introduce in this work a new approach for online approximate Bayesian learning. The main idea of the proposed method is to approximate the sequence $(\pi_t)_{t\geq 1}$ of posterior distributions by a sequence $(\tilde{\pi}_t)_{t\geq 1}$ which (i) can be estimated in an online fashion using sequential Monte Carlo methods and (ii) is shown to converge to the same distribution as the sequence $(\pi_t)_{t\geq 1}$, under weak assumptions on the statistical model at hand. In its simplest version, $(\tilde{\pi}_t)_{t\geq 1}$ is the sequence of filtering distributions associated to a particular state-space model, which can therefore be approximated using a standard particle filter algorithm. We illustrate on several challenging examples the benefits of this approach for approximate Bayesian parameter inference, and with one real data example we show that its online predictive performance can significantly outperform that of stochastic gradient descent and streaming variational Bayes.

We present a federated learning approach for learning a client adaptable, robust model when data is non-identically and non-independently distributed (non-IID) across clients. By simulating heterogeneous clients, we show that adding learned client-specific conditioning improves model performance, and the approach is shown to work on balanced and imbalanced data set from both audio and image domains. The client adaptation is implemented by a conditional gated activation unit and is particularly beneficial when there are large differences between the data distribution for each client, a common scenario in federated learning.

Despite significant advances, continual learning models still suffer from catastrophic forgetting when exposed to incrementally available data from non-stationary distributions. Rehearsal approaches alleviate the problem by maintaining and replaying a small episodic memory of previous samples, often implemented as an array of independent memory slots. In this work, we propose to augment such an array with a learnable random graph that captures pairwise similarities between its samples, and use it not only to learn new tasks but also to guard against forgetting. Empirical results on several benchmark datasets show that our model consistently outperforms recently proposed baselines for task-free continual learning.

Transformers have been proven a successful model for a variety of tasks in sequence modeling. However, computing the attention matrix, which is their key component, has quadratic complexity with respect to the sequence length, thus making them prohibitively expensive for large sequences. To address this, we propose clustered attention, which instead of computing the attention for every query, groups queries into clusters and computes attention just for the centroids. To further improve this approximation, we use the computed clusters to identify the keys with the highest attention per query and compute the exact key/query dot products. This results in a model with linear complexity with respect to the sequence length for a fixed number of clusters. We evaluate our approach on two automatic speech recognition datasets and show that our model consistently outperforms vanilla transformers for a given computational budget. Finally, we demonstrate that our model can approximate arbitrarily complex attention distributions with a minimal number of clusters by approximating a pretrained BERT model on GLUE and SQuAD benchmarks with only 25 clusters and no loss in performance.

This article explores the use of machine learning models to build a market generator. The underlying idea is to simulate artificial multi-dimensional financial time series, whose statistical properties are the same as those observed in the financial markets. In particular, these synthetic data must preserve the probability distribution of asset returns, the stochastic dependence between the different assets and the autocorrelation across time. The article proposes then a new approach for estimating the probability distribution of backtest statistics. The final objective is to develop a framework for improving the risk management of quantitative investment strategies, in particular in the space of smart beta, factor investing and alternative risk premia.

Using the language of differential geometry, I derive a form of the Bayesian Cram\'er-Rao bound that remains invariant under reparametrization. By assuming that the prior probability density is the square of a wavefunction, I also express the bound in terms of functionals that are quadratic with respect to the wavefunction and its gradient. The problem of finding an unfavorable prior to tighten the bound for minimax estimation is shown, in a special case, to be equivalent to finding the ground-state energy with the Schr\"odinger equation, with the Fisher information playing the role of the potential.

Datasets for biosignals, such as electroencephalogram (EEG) and electrocardiogram (ECG), often have noisy labels and have limited number of subjects (<100). To handle these challenges, we propose a self-supervised approach based on contrastive learning to model biosignals with a reduced reliance on labeled data and with fewer subjects. In this regime of limited labels and subjects, intersubject variability negatively impacts model performance. Thus, we introduce subject-aware learning through (1) a subject-specific contrastive loss, and (2) an adversarial training to promote subject-invariance during the self-supervised learning. We also develop a number of time-series data augmentation techniques to be used with the contrastive loss for biosignals. Our method is evaluated on publicly available datasets of two different biosignals with different tasks: EEG decoding and ECG anomaly detection. The embeddings learned using self-supervision yield competitive classification results compared to entirely supervised methods. We show that subject-invariance improves representation quality for these tasks, and observe that subject-specific loss increases performance when fine-tuning with supervised labels.

Deep generative models have been successfully applied to Zero-Shot Learning (ZSL) recently. However, the underlying drawbacks of GANs and VAEs (e.g., the hardness of training with ZSL-oriented regularizers and the limited generation quality) hinder the existing generative ZSL models from fully bypassing the seen-unseen bias. To tackle the above limitations, for the first time, this work incorporates a new family of generative models (i.e., flow-based models) into ZSL. The proposed Invertible Zero-shot Flow (IZF) learns factorized data embeddings (i.e., the semantic factors and the non-semantic ones) with the forward pass of an invertible flow network, while the reverse pass generates data samples. This procedure theoretically extends conventional generative flows to a factorized conditional scheme. To explicitly solve the bias problem, our model enlarges the seen-unseen distributional discrepancy based on negative sample-based distance measurement. Notably, IZF works flexibly with either a naive Bayesian classifier or a held-out trainable one for zero-shot recognition. Experiments on widely-adopted ZSL benchmarks demonstrate the significant performance gain of IZF over existing methods, in both classic and generalized settings.

We study multinomial logit bandit with limited adaptivity, where the algorithms change their exploration actions as infrequently as possible when achieving almost optimal minimax regret. We propose two measures of adaptivity: the assortment switching cost and the more fine-grained item switching cost. We present an anytime algorithm (AT-DUCB) with $O(N \log T)$ assortment switches, almost matching the lower bound $\Omega(\frac{N \log T}{ \log \log T})$. In the fixed-horizon setting, our algorithm FH-DUCB incurs $O(N \log \log T)$ assortment switches, matching the asymptotic lower bound. We also present the ESUCB algorithm with item switching cost $O(N \log^2 T)$.

De novo molecular design attempts to search over the chemical space for molecules with the desired property. Recently, deep learning has gained considerable attention as a promising approach to solve the problem. In this paper, we propose genetic expert-guided learning (GEGL), a simple yet novel framework for training a deep neural network (DNN) to generate highly-rewarding molecules. Our main idea is to design a "genetic expert improvement" procedure, which generates high-quality targets for imitation learning of the DNN. Extensive experiments show that GEGL significantly improves over state-of-the-art methods. For example, GEGL manages to solve the penalized octanol-water partition coefficient optimization with a score of 31.82, while the best-known score in the literature is 26.1. Besides, for the GuacaMol benchmark with 20 tasks, our method achieves the highest score for 19 tasks, in comparison with state-of-the-art methods, and newly obtains the perfect score for three tasks.

The General Automated Machine learning Assistant (GAMA) is a modular AutoML system developed to empower users to track and control how AutoML algorithms search for optimal machine learning pipelines, and facilitate AutoML research itself. In contrast to current, often black-box systems, GAMA allows users to plug in different AutoML and post-processing techniques, logs and visualizes the search process, and supports easy benchmarking. It currently features three AutoML search algorithms, two model post-processing steps, and is designed to allow for more components to be added.

We propose a novel framework for structured bandits, which we call an influence diagram bandit. Our framework captures complex statistical dependencies between actions, latent variables, and observations; and thus unifies and extends many existing models, such as combinatorial semi-bandits, cascading bandits, and low-rank bandits. We develop novel online learning algorithms that learn to act efficiently in our models. The key idea is to track a structured posterior distribution of model parameters, either exactly or approximately. To act, we sample model parameters from their posterior and then use the structure of the influence diagram to find the most optimistic action under the sampled parameters. We empirically evaluate our algorithms in three structured bandit problems, and show that they perform as well as or better than problem-specific state-of-the-art baselines.

The selection of coarse-grained (CG) mapping operators is a critical step for CG molecular dynamics (MD) simulation. It is still an open question about what is optimal for this choice and there is a need for theory. The current state-of-the art method is mapping operators manually selected by experts. In this work, we demonstrate an automated approach by viewing this problem as supervised learning where we seek to reproduce the mapping operators produced by experts. We present a graph neural network based CG mapping predictor called DEEP SUPERVISED GRAPH PARTITIONING MODEL(DSGPM) that treats mapping operators as a graph segmentation problem. DSGPM is trained on a novel dataset, Human-annotated Mappings (HAM), consisting of 1,206 molecules with expert annotated mapping operators. HAM can be used to facilitate further research in this area. Our model uses a novel metric learning objective to produce high-quality atomic features that are used in spectral clustering. The results show that the DSGPM outperforms state-of-the-art methods in the field of graph segmentation.

Graph-based neural network models are producing strong results in a number of domains, in part because graphs provide flexibility to encode domain knowledge in the form of relational structure (edges) between nodes in the graph. In practice, edges are used both to represent intrinsic structure (e.g., abstract syntax trees of programs) and more abstract relations that aid reasoning for a downstream task (e.g., results of relevant program analyses). In this work, we study the problem of learning to derive abstract relations from the intrinsic graph structure. Motivated by their power in program analyses, we consider relations defined by paths on the base graph accepted by a finite-state automaton. We show how to learn these relations end-to-end by relaxing the problem into learning finite-state automata policies on a graph-based POMDP and then training these policies using implicit differentiation. The result is a differentiable Graph Finite-State Automaton (GFSA) layer that adds a new edge type (expressed as a weighted adjacency matrix) to a base graph. We demonstrate that this layer can find shortcuts in grid-world graphs and reproduce simple static analyses on Python programs. Additionally, we combine the GFSA layer with a larger graph-based model trained end-to-end on the variable misuse program understanding task, and find that using the GFSA layer leads to better performance than using hand-engineered semantic edges or other baseline methods for adding learned edge types.

Model-free deep reinforcement learning (RL) has been successful in a range of challenging domains. However, there are some remaining issues, such as stabilizing the optimization of nonlinear function approximators, preventing error propagation due to the Bellman backup in Q-learning, and efficient exploration. To mitigate these issues, we present SUNRISE, a simple unified ensemble method, which is compatible with various off-policy RL algorithms. SUNRISE integrates three key ingredients: (a) bootstrap with random initialization which improves the stability of the learning process by training a diverse ensemble of agents, (b) weighted Bellman backups, which prevent error propagation in Q-learning by reweighing sample transitions based on uncertainty estimates from the ensembles, and (c) an inference method that selects actions using highest upper-confidence bounds for efficient exploration. Our experiments show that SUNRISE significantly improves the performance of existing off-policy RL algorithms, such as Soft Actor-Critic and Rainbow DQN, for both continuous and discrete control tasks on both low-dimensional and high-dimensional environments. Our training code is available at https://github.com/pokaxpoka/sunrise.

Multi-Arm Multi-Stage (MAMS) platform trials are an efficient tool for the comparison of several treatments with a control. Suppose a new treatment becomes available at some stage of a trial already in progress. There are clear benefits to adding the treatment to the current trial for comparison, but how?

As flexible as the MAMS framework is, it requires pre-planned options for how the trial proceeds at each stage in order to control the familywise error rate. Thus, as with many adaptive designs, it is difficult to make unplanned design modifications. The conditional error approach is a tool that allows unplanned design modifications while maintaining the overall error rate. In this work we use the conditional error approach to allow adding new arms to a MAMS trial in progress.

Using a single stage two-arm trial, we demonstrate the principals of incorporating additional hypotheses into the testing structure. With this framework for adding treatments and hypotheses in place, we show how to update the testing procedure for a MAMS trial in progress to incorporate additional treatment arms. Through simulation, we illustrate the operating characteristics of such procedures.

Bayesian computation for filtering and forecasting analysis is developed for a broad class of dynamic models. The ability to scale-up such analyses in non-Gaussian, nonlinear multivariate time series models is advanced through the introduction of a novel copula construction in sequential filtering of coupled sets of dynamic generalized linear models. The new copula approach is integrated into recently introduced multiscale models in which univariate time series are coupled via nonlinear forms involving dynamic latent factors representing cross-series relationships. The resulting methodology offers dramatic speed-up in online Bayesian computations for sequential filtering and forecasting in this broad, flexible class of multivariate models. Two examples in nonlinear models for very heterogeneous time series of non-negative counts demonstrate massive computational efficiencies relative to existing simulation-based methods, while defining similar filtering and forecasting outcomes.

Neural architecture search (NAS) has been extensively studied in the past few years. A popular approach is to represent each neural architecture in the search space as a directed acyclic graph (DAG), and then search over all DAGs by encoding the adjacency matrix and list of operations as a set of hyperparameters. Recent work has demonstrated that even small changes to the way each architecture is encoded can have a significant effect on the performance of NAS algorithms.

In this work, we present the first formal study on the effect of architecture encodings for NAS, including a theoretical grounding and an empirical study. First we formally define architecture encodings and give a theoretical characterization on the scalability of the encodings we study Then we identify the main encoding-dependent subroutines which NAS algorithms employ, running experiments to show which encodings work best with each subroutine for many popular algorithms. The experiments act as an ablation study for prior work, disentangling the algorithmic and encoding-based contributions, as well as a guideline for future work. Our results demonstrate that NAS encodings are an important design decision which can have a significant impact on overall performance. Our code is available at https://github.com/naszilla/nas-encodings.

In this paper, we propose to train deep neural networks with biomechanical simulations, to predict the prostate motion encountered during ultrasound-guided interventions. In this application, unstructured points are sampled from segmented pre-operative MR images to represent the anatomical regions of interest. The point sets are then assigned with point-specific material properties and displacement loads, forming the un-ordered input feature vectors. An adapted PointNet can be trained to predict the nodal displacements, using finite element (FE) simulations as ground-truth data. Furthermore, a versatile bootstrap aggregating mechanism is validated to accommodate the variable number of feature vectors due to different patient geometries, comprised of a training-time bootstrap sampling and a model averaging inference. This results in a fast and accurate approximation to the FE solutions without requiring subject-specific solid meshing. Based on 160,000 nonlinear FE simulations on clinical imaging data from 320 patients, we demonstrate that the trained networks generalise to unstructured point sets sampled directly from holdout patient segmentation, yielding a near real-time inference and an expected error of 0.017 mm in predicted nodal displacement.

Machine-aided programming tools such as type predictors and code summarizers are increasingly learning-based. However, most code representation learning approaches rely on supervised learning with task-specific annotated datasets. We propose Contrastive Code Representation Learning (ContraCode), a self-supervised algorithm for learning task-agnostic semantic representations of programs via contrastive learning. Our approach uses no human-provided labels, relying only on the raw text of programs. In particular, we design an unsupervised pretext task by generating textually divergent copies of source functions via automated source-to-source compiler transforms that preserve semantics. We train a neural model to identify variants of an anchor program within a large batch of negatives. To solve this task, the network must extract program features representing the functionality, not form, of the program. This is the first application of instance discrimination to code representation learning to our knowledge. We pre-train models over 1.8m unannotated JavaScript methods mined from GitHub. ContraCode pre-training improves code summarization accuracy by 7.9% over supervised approaches and 4.8% over RoBERTa pre-training. Moreover, our approach is agnostic to model architecture; for a type inference task, contrastive pre-training consistently improves the accuracy of existing baselines.

Reinforcement learning is typically concerned with learning control policies tailored to a particular agent. We investigate whether there exists a single global policy that can generalize to control a wide variety of agent morphologies -- ones in which even dimensionality of state and action spaces changes. We propose to express this global policy as a collection of identical modular neural networks, dubbed as Shared Modular Policies (SMP), that correspond to each of the agent's actuators. Every module is only responsible for controlling its corresponding actuator and receives information from only its local sensors. In addition, messages are passed between modules, propagating information between distant modules. We show that a single modular policy can successfully generate locomotion behaviors for several planar agents with different skeletal structures such as monopod hoppers, quadrupeds, bipeds, and generalize to variants not seen during training -- a process that would normally require training and manual hyperparameter tuning for each morphology. We observe that a wide variety of drastically diverse locomotion styles across morphologies as well as centralized coordination emerges via message passing between decentralized modules purely from the reinforcement learning objective. Videos and code at https://huangwl18.github.io/modular-rl/

Most of the literature on direct and indirect effects assumes that there are no post-treatment common causes of the mediator and the outcome. In contrast to natural direct and indirect effects, organic direct and indirect effects, which were introduced in Lok (2016, 2020), can be extended to provide an identification result for settings with post-treatment common causes of the mediator and the outcome. This article provides a definition and an identification result for organic direct and indirect effects in the presence of post-treatment common causes of mediator and outcome. These new organic indirect and direct effects have interpretations in terms of intervention effects. Organic indirect effects in the presence of post-treatment common causes are an addition to indirect effects through multivariate mediators. Organic indirect effects in the presence of post-treatment common causes can be used e.g. 1. to predict the effect of the initial treatment if its side affects are suppressed through additional interventions or 2. to predict the effect of a treatment that does not affect the post-treatment common cause and affects the mediator the same way as the initial treatment.

The spectral distribution $f(\omega)$ of a stationary time series $\{Y_t\}_{t\in\mathbb{Z}}$ can be used to investigate whether or not periodic structures are present in $\{Y_t\}_{t\in\mathbb{Z}}$, but $f(\omega)$ has some limitations due to its dependence on the autocovariances $\gamma(h)$. For example, $f(\omega)$ can not distinguish white i.i.d. noise from GARCH-type models (whose terms are dependent, but uncorrelated), which implies that $f(\omega)$ can be an inadequate tool when $\{Y_t\}_{t\in\mathbb{Z}}$ contains asymmetries and nonlinear dependencies.

Asymmetries between the upper and lower tails of a time series can be investigated by means of the local Gaussian autocorrelations introduced in Tj{\o}stheim and Hufthammer (2013), and these local measures of dependence can be used to construct the local Gaussian spectral density presented in this paper. A key feature of the new local spectral density is that it coincides with $f(\omega)$ for Gaussian time series, which implies that it can be used to detect non-Gaussian traits in the time series under investigation. In particular, if $f(\omega)$ is flat, then peaks and troughs of the new local spectral density can indicate nonlinear traits, which potentially might discover local periodic phenomena that remain undetected in an ordinary spectral analysis.

We provide a complete picture of asymptotically minimax estimation of $L_r$-norms (for any $r\ge 1$) of the mean in Gaussian white noise model over Nikolskii-Besov spaces. In this regard, we complement the work of Lepski, Nemirovski and Spokoiny (1999), who considered the cases of $r=1$ (with poly-logarithmic gap between upper and lower bounds) and $r$ even (with asymptotically sharp upper and lower bounds) over H\"{o}lder spaces. We additionally consider the case of asymptotically adaptive minimax estimation and demonstrate a difference between even and non-even $r$ in terms of an investigator's ability to produce asymptotically adaptive minimax estimators without paying a penalty.

We provide a minimax optimal estimation procedure for F and W in matrix valued linear models Y = F W + Z where the parameter matrix W and the design matrix F are unknown but the latter takes values in a known finite set. The proposed finite alphabet linear model is justified in a variety of applications, ranging from signal processing to cancer genetics. We show that this allows to separate F and W uniquely under weak identifiability conditions, a task which is not doable, in general. To this end we quantify in the noiseless case, that is, Z = 0, the perturbation range of Y in order to obtain stable recovery of F and W. Based on this, we derive an iterative Lloyd's type estimation procedure that attains minimax estimation rates for W and F for Gaussian error matrix Z. In contrast to the least squares solution the estimation procedure can be computed efficiently and scales linearly with the total number of observations. We confirm our theoretical results in a simulation study and illustrate it with a genetic sequencing data example.

It is common in modern prediction problems for many predictor variables to be counts of rarely occurring events. This leads to design matrices in which many columns are highly sparse. The challenge posed by such "rare features" has received little attention despite its prevalence in diverse areas, ranging from natural language processing (e.g., rare words) to biology (e.g., rare species). We show, both theoretically and empirically, that not explicitly accounting for the rareness of features can greatly reduce the effectiveness of an analysis. We next propose a framework for aggregating rare features into denser features in a flexible manner that creates better predictors of the response. Our strategy leverages side information in the form of a tree that encodes feature similarity.

We apply our method to data from TripAdvisor, in which we predict the numerical rating of a hotel based on the text of the associated review. Our method achieves high accuracy by making effective use of rare words; by contrast, the lasso is unable to identify highly predictive words if they are too rare. A companion R package, called rare, implements our new estimator, using the alternating direction method of multipliers.

Finite mixtures are a flexible modeling tool for irregularly shaped densities and samples from heterogeneous populations. When modeling with mixtures using an exchangeable prior on the component features, the component labels are arbitrary and are indistinguishable in posterior analysis. This makes it impossible to attribute any meaningful interpretation to the marginal posterior distributions of the component features. We propose a model in which a small number of observations are assumed to arise from some of the labeled component densities. The resulting model is not exchangeable, allowing inference on the component features without post-processing. Our method assigns meaning to the component labels at the modeling stage and can be justified as a data-dependent informative prior on the labelings. We show that our method produces interpretable results, often (but not always) similar to those resulting from relabeling algorithms, with the added benefit that the marginal inferences originate directly from a well specified probability model rather than a post hoc manipulation. We provide asymptotic results leading to practical guidelines for model selection that are motivated by maximizing prior information about the class labels and demonstrate our method on real and simulated data.

We study nonparametric maximum likelihood estimation for two classes of multivariate distributions that imply strong forms of positive dependence; namely log-supermodular (MTP$_2$) distributions and log-$L^\#$-concave (LLC) distributions. In both cases we also assume log-concavity in order to ensure boundedness of the likelihood function. Given $n$ independent and identically distributed random vectors in $\mathbb R^d$ from one of our distributions, the maximum likelihood estimator (MLE) exists a.s. and is unique a.e. with probability one when $n\geq 3$. This holds independently of the ambient dimension $d$. We conjecture that the MLE is always the exponential of a tent function. We prove this result for samples in $\{0,1\}^d$ or in $\mathbb{R}^2$ under MTP$_2$, and for samples in $\mathbb{Q}^d$ under LLC. Finally, we provide a conditional gradient algorithm for computing the maximum likelihood estimate.

We propose a framework for estimation and inference when the model may be misspecified. We rely on a local asymptotic approach where the degree of misspecification is indexed by the sample size. We construct estimators whose mean squared error is minimax in a neighborhood of the reference model, based on simple one-step adjustments. In addition, we provide confidence intervals that contain the true parameter under local misspecification. To interpret the degree of misspecification, we map it to the local power of a specification test of the reference model. Our approach allows for systematic sensitivity analysis when the parameter of interest may be partially or irregularly identified. As illustrations, we study two binary choice models: a cross-sectional model where the error distribution is misspecified, and a dynamic panel data model where the number of time periods is small and the distribution of individual effects is misspecified.

Analysis of structural and functional connectivity (FC) of human brains is of pivotal importance for diagnosis of cognitive ability. The Human Connectome Project (HCP) provides an excellent source of neural data across different regions of interest (ROIs) of the living human brain. Individual specific data were available from an existing analysis (Dai et al., 2017) in the form of time varying covariance matrices representing the brain activity as the subjects perform a specific task. As a preliminary objective of studying the heterogeneity of brain connectomics across the population, we develop a probabilistic model for a sample of covariance matrices using a scaled Wishart distribution. We stress here that our data units are available in the form of covariance matrices, and we use the Wishart distribution to create our likelihood function rather than its more common usage as a prior on covariance matrices. Based on empirical explorations suggesting the data matrices to have low effective rank, we further model the center of the Wishart distribution using an orthogonal factor model type decomposition. We encourage shrinkage towards a low rank structure through a novel shrinkage prior and discuss strategies to sample from the posterior distribution using a combination of Gibbs and slice sampling. We extend our modeling framework to a dynamic setting to detect change points. The efficacy of the approach is explored in various simulation settings and exemplified on several case studies including our motivating HCP data. We extend our modeling framework to a dynamic setting to detect change points.

Speech Acts (SAs) are one of the important areas of pragmatics, which give us a better understanding of the state of mind of the people and convey an intended language function. Knowledge of the SA of a text can be helpful in analyzing that text in natural language processing applications. This study presents a dictionary-based statistical technique for Persian SA recognition. The proposed technique classifies a text into seven classes of SA based on four criteria: lexical, syntactic, semantic, and surface features. WordNet as the tool for extracting synonym and enriching features dictionary is utilized. To evaluate the proposed technique, we utilized four classification methods including Random Forest (RF), Support Vector Machine (SVM), Naive Bayes (NB), and K-Nearest Neighbors (KNN). The experimental results demonstrate that the proposed method using RF and SVM as the best classifiers achieved a state-of-the-art performance with an accuracy of 0.95 for classification of Persian SAs. Our original vision of this work is introducing an application of SA recognition on social media content, especially the common SA in rumors. Therefore, the proposed system utilized to determine the common SAs in rumors. The results showed that Persian rumors are often expressed in three SA classes including narrative, question, and threat, and in some cases with the request SA.

We propose a new Conditional BEKK matrix-F (CBF) model for the time-varying realized covariance (RCOV) matrices. This CBF model is capable of capturing heavy-tailed RCOV, which is an important stylized fact but could not be handled adequately by the Wishart-based models. To further mimic the long memory feature of the RCOV, a special CBF model with the conditional heterogeneous autoregressive (HAR) structure is introduced. Moreover, we give a systematical study on the probabilistic properties and statistical inferences of the CBF model, including exploring its stationarity, establishing the asymptotics of its maximum likelihood estimator, and giving some new inner-product-based tests for its model checking. In order to handle a large dimensional RCOV matrix, we construct two reduced CBF models -- the variance-target CBF model (for moderate but fixed dimensional RCOV matrix) and the factor CBF model (for high dimensional RCOV matrix). For both reduced models, the asymptotic theory of the estimated parameters is derived. The importance of our entire methodology is illustrated by simulation results and two real examples.

We study a mean-field spike and slab variational Bayes (VB) approximation to Bayesian model selection priors in sparse high-dimensional linear regression. Under compatibility conditions on the design matrix, oracle inequalities are derived for the mean-field VB approximation, implying that it converges to the sparse truth at the optimal rate and gives optimal prediction of the response vector. The empirical performance of our algorithm is studied, showing that it works comparably well as other state-of-the-art Bayesian variable selection methods. We also numerically demonstrate that the widely used coordinate-ascent variational inference (CAVI) algorithm can be highly sensitive to the parameter updating order, leading to potentially poor performance. To mitigate this, we propose a novel prioritized updating scheme that uses a data-driven updating order and performs better in simulations.

The problem of missingness in observational data is ubiquitous. When the confounders are missing at random, multiple imputation is commonly used; however, the method requires congeniality conditions for valid inferences, which may not be satisfied when estimating average causal treatment effects. Alternatively, fractional imputation, proposed by Kim 2011, has been implemented to handling missing values in regression context. In this article, we develop fractional imputation methods for estimating the average treatment effects with confounders missing at random. We show that the fractional imputation estimator of the average treatment effect is asymptotically normal, which permits a consistent variance estimate. Via simulation study, we compare fractional imputation's accuracy and precision with that of multiple imputation.

Count data is becoming more and more ubiquitous in a wide range of applications, with datasets growing both in size and in dimension. In this context, an increasing amount of work is dedicated to the construction of statistical models directly accounting for the discrete nature of the data. Moreover, it has been shown that integrating dimension reduction to clustering can drastically improve performance and stability. In this paper, we rely on the mixture of multinomial PCA, a mixture model for the clustering of count data, also known as the probabilistic clustering-projection model in the literature. Related to the latent Dirichlet allocation model, it offers the flexibility of topic modeling while being able to assign each observation to a unique cluster. We introduce a greedy clustering algorithm, where inference and clustering are jointly done by mixing a classification variational expectation maximization algorithm, with a branch & bound like strategy on a variational lower bound. An integrated classification likelihood criterion is derived for model selection, and a thorough study with numerical experiments is proposed to assess both the performance and robustness of the method. Finally, we illustrate the qualitative interest of the latter in a real-world application, for the clustering of anatomopathological medical reports, in partnership with expert practitioners from the Institut Curie hospital.

We study a nonparametric contextual bandit problem where the expected reward functions belong to a H\"older class with smoothness parameter $\beta$. We show how this interpolates between two extremes that were previously studied in isolation: non-differentiable bandits ($\beta\leq1$), where rate-optimal regret is achieved by running separate non-contextual bandits in different context regions, and parametric-response bandits ($\beta=\infty$), where rate-optimal regret can be achieved with minimal or no exploration due to infinite extrapolatability. We develop a novel algorithm that carefully adjusts to all smoothness settings and we prove its regret is rate-optimal by establishing matching upper and lower bounds, recovering the existing results at the two extremes. In this sense, our work bridges the gap between the existing literature on parametric and non-differentiable contextual bandit problems and between bandit algorithms that exclusively use global or local information, shedding light on the crucial interplay of complexity and regret in contextual bandits.

The minimum error entropy (MEE) criterion has been verified as a powerful approach for non-Gaussian signal processing and robust machine learning. However, the implementation of MEE on robust classification is rather a vacancy in the literature. The original MEE only focuses on minimizing the Renyi's quadratic entropy of the error probability distribution function (PDF), which could cause failure in noisy classification tasks. To this end, we analyze the optimal error distribution in the presence of outliers for those classifiers with continuous errors, and introduce a simple codebook to restrict MEE so that it drives the error PDF towards the desired case. Half-quadratic based optimization and convergence analysis of the new learning criterion, called restricted MEE (RMEE), are provided. Experimental results with logistic regression and extreme learning machine are presented to verify the desirable robustness of RMEE.

Continuous control tasks in reinforcement learning are important because they provide an important framework for learning in high-dimensional state spaces with deceptive rewards, where the agent can easily become trapped into suboptimal solutions. One way to avoid local optima is to use a population of agents to ensure coverage of the policy space, yet learning a population with the "best" coverage is still an open problem. In this work, we present a novel approach to population-based RL in continuous control that leverages properties of normalizing flows to perform attractive and repulsive operations between current members of the population and previously observed policies. Empirical results on the MuJoCo suite demonstrate a high performance gain for our algorithm compared to prior work, including Soft-Actor Critic (SAC).

We propose a new family of efficient and expressive deep generative models of graphs, called Graph Recurrent Attention Networks (GRANs). Our model generates graphs one block of nodes and associated edges at a time. The block size and sampling stride allow us to trade off sample quality for efficiency. Compared to previous RNN-based graph generative models, our framework better captures the auto-regressive conditioning between the already-generated and to-be-generated parts of the graph using Graph Neural Networks (GNNs) with attention. This not only reduces the dependency on node ordering but also bypasses the long-term bottleneck caused by the sequential nature of RNNs. Moreover, we parameterize the output distribution per block using a mixture of Bernoulli, which captures the correlations among generated edges within the block. Finally, we propose to handle node orderings in generation by marginalizing over a family of canonical orderings. On standard benchmarks, we achieve state-of-the-art time efficiency and sample quality compared to previous models. Additionally, we show our model is capable of generating large graphs of up to 5K nodes with good quality. To the best of our knowledge, GRAN is the first deep graph generative model that can scale to this size. Our code is released at: https://github.com/lrjconan/GRAN.

In this paper, we address two fundamental questions in neural architecture design research: (i) How does an architecture topology impact the gradient flow during training? (ii) Can certain topological characteristics of deep networks indicate a priori (i.e., without training) which models, with a different number of parameters/FLOPS/layers, achieve a similar accuracy? To this end, we formulate the problem of deep learning architecture design from a network science perspective and introduce a new metric called NN-Mass to quantify how effectively information flows through a given architecture. We demonstrate that our proposed NN-Mass is more effective than the number of parameters to characterize the gradient flow properties, and to identify models with similar accuracy, despite having significantly different size/compute requirements. Detailed experiments on both synthetic and real datasets (e.g., MNIST and CIFAR-10/100) provide extensive empirical evidence for our insights. Finally, we exploit our new metric to design efficient architectures directly, and achieve up to 3x fewer parameters and FLOPS, while losing minimal accuracy (96.82% vs. 97%) over large CNNs on CIFAR-10.

Training an agent to solve control tasks directly from high-dimensional images with model-free reinforcement learning (RL) has proven difficult. A promising approach is to learn a latent representation together with the control policy. However, fitting a high-capacity encoder using a scarce reward signal is sample inefficient and leads to poor performance. Prior work has shown that auxiliary losses, such as image reconstruction, can aid efficient representation learning. However, incorporating reconstruction loss into an off-policy learning algorithm often leads to training instability. We explore the underlying reasons and identify variational autoencoders, used by previous investigations, as the cause of the divergence. Following these findings, we propose effective techniques to improve training stability. This results in a simple approach capable of matching state-of-the-art model-free and model-based algorithms on MuJoCo control tasks. Furthermore, our approach demonstrates robustness to observational noise, surpassing existing approaches in this setting. Code, results, and videos are anonymously available at https://sites.google.com/view/sac-ae/home.

Designing energy-efficient networks is of critical importance for enabling state-of-the-art deep learning in mobile and edge settings where the computation and energy budgets are highly limited. Recently, Liu et al. (2019) framed the search of efficient neural architectures into a continuous splitting process: it iteratively splits existing neurons into multiple off-springs to achieve progressive loss minimization, thus finding novel architectures by gradually growing the neural network. However, this method was not specifically tailored for designing energy-efficient networks, and is computationally expensive on large-scale benchmarks. In this work, we substantially improve Liu et al. (2019) in two significant ways: 1) we incorporate the energy cost of splitting different neurons to better guide the splitting process, thereby discovering more energy-efficient network architectures; 2) we substantially speed up the splitting process of Liu et al. (2019), which requires expensive eigen-decomposition, by proposing a highly scalable Rayleigh-quotient stochastic gradient algorithm. Our fast algorithm allows us to reduce the computational cost of splitting to the same level of typical back-propagation updates and enables efficient implementation on GPU. Extensive empirical results show that our method can train highly accurate and energy-efficient networks on challenging datasets such as ImageNet, improving a variety of baselines, including the pruning-based methods and expert-designed architectures.

Estimation of individual treatment effects is often used as the basis for contextual decision making in fields such as healthcare, education, and economics. However, in many real-world applications it is sufficient for the decision maker to have upper and lower bounds on the potential outcomes of decision alternatives, allowing them to evaluate the trade-off between benefit and risk. With this in mind, we develop an algorithm for directly learning upper and lower bounds on the potential outcomes under treatment and non-treatment. Our theoretical analysis highlights a trade-off between the complexity of the learning task and the confidence with which the resulting bounds cover the true potential outcomes; the more confident we wish to be, the more complex the learning task is. We suggest a novel algorithm that maximizes a utility function while maintaining valid potential outcome bounds. We illustrate different properties of our algorithm, and highlight how it can be used to guide decision making using two semi-simulated datasets.

Large scale machine learning is increasingly relying on distributed optimization, whereby several machines contribute to the training process of a statistical model. In this work we study the performance of asynchronous, distributed settings, when applying sparsification, a technique used to reduce communication overheads. In particular, for the first time in an asynchronous, non-convex setting, we theoretically prove that, in presence of staleness, sparsification does not harm SGD performance: the ergodic convergence rate matches the known result of standard SGD, that is $\mathcal{O} \left( 1/\sqrt{T} \right)$. We also carry out an empirical study to complement our theory, and confirm that the effects of sparsification on the convergence rate are negligible, when compared to 'vanilla' SGD, even in the challenging scenario of an asynchronous, distributed system.

Sequential and temporal data arise in many fields of research, such as quantitative finance, medicine, or computer vision. A novel approach for sequential learning, called the signature method and rooted in rough path theory, is considered. Its basic principle is to represent multidimensional paths by a graded feature set of their iterated integrals, called the signature. This approach relies critically on an embedding principle, which consists in representing discretely sampled data as paths, i.e., functions from $[0,1]$ to ${\mathbb R}^d$. After a survey of machine learning methodologies for signatures, the influence of embeddings on prediction accuracy is investigated with an in-depth study of three recent and challenging datasets. It is shown that a specific embedding, called lead-lag, is systematically better, whatever the dataset or algorithm used. Moreover, it is emphasized through an empirical study that computing signatures over the whole path domain does not lead to a loss of local information. It is concluded that, with a good embedding, the signature combined with a simple algorithm achieves results competitive with state-of-the-art, domain-specific approaches.

Self-play, where the algorithm learns by playing against itself without requiring any direct supervision, has become the new weapon in modern Reinforcement Learning (RL) for achieving superhuman performance in practice. However, the majority of exisiting theory in reinforcement learning only applies to the setting where the agent plays against a fixed environment; it remains largely open whether self-play algorithms can be provably effective, especially when it is necessary to manage the exploration/exploitation tradeoff. We study self-play in competitive reinforcement learning under the setting of Markov games, a generalization of Markov decision processes to the two-player case. We introduce a self-play algorithm---Value Iteration with Upper/Lower Confidence Bound (VI-ULCB)---and show that it achieves regret $\tilde{\mathcal{O}}(\sqrt{T})$ after playing $T$ steps of the game, where the regret is measured by the agent's performance against a \emph{fully adversarial} opponent who can exploit the agent's strategy at \emph{any} step. We also introduce an explore-then-exploit style algorithm, which achieves a slightly worse regret of $\tilde{\mathcal{O}}(T^{2/3})$, but is guaranteed to run in polynomial time even in the worst case. To the best of our knowledge, our work presents the first line of provably sample-efficient self-play algorithms for competitive reinforcement learning.

Representing shapes as level sets of neural networks has been recently proved to be useful for different shape analysis and reconstruction tasks. So far, such representations were computed using either: (i) pre-computed implicit shape representations; or (ii) loss functions explicitly defined over the neural level sets. In this paper we offer a new paradigm for computing high fidelity implicit neural representations directly from raw data (i.e., point clouds, with or without normal information). We observe that a rather simple loss function, encouraging the neural network to vanish on the input point cloud and to have a unit norm gradient, possesses an implicit geometric regularization property that favors smooth and natural zero level set surfaces, avoiding bad zero-loss solutions. We provide a theoretical analysis of this property for the linear case, and show that, in practice, our method leads to state of the art implicit neural representations with higher level-of-details and fidelity compared to previous methods.

Stochastic gradient descent without replacement sampling is widely used in practice for model training. However, the vast majority of SGD analyses assumes data is sampled with replacement, and when the function minimized is strongly convex, an $\mathcal{O}\left(\frac{1}{T}\right)$ rate can be established when SGD is run for $T$ iterations. A recent line of breakthrough works on SGD without replacement (SGDo) established an $\mathcal{O}\left(\frac{n}{T^2}\right)$ convergence rate when the function minimized is strongly convex and is a sum of $n$ smooth functions, and an $\mathcal{O}\left(\frac{1}{T^2}+\frac{n^3}{T^3}\right)$ rate for sums of quadratics. On the other hand, the tightest known lower bound postulates an $\Omega\left(\frac{1}{T^2}+\frac{n^2}{T^3}\right)$ rate, leaving open the possibility of better SGDo convergence rates in the general case. In this paper, we close this gap and show that SGD without replacement achieves a rate of $\mathcal{O}\left(\frac{1}{T^2}+\frac{n^2}{T^3}\right)$ when the sum of the functions is a quadratic, and offer a new lower bound of $\Omega\left(\frac{n}{T^2}\right)$ for strongly convex functions that are sums of smooth functions.

What makes untrained deep neural networks (DNNs) different from the trained performant ones? By zooming into the weights in well-trained DNNs, we found it is the location of weights that hold most of the information encoded by the training. Motivated by this observation, we hypothesize that weights in stochastic gradient-based method trained DNNs can be separated into two dimensions: the locations of weights and their exact values. To assess our hypothesis, we propose a novel method named Lookahead Permutation (LaPerm) to train DNNs by reconnecting the weights. We empirically demonstrate the versatility of LaPerm while producing extensive evidence to support our hypothesis: when the initial weights are random and dense, our method demonstrates speed and performance similar to or better than that of regular optimizers, e.g., Adam; when the initial weights are random and sparse (many zeros), our method changes the way neurons connect and reach accuracy comparable to that of a well-trained fully initialized network; when the initial weights share a single value, our method finds weight agnostic neural network with far better-than-chance accuracy.

We present Geo2DR (Geometric to Distributed Representations), a GPU ready Python library for unsupervised learning on graph-structured data using discrete substructure patterns and neural language models. It contains efficient implementations of popular graph decomposition algorithms and neural language models in PyTorch which can be combined to learn representations of graphs using the distributive hypothesis. Furthermore, Geo2DR comes with general data processing and loading methods to bring substantial speed-up in the training of the neural language models. Through this we provide a modular set of tools and methods to quickly construct systems capable of learning distributed representations of graphs. This is useful for replication of existing methods, modification, or development of completely new methods. This paper serves to present the Geo2DR library and perform a comprehensive comparative analysis of existing methods re-implemented using Geo2DR across widely used graph classification benchmarks. Geo2DR displays a high reproducibility of results in published methods and interoperability with other libraries useful for distributive language modelling.

Investigation of the degree of personalization in federated learning algorithms has shown that only maximizing the performance of the global model will confine the capacity of the local models to personalize. In this paper, we advocate an adaptive personalized federated learning (APFL) algorithm, where each client will train their local models while contributing to the global model. Information theoretically, we prove that the mixture of local and global models can reduce the generalization error. We also propose a communication-reduced bilevel optimization method, which reduces the communication rounds to $O(\sqrt{T})$ and show that under strong convexity and smoothness assumptions, the proposed algorithm can achieve a convergence rate of $O(1/T)$ with some residual error. The residual error is related to the gradient diversity among local models, and the gap between optimal local and global models. The extensive experiments demonstrate the effectiveness of our personalization, as well as the correctness of our theory.

The deductive closure of an ideal knowledge base (KB) contains exactly the logical queries that the KB can answer. However, in practice KBs are both incomplete and over-specified, failing to answer some queries that have real-world answers. \emph{Query embedding} (QE) techniques have been recently proposed where KB entities and KB queries are represented jointly in an embedding space, supporting relaxation and generalization in KB inference. However, experiments in this paper show that QE systems may disagree with deductive reasoning on answers that do not require generalization or relaxation. We address this problem with a novel QE method that is more faithful to deductive reasoning, and show that this leads to better performance on complex queries to incomplete KBs. Finally we show that inserting this new QE module into a neural question-answering system leads to substantial improvements over the state-of-the-art.

Researchers often use model-based multiple imputation to handle missing at random data to minimize bias while making the best use of all available data. However, there are contexts where it is very difficult to fit a model due to constraints amongst variables, and using a generic regression imputation model may result in implausible values. We explore the advantages of employing a logic-based resampling with matching (RWM) approach for multiple imputation. This approach is similar to random hot deck imputation, and allows for more plausible imputations than model-based approaches. We illustrate a RWM approach for multiply imputing missing pain, activity frequency, and sport data using The Childhood Health, Activity, and Motor Performance School Study Denmark (CHAMPS-DK). We match records with missing data to several observed records, generate probabilities for matched records using observed data, and sample from these records based on the probability of each occurring. Because imputed values are generated randomly, multiple complete datasets can be created. They are then analyzed and averaged in the same way as model-based multiple imputation. This approach can be extended to other datasets as an alternative to model-based approaches, particularly where there are time-dependent ordered categorical variables or other constraints between variables.

Bias in causal comparisons has a direct correspondence with distributional imbalance of covariates between treatment groups. Weighting strategies such as inverse propensity score weighting attempt to mitigate bias by either modeling the treatment assignment mechanism or balancing specified covariate moments. This paper introduces a new weighting method, called energy balancing, which instead aims to balance weighted covariate distributions. By directly targeting the root source of bias, the proposed weighting strategy can be flexibly utilized in a wide variety of causal analyses, including the estimation of average treatment effects and individualized treatment rules. Our energy balancing weights (EBW) approach has several advantages over existing weighting techniques. First, it offers a model-free and robust approach for obtaining covariate balance, obviating the need for modeling decisions of secondary nature to the scientific question at hand. Second, since this approach is based on a genuine measure of distributional balance, it provides a means for assessing the balance induced by a given set of weights. Finally, the proposed method is computationally efficient and has desirable theoretical guarantees under mild conditions. We demonstrate the effectiveness of this EBW approach in a suite of simulation experiments and in a study on the safety of right heart catheterization.

Learning to plan for long horizons is a central challenge in episodic reinforcement learning problems. A fundamental question is to understand how the difficulty of the problem scales as the horizon increases. Here the natural measure of sample complexity is a normalized one: we are interested in the number of episodes it takes to provably discover a policy whose value is $\varepsilon$ near to that of the optimal value, where the value is measured by the normalized cumulative reward in each episode. In a COLT 2018 open problem, Jiang and Agarwal conjectured that, for tabular, episodic reinforcement learning problems, there exists a sample complexity lower bound which exhibits a polynomial dependence on the horizon -- a conjecture which is consistent with all known sample complexity upper bounds. This work refutes this conjecture, proving that tabular, episodic reinforcement learning is possible with a sample complexity that scales only logarithmically with the planning horizon. In other words, when the values are appropriately normalized (to lie in the unit interval), this results shows that long horizon RL is no more difficult than short horizon RL, at least in a minimax sense. Our analysis introduces two ideas: (i) the construction of an $\varepsilon$-net for optimal policies whose log-covering number scales only logarithmically with the planning horizon, and (ii) the Online Trajectory Synthesis algorithm, which adaptively evaluates all policies in a given policy class using sample complexity that scales with the log-covering number of the given policy class. Both may be of independent interest.

We give a formal verification procedure that decides whether a classifier ensemble is robust against arbitrary randomized attacks. Such attacks consist of a set of deterministic attacks and a distribution over this set. The robustness-checking problem consists of assessing, given a set of classifiers and a labelled data set, whether there exists a randomized attack that induces a certain expected loss against all classifiers. We show the NP-hardness of the problem and provide an upper bound on the number of attacks that is sufficient to form an optimal randomized attack. These results provide an effective way to reason about the robustness of a classifier ensemble. We provide SMT and MILP encodings to compute optimal randomized attacks or prove that there is no attack inducing a certain expected loss. In the latter case, the classifier ensemble is provably robust. Our prototype implementation verifies multiple neural-network ensembles trained for image-classification tasks. The experimental results using the MILP encoding are promising both in terms of scalability and the general applicability of our verification procedure.

We study locally differentially private (LDP) bandits learning in this paper. First, we propose simple black-box reduction frameworks that can solve a large family of context-free bandits learning problems with LDP guarantee. Based on our frameworks, we can improve previous best results for private bandits learning with one-point feedback, such as private Bandits Convex Optimization etc, and obtain the first results for Bandits Convex Optimization (BCO) with multi-point feedback under LDP. LDP guarantee and black-box nature make our frameworks more attractive in real applications compared with previous specifically designed and relatively weaker differentially private (DP) context-free bandits algorithms. Further, we also extend our algorithm to Generalized Linear Bandits with regret bound $\tilde{\mathcal{O}}(T^{3/4}/\varepsilon)$ under $(\varepsilon, \delta)$-LDP which is conjectured to be optimal. Note given existing $\Omega(T)$ lower bound for DP contextual linear bandits (Shariff&Sheffe,NeurIPS2018), our result shows a fundamental difference between LDP and DP contextual bandits learning.

This paper investigates reinforcement learning with constraints, which is indispensable in safety-critical environments. To drive the constraint violation monotonically decrease, the constraints are taken as Lyapunov functions, and new linear constraints are imposed on the updating dynamics of the policy parameters such that the original safety set is forward-invariant in expectation. As the new guaranteed-feasible constraints are imposed on the updating dynamics instead of the original policy parameters, classic optimization algorithms are no longer applicable. To address this, we propose to learn a neural network-based meta-optimizer to optimize the objective while satisfying such linear constraints. The constraint-satisfaction is achieved via projection onto a polytope formulated by multiple linear inequality constraints, which can be solved analytically with our newly designed metric. Ultimately, the meta-optimizer trains the policy network to monotonically decrease the constraint violation and maximize the cumulative reward. Numerical results validate the theoretical findings.

Neural Linear Models (NLM) are deep models that produce predictive uncertainty by learning features from the data and then performing Bayesian linear regression over these features. Despite their popularity, few works have focused on formally evaluating the predictive uncertainties of these models. In this work, we show that traditional training procedures for NLMs can drastically underestimate uncertainty in data-scarce regions. We identify the underlying reasons for this behavior and propose a novel training procedure for capturing useful predictive uncertainties.

In this paper, we focus on the task of multi-view multi-source geo-localization, which serves as an important auxiliary method of GPS positioning by matching drone-view image and satellite-view image with pre-annotated GPS tag. To solve this problem, most existing methods adopt metric loss with an weighted classification block to force the generation of common feature space shared by different view points and view sources. However, these methods fail to pay sufficient attention to spatial information (especially viewpoint variances). To address this drawback, we propose an elegant orientation-based method to align the patterns and introduce a new branch to extract aligned partial feature. Moreover, we provide a style alignment strategy to reduce the variance in image style and enhance the feature unification. To demonstrate the performance of the proposed approach, we conduct extensive experiments on the large-scale benchmark dataset. The experimental results confirm the superiority of the proposed approach compared to state-of-the-art alternatives.

Imitation learning in a high-dimensional environment is challenging. Most inverse reinforcement learning (IRL) methods fail to outperform the demonstrator in such a high-dimensional environment, e.g., Atari domain. To address this challenge, we propose a novel reward learning module to generate intrinsic reward signals via a generative model. Our generative method can perform better forward state transition and backward action encoding, which improves the module's dynamics modeling ability in the environment. Thus, our module provides the imitation agent both the intrinsic intention of the demonstrator and a better exploration ability, which is critical for the agent to outperform the demonstrator. Empirical results show that our method outperforms state-of-the-art IRL methods on multiple Atari games, even with one-life demonstration. Remarkably, our method achieves performance that is up to 5 times the performance of the demonstration.

This study analyzes the usage of Japanese gendered language on Twitter. Starting from a collection of 408 million Japanese tweets from 2015 till 2019 and an additional sample of 2355 manually classified Twitter accounts timelines into gender and categories (politicians, musicians, etc). A large scale textual analysis is performed on this corpus to identify and examine sentence-final particles (SFPs) and first-person pronouns appearing in the texts. It turns out that gendered language is in fact used also on Twitter, in about 6% of the tweets, and that the prescriptive classification into "male" and "female" language does not always meet the expectations, with remarkable exceptions. Further, SFPs and pronouns show increasing or decreasing trends, indicating an evolution of the language used on Twitter.

The James-Stein estimator is an estimator of the multivariate normal mean and dominates the maximum likelihood estimator (MLE) under squared error loss. The original work inspired great interest in developing shrinkage estimators for a variety of problems. Nonetheless, research on shrinkage estimation for manifold-valued data is scarce. In this paper, we propose shrinkage estimators for the parameters of the Log-Normal distribution defined on the manifold of $N \times N$ symmetric positive-definite matrices. For this manifold, we choose the Log-Euclidean metric as its Riemannian metric since it is easy to compute and is widely used in applications. By using the Log-Euclidean distance in the loss function, we derive a shrinkage estimator in an analytic form and show that it is asymptotically optimal within a large class of estimators including the MLE, which is the sample Fr\'echet mean of the data. We demonstrate the performance of the proposed shrinkage estimator via several simulated data experiments. Furthermore, we apply the shrinkage estimator to perform statistical inference in diffusion magnetic resonance imaging problems.

In many applications, data is easy to acquire but expensive and time consuming to label prominent examples include medical imaging and NLP. This disparity has only grown in recent years as our ability to collect data improves. Under these constraints, it makes sense to select only the most informative instances from the unlabeled pool and request an oracle (e.g a human expert) to provide labels for those samples. The goal of active learning is to infer the informativeness of unlabeled samples so as to minimize the number of requests to the oracle. Here, we formulate active learning as an open-set recognition problem. In this latter paradigm, only some of the inputs belong to known classes; the classifier must identify the rest as unknown.More specifically, we leverage variational neuralnetworks (VNNs), which produce high-confidence (i.e., low-entropy) predictions only for inputs that closely resemble the training data. We use the inverse of this confidence measure to select the samples that the oracle should label. Intuitively, unlabeled samples that the VNN is uncertain about are more informative for future training. We carried out an extensive evaluation of our novel, probabilistic formulation of active learning, achieving state-of-the-art results on CIFAR-10 andCIFAR-100. In addition, unlike current active learning methods, our algorithm can learn tasks with non i.i.d distribution, without the need for task labels. As our experiments show, when the unlabeled pool consists of a mixture of samples from multiple tasks, our approach can automatically distinguish between samples from seen vs. unseen tasks.

Meta-learning has proven to be successful at few-shot learning across the regression, classification and reinforcement learning paradigms. Recent approaches have adopted Bayesian interpretations to improve gradient based meta-learners by quantifying the uncertainty of the post-adaptation estimates. Most of these works almost completely ignore the latent relationship between the covariate distribution (p(x)) of a task and the corresponding conditional distribution p(y|x). In this paper, we identify the need to explicitly model the meta-distribution over the task covariates in a hierarchical Bayesian framework. We begin by introducing a graphical model that explicitly leverages very few samples drawn from p(x) to better infer the posterior over the optimal parameters of the conditional distribution (p(y|x)) for each task. Based on this model we provide an inference strategy and a corresponding meta-algorithm that explicitly accounts for the meta-distribution over task covariates. Finally, we demonstrate the significant gains of our proposed algorithm on a synthetic regression dataset.

Bayesian methods have proved powerful in many applications for the inference of model parameters from data. These methods are based on Bayes' theorem, which itself is deceptively simple. However, in practice the computations required are intractable even for simple cases. Hence methods for Bayesian inference have historically either been significantly approximate, e.g., the Laplace approximation, or achieve samples from the exact solution at significant computational expense, e.g., Markov Chain Monte Carlo methods. Since around the year 2000 so-called Variational approaches to Bayesian inference have been increasingly deployed. In its most general form Variational Bayes (VB) involves approximating the true posterior probability distribution via another more 'manageable' distribution, the aim being to achieve as good an approximation as possible. In the original FMRIB Variational Bayes tutorial we documented an approach to VB based that took a 'mean field' approach to forming the approximate posterior, required the conjugacy of prior and likelihood, and exploited the Calculus of Variations, to derive an iterative series of update equations, akin to Expectation Maximisation. In this tutorial we revisit VB, but now take a stochastic approach to the problem that potentially circumvents some of the limitations imposed by the earlier methodology. This new approach bears a lot of similarity to, and has benefited from, computational methods applied to machine learning algorithms. Although, what we document here is still recognisably Bayesian inference in the classic sense, and not an attempt to use machine learning as a black-box to solve the inference problem.

Providing a small set of promising candidates in place of a single prediction is well-suited for many open-ended classification tasks. Conformal Prediction (CP) is a technique for creating classifiers that produce a valid set of predictions that contains the true answer with arbitrarily high probability. In practice, however, standard CP can suffer from both low predictive and computational efficiency during inference---i.e., the predicted set is both unusably large, and costly to obtain. This is particularly pervasive in the considered setting, where the correct answer is not unique and the number of total possible answers is high. In this work, we develop two simple and complementary techniques for improving both types of efficiencies. First, we relax CP validity to arbitrary criterions of success---allowing our framework to make more efficient predictions while remaining "equivalently correct." Second, we amortize cost by conformalizing prediction cascades, in which we aggressively prune implausible labels early on by using progressively stronger classifiers---while still guaranteeing marginal coverage. We demonstrate the empirical effectiveness of our approach for multiple applications in natural language processing and computational chemistry for drug discovery.

Intermediate-Severity (IS) faults present milder symptoms compared to severe faults, and are more difficult to detect and diagnose due to their close resemblance to normal operating conditions. The lack of IS fault examples in the training data can pose severe risks to Fault Detection and Diagnosis (FDD) methods that are built upon Machine Learning (ML) techniques, because these faults can be easily mistaken as normal operating conditions. Ensemble models are widely applied in ML and are considered promising methods for detecting out-of-distribution (OOD) data. We identify common pitfalls in these models through extensive experiments with several popular ensemble models on two real-world datasets. Then, we discuss how to design more effective ensemble models for detecting and diagnosing IS faults.

Deep convolutional networks (DCNs) learn meaningful representations where data that share the same abstract characteristics are positioned closer and closer. Understanding these representations and how they are generated is of unquestioned practical and theoretical interest. In this work we study the evolution of the probability density of the ImageNet dataset across the hidden layers in some state-of-the-art DCNs. We find that the initial layers generate a unimodal probability density getting rid of any structure irrelevant for classification. In subsequent layers density peaks arise in a hierarchical fashion that mirrors the semantic hierarchy of the concepts. Density peaks corresponding to single categories appear only close to the output and via a very sharp transition which resembles the nucleation process of a heterogeneous liquid. This process leaves a footprint in the probability density of the output layer where the topography of the peaks allows reconstructing the semantic relationships of the categories.

In this manuscript, we propose a federated F-score based ensemble tree model for automatic rule extraction, namely Fed-FEARE. Under the premise of data privacy protection, Fed-FEARE enables multiple agencies to jointly extract set of rules both vertically and horizontally. Compared with that without federated learning, measures in evaluating model performance are highly improved. At present, Fed-FEARE has already been applied to multiple business, including anti-fraud and precision marketing, in a China nation-wide financial holdings group.

This paper concerns the problem of 1-bit compressed sensing, where the goal is to estimate a sparse signal from a few of its binary measurements. We study a non-convex sparsity-constrained program and present a novel and concise analysis that moves away from the widely used notion of Gaussian width. We show that with high probability a simple algorithm is guaranteed to produce an accurate approximation to the normalized signal of interest under the $\ell_2$-metric. On top of that, we establish an ensemble of new results that address norm estimation, support recovery, and model misspecification. On the computational side, it is shown that the non-convex program can be solved via one-step hard thresholding which is dramatically efficient in terms of time complexity and memory footprint. On the statistical side, it is shown that our estimator enjoys a near-optimal error rate under standard conditions. The theoretical results are substantiated by numerical experiments.

Electricity price forecasting is an essential task for all the deregulated markets of the world. The accurate prediction of the day-ahead electricity prices is an active research field and available data from various markets can be used as an input for forecasting. A collection of models have been proposed for this task, but the fundamental question on how to use the available big data is often neglected. In this paper, we propose to use transfer learning as a tool for utilizing information from other electricity price markets for forecasting. We pre-train a bidirectional Gated Recurrent Units (BGRU) network on source markets and finally do a fine-tuning for the target market. Moreover, we test different ways to use the input data from various markets in the models. Our experiments on five different day-ahead markets indicate that transfer learning improves the performance of electricity price forecasting in a statistically significant manner.

Conventional bidding strategies for online display ad auction heavily relies on observed performance indicators such as clicks or conversions. A bidding strategy naively pursuing these easily observable metrics, however, fails to optimize the profitability of the advertisers. Rather, the bidding strategy that leads to the maximum revenue is a strategy pursuing the performance lift of showing ads to a specific user. Therefore, it is essential to predict the lift-effect of showing ads to each user on their target variables from observed log data. However, there is a difficulty in predicting the lift-effect, as the training data gathered by a past bidding strategy may have a strong bias towards the winning impressions. In this study, we develop Unbiased Lift-based Bidding System, which maximizes the advertisers' profit by accurately predicting the lift-effect from biased log data. Our system is the first to enable high-performing lift-based bidding strategy by theoretically alleviating the inherent bias in the log. Real-world, large-scale A/B testing successfully demonstrates the superiority and practicability of the proposed system.

We present a systematic investigation using graph neural networks (GNNs) to model organic chemical reactions. To do so, we prepared a dataset collection of four ubiquitous reactions from the organic chemistry literature. We evaluate seven different GNN architectures for classification tasks pertaining to the identification of experimental reagents and conditions. We find that models are able to identify specific graph features that affect reaction conditions and lead to accurate predictions. The results herein show great promise in advancing molecular machine learning.