## Statistics (stat) updates on the arXiv.org e-print archive



In this work we study a variant of the well-known multi-armed bandit (MAB) problem, which has the properties of a delay in feedback, and a loss that declines over time. We introduce an algorithm, EXP4-DFDC, to solve this MAB variant, and demonstrate that the regret vanishes as the time increases. We also show that LeCaR, a previously published machine learning-based cache replacement algorithm, is an instance of EXP4-DFDC. Our results can be used to provide insight on the choice of hyperparameters, and optimize future LeCaR instances.

Machine learning is rapidly becoming one of the most important technology for malware traffic detection, since the continuous evolution of malware requires a constant adaptation and the ability to generalize. However, network traffic datasets are usually oversized and contain redundant and irrelevant information, and this may dramatically increase the computational cost and decrease the accuracy of most classifiers, with the risk to introduce further noise.

We propose two novel dataset optimization strategies which exploit and combine several state-of-the-art approaches in order to achieve an effective optimization of the network traffic datasets used to train malware detectors. The first approach is a feature selection technique based on mutual information measures and sensibility enhancement. The second is a dimensional reduction technique based autoencoders. Both these approaches have been experimentally applied on the MTA-KDD'19 dataset, and the optimized results evaluated and compared using a Multi Layer Perceptron as machine learning model for malware detection.

Constrained Markov Decision Processes (CMDPs) formalize sequential decision-making problems whose objective is to minimize a cost function while satisfying constraints on various cost functions. In this paper, we consider the setting of episodic fixed-horizon CMDPs. We propose an online algorithm which leverages the linear programming formulation of finite-horizon CMDP for repeated optimistic planning to provide a probably approximately correct (PAC) guarantee on the number of episodes needed to ensure an $\epsilon$-optimal policy, i.e., with resulting objective value within $\epsilon$ of the optimal value and satisfying the constraints within $\epsilon$-tolerance, with probability at least $1-\delta$. The number of episodes needed is shown to be of the order $\tilde{\mathcal{O}}\big(\frac{|S||A|C^{2}H^{2}}{\epsilon^{2}}\log\frac{1}{\delta}\big)$, where $C$ is the upper bound on the number of possible successor states for a state-action pair. Therefore, if $C \ll |S|$, the number of episodes needed have a linear dependence on the state and action space sizes $|S|$ and $|A|$, respectively, and quadratic dependence on the time horizon $H$.

The present paper is devoted to clustering geometric graphs. While the standard spectral clustering is often not effective for geometric graphs, we present an effective generalization, which we call higher-order spectral clustering. It resembles in concept the classical spectral clustering method but uses for partitioning the eigenvector associated with a higher-order eigenvalue. We establish the weak consistency of this algorithm for a wide class of geometric graphs which we call Soft Geometric Block Model. A small adjustment of the algorithm provides strong consistency. We also show that our method is effective in numerical experiments even for graphs of modest size.

Learning low-dimensional representations for entities and relations in knowledge graphs using contrastive estimation represents a scalable and effective method for inferring connectivity patterns. A crucial aspect of contrastive learning approaches is the choice of corruption distribution that generates hard negative samples, which force the embedding model to learn discriminative representations and find critical characteristics of observed data. While earlier methods either employ too simple corruption distributions, i.e. uniform, yielding easy uninformative negatives or sophisticated adversarial distributions with challenging optimization schemes, they do not explicitly incorporate known graph structure resulting in suboptimal negatives. In this paper, we propose Structure Aware Negative Sampling (SANS), an inexpensive negative sampling strategy that utilizes the rich graph structure by selecting negative samples from a node's k-hop neighborhood. Empirically, we demonstrate that SANS finds high-quality negatives that are highly competitive with SOTA methods, and requires no additional parameters nor difficult adversarial optimization.

The theory of integral quadratic constraints (IQCs) allows the certification of exponential convergence of interconnected systems containing nonlinear or uncertain elements. In this work, we adapt the IQC theory to study first-order methods for smooth and strongly-monotone games and show how to design tailored quadratic constraints to get tight upper bounds of convergence rates. Using this framework, we recover the existing bound for the gradient method~(GD), derive sharper bounds for the proximal point method~(PPM) and optimistic gradient method~(OG), and provide \emph{for the first time} a global convergence rate for the negative momentum method~(NM) with an iteration complexity $\bigo(\kappa^{1.5})$, which matches its known lower bound. In addition, for time-varying systems, we prove that the gradient method with optimal step size achieves the fastest provable worst-case convergence rate with quadratic Lyapunov functions. Finally, we further extend our analysis to stochastic games and study the impact of multiplicative noise on different algorithms. We show that it is impossible for an algorithm with one step of memory to achieve acceleration if it only queries the gradient once per batch (in contrast with the stochastic strongly-convex optimization setting, where such acceleration has been demonstrated). However, we exhibit an algorithm which achieves acceleration with two gradient queries per batch.

The problem of monotone missing data has been broadly studied during the last two decades and has many applications in different fields such as bioinformatics or statistics. Commonly used imputation techniques require multiple iterations through the data before yielding convergence. Moreover, those approaches may introduce extra noises and biases to the subsequent modeling. In this work, we derive exact formulas and propose a novel algorithm to compute the maximum likelihood estimators (MLEs) of a multiple class, monotone missing dataset when all the covariance matrices of all categories are assumed to be equal, namely EPEM. We then illustrate an application of our proposed methods in Linear Discriminant Analysis (LDA). As the computation is exact, our EPEM algorithm does not require multiple iterations through the data as other imputation approaches, thus promising to handle much less time-consuming than other methods. This effectiveness was validated by empirical results when EPEM reduced the error rates significantly and required a short computation time compared to several imputation-based approaches. We also release all codes and data of our experiments in one GitHub repository to contribute to the research community related to this problem.

Deep neural networks (DNNs) have proven to be powerful tools for processing unstructured data. However for high-dimensional data, like images, they are inherently vulnerable to adversarial attacks. Small almost invisible perturbations added to the input can be used to fool DNNs. Various attacks, hardening methods and detection methods have been introduced in recent years. Notoriously, Carlini-Wagner (CW) type attacks computed by iterative minimization belong to those that are most difficult to detect. In this work, we demonstrate that such iterative minimization attacks can by used as detectors themselves. Thus, in some sense we show that one can fight fire with fire. This work also outlines a mathematical proof that under certain assumptions this detector provides asymptotically optimal separation of original and attacked images. In numerical experiments, we obtain AUROC values up to 99.73% for our detection method. This distinctly surpasses state of the art detection rates for CW attacks from the literature. We also give numerical evidence that our method is robust against the attacker's choice of the method of attack.

This article proposes a novel Bayesian classification framework for networks with labeled nodes. While literature on statistical modeling of network data typically involves analysis of a single network, the recent emergence of complex data in several biological applications, including brain imaging studies, presents a need to devise a network classifier for subjects. This article considers an application from a brain connectome study, where the overarching goal is to classify subjects into two separate groups based on their brain network data, along with identifying influential regions of interest (ROIs) (referred to as nodes). Existing approaches either treat all edge weights as a long vector or summarize the network information with a few summary measures. Both these approaches ignore the full network structure, may lead to less desirable inference in small samples and are not designed to identify significant network nodes. We propose a novel binary logistic regression framework with the network as the predictor and a binary response, the network predictor coefficient being modeled using a novel class global-local shrinkage priors. The framework is able to accurately detect nodes and edges in the network influencing the classification. Our framework is implemented using an efficient Markov Chain Monte Carlo algorithm. Theoretically, we show asymptotically optimal classification for the proposed framework when the number of network edges grows faster than the sample size. The framework is empirically validated by extensive simulation studies and analysis of a brain connectome data.

In this work, we develop a novel fairness learning approach for multi-task regression models based on a biased training dataset, using a popular rank-based non-parametric independence test, i.e., Mann Whitney U statistic, for measuring the dependency between target variable and protected variables. To solve this learning problem efficiently, we first reformulate the problem as a new non-convex optimization problem, in which a non-convex constraint is defined based on group-wise ranking functions of individual objects. We then develop an efficient model-training algorithm based on the framework of non-convex alternating direction method of multipliers (NC-ADMM), in which one of the main challenges is to implement an efficient projection oracle to the preceding non-convex set defined based on ranking functions. Through the extensive experiments on both synthetic and real-world datasets, we validated the out-performance of our new approach against several state-of-the-art competitive methods on several popular metrics relevant to fairness learning.

We study fairness in supervised few-shot meta-learning models that are sensitive to discrimination (or bias) in historical data. A machine learning model trained based on biased data tends to make unfair predictions for users from minority groups. Although this problem has been studied before, existing methods mainly aim to detect and control the dependency effect of the protected variables (e.g. race, gender) on target prediction based on a large amount of training data. These approaches carry two major drawbacks that (1) lacking showing a global cause-effect visualization for all variables; (2) lacking generalization of both accuracy and fairness to unseen tasks. In this work, we first discover discrimination from data using a causal Bayesian knowledge graph which not only demonstrates the dependency of the protected variable on target but also indicates causal effects between all variables. Next, we develop a novel algorithm based on risk difference in order to quantify the discriminatory influence for each protected variable in the graph. Furthermore, to protect prediction from unfairness, a fast-adapted bias-control approach in meta-learning is proposed, which efficiently mitigates statistical disparity for each task and it thus ensures independence of protected attributes on predictions based on biased and few-shot data samples. Distinct from existing meta-learning models, group unfairness of tasks are efficiently reduced by leveraging the mean difference between (un)protected groups for regression problems. Through extensive experiments on both synthetic and real-world data sets, we demonstrate that our proposed unfairness discovery and prevention approaches efficiently detect discrimination and mitigate biases on model output as well as generalize both accuracy and fairness to unseen tasks with a limited amount of training samples.

Forecasting influenza in a timely manner aids health organizations and policymakers in adequate preparation and decision making. However, effective influenza forecasting still remains a challenge despite increasing research interest. It is even more challenging amidst the COVID pandemic, when the influenza-like illness (ILI) counts is affected by various factors such as symptomatic similarities with COVID-19 and shift in healthcare seeking patterns of the general population. We term the ILI values observed when it is potentially affected as COVID-ILI. Under the current pandemic, historical influenza models carry valuable expertise about the disease dynamics but face difficulties adapting. Therefore, we propose CALI-NET, a neural transfer learning architecture which allows us to 'steer' a historical disease forecasting model to new scenarios where flu and COVID co-exist. Our framework enables this adaptation by automatically learning when it is should emphasize learning from COVID-related signals and when from the historical model. In such way, we exploit representations learned from historical ILI data as well as the limited COVID-related signals. Our experiments demonstrate that our approach is successful in adapting a historical forecasting model to the current pandemic. In addition, we show that success in our primary goal, adaptation, does not sacrifice overall performance as compared with state-of-the-art influenza forecasting approaches.

We consider Bayesian high-dimensional mediation analysis to identify among a large set of correlated potential mediators the active ones that mediate the effect from an exposure variable to an outcome of interest. Correlations among mediators are commonly observed in modern data analysis; examples include the activated voxels within connected regions in brain image data, regulatory signals driven by gene networks in genome data and correlated exposure data from the same source. When correlations are present among active mediators, mediation analysis that fails to account for such correlation can be sub-optimal and may lead to a loss of power in identifying active mediators. Building upon a recent high-dimensional mediation analysis framework, we propose two Bayesian hierarchical models, one with a Gaussian mixture prior that enables correlated mediator selection and the other with a Potts mixture prior that accounts for the correlation among active mediators in mediation analysis. We develop efficient sampling algorithms for both methods. Various simulations demonstrate that our methods enable effective identification of correlated active mediators, which could be missed by using existing methods that assume prior independence among active mediators. The proposed methods are applied to the LIFECODES birth cohort and the Multi-Ethnic Study of Atherosclerosis (MESA) and identified new active mediators with important biological implications.

We present an elementary mathematical method to find the minimax estimator of the Bernoulli proportion $\theta$ under the squared error loss when $\theta$ belongs to the restricted parameter space of the form $\Omega = [0, \eta]$ for some pre-specified constant $0 \leq \eta \leq 1$. This problem is inspired from the problem of estimating the rate of positive COVID-19 tests. The presented results and applications would be useful materials for both instructors and students when teaching point estimation in statistical or machine learning courses.

The success of deep learning relies on the availability of large-scale annotated data sets, the acquisition of which can be costly, requiring expert domain knowledge. Semi-supervised learning (SSL) mitigates this challenge by exploiting the behavior of the neural function on large unlabeled data. The smoothness of the neural function is a commonly used assumption exploited in SSL. A successful example is the adoption of mixup strategy in SSL that enforces the global smoothness of the neural function by encouraging it to behave linearly when interpolating between training examples. Despite its empirical success, however, the theoretical underpinning of how mixup regularizes the neural function has not been fully understood. In this paper, we offer a theoretically substantiated proposition that mixup improves the smoothness of the neural function by bounding the Lipschitz constant of the gradient function of the neural networks. We then propose that this can be strengthened by simultaneously constraining the Lipschitz constant of the neural function itself through adversarial Lipschitz regularization, encouraging the neural function to behave linearly while also constraining the slope of this linear function. On three benchmark data sets and one real-world biomedical data set, we demonstrate that this combined regularization results in improved generalization performance of SSL when learning from a small amount of labeled data. We further demonstrate the robustness of the presented method against single-step adversarial attacks. Our code is available at https://github.com/Prasanna1991/Mixup-LR.

Estimation of density functions supported on general domains arises when the data is naturally restricted to a proper subset of the real space. This problem is complicated by typically intractable normalizing constants. Score matching provides a powerful tool for estimating densities with such intractable normalizing constants, but as originally proposed is limited to densities on $\mathbb{R}^m$ and $\mathbb{R}_+^m$. In this paper, we offer a natural generalization of score matching that accommodates densities supported on a very general class of domains. We apply the framework to truncated graphical and pairwise interaction models, and provide theoretical guarantees for the resulting estimators. We also generalize a recently proposed method from bounded to unbounded domains, and empirically demonstrate the advantages of our method.

Measuring and testing the dependency between multiple random functions is often an important task in functional data analysis. In the literature, a model-based method relies on a model which is subject to the risk of model misspecification, while a model-free method only provides a correlation measure which is inadequate to test independence. In this paper, we adopt the Hilbert-Schmidt Independence Criterion (HSIC) to measure the dependency between two random functions. We develop a two-step procedure by first pre-smoothing each function based on its discrete and noisy measurements and then applying the HSIC to recovered functions. To ensure the compatibility between the two steps such that the effect of the pre-smoothing error on the subsequent HSIC is asymptotically negligible, we propose to use wavelet soft-thresholding for pre-smoothing and Besov-norm-induced kernels for HSIC. We also provide the corresponding asymptotic analysis. The superior numerical performance of the proposed method over existing ones is demonstrated in a simulation study. Moreover, in an magnetoencephalography (MEG) data application, the functional connectivity patterns identified by the proposed method are more anatomically interpretable than those by existing methods.

Graph convolutional networks (GCNs) have achieved promising performance on various graph-based tasks. However they suffer from over-smoothing when stacking more layers. In this paper, we present a quantitative study on this observation and develop novel insights towards the deeper GCN. First, we interpret the current graph convolutional operations from an optimization perspective and argue that over-smoothing is mainly caused by the naive first-order approximation of the solution to the optimization problem. Subsequently, we introduce two metrics to measure the over-smoothing on node-level tasks. Specifically, we calculate the fraction of the pairwise distance between connected and disconnected nodes to the overall distance respectively. Based on our theoretical and empirical analysis, we establish a universal theoretical framework of GCN from an optimization perspective and derive a novel convolutional kernel named GCN+ which has lower parameter amount while relieving the over-smoothing inherently. Extensive experiments on real-world datasets demonstrate the superior performance of GCN+ over state-of-the-art baseline methods on the node classification tasks.

We propose two new criteria to understand the advantage of deepening neural networks. It is important to know the expressivity of functions computable by deep neural networks in order to understand the advantage of deepening neural networks. Unless deep neural networks have enough expressivity, they cannot have good performance even though learning is successful. In this situation, the proposed criteria contribute to understanding the advantage of deepening neural networks since they can evaluate the expressivity independently from the efficiency of learning. The first criterion shows the approximation accuracy of deep neural networks to the target function. This criterion has the background that the goal of deep learning is approximating the target function by deep neural networks. The second criterion shows the property of linear regions of functions computable by deep neural networks. This criterion has the background that deep neural networks whose activation functions are piecewise linear are also piecewise linear. Furthermore, by the two criteria, we show that to increase layers is more effective than to increase units at each layer on improving the expressivity of deep neural networks.

We propose a novel generalisation to the Student-t Probabilistic Principal Component methodology which: (1) accounts for an asymmetric distribution of the observation data; (2) is a framework for grouped and generalised multiple-degree-of-freedom structures, which provides a more flexible approach to modelling groups of marginal tail dependence in the observation data; and (3) separates the tail effect of the error terms and factors. The new feature extraction methods are derived in an incomplete data setting to efficiently handle the presence of missing values in the observation vector. We discuss various special cases of the algorithm being a result of simplified assumptions on the process generating the data. The applicability of the new framework is illustrated on a data set that consists of crypto currencies with the highest market capitalisation.

Deep neural networks (DNNs) have demonstrated excellent performance on various tasks, however they are under the risk of adversarial examples that can be easily generated when the target model is accessible to an attacker (white-box setting). As plenty of machine learning models have been deployed via online services that only provide query outputs from inaccessible models (e.g. Google Cloud Vision API2), black-box adversarial attacks (inaccessible target model) are of critical security concerns in practice rather than white-box ones. However, existing query-based black-box adversarial attacks often require excessive model queries to maintain a high attack success rate. Therefore, in order to improve query efficiency, we explore the distribution of adversarial examples around benign inputs with the help of image structure information characterized by a Neural Process, and propose a Neural Process based black-box adversarial attack (NP-Attack) in this paper. Extensive experiments show that NP-Attack could greatly decrease the query counts under the black-box setting.

Information networks are ubiquitous and are ideal for modeling relational data. Networks being sparse and irregular, network embedding algorithms have caught the attention of many researchers, who came up with numerous embeddings algorithms in static networks. Yet in real life, networks constantly evolve over time. Hence, evolutionary patterns, namely how nodes develop itself over time, would serve as a powerful complement to static structures in embedding networks, on which relatively few works focus. In this paper, we propose EPNE, a temporal network embedding model preserving evolutionary patterns of the local structure of nodes. In particular, we analyze evolutionary patterns with and without periodicity and design strategies correspondingly to model such patterns in time-frequency domains based on causal convolutions. In addition, we propose a temporal objective function which is optimized simultaneously with proximity ones such that both temporal and structural information are preserved. With the adequate modeling of temporal information, our model is able to outperform other competitive methods in various prediction tasks.

The detection of groups of molecules that co-localize with histopathological patterns or sub-structures is an important step to combine the rich high-dimensional content of mass spectrometry imaging (MSI) with classic histopathological staining. Here we present the evolution of GRINE to COBI-GRINE, an interactive web tool that maps MSI data onto a graph structure to detect communities of laterally similar distributed molecules and co-visualizes the communities with Hematoxylin and Eosin (HE) stained images. Thereby the tool enables biologists and pathologists to examine the MSI image graph in a target-oriented manner and links molecular co-localization to pathology. Another feature is the manual optimization of cluster results with the assist of graph statistics in order to improve the community results. As the graphs can become very complex, those statistics provide good heuristics to support and accelerate the detection of sub-clusters and misclusterings. This kind of edited cluster optimization allows the integration of expert background knowledge into the clustering result and a more precise analysis of links between molecular co-localization and pathology.

Travel time is a crucial measure in transportation. Accurate travel time prediction is also fundamental for operation and advanced information systems. A variety of solutions exist for short-term travel time predictions such as solutions that utilize real-time GPS data and optimization methods to track the path of a vehicle. However, reliable long-term predictions remain challenging. We show in this paper the applicability and usefulness of travel time i.e. delivery time prediction for postal services. We investigate several methods such as linear regression models and tree based ensembles such as random forest, bagging, and boosting, that allow to predict delivery time by conducting extensive experiments and considering many usability scenarios. Results reveal that travel time prediction can help mitigate high delays in postal services. We show that some boosting algorithms, such as light gradient boosting and catboost, have a higher performance in terms of accuracy and runtime efficiency than other baselines such as linear regression models, bagging regressor and random forest.

Data clustering with uneven distribution in high level noise is challenging. Currently, HDBSCAN is considered as the SOTA algorithm for this problem. In this paper, we propose a novel clustering algorithm based on what we call graph of density topology (GDT). GDT jointly considers the local and global structures of data samples: firstly forming local clusters based on a density growing process with a strategy for properly noise handling as well as cluster boundary detection; and then estimating a GDT from relationship between local clusters in terms of a connectivity measure, givingglobal topological graph. The connectivity, measuring similarity between neighboring local clusters, is based on local clusters rather than individual points, ensuring its robustness to even very large noise. Evaluation results on both toy and real-world datasets show that GDT achieves the SOTA performance by far on almost all the popular datasets, and has a low time complexity of O(nlogn). The code is available at https://github.com/gaozhangyang/DGC.git.

This PhD thesis lays out algebraic and topological structures relevant for the study of probabilistic graphical models.

Marginal estimation algorithms are introduced as diffusion equations of the form $\dot u = \delta \varphi$. They generalise the traditional belief propagation (BP) algorithm, and provide an alternative for contrastive divergence (CD) or Markov chain Monte Carlo (MCMC) algorithms, typically involved in estimating a free energy functional and its gradient w.r.t. model parameters.

We propose a new homological picture where parameters are a collections of local interaction potentials $(u_\alpha) \in A_0$, for $\alpha$ running over the factor nodes of a given region graph. The boundary operator $\delta$ mapping heat fluxes $(\varphi_{\alpha\beta}) \in A_1$ to a subspace $\delta A_1 \subseteq A_0$ is the discrete analog of a divergence. The total energy $H = \sum_\alpha u_\alpha$ defining the global probability $p = e^{-H} / Z$ is in one-to-one correspondence with a homology class $[u] = u + \delta A_1$ of interaction potentials, so that total energy remains constant when $u$ evolves up to a boundary term $\delta \varphi$.

Stationary states of diffusion are shown to lie at the intersection of a homology class of potentials with a non-linear constraint surface enforcing consistency of the local marginals estimates. This picture allows us to precise and complete a proof on the correspondence between stationary states of BP and critical points of a local free energy functional (obtained by Bethe-Kikuchi approximations) and to extend the uniqueness result for acyclic graphs (i.e. trees) to a wider class of hypergraphs. In general, bifurcations of equilibria are related to the spectral singularities of a local diffusion operator, yielding new explicit examples of the degeneracy phenomenon.

Work supervised by Pr. Daniel Bennequin

We consider the problem of estimating a meta-model of an unknown regression model with non-Gaussian and non-bounded error. The meta-model belongs to a reproducing kernel Hilbert space constructed as a direct sum of Hilbert spaces leading to an additive decomposition including the variables and interactions between them. The estimator of this meta-model is calculated by minimizing an empirical least-squares criterion penalized by the sum of the Hilbert norm and the empirical $L^2$-norm. In this context, the upper bounds of the empirical $L^2$ risk and the $L^2$ risk of the estimator are established.

Recent work has identified a number of formally incompatible operational measures for the unfairness of a machine learning (ML) system. As these measures all capture intuitively desirable aspects of a fair system, choosing "the one true" measure is not possible, and instead a reasonable approach is to minimize a weighted combination of measures. However, this simply raises the question of how to choose the weights. Here, we formulate Legally Grounded Fairness Objectives (LGFO), which uses signals from the legal system to non-arbitrarily measure the social cost of a specific degree of unfairness. The LGFO is the expected damages under a putative lawsuit that might be awarded to those who were wrongly classified, in the sense that the ML system made a decision different to that which would have be made under the court's preferred measure. Notably, the two quantities necessary to compute the LGFO, the court's preferences about fairness measures, and the expected damages, are unknown but well-defined, and can be estimated by legal advice. Further, as the damages awarded by the legal system are designed to measure and compensate for the harm caused to an individual by an unfair classification, the LGFO aligns closely with society's estimate of the social cost.

The success of machine learning algorithms often relies on a large amount of high-quality data to train well-performed models. However, data is a valuable resource and are always held by different parties in reality. An effective solution to such a data isolation problem is to employ federated learning, which allows multiple parties to collaboratively train a model. In this paper, we propose a Secure version of the widely used Maximum Mean Discrepancy (SMMD) based on homomorphic encryption to enable effective knowledge transfer under the data federation setting without compromising the data privacy. The proposed SMMD is able to avoid the potential information leakage in transfer learning when aligning the source and target data distribution. As a result, both the source domain and target domain can fully utilize their data to build more scalable models. Experimental results demonstrate that our proposed SMMD is secure and effective.

Carbon dioxide Capture and Storage (CCS) is an important strategy in mitigating anthropogenic CO$_2$ emissions. In order for CCS to be successful, large quantities of CO$_2$ must be stored and the storage site conformance must be monitored. Here we present a deep learning method to reconstruct pressure fields and classify the flux out of the storage formation based on the pressure data from Above Zone Monitoring Interval (AZMI) wells. The deep learning method is a version of a semi conditional variational auto-encoder tailored to solve two tasks: reconstruction of an incremental pressure field and leakage rate classification. The method, predictions and associated uncertainty estimates are illustrated on the synthetic data from a high-fidelity heterogeneous 2D numerical reservoir model, which was used to simulate subsurface CO$_2$ movement and pressure changes in the AZMI due to a CO$_2$ leakage.

In analogy to compressed sensing, which allows sample-efficient signal reconstruction given prior knowledge of its sparsity in frequency domain, we propose to utilize policy simplicity (Occam's Razor) as a prior to enable sample-efficient imitation learning. We first demonstrated the feasibility of this scheme on linear case where state-value function can be sampled directly. We also extended the scheme to scenarios where only actions are visible and scenarios where the policy is obtained from nonlinear network. The method is benchmarked against behavior cloning and results in significantly higher scores with limited expert demonstrations.

Artificial intelligence (AI) provides many opportunities to improve private and public life. Discovering patterns and structures in large troves of data in an automated manner is a core component of data science, and currently drives applications in diverse areas such as computational biology, law and finance. However, such a highly positive impact is coupled with significant challenges: how do we understand the decisions suggested by these systems in order that we can trust them? In this report, we focus specifically on data-driven methods -- machine learning (ML) and pattern recognition models in particular -- so as to survey and distill the results and observations from the literature. The purpose of this report can be especially appreciated by noting that ML models are increasingly deployed in a wide range of businesses. However, with the increasing prevalence and complexity of methods, business stakeholders in the very least have a growing number of concerns about the drawbacks of models, data-specific biases, and so on. Analogously, data science practitioners are often not aware about approaches emerging from the academic literature, or may struggle to appreciate the differences between different methods, so end up using industry standards such as SHAP. Here, we have undertaken a survey to help industry practitioners (but also data scientists more broadly) understand the field of explainable machine learning better and apply the right tools. Our latter sections build a narrative around a putative data scientist, and discuss how she might go about explaining her models by asking the right questions.

This work presents a mathematical treatment of the relation between Self-Organizing Maps (SOMs) and Gaussian Mixture Models (GMMs). We show that energy-based SOM models can be interpreted as performing gradient descent, minimizing an approximation to the GMM log-likelihood that is particularly valid for high data dimensionalities. The SOM-like decrease of the neighborhood radius can be understood as an annealing procedure ensuring that gradient descent does not get stuck in undesirable local minima. This link allows to treat SOMs as generative probabilistic models, giving a formal justification for using SOMs, e.g., to detect outliers, or for sampling.

High-dimensional streaming data are becoming increasingly ubiquitous in many fields. They often lie in multiple low-dimensional subspaces, and the manifold structures may change abruptly on the time scale due to pattern shift or occurrence of anomalies. However, the problem of detecting the structural changes in a real-time manner has not been well studied. To fill this gap, we propose a dynamic sparse subspace learning (DSSL) approach for online structural change-point detection of high-dimensional streaming data. A novel multiple structural change-point model is proposed and it is shown to be equivalent to maximizing a posterior under certain conditions. The asymptotic properties of the estimators are investigated. The penalty coefficients in our model can be selected by AMDL criterion based on some historical data. An efficient Pruned Exact Linear Time (PELT) based method is proposed for online optimization and change-point detection. The effectiveness of the proposed method is demonstrated through a simulation study and a real case study using gesture data for motion tracking.

This paper aims to understand and improve the utility of the dropout operation from the perspective of game-theoretic interactions. We prove that dropout can suppress the strength of interactions between input variables of deep neural networks (DNNs). The theoretical proof is also verified by various experiments. Furthermore, we find that such interactions were strongly related to the over-fitting problem in deep learning. Thus, the utility of dropout can be regarded as decreasing interactions to alleviating the significance of over-fitting. Based on this understanding, we propose an interaction loss to further improve the utility of dropout. Experimental results have shown that the interaction loss can effectively improve the utility of dropout and boost the performance of DNNs.

Deep learning approaches to anomaly detection have recently improved the state of the art in detection performance on complex datasets such as large collections of images or text. These results have sparked a renewed interest in the anomaly detection problem and led to the introduction of a great variety of new methods. With the emergence of numerous such methods, including approaches based on generative models, one-class classification, and reconstruction, there is a growing need to bring methods of this field into a systematic and unified perspective. In this review we aim to identify the common underlying principles as well as the assumptions that are often made implicitly by various methods. In particular, we draw connections between classic 'shallow' and novel deep approaches and show how this relation might cross-fertilize or extend both directions. We further provide an empirical assessment of major existing methods that is enriched by the use of recent explainability techniques, and present specific worked-through examples together with practical advice. Finally, we outline critical open challenges and identify specific paths for future research in anomaly detection.

With rapid developments of information and technology, large scale network data are ubiquitous. In this work we develop a distributed spectral clustering algorithm for community detection in large scale networks. To handle the problem, we distribute l pilot network nodes on the master server and the others on worker servers. A spectral clustering algorithm is first conducted on the master to select pseudo centers. The indexes of the pseudo centers are then broadcasted to workers to complete distributed community detection task using a SVD type algorithm. The proposed distributed algorithm has three merits. First, the communication cost is low since only the indexes of pseudo centers are communicated. Second, no further iteration algorithm is needed on workers and hence it does not suffer from problems as initialization and non-robustness. Third, both the computational complexity and the storage requirements are much lower compared to using the whole adjacency matrix. A Python package DCD (www.github.com/Ikerlz/dcd) is developed to implement the distributed algorithm for a Spark system. Theoretical properties are provided with respect to the estimation accuracy and mis-clustering rates. Lastly, the advantages of the proposed methodology are illustrated by experiments on a variety of synthetic and empirical datasets.

Bank transaction fraud results in over $13B annual losses for banks, merchants, and card holders worldwide. Much of this fraud starts with a Point-of-Compromise (a data breach or a skimming operation) where credit and debit card digital information is stolen, resold, and later used to perform fraud. We introduce this problem and present an automatic Points-of-Compromise (POC) detection procedure. BreachRadar is a distributed alternating algorithm that assigns a probability of being compromised to the different possible locations. We implement this method using Apache Spark and show its linear scalability in the number of machines and transactions. BreachRadar is applied to two datasets with billions of real transaction records and fraud labels where we provide multiple examples of real Points-of-Compromise we are able to detect. We further show the effectiveness of our method when injecting Points-of-Compromise in one of these datasets, simultaneously achieving over 90% precision and recall when only 10% of the cards have been victims of fraud. In the classical multi-party computation setting, multiple parties jointly compute a function without revealing their own input data. We consider a variant of this problem, where the input data can be shared for machine learning training purposes, but the data are also encrypted so that they cannot be recovered by other parties. We present a rotation based method using flow model, and theoretically justified its security. We demonstrate the effectiveness of our method in different scenarios, including supervised secure model training, and unsupervised generative model training. Our code is available at https://github.com/ duchenzhuang/flowencrypt. This paper explores a new research problem of unsupervised transfer learning across multiple spatiotemporal prediction tasks. Unlike most existing transfer learning methods that focus on fixing the discrepancy between supervised tasks, we study how to transfer knowledge from a zoo of unsupervisedly learned models towards another predictive network. Our motivation is that models from different sources are expected to understand the complex spatiotemporal dynamics from different perspectives, thereby effectively supplementing the new task, even if the task has sufficient training samples. Technically, we propose a differentiable framework named transferable memory. It adaptively distills knowledge from a bank of memory states of multiple pretrained RNNs, and applies it to the target network via a novel recurrent structure called the Transferable Memory Unit (TMU). Compared with finetuning, our approach yields significant improvements on three benchmarks for spatiotemporal prediction, and benefits the target task even from less relevant pretext ones. Balancing the distributions of the confounders across the exposure levels in an observational study through matching or weighting is an accepted method to control for confounding due to these variables when estimating the association between an exposure and outcome and to reduce the degree of dependence on certain modeling assumptions. Despite the increasing popularity in practice, these procedures cannot be immediately applied to datasets with missing values. Multiple imputation of the missing data is a popular approach to account for missing values while preserving the number of units in the dataset and accounting for the uncertainty in the missing values. However, to the best of our knowledge, there is no comprehensive matching and weighting software that can be easily implemented with multiply imputed datasets. In this paper, we review this problem and suggest a framework to map out the matching and weighting multiply imputed datasets to 5 actions as well as the best practices to assess balance in these datasets after matching and weighting. We also illustrate these approaches using a companion package for R, MatchThem. Ecologists are interested in modeling the population growth of species in various ecosystems. Studying population dynamics can assist environmental managers in making better decisions for the environment. Traditionally, the sampling of species and tracking of populations have been recorded on a regular time frequency. However, sampling can be an expensive process due to available resources, money and time. Limiting sampling makes it challenging to properly track the growth of a population. Thus, we propose a new and novel approach to designing sampling regimes based on the dynamics associated with population growth models. This design study minimizes the amount of time ecologists spend in the field, while maximizing the information provided by the data. Studies often estimate associations between an outcome and multiple variates. For example, studies of diagnostic test accuracy estimate sensitivity and specificity, and studies of prognostic factors typically estimate associations for multiple factors. Meta-analysis is a family of methods for synthesizing estimates across multiple studies. Multivariate models exist that account for within-study correlations and between-study heterogeneity. The number of parameters that must be estimated in existing models is quadratic in the number of variates, which means they may not be usable if data are sparse with many variates and few studies. We propose a new model that addresses this problem by approximating a variance-covariance matrix that models within-study correlation and between-study heterogeneity in a low-dimensional space using random projection. The number of parameters that must be estimated in this model is quadratic in the dimensionality of the low-dimensional space, making estimation more tractable. We demonstrate the method using data from an ongoing systematic review on predictors of pain and function after total knee arthroplasty. Obtaining data with meaningful labels is often costly and error-prone. In this situation, semi-supervised learning (SSL) approaches are interesting, as they leverage assumptions about the unlabeled data to make up for the limited amount of labels. However, in real-world situations, we cannot assume that the labeling process is infallible, and the accuracy of many SSL classifiers decreases significantly in the presence of label noise. In this work, we introduce the LGC_LVOF, a leave-one-out filtering approach based on the Local and Global Consistency (LGC) algorithm. Our method aims to detect and remove wrong labels, and thus can be used as a preprocessing step to any SSL classifier. Given the propagation matrix, detecting noisy labels takes O(cl) per step, with c the number of classes and l the number of labels. Moreover, one does not need to compute the whole propagation matrix, but only an$l$by$l$submatrix corresponding to interactions between labeled instances. As a result, our approach is best suited to datasets with a large amount of unlabeled data but not many labels. Results are provided for a number of datasets, including MNIST and ISOLET. LGCLVOF appears to be equally or more precise than the adapted gradient-based filter. We show that the best-case accuracy of the embedding of LGCLVOF into LGC yields performance comparable to the best-case of$\ell_1$-based classifiers designed to be robust to label noise. We provide a heuristic to choose the number of removed instances. In randomized clinical trials, adjustments for baseline covariates at both design and analysis stages are highly encouraged by regulatory agencies. A recent trend is to use a model-assisted approach for covariate adjustment to gain credibility and efficiency while producing asymptotically valid inference even when the model is incorrect. In this article we present three principles for model-assisted inference in simple or covariate-adaptive randomized trials: (1) guaranteed efficiency gain principle, a model-assisted method should often gain but never hurt efficiency; (2) validity and universality principle, a valid procedure should be universally applicable to all commonly used randomization schemes; (3) robust standard error principle, variance estimation should be heteroscedasticity-robust. To fulfill these principles, we recommend a working model that includes all covariates utilized in randomization and all treatment-by-covariate interaction terms. Our conclusions are based on asymptotic theory with a generality that has not appeared in the literature, as most existing results are about linear contrasts of outcomes rather than the joint distribution and most existing inference results under covariate-adaptive randomization are special cases of our theory. Our theory also reveals distinct results between cases of two arms and multiple arms. Sequential sensor data is generated in a wide variety of practical applications. A fundamental challenge involves learning effective classifiers for such sequential data. While deep learning has led to impressive performance gains in recent years in domains such as speech, this has relied on the availability of large datasets of sequences with high-quality labels. In many applications, however, the associated class labels are often extremely limited, with precise labelling/segmentation being too expensive to perform at a high volume. However, large amounts of unlabeled data may still be available. In this paper we propose a novel framework for semi-supervised learning in such contexts. In an unsupervised manner, change point detection methods can be used to identify points within a sequence corresponding to likely class changes. We show that change points provide examples of similar/dissimilar pairs of sequences which, when coupled with labeled, can be used in a semi-supervised classification setting. Leveraging the change points and labeled data, we form examples of similar/dissimilar sequences to train a neural network to learn improved representations for classification. We provide extensive synthetic simulations and show that the learned representations are superior to those learned through an autoencoder and obtain improved results on both simulated and real-world human activity recognition datasets. Recent network pruning methods focus on pruning models early-on in training. To estimate the impact of removing a parameter, these methods use importance measures that were originally designed for pruning trained models. Despite lacking justification for their use early-on in training, models pruned using such measures result in surprisingly minimal accuracy loss. To better explain this behavior, we develop a general, gradient-flow based framework that relates state-of-the-art importance measures through an order of time-derivative of the norm of model parameters. We use this framework to determine the relationship between pruning measures and evolution of model parameters, establishing several findings related to pruning models early-on in training: (i) magnitude-based pruning removes parameters that contribute least to reduction in loss, resulting in models that converge faster than magnitude-agnostic methods; (ii) loss-preservation based pruning preserves first-order model evolution dynamics and is well-motivated for pruning minimally trained models; and (iii) gradient-norm based pruning affects second-order model evolution dynamics, and increasing gradient norm via pruning can produce poorly performing models. We validate our claims on several VGG-13, MobileNet-V1, and ResNet-56 models trained on CIFAR-10 and CIFAR-100. Code available at https://github.com/EkdeepSLubana/flowandprune. We study how neural networks trained by gradient descent extrapolate, i.e., what they learn outside the support of training distribution. Previous works report mixed empirical results when extrapolating with neural networks: while multilayer perceptrons (MLPs) do not extrapolate well in simple tasks, Graph Neural Networks (GNNs), a structured network with MLP modules, have some success in more complex tasks. We provide a theoretical explanation and identify conditions under which MLPs and GNNs extrapolate well. We start by showing ReLU MLPs trained by gradient descent converge quickly to linear functions along any direction from the origin, which suggests ReLU MLPs cannot extrapolate well in most non-linear tasks. On the other hand, ReLU MLPs can provably converge to a linear target function when the training distribution is "diverse" enough. These observations lead to a hypothesis: GNNs can extrapolate well in dynamic programming (DP) tasks if we encode appropriate non-linearity in the architecture and input representation. We provide theoretical and empirical support for the hypothesis. Our theory explains previous extrapolation success and suggest their limitations: successful extrapolation relies on incorporating task-specific non-linearity, which often requires domain knowledge or extensive model search. We give an explicit formula for the reciprocal maximum likelihood degree of Brownian motion tree models. To achieve this, we connect them to certain toric (or log-linear) models, and express the Brownian motion tree model of an arbitrary tree as a toric fiber product of star tree models. Stochastic gradient descent (SGD) is often applied to train Deep Neural Networks (DNNs), and research efforts have been devoted to investigate the convergent dynamics of SGD and minima found by SGD. The influencing factors identified in the literature include learning rate, batch size, Hessian, and gradient covariance, and stochastic differential equations are used to model SGD and establish the relationships among these factors for characterizing minima found by SGD. It has been found that the ratio of batch size to learning rate is a main factor in highlighting the underlying SGD dynamics; however, the influence of other important factors such as the Hessian and gradient covariance is not entirely agreed upon. This paper describes the factors and relationships in the recent literature and presents numerical findings on the relationships. In particular, it confirms the four-factor and general relationship results obtained in Wang (2019), while the three-factor and associated relationship results found in Jastrz\c{e}bski et al. (2018) may not hold beyond the considered special case. In order to determine whether or not an effect is absent based on a statistical test, the recommended frequentist tool is the equivalence test. Typically, it is expected that an appropriate equivalence margin has been specified before any data are observed. Unfortunately, this can be a difficult task. If the margin is too small, then the test's power will be substantially reduced. If the margin is too large, any claims of equivalence will be meaningless. Moreover, it remains unclear how defining the margin afterwards will bias one's results. In this short article, we consider a series of hypothetical scenarios in which the margin is defined post-hoc or is otherwise considered controversial. We also review a number of relevant, potentially problematic actual studies from clinical trials research, with the aim of motivating a critical discussion as to what is acceptable and desirable in the reporting and interpretation of equivalence tests. In classic fair division problems such as cake cutting and rent division, envy-freeness requires that each individual (weakly) prefer his allocation to anyone else's. On a conceptual level, we argue that envy-freeness also provides a compelling notion of fairness for classification tasks. Our technical focus is the generalizability of envy-free classification, i.e., understanding whether a classifier that is envy free on a sample would be almost envy free with respect to the underlying distribution with high probability. Our main result establishes that a small sample is sufficient to achieve such guarantees, when the classifier in question is a mixture of deterministic classifiers that belong to a family of low Natarajan dimension. We provide non-asymptotic excess risk guarantees for statistical learning in a setting where the population risk with respect to which we evaluate the target parameter depends on an unknown nuisance parameter that must be estimated from data. We analyze a two-stage sample splitting meta-algorithm that takes as input two arbitrary estimation algorithms: one for the target parameter and one for the nuisance parameter. We show that if the population risk satisfies a condition called Neyman orthogonality, the impact of the nuisance estimation error on the excess risk bound achieved by the meta-algorithm is of second order. Our theorem is agnostic to the particular algorithms used for the target and nuisance and only makes an assumption on their individual performance. This enables the use of a plethora of existing results from statistical learning and machine learning to give new guarantees for learning with a nuisance component. Moreover, by focusing on excess risk rather than parameter estimation, we can give guarantees under weaker assumptions than in previous works and accommodate settings in which the target parameter belongs to a complex nonparametric class. We provide conditions on the metric entropy of the nuisance and target classes such that oracle rates---rates of the same order as if we knew the nuisance parameter---are achieved. We also derive new rates for specific estimation algorithms such as variance-penalized empirical risk minimization, neural network estimation and sparse high-dimensional linear model estimation. We highlight the applicability of our results in four settings of central importance: 1) heterogeneous treatment effect estimation, 2) offline policy optimization, 3) domain adaptation, and 4) learning with missing data. Deep reinforcement learning has been successful in a variety of tasks, such as game playing and robotic manipulation. However, attempting to learn \textit{tabula rasa} disregards the logical structure of many domains as well as the wealth of readily available knowledge from domain experts that could help "warm start" the learning process. We present a novel reinforcement learning technique that allows for intelligent initialization of a neural network weights and architecture. Our approach permits the encoding domain knowledge directly into a neural decision tree, and improves upon that knowledge with policy gradient updates. We empirically validate our approach on two OpenAI Gym tasks and two modified StarCraft 2 tasks, showing that our novel architecture outperforms multilayer-perceptron and recurrent architectures. Our knowledge-based framework finds superior policies compared to imitation learning-based and prior knowledge-based approaches. Importantly, we demonstrate that our approach can be used by untrained humans to initially provide >80% increase in expected reward relative to baselines prior to training (p < 0.001), which results in a >60% increase in expected reward after policy optimization (p = 0.011). Network embedding is a method to learn low-dimensional representation vectors for nodes in complex networks. In real networks, nodes may have multiple tags but existing methods ignore the abundant semantic and hierarchical information of tags. This information is useful to many network applications and usually very stable. In this paper, we propose a tag representation learning model, Tag2Vec, which mixes nodes and tags into a hybrid network. Firstly, for tag networks, we define semantic distance as the proximity between tags and design a novel strategy, parameterized random walk, to generate context with semantic and hierarchical information of tags adaptively. Then, we propose hyperbolic Skip-gram model to express the complex hierarchical structure better with lower output dimensions. We evaluate our model on the NBER U.S. patent dataset and WordNet dataset. The results show that our model can learn tag representations with rich semantic information and it outperforms other baselines. In many cases, Neural networks can be mapped into tensor networks with an exponentially large bond dimension. Here, we compare different sub-classes of neural network states, with their mapped tensor network counterpart for studying the ground state of short-range Hamiltonians. We show that when mapping a neural network, the resulting tensor network is highly constrained and thus the neural network states do in general not deliver the naive expected drastic improvement against the state-of-the-art tensor network methods. We explicitly show this result in two paradigmatic examples, the 1D ferromagnetic Ising model and the 2D antiferromagnetic Heisenberg model, addressing the lack of a detailed comparison of the expressiveness of these increasingly popular, variational ans\"atze. Cephalometric tracing method is usually used in orthodontic diagnosis and treat-ment planning. In this paper, we propose a deep learning based framework to au-tomatically detect anatomical landmarks in cephalometric X-ray images. We train the deep encoder-decoder for landmark detection, and combine global landmark configuration with local high-resolution feature responses. The proposed frame-work is based on 2-stage u-net, regressing the multi-channel heatmaps for land-mark detection. In this framework, we embed attention mechanism with global stage heatmaps, guiding the local stage inferring, to regress the local heatmap patches in a high resolution. Besides, the Expansive Exploration strategy im-proves robustness while inferring, expanding the searching scope without in-creasing model complexity. We have evaluated our framework in the most wide-ly-used public dataset of landmark detection in cephalometric X-ray images. With less computation and manually tuning, our framework achieves state-of-the-art results. The coefficient of variation (CV) is commonly used to measure relative dispersion. However, since it is based on the sample mean and standard deviation, outliers can adversely affect the CV. Additionally, for skewed distributions the mean and standard deviation do not have natural interpretations and, consequently, neither does the CV. Here we investigate the extent to which quantile-based measures of relative dispersion can provide appropriate summary information as an alternative to the CV. In particular, we investigate two measures, the first being the interquartile range (in lieu of the standard deviation), divided by the median (in lieu of the mean), and the second being the median absolute deviation (MAD), divided by the median, as robust estimators of relative dispersion. In addition to comparing the influence functions of the competing estimators and their asymptotic biases and variances, we compare interval estimators using simulation studies to assess coverage. Surprise-based learning allows agents to rapidly adapt to non-stationary stochastic environments characterized by sudden changes. We show that exact Bayesian inference in a hierarchical model gives rise to a surprise-modulated trade-off between forgetting old observations and integrating them with the new ones. The modulation depends on a probability ratio, which we call "Bayes Factor Surprise", that tests the prior belief against the current belief. We demonstrate that in several existing approximate algorithms the Bayes Factor Surprise modulates the rate of adaptation to new observations. We derive three novel surprised-based algorithms, one in the family of particle filters, one in the family of variational learning, and the other in the family of message passing, that have constant scaling in observation sequence length and particularly simple update dynamics for any distribution in the exponential family. Empirical results show that these surprise-based algorithms estimate parameters better than alternative approximate approaches and reach levels of performance comparable to computationally more expensive algorithms. The Bayes Factor Surprise is related to but different from Shannon Surprise. In two hypothetical experiments, we make testable predictions for physiological indicators that dissociate the Bayes Factor Surprise from Shannon Surprise. The theoretical insight of casting various approaches as surprise-based learning, as well as the proposed online algorithms, may be applied to the analysis of animal and human behavior, and to reinforcement learning in non-stationary environments.$k$-means algorithm is one of the most classical clustering methods, which has been widely and successfully used in signal processing. However, due to the thin-tailed property of the Gaussian distribution,$k$-means algorithm suffers from relatively poor performance on the dataset containing heavy-tailed data or outliers. Besides, standard$k$-means algorithm also has relatively weak stability,$i.e.$its results have a large variance, which reduces the model credibility. In this paper, we propose a robust and stable$k$-means variant, dubbed the$t$-$k$-means, as well as its fast version to alleviate those problems. Theoretically, we derive the$t$-$k$-means and analyze its robustness and stability from the aspect of the loss function and the expression of the clustering center, respectively. A large number of experiments are also conducted, which verify the effectiveness and efficiency of the proposed method. Policy gradient methods are among the most effective methods in challenging reinforcement learning problems with large state and/or action spaces. However, little is known about even their most basic theoretical convergence properties, including: if and how fast they converge to a globally optimal solution or how they cope with approximation error due to using a restricted class of parametric policies. This work provides provable characterizations of the computational, approximation, and sample size properties of policy gradient methods in the context of discounted Markov Decision Processes (MDPs). We focus on both: "tabular" policy parameterizations, where the optimal policy is contained in the class and where we show global convergence to the optimal policy; and parametric policy classes (considering both log-linear and neural policy classes), which may not contain the optimal policy and where we provide agnostic learning results. One central contribution of this work is in providing approximation guarantees that are average case -- which avoid explicit worst-case dependencies on the size of state space -- by making a formal connection to supervised learning under distribution shift. This characterization shows an important interplay between estimation error, approximation error, and exploration (as characterized through a precisely defined condition number). Operator-theoretic analysis of nonlinear dynamical systems has attracted much attention in a variety of engineering and scientific fields, endowed with practical estimation methods using data such as dynamic mode decomposition. In this paper, we address a lifted representation of nonlinear dynamical systems with random noise based on transfer operators, and develop a novel Krylov subspace method for estimating the operators using finite data, with consideration of the unboundedness of operators. For this purpose, we first consider Perron-Frobenius operators with kernel-mean embeddings for such systems. We then extend the Arnoldi method, which is the most classical type of Kryov subspace method, so that it can be applied to the current case. Meanwhile, the Arnoldi method requires the assumption that the operator is bounded, which is not necessarily satisfied for transfer operators on nonlinear systems. We accordingly develop the shift-invert Arnoldi method for Perron-Frobenius operators to avoid this problem. Also, we describe an approach of evaluating predictive accuracy by estimated operators on the basis of the maximum mean discrepancy, which is applicable, for example, to anomaly detection in complex systems. The empirical performance of our methods is investigated using synthetic and real-world healthcare data. [New and updated results were published in Nature Chemistry, doi:10.1038/s41557-020-0544-y.] The electronic Schr\"odinger equation describes fundamental properties of molecules and materials, but can only be solved analytically for the hydrogen atom. The numerically exact full configuration-interaction method is exponentially expensive in the number of electrons. Quantum Monte Carlo is a possible way out: it scales well to large molecules, can be parallelized, and its accuracy has, as yet, only been limited by the flexibility of the used wave function ansatz. Here we propose PauliNet, a deep-learning wave function ansatz that achieves nearly exact solutions of the electronic Schr\"odinger equation. PauliNet has a multireference Hartree-Fock solution built in as a baseline, incorporates the physics of valid wave functions, and is trained using variational quantum Monte Carlo (VMC). PauliNet outperforms comparable state-of-the-art VMC ansatzes for atoms, diatomic molecules and a strongly-correlated hydrogen chain by a margin and is yet computationally efficient. We anticipate that thanks to the favourable scaling with system size, this method may become a new leading method for highly accurate electronic-strucutre calculations on medium-sized molecular systems. Rapid intensification (RI) of tropical cyclones often causes major destruction to human civilization due to short response time. It is an important yet challenging task to accurately predict this kind of extreme weather event in advance. Traditionally, meteorologists tackle the task with human-driven feature extraction and predictor correction procedures. Nevertheless, these procedures do not leverage the power of modern machine learning models and abundant sensor data, such as satellite images. In addition, the human-driven nature of such an approach makes it difficult to reproduce and benchmark prediction models. In this study, we build a benchmark for RI prediction using only satellite images, which are underutilized in traditional techniques. The benchmark follows conventional data science practices, making it easier for data scientists to contribute to RI prediction. We demonstrate the usefulness of the benchmark by designing a domain-inspired spatiotemporal deep learning model. The results showcase the promising performance of deep learning in solving complex meteorological problems such as RI prediction. This paper investigates statistical models for road traffic modeling. The proposed methodology considers road traffic as a (i) high-dimensional time-series for which (ii) regeneration occurs at the end of each day. Since (ii), prediction is based on a daily modeling of the road traffic using a vector autoregressive model that combines linearly the past observations of the day. Considering (i), the learning algorithm follows from an$\ell_1$-penalization of the regression coefficients. Excess risk bounds are established under the high-dimensional framework in which the number of road sections goes to infinity with the number of observed days. Considering floating car data observed in an urban area, the approach is compared to state-of-the-art methods including neural networks. In addition of being very competitive in terms of prediction, it enables to identify the most determinant sections of the road network. Most existing knowledge graphs suffer from incompleteness, which can be alleviated by inferring missing links based on known facts. One popular way to accomplish this is to generate low-dimensional embeddings of entities and relations, and use these to make inferences. ConvE, a recently proposed approach, applies convolutional filters on 2D reshapings of entity and relation embeddings in order to capture rich interactions between their components. However, the number of interactions that ConvE can capture is limited. In this paper, we analyze how increasing the number of these interactions affects link prediction performance, and utilize our observations to propose InteractE. InteractE is based on three key ideas -- feature permutation, a novel feature reshaping, and circular convolution. Through extensive experiments, we find that InteractE outperforms state-of-the-art convolutional link prediction baselines on FB15k-237. Further, InteractE achieves an MRR score that is 9%, 7.5%, and 23% better than ConvE on the FB15k-237, WN18RR and YAGO3-10 datasets respectively. The results validate our central hypothesis -- that increasing feature interaction is beneficial to link prediction performance. We make the source code of InteractE available to encourage reproducible research. Although ADAM is a very popular algorithm for optimizing the weights of neural networks, it has been recently shown that it can diverge even in simple convex optimization examples. Several variants of ADAM have been proposed to circumvent this convergence issue. In this work, we study the ADAM algorithm for smooth nonconvex optimization under a boundedness assumption on the adaptive learning rate. The bound on the adaptive step size depends on the Lipschitz constant of the gradient of the objective function and provides safe theoretical adaptive step sizes. Under this boundedness assumption, we show a novel first order convergence rate result in both deterministic and stochastic contexts. Furthermore, we establish convergence rates of the function value sequence using the Kurdyka-Lojasiewicz property. Inferring concerted changes among biological traits along an evolutionary history remains an important yet challenging problem. Besides adjusting for spurious correlation induced from the shared history, the task also requires sufficient flexibility and computational efficiency to incorporate multiple continuous and discrete traits as data size increases. To accomplish this, we jointly model mixed-type traits by assuming latent parameters for binary outcome dimensions at the tips of an unknown tree informed by molecular sequences. This gives rise to a phylogenetic multivariate probit model. With large sample sizes, posterior computation under this model is problematic, as it requires repeated sampling from a high-dimensional truncated normal distribution. Current best practices employ multiple-try rejection sampling that suffers from slow-mixing and a computational cost that scales quadratically in sample size. We develop a new inference approach that exploits 1) the bouncy particle sampler (BPS) based on piecewise deterministic Markov processes to simultaneously sample all truncated normal dimensions, and 2) novel dynamic programming that reduces the cost of likelihood and gradient evaluations for BPS to linear in sample size. In an application with 535 HIV viruses and 24 traits that necessitates sampling from a 12,840-dimensional truncated normal, our method makes it possible to estimate the across-trait correlation and detect factors that affect the pathogen's capacity to cause disease. This inference framework is also applicable to a broader class of covariance structures beyond comparative biology. Graph link prediction is an important task in cyber-security: relationships between entities within a computer network, such as users interacting with computers, or system libraries and the corresponding processes that use them, can provide key insights into adversary behaviour. Poisson matrix factorisation (PMF) is a popular model for link prediction in large networks, particularly useful for its scalability. In this article, PMF is extended to include scenarios that are commonly encountered in cyber-security applications. Specifically, an extension is proposed to explicitly handle binary adjacency matrices and include known covariates associated with the graph nodes. A seasonal PMF model is also presented to handle seasonal networks. To allow the methods to scale to large graphs, variational methods are discussed for performing fast inference. The results show an improved performance over the standard PMF model and other statistical network models. This paper addresses the case where data come as point sets, or more generally as discrete measures. Our motivation is twofold: first we intend to approximate with a compactly supported measure the mean of the measure generating process, that coincides with the intensity measure in the point process framework, or with the expected persistence diagram in the framework of persistence-based topological data analysis. To this aim we provide two algorithms that we prove almost minimax optimal. Second we build from the estimator of the mean measure a vectorization map, that sends every measure into a finite-dimensional Euclidean space, and investigate its properties through a clustering-oriented lens. In a nutshell, we show that in a mixture of measure generating process, our technique yields a representation in$\mathbb{R}^k$, for$k \in \mathbb{N}^*$that guarantees a good clustering of the data points with high probability. Interestingly, our results apply in the framework of persistence-based shape classification via the ATOL procedure described in \cite{Royer19}. The goal of many scientific experiments including A/B testing is to estimate the average treatment effect (ATE), which is defined as the difference between the expected outcomes of two or more treatments. In this paper, we consider a situation where an experimenter can assign a treatment to research subjects sequentially. In adaptive experimental design, the experimenter is allowed to change the probability of assigning a treatment using past observations for estimating the ATE efficiently. However, with this approach, it is difficult to apply a standard statistical method to construct an estimator because the observations are not independent and identically distributed. We thus propose an algorithm for efficient experiments with estimators constructed from dependent samples. We also introduce a sequential testing framework using the proposed estimator. To justify our proposed approach, we provide finite and infinite sample analyses. Finally, we experimentally show that the proposed algorithm exhibits preferable performance. The translation equivariance of convolutional layers enables convolutional neural networks to generalize well on image problems. While translation equivariance provides a powerful inductive bias for images, we often additionally desire equivariance to other transformations, such as rotations, especially for non-image data. We propose a general method to construct a convolutional layer that is equivariant to transformations from any specified Lie group with a surjective exponential map. Incorporating equivariance to a new group requires implementing only the group exponential and logarithm maps, enabling rapid prototyping. Showcasing the simplicity and generality of our method, we apply the same model architecture to images, ball-and-stick molecular data, and Hamiltonian dynamical systems. For Hamiltonian systems, the equivariance of our models is especially impactful, leading to exact conservation of linear and angular momentum. Efficient numerical solvers for sparse linear systems are crucial in science and engineering. One of the fastest methods for solving large-scale sparse linear systems is algebraic multigrid (AMG). The main challenge in the construction of AMG algorithms is the selection of the prolongation operator -- a problem-dependent sparse matrix which governs the multiscale hierarchy of the solver and is critical to its efficiency. Over many years, numerous methods have been developed for this task, and yet there is no known single right answer except in very special cases. Here we propose a framework for learning AMG prolongation operators for linear systems with sparse symmetric positive (semi-) definite matrices. We train a single graph neural network to learn a mapping from an entire class of such matrices to prolongation operators, using an efficient unsupervised loss function. Experiments on a broad class of problems demonstrate improved convergence rates compared to classical AMG, demonstrating the potential utility of neural networks for developing sparse system solvers. We consider the problem of simultaneous variable selection and estimation of the corresponding regression coefficients in an ultra-high dimensional linear regression models, an extremely important problem in the recent era. The adaptive penalty functions are used in this regard to achieve the oracle variable selection property along with easier computational burden. However, the usual adaptive procedures (e.g., adaptive LASSO) based on the squared error loss function is extremely non-robust in the presence of data contamination which are quite common with large-scale data (e.g., noisy gene expression data, spectra and spectral data). In this paper, we present a regularization procedure for the ultra-high dimensional data using a robust loss function based on the popular density power divergence (DPD) measure along with the adaptive LASSO penalty. We theoretically study the robustness and the large-sample properties of the proposed adaptive robust estimators for a general class of error distributions; in particular, we show that the proposed adaptive DPD-LASSO estimator is highly robust, satisfies the oracle variable selection property, and the corresponding estimators of the regression coefficients are consistent and asymptotically normal under easily verifiable set of assumptions. Numerical illustrations are provided for the mostly used normal error density. Finally, the proposal is applied to analyze an interesting spectral dataset, in the field of chemometrics, regarding the electron-probe X-ray microanalysis (EPXMA) of archaeological glass vessels from the 16th and 17th centuries. The coronavirus disease 2019 (COVID-19) has quickly grown from a regional outbreak in Wuhan, China to a global pandemic. Early estimates of the epidemic growth and incubation period of COVID-19 may have been biased due to sample selection. Using detailed case reports from 14 locations in and outside mainland China, we obtained 378 Wuhan-exported cases who left Wuhan before an abrupt travel quarantine. We developed a generative model we call BETS for four key epidemiological events---Beginning of exposure, End of exposure, time of Transmission, and time of Symptom onset (BETS)---and derived explicit formulas to correct for the sample selection. We gave a detailed illustration of why some early and highly influential analyses of the COVID-19 pandemic were severely biased. All our analyses, regardless of which subsample and model were being used, point to an epidemic doubling time of 2 to 2.5 days during the early outbreak in Wuhan. A Bayesian nonparametric analysis further suggests that about 5% of the symptomatic cases may not develop symptoms within 14 days of infection and that men may be much more likely than women to develop symptoms within 2 days of infection. Aircraft performance models play a key role in airline operations, especially in planning a fuel-efficient flight. In practice, manufacturers provide guidelines which are slightly modified throughout the aircraft life cycle via the tuning of a single factor, enabling better fuel predictions. However this has limitations, in particular they do not reflect the evolution of each feature impacting the aircraft performance. Our goal here is to overcome this limitation. The key contribution of the present article is to foster the use of machine learning to leverage the massive amounts of data continuously recorded during flights performed by an aircraft and provide models reflecting its actual and individual performance. We illustrate our approach by focusing on the estimation of the drag and lift coefficients from recorded flight data. As these coefficients are not directly recorded, we resort to aerodynamics approximations. As a safety check, we provide bounds to assess the accuracy of both the aerodynamics approximation and the statistical performance of our approach. We provide numerical results on a collection of machine learning algorithms. We report excellent accuracy on real-life data and exhibit empirical evidence to support our modelling, in coherence with aerodynamics principles. This paper presents a batch-wise density-based clustering approach for local outlier detection in massive-scale datasets. Differently from well-known traditional algorithms, which assume that all the data is memory-resident, our proposed method is scalable and processes the data chunk-by-chunk within the confines of a limited memory buffer. At first, a temporary clustering model is built, then it is incrementally updated by analyzing consecutive memory loads of points. Ultimately, the proposed algorithm will give an outlying score to each object, which is named SDCOR (Scalable Density-based Clustering Outlierness Ratio). Evaluations on real-life and synthetic datasets demonstrate that the proposed method has a low linear time complexity and is more effective and efficient compared to best-known conventional density-based methods, which need to load all data into the memory; and also, to some fast distance-based methods, which can perform on data resident in the disk. Researchers regularly use synthetic control methods for estimating causal effects when a subset of units receive a single persistent treatment, and the rest are unaffected by the change. In many applications, however, units not assigned to treatment are nevertheless impacted by the intervention because of cross-unit interactions. This paper extends the synthetic control methods to accommodate partial interference, allowing interactions within predefined groups, but not between them. Focusing on a class of causal estimands that capture the effect both on the treated and control units, we develop a multivariate Bayesian structural time series model for generating synthetic controls that would have occurred in the absence of an intervention enabling us to estimate our novel effects. In a simulation study, we explore our Bayesian procedure's empirical properties and show that it achieves good frequentists coverage even when the model is misspecified. Our work is motivated by an analysis of a marketing campaign's effectiveness by an Italian supermarket chain that permanently reduced the price of hundreds of store-brand products. We use our new methodology to make causal statements about the impact on sales of the affected store-brands and their direct competitors. Our proposed approach is implemented in the CausalMBSTS \R package. Statisticians have warned us since the early days of their discipline that experimental correlation between two observations by no means implies the existence of a causal relation. The question about what clues exist in observational data that could informs us about the existence of such causal relations is nevertheless more that legitimate. It lies actually at the root of any scientific endeavor. For decades however the only accepted method among statisticians to elucidate causal relationships was the so called Randomized Controlled Trial. Besides this notorious exception causality questions remained largely taboo for many. One reason for this state of affairs was the lack of an appropriate mathematical framework to formulate such questions in an unambiguous way. Fortunately thinks have changed these last years with the advent of the so called Causality Revolution initiated by Judea Pearl and coworkers. The aim of this pedagogical paper is to present their ideas and methods in a compact and self-contained fashion with concrete business examples as illustrations. Advances in Reinforcement Learning (RL) have successfully tackled sample efficiency and overestimation bias. However, these methods often fall short of scalable performance. On the other hand, genetic methods provide scalability but depict hyperparameter sensitivity to evolutionary operations. We present the Evolution-based Soft Actor-Critic (ESAC), a scalable RL algorithm. Our contributions are threefold; ESAC (1) abstracts exploration from exploitation by combining Evolution Strategies (ES) with Soft Actor-Critic (SAC), (2) provides dominant skill transfer between offsprings by making use of soft winner selections and genetic crossovers in hindsight and (3) improves hyperparameter sensitivity in evolutions using Automatic Mutation Tuning (AMT). AMT gradually replaces the entropy framework of SAC allowing the population to succeed at the task while acting as randomly as possible, without making use of backpropagation updates. On a range of challenging control tasks consisting of high-dimensional action spaces and sparse rewards, ESAC demonstrates state-of-the-art performance and sample efficiency equivalent to SAC. ESAC demonstrates scalability comparable to ES on the basis of hardware resources and algorithm overhead. A complete implementation of ESAC with notes on reproducibility and videos can be found at the project website https://karush17.github.io/esac-web/. Out-of-Distribution (OOD) generalization problem is a problem of seeking the predictor function whose performance in the worst environments is optimal. This paper makes two contributions to OOD problem. We first use the basic results of probability to prove maximal Invariant Predictor(MIP) condition, a theoretical result that can be used to identify the OOD optimal solution. We then use our MIP to derive inner-environmental Gradient Alignment(IGA) algorithm that can be used to help seek the OOD optimal predictor. Previous studies that have investigated the theoretical aspect of the OOD-problem use strong structural assumptions such as causal DAG. However, in cases involving image datasets, for example, the identification of hidden structural relations is itself a difficult problem. Our theoretical results are different from those of many previous studies in that it can be applied to cases in which the underlying structure of a dataset is difficult to analyze. We present an extensive comparison of previous theoretical approaches to the OODproblems based on the assumptions they make. We also present an extension of the colored-MNIST that can more accurately represent the pathological OOD situation than the original version, and demonstrate the superiority of IGA over previous methods on both the original and the extended version of Colored-MNIST. How can we identify the training examples that contribute most to the prediction of a tree ensemble? In this paper, we introduce TREX, an explanation system that provides instance-attribution explanations for tree ensembles, such as random forests and gradient boosted trees. TREX builds on the representer point framework previously developed for explaining deep neural networks. Since tree ensembles are non-differentiable, we define a kernel that captures the structure of the specific tree ensemble. By using this kernel in kernel logistic regression or a support vector machine, TREX builds a surrogate model that approximates the original tree ensemble. The weights in the kernel expansion of the surrogate model are used to define the global or local importance of each training example. Our experiments show that TREX's surrogate model accurately approximates the tree ensemble; its global importance weights are more effective in dataset debugging than the previous state-of-the-art; its explanations identify the most influential samples better than alternative methods under the remove and retrain evaluation framework; it runs orders of magnitude faster than alternative methods; and its local explanations can identify and explain errors due to domain mismatch. This paper surveys the field of transfer learning in the problem setting of Reinforcement Learning (RL). RL has been a key solution to sequential decision-making problems. Along with the fast advances of RL in various domains, such as robotics and game-playing, transfer learning arises as an important technique to assist RL by leveraging and transferring external expertise to boost the learning process of RL. In this survey, we review the central issues of transfer learning in the RL domain, providing a systematic categorization of its state-of-the-art techniques. We analyze their goals, methodologies, applications, and the RL frameworks under which the transfer learning techniques are approachable. We discuss the relationship between transfer learning and other relevant topics from the RL perspective and also explore the potential challenges as well as future development directions for transfer learning in RL. This paper developed an inference problem for Vasicek model driven by a general Gaussian process. We construct a least squares estimator and a moment estimator for the drift parameters of the Vasicek model, and we prove the consistency and the asymptotic normality. Our approach extended the result of Xiao and Yu (2018) for the case when noise is a fractional Brownian motion with Hurst parameter H \in [1/2,1). Harsh winter climate can cause various problems for both public and private sectors in Sweden, especially in the northern part for railway industry. To have a better understanding of winter climate impacts, this study investigates effects of the winter climate including atmospheric icing on the performance of high speed passenger trains in the Botnia-Atlantica region. The investigation is done with train operational data together with simulated weather data from the Weather Research and Forecast model over January - February 2017. Two different measurements of the train performance are analysed. One is cumulative delay which measures the increment in delay in terms of running time within two consecutive measuring spots, the other is current delay which is the delay in terms of arrival time at each measuring spot compared to the schedule. Cumulative delay is investigated through a Cox model and the current delay is studied using a Markov chain model. The results show that the weather factors have impacts on the train performance. Therein temperature and humidity have significant impacts on both the occurrence of cumulative delay and the transition probabilities between (current) delayed and non-delayed states. We consider general high-dimensional spiked sample covariance models and show that their leading sample spiked eigenvalues and their linear spectral statistics are asymptotically independent when the sample size and dimension are proportional to each other. As a byproduct, we also establish the central limit theorem of the leading sample spiked eigenvalues by removing the block diagonal assumption on the population covariance matrix, which is commonly needed in the literature. Moreover, we propose consistent estimators of the$L_4\$ norm of the spiked population eigenvectors. Based on these results, we develop a new statistic to test the equality of two spiked population covariance matrices. Numerical studies show that the new test procedure is more powerful than some existing methods.

Deep learning applied to weather forecasting has started gaining popularity because of the progress achieved by data-driven models. The present paper compares four different deep learning architectures to perform weather prediction on daily data gathered from 18 cities across Europe and spanned over a period of 15 years. The four proposed models investigate the different type of input representations (i.e. tensorial unistream vs. multi-stream matrices) as well as the combination of convolutional neural networks and LSTM (i.e. cascaded vs. ConvLSTM). In particular, we show that a model that uses a multi-stream input representation and that processes each lag individually combined with a cascaded convolution and LSTM is capable of better forecasting than the other compared models. In addition, we show that visualization techniques such as occlusion analysis and score maximization can give an additional insight on the most important features and cities for predicting a particular target feature and city.