## Statistics (stat) updates on the arXiv.org e-print archive



The current tsunami of deep learning (the hyper-vitamined return of artificial neural networks) applies not only to traditional statistical machine learning tasks: prediction and classification (e.g., for weather prediction and pattern recognition), but has already conquered other areas, such as translation. A growing area of application is the generation of creative content: in particular the case of music, the topic of this paper. The motivation is in using the capacity of modern deep learning techniques to automatically learn musical styles from arbitrary musical corpora and then to generate musical samples from the estimated distribution, with some degree of control over the generation. This article provides a survey of music generation based on deep learning techniques. After a short introduction to the topic illustrated by a recent exemple, the article analyses some early works from the late 1980s using artificial neural networks for music generation and how their pioneering contributions foreshadowed current techniques. Then, we introduce some conceptual framework to analyze the various concepts and dimensions involved. Various examples of recent systems are introduced and analyzed to illustrate the variety of concerns and of techniques.

High-capacity models require vast amounts of data, and data augmentation is a common remedy when this resource is limited. Standard augmentation techniques apply small hand-tuned transformations to existing data, which is a brittle process that realistically only allows for simple transformations. We propose a Bayesian interpretation of data augmentation where the transformations are modelled as latent variables to be marginalized, and show how these can be inferred variationally in an end-to-end fashion. This allows for significantly more complex transformations than manual tuning, and the marginalization implies a form of test-time data augmentation. The resulting model can be interpreted as a probabilistic extension of spatial transformer networks. Experimentally, we demonstrate improvements in accuracy and uncertainty quantification in image and time series classification tasks.

Sparsity-inducing regularization problems are ubiquitous in machine learning applications, ranging from feature selection to model compression. In this paper, we present a novel stochastic method -- Orthant Based Proximal Stochastic Gradient Method (OBProx-SG) -- to solve perhaps the most popular instance, i.e., the l1-regularized problem. The OBProx-SG method contains two steps: (i) a proximal stochastic gradient step to predict a support cover of the solution; and (ii) an orthant step to aggressively enhance the sparsity level via orthant face projection. Compared to the state-of-the-art methods, e.g., Prox-SG, RDA and Prox-SVRG, the OBProx-SG not only converges to the global optimal solutions (in convex scenario) or the stationary points (in non-convex scenario), but also promotes the sparsity of the solutions substantially. Particularly, on a large number of convex problems, OBProx-SG outperforms the existing methods comprehensively in the aspect of sparsity exploration and objective values. Moreover, the experiments on non-convex deep neural networks, e.g., MobileNetV1 and ResNet18, further demonstrate its superiority by achieving the solutions of much higher sparsity without sacrificing generalization accuracy.

In this paper, we identify a new phenomenon called activation-divergence which occurs in Federated Learning (FL) due to data heterogeneity (i.e., data being non-IID) across multiple users. Specifically, we argue that the activation vectors in FL can diverge, even if subsets of users share a few common classes with data residing on different devices. To address the activation-divergence issue, we introduce a prior based on the principle of maximum entropy; this prior assumes minimal information about the per-device activation vectors and aims at making the activation vectors of same classes as similar as possible across multiple devices. Our results show that, for both IID and non-IID settings, our proposed approach results in better accuracy (due to the significantly more similar activation vectors across multiple devices), and is more communication-efficient than state-of-the-art approaches in FL. Finally, we illustrate the effectiveness of our approach on a few common benchmarks and two large medical datasets.

Neural approaches to natural language processing (NLP) often fail at the logical reasoning needed for deeper language understanding. In particular, neural approaches to reasoning that rely on embedded \emph{generalizations} of a knowledge base (KB) implicitly model which facts that are \emph{plausible}, but may not model which facts are \emph{true}, according to the KB. While generalizing the facts in a KB is useful for KB completion, the inability to distinguish between plausible inferences and logically entailed conclusions can be problematic in settings like as KB question answering (KBQA). We propose here a novel KB embedding scheme that supports generalization, but also allows accurate logical reasoning with a KB. Our approach introduces two new mechanisms for KB reasoning: neural retrieval over a set of embedded triples, and "memorization" of highly specific information with a compact sketch structure. Experimentally, this leads to substantial improvements over the state-of-the-art on two KBQA benchmarks.

The increasing use of Internet-of-Things (IoT) devices for monitoring a wide spectrum of applications, along with the challenges of "big data" streaming support they often require for data analysis, is nowadays pushing for an increased attention to the emerging edge computing paradigm. In particular, smart approaches to manage and analyze data directly on the network edge, are more and more investigated, and Artificial Intelligence (AI) powered edge computing is envisaged to be a promising direction. In this paper, we focus on Data Centers (DCs) and Supercomputers (SCs), where a new generation of high-resolution monitoring systems is being deployed, opening new opportunities for analysis like anomaly detection and security, but introducing new challenges for handling the vast amount of data it produces. In detail, we report on a novel lightweight and scalable approach to increase the security of DCs/SCs, that involves AI-powered edge computing on high-resolution power consumption. The method -- called pAElla -- targets real-time Malware Detection (MD), it runs on an out-of-band IoT-based monitoring system for DCs/SCs, and involves Power Spectral Density of power measurements, along with AutoEncoders. Results are promising, with an F1-score close to 1, and a False Alarm and Malware Miss rate close to 0%. We compare our method with State-of-the-Art MD techniques and show that, in the context of DCs/SCs, pAElla can cover a wider range of malware, significantly outperforming SoA approaches in terms of accuracy. Moreover, we propose a methodology for online training suitable for DCs/SCs in production, and release open dataset and code.

In many applications, it is of interest to assess the relative contribution of features (or subsets of features) toward the goal of predicting a response -- in other words, to gauge the variable importance of features. Most recent work on variable importance assessment has focused on describing the importance of features within the confines of a given prediction algorithm. However, such assessment does not necessarily characterize the prediction potential of features, and may provide a misleading reflection of the intrinsic value of these features. To address this limitation, we propose a general framework for nonparametric inference on interpretable algorithm-agnostic variable importance. We define variable importance as a population-level contrast between the oracle predictiveness of all available features versus all features except those under consideration. We propose a nonparametric efficient estimation procedure that allows the construction of valid confidence intervals, even when machine learning techniques are used. We also outline a valid strategy for testing the null importance hypothesis. Through simulations, we show that our proposal has good operating characteristics, and we illustrate its use with data from a study of an antibody against HIV-1 infection.

Deep learning based models have surpassed classical machine learning based approaches in various text classification tasks, including sentiment analysis, news categorization, question answering, and natural language inference. In this work, we provide a detailed review of more than 150 deep learning based models for text classification developed in recent years, and discuss their technical contributions, similarities, and strengths. We also provide a summary of more than 40 popular datasets widely used for text classification. Finally, we provide a quantitative analysis of the performance of different deep learning models on popular benchmarks, and discuss future research directions.

This paper proposes a novel framework for the segmentation of phonocardiogram (PCG) signals into heart states, exploiting the temporal evolution of the PCG as well as considering the salient information that it provides for the detection of the heart state. We propose the use of recurrent neural networks and exploit recent advancements in attention based learning to segment the PCG signal. This allows the network to identify the most salient aspects of the signal and disregard uninformative information. The proposed method attains state-of-the-art performance on multiple benchmarks including both human and animal heart recordings. Furthermore, we empirically analyse different feature combinations including envelop features, wavelet and Mel Frequency Cepstral Coefficients (MFCC), and provide quantitative measurements that explore the importance of different features in the proposed approach. We demonstrate that a recurrent neural network coupled with attention mechanisms can effectively learn from irregular and noisy PCG recordings. Our analysis of different feature combinations shows that MFCC features and their derivatives offer the best performance compared to classical wavelet and envelop features. Heart sound segmentation is a crucial pre-processing step for many diagnostic applications. The proposed method provides a cost effective alternative to labour extensive manual segmentation, and provides a more accurate segmentation than existing methods. As such, it can improve the performance of further analysis including the detection of murmurs and ejection clicks. The proposed method is also applicable for detection and segmentation of other one dimensional biomedical signals.

The global expansion of maritime activities and the development of the Automatic Identification System (AIS) have driven the advances in maritime monitoring systems in the last decade. Monitoring vessel behavior is fundamental to safeguard maritime operations, protecting other vessels sailing the ocean and the marine fauna and flora. Given the enormous volume of vessel data continually being generated, real-time analysis of vessel behaviors is only possible because of decision support systems provided with event and anomaly detection methods. However, current works on vessel event detection are ad-hoc methods able to handle only a single or a few predefined types of vessel behavior. Most of the existing approaches do not learn from the data and require the definition of queries and rules for describing each behavior. In this paper, we discuss challenges and opportunities in classical machine learning and deep learning for vessel event and anomaly detection. We hope to motivate the research of novel methods and tools, since addressing these challenges is an essential step towards actual intelligent maritime monitoring systems.

Recently, the Wasserstein loss function has been proven to be effective when applied to deterministic full-waveform inversion (FWI) problems. We consider the application of this loss function in Bayesian FWI so that the uncertainty can be captured in the solution. Other loss functions that are commonly used in practice are also considered for comparison. Existence and stability of the resulting Gibbs posteriors are shown on function space under weak assumptions on the prior and model. In particular, the distribution arising from the Wasserstein loss is shown to be quite stable with respect to high-frequency noise in the data. We then illustrate the difference between the resulting distributions numerically, using Laplace approximations and dimension-robust MCMC to estimate the unknown velocity field and uncertainty associated with the estimates.

We present a locality preserving loss (LPL)that improves the alignment between vector space representations (i.e., word or sentence embeddings) while separating (increasing distance between) uncorrelated representations as compared to the standard method that minimizes the mean squared error (MSE) only. The locality preserving loss optimizes the projection by maintaining the local neighborhood of embeddings that are found in the source, in the target domain as well. This reduces the overall size of the dataset required to the train model. We argue that vector space alignment (with MSE and LPL losses) acts as a regularizer in certain language-based classification tasks, leading to better accuracy than the base-line, especially when the size of the training set is small. We validate the effectiveness ofLPL on a cross-lingual word alignment task, a natural language inference task, and a multi-lingual inference task.

Mixture modeling that takes account of potential heterogeneity in data is widely adopted for classification and clustering problems. However, it can be sensitive to outliers especially when the mixture components are Gaussian. In this paper, we introduce the robust estimating methods using the weighted complete estimating equations for robust fitting of multivariate mixture models. The proposed approach is based on a simple modification of the complete estimating equation given the latent variables for grouping indicators with the weights that depend on the components of mixture distributions for downweighting outliers. We develop a simple expectation-estimating-equation (EEE) algorithm to solve the weighted complete estimating equations. As examples, the multivariate Gaussian mixture, mixture of experts and multivariate skew normal mixture are considered. In particular, we derive a novel EEE algorithm for the skew normal mixture which results in the closed form expressions for both the E- and EE-steps by slightly extending the proposed method. The numerical performance of the proposed method is examined through the simulated and real datasets.

Inferring causal relationships or related associations from observational data can be invalidated by the existence of hidden confounders or measurement errors. We focus on high-dimensional linear regression settings, where the measured covariates are affected by hidden confounding. We propose the Doubly Debiased Lasso estimator for single components of the regression coefficient vector. Our advocated method is novel as it simultaneously corrects both the bias due to estimating the high-dimensional parameters as well as the bias caused by the hidden confounding. We establish its asymptotic normality and also prove that it is efficient in the Gauss-Markov sense. The validity of our methodology relies on a dense confounding assumption, i.e. that every confounding variable affects many covariates. The finite sample performance is illustrated with an extensive simulation study and a genomic application.

Generative Adversarial Networks (GANs) have gained significant attention in recent years, with particularly impressive applications highlighted in computer vision. In this work, we present a Mixture Density Conditional Generative Adversarial Model (MD-CGAN), where the generator is a Gaussian mixture model, with a focus on time series forecasting. Compared to examples in vision, there have been more limited applications of GAN models to time series. We show that our model is capable of estimating a probabilistic posterior distribution over forecasts and that, in comparison to a set of benchmark methods, the MD-CGAN model performs well, particularly in situations where noise is a significant in the time series. Further, by using a Gaussian mixture model that allows for a flexible number of mixture coefficients, the MD-CGAN offers posterior distributions that are non-Gaussian.

The graph matching problem aims to find the latent vertex correspondence between two edge-correlated graphs and has many practical applications. In this work, we study a version of the seeded graph matching problem, which assumes that a set of seeds, i.e., pre-mapped vertex-pairs, is given in advance. Specifically, consider two correlated graphs whose edges are sampled independently with probability $s$ from a parent \ER graph $\mathcal{G}(n,p)$. Furthermore, a mapping between the vertices of the two graphs is provided as seeds, of which an unknown $\beta$ fraction is correct. This problem was first studied in \cite{lubars2018correcting} where an algorithm is proposed and shown to perfectly recover the correct vertex mapping with high probability if $\beta\geq\max\left\{\frac{8}{3}p,\frac{16\log{n}}{nps^2}\right\}$. We improve their condition to $\beta\geq\max\left\{30\sqrt{\frac{\log n}{n(1-p)^2s^2}},\frac{45\log{n}}{np(1-p)^2s^2}\right)$. However, when $p=O\left( \sqrt{{\log n}/{ns^2}}\right)$, our improved condition still requires that $\beta$ must increase inversely proportional to $np$. In order to improve the matching performance for sparse graphs, we propose a new algorithm that uses "witnesses" in the 2-hop neighborhood, instead of only 1-hop neighborhood as in \cite{lubars2018correcting}. We show that when $np^2\leq\frac{1}{135\log n}$, our new algorithm can achieve perfect recovery with high probability if $\beta\geq\max\left\{900\sqrt{\frac{np^3(1-s)\log n}{s}},600\sqrt{\frac{\log n}{ns^4}}, \frac{1200\log n}{n^2p^2s^4}\right\}$ and $nps^2\geq 128\log n$. Numerical experiments on both synthetic and real graphs corroborate our theoretical findings and show that our 2-hop algorithm significantly outperforms the 1-hop algorithm when the graphs are relatively sparse.

Cluster structure detection is a fundamental task for the analysis of graphs, in order to understand and to visualize their functional characteristics. Among the different cluster structure detection methods, spectral clustering is currently one of the most widely used due to its speed and simplicity. Yet, there are few theoretical guarantee to recover the underlying partitions of the graph for general models. This paper therefore presents a variant of spectral clustering, called 1-spectral clustering, performed on a new random model closely related to stochastic block model. Its goal is to promote a sparse eigenbasis solution of a 1 minimization problem revealing the natural structure of the graph. The effectiveness and the robustness to small noise perturbations of our technique is confirmed through a collection of simulated and real data examples.

We consider the problem of fitting a polynomial to a set of data points, each data point consisting of a feature vector and a response variable. In contrast to standard least-squares polynomial regression, we require that the polynomial regressor satisfy shape constraints, such as monotonicity with respect to a variable, Lipschitz-continuity, or convexity over a region. Constraints of this type appear quite frequently in a number of areas including economics, operations research, and pricing. We show how to use semidefinite programming to obtain polynomial regressors that have these properties. We further show that, under some assumptions on the generation of the data points, the regressors obtained are consistent estimators of the underlying shape-constrained function that maps the feature vectors to the responses. We apply our methodology to the US KLEMS dataset to estimate production of a sector as a function of capital, energy, labor, materials, and services. We observe that it outperforms the more traditional approach (which consists in modelling the production curves as Cobb-Douglas functions) on 50 out of the 65 industries listed in the KLEMS database.

This paper presents methodologies for solving a common nuclear engineering problem using a suitable mathematical framework. Besides its potential for more general applications, this abstract formalization of the problem provides an improved robustness to the solution compared to the empirical treatment used in industrial practice of today. The essence of the paper proposes a sequential design for a stochastic simulator experiment to maximize a computer output y(x). The complications present in applications of interest are (1) the input x is an element of an unknown subset of a positive hyperplane and (2) y(x) is measured with error. The training data for this problem are a collection of historical inputs x corresponding to runs of a physical system that is linked to the simulator and the associated y(x). Two methods are provided for estimating the input domain. An extension of the well-known efficient global optimization (EGO) algorithm is presented to solve the optimization problem. An example of application of the method is given in which patterns of the "combustion rate" of fissile spent fuel rods are determined to maximize the computed k-effective taken to be the "criticality coefficient".

Results. The PCA confirms the existence of the Gnevyshev gap (GG) for solar cycles at about 40% from the start of the cycle. The temporal evolution of sunspot area data for even cycles shows that the GG exists at least at the 95% confidence level for all sizes of sunspots. On the other hand, the GG is shorter and statistically insignificant for the odd cycles of aerial sunspot data. Furthermore, the analysis of sunspot area sizes for even and odd cycles of SC12-SC23 shows that the greatest difference is at 4.2-4.6 years, where even cycles have a far smaller total area than odd cycles. The average area of the individual sunspots of even cycles is also smaller in this interval. The statistical analysis of the temporal evolution shows that northern sunspot groups maximise earlier than southern groups for even cycles, but are concurrent for odd cycles. Furthermore, the temporal distributions of odd cycles are slightly more leptokurtic than distributions of even cycles. The skewnesses are 0.37 and 0.49 and the kurtoses 2.79 and 2.94 for even and odd cycles, respectively. The correlation coefficient between skewness and kurtosis for even cycles is 0.69, and for odd cycles, it is 0.90. Conclusions. The separate PCAs for even and odd sunspot cycles show that odd cycles are more inhomogeneous than even cycles, especially in GSN data. Even cycles, however, have two anomalous cycles: SC4 and SC6. According to the analysis of the sunspot area size data, the GG is more distinct in even than odd cycles. We also present another Waldmeier-type rule, that is, we find a correlation between skewness and kurtosis of the sunspot group cycles.

In this paper we re-consider the results of Athanasopoulos et al. (2019), where the forecasts of the Australian quarterly series which form the Gross Domestic Product (GDP) at current prices are separately reconciled from Income (16 time series) and Expenditure (80 time series) sides. We instead propose a complete reconciliation strategy, resulting in a one number forecast' of the GDP figure, coherent with both sides' forecasted series, and evaluate the performance of the reconciled forecasts according to the new proposal.

Flow-based generative models are an important class of exact inference models that admit efficient inference and sampling for image synthesis. Owing to the efficiency constraints on the design of the flow layers, e.g. split coupling flow layers in which approximately half the pixels do not undergo further transformations, they have limited expressiveness for modeling long-range data dependencies compared to autoregressive models that rely on conditional pixel-wise generation. In this work, we improve the representational power of flow-based models by introducing channel-wise dependencies in their latent space through multi-scale autoregressive priors (mAR). Our mAR prior for models with split coupling flow layers (mAR-SCF) can better capture dependencies in complex multimodal data. The resulting model achieves state-of-the-art density estimation results on MNIST, CIFAR-10, and ImageNet. Furthermore, we show that mAR-SCF allows for improved image generation quality, with gains in FID and Inception scores compared to state-of-the-art flow-based models.

This paper makes a first step towards compatible and hence reusable network components. Rather than training networks for different tasks independently, we adapt the training process to produce network components that are compatible across tasks. We propose and compare several different approaches to accomplish compatibility. Our experiments on CIFAR-10 show that: (i) we can train networks to produce compatible features, without degrading task accuracy compared to training networks independently; (ii) the degree of compatibility is highly dependent on where we split the network into a feature extractor and a classification head; (iii) random initialization has a large effect on compatibility; (iv) we can train incrementally: given previously trained components, we can train new ones which are also compatible with them. This work is part of a larger goal to increase network reusability: we envision that compatibility will enable solving new tasks by mixing and matching suitable components.

Here, we propose an unsupervised fuzzy rule-based dimensionality reduction method primarily for data visualization. It considers the following important issues relevant to dimensionality reduction-based data visualization: (i) preservation of neighborhood relationships, (ii) handling data on a non-linear manifold, (iii) the capability of predicting projections for new test data points, (iv) interpretability of the system, and (v) the ability to reject test points if required. For this, we use a first-order Takagi-Sugeno type model. We generate rule antecedents using clusters in the input data. In this context, we also propose a new variant of the Geodesic c-means clustering algorithm. We estimate the rule parameters by minimizing an error function that preserves the inter-point geodesic distances (distances over the manifold) as Euclidean distances on the projected space. We apply the proposed method on three synthetic and three real-world data sets and visually compare the results with four other standard data visualization methods. The obtained results show that the proposed method behaves desirably and performs better than or comparable to the methods compared with. The proposed method is found to be robust to the initial conditions. The predictability of the proposed method for test points is validated by experiments. We also assess the ability of our method to reject output points when it should. Then, we extend this concept to provide a general framework for learning an unsupervised fuzzy model for data projection with different objective functions. To the best of our knowledge, this is the first attempt to manifold learning using unsupervised fuzzy modeling.

Multivariate subordinated L\'evy processes are widely employed in finance for modeling multivariate asset returns. We propose to exploit non-linear dependence among financial assets through multivariate cumulants of these processes, for which we provide a closed form formula by using the multi-index generalized Bell polynomials. Using multivariate cumulants, we perform a sensitivity analysis, to investigate non-linear dependence as a function of the model parameters driving the dependence structure

In multi-label learning, the issue of missing labels brings a major challenge. Many methods attempt to recovery missing labels by exploiting low-rank structure of label matrix. However, these methods just utilize global low-rank label structure, ignore both local low-rank label structures and label discriminant information to some extent, leaving room for further performance improvement. In this paper, we develop a simple yet effective discriminant multi-label learning (DM2L) method for multi-label learning with missing labels. Specifically, we impose the low-rank structures on all the predictions of instances from the same labels (local shrinking of rank), and a maximally separated structure (high-rank structure) on the predictions of instances from different labels (global expanding of rank). In this way, these imposed low-rank structures can help modeling both local and global low-rank label structures, while the imposed high-rank structure can help providing more underlying discriminability. Our subsequent theoretical analysis also supports these intuitions. In addition, we provide a nonlinear extension via using kernel trick to enhance DM2L and establish a concave-convex objective to learn these models. Compared to the other methods, our method involves the fewest assumptions and only one hyper-parameter. Even so, extensive experiments show that our method still outperforms the state-of-the-art methods.

A bootstrap procedure for constructing pointwise or simultaneous prediction intervals for a stationary functional time series is proposed. The procedure exploits a general vector autoregressive representation of the time-reversed series of Fourier coefficients appearing in the Karhunen-Lo\{e}ve representation of the functional process. It generates backwards-in-time, functional replicates that adequately mimic the dependence structure of the underlying process and have the same conditionally fixed curves at the end of each functional pseudo-time series. The bootstrap prediction error distribution is then calculated as the difference between the model-free, bootstrap-generated future functional observations and the functional forecasts obtained from the model used for prediction. This allows the estimated prediction error distribution to account for not only the innovation and estimation errors associated with prediction but also the possible errors from model misspecification. We show the asymptotic validity of the bootstrap in estimating the prediction error distribution of interest. Furthermore, the bootstrap procedure allows for the construction of prediction bands that achieve (asymptotically) the desired coverage. These prediction bands are based on a consistent estimation of the distribution of the studentized prediction error process. Through a simulation study and the analysis of two data sets, we demonstrate the capabilities and the good finite-sample performance of the proposed method.

Previous work on symmetric group equivariant neural networks generally only considered the case where the group acts by permuting the elements of a single vector. In this paper we derive formulae for general permutation equivariant layers, including the case where the layer acts on matrices by permuting their rows and columns simultaneously. This case arises naturally in graph learning and relation learning applications. As a specific case of higher order permutation equivariant networks, we present a second order graph variational encoder, and show that the latent distribution of equivariant generative models must be exchangeable. We demonstrate the efficacy of this architecture on the tasks of link prediction in citation graphs and molecular graph generation.

We propose learning discrete structured representations from unlabeled data by maximizing the mutual information between a structured latent variable and a target variable. Calculating mutual information is intractable in this setting. Our key technical contribution is an adversarial objective that can be used to tractably estimate mutual information assuming only the feasibility of cross entropy calculation. We develop a concrete realization of this general formulation with Markov distributions over binary encodings. We report critical and unexpected findings on practical aspects of the objective such as the choice of variational priors. We apply our model on document hashing and show that it outperforms current best baselines based on discrete and vector quantized variational autoencoders. It also yields highly compressed interpretable representations.

Graph convolutional networks (GCNs) have gained popularity due to high performance achievable on several downstream tasks including node classification. Several architectural variants of these networks have been proposed and investigated with experimental studies in the literature. Motivated by a recent work on simplifying GCNs, we study the problem of designing other variants and propose a framework to compose networks using building blocks of GCN. The framework offers flexibility to compose and evaluate different networks using feature and/or label propagation networks, linear or non-linear networks, with each composition having different computational complexity. We conduct a detailed experimental study on several benchmark datasets with many variants and present observations from our evaluation. Our empirical experimental results suggest that several newly composed variants are useful alternatives to consider because they are as competitive as, or better than the original GCN.

The signature in rough path theory provides a graduated summary of a path through an examination of the effects of its increments. Inspired by recent developments of signature features in the context of machine learning, we explore a transformation that is able to embed the effect of the absolute position of the data stream into signature features. This unified feature is particularly effective for its simplifying role in allowing the signature feature set to accommodate nonlinear functions of absolute and relative values.

We present a timely and novel methodology that combines disease estimates from mechanistic models with digital traces, via interpretable machine-learning methodologies, to reliably forecast COVID-19 activity in Chinese provinces in real-time. Specifically, our method is able to produce stable and accurate forecasts 2 days ahead of current time, and uses as inputs (a) official health reports from Chinese Center Disease for Control and Prevention (China CDC), (b) COVID-19-related internet search activity from Baidu, (c) news media activity reported by Media Cloud, and (d) daily forecasts of COVID-19 activity from GLEAM, an agent-based mechanistic model. Our machine-learning methodology uses a clustering technique that enables the exploitation of geo-spatial synchronicities of COVID-19 activity across Chinese provinces, and a data augmentation technique to deal with the small number of historical disease activity observations, characteristic of emerging outbreaks. Our model's predictive power outperforms a collection of baseline models in 27 out of the 32 Chinese provinces, and could be easily extended to other geographies currently affected by the COVID-19 outbreak to help decision makers.

Public transportation system commuters are often interested in getting accurate travel time information to plan their daily activities. However, this information is often difficult to predict accurately due to the irregularities of road traffic, caused by factors such as weather conditions, road accidents, and traffic jams. In this study, two neural network models namely multi-layer(MLP) perceptron and long short-term model(LSTM) are developed for predicting link travel time of a busy route with input generated using Origin-Destination travel time matrix derived from a historical GPS dataset. The experiment result showed that both models can make near-accurate predictions however, LSTM is more susceptible to noise as time step increases.

In recent years, it has become crucial to improve the resilience of electricity distribution networks (DNs) against storm-induced failures. Microgrids enabled by Distributed Energy Resources (DERs) can significantly help speed up re-energization of loads, particularly in the complete absence of bulk power supply. We describe an integrated approach which considers a pre-storm DER allocation problem under the uncertainty of failure scenarios as well as a post-storm dispatch problem in microgrids during the multi-period repair of the failed components. This problem is computationally challenging because the number of scenarios (resp. binary variables) increases exponentially (resp. quadratically) in the network size. Our overall solution approach for solving the resulting two-stage mixed-integer linear program (MILP) involves implementing the sample average approximation (SAA) method and Benders Decomposition. Additionally, we implement a greedy approach to reduce the computational time requirements of the post-storm repair scheduling and dispatch problem. The optimality of the resulting solution is evaluated on a modified IEEE 36-node network.

We present an analysis of semi-supervised acoustic and language model training for English-isiZulu code-switched ASR using soap opera speech. Approximately 11 hours of untranscribed multilingual speech was transcribed automatically using four bilingual code-switching transcription systems operating in English-isiZulu, English-isiXhosa, English-Setswana and English-Sesotho. These transcriptions were incorporated into the acoustic and language model training sets. Results showed that the TDNN-F acoustic models benefit from the additional semi-supervised data and that even better performance could be achieved by including additional CNN layers. Using these CNN-TDNN-F acoustic models, a first iteration of semi-supervised training achieved an absolute mixed-language WER reduction of 3.4%, and a further 2.2% after a second iteration. Although the languages in the untranscribed data were unknown, the best results were obtained when all automatically transcribed data was used for training and not just the utterances classified as English-isiZulu. Despite reducing perplexity, the semi-supervised language model was not able to improve the ASR performance.

In this paper, robust deep learning frameworks are introduced, aims to detect respiratory diseases from respiratory sound inputs. The entire processes firstly begins with a front-end feature extraction that transforms recordings into spectrograms. Next, a back-end deep learning model classifies the spectrogram features into categories of respiratory disease or anomaly. Experiments are conducted over the ICBHI benchmark dataset of respiratory sounds. According to obtained experimental results, we make three main contributions toward lung-sound analysis: Firstly, we provide an extensive analysis on common factors (type of spectrogram, time resolution, cycle length, or data augmentation, etc.) that affect final prediction accuracy in a deep learning based system. Secondly, we propose novel deep learning based frameworks by using the most influencing factors indicated. As a result, the proposed deep learning frameworks outperforms state of the art methods. Finally, we successfully to apply the Teacher-Student scheme to solve the trade-off between model performance and model size that helps to increase ability of building real-time applications.

The ability to learn in dynamic, nonstationary environments without forgetting previous knowledge, also known as Continual Learning (CL), is a key enabler for scalable and trustworthy deployments of adaptive solutions. While the importance of continual learning is largely acknowledged in machine vision and reinforcement learning problems, this is mostly under-documented for sequence processing tasks. This work proposes a Recurrent Neural Network (RNN) model for CL that is able to deal with concept drift in input distribution without forgetting previously acquired knowledge. We also implement and test a popular CL approach, Elastic Weight Consolidation (EWC), on top of two different types of RNNs. Finally, we compare the performances of our enhanced architecture against EWC and RNNs on a set of standard CL benchmarks, adapted to the sequential data processing scenario. Results show the superior performance of our architecture and highlight the need for special solutions designed to address CL in RNNs.

Expectiles define a least squares analogue of quantiles. They have lately received substantial attention in actuarial and financial risk management contexts. Unlike quantiles, expectiles define coherent risk measures and are determined by tail expectations rather than tail probabilities; unlike the popular Expected Shortfall, they define elicitable risk measures. This has motivated the study of the behaviour and estimation of extreme expectiles in some of the recent statistical literature. The case of stationary but weakly dependent observations has, however, been left largely untouched, even though correctly accounting for the uncertainty present in typical financial applications requires the consideration of dependent data. We investigate here the theoretical and practical behaviour of two classes of extreme expectile estimators in a strictly stationary $\beta-$mixing context, containing the classes of ARMA, ARCH and GARCH models with heavy-tailed innovations that are of interest in financial applications. We put a particular emphasis on the construction of asymptotic confidence intervals adapted to the dependence framework, whose performance we contrast with that of the naive intervals obtained from the theory of independent and identically distributed data. The methods are showcased in a numerical simulation study and on real financial data.

When trained effectively, the Variational Autoencoder (VAE) can be both a powerful generative model and an effective representation learning framework for natural language. In this paper, we propose the first large-scale language VAE model, Optimus. A universal latent embedding space for sentences is first pre-trained on large text corpus, and then fine-tuned for various language generation and understanding tasks. Compared with GPT-2, Optimus enables guided language generation from an abstract level using the latent vectors. Compared with BERT, Optimus can generalize better on low-resource language understanding tasks due to the smooth latent space structure. Extensive experimental results on a wide range of language tasks demonstrate the effectiveness of Optimus. It achieves new state-of-the-art on VAE language modeling benchmarks. We hope that our first pre-trained big VAE language model itself and results can help the NLP community renew the interests of deep generative models in the era of large-scale pre-training, and make these principled methods more practical.

The recent advances in deep learning indicate significant progress in the field of single image super-resolution. With the advent of these techniques, high-resolution image with high peak signal to noise ratio (PSNR) and excellent perceptual quality can be reconstructed. The major challenges associated with existing deep convolutional neural networks are their computational complexity and time; the increasing depth of the networks, often result in high space complexity. To alleviate these issues, we developed an innovative shallow residual feature representative network (SRFRN) that uses a bicubic interpolated low-resolution image as input and residual representative units (RFR) which include serially stacked residual non-linear convolutions. Furthermore, the reconstruction of the high-resolution image is done by combining the output of the RFR units and the residual output from the bicubic interpolated LR image. Finally, multiple experiments have been performed on the benchmark datasets and the proposed model illustrates superior performance for higher scales. Besides, this model also exhibits faster execution time compared to all the existing approaches.

Deep speaker embedding has demonstrated state-of-the-art performance in audio speaker recognition (SRE). However, one potential issue with this approach is that the speaker vectors derived from deep embedding models tend to be non-Gaussian for each individual speaker, and non-homogeneous for distributions of different speakers. These irregular distributions can seriously impact SRE performance, especially with the popular PLDA scoring method, which assumes homogeneous Gaussian distribution. In this paper, we argue that deep speaker vectors require deep normalization, and propose a deep normalization approach based on a novel discriminative normalization flow (DNF) model. We demonstrate the effectiveness of the proposed approach with experiments using the widely used SITW and CNCeleb corpora. In these experiments, the DNF-based normalization delivered substantial performance gains and also showed strong generalization capability in out-of-domain tests.

Speaker embeddings (x-vectors) extracted from very short segments of speech have recently been shown to give competitive performance in speaker diarization. We generalize this recipe by extracting from each speech segment, in parallel with the x-vector, also a diagonal precision matrix, thus providing a path for the propagation of information about the quality of the speech segment into a PLDA scoring backend. These precisions quantify the uncertainty about what the values of the embeddings might have been if they had been extracted from high quality speech segments. The proposed probabilistic embeddings (x-vectors with precisions) are interfaced with the PLDA model by treating the x-vectors as hidden variables and marginalizing them out. We apply the proposed probabilistic embeddings as input to an agglomerative hierarchical clustering (AHC) algorithm to do diarization in the DIHARD'19 evaluation set. We compute the full PLDA likelihood 'by the book' for each clustering hypothesis that is considered by AHC. We do joint discriminative training of the PLDA parameters and of the probabilistic x-vector extractor. We demonstrate accuracy gains relative to a baseline AHC algorithm, applied to traditional xvectors (without uncertainty), and which uses averaging of binary log-likelihood-ratios, rather than by-the-book scoring.

Due to the simple design pipeline, end-to-end (E2E) neural models for speech enhancement (SE) have attracted great interest. In order to improve the performance of the E2E model, the locality and temporal sequential properties of speech should be efficiently taken into account when modelling. However, in most current E2E models for SE, these properties are either not fully considered, or are too complex to be realized. In this paper, we propose an efficient E2E SE model, termed WaveCRN. In WaveCRN, the speech locality feature is captured by a convolutional neural network (CNN), while the temporal sequential property of the locality feature is modeled by stacked simple recurrent units (SRU). Unlike a conventional temporal sequential model that uses a long short-term memory (LSTM) network, which is difficult to parallelize, SRU can be efficiently parallelized in calculation with even fewer model parameters. In addition, in order to more effectively suppress the noise components in the input noisy speech, we derive a novel restricted feature masking (RFM) approach that performs enhancement on the embedded features in the hidden layers instead of on the physical spectral features commonly used in speech separation tasks. Experimental results on speech denoising and compressed speech restoration tasks confirm that with the lightweight architecture of SRU and the feature-mapping-based RFM, WaveCRN performs comparably with other state-of-the-art approaches with notably reduced model complexity and inference time.

Blockchain-enabled Federated Learning (BFL) enables model updates of Federated Learning (FL) to be stored in the blockchain in a secure and reliable manner. However, the issue of BFL is that the training latency may increase due to the blockchain mining process. The other issue is that mobile devices in BFL have energy and CPU constraints that may reduce the system lifetime and training efficiency. To address these issues, the Machine Learning Model Owner (MLMO) needs to (i) decide how much data and energy that the mobile devices use for the training and (ii) determine the mining difficulty to minimize the training latency and energy consumption while achieving the target model accuracy. Under the uncertainty of the BFL environment, it is challenging for the MLMO to determine the optimal decisions. We propose to use the Deep Reinforcement Learning (DRL) to derive the optimal decisions for the MLMO.

Deep neural networks have experimentally demonstrated superior performance over other machine learning approaches in decision-making predictions. However, one major concern is the closed set nature of the classification decision on the trained classes, which can have serious consequences in safety critical systems. When the deep neural network is in a streaming environment, fast interpretation of this classification is required to determine if the classification result is trusted. Un-trusted classifications can occur when the input data to the deep neural network changes over time. One type of change that can occur is concept evolution, where a new class is introduced that the deep neural network was not trained on. In the majority of deep neural network architectures, the only option is to assign this instance to one of the classes it was trained on, which would be incorrect. The aim of this research is to detect the arrival of a new class in the stream. Existing work on interpreting deep neural networks often focuses on neuron activations to provide visual interpretation and feature extraction. Our novel approach, coined DeepStreamCE, uses streaming approaches for real-time concept evolution detection in deep neural networks. DeepStreamCE applies neuron activation reduction using an autoencoder and MCOD stream-based clustering in the offline phase. Both outputs are used in the online phase to analyse the neuron activations in the evolving stream in order to detect concept evolution occurrence in real time. We evaluate DeepStreamCE by training VGG16 convolutional neural networks on combinations of data from the CIFAR-10 dataset, holding out some classes to be used as concept evolution. For comparison, we apply the data and VGG16 networks to an open-set deep network solution - OpenMax. DeepStreamCE outperforms OpenMax when identifying concept evolution for our datasets.

In this paper we investigate some of the issues that arise from the scalarization of the multi-objective optimization problem in the Advantage Actor Critic (A2C) reinforcement learning algorithm. We show how a naive scalarization leads to gradients overlapping and we also argue that the entropy regularization term just inject uncontrolled noise into the system. We propose two methods: one to avoid gradient overlapping (NOG) but keeping the same loss formulation; and one to avoid the noise injection (TE) but generating action distributions with a desired entropy. A comprehensive pilot experiment has been carried out showing how using our proposed methods speeds up the training of 210%. We argue how the proposed solutions can be applied to all the Advantage based reinforcement learning algorithms.

We present CURL: Contrastive Unsupervised Representations for Reinforcement Learning. CURL extracts high-level features from raw pixels using contrastive learning and performs off-policy control on top of the extracted features. CURL outperforms prior pixel-based methods, both model-based and model-free, on complex tasks in the DeepMind Control Suite and Atari Games showing 2.8x and 1.6x performance gains respectively at the 100K interaction steps benchmark. On the DeepMind Control Suite, CURL is the first image-based algorithm to nearly match the sample-efficiency and performance of methods that use state-based features.

One of the greatest obstacles in the adoption of deep neural networks for new applications is that training the network typically requires a large number of manually labeled training samples. We empirically investigate the scenario where one has access to large amounts of unlabeled data but require labeling only a single prototypical sample per class in order to train a deep network (i.e., one-shot semi-supervised learning). Specifically, we investigate the recent results reported in FixMatch for one-shot semi-supervised learning to understand the factors that affect and impede high accuracies and reliability for one-shot semi-supervised learning of Cifar-10. For example, we discover that one barrier to one-shot semi-supervised learning for high-performance image classification is the unevenness of class accuracy during the training. These results point to solutions that might enable more widespread adoption of one-shot semi-supervised training methods for new applications.

Modern RNA sequencing technologies provide gene expression measurements from single cells that promise refined insights on regulatory relationships among genes. Directed graphical models are well-suited to explore such (cause-effect) relationships. However, statistical analyses of single cell data are complicated by the fact that the data often show zero-inflated expression patterns. To address this challenge, we propose directed graphical models that are based on Hurdle conditional distributions parametrized in terms of polynomials in parent variables and their 0/1 indicators of being zero or nonzero. While directed graphs for Gaussian models are only identifiable up to an equivalence class in general, we show that, under a natural and weak assumption, the exact directed acyclic graph of our zero-inflated models can be identified. We propose methods for graph recovery, apply our model to real single-cell RNA-seq data on T helper cells, and show simulated experiments that validate the identifiability and graph estimation methods in practice.

In this paper, we focus on weakly supervised learning with noisy training data for both classification and regression problems.We assume that the training outputs are collected from a mixture of a target and correlated noise distributions.Our proposed method simultaneously estimates the target distribution and the quality of each data which is defined as the correlation between the target and data generating distributions.The cornerstone of the proposed method is a Cholesky Block that enables modeling dependencies among mixture distributions in a differentiable manner where we maintain the distribution over the network weights.We first provide illustrative examples in both regression and classification tasks to show the effectiveness of the proposed method.Then, the proposed method is extensively evaluated in a number of experiments where we show that it constantly shows comparable or superior performances compared to existing baseline methods in the handling of noisy data.

To conduct Bayesian inference with large data sets, it is often convenient or necessary to distribute the data across multiple machines. We consider a likelihood function expressed as a product of terms, each associated with a subset of the data. Inspired by global variable consensus optimisation, we introduce an instrumental hierarchical model associating auxiliary statistical parameters with each term, which are conditionally independent given the top-level parameters. One of these top-level parameters controls the unconditional strength of association between the auxiliary parameters. This model leads to a distributed MCMC algorithm on an extended state space yielding approximations of posterior expectations. A trade-off between computational tractability and fidelity to the original model can be controlled by changing the association strength in the instrumental model. We further propose the use of a SMC sampler with a sequence of association strengths, allowing both the automatic determination of appropriate strengths and for a bias correction technique to be applied. In contrast to similar distributed Monte Carlo algorithms, this approach requires few distributional assumptions. The performance of the algorithms is illustrated with a number of simulated examples.

Symbolic data analysis (SDA) is an emerging area of statistics concerned with understanding and modelling data that takes distributional form (i.e. symbols), such as random lists, intervals and histograms. It was developed under the premise that the statistical unit of interest is the symbol, and that inference is required at this level. Here we consider a different perspective, which opens a new research direction in the field of SDA. We assume that, as with a standard statistical analysis, inference is required at the level of individual-level data. However, the individual-level data are aggregated into symbols - group-based distributional-valued summaries - prior to the analysis. In this way, large and complex datasets can be reduced to a smaller number of distributional summaries, that may be analysed more efficiently than the original dataset. As such, we develop SDA techniques as a new approach for the analysis of big data. In particular we introduce a new general method for constructing likelihood functions for symbolic data based on a desired probability model for the underlying measurement-level data, while only observing the distributional summaries. This approach opens the door for new classes of symbol design and construction, in addition to developing SDA as a viable tool to enable and improve upon classical data analyses, particularly for very large and complex datasets. We illustrate this new direction for SDA research through several real and simulated data analyses.

We consider a communication scenario, in which an intruder tries to determine the modulation scheme of the intercepted signal. Our aim is to minimize the accuracy of the intruder, while guaranteeing that the intended receiver can still recover the underlying message with the highest reliability. This is achieved by perturbing channel input symbols at the encoder, similarly to adversarial attacks against classifiers in machine learning. In image classification, the perturbation is limited to be imperceptible to a human observer, while in our case the perturbation is constrained so that the message can still be reliably decoded by the legitimate receiver, which is oblivious to the perturbation. Simulation results demonstrate the viability of our approach to make wireless communication secure against state-of-the-art intruders (using deep learning or decision trees) with minimal sacrifice in the communication performance. On the other hand, we also demonstrate that using diverse training data and curriculum learning can significantly boost the accuracy of the intruder.

Few-shot classification refers to learning a classifier for new classes given only a few examples. While a plethora of models have emerged to tackle it, we find the procedure and datasets that are used to assess their progress lacking. To address this limitation, we propose Meta-Dataset: a new benchmark for training and evaluating models that is large-scale, consists of diverse datasets, and presents more realistic tasks. We experiment with popular baselines and meta-learners on Meta-Dataset, along with a competitive method that we propose. We analyze performance as a function of various characteristics of test tasks and examine the models' ability to leverage diverse training sources for improving their generalization. We also propose a new set of baselines for quantifying the benefit of meta-learning in Meta-Dataset. Our extensive experimentation has uncovered important research challenges and we hope to inspire work in these directions.

We study the problem of finding the optimal dosage in early stage clinical trials through the multi-armed bandit lens. We advocate the use of the Thompson Sampling principle, a flexible algorithm that can accommodate different types of monotonicity assumptions on the toxicity and efficacy of the doses. For the simplest version of Thompson Sampling, based on a uniform prior distribution for each dose, we provide finite-time upper bounds on the number of sub-optimal dose selections, which is unprecedented for dose-finding algorithms. Through a large simulation study, we then show that variants of Thompson Sampling based on more sophisticated prior distributions outperform state-of-the-art dose identification algorithms in different types of dose-finding studies that occur in phase I or phase I/II trials.

Recently, it has been shown that many functions on sets can be represented by sum decompositions. These decompositons easily lend themselves to neural approximations, extending the applicability of neural nets to set-valued inputs---Deep Set learning. This work investigates a core component of Deep Set architecture: aggregation functions. We suggest and examine alternatives to commonly used aggregation functions, including learnable recurrent aggregation functions. Empirically, we show that the Deep Set networks are highly sensitive to the choice of aggregation functions: beyond improved performance, we find that learnable aggregations lower hyper-parameter sensitivity and generalize better to out-of-distribution input size.

Super-resolution microscopy is rapidly gaining importance as an analytical tool in the life sciences. A compelling feature is the ability to label biological units of interest with fluorescent markers in living cells and to observe them with considerably higher resolution than conventional microscopy permits. The images obtained this way, however, lack an absolute intensity scale in terms of numbers of fluorophores observed. We provide an elaborate model to estimate this information from the raw data. To this end we model the entire process of photon generation in the fluorophore, their passage trough the microscope, detection and photo electron amplification in the camera, and extraction of time series from the microscopic images. At the heart of these modeling steps is a careful description of the fluorophore dynamics by a novel hidden Markov model that operates on two time scales (HTMM). Besides the fluorophore number, information about the kinetic transition rates of the fluorophore's internal states is also inferred during estimation. We comment on computational issues that arise when applying our model to simulated or measured fluorescence traces and illustrate our methodology on simulated data.

Reinforcement learning (RL) methods learn optimal decisions in the presence of a stationary environment. However, the stationary assumption on the environment is very restrictive. In many real world problems like traffic signal control, robotic applications, one often encounters situations with non-stationary environments and in these scenarios, RL methods yield sub-optimal decisions. In this paper, we thus consider the problem of developing RL methods that obtain optimal decisions in a non-stationary environment. The goal of this problem is to maximize the long-term discounted reward achieved when the underlying model of the environment changes over time. To achieve this, we first adapt a change point algorithm to detect change in the statistics of the environment and then develop an RL algorithm that maximizes the long-run reward accrued. We illustrate that our change point method detects change in the model of the environment effectively and thus facilitates the RL algorithm in maximizing the long-run reward. We further validate the effectiveness of the proposed solution on non-stationary random Markov decision processes, a sensor energy management problem and a traffic signal control problem.

A critical decision point when training predictors using multiple studies is whether these studies should be combined or treated separately. We compare two multi-study learning approaches in the presence of potential heterogeneity in predictor-outcome relationships across datasets. We consider 1) merging all of the datasets and training a single learner, and 2) multi-study ensembling, which involves training a separate learner on each dataset and combining the predictions resulting from each learner. In a linear regression setting, we show analytically and confirm via simulation that merging yields lower prediction error than ensembling when the predictor-outcome relationships are relatively homogeneous across studies. However, as cross-study heterogeneity increases, there exists a transition point beyond which ensembling outperforms merging. We provide analytic expressions for the transition point in various scenarios, study asymptotic properties, and illustrate how transition point theory can be used for deciding when studies should be combined with an application from metabolomics.

We prove bounds on the generalization error of convolutional networks. The bounds are in terms of the training loss, the number of parameters, the Lipschitz constant of the loss and the distance from the weights to the initial weights. They are independent of the number of pixels in the input, and the height and width of hidden feature maps. We present experiments using CIFAR-10 with varying hyperparameters of a deep convolutional network, comparing our bounds with practical generalization gaps.

It is common to be interested in rankings or order relationships among entities. In complex settings where one does not directly measure a univariate statistic upon which to base ranks, such inferences typically rely on statistical models having entity-specific parameters. These can be treated as random effects in hierarchical models characterizing variation among the entities. We are particularly motivated by the problem of ranking basketball players in terms of their contribution to team performance. Using data from the United States National Basketball Association (NBA), we find that many players have similar latent ability levels, making any single estimated ranking highly misleading. The current literature fails to provide summaries of order relationships that adequately account for such uncertainty. Motivated by this, we propose a strategy for characterizing uncertainty in inferences on order relationships among players and lineups. Our approach adapts to scenarios in which uncertainty in ordering is high by producing more conservative results that improve interpretability. This is achieved through a reward function within a decision-theoretic framework. We apply our approach to data from the 2009-10 NBA season.

In this paper, we propose "personal VAD", a system to detect the voice activity of a target speaker at the frame level. This system is useful for gating the inputs to a streaming on-device speech recognition system, such that it only triggers for the target user, which helps reduce the computational cost and battery consumption, especially in scenarios where a keyword detector is unpreferable. We achieve this by training a VAD-alike neural network that is conditioned on the target speaker embedding or the speaker verification score. For each frame, personal VAD outputs the probabilities for three classes: non-speech, target speaker speech, and non-target speaker speech. Under our optimal setup, we are able to train a model with only 130K parameters that outperforms a baseline system where individually trained standard VAD and speaker recognition networks are combined to perform the same task.

In this paper we introduce a novel Bayesian data augmentation approach for estimating the parameters of the generalised logistic regression model. We propose a P\'olya-Gamma sampler algorithm that allows us to sample from the exact posterior distribution, rather than relying on approximations. A simulation study illustrates the flexibility and accuracy of the proposed approach to capture heavy and light tails in binary response data of different dimensions. The methodology is applied to two different real datasets, where we demonstrate that the P\'olya-Gamma sampler provides more precise estimates than the empirical likelihood method, outperforming approximate approaches.

In this study, we introduce a novel multi-task learning algorithm based on capsule network to encode visual attributes towards image-based diagnosis. By learning visual attributes, our proposed capsule architecture, called X-Caps, is considered explainable and models high-level visual attributes within the vectors of its capsules, then forms predictions based solely on these interpretable features. To accomplish this, we modify the dynamic routing algorithm to independently route information from child capsules to parents for the visual attribute vectors. To increase the explainability of our method further, we propose to train our network on a distribution of expert labels directly rather than the average of these labels as done in previous studies. At test time, this provides a meaningful metric of model confidence, punishing over/under confidence, directly supervised by human-experts' agreement, while visual attribute prediction scores are verified via a reconstruction branch of the network. To test and validate the proposed algorithm, we conduct experiments on a large dataset of over 1000 CT scans, where our proposed X-Caps, even being a relatively small 2D capsule network, outperforms the previous state-of-the-art deep dual-path dense 3D CNN in predicting visual attribute scores while also improving diagnostic accuracy. To the best of our knowledge, this is the first study to investigate capsule networks for making predictions based on radiologist-level interpretable attributes and its applications to medical image diagnosis.

This paper presents a new neural architecture that combines a modulated Hebbian network (MOHN) with DQN, which we call modulated Hebbian plus Q network architecture (MOHQA). The hypothesis is that such a combination allows MOHQA to solve difficult partially observable Markov decision process (POMDP) problems which impair temporal difference (TD)-based RL algorithms such as DQN, as the TD error cannot be easily derived from observations. The key idea is to use a Hebbian network with bio-inspired neural traces in order to bridge temporal delays between actions and rewards when confounding observations and sparse rewards result in inaccurate TD errors. In MOHQA, DQN learns low level features and control, while the MOHN contributes to the high-level decisions by associating rewards with past states and actions. Thus the proposed architecture combines two modules with significantly different learning algorithms, a Hebbian associative network and a classical DQN pipeline, exploiting the advantages of both. Simulations on a set of POMDPs and on the MALMO environment show that the proposed algorithm improved DQN's results and even outperformed control tests with A2C, QRDQN+LSTM and REINFORCE algorithms on some POMDPs with confounding stimuli and sparse rewards.

Existing graph neural network-based models have biasedly used a supervised training setting for graph classification, and they often share the conventional limitations in exploiting potential dependencies among nodes. To this end, we present U2GNN -- a novel embedding model leveraging the strength of the transformer self-attention network -- to learn low-dimensional embeddings of graphs. In particular, given an input graph, U2GNN applies a self-attention mechanism followed by a recurrent transition to update vector representation of each node from its neighbors. Thus, U2GNN can address the limitations in the existing models to produce plausible node embeddings whose sum is the final embedding of the whole graph. Experimental results in both supervised and unsupervised training settings show that our U2GNN achieves new state-of-the-art performances on a range of well-known benchmark datasets for the graph classification task. To the best of our knowledge, this is the first work showing that an unsupervised model performs better than supervised models by a large margin.

While deep learning-based classification is generally tackled using standardized approaches, a wide variety of techniques are employed for regression. In computer vision, one particularly popular such technique is that of confidence-based regression, which entails predicting a confidence value for each input-target pair (x,y). While this approach has demonstrated impressive results, it requires important task-dependent design choices, and the predicted confidences lack a natural probabilistic meaning. We address these issues by proposing a general and conceptually simple regression method with a clear probabilistic interpretation. In our proposed approach, we create an energy-based model of the conditional target density p(y|x), using a deep neural network to predict the un-normalized density from (x,y). This model of p(y|x) is trained by directly minimizing the associated negative log-likelihood, approximated using Monte Carlo sampling. We perform comprehensive experiments on four computer vision regression tasks. Our approach outperforms direct regression, as well as other probabilistic and confidence-based methods. Notably, our model achieves a 2.2% AP improvement over Faster-RCNN for object detection on the COCO dataset, and sets a new state-of-the-art on visual tracking when applied for bounding box estimation. In contrast to confidence-based methods, our approach is also shown to be directly applicable to more general tasks such as age and head-pose estimation.

We consider the problem of learning low-dimensional representations for large-scale Markov chains. We formulate the task of representation learning as that of mapping the state space of the model to a low-dimensional state space, called the kernel space. The kernel space contains a set of meta states which are desired to be representative of only a small subset of original states. To promote this structural property, we constrain the number of nonzero entries of the mappings between the state space and the kernel space. By imposing the desired characteristics of the representation, we cast the problem as a constrained nonnegative matrix factorization. To compute the solution, we propose an efficient block coordinate gradient descent and theoretically analyze its convergence properties.

Indirect comparisons of treatment-specific outcomes across separate studies often inform decision-making in the absence of head-to-head randomized comparisons. Differences in baseline characteristics between study populations may introduce confounding bias in such comparisons. Matching-adjusted indirect comparison (MAIC) (Signorovitch et al., 2010) has been used to adjust for differences in observed baseline covariates when the individual patient-level data (IPD) are available for only one study and aggregate data (AGD) are available for the other study. The approach weights outcomes from the IPD using estimates of trial selection odds that balance baseline covariates between the IPD and AGD. With the increasing use of MAIC, there is a need for formal assessments of its statistical properties. In this paper we formulate identification assumptions for causal estimands that justify MAIC estimators. We then examine large sample properties and evaluate strategies for estimating standard errors without the full IPD from both studies. The finite-sample bias of MAIC and the performance of confidence intervals based on different standard error estimators are evaluated through simulations. The method is illustrated through an example comparing placebo arm and natural history outcomes in Duchenne muscular dystrophy.

Social interactions determine many economic behaviors, but information on social ties does not exist in most publicly available and widely used datasets. We present results on the identification of social networks from observational panel data that contains no information on social ties between agents. In the context of a canonical social interactions model, we provide sufficient conditions under which the social interactions matrix, endogenous and exogenous social effect parameters are all globally identified. While this result is relevant across different estimation strategies, we then describe how high-dimensional estimation techniques can be used to estimate the interactions model based on the Adaptive Elastic Net GMM method. We employ the method to study tax competition across US states. We find the identified social interactions matrix implies tax competition differs markedly from the common assumption of competition between geographically neighboring states, providing further insights for the long-standing debate on the relative roles of factor mobility and yardstick competition in driving tax setting behavior across states. Most broadly, our identification and application show the analysis of social interactions can be extended to economic realms where no network data exists.

Batch Normalization (BN) is one of the most widely used techniques in Deep Learning field. But its performance can awfully degrade with insufficient batch size. This weakness limits the usage of BN on many computer vision tasks like detection or segmentation, where batch size is usually small due to the constraint of memory consumption. Therefore many modified normalization techniques have been proposed, which either fail to restore the performance of BN completely, or have to introduce additional nonlinear operations in inference procedure and increase huge consumption. In this paper, we reveal that there are two extra batch statistics involved in backward propagation of BN, on which has never been well discussed before. The extra batch statistics associated with gradients also can severely affect the training of deep neural network. Based on our analysis, we propose a novel normalization method, named Moving Average Batch Normalization (MABN). MABN can completely restore the performance of vanilla BN in small batch cases, without introducing any additional nonlinear operations in inference procedure. We prove the benefits of MABN by both theoretical analysis and experiments. Our experiments demonstrate the effectiveness of MABN in multiple computer vision tasks including ImageNet and COCO. The code has been released in https://github.com/megvii-model/MABN.

Graph generative models have been extensively studied in the data mining literature. While traditional techniques are based on generating structures that adhere to a pre-decided distribution, recent techniques have shifted towards learning this distribution directly from the data. While learning-based approaches have imparted significant improvement in quality, some limitations remain to be addressed. First, learning graph distributions introduces additional computational overhead, which limits their scalability to large graph databases. Second, many techniques only learn the structure and do not address the need to also learn node and edge labels, which encode important semantic information and influence the structure itself. Third, existing techniques often incorporate domain-specific rules and lack generalizability. Fourth, the experimentation of existing techniques is not comprehensive enough due to either using weak evaluation metrics or focusing primarily on synthetic or small datasets. In this work, we develop a domain-agnostic technique called GraphGen to overcome all of these limitations. GraphGen converts graphs to sequences using minimum DFS codes. Minimum DFS codes are canonical labels and capture the graph structure precisely along with the label information. The complex joint distributions between structure and semantic labels are learned through a novel LSTM architecture. Extensive experiments on million-sized, real graph datasets show GraphGen to be 4 times faster on average than state-of-the-art techniques while being significantly better in quality across a comprehensive set of 11 different metrics. Our code is released at https://github.com/idea-iitd/graphgen.

Automated algorithm selection and hyperparameter tuning facilitates the application of machine learning. Traditional multi-armed bandit strategies look to the history of observed rewards to identify the most promising arms for optimizing expected total reward in the long run. When considering limited time budgets and computational resources, this backward view of rewards is inappropriate as the bandit should look into the future for anticipating the highest final reward at the end of a specified time budget. This work addresses that insight by introducing HAMLET, which extends the bandit approach with learning curve extrapolation and computation time-awareness for selecting among a set of machine learning algorithms. Results show that the HAMLET Variants 1-3 exhibit equal or better performance than other bandit-based algorithm selection strategies in experiments with recorded hyperparameter tuning traces for the majority of considered time budgets. The best performing HAMLET Variant 3 combines learning curve extrapolation with the well-known upper confidence bound exploration bonus. That variant performs better than all non-HAMLET policies with statistical significance at the 95% level for 1,485 runs.

The spoofing countermeasure (CM) systems in automatic speaker verification (ASV) are not typically used in isolation of each other. These systems can be combined, for example, into a cascaded system where CM produces first a decision whether the input is synthetic or bona fide speech. In case the CM decides it is a bona fide sample, then the ASV system will consider it for speaker verification. End users of the system are not interested in the performance of the individual sub-modules, but instead are interested in the performance of the combined system. Such combination can be evaluated with tandem detection cost function (t-DCF) measure, yet the individual components are trained separately from each other using their own performance metrics. In this work we study training the ASV and CM components together for a better t-DCF measure by using reinforcement learning. We demonstrate that such training procedure indeed is able to improve the performance of the combined system, and does so with more reliable results than with the standard supervised learning techniques we compare against.

In this paper, we conduct mathematical and numerical analyses for COVID-19. To predict the trend of COVID-19, we propose a time-dependent SIR model that tracks the transmission rate and the recovering rate at time $t$. Using the data provided by China authority, we show our one-day prediction errors are almost less than $3\%$. The turning point and the total number of confirmed cases in China are predicted under our model. To analyze the impact of the asymptomatic infections on the spread of disease, we extend our SIR model by considering two types of infected persons: detectable infected persons and undetectable infected persons. Whether there is an outbreak is characterized by the spectral radius of a $2 \times 2$ matrix that is closely related to the basic reproduction number $R_0$. We plot the phase transition diagram of an outbreak and show that there are several countries on the verge of COVID-19 outbreaks on Mar. 2, 2020. To illustrate the effectiveness of social distancing, we analyze the independent cascade model for disease propagation in a random network specified by a degree distribution. We show two approaches of social distancing that can lead to a reduction of $R_0$.

An optimal filter for Poisson observations is developed as a variant of the traditional Kalman filter. Poisson distributions are characteristic of infectious diseases, which model the number of patients recorded as presenting each day to a health care system. We develop both a linear and nonlinear (extended) filter. The methods are applied to a case study of neonatal sepsis and postinfectious hydrocephalus in Africa, using parameters estimated from publicly available data. Our approach is applicable to a broad range of disease dynamics, including both noncommunicable and the inherent nonlinearities of communicable infectious diseases and epidemics such as from COVID-19.

We present Catalyst.RL, an open-source PyTorch framework for reproducible and sample efficient reinforcement learning (RL) research. Main features of Catalyst.RL include large-scale asynchronous distributed training, efficient implementations of various RL algorithms and auxiliary tricks, such as n-step returns, value distributions, hyperbolic reinforcement learning, etc. To demonstrate the effectiveness of Catalyst.RL, we applied it to a physics-based reinforcement learning challenge "NeurIPS 2019: Learn to Move -- Walk Around" with the objective to build a locomotion controller for a human musculoskeletal model. The environment is computationally expensive, has a high-dimensional continuous action space and is stochastic. Our team took the 2nd place, capitalizing on the ability of Catalyst.RL to train high-quality and sample-efficient RL agents in only a few hours of training time. The implementation along with experiments is open-sourced so results can be reproduced and novel ideas tried out.

The fact that image datasets are often imbalanced poses an intense challenge for deep learning techniques. In this paper, we propose a method to restore the balance in imbalanced images, by coalescing two concurrent methods, generative adversarial networks (GANs) and capsule network. In our model, generative and discriminative networks play a novel competitive game, in which the generator generates samples towards specific classes from multivariate probabilities distribution. The discriminator of our model is designed in a way that while recognizing the real and fake samples, it is also requires to assign classes to the inputs. Since GAN approaches require fully observed data during training, when the training samples are imbalanced, the approaches might generate similar samples which leading to data overfitting. This problem is addressed by providing all the available information from both the class components jointly in the adversarial training. It improves learning from imbalanced data by incorporating the majority distribution structure in the generation of new minority samples. Furthermore, the generator is trained with feature matching loss function to improve the training convergence. In addition, prevents generation of outliers and does not affect majority class space. The evaluations show the effectiveness of our proposed methodology; in particular, the coalescing of capsule-GAN is effective at recognizing highly overlapping classes with much fewer parameters compared with the convolutional-GAN.

This paper investigates asymptotic properties of a class of algorithms that can be viewed as robust analogues of the classical empirical risk minimization. These strategies are based on replacing the usual empirical average by a robust proxy of the mean, such as the median-of-means estimator. It is well known by now that the excess risk of resulting estimators often converges to 0 at the optimal rates under much weaker assumptions than those required by their "classical" counterparts. However, much less is known about asymptotic properties of the estimators themselves, for instance, whether robust analogues of the maximum likelihood estimators are asymptotically efficient. We make a step towards answering these questions and show that for a wide class of parametric problems, minimizers of the appropriately defined robust proxy of the risk converge to the minimizers of the true risk at the same rate, and often have the same asymptotic variance, as the estimators obtained by minimizing the usual empirical risk. Moreover, our results show that robust algorithms based on the so-called "min-max" type procedures in many cases provably outperform, is the asymptotic sense, algorithms based on direct risk minimization.

In this paper, we propose a deep learning framework, TSception, for emotion detection from electroencephalogram (EEG). TSception consists of temporal and spatial convolutional layers, which learn discriminative representations in the time and channel domains simultaneously. The temporal learner consists of multi-scale 1D convolutional kernels whose lengths are related to the sampling rate of the EEG signal, which learns multiple temporal and frequency representations. The spatial learner takes advantage of the asymmetry property of emotion responses at the frontal brain area to learn the discriminative representations from the left and right hemispheres of the brain. In our study, a system is designed to study the emotional arousal in an immersive virtual reality (VR) environment. EEG data were collected from 18 healthy subjects using this system to evaluate the performance of the proposed deep learning network for the classification of low and high emotional arousal states. The proposed method is compared with SVM, EEGNet, and LSTM. TSception achieves a high classification accuracy of 86.03%, which outperforms the prior methods significantly (p<0.05). The code is available at https://github.com/deepBrains/TSception

There exist several inherent trade-offs in designing a fair model, such as those between the model's predictive performance and fairness, or even among different notions of fairness. In practice, exploring these trade-offs requires significant human and computational resources. We propose a diagnostic that enables practitioners to explore these trade-offs without training a single model. Our work hinges on the observation that many widely-used fairness definitions can be expressed via the fairness-confusion tensor, an object obtained by splitting the traditional confusion matrix according to protected data attributes. Optimizing accuracy and fairness objectives directly over the elements in this tensor yields a data-dependent yet model-agnostic way of understanding several types of trade-offs. We further leverage this tensor-based perspective to generalize existing theoretical impossibility results to a wider range of fairness definitions. Finally, we demonstrate the usefulness of the proposed diagnostic on synthetic and real datasets.

Discovering patterns and detecting anomalies in individual travel behavior is a crucial problem in both research and practice. In this paper, we address this problem by building a probabilistic framework to model individual spatiotemporal travel behavior data (e.g., trip records and trajectory data). We develop a two-dimensional latent Dirichlet allocation (LDA) model to characterize the generative mechanism of spatiotemporal trip records of each traveler. This model introduces two separate factor matrices for the spatial dimension and the temporal dimension, respectively, and use a two-dimensional core structure at the individual level to effectively model the joint interactions and complex dependencies. This model can efficiently summarize travel behavior patterns on both spatial and temporal dimensions from very sparse trip sequences in an unsupervised way. In this way, complex travel behavior can be modeled as a mixture of representative and interpretable spatiotemporal patterns. By applying the trained model on future/unseen spatiotemporal records of a traveler, we can detect her behavior anomalies by scoring those observations using perplexity. We demonstrate the effectiveness of the proposed modeling framework on a real-world license plate recognition (LPR) data set. The results confirm the advantage of statistical learning methods in modeling sparse individual travel behavior data. This type of pattern discovery and anomaly detection applications can provide useful insights for traffic monitoring, law enforcement, and individual travel behavior profiling.