Statistical Data Science

Selected Recent Research Topics

Data privacy and synthetic data

It has been widely accepted that anonymization is insufficient in protecting privacy. The concerns about data privacy have grown significantly, particularly with the advancement of computer science technology and the rise in data generated by individuals and tech companies. As concerns about data privacy continue to grow, differential privacy (DP) has emerged as a fundamental concept that aims to guarantee privacy by ensuring individuals' indistinguishability in data analysis. Our recent research projects include minimax analysis on Gaussian DP,, DP goodness-of-fit test for continuous variables, and evaluation measures for synthetic data.

Compositional data analysis

Compositional data, such as human gut microbiomes, consist of non-negative variables whose only the relative values to other variables are available. Analyzing compositional data such as human gut microbiomes needs a careful treatment of the geometry of the data. A common geometrical understanding of compositional data is via a regular simplex. Majority of existing approaches rely on a log-ratio or power transformations to overcome the innate simplicial geometry. In this work, based on the key observation that a compositional data vector exists in a projective space, a new mapping from a simplex to the positive orthant of a sphere is considered. The intrinsic domain of compositional data analysis is then expanded to a unit sphere, modulo a discrete group action. This expansion makes the traditional directional statistics immediately applicable to compositional data by employing the group-invariance principle. A main contribution of this work that needs more intricate development is the construction of Reproducing Kernel Hilbert Space (RKHS) for compositional data analysis. Utilizing the theory of spherical harmonics, we show that there is a reproducing kernel existing in a function space on the compositional domain, which will wide open numerous possibilities.

Functional data analysis

Functional linear discriminant analysis offers a simple yet efficient method for classification, with the possibility of achieving a perfect classification. Several methods are proposed in the literature that mostly address the dimensionality of the problem. On the other hand, there is a growing interest in interpretability of the analysis, which favors a simple and sparse solution. In this work, we propose a new approach that incorporates a type of sparsity that identifies non- zero sub-domains in the functional setting, offering a solution that is easier to interpret without compromising performance. With the need to embed additional constraints in the solution, we reformulate the functional linear discriminant analysis as a regularization problem with an ap- propriate penalty. Inspired by the success of L1-type regularization at inducing zero coefficients for scalar variables, we develop a new regularization method for functional linear discriminant analysis that incorporates an L1-type penalty, to induce zero regions. We demonstrate that our formulation has a well-defined solution that contains zero regions, achieving a functional sparsity in the sense of domain selection. In addition, the misclassification probability of the regularized solution is shown to converge to the Bayes error if the data are Gaussian. Our method does not presume that the underlying function has zero regions in the domain, but produces a sparse estimator that consistently estimates the true function whether or not the latter is sparse. Numerical comparisons with existing methods demonstrate this property in finite samples with both simulated and real data examples.

High-dimensional asymptotics

Data piling refers to the phenomenon that training data vectors from each class project to a single point for classification. While this interesting phenomenon has been a key to understanding many distinctive properties of high-dimensional discrimination, the theoretical underpinning of data piling is far from properly established. In this work, high-dimensional asymptotics of data piling is investigated under a spiked covariance model, which reveals its close connection to the well-known ridged linear classifier. In particular, by projecting the ridge discriminant vector onto the subspace spanned by the leading sample principal component directions and the maximal data piling vector, we show that a negatively ridged discriminant vector can asymptotically achieve data piling of independent test data, essentially yielding a perfect classification. The second data piling direction is obtained purely from training data and shown to have a maximal property. Furthermore, asymptotic perfect classification occurs only along the second data piling direction.

Data with ordinality (KSS Slides)

Ordinal classification problems arise in a variety of real-world applications, in which samples need to be classified into categories with a natural ordering. An example of classifying high-dimensional ordinal data is to use gene expressions to predict the ordinal drug response, which has been increasingly studied in pharmacogenetics. Classical ordinal classification methods are typically not able to tackle high- dimensional data and standard high-dimensional classification methods discard the ordering information among the classes. Existing work of high-dimensional ordinal classification approaches usually assume a linear ordinality among the classes. We argue that manually-labeled ordinal classes may not be linearly arranged in the data space, especially in high-dimensional complex problems.

Google Sites

Report abuse