My research is generally in the area of *statistical* machine learning. I interpret *statistical* in two senses: i) using tools from statistical theory to understand and improve methods, even those not originally developed from a statistical point-of-view; and ii) focused on the generation of *insights* and *knowledge* rather than just “black-box” predictive or generative power.^{1} Under that broad umbrella, I am particularly interested in structured dimension reduction, network analysis, and ML fairness, as well as various computational questions arising from methods work. My current application areas of interest include detection and attribution in climate analytics and the development of graph-inspired information retrieval systems.

Modern scientific experiments produce vast quantities of data that are nearly-impossible to analyze with traditional EDA (Exploratory Data Analysis) techniques. Moreover, these vast data sets typically have complex dependence structures that must be reflected in unsupervised analysis. I have developed techniques for flexibly incorporating structure into dimension reduction algorithms, most notably sparsity and smoothness and low-rank network structure. This line of work is particularly inspired by the analysis of large-scale climate simulations from global climate models. These models produce massive output, with significant temporal and spatial correlation. With collaborators from Sandia National Laboratories, I have applied regularized PCA to climate simulations in order to identify the principal effects of massive volcanic eruptions on the global climate.

Climate scientists often refer to low-dimensional summarizations of climate data as “fingerprints”: these fingerprints may be motivated by domain expertise, *e.g.* global mean temperature, or by some statistical criterion, such as maximizing variance explained. Among their many uses in climate science, fingerprints are often used as inputs to Detection & Attribution studies; that is, they are used as evidence to detect changes in a climate quantity of interest and to attribute that change to a particular (posited) cause. In collaboration with a cross-disciplinary team from SNL, I am working to develop advanced fingerprinting techniques which maximize the statistical power of downstream Detection & Attribution studies. This work incorporates ideas from supervised dimension reduction, such as Partial Least Squares or Canonical Correlation Analysis (PLS/CCA), to find “most-attributable” fingerprints, *i.e.*, those aspects of the climate that appear to exhibit the most robust evidence of change. This analysis poses serious inferential challenges-*i.e.*, if we have found the ‘most different’ aspect of the climate, classical attribution studies are no longer statistically valid-but we believe that we will be able to address these as climate models also provide us with the ability to generate “null samples” from which we can estimate a suitable sampling distribution.

I also maintain an ongoing collaboration with Sandia National Labs (SNL) in the area of large graph analytics. Our recent work focuses on the intersection of computational algorithms and statistical theory for exceptionally large graph problems, with a particular focus on the “seed set expansion” (or local community detection) problem. My SNL collaborators bring expertise in a range of applied math and engineering fields, including high-performance computing, numerical linear algebra, tensor analysis, and graph algorithms. This collaboration is particularly rewarding as we are able to work with SNL’s mission-focused analytics teams to identify insights and challenges that a purely-academic investigation may overlook.

If you would like to discuss any of these topics further, please dont hesitate to get in touch.

My work has been supported by grants and fellowships from the US National Science Foundation (NSF), the US Intelligence Community (US IC), and the Sandia National Laboratories Laboratory Directed Research and Development program (SNL LDRD). For more, see here.

This latter goal overlaps with “explainable” or “interpretable” AI (XAI); to the degree there is a difference, XAI focuses more on explaining the results of existing prediction algorithms - which may be of standalone value as black-box predictors - while my work centers the knowledge generation aims. In general, the methods I focus on consider predictive power as

*evidence*of a meaningful scientific relationship, not as the end goal.↩︎