Computational and Statistical Methodology for Highly Structured Data

Ph.D. Thesis, Rice University

Michael Weylandt (Department of Statistics, Rice University)

Abstract: Modern data-intensive research is typically characterized by large scale data and the impressive computational and modeling tools necessary to analyze it. Equally important, though less remarked upon, is the important structure present in large data sets. Statistical approaches that incorporate knowledge of this structure, whether spatio-temporal dependence or sparsity in a suitable basis, are essential to accurately capture the richness of modern large scale data sets. This thesis presents four novel methodologies for dealing with various types of highly structured data in a statistically rich and computationally efficient manner. The first project considers sparse regression and sparse covariance selection for complex valued data. While complex valued data is ubiquitous in spectral analysis and neuroimaging, typical machine learning techniques discard the rich structure of complex numbers, losing valuable phase information in the process. A major contribution of this project is the development of convex analysis for a class of non-smooth “Wirtinger” functions, which allows high-dimensional statistical theory to be applied in the complex domain. The second project considers clustering of large scale multi-way array (“tensor”) data. Efficient clustering algorithms for convex bi-clustering and co-clustering are derived and shown to achieve an order-of-magnitude speed improvement over previous approaches. The third project considers principal component analysis for data with smooth and/or sparse structure. An efficient manifold optimization technique is proposed which can flexibly adapt to a wide variety of regularization schemes, while efficiently estimating multiple principal components. Despite the non-convexity of the manifold constraints used, it is possible to establish convergence to a stationary point. Additionally, a new family of “deflation” schemes are proposed to allow iterative estimation of nested principal components while maintaining weaker forms of orthogonality. The fourth and final project develops a multivariate volatility model for US natural gas markets. This model flexibly incorporates differing market dynamics across time scales and different spatial locations. A rigorous evaluation shows significantly improved forecasting performance both in- and out-of-sample. All four methodologies are able to flexibly incorporate prior knowledge in a statistically rigorous fashion while maintaining a high degree of computational performance.

Copy of Record: Rice University Library

Summary: This thesis collects most of the work I published during my Ph.D. as well as some work on statistical learning in complex variables which has not yet been published elsewhere. The treatment of complex variables is slightly different than that typically appearing in the literature: by tweaking the usual construction of \(\mathbb{C}^n\) as a real inner product space, I show that essentially all results from real-valued statistical learning theory translate without effort: results for complex-valued sparse regression or Gaussian graphical models are comparable to their real-valued, with a better sample complexity if we assume “proper” (isotropic) sub-Gaussian noise. These results can be combined with the concentration bounds of Fiecas et al (2019, EJS) to get best-in-class sparsistency results for probabilistic graphical models of stochastic processes. While the optimization theory developed in this thesis could be pushed further, e.g. to non-Gaussian / generalized linear models, the correct construction of complex-GLMs is a bit tricky1 and omitted here.

Presentations: Thesis defense was held virtually and the Zoom recording can be viewed below:

Direct YouTube Link. Slides can be found here.


  AUTHOR="Michael Weylandt",
  TITLE="Computational and Statistical Methodology for Highly Structured Data",
  SCHOOL="Rice Univeristy",

  1. Some definitional problems arise when mapping complex covariates to the real-valued linear predictor: it’s not clear how to construct monotonicity here.↩︎