## January 30, 2019 12:00pm – 2:15pm

*Special Guest Lecture: Data Science in the Ad Tech Industry*

**Location:** SAMSI Classroom

**Speaker:** Gene Ferruzza, Valassis Digital

**Bio:**

Gene Ferruzza manages the Data Science team at Valassis Digital. Working in data science his entire career he started as a software engineer developing neural network applications in the early days of neural computing. His work over the years has focused on the design and deployment of analytical technologies driving digital consumer communications. At Valassis Digital his work is focused inside of a data science and engineering team working together to enhance AI driven processes that create intelligence and optimize the relevance between consumers and advertisers. Gene has a BS degree in Computer Science and Mathematics from the University of Pittsburgh.

## Abstract

This session will outline the functionality and technology challenges of the online advertising display industry and how data science is being leveraged in every aspect of its operation. Over the past 10 years automated or “programmatic” online advertising has grown from non-existent to a major component driving 80% of advertising content that online users see when browsing. Nearly half of the advertising content is delivered through Real Time Bidding (RTB), the buying of ad space via an auction that occurs within the milliseconds it takes for a webpage to load. What will be covered is the underlying process of RTB when delivering online ads every time a browser brings up a website, and a look at how Valassis Digital uses data, expert systems and machine learning to drive intelligence into the operation.

## February 6, 2019 1:15pm – 2:15pm

*Lecture: Sea Ice and Data Assimilation: Challenges and Proposed Approaches*

**Location:** SAMSI Classroom

**Speaker:** Christian Sampson, Second-Year SAMSI Postdoctoral Fellow

## Abstract

Sea ice dynamics are driven by a complex set of processes from the small to the large scale. The ice has become much more dynamic in recent years with earlier melt onset and lower sea ice extent in late summer. This has also led to a new Arctic made up of less multi-year ice. As the Arctic opens up, accurate numerical sea ice state prediction will become increasingly important for both scientific and operational applications. Today, most large scale sea ice models solve the sea ice momentum balance equation on an Eulerian grid. These first generation models work fairly well for long run climate studies, however, they typically fail to capture important ice characteristics such as lead formation, important for ship navigation and calculation of heat fluxes. The increasingly dynamic Arctic should instead be modeled while keeping in mind what it is, a Lagrangian set of floes interacting with each other and their environments. However, the Lagrangian view can make data assimilation difficult for most sea ice data products. In this talk I will describe two relatively new sea ice models which take a Lagrangian approach to simulating ice dynamics, MPM-ice and neXtSIM. I will discuss some of the issues associated with data assimilation in these models, as well as sea ice in general, and outline some approaches we think will advance our ability to accurately predict sea ice states.

## References

No references provided at this time

## February 20, 2019 1:15pm – 2:15pm

*Lecture: A Method for High-dimensional Non-Gaussian Spatial Data *

**Location:** SAMSI Classroom

**Speaker:** Yawen Guan, Second-Year SAMSI Postdoctoral Fellow

## Abstract

Non-Gaussian spatial data are common in many environmental disciplines. Spatial generalized linear mixed models (SGLMMs) are flexible models for such data, but inference for SGLMMs is computationally expensive, especially when the data are high-dimensional. I will present a new method that replaces high-dimensional spatial random effects with a reduced-dimensional representation based on random projections. I will discuss estimation in a Bayesian framework via Markov chain Monte Carlo (MCMC), as well as maximum likelihood estimation via a Markov chain Monte Carlo Expectation Maximization (MCMC-EM) algorithm.

## References

No references provided at this time

## February 27, 2019 1:15pm – 2:15pm

*Special Guest Lecture: Two Extensions of the Stochastic Block Model for Network Clustering Motivated by Three Datasets in Sociology, Ecology and Ethnobiology*

**Location:** SAMSI Classroom

**Speaker:** Pierre Barbillon, Agro Paris Tech

## Abstract

Clustering nodes by detecting communities for instance, is a standard issue when analyzing network data. The Stochastic Block Model (SBM) is a flexible latent-variable model widely used for unraveling structures in networks. The latent variables are categorical and correspond to a clustering of nodes. Any connection between two nodes is then modeled as a draw of a Bernoulli random variable, the probability of which depends on the latent variables associated with these nodes. In this talk, we will propose two extensions of the SBM in order to handle on the one hand multiplex networks (several possible connections between two nodes) and on the other hand multipartite networks (several networks involving the same nodes). The multiplex extension is motivated by an example on a dataset on French cancer researchers for whom we have their direct connections as an advice network and connections through their labs as a resource sharing network. The multipartite extension is concerned with data on plant – pollinator, plant – ant and plant – bird interactions and with data on seed sharing among farmers and inventories of plant for each farmer.

We will present an inference method for these extensions based on a variational version of the Expectation-Maximization algorithm and a model selection procedure for determining the number of clusters based on a penalized likelihood criterion.

## References

No references provided at this time

## March 6, 2019 1:15pm – 2:15pm

*Lecture: Design and Distributed Algorithms*

**Location:** SAMSI Classroom

**Speaker:** Cheng Cheng, Second-Year SAMSI Postdoctoral Fellow

## Abstract

Graph signal processing provides an innovative framework to process data on graphs. In this talk, I will discuss the graph filter to process data on a sparse graph from the design to distributed algorithms. The Chebyshev polynomial approximation of high order has been widely used in the approximation of the graph multiplier operators. We propose an iterative Chebyshev polynomial approximation (ICPA) algorithm to implement the inverse filtering procedure, which is feasible to eliminate the restoration error even using Chebyshev polynomial approximation of lower order. I will discuss the distributed implementation of the ICPA algorithm on a spatially distributed network, show how can ICPA algorithm can be used in signal denoising.

## References

No references provided at this time

## March 20, 2019 1:15pm – 2:15pm

*Special Guest Lecture: Opportunities for Collaboration between the Duke Master in Interdisciplinary Data Science Program and SAMSI*

**Location:** SAMSI Classroom

**Speaker:** Thomas Nechyba and Jana Schaich Borg, Duke University

## Abstract

In this informal presentation, I will introduce Duke’s new Master in Interdisciplinary Data Science program. I will share what we believe makes the program unique, and discuss multiple ways that it would be great to get SAMSI researchers involved.

**Click on image below to view presentation:**

## March 27, 2019 1:15pm – 2:15pm

*Lecture: Learning Personalized PDEs for Biological Transport Models from Noisy Data*

**Location:** SAMSI Classroom

**Speaker:** John Nardini, First-Year SAMSI Postdoctoral Fellow

## Abstract

The Fisher-KPP partial differential equation (PDE) model has been widely used to predict and diagnose tumor progression in glioblastoma patients. While this equation has proven to be a useful model in describing tumor progression, we do not know if it is the optimal reaction-diffusion equation to do so. Performing a typical model selection study to investigate this would be computationally prohibitive, so we instead consider the problem of learning the dynamics of a given noisy dataset using sparse regression methods. Recent studies in this area have only been successful in the presence of very small amounts of noise. We accordingly develop a method to denoise noisy data for use in an equation learning framework and demonstrate that this method can correctly identify the PDE model that generated noisy spatiotemporal data. This work is a first step towards developing a methodology to generate data-driven models from patient data.

## References

No references provided at this time

## April 10, 2019 1:15pm – 2:15pm

*Lecture: On the Inference of Applying Gaussian Process Modeling to a Deterministic Function*

**Location:** SAMSI Classroom

**Speaker:** Wenjia Wang, First-Year SAMSI Postdoctoral Fellow

## Abstract

We investigate applying Gaussian process modeling to a deterministic function from prediction and uncertainty quantification perspectives. The upper bound and optimal convergence rate of prediction of Gaussian process modeling has been extensively studied in the literature, while a thorough exploration of the convergence rate and the theoretical study of uncertainty quantification are lacking. We prove that, if we use maximum likelihood estimation, under different choices of nugget parameters, the constructed predictor is not optimal and/or the estimated confidence interval is not reliable. The results suggest that, if one uses Gaussian process modeling to a deterministic function, the reliability of the confidence interval and the optimality of predictors cannot be achieved at the same time, unless further information of the underlying function is known.

## References

No references provided at this time

## April 17, 2019 1:15pm – 2:15pm

*Lecture: Additive Partially Linear Models for Ultra‐High‐Dimensional Regression*

**Location:** SAMSI Classroom

**Speaker:** Xinyi Li, First-Year SAMSI Postdoctoral Fellow

## Abstract

Abstract: We consider a semiparametric additive partially linear regression model (APLM) for analysing ultra‐high‐dimensional data where both the number of linear components and the number of non‐linear components can be much larger than the sample size. We propose a two‐step approach for estimation, selection, and simultaneous inference of the components in the APLM. In the first step, the non‐linear additive components are approximated using polynomial spline basis functions, and a doubly penalized procedure is proposed to select nonzero linear and non‐linear components based on adaptive lasso. In the second step, local linear smoothing is then applied to the data with the selected variables to obtain the asymptotic distribution of the estimators of the nonparametric functions of interest. The proposed method selects the correct model with probability approaching one under regularity conditions. The estimators of both the linear part and the non‐linear part are consistent and asymptotically normal, which enables us to construct confidence intervals and make inferences about the regression coefficients and the component functions. The performance of the method is evaluated by simulation studies. The proposed method is also applied to a data set on the shoot apical meristem of maize genotypes.

## References

– Supplementary A for “Additive Partially Linear Models for Ultra-high-dimensional Regression” – Xinyi Li< Li Wang and Dan Nettleton

## April 24, 2019 1:15pm – 2:15pm

*Lecture: Statistical Surrogate Modeling in Quality Assessment of Remote Sensing Retrieval and Geophysical Hazard Quantification*

**Location:** SAMSI Classroom

**Speaker:** Pulong Ma, First-Year SAMSI Postdoctoral Fellow

## Abstract

Surrogate models have been widely adopted in various uncertainty quantification (UQ) studies. This talk is based on surrogate modeling work in two different projects:

The first project is motivated by remote sensing retrievals in NASA’s Orbiting Carbon Observatory-2 (OCO-2) that aim at estimating the atmospheric state from satellite observations of reflected sunlight. The observing system uncertainty experiment (OSUE) has been used as a cost-effective way to make retrieval quality assessment in an observing system such as OCO-2. However, the physical forward model that describes the mathematical relationship between atmospheric state and measurements of radiances in the OSUE is computationally expensive. To overcome this issue, a multivariate statistical emulator has been developed to facilitate computations in large-scale OSUEs. In the emulation construction, We use functional principal component analysis to reduce the dimension of the functional radiance output and use an active subspace approach to reduce the dimension of atmospheric state. The nearest-neighbor Gaussian process is then built on low-dimensional spaces. The proposed methodology is applied to the OCO-2 application with about 10,000 model runs.

The second project in progress is motivated by geophysical hazard quantification. Many complex computer models of real-world processes have been developed at different levels of accuracy with varying computational cost. Quantifying the risk of geophysical hazards due to natural disasters such as volcanoes and storm surges requires large-scale simulation of the most accurate computational models, which is often computationally intractable. To overcome the computational challenge, we develop a cokriging-based emulator for multi-level computer models. I will introduce the proposed cokriging model and its Bayesian estimation with conjugate priors and non-informative priors. The proposed methodology will be illustrated with two toy examples.

## References

No references provided at this time

## May 1, 2019 1:15pm – 2:15pm

*Lecture: Efficient sampling for imbalanced large categorical data using piece-wise deterministic Markov chain Monte Carlo*

**Location:** SAMSI Classroom

**Speaker:** Matthias Sachs, Second-Year SAMSI Postdoctoral Fellow

## Abstract

In this talk I will first give a basic introduction to piecewise deterministic Markov processes and explain an example of the so called zig-zag sampler, along with how these processes can be used as Monte Carlo methods. I will then discuss challenges in sampling of Bayesian posterior distributions in the situation of large and high dimensional data (“large n and large p”), and highlight some features of PDMPs, which make these processes particularly suitable for this task. In particular their properties allow for sub-sampling of the data without introducing any systematic bias. I will present multiple extensions of current versions of the zig-zag process including improved data sub-sampling strategies and efficient adaptive preconditioning of the process.

## References

No references provided at this time