# Data Analytics Research Seminar

### Incoming seminars

23 May 2024,10:30-11:45

Céline Duval (Université de Lille)

Title: Geometry of excursion sets: computing the surface area from discretized points

Abstract: The excursion sets of a smooth random field carries relevant information in its various geometric measures. After an introduction of these geometrical quantities showing how they are related to the parameters of the field, we focus on the problem of discretization. From a computational viewpoint, one never has access to the continuous observation of the excursion set, but rather to observations at discrete points in space. It has been reported that for specific regular lattices of points in dimensions 2 and 3, the usual estimate of the surface area of the excursions remains biased even when the lattice becomes dense in the domain of observation. We show that this limiting bias is invariant to the locations of the observation points and that it only depends on the ambiant dimension. (based on joint works with H. Biermé, R. Cotsakis, E. Di Bernardino and A. Estrade)

6 June 2024,10:30-11:45

Peter Radchenko (University of Sydney)

Title: Modeling with Categorical Features via Exact Fusion and Sparsity Regularization

Abstract: We study the high-dimensional linear regression problem with categorical predictors that have many levels. We propose a new estimation approach, which performs model compression via two mechanisms by simultaneously encouraging (a) clustering of the regression coefficients to collapse some of the categorical levels together; and (b) sparsity of the regression coefficients. We formulate our estimator as a solution to a mixed integer program, and provide a row generation procedure to speed-up the computation. We also present a fast approximate algorithm for our method that obtains high-quality feasible solutions via block coordinate descent; the main building block of our algorithm is an exact solver for the univariate case. We establish new theoretical guarantees for both the prediction and the cluster recovery performance of our estimator. Our numerical experiments on synthetic and real datasets demonstrate that our proposed estimator tends to outperform the state-of-the-art.

13 June 2024,10:30-11:45

Gilles Stoltz ((Laboratoire de mathématiques d'Orsay, CNRS - Université Paris-Saclay & HEC Paris))

Title: Contextual stochastic bandits with budget constraints and fairness application

Abstract: We review the setting and fundamental results of contextual stochastic bandits, where at each round some vector-valued context x_t is observed and K actions are available, each action a providing a stochastic reward with expectation given by some (partially unknown) function of x_t and a. The aim is to maximize the cumulative rewards obtained, or equivalently, to minimize the regret. This requires maintaining a good balance between the estimation (a.k.a., exploration) of the function and the exploitation of the estimates built. The literature also considers additional budget constraints (leading to so-called contextual bandits with knapsacks): actions now provide rewards but also costs. The literature also illustrated that costs may model fairness constraints. We will review these two lines of work and describe our own contribution in this respect, related to a more direct strategy, able to handle \sqrt{T} cost constraints over T rounds, which is exactly what is needed for fairness applications. The recent results discussed at the end of the talk will be based on the joint work by Evgenii Chzhen, Christophe Giraud, Zhen Li, and Gilles Stoltz, Small total-cost constraints in contextual bandits with knapsacks, with application to fairness, Neurips, 2023.

Past seminars 2023-2024

5 October 2023,10:30-11:45

Sirio Legramanti (University of Bergamo)

Title: Weighting covariates in Bayesian nonparametric clustering: an application to transportation networks

Abstract: In clustering, observed individual data are often accompanied by covariates that can assist the clustering process itself. This is the case, for example, of transportation networks, where each node has spatial coordinates, and it is often desirable that clusters of nodes are spatially cohesive. In fact, the obtained clusters may be used to inform public policy decisions, and it may be preferable that such policies are uniform over neighboring areas. Naturally, depending on the application, different notions of closeness can be used to define such neighborhoods, thus potentially requiring to transform the spatial covariates.

Motivated by real-world data about subscriptions to the public transportation system of Bergamo (Italy) and its surroundings, we propose a method to incorporate properly transformed spatial covariates into a state-of-the-art stochastic block model, while inferring the weight of covariates. (Joint work with Valentina Ghidini and Raffaele Argiento)

19 October 2023,10:30-11:45

Badr-Eddine Chérief-Abdellatif (LPSM, CNRS)

Title: Label Shift Quantification via Distribution Feature Matching

Abstract: Quantification learning deals with the task of estimating the target label distribution under label shift. In this talk, we present a unifying framework, distribution feature matching (DFM), that recovers as particular instances various estimators introduced in previous literature. We derive a general performance bound for DFM procedures and extend this analysis to study robustness of DFM procedures in the misspecified setting under departure from the exact label shift hypothesis, in particular in the case of contamination of the target by an unknown distribution.

16 November 2023,10:30-11:45

Nikolaus Schweizer (Tilburg University)

Title: Solving Maxmin Optimization Problems via Population Games

Abstract: The Iteratively Reweighted Least Squares (IRLS) method is well known in numerical analysis as a useful technique for solving minmax function approximation problems on a finite grid. We extend the method so that it can be applied to find maxmin solutions of more general multicriteria problems. As in the original IRLS method, the key idea is to find the maxmin decision as an optimizer of a suitably weighted sum of monotonic transformations of the criterion functions. The method is effective when transformations can be found that make the resulting weighted-sum problems easy to solve. The relevant weights are determined by an iterative scheme. For this, we use a discrete-time version of the celebrated replicator equation of evolutionary game theory, also known in machine learning as the exponential multiplicative weights algorithm. The iterative process can be viewed as the co-evolution of a population of "testers" jointly with the decision maker, which produces the maxmin solution from a symmetric Nash equilibrium in a population game. This establishes a connection to game theory that is quite different from the usual one via two-person zero-sum games. Examples are provided to show the use of the generalized IRLS method in collective investment and in decision making under uncertainty.

(Joint work with Anne Balter and Johannes M. Schumacher)

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4264811

14 December 2023,10:30-11:45

Artem Prokhorov (University of Sidney)

Title: A machine learning attack on illegal trading

Abstract: We design an adaptive framework for the detection of illegal trading behavior. Its key component is an extension of a pattern recognition tool, originating from the field of signal processing and adapted to modern electronic systems of securities trading. The new method combines the flexibility of dynamic time warping with contemporary approaches from extreme value theory to explore large-scale transaction data and accurately identify illegal trading patterns. Importantly, our method does not need access to any confirmed illegal transactions for training. We use a high-frequency order book dataset provided by an international investment firm to show that the method achieves remarkable improvements over alternative approaches in the identification of suspected illegal insider trading cases.

https://doi.org/10.1016/j.jbankfin.2022.106735

1 February 2024,10:30-11:45

Gábor Lugosi (Universitat Pompeu Fabra)

Title: Network archaeology: a review of recent results

Abstract: Large networks that change dynamically over time are ubiquitous in various areas such as social networks, and epidemiology. These networks are often modeled by random dynamics which, despite being relatively simple, give a quite accurate macroscopic description of real networks. "Network archaeology" is an area of combinatorial statistics in which one studies statistical problems of inferring the past properties of such growing networks. In this talk we discuss some simple network models and review recent results on revealing the past of the networks.

29 February 2024,10:30-11:45

Claire Boyer (LPSM, Sorbonne Université)

Title: Some statistical insights on physics-informed machine learning

Abstract: Physics-informed machine learning combines the expressiveness of data-based approaches with the interpretability of physical models. In this context, we consider a general regression problem where the empirical risk is regularized by a partial differential equation that quantifies the physical inconsistency. We prove that for linear differential priors, the problem can be formulated as a kernel regression task, giving a rigorous framework to analyze physics-informed ML. In particular, the physical prior can help in boosting the estimator convergence.

The direct implementation of physics-informed kernel estimators can be tedious, and practitioners often resort to physics-informed neural networks (PINNs) instead. We offer some food for thought and statistical insight into the proper use of PINNs.

14 March 2024,10:30-11:45

Matteo Barigozzi (Università di Bologna)

Title: Title: High-dimensional dynamic matrix factor models

Abstract: High-dimensional matrix-variate time series data are becoming increasingly popular in economics and finance. This has stimulated the development of matrix factor models to achieve significant dimension reduction. This paper proposes an approximate dynamic matrix factor model that accounts for the time series nature of the data, and develops an EM algorithm to perform quasi-maximum likelihood estimation of the model parameters. The algorithm is further extended to estimate the dynamic matrix factor model on a dataset with an arbitrary pattern of missing data. We prove consistency of the estimated row and column loadings matrices and of the matrix factors. The finite sample properties of the proposed estimation strategies are assessed through a large simulation study and an application to a financial dataset.

Matteo Barigozzi and Luca Trapin

28 March 2024,10:30-11:45

Davide La Vecchia (University of Geneva)

Title: Saddlepoint techniques for the statistical analysis of time series

Abstract: Saddlepoint techniques provide numerically accurate, small sample approximations to the distribution of estimators and test statistics. While a complete theory on saddlepoint techniques is available in the case of independent observations, much less attention has been devoted to the time series setting. This talks contributes to fill this gap. Under short and/or long range serial dependence, for Gaussian and non Gaussian processes, the talk shows how to derive and implement saddlepoint approximations for Whittle's estimator, a frequency domain M-estimator. The derivation is based on the treatment of the standardized periodogram ordinates as (i.) i.d. random variables. Comparisons of the saddlepoint techniques to other methods are presented: the numerical exercises show that the saddlepoint approximations yield accuracy improvements over extant methods, while preserving analytical tractability and avoiding resampling. The talks starts with a gentle introduction to saddlepoint techniques in the i.i.d. setting and with a review of the basic frequency domain tools for time series analysis. The results are based on joint works with E. Ronchetti and A. Moor.

25 April 2024,10:30-11:45

Vincent Fortuin (Helmholtz AI/TUM)

Title: Use Cases for Bayesian Deep Learning in the Age of ChatGPT

Abstract: Many researchers have pondered the same existential questions since the release of ChatGPT: Is scale really all you need? Will the future of machine learning rely exclusively on foundation models? Should we all drop our current research agenda and work on the next large language model instead? In this talk, I will try to make the case that the answer to all these questions should be a convinced “no” and that now, maybe more than ever, should be the time to focus on fundamental questions in machine learning again. I will provide evidence for this by presenting three modern use cases of Bayesian deep learning in the areas of self-supervised learning, interpretable additive modeling, and neural network sparsification. Together, these will show that the research field of Bayesian deep learning is very much alive and thriving and that its potential for valuable real-world impact is only just unfolding.

2 May 2024,10:30-11:45

Gérard Ben Arous (NYU)

Title: Dynamical spectral transition for optimization in very high dimensions

Abstract: In recent work with Reza Gheissari (Northwestern), Aukosh Jagannath (Waterloo) we gave a general context for the existence of projected low dimensional “effective dynamics” of Stochastic Gradient Descent in very high dimensional Data Science problems. These effective dynamics (and, in particular, their so-called ‘critical regime”) define a dynamical system in finite dimensions which may be quite complex, and rules the performance of the learning algorithm.

The next step is to understand how the system finds these “summary statistics”. This is done in the last work with the same authors and with Jiaoyang Huang (Wharton, U-Penn). This is based on a dynamical spectral transition of Random Matrix Theory: along the trajectory of the optimization path, the Gram matrix or the Hessian matrix develop outliers which carry these effective dynamics.

I will naturally first come back to the Random Matrix Tools needed here (the behavior of the edge of the spectrum and the BBP transition).

And then illustrate the use of this point of view on a few central examples of ML: classification for Gaussian mixtures, and the XOR task.

References: NeurIPS 2022, Best paper award, CPAM March 2024, ICLR May 2024, and Arxiv 2310.03010.

### Archive 2013-2023

### 2022-2023

20 October 2022 from 10.30am to 11.45am (1h15 per talk including 30 minutes of broad introduction and 15 min questions)

Lu Yu (CREST-ENSAE)

Title: Mirror Descent Strikes Again: Optimal Stochastic Convex Optimization under Infinite Noise Variance

Abstract: We study stochastic convex optimization under infinite noise variance. Specifically, when the stochastic gradient is unbiased and has uniformly bounded (1 + κ)-th moment, for some κ ∈ (0, 1], we quantify the convergence rate of the Stochastic Mirror Descent algorithm with a particular class of uniformly convex mirror maps, in terms of the number of iterations, dimensionality and related geometric parameters of the optimization problem. Interestingly this algorithm does not require any explicit gradient clipping or normalization, which have been extensively used in several recent empirical and theoretical works. We complement our convergence results with information-theoretic lower bounds showing that no other algorithm using only stochastic first-order oracles can achieve improved rates. Our results have several interesting consequences for devising online/streaming stochastic approximation algorithms for problems arising in robust statistics and machine learning.

3 November 2022 from 10.30am to 11.45am (1h15 per talk including 30 minutes of broad introduction and 15 min questions)

Nicolas Schreuder (Genova University)

Title: Fair statistical learning: a study of the Demographic Parity constraint

Abstract: In various domains, statistical algorithms trained on personal data take pivotal decisions which influence our lives on a daily basis. Recent studies show that a naive use of these algorithms in sensitive domains may lead to unfair and discriminating decisions, often inheriting or even amplifying biases present in data. In the first part of the talk, I will introduce and discuss the question of fairness in machine learning through concrete examples of biases coming from the data and/or from the algorithms. In a second part, I will demonstrate how statistical learning theory can help us better understand and overcome some of those biases. In particular, I will present a selection of recent results from two of my papers on the Demographic Parity constraint:

- A minimax framework for quantifying risk-fairness trade-off in regression (with E. Chzhen), Ann. Statist. 50(4): 2416-2442(Aug.2022).

- Fair learning with Wasserstein barycenters for non-decomposable performance measures (with S. Gaucher and E. Chzhen), arXiv preprint arXiv:2209.00427.

17 November 2022 from 10.30am to 11.45am (1h15 per talk including 30 minutes of broad introduction and 15 min questions)

Alfred Galichon (NYU)

Title: Estimating Matching Models: from theory to empirics

Abstract: I will review a methodology for the estimation of models of matching, with a focus on family economics. The theoretical foundations, the econometrics toolbox, and some empirical results will be discussed. This talk is partly a review of the existing literature, and partly based on two new papers:

- https://arxiv.org/abs/2204.00362.

8 December 2022 from 10.30am to 11.45am (1h15 per talk including 30 minutes of broad introduction and 15 min questions)

Guillaume A. Pouliot (The University of Chicago)

Title: An Exact t-Test

Abstract: I give a short review and selective survey of randomization inference. Surprisingly, the methodological question of how to produce marginal exact and asymptotically robust inference for a regression coefficient in the multivariate linear model with general design matrix appears to be unresolved in the literature. We produce a test statistic which delivers such inference.

6 April 2023 from 12.00pm to 1.15pm (1h15 per talk including 30 minutes of broad introduction and 15 min questions)

Dion Bongaert (RSM Erasmus University)

Title: Reverse Engineering Mutual Fund Trades

Abstract: In this paper we present a novel method for imputing daily mutual fund trades from data on fund returns, total net assets, and fund holdings at the, respectively daily, monthly, and quarterly frequencies. Therefore, our method works with standard CRSP mutual fund data. We set up an (under-identified) system of linear equations and solve the under-identification issue by an iterative method that applies random and adaptive constraints on trade incidence. The method produces daily, position-level trade estimates with associated confidence levels. Validation and simulation studies using proprietary fund trading data show high accuracy, especially for larger and more relevant trades.

30 May 2023 from 12.00pm to 1.15pm (1h15 per talk including 30 minutes of broad introduction and 15 min questions)

Cesare Robotti (Warwick Business School)

Title: Priced Risk in Corporate Bonds

Abstract: Recent studies document strong empirical support for multifactor models that aim at explaining the cross-sectional variation in corporate bond expected excess returns. We revisit these findings and provide evidence that common factor pricing in corporate bonds is exceedingly difficult to establish. Based on portfolio- and bond-level analyses, we demonstrate that previously proposed bond risk factors, with traded liquidity as the only marginal exception, do not provide any incremental pricing information to the corporate bond market factor. This implies that the bond CAPM is never outperformed by other traded and nontraded factor models in pairwise and multiple model comparison tests.

### 2021-2022

9 June 2022 from 2.00 to 4.00pm (45 minutes per talk plus a 30 minutes coffee break) in Room N517

Alexandra Carpentier (Universität Potsdam).

Karim Lounici (Ecole Polytechnique).

12 May 2022 from 2.00 to 4.00pm (45 minutes per talk plus a 30 minutes coffee break)

Victor-Emmanuel Brunel (ENSAE).

George Deligiannidis (University of Oxford).

12 April 2022 from 2.00 to 4.00pm (45 minutes per talk plus a 30 minutes coffee break) in Room N517

Gilles Stupfler (ENSAI). Asymmetric least squares techniques for extreme risk assessment

Robert Adamek (Maastricht University). Local Projection Inference in High Dimensions

3 March 2022 from 2.00 to 4.00pm (45 minutes per talk plus a 30 minutes coffee break) on Zoom

Giacomo Zanella (Bocconi University). Robust leave-one-out cross-validation for high-dimensional Bayesian models

Matthew Graham (University College London). Manifold MCMC methods for Bayesian inference in diffusion models

13 December 2021 from 2.30 to 4.30pm (45 minutes per talk plus a 30 minutes coffee break) in Room N517

Christian Brownlees (Universitat Pompeu Fabra). Empirical Risk Minimization for Time Series: Nonparametric Performance Bounds for Prediction

Anders Kock (University of Oxford). Consistency of p-norm based tests in high dimensions: characterization, monotonicity, domination

24 November 2021 from 2.30 to 4.30pm (45 minutes per talk plus a 30 minutes coffee break) in Room N517

Umut Simsekli (INRIA). Towards Building a Heavy-Tailed Theory of Stochastic Gradient Descent for Deep Neural Networks

Valentin De Bortoli (University of Oxford). Diffusion Schrödinger Bridge with Applications to Score-Based Generative Modeling

### 2018-2019

November 22, 2018 - 1:00 pm to 3:00 pm - ESSEC Cergy (N305)

Prof. Taoufik Bouezmarni (Laval University)

Extended Lorenz curves for general random variables

Prof. Matei Demetrescu (Kiel University)

Nonlinear Predictability of Stock Returns? Parametric vs. non parametric inference in predictive regressions

October 16, 2018 - 5:15 pm to 6:15 pm - ESSEC LA Défense (CNIT), s. 344

Prof. Arijit Chakrabarty (Indian Statistical Institute, Kolkata)

Spectra of Adjacency and Laplacian Matrices of inhomogeneous Erdös-Rényi Graphs

### 2017-2018 Program:

TIME SERIES WORKSHOP 2018, Wednesday April 11 - 2018, 2:30 pm to 5:40 pm, Room N516

Organizers: Prof. Luc Bauwens, CORE - UCL, Fellow of the Institute of Advanced Studies UCP Université Paris-Seine, Guillaume Chevillon, ESSEC Business School, Prof. Jeroen Rombouts, ESSEC Business School

March 29, 2018 : 5th Empirical Finance Workshop - Cergy (KLAB)

December 14, 2017 - 1:00 pm to 3:00 pm - Cergy (Room N305):

Prof. Xavier D’haultfoeuille (ENSAE - CREST)

Testing Rational Expectations Using Data Combination

Prof. Artem Prokhorov (University of Sydney)

On Semiparametric Estimation using Bernstein Copulas

### 2016-2017 Program:

July 4, 2017 - from 10:30 am to 12:00 pm - Cergy Room N105:

Prof. Aurore Delaigle (University of Melbourne)

Analyzing Partially Observed Functional Data

April 21, 2017 - from 1:00 pm to 4:00 pm - Cergy Room N305:

Prof. Valentina Corradi (University of Surrey)

Improved Tests for robust forecast comparison

Prof. Jean-David Fermanian (CREST)

The behavior of dealers and clients on the European corporate bond market: the case of Multi-dealer-to-client platforms

Prof. Bas Werker (Tilburg University)

Arbitrage Pricing Theory for Idiosyncratic Variance Factors

March 30-31, 2017: 25th Annual Symposium of the Society for Nonlinear Dynamics and Econometrics (SNDE)

March 15, 2017 : 4th Empirical Finance Workshop - Cergy (KLAB)

March 2, 2017 - from 1:45 pm to 4:00 pm - Cergy Room N305:

Prof. Karim ABADIR (Imperial College London)

Macro and financial markets: The memory of an elephant

Prof. Joerg Breitung (University of Cologne)

Multivariate tests for asset price bubbles

February 24, 2017 - from 2:00 pm to 5:00 pm - IBM Bois Colombes :

Internet of Things & Predictive Analytics

Reda Gomery (Deloitte), Marc Van Der Laan (AT&T), Thomas Watteyne (INRIA), Georges Uzbelger (IBM)

November 25, 2016 - from 11:45 am to 1:15 - Cergy, Room N405:

Prof. Juhyun Park, (Lancaster University)

Estimation of functional sparsity in nonparametric varying coefficient models

November 17-18, 2016: The 2016 8th French Econometrics Conference (FEC2016)

November 15, 2016, from 1:15 to 4:00 pm (Room E125):

Yu-Wei Hsieh (University of Southern California)

Seminar on the Econometrics of Matching models

### 2015-2016 Program

March 16, 2016: 3rd Empirical Finance Workshop

May 31, 2016:

Prof. Christophe CROUX (Katholieke Universiteit Leuven)

Sparse Cointegration

Prof. Nikolay GOSPODINOV (Federal Reserve Bank of Atlanta)

Spurious Inference in Reduced-Rank Asset-Pricing Models

Prof. Otilia BOLDEA (Tilburg University)

Break-point Estimation in Panel data with fixed effects

November 5-6, 2015: Advances in Time Series and Forecasting Conference

September 25, 2015: Workshop on Time Series Econometrics

September 24, 2015:

Prof. Cristina DAVINO (Università de Macerata, Italy) -Quantile Regression an overview of properties and applications

### 2014-15 Program

June: Siem Jan Koopman (VU, Amsterdam)

May: Second Workshop on ICT and Innovation Forecasting; From Theory to Practice & Applications

April:

Esther Ruiz (UC3 Madrid)

Genaro Succarat (BI Norwegian Business School)

March: WORKSHOP ON MODELLING & FORECASTING MOMENT RISK PREMIA

Jan-March 2015: the seminars are part of the Working Group on Risk - CREAR

10th December (Banque de France): ESSEC/Banque de France workshop on Expectations and Forecasting

6th and 7th November (La Défense): European Seminar on Bayesian Econometrics

October:

Ingrid VAN KEILEGOM (Université Catholique de Louvain)

Paul DOUKHAN (Université de Cergy-Pontoise)

SUBPAGES (1): 2013-14 PROGRAM OF ECONOMETRICS & STATISTICS SEMINARS