Statistical Investigations into the Unseen: Missing Mass for Markov Samples and Natural Distribution Estimation

Suppose we observe a sequence of samples from a very large alphabet and the number of samples is comparable or lesser than the alphabet size. Several letters from the alphabet will be unseen or missing in the observed samples. What can be inferred about the distribution's probability mass on the missing letters? The sum of the probability
masses on all missing letters is called missing mass, and the classical Good-Turing (GT) estimator is minimax optimal over all distributions and alphabet sizes when the samples are iid. However, when the samples are Markovian sequences, the GT estimator fails. In this talk, we will introduce a windowed version of the GT estimator
and show that, when the window size is sufficiently larger than the mixing time, the windowed GT estimator is nearly minimax optimal. Going beyond missing mass, we will present the generalization to higher-order missing mass and missing g-mass, which can potentially quantify the distance of the missing part of the distribution from
uniformity. We will conclude with some extensions of these results to the distribution's probability mass on sparsely observed letters and potential impact on distribution estimation.

Further information

Time:

Venue:

Speaker:

Series:

Forthcoming Seminars

News, Announcements and Events

Study at Cambridge

About the University

Research at Cambridge