Why do we use Gaussian distribution everywhere?

Written on June 17, 2020.
Tags: gaussian process, regression

Table of contents

While looking at my hidden github code repos to figure out which one can actually be useful to others, I stepped on pyGP and remember a small ‘eureka’ momemt I had while reading of Bayesian inference. Going to share it, it might save you some time.

I have been using this repo to explore Gaussian Process Regression (GPR) and Bayesian Optimization (BO) - maybe more about this later but there are already tones of blog, tutorial, whatnot on the topic, not sure I can add a lot. Have cleaned up the build, added a README.md, have a look at the repo if you want to see different variants with different frameworks: Tensorflow Probabilities, Scikit-learn and GPflow.

Now that this is squared and made available, let’s move to the point I wanted to cover and ‘eureka’ I had on the topic while reading Statistical Rethinking: A Bayesian Course with Examples in R and Stan by R. McElreath (first edition - took me a year+ to read the first, probably going to pass on the second edition). The chapter 9 of the first edition is dedicated on the introduction of Maximum Entropy distributions and this is the revelation I’m talking about.

Ever wondered why we see a very small number of distributions used again and again when we don’t know what to use and if that make some sense?

Principle of maximum entropy.

The reason it make sense is what is called the Principle of maximum entropy. This basically says that if we don’t know much and we should assume what we have observed comes from the distribution that makes the less assumptions outside of what we know for sure. Seems reasonable when we can easily find such family.

Gaussian distributions and more

When what we know is the mean and variance, the corresponding maximum entropy distribution happens to the Gaussian distribution. And voila.

There are other maximum entropy distributions under specific constraints, for example: the Gamma distribution for distributions with same mean and average logarithm, and so one. Read the book, there are more examples. wikipedia has a table of the distribution and the corresponding constraints.

Does that extend to GPR?

What happens if you have samples of a discrete-time stochastic process, aka a timeserie?

At this point, I’m not sure if or when the Gaussian process used in the GPR fills the principle of maximum entropy. I will not go further yet and read more on the topic.

Burg’s maximum entropy theorem says the process that maximizes the entropy under some autocorrelation constraints - and the existence of some limits which always exist for stationary processes, is a Gaussian Process.

June 17, 2020

Creative Commons License This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Powered by Hakyll.