An exploration of the link between continuous and discrete mathematics with the Gaussian KDE

For the longest time I struggled with statistics, mainly because in order to teach practical statistics and distributions to students, there is a lot of handwaving done to get to the methods we use to make inferences about data.

In my case, every time I was exposed to stats, I still hadn't had enough exposure to calculus to understand why distributions were the way they were. It all seemed a bit magical. First you have a set of measurements and then, bang! You have this really nice-looking curve that appears to emerge from the noise.

This always seemed a bit magical to me, and I always felt uncomfortable with how the interesting parts of the data were glossed over so we could arrive at figuring out things like the mean or standard deviation.

The interesting question to me is: how do we go from a list of points, measurements, or whatever the underlying data is, to a function that tells us the likelihood of finding a measurement that falls within a particular region?

Introducing the Kernel Density Estimation

The truth is that there are multiple answers to this question, and each one tells us something a little bit different about the underlying data. For brevity's sake, we will focus on an answer to the question using normal distributions, but know that we can apply this technique using other distribution shapes.

The shape of the distribution we use is called the kernel, and the short intuitive answer to the question posed above is that we take a little distribution for each point and lay them on top of one another. As you add more data to the dataset for a normally distributed random variable, you will eventually accumulate a curve that looks like a bigger normal distribution.

Before we go deeper, I have built an interactive example. You can click and drag data points to move them around and see how that affects the overall distribution.

You can also press the + and - buttons to add/remove data points from the estimation. Feel free to play with the other buttons, but we will get to their significance in a moment.

Impulses and Convolutions

The Impulse

So now you hopefully have a bit of an intuition for how the estimation changes depending on the underlying data. Now we will try to build an intuition for why it behaves the way it does.

The first concept is the impulse. Impulses are a fascinating subject in mathematics that have all sorts of uses and implications. For our purposes today, we will focus on the notion of an impulse as a probability distribution that is zero everywhere except for an infinitesimally small region in the domain centered around some mean value.

Remember that probability distributions aren't evaluated directly but are integrated over. The impulse, then, is still a distribution, but the area under its curve that contributes any probability is approaching zero.

If you're unfamiliar, you're probably asking yourself, "Why would we put ourselves in such a nightmare?"

The reason is that in the language of distributions, an impulse represents some knowledge, measurement, or in other words, a piece of data. It's not something that might happen—it's something that has happened and we noted it, hence its existence in the data.

Combining Distributions

The next step is to figure out some way of "combining" distributions. There are different methods we could consider, but for brevity here we will choose a straightforward approach.

First, to define what we mean by "combine," we need to define what exactly the probability distribution is. For our purposes, it's easiest to think of a distribution as a probability density function (PDF).

Functions are, of course, yet another wormhole we could get sucked into, but for now I'm going to assume that if you made it this far, you understand enough.

That being said, our PDF can't be just any old function; it must be a function that when integrated over some patch in its domain returns a meaningful probability of taking a measurement within that patch. This means that the PDF must follow the following axioms:

  1. The PDF must always be non-negative: for all
  2. The integral of the PDF over the entire domain must equal 1:

Now, of course, you've seen a few of these functions already in the visualization. We know that more or less they look like some hump that has tails that approach zero going off to either extreme.

We can also see that it's pretty intuitive to combine two distributions based on the visualization. Try this one out: you have two points to play with, but ostensibly at this point you have two normal distributions "combined." Try playing with the bandwidth parameter and also try moving the points closer together and further apart.

What we can see is that as either the bandwidth goes down or as the points move further apart, we get two distinct peaks around each point, and as they get closer together they form one peak that resembles the bell curve.

We can see in the regions of the graph how one region lays on top of the other, and overall we can see that all we have to do to combine the functions is add their outputs and scale the resulting function to satisfy the second axiom.

Convolutions

Now that we have the ability to combine our distributions, let's consider combining a bunch of impulses. What we end up with is a bunch of peaks that sort of look like pencils, all with a uniform height. The only times when there is one distribution higher than another is when two measurements coincide directly, which in most circumstances doesn't happen often.

Since we're only concerned with the normal distribution, we also know that the shape of all of those impulses looks like the bell curve. All we then need to do to "convolve" is to increase our bandwidth around each point and sum all the functions together, followed by the scaling.

In practice, we use heuristics to determine what an appropriate bandwidth is. Oftentimes the correct bandwidth depends on domain knowledge about the underlying data.

Next Steps

This is a fascinating subject, and I think that there's a lot we can learn about distributions of data from thinking about density estimates like this. I'll leave you with this thought that has been mulling around in the back of my mind:

What kind of information can we glean from a graph whose edges are composed of and weighted by the width of a vertical slice through the stacked kernels?

What can we learn by watching the graph transform as the vertical slice moves through the domain of the distribution?

Do symmetries emerge between different graph transformations of different permutations of data that build the same KDE?