The Interactive Kalman Filter

Author: Arthur Lovekin. Originally Published: 2024-10-27

Control the robot's acceleration by dragging the number: \( \controlInput = \)

Introduction

The Kalman Filter is a powerful tool for combining information in the face of uncertainty, and is used widely throughout science and engineering for tracking time-varying systems. In this article, I hope to teach you how deep principles in probability and linear algebra (which you probably already know!) naturally lead to the equations of the Kalman Filter. I've made a series of interactive diagrams that will give you a better feel for the different parameters, and help you understand both the power and limitations of this method.

This is meant as an introduction that can kickstart your progress without having to digest an entire math textbook. That being said, the theory behind the Kalman Filter goes a lot deeper than what I show here, including theoretical proofs of optimality and extensions to nonlinear systems. At the end of the article, I've listed the resources I used to write this article, which I highly recommend if you want to dive deeper.

There are over 17 variables that are commonly used to derive the Kalman Filter, and to make matters more confusing many resources use different naming conventions! To alleviate this pain, I've made it so that you can choose the naming convention using the dropdown menu below. Also, you can click on any variable throughout the article to get more information about it.

Choose your naming convention:

What is the Kalman Filter?

The primary goal of the Kalman Filter is to estimate the state of a dynamical system. The need for state estimation shows up throughout engineering. To give a few examples, if you're driving a car, you need to know where you are and how fast you're going before you can intelligently turn the steering wheel. If you're controlling the central heating in an office building, it is helpful to know the temperature in each room. If you're designing a brain-computer interface for paralyzed people [4], your entire aim might be to estimate the intended position of a cursor on a screen based on their noisy neural signals.

In all of these cases, there are multiple sources of information that we can use to estimate the state of the system. On one hand, we have some background knowledge of how the system should generally behave (eg. the dynamical model of the car, building, or brain). On the other hand we have sensors (eg. cameras, thermometer, or brain electrodes), that can measure some aspect of the state. The challenge is that neither of these information sources is perfect: our dynamical model doesn't account for common disturbances, and although we often have many sensors, they are usually noisy, contradictory, and cannot individually perceive the full state. To get the most reliable state estimate, we need to continuously combine the information from the sensors and the dynamical model as time progresses.

The Kalman Filter is an algorithm that incorporates new pieces of information as they come in over time into one unified state estimate. It does this in two steps:

  1. Predict the next state using the dynamical model
  2. Update the state estimate using sensor measurements
The Kalman Filter assumes that both of the above steps can be described with linear models, and that there is some gaussian random noise associated with each of them. Under these assumptions, the Kalman Filter has been proven to be the theoretically optimal way to estimate the state of the system. Even when these assumptions don't hold, the Kalman Filter is still highly accurate and is also very memory-efficient and fast, so it is a popular choice for many applications.

Building a model

To understand the details of the Kalman Filter, we'll use the example of a simple robot. The goal will be to estimate the position and velocity of the robot so that it can navigate without crashing. On its back, the robot carries an ultrasonic sensor that can measure the distance to the nearest wall, and you as the operator will be able to control the acceleration.

A simple robot

The first step in building an engineering model is to define a state vector. For our robot, this will be a two-dimensional vector containing the \(p\) and velocity \(v\). In general, the state vector simply contains the minimum set of variables needed to make a complete description of the system (so in other examples it might contain things like temperatures, voltages, etc.). One caviat is that practically speaking, the Kalman Filter algorithm will only ever be able to estimate the state, not work with the true state values. For this reason, we'll add a little hat to indicate that this vector is just an estimate. $$ \text{Estimated State vector }\stateEstimate = \begin{bmatrix} p \\ v \end{bmatrix} = \begin{bmatrix} \text{position} \\ \text{velocity} \end{bmatrix} $$

Once we've defined our state vector, we need to model how it evolves over time. For our robot, we can use a simple kinematics formula: $$ \begin{alignat}{3} p_{\discreteTime} &= p_{\discreteTime-1} &+& \Delta t &v_{\discreteTime-1} \\ v_{\discreteTime} &= 0 &+& &v_{\discreteTime-1} \\ \end{alignat} $$

We'll package this up into matrix form so it is easy to use later. We'll call the matrix that converts from the old state to a new state using the dynamics model the state transition matrix \(\stateTransition\). $$ \text{State Transition Matrix } \stateTransition = \begin{bmatrix} 1 & \Delta t \\ 0 & 1 \end{bmatrix} $$ $$ \stateEstimate_{\discreteTime} = \stateTransition \stateEstimate_{\discreteTime-1} \\ $$

Now we have a simple dynamical model of our robot, but it fails to include one common thing: a way to control the robot. In the interactive demo above, you (as the operator) are able to control the linear acceleration \(a\) via the control input vector \(\controlInput\). This [-1,1] throttle input doesn't directly control the linear acceleration, but rather dictates the angular acceleration of the wheels. If we assume that the wheel has radius \(r\), the maximum angular acceleration is \( \alpha_{\text{max}} \), and that the wheel perfectly tracks the control command, then we can extend our kinematics equation to include the control input: $$ \begin{alignat}{3} a &= r \alpha_{\text{max}} u \\ p_{\discreteTime} &= p_{\discreteTime-1} &+& \Delta t &v_{\discreteTime-1} &+& \frac{1}{2} (r \alpha_{\text{max}} u) \Delta t^2 \\ v_{\discreteTime} &= 0 &+& &v_{\discreteTime-1} &+& (r \alpha_{\text{max}} u ) \Delta t \\ \end{alignat} $$ As with the state equation, we'll package this up in matrix form by defining the control matrix \(\controlMatrix\), which determines how the input vector \(\controlInput \) impacts the state. $$ \text{Control Input Vector } \controlInput = \begin{bmatrix} u \end{bmatrix} $$ $$ \text{Control Matrix } \controlMatrix = \begin{bmatrix} \frac{1}{2} r \alpha_{\text{max}} \Delta t^2 \\ r \alpha_{\text{max}} \Delta t \end{bmatrix} $$ Combining this with our dynamics model, our complete dynamical model looks like this: $$ \stateEstimate_{\discreteTime} = \stateTransition \stateEstimate_{\discreteTime-1} + \controlMatrix \controlInput_{\discreteTime} \\ $$ In summary, we predict the new state by applying the dynamics model to the previous state estimate, and adding a correction term for known influences like the control input.

What if I'm controlling something other than the acceleration (like the velocity)?

The model above is there for your convenience, but if your system is different you should feel free to modify it! For example, a velocity controller might assume that the control input directly dictates the velocity, which would give you the following matrices: $$ \text{Control Input Vector } \controlInput = \begin{bmatrix} v \end{bmatrix} $$ $$ \text{Control Matrix } \controlMatrix = \begin{bmatrix} 0 \\ 1 \end{bmatrix} $$ $$ \text{State Transition Matrix } \stateTransition = \begin{bmatrix} 1 & \Delta t \\ 0 & 0 \end{bmatrix} $$ (Note that \(\stateTransition\) now has zeros in the bottom row because it is more reliable to assume that the controller dictates the new velocity than that the velocity remains constant.) There are other possibilities: you could get rid of \( \controlMatrix \) entirely and make \(\stateTransition_{11} = \frac{u}{v}\), or you could even treat the control input as a sensor and incorporate it in the Update step (see below). I don't know what is best, but you should experiment with what makes sense!

Bringing Uncertainty into the Model

The linear model we derived above is great, but it's not very realistic. In the real world, the robot might slip and bounce over the terrain, the motors might not exactly reach the commands, and so on. In other words, there is random noise introduced by the messiness of the environment, which in turn makes our state estimate less certain.

One method that can represent both the uncertainty in our state estimate and the random noise introduced by the environment is to assign each of them a probability distribution. Essentially, we acknowledge that at a given instant we do not know the precise value of the state, but we can still assign a probability to the state lying within a given range (where that probability is the area under the probability distribution). For simplicity, we assume that both the state and noise distributions are gaussian.

So now, instead of representing our state estimate with a single vector, we'll say that our state estimate is a normal distribution parametrized by a mean \(\state\) and a covariance \(\stateCovariance\). $$ \stateEstimate \sim \gaussian \left( \state, \stateCovariance \right) $$

\( \mathcal{N} \left( \begin{eqnarray} \\ \state\\ \\ \end{eqnarray} \right. =\)
1.0
1.0
, \( \stateCovariance = \)
1.0 0.0
0.0
1.0
\( \left. \begin{eqnarray} \\ \\ \\ \end{eqnarray} \right) \)

Why a gaussian?

There are a few really compelling reasons to choose a gaussian:
  1. We can parametrize the distribution with just the mean and variance, so it is numerically easy to update
  2. When we combine two independent gaussians into one estimate, the result is also a gaussian
  3. Mathematicians have proven (the Central Limit Theorem) that the combined effect of many random processes (even if individually they are non-gaussian) will result in a gaussian distribution. See 3b1b's explanation for why.
Take heed: we choose a gaussian mostly because it is numerically convenient, but in practice many things (even something as simple as a \( \sin \theta \) term in the dynamics model introduced by turning around), will make the distribution non-gaussian.

Hopefully you're already familiar with a 1D normal distribution, which is usually parametrized by its mean \(\mu\) and variance \(\sigma^2\). The multivariate version is similar, except now the mean is an n-dimensional vector (in our case, 2-dimensional), and the covariance is an n x n matrix whose (\(i,j\))th element represents the covariance between the \(i\)th and \(j\)th state variables. $$ \text{Mean State Vector } \state = \begin{bmatrix} \bar{p} \\ \bar{v} \end{bmatrix} $$ $$ \text{State Covariance Matrix } \stateCovariance = \begin{bmatrix} \text{Var}(p) & \text{Cov}(p,v) \\ \text{Cov}(v,p) & \text{Var}(v) \end{bmatrix} $$

Now comes an important point: In our dynamics equation above we know how to apply the matrix \(\stateTransition\) to the state estimate \( \stateEstimate \) when it was a single vector. But now our estimate is a gaussian, so we need to apply \(\stateTransition\) to the whole distribution. Fortunately, the mean is a single point, so we can directly apply the state transition matrix as we did before. The variance on the other hand is the average "squared-distance" from the mean, so how should it be treated?

We'll introduce a very useful rule for multiplying a gaussian distribution \(\gaussian \left( \mathbf{\bar{\mu}}, \mathbf{\Sigma} \right) \) by a matrix \( \mathbf{A} \):

$$ \begin{align} \text{If you linearly transform the mean: } & \mathbf{\bar{\mu}}_{\discreteTime} = \mathbf{A} \mathbf{\bar{\mu}}_{\discreteTime-1} \\ \text{ then the covariance is: } & \mathbf{\Sigma}_{\discreteTime} = \mathbf{A} \mathbf{\Sigma}_{\discreteTime-1} \mathbf{A}^T \end{align} $$

Why multiply the covariance by \( \mathbf{A} \) on the left and right?

One way you can remember this is to think back to what happens when you multiply a 1D normal distribution by a scalar. The standard deviation \( \sigma \) will scale linearly, and the variance is the standard deviation squared (which we can write in a way that is suggestive of what the operation looks like in higher dimensions).
Diagram showing the effect of multiplying a 1D Gaussian by a scalar

The other thing we need to know is that if we add or subtract two independent, normally distributed random variables, then their variances add. (see AP Central's explanation for a refresher.)

With this knowledge of probability, we have actually arrived to the point where we can write out the first component of the Kalman Filter: the Predict Step! We'll define the covariance of the state estimate to be \(\stateCovariance\), and the mean to be \(\state\). To handle environmental noise, we'll also add a noise term with mean 0 and covariance \(\processNoiseCovariance\). Writing this out, we have:

The Predict Step: $$ \state_{\discreteTime} = \stateTransition \state_{\discreteTime-1} + \controlMatrix \controlInput_{\discreteTime} $$ $$ \stateCovariance_{\discreteTime} = \stateTransition \stateCovariance_{\discreteTime-1} \stateTransition^T + \processNoiseCovariance $$ where $$ \text{State Estimate Distribution }\stateEstimate \sim \gaussian \left( \state, \stateCovariance \right) $$ and $$ \text{Process Noise} \sim \gaussian \left( 0, \processNoiseCovariance \right) $$

Fundamentally, the predict step describes how to predict the a new gaussian state estimate from the previous estimate, based on our dynamical model of the system.

Drag the underlined values to see how they would affect one iteration of the dynamics model!

$$\state_{\discreteTime}$$ = $$\stateTransition$$ $$\state_{\discreteTime -1}$$ + $$\controlMatrix$$ $$\controlInput_{\discreteTime}$$ + $$\mathbf{0}$$
2.0
1.0
=
1.00 0.01
0.00 1.00
0.00
0.00
+
0.00
1.00
0.00
+
0
0
$$\stateCovariance_{\discreteTime}$$ = $$\stateTransition$$ $$\stateCovariance_{\discreteTime - 1}$$ $$\stateTransition^T$$ + $$\processNoiseCovariance$$
0 0
0 0
=
1.00 0.00
0.00 1.00
0.10 0.00
0.00 0.10
1.00 0.00
0.00 1.00
+
0.10 0.00
0.00 0.10

Let's try running this Predict-Only Filter in state space!

Control the robot's acceleration by dragging the number: \( \controlInput = \)

Sensing the environment

Picture of an ultrasonic distance sensor

Ok so the state ... isn't terrible, but it's not great either, and the variance is exploding (which is reasonable considering the dynamics equation has no feedback). Fortunately, we have endowed this robot with an ultrasonic distance sensor that should help. This sensor will output a measurement vector containing one value corresponding to the distance to the nearest object. $$ \text{Measurement Vector } \measurement = \begin{bmatrix} \text{distance} \end{bmatrix} $$ The only question is, how do we incorporate these measurements into the filter?

To start with, we need a linear model that relates the measurement vector \( \measurement \) to the state. Our first instinct might be to write a matrix that converts from the measurement vector to the state vector (we're trying to update the state after all). The problem is that since the sensor doesn't measure the velocity, there is no way to fully recover the state from the measurement using a linear transformation. We can however think about this from a more causal perspective:

In reality we know that the robot is in some true state \( \trueState \), and when the sensor takes a measurement of that state it will generally follow a linear process. If we define the measurement matrix \(\measurementMatrix\) to describe this linear measurement process, then in the ideal case we can write: $$ \measurement_{\text{ideal}} = \measurementMatrix \trueState $$

Let's get into some implementation details to see what \(\measurementMatrix\) looks like.

This video does a great job of explaining how ultrasonic sensors work

Most ultrasonic sensors do not directly provide a distance. Instead, they measure the time (in microseconds, \( \mu s \)) it takes for a pulse of sound to travel to the wall and back. In order to convert from \( \mu s \) to meters, we need to multiply by a factor of \( \frac{2 \times 10^6 }{c} \), where \(c\) is the speed of sound in air (343 \( m/s \)), there are \( 10^6 \mu s \) per second, and the factor of 2 accounts for the sound traveling double the distance (to the wall, and then back). Thus to convert from the measurement vector to the state vector, the measurement matrix would be: $$ \text{Measurement Matrix } \measurementMatrix = \begin{bmatrix} \frac{2 \times 10^6 }{c} & 0 \end{bmatrix} $$

Of course, no sensor is ever perfect, so we will also say that when the sensor takes a measurement it will also introduce some sensor noise. We assume that the noise is normally distributed with zero mean and covariance \(\measurementNoiseCovariance\) (which hopefully is listed in the sensor datasheet). $$ \text{Measurement Noise} \sim \gaussian \left( 0, \measurementNoiseCovariance \right) $$ Now we can say that when the robot is in state \( \trueState \), the measurements you can expect to get will follow a normal distribution with covariance \(\measurementNoiseCovariance\) and mean \(\measurementMatrix \trueState\). Of course, we never actually see the entirety of this distribution; the robot will only ever take individual measurements. Since \( \measurement \) is the only sample we have, we will say that every time we read some measurement, we infer that the distribution of possible measurements is a gaussian centered at \( \measurement \) with covariance \(\measurementNoiseCovariance\): $$ \text{Possible Measurements}\sim \gaussian \left( \measurement, \measurementNoiseCovariance \right) $$

To summarize, we've defined the measurement matrix \( \measurementMatrix \), whose role is to convert from the space of states to the space of measurements. We've used it here to convert from the ideal state to an ideal measurement, but in the next section we will see that \( \measurementMatrix \) can also be applied to the entire distribution of our current state estimate. This operation yields the measurement we expect given the current state estimate.

Update Step: Combining two gaussians

At this point, we have one normal distribution that represents our belief of the true state of the robot, and another normal distribution that represents the measurements you'd expect to get from the true state. How can we combine these two distributions into one new state estimate? $$ \text{State Estimate distribution from the Prediction Step: }\gaussian \left( \state, \stateCovariance \right) $$ $$ \text{Measurement distribution inferred from the actual measurement: } \gaussian \left( \measurement, \measurementNoiseCovariance \right) $$

The first challenge we face is that we're trying to combine different "types" of things. The state estimate distribution has a mean and variance with a particular dimension and units, while the measurement distribution has entirely different dimensions and units. Before we can do any combining, we have to first get both distributions into the same space.

There are two ways to do approach this. Choose your own adventure!

The first way we could approach combining the measurement distribution with the state estimate distribution can be outlined as follows:

  1. See what measurement we should expect given our current state estimate
  2. Combine this with the actual measurement we got
  3. Convert a "fused measurement" back to state space
Let's break this down.

In order to put our two distributions on the same footing, we'll multiply the state estimate distribution by \( \measurementMatrix \) to bring it into to the space of measurements. Remembering our rule for multiplying gaussians, we'll get two normal distributions in the same space:

Now that we have two distributions in the same space, we can stop to think: what would we like to be true when we combine these estimates? Here are some things that would be nice:

  1. The mean of the combined estimate should always lie between the means of the two original estimates
  2. The uncertainty (ie. variance) should be smaller than the uncertainty of either of the original estimates
  3. The resulting estimate should more closely resemble the distribution in which we have higher confidence (ie. less variance).

For example, if we know that the ultrasonic sensor is much more accurate than our dynamics model (which is propagated blindly), then we'd want our combined estimate to consist mostly of the sensor measurement, with a hint of the dynamics model to account for small measurement noise. Since we're making use of all of our information, we should be more certain of our combined estimate than either of the original estimates independently.

But how much, exactly, of each estimate do we want to use? One reasonable approach is to weight each distribution by the inverse of its variance, so the distribution with less uncertainty (ie. smaller variance) will play more of a role. To write this mathematically, we first define the variance ratio (in the space of measurements): $$ \varianceRatio = \frac{\measurementMatrix \stateCovariance \measurementMatrix^T} {\measurementMatrix \stateCovariance \measurementMatrix^T + \measurementNoiseCovariance} = \frac{\text{Uncertainty from state estimate}}{\text{Total uncertainty from the state estimate and sensor noise}} $$

Wait, you can write a matrix inverse like a fraction?!

Technically, writing a matrix fraction is not legitimate because, unlike scalars, the order in which you multiply the numerator and denominator greatly matters, but the fraction form does not contain this information. That being said, the insight that this is a simple ratio is heightened by the fractional notation, so I ask that you simply remember that the inverse goes on the right.

Is there a deeper mathematical reason why we use the variance ratio?

Yes! You may have learned about Bayes' Rule, which describes how to update your beliefs in light of new information. In particular, it says that $$ P( A | B ) = \frac{P( B | A ) P( A )}{P( B )}$$ $$ \text{Posterior} = \frac{\text{Likelihood} \times \text{Prior}}{\text{Normalizing constant}} $$ In our case, the prior is the state estimate from the prediction step, the likelihood comes from our measurement, and the posterior is our combined estimate. $$ P(\state | \measurement) = \frac{P(\measurement | \state) P(\state)}{P(\measurement)} $$ The primary peculiarity of the Kalman Filter is that instead of having individual numbers for each probability in the equation, now we have gaussian distributions. Nonetheless, the principle of multiplying the likelihood by the prior and normalizing is the same. If you multiply two gaussian probability density functions, you can actually show that the variance ratio emerges! For example, in 1D you could algebraically show : $$ C_1 \exp \left( -\frac{(x-\mu_1)^2}{2\sigma_1^2} \right) \times C_2 \exp \left( -\frac{(x-\mu_2)^2}{2\sigma_2^2} \right) = C_3 \exp \left( -\frac{(x-\mu_3)^2}{2\sigma_3^2} \right) $$ Where the \( C \) terms are normalization constants, and the mean and variance of the resulting distribution are: $$ \mu_3 = \left( 1 - \frac{\sigma_1^2}{\sigma_1^2 + \sigma_2^2} \right) \mu_1 + \frac{\sigma_1^2}{\sigma_1^2 + \sigma_2^2} \mu_2 $$ $$ \sigma_3^2 = \sigma_1^2 - \frac{\sigma_1^2}{\sigma_1^2 + \sigma_2^2} \sigma_1^2 $$

Then we can use this to write a weighted sum of each distribution: $$ \state_{\discreteTime} = \measurementMatrix^{-1} \left[ \left( \identityMatrix - \varianceRatio \right) \measurementMatrix \state + \varianceRatio \measurement \right] $$

Hold on, where did that \(\measurementMatrix^{-1}\) come from? Remember, at the end of the equation we need to get something that is in the space of states, but we're currently in the space of measurements. Since \(\measurementMatrix\) converts from state space to measurement space, we'd imagine that \(\measurementMatrix^{-1}\) goes the other way from measurements to states.

There's only one problem: \(\measurementMatrix\) is not necessarily square, so it cannot be inverted.

Fortunately, there's a workaround! All we have to do is suspend our disbelief and pretend that \(\identityMatrix = \measurementMatrix^{-1} \measurementMatrix\). Then we can distribute the \(\measurementMatrix^{-1}\) into the formula and cancel it with existing \(\measurementMatrix\) terms, including the leading \(\measurementMatrix\) of the variance ratio formula. The variance ratio without its leading \(\measurementMatrix\) is actually so important that we'll give it a name: $$ \text{The Kalman Gain } \kalmanGain = \stateCovariance \measurementMatrix^T \left( \measurementMatrix \stateCovariance \measurementMatrix^T + \measurementNoiseCovariance \right)^{-1} $$ Then we can write the variance ratio as: $$ \varianceRatio = \measurementMatrix \kalmanGain $$ Now we can proceed in applying the distributive property to get our state update equation: $$ \begin{align*} \state_{\discreteTime} &= \measurementMatrix^{-1} \left[ (\identityMatrix - \varianceRatio) \measurementMatrix \state_{\discreteTime-1} + \varianceRatio \measurement \right] \\ &= \left[ \left( \measurementMatrix^{-1} - \measurementMatrix^{-1} \varianceRatio \right) \measurementMatrix \state_{\discreteTime-1} + \measurementMatrix^{-1} \varianceRatio \measurement \right] \\ &= \left( \measurementMatrix^{-1} - \kalmanGain \right) \measurementMatrix \state_{\discreteTime-1} + \kalmanGain \measurement \\ \state_{\discreteTime} &= \left( \identityMatrix - \kalmanGain \measurementMatrix \right) \state_{\discreteTime-1} + \kalmanGain \measurement \\ \end{align*} $$

Great! Now we have a matrix formula to combine the means of two Gaussians. The only step left is to propagate the variances too, which we can easily do using our rule for multiplying Gaussians.

$$ \stateCovariance_{\discreteTime} = \left( \identityMatrix - \kalmanGain \measurementMatrix \right) \stateCovariance_{\discreteTime-1} \left( \identityMatrix - \kalmanGain \measurementMatrix \right)^T + \kalmanGain \measurementNoiseCovariance \kalmanGain^T $$

The equation above is known as the Joseph Form of the Covariance Update Equation. There is also a simpler (though less numerically stable) form of this equation.

$$ \begin{align} \stateCovariance_{\discreteTime} &= \left( \identityMatrix - \kalmanGain \measurementMatrix \right) \stateCovariance_{\discreteTime-1} \class{highlight-term}{ \left(\identityMatrix - \kalmanGain \measurementMatrix \right)^T} + \kalmanGain \measurementNoiseCovariance \kalmanGain^T \\ &= \left( \identityMatrix - \kalmanGain \measurementMatrix \right) \stateCovariance_{\discreteTime-1} \class{highlight-term}{ \left( \identityMatrix - \left(\kalmanGain \measurementMatrix \right)^T \right) } + \kalmanGain \measurementNoiseCovariance \kalmanGain^T \\ &= \left( \identityMatrix - \kalmanGain \measurementMatrix \right) \class{highlight-term}{ \stateCovariance_{\discreteTime-1} \left(\identityMatrix - \measurementMatrix^T \kalmanGain^T \right) } + \kalmanGain \measurementNoiseCovariance \kalmanGain^T \\ &= \class{highlight-term}{ \left( \identityMatrix - \kalmanGain \measurementMatrix \right) \left(\stateCovariance_{\discreteTime-1} - \stateCovariance_{\discreteTime-1} \measurementMatrix^T \kalmanGain^T \right) } + \kalmanGain \measurementNoiseCovariance \kalmanGain^T \\ &= \stateCovariance_{\discreteTime-1} - \kalmanGain \measurementMatrix \stateCovariance_{\discreteTime-1} - \stateCovariance_{\discreteTime-1} \measurementMatrix^T \kalmanGain^T + \class{highlight-term}{ \kalmanGain \measurementMatrix \stateCovariance_{\discreteTime-1} \measurementMatrix^T \kalmanGain^T + \kalmanGain \measurementNoiseCovariance \kalmanGain^T } \\ &= \stateCovariance_{\discreteTime-1} - \kalmanGain \measurementMatrix \stateCovariance_{\discreteTime-1} - \stateCovariance_{\discreteTime-1} \measurementMatrix^T \kalmanGain^T + \class{highlight-term}{ \kalmanGain \left(\measurementMatrix \stateCovariance_{\discreteTime-1} \measurementMatrix^T + \measurementNoiseCovariance \right) } \kalmanGain^T \\ &= \stateCovariance_{\discreteTime-1} - \kalmanGain \measurementMatrix \stateCovariance_{\discreteTime-1} \class{highlight-term}{ - \stateCovariance_{\discreteTime-1} \measurementMatrix^T \kalmanGain^T + \stateCovariance_{\discreteTime-1} \measurementMatrix^T \kalmanGain^T }\\ \stateCovariance_{\discreteTime} &= \stateCovariance_{\discreteTime-1} - \kalmanGain \measurementMatrix \stateCovariance_{\discreteTime-1} \end{align} $$ This is a much simpler version of the covariance update equation! It clearly gives the insight that with new information, your uncertainty will always decrease. The only limitation is that, without quadratic terms keeping things symmetric, small numerical errors can build up over time and lead to an invalid covariance matrix.

Here's an interactive diagram to visualize the whole process. The work done by the Kalman Gain is surrounded by a dashed box. Try sliding the underlined variables around to see how they affect the outcome!

Update Step (Measurement Space)
\( \mathcal{N} \left( \begin{eqnarray} \\ \state_{\discreteTime - 1} \\ \\ \end{eqnarray} \right. =\)
1.0
1.0
, \( \stateCovariance_{\discreteTime-1} = \)
1.0 0.0
0.0
1.0
\( \left. \begin{eqnarray} \\ \\ \\ \end{eqnarray} \right) \)
Convert to Measurement Space with \( \measurementMatrix = \begin{bmatrix} 1 & 0 \end{bmatrix} \)
\(\mathcal{N} \left( \measurementMatrix \state_{\discreteTime-1} = \begin{eqnarray} \\ \\ \end{eqnarray}\right. \)
1.0
, \( \measurementMatrix \stateCovariance \measurementMatrix^T = \)
1.0
\( \left. \begin{eqnarray} \\ \\ \end{eqnarray} \right) \)
\( \mathcal{N} \left( \measurement = \begin{eqnarray} \\ \\ \end{eqnarray}\right. \)
-1.0
, \( \measurementNoiseCovariance = \)
1.0
\( \left. \begin{eqnarray} \\ \\ \end{eqnarray} \right) \)
\( \varianceRatio = \frac{\measurementMatrix \stateCovariance \measurementMatrix^T}{\measurementMatrix \stateCovariance \measurementMatrix^T + \measurementNoiseCovariance} = \)
1.0
\(\mathcal{N} \left( (I-\varianceRatio)\measurementMatrix + \varianceRatio \measurement \text{, } \space \space \space (I-\varianceRatio) \measurementMatrix \stateCovariance \measurementMatrix^T (I-\varianceRatio)^T + \varianceRatio \measurementNoiseCovariance \varianceRatio^T \right) \)
Convert back to state space with \( \measurementMatrix^{-1} \)
\(\text{New State Estimate (not plotted): } \mathcal{N} \left( \begin{eqnarray} \\ \state_{\discreteTime} \\ \\ \end{eqnarray} \right. =\)
1.0
1.0
, \( \stateCovariance_{\discreteTime} = \)
1.0
0.0
0.0
1.0
\( \left. \begin{eqnarray} \\ \\ \\ \end{eqnarray} \right) \)

The other way we could approach combining the measurement distribution with the state estimate distribution can be outlined as follows:

  1. See what states could have led to the measurement we got
  2. Combine this with our current state estimate

Let's break this down.

In order to put our two distributions on the same footing, we'll multiply the measurement distribution by \( \measurementMatrix^{-1} \) to bring it into the space of states. Remembering our rule for multiplying gaussians, we'll get two normal distributions in the same space:

(If you see something wrong here, suspend your disbelief and read on ... )

But how much, exactly, of each estimate do we want to use? One reasonable approach is to weight each distribution by the inverse of its variance, so the distribution with less uncertainty (ie. smaller variance) will play more of a role. To write this mathematically, we first define the variance ratio (in the space of states): $$ \varianceRatioState = \frac{\stateCovariance}{\stateCovariance + \measurementMatrix^{-1} \measurementNoiseCovariance (\measurementMatrix^{-1})^T} = \frac{\text{Uncertainty from state estimate}}{\text{Total uncertainty from the state estimate and sensor noise}} $$

Wait, you can write a matrix inverse like a fraction?!

Is there a deeper mathematical reason why we use the variance ratio?

Then we can use this to write a weighted sum of each distribution:

$$ \state = (\identityMatrix - \varianceRatioState) \state + \varianceRatioState \measurementMatrix^{-1} \measurement $$

Great! It seems like this weighted sum does exactly what we need.

There's only one problem: \(\measurementMatrix\) is not necessarily square, so it cannot be inverted.

Gru's Plan meme: Measurement Matrix is not invertible

Well, that kind of throws a wrench in everything. Fortunately, there is a workaround! If we suspend our disbelief and do a bit of matrix algebra as if \( \measurementMatrix^{-1} \) exists, we can avoid ever having to invert the measurement matrix. We'll use the following identities: For any invertible matrices \(\mathbf{A}\), \(\mathbf{B}\) we have: \( (\mathbf{A} \mathbf{B})^{-1} = \mathbf{B}^{-1} \mathbf{A}^{-1} \), and \( (\mathbf{A}^{-1})^{-T} = (\mathbf{A}^T)^{-1} \), and we can always multiply anywhere by the identity \( \identityMatrix = \measurementMatrix^{-1} \measurementMatrix = \measurementMatrix^T (\measurementMatrix^T)^{-1} \)

So let's knock out those inverses! $$ \begin{align*} \varianceRatioState &= \stateCovariance \left[ \stateCovariance + \measurementMatrix^{-1} \measurementNoiseCovariance \left( \measurementMatrix^{-1} \right)^T \right]^{-1} \\ &= \stateCovariance \left[ \stateCovariance + \measurementMatrix^{-1} \measurementNoiseCovariance \left( \measurementMatrix^{-1} \right)^T \right]^{-1} \left( \measurementMatrix^{-1} \measurementMatrix \right)\\ &= \stateCovariance \left[ \measurementMatrix \left( \stateCovariance + \measurementMatrix^{-1} \measurementNoiseCovariance \left( \measurementMatrix^{-1} \right)^T \right) \right]^{-1} \measurementMatrix \\ &= \stateCovariance \left[ \measurementMatrix \stateCovariance + \measurementNoiseCovariance \left( \measurementMatrix^{-1} \right)^T \right]^{-1} \measurementMatrix \\ &= \stateCovariance \left( \measurementMatrix^T \left( \measurementMatrix^T \right)^{-1} \right) \left[ \measurementMatrix \stateCovariance + \measurementNoiseCovariance \left( \measurementMatrix^{-1} \right)^T \right]^{-1} \measurementMatrix \\ &= \stateCovariance \measurementMatrix^T \left[ \left( \measurementMatrix \stateCovariance + \measurementNoiseCovariance \left( \measurementMatrix^{-1} \right)^T \right) \measurementMatrix^T \right]^{-1} \measurementMatrix \\ \varianceRatioState &= \stateCovariance \measurementMatrix^T \left[ \measurementMatrix \stateCovariance \measurementMatrix^T + \measurementNoiseCovariance \right]^{-1} \measurementMatrix \\ \end{align*} $$

Now our Variance ratio is free of \( \measurementMatrix^{-1} \), but there is still one more remaining in the weighted sum formula. In preparation to eliminate this remaining \( \measurementMatrix^{-1} \), we'll separate out the trailing \( \measurementMatrix \) from the rest of the formula. The variance ratio without the trailing \( \measurementMatrix \) is actually so important that we'll give it a name: $$ \text{The Kalman Gain } \kalmanGain = \stateCovariance \measurementMatrix^T \left[ \measurementMatrix \stateCovariance \measurementMatrix^T + \measurementNoiseCovariance \right]^{-1} $$ Then we can write the variance ratio as: $$ \varianceRatioState = \kalmanGain \measurementMatrix $$

Now we can proceed in writing our state update equation without any remaining \( \measurementMatrix^{-1} \): $$ \begin{align*} \state &= (\identityMatrix - \varianceRatioState) \state + \varianceRatioState \measurementMatrix^{-1} \measurement \\ \state &= \left( \identityMatrix - \kalmanGain \measurementMatrix \right) \state + \kalmanGain \measurement \\ \end{align*} $$

Great! Now we have a matrix formula to combine the means of two Gaussians. The only step left is to propagate the variances too, which we can easily do using our rule for multiplying Gaussians.

$$ \stateCovariance_{\discreteTime} = \left( \identityMatrix - \kalmanGain \measurementMatrix \right) \stateCovariance_{\discreteTime-1} \left( \identityMatrix - \kalmanGain \measurementMatrix \right)^T + \kalmanGain \measurementNoiseCovariance \kalmanGain^T $$

The equation above is known as the Joseph Form of the Covariance Update Equation. There is also a simpler (though less numerically stable) form of this equation.

Here's an interactive diagram to visualize the whole process. The work done by the Kalman Gain is surrounded by a dashed box. Try sliding the underlined variables around to see how they affect the outcome!

Update Step (State Space)
\( \mathcal{N} \left( \begin{eqnarray} \\ \state_{\discreteTime - 1} \\ \\ \end{eqnarray} \right. =\)
1.0
0.0
, \( \stateCovariance_{\discreteTime-1} = \)
0.5 0.0
0.0
0.5
\( \left. \begin{eqnarray} \\ \\ \\ \end{eqnarray} \right) \)
Convert to Measurement Space with \( \measurementMatrix = \begin{bmatrix} 1 & 0 \end{bmatrix} \)
\(\mathcal{N} \left( \measurementMatrix \state_{\discreteTime-1} = \begin{eqnarray} \\ \\ \end{eqnarray}\right. \)
1.0
, \( \measurementMatrix \stateCovariance \measurementMatrix^T = \)
1.0
\( \left. \begin{eqnarray} \\ \\ \end{eqnarray} \right) \)
\( \mathcal{N} \left( \measurement = \begin{eqnarray} \\ \\ \end{eqnarray}\right. \)
-1.0
, \( \measurementNoiseCovariance = \)
0.3
\( \left. \begin{eqnarray} \\ \\ \end{eqnarray} \right) \)
Convert to state space with \( \measurementMatrix^{-1} \)
\( \mathcal{N} \left( \begin{eqnarray} \\ \state_{\discreteTime - 1} \\ \\ \end{eqnarray} \right. =\)
0.0
0.0
, \( \stateCovariance_{\discreteTime-1} = \)
0.0
0.0
0.0
0.0
\( \left. \begin{eqnarray} \\ \\ \\ \end{eqnarray} \right) \)
\( \mathcal{N} \left( \measurementMatrix^{-1} \measurement = \begin{eqnarray} \\ \\ \\ \end{eqnarray}\right. \)
0.0
-
, \( \measurementMatrix^{-1} \measurementNoiseCovariance \left( \measurementMatrix^{-1} \right)^T = \)
0.0
-
- -
\( \left. \begin{eqnarray} \\ \\ \\ \end{eqnarray} \right) \)
\( \varianceRatioState = \frac{\stateCovariance}{\stateCovariance + \measurementMatrix^{-1} \measurementNoiseCovariance (\measurementMatrix^{-1})^T} = \stateCovariance \measurementMatrix^T \left[ \measurementMatrix \stateCovariance \measurementMatrix^T + \measurementNoiseCovariance \right]^{-1} \measurementMatrix = \)
1.0
1.0
1.0
1.0
\( \mathcal{N} \left( (I - \varianceRatioState) \state + \varianceRatioState \measurementMatrix^{-1} \measurement \text{, } \space \space \space (I-\varianceRatioState) \stateCovariance (I-\varianceRatioState)^T + (\varianceRatioState \measurementMatrix^{-1}) \measurementNoiseCovariance (\varianceRatioState \measurementMatrix^{-1})^T \right) \)
\(\text{New State Estimate: } \mathcal{N} \left( \begin{eqnarray} \\ \state_{\discreteTime} \\ \\ \end{eqnarray} \right. =\)
1.0
1.0
, \( \stateCovariance_{\discreteTime} = \)
1.0
0.0
0.0
1.0
\( \left. \begin{eqnarray} \\ \\ \\ \end{eqnarray} \right) \)

Bringing it all together

In each of our derivations, we've arrived at the same definition for the Kalman Gain: $$ \text{The Kalman Gain } \kalmanGain = \stateCovariance \measurementMatrix^T \left( \measurementMatrix \stateCovariance \measurementMatrix^T + \measurementNoiseCovariance \right)^{-1} $$ However, we have also shown that there are two different ways to interpret this formula (so long as you imagine that \( \measurementMatrix^{-1} \) exists). Rearranging \( \varianceRatioState = \kalmanGain \measurementMatrix \) and \( \varianceRatio = \measurementMatrix \kalmanGain \) we have: $$ \kalmanGain = \measurementMatrix^{-1} \varianceRatio = \varianceRatioState \measurementMatrix^{-1} $$ Therefore, we can choose to interpret the Kalman Gain as a two-step process that always takes measurements as inputs, then either:

The Full Algorithm

Eg. What you actually need to implement

Now that we have all the mathematical pieces of the Predict and Update steps, we can write out the full algorithm! In practice, we write each step as a function that takes the previous state estimate along with new information, and returns a new state estimate.

Predict(\( \state_{\discreteTime-1}, \space \stateCovariance_{\discreteTime-1}, \space \stateTransition, \space \controlMatrix, \space \controlInput_{\discreteTime}, \space \processNoiseCovariance \)): Update(\( \state_{\discreteTime-1}, \space \stateCovariance_{\discreteTime-1}, \space \measurementMatrix, \space \measurement_{\discreteTime}, \space \measurementNoiseCovariance \)):
\( \begin{align} \quad \quad \state_{\discreteTime} &= \stateTransition \state_{\discreteTime-1} + \controlMatrix \controlInput_{\discreteTime} \\ \quad \quad \stateCovariance_{\discreteTime} &= \stateTransition \stateCovariance_{\discreteTime-1} \stateTransition^T + \processNoiseCovariance \\ \end{align} \) \( \begin{align} \quad \quad \kalmanGain &= \stateCovariance_{\discreteTime-1} \measurementMatrix^T \left( \measurementMatrix \stateCovariance_{\discreteTime-1} \measurementMatrix^T + \measurementNoiseCovariance \right)^{-1} \\ \quad \quad \state_{\discreteTime} &= \state_{\discreteTime-1} + \kalmanGain \left( \measurement_{\discreteTime} - \measurementMatrix \state_{\discreteTime-1} \right) \\ \quad \quad \stateCovariance_{\discreteTime} &= \left( \identityMatrix - \kalmanGain \measurementMatrix \right) \stateCovariance_{\discreteTime-1} \left( \identityMatrix - \kalmanGain \measurementMatrix \right)^T + \kalmanGain \measurementNoiseCovariance \kalmanGain^T \end{align} \)
\(\quad \quad \text{return } \left( \state_{\discreteTime}, \space \stateCovariance_{\discreteTime} \right) \) \(\quad \quad \text{return } \left( \state_{\discreteTime}, \space \stateCovariance_{\discreteTime} \right) \)

A few notes:

  1. You should be calling the Predict function as often as you can (the dynamics never stops so neither should your blind estimate), but the update function can only be called when there is a new measurement. If your sensor is fast, you might always call Predict immediately followed by Update, but if your sensor is slow, 5 predict steps might go by with no update.
  2. In more advanced implementations, you can define all of the matrices (\(\stateTransition\), \(\controlMatrix\), \(\measurementMatrix\), etc.) to vary with time (they even may change nonlinearly as time progresses).
  3. We use the longer form of the covariance update formula (also known as the Joseph Form) because it is numerically more stable (though on a modern computer running for short time periods, the difference is negligible).

Let's see the full algorithm in action!

Control the robot's acceleration by dragging the number: \( \controlInput = \)

Why is this not perfect?

Despite proofs of theoretical optimality, the Kalman Filter has limitations. In this example, it struggles to track maximum velocity accurately because it only uses throttle to predict velocity, unaware of the robot's speed limit. Running the filter at a slow rate also causes the state estimate to lag the true state most of the time. These issues arise even in a linear system with minimal noise! When applying Kalman Filters to real-world systems, you'll likely need to manually address such errors to achieve good performance.

Summary

The Kalman Filter is an algorithmic blueprint for incorporating different sources of information into a unified probabilistic state estimate. It outlines two ways to propagate the current state estimate to the next:

  1. Predict: Incorporate information about the default dynamics of the system
  2. Update: Fuse sensor information with the most recent state estimate using the Kalman Gain
The Kalman Gain is an operator that takes vectors in the measurement space as input, and outputs the weighted contribution of that measurement vector to the total state estimate. This operation can be interpreted as having two steps (in either order): In practice the measurement matrix is not invertible, but by bundling these two steps together we can define the Kalman Gain without ever using \( \measurementMatrix^{-1} \).

My aim with this article was to give you an intuitive understanding of how the Kalman Filter works, but from here there is a lot of beautiful math that proves the Filter's optimality under certain conditions, and also provides insight on how to handle different physical systems. Hopefully while exploring this article you realized that there are many obvious shortcomings. For example, What if your model isn't linear? What if you don't know the covariance matrix for your sensors? How do small changes in parameters like the timestep (or accumulation of numerical error) affect the filter's behavior?

If you would like to see how the Kalman Filter is implemented under the hood, my github repo has a javascript implementation.

As far as dealing with non-linearities, there are several different algorithms you should look into: the Particle Filter, the Unscented Kalman Filter (UKF), and the Extended Kalman Filter (EKF). All three of these are similar in that they combine probabilistic state and measurement information, but they differ in how they incorporate non-linear process models and how they represent the state estimate distribution.

In total I've thrown you 17 variables -- that can be a lot to keep track of! I've collected them all here so you have a single reference, though clicking on the variables throughout the article will give the same information. Also remember that you can choose the naming convention using the dropdown menu at the top of the article.

Symbol Name Description Dimension

References

Documents I found helpful while writing this article

  1. Babb, T. (2015, August 11). How a Kalman filter works, in pictures | Bzarg. https://www.bzarg.com/p/how-a-kalman-filter-works-in-pictures/
  2. Becker, A. (2023). Kalman Filter from the Ground Up. https://www.kalmanfilter.net/default.aspx
  3. Fox, D. et.al. Markov Localization for Mobile Robots in Dynamic Environments. Journal of Artificial Intelligence Research. 1999 Nov; pp. 391-427 https://www.cs.cmu.edu/afs/cs/project/jair/pub/volume11/fox99a-html/jair-localize.html
  4. Gilja V, et.al. A high-performance neural prosthesis enabled by control algorithm design. Nat Neurosci. 2012 Dec;15(12):1752-7. doi: 10.1038/nn.3265. Epub 2012 Nov 18. PMID: 23160043; PMCID: PMC3638087. https://pmc.ncbi.nlm.nih.gov/articles/PMC3638087/
  5. Labbe, R. (2024). Kalman and Bayesian Filters in Python. https://github.com/rlabbe/Kalman-and-Bayesian-Filters-in-Python
  6. Thrun, S., Burgard, W., & Fox, D. (2005). Probabilistic robotics. MIT Press.
  7. Why variances Add—And Why it Matters – AP Central | College Board. (n.d.). https://apcentral.collegeboard.org/courses/ap-statistics/classroom-resources/why-variances-add-and-why-it-matters
  8. Wikipedia contributors. (2024, October 23). Kalman filter. Wikipedia. https://en.wikipedia.org/wiki/Kalman_filter

Excellent videos related to Filtering and Probability

Articles that inspired this level of interactivity

Updates and Corrections

If you see mistakes or want to suggest change, please create an issue on Github. I would like my writing and visual style to be as good as it can be, but it also takes a lot of work to create and maintain this type of content, so I may or may not make the change.

Citation

If you use this article in an academic context, please cite it as:

Lovekin, A. "The Interactive Kalman Filter." 2024. https://arthurlovekin.com/interactive-kalman-filter/index.html

Bibtex Citation:

@misc{lovekin2024interactivekalman, author = {Lovekin, Arthur}, title = {The Interactive Kalman Filter}, year = {2024}, note = {https://arthurlovekin.com/interactive-kalman-filter/}, }