Background

​ When observing the natural world, many of us notice a somewhat beautiful dichotomy. No two things are ever exactly alike, but they all seem to follow some underlying form. Plato believed that the true forms of the universe were hidden from us. Through observation of the natural world, we could merely acquire approximate knowledge of them. They were hidden blueprints. The pure forms were only accessible through abstract reasoning of philosophy and mathematics. For example, the circle he describes as that which has the distance from its circumference to its center everywhere equal. Yet we will never find a material manifestation of a perfect circle or a perfectly straight line. Though interestingly, Plato speculated that after an uncountable number of years, the universe will reach an ideal state, returning to its perfect form. This Platonic focus on abstract pure forms remained popular for centuries.

​ It wasn’t until the 16th century when people tried to embrace the messy variation in the real world and apply mathematics to tease out underlying patterns. Bernoulli refined the idea of expectation. He was focused on a method of accurately estimating the unknown probability of some event based on the number of times the event occurs in independent trials. He uses a simple example. Suppose that without your knowledge, 3,000 light pebbles and 2,000 dark pebbles are hidden in an urn, and that to determine the ratio of white versus black by experiment, you draw one pebble after another, with replacement, and note how many times a white pebble is drawn versus black. He went on to prove that the expected value of white versus black observations will converge on the actual ratio as the number of trials increases, known as the weak law of large numbers. He concluded by saying, "If observations"of all events be continued for the entire infinity, "it will be noticed that everything in the world "is governed by precise ratios “and a constant law of change.” This idea was quickly extended as it was noticed that not only did things converge on an expected average, but the probability of variation away from averages also follow a familiar, underlying shape, or distribution. A great example of this is Francis Galton’s bean machine. Imagine each collision as a single independent event, such as a coin flip. After 10 collisions or events, the bean falls into a bucket representing the ratio of left versus right deflection, or heads versus tails. This overall curvature, known as the binomial distribution, appears to be an ideal form as it kept appearing everywhere any time you looked at the variation of a large number of random trials. It seems the average fate of these events is somehow predetermined, known today as the central limit theorem.

​ This was a dangerous philosophical idea to some. Pavel Nekrasov, originally a theologian by training, later took up mathematics and was a strong proponent of the religious doctrine of free will. He didn’t like the idea of us having this predetermined statistical fate. He made a famous claim that independence is a necessary condition for the law of large numbers, since independence just describes these toy examples using beans or dice, where the outcome of previous events doesn’t change the probability of the current or future events. However, as we all can relate, most things in the physical world are clearly dependent on prior outcomes, such as the chance of fire or sun or even our life expectancy. When the probability of some event depends, or is conditional, on previous events, we say they are dependent events, or dependent variables. This claim angered another Russian mathematician, Andrey Markov, who maintained a very public animosity towards Nekrasov. He goes on to say in a letter that "this circumstance "prompts me to explain in a series of articles "that the law of large numbers can apply “to dependent variables,” using a construction which he brags Nekrasov cannot even dream about. Markov extends Bernoulli’s results to dependent variables using an ingenious construction.

​ Imagine a coin flip which isn’t independent, but dependent on the previous outcome, so it has short-term memory of one event. This can be visualized using a hypothetical machine which contains two cups, which we call states. In one state we have a 50-50 mix of light versus dark beads, while in the other state we have more dark versus light. One cup we can call state zero. It represents a dark having previously occurred, and the other state, we can call one, it represents a light bead having previously occurred. To run our machine, we simply start in a random state and make a selection. Then we move to either state zero or one, depending on that event. Based on the outcome of that selection, we output either a zero if it’s dark, or a one if it’s light. With this two-state machine, we can identify four possible transitions. If we are in state zero and a black occurs, we loop back to the same
state and select again. If a light bead is selected, we jump over to state one, which can also loop back on itself, or jump back to state zero if a dark is chosen. The probability of a light versus dark selection is clearly not independent here, since it depends on the previous outcome. But Markov proved that as long as every state in the machine is reachable, when you run these machines in a sequence, they reach equilibrium. That is, no matter where you start, once you begin the sequence, the number of times you visit each state converges to some specific ratio, or a probability. This simple example disproved Nekrasov’s claim that only independent events could converge on predictable distributions. But the concept of modeling sequences of random events using states and transitions between states became known as a Markov chain. One of the first and most famous applications of Markov chains was published by Claude Shannon.

Bernoulli / Poisson Process

Markov Chains

Some thinkings

Q: Is a Markov chain the same as a finite state machine?

Markov chains can be represented by finite state machines. The idea is that a Markov chain describes a process in which the transition to a state at time t+1 depends only on the state at time t. The main thing to keep in mind is that the transitions in a Markov chain are probabilistic rather than deterministic, which means that you can’t always say with perfect certainty what will happen at time t+1.

The Wikipedia articles on Finite-state machines has a subsection on Finite Markov-chain processes, I’d recommend reading that for more information. Also, the Wikipedia article on Markov chains has a brief sentence describing the use of finite state machines in representing a Markov chain. That states:

A finite state machine can be used as a representation of a Markov chain. Assuming a sequence of independent and identically distributed input signals (for example, symbols from a binary alphabet chosen by coin tosses), if the machine is in state y at time n, then the probability that it moves to state x at time n + 1 depends only on the current state.

Q: What does the steady state represent to a Markov Chain?

I think you are struggling because you are misunderstanding what we are intending to describe with the term steady state. You are correct, a Markov Chain is completely determined once it is defined. That means, no matter what step (or time) you are interested in, I can find the distributions for that step. This property is known as uniqueness.

Though, that is not what we mean by steady state, and we need to be clear when we say steady state because it can have two different meanings.

Meaning 1: There is a very deep relationship between stochastic processes and linear algebra. If you have not taken a linear algebra course that discussed both eigenvalues and eigenvectors, then this might be hard to understand.

A steady state is an eigenvector for a stochastic matrix. That is, if I take a probability vector and multiply it by my probability transition step matrix and get out the same exact probability vector, it was a steady state. In other words, nothing changed after the step. You got out out the same probabilities that you put in. For stochastic processes, we fix the eigenvalue at , which gives us unique eigenvectors.

And in fact, the eigenvector is just the final state vector as follows:

is the transition matrix, and is the eigenvector with .

Meaning 2: However, I think you might be talking about limiting distributions as they are sometimes called steady state distributions for markov chains. The idea of a steady state distribution is that we have reached (or converging to) a point in the process where the distributions will no longer change. The distributions for this step are equal to distributions for steps from hereon forward. So if we have reached the limiting distribution and you are at a state i, the probability you will “step” to an accessible state j will be the same now and any time in the future. Steady states distributions are unique if they exist.

Things get interesting when you plug in the steady state for the limiting distribution.