## Using Markov Chains to Generate Test Input

September 15, 2009 — Eric MelskiOne challenge that we’ve faced at Electric Cloud is how to verify that our makefile parser correctly emulates GNU Make. We started by generating test cases based on a close reading of the gmake manual. Then we turned to real-world examples: makefiles from dozens of open source projects and from our customers. After several years of this we’ve accumulated nearly two thousand individual tests of our gmake emulation, and yet we still sometimes find incompatibilities. We’re always looking for new ways to test our parser.

One idea is to generate random text and use that as a “makefile”. Unfortunately, truly random text is almost useless in this regard, because it doesn’t look anything like a real makefile. Instead, we can use *Markov chains* to generate random text that is very much like a real makefile. When we first introduced this technique, we uncovered 13 previously unknown incompatibilities — at the time that represented 10% of the total defects reported against the parser! Read on to learn more about Markov chains and how we applied them in practice.

### Markov chains

A Markov chain is simply a sequence of random values in which the next value is in some way dependent on the current value, rather than being completely random. Consider the case of generating random text one letter at a time, from the set of uppercase English letters (A-Z). If the sequence is completely random, then for each character generated, any letter is equally probable. Regardless of what characters you’ve generated up to this point, you are just as likely to get a *D* as an *X* next. Think of it as if you have a bag of tiles, one for each letter. With truly random text, you pick one tile and write down the letter on that tile. Then you return the tile to the bag and pick again. Lather, rinse, repeat until you’ve generated as much text as you like.

But suppose we said that the probability of the next character is dependent on the last character we generated. For example, in English text if you run across the letter *Q* you can be pretty sure that the next character is going to be *U*. It’s almost certainly not going to be *X*, or *Z*, etc. We can build a table that tells us the probabilty that any given letter will be followed by any other letter. For the letter *Q*, the table might look like this:

Letter | Probability of appearing after Q |

A | 1% |

E | 1% |

I | 1% |

O | 1% |

U | 96% |

All others | 0% |

We can do a much better job of generating text that looks like English if we use these tables to guide us. Imagine that instead of one bag of tiles with one tile for each letter, we have one bag for each letter, and we fill each bag with tiles according to the probabilities in our table. For example, the bag for the letter *Q* would contain 96 *U* tiles, and one tile each for *A*, *E*, *I*, and *O*. Each time we want to generate a new letter, we look at the last letter we generated, find the bag of tiles corresponding to that letter, and pick out one tile. After writing down the letter, we return the tile to the bag and repeat the process. The sequence of letters that we generate in this manner is a simple Markov chain.

How do you build the probability tables? One way is to generate them from some sample input. If we have a sufficiently large example of text in the target language, we can scan it and count the number of times each letter occurs, and the number of times it is followed by each letter. Note that it is critical to have a large and varied input. If the sample is too small, the resulting probability tables won’t accurately represent the target language. The generator will only be able to generate the input text itself.

### Markov chains of order *m*

Of course, we don’t have to limit ourselves to considering just the single previous letter. For example, if the previous two characters are *ST*, then the next character is probably a vowel, or maybe an *R*. It’s probably not going to be *D*, or *Q*, etc. Just as before, we can make a table that tells us the probability that any given pair of letters will be followed by any other letter. And we can keep going, adding more and more of the preceding letters to our formula. The number of previous characters we use is called the *order* of the chain, so if we use the previous 4 characters, we would say we have a *Markov chain of order 4*.

The greater the order of the chain, the “smarter” our generator becomes, because it is considering more context when choosing letters. As you can see, the generated text looks more and more like the target language as you increase the order (although after a some point, you get little additional benefit from further increases):

Order | Result |

0 (purely random) | ehnee.Alr noer ealcra edctn eIi |

1 | Pige foule.ce d futht wrion e mara |

2 | Prookiname arg-tm aread on achivedging |

3 | Yes, and no usinession be |

4 | Project that last it make you first, moderneath. |

### Using Markov chains in testing

In order to use this technique effectively in testing, you need a couple of things besides the generator itself:

**A large sample input to seed the generator**. As noted, the bigger your sample input text, the higher the quality of the probability tables, and therefore the more varied your generated text will be. Since we are trying to generate makefiles, we used several megabytes of makefiles from a variety of open-source projects as the sample text.**An automated evaluation mechanism**. You have to be able to determine quickly and automatically if a given generated file is processed correctly or not. Of course,*correct*can mean many different things here. It might be as simple as “does not cause a program crash”. In our case, it means “emake parses this makefile the same way that gmake does”, so we use gmake as a*reference implementation*. Note that it doesn’t matter if the generated text truly is a completely valid makefile. In fact most of the time it will not be. What matters is those cases is that emake and gmake both report the*same*error.

We wrote a simple shell script to drive the testing process. First, it uses the generator to produce several random makefiles. Then it runs each makefile through both gmake and emake, and compares the results. Any differences are reported for further investigation.

### Conclusion

Verifying the implementation of an emulator for a complex system is hard, especially when the original system has no formal specification. Using randomly generated input is a useful way to extend the breadth of your testing, and Markov chains make it possible to generate even more useful random input. Our original implementation of this technique uncovered several previously unknown defects, and it continues to pay dividends both by uncovering new defects and by providing a confidence measure for our emulation.

If you want to play with Markov chains yourself, you can download the source for the generator used in this article. NB: the program has only been compiled and used on Linux; on other platforms your mileage may vary. For more information about Markov chains, I recommend Section 15.3 of the excellent book *Programming Pearls* by Jon Bentley.

If you enjoyed this article about testing techniques, you may enjoy this related article:

After learning more about stochastic modeling and trying my hand at it, this was one of the areas I read up on and this is a good demonstration of why Markov chains help so much. Good job!

Very clear and concise overview of a topic that is usually obfuscated by mathematical notation. Great job.

Bentley built his Markov chain using his Bell labs colleagues Pike & Kernighan’s Markov algorithm from The Practice of Programming

http://books.google.co.uk/books?id=to6M9_dbjosC&pg=PA62&lpg=PA62&ots=3YL0Fgy14b&dq=markov+practice+of+programming

Pike had used this technique earlier when coding up http://en.wikipedia.org/wiki/Mark_V_Shaney with Bruce Ellis

I’m sure they were not the first

Your explanation of Markov chain’s is crystal clear. Thanks. Im curious to know how you built the probability tables using makefiles as input. How did you tokenize the makefile ? Did you use the same parser in make ?

That’s a great question. For this implementation, I did not bother trying to tokenize the makefile, but rather built the probability tables based on the character-by-character analysis of the makefiles. This is sometimes called “letter-level” Markov chains, as opposed to “word-level”. It has the advantage of being very simple to implement, and the disadvantage of producing somewhat less makefile-like output. In theory, word-level chains would give me random makefiles that were even more like real makefiles. I explored changing the generator to do word-level chains, but that proved to be more work than I could reasonably justify investing in this side project.

Thanks for the reply Eric.

Probably the best explanation of Markov Chains on the web. consider updating Wikipedia with this information. thanks!

Like everyone else has said, excellent explanation of Markov chains. I’ve had a somewhat passing interest in learning them, but each time didn’t grasp them right away. This was crystal clear. Thanks!