Tuesday, November 27, 2007

Monkey tales: Continuous Integration at Terracotta

As a platform to cluster and scale Java applications, Terracotta usages are greatly varied and resourceful. Further more, Terracotta supports Linux, Windows, and Solaris (also OSX for dev), compiles and runs with Sun JDKs (1.4, 1.5, 1.6) and IBM. As for application servers, we support Tomcat, JBoss, Jetty, Websphere, Weblogic, etc. The list goes on and on. That puts a lot of burden on testings and build infrastructure and we can't think of a better way than continuous integration.

CruiseControl (CC) comes in handy and works really well for us. It basically checks out new changesets from the repo periodically then compile and run all tests. Our custom built system, tcbuild, is JRuby + Ant combo. It has the ability to run defined group of tests such as system tests, unit tests, container tests, crash tests, etc.

We create a CC project for each group of tests, namely check_unit, check_system, check_container, etc that varies in jdk, appserver, test mode. There will be 57 projects with all the permutations.

And each CC project will do:
- check out latest revisions
- compile
- run tests in that project
- report

And we do this on multiple boxes that run Redhat, Suse 10, Windows 2003, Solaris 9, 10. We call these monkey machines.

So what happens when there's a bad build (compiling error), or tests fail? CC will email a specific team depends on which group of tests that failed. There's also a mailing list that will keep track of every failure.

This works great but there's a drawback. Whenever there's a compile error, we got spammed and we got it bad. It will fail on all 57 CC runs on each monkey box.

To solve this problem, we devised a two-tier monkey scheme. We would have a monkey box, called "monkey police" that would check out the latest revision(s) every 5 minute. It will compile and run a group crucial system tests. If the build fails to compile or any of the test fails, that means that build is sorely broken. In this case, the police will email the last person(s) and the engineer mailing list about the error. If everything works out fine, the police will mark the latest revision as good and save it onto a shared file. The monkey troops (other test machines) will read from that shared file and only do "svn update" up to the latest known good revision before they run their tests.

So before, the troops will do "svn update" and it will pull down every changes. With this scheme, the troops will do "svn update -r 2333" where "2333" is a good revision that the monkey police has tested.

And we have one monkey police per branch. So when there's build error or fatal test failure, we'll know about it right away and it won't disturb our test machines in the mean time.

Why bother with a bad checkin? :)

This has worked great for us and I thought I'd share the experience.