We at codecentric have hundreds of automatic builds run every day, sometimes they … fail. This post is not about lame excuses. “nah the build shouldn’t fail, that was a trivial change…” does not count. But there are situations where a build fails because … well nobody really knows.
Some people say: cosmic rays! But we know that is not true. To efficiently utilize a CI system without the need to troubleshoot a long time here some common issues we encountered and ideas how to mitigate them.
- A test might do some time calculation and either in the test or in the code under test time is taken twice. While most of the time there is no difference, there might be a microsecond sometimes. A good indicator for this is a message like: Time was 23:30:00 but expected 23:30:00 (note microseconds are not shown)
- Code, Testcode or Testrunner/CI code might leave files behind. Sometimes these are log files, sometimes files produced as test output. Take the time to search the whole server for all file writes made during a test run and take care to have a cleanup in place. Don’t forget to add disk space monitoring, because build machines have hard discs that tend to get full. (Hudson can do both)
- Users logging into the CI system might lock resources, like files or ports, or do anything bad to the machine. You should not allow user logons. All “analysis” should be made read only. Note that also read access or parallel tests can lock files.
- It is not necessarily a bug when your code or tests do not run when the system date is 1th of January 1970. It could be, but you should make sure the system always uses the current time. Set up a ntp daemon. If you must use specific points in time for testing, you should be able to set a time source for all of your code, like a spring bean called TimeProvider which normally resorts to the system time.
- If your tests need to apply evil hacks to test your software (which might be required. If not get rid of the hacks) it is often safer to let test execution fork, so tests cannot introduce side effects via the JVM (like setting System properties). Code coverage tools using bytecode manipulation count as hacks.
- If you have multiple build machines ensure that they are as similar as possible. If you can afford it, you can set up a farm of build machines with defined differences, so you can spread testing on varying hardware in case this cannot be simulated. You do not want any surprise differences on which you spend hours to find out.
- Consider setting up the system under test for integration tests nightly from scratch. Those tests tend to get messed up by exploratory testing and ad hoc demos. It is easier to automate such stuff than one would think, though it takes some time.
- If you do automatic deployment, you need to at least stop the server, copy changed artifacts and then restart the server. Any kind of “hot deployment” is unfortunately just to fragile for reliable results.
- After doing any change to configuration or infrastructure of your CI system trigger a build manually. If not done the next normal developer checkin will cause a failing build and will leave the dev wondering how that change could break this stuff.
- If you find your tests hanging in your code, especially if multiple tests were run at the same time, take a heap and thread dump of the JVM before restarting the tests. You might be lucky that you found by accident a real concurrency issue inside your code. You should be grateful for that because you hardly can deliberately test this.
Yes it is possible that the build breaks without any issues in your software, but it wastes a lot of time on investigation. Know the weaknesses of your system and try to fix them or at least document them.
We have an issue with one of our integration test suites, which connects to an external service. Sometimes this just hangs and results in a connection timeout. Until a while ago this always “broke” the build. The result of this was that every engineer had a look at the build, the logs and eventually found out it was a timeout on the external system- We discussed this and decided to stop wasting time on investigating this. So we added a mechanic that in this special timeout case the build does not turn red. It stays green, but creates a tag “timeouted” on the test. This of course has a problem. A green build with “timouted” can have a problem with the computation of the extern call results. It might be really red. But we cannot know this. Real green build are not allowed to have “timeouted” tests. But the important part is that we want “real red” builds, which turn only red when there is a issue we can fix. In RobotFramework you can define a third state for “noncritical failures”. Decide yourself if this is something for you.
But the most important takeaway is: If a broken build is not caused by test or production code, you must find the reason and address it. You cannot say: “cosmic rays” because that will lead everybody to say “broken build – cosmic rays” and you will have much less successful builds because eventually no one will care. Red should always mean: Team take action!