Overview

Using Exceptions to Write Robust Software for Stable Production

1 Comment

A study shows that the cause for almost all critical faults is bad error handling. I can back this up with my own experience in various projects: the feature is implemented and there are tests in place which verify the correctness of the implementation. Negative test cases (invalid user input, expected file not found,…) are present to a varying degree, but what about errors (exception while accessing a file, existing row with the same primary key, XML schema validation failed,…)? Rarely I see tests for these cases. If problems occur during test or production AND there is enough information to understand and reproduce the issue, only then there is a chance that test cases for these problems get added.

In this article I want to outline the why and especially the dos and don’ts of error handling. The article uses Java for the demonstration but the ideas are language-independent.

tl;dr: Use exceptions because they provide advantages (fail fast and no thinking about return value in the error case required). Avoid duplicated logging. In log messages describe what will happen next. Sometimes it is better to replace null as an indicator for problems with exceptions.

Motivation

We, the developers, write software. The requested features and changes get implemented and at some point the software gets into contact with the real world at deployment time. The real world is messy. Firstly, because the technical environment is different from the developer machine or the CI server. This difference can be reduced with tools like Puppet but there may still be additional differences between a 4 node cluster and a 400 node cluster. And let us not forget software which is run on the computer of the user  (like a desktop application) and not hosted by the software producing company (like a web application). The second reason is that the real users are much more creative in finding input combinations which the developing team (PO, QA, developer) just could not imagine and therefore, the software may or may not handle them correctly. The complete space of all input values is just huge.

The idea is to find these issues as fast as possible. Usually through technical tests (e.g. performance tests on a setup which is similar to the production system) or with exploration tests with a skilled QA person. It is also possible to reduce and control the amount of users which can access the software. Two common ways are selecting pilot users which agree to use the new unreleased version and diverting a small amount of the traffic to the new version (with or without informing the users) and a tight monitoring of the new software version.

What is the connection to error handling? Errors are one way to react to unsupported input data or an environment which violates some assumption. Commonly the creation and propagating of such errors are built into the programming languages as exceptions. Exceptions allow a programmer to cheaply state that some data is outside the supported area and therefore the software is unable to continue. One can see exceptions as a cheap safety net which avoids that the software continues and outputs or stores wrong information. The normal behaviour of exception (bubbling up in the call stack until an exception handler catches the exception) supports this. Asserts in C are similar in this regard.

If

  • it is confirmed that certain situations occur in the normal operation and
  • the reasons for these situations are understood and
  • such situations should be supported and
  • the expected output can be specified

then it is possible to change the behaviour by handling the situation. This means that the software becomes more robust because it can cope with more input values but also that the software becomes more complex. So this is always a matter of consideration.

This also means that there has to be a process which continuously looks at exceptions and log messages and time is invested to understand these. This is especially important shortly after changes (new release, hardware upgrade, cluster sizes changed, new OS for mobile devices released, …).

So in summary three conditions must hold to improve the quality of the software:

  1. There has to be a motivation for continuous improvement. From this the user will get a better experience, the project sponsor gets more business value, operations get a more robust software and for the developer the maintainability improves. Both the management and the developers must believe in this continuous improvement.
  2. There is at least one feedback channel about the running software back to the developers. Examples are: log messages, monitoring on multiple layers, user feedback via phone or email,… This is not a problem for common web applications but is more difficult if privacy is very important or if the system is not connected to the internet (e.g. elevator control).
  3. The development team can react to the feedback in an easy and timely manner. Driving around the town and updating software of all elevators does not qualify as easy. Similar if you find a bug 2 days after deployment but you can only deploy two times a year. An agile approach ensures this last condition.

So if these conditions are in place what can we the developers do to produce robust software which reacts in a good way to unexpected conditions? First I will cover log messages and then exception handling. The last part is about exceptions and API design. As already mentioned I’m using Java in the code examples.

Log messages

The primary purpose of the log message is to help the analysis of the problem after it occurred (post mortem). The log message should contain all relevant information to identify the problem and its cause fast and with high probability. What are the questions a log message for a problem should be able to answer?

  • What has been tried?
  • Which were the parameter values?
  • What was the result? This usually means the caught exception or some error code.
  • How does the method react to this?
  • Optional: What are possible reasons for the problem?
  • Optional: What are possible consequences?

For some time now, I prefer to write such log messages starting with “Failed to ” and which form one or more sentences. So the pattern is “Failed to VERB with/for/of/from OBJECT.”

Some fictitious examples:

  • WARN: “Failed to create scaled thumbnail file for /tmp/foo.gif. Will return the original file as thumbnail. This may increase the used bandwidth. Saved the original file unter /tmp/bc2384d4-555d-11e5-9913-600308a94de6 for later analysis. Is imagemagick installed and in the PATH?”
  • ERROR: “Failed to get prices for Contract[…] from the backend. Will return null to indicate no-price. Does the monitoring at http://…. show a problem with the backend?”
  • INFO: “Failed to send email about Contract[…] to john.doe@example.com. Will retry 3 more times after a timeout of 2.4s.”
  • INFO: “Succeeded in sending email about Contract[…] to john.doe@example.com after 2 tries.”
  • WARN: “Failed to send email about Contract[…] to john.doe@example.com. No more retries left. The number of emails sent in the monthly report may be off.”
  • INFO: “Failed to get logged in user from the HTTP session. Will send a 401 back. User will have to log in once again. Maybe a timed out session?”
  • WARN: “Failed to send event UserLoggedIn[…] using kafka (server …). Will return false to indicate a problem.”

What about adding the exception message to the log message? I.e. should one write the following line?

  LOGGER.error("Failed to FOO with BAR: " + e.getMessage(), e);

The advantages for adding the message is that it is better for searching (especially if grep is used) since all information are now on one line. The disadvantage is that searching gets more difficult since duplicate matches are found. If the log messages are structured (e.g. if ELK is used) I would recommend to exclude the exception message.

I would like to cover two other aspects. First, for complex objects the toString() method should provide the required information. Since one doesn’t know which information may be relevant it is usually a good starting point to just return all fields. Of course if security or privacy is relevant one has to adapt this strategy. From my experience I can recommend the ToStringBuilder from the apache-commons project for this. Note that one has to pay attention to circular references which result in an unbound recursion.

The second aspect is the formatting of strings in the log message. There are multiple aspects to this:

  • Handling of null
  • Handling of non-printable characters
  • Be able to copy-paste this to easily create a test

In the most simple form a log message is written like this

  LOG.info("Failed to send email to " + email + ".")

Here information is lost for null. The message “Failed to send email to null.” could be caused by email==null or email==”null”. A different option is

  LOG.info("Failed to send email to '" + email + "'.")

but again this has problems with email == null.

Especially for the escaping of the non-printable chars one has to use a method (commonly named escape(), quote(), format(), …) ending with code like:

  LOG.info("Failed to send email to " + escape(email) + ".")

The method escape will return something like “<null>” for null, and “\”foo\”” for "foo". It will also escape non-printable chars like tabs. In the best case the escaping uses the rules for string literals so that quickly a new test case can be created from the log message.

What to do with exceptions?

Let us assume that a method throws a checked exception. How can the caller react to this? I will outline the possible variants, classify them and explain in which cases these variants should be used. The software developer has to react to a checked exceptions but on the other side is free to ignore the unchecked exception. Reacting on an unchecked exception is no different from reacting on a checked exception and most importantly the same mistakes can be made.

Variant 1: catch and ignore

try {
  methodCall();
} catch(IOException e){}

In general this is a bad solution because most likely important information is lost. There are, however, some valid cases for such a pattern. One such case is inside a finally block to ensure that the exception of the try block is not replaced with an exception of the finally code since the first exception is usually more important. In such and similar cases I usually use two safeguards to ensure that the exception ignoring was really intended and not just laziness: the caught exception is called ignored and the catch block has a comment.

file.flush()
try {
  file.close();
} catch(IOException ignored){
  // there is nothing we can do anymore about it
}

Variant 2: catch and log

try {
  methodCall();
} catch(IOException e){
  LOGGER.warn("Failed to do FOO with BAR.", e);
}

The problem is not ignored but logged. Should you use this pattern? In this form only in very few places. The main problem with “catch and ignore” and “catch and log” is that the control flow continues unchanged afterwards. And since all variables in Java must have a value one can often see code like the following:

String foo = null;
...
try {
  foo = bar.readFoo();
} catch(IOException e){
  LOGGER.warn("Failed to do read FOO with BAR.", e);
}
...
if (foo == null) {
  ...
}

In such code an extra burden is placed on the reader who has to understand what values the variable contains in what situations. A better alternative is the following pattern.

Variant 3: catch, log and handle

try {
  fetchedContent = fetch(url);
} catch(IOException e){
  LOGGER.warn("Failed to fetch " + url + ". Will use the empty string.", e);
  fetchedContent = "";
}

Here the handling of the exception is made explicit and is inside the catch block. Ideally a neutral value can be chosen which does not require changes in the remaining method. An alternative is to return early:

try {
  fetchedContent = fetch(url);
} catch(IOException e){
  LOGGER.warn("Failed to fetch " + url + ". Will return null.", e);
  return null;
}

Variant 4: catch and throw enhanced aka catch and wrap

The exception is caught and a new exception is created and thrown instead. The original exception is attached as a nested exception to the new one.

try {
  fetchedContent = fetch(url);
} catch(IOException e){
  throw new RuntimeException("Failed to fetch " + url + ".", e);
}

Using this pattern it is easily possible to build a chain of exceptions which go from the top to the bottom of the stack. This is IMHO a very valuable feature since it makes the debugging much easier. Example:

Controller: Failed to serve HTTP-requuest […].
caused by Controller: Failed to calculate price for Contract[…]
caused by Service: Failed to validate Contract[…]
caused by Soap: Failed to execute soap call for …
caused by Network: Failed to connect to host …
caused by SslSocket: Failed to verify SSL certificate
caused by Crypto: Wrong passphrase for keystore

How should the message for the new exception look like? Very similar to a log message but without the handling and consequences parts:

  • What has been tried?
  • Which were the parameter values?
  • What was the result?
  • Optional: What are possible reasons for the problem?

If the new exception should be a checked or unchecked exception is still open for debate. I prefer unchecked but there are other opinions.

Which exception class should be used? This topic is hotly debated, as well. My opinion is that only if the code reacts to these errors in some way (catches the exceptions) a specific exception class should be used. This class can come from the JDK, 3rd party sources or is specifically created for this purpose. The last option is the most defensive since no 3rd party module can throw such an exception. If there is currently no specific reaction to this type of error a generic exception is fully valid in my opinion. Please note that if the software component provides a public API (especially to components not under control) specific exceptions should be used, and documented so that the caller can react to them.

A special case of this variant is the transformation of a checked into an unchecked exception. This is sometimes required for the standard functional interfaces of Java 8.

Variant 5: catch, log and rethrow AND catch, log and throw enhanced

The exception is caught, logged and the original exception is rethrown or a new exception is thrown.

try {
  fetchedContent = fetch(url);
} catch(IOException e){
  LOGGER.warn("Failed to fetch " + url + ".", e);
  throw e;
}

or

try {
  fetchedContent = fetch(url);
} catch(IOException e){
  LOGGER.warn("Failed to fetch " + url + ".", e);
  throw new RuntimeException("Failed to fetch " + url + ".", e);
}

In short: don’t do this. This is the main reason for seeing an exception multiple times in the log messages (double logging). In such a case it is hard to establish the sequence of events and the number of actual errors. If for some reason you really have to use this variant at least state in the log message that an exception will be thrown.

Variant 6: do not catch

The exception is not caught and therefore walks up the call stack. This is similar to ‘catch and throw enhanced’ with the difference that no further information about the operation is attached. IMHO this is a disadvantage. This variant is the default behaviour for unchecked exceptions.

Variant 7: catch and handle

Like ‘Variant 3: catch, log and handle’ but without the logging. There are also valid use cases for this variant. The requirement is that the developer is sure about the reason of the exception. Example:

boolean isInteger(String str) {
  try {
    Integer.parseInt(str);
    return true;
  } catch(NumberFormatException ignored) {
    return false;
  }
}

Which variant for which use case?

If the special cases are left out the following variants are left:

  • catch, log and handle
  • catch and throw enhanced
  • do not catch

If the exception can be handled ‘catch, log and handle’ should be used. If useful information from the current method can be added or if a higher rate of problems is expected or if an unchecked exception is desired then ‘catch and throw enhanced’ should be used. In all other cases ‘do not catch’ is the right choice.

In many cases the handling of problems happens on the top of the call stack. If we look at a common web application with a REST interface on the server the first choice would be the REST API method. I would argue, however, that the JavaScript client is also part of the call stack. This means that the top of the call stack is the JavaScript event handler and it may be the better place to handle the problem (displaying an error message). So sending a status code of 500 from the server to the client is just another way of propagating the problem. There should still be a log statement on top of the server call stack because:

  • logging inside the server is more reliable
  • no internal details should be leaked over the network
  • it is the best place to log the complete HTTP request (headers + body) for later analysis

Usually such functionality does not have to be implemented in all REST API methods but in a common exception handler.

Interface Design and Exceptions

So far we discussed how to react to exceptions. So when should exceptions be thrown? Exceptions should be thrown if the method can not perform its described functionality.

Example:

void sendMessage1(Message message);

Without further information the software developer calling this method can assume that the function either succeeds in sending the message or throws an exception.

/**
 * @return true = message has been send, false = sending failed
 */
boolean sendMessage2(Message message);

In this case it is not guaranteed that the sending is always successful. Do you assume that this method throws an exception? Not really. If this methods also throws an exception than this would be a burden for the caller since it now has to check for two things (return value and exception) and therefore it is bad interface design. Side note: since boolean does not carry much information the called method (sendMessage2) has to log any exceptions and convert it to false.

In methods which may fail I prefer to encode this missing guarantee in the name. For example with tryTo:

/**
 * @return true = message has been send, false = sending failed
 */
boolean tryToSendMessage3(Message message);

This was an example for a command. What about a query?

/** Fetches the price from backend */
double getPrice1(Contract contract);

Clearly and similar to sendMessage1 the caller expects an exception if the price can not be calculated. There is also the variant with null (which IMHO should always mentioned in the Javadoc):

/**
* @return null if the price can be not calculated
*/
Double getPrice2(Contract contract);

Or with Optional (without Javadoc):

Optional<Double> getPrice3(Contract contract);

Also similar to above I expect no exceptions when errors occur but that null or Optional.emtpy() is returned.

During the design of a public methods and an API one has to decide if error conditions are explicitly part of the API (boolean for sendMessage or null/Optional.emtpy() for getPrice) or exceptions will be used. I would suggest starting with (unchecked) exceptions for the following reasons:

  • to keep the API small
  • allow the caller to perform ‘do not catch’ reducing the initial coding effort
  • no thinking about which special value should be used (Should we return null, "" or Optional.empty()?)
  • no special values which require documentation means less documentation

So using exceptions allows a fast initial implementation and the collection of feedback. If during the continuous improvement the decision is made that all callers should handle certain situations the signature can and should be changed (wrapping the result value in an Optional, add a checked exception,…). The compiler can be used here to help catch all call sites.

Again here the default is different if you design an API which has to be stable for a longer time or is used by multiple parties.

The End

Thank you for reading until the end of this longer post. I did not expect there to be so much to write about error handling.

If you want to continue reading about this topic, I can recommend Need Robust Software? Make It Fragile. The other posts of the author are also worth a read as they challenge common positions.

Kommentare

  • Sascha Fröhlich

    20. January 2016 von Sascha Fröhlich

    I don’t think one should use exceptions for flow control, so I don’t agree with variant 7. I like the rest, great article!

Comment

Your email address will not be published. Required fields are marked *