Overview

Find Java Memory Leaks at Runtime (Act 5)

3 Comments

Act 4 of our series on OutOfMemoryError closed with the promise of better approaches to find memory leaks. We explained that, while we can find big objects in heap dumps,  they only in case of an OutOfMemoryError give us the indication of a leak. To have a chance to find something during a post-mortem analysism, one should always use the JVM parameter -XX:+HeapDumpOnOutOfMemoryError.

But not all leaks will cause an OutOfMemoryError and produce a dump, or they would take a very long time to occur. For example the server and the JVM could even be restarted in a regular interval for deployments or to fight memory issues.

To find slowly growing memory leaks, we have to perform more complicated and time-consuming analysis. We could use multiple dumps, which are spread out over time. While they theoretically would allow us to recognize growing structures, it would be tedious in practice, because the difference between multiple dumps will be mostly normal fluctuation, which makes it difficult to spot the relevant delta. And you already have to know the memory leak producing use cases which you then can invoke between dump. But perhaps the biggest problem of all is that creating a dump in production is not advisable, because it can hang the system for seconds to minutes, depending on the heap size.

A better solution is to monitor the heap and the relevant objects within during the runtime of our application. By doing so we can track every structure and get notified when something keeps on growing over time. And because the application is still running fine, we can also easily get the information about the code which is interacting with the leak. This is not possible at all using heap dumps, as they do not contain information about code.

The concept

The pattern for runtime analysis of heap is pretty simple:

  • Find all objects which are created by the application.
  • Track those objects and record their size.
  • Alert on any “abnormal” behavior.
  • Provide content and invoking code for diagnosis.

Unfortunately each and every of those points brings a lot of issues in practice. As a result of this, there are only a few implementations of this concept. Already finding all the relevant objects is not an easy task. In Act 4 I recommended to focus on our own packages, like de.codecentric.memoryleak. But what do we do when standard classes leak? While in a demo application the number of objects might be manageable, in real applications, there are millions of objects. How can we ever efficiently store data on those complex structures? And what is “abnormal” behavior? Are there sizes and lifetimes of objects that we can consider “normal”?

An Implementation

As an example for this concept, I am going to showcase the Leak Detection Feature of the APM solution AppDynamics. The only other implementation of a leak detection, which does not use heap dumps, I am currently aware of, is the Introscope Leak Hunter. Should you know a tool which does it in a similar way, I would be happy to get learn about them in your comment!

Assumptions

As you can guess, a solution as outlined above cannot be realistically implemented. We need to simplify the problem using simple assumptions. Luckily there are quite a few assumptions you can do for any Java program. For example the typical age distribution of objects is used by Garbage Collectors to work with different generations, as I described in Act 3.

AppDynamics is doing the following assumptions:

  • There is no need to monitor all objects. Experience shows us that most memory leaks are caused by putting data into collection type structures, like lists and maps, but not removing the data from there later. Custom cache implementations are a very typical example. Because of that, AppDynamics monitors just those classes.
  • We do not need to look at collections that are not used, like internal structures created by the application server on startup. We just need the structures our code interacts with.
  • From those active collections, we just need to monitor those which contain a relevant number of objects. And because we look for leaks, that number has to increase over time.
  • Those long and active collections could be leaks, but to become a relevant problem for the stability of our code, that collection has to dominate a significant amount of memory.
  • All those factors apply over a longer period of time.

AppDynamics uses a similar process to find leaks. By that it minimizes impact on the monitored JVM. Additionally AppDynamics uses elaborated algorithms to calculate object tree sizes efficiently with very low overhead, even under high load. Nevertheless memory analysis will always be connected to a higher overhead.

My Open Source Collection Analyzer

Because AppDynamics is a commercial solution, I wanted to have my own shot at implementing such a memory leak finder.
You can find my version of a basic Java Memory Analyzers on Github.

The fundamental idea is surprisingly easy to implement. My analyzer is coded using just two classes.

CollectionAnalyzerAspect

I do similar assumptions as AppDynamics and just watch Collections. If I wanted, I could add any possibly leaking classes here:

@Before("   call(* java.util.Map.put(..)) &&
            !this(de.codecentric.performance.memory.CollectionAnalyzerAspect)")
public void trackMapPuts(final JoinPoint thisJoinPoint) {
	Map target = (Map) thisJoinPoint.getTarget();
	CollectionStatistics stats = getStatistics(target);
	stats.recordWrite(getLocation(thisJoinPoint));
	stats.evaluate(target.size());
}

This Pointcut adds my code before all calls to Map.put(). Because I am using a map myself to store statistical data, I need to exclude myself to avoid a nasty recursion. Next, I get a statistics storage object for the collection instance I monitor to record access and evaluate its usage.
This is a simplistic approach. I think it would be much better to evaluate all statistics in a separate thread periodically than to do this synchronous on every request.
There is one additional interesting problem: How can I identify the Collection I am currently inspecting? For that I am using the “identityHashCode”, but I already know that this might not be a wise idea, as it might not be unique.

int identityHashCode = System.identityHashCode(targetCollection);

CollectionStatistics

Ok, I have recorded all the invocation counts for the methods. So what do I do with this data?

public void evaluate(int size) {
	if (size >= DANGEROUS_SIZE) {
		System.out.printf("\nInformation for Collection %s (id: %d)\n", className, id);
		System.out.printf(" * Collection is very long (%d)!\n", size);
		if (reads == 0)	System.out.printf(" * Collection was never read!\n");
		if (deletes == 0) System.out.printf(" * Collection was never reduced!\n");
		System.out.printf("Recorded usage for this Collection:\n");
		for (String code : interactingCode) {
			System.out.printf(" * %s\n", code);
		}
	}
}

I did not really decide on an elaborated criterion for when a collection behaves “abnormal”. I just took a hardcoded collection length. It would be a great idea to have a WeakReference to the collection and calculate the dominator tree of it, but to calculate the Deep Size is a pretty complex problem on its own.

Besides the length, I consider two factors as interesting:

  • Is this collection ever read?
  • Was something ever deleted from it?

Both are typical antipattern for caches. Nobody reading or deleting from long lists is a clear indicator for a problem. Thats why I am warning on it. Last, I print all the recorded invoking code, which is a pretty useful information!

A Testrun

Information for Collection java.util.ArrayList (id: 1813612981)
 * Collection is very long (5000)!
 * Collection was never reduced!
Recorded usage for this Collection:
 * de.codecentric.performance.LeakDemo:19
 * de.codecentric.performance.LeakDemo:17
 * de.codecentric.performance.LeakDemo:18
 
Information for Collection java.util.ArrayList (id: 1444378545)
 * Collection is very long (5000)!
 * Collection was never read!
 * Collection was never reduced!
Recorded usage for this Collection:
 * de.codecentric.performance.LeakDemo:18
 
Information for Collection java.util.HashMap (id: 515060127)
 * Collection is very long (5000)!
 * Collection was never read!
 * Collection was never reduced!
Recorded usage for this Collection:
 * de.codecentric.performance.LeakDemo:19
 
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
	at de.codecentric.performance.DummyData.(DummyData.java:5)
	at de.codecentric.performance.LeakDemo.runAndLeak(LeakDemo.java:17)
	at de.codecentric.performance.DemoRunner.main(DemoRunner.java:12)

Heureka, it works!

Overhead

To record each and every invocation of a collection is quite memory intensive. Perhaps I should do this only when I have the indication that this collection could be leaking. But b using those AspectJ pointcuts, my code will always run. In a real environment with hundred thousand such collections it will not be a great idea for sure. Dynamic ByteCode Instrumentation should be used to avoid this. And of course an evaluation over a long period of time makes more sense than my quick checking.

As we can see, the idea is easy to implement, but a production ready solution requires a good amount of thinking and clever algorithms. If you like to improve my analyzer, feel free to send patches via GitHub.

Memory Analysis of a Demo Application using AppDynamics

So lets have a look at what AppDynamics does in a professional solution.

10:43 – Application Server Restart

After starting the AppDynamics Leak Detection you will not get immediate results. It starts in the background analyzing collections. Only after a while possible leaks might show up.

11:00 – Collection detected

This java.util.LinkedList is monitored by AppDynamics. It has 56,881 entries, which makes it indeed interesting. But AppDynamics has no long time information yet, so it is not marked as  “potentially leaking”.

11:10 – Collection potentially leaking

Time passed, but the collection continued to grow. 98,850 entries are almost twice the amount compared to ten minutes ago. The internal heuristics now mark it as “potentially leaking”.

11:17 – The leak is growing

The overview is showing the growth of the leak. Garbage Collection activation would be also drawn here to visualize effects of using SoftReferences.

11:30 – Showing the memory leak

The Content Inspection shows us what is inside the Collection. In this case there are now 118,990 java.lang.String objects with a total size of 20MB.
AppDynamics can also dump the collection and its contents to disc to allow a more detailed analysis of the contents.

11:38 – Identifying the root cause

By using an Access Tracking Session AppDynamics finds out who is creating this memory leak. While you might come up to this point using heap dumps, the listing of the call hierarchy is something special. LinkedLists containing Strings could have been used everywhere, but this leaking LinkedList is used by the “newbookmark” business transaction.
The BookmarkDaoImpl is appending Strings to that list in line 50. However, AppDynamics did not see any code reading or deleting from this list.

So we now got all the information we need to fix this memory leak:

  • We can see the potentially leaking structures.
  • We get notified about leaks automatically.
  • We can see the contents of those structures.
  • Business transactions (Use Cases) responsible for creating the leak are identified.
  • Accessing code is recorded and shown.

The final decision on whether this is a memory leak or just strange code is of course still up to the developer.

Wrap Up

It is possible to find memory leaks at runtime without creating heap dumps. The information on the invocating code is very useful for correcting memory leaks. Unfortunately, there is no free or open source product for finding leaks in such a way. As you have seen, it is not recommendable to implement it yourself.
AppDynamics has a free 30 day trial which includes the memory leak finder, so you can check for yourself if it is something you can use.

Kommentare

  • Nice article!!! Would it possible to cover JVM parameters/flags and GC in a similar series down the road.

  • Hi TJ,
    yeah, Patrick is writing the Series on JVM Parameters, but its only available in German.
    But next up is a series on Garbage Collection. But this is still a secret 🙂

    Fabian

  • Thank you. Also could you write something on benchmarking.I hear this term very often at my organization where people talk writing benchmarking code to measure performance and sometimes using JMX to read application metrics?

Comment

Your email address will not be published. Required fields are marked *