I was at a customer recently to further improve their application monitoring for which they were using AppDynammics. When I arrived, they told me:
“Good that you came Fabian, we have something interesting to show”. Usually operations guys are very concerned about something when they say this to me, but in this case, they were happy. “We had a big outage when we took our new software live on Saturday!”. Ooops?? Nobody likes production outage on Saturdays! Why are they so happy about it? “Well it was a long Saturday, but thanks to our monitoring we knew what was going on”.
So let me walk you through what they did, and show you the problem that killed their server.
An Overview of the Situation
Here is what the situation looked like on Saturday 8 in the morning.
If you don’t know AppDynamics, here a short introduction:
- The biggest area is the application map. AppDynamics autodiscovers and monitors all systems and their interaction.
- On the right side there are some statistics. We can see some Stalls and Abnormal Slow requests. AppDynamics found out that something is not right.
- On the bottom is the historical view. Because this is historical data now, we only see hourly data. During the incident, the data was more fine grained.
The historical data is of most interest for us, looking back.
We can see the server start on the left hand side (blue icons on top of green bar). The load on the system climbed rapidly (green bar). But then the system broke down. The response time increased massively (blue bar). And requests decreased. Soon the server crashed and had to be restarted.
But the restart did not help, the system did not recover. Luckily it did later after the problem was fixed 🙂
What went wrong
The system crashed with an OutOfMemory Error. In our OutOfMemoryError Series, we already discussed some of them, and here is another: “Unable to create native thread”.
This error is actually an interesting fact. Outages with this kind of Error are usually not created by inefficient code, but either by infrastructural problems or buggy threading code. So lets look at the number of threads on each server recorded by AppDynamics:
Ouch. When it crashed, the system had already created 1800 threads.
What was producing these threads? I have seen this before at customers, so I immediately knew this. And also my client was able to find out by seeing a few snapshots of stalled calls. JSP Compilation was the culprit.
All application servers have an option to precompile JSPs on startup, or when requested for the first time. But both settings are problematic when the server is started under high load.
The problem was resolved on Sunday Morning 4AM. But the ops team had not to be awake those 20 hours. They had reported the qualified issue to Oracle around noon on Saturday. The Oracle provided “fix” was simple, but not documented anywhere. Or at least not where you can find it if you do not know it.
Here is the screenshot from AppDynamics discovering the property change and restart at 4AM.
Limiting JSP Compiler Threads on Weblogic
The “secret” switch was an environment variable you have to set:
BEA_COMPILER_NUM_THREADS = 1
While 1 is for sure very pessimistic, it seems to be a way better setting than unlimited, which is the default. No one should ever create an unlimited amount of threads. If you think in a classical view, you should only have as much threads as you have number of cores. modern JVMs can easily handle like ten times that much threads, without much starvation. But in this case we had about 100 Compiler Threads per CPU.