I was at a customer recently to further improve their application monitoring for which they were using AppDynammics. When I arrived, they told me:
“Good that you came Fabian, we have something interesting to show”. Usually operations guys are very concerned about something when they say this to me, but in this case, they were happy. “We had a big outage when we took our new software live on Saturday!”. Ooops?? Nobody likes production outage on Saturdays! Why are they so happy about it? “Well it was a long Saturday, but thanks to our monitoring we knew what was going on”.
So let me walk you through what they did, and show you the problem that killed their server.
An Overview of the Situation
Here is what the situation looked like on Saturday 8 in the morning.
If you don’t know AppDynamics, here a short introduction:
- The biggest area is the application map. AppDynamics autodiscovers and monitors all systems and their interaction.
- On the right side there are some statistics. We can see some Stalls and Abnormal Slow requests. AppDynamics found out that something is not right.
- On the bottom is the historical view. Because this is historical data now, we only see hourly data. During the incident, the data was more fine grained.
The historical data is of most interest for us, looking back.
We can see the server start on the left hand side (blue icons on top of green bar). The load on the system climbed rapidly (green bar). But then the system broke down. The response time increased massively (blue bar). And requests decreased. Soon the server crashed and had to be restarted.
But the restart did not help, the system did not recover. Luckily it did later after the problem was fixed 🙂
What went wrong
The system crashed with an OutOfMemory Error. In our OutOfMemoryError Series , we already discussed some of them, and here is another: “Unable to create native thread”.
This error is actually an interesting fact. Outages with this kind of Error are usually not created by inefficient code, but either by infrastructural problems or buggy threading code. So lets look at the number of threads on each server recorded by AppDynamics:
Ouch. When it crashed, the system had already created 1800 threads.
What was producing these threads? I have seen this before at customers, so I immediately knew this. And also my client was able to find out by seeing a few snapshots of stalled calls. JSP Compilation was the culprit.
All application servers have an option to precompile JSPs on startup, or when requested for the first time. But both settings are problematic when the server is started under high load.
The problem was resolved on Sunday Morning 4AM. But the ops team had not to be awake those 20 hours. They had reported the qualified issue to Oracle around noon on Saturday. The Oracle provided “fix” was simple, but not documented anywhere. Or at least not where you can find it if you do not know it.
Here is the screenshot from AppDynamics discovering the property change and restart at 4AM.
Limiting JSP Compiler Threads on Weblogic
The “secret” switch was an environment variable you have to set:
BEA_COMPILER_NUM_THREADS = 1
While 1 is for sure very pessimistic, it seems to be a way better setting than unlimited, which is the default. No one should ever create an unlimited amount of threads. If you think in a classical view, you should only have as much threads as you have number of cores. modern JVMs can easily handle like ten times that much threads, without much starvation. But in this case we had about 100 Compiler Threads per CPU.
Optimizing iText performance using AppDynamics and YourKit
The following example shows how easy it is to combine a performance monitoring solution with a profiler. On a regular patrol through our AppDynamics monitoring on our continuously integrated projects, I found this interesting HotSpot in iText. iText...
27.11.2010 | 2 Minuten Lesezeit
Phantom java logger causing major performance problems
Recently at a customer, I saw massive amounts of garbage generated, causing many garbage collections, as well as a huge slowdown inside Hibernate code. I browsed through the slow transactions recorded in production by AppDynamics, and was wondering why...
11.11.2010 | 2 Minuten Lesezeit
Easy Performance Analysis with AppDynamics Lite
AppDynamics is the rising star in the Application Performance Management sky. Mirko gives a really good description why AppDynamics delivers the right solutions for todays distributed architectures in his Post “Troubleshoot Java in production – introducing...
30.8.2010 | 1 Minuten Lesezeit
A Different Take on Sprint Retrospectives
There are many ways to do a good sprint retrospective, so we decided to try a new one every now and then. This time we took the role of a painter, painting out impression of the last sprint into a formidable piece of art. It might look strange at the...
- Agile methods
4.8.2010 | 2 Minuten Lesezeit
WordPress WPML Comments Filter Plugin
I admit, this post is a bit “off-topic”. Recently we migrated this blog from using qTranslate to WPML for publishing in German and English. Main reasons were much better updates and a cleaner separation. But one feature was missing because of that: ...
28.6.2010 | 2 Minuten Lesezeit
Style Tests using Selenium and Robotframework
In projects facing end customers style matters, often more than less. While in internal apps it doesn’t matter if the UI changes after each release, there might be customers that want their app to follow a very strict style guide to integrate with their...
15.6.2010 | 4 Minuten Lesezeit
codecentric playing at german board game championship
“Dr. codecentric und seine kranken Pfleger”, (codecentric, M.D. and his sick attendants) the codecentric board game team, Andreas Ebbert-Karroum, Torsten Rodemann, Marc Clemens and Fabian Lange (left to right) competed in Dinslakenhighly motivated for...
27.2.2010 | 2 Minuten Lesezeit
Hot Coffee and Green Builds
Automated builds and tests already have a long tradition at codecentric, but we never managed to put up build radiators in our new offices. Till today. Developers could have looked up the status in the past, but getting it pushed to you while enjoying...
- Software development
1.2.2010 | 1 Minuten Lesezeit
Meet The Experts Architecture – Open Space: Managing the JAR Chaos
This post shall sum up the results from our fruitful discussion on friday evenig. The idea for the open space discussion was sparked by Stefan Zörner who talked about modularity and what happens when you have no control over modularity. This post will...
29.11.2009 | 1 Minuten Lesezeit
#devoxx 09: map&reduce and closures
A hot topic here at the Devoxx were the upcoming Java editions with their features and changes in the language syntax. While it is nice that you will be able to switch() on Strings, have a modularized platform and other cool stuff, one thing is a bit...
19.11.2009 | 3 Minuten Lesezeit
codecentric Crew visiting #Devoxx 2009
As every year, codecentric Developers are attending the Devoxx. Devoxx in Antwerp is among the top conferences for Java in Europe, known for its hand picked Speakers and excellent topics. No surprise that the depicted 7 gents in codecentric Shirts did...
18.11.2009 | 1 Minuten Lesezeit
JUG Cologne – 5th October – Slides on Eclipse RAP
Having a presentation slot at a Java User Group is always special. Its an audience who cares, or is there just for the buffet. No kidding, todays evening was great. Besides my talk on RAP for which i attach the slides, there were new insights on what...
5.10.2009 | 1 Minuten Lesezeit
Neal Ford at RheinJUG: Emergent Design & Evolutionary Architecture
Back after having a great evening at todays RheinJUG talk held by Neal Ford. It was almost a perfect fit for our upcoming Meet the Experts – Architecture . Because Neal has the slides on his github , I just want to briefly summarize my personal takeaways...
- Software architecture
20.9.2009 | 1 Minuten Lesezeit
Commit every day, or revert – Be agile, every day
One of the biggest problems in agile development teams is “effort”. Of course it is always about effort, because effort is money and we all like our money. In planning we can cope with effort quite easily: “oh that’s a week effort”, but when it comes...
- Agile methods
2.9.2009 | 4 Minuten Lesezeit
JSP Tag Pooling Memory Leaks
JSP custom tags were once widely used, but even still nowadays they find their way into projects. Not to mention the masses of production code using them. And almost all projects I have looked at using custom tags had the same issue. When writing JSP...
13.8.2009 | 2 Minuten Lesezeit
Convert InputStream to String
Because searching for “Convert InputStream to String” still brings up solutions involving StringBuffer, byte or something like that, developers still produce large amounts of different implementations of the same conversion in their projects. In an...
10.8.2009 | 1 Minuten Lesezeit
codecentric coding night – facts & figures
Hier einige interessante Statistiken zur coding night . Da die coding night ein „Projekt im Zeitraffer“ war, sind die von Hudson bei den automatischen Builds erstellten Statistiken ganz interessant. JUnit Test Ausführung Das erste was auffällt ist,...
- Agile Methoden
15.7.2009 | 2 Minuten Lesezeit
Eclipse Galileo and SVN
To prove that I can do short posts as well, here a quick guide to SVN in latest Eclipse release. This was not that easy in previous releases, but now it works like a charm: Help Install New Software… Galileo – http://download.eclipse.org/releases/galileo...
28.6.2009 | 1 Minuten Lesezeit
Can I change this Code?
“Can I change this code?” sounds like a normal question, but in my opinion it expresses a problem in agile development that needs addressing. Foremost: This is a very good question, because it shows a noble intent: Make code you found better. Following...
- Software architecture
- Agile methods
29.4.2009 | 3 Minuten Lesezeit
Data Validation Alongside Agile Development
I would like to discuss an issue one can likely experience with agile development processes and systems which data needs to be maintened during upgrades: A customer care application for a PC retailer was developed so far and the software is running pretty...
- Software development
27.1.2009 | 2 Minuten Lesezeit
Ajax World Conference in San Jose, CA
From the 20th to 22nd of October the 6th Ajax World Conference took place in the sunny San Jose, CA. I was there those 3 days as delegate of codecentric to catch up with the newest trends and developments in Ajax and RIA. I tried to collect and write...
4.11.2008 | 1 Minuten Lesezeit
Dein Job bei codecentric?
Agile Developer & Consultant (w/d/m)
An allen Standorten
Gemeinsam bessere Projekte umsetzen.
Wir helfen Deinem Unternehmen.
Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.
Hilf uns, noch besser zu werden.
Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.
Do you still have questions? Just send me a message.