Today I’ve read a blog entry by Roman Spitzbart that is titled APM Myth Busters: Sampling is as Good as Having all Transactions. Roman is the sales engineering manager at Compuware for dynaTrace APM.
Even some of the points that Roman makes are true, he presents a lot of new myth about APM that I want to comment on.
The truth is that APM is not easy, especially in big, distributed systems. I’ve recently read a lot about monitoring of highly distributed systems like this Blog Entry by Netflix or this presentation by Twitter at the Strata conference last week. There is also the Dapper paper on this topic by Google which gives deep insides how to develop an APM system for the modern IT world. These stories are great as they are written by great people at companies that have really big systems and they are totally vendor neutral.
One statement of the Dapper paper by Google is
The Dapper overhead attributed to any given process is proportional to the number of traces that process samples per unit time.
The overhead of any APM system depends on some simple factors. It can be approximated by a simple math:
overhead = number_of_transactions x number_of_measurement_points / second x implementation_overhead
What does that mean? The first part is easy. The more data you measure and produce, the more overhead you generate. This is just “physics”. I’ve put in a implementation factor, as there can be still big differences how you implement your measurement agents – this is dependend on your solution. E.g. look at this benchmark by William Louth about the performance of his agent measuring method calls. A method call would be a measurement_point in my formula. If the numbers of the benchmark would be correct (which I cannot claim as I didn’t make the benchmark), than Williams Satoris could measure 50 times more measurement points per second than NewRelic. This would be a implementation_overhead factor of 50 for NewRelic.
*All* APM solutions have to work on these three factors to get the best result for their customers:
Number of transactions
The approach you do depends on the load of the system. If you have thousands of transactions per second on a highly distributed system and you want to monitor them in near realtime, you simply cannot measure *every* transaction in *depth*. Google says in the Dapper paper that capturing only every 1024th transaction was good enough for them, as they had so much traffic that they got all the data they needed for their problems. This is just statistics with big numbers – and the claim was made after being 2 years in production with Dapper. But the paper also says that “However, lower traffic workloads may miss important events at such low sampling rates, while tolerating higher sampling rates with acceptable performance overheads.” This means that if the number of transactions are lower you can measure more transactions (maybe even all). This can be done simple (randomly) but also in a more adaptive and intelligent way, so that you capture the relevant transactions. APM vendors chose different strategies for that or they always measure all transactions like Roman says for dynaTrace which will have other drawbacks as you can calculate by yourself.
Number of measurement points
The number of measurement points is also a critical factor. You simply cannot measure every method call with every parameter in every transaction. The measurement points are normally generated by instrumenting the application code. Some tools require the user to manually instrument the code or even program the measurement points and some tools have an intelligence in the agent to adapt the instrumentation based on statistics, so that only relevant code is measured.
As described above, the implementation of the agent how to select the transactions and the instrumentation code will add additional overhead depending on the vendor.
So for a customer I would suggest to not listen to *any* marketing messages of these vendors, but really understand your APM needs and the different technologies of the vendors. Chose 2-3 APM solutions that seem to fit your needs most and than simply do a proof of concept to evaluate the parameters I’ve described above and test the solution in terms of value and usability.
The requirements for a highly frequented and distributed application are different than for a small web application on a single machine! The “magic” is that you get the information that you need to solve your problems or make decisions AND this with the lowest overhead.
And the best tip for free: If a self claimed myth buster is entering your house, be sure that he has some new myth in his backpack! 🙂