Metrics monitoring and real time analysis

Every developer needs to know about what is happening inside the application. We are no different and what we do to achieve this is tracing a lot of our data, from low level system stuff to many high level, application specific metrics.

In order to collect application data and plot it, since we use autoscaling, we need to aggregate it over time inside the application, send it to a collector and aggregate it again over instances. For that second round of aggregation we use etsy’s statsd with a few modifications of our own to compose metrics. Those modifications are made in order to allow us to create some compound metrics based on raw data we collect, e.g. we have the raw clicks and pageviews and, with that, we calculate the ctr. Since statsd just collects the data, and aggregate it, we had to devise our own composition engine with custom rules to provide us such information. You can get our changes at github and feel free to modify and contribute back :)

Measuring data is an exercise of parsimony, when you first get to it, you feel compelled to track each and every piece of data you can. But, as a matheusrossato says: “not everything you can measure is important, and not everything that is important can be measured”. For what it’s worth, the first time we started measuring data on one of our products, we also sent data to Librato for plotting and our billing was about 5 times than today’s value, with less than a half of useful data that we have today. Our mistake is that we basically were measuring useless data, just because we could.

With that in mind, there is also a deal of combining and processing raw data to plot something else, as the example of ctr above. This is where statsd came in for us. While you can have very good insights just with raw data, there are many times when they are just not enough to show you something. What kind of data needs to be combined or post processed? That is up to you and your experience on what you are doing :) When you have no information on that, try experimenting on combining some of the data you already have, a few at a time and keeping what makes sense, discarding what does not. Also, if you don’t have some kind of data tracked and need it twice, track it! Try to keep a library that makes it easy to add more data to your monitoring system.

If you are using librato, use their notifications api to plot data about deploys, changes in infrastructure or application settings, it helps a lot to quickly identify why a metric has freaked out.

At last but not at least, learn and improve over what you get from the monitoring. While it may sound obvious, all that data you gather can easily be a waste of money if you don’t pay attention to it and listen to what it has to say.