Open-Sourcing Rearview: Real-Time Monitoring with Graphite

If you’re using StatsD and Graphite, you’ve probably seen tools that allow you to create simple monitors that alert when your data exceeds some upper or lower threshold. To date, however, none offer the ability to write custom monitors that allow the creation of control charts with seasonal data or deployment triggered monitors. That’s exactly why the Analytics team at LivingSocial created the open source tool rearview.


Rearview is a Scala monitoring framework for Graphite time series data. The monitors are simple Ruby scripts which are run in a sandbox to prevent I/O. Each monitor is configured with a crontab compatible time specification used by the scheduler.

Monitors define the following attributes:

  1. One or more Graphite metrics.
  2. Crontab time specification.
  3. Optional Ruby expression. If no custom graph calls are made a default graph is generated.
  4. Optional PagerDuty api keys and/or emails.

The monitor workflow is as follows:

Rearview Workflow

Monitor Details

A monitor is simply a Ruby script which runs with some timeseries data in scope by default. The variables in scope to the monitor are generated from the job’s definition of metrics and how far back to retrieve data. A monitor author can use the data in scope to determine whether an alert should be generated any way they see fit.

The add or edit monitor UI has several fields, but the most important fields are the metrics, number of minutes and the monitor Ruby expression fields (see Figure 1 below.)

Sample Rearview Monitor

Let’s suppose we calculate the conversion rate for our ad server over the last 30 minutes. If the conversion rate drops below 10% we want to generate an alert. In this example, we would specify the following metrics:

alias(stats_counts.adserver.web_traffic.impression, "impressions")
alias(stats_counts.adserver.web_traffic.conversion, "conversions")

By entering 30 into the minutes back field, the monitor will grab 30 minutes worth of data. Depending on your Graphite configuration this could be anywhere from 1800 datapoints per metric (for 1s retention) to 30 datapoints per metric (for 1min retention.) In our example we will be using a 10s retention, which will return 180 datapoints per metric.

The monitor code would be defined as follows:

puts @timeseries

impressions = @a.values.sum # the sum method uses to_f to convert Nils to 0.0
conversions = @b.values.sum

rate = (conversions / impressions) * 100
puts rate

raise "The conversion rate has dropped below 10%" if rate < 10

By default, rearview creates a namespace for the monitor with some implicit instance variables defined. These variables are defined beginning with @a, which corresponds to the first metric in the list, @b which is the second metric, and so on. In this example the timeseries for impressions is @a and conversions is @b. Each timeseries variable @a, @b, … etc. is a TimeSeries instance with the fields:

  • label – the name of the metric for the timeseries (String). This value has an accessor which can be set to some other value for readability in graphs, etc
  • timestamp – a long value with the timestamp in milliseconds (Fixnum)
  • value – the double value of the entry (may be Nil) (Float)

Additionally, there is a variable @timeseries in scope, which is a an Array of TimeSeries objects represented above. So @a, @b, … etc. are just convenience variables which correspond to each entry of @timeseries in the order specified in the metrics UI text field. The string representation of @timeseries variable for the above example on 1 minute’s worth of data would be:

        label: impressions,
        entries: [
            { label: impressions, timestamp: 1361381120, value: 82.0 },
            { label: impressions, timestamp: 1361381130, value: 74.0 },
            { label: impressions, timestamp: 1361381140, value: 72.0 },
            { label: impressions, timestamp: 1361381150, value: 72.0 },
            { label: impressions, timestamp: 1361381160, value: 81.0 },
            { label: impressions, timestamp: 1361381170, value: 70.0 },
            { label: impressions, timestamp: 1361381180, value: nil }
        label: conversions,
        entries: [
            { label: conversions, timestamp: 1361381120, value: 17.0 },
            { label: conversions, timestamp: 1361381130, value: 17.0 },
            { label: conversions, timestamp: 1361381140, value: 17.0 },
            { label: conversions, timestamp: 1361381150, value: 11.0 },
            { label: conversions, timestamp: 1361381160, value: 18.0 },
            { label: conversions, timestamp: 1361381170, value: 6.0 },
            { label: conversions, timestamp: 1361381180, value: nil }

Notice there are two array entries in @timeseries, which correspond to the variables @a and @b. The default label for each metric is set to the alias for a given timeseries. If an alias is not specified, the default value will match the exact string used in the metric field. Optionally, you can set the label manually within the monitor like this:

@a.label = "impressions"
@b.label = "conversions"

Now back to the example, the first line prints the @timeseries variable. All output from the monitor appears in the output field. The next two lines sum the values of the two entries for impressions and conversions using the utility array method sum located in /src/main/resources/jruby/utilities.rb. This file also contains array methods for calculating mean, median, and percentile. Any method added to this file will be available to all monitors.

The next line calculates the conversion rate and then does a puts call which will be shown in the UI output field. Using puts is a handy way to debug the data initially and determine the shape of the data and so on. Lastly, a monitor generates an alert by simply raising an exception with whatever text you want to appear in an email or PagerDuty alert.

The following are the variables provided implicitly to a monitor:

  • @name – Name of the monitor specified in the name field in the UI
  • @minutes – Number of minutes specified in the minutes field in the UI
  • @jobId – An id generated by the server for the job. This defaults to -1 for new monitors before saving.
  • @timeseries – A 2-dimensional Array containing Hashes with the fields: metric, timestamp and value
  • @a, @b, …, @z – If there are more than 26 metrics the variables wrap and begin again at @a1, @b1, etc. (However, if you have more than 26 metrics you’re likely doing something wrong.)

There are a few utility functions also available to the monitor:

  • with_metrics
  • fold_metrics
  • graph_value

The utility functions are better explained through an example:

impressions = 0
conversions = 0
rate = 0

with_metrics do |a, b|
  impressions += a.value.to_f
  conversions += b.value.to_f
  rate = (conversions / impressions) * 100
  graph_value["# of #{a.label}", a.timestamp, a.value]
  graph_value["# of #{b.label}", b.timestamp, b.value]
  graph_value["Conversion Rate", a.timestamp, rate]

raise "The conversion rate has dropped below 10%." if rate < 10

In this example we’re using the utility functions with_metrics and graph_value. The with_metrics function is a convenience function which introduces variables to the passed block which all align to the same timeslice in the time series. So in the example, a corresponds to impressions and b conversions. Each iteration through the block has the successive timestamp until the end of the series. The graph_value function will plot on the graph the specified value for the timestamp given. In the example the resulting graph will render 3 lines, with the labels “# of Impressions”, “# of Conversions” and “Conversion Rate”.

Future Development

If you’ve made it this far you are likely interested in more information on installation and deployment. If so, check out the bottom of the readme on the rearview project page in GitHub. And please feel free to contribute to the project in any way you can. Finally, if you’re more comfortable with Rails than Scala you’ll be happy to know we are working on a Rails port, which should be available soon.

This post is cross-posted from Steve Akers’ blog.