The Unsatisfying State of Prometheus-Based Monitoring of AWS Infrastructure
Categories:
Originally posted in LinkedIn
Introduction
Prometheus has become the de-facto industry standard for monitoring systems. It is a well-established open-source toolkit to collect, store and query metrics data and also trigger alerts based on them. Prometheus has a large support community and most open-source services expose their metrics using Prometheus compatible metrics or have tooling around to make them available in a Prometheus scrapeable format.
On the Cloud provider front, Amazon AWS has been - and still is - the leader in terms of market share[1]. A large number of businesses heavily rely not only in the infrastructure they provide, but in the hosted and proprietary services they sell (DynamoDB, SQS, etc). AWS itself provides a product for monitoring their solutions: AWS CloudWatch. Unsurprisingly, one can use CloudWatch to store custom metrics but CloudWatch is the only way of accessing metrics generated by AWS Services.
It is common to see scenarios where companies have a partitioned monitoring solution. The applications, services and self-hosted pieces of infrastructure are monitored using a Prometheus-centered solution (Prometheus, Grafana, AlertManager, etc) while the proprietary services they use are monitored through CloudWatch.
This is not ideal for several reasons. Primarily the two solutions are really different in a lot of dimensions, thus making it more difficult to have a workforce that is comfortable with both systems and is able to efficiently use either depending on the use case. Training is basically doubled as the concepts are not really transferable from one solution to the other. Also, having a partitioned monitored solution also makes it harder to correlate metrics that are not in the same system for obvious reasons. I think it is a fair assumption to think that companies will have a desire to try to centralize their monitoring solution into a one-stop shop as much as possible.
I would argue that centralizing into CloudWatch is not the common - or even the right - approach. The number of things without native Prometheus support is smaller than the number of things that don’t have any CloudWatch support. Also, if deciding to unify, I’d rather be unifying towards an industry standard that is more commonly used than a proprietary tool.
Current Tooling Available
If the decision has been made to unify the metrics on the Prometheus side, then there’s the need to import the CloudWatch metrics into Prometheus. The current two ways I have found to be available are either Prometheus’ CloudWatch Exporter or YACE. With both it is possible to transform metrics exposed by AWS CloudWatch into Prometheus scrape-able metrics. And even though they differ in some ways, they both share the same underlying flaws that make their output not too valuable.
The issues stem from the trade-offs taken by the developers of these tools and what they set out to achieve. I would summarize these tools as general frameworks that transform any AWS CloudWatch Metrics into a Prometheus scrapeable format. They don’t aim to understand the semantics of each AWS CloudWatch provides and what it is representing; this greatly simplifies the maintenance, but comes with limitations. Also, the decision to retrieve the metrics through the AWS API rather than other available mechanisms was likely because it makes the tool and its deployment model simpler.
The first and most important of these, is a conceptual flaw. They obtain an aggregated metric from AWS CloudWatch using their API and expose what they obtained as a gauge metric. All the metrics exposed through these exporters are gauges, independently of what they represent. For some of the metrics - such as number of ECS tasks running - this is OK but for most of them (lambda invocation count, ALB request count, etc) other Prometheus metric types would be more appropriate such as counters or summaries. Because of this design, rates cannot really be calculated and doing interpolation is not possible.
Secondly, the trade-offs of the configured scraping frequency, lookback period and the delay has major impacts in the retrieved metrics as well as the performance. Given that these exporters work by periodically polling the CloudWatch API and that AWS doesn’t make the metrics immediately available through the API, a mismatch in these settings can lead to different non-optimal situations. Setting the frequency and lookback periods too low can lead to metrics with tons of gaps as they will not be available at the frequency they are requested. Using the sum metric with a lookback period larger than scraping frequency will lead to “overlapping” datapoints; normalizing to rates can be done in PromQL but that will depend on the lookback period used. Setting the update frequency too high in a large infrastructure footprint can lead to the API calls being throttled and/or the exporter not being able to finish in time leading to metrics with gaps in them. But lowering the frequency, increasing the period and adding a delay can lead to an intolerable lag in the metric becoming available. For some metrics I have observed an average lag of 15 minutes to make it reliable.
Conclusion
Overall, the available tools I found that are capable to retrieve and expose CloudWatch metrics in a way that can be scraped by Prometheus is quite unfulfilling. There are these two exporters that are able to provide a basic level of translation which is definitely better than nothing. But for enterprises where monitoring is business-critical, the shortcomings of the features the provide are noticeable. The constraints placed by the way the metrics are exposed as well as the lag in the metric availability makes it harder to centralize monitoring solely in Prometheus and falling back to using CloudWatch directly seems to be result.