The key here is, with metrics, there’s really only 2 (major) ways to do it. Push or pull. Does your application / server push the data out to your monitoring tool, or does your monitoring tool pull the data in.
PUSH
One of the first run ins with metrics, we took an incredibly naïve approach. Every certain amount of time, push all the data for all the custom (and built-in) metrics to a monitoring tool. Why was this naïve?
- CPU Intensive
- Many connections made to emit metrics
- If the monitoring tool is down, it can affect PROD
- Timing of pushes configured on every service
Okay, well lets say your monitoring tool is equipped to receive bulk sets of metrics so that you don’t need to send them individually? Well that may solve for the first two points, but it still doesn’t solve for the last two. If the monitoring tool goes offline, you’re still pushing metrics to a dead service.
PULL
I strongly believe pulling metrics is the way to go and this is how tools like Prometheus work. They render a page with metrics on the service, that can be polled by the central Prometheus service. If Prometheus goes offline, it does not affect PROD at all, it just stops polling for updates.
If you haven’t checked out Prometheus, I highly recommend it. For .NET (my development environment) there’s great libraries with default metrics out of the box for both services and servers. In the future, I’ll create a post on implementing Prometheus and visualization of the metrics in another post later.