Lorenzo Iannone, Head of Development
Microservices are hugely beneficial in many cases. They allow us to use the right tool for the right job. They keep the codebases small and easy for agile teams to manage. They make everything easier to scale, and the system more resilient.
But not everything in microservice land is golden.
In a microservice architecture, an application can fail or degrade in performance when any one microservice fails. Knowing the state of each system component is essential to knowing the overall status of the application. Thus, having an efficient monitoring system becomes crucial to evaluating performance.
In an architecture made of microservices, we sometimes deal with a non-negligible level of noise, unactionable metrics, and clusters of events generated from different services. Microservices have therefore brought us new challenges for observing and monitoring how systems operate.
To better monitor microservices, we need tools capable of storing a huge amount of data. We also need to collect information from multiple sources and allow observers to extract meaningful information from them.
It’s important to monitor the right events and metrics.
System events. While in a monolithic world, any single system event is almost certainly of interest. This is not necessarily true in microservice architecture. Restarting a monolith can be catastrophic. A microservice restart, on the other hand, can be quite different. During a microservice restart, a correlation between collected metrics can be beneficial. How many restarts did we count? In what time range? Unfortunately, there is no golden rule to pre-determine a good clustering policy; this can only be achieved through incremental refinements.
Platform metrics. Such metrics help highlight anything unusual within infrastructure behaviour. Some examples of platform metrics are the number of database connections, the number of running service instances (high spikes may be a predictor of a dos attack), the amount of CPU, and the host’s memory usage.
Alerts generated from platform metrics are not always an issue. They can sometimes be false positives. For example, a spike in the number of service instances could be the result of increased traffic due to recent advertisements or press conferences.
Microservice and application-specific metrics. Monitoring microservice-specific metrics can be useful. These metrics convey the state of a service whenever a problem occurs. But collecting and analyzing them can be tricky. These metrics and their values are sometimes language-dependent.
Python software is likely to be served by WSGI workers, while Go Microservices is composed of a single, executable binary file. The number of WSGI workers can be relevant to defining dashboards and alerts, but is not meaningful to microservices that don’t use WSGI. Using multiple languages to write microservices makes it particularly difficult to define baselines.
Some additional metrics which can help provide a clearer snapshot of the state of the services are failed/success request ratio, response time, number of client connections, and error/exception, (logged from the code level).
Once we identify which key metrics to monitor, it’s important to set up an alert system. All alerts should be meaningful and minimize false positives. Having many false positives creates a tendency to ignore alerts. Sooner or later a relevant alert will be ignored. To minimise the false positives, thresholds should be adapted over time. (When analyzing historical data, keep in mind that the process of tuning thresholds will go through multiple iterations before becoming satisfactory.)
It’s good practice to mark alerts with severity tags such as warning, error, or critical. This helps evaluate and address issues in a timely manner. Issues that take place after office hours, for example, can be evaluated on their urgency and possibly deferred until the next working day.
Never forget to set alerts for the absence of any metrics. If you are not collecting data you won’t likely understand what’s happening. Without metric indicators, there are no alerts. And without alerts, you will not receive any malfunction notifications.
What works for us
We reevaluated our logging policy.
We marked as error/critical only the application errors we felt needed immediate attention. Any standard application errors—routes not found, bad input, etc.—were marked with an INFO tag. Next, we created a centralized logging system. (As developers, we don’t always pay proper attention to the right issues. This can be reflected in unnecessary noise in our monitoring system.) Adopting a centralised logging system made it easier for us to collect and inspect logs.
We began a series of iterations to better tune our thresholds.
In one instance, we created an alert to notify us if our cluster nodes scale up or down too much. After a few attempts, we were better able to identify our normal scaling patterns. Understanding such patterns allows us to set meaningful thresholds that do not generate false positives. Over time, as the business and number of nodes grow, it will become increasingly important to review these thresholds.
We keep track of all metrics using Prometheus.
Some examples of what Prometheus monitors are…
– Backend Errors Ratio
– Storage status
– Memory status
– Metrics collector status (if we are not collecting metrics, we’re in trouble!)
– Certificate expiration
– Kubernetes pods restarts
– Queues status and message processing time
A final suggestion.
It’s important to receive alerts on different media. We use Slack, email and mobile phone app integrations. This minimizes the likelihood of an alert going unseen.