As I have always been interested in microservices, I was happy to join the microservices team within Archers. A proof of concept, exploring some of microservices’ key concepts, was already well underway. However, a solution to trace calls through the architecture in order to facilitate easier troubleshooting and system debugging was not yet present. In this blog article we introduce Elastic Stack as a solution to face these challenges.

We chose Elastic Stack because it’s one of the most common toolsets and the majority of the Archers' colleagues already have experience with the tool. The goal of this article is not to examine Elastic in depth, but to add observability to our proof of concept. In addition, we offer a first look at the possibilities and challenges when implementing a logging framework

Setting up an Elastic Stack

As we wanted to implement our Elastic Stack according to the microservices principles, a separate container for Kibana, Elasticsearch and Logstash was needed. We didn’t need to use Beats unless it’s necessary, more on that later.

Since Elastic provides images via the Docker repository, we were able to directly pull them in the docker-compose file. Kibana didn’t need any additional configuration. Elasticsearch persistent storage should be configured, but as we are only interested in the latest data (tracing for debugging) we didn’t need this. Logstash has to be attuned to the way the logs are provided by pointing to a storage location where the config file is located.

Extract from the docker-compose file, the part for Logstash:

logstash:
image: docker.elastic.co/logstash/logstash:7.1.1
ports:
- "5959:5959"
environment:
- ELASTICSEARCH_URL=172.30.1.5:9200
volumes:
- ../microservices/logstash/logstash.conf/:/usr/share/logstash/pipeline/logstash.conf
networks:
- skynet
depends_on:
- elasticsearch

Log provision and intake

We also had to choose between the myriads of ways to provide our log messages to Elasticsearch. We explored the following possibilities:

  • Attaching Beats as a sidecar to our microservices. Beats are small modules that acquire data and feed it to Elastic Stack, which allows them to send logfiles to Logstash or even straight to Elasticsearch.
  • Keeping our logs on the Microservice Containers enabling Logstash to read and remove them.
  • Let the Microservices put logs as messages on a queue, then configure Logstash to read from this queue.

We chose the last option, using a queue, as this corresponded to the way our microservices were already set up. In order to communicate, we only needed to make a new message topic on the RabbitMQ. The main advantage of this setup is that on the microservice side there is no risk of “overflow” of log messages during high load, as they can always be pushed away to the queue. On the other side, Logstash can’t be overloaded as the tool is able to take and process messages off the queue at its own pace. Note that these capacities have to be somewhat in line with each other as the queue depth isn’t infinite. However, it’s always a good thing to avoid problems during traffic peaks.

As we aimed to reduce the complexity of the architecture, we used the same queue for all microservices. Having a separate queue for each service would be more in line with the microservices principle of ‘independency’, however, as all of these services belong to the same functional domain, we considered this trade-off (complexity vs independence) was justified.

This meant we needed to standardize the content of our log messages over the different services. First, we settled on JSON as this is the default format supported in Logstash. In addition, we also defined a default set of fields. This way, we only had to configure one parser for all the messages.

Fortunately, Logstash has a plugin for rabbitmq which we can call directly.

Here you can find an extract from the aforementioned logstash.conf file:

input {
rabbitmq {
host => "rabbitmq"
exchange => "healthrisk_exchange_log"
queue => "logging_queue"
key => "*"
durable => true
subscription_retry_interval_seconds => 5
codec => plain
metadata_enabled => true
}
stdin { codec => plain }
}
filter {
mutate {
copy => { "[@metadata][rabbitmq_headers][correlationId]" => "corr_id" }
}
mutate {
copy => { "[@metadata][rabbitmq_properties][routing-key]" => "serviceName" }
}
dissect {
mapping => {"message" => "%{logtime} - %{action} - %{payload}"}
}
date {
match => [ "logtime", "ISO8601" ]
}
json {
source => "payload"
target => "payload"
}
}

Note: we used Logback with a custom configuration in order to produce these logging messages. In addition, we manually called it in the code to send logs. For this example, we had to build the date layout manually, but you can also look into a framework with an established standard for Elastic.

<layout>
<pattern>%date{yyyy-MM-dd'T'HH:mm:ss.SSS'Z'} - %m%n</pattern>
</layout>

Correlating logs

As mentioned before, we ensured that all services provided their logs in Json format and logged what happened inside the service. However, this is not sufficient. In a distributed system where each component (service) is executing a single task to achieve a common goal, it is of utmost importance to link messages that are part of the same workflow instead of only tracing individual messages. This can be done by assigning each message in a workflow to the same “correlation” id. When every message contains the right information about which component (service) executed a particular action, a traceability overview can be generated.

Kibana dashboards

Filtering on the aforementioned correlation ID allowed us to show the workflow of a call in a single visualization. This makes it easy to see where something is going wrong. On top of that, Kibana foresees the possibility to create dashboards. These dashboards show everything upfront, so you don’t have to lose time searching for a specific ID.

Correlation dashboard 2

We also wanted to do some uptime monitoring. Normally, logs aren’t ideal for this as they register a single moment in time while monitoring is done in real-time. However, we cheated by logging in heartbeat logs from every application and basing uptime results on whether we recently received a log or not. For our proof of concept, this works decently. Nonetheless, it’s not recommended to do this in a production environment.

Conclusion

A basic logging solution is pretty straightforward to setup and invaluable in a distributed system. It was an educational experience to see how logging frameworks operate behind the scenes. I definitely learned a lot about the challenges that come with making a call traceable through multiple applications without the need of examining too much in depth on the technical side.

Do you want a more technical introduction to Elastic Stack?

Check out my colleague's blog!

WE ARE LOOKING FOR AN INTEGRATION ARCHITECT OR API ANALYST! ANY INTEREST?