Technology companies often struggle under the weight of production logs. Usually, their infrastructure is built up in the datacenter to transport, store, and process logs. At TrueCar, we faced similar issues in building, scaling, supporting, and maintaining this infrastructure. In this series of articles, I will describe how we pieced together open source software to build a world class, highly performant, and scalable infrastructure for log management.
When I started at TrueCar, I was tasked with “fixing” our logging infrastructure, which was exhibiting some of the same scalability issues noted above. I decided this was the perfect opportunity to test a new technology that had a lot of promise. Because there are a lot of moving and interconnected pieces in our infrastructure, I separated elements and started with an arbitrary “first” piece. I had heard some good anecdotal information about a queuing technology called Kafka before I came to TrueCar, and chose to investigate it.
Problems with Logging as an Infrastructure Service
Most problems facing developers struggling with logging have to do with stability and scale. It is easy to setup a typical syslog central logging service with a tens or dozens of systems. Tools like rsyslog and tcp syslog support are commonly available. However, once developers begin to scale up to hundreds (or thousands) of machines, the number of connections and amount of data that needs to be stored quickly overwhelm your central logging server(s). Using load balancers gets expensive and/or hard to support. I have seen huge logging infrastructure become overwhelmed by one simple application being deployed with debug logging enabled.
Kafka enters the picture as a store-and-forward queue for logging messages. Kafka can scale as a cluster with multiple producers (systems that generate messages) and multiple consumers (systems that read or subscribe to messages). In the common scenario above, we could have multiple hundreds (or thousands!) of logging generators writing to a Kafka queue. Then, we can build out an independent log ingestion infrastructure to consume the messages. With Kafka in the middle, we can safely absorb a large amount of data while we scale out our consumption infrastructure. We will get into all of the producing and consuming segments of our infrastructure in later blog posts. For now, we’re focusing on Kafka itself.
The first thing that I did was follow the excellent quick start guide from Apache’s website. I recommend you go through these steps if you’ve never set up Kafka before. You can play around with a mini cluster to see how it works and what is involved for a basic setup. I also strongly encourage you to get the latest version of the software available. Kafka is a new project and bugs are fixed all the time. Of course, new bugs and problems get introduced as well, but with Kafka, I’ve found you need to use the latest versions at all times.
Now that you’ve played with the quickstart (or if you already know how to use Kafka), we’ll go into building out a production-sized cluster. You’ll notice that I’m skipping over the development cluster. My experience in devops and systems administration has shown me that so-called “development environments” are actually larger and more critical than the production environments! I have found that building out a production-scale cluster gives valuable information on sizing and building the development clusters, and avoids the issue of “starting small” that can sometimes hamper a new project.
For a truly production-scale project, you will want to use at least three physical hosts. Disk IO is critical for Kafka, so you don’t want virtualisation technology in the way. If you do use virtualisation in your infrastructure, and you know what you are doing, then by all means go for it. Having learned the hard way, I don’t recommend it. You will need at least three because you want to keep a replica copy of any log data in case a single host goes down for maintenance or has problems. Ideally, you would use at least five hosts for your Kafka cluster, but three will do just fine to start.
TrueCar uses puppet for systems configuration management, but you can use anything with which you are familiar, including manual installation, if that suits you. However, we store some critical information on the settings and dials that we tune to keep the cluster running. There are several open source puppet modules out there and you can select one that is best for you. We have a fork of a puppet module that works for us. There are some better ones that have surfaced recently. (Check out a very nice looking module from Wikimedia.)
Zookeeper is also a requirement that is out of scope at this point. You should follow other installation guides (or future blog posts) for Zookeeper.
Here are the key settings that we’ve consolidated based on the recommendations from LinkedIn and our own experience:
Kafka Broker IDs
Kafka identifies brokers in the cluster by using ID numbers. You can use any method you like to produce unique IDs. I adapted a math trick based on the hostname (“fqdn” — fully qualified domain name) to produce a unique ID that will be unique across all clusters and will stay the same as long as the host isn’t renamed. The Ruby code is a bit difficult to understand, but basically we’re converting the hostname to a large integer and then computing the modulus with a large prime number to reduce it to a 31-bit number. We don’t use 32-bit numbers because we’ve run into duplicates that way. Ascii characters have 8 bit boundaries and a 32-bit modulus would align with those boundaries. A prime number reduces the odds of such a “collision”.
$broker_id = inline_template(“<%= @fqdn.downcase.gsub(/[^a-z0-9]/,”).to_i(36)%(2**31 -1) %>”)
There is a not-too-subtle circular loop in the logging infrastructure that requires monitoring logs to operate smoothly and detect problems. Seasoned operations folks know that even the monitoring system needs to be monitored by something. We’ve noticed there are several logs that kafka produces and need to be watched. We ingest the logs and place them inside the kafka queues the same way that we monitor all our logs; if a Kafka server isn’t connected to the cluster, at least the cluster mates should be able to receive the logs. For your manual inspection, please pay attention to the following logs. Some of them are quite verbose with a lot of useless noise. There are several bugs that will, hopefully, be addressed in later versions.
TrueCar uses Nagios for monitoring and alerting of our Infrastructure in the datacenter. Included in the puppet module above, there are two critical checks that monitor the state of the kafka cluster and the topics. The first command is “check_kafka_replication_state.sh”, which will execute the built-in Kafka topics command to look for any replication errors or problems. We haven’t seen many alerts from these checks because Kafka is fairly stable and there should be nothing wrong with the replication status unless a node is rebooted or goes offline for some reason.
The second check is a consumer lag check that I extensively rewrote to run asynchronously. This nagios check will go through each topic and consumer group and alert if the topic lag is very large. This alert can be quite sensitive and you will need to tweak the settings on your monitoring system to make it work for your particular setup.
Operational Tasks and Debugging
There is a wonderful UI written by Quantified that is simply amazing for daily use and finding out what is going on with the Kafka topics. We’ve installed this Java program in each of the primary instances in our clusters to quickly identify any problems with queues, topics, consumers, and messages. This is a highly recommended tool. If you have a lot of consumers and partitions, make sure you are aware of the large DB size problem documented in issue #37.
I also highly recommend the Yahoo Kafka Manager. It is a tool we use extensively to configure topic partitions, balance loaded servers, monitor, and repair the health of the cluster. We add all of our clusters into one manager instance to manage all of our clusters from one UI.
There are some minor issues with the Yahoo tool: First, it requires the Play Framework which is a bit confusing if you’ve never used it and it pulls down a lot of dependencies from the internet at run time. Second, it also requires an environment variable or the program won’t even start. It’s mentioned in the documentation, but just to be clear, you must set the following variable each time you start the manager: ZK_HOSTS=”my.zookeeper.host.com:2181″
This is the first in a series of articles on using Kafka at TrueCar for logging. We’ve yet to cover, other uses for Kafka, how all of our logging infrastructure works together, and much more!