Engineering Scalable Log Aggregation with Kafka and the ELK Stack via Docker Compose

The architectural challenge of managing application logs at scale necessitates a robust pipeline capable of handling high-throughput data streams without inducing system instability. In traditional logging setups, a direct pipeline from a log shipper to an indexing engine can lead to catastrophic failures during traffic spikes, as the indexing engine may become overwhelmed by the volume of incoming data. To mitigate this, the integration of Apache Kafka as a distributed streaming platform provides a critical buffering layer, ensuring that logs are queued and processed asynchronously. By leveraging Docker and Docker Compose, engineers can orchestrate a complex ecosystem consisting of Filebeat for log shipping, Kafka for buffering, and the ELK (Elasticsearch, Logstash, Kibana) stack for indexing, processing, and visualization. This approach transforms a volatile stream of log data into a structured, searchable, and scalable observability platform.

The Architectural Role of Kafka as a Message Buffer

In a high-scale environment, the volume of logs generated by multiple application servers can fluctuate wildly. When these logs are sent directly to Elasticsearch via Logstash, a "backpressure" problem occurs: if Elasticsearch cannot index the data as fast as it is being received, the entire pipeline may stall or crash.

Kafka addresses this by acting as a durable, distributed queue. It decouples the log producers (Filebeat) from the log consumers (Logstash). Filebeat pushes logs into a Kafka topic, where they are stored on disk. Logstash then pulls these logs from Kafka at its own sustainable pace. This ensures that even if the ELK stack is under heavy load or undergoing maintenance, no logs are lost, as they remain securely queued within Kafka.

For those implementing this via Docker, the Bitnami image for Kafka is highly recommended. This image is frequently updated and thoroughly documented, providing a stable foundation for the messaging layer.

Prerequisites and System-Level Configuration

Before deploying the containerized stack, the host environment must be properly prepared to handle the memory-mapped files required by Elasticsearch.

The installation of the Docker engine and Docker Compose is mandatory. For users on Ubuntu 18.04 LTS, specialized bash scripts are available to streamline this installation process. Once the environment is ready, the deployment begins by cloning the authoritative repository:

git clone [email protected]:sermilrod/kafka-elk-docker-compose.git

A critical technical requirement for Elasticsearch is the adjustment of the vm.max_map_count setting. Elasticsearch uses a memory-mapped file for its indices; if the system limit is too low, the node will fail to start or crash under load. The following command must be executed on the host machine:

sysctl -w vm.max_map_count=262144

Because this command only affects the current session and will vanish upon a system restart, it must be made permanent. This is achieved by adding the line vm.max_map_count=262144 to the /etc/sysctl.conf file. Failure to do this will result in the Elasticsearch container failing to initialize after a reboot.

Directory Structure and Volume Management

To ensure data persistence and proper log ingestion, specific directories must be created on the host machine before running the compose file.

The docker-compose.yml file typically maps internal container volumes to host folders to prevent data loss when containers are destroyed. By default, the configuration uses a volume named esdata. If a custom path is required, the user must edit the YAML file accordingly. To create the default volume:

cd kafka-elk-docker-compose
mkdir esdata

Furthermore, the stack requires a source of logs to demonstrate functionality. The provided architecture utilizes a default Apache container to generate logs. Filebeat requires a specific directory to mount these logs. To set up this environment:

cd kafka-elk-docker-compose
mkdir apache-logs

This setup allows Filebeat to treat the host directory as a source of truth, reading the access.log and error.log files produced by the Apache container.

Log Ingestion and Forwarding with Filebeat

Filebeat serves as the lightweight agent responsible for forwarding logs from the application servers to the Kafka cluster. It is preferred over Logstash for the initial shipping phase because it is significantly more resource-efficient, requiring far less CPU and memory.

The configuration of Filebeat is managed via the filebeat.yml file. It is imperative that the owner of this file is either the root user or the specific user executing the Beat process to avoid permission errors. In modern versions of Filebeat, the prospectors configuration has been replaced by inputs.

The logic for log collection is defined by paths and tags. For instance, the configuration for Apache logs is structured as follows:

paths: /apache-logs/access.log
tags: testenv, apache_access
input_type: log
documenttype: apacheaccess
fieldsunderroot: true

Similarly, error logs are tracked via:

paths: /apache-logs/error.log
tags: testenv, apache_error
input_type: log
documenttype: apacheerror
fieldsunderroot: true

Configuring the Kafka Output Pipeline

The connection between Filebeat and Kafka depends on the network topology. The configuration varies based on whether the application server is located within the same Virtual Private Cloud (VPC) as the Kafka broker or is connecting via a public endpoint.

When the server is within the same VPC, the private IP and port 9092 are used:

output.kafka:
version: 0.10.2.1
hosts: ["KAFKA_PRIVATE_IP:9092"]
topic: 'applogs'
partition.round_robin: reachable_only: false
required_acks: 1
compression: gzip
max_message_bytes: 1000000

For public connectivity, the public IP and a different port (e.g., 19092) are typically employed to handle external traffic. The required_acks setting is crucial; setting this to 1 ensures the producer receives an acknowledgment from the leader, providing a balance between performance and reliability. It is strongly recommended not to set this value to 0, as that would disable acknowledgments and risk data loss.

Kafka Topic Management and Verification

Once the stack is deployed using docker-compose up -d, the Kafka cluster is operational, but a specific topic must be created to handle the logs. Without a defined topic, Filebeat will fail to push data, resulting in connection errors in the logs.

To create the applogs topic, execute the following command within the Kafka container:

docker exec -it kafka /opt/bitnami/kafka/bin/kafka-topics.sh --create --zookeeper zookeeper:2181 --partitions 1 --replication-factor 1 --topic applogs

To verify that the topic has been successfully created and listed, use:

docker exec -it kafka /opt/bitnami/kafka/bin/kafka-topics.sh --list --zookeeper zookeeper:2181

To validate the communication pipeline, engineers can use a console consumer to listen for incoming messages in real-time:

docker exec -it kafka /opt/bitnami/kafka-console-consumer.sh --bootstrap-server PRIVATE_KAFKA_IP:9092 --topic applogs --from-beginning

Conversely, to test the pipeline by manually pushing a message to the topic, the console producer is used:

docker exec -it kafka /opt/bitnami/kafka-console-producer.sh --broker-list PRIVATE_KAFKA_IP:9092 --topic applogs

For those operating on an external application server, the tool kafkacat is recommended to verify the connection to the Kafka broker.

Log Processing with Logstash and Elasticsearch

After logs reach Kafka, they must be consumed and forwarded to Elasticsearch. Logstash acts as the consumer that reads from the Kafka topic and applies the necessary filters before indexing the data.

The output block in Logstash defines the destination of the processed logs. The configuration ensures that logs are stored in a time-stamped index, allowing for efficient data rotation and searching.

output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "%{[indexPrefix]}-logs-%{+YYYY.MM.dd}"
}
}

The hostname elasticsearch refers to the service name defined in the docker-compose.yml file, and port 9200 is the default REST API port for Elasticsearch.

Visualization and Data Analysis in Kibana

Kibana provides the graphical interface to visualize the logs indexed in Elasticsearch. Once the full stack is deployed, it typically takes about one minute for all inter-dependent services to become fully functional.

The Kibana dashboard is accessible via http://localhost:5601. However, data will not be visible immediately. An index pattern must be created to tell Kibana which indices to query. The following settings are required:

Index name or pattern: logstash-*
Time-field name: @timestamp

To test the entire pipeline and generate data for visualization, a simple request can be sent to the Apache container via the exposed port 8888:

curl http://localhost:8888/

This request triggers the generation of an access log, which is then picked up by Filebeat, queued in Kafka, processed by Logstash, and finally indexed in Elasticsearch for viewing in Kibana.

Technical Specifications and Configuration Summary

The following table summarizes the key configurations and ports utilized in this architecture.

Component	Default Port	Primary Responsibility	Key Configuration
Apache	8888	Log Generation	`/apache-logs/`
Filebeat	N/A	Log Shipping	`output.kafka`
Kafka	9092 / 19092	Buffering/Queueing	`topic: applogs`
Elasticsearch	9200	Indexing/Storage	`vm.max_map_count`
Logstash	N/A	Processing/Filtering	`elasticsearch` output
Kibana	5601	Visualization	`logstash-*` index pattern

Detailed Deployment Workflow

The sequence of operations to successfully deploy and verify this stack is critical. Any deviation in the order of operations—specifically regarding the sysctl settings or topic creation—will result in a broken pipeline.

Host Preparation:

Install Docker and Docker Compose.
Execute sysctl -w vm.max_map_count=262144.
Permanently add the setting to /etc/sysctl.conf.

Repository Setup:

Clone the repository: git clone [email protected]:sermilrod/kafka-elk-docker-compose.git.
Create required directories: mkdir esdata and mkdir apache-logs.

Orchestration:

Launch the services: docker-compose up -d.
Verify that all containers are running.

Kafka Initialization:

Create the topic: docker exec -it kafka /opt/bitnami/kafka/bin/kafka-topics.sh --create --zookeeper zookeeper:2181 --partitions 1 --replication-factor 1 --topic applogs.
List topics to confirm creation.

Data Validation:

Generate logs: curl http://localhost:8888/.
Consume logs via kafka-console-consumer.sh.

Visualization:

Access Kibana at http://localhost:5601.
Create index pattern logstash-* with @timestamp.

Conclusion

The deployment of a Kafka-buffered ELK stack via Docker Compose provides a professional-grade solution for log aggregation, offering resilience against data spikes and a clear separation of concerns between ingestion, buffering, and analysis. By utilizing Filebeat for lightweight shipping and Kafka for durable queueing, the architecture avoids the common pitfalls of direct-to-Elasticsearch pipelines. While this specific Docker Compose setup is highly effective for testing and development, it is important to note that it is not intended as a production-ready solution without further tuning of resource limits, security configurations (such as SSL/TLS), and the implementation of a multi-node Kafka and Elasticsearch cluster for high availability. The strength of this system lies in its modularity, allowing engineers to replace or scale individual components based on the specific demands of their infrastructure.