Kafka Replication, Failover

Kafka Replication, Failover

This document analyzes Kafka Replication and Failover.

1. Kafka Replication

[Figure 1] Kafka before Replication

[Figure 1] Kafka before Replication

[Figure 2] Kafka after Replication

[Figure 2] Kafka after Replication

Kafka configures multiple Brokers and distributes Partitions to each Broker as much as possible to increase Message throughput. However, even when using multiple Brokers and Partitions, if Replication is not applied, Message loss cannot be prevented when some Brokers die. [Figure 1] shows the Kafka Cluster before applying Replication, and [Figure 2] shows the Kafka Cluster after applying Kafka Replication. Topic A and Topic B are set to Replica 2, and Topic C is set to Replica 3. When Replication is not applied, if Broker C dies, all Messages of Topic B and Messages of Partition 2 of Topic C are lost, but when Replication is applied, Message loss can be prevented using replicated Partitions.

In Kafka, the Partition used by Producers and Consumers is called Leader Partition, and the remaining replicas are called Follower Partition. Even when Replication is applied, Producers and Consumers only use Leader Partitions and do not directly use Follower Partitions. Follower Partitions are only used for Failover when Leader Partitions cannot be used due to failures.

The Replication synchronization method can use both Sync and Async methods depending on the Producer’s ACK settings. When the Producer’s ACK setting is 0, the Producer sends Messages and does not wait for ACK from the Broker, and when it is 1, the Producer receives ACK from the Broker only when Message transmission to the Leader Partition is completed. Therefore, from a Replication perspective, 0 and 1 correspond to Async methods. On the other hand, when the Producer’s ACK setting is all, the Producer receives ACK from the Broker only when Message transmission to the Leader Partition is completed. Therefore, from a Replication perspective, all corresponds to Sync methods.

2. Kafka Failover

[Figure 3] Kafka after Failover

[Figure 3] Kafka after Failover

Failover refers to the process where a Follower Partition becomes a Leader Partition when the Leader Partition cannot be used due to failure. [Figure 3] shows the Failover operation when Broker C dies in the state of [Figure 2]. It can be seen that the Leader Partition of Topic B moves to Broker B, and the Leader Partition of Topic C moves to Broker A.

Kafka promotes an arbitrary Follower Partition to Leader Partition among Follower Partitions that are completely synchronized with the Leader Partition (ISR, In-Sync Replicas). Producers and Consumers receive information about the changed Leader Partition from the running Kafka Broker and operate accordingly. When Producers use ACK settings of 0 or 1, Replication is not guaranteed, so Messages sent by Producers may not be replicated to the new Leader Partition. On the other hand, when using ACK settings of all, Replication is guaranteed, so Messages sent by Producers exist in the new Leader Partition and Message loss does not occur.

Messages may have completed Replication but Producers may not receive ACK due to Broker failure. This means that Producers may send Messages stored in Partitions redundantly, which means Message duplication may occur. To prevent such Message duplication, Kafka’s idempotency setting (enable.idempotence) can be used to prevent Message duplication. Consumers may also not receive ACK due to Broker failure even though they sent the Offset of processed Messages to the Broker. Therefore, Consumers may also receive the same Message redundantly, and Consumers must implement idempotency Logic to ensure that there are no problems even when receiving the same Message.

3. References