Kafka Transaction
Analyzes Kafka’s Transaction technique.
1. Kafka Transaction
Kafka Transaction, as the name suggests, refers to a technique that bundles multiple Records sent by Producer to Kafka into one Transaction for processing. Here, multiple Records can be bundled into one Transaction for processing even when delivered to multiple Topics and Partitions. On the other hand, Kafka Transaction does not support a technique where Consumer bundles multiple Records into one Transaction for processing. That is, Kafka Transaction is a Producer-centric Transaction technique.
Kafka Transaction is internally implemented using two-phase commit technique, and to use Kafka Transaction, Idempotence functionality (enable.idempotence) and In-flight Request limiting functionality (max.in.flight.requests.per.connection) must be set to 5 or less to prevent duplicate storage of identical Events/Data. Kafka Transaction is largely divided into two methods: Produce-only Transaction and Consume-Produce Transaction.
1.1. Produce-only Transaction
![[Figure 1] Produce-only Transaction](/blog-software/docs/theory-analysis/kafka-transaction/images/produce-only-transaction.png)
[Figure 1] Produce-only Transaction
Produce-only Transaction, as the name suggests, refers to a technique that bundles multiple Records delivered to multiple Topics and Partitions into one Transaction for processing. [Figure 1] shows the operation process of Produce-only Transaction. Producer delivers Records bundled into one Transaction to Partition 0,1 of Topic A and Partition 0,1,2 of Topic B. Records bundled into one Transaction have Atomicity characteristics where all are stored in Kafka or all are not stored.
Produce-only Transaction is mainly used when utilizing Kafka as an Event/Data Store rather than an Event Bus. That is, it is used to prevent only some Records among multiple Records sent by Producer from being stored, in order to ensure consistency of Events/Data stored in Kafka.
| |
| |
[Code 1] shows Producer App Code before Transaction is applied, and [Code 2] shows Producer App Code after Transaction is applied. When using Kafka Transaction, similar to DB Transaction, the process of initializing Transaction before starting Transaction, starting Transaction, and ending Transaction must be performed.
In [Code 2], Transaction is initialized through producer.init_transactions(), Transaction is started through producer.begin_transaction(), and Transaction is ended through producer.commit_transaction(). If an error occurs during Transaction, Transaction is aborted through producer.abort_transaction().
To use Transaction, transactional.id setting value must be set. transactional.id must be set to a unique ID for each Producer. Generally, a unique ID is set using Hostname.
1.2. Consume-Produce Transaction
![[Figure 2] Consume-Produce Transaction](/blog-software/docs/theory-analysis/kafka-transaction/images/consume-produce-transaction.png)
[Figure 2] Consume-Produce Transaction
Consume-Produce Transaction, as the name suggests, refers to a technique that bundles the process of fetching multiple Records from Kafka (Consume) in Application, processing Events/Data, and delivering processed Events/Data back to Kafka (Produce) into one Transaction for processing. [Figure 2] shows the operation process of Consume-Produce Transaction. Application bundles the process of fetching Records from Partition 0,1 of Topic A, processing Events/Data, and delivering processed Events/Data to Partition 0,1,2 of Topic B into one Transaction for processing.
Consume-Produce Transaction is mainly used to implement Exactly-Once, which processes Events/Data stored in Kafka exactly once. Spark Streams Applications that transform Events/Data between Kafka and Kafka also use Kafka Transaction to implement Exactly-Once.
| |
[Code 3] shows App Code that performs Consumer, Data processing, and Producer roles before Transaction is applied. If Transaction is not applied, if App dies after completing the 2.Process & produce process and before executing the 3.Commit offset process, and then runs again, the restarted App fetches the same Data from Kafka, performs the same processing, and performs duplicate storage in Kafka. This is because when Data is first fetched from Kafka, only Data processing is performed and Commit is not completed, so Kafka delivers the same Data to Consumer until Commit is completed.
| |
[Code 4] shows App Code that performs Consumer, Data processing, and Producer roles after Transaction is applied. The key point is that it bundles processed Data together with a value one more than the Record Offset of fetched Data through the producer.send_offsets_to_transaction() function into one Transaction for processing. That is, it implements Exactly-Once by ensuring Atomicity of processed Data and Offset of Data to be processed next through Transaction.
2. References
- Kafka Transaction : https://developer.confluent.io/courses/architecture/transactions/
- Kafka Transaction : https://www.confluent.io/blog/transactions-apache-kafka/
- Kafka Idempotence, Transaction : https://stackoverflow.com/questions/58894281/difference-between-idempotence-and-exactly-once-in-kafka-stream