從 Kafka 內部儲存資料來看 Kafka 的基本概念是很合適的,因為怎麼存就會了解它的架構和限制。

Kafka 本質上是 Distributed, Replicated Messaging Queue ,在微服務和分散式計算的經常會被提及。要最大化效能就要對資料的存放有些了解。

Kafka 的基本概念如下圖:

IMAGE

下圖是概念和內部儲存資料的關係圖:

IMAGE

注意:上圖的 index/timeIndex 檔案只是示意圖,它們不是一個每筆 message 都有一筆 index 資料,見下面的實測。

窺視一個 partition 的資料夾

1$ ll /data/kafka/kafka-logs/test.eugene.test-7
1total 8.0K
2-rw-r--r-- 1 root root 10M Aug 22 18:31 00000000000000000000.index
3-rw-r--r-- 1 root root  88 Aug 22 18:35 00000000000000000000.log
4-rw-r--r-- 1 root root 10M Aug 22 18:31 00000000000000000000.timeindex
5-rw-r--r-- 1 root root   8 Aug 22 18:31 leader-epoch-checkpoint

用工具 DumpLogSegments 可以一窺內容

OffsetIndex - Index Of Offsets Of Log Segment

1$ bin/kafka-run-class.sh kafka.tools.DumpLogSegments --deep-iteration --print-data-log --files /data/kafka/kafka-logs/test.eugene.test-7/00000000000000000000.index
1Dumping /data/kafka/kafka-logs/test.eugene.test-7/00000000000000000000.index
2offset: 0 position: 0

TimeIndex - Index Of Timestamp And Offsets Of Log Segment

1$ bin/kafka-run-class.sh kafka.tools.DumpLogSegments --deep-iteration --print-data-log --files /data/kafka/kafka-logs/test.eugene.test-7/00000000000000000000.timeindex
1Found timestamp mismatch in :/data/kafka/kafka-logs/test.eugene.test-7/00000000000000000000.timeindex
2  Index timestamp: 0, log timestamp: 1629628512555

Log File

1$ bin/kafka-run-class.sh kafka.tools.DumpLogSegments --deep-iteration --print-data-log --files /data/kafka/kafka-logs/test.eugene.test-7/00000000000000000000.log
 1Dumping /data/kafka/kafka-logs/test.eugene.test-7/00000000000000000000.log
 2Starting offset: 0
 3baseOffset: 0 lastOffset: 0 count: 1 baseSequence: -1 lastSequence: -1 producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 0 isTransactional: false isControl: false position: 0 CreateTime: 1629628512555 size: 88 magic: 2 compresscodec: NONE crc: 1254090055 isvalid: true
 4| offset: 0 CreateTime: 1629628512555 keysize: 0 valuesize: 18 sequence: -1 headerKeys: [] key: 12 payload: {
 5    "data": 12
 6}
 7baseOffset: 1 lastOffset: 1 count: 1 baseSequence: -1 lastSequence: -1 producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 0 isTransactional: false isControl: false position: 88 CreateTime: 1629628976272 size: 90 magic: 2 compresscodec: NONE crc: 2535940961 isvalid: true
 8| offset: 1 CreateTime: 1629628976272 keysize: 2 valuesize: 18 sequence: -1 headerKeys: [] key: 13 payload: {
 9    "data": 13
10}

相關連結

A Practical Introduction to the Internals of Kafka Storage | Medium