Kafka 的数据可靠性和一致性

2019-07-20 13:00:49 暂无阅读：1490 评论：0

在大数据范畴的，Kafka作为新闻发布订阅系统，为同窗们所熟悉，它有非常好的扩展性。可以在大数据情况下实现高吞吐量和高可用。那么Kafka是若何包管数据的靠得住性和一致性的呢？

靠得住性

靠得住性方面，Kafka首要在Leader的选举、Broker的分布式布置、Partition的replication和Producer的acks三个偏向进行保障。

Leader选举机制

Kafka Leader首要从维护的ISR（in-sync replicas）列表中选出，那么什么是ISR列表呢？我们来看下面一张图

这里我们能够发现Kafka的每一个partition都维护了一个ISR列表，这个ISR官方的注释是： \"isr\" is the set of \"in-sync\" replicas. This is the subset of the replicas list that is currently alive and caught-up to the leader.

也就是说ISR是知足指定同步前提的replica的鸠合。它是所有replica的子集。为什么我这里说是知足了指定前提的replica呢，因为要进入ISR列表，一样需要知足两个前提，follower即掉队的新闻数不跨越replica.lag.max.messages所设置的数值而且follower可以在指定的时间（replica.lag.time.max.ms）内向leader发送fetch恳求，则不会将它从ISR列表中删除。如许在unclean.leader.election.enable=false，的情形下，若是leader挂了，则kafka会从ISR列表中选择第一个follower作为leader，这就包管了已经提交的数据的靠得住性。

Broker的分布式布置

Broker的分布式布置（一样三个以上实例），包管了数据在传输过程中，不会因为单Broker挂掉而导致数据丢失。

Partition的replications

Kafka 从 0.8.0 版本起头引入了分区副本 KAFKA-50 的概念，我们在建立Kafka Topic的时候，经由指定 replication-factor能够设置Partition的副本数，也能够在设置文件中经由参数： default.replication.factor，指定默认的分区副本数。

Producer的acks

Producer经由设置的acks来判断新闻是否发送成功。

我们来看官方的注释

The number of acknowledgments the producer requires the leader to have received before considering a request complete. This controls the durability of records that are sent. The following settings are allowed:acks=0 If set to zero then the producer will not wait for any acknowledgment from the server at all. The record will be immediately added to the socket buffer and considered sent. No guarantee can be made that the server has received the record in this case, and the retriesconfiguration will not take effect (as the client won't generally know of any failures). The offset given back for each record will always be set to -1.

若是设置acks=0，透露不守候ack确认，则马上认为发送成功，并进行下一次发送。在这种模式下，Kafka的吞吐量非常大，然则发送丢失数据的概率也随之变大，是很有或者丢失的。acks=1 This will mean the leader will write the record to its local log but will respond without awaiting full acknowledgement from all followers. In this case should the leader fail immediately after acknowledging the record but before the followers have replicated it then the record will be lost.

若是设置acks=1，透露守候leader反馈领受成功后，则认为发送成功，并进行下一次发送。若是Leader在领受到新闻后，仍未同步到Follower，此时发生了溃逃，是有或者导致数据丢失的。acks=all This means the leader will wait for the full set of in-sync replicas to acknowledge the record. This guarantees that the record will not be lost as long as at least one in-sync replica remains alive. This is the strongest available guarantee. This is equivalent to the acks=-1 setting.

acks=all或许acks=-1这种情形是最平安的，当然效率也是最低的。 Producer会守候Leader返回确认信息，而Leader会守候所有的副本均同步完成。想要深入研究的同窗min.insync.replicas认识一下

一致性

为了保障分歧的消费者在数据消费过程中的一致性，Kafka引入了High Water Mark 机制，当我们将隔离级别isolation.level设置为： read_committed时，Kafka会包管所有消费者所消费的新闻都是在High Water Mark之下。所谓High Water Mark雷同于木桶道理所示，水位线的最高点为已经同步到所有Follower的新闻所对应的的offset。也就是ISR列表中偏移量最小的副本。

需要的Java架构师方面的资料能够存眷之后私信哈，复原“资料”领取免费架构视频资料，记得要点赞转发噢！！！