raft extended论文上
In Search of an Understandable Consensus Algorithm
(Extended Version)
Diego Ongaro and John Ousterhout
Stanford University
Abstract
Raft is a consensus algorithm for managinga replicated log. It produces a result equivalent to (multi-)Paxos, and it is as efficient as Paxos, but its structure is different from Paxos; this makes Raft more understandable than Paxos and also provides a better foundation for building practical systems. In order to enhance understandability, Raft separates the key elements of consensus, such as leader election, log replication, and safety, and it enforces a stronger degree of coherency to reduce the number of states that must be considered. Results from a user study demonstrate that Raft is easier for students to learn than Paxos. Raft also includes a new mechanism for changing the cluster membership, which uses overlapping majorities to guarantee safety.
Raft是一种用于管理复制日志的共识算法。它产生的结果等同于(multi-)Paxos,它和Paxos一样高效,但它的结构与Paxos不同;这使得Raft比Paxos更容易理解,也为建立实用系统提供了更好的基础。为了提高可理解性,Raft分离了共识的关键要素,如leader选举(leader election)、日志复制(log replication)和安全,而且它执行了更强的一致性,以减少必须考虑的状态数量。一项用户研究的结果表明,Raft比Paxos更容易让学生学习。Raft还包括一个改变集群成员的新机制,它使用重叠多数(overlapping majorities)来保证安全。
1 Introduction
Consensus algorithms allow a collection of machines to work as a coherent group that can survive the failures of some of its members. Because of this, they play a key role in building reliable large-scale software systems. Paxos [15, 16] has dominated the discussion of consensus algorithms overthe last decade: most implementations of consensus are based on Paxos or influenced by it, and Paxos has become the primary vehicle used to teach students about consensus.
共识算法允许一组机器作为一个整体工作,可以在一些成员的故障时继续工作。因此,它们在构建可靠的大规模软件系统中发挥了关键作用。在过去的十年中,Paxos[15, 16]主导了关于共识算法的讨论:大多数共识的实现都是基于Paxos或受其影响,而Paxos已经成为教学生了解共识的主要工具。
Unfortunately, Paxos is quite difficult to understand, in spite of numerous attempts to make it more approachable. Furthermore, its architecture requires complex changes to support practical systems. As a result, both system builders and students struggle with Paxos.
不幸的是,Paxos相当难理解,尽管有许多尝试使它更容易被明白。此外,它的架构需要复杂的变化来支持实际的系统。因此,系统建设者和学生都在为Paxos而头秃。
After struggling with Paxos ourselves, we set out to find a new consensus algorithm that could provide a better foundation for system building and education. Our approach was unusual in that our primary goal was understandability: could we define a consensus algorithm for practical systems and describe it in a way that is significantly easier to learn than Paxos? Furthermore,we wanted the algorithm to facilitate the development of intuitions that are essential for system builders. It was important not just for the algorithm to work, but for it to be obvious why it works.
在我们与Paxos苦斗之后,我们开始寻找一种新的共识算法,为系统建设和教育提供一个更好的基础。我们的方法是不寻常的,因为我们的主要目标是可理解性:我们能否为实用系统定义一个共识算法,并以一种明显比Paxos更容易学习的方式来描述它?此外,我们希望该算法能够培养系统构建者的开发直觉,这对系统构建者是必不可少的。重要的是,不仅要让算法起作用,而且要让它的作用明显。
The result of this work is a consensus algorithm called Raft. In designing Raft we applied specific techniques to improveunderstandability, includingdecomposition(Raft separates leader election, log replication, and safety) and state space reduction (relative to Paxos, Raft reduces the degree of nondeterminism and the ways servers can be inconsistent with each other). A user study with 43 students at two universities shows that Raft is significantly easier to understand than Paxos: after learning both algorithms, 33 of these students were able to answer questions about Raft better than questions about Paxos.
这项工作的结果是一种叫做Raft的共识算法。在设计Raft的过程中,我们应用了特定的技术来提高可理解性,包括分解(Raft将leader选举、日志复制和安全【leader election, log replication, and safety】分开)和减少状态空间【state space】(相对于Paxos,Raft减少了非确定性的程度和服务器之间的不一致方式)。对两所大学的43名学生进行的用户研究表明,Raft明显比Paxos更容易理解:在学习了两种算法后,其中33名学生能够更好地回答关于Raft的问题,而不是关于Paxos的问题。
Raft is similar in many ways to existing consensus algorithms (most notably, Oki and Liskov’s Viewstamped Replication [29, 22]), but it has several novel features:
Raft在许多方面与现有的共识算法(最值得注意的是Oki和Liskov的Viewstamped Replication[29, 22])相似,但它有几个新的特点:
• Strong leader: Raft uses a stronger form of leadership than other consensus algorithms. For example, log entries only flow from the leader to other servers. This simplifies the management of the replicated log and makes Raft easier to understand.
Strong leader。与其他共识算法相比,Raft使用了一种更强的leader形式。例如,日志条目只从leader流向其他服务器。这简化了对复制日志的管理,使Raft更容易理解。
• Leader election: Raft uses randomized timers to elect leaders. This adds only a small amount of mechanism to the heartbeats already required for any consensus algorithm, while resolving conflicts simply and rapidly.
leader选举(Leader election)。Raft使用随机定时器来选举leader。这在任何共识算法已经需要的心跳上只增加了少量的机制,同时简单而快速地解决冲突。
• Membership changes: Raft’s mechanism for changing the set of servers in the cluster uses a new joint consensus approach where the majorities of two different configurations overlap during transitions. This allows the cluster to continue operating normally during configuration changes.
Raft通过一种新的joint consensus的方法来实现server集合的改变,其中两个不同配置下的majority在过渡阶段会发生重合。这能让集群在配置改变时也能继续正常运行。
we believe that Raft is superior to Paxos and other consensus algorithms, both for educational purposes and as a foundation for implementation. It is simpler and more understandable than other algorithms; it is described completely enough to meet the needs of a practical system; it has several open-source implementations and is used by several companies; its safety properties have been formally specified and proven; and its efficiency is comparable to other algorithms.
我们相信Raft比Paxos和其他共识算法更胜一筹,无论是出于教育目的还是作为实践的基础。它比其他共识算法更简单,更容易理解;它的描述足够完整,可以满足实际系统的需要;它有几个开源的实现,并被几个公司使用;它的安全属性已经被正式规定和证明;并且它的效率与其他算法相当。
The remainder of the paper introduces the replicated state machine problem (Section 2), discusses the strengths and weaknesses of Paxos (Section 3), describes our general approach to understandability (Section 4), presents the Raft consensus algorithm (Sections 5–8), evaluates Raft (Section 9), and discusses related work (Section 10).
论文的第2章介绍了状态机的相关问题,第3章描述了Paxos的优缺点,第4章介绍了我们达成可理解性目标的一般方法,第5到8章详细介绍了 Raft一致性算法,第9章描述了对Raft的评估,第10章讨论了于Raft相关一些成果。
2 Replicated state machines
Consensus algorithms typically arise in the context of replicated state machines [37]. In this approach, state machines on a collection of servers compute identical copies of the same state and can continue operating even if some of the servers are down. Replicated state machines are used to solve a variety of fault tolerance problems in distributed systems. For example, large-scale systems that have a single cluster leader, such as GFS [8], HDFS [38], and RAMCloud [33], typically use a separate replicated state machine to manage leader election and store configuration information that must survive leader crashes. Examples of replicated state machines include Chubby [2] and ZooKeeper [11].
共识算法通常出现在*复制的状态机(我也不知道是不是这样直译,replicated state machines)*的背景下[37]。在这种方法中,服务器集合上的状态机计算相同状态的相同副本,即使一些服务器停机,也能继续运行。复制的状态机被用来解决分布式系统中的各种容错问题。例如,拥有单一集群leader的大规模系统,如GFS[8]、HDFS[38]和RAMCloud[33],通常使用单独的复制状态机来管理leader的选举,并存储必须在leader崩溃后生存的配置信息。复制状态机的例子包括Chubby [2] 和 ZooKeeper [11]。
Replicated state machines are typically implemented using a replicated log, as shown in Figure 1. Each server stores a log containing a series of commands, which its state machine executes in order. Each log contains the same commands in the same order, so each state machine processes the same sequence of commands. Since the state machines are deterministic, each computes the same state and the same sequence of outputs.
复制的状态机通常使用复制的日志来实现,如图1所示。每台服务器存储一个包含一系列命令的日志,其状态机按顺序执行这些命令。每个日志都包含相同顺序的命令,所以每个状态机处理相同的命令序列。由于状态机是确定性的,每个状态机都计算出相同的状态和相同的输出序列。

Figure 1: Replicated state machine architecture. The consensus algorithm manages a replicated log containing state machine commands from clients. The state machines process identical sequences of commands from the logs, so they produce the same outputs.
复制的状态机架构。共识算法管理着一个包含来自clients的状态机命令的复制日志。状态机处理来自日志的相同的命令序列,所以它们产生相同的输出。
Keeping the replicated log consistent is the job of the consensus algorithm. The consensus module on a server receives commands from clients and adds them to its log. It communicates with the consensus modules on other servers to ensure that every log eventually contains the same requests in the same order, even if some servers fail. Once commands are properly replicated, each server’s state machine processes them in log order, and the outputs are returned to clients. As a result, the servers appear to form a single, highly reliable state machine.
保持复制的日志的一致性是共识算法的工作。服务器上的共识模块接收来自客户端的命令,并将它们添加到其日志中。它与其他服务器上的共识模块进行通信,以确保每条日志最终都包含相同顺序的请求,即使一些服务器失败了。一旦命令被正确复制,每个服务器的状态机就会按照日志顺序处理它们,并将输出结果返回给clients。因此,这些服务器似乎形成了一个单一的、高度可靠的状态机。
Consensus algorithms for practical systems typically have the following properties:
实用系统的共识算法通常具有以下特性:
• They ensure safety (never returning an incorrect result) under all non-Byzantine conditions, including network delays, partitions, and packet loss, duplication, and reordering.
"Non-Byzantine conditions" means that the servers are fail-stop: they either follow the Raft protocol correctly, or they halt. For example, most power failures are non-Byzantine because they cause computers to simply stop executing instructions; if a power failure occurs, Raft may stop operating, but it won't send incorrect results to clients.
Byzantine failure refers to situations in which some computers execute incorrectly, because of bugs or because someone malicious is controlling the computers. If a failure like this occurs, Raft may send incorrect results to clients.
Most of 6.824 is about tolerating non-Byzantine faults. Correct operation despite Byzantine faults is more difficult; we'll touch on this topic at the end of the term.
• They are fully functional (available) as long as any majority of the servers are operational and can communicate with each other and with clients. Thus, a typical cluster of five servers can tolerate the failure of any two servers. Servers are assumed to fail by stopping; they may later recover from state on stable storage and rejoin the cluster.
• They do not depend on timing to ensure the consistency of the logs: faulty clocks and extreme message delays can, at worst, cause availability problems.
• In the common case, a command can complete as soon as a majority of the cluster has responded to a single round of remote procedure calls; a minority of slow servers need not impact overall system performance.
在所有non-Byzantine 条件下,包括网络延迟、分区和数据包丢失、重复和重新排序,它们都能确保安全(永远不会返回错误的结果)。
只要大多数服务器都在运行,并能相互通信和与clients通信,它们就能完全发挥作用(可用)。因此,一个典型的由五个服务器组成的集群可以容忍任何两个服务器的故障。服务器被假定为失败by stopping;它们后来可能从稳定存储的状态中恢复并重新加入集群。
它们不依赖时间来确保日志的一致性:故障时钟和极端的消息延迟在最坏的情况下会导致可用性问题。
在普通情况下,只要集群中的大多数人对单轮远程过程调用作出反应,就可以完成一个命令;少数缓慢的服务器不会影响整个系统的性能。
3 What’s wrong with Paxos?
Over the last ten years, Leslie Lamport’s Paxos protocol [15] has become almost synonymous with consensus: it is the protocol most commonly taught in courses, and most implementations of consensus use it as a starting point. Paxos first defines a protocol capable of reaching agreement on a single decision, such as a single replicated log entry. We refer to this subset as single-decree Paxos. Paxos then combines multiple instances of this protocolto facilitate a series of decisions such as a log (multi-Paxos). Paxos ensures both safety and liveness, and it supports changes in cluster membership. Its correctness has been proven, and it is efficient in the normal case.
在过去的十年里,Leslie Lamport的Paxos协议[15]几乎成了共识的代名词:它是课程中最常教授的协议,大多数共识的实现都以它为基础。Paxos首先定义了一个能够在单一决定上达成一致的协议,例如单一复制的日志条目。我们把这个子集称为单决策的Paxos。然后,Paxos将这个协议的多个实例结合起来,以促进一系列的决定,如日志(multi-Paxos)。Paxos既保证了安全性,又保证了有效性,而且它支持集群成员的变化。它的正确性已被证明,而且在正常情况下是有效的。
Unfortunately, Paxos has two significant drawbacks. The first drawback is that Paxos is exceptionally difficult to understand. The full explanation [15] is notoriously opaque;few people succeed in understandingit, and only with great effort. As a result, there have been several attempts to explain Paxos in simpler terms [16, 20, 21]. These explanations focus on the single-decree subset, yet they are still challenging. In an informal survey of attendees at NSDI 2012, we found few people who were comfortable with Paxos, even among seasoned researchers. We struggled with Paxos ourselves; we were not able to understand the complete protocol until after reading several simplified explanations and designing our own alternative protocol, a process that took almost a year.
不幸的是,Paxos有两个显著的缺点。第一个缺点是,Paxos特别难理解。完整的解释[15]是出了名的不透明;很少有人能成功地理解它,而且是在付出巨大努力之后。因此,已经有一些尝试用更简单的术语来解释Paxos [16, 20, 21]。这些解释集中在单法令(single-decree)子集上,但它们仍然具有挑战性。在对NSDI 2012的与会者进行的非正式调查中,我们发现很少有人对Paxos感到满意,即使在经验丰富的研究人员中。我们自己也在为Paxos挣扎;直到读了几个简化的解释和设计了我们自己的替代协议之后,我们才能够理解完整的协议,这个过程几乎花了一年时间。
We hypothesize that Paxos’ opaqueness derives from its choice of the single-decree subset as its foundation. Single-decree Paxos is dense and subtle: it is divided into two stages that do not have simple intuitive explanations and cannot be understood independently. Because of this, it is difficult to develop intuitions about why the singledecree protocol works. The composition rules for multiPaxos add significant additional complexity and subtlety. We believe that the overall problem of reaching consensus on multiple decisions (i.e., a log instead of a single entry) can be decomposed in other ways that are more direct and obvious.
我们假设,Paxos的不透明性来自于它选择single-decree subset作为基础。single-decree Paxos是密集而微妙的:它分为两个阶段,没有简单的直观解释,不能被独立理解。正因为如此,很难发展出关于single-decree协议为何有效的直觉。multiPaxos的组成规则大大增加了复杂性和微妙性。我们认为,就多个决定达成共识的整体问题(即一个日志而不是一个条目)可以用其他更直接和明显的方式进行分解。
The second problem with Paxos is that it does not provide a good foundation for building practical implementations. One reason is that there is no widely agreedupon algorithm for multi-Paxos. Lamport’s descriptions are mostly about single-decree Paxos; he sketched possible approaches to multi-Paxos, but many details are missing. There have been several attempts to flesh out and optimize Paxos, such as [26], [39], and [13], but these differ from each other and from Lamport’s sketches. Systems such as Chubby [4] have implemented Paxos-like algorithms, but in most cases their details have not been published.
Paxos的第二个问题是,它没有为建立实际的实现提供一个良好的基础。原因之一是没有广泛认同的multi-Paxos的算法。Lamport的描述主要是关于single-decree Paxos的;他勾画了多Paxos的可能方法,但缺少许多细节。已经有一些尝试来充实和优化Paxos,如[26]、[39]和[13] (in references),但这些尝试相互之间以及与Lamport的草图不同。像Chubby[4]这样的系统已经实现了类似Paxos的算法,但在大多数情况下,它们的细节还没有被公布。
Furthermore, the Paxos architecture is a poor one for building practical systems; this is another consequence of the single-decree decomposition.For example, there is little benefit to choosing a collection of log entries independently and then melding them into a sequential log; this just adds complexity. It is simpler and more efficient to design a system around a log, where new entries are appended sequentially in a constrained order. Another problem is that Paxos uses a symmetric peer-to-peer approach at its core (though it eventually suggests a weak form of leadership as a performance optimization). This makes sense in a simplified world where only one decision will be made, but few practical systems use this approach. If a series of decisions must be made, it is simpler and faster to first elect a leader, then have the leader coordinate the decisions.
此外,Paxos架构对于构建实际系统来说是一个糟糕的架构;这是single-decree 分解的另一个后果。例如,独立地选择一个日志条目集合,然后将它们拼接成一个连续的日志,没有什么好处;这只会增加复杂性。围绕日志设计一个系统更简单、更有效,新的条目是以受限的顺序依次追加的。另一个问题是,Paxos的核心是使用对称的对等方法(尽管它最终建议采用弱的leader形式作为性能优化)。这在一个只做一个决定的简化中是有意义的,但很少有实际的系统使用这种方法。如果必须做出一系列的决定,首先选举一个leader,然后让leader协调这些决定,这样做更简单、更快捷。
As a result, practical systems bear little resemblance to Paxos. Each implementation begins with Paxos, discovers the difficulties in implementing it, and then develops a significantly different architecture. This is timeconsuming and error-prone, and the difficulties of understanding Paxos exacerbate the problem. Paxos’ formulation may be a good one for provingtheorems aboutits correctness, but real implementations are so different from Paxos that the proofs havelittle value.The followingcomment from the Chubby implementers is typical:
There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system.... the final system will be based on an unproven protocol [4].
因此,实际系统很少和Paxos类似。各种实现都以Paxos开始,然后发现实现起来很困难,于是最后开发出了一个完全不同的架构。这是极其费时并且容易出错的,而Paxos的难以理解则更加加剧了这个问题。Paxos的正确性理论很好证明,但是实际的实现和Paxos太过不同,因此这些证明就没什么价值了。接下来这段来自Chubby的评论是非常典型的:
Paxos算法的描述和现实世界的系统的需求之间有巨大的矛盾....而最终的系统都将建立在一个未经证明的协议之上
Because of these problems, we concluded that Paxos does not provide a good foundation either for system building or for education. Given the importance of consensus in large-scale software systems, we decided to see if we could design an alternative consensus algorithm with better properties than Paxos. Raft is the result of that experiment.
由于这些问题,我们得出结论,Paxos既不能为系统建设也不能为教育提供一个良好的基础。考虑到共识在大规模软件系统中的重要性,我们决定看看我们是否能设计出一种比Paxos具有更好特性的替代性共识算法。Raft就是这个实验的结果。
4 Designing for understandability
We had several goals in designing Raft: it must provide a complete and practical foundation for system building, so that it significantly reduces the amount of design work requiredof developers;it must be safe underall conditions and available under typical operating conditions; and it must be efficient for common operations. But our most important goal—and most difficult challenge—was understandability. It must be possible for a large audience to understand the algorithm comfortably. In addition, it must be possible to develop intuitions about the algorithm, so that system builders can make the extensions that are inevitable in real-world implementations.
我们在设计Raft时有几个目标:它必须为系统建设提供一个完整而实用的基础,从而大大减少开发人员所需的设计工作量;它必须在所有条件下都是安全的,并且在典型的操作条件下可用;它必须对普通操作有效。但是,我们最重要的目标和最困难的挑战是可理解性。必须让广大读者能够舒适地理解该算法。此外,还必须有可能发展出关于该算法的直觉,以便系统构建者能够进行扩展,而这在现实世界的实现中是不可避免的。
There were numerous points in the design of Raft where we had to choose among alternative approaches. In these situations we evaluated the alternatives based on understandability: how hard is it to explain each alternative (for example,how complex is its state space, and does it have subtle implications?), and how easy will it be for a reader to completely understand the approach and its implications?
在设计Raft的很多节点上,我们要在很多可选方法之间做出选择。在这些情况下,我们基于可理解性对这些方法进行评估:对于每一个可选方案的描述是否困难(比如,它的状态空间的复杂度是多少,以及它是否有subtle implication?)以及读者是否能轻松地完全理解这种方法。
We recognize that there is a high degree of subjectivity in such analysis; nonetheless, we used two techniques that are generally applicable. The first technique is the well-known approach of problem decomposition: wherever possible, we divided problems into separate pieces that could be solved, explained, and understood relatively independently. For example, in Raft we separated leader election, log replication, safety, and membership changes. Our second approach was to simplify the state space by reducing the number of states to consider, making the system more coherent and eliminating nondeterminism where possible. Specifically, logs are not allowed to have holes, and Raft limits the ways in which logs can become inconsistent with each other. Although in most cases we tried to eliminate nondeterminism, there are some situations where nondeterminism actually improves understandability. In particular, randomized approaches introduce nondeterminism, but they tend to reduce the state space by handling all possible choices in a similar fashion (“choose any; it doesn’t matter”). We used randomization to simplify the Raft leader election algorithm.
我们认识到,这种分析有很大程度的主观性;然而,我们使用了两种普遍适用的技术。第一种技术是众所周知的问题分解方法:在可能的情况下,我们把问题分成独立的部分,可以相对独立地解决、解释和理解。例如,在Raft中,我们将leader选举、日志复制、安全和成员变更(leader election, log replication, safety, and membership changes)分开。我们的第二个方法是通过减少需要考虑的状态数量来简化状态空间(state space),使系统更加连贯,并尽可能消除非确定性。具体来说,不允许日志有漏洞,而且Raft限制了日志相互之间不一致的方式。尽管在大多数情况下,我们试图消除非确定性,但在某些情况下,非确定性实际上提高了可理解性。特别是,随机化的方法引入了非确定性,但它们倾向于通过以类似的方式处理所有可能的选择来减少状态空间("选择任何;这并不重要")。我们使用随机化来简化Raftleader选举算法。
5 The Raft consensus algorithm
Raft is an algorithm for managing a replicated log of the form described in Section 2. Figure 2 summarizes the algorithm in condensed form for reference, and Figure 3 lists key properties of the algorithm; the elements of these figures are discussed piecewise over the rest of this section.
Raft是一种用于管理第2节所述形式的复制日志的算法。图2概括了该算法的浓缩形式以供参考,图3列出了该算法的关键属性;这些数字的元素将在本节的其余部分逐一讨论。
Raft implements consensus by first electing a distinguished leader, then giving the leader complete responsibility for managing the replicated log. The leader accepts log entries from clients, replicates them on other servers, and tells servers when it is safe to apply log entries to their state machines. Having a leader simplifies the management of the replicated log. For example, the leader can decide where to place new entries in the log without consulting other servers, and data flows in a simple fashion from the leader to other servers. A leader can fail or become disconnected from the other servers, in which case a new leader is elected.
Raft通过首先选举一个杰出的leader来实现共识,然后让leader完全负责管理复制的日志。leader接受来自客户端的日志条目,将其复制到其他服务器上,并告诉服务器何时可以安全地将日志条目应用到他们的状态机。有一个leader可以简化对复制日志的管理。例如,leader可以决定在日志中放置新条目的位置,而不需要咨询其他服务器,并且数据以一种简单的方式从leader流向其他服务器。一个leader可能会失败或与其他服务器断开连接,在这种情况下,会选出一个新的leader。
Given the leader approach, Raft decomposes the consensus problem into three relatively independentsubproblems, which are discussed in the subsections that follow: •Leader election: a new leader must be chosen when an existing leader fails (Section 5.2). •Log replication: the leader must accept log entries from clients and replicate them across the cluster, forcing the other logs to agree with its own (Section 5.3). •Safety: the key safety property for Raft is the State Machine Safety Property in Figure 3: if any server has applied a particular log entry to its state machine, then no other server may apply a different command for the same log index. Section 5.4 describes how Raft ensures this property; the solution involves an additional restriction on the election mechanism described in Section 5.2.
考虑到leader方法,Raft将共识问题分解为三个相对独立的子问题,在接下来的小节中讨论。
leader选举:当现有leader失败时,必须选择一个新的leader(第5.2节)。
日志复制:leader必须接受来自客户端的日志条目,并将其复制到整个集群中,迫使其他日志与自己的日志一致(第5.3节)。
安全:Raft的关键安全属性是图3中的状态机安全属性:如果任何服务器在其状态机上应用了一个特定的日志条目,那么其他服务器不得对同一日志索引应用不同的命令。第5.4节描述了Raft如何确保这一属性;该解决方案涉及对第5.2节中描述的选举机制的额外限制。
在介绍了共识算法之后,本节讨论了系统中的可用性问题和计时的作用。


在所有服务器上持久存在的:(在响应RPCs之前进行更新) currentTerm 服务器最后知道的任期号(服务启动时,初始化为0,单调递增) votedFor 在当前任期内收到的选票的candidate id(如果没有就为 null) log[] 日志条目;每个条目包含状态机的要执行命令及从leader处得到的任期号 在所有服务器上不稳定存在的: commitIndex 已知的被提交的最大日志条目的索引值(初始化为0,单调递增) lastApplied 被状态机执行的最大日志条目的索引值(初始化为0,单调递增) 在leader服务器上不稳定存在的:(每次选举之后重新初始化) nextIndex[] 对于每一个服务器,需要发给它的下一个日志条目的索引(初始化为leader上一条日志的索引值+1) matchIndex[] 对于每一个服务器,已复制到该服务器的日志条目的最高索引值(初始化为0,单调递增)
参数 term leader的任期号 leaderId leader id,为了其他服务器能够重定向客户端到leader prevLogIndex 当前日志之前的日志的索引值 prevLogTerm 当前日志之前的日志的leader任期号 entries[] 将要存储的日志条目(心跳检测RPCs时为空,一条或多条) leaderCommit leader提交的日志条目索引值 返回值 term 当前的任期号,用于leader更新自己的任期号 success 如果follower服务器包含能够匹配上 prevLogIndex 和 prevLogTerm 的日志时返回true 接收者实现: 1.自身携带的任期号小于当前任期号(term < currentTerm),则返回 false(5.1节) 2.没有任期号与prevLogTerm相匹配,索引为prevLogIndex的日志条目,则返回 false(5.3节) 3.如果已经存在的日志条目与新的日志条目冲突(索引:index相同但是任期号:term 不同),则删除此日志条目及它之后所有的日志条目(5.3节) 4.添加任何在已有的日志中不存在的新条目 5.如果 leaderCommit > commitIndex,将commitIndex设置为leaderCommit和最新日志条目索引号中较小的那一个。
参数 term candidate的任期号 candidateId 请求投票的candidate id lastLogIndex candidate最新日志条目的索引值 lastLogTerm candidate最新日志条目对应的任期号 返回值 term 当前的任期号,用于candidate更新自己任期号 voteGranted 如果 candidate 收到选票则返回 true
2. 接收者需要实现: 1.自身携带的任期号小于当前任期号(term < currentTerm),则返回 false(5.1节) 如果votedFor为空或者是candidateId,并且candidate的日志至少和自己的日志一样新,则给该candidate投票(5.2节 和 5.4节)
所有服务器:
1.如果commitIndex > lastApplied,lastApplied自增,将log[lastApplied]应用到状态机(5.3节) 2.如果 RPC 的请求或者响应中的任期号 term T 大于 currentTerm,则将currentTerm赋值为 T,并切换状态为follower(Follower)(5.1节)
follower(followers): 5.2节
1.响应来自candidate和leader的 RPC 2.选举超时后,未收到来自前leader的AppendEntries RPC,或者投票给candidate,则自己转换状态为candidate。
candidate:5.2节
1.转变为选举人之后开始选举: 2.currentTerm自增 3.给自己投票 4.重置选举计时器 5.向其他服务器发送RequestVote RPC 6.如果收到了来自大多数服务器的投票:则成为leader 7.如果收到了来自新leader的AppendEntries RPC(heartbeat):转换状态为follower 8.如果选举超时:开始新一轮的选举
leader:
1.选举时:向其他服务器发送空的AppendEntries RPC(heartbeat);在空闲时间重复发送以防止选举超时(5.2节) 2.如果收到来自客户端的请求:添加条目到本地日志,日志条目应用到状态机后响应客户端(5.3节) 3.如果上一次收到的follower的日志索引大于将要发送给follower的的日志索引(nextIndex):则通过AppendEntries RPC将 nextIndex 之后的所有日志条目发送给follower。 4.发送成功:则将该follower的 nextIndex和matchIndex更新 5.如果由于日志不一致导致AppendEntries RPC失败:将nextIndex递减并且重新发送(5.3节) 6.如果存在一个数N,满足N > commitIndex和matchIndex[i] >= N并且log[N].term == currentTerm的 N,则将commitIndex赋值为 N
图三:
选举安全性:一个任期只能有一个leader当选
leader 只追加:leader不覆盖或者删除自身的日志条目,只追加新条目
日志匹配:如果两个日志包含一个索引和任期都相同的条目,那么日志中此条目索引之后的所有条目都相同
leader 完整性:如果一个日志条目在一个任期内被提交,那么次条目将会呈现在之后任期的leader的日志中。
状态机安全性:当一个服务器在它自己的状态机上应用了一个指定索引的日志条目后,其它服务器状态机将不会出现应用同样索引的不同日志条目情况
5.1 Raft basics
A Raft cluster contains several servers; five is a typical number, which allows the system to tolerate two failures. At any given time each server is in one of three states: leader, follower, or candidate. In normal operation there is exactly one leader and all of the other servers are followers. Followers are passive: they issue no requests on their own but simply respond to requests from leaders and candidates. The leader handles all client requests (if a client contacts a follower, the follower redirects it to the leader). The third state, candidate, is used to elect a new leader as described in Section 5.2. Figure 4 shows the states and their transitions; the transitions are discussed below.
一个Raft集群包含几个服务器;五个是典型的数字,这使得系统可以容忍两个故障。在任何时候,每个服务器都处于三种状态中的一种。leader,follower*,或candidate。在正常操作中,正好有一个leader,其他所有的服务器都是follower。follower是被动的:他们自己不发出任何请求,只是回应leader和candidate的请求。leader处理所有的客户请求(如果客户联系follower,follower将其重定向到leader)。第三种状态,candidate,被用来选举一个新的leader,如第5.2节所述。图4显示了这些状态和它们的转换;下面将讨论这些转换。
Raft divides time into terms of arbitrary length, as shown in Figure 5. Terms are numbered with consecutive integers. Each term begins with an election, in which one or more candidates attempt to become leader as described in Section 5.2. If a candidate wins the election, then it serves as leader for the rest of the term. In some situations an election will result in a split vote. In this case the term will end with no leader; a new term (with a new election) will begin shortly. Raft ensures that there is at most one leader in a given term.
如图5所示,Raft将时间划分为任意长度的terms 。terms 用连续的整数来编号。每个terms以选举开始,其中一个或多个candidate试图成为leader,如第5.2节所述。如果一个candidate在选举中获胜,那么他将在剩下的任期内担任leader。在某些情况下,选举的结果是分裂票。在这种情况下,任期term将在没有leader的情况下结束;新的任期(和新的选举)将很快开始。raft确保在一个特定的任期内最多有一个leader。

图4:服务器状态。follower只对来自其他服务器的请求做出回应。如果一个follower没有收到任何通信,它就会成为一个candidate并发起选举。获得整个集群中大多数人投票的candidate成为新的leader。leader通常会一直运行,直到他们失败。

图5:时间被划分为几个terms,每个term以选举开始。选举成功后,由一个leader管理集群,直到任期结束。有些选举失败,在这种情况下,term结束时没有选择leader。在不同的服务器上,可以在不同的时间观察到terms之间的转换。
Different servers may observe the transitions between terms at different times, and in some situations a server may not observe an election or even entire terms. Terms act as a logical clock [14] in Raft, and they allow servers to detect obsolete information such as stale leaders. Each server stores a current term number, which increases monotonically over time. Current terms are exchanged whenever servers communicate; if one server’s current term is smaller than the other’s, then it updates its current term to the larger value. If a candidate or leader discovers that its term is out of date, it immediately reverts to follower state. If a server receives a request with a stale term number, it rejects the request.
任期(term)作为Raft的逻辑时钟,使得服务器能够检测到过期信息,如过期的leader,每个服务器都存储着一个当前任期数字,数字随任期单调递增,服务器间通信时会相互交换任期信息。如果一个服务器的任期信息比其它的服务器小,那么就更新自己的任期到当前较大的任期。如果leader 或者 candidate发现自己的任期信息已经过时,那么它们会立即转换状态为follower。当一个服务器收到一个过期任期信息的请求时,会拒绝这个请求。
Raft servers communicate using remote procedurecalls (RPCs), and the basic consensus algorithm requires only two types of RPCs. RequestVote RPCs are initiated by candidates during elections (Section 5.2), and AppendEntries RPCs are initiated by leaders to replicate log entries and to provide a form of heartbeat (Section 5.3). Section 7 adds a third RPC for transferringsnapshots between servers. Servers retry RPCs if they do not receive a response in a timely manner,and they issue RPCs in parallel for best performance.
Raft服务器使用远程过程调用(RPC)进行通信,而基本的共识算法只需要两种类型的RPC。RequestVote RPCs由candidate在选举期间发起(第5.2节),AppendEntries RPCs由leader发起,用于复制日志条目并提供一种心跳形式(第5.3节)。第7节增加了第三个RPC,用于在服务器之间传输快照。如果服务器没有及时收到响应,它们会重试RPC,并且为了获得最佳性能,它们会并行地发出RPC。
5.2 Leader election
Raft uses a heartbeat mechanism to trigger leader election. When servers start up, they begin as followers. A server remains in follower state as long as it receives valid RPCs from a leader or candidate. Leaders send periodic heartbeats (AppendEntriesRPCs that carry no log entries) to all followers in order to maintain their authority. If a follower receives no communication over a period of time called the election timeout, then it assumes there is no viable leader and begins an election to choose a new leader.
Raft使用心跳机制来触发leader选举。当服务器启动时,它们开始是follower。只要服务器收到来自leader或candidate的有效RPC,它就一直处于follower状态。leader定期向所有follower发送心跳(AppendEntriesRPCs,不携带日志条目),以维持他们的权威。如果follower在一段被称为选举超时的时间内没有收到任何通信,那么它就认为没有可行的leader,并开始选举以选择一个新的leader。
To begin an election, a follower increments its current term and transitions to candidate state. It then votes for itself and issues RequestVote RPCs in parallel to each of the other servers in the cluster. A candidate continues in this state until one of three things happens: (a) it wins the election, (b) another server establishes itself as leader, or (c) a period of time goes by with no winner. These outcomes are discussed separately in the paragraphs below.
为了开始选举,follower增加它的当前任期(term)并过渡到候选状态。然后,它为自己投票,并向集群中的每个其他服务器发出RequestVote RPCs。candidate继续处于这种状态,直到发生三种情况之一。(a)它赢得了选举,(b)另一个服务器确立了自己的leader地位,或者(c)一段时间内没有赢家。这些结果将在下面的段落中分别讨论。
A candidate wins an election if it receives votes from a majority of the servers in the full cluster for the same term. Each server will vote for at most one candidate in a given term, on a first-come-first-served basis (note: Section 5.4 adds an additional restriction on votes). The majority rule ensures that at most one candidate can win the election for a particular term (the Election Safety Property in Figure 3). Once a candidate wins an election, it becomes leader. It then sends heartbeat messages to all of the other servers to establish its authority and prevent new elections.
如果一个candidate在同一任期内获得了整个集群中大多数服务器的投票,那么它就赢得了选举。每台服务器在给定的任期内最多为一名candidate投票,以先来后到为原则(注:第5.4节对投票增加了一个额外的限制)。少数服从多数的原则保证了最多只有一名candidate能够在某一任期内赢得选举(图3中的选举安全属性)。一旦一个candidate在选举中获胜,它就成为leader。然后,它向所有其他服务器发送心跳信息,以建立其权威并防止新的选举。
While waiting for votes, a candidate may receive an AppendEntries RPC from another server claiming to be leader. If the leader’s term (included in its RPC) is at least as large as the candidate’s current term, then the candidate recognizes the leader as legitimate and returns to follower state. If the term in the RPC is smaller than the candidate’s current term, then the candidate rejects the RPC and continues in candidate state.
在等待投票的过程中,candidate可能会收到另一个服务器的AppendEntries RPC,声称自己是leader。如果leader的任期(包括在其RPC中)至少与candidate的当前任期一样大,那么candidate就会承认leader是合法的并返回到follower状态。如果RPC中的term比candidatecurrentTerm小,那么candidate拒绝RPC,继续处于候选状态。
The third possible outcome is that a candidate neither wins nor loses the election: if many followers become candidates at the same time, votes could be split so that no candidate obtains a majority. When this happens, each candidate will time out and start a new election by incrementing its term and initiating another round of RequestVote RPCs. However, without extra measures split votes could repeat indefinitely.
第三个可能的结果是,一个candidate既没有赢得选举,也没有失去选举:如果许多follower同时成为candidate,票数可能被分割,因此没有candidate获得多数。当这种情况发生时,每个candidate都会超时,并通过增加其任期和启动新一轮的RequestVote RPC来开始新的选举。然而,如果没有额外的措施,分裂的投票可能会无限期地重复。
Raft uses randomized election timeouts to ensure that split votes are rare and that they are resolved quickly. To prevent split votes in the first place, election timeouts are chosen randomly from a fixed interval (e.g., 150–300ms). This spreads out the servers so that in most cases only a single server will time out; it wins the election and sends heartbeats before any other servers time out. The same mechanism is used to handle split votes. Each candidate restarts its randomized election timeout at the start of an election, and it waits for that timeout to elapse before starting the next election; this reduces the likelihood of another split vote in the new election. Section 9.3 shows that this approach elects a leader rapidly.
Raft使用随机的选举超时,以确保分裂投票很少发生,并能迅速解决。为了从一开始就防止分裂投票,选举超时是从一个固定的时间间隔中随机选择的(例如,150-300ms)。这就分散了服务器,所以在大多数情况下,只有一个服务器会超时;它赢得了选举,并在任何其他服务器超时之前发送心跳信号。同样的机制被用来处理分裂投票。每个candidate在选举开始时重新启动其随机的选举超时,并等待超时过后再开始下一次选举;这减少了在新的选举中再次出现分裂票的可能性。第9.3节显示,这种方法可以迅速选出一个leader。

图6:日志是由条目(entries)组成的,这些条目按顺序编号。每个条目都包含创建它的term(每个方框中的数字)和状态机的命 令。如果一个条目可以安全地应用于状态机,那么该条目就被认为是committed的。
Elections are an example of how understandability guided our choice between design alternatives. Initially we planned to use a ranking system: each candidate was assigned a unique rank, which was used to select between competing candidates. If a candidate discovered another candidate with higher rank, it would return to follower state so that the higher ranking candidate could more easily win the next election. We found that this approach created subtle issues around availability (a lower-ranked server might need to time out and become a candidate again if a higher-ranked server fails, but if it does so too soon, it can reset progress towards electing a leader). We made adjustments to the algorithm several times, but after each adjustment new corner cases appeared. Eventually we concluded that the randomized retry approach is more obvious and understandable.
选举是一个例子,说明可理解性是如何指导我们在设计方案之间进行选择的。最初我们计划使用一个排名系统:每个candidate被分配一个独特的排名(rank),用来在竞争的candidate之间进行选择。如果一个candidate发现了另一个排名更高的candidate,它就会回到follower的状态,这样排名更高的candidate就能更容易地赢得下一次选举。我们发现这种方法在可用性方面产生了一些微妙的问题(如果一个排名较高的服务器失败了,一个排名较低的服务器可能需要超时并再次成为candidate,但如果它过早地这样做,它可能会重置选举leader的进展)。我们对算法进行了多次调整,但每次调整后都会出现新的角落案例。最终我们得出结论,随机重试的方法更加浅显易懂。
5.3 Log replication
Once a leader has been elected, it begins servicing client requests. Each client request contains a commandto be executed by the replicated state machines. The leader appends the command to its log as a new entry, then issues AppendEntries RPCs in parallel to each of the other servers to replicate the entry. When the entry has been safely replicated (as described below), the leader applies the entry to its state machine and returns the result of that execution to the client. If followers crash or run slowly, or if network packets are lost, the leader retries AppendEntries RPCs indefinitely (even after it has responded to the client) until all followers eventually store all log entries.
一旦一个leader被选出,它就开始为客户的请求提供服务。每个客户请求都包含一个要由复制的状态机执行的命令。leader将命令作为一个新条目(entry)附加到它的日志中,然后向其他每个服务器平行地发出AppendEntries RPCs来复制该条目。当条目被安全复制后(如下所述),leader将条目应用于其状态机,并将执行结果返回给客户端。如果follower崩溃或运行缓慢,或者网络数据包丢失,leader会无限期地重试AppendEntries RPCs(甚至在它回应了客户端之后),直到所有follower最终存储所有日志条目。
Logs are organized as shown in Figure 6. Each log entry stores a state machine command along with the term number when the entry was received by the leader. The term numbers in log entries are used to detect inconsistencies between logs and to ensure some of the properties in Figure 3. Each log entry also has an integer index identifying its position in the log.
日志的组织方式如图6所示。每个日志条目都存储了一个状态机命令,以及leader收到该条目时的term编号。日志条目中的term编号被用来检测日志之间的不一致,并确保图3中的一些属性。每个日志条目也有一个整数索引,用于识别它在日志中的位置。
The leader decides when it is safe to apply a log entry to the state machines; such an entry is called committed. Raft guarantees that committed entries are durable and will eventually be executed by all of the available state machines. A log entry is committed once the leader that created the entry has replicated it on a majority of the servers (e.g., entry 7 in Figure 6). This also commits all preceding entries in the leader’s log, including entries created by previous leaders. Section 5.4 discusses some subtleties when applying this rule after leader changes, and it also shows that this definition of commitment is safe. The leader keeps track of the highest index it knows to be committed, and it includes that index in future AppendEntries RPCs (including heartbeats) so that the other servers eventually find out. Once a follower learns that a log entry is committed, it applies the entry to its local state machine (in log order).
leader决定何时将日志条目应用到状态机中是安全的;这样的条目被称为承诺(committed)。Raft保证所提交的条目是持久的,最终会被所有可用的状态机执行。一旦创建该条目的leader将其复制到大多数服务器上,该日志条目就会被提交(例如,图6中的条目7)。这也会提交leader日志中所有之前的条目,包括之前leader创建的条目。第5.4节讨论了在leader变更后应用这一规则时的一些微妙之处,它还表明这种承诺的定义是安全的。leader会跟踪它所知道的已承诺的最高索引,并且它在未来的AppendEntries RPC(包括心跳)中包括该索引,以便其他服务器最终发现。一旦follower得知一个日志条目被提交,它就会将该条目应用到它的本地状态机(按日志顺序)。
We designed the Raft log mechanismto maintain a high level of coherency between the logs on different servers. Not only does this simplify the system’s behavior and make it more predictable,but it is an importantcomponent of ensuring safety. Raft maintains the following properties, which together constitute the Log Matching Property in Figure 3: • If two entries in different logs have the same index and term, then they store the same command. • If two entries in different logs have the same index and term, then the logs are identical in all preceding entries.
我们设计了Raft的日志机制,以保持不同服务器上的日志之间的高度一致性。这不仅简化了系统的行为,使其更具可预测性,而且是确保安全的一个重要组成部分。Raft维护以下属性,它们共同构成了图3中的日志匹配属性。
如果不同日志中的两个条目具有相同的索引和术语(index and term),那么它们存储的是同一个命令。
如果不同日志中的两个条目具有相同的索引和术语,那么日志中的所有前面的条目都是相同的。
The first property follows from the fact that a leader creates at most one entry with a given log index in a given term, and log entries never change their position in the log. The second property is guaranteed by a simple consistency check performed by AppendEntries. When sending an AppendEntries RPC, the leader includes the index and term of the entry in its log that immediately precedes the new entries. If the follower does not find an entry in its log with the same index and term, then it refuses the new entries. The consistency check acts as an induction step: the initial empty state of the logs satisfies the Log Matching Property, and the consistency check preserves the Log Matching Property whenever logs are extended. As a result, wheneverAppendEntriesreturns successfully, the leader knows that the follower’s log is identical to its own log up through the new entries.
第一个属性来自于这样一个事实,即一个leader在一个给定的term中最多创建一个具有给定日志索引的条目,并且日志条目永远不会改变它们在日志中的位置。第二个属性由AppendEntries执行的简单一致性检查来保证。当发送AppendEntries RPC时,leader包括其日志中紧接新条目之前的条目的索引和术语(index and term)。如果follower在其日志中没有找到具有相同索引和术语的条目,那么它将拒绝新条目。一致性检查作为一个归纳步骤:日志的初始空状态满足了日志匹配属性,并且每当日志被扩展时,一致性检查都会保留日志匹配属性。因此,每当AppendEntries成功返回时,leader知道follower的日志与自己的日志在新条目之前是相同的。
During normal operation, the logs of the leader and followers stay consistent, so the AppendEntries consistency check never fails. However, leader crashes can leave the logs inconsistent (the old leader may not have fully replicated all of the entries in its log). These inconsistencies can compound over a series of leader and follower crashes. Figure 7 illustrates the ways in which followers’ logs may differ from that of a new leader. A follower may be missing entries that are present on the leader, it may have extra entries that are not present on the leader, or both. Missing and extraneous entries in a log may span multiple terms.
在正常运行期间,leader和follower的日志保持一致,所以AppendEntries一致性检查从未失败。然而,leader崩溃会使日志不一致(老leader可能没有完全复制其日志中的所有条目)。这些不一致会在一系列leader和follower的崩溃中加剧。图7说明了follower的日志可能与新leader的日志不同的方式。follower可能会丢失leader的条目,可能会有leader没有的额外条目,或者两者都有。日志中缺失和多余的条目可能跨越多个terms。

图7:当顶端的leader掌权时,在follower的日志中可能会出现(a-f)的任何一种情况。每个盒子代表一个日志条目;盒子里的数字是它的term。一个follower可能缺少条目(a-b),可能有额外的未承诺(uncommitted )的条目(c-d),或者两者都有(e-f)。例如,如果该服务器是第2期的leader,在其日志中增加了几个条目,然后在提交任何条目之前就崩溃了;它很快重新启动,成为第3期的leader,并在其日志中增加了几个条目;在第2期或第3期的任何条目被提交之前,该服务器再次崩溃,并持续了几个terms。
In Raft, the leader handles inconsistencies by forcing the followers’ logs to duplicate its own. This means that conflicting entries in follower logs will be overwritten with entries from the leader’s log. Section 5.4 will show that this is safe when coupled with one more restriction.
在Raft中,leader通过强迫follower的日志重复自己的日志来处理不一致的情况。这意味着follower日志中的冲突条目将被leader日志中的条目覆盖。第5.4节将表明,当再加上一个限制条件时,这是很安全的。
To bring a follower’s log into consistency with its own, the leader must find the latest log entry where the two logs agree, delete any entries in the follower’s log after that point, and send the follower all of the leader’s entries after that point. All of these actions happen in response to the consistency check performed by AppendEntries RPCs. The leader maintains a nextIndex for each follower, which is the index of the next log entry the leader will send to that follower. When a leader first comes to power, it initializes all nextIndex values to the index just after the last one in its log (11 in Figure 7). If a follower’s log is inconsistent with the leader’s, the AppendEntries consistency check will fail in the next AppendEntries RPC. After a rejection,the leader decrements nextIndexand retries the AppendEntries RPC. Eventually nextIndex will reach a point where the leader and follower logs match. When this happens,AppendEntrieswill succeed, which removes any conflicting entries in the follower’s log and appends entries from the leader’s log (if any). Once AppendEntries succeeds, the follower’s log is consistent with the leader’s, and it will remain that way for the rest of the term.
为了使follower的日志与自己的日志保持一致,leader必须找到两个日志一致的最新日志条目,删除follower日志中该点之后的任何条目,并将该点之后leader的所有条目发送给follower。所有这些动作都是为了响应AppendEntries RPCs所进行的一致性检查而发生的。leader为每个follower维护一个nextIndex,它是leader将发送给该follower的下一个日志条目的索引。当leader第一次上台时,它将所有的nextIndex值初始化为其日志中最后一条的索引(图7中的11)。如果follower的日志与leader的日志不一致,那么在下一个AppendEntries RPC中,AppendEntries一致性检查将失败。在拒绝之后,leader会递减NextIndex并重试AppendEntries RPC。最终,nextIndex将达到一个leader和follower日志匹配的点。当这种情况发生时,AppendEntries将成功,这将删除follower日志中任何冲突的条目,并从leader的日志中添加条目(如果有的话)。一旦AppendEntries成功,follower的日志就与leader的日志一致了,并且在剩下的时间里都会保持这种状态。
If desired, the protocol can be optimized to reduce the number of rejected AppendEntries RPCs. For example, when rejecting an AppendEntries request, the follower can include the term of the conflicting entry and the first index it stores for that term. With this information, the leader can decrement nextIndex to bypass all of the conflicting entries in that term; one AppendEntries RPC will be required for each term with conflicting entries, rather than one RPC per entry. In practice, we doubt this optimization is necessary, since failures happen infrequently and it is unlikely that there will be many inconsistent entries.
如果愿意的话,协议还可以通过减少失败的Append Entries RPCs次数来进行优化。例如,当拒绝AppendEntries RPCs时,follower可以将冲突的条目的任期和此任期内存储的第一个条目返回给leader。这样leader就可将nextIndex直接减去所有冲突的条目最早的条目。一个任期内的日志条目冲突只需要一次AppendEntries RPCs就可以,而不需要像之前那样每个条目一次AppendEntries RPCs。但是在实际应用中,我们认为此优化时完全没必要的。因为AppendEntries RPCs请求失败并不是经常发生,并且好像也不会有很多冲突的日志条目。
With this mechanism,a leader does not need to take any special actions to restore log consistency when it comes to power. It just begins normal operation, and the logs automatically converge in response to failures of the AppendEntries consistency check. A leader never overwrites or deletes entries in its own log (the Leader Append-Only Property in Figure 3).
This log replication mechanism exhibits the desirable consensus properties described in Section 2: Raft can accept, replicate, and apply new log entries as long as a majority of the servers are up; in the normal case a new entry can be replicated with a single round of RPCs to a majority of the cluster; and a single slow follower will not impact performance.
有了这种机制,leader在上位时不需要采取任何特别的行动来恢复日志的一致性。它只是开始正常的操作,而日志会自动收敛以应对AppendEntries一致性检查的失败。一个leader永远不会覆盖或删除自己日志中的条目(图3中的leader仅应用属性)。
这种日志复制机制表现出第2节中描述的理想的共识属性:只要大多数服务器都在运行,Raft就可以接受、复制和应用新的日志条目;在正常情况下,一个新的条目可以通过一轮RPC复制到集群的大部分;一个缓慢的follower不会影响性能。
5.4 Safety
The previous sections described how Raft elects leaders and replicates log entries. However, the mechanisms described so far are not quite sufficient to ensure that each state machine executes exactly the same commands in the same order. For example, a follower might be unavailable while the leader commits several log entries, then it could be elected leader and overwrite these entries with new ones; as a result, different state machines might execute different command sequences.
This section completes the Raft algorithm by adding a restriction on which servers may be elected leader. The restriction ensures that the leader for any given term contains all of the entries committed in previous terms (the Leader Completeness Property from Figure 3). Given the election restriction, we then make the rules for commitment more precise. Finally, we present a proof sketch for the Leader Completeness Property and show how it leads to correct behavior of the replicated state machine.
前面的章节描述了Raft如何选举leader和复制日志条目。然而,到目前为止所描述的机制还不足以确保每个状态机以相同的顺序执行完全相同的命令。例如,当leader提交几个日志条目时,一个follower可能无法使用,然后它可能被选为leader,并用新的条目覆盖这些条目;结果,不同的状态机可能执行不同的命令序列。
本节通过增加对哪些服务器可以被选为leader的限制,完善了Raft算法。该限制确保了任何给定条款的leader都包含了之前条款中承诺的所有条目(图3中的leader完整性属性)。考虑到选举限制,我们会使承诺的规则更加精确。最后,我们提出一个leader完整性属性的证明草图,并说明它如何导致复制状态机的正确行为。
5.4.1 Election restriction
In any leader-based consensus algorithm, the leader must eventually store all of the committed log entries. In some consensus algorithms, such as Viewstamped Replication [22], a leader can be elected even if it doesn’t initially contain all of the committed entries. These algorithms contain additional mechanisms to identify the missing entries and transmit them to the new leader, either during the election process or shortly afterwards. Unfortunately, this results in considerable additional mechanism and complexity. Raft uses a simpler approach where it guarantees that all the committed entries from previous terms are present on each new leader from the moment of its election, without the need to transfer those entries to the leader. This means that log entries only flow in one direction, from leaders to followers, and leaders never overwrite existing entries in their logs.
在任何基于leader的共识算法中,leader最终必须存储所有承诺(committed)的日志条目。在一些共识算法中,例如Viewstamped Replication[22],即使最初不包含所有已承诺的条目,也可以选出一个leader。这些算法包含额外的机制来识别缺失的条目,并在选举过程中或之后不久将它们传送给新的leader。不幸的是,这导致了相当多的额外机制和复杂性。Raft使用了一种更简单的方法,它保证从每一个新的leader当选的那一刻起,以前的所有承诺条目都存在于其身上,而不需要将这些条目传输给leader。这意味着日志条目只在一个方向流动,即从leader到follower,而且leader永远不会覆盖他们日志中的现有条目。
Raft uses the voting process to prevent a candidate from winning an election unless its log contains all committed entries. A candidate must contact a majority of the cluster in order to be elected, which means that every committed entry must be present in at least one of those servers. If the candidate’s log is at least as up-to-date as any other log in that majority (where “up-to-date” is defined precisely below), then it will hold all the committed entries. The RequestVote RPC implements this restriction: the RPC includes information about the candidate’s log, and the voter denies its vote if its own log is more up-to-date than that of the candidate.
Raft使用投票程序来防止candidate赢得选举,除非其日志包含所有承诺的条目。candidate必须与集群中的大多数人联系才能当选,这意味着每个已承诺的条目必须至少存在于其中一个服务器中。如果candidate的日志至少和该多数人中的任何其他日志一样是最新的(这里的 "最新 "在下面有精确的定义),那么它将包含所有承诺的条目。RequestVote RPC实现了这一限制:RPC包括关于candidate日志的信息,如果投票人自己的日志比candidate的日志更及时,则拒绝投票。
Raft determines which of two logs is more up-to-date by comparing the index and term of the last entries in the logs. If the logs have last entries with different terms, then the log with the later term is more up-to-date. If the logs end with the same term, then whichever log is longer is more up-to-date.
Raft通过比较日志中最后条目的索引和term来确定两个日志中哪一个是最新的。如果日志的最后条目有不同的term,那么term较晚的日志是最新的。如果日志以相同的期限结束,那么哪一个日志的期限更长,哪一个就是最新的。
5.4.2 Committing entries from previous terms
As described in Section 5.3, a leader knows that an entry from its current term is committed once that entry is stored on a majority of the servers. If a leader crashes before committing an entry, future leaders will attempt to finish replicating the entry. However, a leader cannot immediately conclude that an entry from a previous term is committed once it is stored on a majority of servers. Figure 8 illustrates a situation where an old log entry is stored on a majority of servers, yet can still be overwritten by a future leader.
如第5.3节所述,一旦一个条目被存储在大多数服务器上,leader就知道其当前任期的条目被提交。如果一个leader在提交条目之前崩溃了,未来的leader将试图完成对该条目的复制。然而,leader不能立即得出结论,一旦前一届的条目被存储在大多数服务器上,它就被提交了。图8说明了这样一种情况:一个旧的日志条目被存储在大多数服务器上,但仍然可以被未来的leader所覆盖。

图8:一个时间序列显示了为什么leader不能使用较早的条款的日志条目来确定承诺。在(a)中,S1是leader,部分复制了索引2的日志条目。在(b)中,S1崩溃了;S5凭借S3、S4和自己的投票当选为第三任期的leader,并接受了日志索引2的不同条目。在(c)中,S5崩溃了;S1重新启动,被选为leader,并继续复制。在这一点上,第2项的日志条目已经在大多数服务器上复制,但它没有被提交。如果S1像(d)那样崩溃,S5可以被选为leader(有S2、S3和S4的投票),并用它自己的第3期的条目覆盖该条目。然而,如果S1在崩溃前在大多数服务器上复制了其当前任期的条目,如(e),那么这个条目就被承诺了(S5不能赢得选举)。在这一点上,日志中所有前面的条目也被提交。
To eliminate problems like the one in Figure 8, Raft never commits log entries from previous terms by counting replicas. Only log entries from the leader’s current term are committed by counting replicas; once an entry from the current term has been committed in this way, then all prior entries are committed indirectly because of the Log Matching Property. There are some situations where a leader could safely conclude that an older log entry is committed (for example, if that entry is stored on every server), but Raft takes a more conservative approach for simplicity.
Raft incurs this extra complexity in the commitment rules because log entries retain their original term numbers when a leader replicates entries from previous terms. In other consensus algorithms, if a new leader rereplicates entries from prior “terms,” it must do so with its new “term number.” Raft’s approach makes it easier to reason about log entries, since they maintain the same term number over time and across logs. In addition, new leaders in Raft send fewer log entries from previous terms than in other algorithms (other algorithms must send redundant log entries to renumber them before they can be committed).
为了消除类似图8中的问题,Raft从不通过计算副本来提交以前的日志条目。只有leader当前任期的日志条目是通过计算副本来提交的;一旦当前任期的条目以这种方式被提交,那么由于日志匹配特性,所有之前的条目都被间接提交。在某些情况下,leader可以安全地断定一个较早的日志条目已被提交(例如,如果该条目存储在每个服务器上),但Raft为简单起见采取了更保守的做法。
Raft在承诺规则中产生了这种额外的复杂性,因为当leader复制前几期的条目时,日志条目会保留其原来的条目term。在其他共识算法中,如果一个新的leader复制之前 "terms "的条目,它必须用新的 "term编号 "来做。Raft的方法使得对日志条目的推理更加容易,因为它们在不同的时间和不同的日志中保持着相同的term编号。此外,与其他算法相比,Raft的新leader从以前的terms中发送较少的日志条目(其他算法必须发送多余的日志条目,在它们被提交之前对其重新编号)。
5.4.3 Safety argument
Given the complete Raft algorithm, we can now argue more precisely that the Leader Completeness Property holds (this argument is based on the safety proof; see Section 9.2). We assume that the Leader Completeness Property does not hold, then we prove a contradiction. Suppose the leader for term T (leaderT) commits a log entry from its term, but that log entry is not stored by the leader of some future term. Consider the smallest term U > T whose leader (leaderU) does not store the entry.
考虑到完整的Raft算法,我们现在可以更精确地论证Leader Completeness属性是否成立(这个论证是基于安全证明的,见第9.2节)。我们假设leader完备性属性不成立,然后我们证明一个矛盾。假设term T的leader(leaderT)从其term中提交了一个日志条目,但是这个日志条目没有被未来某个term的leader所存储。考虑最小的term U > T,其leader(leaderU)不存储该条目。
The committed entry must have been absent fromleaderU’s log at the time of its election (leaders never delete or overwrite entries).
leaderT replicated the entry on a majority of the cluster, and leaderU received votes from a majority of the cluster. Thus, at least one server (“the voter”) both accepted the entry from leaderT and voted for leaderU, as shown in Figure 9. The voter is key to reaching a contradiction.
The voter must have accepted the committed entryfrom leaderT before voting for leaderU; otherwise it would have rejected the AppendEntries request from leaderT (its currentterm would have been higherthan T).
The voter still stored the entry when it voted forleaderU, since every intervening leader contained the entry (by assumption), leaders never remove entries, and followers only remove entries if they conflict with the leader.
The voter granted its vote to leaderU, so leaderU’s log must have been as up-to-date as the voter’s. This leads to one of two contradictions.
First, if the voter and leaderU shared the same last log term, then leaderU’s log must have been at least as long as the voter’s, so its log contained every entry in the voter’s log. This is a contradiction, since the voter contained the committed entry and leaderU was assumed not to.
Otherwise, leaderU’s last log term must have been larger than the voter’s. Moreover, it was larger than T, since the voter’s last log term was at least T (it contains the committed entry from term T). The earlier leader that created leaderU’s last log entry must have contained the committed entry in its log (by assumption). Then, by the Log Matching Property, leaderU’s log must also contain the committed entry, which is a contradiction.
This completes the contradiction. Thus, the leadersof all terms greater than T must contain all entries from term T that are committed in term T.
The Log Matching Property guarantees that futureleaders will also contain entries that are committed indirectly, such as index 2 in Figure 8(d).
在leaderU当选时,该承诺条目必须不在leaderU的日志中(leader从不删除或覆盖条目)。(反证法)
2.leaderT在集群的大多数上复制了该条目,而leaderU从集群的大多数上收到了投票。因此,至少有一台服务器("投票者")既接受了leaderT的条目,又投票给leaderU,如图9所示。投票者是达成矛盾的关键。
投票者必须在投票给leaderU之前接受leaderT的承诺条目;否则它就会拒绝leaderT的AppendEntries请求(它的当前期限会比T高)。
投票者在投票给leaderU时仍然存储了该条目,因为每个介入的leader都包含该条目(根据假设),leader从不删除条目,而follower只有在与leader冲突时才会删除条目。
投票者把票投给了leaderU,所以leaderU的日志肯定和投票者的一样是最新的。这导致了两个矛盾中的一个矛盾。
首先,如果投票人和leaderU共享最后一个日志项,那么leaderU的日志一定至少和投票人的一样长,所以它的日志包含了投票人日志中的每个条目。这是一个矛盾,因为投票者包含了已承诺的条目,而leaderU被认为不包含。
否则,leaderU的最后一个日志项一定比投票者的大。而且,它比T大,因为投票者的最后一个对数项至少是T(它包含T项的承诺条目)。创建leaderU的最后一个日志项的早期leader,其日志中一定包含了已承诺的条目(根据假设)。那么,根据日志匹配属性,leaderU的日志也必须包含已承诺的条目,这就是一个矛盾。
这就完成了这个矛盾。因此,所有大于T的项的首领必须包含T项中所有在T项中承诺的条目。
日志匹配属性保证了未来的leader也将包含间接承诺的条目,如图8(d)中的索引2。

图9:如果S1(任期T的leader)在其任期内提交了一个新的日志条目,而S5被选为后来任期U的leader,那么至少有一个服务器(S3)接受了该日志条目,并且也为S5投票。
Given the Leader Completeness Property, we can prove the State Machine Safety Property from Figure 3, which states that if a server has applied a log entry at a given index to its state machine,no other server will everapply a different log entry for the same index. At the time a server applies a log entry to its state machine, its log must be identical to the leader’s log up through that entry and the entry must be committed. Now consider the lowest term in which any server applies a given log index; the Log Completeness Property guarantees that the leaders for all higher terms will store that same log entry, so servers that apply the index in later terms will apply the same value. Thus, the State Machine Safety Property holds.
鉴于leader完备性属性,我们可以证明图3中的状态机安全属性,即如果一个服务器在其状态机上应用了一个给定索引的日志条目,那么其他服务器就不会为同一索引应用不同的日志条目。当一个服务器在其状态机上应用一个日志条目时,它的日志必须与leader的日志相同,直到该条目,并且该条目必须被提交。现在考虑任何服务器应用一个给定的日志索引的最低term;日志完整性属性保证所有更高term的leader将存储相同的日志条目,因此在以后的term中应用索引的服务器将应用相同的值。因此,状态机安全属性成立。
Finally, Raft requires servers to apply entries in log index order. Combined with the State Machine Safety Property, this means that all servers will apply exactly the same set of log entries to their state machines, in the same order.
最后,Raft要求服务器按照日志索引顺序应用条目。结合状态机安全属性,这意味着所有服务器将以相同的顺序将相同的日志条目集应用于其状态机。
5.5 Follower and candidate crashes
Until this point we have focused on leader failures. Follower and candidate crashes are much simpler to handle than leader crashes, and they are both handled in the same way. If a follower or candidate crashes, then future RequestVote and AppendEntries RPCs sent to it will fail. Raft handles these failures by retrying indefinitely; if the crashed server restarts, then the RPC will complete successfully. If a server crashes after completing an RPC but before responding, then it will receive the same RPC again after it restarts. Raft RPCs are idempotent, so this causes no harm. For example, if a follower receives an AppendEntries request that includes log entries already present in its log, it ignores those entries in the new request.
在这之前,我们一直专注于leader的失败。follower和candidate的崩溃比leader的崩溃要简单得多,而且它们的处理方式都是一样的。如果一个follower或candidate崩溃了,那么未来发送给它的RequestVote和AppendEntries RPC将会失败。Raft通过无限期地重试来处理这些失败;如果崩溃的服务器重新启动,那么RPC将成功完成。如果服务器在完成RPC后但在响应前崩溃,那么它将在重新启动后再次收到相同的RPC。Raft RPC是幂等的,所以这不会造成任何损失。例如,如果一个follower收到一个AppendEntries请求,其中包括已经存在于其日志中的日志条目,那么它将忽略新请求中的这些条目。
5.6 Timing and availability
One of our requirements for Raft is that safety must not depend on timing: the system must not produce incorrect results just because some event happens more quickly or slowly than expected. However, availability (the ability of the system to respond to clients in a timely manner) must inevitably depend on timing. For example, if message exchanges take longer than the typical time between server crashes, candidates will not stay up long enough to win an election; without a steady leader, Raft cannot make progress.
我们对Raft的要求之一是安全不能依赖于时间:系统不能因为某些事件的发生比预期快或慢而产生不正确的结果。然而,可用性(系统及时响应客户的能力)必须不可避免地取决于时间。例如,如果消息交换的时间超过了服务器崩溃之间的典型时间,那么candidate就不会保持足够长的时间来赢得选举;没有一个稳定的leader,Raft就无法取得进展。
Leader election is the aspect of Raft where timing is most critical. Raft will be able to elect and maintain a steady leader as long as the system satisfies the following timing requirement:
broadcastTime ≪ electionTimeout ≪ MTBF
leader选举是Raft中时机最关键的方面。只要系统满足以下时间要求,Raft就能选出并维持一个稳定的leader:
broadcastTime ≪selectionTimeout ≪MTBF
In this inequality broadcastTime is the average time it takes a server to send RPCs in parallel to every server in the cluster and receive their responses; electionTimeout is the election timeout described in Section 5.2; and MTBF is the average time between failures for a single server. The broadcast time should be an order of magnitude less than the election timeout so that leaders can reliably send the heartbeat messages required to keep followers from starting elections; given the randomized approach used for election timeouts, this inequality also makes split votes unlikely. The election timeout should be a few orders of magnitudeless than MTBF so that the system makes steady progress. When the leader crashes, the system will be unavailable for roughly the election timeout; we would like this to represent only a small fraction of overall time.
在这个不等式中,broadcastTime是一台服务器向集群中的每台服务器并行发送RPC并接收其响应的平均时间;electionTimeout是第5.2节中描述的选举超时;MTBF是单台服务器的平均故障间隔时间。广播时间应该比选举超时少一个数量级,这样leader就可以可靠地发送心跳信息,以防止follower开始选举;考虑到选举超时使用的随机方法,这种不平等也使得分裂投票不太可能发生。选举超时应该比MTBF少几个数量级,这样系统才会有稳定的进展。当leader崩溃时,系统将在大约选举超时的时间内不可用;我们希望这只占总体时间的一小部分。
The broadcast time and MTBF are properties of the underlying system, while the election timeout is something we must choose. Raft’s RPCs typically require the recipient to persist information to stable storage, so the broadcast time may range from 0.5ms to 20ms, depending on storage technology. As a result, the election timeout is likely to be somewherebetween 10ms and 500ms. Typical server MTBFs are several months or more, which easily satisfies the timing requirement.
广播时间和MTBF是底层系统的属性,而选举超时是我们必须选择的。Raft的RPC通常要求接收者将信息坚持到稳定的存储中,所以广播时间可能在0.5ms到20ms之间,这取决于存储技术。因此,选举超时可能是在10ms和500ms之间。典型的服务器MTBF是几个月或更长时间,这很容易满足时间要求。
Last updated