Distributed protocol
Consensus Algorithm
满足三个条件:
non-triviality:if the value v is decided,then v must have been proposed by a proposer
safety & safe learning:任意两个proposr,得知decided value is a andf b,必有a=b
classic paxos
two phase:
- reading phase
- writing phase
properties for proposer
- epoch
- proppser只在收到大部分acceptor的promise后开始提议
- proposer只在收到大部分acceptor的accepts后返回最终的决议值
- 如果之前没有确定value,任何value可以被propose,如果有,则epoch最高的value被propose
- 被proposer使用的epoch大于之前所有被使用过得epoch
properties for acceptor:
- acceptor只处理收到的的prepare或者propose中epoch>= last promised epoch的消息
- For each prepare or propose message received, the acceptor’s last promised epoch is set to the epoch received. This is after Property 6 has been satisfied.
- For each prepare message received, the acceptor replies with promise. This is after Properties 6 & 7 have been satisfied.
- For each propose message received, the acceptor replies with accept after updating its last accepted proposal. This is after Properties 6 & 7 have been satisfied.
- Last promised epoch and last accepted proposal are persistent and only updated by Properties 7 & 9
classic paxos的改进
- 对于epoch号比较小的prepare或者proposal,acceptor不会做响应,只能由proposer等待超时并重试,此处可以改进为acceptor发送no—prepare或者no-accept消息。
- 在prepare阶段,如果proposer收到的大部分ack中的value和第二阶段的proposal value相同,则可以跳过第二阶段,直接返回value?算法如何知道确定的值就是第二阶段需要propose的值呢?
- 在一个acceptor收到确认消息后,就返回decided给proposer,proposer不必等大多数acceptor返回确认就可以直接返回确认值。(这样做真的没问题吗)Mencius
proposal copying
(Unique proposer uses each epoch)一个epoch内只有一个proposal能被接受,若proposer在phase one接受到了NACK中,发现max Epoch等于prepare中发送的epoch且已有了promised的值,则做proposal copying并跳到phase 2(为了更快的收敛?)
Raft
- 三种状态
- leader
- follower
- candidate
- 每个server存储一个current term number
- 使用RPC通信:
- RequestVote RPC (initiated by candidates during elections)
- Append_entries RPCS(initiated by leaders to replicate log entries and to provide a form of heartbeat
- leader election
- 时机:只要follower能从leader或者candidate收到rpc,remain 状态,如果在election timeout时间内没有收到rpc,认为leader死亡,开始leader election
- 选举流程:follower increments current term(term += 1) 并且转换为candidate状态,vote for self并且并行的发送requestVote RPC给集群中的其他server
- candidate保持状态直到:
- 赢得选举
- 其他server赢得选举
- 一段时间过去了没有胜出者
- candidate赢得选举的条件:在 same term的条件下获取集群内大部分server的votes。一个server只能为一个candidate vote,first-come-first-serverd。一旦赢得选举则向其他的节点发送heartbeat to prevent new election
- 如何解决split votes(每个server同时开始选举并为自己投票)
- random election timeout(150-300ms)
- followers只为log entry大于自己的candidate投票
- candidate 收到 leader的request,比较其term和自己的term,若大于自己的则变为follower,否则reject request继续保持candidate状态
- Term(逻辑时钟)
- 特性:连续数字,每个server都会有一个term
- term changes whenever servers communitacte
- If server a‘s term smaller than b’s, a change it’s term to b’s
- candidate of leader founds larger term, trun itself to follower
- if server recieved stale term request , reject it
- Log replication
- 流程
- command come
- leader append command to it’s log
- Sent log to other servers to replicate the entry
- recieved most servers reply, leader apply this comand to it’s state machine
- return result to client
- (if followers crash or run slowlly, or network issues, leader send rpc infinitely until all followers store all log)
- log structure
- state machine command
- term number
- to detect inconsistencies between logs
- integer index
- to identify its position in the log
- commited log entries
- when ont entry has replicated it on a majority of servers
- leader includs highest index it knows to be commietted int appedEntries RPCs so other followers learns that and then commite that entry to its state machine
- how to handle inconsistencies?
- by forcing the follower’s logs to duplicate it’s own(follower overrited)
- leader 存储每个follower的nextIndex array,此array初始值为last one of leader index + 1,然后向follower发送AppendEntries RPC用于指针探测,如果探测失败则以1步距回退,直到找到一个agree point。(Leader Append-Only确保了leader不会删除或者覆盖他自己的log)
- log被复制到大多数节点上之后leader crush但是log未commit,仍有可能被覆盖掉
- Raft 通过 up-to-date vote 来确保新leader有所有commited logs
- if tyhe logs have last entries with different terms, the the log with the later term is more up-to-date
- if the logs end up with the same term ,then whicheve log is longer is more up-to-date
- 流程
- Timing and availability
- broadcastTIme(heartBeat) << electionTimeOut<<MTBF
- Raft 图解
Chain replication
- request processing
- reply generation:
- reply is generated and sent by tail
- Query processing
- Query processed by tail
- Update processing
- update processed by head and delevered on chain
- reply generation:
- Advantages & disadvantages
- Read:cheap beacause only tail reply the query
- Write:more heavy beacause all nodes need to participate, but compution only need once because head compute the value and other nodes only need to do once write.
- Coping with server failures
- 直接删掉?
- ……(没看,失去兴趣)
- 参考:https://www.dazhuanlan.com/andyblack/topics/1098082
FaceBook F4 system
Design details
Volumes:
- state:
- Unlocked: under 100GB, allow read/write/delete
- Locked: only read/delete is allowed
- 类别:
- dataFile: BLOB + metadata(key,size or checksum)
- index FIle:aimed to allow rebuild when rebooting
- Journal File:tracks BLOBS that have been deleted
- state:
System overall
- 路由
- Create to Hot storage
- delete on hot or warm storage
- read on cache or hot or warm storage
- controller
- provision new machines
- Maintain pool of unlocked volumes
- ensure all logical volumes have enough physical volumes backing them
- create new physical volumes if necessary
- perform compaction and garbage clean
- Router tier
- 存储了逻辑volume到物理volume的映射
- Transformer Tier
- 讲计算和存储分离,计算节点用来计算,存储节点仅仅用来存取数据
……..(未读完)
- 路由
可靠性策略
- 副本策略:不同节点/不同机架/不同DC
- 一致性hash
- CRUSH(Controlled Replication Under Scalable Hashing):CRUSH 算法的设置目的是使数据能够根据设备的存储能力和宽带资源加权平均地分布,并保持一个相对的概率平衡。 副本放置在具有层次结构的存储设备中,这对数据安全也有重要影响。 通过反射系统的物理安装组织,CRUSH算法可以将系统模块化,从而定位潜在的设备故障。
- EC(Erasure Code) 纠删码
- 特点:
- 低冗余,高磁盘利用率
- 数据恢复代价高
- 数据更新代价高
- 特点:
分布式事务协议
- 2PC
- 阶段:
- Prepare(询问):master发送prepare给参与节点,参与节点执行事务中的操作,并返回yes or no给master
- Commit(提交或者Abort):master 收到所有节点的yes信息,则进入commit流程,发送commit给所有的参与节点。若收到任意节点的no,则进行abort流程,参与者返回执行结果给master
- 存在的问题:
- 同步阻塞问题:若参与者共享同步资源,参与者访问临界资源存在阻塞
- 协调者故障导致参与者长期阻塞
- 数据不一致:协调者在发送commit阶段故障,部分参与者收到了commit
- 协调者发送commit的时候宕机,唯一收到此消息的参与者此时也宕机,事务状态未可知
- 阶段:
- 3PC
- 阶段:
- can-commit
- 协调者询问参与者是否可以commit
- 参与者回复yes or no
- pre-commit
- 如果参与者全都是yes,则协调者执行:
- 协调者发送预提交请求
- 参与者预提交,记录undo和redo日志,锁定记录,返回执行响应
- 如果任意一个参与者发送了no或者等待超时,协调者执行:
- 发送abort请求
- 参与者收到abort或者等待超时执行中断
- 如果参与者全都是yes,则协调者执行:
- do-commit
- 收到全部回复都是yes:
- 发送commit
- 参与者提交事务,释放资源
- 参与者回复响应
- 协调者收到全部响应,事务完成
- 收到任意no
- 发送abort
- 参与者回滚,释放资源
- 参与者回复响应
- 协调者收到所有回复,abort完成
- 收到全部回复都是yes:
- can-commit
- 如何解决协调者超时?
- 超时机制
- 如何解决同步阻塞?
- 无法解决
- 如何解决不一致?
- 无法解决
- 相较于2pc的优点?
- 超时机制一定程度上解决协调者宕机问题
- 第一阶段一分为二,canCommit阶段尽早可以判断事务是否可以执行,占用资源少,提高了吞吐量。
- 阶段: