Daos-2-phase-commit

Posted on 2023-04-01 Edited on 2023-04-17 In knowledge

Two-phase Commit

经典的2PC

流程

client            coordinator           slave              slave
  |     txn           |                  |                  |
  | ----------------->|                  |                  |
  |                   |-----prepare----->|                  |
  |                   |-----prepare------------------------>|
  |                   |                  |                  |
  |                   |<--------OK-------|                  |  
  |                   |<--------OK--------------------------|
  |                   |                  |                  |
  |                   |-----commit ----->|                  |
  |                   |-----commit ------------------------>|
  |                   |                  |                  |
  |                   |<-----OK----------|                  |
  |                   |<-----OK-----------------------------|
  |<-----reply--------|                  |                  |

经典2pc中的Coordinator侧的状态转移：

接受request执行Operation发送prepared message给所有的参与者，initiated
接受到所有participants的positive reply，prepared
发送committ请求给所有的participants, commiting
接受到所有participants的positive reply，committed

Redo & undo log在2pc协议中的作用：

coordinator在Commit Phase发送commit给所有participants之前记录redo log，如果在改动落盘之前coordinator宕机，可以从redo log恢复。
undo log保证了在Prepared Phase中，如果有slave回复不可提交，可用来abort transaction
redo log用来容灾，undo log用来恢复事务执行之前的状态。

Redo & undo log在2pc中写入的时机：

Coordinator:
1. Prepare Phase, 执行Operation之前记录redolog/undolog
2. 在提交事务之前，记录transaction到redolog
Non-Coordinator:
1. 执行Operation之前记录redolog/undolog
2. 回复positive Response之后将transaction记录到redolog

Classical 2PC的劣势：

阻塞协议，所有的参与者都需要等待coordinator的决策，影响latency
协调者单点失败：如果coordinator失效，协议无法继续。
级联回滚：先到的事务回滚会导致后续的事务级联回滚
参与者单点失败：参与者的失败或者超时，影响latency
大量的网络通信开销：两个阶段的网络开销随着节点数的增多线性增长

DAOS asynchronous 2PC

流程：

client提交txn
coordinator接受，本地执行Operation，发送给其他participants
接受到其他participants的positive Response，标记事务为committable，batch住事务号
batch的事务号达到阈值，发送二阶段的commit rpc
coordinator和slave提交事务。

client            coordinator           slave              slave
  |     txn           |                  |                  |
  | ----------------->|                  |                  |
  |                   |-----prepare----->|                  |
  |                   |-----prepare------------------------>|
  |                   |                  |                  |
  |                   |<--------OK-------|                  |  
  |                   |<--------OK--------------------------|
  |<-------reply------|                  |                  |
  ...              time elaps           ...                ...
  |                   |                  |                  |
  |                meet batch            |                  |
  |                   |--commit batch--->|                  |
  |                   |--commit batch---------------------->|
  |                   |                  |                  |
  |                   |<-----OK----------|                  |
  |                   |<-----OK-----------------------------|
  |                   |                  |                  |

相对于classical 2PC的改动点

classical 2PC中coordinator在发送prepare之前写redo/undo 日志，A2PC中是发送之后写日志
classical 2PC中coordinator需要再二阶段执行完成后才能回复客户端，A2PC中一阶段完成即可回复
classical 2PC中二阶段的txn都是one by one的处理，A2PC中针对同一Redundancy Group的txns做了batch。

如何处理改动带来的一致性风险

coordinator在一阶段发送rpc后才写日志，假如在写日志之前宕机，可能会有两种场景：
1. Prepared消息全部发送到了参与者节点
  1. 新Coordinator必有这条日志，全部参与者可以提交，Resync协议保证此日志提交
  2. 新Coordinator必有这条日志，至少一个参与者abort，Resync协议保证此日志回滚
2. Prepared消息部分发送到了参与者节点：
  1. 新Coordinator有这条日志，由于部分参与者未收到这条Prepare，Resync协议保证此日志回滚
  2. 新Coordinator没有这条日志，不会尝试abort或者commit，客户端Timeout，对此日志在新的Coordinator上发起重试。

如何解决batch机制引入的inconsistency

由于batch机制，Coordinator状态为committable的状态的事务，在participants端可能是prepared的状态，participants端的读写请求如何处理？

写请求：统一由leader端处理
读请求：通过redirect到leader端处理或者依赖于refresh流程同步后再处理

新的Coordinator如何产生

通过伪哈希算法，在剩余的存活参与者中选出，因为参与者携带了未决事务的信息。
不能选择Fallback节点作为新的Coordinator，因为FallBack节点没有事务信息，无法执行Resync流程。

旧的Coordinator如何处理

暂时下线：在重新上线后，依赖rebuild协议同步事务信息
永久下线：在FallBack上rebuild事务信息

Resync

principals

存活的节点，有任何节点针对此txn提交为abort或者没有voted log，则此txn回滚
失联的节点abort了当前事务且至少和一个存活节点确认了abort信息，此txn回滚
存活的节点对txn1标记为prepared，失联的leader对txn1标记为abort，txn1可以提交。

流程

client发送txn
leader接受，replicate，将log持久化
slave接受，将log持久化，处理txn
slave回复，leader接受
leader汇总，根据汇总结果abort或者commit
leader回复client
leader cache txn，等到batch满发送给slaves
slave做相应的处理。

极端例子

T0到达client比T1早，T0和T1之间是冲突关系
T0在Coordinator执行成功，T1在Coordinator执行失败
T1的Prepare rpc已经先于T0到达了所有的参与者，且所有参阅者回复positive
协调者在收到T1的回复之前crash，无法发送abort T1
在Resync流程中，T1被commit，T0超时，因为crash之前并没有回复客户端，因此不影响一致性

Refresh

流程：

获取需要Refresh的dtx列表
遍历每个dtx
查找dtx leader
如果自己是leader，跳过当前dtx的检查
初始化
向leader发送RPC进行事务状态同步
根据同步的事务结果做相应的处理

Classical 2PC & asynchronous 2PC对比：

asynchronous 2PC解决了哪些问题：

Prepare阶段结束后就响应客户端，缓解了阻塞问题，至少节省了一个RTT的时间
Prepare阶段，不写日志就发rpc，至少节省了一次落盘的时间
参与者失败，通过SWIM协议在有限时间内发现并启用fallback，协调者失败，通过resync决策事务状态，一定程度上解决了了协调者和参与者的失败问题
Commit阶段通过batch成组提交，减小网络通信的开销
级联回滚依然存在

asynchronous 2PC可能带来那些问题：

asynchronous 2PC不是单独可用的协议，需要SWIM协议配合
引入resync的过程，增加了复杂性
batch提交引入了额外的复杂性，传统的2PC中，各节点的状态一致，任何节点都可以处理读写请求。在a2PC中，leader会将已经提交的事务的二阶段batch，因此slave节点上的事务状态仍然是Prepared，需要同步信息或者只由leader处理。

代码

client端：

伪代码：

daos_handle_t tx = DAOS_TX_NONE;
int           rc;

/* allocate transaction */
rc = daos_tx_open(dfs->coh, &th, 0, NULL);
if (rc)
	...

restart:
	/* execute operations under the same transaction */
	rc = daos_obj_fetch(..., th);
	...
	rc = daos_obj_update(..., th);
	...
	if (rc) {
		rc = daos_tx_abort(th, NULL);
		/* either goto restart or exit */
	}

	rc = daos_tx_commit(th, NULL);
	if (rc) {
		if (rc == -DER_TX_RESTART) {
			/* conflict with another transaction, try again */
			rc = daos_tx_restart(th, NULL);
			goto restart;
		}
		...
	}

/* free up all the resources allocated for the transaction */
rc = daos_tx_close(th, NULL);

关键调用点

构造task

1 2	int daos_tx_commit(daos_handle_t th, daos_event_t *ev)

检查txn的状态是否满足提交条件

1 2	int dc_tx_commit(tse_task_t *task)

发送rpc

// 检查pool_map_version是够匹配
// dc_tx_commit_prepare,比较复杂，取第一个target作为leader
// 注册回调dc_tx_commit_cb
// 修改事务状态为TX_COMMITTING
// 构造rpc并发送rpc
static int
dc_tx_commit_trigger(tse_task_t *task, struct dc_tx *tx, daos_tx_commit_t *args)

Server端

关键调用点

检查冲突，分配内存，为每个子txn分配执行函数

1 2	void ds_obj_cpd_handler(crt_rpc_t *rpc)

执行Server端的leader操作

// 执行Server端的leader操作，设置内存栅栏
// 为每个txn的每个tgt执行obj_obj_dtx_leader
static void
ds_obj_dtx_leader_ult(void *arg)

在txn的每个tgt上执行Operation

1 2	static void ds_obj_dtx_leader(struct daos_cpd_args *dca)

分batch执行

int
dtx_leader_exec_ops(struct dtx_leader_handle *dlh, dtx_sub_func_t func,
		    dtx_agg_cb_t agg_cb, int allow_failure, void *func_arg)

leader上的每个tgt执行到的函数

1
2
3

static int
obj_obj_dtx_leader(struct dtx_leader_handle *dlh, void *arg, int idx,
		   dtx_sub_comp_cb_t comp_cb)

static int
ds_cpd_handle_one_wrap(crt_rpc_t *rpc, struct daos_cpd_sub_head *dcsh,
		       struct daos_cpd_disp_ent *dcde, struct daos_cpd_sub_req *dcsrs,
		       struct obj_io_context *ioc, struct dtx_handle *dth)

leader 和 slave上都会执行到的读写接口

// leader 和 slave上都会执行到的操作
// 执行时机的写入操作，更新ts，此函数比较复杂
static int
ds_cpd_handle_one(crt_rpc_t *rpc, struct daos_cpd_sub_head *dcsh,
		  struct daos_cpd_disp_ent *dcde,
		  struct daos_cpd_sub_req *dcsrs,
		  struct obj_io_context *ioc, struct dtx_handle *dth)
/* Locally process the operations belong to one DTX.
 * Common logic, shared by both leader and non-leader.
 */

2nd phase

1 2	int dtx_leader_end(struct dtx_leader_handle dlh, struct ds_cont_hdl coh, int result)

Resync关键调用点

入口

1
2
3

// 构造Resync的txn列表
int
dtx_resync(daos_handle_t po_hdl, uuid_t po_uuid, uuid_t co_uuid, uint32_t ver, bool block)

one by one的检查

// 在这个function中one by one的检查
// 如果返回的结果是DSHR_NEED_COMMIT，batch住txn，等待提交
// 否则不会提交
static int
dtx_status_handle(struct dtx_resync_args *dra)

检查某一个entry

// 检查某一个entry
// 返回commit，则返给上层提交
int
dtx_status_handle_one(struct ds_cont_child *cont, struct dtx_entry *dte,
		      daos_epoch_t epoch, int *tgt_array, int *err)

发送rpc同步Commit信息

1 2	int dtx_check(struct ds_cont_child cont, struct dtx_entry dte, daos_epoch_t epoch)

DTX Check的逻辑

1
2
3

static void
dtx_handler(crt_rpc_t *rpc)

rpc处理

1 2	static int dtx_rpc_internal(struct dtx_common_args *dca)

rpc发送

1 2	static int dtx_req_list_send(struct dtx_common_args *dca, daos_epoch_t epoch, int len)

回调汇总

// 所有peers检查后的结果汇总回调
// Resync principals
static void
dtx_req_list_cb(void **args)

Refresh关键调用点

server A

触发refresh

int
dtx_refresh_internal(struct ds_cont_child *cont, int *check_count,
		     d_list_t *check_list, d_list_t *cmt_list,
		     d_list_t *abt_list, d_list_t *act_list, bool failout)
// 删选出leader不是当前节点的dtx列表

准备rpc并发送给Leader

// 为leader不是当前节点的所有dtx发送rpc
rc = dtx_rpc_prep(cont, &head, NULL, len, DTX_REFRESH, 0,
				  cmt_list, abt_list, act_list, &dca);
rc = dtx_rpc_post(&dca, rc);

处理leader不是当前节点的所有dtx的检查结果

1
2
3

d_list_for_each_entry_safe(dsp, tmp, cmt_list, dsp_link) // 处理可提交的事务
d_list_for_each_entry_safe(dsp, tmp, abt_list, dsp_link) // 处理可bort的事务
d_list_for_each_entry_safe(dsp, tmp, act_list, dsp_link) // 处理uncertain和inprogress的事务

遍历处理leader即为当前节点的每个dtx

1	d_list_for_each_entry_safe(dsp, tmp, &self, dsp_link)

为每个dtx做检查

int
dtx_status_handle_one(struct ds_cont_child *cont, struct dtx_entry *dte,
		      daos_epoch_t epoch, int *tgt_array, int *err)
// 根据检查返回的结果做相应的处理

Leader端

处理refresh rpc

1
2
3

static void
dtx_handler(crt_rpc_t *rpc)
case DTX_REFRESH分支

调用vos来做检查

int
vos_dtx_check(daos_handle_t coh, struct dtx_id *dti, daos_epoch_t *epoch,
	      uint32_t *pm_ver, struct dtx_memberships **mbs, struct dtx_cos_key *dck,
	      bool for_refresh)

根据返回结果逐个处理dtx

Q&A

rebuild结束后才能开始Resync吗？

论文中没有明确提及。
一个事务中会有多个Operation，如果是replication协议，那么为什么≤还要引入两阶段提交？

Daos中的raft只用在pool或者Container service的主从之间同步状态，不用于数据的replication。
classical 2PC中Prepared的语义和a2PC中语义的区别

classical 2PC中，coordinator接收到所有其他participants节点的positive reply后标记事务状态为Prepared

a2PC中，coordinator或者其他节点，只要执行完了事务的operations，即可认为是Prepared状态。
DTE_ORPHAN是如何处理的？

Daos-mvcc-and-timestamp-ordering

Posted on 2023-03-29 Edited on 2023-04-17 In knowledge

分布式系统中的时间

计算机系统中的时间，通过石英振荡器的震动次数来确定。一个石英振荡器的震动频率是32768Hz每秒，计算机电路根据石英晶体的震动次数来确定1s钟。但是石英晶体的震荡次数会因为环境产生误差。

计算机系统中，假设发生了事件A和事件B，如何确认时间A和事件B的先后顺序？

单机系统中，时间单调递增，假设时间A先于时间B发生，则一定意味着时间A发生时的时间戳小于事件B
分布式系统中，由于不同机器上的时间存在误差，假设A在机器a上发生，B在机器b上发生，物理上，A在B之前发生，显然不能直接推导出T(A) <T(B),因为A和B在两台不同的机器上获取了时间戳，参考系不一样。

物理时钟，包括NTP协议和TrueTime都属于物理时钟，另一种是逻辑时钟，包括Lamport逻辑时钟和向量时钟。这两种时钟有各自的优缺点。物理时钟的优点在于直观，就是真实世界的时间，使用方便，缺点在于无法做到绝对精确，成本相对高一些。逻辑时钟的优点在于可以做到精确的因果关系，缺点在于节点之间需要通信，而且使用上不如物理时钟直观。

Network Time Protocol（NTP）

NTP协议的目标是将所有计算机的时间同步到几毫秒误差内。实际上广域网可以达到几十毫秒的误差，局域网误差可以在1毫秒内。

客户端A发送NTP消息给服务器B，消息中包含发送时间戳 T1
服务器B收到NTP消息后，将接收时间 T2 写入消息中
服务器B发送该NTP消息给客户端A，发送时间 T3 写入消息中
客户端A收到该NTP消息的时间为 T4

         T1    T2`           T3`   T4
A    -----------------------------^-----
          \    |             |    /
           \   |             |   /
            \  |             |  /
             \ |             | /
              \|             |/
B    ----------v-------------------------
               T2            T3

AB之间的网络往返时间RTT(Round Trip Time)：δ = (T4 - T1) - (T3 - T2)

AB之间的时间偏移：θ = ( (T2 - T1) + (T3 - T4) ) / 2

A从发送请求消息到收到响应的时间间隔是 T4 - T1，其中 T3 - T2 是B的处理时间，所以网络往返时间

δ = (T4 - T1) - (T3 - T2)。

假设A和B的时间偏差为θ，那么 T3 - θ = T3`。

T4和T3’的间隔是半个RTT：T4 - T3` = δ / 2

把T3`和δ代入上面这个等式，得到：θ = ( (T2 - T1) + (T3 - T4) ) / 2。

NTP协议在广域网可以达到几十毫秒的误差，局域网误差可以在1毫米内。误差最大的一个原因是发送请求和接收响应这两个阶段的网络时间可能是不一样的。前面我们推导时间偏移公式的时候，假设网络往返发送和接收阶段的时间是一样的，但是实际网络中，这两个阶段走的路由可能是不一样的，所花的时间也可能不一样，计算的时间偏移也不准确，这样就造成了广域网的误差可能达到几十毫秒甚至更高。局域网中因为网络比较稳定，经过的路由器也比较少，所以误差可以到1毫米内。

Lamport逻辑时钟

通过逻辑时钟，可以刻画分布式系统中事件的因果一致性关系。

局限：

无法感知系统外的事件。
由C(a) < C(b)不能推导出a → b，即使知道了两个逻辑时钟值，但却不能确定这两个事件的因果关系。
不能区分事件之间是否并发
不够直观，脱离了物理时间的直观范畴

向量时钟

解决了Lamport逻辑时钟中不能描述时间因果关系的问题

不能感知系统外的事件
可以描述时间之间的因果关系
可以区分事件之间的并发关系
不够直观，脱离了物理时间的直观范畴

TrueTime

由谷歌提出，通过高精度的硬件，刻画了相对准确的物理时间。

HLC（Hybrid Logical Clocks）

(HLC)将物理时钟和逻辑时钟结合起来。

需要的空间复杂度有限，不会随着集群规模的增长而增长。
需要满足因果关系。
逻辑时间部分的增长有界。
逻辑时间和真实的物理时间的误差是有边界的。

DAOS中的时钟

1 2	/** Mask for the 18 logical bits */ #define D_HLC_MASK 0x3FFFFULL

18位用于存储逻辑时钟，46位用于存储物理时钟。

数据库

数据库中的异常行为

脏读
幻读
不可重复读
脏写
丢失更新

数据库中的并发控制机制

基于时间的（Time-stamp ordering）
基于提交顺序的（Commitment ordering）
基于串行化图测试验证的（Serialization graph testing）
基于锁的（Locking）

Time-stamp ordering

时间戳
（Timestamp ordering，TO）：基于时间戳对事务提交顺序排序的并发控制技术。

时间戳排序技术中有两类主体：

事务
数据项。

时间戳就要“盖（赋值）”在这两类主体上。

依附在事务上的时间戳：

每个事务分配一个时间值（通常是在事务开始的时候分配，但有的系统是在事务提交的时候才分配时间值）作为此事务发生的标识，这个时间值称为一个“时间戳”，“时间戳”就如同为事务盖了一个章。时间值取值有两种方式，一是系统时钟，二是逻辑计数器。

依附在数据上的时间戳：

数据项上有两个时间戳：

读时间戳，记录读取该数据项的最大事务的时间戳
写时间戳，记录写入该数据项当前值的事务对应的时间戳，即最新的修改该数据项的事务的时间戳标识。

事务根据时间戳确认先后顺序关系

因存在并发，所以通过检查Ti事务的时间戳和Tj事务的数据项上的时间戳以确定并发事务Ti和Tj之间的先后关系（如果Ti<Tj，则事务调度器必须保证所产生的并发调度等价于事务Ti先于事务Tj的某个串行调度）。

读写冲突按照时间戳顺序执行

任何有冲突的READ或WRITE操作按时间戳顺序执行。

算法原理

算法解释

写读冲突

begin T2                       R(X)
|------------------------------|------------------------------------------ case 1
              begin T2                      R(X)
              |-----------------------------|----------------------------- case 2
		begin T1       W(X)
		|--------------|------------------------------------------------------
                                begin T2       R(X)
                                |--------------|-------------------------- case 3

-----------------------------Time line--------------------------------------->

case1:事务T2的开始时间早于事务T1,事务T2的R(X)时间晚于事务T1的W(X)时间，T2读被abort，避免脏读

case2:事务T2的开始时间晚于事务T1,事务T2的R(X)时间晚于事务T1的W(X)时间，T2读被abort，避免脏读

case3:事务T2的开始时间在T1写X之后，此时事务T2阻塞等待：

事务T1提交，则事务T2不受影响
事务T1abort，事务T2也必须abort，避免脏读

总结：

当前事务开始早于其他事务的写，当前事务的读需要abort
当前事务开始晚于其他事务的写，其他事务commit，当前事务可以读。

读写冲突

begin T2                       W(X)
|------------------------------|------------------------------------------ case 1
              begin T2                      W(X)
              |-----------------------------|----------------------------- case 2
		begin T1       R(X)
		|--------------|------------------------------------------------------

-----------------------------Time line------------------------------------->

case1:事务T2的开始时间早于事务T1,事务T2的W(X)时间晚于事务T1的R(X)时间，T2读被abort，避免事务T1不可重复读

case1:事务T2的开始时间晚于事务T1,事务T2的W(X)时间晚于事务T1的R(X)时间，T2读被abort，避免事务T1不可重复读

总结：读前写后回滚写

写写冲突

begin T2                       W(X)
|------------------------------|------------------------------------------ case 1
              begin T2                      W(X)
              |-----------------------------|----------------------------- case 2
		begin T1       W(X)
		|--------------|------------------------------------------------------

-----------------------------Time line----------------------------------------->

case1:事务T2的开始时间早于事务T1,事务T2的W(X)时间晚于事务T1的W(X)时间，T2读被abort，事务T2需要写的值是事务T1介入之前的值

case1:事务T2的开始时间晚于事务T1,事务T2的W(X)时间晚于事务T1的W(X)时间，T2读被abort，事务T2需要写的值是事务T1介入之前的值

问：T2事务中的W(X)可能小于T1事务中的W(X)吗?

答：假设Wt2(X) < Wt1(X)，则代表着T2的写操作在T1的写操作之前完成，T2的写操作为后写，后写需要被abort，但是X的写时间被更新成了Wt1(X)，意味着T1的写成功，两者矛盾，因此不可能T2事务中的W(X)可能小于T1事务中的W(X)。

总结：写写冲突，回滚后写。

伪代码解释

检查Ti需要使用到的所有value

如果打算读value：【第一张图中的场景】
1. 如果当前事务开始早于写操作，读要回滚再重来（否则可能会脏读）
2. 如果当前事务开始晚于写操作，将写value的事务加入到当前事务的依赖事务列表中DEP(Ti). add(WT’S(O;))，更新value的读时间戳DEP(Ti). add(WT’S(O;))
如果打算写value：
1. 如果当前事务的开始早于读操作，回滚写（否则另一个事务的读可能出现不可重复读）【第二张图中的场景】
2. 如果当前事务的开始晚于写操作，abort当前事务的写或者一句Thomas Write Rule skip【第三张图中的场景】
3. 否则通过检查，存储更新前的旧值OLD(Ti). add(O;, WTS(O;)),更新value的写时间WTS(0;) = TS(I; ),更新value的值

如果DEP（Ti）中有未结束的事务，wait

如果DEP（Ti）中有abort的事务，则1.2中的读也需要abort

abort流程：恢复旧值和旧的时间戳。

DAOS中的Time-stamp ordering

伪代码

A read at epoch e follows these rules: （对应于写读冲突中的三个case）

// Epoch uncertainty check
	if e is uncertain
			// 写读冲突,读事务的读操作的开始晚于其他事务的写操作（读事务的读操作一定是在e_orig + epsilon之后吗？）
      if there is any overlapping, unaborted write in (e, e_orig + epsilon]
          reject // 回滚读 (case1&case2)
  find the highest overlapping, unaborted write in [0, e] //写读冲突，读事务的开始时间晚于其他写事务的写操作事件
  if the write is not committed // 等待
      wait for the write to commit or abort
      if aborted
          retry the find skipping this write // 如果其他写事务abort，此读事务retry (case3)
  // Read timestamp update
  for level i from container to the read's level lv
      update i.high // 更新TimeStamp Cache
  update lv.low

e:代指的是事务的开始HLC时间，e的实际开始物理时间可能在(e, e_orig + epsilon）范围内。

epsilon:代指的是是HLC时间和实际物理时间的误差范围。

A write at epoch e follows these rules: （对应于读写冲突&写写冲突的几个case）

// Epoch uncertainty check
 if e is uncertain
 		// 写写冲突，后写abort（认为当前事务的写在e_orig + epsilon时间之后）
     if there is any overlapping, unaborted write in (e, e_orig + epsilon]
         reject
 // Read timestamp check， 检查读写冲突，读在前写在后
 for level i from container to one level above the write
 		// 读写冲突，如果父节点的low小于当前事务的开始时间，意味着父节点下的所有子节点都在low这个时间点被读过，则可以推导出write的目			 // 标value在low时间被读取过,当前事务的开始时间早于value的读操作时间，则当前事务需要abort
 		// (i.low == e) && (other reader @ i.low) --> 检查是否是同一个txn中的操作，如果是同一个txn的读写，则不需要abort
 		// 这些父节点的检查可能是想快速判断出是够有冲突，不必递归到目标项上做检查
     if (i.low > e) || ((i.low == e) && (other reader @ i.low))
         reject
 // 检查目标项，读写冲突，abort写
 if (i.high > e) || ((i.high == e) && (other reader @ i.high))
     reject
 // 此处的overlapping检查逻辑暂不明确作用,好像也是在检查写写冲突
 // 上面的写写冲突检查区间并没有覆盖到e这个点，是个左开右闭区间，不清楚为什么要分开检测
 find if there is any overlapping write at e
 if found and from a different transaction
		reject

i.low:目标节点和其所有子节点都至少在low时间点被读取过。

i.high:目标节点至少有一个子节点在high时间被读取过。

Daos中检查冲突的代码片段

检查读写冲突

// 写操作检查，查看是够有读写冲突
bool
vos_ts_check_read_conflict(struct vos_ts_set *ts_set, int idx,
			   daos_epoch_t write_time)
{
	struct vos_ts_set_entry	*se;
	struct vos_ts_entry	*entry;
	int			 write_level;
	bool			 conflict;

	D_ASSERT(ts_set != NULL);

	se = &ts_set->ts_entries[idx];
	entry = se->se_entry;

	if (ts_set->ts_wr_level > ts_set->ts_max_type)
		write_level = ts_set->ts_max_type;
	else
		write_level = ts_set->ts_wr_level;

	if (se->se_etype > write_level)
		return false; /** Check is redundant */

	/** NB: If there is a negative entry, we should also check it.  Otherwise, we can miss
	 *  timestamp updates associated with conditional operations where the tree exists but
	 *  we don't load it ,不太明白这边的negative是什么意思？
	 */
	// 检查目标value父节点
	// for level i from container to one level above the write
	if (se->se_etype < write_level) {
		/* check the low time */
		conflict = vos_ts_check_conflict(entry->te_ts.tp_ts_rl, &entry->te_ts.tp_tx_rl,
						 write_time, &ts_set->ts_tx_id);

		if (conflict || entry->te_negative == NULL) 
		// conflict为true ---》冲突
		// conflict为false，negative为空，---》不冲突
			return conflict;
		
		// conflict为false且negative不为空，检查negative：什么是negative？
		// 如果和negative冲突也代表冲突
		return vos_ts_check_conflict(entry->te_negative->te_ts.tp_ts_rl,
					     &entry->te_negative->te_ts.tp_tx_rl,
					     write_time, &ts_set->ts_tx_id);
	}

	/* check the high time */
	// if (i.high > e) || ((i.high == e) && (other reader @ i.high)),检查目标value的high
	conflict = vos_ts_check_conflict(entry->te_ts.tp_ts_rh, &entry->te_ts.tp_tx_rh, write_time,
					 &ts_set->ts_tx_id);

	if (conflict || entry->te_negative == NULL)
	// conflict为true ---》冲突
	// conflict为false，negative为空，---》不冲突
		return conflict;

	// conflict为false且negative不为空，检查negative：什么是negative？
	// 如果和negative冲突也代表冲突
	return vos_ts_check_conflict(entry->te_negative->te_ts.tp_ts_rh,
				     &entry->te_negative->te_ts.tp_tx_rh, write_time,
				     &ts_set->ts_tx_id);
}


// 检查读写冲突的子函数
// 如果冲突，返回true
// 如果不冲突，返回false
static inline bool
vos_ts_check_conflict(daos_epoch_t read_time, const struct dtx_id *read_id,
		      daos_epoch_t write_time, const struct dtx_id *write_id)

{
	// 当前事务的开始时间晚于读操作，则不冲突
	if (write_time > read_time)
		return false;

	// 当前事务的开始时间早于读操作，冲突，abort当前事务的写
	if (write_time != read_time)
		return true;

	// 当前事务写操作和其他事务的读操作在同一时间发起，且这两事务不在同时发生，则冲突
	if (read_id->dti_hlc != write_id->dti_hlc)
		return true;

	// 当前事务写操作和其他事务的读操作在同一时间发起，且这两个事务同时发生：
	//	1. 事务ID号相同，则不冲突 （同一个事务内同时的读写操作）
	//  2. 事务ID号不同，则冲突
	return uuid_compare(read_id->dti_uuid, write_id->dti_uuid) != 0;
}

检查写写冲突，写读冲突

// TODO:不能理解case2，case3中的场景

case1: <-----------X----------------|----------------o-----------------o------------------>
                current            bound               U1                U2
case2: <-----------o----------------v----------------|-----------------o------------------>
                  U1               current          bound             U2	
case3: <-----------o----------------x----------------o-----------------|------------------>
                  U1               current          U2                bound	
case4: <-----------o----------------o----------------v-----------------|------------------>
                  U1               U2               current           bound

/** Do an uncertainty check on the entry.  Return true if there
 *  is a write within the epoch uncertainty bound or if it
 *  can't be determined that the epoch is safe (e.g. a cache miss).
 *
 *  There are the following cases for an uncertainty check
 *  1. The access timestamp is earlier than both.  In such
 *     case, we have a cache miss and can't determine whether
 *     there is uncertainty so we must reject the access.
 *  2. The access is later than the first and the bound is
 *     less than or equal to the high time.  No conflict in
 *     this case because the write is outside the undertainty
 *     bound.
 *  3. The access is later than the first but the bound is
 *     greater than the high timestamp.  We must reject the
 *     access because there is an uncertain write.
 *  4. The access is later than both timestamps.  No conflict
 *     in this case.
 *
 *  \param[in]	ts_set	The timestamp set
 *  \param[in]	epoch	The epoch of the update
 *  \param[in]	bound	The uncertainty bound
 */
static inline bool
vos_ts_wcheck(struct vos_ts_set *ts_set, daos_epoch_t epoch,
	      daos_epoch_t bound)
{
	struct vos_wts_cache	*wcache;
	struct vos_ts_set_entry	*se;
	uint32_t		 high_idx;
	daos_epoch_t		 high;
	daos_epoch_t		 second;

	if (!vos_ts_in_tx(ts_set) || ts_set->ts_init_count == 0 ||
	    bound <= epoch)
		return false;

	se = &ts_set->ts_entries[ts_set->ts_init_count - 1];

	if (se->se_entry == NULL)
		return false;

	wcache = &se->se_entry->te_w_cache;
	high_idx = wcache->wc_w_high;
	high = wcache->wc_ts_w[high_idx];
	// 事务的开始时间晚于任何其他事务的写时间，不产生任何冲突
	// 写读冲突中的case3,读事务的开始时间晚于其他事务的写操作时间，需要返回等待
	// 写写冲突中，写事务的开始时间晚于其他事务的写操作时间，不冲突
	if (epoch >= high) /* Case #4, the access is newer than any write */
		return false;

	second = wcache->wc_ts_w[1 - high_idx];
	// 事务的开始时间小于其他事务写此value的Tlow,当前事务的写应该被abort
	// 写读冲突中，读事务的开始时间早于写事务的写操作时间，读操作需要被abort，写读冲突中的case1和case2属于此类
	// 写写冲突中，写事务的开始时间早于其他写事务的写操作时间，后写的需要被abort，写写冲突中的case1和case2属于此种
	if (epoch < second) /* Case #1, Cache miss, not enough history */
		return true;

	/* We know at this point that second <= epoch so we need to determine
	 * only if the high time is inside the uncertainty bound.
	 */
	if (bound >= high) /* Case #3, Uncertain write conflict */
		return true;

	return false;
}

DAOS中基于时间戳的多版本并发控制协议

时间戳排序的多版本并发控制协议算法原理

MVCC不是一个可独立使用的事务并发控制技术，而是需要基于其他并发控制技术，如基于时间戳的称为多版本时间戳排序机制( multiversion timestamp-ordering scheme) ，基于两阶段封锁协议的称为多版本两阶段封锁协议 ( multiversion two-phase locking protocol)’’。

基于时间戳的称为多版本时间戳排序机制的基本原理：

首先，数据库系统在事务开始前赋予一个时间戳，记为TS（Ti），这个时间戳则决定了并发的事务的调度顺序。
对于每个数据项X，多版本体现在：X有一个版本序列<X1，X2，…，Xn>，其中，每个版本Xi包括三个字段，分别是：
1. Xi=value，value是数据项X的第i个版本的值，每个版本是由一个写操作生成的。
2. W-timestamp（Xi）是创建Xi这个版本的事务的时间戳（不是当前时间戳值），即表明此数据项是被谁在什么时候创建的。
3. R-timestamp（Xi）是所有成功读取Xi这个版本的事务的时间戳。
再次，多版本时间戳排序机制通过如下规则，保证可串行性：
1. 如果事务Ti执行Read操作或Write操作，假设Xm表示X满足如下条件的版本，其写时间戳是小于或等于TS（Ti）的最大写时间戳（确保了在所有版本中找到一个“最近版本”）。
2. 如果事务Ti执行读操作Read（X），返回给事务Ti的值为Xm。读永远不会被阻塞。
3. 如果事务Ti执行写操作Write（X）：
  1. 如果 TS(Ti)<R-timestamp(Xm)，则中止事务 Ti，这表明即将执行的这个写操作之后的时间上已经发生过了一个读操作，如果允许写操作成功，则可能发生不可重复读异常现象。这是写 - i卖冲突，事务 Ti 被中止。
  2. 如果 TS(Ti)=W-tim巳stamp(X时，则系统更新事务 Ti 的值 Xm 为新值，这表明本事务多次写过同一个数据项，新值覆盖旧值
  3. 如果 TS(Ti)>W-timestamp(Xm)，则系统为事务 Ti 的数据项 X 创建一个新值，这说明后发生的事务才创建新的版本。这是写一写冲突，导致产生新版本。

DAOS中的时间戳的多版本并发控制协议代码片段

TODO

Question：

为什么Daos中每个entry需要记录两个write TimeStamp？

按照传统的Time Ordering算法，只记录一个readTimeStamp和一个writeTimeStamp。

daos的文档中这样描述：

In order to detect epoch uncertainty violations, VOS also maintains a pair of write timestamps for each container, object, dkey, and akey. Logically, the timestamps represent the latest two updates to either the entity itself or to an entity in a subtree. At least two timestamps are required to avoid assuming uncertainty if there are any later updates. The figure below shows the need for at least two timestamps. With a single timestamp only, the first, second, and third cases would be indistinguishable and would be rejected as uncertain.
```
 根据文档的描述，记录两个Write TimeStamp，可以减少一些uncertain场景的冲突个数。按照传统的Time Ordering算法，上一小        节中的case2，case3，和case4都应该被abort。通过记录两个TimeStamp，case2可以不abort。没有理解这边不abort的理由。
```
TimeStamp entry中的negative代表了什么？为什么检查冲突的时候也需要检查negative

NB: If there is a negative entry, we should also check it. Otherwise, we can miss timestamp updates associated with conditional operations where the tree exists but we don’t load it

其他参考：

CockroachDB中关于TimeStamp Cache的描述：

Read-Write Conflicts – Read Timestamp Cache
On any read operation, the timestamp of that read operation is recorded in a node-local timestamp cache. This cache will return the most recent timestamp at which the key was read.

All write operations consult the timestamp cache for the key they are writing; if the returned timestamp is greater than the operation timestamp, this indicates a RW conflict with a later timestamp. To disallow this, the operation (and its transaction) must be aborted and restarted with a later timestamp.

The timestamp cache is an interval cache, meaning that its keys are actually key ranges. If a read operation is actually a predicate operating over a range of keys (such as a scan), then the entire scanned key range is written to the timestamp cache. This prevents RW conflicts where the key being written was not present during the scan operation.

The timestamp cache is a size-limited, in-memory LRU (least recently used) data structure, with the oldest timestamps being evicted when the size limit is reached. To deal with keys not in the cache, we also maintain a “low water mark”, which is equivalent to the earliest read timestamp of any key that is present in the cache. If a write operation writes to a key not present in the cache, the “low water mark” is returned instead.

Write-Write Conflicts – Can only write the most recent version of a key
If a write operation attempts to write to a key, but that key already has a version with a later timestamp than the operation itself, allowing the operation would create a WW conflict with the later transaction. To ensure serializability, the operation (and its transaction) must be aborted and restarted with a later timestamp.

By choosing a timestamp-based ordering, and rejecting all conflicts which disagree with that ordering, CockroachDB’s Serializable Snapshot guarantees a serializable schedule.

Timestamp cache
The timestamp cache tracks the highest timestamp (i.e., most recent) for any read operation that a given range has served.

Each write operation in a BatchRequest checks its own timestamp versus the timestamp cache to ensure that the write operation has a higher timestamp; this guarantees that history is never rewritten and you can trust that reads always served the most recent data. It’s one of the crucial mechanisms CockroachDB uses to ensure serializability. If a write operation fails this check, it must be restarted at a timestamp higher than the timestamp cache’s value.

Notes of CRaft

Posted on 2023-03-25 Edited on 2023-04-17 In knowledge

Raft协议

流程：

选主，投票，确认
向Leader发送日志
Leader分发日志，F+1个slave确认
Leader提交日志，回复client，同步commit信息

Raft协议保证两个properties：

Safety. They never return incorrect results under all non- Byzantine conditions.
Liveness. They are fully functional as long as any majority of the servers are alive and can communicate with each other and with clients. We call this group of servers healthy.

Safety property由Leader Completeness Property 保证： if a log entry is committed in a given term, then that entry will be present in the logs of the leaders for all higher-numbered terms.

Liveness 由Raft的rule保证。

Erasure Coding

纠删码是一种通用的用来在存储和网络传输领域容错的手段。

Reed-Solomon code（RS-code）

原理：

给定n个数据块d1, d2,…, dn，n和一个正整数m， RS根据n个数据块生成m个校验块， c1, c2,…, cm。对于任意的n和m，从n个原始数据块和m 个校验块中任取n块就能解码出原始数据，即RS最多容忍m个数据块或者校验块同时丢失（纠删码只能容忍数据丢失，无法容忍数据篡改，纠删码正是得名与此）。

单位矩阵：

定义：在线性代数中，n $n$ 阶单位矩阵，是一个n×n $n\times n$ 的方形矩阵，其主对角线元素为1，其馀元素为0。

性质： $AI_{n}=A$ 且InB=B $I_{n}B=B$ (任何矩阵与单位矩阵相乘都等于本身)

范德蒙矩阵（vandermonde matrix）：

定义： $V={\begin{bmatrix}1&\alpha _{1}&\alpha _{1}^{2}&\dots &\alpha _{1}^{n-1}\\1&\alpha _{2}&\alpha _{2}^{2}&\dots &\alpha _{2}^{n-1}\\1&\alpha _{3}&\alpha _{3}^{2}&\dots &\alpha _{3}^{n-1}\\\vdots &\vdots &\vdots &\ddots &\vdots \\1&\alpha _{m}&\alpha _{m}^{2}&\dots &\alpha _{m}^{n-1}\\\end{bmatrix}}$

之所以采用vandermonde矩阵的原因是， RS数据恢复算法要求编码矩阵任意n*n子矩阵可逆。

逆矩阵：

性质：矩阵和其逆矩阵相乘，结果等于单位矩阵。

举例：

假设D1，D4和C2丢失，如何通过RS-code恢复？

根据编码的等式，获得B‘等式：

由于编码矩阵可逆，将等式两边乘以编码矩阵的逆矩阵：
根据上面提到的逆矩阵性质，矩阵乘以其逆矩阵等于单位矩阵，得到以下等式
所以对幸存的fragment使用编码矩阵的逆矩阵进行重新编码，可以获取到原始的fragment。

此论文中RS-code的一些相关定义：

RS可以定义一组整数，k和m，data被等分为k组，使用k组fragment编码生成m组parity fragment，因此一共有k+m组编码后的fragment，称作coded-fragment。集群中有N个Server，每个Server存储一个coded-fragment，因此k+m = N。

RS-Paxos

RS-Paxos也是结合了RS纠删码，可以将存储代价大大降低，付出的代价是Liveness性质不能和Paxos一致。

假设集群中有2F + 1个节点，Paxos可以容忍F个节点的failure，因此将Liveness定义为F。Paxos算法和Raft算法都遵循排容原理（inclusion-exclusion principle）：

|A ∪ B| = | A |+| B |−| A ∩ B | (1)

（1）中是两个集合之间的排容原理：

假设有限集A和B不相交，则 A U B = A + B
假设有限集A和B相交，则A + B中交集算了两次，需要减去一次交集A ∩ B

inclusion-exclusion principle 保证了在一个Raft group中，两个majority必定至少有一个共同的节点。

证明：If |A| > |A∪B|/2 and |B| > |A∪B|/2, |A∩B| = |A|+|B|−|A∪B| > 0.

结论：RS-Paxos的Liveness是小于F的。

证明：

RS-Paxos中，需要满足条件Qr + Qw − N ≥ k (2)：Qr和Qw的交集中的Server个数至少为k
RS-Paxos中，至少要max(Qr,Qw)个healthy节点存在才能work
如果一个command被提交，那么至少Qw个节点accept过
如果某节点proposal一个command，那么在prepare阶段至少需要contact Qr个节点
设RS-Paxos的Liveness为L

1
2
3

1. L <= N - max(Qr,Qw)
2. 根据（2），max(Qr,Qw) >= (Qr + Qw)/2 >= (N + k)/2 
3. 根据以上两条，所以 L <= N - (N + k)/2 = (N - k)/2 = (2F + 1 - k)/2 = F - (k + 1)/2 < F

CRaft

设计目标：

提供和经典Raft算法一致的Liveness性质（通过限制1来实现）
通过RS-code,降低存储代价

限制：

在一个有2F+1个节点的Raft Group中，当只有F+1个节点存活时，log entry只有在被完整的复制到所有的F+1个healthy节点上后，才能被提交。

证明：假设在2F+1个节点的Group中，当前时刻T1只有F+1个Server存活（S和剩余节点），有节点S和log e，e已经提交，S只存储了e的一个fragment。下一时刻T2，S存活，上一时刻的剩余节点全部unhealthy，上一时刻unhealthy的节点全部转为active，此时只有S存储有log e的一个fragment，无法恢复出完整的log，得证。

coded-fragment replication and complete-entry replication

Coded-fragment Replication

流程：

encode原始log，获取k+m个fragment
leader分fragments
至少F+k个slave成功收到fragment并响应leader后，leader commit此log
通知slave commit此log

why F+k?

相较于经典raft理论中，leader在收到F+1个slave的成功响应后即可commit log，CRaft中的commit条件更为严格，在使用coded-fragment replication分发后，需要在F+k个slave响应后方能提交。F+k为了保证Liveness属性，保证了在leader选举后，leader能够正确的从fragment中恢复出完整的log entry。

证明：

假设T1时刻leader无响应，出发leader选举后，选举出新的leader。Raft的election rule保证了选举出的leader拥有最新最全的日志，假设日志e已经提交，新的leader必定拥有e的fragment，由于e已提交，因此至少F+k个Server存储了e的coded-fragments：

设 A = (F + k),B = (F + 1)
A + B = F + k + F + 1 = 2F + 1 + K
因为 2F + 1 = N
所以 A + B - N = 2F + 1 - N + K = k (1)
根据 A ∪ B = A + B − A ∩ B，因此 A ∩ B = A + B - A ∪ B (2)
由于 A 和 B都属于raft组的majority，因此 A ∪ B <= N
所以 A + B - N < A + B - A ∪ B，
带入 (1)和(2),的 k < A ∩ B

由以上推导可得，在Raft Group中任取F + 1个Server，这F + 1个Server中至少存在k个coded-fragments。

由于Rs-code要求至少要k个coded-fragments才能恢复数据，由此满足条件。

反例

2 * 3 + 1,七个节点，可以容忍三个节点的failure，（k，m） = （3，4），
即一个日志被code成3个fragment和4个parity，一共7个coded fragment，可以容忍4个fragment的丢失，至少需要3个fragment
才能恢复出完整的日志。
假设使用 3 + 1即 4个节点应答就commit，例如log3，在T2时刻，仅剩4个节点active，因为只有两个fragment，
因此无法恢复出完整的log。

complete-entry replication

何时触发？

coded-fragment要求至少有F + k个节点存活，才能保证Liveness属性，因此如果Raft Group内的healthy节点个数在[F + 1, F + k)区间内时，需要转化为complete-entry replication。

日志提交条件

和经典raft一致，即日志得到F+1个slave的成功回复后，可以提交。

优化

在日志提交后，没有获取到日志的节点可以使用coded-fragment replication来节省Storage cost

定义参数p，0 ≤ p ≤ F，leader可以首先发送完整日志到F+p个followers，再向剩余的F-p个followers发送fragment。p值的大小和Storage&network cost成正比关系，当p = F时，退化成经典的raft算法。本文的实现中选取的p值为0。

strategy prediction

算法

通过心跳信息来确定active Server的个数：

如果active Serve个数不少于F+k，触发coded-fragment replication
如果active Server个数少于F+k，触发complete-entry replication

Newly-elected Leader

对于new elected leader，可能存在以下两种类型的日志：

commit过的日志，coded-fragment replication 保证了至少有k个fragment可以恢复出此条日志，complete-entry replication保证了其他Server有存储过完整的日志，因此committed log可以在new elected leader上正确的恢复。
unapply过的日志，没有机制可以保证此条日志可以正确的恢复，CRaft中引入了LeaderPre operation

LeaderPre operation

流程：

new elected leader查找出本地unapplied code-fragments
为每条日志向所有的follower发送request
如果接收到了此条日志的完整副本或者k条fragment，则恢复出此条日志，但不会commit或者apply
如果无法恢复，删除
此过程中需要不断向slave发送心跳，防止触发新的选举

unrecoverable意味着uncommitted，因此删除是harmless的。

具体证明

Leader Completeness Property：如果log e在term T被提交，那么e会在所有有更高term的leader中出现，且e不会被更高term中的leader删除。

反证：

假设LCP无法被hold，有两种场景：

e在Term U的新Leader选举时不在Leader的日志中。
e在Term U的新Leader选举的LeaderPre阶段中被删除了。

先证明场景1，根据假设，在Term T到Term U之间的所有Leader都持有log e且e没有被LeaderPre阶段删除过，因此e从Term T开始就没有被任何节点删除过。LeaderT将e至少分发给了F+1个节点，LeaderU至少接收到了F+1个节点的投票响应，根据排容原理，F+1 + F+1 = N + 1 > N,即至少一个节点接受了e日志且给LeaderU投了票。此Server先一定是先接受了e日志，再给LeaderU投票，否则，它会reject LeaderT的AppendEntris RPC。因为e从Term T开始就没有被删除过，且投票给了LeaderU，因此LeaderU的日志一定至少和此节点的日志一样新，因此LeaderU的日志中一定包含了e，这个假设1存在矛盾。（此证明和Raft经典论文的证明思路一致）

再证明场景2，假设在Term U中e被删除了，有一个无法被恢复的log e1，假设e1不是e，因为e被删除了，因此e1的index一定是小于e的。因为e在Term T中由LeaderT提交且e1的index小于e，因此e1也被提交过。因此e1打破了Leader Completeness Property且index比e小，因此此处矛盾，所以e1一定是e。(这边不是废话吗？论文里假设的就是e在Term U中被删除了)。所以e因为无法正确的恢复而在LeaderPre中被删除。因为e在Term T中被提交，所以至少完整的复制到了F+1个节点上或者以EC Fragment的形式被部分的复制到了至少F+k个节点上，因此e一定是能够被恢复的。此处和假设2产生矛盾。

Performance

Overview

新的Leader选出来后，需要做日志恢复，影响性能
LeaderPre会延长选举时间，影响可用性
新选举的Leader恢复所有完整的日志带来的损耗和降低读延迟之间存在取舍
日志滞后的节点当前无法通过snapshot来同步，snapshot机制在CRaft中要复杂的多
编解码会消耗CPU

Latency

在value size比较小的时候，几乎没有性能区别，在value size比较大的时候，因为CRaft只做coded-fragment分发，因此性能优势明显。（they reduce 20%–45% of latency compared to Raft when N = 5, and 20%–60.8% when N = 7.）

Throughput

value size比较大的时候，对于带宽的提升明显，大到临界点时，总带宽会降低，因为遇到了网络拥塞问题（CRaft and RS-Raft can have a 180% improvement on write throughput when N = 5 and 250% when N = 7. Also, the throughput peaks of CRaft and RS-Raft both appear much later than Raft’s. ）

Network Cost

通过leader端的检测，Craft相较于原始raft协议，在N = 7时，配置不同的k值，raft中leader的数据发送量是Craft的250%~300%

Liveness

健康节点数为5时，CRaft的带宽不如Rs-Paxos,因为CRaft必须使用complete-log replication

健康节点数为4时，Rs-Paxos无法work。

节点数	CRaft可以容忍的失败节点数	RS-Paxos可以容忍的失败节点数
5	2	1
7	3	2

Recovery Read

当leader发生切换时，在CRaft中，第一次读取可能需要leader做Recovery，因为leader可能只存储了log的部分fragment，假如leader不发生频繁切换，那么CRaft的read延时和Raft类似，在Recovery Read中，相较于Raft，延时为raft的1.4倍左右，当多次读同一个key时，延迟基本相当。

Conclusion

	Raft	RS-Paxos	CRaft
Storage cost	2F + 1	2F / k + 1	2F / k + 1
network cost	2F	2F / k	2F / k
liveness level	F	F - (k - 1) / 2	F
Commit slave reply cnt	F + 1	F + 1	F + k

Q&A:

m和k之间的约束是什么，在实际的工业场景中，此算法的价值如何？

现有：m + k = N（1）

		2F + 1 = N（2）

需满足：F  + k < N (3)，即不需要commit_all

由（2）（3）可得：(N - 1) / 2 + k < N，可得k < ( N + 1) / 2 （4）

由（4）可得：m + k  < m + (N + 1) /2 （5）

由（1）（5）可得：N < m + (N + 1) /2，可得m > (N - 1) / 2 （6）

由（2）（6）可得：m > (2F + 1 - 1) / 2，可得m > F

因为m为整数，所以m至少为F+1。

当m为F + 1时，由（1）可得，k = F， 此时F + k = 2F，即在最坏的case，要2F个节点确认后才能提交log。

k	m	commit条件
F	F + 1	2F
F - 1	F + 2	2F - 1
F - 2	F + 3	2F - 2
…	…	…
F - x	F + x + 1	2F - x

2F - x >= F + 1,即 x <= F - 1。

k	m	commit条件
F + 1	F	2F + 1
F + 2	F - 1	2F + 2 (无法使用CRaft)
F + 3	F - 2	2F + 3 (无法使用CRaft)
…	…	…
F + x	F + x + 1	2F - x (无法使用CRaft)

结论：

按照上面的推导，k比m小的越多，则commit的条件越宽松
在k大于m的场景中，k最小为F+1,m为F，此时F+k = 2F + 1 = N,为commit_all, k再大的话，不满足条件，无法使用CRaft
在实际的场景中，k往往比m大，因此CRaft的应用条件比较有限。

gcc_hello_world

Posted on 2022-05-17 Edited on 2023-04-17

GCC hello_world

将.c文件预处理成.i文件

1	gcc -E hello.c -o hello.i

此阶段称为预处理阶段。

将.i文件汇编成.s文件

1	gcc -S hello.i -o hello.s

此阶段为编译阶段，将预处理文件转为汇编代码文件。

将.s文件转变成.o机器码可执行文件

1	gcc -c hello.s -o hello.o

此阶段将汇编语言转变为二进制机器码。

使用readelf查看二进制文件

1	readelf -a hello.o

将.o可执行文件链接成最终可执行程序

## 动态链接
gcc hello.c -o hello
## 静态链接
gcc hello.c -o hello --static

可选参数

参数列表

选项	解释
-ansi	只支持 ANSI 标准的 C 语法。这一选项将禁止 GNU C 的某些特色，例如 asm 或 typeof 关键词。
-c	只编译并生成目标文件。
-DMACRO	以字符串”1”定义 MACRO 宏。
-DMACRO=DEFN	以字符串”DEFN”定义 MACRO 宏。
-E	只运行 C 预编译器。
-g	生成调试信息。GNU 调试器可利用该信息。
-IDIRECTORY	指定额外的头文件搜索路径DIRECTORY。
-LDIRECTORY	指定额外的函数库搜索路径DIRECTORY。
-lLIBRARY	连接时搜索指定的函数库LIBRARY。
-m486	针对 486 进行代码优化。
-o FILE	生成指定的输出文件。用在生成可执行文件时。
-O0	不进行优化处理。
-O 或 -O1	优化生成代码。
-O2	进一步优化。
-O3	比 -O2 更进一步优化，包括 inline 函数。
-shared	生成共享目标文件。通常用在建立共享库时。
-static	禁止使用共享连接。
-UMACRO	取消对 MACRO 宏的定义。
-w	不生成任何警告信息。
-Wall	生成所有警告信息。

GDB_HELLO_WORLD

Posted on 2022-05-13 Edited on 2023-04-17 In knowledge

GDB单步跟踪

gdb + 可执行二进制程序
r   + 参数列表
b   + 断点行数
bt 显示堆栈信息
print 打印表达式值
continue 在一个断点之后继续执行
list 列出源码的一部分
next 在停止之后单步执行
set 设置变量的值
step 进入函数调用
watch 监视一个函数的值
rwatch 监视变量被读暂停程序
awatch 监视变量被读写暂停程序
disable 消除断点
display 在断点停止的地方显示表达式的值
enable 允许断点
finish 继续执行直到函数返回
ignore 忽略断点的次数
info 查看信息
load 动态加载程序到调试器
whatis显示变量的值和类型
ptype 显示变量的类型
return 强制从当前函数返回
make 不退出gdb即可重新产生可执行文件
shel 不退出gdb即可执行linux shell命令
help
quit

常用参数

参数	解析
-e	指定可执行文件名
-c	指定coredump文件
-d	指定目录加入到源文件搜索路径
–cd	指定目录作为路径运行gdb
-s	指定文件读取符号表
-p	指定attach进程

调试进程

1 2	gdb -p 进程名 gdb attach 进程名

调试线程

1 2	info thread //列出已允许的进程下的线程 thread 线程号 //切换到线程

查看相关信息

(gdb) info　　thread　　  //列出线程
(gdb) info　　register　 //列出寄存器
(gdb) info　　frame　   //列出栈帧
(gdb) info　　files　　//列出当前文件
(gdb) info　　share　 //列出当前共享库

指定程序允许参数

1 2	set args 1 2 3 4 show args

其他参数

path 设定程序的允许路径
show path
set environment K=v 设置环境变量
show environment 查看环境变量
show language 查看语言环境
info frame 查看函数的程序语言
info source 查看当前文件的程序语言
info breakpoints 显示所有断点
info terminal 显示程序用到的终端模式

添加断点

break function 进入指定函数时停住
break n 在指定行号停住
break +offset / -offset 在行号后offset行停住或者前offset行停住
break filename:linenum filename的linenum行停住
break filename:function 同理
break *address 在程序运行的内存地址处停住
break 没有参数时表示在下一条指令处停住

删除断点

1
2
3

delte n 删除n号断点
delte 删除所有断点
clear 删除行n上面的所有断点

程序调试

run/r 程序执行直到遇到断点
continue/c 程序执行直到遇到下个断点
next/n 执行下条语句
step/s 单步进入
finish 跳出当前函数
jump location 直到下一条语句的运行点
print/p 输出变量值

输出

(gdb) print num
(gdb) p num
(gdb) print file::variable
(gdb) print function::variable
(gdb) p *array@len
$1 = {2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40}
#静态数组直接print + 数组名就可以打印内容

源代码显示

命令	解析
list n	显示程序第n行的周围的源程序。
list function	显示函数名为function的函数的源程序。
list +n	显示当前行n后面的源程序。
list -n	显示当前行n前面的源程序。
set listsize	设置一次显示源代码的行数。
show listsize	查看当前listsize的设置。
# 查看源代码的内存地址
(gdb) info line tst.c:func
Line 5 of "tst.c" starts at address 0x8048456 and ends at 0x804845d .

查看内存地址保存的值

你可以使用examine命令（简写是x）来查看内存地址中的值。x命令的语法如下所示：

(gdb) x/nfu addr
1
n 是一个正整数，表示显示内存的长度，也就是说从当前地址向后显示几个地址的内容。
f 表示显示的格式，参见上面。如果地址所指的是字符串，那么格式可以是s，如果地十是
指令地址，那么格式可以是i。
u 表示从当前地址往后请求的字节数，如果不指定的话，GDB默认是4个bytes。u参数可
以用下面的字符来代替，b表示单字节，h表示双字节，w表示四字节，g表示八字节。当
我们指定了字节长度后，GDB会从指内存定的内存地址开始，读写指定字节，并把其当作
一个值取出来。
addr表示一个内存地址。
n/f/u三个参数可以一起使用。例如：

(gdb) x/3uh 0x54320 //从内存地址0x54320读取内容，h表示以双字节为一个单位，3表示三个单位，u表示按十六进制显示。

查看寄存器

1 2	info registers info all-registers

显示堆栈

(gdb) backtrace [-full] [n]
/*命令产生一张列表，包含着从最近的过程开始的所有有效过程和调用这些过程的参数。
n：一个整数值，当为正整数时，表示打印最里层的 n 个栈帧的信息；n为负整数时，那么表示打印最外层n个栈帧的信息；
-full：打印栈帧信息的同时，打印出局部变量的值
注意，当调试多线程程序时，该命令仅用于打印当前线程中所有栈帧的信息。
如果想要打印所有线程的栈帧信息，应执行thread apply all backtrace命令*/

显示栈帧

frame 或 f 会打印出这些信息：栈的层编号，当前的函数名，函数参数值，函数所在文件及行号，函数执行到的语句。
查看当前栈帧中存储的信息

(gdb) info frame
Stack level 0, frame at 0x7ffc1da10e80:
 rip = 0x7f800008b4e3 in __epoll_wait_nocancel; saved rip = 0x5560eac8b902
 called by frame at 0x7ffc1da11280
 Arglist at 0x7ffc1da10e70, args:
 Locals at 0x7ffc1da10e70, Previous frame's sp is 0x7ffc1da10e80
 Saved registers:
  rip at 0x7ffc1da10e78

该命令会依次打印出当前栈帧的如下信息：
1、当前栈帧的编号，以及栈帧的地址；
2、当前栈帧对应函数的存储地址，以及该函数被调用时的代码存储的地址
3、当前函数的调用者，对应的栈帧的地址；
4、编写此栈帧所用的编程语言；
5、函数参数的存储地址以及值；
6、函数中局部变量的存储地址；
7、栈帧中存储的寄存器变量，例如指令寄存器（64位环境中用 rip 表示，32为环境中用eip 表示）、
堆栈基指针寄存器（64位环境用 rbp表示，32位环境用 ebp表示）等。

搜索源代码

1 2	forward-search reverse-search

设置观察点

1
2
3

watch 	为表达式(变量)expr设置一个观察点。一旦表达式值有变化时，马上停住程序
(gdb) watch i != 10
//这里，i != 10这个表达式一旦变化，则停住。watch <expr> 为表达式（变量）expr设置一个观察点。一量表达式值有变化时，马上停住程序(也是一种断点)。

设置捕捉点

可设置捕捉点来补捉程序运行时的一些事件。如载入共享库(动态链接库)或是C++的异常。设置捕捉点的格式为：
//catch 	当event发生时，停住程序
(gdb) info catch
//打印出当前的函数中的异常处理信息。
(gdb) info locals
//打印出当前函数中所有局部变量及其值。

强制调用函数

(gdb) call <expr>
/*这里,<expr>可以是一个函数，这样就会返回函数的返回值，如果函数的返回类型是void那么就不会打印函数的返回值,但是实践发现，函数运行过程中的打印语句还是没有被打印出来。
表达式中可以一是函数，以此达到强制调用函数的目的。并显示函数的返回值，如果函数返
回值是void，那么就不显示。*/

终止

1
2
3

kill 终止正在调试的程序
detach 将gdb和程序分离
q 退出gdb

打印美化

1	set print pretty on

DAOS-simple-obj-example

Posted on 2022-02-04 Edited on 2023-04-17 In knowledge

Daos simple Obj

Connect Pool

Create Container

Example_daos_key_array

daos_obj_generate_oid
1. poh = dc_cont_hdl2pool_hdl(coh); // 获取cookie
2. pool = dc_hdl2pool(poh); //获取pool
  1. 获取plmap，一个双向链表
  2. 获取pool
daos_obj_open(coh, oid, DAOS_OO_RW, &oh, NULL);
1. rc = dc_obj_open_task_create(coh, oid, mode, oh, ev, NULL, &task); //创建task
2. dc_task_schedule(tse_task_t *task, bool instant)//执行task
dts_buf_render//创建buffer

循环创建十个Dkey：

d_iov_set(&dkey, dkey_str, strlen(dkey_str));//初始化dkey

rc = daos_obj_update(oh, DAOS_TX_NONE, 0, &dkey, 1, &iod, &sgl,NULL);//更新dkey，一个dkey下只有一个akey，因此只有1个oid和一个sgl.

iod // iovec for memory buffer
{...}
iod_name
iov_buf:0x5555555596e4
iov_buf_len:4
iov_len:4
iod_type:DAOS_IOD_ARRAY
iod_size:1
iod_flags:0
iod_nr:1
iod_recxs:0x7fffffffd710
rx_idx:0
rx_nr:1024
------------
sgl  // scatter and gather list for memory buffers
{...}
sg_nr:1
sg_nr_out:0
sg_iovs:0x7fffffffd740
iov_buf:0x7fffffffd7c0
iov_buf_len:1024
iov_len:1024
------------
daos_recx_t // 记录array的第一个元素地址和其中的元素个数

循环fetch 10个dkey
遍历所有的dkey
删除一个dkey
遍历所有dkey，现在只有9个dkey了
daos_obj_close(oh, NULL);//关闭obj

example_daos_key_sv()

daos_obj_generate_oid(coh, &oid, 0, OC_SX, 0, 0);
rc = daos_obj_open(coh, oid, DAOS_OO_RW, &oh, NULL);
dts_buf_render(buf, BUFLEN);

rc = daos_obj_update(oh, DAOS_TX_NONE, 0, &dkey, 1, &iod, &sgl,NULL);

oh
{...}
cookie:71
dkey
{...}
iov_buf:0x7fffffffd780
iov_buf_len:6
iov_len:6
iod
{...}
iod_name
iod_type:DAOS_IOD_SINGLE
iod_size:1024
iod_flags:4
iod_nr:1
iod_recxs:0x0
rx_idx
rx_nr
{...}
sg_nr:1
sg_nr_out:0
sg_iovs:0x7fffffffd720
iov_buf:0x7fffffffd7c0
iov_buf_len:1024
iov_len:1024

循环10次，创建10个dkey，每个dkey有一个akey和一个1k的single value：

初始化参数

 iod.iod_nr	= 1; /** has to be 1 for single value */
iod.iod_size	= BUFLEN; /** size of the single value */
iod.iod_recxs	= NULL; /** recx is ignored for single value */
iod.iod_type	= DAOS_IOD_SINGLE; /** value type of the akey */

rc = daos_obj_update(oh, DAOS_TX_NONE, 0, &dkey, 1, &iod, &sgl,
NULL);

循环10次update：
1. rc = daos_obj_update(oh, DAOS_TX_NONE, 0, &dkey, 1, &iod, &sgl, NULL);
循环10次fetch：
1. rc = daos_obj_fetch(oh, DAOS_TX_NONE, 0, &dkey, 1, &iod, &sgl, NULL, NULL);

example_daos_array()

daos_array_generate_oid(coh, &oid, true, 0, 0, 0);
Create the array object with cell size 1 (byte array) and 1m chunk size (similar to stripe size in Lustre). Both are configurable by the user of course :

rc = daos_array_create(coh, oid, DAOS_TX_NONE, 1, 1048576, &oh,
NULL);
rc = daos_array_write(oh, DAOS_TX_NONE, &iod, &sgl, NULL);
rc = daos_array_get_size(oh, DAOS_TX_NONE, &array_size, NULL); // check size
rc = daos_array_read(oh, DAOS_TX_NONE, &iod, &sgl, NULL); // read & verify
daos_array_close(oh, NULL); //close

example_daos_kv()

Abstract iyt the 2-level keys and exposes a single Key and atomic value to represent a more traditional KV API.

daos_obj_generate_oid(coh, &oid, DAOS_OT_KV_HASHED, OC_SX, 0, 0);
rc = daos_kv_open(coh, oid, DAOS_OO_RW, &oh, NULL);
dts_buf_render(buf, BUFLEN);
循环10次，为rank0 put 10个key：
1. rc = daos_kv_put(oh, DAOS_TX_NONE, 0, key, BUFLEN, buf, NULL);
循环10次：
1. rc = daos_kv_get(oh, DAOS_TX_NONE, 0, key, &size, NULL, NULL); // query size & verify
2. rc = daos_kv_get(oh, DAOS_TX_NONE, 0, key, &size, rbuf, NULL); // get data write into rbuf
3. memcmp(buf, rbuf, BUFLEN) // verify
list_keys(oh, &num_keys); // 10个key
rc = daos_kv_remove(oh, DAOS_TX_NONE, 0, key, NULL); // remove a key
list_keys(oh, &num_keys); // 9个key
daos_kv_close(oh, NULL);

daos_cont_close(coh, NULL);

daos_pool_disconnect(poh, NULL)

daos_fini();

functions

daos_obj_generate_oid

输入：
- daos_handle_t oh ：存储了一个unsigned int 64的cookie
- daos_obj_id_t oid : daos object id, 用两个64位int 存储 lo（low）和 high （high）
- enum DAOS_OT_KV_HASHED : flat KV (no akey) with integer dkey
- OC_SX ：OC策略
- 0 ：边长参数
- 0 ：变长参数
实现，内部调用了daos_obj_generate_oid2
- 输入：
  - daos_handle_t
  - daos_obj_id_t
  - daos_otype_t
  - daos_oclass_id_t ： object class id， 32位int
  - daos_oclass_hints_t ：object class hint， 16位int
  - args : 其他int参数
- dc_cont_hdl2pool_hdl
  - 根据cookie获取pool handler
- rc = pl_map_query(pool->dp_pool, &attr)
- dc_pool_put(struct dc_pool *pool) // 将pool->dp_hlink放入了一个双向队列中？
- 按照选定的OC策略做相应的操作
- daos_obj_set_oid(oid, type, ord, nr_grp, args); // 就生成好了？

daos_kv_open

输入：
- daos_handle_t
- daos_obj_id_t
- mode : int 类型
- daos_handle_t
- daos_event_t : event and event queue, maybe used for debug
调用rc = dc_task_create(dc_kv_open, NULL, ev, &task)
- 将参数包装成dc_kv_open结构体
- 生成一个kvOpen的task并调度

dts_buf_render(buf, BUFLEN)

输入buf和len
利用随机生成的字符填满buf

daos_kev_put

输入：
- daos_handle_t oh
- daos_handle_t th
- uint64_t flag
- char * key
- daos_size_t buf_size
- void * buf
- daos_event_t * ev
调用rc = dc_task_create(dc_kv_put, NULL, ev, &task);
将参数包装到task的args中
调用dc_task_schedule(task, true);
1. task_is_valid(task)
2. ev = task_ptr2args(task)->ta_ev;
3. rc = daos_event_launch(ev);
  - 输入：daos_event *ev
4. rc = tse_task_schedule(task, instant);

kv_update

daos_io_0线程接收到io请求，crt_handle_rpc(void *arg)处理rpc
ds_obj_rw_handler(crt_rpc_t *rpc)处理rpc
- obj_ioc_begin(orw->orw_oid.id_pub, orw->orw_map_ver,
  
  orw->orw_pool_uuid, orw->orw_co_hdl,
  
  orw->orw_co_uuid, opc_get(rpc->cr_opc),
  
  orw->orw_flags, &ioc); // various check before access VOS
rc = process_epoch(&orw->orw_epoch, &orw->orw_epoch_first,

&orw->orw_flags); // 处理epoch相关

Object_Update

callstack

libvos_srv.so!dkey_update(daos_key_t * dkey, uint32_t pm_ver, struct vos_io_context * ioc) (/home/kuhan/daos/src/vos/vos_io.c:1803)
libvos_srv.so!vos_update_end(daos_handle_t ioh, uint32_t pm_ver, daos_key_t * dkey, int err, daos_size_t * size, struct dtx_handle * dth) (/home/kuhan/daos/src/vos/vos_io.c:2278)
libobj.so!obj_rw_complete(struct dtx_handle * dth, int status, daos_handle_t ioh, struct obj_io_context * ioc, crt_rpc_t * rpc) (/home/kuhan/daos/src/object/srv_obj.c:128)
libobj.so!obj_local_rw_internal(struct dtx_handle * dth, uint64_t * split_offs, struct dcs_iod_csums * split_csums, daos_iod_t * split_iods, struct obj_io_context * ioc, crt_rpc_t * rpc) (/home/kuhan/daos/src/object/srv_obj.c:1677)
libobj.so!obj_local_rw(crt_rpc_t * rpc, struct obj_io_context * ioc, daos_iod_t * split_iods, struct dcs_iod_csums * split_csums, uint64_t * split_offs, struct dtx_handle * dth, _Bool pin) (/home/kuhan/daos/src/object/srv_obj.c:1697)
libobj.so!obj_tgt_update(dtx_sub_comp_cb_t comp_cb, void * arg, struct dtx_leader_handle * dlh) (/home/kuhan/daos/src/object/srv_obj.c:2425)
libobj.so!obj_tgt_update(struct dtx_leader_handle * dlh, void * arg, int idx, dtx_sub_comp_cb_t comp_cb) (/home/kuhan/daos/src/object/srv_obj.c:2356)
libobj.so!ds_obj_rw_handler(crt_rpc_t * rpc) (/home/kuhan/daos/src/object/srv_obj.c:2664)
libcart.so.4!crt_handle_rpc(void * arg) (/home/kuhan/daos/src/cart/crt_rpc.c:1638)
libabt.so.1!ABTD_ythread_func_wrapper (未知源:0)
libabt.so.1!make_fcontext (未知源:0)
[Unknown/Just-In-Time compiled code] (未知源:0)

obj_local_rw_internal

rc = vos_update_begin(ioc->ioc_vos_coh, orw->orw_oid,

orw->orw_epoch, cond_flags, dkey,

orw->orw_nr, iods, iod_csums,

ioc->ioc_coc->sc_props.dcp_dedup_size,

&ioh, dth);

rc = vos_check_akeys(iod_nr, iods);
vos_ioc_create
rc = vos_space_hold(vos_cont2pool(ioc->ic_cont), flags, dkey, iod_nr,

iods, iods_csums, &ioc->ic_space_held[0]);

rc = dkey_update_begin(ioc);

循环rc = akey_update_begin(ioc);

获取dcs_csum_info

获取daos_iod_t,1k

iod
0x7f9a07a4dee0
iod_name
iod_type:DAOS_IOD_ARRAY
iod_size:1
iod_flags:0
iod_nr:1
iod_recxs:0x7f9a07e9fbf0
rx_idx:0
rx_nr:1024

for (i = 0; i < iod->iod_nr; i++) //循环

size = (iod->iod_type == DAOS_IOD_SINGLE) ? iod->iod_size :

iod->iod_recxs[i].rx_nr * iod->iod_size; //获取size，1k–1024

media = vos_media_select(vos_cont2pool(ioc->ic_cont),

iod->iod_type, size); //决定往哪个media上写，0 –> scm

/*
 * A simple media selection policy embedded in VOS, which select media by
 * akey type and record size.
 */
static inline uint16_t
vos_media_select(struct vos_pool *pool, daos_iod_type_t type, daos_size_t size)
{
	if (pool->vp_vea_info == NULL)
		return DAOS_MEDIA_SCM;

	return (size >= VOS_BLK_SZ) ? DAOS_MEDIA_NVME : DAOS_MEDIA_SCM;
}
// 首先做特判
// 如果大于等于4k，写到NVME上，否则写到SCM

iod->iod_type
为 DAOS_IOD_ARRAY类型：

rc = vos_reserve_recx(ioc, media, size, recx_csum, csum_len);

为 struct bio_iov biov 分配内存

rc = reserve_space(ioc, media, size, &off);

oc
0x7f9a0724ef00
ic_ent_array_alloc
ea_data
ea_ents:0x0
ea_ent_nr:0
ea_size:0
ea_max:0
ea_inob:0
ea_first_delete:0
ea_delete_nr:0
ea_embedded_ents
ea_embedded
ic_ent_array:0x0
ic_bound:509171074455830528
ic_epr
ic_oid
ic_cont:0x7f9a07e9eb60
ic_iods:0x7f9a07a4dee0
iod_csums:0x0
ic_obj:0x0
ic_biod:0x7f99c89c63b0
ic_ts_set:0x7f99c89c6f90
ic_biov_csums:0x7f99c87c76e0
ic_biov_csums_at:0
ic_biov_csums_nr:1
ic_dkey_info
ic_akey_info
ic_sgl_at:0
ic_iov_at:0
ic_rsrvd_scm:0x7f9a07e86980
ic_umoffs:0x7f9a07e9e6a0
ic_umoffs_cnt:0
ic_umoffs_at:0
  ic_blk_exts
next:0x7f9a0724fb38
prev:0x7f9a0724fb38
ic_space_held
ic_iod_nr:1
ic_dedup_th:4096
ic_dedup_entries
ic_dedup_bsgls:0x0
ic_dedup_bufs:0x0
ic_io_size:0
ic_update:1
ic_size_fetch:0
ic_save_recx:0
ic_dedup:0
ic_dedup_verify:0
ic_read_ts_only:0
ic_check_existence:0
ic_remove:0
ic_skip_fetch:0
ic_ec:0
ic_shadows:0x0
re_nr
re_total
re_snapshot
re_ep_valid
re_items
ic_recx_lists:0x0
re_nr
re_total
re_snapshot
re_ep_valid
re_items

在SCM上申请内存

bio_addr_set(&biov.bi_addr, media, off);//设置偏移量
bio_iov_set_len(&biov, size); //设置长度
rc = iod_reserve(ioc, &biov);

*ioh = vos_ioc2ioh(ioc); //获取cookie？

biod = vos_ioh2desc(ioh); //获取io descriptor？

0x7f99c89c63b0
bd_ctxt:0x7f99c89b0650
bic_link
bic_umem:0x7f99c860ab58
bic_pmempool_uuid:12137328209446634291
bic_blob:0x7f99c89b1530
bic_xs_ctxt:0x7f9a04427370
bic_inflight_dmas:0
bic_io_unit:4096
bic_pool_id
bic_opening:0
bic_closing:0
bd_rsrvd
brd_regions:0x0
brd_rg_max:0
brd_rg_cnt:0
brd_dma_chks:0x0
brd_chk_max:0
brd_chk_cnt:0
bd_dma_done:0x10
bd_inflights:0
bd_result:0
bd_chk_type:0
bd_type:0
bd_buffer_prep:0
bd_dma_issued:0
bd_retry:0
bd_rdma:0
bd_bulk_hdls:0x0
bd_bulk_max:0
bd_bulk_cnt:0
bd_sgl_cnt:1
bd_sgls

rc = bio_iod_prep(biod, BIO_CHK_TYPE_IO, rma ? rpc->cr_ctx : NULL,CRT_BULK_RW);

rc = iterate_biov(biod, arg ? bulk_map_one : dma_map_one, arg);

for循环：rc = cb_fn(biod, biov, data); // 函数指针调用，此处写SCM

biod
0x7f99c89c63b0
bd_ctxt:0x7f99c89b0650
bic_link
bic_umem:0x7f99c860ab58
bic_pmempool_uuid:12137328209446634291
bic_blob:0x7f99c89b1530
bic_xs_ctxt:0x7f9a04427370
bic_inflight_dmas:0
bic_io_unit:4096
bic_pool_id
bic_opening:0
bic_closing:0
bd_rsrvd
bd_dma_done:0x10
bd_inflights:0
bd_result:0
bd_chk_type:0
bd_type:0
bd_buffer_prep:0
bd_dma_issued:0
bd_retry:0
bd_rdma:0
bd_bulk_hdls:0x0
bd_bulk_max:0
bd_bulk_cnt:0
bd_sgl_cnt:1
bd_sgls
--------------
biov
0x7f9a07e9eeb0
bi_buf:0x0
bi_data_len:1024
bi_addr
ba_off:4576720
ba_type:0 '\000'
ba_pad1:0 '\000'
ba_flags:0
ba_pad2:0
bi_prefix_len:0
bi_suffix_len:0
--------------
data
0x0

函数指针调用dma_map_one(struct bio_desc *biod, struct bio_iov *biov, void *arg)

// /* Convert offset of @biov into memory pointer */

direct_scm_access(biod, biov):
- bio_iov_set_raw_buf(biov,umem_off2ptr(umem, bio_iov2raw_off(biov)));

rc = bio_iod_copy(biod, orw->orw_sgls.ca_arrays, orw->orw_nr);
- 将参数包装成bio_copy_args结构体
- iterate_biov(biod, copy_one, &arg);

rc = vos_dedup_verify(ioh);

/*
 * Check if the dedup data is identical to the RDMA data in a temporal
 * allocated DRAM extent, if memcmp fails, allocate a new SCM extent and
 * update it's address in VOS tree, otherwise, keep using the original
 * dedup data address in VOS tree.
 */

rc = obj_verify_bio_csum(orw->orw_oid.id_pub, iods, iod_csums,biod, ioc->ioc_coc->sc_csummer,orw->orw_iod_array.oia_iod_nr); //verify CSUM

rc = obj_rw_complete(rpc, ioc, ioh, rc, dth); // the callstack is deep inside this function…

rc = vos_update_end(ioh, ioc->ioc_map_ver,&orwi->orw_dkey, status,&ioc->ioc_io_size, dth);

一些dtx commit逻辑？

err = dkey_update(ioc, pm_ver, dkey, dtx_is_valid_handle(dth) ? dth->dth_op_seq : VOS_SUB_OP_MAX); // update tree index

rc = obj_tree_init(obj); // initialize tree for an object

rc = key_tree_prepare(obj, obj->obj_toh, VOS_BTR_DKEY, dkey,SUBTR_CREATE, DAOS_INTENT_UPDATE, &krec, &ak_toh,ioc->ic_ts_set);

/**
 * Load the subtree roots embedded in the parent tree record.
 *
 * akey tree	: all akeys under the same dkey
 * recx tree	: all record extents under the same akey, this function will
 *		  load both btree and evtree root.
 */
	/* NB: In order to avoid complexities of passing parameters to the
	 * multi-nested tree, tree operations are not nested, instead:
	 *
	 * - In the case of fetch, we load the subtree root stored in the
	 *   parent tree leaf.
	 * - In the case of update/insert, we call dbtree_update() which may
	 *   create the root for the subtree, or just return it if it's already
	 *   there.
	 */

rc = dbtree_fetch(toh, BTR_PROBE_EQ, intent, key,NULL, &riov);
/* use BTR_PROBE_BYPASS to avoid probe again */

rc = dbtree_upsert(toh, BTR_PROBE_BYPASS, intent, key, &riov);
- ```
Update the value of the provided key, or insert it as a new key if
 * there is no match.
```
- tcx = btr_hdl2tcx(toh); ///** find the tree context of the handle */
- rc = btr_tx_begin(tcx); // begin transaction？
- rc = btr_upsert(tcx, opc, intent, key, val);
  - rc = btr_insert(tcx, key, val); // bypass策略，直接取前一次probe的结果，create a new record, insert it into tree leaf node.
    - btr_hkey_gen(tcx, key, &rec->rec_hkey[0]); //生成hkey
    - rc = btr_node_insert_rec(tcx, trace, rec);
      - btr_node_insert_rec_only(tcx, trace, rec);
- btr_tx_end(tcx, rc);
  - rc = umem_tx_commit(btr_umm(tcx));
vos_ilog_ts_ignore(vos_obj2umm(obj), &krec->kr_ilog);
vos_ilog_ts_mark(ts_set, &krec->kr_ilog);

rc = vos_ilog_update(ioc->ic_cont, &krec->kr_ilog, &ioc->ic_epr,

ioc->ic_bound, &obj->obj_ilog_info,

&ioc->ic_dkey_info, update_cond, ioc->ic_ts_set); // update dkey log ？
for循环：rc = akey_update(ioc, pm_ver, ak_toh, minor_epc);
- rc = key_tree_prepare(obj, ak_toh, VOS_BTR_AKEY,
  
  &iod->iod_name, flags, DAOS_INTENT_UPDATE,
  
  &krec, &toh, ioc->ic_ts_set);
- ….

vos_ts_set_check_conflict(ioc->ic_ts_set, ioc->ic_epr.epr_hi) // Now that we are past the existence checks, ensure there isn’t a read conflict
err = vos_tx_end(ioc->ic_cont, dth, &ioc->ic_rsrvd_scm,&ioc->ic_blk_exts, tx_started, err); // dtx operations，on scm
vos_ts_set_upgrade(ioc->ic_ts_set);
vos_space_unhold(vos_cont2pool(ioc->ic_cont), &ioc->ic_space_held[0]);
vos_ioc_destroy(ioc, err != 0); // memory free
- bio_iod_free(ioc->ic_biod);
- vos_obj_release(vos_obj_cache_current(), ioc->ic_obj, evict);
- vos_ioc_reserve_fini(ioc);
- vos_ilog_fetch_finish(&ioc->ic_dkey_info);
- vos_ilog_fetch_finish(&ioc->ic_akey_info);
- vos_cont_decref(ioc->ic_cont);
- vos_ts_set_free(ioc->ic_ts_set);
- D_FREE(ioc);
vos_dth_set(NULL);

update流程

client端：
- 调用daos_obj_update
- create_task
- schedule_task
- 发RPC
Server端：
- 收到rpc
- 调用rw_handler
- leader开始执行
- 前置检查
- 预估空间
- 申请空间
- 更新Dkey
- 更新AKey
- 更新Index
- 释放空间
- reply rpc

流程图

Dkey-Array Object update

Posted on 2022-01-04 Edited on 2023-04-17 In knowledge

daos_obj_generate_oid

输入：
- daos_handle_t oh ：存储了一个unsigned int 64的cookie
- daos_obj_id_t oid : daos object id, 用两个64位int 存储 lo（low）和 high （high）
- enum DAOS_OT_KV_HASHED : flat KV (no akey) with integer dkey
- OC_SX ：OC策略
- 0 ：边长参数
- 0 ：变长参数
实现，内部调用了daos_obj_generate_oid2
- 输入：
  - daos_handle_t
  - daos_obj_id_t
  - daos_otype_t
  - daos_oclass_id_t ： object class id， 32位int
  - daos_oclass_hints_t ：object class hint， 16位int
  - args : 其他int参数
- dc_cont_hdl2pool_hdl
  - 根据cookie获取pool handler
- rc = pl_map_query(pool->dp_pool, &attr)
- dc_pool_put(struct dc_pool *pool) // 将pool->dp_hlink放入了一个双向队列中？
- 按照选定的OC策略做相应的操作
- daos_obj_set_oid(oid, type, ord, nr_grp, args); // 就生成好了？

daos_kv_open

输入：
- daos_handle_t
- daos_obj_id_t
- mode : int 类型
- daos_handle_t
- daos_event_t : event and event queue, maybe used for debug
调用rc = dc_task_create(dc_kv_open, NULL, ev, &task)
- 将参数包装成dc_kv_open结构体
- 生成一个kvOpen的task并调度

dts_buf_render(buf, BUFLEN)

输入buf和len
利用随机生成的字符填满buf

daos_kev_put

输入：
- daos_handle_t oh
- daos_handle_t th
- uint64_t flag
- char * key
- daos_size_t buf_size
- void * buf
- daos_event_t * ev
调用rc = dc_task_create(dc_kv_put, NULL, ev, &task);
将参数包装到task的args中
调用dc_task_schedule(task, true);
1. task_is_valid(task)
2. ev = task_ptr2args(task)->ta_ev;
3. rc = daos_event_launch(ev);
  - 输入：daos_event *ev
4. rc = tse_task_schedule(task, instant);

kv_update

daos_io_0线程接收到io请求，crt_handle_rpc(void *arg)处理rpc
ds_obj_rw_handler(crt_rpc_t *rpc)处理rpc
- obj_ioc_begin(orw->orw_oid.id_pub, orw->orw_map_ver,
  
  orw->orw_pool_uuid, orw->orw_co_hdl,
  
  orw->orw_co_uuid, opc_get(rpc->cr_opc),
  
  orw->orw_flags, &ioc); // various check before access VOS
rc = process_epoch(&orw->orw_epoch, &orw->orw_epoch_first,

&orw->orw_flags); // 处理epoch相关

Object_Update

callstack

libvos_srv.so!dkey_update(daos_key_t * dkey, uint32_t pm_ver, struct vos_io_context * ioc) (/home/kuhan/daos/src/vos/vos_io.c:1803)
libvos_srv.so!vos_update_end(daos_handle_t ioh, uint32_t pm_ver, daos_key_t * dkey, int err, daos_size_t * size, struct dtx_handle * dth) (/home/kuhan/daos/src/vos/vos_io.c:2278)
libobj.so!obj_rw_complete(struct dtx_handle * dth, int status, daos_handle_t ioh, struct obj_io_context * ioc, crt_rpc_t * rpc) (/home/kuhan/daos/src/object/srv_obj.c:128)
libobj.so!obj_local_rw_internal(struct dtx_handle * dth, uint64_t * split_offs, struct dcs_iod_csums * split_csums, daos_iod_t * split_iods, struct obj_io_context * ioc, crt_rpc_t * rpc) (/home/kuhan/daos/src/object/srv_obj.c:1677)
libobj.so!obj_local_rw(crt_rpc_t * rpc, struct obj_io_context * ioc, daos_iod_t * split_iods, struct dcs_iod_csums * split_csums, uint64_t * split_offs, struct dtx_handle * dth, _Bool pin) (/home/kuhan/daos/src/object/srv_obj.c:1697)
libobj.so!obj_tgt_update(dtx_sub_comp_cb_t comp_cb, void * arg, struct dtx_leader_handle * dlh) (/home/kuhan/daos/src/object/srv_obj.c:2425)
libobj.so!obj_tgt_update(struct dtx_leader_handle * dlh, void * arg, int idx, dtx_sub_comp_cb_t comp_cb) (/home/kuhan/daos/src/object/srv_obj.c:2356)
libobj.so!ds_obj_rw_handler(crt_rpc_t * rpc) (/home/kuhan/daos/src/object/srv_obj.c:2664)
libcart.so.4!crt_handle_rpc(void * arg) (/home/kuhan/daos/src/cart/crt_rpc.c:1638)
libabt.so.1!ABTD_ythread_func_wrapper (未知源:0)
libabt.so.1!make_fcontext (未知源:0)
[Unknown/Just-In-Time compiled code] (未知源:0)

obj_local_rw_internal

rc = vos_update_begin(ioc->ioc_vos_coh, orw->orw_oid,

orw->orw_epoch, cond_flags, dkey,

orw->orw_nr, iods, iod_csums,

ioc->ioc_coc->sc_props.dcp_dedup_size,

&ioh, dth);

rc = vos_check_akeys(iod_nr, iods);
vos_ioc_create
rc = vos_space_hold(vos_cont2pool(ioc->ic_cont), flags, dkey, iod_nr,

iods, iods_csums, &ioc->ic_space_held[0]);

rc = dkey_update_begin(ioc);

循环rc = akey_update_begin(ioc);

获取dcs_csum_info

获取daos_iod_t,1k

iod
0x7f9a07a4dee0
iod_name
iod_type:DAOS_IOD_ARRAY
iod_size:1
iod_flags:0
iod_nr:1
iod_recxs:0x7f9a07e9fbf0
rx_idx:0
rx_nr:1024

for (i = 0; i < iod->iod_nr; i++) //循环

size = (iod->iod_type == DAOS_IOD_SINGLE) ? iod->iod_size :

iod->iod_recxs[i].rx_nr * iod->iod_size; //获取size，1k–1024

media = vos_media_select(vos_cont2pool(ioc->ic_cont),

iod->iod_type, size); //决定往哪个media上写，0 –> scm

/*
 * A simple media selection policy embedded in VOS, which select media by
 * akey type and record size.
 */
static inline uint16_t
vos_media_select(struct vos_pool *pool, daos_iod_type_t type, daos_size_t size)
{
	if (pool->vp_vea_info == NULL)
		return DAOS_MEDIA_SCM;

	return (size >= VOS_BLK_SZ) ? DAOS_MEDIA_NVME : DAOS_MEDIA_SCM;
}
// 首先做特判
// 如果大于等于4k，写到NVME上，否则写到SCM

iod->iod_type
为 DAOS_IOD_ARRAY类型：

rc = vos_reserve_recx(ioc, media, size, recx_csum, csum_len);

为 struct bio_iov biov 分配内存

rc = reserve_space(ioc, media, size, &off);

oc
0x7f9a0724ef00
ic_ent_array_alloc
ea_data
ea_ents:0x0
ea_ent_nr:0
ea_size:0
ea_max:0
ea_inob:0
ea_first_delete:0
ea_delete_nr:0
ea_embedded_ents
ea_embedded
ic_ent_array:0x0
ic_bound:509171074455830528
ic_epr
ic_oid
ic_cont:0x7f9a07e9eb60
ic_iods:0x7f9a07a4dee0
iod_csums:0x0
ic_obj:0x0
ic_biod:0x7f99c89c63b0
ic_ts_set:0x7f99c89c6f90
ic_biov_csums:0x7f99c87c76e0
ic_biov_csums_at:0
ic_biov_csums_nr:1
ic_dkey_info
ic_akey_info
ic_sgl_at:0
ic_iov_at:0
ic_rsrvd_scm:0x7f9a07e86980
ic_umoffs:0x7f9a07e9e6a0
ic_umoffs_cnt:0
ic_umoffs_at:0
  ic_blk_exts
next:0x7f9a0724fb38
prev:0x7f9a0724fb38
ic_space_held
ic_iod_nr:1
ic_dedup_th:4096
ic_dedup_entries
ic_dedup_bsgls:0x0
ic_dedup_bufs:0x0
ic_io_size:0
ic_update:1
ic_size_fetch:0
ic_save_recx:0
ic_dedup:0
ic_dedup_verify:0
ic_read_ts_only:0
ic_check_existence:0
ic_remove:0
ic_skip_fetch:0
ic_ec:0
ic_shadows:0x0
re_nr
re_total
re_snapshot
re_ep_valid
re_items
ic_recx_lists:0x0
re_nr
re_total
re_snapshot
re_ep_valid
re_items

在SCM上申请内存

bio_addr_set(&biov.bi_addr, media, off);//设置偏移量
bio_iov_set_len(&biov, size); //设置长度
rc = iod_reserve(ioc, &biov);

*ioh = vos_ioc2ioh(ioc); //获取cookie？

biod = vos_ioh2desc(ioh); //获取io descriptor？

0x7f99c89c63b0
bd_ctxt:0x7f99c89b0650
bic_link
bic_umem:0x7f99c860ab58
bic_pmempool_uuid:12137328209446634291
bic_blob:0x7f99c89b1530
bic_xs_ctxt:0x7f9a04427370
bic_inflight_dmas:0
bic_io_unit:4096
bic_pool_id
bic_opening:0
bic_closing:0
bd_rsrvd
brd_regions:0x0
brd_rg_max:0
brd_rg_cnt:0
brd_dma_chks:0x0
brd_chk_max:0
brd_chk_cnt:0
bd_dma_done:0x10
bd_inflights:0
bd_result:0
bd_chk_type:0
bd_type:0
bd_buffer_prep:0
bd_dma_issued:0
bd_retry:0
bd_rdma:0
bd_bulk_hdls:0x0
bd_bulk_max:0
bd_bulk_cnt:0
bd_sgl_cnt:1
bd_sgls

rc = bio_iod_prep(biod, BIO_CHK_TYPE_IO, rma ? rpc->cr_ctx : NULL,CRT_BULK_RW);

rc = iterate_biov(biod, arg ? bulk_map_one : dma_map_one, arg);

for循环：rc = cb_fn(biod, biov, data); // 函数指针调用，此处写SCM

biod
0x7f99c89c63b0
bd_ctxt:0x7f99c89b0650
bic_link
bic_umem:0x7f99c860ab58
bic_pmempool_uuid:12137328209446634291
bic_blob:0x7f99c89b1530
bic_xs_ctxt:0x7f9a04427370
bic_inflight_dmas:0
bic_io_unit:4096
bic_pool_id
bic_opening:0
bic_closing:0
bd_rsrvd
bd_dma_done:0x10
bd_inflights:0
bd_result:0
bd_chk_type:0
bd_type:0
bd_buffer_prep:0
bd_dma_issued:0
bd_retry:0
bd_rdma:0
bd_bulk_hdls:0x0
bd_bulk_max:0
bd_bulk_cnt:0
bd_sgl_cnt:1
bd_sgls
--------------
biov
0x7f9a07e9eeb0
bi_buf:0x0
bi_data_len:1024
bi_addr
ba_off:4576720
ba_type:0 '\000'
ba_pad1:0 '\000'
ba_flags:0
ba_pad2:0
bi_prefix_len:0
bi_suffix_len:0
--------------
data
0x0

函数指针调用dma_map_one(struct bio_desc *biod, struct bio_iov *biov, void *arg)

// /* Convert offset of @biov into memory pointer */

direct_scm_access(biod, biov):
- bio_iov_set_raw_buf(biov,umem_off2ptr(umem, bio_iov2raw_off(biov)));

rc = bio_iod_copy(biod, orw->orw_sgls.ca_arrays, orw->orw_nr);
- 将参数包装成bio_copy_args结构体
- iterate_biov(biod, copy_one, &arg);

rc = vos_dedup_verify(ioh);

/*
 * Check if the dedup data is identical to the RDMA data in a temporal
 * allocated DRAM extent, if memcmp fails, allocate a new SCM extent and
 * update it's address in VOS tree, otherwise, keep using the original
 * dedup data address in VOS tree.
 */

rc = obj_verify_bio_csum(orw->orw_oid.id_pub, iods, iod_csums,biod, ioc->ioc_coc->sc_csummer,orw->orw_iod_array.oia_iod_nr); //verify CSUM

rc = obj_rw_complete(rpc, ioc, ioh, rc, dth); // the callstack is deep inside this function…

rc = vos_update_end(ioh, ioc->ioc_map_ver,&orwi->orw_dkey, status,&ioc->ioc_io_size, dth);

一些dtx commit逻辑？

err = dkey_update(ioc, pm_ver, dkey, dtx_is_valid_handle(dth) ? dth->dth_op_seq : VOS_SUB_OP_MAX); // update tree index

rc = obj_tree_init(obj); // initialize tree for an object

rc = key_tree_prepare(obj, obj->obj_toh, VOS_BTR_DKEY, dkey,SUBTR_CREATE, DAOS_INTENT_UPDATE, &krec, &ak_toh,ioc->ic_ts_set);

/**
 * Load the subtree roots embedded in the parent tree record.
 *
 * akey tree	: all akeys under the same dkey
 * recx tree	: all record extents under the same akey, this function will
 *		  load both btree and evtree root.
 */
	/* NB: In order to avoid complexities of passing parameters to the
	 * multi-nested tree, tree operations are not nested, instead:
	 *
	 * - In the case of fetch, we load the subtree root stored in the
	 *   parent tree leaf.
	 * - In the case of update/insert, we call dbtree_update() which may
	 *   create the root for the subtree, or just return it if it's already
	 *   there.
	 */

rc = dbtree_fetch(toh, BTR_PROBE_EQ, intent, key,NULL, &riov);
/* use BTR_PROBE_BYPASS to avoid probe again */

rc = dbtree_upsert(toh, BTR_PROBE_BYPASS, intent, key, &riov);
- ```
Update the value of the provided key, or insert it as a new key if
 * there is no match.
```
- tcx = btr_hdl2tcx(toh); ///** find the tree context of the handle */
- rc = btr_tx_begin(tcx); // begin transaction？
- rc = btr_upsert(tcx, opc, intent, key, val);
  - rc = btr_insert(tcx, key, val); // bypass策略，直接取前一次probe的结果，create a new record, insert it into tree leaf node.
    - btr_hkey_gen(tcx, key, &rec->rec_hkey[0]); //生成hkey
    - rc = btr_node_insert_rec(tcx, trace, rec);
      - btr_node_insert_rec_only(tcx, trace, rec);
- btr_tx_end(tcx, rc);
  - rc = umem_tx_commit(btr_umm(tcx));
vos_ilog_ts_ignore(vos_obj2umm(obj), &krec->kr_ilog);
vos_ilog_ts_mark(ts_set, &krec->kr_ilog);

rc = vos_ilog_update(ioc->ic_cont, &krec->kr_ilog, &ioc->ic_epr,

ioc->ic_bound, &obj->obj_ilog_info,

&ioc->ic_dkey_info, update_cond, ioc->ic_ts_set); // update dkey log ？
for循环：rc = akey_update(ioc, pm_ver, ak_toh, minor_epc);
- rc = key_tree_prepare(obj, ak_toh, VOS_BTR_AKEY,
  
  &iod->iod_name, flags, DAOS_INTENT_UPDATE,
  
  &krec, &toh, ioc->ic_ts_set);
- ….

vos_ts_set_check_conflict(ioc->ic_ts_set, ioc->ic_epr.epr_hi) // Now that we are past the existence checks, ensure there isn’t a read conflict
err = vos_tx_end(ioc->ic_cont, dth, &ioc->ic_rsrvd_scm,&ioc->ic_blk_exts, tx_started, err); // dtx operations，on scm
vos_ts_set_upgrade(ioc->ic_ts_set);
vos_space_unhold(vos_cont2pool(ioc->ic_cont), &ioc->ic_space_held[0]);
vos_ioc_destroy(ioc, err != 0); // memory free
- bio_iod_free(ioc->ic_biod);
- vos_obj_release(vos_obj_cache_current(), ioc->ic_obj, evict);
- vos_ioc_reserve_fini(ioc);
- vos_ilog_fetch_finish(&ioc->ic_dkey_info);
- vos_ilog_fetch_finish(&ioc->ic_akey_info);
- vos_cont_decref(ioc->ic_cont);
- vos_ts_set_free(ioc->ic_ts_set);
- D_FREE(ioc);
vos_dth_set(NULL);

update流程

api调用–>task generate –> client rpc call –> server accept –> handle rpc –> decide type –> allocate memory –> write data –> update index –> free memory

Programming language GO & RUST

Posted on 2022-01-04 Edited on 2023-04-17 In Programming language

Go

常量
1. 定义格式 const Name [Type] = Value
变量
1. 定义格式 var Name [Type]
2. 系统自动赋予初值
3. 全局变量希望能被外部包所使用，则需要将变量首字母大写
4. 函数体内定义的变量为局部变量，否则为全局变量
5. 可以省去Type，变量在被赋值时编辑器会在编译阶段做类型推断
6. 当你在函数体内声明局部变量时，应使用简短声明语法 :=
7. 值类型和引用类型
  1. 值类型用等号赋值的时候，实际上是在内存中做了值拷贝
    1. int float bool string 数组 struct
    2. 存储在栈内
  2. 引用类型变量存储的是值所在的内存地址
    1. 指针 slices maps channel
    2. 存储在堆中
    3. 局部变量的简短化创建形式a := 50
    4. 局部变量不可以声明了但却不使用，全局变量可以
  3. init函数
    1. 变量可以在init函数中被初始化，init函数在每个包完成初始化后自动执行，优先级比main高
基本类型和运算符
1. bool 格式化输出时，可以用%t来表示要输出的类型为bool
2. 数值类型：
  1. int和uint根据操作系统的位数，决定数值的长度（4位或者8位）
  2. uintptr长度被设定为足够放一个指针即可
  3. 整数：
    1. int8（-128 -> 127）
    2. int16（-32768 -> 32767）
    3. int32（-2,147,483,648 -> 2,147,483,647）
    4. int64（-9,223,372,036,854,775,808 -> 9,223,372,036,854,775,807）
  4. 无符号整数：
    1. uint8（0 -> 255）
    2. uint16（0 -> 65,535）
    3. uint32（0 -> 4,294,967,295）
    4. uint64（0 -> 18,446,744,073,709,551,615）
  5. 浮点数：
    1. float32（+- 1e-45 -> +- 3.4 * 1e38）：小数点后7位
    2. float64（+- 5 1e-324 -> 107 1e308）：小数点后15位
  6. 复数：
    1. complex64(32位实数和虚数)
    2. Complex128(64位实数和虚数)
  7. 随机数：
    1. rand包
  8. 类型别名：
    1. type 别名类型
  9. 字符类型
    1. char
    2. 实际存储了整型
3. 字符串
  1. 字符串是字节的定长数组
  2. 解释字符串，用双引号括起来
  3. 非解释字符串，反引号括起来
  4. 字符串的二元运算符比较，逐个字节对比
  5. 获取字符串中某个字节的地址是非法的$str[i]
  6. 字符串使用+拼接
4. strings和strconv包（String 库函数的使用）
5. 指针：
  1. 一个指针变量可以指向任何一个值得内存地址
控制结构（省去了condition两侧的括号，使得代码更加整洁,执行语句中的括号在任何情况下都不能被省略）
1. if-else
2. if-else if-else
3. 测试多返回值函数的错误
  1. 方法可以返回多个返回值，第二个返回值可以是错误的详细信息，如果第二个返回值不为Nil，则代表发生了错误。
4. switch case：
  1. 不需要写break
  2. 如果希望匹配到之后还继续执行后面的分支，用“fallthrough”关键字
  3. switch 语句的第二种形式是不提供任何被判断的值（实际上默认为判断是否为 true），然后在每个 case 分支中进行测试不同的条件。当任一分支的测试结果为 true 时，该分支的代码会被执行。这看起来非常像链式的 if-else 语句，但是在测试条件非常多的情况下，提供了可读性更好的书写方式。
  4. switch的第三种形式是condition中可以对两个变量进行计算赋值。随后在case分支中根据变量的值进行具体的行为
5. 循环：for结构
  1. 基本形式：for 初始化语句;条件语句;修饰语句{}
  2. 第二种形式，类似于while循环。没有初始化语句和index更新语句
  3. 第三种形式，无限循环。for {}
  4. for-range结构：for ix, val := range coll { }
6. Break 和 continue
7. label和goto（不推荐使用，没看）
函数
1. 分类：普通的带有名字的函数、匿名函数、方法
2. go里面函数重载是不允许的，没有泛型，为了效率
3. 函数的一般定义：func f(name1 type1,name 2type2) 返回值类型,参数可以没有参数名。
4. 函数都是按照值传递的
5. 带命名的返回值，只需要在函数尾部直接return
6. 不带命名的返回值，需要用（）装起来写在return后面
7. 空白符_匹配一些不需要的值，然后丢掉
8. 通过传递指针来改变函数外部变量的值
9. 变长参数函数
  1. 形式：func myFunc(a,b,arg ...int){}
  2. 如果一个变长参数的类型没有被指定，则可以使用默认的空接口 interface{}，这样就可以接受任何类型的参数
    
    `func typecheck(..,..,values … interface{}) {
```
for _, value := range values {
    switch v := value.(type) {
        case int: …
        case float: …
        case string: …
        case bool: …
        default: …
    }
}
```
    }`
10. defer和追踪
  1. defer作用：类似于finally，用于一些资源的释放
  2. 使用defer来记录函数的参数和返回值
11. 将函数作为参数
  1. func IndexFunc(s string, f func(c int) bool) int
12. 闭包
  1. 匿名函数赋值给变量：fplus := func(x, y int) int { return x + y }
  2. 直接调用匿名函数：func(x, y int) int { return x + y } (3, 4)
  3. 匿名函数的调用，在匿名函数后加一对（）表示对其的调用
13. 应用闭包：将函数作为返回值
  1. 闭包函数保存并积累其中的变量的值，不管外部函数退出与否，它都能够继续操作外部函数中的局部变量。
  2. 在闭包中使用到的变量可以是在闭包函数体内声明的，也可以是在外部函数声明的：
    
    `var g int
    go func(i int) {
```
s := 0
for j := 0; j < i; j++ { s += j }
g = s
```
    }(1000) // Passes argument 1000 to the function literal.`
数组与切片
1. 数组
  1. 声明语句：var identifier [len]type
  2. 使用for循环遍历
  `for i:=0; i < len(arr1); i++｛
```
arr1[i] = ...
```
  }`
  1. 使用for-range遍历
  `for i:=0; i < len(arr1); i++｛
```
arr1[i] = ...
```
  }`
  1. Go 语言中的数组是一种 值类型（不像 C/C++ 中是指向首元素的指针），所以可以通过 new() 来创建： var arr1 = new([5]int)。arr1的类型是*[5]int，把arr1赋值给另一个时，需要做一次数组内存的拷贝。
  2. 讲数组作为函数参数时，会做一次数组的拷贝，如果需要修改传入数组的值，需要用引用传递的方式
  3. 数组可以在声明时使用{}来初始化
2. 切片
  1. 定义和相关特性
    1. 切片（slice）是对数组一个连续片段的引用（该数组我们称之为相关数组，通常是匿名的），所以切片是一个引用类型
    2. 和数组不同的是，切片的长度可以在运行时修改，最小为 0 最大为相关数组的长度：切片是一个 长度可变的数组。
    3. 多个切片如果表示同一个数组的片段，它们可以共享数据；因此一个切片和相关数组的其他切片是共享存储的，相反，不同的数组总是代表不同的存储。数组实际上是切片的构建块。
  2. 优点：
    1. 因为切片是引用，所以它们不需要使用额外的内存并且比使用数组更有效率，所以在 Go 代码中切片比数组更常用。
  3. 声明：var identifier []type（不需要说明长度）。
  4. 初始化：
    1. var slice1 []type = arr1[start:end]头闭尾开区间
    2. 类似数组的初始化：var x = []int{2, 3, 5, 7, 11}。这样就创建了一个长度为 5 的数组并且创建了一个相关切片。
  5. 长度：存储的值的个数
  6. 容量：cap() 可以测量切片最长可以达到多少：它等于切片从第一个元素开始，到相关数组末尾的元素个数
  7. 对于每个切片，以下状态总是成立：
    
    s == s[:i] + s[i:] // i是一个整数且: 0 <= i <= len(s) len(s) <= cap(s)
  8. 切片的存储类似结构体：
    1. 指向相关数组的指针
    2. 长度
    3. 容量
  9. 将切片传递给函数：
    
    `func sum(a []int) int {
```
s := 0
for i := 0; i < len(a); i++ {
    s += a[i]
}
return s
```
    }
    
    func main() {
```
var arr = [5]int{0, 1, 2, 3, 4}
sum(arr[:])
```
    }`
  10. 使用make创造一个切片：
    1. var slice1 []type = make([]type, len)。
    2. 简写为：slice1 := make([]type, len)
    3. make 的使用方式是：func make([]T, len, cap)，其中 cap 是可选参数。
    4. 以下两种创建切片的方法等效：
      1. make([]int, 50, 100)
      2. new([100]int)[0:50]
  11. make和new的区别
    1. 看起来二者没有什么区别，都在堆上分配内存，但是它们的行为不同，适用于不同的类型。
    2. new (T) 为每个新的类型 T 分配一片内存，初始化为 0 并且返回类型为 * T 的内存地址：这种方法返回一个指向类型为 T，值为 0 的地址的指针，它适用于值类型如数组和结构体（参见第 10 章）；它相当于 &T{}。
    3. make(T) 返回一个类型为 T 的初始值，它只适用于 3 种内建的引用类型：切片、map 和 channel
    4. 换言之，new 函数分配内存，make 函数初始化
  12. bytes包Buffer（类似于java里面的StringBuilder）
    1. 申明方式：var buffer bytes.Buffer
    2. 获取指针：var r *bytes.Buffer = new(bytes.Buffer)
    3. 或者通过函数：func NewBuffer(buf []byte) *Buffer
  13. 切片的for-range
    1. 单维切片：
      
      `for ix, value := range slice1 {
```
...
```
      }`
    2. 多维切片：
      
      `for row := range screen {
```
for column := range screen[row] {
    screen[row][column] = 1
}
```
      }`
  14. 切片重组（扩容）
    1. 扩展一位：sl = sl[0:len(sl)+1]
  15. 切片的复制与增加
    1. 如果想增加切片的容量，我们必须创建一个新的更大的切片并把原切片的内容都拷贝过来。
    2. 通过func append(s[]T, x ...T) []T在切片中追加内容
      1. 例子sl3 := []int{1, 2, 3} sl3 = append(sl3, 4, 5, 6)
    3. 通过拷贝讲切片复制到新的切片中func copy(dst, src []T) int,返回拷贝的元素的个数
      1. 例子：sl_from := []int{1, 2, 3}
        
        sl_to := make([]int, 10)
        
        n := copy(sl_to, sl_from)
Map
1. 声明：var map1 map[keytype]valuetype
2. 赋值：map1[key1] = val1
3. 取值：v := map1[key1]
4. 获取长度:len(map1)
5. 初始化：{key1:val1, key2:val2}
6. make初始化：map1 := make(map[keytype]valuetype) 相当于``mapCreated := map[string]float32{}`
7. 切片作为map的值：
  
  mp1 := make(map[int][]int) mp2 := make(map[int]*[]int)
8. 检验key是否存在：
  
  `if _, ok := map1[key1]; ok {
```
// ...
```
  }`
9. 删除kv： delete(map1, key1)
10. for-range遍历：
  1. kv：
    
    `for key, value := range map1 {
```
...
```
    }`
  2. 只关心value
    
    `for _, value := range map1 {
```
...
```
    }`
  3. 只关心key
    
    `for key := range map1 {
```
fmt.Printf("key is: %d\n", key)
```
    }`
11. 切片：
  1. 两次make，第一次分配切片，第二次分配切片中每个map元素
    
    `items := make([]map[int]int, 5)
    for i:= range items {
```
items[i] = make(map[int]int, 1)
items[i][1] = 2
```
    }`
12. map排序
  1. 拷贝出key，对key排序，然后顺序遍历key取出value（会不会效率太低了？）
13. 将kv对调：
  1. 拷贝一个新的大小相同的map，遍历原始map，复制数据到新的map
包
1. 标准库
2. regexp包
3. sync包
4. 紧密计算big包
5. 自定义包和可见性：
  1. Import with . :import . “./pack1”，当使用. 来做为包的别名时，可以不通过包名来使用其中的项目。例如：test := ReturnStr()。在当前的命名空间导入 pack1 包，一般是为了具有更好的测试效果。
  2. Import with _ :import _ “./pack1”，pack1 包只导入其副作用，也就是说，只执行它的 init 函数并初始化其中的全局变量。
  3. 导入外部安装包：先通过go install安装

结构体和方法（struct & method）

结构体：

定义：

type identifier struct {
    field1 type1
    field2 type2
    ...
}

使用new

t := new(T)
使用声明：

var t T:分配内存并零值化内存

使用.选择器来访问结构体的属性，无论变量是结构体还是结构体类型的指针

type myStruct struct { i int }
var v myStruct    // v是结构体类型变量
var p *myStruct   // p是指向一个结构体类型变量的指针
v.i
p.i

使用{}初始化一个结构体

1 2	ms := &struct1{10, 15.5, "Chris"} // 此时ms的类型是 *struct1

表达式 new(Type) 和 &Type{} 是等价的。

或者

1 2	var ms struct1 ms = struct1{10, 15.5, "Chris"}

或者制定字段key来初始化

1
2
3

intr := Interval{0, 3}            (A)
intr := Interval{end:5, start:1}  (B)
intr := Interval{end:5}           (C)

结构体的内存布局：结构体和它所包含的数据在内存中是以连续快的形式存在的。
递归结构体：可以用来定义链表的节点或者二叉树的节点
make不能用于struct

结构体可以带Tag，通过反射获取

package main

import (
    "fmt"
    "reflect"
)

type TagType struct { // tags
    field1 bool   "An important answer"
    field2 string "The name of the thing"
    field3 int    "How much there are"
}

func main() {
    tt := TagType{true, "Barak Obama", 1}
    for i := 0; i < 3; i++ {
        refTag(tt, i)
    }
}

func refTag(tt TagType, ix int) {
    ttType := reflect.TypeOf(tt)
    ixField := ttType.Field(ix)
    fmt.Printf("%v\n", ixField.Tag)
}

匿名字段和内嵌结构体
1. 匿名字段：通过结构体名字.字段类型来访问匿名字段，没个结构体针对每一种数据类型只能有一个匿名字段
2. 内嵌结构体:结构体可以通过结构体名字.内嵌结构体字段来访问内嵌匿名结构体的字段，类似于软件工程领域的组合设计模式
3. 命名冲突：外层结构体的相同命名字段会覆盖内层结构体的相同命名字段，访问外层结构体的相同命名字段A.b，访问内层结构体的相同命名字段A.B.b

方法：

在 Go 语言中，结构体就像是类的一种简化形式，那么面向对象程序员可能会问：类的方法在哪里呢？在 Go 中有一个概念，它和方法有着同样的名字，并且大体上意思相同：Go 方法是作用在接收者（receiver）上的一个函数，接收者是某种类型的变量。因此方法是一种特殊类型的函数。一个类型加上它的方法等价于面向对象中的一个类。一个重要的区别是：在 Go 中，类型的代码和绑定在它上面的方法的代码可以不放置在一起，它们可以存在在不同的源文件，唯一的要求是：它们必须是同一个包的。
类型 T（或 *T）上的所有方法的集合叫做类型 T（或 *T）的方法集。

因为方法是函数，所以同样的，不允许方法重载，即对于一个类型只能有一个给定名称的方法。但是如果基于接收者类型，是有重载的：具有同样名字的方法可以在 2 个或多个不同的接收者类型上存在，比如在同一个包里这么做是允许的：
1
2
func (a *denseMatrix) Add(b Matrix) Matrix
func (a *sparseMatrix) Add(b Matrix) Matrix
定义方法的格式

func (recv receiver_type) methodName(parameter_list) (return_value_list) { ... }
方法的调用方式：

recv.methodName(),recv类似于面向对象语言中的this或者self

一个例子：

package main

import "fmt"

type TwoInts struct {
    a int
    b int
}

func main() {
    two1 := new(TwoInts)
    two1.a = 12
    two1.b = 10

    fmt.Printf("The sum is: %d\n", two1.AddThem())
    fmt.Printf("Add them to the param: %d\n", two1.AddToParam(20))

    two2 := TwoInts{3, 4}
    fmt.Printf("The sum is: %d\n", two2.AddThem())
}

func (tn *TwoInts) AddThem() int {
    return tn.a + tn.b
}

func (tn *TwoInts) AddToParam(param int) int {
    return tn.a + tn.b + param
}

函数和方法的区别：
1. 函数将变量作为参数：Function1(recv)
  
  方法在变量上被调用：recv.Method1()
2. 方法没有和数据定义（结构体）混在一起：它们是正交的类型；表示（数据）和行为（方法）是独立的。
指针作为接受者：
1. 传入指针或者值都是合法的，go会自动解引用
2. 指针方法和值方法都可以在指针或者非指针上被调用
获取或者设置对象的值使用getter和setter
多重继承可以通过一个类型内嵌多个匿名类型来实现，匿名类型的方法会被提升为此父类型的方法
总结：
1. 在Go中，类型就是类
2. Go拥有类似面向对象语言的嘞继承的概念以实现代码复用和多态
3. go中代码复用通过组合和委托实现，多态用接口来实现。
4. 类型可以覆写内嵌匿名类型的方法

接口与反射

接口

定义：接口提供了一种方式来说明对象的行为：如果谁能搞定这件事，它就可以用在这儿。
接口定义了一组方法，但是这些方法不包含实现，接口内也不能拥有变量

接口定义语法：

type Namer interface {
    Method1(param_list) return_type
    Method2(param_list) return_type
    ...
}

类型不需要显式声明它实现了某个接口：接口被隐式地实现。多个类型可以实现同一个接口。

实现某个接口的类型（除了实现接口方法外）可以有其他的方法。

一个类型可以实现多个接口。

接口类型可以包含一个实例的引用，该实例的类型实现了此接口（接口是动态类型）。

例子（go中的多态？）

package main

import "fmt"

type Shaper interface {
    Area() float32
}

type Square struct {
    side float32
}

func (sq *Square) Area() float32 {
    return sq.side * sq.side
}

func main() {
    sq1 := new(Square)
    sq1.side = 5

    var areaIntf Shaper
    areaIntf = sq1
    // shorter,without separate declaration:
    // areaIntf := Shaper(sq1)
    // or even:
    // areaIntf := sq1
    fmt.Printf("The square has area: %f\n", areaIntf.Area())
}

接口嵌套接口

一个接口可以包含一个或多个其他的接口，这相当于直接将这些内嵌接口的方法列举在外层接口中一样。

type ReadWrite interface {
    Read(b Buffer) bool
    Write(b Buffer) bool
}

type Lock interface {
    Lock()
    Unlock()
}

type File interface {
    ReadWrite
    Lock
    Close()
}

类型断言：如何监测和转换接口变量的类型？

定义：一个接口类型的变量varI中可以包含任何类型的值，必须有一种方式来检测它的动态类型，即运行时在变量中存储的值的实际类型。在执行过程中动态类型可能会有所不同，但是它总是可以分配给接口变量本身的类型。通常我们可以使用类型断言来测试某个时刻varI是否包含类型T的值：

v := varI.(T)

varI必须是一个接口类型变量。类型断言可能是无效的，虽然编译器会尽力检查转换是否有效，但是它不可能预见所有的可能性。如果转换在程序运行时失败会导致错误发生。更安全的方式是使用以下形式来进行类型断言：

if v, ok := varI.(T); ok {  // checked type assertion
    Process(v)
    return
}
// varI is not of type T

如果转换合法，v 是 varI 转换到类型 T 的值，ok 会是 true；否则 v 是类型 T 的零值，ok 是 false，也没有运行时错误发生。

例子：(暂时还不能明白这个的用处2021/12/2)

package main

import (
    "fmt"
    "math"
)

type Square struct {
    side float32
}

type Circle struct {
    radius float32
}

type Shaper interface {
    Area() float32
}

func main() {
    var areaIntf Shaper
    sq1 := new(Square)
    sq1.side = 5

    areaIntf = sq1
    // Is Square the type of areaIntf?
    if t, ok := areaIntf.(*Square); ok {
        fmt.Printf("The type of areaIntf is: %T\n", t)
    }
    if u, ok := areaIntf.(*Circle); ok {
        fmt.Printf("The type of areaIntf is: %T\n", u)
    } else {
        fmt.Println("areaIntf does not contain a variable of type Circle")
    }
}

func (sq *Square) Area() float32 {
    return sq.side * sq.side
}

func (ci *Circle) Area() float32 {
    return ci.radius * ci.radius * math.Pi
}

类型判断：type-switch

接口变量的类型也可以使用type-switch结构来判断
接口类型变量可以代表任何类型，所以需要有类型判断

例子

func classifier(items ...interface{}) {
    for i, x := range items {
        switch x.(type) {
        case bool:
            fmt.Printf("Param #%d is a bool\n", i)
        case float64:
            fmt.Printf("Param #%d is a float64\n", i)
        case int, int64:
            fmt.Printf("Param #%d is a int\n", i)
        case nil:
            fmt.Printf("Param #%d is a nil\n", i)
        case string:
            fmt.Printf("Param #%d is a string\n", i)
        default:
            fmt.Printf("Param #%d is unknown\n", i)
        }
    }
}

可以这样调用此方法：classifier(13, -14.3, “BELGIUM”, complex(1, 2), nil, false) 。

在处理来自于外部的、类型未知的数据时，比如解析诸如 JSON 或 XML 编码的数据，类型测试和转换会非常有用。

测试一个值是否实现了某个接口：
1
2
3
4
5
6
7
type Stringer interface {
String() string
}

if sv, ok := v.(Stringer); ok {
fmt.Printf("v implements String(): %s\n", sv.String()) // note: sv, not v
}
接口是一种契约，实现类型必须满足它，它描述了类型的行为，规定类型可以做什么。接口彻底将类型能做什么，以及如何做分离开来，使得相同接口的变量在不同的时刻表现出不同的行为，这就是多态的本质。
使用方法集与接口（这节有点晦涩）
1. 总结
  
  在接口上调用方法时，必须有和方法定义时相同的接收者类型或者是可以从具体类型 P 直接可以辨识的：
  1. 指针方法可以通过指针调用
  2. 值方法可以通过值调用
  3. 接收者是值的方法可以通过指针调用，因为指针会首先被解引用
  4. 接收者是指针的方法不可以通过值调用，因为存储在接口中的值没有地址
2. Go 语言规范定义了接口方法集的调用规则：
  1. 类型 *T 的可调用方法集包含接受者为 *T 或 T 的所有方法集
  2. 类型 T 的可调用方法集包含接受者为 T 的所有方法
  3. 类型 T 的可调用方法集不包含接受者为 *T 的方法
空接口
1. 概念：不包含任何方法，它对实现不做任何要求（类似于Java中的Object对象）
2. 可以给一个空接口类型的变量var val interface {}赋值任何类型的值
3. 复制数据切片到空接口切片是不允许的，因为内存布局不一致，需要用for-range逐个复制
  1
  2
  3
  4
  5
  var dataSlice []myType = FuncReturnSlice()
  var interfaceSlice []interface{} = make([]interface{}, len(dataSlice))
  for i, d := range dataSlice {
  interfaceSlice[i] = d
  }
4. 一个接口的值可以赋值给另一个接口变量，只要底层类型实现了必要的方法

反射

概念：反射是用程序检查气所拥有的的结构，尤其是类型的一种能力；这是元编程的一种形式，反射可以在运行时检查类型和变量，例如大小方法和动态的调用这些方法。
Go中反射包Type用来表示一个Go类型，反射包Value为Go值提供了发射接口

两个函数：

1 2	func TypeOf(i interface{}) Type func ValueOf(i interface{}) Value

通过反射修改或者设置值

package main

import (
    "fmt"
    "reflect"
)

func main() {
    var x float64 = 3.4
    v := reflect.ValueOf(x)
    // setting a value:
    // v.SetFloat(3.1415) // Error: will panic: reflect.Value.SetFloat using unaddressable value
    fmt.Println("settability of v:", v.CanSet())
    v = reflect.ValueOf(&x) // Note: take the address of x.
    fmt.Println("type of v:", v.Type())
    fmt.Println("settability of v:", v.CanSet())
    v = v.Elem()
    fmt.Println("The Elem of v is: ", v)
    fmt.Println("settability of v:", v.CanSet())
    v.SetFloat(3.1415) // this works!
    fmt.Println(v.Interface())
    fmt.Println(v)
}
//输出：
settability of v: false
type of v: *float64
settability of v: false
The Elem of v is:  <float64 Value>
settability of v: true
3.1415
<float64 Value>

反射结构体：
1. ``NumField() 方法返回结构体内的字段数量
2. 通过for循环用索引取得每个字段的值Field(i)
3. 用签名在结构体上的方法，例如，使用索引 n 来调用：Method(n).Call(nil)
接口与动态类型：
1. Go 中的接口跟 Java/C# 类似：都是必须提供一个指定方法集的实现。但是更加灵活通用：任何提供了接口方法实现代码的类型都隐式地实现了该接口，而不用显式地声明。
  
  和其它语言相比，Go 是唯一结合了接口值，静态类型检查（是否该类型实现了某个接口），运行时动态转换的语言，并且不需要显式地声明类型是否满足某个接口。该特性允许我们在不改变已有的代码的情况下定义和使用新接口。
  
  接收一个（或多个）接口类型作为参数的函数，其实参可以是任何实现了该接口的类型。实现了某个接口的类型可以被传给任何以此接口为参数的函数。
2. 接口的继承：
  1. 当一个类型包含（内嵌）另一个类型（实现了一个或多个接口）的指针时，这个类型就可以使用（另一个类型）所有的接口方法。类型可以通过继承多个接口来提供像 多重继承 一样的特性：
    1
    2
    3
    4
    type ReaderWriter struct {
    *io.Reader
    *io.Writer
    }

Go中的面相对象总结：
1. Go 没有类，而是松耦合的类型、方法对接口的实现。
2. 封装：
  1. 包范围内的：通过标识符首字母小写，对象 只在它所在的包内可见
  2. 可导出的：通过标识符首字母大写，对象 对所在包以外也可见
3. 继承：
  1. 用组合实现：内嵌一个（或多个）包含想要的行为（字段和方法）的类型；多重继承可以通过内嵌多个类型实现
4. 多态：
  1. 用接口实现：某个类型的实例可以赋给它所实现的任意接口类型的变量。类型和接口是松耦合的，并且多重继承可以通过实现多个接口实现。Go 接口不是 Java 和 C# 接口的变体，而且：接口间是不相关的，并且是大规模编程和可适应的演进型设计的关键。

错误处理与测试：
1. Go中预定义的error类型接口
  1
  2
  3
  type error interface {
  Error() string
  }
2. 定义错误：
  
  err := errors.New(“math - square root of negative number”)
3. 运行时异常和Panic
  1. 当发生像数组下标越界或类型断言失败这样的运行错误时，Go 运行时会触发运行时 panic，伴随着程序的崩溃抛出一个 runtime.Error 接口类型的值。这个错误值有个 RuntimeError() 方法用于区别普通错误。
    
    panic 可以直接从代码初始化：当错误条件（我们所测试的代码）很严苛且不可恢复，程序不能继续运行时，可以使用 panic 函数产生一个中止程序的运行时错误。panic 接收一个做任意类型的参数，通常是字符串，在程序死亡时被打印出来。Go 运行时负责中止程序并给出调试信息。
  2. Panic的调用方式：
    
    在多层嵌套的函数调用中调用 panic，可以马上中止当前函数的执行，所有的 defer 语句都会保证执行并把控制权交还给接收到 panic 的函数调用者。这样向上冒泡直到最顶层，并执行（每层的） defer，在栈顶处程序崩溃，并在命令行中用传给 panic 的值报告错误情况：这个终止过程就是 panicking。
4. 从Panic中恢复
  1. panic 会导致栈被展开直到 defer 修饰的 recover () 被调用或者程序中止。

The Cargo Book

Cargo Guide
1. What is cargo？
  
  Cargo is the Rust package manager. It is a tool that allows Rust packages to declare their various dependencies and ensure that you’ll always get a repeatable build.
2. Creating a new package
  1. cargo new hello_world --bin
  2. cargo build
  3. cargo run
  4. cargo build --release
3. Working on an existing package
  1. git clone
  2. cargo build: fetch dependencies and build them
4. Package layout
  1. Cargo.toml and Cargo.lock are stored in the root of your package
  2. src for source code
  3. src/lib.rs for default library files
  4. Other executables can be placed in src/bin/.
  5. Benchmarks go in the benches directory.
  6. Examples go in the examples directory.
  7. Integration tests go in the tests directory.
5. toml and lock
  1. Cargo.toml is about describing your dependencies in a broad sense, and is written by you.
  2. Cargo.lock contains exact information about your dependencies. It is maintained by Cargo and should not be manually edited.
  3. cargo update will update dependencites to newest version
6. Tests
  1. Command cargo test
  2. run unit tests in /src/tests dir
  3. run integration-style tests in /tests dir
Cargo Reference
1. Specifying Dependencies
  1. specifying dependencites from crates.io:default choice
  2. Caret requirements:an update is allowed if the new version number does not modify the lefct-most non-zero digit in the major,minor,patch grouping
  3. Tilde requirements:specify a minimal version with some ability to update(not specified part can be modified)
  4. Wildcard requirements：allow for any version where the wildcard is positioned
  5. comparison requirements: allow manually specifying a version range or an exiact version to depend on
  6. multiple requirements:eperated with comma
  7. specifying dependencies from other registries
  8. specifying depemdencies from git repositories
  9. specifying path dependencies
  10. Mutiple locations
  11. Platform specified dependencies
Cargo commands
1. General Commands
  1. cargo
  2. cargo help
  3. cargo version
2. Build Commands
  1. cargo bench:execute benchmarks of a package
  2. cargo build：Compile the current package
  3. cargo check: check a local package and all of its dependencies for errors
  4. cargo clean:remove artifacts feom the target directory that Cargo has generated in the past
  5. cargo doc:build the documentation for the local pakage and all dependencies.the output is placed in target/doc
  6. cargo fetch:fetch dependencies of a pakage from the network
  7. cargo fix:automatically fix lint warnings reported by rustc
  8. cargo run:run binary or exaple if local package
  9. cargo rustc:copile the current package
  10. cargo rustdoc:build a pakage’s documentation
  11. cargo test: execute unit and integration test of package
3. Manifest Commands
  1. cargo generate-lockfile：create cargo.lock file for the curren package or workspace.if already exists, rebuild the lastest avaliable version of every package
  2. cargo locate-project: print a JSON object to stdout with the full path to the Cargo.toml manifest
  3. cargo metadata:output JSON to stdout containning info about the memebers and resolved deoendencies of the current package,–format-version is recommended
  4. cargo pkgid: print out the fully qualified package ID specifier for a package or dependency in the curren workspace
  5. cargo tree:display a tree of dependencies to the terminal
  6. cargo update:update dependencies as recorded in the local lock file
  7. cargo vendor:vendor all crates.io and git dependencies for a project into the specified directory at <path>. After this command completes the vendor directory specified by <path> will contain all remote sources from dependencies specified.
  8. cargo verify-project:parse the local manifest and check it’s validity
4. Package commands
  1. cargo init：create a new cargo manifest in the current directory.
  2. cargo install:This command manages Cargo’s local set of installed binary crates. Only packages which have executable [[bin]] or [[example]] targets can be installed, and all executables are installed into the installation root’s bin folder.
  3. cargo new:create a new cargo package in the given directory.
  4. cargo search:this performs a textual search for crates on cargo repository.The matching crates will be displayed along with their descriptioin in TOML format suitable for copying into a Cargo.html manifest.
  5. cargo uninstall:by default all binaries are removed for a crate but –bin and –example flags can be used to only remove particular binaries.
5. Publishing Commands
  1. cargo login:This command will save the API token to disk so that commands that require authentication, such as cargo-publish(1), will be automatically authenticated. The token is saved in $CARGO_HOME/credentials.toml. CARGO_HOME defaults to .cargo in your home directory.
  2. cargo owner:This command will modify the owners for a crate on the registry. Owners of a crate can upload new versions and yank old versions. Non-team owners can also modify the set of owners, so take care!
  3. cargo package:This command will create a distributable, compressed .crate file with the source code of the package in the current directory. The resulting file will be stored in the target/package directory.
  4. cargo publish:This command will create a distributable, compressed .crate file with the source code of the package in the current directory and upload it to a registry.
  5. cargo yank:The yank command removes a previously published crate’s version from the server’s index. This command does not delete any data, and the crate will still be available for download via the registry’s download link.

The Rust Programming Language

Programming a Guessing Game

use std::io;
use rand::Rng;
use std::cmp::Ordering;

fn main(){
        println!("Guess the number!");
        let secret_number = rand::thread_rng().gen_range(1..101);
//      println!("The secret number is: {}", secret_number);
        loop {
                println!("Please input your guess.");
                let mut guess = String::new();
                io::stdin().read_line(&mut guess).expect("Failed to read line");
                let guess: u32 = match guess.trim().parse() {
                                                Ok(num) => num,
                                                Err(_) => continue,
                                        };
                match guess.cmp(&secret_number) {
                 Ordering::Less => println!("Too small!"),
                 Ordering::Greater => println!("Too big!"),
                 Ordering::Equal => {
                        println!("You win!");
                        break;
                        }
                }
                println!("You guessed :{}", guess);
        }
}

Common Programming Concepts
1. Variables and Mutablity
  1. by default variables are immutable
  2. if want mutable, use keyword mut before name of viariance
  3. Difference between variables and constants
    1. mut can not be used with constants
    2. declare constants using const instead of let
    3. constants can be declared in any scope
    4. constants may be set only to a constant expression
    5. constants naming convention:use all upercase with underscores between words
  4. Shadowing
    1. we can shadow a variable by using the same variable’s name and repeating the use of let keyword
    2. shadowing allow us using the same name for different types
2. Data types
  1. Scalar Types
    1. Integer Types
      
      Length Signed Unsigned
      
      8-bit i8 u8
      
      16-bit i16 u16
      
      32-bit i32 u32
      
      64-bit i64 u64
      
      128-bit i128 u128
      
      arch isize usize
      
      Range caculation: -(2^n - 1^) to 2^n - 1^ - 1
      
      e.g. 1_000 = 1000, 57u8 = u8 type of value 57
      
      more examples:
      
      Number literals Example
      
      Decimal 98_222
      
      Hex 0xff
      
      Octal 0o77
      
      Binary 0b1111_0000
      
      Byte (u8 only) b'A'
    2. Floating-Point Types
      1. single-percision:f32
      2. double-percision:f64
    3. Boolean type
      1. bool
    4. Character Type
      1. size:4 bytes
      2. Unicode
      3. char
  2. Compound Types
    1. The Tuple Type
      1. group together a number of values with a variety of types into one compound type
      2. fix length
      3. e.g.
        
        1
        2
        3
        fn main() {
        let tup: (i32, f64, u8) = (500, 6.4, 1);
        }
      4. visited by index
    2. The Array Type
      1. every element of an array muse have the same type
      2. fix length
      3. e.g
        
        1
        2
        3
        4
        5
        fn main() {
        let a = [1, 2, 3, 4, 5];
        let a: [i32; 5] = [1, 2, 3, 4, 5];
        let a = [3; 5]// size 5 with initial value 3
        }
      4. visited by index
3. Functions
  1. Coding stype: snake case. All letters are lowercase and underscores separate words
  2. start with fn
  3. Function Parameters
    1. type of each parameter must be declared
    2. multiple parameters separated with commas
  4. Function bodies contain statements and expressions
    1. Expressions do not include ending semicolons, if have, expression turns to statement,which will not return a value
  5. Functions with return values
    1. Declare return vlaue types after a ->
    2. the return value of the function is synonymous with the value of the final expression in the block of the body of a function.
4. Comments
  1. //
5. Control Flow
  1. If Expressions（if-else if-else）
  2. Loop(loop)
    1
    2
    3
    4
    5
    fn main() {
    loop {
    println!("again!");
    }
    }
  3. break & continue can use lable to apply to the labled loop
  4. you can add the calue you want returned after the break expression
  5. conditional loop with while
  6. For loop:for element in collection
Understanding Ownership
1. What is ownership?
  1. Memory is managed through a system of ownership with a set of rules that the compiler checks at compile time.
  2. Stack and Heap
    1. stack for known, fixed size data at compile time
    2. heap is less organized, freee space must be serached
    3. hense stack is more efficiency
    4. when code calls a function ,value passed into function and variables get pushed onto the stack , when the function is over, those values get poped off the stack
    5. heap is controlled by ownership
  3. Ownership rules
    1. Each value in Rust has variable that’s called its owner
    2. There can only be one owner at a time
    3. When the owner goes out of scope,the value will be dropped
  4. Variable Scope
    1. When varianle comes into scope, it is valid
    2. it remains valid until it goes out of scope
  5. The String type
    1. string is immutable but String not
  6. Memory and Allocation
  7. Ways Variables and Data interact:Move
    1. primitive types allocated on stack
    2. Reference allocated on stack
    3. Data allocated on heap
    4. For ptimitive types: s1 = s2 means copy on stack
    5. For non-primitive types: s1 = s2 means referece copied on stack and data on heap did not do anything
  8. Ways Varianles and Data interact:Clone
    1. if we do want to deeply copy the heap data of the String, not just the stack data, we can use a common method called clone
  9. Stack-Only data:copy
    1. Types such as integers that have a known size at compile time are strored entirely on the stack ,so copies of the actual values are quick to make.
    2. if a type implements the copy trait, an doler variable is still usable after assignment.
  10. Ownership and functions
    1. The semantics for passing a value to a function are similar to those for assigning a value to a variable.Passing a variable to a funtion will move or Copy.For ptimitive types, after funciton calling, variable is still valid, but for other(e.g. String) types, calling function means ownership moving ,which meams s will be invalid after function call.
  11. Return values and scope
    1. returning values can also transfer ownership
    2. The ownership of a variable follows the same pattern every time: assigning a vlaue to another variable moves it.When a variable that includes data on the heap goes out of scope,the value will be cleaned by drop unless the data has been moved to be owned by another variable.
2. References and Borrowing
  1. reference allow you to refer to some value without taking ownership of it.
  2. The &s1 syntax lets us create a reference that refers to the value of s1 but does not own it. Because it does not own it, the value it points to will not be dropped when the reference stops being used.
  3. When functions have rederences as parameters intead of the actual values, we won’t need to return the values in order to give back ownership, because we never had ownership.We call the action of creating a reference borrowing.
  4. Modifying something borrowed is not allowed, just as variables are immutable by default, so are references.
  5. Mutable references, e.g.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    fn main() {
    let mut s = String::from("hello");
    
    change(&mut s);
    }
    
    fn change(some_string: &mut String) {
    some_string.push_str(", world");
    }
    Mutable references have one big restiction:you can have only one mutable reference to a particular piece of data at a time.The restriction preventing multiple mutable references to the same data at the same time allows for mutation but in a very controlled fashion.Rust prevents data races even in compile time.Samelly, combining mutable and immutable references is also no permitted.(considered it is same as RW lock mechanism).Note that a reference’s scope starts from where it is introduced and continues through the last time that references is used.
  6. Dangling References
    
    In languages with pointers, it’s easy to erroneously create a dangling pointer, a pointer that references a location in memory that may have been given to someone else, by freeing some memory while preserving a pointer to that memory. In Rust, by contrast, the compiler guarantees that references will never be dangling references: if you have a reference to some data, the compiler will ensure that the data will not go out of scope before the reference to the data does.
  7. The rules of References
    1. At any given time, you can have either one mutable references or any number of immutable references.
    2. Referebces must always be valid.
3. The Slice Type
  1. A String Slice is a reference to part of a String.
  2. We can create slices using a range within brackets by specifying [starting_index..ending_index], where starting_index is the first position in the slice and ending_index is one more than the last position in the slice.
  3. if starting_index omitted, which means start from zero, if ending_index omitted, which means end to last byte, if all ommited, which means reference the total string
  4. The type that signifies “string slice” is written as &str
  5. The compiler will ensure the references into the String remain valid.
  6. String literals are slices：let s = "Hello, World"， the type of s here is &s, it is a slice poting to that specified piont of the binary, This is also why string literals are immutable, &str is an immutable reference.

Using structs to structure related data

Defining an instantiating structs
1. Structs are similar to tuples, but unlike with tuple, you will name each piece of data so it is clear what the values mean.Structs are more flexible than tuples:you don’t have to rely on the order of the data so specify or access the value of an instance.
2. kv pairs visited(read or write) by dot.

Creating instances from other instances with struct update syntax

Using struct update syntax, we can achieve the same effect with less code, as shown in Listing 5-7. The syntax .. specifies that the remaining fields not explicitly set should have the same value as the fields in the given instance.

struct User {
    active: bool,
    username: String,
    email: String,
    sign_in_count: u64,
}

fn main() {
    let user1 = User {
        email: String::from("someone@example.com"),
        username: String::from("someusername123"),
        active: true,
        sign_in_count: 1,
    };

    let user2 = User {
        email: String::from("another@example.com"),
        ..user1
    };
}

Using Tuple Structs Without Named Fields to Create Different Types
1
struct Color(i32, i32, i32);

Unit-Like Structs Without Any Fields

fn main() {
    struct AlwaysEqual;

    let subject = AlwaysEqual;
}

Method Syntax

Method are similar to functions, but methods are different from functions in that they’re defined within the context of a struct and their first parameter is always self, which represents the instance of the stuct the method is being called on.
Where is the -> Operator?
1. Rust has a featured called automatic referencing and dereferencing.

Associated Functions

All functions defined within an impl block are called associaterd functions because they’re associated with the type name after the impl

Associated functions that aren’t methods are often used for constructors that will return a new instance of the struct.

#[derive(Debug)]
struct Rectangle {
    width: u32,
    height: u32,
}

impl Rectangle {
    fn square(size: u32) -> Rectangle {
        Rectangle {
            width: size,
            height: size,
        }
    }
}

fn main() {
    let sq = Rectangle::square(3);
}

Multiple imlp Blocks
1. It is valid to seperate methods into multiple impl blocks but not recomended.

Enums and Pattern Matching

Defining an Enum

enum IpAddrKind {
    V4,
    V6,
}

fn main() {
    let four = IpAddrKind::V4;
    let six = IpAddrKind::V6;

    route(IpAddrKind::V4);
    route(IpAddrKind::V6);
}

fn route(ip_kind: IpAddrKind) {}

Enum Values

enum IpAddrKind{
  v4(String),
  v6(String),
}

let home = IpAddr::V4(String::from("127.0.0.1"));

let loopback = IpAddr::V6(String::from("::1"));

Any type can be values for Enums.

Enum Methods

fn main() {
    enum Message {
        Quit,
        Move { x: i32, y: i32 },
        Write(String),
        ChangeColor(i32, i32, i32),
    }

    impl Message {
        fn call(&self) {
            // method body would be defined here
        }
    }

    let m = Message::Write(String::from("hello"));
    m.call();
}

The OptionEnum and it’s advantages over Null values
1. When we have Somevalue,we know that a value is present and the value is held within the Some
2. When we have a Nonevalue,in some sense, it means the same thing as null
3. In other words, you have to convert an Option<T> to a T before you can perform T operations with it.Generally, this helps catch one of the most common issues with null: assuming that something isn’t null when it actually is.
4. match expression is a control flow construct that does just this when used with enums.

The match control flow operator

fn main() {
  fn plus_one(x: Option<i32>) -> Option<i32> {
    match x {
      None => None,
      Some(i) => Some(i + 1),
    }
  }
  let five = Some(5);
  let six = plus_one(five);
  let none = plus_one(None);
}

Matches Are Exhaustive
1. Matches in Rust are exhaustive, we must exhaust every last possibility in order for the code to be valid.Especially inthe case of Option<T>, when Rust prevent us from forgetting to explicitly handle None case.

Catch-all Patterns and the _ Placeholder

for match default arm, we can use other and _to handle this.

fn main() {
    let dice_roll = 9;
    match dice_roll {
        3 => add_fancy_hat(),
        7 => remove_fancy_hat(),
        other => move_player(other),
    }

    fn add_fancy_hat() {}
    fn remove_fancy_hat() {}
    fn move_player(num_spaces: u8) {}
}
fn main() {
    let dice_roll = 9;
    match dice_roll {
        3 => add_fancy_hat(),
        7 => remove_fancy_hat(),
        _ => (),
    }

    fn add_fancy_hat() {}
    fn remove_fancy_hat() {}
}

Concise Control Flow with if let

The if letsyntax lets you combine if and let into a less verbose way to handle values that match one pattern while ignoring the rest.

In other words, you can think of if let as syntax sugar for a match that runs code when the value matches one pattern and then ignores all other values.

fn main() {
    let config_max = Some(3u8);
    match config_max {
        Some(max) => println!("The maximum is configured to be {}", max),
        _ => (),
    }
}
fn main() {
    let config_max = Some(3u8);
    if let Some(max) = config_max {
        println!("The maximum is configured to be {}", max);
    }
}

#[derive(Debug)]
enum UsState {
    Alabama,
    Alaska,
    // --snip--
}

enum Coin {
    Penny,
    Nickel,
    Dime,
    Quarter(UsState),
}

fn main() {
    let coin = Coin::Penny;
    let mut count = 0;
    match coin {
        Coin::Quarter(state) => println!("State quarter from {:?}!", state),
        _ => count += 1,
    }
}
#[derive(Debug)]
enum UsState {
    Alabama,
    Alaska,
    // --snip--
}

enum Coin {
    Penny,
    Nickel,
    Dime,
    Quarter(UsState),
}

fn main() {
    let coin = Coin::Penny;
    let mut count = 0;
    if let Coin::Quarter(state) = coin {
        println!("State quarter from {:?}!", state);
    } else {
        count += 1;
    }
}

Managing Growing projects with packages,Crates,and Modules

(haven’t read)

Common Collections

Types:
1. A vector allows you to store a variable number of values next to each other
2. A string is a collection of characters
3. A hash map allows you to associate a value with a particular key

Storing Lists of Values with Vectors

Creating a new vector

let v: Vec<i32> = Vec::new()

Simply

let v = vec![1,2,3]
Update a vector

v.push(1)

Drop a vector Drops its elements

fn main() {
    {
        let v = vec![1, 2, 3, 4];

        // do stuff with v
    } // <- v goes out of scope and is freed here
}

Read elements of vectors

with index syntax or the get method

fn main() {
    let v = vec![1, 2, 3, 4, 5];

    let third: &i32 = &v[2];
    println!("The third element is {}", third);

    match v.get(2) {
        Some(third) => println!("The third element is {}", third),
        None => println!("There is no third element."),
    }
}

Iterating over the values in a vector

fn main() {
    let v = vec![100, 32, 57];
    for i in &v {
        println!("{}", i);
    }
}
fn main() {
    let mut v = vec![100, 32, 57];
    for i in &mut v {
        *i += 50;
    }
}

Using am Emum to store multiple Types

fn main() {
    enum SpreadsheetCell {
        Int(i32),
        Float(f64),
        Text(String),
    }

    let row = vec![
        SpreadsheetCell::Int(3),
        SpreadsheetCell::Text(String::from("blue")),
        SpreadsheetCell::Float(10.12),
    ];
}

Storing UTF-8 Encoded Text with Strings
1. Create a new String
  
  let data = "initial contents".to_string()
  
  let s = String::from("initial contents")
2. Update a String
  
  s.push_str("bar")
  
  s.push('l')
  1
  2
  3
  4
  let s1 = String::from("Hello, ");
  let s2 = String::from("world!");
  let s3 = s1 + &s2;
  // + use fn add(self, s: &str) -> String {
  use format! Macro is recomended
  1
  2
  3
  4
  5
  let s1 = String::from("tic");
  let s2 = String::from("tac");
  let s3 = String::from("toe");
  
  let s = format!("{}-{}-{}", s1, s2, s3);
  format! macro works in the same way as ptintln! but instead of printing the output to the screen,it returns a String with the contents. The code generated by the format! macro uses references so that this call doesn’t take ownership of any of its parameters.
3. Index into Strings
  1. index to a String is not valid.
4. Slice Strings
  
  s[start..end]
5. Methods for Iterating Over Strings
  
  for c in "abcde".chars()
  
  for b in "abcde".bytes()

Storing keys with associated values in hash maps

Creating a hash map

use std::collections::HashMap;
let mut scores = HashMap::new();
scores.inert(String::from("Blue"), 10);
scores.indert(String::from("Yellow"),50);

Another way with two vec

use std::collections::HashMap;
let teams = vec![String::from("Blue"), String::from("Yellow")];
let initial_scores = vec![10, 50];
let mut scores: HashMap<_, _> = teams.into_iter().zip(initial_scores.into_iter()).collect();

HashMaps and ownership
1. For types that implement the copy trait, the values are copied into the hashmap.For owned values like String, the values will be moved and the hashmao will be the owner of those values.

Accessing values in a hashmap

let team_name = String::from("Blue")

let score = scores.get(&team_name)

use std:collections::Hashmap;
let mut scores = HashMap::new();
scores.insert(String::from("Blue"), 10);
for(key, value) int &scores {
  println!("{}:{}",key, value);
}

Updating a hashmap
1. overwitring a value
  1
  2
  scores.insert(String::from("Blue"), 10);
  scores.insert(String::from("Blue"), 20);
2. Only inserting a value if the key has no value
  1
  scores.entry(String::from("Yellow")).or_insert(50);
  The or_insert method on Entry is defined to return a mutable reference to the value for the corresponding Entry key if that key exists, and if not ,inserts the parameter as the new value for this key and returns a mutable reference to the new value.
3. Updating a value based on the old value
  1
  2
  3
  4
  5
  6
  let text = "hello world wonderful world";
  let mut map = HashMap::new();
  for word in text,split_whitespcae() {
  let count = map.entry(word).or_intsert(0);
  *count += 1;
  }
  The or_insert method actually returns a mutable reference(&mut v) to the value for this key.

Error Handling

Unrecoverable Erros with panic!

Recoverable Errors with Result

Resuit Enum

use std::fs::File;
fn main(){
	let f = File::open("hello.txt");
	let f = match f {
		OK(file) => file,
		Err(error) => panic!("Problem opening the file: {:?}", error),
	};
}

Mathcing on different errors

fn main() {
    let f = File::open("hello.txt").unwrap_or_else(|error| {
        if error.kind() == ErrorKind::NotFound {
            File::create("hello.txt").unwrap_or_else(|error| {
                panic!("Problem creating the file: {:?}", error);
            })
        } else {
            panic!("Problem opening the file: {:?}", error);
        }
    });
}

Shortcuts for panic on error:unwrap and expect
1. unwrap is a shortcut method that is implemented just like the matchexpression.if the Result value is Ok variant,unwrap will return the value inside the Ok.If the Result is the Err variant,unwrap will call the panic!macro for us.
  
  let f = File::open("hello.txt").unwrap()
2. We use expect in the same way as unwrap: to return the file handle or call the panic! macro. The error message used by expect in its call to panic! will be the parameter that we pass to expect, rather than the default panic! message that unwrap uses.
  
  let f = File::open("hello.txt").expect("Failed to open")
Propagating Errors

A shortcut for prapagating errors:? operator
1
2
3
4
5
6
fn read_username_from_file() -> Result<String, io::Error> {
let mut f = File::open("hello.txt")?;
let mut s = String::new();
f.read_to_string(&mut s)?;
Ok(s)
}
If the value of the Result is an OK,tge value insideOk will get returned from this expression, and the program will continue.if the value is an Err, the Err will be returned from the whole function as if we had used return keyword so the error value gets propagated to the calling code.

The ? operator can be used in functions that have type of Result.

To panic! Or Not To panic!

So how do you decide when you should call panic! and when you should return Result? When code panics, there’s no way to recover. You could call panic! for any error situation, whether there’s a possible way to recover or not, but then you’re making the decision on behalf of the code calling your code that a situation is unrecoverable. When you choose to return a Result value, you give the calling code options rather than making the decision for it. The calling code could choose to attempt to recover in a way that’s appropriate for its situation, or it could decide that an Err value in this case is unrecoverable, so it can call panic! and turn your recoverable error into an unrecoverable one. Therefore, returning Result is a good default choice when you’re defining a function that might fail.

Generic Types, Traits, and LifeTImes

Generic Data Types
1. In Function Definitions
  
  Before we use a para,eter in the body of the function, we have to declare the parameter name in the signature so the compiller knows what that name means.
  
  fn largest<T>(list: &[T]) -> T {...}
2. In Struct Definitions
  1
  2
  3
  4
  struct Point<T> {
  X: T,
  y: T,
  }
  1
  2
  3
  4
  struct Point<T, U> {
  x: T,
  y: U,
  }
3. In Enum Definitions
  1
  2
  3
  4
  enum Option<T> {
  Some(T),
  None,
  }
  1
  2
  3
  4
  enum Result<T, E> {
  Ok(T),
  Err(E),
  }
4. In Method Definitions
  1
  2
  3
  4
  5
  6
  7
  8
  9
  struct Point<T> {
  x: T,
  y: T,
  }
  impl<T> Point<T> {
  fn x(&self) -> &T {
  &self.x
  }
  }

Traits:Definning Shared Behavior

A trait tells the Rust compiler about functionality a particular type has and can share with other types. We can use traits to define shared behavior in an abstract way. We can use trait bounds to specify that a generic type can be any type that has certain behavior.

Defining a trait

1
2
3

pub trait Summary {
  fn summarize(&self) -> String;
}

Imlementing a trait on a type

pub struct NewsArticle 
	pub headline: String,
	pub location: String,
	pub author: String,	
	pub content: String,
}

impl Summary for NewsArticle {
  fn summarize(&self) -> String {
    ......
  }
}

Default Implementations

traits can have default implementations
Traits as parameters
We have implemented the Summary trait on the NewsArticle and Tweet types.We can define a notify function that calls the summarize method on its item parameter,which is of some type that implements the Summary trait.
1
2
3
pub fn notify(item : &impl Summary) {
println!("Breaking news! {}", item.summarize());
}
Trait Bound Syntax

Code in item6 is straightforward but it is actually a syntax sugar for a longer form called trait bound.
1
2
3
pub fn notify<T : Summary>(item : &T) {
println!("Breaking news! {}", item.summarize());
}

Specifying Multiple Trait Bounds with + Syntax

1
2
3

pub fn notify(item :&(impl Summary + Display)) {
  .....
}

1
2
3

pub fn notify<T : Summary + Display> (item: &T) {
  ......
}

Clearer Trait bounds with where clauses

fn some_function<T, U>(t: &T, u: &U) -> i32
    where T: Display + Clone,
          U: Clone + Debug
{

Returning types that implement traits
1
fn returns_summarizable() -> impl Summary {...}
Attention, you can only use impl Trait if you‘re returning a sigle type.
Using Trait Bounds to Conditionally Implement Methods

We can also conditinally implement a trait for any type that implements another trait.Implementations of a trait on any type that satisfies the trait bounds are called blanket implementations.

Validating references with lifetimes

Preventing dangling references with lifetimes
The Borrow checker
Generic Lifetimes in functions
Lifetime Annotation Syntax
Lifetime Annotations in function signatures
1
2
3
4
5
6
7
for longest<’a>(x: &'a str,y: &'a str) -> &'a str{
if x,len() > y.len() {
x
} else {
y
}
}
The function signature now tells Rust that for some lifetime 'a, the function takes two parameters, both of which are string slices that live at least as long as lifetime 'a. The function signature also tells Rust that the string slice returned from the function will live at least as long as lifetime 'a. In practice, it means that the lifetime of the reference returned by the longest function is the same as the smaller of the lifetimes of the references passed in. These constraints are what we want Rust to enforce. Remember, when we specify the lifetime parameters in this function signature, we’re not changing the lifetimes of any values passed in or returned. Rather, we’re specifying that the borrow checker should reject any values that don’t adhere to these constraints. Note that the longestfunction doesn’t need to know exactly how long x and y will live, only that some scope can be substituted for 'a that will satisfy this signature.
Thinking in terms of lifetimes

Lifetime annotations in struct dedinitions

We have only defined structs to hold owned types.It is possible for structs to hold references, but in that case we would need to add a lifetime annotation on every reference in the struct’s definition.

struct ImportantExcerpt<'a> {
    part: &'a str,
}

fn main() {
    let novel = String::from("Call me Ishmael. Some years ago...");
    let first_sentence = novel.split('.').next().expect("Could not find a '.'");
    let i = ImportantExcerpt {
        part: first_sentence,
    };
}

This annotation means an instance of ImportantExcerpt can not outlive the reference it holds in its part field.

Lifetime Elision

Lifetimes on function or method parameters are called input lifetimes, and lifetimes on return values are called output lifetimes.

Three rules for lifetime:
1. Each parameter that is a reference gets its own lifetime parameter
2. If there is exactly one input lifetime parameter, that lifetime is assigned to all output lifetime parameters
3. There are multiple input lifetime parameters, but one of them is &selft or &mut self because this is a method, the lifetime of self is assigned to all output lifetime parameters.
Lifetime Annotations in Method Definitions

Lifetime names for struct fields always need to be declared after the impl keyword and then used after the struct’s name, because those lifetimes are part of the struct’s type
1
2
3
4
5
impl<'a> ImportantExcerpt<'a> {
fn level(&self) -> i32 {
3
}
}
The static lifetime

Static reference can live for the entire duration of the program

let s: &'static str = "I have a static lifetime"'

Generic Type parameters, Trait Bounds, and lifetimes Together

fn main() {
    let string1 = String::from("abcd");
    let string2 = "xyz";

    let result = longest_with_an_announcement(
        string1.as_str(),
        string2,
        "Today is someone's birthday!",
    );
    println!("The longest string is {}", result);
}

use std::fmt::Display;

fn longest_with_an_announcement<'a, T>(
    x: &'a str,
    y: &'a str,
    ann: T,
) -> &'a str
where
    T: Display,
{
    println!("Announcement! {}", ann);
    if x.len() > y.len() {
        x
    } else {
        y
    }
}

Writing Automated Tests

How to Write Tests？

Checking resutls with the assert! Macro

We give the !assert Marco an argument that avaluates to a Boolean,if the value is true, assert! does nothing an the test passes,if the value is false, the assert! Macro calls the panic! macro.
Testing equality with the assert_eq! And assert_ne! marcos
Adding custom failure messages

Checking for panics with should_panic

To our test function, this attribute makes a test pass if the code inside the function panics, thet test will fial if the code inside the function doesn’t panic.

pub struct Guess {
    value: i32,
}

impl Guess {
    pub fn new(value: i32) -> Guess {
        if value < 1 || value > 100 {
            panic!("Guess value must be between 1 and 100, got {}.", value);
        }

        Guess { value }
    }
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    #[should_panic]
    fn greater_than_100() {
        Guess::new(200);
    }
}

Test orgnization

The #[cfg(test)] annotation on the tests module tells Rust to compile and run the test code only when you run cargo test, not when you run cargo build.

Function Language features:iterators and closures

Closures:aninymous functions that can capture their environment

To define a closure, we start with a pair of vertical pipes|，after the parameters,we place curly brackets that hold the body of the closure,which is optinal if the closure body is a sigle expression.
Closures don’t require you to annotate the types of the parameters or the return value like functions do.The compiler is reliably able to infer the types if the paramaters and the return value, similar to how it’s able to infer types of most variables.
1
2
3
4
fn add_one_v1 (x: u32) -> u32 { x + 1 }
let add_one_v2 = |x: u32| -> u32 { x + 1 };
let add_one_v3 = |x| { x + 1 };
let add_one_v4 = |x| x + 1 ;
Once user calling a closure, compiler refers the type of closure paramaters, and types are then locked into the closure,and we get a type error if we try to use a difference type with the same closure.

Cloures have an additional capability that functions don’t have: they can capture their environment and access variables from the scope in which they’re defined.

fn main() {
    let x = 4;

    fn equal_to_x(z: i32) -> bool {
        z == x
    }

    let y = 4;

    assert!(equal_to_x(y));
}

Closures can capture value from their environment in three ways,which directly map to the three ways a function can take a parameter:taking ownership, borrowing mutably, and borrowing immutably.These are encoded in three Fn traits as follows:

FnOnce:consumes the variables it captures from its enclosing scope, known as the closure’s environment.To consume the captured variables, the closure must take ownership of these variables and move them into the closure when it is defined.The once part of the name represents the fact that the closure can’t take ownership of the same variables more than once, so it can be called only once.
FnMut:can change the environment because it mutably borrows values.
Fn:borrows values from the environment immutably

When you create a closure, Rust infers which trait to use based on how the closure uses the values from the environment. All closures implement FnOnce because they can all be called at least once. Closures that don’t move the captured variables also implement FnMut, and closures that don’t need mutable access to the captured variables also implement Fn. In Listing 13-12, the equal_to_xclosure borrows x immutably (so equal_to_x has the Fn trait) because the body of the closure only needs to read the value in x.

If you want to force the closure to take ownership of the values it uses in the environment, you can use the move keyword before the parameter list. This technique is mostly useful when passing a closure to a new thread to move the data so it’s owned by the new thread.

fn main() {
    let x = vec![1, 2, 3];

    let equal_to_x = move |z| z == x;

    println!("can't use x here: {:?}", x);
  // move this line, the code can be compiled

    let y = vec![1, 2, 3];

    assert!(equal_to_x(y));
}

Processing a series of items with iterators
1. Iterators are lazy
2. The Iterator trait an next method
  1
  2
  3
  4
  5
  6
  7
  pub trait Iterator {
  type Item;
  
  fn next(&mut self) -> Option<Self::Item>;
  
  // methods with default implementations elided
  }
  The valuies we get from the calls to next are immutable references to the values in the vector.The iter method produces an iterator over immutable references.If we want to create an iterator that takes ownership of vector and returns owned values, we can call into_iter instead of iter.Similarly， if we want to iterate over mutable references, we can call iter_mut instead of iter.
3. Methods that consume the iterator
  1
  2
  3
  4
  5
  fn iterator_sum() {
  let v1 = vec![1,2,3];
  let v1_iter = v1.iter();
  let total:i32 = v1.iter.sum();
  }
4. Methods that produce other iterators
  1
  2
  let v1: Vec<i32> = vec![1,2,3]
  let v2: Vec<_> = v1.iter().map(|x| x+1).collect();
  We use collect method to consume the iterator because iterators are lazy, we collect the results of iterating over the iterator that’s returned from the call to map into a vector.
5. Using closures that capture their environment
  
  filter iterator adaptor:The filter method on an iterator takes a closure that takes each item from the iterator and returns a boolean.If the closure returns true, the value will be included in the iterator produced by filter, if the closure returns false, the value won’t be included in the resulting iterator.
  
  shoes.into_iter().filter(|s| s.size == shoe_size).collect()
6. Comparing performance:Loops and iterators
  
  Closures and iterators are Rust features inspired by functional programming language ideas. They contribute to Rust’s capability to clearly express high-level ideas at low-level performance. The implementations of closures and iterators are such that runtime performance is not affected. This is part of Rust’s goal to strive to provide zero-cost abstractions.

More about Cargo and Crates.io (Not covered)

Smart Pointers

In Rust, which uses the concept of ownership and borrowing, an additional difference between references and smart pointers is that the reference are pointers that only borrow data; in contrast, in many cases, smart pointers own the data they point to.
Using Box to point to data on the heap

Boxes allow you to store data on the heap rather than the stack, what remains on the stack is the pointer to the heap data.
1. Enabling recursive types with boxes
Treating smart pointers like regular references with the Deref trait

Implementing the Deref trait allows you to customize the behavior of the dereference operator, *(as opposed to the multiplication or glob operator). By implementing Deref in such a way that a smart pointer can be treated like a regular reference, you can write code that operates on references and use that code with smart pointers too.

Without the Deref trait, the compiler can only dereference & references.The deref method gives the compiller the ability to take value of any type that implements Deref and call the deref method to get a & reference that it knows how to dereference.
Implicit deref coercions with functions and methods

Deref coercion is a convenience that Rust performs on arguments to functions and methods. Deref coercion works only on types that implement the Deref trait. Deref coercion converts such a type into a reference to another type. For example, deref coercion can convert &String to &str because String implements the Deref trait such that it returns &str. Deref coercion happens automatically when we pass a reference to a particular type’s value as an argument to a function or method that doesn’t match the parameter type in the function or method definition. A sequence of calls to the deref method converts the type we provided into the type the parameter needs.

Deref coercion was added to Rust so that programmers writing function and method calls don’t need to add as many explicit references and dereferences with & and *. The deref coercion feature also lets us write more code that can work for either references or smart pointers.
1
2
3
fn hello(name: &str) {
println!("Hello, {}!", name);
}
1
2
3
4
fn main() {
let m = MyBox::new(String::from("Rust"));
hello(&m);
}
with rust deref coercions, the code is equal to
1
2
3
4
fn main() {
let m = MyBox::new(String::from("Rust"));
hello(&(*m)[..]);
}
How deref coercion interacts with mutability

rust does deref coercion when it finds types and trait implementions in three cases:
- From &T to &U when T: Deref<Target=U>
- From &mut T to &mut U when T: DerefMut<Target=U>
- From &mut T to &U when T: Deref<Target=U>
Running code on cleanup with the drop trait

The second trait important to the smart pointer pattern is Drop, which lets you customize what happens when a value is about to go out of scope. You can provide an implementation for the Droptrait on any type, and the code you specify can be used to release resources like files or network connections. We’re introducing Drop in the context of smart pointers because the functionality of the Drop trait is almost always used when implementing a smart pointer. For example, when a Box<T> is dropped it will deallocate the space on the heap that the box points to.

Programmer do not have to free memory in code when finishing using an instance of a data type.In rust, when a value goes out of scope,the compiler will insert ‘free memory’ code automatically.

Most of time, programmers do not have to drop memory of one instance before the scope ending, if have to, rust provides a method std::mem::drop to do this.

Rc, the reference counted smart pointer

To enable multiple ownership, Rust has a type called Rc<T>, which is an abbreviation for reference counting. The Rc<T> type keeps track of the number of references to a value to determine whether or not the value is still in use. If there are zero references to a value, the value can be cleaned up without any references becoming invalid.

We use the Rc<T> type when we want to allocate some data on the heap for multiple parts of our program to read and we can’t determine at compile time which part will finish using the data last. If we knew which part would finish last, we could just make that part the data’s owner, and the normal ownership rules enforced at compile time would take effect.

enum List {
    Cons(i32, Rc<List>),
    Nil,
}

use crate::List::{Cons, Nil};
use std::rc::Rc;

fn main() {
    let a = Rc::new(Cons(5, Rc::new(Cons(10, Rc::new(Nil)))));
    let b = Cons(3, Rc::clone(&a));
    let c = Cons(4, Rc::clone(&a));
}

RefCell and the interior mutability pattern

……..(TODO)

reference:https://doc.rust-lang.org/book/ch15-05-interior-mutability.html

Something about linux core

Posted on 2022-01-04 Edited on 2023-04-17 In knowledge

内存管理

页：MMU管理的基本单位
区：
1. ZONE_DMA
2. ZONE_DMA32
3. ZONE_NORMAL
4. ZONE_HIGHEM
  
  ![image-20211227135455605](/Users/chenjiawei/Library/Application Support/typora-user-images/image-20211227135455605.png)

块IO

块设备中最小的可寻址单元是山区们一般为512字节。块设备无法对比扇区还小的单元进行寻址和操作。因为各种软件的用途不同，所以它们都会用到自己的最小逻辑可寻址单元–块。块是文件系统的一种抽象–只能基于块来访问文件系统。虽然物理磁盘寻址是按照扇区进行的，但是内核执行的所有磁盘操作都是按照块进行的。对块大小的最终要求是必须是扇区大小的2的整数倍，并且要小于页面的大小，所以通常块大小是512字节，1KB或者4KB。；

root@kuhan-server:/home/kuhan/daos# dmg -i -l 172.20.18.168 storage scan
Hosts         SCM Total       NVMe Total
-----         ---------       ----------
172.20.18.168 0 B (0 modules) 0 B (0 controllers)
root@kuhan-server:/home/kuhan/daos# dmg -i -l 172.20.18.168 storage format
Formatting scm storage for DAOS I/O Engine instance 0 (reformat: false)
Instance 0: starting format of SCM (ram:/mnt/daos)
Instance 0: finished format of SCM (ram:/mnt/daos)
Formatting nvme storage for DAOS I/O Engine instance 0
Instance 0: starting format of file block devices [/tmp/daos-bdev]
Instance 0: finished format of file block devices [/tmp/daos-bdev]
Writing nvme config file for DAOS I/O Engine instance 0
DAOS I/O Engine instance 0 storage ready
Format Summary:
  Hosts         SCM Devices NVMe Devices
  -----         ----------- ------------
  172.20.18.168 1           1
SCM @ /mnt/daos: 4.3 GB Total/4.3 GB Avail
Starting I/O Engine instance 0: /home/kuhan/daos/install/bin/daos_engine
root@kuhan-server:/home/kuhan/daos# daos_engine:0 Using legacy core allocation algorithm
MS leader running on kuhan-server
daos_engine:0 DAOS I/O Engine (v2.0.0) process 6930 started on rank 0 with 1 target, 0 helper XS, firstcore 0, host kuhan-server.

commands

Storage Operations

1	dmg storage query usage

NMMe SSD Health Monitoring

3.3  DAOS系统管理命令
1） 存储空间占用查询
dmg -i storage query usage
2） 健康检查
dmg -i storage query list-devices
dmg -i storage query list-pools
dmg -i storage query device-health -u 038a6076-a2e9-48ff-b4b2-1859d319c928
dmg -i storage scan --nvme-meta
dmg -i storage scan --nvme-health
3） NVMe SSD Eviction and Hotplug
dmg -i storage set nvme-faulty –uuid=
dmg -i storage replace nvme --old-uuid= --new-uuid=

4） NVMe SSD点灯
dmg -i storage identify vmd -h --uuid=

3.4  Benchmark
1）创建pool
dmg -i pool create --size=50GB
dmg -i pool list
dmg -i pool list -v
dmg -i pool query --pool=115d5112-e4db-4364-9821-48b661cef7bc
daos pool autotest --pool=115d5112-e4db-4364-9821-48b661cef7bc
2）创建container
daos cont create --pool=115d5112-e4db-4364-9821-48b661cef7bc
Successfully created container f5d39d5d-c352-4538-8a28-54d75c76c0ac
daos cont create --pool=115d5112-e4db-4364-9821-48b661cef7bc --path=/tmp/mycontainer --type=POSIX
Successfully created container 529c9361-65b3-4d7a-ba1f-b306390dcdea type POSIX
daos cont query --path /tmp/mycontainer
3）测试
export POOL=115d5112-e4db-4364-9821-48b661cef7bc
export CONT=529c9361-65b3-4d7a-ba1f-b306390dcdea
fio ./examples/dfs.fio

create pool

root@kuhan-server:/home/kuhan/daos# dmg -i pool create --size=10GB --label=test
Creating DAOS pool with automatic storage allocation: 10 GB total, 6,94 tier ratio
daos_engine:0 3f085ef1: pool/cont hdl uuid 2366bb3f/a41283be
3f085ef1: rank 0 became pool service leader 0
Pool created with 6.00%,94.00% storage tier ratio
-------------------------------------------------
  UUID                 : 3f085ef1-eddb-47ed-b68f-c05cb1f7c7b7
  Service Ranks        : 0
  Storage Ranks        : 0
  Total Size           : 10 GB
  Storage tier 0 (SCM) : 600 MB (600 MB / rank)
  Storage tier 1 (NVMe): 9.4 GB (9.4 GB / rank)

root@kuhan-server:/home/kuhan/daos# dmg -i storage scan
Hosts     SCM Total       NVMe Total
-----     ---------       ----------
localhost 0 B (0 modules) 0 B (0 controllers)
root@kuhan-server:/home/kuhan/daos# daos cont create --pool=3f085ef1-eddb-47ed-b68f-c05cb1f7c7b7
  Container UUID : a1c04e93-26e9-4453-9a59-72e1a33b7667
  Container Type : unknown

Successfully created container a1c04e93-26e9-4453-9a59-72e1a33b7667

Distributed protocol

Posted on 2022-01-04 Edited on 2023-04-17 In knowledge

Distributed protocol

Consensus Algorithm

满足三个条件：

non-triviality:if the value v is decided,then v must have been proposed by a proposer

safety & safe learning:任意两个proposr，得知decided value is a andf b,必有a=b

classic paxos

two phase：

reading phase
writing phase

properties for proposer

epoch
proppser只在收到大部分acceptor的promise后开始提议
proposer只在收到大部分acceptor的accepts后返回最终的决议值
如果之前没有确定value，任何value可以被propose，如果有，则epoch最高的value被propose
被proposer使用的epoch大于之前所有被使用过得epoch

properties for acceptor：

acceptor只处理收到的的prepare或者propose中epoch>= last promised epoch的消息
For each prepare or propose message received, the acceptor’s last promised epoch is set to the epoch received. This is after Property 6 has been satisfied.
For each prepare message received, the acceptor replies with promise. This is after Properties 6 & 7 have been satisfied.
For each propose message received, the acceptor replies with accept after updating its last accepted proposal. This is after Properties 6 & 7 have been satisfied.
Last promised epoch and last accepted proposal are persistent and only updated by Properties 7 & 9

classic paxos的改进

对于epoch号比较小的prepare或者proposal，acceptor不会做响应，只能由proposer等待超时并重试，此处可以改进为acceptor发送no—prepare或者no-accept消息。
在prepare阶段，如果proposer收到的大部分ack中的value和第二阶段的proposal value相同，则可以跳过第二阶段，直接返回value？算法如何知道确定的值就是第二阶段需要propose的值呢？
在一个acceptor收到确认消息后，就返回decided给proposer，proposer不必等大多数acceptor返回确认就可以直接返回确认值。（这样做真的没问题吗）Mencius

proposal copying

（Unique proposer uses each epoch）一个epoch内只有一个proposal能被接受，若proposer在phase one接受到了NACK中，发现max Epoch等于prepare中发送的epoch且已有了promised的值，则做proposal copying并跳到phase 2（为了更快的收敛？）

Raft

三种状态
1. leader
2. follower
3. candidate
每个server存储一个current term number
使用RPC通信：
1. RequestVote RPC （initiated by candidates during elections）
2. Append_entries RPCS(initiated by leaders to replicate log entries and to provide a form of heartbeat
leader election
1. 时机：只要follower能从leader或者candidate收到rpc，remain 状态，如果在election timeout时间内没有收到rpc，认为leader死亡，开始leader election
2. 选举流程：follower increments current term(term += 1) 并且转换为candidate状态，vote for self并且并行的发送requestVote RPC给集群中的其他server
3. candidate保持状态直到：
  1. 赢得选举
  2. 其他server赢得选举
  3. 一段时间过去了没有胜出者
4. candidate赢得选举的条件：在 same term的条件下获取集群内大部分server的votes。一个server只能为一个candidate vote，first-come-first-serverd。一旦赢得选举则向其他的节点发送heartbeat to prevent new election
5. 如何解决split votes（每个server同时开始选举并为自己投票）
  1. random election timeout（150-300ms）
6. followers只为log entry大于自己的candidate投票
7. candidate 收到 leader的request，比较其term和自己的term，若大于自己的则变为follower，否则reject request继续保持candidate状态
Term（逻辑时钟）
1. 特性：连续数字，每个server都会有一个term
2. term changes whenever servers communitacte
3. If server a‘s term smaller than b’s, a change it’s term to b’s
4. candidate of leader founds larger term, trun itself to follower
5. if server recieved stale term request , reject it
Log replication
1. 流程
  1. command come
  2. leader append command to it’s log
  3. Sent log to other servers to replicate the entry
  4. recieved most servers reply, leader apply this comand to it’s state machine
  5. return result to client
  6. (if followers crash or run slowlly, or network issues, leader send rpc infinitely until all followers store all log)
2. log structure
  1. state machine command
  2. term number
    1. to detect inconsistencies between logs
  3. integer index
    1. to identify its position in the log
3. commited log entries
  1. when ont entry has replicated it on a majority of servers
  2. leader includs highest index it knows to be commietted int appedEntries RPCs so other followers learns that and then commite that entry to its state machine
4. how to handle inconsistencies?
  1. by forcing the follower’s logs to duplicate it’s own(follower overrited)
  2. leader 存储每个follower的nextIndex array，此array初始值为last one of leader index + 1，然后向follower发送AppendEntries RPC用于指针探测，如果探测失败则以1步距回退，直到找到一个agree point。（Leader Append-Only确保了leader不会删除或者覆盖他自己的log）
5. log被复制到大多数节点上之后leader crush但是log未commit，仍有可能被覆盖掉
6. Raft 通过 up-to-date vote 来确保新leader有所有commited logs
  1. if tyhe logs have last entries with different terms, the the log with the later term is more up-to-date
  2. if the logs end up with the same term ,then whicheve log is longer is more up-to-date
Timing and availability
1. broadcastTIme(heartBeat) << electionTimeOut<<MTBF
Raft 图解
1. http://thesecretlivesofdata.com/raft/
2. https://raft.github.io

Chain replication

request processing
1. reply generation:
  1. reply is generated and sent by tail
2. Query processing
  1. Query processed by tail
3. Update processing
  1. update processed by head and delevered on chain
Advantages & disadvantages
1. Read:cheap beacause only tail reply the query
2. Write:more heavy beacause all nodes need to participate, but compution only need once because head compute the value and other nodes only need to do once write.
Coping with server failures
1. 直接删掉？
……(没看，失去兴趣)
参考：https://www.dazhuanlan.com/andyblack/topics/1098082

FaceBook F4 system

Design details
1. Volumes:
  1. state：
    1. Unlocked: under 100GB, allow read/write/delete
    2. Locked: only read/delete is allowed
  2. 类别：
    1. dataFile: BLOB + metadata(key,size or checksum)
    2. index FIle:aimed to allow rebuild when rebooting
    3. Journal File:tracks BLOBS that have been deleted
2. System overall
  1. 路由
    1. Create to Hot storage
    2. delete on hot or warm storage
    3. read on cache or hot or warm storage
  2. controller
    1. provision new machines
    2. Maintain pool of unlocked volumes
    3. ensure all logical volumes have enough physical volumes backing them
    4. create new physical volumes if necessary
    5. perform compaction and garbage clean
  3. Router tier
    1. 存储了逻辑volume到物理volume的映射
  4. Transformer Tier
    1. 讲计算和存储分离，计算节点用来计算，存储节点仅仅用来存取数据
  ……..(未读完)

可靠性策略

副本策略：不同节点/不同机架/不同DC
一致性hash
CRUSH（Controlled Replication Under Scalable Hashing）：CRUSH 算法的设置目的是使数据能够根据设备的存储能力和宽带资源加权平均地分布，并保持一个相对的概率平衡。副本放置在具有层次结构的存储设备中，这对数据安全也有重要影响。通过反射系统的物理安装组织，CRUSH算法可以将系统模块化，从而定位潜在的设备故障。
EC(Erasure Code) 纠删码
1. 特点：
  1. 低冗余，高磁盘利用率
  2. 数据恢复代价高
  3. 数据更新代价高

分布式事务协议

2PC
1. 阶段：
  1. Prepare(询问):master发送prepare给参与节点，参与节点执行事务中的操作，并返回yes or no给master
  2. Commit（提交或者Abort):master 收到所有节点的yes信息，则进入commit流程，发送commit给所有的参与节点。若收到任意节点的no，则进行abort流程，参与者返回执行结果给master
2. 存在的问题：
  1. 同步阻塞问题：若参与者共享同步资源，参与者访问临界资源存在阻塞
  2. 协调者故障导致参与者长期阻塞
  3. 数据不一致：协调者在发送commit阶段故障，部分参与者收到了commit
  4. 协调者发送commit的时候宕机，唯一收到此消息的参与者此时也宕机，事务状态未可知
3PC
1. 阶段：
  1. can-commit
    1. 协调者询问参与者是否可以commit
    2. 参与者回复yes or no
  2. pre-commit
    1. 如果参与者全都是yes，则协调者执行：
      1. 协调者发送预提交请求
      2. 参与者预提交，记录undo和redo日志，锁定记录，返回执行响应
    2. 如果任意一个参与者发送了no或者等待超时，协调者执行：
      1. 发送abort请求
      2. 参与者收到abort或者等待超时执行中断
  3. do-commit
    1. 收到全部回复都是yes：
      1. 发送commit
      2. 参与者提交事务，释放资源
      3. 参与者回复响应
      4. 协调者收到全部响应，事务完成
    2. 收到任意no
      1. 发送abort
      2. 参与者回滚，释放资源
      3. 参与者回复响应
      4. 协调者收到所有回复，abort完成
2. 如何解决协调者超时？
  1. 超时机制
3. 如何解决同步阻塞？
  1. 无法解决
4. 如何解决不一致？
  1. 无法解决
5. 相较于2pc的优点？
  1. 超时机制一定程度上解决协调者宕机问题
  2. 第一阶段一分为二，canCommit阶段尽早可以判断事务是否可以执行，占用资源少，提高了吞吐量。

Length	Signed	Unsigned
8-bit	`i8`	`u8`
16-bit	`i16`	`u16`
32-bit	`i32`	`u32`
64-bit	`i64`	`u64`
128-bit	`i128`	`u128`
arch	`isize`	`usize`

Number literals	Example
Decimal	`98_222`
Hex	`0xff`
Octal	`0o77`
Binary	`0b1111_0000`
Byte (`u8` only)	`b'A'`