分布式系统中的时间

计算机系统中的时间，通过石英振荡器的震动次数来确定。一个石英振荡器的震动频率是32768Hz每秒，计算机电路根据石英晶体的震动次数来确定1s钟。但是石英晶体的震荡次数会因为环境产生误差。

计算机系统中，假设发生了事件A和事件B，如何确认时间A和事件B的先后顺序？

单机系统中，时间单调递增，假设时间A先于时间B发生，则一定意味着时间A发生时的时间戳小于事件B
分布式系统中，由于不同机器上的时间存在误差，假设A在机器a上发生，B在机器b上发生，物理上，A在B之前发生，显然不能直接推导出T(A) <T(B),因为A和B在两台不同的机器上获取了时间戳，参考系不一样。

物理时钟，包括NTP协议和TrueTime都属于物理时钟，另一种是逻辑时钟，包括Lamport逻辑时钟和向量时钟。这两种时钟有各自的优缺点。物理时钟的优点在于直观，就是真实世界的时间，使用方便，缺点在于无法做到绝对精确，成本相对高一些。逻辑时钟的优点在于可以做到精确的因果关系，缺点在于节点之间需要通信，而且使用上不如物理时钟直观。

Network Time Protocol（NTP）

NTP协议的目标是将所有计算机的时间同步到几毫秒误差内。实际上广域网可以达到几十毫秒的误差，局域网误差可以在1毫秒内。

客户端A发送NTP消息给服务器B，消息中包含发送时间戳 T1
服务器B收到NTP消息后，将接收时间 T2 写入消息中
服务器B发送该NTP消息给客户端A，发送时间 T3 写入消息中
客户端A收到该NTP消息的时间为 T4

         T1    T2`           T3`   T4
A    -----------------------------^-----
          \    |             |    /
           \   |             |   /
            \  |             |  /
             \ |             | /
              \|             |/
B    ----------v-------------------------
               T2            T3

AB之间的网络往返时间RTT(Round Trip Time)：δ = (T4 - T1) - (T3 - T2)

AB之间的时间偏移：θ = ( (T2 - T1) + (T3 - T4) ) / 2

A从发送请求消息到收到响应的时间间隔是 T4 - T1，其中 T3 - T2 是B的处理时间，所以网络往返时间

δ = (T4 - T1) - (T3 - T2)。

假设A和B的时间偏差为θ，那么 T3 - θ = T3`。

T4和T3’的间隔是半个RTT：T4 - T3` = δ / 2

把T3`和δ代入上面这个等式，得到：θ = ( (T2 - T1) + (T3 - T4) ) / 2。

NTP协议在广域网可以达到几十毫秒的误差，局域网误差可以在1毫米内。误差最大的一个原因是发送请求和接收响应这两个阶段的网络时间可能是不一样的。前面我们推导时间偏移公式的时候，假设网络往返发送和接收阶段的时间是一样的，但是实际网络中，这两个阶段走的路由可能是不一样的，所花的时间也可能不一样，计算的时间偏移也不准确，这样就造成了广域网的误差可能达到几十毫秒甚至更高。局域网中因为网络比较稳定，经过的路由器也比较少，所以误差可以到1毫米内。

Lamport逻辑时钟

通过逻辑时钟，可以刻画分布式系统中事件的因果一致性关系。

局限：

无法感知系统外的事件。
由C(a) < C(b)不能推导出a → b，即使知道了两个逻辑时钟值，但却不能确定这两个事件的因果关系。
不能区分事件之间是否并发
不够直观，脱离了物理时间的直观范畴

向量时钟

解决了Lamport逻辑时钟中不能描述时间因果关系的问题

不能感知系统外的事件
可以描述时间之间的因果关系
可以区分事件之间的并发关系
不够直观，脱离了物理时间的直观范畴

TrueTime

由谷歌提出，通过高精度的硬件，刻画了相对准确的物理时间。

HLC（Hybrid Logical Clocks）

(HLC)将物理时钟和逻辑时钟结合起来。

需要的空间复杂度有限，不会随着集群规模的增长而增长。
需要满足因果关系。
逻辑时间部分的增长有界。
逻辑时间和真实的物理时间的误差是有边界的。

DAOS中的时钟

1 2	/** Mask for the 18 logical bits */ #define D_HLC_MASK 0x3FFFFULL

18位用于存储逻辑时钟，46位用于存储物理时钟。

数据库

数据库中的异常行为

脏读
幻读
不可重复读
脏写
丢失更新

数据库中的并发控制机制

基于时间的（Time-stamp ordering）
基于提交顺序的（Commitment ordering）
基于串行化图测试验证的（Serialization graph testing）
基于锁的（Locking）

Time-stamp ordering

时间戳
（Timestamp ordering，TO）：基于时间戳对事务提交顺序排序的并发控制技术。

时间戳排序技术中有两类主体：

事务
数据项。

时间戳就要“盖（赋值）”在这两类主体上。

依附在事务上的时间戳：

每个事务分配一个时间值（通常是在事务开始的时候分配，但有的系统是在事务提交的时候才分配时间值）作为此事务发生的标识，这个时间值称为一个“时间戳”，“时间戳”就如同为事务盖了一个章。时间值取值有两种方式，一是系统时钟，二是逻辑计数器。

依附在数据上的时间戳：

数据项上有两个时间戳：

读时间戳，记录读取该数据项的最大事务的时间戳
写时间戳，记录写入该数据项当前值的事务对应的时间戳，即最新的修改该数据项的事务的时间戳标识。

事务根据时间戳确认先后顺序关系

因存在并发，所以通过检查Ti事务的时间戳和Tj事务的数据项上的时间戳以确定并发事务Ti和Tj之间的先后关系（如果Ti<Tj，则事务调度器必须保证所产生的并发调度等价于事务Ti先于事务Tj的某个串行调度）。

读写冲突按照时间戳顺序执行

任何有冲突的READ或WRITE操作按时间戳顺序执行。

算法原理

算法解释

写读冲突

begin T2                       R(X)
|------------------------------|------------------------------------------ case 1
              begin T2                      R(X)
              |-----------------------------|----------------------------- case 2
		begin T1       W(X)
		|--------------|------------------------------------------------------
                                begin T2       R(X)
                                |--------------|-------------------------- case 3

-----------------------------Time line--------------------------------------->

case1:事务T2的开始时间早于事务T1,事务T2的R(X)时间晚于事务T1的W(X)时间，T2读被abort，避免脏读

case2:事务T2的开始时间晚于事务T1,事务T2的R(X)时间晚于事务T1的W(X)时间，T2读被abort，避免脏读

case3:事务T2的开始时间在T1写X之后，此时事务T2阻塞等待：

事务T1提交，则事务T2不受影响
事务T1abort，事务T2也必须abort，避免脏读

总结：

当前事务开始早于其他事务的写，当前事务的读需要abort
当前事务开始晚于其他事务的写，其他事务commit，当前事务可以读。

读写冲突

begin T2                       W(X)
|------------------------------|------------------------------------------ case 1
              begin T2                      W(X)
              |-----------------------------|----------------------------- case 2
		begin T1       R(X)
		|--------------|------------------------------------------------------

-----------------------------Time line------------------------------------->

case1:事务T2的开始时间早于事务T1,事务T2的W(X)时间晚于事务T1的R(X)时间，T2读被abort，避免事务T1不可重复读

case1:事务T2的开始时间晚于事务T1,事务T2的W(X)时间晚于事务T1的R(X)时间，T2读被abort，避免事务T1不可重复读

总结：读前写后回滚写

写写冲突

begin T2                       W(X)
|------------------------------|------------------------------------------ case 1
              begin T2                      W(X)
              |-----------------------------|----------------------------- case 2
		begin T1       W(X)
		|--------------|------------------------------------------------------

-----------------------------Time line----------------------------------------->

case1:事务T2的开始时间早于事务T1,事务T2的W(X)时间晚于事务T1的W(X)时间，T2读被abort，事务T2需要写的值是事务T1介入之前的值

case1:事务T2的开始时间晚于事务T1,事务T2的W(X)时间晚于事务T1的W(X)时间，T2读被abort，事务T2需要写的值是事务T1介入之前的值

问：T2事务中的W(X)可能小于T1事务中的W(X)吗?

答：假设Wt2(X) < Wt1(X)，则代表着T2的写操作在T1的写操作之前完成，T2的写操作为后写，后写需要被abort，但是X的写时间被更新成了Wt1(X)，意味着T1的写成功，两者矛盾，因此不可能T2事务中的W(X)可能小于T1事务中的W(X)。

总结：写写冲突，回滚后写。

伪代码解释

检查Ti需要使用到的所有value

如果打算读value：【第一张图中的场景】
1. 如果当前事务开始早于写操作，读要回滚再重来（否则可能会脏读）
2. 如果当前事务开始晚于写操作，将写value的事务加入到当前事务的依赖事务列表中DEP(Ti). add(WT’S(O;))，更新value的读时间戳DEP(Ti). add(WT’S(O;))
如果打算写value：
1. 如果当前事务的开始早于读操作，回滚写（否则另一个事务的读可能出现不可重复读）【第二张图中的场景】
2. 如果当前事务的开始晚于写操作，abort当前事务的写或者一句Thomas Write Rule skip【第三张图中的场景】
3. 否则通过检查，存储更新前的旧值OLD(Ti). add(O;, WTS(O;)),更新value的写时间WTS(0;) = TS(I; ),更新value的值

如果DEP（Ti）中有未结束的事务，wait

如果DEP（Ti）中有abort的事务，则1.2中的读也需要abort

abort流程：恢复旧值和旧的时间戳。

DAOS中的Time-stamp ordering

伪代码

A read at epoch e follows these rules: （对应于写读冲突中的三个case）

// Epoch uncertainty check
	if e is uncertain
			// 写读冲突,读事务的读操作的开始晚于其他事务的写操作（读事务的读操作一定是在e_orig + epsilon之后吗？）
      if there is any overlapping, unaborted write in (e, e_orig + epsilon]
          reject // 回滚读 (case1&case2)
  find the highest overlapping, unaborted write in [0, e] //写读冲突，读事务的开始时间晚于其他写事务的写操作事件
  if the write is not committed // 等待
      wait for the write to commit or abort
      if aborted
          retry the find skipping this write // 如果其他写事务abort，此读事务retry (case3)
  // Read timestamp update
  for level i from container to the read's level lv
      update i.high // 更新TimeStamp Cache
  update lv.low

e:代指的是事务的开始HLC时间，e的实际开始物理时间可能在(e, e_orig + epsilon）范围内。

epsilon:代指的是是HLC时间和实际物理时间的误差范围。

A write at epoch e follows these rules: （对应于读写冲突&写写冲突的几个case）

// Epoch uncertainty check
 if e is uncertain
 		// 写写冲突，后写abort（认为当前事务的写在e_orig + epsilon时间之后）
     if there is any overlapping, unaborted write in (e, e_orig + epsilon]
         reject
 // Read timestamp check， 检查读写冲突，读在前写在后
 for level i from container to one level above the write
 		// 读写冲突，如果父节点的low小于当前事务的开始时间，意味着父节点下的所有子节点都在low这个时间点被读过，则可以推导出write的目			 // 标value在low时间被读取过,当前事务的开始时间早于value的读操作时间，则当前事务需要abort
 		// (i.low == e) && (other reader @ i.low) --> 检查是否是同一个txn中的操作，如果是同一个txn的读写，则不需要abort
 		// 这些父节点的检查可能是想快速判断出是够有冲突，不必递归到目标项上做检查
     if (i.low > e) || ((i.low == e) && (other reader @ i.low))
         reject
 // 检查目标项，读写冲突，abort写
 if (i.high > e) || ((i.high == e) && (other reader @ i.high))
     reject
 // 此处的overlapping检查逻辑暂不明确作用,好像也是在检查写写冲突
 // 上面的写写冲突检查区间并没有覆盖到e这个点，是个左开右闭区间，不清楚为什么要分开检测
 find if there is any overlapping write at e
 if found and from a different transaction
		reject

i.low:目标节点和其所有子节点都至少在low时间点被读取过。

i.high:目标节点至少有一个子节点在high时间被读取过。

Daos中检查冲突的代码片段

检查读写冲突

// 写操作检查，查看是够有读写冲突
bool
vos_ts_check_read_conflict(struct vos_ts_set *ts_set, int idx,
			   daos_epoch_t write_time)
{
	struct vos_ts_set_entry	*se;
	struct vos_ts_entry	*entry;
	int			 write_level;
	bool			 conflict;

	D_ASSERT(ts_set != NULL);

	se = &ts_set->ts_entries[idx];
	entry = se->se_entry;

	if (ts_set->ts_wr_level > ts_set->ts_max_type)
		write_level = ts_set->ts_max_type;
	else
		write_level = ts_set->ts_wr_level;

	if (se->se_etype > write_level)
		return false; /** Check is redundant */

	/** NB: If there is a negative entry, we should also check it.  Otherwise, we can miss
	 *  timestamp updates associated with conditional operations where the tree exists but
	 *  we don't load it ,不太明白这边的negative是什么意思？
	 */
	// 检查目标value父节点
	// for level i from container to one level above the write
	if (se->se_etype < write_level) {
		/* check the low time */
		conflict = vos_ts_check_conflict(entry->te_ts.tp_ts_rl, &entry->te_ts.tp_tx_rl,
						 write_time, &ts_set->ts_tx_id);

		if (conflict || entry->te_negative == NULL) 
		// conflict为true ---》冲突
		// conflict为false，negative为空，---》不冲突
			return conflict;
		
		// conflict为false且negative不为空，检查negative：什么是negative？
		// 如果和negative冲突也代表冲突
		return vos_ts_check_conflict(entry->te_negative->te_ts.tp_ts_rl,
					     &entry->te_negative->te_ts.tp_tx_rl,
					     write_time, &ts_set->ts_tx_id);
	}

	/* check the high time */
	// if (i.high > e) || ((i.high == e) && (other reader @ i.high)),检查目标value的high
	conflict = vos_ts_check_conflict(entry->te_ts.tp_ts_rh, &entry->te_ts.tp_tx_rh, write_time,
					 &ts_set->ts_tx_id);

	if (conflict || entry->te_negative == NULL)
	// conflict为true ---》冲突
	// conflict为false，negative为空，---》不冲突
		return conflict;

	// conflict为false且negative不为空，检查negative：什么是negative？
	// 如果和negative冲突也代表冲突
	return vos_ts_check_conflict(entry->te_negative->te_ts.tp_ts_rh,
				     &entry->te_negative->te_ts.tp_tx_rh, write_time,
				     &ts_set->ts_tx_id);
}


// 检查读写冲突的子函数
// 如果冲突，返回true
// 如果不冲突，返回false
static inline bool
vos_ts_check_conflict(daos_epoch_t read_time, const struct dtx_id *read_id,
		      daos_epoch_t write_time, const struct dtx_id *write_id)

{
	// 当前事务的开始时间晚于读操作，则不冲突
	if (write_time > read_time)
		return false;

	// 当前事务的开始时间早于读操作，冲突，abort当前事务的写
	if (write_time != read_time)
		return true;

	// 当前事务写操作和其他事务的读操作在同一时间发起，且这两事务不在同时发生，则冲突
	if (read_id->dti_hlc != write_id->dti_hlc)
		return true;

	// 当前事务写操作和其他事务的读操作在同一时间发起，且这两个事务同时发生：
	//	1. 事务ID号相同，则不冲突 （同一个事务内同时的读写操作）
	//  2. 事务ID号不同，则冲突
	return uuid_compare(read_id->dti_uuid, write_id->dti_uuid) != 0;
}

检查写写冲突，写读冲突

// TODO:不能理解case2，case3中的场景

case1: <-----------X----------------|----------------o-----------------o------------------>
                current            bound               U1                U2
case2: <-----------o----------------v----------------|-----------------o------------------>
                  U1               current          bound             U2	
case3: <-----------o----------------x----------------o-----------------|------------------>
                  U1               current          U2                bound	
case4: <-----------o----------------o----------------v-----------------|------------------>
                  U1               U2               current           bound

/** Do an uncertainty check on the entry.  Return true if there
 *  is a write within the epoch uncertainty bound or if it
 *  can't be determined that the epoch is safe (e.g. a cache miss).
 *
 *  There are the following cases for an uncertainty check
 *  1. The access timestamp is earlier than both.  In such
 *     case, we have a cache miss and can't determine whether
 *     there is uncertainty so we must reject the access.
 *  2. The access is later than the first and the bound is
 *     less than or equal to the high time.  No conflict in
 *     this case because the write is outside the undertainty
 *     bound.
 *  3. The access is later than the first but the bound is
 *     greater than the high timestamp.  We must reject the
 *     access because there is an uncertain write.
 *  4. The access is later than both timestamps.  No conflict
 *     in this case.
 *
 *  \param[in]	ts_set	The timestamp set
 *  \param[in]	epoch	The epoch of the update
 *  \param[in]	bound	The uncertainty bound
 */
static inline bool
vos_ts_wcheck(struct vos_ts_set *ts_set, daos_epoch_t epoch,
	      daos_epoch_t bound)
{
	struct vos_wts_cache	*wcache;
	struct vos_ts_set_entry	*se;
	uint32_t		 high_idx;
	daos_epoch_t		 high;
	daos_epoch_t		 second;

	if (!vos_ts_in_tx(ts_set) || ts_set->ts_init_count == 0 ||
	    bound <= epoch)
		return false;

	se = &ts_set->ts_entries[ts_set->ts_init_count - 1];

	if (se->se_entry == NULL)
		return false;

	wcache = &se->se_entry->te_w_cache;
	high_idx = wcache->wc_w_high;
	high = wcache->wc_ts_w[high_idx];
	// 事务的开始时间晚于任何其他事务的写时间，不产生任何冲突
	// 写读冲突中的case3,读事务的开始时间晚于其他事务的写操作时间，需要返回等待
	// 写写冲突中，写事务的开始时间晚于其他事务的写操作时间，不冲突
	if (epoch >= high) /* Case #4, the access is newer than any write */
		return false;

	second = wcache->wc_ts_w[1 - high_idx];
	// 事务的开始时间小于其他事务写此value的Tlow,当前事务的写应该被abort
	// 写读冲突中，读事务的开始时间早于写事务的写操作时间，读操作需要被abort，写读冲突中的case1和case2属于此类
	// 写写冲突中，写事务的开始时间早于其他写事务的写操作时间，后写的需要被abort，写写冲突中的case1和case2属于此种
	if (epoch < second) /* Case #1, Cache miss, not enough history */
		return true;

	/* We know at this point that second <= epoch so we need to determine
	 * only if the high time is inside the uncertainty bound.
	 */
	if (bound >= high) /* Case #3, Uncertain write conflict */
		return true;

	return false;
}

DAOS中基于时间戳的多版本并发控制协议

时间戳排序的多版本并发控制协议算法原理

MVCC不是一个可独立使用的事务并发控制技术，而是需要基于其他并发控制技术，如基于时间戳的称为多版本时间戳排序机制( multiversion timestamp-ordering scheme) ，基于两阶段封锁协议的称为多版本两阶段封锁协议 ( multiversion two-phase locking protocol)’’。

基于时间戳的称为多版本时间戳排序机制的基本原理：

首先，数据库系统在事务开始前赋予一个时间戳，记为TS（Ti），这个时间戳则决定了并发的事务的调度顺序。
对于每个数据项X，多版本体现在：X有一个版本序列<X1，X2，…，Xn>，其中，每个版本Xi包括三个字段，分别是：
1. Xi=value，value是数据项X的第i个版本的值，每个版本是由一个写操作生成的。
2. W-timestamp（Xi）是创建Xi这个版本的事务的时间戳（不是当前时间戳值），即表明此数据项是被谁在什么时候创建的。
3. R-timestamp（Xi）是所有成功读取Xi这个版本的事务的时间戳。
再次，多版本时间戳排序机制通过如下规则，保证可串行性：
1. 如果事务Ti执行Read操作或Write操作，假设Xm表示X满足如下条件的版本，其写时间戳是小于或等于TS（Ti）的最大写时间戳（确保了在所有版本中找到一个“最近版本”）。
2. 如果事务Ti执行读操作Read（X），返回给事务Ti的值为Xm。读永远不会被阻塞。
3. 如果事务Ti执行写操作Write（X）：
  1. 如果 TS(Ti)<R-timestamp(Xm)，则中止事务 Ti，这表明即将执行的这个写操作之后的时间上已经发生过了一个读操作，如果允许写操作成功，则可能发生不可重复读异常现象。这是写 - i卖冲突，事务 Ti 被中止。
  2. 如果 TS(Ti)=W-tim巳stamp(X时，则系统更新事务 Ti 的值 Xm 为新值，这表明本事务多次写过同一个数据项，新值覆盖旧值
  3. 如果 TS(Ti)>W-timestamp(Xm)，则系统为事务 Ti 的数据项 X 创建一个新值，这说明后发生的事务才创建新的版本。这是写一写冲突，导致产生新版本。

DAOS中的时间戳的多版本并发控制协议代码片段

TODO

Question：

为什么Daos中每个entry需要记录两个write TimeStamp？

按照传统的Time Ordering算法，只记录一个readTimeStamp和一个writeTimeStamp。

daos的文档中这样描述：

In order to detect epoch uncertainty violations, VOS also maintains a pair of write timestamps for each container, object, dkey, and akey. Logically, the timestamps represent the latest two updates to either the entity itself or to an entity in a subtree. At least two timestamps are required to avoid assuming uncertainty if there are any later updates. The figure below shows the need for at least two timestamps. With a single timestamp only, the first, second, and third cases would be indistinguishable and would be rejected as uncertain.
```
 根据文档的描述，记录两个Write TimeStamp，可以减少一些uncertain场景的冲突个数。按照传统的Time Ordering算法，上一小        节中的case2，case3，和case4都应该被abort。通过记录两个TimeStamp，case2可以不abort。没有理解这边不abort的理由。
```
TimeStamp entry中的negative代表了什么？为什么检查冲突的时候也需要检查negative

NB: If there is a negative entry, we should also check it. Otherwise, we can miss timestamp updates associated with conditional operations where the tree exists but we don’t load it

其他参考：

CockroachDB中关于TimeStamp Cache的描述：

Read-Write Conflicts – Read Timestamp Cache
On any read operation, the timestamp of that read operation is recorded in a node-local timestamp cache. This cache will return the most recent timestamp at which the key was read.

All write operations consult the timestamp cache for the key they are writing; if the returned timestamp is greater than the operation timestamp, this indicates a RW conflict with a later timestamp. To disallow this, the operation (and its transaction) must be aborted and restarted with a later timestamp.

The timestamp cache is an interval cache, meaning that its keys are actually key ranges. If a read operation is actually a predicate operating over a range of keys (such as a scan), then the entire scanned key range is written to the timestamp cache. This prevents RW conflicts where the key being written was not present during the scan operation.

The timestamp cache is a size-limited, in-memory LRU (least recently used) data structure, with the oldest timestamps being evicted when the size limit is reached. To deal with keys not in the cache, we also maintain a “low water mark”, which is equivalent to the earliest read timestamp of any key that is present in the cache. If a write operation writes to a key not present in the cache, the “low water mark” is returned instead.

Write-Write Conflicts – Can only write the most recent version of a key
If a write operation attempts to write to a key, but that key already has a version with a later timestamp than the operation itself, allowing the operation would create a WW conflict with the later transaction. To ensure serializability, the operation (and its transaction) must be aborted and restarted with a later timestamp.

By choosing a timestamp-based ordering, and rejecting all conflicts which disagree with that ordering, CockroachDB’s Serializable Snapshot guarantees a serializable schedule.

Timestamp cache
The timestamp cache tracks the highest timestamp (i.e., most recent) for any read operation that a given range has served.

Each write operation in a BatchRequest checks its own timestamp versus the timestamp cache to ensure that the write operation has a higher timestamp; this guarantees that history is never rewritten and you can trust that reads always served the most recent data. It’s one of the crucial mechanisms CockroachDB uses to ensure serializability. If a write operation fails this check, it must be restarted at a timestamp higher than the timestamp cache’s value.

Saturn

Daos-mvcc-and-timestamp-ordering