Saturn

The devil is in the details.

0%

General storage knowledge

RAID的分类:

  1. 参考:https://www.zhihu.com/question/20131784

HDD

  1. 全称:Hard Disk Drive,缩写:HDD
  2. 概念:有时为了与固态硬盘相区分称“机械硬盘”或“传统硬盘”)是电脑上使用坚硬的旋转盘片为基础的非易失性存储器,它在平整的磁性表面存储和检索数字数据,数据通过离磁性表面很近的磁头由电磁流来改变极性的方式被写入到磁盘上,数据可以通过盘片被读取,原理是磁头经过盘片的上方时盘片本身的磁场导致读取线圈中电气信号改变。硬盘的读写是采用半随机存取的方式,可以以任意顺序读取硬盘中的资料[2],但读取不同位置的资料速度不相同。硬盘包括一至数片高速转动的盘片以及放在执行器悬臂上的磁头。早期的硬盘存储介质是可替换的,不过现在硬盘的存储介质一般不能更换,盘片与磁头是一起被密封在硬盘驱动器内。硬盘有一个有着过滤措施的气孔,用来平衡工作时产生的热量导致的硬盘内外的气压差。硬盘是由IBM在1956年开始使用[3],在1960年代初成为通用式电脑中主要的辅助存放设备,随着技术的进步,硬盘也成为服务器个人电脑的主要组件。
  3. reference:https://zh.wikipedia.org/wiki/硬盘

SATA SSD

  1. 概念:固态硬盘固态驱动器(英语:Solid-state drive或Solid-state disk,简称SSD)是一种以集成电路制作的电脑存储设备。可以用非易失性存储器(主要以闪存中的 NAND Flash)作为永久性存储设备,也可以用易失性存储器(例如DRAM)作为临时性存储设备。固态硬盘常采用SATAPCI ExpressmSATAM.2ZIFIDEU.2CFCFast等接口。目前由于每单位价格及最大存储容量与机械硬盘有差距,固态硬盘暂时无法完全取代机械式硬盘
  2. reference:https://zh.wikipedia.org/wiki/固态硬盘

NVMe(Non-Volatile Memory express)

  1. 全称:NVMEHCIS:Non-Volatile Memory Host Controller Interface Specification
  2. 概念:NVM Express(缩写NVMe),或称非易失性内存主机控制器接口规范(英语:Non-Volatile Memory Host Controller Interface Specification,缩写:NVMHCIS),是一个逻辑设备接口规范。它是基于设备逻辑接口的总线传输协议规范(相当于通讯协议中的应用层),用于访问通过PCI Express(PCIe)总线附加的非易失性存储器介质(例如采用闪存固态硬盘驱动器),虽然理论上不一定要求PCIe总线协议。NVMe是一种协议,是一组允许SSD使用PCIe总线的软硬件标准;而PCIe是实际的物理连接通道。

img

img

PCI Express型、M.2型(下)

  1. NVM代表非易失性存储器(non-volatile memory)的首字母缩略字,这是固态硬盘(SSD)的常见的闪存形式。此规范主要是为基于闪存的存储设备提供一个低延时、内部并发化的原生界面规范,也为现代CPU、电脑平台及相关应用提供原生存储并发化的支持[1],令主机硬件和软件可以充分利用固态存储设备的并行化存储能力。相比此前机械硬盘驱动器(HDD)时代的AHCI(SATA下的协议),NVMe/NVMHCI降低了I/O操作等待时间、提升同一时间内的操作数、更大容量的操作队列等。依托于PCIe总线,NVMe设备可适用于各种支持PCIe总线的物理插槽上,包括标准尺寸的PCIe扩展卡(一般是4个PCIe通道)[2]、采用U.2物理连接界面(SFF-8639)的2.5英寸/3.5英寸标准尺寸固态硬盘驱动器、[3][4]SATA Express总线(兼容于PCIe)的设备、M.2规格扩展卡等。[5]此规范由NVMHCIS工作组负责管理。

  2. 参考:https://zh.wikipedia.org/wiki/NVM_Express

  3. 参考:https://www.chinastor.com/baike/ssd/04103A942017.html

PMem

  1. 全称:非易失性存储

  2. 概念:英特尔® 傲腾™ 持久内存(PMem) 是一款基于3DXpoint 介质,并具有内存形态的一种全新硬件(黑科技)。通常称为Persistent Memory。

![image-20211203132338916](/Users/chenjiawei/Library/Application Support/typora-user-images/image-20211203132338916.png)

  1. 参考:https://learnku.com/articles/59125

ACID理论:

  1. 全称:Atomic|Consistency|Isolation|Durability

Atomicity

Main article: Atomicity (database systems)

Transactions are often composed of multiple statements. Atomicity guarantees that each transaction is treated as a single “unit”, which either succeeds completely, or fails completely: if any of the statements constituting a transaction fails to complete, the entire transaction fails and the database is left unchanged. An atomic system must guarantee atomicity in each and every situation, including power failures, errors and crashes.[4] A guarantee of atomicity prevents updates to the database occurring only partially, which can cause greater problems than rejecting the whole series outright. As a consequence, the transaction cannot be observed to be in progress by another database client. At one moment in time, it has not yet happened, and at the next it has already occurred in whole (or nothing happened if the transaction was cancelled in progress).

An example of an atomic transaction is a monetary transfer from bank account A to account B. It consists of two operations, withdrawing the money from account A and saving it to account B. Performing these operations in an atomic transaction ensures that the database remains in a consistent state, that is, money is neither debited nor credited if either of those two operations fail.

Consistency

Main article: Consistency (database systems)

Consistency ensures that a transaction can only bring the database from one valid state to another, maintaining database invariants: any data written to the database must be valid according to all defined rules, including constraints, cascades, triggers, and any combination thereof. This prevents database corruption by an illegal transaction, but does not guarantee that a transaction is correct. Referential integrity guarantees the primary keyforeign key relationship.

Isolation

Main article: Isolation (database systems)

Transactions are often executed concurrently (e.g., multiple transactions reading and writing to a table at the same time). Isolation ensures that concurrent execution of transactions leaves the database in the same state that would have been obtained if the transactions were executed sequentially. Isolation is the main goal of concurrency control; depending on the method used, the effects of an incomplete transaction might not even be visible to other transactions.

Durability

Main article: Durability (database systems)

Durability guarantees that once a transaction has been committed, it will remain committed even in the case of a system failure (e.g., power outage or crash). This usually means that completed transactions (or their effects) are recorded in non-volatile memory.[citation needed]

CAP理论:

  1. 全称:Consistency|Availablility|Partition Tolerance

  2. 概念:In theoretical computer science, the CAP theorem, also named Brewer’s theorem after computer scientist Eric Brewer, states that any distributed data store can only provide two of the following three guarantees:[1][2][3]

    • Consistency

      Every read receives the most recent write or an error.

    • Availability

      Every request receives a (non-error) response, without the guarantee that it contains the most recent write.

    • Partition tolerance

      The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes.

    When a network partition failure happens, it must be decided whether to:

    1. cancel the operation and thus decrease the availability but ensure consistency or to
    2. proceed with the operation and thus provide availability but risk inconsistency.

    Thus, if there is a network partition, one has to choose between consistency and availability. Note that consistency as defined in the CAP theorem is quite different from the consistency guaranteed in ACID database transactions.[4]

    Eric Brewer argues that the often-used “two out of three” concept can be somewhat misleading because system designers only need to sacrifice consistency or availability in the presence of partitions, but that in many systems partitions are rare.

  3. 选择:

    1. AP:分布式NoSql
    2. CP:Oracle RAC
    3. CA:单机关系型数据库
  4. 参考:https://en.wikipedia.org/wiki/CAP_theorem

BASE理论:

  1. 定义:

    1. BasicallyAvailable
    2. Soft-state
    3. Eventually Consistent
  2. 牺牲一致性换取高可用和分区容忍

  3. 参考:https://www.chinastor.com/baike/ssd/04103A942017.html

SDS:

  1. 全称:SoftWare Defined Storage

  2. 参考自维基百科:

    1. Software-defined storage (SDS) is a marketing term for computer data storage software for policy-based provisioning and management of data storage independent of the underlying hardware. Software-defined storage typically includes a form of storage virtualization to separate the storage hardware from the software that manages it.[1] The software enabling a software-defined storage environment may also provide policy management for features such as data deduplication, replication, thin provisioning, snapshots and backup.
    2. Software-defined storage (SDS) hardware may or may not also have abstraction, pooling, or automation software of its own. When implemented as software only in conjunction with commodity servers with internal disks, it may suggest software such as a virtual or global file system. If it is software layered over sophisticated large storage arrays, it suggests software such as storage virtualization or storage resource management, categories of products that address separate and different problems. If the policy and management functions also include a form of artificial intelligence to automate protection and recovery, it can be considered as intelligent abstraction.[2] Software-defined storage may be implemented via appliances over a traditional storage area network (SAN), or implemented as network-attached storage (NAS), or using object-based storage. In March 2014 the Storage Networking Industry Association (SNIA) began a report on software-defined storage.
  3. 参考自百度百科:

    1. 将存储硬件中典型的存储控制功能抽离出来放到软件上,包括卷管理,Raid,数据保护,快照和复制等。

DAS直连式存储:

  1. 定义:直连式存储(DAS)是一种计算机存储,它直接连接到某台计算机且其他计算机无法获取。对于个人计算机用户来说,硬盘驱动器就是直连式存储的常见形式。
  2. 参考:https://baike.baidu.com/item/直连式存储/10276052?fr=aladdin

NAS网络连接存储

  1. 全称:NetWork Attached Storeage

  2. 定义:可以通过以太网方式接入并进行访问的存储形式

    NAS(Network Attached Storage:网络附属存储)按字面简单说就是连接在网络上,具备资料存储功能的装置,因此也称为网络存储器。它是一种专用数据存储服务器。它以数据为中心,将存储设备与服务器彻底分离,集中管理数据,从而释放带宽、提高性能、降低总拥有成本、保护投资。其成本远远低于使用服务器存储,而效率却远远高于后者。目前国际著名的NAS企业有Netapp、EMC、OUO等。

  3. 参考:https://baike.baidu.com/item/NAS/3465615

SAN存储区域网络

  1. 全称:Storage Area NetWork

  2. 定义:存储区域网络(Storage Area Network,简称SAN)采用网状通道(Fibre Channel ,简称FC,区别与Fiber Channel光纤通道)技术,通过FC交换机连接存储阵列和服务器主机,建立专用于数据存储的区域网络。SAN经过十多年历史的发展,已经相当成熟,成为业界的事实标准(但各个厂商的光纤交换技术不完全相同,其服务器和SAN存储有兼容性的要求)。

    SAN专注于企业级存储的特有问题。当前企业存储方案所遇到问题的两个根源是:数据与应用系统紧密结合所产生的结构性限制,以及小型计算机系统接口(SCSI)标准的限制。大多数分析都认为SAN是未来企业级的存储方案,这是因为SAN便于集成,能改善数据可用性及网络性能,而且还可以减轻管理作业。

  3. 参考:https://baike.baidu.com/item/存储区域网络/6091260?fromtitle=SAN&fromid=10789152

  4. 和NAS不同的是此类存储用于专用网络,无法和以太网互通:SAN和NAS都是通过网络的方式实现了业务服务器与存储设备的连接和访问,但两者是有别的前者在业务服务器上呈现的是一个磁盘,需要格式化成文件系统后使用;而后者在业务服务器上呈现的是文件系统。

SCSI小型计算机系统接口

  1. 全称:Small Computer System Interface

  2. 定义:Small Computer System Interface (SCSI, /ˈskʌzi/ SKUZ-ee)[1] is a set of standards for physically connecting and transferring data between computers and peripheral devices. The SCSI standards define commands, protocols, electrical, optical and logical interfaces. The SCSI standard defines command sets for specific peripheral device types; the presence of “unknown” as one of these types means that in theory it can be used as an interface to almost any device, but the standard is highly pragmatic and addressed toward commercial requirements. The initial Parallel SCSI was most commonly used for hard disk drives and tape drives, but it can connect a wide range of other devices, including scanners and CD drives, although not all controllers can handle all devices.

    The ancestral SCSI standard, X3.131-1986, generally referred to as SCSI-1, was published by the X3T9 technical committee of the American National Standards Institute (ANSI) in 1986. SCSI-2 was published in August 1990 as X3.T9.2/86-109, with further revisions in 1994 and subsequent adoption of a multitude of interfaces. Further refinements have resulted in improvements in performance and support for ever-increasing storage data capacity.

  3. 参考:https://en.wikipedia.org/wiki/SCSI

  4. SCSI定义命令、通信协议以及实体的电气特性,可并行,读写时不需要CPU时间片

SAS

  1. 全称:Serial Attached SCSI

  2. 定义:In computing, Serial Attached SCSI (SAS) is a point-to-point serial protocol that moves data to and from computer-storage devices such as hard disk drives and tape drives. SAS replaces the older Parallel SCSI (Parallel Small Computer System Interface, usually pronounced “scuzzy” or “sexy”[3][4]) bus technology that first appeared in the mid-1980s. SAS, like its predecessor, uses the standard SCSI command set. SAS offers optional compatibility with Serial ATA (SATA), versions 2 and later. This allows the connection of SATA drives to most SAS backplanes or controllers. The reverse, connecting SAS drives to SATA backplanes, is not possible.[5]

    The T10 technical committee of the International Committee for Information Technology Standards (INCITS) develops and maintains the SAS protocol; the SCSI Trade Association (SCSITA) promotes the technology.

  3. 参考:https://en.wikipedia.org/wiki/Serial_Attached_SCSI

AHCI

  1. 全称:Serial ATA Advanced Host Controller Interface

  2. 定义:The Advanced Host Controller Interface (AHCI) is a technical standard defined by Intel that specifies the operation of Serial ATA (SATA) host controllers in a non-implementation-specific manner in its motherboard chipsets.

    The specification describes a system memory structure for computer hardware vendors to exchange data between host system memory and attached storage devices. AHCI gives software developers and hardware designers a standard method for detecting, configuring, and programming SATA/AHCI adapters. AHCI is separate from the SATA 3 Gbit/s standard, although it exposes SATA’s advanced capabilities (such as hot swapping and native command queuing) such that host systems can utilize them. For modern solid state drives, the interface has been superseded by NVMe.[1]

    As of December 2020, the current version of the specification is 1.3.1.

    AHCI是通用接口技术标准

  3. 参考:https://en.wikipedia.org/wiki/Advanced_Host_Controller_Interface

iSCSI

  1. iSCSI是一种存储设备远程映射技术。iSCSI则是通过TCP协议对SCSI进行封装的一种协议,也就是通过以太网传输SCSI协议的内容。

  2. In computing, **iSCSI **is an acronym for Internet Small Computer Systems Interface, an Internet Protocol (IP)-based storage networking standard for linking data storage facilities. It provides block-level access to storage devices by carrying SCSI commands over a TCP/IP network. iSCSI is used to facilitate data transfers over intranets and to manage storage over long distances. It can be used to transmit data over local area networks (LANs), wide area networks (WANs), or the Internet and can enable location-independent data storage and retrieval.

    The protocol allows clients (called initiators) to send SCSI commands (CDBs) to storage devices (targets) on remote servers. It is a storage area network (SAN) protocol, allowing organizations to consolidate storage into storage arrays while providing clients (such as database and web servers) with the illusion of locally attached SCSI disks.[1] It mainly competes with Fibre Channel, but unlike traditional Fibre Channel which usually requires dedicated cabling,[a] iSCSI can be run over long distances using existing network infrastructure.[2] iSCSI was pioneered by IBM and Cisco in 1998 and submitted as a draft standard in March 2000.

  3. 参考:https://en.wikipedia.org/wiki/ISCSI

NVME-OF

  1. 全称:NVME over Fabric

  2. 解释:SSD硬件性能提高,软件成为了瓶颈。

    1. 减少软件开销,出现了SPDK
    2. 把固态盘放到单独设备里面,存储独立出来供很多主机共享

    NVMe-oF在NVMe协议中的NVMe Transport部分进行了扩展,来支持In Band,以太网光纤通道等。

    NMMe-oF分为两种:使用RDMA的和使用FC-NVMe的

    前者:“nfiniBand、RoCE(RDMA over Converged Ethernet)和iWARP(internet Wide Area RDMA Protocol),RDMA支持在不涉及处理器的情况下将数据传输到两台计算机的内存,并提供低延迟和快速的数据传输。”

    摘录来自: 英特尔亚太研发有限公司. “Linux开源存储全栈详解从Ceph 到容器存储。” Apple Books.

  3. NVM Express over Fabrics (NVMe-oF) is the concept of using a transport protocol over a network to connect remote NVMe devices, contrary to regular NVMe where physical NVMe devices are connected to a PCIe bus either directly or over a PCIe switch to a PCIe bus. In August 2017, a standard for using NVMe over Fibre Channel (FC) was submitted by the standards organization International Committee for Information Technology Standards (ICITS), and this combination is often referred to as FC-NVMe or sometimes NVMe/FC.[35]

    As of May 2021, supported NVMe transport protocols are:

    The standard for NVMe over Fabrics was published by NVM Express, Inc. in 2016.[40][41]

    The following software implements the NVMe-oF protocol:

    • Linux NVMe-oF initiator and target.[42] RoCE transport was supported initially, and with Linux kernel 5.x, native support for TCP was added.
    • Storage Performance Development Kit (SPDK) NVMe-oF initiator and target drivers.[44] Both RoCE and TCP transports are supported.
    • Starwind NVMe-oF initiator and target for Microsoft Windows, supporting both RoCE and TCP transports.[47]
  4. 参考:https://en.wikipedia.org/wiki/NVM_Express

RDMA

  1. 全称:Remote Direct Memory Access

  2. 用途:解决网络传输中服务器端数据处理的延迟产生,它讲数据直接从一台计算机的内存传输到另外一台计算机,无需双方的操作系统的介入。

  3. 特性:low latency,low cpu overhead, high bandwidth

  4. 参考:https://cloud.tencent.com/developer/article/1420687

  5. In computing, remote direct memory access (RDMA) is a direct memory access from the memory of one computer into that of another without involving either one’s operating system. This permits high-throughput, low-latency networking, which is especially useful in massively parallel computer clusters.

    RDMA supports zero-copy networking by enabling the network adapter to transfer data from the wire directly to application memory or from application memory directly to the wire, eliminating the need to copy data between application memory and the data buffers in the operating system. Such transfers require no work to be done by CPUs, caches, or context switches, and transfers continue in parallel with other system operations. This reduces latency in message transfer.

    However, this strategy presents several problems related to the fact that the target node is not notified of the completion of the request (single-sided communications).

  6. 参考:https://en.wikipedia.org/wiki/Remote_direct_memory_access

TOE

  1. 全称:TCP offload engine
  2. TCP offload engine (TOE) is a technology used in network interface cards (NIC) to offload processing of the entire TCP/IP stack to the network controller. It is primarily used with high-speed network interfaces, such as gigabit Ethernet and 10 Gigabit Ethernet, where processing overhead of the network stack becomes significant.将CPU耗费大量资源进行多层网络协议的数据包处理工作转移到网卡上,需要网卡支持,常见于高速以太网接口上。
  3. 参考:https://en.wikipedia.org/wiki/TCP_offload_engine

U-Net

  1. 全称:User-Net networking
  2. 避免了数据从用户空间向内核空间的拷贝

Linux内核IO栈

img

  1. 用户空间和内核空间:32位系统的寻址空间为4G,操作系统的核心是内核,独立于普通的应用程序,可以访问受保护的内存空间,也有访问底层硬件设备的所有权限。操作系统讲虚拟空间划分为两部分,一部分为内核空间,一部分为用户空间。以Linux为例,较高的1g字节供内核使用,较低的3g字节供用户进程使用。

  2. page cache缓存IO:在Linux的缓存IO机制中,操作系统会将IO的数据缓存在文件系统的页缓存中(page cache),数据会被先拷贝到操作系统的系统内核缓冲区中,然后才会从操作系统内核莪缓冲区拷贝到应用程序的地址空间。

  3. 网络IO模型的类别:

    1. 同步(synchronous IO)
    2. 阻塞(blocking IO)
    3. 非阻塞(non-blocking IO)
    4. 多路复用(multiplexing IO)
    5. 信号驱动式(signal-dirven IO)
    6. 异步(asynchronous IO)
  4. Linux IO模型:(参考https://www.jianshu.com/p/486b0965c296)

    img

    1. 同步阻塞:线程等kernal把数据从内核缓存拷贝到应用程序空间再处理
    2. 同步非阻塞:线程等数据的过程中去做其他事,但是会不断轮询(操作系统立刻返回,但是不一定会有数据)
    3. IO多路复用:I/O多路复用技术通过把多个I/O的阻塞复用到同一个select的阻塞上,从而使得系统在单线程的情况下可以同时处理多个客户端请求。与传统的多线程/多进程模型比,I/O多路复用的最大优势是系统开销小,系统不需要创建新的额外进程或者线程,也不需要维护这些进程和线程的运行,降底了系统的维护工作量,节省了系统资源

Linux开源存储全栈详解从Ceph 到容器存储(读书摘录)

SPDK:Storage Performance Development Kit:

利用用户态的NVMe SSD,加速应用,如iSCSITarget或者NVMe-oF Target

  1. Linux远程存储服务

“1)块设备服务
Linux常用的块设备服务主要基于iSCSI(Internet Small Computer System Interface)和NVMe over Fabrics。
iSCSI协议是SCSI(Small Computer System Interface)协议在以太网上的扩展,一台机器通过iSCSI协议即可通过传输控制协议/网际协议(Transmission Control Protocol / Internet Protocol,TCP/IP)为其他客户提供共享的存储设备。通过iSCSI协议被访问的设备称为Target,而访问Target的客户(Client)端称为Initiator。目前Linux主流的iSCSI Target软件是基于Kernel的Linux-IO,在用户态可以使用targetcli工具进行管理。当然还有其他开源的iSCSI Target,如STGT、SCST等。iSCSI常用的iSCSI Initiator工具包括iscsiadm命令和libiscsi、open-iscsi等软件开发包。
“NVMe over Fabrics则是NVMe协议在Fabrics上的延伸,主要的设计目的是让客户端能够更高效地访问远端的服务器上的NVMe盘。相对iSCSI协议,NVMe over Fabrics则完全是为高效访问基于NVMe协议的快速存储设备设计的,往往和带有RDMA(Remote Direct Memory Access)功能的以太网卡,或者光纤通道、Infiniband一起工作。”

摘录来自: 英特尔亚太研发有限公司. “Linux开源存储全栈详解从Ceph 到容器存储。” Apple Books.

“2)文件存储服务
基于不同的协议,在Linux中可以提供很多文件粒度的服务。例如,基于网络文件系统(Network File System,NFS)协议的服务,服务器端可以直接加载支持网络文件系统协议的daemon。网络文件系统协议最早是由Sun公司在1984年开发的,目前已经发展到了NFSv4。
另外还有基于CIFS(Common Internet File System)的samba服务,使用这个服务可以向Windows客户端共享文件。这样Windows客户端可以把一个网络地址挂载成本地一块磁盘。例如,一个地址为192.168.1.8的Linux服务器导出一个名为XYZ的目录,实际指“向/home/XYZ,那么客户端就可以使用\192.168.1.8\XYZ,但是需要通过samba服务器的用户验证。
此外,Linux还有其他文件服务,如基于文件传输协议(File Transfer Protocol,FTP)的服务,这里不再赘述。另外在Linux系统中,如果用户熟悉SSH(Secure Shell)的一些命令,可以使用scp命令在不同Linux客户端进行文件的复制,或使用wget命令进行文件的下载。这是普通用户常用的功能,但需要服务器端的支持,不过这些服务器端程序的实现一般都比较简单。”

摘录来自: 英特尔亚太研发有限公司. “Linux开源存储全栈详解从Ceph 到容器存储。” Apple Books.

  1. 存储服务的分类:(p61)
    1. 块存储
    2. 文件存储
    3. 对象存储
  2. 压缩方式:(比特级别的去重)
    1. 霍夫曼编码(不太记得了
    2. 算数编码(没看明白
  3. 重复数据删除(块级别的去重,8K)

SSD

  1. 物理接口的演变:SATA, mSATA, SATAExpress, M.2,U.2
  2. ![image-20211130102958080](/Users/chenjiawei/Library/Application Support/typora-user-images/image-20211130102958080.png)

网络存储技术

DAS,NAS,SAN, iSCSI

![image-20211130104641972](/Users/chenjiawei/Library/Application Support/typora-user-images/image-20211130104641972.png)

![image-20211130104951269](/Users/chenjiawei/Library/Application Support/typora-user-images/image-20211130104951269.png)

参考:p174

内核提供的有关存储的系统调用

read

write

open

进程地址范围分类:代码段,数据段,未初始化的全局变量段,堆栈

mmap

“与read/write相比,使用mmap的方式对文件进行访问,带来的一个显著好处就是可以减少一次用户空间到内核空间的复制”

摘录来自: 英特尔亚太研发有限公司. “Linux开源存储全栈详解从Ceph 到容器存储。” Apple Books.

虚拟文件系统主要有以下4个对象类型:

  1. 超级块:呆逼爱一个已安装的文件系统
  2. 索引节点:代表村塾社保上的一个实际物理文件,即“元数据”
  3. 目录项:描述了文件系统的层次结构
  4. 文件:代表已经被进程打开的文件,主要用于建立进程和文件之间的对应的关系

Btrfs文件系统的优点:

  1. B-Tree存储索引,提升查找效率,减少磁盘IO
  2. 基于extent的文件存储
  3. 动态索引节点分配
  4. 针对固态银盘优化
  5. 支持元数据和数据的校验
  6. 支持copy on write
  7. 子分区
  8. 软件磁盘阵列
  9. 压缩

Direct IO

省略使用Buffered UI中的内核缓冲区的使用,数据可以直接在用户空间好磁盘进行传输。

“Direct I/O最主要的优点就是通过减少内核缓冲区和用户空间的数据复制次数,降低文件读/写时所带来的CPU负载能力及内存带宽的占用率。”

摘录来自: 英特尔亚太研发有限公司. “Linux开源存储全栈详解从Ceph 到容器存储。” Apple Books.

IO调度算法

  1. noop:什么都不做,顺序调度
  2. deadLine:电梯+避免饥饿(时间到了立马给它调度
  3. CFQ(Completely Fair Queueing):固定时间片内调度,超时重新排队

IO合并

进行IO调度之前,每个线程都有私有的Plug队列蓄流,蓄流过程中会尝试合并Bio request。

“泄流的时候,进程本地Plug队列的request,会被加入电梯调度算法的队列中。当各个进程本地Plug队列里面的request被泄流时,进入的不是最终的设备驱动,而是一个电梯调度算法,request将进行再一次的排队。这个电梯调度算法的主要目的就是进一步合并request,把request对硬盘的访问顺序化,以及执行一定的QoS(Quality of Service)。”

摘录来自: 英特尔亚太研发有限公司. “Linux开源存储全栈详解从Ceph 到容器存储。” Apple Books.

DRBD

Distributed Relicated Block Device,主从复制,依靠tcpip协议

存储加速

基于cpu

  1. 超线程
  2. SIMD(Single Instruction Multiple Data)一条指令同时处理多个数据集

基于协处理器或其他硬件

  1. FPGA
  2. 存储协议转换加速
  3. 特殊存储接口加速

智能网卡加速

把一部分cpu的工作放到网卡上来完成

数据保护

  1. RAID 冗余磁盘阵列

  2. 纠删码:

    1. “纠删码可以看作RAID5和RAID6的超集,k +m 纠删码如图4-8所示,其基本思想是将k 块原始的数据元素通过一定的计算,得到m 块冗余元素(校验块)。对于这k +m 块的元素,当其中任意m 块元素出错(包括原始数据和冗余数据)时,均可以通过对应的重构算法恢复出原来的k 块数据。生成校验的过程被称为编码,恢复丢失数据块的过程被称为解码。”

      摘录来自: 英特尔亚太研发有限公司. “Linux开源存储全栈详解从Ceph 到容器存储。” Apple Books.

数据安全

hash用于去重,加密和数据一致性验证

  1. MD5
  2. SHA1
  3. SHA2
  4. SHA3

数据完整性:

循环冗余校验

存储新能软件加速库

SPDK:用户态、异步、轮询

P291

一致性HASH

“而一致性哈希算法的出现解决了扩容带来的数据迁移问题,甚至能够接近理论上的最优解。即在存有k 个数据块的n 个节点存储系统中,再增加m 个节点只会导致平均k ×m /(n + m )个数据块从n 个节点向m 个节点迁移,而非所有k 个数据块全部重新分布。这似乎已经非常理想了,但一致性哈希算法的模型仍然过于简单,不足以应对存储系统中出现的各种可能的情况。最突出的就是数据失效问题,因为所有用户数据都是均匀分布在系统中的,所以一个设备的失效将会影响所有用户数据的完整性。而且由于一致性哈希算法没有感知存储节点的实际物理分布的能力,如何合理地控制数据的失效域更是无从谈起。”

摘录来自: 英特尔亚太研发有限公司. “Linux开源存储全栈详解从Ceph 到容器存储。” Apple Books.

CRUSH算法

  1. 元数据:
    1. CRUSH Map:保存了汲取中所有设备活OSF存储节点的位置信息和权重设置
    2. OSDMap:保存了各个OSD的运行时状态,让CRUSH感知存储节点的实效、删除和加入情况
    3. CRUSH Rule