Skip to Content

Persistance

2019-01-21·Operating System

Operating System: Three Easy Pieces 讲义 NJU: 操作系统 - 蒋炎岩>

储存介质 (Non-Volatile Memory)

NameYear原理Remarks
Magnetic Tape1928磁比特数组无法进行高效的随机读写
Magnetic Drum1932并行版磁带
Hard Disk1956二维并行版磁带柱面(platter)+磁道(track)+扇区(sector)
Floppy Disk1971读写头与盘片分离floppy drive 每个电脑一份
Compact Disk19801bit = 挖坑/填坑CD-ReWritable: 激光改变反射率
Solid State Drive1991NAND FlashNAND(x,y)=!(x&y)!(x\&y)
FTL(Flash translation layer)运行操作系统和闪存管理程序
USB Flash Disk1999放电做不到100%干净

Hard Disk Time

  • RPM: rotations per minute
  • 外部半径大
    • track skew
    • multi-zoned
  • cache (track buffer)
NameDescription经典数值
rotation delayrotate on track2ms
seek timeacceleration+coasting+deceleration+settling(most, 0.5-2ms)4ms
transfer timeread/write data30 ms
tatol delayrotation delay + seek time(0.5-2ms) + transfer time
I/O rate RandomTransfer size/total delay0.66MB/s(SCSI) 0.31MB/s(SATA)
I/O rate Sequential125MB/s(SCSI) 105MB/s(SATA)

disk scheduler

NameRemarks
SJFlength of each job is usually unknown
SSTF(Shortest Seek Time First)Disk infomation may not be known to host
NBF(nearest block first)
SCAN(elevator algorithm)moves back and forth across the disk servicing re- quests in order across the tracks
SPTF(shortest positioning time first)seek time + rotation delay

IO设备与驱动

访问指令

NameDescription
Port IO (PIO)为I/O设备提供了一个单独的地址空间,通过读/写端口的方式实现设备控制(状态、控制、数据寄存器)
Memory-Mapped I/O (MMIO)给特定的内存地址赋予特殊的含义,从而读/写内存地址就能实现设备的访问

设备管理

MethodDescription
Pollingpolling and wait device read->send->polling and wait device finished
interruptsend data->do sth else->receive ITR->run ISR
DMAsend info to DMA->do sth else->receive ITR->run ISR
  • coalescing: an interrupt-based optimzation
    • a device which needs to raise an interrupt first waits for a bit before deliv- ering the interrupt to the CPU
    • multiple interrupts can be coalesced into a single interrupt delivery
  • ISR: interrupt service routine(interrupt handler)

设备驱动程序

为每一类设备统一接口访问

CategoryExamples
字符设备键盘、鼠标、终端、显卡
块设备存储设备(磁盘、SSD)
网络设备网卡

设备举例

NameDescription
键盘控制器缓冲区
磁盘控制器通过状态、控制、数据访问,可配置为中断模式
显卡(GPU)协处理器;有专门的核心、指令等用于图像显示
DMA实现 memcpy() 的I/O设备;通过总线占用内存宽带
网卡网络设备:内存,DMA,中断
  • DMA(Direct Memery Access)
    • 没有DMA时的运行顺序
      • CPU 给设备发送少量命令(快)
      • CPU 传送数据(慢)
      • CPU 等待设备完成(并行)
    • 有DMA时的运行
      • 给 DMA 发送命令(快)
      • DMA 传送数据(DMA 与 CPU 并行)
      • DMA 等待设备完成 (DMA 与设备并行)
      • CPU 等待 DMA 完成(DMA 与 CPU 并行)
  • 网络设备驱动
    • send(): 发送包
      • TX:发送数据
      • RX:接受数据
    • poll(): 询问是否收到包
    • bufctl(): 设置 ring buffer
  • 每当需要做一件耗费CPU太多时间的专有工作,就可以增加一个I/O设备
    • 图形渲染->GPU
    • 大规模数据传送->DMA

设备总线 (I/O bus)

CategoryNameDescription
General I/O BusPCI
Peripheral I/O BusSCSI
SATA
USB

文件系统

文件系统是

  • 储存设备虚拟化(Virtualization)
  • 保存在持久储存上的数据结构
  • 管理操作系统内部对象的中间层
    • 文件描述符 = 指向对象的指针
    • read/write/ioctl系统调用 = 对象访问

文件系统实现

  • 文件系统的设计
    • balloc/bfree 的实现
    • 虚拟磁盘的数据结构
    • inode 的表示和储存
    • 目录文件的数据结构
  • 文件系统的实现 = 实现数据结构的查询/修改操作
  • 虚拟文件系统 VFS
  • 文件系统缓存
    • static partitioning: approx 10%
    • dynamic partition

API

CategoryNameDescription
管理文件inode numberlow-level name of a file
open()/openat()`int fd=open("foo", O_CREAT
fsync()fsync:所有数据;fdatasync:数据;mync
fstat()
管理目录isdir()
opendir/closedir()DIR *dp = opendir(".")
readdir()struct dirent *d = readdir(dp)
fchmod()/fchown()
rmdir()
Linkhard linksame inode
soft linksymbolic link,dangling reference,ln -s
Filesystemmkfs
mount

Filesystem List

CategoryNameIntro
WindowsFATFile Allocation Table, FAT12,FAT16,FAT32
exFAT
NTFSNew Technology File System
ReFSB+-tree, CoW(Copy-on-Write)
macOSHFS+
APFSB-Tree
LinuxExt4Ext2, Ext3
ReiserFSstoring a huge number of small files
XFS
JFSJSF1, JSF2, also for clustered file systems
BtrfsB-tree
UnixJFS
SolarisZFSalso for clusterd file systems, COW
Clustered file systemsGFSGobal File System of Red Hat Linux
VMFSVirtual Machine File System of VMware
Apple Xsan
NetAppWAFLsnapshots
OthersLFSLog-structured File SYstems

Filesystem Hierarchy Standard

NameFunction
/boot系统启动数据
/dev设备文件
/sbin系统程序
/etc配置文件
/home用户目录
/lib库文件
/media可移动设备
/var可变文件 (logs, snapshots)
/tmp临时文件
/usr用户程序 (/usr/bin/, /usr/lib/, /usr/local/

可靠性

  • 硬件:a disk fails;a block fails
  • 软件:一致性

RAID (Redundant Arrays of inexpensive disks)

Fail-stop model: a disk can be working or failed

RAIN LevelRAID-0RAIN-1RAID-4RAID-5
Realizationstripingmirroringparity-basedrotate parity
Remarkchunk sizeadditive/subtractive
CapacibilityNBN*B(NB)/2(N*B)/2(N1)B(N-1)*B(N1)B(N-1)*B
Reliability0111
Best caseN/2N/2
Throught
Sequential ReadNSN*S(N/2)S(N/2)*S(N1)S(N-1)*S(N1)S(N-1)*S
Sequential WriteNSN*S(N/2)S(N/2)*S(N1)S(N-1)*S(N1)S(N-1)*S
Random ReadNRN*RNRN*R(N1)S(N-1)*SNRN*R
Random WriteNRN*R(N/2)R(N/2)*R12R\frac{1}{2}RN4R\frac{N}{4}R
Latency
ReadTTTTTTTT
WriteTTTT2T2T2T2T

完整性

single-block failures

ErrorsDescriptionFrequency(Cheap/Costly)Detection
LSEs(latent-sector errors)sector(s) damaged9.40%/1.40%error correcting code(ECC)
Corruptionsilent faults like misdirected write0.50%/0.05%checksum (say a 4KB block)
Lost Writesread-after-write/checksum elsewhere

一致性

  • 持久化:原子性丧失
    • FAT: FAT[b]=EOF,data[b]=DATA,FAT[f]=b
    • ext2: DMAP,inode,DATA 都需要更新
  • 恢复文件系统一致性:FSCK(Filesystem Checking)
  • 崩溃一致性:Journaling
    • 利用 sync() 保证顺序性:磁盘会优化写入顺序
    • ext3 journaling (data journaling)
      • journal write: physical logging
        • TxB,I[v2],B[v2],Db->Journal
        • sync()
      • journal commit
        • TxD->Journal
        • sync()
      • Checkpoint
        • I[v2],B[v2],Db->Disk
    • Metadata Journaling (ordered journaling)
      • Data write
        • Db->Disk
        • sync()
      • Journal metadata write: logical logging
        • TxB,I[v2],B[v2]->Journal
        • sync()
      • Journal commit
        • TxE->Journal
        • sync()
      • Checkpoint metadata
        • I[v2],B[v2]->Disk
      • Free
        • mark the transaction free in journal superblock
    • 崩溃恢复
      • 用一个额外的指针维护journal完成的时刻
      • 从指针开始,向后重做journal中记录的操作
  • 其它方法
    • Soft Updates: orders all writes to the file sys- tem to ensure that the on-disk structures are never left in an inconsis- tent state
    • copy on write(COW): never overwrites files or directories in place
  • LFS: writing to disk sequentially
    • write buffering
    • D=F1FRpeakTpositionD=\frac{F}{1-F}*R_{peak}*T_{position}
    • Checkpoint Region->Inode Map->Inode->[SS,D]
    • recursive update problem
    • garbage collection problem