Operating System: Three Easy Pieces 讲义 NJU: 操作系统 - 蒋炎岩>
储存介质 (Non-Volatile Memory)
| Name | Year | 原理 | Remarks |
|---|---|---|---|
| Magnetic Tape | 1928 | 磁比特数组 | 无法进行高效的随机读写 |
| Magnetic Drum | 1932 | 并行版磁带 | |
| Hard Disk | 1956 | 二维并行版磁带 | 柱面(platter)+磁道(track)+扇区(sector) |
| Floppy Disk | 1971 | 读写头与盘片分离 | floppy drive 每个电脑一份 |
| Compact Disk | 1980 | 1bit = 挖坑/填坑 | CD-ReWritable: 激光改变反射率 |
| Solid State Drive | 1991 | NAND Flash | NAND(x,y)= |
| FTL(Flash translation layer) | 运行操作系统和闪存管理程序 | ||
| USB Flash Disk | 1999 | 放电做不到100%干净 |
Hard Disk Time
- RPM: rotations per minute
- 外部半径大
- track skew
- multi-zoned
- cache (track buffer)
| Name | Description | 经典数值 |
|---|---|---|
| rotation delay | rotate on track | 2ms |
| seek time | acceleration+coasting+deceleration+settling(most, 0.5-2ms) | 4ms |
| transfer time | read/write data | 30 ms |
| tatol delay | rotation delay + seek time(0.5-2ms) + transfer time | |
| I/O rate Random | Transfer size/total delay | 0.66MB/s(SCSI) 0.31MB/s(SATA) |
| I/O rate Sequential | 125MB/s(SCSI) 105MB/s(SATA) |
disk scheduler
| Name | Remarks |
|---|---|
| SJF | length of each job is usually unknown |
| SSTF(Shortest Seek Time First) | Disk infomation may not be known to host |
| NBF(nearest block first) | |
| SCAN(elevator algorithm) | moves back and forth across the disk servicing re- quests in order across the tracks |
| SPTF(shortest positioning time first) | seek time + rotation delay |
IO设备与驱动
访问指令
| Name | Description |
|---|---|
| Port IO (PIO) | 为I/O设备提供了一个单独的地址空间,通过读/写端口的方式实现设备控制(状态、控制、数据寄存器) |
| Memory-Mapped I/O (MMIO) | 给特定的内存地址赋予特殊的含义,从而读/写内存地址就能实现设备的访问 |
设备管理
| Method | Description |
|---|---|
| Polling | polling and wait device read->send->polling and wait device finished |
| interrupt | send data->do sth else->receive ITR->run ISR |
| DMA | send info to DMA->do sth else->receive ITR->run ISR |
- coalescing: an interrupt-based optimzation
- a device which needs to raise an interrupt first waits for a bit before deliv- ering the interrupt to the CPU
- multiple interrupts can be coalesced into a single interrupt delivery
- ISR: interrupt service routine(interrupt handler)
设备驱动程序
为每一类设备统一接口访问
| Category | Examples |
|---|---|
| 字符设备 | 键盘、鼠标、终端、显卡 |
| 块设备 | 存储设备(磁盘、SSD) |
| 网络设备 | 网卡 |
设备举例
| Name | Description |
|---|---|
| 键盘控制器 | 缓冲区 |
| 磁盘控制器 | 通过状态、控制、数据访问,可配置为中断模式 |
| 显卡(GPU) | 协处理器;有专门的核心、指令等用于图像显示 |
| DMA | 实现 memcpy() 的I/O设备;通过总线占用内存宽带 |
| 网卡 | 网络设备:内存,DMA,中断 |
- DMA(Direct Memery Access)
- 没有DMA时的运行顺序
- CPU 给设备发送少量命令(快)
- CPU 传送数据(慢)
- CPU 等待设备完成(并行)
- 有DMA时的运行
- 给 DMA 发送命令(快)
- DMA 传送数据(DMA 与 CPU 并行)
- DMA 等待设备完成 (DMA 与设备并行)
- CPU 等待 DMA 完成(DMA 与 CPU 并行)
- 没有DMA时的运行顺序
- 网络设备驱动
- send(): 发送包
- TX:发送数据
- RX:接受数据
- poll(): 询问是否收到包
- bufctl(): 设置 ring buffer
- send(): 发送包
- 每当需要做一件耗费CPU太多时间的专有工作,就可以增加一个I/O设备
- 图形渲染->GPU
- 大规模数据传送->DMA
设备总线 (I/O bus)
| Category | Name | Description |
|---|---|---|
| General I/O Bus | PCI | |
| Peripheral I/O Bus | SCSI | |
| SATA | ||
| USB |
文件系统
文件系统是
- 储存设备虚拟化(Virtualization)
- 保存在持久储存上的数据结构
- 管理操作系统内部对象的中间层
- 文件描述符 = 指向对象的指针
- read/write/ioctl系统调用 = 对象访问
文件系统实现
- 文件系统的设计
- balloc/bfree 的实现
- 虚拟磁盘的数据结构
- inode 的表示和储存
- 目录文件的数据结构
- 文件系统的实现 = 实现数据结构的查询/修改操作
- 虚拟文件系统 VFS
- 文件系统缓存
- static partitioning: approx 10%
- dynamic partition
API
| Category | Name | Description |
|---|---|---|
| 管理文件 | inode number | low-level name of a file |
open()/openat() |
`int fd=open("foo", O_CREAT | |
fsync() |
fsync:所有数据;fdatasync:数据;mync | |
fstat() |
||
| 管理目录 | isdir() |
|
opendir/closedir() |
DIR *dp = opendir(".") |
|
readdir() |
struct dirent *d = readdir(dp) |
|
fchmod()/fchown() |
||
rmdir() |
||
| Link | hard link | same inode |
| soft link | symbolic link,dangling reference,ln -s |
|
| Filesystem | mkfs |
|
mount |
Filesystem List
| Category | Name | Intro |
|---|---|---|
| Windows | FAT | File Allocation Table, FAT12,FAT16,FAT32 |
| exFAT | ||
| NTFS | New Technology File System | |
| ReFS | B+-tree, CoW(Copy-on-Write) | |
| macOS | HFS+ | |
| APFS | B-Tree | |
| Linux | Ext4 | Ext2, Ext3 |
| ReiserFS | storing a huge number of small files | |
| XFS | ||
| JFS | JSF1, JSF2, also for clustered file systems | |
| Btrfs | B-tree | |
| Unix | JFS | |
| Solaris | ZFS | also for clusterd file systems, COW |
| Clustered file systems | GFS | Gobal File System of Red Hat Linux |
| VMFS | Virtual Machine File System of VMware | |
| Apple Xsan | ||
| NetApp | WAFL | snapshots |
| Others | LFS | Log-structured File SYstems |
Filesystem Hierarchy Standard
| Name | Function |
|---|---|
| /boot | 系统启动数据 |
| /dev | 设备文件 |
| /sbin | 系统程序 |
| /etc | 配置文件 |
| /home | 用户目录 |
| /lib | 库文件 |
| /media | 可移动设备 |
| /var | 可变文件 (logs, snapshots) |
| /tmp | 临时文件 |
| /usr | 用户程序 (/usr/bin/, /usr/lib/, /usr/local/ |
可靠性
- 硬件:a disk fails;a block fails
- 软件:一致性
RAID (Redundant Arrays of inexpensive disks)
Fail-stop model: a disk can be working or failed
| RAIN Level | RAID-0 | RAIN-1 | RAID-4 | RAID-5 |
|---|---|---|---|---|
| Realization | striping | mirroring | parity-based | rotate parity |
| Remark | chunk size | additive/subtractive | ||
| Capacibility | ||||
| Reliability | 0 | 1 | 1 | 1 |
| Best case | ||||
| Throught | ||||
| Sequential Read | ||||
| Sequential Write | ||||
| Random Read | ||||
| Random Write | ||||
| Latency | ||||
| Read | ||||
| Write |
完整性
single-block failures
| Errors | Description | Frequency(Cheap/Costly) | Detection |
|---|---|---|---|
| LSEs(latent-sector errors) | sector(s) damaged | 9.40%/1.40% | error correcting code(ECC) |
| Corruption | silent faults like misdirected write | 0.50%/0.05% | checksum (say a 4KB block) |
| Lost Writes | read-after-write/checksum elsewhere |
一致性
- 持久化:原子性丧失
- FAT: FAT[b]=EOF,data[b]=DATA,FAT[f]=b
- ext2: DMAP,inode,DATA 都需要更新
- 恢复文件系统一致性:FSCK(Filesystem Checking)
- 崩溃一致性:Journaling
- 利用 sync() 保证顺序性:磁盘会优化写入顺序
- ext3 journaling (data journaling)
- journal write: physical logging
- TxB,I[v2],B[v2],Db->Journal
- sync()
- journal commit
- TxD->Journal
- sync()
- Checkpoint
- I[v2],B[v2],Db->Disk
- journal write: physical logging
- Metadata Journaling (ordered journaling)
- Data write
- Db->Disk
- sync()
- Journal metadata write: logical logging
- TxB,I[v2],B[v2]->Journal
- sync()
- Journal commit
- TxE->Journal
- sync()
- Checkpoint metadata
- I[v2],B[v2]->Disk
- Free
- mark the transaction free in journal superblock
- Data write
- 崩溃恢复
- 用一个额外的指针维护journal完成的时刻
- 从指针开始,向后重做journal中记录的操作
- 其它方法
- Soft Updates: orders all writes to the file sys- tem to ensure that the on-disk structures are never left in an inconsis- tent state
- copy on write(COW): never overwrites files or directories in place
- LFS: writing to disk sequentially
- write buffering
- Checkpoint Region->Inode Map->Inode->[SS,D]
- recursive update problem
- garbage collection problem