Introduction
Background
Linux I/O stack
OS block layer: additional features
-
Hardware agnostic The block layer is the point of entry for I/O requests for a block storage device (HDDs or SSDs), except for direct I/Os (Direct Memory Access) which are handled by strict interrupts. Any solution or optimization in the block layer can be applied to a diverse range of storage devices, making the solution independent of the underlying physical media.
-
Information capturing Accounting information such as block, file and process the requests belong to, is isolated from the device driver, as the function of the device driver is only to transmit the requests from the block layer to the physical storage. The block layer has access to such attributes [8, 10], thereby providing opportunities to exploit information for intelligent optimizations.
-
Storage architecture agnostic Irrespective of storage networked virtualizations like SAN, NAS or DAS, ultimately I/Os are managed by the block layer. The block layer as shown in Fig. 1 resides towards the client in centralized storage management solutions like SAN, while in the case of NAS, it lies inside the NAS device. Therefore, any solution built in the block layer can be applied to any data-center storage infrastructure.
Secondary storage (block device) characteristics
-
Cost and lifespan The main disadvantage of SCMs over HDD is their high cost and limited lifespan. SCMs can endure limited write-erase cycles, i.e. after a threshold of writes, the pages becomes dysfunctional. Therefore SCMs increase the total cost of ownership (TCO). Unless majority of the data is uniformly “hot” it is highly inefficient to store the data in high-value SCMs as they are underutilized and do not justify the high investment [3, 4, 23].
-
Write amplification To increase life-time of SCM pages, the firmware tries to spread the writes throughout the device. Additionally, due to physical constraints, SCMs cannot over-write at the same location (page). As deletion or erase happens in the granularity of blocks, therefore a single page update requires a complete block erase and out-of-place write. These result into unwanted phenomenons such as write amplification (wear-leveling) and garbage collection (faulty block management) [9, 19, 24]. These activities consume a lot of CPU time as well as the SSD controller and the File System have additional jobs such as book-keeping than simple data access.
-
Skewed write performance The superior performance of SCMs over HDDs is highly dependent on the workload. For write-intensive scientific and industrial workloads, the performance of HDDs and SSDs have been shown to be nearly same [9]. The skew in performance makes it more economically feasible to use HDDs.
Hadoop MapReduce: working and workload characteristics
Requirements from a block I/O scheduler in Big Data deployments
Issues with current I/O schedulers
Order | Request | LBA | Transfer size | Track no. (cylinder) | Read/write | Time to expire (ms) |
---|---|---|---|---|---|---|
1 | B1 | 7125 | 40 | 71 | W | Exp#1 |
2 | A1 | 305 | 24 | 3 | R | Exp#2 |
3 | A2 | 340 | 24 | 3 | R | Exp#3 |
4 | A3 | 370 | 24 | 3 | R | Exp#4 |
5 | C1 | 1600 | 4 | 16 | R | Exp#5 |
6 | B2 | 7165 | 40 | 71, 72 | W | 50 |
7 | B3 | 7205 | 40 | 72 | W | 53 |
8 | A4 | 410 | 24 | 4 | R | 60 |
9 | A5 | 440 | 24 | 4 | R | 65 |
10 | A6 | 470 | 24 | 4 | R | 100 |
11 | C2 | 1670 | 4 | 16 | R | 105 |
12 | B4 | 7245 | 40 | 72 | W | 110 |
Our approach: BID “Bulk I/O Dispatch”
BID-HDD: contention avoiding I/O scheduling for HDDs
-
Time stamp of the oldest request present in the queue, denoted by \(TS_{old}(SQ_p)\).
-
Time stamp of the newest request present in the queue, denoted by \(TS_{new}(SQ_p)\).
-
Wait timer for next I/O request \(wait(SQ_p)\).
-
Flush deadline timer \(deadline(SQ_p)\).
BID-hybrid: contention avoidance utilizing multi-tier architecture
Experiments and performance evaluation
Testbed: emulating cloud Hadoop workloads and capturing block layer activities
Workload | I/O characteristics |
---|---|
Grep
| Mostly sequential reads with small writes |
Random text writer
| Mostly sequential writes, mixed with random writes and negligible reads |
Sort
| More reads than writes. Large sequential reads with random writes and later sequential writes |
TeraSort
| Good mix of sequential and random reads/writes. More reads than writes |
Wordcount
| Mostly sequential reads, with large number of random writes followed by random reads and small sequential writes |
Word standard deviation
| Mostly sequential reads with small inter-phase writes, followed by small writes in the end |
System simulator
-
OS Simulator This module takes the collected workload I/O traces (Trace File) as input and recreates the Kernel Block Layer functions after the stage from which the traces were collected (refer to “Testbed: emulating cloud Hadoop workloads and capturing block layer activities” section). It performs Kernel block I/O operations such as: (1) making block I/O (BIO) structure from traces; (2) enqueuing BIO request structures to the “request queue RQ” based on RQ limitations; (3) pluggable I/O Scheduling: merging, sorting, re-ordering, staging, etc. as per the Scheduling scheme; (4) managing the I/O requests inflow and outflow in the “dispatch queue” as per the device driver specifications; (5) dispatching requests from dispatch queue to the block device. The I/O Scheduling sub-module is made pluggable so that different scheduling schemes can be tested. Simulator provides the flexibility to configure parameters like: data holding size of each BIO structure, request queue size, dispatch queue size and block device driver parameters. In the case of the Hybrid Approach, the OS Simulator (1) makes Tier Placement decisions; (2) interacts with the VFS to map the LBA entries offloaded to SSD for future reference; (3) spawns the process on the block layer for appropriate SSD as per the multi-queue architecture [10]. To preserve the I/O characteristics of the workloads, the requests are submitted based on the timestamp from the trace file to the kernel block layer.
-
VFS Simulator The main function of VFS is to locate the blocks required by applications. The VFS and the Mapping layer along with abstractions such as logical volume manager provide storage virtualization. This enables storage pooling, capacity utilization and unifying storage with heterogeneous devices which aids data migration and placement policies. As the traces already have the LBA and targeted block device, so the simulator is designed to work in case of Hybrid “tier placement” decisions, as the change of target device is taken on the fly. The VFS-SSD sub-module stores in a table, which data is stored in which block device. This aids in finding the location of future reference to already written data.
-
Storage Simulator This module takes the I/O requests from the dispatch queue of the OS Block Simulator and based on the device type (HDD or SSD), return performance metrics like completion time depending on the current state of the block device. The module takes block device configuration parameters as inputs (device driver) such as drive capacity, block device type (HDD or SSD), etc. For HDDs, drive parameters include geometry, no. of disk heads, no. of tracks (cylinders), sectors/track, rotations per minute (RPM), command processing time, settle time, average seek time, rotational latency, cylinder switch time, track-to-adjacent switch time, and head switch time. For SCMs (SSDs), the drive parameters include the no. of pages per block, size of each page, seek time (read, writes, erase) etc. The Storage Simulator is CHS compliant for 48-bit LBA. The Storage simulator calculates the I/O access time (per I/O request) by HDDs considering the current location of the disk arm and time needed to reach the desired new location and access data size. The access time also takes into account minute details such as command processing time, settle time, rotational latency, cylinder (track) switch time, head switch time and average seek time [17, 40, 41]. For SSDs, the access time depends on the SSD properties provided by the manufacturer. The configurable features gives us the ability to test the schemes with different devices as well as drive architectures.
Performance evaluation: results and discussions
Block device | Default parameters |
---|---|
HDD | Maximum “request” structure size = 512 kB |
Request queue size = 256 BIO structures (128 reads, 128 writes) | |
Max. size of each block I/O (BIO) structure = 128 × 4K pages | |
1 page (bio vec) = 8 × 512-byte disk sectors (block) | |
Access granularity (disk block sector size) = 512 bytes | |
Specification based exactly as our 250 GB Hadoop cluster HDD | |
SSD | Block size = 256 pages; page size = 4 kB = access granularity |
Access time: read = 0.025 ms; write = 0.5 ms | |
Specification based on SLC Flash SSD [9] |
Cumulative I/O completion time
Number of disk arm movements
Disk head movement (seek) distance
Impact on individual read/write I/Os
Related works
Block layer (I/O scheduling) | Multi-tier |
---|---|