ABSTRACT
It is challenging to simultaneously achieve multicore scalability and high disk throughput in a file system. For example, even for commutative operations like creating different files in the same directory, current file systems introduce cache-line conflicts when updating an in-memory copy of the on-disk directory block, which limits scalability.
ScaleFS is a novel file system design that decouples the in-memory file system from the on-disk file system using per-core operation logs. This design facilitates the use of highly concurrent data structures for the in-memory representation, which allows commutative operations to proceed without cache conflicts and hence scale perfectly. ScaleFS logs operations in a per-core log so that it can delay propagating updates to the disk representation (and the cache-line conflicts involved in doing so) until an fsync. The fsync call merges the per-core logs and applies the operations to disk. ScaleFS uses several techniques to perform the merge correctly while achieving good performance: timestamped linearization points to order updates without introducing cache-line conflicts, absorption of logged operations, and dependency tracking across operations.
Experiments with a prototype of ScaleFS show that its implementation has no cache conflicts for 99% of test cases of commutative operations generated by Commuter, scales well on an 80-core machine, and provides on-disk performance that is comparable to that of Linux ext4.
Supplemental Material
- H. Attiya, E. Hillel, and A. Milani. Inherent limitations on disjoint-access parallel implementations of transactional memory. In Proceedings of the 21st Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages 69--78, Calgary, Canada, Aug. 2009. Google ScholarDigital Library
- D. J. Bernstein. qmail internals, 1998. http//:www.qmail.org/man/misc/INTERNALS.txt.Google Scholar
- S. S. Bhat. Designing multicore scalable filesystems with durability and crash consistency. Master's thesis, Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, June 2017.Google Scholar
- S. Boyd-Wickizer, A. T. Clements, Y. Mao, A. Pesterev, M. F. Kaashoek, R. Morris, and N. Zeldovich. An analysis of Linux scalability to many cores. In Proceedings of the 9th Symposium on Operating Systems Design and Implementation (OSDI), pages 1--16, Vancouver, Canada, Oct. 2010. Google ScholarDigital Library
- S. Boyd-Wickizer, M. F. Kaashoek, R. Morris, and N. Zeldovich. OpLog: a library for scaling update-heavy data structures. Technical Report MIT-CSAIL-TR-2014-019, MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, Sept. 2014.Google Scholar
- M. Castro and B. Liskov. Practical Byzantine fault tolerance. In Proceedings of the 3rd Symposium on Operating Systems Design and Implementation (OSDI), pages 173--186, New Orleans, LA, Feb. 1999. Google ScholarDigital Library
- V. Chidambaram, T. Sharma, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Consistency without ordering. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST), pages 101--116, San Jose, CA, Feb. 2012. Google ScholarDigital Library
- S. Chutani, O. T. Anderson, M. L. Kazar, B. W. Leverett, W. A. Mason, and R. N. Sidebotham. The Episode file system. In Proceedings of the Winter 1992 USENIX Technical Conference, pages 43--59, Jan. 1992.Google Scholar
- A. T. Clements, M. F. Kaashoek, and N. Zeldovich. RadixVM: Scalable address spaces for multithreaded applications. In Proceedings of the 8th ACM EuroSys Conference, pages 211--224, Prague, Czech Republic, Apr. 2013. Google ScholarDigital Library
- A. T. Clements, M. F. Kaashoek, N. Zeldovich, R. T. Morris, and E. Kohler. The scalable commutativity rule: Designing scalable software for multicore processors. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP), pages 1--17, Farmington, PA, Nov. 2013. Google ScholarDigital Library
- J. Corbet. Dcache scalability and RCU-walk, Apr. 2012. http://lwn.net/Articles/419811/.Google Scholar
- R. Cox, M. F. Kaashoek, and R. T. Morris. Xv6, a simple Unix-like teaching operating system, 2016. http://pdos.csail.mit.edu/6.828/xv6.Google Scholar
- M. Curtis-Maury, V. Devadas, V. Fang, and A. Kulkarni. To Waffinity and beyond: A scalable architecture for incremental parallelization of file system code. In Proceedings of the 12th Symposium on Operating Systems Design and Implementation (OSDI), pages 419--434, Savannah, GA, Nov. 2016. Google ScholarDigital Library
- S. R. Dulloor, S. Kumar, A. Keshavamurthy, P. Lantz, D. Reddy, R. Sankaran, and J. Jackson. System software for persistent memory. In Proceedings of the 9th ACM EuroSys Conference, Amsterdam, The Netherlands, Apr. 2014. Google ScholarDigital Library
- C. Frost, M. Mammarella, E. Kohler, A. de los Reyes, S. Hovsepian, A. Matsuoka, and L. Zhang. Generalized file system dependencies. In Proceedings of the 21st ACM Symposium on Operating Systems Principles (SOSP), pages 307--320, Stevenson, WA, Oct. 2007. Google ScholarDigital Library
- G. R. Ganger and Y. N. Patt. Metadata update performance in file systems. In Proceedings of the 1st Symposium on Operating Systems Design and Implementation (OSDI), pages 49--60, Monterey, CA, Nov. 1994. Google ScholarDigital Library
- C. Gruenwald, III, F. Sironi, M. F. Kaashoek, and N. Zeldovich. Hare: a file system for non-cache-coherent multicores. In Proceedings of the 10th ACM EuroSys Conference, Bordeaux, France, Apr. 2015. Google ScholarDigital Library
- R. Hagmann. Reimplementing the Cedar file system using logging and group commit. In Proceedings of the 11th ACM Symposium on Operating Systems Principles (SOSP), pages 155--162, Austin, TX, Nov. 1987. Google ScholarDigital Library
- M. P. Herlihy and J. M. Wing. Linearizability: a correctness condition for concurrent objects. ACM Transactions on Programming Languages Systems, 12(3):463--492, 1990. Google ScholarDigital Library
- A. Israeli and L. Rappoport. Disjoint-access-parallel implementations of strong shared memory primitives. In Proceedings of the 13th ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing, Los Angeles, CA, Aug. 1994. Google ScholarDigital Library
- M. Jambor, T. Hruby, J. Taus, K. Krchak, and V. Holub. Implementation of a Linux log-structured file system with a garbage collector. ACM SIGOPS Operating Systems Review, 41(1):24--32, Jan. 2007. Google ScholarDigital Library
- J. Kang, B. Zhang, T. Wo, W. Yu, L. Du, S. Ma, and J. Huai. SpanFS: A scalable file system on fast storage devices. In Proceedings of the 2015 USENIX Annual Technical Conference, Santa Clara, CA, July 2015. Google ScholarDigital Library
- Y. Klonatos, M. Marazakis, and A. Bilas. A scaling analysis of Linux I/O performance. Poster presented at EuroSys, 2011. http://eurosys2011.cs.uni-salzburg.at/pdf/postersubmission/eurosys11-poster-klonatos.pdf.Google Scholar
- C. Lameter. Effective synchronization on Linux/NUMA systems. In Gelato Conference, May 2005. http//:www.lameter.com/gelato2005.pdf.Google Scholar
- E. Lee, H. Bahn, and S. H. Noh. Unioning of the buffer cache and journaling layers with non-volatile memory. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST), pages 73--80, San Jose, CA, Feb. 2013. Google ScholarDigital Library
- L. Lu, A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, and S. Lu. A study of Linux file system evolution. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST), pages 31--44, San Jose, CA, Feb. 2013. Google ScholarDigital Library
- L. Lu, Y. Zhang, T. Do, S. Al-Kiswany, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Physical disentanglement in a container-based file system. In Proceedings of the 11th Symposium on Operating Systems Design and Implementation (OSDI), pages 81--96, Broomfield, CO, Oct. 2014. Google ScholarDigital Library
- Y. Lu, J. Shu, and W. Wang. ReconFS: A reconstructable file system on flash storage. In Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST), pages 75--88, Santa Clara, CA, Feb. 2014. Google ScholarDigital Library
- A. Mathur, M. Cao, S. Bhattacharya, A. Dilger, A. Tomas, and L. Vivier. The new ext4 filesystem: current status and future plans. In Proceedings of the Linux Symposium, pages 21--34, Ottawa, Canada, June 2007.Google Scholar
- P. E. McKenney, D. Sarma, and M. Soni. Scaling dcache with RCU. Linux Journal, 2004(117), Jan. 2004. Google ScholarDigital Library
- C. Min, S. Kashyap, S. Maass, and T. Kim. Understanding manycore scalability of file systems. In Proceedings of the 2016 USENIX Annual Technical Conference, Denver, CO, June 2016. Google ScholarDigital Library
- D. Park and D. Shin. iJournaling: Fine-grained journaling for improving the latency of fsync system call. In Proceedings of the 2017 USENIX Annual Technical Conference, pages 787--798, Santa Clara, CA, July 2017. Google ScholarDigital Library
- S. Park, T. Kelly, and K. Shen. Failure-atomic msync(): A simple and efficient mechanism for preserving the integrity of durable data. In Proceedings of the 8th ACM EuroSys Conference, pages 225--238, Prague, Czech Republic, Apr. 2013. Google ScholarDigital Library
- T. S. Pillai, V. Chidambaram, R. Alagappan, S. Al-Kiswany, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. All file systems are not created equal: On the complexity of crafting crash-consistent applications. In Proceedings of the 11th Symposium on Operating Systems Design and Implementation (OSDI), pages 433--448, Broomfield, CO, Oct. 2014. Google ScholarDigital Library
- T. S. Pillai, R. Alagappana, L. Lu, V. Chidambaram, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Application crash consistency and performance with CCFS. In Proceedings of the 15th USENIX Conference on File and Storage Technologies (FAST), pages 181--196, Santa Clara, CA, Feb.-Mar. 2017. Google ScholarDigital Library
- K. Ren and G. Gibson. TABLEFS: Enhancing metadata efficiency in the local file system. In Proceedings of the 2013 USENIX Annual Technical Conference, pages 145--156, San Jose, CA, June 2013. Google ScholarDigital Library
- D. M. Ritchie and K. Thompson. The UNIX time-sharing system. Communications of the ACM, 17(7):365--375, July 1974. Google ScholarDigital Library
- O. Rodeh, J. Bacik, and C. Mason. BTRFS: The Linux B-tree filesystem. ACM Transactions on Storage, 9(3): 9:1--32, Aug. 2013. Google ScholarDigital Library
- M. Rosenblum and J. Ousterhout. The design and implementation of a log-structured file system. In Proceedings of the 13th ACM Symposium on Operating Systems Principles (SOSP), pages 1--15, Pacific Grove, CA, Oct. 1991. Google ScholarDigital Library
- A. Sweeney, D. Doucette, W. Hu, C. Anderson, M. Nishimoto, and G. Peck. Scalability in the XFS file system. In Proceedings of the 1996 USENIX Annual Technical Conference, San Diego, CA, Jan. 1996. Google ScholarDigital Library
- A. Tridgell and R. Sahlberg. DBENCH, 2013. https://dbench.samba.org/.Google Scholar
- J. Xu and S. Swanson. NOVA: A log-structured file system for hybrid volatile/non-volatile main memories. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST), pages 323--338, Santa Clara, CA, Feb. 2016. Google ScholarDigital Library
- J. Yang, P. Twohey, D. Engler, and M. Musuvathi. eXplode: A lightweight, general system for finding serious storage system errors. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI), pages 131--146, Seattle, WA, Nov. 2006. Google ScholarDigital Library
Recommendations
The design and implementation of a log-structured file system
This paper presents a new technique for disk storage management called a log-structured file system. A log-structured file system writes all modifications to disk sequentially in a log-like structure, thereby speeding up both file writing and crash ...
The design and implementation of a log-structured file system
SOSP '91: Proceedings of the thirteenth ACM symposium on Operating systems principlesThis paper presents a new technique for disk storage management called a log-structured file system. A log-structured file system writes all modifications to disk sequentially in a log-like structure, thereby speeding up both file writing and crash ...
Comments