Abstract
For five years, we collected annual snapshots of file-system metadata from over 60,000 Windows PC file systems in a large corporation. In this article, we use these snapshots to study temporal changes in file size, file age, file-type frequency, directory size, namespace structure, file-system population, storage capacity and consumption, and degree of file modification. We present a generative model that explains the namespace structure and the distribution of directory sizes. We find significant temporal trends relating to the popularity of certain file types, the origin of file content, the way the namespace is used, and the degree of variation among file systems, as well as more pedestrian changes in size and capacities. We give examples of consequent lessons for designers of file systems and related software.
- Adya, A., Bolosky, W., Castro, M., Cermak, G., Chaiken, R., Douceur, J., Howell, J., Lorch, J., Theimer, M., and Wattenhofer, R.P. 2002. FARSITE: Federated, available, and reliable storage for an incompletely trusted environment. In Proceedings of the 5th USENIX Symposium on Operating Systems Design and Implementation (OSDI), Boston, MA, 1--14. Google ScholarDigital Library
- Agrawal, N.A., Bolosky, W.J., Douceur, J.R., and Lorch, J.R. 2007. A five-year study of file system metadata. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST), San Jose, CA, 31--45. Google ScholarDigital Library
- Arpaci-Dusseau, A.C. and Arpaci-Dusseau, R.H. 2001. Information and control in gray-box systems. In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP), Banff, Canada, 43--56. Google ScholarDigital Library
- Barford, P. and Crovella, M. 1998. Generating representative web workloads for network and server performance evaluation. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), Madison, WI, 151--160. Google ScholarDigital Library
- Bennett, J.M., Bauer, M.A., and Kinchlea, D. 1991. Characteristics of files in NFS environments. In Proceedings of the ACM SIGSMALL/PC Symposium on Small Systems, Toronto, Ontario, Candada, 33--40. Google ScholarDigital Library
- Bolosky, W.J., Corbin, S., Goebel, D., and Douceur, J.R. 2000. Single instance storage in Windows 2000. In Proceedings of the 4th USENIX Windows Systems Symposium, Seattle, WA. Google ScholarDigital Library
- Bonwick, J. 2006. ZFS: The last word in file systems. http://www.opensolaris.org/os/community/zfs/docs/zfs_last.pdf.Google Scholar
- Chapman, G. 2002. Why does Explorer think I only want to see my documents? http://pubs.logicalexpressions.com/Pub0009/LPMArticle.asp?ID=189.Google Scholar
- Cox, L.P., Murray, C.D., and Noble, B.D. 2002. Pastiche: Making backup cheap and easy. In Proceedings of the 5th USENIX Symposium on Operating Systems Design and Implementation (OSDI), Boston, MA, 285--298. Google ScholarDigital Library
- Douceur, J.R. and Bolosky, W.J. 1999. A large-scale study of file system contents. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), Atlanta, GA, 59--70. Google ScholarDigital Library
- Downey, A.B. 2001. The structural cause of file size distributions. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), Cambridge, MA, 328--329. Google ScholarDigital Library
- Evans, K.M. and Kuenning, G.H. 2002. A study of irregularities in file-size distributions. In Proceedings of the International Symposium on Performance Evaluation of Computer and Telecommunication Systems (SPECTS), San Diego, CA.Google Scholar
- Freund, J.E. 1992. Mathematical Statistics, 5th ed. Prentice Hall. Google ScholarDigital Library
- Gribble, S.D., Manku, G.S., Roselli, D.S., Brewer, E.A., Gibson, T.J., and Miller, E.L. 1998. Self-Similarity in file systems. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), Madison, WI, 141--150. Google ScholarDigital Library
- Gunawi, H.S., Agrawal, N., Arpaci-Dusseau, A.C., Arpaci-Dusseau, R.H., and Schindler, J. 2005. Deconstructing commodity storage clusters. In Proceedings of the 32nd International Symposium on Computer Architecture (ISCA), Madison, WI, 60--71. Google ScholarDigital Library
- Irlam, G. 1993. Unix file size survey -- 1993. http://www.base.com/gordoni/ufs93.html.Google Scholar
- Knuth, D.E. 1981. The Art of Computer Programming, Volume 2: Seminumerical Algorithms, 2nd ed. Addison-Wesley. Google ScholarDigital Library
- Mahmoud, H.M. 1992. Distances in random plane-oriented recursive trees. J. Comput. Appl. Math. 41, 237--245. Google ScholarDigital Library
- Mesnier, M., Thereska, E., Ganger, G.R., Ellard, D., and Seltzer, M. 2004. File classification in self-* storage systems. In Proceedings of the 1st International Conference on Autonomic Computing (ICAC), New York. Google ScholarDigital Library
- Microsoft. 2006. SetFileTime. http://msdn.microsoft.com/library/default.asp?url=/library/en-us/wcecoreos5/html/wce50lrfsetfiletime.asp.Google Scholar
- Mitchell, S. 1997. Inside the Windows 95 file system. O'Reilly, Sebastopol, CA. Google ScholarDigital Library
- Mitzenmacher, M. 2004. Dynamic models for file sizes and double Pareto distributions. Internet Math. 1, 3, 305--333.Google ScholarCross Ref
- Mullender, S.J. and Tanenbaum, A.S. 1984. Immediate files. Softw. Pract. Exper. 14, 4 (Apr.), 365--368. Google ScholarDigital Library
- Ousterhout, J.K., Costa, H.D., Harrison, D., Kunze, J.A., Kupfer, M., and Thompson, J.G. 1985. A trace-driven analysis of the UNIX 4.2 BSD file system. In Proceedings of the 10th ACM Symposium on Operating Systems Principles (SOSP), Orcas Island, WA, 15--24. Google ScholarDigital Library
- Reiser, H. 2006. Three reasons why ReiserFS is great for you. http://www.namesys.com/.Google Scholar
- Roselli, D., Lorch, J.R., and Anderson, T.E. 2000. A comparison of file system workloads. In Proceedings of the USENIX Annual Technical Conference, San Diego, CA, 41--54. Google ScholarDigital Library
- Satyanarayanan, M. 1981. A study of file sizes and functional lifetimes. In Proceedings of the 8th ACM Symposium on Operating Systems Principles (SOSP), Pacific Grove, CA, 96--108. Google ScholarDigital Library
- Sienknecht, T.F., Friedrich, R.J., Martinka, J.J., and Friedenbach, P.M. 1994. The implications of distributed data in a commercial environment on the design of hierarchical storage management. In Proceedings of the 16th IFIP Working Group 7.3 International Symposium on Computer Performance Modeling and Evaluation. 3--25. Google ScholarDigital Library
- Smith, K. and Seltzer, M. 1994. File layout and file system performance. Tech. Rep. TR-35-94, Harvard University.Google Scholar
- Solomon, D.A. 1998. Inside Windows NT, 2nd ed. Microsoft Press, Redmond, WA. Google ScholarDigital Library
- Vogels, W. 1999. File system usage in Windows NT 4.0. In Proceedings of the 17th ACM Symposium on Operating Systems Principles (SOSP), Kiawah Island, SC, 93--109. Google ScholarDigital Library
Index Terms
- A five-year study of file-system metadata
Recommendations
A multiple-file write scheme for improving write performance of small files in Fast File System
Fast File System (FFS) stores files to disk in separate disk writes, each of which incurs a disk positioning (seek + rotation) limiting the write performance for small files. We propose a new scheme called co-writing to accelerate small file writes in ...
A five-year study of file-system metadata
FAST '07: Proceedings of the 5th USENIX conference on File and Storage TechnologiesFor five years, we collected annual snapshots of filesystem metadata from over 60,000 Windows PC file systems in a large corporation. In this paper, we use these snapshots to study temporal changes in file size, file age, file-type frequency, directory ...
Implementation of a stackable file system for real-time network backup
We propose a backup system based on a stackable mirroring file system, general-purpose mirroring file system (GMFS). This file system mirrors data in real-time on the file system layer. It uses the typical network file system (NFS) and backs up data to ...
Comments