skip to main content
research-article

Write Skew and Zipf Distribution: Evidence and Implications

Published:08 June 2016Publication History
Skip Abstract Section

Abstract

Understanding workload characteristics is essential to storage systems design and performance optimization. With the emergence of flash memory as a new viable storage medium, the new design concern of flash endurance arises, necessitating a revisit of workload characteristics, in particular, of the write behavior. Inspired by Web caching studies where a Zipf-like access pattern is commonly found, we hypothesize that write count distribution at the block level may also follow Zipf’s Law. To validate this hypothesis, we study 48 block I/O traces collected from a wide variety of real and benchmark applications. Through extensive analysis, we demonstrate that the Zipf-like pattern indeed widely exists in write traffic provided its disguises are removed by statistical processing. This finding implies that write skew in a large class of applications could be analytically expressed and, thus, facilitates design tradeoff explorations adaptive to workload characteristics.

References

  1. Nitin Agrawal, Vijayan Prabhakaran, Ted Wobber, John D. Davis, Mark Manasse, and Rina Panigrahy. 2008. Design tradeoffs for SSD performance. In Proceedings of the USENIX 2008 Annual Technical Conference on Annual Technical Conference (ATC’08). USENIX Association, 57--70. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Martin F. Arlitt and Carey L. Williamson. 1997. Internet web servers: Workload characterization and performance implications. In IEEE/ACM Transactions on Networking. IEEE Press, 631--645. DOI:http://dx.doi.org/10.1109/90.649565 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. Axboe. 2014. FIO (Flexible IO Tester). Retrieved from http://git.kernel.dk/?p=fio.git;a=summary.Google ScholarGoogle Scholar
  4. Bernd Blasius and Ralf Tönjes. 2009. Zipf’s law in the popularity distribution of chess openings. Physics Review Letters 103, 21 (Nov. 2009), 218701. DOI:http://dx.doi.org/10.1103/PhysRevLett.103.218701Google ScholarGoogle ScholarCross RefCross Ref
  5. L. Breslau, Pei Cao, Li Fan, G. Phillips, and S. Shenker. 1999. Web caching and Zipf-like distributions: Evidence and implications. In Proceedings of the 18th Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings (INFOCOM’99), Vol. 1. IEEE, 126--134. DOI:http://dx.doi.org/10.1109/INFCOM.1999.749260Google ScholarGoogle Scholar
  6. Werner Bux and Ilias Iliadis. 2010. Performance of greedy garbage collection in flash-based solid-state drives. In Performance Evaluation, Vol. 67. Elsevier Science Publishers B. V., 1172--1186. DOI:http://dx.doi.org/10.1016/j.peva.2010.07.003 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Meeyoung Cha, Haewoon Kwak, Pablo Rodriguez, Yong-Yeol Ahn, and Sue Moon. 2007. I tube, you tube, everybody tubes: Analyzing the world’s largest user generated content video system. In Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement (IMC’07). ACM, 1--14. DOI:http://dx.doi.org/10.1145/1298306.1298309 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Li-Pin Chang and Chun-Da Du. 2009. Design and implementation of an efficient wear-leveling algorithm for solid-state-disk microcontrollers. ACM Transactions on Design Automation of Electronic Systems 15, 1, Article 6 (Dec. 2009), 36 pages. DOI:http://dx.doi.org/10.1145/1640457.1640463 Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Feng Chen, Tian Luo, and Xiaodong Zhang. 2011. CAFTL: A content-aware flash translation layer enhancing the lifespan of flash memory based solid state drives. In Proceedings of the 9th USENIX Conference on File and Stroage Technologies (FAST’11). USENIX Association, 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Maureen Chesire, Alec Wolman, Geoffrey M. Voelker, and Henry M. Levy. 2001. Measurement and analysis of a streaming-media workload. In Proceedings of the 3rd Conference on USENIX Symposium on Internet Technologies and Systems - Volume 3 (USITS’01). USENIX Association, 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Edward Chlebus. 2009. An approximate formula for a partial sum of the divergent p-series. In Applied Mathematics Letters. Elsevier Ltd, 732--737. DOI:http://dx.doi.org/10.1016/j.aml.2008.07.007Google ScholarGoogle Scholar
  12. Pasquale Cirillo. 2013. Are your data really Pareto distributed? Physica A: Statistical Mechanics and Its Applications (2013). DOI:http://dx.doi.org/10.1016/j.physa.2013.07.061Google ScholarGoogle Scholar
  13. Aaron Clauset, Cosma Rohilla Shalizi, and M. E. J. Newman. 2009. Power-law distributions in empirical data. In SIAM Review. Society for Industrial and Applied Mathematics, 661--703. DOI:http://dx.doi.org/10.1137/070710111 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Carlos R. Cunha, Azer Bestavros, and Mark E. Crovella. 1995. Characteristics of WWW Client-based Traces. Technical Report BU-CS-95-010. Computer Science Department, Boston University, oston, MA 02215. Google ScholarGoogle Scholar
  15. Peter Desnoyers. 2012. Analytic modeling of SSD write performance. In Proceedings of the 5th Annual International Systems and Storage Conference (SYSTOR’12). ACM, Article 12, 10 pages. DOI:http://dx.doi.org/10.1145/2367589.2367603 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Peter Desnoyers. 2014. Analytic models of SSD write performance. Transactions on Storage 10, Article 8 (March 2014), 25 pages. DOI:http://dx.doi.org/10.1145/2577384 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. ETW. 2012. ETW: Event Tracing for Windows. Retrieved from http://msdn.microsoft.com/en-us/library/bb968803%28VS.85%29.aspx.Google ScholarGoogle Scholar
  18. Phillipa Gill, Martin Arlitt, Zongpeng Li, and Anirban Mahanti. 2007. Youtube traffic characterization: A view from the edge. In Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement (IMC’07). ACM, 15--28. DOI:http://dx.doi.org/10.1145/1298306.1298310 Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Lei Guo, Enhua Tan, Songqing Chen, Zhen Xiao, and Xiaodong Zhang. 2008. The stretched exponential distribution of internet media access patterns. In Proceedings of the 27th ACM Symposium on Principles of Distributed Computing (PODC’08). ACM, 283--294. DOI:http://dx.doi.org/10.1145/1400751.1400789 Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Aayush Gupta, Raghav Pisolkar, Bhuvan Urgaonkar, and Anand Sivasubramaniam. 2011. Leveraging value locality in optimizing NAND flash-based SSDs. In Proceedings of the 9th USENIX Conference on File and Stroage Technologies (FAST’11). USENIX Association, 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Jen-Wei Hsieh, Tei-Wei Kuo, and Li-Pin Chang. 2006. Efficient identification of hot data for flash memory storage systems. In Transactions on Storage. ACM, 22--40. DOI:http://dx.doi.org/10.1145/1138041.1138043 Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. Kavalanekar, V. Sharda, B. L. Worthington, and Q. Zhang. 2008. Characterization of storage workload traces from production windows servers. In Proceedings of the 4th International Symposium on Workload Characterization (IISWC’08). IEEE, New York, NY.Google ScholarGoogle Scholar
  23. Isao Kotera, Ryusuke Egawa, Hiroyuki Takizawa, and Hiroaki Kobayashi. 2008. Modeling of cache access behavior based on Zipf’s law. In Proceedings of the 9th Workshop on Memory Performance: Dealing with Applications, Systems and Architecture (MEDEA’08). ACM, 9--15. DOI:http://dx.doi.org/10.1145/1509084.1509086 Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Jongsung Lee and Jin-Soo Kim. 2013. An empirical study of hot/cold data separation policies in solid state drives (SSDs). In Proceedings of the 6th International Systems and Storage Conference (SYSTOR’13). ACM, Article 12, 6 pages. DOI:http://dx.doi.org/10.1145/2485732.2485745 Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Microsoft News Centre. 2013. The Big Bang: How the Big Data Explosion Is Changing the World. Retrieved from http://www.microsoft.com/en-us/news/features/2013/feb13/02-11bigdata.aspx.Google ScholarGoogle Scholar
  26. Dushyanth Narayanan, Austin Donnelly, and Antony Rowstron. 2008. Write off-loading: Practical power management for enterprise storage. In Transactions on Storage. ACM, Article 10, 23 pages. DOI:http://dx.doi.org/10.1145/1416944.1416949 Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Storage Networking Industry Association. 2011. IOTTA Repository. Retrieved from http://iotta.snia.org/.Google ScholarGoogle Scholar
  28. Mark E. J. Newman. 2005. Power laws, pareto distributions and Zipf’s law. Contemporary Physics 46 (2005), 323--351. http://arxiv.org/abs/cond-mat/0412004.Google ScholarGoogle ScholarCross RefCross Ref
  29. Chanik Park, Wonmoon Cheon, Jeonguk Kang, Kangho Roh, Wonhee Cho, and Jin-Soo Kim. 2008. A reconfigurable FTL (flash translation layer) architecture for NAND flash-based applications. ACM Transactions in Embedded Computing Systems 7, 4, Article 38 (Aug. 2008), 23 pages. DOI:http://dx.doi.org/10.1145/1376804.1376806 Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Dongchul Park and David H. C. Du. 2011. Hot data identification for flash-based storage systems using multiple bloom filters. In Proceedings of the 2011 IEEE 27th Symposium on Mass Storage Systems and Technologies (MSST’11). IEEE Computer Society, 1--11. DOI:http://dx.doi.org/10.1109/MSST.2011.5937216 Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Alma Riska and Erik Riedel. 2006. Disk drive level workload characterization. In Proceedings of the Annual Conference on USENIX’06 Annual Technical Conference (ATEC’06). USENIX Association, 9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Storage Performance Council. 2002. OLTP Application I/O. Retrieved from http://traces.cs.umass.edu/index.php/Storage/Storage.Google ScholarGoogle Scholar
  33. Storage Performance Council. 2009. SPC benchmark 2C™, (SPC-2C) official specification. Retrieved from http://www.storageperformance.org/specs/spc2c_v1.2.pdf.Google ScholarGoogle Scholar
  34. Wenting Tang, Yun Fu, Ludmila Cherkasova, and Amin Vahdat. 2003. MediSyn: A synthetic streaming media service workload generator. In Proceedings of the 13th International Workshop on Network and Operating Systems Support for Digital Audio and Video (NOSSDAV’03). ACM, 12--21. DOI:http://dx.doi.org/10.1145/776322.776327 Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Vernon Turner, Stephen Minton, Vernon Turner, and David Reinsel. 2014. The Digital Universe of Opportunities: Rich Data and the Increasing Value of the Internet of Things. Retrieved from http://idcdocserv.com/1678.Google ScholarGoogle Scholar
  36. Benny Van Houdt. 2013. A mean field model for a class of garbage collection algorithms in flash-based solid state drives. In Proceedings of the ACM SIGMETRICS/International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’13). ACM, 191--202. DOI:http://dx.doi.org/10.1145/2465529.2465543 Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Adepele Williams, Arlitt Martin, Carey Williamson, and Barker Ken. 2005. Web server workload characterization: Ten years later. In Web Information Systems Engineering and Internet Technologies. Springer, 3--21.Google ScholarGoogle Scholar
  38. Yue Yang and Jianwen Zhu. 2014. Analytical modeling of garbage collection algorithms in hotness-aware flash-based solid state drives. In Proceedings of the 30th International Conference on Massive Storage Systems and Technology (MSST’14). IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  39. Hongliang Yu, Dongdong Zheng, Ben Y. Zhao, and Weimin Zheng. 2006. Understanding user behavior in large-scale video-on-demand systems. In Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006 (EuroSys’06). ACM, 333--344. DOI:http://dx.doi.org/10.1145/1217935.1217968 Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. George Kingsley Zipf. 1950. Human behavior and the principle of least effort. cambridge, (mass.): Addison-Wesley, 1949, pp. 573. Journal of Clinical Psychology 6, 3 (1950), 394--401. DOI:http://dx.doi.org/10.1002/1097-4679(195007)6:3<306::AID-JCLP2270060331>3.0.CO;2-7Google ScholarGoogle Scholar

Index Terms

  1. Write Skew and Zipf Distribution: Evidence and Implications

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Storage
        ACM Transactions on Storage  Volume 12, Issue 4
        August 2016
        213 pages
        ISSN:1553-3077
        EISSN:1553-3093
        DOI:10.1145/2940403
        Issue’s Table of Contents

        Copyright © 2016 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 8 June 2016
        • Accepted: 1 January 2016
        • Revised: 1 October 2015
        • Received: 1 November 2014
        Published in tos Volume 12, Issue 4

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader