ABSTRACT
The data structures under-pinning collection API (e.g. lists, sets, maps) in the standard libraries of programming languages are used intensively in many applications. The standard libraries of recent Java Virtual Machine languages, such as Clojure or Scala, contain scalable and well-performing immutable collection data structures that are implemented as Hash-Array Mapped Tries (HAMTs). HAMTs already feature efficient lookup, insert, and delete operations, however due to their tree-based nature their memory footprints and the runtime performance of iteration and equality checking lag behind array-based counterparts. This particularly prohibits their application in programs which process larger data sets. In this paper, we propose changes to the HAMT design that increase the overall performance of immutable sets and maps. The resulting general purpose design increases cache locality and features a canonical representation. It outperforms Scala’s and Clojure’s data structure implementations in terms of memory footprint and runtime efficiency of iteration (1.3–6.7x) and equality checking (3–25.4x).
Supplemental Material
Available for Download
Artifact that was submitted to the OOPSLA'15 Artifact Evaluation. Contains the source code of the benchmarks and the results from our experiments.
- A. V. Aho, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques, and Tools. Addison-Wesley, 1986. Google ScholarDigital Library
- P. Bagwell. Fast And Space Efficient Trie Searches. Technical Report LAMP-REPORT-2000-001, Ecole polytechnique fédérale de Lausanne, 2000.Google Scholar
- P. Bagwell. Ideal Hash Trees. Technical Report LAMPREPORT-2001-001, Ecole polytechnique fédérale de Lausanne, 2001.Google Scholar
- P. Bagwell. Fast Functional Lists, Hash-Lists, Deques, and Variable Length Arrays. Technical Report LAMP-REPORT- 2002-003, Ecole polytechnique fédérale de Lausanne, 2002.Google Scholar
- P. Bagwell and T. Rompf. RRB-Trees: Efficient Immutable Vectors. Technical Report EPFL-REPORT-169879, Ecole polytechnique fédérale de Lausanne, 2011.Google Scholar
- A. Biboudis, N. Palladinos, G. Fourtounis, and Y. Smaragdakis. Streams a la carte: Extensible Pipelines with Object Algebras. In ECOOP ’15: Proceedings of the 29th European conference on Object-Oriented Programming. Schloss Dagstuhl, 2015.Google Scholar
- T. J. Biggerstaff. A perspective of generative reuse. Annals of Software Engineering, 5(1):169–226, 1998. ISSN 1022-7091. Google ScholarDigital Library
- K. D. Cooper, T. J. Harvey, and K. Kennedy. A Simple, Fast Dominance Algorithm. Technical Report TR-06-33870, Rice University, 2006.Google Scholar
- K. Czarnecki and U. W. Eisenecker. Generative Programming: Methods, Tools, and Applications. ACM Press, 2000. Google ScholarDigital Library
- R. De La Briandais. File Searching Using Variable Length Keys. In Papers presented at the the March 3-5, 1959, western joint computer conference, pages 295–298. ACM Press, 1959. Google ScholarDigital Library
- J. R. Driscoll, N. Sarnak, D. D. Sleator, and R. E. Tarjan. Making data structures persistent. In Proceedings of the 18th annual ACM symposium on Theory of computing. ACM, 1986. Google ScholarDigital Library
- J. Ebert, D. Bildhauer, H. Schwarz, and V. Riediger. Using difference information to reuse software cases. Softwaretechnik-Trends, 2007.Google Scholar
- E. Fredkin. Trie Memory. Communications of the ACM, 3(9): 490–499, 1960. Google ScholarDigital Library
- J. Gil and Y. Shimron. Smaller Footprint for Java Collections. In ECOOP ’12: Proceedings of the 26th European conference on Object-Oriented Programming. Springer-Verlag, 2012. Google ScholarDigital Library
- E. Hajiyev, M. Verbaere, and O. de Moor. Codequest: Scalable source code queries with datalog. In ECOOP ’06: Proceedings of the 20th European Conference on Object-Oriented Programming. Springer-Verlag, 2006. Google ScholarDigital Library
- M. Hills and P. Klint. PHP AiR: Analyzing PHP systems with Rascal. In Proceedings of IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering. IEEE, 2014.Google ScholarCross Ref
- A. Igarashi and M. Viroli. On Variance-Based Subtyping for Parametric Types. In ECOOP ’02: Proceedings of the 16th European conference on Object-Oriented Programming. Springer-Verlag, 2002. Google ScholarDigital Library
- P. Klint, T. van der Storm, and J. Vinju. Rascal: A Domain Specific Language for Source Code Analysis and Manipulation. In Proceedings of Ninth IEEE International Working Conference on Source Code Analysis and Manipulation. IEEE, 2009. Google ScholarDigital Library
- R. E. Ladner, R. Fortna, and B.-H. Nguyen. A Comparison of Cache Aware and Cache Oblivious Static Search Trees Using Program Instrumentation. In Experimental Algorithmics. Springer-Verlag, 2002. Google ScholarDigital Library
- D. McIlroy. Mass-Produced Software Components. In P. Naur and B. Randell, editors, Proceedings of NATO Software Engineering Conference, pages 138–155. NATO Scientific Affairs Division, 1968.Google Scholar
- N. Mitchell and G. Sevitsky. The causes of bloat, the limits of health. In OOPSLA ’07: Proceedings of the 22nd annual ACM SIGPLAN conference on Object-oriented programming systems and applications. ACM, 2007. Google ScholarDigital Library
- C. Okasaki. Purely Functional Data Structures. Cambridge University Press, 1999. Google ScholarDigital Library
- R. Olsson and S. Nilsson. TRASH A dynamic LC-trie and hash data structure. In High Performance Switching and Routing, 2007. HPSR ’07. Workshop on. IEEE, 2007.Google ScholarCross Ref
- A. Prokopec, N. G. Bronson, P. Bagwell, and M. Odersky. Concurrent tries with efficient non-blocking snapshots. In PPoPP ’12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming. ACM, 2012. Google ScholarDigital Library
- P. Rademaker. Binary relational querying for structural source code analysis. Universiteit Utrecht. Master’s thesis, 2008.Google Scholar
- I. Ramakrishnan, P. Rao, K. Sagonas, T. Swift, and D. S. Warren. Efficient Tabling Mechanisms for Logic Programs. In Proceedings of the 12th International Conference on Logic Programming. Elsevier, 1995.Google Scholar
- N. Sarnak and R. E. Tarjan. Planar point location using persistent search trees. Communications of the ACM, 29(7): 669–679, 1986. Google ScholarDigital Library
- N. Shavit and D. Touitou. Software transactional memory. ACM Press, 1995.Google ScholarDigital Library
- M. J. Steindorfer and J. J. Vinju. Code Specialization for Memory Efficient Hash Tries (Short Paper). In GPCE ’14: Proceedings of Generative Programming Concepts & Experiences. ACM, 2014. Google ScholarDigital Library
- N. Stucki, T. Rompf, V. Ureche, and P. Bagwell. RRB Vector: A Practical General Purpose Immutable Sequence. In ICFP ’15: Proceedings of the 20th ACM SIGPLAN International Conference on Functional Programming. ACM, 2015. Google ScholarDigital Library
- V. Ureche, C. Talau, and M. Odersky. Miniboxing: improving the speed to code size tradeoff in parametric polymorphism translations. In OOPSLA ’13: Proceedings of the 2013 ACM SIGPLAN international conference on Object oriented programming systems languages & applications. ACM, 2013. Google ScholarDigital Library
- A. Wöß, C. Wirth, D. Bonetta, C. Seaton, C. Humer, and H. Mössenböck. An object storage model for the truffle language implementation framework. In PPPJ ’14: Proceedings of the 2014 International Conference on Principles and Practices of Programming on the Java platform. ACM, 2014. Google ScholarCross Ref
Index Terms
- Optimizing hash-array mapped tries for fast and lean immutable JVM collections
Recommendations
Optimizing hash-array mapped tries for fast and lean immutable JVM collections
OOPSLA '15The data structures under-pinning collection API (e.g. lists, sets, maps) in the standard libraries of programming languages are used intensively in many applications. The standard libraries of recent Java Virtual Machine languages, such as Clojure or ...
Towards a software product line of trie-based collections
GPCE 2016: Proceedings of the 2016 ACM SIGPLAN International Conference on Generative Programming: Concepts and ExperiencesCollection data structures in standard libraries of programming languages are designed to excel for the average case by carefully balancing memory footprint and runtime performance. These implicit design decisions and hard-coded trade-offs do constrain ...
Code specialization for memory efficient hash tries (short paper)
GPCE 2014: Proceedings of the 2014 International Conference on Generative Programming: Concepts and ExperiencesThe hash trie data structure is a common part in standard collection libraries of JVM programming languages such as Clojure and Scala. It enables fast immutable implementations of maps, sets, and vectors, but it requires considerably more memory than ...
Comments