ABSTRACT
In this paper we introduce the first efficient external-memory algorithm to compute the bisimilarity equivalence classes of a directed acyclic graph (DAG). DAGs are commonly used to model data in a wide variety of practical applications, ranging from XML documents and data provenance models, to web taxonomies and scientific workflows. In the study of efficient reasoning over massive graphs, the notion of node bisimilarity plays a central role. For example, grouping together bisimilar nodes in an XML data set is the first step in many sophisticated approaches to building indexing data structures for efficient XPath query evaluation. To date, however, only internal-memory bisimulation algorithms have been investigated. As the size of real-world DAG data sets often exceeds available main memory, storage in external memory becomes necessary. Hence, there is a practical need for an efficient approach to computing bisimulation in external memory.
Our general algorithm has a worst-case IO-complexity of O(Sort(|N| + |E|)), where |N| and |E| are the numbers of nodes and edges, resp., in the data graph and Sort(n) is the number of accesses to external memory needed to sort an input of size n. We also study specializations of this algorithm to common variations of bisimulation for tree-structured XML data sets. We empirically verify efficient performance of the algorithms on graphs and XML documents having billions of nodes and edges, and find that the algorithms can process such graphs efficiently even when very limited internal memory is available. The proposed algorithms are simple enough for practical implementation and use, and open the door for further study of external-memory bisimulation algorithms. To this end, the full open-source C++ implementation has been made freely available.
- S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web: from relations to semistructured data and XML. Morgan Kaufmann, San Francisco, 2000. Google ScholarDigital Library
- A. Aggarwal and J. S. Vitter. The Input/Output complexity of sorting and related problems. Communications of the ACM, 31(9):1116--1127, 1988. Google ScholarDigital Library
- D. Ajwani, A. Cosgaya-Lozano, and N. Zeh. Engineering a topological sorting algorithm for massive graphs. Proc. Algorithm Engineering and Experimentation (ALENEX), 2011.Google ScholarCross Ref
- L. Arge. The buffer tree: A technique for designing batched external data structures. Algorithmica, 37(1):1--24, 2003.Google ScholarDigital Library
- L. Arge, P. Ferragina, R. Grossi, and J. S. Vitter. On sorting strings in external memory (extended abstract). In Proc. ACM Symp. on Theory of Computation (STOC), pages 540--548, 1997. Google ScholarDigital Library
- M. Ben-Ari, T. Milo, and E. Verbin. Querying DAG-shaped execution traces through views. In WebDB, Providence, RI, USA, 2009.Google Scholar
- P. Buneman, M. Grohe, and C. Koch. Path queries on compressed XML. In VLDB, pages 141--152, Berlin, 2003. Google ScholarDigital Library
- Y.-J. Chiang, M. T. Goodrich, E. F. Grove, R. Tamassia, D. E. Vengroff, and J. S. Vitter. External-memory graph algorithms. In Proc. Symp. on Discrete Algorithms (SODA), pages 139--149, 1995. Google ScholarDigital Library
- V. Christophides, G. Karvounarakis, D. Plexousakis, M. Scholl, and S. Tourtounis. Optimizing taxonomic semantic web queries using labeling schemes. J. Web Sem., 1(2):207--228, 2004.Google ScholarCross Ref
- R. Dementiev, L. Kettner, and P. Sanders. STXXL: standard template library for XXL data sets. Software: Practice and Experience, 38(6):589--637, 2008. Google ScholarDigital Library
- J. Deng, B. Choi, J. Xu, and S. S. Bhowmick. Optimizing incremental maintenance of minimal bisimulation of cyclic graphs. In DASFAA, pages 543--557, Hong Kong, 2011. Google ScholarDigital Library
- D. Deutch and T. Milo. A quest for beauty and wealth (or, business processes for database researchers). In Proc. ACM Symp. on Principles of Database Systems (PODS), pages 1--12, Athens, Greece, 2011. Google ScholarDigital Library
- A. Dovier, C. Piazza, and A. Policriti. An efficient algorithm for computing bisimulation equivalence. Theor. Comput. Sci., 311(1--3):221--256, 2004. Google ScholarDigital Library
- G. H. L. Fletcher, D. Van Gucht, Y. Wu, M. Gyssens, S. Brenes, and J. Paredaens. A methodology for coupling fragments of XPath with structural indexes for XML documents. Inf. Syst., 34(7):657--670, 2009. Google ScholarDigital Library
- N. Grimsmo, T. A. Bjørklund, and M. L. Hetland. Linear computation of the maximum simultaneous forward and backward bisimulation for node-labeled trees. In XSym, pages 18--32, Singapore, 2010. Google ScholarDigital Library
- J. Hellings. Bisimulation partitioning and partition maintenance on very large directed acyclic graphs. Master's thesis, Eindhoven Univ. of Technology, 2011.Google Scholar
- R. Kaushik, P. Bohannon, J. F. Naughton, and H. F. Korth. Covering indexes for branching path queries. In Proc. SIGMOD International Conference on Management of Data, pages 133--144, Madison, Wisconsin, 2002. Google ScholarDigital Library
- R. Kaushik, P. Bohannon, J. F. Naughton, and P. Shenoy. Updates for structure indexes. In VLDB, pages 239--250, Hong Kong, 2002. Google ScholarDigital Library
- R. Kaushik, P. Shenoy, P. Bohannon, and E. Gudes. Exploiting local similarity for indexing paths in graph-structured data. In Proc. IEEE Int. Conf. on Data Engineering (ICDE), pages 129--140, 2002. Google ScholarDigital Library
- U. Meyer, P. Sanders, and J. Sibeyn, editors. Algorithms for memory hierarchies: advanced lectures. Springer-Verlag, Berlin, Heidelberg, 2003. Google ScholarDigital Library
- T. Milo and D. Suciu. Index structures for path expressions. In Proc. Int. Conf. on Database Theory (ICDT), pages 277--295, Jerusalem, 1999. Google ScholarDigital Library
- L. Moreau. The foundations for provenance on the web. Foundations and Trends in Web Science, 2(2--3):99--41, 2010. Google ScholarDigital Library
- K.-K. Muniswamy-Reddy and M. Seltzer. Provenance as first class cloud data. SIGOPS Oper. Syst. Rev., 43:11--16, 2010. Google ScholarDigital Library
- R. Paige and R. E. Tarjan. Three partition refinement algorithms. SIAM J. Comput., 16(6):973--989, 1987. Google ScholarDigital Library
- G. Palla, I. J. Farkas, P. Pollner, I. Derényi, and T. Vicsek. Fundamental statistical features and self-similar properties of tagged networks. New Journal of Physics, 10(12):123026, 2008.Google ScholarCross Ref
- D. Saha. An incremental bisimulation algorithm. In Proc. Found. of Software Technology and Theoretical Computer Science (FSTTCS), pages 204--215, New Delhi, India, 2007. Google ScholarDigital Library
- D. Sangiorgi. On the origins of bisimulation and coinduction. ACM Trans. Program. Lang. Syst., 31:15:1--15:41, 2009. Google ScholarDigital Library
- B. Smith et al. The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nature Biotechnology, 25:1251--1255, 2007.Google ScholarCross Ref
- W. Wang, H. Jiang, H. Wang, X. Lin, H. Lu, and J. Li. Efficient processing of XML path queries using the disk-based F&B index. In VLDB, pages 145--156, Trondheim, Norway, 2005. Google ScholarDigital Library
- J. Yu and R. Buyya. A taxonomy of workflow management systems for grid computing. Journal of Grid Computing, 3:171--200, 2005.Google ScholarCross Ref
Index Terms
- Efficient external-memory bisimulation on DAGs
Recommendations
Bisimulation on speed: a unified approach
Two process-algebraic approaches have been developed for comparing two bisimulation-equivalent processes with respect to speed: the one of Moller/Tofts equips actions with lower time bounds, while the other by Lüttgen/Vogler considers upper time bounds ...
Bisimulation relations for weighted automata
Bisimulation is a well known equivalence relation for discrete event systems and has been extended to probabilistic and stochastic systems. This paper introduces a general definition of bisimulation which can be applied to finite automata where weights ...
RAM-Efficient External Memory Sorting
In recent years a large number of problems have been considered in external memory models of computation, where the complexity measure is the number of blocks of data that are moved between slow external memory and fast internal memory (also called I/Os)...
Comments