Abstract
Semi-structured data such as XML are popular for data interchange and storage. However, many XML documents have improper nesting where open - and close-tags are unmatched. Since some semi-structured data (e.g., Latex) have a flexible grammar and since many XML documents lack an accompanying DTD or XSD, we focus on computing a syntactic repair via the edit distance.
To solve this problem, we propose a dynamic programming algorithm which takes cubic time. While this algorithm is not scalable, well-formed substrings of the data can be pruned to enable faster computation. Unfortunately, there are still cases where the dynamic program could be very expensive; hence, we give branch-and-bound algorithms based on various combinations of two heuristics, called MinCost and MaxBenefit, that trade off between accuracy and efficiency. Finally, we experimentally demonstrate the performance of these algorithms on real data.
- A. V. Aho, T. G. Peterson: A Minimum Distance Error-Correcting Parser for Context-Free Languages. SIAM J. Comput. 1(4): 305-312 (1972).Google Scholar
- G. Beskales, I. F. Ilyas and L. Golab: Sampling the Repairs of Functional Dependency Violations under Hard Constraints. PVLDB 3(1): 197-207 (2010). Google Scholar
- G. J. Bex, F. Neven, T. Schwentick, K. Tuyls: Inference of Concise DTDs from XML Data, VLDB 2006: 115-126. Google Scholar
- G. J. Bex, F. Neven, S. Vansummeren: Inferring XML Schema Definitions from XML Data, VLDB 2007: 998-1009. Google Scholar
- U. Boobna and M. de Rougemont: Correctors for XML Data. XSym 2004: 97-111.Google Scholar
- P. Bohannon, M. Flaster, W. Fan and R. Rastogi: A Cost-Based Model and Effective Heuristic for Repairing Constraints by Value Modification. SIGMOD 2005: 143-154. Google Scholar
- B. Choi: What are real DTDs like? WebDB 2002: 43-48.Google Scholar
- G. Cobena, S. Abiteboul, A. Marian: Detecting Changes in XML Documents. ICDE 2002: 41-52. Google Scholar
- T. Dalamagas, T. Cheng, K. Winkel, T. K. Sellis: A Methodology for Clustering XML Documents by Structure. Inf. Syst. 31(3): 187-228 (2006). Google Scholar
- W. Fan, J. Li, S. Ma, N. Tang and W. Yu: Interaction between record matching and data repairing. SIGMOD 2011: 469-480. Google Scholar
- M. N. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri and K. Shim: DTD Inference from XML Documents: The XTRACT Approach. IEEE Data Eng. Bull. 26(3): 19-25 (2003).Google Scholar
- E. M. Gold: Language Identification in the Limit, Information and Control 10(5): 447-474 (1967).Google Scholar
- S. Guha, H. V. Jagadish, N. Koudas, D. Srivastava, T. Yu: Approximate XML Joins. SIGMOD 2002: 287-298. Google Scholar
- S. A. Greibach: The Hardest Context-Free Language, SIAM J. Comput. 2(4): 304-310 (1973).Google Scholar
- S. Grijzenhout and M. Marx: The quality of the XML web. CIKM 2011: 1719-1724. Google Scholar
- L. Lee: Fast context-free grammar parsing requires fast boolean matrix multiplication, J. ACM 49(1): 1-15 (2002). Google Scholar
- V. I. Levenshtein: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10(8): 707-710 (1966).Google Scholar
- F. Magniez, C. Mathieu and A. Nayak, Recognizing well-parenthesized expressions in the streaming model, STOC 2010: 261-270. Google Scholar
- G. Myers: Approximately Matching Context-Free Languages. Inf. Process. Lett. 54(2): 85-92 (1995). Google Scholar
- G. Navarro. A guided tour to approximate string matching. ACM Comput. Surv., 33(1):31-88 (2001). Google Scholar
- M. Pawlik, N. Augsten: RTED: A Robust Algorithm for the Tree Edit Distance. PVLDB 5(4): 334-345 (2011). Google Scholar
- H. Samimi, M. Schaefer, S. Artzi, T. Millstein, F. Tip and L. Hendren: Automated Repair of HTML Generation Errors in PHP Applications Using String Constraint Solving. Int'l Conf. Software Engineering 2012. Google Scholar
- L. Segoufin and C. Sirangelo: Constant-Memory Validation of Streaming XML Documents Against DTDs. ICDT 2007: 299-313. Google Scholar
- L. Segoufin and V. Vianu: Validating Streaming XML Documents. PODS 2002: 53-64. Google Scholar
- S. Staworko and J. Chomicki: Validity-Sensitive Querying of XML Databases. EDBT Workshops 2006: 164-177. Google Scholar
- N. Suzuki: Finding an optimum edit script between an XML document and a DTD. SAC 2005: 647-653. Google Scholar
- D. Sankoff, J. B. Kruskal: Time Warps, String Edits and Macromolecules Theory and Practice of Sequence Comparison, 1983. Addison Wesley.Google Scholar
- E. S. Tabanao, M. M. T. Rodrigo, M. C. Jadud: Predicting at-risk novice Java programmers through the analysis of online protocols. ICER 2011: 85-92. Google Scholar
- Y. Wang, D. J. DeWitt, J. Cai: X-Diff: An Effective Change Detection Algorithm for XML Documents. ICDE 2003: 519-530.Google Scholar
- R. A. Wagner and M. J. Fischer. The string-to-string correction problem. Journal of the ACM, 21(1):168-173 (1974). Google Scholar
- K. Zhang and D. Shasha: Simple Fast Algorithms for the Editing distance between Trees and Related Problems. SIAM J. Computing, 18(6): 1245-1262 (1989). Google Scholar
Recommendations
Query translation for distributed heterogeneous structured and semi-structured databases
BNCOD'06: Proceedings of the 23rd British National Conference on Databases, conference on Flexible and Efficient Information HandlingThe main purpose of building data integration systems is to facilitate access to a multitude of data sources. A data integration system must contain a module that uses source descriptions in order to reformulate user queries which are posed in terms of ...
Comments