skip to main content
article

On repairing structural problems in semi-structured data

Published:01 July 2013Publication History
Skip Abstract Section

Abstract

Semi-structured data such as XML are popular for data interchange and storage. However, many XML documents have improper nesting where open - and close-tags are unmatched. Since some semi-structured data (e.g., Latex) have a flexible grammar and since many XML documents lack an accompanying DTD or XSD, we focus on computing a syntactic repair via the edit distance.

To solve this problem, we propose a dynamic programming algorithm which takes cubic time. While this algorithm is not scalable, well-formed substrings of the data can be pruned to enable faster computation. Unfortunately, there are still cases where the dynamic program could be very expensive; hence, we give branch-and-bound algorithms based on various combinations of two heuristics, called MinCost and MaxBenefit, that trade off between accuracy and efficiency. Finally, we experimentally demonstrate the performance of these algorithms on real data.

References

  1. A. V. Aho, T. G. Peterson: A Minimum Distance Error-Correcting Parser for Context-Free Languages. SIAM J. Comput. 1(4): 305-312 (1972).Google ScholarGoogle Scholar
  2. G. Beskales, I. F. Ilyas and L. Golab: Sampling the Repairs of Functional Dependency Violations under Hard Constraints. PVLDB 3(1): 197-207 (2010). Google ScholarGoogle Scholar
  3. G. J. Bex, F. Neven, T. Schwentick, K. Tuyls: Inference of Concise DTDs from XML Data, VLDB 2006: 115-126. Google ScholarGoogle Scholar
  4. G. J. Bex, F. Neven, S. Vansummeren: Inferring XML Schema Definitions from XML Data, VLDB 2007: 998-1009. Google ScholarGoogle Scholar
  5. U. Boobna and M. de Rougemont: Correctors for XML Data. XSym 2004: 97-111.Google ScholarGoogle Scholar
  6. P. Bohannon, M. Flaster, W. Fan and R. Rastogi: A Cost-Based Model and Effective Heuristic for Repairing Constraints by Value Modification. SIGMOD 2005: 143-154. Google ScholarGoogle Scholar
  7. B. Choi: What are real DTDs like? WebDB 2002: 43-48.Google ScholarGoogle Scholar
  8. G. Cobena, S. Abiteboul, A. Marian: Detecting Changes in XML Documents. ICDE 2002: 41-52. Google ScholarGoogle Scholar
  9. T. Dalamagas, T. Cheng, K. Winkel, T. K. Sellis: A Methodology for Clustering XML Documents by Structure. Inf. Syst. 31(3): 187-228 (2006). Google ScholarGoogle Scholar
  10. W. Fan, J. Li, S. Ma, N. Tang and W. Yu: Interaction between record matching and data repairing. SIGMOD 2011: 469-480. Google ScholarGoogle Scholar
  11. M. N. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri and K. Shim: DTD Inference from XML Documents: The XTRACT Approach. IEEE Data Eng. Bull. 26(3): 19-25 (2003).Google ScholarGoogle Scholar
  12. E. M. Gold: Language Identification in the Limit, Information and Control 10(5): 447-474 (1967).Google ScholarGoogle Scholar
  13. S. Guha, H. V. Jagadish, N. Koudas, D. Srivastava, T. Yu: Approximate XML Joins. SIGMOD 2002: 287-298. Google ScholarGoogle Scholar
  14. S. A. Greibach: The Hardest Context-Free Language, SIAM J. Comput. 2(4): 304-310 (1973).Google ScholarGoogle Scholar
  15. S. Grijzenhout and M. Marx: The quality of the XML web. CIKM 2011: 1719-1724. Google ScholarGoogle Scholar
  16. L. Lee: Fast context-free grammar parsing requires fast boolean matrix multiplication, J. ACM 49(1): 1-15 (2002). Google ScholarGoogle Scholar
  17. V. I. Levenshtein: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10(8): 707-710 (1966).Google ScholarGoogle Scholar
  18. F. Magniez, C. Mathieu and A. Nayak, Recognizing well-parenthesized expressions in the streaming model, STOC 2010: 261-270. Google ScholarGoogle Scholar
  19. G. Myers: Approximately Matching Context-Free Languages. Inf. Process. Lett. 54(2): 85-92 (1995). Google ScholarGoogle Scholar
  20. G. Navarro. A guided tour to approximate string matching. ACM Comput. Surv., 33(1):31-88 (2001). Google ScholarGoogle Scholar
  21. M. Pawlik, N. Augsten: RTED: A Robust Algorithm for the Tree Edit Distance. PVLDB 5(4): 334-345 (2011). Google ScholarGoogle Scholar
  22. H. Samimi, M. Schaefer, S. Artzi, T. Millstein, F. Tip and L. Hendren: Automated Repair of HTML Generation Errors in PHP Applications Using String Constraint Solving. Int'l Conf. Software Engineering 2012. Google ScholarGoogle Scholar
  23. L. Segoufin and C. Sirangelo: Constant-Memory Validation of Streaming XML Documents Against DTDs. ICDT 2007: 299-313. Google ScholarGoogle Scholar
  24. L. Segoufin and V. Vianu: Validating Streaming XML Documents. PODS 2002: 53-64. Google ScholarGoogle Scholar
  25. S. Staworko and J. Chomicki: Validity-Sensitive Querying of XML Databases. EDBT Workshops 2006: 164-177. Google ScholarGoogle Scholar
  26. N. Suzuki: Finding an optimum edit script between an XML document and a DTD. SAC 2005: 647-653. Google ScholarGoogle Scholar
  27. D. Sankoff, J. B. Kruskal: Time Warps, String Edits and Macromolecules Theory and Practice of Sequence Comparison, 1983. Addison Wesley.Google ScholarGoogle Scholar
  28. E. S. Tabanao, M. M. T. Rodrigo, M. C. Jadud: Predicting at-risk novice Java programmers through the analysis of online protocols. ICER 2011: 85-92. Google ScholarGoogle Scholar
  29. Y. Wang, D. J. DeWitt, J. Cai: X-Diff: An Effective Change Detection Algorithm for XML Documents. ICDE 2003: 519-530.Google ScholarGoogle Scholar
  30. R. A. Wagner and M. J. Fischer. The string-to-string correction problem. Journal of the ACM, 21(1):168-173 (1974). Google ScholarGoogle Scholar
  31. K. Zhang and D. Shasha: Simple Fast Algorithms for the Editing distance between Trees and Related Problems. SIAM J. Computing, 18(6): 1245-1262 (1989). Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 6, Issue 9
    July 2013
    180 pages

    Publisher

    VLDB Endowment

    Publication History

    • Published: 1 July 2013
    Published in pvldb Volume 6, Issue 9

    Qualifiers

    • article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader