skip to main content
research-article

Repairing data through regular expressions

Published:01 January 2016Publication History
Skip Abstract Section

Abstract

Since regular expressions are often used to detect errors in sequences such as strings or date, it is natural to use them for data repair. Motivated by this, we propose a data repair method based on regular expression to make the input sequence data obey the given regular expression with minimal revision cost. The proposed method contains two steps, sequence repair and token value repair.

For sequence repair, we propose the Regular-expression-based Structural Repair (RSR in short) algorithm. RSR algorithm is a dynamic programming algorithm that utilizes Nondeterministic Finite Automata (NFA) to calculate the edit distance between a prefix of the input string and a partial pattern regular expression with time complexity of O(nm2) and space complexity of O(mn) where m is the edge number of NFA and n is the input string length. We also develop an optimization strategy to achieve higher performance for long strings. For token value repair, we combine the edit-distance-based method and associate rules by a unified argument for the selection of the proper method. Experimental results on both real and synthetic data show that the proposed method could repair the data effectively and efficiently.

References

  1. Summary of movielens datasets. http://files.grouplens.org/datasets/movielens/ml-1m-README.txt.Google ScholarGoogle Scholar
  2. R. Agrawal, T. Imielinski, and A. N. Swami. Mining association rules between sets of items in large databases. In SIGMOD, pages 207--216, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB, pages 487--499, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Arenas, L. E. Bertossi, J. Chomicki, X. He, V. Raghavan, and J. P. Spinrad. Scalar aggregation in inconsistent databases. Theor. Comput. Sci., 296(3):405--434, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Backurs and P. Indyk. Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). In STOC, pages 51--58, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. In ICDE, pages 746--755, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  7. F. Chiang and R. J. Miller. Discovering data quality rules. PVLDB, 1(1):1166--1177, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. Deng, G. Li, J. Feng, and W. Li. Top-k string similarity search with edit-distance constraints. In ICDE, pages 925--936, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. X. L. Dong, L. Berti-Equille, and D. Srivastava. Integrating conflicting data: The role of source dependence. PVLDB, 2(1):550--561, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. X. L. Dong, L. Berti-Equille, and D. Srivastava. Truth discovery and copying detection in a dynamic world. PVLDB, 2(1):562--573, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. W. Eckerson. Data quality and the bottom line. Journal of Radioanalytical & Nuclear Chemistry, 160(4):355--362, 1992.Google ScholarGoogle Scholar
  12. W. Fan, F. Geerts, N. Tang, and W. Yu. Inferring data currency and consistency for conflict resolution. In ICDE, pages 470--481, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Galland, S. Abiteboul, A. Marian, and P. Senellart. Corroborating information from disagreeing views. In WSDM, pages 131--140, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. F. Geerts, G. Mecca, P. Papotti, and D. Santoro. The LLUNATIC data-cleaning framework. PVLDB, 6(9):625--636, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. K. Lakshminarayan, S. A. Harp, R. P. Goldman, and T. Samad. Imputation of missing data using machine learning techniques. In KDD, pages 140--145, 1996.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. C. Mayfield, J. Neville, and S. Prabhakar. ERACER: a database approach for statistical inference and data cleaning. In SIGMOD, pages 75--86, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. I. O. Medicine, J. M. Corrigan, and M. S. Donaldson. To err is human: Building a safer health system. Institute of Medicine the National Academies, 7(4):245--246, 2000.Google ScholarGoogle Scholar
  18. V. Raman and J. M. Hellerstein. Potter's Wheel: An interactive data cleaning system. In VLDB, pages 381--390, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. N. A. Setiawan, P. A. Venkatachalam, and A. F. M. Hani. Missing attribute value prediction based on artificial neural network and rough set theory. In BMEI, pages 306--310, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. A. Wagner and M. J. Fischer. The string-to-string correction problem. J. ACM, 21(1):168--173, 1974. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. T. Warren. Using regular expressions for data cleansing and standardization. http://www.kimballgroup.com/2009/01/using-regular-expressions-for-data-cleansing-and-standardization.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 9, Issue 5
    January 2016
    72 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    • Published: 1 January 2016
    Published in pvldb Volume 9, Issue 5

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader