Abstract
Since regular expressions are often used to detect errors in sequences such as strings or date, it is natural to use them for data repair. Motivated by this, we propose a data repair method based on regular expression to make the input sequence data obey the given regular expression with minimal revision cost. The proposed method contains two steps, sequence repair and token value repair.
For sequence repair, we propose the Regular-expression-based Structural Repair (RSR in short) algorithm. RSR algorithm is a dynamic programming algorithm that utilizes Nondeterministic Finite Automata (NFA) to calculate the edit distance between a prefix of the input string and a partial pattern regular expression with time complexity of O(nm2) and space complexity of O(mn) where m is the edge number of NFA and n is the input string length. We also develop an optimization strategy to achieve higher performance for long strings. For token value repair, we combine the edit-distance-based method and associate rules by a unified argument for the selection of the proper method. Experimental results on both real and synthetic data show that the proposed method could repair the data effectively and efficiently.
- Summary of movielens datasets. http://files.grouplens.org/datasets/movielens/ml-1m-README.txt.Google Scholar
- R. Agrawal, T. Imielinski, and A. N. Swami. Mining association rules between sets of items in large databases. In SIGMOD, pages 207--216, 1993. Google ScholarDigital Library
- R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB, pages 487--499, 1994. Google ScholarDigital Library
- M. Arenas, L. E. Bertossi, J. Chomicki, X. He, V. Raghavan, and J. P. Spinrad. Scalar aggregation in inconsistent databases. Theor. Comput. Sci., 296(3):405--434, 2003. Google ScholarDigital Library
- A. Backurs and P. Indyk. Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). In STOC, pages 51--58, 2015. Google ScholarDigital Library
- P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. In ICDE, pages 746--755, 2007.Google ScholarCross Ref
- F. Chiang and R. J. Miller. Discovering data quality rules. PVLDB, 1(1):1166--1177, 2008. Google ScholarDigital Library
- D. Deng, G. Li, J. Feng, and W. Li. Top-k string similarity search with edit-distance constraints. In ICDE, pages 925--936, 2013. Google ScholarDigital Library
- X. L. Dong, L. Berti-Equille, and D. Srivastava. Integrating conflicting data: The role of source dependence. PVLDB, 2(1):550--561, 2009. Google ScholarDigital Library
- X. L. Dong, L. Berti-Equille, and D. Srivastava. Truth discovery and copying detection in a dynamic world. PVLDB, 2(1):562--573, 2009. Google ScholarDigital Library
- W. Eckerson. Data quality and the bottom line. Journal of Radioanalytical & Nuclear Chemistry, 160(4):355--362, 1992.Google Scholar
- W. Fan, F. Geerts, N. Tang, and W. Yu. Inferring data currency and consistency for conflict resolution. In ICDE, pages 470--481, 2013. Google ScholarDigital Library
- A. Galland, S. Abiteboul, A. Marian, and P. Senellart. Corroborating information from disagreeing views. In WSDM, pages 131--140, 2010. Google ScholarDigital Library
- F. Geerts, G. Mecca, P. Papotti, and D. Santoro. The LLUNATIC data-cleaning framework. PVLDB, 6(9):625--636, 2013. Google ScholarDigital Library
- K. Lakshminarayan, S. A. Harp, R. P. Goldman, and T. Samad. Imputation of missing data using machine learning techniques. In KDD, pages 140--145, 1996.Google ScholarDigital Library
- C. Mayfield, J. Neville, and S. Prabhakar. ERACER: a database approach for statistical inference and data cleaning. In SIGMOD, pages 75--86, 2010. Google ScholarDigital Library
- I. O. Medicine, J. M. Corrigan, and M. S. Donaldson. To err is human: Building a safer health system. Institute of Medicine the National Academies, 7(4):245--246, 2000.Google Scholar
- V. Raman and J. M. Hellerstein. Potter's Wheel: An interactive data cleaning system. In VLDB, pages 381--390, 2001. Google ScholarDigital Library
- N. A. Setiawan, P. A. Venkatachalam, and A. F. M. Hani. Missing attribute value prediction based on artificial neural network and rough set theory. In BMEI, pages 306--310, 2008. Google ScholarDigital Library
- R. A. Wagner and M. J. Fischer. The string-to-string correction problem. J. ACM, 21(1):168--173, 1974. Google ScholarDigital Library
- T. Warren. Using regular expressions for data cleansing and standardization. http://www.kimballgroup.com/2009/01/using-regular-expressions-for-data-cleansing-and-standardization.Google Scholar
Recommendations
Regular Transducer Expressions for Regular Transformations
LICS '18: Proceedings of the 33rd Annual ACM/IEEE Symposium on Logic in Computer ScienceFunctional MSO transductions, deterministic two-way transducers, as well as streaming string transducers are all equivalent models for regular functions. In this paper, we show that every regular function, either on finite words or on infinite words, ...
Repairing Regular Expressions for Extraction
While synthesizing and repairing regular expressions (regexes) based on Programming-by-Examples (PBE) methods have seen rapid progress in recent years, all existing works only support synthesizing or repairing regexes for membership testing, and the ...
Closure properties and descriptional complexity of deterministic regular expressions
We study the descriptional complexity of regular languages that are definable by deterministic regular expressions, i.e., we examine worst-case blow-ups in size when translating between different representations for such languages. As representations of ...
Comments