research-article

Repairing data through regular expressions

Authors:
Zeyu Li

Harbin Institute of Technology

Harbin Institute of Technology
View Profile

,
Hongzhi Wang

Harbin Institute of Technology

Harbin Institute of Technology
View Profile

,
Wei Shao

Harbin Institute of Technology

Harbin Institute of Technology
View Profile

,
Jianzhong Li

Harbin Institute of Technology

Harbin Institute of Technology
View Profile

,
Hong Gao

Harbin Institute of Technology

Harbin Institute of Technology
View Profile

Proceedings of the VLDB Endowment Volume 9 Issue 5pp 432–443https://doi.org/10.14778/2876473.2876478

Published:01 January 2016Publication History

Proceedings of the VLDB Endowment

Abstract

Since regular expressions are often used to detect errors in sequences such as strings or date, it is natural to use them for data repair. Motivated by this, we propose a data repair method based on regular expression to make the input sequence data obey the given regular expression with minimal revision cost. The proposed method contains two steps, sequence repair and token value repair.

For sequence repair, we propose the Regular-expression-based Structural Repair (RSR in short) algorithm. RSR algorithm is a dynamic programming algorithm that utilizes Nondeterministic Finite Automata (NFA) to calculate the edit distance between a prefix of the input string and a partial pattern regular expression with time complexity of O(nm²) and space complexity of O(mn) where m is the edge number of NFA and n is the input string length. We also develop an optimization strategy to achieve higher performance for long strings. For token value repair, we combine the edit-distance-based method and associate rules by a unified argument for the selection of the proper method. Experimental results on both real and synthetic data show that the proposed method could repair the data effectively and efficiently.

References

Summary of movielens datasets. http://files.grouplens.org/datasets/movielens/ml-1m-README.txt.Google Scholar
R. Agrawal, T. Imielinski, and A. N. Swami. Mining association rules between sets of items in large databases. In SIGMOD, pages 207--216, 1993. Google ScholarDigital Library
R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB, pages 487--499, 1994. Google ScholarDigital Library
M. Arenas, L. E. Bertossi, J. Chomicki, X. He, V. Raghavan, and J. P. Spinrad. Scalar aggregation in inconsistent databases. Theor. Comput. Sci., 296(3):405--434, 2003. Google ScholarDigital Library
A. Backurs and P. Indyk. Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). In STOC, pages 51--58, 2015. Google ScholarDigital Library
P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. In ICDE, pages 746--755, 2007.Google ScholarCross Ref
F. Chiang and R. J. Miller. Discovering data quality rules. PVLDB, 1(1):1166--1177, 2008. Google ScholarDigital Library
D. Deng, G. Li, J. Feng, and W. Li. Top-k string similarity search with edit-distance constraints. In ICDE, pages 925--936, 2013. Google ScholarDigital Library
X. L. Dong, L. Berti-Equille, and D. Srivastava. Integrating conflicting data: The role of source dependence. PVLDB, 2(1):550--561, 2009. Google ScholarDigital Library
X. L. Dong, L. Berti-Equille, and D. Srivastava. Truth discovery and copying detection in a dynamic world. PVLDB, 2(1):562--573, 2009. Google ScholarDigital Library
W. Eckerson. Data quality and the bottom line. Journal of Radioanalytical & Nuclear Chemistry, 160(4):355--362, 1992.Google Scholar
W. Fan, F. Geerts, N. Tang, and W. Yu. Inferring data currency and consistency for conflict resolution. In ICDE, pages 470--481, 2013. Google ScholarDigital Library
A. Galland, S. Abiteboul, A. Marian, and P. Senellart. Corroborating information from disagreeing views. In WSDM, pages 131--140, 2010. Google ScholarDigital Library
F. Geerts, G. Mecca, P. Papotti, and D. Santoro. The LLUNATIC data-cleaning framework. PVLDB, 6(9):625--636, 2013. Google ScholarDigital Library
K. Lakshminarayan, S. A. Harp, R. P. Goldman, and T. Samad. Imputation of missing data using machine learning techniques. In KDD, pages 140--145, 1996.Google ScholarDigital Library
C. Mayfield, J. Neville, and S. Prabhakar. ERACER: a database approach for statistical inference and data cleaning. In SIGMOD, pages 75--86, 2010. Google ScholarDigital Library
I. O. Medicine, J. M. Corrigan, and M. S. Donaldson. To err is human: Building a safer health system. Institute of Medicine the National Academies, 7(4):245--246, 2000.Google Scholar
V. Raman and J. M. Hellerstein. Potter's Wheel: An interactive data cleaning system. In VLDB, pages 381--390, 2001. Google ScholarDigital Library
N. A. Setiawan, P. A. Venkatachalam, and A. F. M. Hani. Missing attribute value prediction based on artificial neural network and rough set theory. In BMEI, pages 306--310, 2008. Google ScholarDigital Library
R. A. Wagner and M. J. Fischer. The string-to-string correction problem. J. ACM, 21(1):168--173, 1974. Google ScholarDigital Library
T. Warren. Using regular expressions for data cleansing and standardization. http://www.kimballgroup.com/2009/01/using-regular-expressions-for-data-cleansing-and-standardization.Google Scholar

Recommendations

Regular Transducer Expressions for Regular Transformations
LICS '18: Proceedings of the 33rd Annual ACM/IEEE Symposium on Logic in Computer Science

Functional MSO transductions, deterministic two-way transducers, as well as streaming string transducers are all equivalent models for regular functions. In this paper, we show that every regular function, either on finite words or on infinite words, ...
Read More
Repairing Regular Expressions for Extraction

While synthesizing and repairing regular expressions (regexes) based on Programming-by-Examples (PBE) methods have seen rapid progress in recent years, all existing works only support synthesizing or repairing regexes for membership testing, and the ...
Read More
Closure properties and descriptional complexity of deterministic regular expressions

We study the descriptional complexity of regular languages that are definable by deterministic regular expressions, i.e., we examine worst-case blow-ups in size when translating between different representations for such languages. As representations of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 9, Issue 5
January 2016
72 pages
ISSN:2150-8097
Editors:
Surajit Chaudhuri
Microsoft Research
,
Jayant Haritsa
I.I.Sc. Bangalore
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 January 2016
Published in pvldb Volume 9, Issue 5
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 155
  Total Downloads
- Downloads (Last 12 months)10
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Repairing data through regular expressions

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Regular Transducer Expressions for Regular Transformations

Repairing Regular Expressions for Extraction

Closure properties and descriptional complexity of deterministic regular expressions

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Repairing data through regular expressions

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Regular Transducer Expressions for Regular Transformations

Repairing Regular Expressions for Extraction

Closure properties and descriptional complexity of deterministic regular expressions

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media