research-article

Enriching data imputation with extensive similarity neighbors

Authors:
Shaoxu Song

Tsinghua University, China

Tsinghua University, China
View Profile

,
Aoqian Zhang

Tsinghua University, China

Tsinghua University, China
View Profile

,
Lei Chen

The Hong Kong University of Science & Technology, Hong Kong

The Hong Kong University of Science & Technology, Hong Kong
View Profile

,
Jianmin Wang

Tsinghua University, China

Tsinghua University, China
View Profile

Proceedings of the VLDB Endowment Volume 8 Issue 11pp 1286–1297https://doi.org/10.14778/2809974.2809989

Published:01 July 2015Publication History

Proceedings of the VLDB Endowment

Abstract

Incomplete information often occur along with many database applications, e.g., in data integration, data cleaning or data exchange. The idea of data imputation is to fill the missing data with the values of its neighbors who share the same information. Such neighbors could either be identified certainly by editing rules or statistically by relational dependency networks. Unfortunately, owing to data sparsity, the number of neighbors (identified w.r.t. value equality) is rather limited, especially in the presence of data values with variances. In this paper, we argue to extensively enrich similarity neighbors by similarity rules with tolerance to small variations. More fillings can thus be acquired that the aforesaid equality neighbors fail to reveal. To fill the missing values more, we study the problem of maximizing the missing data imputation. Our major contributions include (1) the np-hardness analysis on solving and approximating the problem, (2) exact algorithms for tackling the problem, and (3) efficient approximation with performance guarantees. Experiments on real and synthetic data sets demonstrate that the filling accuracy can be improved.

References

Full version. http://ise.thss.tsinghua.edu.cn/sxsong/doc/incomplete.pdf.Google Scholar
P. Bohannon, M. Flaster, W. Fan, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD Conference, pages 143--154, 2005. Google ScholarDigital Library
F. Chiang and R. J. Miller. A unified model for data and constraint repair. In ICDE, pages 446--457, 2011. Google ScholarDigital Library
X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, pages 458--469, 2013. Google ScholarDigital Library
A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 19(1):1--16, 2007. Google ScholarDigital Library
W. Fan, J. Li, X. Jia, and S. Ma. Reasoning about record matching rules. PVLDB, 2009. Google ScholarDigital Library
W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. PVLDB, 3(1):173--184, 2010. Google ScholarDigital Library
R. M. Karp. Reducibility among combinatorial problems. In Complexity of Computer Computations, pages 85--103. Plenum Press, 1972.Google ScholarCross Ref
S. Kolahi and L. V. S. Lakshmanan. On approximating optimum repairs for functional dependency violations. In ICDT, pages 53--62, 2009. Google ScholarDigital Library
L. Libkin and L. Wong. Semantic representations and query languages for or-sets. In PODS, pages 37--48, 1993. Google ScholarDigital Library
R. J. A. Little and D. B. Rubin. Statistical Analysis with Missing Data. John Wiley & Sons, Inc., New York, NY, USA, 2002. Google ScholarCross Ref
C. Mayfield, J. Neville, and S. Prabhakar. ERACER: a database approach for statistical inference and data cleaning. In SIGMOD Conference, pages 75--86, 2010. Google ScholarDigital Library
G. Navarro. A guided tour to approximate string matching. ACM Comput. Surv., 33(1):31--88, 2001. Google ScholarDigital Library
S. Song and L. Chen. Differential dependencies: Reasoning and discovery. ACM Trans. Database Syst., 36(3):16, 2011. Google ScholarDigital Library
S. Song, L. Chen, and H. Cheng. Parameter-free determination of distance thresholds for metric distance constraints. In ICDE, pages 846--857, 2012. Google ScholarDigital Library
C. J. van Rijsbergen. Information Retrieval. Butterworth, 1979. Google ScholarDigital Library
J. Wang and N. Tang. Towards dependable data repairing with fixing rules. In SIGMOD Conference, pages 457--468, 2014. Google ScholarDigital Library
G. Wolf, H. Khatri, B. Chokshi, J. Fan, Y. Chen, and S. Kambhampati. Query processing over incomplete autonomous databases. In VLDB, pages 651--662, 2007. Google ScholarDigital Library
S. Wu, X. Feng, Y. Han, and Q. Wang. Missing categorical data imputation approach based on similarity. In SMC, pages 2827--2832, 2012.Google ScholarCross Ref
S. Zhang, J. Zhang, X. Zhu, Y. Qin, and C. Zhang. Missing value imputation based on data clustering. Trans. on Computational Science, 1:128--138, 2008. Google ScholarDigital Library

Index Terms

Enriching data imputation with extensive similarity neighbors
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Information systems
  1. Data management systems
    1. Database design and models
  2. Information retrieval
    1. Document representation

Recommendations

Enriching Data Imputation under Similarity Rule Constraints
Incomplete information often occurs along with many database applications, e.g., in data integration, data cleaning, or data exchange. The idea of data imputation is often to fill the missing data with the values of its neighbors who share the same/...
Read More
Nearest neighbours in least-squares data imputation algorithms with different missing patterns

Methods for imputation of missing data in the so-called least-squares approximation approach, a non-parametric computationally efficient multidimensional technique, are experimentally compared. Contributions are made to each of the three components of ...
Read More
Missing Value Imputation Method Using Separate Features Nearest Neighbors Algorithm
Computational Science – ICCS 2021
Abstract
Missing value imputation is a problem often meet when working with medical and biometric data sets. Prior to working on these datasets, missing values have to be eliminated. It could be done by imputing estimated values. However, imputation should ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 8, Issue 11
July 2015
264 pages
ISSN:2150-8097
Editors:
Chen Li
University of California, Irvine
,
Volker Markl
TU Berlin
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 July 2015
Published in pvldb Volume 8, Issue 11
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 13
  Total Citations
  View Citations
- 210
  Total Downloads
- Downloads (Last 12 months)20
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Enriching data imputation with extensive similarity neighbors

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Enriching Data Imputation under Similarity Rule Constraints

Nearest neighbours in least-squares data imputation algorithms with different missing patterns

Missing Value Imputation Method Using Separate Features Nearest Neighbors Algorithm

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Enriching data imputation with extensive similarity neighbors

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Enriching Data Imputation under Similarity Rule Constraints

Nearest neighbours in least-squares data imputation algorithms with different missing patterns

Missing Value Imputation Method Using Separate Features Nearest Neighbors Algorithm

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media