Abstract
Incomplete information often occur along with many database applications, e.g., in data integration, data cleaning or data exchange. The idea of data imputation is to fill the missing data with the values of its neighbors who share the same information. Such neighbors could either be identified certainly by editing rules or statistically by relational dependency networks. Unfortunately, owing to data sparsity, the number of neighbors (identified w.r.t. value equality) is rather limited, especially in the presence of data values with variances. In this paper, we argue to extensively enrich similarity neighbors by similarity rules with tolerance to small variations. More fillings can thus be acquired that the aforesaid equality neighbors fail to reveal. To fill the missing values more, we study the problem of maximizing the missing data imputation. Our major contributions include (1) the np-hardness analysis on solving and approximating the problem, (2) exact algorithms for tackling the problem, and (3) efficient approximation with performance guarantees. Experiments on real and synthetic data sets demonstrate that the filling accuracy can be improved.
- Full version. http://ise.thss.tsinghua.edu.cn/sxsong/doc/incomplete.pdf.Google Scholar
- P. Bohannon, M. Flaster, W. Fan, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD Conference, pages 143--154, 2005. Google ScholarDigital Library
- F. Chiang and R. J. Miller. A unified model for data and constraint repair. In ICDE, pages 446--457, 2011. Google ScholarDigital Library
- X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, pages 458--469, 2013. Google ScholarDigital Library
- A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 19(1):1--16, 2007. Google ScholarDigital Library
- W. Fan, J. Li, X. Jia, and S. Ma. Reasoning about record matching rules. PVLDB, 2009. Google ScholarDigital Library
- W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. PVLDB, 3(1):173--184, 2010. Google ScholarDigital Library
- R. M. Karp. Reducibility among combinatorial problems. In Complexity of Computer Computations, pages 85--103. Plenum Press, 1972.Google ScholarCross Ref
- S. Kolahi and L. V. S. Lakshmanan. On approximating optimum repairs for functional dependency violations. In ICDT, pages 53--62, 2009. Google ScholarDigital Library
- L. Libkin and L. Wong. Semantic representations and query languages for or-sets. In PODS, pages 37--48, 1993. Google ScholarDigital Library
- R. J. A. Little and D. B. Rubin. Statistical Analysis with Missing Data. John Wiley & Sons, Inc., New York, NY, USA, 2002. Google ScholarCross Ref
- C. Mayfield, J. Neville, and S. Prabhakar. ERACER: a database approach for statistical inference and data cleaning. In SIGMOD Conference, pages 75--86, 2010. Google ScholarDigital Library
- G. Navarro. A guided tour to approximate string matching. ACM Comput. Surv., 33(1):31--88, 2001. Google ScholarDigital Library
- S. Song and L. Chen. Differential dependencies: Reasoning and discovery. ACM Trans. Database Syst., 36(3):16, 2011. Google ScholarDigital Library
- S. Song, L. Chen, and H. Cheng. Parameter-free determination of distance thresholds for metric distance constraints. In ICDE, pages 846--857, 2012. Google ScholarDigital Library
- C. J. van Rijsbergen. Information Retrieval. Butterworth, 1979. Google ScholarDigital Library
- J. Wang and N. Tang. Towards dependable data repairing with fixing rules. In SIGMOD Conference, pages 457--468, 2014. Google ScholarDigital Library
- G. Wolf, H. Khatri, B. Chokshi, J. Fan, Y. Chen, and S. Kambhampati. Query processing over incomplete autonomous databases. In VLDB, pages 651--662, 2007. Google ScholarDigital Library
- S. Wu, X. Feng, Y. Han, and Q. Wang. Missing categorical data imputation approach based on similarity. In SMC, pages 2827--2832, 2012.Google ScholarCross Ref
- S. Zhang, J. Zhang, X. Zhu, Y. Qin, and C. Zhang. Missing value imputation based on data clustering. Trans. on Computational Science, 1:128--138, 2008. Google ScholarDigital Library
Index Terms
- Enriching data imputation with extensive similarity neighbors
Recommendations
Enriching Data Imputation under Similarity Rule Constraints
Incomplete information often occurs along with many database applications, e.g., in data integration, data cleaning, or data exchange. The idea of data imputation is often to fill the missing data with the values of its neighbors who share the same/...
Nearest neighbours in least-squares data imputation algorithms with different missing patterns
Methods for imputation of missing data in the so-called least-squares approximation approach, a non-parametric computationally efficient multidimensional technique, are experimentally compared. Contributions are made to each of the three components of ...
Missing Value Imputation Method Using Separate Features Nearest Neighbors Algorithm
Computational Science – ICCS 2021AbstractMissing value imputation is a problem often meet when working with medical and biometric data sets. Prior to working on these datasets, missing values have to be eliminated. It could be done by imputing estimated values. However, imputation should ...
Comments