skip to main content
research-article

Enriching data imputation with extensive similarity neighbors

Published:01 July 2015Publication History
Skip Abstract Section

Abstract

Incomplete information often occur along with many database applications, e.g., in data integration, data cleaning or data exchange. The idea of data imputation is to fill the missing data with the values of its neighbors who share the same information. Such neighbors could either be identified certainly by editing rules or statistically by relational dependency networks. Unfortunately, owing to data sparsity, the number of neighbors (identified w.r.t. value equality) is rather limited, especially in the presence of data values with variances. In this paper, we argue to extensively enrich similarity neighbors by similarity rules with tolerance to small variations. More fillings can thus be acquired that the aforesaid equality neighbors fail to reveal. To fill the missing values more, we study the problem of maximizing the missing data imputation. Our major contributions include (1) the np-hardness analysis on solving and approximating the problem, (2) exact algorithms for tackling the problem, and (3) efficient approximation with performance guarantees. Experiments on real and synthetic data sets demonstrate that the filling accuracy can be improved.

References

  1. Full version. http://ise.thss.tsinghua.edu.cn/sxsong/doc/incomplete.pdf.Google ScholarGoogle Scholar
  2. P. Bohannon, M. Flaster, W. Fan, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD Conference, pages 143--154, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. F. Chiang and R. J. Miller. A unified model for data and constraint repair. In ICDE, pages 446--457, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, pages 458--469, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 19(1):1--16, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. W. Fan, J. Li, X. Jia, and S. Ma. Reasoning about record matching rules. PVLDB, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. PVLDB, 3(1):173--184, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. M. Karp. Reducibility among combinatorial problems. In Complexity of Computer Computations, pages 85--103. Plenum Press, 1972.Google ScholarGoogle ScholarCross RefCross Ref
  9. S. Kolahi and L. V. S. Lakshmanan. On approximating optimum repairs for functional dependency violations. In ICDT, pages 53--62, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. L. Libkin and L. Wong. Semantic representations and query languages for or-sets. In PODS, pages 37--48, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. R. J. A. Little and D. B. Rubin. Statistical Analysis with Missing Data. John Wiley & Sons, Inc., New York, NY, USA, 2002. Google ScholarGoogle ScholarCross RefCross Ref
  12. C. Mayfield, J. Neville, and S. Prabhakar. ERACER: a database approach for statistical inference and data cleaning. In SIGMOD Conference, pages 75--86, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. G. Navarro. A guided tour to approximate string matching. ACM Comput. Surv., 33(1):31--88, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Song and L. Chen. Differential dependencies: Reasoning and discovery. ACM Trans. Database Syst., 36(3):16, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Song, L. Chen, and H. Cheng. Parameter-free determination of distance thresholds for metric distance constraints. In ICDE, pages 846--857, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. C. J. van Rijsbergen. Information Retrieval. Butterworth, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Wang and N. Tang. Towards dependable data repairing with fixing rules. In SIGMOD Conference, pages 457--468, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. G. Wolf, H. Khatri, B. Chokshi, J. Fan, Y. Chen, and S. Kambhampati. Query processing over incomplete autonomous databases. In VLDB, pages 651--662, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. S. Wu, X. Feng, Y. Han, and Q. Wang. Missing categorical data imputation approach based on similarity. In SMC, pages 2827--2832, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  20. S. Zhang, J. Zhang, X. Zhu, Y. Qin, and C. Zhang. Missing value imputation based on data clustering. Trans. on Computational Science, 1:128--138, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Enriching data imputation with extensive similarity neighbors

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Proceedings of the VLDB Endowment
          Proceedings of the VLDB Endowment  Volume 8, Issue 11
          July 2015
          264 pages
          ISSN:2150-8097
          Issue’s Table of Contents

          Publisher

          VLDB Endowment

          Publication History

          • Published: 1 July 2015
          Published in pvldb Volume 8, Issue 11

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader