skip to main content
research-article

KATARA: reliable data cleaning with knowledge bases and crowdsourcing

Published:01 August 2015Publication History
Skip Abstract Section

Abstract

Data cleaning with guaranteed reliability is hard to achieve without accessing external sources, since the truth is not necessarily discoverable from the data at hand. Furthermore, even in the presence of external sources, mainly knowledge bases and humans, effectively leveraging them still faces many challenges, such as aligning heterogeneous data sources and decomposing a complex task into simpler units that can be consumed by humans. We present Katara, a novel end-to-end data cleaning system powered by knowledge bases and crowdsourcing. Given a table, a kb, and a crowd, Katara (i) interprets the table semantics w.r.t. the given kb; (ii) identifies correct and wrong data; and (iii) generates top-k possible repairs for the wrong data. Users will have the opportunity to experience the following features of Katara: (1) Easy specification: Users can define a Katara job with a browser-based specification; (2) Pattern validation: Users can help the system to resolve the ambiguity of different table patterns (i.e., table semantics) discovered by Katara; (3) Data annotation: Users can play the role of internal crowd workers, helping Katara annotate data. Moreover, Katara will visualize the annotated data as correct data validated by the kb, correct data jointly validated by the kb and the crowd, or erroneous tuples along with their possible repairs.

References

  1. X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, 2013.Google ScholarGoogle Scholar
  2. X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, and Y. Ye. KATARA: a data cleaning system powered by knowledge bases and crowdsourcing. In SIGMOD, 2015. Google ScholarGoogle Scholar
  3. O. Deshpande, D. S. Lamba, M. Tourn, S. Das, S. Subramaniam, A. Rajaraman, V. Harinarayan, and A. Doan. Building, maintaining, and using knowledge bases: a report from the trenches. In SIGMOD Conference, 2013. Google ScholarGoogle Scholar
  4. W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Interaction between record matching and data repairing. In SIGMOD, 2011. Google ScholarGoogle Scholar
  5. W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. VLDB J., 21(2), 2012. Google ScholarGoogle Scholar
  6. Z. Khayyat, I. F. Ilyas, A. Jindal, S. Madden, M. Ouzzani, P. Papotti, J.-A. Quiané-Ruiz, N. Tang, and S. Yin. Bigdansing: A system for big data cleansing. In SIGMOD, 2015. Google ScholarGoogle Scholar
  7. G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and searching web tables using entities, types and relationships. PVLDB, 2010. Google ScholarGoogle Scholar
  8. C. Mayfield, J. Neville, and S. Prabhakar. ERACER: a database approach for statistical inference and data cleaning. In SIGMOD, 2010. Google ScholarGoogle Scholar
  9. M. Yakout, L. Berti-Equille, and A. K. Elmagarmid. Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes. In SIGMOD, 2013. Google ScholarGoogle Scholar
  10. M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. PVLDB, 2011. Google ScholarGoogle Scholar

Index Terms

  1. KATARA: reliable data cleaning with knowledge bases and crowdsourcing
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Proceedings of the VLDB Endowment
          Proceedings of the VLDB Endowment  Volume 8, Issue 12
          Proceedings of the 41st International Conference on Very Large Data Bases, Kohala Coast, Hawaii
          August 2015
          728 pages

          Publisher

          VLDB Endowment

          Publication History

          • Published: 1 August 2015
          Published in pvldb Volume 8, Issue 12

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader