skip to main content
10.1145/2452376.2452457acmotherconferencesArticle/Chapter ViewAbstractPublication PagesedbtConference Proceedingsconference-collections
research-article

Entity discovery and annotation in tables

Published:18 March 2013Publication History

ABSTRACT

The Web is rich of tables (e.g., HTML tables, spreadsheets, Google Fusion Tables) that host a considerable wealth of high-quality relational data. Unlike unstructured texts, tables usually favour the automatic extraction of data because of their regular structure and properties. The data extraction is usually complemented by the annotation of the table, which determines its semantics by identifying a type for each column, the relations between columns, if any, and the entities that occur in each cell.

In this paper, we focus on the problem of discovering and annotating entities in tables. More specifically, we describe an algorithm that identifies the rows of a table that contain information on entities of specific types (e.g., restaurant, museum, theatre) derived from an ontology and determines the cells in which the names of those entities occur. We implemented this algorithm while developing a faceted browser over a repository of RDF data on points of interest of cities that we extracted from Google Fusion Tables.

We claim that our algorithm complements the existing approaches, which annotate entities in a table based on a pre-compiled reference catalogue that lists the types of a finite set of entities; as a result, they are unable to discover and annotate entities that do not belong to the reference catalogue. Instead, we train our algorithm to look for information on previously unseen entities on the Web so as to annotate them with the correct type.

References

  1. Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., and Hellmann, S. DBpedia - A Crystallization Point for the Web of Data. Web Semant. 7 (September 2009), 154--165. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Borges, K. A. V., Laender, A. H. F., Medeiros, C. B., and Davis, Jr., C. A. Discovering Geographic Locations in Web Pages Using Urban Addresses. In Proceedings of the 4th ACM Workshop on Geographical Information Retrieval (New York, NY, USA, 2007), GIR '07, ACM, pp. 31--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Cafarella, M. J., Halevy, A., Wang, D. Z., Wu, E., and Zhang, Y. WebTables: Exploring the Power of Tables on the Web. Proc. VLDB Endow. 1, 1 (2008), 538--549. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Cimiano, P., and Völker, J. Towards Large-scale, Open-domain and Ontology-based Named Entity Classification. In In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP'05) (2005), INCOMA Ltd, pp. 166--172.Google ScholarGoogle Scholar
  5. Doan, A., Ramakrishnan, R., and Halevy, A. Y. Crowdsourcing Systems on the World-Wide Web. Commun. ACM 54 (2011), 86--96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Fleischman, M., and Hovy, E. Fine Grained Classification of Named Entities. In Proceedings of the 19th International Conference on Computational Linguistics - Volume 1 (Stroudsburg, PA, USA, 2002), COLING '02, Association for Computational Linguistics, pp. 1--7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Ganti, V., König, A. C., and Vernica, R. Entity Categorization over Large Document Collections. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (New York, NY, USA, 2008), KDD '08, ACM, pp. 274--282. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Giuliano, C. Fine-grained Classification of Named Entities Exploiting Latent Semantic Kernels. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (Stroudsburg, PA, USA, 2009), CoNLL '09, Association for Computational Linguistics, pp. 201--209. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Gonzalez, H., Halevy, A. Y., Jensen, C. S., Langen, A., Madhavan, J., Shapley, R., Shen, W., and Goldberg-Kidon, J. Google Fusion Tables: Web-centered Data Management and Collaboration. In Proceedings of the 2010 International Conference on Management of Data (New York, NY, USA, 2010), SIGMOD '10, ACM, pp. 1061--1066. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Guo, X., Chen, Y., Chen, J., and Du, X. ITEM: Extract and Integrate Entities from Tabular Data to RDF Knowledge Base. In Proceedings of the 13th Asia-Pacific Web Conference on Web Technologies and Applications (Berlin, Heidelberg, 2011), APWeb'11, Springer-Verlag, pp. 400--411. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Han, L., Finin, T., Parr, C., Sachs, J., and Joshi, A. RDF123: From Spreadsheets to RDF. In The Semantic Web - ISWC 2008, vol. 5318 of Lecture Notes in Computer Science. Springer Berlin/Heidelberg, 2008, pp. 451--466. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Hignette, G., Buche, P., Dibie-Barthélemy, J., and Haemmerlé, O. Fuzzy Annotation of Web Data Tables Driven by a Domain Ontology. In Proceedings of the 6th European Semantic Web Conference on The Semantic Web: Research and Applications (Berlin, Heidelberg, 2009), ESWC 2009, Springer-Verlag, pp. 638--653. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Hsu, C. W., Chang, C. C., and Lin, C. J. A Practical Guide to Support Vector Classification. Retrieved online at http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf, 2003.Google ScholarGoogle Scholar
  14. Limaye, G., Sarawagi, S., and Chakrabarti, S. Annotating and Searching Web Tables Using Entities, Types and Relationships. Proc. VLDB Endow. 3 (September 2010), 1338--1347. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Mulwad, V. DC proposal: Graphical Models and Probabilistic Reasoning for Generating Linked Data from Tables. In Proceedings of the 10th international conference on The semantic web - Volume Part II (Berlin, Heidelberg, 2011), ISWC'11, Springer-Verlag, pp. 317--324. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Ni, Y., Zhang, L., Qiu, Z., and Wang, C. Enhancing the Open-domain Classification of Named Entity Using Linked Open Data. In Proceedings of the 9th International Semantic Web Conference on The semantic Web - Volume Part I (Berlin, Heidelberg, 2010), ISWC'10, Springer-Verlag, pp. 566--581. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Quercini, G., Setz, J., Sonntag, D., and Reynaud, C. Facetted Browsing of Extracted Fusion Tables Data for Digital Cities. In Proceedings of the Web of Linked Entities Workshop in conjunction with the 11th International Semantic Web Conference (ISWC 2012) (2012), pp. 94--105.Google ScholarGoogle Scholar
  18. Setz, J., Quercini, G., Sonntag, D., and Reynaud, C. Facetted Search on Extracted Fusion Tables Data for Digital Cities. In 35th Annual German Conference on Artificial Intelligence (Demo paper) (2012).Google ScholarGoogle Scholar
  19. Suchanek, F. M., Kasneci, G., and Weikum, G. Yago: a Core of Semantic Knowledge. In Proceedings of the 16th International Conference on World Wide Web (New York, NY, USA, 2007), WWW '07, ACM, pp. 697--706. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. van Assem, M., Rijgersberg, H., Wigham, M., and Top, J. Converting and Annotating Quantitative Data Tables. In Proceedings of the 9th International Semantic Web Conference on The Semantic Web - Volume Part I (Berlin, Heidelberg, 2010), ISWC'10, Springer-Verlag, pp. 16--31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. van rijsbergen, C. J., Robertson, S. E., and Porter, M. F. New models in probabilistic information retrieval. In British Library Research and Development Report, no. 5587 (1980), London: British Library.Google ScholarGoogle Scholar
  22. Venetis, P., Halevy, A., Madhavan, J., Paşca, M., Shen, W., Wu, F., Miao, G., and Wu, C. Recovering Semantics of Tables on the Web. Proc. VLDB Endow. 4 (2011), 528--538. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Wang, J., Wang, H., Wang, Z., and Zhu, K. Q. Understanding Tables on the Web. In Conceptual Modeling - 31st International Conference ER 2012 (2012), vol. 7532 of Lecture Notes in Computer Science, Springer, pp. 141--155. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Wu, W., Li, H., Wang, H., and Zhu, K. Q. Probase: a Probabilistic Taxonomy for Text Understanding. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (New York, NY, USA, 2012), SIGMOD '12, ACM, pp. 481--492. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Zicari, R. V. Google Fusion Tables. Interview with Alon Y. Halevy. http://www.odbms.org/blog/2011/08/google-fusion-tables-interview-with-alon-y-halevy/, 2011.Google ScholarGoogle Scholar

Index Terms

  1. Entity discovery and annotation in tables

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      EDBT '13: Proceedings of the 16th International Conference on Extending Database Technology
      March 2013
      793 pages
      ISBN:9781450315975
      DOI:10.1145/2452376

      Copyright © 2013 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 18 March 2013

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate7of10submissions,70%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader