ABSTRACT
The Web is rich of tables (e.g., HTML tables, spreadsheets, Google Fusion Tables) that host a considerable wealth of high-quality relational data. Unlike unstructured texts, tables usually favour the automatic extraction of data because of their regular structure and properties. The data extraction is usually complemented by the annotation of the table, which determines its semantics by identifying a type for each column, the relations between columns, if any, and the entities that occur in each cell.
In this paper, we focus on the problem of discovering and annotating entities in tables. More specifically, we describe an algorithm that identifies the rows of a table that contain information on entities of specific types (e.g., restaurant, museum, theatre) derived from an ontology and determines the cells in which the names of those entities occur. We implemented this algorithm while developing a faceted browser over a repository of RDF data on points of interest of cities that we extracted from Google Fusion Tables.
We claim that our algorithm complements the existing approaches, which annotate entities in a table based on a pre-compiled reference catalogue that lists the types of a finite set of entities; as a result, they are unable to discover and annotate entities that do not belong to the reference catalogue. Instead, we train our algorithm to look for information on previously unseen entities on the Web so as to annotate them with the correct type.
- Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., and Hellmann, S. DBpedia - A Crystallization Point for the Web of Data. Web Semant. 7 (September 2009), 154--165. Google ScholarDigital Library
- Borges, K. A. V., Laender, A. H. F., Medeiros, C. B., and Davis, Jr., C. A. Discovering Geographic Locations in Web Pages Using Urban Addresses. In Proceedings of the 4th ACM Workshop on Geographical Information Retrieval (New York, NY, USA, 2007), GIR '07, ACM, pp. 31--36. Google ScholarDigital Library
- Cafarella, M. J., Halevy, A., Wang, D. Z., Wu, E., and Zhang, Y. WebTables: Exploring the Power of Tables on the Web. Proc. VLDB Endow. 1, 1 (2008), 538--549. Google ScholarDigital Library
- Cimiano, P., and Völker, J. Towards Large-scale, Open-domain and Ontology-based Named Entity Classification. In In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP'05) (2005), INCOMA Ltd, pp. 166--172.Google Scholar
- Doan, A., Ramakrishnan, R., and Halevy, A. Y. Crowdsourcing Systems on the World-Wide Web. Commun. ACM 54 (2011), 86--96. Google ScholarDigital Library
- Fleischman, M., and Hovy, E. Fine Grained Classification of Named Entities. In Proceedings of the 19th International Conference on Computational Linguistics - Volume 1 (Stroudsburg, PA, USA, 2002), COLING '02, Association for Computational Linguistics, pp. 1--7. Google ScholarDigital Library
- Ganti, V., König, A. C., and Vernica, R. Entity Categorization over Large Document Collections. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (New York, NY, USA, 2008), KDD '08, ACM, pp. 274--282. Google ScholarDigital Library
- Giuliano, C. Fine-grained Classification of Named Entities Exploiting Latent Semantic Kernels. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (Stroudsburg, PA, USA, 2009), CoNLL '09, Association for Computational Linguistics, pp. 201--209. Google ScholarDigital Library
- Gonzalez, H., Halevy, A. Y., Jensen, C. S., Langen, A., Madhavan, J., Shapley, R., Shen, W., and Goldberg-Kidon, J. Google Fusion Tables: Web-centered Data Management and Collaboration. In Proceedings of the 2010 International Conference on Management of Data (New York, NY, USA, 2010), SIGMOD '10, ACM, pp. 1061--1066. Google ScholarDigital Library
- Guo, X., Chen, Y., Chen, J., and Du, X. ITEM: Extract and Integrate Entities from Tabular Data to RDF Knowledge Base. In Proceedings of the 13th Asia-Pacific Web Conference on Web Technologies and Applications (Berlin, Heidelberg, 2011), APWeb'11, Springer-Verlag, pp. 400--411. Google ScholarDigital Library
- Han, L., Finin, T., Parr, C., Sachs, J., and Joshi, A. RDF123: From Spreadsheets to RDF. In The Semantic Web - ISWC 2008, vol. 5318 of Lecture Notes in Computer Science. Springer Berlin/Heidelberg, 2008, pp. 451--466. Google ScholarDigital Library
- Hignette, G., Buche, P., Dibie-Barthélemy, J., and Haemmerlé, O. Fuzzy Annotation of Web Data Tables Driven by a Domain Ontology. In Proceedings of the 6th European Semantic Web Conference on The Semantic Web: Research and Applications (Berlin, Heidelberg, 2009), ESWC 2009, Springer-Verlag, pp. 638--653. Google ScholarDigital Library
- Hsu, C. W., Chang, C. C., and Lin, C. J. A Practical Guide to Support Vector Classification. Retrieved online at http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf, 2003.Google Scholar
- Limaye, G., Sarawagi, S., and Chakrabarti, S. Annotating and Searching Web Tables Using Entities, Types and Relationships. Proc. VLDB Endow. 3 (September 2010), 1338--1347. Google ScholarDigital Library
- Mulwad, V. DC proposal: Graphical Models and Probabilistic Reasoning for Generating Linked Data from Tables. In Proceedings of the 10th international conference on The semantic web - Volume Part II (Berlin, Heidelberg, 2011), ISWC'11, Springer-Verlag, pp. 317--324. Google ScholarDigital Library
- Ni, Y., Zhang, L., Qiu, Z., and Wang, C. Enhancing the Open-domain Classification of Named Entity Using Linked Open Data. In Proceedings of the 9th International Semantic Web Conference on The semantic Web - Volume Part I (Berlin, Heidelberg, 2010), ISWC'10, Springer-Verlag, pp. 566--581. Google ScholarDigital Library
- Quercini, G., Setz, J., Sonntag, D., and Reynaud, C. Facetted Browsing of Extracted Fusion Tables Data for Digital Cities. In Proceedings of the Web of Linked Entities Workshop in conjunction with the 11th International Semantic Web Conference (ISWC 2012) (2012), pp. 94--105.Google Scholar
- Setz, J., Quercini, G., Sonntag, D., and Reynaud, C. Facetted Search on Extracted Fusion Tables Data for Digital Cities. In 35th Annual German Conference on Artificial Intelligence (Demo paper) (2012).Google Scholar
- Suchanek, F. M., Kasneci, G., and Weikum, G. Yago: a Core of Semantic Knowledge. In Proceedings of the 16th International Conference on World Wide Web (New York, NY, USA, 2007), WWW '07, ACM, pp. 697--706. Google ScholarDigital Library
- van Assem, M., Rijgersberg, H., Wigham, M., and Top, J. Converting and Annotating Quantitative Data Tables. In Proceedings of the 9th International Semantic Web Conference on The Semantic Web - Volume Part I (Berlin, Heidelberg, 2010), ISWC'10, Springer-Verlag, pp. 16--31. Google ScholarDigital Library
- van rijsbergen, C. J., Robertson, S. E., and Porter, M. F. New models in probabilistic information retrieval. In British Library Research and Development Report, no. 5587 (1980), London: British Library.Google Scholar
- Venetis, P., Halevy, A., Madhavan, J., Paşca, M., Shen, W., Wu, F., Miao, G., and Wu, C. Recovering Semantics of Tables on the Web. Proc. VLDB Endow. 4 (2011), 528--538. Google ScholarDigital Library
- Wang, J., Wang, H., Wang, Z., and Zhu, K. Q. Understanding Tables on the Web. In Conceptual Modeling - 31st International Conference ER 2012 (2012), vol. 7532 of Lecture Notes in Computer Science, Springer, pp. 141--155. Google ScholarDigital Library
- Wu, W., Li, H., Wang, H., and Zhu, K. Q. Probase: a Probabilistic Taxonomy for Text Understanding. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (New York, NY, USA, 2012), SIGMOD '12, ACM, pp. 481--492. Google ScholarDigital Library
- Zicari, R. V. Google Fusion Tables. Interview with Alon Y. Halevy. http://www.odbms.org/blog/2011/08/google-fusion-tables-interview-with-alon-y-halevy/, 2011.Google Scholar
Index Terms
- Entity discovery and annotation in tables
Recommendations
Web personal name disambiguation based on reference entity tables mined from the web
WIDM '09: Proceedings of the eleventh international workshop on Web information and data managementAmbiguous personal names are common on the Web, which pose a challenge for many different tasks. The traditional disambiguation employs the clustering methods. However, without reference entity tables, the clustering method can only identify whether two ...
Ontology-based automatic semantic annotation for named entity disambiguation
ISC '07: Proceedings of the 10th IASTED International Conference on Intelligent Systems and ControlThe vision of Semantic Web can be realized when there are masses of machine-processable semantic metadata. Manual construction of metadata is not feasible, methods for automated semantic annotation have been developed. Semantic annotation is the process ...
Exploring entity relations for named entity disambiguation
HLT-SS '11: Proceedings of the ACL 2011 Student SessionNamed entity disambiguation is the task of linking an entity mention in a text to the correct real-world referent predefined in a knowledge base, and is a crucial subtask in many areas like information retrieval or topic detection and tracking. Named ...
Comments