research-article

The parallel path framework for entity discovery on the web

Authors:
Tim Weninger

University of Illinois at Urbana-Champaign

University of Illinois at Urbana-Champaign
View Profile

,
Thomas J. Johnston

University of Illinois at Urbana-Champaign

University of Illinois at Urbana-Champaign
View Profile

,
Jiawei Han

University of Illinois at Urbana-Champaign

University of Illinois at Urbana-Champaign
View Profile

Authors Info & Claims

ACM Transactions on the Web Volume 7 Issue 3Article No.: 16pp 1–29https://doi.org/10.1145/2516633.2516638

Published:30 September 2013Publication History

ACM Transactions on the Web

Abstract

It has been a dream of the database and Web communities to reconcile the unstructured nature of the World Wide Web with the neat, structured schemas of the database paradigm. Even though databases are currently used to generate Web content in some sites, the schemas of these databases are rarely consistent across a domain. This makes the comparison and aggregation of information from different domains difficult. We aim to make an important step towards resolving this disparity by using the structural and relational information on the Web to (1) extract Web lists, (2) find entity-pages, (3) map entity-pages to a database, and (4) extract attributes of the entities. Specifically, given a Web site and an entity-page (e.g., university department and faculty member home page) we seek to find all of the entity-pages of the same type (e.g., all faculty members in the department), as well as attributes of the specific entities (e.g., their phone numbers, email addresses, office numbers). To do this, we propose a Web structure mining method which grows parallel paths through the Web graph and DOM trees and propagates relevant attribute information forward. We show that by utilizing these parallel paths we can efficiently discover entity-pages and attributes. Finally, we demonstrate the accuracy of our method with a large case study.

References

Blanco, L., Crescenzi, V., and Merialdo, P. 2005. Efficiently locating collections of web pages to wrap. In Proceedings of the International Conference on Web Information Systems and Technologies. 247--254.Google Scholar
Blanco, L., Crescenzi, V., Merialdo, P., and Papotti, P. 2008a. Flint: Google-basing the Web. In Proceedings of the International Conference on Extending Database Technology. ACM Press, New York, 720--724. Google ScholarDigital Library
Blanco, L., Crescenzi, V., Merialdo, P., and Papotti, P. 2008b. Supporting the automatic construction of entity aware search engines. In Proceedings of the 10th ACM Workshop on Web Information and Data Management (WIDM'08). ACM Press, New York, 149. Google ScholarDigital Library
Cafarella, M. J., Halevy, A., and Khoussainova, N. 2009. Data integration for the relational web. Proc. VLDB Endow. 2, 1, 1090--1101. Google ScholarDigital Library
Cafarella, M. J., Halevy, A., Wang, D. Z., Wu, E., and Zhang, Y. 2008. WebTables: Exploring the power of tables on the web. Proc. VLDB Endow. 1, 1, 538--549. Google ScholarDigital Library
Crescenzi, V., Mecca, G., and Merialdo, P. 2001. RoadRunner: Towards automatic data extraction from large web sites. In Proceedings of the International Conference on Very Large Databases. 109--118. Google ScholarDigital Library
Crescenzi, V., Merialdo, P., and Missier, P. 2005. ClusteringWeb pages based on their structure. Data Knowl. Engin. 54, 3, 279--299. Google ScholarDigital Library
Elmeleegy, H., Madhavan, J., and Halevy, A. 2011. Harvesting relational tables from lists on the web. VLDB J. 20, 2, 209--226. Google ScholarDigital Library
Fumarola, F., Weninger, T., Barber, R., Malerba, D., and Han, J. 2011. HyLiEn: A hybrid approach to general list extraction on the web. In Proceedings of the International World Wide Web Conference. ACM Press, New York, 35. Google ScholarDigital Library
Gatterbauer, W., Bohunsky, P., Herzog, M., Krüpl, B., and Pollak, B. 2007. Towards domain-independent information extraction from web tables. In Proceedings of the 16th International Conference on World Wide Web (WWW'07). ACM Press, New York, 71. Google ScholarDigital Library
Gupta, R. and Sarawagi, S. 2009. Answering table augmentation queries from unstructured lists on the Web. Proc. VLDB Endow. 2, 1, 289--300. Google ScholarDigital Library
Hovy, E., Horacek, H., Métais, E., Muñoz, R., and Wolska, M. 2010. Natural Language Processing and Information Systems. Lecture Notes in Computer Science Series, vol. 5723, Springer.Google Scholar
Kaptein, R., Serdyukov, P., De Vries, A., and Kamps, J. 2010. Entity ranking using Wikipedia as a pivot. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management (CIKM'10). ACM Press, New York, 69. Google ScholarDigital Library
Kim, S.-M., Pantel, P., Duan, L., and Gaffney, S. 2009. Improving web page classification by label-propagation over click graphs. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management (CIKM'09). ACM Press, New York, 1077. Google ScholarDigital Library
Lerman, K., Getoor, L., Minton, S., and Knoblock, C. 2004. Using the structure of Web sites for automatic segmentation of tables. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'04). 119--130. Google ScholarDigital Library
Limaye, G., Sarawagi, S., and Chakrabarti, S. 2010. Annotating and searching web tables using entities, types and relationships. Proc. VLDB Endow. 3, 1--2, 1338--1347. Google ScholarDigital Library
Lin, C. X., Zhao, B., Weninger, T., Han, J., and Liu, B. 2010. Entity relation discovery from web tables and links. In Proceedings of the International World Wide Web Conference. ACM Press, New York, 1145. Google ScholarDigital Library
Liu, B. 2011. Web Data Mining 2nd Ed. Springer.Google Scholar
Liu, B., Grossman, R., and Zhai, Y. 2003. Mining data records in Web pages. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'03). ACMPress, New York, 601. Google ScholarDigital Library
Liu, W., Meng, X., and Meng, W. 2010. ViDE: A vision-based approach for deep web data extraction. IEEE Trans. Knowl. Data Eng. 22, 3, 447--460. Google ScholarDigital Library
Lopresti, D. and Tomkins, A. 1997. Block edit models for approximate string matching. Theor. Comput. Sci. 181, 1, 159--179. Google ScholarDigital Library
Mansuri, I. and Sarawagi, S. 2006. Integrating Unstructured Data into Relational Databases. In Proceedings of the International Conference on Data Engineering. IEEE, 29. Google ScholarDigital Library
Miao, G., Tatemura, J., Hsiung, W.-P., Sawires, A., and Moser, L. E. 2009. Extracting data records from the web using tag path clustering. In Proceedings of the 18th International Conference on World Wide Web (WWW'09). ACM Press, New York, 981. Google ScholarDigital Library
Qi, X. and Davison, B. D. 2009. Web page classification. ACM Comput. Surv. 41, 2, 1--31. Google ScholarDigital Library
Raghavan, S. and Garcia-Molina, H. 2001. Crawling the hidden web. In Proceedings of the International Conference on Very Large Databases. 129--138. Google ScholarDigital Library
Roy, P., Mohania, M., Bamba, B., and Raman, S. 2005. Towards automatic association of relevant unstructured content with structured query results. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM'05). ACM Press, New York, 405. Google ScholarDigital Library
Shen, D., Sun, J.-T., Yang, Q., and Chen, Z. 2006. A comparison of implicit and explicit links for web page classification. In Proceedings of the 15th International Conference on World Wide Web (WWW'06). ACM Press, New York, 643. Google ScholarDigital Library
Shen, X., Chen, J., Meng, X., Zhang, Y., and Liu, C. 2009. Advances in Knowledge Discovery and Data Mining. Lecture Notes in Computer Science, vol. 5476, Springer.Google Scholar
Small, H. 1973. Co-citation in the scientific literature: A new measure of the relationship between two documents. J. Amer. Soc. Inf. Sci. 24, 4, 28--31.Google ScholarCross Ref
Tong, S. and Dean, J. 2008. System and methods for automatically creating lists. US Patent 7350187.Google Scholar
Wang, R. C. and Cohen, W. W. 2007. Language-independent set expansion of named entities using the Web. In Proceedings of the 7th IEEE International Conference on Data Mining (ICDM'07). 342--350. Google ScholarDigital Library
Weninger, T., Fumarola, F., Barber, R., Han, J., and Malerba, D. 2011a. Unexpected results in automatic list extraction on the web. ACM SIGKDD Explorations Newsl. 12, 2, 26. Google ScholarDigital Library
Weninger, T., Fumarola, F., Han, J., and Malerba, D. 2010. Mapping web pages to database records via link paths. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management (CIKM'10). ACM Press, New York, 1637. Google ScholarDigital Library
Weninger, T., Fumarola, F., Lin, C. X., Barber, R., Han, J., and Malerba, D. 2011b. Growing parallel paths for entity-page discovery. In Proceedings of the 20th International Conference on World Wide Web (WWW'11). ACM Press, New York, 145. Google ScholarDigital Library
Weninger, T., McCloskey, D., et al. 2011c. WINACS: Construction and analysis of web-based computer science information networks. In Proceedings of the International Conference on Management of Data (SIGMOD'11). ACM Press, New York, 1255. Google ScholarDigital Library
Weninger, T., Zhai, C., and Han, J. 2012. Building enriched web page representations using link paths. In Proceedings of the 23rd ACM Conference on Hypertext and Social Media. ACM Press, New York, 53. Google ScholarDigital Library
Yang, H. and Chua, T.-S. 2004. Effectiveness of web page classification on finding list answers. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, 522--523. Google ScholarDigital Library
Yen, J. Y. 1971. Finding the k shortest loopless paths in a network. Manage. Sci. 17, 11, 712--716.Google Scholar
Yu, H., Han, J., and Chang, K. C.-C. 2004. Pebl: Web page classification without negative examples. IEEE Trans. Knowl. Data Eng. 16, 1, 70--81. Google ScholarDigital Library
Zhai, Y. and Liu, B. 2006. Structured data extraction from the web based on partial tree alignment. IEEE Trans. Knowl. Data Eng. 18, 12, 1614--1628. Google ScholarDigital Library

Index Terms

The parallel path framework for entity discovery on the web
1. Information systems
  1. Information retrieval
  2. Information systems applications
    1. Data mining

Recommendations

Growing parallel paths for entity-page discovery
WWW '11: Proceedings of the 20th international conference companion on World wide web

In this paper, we use the structural and relational information on the Web to find entity-pages. Specifically, given a Web site and an entity-page (e.g., department and faculty member homepage) we seek to find all of the entity-pages of the same type (...
Read More
Combining URL and HTML Features for Entity Discovery in the Web

The web is a large repository of entity-pages. An entity-page is a page that publishes data representing an entity of a particular type, for example, a page that describes a driver on a website about a car racing championship. The attribute values ...
Read More
Knowledge Discovery and Retrieval on World Wide Web Using Web Structure Mining
AMS '10: Proceedings of the 2010 Fourth Asia International Conference on Mathematical/Analytical Modelling and Computer Simulation

The World Wide Web is nearing omnipresence. The explosively growing number of Web contents including Digitalized manuals, emails pictures, multimedia, and Web services require a distinct and elaborate structural framework that can provide a navigational ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on the Web Volume 7, Issue 3
September 2013
149 pages
ISSN:1559-1131
EISSN:1559-114X
DOI:10.1145/2516633
Issue’s Table of Contents

Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 September 2013
- Accepted: 1 March 2013
- Revised: 1 July 2012
- Received: 1 February 2012
Published in tweb Volume 7, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Parallel paths
entity pages
semi-structured data
web structure mining
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 375
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

The parallel path framework for entity discovery on the web

ACM Transactions on the Web

Abstract

References

Cited By

Index Terms

Recommendations

Growing parallel paths for entity-page discovery

Combining URL and HTML Features for Entity Discovery in the Web

Knowledge Discovery and Retrieval on World Wide Web Using Web Structure Mining

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

The parallel path framework for entity discovery on the web

ACM Transactions on the Web

Abstract

References

Cited By

Index Terms

Recommendations

Growing parallel paths for entity-page discovery

Combining URL and HTML Features for Entity Discovery in the Web

Knowledge Discovery and Retrieval on World Wide Web Using Web Structure Mining

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media