Article

Mining data records in Web pages

Authors:
Bing Liu

University of Illinois at Chicago, Chicago, IL

University of Illinois at Chicago, Chicago, IL
View Profile

,
Robert Grossman

University of Illinois at Chicago, Chicago, IL

University of Illinois at Chicago, Chicago, IL
View Profile

,
Yanhong Zhai

University of Illinois at Chicago, Chicago, IL

University of Illinois at Chicago, Chicago, IL
View Profile

KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2003Pages 601–606https://doi.org/10.1145/956750.956826

Published:24 August 2003Publication History

KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 601–606

ABSTRACT

A large amount of information on the Web is contained in regularly structured objects, which we call data records. Such data records are important because they often present the essential information of their host pages, e.g., lists of products or services. It is useful to mine such data records in order to extract information from them to provide value-added services. Existing automatic techniques are not satisfactory because of their poor accuracies. In this paper, we propose a more effective technique to perform the task. The technique is based on two observations about data records on the Web and a string matching algorithm. The proposed technique is able to mine both contiguous and non-contiguous data records. Our experimental results show that the proposed technique outperforms existing techniques substantially.

References

Baeza-Yates, R. "Algorithms for string matching: A survey." ACM SIGIR Forum, 23(3--4):34--58, 1989 Google ScholarDigital Library
Buttler, D., Liu, L., Pu, C. "A fully automated extraction system for the World Wide Web." IEEE ICDCS-21, 2001.Google Scholar
Chang, C-H., Lui, S-L. "IEPAD: Information extraction based on pattern discovery." WWW-10, 2001. Google ScholarDigital Library
Cohen, W., Hurst, M., and Jensen, L. "A flexible learning system for wrapping tables and lists in HTML documents." WWW-2002, 2002. Google ScholarDigital Library
Doorenbos, R., Etzioni, O., Weld, D. "A scalable comparison shopping agent for the World Wide Web." Agents-97, 1997. Google ScholarDigital Library
Embley, D., Jiang, Y and Ng, Y. "Record-boundary discovery in Web documents," SIGMOD-99, 1999. Google ScholarDigital Library
Gusfield, D. Algorithms on strings, tree, and sequence. 1997. Google ScholarDigital Library
Hsu, C.-N., and Dung, M.-T. "Generating finite-state transducers for semi-structured data extraction from the Web." Information Systems. 23(8): 521--538, 1998. Google ScholarDigital Library
Kushmerick, N. "Wrapper induction: efficiency and expressiveness." Artificial Intelligence, 118:15--68, 2000. Google ScholarDigital Library
Lerman, K. Knoblock, C., and Minton, S. "Automatic data extraction from lists and tables in web sources." IJCAI-01 Workshop on Adaptive Text Extraction and Mining, 2001.Google Scholar
Liu, B., Grossman, R. and Zhai, Y. "Mining data records in Web pages." UIC Technical Report, 2003.Google Scholar
Muslea, I., Minton, S. and Knoblock, C. "A hierarchical approach to wrapper induction." Agents-99, 1999. Google ScholarDigital Library

Index Terms

Mining data records in Web pages
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information systems applications

Recommendations

Extraction of flat and nested data records from web pages
AusDM '06: Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61

This paper deals with studies the problem of identification and extraction of flat and nested data records from a given web page. With the explosive growth of information sources available on the World Wide Web, it has become increasingly difficult to ...
Read More
Mining Web Pages for Data Records

Much information on the Web is contained in regularly structured objects, or data records. Data records often present their host pages' essential information, such as lists of products and services. Mining data records to extract this information can ...
Read More
Web data mining: exploring hyperlinks, contents, and usage data

This paper presents a review of the book "Web Data Mining - Exploring Hyperlinks, Contents, and Usage Data" by Bing Liu. The review concludes that the breadth and depth of this book makes it a required staple for every Web mining researcher, student, or ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
August 2003
736 pages
ISBN:1581137370
DOI:10.1145/956750
Conference Chair:
Lise Getoor
University of Maryland, College Park
,
General Chair:
Ted Senator
DARPA
,
Program Chairs:
Pedro Domingos
University of Washington
,
Christos Faloutsos
Carnegie Mellon University
Copyright © 2003 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 August 2003
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Web data records
Web information integration
Web mining
Qualifiers
- Article
Conference

Acceptance Rates
KDD '03 Paper Acceptance Rate46of298submissions,15%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 235
  Total Citations
  View Citations
- 2,436
  Total Downloads
- Downloads (Last 12 months)13
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Mining data records in Web pages

KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Extraction of flat and nested data records from web pages

Mining Web Pages for Data Records

Web data mining: exploring hyperlinks, contents, and usage data