ABSTRACT
A large amount of information on the Web is contained in regularly structured objects, which we call data records. Such data records are important because they often present the essential information of their host pages, e.g., lists of products or services. It is useful to mine such data records in order to extract information from them to provide value-added services. Existing automatic techniques are not satisfactory because of their poor accuracies. In this paper, we propose a more effective technique to perform the task. The technique is based on two observations about data records on the Web and a string matching algorithm. The proposed technique is able to mine both contiguous and non-contiguous data records. Our experimental results show that the proposed technique outperforms existing techniques substantially.
- Baeza-Yates, R. "Algorithms for string matching: A survey." ACM SIGIR Forum, 23(3--4):34--58, 1989 Google ScholarDigital Library
- Buttler, D., Liu, L., Pu, C. "A fully automated extraction system for the World Wide Web." IEEE ICDCS-21, 2001.Google Scholar
- Chang, C-H., Lui, S-L. "IEPAD: Information extraction based on pattern discovery." WWW-10, 2001. Google ScholarDigital Library
- Cohen, W., Hurst, M., and Jensen, L. "A flexible learning system for wrapping tables and lists in HTML documents." WWW-2002, 2002. Google ScholarDigital Library
- Doorenbos, R., Etzioni, O., Weld, D. "A scalable comparison shopping agent for the World Wide Web." Agents-97, 1997. Google ScholarDigital Library
- Embley, D., Jiang, Y and Ng, Y. "Record-boundary discovery in Web documents," SIGMOD-99, 1999. Google ScholarDigital Library
- Gusfield, D. Algorithms on strings, tree, and sequence. 1997. Google ScholarDigital Library
- Hsu, C.-N., and Dung, M.-T. "Generating finite-state transducers for semi-structured data extraction from the Web." Information Systems. 23(8): 521--538, 1998. Google ScholarDigital Library
- Kushmerick, N. "Wrapper induction: efficiency and expressiveness." Artificial Intelligence, 118:15--68, 2000. Google ScholarDigital Library
- Lerman, K. Knoblock, C., and Minton, S. "Automatic data extraction from lists and tables in web sources." IJCAI-01 Workshop on Adaptive Text Extraction and Mining, 2001.Google Scholar
- Liu, B., Grossman, R. and Zhai, Y. "Mining data records in Web pages." UIC Technical Report, 2003.Google Scholar
- Muslea, I., Minton, S. and Knoblock, C. "A hierarchical approach to wrapper induction." Agents-99, 1999. Google ScholarDigital Library
Index Terms
- Mining data records in Web pages
Recommendations
Extraction of flat and nested data records from web pages
AusDM '06: Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61This paper deals with studies the problem of identification and extraction of flat and nested data records from a given web page. With the explosive growth of information sources available on the World Wide Web, it has become increasingly difficult to ...
Mining Web Pages for Data Records
Much information on the Web is contained in regularly structured objects, or data records. Data records often present their host pages' essential information, such as lists of products and services. Mining data records to extract this information can ...
Web data mining: exploring hyperlinks, contents, and usage data
This paper presents a review of the book "Web Data Mining - Exploring Hyperlinks, Contents, and Usage Data" by Bing Liu. The review concludes that the breadth and depth of this book makes it a required staple for every Web mining researcher, student, or ...
Comments