Article

Mining reference tables for automatic text segmentation

Authors:
Eugene Agichtein

Columbia University

Columbia University
View Profile

,
Venkatesh Ganti

Microsoft Research

Microsoft Research
View Profile

KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2004Pages 20–29https://doi.org/10.1145/1014052.1014058

Published:22 August 2004Publication History

KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 20–29

ABSTRACT

Automatically segmenting unstructured text strings into structured records is necessary for importing the information contained in legacy sources and text collections into a data warehouse for subsequent querying, analysis, mining and integration. In this paper, we mine tables present in data warehouses and relational databases to develop an automatic segmentation system. Thus, we overcome limitations of existing supervised text segmentation approaches, which require comprehensive manually labeled training data. Our segmentation system is robust, accurate, and efficient, and requires no additional manual effort. Thorough evaluation on real datasets demonstrates the robustness and accuracy of our system, with segmentation accuracy exceeding state of the art supervised approaches.

References

Microsoft SmartTagger.Google Scholar
Proceedings of the 7th Message Understanding Conference (MUC-7). Morgan Kaufman, 1998.Google Scholar
B. Adelberg. NoDoSE--a tool for semi-automatically extracting structured and semistructured data from text documents. In Proceedings of the ACM SIGMOD Conference, 1998. Google ScholarDigital Library
J. Bilmes. What HMMs can do. Technical report, UWEETR-2002-0003, 2002.Google Scholar
V. R. Borkar, K. Deshmukh, and S. Sarawagi. Automatic segmentation of text into structured records. In Proceedings of the ACM SIGMOD Conference, 2001. Google ScholarDigital Library
M. E. Califf and R. J. Mooney. Relational learning of pattern-match rules for information extraction. In Sixteenth National Conference on Artificial Intelligence, 1999. Google ScholarDigital Library
S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In Proceedings of the ACM SIGMOD Conference, 2003. Google ScholarDigital Library
S. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. In Proceedings of the annual meeting of ACL, pages 310--318, 1996. Google ScholarDigital Library
W. Cohen and S. Sarawagi. Exploiting dictionaries in named entity extraction: Combining semi-markov extraction processes and data integration method. In Proceedings of the ACM SIGKDD Conference, 2004. Google ScholarDigital Library
M. Collins. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In Proceedings of the EMNLP Conference, 2002. Google ScholarDigital Library
M. Collins and Y. Singer. Unsupervised models for named entity classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 1999.Google Scholar
V. Crescenzi, G. Mecca, and P. Merialdo. RoadRunner: Towards automatic data extraction from large web sites. In Proceedings of the VLDB Conference, 2001. Google ScholarDigital Library
J. Droppo, L. Deng, and A. Acero. Evaluation of the splice algorithm on the aurora2 database. In Proceedings of the Eurospeech Conference, 2001.Google Scholar
D. Embley, S. Jiang, and Y. Ng. Record-boundary discovery in web documents. In Proceedings of the ACM SIGMOD Conference, 1999. Google ScholarDigital Library
D. Freitag and A. McCallum. Information extraction with HMM structures learned by stochastic optimization. In Proceedings of the AAAI/IAAI Conference, pages 584--589, 2000. Google ScholarDigital Library
D. Freitag and A. McCallum. Information extraction with HMM structures learned by stochastic optimization. In Proceedings of the AAAI/IAAI Conference, pages 584--589, 2000. Google ScholarDigital Library
R. Grishman. Information extraction: Techniques and challenges. In Information Extraction (International Summer School SCIE-97). Springer-Verlag, 1997. Google ScholarDigital Library
M. A. Hernandez and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1):9--37, 1998. Google ScholarDigital Library
C. A. Knoblock, K. Lerman, S. Minton, and I. Muslea. Accurately and reliably extracting data from the web: A machine learning approach. IEEE Data Engineering Bulletin, 23(4):33--41, 2000.Google Scholar
M. Lapata. Probabilistic text structuring: Experiments with sentence ordering. In Proceedings of the annual meeting of ACL, 2003. Google ScholarDigital Library
A. Martin and M. Przybocki. NIST 2003 language recognition evaluation. In Proceedings of the Eurospeech Conference, 2003.Google Scholar
A. McCallum, D. Freitag, and F. Pereira. Maximum entropy markov models for information extraction and segmentation. In Proceedings of the ICML Conference, 2000. Google ScholarDigital Library
A. Mikheev, M. Moens, and C. Grover. Named entity recognition without gazetteers. In Proceedings of EACL, 1999. Google ScholarDigital Library
I. Muslea, S. Minton, and C. Knoblock. A hierarchical approach to wrapper induction. In O. Etzioni, J. P. Muller, and J. M. Bradshaw, editors, Proceedings of the Third International Conference on Autonomous Agents (Agents'99), pages 190--197, Seattle, WA, USA, 1999. ACM Press. Google ScholarDigital Library
L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 1989.Google ScholarCross Ref
L. R. Rabiner and B. H. Juang. Fundamentals of speech recognition. Prentice Hall, 1993. Google ScholarDigital Library
K. Seymore, A. McCallum, and R. Rosenfeld. Learning hidden Markov model structure for information extraction. In AAAI 99 Workshop on Machine Learning for Information Extraction, 1999.Google Scholar
C. Sutton, K. Rohanimanesh, and A. McCallum. Dynamic conditional random fields: Factorized probabilistic models for labeling and segmenting sequence data. In Proceedings of the ICML Conference, 2004. Google ScholarDigital Library

Index Terms

Mining reference tables for automatic text segmentation
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Weakly-Supervised Text Instance Segmentation
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Text segmentation is a challenging computer vision task with many downstream applications. Current text segmentation models need to be trained with pixel-level annotations, which requires a lot of labor cost. In this paper, we take the first attempt to ...
Read More
Text Segmentation by Automatically Designed Morphological Operators
SIBGRAPI '00: Proceedings of the 13th Brazilian Symposium on Computer Graphics and Image Processing

Identification of areas corresponding to text in document images is an important step for a character recognition system. We briefly review a technique for automatic design of binary morphological operators and show its application to the segmentation ...
Read More
Linear text segmentation using classification techniques
A2CWiC '10: Proceedings of the 1st Amrita ACM-W Celebration on Women in Computing in India

Automatic segmentation of a text stream into topically coherent segments is an important component in natural language processing tasks such as information retrieval and document summarization. Machine learning techniques can play a vital role in ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
August 2004
874 pages
ISBN:1581138881
DOI:10.1145/1014052
General Chairs:
Won Kim
Cyber Database Solutions
,
Ronny Kohavi
Amazon.com
,
Program Chairs:
Johannes Gehrke
Cornell University
,
William DuMouchel
AT&T Labs Research
Copyright © 2004 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 August 2004
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data cleaning
information extraction
machine learning
text management
text segmentation
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,133of8,635submissions,13%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 91
  Total Citations
  View Citations
- 1,311
  Total Downloads
- Downloads (Last 12 months)12
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Mining reference tables for automatic text segmentation

KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Weakly-Supervised Text Instance Segmentation

Text Segmentation by Automatically Designed Morphological Operators

Linear text segmentation using classification techniques