ABSTRACT
Automatically segmenting unstructured text strings into structured records is necessary for importing the information contained in legacy sources and text collections into a data warehouse for subsequent querying, analysis, mining and integration. In this paper, we mine tables present in data warehouses and relational databases to develop an automatic segmentation system. Thus, we overcome limitations of existing supervised text segmentation approaches, which require comprehensive manually labeled training data. Our segmentation system is robust, accurate, and efficient, and requires no additional manual effort. Thorough evaluation on real datasets demonstrates the robustness and accuracy of our system, with segmentation accuracy exceeding state of the art supervised approaches.
- Microsoft SmartTagger.Google Scholar
- Proceedings of the 7th Message Understanding Conference (MUC-7). Morgan Kaufman, 1998.Google Scholar
- B. Adelberg. NoDoSE--a tool for semi-automatically extracting structured and semistructured data from text documents. In Proceedings of the ACM SIGMOD Conference, 1998. Google ScholarDigital Library
- J. Bilmes. What HMMs can do. Technical report, UWEETR-2002-0003, 2002.Google Scholar
- V. R. Borkar, K. Deshmukh, and S. Sarawagi. Automatic segmentation of text into structured records. In Proceedings of the ACM SIGMOD Conference, 2001. Google ScholarDigital Library
- M. E. Califf and R. J. Mooney. Relational learning of pattern-match rules for information extraction. In Sixteenth National Conference on Artificial Intelligence, 1999. Google ScholarDigital Library
- S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In Proceedings of the ACM SIGMOD Conference, 2003. Google ScholarDigital Library
- S. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. In Proceedings of the annual meeting of ACL, pages 310--318, 1996. Google ScholarDigital Library
- W. Cohen and S. Sarawagi. Exploiting dictionaries in named entity extraction: Combining semi-markov extraction processes and data integration method. In Proceedings of the ACM SIGKDD Conference, 2004. Google ScholarDigital Library
- M. Collins. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In Proceedings of the EMNLP Conference, 2002. Google ScholarDigital Library
- M. Collins and Y. Singer. Unsupervised models for named entity classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 1999.Google Scholar
- V. Crescenzi, G. Mecca, and P. Merialdo. RoadRunner: Towards automatic data extraction from large web sites. In Proceedings of the VLDB Conference, 2001. Google ScholarDigital Library
- J. Droppo, L. Deng, and A. Acero. Evaluation of the splice algorithm on the aurora2 database. In Proceedings of the Eurospeech Conference, 2001.Google Scholar
- D. Embley, S. Jiang, and Y. Ng. Record-boundary discovery in web documents. In Proceedings of the ACM SIGMOD Conference, 1999. Google ScholarDigital Library
- D. Freitag and A. McCallum. Information extraction with HMM structures learned by stochastic optimization. In Proceedings of the AAAI/IAAI Conference, pages 584--589, 2000. Google ScholarDigital Library
- D. Freitag and A. McCallum. Information extraction with HMM structures learned by stochastic optimization. In Proceedings of the AAAI/IAAI Conference, pages 584--589, 2000. Google ScholarDigital Library
- R. Grishman. Information extraction: Techniques and challenges. In Information Extraction (International Summer School SCIE-97). Springer-Verlag, 1997. Google ScholarDigital Library
- M. A. Hernandez and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1):9--37, 1998. Google ScholarDigital Library
- C. A. Knoblock, K. Lerman, S. Minton, and I. Muslea. Accurately and reliably extracting data from the web: A machine learning approach. IEEE Data Engineering Bulletin, 23(4):33--41, 2000.Google Scholar
- M. Lapata. Probabilistic text structuring: Experiments with sentence ordering. In Proceedings of the annual meeting of ACL, 2003. Google ScholarDigital Library
- A. Martin and M. Przybocki. NIST 2003 language recognition evaluation. In Proceedings of the Eurospeech Conference, 2003.Google Scholar
- A. McCallum, D. Freitag, and F. Pereira. Maximum entropy markov models for information extraction and segmentation. In Proceedings of the ICML Conference, 2000. Google ScholarDigital Library
- A. Mikheev, M. Moens, and C. Grover. Named entity recognition without gazetteers. In Proceedings of EACL, 1999. Google ScholarDigital Library
- I. Muslea, S. Minton, and C. Knoblock. A hierarchical approach to wrapper induction. In O. Etzioni, J. P. Muller, and J. M. Bradshaw, editors, Proceedings of the Third International Conference on Autonomous Agents (Agents'99), pages 190--197, Seattle, WA, USA, 1999. ACM Press. Google ScholarDigital Library
- L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 1989.Google ScholarCross Ref
- L. R. Rabiner and B. H. Juang. Fundamentals of speech recognition. Prentice Hall, 1993. Google ScholarDigital Library
- K. Seymore, A. McCallum, and R. Rosenfeld. Learning hidden Markov model structure for information extraction. In AAAI 99 Workshop on Machine Learning for Information Extraction, 1999.Google Scholar
- C. Sutton, K. Rohanimanesh, and A. McCallum. Dynamic conditional random fields: Factorized probabilistic models for labeling and segmenting sequence data. In Proceedings of the ICML Conference, 2004. Google ScholarDigital Library
Index Terms
- Mining reference tables for automatic text segmentation
Recommendations
Weakly-Supervised Text Instance Segmentation
MM '23: Proceedings of the 31st ACM International Conference on MultimediaText segmentation is a challenging computer vision task with many downstream applications. Current text segmentation models need to be trained with pixel-level annotations, which requires a lot of labor cost. In this paper, we take the first attempt to ...
Text Segmentation by Automatically Designed Morphological Operators
SIBGRAPI '00: Proceedings of the 13th Brazilian Symposium on Computer Graphics and Image ProcessingIdentification of areas corresponding to text in document images is an important step for a character recognition system. We briefly review a technique for automatic design of binary morphological operators and show its application to the segmentation ...
Linear text segmentation using classification techniques
A2CWiC '10: Proceedings of the 1st Amrita ACM-W Celebration on Women in Computing in IndiaAutomatic segmentation of a text stream into topically coherent segments is an important component in natural language processing tasks such as information retrieval and document summarization. Machine learning techniques can play a vital role in ...
Comments