skip to main content
10.1145/1014052.1014058acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Mining reference tables for automatic text segmentation

Published:22 August 2004Publication History

ABSTRACT

Automatically segmenting unstructured text strings into structured records is necessary for importing the information contained in legacy sources and text collections into a data warehouse for subsequent querying, analysis, mining and integration. In this paper, we mine tables present in data warehouses and relational databases to develop an automatic segmentation system. Thus, we overcome limitations of existing supervised text segmentation approaches, which require comprehensive manually labeled training data. Our segmentation system is robust, accurate, and efficient, and requires no additional manual effort. Thorough evaluation on real datasets demonstrates the robustness and accuracy of our system, with segmentation accuracy exceeding state of the art supervised approaches.

References

  1. Microsoft SmartTagger.Google ScholarGoogle Scholar
  2. Proceedings of the 7th Message Understanding Conference (MUC-7). Morgan Kaufman, 1998.Google ScholarGoogle Scholar
  3. B. Adelberg. NoDoSE--a tool for semi-automatically extracting structured and semistructured data from text documents. In Proceedings of the ACM SIGMOD Conference, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Bilmes. What HMMs can do. Technical report, UWEETR-2002-0003, 2002.Google ScholarGoogle Scholar
  5. V. R. Borkar, K. Deshmukh, and S. Sarawagi. Automatic segmentation of text into structured records. In Proceedings of the ACM SIGMOD Conference, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. E. Califf and R. J. Mooney. Relational learning of pattern-match rules for information extraction. In Sixteenth National Conference on Artificial Intelligence, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In Proceedings of the ACM SIGMOD Conference, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. In Proceedings of the annual meeting of ACL, pages 310--318, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. W. Cohen and S. Sarawagi. Exploiting dictionaries in named entity extraction: Combining semi-markov extraction processes and data integration method. In Proceedings of the ACM SIGKDD Conference, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Collins. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In Proceedings of the EMNLP Conference, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. Collins and Y. Singer. Unsupervised models for named entity classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 1999.Google ScholarGoogle Scholar
  12. V. Crescenzi, G. Mecca, and P. Merialdo. RoadRunner: Towards automatic data extraction from large web sites. In Proceedings of the VLDB Conference, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Droppo, L. Deng, and A. Acero. Evaluation of the splice algorithm on the aurora2 database. In Proceedings of the Eurospeech Conference, 2001.Google ScholarGoogle Scholar
  14. D. Embley, S. Jiang, and Y. Ng. Record-boundary discovery in web documents. In Proceedings of the ACM SIGMOD Conference, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. D. Freitag and A. McCallum. Information extraction with HMM structures learned by stochastic optimization. In Proceedings of the AAAI/IAAI Conference, pages 584--589, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. Freitag and A. McCallum. Information extraction with HMM structures learned by stochastic optimization. In Proceedings of the AAAI/IAAI Conference, pages 584--589, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. R. Grishman. Information extraction: Techniques and challenges. In Information Extraction (International Summer School SCIE-97). Springer-Verlag, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. A. Hernandez and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1):9--37, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. C. A. Knoblock, K. Lerman, S. Minton, and I. Muslea. Accurately and reliably extracting data from the web: A machine learning approach. IEEE Data Engineering Bulletin, 23(4):33--41, 2000.Google ScholarGoogle Scholar
  20. M. Lapata. Probabilistic text structuring: Experiments with sentence ordering. In Proceedings of the annual meeting of ACL, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Martin and M. Przybocki. NIST 2003 language recognition evaluation. In Proceedings of the Eurospeech Conference, 2003.Google ScholarGoogle Scholar
  22. A. McCallum, D. Freitag, and F. Pereira. Maximum entropy markov models for information extraction and segmentation. In Proceedings of the ICML Conference, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. Mikheev, M. Moens, and C. Grover. Named entity recognition without gazetteers. In Proceedings of EACL, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. I. Muslea, S. Minton, and C. Knoblock. A hierarchical approach to wrapper induction. In O. Etzioni, J. P. Muller, and J. M. Bradshaw, editors, Proceedings of the Third International Conference on Autonomous Agents (Agents'99), pages 190--197, Seattle, WA, USA, 1999. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 1989.Google ScholarGoogle ScholarCross RefCross Ref
  26. L. R. Rabiner and B. H. Juang. Fundamentals of speech recognition. Prentice Hall, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. K. Seymore, A. McCallum, and R. Rosenfeld. Learning hidden Markov model structure for information extraction. In AAAI 99 Workshop on Machine Learning for Information Extraction, 1999.Google ScholarGoogle Scholar
  28. C. Sutton, K. Rohanimanesh, and A. McCallum. Dynamic conditional random fields: Factorized probabilistic models for labeling and segmenting sequence data. In Proceedings of the ICML Conference, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Mining reference tables for automatic text segmentation

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
        August 2004
        874 pages
        ISBN:1581138881
        DOI:10.1145/1014052

        Copyright © 2004 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 22 August 2004

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        Overall Acceptance Rate1,133of8,635submissions,13%

        Upcoming Conference

        KDD '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader