Skip to main content
Erschienen in: Knowledge and Information Systems 3/2013

01.03.2013 | Regular Paper

A segment-based approach to clustering multi-topic documents

verfasst von: Andrea Tagarelli, George Karypis

Erschienen in: Knowledge and Information Systems | Ausgabe 3/2013

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Document clustering has been recognized as a central problem in text data management. Such a problem becomes particularly challenging when document contents are characterized by subtopical discussions that are not necessarily relevant to each other. Existing methods for document clustering have traditionally assumed that a document is an indivisible unit for text representation and similarity computation, which may not be appropriate to handle documents with multiple topics. In this paper, we address the problem of multi-topic document clustering by leveraging the natural composition of documents in text segments that are coherent with respect to the underlying subtopics. We propose a novel document clustering framework that is designed to induce a document organization from the identification of cohesive groups of segment-based portions of the original documents. We empirically give evidence of the significance of our segment-based approach on large collections of multi-topic documents, and we compare it to conventional methods for document clustering.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
1.
2.
Zurück zum Zitat Banerjee A, Krumpelman C, Ghosh J, Basu S, Mooney RJ (2005) Model-based overlapping clustering. In: Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 532–537 Banerjee A, Krumpelman C, Ghosh J, Basu S, Mooney RJ (2005) Model-based overlapping clustering. In: Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 532–537
3.
Zurück zum Zitat Banerjee A, Shan H (2007) Latent Dirichlet conditional naive-Bayes models. In: Proceedings of the 7th IEEE international conference on data mining (ICDM), pp 421–426 Banerjee A, Shan H (2007) Latent Dirichlet conditional naive-Bayes models. In: Proceedings of the 7th IEEE international conference on data mining (ICDM), pp 421–426
4.
Zurück zum Zitat Baraldi A, Blonda P (1999) A survey of fuzzy clustering algorithms for pattern recognition. i–ii. IEEE Trans Syst Man Cybern Part B 29(6):778–801CrossRef Baraldi A, Blonda P (1999) A survey of fuzzy clustering algorithms for pattern recognition. i–ii. IEEE Trans Syst Man Cybern Part B 29(6):778–801CrossRef
5.
Zurück zum Zitat Bezdek J (1981) Pattern recognition with fuzzy objective function algorithms. Plenum Press, New YorkMATHCrossRef Bezdek J (1981) Pattern recognition with fuzzy objective function algorithms. Plenum Press, New YorkMATHCrossRef
6.
Zurück zum Zitat Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022MATH Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022MATH
7.
Zurück zum Zitat Brants T, Chen F, Tsochantaridis I (2002) Topic-based document segmentation with probabilistic latent semantic analysis. In: Proceedings of the 11th ACM international conference on information and knowledge management (CIKM), pp 211–218 Brants T, Chen F, Tsochantaridis I (2002) Topic-based document segmentation with probabilistic latent semantic analysis. In: Proceedings of the 11th ACM international conference on information and knowledge management (CIKM), pp 211–218
8.
Zurück zum Zitat Campos R, Dias G, Nunes C (2006) WISE: hierarchical soft clustering of web page search results based on web content mining techniques. In: Proceedings of the IEEE/WIC/ACM international conference on web intelligence, pp 301–304 Campos R, Dias G, Nunes C (2006) WISE: hierarchical soft clustering of web page search results based on web content mining techniques. In: Proceedings of the IEEE/WIC/ACM international conference on web intelligence, pp 301–304
9.
Zurück zum Zitat Chen CL, Tseng FSC, Liang T (2011) An integration of fuzzy association rules and WordNet for document clustering. Knowl Inf Syst 28(3):687–708CrossRef Chen CL, Tseng FSC, Liang T (2011) An integration of fuzzy association rules and WordNet for document clustering. Knowl Inf Syst 28(3):687–708CrossRef
10.
Zurück zum Zitat Chim H, Deng X (2008) Efficient phrase-based document similarity for clustering. IEEE Trans Knowl Data Eng 20(9):1217–1229CrossRef Chim H, Deng X (2008) Efficient phrase-based document similarity for clustering. IEEE Trans Knowl Data Eng 20(9):1217–1229CrossRef
11.
Zurück zum Zitat Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41:391–407CrossRef Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41:391–407CrossRef
12.
Zurück zum Zitat Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42(1/2):143–175MATHCrossRef Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42(1/2):143–175MATHCrossRef
13.
Zurück zum Zitat Du L, Buntine WL, Jin H (2010) A segmented topic model based on the two-parameter Poisson–Dirichlet process. Mach Learn 81(1):5–19CrossRef Du L, Buntine WL, Jin H (2010) A segmented topic model based on the two-parameter Poisson–Dirichlet process. Mach Learn 81(1):5–19CrossRef
14.
Zurück zum Zitat Farahat AK, Kamel MS (2011) Statistical semantics for enhancing document clustering. Knowl Inf Syst 28(2):365–393CrossRef Farahat AK, Kamel MS (2011) Statistical semantics for enhancing document clustering. Knowl Inf Syst 28(2):365–393CrossRef
15.
Zurück zum Zitat Fu Q, Banerjee A (2008) Multiplicative mixture models for overlapping clustering. In: Proceedings of the 8th IEEE international conference on data mining (ICDM), pp 791–796 Fu Q, Banerjee A (2008) Multiplicative mixture models for overlapping clustering. In: Proceedings of the 8th IEEE international conference on data mining (ICDM), pp 791–796
16.
Zurück zum Zitat Fu Q, Banerjee A (2009) Bayesian overlapping subspace clustering. In: Proceedings of the 9th IEEE international conference on data mining (ICDM), pp 776–781 Fu Q, Banerjee A (2009) Bayesian overlapping subspace clustering. In: Proceedings of the 9th IEEE international conference on data mining (ICDM), pp 776–781
17.
Zurück zum Zitat Fung B, Wang K, Ester M (2003) Hierarchical document clustering using frequent itemsets. In: Proceedings of the 3rd SIAM international conference on data mining (SDM), pp 59–70 Fung B, Wang K, Ester M (2003) Hierarchical document clustering using frequent itemsets. In: Proceedings of the 3rd SIAM international conference on data mining (SDM), pp 59–70
18.
Zurück zum Zitat He Q, Chang K, Lim EP, Banerjee A (2010) Keep it simple with time: a re-examination of probabilistic topic detection models. IEEE Trans Pattern Anal Mach Intell 32(10):1795–1808CrossRef He Q, Chang K, Lim EP, Banerjee A (2010) Keep it simple with time: a re-examination of probabilistic topic detection models. IEEE Trans Pattern Anal Mach Intell 32(10):1795–1808CrossRef
19.
Zurück zum Zitat Hearst MA (1997) TextTiling: segmenting text into multi-paragraph subtopic passages. Comput Linguist 23(1):33–64 Hearst MA (1997) TextTiling: segmenting text into multi-paragraph subtopic passages. Comput Linguist 23(1):33–64
20.
Zurück zum Zitat Hearst MA, Plaunt C (1993) Subtopic structuring for full-length document access. In: Proceedings of the 16th ACM international conference on research and development in information retrieval (SIGIR), pp 59–68 Hearst MA, Plaunt C (1993) Subtopic structuring for full-length document access. In: Proceedings of the 16th ACM international conference on research and development in information retrieval (SIGIR), pp 59–68
21.
Zurück zum Zitat Heller KA, Ghahramani Z (2007) A nonparametric Bayesian approach to modeling overlapping clusters. In: Proceedings of the 11th international conference on artificial intelligence and statistics (AISTATS) Heller KA, Ghahramani Z (2007) A nonparametric Bayesian approach to modeling overlapping clusters. In: Proceedings of the 11th international conference on artificial intelligence and statistics (AISTATS)
22.
Zurück zum Zitat Hofmann T (1999) Probabilistic latent semantic indexing. In: Proceedings of the 22nd ACM international conference on research and development in information retrieval (SIGIR), pp 50–57 Hofmann T (1999) Probabilistic latent semantic indexing. In: Proceedings of the 22nd ACM international conference on research and development in information retrieval (SIGIR), pp 50–57
23.
Zurück zum Zitat Hofmann T (2001) Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning 42(1–2):177–196MATHCrossRef Hofmann T (2001) Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning 42(1–2):177–196MATHCrossRef
24.
Zurück zum Zitat Jain A, Dubes R (1988) Algorithms for clustering data. Prentice-Hall, Upper Saddle RiverMATH Jain A, Dubes R (1988) Algorithms for clustering data. Prentice-Hall, Upper Saddle RiverMATH
25.
Zurück zum Zitat Jing L, Ng MK, Huang JZ (2010) Knowledge-based vector space model for text clustering. Knowl Inf Syst 25(1):35–55CrossRef Jing L, Ng MK, Huang JZ (2010) Knowledge-based vector space model for text clustering. Knowl Inf Syst 25(1):35–55CrossRef
26.
Zurück zum Zitat Jing L, Ng MK, Xu J, Huang JZ (2005) Subspace clustering of text documents with feature weighting-means algorithm. In: Proceedings of the 9th Pacific-Asia conference on advances in knowledge discovery and data mining (PAKDD), pp 802–812 Jing L, Ng MK, Xu J, Huang JZ (2005) Subspace clustering of text documents with feature weighting-means algorithm. In: Proceedings of the 9th Pacific-Asia conference on advances in knowledge discovery and data mining (PAKDD), pp 802–812
28.
Zurück zum Zitat Khy S, Ishikawa Y, Kitagawa H (2007) A novelty-based clustering method for online documents. World Wide Web 11:1–37CrossRef Khy S, Ishikawa Y, Kitagawa H (2007) A novelty-based clustering method for online documents. World Wide Web 11:1–37CrossRef
29.
Zurück zum Zitat Kim YM, Pessiot JF, Amini MR, Gallinari P (2008) An extension of PLSA for document clustering. In: Proceedings of the 17th ACM international conference on information and knowledge management (CIKM), pp 1345–1346 Kim YM, Pessiot JF, Amini MR, Gallinari P (2008) An extension of PLSA for document clustering. In: Proceedings of the 17th ACM international conference on information and knowledge management (CIKM), pp 1345–1346
30.
Zurück zum Zitat Kogan J (2007) Introduction to clustering large and high-dimensional data. Cambridge University Press, CambridgeMATH Kogan J (2007) Introduction to clustering large and high-dimensional data. Cambridge University Press, CambridgeMATH
31.
Zurück zum Zitat Krishnapuram R, Joshi A, Yi L (1999) A fuzzy relative of the \(k\)-medoids algorithm with application to web document and snippet clustering. In: Proceedings of the IEEE international conference on fuzzy systems, pp 1281–1286 Krishnapuram R, Joshi A, Yi L (1999) A fuzzy relative of the \(k\)-medoids algorithm with application to web document and snippet clustering. In: Proceedings of the IEEE international conference on fuzzy systems, pp 1281–1286
32.
Zurück zum Zitat Kummamuru K, Dhawale A, Krishnapuram R (2003) Fuzzy co-clustering of documents and keywords. In: Proceedings of the 12th IEEE international conference on fuzzy systems, pp 772–777 Kummamuru K, Dhawale A, Krishnapuram R (2003) Fuzzy co-clustering of documents and keywords. In: Proceedings of the 12th IEEE international conference on fuzzy systems, pp 772–777
33.
Zurück zum Zitat Larsen B, Aone C (1999) Fast and effective text mining using linear-time document clustering. In: Proceedings of the 5th ACM international conference on knowledge discovery and data mining (KDD), pp 16–22 Larsen B, Aone C (1999) Fast and effective text mining using linear-time document clustering. In: Proceedings of the 5th ACM international conference on knowledge discovery and data mining (KDD), pp 16–22
34.
Zurück zum Zitat Lewis DD, Yang Y, Rose T, Li F (2004) RCV1: a new benchmark collection for text categorization research. J Mach Learn Res 5:361–397 Lewis DD, Yang Y, Rose T, Li F (2004) RCV1: a new benchmark collection for text categorization research. J Mach Learn Res 5:361–397
35.
Zurück zum Zitat Li T, Ma S, Ogihara M (2004) Document clustering via adaptive subspace iteration. In: Proceedings of the 27th ACM international conference on research and development in information retrieval (SIGIR), pp 218–225 Li T, Ma S, Ogihara M (2004) Document clustering via adaptive subspace iteration. In: Proceedings of the 27th ACM international conference on research and development in information retrieval (SIGIR), pp 218–225
36.
Zurück zum Zitat Mendes MES, Sacks L (2003) Evaluating fuzzy clustering for relevance-based information access. In: Proceedings of the 12th IEEE international conference on fuzzy systems, pp 648–653 Mendes MES, Sacks L (2003) Evaluating fuzzy clustering for relevance-based information access. In: Proceedings of the 12th IEEE international conference on fuzzy systems, pp 648–653
37.
Zurück zum Zitat Misra H, Yvon F, Jose JM, Cappé O (2009) Text segmentation via topic modeling: an analytical study. In: Proceedings of the 18th ACM international conference on information and knowledge management (CIKM), pp 1553–1556 Misra H, Yvon F, Jose JM, Cappé O (2009) Text segmentation via topic modeling: an analytical study. In: Proceedings of the 18th ACM international conference on information and knowledge management (CIKM), pp 1553–1556
38.
Zurück zum Zitat Mittal V, Kantrowitz M, Goldstein J, Carbonell J (1999) Selecting text spans for document summaries. In: Proceedings of 16th national conference on artificial intelligence and 11th conference on innovative applications of artificial intelligence, pp 467–473 Mittal V, Kantrowitz M, Goldstein J, Carbonell J (1999) Selecting text spans for document summaries. In: Proceedings of 16th national conference on artificial intelligence and 11th conference on innovative applications of artificial intelligence, pp 467–473
39.
Zurück zum Zitat Ni X, Quan X, Lu Z, Wenyin L, Hua B (2011) Short text clustering by finding core terms. Knowl Inf Syst 27(3):345–365CrossRef Ni X, Quan X, Lu Z, Wenyin L, Hua B (2011) Short text clustering by finding core terms. Knowl Inf Syst 27(3):345–365CrossRef
40.
Zurück zum Zitat Osinski S, Stefanowski J, Weiss D (2004) Lingo: search results clustering algorithm based on singular value decomposition. In: Proceedings of the international conference on intelligent information systems, pp 359–368 Osinski S, Stefanowski J, Weiss D (2004) Lingo: search results clustering algorithm based on singular value decomposition. In: Proceedings of the international conference on intelligent information systems, pp 359–368
41.
Zurück zum Zitat Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newsl 6(1):90–105CrossRef Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newsl 6(1):90–105CrossRef
42.
Zurück zum Zitat Salton G (1989) Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley, Boston Salton G (1989) Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley, Boston
43.
Zurück zum Zitat Shafiei MM, Milios EE (2006) Latent Dirichlet co-clustering. In: Proceedings of the 6th IEEE international conference on data mining (ICDM), pp 542–551 Shafiei MM, Milios EE (2006) Latent Dirichlet co-clustering. In: Proceedings of the 6th IEEE international conference on data mining (ICDM), pp 542–551
44.
Zurück zum Zitat Shafiei MM, Milios EE (2006) Model-based overlapping co-clustering. In: Proceedings of the 4th workshop on text mining, in conjunction with the 6th SIAM international conference on data mining (SDM) Shafiei MM, Milios EE (2006) Model-based overlapping co-clustering. In: Proceedings of the 4th workshop on text mining, in conjunction with the 6th SIAM international conference on data mining (SDM)
45.
Zurück zum Zitat Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: Proceedings of the KDD’00 workshop on text mining Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: Proceedings of the KDD’00 workshop on text mining
46.
Zurück zum Zitat Tagarelli A, Karypis G (2008) A segment-based approach to clustering multi-topic documents. In: Proceedings of the 6th workshop on text mining, in conjunction with the 8th SIAM international conference on data mining (SDM) Tagarelli A, Karypis G (2008) A segment-based approach to clustering multi-topic documents. In: Proceedings of the 6th workshop on text mining, in conjunction with the 8th SIAM international conference on data mining (SDM)
47.
Zurück zum Zitat Tsai FS, Zhang Y (2011) D2S: document-to-sentence framework for novelty detection. Knowl Inf Syst 29(2):419–433CrossRef Tsai FS, Zhang Y (2011) D2S: document-to-sentence framework for novelty detection. Knowl Inf Syst 29(2):419–433CrossRef
48.
Zurück zum Zitat Ueda N, Saito K (2002) Single-shot detection of multiple categories of text using parametric mixture models. In: Proceedings of the 8th ACM international conference on knowledge discovery and data mining (KDD), pp 626–631 Ueda N, Saito K (2002) Single-shot detection of multiple categories of text using parametric mixture models. In: Proceedings of the 8th ACM international conference on knowledge discovery and data mining (KDD), pp 626–631
49.
Zurück zum Zitat Wan X, Yang J, Xiao J (2008) Towards a unified approach to document similarity search using manifold-ranking of blocks. Inf Process Manag 44:1032–1048CrossRef Wan X, Yang J, Xiao J (2008) Towards a unified approach to document similarity search using manifold-ranking of blocks. Inf Process Manag 44:1032–1048CrossRef
50.
Zurück zum Zitat Zamir O, Etzioni O (1998) Web document clustering: a feasibility demonstration. In: Proceedings of the 21st ACM international conference on research and development in information retrieval (SIGIR), pp 46–54 Zamir O, Etzioni O (1998) Web document clustering: a feasibility demonstration. In: Proceedings of the 21st ACM international conference on research and development in information retrieval (SIGIR), pp 46–54
51.
Zurück zum Zitat Zeng HJ, He QC, Chen Z, Ma WY, Ma J (2004) Learning to cluster web search results. In: Proceedings of the 27th ACM international conference on research and development in information retrieval (SIGIR), pp 210–217 Zeng HJ, He QC, Chen Z, Ma WY, Ma J (2004) Learning to cluster web search results. In: Proceedings of the 27th ACM international conference on research and development in information retrieval (SIGIR), pp 210–217
52.
Zurück zum Zitat Zhao Y, Karypis G (2004) Empirical and theoretical comparison of selected criterion functions for document clustering. Mach Learn 55(3):311–331MATHCrossRef Zhao Y, Karypis G (2004) Empirical and theoretical comparison of selected criterion functions for document clustering. Mach Learn 55(3):311–331MATHCrossRef
53.
Zurück zum Zitat Zhao Y, Karypis G (2004) Soft clustering criterion functions for partitional document clustering: a summary of results. In: Proceedings of the 13th ACM international conference on information and knowledge management (CIKM), pp 246–247 Zhao Y, Karypis G (2004) Soft clustering criterion functions for partitional document clustering: a summary of results. In: Proceedings of the 13th ACM international conference on information and knowledge management (CIKM), pp 246–247
54.
Zurück zum Zitat Zhao Y, Karypis G, Fayyad UM (2005) Hierarchical clustering algorithms for document datasets. Data Min Knowl Discov 10(2):141–168MathSciNetCrossRef Zhao Y, Karypis G, Fayyad UM (2005) Hierarchical clustering algorithms for document datasets. Data Min Knowl Discov 10(2):141–168MathSciNetCrossRef
55.
Zurück zum Zitat Zhong S, Ghosh J (2005) Generative model-based document clustering: a comparative study. Knowl Inf Syst 8(3):374–384CrossRef Zhong S, Ghosh J (2005) Generative model-based document clustering: a comparative study. Knowl Inf Syst 8(3):374–384CrossRef
Metadaten
Titel
A segment-based approach to clustering multi-topic documents
verfasst von
Andrea Tagarelli
George Karypis
Publikationsdatum
01.03.2013
Verlag
Springer-Verlag
Erschienen in
Knowledge and Information Systems / Ausgabe 3/2013
Print ISSN: 0219-1377
Elektronische ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-012-0556-z

Weitere Artikel der Ausgabe 3/2013

Knowledge and Information Systems 3/2013 Zur Ausgabe

Premium Partner