Skip to main content
Erschienen in:
Buchtitelbild

2017 | OriginalPaper | Buchkapitel

The Problem of First Story Detection in Multiaspect Text Categorization

verfasst von : Sławomir Zadrożny, Janusz Kacprzyk, Marek Gajewski

Erschienen in: Information Technology and Computational Physics

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The new concept of multiaspect text categorization (MTC), recently introduced in a series of our papers, may be viewed as a combination of the classic and well-known text categorization (TC) and some kind of sequential data classification. The first aspect of the problem, i.e., the assignment of a document to a category, may be addressed using one of the well-known techniques such as, e.g., the k-nearest neighbors method. The second aspect is, however, less standard and boils down to the assignment of a document to one of the sequences, called cases, of documents maintained within a category. Cases cannot be treated in the same way as categories as, first, they contain an ordered—by the time of arrival—set of documents, and second, they are usually represented in a training dataset by a (relatively) small number of documents. Moreover, it is assumed that new cases can emerge during the document collection lifetime. Hence, the assignment of a document to a case is a challenging task by itself, and then the deciding if a document starts a new case is even more difficult. In this paper, we deal with the latter problem, discussing it in the broader perspective of sequential data mining and comparing a number of approaches to solve it.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
2.
Zurück zum Zitat Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. ACM Press and Addison Wesley (1999) Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. ACM Press and Addison Wesley (1999)
3.
Zurück zum Zitat Zadrożny, S., Kacprzyk, J., Gajewski, M., Wysocki, M.: A novel text classification problem and two approaches to its solution. In: Proceedings of the International Congress on Control and Information Processing 2013. Cracow University of Technology (2013) Zadrożny, S., Kacprzyk, J., Gajewski, M., Wysocki, M.: A novel text classification problem and two approaches to its solution. In: Proceedings of the International Congress on Control and Information Processing 2013. Cracow University of Technology (2013)
4.
Zurück zum Zitat Zadrożny, S., Kacprzyk, J., Gajewski, M., Wysocki, M.: A novel text classification problem and its solution. Tech. Trans. Autom. Control 4-AC, 7–16 (2013) Zadrożny, S., Kacprzyk, J., Gajewski, M., Wysocki, M.: A novel text classification problem and its solution. Tech. Trans. Autom. Control 4-AC, 7–16 (2013)
5.
Zurück zum Zitat Zadrożny, S., Kacprzyk, J., Gajewski, M.: A novel approach to sequence-of-documents focused text categorization using the concept of a degree of fuzzy set subsethood. In: Proceedings of the Annual Conference of the North American Fuzzy Information processing Society NAFIPS’2015 and 5th World Conference on Soft Computing 2015, Redmond, WA, USA, August 17–19, 2015 (2015) Zadrożny, S., Kacprzyk, J., Gajewski, M.: A novel approach to sequence-of-documents focused text categorization using the concept of a degree of fuzzy set subsethood. In: Proceedings of the Annual Conference of the North American Fuzzy Information processing Society NAFIPS’2015 and 5th World Conference on Soft Computing 2015, Redmond, WA, USA, August 17–19, 2015 (2015)
6.
Zurück zum Zitat Zadrożny, S., Kacprzyk, J., Gajewski, M.: A new two-stage approach to the multiaspect text categorization. In: IEEE Symposium on Computational Intelligence for Human-like Intelligence, CIHLI 2015, Cape Town, South Africa, December 8–10, 2015. IEEE 2015, pp. 1484–1490 (2015) Zadrożny, S., Kacprzyk, J., Gajewski, M.: A new two-stage approach to the multiaspect text categorization. In: IEEE Symposium on Computational Intelligence for Human-like Intelligence, CIHLI 2015, Cape Town, South Africa, December 8–10, 2015. IEEE 2015, pp. 1484–1490 (2015)
7.
Zurück zum Zitat Gajewski, M., Kacprzyk, J., Zadrożny, S.: Topic detection and tracking: a focused survey and a new variant. Informatyka Stosowana 2014(1), 133–147 (2014) Gajewski, M., Kacprzyk, J., Zadrożny, S.: Topic detection and tracking: a focused survey and a new variant. Informatyka Stosowana 2014(1), 133–147 (2014)
8.
Zurück zum Zitat Zadrożny, S., Kacprzyk, J., Gajewski, M.: A new approach to the multiaspect text categorization by using the support vector machines. In: De Tré, G., Grzegorzewski, P., Kacprzyk, J., Owsiński, J.W., Penczek, W., Zadrożny, S. (eds.) Challenging problems and solutions in intelligent systems, pp. 261–277. Springer International Publishing, Heidelberg (2016)CrossRef Zadrożny, S., Kacprzyk, J., Gajewski, M.: A new approach to the multiaspect text categorization by using the support vector machines. In: De Tré, G., Grzegorzewski, P., Kacprzyk, J., Owsiński, J.W., Penczek, W., Zadrożny, S. (eds.) Challenging problems and solutions in intelligent systems, pp. 261–277. Springer International Publishing, Heidelberg (2016)CrossRef
9.
Zurück zum Zitat Zadrożny, S., Kacprzyk, J., Gajewski, M.: Multiaspect text categorization problem solving: a nearest neighbours classifier based approaches and beyond. J. Autom. Mob. Rob. Intell. Syst. 9, 58–70 (2015) Zadrożny, S., Kacprzyk, J., Gajewski, M.: Multiaspect text categorization problem solving: a nearest neighbours classifier based approaches and beyond. J. Autom. Mob. Rob. Intell. Syst. 9, 58–70 (2015)
10.
Zurück zum Zitat Zadrożny, S., Kacprzyk, J., Gajewski, M.: A hierarchy-aware approach to the multiaspect text categorization problem. In: Proceedings of the World Conference on Soft Computing, Berkeley, CA, US (2016, in press) Zadrożny, S., Kacprzyk, J., Gajewski, M.: A hierarchy-aware approach to the multiaspect text categorization problem. In: Proceedings of the World Conference on Soft Computing, Berkeley, CA, US (2016, in press)
11.
Zurück zum Zitat Yang, Y., Zhang, J., Carbonell, J., Jin, C.: Topic-conditioned novelty detection. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA: ACM, pp. 688–693 (2002) Yang, Y., Zhang, J., Carbonell, J., Jin, C.: Topic-conditioned novelty detection. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA: ACM, pp. 688–693 (2002)
12.
Zurück zum Zitat Allan, J. (ed.) Topic Detection and Tracking: Event-based Information. Kluwer Academic Publishers (2002) Allan, J. (ed.) Topic Detection and Tracking: Event-based Information. Kluwer Academic Publishers (2002)
13.
Zurück zum Zitat Allan, J., Carbonell, J., Doddington, G., Yamron, J., Yang, Y.: Topic detection and tracking pilot study: final report. In: Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop (1998) Allan, J., Carbonell, J., Doddington, G., Yamron, J., Yang, Y.: Topic detection and tracking pilot study: final report. In: Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop (1998)
14.
Zurück zum Zitat Allan, J., Lavrenko, V., Jin, H.: First story detection in TDT is hard. In: Proceedings of the Ninth International Conference on Information and Knowledge Management, CIKM ’00, pp. 374–381. ACM, New York, NY, USA (2000) Allan, J., Lavrenko, V., Jin, H.: First story detection in TDT is hard. In: Proceedings of the Ninth International Conference on Information and Knowledge Management, CIKM ’00, pp. 374–381. ACM, New York, NY, USA (2000)
15.
Zurück zum Zitat Yang, Y.: An evaluation of statistical approaches to text categorization. Inf. Retriev. 1(1–2), 69–90 (1999)CrossRef Yang, Y.: An evaluation of statistical approaches to text categorization. Inf. Retriev. 1(1–2), 69–90 (1999)CrossRef
16.
Zurück zum Zitat Markou, M., Singh, S.: Novelty detection: a review—part 1: statistical approaches. Signal Process. 83(12), 2481–2497 (2003)CrossRefMATH Markou, M., Singh, S.: Novelty detection: a review—part 1: statistical approaches. Signal Process. 83(12), 2481–2497 (2003)CrossRefMATH
17.
Zurück zum Zitat De Faria, E., Gonçalves, I., Gama, J., De Leon Ferreira Carvalho, A.: Evaluation of multiclass novelty detection algorithms for data streams. IEEE Trans. Knowl. Data Eng. 27(11), 2961–2973 (2015)CrossRef De Faria, E., Gonçalves, I., Gama, J., De Leon Ferreira Carvalho, A.: Evaluation of multiclass novelty detection algorithms for data streams. IEEE Trans. Knowl. Data Eng. 27(11), 2961–2973 (2015)CrossRef
18.
Zurück zum Zitat Hofmann, D.B.T., Baker, L.D., Hofmann, T., Mccallum, A.K., Yang, Y.: A hierarchical probabilistic model for novelty detection in text (1999) Hofmann, D.B.T., Baker, L.D., Hofmann, T., Mccallum, A.K., Yang, Y.: A hierarchical probabilistic model for novelty detection in text (1999)
19.
Zurück zum Zitat Hansen, L.K., Sigurdsson, S., Kolenda, T., Nielsen, F.A., Kjems, U., Larsen, J.: Modeling text with generalizable gaussian mixtures. In: Proceedings of ICASSP’2000, pp. 3494–3497. IEEE (1999) Hansen, L.K., Sigurdsson, S., Kolenda, T., Nielsen, F.A., Kjems, U., Larsen, J.: Modeling text with generalizable gaussian mixtures. In: Proceedings of ICASSP’2000, pp. 3494–3497. IEEE (1999)
20.
Zurück zum Zitat De Faria, E., Gonçalves, I., De Leon Ferreira Carvalho, A., Gama, J.: Novelty detection in data streams. Artif. Intell. Rev. 45(2), 235–269 (2016)CrossRef De Faria, E., Gonçalves, I., De Leon Ferreira Carvalho, A., Gama, J.: Novelty detection in data streams. Artif. Intell. Rev. 45(2), 235–269 (2016)CrossRef
21.
Zurück zum Zitat Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3) (2009) Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3) (2009)
22.
Zurück zum Zitat Dietterich, T.G.: Machine learning for sequential data: a review. In: Caelli, T., Amin, A., Duin, R.P.W., Kamel, M.S., de Ridder, D. (eds.) Structural, Syntactic, and Statistical Pattern Recognition, Joint IAPR International Workshops SSPR 2002 and SPR 2002, Windsor, Ontario, Canada, August 6–9, 2002, Proceedings. Lecture Notes in Computer Science, vol. 2396, pp. 15–30. Springer (2002) Dietterich, T.G.: Machine learning for sequential data: a review. In: Caelli, T., Amin, A., Duin, R.P.W., Kamel, M.S., de Ridder, D. (eds.) Structural, Syntactic, and Statistical Pattern Recognition, Joint IAPR International Workshops SSPR 2002 and SPR 2002, Windsor, Ontario, Canada, August 6–9, 2002, Proceedings. Lecture Notes in Computer Science, vol. 2396, pp. 15–30. Springer (2002)
23.
Zurück zum Zitat Zadrożny, S., Kacprzyk, J., Gajewski, M.: A solution of the multiaspect text categorization problem by a hybrid HMM and LDA based technique. In: 16th International Conference Information Processing and Management of Uncertainty in Knowledge-Based Systems, Eindhoven, The Netherlands (2016, in press) Zadrożny, S., Kacprzyk, J., Gajewski, M.: A solution of the multiaspect text categorization problem by a hybrid HMM and LDA based technique. In: 16th International Conference Information Processing and Management of Uncertainty in Knowledge-Based Systems, Eindhoven, The Netherlands (2016, in press)
24.
Zurück zum Zitat Yang, Y., Ault, T., Pierce, T., Lattimer, C.W.: Improving text categorization methods for event tracking. In: SIGIR, pp. 65–72 (2000) Yang, Y., Ault, T., Pierce, T., Lattimer, C.W.: Improving text categorization methods for event tracking. In: SIGIR, pp. 65–72 (2000)
26.
Zurück zum Zitat McCullagh, P., Nelder, J.: Generalized Linear Models, 2nd edn. Chapman & Hall/CRC Monographs on Statistics & Applied Probability. Taylor & Francis (1989) McCullagh, P., Nelder, J.: Generalized Linear Models, 2nd edn. Chapman & Hall/CRC Monographs on Statistics & Applied Probability. Taylor & Francis (1989)
27.
Zurück zum Zitat Ng, A.Y., Jordan, M.I.: On discriminative vs. generative classifiers: a comparison of logistic regression and naive bayes. In: Dietterich, T.G., Becker, S., Ghahramani, Z. (eds.) Advances in Neural Information Processing Systems 14 [Neural Information Processing Systems: Natural and Synthetic, NIPS 2001, December 3–8, 2001). Vancouver, British Columbia, Canada], pp. 841–848. MIT Press (2001) Ng, A.Y., Jordan, M.I.: On discriminative vs. generative classifiers: a comparison of logistic regression and naive bayes. In: Dietterich, T.G., Becker, S., Ghahramani, Z. (eds.) Advances in Neural Information Processing Systems 14 [Neural Information Processing Systems: Natural and Synthetic, NIPS 2001, December 3–8, 2001). Vancouver, British Columbia, Canada], pp. 841–848. MIT Press (2001)
28.
Zurück zum Zitat Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA (2001) Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA (2001)
29.
Zurück zum Zitat Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman and Hall, London (1986)CrossRefMATH Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman and Hall, London (1986)CrossRefMATH
30.
Zurück zum Zitat Karatzoglou, A., Smola, A., Hornik, K., Zeileis, A.: kernlab-an S4 package for kernel methods in R. J. Stat. Softw. 11(9), 1–20 (2004)CrossRef Karatzoglou, A., Smola, A., Hornik, K., Zeileis, A.: kernlab-an S4 package for kernel methods in R. J. Stat. Softw. 11(9), 1–20 (2004)CrossRef
31.
Zurück zum Zitat Bird, S., et al.: The ACL anthology reference corpus: a reference dataset for bibliographic research in computational linguistics. In: Proceedings of Language Resources and Evaluation Conference (LREC 08), Marrakesh, Morocco, pp. 1755–1759 Bird, S., et al.: The ACL anthology reference corpus: a reference dataset for bibliographic research in computational linguistics. In: Proceedings of Language Resources and Evaluation Conference (LREC 08), Marrakesh, Morocco, pp. 1755–1759
33.
Zurück zum Zitat Feinerer, I., Hornik, K., Meyer, D.: Text mining infrastructure in R. J. Stat. Softw. 25(5), 1–54 (2008)CrossRef Feinerer, I., Hornik, K., Meyer, D.: Text mining infrastructure in R. J. Stat. Softw. 25(5), 1–54 (2008)CrossRef
36.
Zurück zum Zitat Venables, W.N., Ripley, B.D.: Modern Applied Statistics with S, 4th edn. Springer, New York (2002)CrossRefMATH Venables, W.N., Ripley, B.D.: Modern Applied Statistics with S, 4th edn. Springer, New York (2002)CrossRefMATH
Metadaten
Titel
The Problem of First Story Detection in Multiaspect Text Categorization
verfasst von
Sławomir Zadrożny
Janusz Kacprzyk
Marek Gajewski
Copyright-Jahr
2017
DOI
https://doi.org/10.1007/978-3-319-44260-0_1