Skip to main content

2017 | OriginalPaper | Buchkapitel

Robust Multi-view Topic Modeling by Incorporating Detecting Anomalies

verfasst von : Guoxi Zhang, Tomoharu Iwata, Hisashi Kashima

Erschienen in: Machine Learning and Knowledge Discovery in Databases

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Multi-view text data consist of texts from different sources. For instance, multilingual Wikipedia corpora contain articles in different languages which are created by different group of users. Because multi-view text data are often created in distributed fashion, information from different sources may not be consistent. Such inconsistency introduce noise to analysis of such kind of data. In this paper, we propose a probabilistic topic model for multi-view data, which is robust against noise. The proposed model can also be used for detecting anomalies. In our experiments on Wikipedia data sets, the proposed model is more robust than existing multi-view topic models in terms of held-out perplexity.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Asuncion, A., Welling, M., Smyth, P., Teh, Y.W.: On smoothing and inference for topic models. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pp. 27–34. AUAI Press (2009) Asuncion, A., Welling, M., Smyth, P., Teh, Y.W.: On smoothing and inference for topic models. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pp. 27–34. AUAI Press (2009)
2.
Zurück zum Zitat Bach, F.R., Jordan, M.I.: A probabilistic interpretation of canonical correlation analysis. Technical report 688, Department of Statistics, University of California, Berkeley (2005) Bach, F.R., Jordan, M.I.: A probabilistic interpretation of canonical correlation analysis. Technical report 688, Department of Statistics, University of California, Berkeley (2005)
3.
Zurück zum Zitat Blei, D.M., Jordan, M.I.: Modeling annotated data. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 127–134. ACM (2003) Blei, D.M., Jordan, M.I.: Modeling annotated data. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 127–134. ACM (2003)
4.
Zurück zum Zitat Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)MATH Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)MATH
6.
Zurück zum Zitat Duh, K., Yeung, C.-M.A., Iwata, T., Nagata, M.: Managing information disparity in multilingual document collections. ACM Trans. Speech Lang. Process. (TSLP) 10(1), 1 (2013)CrossRef Duh, K., Yeung, C.-M.A., Iwata, T., Nagata, M.: Managing information disparity in multilingual document collections. ACM Trans. Speech Lang. Process. (TSLP) 10(1), 1 (2013)CrossRef
7.
Zurück zum Zitat Fukumasu, K., Eguchi, K., Xing, E.P.: Symmetric correspondence topic models for multilingual text analysis. In: Advances in Neural Information Processing Systems 25, pp. 1286–1294. Curran Associates Inc. (2012) Fukumasu, K., Eguchi, K., Xing, E.P.: Symmetric correspondence topic models for multilingual text analysis. In: Advances in Neural Information Processing Systems 25, pp. 1286–1294. Curran Associates Inc. (2012)
8.
Zurück zum Zitat Gao, J., Fan, W., Turaga, D., Parthasarathy, S., Han, J.: A spectral framework for detecting inconsistency across multi-source object relationships. In: 2011 IEEE 11th International Conference on Data Mining (ICDM), pp. 1050–1055. IEEE (2011) Gao, J., Fan, W., Turaga, D., Parthasarathy, S., Han, J.: A spectral framework for detecting inconsistency across multi-source object relationships. In: 2011 IEEE 11th International Conference on Data Mining (ICDM), pp. 1050–1055. IEEE (2011)
9.
Zurück zum Zitat Hara, N., Shachaf, P., Hew, K.F.: Cross-cultural analysis of the Wikipedia community. J. Am. Soc. Inform. Sci. Technol. 61(10), 2097–2108 (2010)CrossRef Hara, N., Shachaf, P., Hew, K.F.: Cross-cultural analysis of the Wikipedia community. J. Am. Soc. Inform. Sci. Technol. 61(10), 2097–2108 (2010)CrossRef
10.
Zurück zum Zitat Iwata, T., Yamada, M.: Multi-view anomaly detection via robust probabilistic latent variable models. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems 29, pp. 1136–1144. Curran Associates Inc., New York (2016) Iwata, T., Yamada, M.: Multi-view anomaly detection via robust probabilistic latent variable models. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems 29, pp. 1136–1144. Curran Associates Inc., New York (2016)
11.
Zurück zum Zitat Mimno, D., Wallach, H.M., Naradowsky, J., Smith, D.A., McCallum, A.: Polylingual topic models. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2-Volume 2, pp. 880–889. Association for Computational Linguistics (2009) Mimno, D., Wallach, H.M., Naradowsky, J., Smith, D.A., McCallum, A.: Polylingual topic models. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2-Volume 2, pp. 880–889. Association for Computational Linguistics (2009)
13.
Zurück zum Zitat Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetMATH Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetMATH
14.
Zurück zum Zitat Ramage, D., Heymann, P., Manning, C.D., Garcia-Molina, H.: Clustering the tagged web. In: Proceedings of the Second ACM International Conference on Web Search and Data Mining, pp. 54–63. ACM (2009) Ramage, D., Heymann, P., Manning, C.D., Garcia-Molina, H.: Clustering the tagged web. In: Proceedings of the Second ACM International Conference on Web Search and Data Mining, pp. 54–63. ACM (2009)
15.
Zurück zum Zitat Sethuraman, J.: A constructive definition of dirichlet priors. Stat. Sin. 639–650 (1994) Sethuraman, J.: A constructive definition of dirichlet priors. Stat. Sin. 639–650 (1994)
16.
Zurück zum Zitat Vulić, I., De Smet, W., Tang, J., Moens, M.-F.: Probabilistic topic modeling in multilingual settings: an overview of its methodology and applications. Inf. Process. Manag. 51(1), 111–147 (2015)CrossRef Vulić, I., De Smet, W., Tang, J., Moens, M.-F.: Probabilistic topic modeling in multilingual settings: an overview of its methodology and applications. Inf. Process. Manag. 51(1), 111–147 (2015)CrossRef
Metadaten
Titel
Robust Multi-view Topic Modeling by Incorporating Detecting Anomalies
verfasst von
Guoxi Zhang
Tomoharu Iwata
Hisashi Kashima
Copyright-Jahr
2017
DOI
https://doi.org/10.1007/978-3-319-71246-8_15