Skip to main content

2015 | OriginalPaper | Buchkapitel

Clustering-Based Topic Identification of Transcribed Arabic Broadcast News

verfasst von : Ahmed Abdelaziz Jafar, Mohamed Waleed Fakhr, Mohamed Hesham Farouk

Erschienen in: New Trends in Networking, Computing, E-learning, Systems Sciences, and Engineering

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In this research different clustering techniques are applied for grouping transcribed textual documents obtained out of audio streams. Since audio transcripts are normally highly erroneous, it is essential to reduce the negative impact of errors gained at the speech recognition stage. In attempt to overcome some of these errors, different stemming techniques are applied on the transcribed text. The goal of this research is to achieve automatic topic clustering of transcribed speech documents, and investigate the impact of applying stemming techniques in combination with a Chi-square similarity measure on the accuracy of the selected clustering algorithms. The evaluation—using F-Measure—showed that using root-based stemming in combination of spectral clustering technique achieved the highest accuracy.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Erling Wold, Thorn Blum, Douglas Keislar, and James Wheaton, “Content-based classification, search, and retrieval of audio,” IEEE Multimedia, pp. 27-36, Fall 1996. Erling Wold, Thorn Blum, Douglas Keislar, and James Wheaton, “Content-based classification, search, and retrieval of audio,” IEEE Multimedia, pp. 27-36, Fall 1996.
2.
Zurück zum Zitat N. V. Patel and I. K. Sethi, “Audio characterization for video indexing,” Proceedings of IS&T/SPIE Conf. Storage and Retrieval for Image and Video Databases IV, pp. 373-384, San Jose, CA, February 1996. N. V. Patel and I. K. Sethi, “Audio characterization for video indexing,” Proceedings of IS&T/SPIE Conf. Storage and Retrieval for Image and Video Databases IV, pp. 373-384, San Jose, CA, February 1996.
3.
Zurück zum Zitat N. V. Patel and I. K. Sethi, “Video classification using Speaker identification,” Proceedings of IS&T/SPIE Conf. Storage and Retrieval for Image and Video Databases V, pp. 218-225, San Jose, February 1997. N. V. Patel and I. K. Sethi, “Video classification using Speaker identification,” Proceedings of IS&T/SPIE Conf. Storage and Retrieval for Image and Video Databases V, pp. 218-225, San Jose, February 1997.
4.
Zurück zum Zitat Dongge Li, IK Sethi, N Dimitrova and T McGee, “Classification of general audio data for content-based retrieval,” Pattern Recognition Letters, Vol. 22, pp. 533-544, April 2001. Dongge Li, IK Sethi, N Dimitrova and T McGee, “Classification of general audio data for content-based retrieval,” Pattern Recognition Letters, Vol. 22, pp. 533-544, April 2001.
5.
Zurück zum Zitat Anni R. Coden, and Eric W. Brown, “Speech transcript analysis for automatic search,” IBM Research Report, RC 21838 (98287), September 2000. Anni R. Coden, and Eric W. Brown, “Speech transcript analysis for automatic search,” IBM Research Report, RC 21838 (98287), September 2000.
6.
Zurück zum Zitat Oger, S.; Rouvier, M.; Linares, G., “Transcription-based video genre classification,” Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, pp.5114, 5117, March 2010. Oger, S.; Rouvier, M.; Linares, G., “Transcription-based video genre classification,” Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, pp.5114, 5117, March 2010.
7.
Zurück zum Zitat Abberley, D., Renals, S., and Cook G., “Retrieval of broadcast news documents with the THISL system,” Proc. of the IEEE International Conference on Acoustic, Speech, and Signal Processing, 1998, pp. 3781-3784. Abberley, D., Renals, S., and Cook G., “Retrieval of broadcast news documents with the THISL system,” Proc. of the IEEE International Conference on Acoustic, Speech, and Signal Processing, 1998, pp. 3781-3784.
8.
Zurück zum Zitat J.L. Gauvain, L. Lamel, and G. Adda, “Transcribing broadcast news for audio and video indexing,” Communications of the ACM, 43(2), 2000. J.L. Gauvain, L. Lamel, and G. Adda, “Transcribing broadcast news for audio and video indexing,” Communications of the ACM, 43(2), 2000.
9.
Zurück zum Zitat Oktay Ibrahimov, Ishwar Sethi, and Nevenka Dimitrova, “A novel similarity based clustering algorithm for grouping broadcast news,” Proc. of SPIE Conf. ‘Data Mining and Knowledge Discovery: Theory, Tools, and Technology IV, 2002. Oktay Ibrahimov, Ishwar Sethi, and Nevenka Dimitrova, “A novel similarity based clustering algorithm for grouping broadcast news,” Proc. of SPIE Conf. ‘Data Mining and Knowledge Discovery: Theory, Tools, and Technology IV, 2002.
10.
Zurück zum Zitat Michael Steinbach, George Karypis, and Vipin Kumar, “A comparison of document clustering techniques”, University of Minnesota, 2000. Michael Steinbach, George Karypis, and Vipin Kumar, “A comparison of document clustering techniques”, University of Minnesota, 2000.
11.
Zurück zum Zitat Ulrike von Luxburg, “A tutorial on spectral clustering,” Springer Statistics and Computing, vol. 17, Issue 4, pp 395-416, December 2007. Ulrike von Luxburg, “A tutorial on spectral clustering,” Springer Statistics and Computing, vol. 17, Issue 4, pp 395-416, December 2007.
12.
Zurück zum Zitat R. R. Korfhage, “Information storage and retrieval,” John Wiley, 1997. R. R. Korfhage, “Information storage and retrieval,” John Wiley, 1997.
13.
Zurück zum Zitat Ibrahim Abu El-Khair, “Effects of stop words elimination for Arabic information retrieval: a comparative study,” International Journal of Computing & Information Sciences, 2006, pp. 119-133. Ibrahim Abu El-Khair, “Effects of stop words elimination for Arabic information retrieval: a comparative study,” International Journal of Computing & Information Sciences, 2006, pp. 119-133.
14.
Zurück zum Zitat P. Schauble, “Multimedia information retrieval: Content-based information retrieval from Large Text and Audio Databases,” Kluwer Academic Publishers, 1997. P. Schauble, “Multimedia information retrieval: Content-based information retrieval from Large Text and Audio Databases,” Kluwer Academic Publishers, 1997.
15.
Zurück zum Zitat I.A. Al-Kharashi and M.W. Evens, “Comparing words, stems, and roots as index terms in an Arabic information retrieval system,” Journal of the American Society for Information Science, vol. 45, 1994, pp. 548-60.CrossRef I.A. Al-Kharashi and M.W. Evens, “Comparing words, stems, and roots as index terms in an Arabic information retrieval system,” Journal of the American Society for Information Science, vol. 45, 1994, pp. 548-60.CrossRef
16.
Zurück zum Zitat Eiman Al-Shammari and Jessica Lin, “A novel Arabic lemmatization algorithm,” Proceedings of the second workshop on Analytics for noisy unstructured text data, 2008, pp. 113-118. Eiman Al-Shammari and Jessica Lin, “A novel Arabic lemmatization algorithm,” Proceedings of the second workshop on Analytics for noisy unstructured text data, 2008, pp. 113-118.
17.
Zurück zum Zitat L.S. Larkey, L. Ballesteros, and M.E. Connell, “Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis,” Tampere, Finland: ACM, 2002, pp. 275-282. L.S. Larkey, L. Ballesteros, and M.E. Connell, “Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis,” Tampere, Finland: ACM, 2002, pp. 275-282.
18.
Zurück zum Zitat L.S. Larkey and M.E. Connell, “Arabic information retrieval at UMass in TREC-10,” Proceedings of the Tenth Text Retrieval Conference (TREC-10)”, EM Voorhees and DK Harman ed, 2001, pp. 562-570. L.S. Larkey and M.E. Connell, “Arabic information retrieval at UMass in TREC-10,” Proceedings of the Tenth Text Retrieval Conference (TREC-10)”, EM Voorhees and DK Harman ed, 2001, pp. 562-570.
19.
Zurück zum Zitat S. Khoja and R. Garside, “Stemming Arabic text,” Lancaster, UK, Computing Department, Lancaster University, 1999. S. Khoja and R. Garside, “Stemming Arabic text,” Lancaster, UK, Computing Department, Lancaster University, 1999.
20.
Zurück zum Zitat W. Al-Fares, “Arabic root-based clustering: an algorithm for identifying roots based on n-grams and morphological similarity,” University of Essex (United Kingdom), 2002. W. Al-Fares, “Arabic root-based clustering: an algorithm for identifying roots based on n-grams and morphological similarity,” University of Essex (United Kingdom), 2002.
21.
Zurück zum Zitat G. Salton, “Automatic text processing: the transformation, analysis, and retrieval of information by computer,” Addison-Wesley, 1989. G. Salton, “Automatic text processing: the transformation, analysis, and retrieval of information by computer,” Addison-Wesley, 1989.
22.
Zurück zum Zitat Fabrizio Sebastiani, “A tutorial on automated text categorization,” Istituto di Elaborazione dell’Informazione, 1999. Fabrizio Sebastiani, “A tutorial on automated text categorization,” Istituto di Elaborazione dell’Informazione, 1999.
23.
Zurück zum Zitat Nicholas Awde and Putros Samano, The Arabic Alphabet: How to Read & Write It, Lyle Stuart, October 2000. Nicholas Awde and Putros Samano, The Arabic Alphabet: How to Read & Write It, Lyle Stuart, October 2000.
24.
Zurück zum Zitat M. Singler, and R. Jin, A. Hauptmann, “CMU spoken document retrieval in Trec-8: analysis of the role of term frequency TF,” The 8th Text REtrieval Conference, NIST, Gaithersburg, MD, November 1999. M. Singler, and R. Jin, A. Hauptmann, “CMU spoken document retrieval in Trec-8: analysis of the role of term frequency TF,” The 8th Text REtrieval Conference, NIST, Gaithersburg, MD, November 1999.
25.
Zurück zum Zitat Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. “Okapi at TREC-3,” Proceedings of the Third Text REtrieval Conference (TREC 1994). Gaithersburg, USA, November 1994. NIST SP 500-225, 1995, 109-126. Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. “Okapi at TREC-3,” Proceedings of the Third Text REtrieval Conference (TREC 1994). Gaithersburg, USA, November 1994. NIST SP 500-225, 1995, 109-126.
26.
Zurück zum Zitat Jianbo Shi and Jitendra Malik, “Normalized cuts and image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, No. 8, August 2000. Jianbo Shi and Jitendra Malik, “Normalized cuts and image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, No. 8, August 2000.
Metadaten
Titel
Clustering-Based Topic Identification of Transcribed Arabic Broadcast News
verfasst von
Ahmed Abdelaziz Jafar
Mohamed Waleed Fakhr
Mohamed Hesham Farouk
Copyright-Jahr
2015
DOI
https://doi.org/10.1007/978-3-319-06764-3_32

Neuer Inhalt