Skip to main content
Top
Published in: International Journal on Digital Libraries 3-4/2014

01-08-2014

A system for high quality crowdsourced indigenous language transcription

Authors: Ngoni Munyaradzi, Hussein Suleman

Published in: International Journal on Digital Libraries | Issue 3-4/2014

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In this article, a crowdsourcing method is proposed to transcribe manuscripts from the Bleek and Lloyd Collection, where non-expert volunteers transcribe pages of the handwritten text using an online tool. The digital Bleek and Lloyd Collection is a rare collection that contains artwork, notebooks and dictionaries of the indigenous people of Southern Africa. The notebooks, in particular, contain stories that encode the language, culture and beliefs of these people, handwritten in now-extinct languages with a specialized notation system. Previous attempts have been made to convert the approximately 20,000 pages of text to a machine-readable form using machine learning algorithms but, due to the complexity of the text, the recognition accuracy was low. This article presents details of the system used to enable transcription by volunteers as well as results from experiments that were conducted to determine the quality and consistency of transcriptions. The results show that volunteers are able to produce reliable transcriptions of high quality. The inter-transcriber agreement is 80 % for |Xam text and 95 % for English text. When the |Xam text transcriptions produced by the volunteers are compared with a gold standard, the volunteers achieve an average accuracy of 64.75 %, which exceeded that in previous work. Finally, the degree of transcription agreement correlates with the degree of transcription accuracy. This suggests that the quality of unseen data can be assessed based on the degree of agreement among transcribers.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Anderson, David P., Cobb, Jeff, Korpela, Eric, Lebofsky, Matt, Werthimer, Dan: Seti@home: an experiment in public-resource computing. Commun. ACM 45(11), 56–61 (2002)CrossRef Anderson, David P., Cobb, Jeff, Korpela, Eric, Lebofsky, Matt, Werthimer, Dan: Seti@home: an experiment in public-resource computing. Commun. ACM 45(11), 56–61 (2002)CrossRef
3.
go back to reference Callison-Burch, C.: Fast, cheap, and creative: evaluating translation quality using amazons mechanical turk. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. EMNLP’09, vol. 1, pp. 286–295. Association for Computational Linguistics, Stroudsburg (2009) Callison-Burch, C.: Fast, cheap, and creative: evaluating translation quality using amazons mechanical turk. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. EMNLP’09, vol. 1, pp. 286–295. Association for Computational Linguistics, Stroudsburg (2009)
4.
go back to reference Catlin-Groves, C.L.: The citizen science landscape: from volunteers to citizen sensors and beyond. Int. J. Zool. 2012, p. 14 (2012). doi:10.1155/2012/349630. Article ID 349630 Catlin-Groves, C.L.: The citizen science landscape: from volunteers to citizen sensors and beyond. Int. J. Zool. 2012, p. 14 (2012). doi:10.​1155/​2012/​349630. Article ID 349630
5.
go back to reference Causer, T., Valerie, W.: Building a volunteer community: results and findings from Transcribe Bentham. Digit. Humanit. Q. 6(2) (2012) Causer, T., Valerie, W.: Building a volunteer community: results and findings from Transcribe Bentham. Digit. Humanit. Q. 6(2) (2012)
6.
go back to reference Kanefsky, B., Barlow, N.G., Gulick, V.C.: Can distributed volunteers accomplish massive data analysis tasks? In: Lunar and Planetary Institute Science Conference Abstracts. Lunar and Planetary Institute, vol. 32, pp. 1272. Technical Report (2001) Kanefsky, B., Barlow, N.G., Gulick, V.C.: Can distributed volunteers accomplish massive data analysis tasks? In: Lunar and Planetary Institute Science Conference Abstracts. Lunar and Planetary Institute, vol. 32, pp. 1272. Technical Report (2001)
7.
go back to reference Lee, J.H.: Crowdsourcing music similarity judgments using mechanical turk. In: Proceedings of the ISMIR 2010, pp. 183–188 (2010) Lee, J.H.: Crowdsourcing music similarity judgments using mechanical turk. In: Proceedings of the ISMIR 2010, pp. 183–188 (2010)
8.
go back to reference Lee, J.H., Xiao, H.: Generating ground truth for music mood classification using mechanical turk. In: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL’12, pp. 129–138. ACM, New York (2012) Lee, J.H., Xiao, H.: Generating ground truth for music mood classification using mechanical turk. In: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL’12, pp. 129–138. ACM, New York (2012)
9.
go back to reference Levenshtein, V.I. Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10(8):707–710 (1966) Levenshtein, V.I. Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10(8):707–710 (1966)
10.
go back to reference Marge, M., Satanjeev, B., Rudnicky, A.I.: Using the Amazon Mechanical Turk to transcribe and annotate meeting speech for extractive summarization. In: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, CSLDAMT’10, pp. 99–107. Association for Computational Linguistics, Stroudsburg (2010) Marge, M., Satanjeev, B., Rudnicky, A.I.: Using the Amazon Mechanical Turk to transcribe and annotate meeting speech for extractive summarization. In: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, CSLDAMT’10, pp. 99–107. Association for Computational Linguistics, Stroudsburg (2010)
11.
go back to reference Nowak, S., Rüger, S.: How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation. In: Proceedings of the International Conference on Multimedia Information Retrieval, MIR’10, pp. 557–566. ACM, New York (2010) Nowak, S., Rüger, S.: How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation. In: Proceedings of the International Conference on Multimedia Information Retrieval, MIR’10, pp. 557–566. ACM, New York (2010)
12.
go back to reference Shachaf, P.: The paradox of expertise: is the wikipedia reference desk as good as your library? J. Doc. 65(6), 977–996 (2009)CrossRef Shachaf, P.: The paradox of expertise: is the wikipedia reference desk as good as your library? J. Doc. 65(6), 977–996 (2009)CrossRef
13.
go back to reference Suleman, H.: Digital libraries without databases: the Bleek and Lloyd collection. In: Research and Advanced Technology for Digital Libraries, pp. 392–403 (2007) Suleman, H.: Digital libraries without databases: the Bleek and Lloyd collection. In: Research and Advanced Technology for Digital Libraries, pp. 392–403 (2007)
14.
go back to reference Von Ahn, L., Maurer, B., McMillen, C., Abraham, D., Blum, M.: RECAPTCHA: human-based character recognition via web security measures. Science 321, 1465–1468 (2008)CrossRefMATHMathSciNet Von Ahn, L., Maurer, B., McMillen, C., Abraham, D., Blum, M.: RECAPTCHA: human-based character recognition via web security measures. Science 321, 1465–1468 (2008)CrossRefMATHMathSciNet
15.
go back to reference Williams, K.: Learning to read Bushman: automatic handwriting recognition for Bushman languages. MSc, Department of Computer Science, University of Cape Town (2012) Williams, K.: Learning to read Bushman: automatic handwriting recognition for Bushman languages. MSc, Department of Computer Science, University of Cape Town (2012)
16.
go back to reference Williams, K., Suleman, H.: Creating a handwriting recognition corpus for Bushman languages. In: Proceedings of the 13th International Conference on Asia-Pacific Digital Libraries: for Cultural Heritage, Knowledge Dissemination, and Future Creation, ICADL’11, pp. 222–231. Springer, Berlin (2011) Williams, K., Suleman, H.: Creating a handwriting recognition corpus for Bushman languages. In: Proceedings of the 13th International Conference on Asia-Pacific Digital Libraries: for Cultural Heritage, Knowledge Dissemination, and Future Creation, ICADL’11, pp. 222–231. Springer, Berlin (2011)
17.
go back to reference Yujian, L., Liu B.: A normalized Levenshtein distance metric. IEEE Trans. Pattern Anal. Mach. Intell. 29(6):1091–1095 (2007) Yujian, L., Liu B.: A normalized Levenshtein distance metric. IEEE Trans. Pattern Anal. Mach. Intell. 29(6):1091–1095 (2007)
Metadata
Title
A system for high quality crowdsourced indigenous language transcription
Authors
Ngoni Munyaradzi
Hussein Suleman
Publication date
01-08-2014
Publisher
Springer Berlin Heidelberg
Published in
International Journal on Digital Libraries / Issue 3-4/2014
Print ISSN: 1432-5012
Electronic ISSN: 1432-1300
DOI
https://doi.org/10.1007/s00799-014-0112-4

Other articles of this Issue 3-4/2014

International Journal on Digital Libraries 3-4/2014 Go to the issue

Premium Partner