Skip to main content
Erschienen in: International Journal of Computer Vision 1/2013

01.01.2013

Efficiently Scaling up Crowdsourced Video Annotation

A Set of Best Practices for High Quality, Economical Video Labeling

verfasst von: Carl Vondrick, Donald Patterson, Deva Ramanan

Erschienen in: International Journal of Computer Vision | Ausgabe 1/2013

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

We present an extensive three year study on economically annotating video with crowdsourced marketplaces. Our public framework has annotated thousands of real world videos, including massive data sets unprecedented for their size, complexity, and cost. To accomplish this, we designed a state-of-the-art video annotation user interface and demonstrate that, despite common intuition, many contemporary interfaces are sub-optimal. We present several user studies that evaluate different aspects of our system and demonstrate that minimizing the cognitive load of the user is crucial when designing an annotation platform. We then deploy this interface on Amazon Mechanical Turk and discover expert and talented workers who are capable of annotating difficult videos with dense and closely cropped labels. We argue that video annotation requires specialized skill; most workers are poor annotators, mandating robust quality control protocols. We show that traditional crowdsourced micro-tasks are not suitable for video annotation and instead demonstrate that deploying time-consuming macro-tasks on MTurk is effective. Finally, we show that by extracting pixel-based features from manually labeled key frames, we are able to leverage more sophisticated interpolation strategies to maximize performance given a fixed budget. We validate the power of our framework on difficult, real-world data sets and we demonstrate an inherent trade-off between the mix of human and cloud computing used vs. the accuracy and cost of the labeling. We further introduce a novel, cost-based evaluation criteria that compares vision algorithms by the budget required to achieve an acceptable performance. We hope our findings will spur innovation in the creation of massive labeled video data sets and enable novel data-driven computer vision applications.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Fußnoten
1
The software and data sets can be downloaded from our website at http://​mit.​edu/​vondrick/​vatic.
 
Literatur
Zurück zum Zitat Agarwala, A., Hertzmann, A., Salesin, D., & Seitz, S. (2004). Keyframe-based tracking for rotoscoping and animation. ACM Transactions on Graphics, ACM, 23, 584–591. CrossRef Agarwala, A., Hertzmann, A., Salesin, D., & Seitz, S. (2004). Keyframe-based tracking for rotoscoping and animation. ACM Transactions on Graphics, ACM, 23, 584–591. CrossRef
Zurück zum Zitat Ali, K., Hasler, D., & Fleuret, F. (2011). Flowboost–appearance learning from sparsely annotated video. In IEEE computer vision and pattern recognition. Ali, K., Hasler, D., & Fleuret, F. (2011). Flowboost–appearance learning from sparsely annotated video. In IEEE computer vision and pattern recognition.
Zurück zum Zitat Aydemir, A., Henell, D., Jensfelt, P., & Shilkrot, R. (2012). Kinect@ home: crowdsourcing a large 3d dataset of real environments. In 2012 AAAI spring symposium series. Aydemir, A., Henell, D., Jensfelt, P., & Shilkrot, R. (2012). Kinect@ home: crowdsourcing a large 3d dataset of real environments. In 2012 AAAI spring symposium series.
Zurück zum Zitat Bailey, B., & Konstan, J. (2006). On the need for attention-aware systems: measuring effects of interruption on task performance, error rate, and affective state. Computers in Human Behavior, 22(4), 685–708. CrossRef Bailey, B., & Konstan, J. (2006). On the need for attention-aware systems: measuring effects of interruption on task performance, error rate, and affective state. Computers in Human Behavior, 22(4), 685–708. CrossRef
Zurück zum Zitat Bellman, R. (1956). Dynamic programming and Lagrange multipliers. Proceedings of the National Academy of Sciences of the United States of America, 42(10), 767. MathSciNetMATHCrossRef Bellman, R. (1956). Dynamic programming and Lagrange multipliers. Proceedings of the National Academy of Sciences of the United States of America, 42(10), 767. MathSciNetMATHCrossRef
Zurück zum Zitat Buchanan, A., & Fitzgibbon, A. (2006). Interactive feature tracking using kd trees and dynamic programming. In CVPR 06, Citeseer (Vol. 1, pp. 626–633). Buchanan, A., & Fitzgibbon, A. (2006). Interactive feature tracking using kd trees and dynamic programming. In CVPR 06, Citeseer (Vol. 1, pp. 626–633).
Zurück zum Zitat Chen, J., Zou, W., & Ng, A. (2011). Personal communication. Chen, J., Zou, W., & Ng, A. (2011). Personal communication.
Zurück zum Zitat Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In CVPR. Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In CVPR.
Zurück zum Zitat Demiröz, B., Salah, A., & Akarun, L. (2012). Çevresel zeka uygulamalari için tüm yönlü kamera kullanimi multi-omnidirectional cameras for ambient intelligence. Demiröz, B., Salah, A., & Akarun, L. (2012). Çevresel zeka uygulamalari için tüm yönlü kamera kullanimi multi-omnidirectional cameras for ambient intelligence.
Zurück zum Zitat Deng, J., Dong, W., Socher, R., Li, L., Li, K., & Fei-Fei, L. (2009). ImageNet: a large-scale hierarchical image database. In Proc. CVPR (pp. 710–719). Deng, J., Dong, W., Socher, R., Li, L., Li, K., & Fei-Fei, L. (2009). ImageNet: a large-scale hierarchical image database. In Proc. CVPR (pp. 710–719).
Zurück zum Zitat Endres, I., Farhadi, A., Hoiem, D., & Forsyth, D. (2010). The benefits and challenges of collecting richer object annotations. In CVPR workshop on advancing computer vision with humans in the loop. New York: IEEE Press. Endres, I., Farhadi, A., Hoiem, D., & Forsyth, D. (2010). The benefits and challenges of collecting richer object annotations. In CVPR workshop on advancing computer vision with humans in the loop. New York: IEEE Press.
Zurück zum Zitat Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2), 303–338. CrossRef Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2), 303–338. CrossRef
Zurück zum Zitat Fan, R., Chang, K., Hsieh, C., Wang, X., & Lin, C. (2008). LIBLINEAR: a library for large linear classification. Journal of Machine Learning Research, 9, 1871–1874. MATH Fan, R., Chang, K., Hsieh, C., Wang, X., & Lin, C. (2008). LIBLINEAR: a library for large linear classification. Journal of Machine Learning Research, 9, 1871–1874. MATH
Zurück zum Zitat Felzenszwalb, P., & Huttenlocher, D. (2004). Distance transforms of sampled functions (Cornell Computing and Information Science Technical Report TR2004-1963). Felzenszwalb, P., & Huttenlocher, D. (2004). Distance transforms of sampled functions (Cornell Computing and Information Science Technical Report TR2004-1963).
Zurück zum Zitat Fisher, R. (2004). The pets04 surveillance ground-truth data sets. In Proc. 6th IEEE international workshop on performance evaluation of tracking and surveillance (pp. 1–5). Fisher, R. (2004). The pets04 surveillance ground-truth data sets. In Proc. 6th IEEE international workshop on performance evaluation of tracking and surveillance (pp. 1–5).
Zurück zum Zitat Huber, D. (2011). Personal communication. Huber, D. (2011). Personal communication.
Zurück zum Zitat Kumar, N., Berg, A. C., Belhumeur, P. N., & Nayar, S. K. (2009). Attribute and simile classifiers for face verification. In ICCV. Kumar, N., Berg, A. C., Belhumeur, P. N., & Nayar, S. K. (2009). Attribute and simile classifiers for face verification. In ICCV.
Zurück zum Zitat Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In IEEE conference on computer vision and pattern recognition, 2008. CVPR 2008 (pp. 1–8). New York: IEEE Press. Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In IEEE conference on computer vision and pattern recognition, 2008. CVPR 2008 (pp. 1–8). New York: IEEE Press.
Zurück zum Zitat Liu, C., Freeman, W., Adelson, E., & Weiss, Y. (2008). Human-assisted motion annotation. In IEEE conference on computer vision and pattern recognition, CVPR 2008 (pp. 1–8). CrossRef Liu, C., Freeman, W., Adelson, E., & Weiss, Y. (2008). Human-assisted motion annotation. In IEEE conference on computer vision and pattern recognition, CVPR 2008 (pp. 1–8). CrossRef
Zurück zum Zitat Liu, W., & Lazebnik, S. (2011). Personal communication. Liu, W., & Lazebnik, S. (2011). Personal communication.
Zurück zum Zitat Mark, G., Gonzalez, V., & Harris, J. (2005). No task left behind? Examining the nature of fragmented work. In Proceedings of the SIGCHI conference on human factors in computing systems (pp. 321–330). New York: ACM Press. Mark, G., Gonzalez, V., & Harris, J. (2005). No task left behind? Examining the nature of fragmented work. In Proceedings of the SIGCHI conference on human factors in computing systems (pp. 321–330). New York: ACM Press.
Zurück zum Zitat Mihalcik, D., & Doermann, D. (2003). The design and implementation of ViPER (Technical report). Mihalcik, D., & Doermann, D. (2003). The design and implementation of ViPER (Technical report).
Zurück zum Zitat Munkres, J. (1957). Algorithms for the assignment and transportation problems. Journal of the Society for Industrial and Applied Mathematics, 5(1), 32–38. MathSciNetMATHCrossRef Munkres, J. (1957). Algorithms for the assignment and transportation problems. Journal of the Society for Industrial and Applied Mathematics, 5(1), 32–38. MathSciNetMATHCrossRef
Zurück zum Zitat Oh, S. (2011). Personal communication. Oh, S. (2011). Personal communication.
Zurück zum Zitat Oh, S., Hoogs, A., Perera, A., Cuntoor, N., Chen, C. C., Lee, J. T., Mukherjee, S., Aggarwal, J. K., Lee, H., Davis, L., Swears, E., Wang, X., Ji, Q., Reddy, K., Shah, M., Vondrick, C., Pirsiavash, H., Ramanan, D., Yuen, J., Torralba, A., Song, B., Fong, A., Roy-Chowdhury, A., & Desai, M. (2011). A large-scale benchmark dataset for event recognition in surveillance video. In CVPR. Oh, S., Hoogs, A., Perera, A., Cuntoor, N., Chen, C. C., Lee, J. T., Mukherjee, S., Aggarwal, J. K., Lee, H., Davis, L., Swears, E., Wang, X., Ji, Q., Reddy, K., Shah, M., Vondrick, C., Pirsiavash, H., Ramanan, D., Yuen, J., Torralba, A., Song, B., Fong, A., Roy-Chowdhury, A., & Desai, M. (2011). A large-scale benchmark dataset for event recognition in surveillance video. In CVPR.
Zurück zum Zitat Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: a holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3), 145–175. MATHCrossRef Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: a holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3), 145–175. MATHCrossRef
Zurück zum Zitat Pirsiavash, H., & Ramanan, D. (2012). Detecting activities of daily living in first-person camera views. In CVPR. Pirsiavash, H., & Ramanan, D. (2012). Detecting activities of daily living in first-person camera views. In CVPR.
Zurück zum Zitat Ramanan, D., Baker, S., & Kakade, S. (2007). Leveraging archival video for building face datasets. In IEEE 11th international conference on Computer vision, 2007. ICCV 2007 (pp. 1–8). New York: IEEE Press. Ramanan, D., Baker, S., & Kakade, S. (2007). Leveraging archival video for building face datasets. In IEEE 11th international conference on Computer vision, 2007. ICCV 2007 (pp. 1–8). New York: IEEE Press.
Zurück zum Zitat Ross, J., Irani, L., Silberman, M. S., Zaldivar, A., & Tomlinson, B. (2010). Who are the crowdworkers? Shifting demographics in mechanical Turk. In Alt.CHI session of CHI 2010 extended abstracts on human factors in computing systems. Ross, J., Irani, L., Silberman, M. S., Zaldivar, A., & Tomlinson, B. (2010). Who are the crowdworkers? Shifting demographics in mechanical Turk. In Alt.CHI session of CHI 2010 extended abstracts on human factors in computing systems.
Zurück zum Zitat Russell, B., Torralba, A., Murphy, K., & Freeman, W. (2008). LabelMe: a database and web-based tool for image annotation. International Journal of Computer Vision, 77(1), 157–173. CrossRef Russell, B., Torralba, A., Murphy, K., & Freeman, W. (2008). LabelMe: a database and web-based tool for image annotation. International Journal of Computer Vision, 77(1), 157–173. CrossRef
Zurück zum Zitat Schwartz, B. (2005). The paradox of choice: why more is less. New York: Harper Perennial. Schwartz, B. (2005). The paradox of choice: why more is less. New York: Harper Perennial.
Zurück zum Zitat Smeaton, A., Over, P., & Kraaij, W. (2006). Evaluation campaigns and trecvid. In Proceedings of the 8th ACM international workshop on multimedia information retrieval (pp. 321–330). New York: ACM Press. Smeaton, A., Over, P., & Kraaij, W. (2006). Evaluation campaigns and trecvid. In Proceedings of the 8th ACM international workshop on multimedia information retrieval (pp. 321–330). New York: ACM Press.
Zurück zum Zitat Sorokin, A., & Forsyth, D. (2008). Utility data annotation with Amazon Mechanical Turk. Urbana, 51, 61, 820. Sorokin, A., & Forsyth, D. (2008). Utility data annotation with Amazon Mechanical Turk. Urbana, 51, 61, 820.
Zurück zum Zitat Torralba, A., Fergus, R., & Freeman, W. T. (2008). 80 million tiny images: a large dataset for nonparametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(11), 1958–1970. CrossRef Torralba, A., Fergus, R., & Freeman, W. T. (2008). 80 million tiny images: a large dataset for nonparametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(11), 1958–1970. CrossRef
Zurück zum Zitat Torralba, A., Russell, B., & Yuen, J. (2010). Labelme: online image annotation and applications. Proceedings of the IEEE, 98(8), 1467–1484. CrossRef Torralba, A., Russell, B., & Yuen, J. (2010). Labelme: online image annotation and applications. Proceedings of the IEEE, 98(8), 1467–1484. CrossRef
Zurück zum Zitat Vijayanarasimhan, S., & Grauman, K. (2009). What’s it going to cost you? Predicting effort vs. informativeness for multi-label image annotations. In CVPR. Vijayanarasimhan, S., & Grauman, K. (2009). What’s it going to cost you? Predicting effort vs. informativeness for multi-label image annotations. In CVPR.
Zurück zum Zitat Vijayanarasimhan, S., Jain, P., & Grauman, K. (2010). Far-sighted active learning on a budget for image and video recognition. In CVPR. Vijayanarasimhan, S., Jain, P., & Grauman, K. (2010). Far-sighted active learning on a budget for image and video recognition. In CVPR.
Zurück zum Zitat Vittayakorn, S., & Hays, J. (2011). Quality assessment for crowdsourced object annotations. In J. Hoey, S. McKenna, & E. Trucco (Eds.), Proceedings of the British machine vision conference (pp. 109–110). Vittayakorn, S., & Hays, J. (2011). Quality assessment for crowdsourced object annotations. In J. Hoey, S. McKenna, & E. Trucco (Eds.), Proceedings of the British machine vision conference (pp. 109–110).
Zurück zum Zitat Von Ahn, L., & Dabbish, L. (2004). Labeling images with a computer game. In Proceedings of the SIGCHI conference on human factors in computing systems (pp. 319–326). New York: ACM Press. Von Ahn, L., & Dabbish, L. (2004). Labeling images with a computer game. In Proceedings of the SIGCHI conference on human factors in computing systems (pp. 319–326). New York: ACM Press.
Zurück zum Zitat Von Ahn, L., Liu, R., & Blum, M. (2006). Peekaboom: a game for locating objects in images. In Proceedings of the SIGCHI conference on human factors in computing systems (pp. 55–64). New York: ACM Press. CrossRef Von Ahn, L., Liu, R., & Blum, M. (2006). Peekaboom: a game for locating objects in images. In Proceedings of the SIGCHI conference on human factors in computing systems (pp. 55–64). New York: ACM Press. CrossRef
Zurück zum Zitat Vondrick, C., & Ramanan, D. (2011). Video annotation and tracking with active learning. In NIPS. Vondrick, C., & Ramanan, D. (2011). Video annotation and tracking with active learning. In NIPS.
Zurück zum Zitat Vondrick, C., Ramanan, D., & Patterson, D. (2010). Efficiently scaling up video annotation with crowdsourced marketplaces. In Computer Vision–ECCV 2010 (pp. 610–623). CrossRef Vondrick, C., Ramanan, D., & Patterson, D. (2010). Efficiently scaling up video annotation with crowdsourced marketplaces. In Computer Vision–ECCV 2010 (pp. 610–623). CrossRef
Zurück zum Zitat Welinder, P., Branson, S., Belongie, S., & Perona, P. (2010). The multidimensional wisdom of crowds. In Neural information processing systems conference (NIPS) (Vol. 6, p. 8). Welinder, P., Branson, S., Belongie, S., & Perona, P. (2010). The multidimensional wisdom of crowds. In Neural information processing systems conference (NIPS) (Vol. 6, p. 8).
Zurück zum Zitat Xiao, J., Hays, J., Ehinger, K., Oliva, A., & Torralba, A. (2010). Sun database: large-scale scene recognition from abbey to zoo. In CVPR. Xiao, J., Hays, J., Ehinger, K., Oliva, A., & Torralba, A. (2010). Sun database: large-scale scene recognition from abbey to zoo. In CVPR.
Zurück zum Zitat Yilmaz, A., Javed, O., & Shah, M. (2006). Object tracking: a survey. In Acm computing surveys (CSUR). Yilmaz, A., Javed, O., & Shah, M. (2006). Object tracking: a survey. In Acm computing surveys (CSUR).
Zurück zum Zitat Yuen, J., Russell, B., Liu, C., & Torralba, A. (2009). LabelMe video: building a video database with human annotations. In International conference of computer vision. Yuen, J., Russell, B., Liu, C., & Torralba, A. (2009). LabelMe video: building a video database with human annotations. In International conference of computer vision.
Metadaten
Titel
Efficiently Scaling up Crowdsourced Video Annotation
A Set of Best Practices for High Quality, Economical Video Labeling
verfasst von
Carl Vondrick
Donald Patterson
Deva Ramanan
Publikationsdatum
01.01.2013
Verlag
Springer US
Erschienen in
International Journal of Computer Vision / Ausgabe 1/2013
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI
https://doi.org/10.1007/s11263-012-0564-1

Weitere Artikel der Ausgabe 1/2013

International Journal of Computer Vision 1/2013 Zur Ausgabe