nach oben

International Journal of Computer Vision

Erschienen in:

01.01.2013

Efficiently Scaling up Crowdsourced Video Annotation

A Set of Best Practices for High Quality, Economical Video Labeling

verfasst von: Carl Vondrick, Donald Patterson, Deva Ramanan

Erschienen in: International Journal of Computer Vision | Ausgabe 1/2013

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

We present an extensive three year study on economically annotating video with crowdsourced marketplaces. Our public framework has annotated thousands of real world videos, including massive data sets unprecedented for their size, complexity, and cost. To accomplish this, we designed a state-of-the-art video annotation user interface and demonstrate that, despite common intuition, many contemporary interfaces are sub-optimal. We present several user studies that evaluate different aspects of our system and demonstrate that minimizing the cognitive load of the user is crucial when designing an annotation platform. We then deploy this interface on Amazon Mechanical Turk and discover expert and talented workers who are capable of annotating difficult videos with dense and closely cropped labels. We argue that video annotation requires specialized skill; most workers are poor annotators, mandating robust quality control protocols. We show that traditional crowdsourced micro-tasks are not suitable for video annotation and instead demonstrate that deploying time-consuming macro-tasks on MTurk is effective. Finally, we show that by extracting pixel-based features from manually labeled key frames, we are able to leverage more sophisticated interpolation strategies to maximize performance given a fixed budget. We validate the power of our framework on difficult, real-world data sets and we demonstrate an inherent trade-off between the mix of human and cloud computing used vs. the accuracy and cost of the labeling. We further introduce a novel, cost-based evaluation criteria that compares vision algorithms by the budget required to achieve an acceptable performance. We hope our findings will spur innovation in the creation of massive labeled video data sets and enable novel data-driven computer vision applications.

Vorheriger Artikel Sparse Adaptive Parameterization of Variability in Image Ensembles

Nächster Artikel Modeling Coverage in Camera Networks: A Survey

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

The software and data sets can be downloaded from our website at http://mit.edu/vondrick/vatic.

Agarwala, A., Hertzmann, A., Salesin, D., & Seitz, S. (2004). Keyframe-based tracking for rotoscoping and animation. ACM Transactions on Graphics, ACM, 23, 584–591. CrossRef

Ali, K., Hasler, D., & Fleuret, F. (2011). Flowboost–appearance learning from sparsely annotated video. In IEEE computer vision and pattern recognition.

Anonymous (2012). http://www.visint.org/.

Aydemir, A., Henell, D., Jensfelt, P., & Shilkrot, R. (2012). Kinect@ home: crowdsourcing a large 3d dataset of real environments. In 2012 AAAI spring symposium series.

Bailey, B., & Konstan, J. (2006). On the need for attention-aware systems: measuring effects of interruption on task performance, error rate, and affective state. Computers in Human Behavior, 22(4), 685–708. CrossRef

Bellman, R. (1956). Dynamic programming and Lagrange multipliers. Proceedings of the National Academy of Sciences of the United States of America, 42(10), 767. MathSciNetMATHCrossRef

Buchanan, A., & Fitzgibbon, A. (2006). Interactive feature tracking using kd trees and dynamic programming. In CVPR 06, Citeseer (Vol. 1, pp. 626–633).

Chen, J., Zou, W., & Ng, A. (2011). Personal communication.

Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In CVPR.

Demiröz, B., Salah, A., & Akarun, L. (2012). Çevresel zeka uygulamalari için tüm yönlü kamera kullanimi multi-omnidirectional cameras for ambient intelligence.

Deng, J., Dong, W., Socher, R., Li, L., Li, K., & Fei-Fei, L. (2009). ImageNet: a large-scale hierarchical image database. In Proc. CVPR (pp. 710–719).

Endres, I., Farhadi, A., Hoiem, D., & Forsyth, D. (2010). The benefits and challenges of collecting richer object annotations. In CVPR workshop on advancing computer vision with humans in the loop. New York: IEEE Press.

Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2), 303–338. CrossRef

Fan, R., Chang, K., Hsieh, C., Wang, X., & Lin, C. (2008). LIBLINEAR: a library for large linear classification. Journal of Machine Learning Research, 9, 1871–1874. MATH

Felzenszwalb, P., & Huttenlocher, D. (2004). Distance transforms of sampled functions (Cornell Computing and Information Science Technical Report TR2004-1963).

Fisher, R. (2004). The pets04 surveillance ground-truth data sets. In Proc. 6th IEEE international workshop on performance evaluation of tracking and surveillance (pp. 1–5).

Huber, D. (2011). Personal communication.

Kahle, B. (2010). http://www.archive.org/details/movies.

Kumar, N., Berg, A. C., Belhumeur, P. N., & Nayar, S. K. (2009). Attribute and simile classifiers for face verification. In ICCV.

Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In IEEE conference on computer vision and pattern recognition, 2008. CVPR 2008 (pp. 1–8). New York: IEEE Press.

Liu, C., Freeman, W., Adelson, E., & Weiss, Y. (2008). Human-assisted motion annotation. In IEEE conference on computer vision and pattern recognition, CVPR 2008 (pp. 1–8). CrossRef

Liu, W., & Lazebnik, S. (2011). Personal communication.

Mark, G., Gonzalez, V., & Harris, J. (2005). No task left behind? Examining the nature of fragmented work. In Proceedings of the SIGCHI conference on human factors in computing systems (pp. 321–330). New York: ACM Press.

Mihalcik, D., & Doermann, D. (2003). The design and implementation of ViPER (Technical report).

Munkres, J. (1957). Algorithms for the assignment and transportation problems. Journal of the Society for Industrial and Applied Mathematics, 5(1), 32–38. MathSciNetMATHCrossRef

Oh, S. (2011). Personal communication.

Oh, S., Hoogs, A., Perera, A., Cuntoor, N., Chen, C. C., Lee, J. T., Mukherjee, S., Aggarwal, J. K., Lee, H., Davis, L., Swears, E., Wang, X., Ji, Q., Reddy, K., Shah, M., Vondrick, C., Pirsiavash, H., Ramanan, D., Yuen, J., Torralba, A., Song, B., Fong, A., Roy-Chowdhury, A., & Desai, M. (2011). A large-scale benchmark dataset for event recognition in surveillance video. In CVPR.

Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: a holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3), 145–175. MATHCrossRef

Pirsiavash, H., & Ramanan, D. (2012). Detecting activities of daily living in first-person camera views. In CVPR.

Ramanan, D., Baker, S., & Kakade, S. (2007). Leveraging archival video for building face datasets. In IEEE 11th international conference on Computer vision, 2007. ICCV 2007 (pp. 1–8). New York: IEEE Press.

Ross, J., Irani, L., Silberman, M. S., Zaldivar, A., & Tomlinson, B. (2010). Who are the crowdworkers? Shifting demographics in mechanical Turk. In Alt.CHI session of CHI 2010 extended abstracts on human factors in computing systems.

Russell, B., Torralba, A., Murphy, K., & Freeman, W. (2008). LabelMe: a database and web-based tool for image annotation. International Journal of Computer Vision, 77(1), 157–173. CrossRef

Schwartz, B. (2005). The paradox of choice: why more is less. New York: Harper Perennial.

Smeaton, A., Over, P., & Kraaij, W. (2006). Evaluation campaigns and trecvid. In Proceedings of the 8th ACM international workshop on multimedia information retrieval (pp. 321–330). New York: ACM Press.

Sorokin, A., & Forsyth, D. (2008). Utility data annotation with Amazon Mechanical Turk. Urbana, 51, 61, 820.

Torralba, A., Fergus, R., & Freeman, W. T. (2008). 80 million tiny images: a large dataset for nonparametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(11), 1958–1970. CrossRef

Torralba, A., Russell, B., & Yuen, J. (2010). Labelme: online image annotation and applications. Proceedings of the IEEE, 98(8), 1467–1484. CrossRef

Vijayanarasimhan, S., & Grauman, K. (2009). What’s it going to cost you? Predicting effort vs. informativeness for multi-label image annotations. In CVPR.

Vijayanarasimhan, S., Jain, P., & Grauman, K. (2010). Far-sighted active learning on a budget for image and video recognition. In CVPR.

Vittayakorn, S., & Hays, J. (2011). Quality assessment for crowdsourced object annotations. In J. Hoey, S. McKenna, & E. Trucco (Eds.), Proceedings of the British machine vision conference (pp. 109–110).

Von Ahn, L., & Dabbish, L. (2004). Labeling images with a computer game. In Proceedings of the SIGCHI conference on human factors in computing systems (pp. 319–326). New York: ACM Press.

Von Ahn, L., Liu, R., & Blum, M. (2006). Peekaboom: a game for locating objects in images. In Proceedings of the SIGCHI conference on human factors in computing systems (pp. 55–64). New York: ACM Press. CrossRef

Vondrick, C., & Ramanan, D. (2011). Video annotation and tracking with active learning. In NIPS.

Vondrick, C., Ramanan, D., & Patterson, D. (2010). Efficiently scaling up video annotation with crowdsourced marketplaces. In Computer Vision–ECCV 2010 (pp. 610–623). CrossRef

Welinder, P., Branson, S., Belongie, S., & Perona, P. (2010). The multidimensional wisdom of crowds. In Neural information processing systems conference (NIPS) (Vol. 6, p. 8).

Xiao, J., Hays, J., Ehinger, K., Oliva, A., & Torralba, A. (2010). Sun database: large-scale scene recognition from abbey to zoo. In CVPR.

Yilmaz, A., Javed, O., & Shah, M. (2006). Object tracking: a survey. In Acm computing surveys (CSUR).

Yuen, J., Russell, B., Liu, C., & Torralba, A. (2009). LabelMe video: building a video database with human annotations. In International conference of computer vision.

Titel: Efficiently Scaling up Crowdsourced Video Annotation
A Set of Best Practices for High Quality, Economical Video Labeling
verfasst von: Carl Vondrick
Donald Patterson
Deva Ramanan
Publikationsdatum: 01.01.2013
Verlag: Springer US
Erschienen in: International Journal of Computer Vision / Ausgabe 1/2013
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI: https://doi.org/10.1007/s11263-012-0564-1

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 1/2013

Shape and Refractive Index from Single-View Spectro-Polarimetric Images

Multi-view Scene Flow Estimation: A View Centered Variational Approach

Beyond Independence: An Extension of the A Contrario Decision Procedure

Object-Colour Manifold

Sparse Adaptive Parameterization of Variability in Image Ensembles

Appreciation to IJCV Reviewers