Skip to main content
Erschienen in: International Journal of Computer Vision 3/2020

05.08.2019

Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input

verfasst von: David Harwath, Adrià Recasens, Dídac Surís, Galen Chuang, Antonio Torralba, James Glass

Erschienen in: International Journal of Computer Vision | Ausgabe 3/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In this paper, we explore neural network models that learn to associate segments of spoken audio captions with the semantically relevant portions of natural images that they refer to. We demonstrate that these audio-visual associative localizations emerge from network-internal representations learned as a by-product of training to perform an image-audio retrieval task. Our models operate directly on the image pixels and speech waveform, and do not rely on any conventional supervision in the form of labels, segmentations, or alignments between the modalities during training. We perform analysis using the Places 205 and ADE20k datasets demonstrating that our models implicitly learn semantically coupled object and word detectors.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Alishahi, A., Barking, M., & Chrupala, G. (2017). Encoding of phonology in a recurrent neural model of grounded speech. In Proceedings of the ACL conference on natural language learning (CoNLL). Alishahi, A., Barking, M., & Chrupala, G. (2017). Encoding of phonology in a recurrent neural model of grounded speech. In Proceedings of the ACL conference on natural language learning (CoNLL).
Zurück zum Zitat Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence, Z., et al. (2015). VQA: Visual question answering. In Proceedings of the IEEE international conference on computer vision (ICCV). Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence, Z., et al. (2015). VQA: Visual question answering. In Proceedings of the IEEE international conference on computer vision (ICCV).
Zurück zum Zitat Arandjelovic, R., & Zisserman, A. (2017). Look, listen, and learn. In Proceedings of the IEEE international conference on computer vision (ICCV). Arandjelovic, R., & Zisserman, A. (2017). Look, listen, and learn. In Proceedings of the IEEE international conference on computer vision (ICCV).
Zurück zum Zitat Aytar, Y., Vondrick, C., & Torralba, A. (2016). Soundnet: Learning sound representations from unlabeled video. In Proceedings of the neural information processing systems (NeurIPS). Aytar, Y., Vondrick, C., & Torralba, A. (2016). Soundnet: Learning sound representations from unlabeled video. In Proceedings of the neural information processing systems (NeurIPS).
Zurück zum Zitat Bergamo, A., Bazzani, L., Anguelov, D., & Torresani, L. (2014). Self-taught object localization with deep networks. CoRR. arXiv:1409.3964. Bergamo, A., Bazzani, L., Anguelov, D., & Torresani, L. (2014). Self-taught object localization with deep networks. CoRR. arXiv:​1409.​3964.
Zurück zum Zitat Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., & Shah, R. (1994). Signature verification using a “siamese” time delay neural network. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems (Vol. 6, pp. 737–744). Burlington: Morgan-Kaufmann. Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., & Shah, R. (1994). Signature verification using a “siamese” time delay neural network. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems (Vol. 6, pp. 737–744). Burlington: Morgan-Kaufmann.
Zurück zum Zitat Cho, M., Kwak, S., Schmid, C., & Ponce, J. (2015). Unsupervised object discovery and localization in the wild: Part-based matching with bottom-up region proposals. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR). Cho, M., Kwak, S., Schmid, C., & Ponce, J. (2015). Unsupervised object discovery and localization in the wild: Part-based matching with bottom-up region proposals. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
Zurück zum Zitat Chrupala, G., Gelderloos, L., & Alishahi, A. (2017). Representations of language in a model of visually grounded speech signal. In Proceedings of the annual meeting of the association for computational linguistics (ACL). Chrupala, G., Gelderloos, L., & Alishahi, A. (2017). Representations of language in a model of visually grounded speech signal. In Proceedings of the annual meeting of the association for computational linguistics (ACL).
Zurück zum Zitat Cinbis, R., Verbeek, J., & Schmid, C. (2016). Weakly supervised object localization with multi-fold multiple instance learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(1), 189–203.CrossRef Cinbis, R., Verbeek, J., & Schmid, C. (2016). Weakly supervised object localization with multi-fold multiple instance learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(1), 189–203.CrossRef
Zurück zum Zitat de Vries, H., Strub, F., Chandar, S., Pietquin, O., Larochelle, H., & Courville, A. C. (2017). Guesswhat?! Visual object discovery through multi-modal dialogue. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). de Vries, H., Strub, F., Chandar, S., Pietquin, O., Larochelle, H., & Courville, A. C. (2017). Guesswhat?! Visual object discovery through multi-modal dialogue. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Zurück zum Zitat Doersch, C., Gupta, A., & Efros, A. A. (2015). Unsupervised visual representation learning by context prediction. CoRR. arXiv:1505.05192. Doersch, C., Gupta, A., & Efros, A. A. (2015). Unsupervised visual representation learning by context prediction. CoRR. arXiv:​1505.​05192.
Zurück zum Zitat Drexler, J., & Glass, J. (2017). Analysis of audio-visual features for unsupervised speech recognition. In Proceedings of the grounded language understanding workshop. Drexler, J., & Glass, J. (2017). Analysis of audio-visual features for unsupervised speech recognition. In Proceedings of the grounded language understanding workshop.
Zurück zum Zitat Dupoux, E. (2018). Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language-learner. Cognition, 173, 43–59. CrossRef Dupoux, E. (2018). Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language-learner. Cognition, 173, 43–59. CrossRef
Zurück zum Zitat Faghri, F., Fleet, D. J., Kiros, J. R., & Fidler, S. (2018). Vse++: Improving visual-semantic embeddings with hard negatives. In Proceedings of the British machine vision conference (BMVC). Faghri, F., Fleet, D. J., Kiros, J. R., & Fidler, S. (2018). Vse++: Improving visual-semantic embeddings with hard negatives. In Proceedings of the British machine vision conference (BMVC).
Zurück zum Zitat Fang, H., Gupta, S., Iandola, F., Rupesh, S., Deng, L., Dollar, P., et al. (2015). From captions to visual concepts and back. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) Fang, H., Gupta, S., Iandola, F., Rupesh, S., Deng, L., Dollar, P., et al. (2015). From captions to visual concepts and back. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
Zurück zum Zitat Fellbaum, C. (1998). WordNet: An electronic lexical database. Bradford: Bradford Books.CrossRef Fellbaum, C. (1998). WordNet: An electronic lexical database. Bradford: Bradford Books.CrossRef
Zurück zum Zitat Gao, H., Mao, J., Zhou, J., Huang, Z., & Yuille, A. (2015). Are you talking to a machine? Dataset and methods for multilingual image question answering. In Proceedings of the neural information processing systems (NeurIPS). Gao, H., Mao, J., Zhou, J., Huang, Z., & Yuille, A. (2015). Are you talking to a machine? Dataset and methods for multilingual image question answering. In Proceedings of the neural information processing systems (NeurIPS).
Zurück zum Zitat Gelderloos, L., & Chrupala, G. (2016). From phonemes to images: Levels of representation in a recurrent neural model of visually-grounded language learning. arXiv:1610.03342. Gelderloos, L., & Chrupala, G. (2016). From phonemes to images: Levels of representation in a recurrent neural model of visually-grounded language learning. arXiv:​1610.​03342.
Zurück zum Zitat Gemmeke, J. F., Ellis, D. P. W., Freedman, D., Jansen, A., Lawrence, W., Moore, R. C., et al. (2017). Audio set: An ontology and human-labeled dataset for audio events. In Proceedings of the international conference on acoustics, speech and signal processing (ICASSP). Gemmeke, J. F., Ellis, D. P. W., Freedman, D., Jansen, A., Lawrence, W., Moore, R. C., et al. (2017). Audio set: An ontology and human-labeled dataset for audio events. In Proceedings of the international conference on acoustics, speech and signal processing (ICASSP).
Zurück zum Zitat Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2013). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2013). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Zurück zum Zitat Guérin, J., Gibaru, O., Thiery, S., & Nyiri, E. (2017). CNN features are also great at unsupervised classification. CoRR. arXiv:1707.01700. Guérin, J., Gibaru, O., Thiery, S., & Nyiri, E. (2017). CNN features are also great at unsupervised classification. CoRR. arXiv:​1707.​01700.
Zurück zum Zitat Harwath, D., & Glass, J. (2017). Learning word-like units from joint audio-visual analysis. In Proceedings of the annual meeting of the association for computational linguistics (ACL). Harwath, D., & Glass, J. (2017). Learning word-like units from joint audio-visual analysis. In Proceedings of the annual meeting of the association for computational linguistics (ACL).
Zurück zum Zitat Harwath, D., Recasens, A., Surís, D., Chuang, G., Torralba, A., & Glass, J. (2018). Jointly discovering visual objects and spoken words from raw sensory input. In Proceedings of the IEEE European conference on computer vision (ECCV). Harwath, D., Recasens, A., Surís, D., Chuang, G., Torralba, A., & Glass, J. (2018). Jointly discovering visual objects and spoken words from raw sensory input. In Proceedings of the IEEE European conference on computer vision (ECCV).
Zurück zum Zitat Harwath, D., Torralba, A., & Glass, J. R. (2016). Unsupervised learning of spoken language with visual context. In Proceeding of the neural information processing systems (NeurIPS). Harwath, D., Torralba, A., & Glass, J. R. (2016). Unsupervised learning of spoken language with visual context. In Proceeding of the neural information processing systems (NeurIPS).
Zurück zum Zitat Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the international conference on machine learning (ICML). Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the international conference on machine learning (ICML).
Zurück zum Zitat Jansen, A., Church, K., & Hermansky, H. (2010). Toward spoken term discovery at scale with zero resources. In Proceedings of the annual conference of international speech communication association (INTERSPEECH). Jansen, A., Church, K., & Hermansky, H. (2010). Toward spoken term discovery at scale with zero resources. In Proceedings of the annual conference of international speech communication association (INTERSPEECH).
Zurück zum Zitat Jansen, A., Plakal, M., Pandya, R., Ellis, D. P., Hershey, S., Liu, J., et al. (2018). Unsupervised learning of semantic audio representations. In Proceedings of the international conference on acoustics, speech and signal processing (ICASSP). Jansen, A., Plakal, M., Pandya, R., Ellis, D. P., Hershey, S., Liu, J., et al. (2018). Unsupervised learning of semantic audio representations. In Proceedings of the international conference on acoustics, speech and signal processing (ICASSP).
Zurück zum Zitat Jansen, A., & Van Durme, B. (2011). Efficient spoken term discovery using randomized algorithms. In Proceedings of the IEEE workshop on automfatic speech recognition and understanding (ASRU). Jansen, A., & Van Durme, B. (2011). Efficient spoken term discovery using randomized algorithms. In Proceedings of the IEEE workshop on automfatic speech recognition and understanding (ASRU).
Zurück zum Zitat Johnson, J., Karpathy, A., & Fei-Fei, L. (2016). Densecap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). Johnson, J., Karpathy, A., & Fei-Fei, L. (2016). Densecap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Zurück zum Zitat Kamper, H., Elsner, M., Jansen, A., & Goldwater, S. (2015). Unsupervised neural network based feature extraction using weak top-down constraints. In Proceedings of the international conference on acoustics, speech and signal processing (ICASSP). Kamper, H., Elsner, M., Jansen, A., & Goldwater, S. (2015). Unsupervised neural network based feature extraction using weak top-down constraints. In Proceedings of the international conference on acoustics, speech and signal processing (ICASSP).
Zurück zum Zitat Kamper, H., Jansen, A., & Goldwater, S. (2016). Unsupervised word segmentation and lexicon discovery using acoustic word embeddings. IEEE Transactions on Audio, Speech and Language Processing, 24(4), 669–679.CrossRef Kamper, H., Jansen, A., & Goldwater, S. (2016). Unsupervised word segmentation and lexicon discovery using acoustic word embeddings. IEEE Transactions on Audio, Speech and Language Processing, 24(4), 669–679.CrossRef
Zurück zum Zitat Kamper, H., Settle, S., Shakhnarovich, G., & Livescu, K. (2017). Visually grounded learning of keyword prediction from untranscribed speech. In Proceedings of the annual conference of international speech communication association (INTERSPEECH). Kamper, H., Settle, S., Shakhnarovich, G., & Livescu, K. (2017). Visually grounded learning of keyword prediction from untranscribed speech. In Proceedings of the annual conference of international speech communication association (INTERSPEECH).
Zurück zum Zitat Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Zurück zum Zitat Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Proceedings of the neural information processing systems (NeurIPS). Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Proceedings of the neural information processing systems (NeurIPS).
Zurück zum Zitat LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.CrossRef LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.CrossRef
Zurück zum Zitat Lee, C., & Glass, J. (2012). A nonparametric Bayesian approach to acoustic model discovery. In Proceedings of the annual meeting of the association for computational linguistics (ACL). Lee, C., & Glass, J. (2012). A nonparametric Bayesian approach to acoustic model discovery. In Proceedings of the annual meeting of the association for computational linguistics (ACL).
Zurück zum Zitat Lin, T., Marie, M., Belongie, S., Bourdev, L., Girshick, R., Perona, P., et al. (2015). Microsoft COCO: Common objects in context. In arXiv:1405.0312. Lin, T., Marie, M., Belongie, S., Bourdev, L., Girshick, R., Perona, P., et al. (2015). Microsoft COCO: Common objects in context. In arXiv:​1405.​0312.
Zurück zum Zitat Malinowski, M., & Fritz, M. (2014). A multi-world approach to question answering about real-world scenes based on uncertain input. In Proceedings of the neural information processing systems (NeurIPS). Malinowski, M., & Fritz, M. (2014). A multi-world approach to question answering about real-world scenes based on uncertain input. In Proceedings of the neural information processing systems (NeurIPS).
Zurück zum Zitat Malinowski, M., Rohrbach, M., & Fritz, M. (2015). Ask your neurons: A neural-based approach to answering questions about images. In Proceedings of the IEEE international conference on computer vision (ICCV). Malinowski, M., Rohrbach, M., & Fritz, M. (2015). Ask your neurons: A neural-based approach to answering questions about images. In Proceedings of the IEEE international conference on computer vision (ICCV).
Zurück zum Zitat Ondel, L., Burget, L., & Cernocky, J. (2016) Variational inference for acoustic unit discovery. In 5th Workshop on spoken language technology for under-resourced language. Ondel, L., Burget, L., & Cernocky, J. (2016) Variational inference for acoustic unit discovery. In 5th Workshop on spoken language technology for under-resourced language.
Zurück zum Zitat Owens, A., Isola, P., McDermott, J. H., Torralba, A., Adelson, E. H., & Freeman, W. T. (2016a) Visually indicated sounds. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). Owens, A., Isola, P., McDermott, J. H., Torralba, A., Adelson, E. H., & Freeman, W. T. (2016a) Visually indicated sounds. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Zurück zum Zitat Owens, A., Wu, J., McDermott, J. H., Freeman, W. T., & Torralba, A. (2016b). Ambient sound provides supervision for visual learning. In Proceedings of the IEEE European conference on computer vision (ECCV). Owens, A., Wu, J., McDermott, J. H., Freeman, W. T., & Torralba, A. (2016b). Ambient sound provides supervision for visual learning. In Proceedings of the IEEE European conference on computer vision (ECCV).
Zurück zum Zitat Park, A., & Glass, J. (2008). Unsupervised pattern discovery in speech. IEEE Transactions on Audio, Speech and Language Processing, 16(1), 186–197.CrossRef Park, A., & Glass, J. (2008). Unsupervised pattern discovery in speech. IEEE Transactions on Audio, Speech and Language Processing, 16(1), 186–197.CrossRef
Zurück zum Zitat Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016) You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016) You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Zurück zum Zitat Reed, S. E., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., & Lee, H. (2016). Generative adversarial text to image synthesis. CoRR. arXiv:1605.05396. Reed, S. E., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., & Lee, H. (2016). Generative adversarial text to image synthesis. CoRR. arXiv:​1605.​05396.
Zurück zum Zitat Ren, M., Kiros, R., & Zemel, R. (2015). Exploring models and data for image question answering. In Proceedings of the neural information processing systems (NeurIPS). Ren, M., Kiros, R., & Zemel, R. (2015). Exploring models and data for image question answering. In Proceedings of the neural information processing systems (NeurIPS).
Zurück zum Zitat Renshaw, D., Kamper, H., Jansen, A., & Goldwater, S. (2015). A comparison of neural network methods for unsupervised representation learning on the zero resource speech challenge. In Proceedings of the annual conference of international speech communication association (INTERSPEECH). Renshaw, D., Kamper, H., Jansen, A., & Goldwater, S. (2015). A comparison of neural network methods for unsupervised representation learning on the zero resource speech challenge. In Proceedings of the annual conference of international speech communication association (INTERSPEECH).
Zurück zum Zitat Roy, D. (2003). Grounded spoken language acquisition: Experiments in word learning. IEEE Transactions on Multimedia, 5(2), 197–209.CrossRef Roy, D. (2003). Grounded spoken language acquisition: Experiments in word learning. IEEE Transactions on Multimedia, 5(2), 197–209.CrossRef
Zurück zum Zitat Roy, D., & Pentland, A. (2002). Learning words from sights and sounds: A computational model. Cognitive Science, 26, 113–146.CrossRef Roy, D., & Pentland, A. (2002). Learning words from sights and sounds: A computational model. Cognitive Science, 26, 113–146.CrossRef
Zurück zum Zitat Russell, B., Efros, A., Sivic, J., Freeman, W., & Zisserman, A. (2006). Using multiple segmentations to discover objects and their extent in image collections. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). Russell, B., Efros, A., Sivic, J., Freeman, W., & Zisserman, A. (2006). Using multiple segmentations to discover objects and their extent in image collections. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Zurück zum Zitat Shih, K. J., Singh, S., & Hoiem, D. (2015). Where to look: Focus regions for visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). Shih, K. J., Singh, S., & Hoiem, D. (2015). Where to look: Focus regions for visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Zurück zum Zitat Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR. arXiv:1409.1556. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR. arXiv:​1409.​1556.
Zurück zum Zitat Thiolliere, R., Dunbar, E., Synnaeve, G., Versteegh, M., & Dupoux, E. (2015). A hybrid dynamic time warping-deep neural network architecture for unsupervised acoustic modeling. In Proceedings of the annual conference of international speech communication association (INTERSPEECH). Thiolliere, R., Dunbar, E., Synnaeve, G., Versteegh, M., & Dupoux, E. (2015). A hybrid dynamic time warping-deep neural network architecture for unsupervised acoustic modeling. In Proceedings of the annual conference of international speech communication association (INTERSPEECH).
Zurück zum Zitat Thomee, B., Shamma, D. A., Friedland, G., Elizalde, B., Ni, K., Poland, D., et al. (2015). The new data and new challenges in multimedia research. CoRR. arXiv:1503.01817. Thomee, B., Shamma, D. A., Friedland, G., Elizalde, B., Ni, K., Poland, D., et al. (2015). The new data and new challenges in multimedia research. CoRR. arXiv:​1503.​01817.
Zurück zum Zitat Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Zurück zum Zitat Weber, M., Welling, M., & Perona, P. (2010). Towards automatic discovery of object categories. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). Weber, M., Welling, M., & Perona, P. (2010). Towards automatic discovery of object categories. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Zurück zum Zitat Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., et al. (2015). Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the international conference on machine learning (ICML). Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., et al. (2015). Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the international conference on machine learning (ICML).
Zurück zum Zitat Zhang, T., Ramakrishnan, R., & Livny, M. (1996). Birch: an efficient data clustering method for very large databases. In ACM SIGMOD international conference on management of data (pp. 103–114). Zhang, T., Ramakrishnan, R., & Livny, M. (1996). Birch: an efficient data clustering method for very large databases. In ACM SIGMOD international conference on management of data (pp. 103–114).
Zurück zum Zitat Zhang, Y., Salakhutdinov, R., Chang, H. A., & Glass, J. (2012). Resource configurable spoken query detection using deep Boltzmann machines. In Proceedings of the international conference on acoustics, speech and signal processing (ICASSP). Zhang, Y., Salakhutdinov, R., Chang, H. A., & Glass, J. (2012). Resource configurable spoken query detection using deep Boltzmann machines. In Proceedings of the international conference on acoustics, speech and signal processing (ICASSP).
Zurück zum Zitat Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2015). Object detectors emerge in deep scene CNNs. In Proceedings of the international conference on learning representations (ICLR). Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2015). Object detectors emerge in deep scene CNNs. In Proceedings of the international conference on learning representations (ICLR).
Zurück zum Zitat Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Zurück zum Zitat Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In Proceedings of the neural information processing systems (NeurIPS). Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In Proceedings of the neural information processing systems (NeurIPS).
Zurück zum Zitat Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. (2017). Scene parsing through ADE20K dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. (2017). Scene parsing through ADE20K dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Metadaten
Titel
Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
verfasst von
David Harwath
Adrià Recasens
Dídac Surís
Galen Chuang
Antonio Torralba
James Glass
Publikationsdatum
05.08.2019
Verlag
Springer US
Erschienen in
International Journal of Computer Vision / Ausgabe 3/2020
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI
https://doi.org/10.1007/s11263-019-01205-0

Weitere Artikel der Ausgabe 3/2020

International Journal of Computer Vision 3/2020 Zur Ausgabe

OriginalPaper

Group Normalization

Premium Partner