research-article

Fully perceptual-based 3D spatial sound individualization with an adaptive variational autoencoder

Authors:
Kazuhiko Yamamoto

The University of Tokyo

The University of Tokyo
View Profile

,
Takeo Igarashi

The University of Tokyo

The University of Tokyo
View Profile

Authors Info & Claims

ACM Transactions on Graphics Volume 36 Issue 6Article No.: 212pp 1–13https://doi.org/10.1145/3130800.3130838

Published:20 November 2017Publication History

ACM Transactions on Graphics

Abstract

To realize 3D spatial sound rendering with a two-channel headphone, one needs head-related transfer functions (HRTFs) tailored for a specific user. However, measurement of HRTFs requires a tedious and expensive procedure. To address this, we propose a fully perceptual-based HRTF fitting method for individual users using machine learning techniques. The user only needs to answer pairwise comparisons of test signals presented by the system during calibration. This reduces the efforts necessary for the user to obtain individualized HRTFs. Technically, we present a novel adaptive variational AutoEncoder with a convolutional neural network. In the training, this AutoEncoder analyzes publicly available HRTFs dataset and identifies factors that depend on the individuality of users in a nonlinear space. In calibration, the AutoEncoder generates high-quality HRTFs fitted to a specific user by blending the factors. We validate the feasibilities of our method through several quantitative experiments and a user study.

Supplemental Material

a212-yamamoto.mp4

mp4

61.9 MB

Download

References

V. R. Algazi, R. O. Duda, D. M. Thompson, and C. Avendano. 2001. The CIPIC HRTF Database. In IEEE Workshop on Applications of Signal Processing to Audio and Electroacoustics. 99--102.Google Scholar
P. Bilinski, J. Ahrens, M. Thomas, I. Tashev, and J. Platt. 2014. HRTF magnitude synthesis via sparse representation of anthropometric features. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Proces.Google Scholar
Eric Brochu, Tyson Brochu, and Nando de Freitas. 2010. A Bayesian Interactive Optimization Approach to Procedural Animation Design. In Proc. of ACM SCA. 103--112. Google ScholarDigital Library
Xuefeng Chen, Xiabi Liu, and Yunde Jia. 2009. Combining Evolution Strategy and Gradient Descent Method for Discriminative Learning of Bayesian Classifiers. In Proc. of Genetic and Evolutionary Computation. 507--514. Google ScholarDigital Library
Djork-Arne Clevert, Thomas Unterthiner, and Sepp Hochreiter. 2016. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). In Proc. of ICLR.Google Scholar
Matthieu Courbariaux and Yoshua Bengio. 2016. BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or --- 1. In arXiv.Google Scholar
R. Duraiswaini, D.N. Zotkin, and N.A. Gumerov. 2004. Interpolation and range extrapolation of HRTFs {head related transfer functions}. In ICASSP.Google Scholar
Leon A. Gatys, Alexander S. Ecker, and Matthias Bethges. 2016. Image Style Transfer Using Convolutional Neural Networks. In Proc. of IEEE CVPR.Google ScholarCross Ref
Felipe Grijalva, Luiz Martini, Siome Goldenstein, and Dinei Florencio. 2014. Anthropometric-Based Customization of Head-Related Transfer Functions using Isomap in The Horizontal Plane. In ICASSP.Google Scholar
Nail A. Gumerov, Adam E. O' Donovan, Ramani Duraiswami, and Dmitry N. Zotkin. 2010. Computation of the head-related transfer function via the fast multipole accelerated boundary element method and its spherical harmonic representation. In J. Acoust Soc. Am, Vol. 127.Google ScholarCross Ref
N Hansen, SD Muller, and P Koumoutsakos. 2003. Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (CMA-ES). In Evolutionary Computation. 1--18. Google ScholarDigital Library
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. In Proc. of CVPR.Google Scholar
Daniel Holden, Jun Saito, and Taku Komura. 2016. A Deep Learning Framework for Character Motion Synthesis and Editing. ACM Transaction on Graphics (SIGGRAPH), 35, 4 (2016), 138:1--138:11. Google ScholarDigital Library
Josef Holzl. 2014. A Global Model for HRTF Individualization by Adjustment of Principal Component Weights. In Diploma Thesis.Google Scholar
Hongmei Hu, Lin Zhou, Hao Ma, and Zhenyang Wu. 2008. HRTF personalization based on artificial neural net- work in individual virtual auditory space. In Applied Acoustics, Vol. 69. 163--172.Google ScholarCross Ref
Q Huang and Y Fang. 2009. Modeling personalized head- related impulse response using support vector regressions. In J. Shanghai Univ.Google Scholar
Q. Huang and Q. Zhuang. 2009. HRIR personalisation using support vector regression in independent feature space. In Electron. Letter, Vol. 45.Google Scholar
PK. Iida, Y. Ishii, and S. Nishioka. 2014. Personalization of head-related transfer functions in the median plane based on the anthropometry of the listener's pinnae. In J. Acoust Soc. Am.Google Scholar
Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proc. of ICML. Google ScholarDigital Library
Craig T. Jin, Pierre Guillon, Nicolas Epain, Reza Zolfaghari, Andre van Schaik, Anthony I. Tew, Carl Hetherington, and Jonathan Thorpe. 2014. Creating the Sydney York Morphological and Acoustic Recordings of Ears Database. In IEEE Transactions on Multimedia, Vol. 16.Google ScholarCross Ref
Y. Kahana and P. A. Nelson. 2007. Boundary element simulations of the transfer function of human heads and baffled pinnae using accurate geometric model. In Journal of sound and vibration. 552--579.Google Scholar
Shoken Kaneko, Tsukasa Suenaga, and Satoshi Sekine. 2016. DeepEarNet: individualizing spatial audio with photography, ear shape modeling, and neural networks. In AES Conference on Audio for Virtual and Augmented Reality.Google Scholar
B. F. Katz. 2001. Boundary element method calculation of individual head-related transfer function. i. rigid model calculation. In J. Acoust Soc. Am.Google Scholar
Kingma and Diederik P. 2014. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems. Google ScholarDigital Library
D Kingma and J P Ba. 2014. Adam: A method for stochastic optimization. In CoRR abs/1412.6980.Google Scholar
Diederik P Kingma and Max Welling. 2014. Auto-encoding variational Bayes. In Proc. of ICLR.Google Scholar
Yehuda Koren, Rovert Bell, and Chris Volinsky. 2009. Matrix Factorization Techniques for Recommender Systems. In IEEE Computer, Vol. 42. IEEE, 30--37. Google ScholarDigital Library
Yuki Koyama, Daisuke Sakamoto, and Takeo Igarashi. 2014. Crowd-powered parameter analysis for visual design exploration. In Proc. of ACM UIST. 65--74. Google ScholarDigital Library
E.H.A. Langendijk and A.W. Bronkhorst. 2000. Fidelity of three-dimensional-sound reproduction using a virtual auditory display. In J. Acoust. Soc. Am.Google ScholarCross Ref
Yuancheng Luo, Dmitry N. Zotkin, Hal Daume, and Ramani Duraiswami. 2013b. Kernel regression for Head-Related Transfer Function interpolation and spectral extrema extraction. In ICASSP.Google Scholar
Yuancheng Luo, Dmitry N. Zotkin, and Ramani Duraiswami. 2013a. Virtual AutoEncoder Based Recommendation System for Individualizing Head-Related Transfer Functions. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.Google Scholar
G Matheron. 1963. Principles of geostatistics. In Economic Geology. 1246--1266.Google Scholar
Noriyuki Matsunaga and Tatsuya Hirahara. 2010. Reexamination of fast head-related transfer function measurement by reciprocal method. In J. Acoust Soc. Ja, Vol. 31, 6.Google Scholar
Alok Meshram, Ravish Mehra, and Dinesh Manocha. 2014. Efficient HRTF Computation using Adaptive Rectangular Decomposition. In AES 55th International Conference.Google Scholar
J.C Middlebrooks. 1999. Virtual localization improved by scaling non-individualized external-ear transfer functions in frequency. In J. Acoust. Soc. Am. 106.Google Scholar
P. Mokhtari, H Takemoto, R. Nishimura, and H. Kato. 2008. Computer simulation of hrtfs for personalization of 3d audio. In In Universal Communication, IEEE. ISUC '08. Second International Symposium. 435--440. Google ScholarDigital Library
P. Mokhtari, H Takemoto, R. Nishimura, and H. Kato. 2010. Computer simulation of kemar's head-related transfer functions: verification with measurements and acoustic effects of modifying head shape and pinna concavity. In Principles and Applications of Spatial Hearing. 179--194.Google Scholar
H. Moller., M.F. Sorensen., Jensen C.B, and HammershOi. 1996. Binaural technique: do we need individual recordings?. In J. Audio Eng. Soc. 44, 451e469.Google Scholar
Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. 2014. Stochastic backpropagation and approximate inference in deep generative models. In Proc. of ICML. Google ScholarDigital Library
Kihyuk Sohn, Honglak Lee, and Xinchen Yan. 2015. Learning Structured Output Representation using Deep Conditional Generative Models. In Advances in Neural Information Processing Systems. Google ScholarDigital Library
Ryusuke Takahama, Toshihiro Kamishima, and Hisashi Kashima. 2016. Progressive Comparison for Ranking Estimation. In Proc. of IJCAI. Google ScholarDigital Library
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning Spatiotemporal Features with 3D Convolutional Networks. In Proc. of IEEE ICCV. Google ScholarDigital Library
Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Ko-ray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. In CoRR abs/1609.03499.Google Scholar
Z. Wang and C. F. Chan. 2013. HRIR customization using common factor decomposition and joint support vector regression. In Eur. Signal Process. Conf.Google Scholar
E.M Wenzel, D. J Arruda, and D.J Kistler. 1993. Localization using non-individualized head-related transfer functions. In J. Acoust. Soc. Am. 94.Google ScholarCross Ref
E.M Wenzel and S.H Foster. 1993. Perceptual consequences of interpolating head-related transfer functions during spatial synthesis. In Proc. of Workshop on Applications of Signal Processing to Audio and Acoustics.Google ScholarCross Ref
T. Xiao and Q. H. Liu. 2003. Finite difference computation of head-related transfer function for human hearing. In J. Acoust Soc. Am.Google Scholar
M. E Yumer, P Asente, R Mech, and L. B Kara. 2015. Procedural Modeling Using Autoencoder Networks. In Proc. of ACM UIST. ACM. Google ScholarDigital Library
D. N. Zotkin, R. Duraiswami, and L. S. Davis. 2004. Rendering localized spatial audio in a virtual auditory space. In IEEE Transactions on Multimedia, vol. 6(4). Google ScholarDigital Library
Dmitry N. Zotkin, Ramani Duraiswami, Elena Grassi, and Nail A. Gumerov. 2006. Fast head-related transfer function measurement via reciprocity. In J. Acoust Soc. Am, Vol. 120.Google ScholarCross Ref

Index Terms

Fully perceptual-based 3D spatial sound individualization with an adaptive variational autoencoder
1. Applied computing
  1. Arts and humanities
    1. Sound and music computing

Recommendations

Perceptual reproduction of spatial sound using loudspeaker-signal-domain parametrization

Adaptive perceptual spatial sound reproduction techniques that employ a parametric model describing the properties of the sound field can reproduce spatial sound with high perceptual accuracy when compared to linear techniques. On the other hand, ...
Read More
Sound image externalization for headphone based real-time 3D audio

3D audio effects can provide immersive auditory experience, but we often face the so-called in-head localization (IHL) problem in headphone sound reproduction. To address this problem, we propose an effective sound image externalization approach. ...
Read More
Geometry-Based Spatial Sound Acquisition Using Distributed Microphone Arrays

Traditional spatial sound acquisition aims at capturing a sound field with multiple microphones such that at the reproduction side a listener can perceive the sound image as it was at the recording location. Standard techniques for spatial sound ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Graphics Volume 36, Issue 6
December 2017
973 pages
ISSN:0730-0301
EISSN:1557-7368
DOI:10.1145/3130800
Editor:
Kavita Bala
Issue’s Table of Contents
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 November 2017
Published in tog Volume 36, Issue 6

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
3d spatial sound rendering
deep neural network
optimization
sound design in a virtual environment
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 29
  Total Citations
  View Citations
- 656
  Total Downloads
- Downloads (Last 12 months)56
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Fully perceptual-based 3D spatial sound individualization with an adaptive variational autoencoder

ACM Transactions on Graphics

Abstract

Supplemental Material

References

Cited By

Index Terms

Recommendations

Perceptual reproduction of spatial sound using loudspeaker-signal-domain parametrization

Sound image externalization for headphone based real-time 3D audio

Geometry-Based Spatial Sound Acquisition Using Distributed Microphone Arrays

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Fully perceptual-based 3D spatial sound individualization with an adaptive variational autoencoder

ACM Transactions on Graphics

Abstract

Supplemental Material

References

Cited By

Index Terms

Recommendations

Perceptual reproduction of spatial sound using loudspeaker-signal-domain parametrization

Sound image externalization for headphone based real-time 3D audio

Geometry-Based Spatial Sound Acquisition Using Distributed Microphone Arrays

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media