Abstract
To realize 3D spatial sound rendering with a two-channel headphone, one needs head-related transfer functions (HRTFs) tailored for a specific user. However, measurement of HRTFs requires a tedious and expensive procedure. To address this, we propose a fully perceptual-based HRTF fitting method for individual users using machine learning techniques. The user only needs to answer pairwise comparisons of test signals presented by the system during calibration. This reduces the efforts necessary for the user to obtain individualized HRTFs. Technically, we present a novel adaptive variational AutoEncoder with a convolutional neural network. In the training, this AutoEncoder analyzes publicly available HRTFs dataset and identifies factors that depend on the individuality of users in a nonlinear space. In calibration, the AutoEncoder generates high-quality HRTFs fitted to a specific user by blending the factors. We validate the feasibilities of our method through several quantitative experiments and a user study.
Supplemental Material
- V. R. Algazi, R. O. Duda, D. M. Thompson, and C. Avendano. 2001. The CIPIC HRTF Database. In IEEE Workshop on Applications of Signal Processing to Audio and Electroacoustics. 99--102.Google Scholar
- P. Bilinski, J. Ahrens, M. Thomas, I. Tashev, and J. Platt. 2014. HRTF magnitude synthesis via sparse representation of anthropometric features. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Proces.Google Scholar
- Eric Brochu, Tyson Brochu, and Nando de Freitas. 2010. A Bayesian Interactive Optimization Approach to Procedural Animation Design. In Proc. of ACM SCA. 103--112. Google ScholarDigital Library
- Xuefeng Chen, Xiabi Liu, and Yunde Jia. 2009. Combining Evolution Strategy and Gradient Descent Method for Discriminative Learning of Bayesian Classifiers. In Proc. of Genetic and Evolutionary Computation. 507--514. Google ScholarDigital Library
- Djork-Arne Clevert, Thomas Unterthiner, and Sepp Hochreiter. 2016. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). In Proc. of ICLR.Google Scholar
- Matthieu Courbariaux and Yoshua Bengio. 2016. BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or --- 1. In arXiv.Google Scholar
- R. Duraiswaini, D.N. Zotkin, and N.A. Gumerov. 2004. Interpolation and range extrapolation of HRTFs {head related transfer functions}. In ICASSP.Google Scholar
- Leon A. Gatys, Alexander S. Ecker, and Matthias Bethges. 2016. Image Style Transfer Using Convolutional Neural Networks. In Proc. of IEEE CVPR.Google ScholarCross Ref
- Felipe Grijalva, Luiz Martini, Siome Goldenstein, and Dinei Florencio. 2014. Anthropometric-Based Customization of Head-Related Transfer Functions using Isomap in The Horizontal Plane. In ICASSP.Google Scholar
- Nail A. Gumerov, Adam E. O' Donovan, Ramani Duraiswami, and Dmitry N. Zotkin. 2010. Computation of the head-related transfer function via the fast multipole accelerated boundary element method and its spherical harmonic representation. In J. Acoust Soc. Am, Vol. 127.Google ScholarCross Ref
- N Hansen, SD Muller, and P Koumoutsakos. 2003. Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (CMA-ES). In Evolutionary Computation. 1--18. Google ScholarDigital Library
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. In Proc. of CVPR.Google Scholar
- Daniel Holden, Jun Saito, and Taku Komura. 2016. A Deep Learning Framework for Character Motion Synthesis and Editing. ACM Transaction on Graphics (SIGGRAPH), 35, 4 (2016), 138:1--138:11. Google ScholarDigital Library
- Josef Holzl. 2014. A Global Model for HRTF Individualization by Adjustment of Principal Component Weights. In Diploma Thesis.Google Scholar
- Hongmei Hu, Lin Zhou, Hao Ma, and Zhenyang Wu. 2008. HRTF personalization based on artificial neural net- work in individual virtual auditory space. In Applied Acoustics, Vol. 69. 163--172.Google ScholarCross Ref
- Q Huang and Y Fang. 2009. Modeling personalized head- related impulse response using support vector regressions. In J. Shanghai Univ.Google Scholar
- Q. Huang and Q. Zhuang. 2009. HRIR personalisation using support vector regression in independent feature space. In Electron. Letter, Vol. 45.Google Scholar
- PK. Iida, Y. Ishii, and S. Nishioka. 2014. Personalization of head-related transfer functions in the median plane based on the anthropometry of the listener's pinnae. In J. Acoust Soc. Am.Google Scholar
- Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proc. of ICML. Google ScholarDigital Library
- Craig T. Jin, Pierre Guillon, Nicolas Epain, Reza Zolfaghari, Andre van Schaik, Anthony I. Tew, Carl Hetherington, and Jonathan Thorpe. 2014. Creating the Sydney York Morphological and Acoustic Recordings of Ears Database. In IEEE Transactions on Multimedia, Vol. 16.Google ScholarCross Ref
- Y. Kahana and P. A. Nelson. 2007. Boundary element simulations of the transfer function of human heads and baffled pinnae using accurate geometric model. In Journal of sound and vibration. 552--579.Google Scholar
- Shoken Kaneko, Tsukasa Suenaga, and Satoshi Sekine. 2016. DeepEarNet: individualizing spatial audio with photography, ear shape modeling, and neural networks. In AES Conference on Audio for Virtual and Augmented Reality.Google Scholar
- B. F. Katz. 2001. Boundary element method calculation of individual head-related transfer function. i. rigid model calculation. In J. Acoust Soc. Am.Google Scholar
- Kingma and Diederik P. 2014. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems. Google ScholarDigital Library
- D Kingma and J P Ba. 2014. Adam: A method for stochastic optimization. In CoRR abs/1412.6980.Google Scholar
- Diederik P Kingma and Max Welling. 2014. Auto-encoding variational Bayes. In Proc. of ICLR.Google Scholar
- Yehuda Koren, Rovert Bell, and Chris Volinsky. 2009. Matrix Factorization Techniques for Recommender Systems. In IEEE Computer, Vol. 42. IEEE, 30--37. Google ScholarDigital Library
- Yuki Koyama, Daisuke Sakamoto, and Takeo Igarashi. 2014. Crowd-powered parameter analysis for visual design exploration. In Proc. of ACM UIST. 65--74. Google ScholarDigital Library
- E.H.A. Langendijk and A.W. Bronkhorst. 2000. Fidelity of three-dimensional-sound reproduction using a virtual auditory display. In J. Acoust. Soc. Am.Google ScholarCross Ref
- Yuancheng Luo, Dmitry N. Zotkin, Hal Daume, and Ramani Duraiswami. 2013b. Kernel regression for Head-Related Transfer Function interpolation and spectral extrema extraction. In ICASSP.Google Scholar
- Yuancheng Luo, Dmitry N. Zotkin, and Ramani Duraiswami. 2013a. Virtual AutoEncoder Based Recommendation System for Individualizing Head-Related Transfer Functions. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.Google Scholar
- G Matheron. 1963. Principles of geostatistics. In Economic Geology. 1246--1266.Google Scholar
- Noriyuki Matsunaga and Tatsuya Hirahara. 2010. Reexamination of fast head-related transfer function measurement by reciprocal method. In J. Acoust Soc. Ja, Vol. 31, 6.Google Scholar
- Alok Meshram, Ravish Mehra, and Dinesh Manocha. 2014. Efficient HRTF Computation using Adaptive Rectangular Decomposition. In AES 55th International Conference.Google Scholar
- J.C Middlebrooks. 1999. Virtual localization improved by scaling non-individualized external-ear transfer functions in frequency. In J. Acoust. Soc. Am. 106.Google Scholar
- P. Mokhtari, H Takemoto, R. Nishimura, and H. Kato. 2008. Computer simulation of hrtfs for personalization of 3d audio. In In Universal Communication, IEEE. ISUC '08. Second International Symposium. 435--440. Google ScholarDigital Library
- P. Mokhtari, H Takemoto, R. Nishimura, and H. Kato. 2010. Computer simulation of kemar's head-related transfer functions: verification with measurements and acoustic effects of modifying head shape and pinna concavity. In Principles and Applications of Spatial Hearing. 179--194.Google Scholar
- H. Moller., M.F. Sorensen., Jensen C.B, and HammershOi. 1996. Binaural technique: do we need individual recordings?. In J. Audio Eng. Soc. 44, 451e469.Google Scholar
- Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. 2014. Stochastic backpropagation and approximate inference in deep generative models. In Proc. of ICML. Google ScholarDigital Library
- Kihyuk Sohn, Honglak Lee, and Xinchen Yan. 2015. Learning Structured Output Representation using Deep Conditional Generative Models. In Advances in Neural Information Processing Systems. Google ScholarDigital Library
- Ryusuke Takahama, Toshihiro Kamishima, and Hisashi Kashima. 2016. Progressive Comparison for Ranking Estimation. In Proc. of IJCAI. Google ScholarDigital Library
- Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning Spatiotemporal Features with 3D Convolutional Networks. In Proc. of IEEE ICCV. Google ScholarDigital Library
- Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Ko-ray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. In CoRR abs/1609.03499.Google Scholar
- Z. Wang and C. F. Chan. 2013. HRIR customization using common factor decomposition and joint support vector regression. In Eur. Signal Process. Conf.Google Scholar
- E.M Wenzel, D. J Arruda, and D.J Kistler. 1993. Localization using non-individualized head-related transfer functions. In J. Acoust. Soc. Am. 94.Google ScholarCross Ref
- E.M Wenzel and S.H Foster. 1993. Perceptual consequences of interpolating head-related transfer functions during spatial synthesis. In Proc. of Workshop on Applications of Signal Processing to Audio and Acoustics.Google ScholarCross Ref
- T. Xiao and Q. H. Liu. 2003. Finite difference computation of head-related transfer function for human hearing. In J. Acoust Soc. Am.Google Scholar
- M. E Yumer, P Asente, R Mech, and L. B Kara. 2015. Procedural Modeling Using Autoencoder Networks. In Proc. of ACM UIST. ACM. Google ScholarDigital Library
- D. N. Zotkin, R. Duraiswami, and L. S. Davis. 2004. Rendering localized spatial audio in a virtual auditory space. In IEEE Transactions on Multimedia, vol. 6(4). Google ScholarDigital Library
- Dmitry N. Zotkin, Ramani Duraiswami, Elena Grassi, and Nail A. Gumerov. 2006. Fast head-related transfer function measurement via reciprocity. In J. Acoust Soc. Am, Vol. 120.Google ScholarCross Ref
Index Terms
- Fully perceptual-based 3D spatial sound individualization with an adaptive variational autoencoder
Recommendations
Perceptual reproduction of spatial sound using loudspeaker-signal-domain parametrization
Adaptive perceptual spatial sound reproduction techniques that employ a parametric model describing the properties of the sound field can reproduce spatial sound with high perceptual accuracy when compared to linear techniques. On the other hand, ...
Sound image externalization for headphone based real-time 3D audio
3D audio effects can provide immersive auditory experience, but we often face the so-called in-head localization (IHL) problem in headphone sound reproduction. To address this problem, we propose an effective sound image externalization approach. ...
Geometry-Based Spatial Sound Acquisition Using Distributed Microphone Arrays
Traditional spatial sound acquisition aims at capturing a sound field with multiple microphones such that at the reproduction side a listener can perceive the sound image as it was at the recording location. Standard techniques for spatial sound ...
Comments