Skip to main content
Top

2019 | Book

Proceedings of the 6th Conference on Sound and Music Technology (CSMT)

Revised Selected Papers

Editors: Prof. Dr. Wei Li, Prof. Shengchen Li, Prof. Xi Shao, Prof. Zijin Li

Publisher: Springer Singapore

Book Series : Lecture Notes in Electrical Engineering

insite
SEARCH

About this book

This book discusses the use of advanced techniques to produce and understand music in a digital way. It gathers the first-ever English-language proceedings of the Conference on Sound and Music Technology (CSMT), which was held in Xiamen, China in 2018. As a leading event, the CSMT reflects the latest advances in acoustic and music technologies in China. Sound and technology are more closely linked than most people assume. For example, signal-processing methods form the basis of music feature extraction, while mathematics provides an objective means of representing current musicological theories and discovering new ones. Moreover, machine-learning methods include popular deep learning algorithms and are used in a broad range of contexts, from discovering patterns in music features to producing music. As these proceedings demonstrate, modern technologies not only offer new ways to create music, but can also help people perceive sound in innovative new ways.

Table of Contents

Frontmatter

Music Processing and Music Information Retrieval

Frontmatter
A Novel Singer Identification Method Using GMM-UBM
Abstract
This paper presents a novel method for singer identification from polyphonic music audio signals. It is based on the universal background model (UBM), which is a singer-independent Gaussian mixture model (GMM) trained on many songs to model the singer characteristics. For our model, singing voice separation on a polyphonic signal is used to cope with the negative influences caused by background accompaniment. Then, we construct UBM for each singer trained with the Mel-frequency Cepstral Coefficients (MFCCs) feature, using the maximum a posterior (MAP) estimation. Singer identification is realized by matching test samples to the obtained UBMs for individual singers. Another major contribution of our work is to present two new large singer identification databases with over 100 singers. The proposed system is evaluated on two public datasets and two new ones. Results indicate that UBM can build more accurate statistical models of the singer’s voice than conventional methods. The evaluation carried out on the public dataset shows that our method achieves 16% improvement in accuracy compared with the state-of-the-art singer identification system.
Xulong Zhang, Yiliang Jiang, Jin Deng, Juanjuan Li, Mi Tian, Wei Li
A Practical Singing Voice Detection System Based on GRU-RNN
Abstract
In this paper, we present a practical three-step approach for singing voice detection based on a gated recurrent unit (GRU) recurrent neural network (RNN) and the proposed method achieves comparable results to state-of-the-art method. We combine four classic features—namely Mel-frequency Cepstral Coefficients (MFCC), Mel-filter Bank, Linear Predictive Cepstral Coefficients (LPCC), and Chroma. Then, the mixed signal is first preprocessed by singing voice separation (SVS) with the Deep U-Net Convolutional Networks. Long short-term memory (LSTM) and GRU are both proposed to solve the gradient vanish problem in RNN. In our experiments, we set the block duration as 120 ms and 720 ms respectively, and we get comparable or better results than results from state-of-the-art methods, while results on Jamendo are not as good as those from RWC-Pop.
Zhigao Chen, Xulong Zhang, Jin Deng, Juanjuan Li, Yiliang Jiang, Wei Li
Multimodel Music Emotion Recognition Using Unsupervised Deep Neural Networks
Abstract
In most studies on multimodal music emotion recognition, different modalities are generally combined in a simple way and used for supervised training. The improvement of the experiment results illustrates the correlations between different modalities. However, few studies focus on modeling the relationships between different modal data. In this paper, we propose to model the relationships between different modalities (i.e., lyric and audio data) by deep learning methods in multimodal music emotion recognition. Several deep networks are first applied to perform unsupervised feature learning over multiple modalities. We, then, design a series of music emotion recognition experiments to evaluate the learned features. The experiment results show that the deep networks perform well on unsupervised feature learning for multimodal data and can model the relationships effectively. In addition, we demonstrate a unimodal enhancement experiment, where better features for one modality (e.g., lyric) can be learned by the proposed deep network, if the other modality (e.g., audio) is also present at unsupervised feature learning time.
Jianchao Zhou, Xiaoou Chen, Deshun Yang
Music Summary Detection with State Space Embedding and Recurrence Plot
Abstract
Automatic music summary detection is a task that identifies the most representative part of a song, facilitating users to retrieve the desired songs. In this paper, we propose a novel method based on state space embedding and recurrence plot. Firstly, an extended audio feature with state space embedding is extracted to construct a similarity matrix. Compared with the raw audio features, this extended feature is more robust against noise. Then recurrence plot based on global strategy is adopted to detect similar segment pairs within a song. Finally, we proposed to extract the most repeated part as a summary by selecting and merging the stripes containing the lowest distance in the similarity matrix under the constraints of slope and duration. Experimental results show that the performance of the proposed algorithm is more powerful than the other two competitive baseline methods.
Yongwei Gao, Yichun Shen, Xulong Zhang, Shuai Yu, Wei Li
Constructing a Multimedia Chinese Musical Instrument Database
Abstract
Throughout history, more than 2000 Chinese musical instruments have existed or been historically recorded, they are of non-negligible importance in Chinese musicology. However, the public knows little about them. In this work, we present a multimedia database of Chinese musical instruments. This database includes, for each instrument, text descriptions, images, audio clips of playing techniques, music clips, videos of the craft process and recording process, and acoustic analysis materials. Motivation and selecting criteria of the database are introduced in detail. Potential applications based on this database are discussed, and we take the research on subjective auditory attributes of Chinese musical instruments as an example.
Xiaojing Liang, Zijin Li, Jingyu Liu, Wei Li, Jiaxing Zhu, Baoqiang Han

Acoustic Sound Processing and Analysis

Frontmatter
Bird Sound Detection Based on Binarized Convolutional Neural Networks
Abstract
Bird Sound Detection (BSD) is helpful for monitoring biodiversity and in this regard, deep learning networks have shown good performance in BSD in recent years. However, such a complex network structure requires high memory resources and computing power at great cost for performing the extensive calculations required, which make it difficult to implement the hardware in BSD. Therefore, we designed an audio classification method for BSD using a Binarized Convolutional Neural Network (BCNN). The convolutional layers and fully connected layers of the original Convolutional Neural Network were binarized to two values. The Area Under ROC Curve (AUC) score of BCNN achieved comparable results with the CNN in an unseen evaluation. This paper proposes two networks (CNNs and BCNNs) for the BSD task of the IEEE AASP Challenge on the Detection and Classification of Acoustic Scenes and Events (DCASE2018). The Area Under ROC Curve (AUC) score of BCNN achieved comparable results with CNN on the unseen evaluation data. More importantly, the use of the BCNN could reduce the memory requirement and the hardware loss unit, which are of great significance to the hardware implementation of a bird sound detection system.
Jianan Song, Shengchen Li
Adaptive Consistent Dictionary Learning for Audio Declipping
Abstract
Clipping is a common problem in audio processing. Clipping distortion can be solved by the recently proposed consistent Dictionary Learning (cDL), but the performance of restoration will decrease when the clipping degree is large. To improve the performance of cDL, a method based on adaptive threshold is proposed. In this method, the clipping degree is estimated automatically, and the factor of the clipping degree is adjusted according to the degree of clipping. Experiments show the superior performance of the proposed algorithm with respect to cDL on audio signal restoration.
Penglong Wu, Xia Zou, Meng Sun, Li Li, Xingyu Zhang
A Comparison of Attention Mechanisms of Convolutional Neural Network in Weakly Labeled Audio Tagging
Abstract
Audio tagging aims to predict the types of sound events occurring in audio clips. Recently, the convolutional recurrent neural network (CRNN) has achieved state-of-the-art performance in audio tagging. In CRNN, convolutional layers are applied on input audio features to extract high-level representations followed by recurrent layers. To better learn high-level representations of acoustic features, attention mechanisms were introduced to the convolutional layers of CRNN. Attention is a learning technique that could steer the model to information important to the task to obtain better performance. The two different attention mechanisms in the CRNN, the Squeeze-and-Excitation (SE) block and gated linear unit (GLU), are based on a gating mechanism, but their concerns are different. To compare the performance of the SE block and GLU, we propose to use a CRNN with a SE block (SE-CRNN) and a CRNN with a GLU (GLU-CRNN) in weakly labeled audio tagging and compare these results with the CRNN baseline. The experiments show that the GLU-CRNN achieves an area under curve score of 0.877 in polyphonic audio tagging, outperforming the SE-CRNN of 0.865 and the CRNN baseline of 0.838. The results show that the performance of attention based on GLU is better than the performance of attention based on the SE block in CRNN for weakly labeled polyphonic audio tagging.
Yuanbo Hou, Qiuqiang Kong, Shengchen Li

Music Steganography

Frontmatter
A Standard MIDI File Steganography Based on Music Perception in Note Duration
Abstract
Steganography aims to deliver messages in the recording space of a cover media without being noticed. Supported by the theory of music perception, note duration can be used as the recording space for a steganography method. This paper proposes a steganography method that embeds information in the duration of notes in a MIDI file and generates a new MIDI file in which the only difference lies in the note duration. Under several restrictions, the available varying range of each note is calculated, and then secret information is mapped on the change in note duration within the available range. A listening test and an evaluation on capacity are conducted to measure the transparency and capacity of the proposed method. The results reveal that the proposed method has perfect transparency with no perceptible difference in hearing, zero expansion in file size, and an effective capacity of 3.5%.
Lei Guan, Yinji Jing, Shengchen Li, Ru Zhang
Metadata
Title
Proceedings of the 6th Conference on Sound and Music Technology (CSMT)
Editors
Prof. Dr. Wei Li
Prof. Shengchen Li
Prof. Xi Shao
Prof. Zijin Li
Copyright Year
2019
Publisher
Springer Singapore
Electronic ISBN
978-981-13-8707-4
Print ISBN
978-981-13-8706-7
DOI
https://doi.org/10.1007/978-981-13-8707-4