519
Views
57
Downloads
20
Crossref
N/A
WoS
21
Scopus
0
CSCD
As a subfield of Multimedia Information Retrieval (MIR), Singer IDentification (SID) is still in the research phase. On one hand, SID cannot easily achieve high accuracy because the singing voice is difficult to model and always disturbed by the background instrumental music. On the other hand, the performance of conventional machine learning methods is limited by the scale of the training dataset. This study proposes a new deep learning approach based on Long Short-Term Memory (LSTM) and Mel-Frequency Cepstral Coefficient (MFCC) features to identify the singer of a song in large datasets. The results of this study indicate that LSTM can be used to build a representation of the relationships between different MFCC frames. The experimental results show that the proposed method achieves better accuracy for Chinese SID in the MIR-1K dataset than the traditional approaches.
As a subfield of Multimedia Information Retrieval (MIR), Singer IDentification (SID) is still in the research phase. On one hand, SID cannot easily achieve high accuracy because the singing voice is difficult to model and always disturbed by the background instrumental music. On the other hand, the performance of conventional machine learning methods is limited by the scale of the training dataset. This study proposes a new deep learning approach based on Long Short-Term Memory (LSTM) and Mel-Frequency Cepstral Coefficient (MFCC) features to identify the singer of a song in large datasets. The results of this study indicate that LSTM can be used to build a representation of the relationships between different MFCC frames. The experimental results show that the proposed method achieves better accuracy for Chinese SID in the MIR-1K dataset than the traditional approaches.
This work was supported by the National Natural Science Foundation of China (Nos. 61402210 and 60973137), the Program for New Century Excellent Talents in University (No. NCET-12-0250), the Major Project of High Resolution Earth Observation System (No. 30-Y20A34-9010-15/17), the Strategic Priority Research Program of the Chinese Academy of Sciences (No. XDA03030100), the Gansu Sci. & Tech. Program (Nos. 1104GKCA049, 1204GKCA061, and 1304GKCA018), the Fundamental Research Funds for the Central Universities (No. lzujbky-2016-140), and Google Research Awards and Google Faculty Research Awards, China. We also gratefully acknowledge the support of NVIDIA Corporation with the donation of the Jetson TX1 used for this research.