Emotion Recognition in Javanese Music: A Comparative Study of Classifier Models with a Human-Annotated Dataset
DOI:
https://doi.org/10.35806/ijoced.v7i2.544Keywords:
Classification, Javanese music, Machine learning, MFCC feature extraction, Music emotion recognitionAbstract
With advancements in machine learning and the increasing availability of music datasets, Music Emotion Recognition (MER) has gained significant attention. However, research focusing on Indonesian traditional music, particularly Javanese music, remains limited. Understanding emotions in Javanese music is crucial for preserving cultural heritage and enabling emotion-aware applications tailored to Indonesian traditional music. This study investigates the effectiveness of three well-established machine learning models, 1D Convolutional Neural Networks (1D-CNNs), support Vector Machines (SVMs), and XGBoost, in classifying emotions in Javanese music using a manually annotated dataset. The dataset consists of 100 Javanese songs from various genres, including Dangdut, Koplo, and Campur Sari, annotated based on the Thayer emotion model. The models’ performance was assessed using different data split ratios, with accuracy rates exceeding 70%. Among the tested classifiers, SVM exhibited the highest and most stable accuracy.
References
Allamy, S., & Koerich, A. L. (2021, December). 1D CNN architectures for music genre classification. In 2021 IEEE symposium series on computational intelligence (SSCI) (pp. 01-07). IEEE.
Ansari, M. R., Tumpa, S. A., Raya, J. A. F., & Murshed, M. N. (2021, September). Comparison between support vector machine and random forest for audio classification. In 2021 International Conference on Electronics, Communications and Information Technology (ICECIT) (pp. 1-4). IEEE.
Burges, C. J. (1998). A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery, 2(2), 121-167.
Cheetham, M., Wu, L., Pauli, P., & Jancke, L. (2015). Arousal, valence, and the uncanny valley: Psychophysiological and self-report findings. Frontiers in psychology, 6, 981.
Chen, T., He, T., Benesty, M., Khotilovich, V., Tang, Y., Cho, H., ... & Zhou, T. (2015). Xgboost: extreme gradient boosting. R package version 0.4-2, 1(4), 1-4.
Dutta, D., Porel, S., Tah, D., & Dutta, P. (2024). One-Dimensional Convolutional Neural Network for Data Classification. In Advanced Technologies for Realizing Sustainable Development Goals: 5G, AI, Big Data, Blockchain, and Industry 4.0 Application (pp. 37-62). Bentham Science Publishers.
Gupta, S., Jaafar, J., Ahmad, W. W., & Bansal, A. (2013). Feature extraction using MFCC. Signal & Image Processing: An International Journal, 4(4), 101-108.
Kaur, J., & Kumar, A. (2021). Speech emotion recognition using CNN, k-NN, MLP and random forest. In Computer Networks and Inventive Communication Technologies: Proceedings of Third ICCNCT 2020 (pp. 499-509). Springer Singapore.
Matsumoto, T., Yokohama, T., Suzuki, H., Furukawa, R., Oshimoto, A., Shimmi, T., ... & Chua, L. O. (1990, December). Several image processing examples by CNN. In IEEE International Workshop on Cellular Neural Networks and their Applications (pp. 100-111). IEEE.
Naveenkumar, M., & Kaliappan, V. K. (2019, November). Audio based emotion detection and recognizing tool using mel frequency based cepstral coefficient. In Journal of Physics: Conference Series (Vol. 1362, No. 1, p. 012063). IOP Publishing.
Nissar, I., Rizvi, D. R., Masood, S., & Mir, A. N. (2019). Voice-Based Detection of Parkinson's Disease through Ensemble Machine Learning Approach: A Performance Study. EAI Endorsed Trans. Pervasive Health Technol., 5(19), e2.
Passricha, V., & Aggarwal, R. K. (2019). A hybrid of deep CNN and bidirectional LSTM for automatic speech recognition. Journal of Intelligent Systems, 29(1), 1261-1274.
Priyambudi, Z. S., & Nugroho, Y. S. (2024, January). Which algorithm is better? An implementation of normalization to predict student performance. In AIP Conference Proceedings (Vol. 2926, No. 1, p. 020110). AIP Publishing LLC.
Rezaul, K. M., Jewel, M., Islam, M. S., Siddiquee, K. N. E. A., Barua, N., Rahman, M. A., ... & Asha, U. F. T. (2024). Enhancing Audio Classification Through MFCC Feature Extraction and Data Augmentation with CNN and RNN Models. International Journal of Advanced Computer Science and Applications, 15(7), 37-53.
Rodríguez, P., Bautista, M. A., Gonzalez, J., & Escalera, S. (2018). Beyond one-hot encoding: Lower dimensional target embedding. Image and Vision Computing, 75, 21-31.
Russell, J. A. (1980). A circumplex model of affect. Journal of personality and social psychology, 39(6), 1161.
Salcedo‐Sanz, S., Rojo‐Álvarez, J. L., Martínez‐Ramón, M., & Camps‐Valls, G. (2014). Support vector machines in engineering: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4(3), 234-267.
Scherer, K. R., Clark-Polner, E., & Mortillaro, M. (2011). In the eye of the beholder? Universality and cultural specificity in the expression and perception of emotion. International Journal of Psychology, 46(6), 401-435.
Sloboda, J. A., & Juslin, P. N. (2001). Psychological perspectives on music and emotion. Music and emotion: Theory and research, 71-104.
Thayer, R. E. (1990). The biopsychology of mood and arousal. Oxford University Press.
Yang, Y.-H., & Chen, H. H. (2011). Music emotion recognition (1st ed.). CRC Press. https://doi.org/10.1201/b10731