Towards an efficient backbone for preserving features in speech emotion recognition: deep-shallow convolution with recurrent neural network

Goel, DP; Mahajan, K; Nguyen, ND; Srinivasan, N; Lim, Chee Peng

File(s) under permanent embargo

Towards an efficient backbone for preserving features in speech emotion recognition: deep-shallow convolution with recurrent neural network

journal contribution

posted on 2022-11-22, 02:49 authored by DP Goel, K Mahajan, ND Nguyen, N Srinivasan, Chee Peng LimChee Peng Lim

Speech emotion recognition (SER) has attracted a great deal of research interest, which plays as a critical role in human-machine interactions. Unlike other visual tasks, SER becomes intractable when the convolutional neural networks (CNNs) are employed, owing to their limitation in handling log-mel spectrograms. Therefore, it is useful to establish a feature-extraction backbone that allows CNNs to maintain information integrity of speech utterances when utilizing log-mel spectrograms. Moreover, a neural network with a deep stack of layers can lead to a performance degradation due to various challenges, including information loss, overfitting, or vanishing gradient issues. Many studies employ hybrid/multi-modal methods or specialized network designs to mitigate these obstacles. However, those methods often are unstable, hard to configure and non-adaptive to different tasks. In this research, we propose a reusable backbone pertaining to CNN blocks for undertaking SER tasks, as inspired by the FishNet model. denoted as deep-swallow convolution with RNN (DSCRNN), this proposed backbone method preserves features from both deep and shallow layers, which is effective in improving quality of features extracted from log-mel spectrograms. Simulation results indicate that our proposed DSCRNN backbone achieves improved accuracy rates of 2% and 11% when comparing with those from a baseline model with traditional CNN blocks in a speaker-independent evaluation utilizing the RAVDESS dataset with 4 classes and 8 classes, respectively.

History

Journal

Neural Computing and Applications

Location

Berlin, Germany

Publisher DOI

https://doi.org/10.1007/s00521-022-07723-2

ISSN

0941-0643

eISSN

1433-3058

Language

English

Author URL

https://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=PARTNER_APP&SrcAuth=LinksAMR&KeyUT=WOS:000846813700002&DestLinkType=FullRecord&DestApp=ALL_WOS&UsrCustomerID=a045e4b2bb1f2b747c68c720ec8913b7

Publication classification

C1 Refereed article in a scholarly journal

Publisher

Springer

File(s) under permanent embargo

Towards an efficient backbone for preserving features in speech emotion recognition: deep-shallow convolution with recurrent neural network

History

Journal

Location

Publisher DOI

ISSN

eISSN

Language

Author URL

Publication classification

Publisher

Usage metrics

Categories

Keywords

Licence

Exports