Abstract
Emotion understanding represents a core aspect of human communication. Our social behaviours are closely linked to expressing our emotions and understanding others emotional and mental states through social signals. The majority of the existing work proceeds by extracting meaningful features from each modality and applying fusion techniques either at a feature level or decision level. However, these techniques are incapable of translating the constant talk and feedback between different modalities. Such constant talk is particularly important in continuous emotion recognition, where one modality can predict, enhance and complement the other. This paper proposes three multisensory integration models, based on different pathways of multisensory integration in the brain; that is, integration by convergence, early cross-modal enhancement, and integration through neural synchrony. The proposed models are designed and implemented using third-generation neural networks, Spiking Neural Networks (SNN). The models are evaluated using widely adopted, third-party datasets and compared to state-of-the-art multimodal fusion techniques, such as early, late and deep learning fusion. Evaluation results show that the three proposed models have achieved comparable results to the state-of-the-art supervised learning techniques. More importantly, this paper demonstrates plausible ways to translate constant talk between modalities during the training phase, which also brings advantages in generalisation and robustness to noise.
Original language | English |
---|---|
Number of pages | 13 |
Journal | IEEE Transactions on Affective Computing |
Volume | Early Access |
Early online date | 19 Aug 2021 |
DOIs | |
Publication status | E-pub ahead of print - 19 Aug 2021 |
Keywords
- Spiking neural network
- Multisensory integration
- Emotion recognition
- Neural synchrony
- Graph neural network