Abstract
Emotions understanding represents a core aspect of human communication. Our social behavioursare closely linked to expressing our emotions and understanding others’ emotional and mental
states through social signals. Emotions are expressed in a multisensory manner, where humans
use social signals from different sensory modalities such as facial expression, vocal changes, or
body language. The human brain integrates all relevant information to create a new multisensory
percept and derives emotional meaning.
There exists a great interest for emotions recognition in various fields such as HCI, gaming,
marketing, and assistive technologies. This demand is driving an increase in research on multisensory
emotion recognition. The majority of existing work proceeds by extracting meaningful
features from each modality and applying fusion techniques either at a feature level or decision
level. However, these techniques are ineffective in translating the constant talk and feedback
between different modalities. Such constant talk is particularly crucial in continuous emotion
recognition, where one modality can predict, enhance and complete the other.
This thesis proposes novel architectures for multisensory emotions recognition inspired by
multisensory integration in the brain. First, we explore the use of bio-inspired unsupervised
learning for unisensory emotion recognition for audio and visual modalities. Then we propose
three multisensory integration models, based on different pathways for multisensory integration
in the brain; that is, integration by convergence, early cross-modal enhancement, and integration
through neural synchrony. The proposed models are designed and implemented using third generation
neural networks, Spiking Neural Networks (SNN) with unsupervised learning. The
models are evaluated using widely adopted, third-party datasets and compared to state-of-the-art
multimodal fusion techniques, such as early, late and deep learning fusion. Evaluation results
show that the three proposed models achieve comparable results to state-of-the-art supervised
learning techniques. More importantly, this thesis shows models that can translate a constant
talk between modalities during the training phase. Each modality can predict, complement and
enhance the other using constant feedback. The cross-talk between modalities adds an insight
into emotions compared to traditional fusion techniques.
Date of Award | 29 Jul 2020 |
---|---|
Original language | English |
Awarding Institution |
|
Supervisor | Juan Ye (Supervisor) |
Keywords
- Multisensory integration
- Spiking neural networks
- Emotions recognition
Access Status
- Full text open