dc.description.abstract | Emotion recognition has garnered significant attention in fields such as mental health, human
computer interaction, and personalized services. This research explores a multimodal approach
to emotion recognition by integrating facial expression analysis and speech prosody to achieve a
more accurate and context-sensitive understanding of human emotions. A distinctive aspect of
this study is the creation of a custom video dataset designed specifically for facial expression
recognition, which captures a wide range of emotional states under various real-world
conditions. In parallel, speech emotion detection is performed using publicly available audio
datasets, which analyze features such as pitch, tone, and rhythm to discern emotions expressed
vocally.
The facial expression recognition is based on Convolutional Neural Networks (CNNs), which
extract visual features from the video data, while the emotional cues in speech are analyzed using
Long Short-Term Memory (LSTM) networks. By combining these modalities, this research
addresses the limitations commonly faced by unimodal systems, such as the challenges posed by
noisy environments or occluded faces.
The findings demonstrate that the integration of facial and auditory data significantly improves
emotion classification accuracy, particularly in real-time applications. This research advances the
field of affective computing by highlighting the complementary strengths of visual and auditory
emotion cues and offers practical implications for applications in customer service, virtual
assistants, and mental health diagnostics. | en_US |