View on GitHub

NLP-progress

Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.

Multimodal

Multimodal Emotion Recognition

IEMOCAP

The IEMOCAP (Busso et al., 2008) contains the acts of 10 speakers in a two-way conversation segmented into utterances. The medium of the conversations in all the videos is English. The database contains the following categorical labels: anger, happiness, sadness, neutral, excitement, frustration, fear, surprise, and other.

Monologue:

Model Accuracy Paper / Source
CHFusion (Poria et al., 2017) 76.5% Multimodal Sentiment Analysis using Hierarchical Fusion with Context Modeling
bc-LSTM (Poria et al., 2017) 74.10% Context-Dependent Sentiment Analysis in User-Generated Videos

Conversational: Conversational setting enables the models to capture emotions expressed by the speakers in a conversation. Inter speaker dependencies are considered in this setting.

Model Weighted Accuracy (WAA) Paper / Source
CMN (Hazarika et al., 2018) 77.62% Conversational Memory Network for Emotion Recognition in Dyadic Dialogue Videos
Memn2n 75.08 Conversational Memory Network for Emotion Recognition in Dyadic Dialogue Videos

Multimodal Metaphor Recognition

Mohammad et. al, 2016 created a dataset of verb-noun pairs from WordNet that had multiple senses. They annoted these pairs for metaphoricity (metaphor or not a metaphor). Dataset is in English.

Model F1 Score Paper / Source Code
5-layer convolutional network (Krizhevsky et al., 2012), Word2Vec 0.75 Shutova et. al, 2016 Unavailable

Tsvetkov et. al, 2014 created a dataset of adjective-noun pairs that they then annotated for metaphoricity. Dataset is in English.

Model F1 Score Paper / Source Code
5-layer convolutional network (Krizhevsky et al., 2012), Word2Vec 0.79 Shutova et. al, 2016 Unavailable

Multimodal Sentiment Analysis

MOSI

The MOSI dataset (Zadeh et al., 2016) is a dataset rich in sentimental expressions where 93 people review topics in English. The videos are segmented with each segments sentiment label scored between +3 (strong positive) to -3 (strong negative) by 5 annotators.

Model Accuracy Paper / Source
bc-LSTM (Poria et al., 2017) 80.3% Context-Dependent Sentiment Analysis in User-Generated Videos
MARN (Zadeh et al., 2018) 77.1% Multi-attention Recurrent Network for Human Communication Comprehension

Go back to the README