A must read for data scientists: 5 sentiment analysis research papers

25 Dec 2020

CSDN, 25 Dec 2020, A must read for data scientists! 5 sentiment analysis research papers (translation)

Sentiment analysis has a wide range of uses, and AI models that can recognize emotions and ideas are widely used in many industries. Therefore, creating machines that can intelligently recognize emotions has become increasingly popular. The same goes for natural language processing (NLP) research. This article will introduce 5 important papers on sentiment analysis and sentiment classification.

1. Use deep learning to detect hate speech on Twitter (Deep Learning for Hate Speech Detection in Tweets)

One of the most important uses of sentiment classification models is to detect hate speech. Recently, there have been many reports about the hard work of content reviewers. With the development of automatic hate speech detection and other content review models, reviewers are expected to unload the burden of reviewing image content.

In this paper, the research team defined their hate speech detection task as classifying specific Twitter posts to distinguish whether they are racist or sexist.

To this end, the researchers conducted experiments based on a data set containing 16,000 tweets. In this data set, 1972 tweets were marked as having racially discriminatory content. 3,383 tweets were marked as having sexist content. The remaining tweets are classified as neither racist nor sexist.

Research has shown that certain deep learning techniques can detect hate speech more efficiently than existing N-gram methods.

Release/Last Update Date: June 1, 2017

Authors and contributors: Pinkesh Badjatiya (International Institute of Information Technology-Hyderabad, hereinafter referred to as IIIT-H), Shashank Gupta (IIIT-H), Manish Gupta (Microsoft), Vasudeva Varma (IIIT-H)
Article address: https://arxiv.org/pdf/1706.00188v1.pdf

2. depechemod++: Bilingual Emotion Lexicon (DepecheMood++: a Bilingual Emotion Lexicon)

There are two main ways to create a dictionary: create it directly (usually using a crowdsourced annotator), or derive it from an existing annotated corpus.

The purpose of the researchers' experiment is to test whether simple techniques such as document filtering, frequency reduction or text preprocessing can be used to improve DepecheMood, the latest dictionary. This dictionary consists of annotated news articles and was originally created by Staiano and Guerini in 2014 for sentiment analysis.

In this paper, the researchers explained how they created the dictionary. The new version DepecheMood++ released by this research is available in English and Italian.

Release/Last Update Date: October 8, 2018

Authors and contributors: Oscar Araque (Madrid Polytechnic University), Lorenzo Gatti (University of Twente), Marco Guerini (Bruno Kessler Institute), JacopoStaiano (Recital AI)
Article address: https://arxiv.org/pdf/1810.03660v1.pdf

3. Expressively Vulgar: The Socio-dynamics of Vulgarity

The expression of most thoughts evolves over time, but vulgar language is not the case. The use of vulgar language often contains strong directions for expressing precise information.

In this study, researchers from the University of Texas and the University of Pennsylvania conducted a large-scale data-driven analysis of the crude vocabulary in Twitter posts. More specifically, their research analyzes the sociocultural and pragmatic content of vulgar language in Twitter.

The research team tried to answer the following question: Are the expressions and functions of vulgar speech different due to the demographic characteristics of the person who made the speech? Will vulgar speech affect the perception of emotions? Does modeling vulgar speech help emotion prediction?

The researchers collected a data set of 6,800 tweets. Next, they asked nine reviewers to mark these tweets emotionally on a 5-point scale. It’s worth noting that the data also includes demographic data (gender, age, education, income, religious background, and political ideology) of the person who posted the tweet.

This dataset is the only open dataset that includes both tweets and details of their publishers. In addition, this is also one of the first studies on how to improve the performance of sentiment analysis by modeling vulgar words.

Release/Last Update Date: August 2018

Authors and contributors: Isabela Cachola, Eric Holgate, Junyi Jessy Li (all from the University of Texas at Austin) and Daniel Preotiuc Pietro (University of Pennsylvania)
Article address: https://www.aclweb.org/anthology/C18-1248.pdf

4. Multilingual Twitter Sentiment Classification: The Role of Human Annotators (Multilingual Twitter SentimentClassification: The Role of Human Annotators)

Among the studies on sentiment analysis listed in this article, this is the only study that emphasizes the importance of human annotators. In this automated tweet sentiment classification experiment, researchers from the Jožef Stefan Institute analyzed a large data set of sentiment annotated with multilingual tweets.

Specifically, the research team annotated 1.6 million tweets in 13 different languages. Using these annotated tweets as training data, the team built multiple automatic sentiment classification models.

Their experiment reached some interesting conclusions. First, the researchers pointed out that statistically, the performance of the top classification models did not differ significantly. Secondly, when applied to an ordered three-category sentiment classification problem, the basic accuracy of the classification model has nothing to do with performance. Finally, the researchers stated that they should focus on the accuracy of the training set rather than the training model used.

Release/Last Update Date: May 5, 2016

Authors and contributors: Igor Mozeti, Miha Grčar and Jasmina Smailovičč (all from the Knowledge Technology Department of the Jožef Stefan Institute)
Article address: https://arxiv.org/pdf/1602.07563v2.pdf

5. MELD: A multi-modal and multi-party data set for emotion recognition

In this paper, the author explains the increasing research in the field of conversational emotion recognition. At the same time, they pointed out that the field lacks a large-scale conversational emotion database. In order to make up for this, the researchers proposed a multi-modal emotion line data set (MELD), which is an expansion and enhancement of the original emotion line (EmotionLines) data set.

MELD includes 13,000 voices from 1433 dialogues from the TV series "Friends". The data set mainly focuses on conversations between two or more speakers. In addition, every sentence has emotions and emotional labels. The original data set of EmotionLines only contains the text of the dialogue. Therefore, it can only be used for text analysis. The main improvement of the data set is the addition of audio and video modes. MELD includes the words spoken, the tone of voice and the facial expression of the speaker.

Release/Last Update Date: July 4, 2019

Authors and contributors: Soujanya Poria (Singapore University of Technology and Design), Devamanyu Hazarika (National University of Singapore), Navonil Majumder (National Institute of Technology, Mexico), Gautam Naik (Nanyang Technological University), Erik Cambria (Nanyang Technological University), Rada Mihalcea (Michigan the University)
Article address: https://arxiv.org/pdf/1810.02508v6.pdf

Creating an emotionally intelligent machine is an ambitious goal. To this end, sentiment analysis and sentiment recognition are necessary steps. Hope these papers will help strengthen your understanding of the work currently done in this field.