Scale Events
+00:00 GMT
Sign in or Join the community to continue

DEBAGREEMENT: A Comment-Reply Dataset for (Dis)agreement Detection

Posted Mar 30, 2022 | Views 4.3K
# Tech Talk
Share
speaker
avatar
Aerin Kim
Engineering Manager @ Scale AI

Aerin Kim is the Engineering Manager at Scale AI, turning raw data into high-quality training data using machine learning. She is fascinated by the science aspect of training data, the one and only input of deep learning that determines the performance of the model. Prior to Scale, Aerin was a Senior Research Software Engineer at Microsoft where she worked on question answering, semantic parsing and training data generation for AI applications. Before joining Microsoft, Aerin received her Masters degree in Operations Research from the Fu Foundation School of Engineering and Applied Science at Columbia University.

+ Read More
SUMMARY

DEBAGREEMENT is a dataset including 42,894 comment-reply pairings from the popular debate website Reddit, each of which has been labeled with agree, neutral, or disagree. We gathered interactions from five different forums: r/BlackLivesMatter, r/Brexit, r/Climate, r/Democrats, and r/Republican. Comment pairings for each forum were chosen in such a way that they generate a user interaction graph when taken as a whole.

DEBAGREEMENT offers a challenge for Natural Language Processing (NLP) systems because it includes slang, irony, and topic-specific humor, all of which are common in online discussions. We compared the performance of state-of-the-art language models on a (dis)agreement detection task, and looked at the usage of contextual information that is accessible to the models during training (graph, authorship, and temporal information).

DEBAGREEMENT provides novel opportunities for combining graph-based and text-based machine learning techniques to detect agreements as well as disagreements online, in light of recent research showing that context, such as social context or knowledge graph information, enables language models to perform better on downstream NLP tasks.

Key Takeaways in this Tech Talk:

Language models trained on existing datasets underperform when run on real world data. This discrepancy is exposed when evaluating the models using the new DEBAGREEMENT dataset. DEBAGREEMENT presents new opportunities for modeling diverse online interactions with text and context (authorship, graph, temporal information). Its graph structure allows for the combination of text-based machine learning (ML) with graph representation learning (GRL) approaches. By modeling online discussion forums as graphs of user interactions, researchers can: 1) transform the agreement/disagreement detection problem into a sign link prediction task, and 2) use existing signed graph embedding techniques evaluated on publicly accessible signed graphs such as Epinions and Slashdot. The sentiment and polarization of topics is not static in time and for the first time, this dataset exposes that and shows how realistic topic sentiment evolves over time.

+ Read More

Watch More

47:03
Panel: Building a Resilient MLOps Strategy Through Dataset Management
Posted Oct 06, 2021 | Views 2.8K
# TransformX 2021
Dataset Management: Using the Right Tools for the Job
Posted Oct 21, 2022 | Views 1.6K
# TransformX 2022
# Autonomous Vehicles
# Expert Panel
# Robotics
# Computer Vision
Building a Data-Driven Marketplace for 600,000 Retailers at Faire
Posted Oct 20, 2022 | Views 3.4K
# TransformX 2022
# Fireside Chat
# AI in Retail & eCommerce