Mapping the Multilingual Margins: Intersectional Biases of Sentiment Analysis Systems in English, Spanish, and Arabic

António Câmara, Nina Taneja, Tamjeed Azad, Emily Allaway, Richard Zemel.
Second Workshop on Language Technology for Equality, Diversity, Inclusion (LT-EDI), 2022.

As natural language processing systems become more widespread, it is necessary to address fairness issues in their implementation and deployment to ensure that their negative impacts on society are understood and minimized. However, there is limited work that studies fairness using a multilingual and intersectional framework or on downstream tasks. In this paper, we introduce four multilingual Equity Evaluation Corpora, supplementary test sets designed to measure social biases, and a novel statistical framework for studying unisectional and intersectional social biases in natural language processing. We use these tools to measure gender, racial, ethnic, and intersectional social biases across five models trained on emotion regression tasks in English, Spanish, and Arabic. We find that many systems demonstrate statistically significant unisectional and intersectional social biases.
[link]

Detecting Polarized Topics Using Partisanship-aware Contextualized Topic Embeddings

Zihao He, Negar Mokhberian, António Câmara, Andrés Abeliuk, Kristina Lerman.
Findings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021.

Growing polarization of the news media has been blamed for fanning disagreement, controversy and even violence. Early identification of polarized topics is thus an urgent matter that can help mitigate conflict. However, accurate measurement of topic-wise polarization is still an open research challenge. To address this gap, we propose Partisanship-aware Contextualized Topic Embeddings (PaCTE), a method to automatically detect polarized topics from partisan news sources. Specifically, utilizing a language model that has been finetuned on recognizing partisanship of the news articles, we represent the ideology of a news corpus on a topic by corpus-contextualized topic embedding and measure the polarization using cosine distance. We apply our method to a dataset of news articles about the COVID-19 pandemic. Extensive experiments on different news sources and topics demonstrate the efficacy of our method to capture topical polarization, as indicated by its effectiveness of retrieving the most polarized topics.
[link]

Hierarchical Embedding Topic Modeling for Stance Detection on Unknown Targets

António Câmara.
Undergraduate Thesis, Columbia University, Department of Computer Science, 2022.

Stance detection is an emerging problem in natural language processing with broad application to the social sciences that seeks to understand how authors express attitudes. However, existing models and datasets are only developed for settings where the stance object, or topic of debate, is known. Moreover, existing settings for this problem are high-resource and do not consider the relationship between topics. In this paper, we introduce a novel task, stance detection on unknown targets, that seeks to measure a model's ability to detect stance on topics discovered using only the text itself. To that end, we introduce a model that first discovers a hierarchical set of topics for stance detection using a semantic embedding space and then uses large-scale transformer-based language models for stance detection on these discovered topics. In comparison to popular models, we find that our model performs well on topic modeling, stance detection, and our novel task, especially in low-resource and hierarchical settings. We also discuss the application of our work in low-resource settings and begin collecting datasets that study African American English and Black communities online.
[link]