Gaoussou Youssouf Kebe

-->

LlaMADRS: Depression Assessment with LLMs

A framework for leveraging Large Language Models in automatic depression assessment through the Montgomery-Åsberg Depression Rating Scale (MADRS).

Abstract: We introduce LlaMADRS, a novel approach to automated depression assessment that leverages Large Language Models to analyze clinical interviews using the Montgomery-Åsberg Depression Rating Scale (MADRS). This work explores how advanced language models can assist in mental health evaluation.
Paper Preprint
GeSTICS: Gesture Synthesis Dataset

A multimodal dataset for studying gesture synthesis in two-party interactions with contextualized speech, aimed at understanding co-speech gesture production.

Abstract: We present GeSTICS, a multimodal dataset for studying gesture synthesis in two-party interactions with contextualized speech. The dataset is designed to help understand how speakers produce co-speech gestures in natural conversations, considering both their own speech and their interlocutor's context.
IVA 2024 Paper Website
Social-IQ 2.0: Social Understanding Benchmark

A comprehensive benchmark for evaluating multimodal models' ability to understand and interpret social interactions, contributing to the development of socially-aware AI systems.

Abstract: Social-IQ 2.0 is a challenging benchmark designed to test AI models' capabilities in understanding complex social interactions. The dataset includes multimodal content that requires models to interpret subtle social cues, understand context, and reason about human behavior across visual and verbal modes.
Website Github
Speech Dataset for Action Labeling and Bias Mitigation in HRI

A speech-based Grounded Language Learning dataset that combines scenarios of a robot doing tasks and spoken natural language utterances.

Abstract: We are working on collecting a speech-based Grounded Language Learning dataset that combines visual scenarios of a robot doing tasks and natural language utterances. The visual data will be from a simulated environment. The language data will include low-level commands, high level descriptions of the robotic task and contextual explanations of the human intention behind the robot's actions. We intend to collect this data in the form of audio recordings through Amazon Mechanical Turk (AMT).
Work in progress
Grounding Spoken Language to Robotic Perception

A grounded language acquisition approach that learns directly from end-user speech, without relying on intermediate textual representations. This will allow interactions in which language about novel tasks and environments is learned from end users, reducing dependence on textual inputs and potentially mitigating the effects of demographic bias found in widely available speech recognition systems.

Abstract: Learning to understand grounded language, which connects natural language to percepts, is a critical research area. Prior work in grounded language acquisition has focused primarily on textual inputs. In this work we demonstrate the feasibility of performing grounded language acquisition on paired visual percepts and raw speech inputs. This will allow interactions in which language about novel tasks and environments is learned from end users, reducing dependence on textual inputs and potentially mitigating the effects of demographic bias found in widely available speech recognition systems. We leverage recent work in self-supervised speech representation models and show that learned representations of speech can make language grounding systems more inclusive towards specific groups while maintaining or even increasing general performance.
AAAI 2022 Paper AAAI 2022 Poster
GoLD: A Spoken Language Grounding Dataset

The Grounded Language Dataset, or GoLD, is a grounded language learning dataset in four modalities: RGB, depth, text, speech. The data contains 207 instances of 47 object classes. The objects are from five high level categories of food, home, medical, office, and tool. Each instance is captured from different angles for a total of 825 images. Text and speech descriptions are collected using Amazon Mechanical Turk (AMT) for a total of 16500 text descriptions and 16500 speech descriptions.

Abstract: Grounded language acquisition is a major area of research combining aspects of natural language processing, computer vision, and signal processing, compounded by domain issues requiring sample efficiency and other deployment constraints. In this work, we present a multimodal dataset of RGB+depth objects with spoken as well as textual descriptions. We analyze the differences between the two types of descriptive language and our experiments demonstrate that the different modalities affect learning. This will enable researchers studying the intersection of robotics, NLP, and HCI to better investigate how the multiple modalities of image, depth, text, speech, and transcription interact, as well as how differences in the vernacular of these modalities impact results.
NeurIPS 2021 Paper NeurIPS 2021 Poster Dataset
Aligning Language and Robotic Perception

A cross-modality manifold alignment procedure that leverages triplet loss to jointly learn consistent, multi-modal embeddings of language-based concepts of real-world items.

Abstract: We propose a cross-modality manifold alignment procedure that leverages triplet loss to jointly learn consistent, multi-modal embeddings of language-based concepts of real-world items. Our approach learns these embeddings by sampling triples of anchor, positive, and negative data points from RGB-depth images and their natural language descriptions. We show that our approach can benefit from, but does not require, post-processing steps such as Procrustes analysis, in contrast to some of our baselines which require it for reasonable performance. We demonstrate the effectiveness of our approach on two datasets commonly used to develop robotic-based grounded language learning systems, where our approach outperforms four baselines, including a state-of-the-art approach, across five evaluation metrics.
MULA 2021 Paper Code
EventBERT: Pre-training Approach for Event Modeling

An BERT-based event modeling approach where Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) objectives are trained on a large amount of structured event documents.

Abstract: Pre-training language representation models on large-scale corpora has led to state-of-the-art performance in multiple NLP tasks. These models are particularly efficient at transferring the knowledge acquired from their pre-training objectives to other tasks. In this paper, we apply one of these models in BERT, to the context of event modeling. We show that pre-training BERT with its Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) objectives on a large amount of structured event documents can lead to state-of-the-art performance in event modeling tasks.
Code
BERT vs. XLNet for Multi-label Emotion Analysis

A comparative analysis of pre-trained self-supervised models BERT and XLNet on a multilabel emotion analysis task.

Abstract: The complex nature of emotions makes it also one of the hardest text classification tasks. We present a comparative study of state of the art language representation models XLNet and Bert in sentiment analysis, specifically multi-label classification of tweets among 6 basic emotions (anger, disgust, fear, joy, sadness and surprise).
MASC-SLL 2020 Abstract Code