A framework for leveraging Large Language Models in automatic depression assessment through the Montgomery-Åsberg Depression Rating Scale (MADRS).
A multimodal dataset for studying gesture synthesis in two-party interactions with contextualized speech, aimed at understanding co-speech gesture production.
A comprehensive benchmark for evaluating multimodal models' ability to understand and interpret social interactions, contributing to the development of socially-aware AI systems.
A speech-based Grounded Language Learning dataset that combines scenarios of a robot doing tasks and spoken natural language utterances.
A grounded language acquisition approach that learns directly from end-user speech, without relying on intermediate textual representations. This will allow interactions in which language about novel tasks and environments is learned from end users, reducing dependence on textual inputs and potentially mitigating the effects of demographic bias found in widely available speech recognition systems.
The Grounded Language Dataset, or GoLD, is a grounded language learning dataset in four modalities: RGB, depth, text, speech. The data contains 207 instances of 47 object classes. The objects are from five high level categories of food, home, medical, office, and tool. Each instance is captured from different angles for a total of 825 images. Text and speech descriptions are collected using Amazon Mechanical Turk (AMT) for a total of 16500 text descriptions and 16500 speech descriptions.
A cross-modality manifold alignment procedure that leverages triplet loss to jointly learn consistent, multi-modal embeddings of language-based concepts of real-world items.
An BERT-based event modeling approach where Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) objectives are trained on a large amount of structured event documents.
A comparative analysis of pre-trained self-supervised models BERT and XLNet on a multilabel emotion analysis task.