TY - GEN
T1 - Show and Guide
T2 - 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024
AU - Glória-Silva, Diogo
AU - Semedo, David
AU - Magalhães, João
N1 - info:eu-repo/grantAgreement/FCT/OE/PRT%2FBD%2F152810%2F2021/PT#
info:eu-repo/grantAgreement/FCT/Concurso de avaliação no âmbito do Programa Plurianual de Financiamento de Unidades de I&D (2017%2F2018) - Financiamento Base/UIDB%2F04516%2F2020/PT#
Funding Information:
This work was supported by the FCT Ph.D. scholarship grant Ref. PRT/BD/152810/2021 awarded by CMU Portugal Affiliated Ph.D. program, and by the FCT project NOVA LINCS Ref. (UIDB/04516/2020). Data collection was possible under the Alexa Prize Taskbot Challenge organized by Amazon Science.
Publisher Copyright:
© 2024 Association for Computational Linguistics.
PY - 2024
Y1 - 2024
N2 - Guiding users through complex procedural plans is an inherently multimodal task in which having visually illustrated plan steps is crucial to deliver an effective plan guidance. However, existing works on plan-following language models (LMs) often are not capable of multimodal input and output. In this work, we present MM-PlanLLM, the first multimodal LLM designed to assist users in executing instructional tasks by leveraging both textual plans and visual information. Specifically, we bring cross-modality through two key tasks: Conversational Video Moment Retrieval, where the model retrieves relevant step-video segments based on user queries, and Visually-Informed Step Generation, where the model generates the next step in a plan, conditioned on an image of the user's current progress. MM-PlanLLM is trained using a novel multitask-multistage approach, designed to gradually expose the model to multimodal instructional-plans semantic layers, achieving strong performance on both multimodal and textual dialogue in a plan-grounded setting. Furthermore, we show that the model delivers cross-modal temporal and plan-structure representations aligned between textual plan steps and instructional video moments.
AB - Guiding users through complex procedural plans is an inherently multimodal task in which having visually illustrated plan steps is crucial to deliver an effective plan guidance. However, existing works on plan-following language models (LMs) often are not capable of multimodal input and output. In this work, we present MM-PlanLLM, the first multimodal LLM designed to assist users in executing instructional tasks by leveraging both textual plans and visual information. Specifically, we bring cross-modality through two key tasks: Conversational Video Moment Retrieval, where the model retrieves relevant step-video segments based on user queries, and Visually-Informed Step Generation, where the model generates the next step in a plan, conditioned on an image of the user's current progress. MM-PlanLLM is trained using a novel multitask-multistage approach, designed to gradually expose the model to multimodal instructional-plans semantic layers, achieving strong performance on both multimodal and textual dialogue in a plan-grounded setting. Furthermore, we show that the model delivers cross-modal temporal and plan-structure representations aligned between textual plan steps and instructional video moments.
UR - http://www.scopus.com/inward/record.url?scp=85217777836&partnerID=8YFLogxK
U2 - 10.18653/v1/2024.emnlp-main.1191
DO - 10.18653/v1/2024.emnlp-main.1191
M3 - Conference contribution
AN - SCOPUS:85217777836
T3 - EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
SP - 21371
EP - 21389
BT - EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
A2 - Al-Onaizan, Yaser
A2 - Bansal, Mohit
A2 - Chen, Yun-Nung
PB - Association for Computational Linguistics (ACL)
Y2 - 12 November 2024 through 16 November 2024
ER -