A curated list of papers related to multi-modal machine learning, especially multi-modal large language models (LLMs).
Recent Advances in Vision Foundation Models, CVPR 2023 Workshop [pdf]
M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning, arxiv 2023 [data]
LLaVA Instruction 150K, arxiv 2023 [data]
Youku-mPLUG 10M, arxiv 2023 [data]
MULTIINSTRUCT: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning, ACL 2023 [data]
A Survey on Multimodal Large Language Models, arxiv 2023 [project page]
Vision-Language Models for Vision Tasks: A Survey, arxiv 2023
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic, arxiv 2023 [code]
PandaGPT: One Model To Instruction-Follow Them All, arxiv 2023 [code]
Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic, arxiv 2023 [code]
MIMIC-IT: Multi-Modal In-Context Instruction Tuning, arxiv 2023 [code]
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding, arxiv 2023 [code]
MetaVL: Transferring In-Context Learning Ability From Language Models to Vision-Language Models, arxiv 2023
mPLUG-Owl : Modularization Empowers Large Language Models with Multimodality, arxiv 2023 [code]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning, arxiv 2023 [code]
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models, ICML 2023 [code]
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models, arxiv 2023 [code]
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans, arxiv 2023 [code]
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model, arxiv 2023 [code]
Language Is Not All You Need: Aligning Perception with Language Models, arxiv 2023 [code]
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities, arxiv 2023 [code]
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages, arxiv 2023 [code]
Visual Instruction Tuning, arxiv 2023 [code]
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models, arxiv 2023 [code]
PaLI: A Jointly-Scaled Multilingual Language-Image Model, ICLR 2023 [blog]
Grounding Language Models to Images for Multimodal Inputs and Outputs, ICML 2023 [code]
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework, ICML 2022 [code]
Flamingo: a Visual Language Model for Few-Shot Learning, NeurIPS 2022
LISA: Reasoning Segmentation via Large Language Model, arxiv 2023 [code]
Contextual Object Detection with Multimodal Large Language Models, arxiv 2023 [code]
KOSMOS-2: Grounding Multimodal Large Language Models to the World, arxiv 2023 [code]
Contextual Object Detection with Multimodal Large Language Models, arxiv 2023 [code]
Fast Segment Anything, arxiv 2023 [code]
Multi-Modal Classifiers for Open-Vocabulary Object Detection, ICML 2023 [code]
Instruction-ViT: Multi-Modal Prompts for Instruction Learning in ViT, arxiv 2023
Images Speak in Images: A Generalist Painter for In-Context Visual Learning, arxiv 2023 [code]
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding, arxiv 2023 [code]
SegGPT: Segmenting Everything In Context, arxiv 2023 [code]
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks, arxiv 2023 [code]
Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching, arxiv 2023
Personalize Segment Anything Model with One Shot, arxiv 2023 [code]
Segment Anything, arxiv 2023 [code]
Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks, CVPR 2023 [code]
A Generalist Framework for Panoptic Segmentation of Images and Videos, arxiv 2022
A Unified Sequence Interface for Vision Tasks, NeurIPS 2022 [code]
Pix2seq: A language modeling framework for object detection, ICLR 2022 [code]
Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition, arxiv 2023 [code]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control, preprint [project page]
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models, arxiv 2023 [project page]
MotionGPT: Human Motion as a Foreign Language, arxiv 2023 [code]
Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model, arxiv 2023 [code]
PaLM-E: An Embodied Multimodal Language Model, arxiv 2023 [blog]
Generative Agents: Interactive Simulacra of Human Behavior, arxiv 2023
Vision-Language Models as Success Detectors, arxiv 2023
TidyBot: Personalized Robot Assistance with Large Language Models, arxiv 2023 [code]
Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action, CoRL 2022 [blog] [code]
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day, arxiv 2023 [code]
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark, arxiv 2023 [code]
LOVM: Language-Only Vision Model Selection, arxiv 2023 [code]
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation, arxiv 2023 [project page]