GitHub repos

haotian-liu/LLaVA
Large Language-and-Vision Assistant built towards multimodal GPT-4 level capabilities.
Language: Python
#chatbot #chatgpt #gpt_4 #llama #llava #multimodal
Stars: 716 Issues: 14 Forks: 34
https://github.com/haotian-liu/LLaVA

GitHub

GitHub - haotian-liu/LLaVA: [NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond. - haotian-liu/LLaVA

👍4

2.38K views16:09

GitHub repos

open-mmlab/Multimodal-GPT
Multimodal-GPT
Language: Python
#flamingo #gpt #gpt_4 #llama #multimodal #transformer #vision_and_language
Stars: 244 Issues: 1 Forks: 12
https://github.com/open-mmlab/Multimodal-GPT

GitHub

GitHub - open-mmlab/Multimodal-GPT: Multimodal-GPT

Multimodal-GPT. Contribute to open-mmlab/Multimodal-GPT development by creating an account on GitHub.

👎1

2.2K views04:10

GitHub repos

X-PLUG/mPLUG-Owl
mPLUG-Owl🦉: Modularization Empowers Large Language Models with Multimodality
Language: Python
#alpaca #chatbot #chatgpt #computer_vision #damo #gpt #gpt4 #gpt4_api #huggingface #instruction_tuning #large_language_models #llama #mplug #mplug_owl #multimodal #pretraining #pytorch #transformer #visual_reasoning #visual_recognition
Stars: 209 Issues: 1 Forks: 9
https://github.com/X-PLUG/mPLUG-Owl

GitHub

GitHub - X-PLUG/mPLUG-Owl: mPLUG-Owl: The Powerful Multi-modal Large Language Model Family

mPLUG-Owl: The Powerful Multi-modal Large Language Model Family - X-PLUG/mPLUG-Owl

2.17K views22:10

GitHub repos

OpenGVLab/InternChat
InternChat allows you to interact with ChatGPT by clicking, dragging and drawing using a pointing device.
Language: Python
#chatgpt #click #foundation_model #gpt #gpt_4 #gradio #husky #image_captioning #internimage #langchain #llama #llm #multimodal #ocr #sam #segment_anything #vicuna #video #video_generation #vqa
Stars: 231 Issues: 1 Forks: 10
https://github.com/OpenGVLab/InternChat

GitHub

GitHub - OpenGVLab/InternGPT: InternGPT (iGPT) is an open source demo platform where you can easily showcase your AI models. Now…

InternGPT (iGPT) is an open source demo platform where you can easily showcase your AI models. Now it supports DragGAN, ChatGPT, ImageBind, multimodal chat like GPT-4, SAM, interactive image editin...

2.35K views22:10

GitHub repos

kyegomez/tree-of-thoughts
Plug in and Play Implementation of Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Language: Python
#artificial_intelligence #chatgpt #deep_learning #gpt4 #multimodal #prompt #prompt_engineering #prompt_learning #prompt_tuning
Stars: 366 Issues: 7 Forks: 31
https://github.com/kyegomez/tree-of-thoughts

GitHub

GitHub - kyegomez/tree-of-thoughts: Plug in and Play Implementation of Tree of Thoughts: Deliberate Problem Solving with Large…

Plug in and Play Implementation of Tree of Thoughts: Deliberate Problem Solving with Large Language Models that Elevates Model Reasoning by atleast 70% - kyegomez/tree-of-thoughts

👍1

2.26K views10:11

GitHub repos

OFA-Sys/ONE-PEACE
A general representation modal across vision, audio, language modalities.
Language: Python
#audio_language #foundation_models #multimodal #representation_learning #vision_language
Stars: 185 Issues: 2 Forks: 5
https://github.com/OFA-Sys/ONE-PEACE

GitHub

GitHub - OFA-Sys/ONE-PEACE: A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring…

A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities - OFA-Sys/ONE-PEACE

2.27K views04:11

GitHub repos

google/break-a-scene
Official implementation for "Break-A-Scene: Extracting Multiple Concepts from a Single Image" [SIGGRAPH Asia 2023]
Language: Python
#deep_learning #diffusion_models #generative_ai #multimodal #text_to_image
Stars: 164 Issues: 1 Forks: 4
https://github.com/google/break-a-scene

GitHub

GitHub - google/break-a-scene: Official implementation for "Break-A-Scene: Extracting Multiple Concepts from a Single Image" [SIGGRAPH…

Official implementation for "Break-A-Scene: Extracting Multiple Concepts from a Single Image" [SIGGRAPH Asia 2023] - google/break-a-scene

👍2

2.53K views04:18

GitHub repos

lxe/llavavision
A simple "Be My Eyes" web app with a llama.cpp/llava backend
Language: JavaScript
#ai #artificial_intelligence #computer_vision #llama #llamacpp #llm #local_llm #machine_learning #multimodal #webapp
Stars: 284 Issues: 0 Forks: 7
https://github.com/lxe/llavavision

GitHub

GitHub - lxe/llavavision: A simple "Be My Eyes" web app with a llama.cpp/llava backend

A simple "Be My Eyes" web app with a llama.cpp/llava backend - lxe/llavavision

2.12K views05:21

GitHub repos

LLaVA-VL/LLaVA-Plus-Codebase
LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills
Language: Python
#agent #large_language_models #large_multimodal_models #multimodal_large_language_models #tool_use
Stars: 213 Issues: 7 Forks: 13
https://github.com/LLaVA-VL/LLaVA-Plus-Codebase

GitHub

GitHub - LLaVA-VL/LLaVA-Plus-Codebase: LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills

LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills - LLaVA-VL/LLaVA-Plus-Codebase

🥴2

2.21K views17:21

GitHub repos

YangLing0818/RPG-DiffusionMaster
Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs (PRG)
Language: Python
#image_editing #large_language_models #multimodal_large_language_models #text_to_image_diffusion
Stars: 272 Issues: 5 Forks: 14
https://github.com/YangLing0818/RPG-DiffusionMaster

GitHub

GitHub - YangLing0818/RPG-DiffusionMaster: [ICML 2024] Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating…

[ICML 2024] Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs (RPG) - YangLing0818/RPG-DiffusionMaster

2.34K views05:25

GitHub repos

X-PLUG/MobileAgent
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception
Language: Python
#agent #gpt4v #mllm #mobile_agents #multimodal #multimodal_large_language_models
Stars: 246 Issues: 3 Forks: 21
https://github.com/X-PLUG/MobileAgent

GitHub

GitHub - X-PLUG/MobileAgent: Mobile-Agent: The Powerful GUI Agent Family

Mobile-Agent: The Powerful GUI Agent Family. Contribute to X-PLUG/MobileAgent development by creating an account on GitHub.

2.71K views11:26

GitHub repos

BradyFU/Video-MME
✨✨Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Language: Python
#large_language_models #large_vision_language_models #mme #multimodal_large_language_models #video #video_mme
Stars: 182 Issues: 1 Forks: 6
https://github.com/BradyFU/Video-MME

GitHub

GitHub - MME-Benchmarks/Video-MME: ✨✨[CVPR 2025] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs…

✨✨[CVPR 2025] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis - MME-Benchmarks/Video-MME

2.12K views04:00

GitHub repos

ictnlp/LLaMA-Omni
LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.
Language: Python
#large_language_models #multimodal_large_language_models #speech_interaction #speech_language_model #speech_to_speech #speech_to_text
Stars: 274 Issues: 1 Forks: 16
https://github.com/ictnlp/LLaMA-Omni

GitHub

GitHub - ictnlp/LLaMA-Omni: LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1…

LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level. - ictnlp/LLaMA-Omni

2.18K views16:00

GitHub repos

nv-tlabs/LLaMA-Mesh
Unifying 3D Mesh Generation with Language Models
Language: Python
#3d_generation #llm #mesh_generation #multimodal
Stars: 427 Issues: 7 Forks: 14
https://github.com/nv-tlabs/LLaMA-Mesh

GitHub

GitHub - nv-tlabs/LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models

Unifying 3D Mesh Generation with Language Models. Contribute to nv-tlabs/LLaMA-Mesh development by creating an account on GitHub.

👍2

1.76K views05:00

GitHub repos

ictnlp/LLaVA-Mini
LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.
Language: Python
#efficient #gpt4o #gpt4v #large_language_models #large_multimodal_models #llama #llava #multimodal #multimodal_large_language_models #video #vision #vision_language_model #visual_instruction_tuning
Stars: 173 Issues: 7 Forks: 11
https://github.com/ictnlp/LLaVA-Mini

GitHub

GitHub - ictnlp/LLaVA-Mini: LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images,…

LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner. - GitHub - ictnlp/LLaVA-Mini: LLaVA-Mi...

1.95K views23:00

GitHub repos

ByteDance-Seed/Seed1.5-VL
Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks.
Language: Jupyter Notebook
#cookbook #large_language_model #multimodal_large_language_models #vision_language_model
Stars: 404 Issues: 0 Forks: 3
https://github.com/ByteDance-Seed/Seed1.5-VL

GitHub

GitHub - ByteDance-Seed/Seed1.5-VL: Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal…

Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks. ...

👍1

1.78K views16:00

GitHub repos

Tencent-Hunyuan/Hunyuan3D-Omni
Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets
Language: Python
#3d #3d_aigc #3d_generation #hunyuan3d #image_to_3d #multimodal #shape
Stars: 181 Issues: 0 Forks: 10
https://github.com/Tencent-Hunyuan/Hunyuan3D-Omni

GitHub

GitHub - Tencent-Hunyuan/Hunyuan3D-Omni: Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets

Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets - Tencent-Hunyuan/Hunyuan3D-Omni

1.54K views10:00

GitHub repos

FunAudioLLM/Fun-ASR
Fun-ASR is an end-to-end speech recognition large model launched by Tongyi Lab.
Language: Python
#audio #audio_language_model #audio_understanding #fun_asr #multimodal_large_language_models #pytorch #speaker_diarization #speech_recognition
Stars: 264 Issues: 4 Forks: 8
https://github.com/FunAudioLLM/Fun-ASR

GitHub

GitHub - FunAudioLLM/Fun-ASR: Fun-ASR is an end-to-end speech recognition large model launched by Tongyi Lab.

Fun-ASR is an end-to-end speech recognition large model launched by Tongyi Lab. - FunAudioLLM/Fun-ASR

1.54K views23:00

GitHub repos

OpenMOSS/MOVA
MOVA: Towards Scalable and Synchronized Video–Audio Generation
Language: Python
#diffusion_models #multimodal #sglang #video_audio_generation
Stars: 397 Issues: 7 Forks: 24
https://github.com/OpenMOSS/MOVA

GitHub

GitHub - OpenMOSS/MOVA: MOVA: Towards Scalable and Synchronized Video–Audio Generation

MOVA: Towards Scalable and Synchronized Video–Audio Generation - OpenMOSS/MOVA

❤1

1.49K views05:00

GitHub repos

fikrikarim/parlor
On-device, real-time multimodal AI. Have natural voice and vision conversations with an AI that runs entirely on your machine. Powered by Gemma 4 E2B and Kokoro.
Language: HTML
#apple_silicon #gemma #kokoro #litert_lm #local_llm #mlx #multimodal #on_device_ai #python #real_time #speech_recognition #text_to_speech #voice_assistant
Stars: 1183 Issues: 3 Forks: 114
https://github.com/fikrikarim/parlor

GitHub

GitHub - fikrikarim/parlor: On-device, real-time multimodal AI. Have natural voice and vision conversations with an AI that runs…

On-device, real-time multimodal AI. Have natural voice and vision conversations with an AI that runs entirely on your machine. Powered by Gemma 4 E2B and Kokoro. - fikrikarim/parlor

1.42K views04:00

About

Blog

Apps

Platform