GitHub repos

clovaai/donut
Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022
Language: Python
#computer_vision #document_ai #eccv_2022 #multimodal_pre_trained_model #nlp #ocr
Stars: 98 Issues: 2 Forks: 5
https://github.com/clovaai/donut

GitHub

GitHub - clovaai/donut: Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator…

Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022 - clovaai/donut

❤1

2.15K views04:19

GitHub repos

ilaria-manco/multimodal-ml-music
List of academic resources on Multimodal ML for Music
Language: TeX
#academic_publications #awesome_list #multimodal_data #multimodal_deep_learning #multimodal_learning #music_ai #music_information_retrieval #music_research #resources
Stars: 123 Issues: 1 Forks: 7
https://github.com/ilaria-manco/multimodal-ml-music

GitHub

GitHub - ilaria-manco/multimodal-ml-music: List of academic resources on Multimodal ML for Music

List of academic resources on Multimodal ML for Music - ilaria-manco/multimodal-ml-music

👍1

2.34K views17:03

GitHub repos

SkalskiP/courses
This repository is a curated collection of links to various courses and resources about Artificial Intelligence (AI)
Language: Python
#computer_vision #deep_learning #deep_neural_networks #machine_learning #mlops #multimodal #natural_language_processing #nlp #transformers #tutorial
Stars: 323 Issues: 0 Forks: 29
https://github.com/SkalskiP/courses

GitHub

GitHub - SkalskiP/courses: This repository is a curated collection of links to various courses and resources about Artificial Intelligence…

This repository is a curated collection of links to various courses and resources about Artificial Intelligence (AI) - SkalskiP/courses

👍1

2.37K views16:08

GitHub repos

haotian-liu/LLaVA
Large Language-and-Vision Assistant built towards multimodal GPT-4 level capabilities.
Language: Python
#chatbot #chatgpt #gpt_4 #llama #llava #multimodal
Stars: 716 Issues: 14 Forks: 34
https://github.com/haotian-liu/LLaVA

GitHub

GitHub - haotian-liu/LLaVA: [NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond. - haotian-liu/LLaVA

👍4

2.37K views16:09

GitHub repos

open-mmlab/Multimodal-GPT
Multimodal-GPT
Language: Python
#flamingo #gpt #gpt_4 #llama #multimodal #transformer #vision_and_language
Stars: 244 Issues: 1 Forks: 12
https://github.com/open-mmlab/Multimodal-GPT

GitHub

GitHub - open-mmlab/Multimodal-GPT: Multimodal-GPT

Multimodal-GPT. Contribute to open-mmlab/Multimodal-GPT development by creating an account on GitHub.

👎1

2.19K views04:10

GitHub repos

X-PLUG/mPLUG-Owl
mPLUG-Owl🦉: Modularization Empowers Large Language Models with Multimodality
Language: Python
#alpaca #chatbot #chatgpt #computer_vision #damo #gpt #gpt4 #gpt4_api #huggingface #instruction_tuning #large_language_models #llama #mplug #mplug_owl #multimodal #pretraining #pytorch #transformer #visual_reasoning #visual_recognition
Stars: 209 Issues: 1 Forks: 9
https://github.com/X-PLUG/mPLUG-Owl

GitHub

GitHub - X-PLUG/mPLUG-Owl: mPLUG-Owl: The Powerful Multi-modal Large Language Model Family

mPLUG-Owl: The Powerful Multi-modal Large Language Model Family - X-PLUG/mPLUG-Owl

2.16K views22:10

GitHub repos

OpenGVLab/InternChat
InternChat allows you to interact with ChatGPT by clicking, dragging and drawing using a pointing device.
Language: Python
#chatgpt #click #foundation_model #gpt #gpt_4 #gradio #husky #image_captioning #internimage #langchain #llama #llm #multimodal #ocr #sam #segment_anything #vicuna #video #video_generation #vqa
Stars: 231 Issues: 1 Forks: 10
https://github.com/OpenGVLab/InternChat

GitHub

GitHub - OpenGVLab/InternGPT: InternGPT (iGPT) is an open source demo platform where you can easily showcase your AI models. Now…

InternGPT (iGPT) is an open source demo platform where you can easily showcase your AI models. Now it supports DragGAN, ChatGPT, ImageBind, multimodal chat like GPT-4, SAM, interactive image editin...

2.32K views22:10

GitHub repos

kyegomez/tree-of-thoughts
Plug in and Play Implementation of Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Language: Python
#artificial_intelligence #chatgpt #deep_learning #gpt4 #multimodal #prompt #prompt_engineering #prompt_learning #prompt_tuning
Stars: 366 Issues: 7 Forks: 31
https://github.com/kyegomez/tree-of-thoughts

GitHub

GitHub - kyegomez/tree-of-thoughts: Plug in and Play Implementation of Tree of Thoughts: Deliberate Problem Solving with Large…

Plug in and Play Implementation of Tree of Thoughts: Deliberate Problem Solving with Large Language Models that Elevates Model Reasoning by atleast 70% - GitHub - kyegomez/tree-of-thoughts: Plug i...

👍1

2.24K views10:11

GitHub repos

OFA-Sys/ONE-PEACE
A general representation modal across vision, audio, language modalities.
Language: Python
#audio_language #foundation_models #multimodal #representation_learning #vision_language
Stars: 185 Issues: 2 Forks: 5
https://github.com/OFA-Sys/ONE-PEACE

GitHub

GitHub - OFA-Sys/ONE-PEACE: A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring…

A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities - OFA-Sys/ONE-PEACE

2.25K views04:11

GitHub repos

google/break-a-scene
Official implementation for "Break-A-Scene: Extracting Multiple Concepts from a Single Image" [SIGGRAPH Asia 2023]
Language: Python
#deep_learning #diffusion_models #generative_ai #multimodal #text_to_image
Stars: 164 Issues: 1 Forks: 4
https://github.com/google/break-a-scene

GitHub

GitHub - google/break-a-scene: Official implementation for "Break-A-Scene: Extracting Multiple Concepts from a Single Image" [SIGGRAPH…

Official implementation for "Break-A-Scene: Extracting Multiple Concepts from a Single Image" [SIGGRAPH Asia 2023] - google/break-a-scene

👍2

2.51K views04:18

GitHub repos

lxe/llavavision
A simple "Be My Eyes" web app with a llama.cpp/llava backend
Language: JavaScript
#ai #artificial_intelligence #computer_vision #llama #llamacpp #llm #local_llm #machine_learning #multimodal #webapp
Stars: 284 Issues: 0 Forks: 7
https://github.com/lxe/llavavision

GitHub

GitHub - lxe/llavavision: A simple "Be My Eyes" web app with a llama.cpp/llava backend

A simple "Be My Eyes" web app with a llama.cpp/llava backend - lxe/llavavision

2.1K views05:21

GitHub repos

LLaVA-VL/LLaVA-Plus-Codebase
LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills
Language: Python
#agent #large_language_models #large_multimodal_models #multimodal_large_language_models #tool_use
Stars: 213 Issues: 7 Forks: 13
https://github.com/LLaVA-VL/LLaVA-Plus-Codebase

GitHub

GitHub - LLaVA-VL/LLaVA-Plus-Codebase: LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills

LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills - LLaVA-VL/LLaVA-Plus-Codebase

🥴2

2.19K views17:21

GitHub repos

YangLing0818/RPG-DiffusionMaster
Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs (PRG)
Language: Python
#image_editing #large_language_models #multimodal_large_language_models #text_to_image_diffusion
Stars: 272 Issues: 5 Forks: 14
https://github.com/YangLing0818/RPG-DiffusionMaster

GitHub

GitHub - YangLing0818/RPG-DiffusionMaster: [ICML 2024] Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating…

[ICML 2024] Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs (RPG) - YangLing0818/RPG-DiffusionMaster

2.32K views05:25

GitHub repos

X-PLUG/MobileAgent
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception
Language: Python
#agent #gpt4v #mllm #mobile_agents #multimodal #multimodal_large_language_models
Stars: 246 Issues: 3 Forks: 21
https://github.com/X-PLUG/MobileAgent

GitHub

GitHub - X-PLUG/MobileAgent: Mobile-Agent: The Powerful GUI Agent Family

Mobile-Agent: The Powerful GUI Agent Family. Contribute to X-PLUG/MobileAgent development by creating an account on GitHub.

2.69K views11:26

GitHub repos

BradyFU/Video-MME
✨✨Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Language: Python
#large_language_models #large_vision_language_models #mme #multimodal_large_language_models #video #video_mme
Stars: 182 Issues: 1 Forks: 6
https://github.com/BradyFU/Video-MME

GitHub

GitHub - BradyFU/Video-MME: ✨✨[CVPR 2025] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video…

✨✨[CVPR 2025] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis - BradyFU/Video-MME

2.09K views04:00

GitHub repos

ictnlp/LLaMA-Omni
LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.
Language: Python
#large_language_models #multimodal_large_language_models #speech_interaction #speech_language_model #speech_to_speech #speech_to_text
Stars: 274 Issues: 1 Forks: 16
https://github.com/ictnlp/LLaMA-Omni

GitHub

GitHub - ictnlp/LLaMA-Omni: LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1…

LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level. - ictnlp/LLaMA-Omni

2.15K views16:00

GitHub repos

nv-tlabs/LLaMA-Mesh
Unifying 3D Mesh Generation with Language Models
Language: Python
#3d_generation #llm #mesh_generation #multimodal
Stars: 427 Issues: 7 Forks: 14
https://github.com/nv-tlabs/LLaMA-Mesh

GitHub

GitHub - nv-tlabs/LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models

Unifying 3D Mesh Generation with Language Models. Contribute to nv-tlabs/LLaMA-Mesh development by creating an account on GitHub.

👍2

1.72K views05:00

GitHub repos

ictnlp/LLaVA-Mini
LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.
Language: Python
#efficient #gpt4o #gpt4v #large_language_models #large_multimodal_models #llama #llava #multimodal #multimodal_large_language_models #video #vision #vision_language_model #visual_instruction_tuning
Stars: 173 Issues: 7 Forks: 11
https://github.com/ictnlp/LLaVA-Mini

GitHub

GitHub - ictnlp/LLaVA-Mini: LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images,…

LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner. - GitHub - ictnlp/LLaVA-Mini: LLaVA-Mi...

1.89K views23:00

GitHub repos

ByteDance-Seed/Seed1.5-VL
Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks.
Language: Jupyter Notebook
#cookbook #large_language_model #multimodal_large_language_models #vision_language_model
Stars: 404 Issues: 0 Forks: 3
https://github.com/ByteDance-Seed/Seed1.5-VL

GitHub

GitHub - ByteDance-Seed/Seed1.5-VL: Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal…

Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks. ...

👍1

1.69K views16:00

GitHub repos

Tencent-Hunyuan/Hunyuan3D-Omni
Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets
Language: Python
#3d #3d_aigc #3d_generation #hunyuan3d #image_to_3d #multimodal #shape
Stars: 181 Issues: 0 Forks: 10
https://github.com/Tencent-Hunyuan/Hunyuan3D-Omni

GitHub

GitHub - Tencent-Hunyuan/Hunyuan3D-Omni: Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets

Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets - Tencent-Hunyuan/Hunyuan3D-Omni

1.35K views10:00

About

Blog

Apps

Platform