Vision Agents 0.1
First steps here, we've just released 0.1 of Vision Agents. https://github.com/GetStream/Vision-Agents
What My Project Does
The idea is that it makes it super simple to build vision agents, combining fast models like Yolo with Gemini/Openai realtime. We're going for low latency & a completely open sdk. So you can use any vision model or video edge network.
Here's an example of running live video through Yolo and then passing it to Gemini
agent = Agent(
edge=getstream.Edge(),
agentuser=agentuser,
instructions="Read @golfcoach.md",
llm=openai.Realtime(fps=10),
#llm=gemini.Realtime(fps=1), # Careful with FPS can get expensive
processors=[ultralytics.YOLOPoseProcessor(modelpath="yolo11n-pose.pt")],
)
Target Audience
Vision AI is like chatgpt in 2022. It's really fun to see how it works and what's possible. Anything from live coaching, to sports, to physical therapy, robotics, drones etc. But it's not production quality yet. Gemini and OpenAI both hallucinate a ton for vision AI. It seems close to being viable though, especially fun to have it describe your surroundings etc.
Comparison
Similar to Livekit
/r/Python
https://redd.it/1o2yh3k
First steps here, we've just released 0.1 of Vision Agents. https://github.com/GetStream/Vision-Agents
What My Project Does
The idea is that it makes it super simple to build vision agents, combining fast models like Yolo with Gemini/Openai realtime. We're going for low latency & a completely open sdk. So you can use any vision model or video edge network.
Here's an example of running live video through Yolo and then passing it to Gemini
agent = Agent(
edge=getstream.Edge(),
agentuser=agentuser,
instructions="Read @golfcoach.md",
llm=openai.Realtime(fps=10),
#llm=gemini.Realtime(fps=1), # Careful with FPS can get expensive
processors=[ultralytics.YOLOPoseProcessor(modelpath="yolo11n-pose.pt")],
)
Target Audience
Vision AI is like chatgpt in 2022. It's really fun to see how it works and what's possible. Anything from live coaching, to sports, to physical therapy, robotics, drones etc. But it's not production quality yet. Gemini and OpenAI both hallucinate a ton for vision AI. It seems close to being viable though, especially fun to have it describe your surroundings etc.
Comparison
Similar to Livekit
/r/Python
https://redd.it/1o2yh3k
GitHub
GitHub - GetStream/Vision-Agents: Open Vision Agents by Stream. Build Vision Agents quickly with any model or video provider. Uses…
Open Vision Agents by Stream. Build Vision Agents quickly with any model or video provider. Uses Stream's edge network for ultra-low latency. - GetStream/Vision-Agents