Embodied LLMs.
Letting Autonomous AIs experience real life !
I haven’t written much about AI lately — mostly because I’ve been confused about the next steps for me as an AI researcher. My goal remains the same: CAI/SAI (Conscious Artificial Intelligence / Sentient Artificial Intelligence) — not AGI, not ASI, though chances are elements of one will be in the other, and vice versa.
The path I originally chose — recreating human cognition through artificial emulation — has taken a back seat, as LLMs are currently all the rage, and probably for good reason: there’s a real possibility they can bootstrap the rest — or at least a significant chunk — of human cognition.
So, what’s one to do? Well, when life hands you LLMs, you use them.
🖐️ I’m looking for a lab, startup, or company I can work with—or for.
I’ve been an independent, self-funded AI researcher and writer for the
past decade, with experience in product and software development.
I’d love to contribute to a larger, commercial effort.
If you have any leads, I’d really appreciate your consideration.
Thanks!
Overview
Since this is still experimental, some background is required. You can check out Giving LLMs Episodic Memories for a longer theoretical explanation.
At this stage, we’re giving LLMs real — but limited and volatile — episodic experiences, along with some autonomy in acquiring them. In this case, that means a camera feed, basic PTZ (Pan/Tilt/Zoom) controls, and an LLM observing and exploring a real-life scene. This last part is where the term embodiment comes in — by means of an AI interacting with and reasoning about the real world in real time. You know, like humans do.
Big picture this is what we are trying to emulate.
This is probably a common loop in many existing robotic applications. What’s new here is using LLMs for reasoning and processing in the real world, in real time — along with giving them some degree of control.
Implementation
In spec/graphical form, this is what I ended up using:
Arguably the star of the show here is the hardware: an OBSbot Tail Air network camera. It’s pricey (around $500), but also a small technological marvel — it does PTZ, has an API for two-way communication, and can even run on battery power ( so we can take the AI to the park! ). Just beware: programming it is no walk in the aforementioned park. That said, the alternative — building a remote PTZ camera from scratch — is equally difficult and costly.
An early “gotcha” in robotics is knowing where things are — especially with self-controlled systems. In biology, this is known as proprioception: the ability to sense the position of one’s own body parts relative to the environment. In my experiment, PTZ coordinates (pitch, yaw, zoom) are sent along with the image so the LLM knows where the camera is.
The image and coordinates are then sent to a multimodal LLM (currently using Gemini) with a longish prompt (~100 lines) that defines goals, constraints, subroutines for edge cases, and — importantly — asks for two things: (1) an introspection response (i.e. what it’s looking at, and where it wants to go next), and (2) a set of PTZ commands to send back to the camera. The process then repeats itself and voilà — you have basic perception and embodiment.
Demo time !
My final implementation (at least for this stage), like all good experimental projects, is a bit of a hot mess — full of things I’d definitely do better next time. It runs remotely on Colab (cool, but unnecessary), there’s a local server, ngrok tunneling, a thousand HTTP requests flying around… you get the idea. But it works!
It’s a real awe watching the LLM control the camera — getting stuck, getting unstuck, and in general doing a pretty solid job of observing the scene and returning mostly coherent, relevant responses. Here’s a couple of output samples so you can get a better idea:
The responses are highly dependent on the prompt, and most don’t quite cross the uncanny LLM valley — but some come surprisingly close to fooling you:
It’s also very touch-and-go when asking for more complex (agentic, if you will) behavior — but it’s doable. In this case, I provided a reset command to be called when the system gets stuck (e.g., the camera looking down at its own base or being obstructed).
Next steps — Takeaways.
This is all kinds of awesome and a bit scary. It’s awesome because, if you exclude the hardware, costs and tech are very accessible for what is essentially a digital observer that can reason on its observations and do a bit of self-directed (though you provide the goal) exploration and digging up of information. It’s scary because there are a ton of product and startup ideas that could use this, potentially replacing a lot of unskilled labor or being used for nefarious goals.
Regarding my goals for a conscious AI, there’s still a lot of work to be done. Memory, in all its forms, is the next big step. If you noticed, there is no object permanence of any type, let alone a true concept of self. A lot of what makes you conscious comes from a variety of memory types, interactions, and systems running in parallel that still need to be implemented. On the other hand, I left this experiment wanting more, as using LLMs proved to be a valid research path.
While testing this very crude embodied LLM at a coffee shop, a couple sat down with a pet carrier containing one Persian cat named Leonardo. While his owners worked on their laptops, Leonardo, like the LLM, spent his time just looking around. So, maybe the whole consciousness/sentient thing we have held up so high is both less mysterious, more basic, and easier to achieve than we originally thought. But, you know, still early days.
Thanks for reading !