Companies like OpenAI and Midjourney create chatbots, image generators, and other artificial intelligence tools that operate in the digital world.
Now, a startup founded by three former OpenAI researchers is using the technology development methods behind chatbots to create artificial intelligence technology that can navigate the physical world.
Covariant, a robotics company based in Emeryville, California, is creating ways for robots to pick, move and sort items as they are transported through warehouses and distribution centers. Its goal is to help robots understand what is happening around them and decide what they should do next.
The technology also gives the robots a broad understanding of the English language, allowing people to chat with them as if they were chatting with ChatGPT.
The technology, still developing, is not perfect. But it's a clear sign that the artificial intelligence systems that power online chatbots and image generators will also power machines in warehouses, on roads and in homes.
Like chatbots and image generators, this robotic technology learns its skills by analyzing huge amounts of digital data. That means engineers can improve technology by feeding it more and more data.
Covariant, backed by $222 million in funding, doesn't build robots. Create the software that powers the robots. The company aims to implement its new technology with warehouse robots, providing a roadmap for others to do the same in manufacturing plants and perhaps even on roads with self-driving cars.
The artificial intelligence systems that power chatbots and imagers are called neural networks, after the network of neurons in the brain.
By identifying patterns in large amounts of data, these systems can learn to recognize words, sounds and images, or even generate them on their own. This is how OpenAI created ChatGPT, giving you the power to answer questions instantly, write term papers, and generate computer programs. He learned these skills from texts selected from the Internet. (Several media outlets, including The New York Times, have sued OpenAI for copyright infringement.)
Today, companies are building systems that can learn from different types of data at the same time. By analyzing both a collection of photographs and the captions that describe those photographs, for example, a system can capture the relationships between the two. You may learn that the word “banana” describes a yellow, curved fruit.
OpenAI used that system to build Sora, its new video generator. By analyzing thousands of subtitled videos, the system learned to generate videos when given a brief description of a scene, such as “a gorgeously rendered paper world of a coral reef, teeming with colorful fish and sea creatures.”
Covariant, founded by Pieter Abbeel, a professor at the University of California, Berkeley, and three of his former students, Peter Chen, Rocky Duan and Tianhao Zhang, used similar techniques to build a system that powers warehouse robots.
The company helps operate sorting robots in warehouses around the world. It has spent years collecting data (from cameras and other sensors) that shows how these robots operate.
“It ingests all kinds of data important to robots, which can help them understand and interact with the physical world,” Dr. Chen said.
By combining that data with the massive amounts of text used to train chatbots like ChatGPT, the company has created artificial intelligence technology that gives its robots a much broader understanding of the world around them.
After identifying patterns in this mix of images, sensory data and text, the technology gives a robot the power to handle unexpected situations in the physical world. The robot knows how to pick up a banana, even though it has never seen a banana before.
You can also respond in plain English, like a chatbot. If you tell him to “grab a banana,” he will know what that means. If you tell him to “pick a yellow fruit”, he will understand that too.
It can even generate videos that predict what will likely happen when you try to pick up a banana. These videos have no practical use in a warehouse, but show the robot's understanding of what is around it.
“If you can predict the next frames of a video, you can pinpoint the right strategy to follow,” Dr. Abbeel said.
The technology, called RFM, for the fundamental model of robotics, makes mistakes, just like chatbots. Although you often understand what people ask of you, there is always the chance that you don't. Occasionally drops objects.
Gary Marcus, an artificial intelligence entrepreneur and professor emeritus of psychology and neural sciences at New York University, said the technology could be useful in warehouses and other situations where errors are acceptable. But he said it would be more difficult and risky to implement in manufacturing plants and other potentially dangerous situations.
“It all comes down to the cost of error,” he said. “If you have a 150-pound robot that can do something harmful, that cost can be high.”
As companies train this type of system with increasingly larger and more varied collections of data, researchers believe it will improve rapidly.
This is very different from the way robots operated in the past. Typically, engineers programmed robots to perform the same precise movement over and over again, such as lifting a box of a certain size or placing a rivet in a particular location on a car's rear bumper. But robots could not deal with unexpected or random situations.
By learning from digital data – hundreds of thousands of examples of what happens in the physical world – robots can begin to handle the unexpected. And when those examples are combined with language, robots can also respond to text and voice suggestions, just like a chatbot would.
This means that, like chatbots and image generators, robots will be more agile.
“What's in the digital data can be transferred to the real world,” Dr. Chen said.