Humans can extrapolate and learn to solve variations of a manipulation task if the objects involved have varying visual or physical attributes, given just a few examples of completing the task with standard objects. For the learned policies to be universal for different scales of objects, orientations, and visual appearances, existing studies on robot learning still need a considerable increase in data. However, despite these improvements, generalization to undiscovered variations is not guaranteed.
A new paper from Stanford University investigates the challenge of zero-shot learning of a visuomotor policy that can take as input a small number of sample trajectories from a single-source manipulation scenario and generalize to scenarios with appearances, sizes, and poses. visuals of invisible objects. In particular, it was important to learn policies for addressing deformable and articulated objects, such as clothing or boxes, as well as rigid objects, such as pick-and-place. To ensure that the learned policy is robust across different object locations, orientations, and scales, the proposal was to incorporate equivariance into the visual representation of objects and policy architecture.
They present EquivAct, a novel visuomotor policy learning approach that can learn closed-loop policies for 3D robot manipulation tasks from demonstrations in a single-source manipulation scenario and generalize zero-shot scenarios to unseen scenarios. The learned policy takes as input the poses of the robot’s end-effector and a partial point cloud of the environment and as output the robot’s actions, such as the end-effector’s speed and grasping commands. Unlike most previous work, the researchers used network architectures equivalent to SIM(3) for their neural networks. This means that the output end-effector velocities will be adjusted in kind as the input point cloud and end-effector positions are translated and rotated. Because its policy architecture is equivariant, it can learn from demonstrations of smaller-scale tabletop activities and then generalize to mobile manipulation tasks that involve larger variations of the demonstrated objects with different visual and physical appearances.
This approach is divided into two parts: learning about representation and politics. To train the agent’s representations, the team first provides it with a set of synthetic point clouds that were captured using the same camera and settings as the target task objects, but with a different non-uniform random scale. They supplemented the training data in this way to accommodate non-uniform scaling, even if the suggested architecture is equivalent to uniform scaling. The simulated data does not have to show the robot’s activities or even demonstrate the actual task. To extract global and local features from the scene point cloud, they use the simulated data to train an encoder-decoder architecture equivalent to SIM(3). During training, a contrastive learning loss on paired point cloud inputs was used to combine local features for sections of related objects at similar positions. During the policy learning phase, it was assumed that access to a sample of previously verified task trajectories is limited.
The researchers use data to train a closed-loop policy that, given a partial point cloud of the scene as input, uses a pre-learned encoder to extract global and local features from the point cloud and then feeds those features into a SIM( 3). -equivalent action prediction network to predict movements of end effectors. Beyond the standard rigid object manipulation tasks of previous work, the proposed method is evaluated on more complex quilt folding, container covering, and box sealing tasks.
The team presents many human examples in which a person manipulates a tabletop object for each activity. After demonstrating the method, they evaluated it on a mobile manipulation platform, where the robots will have to solve the same problem on a much larger scale. The findings show that this method is able to learn a closed-loop robot manipulation policy from the source manipulation demonstrations and execute the target job in a single run without any adjustments. Furthermore, the approach is shown to be more efficient than that and relies on significant increases for generalization to out-of-distribution object poses and scales. It also outperforms works that do not exploit equivariance.
Review the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to join. our 32k+ ML SubReddit, Facebook community of more than 40,000 people, Discord channel, and Electronic newsletterwhere we share the latest news on ai research, interesting ai projects and more.
If you like our work, you’ll love our newsletter.
we are also in Telegram and WhatsApp.
Dhanshree Shenwai is a Computer Science Engineer and has good experience in FinTech companies covering Finance, Cards & Payments and Banking with a keen interest in ai applications. He is excited to explore new technologies and advancements in today’s evolving world that makes life easier for everyone.
<!– ai CONTENT END 2 –>