Meet KITE: An AI Framework for Semantic Manipulation Using Key Points as Representations for Visual Grounding and Precise Action Inference

With the increasing advancement in the field of Artificial Intelligence, AI technology is starting to be combined with robotics. From computer vision and natural language processing to edge computing, AI is integrating with robotics to develop meaningful and effective solutions. AI robots are machines that act in the real world. It is important to consider the possibility of language as a means of communication between people and robots. However, two main problems prevent modern robots from efficiently handling free-form language inputs. The first challenge is to allow a robot to reason about what it needs to manipulate based on the instructions provided. Another is pick-and-place tasks where careful discernment is needed when picking up items like stuffed animals by the ears instead of the legs or soap bottles by the dispensers instead of the sides.

Robots must extract scene and object semantics from input instructions and plan precise low-level actions based on semantic manipulation. To overcome these challenges, researchers at Stanford University have introduced KITE (Key Points + Instructions for Execution), a two-step framework for semantic manipulation. Scene semantics and object semantics are taken into account in KITE. While object semantics accurately locate various parts within an object instance, scene semantics involves discriminating between various objects in a visual scene.

The first phase of KITE involves using 2D image key points to ground an input statement in a visual context. For the inference of subsequent actions, this procedure offers a very accurate object-centered bias. Robot develops a precise understanding of items and their relevant characteristics by assigning command to key points in the scene. The second step of KITE is to execute a learned skill conditioned by keypoints based on the observation of the RGB-D scene. The robot uses these parameterized talents to carry out the given instruction. Keypoints and parameterized abilities work together to provide detailed manipulation and generalization of differences in scenes and objects.

[Sponsored] 🔥 Build your personal brand with Taplio 🚀 The first all-in-one AI-powered tool to grow on LinkedIn. Create better LinkedIn content 10 times faster, schedule, analyze your stats, and engage. Try it free!

For evaluation, the team tested KITE’s performance in three real-world environments: high-precision coffee brewing, semantic understanding, and long-horizon 6-degree-of-freedom tabletop manipulation. KITE finished the coffee making task with a 71% success rate, a 70% success rate for semantic comprehension, and a 75% success rate for following instructions in the table manipulation scenario. KITE outperformed frameworks that use grounding based on keypoints instead of pretrained visual language models. It outperformed frameworks that emphasize end-to-end visuomotor control over skill use.

KITE achieved these results despite having the same or fewer demonstrations throughout the training, proving its effectiveness and efficiency. To assign an image and a language phrase to a featured heatmap and produce a key point, KITE employs a CLIPort-style technique. To generate skill waypoints, the expert architecture modifies PointNet++ to accept an input multiple viewpoint cloud annotated with a keypoint. 2D keypoints allow KITE to accurately cater for visual features, while 3D point clouds provide the necessary 6DoF context for planning.

In conclusion, the KITE framework presents a promising solution to the longstanding challenge of allowing robots to interpret and follow natural language commands in the context of manipulation. It achieves fine-grained semantic manipulation with high precision and generalization by using the power of keypoints and the instruction base.

review the Paper and Project. Don’t forget to join our 25k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]

🚀 Check out 100 AI tools at AI Tools Club

Tanya Malhotra is a final year student at the University of Petroleum and Power Studies, Dehradun, studying BTech in Computer Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a data science enthusiast with good analytical and critical thinking, along with a keen interest in acquiring new skills, leading groups, and managing work in an organized manner.