[paper-review] MOKA: Open-Vocabulary Robotic Manipulation through Mark-Based Visual Prompting

ArXiv, 2024. [Paper] [Project]

Kuan Fang^_, Fangchen Liu^_, Pieter Abbeel, Sergey Levine

denotes equal contribution, alphabetical order Berkeley AI Research, UC Berkeley

Mar. 05.

Fig. 1: Overview of MOKA.

Title

MOKA: Open-Vocabulary Robotic Manipulation through Mark-Based Visual Prompting (ArXiv, 2024)

Summary

MOKA converts the motion generation problem into a series of visual question-answering problems that the VLM can solve.

Method

They introduce a point-based affordance representation that bridges the VLM’s prediction on RGB images and the robot’s motion in the physical world
They refer to the physical interaction at each stage as a subtask, which includes interactions with objects in hand (e.g., lifting up an object, opening an drawer), interactions with environmental objects unattached to the robot (e.g., pushing an obstacle, pressing a button), and tool use which involves grasping a tool object to make contact with another object.
Decompose the task into a sequence of feasible subtasks based on the free-form language description ( l )
Then use VLM (GroundedSAM) to segment objects from the 2D image and overlay them (similar approach to SoM). Additionally, for each of the subtasks, the VLM is asked to provide the summary of the subtask instruction.

Thoughts

This paper’s approach is similar idea with the previous approach of VPI paper. I read this paper with great interest.
They have to lift all points from the 2D image into the 6D Cartesian space. So they only consider the cases where the waypoints are at the same height as the target point in most common table manipulation scenarios.
- This point would limit the scope of this paper.
Since robust grasping relies on contact physics, they do not rely directly on the predicted grasp pose from the VLM, but use a grasp sampler that is closest to the response grasp pose.

Title

Summary

Method

Thoughts

Enjoy Reading This Article?