[paper-review] CoPa: General Robotic Manipulation through Spatial Constraints of Parts with Foundational Model

ArXiv, 2024. [Paper] [Project]

Haoxu Huang^2,3,4*, Fanqi Lin^1,2,4*, Yingdong Hu^1,2,4, Shengjie Wang^1,2,4, Yang Gao^1,2,4 > ¹Institute of Interdisciplinary Information Sciences, Tsinghua University. ²Shanghai Qi Zhi Institute. ³Shanghai Jiao Tong University. ⁴Shanghai Artificial Intelligence Laboratory. ^* The first two authors contributed equally.

Mar. 13.

Fig. 1: Overview of CoPa.

Title

CoPa: General Robotic Manipulation through Spatial Constraints of Parts with Foundational Model (2024, ArXiv)

Summary:

They introduce a framework CoPa, which generates a sequence of 6-DoF end-effector poses for open-world robotic manipulation.They introduce a framework CoPa, which generates a sequence of 6-DoF end-effector poses for open-world robotic manipulation.

Contributions.

Task-Oriented Grasping Module
- Firstly, they annotate the grasping object leveraging SoM method. (Coarse-Grained Object Grounding)
- Sequentially crop the image into the region of interest (ROI) of the grasped object. Annotate the grasp contact point in the pixel coordinates of the image. Take a sample grasp pose from GraspNet and match it to the annotated contact point. (Fine-grained part grounding)
Task-Aware Motion Planning Module
- This module is used to obtain a series of post-grasp poses. Given the instruction and the current observation, they first employ a grounding module to identify task-relevant parts within the scene.
- Subsequently, these parts are modeled in 3D, and are then projected and annotated onto the scene image. Following this, VLMs are utilized to generate spatial constraints for these parts. Finally, a solver is applied to calculate the post-grasp poses based on these constraints.

Thoughts.

They presented their methodology in a very clear way: Combine (I) high-level task planning, which determines what to do next, and (ii) low-level robotic control, focusing on the precise actuation of joints.
- Now the GPT-X model can be used in robotic tasks to think like a human.
They demonstrate the seamless integration with ViLa to accomplish long-horizon tasks.
- The high-level planner generates a sequence of sub-goals, which are then executed by CoPa.
- The results show that CoPa can be easily integrated with existing high-level planning algorithms to accomplish complex, long-horizon tasks.

Title

Summary:

Contributions.

Thoughts.

Enjoy Reading This Article?