SPOTS: Stable Placement of Objects with Reasoning in Semi-Autonomous Teleoperation Systems

Joonhyung Lee¹, Sangbeom Park¹, Jeongeun Park¹,

Kyungjae Lee², Sungjoon Choi¹

Korea University¹

Chungang University²

Paper Code Video

News

We are happy to announce that SPOTS has been accepted to ICRA2024! 😆🎉🎉
Codes will be uploaded soon.

SPOTS is an approach to a semi-autonomous teleoperation framework that focuses on verifying placement positions with 1) a Stability Verification (i.e., physics-based simulation) step and 2) Receptacle Reasoning (i.e., common knowledge) step by utilizing LLMs that understand scene contexts and reason about the corresponding task without learning.

Abstract

Pick-and-place is one of the fundamental tasks in robotics research. However, the attention has been mostly focused on the ``pick'' task, leaving the ``place'' task relatively unexplored. In this paper, we address the problem of placing objects in the context of a teleoperation framework. Particularly, we focus on two aspects of the place task: stability robustness and contextual reasonableness of object placements. Our proposed method combines simulation-driven physical stability verification via real-to-sim and the semantic reasoning capability of large language models. In other words, given place context information (e.g., user preferences, object to place, and current scene information), our proposed method outputs a probability distribution over the possible placement candidates, considering the robustness and reasonableness of the place task. Our proposed method is extensively evaluated in two simulation and one real world environments and we show that our method can greatly increase the physical plausibility of the placement as well as contextual soundness while considering user preferences.

Framework

Overall pipeline of the proposed teleoperation framework. With the scene input, the system checks the physical stability of the correct placement. Then, the system verifies the contextually reasonable positions based on the receptacle reasoning step, considering the scene's context, and recommends the coordinates obtained from both processes to the user.

\( \textbf{SPOTS} \)

\( \textbf{Stability Verification} \)

We aim to identify regions where objects can be stably placed over a given interaction time \(T\) in simulation. More specifically, to determine the robustness of the placement stability, small perturbations are injected after the object has been placed. We define the set of points \( \mathcal{P}_{\text{s}} \) to represent coordinates where objects can be placed stably.

\( \textbf{Receptacle Reasoning} \)

Though the set of \( \mathcal{P}_{\text{s}} \) points, that are verified in \( \textbf{Stability Verification} \) step, the points are determined to be feasible, it may contain some options that do not consider the context of the scene. Therefore, we aim to analyze the reasonableness within the limited range of \( \mathcal{P}_{\text{s}} \) that corresponds to the current scene's situation and context.

User Interaction

User interaction with the proposed system SPOTS. The user selects among the candidates, provided with a close consideration of stability and reasonableness, in an interactive viewer. SPOTS recommends the placement candidates based on the prompt of the task.

Environments

Our real-to-sim transfer module, illustrated in Framework, utilizes OWL-ViT for open-vocabulary object detection and AprilTags for pose estimation, based on input from an RGBD vision sensor. The detected objects form a label super-set that includes nine categories of [1], for a total of 21 object assets. For each detected object, we assume the corresponding 3D asset is available. These assets are transferred into a simulation environment that mimics the real world as closely as possible. This reconstructed environment is the basis for all subsequent evaluations. The framework is built on the MuJoCo simulator, using assets from the YCB and Google Scanned dataset. We use a tabletop manipulation framework with a 6-DoF robot arm and gpt-3.5-turbo.

[1] 'DishRack', 'Bowl', 'BookShelf', 'Fruit', 'Beverage', 'Snack', 'Tray', 'Glass', 'Book'

Select an image below:

[Small Gap]: White Dish Rack — (a) Small Gap

[Medium Gap]: Black Dish Rack — (b) Medium Gap

[Large Gap]: Wood Dish Rack — (c) Large Gap

[Two-Tiered Bookshelf] — (d) Two-Tiered Bookshelf

[Three-Tiered Bookshelf] — (e) Three-Tiered Bookshelf

[Three-Tiered Shelf] — (f) Three-Tiered Shelf

Results

Result of stability verification module: Performed for all environments in our experiments, not including reasoning module. The ratio of stable coordinates to the total number of coordinates is very low. This indicates that the task we are assuming is physically difficult to be stably located.

We compare SPOTS to three prior methods: LLM-GROP [1], Code-as-Policies (CaP) [2], and Language-to-Reward (L2R) [3]. LLM-GROP uses two different template-based prompts; one extracts semantic relationships with examples, and the other one predicts geometric spatial relationships for varying scene geometry. CaP generates policy code for the robot motion using a pre-defined low-level primitive function. L2R defines reward parameters that can be optimized, and the reward function is designed for moving a manipulator to a parameterized placement position.

Our evaluation metrics are the place stability and reasonableness of the suggested object placements. The stability success rate is based purely on the physical stability of object placement in simulations, whether that object is placed stable (i.e., Sta. S/R). Reasonableness success rate (i.e., Rea. S/R), on the other hand, is based on whether object placement aligns with the ground truth that we define. Evaluating reasonableness success criteria is manually designed. These metrics assess the overall effectiveness of placements in ensuring both stability and reasonableness. These specific criteria are the ground truth for confirming appropriate locations in our experimental validation. Furthermore, we measure the time taken for the inference and the number of input and output tokens to measure the efficiency of utilizing LLMs.

By separating the tasks of predicting receptacles and ensuring physical robustness into two distinct modules, we find that SPOTS achieves a higher success rate while using fewer tokens compared to the methods that enforce LLMs to predict both robotic plans while understanding the context. From this experiment, we would like to posit that SPOTS has great capability of promptable placement tasks, which considers both physically stable and reasonable regions, and SPOTS has a good distribution, where reasonable positions can be sampled.

[1] Task and Motion Planning with Large Language Models for Object Rearrangement

[2] Code as Policies: Language Model Programs for Embodied Control

[3] Language to Rewards for Robotic Skill Synthesis

Real World Demonstration

In this experiment, we consider a scene with different objects placed on a desk. We designed a task that categorizes objects based on similarity. The reasoning criteria, termed similarity, varies for each experiment and serves as the ground truth for evaluating reasoning abilities. Each type of similarity was evaluated five times, and performance was measured using the overall success rate (i.e., both Sta. S/R and Rea. S/R). From this experiment, we insist that the reasonable place varies depending on the task description given as input. Furthermore, we are able to accurately determine the stable positions to place the objects by reconstructing the robot's ego-centric view with the real-to-sim method.

BibTeX


        @article{lee2023spots,
          title={SPOTS: Stable Placement of Objects with Reasoning in Semi-Autonomous Teleoperation Systems},
          author={Lee, Joonhyung and Park, Sangbeom and Park, Jeongeun and Lee, Kyungjae and Choi, Sungjoon},
          journal={arXiv preprint arXiv:2309.13937},
          year={2023}
        }