Indoor robots are becoming increasingly prevalent across a range of sectors, but the challenge of navigating multi-level structures through elevators remains largely uncharted. For a robot to operate successfully, it's pivotal to have an accurate perception of elevator states. This paper presents a robust robotic system, tailored to interact adeptly with elevators by discerning their status, actuating buttons, and boarding seamlessly. Given the inherent issues of class imbalance and limited data, we utilize the YOLOv7 model and adopt specific strategies to counteract the potential decline in object detection performance. Our method effectively confronts the class imbalance and label dependency observed in real-world datasets, Our method effectively confronts the class imbalance and label dependency observed in real-world datasets, offering a promising approach to improve indoor robotic navigation systems.
Our real-to-sim transfer module, illustrated in Framework, utilizes OWL-ViT for open-vocabulary object detection and AprilTags for pose estimation, based on input from an RGBD vision sensor. The detected objects form a label super-set that includes nine categories of [1], for a total of 21 object assets. For each detected object, we assume the corresponding 3D asset is available. These assets are transferred into a simulation environment that mimics the real world as closely as possible. This reconstructed environment is the basis for all subsequent evaluations. The framework is built on the MuJoCo simulator, using assets from the YCB and Google Scanned dataset. We use a tabletop manipulation framework with a 6-DoF robot arm and gpt-3.5-turbo.
[1] 'DishRack', 'Bowl', 'BookShelf', 'Fruit', 'Beverage', 'Snack', 'Tray', 'Glass', 'Book'
We compare SPOTS to three prior methods: LLM-GROP [1], Code-as-Policies (CaP) [2], and Language-to-Reward (L2R) [3]. LLM-GROP uses two different template-based prompts; one extracts semantic relationships with examples, and the other one predicts geometric spatial relationships for varying scene geometry. CaP generates policy code for the robot motion using a pre-defined low-level primitive function. L2R defines reward parameters that can be optimized, and the reward function is designed for moving a manipulator to a parameterized placement position.
Our evaluation metrics are the place stability and reasonableness of the suggested object placements. The stability success rate is based purely on the physical stability of object placement in simulations, whether that object is placed stable (i.e., Sta. S/R). Reasonableness success rate (i.e., Rea. S/R), on the other hand, is based on whether object placement aligns with the ground truth that we define. Evaluating reasonableness success criteria is manually designed. These metrics assess the overall effectiveness of placements in ensuring both stability and reasonableness. These specific criteria are the ground truth for confirming appropriate locations in our experimental validation. Furthermore, we measure the time taken for the inference and the number of input and output tokens to measure the efficiency of utilizing LLMs.
By separating the tasks of predicting receptacles and ensuring physical robustness into two distinct modules, we find that SPOTS achieves a higher success rate while using fewer tokens compared to the methods that enforce LLMs to predict both robotic plans while understanding the context. From this experiment, we would like to posit that SPOTS has great capability of promptable placement tasks, which considers both physically stable and reasonable regions, and SPOTS has a good distribution, where reasonable positions can be sampled.
@article{lee2023spots,
title={SPOTS: Stable Placement of Objects with Reasoning in Semi-Autonomous Teleoperation Systems},
author={Lee, Joonhyung and Park, Sangbeom and Park, Jeongeun and Lee, Kyungjae and Choi, Sungjoon},
journal={arXiv preprint arXiv:2309.13937},
year={2023}
}