In robotic object manipulation, human preferences can often be influenced by the visual attributes of objects, such as color and shape. These properties play a crucial role in operating a robot to interact with objects and align with human intention. In this paper, we focus on the problem of inferring underlying human preferences from a sequence of raw visual observations in tabletop manipulation environments with a variety of object types, named Visual Preference Inference (VPI). To facilitate visual reasoning in the context of manipulation, we introduce the Chain-of-Visual-Residuals (CoVR) method. CoVR employs a prompting mechanism that describes the difference between the consecutive images (i.e., visual residuals) and incorporates such texts with a sequence of images to infer the user's preference. This approach significantly enhances the ability to understand and adapt to dynamic changes in its visual environment during manipulation tasks. Furthermore, we incorporate such texts along with a sequence of images to infer the user's preferences. Our method outperforms baseline methods in terms of extracting human preferences from visual sequences in both simulation and real-world environments.
Block Task
Polygon Task
Household Task
To validate the effectiveness of VRD, we conducted a set of experiments focusing on visual reasoning performance. In addition, we designed further experiments to evaluate the ability to infer human preferences, which we divided into two categories: those based on semantic property and those based on spatial patterns. Specifically, the performance of our proposed method is evaluated across three different environments: Block Task, Polygon Task, and Household Task.
We compare our method with other baselines, including large language models and a linear preference extractor. For fair comparisons on visual reasoning, we utilize the same visual reasoning module (i.e., GPT-4V).
Our evaluation metrics are the success rate of Visual Reasoning Descriptor (SRVRD) and the success rate of Preference Reasoning Descriptor (SRPRD) for the given image sequences. In particular, SRVRD is calculated based on the visual residual between the image sequences and is defined as follows:
$$\text{SR}_\text{VRD} = \frac{1}{N-1} \sum_{k=1}^{N-1} \left( \frac{\sum_{l \in V_k} \mathbb{I}(l = \hat{l})}{|V_k|} \right)$$
where \(| \cdot |\) means the number of elements in the set and \(\mathbb{I}\) represents an indicator function that checks whether each element \(l\) within the predicted response \(V_k\) matches its corresponding element \(\hat{l}\) in the ground truth visual residual \(\hat{V}_k\) for each consecutive image pair. The elements of \(V_k\) and \(\hat{V}_k\) include (\(l^{\text{semantic}}_k\), \(l^{\text{geometric}}_k\), \(l^{\text{description}}_k\)), and their respective ground truth counterparts (\(\hat{l}^{\text{semantic}}_k\), \(\hat{l}^{\text{geometric}}_k\), \(\hat{l}^{\text{description}}_k\)).
On the other hand, SRPRD is measured according to the predicted preference that matches the ground truth. The preference criteria for each scene are manually designed. We formulate SRPRD as follows:
$$\text{SR}_\text{PRD} = \frac{1}{M} \sum_{i=1}^{M} \mathbb{I}(l^{\text{preference}}_{i} = \hat{l}^{\text{preference}}_{i})$$
where the indicator function \(\mathbb{I}\) checks for a match between predicted description and ground truth preferences. SRPRD evaluates whether the predicted preferences \(l^{\text{preference}}_{i}\) are in alignment with the ground truth preferences \(\hat{l}^{\text{preference}}_{i}\) across a defined set of scenes.
@article{lee2024visual,
title={Visual Preference Inference: An Image Sequence-Based Preference Reasoning in Tabletop Object Manipulation},
author={Joonhyung Lee and Sangbeom Park and Yongin Kwon and Jemin Lee and Minwook Ahn and Sungjoon Choi},
year={2024},
eprint={2403.11513},
archivePrefix={arXiv},
primaryClass={cs.RO}
}