Visual Preference Inference: An Image Sequence-Based Preference Reasoning in Tabletop Object Manipulation

Joonhyung Lee1, Sangbeom Park1, Yongin Kwon2,
Jemin Lee2, Minwook Ahn3, Sungjoon Choi1
Korea University1
, ETRI2
, Neubla3

We propose Chain-of-Visual-Residuals (CoVR) prompting, a method that connects visual understandings to reason about preferences from a long-horizon image sequence.

Abstract

In robotic object manipulation, human preferences can often be influenced by the visual attributes of objects, such as color and shape. These properties play a crucial role in operating a robot to interact with objects and align with human intention. In this paper, we focus on the problem of inferring underlying human preferences from a sequence of raw visual observations in tabletop manipulation environments with a variety of object types, named Visual Preference Inference (VPI). To facilitate visual reasoning in the context of manipulation, we introduce the Chain-of-Visual-Residuals (CoVR) method. CoVR employs a prompting mechanism that describes the difference between the consecutive images (i.e., visual residuals) and incorporates such texts with a sequence of images to infer the user's preference. This approach significantly enhances the ability to understand and adapt to dynamic changes in its visual environment during manipulation tasks. Furthermore, we incorporate such texts along with a sequence of images to infer the user's preferences. Our method outperforms baseline methods in terms of extracting human preferences from visual sequences in both simulation and real-world environments.

Running Example

Framework

MY ALT TEXT

Overview of Chain-of-Visual-Residuals: (a) We introduce a Visual Preference Inference (VPI) task, which extracts users' preferences solely from visual representations in tabletop manipulation environments. Our approach, CoVR prompting, involves generating (b) visual reasoning descriptions of consecutive images and (c) chaining these descriptions for interpreting human preferences from the scene sequences.

VPI: Visual Preference Inference

MY ALT TEXT

Visual Preference Inference (VPI) Tasks: We define a task of VPI as reasoning user preference based on an image sequence. Specifically, the task involves a robot that moves objects to target locations, following user instructions via mouse clicks which provide which object to move and where to place it. To infer the user's preferences, we extract visual residuals from each image in the sequence and link them together to enhance reasoning capability.


Visual Reasoning Descriptor

MY ALT TEXT
Our goal is to identify which object has moved between two consecutive images and how the geometric relationship of objects has changed, while simultaneously inferring the semantic properties of each object. To this end, we present Visual Reasoning Descriptor (VRD) which translates input images into natural language scene descriptions (referred to as visual residuals). Visual residual \(V\) contains both the semantic properties of the objects and the difference in the objects' configurations between consecutive image pairs and consists of three components: $$\{l^{\text{semantic}}, l^{\text{geometric}}, l^{\text{description}}\}$$.

Preference Reasoning Descriptor

MY ALT TEXT
To interpret the overall preference from the obtained sequence of visual residuals \( \mathcal{V}=\{V_{1},\cdots,V_{n-1}\} \) between an image sequence \( \mathcal{I} \) of length \(n\), we propose Preference Reasoning Descriptor (PRD) to interpret user preferences described in natural language descriptions. To this end, we propose Preference Reasoning Descriptor (PRD) to interpret user preferences described in natural language descriptions. The visual residual information (obtained from VRD) along with the original image sequence is fed into PRD to reason about the underlying human preferences.

Environments

Block Task

Block Task

Polygon Task

Polygon Task

Household Task

Household Task

To validate the effectiveness of VRD, we conducted a set of experiments focusing on visual reasoning performance. In addition, we designed further experiments to evaluate the ability to infer human preferences, which we divided into two categories: those based on semantic property and those based on spatial patterns. Specifically, the performance of our proposed method is evaluated across three different environments: Block Task, Polygon Task, and Household Task.

Baselines & Metrics

We compare our method with other baselines, including large language models and a linear preference extractor. For fair comparisons on visual reasoning, we utilize the same visual reasoning module (i.e., GPT-4V).

  • MLLM-Naive: An ablation of our approach that does not use the Visual Reasoning Descriptor and Preference Reasoning Descriptor. MLLM-Naive infers scene descriptions for consecutive image pairs in a similar way to our method but without using the VRD template. Then, this baseline interprets the preference directly, using only an entire image sequence in a single interaction.
  • MLLM-L2R: Inspired by Language-to-Reward (L2R), this baseline extracts normalized object 2D position (ranging from 0.0 to 1.0) information for feature computation. Subsequently, we integrate a code snippet generation module that produces a piece of code to compute preference weights using the obtained object positions.
  • Mutual-Distance-based Preference Extractor (MDPE): This baseline assumes that human preferences are deterministic, following a linear user model as discussed in prior works. Within the framework of linear models, MDPE computes the preference weights for each specific feature based on pre-defined functions using the mutual distances between objects and then derives the preference from these weights.

Our evaluation metrics are the success rate of Visual Reasoning Descriptor (SRVRD) and the success rate of Preference Reasoning Descriptor (SRPRD) for the given image sequences. In particular, SRVRD is calculated based on the visual residual between the image sequences and is defined as follows:

$$\text{SR}_\text{VRD} = \frac{1}{N-1} \sum_{k=1}^{N-1} \left( \frac{\sum_{l \in V_k} \mathbb{I}(l = \hat{l})}{|V_k|} \right)$$

where \(| \cdot |\) means the number of elements in the set and \(\mathbb{I}\) represents an indicator function that checks whether each element \(l\) within the predicted response \(V_k\) matches its corresponding element \(\hat{l}\) in the ground truth visual residual \(\hat{V}_k\) for each consecutive image pair. The elements of \(V_k\) and \(\hat{V}_k\) include (\(l^{\text{semantic}}_k\), \(l^{\text{geometric}}_k\), \(l^{\text{description}}_k\)), and their respective ground truth counterparts (\(\hat{l}^{\text{semantic}}_k\), \(\hat{l}^{\text{geometric}}_k\), \(\hat{l}^{\text{description}}_k\)).

On the other hand, SRPRD is measured according to the predicted preference that matches the ground truth. The preference criteria for each scene are manually designed. We formulate SRPRD as follows:

$$\text{SR}_\text{PRD} = \frac{1}{M} \sum_{i=1}^{M} \mathbb{I}(l^{\text{preference}}_{i} = \hat{l}^{\text{preference}}_{i})$$

where the indicator function \(\mathbb{I}\) checks for a match between predicted description and ground truth preferences. SRPRD evaluates whether the predicted preferences \(l^{\text{preference}}_{i}\) are in alignment with the ground truth preferences \(\hat{l}^{\text{preference}}_{i}\) across a defined set of scenes.

Results

Image 1 Image 2 Image 3 Image 4

Block Task

Firstly, Table 1 compares the visual reasoning performance of our method against the ablation of our method. The results indicate that our method outperforms the naive approach in understanding both the semantic and geometric properties in between an image sequence. Especially the results of \(0.72 \pm 0.11\) demonstrate the superior visual reasoning ability of our method. In contrast, the MLLM-Naive model shows limited ability in extracting visual signals between images with \(\text{SR}_{\text{VRD}}\) of \(0.40 \pm 0.15\) for the same task. This result highlights the effectiveness of our VRD template-based approach in recognizing the visual residuals within image sequences. From the results in Table 2, we compare the spatial pattern preference reasoning performance of our method to three other baseline approaches. Our method shows outstanding reasoning preference performance in the Block Task. This highlights the benefits of our prompting method as compared to the other baselines. Although the MDPE approach was expected to get a perfect score, MDPE did not achieve a score of \(1.0\) (indicating that this model is always correct). This poor performance of MDPE is mainly due to the parameter sensitivity used in its function function. Such sensitivity often leads to the recognition of multiple preferences, resulting in erroneous preference inferences. MLLM-L2R also exhibits limited effectiveness primarily because relying solely on the responses of the MLLM for position information is impractical; they do not account for specific geometric locations. From the experiment results, we would like to emphasize that our method has a high potential for visual reasoning tasks, taking into account spatial pattern preferences.

Polygon Task

On the Polygon task, as detailed in Table 1, our methodology significantly outperforms the baseline model, indicating superior visual reasoning accuracy. Specifically, our approach achieves a visual reasoning accuracy of \( 0.79 \pm 0.13 \) while the baseline records a lower accuracy of \(0.56 \pm 0.23\). This performance gap shows an enhanced ability of our method to accurately predict visual contexts within image sequences, especially in tasks involving complex geometric shapes such as polygons. The results in Table 3 show that our method consistently outperforms other MLLM-based approaches for semantic preference reasoning in the Polygon task. In comparison, the MLLM-Naive method performs poorly with \(0.20\) for color and \(0.70\) for shape, indicating the limitations of naive models in capturing semantic properties. On the other hand, the MLLM-L2R model shows slight improvements, achieving a score of \(0.30\) for color and \(0.80\) for shape. However, these results still do not reach our reasoning preference performance. Notably, MDPE achieves the perfect scores (=\(1.0\)) both on color and shape criteria. However, it is important to know that the performance of MDPE depends on the manual tuning of the preference feature functions. These results imply that our method can successfully capture semantic preferences without any manual feature engineering, such as predefining a list of object attributes.

Household Task

The metric of \(\text{SR}_{\text{VRD}}\), presented in Table 4 compared the visual reasoning performance of our approach against the ablation of our method, which was evaluated six times respectively. In particular, the results of \(0.63 \pm 0.08\) demonstrated the higher visual reasoning performance of our method. In contrast, the MLLM-Naive model showed a limited ability to extract visual signals between images with \(\text{SR}_{\text{VRD}}\) of \(0.28 \pm 0.19\) for the same task. These results support the effectiveness of our VRD template-based approach in recognition of visual residuals within image sequences. Each type of preference was evaluated six times, and performance was measured in terms of \(\text{SR}_{\text{PRD}}\). As illustrated in Fig. 1 in each step, the robot performs to move objects and captures images. The results of our method in Table 4 are consistent with our simulation results, indicating the balanced performance of our method in spatial pattern and semantic preference reasoning. Compared to other MLLM-based approaches, they showed subpar performance in recognizing spatial patterns and semantic properties. We can notice that MLLM tends to misunderstand the spatial arrangements or semantic properties of objects without explicit annotation by VRD. While MDPE performs as effectively as our approach for both types of preference, it remains highly dependent on the need for handcrafted features. These results support the practical effectiveness of our method and support its successful application in real-world scenarios.

Real World Demonstration

Response: Rearrange the objects with the same category.
Response: Group objects by the same shape.
Response: Make objects into a horizontal line.
Response: Sort objects vertically.

We have demonstrated the effectiveness of our method in interpreting spatial relationships from image sequences and inferring preferences for both semantic and spatial patterns in real-world tabletop environments.

BibTeX


        @article{lee2024visual,
          title={Visual Preference Inference: An Image Sequence-Based Preference Reasoning in Tabletop Object Manipulation}, 
          author={Joonhyung Lee and Sangbeom Park and Yongin Kwon and Jemin Lee and Minwook Ahn and Sungjoon Choi},
          year={2024},
          eprint={2403.11513},
          archivePrefix={arXiv},
          primaryClass={cs.RO}
        }