Relevant for Research Area

C - Applications





Prof. Dr. Abhinav Valada

Prof. Dr. Thomas Brox

Dr. Tim Welschehold


In this project, we investigate learned scene graph representations for long-horizon mobile manipulation tasks. Typical approaches for learning robot skills build their policies directly on sensory information without any intermediary representation. Prominent examples in this research domain directly translate, e.g., an image input to real-world control. While skills learned in such a fashion perform impressively in short-horizon tasks and decently in narrowlong-horizon tasks, they often fail to transfer outside of the training distribution. This poses agreat challenge in the area of robotics since robots need to perform reasonably under uncertainty.

Humans, on the other hand, seem to base their interactions with the world on an object-centric understanding of it. In order to equip robots with such capabilities, an intermediate and actionable layer becomes necessary before performing decision-making. Graphs are a natural mathematical structure to model an object-based view with inter-object relationships. By using agraphical model of the scene, we hope to create agents which learn to interact with their environments in a more reliable and more interpretable way. This is in contrast to policies learned in an end-to-end fashion, which entirely rely on abstract internal representations. Modeling scenes as graphs enables multiple different downstream tasks to benefit from the same interpretable representation.

We aim to demonstrate this interpretable view of environments on the learning of are arrangement task using both raw sensory information as well as natural language descriptions (see Fig. 1.). Thus, we include both semantic as well as geometric information. As a first demonstration of this, we will look at rearranging tables for different occasions. This task is suitable due to the large variety of different goal states, as well as its scalability into a larger problem, i.e., first rearrange the table, later reason about retrieving objects to set a table for dinner. Our proposed approach involves not only high-level planning and reasoning but also low-level execution using methods from the domain of imitation learning and visual servoing. We aim to demonstrate that scene graphs can be used efficiently to represent large spaces. This entails showing that the representation scales with the number of objects detected instead of the size of the observed environment. Based on proximity as well as semantic, geometric, and language priors we can draw relations among objects in a sparse manner instead of usingfully-connected graph structures (see Fig. 2). This becomes relevant when performing therearrangement task since the sparsity of the graph often represents a significant inductive bias to the learning problem posed [14]. Through generating hierarchical representations of the observed scenes also mobile manipulation becomes feasible. Given the structured representation, it should be affordable for the agent to communicate its plan and confirm it with a human supervisor/collaborator during operation. This will allow bridging the gap between human-level language commands and large-scale mobile manipulation tasks.