Learning Compositional Representations for Understanding and Generating Controllable 3D Environments
Despoina Paschalidou
Despoina Paschalidou is a Postdoctoral Researcher at Stanford University working with Prof. Leonidas Guibas at the Geometric Computation Group. She did her Ph.D. at the Max Planck ETH Center for Learning Systems, where she worked with Prof. Andreas Geiger and Prof. Luc van Gool. Prior to this, she was an undergraduate in the School of Electrical and Computer Engineering at the Aristotle University of Thessaloniki in Greece, where she worked with Prof. Anastasios Delopoulos and Christos Diou. Her interest in computer vision, particularly in the areas of interpretable shape representations, scene understanding and generative models and unsupervised deep learning.
Within the first year of our life, we develop a common-sense understanding of the physical behavior of the world, which relies heavily on our ability to properly reason about the arrangement of objects in a scene. While this seems to be a fairly easy task for the human brain, computer vision algorithms struggle to form such high-level reasoning. Therefore, the research community shifted their attention to the development of primitive-based methods that seek to represent objects as semantically consistent part arrangements. However, due to the simplicity of existing primitive representations, these methods fail to accurately reconstruct 3D shapes using a small number of primitives/parts. In the first part of my talk, I will address the trade-off between reconstruction quality and number of parts and present Neural Parts, a novel 3D primitive representation that defines primitives using an Invertible Neural Network (INN) which implements homeomorphic mappings between a sphere and the target object. Since a homeomorphism does not impose any constraints on the primitive shape, our model effectively decouples geometric accuracy from parsimony and as a result captures complex geometries with an order of magnitude fewer primitives. In the second part of my talk, we will look into the problem of inferring and subsequently also generating semantically meaningful object arrangements to populate 3D scenes conditioned on the room shape. In particular, I will present ATISS, a novel autoregressive transformer architecture for creating diverse and plausible synthetic indoor environments as unordered sets of objects. Our unordered set formulation allows us to use the same trained model for a variety of interactive applications such general scene completion, partial room rearrangement with any objects specified by the user, as well as object suggestions for any partial room. This is an important step towards fully automatic content creation. Finally, we will look into 3D shape editing and manipulation. In particular, I will present two methods capable of generating plausible 3D shape variations with local control, that can be combined with ATISS to allow control both in scene level and object level.