SEMINAR

Understand and Reconstruct Multimodal Egocentric Scenes

Speaker

Chenliang Xu

Working

University of Rochester

Timeline

Tue, Jan 7 2025 - 10:00 am (GMT + 7)

About Speaker

Chenliang Xu is an Associate Professor in the Department of Computer Science at the University of Rochester. He received his Ph.D. in Computer Science from the University of Michigan in 2016, an M.S. in Computer Science from the University at Buffalo in 2012, and a B.S. in Information and Computing Science from Nanjing University of Aeronautics and Astronautics, China, in 2010. His research originates in computer vision and tackles interdisciplinary topics, including video understanding, audio-visual learning, vision and language, and methods for trustworthy AI. Xu is a recipient of the James P. Wilmot Distinguished Professorship (2021), the University of Rochester Research Award (2021), the Best Paper Award Honorable Mention at the 17th Asian Conference on Computer Vision (2024), the Best Paper Award at the 17th ACM SIGGRAPH VRCAI Conference (2019), the Best Paper Award at the 14th Sound and Music Computing Conference (2017), and the University of Rochester AR/VR Pilot Award (2017). He has authored over 100 peer-reviewed papers in computer vision, machine learning, multimedia, and AI venues. He served as editor and area chair for various international journals and conferences

Abstract

Every day, the world generates numerous egocentric videos from mixed/augmented reality, lifelogging, and robotics. These videos are like humans looking at the world from an ego perspective, hence the name, egocentric videos. Understanding these videos and reconstructing egocentric scenes are essential to future AI applications. In this talk, I will first introduce two recent methods developed by my group on leveraging large language models (LLMs) to understand multimodal third-person and egocentric videos. These methods show incredible generalizability over traditional task-specific computer vision models. Following that, I will introduce methods leading to real-world audio-visual scene synthesis.