
[00:00] Introduction to EgoAllo system
[00:38] Challenges in egocentric motion estimation
[01:20] Importance of spatial/temporal invariance
[02:11] Comparison of conditioning parameterizations
[02:57] Integration of hand observations
[03:50] Global alignment phase
[04:28] Guidance losses in sampling
[05:03] Handling longer sequences
[05:35] Evaluation results
[06:30] System limitations and future work
[07:13] Implications for other egocentric tasks
[08:05] Advantages of diffusion models
[09:07] Use of synthetic datasets
[09:53] Promising research directions
[10:43] Impact on future motion capture systems
[11:41] Comparison to traditional methods
[12:31] Improved hand estimation accuracy
[13:25] SLAM data inaccuracies impact
[14:09] Levenberg-Marquardt optimizer usage
[15:14] Adapting to complex environments
Authors: Brent Yi, Vickie Ye, Maya Zheng, Lea Müller, Georgios Pavlakos, Yi Ma, Jitendra Malik, Angjoo Kanazawa
Affiliation: UC Berkeley, UT Austin
Abstract: We present EgoAllo, a system for human motion estimation from a head-mounted device. Using only egocentric SLAM poses and images, EgoAllo guides sampling from a conditional diffusion model to estimate 3D body pose, height, and hand parameters that capture the wearer's actions in the allocentric coordinate frame of the scene. To achieve this, our key insight is in representation: we propose spatial and temporal invariance criteria for improving model performance, from which we derive a head motion conditioning parameterization that improves estimation by up to 18%. We also show how the bodies estimated by our system can improve the hands: the resulting kinematic and temporal constraints result in over 40% lower hand estimation errors compared to noisy monocular estimates.
Project page: https://egoallo.github.io/