Instant 3D Human Avatar Generation using Image Diffusion Models

Abstract

We present AvatarPopUp, a method for fast, high quality 3D human avatar generation from different input modalities, such as images and text prompts and with control over the generated pose and shape. The common theme is the use of diffusion-based image generation networks that are specialized for each particular task, followed by a 3D lifting network. We purposefully decouple the generation from the 3D modeling which allow us to leverage powerful image synthesis priors, trained on billions of text-image pairs. We fine-tune latent diffusion networks with additional image conditioning to solve tasks such as image generation and back-view prediction, and to support qualitatively different multiple 3D hypotheses. Our partial fine-tuning approach allows to adapt the networks for each task without inducing catastrophic forgetting. In our experiments, we demonstrate that our method produces accurate, high-quality 3D avatars with diverse appearance that respect the multimodal text, image, and body control signals. Our approach can produce a 3D model in as few as 2 seconds, a four orders of magnitude speedup w.r.t. the vast majority of existing methods, most of which solve only a subset of our tasks, and with fewer controls, thus enabling applications that require the controlled 3D generation of human avatars at scale.

Method

AvatarPopUp builds on the capacity of text-to-image models to generate highly detailed and diverse input images. First, a Latent Diffusion network takes a text prompt and a target body pose and shape G, and generates a highly detailed front image of a person. Next, a second network generates a consistent back view in the same pose and clothing. We then perform pixel-aligned 3D reconstruction given the generated back/front views/images and/or a given 3D body pose and shape G. This decoupling enables the generation of 3D avatars from either text or a single image.

Image to 3D

AvatarPopUp can be used for image-to-3D synthesis. We do so by first predicting a plausible back view using our back image generator, and then lift the image pair to 3D using our 3D lifter. The entire image-to-mesh process takes less than 10 seconds.

Text to 3D

AvatarPopUp can be used for text-to-3D synthesis. Given text, pose, and body shape controls we use cascaded diffusion networks to sample front and back views. Our 3D lifter then outputs a 3D mesh given the image evidence. The whole process takes less than 10 seconds per example on an A100 GPU.

All 77 meshes in the above image were generated in under 12 minutes .

We also show a large number of 360^o renderings of our text-based generations, where we sampled random poses, body shapes and text prompts.

A person wearing ...

Text + Image to 3D

With AvatarPopUp we can do multimodal 3D generation. One use case is 3D virtual try-on. Given an image of a person and a text prompt describing the clothing, we can generate a 3D avatar wearing the target clothing, while at the same time preserving the identity (face + body shape) of the person in the source image.

Our method also allows more fine-grained modifications. For example we can change only specific garments, or we can sample new identities wearing the same clothes.

Animating avatars

With AvatarPopUp we can generate animation-ready avatars. We first generate an avatar in an animation-friendly pose and then leverage the conditioning body model to rig the estimated 3D shape. As a result of our conditioning strategy, 3D avatars and the conditional body model instances are well-aligned in 3D. This allows us to anchor the reconstructed 3D shape on the body model surface and rig it accordingly.

BibTeX

@inproceedings{kolotouros2024avatarpopup,
  author    = {Kolotouros, Nikos and Alldieck, Thiemo and Corona, Enric and Bazavan, Eduard Gabriel and Sminchisescu, Cristian},
  title     = {Instant 3D Human Avatar Generation using Image Diffusion Models},
  booktitle   = {European Conference on Computer Vision (ECCV)},
  year      = {2024},
}