UniF²ace: Fine-grained Face Understanding and Generation with Unified Multimodal Models

1School of Computer Science, Peking University 2Computer Center, Peking University 3Institute of Automation, Chinese Academy of Sciences 4Central South University 5Yau Mathematical Sciences Center and Department of Mathematical Sciences, Tsinghua University 6Fudan University 7Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)

Overview

UniF²ace is the first unified multimodal model specifically designed for face understanding and generation, encompassing tasks such as visual question answering, face image captioning and text-to-face image generation. The generated responses and images demonstrate UniF²ace's significant potential in capturing fine-grained face attributes.

Abstract

Unified multimodal models (UMMs) have emerged as a powerful paradigm in foundational computer vision research, demonstrating significant potential in both image understanding and generation. However, existing research in the face domain primarily focuses on coarse facial attribute understanding, with limited capacity to handle fine-grained facial attributes and without addressing generation capabilities. To overcome these limitations, we propose UniF²ace, the first UMM tailored specifically for fine-grained face understanding and generation. In general, we train UniF²ace on a self-constructed, specialized dataset utilizing two mutually beneficial diffusion techniques and a two-level mixture-of-experts architecture. Specifically, we first build a large-scale facial dataset, UniF²ace-130K, which contains 130K image-text pairs with one million question-answering pairs that span a wide range of facial attributes. Second, we establish a theoretical connection between discrete diffusion score matching and masked generative models, optimizing both evidence lower bounds simultaneously, which significantly improves the model's ability to synthesize facial details. Finally, we introduce both token-level and sequence-level mixture-of-experts, enabling efficient fine-grained representation learning for both understanding and generation tasks. Extensive experiments on \dataset{} demonstrate that UniF²ace-130K outperforms existing UMMs and generative models, achieving superior performance across both understanding and generation tasks.

Demos

Dataset

Pipeline and examples of UniF²ace-130K dataset construction. Left: A three-stage pipeline for building UniF²ace-130K. Step-1: High-quality face images are collected. Step-2: Detailed captions are generated by GPT-4o with a face attribute model trained to classify fine-grained appearance, action, and emotion. Step-3: Question-answering pairs are created. These stages collectively refine GPT-4o-generated captions and produce fine-grained descriptions for VQAs generation. Right: A representative example showcasing UniF²ace-130K's ability to correct (e.g., gender), enhance (e.g., bags under eyes), and reason (e.g., talking, slight tiredness) in GPT-4o-generated captions.

Method

Our UniF²ace architecture integrates Text-to-Image (T2I) and Multimodal Understanding (MMU) tasks. Text inputs are encoded via a tokenizer, while input images are processed through a VQGAN encoder, merging into a unified token sequence. A noise scheduler masks a subset of image tokens, which are then processed by a Transformer with Mixture-of-Experts (MoE) layers. These MoE layers are grouped for generation and understanding tasks, with the first operating at the token level using shared and routed experts, and the second incorporating domain-specific features at the sequence level. This hierarchical design enables fine-grained facial feature processing. The noise scheduler outputs \( p(x_t|x_0) \) for D3Diff loss computation, combined with text autoregressive loss to form the training objective.

Experiments

BibTeX


  @misc{2503.08120,
      Author = {Junzhe Li and Xuerui Qiu and Linrui Xu and Liya Guo and Delin Qu and Tingting Long and Chun Fan and Ming Li},
      Title = {Uni$\textbf{F}^2$ace: Fine-grained Face Understanding and Generation with Unified Multimodal Models},
      Year = {2025},
      Eprint = {arXiv:2503.08120},
  }