PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction

CVPR 2026
* equal contribution
Work partially done during internship at Lambda.
H. Wu contributed to the work during internship at UC San Diego.
1UC San Diego 2Lambda, Inc.
Input Image
Reconstructed scene (interactive)
Scenes

Abstract

We introduce PixARMesh, a method to autoregressively reconstruct complete 3D indoor scene meshes directly from a single RGB image. Unlike prior methods that rely on implicit signed distance fields and post-hoc layout optimization, PixARMesh jointly predicts object layout and geometry within a unified model, producing coherent and artist-ready meshes in a single forward pass. Building on recent advances in mesh generative models, we augment a point-cloud encoder with pixel-aligned image features and global scene context via cross-attention, enabling accurate spatial reasoning from a single image. Scenes are generated autoregressively from a unified token stream containing context, pose, and mesh, yielding compact meshes with high-fidelity geometry. Experiments on synthetic and real-world datasets show that PixARMesh achieves state-of-the-art reconstruction quality while producing lightweight, high-quality meshes ready for downstream applications.

Method

PixARMesh Pipeline

Overview of PixARMesh. Given an RGB image, we use pretrained models to extract the depth point cloud and image features for both the target object and the global scene. These local and global cues are fed into the Pixel-Aligned PC-Encoder to produce the fused latent code, which is then aggregated into a single latent vector via cross-attention. This latent vector conditions the Transformer Decoder, which predicts the object's pose followed by its mesh token sequence.

Comparisons on 3D-FRONT

Qualitative comparisons on synthetic dataset 3D-FRONT

Comparisons on Real-world Images

Qualitative comparisons on Pix-3D dataset and our own images
Comparison on Compactness â–¾

PixARMesh provides a compact representation for scene reconstruction, using far fewer faces and vertices than prior methods while preserving high-quality geometry.

MethodFacesVertices
InstPIFu1.94M971K
Uni-3D141K70.8K
BUOL55.5K27.8K
Gen3DSR364K217K
DeepPriorAssembly251K125K
MIDI1.94M968K
DepR320K160K
PixARMesh-EdgeRunner (Ours)7.1K4.3K
PixARMesh-BPT (Ours)7.5K4.1K

BibTeX


  @article{zhang2026pixarmesh,
    title={PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction},
    author={Zhang, Xiang and Yoo, Sohyun and Wu, Hongrui and Li, Chuan and Xie, Jianwen and Tu, Zhuowen},
    journal={arXiv preprint arXiv:2603.05888},
    year={2026}
  }