DiffSplat: Repurposing Image Diffusion Models for Scalable 3D Gaussian Splat Generation

International Conference on Learning Representations (ICLR) 2025

1Peking University   2ByteDance
† denotes project leader, ‡ denotes corresponding author

DiffSplat is a generative framework to synthesize 3D Gaussian Splats from text prompts & single-view images in ⚡️ 1~2 seconds. It is fine-tuned directly from a pretrained text-to-image diffusion model.

🧩   Abstract   🧩

Recent advancements in 3D content generation from text or a single image struggle with limited high-quality 3D datasets and inconsistency from 2D multi-view generation. We introduce DiffSplat, a novel 3D generative framework that natively generates 3D Gaussians by taming large-scale text-to-image diffusion models.

It differs from previous 3D generative models by effectively utilizing web-scale 2D priors while maintaining 3D consistency in a unified model. To bootstrap the training, a lightweight reconstruction model is proposed to instantly produce multiview Gaussian grids for scalable dataset curation. In conjunction with the regular diffusion loss on these grids, a 3D rendering loss is introduced to facilitate 3D coherence across arbitrary views.

The compatibility with image diffusion models enables seamless adaptions of numerous techniques for image generation to the 3D realm. Extensive experiments reveal the superiority of DiffSplat in text- and image-conditioned generation tasks and downstream applications. Thorough ablation studies validate the efficacy of each critical design choice and provide insights into the underlying mechanism. Code and models are publicly available at here.

🎯   Motivation   🎯

(1) Native 3D methods (such as CLAY, GVGEN and GaussianCube) require extra time-intensive preprocessing for training data curation and face limitations in leveraging pretrained 2D models, posing great challenges to the quality and scale of 3D datasets, as well as the efficiency of 3D network training from scratch.

(2) Rendering-based methods (such as HoloDiffusion, Viewset Diffusion and DMV3D) only need multi-view images for supervision, but suffer from unstable training and the same drawback of native 3D methods that can't leverate pretrained 2D generative priors.

(3) Reconstruction-based methods (such as LGM, Meta 3D AssetGen and MeshFormer) leverage 2D priors by utilizing frozen image diffusion models to generate multiview images with generalizable 3D reconstruction models. However, they regard the multi-view diffusion model as an independent plug-and-play module, so 3D generation is conducted as a two-stage proceeding, which requires extra network parameters and may collapse due to the 3D inconsistency in generated images.

(4) Inheriting the advantages of these three kinds of methods, DiffSplat fine-tunes pretrained image diffusion models for direct 3DGS generation, effectively utilizing 2D diffusion priors and maintaining 3D consistency.

🔮   Method   🔮

(1) Data Curation: Structured Gaussian Reconstruction: A lightweight reconstruction model provides high-quality structured Gaussian representation for "pseudo" dataset curation. (2) Gaussian Latents: Image VAE is fine-tuned to encode Gaussian properties into a shared latent space. (3) DiffSplat is natively capable of generating 3D contents by image and text conditions utilizing 2D priors from text-to-image diffusion models.

📝   Text-to-3D Generation   📝

Text prompts are taken from T3Bench, Instant3D and LATTE3D.

A beautiful rainbow fish A bright red fire hydrant A brown horse in a green pasture A colorful camping tent in a patch of grass
A fluffy, orange cat A green enameled watering can A green frog on a lily pad A human skull
A jar of homemade jam A lighthouse on a rocky shore A plush velvet armchair A red cardinal on a snowy branch
A red rose in a crystal vase A silver mirror with ornate detailing A small porcelain white rabbit figurine A toy robot
A tree stump with an axe buried in it A velvet cushion stitched with golden threads A vibrant orange pumpkin sitting on a hay bale A vintage porcelain doll with a frilly dress
A well-loved stuffed teddy bear A worn-out red flannel shirt An expensive office chair An intricate ceramic vase with peonies painted on it

🖼️   Image-to-3D Generation   🖼️

Single-view images are taken from InstantMesh, Era3D, GPTEval3D and the Internet.

🕹️   Controllable Generation   🕹️

ControlNet for DiffSplat

Original Object Normal Map A steampunk robot with brass gears and steam pipes A cute cartoon robot with oversized eyes
Original Object Depth Map A Santa festive plush bear toy An adorable baby panda
Original Object Canny Edge An ancient, leather-bound magic book with shimmering gold leaf pages A slice of freshly baked bread with a golden crusty exterior and a soft interior

Text-guided Reconstruction with DiffSplat

A sculpture A mask A headwear A flat rock
A pencil box A small purse A ball A flat badge

🛠️   Ablation Studies   🛠️

DiffSplat Design Choices

Effectiveness of Rendering Loss

With

Rendering Loss

Without

Rendering Loss

With

Rendering Loss

Without

Rendering Loss
A plush dragon toy An old car overgrown by vines and weeds

With

Rendering Loss

Without

Rendering Loss

With

Rendering Loss

Without

Rendering Loss

🎨   More Results   🎨

A. Compared with CLAY (Rodin Gen-1)

We evaluated a commercial product named Rodin via membership subscription on its website, which is based on the CLAY technique, since CLAY itself is not open-sourced.

It is worth noting that Rodin:

  1. is trained on proprietary internal datasets
  2. originates from CLAY, which was initially trained on 256 A800 GPUs for 15 days
  3. has undergone months of updates incorporating sophisticated techniques such as RLHF
  4. employs a complex generation pipeline composed of multiple distinct models (text-to-image, image-to-raw 3D, 3D object captioning and attribute prediction, 3D geometry refinement, PBR material generation and refinement, etc.)
  5. take about 1 minute to generate a 3D object though its pipeline

Despite these extensive resources and refinement steps, Rodin (Gen-1 RLHF V0.9) still:

  • struggles to generate intricate and precise structures such as leaves and fur
  • often fails to faithfully adhere to the provided image and text conditions
In contrast, our method is trained on open-source datasets using only 8 A100 GPUs over 2~5 days, demonstrating greater efficiency and accessibility. Moreover, we believe incorporating refinement stages in the proposed method is a promising direction for future exploration.

Rodin Stage-1

Rodin Final

Ours (1~2s)

Prompt: "A fragrant pine Christmas wreath"

Rodin Stage-1

Rodin Final

Ours (1~2s)

Prompt: "A faux-fur leopard print hat"

Rodin Stage-1

Rodin Final

Ours (1~2s)

Input Image + Prompt: "A sculpture"

Rodin Stage-1

Rodin Final

Ours (1~2s)

Input Image + Prompt: "A mask"

B. Compared with Amortized SDS-based method, LATTE3D

Results of LATTE3D are taken from its project page.

Our results demonstrate competitive visual quality with less over-saturation compared to the SDS-based MVDream, while achieving significantly faster inference speed. Although our method is slightly slower than the feed-forward (non-generative) approach LATTE3D, it offers better alignment with text prompts.

Ours (1~2s)

C. Compared with SDS-based method, GaussianDreamer

Results of GaussianDreamer are taken from its project page.

Our results demonstrate competitive visual quality with less over-saturation compared to the SDS-based GaussianDreamer, while achieving significantly faster inference speed. Similar to GaussianDreamer that use Shap-E to generate initial point clouds, our generated 3D Gaussians can also be used as input to the SDS-based methods for further refinement, which could be a promising direction for future research.

GaussianDreamer (15min) Ours (1~2s) GaussianDreamer (15min) Ours (1~2s)
Viking axe, fantasy, weapon,
blender, 8k, HD
Flamethrower, with fire, scifi, cyberpunk, photorealistic, 8K, HD
A freshly baked loaf of sourdough
bread on a cutting board
A zoomed out DSLR photo of
an amigurumi motorcycle

D. Generation of Very Thin Objects

To evaluate the 3D consistency of our generated multi-view Gaussian latents, we take prompts from T3Bench or generated by ChatGPT that describe very thin objects. The results demonstrate that our method can generate 3D coherent multi-view splatter images for thin objects without obvious artifacts or distortions.

A delicate battle axe with an ultrathin, sharp blade and an intricately carved wooden handle A slender battle axe with a crescent-shaped blade and a long, narrow shaft Coffee cup with
many holes
A rustic, weathered bookshelf with built-in storage compartments and ties of evenly shaped shelves
A sleek combat axe with a razor-edged blade and a slim, ergonomic handle designed for swift strikes A shimmering emerald pendant necklace A delicate, handmade
lace doily
An old bronze ship's wheel

🌐   Related Links   🌐

Native 3D Generative Models

Rendering-based 3D (Generative or Reconstruction) Models

Reconstruction-based 3D Generative Models

📚   BibTeX   📚

If you find our work helpful, please consider citing:

          
@inproceedings{lin2025diffsplat,
  title={DiffSplat: Repurposing Image Diffusion Models for Scalable 3D Gaussian Splat Generation},
  author={Lin, Chenguo and Pan, Panwang and Yang, Bangbang and Li, Zeming and Mu, Yadong},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2025}
}