Recent advancements in 3D content generation from text or a single image struggle with limited high-quality 3D datasets and inconsistency from 2D multi-view generation. We introduce DiffSplat, a novel 3D generative framework that natively generates 3D Gaussians by taming large-scale text-to-image diffusion models.
It differs from previous 3D generative models by effectively utilizing web-scale 2D priors while maintaining 3D consistency in a unified model. To bootstrap the training, a lightweight reconstruction model is proposed to instantly produce multiview Gaussian grids for scalable dataset curation. In conjunction with the regular diffusion loss on these grids, a 3D rendering loss is introduced to facilitate 3D coherence across arbitrary views.
The compatibility with image diffusion models enables seamless adaptions of numerous techniques for image generation to the 3D realm. Extensive experiments reveal the superiority of DiffSplat in text- and image-conditioned generation tasks and downstream applications. Thorough ablation studies validate the efficacy of each critical design choice and provide insights into the underlying mechanism. Code and models are publicly available at here.
(1) Native 3D methods (such as CLAY, GVGEN and GaussianCube) require extra time-intensive preprocessing for training data curation and face limitations in leveraging pretrained 2D models, posing great challenges to the quality and scale of 3D datasets, as well as the efficiency of 3D network training from scratch.
(2) Rendering-based methods (such as HoloDiffusion, Viewset Diffusion and DMV3D) only need multi-view images for supervision, but suffer from unstable training and the same drawback of native 3D methods that can't leverate pretrained 2D generative priors.
(3) Reconstruction-based methods (such as LGM, Meta 3D AssetGen and MeshFormer) leverage 2D priors by utilizing frozen image diffusion models to generate multiview images with generalizable 3D reconstruction models. However, they regard the multi-view diffusion model as an independent plug-and-play module, so 3D generation is conducted as a two-stage proceeding, which requires extra network parameters and may collapse due to the 3D inconsistency in generated images.
(4) Inheriting the advantages of these three kinds of methods, DiffSplat fine-tunes pretrained image diffusion models for direct 3DGS generation, effectively utilizing 2D diffusion priors and maintaining 3D consistency.
(1) Data Curation: Structured Gaussian Reconstruction: A lightweight reconstruction model provides high-quality structured Gaussian representation for "pseudo" dataset curation. (2) Gaussian Latents: Image VAE is fine-tuned to encode Gaussian properties into a shared latent space. (3) DiffSplat is natively capable of generating 3D contents by image and text conditions utilizing 2D priors from text-to-image diffusion models.
A beautiful rainbow fish | A bright red fire hydrant | A brown horse in a green pasture | A colorful camping tent in a patch of grass |
A fluffy, orange cat | A green enameled watering can | A green frog on a lily pad | A human skull |
A jar of homemade jam | A lighthouse on a rocky shore | A plush velvet armchair | A red cardinal on a snowy branch |
A red rose in a crystal vase | A silver mirror with ornate detailing | A small porcelain white rabbit figurine | A toy robot |
A tree stump with an axe buried in it | A velvet cushion stitched with golden threads | A vibrant orange pumpkin sitting on a hay bale | A vintage porcelain doll with a frilly dress |
A well-loved stuffed teddy bear | A worn-out red flannel shirt | An expensive office chair | An intricate ceramic vase with peonies painted on it |
Single-view images are taken from InstantMesh, Era3D, GPTEval3D and the Internet.
Original Object | Normal Map | A steampunk robot with brass gears and steam pipes | A cute cartoon robot with oversized eyes |
Original Object | Depth Map | A Santa festive plush bear toy | An adorable baby panda |
Original Object | Canny Edge | An ancient, leather-bound magic book with shimmering gold leaf pages | A slice of freshly baked bread with a golden crusty exterior and a soft interior |
A sculpture | A mask | A headwear | A flat rock |
A pencil box | A small purse | A ball | A flat badge |
With Rendering Loss |
Without Rendering Loss |
With Rendering Loss |
Without Rendering Loss |
A plush dragon toy | An old car overgrown by vines and weeds |
With Rendering Loss |
Without Rendering Loss |
With Rendering Loss |
Without Rendering Loss |
We evaluated a commercial product named Rodin via membership subscription on its website, which is based on the CLAY technique, since CLAY itself is not open-sourced.
It is worth noting that Rodin:
Despite these extensive resources and refinement steps, Rodin (Gen-1 RLHF V0.9) still:
Rodin Stage-1 |
Rodin Final |
Ours (1~2s) |
Prompt: "A fragrant pine Christmas wreath" | ||
Rodin Stage-1 |
Rodin Final |
Ours (1~2s) |
Prompt: "A faux-fur leopard print hat" | ||
Rodin Stage-1 |
Rodin Final |
Ours (1~2s) |
Input Image + Prompt: "A sculpture" | ||
Rodin Stage-1 |
Rodin Final |
Ours (1~2s) |
Input Image + Prompt: "A mask" |
Results of LATTE3D are taken from its project page.
Our results demonstrate competitive visual quality with less over-saturation compared to the SDS-based MVDream, while achieving significantly faster inference speed. Although our method is slightly slower than the feed-forward (non-generative) approach LATTE3D, it offers better alignment with text prompts.
Ours (1~2s) |
Results of GaussianDreamer are taken from its project page.
Our results demonstrate competitive visual quality with less over-saturation compared to the SDS-based GaussianDreamer, while achieving significantly faster inference speed. Similar to GaussianDreamer that use Shap-E to generate initial point clouds, our generated 3D Gaussians can also be used as input to the SDS-based methods for further refinement, which could be a promising direction for future research.
GaussianDreamer (15min) | Ours (1~2s) | GaussianDreamer (15min) | Ours (1~2s) |
Viking axe, fantasy, weapon, blender, 8k, HD |
Flamethrower, with fire, scifi, cyberpunk, photorealistic, 8K, HD | ||
A freshly baked loaf of sourdough bread on a cutting board |
A zoomed out DSLR photo of an amigurumi motorcycle |
To evaluate the 3D consistency of our generated multi-view Gaussian latents, we take prompts from T3Bench or generated by ChatGPT that describe very thin objects. The results demonstrate that our method can generate 3D coherent multi-view splatter images for thin objects without obvious artifacts or distortions.
A delicate battle axe with an ultrathin, sharp blade and an intricately carved wooden handle | A slender battle axe with a crescent-shaped blade and a long, narrow shaft | Coffee cup with many holes |
A rustic, weathered bookshelf with built-in storage compartments and ties of evenly shaped shelves |
A sleek combat axe with a razor-edged blade and a slim, ergonomic handle designed for swift strikes | A shimmering emerald pendant necklace | A delicate, handmade lace doily |
An old bronze ship's wheel |
If you find our work helpful, please consider citing:
@inproceedings{lin2025diffsplat,
title={DiffSplat: Repurposing Image Diffusion Models for Scalable 3D Gaussian Splat Generation},
author={Lin, Chenguo and Pan, Panwang and Yang, Bangbang and Li, Zeming and Mu, Yadong},
booktitle={International Conference on Learning Representations (ICLR)},
year={2025}
}