🎞️MoVieS: Motion-Aware 4D Dynamic
View Synthesis in One Second

Chenguo Lin^1*, Yuchen Lin^1,3*, Panwang Pan^2†,
Yifan Yu², Honglei Yan², Katerina Fragkiadaki³, Yadong Mu^1‡

¹Peking University ²ByteDance ³Carnegie Mellon University

* denotes equal contribution, † denotes project lead, ‡ denotes corresponding author

MoVieS is a feed-forward framework that jointly reconstructs 🎨appearance, 🧱geometry
and 💨motion for 4D scene perception from monocular videos in ⚡️one second.

🧩 Abstract 🧩

We present MoVieS, a novel feed-forward model that synthesizes 4D dynamic novel views from monocular videos in ⚡️one second. MoVieS represents dynamic 3D scenes with pixel-aligned Gaussian primitives and explicitly supervises their time-varying motions. This allows, for the first time, the unified modeling of appearance, geometry and motion from monocular videos, and enables reconstruction, view synthesis and 3D point tracking within a single learning-based framework. By bridging novel view synthesis with geometry reconstruction, MoVieS enables large-scale training on diverse datasets with minimal dependence on task-specific supervision. As a result, it also naturally supports a wide range of zero-shot applications, such as 🌊scene flow estimation and ✂️moving object segmentation. Extensive experiments validate the effectiveness and efficiency of MoVieS across multiple tasks, achieving competitive performance while offering several orders of magnitude speedups. Code and models are publicly available at here.

🔮 Method 🔮

MoVieS consists of a shared image encoder, an attention-based feature backbone, and three heads that simultaneously predict 🎨appearance, 🧱geometry and 💨motion. Image shortcut for splatter head and time-varying Gaussian attributes are omitted for brevity. The image encoder, feature backbone and depth head are initialized from a geometrically pretrained transformer, VGGT. Motion head is initialized from its point head. Remaining components, such as the splatter head and camera/time embeddings, are trained from scratch.

Given \(M\) target timesteps, the motion head is time-conditioned via AdaLN and predicts 3D movements for each input pixel in a canonical space. After rasterization using the \(M\) corresponding query-time cameras, output images in shape \(M\times 3\times H\times W\) are rendered for supervision. Gaussian attribute deformation \(\Delta\mathbf{a}\) is omitted for brevity.

Given 3D point tracking datasets (e.g., PointOdyssey, DynamicReplica and Stereo4D), ground-truth motion \(\Delta\mathbf{x}\) is defined as the 3D displacement of each tracked point between any two frames in the world coordinate system. Two kinds of complementary losses are applied for motion supervision (1) a point-wise L1 loss and (2) a distribution-level loss: \[ \mathcal{L}_{\text{motion}} = \lambda_{\text{pt}}\mathcal{L}_{\text{pt}} + \lambda_{\text{dist}}\mathcal{L}_{\text{dist}}\\ = \frac{1}{P}\sum_{i\in\Omega}\lambda_{\text{pt}}\|\Delta\hat{\mathbf{x}}_i-\Delta\mathbf{x}_i\|_1 + \frac{1}{P^2}\sum_{(i,j)\in\Omega\times\Omega}\lambda_{\text{dist}}\|\Delta\hat{\mathbf{x}}_i\cdot\Delta\hat{\mathbf{x}}_j^\top - \Delta\mathbf{x}_i\cdot\Delta\mathbf{x}_j^\top\|_1, \] where \(\Omega\) denotes the set of all valid \(P\) tracked points, and \(\Omega \times \Omega\) is its Cartesian product.

📊 Datasets 📊

An ideal dataset for dynamic scene reconstruction would include synchronized multi-view videos with dense depth and point tracking annotations. However, such data is infeasible to capture and annotate at scale in practice. Instead, we leverage a diverse set of large-scale open-source datasets, each providing complementary supervision. With the flexible model design, MoVieS can be jointly trained on these heterogeneous sources by aligning objectives to their respective annotations.

📸 Novel View Synthesis 📸


Input Video	Predicted Motion	Novel View Synthesis

Input Video	Predicted Motion	Novel View Synthesis

Input Video	Predicted Motion	Novel View Synthesis

Input Video	Predicted Motion	Novel View Synthesis

Input Video	Predicted Motion	Novel View Synthesis

Input Video	Predicted Motion	Novel View Synthesis

Input Video	Predicted Motion	Novel View Synthesis

Input Video	Predicted Motion	Novel View Synthesis

Input Video	Predicted Motion	Novel View Synthesis

Input Video	Predicted Motion	Novel View Synthesis

📍 3D Point Tracking 📍


Point Tracking	Predicted Depth	Predicted Motion

Point Tracking	Predicted Depth	Predicted Motion

Point Tracking	Predicted Depth	Predicted Motion

Point Tracking	Predicted Depth	Predicted Motion

Point Tracking	Predicted Depth	Predicted Motion

Point Tracking	Predicted Depth	Predicted Motion

Point Tracking	Predicted Depth	Predicted Motion

Point Tracking	Predicted Depth	Predicted Motion

🌊 Scene Flow Estimation 🌊


Input Video	Predicted Depth	Predicted Motion	Predicted Flow

Input Video	Predicted Depth	Predicted Motion	Predicted Flow

Input Video	Predicted Depth	Predicted Motion	Predicted Flow

Input Video	Predicted Depth	Predicted Motion	Predicted Flow

Input Video	Predicted Depth	Predicted Motion	Predicted Flow

Input Video	Predicted Depth	Predicted Motion	Predicted Flow

✂️ Moving Object Segmentation ✂️


Input Video	Predicted Depth	Predicted Motion	Moving Object Segmentation

Input Video	Predicted Depth	Predicted Motion	Moving Object Segmentation

Input Video	Predicted Depth	Predicted Motion	Moving Object Segmentation

Input Video	Predicted Depth	Predicted Motion	Moving Object Segmentation

Input Video	Predicted Depth	Predicted Motion	Moving Object Segmentation

Input Video	Predicted Depth	Predicted Motion	Moving Object Segmentation

🛠️ Ablation Studies 🛠️

🌐 Related Links 🌐

📚 BibTeX 📚

If you find our work helpful, please consider citing:

          
@article{lin2025movies,
  title={MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second},
  author={Lin, Chenguo and Lin, Yuchen and Pan, Panwang and Yu, Yifan and Yan, Honglei and Fragkiadaki, Katerina and Mu, Yadong},
  journal={arXiv preprint arXiv:2507.10065},
  year={2025}
}

🎞️MoVieS: Motion-Aware 4D DynamicView Synthesis in One Second

MoVieS is a feed-forward framework that jointly reconstructs 🎨appearance, 🧱geometry and 💨motion for 4D scene perception from monocular videos in ⚡️one second.