We present MoVieS, a novel feed-forward model that synthesizes 4D dynamic novel views from monocular videos in ⚡️one second. MoVieS represents dynamic 3D scenes using pixel-aligned grids of Gaussian primitives, explicitly supervising their time-varying motion.
This allows, for the first time, the unified modeling of appearance, geometry and motion, and enables view synthesis, reconstruction and 3D point tracking within a single learning-based framework. By bridging novel view synthesis with dynamic geometry reconstruction, MoVieS enables large-scale training on diverse datasets with minimal dependence on task-specific supervision.
As a result, it also naturally supports a wide range of zero-shot applications, such as 🌊scene flow estimation and ✂️moving object segmentation. Extensive experiments validate the effectiveness and efficiency of MoVieS across multiple tasks, achieving competitive performance while offering several orders of magnitude speedups. Code and models are publicly available at here.
MoVieS consists of a shared image encoder, an attention-based feature backbone, and three heads that simultaneously predict 🎨appearance, 🧱geometry and 💨motion. Image shortcut for splatter head and time-varying Gaussian attributes are omitted for brevity. The image encoder, feature backbone and depth head are initialized from a geometrically pretrained transformer, VGGT. Motion head is initialized from its point head. Remaining components, such as the splatter head and camera/time embeddings, are trained from scratch.
Given \(t_q\) target timesteps, the motion head is conditioned via adaptive layer normalization (AdaLN) and predicts 3D movements for each input pixel in a canonical space. After rasterization using the \(M\) corresponding query-time cameras, output images in shape \(M\times 3\times H\times W\) are rendered for supervision. Gaussian attribute deformation \(\Delta\mathbf{a}\) is omitted for brevity.
Given 3D point tracking datasets (e.g., PointOdyssey, DynamicReplica and Stereo4D), ground-truth motion \(\Delta\mathbf{x}\) is defined as the 3D displacement of each tracked point between any two frames in the world coordinate system. Two kinds of complementary losses are applied for motion supervision (1) a point-wise L1 loss and (2) a distribution-level loss: \[ \mathcal{L}_{\text{motion}} = \lambda_{\text{pt}}\mathcal{L}_{\text{pt}} + \lambda_{\text{dist}}\mathcal{L}_{\text{dist}}\\ = \frac{1}{P}\sum_{i\in\Omega}\lambda_{\text{pt}}\|\Delta\hat{\mathbf{x}}_i-\Delta\mathbf{x}_i\|_1 + \frac{1}{P^2}\sum_{(i,j)\in\Omega\times\Omega}\lambda_{\text{dist}}\|\Delta\hat{\mathbf{x}}_i\cdot\Delta\hat{\mathbf{x}}_j^\top - \Delta\mathbf{x}_i\cdot\Delta\mathbf{x}_j^\top\|_1, \] where \(\Omega\) denotes the set of all valid \(P\) tracked points, and \(\Omega \times \Omega\) is its Cartesian product.
An ideal dataset for dynamic scene reconstruction would include synchronized multi-view videos with dense depth and point tracking annotations. However, such data is infeasible to capture and annotate at scale in practice. Instead, we leverage a diverse set of large-scale open-source datasets, each providing complementary supervision. With the flexible model design, MoVieS can be jointly trained on these heterogeneous sources by aligning objectives to their respective annotations.
Input Video | Input Reconstruction | Fixed Novel Viewpoint | Predicted Depth | Predicted Motion |
Input Video | Input Reconstruction | Fixed Novel Viewpoint | Predicted Depth | Predicted Motion |
Input Video | Input Reconstruction | Fixed Novel Viewpoint | Predicted Depth | Predicted Motion |
Input Video | Input Reconstruction | Fixed Novel Viewpoint | Predicted Depth | Predicted Motion |
Point Tracking | Predicted Depth | Predicted Motion |
Point Tracking | Predicted Depth | Predicted Motion |
Point Tracking | Predicted Depth | Predicted Motion |
Point Tracking | Predicted Depth | Predicted Motion |
Point Tracking | Predicted Depth | Predicted Motion |
Point Tracking | Predicted Depth | Predicted Motion |
Point Tracking | Predicted Depth | Predicted Motion |
Point Tracking | Predicted Depth | Predicted Motion |
Input Video | Predicted Depth | Predicted Motion | Predicted Flow |
Input Video | Predicted Depth | Predicted Motion | Predicted Flow |
Input Video | Predicted Depth | Predicted Motion | Predicted Flow |
Input Video | Predicted Depth | Predicted Motion | Predicted Flow |
Input Video | Predicted Depth | Predicted Motion | Predicted Flow |
Input Video | Predicted Depth | Predicted Motion | Predicted Flow |
Input Video | Predicted Depth | Predicted Motion | Moving Object Segmentation |
Input Video | Predicted Depth | Predicted Motion | Moving Object Segmentation |
Input Video | Predicted Depth | Predicted Motion | Moving Object Segmentation |
Input Video | Predicted Depth | Predicted Motion | Moving Object Segmentation |
Input Video | Predicted Depth | Predicted Motion | Moving Object Segmentation |
Input Video | Predicted Depth | Predicted Motion | Moving Object Segmentation |
If you find our work helpful, please consider citing:
@article{lin2025movies,
title={MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second},
author={Lin, Chenguo and Lin, Yuchen and Pan, Panwang and Yu, Yifan and Yan, Honglei and Fragkiadaki, Katerina and Mu, Yadong},
journal={arXiv preprint arXiv:2507.10065},
year={2025}
}