VideoPDE: Unified Generative PDE Solving via Video Inpainting Diffusion Models

TL;DR: We present a unified framework for predicting forward/inverse/partial PDE soltuions using a video inpainting diffusion model.

Flexible PDE solution predictions. From sparse spatiotemporal observations (left), our method can predict future/past and reconstruct the full field solutions more flexibly and accurately compared to existing state-of-the-art methods, e.g., PINO (Li et. al.).

VideoPDE pipeline. We cast PDE solving as a video inpainting task. Our Hierarchical Video Diffusion Transformer (HV-DiT) denoises initial noise into a full video, conditioned on pixel-level sparse measurements. Its ability to handle arbitrary input patterns enables flexible application to diverse PDE scenarios, including forward, inverse, and continuous measurement tasks.

We present a unified framework for solving partial differential equations (PDEs) using video-inpainting diffusion transformer models. Unlike existing methods that devise specialized strategies for either forward or inverse problems under full or partial observation, our approach unifies these tasks under a single, flexible generative framework. Specifically, we recast PDE-solving as a generalized inpainting problem, e.g., treating forward prediction as inferring missing spatiotemporal information of future states from initial conditions. To this end, we design a transformer-based architecture that conditions on arbitrary patterns of known data to infer missing values across time and space. Our method proposes pixel-space video diffusion models for fine-grained, high-fidelity inpainting and conditioning, while enhancing computational efficiency through hierarchical modeling. Extensive experiments show that our video inpainting-based diffusion model offers an accurate and versatile solution across a wide range of PDEs and problem setups, outperforming state-of-the-art baselines.

Conceptual comparison of PDE-solving methods. Neural operator methods struggle with partial inputs. Only PINN and VideoPDE handle forward, inverse, and continuous measurements flexibly. Generative baselines focus on reconstructing one or two frames (instead of dense temporal frames) and are often not designed for forward prediction, where VideoPDE excels. The forward error is measured on the Navier-Stokes dataset.

Using our video inpainting framework, we can predict the future frames from the first frame initial condition. For the complex Kolmogorov Flow, VideoPDE performs noticeably better than prior ML-based methods.

Thanks to our flexible video inpainting framework, VideoPDE can predict the full field future frames from a partial pixels of the initial condition frame, 3% shown here. The SOTA in the forward modeling, PINO, is given the interpolated first frame, which performs significantly worse than our generative approach.

Similarly, our unified framework allows for inverse prediction, where we predict the past from the future observations. Here, from the last frame, we accurately predict the previous frames.

In this Navier-Stokes experiments, similar to the teaser videos, 1% of the pixels provide continuous sensor readings, from which VideoPDE almost perfectly reconstructs the full field solution, noticeably better than SOTA methods for this task.

BibTeX

@article{li2025videopde,
    author    = {Edward Li and Zichen Wang and Jiahe Huang and Jeong Joon Park},
    title     = {VideoPDE: Unified Generative PDE Solving via Video},
    journal   = {arXiV},
    year      = {2025},
}

VideoPDE: Unified Generative PDE Solving via Video Inpainting Diffusion Models

Kolmogorov Flow Forward Prediction

Kolmogorov Flow Forward Prediction from 3% Observation

Wave-Equation Inverse Modeling

Navier–Stokes Continuous 1% observations

BibTeX