Our method overview
For streaming image sequence inputs, PAS3R maintains a persistent scene memory that is iteratively updated via cross-attention with each incoming frame. At the core of our approach is a novel pose-adaptive state update modulation mechanism, which dynamically adjusts the memory's update intensity (effective learning rate) based on camera motion and image structure extracted via Fourier analysis, enabling the model to strike an optimal balance between adapting to novel viewpoints and preserving previously accumulated geometric details. During inference, the model simultaneously predicts camera poses and dense point clouds at every timestep, while during training, we enforce robust trajectory-consistent optimization through a custom-designed pose loss that integrates absolute trajectory error (ATE), relative pose error (RPE), and acceleration constraints. We further incorporate a lightweight online spatiotemporal stabilization module to mitigate trajectory jitter and eliminate geometric artifacts in reconstruction results, and all these tightly integrated components together enable PAS3R to deliver stable, highly scalable long-horizon streaming 3D reconstruction.
Abstract
Online monocular 3D reconstruction enables dense scene recovery from streaming video but remains fundamentally limited by the stability-adaptation dilemma: the reconstruction model must rapidly incorporate novel viewpoints while preserving previously accumulated scene structure. Existing streaming approaches rely on uniform or attention-based update mechanisms that often fail to account for abrupt viewpoint transitions, leading to trajectory drift and geometric inconsistencies over long sequences. We introduce PAS3R, a pose-adaptive streaming reconstruction framework that dynamically modulates state updates according to camera motion and scene structure. Our key insight is that frames contributing significant geometric novelty should exert stronger influence on the reconstruction state, while frames with minor viewpoint variation should prioritize preserving historical context. PAS3R operationalizes this principle through a motion-aware update mechanism that jointly leverages inter-frame pose variation and image frequency cues to estimate frame importance. To further stabilize long-horizon reconstruction, we introduce trajectory-consistent training objectives that incorporate relative pose constraints and acceleration regularization. A lightweight online stabilization module further suppresses high-frequency trajectory jitter and geometric artifacts without increasing memory consumption. Extensive experiments across multiple benchmarks demonstrate that PAS3R significantly improves trajectory accuracy, depth estimation, and point cloud reconstruction quality in long video sequences while maintaining competitive performance on shorter sequences. Code will be made publicly available.
Evaluation results
Comparison of Online methods. From left to right, the panels show point cloud reconstruction error comparison, camera pose error comparison, and depth error comparison, respectively. Compared with other state-of-the-art online 3D reconstruction methods, PAS3R achieves the best performance across all three evaluation tasks.
Long sequences
Outdoor
Outdoor corridor
Computer Room
Bathroom Scene
PAS3R's depth prediction and reconstruction on long sequences (647-1000 frames).
3D reconstruction on short sequences
Input Image
Ground Truth
PAS3R
Input Image
Ground Truth
PAS3R
Input Image
Ground Truth
PAS3R
Input Image
Ground Truth
PAS3R
BibTeX
@misc{xu2026pas3r,
title={PAS3R: Pose-Adaptive Streaming 3D Reconstruction for Long Video Sequences},
author={Lanbo Xu and Liang Guo and Caigui Jiang and Cheng Wang},
year={2026},
eprint={2603.21436},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.21436},
}