Brown CS Blog

PackUV: Video-Native Representations For Streaming 4D Scenes

    A series of images showing how PackUV uses Gaussian UV fitting and produces one representation with all attributes that's 100% video-codec compatible

    by Aashish Rai, Angela Xing, Anushka Agarwal, Xiaoyan Cong, Zekun Li, Tao Lu, Aayush Prakash, Srinath Sridhar  ·  CVPR 2026  ·  Brown University, UMass Amherst, Meta

    Imagine watching a concert not from a fixed camera angle, but from any angle — walking around performers, zooming in, changing your viewpoint in real time. This is the promise of volumetric video: a way of capturing the world not as a flat 2D movie, but as a full 3D scene that evolves over time. It's the technology behind truly immersive AR/VR experiences, sports broadcasts where you can rewind from any angle, or digital doubles in film production.

    The catch? Volumetric video is incredibly hard to store and stream. A 30-minute clip can balloon to terabytes of data, and the formats it comes in are completely alien to the infrastructure the internet already runs on — your computer, your streaming service, your video codec. You can have the most photorealistic 4D scene in the world, but if you can't get it to a viewer efficiently, it's stuck in a lab.

    Our work, PackUV, tackles exactly this problem.

    A Quick Primer: What is Gaussian Splatting?

    Recent breakthroughs in 3D scene reconstruction have been dominated by a technique called Gaussian Splatting. Instead of representing a scene as a mesh or a grid of voxels, it describes the scene as millions of tiny, fuzzy ellipsoids — "Gaussians" — each with properties like color, opacity, and shape. Together, they can reconstruct photorealistic scenes from multi-view camera footage with remarkable quality and speed.

    Extend this idea to video — scenes that move and change over time — and you get 4D Gaussian Splatting. It works beautifully on short, simple clips. But push it to longer sequences, faster motions, or scenes where objects appear and disappear (called disocclusions), and it starts to fall apart. Temporal inconsistencies creep in, quality drops, and the storage requirements become unmanageable. Worse still, the output format has no relationship to standard video: you can't hand it to FFmpeg, store it on a CDN, or stream it over the internet.

    The PackUV Idea: Think in 2D

    The core insight of PackUV is elegant: what if we could lay out all the information describing a 3D scene onto a flat 2D image?

    In computer graphics, this is a well-known trick called UV mapping — think of unfolding the surface of a globe into a flat map. PackUV takes this idea further, applying it not just to surfaces, but to all the attributes of every Gaussian in the scene — their positions, colors, shapes, and how they change over time.

    The result is a UV atlas: a structured, multi-scale image that encodes the entire dynamic 3D scene. And crucially, a sequence of these images over time looks a lot like... a regular video. One that standard video codecs — the same algorithms used by YouTube, Netflix, or Zoom — can compress and stream efficiently, without any specialized infrastructure.

    Making It Work: Handling Motion and Change

    Flattening a dynamic 3D scene into 2D is harder than it sounds. As objects move, appear, or become hidden, the Gaussians describing them shift and change in complex ways. PackUV introduces PackUV-GS, a fitting method designed to keep this representation consistent over time.

    A key component is a flow-guided labeling module that identifies which Gaussians are dynamic (moving objects) versus static (the background), and tracks them accordingly. It also uses video keyframing — borrowing an idea from traditional animation — to anchor the representation at regular intervals, preventing quality from drifting over long sequences. Together, these allow PackUV to handle challenging scenarios like fast-moving people, objects entering and leaving the frame, and scenes that last up to 30 minutes — far beyond what previous methods could manage.

    PackUV-2B: The Largest Dataset of Its Kind

    To properly test and compare these methods, we needed data. So we built PackUV-2B: the largest multi-view video dataset ever assembled, captured with 50-90 synchronized cameras providing full 360° coverage. It spans over 100 sequences and a staggering 2 billion frames, including both controlled studio environments and real-world, in-the-wild settings with complex human motion.

    This dataset doesn't just benchmark PackUV — it's a resource the entire research community can use to push the field forward.

    Why It Matters Beyond the Lab

    PackUV bridges a gap that has long frustrated researchers and engineers alike: the gap between state-of-the-art quality and practical deployability. The moment your 4D scene representation speaks the same language as existing video infrastructure, a whole new range of applications becomes feasible:

    • AR/VR and immersive media: stream live volumetric experiences to headsets without dedicated servers

    • Sports and entertainment: capture events with dozens of cameras and replay them from any angle, on demand

    • Robotics and autonomous systems: efficiently store and replay complex dynamic scene data for training and simulation

    • Telepresence: stream photorealistic 3D avatars in real time over standard network connections

    The work is set to appear at CVPR 2026.

    Paper and supplementary materials: ivl.cs.brown.edu/packuv