Layered Dynamic Textures

ldtfig

Traditional motion representations, based on optical flow, are inherently local and have significant difficulties when faced with aperture problems and noise. The classical solution to this problem is to regularize the optical flow field, but this introduces undesirable smoothing across motion edges or regions where the motion is, by definition, not smooth (e.g. vegetation in outdoors scenes). It also does not provide any information about the objects that compose the scene, although the optical flow field could be subsequently used for motion segmentation. More recently, there have been various attempts to model videos as a superposition of layers subject to homogeneous motion. While layered representations exhibited significant promise in terms of combining the advantages of regularization (use of global cues to determine local motion) with the flexibility of local representations (little undue smoothing), and a truly object-based representation, this potential has so far not fully materialized. One of the main limitations is their dependence on parametric motion models, such as affine transforms, which assume a piece-wise planar world that rarely holds in practice. In fact, layers are usually formulated as “cardboard” models of the world that are warped by such transformations and then stitched to form the frames in a video stream. This severely limits the types of videos that can be synthesized: while the concept of layering showed most promise for the representation of scenes composed of ensembles of objects subject to homogeneous motion (e.g. leaves blowing in the wind, a flock of birds, a picket fence, or highway traffic), very little progress has so far been demonstrated in actually modeling such scenes.

Recently, there has been more success in modeling complex scenes as dynamic textures or, more precisely, samples from stochastic processes defined over space and time. This work has demonstrated that global stochastic modeling of both video dynamics and appearance is much more powerful than the classic global modeling as “cardboard” figures under parametric motion. In fact, the dynamic texture (DT) has shown a surprising ability to abstract a wide variety of complex patterns of motion and appearance into a simple spatio-temporal model. One major current limitation is, however, its inability to decompose visual processes consisting of multiple, co-occurring, dynamic textures, for example, a flock of birds flying in front of a water fountain or highway traffic moving at different speeds, into separate regions of distinct but homogeneous dynamics. In such cases, the global nature of the existing DT model makes it inherently ill-equipped to segment the video into its constituent regions.

In this work, we address this limitation by introducing a new generative model for videos, which we denote by the layered dynamic texture (LDT). This consists of augmenting the dynamic texture with a discrete hidden variable, that enables the assignment of different dynamics to different regions of the video. The hidden variable is modeled as a Markov random field (MRF) to ensure spatial smoothness of the regions, and conditioned on the state of this hidden variable, each region of the video is a standard DT. By introducing a shared dynamic representation for all pixels in a region, the new model is a layered representation. When compared with traditional layered models, it replaces layer formation by “warping cardboard figures” with sampling from the generative model (for both dynamics and appearance) provided by the DT. This enables a much richer video representation. Since each layer is a DT, the model can also be seen as a multi-state dynamic texture, which is capable of assigning different dynamics and appearance to different image regions. We apply the LDT to motion segmentation of challenging video sequences.

Selected Publications

Demos/Results

Datasets/Code