Our main research interests include:

  • computer visionsurveillance:
    dynamic textures, motion segmentation, motion analysis, semantic image annotation, image retrieval, crowd counting.
  • machine learningpattern recognition:
    probabilistic graphical models, support vector machines, Bayesian regression, Gaussian processes.
  • computer auditionmusic information retrieval:
    semantic music annotation and retrieval, music segmentation.
  • eye gaze analysis:
    modeling eye movements with hidden Markov models (HMMs), clustering HMMs

In particular, we aim to develop generative probabilistic models of images, video, and sound that can be applied to computer vision and computer audition problems, such as traffic surveillance, crowd monitoring, semantic image annotation, and music segmentation. Our current research projects are listed below.

Dynamic Texture Models

A family of generative stochastic dynamic texture models for analyzing motion in video, and time-series in general (project overview).

Bag of Systems Trees

We propose the BoSTree that enables efficient mapping of videos to the bag-of-systems (BoS) codebook using a tree-structure, which enables the practical use of larger, richer codebooks.

Clustering Dynamic Textures

We propose a hierarchical EM algorithm capable of clustering dynamic texture models and learning novel cluster centers that are representative of the cluster members. DT clustering can be applied to semantic motion annotation and bag-of-systems codebook generation.

Background Subtraction in Dynamic Scenes

The background model is based on a generalization of the Stauffer-Grimson background model, where each mixture component is a dynamic texture. We derive an on-line algorithm for updating the parameters using a set of sufficient statistics of the model.

Layered Dynamic Textures

One disadvantage of the dynamic texture is its inability to account for multiple co-occuring textures in a single video. We extend the dynamic texture to a multi-state (layered) dynamic texture that can learn regions containing different dynamic textures.

  • Antoni B. Chan and Nuno Vasconcelos, "Layered dynamic textures." IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI), 31(10):1862-1879, Oct 2009.
Kernel Dynamic Textures

We introduce a kernelized dynamic texture, which has a non-linear observation function learned with kernel PCA. The new texture model can account for more complex patterns of motion, such as chaotic motion (e.g. boiling water and fire) and camera motion (e.g. panning and zooming), better than the original dynamic texture.

Mixtures of Dynamic Textures

We introduce the mixture of dynamic textures, which models a collection of video as samples from a set of dynamic textures. We use the model for video clustering and motion segmentation.

Generic Single Object Tracking

Localizing generic single objects in videos given the bounding box marked in the first frame.

Recurrent Filter Learning for Visual Tracking

We propose a recurrent filter generation methods for visual tracking which directly feeds the target’s image patch to a recurrent neural network (RNN) to estimate an object-specific filter for tracking.

Learning Dynamic Memory Networks for Object Tracking

We propose a dynamic memory network to adapt the template to the target’s appearance variations during tracking where an LSTM is used to control the reading and writing process of the memory block.

Understanding Video of Crowded Environments

Motion segmentation and motion classification in video of crowded environments, such as pedestrian scenes and highway traffic.

Beyond Counting: Comparisons of Density Maps for Crowd Analysis Tasks – Counting, Detection, and Tracking

For crowded scenes, the accuracy of object-based computer vision methods declines when the images are low-resolution and objects have severe occlusions. Taking counting methods for example, almost all the recent state-of-the-art counting methods bypass explicit detection and adopt regression-based methods to directly count the objects of interest. Among regression-based methods, density map estimation, where the number of objects inside a subregion is the integral of the density map over that subregion, is especially promising because it preserves spatial information, which makes it useful for both counting and localization (detection and tracking). With the power of deep convolutional neural networks (CNNs) the counting performance has improved steadily. The goal of this paper is to evaluate density maps generated by density estimation methods on a variety of crowd analysis tasks, including counting, detection, and tracking.Most existing CNN methods produce density maps with resolution that is smaller than the original images, due to the downsample strides in the convolution/pooling operations. To produce an original-resolution density map, we also evaluate a classical CNN that uses a sliding window regressor to predict the density for every pixel in the image. We also consider a fully convolutional (FCNN) adaptation, with skip connections from lower convolutional layers to compensate for loss in spatial information during upsampling. In our experiments, we found that the lower-resolution density maps sometimes have better counting performance. In contrast, the original-resolution density maps improved localization tasks, such as detection and tracking, compared to bilinear upsampling the lower-resolution density maps. Finally, we also propose several metrics for measuring the quality of a density map, and relate them to experiment results on counting and localization.

 Selected Publications



Incorporating Side Information by Adaptive Convolution


Computer vision tasks often have side information available that is helpful to solve the task. For example, for crowd counting, the camera perspective (e.g., camera angle and height) gives a clue about the appearance and scale of people in the scene. While side information has been shown to be useful for counting systems using traditional hand-crafted features, it has not been fully utilized in counting systems based on deep learning. In order to incorporate the available side information, we propose an adaptive convolutional neural network (ACNN), where the convolution filter weights adapt to the current scene context via the side information. In particular, we model the filter weights as a low-dimensional manifold within the high-dimensional space of filter weights. The filter weights are generated using a learned “filter manifold” sub-network, whose input is the side information. With the help of side information and adaptive weights, the ACNN can disentangle the variations related to the side information, and extract discriminative features related to the current context (e.g. camera perspective, noise level, blur kernel parameters). We demonstrate the effectiveness of ACNN incorporating side information on 3 tasks: crowd counting, corrupted digit recognition, and image deblurring. Our experiments show that ACNN improves the performance compared to a plain CNN with a similar number of parameters. Since existing crowd counting datasets do not contain ground-truth side information, we collect a new dataset with the ground-truth camera angle and height as the side information.

Selected Publications



Crowd Counting by Adaptively Fusing Predictions from an Image Pyramid

Because of the powerful learning capability of deep neural networks, counting performance via density map estimation has improved significantly during the past several years. However, it is still very challenging due to severe occlusion, large scale variations, and perspective distortion. Scale variations (from image to image) coupled with perspective distortion (within one image) result in huge scale changes of the object size. Earlier methods based on convolutional neural networks (CNN) typically did not handle this scale variation explicitly, until Hydra-CNN and MCNN. MCNN uses three columns, each with different filter sizes, to extract features at different scales. In this paper, in contrast to using filters of different sizes, we utilize an image pyramid to deal with scale variations. It is more effective and efficient to resize the input fed into the network, as compared to using larger filter sizes. Secondly, we adaptively fuse the predictions from different scales (using adaptively changing per-pixel weights), which makes our method adapt to scale changes within an image. The adaptive fusing is achieved by generating an across-scale attention map, which softly selects a suitable scale for each pixel, followed by a 1×1 convolution. Extensive experiments on three popular datasets show very compelling results.

Selected Publications



Pyramid counting demos/results (BMVC paper)

Fusing Crowd Density Maps and Visual Object Trackers for People Tracking in Crowd Scenes

We propose a crowd people tracking framework that fuses the generic visual object tracker with an estimated crowd density map using a convolutional neural network (CNN). Also, we design a Sparse Kernelized Correlation Filter (S-KCF) to suppress target response variations caused by occlusions and illumination changes, and spurious responses.

Small Instance Detection using Object Density Maps

We propose a novel object detection framework using object density maps for partially-occluded small instances, such as pedestrians in low resolution surveillance video.

Counting Pedestrians Crossing a Line

We propose an integer programming method for estimating the instantaneous count of pedestrians crossing a line of interest in a video sequence.

Pedestrian Crowd Counting

We estimate the size of moving crowds in a privacy preserving manner, i.e. without people models or tracking. The system first segments the crowd by its motion, extracts low-level features from each segment, and estimates the crowd count in each segment using a Gaussian process.

Classification and Retrieval of Traffic Video

We classify traffic congestion in video by representing the video as a dynamic texture, and classifying it using an SVM with a probabilistic kernel (the KL kernel). The resulting classifier is robust to noise and lighting changes.

Human Pose Recognition and Tracking

Recognizing and tracking 2D and 3D human pose in images and videos.

Martial Arts, Dancing and Sports Dataset

We collect a multi-view and stereo-depth dataset for 3D human pose estimation, which consists of challenging martial arts actions (Tai-chi and Karate), dancing actions (hip-hop and jazz), and sports actions (basketball, volleyball, football, rugby, tennis and badminton).

Maximum-Margin Structured Learning with Deep Networks for 3D Human Pose Estimation

We propose a maximum-margin structured learning framework with deep neural network that learns the image-pose score function for human pose estimation.

Pose Estimation with Deep Convolutional Neural Network

We propose a heterogeneous multi-task learning framework for 2D human pose estimation from monocular images using a deep convolutional neural network that combines pose regression and part detection. We also extend the model to 3D human pose estimation.

A Robust Likelihood Function for 3D Human Pose Tracking

We propose a robust likelihood function for 3D human pose tracking, which is robust to small pose changes and better able to localize partially occluded and overlapping parts.

Data-Driven Computer Graphics

Directing User Attention via Visual Flow on Web Designs

​We present an approach that allows web designers to easily direct user attention via visual flow on web designs.

DynamicManga: Animating Still Manga via Camera Movement

We propose a method for animating still manga imagery through camera movements, driven by motion and emotion semantics automatically extracted from the manga.

Attention-Directing Composition of Manga Elements

We propose an approach for novices to synthesize a composition of panel elements that can effectively guide the reader’s attention to convey the story.

Automatic Stylistic Manga Layout

We propose an approach to automatically produce a manga layout from a set of input artworks, which is based on a generative layout model and parametric style models.

Eye-Gaze Analysis

Eye Movement analysis with HMMs (EMHMM)

We use hidden Markov models (HMMs) to analyze eye movement data. A person’s eye fixation sequence is summarized with an HMM, and common strategies among people are discovered by clustering HMMs.

Music Analysis

Music Annotation with Time-Series Models

We propose an approach to automatic music annotation and retrieval that is based on the dynamic texture mixture, a generative time series model of musical content. The new annotation model better captures temporal (e.g., rhythmical) aspects as well as timbral content.

Segmenting Musical Structure

We model a time-series of audio feature vectors, extracted from a short audio fragment, as a dynamic texture. The musical structure of a song (e.g. chorus, verse, and bridge) is discovered by segmenting the song using the mixture of dynamic textures. The song segmentations are used for song retrieval, song annotation, and database visualization.

  • Luke Barrington, Antoni B. Chan, and Gert R.G. Lanckriet, "Modeling music as a dynamic texture." IEEE Trans. on Audio, Speech and Language Processing (TASLP), 18(3):602-612, Mar 2010.

Image Analysis

Semantic Image Annotation

We annotate images using supervised multi-class labeling (SML), which treats semantic annotation as a multi-class classification problem. The system is scalable, and was applied to image databases with 60,000 images.