Research (by year)

Our main research interests include:

  • computer visionsurveillance:
    dynamic textures, motion segmentation, motion analysis, semantic image annotation, image retrieval, crowd counting.
  • machine learningpattern recognition:
    probabilistic graphical models, support vector machines, Bayesian regression, Gaussian processes.
  • computer auditionmusic information retrieval:
    semantic music annotation and retrieval, music segmentation.
  • eye gaze analysis:
    modeling eye movements with hidden Markov models (HMMs), clustering HMMs

In particular, we aim to develop generative probabilistic models of images, video, and sound that can be applied to computer vision and computer audition problems, such as traffic surveillance, crowd monitoring, semantic image annotation, and music segmentation. Our current research projects are listed below.


Simplification of Gaussian Mixture Models

An algorithm is proposed to simplify the Gaussian Mixture Models into a reduced mixture model with fewer mixture components, by maximizing a variational lower bound of the expected log-likelihood of a set of virtual samples.

Convolutional Decoders for Image Captioning

RNN-based models dominate the field of image captioning, however, (1) RNNs have to be calculated step-by-step, which is not easily parallelized. (2) There is a long path between the start and end of the sentence using RNNs. Tree structures can make a shorter path, but trees require special processing. (3) RNNs only learn single-level representations at each time step, while convolutional decoders are able to learn multi-level representations of concepts, and each of them should corresponds to an image area, which should benefit word prediction.

Beyond Counting: Comparisons of Density Maps for Crowd Analysis Tasks – Counting, Detection, and Tracking

We propose CNN-pixel and FCNN-skip to produce an original-resolution density map. In our experiments, we found that the lower-resolution density maps sometimes have better counting performance. In contrast, the original-resolution density maps improved localization tasks, such as detection and tracking, compared to bilinear upsampling the lower-resolution density maps.

Crowd Counting by Adaptively Fusing Predictions from an Image Pyramid

We utilize an image pyramid to deal with scale variations. What’s more, we adaptively fuse the predictions from different scales (using adaptively changing per-pixel weights), which makes our method adapt to scale changes within an image.

Learning Dynamic Memory Networks for Object Tracking

We propose a dynamic memory network to adapt the template to the target’s appearance variations during tracking where an LSTM is used to control the reading and writing process of the memory block.

Fusing Crowd Density Maps and Visual Object Trackers for People Tracking in Crowd Scenes

We propose a crowd people tracking framework that fuses the generic visual object tracker with an estimated crowd density map using a convolutional neural network (CNN). Also, we design a Sparse Kernelized Correlation Filter (S-KCF) to suppress target response variations caused by occlusions and illumination changes, and spurious responses.


Incorporating Side Information by Adaptive Convolution

In order to incorporate the available side information, we propose an adaptive convolutional neural network (ACNN), where the convolution filter weights adapt to the current scene context via the side information.

Recurrent Filter Learning for Visual Tracking

We propose a recurrent filter generation methods for visual tracking which directly feeds the target’s image patch to a recurrent neural network (RNN) to estimate an object-specific filter for tracking.

DynamicManga: Animating Still Manga via Camera Movement

We propose a method for animating still manga imagery through camera movements, driven by motion and emotion semantics automatically extracted from the manga.

Martial Arts, Dancing and Sports Dataset

We collect a multi-view and stereo-depth dataset for 3D human pose estimation, which consists of challenging martial arts actions (Tai-chi and Karate), dancing actions (hip-hop and jazz), and sports actions (basketball, volleyball, football, rugby, tennis and badminton).


Directing User Attention via Visual Flow on Web Designs

​We present an approach that allows web designers to easily direct user attention via visual flow on web designs.


Maximum-Margin Structured Learning with Deep Networks for 3D Human Pose Estimation

We propose a maximum-margin structured learning framework with deep neural network that learns the image-pose score function for human pose estimation.

Small Instance Detection using Object Density Maps

We propose a novel object detection framework using object density maps for partially-occluded small instances, such as pedestrians in low resolution surveillance video.

Bag of Systems Trees

We propose the BoSTree that enables efficient mapping of videos to the bag-of-systems (BoS) codebook using a tree-structure, which enables the practical use of larger, richer codebooks.


A Robust Likelihood Function for 3D Human Pose Tracking

We propose a robust likelihood function for 3D human pose tracking, which is robust to small pose changes and better able to localize partially occluded and overlapping parts.

Attention-Directing Composition of Manga Elements

We propose an approach for novices to synthesize a composition of panel elements that can effectively guide the reader’s attention to convey the story.

Pose Estimation with Deep Convolutional Neural Network

We propose a heterogeneous multi-task learning framework for 2D human pose estimation from monocular images using a deep convolutional neural network that combines pose regression and part detection. We also extend the model to 3D human pose estimation.

Eye Movement analysis with HMMs (EMHMM)

We use hidden Markov models (HMMs) to analyze eye movement data. A person’s eye fixation sequence is summarized with an HMM, and common strategies among people are discovered by clustering HMMs.

Clustering hidden Markov Models (HMMs)

We propose a variational hierarchical EM algorithm for clustering hidden Markov models (HMMs), producing groups of similar HMMs and their representative HMM cluster centers.


Clustering Dynamic Textures

We propose a hierarchical EM algorithm capable of clustering dynamic texture models and learning novel cluster centers that are representative of the cluster members. DT clustering can be applied to semantic motion annotation and bag-of-systems codebook generation.

Counting Pedestrians Crossing a Line

We propose an integer programming method for estimating the instantaneous count of pedestrians crossing a line of interest in a video sequence.


Automatic Stylistic Manga Layout

We propose an approach to automatically produce a manga layout from a set of input artworks, which is based on a generative layout model and parametric style models.

Pedestrian Crowd Counting

We estimate the size of moving crowds in a privacy preserving manner, i.e. without people models or tracking. The system first segments the crowd by its motion, extracts low-level features from each segment, and estimates the crowd count in each segment using a Gaussian process.


Music Annotation with Time-Series Models

We propose an approach to automatic music annotation and retrieval that is based on the dynamic texture mixture, a generative time series model of musical content. The new annotation model better captures temporal (e.g., rhythmical) aspects as well as timbral content.

Background Subtraction in Dynamic Scenes

The background model is based on a generalization of the Stauffer-Grimson background model, where each mixture component is a dynamic texture. We derive an on-line algorithm for updating the parameters using a set of sufficient statistics of the model.


Segmenting Musical Structure

We model a time-series of audio feature vectors, extracted from a short audio fragment, as a dynamic texture. The musical structure of a song (e.g. chorus, verse, and bridge) is discovered by segmenting the song using the mixture of dynamic textures. The song segmentations are used for song retrieval, song annotation, and database visualization.

  • Luke Barrington, Antoni B. Chan, and Gert R.G. Lanckriet, "Modeling music as a dynamic texture." IEEE Trans. on Audio, Speech and Language Processing (TASLP), 18(3):602-612, Mar 2010.


Layered Dynamic Textures

One disadvantage of the dynamic texture is its inability to account for multiple co-occuring textures in a single video. We extend the dynamic texture to a multi-state (layered) dynamic texture that can learn regions containing different dynamic textures.

  • Antoni B. Chan and Nuno Vasconcelos, "Layered dynamic textures." IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI), 31(10):1862-1879, Oct 2009.


Mixtures of Dynamic Textures

We introduce the mixture of dynamic textures, which models a collection of video as samples from a set of dynamic textures. We use the model for video clustering and motion segmentation.


Semantic Image Annotation

We annotate images using supervised multi-class labeling (SML), which treats semantic annotation as a multi-class classification problem. The system is scalable, and was applied to image databases with 60,000 images.

Kernel Dynamic Textures

We introduce a kernelized dynamic texture, which has a non-linear observation function learned with kernel PCA. The new texture model can account for more complex patterns of motion, such as chaotic motion (e.g. boiling water and fire) and camera motion (e.g. panning and zooming), better than the original dynamic texture.


Classification and Retrieval of Traffic Video

We classify traffic congestion in video by representing the video as a dynamic texture, and classifying it using an SVM with a probabilistic kernel (the KL kernel). The resulting classifier is robust to noise and lighting changes.