Research (by year)

Contents

2025
2024
2023
2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008
2007
2005

Our main research interests include:

computer vision, surveillance:
crowd analysis, crowd counting, crowd tracking, visual object tracking, multi-view vision, dynamic textures, motion segmentation, motion analysis, image captioning and annotation, image retrieval.
machine learning, pattern recognition:
probabilistic graphical models, deep learning, Bayesian models, Gaussian processes, active learning.
explainable AI (XAI):
gradient-based attribution methods, user trust
eye gaze analysis:
modeling eye movements with hidden Markov models (HMMs), clustering HMMs, co-clustering, DNN+HMM
computer audition, music information retrieval:
semantic music annotation and retrieval, music segmentation.
data-driven computer graphics:
data-driven graphic design, machine learning for graphics.

In particular, we aim to develop machine learning models, such as generative probabilistic models and deep learning models, of images, video, and sound that can be applied to computer vision and computer audition problems, such as crowd monitoring, image understanding, and music understanding. Our current research projects are listed below.

2025

Continual Learning MIL

We pinpoint catastrophic forgetting to the attention layers of attention-MIL models for whole-slide images and introduce two remedies: Attention Knowledge Distillation (AKD) to retain attention weights across tasks and a Pseudo-Bag Memory Pool (PMP) that keeps only the most informative patches. Combined, AKD and PMP achieve state-of-the-art continual-learning accuracy while sharply cutting memory usage on diverse WSI datasets.

Xianrui Li, Yufei Cui, Jun Li, and Antoni B. Chan, "Advancing Multiple Instance Learning with Continual Learning for Whole Slide Imaging." In: IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), June 2025 (highlight).

Image Editing with Diffusion Model from Frequency Perspective

We introduce a novel fine-tuning free approach that employs progressive Frequency truncation to refine the guidance of Diffusion models for universal editing tasks (FreeDiff).

Wei Wu, Qingnan Fan, Shuai Qin, Hong Gu, Ruoyu Zhao, and Antoni B. Chan, "FreeDiff: Progressive Frequency Truncation for Image Editing with Diffusion Models." In: European Conference on Computer Vision (ECCV), Milano, Oct 2024. [supplemental | github]

DistinctAD: Distinctive Audio Description Generation in Contexts

We propose a two-stage framework DistinctAD for automatically generating audio descriptions in movies or tv series. DistinctAD targets at generating distinctive and interesting ADs in similar contextual video clips.

Bo Fang, Wenhao Wu, Qiangqiang Wu, YuXin Song, and Antoni B. Chan, "DistinctAD: Distinctive Audio Description Generation in Contexts." In: IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), June 2025 (highlight).

P2R Loss for Semi-Supervised Counting

We introduce a Point-to-Region (P2R) loss to address the over-activation and pseudo-label propagation issues inherent in semi-supervised crowd counting. By replacing pixel-level matching with region-level supervision, P2R suppresses background noise and achieves state-of-the-art results with significantly higher training stability.

Wei Lin, Chenyang Zhao, and Antoni B. Chan, "Point-to-Region Loss for Semi-Supervised Point-Based Crowd Counting." In: IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), June 2025 (highlight). [github]

Proximal Mapping Loss for Crowd Counting

We propose the Proximal Mapping Loss (PML), a theoretically grounded framework that discards the unrealistic “non-overlap” assumption common in crowd counting. By leveraging proximal operators from convex optimization, PML accurately recovers density in highly congested scenes where severe occlusions and overlapping objects are prevalent.

Wei Lin, Jia Wan, and Antoni B. Chan, "Proximal Mapping Loss: Understanding Loss Functions in Crowd Counting & Localization." In: Intl. Conf. on Learning Representations (ICLR), Singapore, Apr 2025.

2024

Adversarial-Noise Watermark Framework

We propose a novel watermarking framework that leverages adversarial attacks to embed watermarks into images via two secret keys (network and signature) and deploys hypothesis tests to detect these watermarks with statistical guarantees.

Feiyu Chen, Wei Lin, Ziquan Liu, and Antoni B. Chan, "A Secure Image Watermarking Framework with Statistical Guarantees via Adversarial Attacks on Secret Key Networks." In: European Conference on Computer Vision (ECCV), Milano, Oct 2024. [supplemental | github]

Scalable Video Object Segmentation with Simplified Framework

We propose a Simplified VOS framework (SimVOS), which removes the hand-crafted feature extraction and matching modules in previous approaches, to perform joint feature extraction and interaction via a single scalable transformer backbone. We also demonstrate that large-scale self-supervised pre-trained models can provide significant benefits to the VOS task. In addition, a new token refinement module is proposed to achieve a better speed-accuracy trade-off for scalable video object segmentation.

Qiangqiang Wu, Tianyu Yang, Wei Wu, and Antoni B. Chan, "Scalable Video Object Segmentation with Simplified Framework." In: International Conf. Computer Vision (ICCV), Paris, Oct 2023. [github]

DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking Tasks

We study masked autoencoder (MAE) pre-training on videos for matching-based downstream tasks, including visual object tracking (VOT) and video object segmentation (VOS).

Qiangqiang Wu, Tianyu Yang, Ziquan Liu, Baoyuan Wu, Ying Shan, and Antoni B. Chan, "DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking Tasks." In: IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), Jun 2023. [github]

Grad-ECLIP: Gradient-based Visual Explanation for CLIP

We propose a Gradient-based visual Explanation method for CLIP (Grad-ECLIP), which interprets the matching result of CLIP for specific input image-text pair

Chenyang Zhao, Kun Wang, Xingyu Zeng, Rui Zhao, and Antoni B. Chan, "Gradient-based Visual Explanation for Transformer-based CLIP." In: International Conference on Machine Learning (ICML), Vienna, Jul 2024. [github]
Chenyang Zhao, Kun Wang, Janet H. Hsiao, and Antoni B. Chan, Grad-ECLIP: Gradient-based Visual and Textual Explanations for CLIP. arXiv:2502.18816, Feb 2025. [github]

Prompt-Based Counting

We introduce a unified framework for prompt-based counting that supports bounding boxes, points, and natural language within a single architecture. By employing a novel fixed-point inference mechanism, the model iteratively refines density maps to ensure consistency between visual content and diverse prompt modalities.

Wei Lin and Antoni B. Chan, "A Fixed-Point Approach to Unified Prompt-Based Counting." In: AAAI Conference on Artificial Intelligence (AAAI), Vancouver, Feb 2024. [supplemental | github]

2023

Optimal Transport Minimization

We introduce Optimal Transport Minimization (OT-M), a parameter-free algorithm that recovers precise object locations from density maps without additional training. By generating “hard” pseudo-labels through Sinkhorn distance minimization, OT-M enables a more robust semi-supervised crowd counting framework compared to traditional soft-label methods.

Wei Lin and Antoni B. Chan, "Optimal Transport Minimization: Crowd Localization on Density Maps for Semi-Supervised Counting." In: IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), Jun 2023 (highlight). [github]

Pareto Optimization for Active Learning under Out-of-Distribution Data Scenarios

We propose a batch-mode Pareto Optimization Active Learning (POAL) framework for Active Learning under Out-of-Distribution data scenarios.

Xueying Zhan, Zeyu Dai, Qingzhong Wang, Qing Li, Haoyi Xiong, Dejing Dou, and Antoni B. Chan, "Pareto Optimization for Active Learning under Out-of-Distribution Data Scenarios." Transactions on Machine Learning Research (TMLR), June 2023. [github]

ODAM: Gradient-based Instance-specific Visual Explanation for Object Detection

We propose the gradient-weighted Object Detector Activation Maps (ODAM), a visualized explanation technique for interpreting the predictions of object detectors, including class score and bounding box coordinates.

Chenyang Zhao and Antoni B. Chan, "ODAM: Gradient-based Instance-Specific Visual Explanations for Object Detection." In: Intl. Conf. on Learning Representations (ICLR), Rwanda, May 2023. [github]
Chenyang Zhao, Janet H. Hsiao, and Antoni B. Chan, "Gradient-based Instance-Specific Visual Explanations for Object Specification and Object Discrimination." IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI), 46(9):5967-5985, Sep 2024 (online Mar 2024). [github]

A Comparative Survey of Deep Active Learning

We present a comprehensive comparative survey of 19 Deep Active Learning approaches for classification tasks.

Xueying Zhan, Qingzhong Wang, Kuan-hao Huang, Haoyi Xiong, Dejing Dou, and Antoni B. Chan, A Comparative Survey of Deep Active Learning. arXiv:2203.13450, Mar 2023.

A Comparative Survey: Benchmarking for Pool-based Active Learning

We introduce an active learning benchmark comprising 35 public datasets and experiment protocols, and evaluate 17 pool-based AL methods.

Xueying Zhan, Huan Liu, Qing Li, and Antoni B. Chan, "A Comparative Survey: Benchmarking for Pool-based Active Learning." In: International Joint Conf. on Artificial Intelligence (IJCAI), Survey Track, Aug 2021. [github]

2022

Calibration-free Multi-view Crowd Counting

We propose a calibration-free multi-view crowd counting (CF-MVCC) method, which obtains the scene-level count as a weighted summation over the predicted density maps from the camera-views, without needing camera calibration parameters.

Qi Zhang and Antoni B. Chan, "Calibration-free Multi-view Crowd Counting." In: European Conference on Computer Vision (ECCV), Tel Aviv, Oct 2022. [supplemental]

Single-Frame-Based Deep View Synchronization for Unsynchronized Multicamera Surveillance

We propose a synchronization model that operates in conjunction with existing DNN-based multi-view models to allow them to work on unsynchronized data.

Qi Zhang and Antoni B. Chan, "Single-Frame-Based Deep View Synchronization for Unsynchronized Multicamera Surveillance." IEEE Trans. on Neural Networks and Learning Systems (TNNLS), 34(12):10653-10667, Dec 2023. [github]

Modeling Eye Movements by Integrating Deep Neural Networks and Hidden Markov Models

We model eye movements on faces through integrating deep neural networks and hidden Markov Models (DNN+HMM).

Janet H. Hsiao, Jeehye An, Veronica Kit Sum Hui, Yueyuan Zheng, and Antoni B. Chan, "Understanding the role of eye movement consistency in face recognition and autism through integrating deep neural networks and hidden Markov models." npj Science of Learning, 7:28, Oct 2022.

Crowd Counting in the Frequency Domain

We derive loss functions in the frequency domain for training density map regression for crowd counting.

Weibo Shu, Jia Wan, Kay Chen Tan, Sam Kwong, and Antoni B. Chan, "Crowd Counting in the Frequency Domain." In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022. [github]

2021

Dynamic Momentum Adaptation for Zero-Shot Cross-Domain Crowd Counting

We propose a novel Crowd Counting framework built upon an external Momentum Template, termed C2MoT, which enables the encoding of domain specific information via an external template representation.

Qiangqiang Wu, Jia Wan, and Antoni B. Chan, "Dynamic Momentum Adaptation for Zero-Shot Cross-Domain Crowd Counting." In: ACM Multimedia (MM), Oct 2021. [github]

Group-based Distinctive Image Captioning with Memory Attention

We improve the distinctiveness of image captions using a Group-based Distinctive Captioning Model (GdisCap), which compares each image with other images in one similar group and highlights the uniqueness of each image.

Jiuniu Wang, Wenjia Xu, Qingzhong Wang, and Antoni B. Chan, "Group-based Distinctive Image Captioning with Memory Attention." In: ACM Multimedia (MM), Oct 2021 (oral). [supplemental]

Hierarchical Learning of Hidden Markov Models with Clustering Regularization

We propose a novel tree structure variational Bayesian method to learn the individual model and group model simultaneously by treating the group models as the parents of individual models, so that the individual model is learned from observations and regularized by its parents, and conversely, the parent model will be optimized to best represent its children.

Hui Lan and Antoni B. Chan, "Hierarchical Learning of Hidden Markov Models with Clustering Regularization." In: 37th Conference on Uncertainty in Artificial Intelligence (UAI), Jul 2021.

Chinese White Dolphin Detection in the Wild

To reduce the human experts’ workload and improve the observation
accuracy, in this paper, we develop a practical system to detect Chinese White Dolphins in the wild automatically.

Hao Zhang, Qi Zhang, Phuong Anh Nguyen, Victor Lee, and Antoni B. Chan, "Chinese White Dolphin Detection in the Wild." In: ACM Multimedia Asia (MMAsia), Gold Coast, Australia, Dec 2021. [dataset]

Eye Movement analysis with Hidden Markov Models (EMHMM) with co-clustering

We analyze eye movement data on stimuli with different feature layouts. Through co-clustering HMMs, we discover common strategies on each stimuli and cluster subjects with similar strategies.

Janet H. Hsiao, Hui Lan, Yueyuan Zheng, and Antoni B. Chan, "Eye Movement analysis with Hidden Markov Models (EMHMM) with co-clustering." Behavior Research Methods, 53:2473-2486, April 2021.

Meta-Graph Adaptation for Visual Object Tracking

In this paper, we propose a novel meta-graph adaptation network (MGA-Net) to effectively adapt backbone feature extractors in existing deep trackers to a specific online tracking task.

Qiangqiang Wu and Antoni B. Chan, "Meta-Graph Adaptation for Visual Object Tracking." In: IEEE International Conference on Multimedia and Expo (ICME), Jul 2021 (oral).

Progressive Unsupervised Learning for Visual Object Tracking

In this paper, we propose a progressive unsupervised learning (PUL) framework, which entirely removes the need for annotated training videos in visual tracking.

Qiangqiang Wu, Jia Wan, and Antoni B. Chan, "Progressive Unsupervised Learning for Visual Object Tracking." In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2021 (oral). [supplemental]

Fully Nested Neural Network for Adaptive Compression and Quantization

We propose a fully nested neural network (FN3) that runs only once to build a nested set of compressed/quantized models, which is optimal for different resource constraints. We then propose a Bayesian version that estimates the ordered dropout hyperparameter and has well calibrated uncertainty.

Yufei Cui, Ziquan Liu, Wuguannan Yao, Qiao Li, Antoni B. Chan, Tei-wei Kuo, and Jason Xue Chun, "Fully Nested Neural Network for Adaptive Compression and Quantization." In: International Joint Conf. on Artificial Intelligence (IJCAI), Yokohama, July 2020. [supplemental]
Yufei Cui, Ziquan Liu, Qiao Li, Antoni B. Chan, and Chun Jason Xue, "Bayesian Nested Neural Networks for Uncertainty Calibration and Adaptive Compression." In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2021. [github]

A Generalized Loss Function for Crowd Counting and Localization

We propose a generalized loss function for density map regression based on unbalanced optimal transport. We prove that pixel-wise L2 loss and Bayesian loss are special cases and sub-optimal solutions to our proposed loss. Since the predicted density will be pushed toward annotation positions, the density map prediction will be sparse and can naturally be used for localization.

Jia Wan, Ziquan Liu, and Antoni B. Chan, "A Generalized Loss Function for Crowd Counting and Localization." In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2021. [supplemental | github]

Cross-View Cross-Scene Multi-View Crowd Counting

In this paper, we propose a cross-view cross-scene (CVCS) multi-view crowd counting paradigm, where the training and testing occur on different scenes with arbitrary camera layouts.

Qi Zhang, Wei Lin, and Antoni B. Chan, "Cross-View Cross-Scene Multi-View Crowd Counting." In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR):557-567, Jun 2021. [supplemental | github]

Fine-Grained Crowd Counting

In this paper, we propose fine-grained crowd counting, which differentiates a crowd into categories based on the low-level behavior attributes of the individuals (e.g. standing/sitting or violent behavior) and then counts the number of people in each category. To enable research in this area, we construct a new dataset of four real-world fine-grained counting tasks: traveling direction on a sidewalk, standing or sitting, waiting in line or not, and exhibiting violent behavior or not.

Jia Wan, Nikil S. Kumar, and Antoni B. Chan, "Fine-Grained Crowd Counting." IEEE Trans. on Image Processing (TIP), 30:2114-2126, Jan 2021. [code | data]

Tracking-by-Counting: Using Network Flows on Crowd Density Maps for Tracking Multiple Targets

We propose a new multiple-object tracking (MOT) paradigm, tracking-by-counting, tailored for crowded scenes. Using crowd density maps, we jointly model detection, counting, and tracking of multiple targets as a network flow program, which simultaneously finds the global optimal detections and
trajectories of multiple targets over the whole video.

Weihong Ren, Xinchao Wang, Jiandong Tian, Yandong Tang, and Antoni B. Chan, "Tracking-by-Counting: Using Network Flows on Crowd Density Maps for Tracking Multiple Targets." IEEE Trans. on Image Processing (TIP), 30:1439-1452, 2021.

2020

Modeling Noisy Annotations for Crowd Counting

We model the annotation noise using a random variable with Gaussian distribution and derive the pdf of the crowd density value for each spatial location in the image. We then approximate the joint distribution of the density values (i.e., the distribution of density maps) with a full covariance multivariate Gaussian density, and derive a low-rank approximate for tractable implementation.

Jia Wan and Antoni B. Chan, "Modeling Noisy Annotations for Crowd Counting." In: Neural Information Processing Systems (NeurIPS), Dec 2020. [supplemental | github]
Jia Wan, Qiangqiang Wu, and Antoni B. Chan, "Modeling Noisy Annotations for Point-Wise Supervision." IEEE Trans. Pattern Analysis and Machine Intelligence (TPAMI), 45(12):15065-15080, Dec 2023 (online Jul 2023). [github]

Accelerating Monte Carlo Bayesian Inference via Approximating Predictive Uncertainty over Simplex

We propose a generic framework to approximate the output probability distribution induced by a Bayesian NN model posterior with a parameterized model and in an amortized fashion. The aim is to approximate the predictive uncertainty of a specific Bayesian model, meanwhile alleviating the heavy workload of MC integration at testing time.

Yufei Cui, Wuguannan Yao, Qiao Li, Antoni B. Chan, and Chun Xue, "Accelerating Monte Carlo Bayesian Prediction via Approximating Predictive Uncertainty over the Simplex." IEEE Transactions on Neural Networks and Learning Systems (TNNLS), 33(4):1492-1506, Apr 2022 (online 2020).

Compare and Reweight: Distinctive Image Captioning Using Similar Images Sets

To improve the distinctiveness of image captions, we first propose a metric, between-set CIDEr (CIDErBtw), to evaluate the distinctiveness of a caption with respect to those of similar images, and then propose several new training strategies for image captioning based on the new distinctiveness measure.

Jiuniu Wang, Wenjia Xu, Qingzhong Wang, and Antoni B. Chan, "On Distinctive Image Captioning via Comparing and Reweighting." IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI), 45(2):2088-2103, Feb 2023 (online 2022). [github]
Jiuniu Wang, Wenjia Xu, Qingzhong Wang, and Antoni B. Chan, "Compare and Reweight: Distinctive Image Captioning Using Similar Images Sets." In: European Conference on Computer Vision (ECCV), Aug 2020 (oral). [github]

ROAM: Recurrently Optimizing Tracking Model

We propose to offline train a recurrent neural optimizer to update a tracking model in a meta-learning setting, which can converge the model in a few gradient steps during online training.

Tianyu Yang, Pengfei Xu, Runbo Hu, Hua Chai, and Antoni B. Chan, "ROAM: Recurrently Optimizing Tracking Model." In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, Jun 2020. [github]

2019

3D Crowd Counting via Multi-View Fusion with 3D Gaussian Kernels

Recently, an end-to-end multi-view crowd counting method called multi-view multi-scale (MVMS) has been proposed, which fuses multiple camera views using a CNN to predict a 2D scene-level density map on the ground-plane. Unlike MVMS, we propose to solve the multi-view crowd counting task through 3D feature fusion with 3D scene-level density maps, instead of the 2D ground-plane ones.

Qi Zhang and Antoni B. Chan, "3D Crowd Counting via Geometric Attention-guided Multi-View Fusion." International Journal of Computer Vision (IJCV), 130:3123-3139, Dec 2022.

Adaptive Density Map Generation for Crowd Counting

In the sense of end-to-end training, the hand-crafted methods used for generating the density maps may not be optimal for the particular network or dataset used. To address this issue, we propose an adaptive density map generator, which takes the annotation dot map as input, and learns a density map representation for training a counter. The counter and generator are trained jointly within an end-to-end framework.

Jia Wan and Antoni B. Chan, "Adaptive Density Map Generation for Crowd Counting." In: Intl. Conf. on Computer Vision (ICCV), Seoul, Oct 2019. [github]
Jia Wan, Qingzhong Wang, and Antoni B. Chan, "Kernel-based Density Map Generation for Dense Object Counting." IEEE Trans. Pattern Analysis and Machine Intelligence (TPAMI), 44(3):1357-1370, Mar 2022. [github]

Eye Movement analysis with Switching HMMs (EMSHMM)

We use a switching hidden Markov model (EMSHMM) approach to analyze eye movement data in cognitive tasks involving cognitive state changes. A high-level state captures a participant’s cognitive state transitions during the task, and eye movement patterns during each high-level state are summarized with a regular HMM.

Tim Chuk, Antoni B. Chan, Shinsuke Shimojo, and Janet H. Hsiao, "Eye movement analysis with switching hidden Markov models." Behavior Research Methods, 52:1026-1043, June 2020. [appendix | code]

Parametric Manifold Learning of Gaussian Mixture Models

We propose a ParametRIc MAnifold Learning (PRIMAL) algorithm for Gaussian Mixtures Models (GMM), assuming that GMMs lie on or near to a manifold that is generated from a low-dimensional hierarchical latent space through parametric mappings. Inspired by Principal Component Analysis (PCA), the generative processes for priors, means and covariance matrices are modeled by
their respective latent space and parametric mapping.

Ziquan Liu, Lei Yu, Janet H. Hsiao, and Antoni B. Chan, "PRIMAL-GMM: PaRametrIc MAnifold Learning of Gaussian Mixture Models." IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI), 44(6):3197-3211, June 2022 (online 2021). [github]
Ziquan Liu, Lei Yu, Janet H. Hsiao, and Antoni B. Chan, "Parametric Manifold Learning of Gaussian Mixture Models." In: International Joint Conference on Artificial Intelligence (IJCAI), Macau, Aug 2019. [github]

On Diversity in Image Captioning: Metrics and Methods

In this project, we focus on the diversity of image captions. First, diversity metrics are proposed which is more correlated to human judgment. Second, we re-evaluate the existing models and find that (1) there is a large gap between human and the existing models in the diversity-accuracy space, (2) using reinforcement learning (CIDEr reward) to train captioning models leads to improving accuracy but reduce diversity. Third, we propose a simple but efficient approach to balance diversity and accuracy via reinforcement learning—using the linear combination of cross-entropy and CIDEr reward.

Qingzhong Wang and Antoni B. Chan, "Describing like Humans: on Diversity in Image Captioning." In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , Long Beach, June 2019. [github]
Qingzhong Wang, Jia Wan, and Antoni B. Chan, "On Diversity in Image Captioning: Metrics and Methods." IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI), 44(2):1035-1049, Feb 2022. [github]

Residual Regression with Semantic Prior for Crowd Counting

In this paper, a residual regression framework is proposed for crowd counting harnessing the correlation information among samples. By incorporating such information into our network, we discover that more intrinsic characteristics can be learned by the network which thus generalizes better to unseen scenarios. Besides, we show how to effectively leverage the semantic prior to improve the performance of crowd counting.

Jia Wan, Wenhan Luo, Baoyuan Wu, Antoni B. Chan, and Wei Liu, "Residual Regression with Semantic Prior for Crowd Counting." In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , Long Beach, June 2019. [github]

Wide-Area Crowd Counting via Ground-Plane Density Maps and Multi-View Fusion CNNs

In this paper, we propose a deep neural network framework for multi-view crowd counting, which fuses information from multiple camera views to predict a scene-level density map on the ground-plane of the 3D world.

Qi Zhang and Antoni B. Chan, "Wide-Area Crowd Counting: Multi-View Fusion Networks for Counting in Large Scenes." International Journal of Computer Vision (IJCV), 130(8):1938-1960, Aug 2022. [github]
Qi Zhang and Antoni B. Chan, "Wide-Area Crowd Counting via Ground-Plane Density Maps and Multi-View Fusion CNNs." In: IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), Long Beach, June 2019. [dataset&code | github]

2018

Simplification of Gaussian Mixture Models

An algorithm is proposed to simplify the Gaussian Mixture Models into a reduced mixture model with fewer mixture components, by maximizing a variational lower bound of the expected log-likelihood of a set of virtual samples.

Lei Yu, Tianyu Yang, and Antoni B. Chan, "Approximate Inference for Generic Likelihoods via Density-Preserving GMM Simplification." In: NIPS 2016 Workshop on Advances in Approximate Bayesian Inference, Barcelona, Dec 2016.
Lei Yu, Tianyu Yang, and Antoni B. Chan, "Density-Preserving Hierarchical EM Algorithm: Simplifying Gaussian Mixture Models for Approximate Inference." IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI), 41(6):1323-1337, June 2019. [code]

Convolutional Decoders for Image Captioning

RNN-based models dominate the field of image captioning, however, (1) RNNs have to be calculated step-by-step, which is not easily parallelized. (2) There is a long path between the start and end of the sentence using RNNs. Tree structures can make a shorter path, but trees require special processing. (3) RNNs only learn single-level representations at each time step, while convolutional decoders are able to learn multi-level representations of concepts, and each of them should corresponds to an image area, which should benefit word prediction.

Qingzhong Wang and Antoni B. Chan, "CNN+CNN: Convolutional Decoders for Image Captioning." In: IEEE Computer Vision and Pattern Recognition: Language and Vision Workshop, Salt Lake City, Jun 2018. [github]
Qingzhong Wang and Antoni B. Chan, "Gated Hierarchical Attention for Image Captioning." In: Asian Conference on Computer Vision (ACCV), Perth, Dec 2018. [github]

Beyond Counting: Comparisons of Density Maps for Crowd Analysis Tasks – Counting, Detection, and Tracking

We propose CNN-pixel and FCNN-skip to produce an original-resolution density map. In our experiments, we found that the lower-resolution density maps sometimes have better counting performance. In contrast, the original-resolution density maps improved localization tasks, such as detection and tracking, compared to bilinear upsampling the lower-resolution density maps.

Di Kang, Zheng Ma, and Antoni B. Chan, "Beyond Counting: Comparisons of Density Maps for Crowd Analysis Tasks - Counting, Detection, and Tracking." IEEE Trans. on Circuits and Systems for Video Technology (TCSVT), 29(5):1408-1422, May 2019.

Crowd Counting by Adaptively Fusing Predictions from an Image Pyramid

We utilize an image pyramid to deal with scale variations. What’s more, we adaptively fuse the predictions from different scales (using adaptively changing per-pixel weights), which makes our method adapt to scale changes within an image.

Di Kang and Antoni B. Chan, "Crowd Counting by Adaptively Fusing Predictions from an Image Pyramid." In: British Machine Vision Conference (BMVC), Newcastle, Sept 2018. [supplemental]

Learning Dynamic Memory Networks for Object Tracking

We propose a dynamic memory network to adapt the template to the target’s appearance variations during tracking where an LSTM is used to control the reading and writing process of the memory block.

Tianyu Yang and Antoni B. Chan, "Visual Tracking via Dynamic Memory Networks." IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI), 43(1):360-374, Jan 2021. [github]
Tianyu Yang and Antoni B. Chan, "Learning Dynamic Memory Networks for Object Tracking." In: European Conference on Computer Vision (ECCV), Munich, Sept 2018. [github]

Fusing Crowd Density Maps and Visual Object Trackers for People Tracking in Crowd Scenes

We propose a crowd people tracking framework that fuses the generic visual object tracker with an estimated crowd density map using a convolutional neural network (CNN). Also, we design a Sparse Kernelized Correlation Filter (S-KCF) to suppress target response variations caused by occlusions and illumination changes, and spurious responses.

Weihong Ren, Di Kang, Yandong Tang, and Antoni B. Chan, "Fusing Crowd Density Maps and Visual Object Trackers for People Tracking in Crowd Scenes." In: IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Salt Lake City, Jun 2018.

2017

Incorporating Side Information by Adaptive Convolution

In order to incorporate the available side information, we propose an adaptive convolutional neural network (ACNN), where the convolution filter weights adapt to the current scene context via the side information.

Di Kang, Debarun Dhar, and Antoni B. Chan, "Incorporating Side Information by Adaptive Convolution." International Journal of Computer Vision (IJCV), 128:2897-2918, July 2020.
Di Kang, Debarun Dhar, and Antoni B. Chan, "Incorporating Side Information by Adaptive Convolution." In: Neural Information Processing Systems, Long Beach, Dec 2017. [supplemental]

Recurrent Filter Learning for Visual Tracking

We propose a recurrent filter generation methods for visual tracking which directly feeds the target’s image patch to a recurrent neural network (RNN) to estimate an object-specific filter for tracking.

Tianyu Yang and Antoni B. Chan, "Recurrent filter learning for visual tracking." In: ICCV 5th Visual Object Tracking Challenge Workshop VOT2017, Venice, Oct 2017. [video | github]

DynamicManga: Animating Still Manga via Camera Movement

We propose a method for animating still manga imagery through camera movements, driven by motion and emotion semantics automatically extracted from the manga.

Ying Cao, Xufang Pang, Antoni B. Chan, and Rynson W.H. Lau, "DynamicManga: Animating Still Manga via Camera Movement." IEEE Trans. on Multimedia (TMM), 19(1):160-172, Jan 2017. [supplemental | video]

Martial Arts, Dancing and Sports Dataset

We collect a multi-view and stereo-depth dataset for 3D human pose estimation, which consists of challenging martial arts actions (Tai-chi and Karate), dancing actions (hip-hop and jazz), and sports actions (basketball, volleyball, football, rugby, tennis and badminton).

Weichen Zhang, Zhiguang Liu, Liuyang Zhou, Howard Leung, and Antoni B. Chan, "Martial Arts, Dancing and Sports Dataset: a Challenging Stereo and Multi-View Dataset for 3D Human Pose Estimation." Image and Vision Computing, 61:22-39, May 2017. [supplemental | dataset]

2016

Directing User Attention via Visual Flow on Web Designs

We present an approach that allows web designers to easily direct user attention via visual flow on web designs.

Xufang Pang, Ying Cao, Rynson W.H. Lau, and Antoni B. Chan, "Directing User Attention via Visual Flow on Web Designs." ACM Transactions on Graphics (Proc. SIGGRAPH Asia 2016), Dec 2016. [supplemental | video]

2015

Maximum-Margin Structured Learning with Deep Networks for 3D Human Pose Estimation

We propose a maximum-margin structured learning framework with deep neural network that learns the image-pose score function for human pose estimation.

Sijin Li, Weichen Zhang, and Antoni B. Chan, "Maximum-Margin Structured Learning with Deep Networks for 3D Human Pose Estimation." In: Intl. Conf. on Computer Vision (ICCV):2848-2856, Santiago, Dec 2015. [spotlight video]
Sijin Li, Weichen Zhang, and Antoni B. Chan, "Maximum-Margin Structured Learning with Deep Networks for 3D Human Pose Estimation." International Journal of Computer Vision (IJCV), 122(1):149-168, March 2017.

Small Instance Detection using Object Density Maps

We propose a novel object detection framework using object density maps for partially-occluded small instances, such as pedestrians in low resolution surveillance video.

Zheng Ma, Lei Yu, and Antoni B. Chan, "Small Instance Detection by Integer Programming on Object Density Maps." In: IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Boston, Jun 2015. [extended abstract | dataset]

Bag of Systems Trees

We propose the BoSTree that enables efficient mapping of videos to the bag-of-systems (BoS) codebook using a tree-structure, which enables the practical use of larger, richer codebooks.

Adeel Mumtaz, Emanuele Coviello, Gert R.G. Lanckriet, and Antoni B. Chan, "A Scalable and Accurate Descriptor for Dynamic Textures using Bag of System Trees." IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI), 37(4):697-712, Apr 2015. [appendix]

2014

A Robust Likelihood Function for 3D Human Pose Tracking

We propose a robust likelihood function for 3D human pose tracking, which is robust to small pose changes and better able to localize partially occluded and overlapping parts.

Weichen Zhang, Lifeng Shang, and Antoni B. Chan, "A Robust Likelihood Function for 3D Human Pose Tracking." IEEE Trans. on Image Processing (TIP), 23(12):5374-5389, Dec 2014.

Attention-Directing Composition of Manga Elements

We propose an approach for novices to synthesize a composition of panel elements that can effectively guide the reader’s attention to convey the story.

Ying Cao, Rynson W.H. Lau, and Antoni B. Chan, "Look Over Here: Attention-Directing Composition of Manga Elements." ACM Transactions on Graphics (Proc. SIGGRAPH 2014), Aug 2014. [supplemental | video | slides]

Pose Estimation with Deep Convolutional Neural Network

We propose a heterogeneous multi-task learning framework for 2D human pose estimation from monocular images using a deep convolutional neural network that combines pose regression and part detection. We also extend the model to 3D human pose estimation.

Sijin Li, Zhi-Qiang Liu, and Antoni B. Chan, "Heterogeneous Multi-task Learning for Human Pose Estimation with Deep Convolutional Neural Network." International Journal of Computer Vision (IJCV), 113(1):19-36, May 2015.
Sijin Li and Antoni B. Chan, "3D Human Pose Estimation from Monocular Images with Deep Convolutional Neural Network." In: Asian Conference on Computer Vision (ACCV), Singapore, Nov 2014.

Eye Movement analysis with HMMs (EMHMM)

We use hidden Markov models (HMMs) to analyze eye movement data. A person’s eye fixation sequence is summarized with an HMM, and common strategies among people are discovered by clustering HMMs.

Tim Chuk, Antoni B. Chan, and Janet H. Hsiao, "Understanding eye movements in face recognition using hidden Markov models." Journal of Vision, 14(11):8, Sep 2014.

Clustering hidden Markov Models (HMMs)

We propose a variational hierarchical EM algorithm for clustering hidden Markov models (HMMs), producing groups of similar HMMs and their representative HMM cluster centers. We also propose a variational Bayesian version that performs model selection.

Emanuele Coviello, Antoni B. Chan, and Gert R.G. Lanckriet, "Clustering hidden Markov models with variational HEM." Journal of Machine Learning Research (JMLR), 15(2):697-747, Feb 2014. [code]
Hui Lan, Ziquan Liu, Janet H. Hsiao, Dan Yu, and Antoni B. Chan, "Clustering Hidden Markov Models With Variational Bayesian Hierarchical EM." IEEE Trans. on Neural Networks and Learning Systems (TNNLS), 34(3):1537-1551, March 2023 (online 2021).

2013

Clustering Dynamic Textures

We propose a hierarchical EM algorithm capable of clustering dynamic texture models and learning novel cluster centers that are representative of the cluster members. DT clustering can be applied to semantic motion annotation and bag-of-systems codebook generation.

Adeel Mumtaz, Emanuele Coviello, Gert R.G. Lanckriet, and Antoni B. Chan, "Clustering Dynamic Textures with the Hierarchical EM Algorithm for Modeling Video." IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI), 35(7):1606-1621, Jul 2013. [appendix]

Counting Pedestrians Crossing a Line

We propose an integer programming method for estimating the instantaneous count of pedestrians crossing a line of interest in a video sequence.

Zheng Ma and Antoni B. Chan, "Crossing the Line: Crowd Counting by Integer Programming with Local Features." In: IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Portland, Jun 2013.
Zheng Ma and Antoni B. Chan, "Counting People Crossing a Line using Integer Programming and Local Features." IEEE Trans. on Circuits and Systems for Video Technology (TCSVT), 26(10):1955-1969, Oct 2016. [appendix]

2012

Automatic Stylistic Manga Layout

We propose an approach to automatically produce a manga layout from a set of input artworks, which is based on a generative layout model and parametric style models.

Ying Cao, Antoni B. Chan, and Rynson W.H. Lau, "Automatic Stylistic Manga Layout." ACM Transactions on Graphics (Proc. SIGGRAPH Asia 2012), Singapore, Nov 2012. [supplemental | video | slides]

Pedestrian Crowd Counting

We estimate the size of moving crowds in a privacy preserving manner, i.e. without people models or tracking. The system first segments the crowd by its motion, extracts low-level features from each segment, and estimates the crowd count in each segment using a Gaussian process.

Antoni B. Chan and Nuno Vasconcelos, "Counting People with Low-Level Features and Bayesian Regression." IEEE Trans. on Image Processing (TIP), 21(4):2170-2177, May 2012.

2011

Music Annotation with Time-Series Models

We propose an approach to automatic music annotation and retrieval that is based on the dynamic texture mixture, a generative time series model of musical content. The new annotation model better captures temporal (e.g., rhythmical) aspects as well as timbral content.

Emanuele Coviello, Antoni B. Chan, and Gert R.G. Lanckriet, "Time Series Models for Semantic Music Annotation." IEEE Trans. on Audio, Speech and Language Processing (TASLP), 19(5):1343-1359, Jul 2011.

Background Subtraction in Dynamic Scenes

The background model is based on a generalization of the Stauffer-Grimson background model, where each mixture component is a dynamic texture. We derive an on-line algorithm for updating the parameters using a set of sufficient statistics of the model.

Antoni B. Chan, Vijay Mahadevan, and Nuno Vasconcelos, "Generalized Stauffer-Grimson background subtraction for dynamic scenes." Machine Vision and Applications, 22(5):751-766, Sep 2011.

2010

Segmenting Musical Structure

We model a time-series of audio feature vectors, extracted from a short audio fragment, as a dynamic texture. The musical structure of a song (e.g. chorus, verse, and bridge) is discovered by segmenting the song using the mixture of dynamic textures. The song segmentations are used for song retrieval, song annotation, and database visualization.

Luke Barrington, Antoni B. Chan, and Gert R.G. Lanckriet, "Modeling music as a dynamic texture." IEEE Trans. on Audio, Speech and Language Processing (TASLP), 18(3):602-612, Mar 2010.

2009

Layered Dynamic Textures

One disadvantage of the dynamic texture is its inability to account for multiple co-occuring textures in a single video. We extend the dynamic texture to a multi-state (layered) dynamic texture that can learn regions containing different dynamic textures.

Antoni B. Chan and Nuno Vasconcelos, "Layered dynamic textures." IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI), 31(10):1862-1879, Oct 2009.

2008

Mixtures of Dynamic Textures

We introduce the mixture of dynamic textures, which models a collection of video as samples from a set of dynamic textures. We use the model for video clustering and motion segmentation.

Antoni B. Chan and Nuno Vasconcelos, "Modeling, clustering, and segmenting video with mixtures of dynamic textures." IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI), 30(5):909-926, May 2008.

2007

Semantic Image Annotation

We annotate images using supervised multi-class labeling (SML), which treats semantic annotation as a multi-class classification problem. The system is scalable, and was applied to image databases with 60,000 images.

Gustavo Carneiro, Antoni B. Chan, Pedro J. Moreno, and Nuno Vasconcelos, "Supervised learning of semantic classes for image annotation and retrieval." IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI), 29(3):394-410, Mar 2007.

Kernel Dynamic Textures

We introduce a kernelized dynamic texture, which has a non-linear observation function learned with kernel PCA. The new texture model can account for more complex patterns of motion, such as chaotic motion (e.g. boiling water and fire) and camera motion (e.g. panning and zooming), better than the original dynamic texture.

Antoni B. Chan and Nuno Vasconcelos, "Classifying Video with Kernel Dynamic Textures." In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Minneapolis, Jun 2007.

2005

Classification and Retrieval of Traffic Video

We classify traffic congestion in video by representing the video as a dynamic texture, and classifying it using an SVM with a probabilistic kernel (the KL kernel). The resulting classifier is robust to noise and lighting changes.

Antoni B. Chan and Nuno Vasconcelos, "Probabilistic Kernels for the Classification of Auto-regressive Visual Processes." In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Diego, Jun 2005. [8-page version]