About

Welcome to the Video, Image, and Sound Analysis Lab (VISAL) at the City University of Hong Kong! The lab is directed by Prof. Antoni Chan in the Department of Computer Science.

Our main research activities include:

  • Computer Vision, Surveillance
  • Machine Learning, Pattern Recognition
  • Computer Audition, Music Information Retrieval
  • Eye Gaze Analysis

For more information about our current research, please visit the projects and publication pages.

Opportunities for graduate students and research assistants – if you are interested in joining the lab, please check this information.

Latest News [more]

  • [May 27, 2025]

    Congratulations to Chenyang for defending her thesis!

  • [Feb 11, 2025]

    Congratulations to Jiuniu for defending his thesis!

  • [Apr 9, 2024]

    Congratulations to Qiangqiang for defending his thesis!

  • [Jun 16, 2023]

    Congratulations to Hui for defending her thesis!

Recent Publications [more]

  • Video Individual Counting for Moving Drones.
    Yaowu Fan, Jia Wan, Tao Han, Antoni B. Chan, and Jinhua Ma,
    In: International Conf. Computer Vision (ICCV), Honolulu, To appear 2025.
  • Temporal Unlearnable Examples: Preventing Personal Video Data from Unauthorized Exploitation by Object Tracking.
    Qiangqiang Wu, Yi Yu, Chennai Kong, Ziquan Liu, Jia Wan, Haoliang Li, Alex C. Kot, and Antoni B. Chan,
    In: International Conf. Computer Vision (ICCV), Honolulu, to appear 2025.
  • Explaining Object Detection Through Difference Map.
    Shujun Xia, Chenyang Zhao, and Antoni B. Chan,
    In: International Conf Computer Vision (ICCV) 2025 Workshop on Explainable Computer Vision (eXCV), Honolulu, to appear 2025.
  • Large language model tokens are psychologically salient.
    David A. Haslett, Antoni B. Chan, and Janet H. Hsiao,
    In: Annual Conference of the Cognitive Science Society (CogSci), San Francisco, to appear Jul 2025.
  • Whose Values Prevail? Bias in Large Language Model Value Alignment.
    Ruoxi Qi, Gleb Papyshev, Kellee Tsai, Antoni B. Chan, and Janet H. Hsiao,
    In: Annual Conference of the Cognitive Science Society (CogSci), San Francisco, to appear Jul 2025.
  • Eye movement behavior during mind wandering in older adults.
    Xiaoru Teng, Gloria Wong, Antoni B. Chan, and Janet H. Hsiao,
    In: Annual Conference of the Cognitive Science Society (CogSci), San Francisco, to appear Jul 2025.
  • DistinctAD: Distinctive Audio Description Generation in Contexts.
    Bo Fang, Wenhao Wu, Qiangqiang Wu, YuXin Song, and Antoni B. Chan,
    In: IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), June 2025 (highlight).
  • Point-to-Region Loss for Semi-Supervised Point-Based Crowd Counting.
    Wei Lin, Chenyang Zhao, and Antoni B. Chan,
    In: IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), June 2025 (highlight).
  • Advancing Multiple Instance Learning with Continual Learning for Whole Slide Imaging.
    Xianrui Li, Yufei Cui, Jun Li, and Antoni B. Chan,
    In: IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), June 2025 (highlight).
  • Speaker’s Use of Mental Verbs to Convey Belief States: A Comparison between Humans and Large Language Model (LLM) .
    Ruoxi Qi, Zixuan Wang, Antoni B. Chan, and Janet H. Hsiao,
    In: 19th International Pragmatics Conference, Brisbane, Jun 2025.

Recent Project Pages [more]

Continual Learning MIL

We pinpoint catastrophic forgetting to the attention layers of attention-MIL models for whole-slide images and introduce two remedies: Attention Knowledge Distillation (AKD) to retain attention weights across tasks and a Pseudo-Bag Memory Pool (PMP) that keeps only the most informative patches. Combined, AKD and PMP achieve state-of-the-art continual-learning accuracy while sharply cutting memory usage on diverse WSI datasets.

  • Xianrui Li, Yufei Cui, Jun Li, and Antoni B. Chan, "Advancing Multiple Instance Learning with Continual Learning for Whole Slide Imaging." In: IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), June 2025 (highlight).
DistinctAD: Distinctive Audio Description Generation in Contexts

We propose a two-stage framework DistinctAD for automatically generating audio descriptions in movies or tv series. DistinctAD targets at generating distinctive and interesting ADs in similar contextual video clips.

Adversarial-Noise Watermark Framework

We propose a novel watermarking framework that leverages adversarial attacks to embed watermarks into images via two secret keys (network and signature) and deploys hypothesis tests to detect these watermarks with statistical guarantees.

Scalable Video Object Segmentation with Simplified Framework

We propose a Simplified VOS framework (SimVOS), which removes the hand-crafted feature extraction and matching modules in previous approaches, to perform joint feature extraction and interaction via a single scalable transformer backbone. We also demonstrate that large-scale self-supervised pre-trained models can provide significant benefits to the VOS task. In addition, a new token refinement module is proposed to achieve a better speed-accuracy trade-off for scalable video object segmentation.

DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking Tasks

We study masked autoencoder (MAE) pre-training on videos for matching-based downstream tasks, including visual object tracking (VOT) and video object segmentation (VOS).

Recent Datasets and Code [more]

Modeling Eye Movements with Deep Neural Networks and Hidden Markov Models (DNN+HMM)

This is the toolbox for modeling eye movements and feature learning with deep neural networks and hidden Markov models (DNN+HMM).

Dolphin-14k: Chinese White Dolphin detection dataset

A dataset consisting of  Chinese White Dolphin (CWD) and distractors for detection tasks.

Crowd counting: Zero-shot cross-domain counting

Generalized loss function for crowd counting.

CVCS: Cross-View Cross-Scene Multi-View Crowd Counting Dataset

Synthetic dataset for cross-view cross-scene multi-view counting. The dataset contains 31 scenes, each with about ~100 camera views. For each scene, we capture 100 multi-view images of crowds.

Crowd counting: Generalized loss function

Generalized loss function for crowd counting.