About

Welcome to the Video, Image, and Sound Analysis Lab (VISAL) at the City University of Hong Kong! The lab is directed by Prof. Antoni Chan in the Department of Computer Science.

Our main research activities include:

  • Computer Vision, Surveillance
  • Machine Learning, Pattern Recognition
  • Computer Audition, Music Information Retrieval
  • Eye Gaze Analysis

For more information about our current research, please visit the projects and publication pages.

Opportunities for graduate students and research assistants – if you are interested in joining the lab, please check this information.

Latest News [more]

  • [May 27, 2025]

    Congratulations to Chenyang for defending her thesis!

  • [Feb 11, 2025]

    Congratulations to Jiuniu for defending his thesis!

  • [Apr 9, 2024]

    Congratulations to Qiangqiang for defending his thesis!

  • [Jun 16, 2023]

    Congratulations to Hui for defending her thesis!

Recent Publications [more]

Recent Project Pages [more]

Continual Learning MIL

We pinpoint catastrophic forgetting to the attention layers of attention-MIL models for whole-slide images and introduce two remedies: Attention Knowledge Distillation (AKD) to retain attention weights across tasks and a Pseudo-Bag Memory Pool (PMP) that keeps only the most informative patches. Combined, AKD and PMP achieve state-of-the-art continual-learning accuracy while sharply cutting memory usage on diverse WSI datasets.

Image Editing with Diffusion Model from Frequency Perspective

We introduce a novel fine-tuning free approach that employs progressive Frequency truncation to refine the guidance of Diffusion models for universal editing tasks (FreeDiff).

DistinctAD: Distinctive Audio Description Generation in Contexts

We propose a two-stage framework DistinctAD for automatically generating audio descriptions in movies or tv series. DistinctAD targets at generating distinctive and interesting ADs in similar contextual video clips.

P2R Loss for Semi-Supervised Counting

We introduce a Point-to-Region (P2R) loss to address the over-activation and pseudo-label propagation issues inherent in semi-supervised crowd counting. By replacing pixel-level matching with region-level supervision, P2R suppresses background noise and achieves state-of-the-art results with significantly higher training stability.

Proximal Mapping Loss for Crowd Counting

We propose the Proximal Mapping Loss (PML), a theoretically grounded framework that discards the unrealistic “non-overlap” assumption common in crowd counting. By leveraging proximal operators from convex optimization, PML accurately recovers density in highly congested scenes where severe occlusions and overlapping objects are prevalent.

Recent Datasets and Code [more]

Modeling Eye Movements with Deep Neural Networks and Hidden Markov Models (DNN+HMM)

This is the toolbox for modeling eye movements and feature learning with deep neural networks and hidden Markov models (DNN+HMM).

Dolphin-14k: Chinese White Dolphin detection dataset

A dataset consisting of  Chinese White Dolphin (CWD) and distractors for detection tasks.

Crowd counting: Zero-shot cross-domain counting

Generalized loss function for crowd counting.

CVCS: Cross-View Cross-Scene Multi-View Crowd Counting Dataset

Synthetic dataset for cross-view cross-scene multi-view counting. The dataset contains 31 scenes, each with about ~100 camera views. For each scene, we capture 100 multi-view images of crowds.

Crowd counting: Generalized loss function

Generalized loss function for crowd counting.