About

Welcome to the Video, Image, and Sound Analysis Lab (VISAL) at the City University of Hong Kong! The lab is directed by Prof. Antoni Chan in the Department of Computer Science.

Our main research activities include:

  • Computer Vision, Surveillance
  • Machine Learning, Pattern Recognition
  • Computer Audition, Music Information Retrieval
  • Eye Gaze Analysis

For more information about our current research, please visit the projects and publication pages.

Opportunities for graduate students and research assistants – if you are interested in joining the lab, please check this information.

Latest News [more]

  • [Apr 9, 2024]

    Congratulations to Qiangqiang for defending his thesis!

  • [Jun 16, 2023]

    Congratulations to Hui for defending her thesis!

  • [Jan 19, 2023]

    Congratulations to Xueying for defending her thesis!

  • [Dec 9, 2022]

    Congratulations to Ziquan for defending his thesis!

Recent Publications [more]

  • Robust Zero-Shot Crowd Counting and Localization with Adaptive Resolution SAM.
    Jia Wan, Qiangqiang Wu, Wei Lin, and Antoni B. Chan,
    In: European Conference on Computer Vision (ECCV), Milano, Oct 2024.
  • A Secure Image Watermarking Framework with Statistical Guarantees via Adversarial Attacks on Secret Key Networks.
    Feiyu Chen, Wei Lin, Ziquan Liu, and Antoni B. Chan,
    In: European Conference on Computer Vision (ECCV), Milano, Oct 2024.
  • Boosting 3D Single Object Tracking with 2D Matching Distillation and 3D Pre-training.
    Qiangqiang Wu, Yan Xia, Jia Wan, and Antoni B. Chan,
    In: European Conference on Computer Vision (ECCV), Milano, Oct 2024.
  • Mahalanobis Distance-based Multi-view Optimal Transport for Multi-view Crowd Localization.
    Qi Zhang, Kaiyi Zhang, Antoni B. Chan, and Hui Huang,
    In: European Conference on Computer Vision (ECCV), Milano, Oct 2024. [Project&Code]
  • FreeDiff: Progressive Frequency Truncation for Image Editing with Diffusion Models.
    Wei Wu, Qingnan Fan, Shuai Qin, Hong Gu, Ruoyu Zhao, and Antoni B. Chan,
    In: European Conference on Computer Vision (ECCV), Milano, Oct 2024.
  • Human attention guided explainable artificial intelligence for computer vision models.
    Guoyang Liu, Jindi Zhang, Antoni B. Chan, and Janet H. Hsiao,
    Neural Networks, 177:106392, Sep 2024.
  • Edit Temporal-Consistent Videos with Image Diffusion Model.
    Yuanzhi Wang, Yong Li, Xiaoya Zhang, Xin Liu, Anbo Dai, Antoni B. Chan, and Zhen Cui,
    ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), to appear 2024.
  • Group-based Distinctive Image Captioning with Memory Difference Encoding and Attention.
    Jiuniu Wang, Wenjia Xu, Qingzhong Wang, and Antoni B. Chan,
    International Journal of Computer Vision (IJCV), to appear 2024.
  • Gradient-based Visual Explanation for Transformer-based CLIP.
    Chenyang Zhao, Kun Wang, Xingyu Zeng, Rui Zhao, and Antoni B. Chan,
    In: International Conference on Machine Learning (ICML), Vienna, Jul 2024.
  • The Pitfalls and Promise of Conformal Inference Under Adversarial Attacks.
    Ziquan Liu, Yufei Cui, Yan Yan, Yi Xu, Xiangyang Ji, Xue Liu, and Antoni B. Chan,
    In: International Conference on Machine Learning (ICML), Vienna, Jul 2024.

Recent Project Pages [more]

Adversarial-Noise Watermark Framework

We propose a novel watermarking framework that leverages adversarial attacks to embed watermarks into images via two secret keys (network and signature) and deploys hypothesis tests to detect these watermarks with statistical guarantees.

  • Feiyu Chen, Wei Lin, Ziquan Liu, and Antoni B. Chan, "A Secure Image Watermarking Framework with Statistical Guarantees via Adversarial Attacks on Secret Key Networks." In: European Conference on Computer Vision (ECCV), Milano, Oct 2024.
Scalable Video Object Segmentation with Simplified Framework

We propose a Simplified VOS framework (SimVOS), which removes the hand-crafted feature extraction and matching modules in previous approaches, to perform joint feature extraction and interaction via a single scalable transformer backbone. We also demonstrate that large-scale self-supervised pre-trained models can provide significant benefits to the VOS task. In addition, a new token refinement module is proposed to achieve a better speed-accuracy trade-off for scalable video object segmentation.

DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking Tasks

We study masked autoencoder (MAE) pre-training on videos for matching-based downstream tasks, including visual object tracking (VOT) and video object segmentation (VOS).

Grad-ECLIP: Gradient-based Visual Explanation for CLIP

We propose a Gradient-based visual Explanation method for CLIP (Grad-ECLIP), which interprets the matching result of CLIP for specific input image-text pair

  • Chenyang Zhao, Kun Wang, Xingyu Zeng, Rui Zhao, and Antoni B. Chan, "Gradient-based Visual Explanation for Transformer-based CLIP." In: International Conference on Machine Learning (ICML), Vienna, Jul 2024.
Pareto Optimization for Active Learning under Out-of-Distribution Data Scenarios

We propose a batch-mode Pareto Optimization Active Learning (POAL) framework for Active Learning under Out-of-Distribution data scenarios.

Recent Datasets and Code [more]

Modeling Eye Movements with Deep Neural Networks and Hidden Markov Models (DNN+HMM)

This is the toolbox for modeling eye movements and feature learning with deep neural networks and hidden Markov models (DNN+HMM).

Dolphin-14k: Chinese White Dolphin detection dataset

A dataset consisting of  Chinese White Dolphin (CWD) and distractors for detection tasks.

Crowd counting: Zero-shot cross-domain counting

Generalized loss function for crowd counting.

CVCS: Cross-View Cross-Scene Multi-View Crowd Counting Dataset

Synthetic dataset for cross-view cross-scene multi-view counting. The dataset contains 31 scenes, each with about ~100 camera views. For each scene, we capture 100 multi-view images of crowds.

Crowd counting: Generalized loss function

Generalized loss function for crowd counting.