About
Welcome to the Video, Image, and Sound Analysis Lab (VISAL) at the City University of Hong Kong! The lab is directed by Prof. Antoni Chan in the Department of Computer Science.
Our main research activities include:
- Computer Vision, Surveillance
- Machine Learning, Pattern Recognition
- Computer Audition, Music Information Retrieval
- Eye Gaze Analysis
For more information about our current research, please visit the projects and publication pages.
Opportunities for graduate students and research assistants – if you are interested in joining the lab, please check this information.
Latest News [more]
- [Apr 9, 2024]
Congratulations to Qiangqiang for defending his thesis!
- [Jun 16, 2023]
Congratulations to Hui for defending her thesis!
- [Jan 19, 2023]
Congratulations to Xueying for defending her thesis!
- [Dec 9, 2022]
Congratulations to Ziquan for defending his thesis!
Recent Publications [more]
- Robust Zero-Shot Crowd Counting and Localization with Adaptive Resolution SAM.
,
In: European Conference on Computer Vision (ECCV), Milano, Oct 2024. - A Secure Image Watermarking Framework with Statistical Guarantees via Adversarial Attacks on Secret Key Networks.
,
In: European Conference on Computer Vision (ECCV), Milano, Oct 2024. - Boosting 3D Single Object Tracking with 2D Matching Distillation and 3D Pre-training.
,
In: European Conference on Computer Vision (ECCV), Milano, Oct 2024. - Mahalanobis Distance-based Multi-view Optimal Transport for Multi-view Crowd Localization.
,
In: European Conference on Computer Vision (ECCV), Milano, Oct 2024. [Project&Code] - FreeDiff: Progressive Frequency Truncation for Image Editing with Diffusion Models.
,
In: European Conference on Computer Vision (ECCV), Milano, Oct 2024. - Human attention guided explainable artificial intelligence for computer vision models.
,
Neural Networks, 177:106392, Sep 2024. - Edit Temporal-Consistent Videos with Image Diffusion Model.
,
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), to appear 2024. - Group-based Distinctive Image Captioning with Memory Difference Encoding and Attention.
,
International Journal of Computer Vision (IJCV), to appear 2024. - Gradient-based Visual Explanation for Transformer-based CLIP.
,
In: International Conference on Machine Learning (ICML), Vienna, Jul 2024. - The Pitfalls and Promise of Conformal Inference Under Adversarial Attacks.
,
In: International Conference on Machine Learning (ICML), Vienna, Jul 2024.
Recent Project Pages [more]
Visualization Results
- Visual explanations are provided for the matching score between the image and the specific text prompts, which can be nouns (e.g., car, dog) or verbs (e.g., holding, standing). From the visualization comparison, Grad-ECLIP shows a superior explanation ability on different types of text prompts.
- The explanation map from Grad-ECLIP can also be generated from text encoder viewpoint. From the explanation of the sentence, we can identify which words are more important for CLIP when matching with the specific image, and conversely the text-specific important regions on the image are shown with image explanation. This word importance visualization of the input text can be helpful when designing text prompts for image-text dual-encoders in practical applications.
Quantitative Evaluation
-
Faithfulness Evaluation
A faithful explanation method should produce heat maps highlighting the important content in the image that has the greatest impact on the model prediction. Deletion (negative perturbation) replaces input image pixels by random values step-by-step with the important pixels removed first based on the ordering of the heat map values, while recording the drop in prediction performance. Insertion adds image pixels to an empty image step-by-step based on the heat map importance, and records the performance increase.
A steeper drop in performance with deletion steps corresponds to a lower deletion AUC, while the quicker increase in performance with insert steps outputs a higher insertion AUC. Our method obtains the fastest performance drop for Deletion and the largest performance increase for Insertion compared with most related works, showing that regions highlighted in our heat maps better represent explanations of CLIP.
-
Point Game and Segmentation Test
We next evaluate the localization ability of the visual explanations via Point Game (PG), PG-energy, and segmentation evaluation metrics, including pixel accuracy (Pixel Acc.), average precision (AP), and averaged mask intersection over union (maskIoU), regarding the heat maps as soft-segmentation results.
Grad-ECLIP significantly outperforms other explanation methods, especially on PG, which demonstrates that Grad-ECLIP can well show the attention of CLIP on the object with the correct category as the text prompt. CLIPSurgery obtains higher pixel accuracy and maskIoU since it tends to put high heat map values on all the pixels of the object region and gets a higher score when aggregating the heatmaps inside the object mask in these two evaluations. However, the lower PG, PG-energy, and AP demonstrate that there are more high values generated outside of the object boundary. Better segmentation does not necessarily result in faithful explanations, in terms of both insertion and deletion metrics, as indicated in Table 1.
CLIP Analysis via Grad-ECLIP
-
Concept decomposition and addibility in image-text matching
An interesting question is how it processes the combination of words, e.g., adjective and noun, verb and noun. We conducted experiments comparing the explanation heat maps for single words and combined phrases using Grad-ECLIP to examine the working function of phrase matching. From the experimental visualization results, we infer that when processing the matching of images and phrases, the model has the ability of decomposition and addibility of different concepts. This can help the model to generalize to different scenarios and could be the source of the strong zero-shot ability of CLIP.
-
Diagnostics on attribution identification
From the visualization results, we can infer that CLIP has advantages with common perceptual attributes like color, but cannot well handle physical attributes like shape and material, and is weak at grounding objects with comparative attributes, like size and position relationships. Related to the addibility of concepts in the previous section, it is reasonable to expect that attributes that have concrete visual appearance, such as color, will contribute more to the matching score, compared with the abstract comparative attributes.
Significant progress has been achieved on the improvement and downstream usages of the Contrastive Language-Image Pre-training (CLIP) vision-language model, while less attention is paid to the interpretation of CLIP. We propose a Gradient-based visual Explanation method for CLIP (Grad-ECLIP), which interprets the matching result of CLIP for specific input image-text pair. By decomposing the architecture of the encoder and discovering the relationship between the matching similarity and intermediate spatial features, Grad-ECLIP produces effective heat maps that show the influence of image regions or words on the CLIP results. Different from the previous Transformer interpretation methods that focus on the utilization of self-attention maps, which are typically extremely sparse in CLIP, we produce high-quality visual explanations by applying channel and spatial weights on token features. Qualitative and quantitative evaluations verify the superiority of Grad-ECLIP compared with the state-of-the-art methods. A series of analysis are conducted based on our visual explanation results, from which we explore the working mechanism of image-text matching, and the strengths and limitations in attribution identification of CLIP.
Selected Publications
- Gradient-based Visual Explanation for Transformer-based CLIP.
,
In: International Conference on Machine Learning (ICML), Vienna, Jul 2024.
Results
- Evaluation Results of Grad-ECLIP
- Code is available here.
We propose a batch-mode Pareto Optimization Active Learning (POAL) framework for Active Learning under Out-of-Distribution data scenarios.
- "Pareto Optimization for Active Learning under Out-of-Distribution Data Scenarios." Transactions on Machine Learning Research (TMLR), June 2023.,
We propose the gradient-weighted Object Detector Activation Maps (ODAM), a visualized explanation technique for interpreting the predictions of object detectors, including class score and bounding box coordinates.
- "ODAM: Gradient-based Instance-Specific Visual Explanations for Object Detection." In: Intl. Conf. on Learning Representations (ICLR), Rwanda, May 2023. [code],
- "Gradient-based Instance-Specific Visual Explanations for Object Specification and Object Discrimination." IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI), to appear 2024.,
We present a comprehensive comparative survey of 19 Deep Active Learning approaches for classification tasks.
- ,
Recent Datasets and Code [more]
Modeling Eye Movements with Deep Neural Networks and Hidden Markov Models (DNN+HMM)
This is the toolbox for modeling eye movements and feature learning with deep neural networks and hidden Markov models (DNN+HMM).
- Files: download here
- Project page
- If you use this toolbox please cite:
Understanding the role of eye movement consistency in face recognition and autism through integrating deep neural networks and hidden Markov models.
,
npj Science of Learning, 7:28, Oct 2022.
Dolphin-14k: Chinese White Dolphin detection dataset
A dataset consisting of Chinese White Dolphin (CWD) and distractors for detection tasks.
- Files: Google Drive, Readme
- Project page
- If you use this dataset please cite:
Chinese White Dolphin Detection in the Wild.
,
In: ACM Multimedia Asia (MMAsia), Gold Coast, Australia, Dec 2021.
Crowd counting: Zero-shot cross-domain counting
Generalized loss function for crowd counting.
- Files: github
- Project page
- If you use this toolbox please cite:
Dynamic Momentum Adaptation for Zero-Shot Cross-Domain Crowd Counting.
,
In: ACM Multimedia (MM), Oct 2021.
CVCS: Cross-View Cross-Scene Multi-View Crowd Counting Dataset
Synthetic dataset for cross-view cross-scene multi-view counting. The dataset contains 31 scenes, each with about ~100 camera views. For each scene, we capture 100 multi-view images of crowds.
- Files: Google Drive
- Project page
- If you use this dataset please cite:
Cross-View Cross-Scene Multi-View Crowd Counting.
,
In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR):557-567, Jun 2021.
Crowd counting: Generalized loss function
Generalized loss function for crowd counting.
- Files: github
- Project page
- If you use this toolbox please cite:
A Generalized Loss Function for Crowd Counting and Localization.
,
In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2021.