Welcome to the Video, Image, and Sound Analysis Lab (VISAL) at the City University of Hong Kong! The lab is directed by Dr. Antoni Chan in the Department of Computer Science.

Our main research activities include:

  • Computer Vision, Surveillance
  • Machine Learning, Pattern Recognition
  • Computer Audition, Music Information Retrieval
  • Eye Gaze Analysis

For more information about our current research, please visit the projects and publication pages.

Opportunities for graduate students and research assistants – if you are interested in joining the lab, please check this information.

Latest News [more]

  • [May 6, 2019]

    Congratulations to Di for defending his thesis!

  • [Nov 1, 2018]

    Congratulations to Lei for defending her thesis!

  • [Jun 28, 2016]

    Congratulations to Sijin for defending his thesis!

  • [Jun 25, 2016]

    Congratulations to Adeel for winning a “Best Research Paper Award 2013/14” from the Higher Education Commission (HEC) of Pakistan for his TPAMI 2013 paper!

Recent Publications [more]

Recent Project Pages [more]

Residual Regression with Semantic Prior for Crowd Counting

In this paper, a residual regression framework is proposed for crowd counting harnessing the correlation information among samples. By incorporating such information into our network, we discover that more intrinsic characteristics can be learned by the network which thus generalizes better to unseen scenarios. Besides, we show how to effectively leverage the semantic prior to improve the performance of crowd counting.

Simplification of Gaussian Mixture Models

An algorithm is proposed to simplify the Gaussian Mixture Models into a reduced mixture model with fewer mixture components, by maximizing a variational lower bound of the expected log-likelihood of a set of virtual samples.

Convolutional Decoders for Image Captioning

RNN-based models dominate the field of image captioning, however, (1) RNNs have to be calculated step-by-step, which is not easily parallelized. (2) There is a long path between the start and end of the sentence using RNNs. Tree structures can make a shorter path, but trees require special processing. (3) RNNs only learn single-level representations at each time step, while convolutional decoders are able to learn multi-level representations of concepts, and each of them should corresponds to an image area, which should benefit word prediction.