We test the BoSTree on video classification. The table below shows the results on the YUPENN dataset. Looking at the confusion matrix, SOE tends to confuse snowing, waterfall, fountain, and elevator classes, which have similar directional components. On the other hand, BoST confuses water textures, e.g., fountain vs. waterfall, and river vs. beach vs. ocean. Since BoSTree is only based on grayscale video without spatial binning, the results suggest that modeling temporal dynamics can improve recognition of these scenes.
Comparison of BoS Tree classification rates with TSVQ [8] and various spatial and temporal methods reported in [24] on YUPENN dynamic scenes data set:
Confusion matrix for YUPENN data set using (a) a three-level BoS tree and (b) SOE (4 x4x 1). Bold shows the number of correct classifications for each scene category.
We also tested BoSTree on the UCLA-8, UCLA-39, and DynTex-35 datasets. Compared to a regular BoS with same-sized large codebook, the BoSTree obtains similar classification rates but with a speed up of 6-19 times.
Video classification results on UCLA-8, UCLA-39 and DynTex-35 using a large codebook, reduced codebooks, and BoS trees: