The Audio descriptors consist of two classes of descriptors: block-level features (BLF) and i-vector features. The BLF contains 6 descriptors computed from the audio spectrum, among others, spectral patterns (modeling the timbral content), fluctuation patterns (modeling the strength of recurring beats over various frequency bands), and correlation patterns (modeling the correlation between frequency bands to uncover harmonic characteristics). The i-vector features describe timbral characteristics of audio by modeling distributions over Mel frequency cepstral coefficients (MFCCs).
The Visual descriptors also consist of two categories of descriptors: Aesthetic features and AlexNet features, each of them including different aggregation schemes for the two types of visual features. The Aesthetic features were proposed to measure the aesthetic value of coral reef pictures first and were then used to deal with artwork and photographic aesthetics. The Aesthetic features contain 3 types of descriptors: color based descriptors, texture based descriptors and object based descriptors. AlexNet deep neural network has been developed for scene and object recognition tasks. In our context, we use the extracted output values of the fc7 layer.