Multimodal video classification

While many existing video classification datasets are specialized, focusing on areas such as human actions or facial expressions, my research targets the classification of arbitrary videos. I leverage the multi-modal nature of these videos, utilizing information from video frames, audio, and text.

There is a growing trend of designing overly complex architectures for marginal improvements on specific benchmarks. In contrast, I advocate for reusing pretrained networks, which reduces training time and energy costs while building on the strengths of these models.

Videos are inherently complex, containing objects, scenes, sounds, speech, music, and text. Designing a single network capable of capturing all this information is impractical. Instead, I use pretrained models for tasks such as image captioning, facial expression recognition, automatic speech recognition, audio event classification, optical character recognition, and text encoding. I then train compact yet sophisticated neural networks to fuse these features, while capturing both inter- and intra-modal dependencies.

As a result, I achieve state-of-the-art performance on prominent video benchmarks such as MovieNet and Ekman-6. My main findings are published in the Q1 journal Expert Systems with Applications (Sulun et al., 2024). Additionally, I developed an online demo on Google Colab for video emotion classification, which I [presented] at IEEE’s 2024 International Symposium on Multimedia (Sulun et al., 2024). To the best of my knowledge, this is the only existing online application for video emotion classification. Furthermore, despite extensive searches, I have been unable to find any open-source pretrained models for this task.

I also make the extracted pretrained features for the Ekman6 and VideoEmotion8 datasets on Zenodo. I additionally include a ekman6_blacklist.txt file, listing the videos which I found are wrongly labeled.

https://zenodo.org/records/13624583

References

Movie Trailer Genre Classification Using Multimodal Pretrained Features

Serkan Sulun, Paula Viana, and Matthew E.P. Davies

Expert Systems with Applications, Feb 2024

Abs DOI Bib PDF Code

We introduce a novel method for movie genre classification, capitalizing on a diverse set of readily accessible pretrained models. These models extract high-level features related to visual scenery, objects, characters, text, speech, music, and audio effects. To intelligently fuse these pretrained features, we train small classifier models with low time and memory requirements. Employing the transformer model, our approach utilizes all video and audio frames of movie trailers without performing any temporal pooling, efficiently exploiting the correspondence between all elements, as opposed to the fixed and low number of frames typically used by traditional methods. Our approach fuses features originating from different tasks and modalities, with different dimensionalities, different temporal lengths, and complex dependencies as opposed to current approaches. Our method outperforms state-of-the-art movie genre classification models in terms of precision, recall, and mean average precision (mAP). To foster future research, we make the pretrained features for the entire MovieNet dataset, along with our genre classification code and the trained models, publicly available.
@article{trailer, title = {Movie Trailer Genre Classification Using Multimodal Pretrained Features}, author = {Sulun, Serkan and Viana, Paula and Davies, Matthew E.P.}, year = {2024}, journal = {Expert Systems with Applications}, volume = {258}, pages = {125209}, issn = {0957-4174}, doi = {10.1016/j.eswa.2024.125209}, }
VEMOCLAP: A Video Emotion Classification Web Application

Serkan Sulun, Paula Viana, and Matthew E. P. Davies

In 2024 International Symposium on Multimedia (ISM), Dec 2024

Abs DOI Bib PDF Code

We introduce VEMOCLAP: Video EMOtion Classifier using Pretrained features, the first readily available and open-source web application that analyzes the emotional content of any user-provided video. We improve our previous work, which exploits open-source pretrained models that work on video frames and audio, and then efficiently fuse the resulting pretrained features using multi-head cross-attention. Our approach increases the state-of-the-art classification accuracy on the Ekman-6 video emotion dataset by 4.3% and offers an online application for users to run our model on their own videos or YouTube videos. We invite the readers to try our application at https://serkansulun.com/app
@inproceedings{vemoclap, title = {VEMOCLAP: A Video Emotion Classification Web Application}, shorttitle = {VEMOCLAP}, booktitle = {2024 International Symposium on Multimedia (ISM)}, author = {Sulun, Serkan and Viana, Paula and Davies, Matthew E. P.}, year = {2024}, month = dec, pages = {137--140}, publisher = {IEEE Computer Society}, doi = {10.1109/ISM63611.2024.00029}, urldate = {2025-05-17}, isbn = {9798331511111}, langid = {english}, }