Résumé:
The goal of general-purpose audio tagging is to create systems capable of recognizing a
variety of sounds. Including musical instruments, vehicles, animals, sounds generated by some
sort of human activity etc. The motivation for research in the field of artificial sound
understanding can be found in potential applications such as security, healthcare (hearing
impairment), improvement in smart devices and various music related tasks. The main
contribution of this work entails conducting extensive studies and comparisons between audio
tagging systems using a huge dataset made of 11 073 audio recordings. In this thesis, we have
carried out two sets of experiments. First, we have examined Deep Convolutional neural
networks (CNN) and 3 of its variants (Convolutional Recurrent Neural Network (CRNN),
Gated Convolutional Recurrent Neural Network (GCRNN) and Gated Convolutional Neural
Networks (GCNN)) using Log-Mel Spectrogram features. We have supported our analysis and
discussion with numerous statistical tests to analyze and compare the effect of the abovementioned
features
and
models
on
the
tagging
performance.
Our
experimental
findings
indicate
that
our systems capture diverse set of sound events, with various confidences. Moreover,
Convolutional Recurrent Neural Network (CRNN) significantly outperforms the other models.
Second, motivated by the fact that the individual models produce diverse predictions, we have
investigated the effect of ensemble learning using a technique known as stacking. Our analysis
shows that stacking provides a proper amalgamation of the individual learners, resulting in
better handling the diverse nature of the events.
Keywords: Audio Tagging, Deep Learning, Machine leaning, Ensemble Learning,
Stacking, Feature Extraction, Statistical Tests.