Résumé:
Audio Tagging is concerned with the development of systems that are able to
recognize sound events. A growing interest is geared towards audio tagging for various
applications such as acoustic surveillance, tagging video content and environmental
scene recognition. Our goal is to design an audio tagging system capable of recognizing
a wide range of sound events. The development process usually requires a large set of
labeled sound data. However, most existing datasets are unlabeled since hand-labeling is
a very costly and a time-consuming process, and it involves a lot of manual labor. To
mend with this, we have built our audio tagging system following the Semi-Supervised
Learning (SSL) paradigm. Specifically, we have chosen the pseudo-labeling strategy to
learn from weakly labeled data. In addition, our system trains a ResNet deep learning
model on log-mel spectrograms, along with augmentation techniques to increase the
dataset size. The training uses the cyclic cosine annealing technique for the learning rate.
We have carried out our experiments on a huge dataset made of sound recordings; we
have investigated the impact of the sharpening temperature (a hyperparameter of our
system) on the distribution of the pseudo-labels, and have tested ensembling various
variants of our approach. The results demonstrate the efficacy of pseudo-labeling SSL
strategy. Furthermore, ensembling various systems significantly boosts the overall
performance.
Keywords: Audio Tagging, Semi-Supervised Learning, Feature Extraction, Deep
Learning, Ensemble Learning, Statistical Tests.