Résumé:
Audio tagging, also known as Sound Event Recognition, is concerned with the development
of systems that are able to recognize sound events. A sound event is perceived as a
separate individual entity that we can name and recognize, such as helicopter, glass breaking,
baby crying, speech, etc. Considerable attention has been geared towards audio tagging for
various applications, such as information retrieval, music tagging, and acoustic monitoring.
The general framework for audio tagging usually involves two major steps: feature extraction
and classification. Clearly, obtaining well-annotated, strongly labeled data is an expensive and
time-consuming process. Therefore, a large portion of recent development has been devoted
to effectively using weakly labeled data extracted from websites like Youtube, Freesound, or
Flickr. Various semi-supervised learning approaches have been proposed in the literature. We
can cite Mean Teacher, Pseudo Labeling, Mix Match, and most recently, Deep Co-training. The
purpose of this project consists of devising an audio tagging system within the semi-supervised
learning paradigm, specifically the Deep Co-training framework. Such systems essentially use
both labeled and unlabeled audio data. In addition, our system is trained on two different
datasets :Urban8k and Environmental Sound Classification, based on a deep residual neural
network (ResNet) and a wide residual neural network (WideResNet). We supported our analysis
and discussion with numerous statistical tests to analyze and compare our results. We have
investigated the impact of differentiating the supervised ratio on the system’s performance
and have tested the impact of various variants of DCT systems based on different adversarial
attacks. The results demonstrate the efficacy of the Deep Co-training SSL strategy that significantly
boosts the overall performance.
Keywords: Audio Tagging, Semi-supervised learning, Deep Co-training, Feature Extraction, Statistical Tests.