Encoder-decoder-based neural Network architectures for automatic audio captioning.

Gharouba, Hadil; Ben doumia, kaouther; Ykhlef, Hadjer. (Promotrice)

Veuillez utiliser cette adresse pour citer ce document : https://di.univ-blida.dz/jspui/handle/123456789/40776

Affichage complet

Élément Dublin Core	Valeur	Langue
dc.contributor.author	Gharouba, Hadil	-
dc.contributor.author	Ben doumia, kaouther	-
dc.contributor.author	Ykhlef, Hadjer. (Promotrice)	-
dc.date.accessioned	2025-10-26T13:57:19Z	-
dc.date.available	2025-10-26T13:57:19Z	-
dc.date.issued	2025	-
dc.identifier.uri	https://di.univ-blida.dz/jspui/handle/123456789/40776	-
dc.description	ill.,Bibliogr.cote:MA-004-1052	fr_FR
dc.description.abstract	The main objective of our project is to develop an effective system for Automated Audio Captioning (AAC), a task that involves describing ambient sounds within an audio clip using a natural language sentence, effectively bridging the gap between auditory perception and linguistic expression. In recent years, AAC has gained significant attention and has seen considerable progress. Despite these advancements, the field still faces many challenges. To achieve this task, our approach follows an encoder-decoder model based on deep learning techniques. Specifically, we employ a novel, fully transformer-based architecture built around BART, which overcomes the limitations of traditional RNN and CNN approaches in AAC. The self-attention mechanism in BART facilitates better modeling of both local and global dependencies in audio signals. Our model integrates VGGish to extract audio embeddings from log-Mel spectrograms, and a BART transformer combining a bidirectional encoder and an autoregressive decoder for generating captions. Word embeddings are produced using a BPE tokenizer, which is adapted to the unique vocabulary of the training dataset, thereby aligning it with the general requirements of the captioning task. In order to improve the quality of the generated audio captions, we performed multiple experiments using the Clotho dataset. The results indicate that our model produces more accurate and diverse descriptions than existing state-of-the-art approaches. Keywords: Automated Audio Captioning, Encoder-Decoder, Deep Learning, Transformer, BART, VGGish, BPE tokenizer, Clotho	fr_FR
dc.language.iso	en	fr_FR
dc.publisher	Université Blida 1	fr_FR
dc.subject	Automated Audio Captioning	fr_FR
dc.subject	Encoder-Decoder.	fr_FR
dc.subject	Deep Learnin	fr_FR
dc.subject	Transformer.	fr_FR
dc.subject	BART.	fr_FR
dc.subject	VGGish	fr_FR
dc.subject	BPE tokenizer	fr_FR
dc.subject	Clotho	fr_FR
dc.title	Encoder-decoder-based neural Network architectures for automatic audio captioning.	fr_FR
dc.type	Thesis	fr_FR
Collection(s) :	Mémoires de Master

Fichier(s) constituant ce document :

Fichier	Description	Taille	Format
Gharouba Hadil et Ben Doumia Kaouther.pdf		3,68 MB	Adobe PDF	Voir/Ouvrir

Affichage abbrégé

DSpace JSPUI

DSpace préserve et permet l'accès à toute manière de contenu, y compris des documents texte, des images, des MPEG et des ensembles de données