Encoder-decoder-based neural Network architectures for automatic audio captioning.

Gharouba, Hadil; Ben doumia, kaouther; Ykhlef, Hadjer. (Promotrice)

dc.contributor.author	Gharouba, Hadil
dc.contributor.author	Ben doumia, kaouther
dc.contributor.author	Ykhlef, Hadjer. (Promotrice)
dc.date.accessioned	2025-10-26T13:57:19Z
dc.date.available	2025-10-26T13:57:19Z
dc.date.issued	2025
dc.identifier.uri	https://di.univ-blida.dz/jspui/handle/123456789/40776
dc.description	ill.,Bibliogr.cote:MA-004-1052	fr_FR
dc.description.abstract	The main objective of our project is to develop an effective system for Automated Audio Captioning (AAC), a task that involves describing ambient sounds within an audio clip using a natural language sentence, effectively bridging the gap between auditory perception and linguistic expression. In recent years, AAC has gained significant attention and has seen considerable progress. Despite these advancements, the field still faces many challenges. To achieve this task, our approach follows an encoder-decoder model based on deep learning techniques. Specifically, we employ a novel, fully transformer-based architecture built around BART, which overcomes the limitations of traditional RNN and CNN approaches in AAC. The self-attention mechanism in BART facilitates better modeling of both local and global dependencies in audio signals. Our model integrates VGGish to extract audio embeddings from log-Mel spectrograms, and a BART transformer combining a bidirectional encoder and an autoregressive decoder for generating captions. Word embeddings are produced using a BPE tokenizer, which is adapted to the unique vocabulary of the training dataset, thereby aligning it with the general requirements of the captioning task. In order to improve the quality of the generated audio captions, we performed multiple experiments using the Clotho dataset. The results indicate that our model produces more accurate and diverse descriptions than existing state-of-the-art approaches. Keywords: Automated Audio Captioning, Encoder-Decoder, Deep Learning, Transformer, BART, VGGish, BPE tokenizer, Clotho	fr_FR
dc.language.iso	en	fr_FR
dc.publisher	Université Blida 1	fr_FR
dc.subject	Automated Audio Captioning	fr_FR
dc.subject	Encoder-Decoder.	fr_FR
dc.subject	Deep Learnin	fr_FR
dc.subject	Transformer.	fr_FR
dc.subject	BART.	fr_FR
dc.subject	VGGish	fr_FR
dc.subject	BPE tokenizer	fr_FR
dc.subject	Clotho	fr_FR
dc.title	Encoder-decoder-based neural Network architectures for automatic audio captioning.	fr_FR
dc.type	Thesis	fr_FR