Université Blida 1

Encoder-decoder-based neural Network architectures for automatic audio captioning.

Afficher la notice abrégée

dc.contributor.author Gharouba, Hadil
dc.contributor.author Ben doumia, kaouther
dc.contributor.author Ykhlef, Hadjer. (Promotrice)
dc.date.accessioned 2025-10-26T13:57:19Z
dc.date.available 2025-10-26T13:57:19Z
dc.date.issued 2025
dc.identifier.uri https://di.univ-blida.dz/jspui/handle/123456789/40776
dc.description ill.,Bibliogr.cote:MA-004-1052 fr_FR
dc.description.abstract The main objective of our project is to develop an effective system for Automated Audio Captioning (AAC), a task that involves describing ambient sounds within an audio clip using a natural language sentence, effectively bridging the gap between auditory perception and linguistic expression. In recent years, AAC has gained significant attention and has seen considerable progress. Despite these advancements, the field still faces many challenges. To achieve this task, our approach follows an encoder-decoder model based on deep learning techniques. Specifically, we employ a novel, fully transformer-based architecture built around BART, which overcomes the limitations of traditional RNN and CNN approaches in AAC. The self-attention mechanism in BART facilitates better modeling of both local and global dependencies in audio signals. Our model integrates VGGish to extract audio embeddings from log-Mel spectrograms, and a BART transformer combining a bidirectional encoder and an autoregressive decoder for generating captions. Word embeddings are produced using a BPE tokenizer, which is adapted to the unique vocabulary of the training dataset, thereby aligning it with the general requirements of the captioning task. In order to improve the quality of the generated audio captions, we performed multiple experiments using the Clotho dataset. The results indicate that our model produces more accurate and diverse descriptions than existing state-of-the-art approaches. Keywords: Automated Audio Captioning, Encoder-Decoder, Deep Learning, Transformer, BART, VGGish, BPE tokenizer, Clotho fr_FR
dc.language.iso en fr_FR
dc.publisher Université Blida 1 fr_FR
dc.subject Automated Audio Captioning fr_FR
dc.subject Encoder-Decoder. fr_FR
dc.subject Deep Learnin fr_FR
dc.subject Transformer. fr_FR
dc.subject BART. fr_FR
dc.subject VGGish fr_FR
dc.subject BPE tokenizer fr_FR
dc.subject Clotho fr_FR
dc.title Encoder-decoder-based neural Network architectures for automatic audio captioning. fr_FR
dc.type Thesis fr_FR


Fichier(s) constituant ce document

Ce document figure dans la(les) collection(s) suivante(s)

Afficher la notice abrégée

Chercher dans le dépôt


Recherche avancée

Parcourir

Mon compte