Résumé:
In recent years, researchers have focused on developing and training visual question generation
models that based on deep neural networks. these models have a wide range of
applications in various domains, However, there have been no specialized works conducted
on visual question generation in the Arabic language.
Our work aims to automate the process of generating Arabic educational questions
from visual content. We propose a visual Arabic question generation multi-modal, which
integrates two distinct models. The first model is a fine-tuned Arabic image captioning
model, obtained by fine-tuning the Google Vision transformer and AraBert transformer
using a new collected dataset. The second model is an Arabic natural question generation
fine-tuned model.
Our proposed multi-model has been evaluated using the Transparent Human benchmark
protocol, and the results demonstrate its ability to generate relevant captions. 51%
of the captions received a rating between 2 to 4 out of 5 on the scale, indicating their
relevance. Additionally, the model produced relevant questions based on these captions,
achieving an average rating of 3.33 out of 5 in term of relevance.
Keywords: Visual question generation, Arabic image captioning, Transformers, Vision transformer, deep learning.