Résumé:
Automatic visual speech recognition (AVSR) techniques are increasingly prevalent in various domains, including manufacturing, public use, and multimedia devices, making Visual Speech Recognition (VSR) a promising technology that can improve communication accessibility for people with hearing impairments. However, most existing VSR systems are designed for languages like English, leaving a gap for languages like Arabic, which is spoken by over 400 million people worldwide and has unique linguistic and phonetic characteristics. This thesis presents a novel framework for Arabic Visual Speech Recognition, which aims to address this gap and cater to the needs of the Arabic hearing impaired community. The framework integrates state-of-the-art deep learning techniques, such as Convolutional Neural Networks (CNNs) and Vision Transformers (ViT), to transcribe Arabic speech from visual cues accurately and efficiently. The framework also relies on a specialized Arabic dataset, which is carefully curated to capture the diversity and complexity of the Arabic language. This dataset serves as a benchmark for training and evaluating the VSR models, ensuring their robustness and reliability in real-world applications. The framework employs the deep learning techniques like YOLO, CNNs and ViT for robust mouth detection and recognition, which enables the extraction of crucial visual features for accurate speech transcription. The experimental results show that the proposed framework achieves promising performance in enhancing communication accessibility for Arabic speakers with hearing impairments. The framework also demonstrates its effectiveness in handling various linguistic and phonetic variations of the Arabic language, opening up new possibilities for wider applications in real-world scenarios. This research contributes significantly to advancing Arabic Visual Speech Recognition technology, enriching the VSR landscape and fostering greater inclusivity in communication for Arabic speakers.