Résumé:
In the age of big data, textual data is more important than ever, with an everincreasing
size and an abundant production of digital documents, particularly in the
biomedical field as a consequence of the convergence between medical computer
science and bioinformatics. In addition to the fact that these textual data are usually
expressed in an unstructured form (i.e., natural language), which makes their
automated processing more difficult. Moreover the rapid growth of the biomedical
literature, makes the manual indexing approaches more complex, time-consuming
and error-prone. Thus, automated classification is essential. Despite the many efforts,
classification complete biomedical texts according to segments specific to these
texts, such as their title and summary, remains a real challenge.
In this thesis we investigate state of the art approaches in classifying
biomedical texts so that we can compare with pre-trained models that we have tested.
After performing tests on different artificial intelligence models: BioBERT, Roberta,
XLNet, we found out that the ideal model for classifying biomedical texts is
BioBERT with an average F1 score of 85,1% which was very similar to the
roBERTa model with a score of 85% which unlike BioBERT, was not pre-trained on
biomedical texts and with XLNet performing slightly worse with a score of 83%.
Finally, we deployed the three above-mentioned models and developed an
Online User Interface on the Hugging Face Platform in order to test and show the
classification results clearly and easily.
Keywords: Automatic Text Classification, Multilabel Classification, Automatic Medical Language Processing, Deep Learning.