ASRS-CMFS vs. RoBERTa: Comparing Two Pre-Trained Language Models to Predict Anomalies in Aviation Occurrence Reports with a Low Volume of In-Domain Data Available

Samuel Kierszbaum,Thierry Klein,Laurent Lapasset

doi:10.3390/aerospace9100591

Samuel Kierszbaum, Thierry Klein + Show 1 more

Open Access

https://doi.org/10.3390/aerospace9100591

Copy DOI

Abstract

We consider the problem of solving Natural Language Understanding (NLU) tasks characterized by domain-specific data. An effective approach consists of pre-training Transformer-based language models from scratch using domain-specific data before fine-tuning them on the task at hand. A low domain-specific data volume is problematic in this context, given that the performance of language models relies heavily on the abundance of data during pre-training. To study this problem, we create a benchmark replicating realistic field use of language models to classify aviation occurrences extracted from the Aviation Safety Reporting System (ASRS) corpus. We compare two language models on this new benchmark: ASRS-CMFS, a compact model inspired from RoBERTa, pre-trained from scratch using only little domain-specific data, and the regular RoBERTa model, with no domain-specific pre-training. The RoBERTa model benefits from its size advantage, while the ASRS-CMFS benefits from the pre-training from scratch strategy. We find no compelling statistical evidence that RoBERTa outperforms ASRS-CMFS, but we show that ASRS-CMFS is more compute-efficient than RoBERTa. We suggest that pre-training a compact model from scratch is a good strategy for solving domain-specific NLU tasks using Transformer-based language models in the context of domain-specific data scarcity.

Full Text