Automatic Text Classification of PDF Documents using NLP Techniques

Nabil Abdoun,Mohammad Chami

doi:10.1002/iis2.12997

Abstract

AbstractOne of the regular activities performed by engineers during the design and development of the technical systems is to determine which sentences in a PDF specification document represent a requirement, functional architecture, design solution, variability, or other types of systems engineering (SE) data. Extracting such data from these PDF documents and transferring it into system model elements is still performed manually, requires high effort, and is error prone. Hereby, automatic extraction and classification of such SE data has a great potential, but it is still relatively scarce and a challenging task for engineers working with large PDF specification documents. One solution is to follow a suitable writing formulation which provide an immediate and easy way to classify and analyze the PDF documents. However, such formulations are not always followed strictly. As part of our work towards adopting Artificial Intelligence (AI) for Model‐Based Systems Engineering (MBSE), we have been researching the data extraction and data classification topics from PDF files in order to transfer it to system models elements. In this paper, we present the early status of a solution based on AI that uses Natural Language Processing (NLP) techniques to label the SE data existing in PDF files, extract them, and classify them into predefined classes.

Full Text