UML (Unified Modeling Language) diagrams are graphical representations used in software engineering which play a vital role in the design and development of software systems and various engineering processes. Large, good-quality datasets containing UML diagrams are essential for different areas in the industry, research, and teaching purposes; however, few exist in the literature and it is common to find duplicate elements in the existing datasets. This might affect the evaluation of the models obtained when using these datasets. This paper addresses the challenge of creating a ground truth dataset of UML diagrams, including semi-automated inspection to remove duplicates and ensuring the correct labeling of all UML diagrams contained in the dataset. In particular, a dataset of six UML diagram classes was assembled, comprising a total of 2626 images (426 activity diagrams, 636 class diagrams, 352 component diagrams, 357 deployment diagrams, 435 sequence diagrams, and 420 use case diagrams). Importantly, unlike other existing datasets, ours contains no duplicate elements and all diagrams are correctly labeled. Our curated dataset is a valuable and unique resource for the research community, serving as a foundation for training and evaluating diverse artificial intelligence models. In this paper, we demonstrate this by training and testing several deep learning models using our dataset, achieving highly satisfactory results compared to those presented in other works in the literature. Additionally, our experimental results highlight the potential of visual transformers for UML diagram classification, setting our approach apart from others that predominantly used convolutional neural networks for similar tasks.
Read full abstract