Abstract

The data-driven model is the core issue of the Material Genome Initiative (MGI), but how to quickly obtain a large amount of material data has become a critical issue that needs to be resolved. At present, the sharing of material databases is low, so it is not easy to obtain useful material data from public resources. Therefore, we use the text mining method to obtain valid data from the literature of hypereutectic Al-Si alloy. Natural language processing (NLP) is a commonly used text mining method. Named entity recognition (NER), as one of the main tasks of NLP, can effectively extract information from the literature. However, there is no public dataset suitable for material entities recognition (MER) research in the material field. To effectively apply named entity recognition to the material field, five types of entities are selected from the material literature in this paper, and the hypereutectic Al-Si alloy material entity dataset (HASE) is constructed by manual annotation, which includes 8,845 material entities in total. At the same time, in the field of materials with only a small amount of annotation data, the MER method combined with active learning is proposed. Combined with the characteristics of the material entity, active learning adopts automatic annotation based on dictionary and rules, CRF model, and BiGRU-CRF model. In the end, a total of 16,677 material entities were annotated. The method of combining active learning not only improves the performance of the MER model but also reduces the cost of annotation. This method can more accurately extract effective material data in the literature. This research result provides an effective way for MGI researchers to quickly obtain a large amount of material data, which has theoretical significance and practical application value.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call