Abstract

ABSTRACTWe propose a bottom‐up, data‐driven pipeline to uncover the structure of biodiversity subject metadata using a combination of text mining approaches. In this study, we analyze 721,035 subject terms in the Biodiversity Heritage Library (BHL). We utilize named entity recognition and word‐embedding methods to systematically label and group terms based on their vector‐space distances. The results show that the subject terms from BHL are clustered into several prominent themes relating to environmental regulations, geographic locations, organisms, and subject access points. We hope that our approach can serve as a first step to group similar subject terms together in large‐scale, constant growing digital collections with aggregated metadata from multiple sources. Ultimately, we hope the next phases of this project can become a basis for biodiversity digital libraries to standardize their vocabularies.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.