Neuroscientists have long endeavored to map brain connectivity, yet the intricate nature of brain networks often leads them to concentrate on specific regions, hindering efforts to unveil a comprehensive connectivity map. Recent advancements in imaging and text mining techniques have enabled the accumulation of a vast body of literature containing valuable insights into brain connectivity, facilitating the extraction of whole-brain connectivity relations from this corpus. However, the diverse representations of brain region names and connectivity relations pose a challenge for conventional machine learning methods and dictionary-based approaches in identifying all instances accurately. We propose BioSEPBERT, a biomedical pre-trained model based on start-end position pointers and BERT. Additionally, our model integrates specialized identifiers with enhanced self-attention capabilities for preceding and succeeding brain regions, thereby improving the performance of named entity recognition and relation extraction in neuroscience. Our approach achieves optimal F1 scores of 85.0%, 86.6%, and 86.5% for named entity recognition, connectivity relation extraction, and directional relation extraction, respectively, surpassing state-of-the-art models by 2.6%, 1.1%, and 1.1%. Furthermore, we leverage BioSEPBERT to extract 22.6 million standardized brain regions and 165,072 directional relations from a corpus comprising 1.3 million abstracts and 193,100 full-text articles. The results demonstrate that our model facilitates researchers to rapidly acquire knowledge regarding neural circuits across various brain regions, thereby enhancing comprehension of brain connectivity in specific regions. Data and source code are available at: http://atlas.brainsmatics.org/res/BioSEPBERT and https://github.com/Brainsmatics/BioSEPBERT. Supplementary data are available at Bioinformatics online.
Read full abstract