Objective: Faced with the challenges of differential diagnosis caused by the complex clinical manifestations and high pathological heterogeneity of pituitary adenomas, this study aims to construct a high-quality annotated corpus to characterize pituitary adenomas in clinical notes containing rich diagnosis and treatment information. Methods: A dataset from a pituitary adenomas neurosurgery treatment center of a tertiary first-class hospital in China was retrospectively collected. A semi-automatic corpus construction framework was designed. A total of 2000 documents containing 9430 sentences and 524,232 words were annotated, and the text corpus of pituitary adenomas (TCPA) was constructed and analyzed. Its potential application in large language models (LLMs) was explored through fine-tuning and prompting experiments. Results: TCPA had 4782 medical entities and 28,998 tokens, achieving good quality with the inter-annotator agreement value of 0.862-0.986. The LLMs experiments showed that TCPA can be used to automatically identify clinical information from free texts, and introducing instances with clinical characteristics can effectively reduce the need for training data, thereby reducing labor costs. Conclusion: This study characterized pituitary adenomas in clinical notes, and the proposed method were able to serve as references for relevant research in medical natural language scenarios with highly specialized language structure and terminology.
Read full abstract