Abstract
This software article describes the GATE family of open source text analysis tools and processes. GATE is one of the most widely used systems of its type with yearly download rates of tens of thousands and many active users in both academic and industrial contexts. In this paper we report three examples of GATE-based systems operating in the life sciences and in medicine. First, in genome-wide association studies which have contributed to discovery of a head and neck cancer mutation association. Second, medical records analysis which has significantly increased the statistical power of treatment/outcome models in the UK's largest psychiatric patient cohort. Third, richer constructs in drug-related searching. We also explore the ways in which the GATE family supports the various stages of the lifecycle present in our examples. We conclude that the deployment of text mining for document abstraction or rich search and navigation is best thought of as a process, and that with the right computational tools and data collection strategies this process can be made defined and repeatable. The GATE research programme is now 20 years old and has grown from its roots as a specialist development tool for text processing to become a rather comprehensive ecosystem, bringing together software developers, language engineers and research staff from diverse fields. GATE now has a strong claim to cover a uniquely wide range of the lifecycle of text analysis systems. It forms a focal point for the integration and reuse of advances that have been made by many people (the majority outside of the authors' own group) who work in text processing for biomedicine and other areas. GATE is available online <1> under GNU open source licences and runs on all major operating systems. Support is available from an active user and developer community and also on a commercial basis.
Highlights
We talk, we write, we listen or read, and we are so skilled in our use of language that we are seldom aware of the complexities involved in its production and consumption
In analysing the problem we considered a range of existing solutions from the XML, RDBMS and augmented full text indexing fields and solicited input from each of these communities at a workshop in May 2008 on Persisting, Indexing and Querying Multi-Paradigm Text Models, at the Information Retrieval Facility,43. in Vienna
Our discussions failed to identify a pre-existing solution that could be applied directly (XML indexing and retrieval is biased towards trees; relational databases are biased towards relations) but we did discover that the implementation of sequence operators in MG4J [50] was sufficiently efficient to represent a possible solution, and this is how we implemented the annotation graph support in Mımir
Summary
We write, we listen or read, and we are so skilled in our use of language that we are seldom aware of the complexities involved in its production and consumption It is natural, that a large proportion of what we know of the world is externalised exclusively in textual form. GATE has a strong claim to cover a uniquely wide range of the lifecycle of text analysis systems. It forms a focal point for the integration and reuse of advances that have been made by many people (the majority outside of the authors’ own group) who work in text processing for biomedicine and other areas. We begin by describing the technology that has been used in these applications, before describing each of the projects in more detail
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.