Sentence Splitting Research Articles

Processing natural language and extract relevant information in deep technical engineering domain remains an open challenge.  On the other side, manufacturers of high-value assets which often deliver product services through the equipment life, supporting maintenance, spare parts management and remote monitoring and diagnostics for issues resolution, have availability of a good amount of textual data containing technical cases with a certain engineering depth. This paper presents a case study in which various Artificial Intelligence algorithms were applied to historical technical cases to extract know-how useful to help technicians in approaching new cases. Initially the work process and available data are presented; the focus is on the outbound communication delivered from the technical team to the site operators, that is structured in 3 main paragraphs: event description, technical assessment, recommended actions.  The work proceeded in two parallel streams: the first concerned the analysis of event descriptions and technical assessments, aiming to detect recurring topics; the second concerned the analysis of recommended actions that technical support delivered trough years to site operators in order to create a library, which can help for enabling statistical data analysis, quality check review and being the starting point for further AI/NLP developments. A text preprocessing was applied to both streams, consisted in defining standard and domain entities / stopwords and identifying / removing them, creating acronyms and synonyms maps in order to make context disambiguation, sentence splitting for the recommended actions, and finally text lemmatization. For every text the output of the preprocess was a series of keywords.  Then, unsupervised learning algorithms were applied. For this purpose, firstly, we applied feature extraction, bag of words (TF-IDF) and word embeddings (W2V, D2V, BERT), in order to transform our data from language domain into points in a n-features domain. Afterwards, different combinations of unsupervised algorithms were applied to split data into homogeneous groups, such as: LDA, K-means, Spectral, Affinity Propagation and HDBSCAN.  The combinations between language modeling and clustering were evaluated using the Silhouette score and visual analysis.  To validate the effectiveness, the developed NLP algorithms have been implemented into the current SW application used by technical support to perform the service. Moreover, a dedicated app to show trending topics and retrieve insightful information has been developed.  An outlook of the open technical challenges and on the future perspective of NLP applications in the work process are finally delivered.

Read full abstract

BackgroundConcept recognition is an essential task in biomedical information extraction, presenting several complex and unsolved challenges. The development of such solutions is typically performed in an ad-hoc manner or using general information extraction frameworks, which are not optimized for the biomedical domain and normally require the integration of complex external libraries and/or the development of custom tools.ResultsThis article presents Neji, an open source framework optimized for biomedical concept recognition built around four key characteristics: modularity, scalability, speed, and usability. It integrates modules for biomedical natural language processing, such as sentence splitting, tokenization, lemmatization, part-of-speech tagging, chunking and dependency parsing. Concept recognition is provided through dictionary matching and machine learning with normalization methods. Neji also integrates an innovative concept tree implementation, supporting overlapped concept names and respective disambiguation techniques. The most popular input and output formats, namely Pubmed XML, IeXML, CoNLL and A1, are also supported. On top of the built-in functionalities, developers and researchers can implement new processing modules or pipelines, or use the provided command-line interface tool to build their own solutions, applying the most appropriate techniques to identify heterogeneous biomedical concepts. Neji was evaluated against three gold standard corpora with heterogeneous biomedical concepts (CRAFT, AnEM and NCBI disease corpus), achieving high performance results on named entity recognition (F1-measure for overlap matching: species 95%, cell 92%, cellular components 83%, gene and proteins 76%, chemicals 65%, biological processes and molecular functions 63%, disorders 85%, and anatomical entities 82%) and on entity normalization (F1-measure for overlap name matching and correct identifier included in the returned list of identifiers: species 88%, cell 71%, cellular components 72%, gene and proteins 64%, chemicals 53%, and biological processes and molecular functions 40%). Neji provides fast and multi-threaded data processing, annotating up to 1200 sentences/second when using dictionary-based concept identification.ConclusionsConsidering the provided features and underlying characteristics, we believe that Neji is an important contribution to the biomedical community, streamlining the development of complex concept recognition solutions. Neji is freely available at http://bioinformatics.ua.pt/neji.

Read full abstract

Sentence Splitting Research Articles

Related Topics

Articles published on Sentence Splitting

Intelligent Technology Assessment of High-Speed Railway Based on Knowledge Graphs

Nepali ESL/EFL Student Translators' Manipulation of Sentences at the Textual Level

An improved Bulgarian natural language processing pipeline

Does splitting make sentence easier?

Sentence splitting in Arabic to Spanish translation

Medication event extraction in clinical notes: Contribution of the WisPerMed team to the n2c2 2022 challenge

Radiology Text Analysis System (RadText): Architecture and Evaluation.

Contexts and Consequences of Sentence Splitting in Translation (English-French-Czech)

Evaluation of split-and-rephrase output of the knowledge extraction tool in the intelligent tutoring system

Algorithmically Exploiting the Knowledge Accumulated in Textual Domains for Technical Support

Effect of in-app components, medium, and screen size of electronic textbooks on reading performance, behavior, and perception

Fact-Aware Sentence Split and Rephrase with Permutation Invariant Training

Automated transformation of NL to OCL constraints via SBVR

A corpus study of splitting and joining sentences in translation

Effectiveness Level of Online Plagiarism Detection Tools in Arabic

Web Content Mining of Hepatitis-C Community Forums for Identification of Most Frequently Transpiring Lexis

Minimum redundancy and maximum relevance for single and multi-document Arabic text summarization

‘Lösen Sie Schachtelsätze möglichst auf’: The Impact of Editorial Guidelines on Sentence Splitting in German Business Article Translations

A modular framework for biomedical concept recognition

Textrous!: Extracting Semantic Textual Meaning from Gene Sets

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Sentence Splitting Research Articles

Related Topics

Articles published on Sentence Splitting

Intelligent Technology Assessment of High-Speed Railway Based on Knowledge Graphs

Nepali ESL/EFL Student Translators' Manipulation of Sentences at the Textual Level

An improved Bulgarian natural language processing pipeline

Does splitting make sentence easier?

Sentence splitting in Arabic to Spanish translation

Medication event extraction in clinical notes: Contribution of the WisPerMed team to the n2c2 2022 challenge

Radiology Text Analysis System (RadText): Architecture and Evaluation.

Contexts and Consequences of Sentence Splitting in Translation (English-French-Czech)

Evaluation of split-and-rephrase output of the knowledge extraction tool in the intelligent tutoring system

Algorithmically Exploiting the Knowledge Accumulated in Textual Domains for Technical Support

Effect of in-app components, medium, and screen size of electronic textbooks on reading performance, behavior, and perception

Fact-Aware Sentence Split and Rephrase with Permutation Invariant Training

Automated transformation of NL to OCL constraints via SBVR

A corpus study of splitting and joining sentences in translation

Effectiveness Level of Online Plagiarism Detection Tools in Arabic

Web Content Mining of Hepatitis-C Community Forums for Identification of Most Frequently Transpiring Lexis

Minimum redundancy and maximum relevance for single and multi-document Arabic text summarization

‘Lösen Sie Schachtelsätze möglichst auf’: The Impact of Editorial Guidelines on Sentence Splitting in German Business Article Translations

A modular framework for biomedical concept recognition

Textrous!: Extracting Semantic Textual Meaning from Gene Sets