(C) PLOS One This story was originally published by PLOS One and is unaltered. . . . . . . . . . . DrNote: An open medical annotation service [1] ['Johann Frei', 'It-Infrastructure For Translational Medical Research', 'Faculty Of Applied Computer Science', 'University Of Augsburg', 'Augsburg', 'Iñaki Soto-Rey', 'Medical Data Integration Center', 'Institute For Digital Medicine', 'University Hospital Augsburg', 'Frank Kramer'] Date: 2022-08 In the context of clinical trials and medical research medical text mining can provide broader insights for various research scenarios by tapping additional text data sources and extracting relevant information that is often exclusively present in unstructured fashion. Although various works for data like electronic health reports are available for English texts, only limited work on tools for non-English text resources has been published that offers immediate practicality in terms of flexibility and initial setup. We introduce DrNote, an open source text annotation service for medical text processing. Our work provides an entire annotation pipeline with its focus on a fast yet effective and easy to use software implementation. Further, the software allows its users to define a custom annotation scope by filtering only for relevant entities that should be included in its knowledge base. The approach is based on OpenTapioca and combines the publicly available datasets from WikiData and Wikipedia, and thus, performs entity linking tasks. In contrast to other related work our service can easily be built upon any language-specific Wikipedia dataset in order to be trained on a specific target language. We provide a public demo instance of our DrNote annotation service at https://drnote.misit-augsburg.de/ . Since much highly relevant information in healthcare and clinical research is exclusively stored as unstructured text, retrieving and processing such data poses a major challenge. Novel data-driven text processing methods require large amounts of annotated data in order to exceed non data-driven methods’ performance. In the medical domain, such data is not publicly available and restricted access is limited due to federal privacy regulations. We circumvent this issue by developing an annotation pipeline that works on sparse data and retrieves the training data from publicly available data sources. The fully automated pipeline can be easily adapted by third parties for custom use cases or directly applied within minutes for medical use cases. It significantly lowers the barrier for fast analysis of unstructured clinical text data in certain scenarios. Funding: This work is a part of the DIFUTURE project funded by the German Ministry of Education and Research (Bundesministerium für Bildung und Forschung, BMBF) grant FKZ01ZZ1804E. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Data Availability: The Wikipedia and WikiData datasets are publicly available at: https://dumps.wikimedia.org/enwiki/ https://dumps.wikimedia.org/dewiki/ https://dumps.wikimedia.org/wikidatawiki/entities/ Our project repository is publicly available at: GitHub: https://github.com/frankkramer-lab/DrNote . Copyright: © 2022 Frei et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Introduction Effective processing of natural clinical language data has increasingly become a key element for clinical and medical data analysis. Recent trends in the field of natural language processing (NLP) have established novel data-driven neural approaches to largely improve a broad variety of language and text analysis tasks like neural machine translation, text summarization, question answering, text classification and information extraction in general. Most notably, emerging from semantic word embeddings like Word2Vec [1] and GloVe [2], contextualized word embedding techniques like ELMo [3] or BERT [4] based on the Transformer network architecture [5] are applied in order to solve most of context-specific downstream tasks. Attention-based language models therefore gained popularity among the NLP research community since they are able to outperform simpler rule-based models, statistical methods like conditional random fields and other, neural methods like LSTM-based models on core NLP tasks such as named entity recognition (NER). On the matter of domain-specific neural approaches for NLP numerous derivatives [6–11] are applied for various NLP downstream tasks. The trend of these neural approaches appear to steer towards end-to-end models [12] which are often optimized for specific purposes [13]. While most works focus on English data, creating cross-lingual approaches [4, 14, 15] for medical applications is difficult due to the lack of sufficient data. Traditional non-deep learning NLP systems often adopt pipeline-based approaches [16, 17] for text processing in which each pipeline stage performs a modular text processing task, enabling the reuse of single components on different applications and contexts in a simplified fashion. The core components often rely on feature-based machine learning or linguistic rule-based methods, although certain frameworks [16, 18] integrate also neural approaches for certain NLP tasks in more recent versions. For the framework of [18], a domain-specific model [19] has been published for biomedical applications for English text data. For German texts, mEx [20] implements a similar pipeline for clinical texts based on SpaCy [16], albeit its trained models have not been published. Historically, NLP software for medical applications has been an ongoing research subject. The software system medSynDiKATe [21] is an early approach to extract relevant information from pathology finding reports in German language. Apache cTAKES [22] is another modular software for medical text processing, following the UIMA architecture, that uses OpenNLP [23] for text analysis. While [22] is mainly designed for English texts, [24] shows only moderate results for German data when using input text translation into English. HITEx [25] based on the GATE [26] framework, and MetaMaps [27] present comparable notable implementations for medical text processing for English text data. Provided as a public web API, PubTator [28] is a similar text mining tool for English biomedical text annotations with support for a fixed set of entry types. From the perspective of commercial software for medical text analysis in German language, Averbis Health Discovery [29] provides an industry solution to NLP tasks for clinical applications. For a deeper insight in remaining challenges of non-English medical text processing we point to the review paper [30]. More information on the situation of clinical text analysis methods such as for medical concept extraction and normalization or for clinical challenges in general are presented in review papers [31–33]. In similar contexts, Trove [33] is proposed as a framework for weak supervised clinical NER tasks. While the latter work yields a broad overview on key aspects of different methodological concepts and covers weak supervised settings with ontology-based knowledge bases in English, it acknowledges the need for further work on non-English contexts. For text annotation and entity linking in general, earlier works focus on Wikipedia and WikiData as knowledge base. Entity linking on unstructured texts to Wikipedia was shown in [34–37], even before the WikiData [38] knowledge graph was introduced. Different entity linking approaches were evaluated and compared in [39]. In addition to WikiData, other knowledge bases [40–43] have been released as well. For tagging engines like TagMe [37], refined entity linking systems [44, 45] were released. More recently, neural-based entity linking methods have been proposed [46–48]. Motivation By considering common natural language processing tasks as a learning problem, this inherently implies the need for training data. Since novel Transformer-based architectures have been proven effective on large amount of domain-specific training data [6–11], training such domain-specific models for certain languages from scratch without any pretraining [7, 9, 10] remains a major challenge due to the lack of appropriate datasets in general. Hence, transfer learning approaches are commonly used for use case-specific downstream tasks and integrated in practical application [13], in order to mitigate the required amount of training data and boost the performance of the model. Open datasets of biomedical texts and clinical letters for English languages have been published [49, 50]. In the particular case of German data resources for clinical letters, the situation is more dire [30, 51, 52] as no large dataset is publicly available. In addition, one property of natural language processing methods concerns the possible dependency on one specific language: Although works on cross- and multilingual language models like XLM, XLM-R [14, 15] or mBERT [4] present notable results, they indicate higher downstream task performance scores for monolingual models on non low-resource languages. Since text processing pipelines need to be manually fine tuned for their corresponding downstream task on aggregated training data in order to reach significant level of performance, these pipelines require a high level of technical skill sets in order to apply existing methods based on contextualized word embeddings in dedicated domain contexts. [END] --- [1] Url: https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000086 Published and (C) by PLOS One Content appears here under this condition or license: Creative Commons - Attribution BY 4.0. via Magical.Fish Gopher News Feeds: gopher://magical.fish/1/feeds/news/plosone/