Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Review Article
  • Published: 14 November 2012

Text-mining solutions for biomedical research: enabling integrative biology

  • Dietrich Rebholz-Schuhmann 1 , 2 ,
  • Anika Oellrich 1 &
  • Robert Hoehndorf 3 , 4  

Nature Reviews Genetics volume  13 ,  pages 829–839 ( 2012 ) Cite this article

7931 Accesses

156 Citations

15 Altmetric

Metrics details

  • Bioinformatics
  • Literature mining
  • Systems biology

Text mining is a means to process the scientific literature at a large scale. It is the means to make documents and their content more accessible.

Literature repositories, such as PubMed Central and UK PubMed Central, are data collections just like the scientific biomedical databases. They require special techniques to parse the text and to deliver the facts for further analysis.

Data integration, such as the normalization of named entities in the text to database entries, is an essential step towards integrative biology using semantic Web technology.

Knowledge discovery is the ultimate goal of any researcher when exploiting integrated biomedical resources. The scientific literature contributes novel hypotheses and facts.

The use of formal knowledge representations — such as ontologies and fact data repositories — is paramount to make efficient use of our hypothesis generation and validation.

Solutions are emerging that provide intelligent but automated systems to assist biomedical researchers, particularly those dealing with high-throughput data.

In response to the unbridled growth of information in literature and biomedical databases, researchers require efficient means of handling and extracting information. As well as providing background information for research, scientific publications can be processed to transform textual information into database content or complex networks and can be integrated with existing knowledge resources to suggest novel hypotheses. Information extraction and text data analysis can be particularly relevant and helpful in genetics and biomedical research, in which up-to-date information about complex processes involving genes, proteins and phenotypes is crucial. Here we explore the latest advancements in automated literature analysis and its contribution to innovative research approaches.

This is a preview of subscription content, access via your institution

Access options

Subscribe to this journal

Receive 12 print issues and online access

176,64 € per year

only 14,72 € per issue

Rent or buy this article

Prices vary by article type

Prices may be subject to local taxes which are calculated during checkout

biomedical text mining research papers

Jensen, L. J., Saric, J. & Bork, P. Literature mining for the biologist: from information retrieval to biological discovery. Nature Rev. Genet. 7 , 119–129 (2006).

Article   CAS   PubMed   Google Scholar  

Kim, J. J. & Rebholz-Schuhmann, D. Categorization of services for seeking information in biomedical literature: a typology for improvement of practice. Brief. Bioinformat. 9 , 452–465 (2008). This manuscript exploits assumptions and observations linked to search behaviour from users of Web pages to judge the information-seeking behaviour of scientists. It judges available text-mining tools according to these assumptions.

Article   CAS   Google Scholar  

Altman, R. B. et al. Text mining for biology—the way forward: opinions from leading scientists. Genome Biol. 9 (Suppl. 2), S7 (2008).

Article   PubMed   PubMed Central   Google Scholar  

Leach, S. M. et al. Biomedical discovery acceleration, with applications to craniofacial development. PLoS Comput. Biol. 5 , e1000215 (2009).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Hirschman, L. et al. Text mining for the biocuration workflow. Database 2012 , bas020 (2012).

Perez-Iratxeta, C., Bork, P. & Andrade, M. A. Association of genes to genetically inherited diseases using data mining. Nature Genet. 31 , 316–319 (2002).

Perez-Iratxeta, C., Wjst, M., Bork, P. & Andrade, M. A. G2d: a tool for mining genes associated with disease. BMC Genetics 6 , 45 (2005).

Blagosklonny, M. V. & Pardee, A. B. Conceptual biology: unearthing the gems. Nature 416 , 373 (2002).

Malandrino, N. & Smith, R. J. Personalized medicine in diabetes. Clin. Chem. 57 , 231–240 (2011).

Article   PubMed   Google Scholar  

Herder, C. & Roden, M. Genetics of type 2 diabetes: pathophysiologic and clinical relevance. Eur. J. Clin. Invest. 41 , 679–692 (2011).

McCarthy, M. I. Progress in defining the molecular basis of type 2 diabetes mellitus through susceptibility-gene identification. Hum. Mol. Genet. 13 (Suppl. 1), 33–41 (2004).

Hoehndorf, R., Schofield, P. N. & Gkoutos, G. V. PhenomeNET: a whole-phenome approach to disease gene discovery. Nucleic Acids Res. 39 , e119 (2011). The authors describe their approach to the integration of phenotype resources to judge gene–disease associations. The paper demonstrates the potential of phenotype descriptions in the understanding of biological processes.

Li, S. et al. Genetic predisposition to obesity leads to increased risk of type 2 diabetes. Diabetologia 54 , 776–782 (2011).

O'Rahilly, S. Human genetics illuminates the paths to metabolic disease. Nature 462 , 307–314 (2009).

Smith, R. J. et al. Individualizing therapies in type 2 diabetes mellitus based on patient characteristics: what we know and what we need to know. J. Clin. Endocrinol. Metab. 95 , 1566–1574 (2010).

Cohen, K. B., Johnson, H. L., Verspoor, K., Roeder, C. & Hunter, L. E. The structural and content aspects of abstracts versus bodies of full text journal articles are different. BMC Bioinformatics 11 , 492 (2010).

Attwood, T. K. et al. Utopia documents: linking scholarly literature with research data. Bioinformatics 26 , i568–i574 (2010).

Kim, J. J., Zhang, Z., Park, J. C. & Ng, S. K. BioContrasts: extracting and exploiting protein-protein contrastive relations from biomedical literature. Bioinformatics 22 , 597–605 (2006).

Rzhetsky, A., Iossifov, I., Loh, J. M. & White, K. P. Microparadigms: chains of collective reasoning in publications about molecular interactions. Proc. Natl Acad. Sci. USA 103 , 4940–4945 (2006). This article explores how authors report on their results and how the collection of reported facts can be traced, compared and evaluated against each other. It gives early indications of what results might be produced if we applied automatic reasoning to the information from scientific literature and other resources.

Hearst, M. A. Untangling text data mining. Proc. 37th Annu. Meeting Assoc. Comput. Linguistics 1999 , 3–10 (1999).

Article   Google Scholar  

Swanson, D. R. Medical literature as a potential source of new knowledge. Bull. Med. Libr. Assoc. 78 , 29–37 (1990).

CAS   PubMed   PubMed Central   Google Scholar  

Karamanis, N. et al. Natural language processing in aid of FlyBase curators. BMC Bioinformatics 9 , 193 (2008).

Sayers, E. W. et al. Database resources of the national center for biotechnology information. Nucleic Acids Res. 40 , D13–D25 (2012).

McEntyre, J. R. et al. UKPMC: a full text article resource for the life sciences. Nucleic Acids Res. 39 , D58–D65 (2011).

Cheng, D. et al. PolySearch: a web-based text mining system for extracting relationships between human diseases, genes, mutations, drugs and metabolites. Nucleic Acids Res. 36 , 399–405 (2008).

Yu, H. et al. Enabling multi-level relevance feedback on PubMed by integrating rank learning into DBMS. BMC Bioinformatics 11 (Suppl.2), S6 (2010).

Tsuruoka, Y., Tsujii, J. & Ananiadou, S. Facta: a text search engine for finding associated biomedical concepts. Bioinformatics 24 , 2559–2560 (2008).

Ashburner, M. et al. Gene ontology: tool for the unification of biology. The gene ontology consortium. Nature Genet. 25 , 25–29 (2000).

Consortium, G. O. The gene ontology: enhancements for 2011. Nucleic Acids Res. 40 , D559–D564 (2012).

Doms, A. & Schroeder, M. GoPubMed: exploring PubMed with the Gene Ontology. Nucleic Acids Res. 33 , W783–W786 (2005).

Kim, J. J., Pezik, P. & Rebholz-Schuhmann, D. Medevi: retrieving textual evidence of relations between biomedical concepts from MEDLINE. Bioinformatics 24 , 1410–1412 (2008).

Cohen, K. B. & Hunter, L. Getting started in text mining. PLoS Comput. Biol. 4 , e20 (2008).

Brachman, R. J. & Levesque, H. J. Knowledge Representation and Reasoning (Elsevier, 2004).

Leaman, R. & Gonzalez, G. BANNER: an executable survey of advances in biomedical named entity recognition. Pac. Symp. Biocomput. 2008 , 652–663 (2008).

Google Scholar  

Gerner, M., Nenadic, G. & Bergman, C. Linnaeus: A species name identification system for biomedical literature. BMC Bioinformatics 11 , 85 (2010).

Jimeno, A. et al. Assessment of disease named entity recognition on a corpus of annotated sentences. BMC Bioinformatics 9 , S3 (2008).

Jessop, D. M., Adams, S. E., Willighagen, E. L., Hawizy, L. & Murray-Rust, P. OSCAR4: a flexible architecture for chemical text-mining. J. Cheminform. 3 , 41 (2011).

Rebholz-Schuhmann, D., Arregui, M., Gaudan, S., Kirsch, H. & Jimeno, A. Text processing through web services: calling Whatizit. Bioinformatics 24 , 296–298 (2008).

Shah, N. H. et al. Comparison of concept recognizers for building the open biomedical annotator. BMC Bioinformatics 10 , S14 (2009).

Noy, N. F. et al. Bioportal: ontologies and integrated data resources at the click of a mouse. Nucleic Acids Res. 37 , W170–W173 (2009).

Pafilis, E. et al. Reflect: augmented browsing for the life scientist. Nature Biotech. 27 , 508–510 (2009).

Frijters, R. et al. CoPub: a literature-based keyword enrichment tool for microarray data analysis. Nucleic Acids Res. 36 , W406–W410 (2008).

Muller, H. M., Kenny, E. E. & Sternberg, P. W. Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol. 2 , e309 (2004).

Wermter, J., Tomanek, K. & Hahn, U. High-performance gene name normalization with GeNo. Bioinformatics 25 , 815–821 (2009).

Hakenberg, J., Plake, C., Leaman, R., Schroeder, M. & Gonzalez, G. Inter-species normalization of gene mentions with GNAT. Bioinformatics 24 , i126–i132 (2008).

Leitner, F. et al. The FEBS Letters /BioCreative II.5 experiment: making biological information accessible. Nature Biotech. 28 , 897–899 (2010).

Jenssen, T. K., Laegreid, A., Komorowski, J. & Hovig, E. A literature network of human genes for high-throughput analysis of gene expression. Nature Genet. 28 , 21–28 (2001).

CAS   PubMed   Google Scholar  

Hoffmann, R. & Valencia, A. Implementing the iHOP concept for navigation of biomedical literature. Bioinformatics 21 (Suppl. 2), ii252–ii258 (2005).

Goh, K.-I. et al. The human disease network. Proc. Natl Acad. Sci. USA 104 , 8685–8690 (2007).

Feldman, I., Rzhetsky, A. & Vitkup, D. Network properties of genes harboring inherited disease mutations. Proc. Natl Acad. Sci. USA 105 , 4323–4328 (2008).

Krallinger, M. et al. How to link ontologies and protein–protein interactions to literature: text-mining approaches and the BioCreative experience. Database 2012 , bas017 (2012).

Ananiadou, S., Pyysalo, S., Tsujii, J. & Kell, D. B. Event extraction for systems biology by text mining the literature. Trends Biotechnol. 28 , 381–390 (2010).

Geifman, N. & Rubin, E. Towards an age-phenome knowledge-base. BMC Bioinformatics 12 , 229 (2011).

Hearst, M. A. Automatic acquisition of hyponyms from large text corpora. Proc. 14th Conf. Comput. Ling. 2 , 539–545 (1992).

Brady, S. & Shatkay, H. EpiLoc: a (working) text-based system for predicting protein subcellular location. Pac. Symp. Biocomput. 2008 , 604–615 (2008).

Jaeger, S., Gaudan, S., Leser, U. & Rebholz-Schuhmann, D. Integrating protein-protein interactions and text mining for protein function prediction. BMC Bioinformatics 9 , S2 (2008).

Nagel, K., Jimeno-Yepes, A. & Rebholz-Schuhmann, D. Annotation of protein residues based on a literature analysis: cross-validation against UniProtKb. BMC Bioinformatics 10 (Suppl.8), S4 (2009).

Blaschke, C., Oliveros, J. C. & Valencia, A. Mining functional information associated with expression arrays. Funct. Integr. Genom. 1 , 256–268 (2001).

Kuffner, R., Fundel, K. & Zimmer, R. Expert knowledge without the expert: integrated analysis of gene expression and literature to derive active functional contexts. Bioinformatics 21 , (Suppl.2), i259–i267 (2005).

Blaschke, C., Andrade, M. A., Ouzounis, C. & Valencia, A. Automatic extraction of biological information from scientific text: protein–protein interactions. Proc. Int. Conf. Intell. Syst. Mol. Biol. 1999 , 60–67 (1999).

Hunter, L. et al. OpenDMAP: an open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression. BMC Bioinformatics 9 , 78 (2008). The work presented in this paper demonstrates the information technology infrastructure required to process conceptual knowledge and to derive novel findings.

Oda, K. et al. New challenges for text mining: mapping between text and manually curated pathways. BMC Bioinformatics 9 (Suppl. 3), S5 (2008).

Narayanaswamy, M., Ravikumar, K. E. & Vijay-Shanker, K. Beyond the clause: extraction of phosphorylation information from MEDLINE abstracts. Bioinformatics 21 , i319–i327 (2005).

Yuan, X. et al. An online literature mining tool for protein phosphorylation. Bioinformatics 22 , 1668–1669 (2006).

Saric, J., Jensen, L. J. & Rojas, I. Large-scale extraction of gene regulation for model organisms in an ontological context. In Silico Biol. 5 , 21–32 (2005).

Rodriguez-Penagos, C., Salgado, H., Martinez-Flores, I. & Collado-Vides, J. Automatic reconstruction of a bacterial regulatory network using natural language processing. BMC Bioinformatics 8 , 293 (2007).

Kim, J. & Rebholz-Schuhmann, D. Improving the extraction of complex regulatory events from scientific text by using ontology-based inference. J. Biomed. Semantics 2 , S3 (2011).

Rzhetsky, A., Seringhaus, M. & Gerstein, M. Seeking a new biology through text mining. Cell 134 , 9–13 (2008). The authors argue that the exploitation of the scientific literature will serve as an additional resource for the generation of hypotheses and the validation of human-driven hypotheses.

Samwald, M. & Stenzhorn, H. Establishing a distributed system for the simple representation and integration of diverse scientific assertions. J. Biomed. Semantics 1 (Suppl.1), S5 (2010).

Sansone, S. A. et al. Toward interoperable bioscience data. Nature Genet. 44 , 121–126 (2012).

Neumann, E. & Prusak, L. Knowledge networks in the age of the semantic Web. Brief. Bioinformat. 8 , 141–149 (2007).

Gao, Y. et al. SWAN: A distributed knowledge infrastructure for Alzheimer disease research. J. Web Semant. 4 , 222–228 (2006).

Dowell, K. G., McAndrews-Hill, M. S., Hill, D. P., Drabkin, H. J. & Blake, J. A. Integrating text mining into the MGI biocuration workflow. Database 2009 , bap019 (2009).

Jamieson, D. G., Gerner, M., Sarafraz, F., Nenadic, G. & Robertson, D. L. Towards semi-automated curation: using text mining to recreate the HIV-1, human protein interaction database. Database 2012 , bas023 (2012).

Kafkas, S¸., Varog˘lu, E., Rebholz-Schuhmann, D. & Taneri, B. Diversity in the interactions of isoforms linked to clustered transcripts: a systematic literature analysis. J. Proteom. Bioinf. 4 , 250–259 (2011).

Attwood, T. K. et al. Prints and its automatic supplement, preprints. Nucleic Acids Res. 31 , 400–402 (2003).

Licata, L. et al. MINT, the molecular interaction database: 2012 update. Nucleic Acids Res. 40 , D857–D861 (2012).

Donaldson, I. et al. PreBIND and Textomy–mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics 4 , 11 (2003).

Thorn, C. F., Klein, T. E. & Altman, R. B. Pharmacogenomics and bioinformatics: PharmGKB. Pharmacogenomics 11 , 501–505 (2010).

Kuhn, M., Campillos, M., Letunic, I., Jensen, L. J. & Bork, P. A side effect resource to capture phenotypic effects of drugs. Mol. Syst. Biol. 6 , 343 (2010). In this study, semantic resources for the description of phenotypes were used to determine effects induced by drugs, (that is, the authors identify effects and side effects of drugs).

Collier, N. et al. BioCaster: detecting public health rumors with a Web-based text mining system. Bioinformatics 24 , 2940–2941 (2008). BioCaster is an information technology solution that monitors public information streams, such as Twitter, to detect expressions that indicate disease outbreaks. This study demonstrates that social information in combination with scientific information can be very useful for the prediction of disease-related events.

Elkin, P. L., Tuttle, M. S., Trusko, B. E. & Brown, S. H. BioProspecting: novel marker discovery obtained by mining the bibleome. BMC Bioinformatics 10 (Suppl.2), S9 (2009).

van Haagen, H. H. et al. Novel protein-protein interactions inferred from literature context. PLoS ONE 4 , e7894 (2009).

Ceci, F., Pietrobon, R. & Goncalves, A. L. Turning text into research networks: information retrieval and computational ontologies in the creation of scientific databases. PLoS ONE 7 , e27499 (2012).

Pesquita, C. et al. Metrics for GO based protein semantic similarity: a systematic evaluation. BMC Bioinformatics 9 (Suppl.5), S4 (2008).

Coulet, A., Shah, N. H., Garten, Y., Musen, M. & Altman, R. B. Using text to build semantic networks for pharmacogenomics. J. Biomed. Informat. 43 , 1009–1019 (2010).

Percha, B., Garten, Y. & Altman, R. B. Discovery and explanation of drug-drug interactions via text mining. Pacific Symp. Biocomput. 2012 , 410–421 (2012).

Campillos, M., Kuhn, M., Gavin, A.-C., Jensen, L. J. & Bork, P. Drug target identification using side-effect similarity. Science 321 , 263–266 (2008).

Belleau, F., Nolin, M. A., Tourigny, N., Rigault, P. & Morissette, J. Bio2RDF: towards a mashup to build bioinformatics knowledge systems. J. Biomed. Inform 41 , 706–716 (2008).

Patrinos, G. P. et al. Microattribution and nanopublication as means to incentivize the placement of human genome variation data into the public domain. Hum. Mutat. 26 June 2012 (doi:10.1002/humu.22144).

Grau, B. et al. OWL 2: The next step for OWL. Web Semantics 6 , 309–322 (2008).

Jensen, L. J. & Bork, P. Ontologies in quantitative biology: A basis for comparison, integration, and discovery. PLoS Biol. 8 , e1000374 (2010).

Chen, H., Yu, T. & Chen, J. Y. Semantic web meets integrative biology: a survey. Brief. Bioinf. 6 April 2012 (doi:10.1093/bib/bbs014).

Chen, C.-K. et al. Mousefinder: candidate disease genes from mouse phenotype data. Hum. Mutat. 33 , 858–866 (2012).

Washington, N. L. et al. Linking human diseases to animal models using ontology-based phenotype annotation. PLoS Biol. 7 , e1000247 (2009).

King, R. D. et al. The automation of science. Science 324 , 85–89 (2009). The authors mimicked genuine scientific work through automatic analysis of experimental results, derivation of novel hypotheses and by controlling a robot to execute novel experiments. Text mining and literature analysis played an important part in the interpretation of the results from the data mining step to generate valid hypotheses.

Wilkinson, M. D., Vandervalk, B. & McCarthy, L. The semantic automated discovery and integration (SADI) Web service design-pattern, API and reference implementation. J. Biomed. Semantics 2 , 8 (2011). SADI is a framework that registers Web-based services in such a way that they can be easily detected for the processing of data in the Web. Such work helps to set the stage for future progress towards experimental data residing and data analysis occurring on the Web to improve efficiency and to generate new hypotheses.

Krauthammer, M. & Nenadic, G. Term identification in the biomedical literature. J. Biomed. Inform. 37 , 512–526 (2004).

Liakata, M., Saha, S., Dobnik, S., Batchelor, C. & Rebholz-Schuhmann, D. Automatic recognition of conceptualization zones in scientific articles and two life science applications. Bioinformatics 28 , 991–1000 (2012).

Krallinger, M. et al. Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge. Genome Biol. 9 (Suppl.2), S1 (2008).

Smith, B. et al. The obo foundry: coordinated evolution of ontologies to support biomedical data integration. Nature Biotech. 25 , 1251–1255 (2007).

Richter, J. D., Harris, M. A. A., Haendel, M. & Lewis, S. Obo-edit — an ontology editor for biologists. Bioinformatics 23 , 2198–2200 (2007).

Noy, N. F. et al. Creating semantic web contents with Protege-2000. IEEE Intelligent Systems 16 , 60–71 (2001).

Jonquet, C., Shah, N. H. & Musen, M. A. The open biomedical annotator. Summit Translat. Bioinforma 2009 , 56–60 (2009).

Douglas, S. M., Montelione, G. T. & Gerstein, M. PubNet: a flexible system for visualizing literature derived networks. Genome Biol. 6 , R80 (2005).

Download references

Author information

Authors and affiliations.

European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK

Dietrich Rebholz-Schuhmann & Anika Oellrich

Institut für Computerlinguistik, Universität Zürich, Binzmühlestrasse 14, Zürich, 8050, Switzerland

Dietrich Rebholz-Schuhmann

Department of Genetics, University of Cambridge, Downing Street, Cambridge, CB2 3EH, UK

  • Robert Hoehndorf

Department of Physiology, Development and Neuroscience, University of Cambridge, Downing Street, CB2 3EG, UK

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Dietrich Rebholz-Schuhmann .

Ethics declarations

Competing interests.

The authors declare no competing financial interests.

Related links

Further information.

European Bioinformatics Institute

European Patent Office

Google Scholar

National Agricultural Laboratory Catalog

National Center for Biotechnology Information

Nature Reviews Genetics Series on Computational tools

OBO Flatfile Format Syntax and Semantics

Open Biomedical Annotator

PLoS Neglected Tropical Diseases : Impact of environment and social gradient on Leptospira infection in urban slums

ScienceDirect.com

SIDER — Side Effect Resource

Transcript Based Isoform Interaction Database (TBIID)

UK PubMed Central

Testable statements that, if true, may explain an observed phenomenon.

Databases of statements covering a knowledge domain. Often, statements are represented in a form that permits the automated or manual inference of statements that are not explicitly stated using inference rules.

Objective and (experimentally) verifiable ways in which the world is structured.

The process of selecting information or documents from a collection as a result to the submission of a query.

The process of automatically assessing documents, data or knowledge bases to extract statements that are likely to be true given the available information. Information extraction can be based on defined patterns, machine-learning techniques, statistical analyses or automated reasoning.

The process of analysing a set of statements to identify new statements that are true. To discover new knowledge, evidence must have already been gathered in support of the identified statements.

Declarative sentences that can be said to be either true or false. True statements express facts.

The information that has been gathered to demonstrate that the statement is true (that is, it corresponds to a fact); in science, evidence usually contains experimental results.

A reference to literature from which a statement or its supporting evidence were derived.

Single words or compositions of words with well-defined meanings.

The conceptualization of categories of entities or conceptual instances, represented by a unique identifier, a label and a definition.

The extraction of text constituents representing a specific type, preferably entities with a name such as a protein.

The mapping of a named entity or type in the text to a unique identifier, possibly requiring disambiguation and contextual analysis.

A representation of a conceptualization of a domain of knowledge, characterizing the classes and relations that exist in the domain. Commonly, ontologies are represented as graph structure that represents a taxonomy.

Any constituents of the text — such as tokens, words, complex terms or representations of a concept — that serve as an input to a text-mining solution.

Statements that are represented in a formal language to denote the properties or relations of an entity (or concept).

The extension of the World Wide Web to provide, simultaneously, human- and computer-readable semantics through references to well-defined resources.

Processing of the sentence structure using statistics or grammar rules to produce an electronic representation that delivers logical components (for example, a 'noun phrase'), their roles (for example, the 'subject') and dependencies.

Biomedical ontologies and databases serve as semantic resources, as they define and describe concepts and entities.

The use of software to derive statements automatically from a knowledge base using inference rules.

Sets of assertions that share the same topic or that result from the same source. The assertions must be conflict-free within a micro-theory but can contradict other micro-theories.

The selection or creation of hypotheses that can explain a given phenomenon. Commonly, selection criteria regarding relevance, parsimony or consistency with existing knowledge are applied to select the most viable hypotheses for a given phenomenon.

Rights and permissions

Reprints and permissions

About this article

Cite this article.

Rebholz-Schuhmann, D., Oellrich, A. & Hoehndorf, R. Text-mining solutions for biomedical research: enabling integrative biology. Nat Rev Genet 13 , 829–839 (2012). https://doi.org/10.1038/nrg3337

Download citation

Published : 14 November 2012

Issue Date : December 2012

DOI : https://doi.org/10.1038/nrg3337

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Molecular and network-level mechanisms explaining individual differences in autism spectrum disorder.

  • Amanda M. Buch
  • Petra E. Vértes
  • Conor Liston

Nature Neuroscience (2023)

A hybrid strategy to extract metadata from scholarly articles by utilizing support vector machine and heuristics

  • Muhammad Waqas
  • Nadeem Anjum
  • Muhammad Tanvir Afzal

Scientometrics (2023)

Generic features selection for structure classification of diverse styled scholarly articles

Multimedia Tools and Applications (2023)

Automated meta-analysis of the event-related potential (ERP) literature

  • Thomas Donoghue
  • Bradley Voytek

Scientific Reports (2022)

Combining lexical and context features for automatic ontology extension

  • Sara Althubaiti
  • Şenay Kafkas

Journal of Biomedical Semantics (2020)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

biomedical text mining research papers

  • Search Menu
  • Advance articles
  • Author Guidelines
  • Submission Site
  • Open Access Policy
  • Self-Archiving Policy
  • Why publish with this journal?
  • About Bioinformatics Advances
  • About the International Society for Computational Biology
  • Editorial Board
  • Advertising & Corporate Services
  • Diversity, Equity and Inclusion
  • Journals on Oxford Academic
  • Books on Oxford Academic

Issue Cover

Article Contents

1 introduction, 2 materials and methods, 4 conclusions, acknowledgments.

  • < Previous

MarkerGenie: an NLP-enabled text-mining system for biomedical entity relation extraction

ORCID logo

Wenhao Gu and Xiao Yang wish it to be known that, in their opinion, the first two authors should be regarded as Joint First Authors.

  • Article contents
  • Figures & tables
  • Supplementary Data

Wenhao Gu, Xiao Yang, Minhao Yang, Kun Han, Wenying Pan, Zexuan Zhu, MarkerGenie: an NLP-enabled text-mining system for biomedical entity relation extraction, Bioinformatics Advances , Volume 2, Issue 1, 2022, vbac035, https://doi.org/10.1093/bioadv/vbac035

  • Permissions Icon Permissions

Natural language processing (NLP) tasks aim to convert unstructured text data (e.g. articles or dialogues) to structured information. In recent years, we have witnessed fundamental advances of NLP technique, which has been widely used in many applications such as financial text mining, news recommendation and machine translation. However, its application in the biomedical space remains challenging due to a lack of labeled data, ambiguities and inconsistencies of biological terminology. In biomedical marker discovery studies, tools that rely on NLP models to automatically and accurately extract relations of biomedical entities are valuable as they can provide a more thorough survey of all available literature, hence providing a less biased result compared to manual curation. In addition, the fast speed of machine reader helps quickly orient research and development.

To address the aforementioned needs, we developed automatic training data labeling, rule-based biological terminology cleaning and a more accurate NLP model for binary associative and multi-relation prediction into the MarkerGenie program. We demonstrated the effectiveness of the proposed methods in identifying relations between biomedical entities on various benchmark datasets and case studies.

MarkerGenie is available at https://www.genegeniedx.com/markergenie/ . Data for model training and evaluation, term lists of biomedical entities, details of the case studies and all trained models are provided at https://drive.google.com/drive/folders/14RypiIfIr3W_K-mNIAx9BNtObHSZoAyn?usp=sharing .

Supplementary data are available at Bioinformatics Advances online.

Relations of biomedical entities (bioentities) are critical to biomedical studies and are hidden in a large number of biomedical articles. In this work, the main goal is to rapidly and accurately identify associative relations between a pair of biomedical entities present in the literature. We consider two entities to be associative in a context when they are described to be correlated directly, causal or non-causal. Most biomedical entity relations such as a biomarker and a disease are associative. Determining such a relation is typically an important first step to guide additional wet-lab or clinical studies to verify the diagnostic, predictive, prognostic, predisposing and treatment relation. Without being exhaustive, a bioentity may refer to a disease, a gene, a metabolite or a microbial taxa.

Many biomedical text-mining methods have been used to identify associations of diseases and biomarkers. These methods can digest research articles more efficiently and comprehensively compared to human researchers and can help prioritize the targets in diagnoses and drug target discovery. In the following, we provide a brief overview of the current status of biomarker relation database curation and text-mining methods.

Bioentity relation databases are typically manually curated and serve as the ground truth for the research community. For example, Ma et al. (2017) manually extracted 292 microbe–microbe, 39 disease–disease and 483 microbe–disease associations from microbiome-related articles. Janssens et al. (2018) established a disease–microbiome database by querying PUBMED database using criteria ([(’microbiota’ OR ’microbiome) and (’health’ OR ’disease’)] and [microbiome alterations]). Then, the disease, microbiome terms and their relations were extracted manually. Noronha et al. (2019) created a database that relates human metabolism with genetics, microbial metabolism, nutrition and diseases.

To automate and expand the scope of entity relation extraction, a few methods, including PolySearch2 ( Liu et al. , 2015 ), BEST ( Lee et al. , 2016 ), GenCLiP 3 ( Wang et al. , 2020 ), STRING ( Szklarczyk et al. , 2021 ), IBDDB ( Khan et al. , 2021 ) and DrugShot ( Kropiwnicki et al. , 2022 ), have been introduced. With a common assumption that the strength of entity association is positively correlated with their co-occurring frequency in the same context, these methods first identified frequently co-occurring entities of interest, then refined the entity relation with different scoring and filtering criteria. However, there are some limitations. A pair of entities with low co-occurring frequency can be reliable but would be missed. For example, recently discovered relations would have few mentions in the literature. Meanwhile, high co-occurrence counts can include many false positives that require ad hoc and complex rules to eliminate.

These limitations have been addressed by supervised machine learning (ML)-based methods ( Hsieh et al. , 2017 ; Hua and Quan, 2016 ; Xu et al. , 2015 ). To curate CIViC database ( Lever et al. , 2019 ), published literature was parsed and sentences containing a pair of target entities were identified via exact string matching. A support vector machine-based classifier was then trained using 800 labeled sentences. Ahmed et al. (2019) proposed a novel neural network architecture for identifying protein–protein interactions (PPIs) from biomedical text using a tree long-short-term memory (LSTM) network with structured attention to traverse the dependency tree of a sentence through a child sum tree LSTM. Meanwhile, structural information was learned through a parent selection mechanism by modeling non-projective dependency trees. The main challenge for the application of ML methods is the lack of labeled training data. Although distant supervision ( Mintz et al. , 2009 ) can be used to acquire additional training data with positive labels, negative training data cannot be generated and this method requires a high-quality knowledge database that is typically hard to curate.

In this article, we treat finding relevant biomedical entities as a sentence-level binary/multiple relation classification task. During entity extraction, we introduced rule-based strategies to reduce false positive extractions as the existing bioentity terminologies still contain a large number of ambiguities and sometimes, errors. To address the lack of training data and the labor-intensive manual labeling process, we proposed an automated training data generation using co-occurrence frequency matrix and demonstrated its practical use. We then developed a new model, SBGT (SciBERT+Gumbel Tree-GRU), for relation classification that uses SciBERT ( Beltagy et al. , 2019 ) to encode the context features of words and Gumbel Tree-GRU ( Hong et al. , 2020 ) to encode the syntactic structures of sentences.

We provide MarkerGenie as an online text-mining tool. The current release includes the following entities: diseases, microbiomes, genes and metabolites. The corpus currently includes the free-text and tables of articles in PubMed and PubMed Central. The overview of MarkerGenie is given in Figure 1 that includes four main components: user query processing, article retrieving and sentence filtering, model-based classification and results reporting. The implementation details of MarkerGenie are provided in Supplementary Section S1 .

MarkerGenie online workflow. MarkerGenie is a text-mining system for identifying biomarker relations with diseases. Given a query disease term, MarkerGenie first identifies relevant disease terms through fuzzy matching. Then, it retrieves articles according to the synonyms of the disease, and select the sentences that contain both the disease and biomarkers through entity extraction. Afterwards, the filtered sentences are classified by NLP models. Finally, the system returns the biomarkers related to the disease extracted from the literature in detail that including the source sentences, tables and articles. To improve speed, result caching was used

MarkerGenie online workflow. MarkerGenie is a text-mining system for identifying biomarker relations with diseases. Given a query disease term, MarkerGenie first identifies relevant disease terms through fuzzy matching. Then, it retrieves articles according to the synonyms of the disease, and select the sentences that contain both the disease and biomarkers through entity extraction. Afterwards, the filtered sentences are classified by NLP models. Finally, the system returns the biomarkers related to the disease extracted from the literature in detail that including the source sentences, tables and articles. To improve speed, result caching was used

2.1 SBGT model

In the proposed SBGT model, we used SciBERT ( Beltagy et al. , 2019 ) to extract the contextual features of words given the input sentence. SciBERT can improve the handling of unseen and rare words by using subword tokenizer in between words and characters. It had been experimentally shown to outperform BERT-Base and Bio-BERT in relation extraction of biomedical text ( Beltagy et al. , 2019 ). Then, we used the Gumbel Tree-GRU ( Hong et al. , 2020 ) to encode the syntactic structure. The encoded vectors were concatenated and fed into a fully connected layer for prediction. As shown in Figure 1 , given a sentence, SciBERT extracts the contextual features of each word. Each word is encoded as a 1*768 vector. Then Gumbel Tree-GRU is used to organize those words into a vector to represent the sentence. Afterward, the vector as well as the contextual features of Entity1 and Entity2 are concatenated to indicate their relation. Finally, a fully connected layer is applied to predict the probability of the relation falling within each category.

2.2 Unsupervised training data generation for binary relation classification

To generate the training data, a co-occurrence frequency matrix of the bioentities from sentences was first constructed from free-text in PubMed and PubMed Central. We chose entity pairs with the most co-occurrence counts and used two thresholds, ‘minimum co-occurrences t 1 ’ and ‘truncating quantity t 2 ’, to generate the positive data. Particularly, sentences containing a pair of entities co-occurring ≥ t 1 times were considered. At most t 2 of these sentences were retained to prevent the bias toward high frequency entity pairs. The default values of t 1 = 10 and t 2 = 50 were empirically set and used in all current experiments. To generate negative data, sentences containing entity pairs with the frequency of one in the matrix were included except the ones that contain a single disease term and biomarker term; because we found that the latter was more likely to be a positive case. Note that the negative sample means no direct association between two biomedical entities in a sentence. Same as co-occurrence-based methods, some rarer and possibly more relevant biomedical associations may be missed by ignoring low-occurring data. However, the samples generated here were used as the labeled data to train a model rather than used as the final result. The associative relation between a pair of bioentities is extracted by the trained model regardless of their co-occurrence frequency in the actual prediction stage. An example of the positive and negative data generation process is given in Figure 2 . The complete training data was generated subject to a 6:4 ratio for positive and negative instances. The ratio is consistent with the fraction of positive and negative instances observed in the literature. The generated training data were further divided by an 8:2 ratio into training and validation sets for model parameter optimization on F1-score. The model performance was measured on independent datasets.

Example of unsupervised training data generation. The heatmap of the co-occurrence of 20 diseases and 20 microbes in the literature is shown in the left part of this figure. Gastroenteritis and Vibrio parahaemolyticus co-occur most frequently: more than 400 times, greater than a predefined threshold, so they are considered related and the corresponding sentences in the articles were selected as positive samples. On the contrary, the low co-occurrence couples, e.g. Liver Cirrhosis and Alkaline xylosoxidans, tend to be irrelevant and the corresponding sentences formed the negative samples

Example of unsupervised training data generation. The heatmap of the co-occurrence of 20 diseases and 20 microbes in the literature is shown in the left part of this figure. Gastroenteritis and Vibrio parahaemolyticus co-occur most frequently: more than 400 times, greater than a predefined threshold, so they are considered related and the corresponding sentences in the articles were selected as positive samples. On the contrary, the low co-occurrence couples, e.g. Liver Cirrhosis and Alkaline xylosoxidans, tend to be irrelevant and the corresponding sentences formed the negative samples

2.3 Entity extraction

Entity extraction is a pre-requisite step of relation extraction. The entities of interest were curated in term lists in advance. Currently, the following entity term lists have been curated—disease, microbiome, metabolite and gene as detailed in Supplementary Section S2 . First, spaCy ( Neumann et al. , 2019 ) was used for sentence splitting and tokenization. Then, similar to CIViCmine, exact string matching was applied to the tokenized sentences to extract entities. To achieve this, we first constructed a trie on all synonyms and then located the most extended term in sentences by traversing the trie. This strategy has a run time complexity of O( n ) for a length- n sentence.

To improve the accuracy of entity recognition, rule-based filtering was further applied. If a disease term had a prefix of a letter followed by a dot, like ‘s. pneumonia’, the term was disregarded; we also removed term with a length less than four characters unless it was determined to be an abbreviation, conforming the pattern of ‘synonym + (entity)’.

2.4 Relation extraction from tables

Different from the classification-based relation extraction used for text, we used rule-based methods on tables. A table was first extracted and stored as a tuple (caption, table – head, table – body: list of data rows). The bioentity relations generally appear in two different patterns in a table as illustrated by the disease–microbiome relation extraction example: when a disease term and a collective term of the microbiome (e.g. ‘microbiome’, ‘bacteria’) co-occurred in the caption of a table and the microbiome terms were present in the body of the table, all microbiome terms in the table body were considered to be related to the disease ( Fig. 3A ). When a disease term and a specific microbiome term co-occurred in a row or caption of the table, they were considered to be related ( Fig. 3B ).

Illustration of relation extraction from tables. (A) CRC and a collective terms of the microbiome (‘bacterium’) co-occur in the caption of the table, so all microbes in the table body are considered to be related to CRC. (B) CRC and a specific microbiome term (‘Fusobacterium nucleatum’) co-occur in a row of the table. They are considered as related

Illustration of relation extraction from tables. ( A ) CRC and a collective terms of the microbiome (‘bacterium’) co-occur in the caption of the table, so all microbes in the table body are considered to be related to CRC. ( B ) CRC and a specific microbiome term (‘Fusobacterium nucleatum’) co-occur in a row of the table. They are considered as related

2.5 Granular relation classification between a disease and bioentities

When a disease and a bioentity were determined by binary classification to be associative, MarkerGenie can further predict them to be one of the five granular relation types— Predictive, Prognostic, Diagnostic, Predisposing or Treatment if there is a potential specific relation between them judged by CIViCmine’s search terms ( Fig. 4 ). The training sentences of this classification task were first generated via distant supervision method using knowledge databases of CBD ( Zhang et al. , 2018 ), MarkerDB ( Wishart et al. , 2021 ) and Oncomx ( Dingerdissen et al. , 2020 ). Then we used a term list (e.g. ‘risk’ and ‘survival’) provided by CIViCmine to screen sentences that potentially contain one of the five specific relations. In addition, we expanded the term list by using pre-trained word vectors to include synonyms to increase the size of training data.

The workflow of MarkerGenie for classifying the granular relation types between a disease and bio-entities. When a disease and a bio-entity are determined by binary classification to be associative, MarkerGenie judges if there is a potential specific relation between them by using CIViCmine’s search terms. Then, the trained model is applied to predict the granular relation types (Predictive, Prognostic, Diagnostic, Predisposing or Treatment) of the filtered sentences

The workflow of MarkerGenie for classifying the granular relation types between a disease and bio-entities. When a disease and a bio-entity are determined by binary classification to be associative, MarkerGenie judges if there is a potential specific relation between them by using CIViCmine’s search terms. Then, the trained model is applied to predict the granular relation types (Predictive, Prognostic, Diagnostic, Predisposing or Treatment) of the filtered sentences

In this section, we first demonstrate the improved accuracy of SBGT model by applying it on the curated benchmark datasets that were used by previous methods—the binary relation classification of PPI ( Pyysalo et al. , 2007 ) and the multi-relation classification of drug–drug interaction (DDI’13) ( Herrero-Zazo et al. , 2013 ). Next, we demonstrate the validity of automatic training data generation by applying MarkerGeine to disease–biomarker binary associative relation classification. This task does not require any prior knowledge or curated databases. However, when curated databases are available, MarkerGenie would generate training data via distant supervision strategy and produce multi-relation classification. This was demonstrated on disease–gene multi-relation extraction task as carried out in CIViCmine ( Lever et al. , 2019 ). Finally, we demonstrate how MarkerGenie can aid biomarker discovery with a few case studies.

3.1 Binary relation classification

The SBGT model was first validated on the PPI corpora ( Pyysalo et al. , 2007 ), which was used as a benchmark dataset by prior methods. The dataset information and hyper-parameters of SBGT are summarized in Table 1 . To ensure the generalization of the learned model, we replaced the pair of proteins in each sentence with ‘PROTEIN1’ and ‘PROTEIN2’. In addition, all sentences were truncated or padded to a maximum length of 100. The performance of SBGT was compared with seven other state-of-the-art models—sdpCNN ( Hua and Quan, 2016 ), sdpLSTM ( Xu et al. , 2015 ), Bert ( Devlin et al. , 2019 ), BioBERT ( Lee et al. , 2020 ), DRCNN ( Zhang et al. , 2019 ), Bi-LSTM ( Hsieh et al. , 2017 ) and BioKGLM ( Fei et al. , 2021 ). The evaluation scheme and parameters of the compared algorithms were all set per the original papers. The F1-scores of these methods are given in Table 2 , where SBGT achieved 3.2% improvement over the runner up. Since some of the methods were evaluated with macro F1-score in the corresponding references, we also included this metric in Table 2 , where SBGT showed consistent superiority to the compared models, including DCNN ( Choi, 2018 ), Att-sdpLSTM ( Yadav et al. , 2019 ), tLSTM ( Ahmed et al. , 2019 ) and DRCNN.

PPI and DDI’13 dataset information and hyper-parameters of SBGT

Note : The evaluation schemes were selected to be consistent with the methods under comparison.

Comparison of SBGT and other methods on PPI dataset in terms of F1 score and macro-F1 score

3.2 Multi-relation classification

We applied SBGT to the DDI’13 dataset ( Herrero-Zazo et al. , 2013 ) where the goal was to determine specific relations (defined as {NA, ADVICE, EFFECT, MECHANISM, INT}) given two drugs. Like binary classification, we replaced the pair of drugs in each sentence with ‘<ent1 >’ and ‘<ent2 >’ and all sentences were truncated or padded to a maximum length of 100. On this dataset, SBGT were trained with the hyper-parameters shown in Table 1 . SBGT was compared with the seven other state-of-the-art models, including SCNN ( Zhao et al. , 2016 ), CNN-bioWE ( Liu et al. , 2016 ), MCCNN ( Quan et al. , 2016 ), Joint AB-LSTM ( Sahu and Anand, 2018 ), RvNN ( Lim et al. , 2018 ), Position-aware LSTM ( Zhou et al. , 2018 ) and BERE ( Hong et al. , 2020 ) in terms of precision, recall and F1-score. As shown in Table 3 , SBGT attained the best trade-off of precision and recall. In terms of F1 score, SBGT obtained a score of 77.1% that is ∼3% higher than that of the second best model.

Comparison of SBGT and other methods on DDI’13 dataset in terms of Precision, Recall and F1 score

3.3 Disease–biomarker associative binary classification with automatic training data generation

We selected three major biomarker types—microbiome, metabolite and gene—to study their associative relations with diseases from publicly available articles of PubMed and PubMed Central. The labeled training data for these tasks are scarce or even missing though some have been manually curated ( Lever et al. , 2019 ; Liu et al. , 2015 ). We introduced an unsupervised method that can automatically generate the labeled training data in Section 2.

Admittedly, the automatic label generation can include many false positive instances—upon manual inspection, around 15–20% of the positive samples are incorrectly labeled. Yet, we can obtain a large amount of data within a few hours’ run time. We have obtained around 6000 disease–microbiome and 10 000 disease–metabolite or disease–gene training samples. The data of this size would be more suitable for deep learning strategies compared to the typical curated data size in the hundreds scale ( Lever et al. , 2019 ). Though trained using noisy data increased the model bias, the overall model performance improved along with the size of training samples on the test data. As the example of disease–microbiome shown in Figure 5 , the F1 value generally increased as more data became available.

An illustration of the impact of automatically acquired training data on model performance. The SBGT model was trained on different sizes of disease–microbiome training dataset and F1 scores were obtained on the independent 477 test samples. Each experiment was repeated three times and the F1 score was the average

An illustration of the impact of automatically acquired training data on model performance. The SBGT model was trained on different sizes of disease–microbiome training dataset and F1 scores were obtained on the independent 477 test samples. Each experiment was repeated three times and the F1 score was the average

To evaluate the performance of MarkerGenie on the above tasks, we manually curated 477 disease–microbiome samples and 610 annotated disease–metabolite samples. For disease–gene prediction, 382 labeled disease–gene samples were directly obtained from Liu et al. (2015) . MarkerGenie predicted disease–microbiome, disease–metabolite and disease–gene relations with precisions of 83.28%, 85.26% and 82.01%, respectively ( Fig. 6A , the corresponding F1 scores and precision–recall curves are shown in Fig. 6B and C ). Empirically, around 60–70% instances of disease–biomarker pairs co-occurring in the same sentence have a true positive relation, MarkerGenie therefore removed over 10–20% of the false positive instances. For these three tasks, MarkerGenie recalled 84.92%, 89.73% and 88.43% of the relations, respectively. We note that, the reported performance from Liu et al. (2015) on this disease–gene relation dataset obtained from the same study had both higher precision (∼5%) and recall (∼2%), yet its generalizability cannot be independently evaluated. Also note that, Liu et al. (2015) used the rule-based method that factors in the prior knowledge of validated disease–gene relations, which is generally unknown to the model.

Performance of MarkerGenie on disease–biomarker relation identification. (A) Precision and recall of disease–biomarkers’ associative binary classification. (B) F1 scores of disease–biomarkers’ associative binary classification. (C) Precision-recall curves, due to the high threshold at the beginning, there are few samples marked as positive examples, so the upper left part of the curve fluctuates greatly. (D–G) Precision-recall curves of MarkerGenie and CIViCmine on four specific relation extraction, i.e. predictive, prognostic, diagnostic and predisposing

Performance of MarkerGenie on disease–biomarker relation identification. ( A ) Precision and recall of disease–biomarkers’ associative binary classification. ( B ) F1 scores of disease–biomarkers’ associative binary classification. ( C ) Precision-recall curves, due to the high threshold at the beginning, there are few samples marked as positive examples, so the upper left part of the curve fluctuates greatly. ( D–G ) Precision-recall curves of MarkerGenie and CIViCmine on four specific relation extraction, i.e. predictive, prognostic, diagnostic and predisposing

3.4 Granular relation extraction via distant supervision

Following binary associative relation prediction, MarkerGenie can rely on disease–biomarker relation knowledge-bases to automatically generate training data via distant supervision, then yield more deterministic relation predictions. In this part, the performance of MarkerGenie was verified with the 250 test samples from Lever et al. (2019) that contains four granular relation types, Diagnostic, Predictive, Predisposing and Prognostic between cancers and genes. MarkerGenie was compared with CIViCmine ( Lever et al. , 2019 ) in terms of precision and recall, where the precision–recall curves of the two methods are shown in Figure 6D–G . MarkerGenie obtained better precision and recall than CIViCmine.

In the following, we demonstrate how MarkerGenie can be applied to biomarker discoveries with different case studies.

3.4.1 Identification of colorectal cancer-related microbes

Colorectal cancer (CRC) is the third most common cancer worldwide and one of the primary causes of cancer-related deaths ( Rawla et al. , 2019 ). The association between CRC and the human gut microbiome is a focus of the current CRC research ( Abdulla et al. , 2021 ; Chattopadhyay et al. , 2021 ; Sánchez-Alcoholado et al. , 2020 ). In this study, we used MarkerGenie to find the microbes related to CRC from the literature and manually verified the results.

In searching for microbes related to CRC, MarkerGenie returned a total of 2257 sentences that included 264 microbes. Among these 2257 sentences, 2118 were correctly predicted, whereas 98 were wrongly predicted and 41 were difficult to judge via manual inspection. The overall sentence’s binary classification precision is 93.8%. For microbes, 247 out of 264 microbes are associated with CRC. In Figure 7 , an example list of microbes and the corresponding sentences is shown in A. The top 10 microbes with the highest occurrences are shown in B, among these, eight of them have been previously shown to be significantly associated with CRC in the meta-analysis study ( Thomas et al. , 2019 ). The remaining two microbes ‘Helicobacter pylori’ and ‘Human papillomavirus’ also have been shown to be strongly related to CRC in more recent work ( Chao et al. , 2020 ; Wang et al. , 2021 ). These results should provide a good reference to researchers studying CRC and the microbiome.

Statistics of microbes related to CRC returned by MarkerGenie. (A) Example of microbes and the corresponding sentences. (B) The top 10 microbes returned by MarkerGenie

Statistics of microbes related to CRC returned by MarkerGenie. ( A ) Example of microbes and the corresponding sentences. ( B ) The top 10 microbes returned by MarkerGenie

As discussed earlier, upon a positive prediction of binary associative relation between CRC and a microbe, we can further generate more deterministic relation classification via distant supervision (see Section 2.5). Here, MarkerGenie produced 185 predisposing, 181 predictive, 154 prognostic, 67 treatment and 33 diagnostic relations.

3.4.2 Identification of breast cancer-related genes

Identifying relevant genes is valuable for the early diagnosis, prevention and treatment of breast cancer ( Kazmi et al. , 2022 ; Schettini et al. , 2021 ; Zhang et al. , 2022 ). We used MarkerGenie to search and rank the importance of the genes associated with breast cancer. Similar to BEST ( Lee et al. , 2016 ), we presented the top 10 genes found in MarkerGenie along with the ones identified by BEST, Polysearch2 and CIViCmine in Figure 8 . Eight of them were identified in at least one of the other methods and reported to be associated with breast cancer in CIViC knowledge-base ( https://civicdb.org ) or NCBI’s GENE database ( https://www.ncbi.nlm.nih.gov/gene/ ). The remaining two genes ‘ ITK ’ and ‘ NAC ’ were false positives upon inspection. Specifically, the term ‘NAC’ refers to a type of therapy for breast cancer. For ‘ITK’, the term identified in association with breast cancer is ‘EMT’, which is an alias of ‘ ITK ’ gene. However, ‘EMT’ refers to ‘epithelial-mesenchymal transition’ that is a process linked to breast cancer. Both false positives are valid entries in the gene list but had different meanings in the text. To further improve accuracy, ambiguities of terms in the list need to be resolved.

Top 10 genes retrieved with the query ‘Breast cancer’ by different systems

Top 10 genes retrieved with the query ‘Breast cancer’ by different systems

3.4.3 Disease–miRNA association inference

The output of MarkerGenie can also be directly used for other applications such as association prediction. We select the disease–miRNA association inference as a suitable application as it involves three-way interactions—disease–disease, disease–miRNA and miRNA–miRNA. The details of the inference method and experimental results are provided in Supplementary Section S3 . Based on miRNA–miRNA functional similarity, disease–disease semantic similarity and the disease–miRNA associations identified by MarkerGenie, we can infer unknown disease–miRNA associations as accurately as the methods based on curated databases like HMDD ( Huang et al. , 2019 ). MarkerGenie can serve as a surrogate for the laboriously curated databases.

In this work, we proposed a text-mining system, MarkerGenie, to identify bioentity relations from texts and tables of publications in PubMed and PubMed Central. The identification problem was formulated as a relation classification task. A new unsupervised training data generation method and new classification model SBGT were introduced and tested with benchmark datasets and real-world case studies. The experimental results demonstrated the effectiveness of the system. There are further rooms for improvement, including cross-sentence relations extraction, improving negative samples selection, and better ways to handle ambiguities of short entity terms such as gene symbols. It is also favorable to recognize the context (e.g. conditions of experiments and biology relevance) in which the biomarkers are identified and to improve the entity extraction with text-mining methods (e.g. PubTator and NER models).

We would like to express our gratitude toward the editor and the anonymous reviewers whose valuable comments greatly contributed to this manuscript.

Author contributions

X.Y. and W.G. conceived the experiments. W.G. and M.Y. conducted the experiments. K.H. and W.P. analyzed the results. Z.Z., X.Y., K.H. and W.G. wrote and reviewed the manuscript.

This work was partially supported by National Key Research and Development Project [2019YFE0109600] and the National Natural Science Foundation of China [61871272 and 61911530218], the Guangdong Provincial Key Laboratory [2020B121201001], the Shenzhen Fundamental Research Program [JCYJ20190808173617147], and the open project of BGIShenzhen [BGIRSZ20200002].

Conflict of Interest : none declared.

Abdulla M.-H. et al.  ( 2021 ) Association of the microbiome with colorectal cancer development . Int. J. Oncol ., 58 , 1 – 12 .

Google Scholar

Ahmed M. et al.  ( 2019 ) Identifying protein-protein interaction using tree LSTM and structured attention. In: 2019 IEEE 13th International Conference on Semantic Computing, Newport Beach, CA, USA. pp. 224 – 231 .

Beltagy I. et al.  ( 2019 ) SciBERT: a pretrained language model for scientific text. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China. pp. 3615 – 3620 .

Chao G. et al.  ( 2020 ) The prevalence of human papillomavirus in colorectal cancer and adenoma: a Meta-analysis . J. Cancer Res. Ther ., 16 , 1656 .

Chattopadhyay I. et al.  ( 2021 ) Exploring the role of gut microbiome in Colon cancer . Appl. Biochem. Biotechnol ., 193 , 1780 – 1799 .

Choi S.-P. ( 2018 ) Extraction of protein–protein interactions (PPIs) from the literature by deep convolutional neural networks with various feature embeddings . J. Inf. Sci ., 44 , 60 – 73 .

Devlin J. et al.  ( 2019 ) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Minneapolis, MN, USA. Vol. 1 , pp. 4171 – 4186 .

Dingerdissen H.M. et al.  ( 2020 ) OncoMX: a knowledgebase for exploring cancer biomarkers in the context of related cancer and healthy data . JCO Clin. Cancer Inform ., 4 , 210 – 220 .

Fei H. et al.  ( 2021 ) Enriching contextualized language model from knowledge graph for biomedical information extraction . Brief. Bioinform ., 22 , bbaa110 .

Herrero-Zazo M. et al.  ( 2013 ) The DDI corpus: an annotated corpus with pharmacological substances and drug–drug interactions . J. Biomed. Inform ., 46 , 914 – 920 .

Hong L. et al.  ( 2020 ) A novel machine learning framework for automated biomedical relation extraction from large-scale literature repositories . Nat. Mach. Intell ., 2 , 347 – 355 .

Hsieh Y.-L. et al.  ( 2017 ) Identifying protein-protein interactions in biomedical literature using recurrent neural networks with long short-term memory. In: Proceedings of the Eighth International Joint Conference on Natural Language Processing , Taipei, Taiwan. Vol. 2 , pp. 240 – 245 .

Hua L. , Quan C. ( 2016 ) A shortest dependency path based convolutional neural network for protein-protein relation extraction . Biomed. Res. Int ., 2016 , 8479587 .

Huang Z. et al.  ( 2019 ) HMDD v3.0: a database for experimentally supported human microRNA–disease associations . Nucleic Acids Res ., 47 , D1013 – D1017 .

Janssens Y. et al.  ( 2018 ) Disbiome database: linking the microbiome to disease . BMC Microbiol ., 18 , 1 – 6 .

Kazmi N. et al.  ( 2022 ) Rho GTPase gene expression and breast cancer risk: a Mendelian randomization analysis . Sci. Rep ., 12 , 1463 .

Khan F. et al.  ( 2021 ) IBDDB: a manually curated and text-mining-enhanced database of genes involved in inflammatory bowel disease . Database , 2021 , 13 .

Kropiwnicki E. et al.  ( 2022 ) DrugShot: querying biomedical search terms to retrieve prioritized lists of small molecules . BMC Bioinformatics , 23 , 1 – 16 .

Lee J. et al.  ( 2020 ) BioBERT: a pre-trained biomedical language representation model for biomedical text mining . Bioinformatics , 36 , 1234 – 1240 .

Lee S. et al.  ( 2016 ) BEST: next-generation biomedical entity search tool for knowledge discovery from biomedical literature . PLoS One , 11 , e0164680 .

Lever J. et al.  ( 2019 ) Text-mining clinically relevant cancer biomarkers for curation into the CIViC database . Genome Med ., 11 , 1 – 16 .

Lim S. et al.  ( 2018 ) Drug drug interaction extraction from the literature using a recursive neural network . PLoS One , 13 , e0190926 .

Liu S. et al.  ( 2016 ) Drug-Drug interaction extraction via convolutional neural networks . Comput. Math. Methods Med ., 2016 , 6918381 .

Liu Y. et al.  ( 2015 ) Polysearch2: a significantly improved text-mining system for discovering associations between human diseases, genes, drugs, metabolites, toxins and more . Nucleic Acids Res ., 43 , W535 – W542 .

Ma W. et al.  ( 2017 ) An analysis of human microbe-disease associations . Brief. Bioinform ., 18 , 85 – 97 .

Mintz M. et al.  ( 2009 ) Distant supervision for relation extraction without labeled data. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Singapore. pp. 1003 – 1011 .

Neumann M. et al.  ( 2019 ) ScispaCy: Fast and robust models for biomedical natural language processing. In: Proceedings of the 18th BioNLP Workshop and Shared Task, Florence, Italy. pp. 319 – 327 .

Noronha A. et al.  ( 2019 ) The virtual metabolic human database: integrating human and gut microbiome metabolism with nutrition and disease . Nucleic Acids Res ., 47 , D614 – D624 .

Pyysalo S. et al.  ( 2007 ) BioInfer: a corpus for information extraction in the biomedical domain . BMC Bioinformatics , 8 , 50 – 24 .

Quan C. et al.  ( 2016 ) Multichannel convolutional neural network for biological relation extraction . Biomed Res. Int ., 2016 , 1850404 .

Rawla P. et al.  ( 2019 ) Epidemiology of colorectal cancer: incidence, mortality, survival, and risk factors . Prz. Gastroenterol ., 14 , 89 – 103 .

Sahu S.K. , Anand A. ( 2018 ) Drug-drug interaction extraction from biomedical texts using long short-term memory network . J. Biomed. Inform ., 86 , 15 – 24 .

Sánchez-Alcoholado L. et al.  ( 2020 ) The role of the gut microbiome in colorectal cancer development and therapy response . Cancers , 12 , 1406 .

Schettini F. et al.  ( 2021 ) Clinical, pathological, and PAM50 gene expression features of HER2-low breast cancer . NPJ Breast Cancer , 7 , 1 – 13 .

Szklarczyk D. et al.  ( 2021 ) The STRING database in 2021: customizable protein–protein networks, and functional characterization of user-uploaded gene/measurement sets . Nucleic Acids Res ., 49 , D605 – D612 .

Thomas A.M. et al.  ( 2019 ) Metagenomic analysis of colorectal cancer datasets identifies cross-cohort microbial diagnostic signatures and a link with choline degradation . Nat. Med ., 25 , 667 – 678 .

Wang C. et al.  ( 2021 ) Hp-positive Chinese patients should undergo colonoscopy earlier and more frequently: the result of a cross-sectional study based on 13,037 cases of gastrointestinal endoscopy . Front. Oncol ., 11 , 698898 .

Wang J.-H. et al.  ( 2020 ) Genclip 3: mining human genes’ functions and regulatory networks from pubmed based on co-occurrences and natural language processing . Bioinformatics , 36 , 1973 – 1975 .

Wishart D.S. et al.  ( 2021 ) MarkerDB: an online database of molecular biomarkers . Nucleic Acids Res ., 49 , D1259 – D1267 .

Xu Y. et al.  ( 2015 ) Classifying relations via long short term memory networks along shortest dependency paths. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal. pp. 1785 – 1794 .

Yadav S. et al.  ( 2019 ) Feature assisted stacked attentive shortest dependency path based Bi-LSTM model for protein–protein interaction . Knowl. Based Syst ., 166 , 18 – 29 .

Zhang H. et al.  ( 2019 ) Deep residual convolutional neural network for protein-protein interaction extraction . IEEE Access , 7 , 89354 – 89365 .

Zhang W. et al.  ( 2022 ) Epigenetic study of early breast cancer (EBC) based on DNA methylation and gene integration analysis . Sci. Rep ., 12 , 1989 .

Zhang X. et al.  ( 2018 ) CBD: a biomarker database for colorectal cancer . Database , 2018 , 12 .

Zhao Z. et al.  ( 2016 ) Drug drug interaction extraction from biomedical literature using syntax convolutional neural network . Bioinformatics , 32 , 3444 – 3453 .

Zhou D. et al.  ( 2018 ) Position-aware deep multi-task learning for drug–drug interaction extraction . Artif. Intell. Med ., 87 , 1 – 8 .

Author notes

Supplementary data, email alerts, citing articles via.

  • Recommend to your Library
  • Advertising & Corporate Services
  • Journals Career Network

Affiliations

  • Online ISSN 2635-0041
  • Copyright © 2024 Oxford University Press
  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Advanced Search
  • Journal List
  • Healthc Inform Res
  • v.23(3); 2017 Jul

Logo of hir

Text Mining in Biomedical Domain with Emphasis on Document Clustering

Vinaitheerthan renganathan.

Head of Institutional Research, Skyline University College, Sharjah, UAE.

With the exponential increase in the number of articles published every year in the biomedical domain, there is a need to build automated systems to extract unknown information from the articles published. Text mining techniques enable the extraction of unknown knowledge from unstructured documents.

This paper reviews text mining processes in detail and the software tools available to carry out text mining. It also reviews the roles and applications of text mining in the biomedical domain.

Text mining processes, such as search and retrieval of documents, pre-processing of documents, natural language processing, methods for text clustering, and methods for text classification are described in detail.

Conclusions

Text mining techniques can facilitate the mining of vast amounts of knowledge on a given topic from published biomedical research articles and draw meaningful conclusions that are not possible otherwise.

I. Introduction

Text mining [ 1 , 2 ] is the process of extracting new information on a particular topic from a set of documents. Text mining is useful where the data is in the form of text (document) which is unstructured and cannot be processed using traditional methods, such as data mining methods [ 3 ]. Text mining is different from normal search queries as it is also useful in discovering unknown information from a set of documents. Text mining is based on the natural language processing technique, which helps computers to understand and process human language [ 4 ].

II. Roles of Text Mining and Its Applications in Biomedical Field

Biomedical researchers have started using text mining techniques [ 1 , 5 ] due to the vast amount of unstructured information available in the biomedical domain in the form of research articles, case reports, Electronic Health Records (EHRs), and so forth. The following section highlights the applications of text mining in the biomedical domain.

1. Extraction of Knowledge from Biomedical Literature

The number of articles and papers published in the biomedical domain is increasing at a fast rate due to the expansion of online publishing. The total number of articles indexed in MEDLINE exceeded 23 million [ 6 ], and the number of citations reached more than 806 thousand. Thus, knowledge extraction from this vast collection of articles on a particular topic could be very time-consuming. Text mining techniques can facilitate the extraction of unknown knowledge from the vast number of articles available. Some of the research carried out in this area has focused on the following applications. (1) In the field of cancer research, it offers a means to improve diagnosis, treatment, and prevention of cancer through mining of cancer literature [ 7 ]. (2) In pharmacology, it can be helpful to extract drug-drug interactions, protein interactions, and microbial interactions through mining of biomedical literature

2. Text Mining in Systematic Reviews

Systematic reviews [ 8 ] normally involve searching, screening, and synthesizing information from articles meeting the inclusion criteria and combing through the results to address the research problem. The searching, screening, and classification of articles requires enormous time and effort for the researchers involved in systematic reviews. Text mining offers tools for carrying out automatic searching, clustering, and classification of documents and information extraction during various stages of systematic reviews, such as searching, screening, and synthesis of information.

3. Text Mining in Information Extraction from Electronic Health Records

EHR systems store huge amounts of structured and unstructured information. Data mining methods are helpful in analyzing the structured part. Clinical texts [ 9 ], such as patient pathology reports, personal medical histories, and notes related to findings during examinations or procedures form the unstructured part of EHRs, and they can be analyzed using text mining techniques to explore hidden information [ 9 ]. The following are some of the applications of text mining in EHR systems. (1) EchoInfer [ 10 ], a text mining software tool, can be used to extract data pertaining to cardiovascular structures and functions from heterogeneously formatted echocardiographic data sources. (2) A text mining system using Bayesian networks can be used to mine narrative text from mammography reports to aid cancer diagnosis [ 11 ].

4. Biomarkers

Text mining techniques are useful for identifying disease-related biomarkers and associated genes from the literature [ 12 ]. Text mining is applied through the named entity recognition (NER) method, which is a technique for sub-task information extraction. It identifies entities and classifies them into various classes, such as gene, name of a person, organization, etc. The following are the some of the applications of text mining in the field of biomarkers. (1) MeinfoText [ 13 ] is a tool that provides knowledge about associations between gene methylation and cancer through the mining of large amounts of literature. (2) Whatizit [ 14 ] is a tool that identifies terms from a web source, such as PubMed abstracts, and then links them to the corresponding entries in bioinformatics databases.

5. Disease Surveillance

Web mining, a part text mining which helps us to detect disease outbreaks in disease surveillance systems. (1) Biocaster [ 15 ] is a web-based open-source and ontology-based text mining system for detecting and tracking the outbreak of diseases from web-based sources. (2) MedISys and PULS [ 16 ] are information retrieval and extraction systems used to analyse disease epidemics.

6. Other Areas

Some recent advances in biomedical text mining are in the areas of pharmacogenomics, toxicology, precision medicine, and drug repositioning [ 5 ]. Text mining can help identify named entities, such genes, proteins, drugs, and diseases and identify relations among them. Text mining can help the extraction of genotype and phenotype data for providing care in the field of precision medicine as well as the identification of relations among existing drugs and new diseases by mining the biomedical literature to reposition drugs.

III. Text Mining Software in Biomedical Domain

Currently there are several free and commercial software tools available to carry out text mining on various research databases. Table 1 lists the free and commercial software tools available for text mining in the field, and Table 2 compares the biomedical text mining software tools presented in Table 1 .

An external file that holds a picture, illustration, etc.
Object name is hir-23-141-i001.jpg

IV. Text Mining Process

The following processes are involved in text mining: search and retrieval of document [ 24 ], creation of corpus of documents, pre-processing of documents [ 25 ], preparation of document matrix, clustering of documents [ 26 ], finding associations, preparation of word cloud, and processing the language part using natural language techniques [ 4 , 10 ]. Once the process is completed, the next level classifies the documents using a naïve Bayes classifier [ 1 , 12 ] or the support vector machine [ 1 , 12 ] method, or the decision tree method. The vector space model [ 1 , 12 ] concept takes the centre stage in the text mining process, in which the documents are represented as n-dimensional vectors of terms.

1. Search and Retrieval of Documents

The first step involved in text mining is to search and retrieve documents using the information retrieval process [ 24 ], which automatically retrieves documents based on the information need of the user from a large collection of documents, which is usually web-based.

2. Pre-processing of Documents

The pre-processing of documents [ 25 ] involves the steps of stop word removal and stemming.

1) Stop word removal

Stop words, such as ‘the’ and ‘a’ are removed. There are number of methods available to remove stop words. The classic method removes pre-defined stop words, and Zip's law [ 25 ] method removes words with high Term Frequency-Inverse Document Frequency (TF-IDF) value and words appearing only once in the document.

2) Stemming

After the removal of stop words, the next step involves ‘stemming’, which helps us to use only the roots of terms. For example, the terms ‘analyze’, ‘analytical’, and ‘analyzing’ are represented by the root term ‘analysis’.

3. Term Document Matrix

Once the pre-processing has been completed, the next step is to prepare the term document matrix (TDM), in which terms are represented by rows, and documents are represented by columns. TF-IDF [ 27 ] are important measures in the text mining process.

4. Natural Language Processing

Natural language processing [ 4 , 10 ] is a tool that is used to analyze the language part of text documents through automated systems. Basically, language processing is divided into the processing of words (morphology), their different forms (lexicon), sentence structures (syntax), and sentence meanings (semantics), conference analysis, and the relationships between sentences (discourse). Natural language processing systems widely use statistical techniques to remove the ambiguity present in the processing of texts. It is used to automatically process text using a probabilistic approach and to carry out tasks such as segmentation of sentences into words, named entity recognition, parts of speech tagging, conference resolution [ 4 , 28 ], etc.

5. Methods for Text Clustering

Documents are grouped according to their document vector, and each cluster is denoted by the document vector name. Clustering [ 26 ] of documents can be carried out using techniques such as hierarchical clustering and portioning clustering (K-means clustering).

1) Hierarchical clustering technique

Hierarchical clustering [ 29 ] creates a hierarchy of clusters of documents using a top-down (divisive) or bottom-up approach (agglomerative). In the agglomerative method, clustering starts with each document as a single cluster, and in the next step, each cluster is combined with another cluster to form a new cluster based on the closest distance or similarity between the two clusters. This process is repeated until a single cluster is formed. In the divisive method, initially all the documents are combined to form a single cluster, and the cluster is divided into two sub-clusters which have maximum distance or dissimilarity between them. This process will continue until each document forms its own cluster. In hierarchical clustering, previous knowledge about the number of clusters is not required. The outcome of hierarchical clustering is a graphical representation called a dendrogram, in which the documents are represented in a hierarchical tree structure representing the documents as its branches.

2) Partitioning (K-means) clustering

K-means clustering [ 30 ] starts with a predefined number of clusters of documents, for instance, k clusters. Documents will be relocated to different clusters based on the nearness to the cluster centroid (mean). At each partition, the cluster centroid is recalculated recursively after the relocation of documents based on nearness to the cluster centroid. This process is repeated until there is no change in the cluster means or centroid due to the relocation of documents. Generally the K-means clustering algorithm is faster than the hierarchical clustering algorithm.

3) Similarity measures

Clustering process efficiency depends on the choice of similarity measures, such as cosine, Euclidean, Manhattan, and Mahalanobis [ 28 ]. Cosine measure is the simplest and easiest method for clustering documents. Cosine measure calculates the normalized dot products of two document vectors. The cosine values range from 0 to 1, and when the two documents do not share any words, the cosine value is 0.

6. Methods for Text Classification

Documents can be automatically classified into specific categories using classifier algorithm such as naïve Bayes, support vector machine, and decision tree.

1) Naïve Bayes

The naïve Bayes classifier can be used to classify documents based on a probabilistic concept by which terms in each document are assigned specific probabilities based on their frequency in the document corpus. During a supervised training process, the naïve Bayes classifier assigns documents in the training document set to predefined categories based on set of terms in the document whose probability of occurrence is maximum in the predefined category in relation to other category. This training document set can be used to classify a new set of documents based on the posterior probabilities that the set of documents will have terms with similar probabilities and can be classified to a predefined category.

2) Support vector machine

Support vector machine is used to find separators to separate two document categories. Documents are assumed to take the form of linear space, and the separator will be a hyper plane (a subspace in a two-dimensional space) that separates the two categories of documents.

3) Decision tree

The decision tree method represents category conditions as nodes and documents categories as leaves. The decision tree method works recursively and classifies documents into categories based on conditions.

4) Evaluation of classification

The evaluation of the classification process is carried out using recall, precision, and F measures:

TP (true positive): number of correctly classified instances to a class;

FP (false positive): number of falsely classified instances, as belonging to a class;

FN (false negative): number of instances belonging to a class, not correctly classified.

The F-measure is the harmonic mean of precision and recall. It is calculated by

V. Conclusion

This paper provided a comprehensive overview of text mining methods. The paper discussed the roles of text mining in biomedical applications and presented software available to carry out text mining. The paper also presented an overview of techniques to find similarities between studies on a given topic from available research articles.

Conflict of Interest: No potential conflict of interest relevant to this article was reported.

Loading metrics

Open Access

Biomedical Text Mining and Its Applications

* E-mail: [email protected]

Affiliation Pfizer Research Technology Center, Cambridge, Massachusetts, United States of America

  • Raul Rodriguez-Esteban

PLOS

Published: December 24, 2009

  • https://doi.org/10.1371/journal.pcbi.1000597
  • Reader Comments

Figure 1

Citation: Rodriguez-Esteban R (2009) Biomedical Text Mining and Its Applications. PLoS Comput Biol 5(12): e1000597. https://doi.org/10.1371/journal.pcbi.1000597

Editor: Fran Lewitter, Whitehead Institute, United States of America

Copyright: © 2009 Raul Rodriguez-Esteban. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: The author received no specific funding for this work.

Competing interests: The author has declared that no competing interests exist.

biomedical text mining research papers

Introduction

This tutorial is intended for biologists and computational biologists interested in adding text mining tools to their bioinformatics toolbox. As an illustrative example, the tutorial examines the relationship between progressive multifocal leukoencephalopathy (PML) and antibodies. Recent cases of PML have been associated to the administration of some monoclonal antibodies such as efalizumab [1] . Those interested in a further introduction to text mining may also want to read other reviews [2] – [4] .

Understanding large amounts of text with the aid of a computer is harder than simply equipping a computer with a grammar and a dictionary. A computer, like a human, needs certain specialized knowledge in order to understand text. The scientific field that is dedicated to train computers with the right knowledge for this task (among other tasks) is called natural language processing (NLP). Biomedical text mining (henceforth, text mining) is the subfield that deals with text that comes from biology, medicine, and chemistry (henceforth, biomedical text). Another popular name is BioNLP, which some practitioners use as synonymous with text mining.

Biomedical text is not a homogeneous realm [5] . Medical records are written differently from scientific articles, sequence annotations, or public health guidelines. Moreover, local dialects are not uncommon [6] . For example, medical centers develop their own jargons and laboratories create their idiosyncratic protein nomenclatures. This variability means, in practice, that text mining applications are tailored to specific types of text. In particular, for reasons of availability and cost, many are designed for scientific abstracts in English from Medline.

Main Concepts

A term is a name used in a specific domain, and a terminology is a collection of terms. Terms abound in biomedical text, where they constitute important building blocks. Some examples of terms are the names of cell types, proteins, medical devices, diseases, gene mutations, chemical names, and protein domains [7] . Due to their importance, text miners have worked to design algorithms that recognize terms (see examples in Figure 1 ). The task of recognizing terms is also called named entity recognition in the text mining literature, although this NLP task is broader and goes beyond recognition of terms. Although the concept of term is intuitive (or, perhaps, because it is intuitive), terms are hard to define precisely [8] . For example, the text “early progressive multifocal leukoencephalopathy” could possibly refer to any, or all, of these disease terms: “early progressive multifocal leukoencephalopathy,” “progressive multifocal leukoencephalopathy,” “multifocal leukoencephalopathy,” and “leukoencephalopathy.” To overcome such dilemmas, text miners ask experts to identify terms within collections of text such as sets of selected Medline abstracts. These annotations are then used to train a computer by example, so that the computer can emulate the knowledge experts deploy when they read biomedical text. This pedagogical method, “teaching by example,” is a common approach used in many text mining tasks and it is more generally called supervised training. (Alternatively, text miners create rules using expert knowledge.) Thus, text miners rely heavily on collections of text (corpora) that have been annotated by experts (see compilations of corpora: http://www2.informatik.hu-berlin.de/~ hakenber/links/benchmarks.html ; http://compbio.uchsc.edu/ccp/corpora/obtaining.shtml ). Before beginning a text mining task, it is advisable to limit the scope of the task to a corpus made of a set of documents around the topic of interest. In our case, a PML corpus could comprise all the Medline abstracts that mention the term “progressive multifocal leukoencephalopathy,” because this is an unambiguous term. Another relevant corpus to consider could be the ImmunoTome [9] , which is focused on immunology.

thumbnail

  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

(A) Text marked with protein (blue), disease (crimson), Gene Ontology (bright red), chemical (dark red), and species (red) terms by Whatizit [15] with the whatizitEBIMedDiseaseChemicals pipeline. (B) Text marked with protein and cell line terms by ABNER [16] . (C) Protein terms identified by the prototype BIOCreAtIvE metaserver [68] . In the example shown, the metaserver combines the output of systems hosted in three servers.

https://doi.org/10.1371/journal.pcbi.1000597.g001

Text miners are interested in terminologies that have been built manually. These controlled terminologies have notable roles in biomedicine, for example, the HUGO gene nomenclature, the ICD disease classification, or the Gene Ontology. Many of these terminologies are more than just a flat list of terms. Some include term synonyms (thesauri) or relations between terms (taxonomies, ontologies). For text miners, their usefulness comes from their ability to link to information. Once a text is mapped to one of these terminologies, a bridge is opened between the text and other resources. This usefulness justifies efforts such as the National Library of Medicine's manual mapping of Medline abstracts to the Medical Subject Headings (MeSH) terminology. In our example, MeSH can be used to make the PML corpus more focused by restricting it only to abstracts with the MeSH term “leukoencephalopathy, progressive multifocal.” Controlled terminologies can be used to annotate results from experiments and databases [10] . Text miners attempt to make such mappings automatically. For example, a task called gene normalization consists in recognizing names of genes in text and mapping them to their corresponding gene identifiers (e.g., Entrez Gene ID). Thus, using gene normalization it is possible to identify all the abstracts in Medline that mention a given gene from Entrez Gene [11] .

Because there are many controlled terminologies, some terminologies have been created to map between them. For example, the BioThesaurus [12] is a compilation of protein synonyms from several terminologies. The Unified Medical Language System (UMLS) [13] , [14] is a grand compilation of more than 120 terminologies and close to 4 million terms. Despite UMLS's size, all controlled terminologies are incomplete, because new terms are created too quickly to keep them up to date. Furthermore, all have gaps and areas of emphasis that conflict with the needs of users.

Tools for Terms

Whatizit [15] is a tool that recognizes several types of terms. It can be accessed through a Web interface, Web services, or a streamed servlet. Abner [16] is a standalone application that recognizes five types of terms: protein, DNA, RNA, cell line, and cell type. More specialized term recognition has been used, for example, for databases such as LSAT [17] for alternative transcripts and PepBank [18] for peptides. Text miners have also used terminologies to enrich PubMed's search capabilities. Some recent search engines are semedico [19] , novo|seek [20] , and GoPubMed/GoGene [21] , [22] .

Relationships

After recognizing terms, the natural next step is to look for relationships between terms. The simplest method to identify relationships is using the co-occurrence assumption: terms that appear in the same texts tend to be related. For example, if a protein is mentioned often in the same abstracts as a disease, it is reasonable to hypothesize that the protein is involved in some aspect of the disease. The degree of co-occurrence can be quantified statistically to rank and eliminate statistically weak co-occurrences (see Box 1 ). An example using GoGene [22] can illustrate the use of simple co-occurrence, MeSH terms, and gene normalization. The query “leukoencephalopathy, progressive multifocal”[mh] in GoGene returns all the genes mentioned in Medline abstracts annotated with the MeSH term for PML. The genes that appear most often are likely to be related to PML. Those that appear disproportionately more often for PML than for other diseases are likely to be more specific to PML.

biomedical text mining research papers

Better evidence than co-occurrence comes from relationships that are described explicitly [23] . For example, the sentence “We describe a PML in a 67-year-old woman with a destructive polyarthritis associated with anti-JO1 antibodies treated with corticosteroids” [24] describes an explicit link between PML and anti-JO1 antibodies. We can simplify this relationship into a triplet of two terms and a verb: PML is associated with anti-JO1 antibodies. To create the triplet, the verb can be identified with the aid of a part-of-speech (POS) tagger. An example of a POS tagger for biomedical text is MedPost [25] . This triplet representation is powerful due to its simplicity, but it omits crucial details from the original article, such as the fact that the evidence comes from a clinical case study.

A heavily studied area in text mining concerns the relationships known as protein-protein interactions (PPI). Using the triplet representation, PPI can be depicted as network graphs with the proteins as nodes and the verbs as edges (see Figure 2 ). When analyzing text-mined interaction networks, it is important to understand the information that underpins them. For example, interactions can be direct (physical) or indirect, depending on the verb (examples of direct verbs are to bind , to stabilize , to phosphorylate ; examples of indirect verbs are to induce , to trigger , to block ) [26] . The different nature of the protein interactions described in the literature reflects in part the experimental methodology employed and the nature of the interaction itself. A common way to capture the textual variations is by exhaustively identifying all the patterns that appear and writing a set of rules that capture them [27] , [28] . For example, a simple pattern to capture phosphorylations might involve, sequentially, a kinase name, a form of the verb to phosphorylate , and a substrate name [29] , [30] .

thumbnail

The nodes are proteins identified using the query: “leukoencephalopathy, progressive multifocal”[mh] antibody[pubmed] in GoGene [22] . The query retrieves gene symbols mapped to PubMed abstracts that include the keyword antibody and the MeSH term leukoencephalopathy, progressive multifocal (PML). The gene list was exported to SIF format and the gene symbols extracted and used to query PPI using iHOP Web services [69] . Only those iHOP interactions with at least two co-occurrences and confidence above zero were considered. The network was plotted using Cytoscape [70] . The node color is based on the number of interactions (node degree).

https://doi.org/10.1371/journal.pcbi.1000597.g002

Tools for Relationships

To see co-occurrence in action, try FACTA [31] . MedGene and BioGene [32] , [33] use co-occurrence for gene prioritization. Gene prioritization tools such as Endeavour [34] and G2D [35] use text as well as other data sources. PolySearch [36] uses heuristic weighting of different co-occurrence measures and includes a detailed guide to implementation and vocabularies. Anni [37] uses textual profiles instead of co-occurrence to measure relationship between terms. For PPI, iHOP [38] is the most popular tool. RLIMS-P [30] uses linguistic patterns to detect the kinase, substrate, and phosphosite in a phosphorylation. E3Miner [39] detects ubiquitinations, including contextual information.

Besides finding relationships, text miners are also interested in discovering relationships. Due to the size of the literature, scientists miss links between their work and other, related work. Swanson called these links “undiscovered public knowledge.” In a classic example he found by careful reading 11 links between magnesium and migraine that had been neglected [40] . One method to discover relationships is based on transitive inference [41] . Simply stated, if A is linked to B, and B is linked to C, then there is a chance that A is linked to C. PPI networks are, at the core, an example of transitive inference. Arrowsmith [42] is a basic discovery tool that compares two literature sets to find links between them. Applying Arrowsmith to the literature for PML and antibodies yields the immunomodulator tacrolimus, a calcineurin inhibitor, among the top hits. Tacrolimus affects the production of several proteins depicted in Figure 2 , such as IL-2.

The most common measure of output quality in text mining is the F-measure, which is the harmonic mean of two other measures, precision and recall. These three measures can be described with the analogy of searching for needles in a haystack. After a manual search of a haystack, our hands end up full with valuable needles but also with some useless straws. Recall is based on the number of needles found. High recall means that we have found most of the needles for which we were looking. Precision, however, is based on the number of both needles and straws. High precision means that we have retrieved far more needles than straws. Both high precision and high recall are desirable, and a high F-measure reflects both because it is the harmonic mean. Optimizing the F-measure of a text mining application is often different from optimizing the accuracy, because there are usually few needles and large amounts of hay in the haystack. An application that identifies the whole haystack as being only hay is quite accurate but misses all the needles.

It is important to ponder over the way an application has been evaluated before assessing its F-measure [43] , and especially to consider how realistic the evaluation was. The F-measure is not an absolute value. The larger a haystack is, the more difficult it is to find needles. In other words, a low F-measure might reflect a harder task, not a worse application. Moreover, text mined applications may perform differently in different types of text and this may be reflected in lower F-measures than advertised. When the F-measure attainable is not high enough, one solution is to use text mining as a filter. A filter needs high recall, but only moderate precision, to reduce the amount of hay without affecting the needles. Filtering with text mining is used as a preliminary step in databases such as MINT [44] , DIP [45] , and BIND [46] . Filtering is followed by human curation , which involves the review and assessment of results to reduce hay and, hopefully, provide feedback to improve the filtering. The feedback loop between text mining and curation can have an incremental positive impact in output results [47] .

Comprehensiveness

Doing comprehensive text mining means considering all sources of information—Medline and beyond. The abstract conveys an article's main findings, but many other pieces of information are elsewhere in the full text, figures, tables, supplementary information, references, databases, Web sites, and multimedia files. In particular, the full text is critical for information that rarely appears in abstracts, such as experimental measurements. A more comprehensive PML corpus would include full text articles, however despite the surge in open access articles (see the Directory of Open Access Journals, www.doaj.org ; [48] ), the majority of published articles have access and processing restrictions. PubMed Central [49] is the main source of open access articles, and the specialized search engines BioText [50] , Yale Image Finder [51] , and Figurome [52] search PubMed Central figures and tables. A search for “progressive multifocal leukoencephalopathy” in the Yale Image Finder yields only one figure, while a search for “PML” yields a large number of hits, most of them not relevant because PML is an ambiguous acronym.

Text and DNA

Considering text as a sequence of symbols as informative as a protein's DNA sequence is the underlying premise of many text mining tools for bioinformatics. For example, the linguistic similarity between protein corpora (sets of texts built around proteins) correlates with the BLAST score between those same proteins [53] . Text that is used in articles or database annotations to describe a protein can be used for protein clustering and to predict structure [54] , subcellular localization, and function [55] . For example, a protein corpus of a protein located in the nucleus uses a vocabulary that is somewhat different from a corpus built around a secreted protein. These vocabulary differences can be used to predict the subcellular localization of a protein of unknown location. One way to measure vocabulary differences is to represent the texts as vectors of word counts. The word counts can be normalized by the size of the text they come from and the vectors compared using, for example, Euclidean distance (for more, see [56] ). To reduce vector dimensionality, some words can be grouped using a method called stemming. A simple example of stemming is converting plural nouns into singular form and verbs into infinitive form (a widely used stemming algorithm is the Porter stemmer [57] ). Additional simplification can be achieved via tokenization, because some words can be separated into constitutive elements called tokens. In English, however, most words are a single token. An example of a word of two tokens is don't .

Text mining applications for bioinformatics [58] include subcellular localization prediction such as Sherloc and Epiloc [59] , [60] and protein clustering such as TXTGate [61] . Thus, text mining tools can be used for annotating biological databases in the same fashion other bioinformatics tools are used.

An extensive list of text mining applications is maintained in http://zope.bioinfo.cnio.es/bionlp_tools/ [62] . A growing number of tools are being developed under a standard framework called UIMA, which comprises NLP as well as BioNLP tools [63] .

Text mining tools are increasingly more accessible to biologists and computational biologists and these can often be applied to answer scientific questions in combination with other bioinformatics tools. Getting acquainted with them is a first step towards grasping the possibilities of text mining and towards venturing into the algorithms described in the literature. One way to get started on this path is by looking at examples such as [64] – [67] .

Acknowledgments

I would like to thank Rohitha P. SriRamaratnam for comments on the manuscript.

  • View Article
  • Google Scholar
  • 20. Alonso-Allende R (2009) Accelerating searches of research grants and scientific literature with novo|seek. Nat Methods 6. Advertising feature. Available: http://www.novoseek.com/ .
  • 55. Pandev G, Kumar V, Steinbach M (2006) Computational approaches for protein function prediction: a survey. Technical Report 06-028, Department of Computer Science and Engineering, University of Minnesota, Twin Cities.
  • 56. Manning CD, Schutze H (1999) Foundations of Statistical Natural Language Processing. MIT Press.
  • 57. Van Rijsbergen CJ, Robertson SE, Porter MF (1980) New models in probabilistic information retrieval. Tech. Rep. 5587. British Library. Available: http://tartarus.org/~ martin/PorterStemmer/ .

Help | Advanced Search

Computer Science > Computation and Language

Title: biogpt: generative pre-trained transformer for biomedical text generation and mining.

Abstract: Pre-trained language models have attracted increasing attention in the biomedical domain, inspired by their great success in the general natural language domain. Among the two main branches of pre-trained language models in the general language domain, i.e., BERT (and its variants) and GPT (and its variants), the first one has been extensively studied in the biomedical domain, such as BioBERT and PubMedBERT. While they have achieved great success on a variety of discriminative downstream biomedical tasks, the lack of generation ability constrains their application scope. In this paper, we propose BioGPT, a domain-specific generative Transformer language model pre-trained on large scale biomedical literature. We evaluate BioGPT on six biomedical NLP tasks and demonstrate that our model outperforms previous models on most tasks. Especially, we get 44.98%, 38.42% and 40.76% F1 score on BC5CDR, KD-DTI and DDI end-to-end relation extraction tasks respectively, and 78.2% accuracy on PubMedQA, creating a new record. Our case study on text generation further demonstrates the advantage of BioGPT on biomedical literature to generate fluent descriptions for biomedical terms. Code is available at this https URL .

Submission history

Access paper:.

  • Download PDF
  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Biomedical Text Mining: Experience and Practical Approach

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Research status and trend analysis of global biomedical text mining studies in recent 10 years

  • Published: 28 August 2015
  • Volume 105 , pages 509–523, ( 2015 )

Cite this article

  • Xing Zhai 1 ,
  • Zhihong Li 2 ,
  • Kuo Gao 1 ,
  • Youliang Huang 1 ,
  • Lin Lin 3 &
  • Le Wang 4  

929 Accesses

10 Citations

2 Altmetric

Explore all metrics

In recent years, with the abrupt growth of the amount of biomedical literature, a lot of implicit laws and new knowledge were buried in the vast literature, while the text mining technology, if applied in the biomedical field, can integrate and analyze massive biomedical literature data, obtaining valuable information to improve people’s understanding of biomedical phenomena. This paper mainly discussed the research status of text mining technology applied in the biomedical field in recent 10 years in order to provide a reference for further studies of other researchers.

Biomedical text mining literature included in SCI from 2004 to 2013 were retrieved and filtered and then were analyzed from the perspectives of annual changes, regional distribution, research institutions, journals sources, research fields, keywords and so on.

The total amount of global biomedical text mining literature is on the rise, among which literature relevant to named entity recognition, entity relation extraction, text categorization, text clustering, abbreviations extraction and co-occurrence analysis take up a large percentage; studies in USA and the UK are in the leading position.

Compared with other much more mature research topics, the application of text mining technology in biomedicine is still a relatively new research field worldwide, while with the constantly improving awareness of this field and deepening researches in this area, a number of core research areas, core research institutes and core research fields have been formed in this field. Therefore, further researches of this field will inject new vitality in the development of biomedicine.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

Bayer, A. E., & Folger, J. (1966). Some correlates of a citation measure of productivity in science. Sociology of education, 39 , 381–390.

Article   Google Scholar  

Braun, T., Schubert, A. P., & Kostoff, R. N. (2000). Growth and trends of fullerene research as reflected in its journal literature. Chemical Reviews, 100 (1), 23–38.

de Solla Price, D. J., & Beaver, D. (1966). Collaboration in an invisible college. American Psychologist, 21 (11), 1011.

Donaldson, I., Martin, J., De Bruijn, B., Wolting, C., Lay, V., Tuekam, B., & Hogue, C. W. (2003). PreBIND and Textomy–mining the biomedical literature for protein–protein interactions using a support vector machine. BMC bioinformatics, 4 (1), 11.

Fleuren, W. W., Verhoeven, S., Frijters, R., Heupers, B., Polman, J., van Schaik, R., & Alkema, W. (2011). CoPub update: CoPub 5.0 a text mining system to answer biological questions. Nucleic Acids Research, 39 , 450–454.

Frijters, R., Heupers, B., van Beek, P., Bouwhuis, M., van Schaik, R., de Vlieg, J., & Alkema, W. (2008). CoPub: A literature-based keyword enrichment tool for microarray data analysis. Nucleic Acids Research, 36 , 406–410.

Han, J. S., & Ho, Y. S. (2011). Global trends and performances of acupuncture research. Neuroscience and Biobehavioral Reviews, 35 (3), 680–687.

He, M., Wang, Y., & Li, W. (2009). PPI finder: A mining tool for human protein–protein interactions. PLoS One, 4 (2), e4554.

Hirsch, J. E. (2005). An index to quantify an individual’s scientific research output. Proceedings of the National Academy of Sciences of the United States of America, 102 (46), 16569–16572.

Hirsch, J. E. (2007). Does the h index have predictive power? Proceedings of the National Academy of Sciences, 104 (49), 19193–19198.

Hu, X. (2004). Integration of cluster ensemble and text summarization for gene expression analysis. In Proceedings of fourth IEEE symposium on bioinformatics and bioengineering, 2004. BIBE 2004 (pp 251–258). IEEE.

Hur, J., Schuyler, A. D., & Feldman, E. L. (2009). SciMiner: Web-based literature mining tool for target identification and functional enrichment analysis. Bioinformatics, 25 (6), 838–840.

Kinney, A. L. (2007). National scientific facilities and their science impact on nonbiomedical research. Proceedings of the National Academy of Sciences, 104 (46), 17943–17947.

Krallinger, M., Leitner, F., Rodriguez-Penagos, C., & Valencia, A. (2008). Overview of the protein–protein interaction annotation extraction task of BioCreative II. Genome Biology, 9 (Suppl 2), S4.

Leung, S., Chan, K., & Song, L. (2006). Publishing trends in Chinese medicine and related subjects documented in WorldCat. Health Information and Libraries Journal, 23 (1), 13–22.

Li, L. L., Ding, G., Feng, N., Wang, M. H., & Ho, Y. S. (2009). Global stem cell research trend: Bibliometric analysis as a tool for mapping of trends from 1991 to 2006. Scientometrics, 80 (1), 39–58.

Li, T., Ho, Y. S., & Li, C. Y. (2008). Bibliometric analysis on global Parkinson’s disease research trends during 1991–2006. Neuroscience Letters, 441 (3), 248–252.

Li, C., Zhang, Y., & Gao, Z. (1999). A new clustering algorithm. Journal of Pattern Recognition and Artificial Intelligence, 12 (2), 205–209.

Google Scholar  

Liu, H., Hu, Z. Z., Torii, M., Wu, C., & Friedman, C. (2006). Quantitative assessment of dictionary-based protein named entity tagging. Journal of the American Medical Informatics Association, 13 (5), 497–507.

Article   MATH   Google Scholar  

Liu, X., & Wang, Z. (2010). Statistics and analysis of the high-cited papers of information science research from 2004 to 2008. Journal of Intelligence, 29 (1), 64–67.

Lv, T., & Jiang, Y. (2010). Application of text mining in biomedical field. The Chinese Medicine Books Intelligence Magazine, 19 (4), 56–64.

Macias-Chapula, C. A. (2000). AIDS in Haiti: A bibliometric analysis. Bulletin of the Medical Library Association, 88 (1), 56.

Miwa, M., Sætre, R., Miyao, Y., & Tsujii, J. I. (2009). Protein–protein interaction extraction by leveraging multiple kernels and parsers. International Journal of Medical Informatics, 78 (12), e39–e46.

Muller, H., & Mancuso, F. (2008). Identification and analysis of co-occurrence networks with NetCutter. PLoS One, 3 (9), e3178.

Perez-Iratxeta, C., Bork, P., & Andrade, M. A. (2002). Association of genes to genetically inherited diseases using data mining. Nature Genetics, 31 (3), 316–319.

Ramos, J. M., Padilla, S., Masia, M., & Gutierrez, F. (2008). A bibliometric analysis of tuberculosis research indexed in PubMed, 1997–2006. The International Journal of Tuberculosis and Lung Disease, 12 (12), 1461–1468.

Rodriguez-Esteban, R. (2009). Biomedical text mining and its applications. PLoS Computational Biology, 5 (12), e1000597.

Saha, S. K., Sarkar, S., & Mitra, P. (2009). Feature selection techniques for maximum entropy based biomedical named entity recognition. Journal of Biomedical Informatics, 42 (5), 905–911.

Schwartz, A. S., & Hearst, M. A. (2003). A simple algorithm for identifying abbreviation definitions in biomedical text. In Pacific Symposium on Biocomputing (Vol. 8, pp. 451–462).

Si, L., & Kanungo, T. (2005). Thresholding strategies for text classifiers: TREC 2005 Biomedical Triage Task Experiments. In TREC .

Smalheiser, N. R., & Swanson, D. R. (1994). Assessing a gap in the biomedical literature-magnesium-deficiency and neurologic disease. Neuroscience Research Communications, 15 (1), 1–9.

Smith, L., Rindflesch, T., & Wilbur, W. J. (2004). MedPost: A part-of-speech tagger for bioMedical text. Bioinformatics, 20 (14), 2320–2321.

Sorensen, A. A. (2009). Alzheimer’s disease research: Scientific productivity and impact of the top 100 investigators in the field. Journal of Alzheimer’s Disease, 16 (3), 451.

Tari, L., Anwar, S., Liang, S., Cai, J., & Baral, C. (2010). Discovering drug–drug interactions: A text-mining and reasoning approach based on properties of drug metabolism. Bioinformatics, 26 (18), 1547–1553.

Theodosiou, T., Darzentas, N., Angelis, L., & Ouzounis, C. A. (2008). PuReD-MCL: A graph-based PubMed document clustering methodology. Bioinformatics, 24 (17), 1935–1941.

Tsuruoka, Y., Miwa, M., Hamamoto, K., Tsujii, J. I., & Ananiadou, S. (2011). Discovering and visualizing indirect associations between biomedical concepts. Bioinformatics, 27 (13), i111–i119.

Tsuruoka, Y., Tateishi, Y., Kim, J. D., Ohta, T., McNaught, J., Ananiadou, S., & Tsujii, J. I. (2005). Developing a robust part-of-speech tagger for biomedical text. Advances in Informatics, 3746 , 382–392.

Tsuruoka, Y., Tsujii, J. I., & Ananiadou, S. (2008). FACTA: A text search engine for finding associated biomedical concepts. Bioinformatics, 24 (21), 2559–2560.

Tulipano, P. K., Tao, Y., Millar, W. S., Zanzonico, P., Kolbert, K., Xu, H., & Friedman, C. (2007). Natural language processing and visualization in the molecular imaging domain. Journal of Biomedical Informatics, 40 (3), 270–281.

Ugolini, D., Puntoni, R., Perera, F. P., Schulte, P. A., & Bonassi, S. (2007). A bibliometric analysis of scientific production in cancer molecular epidemiology. Carcinogenesis, 28 (8), 1774–1779.

Wang, H., & Zhao, T. (2008). Research and development of biomedical text mining. Journal of Chinese Information Processing, 22 (3), 89–98.

MATH   Google Scholar  

Xie, S., Zhang, J., & Ho, Y. S. (2008). Assessment of world aerosol research trends by bibliometric analysis. Scientometrics, 77 (1), 113–130.

Zhang, H. Q., He, D. G., He, L., & Li, J. (1997). The literature of Qigong: Publication patterns and subject headings. International Forum on Information and Documentation, 22 (3), 38–44.

Download references

Acknowledgments

This research is supported by Young Talent Project of Beijing (No. YETP0821) and Research Project for Practice Development of National TCM Clinical Research Bases.

Author information

Authors and affiliations.

Beijing University of Chinese Medicine, Beijing, 100029, China

Xing Zhai, Kuo Gao & Youliang Huang

Dongzhimen Hospital, Beijing University of Chinese Medicine, Beijing, 100700, China

Knowledge and Action College, HuBei University, Wuhan, 430011, China

Dongfang Hospital, Beijing University of Chinese Medicine, Beijing, 100078, China

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Le Wang .

Additional information

Xing Zhai, Zhihong Li and Kuo Gao have contributed equally to this work.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (DOCX 27 kb)

Rights and permissions.

Reprints and permissions

About this article

Zhai, X., Li, Z., Gao, K. et al. Research status and trend analysis of global biomedical text mining studies in recent 10 years. Scientometrics 105 , 509–523 (2015). https://doi.org/10.1007/s11192-015-1700-9

Download citation

Received : 21 July 2015

Published : 28 August 2015

Issue Date : October 2015

DOI : https://doi.org/10.1007/s11192-015-1700-9

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Text mining
  • Development trends
  • Bibliometrics
  • Find a journal
  • Publish with us
  • Track your research

Frontiers of biomedical text mining: current progress

Affiliation.

  • 1 LIMSI-CNRS, BP 133, 91403 Orsay Cedex, France. [email protected]
  • PMID: 17977867
  • PMCID: PMC2516302
  • DOI: 10.1093/bib/bbm045

It is now almost 15 years since the publication of the first paper on text mining in the genomics domain, and decades since the first paper on text mining in the medical domain. Enormous progress has been made in the areas of information retrieval, evaluation methodologies and resource construction. Some problems, such as abbreviation-handling, can essentially be considered solved problems, and others, such as identification of gene mentions in text, seem likely to be solved soon. However, a number of problems at the frontiers of biomedical text mining continue to present interesting challenges and opportunities for great improvements and interesting research. In this article we review the current state of the art in biomedical text mining or 'BioNLP' in general, focusing primarily on papers published within the past year.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, N.I.H., Intramural
  • Research Support, Non-U.S. Gov't
  • Abstracting and Indexing / trends*
  • Artificial Intelligence*
  • Biology / trends*
  • Databases, Bibliographic / trends*
  • Forecasting
  • Natural Language Processing*
  • Periodicals as Topic*
  • Vocabulary, Controlled

Grants and funding

  • R01-LM009836-01A1/LM/NLM NIH HHS/United States
  • 1G08LM009639-01/LM/NLM NIH HHS/United States
  • R01 LM008111-03/LM/NLM NIH HHS/United States
  • R01 LM009836/LM/NLM NIH HHS/United States
  • 5R01 LM008111-03/LM/NLM NIH HHS/United States
  • G08 LM009639/LM/NLM NIH HHS/United States
  • R01 LM008111/LM/NLM NIH HHS/United States
  • ImNIH/Intramural NIH HHS/United States

IMAGES

  1. (PDF) Biomedical Text Mining for Research Rigor and Integrity: Tasks

    biomedical text mining research papers

  2. TT04-I

    biomedical text mining research papers

  3. BioGPT: Generative Pre-trained Transformer for Biomedical Text

    biomedical text mining research papers

  4. (PDF) Biomedical text mining for research rigor and integrity: tasks

    biomedical text mining research papers

  5. (PDF) Design of an Interactive Biomedical Text Mining Framework to

    biomedical text mining research papers

  6. Download Biomedical Text Mining

    biomedical text mining research papers

VIDEO

  1. IBME 3rd Lecture Researches in Biomedical Engineering (Urdu-Hindi)

  2. Biomarker Research using Targeted Proteomics

  3. Large-scale Text Mining for Biological Data

  4. EasyNER: Using Artificial Intelligence to “Read”...

  5. demo video on topic biotechnology

  6. Biological databases

COMMENTS

  1. Biomedical text mining and its applications in cancer research

    1. Introduction The vast numbers of biomedical text provide a rich source of knowledge for biomedical research. Text mining can help us to mine information and knowledge from a mountain of text and it is now widely applied in biomedical research.

  2. Frontiers of biomedical text mining: current progress

    In this article we review the current state of the art in biomedical text mining or 'BioNLP' in general, focusing primarily on papers published within the past year. Keywords: text mining, natural language processing, information extraction, text summarization, image mining, question answering, literature-based discovery, evaluation, user ...

  3. Text-mining solutions for biomedical research: enabling ...

    Text mining is a means to process the scientific literature at a large scale. It is the means to make documents and their content more accessible. Literature repositories, such as PubMed...

  4. [1901.08746] BioBERT: a pre-trained biomedical language representation

    While BERT obtains performance comparable to that of previous state-of-the-art models, BioBERT significantly outperforms them on the following three representative biomedical text mining tasks: biomedical named entity recognition (0.62% F1 score improvement), biomedical relation extraction (2.80% F1 score improvement) and biomedical question ans...

  5. Biomedical text mining for research rigor and integrity: tasks

    With the exponential increase in biomedical research output and the ability of text mining approaches to perform automatic tasks at large scale, we propose that such approaches can support tools that promote responsible research practices, providing significant benefits for the biomedical research enterprise.

  6. Trends and Techniques of Biomedical Text Mining: A Review

    This paper presents a review on the challenges and contributions of research works on biomedical text mining held from 2003 to 2020. Furthermore, we discussed their methodology, the datasets they utilized to evaluate work, and also their findings.

  7. Application of text mining in the biomedical domain

    In this paper we introduce the most important techniques that are used for a text mining and give an overview of the text mining tools that are currently being used and the type of problems they are typically applied for. Next Keywords The scientific literature provides a wealth of information to researchers.

  8. MarkerGenie: an NLP-enabled text-mining system for biomedical entity

    In biomedical marker discovery studies, tools that rely on NLP models to automatically and accurately extract relations of biomedical entities are valuable as they can provide a more thorough survey of all available literature, hence providing a less biased result compared to manual curation.

  9. Text Mining in Biomedical Domain with Emphasis on Document Clustering

    This paper provided a comprehensive overview of text mining methods. The paper discussed the roles of text mining in biomedical applications and presented software available to carry out text mining. The paper also presented an overview of techniques to find similarities between studies on a given topic from available research articles.

  10. Biomedical Text Mining and Its Applications

    Biomedical text mining (henceforth, text mining) is the subfield that deals with text that comes from biology, medicine, and chemistry (henceforth, biomedical text). Another popular name is BioNLP, which some practitioners use as synonymous with text mining.

  11. BioGPT: Generative Pre-trained Transformer for Biomedical Text

    In this paper, we propose BioGPT, a domain-specific generative Transformer language model pre-trained on large scale biomedical literature. We evaluate BioGPT on six biomedical NLP tasks and demonstrate that our model outperforms previous models on most tasks. Especially, we get 44.98%, 38.42% and 40.76% F1 score on BC5CDR, KD-DTI and DDI end ...

  12. Biomedical Text Mining: Experience and Practical Approach

    Biomedical Text Mining: Experience and Practical Approach Abstract: The fields of biomedical researches included biology and medicine has resulted in a sheer amount of published reports, and papers. Above all biomedical text mining has emerged as a vital research domain that has an impact in the project development of these research areas.

  13. Application of text mining in the biomedical domain

    As a consequence, text mining tools have evolved considerably in number and quality and nowadays can be used to address a variety of research questions ranging from de novo drug target discovery to enhanced biological interpretation of the results from high throughput experiments. In this paper we introduce the most important techniques that ...

  14. Biomedical Text Mining

    Pages 41-70 Finding Gene Associations by Text Mining and Annotating it with Gene Ontology Oviya Ramalakshmi Iyyappan, Sharanya Manoharan Pages 71-90 Biomedical Literature Mining for Repurposing Laboratory Tests Finn Kuusisto, Ross Kleiman, Jeremy Weiss

  15. (PDF) Application of Biomedical Text Mining

    Text mining is being used in many fields such as cybercrime (Kontostathis, Edwards, & Leatherman, 2010), biomedicine (Gong, 2018), and education (Agrawal & Batra, 2013). Text mining is also widely ...

  16. PDF A Comprehensive Benchmark Study on Biomedical Text Generation ...

    079 substantial human effort and is time-consuming. 080 Thus, automated text generation and mining tech- 081 niques can greatly assist researchers via extracting 082 or deriving valuable insights from the available big 083 data in biomedical literature. 084 Recently, one of the most promising advances 085 in NLP field is the development of so called large- 086 scale language models (LLMs) to ...

  17. Research status and trend analysis of global biomedical text mining

    In the biomedical field, due to the abrupt growth of the number of biomedical data and literature, acquiring laws and new knowledge by data mining technology has become an important branch and a hot spot in biomedical field (Tari et al. 2010).Text Mining is a specific research field belonging to the interdisciplinary subject of data mining, whose main tasks are to integrate and analyze vast ...

  18. Frontiers of biomedical text mining: current progress

    10.1093/bib/bbm045 Abstract It is now almost 15 years since the publication of the first paper on text mining in the genomics domain, and decades since the first paper on text mining in the medical domain.

  19. Utilizing BERT for biomedical and clinical text mining

    In Section 4, we conclude our studies and findings and present the possible directions to deepen the biomedical and clinical text mining research with BERT in future. 2. BERT. ... The corpus is made up with 18% computer science domain paper and 82% broad biomedical domain papers. Moreover, instead of using only the abstracts of these sources ...

  20. Text-mining solutions for biomedical research: enabling integrative

    A surprising phenomenon can be noted in the recent history of biomedical text mining: although several systems have been built and deployed in the past few years—Chilibot, Textpresso, and PreBIND (see Text S1 for these and most other citations), the ones that are seeing high usage rates and are making productive contributions to the working lives of bioscientists have been build not by text ...

  21. A survey of current work in biomedical text mining

    The major challenge of biomedical text mining over the next 5-10 years will require enhanced access to full text, better understanding of the feature space of biomedical literature, better methods for measuring the usefulness of systems to users, and continued cooperation with the biomedical research community to ensure that their needs are addressed. The volume of published biomedical ...

  22. Biomedical Text Mining for Research Rigor and Integrity: Tasks

    Four key areas in which text mining techniques can make a significant contribution are identified: plagiarism/fraud detection, ensuring adherence to reporting guidelines, managing information overload, and accurate citation/enhanced bibliometrics. An estimated quarter of a trillion US dollars is invested in the biomedical research enterprise annually. There is growing alarm that a significant ...

  23. Animals

    Background: Research model selection decisions in basic and preclinical biomedical research have not yet been the subject of an ethical investigation. Therefore, this paper aims, (1) to identify a spectrum of reasons for choosing between animal and alternative research models (e.g., based on in vitro or in silico models) and (2) provides an ethical analysis of the selected reasons. Methods: In ...

  24. Biomedical text mining and its applications in cancer research

    The major challenge of biomedical text mining over the next 5-10 years will require enhanced access to full text, better understanding of the feature space of biomedical literature, better methods for measuring the usefulness of systems to users, and continued cooperation with the biomedical research community to ensure that their needs are ...