Structured XML versus the blob of text
Roy Kaufman and Michael Iarrobino explain why text miners should heed the data quality imperative
Too often researchers doing text mining are placing bets on results derived from problematic and low-fidelity sources. Data quality is essential to the bench-to-bedside research journey. Small discrepancies or ambiguities in lab data can obfuscate true findings, preventing scientific progress or, worse, actually causing harm.
The need for good input data is as essential when using scientific articles as it is for any data inquiry. The structure and format of published research have fundamental effects on how humans and machines can discover and interpret valuable facts, findings, and assertions. Too often, this lesson has been ignored.
What’s wrong with abstracts and scraping?
Machine-readable content has become increasingly important to life sciences research and development organisations as the volume of published articles has reached millions annually. Pharmaceutical companies, for example, use machines’ natural-language processing capabilities and semantic technologies, under the umbrella of text mining or text analytics, to shorten their time to insight and increase their ability to elicit novel findings from existing research.
Bioinformaticians first employed text mining technology across journal article abstracts (for example, those on Medline), but this had limitations. Abstracts contain fewer, less novel, and less diverse facts than full-text articles; for instance, they seldom include details of methods or other experimental data.
Text miners have since turned to a variety of techniques to acquire machine-readable, full-text content, including scraping content from public websites or converting PDFs intended for human reading into the XML format that text mining applications typically prefer. These have their downsides too.
Although scraping can yield structured XML output based on the HTML in the page, scrapers break easily and are difficult to scale. When a website is changed or a scraper encounters a page with a different structure, the output may lose its fidelity to the original, or the scraping process may stop entirely.
PDF conversion to XML is also problematic. If the PDF file lacks a text layer or other metadata, conversion processes often rely on optical character recognition (OCR) to create structured text from the image. Doing so can cause omission or introduction of characters and loss of the high-level structure of the content. PDFs that have an embedded text layer still lack the comprehensive tagging provided by a structured XML article.
What’s the effect of the resulting XML on text mining efforts? Here are a few examples.
The XML will probably be missing explicit tags for the sections we all recognise as part of a scientific article – introduction, materials and methods, conclusion, and so on. Thus, the user loses the ability to apply constraints on these sections once the XML is indexed for querying. As a result, false positives increase, and the user may be unable to differentiate a speculative review of the existing literature in the introduction from a substantive experimental finding in the conclusion.
Worse, article metadata, such as the citations and bibliography, may find their way into the content 'blob' of the converted or scraped XML. Furthermore, tabular data in the article may also be inaccessible in a meaningful way. Tables, prized by researchers for their rich experimental data, link concepts together with rows and columns – for example, defining different concentrations found in patients for certain dosage administrations of a drug. When converted from PDF, the resulting XML loses this richness, representing the data as merely strings of nearby letters and numbers.
Structures are better than blobs
By contrast, articles in “native” XML format were created by publishers directly in XML, not converted from PDF or scraped from HTML.
When indexed, the content from these XML files allows for greater flexibility in querying, offering numerous ways for the user to define query clauses to achieve the right precision and recall. Let’s examine how this resolves the problems identified above for scraped or converted files.
With clear identification of article sections, researchers can gain the benefits of a full-text query while maintaining a high signal-to-noise ratio. Depending on the use case, queries can be restricted to particular sections. For example, a user aiming to identify companies whose lab equipment or reagents are being used in experiments may constrain the query to the 'materials and methods' region. The researcher may also apply queries broadly to the document and require particular exclusions restricted by section.
Citations, by far the largest contributor to false positives when working with a blob of converted text, are well defined in the metadata fields of an article in native XML. Consequently, the researcher can invoke them when appropriate, for instance, when doing analytics on networks of researchers in a field, and can exclude them when not.
And finally, a researcher can interrogate rich tabular data with sophisticated text mining queries to identify clear relationships between drugs and symptoms, dosages and concentrations, and other experimental data.
Investment is key
In the last three years, text mining has moved from a novel technique to a standard way for researchers to interact with content. And, just as the PDF needs to be structured and formatted in a manner that makes it useful for humans, so too does XML need to be structured and formatted to make it useful for machines.