Thousands of papers and reports about flora and fauna are published each year. While peer-reviewed published information is vitally important to conservation organisations, the ever-increasing mountain of information presents a huge challenge. Researchers attempting to distill it and ensure the right information reaches the people who can take positive action in a timely manner face a difficult task.
One of the major efforts to systematically review literature on threatened and endangered species is the IUCN Red List, a mammoth manual effort that is extremely expensive, time-consuming, and tedious for the people involved. As a result, the status of many species is typically assessed and re-assessed by experts only once every 5 years.
During this time, anything between 1000-10,000 species are expected to have gone extinct, while still others have been rediscovered. ContentMine's Ross Mounce has found that of 100 plant species listed as extinct, 16 are alive and well somewhere in the world. One species, Wendlandia angustifolia, was reported as rediscovered back in the year 2000, more than 15 years ago! Yet, despite the existance of easily findable peer-reviewed articles reporting the rediscovery and hence extant status of this taxa, the listing remains as 'extinct' on the Red List to this day.
This lag used to be an inevitable byproduct of dealing with a deluge of information and a necessarily sprawling assessment process, but we believe that intelligent machines could offer a solution.
We are building an open source pipeline to extract facts from scientific documents that we think we could make the literature review process cheaper, more rigorous, continuous and transparent. We're publishing a daily stream of facts related to all Red List species and believe that this type of literature tracking could help shorten the time research takes to make an impact and reduce the burden on researchers.
Moreover, we'll also make this information accessible to the 99% of the population who have no access to the scholarly literature but might still be interested and invested in updates on, for example, Ursus maritimus (the polar bear). They may even be sharing that information with others, such as the 2,914 editors of the Wikipedia entry for polar bears. These many eyes mean that errors or updates can be fed back to the IUCN more quickly. We're currently publishing facts from Open Access journals but plan to cover a wider range of publications very soon.
We can pull out a wide range of facts from articles, including species, gene names, chemicals, places and dictionaries of key words provided by researchers. These can be extracted directly into useful datasets or displayed for convenient reading.
To demonstrate how this tool might be useful, Ross Mounce walks through what information might be extracted from an article about an endangered species of frog that was published in PLOS ONE:
Roznik EA, Alford RA (2015) Seasonal Ecology and Behavior of an Endangered Rainforest Frog (Litoria rheocola) Threatened by Disease. PLoS ONE 10(5): e0127851. doi:10.1371/journal.pone.0127851.
Our ami-species plugin, which allows refinement by section, extracted the following text:
From the Introduction: section:
- ...One such species is the common mistfrog (Litoria rheocola), an IUCN Endangered species  that occurs near rocky, fast-flowing rainforest streamsin northeastern Queensland, Australia …
- …Litoria rheocola is a small treefrog (average male body size: 2.0 g, 31 mm; average female body size: 3.1 g, 36 mm ) …
- …Habitat modification and fragmentation also threaten L.rheocola [23,27]...
- …The goal of our study was to understand the behavior of L. rheocola, and how it is affected by season and by sites that vary in elevation. …
- Because L. rheocola are too small to carry radiotransmitters, we tracked frogs using harmonic direction finding [32,33].
- Litoria rheocola is a treefrog, and individuals move along and at right angles to the stream and also climb up and down vegetation;
- Overall, we found that L.rheocola are relatively sedentary frogs that are restricted to the stream environment, and prefer sections of the stream with riffles, numerous rocks, and overhanging vegetation (Table 2).
- Our data confirm that L. rheocola are active year-round, but their behavior varies substantially between seasons.
Our summary reduces the full text from over 6000 words to a more bite-size summary of just ~700. Yet, as you can see, the approach provides far more information than is indicated in the abstract for the paper. Multiply this effect across thousands of papers and searches for thousands of different species and you can begin to understand the usefulness of a tool like this.
To find out more about the tool, you can view the IUCN Red List demo here and find out how to use the ContentMine pipeline here. We are constantly developing the software and are always happy to hear from researchers who would like to try it out! Please get in contact with us over at our forum or via @theContentMine on Twitter.
* Note that because PLOS ONE is an openly-licensed journal we can re-post as much context around each entity as we wish.
Interested in data mining? Check out our Data Science Group.