discussion / AI for Conservation  / 6 September 2018

Google unveils search engine for open data

Dataset Search enables users to find datasets stored across thousands of repositories on the Web, making these datasets universally accessible.

Google unveils search engine for open data

Google has unveiled a search engine to help researchers locate online data that is freely available for use. The company launched the service on 5 September, saying that it is aimed at “scientists, data journalists, data geeks, or anyone else”.

Dataset Search, now available alongside Google’s other specialized search engines, such as those for news and images — as well as Google Scholar and Google Books — locates files and databases on the basis of how their owners have classified them. It does not read the content of the files themselves in the way search engines do for web pages.

Experts say that it fills a gap and could contribute significantly to the success of the open-data movement, which aims to make data openly available for use and re-use.

Government agencies, scientific publishers, research institutions and even individual researchers maintain thousands of open-data repositories around the world, containing millions of data sets.

But researchers who want to know what types of data are available, or who hope to locate data they know already exist, often have to rely on word of mouth, says Natasha Noy, a computer scientist at Google AI in Mountain View, California.

This problem is especially serious for early-career researchers who are not already “plugged” into a network of professional connections, Noy says. It’s also a downside for those who do cross-disciplinary research — for example, an epidemiologist who needs access to climate data that could be relevant to the spread of a virus.

I saw this pop up on twitter this morning - seems interesting. @ac0159, @benkt or anyone else working neck deep in data - have you had a chance to look at it? Is it going to be useful? Curious to hear thoughts.

I did a quick trial just out of curiosity and did a search for 'biodiversity' - the info it seems to bring up is ...diverse... but seems to present useful high level info on the datasets (i.e. licensing, description, types of files the data comes in etc). My search may have been too broad a term, but seems to be an answer to the need for a possible repository of repositories we've been discussing?

I can only applaud the idea - it is definitely something we've been needing, but the current data does seem very hit-and-miss - e.g. searching for "protected areas" doesn't bring up WDPA from the UNEP-WCMC source and similarly "redlist" doesn't link back to IUCN. I'm certainly going to investigate how we get our data listed and correctly pointed too in terms of licensing etc.

link for how to make sure your data gets indexed: https://developers.google.com/search/docs/data-types/dataset



I agree – this is a very welcome development, and it’s early days; I’m sure it will improve rapidly though. Whilst recognising that there are lots(!) of excellent data repositories out there already, with necessarily specialist functionality, there’s long been a need for something that can overarch these effectively, a ‘discovery portal of discovery portals’. Hopefully this can help do that.

After a cursory look, a couple of things struck me, from a user perspective: 1 –definitely some odd/limited search results at the moment, but as noted it’s early days – it’ll snowball as data owners get on board and standards adjust accordingly. 2 – more search tools would be beneficial e.g. date range tools, a map/bounding box search tool (cf Microsoft’s FetchClimate tool).

I also wanted to understand a bit more behind how it’s working – I assumed markup but wondered what ‘semantic web’ stuff this is drawing on. This article gives a bit more info, but I wonder how different it is to other efforts in this regard, e.g. how ARIES team have been developing semantic based tools to find best available datasets for ecosystem service modelling.

Final thought – it raises interesting questions and challenges about how to ensure things like quality and suitability are going to be measured objectively. It seems like this is an issue to be tackled as the tool develops and data owners engage more as it grows…

Hi all,

I'm fairly new to conservation technology and just getting acquainted with the extent and problems in the field. Data aggregation, standardisation and storage keep popping up as chronic problems across a lot of areas. Data seems to exist in sort of silos with different filing and access arrangements between them.

I would be interested to hear: Has the google dataset search improved drastically since its inception? Are there alternative solutions out there, or are there efforts to create them?

For example, from the bioacoustics meetup the other day, the vast datasets from the Australian Acoustic Observatory and Cornell Bioacoustics Centre don't seem to show up on google dataset search. 

Andy the ARIES team you mentioned released a preview video of the interface with their ecosystem services modelling software. It seems really cool, is this something that would be useful for researchers outside of strict ecosystem services e.g. distributions of particular species temporally and spatially? What do you think of their software?

I'm not certain I'm asking the right questions here, but I'd be curious to hear your thoughts on any of this if you have any time.