Datasets at your fingertips in Google Search – Artificial Intelligence Medical and Engineering Researchers Society

Posted by Natasha Noy, Research Scientist, and Omar Benjelloun, Software Engineer, Google Research

Access to datasets is critical to many of today’s endeavors across verticals and industries, whether scientific research, business analysis, or public policy. In the scientific community and throughout various levels of the public sector, reproducibility and transparency are essential for progress, so sharing data is vital. For one example, in the United States a recent new policy requires free and equitable access to outcomes of all federally funded research, including data and statistical information along with publications.

To facilitate discovery of content with this level of statistical detail and better distill this information from across the web, Google now makes it easier to search for datasets. You can click on any of the top three results (see below) to get to the dataset page or you can explore further by clicking “More datasets.” Here is an example:

When users search for datasets in Google search, they find a dedicated section highlighting pages with dataset descriptions. They can explore many more datasets by clicking on “More datasets” and going to Dataset Search.

Powered by Dataset Search

Dataset Search, a dedicated search engine for datasets, powers this feature and indexes more than 45 million datasets from more than 13,000 websites. Datasets cover many disciplines and topics, including government, scientific, and commercial datasets. Dataset Search shows users essential metadata about datasets and previews of the data where available. Users can then follow the links to the data repositories that host the datasets.

Dataset Search primarily indexes dataset pages on the Web that contain schema.org structured data. The schema.org metadata allows Web page authors to describe the semantics of the page: the entities on the pages and their properties. For dataset pages, schema.org metadata describes key elements of the datasets, such as their description, license, temporal and spatial coverage, and available download formats. In addition to aggregating this metadata and providing easy access to it, Dataset Search normalizes and reconciles the metadata that comes directly from the Web pages.

If you are a dataset author or provider and want others to find your datasets in Search, make sure that you publish your dataset in a way that makes it discoverable and specifies how others can reuse the data. Specifically, ensure that the Web page that describes the dataset has machine-readable metadata. The easiest way to ensure this is to publish your dataset in an established dataset repository. Some repositories cater to specific research communities, while others are “generalists” (figshare.com, zenodo.org, datadryad.org, kaggle.com, etc.). These repositories automatically include metadata in dataset pages for every dataset, which makes it easy for search engines to discover and include them in specialized result sections, as in the figure above.

As data sharing continues to grow and evolve, we will continue to make datasets as easy to find, access, and use as any other type of information on the web.

Acknowledgments

We are extremely grateful to the numerous Googlers who contributed to developing and launching this feature, including: Rachel Zax, Damian Biollo, Shiyu Chen, Jonathan Drake, Sunil Vemuri, Stephen Tseou, Amit Bapat, Will Leszczuk, Marc Najork, Sergei Vassilvitskii, Bruno Possas, and Corinna Cortes.

Posted by Natasha Noy, Research Scientist, and Omar Benjelloun, Software Engineer, Google Research Access to datasets is critical to many of today’s endeavors across verticals and industries, whether scientific research, business analysis, or public policy. In the scientific community and throughout various levels of the public sector, reproducibility and transparency are essential for progress, so sharing data is vital. For one example, in the United States a recent new policy requires free and equitable access to outcomes of all federally funded research, including data and statistical information along with publications. To facilitate discovery of content with this level of statistical detail and better distill this information from across the web, Google now makes it easier to search for datasets. You can click on any of the top three results (see below) to get to the dataset page or you can explore further by clicking “More datasets.” Here is an example: When users search for datasets in Google search, they find a dedicated section highlighting pages with dataset descriptions. They can explore many more datasets by clicking on “More datasets” and going to Dataset Search. Powered by Dataset Search Dataset Search, a dedicated search engine for datasets, powers this feature and indexes more than 45 million datasets from more than 13,000 websites. Datasets cover many disciplines and topics, including government, scientific, and commercial datasets. Dataset Search shows users essential metadata about datasets and previews of the data where available. Users can then follow the links to the data repositories that host the datasets. Dataset Search primarily indexes dataset pages on the Web that contain schema.org structured data. The schema.org metadata allows Web page authors to describe the semantics of the page: the entities on the pages and their properties. For dataset pages, schema.org metadata describes key elements of the datasets, such as their description, license, temporal and spatial coverage, and available download formats. In addition to aggregating this metadata and providing easy access to it, Dataset Search normalizes and reconciles the metadata that comes directly from the Web pages. If you are a dataset author or provider and want others to find your datasets in Search, make sure that you publish your dataset in a way that makes it discoverable and specifies how others can reuse the data. Specifically, ensure that the Web page that describes the dataset has machine-readable metadata. The easiest way to ensure this is to publish your dataset in an established dataset repository. Some repositories cater to specific research communities, while others are “generalists” (figshare.com, zenodo.org, datadryad.org, kaggle.com, etc.). These repositories automatically include metadata in dataset pages for every dataset, which makes it easy for search engines to discover and include them in specialized result sections, as in the figure above. As data sharing continues to grow and evolve, we will continue to make datasets as easy to find, access, and use as any other type of information on the web. Acknowledgments We are extremely grateful to the numerous Googlers who contributed to developing and launching this feature, including: Rachel Zax, Damian Biollo, Shiyu Chen, Jonathan Drake, Sunil Vemuri, Stephen Tseou, Amit Bapat, Will Leszczuk, Marc Najork, Sergei Vassilvitskii, Bruno Possas, and Corinna Cortes.

Powered by Dataset Search

Acknowledgments

Leave a ReplyCancel Reply