PubMed knowledge graph (PKG)

A comprehensive knowledge graph dataset encompassing over 36 million papers, 1.3 million patents, and 0.48 million clinical trials in the biomedical field.

The first version of PKG has been published in Scientific Data, which extracted bio-entities from 29 million PubMed abstracts, disambiguated author names, integrated funding data through the National Institutes of Health (NIH) ExPORTER, collected affiliation history and educational background of authors from ORCID®, and identified fine-grained affiliation data from MapAffil. You can refer to the paper "Building a PubMed knowledge graph" for more details of the first version of PKG.

Papers, patents, and clinical trials are indispensable types of scientific literature in biomedicine, crucial for knowledge sharing and dissemination. However, these documents are often stored in disparate databases with varying management standards and data formats, making it challenging to form systematic, fine-grained connections among them. To address this issue, we introduce PKG 2.0, a comprehensive knowledge graph dataset encompassing over 36 million papers, 1.3 million patents, and 0.48 million clinical trials in the biomedical field. PKG 2.0 integrates these previously dispersed resources through various links, including biomedical entities, author networks, citation relationships, and research projects. You can refer to our paper "PubMed knowledge graph 2.0: Connecting papers, patents, and clinical trials in biomedical science" for more details of PKG 2.0.

We have organized the knowledge graph dataset into a series of relational database tables, named with prefixes A, B, and C.

All the data of PKG 2.0 have been stored in SQL files, which can be downloaded from OneDrive. Additionally, tables starting with 'C' have been organized in TSV format and can be downloaded from figshare. Please refer to the Database Features and the Database Description for more details about how to use the PKG dataset.

Download

FAQ

1. How to decompress the .gz files?

      For Windows, 7-Zip is recommended.

      For Linux, the following command can be used for decompression:

              "gunzip tablename.sql.gz"

2. How can I verify the data files using MD5 checksums?

      The following command can be used for verification:

              "md5sum -c tablename.sql.gz.md5sum"

Updates

[Jun 17, 2025] The personal privacy information such as name, gender, and race were deleted.

[Apr 17, 2025] C04_ReferenceList_Papers: Delete citation data from OpenCitations.

Citation

Xu, J., Yu, C., Xu, J. et al. PubMed knowledge graph 2.0: Connecting papers, patents, and clinical trials in biomedical science. Sci Data 12, 1018 (2025).https://doi.org/10.1038/s41597-025-05343-8

Xu, J., Kim, S., Song, M., Jeong, M., Kim, D., Kang, J., Rousseau, J. F., Li, X., Xu, W., Torvik, V. I., Bu, Y., Chen, C., Ebeid, I. A., Li, D., & Ding, Y. (2020). Building a PubMed knowledge graph. Scientific Data, 7, 205. https://doi.org/10.1038/s41597-020-0543-2

Copyright Notice

This project is licensed under the MIT License.

Contact Information

If you need help or have issues using PKG, please contact Jian Xu at issxj@mail.sysu.edu.cn.