PubMed knowledge graph (PKG)

A comprehensive knowledge graph dataset encompassing over 36 million papers, 1.3 million patents, and 0.48 million clinical trials in the biomedical field.

The first version of PKG has been published in Scientific Data, which extracted bio-entities from 29 million PubMed abstracts, disambiguated author names, integrated funding data through the National Institutes of Health (NIH) ExPORTER, collected affiliation history and educational background of authors from ORCID®, and identified fine-grained affiliation data from MapAffil. You can refer to the paper "Building a PubMed knowledge graph" for more details of the first version of PKG.

Papers, patents, and clinical trials are indispensable types of scientific literature in biomedicine, crucial for knowledge sharing and dissemination. However, these documents are often stored in disparate databases with varying management standards and data formats, making it challenging to form systematic, fine-grained connections among them. To address this issue, we introduce PKG 2.0, a comprehensive knowledge graph dataset encompassing over 36 million papers, 1.3 million patents, and 0.48 million clinical trials in the biomedical field. PKG 2.0 integrates these previously dispersed resources through various links, including biomedical entities, author networks, citation relationships, and research projects. You can refer to our paper "PubMed knowledge graph 2.0: Connecting papers, patents, and clinical trials in biomedical science" for more details of PKG 2.0.

We have organized the knowledge graph dataset into a series of relational database tables, named with prefixes A, B, and C.

All the data of PKG 2.0 have been stored in SQL and TSV files , which can be downloaded from Science Data Bank. Please refer to the Database Features and the Database Description for more details about how to use the PKG dataset.

Updates

[Jul 20, 2025] The PKG dataset has been updated to PKG24S4 based on the PubMed 2025 Baseline. Meanwhile, data from other sources integrated in the dataset have also been updated correspondingly to ensure the timeliness and consistency of the overall data.

[Jun 17, 2025] The personal privacy information such as name, gender, and race were deleted to comply with personal data protection requirements.

[Apr 17, 2025] C04_ReferenceList_Papers: Delete citation data from OpenCitations to avoid incorrect data caused by some ID mappings.

[Oct 10, 2024] The initial version of PKG 2.0 was released alongside the arXiv article (To align with evolving data privacy standards, this version is no longer available for download).

Download

FAQ

1. How to decompress the .gz files?

      For Windows, 7-Zip is recommended.

      For Linux, the following command can be used for decompression:

              "gunzip tablename.sql.gz"

Citation

Xu, J., Yu, C., Xu, J. et al. PubMed knowledge graph 2.0: Connecting papers, patents, and clinical trials in biomedical science. Sci Data 12, 1018 (2025).https://doi.org/10.1038/s41597-025-05343-8

Xu, J., Kim, S., Song, M., Jeong, M., Kim, D., Kang, J., Rousseau, J. F., Li, X., Xu, W., Torvik, V. I., Bu, Y., Chen, C., Ebeid, I. A., Li, D., & Ding, Y. (2020). Building a PubMed knowledge graph. Scientific Data, 7, 205. https://doi.org/10.1038/s41597-020-0543-2

Copyright Notice

This project is licensed under the MIT License.

Contact Information

If you need help or have issues using PKG, please contact Jian Xu at issxj@mail.sysu.edu.cn.