Wikidata: The Internet's Hidden Engine for Knowledge

A quiet revolution in how humanity organizes knowledge is underway, powered by a collaborative platform you've probably never seen.

In an age of information overload and isolated data silos, the scientific community faces a formidable challenge: how to efficiently integrate and utilize the vast amount of knowledge being discovered. A significant portion of this knowledge remains locked away, either in specialized databases with incompatible formats or buried in free-text publications, hindering our ability to query and mine this information effectively1 . Imagine if every fact about a gene, a chemical compound, or a disease could be seamlessly connected, not just within biology but across all domains of human knowledge. This is the ambitious mission of Wikidata, a community-maintained knowledge base that is becoming an indispensable platform for data integration and dissemination in the life sciences and beyond.

What Exactly is Wikidata?

If you've ever used Wikipedia, you've indirectly used Wikidata. While Wikipedia focuses on human-readable articles, its sister project Wikidata is a repository for structured data that can be read and edited by both humans and machines4 . It serves as the central storage for the vast majority of the structured data used across Wikipedia, Wiktionary, and other Wikimedia projects.

To understand its structure, think of Wikidata as a massive set of subject-predicate-object triples, a standard known as Resource Description Framework (RDF)2 . For example, the item for "Insulin" (the subject) could be connected to the property "has part" (the predicate) with the value "amino acid chain" (the object)6 . This simple, yet powerful, structure allows for an immense web of interconnected knowledge.

Items

These represent real-world objects, concepts, or events, each identified by a unique number prefixed with a 'Q'. For instance, the item for "type 2 diabetes" is Q302166 .

Properties

These define the relationships between items and are identified by a number prefixed with a 'P'. The property "medical condition treated" is P21756 .

Statements

These are the actual data points, formed by combining an item, a property, and a value. A statement about the drug "Metformin" (Q180912) would use the property "medical condition treated" (P2175) to link to the item "type 2 diabetes" (Q30216), creating a meaningful fact2 6 .

Building the Biomedical Knowledge Graph

Wikidata's potential is perhaps most vividly illustrated in the life sciences, where it has been seeded with content from authoritative public resources. Through a combination of community curation and automated programs called "bots," vast quantities of biological data have been integrated into a single, queryable knowledge graph1 .

Biomedical Data Integration in Wikidata

Source: eLife 2020 Review1

Data Type Count in Wikidata Key Data Sources
Genes & Proteins Over 1.1 million genes & 940k proteins NCBI Gene, Ensembl, UniProt
Chemical Compounds Over 150,000 PubChem, RxNorm, IUPHAR
Diseases Over 16,000 Human Disease Ontology, Monarch Disease Ontology
Genetic Variants ~1,500 (clinically relevant) CIViC (Clinical Interpretations of Variants in Cancer)
Pathways ~3,000 Reactome, WikiPathways
Scientific Publications Over 20 million WikiCite project
FAIR Principles in Action

This integration adheres to the FAIR principles—making data Findable, Accessible, Interoperable, and Reusable. By breaking down the walls of data silos, Wikidata combines the strengths of both centralized and distributed approaches to data management1 .

A Deep Dive: Measuring and Improving Wikidata's Quality

With such rapid, bot-assisted, and community-driven growth, a critical question arises: Can we trust the data? This concern was the focus of a comprehensive scientific study that analyzed the quality of Wikidata's massive knowledge graph.

The Experimental Methodology

The researchers developed a framework to detect low-quality statements using three key indicators:

1
Community Updates

Statements that were permanently removed by editors, suggesting they were inaccurate or poorly sourced.

2
Deprecated Statements

Statements that were not deleted but flagged with a "deprecated" rank, indicating they are no longer considered valid.

3
Property Constraint Violations

Statements that break rules defined by the community for how specific properties should be used.

Results and Analysis

The study provided a nuanced view of Wikidata's reliability:

Key Findings
  • The analysis identified 76.5 million removed statements and 10 million deprecated statements.
  • Despite these numbers, the research concluded that Wikidata is a knowledge graph of increasing quality, actively self-correcting by removing duplicates, fixing modeling errors, and addressing constraint violations.
  • The data's reliability was found to be comparable to, and in some cases superior to, other general-domain knowledge graphs.
Quality Indicators
Quality Indicator What It Measures Implication
Community Updates Statements permanently removed by editors. Reflects community consensus and improving accuracy over time.
Deprecated Statements Statements marked as no longer valid but kept for record. Addresses veracity by showing evolving consensus on facts.
Constraint Violations Statements that break predefined property rules. Provides insights into the consistency and well-formedness of data.

The Scientist's Toolkit: Key "Reagents" for Wikidata Research

Engaging with Wikidata does not require wet-lab equipment. Instead, the essential tools are software and interfaces that allow researchers to access, manipulate, and contribute to this vast knowledge base.

Wikidata Integrator (WDI)

Primary Function: A Python library to simplify the creation of bots for importing data1 .

Use Case: Automating the upload of new gene data from NCBI into Wikidata.

SPARQL Endpoint

Primary Function: A query service that allows users to ask complex questions of the entire database1 .

Use Case: Finding all proteins associated with a specific disease and the drugs that target them.

Wikidata Embedding Project

Primary Function: A new vector database that allows semantic search using natural language8 .

Use Case: Enabling AI models to understand and retrieve Wikidata information more intuitively.

Wikibase REST API

Primary Function: An application programming interface for retrieving and editing item data3 .

Use Case: Building a custom mobile app that displays information from Wikidata.

Scholia

Primary Function: A webservice that creates visual scholarly profiles for topics, authors, and organizations2 .

Use Case: Generating a detailed, interactive summary of all publications related to a particular pathogen.

Beyond the Life Sciences: The Expanding Universe of Wikidata

While its applications in biomedicine are profound, Wikidata's scope is virtually limitless. It has become a central hub for open structured data across all fields4 . This is powered by a vibrant, global community of human editors and bot developers who continuously expand and maintain the knowledge graph1 .

Wikidata and Artificial Intelligence

Recent innovations are focusing on the intersection of Wikidata and Artificial Intelligence. The new Wikidata Embedding Project, announced in late 2024, applies vector-based semantic search to Wikidata's nearly 120 million entries. This, combined with support for the Model Context Protocol (MCP), makes this verified knowledge more accessible to large language models (LLMs) through natural language queries. This project positions Wikidata as a high-quality, reliable source for grounding AI models, countering the trend of AI being controlled by a handful of large corporations8 .

Community and Events

The community continues to evolve the platform through events like the annual Data Reuse Days and the Wikidata Workshop at the International Semantic Web Conference (ISWC), where researchers and developers discuss trends, tools, and the future of this collaborative knowledge graph3 4 .

Global Community Open Data Collaboration Semantic Web

Conclusion: A Collective Brain for the Digital Age

Wikidata represents a bold, ongoing experiment in collective intelligence. It is more than just a database; it is a global platform for sharing knowledge in a way that is open, reusable, and interconnected. By integrating disparate data sources in the life sciences, it empowers researchers to ask questions that were previously impossible to answer efficiently. As it continues to grow and evolve, embracing new technologies like AI, its role as the hidden engine of the internet's knowledge will only become more vital. It provides a powerful model for how we can work together to build a more structured, accessible, and intelligent understanding of our world.

Frequently Asked Questions

How is Wikidata different from Wikipedia?
While Wikipedia contains human-readable articles, Wikidata stores structured data that can be read by both humans and machines. Wikipedia articles often pull data from Wikidata to populate infoboxes and other structured elements.
Can anyone contribute to Wikidata?
Yes, Wikidata is a collaborative project and anyone can create an account to contribute. There are also automated programs called "bots" that help import large datasets from authoritative sources.
How reliable is the data in Wikidata?
Studies have shown that Wikidata maintains high data quality through community curation and automated quality checks. The platform actively self-corrects by removing inaccurate statements and flagging deprecated information.

References