Hello, I’m Ylli Bajraktari, CEO of the Special Competitive Studies Project. In this edition of our newsletter, SCSP's Venkat Somala, Karina Barao, and Ananmay Agarwal discuss data’s role in unlocking scientific potential and strategies for securing American scientific leadership.
From Lab to Leadership: How Data Can Keep America Ahead
The United States was once unchallenged in its scientific and technological prowess. That is no longer true. Over at least the past decade, China has been racing to catch up, in part by leveraging data as an asset. Beijing recognizes that the future of scientific discovery lies not just in traditional research capabilities, but in the systematic collection, curation, storage, and AI analysis of massive scientific datasets. It is, accordingly, doing all it can to accumulate these datasets and invest in necessary infrastructure to utilize its data. As SCSP wrote in our National Data Action Plan, data is the foundational building block for the digital world that touches every dimension of our lives. In order to ensure our nation remains at the forefront of innovation and can harness its gains to preserve our economic and national security, we need a concerted effort to unleash data for our scientific advantage.
Today's greatest scientific challenges – discovering life-saving therapeutics, achieving energy independence, creating new materials, and ensuring agricultural resilience – are tremendously complex, encompassing billions of variables, intricate system interactions, and patterns that emerge only across massive and diverse datasets. AI capabilities hold the promise of solving these challenges by identifying patterns invisible to traditional analysis, and simulating complex system behaviors that would be impossible or infeasible at scale through experimentation alone. These capabilities can then identify hypotheses and potential solutions to pressing scientific challenges.
China hopes to dominate in scientific data in response to the promise AI capabilities hold. As President Xi Jinping stated, "Big data is the new oil. Whoever controls big data technologies will control the resources for development and have the upper hand." This is not rhetoric – China has developed a comprehensive data-driven scientific research and discovery strategy, which could reshape the global scientific landscape.
The effect could be the erosion of U.S. scientific research advantages. As global competition intensifies, the United States must change its approach. Scientific leadership has practical benefits such as economic incentives and solutions to societal problems like ridding the world of diseases. Importantly, leadership in science also determines the direction of research and discoveries, promoting democratic values and prosperity over authoritarian principles. To sustain U.S. scientific leadership, the incoming Trump Administration should prioritize a robust, secure, and innovation-driven data strategy that unleashes the full potential of scientific discovery.
China’s Data-Driven Research Advantages - A Threat to U.S. Scientific Leadership
Beijing is collecting and hoarding data, investing in AI infrastructure to scale the value of data, and changing the regulatory landscape to support its data-driven scientific research strategies.
Genomic Initiatives: As one example of its data vision, China recognizes the value of large genomic datasets (entire sequences of DNA) representing diverse populations to drive medical discoveries, including in precision medicine, and increase commercial value. In 2016, China announced a $9 billion,15-year initiative to accumulate, analyze, and sequence genomic data from home and abroad in an effort to advance precision medicine.
China possesses significant access to domestic genomic data due to the size of its population, and Beijing has tightened regulations to control access to this resource. To increase diversity in its collection of genetic datasets, Chinese leaders also are acquiring data through both legal and illicit means. For example, in 2013, the Beijing Genomics Institute (BGI), now considered the world's largest genetic research organization, acquired Complete Genomics, a U.S. company with DNA sequencing data on millions of Americans. More recently, Chinese-linked hacking groups have been implicated in major breaches targeting healthcare data and biomedical research institutions globally.
These massive life science datasets are being used to implement projects including the Chinese Millionome Database (CMDB) and the China National GeneBank Database (CNGBdb), which place China at the forefront of precision medicine. CMDB contains 9.04 million single nucleotide variants from whole-genome sequencing data of 141,431 unrelated healthy Chinese individuals, making it the most comprehensive Chinese population genome database. CNGBdb, established in 2011, houses over 17,000TB of scientific data from over 600 labs.Investments in Infrastructure: China’s investments in AI, supercomputing, and data infrastructure demonstrate the scale of its commitment to its scientific ambitions. China has pledged to spend almost $52 billion on science R&D in 2024, focusing heavily on AI and supercomputing, which isn't just about raw computing power – it's about creating the capability to extract insights from their massive datasets faster than any competitor. Indeed, China stopped publicly reporting its supercomputer capabilities, leading experts to speculate that it has developed some of the world’s fastest high-performance computing systems. With respect to data infrastructure, China’s $6.1 billion "Eastern Data, Western Computing" project creates a national network of data centers that can share and extract insights from data at scale on a national level. China has also launched initiatives including the "Top 10 Frontiers in AI for Science" to enable seamless collaboration amongst scientists in the fields of AI and basic research to explore the new future of scientific intelligence, and create an "AI Einstein," designed to independently discover novel scientific principles.
At the same time, Beijing has been careful to safeguard its data. In 2018, it introduced the "Measures for the Management of Scientific Data" regulation. The rule aims to standardize the collection, sharing, and use of scientific data to support innovation, and introduce strict state control over this data and its dissemination.
The Stakes for American Science
The United States must prioritize data as a strategic resource to maintain its scientific leadership. U.S. innovation power depends on a robust data ecosystem, which can be used to unlock the next generation of scientific breakthroughs in fields like genomics, drug discovery, and material sciences.
The pace of scientific progress is increasing, especially in the life sciences. Atomwise's AtomNet platform demonstrated success in discovering novel drug candidates across 318 targets, identifying structurally novel hits for 235 targets and averaging seven distinct bioactive compounds per target – far exceeding traditional high-throughput screening methods. AstraZeneca is utilizing AI tools to analyze over 2 million genomes by 2026, accelerating drug discovery research. A recent study on AI in materials research found that AI-assisted researchers in a U.S. R&D lab discovered 44% more materials and boosted their patent filings by 39%, demonstrating AI's ability to not just speed up existing processes but actively generate novel ideas and solutions. Exscienta has developed an AI-designed drug for obsessive-compulsive disorder that reached clinical trials in just 12 months, highlighting the transformative potential of data in drug development. Google DeepMind’s AlphaFold has revolutionized protein structure prediction, accurately modeling over 200 million proteins – including the entire human proteome – enabling a deeper understanding of disease mechanisms and accelerating drug discovery. The open-sourced AlphaFold 3 goes beyond proteins to predict the structure and interactions of all life’s molecules, including DNA, RNA, and small molecules, with unprecedented accuracy, further accelerating drug discovery research. United States federal programs, such as the National COVID Cohort Collaborative (N3C), have showcased how centralized data resources can yield actionable insights in healthcare and beyond. A research team supported by the National Institutes of Health leveraged machine learning techniques and electronic health records from N3C to identify characteristics of people with long COVID, demonstrating the potential of data in understanding and addressing complex health challenges.
The United States has pioneered many of these advancements. Nevertheless, it should innovate much more and much quicker given its significant advantages: potential access to enormous amounts of robust data from its rich ecosystem of global technology companies, world-class universities and federal labs, as well as the largest data analytics market. These assets create a fertile environment for innovation. However, the United States still lacks a cohesive strategy for data driven scientific research and innovation to fully harness these competitive advantages.
What Winning Should Look Like
The United States must prioritize data for scientific advancement by protecting its data, incentivizing and enabling data sharing, diffusing data expertise across the nation, and collaborating with partners and allies on data infrastructure and capabilities.
Strengthening U.S. Data Sovereignty
Action: Develop a "Made in America" Data Standards Framework prioritizing national security, innovation, and individual rights.
Implementation: Mandate federal agencies to use secure and transparent data practices aligned with national security priorities. Include explicit restrictions on data-sharing agreements with countries of concern, particularly China, and incentivize compliance among private sector partners.
Outcome: Enhance trust in domestic data systems, counter foreign data exploitation, and showcase U.S. data leadership globally.
Streamlining Data Sharing While Prioritizing National Security
Action: Create a "Secure Data Exchange Framework" to promote cross-sectoral data sharing with stringent controls to protect sensitive information.
Implementation: Partner with industry leaders to develop secure, federated data-sharing platforms. Ensure compliance with national security objectives by blocking critical scientific data flows to adversaries.
Outcome: Accelerate scientific progress in AI, energy, and biotech while safeguarding intellectual property and critical data assets.
Empowering Rural Communities with Data Hubs
Action: Create regional data hubs in rural and economically disadvantaged areas to harness the power of data for local development.
Implementation: Provide federal grants and tax incentives to encourage tech companies and data-based enterprises to establish operations in rural communities. Build public-private partnerships to implement data training programs aligned with the specific needs of local industries, including agriculture, manufacturing, and energy.
Outcome: Drive economic growth and innovation in underserved areas, enabling rural communities to actively participate in and benefit from the advancements of the data economy.
Establishing Secure Research Enclaves with Allies
Action: Create secure environments for collaborative research between the United States and our core allies.
Implementation: Develop facilities for joint scientific research under strict access controls. Integrate allied personnel into U.S. national laboratories while implementing reciprocal arrangements abroad.
Outcome: Promote innovation with allies while ensuring sensitive research remains protected.
Strengthening Screening for International Research Collaborations
Action: Establish clearer guidelines for evaluating foreign partnerships in sensitive research fields.
Implementation: Establish an inter-agency task force to oversee joint research initiatives involving critical technologies. Require universities and labs to implement security reviews for research projects with foreign entities.
Outcome: Prevent adversaries from exploiting open science practices while maintaining collaborative opportunities with trusted allies.
Establishing an Allied Open Science Framework
Action: Promote the co-development of open science standards among trusted allies to facilitate secure data and knowledge sharing.
Implementation: Create a multilateral agreement with allies and partners to share non-sensitive research data and establish joint research priorities in fields like AI, quantum computing, and space exploration. Host an annual U.S.-led "Innovation Summit" to coordinate these efforts and showcase U.S. leadership.
Outcome: Foster deeper scientific ties with allies while countering China’s assertion in international scientific collaboration fora.
Expand U.S. Leadership in International Standards-Setting Bodies
Action: Increase U.S. participation and leadership in global technology standardization efforts.
Implementation: Increase U.S. Government support for U.S. firms to participate in these standards setting processes to ensure U.S. perspectives are fully represented. Deploy senior diplomats and technical experts to international standards-setting organizations such as the International Organization for Standardization (ISO), International Telecommunication Union (ITU), and the Institute of Electrical and Electronics Engineers (IEEE) to champion American values and interests. Form a stronger coalition of allies to counter China's influence in shaping international technology standards. Host standards-setting activities on U.S. soil with allied and partner nations. Understand that if we abandon these forums, China will fill our gap and set the global standards.
Outcome: Cement U.S. leadership in setting the rules for emerging technologies.
By embracing these recommendations, the United States can harness the power of data for scientific advancement, maintain its global leadership, and ensure a future where technological innovation benefits all Americans and its allies.