The Connected Ideas Project
Tech, Policy, and Our Lives
Ep 24 - From Chaos to Clarity: Organizing Life’s Code in the Genomic Era
0:00
Current time: 0:00 / Total time: -10:39
-10:39

Ep 24 - From Chaos to Clarity: Organizing Life’s Code in the Genomic Era

As sequencing technologies unlock a wealth of biological data, a lack of standardization threatens to limit its potential—here’s why we must act now

Over the next century, biotechnology is poised to revolutionize how we live, work, and address some of humanity's most pressing challenges. In fact, breakthroughs in biotechnology and related emerging technologies are already allowing scientists to produce targeted cancer therapies, engineer more resilient crops, create sustainable materials, and develop solutions to mitigate environmental pollution.

view of Earth and satellite
Photo by NASA on Unsplash

But why now? After decades of scientific progress, why are we just beginning to see biology reshape industries and daily life? The answer lies in the exponential advancements across three core capabilities: our ability to read, write, and edit DNA, the "source code" for life.


The podcast audio was AI-generated using Google’s NotebookLM

Share

This is a guest post by Evan Peikon, who writes a great Substack at Decoding Biology.


Advancements in sequencing technology have significantly enhanced our ability to read DNA, enabling scientists to decode genomes with unprecedented speed and accuracy. Additionally, improvements in DNA synthesis technologies have allowed scientists to write new genetic code, creating novel biological sequences for the first time. Finally, breakthrough discoveries like CRISPR have revolutionized the field of DNA editing, allowing scientists to reprogram genes with high fidelity and at a scale previously unimaginable.

Naturally, each of these technologies carry immense potential, while also introducing new unforeseen challenges that we need to contend with. In this article, I’m going to focus on the first core capability, sequencing, and the challenges associated with converting biomolecules to bytes. This seems like a natural place to start. Just as a child learns to read before they can write, we first need to understand the core concept of DNA reading (i.e. sequencing) before we can fully appreciate our newfound abilities to write and edit biology.

The Digitization of Biology

Sequencing refers to the reading of DNA or RNA. In this section I’ll focus on the former, though RNA-sequencing technologies are no less influential. DNA reading, or sequencing, is the process of determining the precise order of nucleotide bases (adenine, thymine, guanine, and cytosine) in a DNA sample. By reading the genetic code in a DNA sample, scientists can identify microorganisms, discover disease-causing genes, and infer biological properties and traits of bacteria that form the basis of countless research tools used in molecular biology.

At its core, DNA sequencing is a simple process. A researcher begins by obtaining a DNA sample with an unknown genetic sequence. The sample is then prepared, loaded into a DNA sequencer, amplified, and analyzed. Finally, the sequencer generates a digital representation of the DNA sequence (i.e., a string of A's, T's, C's, and G's), offering insights ranging from detecting genetic mutations to identifying pathogens. This ability to read DNA is foundational to modern biology and fuels innovations in personalized medicine, synthetic biology, and alternative food production.

The modern revolution in DNA sequencing is driven by next-generation sequencing (NGS) technologies, also known as high-throughput sequencing. NGS is responsible for the rapidly plummeting cost of DNA sequencing over the past two decades, since it allows for massively parallel sequencing, ushering in a new era of biological data collection. Simultaneously, the rapid development of diverse omics data platforms– genomics, epigenomics, transcriptomics, proteomics, metabolomics, and phenomics—have revolutionized the ability to collect biological data on an unprecedented scale. To put this in perspective, Illumina estimates that genomic data alone will require half a zettabyte of data per year by 2025, which is ~150x the amount of data generated by Meta and Youtube per year combined.

black iphone 4 on brown wooden table
Photo by dole777 on Unsplash

The breadth of data that can now be collected enables researchers to explore complex questions about life’s processes, from understanding disease mechanisms to improving agricultural productivity. However, as the challenge of "reading" biology has been largely addressed, the scientific community now faces a new obstacle: organizing and standardizing the wealth of data in ways that ensure its safe, secure, and responsible reuse.

Why Standardizing Biological Data Matters

The importance of proper data standardization cannot be overstated. Well-standardized data not only enables researchers to repurpose existing biological data to answer new experimental questions, but also allows for more optimal training of artificial intelligence models for bioinformatics applications (less garbage in, less garbage out). Moreover, standardization directly impacts the reproducibility of research, a cornerstone of scientific progress.

Beyond scientific benefits, there is an economic incentive to standardizing data: well-indexed and standardized data allows researchers to reuse and reproduce findings without the need for costly new data collection efforts. For example, datasets stored with comprehensive metadata, supplementary information, and standardized protocols allow other scientists to validate findings and extend them to new experimental contexts. This not only enhances the reliability of scientific work but also enables the reuse of previously collected data to answer new questions, in turn fostering innovation while saving time and resources.

Challenges in Data Standardization

Despite its importance, data standardization remains an uphill battle. Biological research is inherently variable, with differences in experimental design, data collection methods, and processing techniques contributing to inconsistencies across datasets. For example, even small procedural variations—such as differences in how a sample is pipetted or the rate that collagenase is stirred—can result in significant variability. This is further compounded by the lack of universal metadata standards, which often leads to datasets being described in ways that are incomplete, ambiguous, or incompatible with one another.

The fragmented nature of biological data storage further complicates matters. Researchers often use different databases with varying quality control measures, formats, and terminologies. This lack of interoperability makes it challenging to combine data across repositories, reducing its usefulness. Moreover, the absence of comprehensive metadata can make it difficult to interpret or contextualize the data. Without information about the source of the sample, the experimental protocol, or the technical specifications of the instruments used, datasets lose much of their scientific value.

The problem extends beyond individual research practices to structural issues in the broader biotechnology ecosystem. Clear and consistent federal guidelines for data standardization are lacking, leaving researchers to navigate a patchwork of standards that often fail to meet the needs of modern science. Addressing these issues requires coordinated efforts to harmonize standards, improve metadata quality, and ensure that biological data is findable, accessible, interoperable, and reusable—a set of principles collectively known as FAIR.

The Role of Policy and Collaboration

Addressing the challenges of data standardization in biology demands a cohesive effort, where policy and collaboration play central roles. Government agencies, industry leaders, academic institutions, and international organizations must come together to create and enforce guidelines that prioritize data interoperability and privacy. Effective policies would ensure that biological data is consistently recorded, reported, and stored across platforms, enabling seamless integration of datasets. This, in turn, would facilitate the development of advanced AI models tailored to biological research and enhance the overall utility of biological data as a strategic resource.

A dual approach is essential: top-down policy interventions and bottom-up collaboration. At the federal level, policies can establish a robust framework for consistent data practices. For example, mandating standardized metadata formats and reporting protocols for federally funded research would enhance data quality and encourage compatibility across databases. Such initiatives would not only address technical inconsistencies but also set a precedent for private and international stakeholders to adopt similar standards.

Collaboration across diverse sectors is equally crucial. Standardization is a complex challenge that cannot be solved in isolation. Input from various stakeholders ensures that guidelines are comprehensive and practical. The biotechnology industry, for instance, has a vested interest in standardization because it underpins the success of AI-driven discoveries and the commercialization of new products. Academic researchers, on the other hand, benefit from improved reproducibility and the ability to repurpose existing data. Global collaborations further amplify these efforts, enabling data sharing across borders and fostering innovation on an international scale.

Building the infrastructure to support data standardization is another critical area where policy and collaboration intersect. This infrastructure encompasses hardware and software for storing, transferring, and managing data, as well as the expertise needed to establish and maintain these systems. Automated laboratories, for example, can produce high-quality, standardized data with minimal variability, but they require significant investments in equipment, training, and operational protocols. Policies and collaborative initiatives should ensure that all research environments—from small academic labs to large industrial facilities—have access to the tools and resources needed to participate in a standardized ecosystem.

By addressing these structural and systemic challenges, the scientific community can unlock the full potential of biological data. Transforming fragmented datasets into a unified, strategic asset will accelerate scientific discovery, enhance research reproducibility, and position biology as a cornerstone of innovation in the 21st century. Such progress will not only benefit researchers but also ensure that biological data serves as a critical resource for addressing global challenges.

NASA GeneLab: A Model for Standardization and Collaboration

Imagine a world where biological data standardization is no longer a barrier but a foundational pillar of our research ecosystem. Researchers collecting data would meticulously label their datasets with detailed metadata, encompassing the source of the biological material, experimental protocols, and technical specifications. Databases would use standardized formats and interoperable metadata, allowing future researchers to search for and access datasets with ease. Instead of manually reformatting data or navigating disparate repositories, scientists could focus on deriving insights and solving complex biological questions. The research process would become more efficient, transparent, and collaborative. We’re a far way off from this fantasy being a reality, though it does exist in small microcosms.

One exemplary model of data standardization in practice is NASA GeneLab, a pioneering open-access repository for omics data from spaceflight and related experiments. GeneLab demonstrates how proper standardization and collaboration can transform data usage. Established as part of NASA’s efforts to explore the effects of space on biology, GeneLab hosts a comprehensive collection of datasets spanning genomics, transcriptomics, proteomics, and metabolomics, carefully curated with extensive metadata.

astronaut standing on moon beside U.S.A. flag
Photo by NASA on Unsplash

Unlike databases such as the Gene Expression Omnibus (GEO), which organizes data by individual studies and often provides limited metadata, GeneLab’s Open Science Data Repository adopts a study-agnostic approach. Researchers can filter datasets by data source, data type, project type, assay type (e.g., RNA-seq, ChIP-seq), organism, tissue type, and additional factors, making it significantly easier to find relevant data. Each dataset in GeneLab is enriched with metadata detailing the experimental protocols, data processing pipelines, and links to other studies that have used the same data. This meticulous organization ensures that even datasets collected decades ago remain valuable for new research endeavors.

Lessons from GeneLab: A Researcher’s Perspective

NASA GeneLab’s approach highlights the power of collaboration between government, industry, academia, and citizen scientists. By adhering to community-established standards and protocols, GeneLab not only enables reproducible research but also accelerates scientific discovery. For instance, data generated from ground, high altitude, and spaceflight experiments can be reanalyzed in the context of newer findings, or recombined in new ways, unlocking insights into how space-associated stressors like microgravity, ionizing radiation, and altered light-dark cycles influence biological systems. GeneLab’s data repository also enables the training of more accurate AI models by providing high-quality, standardized data, setting a benchmark for other repositories to follow (this project is currently being undertaken by GeneLab’s AI/ML analysis working group).

Drawing from personal experience, mining data from GeneLab feels fundamentally different compared to traditional databases like GEO. In GEO, finding relevant data requires sifting through individual studies and deciphering limited metadata, a time-consuming and often frustrating process. In contrast, GeneLab’s user-friendly interface and extensive metadata make it possible to locate datasets that meet specific criteria without delving into the original studies. This model not only simplifies data reuse but also enhances the reproducibility and impact of research.

A Call to Action: Moving Toward Standardization

While NASA GeneLab offers a glimpse of what is possible, achieving widespread standardization requires a concerted effort from the broader scientific community. Individual researchers must commit to labeling their data with comprehensive metadata, documenting data processing methods, and adhering to established protocols. Simultaneously, advocacy for top-down policies is essential to ensure that data standardization practices align with the needs of the research community.

As scientists, we often acknowledge the importance of data standardization but fall short of implementing it in our work.To address this, we need to embrace both top-down and bottom-up approaches. By holding ourselves to higher standards and advocating for systemic change, we can build a future where biological data is not just abundant but also accessible, interoperable, and reusable. This vision, once realized, will not only advance scientific discovery but also cement biological data as a cornerstone of innovation in the 21st century.

Cheers,

-Evan


Share

Discussion about this episode