Genome Assembly Standards for EBP-Affiliated Projects
EBP Assembly Standards
Version 7.0 | January 2026
The Earth BioGenome Project assembly standards define recommended minimum quality targets for reference genomes contributing to EBP goals. These standards are updated regularly to reflect advances in sequencing, assembly, curation, and data submission practices.
Summary
The Earth BioGenome Project (EBP) Assembly Standards document defines recommended quality standards, submission guidelines, and best practices for generating high-quality reference genomes across the tree of life. The standards are regularly updated to reflect advances in sequencing and assembly technologies and currently define different assembly targets for standard eukaryotic genomes, low-input species, ultra-low-input organisms, and unculturable single-cell eukaryotes.
For most eukaryotic species with sufficient DNA, the recommended minimum standard is 6.C.Q40, representing megabase-scale contig continuity, chromosome-scale scaffolding, and high base accuracy. Additional quality targets include high BUSCO completeness, low false duplication rates, and strong chromosomal assignment. The document also outlines standards for low-input and ultra-low-input assemblies, including emerging methods such as PiMmS and whole genome amplification approaches for sequencing tiny organisms.
The report further details recommended quality control practices, including contaminant removal, organellar genome separation, haplotype identification, chromosome naming, and telomere completeness. It also provides guidance for diploid and phased genome assemblies, including naming conventions such as .hap1, .hap2, .mat, and .pat.
In addition, the document outlines standards for submitting assemblies and raw sequencing data to INSDC databases (GenBank, ENA, and DDBJ), explains the use of BioProject hierarchies, and introduces TOLIDs (Tree of Life IDs) for standardized sample and assembly tracking across biodiversity genomics initiatives.
Overall, the standards aim to support the generation of accurate, open, and reusable reference genomes that enable downstream research in biodiversity science, conservation, evolution, agriculture, biotechnology, and comparative genomics while continuing to evolve alongside rapidly improving sequencing technologies.
EBP Assembly Standards At-a-Glance
BUSCO >90%
<5% false duplications
90% chromosomal assignment
Table of Contents
What Standard Applies to My Species?
What are the Core Assembly Metrics?
Additional Quality Requirements
INSDC Submission Requirements
Diploid and Polyploid Assemblies
Tools for Measuring Assembly Quality
References and Appendices
Frequently Asked Questions
1. What Standard Applies to My Species?
1. What are the Core Assembly Metrics?
Figure 1: Overview of the INSDC BioProject, assembly and raw data submission structure. Add haplotype assembly BioProjects and data as required in case of polyploid eukaryotes. Equally, add as required for multiple prokaryotic assemblies, e.g. Wolbachias. Eukaryotic cobionts will require their own tolid and BioProject. Umbrella projects can be linked and stacked to reflect their relationships. If several individuals of the same species are assembled and released by the same project, their data BioProjects can be added to the species umbrella BioProject.
Frequently Asked Questions
1) What is an EBP-quality assembly?
An EBP-quality assembly is a genome assembly that meets quality targets agreed by the Earth BioGenome Project community, including high completeness, low error rates, strong contiguity, and chromosome-scale scaffolding where possible.
2) What does 6.C.Q40 mean?
6.C.Q40 describes an assembly standard with megabase-scale contig continuity, chromosome-scale scaffolding, and an estimated base accuracy of Q40, or fewer than 1 error in 10,000 bases.
3) What standards apply to low-input genomes?
For species with limited DNA, EBP recommends 5.C.Q40 where possible. For ultra-low-input samples, such as some meiofauna, 5.6.Q40 may be an appropriate minimum target.
4) What are the requirements for diploid assemblies?
Diploid assemblies should clearly identify the primary assembly and any alternate haplotypes. Where possible, phased assemblies should use consistent naming such as .hap1 and .hap2, or .mat and .pat for trio-phased assemblies.
5) How should assemblies be submitted to INSDC?
Assemblies should be submitted openly to INSDC through GenBank, ENA/EMBL-EBI, or DDBJ. Raw sequencing data should also be submitted and linked to the relevant BioProject. EBP-affiliated projects should link their top-level BioProject to the EBP umbrella BioProject, PRJNA533106.
Linking your BioProject to the EBP umbrella allows assemblies to be automatically recognized and tracked by GoaT as EBP-associated genomes, helping ensure your project receives proper attribution for submitted assemblies.
6) What is a TOLID?
A TOLID, or Tree of Life ID, is a short standardized identifier used for samples and assemblies in Tree of Life genome projects. It helps track species, individuals, and assemblies consistently across large-scale biodiversity genomics efforts.
7) How do you generate a TOLID?
A TOLID (Tree of Life ID) is a standardized identifier used for samples and assemblies across large-scale biodiversity genomics projects such as the Tree of Life Programme, the Earth BioGenome Project, and related initiatives.
TOLIDs are assigned through the Tree of Life ID system hosted by the Wellcome Sanger Institute.
TOLID structure
A TOLID typically follows this format:
<clade><gen><spec><ind>.<assembly>
Example:
ilAlcRepa1.1
Which can represent:
i= insectl= LepidopteraAlcRepa= Alcis repandata1= individual 1.1= assembly version 1