Report on Assembly Standards

Version 6.0 - September 2024

To accompany the standards, the EBP provides Sequencing and Assembly Recommendations.

The standards in this version have changed from those in the original EBP paper1 and previous versions (July 2019, July 2020, December 2020, March 2021, May 2023), reflecting progress in sequencing and assembly technology. This is a living document, which we aim to revise at least annually.

1. Quantitative assembly standards:

The EBP sequencing and assembly standards committee proposes different standards for three groups of organisms. The three groups are:

Eukaryotic species for which sufficient DNA and tissue is available. For these species we propose a minimum reference standard of 6.C.Q40, i.e. megabase N50 contig continuity and chromosomal scale N50 scaffolding, with less than 1/10,000 error rate (see table from the VGP flagship assembly paper2 in Appendix A for notation and further information). For species with chromosome N50 smaller than a megabase this will be C.C.Q40.

Alongside the contiguity and error rate goals we propose the following additional criteria from the table:
- < 5% false duplications
- > 90% kmer completeness
- > 90% sequence assigned to candidate chromosomal sequences
- > 90% single copy conserved genes (e.g. BUSCO) complete and single copy (*)
- > 90% transcripts from the same organism mappable
Links to standard tools for measuring these metrics are provided in Appendix B below.
While we believe that these are achievable goals for most, perhaps ultimately all, species from which large enough, high quality samples can be obtained, we recognise that for many reasons (e.g. sample quality, very large genomes, polyploidy, cost expediency) they may not be met in the first instance. Interim references that do not meet the standard can be very useful and should be submitted to INSDC and valued. However, there should be a continuing EBP goal to revisit them and bring them up to the target standard, as that becomes practical.
(*) The BUSCO requirement is a useful target, but can be relaxed in two circumstances for biological rather than technical reasons: first the completeness may be below 90% for taxa that are strongly divergent from those for which the closest BUSCO set was established; second for polyploids or incompletely rediploidized species for which many BUSCO genes may be present in more than one copy, so marked as duplicate by BUSCO.
Recent developments made yet another approach possible for this organism group: the generation of complete telomere-2-telomere (T2T) assemblies. These are usually based on high quality long reads combined with ultralong reads, require significant resources and are therefore not suggested to replace the above mentioned quality standard yet. Reaching T2T quality is defined as the presence of all telomere sequences (where applicable), the absence of sequence gaps and a QV greater than 60.

2. Species with limited DNA or material per individual (potentially <~100 ng DNA from a single individual as of 2024, though in practice the thresholds may be higher). For inputs as low as 10 ng, with new long range whole genome amplification technologies (also known as Ultra Low Input or ULI), it was proposed in May 2023 that a standard of 5.C.Q40 is regularly achievable, i.e. similar to the larger sample target but with contig N50 relaxed to >100kb to reflect amplification dropout. So this should be the current target. For even smaller samples we don’t currently recommend a standard.

3. Unculturable single cell eukaryotes. For these we propose a metagenomics-like standard to be determined based on experience in the prokaryotic community. This is still outstanding.

2. Additional requirements:

Our experience has shown that currently, all (combinations of) automated processes generate assemblies with a variety of remaining errors, some of which are relatively easy to address and should be corrected before submission3. We therefore propose that a set of quality control criteria are required to be met including:

a) separation of sequence of the target species from contaminants and other organisms such as symbionts/cobionts

b) explicit identification of a primary (haploid or pseudo-haploid) assembly, with additional sequence in a secondary bin that may contain either full alternate haplotypes or a set of haplotypic/other sequences from the individual. See below for further discussion of submission of diploid assemblies

c) separation and explicit identification of organellar genomes

d) only A,C,G,T and N bases and sequences should not begin or end with Ns

We would like the majority of chromosome ends to contain telomeric repeat sequence, however this is currently not observed consistently.

We also encourage:

e) identification of discordances between raw data and resulting assembly to locate and remove structural errors (misjoins, missed joins and false duplications)

f) identification and naming of chromosomes, esp. sex chromosomes, where possible (see below)

g) reconciliation with the known karyotype where it exists and this is possible.

h) identification of [within individual] PAR (pseudoautosomal region) annotations of sex chromosomes, while recognising that this may be difficult in practice for some many species.

Identification and naming of chromosomal-scale scaffolds can be achieved by consulting Hi-C 2D maps and comparison to existing karyotyping and linkage maps. If chromosome naming already exists for a given species, it should be reflected in the new assembly. If no previous naming exists, we recommend to name chromosomes by size, taking into account scaffolds that can be assigned to belong to a certain chromosome, but could not be unambiguously placed (unlocalised scaffolds). An alternative that is applicable in some cases is to name chromosomes after those in a closely related species with established nomenclature; we only recommend this if there are no major interchromosomal rearrangements identified between the species, i.e. all chromosomes are in one-to-one correspondence, but may have within chromosome rearrangements.

3. INSDC project structure and nomenclature:

For a reference genome to count towards the EBP goals it must be submitted to the INSDC (GenBank/EMBL/DDBJ) Genomes Division for open access use by the scientific community.

When this submission is made the assembly is associated with a BioProject object, which can be part of a hierarchy by being assigned to an “umbrella” BioProject. We suggest the structure in the figure below, with a data project for the raw data and one per assembly, an umbrella project for the target species (note that under this there may be assemblies of separate symbiont or cobiont species), and then above an umbrella BioProject corresponding to the overall project. Please also link your top level BioProject to the EBP BioProject object, whose identifier is PRJNA533106 (https://www.ncbi.nlm.nih.gov/bioproject/533106). If you are an EBP affiliate project and need PRJNA533106 to link down to your project please contact Erich Jarvis. If the assembly is also contributing to other larger scale efforts such as GIGA or VGP then you can also link your assembly to their umbrella BioProjects.

As well as connecting to a BioProject, an assembly needs to be assigned to a “txid” entry in the NCBI Taxonomy database. Although txid identifiers can be created for informal taxa such as Maylandia sp. “pearly”, we would like EBP genomes to be associated with taxonomically valid species names, and urge EBP-affiliated projects to work with appropriate taxonomists to identify samples to a species and where necessary to establish the species name in the standard manner in the literature. Txid identifiers at levels below species (e.g. subspecies or strains) will be tracked by EBP at the species level.

Furthermore, in addition to the numerical identifiers generated for assemblies by the public databases we request projects to adopt the tolid (for Tree of Life ID) standard short nomenclature for samples and assemblies, as used by the VGP and Darwin Tree of Life project. This takes the form <clade><gen><spec><ind>.<assembly> e.g. ilAlcRepa1.1 for the first assembly of insect lepidoptera Alcis repandata individual 1. Unique species designations have been generated to cover all ~485,000 species with data in INSDC or found in Britain and Ireland with relatively few clashes that were resolved with a simple process; other species can be added on request. From summer 2024, tolids can also be assigned to subspecies, variants or other taxonomic levels below “species”. The assigned tolids will reflect the parent species to preserve the ability of EBP species level accounting. A server to view and assign unique individual identifiers for samples is available at https://id.tol.sanger.ac.uk/, where you can also find details on the two letter prefix assignments which assigns all 26 letters to a top level partition of the tree of life. For queries please contact tolid-help@sanger.ac.uk.

Finally, it is mandatory for EBP assemblies to submit to INSDC the primary raw data used to build the assembly along with the primary assemblies. For ENA submissions, the raw data needs to be referred to in the RUN_REF fields of the manifest of the assembly submission, as indicated in the instructions at https://ena-docs.readthedocs.io/en/latest/submit/assembly/genome.html#manifest-files. For Genbank submissions, instructions can be found here: https://www.ncbi.nlm.nih.gov/sra/docs/submit/. Submission of assemblies and raw reads with the above recommended Bioproject umbrella structure will ensure linking and discoverability

4. diploid assemblies:

This section was added to version 5 following a discussion in the EBP/VGP assembly call on 5 May 2023.

It is now possible to separate out the homologous chromosomes during assembly to provide two essentially complete sets of chromosomes. There are two standard approaches to this. The first uses sequence data from the parents to phase the chromosomes, resulting in a maternal haploid genome and a paternal haploid genome. We recommend that these are called <tolid>.mat and <tolid>.pat and to provide an additional synthetic genome including both X and Y for annotation purposes. The second uses long range phasing data from Hi-C or (ultra)long reads to phase chromosomes, resulting potentially in fully phased individual chromosomes, or if not then ones with few phase switches, but no phase coherence across chromosomes. In the latter case we recommend that the resulting haploid genome sets are called <tolid>.hap1 and <tolid>.hap2. For polyploids, e.g. tetraploids, the corresponding nomenclature would be <tolid>.mat1, .mat2, .pat1, .pat2 or .hap1, .hap2, .hap3, .hap4.

Despite submitting two full haploid genomes that are effectively equivalent in quality, the wider community still wants a single “reference” haploid genome to be designated. By default this will be .hap1 if phased with long range data. For trio-phased assemblies the primary assembly will need to be designated.

Species with heterogametic sex chromosomes. Because it is desired that the primary reference contains as complete a representation as possible of the species’ genome in haploid form, we recommend that .hap1 should contain the non-recombining Y as well as the X (or the non-recombining W as well as Z for ZW system species) and .hap2 should just contain the pseudo-autosomal region(s) of Y or W. An important requirement is that all the material should come from the same individual, and that the sum of the material in .hap1 and .hap2 should give the complete diploid genome. For X0 or Z0 species .hap1 should contain the X or Z chromosome.

In some cases, where an individual of the homogametic sex has been sequenced, people have added a heterogametic sex chromosome from another individual to make a reference for mapping, e.g. adding a Y from a male when the sequenced individual was a female. This is deprecated in EBP primary submissions. We request that instead a synthetic reference for alignment is made using material from two different primary assembly submissions.

The organellar (mitochondrion, plastid etc.) genomes should be included in the primary assembly.

5. Conclusion:

We believe the standards specified above are an essential foundation for high quality genome annotation and analysis supporting the scientific and societal goals of the EBP. However, we note the need for and encourage research to address the following problems and increase the proportion of species reaching this assembly standard: 1. Generating uniform coverage of long read sequences from low input DNA, 2. High quality phased assembly of high heterozygosity and polyploid genomes 3. Better genome sequencing and assembly of mixtures of unculturable single cell eukaryotes.

No specific sequencing recommendations are made, to allow for future technology improvement and change. We urge EBP affiliated projects to share their sequencing recipes and assembly pipelines via publication, protocols.io and github for reuse by others. Multiple long read plus HiC and/or other scaffolding strategies, for example the Vertebrate Genome Project assembly pipeline (available on Galaxy) or the Tree of Life assembly pipeline, now regularly reach these standards given sufficient material for less than the projected EBP phase 1 cost of $30,000(1). Indeed we believe that it is now possible to reach these standards at substantially lower cost, of the order of $5,000 in direct costs for a 1 Gb genome, though we note that real costs vary widely across different countries. Because of this rapid change, the committee is optimistic about reducing sequencing complexity and computational costs in the next few years to achieve the estimated direct cost for phases 2 and 3 of the EBP of $800/species for EBP (in 2018 dollars (1)). To monitor this we will continue to revisit the standards on at least an annual basis, reconsidering what is feasible in terms of accuracy, completeness and cost.

1. H. A. Lewin et al., Earth BioGenome Project: Sequencing life for the future of life. Proceedings of the National Academy of Sciences of the United States of America 115, 4325-4333 (2018).

2 Rhie et al., Towards complete and error-free genome assemblies of all vertebrate species. bioRxiv preprint doi: https://doi.org/10.1101/2020.05.22.110833

3. Howe K et al., Significantly improving the quality of genome assemblies through curation. GigaScience 10(1), https://doi.org/10.1093/gigascience/giaa153 (2021)

Appendix A: Historical table of proposed metrics from 2018

from Rhie et al, bioRxiv https://doi.org/10.1101/2020.05.22.110833

Appendix B: Tools to obtain EBP assembly metrics

Genometools: http://genometools.org/tools.html

Merqury: https://github.com/marbl/merqury
Asset: https://github.com/dfguan/asset
Purge_dups: https://github.com/dfguan/purge_dups
BUSCO: https://github.com/openpaul/busco
STAR: https://github.com/alexdobin/STAR
Mequry.FK https://github.com/thegenemyers/MERQURY.FK
Yak https://github.com/lh3/yak

NB1 Merqury requires a high accuracy shotgun data set (Illumina or PacBio CCS) to extract kmers from. Ideally this will be independent from the one used to build the primary assembly.

NB2 Validation of haplotype phasing accuracy requires genome-wide sequence from a close relative so is not possible in many cases. It should be carried out where possible, but is not required to meet the EBP assembly standard.

ABOUT THE SUBCOMMITTEE

This Report on Assembly Standards was developed by EBP’s Scientific Subcommittee for Sequencing and Assembly.

Learn more