Report on Assembly Standards

Version 5.0 - May 2023


A. PURPOSE AND AUDIENCE:

The standards in this version have changed from those in the original EBP paper1 and previous versions (July 2019, July 2020, December 2020, March 2021), reflecting progress in sequencing and assembly technology. This is a living document, which we aim to revise at least annually.  


B. Quantitative assembly standards:

The EBP sequencing and assembly standards committee proposes different standards for three groups of organisms. The three groups are:

  1. Eukaryotic species for which sufficient DNA and tissue is available. For these species we propose a minimum reference standard of 6.C.Q40, i.e. megabase N50 contig continuity and chromosomal scale N50 scaffolding, with less than 1/10,000 error rate (see table from the VGP  flagship assembly paper2 in Appendix A for notation and further information).  For species with chromosome N50 smaller than a megabase this will be C.C.Q40.

    Alongside the contiguity and error rate goals we propose the following additional criteria from the table:

    • < 5% false duplications 

    • > 90% kmer completeness 

    • > 90% sequence assigned to candidate chromosomal sequences

    • > 90% single copy conserved genes (e.g. BUSCO) complete and single copy

    • > 90% transcripts from the same organism mappable

    Links to standard tools for measuring these metrics are provided in Appendix B below.

    While we believe that these are achievable goals for most, perhaps ultimately all, species from which large enough high quality samples can be obtained, we recognise that for many reasons (e.g. sample quality, very large genomes, polyploidy, cost expediency) they may not be met in the first instance.  Interim references that do not meet the standard can be very useful and should be valued.  However, there should be a continuing EBP goal to revisit them and bring them up to the target standard, as that becomes practical. A proposed nomenclature is that a public assembly that meets this standard is called a reference assembly whereas one that does not meet the standard but perhaps is the best available for its species is called a representative assembly.

  2. Species with limited DNA or material per individual (potentially <~100ng DNA from a single individual as of 2020, though in practice the thresholds may be higher). Prior to May 2023 we proposed an interim standard of 4.5.Q40, i.e. 10kb contig N50, 100kb scaffold N50 and still error rate less than 1/10,000.  However, with new long range whole genome amplification technologies (also known as Ultra Low Input or ULI), it was proposed in May 2023 that a standard of 5.C.Q40 is regularly achievable, i.e. similar to the larger sample target but with contig N50 relaxed to >100kb to reflect amplification dropout.  So this should be the current target.

  3. Unculturable single cell eukaryotes where we propose a metagenomics-like standard to be determined based on experience in the prokaryotic community.  This is still outstanding.

C. Additional requirements:

Our experience has shown that currently, all (combinations of) automated processes generate assemblies with a variety of remaining errors, some of which are relatively easy to address and should be corrected before submission3.  We therefore propose that a set of quality control criteria are required to be met including:

  • Separation of sequence of the target species from contaminants and other organisms such as symbionts/cobionts

  • Explicit identification of a primary (haploid or pseudo-haploid) assembly, with additional sequence in a secondary bin that may contain either full alternate haplotypes or a set of haplotypic/other sequence from the individual. See below for further discussion of submission of diploid assemblies.

  • Separation and explicit identification of organellar genomes

  • Only A,C,G,T and N bases and sequences should not begin or end with Ns

We also encourage:

  • Identification of discordances between raw data and resulting assembly to locate and remove structural errors (misjoins, missed joins and false duplications)

  • Identification and naming of chromosomes, esp. sex chromosomes, where possible (see below)

  • Reconciliation with the known karyotype where it exists and this is possible.

Identification and naming of chromosomal-scale scaffolds can be achieved by consulting Hi-C 2D maps and comparison to existing karyotyping and linkage maps. If chromosome naming already exists for a given species, it should be reflected in the new assembly. If no previous naming exists, we recommend to name chromosomes by size, taking into account scaffolds that can be assigned to belong to a certain chromosome, but could not be unambiguously placed (unlocalised scaffolds). An alternative that is applicable in some cases is to name chromosomes after those in a closely related species with established nomenclature; we only recommend this if there are no major interchromosomal rearrangements identified between the species, i.e. all chromosomes are in one-to-one correspondence, but may have within chromosome rearrangements.

D. INSDC project structure and nomenclature:

For a reference genome to count towards the EBP goals it must be submitted to the INSDC (GenBank/EMBL/DDBJ) Genomes Division for open access use by the scientific community.  

When this submission is made the assembly is associated with a BioProject object, which can be part of a hierarchy by being assigned to an “umbrella” BioProject.  We suggest the structure in the figure below, with a data project for the raw data and one per assembly, an umbrella project for the target species (note that under this there may be assemblies of separate symbiont or cobiont species), and then above an umbrella BioProject corresponding to the overall project.  Please also link your top level BioProject to the EBP BioProject object, whose identifier is PRJNA533106 (https://www.ncbi.nlm.nih.gov/bioproject/533106).  If the assembly is also contributing to other larger scale efforts such as GIGA or VGP then you can also link your assembly to their umbrella BioProjects.

Screen Shot 2021-03-10 at 7.41.45 PM.png

As well as connecting to a BioProject, an assembly needs to be assigned to a “txid” entry in the NCBI Taxonomy database.  Although txid identifiers can be created for informal taxa such as Maylandia sp. “pearly”, we would like EBP genomes to be associated with taxonomically valid species names, and urge EBP-affiliated projects to work with appropriate taxonomists to identify samples to a species and where necessary to establish the species name in the standard manner in the literature.  

Furthermore, in addition to the numerical identifiers generated for assemblies by the public databases we request projects to adopt the tolid (for Tree of Life ID) standard short nomenclature for samples and assemblies, as used by the VGP and Darwin Tree of Life project.  This takes the form <clade><gen><spec><ind>.<assembly> e.g. ilAlcRepa1.1 for the first assembly of insect lepidoptera Alcis repandata individual 1.  Unique species designations have been generated to cover all ~485,000 species with data in INSDC or found in Britain and Ireland with relatively few clashes that were resolved with a simple process; other species can be added on request.   A server to view and assign unique individual identifiers for samples is available at https://id.tol.sanger.ac.uk/, where you can also find details on the two letter prefix assignments which assigns all 26 letters to a top level partition of the tree of life. For queries please contact tolid-help@sanger.ac.uk.

Finally, it is mandatory for EBP assemblies to submit to INSDC the primary raw data used to build the assembly along with the primary assemblies, and to refer to it in RUN_REF fields of the manifest of the assembly submission, as indicated in the instructions at  https://ena-docs.readthedocs.io/en/latest/submit/assembly/genome.html#manifest-files.

E. diploid assemblies:

This section was added to version 5 following a discussion in the EBP/VGP assembly call on 5 May 2023.

It is now possible to separate out the homologous chromosomes during assembly to provide two essentially complete sets of chromosomes.  There are two standard approaches to this.  The first uses sequence data from the parents to phase the chromosomes, resulting in a maternal haploid genome and a paternal haploid genome.  We recommend that these are called <tolid>.mat and <tolid>.pat. The second uses long range phasing data from Hi-C or (ultra)long reads to phase chromosomes, resulting potentially in fully phased individual chromosomes, or if not then ones with few phase switches, but no phase coherence across chromosomes.  In the latter case we recommend that the resulting haploid genome sets are called <tolid>.hap1 and <tolid>.hap2.  For polyploids, e.g. tetraploids, the corresponding nomenclature would be <tolid>.mat1, .mat2, .pat1, .pat2 or .hap1, .hap2, .hap3, .hap4.

Despite submitting two full haploid genomes that are effectively equivalent in quality, the wider community still wants a single “reference” haploid genome to be designated.  By default this will be .hap1 if phased with long range data. For trio-phased assemblies the primary assembly will need to be designated. Because it is desired that the primary reference contains as complete a representation as possible of the species’ genome in haploid form, we recommend that .hap1 should contain the non-recombining Y as well as the X (or the non-recombining W as well as Z for ZW system species) and .hap2 should just contain the pseudo-autosomal region(s) of Y or W.  An important requirement is that all the material should come from the same individual, and that the sum of the material in .hap1 and .hap2 should give the complete diploid genome.  For XO or ZO species .hap1 should contain the X or Z chromosome.

In some cases, where an individual of the homogametic sex has been sequenced, people have added a heterogametic sex chromosome from another individual to make a reference for mapping, e.g. adding a Y from a male when the sequenced individual was a female. This is deprecated in EBP primary submissions.  We request that instead a synthetic reference for alignment is made using material from two different primary assembly submissions. 

f. Conclusion:

We believe the standards specified above are an essential foundation for high quality genome annotation and analysis supporting the scientific and societal goals of the EBP.  However, we note the need for and encourage research to address the following problems and increase the proportion of species reaching this assembly standard: 1. Generating uniform coverage long read sequences from low input DNA, 2. High quality assembly of high heterozygosity and polyploid genomes 3. Better genome sequencing and assembly of mixtures of unculturable single cell eukaryotes. 

No specific sequencing recommendations are made, to allow for future technology improvement and change.  We urge EBP affiliated projects to share their sequencing recipes and assembly pipelines via publication, protocols.io and github for reuse by others. Multiple long read plus HiC and/or other scaffolding strategies, for example the Vertebrate Genome Project assembly pipeline (https://github.com/VGP/vgp-assembly and available on DNAnexus), now regularly reach these standards given sufficient material for less than the projected EBP phase 1 cost of $30,000 (1). Indeed we believe that it is now possible to reach these standards at substantially lower cost, of the order of $10,000 in direct costs for a 1Gb genome. Because of this rapid change, the committee is optimistic about reducing sequencing complexity and computational costs in the next few years to achieve the estimated direct cost for phases 2 and 3 of the EBP of $800/species for EBP (in 2018 dollars (1)).  To monitor this we will continue to revisit the standards on at least an annual basis, reconsidering what is feasible in terms of accuracy, completeness and cost.

g. Citations:

  1. H. A. Lewin et al., Earth BioGenome Project: Sequencing life for the future of life. Proceedings of the National Academy of Sciences of the United States of America 115, 4325-4333 (2018).

  2. Rhie et al., Towards complete and error-free genome assemblies of all vertebrate species. bioRxiv preprint doi: https://doi.org/10.1101/2020.05.22.110833

  3. Howe K et al., Significantly improving the quality of genome assemblies through curation. GigaScience 10(1), https://doi.org/10.1093/gigascience/giaa153 (2021)

Appendix A: Table of proposed EBP metrics 

Rhie et al, bioRxiv https://doi.org/10.1101/2020.05.22.110833

Rhie et al, bioRxiv https://doi.org/10.1101/2020.05.22.110833

Appendix B: Tools to obtain EBP assembly metrics

Screen Shot 2021-03-10 at 7.56.10 PM.png

NB1 Merqury requires a high accuracy shotgun data set (Illumina or PacBio CCS) to extract kmers from.  Ideally this will be independent from the one used to build the primary assembly.

NB2 Validation of haplotype phasing accuracy requires genome-wide sequence from a close relative so is not possible in many cases. It should be carried out where possible, but is not required to meet the EBP assembly standard.


EBP - Chromosome Graphic - Website.png

ABOUT THE SUBCOMMITTEE

This Report on Assembly Standards was developed by EBP’s Scientific Subcommittee for Sequencing and Assembly.