Report on Assembly Recommendations

Document History

Version 1 (May 2023) - confirmed after Lausanne EBP Phase 2 meeting

Version 2 (October 2023) - added Workflows section

Version 3 (June 2024) - added genome tools

Version 4 (January 2026) - opened for comments from June 2025, approved by the Sequencing and Assembly Committee 31st Jan 2026


To accompany the Recommendations, the EBP provides Sequencing and Assembly Standards.

The EBP Assembly Standards document currently proposes standards to be achieved for reference genomes. The key elements of this were >1 Mb contiguity (contig NG50), >90% assigned to chromosomal scaffolds (and thus scaffold N50 ~= chromosomal N50), <1 in 10,000 error rate = 6.C.Q40.  In addition, there were further requirements including identifying and separating out organellar genomes, contaminants, sex chromosomes, etc.

 The standards document intentionally does not provide requirements for primary data or software to be used. There are competitive alternative technologies available at all steps. Furthermore, technologies change, both for sequencing and computational methods, and different combinations of data can be used to achieve these goals. However it is useful to provide concrete recommendations based on the experience of EBP major projects, at least as an exemplar to help others plan and get going.

Summary of proposed requirements for sequencing

It is our experience that to achieve these standards it is necessary to use both:

  1. Long read data for the primary assembly.  Current options (end-2025) are:

    a. Pacific Biosciences HiFi (high-fidelity, also known as circular consensus sequencing or CCS): very high accuracy, typically 10-20 kb length

    b. Oxford Nanopore (ONT): moderate to high accuracy using the R10.4 chemistry, length from 10-1,000 kb with yield decreasing with higher mean length. Whilst the accuracy of the individual reads is lower than that of CCs reads, ONT reads can be assembled into comparable quality outcomes without the need for additional separate error correction, e.g. using hifiasm –ont.

  2. Long range scaffolding data.  Primary options are:

    a. Hi-C data (short read pairs using proximity ligation data). Widely used commercial kits for assembly scaffolding by EBP projects include Dovetail Omni-C, Arima Hi-C and Phase Hi-C, but other Hi-C library making options are possible.  A possible alternative is long read Pore-C data.

    b. Ultra-long (100 kb+) reads (currently from ONT), although these may still fall short of producing whole chromosomes.

If the read-based assembly delivers chromosomal contigs, as can happen in some cases when ultra-long reads are included, then additional scaffolding is not required.

Additional information that can be very valuable includes:

1. Whole Genome Sequencing (WGS) data from both parents, if available, for essentially complete haplotype separation and validation.  Short read data are frequently used for this; WGS from one parent (e.g. maternal) can be of utility in haplotype phasing where contig N50 is large with respect to chromosome N50, there are few short contigs, and the species has high heterozygosity (such that the two alleles in the maternal genome are unlikely to be extensively shared with the unmeasured paternal genome).

2. Deep paired-end high accuracy short read WGS read data (from, e.g. Illumina, Element, Ultima or other high-accuracy platform) for base-pair accuracy “polishing”. We note that such data are of limited utility if using high quality or high depth long read data, as “polishing” generally serves to “correct” the error biases of one sequencing technology to that of the other. Also, assembly algorithms might not require separate read correction any longer (e.g. hifiasm –ont).

Less easily available and less used these days are:

3. BioNano optical restriction maps.

4. Strand-Seq data for contig orientation in scaffolding, where available. We note that this requires a cell line to make the library.

5. Genetic maps built from crosses with reasonable marker density, for example using RADseq or similar reduced representation sequencing on a genetic map with high crossover density. Genetic maps are excellent for chromosomal (formally linkage group) assignment and long range ordering, but are usually of comparatively low resolution.

6. Fluorescence in situ hybridisation (FISH) data. Low throughput but high information. Good for resolving complex medium- to long-range structural assignment, and confirming breakpoints.

7. Identification of telomeric repeat arrays at the end of presumed "complete chromosomal" scaffolds or contigs. We note that not all taxa (e.g. drosophilid Diptera) have short tandem repeats as telomeric arrays, and some taxa (e.g. Saccharomyces yeasts) have arrays that contain a number of related short sequences (i.e are not strictly tandem repeats of identical units).

Recommended sequencing data for primary assembly

We provide three recommended specific sequencing recipes based on experience as of late 2025.  These are typically adequate to achieve the EBP Assembly Standards, but will not work 100% of the time for 100% of organisms.  First for assembly:

A. PacBio HiFi based

a. >12.5 (ideally up to 20) fold coverage per haplotype PacBio HiFi data.  For a diploid organism this means >25 fold.  For polyploids, the genome size estimate on which coverage should be based is that of the sum of the span of homeologues. Thus for a recent autopolyploid the base genome span should be twice that of the simple diploid. For rediploidized polyploids or allopolyploids the span is likely to be approximately the sum of the spans of the originating diploids.

B. ONT-based

a. >20 fold coverage of ONT data per haplotype with read N50 ~50 kb.  Error correction of this data using herro or using hifiasm –ont (see above).

b. >25 fold coverage per haplotype paired end short read data for polishing.

Whichever of the long-read platforms is used, you should be starting from a single organism of your species for all the long read data, and the short read polishing data. If available, and if it is known or suspected based on phylogenetic evidence that the organism has sex chromosomes, the heterogametic sex (e.g., ZW or XY individual, or the UV stage) should be chosen. The polishing data, if necessary (see above), can come from a linked read library, or even a non-restriction-enzyme based Hi-C library such as Dovetail’s Omni-C but in either of these cases must be greater depth than the long read coverage to compensate for greater unevenness in coverage.

For some species and species groups, generation of data from both major long-read platforms may be required, as the regions subject to “dropout” (regions that are difficult to sequence, as revealed by missing data or gaps in the assembly when scaffolded with Hi-C) may differ between platforms. For some taxa one platform is recommended over the other. For example ONT data are generally superior than PacBio HiFi when assembling teleost fish genomes and avian microchromosomes, while PacBio high-accuracy HiFi data excel at assembly of repeat-rich genomes where minor variation between repeats permits accurate reconstruction of tandem arrays.

Recommended sequencing data for scaffolding

In addition to the long read data we recommend the generation of >50 fold coverage of the haploid genome size in Hi-C data for scaffolding (i.e. >25x per haplotype).  Ideally this will also be from the same individual, but if that is not possible, e.g. because the specimen was too small, it is possible to scaffold with Hi-C data from another organism or even a pool of additional individuals of the same species. Ideally, these should be from the same sex as the specimen giving rise to the long reads , especially if the species has known genotypic differences between the sexes (e.g., differentiated sex chromosomes). Where Hi-C short-read data are also used for polishing, greater coverage (>100 fold per haploid genome) will be required. Chromatin conformation capture data protocols are also available for both long read platforms (poreC for ONT and CiFi for PacBio). These methodologies use similar methods for generation of proximity ligation products, but differ from the short read protocols in concatenating these products into longer arrays for sequencing. The long read platforms can generate longer subfragment reads than the short read platforms and these longer subreads in turn are more likely to be uniquely mapped in the primary assembly, thus improving mapping rates-per-fragment, and especially mapping to repetitive sequences. These improvements are tempered by the higher cost-per-fragment pair compared to standard short-read Hi-C data.

Recommended assembly processes

Below we give an outline of a recommended assembly pipeline, based on experience as of December 2025, with options for the different data types.

[Note: References and links to repositories for the software tools suggested (indicated by bold lettering) are given at the end of this document under each tool name]:

  1. Register your sample for a tolid (Tree of Life ID) at id.tol.sanger.ac.uk and name your assembly after this tolid.

  2. Build k-mer tables for long reads and short read data, and use them to evaluate genome size, heterozygosity, mean coverage and ploidy.  The k-mer tables should be kept as they are useful later. There are multiple tool-chains for this, using incompatible file formats (sadly):

    1. FastK, followed by GENESCOPE.FK

      MERQURY.FK Smudge is used to estimate ploidy.

    2. Meryl with GenomeScope2 and Merqury.

      SmudgePlots within GenoScope can be used to estimate ploidy.

    3. KMC

The Genomes on a Tree (GoaT) datasystem (https://goat.genomehubs.org) has collated genome size estimates generated using non-sequencing methods (fluorescence and densitometry) as well as the assembly sizes of genomes submitted to the International Nucleotide Sequence Database Collaboration (INSDC: GenBank, European Nucleotide Archive and The DNA Database of Japan). Where values are not available for a species, GoaT estimates genome size from a simple ancestral reconstruction. These values can be used as a ground truth (directly measured or assembled values) or a guide (estimated values) against which to compare the values derived from k-mer distributions.

3. Assemble the long reads into contigs.  We now strongly recommend an assembler that separates the haplotypes. e.g. hifiasm or Verkko2. If you have parental Illumina data then the assemblers can be used in trio-binning mode. Hi-C/Pore-C data can also be used for haplotype separation with both assemblers.

4. These assemblers create either hap1/hap2 pairs of pseudo-chromosomes if scaffolding data are included in the assembly, or primary and alternate sets of contigs.  Even when they do this it may be necessary to remove haplotypic duplicates from the primary contigs using purge_dups, before scaffolding.

5. Identify and remove contaminants and co-bionts. Biodiversity genomics sequencing often uses specimens isolated from the wild that can come with genomes derived from the microbiome, parasites, pathogens, mutualist symbionts and chance contamination (collectively “cobionts”). Tools such as BlobToolkit, FCS-GX , ASCC and Kraken2 can help identify and mark these for removal. It may be advisable to repeat the primary assembly after removal of identified cobiont genomes, to assure best performance.

6. Identify and separate out contigs corresponding to organelles (mitochondrion, plastid for plants, potentially others for some organisms).  For animal mitochondria, reassemble with MitoHifi (for HiFi data) or MitoVGP (for ONT or PacBio CLR data).  Non-animal mitochondria and plastids are more complex, with long repeats that support recombination and variable topologies; Oatk is specifically designed to address them, and can also be used for the simpler mitochondria of Metazoa.

7. Scaffold with Hi-C/Omni-C data using YaHS. YaHS is recommended over older tools such as SALSA2, as these have known limitations. For Omni-C, HiRise or 3D-DNA can also be used.

8. If the contigs were made with uncorrected ONT reads you may wish to polish the assembly depending on the QV calculated. We recommend using short read data from a high accuracy platform (such as Illumina or Element) and bwa (to map), FreeBayes or DeepVariant (to variant call), Merfin (to filter calls) and bcftools (to apply corrections). If the contigs were made with HiFi data or corrected ONT data (prior to assembly or during assembly with hifiasm –ont) then their accuracy will be much higher to start with, but could  potentially be improved by remapping the long reads with WinnowMap, calling variants with DeepVariant, and then correcting as above. Correction with short reads may also improve accuracy, particularly for indels, but care must be taken to avoid inserting errors due to uncertainty of short read mapping and simply correcting the “category errors” inherent in the long read data with the error biases of the short read platform. It is important to map the polishing reads to all of the assembly material (primary, alternate, contamination, organelles) to avoid mismapping and false calls that occur if segments of the DNA are missing.

Curation of scaffolded assemblies

The process above will generate assemblies, but it is then necessary to carry out a curation and QC step.

[Note: References and links to repositories for the software tools suggested (indicated by bold lettering) are given at the end of this document under each tool name]

  1. Recheck for contamination using e.g. FCS-GX or ASCC and remove contaminated contigs/regions.

  2. Re-map the Hi-C data, and review/edit scaffolds inPretextVieworJuiceBoxto generate a set of large scaffolds that you trust to represent chromosomes. Workflows for this are available (GRIT Rapid Curation,TreeVal). Some small contigs/scaffolds will remain in the primary assembly, and can be submitted as unlocalised, but the goal is for these to represent less than 10% of the sequence.

  3. Assign chromosome numbers to chromosomal scaffolds, using genetic data from the species if available to connect to established chromosome/linkage group nomenclature (taking care to conform to the established orientation), or numerically in size order, taking into account any karyotype information that is available. If there is complete one-to-one orthology to the chromosome set of a closely related species with an established chromosome nomenclature, then it is acceptable to adopt that nomenclature. Assess for previously unannotated/undescribed sex chromosomes prior to settling on numbering.

  4. Confirm unique single copy material representation using GENESCOPE.FK.

  5. Estimate base pair accuracy using MERQURY.FK or Merqury or yak.

  6. Estimate single copy gene representation completeness with BUSCO.

It is important that when the finished assembly is submitted to the INSDC databases you also submit the primary data used for the assembly process, and link it via a shared BioProject identifier.  This permits a variety of scientific downstream analyses unbiased by the assembly, as well as independent generation of summary metrics and potential third party evaluation of assembly queries.

Cobiont assemblies

In order to assemble cobionts present in the sequenced sample, we recommend assembly as described above, e.g. using hifiasm, for eukaryotic cobionts, or a metagenome assembler such as metaMDBG to assemble microbial communities. Quality checks can be carried out using BUSCO or CheckM2, and for prokaryotic MAGs the assemblies can be validated against the MiMAG standard (https://www.nature.com/articles/nbt.3893). The additional data should be submitted alongside the target species' assembly as detailed in Fig 1 of the EBP assembly standards document .

Telomere to Telomere (T2T) assembly

The recommendations above are designed to achieve the EBP assembly quality target.  While this standard ensures good reference genomes, and in some species some chromosomes may be contiguous, such assemblies typically include a number of gaps within chromosomes (frequently hundreds, but for some species thousands), and some assembled contigs that can not be placed in the chromosomal scaffolds.  Gaps typically occur at highly repetitive sequences such as centromeres, ribosomal DNA clusters or long segmental repeats, or where there is read drop-out because of sequence composition.  For example PacBio HIFi can lose coverage in very GA-rich regions. 

In 2022,  a complete assembly for human (CHM13 double haploid cell line) which includes all chromosomal sequence without gaps was published. There are efforts to develop a standard pipeline to be able to attain this goal for other species, including for heterozygous diploid samples. Current recommendations are to generate >60 fold coverage in PacBio HiFi sequence and >40 fold coverage in ultralong (>100kb) ONT sequence, for a diploid, in addition to deep (>50 fold coverage) Hi-C for confirmation. A dedicated assembler for this configuration is Verkko. Recent versions of hifiasm also support integration of HiFi, ultralong ONT and Hi-C data. There are also investigations into an ONT-only recipe, using APK (Assembly Polishing Kit), ULK (Ultra Long Sequencing Kit) and PoreC/Hi-C.

Workflows

We support EBP projects using NextFlow workflows registered in https://workflowhub.eu/.

Workflows developed using the approaches and tools listed above (and most likely others) are available under these WorkflowHub collections

●      VGP: https://workflowhub.eu/collections/8

●      ERGA: https://workflowhub.eu/programmes/33

●      Sanger Tree of Life: https://workflowhub.eu/programmes/37

General references

  1. Rhie, Arang, Shane A. McCarthy, Olivier Fedrigo, Joana Damas, Giulio Formenti, Sergey Koren, Marcela Uliano-Silva, et al. 2021. “Towards Complete and Error-Free Genome Assemblies of All Vertebrate Species.” Nature 592 (7856): 737–46.

    ○ older CLR pipeline available at https://github.com/VGP/vgp-assembly

  2. Larivière, Delphine, Linelle Abueg, Nadolina Brajuka, Cristóbal Gallardo-Alba, Bjorn Grüning, Byung June Ko, Alex Ostrovsky, et al. 2023. “Scalable, Accessible, and Reproducible Reference Genome Assembly and Evaluation in Galaxy.” bioRxiv. https://doi.org/10.1101/2023.06.28.546576.

    ○  Galaxy pipeline tutorial available at https://training.galaxyproject.org/training-material/topics/assembly/tutorials/vgp_genome_assembly/tutorial.html

  3. Kerstin Howe, William Chow, Joanna Collins, Sarah Pelan, Damon-Lee Pointon, Ying Sims, James Torrance, Alan Tracey, Jonathan Wood, “Significantly improving the quality of genome assemblies through curation”, GigaScience, Volume 10, Issue 1, January 2021, giaa153, https://doi.org/10.1093/gigascience/giaa153

  4. Heng Li and Richard Durbin “Genome assembly in the telomere-to-telomere era”,

    Nat Rev Genet.2024 doi:10.1038/s41576-024-00718-w

    Tool/workflow references

  5. HiFiAsm Cheng, Haoyu, Gregory T. Concepcion, Xiaowen Feng, Haowen Zhang, and Heng Li. 2021. “Haplotype-Resolved de Novo Assembly Using Phased Assembly Graphs with Hifiasm.” Nature Methods 18 (2): 170–75.

    ○      https://github.com/chhylp123/hifiasm

  6. HiCanu Nurk, Sergey, Brian P. Walenz, Arang Rhie, Mitchell R. Vollger, Glennis A. Logsdon, Robert Grothe, Karen H. Miga, Evan E. Eichler, Adam M. Phillippy, and Sergey Koren. 2020. “HiCanu: Accurate Assembly of Segmental Duplications, Satellites, and Allelic Variants from High-Fidelity Long Reads.” Genome Research 30 (9): 1291–1305.

    ○      https://github.com/marbl/canu

  7. Verkko Rautiainen, Mikko, Sergey Nurk, Brian P. Walenz, Glennis A. Logsdon, David Porubsky, Arang Rhie, Evan E. Eichler, Adam M. Phillippy, and Sergey Koren. 2023. “Telomere-to-Telomere Assembly of Diploid Chromosomes with Verkko.” Nature Biotechnology, February. https://doi.org/10.1038/s41587-023-01662-6.

    ○      https://github.com/marbl/verkko

  8. Oatk

    ○      https://github.com/c-zhou/oatk

  9. FASTK

    ○      https://github.com/thegenemyers/FASTK

    ○      https://github.com/thegenemyers/GENESCOPE.FK

    ○      https://github.com/thegenemyers/MERQURY.FK

  10. KMC Marek Kokot, Maciej Długosz, Sebastian Deorowicz. 2017. “KMC 3: counting and manipulating k-mer statistics” Bioinformatics 33:2759–2761. https://doi.org/10.1093/bioinformatics/btx304

    ○      https://github.com/refresh-bio/KMC

  11. GenomeScope Vurture, Gregory W., Fritz J. Sedlazeck, Maria Nattestad, Charles J. Underwood, Han Fang, James Gurtowski, and Michael C. Schatz. 2017. “GenomeScope: Fast Reference-Free Genome Profiling from Short Reads.” Bioinformatics 33 (14): 2202–4.

    ○      https://github.com/schatzlab/genomescope

  12. Meryl/Merqury Rhie, A., Walenz, B.P., Koren, S. et al. 2020. “Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol 21:245. https://doi.org/10.1186/s13059-020-02134-9

    ○      https://github.com/marbl/meryl

  13. Shasta Shafin, K., Pesout, T., Lorig-Roach, R. et al. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat Biotechnol 38, 1044–1053 (2020). https://doi.org/10.1038/s41587-020-0503-6

    ○      https://github.com/paoloshasta/shasta 

  14. Flye Mikhail Kolmogorov, Derek M. Bickhart, Bahar Behsaz, Alexey Gurevich, Mikhail Rayko, Sung Bong Shin, Kristen Kuhn, Jeffrey Yuan, Evgeny Polevikov, Timothy P. L. Smith and Pavel A. Pevzner "metaFlye: scalable long-read metagenome assembly using repeat graphs", Nature Methods, 2020 https://doi.org/s41592-020-00971-x

    ○      https://github.com/fenderglass/Flye

  15. Marvel

    ○      https://github.com/schloi/MARVEL

  16. Falcon-unzip Chen-Shan Chin, Paul Peluso, Fritz J Sedlazeck, Maria Nattestad, Gregory T Concepcion, Alicia Clum, Christopher Dunn, Ronan O'Malley, Rosa Figueroa-Balderas, Abraham Morales-Cruz, Grant R Cramer, Massimo Delledonne, Chongyuan Luo, Joseph R Ecker, Dario Cantu, David R Rank, Michael C Schatz “Phased diploid genome assembly with single-molecule real-time sequencing”, Nat Methods 2016

    ○      https://github.com/PacificBiosciences/FALCON/wiki/FALCON-FALCON-Unzip-%22For-Phased-Diploid-Genome-Assembly-with-Single-Molecule-Real-Time-Sequencing%22

  17. wtdbg2 Ruan, J. and Li, H. “Fast and accurate long-read assembly with wtdbg2”. Nat Methods 2019

    ○      https://github.com/ruanjue/wtdbg2

  18. YaHS Chenxi Zhou, Shane A McCarthy, Richard Durbin “YaHS: yet another Hi-C scaffolding tool “ Bioinformatics 2023

    ○      https://github.com/c-zhou/yahs

  19. SALSA2 Jay Ghurye,Arang Rhie,Brian P. Walenz,Anthony Schmitt,Siddarth Selvaraj,Mihai Pop,Adam M. Phillippy ,Sergey Koren “Integrating Hi-C links with assembly graphs for chromosome-scale assembly” PLOS Computational Biology 2019

    ○      https://github.com/marbl/SALSA

  20. HiRise

    ○      https://github.com/DovetailGenomics/HiRise_July2015_GR

  21. 3D-DNA Dudchenko, O., Batra, S.S., Omer, A.D., Nyquist, S.K., Hoeger, M., Durand, N.C., Shamim, M.S., Machol, I., Lander, E.S., Aiden, A.P., et al. (2017). De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science. Apr 7; 356(6333):92-95. doi: https://doi.org/10.1126/science.aal3327. Epub 2017 Mar 23.

    ○      https://github.com/aidenlab/3d-dna

  22. purge_dups Dengfeng Guan, Shane A McCarthy, Jonathan Wood, Kerstin Howe, Yadong Wang, Richard Durbin (2020). Identifying and removing haplotypic duplication in primary genome assemblies. Bioinformatics, Volume 36, Issue 9, May 2020, Pages 2896–2898

    ○      https://github.com/dfguan/purge_dups

  23. BlobToolKit Laetsch DR, Blaxter ML (2017) BlobTools: Interrogation of genome assemblies. F1000Research https://f1000research.com/articles/6-1287/v1

    ○      https://github.com/blobtoolkit/blobtoolkit?tab=readme-ov-file

  24. Kraken2 Derrick E. Wood, Jennifer Lu & Ben Langmead (2019) Improved metagenomic analysis with Kraken 2. Genome Biology volume 20, Article number: 257

    ○      https://github.com/DerrickWood/kraken2?tab=readme-ov-file

  25. MitoHiFi Marcela Uliano-Silva, João Gabriel R. N. Ferreira, Ksenia Krasheninnikova, Darwin Tree of Life Consortium, Giulio Formenti, Linelle Abueg, James Torrance, Eugene W. Myers, Richard Durbin, Mark Blaxter & Shane A. McCarthy (2023) MitoHiFi: a python pipeline for mitochondrial genome assembly from PacBio high fidelity reads. BMC Bioinformatics volume 24, Article number: 288

    ○      https://github.com/marcelauliano/MitoHiFi

  26. MitoVGP Giulio Formenti, Arang Rhie, Jennifer Balacco, Bettina Haase, Jacquelyn Mountcastle, Olivier Fedrigo, Samara Brown, Marco Rosario Capodiferro, Farooq O. Al-Ajli, Roberto Ambrosini, Peter Houde, Sergey Koren, Karen Oliver, Michelle Smith, Jason Skelton, Emma Betteridge, Jale Dolucan, Craig Corton, Iliana Bista, James Torrance, Alan Tracey, Jonathan Wood, Marcela Uliano-Silva, Kerstin Howe, Shane McCarthy, Sylke Winkler, Woori Kwak, Jonas Korlach, Arkarachai Fungtammasan, Daniel Fordham, Vania Costa, Simon Mayes, Matteo Chiara, David S. Horner, Eugene Myers, Richard Durbin, Alessandro Achilli, Edward L. Braun, Adam M. Phillippy, Erich D. Jarvis & The Vertebrate Genomes Project Consortium (2021) Complete vertebrate mitogenomes reveal widespread repeats and gene duplications. Genome Biology volume 22, Article number: 120

    ○      https://github.com/gf777/mitoVGP

  27. Medaka

    ○      https://github.com/nanoporetech/medaka

  28. Arrow

    ○      https://www.pacb.com/wp-content/uploads/SMRT_Tools_Reference_Guide_v10.1.pdf

  29. bwa Li H. (2013) Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997v2

    ○      https://github.com/lh3/bwa

  30. DeepVariant

    ○      https://github.com/google/deepvariant

  31. Merfin Formenti, G., Rhie, A., Walenz, B.P. et al. Merfin: improved variant filtering, assembly evaluation and polishing via k-mer validation. Nat Methods (2022). https://doi.org/10.1038/s41592-022-01445-y

    ○      https://github.com/arangrhie/merfin

  32. bcftools Petr Danecek, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard, Andrew Whitwham, Thomas Keane, Shane A McCarthy, Robert M Davies, Heng Li (2021) Twelve years of SAMtools and BCFtools. GigaScience, Volume 10, Issue 2, February 2021, giab008, https://doi.org/10.1093/gigascience/giab008

    ○      https://github.com/samtools/bcftools

  33. winnowmap Chirag Jain, Arang Rhie, Nancy Hansen, Sergey Koren and Adam Phillippy (2022) Long-read mapping to repetitive reference sequences using Winnowmap2. Nature Methods,  19, pages705–710

    ○      https://github.com/marbl/Winnowmap

  34. FCS-gx Astashyn A, Tvedte ES, Sweeney D, Sapojnikov V, Bouk N, Joukov V, Mozes E, Strope PK, Sylla PM, Wagner L, Bidwell SL, Brown LC, Clark K, Davis EW, Smith-White B, Hlavina W, Pruitt KD, Schneider VA, Murphy TD (2024) Rapid and sensitive detection of genome contamination at scale with FCS-GX. Genome Biol. 2024 Feb 26;25(1):60.

    ○      https://github.com/ncbi/fcs-gx

  35. PretextView

    ○      https://github.com/sanger-tol/PretextView

  36. JuiceBox Neva C. Durand, James T. Robinson, Muhammad S. Shamim, Ido Machol, Jill P. Mesirov, Eric S. Lander, and Erez Lieberman Aiden (2017) Juicebox Provides a Visualization System for Hi-C Contact Maps with Unlimited Zoom. Cell Syst. 2016 Jul; 3(1): 99–101.

    ○      https://github.com/aidenlab/Juicebox

  37. yak

    ○      https://github.com/lh3/yak

  38. BUSCO Mathieu Seppey, Mosè Manni, Evgeny M Zdobnov (2019) BUSCO: Assessing Genome Assembly and Annotation Completeness. Methods Mol Biol: 1962:227-245. doi: 10.1007/978-1-4939-9173-0_14.

    ○      https://busco.ezlab.org/

  39. herro Stanojevic, D., Lin, D., Florez De Sessions, P., & Sikic, M. (2024). Telomere-to-telomere phased genome assembly using error-corrected Simplex nanopore reads. bioRxiv, 2024-05. doi:10.1101/2024.05.18.594796

    ○      https://github.com/lbcb-sci/herro

  40. LJA Anton Bankevich, Andrey V. Bzikadze, Mikhail Kolmogorov, Dmitry Antipov & Pavel A. Pevzner (2022) Multiplex de Bruijn graphs enable genome assembly from long, high-fidelity reads. Nature Biotechnology volume 40, pages1075–1081

    ○      https://github.com/AntonBankevich/LJA

  41. tidk Brown, M., González De la Rosa, P. M. and Mark, B. (2023) ‘A Telomere Identification Toolkit’. Zenodo. doi: 10.5281/zenodo.10091385.

    ○      https://github.com/tolkit/telomeric-identifier

  42. GRIT Rapid Curation

    ○      https://gitlab.com/wtsi-grit/rapid-curation

  43. TreeVal

    ○      https://pipelines.tol.sanger.ac.uk/treeval

  44. ASCC Assembly Screen for Cobionts and Contaminants

    ○      https://github.com/sanger-tol/ascc

  45. APK (Assembly Polishing Kit)

    ○   https://nanoporetech.com/document/telomere-to-telomere-sequencing-t2t-on-promethion-sqk-apk114-sqk

  46. ULK (Ultra Long Sequencing Kit)

    ○      https://nanoporetech.com/document/telomere-to-telomere-sequencing-t2t-on-promethion-sqk-apk114-sqk

  47. metaMDBG Gaëtan Benoit, Sébastien Raguideau, Robert James, Adam M. Phillippy, Rayan Chikhi & Christopher Quince  (2024) High-quality metagenome assembly from long accurate reads with metaMDBG. Nature Biotechnology 42, 1378–1383

    ○      https://github.com/GaetanBenoitDev/metaMDBG

  48. CheckM2 Alex Chklovski, Donovan H. Parks, Ben J. Woodcroft & Gene W. Tyson (2023) CheckM2: a rapid, scalable and accurate tool for assessing microbial genome quality using machine learning. Nature Methods 20, 1203–1212

    ○      https://github.com/chklovski/CheckM2


ABOUT THE SUBCOMMITTEE

This Report on Assembly Standards was developed by EBP’s Scientific Subcommittee for Sequencing and Assembly.