Genome Limits
Not all genomes are created equal. In this section, EBP contributors explore the biological and technical limits that continue to challenge biodiversity genomics — from enormous genome sizes and extreme repetitiveness to degraded samples and organisms that resist standard sequencing approaches. Their reflections highlight how advances in sequencing technologies, assembly methods, and computational tools are steadily pushing those limits outward, opening access to increasingly difficult branches of the Tree of Life.
What have you learned from working on hard genomes?
Kerstin Howe in her office (with her favourite painting of lichens by Samantha Clark and a Hi-C map of a tetraploid assembly on her screen).
What is the most intimidating genome you’ve worked on, and why?
There are lots of factors that can make a genome difficult to sequence and assemble, the most straightforward is probably the genome size. Not only does it cost a lot of money to sequence a large genome, large sizes also put strains on compute performance and might even render certain processes impossible.
For me, our most intimidating genome so far was therefore the mistletoe (Viscum album) with over 90 Gb in size (30 times larger than the human genome). At least it wasn’t polyploid…We generated a staggering 6 Tbp of Hifi reads and nearly 2 Tbp of HiC data to get enough coverage. And then, after some bioinformatics magic to allow this amount of data to be digested, it surprisingly worked. The initial draft assembly already looked very good. As the resolution for visualising curation data is restricted, we needed to modify this process, too. Mistletoe has 10 chromosomes, each three times the size of the complete human genome, and they didn’t fit all together into one curatable Hi-C map. We therefore separated the chromosomal scaffolds and combined each with all the sub-chromosomal scaffolds/contigs to allow for error correction within the former and placement of the latter, then combined everything again to get the final picture. And it worked beautifully! Submitting the final genome assembly posed another hurdle as INSDC is not equipped to take single sequences bigger than 2Gbp, so in order to submit the genome with each chromosomal scaffold around 10Gbp we had to cut the sequence apart again. The information on how to stitch everything together is included in the submission though, so all was fine in the end. The next challenge is the annotation, but even there we already got some promising first results. Hopefully more, soon.
We were somewhat lucky with mistletoe as lots of biomaterial was available to generate enough data and the repeat content was varied enough to not lead to extended assembly collapses. What really intimidates me now are large (and even worse if polyploid) genomes in really small species…thanks, but no, thanks to some dinoflagellates for now.
How has the push toward telomere-to-telomere genomes changed what we consider “complete”?
Giulio Formenti: The push toward telomere-to-telomere (T2T) assemblies has fundamentally redefined what we mean by a “complete” genome. Until recently, many chromosome-level references were considered finished despite containing unresolved gaps, collapsed repeats, and missing centromeric or subtelomeric regions. T2T efforts have shown that these omitted regions often contain important biology, including genes, regulatory elements, structural variants, and key chromosomal features. In our recent T2T zebra finch assembly, for example, completing the genome added nearly 90 million base pairs of previously missing sequence and enabled the first sequence-level characterization of avian centromeres in this species and of a large amplicon gene array on chrZ. “Complete” no longer means simply scaffolded into chromosomes—it increasingly means every chromosome is resolved end-to-end, with all major repetitive and structurally complex regions represented.
Giulio Formenti is a Research Assistant Professor at The Rockefeller University, Co-Director and Bioinformatics Lead of the Vertebrate Genome Laboratory, and Chair of the Assembly Group for the Vertebrate Genomes Project (VGP).
Mark Blaxter
Which species has most surprised you by how difficult it was to work with, and why?
Mark Blaxter. In the first days of the Tree of Life programme at Sanger, we collected and froze specimens of some very common land snails in the UK: the banded grove and field snails Cepaea hortensis and Cepaea nemoralis. These banded snails have been the subject of genetic and ecological research for a century, and I hoped that one of the first fruits of our genomics efforts would be reference genomes that would allow snail colour pattern researchers to finally solve the genetic riddle of how the banding patterns are controlled. Fast forward five years and finally we released reference genomes…
Why was it so hard? It turned out to be very difficult to extract long DNA that would sequence well with either PacBio or ONT technologies: umpteen cells were run with tiny, tiny data yields and ever more frustrated lab teams and Cepaea collaborators. Extensive development of extraction methods to solve the Cepaea problem now means we are confident in being able to generate sequenceable DNA from any snail…