Report on Analysis Standards

Version 1.0—March 2021

Authors: Kerstin Lindblad-Toh, Juan Carlos Castilla Rubio, Federica Di Palma, Rebecca Johnson, Jose Victor Lopez, Tomas Marques, Stephen Richards, Sunil Kumar Sahu, Pamela Soltis

A. Purpose and Audience:

The EBP is a global community of genomics researchers who have interest or expertise in using genomics to understand and improve our world. This document forms part of a suite of documents aimed at any researcher with an interest in genomics and who may wish to gain knowledge or work with other experts on topics such as:

Ethics
IT and Informatics
Sample Collection and Processing
Whole Genome Assembly
Whole Genome Annotation
Analysis of genomic data both from a comparative genomics point of view or lineage specific analyses

The goal of this Analysis Standards for Genomic Data document is to acquaint researchers with current analyses and uses for genomic data, including directing them towards useful resources that may be standard across some analysis types. One of the challenges of capturing ‘analysis standards’ for the diverse needs of a genomics community is that there are many uses/purposes for these data.

Capture the most common analyses used to interrogate de novo genomes
Catalogue other types of analyses that may not rely only on whole genomes (this is particularly important for those working on understudied phyla or ‘dark taxa’)
Serve as a platform to connect biologists/genomicists with other members of the EBP community and other stakeholders outside the scientific community, to foster collaboration and data sharing, in line with access and benefit sharing of the Convention of Biological Diversity Nagoya Protocol and related international agreements and regulations.
Highlight compute/hardware that may be accessible for potential users
Highlight training opportunities for those interested in learning and running their own analyses

We acknowledge that – as much as possible – the genomics community will benefit from using similar methodologies and data generated across species. However, we note that different groups of species can have very different genome size and complexity. We also note that while we aim for the proposed standards and metrics, different projects may produce different levels of sequence coverage and assembly contiguity dependent on their end use goal or sample quality. The quality of genome annotation may also vary, partly based on if there is available transcriptome data for the particular species.

As a practical goal we wish to list and link to recommended software packages that can be used at scale as well as to github repositories and examples of recommended pipelines and run commands. To aid in our educational goals, each of these would ideally be paired with a tutorial describing the goals of the analyses and how to perform them. These materials could also be used in training workshops (both virtual and online).

B. Underlying Data:

Prior to most analyses described below, a genome assembly and associated annotations are typically generated:

B1. Basic Assembly and Genome Statistics

As part of the assembly process and standard quality control, several parameters are collected (see Appendix 1, currently available with the EBP Assembly Committee’s draft standards here.

B2: Annotation

The goal of the annotation process is to understand the biological content and function of the sequenced genome. Specifically, the annotations will include:

Simple repeats and transposable elements
Functional sequence features such as CpG islands
Protein-coding genes
Non-coding transcripts including small RNAs (sRNA) and lncRNAs

The annotation goal is described here and in Appendix 2.

C. Genome Analysis:

Genome analysis targets questions related to genome evolution, population genomics to aid conservation, and biodiversity in ecological samples. We outline some of the more common questions and current approaches below:

Alignments and synteny analysis of related species
Repeat content and evolution
Partial or whole-genome duplication
Updating the evolutionary tree
Evolutionary constraint and accelerated evolution
Analysis of gene content, gene family expansion/contraction and selection on protein-coding genes
Analysis of non-coding transcripts
Intraspecific variation, conservation, biodiversity and adaptation
Environmental DNA and/or ecological samples

C1. Alignments of Genomes and Synteny Analysis

An alignment forms the basis for comparative analysis across species. Alignments can be generated using TBLASTN or blast reciprocal best matches at both the nucleotide level for evolutionarily close species, and the protein level for species comparisons of wider divergence. CACTUS is one new alignment tool that generates a reference-free genome sequence alignment that can be projected on any one of the genomes in the alignments. It works by generating the ancestor for any pair of most closely related species. In theory the alignment can be updated by either swapping out less contiguous assemblies for upgraded genomes or by adding in new genomes. The CACTUS team is working on making it more efficient to run as well as easier for others to run (Benedict Pattern, personal communication). Currently, it can handle genomes up to 10 Gb, but even 3 Gb genomes require significant computational capacity at present. It works less well for distantly related species (divergence times >30-50 million years depending on the taxa). CACTUS has been run for mammals, birds, insects and plants with smaller genomes (Arabidopsis).

Methods:

C2. Repeat Content and Evolution

Catalogs of simple sequence repeats and transposable elements will be generated as part of the genome annotation. Repeats are a common way of changing the size of the genome. Inserted repeat elements may be exapted into novel regulatory elements or transcripts that may regulate gene function. Based on the similarity and distribution, it is possible to form hypotheses of the repeat content over time in a species of interest.

Methods:

Repeatmasker and Repeat Modeler
REPET
MITE-hunter and LTRharvest (de novo discovery)

C3. Partial or Whole-genome Duplication

In addition to repeat content, the genome size is a function of loss and addition based on gene and genome duplication. Some species such as plants have often undergone one or more episodes of whole-genome duplication. Partial or whole-genome duplications allow the divergent evolution of duplicated genes. Using synteny between species and internal similarity of genome sections allows for analysis of gain and loss.

Methods:

Read depth coverage
Alignments and curation
Examine lengths of elements, homology and copy number
Ks Plots
Synteny Analysis (e.g. via CoGe)

C4. Species Trees

Species trees are desirable for a number of analyses such as calling evolutionary constraint, detecting positive selection, and delineating species boundaries/hybridization events and inferring evolutionary relationships. The tree may be built using different types of data including ancestral repeats, genes, 4-fold degenerate sites. At the family level, the species tree will be fairly stable, at least for well-understood groups, regardless of data source; however, when getting to more closely related species, the tree might vary between genomic regions.

Methods and Data Resources:

C5. Evolutionary Constraint and Accelerated Evolution

To identify functional elements in the genome of each species, there is a need to identify evolutionarily constrained regions – this will identify not only coding genes, but also regulatory elements. Depending on the size of the species groups analyzed together, the genome size and the evolutionary distance, evolutionary constraint can be detected either as single-base constraint, when the power is high, or for a certain element size when the power is lower. (For 240 mammals, with a total branch length of 16 substitutions per site, the power is single-base constraint.) Constraint can be called with a number of tools including GERP, SiPhy, Phastcons, and PhyloP. Constraint scores are often useful to identify ultrasconserved elements which may encompass enhancers but also for the detection of promoters, insulators and other regulatory elements.

A way to identify positive selection across the whole genome is the detection of accelerated regions (ARs), either by looking for adaptation of a specific species or by looking for adaptation and signs of convergent evolution on specific branches of a tree. This analysis relies heavily on PhyloP, as well as on scripts (on github?) and careful manual curation as there are many potentials for artifacts in the detection of ARs. Selection on localized regions of the genome can be detected via searches for selective sweeps.

Methods:

GERP
Phastcons
PhyloP (PMID: 19858363)
SweeD

C6. Analysis of Gene Content, Gene Family Expansion and Selection on Protein-coding Genes

Both gene family expansions and contractions as well as positive selection on specific regions of a protein are important for species evolution. Typically, a majority of genes are 1:1 orthologs between larger species groups (i.e. 14,000, ~70%, in mammals). Using the annotated protein-coding genes, for further analysis of gene content and orthologs we recommend using https://www.orthodb.org. To detect gene family expansions, whole-genome alignments and synteny can be used. These studies will also detect horizontal gene transfer. In addition, specific gene families can be identified using tblastn of a protein sequence to identify copies similar to the protein sequence. To detect selection within protein-coding genes we recommend using either PhyloP and/or CodeML using the protein sequence.

Methods:

Read depth
Orthodb
CodeML
Conserved motif identification by the local multiple Em (expectation Maximization) for motif elicitation (MEME)

C7. Analysis of Non-coding Transcripts

Non-coding transcripts have a key role in genome regulation and function. These include both lincRNAs and miRNAs and will be identified as part of the genome annotation process. Non-coding transcripts typically evolve more rapidly than protein-coding genes, so performing analyses to study the evolution of non-coding transcripts in different species is important. However, caution needs to be taken as the catalog of transcripts depends a lot on what RNAseq data were used for the annotation. Analysis can be performed either by direct comparison of transcripts between related species or by comparison using the whole genome alignment/synteny.

For annotation of non-coding RNAs, we recommend using methodology such as FEELnc based on RNA evidence as well as transferring annotations by lift over between nearby species. In Drosophila, this works for species that diverged up to ~40 million years ago – this would correspond to, for example, either within-family comparisons or genomes <10-20% divergent.

Methods:

C8. Intraspecific Variation, Conservation, Biodiversity and Adaptation

As more and more species become endangered or critically endangered, the need for genomic information to guide conservation efforts increases. To generate data for this analysis multiple tiers can be applied, depending on sample availability.

Using only the individual sequenced for the high-quality genome. These data can be analyzed using: (1) the fraction of sites at which the sequenced individual is heterozygous (“overall heterozygosity”); and (2) the proportion of the genome residing in an extended region without any variation (“segments of homozygosity”, or SoH). This measurement is crude but can be used both for marker generation and a first look at homozygous regions. Additionally, population histories can be estimated from only a single genome using pairwise sequentially Markovian coalescence (PSMC) and later MSMC methods, enabling better understanding of the impact of humanity on species populations.

To generate a better marker panel and expand the search for homozygous regions, 3 or 4 more individuals can be sequenced. Such marker panels can frequently be used for following populations as key for conservation efforts.

To allow a deeper analysis of population structure and regions under selection, analysis of regions of homozygosity (RoH) can be performed within and across populations. For this, sequencing of 10-20 individuals each from multiple populations (can be lower-coverage, short-read data) is a good target. Such data sets can detect both more detailed geographic population structure and positive selection. Positive selection based on population structure is most frequently detected using Fst.

Basic population metrics could be captured by RADseq and other reduced representation or low-coverage methods currently used in many population studies, but these methods generally do not produce data that can be annotated for possible function.

Methods:

Runs of homozygosity
Heterozygosity GENHET and tests for HWE ARLEQUIN
Fst
Estimation of population histories
Reconstruction of multi-generational pedigrees SEQUOIA
Genetic structure and admixture STRUCTURE
Marker generation

C9. Supporting Environmental DNA and/or Ecological Samples

eDNA analysis allows the characterization and analysis of threatened and non-threatened species within ecosystems. Analysis is typically performed using a metabarcoding approach but could be expanded to whole-genome sequencing. By generating a high-quality digital library, based on whole-genome sequences to underpin the identification of eDNA samples, the EBP can accelerate this work. Thus, it is important to deposit not just genome sequences into INSDC databases, but also derived barcodes to appropriate barcode sequences (save in electronic repositories) to accelerate their growth and enable identification of species from non-barcode sequences.

Methods and Resources:

eDNA metabarcoding
Shotgun sequencing
UPARSE
DADA2
Blast

D. Specific Taxonomic Challenges and Scientific Questions:

D1. Plants/Algae

Patterns of ancient whole-genome duplication
Patterns of fractionation following whole-genome duplication
Gene family expansion/contraction/gain/loss
Repeat characterization and evolution
Patterns of reticulation
Syntenic relationships among species
Conservation and population genetics
Genome evolution – particularly for very large and very small genomes
Adaptation and genotype-phenotype relationships
Signatures of domestication

Methods and Resources:

D2. Vertebrates

Conservation and population genetics
Genome evolution
Adaptation/Convergent evolution
Annotation of the human genome

Methods and Resources:

Vertebrate Genomes Project (VGP)

D3. Arthropods

Conservation and population genetics
Invasive species
Characterizing the mechanisms underlying the vast number of arthropod species
Venom and predator, and beneficial arthropod analysis for agricultural biocontrol and enhancement of soil health and robustness
Connection of genotypic and phenotypic evolutionary change; due to the large number of species with small phenotypic variation, arthropods provide the largest opportunity to study animal phenotypic evolution.

Methods and Resources:

D4. Aquatic and Marine Organisms:

Composition of organisms in habitat
Habitat loss, predictive monitoring for habitat changes using sentinel taxa
Conservation and population genetics
Genome evolution
Adaptation/Convergent evolution
Special collection and preservation practices

Methods and Resources:

Global Invertebrate Genomics Alliance

D5. Microbial Eukaryotes

Initial characterization of the many unknown taxa of microbial eukaryotes in different habitats
Relationship of gene content evolution and cellular structure
Evolution of the eukaryotic cell
Role of gene transfer in microbial eukaryote evolution
Interaction with Global Virome Project and related people/ livestock movements to enable zoonotic virus spillover

Methods and Resources:

Handbook of the Protists

E. Supporting outside organizations for analysis outside the EBP mission:

We propose that there are many important analyses that exist outside the EBP mission that can be supported by creating active collaboration to rapidly disseminate EBP data. For example, eDNA sequencing, protein structure prediction, and mass-spec protein analysis often utilize annotated genomic sequence resources as analysis inputs, and collaboration with appropriate groups could decrease the update times and more rapidly increase the utility for these EBP stakeholders.

F. Nomenclature, standardization, and automation:

Using the nomenclature developed as part of the Tree of Life with six-digit species identification, we would like to attach this label to all analyses having a large searchable database showing which species have: samples available, genomes available, transcriptome sequencing, annotation available, and standardized analyses that have been performed. We also see the need for vouchering to as great an extent as possible.

Several softwares, such as CACTUS, would benefit from being made both faster and more automatable. We argue that technical experts should work with biologists to continue to scale up methods to handle huge datasets in a way that is key for the analysis.

We also argue that it would be worth encouraging funding agencies to fund methods development and automation. This includes automating the tally and listing of completed genomes – EBP and non-EBP.

G. Education:

We propose to:

Generate shorter video tutorial presentations for each of the analysis procedures
Hold EBP-sanctioned, conservation-based training workshops, many of which could follow previously run successful models – e.g. ConGen2018 and Carpentries, Genomics, and Data Science training at the Smithsonian
Attach more in-depth analysis courses to existing conferences
Develop and implement programs such as summer internships to broaden participation in science and lead to a more diverse and inclusive genomics workforce
Generate a type of matchmaker service where younger PIs can connect with experts in different genomics fields.

Proposals should be encouraged that create innovative education programs that will help form bridges between the purely computational and organismal sciences. (The taxonomic disciplines in particular continue to experience decline and attrition [Bik 2017 PLoS biology, 15(8), e2002231]). Genomics disciplines often finds the need to train or recruit early career scientists who are fluent in multiple languages, or foster effective communications to those already established in each.

This can be connected to: the EU education program, IGNITE, being prepared for another manuscript in prep in this Cellular Genomics volume.

ABout the subcommittee

This Report on Analysis Standards was developed by EBP’s Scientific Subcommittee for Data Analysis.

Learn More