Report on Sample Collection and Processing Standards

Version 1.0—March 2021


Authors: Mara Lawniczak, Mark Blaxter, Warren E. Johnson, Olga Vinnere Pettersson, Katie Barker and the Sample Collection and Processing Subcommittee.

A. PURPOSE AND AUDIENCE:

Here we address the desired sample collection and processing standards for the Earth BioGenome Project as of February 2021. This is a living document and we expect standards will be revised regularly for field work, metadata collection, and processing and preserving specimens to support generation of the highest quality sequencing data. The standards set out here explicitly address Phase 1 of the project, which aims to generate at least one high quality reference genome for every family, but we also discuss later phases of the project to indicate where further development is required. 

B. Phases:

Phase 1: Considerations for Selecting Family Representatives

A variety of factors should be considered when suggesting a species representative of a family, and these are listed here in order of importance: 

  • Permissions and Availability: sampling is achievable taking into consideration permissions and legal obtainment. 

  • Community Value: of broad community use and value (this could be assessed through surveys oriented towards target communities, e.g: Priority species survey for the Darwin Tree of Life project). 

  • Publicly Registered: the species is registered with its current name and taxonomy in a publicly available database (we recommend the NCBI Taxonomy Database) and assigned a numeric identifier to assist with tracking name and taxonomy changes over time.

  • Physical Size: Considering today’s technology limits, we propose a requirement for ten samples, each weighing more than 10 milligrams per 1 Gb of genome size, for animals, fungi, protists and 100 milligrams per 1 Gb of genome size for plants. [and a minimum of three samples to support the three platforms of long read, Hi-C, and RNAseq]

  • Species Representative: generally considered to be a “good” biological species (not from a known species complex) and if possible, sampled from or near the type locality.

  • Genome Size: Where genome sizes and/or ploidy are known, prioritizing species with smaller, diploid genomes (because costs of data generation will fall and our ability to assemble high repeat content genomes will improve in future EBP phases)

  • Taxonomic Stability: not subject to current disagreement and revision. 

We recommend that the global community comes together to create the Phase 1 target species list. In this respect, we have created a mechanism in the next section to assist in both proposing target species and in making publicly accessible the species that are actively underway for high quality reference genome sequencing by both small and large scale projects.  

EBP Phase 1 Family-Representative Target Species: Openly Collating Community Proposals and Genomes Underway

There are over 9,000 Eukaryota families in the Catalogue of Life checklist, including 6470 Animalia, 1052 Chromista, 757 Fungi, 965 Plantae, and 221 Protozoa. An important collective step towards EBP ambitions is to gather the list of target species proposed to be family representatives to fulfill the Phase 1 goal of a high quality reference genome for every Eukaryotic family. Multiple species for each family should be proposed and collected, to provide greater flexibility in achieving Phase 1 goals.

As much as possible, this process should be globally transparent and open to input from the wider community that may not be actively generating reference genomes but who will benefit from their availability. To assist with this transparency and the need for community input, we have created the open EBP Family_Reference_Proposals spreadsheet

that can document suggestions on target taxa. Please note there are two tabs available for community entries: “Family Reference Suggestions” is there to collate recommendations from the community on ideal family level target species and “Family Reference Projects” is there to record reference genome sequencing projects already underway (more below) that are targeting a relatively small and clearly defined set of taxa. For larger scale projects that aim to sequence hundreds or thousands of species, target lists are likely to undergo revision as projects proceed. Therefore, we have created two distinct approaches to facilitate global transparency of target species lists depending on the size of the project. In both cases, these species and their associate projects will be searchable and displayed on the “Genomes on a Tree (GoaT) Service”, which is a useful website for estimating genome size and chromosomal numbers.

  • Small Defined Projects (e.g., 5 predetermined species) can list target species on the EBP Family_Reference_Proposals spreadsheet, which will be processed and displayed by GoaT. This sheet includes a tab that lists all families in the 2019 Catalogue of Life checklist.  If you would like to propose a family representative, please give the species name (full Linnaean binomial), the taxonomic Family, and, if possible, your name and contact details. Please enter only species for which you are sequencing or intend to sequence the genome to the expected EBP standards. The list is and will be public and you should note this when entering personal details. These details may be used to contact you.  

  • Larger funded projects will have lists of their target species, and these will contribute both to local goals and to EBP Phase 1 goals. These larger lists are also likely to be revised as projects proceed based on for example challenges with acquiring the species or extracting sufficient material from it. We thus ask that larger projects publish on their website a file listing their target species, and supply the URL of the file to the Genomes on a Tree (GoaT) service. The format for the file is given in ANNEXE 1 below. GoaT will archive the file and process it for display.

Ethical Collecting

All collection activities should carefully follow institutional and national protocols, including but not limited to prior informed consent, compliance with expectations of Nagoya protocol, and considerations of endangered species. Sample collectors should ensure that all local and national permissions for collection are in place, and that there is a record of these permissions that can be referred to if any questions arise as to whether a specimen was legally obtained. These permissions will vary widely between countries, so it is beyond the scope of this document to summarize them. Best Practice is to ensure that every specimen is collected legally within the applicable frameworks (including national, local rules, rules on endangered species, rules on collecting in protected sites). 

It is of utmost importance that specimens and projects contributing to the EBP are legally obtained. Another complicating factor is that many specimens are likely to be collected and moved out of their country of origin for sequencing. In these cases, the Nagoya Protocol must be followed. Again, the precise guidance on how to follow Nagoya Protocol is beyond the scope of this document, especially given that many countries interpret the protocol differently. At a high level, sample collectors who will be shipping specimens outside their country of origin must contact their local Access and Benefit-sharing Clearing-house to understand what the rules are, and to obtain a PIC (prior informed consent) and a MAT (mutually agreed terms on what the benefit is, perhaps financial or academic, perhaps an acknowledgement on a paper, sharing results). These documents should be written as broadly as possible to support the vision of the project. Countries receiving samples should ensure they have further permissions within the MAT to pass the samples on if there is any anticipation that might be required.

Beyond the rules and regulations, collecting methods must be ethical and overcollection of any species should be avoided. Projects should consider what the best sampling strategies might be to avoid overcollection, for example, lineage focused bioblitzes with a group of taxonomic experts.  

Associated Metadata

The family level representatives for EBP must be accompanied by robust and complete metadata of all types. This means that each contributed specimen should be identified to species level by a taxonomic expert and that wherever possible, material from the same specimen should be independently DNA barcoded using appropriate markers and the data deposited on BOLD and in the INSDC database. These DNA barcodes will serve both to ensure that species with reference genomes have independently generated DNA barcode data and that the DNA barcodes match the resulting reference genome and no sample swaps have occurred along the way.  

Where these standards are not achievable, it is advisable to substitute a different representative for the family given the importance of ensuring the specimen is truly representative of the species to which it is assigned. Metadata fields and terms should be standardized across the project, and this is covered in a separate EBP Committee on IT/Informatics Best Practices, which draws on the efforts to standardize metadata collected by the Darwin Tree of Life Project in the UK. 

High quality images should accompany each contributed specimen, and these should be made publicly available. Ideally, these images will be found on both the BOLD database accompanying the DNA barcode and in yet-to-be designed portal for EBP that supports access to all metadata for each sequenced species. 

Voucher specimens from every family-representative should be obtained. These vouchers can take many forms, including tissue vouchers (discussed further below), viably frozen cell lines, image vouchers where tissue is not obtainable (e.g., the whole specimen is required for sequencing), and molecular vouchers of extracted RNA and DNA. Vouchers should be deposited in publicly accessible collection facilities located in the country of origin for each sequenced species, or where there is excess material, possibly spread across multiple repositories. It is recommended that subsamples and/or sample derivatives be stored in a GGBN member institution and linked to the voucher using a unique id. 

Processing and Preserving

As an organism is processed, it should be photographed alongside a tracking identifier (e.g., a SPECIMEN_ID) and alongside the (e.g., FluidX) barcodes of the tubes into which it is processed (Figure 1).

Figure 1. An example of the documentation that should occur as a sample is being processed where the SPECIMEN_ID (the NHM barcode under the fly) is photographed alongside the specimen and the barcoded tubes to which different samples of that specime…

Figure 1. An example of the documentation that should occur as a sample is being processed where the SPECIMEN_ID (the NHM barcode under the fly) is photographed alongside the specimen and the barcoded tubes to which different samples of that specimen are destined. The metadata tracking sheet would thus have three entries for this fly, where collection-related information would be identical, but tissue type and tissue size would vary (e.g., head, thorax, and abdomen each in a separate tube). Photograph by Mara Lawniczak.

These photographs are useful for resolving sample tracking problems that can arise. We strongly encourage the use of barcoded tubes and scanning of these tubes rather than manual entry as this is prone to typos. For samples in the dozens or hundreds, this can be done with a simple single tube scanner or even with a phone and an app like EpiCollect (https://five.epicollect.net). For larger projects processing many hundreds or thousands of specimens, rack scanners can be used to scan whole racks of barcoded tubes in advance of sample processing. 

The Phase 1 ambition is to sequence species that are good family level representatives and the guidelines above indicate the considerations that one can consider when selecting appropriate species. Ultimately though, what is actually selected and sequenced is an individual or a set of individuals and should be recognized as so. Some further considerations when selecting the specimen that will be sequenced to represent the species and the family are discussed here. 

The target individual should be collected from the wild rather than from a laboratory colony or culture collection. One exception to this in Phase 1 might be single celled eukaryotic organisms that are in culture, as it remains challenging to generate high quality genomes from single cells. The size of the samples taken from a specimen will depend on the taxon, and the precise guidance around required input material is likely to be a rapidly moving target as required quantities decrease and our ability to achieve high quality extracts across a wide range of taxa increases. Our current recommendations for animals and fungi are at least 10 and preferably closer to 100 milligrams of tissue per 1 Gb of genome size for each sample. For plants and Chromista we suggest at least 100 and preferably 1000 milligrams of tissue per Gb of genome for each sample. Multiple samples from the same specimen should be prioritized over single samples from different specimens, and given the current need for samples to be directed down three different processes (Hi-C, HMW DNA extraction, RNA extraction), we recommend that at least 5 and up to 10 samples meeting these standards per species would be ideal. If this is not possible, then additional specimens should be collected to reach similar quantities of tissue. This level of replication gives slack in the system for repeat extractions where sufficient quantities or qualities of data have not been achieved and also provides a biobank of tissue from specimens for the future when new approaches might be desired (e.g., protein, metabolite analysis or new or improved genome/transcriptome sequencing technologies).

The sex of the specimen and the particular lifestages and tissues that are best for different data types should be considered carefully. Where relevant and possible, it is preferable to sample from the heterogametic sex to provide data for both sex chromosomes. Ploidy levels can vary within some species, and if it is possible to assess this in advance of sequencing, we advise to select individuals with lower ploidy. Recommendations on best life stages and best tissues to target to achieve the highest qualities and quantities of DNA, RNA, and nuclei will vary depending on the taxon. As the project progresses and we gain a better understanding of these factors that provide the best quantities and qualities, later versions of this document will collate at a high level specific tissues to be prioritized or avoided based on additional species (cobionts) that might be present in those tissue types (e.g., gut tissue/food organisms/microbiome will likely give many additional off-target sequences that may or may not be desirable). For many taxa where the majority of the specimen will be consumed in the process of data generation (e.g. most insects), it may be advisable to grind the whole organism to a fine powder in liquid nitrogen to avoid different data types being generated from different tissues (e.g., Hi-C data coming from the head, and long-read data coming from the abdomen, each tissue with distinct symbiomes).

Living specimens should be processed into tubes on dry ice and from that point forward, held at -80°C or below (e.g., in liquid nitrogen). Specimens that have died before processing tend to have damaged and degraded DNA and RNA and should not be submitted for long read or RNA sequencing. Living specimens should be taken to a site where dry ice and liquid nitrogen are available, humanely euthanized, and rapidly processed into small lentil-sized pieces while freezing, for example using a petri dish on dry ice and a scalpel. Small pieces of specimens sitting on dry ice as in Figure 1 have generally generated high quality DNA as long as the freezing process was rapid. Small tissue pieces can then be placed into barcoded tubes. Currently, for animals and fungi, we recommend several tubes each containing one piece of > 10mg tissue to support different workstreams without compromising the temperature of the remainder of the material through freeze-thaw cycles. Plants and Chromista should also be processed into small pieces to support rapid freezing, but larger volumes of tissue might be placed in tubes given up to ten times more tissue may be required to achieve adequate quantities of DNA for these groups.

Situations in which preserving from living and access to dry ice or liquid nitrogen is not possible are likely to increase in frequency as the project progresses. We are still learning which preservatives offer the best chance at successful long-read and long-range sequencing, and later versions of this document will collate these results at a high-level. As of now, we suggest if there is no possibility of rapid processing and preservation of a specimen from living to -80°C or below that samples are processed into small lentil-sized pieces in 100% ethanol for HMW DNA and Hi-C, and RNALater for RNA. Lower percentages of ethanol are not advised as they seem to result in more degraded DNA. As soon as access to a -80°C is available, the samples should be frozen and records kept on how long samples were held at room temperature [n.b. we have generated high quality genomes from mosquito specimens held at room temperature for over a week in an excess volume of 100% ethanol, lightly squished to compromise cuticle and support more rapid penetration of preservative, but further testing across the tree of life is required]. 

The exact tissue types recommended for HMW DNA and RNA for the wide range of target taxa is beyond the scope of this document, and undoubtedly will change as we experience successes and failures to extract and sequence. In the meantime, for annotation based on RNAseq, we suggest collecting a diversity of tissue types whenever possible, factoring in previous understanding of tissues for the focal taxa that have representative or higher than average transcript diversity. 

Specimens will often need to be shipped to a different location for further work. Customs and import of biological material is often slow and there is a risk of losing precious material due to a loss in maintenance of the cold chain. To avoid this, proper legal documentation for export and import and associated metadata should accompany the specimens when shipped. Some couriers offer dry ice top up service for a considerable fee (e.g. World Courier). If tissues do not remain frozen for their entire journey, unless they are in a suitable preservative, they will not yield HMW DNA or high quality RNA. 

 

Looking to the Future

Current best practice assembly guidelines are to generate a combination of data types including long-read (PacBio HiFi and/or ultra long ONT), long-range (Hi-C and linked read), and RNAseq (Illumina short read, PacBio Iso-seq, or ONT cDNA-PCR) data from the same specimen wherever possible, aiming for the heterogametic sex when this is relevant. Separate samples currently head down these different routes, but we should be developing protocols that support minimal extraction of material sufficient for any of these types of data generation (e.g., nuclei extraction). Furthermore, in all SOPs we currently discard material that we might one day look back on and regret, such as proteins and metabolites. While data generation from these materials is currently out of scope, this is unlikely to be true in years to come and the retention of relevant material to add layers of additional data to the high quality reference genomes would be prudent. Thus, for specimens where samples are available in excess, considerations should be given to appropriate storage to future-proof these samples as much as possible. And for specimens where all material is used in the data generation, perhaps typically discarded supernatants should be retained for future investigations. 

We also encourage activities that simplify and streamline SOPs such that it becomes easier to “containerize” extraction, sequencing, and assembling activities. Containerization means a world in which a shipping container or crate could house everything needed to go from sample to sequence and would build capacity in Low and Middle Income Countries, often harboring the greatest biodiversity. This relieves pressure on Nagoya Permit requirements and budget spent on expensive shipping costs to maintain the cold-chain, but more importantly, it is better for global science.

C. ANNEXE 1: REPORTING PRIORITY TARGETS FOR EBP

See EBP Family Representative and Priority Declarations v2.

GoaT stores and serves data pertaining to species, and can infer values for, for example genome size, for previously unanalysed species by reference to their phylogenetic neighbours. It also displays species “status” using a set of flags provided by the community. One kind of data we wish to deliver through GoaT is the declaration that a species is of interest to, or is currently being sequenced by, a partner in the global EBP community.

Data destined for GoaT need to follow the following schema.

Screen Shot 2021-03-09 at 8.11.12 PM.png

Process

Each major project should collate one or more files that contain the following:

  • A HEADER section declaring who made the list, when it was made, and what it contains

  • A set of lines declaring SPECIES STATUS.

The lists could be made in Excel or Google sheets, and should be stored as .txt or .tsv (text files - plain text or tab separated values).Please name the files using the format “[project]_[family|priority|all]_species_list.[txt|tsv]” and save them to a URL that can be accessed openly. Please then communicate the URL to Sujai Kumar. The file can be updated as frequently as is needed. GoaT will access the file weekly.


Schema

The following schema defines ebp_species_goat_2.0. The header should contain the following, each line beginning with #. The header is likely to stay the same for each iteration of a particular list except for the date. We use the date to track changes. The last line is the declaration that what follows is the content: genus and species Linnaean binomen, its family (to disambiguate Linnaean namespace clashes) and the status. The whitespace in the lines should be spaces or tabs. If you export from MS Excel or Googlesheets data in cells will be separated by tab characters.

# project_name [enter your project name]

# subproject_name [if it exists]

# primary_contact [given_name surname of person who “owns” the file]

# primary_contact_institution [address]

# primary_contact_email [email address]

# date_of_update [yyyy-mm-dd]

# schema_version ebp_species_goat_2.0

# Genus species family status

SPECIES STATUS

This is a list of values, with one species per line. The whitespace in the lines can be spaces or tabs. If you export from MS Excel or Googlesheets data in cells will be separated by tab characters.

The first is the species name. Please use the full binomial Linnaean name. We cannot accept unresolved names: please do NOT use subspecies epithets, hybrid or informal names (e.g. use Canis lupus, not Canis lupus orion, and no “Weirdthingy sp.”, “Weirdthingy x anotherthingy”, “Weirdthingy cf somethingelse”, etc). 

The second is the taxonomic family. Ther family is needed to solve namespace clashes between systems that will arise if only the Linnaean binomial is given.

The third is the status. We define three statuses currently. The first is that the species is on the widest list for the project (e.g. all Chiroptera, or all species in a geographic area). The second is that the species has been identified as a Family representative. The third is that the species is a priority for another reason (economic importance, conservation importance, etc; the “important, iconic or interesting” classification).

  • For DToL “family representative” taxa we use “dtol_family_representative”

  • For the DToL  “other priority” taxa we use “dtol_other_priority”

  • For all other species of interest to DToL, we use “dtol”

Please use this schema for your project (e.g. vgp_family_representative, etc).

It is fine to separate these different sets into separate files (i.e. “family”, “other” and “all”; but do supply the URLs for all files), and it is fine to have species appear in more than one list, as “family” will overwrite “other” will overwrite “all”.

Example

Below is an example for Mythical_species_DToL_species_list.txt

# project_name DToL

# subproject_name Mythical_species

# primary_contact Mark Blaxter

# primary_contact_institution Tree of Life, WSI, CB10 1SA, UK

# primary_contact_email mb35@sanger.ac.uk

# schema_version dtol_species_goat_2.0

# date_of_update 2021-01-12

# Genus species family status

Neopixie underhilli Faerinae dtol_family_representative

Neopixie deep-underhilli Faerinae dtol

Paraneopixie underhilli Faerinae dtol_other_priority

Monoceros unicornus Unicorninae dtol_family_representative

Anthropopisces sirenia Mermaidinae dtol_family_representative

Minutusbitey noseeumi Deeplyannoyinae dtol_family_representative


unsplash-image-E9Ucfek-Lp0.jpg

ABOUT THE SUBCOMMITTEE

This Report on Sample Collection and Processing Standards was developed by EBP’s Scientific Subcommittee for Sample Collection and Processing.