Whole sequencing of human DNA
Whole-genome sequencing (WGS) consists of the sequencing of the entire genome, i.e, of all the human DNA contained in the cell nucleus (about 3 billion nucleotides). With this technique, both coding and non-coding regions of the DNA are sequenced. As such, therefore, whole-genome sequencing includes whole-exome sequencing (i.e. of all coding regions of a gene, called exons) plus intergenic regions, that is all sequences between one gene and another, and that, quantitatively, represent the vast majority of DNA (as much as 98%).
Whole-genome variants study
Through whole-genome sequencing, a huge amount of variants are identified (patient’s file can be as big as 100-300 Gb). The vast majority of these variants are polymorphism (non-dangerous variants, responsible for interindividual phenotypic variability), while a small minority is made of pathogenic variants, which can be present in healthy people, determining the healthy carrier status (it is estimated that every one of us is a healthy carrier of at least 30 autosomal recessive genetic disorders). Whole-genome sequencing in an affected individual is particularly arduous, although clinically very useful, since the interpretation of the significance of the variants detected, particularly when they are located in intronic or intergenic regions, is remarkably difficult.
Of note, in some sources, the acronym WGS is not used as acronym for Whole Genome Sequencing, but for Whole-Genome Shotgun. The meaning is the same, but the emphasis is simply placed on serial and fast approach of the method.
Whole-genome and whole exome sequencing differences
What is the difference between whole genome and whole exome sequencing?
A gene is made of an alternate of coding regions (called exons) and non-coding regions (called introns). Between two genes there are very long non-coding sequences, called intergenic regions, which, alone, account for the vast majority of DNA (about 98%).
Whole-exome sequencing analyzes all exons, all exon-intron boundaries intronic regions and sometimes – but not always – regulatory regions upstream and downstream the gene (called 5′-UTR and 3′-UTR, respectively, long some hundreds to some thousands of nucleotides), for the purpose of identifying exonic mutations (missense, nonsense, frameshift, in-frame or regulatory) or intronic mutations that impact the mRNA maturation (splicing mutations). It is estimated that these mutations cover about 85% of the whole human mutational spectrum (that is, it is thought that 85% of all disease mutations fall back into the exome, which constitutes only 2% of human DNA).
Instead, whole-genome sequencing analyzes the whole DNA and comprises all exons, all introns and all intergenic regions. Although it is estimated that 85% of all disease-causing mutations fall in the exome, some pathogenic mutations can fall in deep intronic regions (such as the ones described for some retinal dystrophies) on in regions upstream or downstream the genes, not identifiable through exome analysis. Consider that, for example, one of the most frequent pathogenic mutation for Leber congenital amaurosis falls in a deep intronic region not covered by the vast majority of exome-analysis kit (mutation c.2991+1655A>G in the CEP290 gene). Mutations that occur in deep intergenic regions still seem very rare.
From the diagnostic point of view, a substantial difference between whole exome and whole genome sequencing is that, on the genome sequencing data, it is also possible to carry out the study of large deletions/duplications (also called the study of Copy Number Variations or CNV), which are a type of mutations usually not detectable by standard sequencing (not even by Sanger sequencing). Sometimes it is possible to carry out the CNV study also on the exome data, but this option depends on the laboratory, since for the CNV analysis on the exome it is necessary to have a database with numerous samples, while the CNV analysis on the genome is also possible with the only patient sample.
Whole-genome sequencing was made possible by the advent of Next Generation Sequencing (NGS) which, compared to the traditional Sanger method, allows to sequence considerably greater quantities of DNA at significantly lower costs and with levels of sensitivity and specificity that are now comparable (however, it is curious to point out that the Human Genome Project, which led to the completion of the entire human genome sequence in the early 2000s, was based solely on Sanger technology – the project was in fact funded with large public and private capital).
In NGS, the DNA of an individual is broken into numerous small fragments by a process of fragmentation (DNA shearing) that could be mechanical (via sonication – Covaris) or enzymatic. Small synthesis sequences (called adapters) are then added to the obtained fragments, in order to build the so-called sequencing library. The fragments of the library are then sequenced until numerous complementary copies are obtained for each fragment (called reads), which are then aligned with the reference sequences present in the databases until the whole genome sequence of the individual analyzed is reconstituted like in a puzzle. Different manufacturers offer machines capable of sequencing the complete human genome, with varying times and costs. Among the most popular and most robust are the Illumina NextSeq and the Illumina Novaseq systems. Then there are the ThermoFischer (Ion S5), PacBio, Complete Genomics (Revolocity Supersequencer) platforms.
A fundamental parameter of all NGS analyzes is coverage (or, more correctly, coverage depth), that is, the level of reading depth. For each fragment of the sequencing library, it is in fact possible to obtain a variable number of reads. The greater the number of reads obtained for each fragment, the greater the sensitivity and specificity of the analysis. In exome sequencing for diagnostic purposes, the standard coverage required is 100x (ie the aim is to obtain 100 reads in paired-end sequencing – 50 in one sense, 50 in the opposite sense – for each fragment). Basically, it is as if you “read” the fragment at least 100 times. In whole-genome sequencing for diagnostic purposes, coverage of 30x is usually sufficient. Lower coverage is sufficient in genome sequencing because it is not necessary to compensate for the differences in yield that normally occur as a consequence of the enrichment step required in exome analysis.
Breda Genetics routinely offers whole-genome sequencing, which is normally done on the Illumina Novaseq platform in collaboration with some of the largest sequencing centers in the world. The analysis is available in the GENOME FULL variant (full analysis of all data with prioritization of the variants based on clinical information and possible CNV analysis, particularly indicated in cases of syndromic and non-syndromic patients with unknown clinical diagnosis), and GENOME PANEL variant (analysis of a panel of genes, in which it is particularly important to analyze also the deep intronic regions, such as in retinal dystrophies).