CNV analysis based on NGS data: has the dream come true?
The term Copy Number Variations (CNVs) is traditionally referring to intermediate-scale large deletions/duplications of 1 kb to 5 kb in size. However, in practice, when the analysis is based on data from Next Generation Sequencing (NGS), CNV is used every day more to identify deletions/duplications of any size larger than 50 bp, from one single exon to whole-gene deletion/duplication.
Geneticists and Bioinformaticians started to imagine CNV analysis based on NGS data a long time ago. Since then, they have been designing different algorithms and platforms, trying and retrying, revisioning, and refining calculations. Till recently, the result tended to be always the same: a glass-half-full (or a glass-half-empty). Major issues were concerning resolution (hardly achieving the exon level), sensitivity and sensibility (roughly staying around 85%, well far from the 99% of direct molecular methods), the need of manual fine-tuning the algorithm depending on the genomic region and the resolution desired, and a mostly unfriendly user-interface.
In general, released CNV algorithms were able to detect large deletions of several kilobases (e.g. a whole-gene deletion), but relying on them for the detection of del/dups of one exon or two was certainly a risk. Exonic or multi-exonic deletions are smaller than whole-gene deletions, but they can be equally pathogenic, therefore they must be screened to complete a thorough genetic screening (e.g. for BRCA1 and BRCA2, but also plenty of other genes associated with rare syndromes and diseases). So, there was basically no alternative: because aCGH has a too low resolution power, the only alternative was MLPA, qPCR (or high-resolution aCGH). However, MLPA, qPCR, and high-resolution aCGH present a big disadvantage in that the physician needs to know exactly which gene(s) he wants to test. So, you basically need to have a very well defined clinical suspicion for a specific genetic disorder or syndrome and give up to the possibility of indiscriminate genome-wide testing.
At Breda Genetics, we started to seriously approach the option of CNV detection based on whole-exome and whole-genome sequencing since our very beginning in 2016. We performed it in a good number of cases, even on external quality control samples, and mostly performing molecular methods in parallel to confirm the results. Our results findings were basically inline with the average: very high chances to detect whole gene deletions (which is already a big advantage in contrast to aCGH), but consistent difficulties in detecting single exon deletions.
Nowadays protocols seem to start improving significantly, offering consistent improvements of the detection power and even new sequencing techniques (read about low-pass genome sequencing below).
CNV analysis in tg-NGS panels, whole exome sequencing, and whole genome sequencing: the difference.
Algorithmic CNV analysis can be performed on targeted NGS panels (tg-NGS), whole exome sequencing and whole genome sequencing. Most tools have been designed to run on data whole exome or whole genome sequencing data, however, they have been applied experimentally to targeted panels as well.
Algorithmic CNV analysis is substantially based on the comparison. To perform CNV analysis based on NGS data you need a pool of samples that have been enriched with the same capturing kit and sequenced at the same depth of coverage. Then, your sample will be compared to that pool, to see if significant differences in coverage depth arise for specific gene regions. Only in the case of whole genome sequencing, the sample can be run alone, as coverage stability makes it possible to use the sample as control of itself. However, also for whole genome sequencing, the run against a pool of control samples is recommended by many). Another feature of CNV analysis based on whole genome sequencing is in its potential to reach extremely high levels of accuracy, enabling the identification of position and orientation of the detected CNV.
Listing all variations or excluding the common ones?
Some algorithms applied in CNV testing are designed to highlight any copy number variant in the sample. By contrast, other algorithms are designed to exclude from the results all variations present in two samples or more (starting from the assumption that, if they are frequent, they are benign). The second approach definitely shrinks the amount of data that the scientist has to eventually evaluates more. The first approach is certainly more complete, but it's much more labor-intensive, as any CNV must be evaluated for its possible pathogenicity. To help scientists, some platforms are now annotating CNVs as pathogenic, uncertain, or benign based on public databases. However, annotation databases of CNV are still incomplete.
Low-pass genome sequencing: a substitute for aCGH?
A new sequencing technique, low-pass genome sequencing (i.e. whole-genome sequencing at low level of coverage: 5x), has been proposed as a new tool for CNV scanning and detection with a higher sensitivity than aCGH and, of course, a higher level of resolution. Some authors highlight that low-pass genome sequencing is more powerful in detecting mosaic CNVs as well. Giving these encouraging results, low-pass genome sequencing has already been suggested as an alternative to aCGH, even prenatally.
Why low-pass (i.e. low coverage)? Standard whole genome sequencing for single nucleotide variants (SNVs) detection is usually performed at 30x (which is more or less the equivalent of whole exome sequencing at 100x) and, as said above, is already used to run CNV analysis. However, whole genome sequencing at 30x has disadvantages relating to its costs, file size and calculation burdens. So scientists came to the idea of performing whole genome sequencing at lower coverage to reduce costs and speed-up analysis, apparently with success.
Very interestingly, low-pass genome sequencing for CNV detection has already been tested in non-invasive prenatal diagnostics (NIPT) on cell-free circulating DNA. Sensitivity was reportedly varying based on the CNV size and the fetal DNA fraction, but, even more engaging, the assay showed to be capable of indicating the origin of an aberration (i.e. whether it is exclusively fetal or fetomaternal/maternal), an aspect which may be certainly of help in pathogenicity assessment and genetic counselling.
The future: where are we going to?
There's no doubt that many researchers are going in the direction of finding one single assay to test any kind of genetic disorder. That would be definitively a great achievement, especially in a field such as Medical Genetics, where, still today, the approach remains fragmented in a variety of testing that needs to be put together to confirm a diagnosis (from family carrier testing after whole exome sequencing, to pseudogene testing, to MLPA and more).
After the glories of Sanger sequencing (the entire human genome was sequenced by capillary electrophoresis!), NGS had provided the scientific community with the first tool to pool together hundreds of genes for the testing of a group of disorders: the so-called targeted NGS multigene panels (e.g. panels for mental retardation, ophthalmological disorders, neurogenic and myopathic conditions, hearing loss, metabolic disorders and so on). However, the biggest one remains the jump to whole exome and whole genome sequencing, two techniques that, for the first time, enabled the screening of the entire genomic material of an individual for the detection of SNVs. The next big step seems now the achievement of reliable and sensitive algorithmic CNVs testing based on NGS data, whether it's coming as an add-on of whole exome and whole genome sequencing at high coverage depth or as stand-alone testing by low-pass genome sequencing in substitution of aCGH.
While public institutions and private companies will likely put consistent efforts into this next step, more uniform literature evidence, shared knowledge (including richer CNV annotations), and easier to use platforms are certainly needed to validate pipelines and expand CNV analysis in more centers and at lower costs. But maybe the direction has already been defined once for all.
Exome sequencing and whole genome sequencing for the detection of copy number variation. PMID: 26088785.
Free-access copy-number variant detection tools for targeted next-generation sequencing data. PMID: 31097148
Clinical Validation of Copy Number Variant Detection from Targeted Next-Generation Sequencing Panels. PMID: 28818680
Low-pass genome sequencing versus chromosomal microarray analysis: implementation in prenatal diagnosis. PMID: 26088785
Copy-Number Variants Detection by Low-Pass Whole-Genome Sequencing. PMID: 28696555
Validation of Copy Number Variants Detection from Pregnant Plasma Using Low-Pass Whole-Genome Sequencing in Noninvasive Prenatal Testing-Like Settings. PMID: 32784382