Which one is the best database?
Genetic testing remains the most complex and difficult among all laboratory tests, both from the technical and the medical point of view. Since an excellent bioinformatic pipeline is fundamental to produce good data for the geneticist and because several bioinformatic tools are available today, it is necessary to choose well when establishing standard operating protocols for variant filtering and annotation. During the bioinformatic and clinical processing that results in the final medical assessment, variants must be checked at least for a couple of aspects: (1) if and how they are annotated in major databases and (2) if and how they can be predicted to be pathogenic (in silico analysis).
Interrogating the databases
We’ll focus here on the first step. Several databases store and list variants which have been previously reported in the literature. Some are open-source (e.g. ClinVar, sbSNP, dbVar, DECIPHER, ExAC and several locus-specific databases like for instance LOVD, Cystic Fibrosis Mutation Database, CFTR2 database, RettBASE, etc) whereas some other are licensed (one for all: HGMD Professional).
HGMD Professional is traditionally rated as an excellent tool by the scientific community. Part of the HGMD content may be available for free, although it’s not possible to read the variant without having at least an academic license. The content of HGMD is manually edited and variants are classified as disease-causing or possibly disease-causing (for Mendelian conditions) or as disease-associated (for multifactorial diseases). HGMD entries are classified based on peer-reviewed publications from PubMed. Writing a medical report on such literature-based may feel particularly reassuring, but it’s known that, despite an overall high level of quality, even HGMD is not free from mistakes or inaccuracies. HGMD contains both small mutations (SNPs and small insertions/deletions) and large mutations like large deletions and duplications or fusion genes (although these latter mutations may be difficult to look at when analysing a contigous gene syndrome patient, since HGMD entries are typically listed “per gene” and not per chromosomal position).
Another flagship publication about ClinVar has just been released to instruct the reader about its features and use. ClinVar is seeded with records based on allelic variants described in OMIM, Generviews, dbSNP, Uniprot and locus specific databases and with variants submitted by a small group of clinical laboratories and some research groups. It is very interesting to highlight that the ClinVar staff does not actually arbitrate on the significance of the variants, if conflicting interpretations have been submitted by different authors. Any variant/disease association is simply assigned a unique number prefixed with SCV. So it happens that for one single variant there may be different SCVs. Variants in ClinVar may be of any length or type, ranging from single nucleotide substitutions and small insertions/deletions to copy number changes and cytogenetic rearrangements. SNPs and other short variants (<50 bp) that are submitted to ClinVar are also submitted to dbSNP on behalf of the submitter. Variants larger than 50 bp that are submitted to ClinVar are also submitted to dbVar on behalf of the submitter. ClinVar is restricted to variants that have been interpreted for clinical or functional significance; it is not restricted by size or type of variant. For example, if you search ClinVar for ZEB2 and look at the Variant length filter on the left, there are variants <51bp, greater than 5Mb, and all ranges in between. Thanks to the Mutation Viewer it is possible to see large mutations represented on the chromosome, so that also contiguous gene syndromes may be seen.
ClinVar or HGMD?
Multiple options are better than just one. Although giving priorities to the signals coming from one single source may be convenient in the initial phase, having at disposal more than one tool to make confrontations is definitively recommended (and is also the basis of any bioinformatic pipeline: let’s just think about in silico predictions for missense mutations, which are generally done by utilizing two or more software like Mutation Taster, Polyhen2, PROVEAN, or at tools like Alamut – licensed – which integrates the scores of five different programs just for splice mutations). Comparing how a certain variant is classified in different databases may be decisive in assigning the definitive significance with confidence. Although HGMD has been traditionally very well deemed, it is an evidence that also public databases are becoming more and more reliable. Datasets like ClinVar are now largely powered and strengthened by the fact that peer-reviewed journals now require authors to submit any newly identified mutation in public repositories such as GenBank, Genomes (WGS), Complete Genomes, Transcriptome Shotgun Assembly (TSA), Short Read Archive (SRA), Gene Expression Omnibus (GEO), BioProject, BioSample, and dbSNP. In such a context, ClinVar may be even more appealing than other databases because all variants submitted to ClinVar are automatically entered by the ClinVar staff also in dbSNP and dbVar. This fact may encourage several submitters to send even more variants to ClinVar, where the amount of data will increase, hopefully reducing the conflicting data.