The content on this website is based on ClinVar database version November 27, 2019
Simple ClinVar was developed to provide gene- and disease-wise summary statistic based on all available genetic variants from ClinVar. How many missense variants are associated to heart disease? What are the top 10 genes mutated in Alzheimer? Does CDKL5 have pathogenic mutations? If so, where? Simple ClinVar is able to answer these questions and more, in a matter of seconds.
For detailed information about Simple ClinVar please refer to the original publication:
Clinical genetic testing has exponentially expanded in recent years, leading to an overwhelming amount of patient variants with high variability in pathogenicity and heterogeneous phenotypes. A large part of the variant level data are comprehensively aggregated in public databases such as ClinVar and are publicly accessible. However, the ability to explore this rich resource and answer general questions such as “How many missense variants are associated to a specific disease or gene?” or “In which part of the protein are patient variants located?” is limited and requires advanced bioinformatics processing.
Here, we present Simple ClinVar a web-server application that is able to provide variant, gene, and disease level summary statistics based on the entire ClinVar database in a dynamic and user-friendly web-interface. Overall, our web application is able to interactively answer basic questions regarding genetic variation and its known relationships to disease. Our website will follow ClinVar monthly releases and provide easy access to the rich ClinVar resource to a broader audience including basic and clinical scientists.
The ClinVar database is downloaded on a monthly basis directly from the ClinVar ftp site The tabular data is processed internally to produce a pre-filtered ClinVar file for the user to explore on the Simple ClinVar web-server. The pre-filtering step is designed to reduce the complexity of ClinVar entries as well as to provide fast access to high quality entries.
Detailed description of the pre-filtering step as well as access to the complete pre-filtering pipeline is provided at our GitHub repository. Briefly the pipeline performs four main tasks:
1) First, we keep only entries from the human reference genome version GRCh37.p13/hg19 and referring to canonical transcripts.
2) Second, Molecular consequence is inferred through the analysis of the Human Genome Variation Society (HGVS) sequence variant nomenclature field. HGVS format contains string patterns that allow molecular consequence inference (e.g. ENST00001234 c.3281 [p.Gly1046Arg]). Specifically, when the variant is reported to cause an amino acid change different to the reference it is annotated as a “missense” variant (e.g. p.Gly1046Arg). If the genetic variant leads to the same amino acid (e.g. p.Gly1046Gly) or a stop codon (e.g. p.Gly1046*) the entry is annotated as “synonymous” or “stop-gain” variant, respectively. Depending on the observed outcome, small insertions and deletions molecular consequence (collectively “indels”; e.g. p.Gly1046Glyfs., p.Gly1046Glyins., p.Gly1046del.*) are separated in to “frameshifts” and “in-frame indels”.
3) Third, we reduced the complexity of the clinical significance field by regrouping and merging them in to five unique and non-redundant categories: “Pathogenic”, “Likely pathogenic”, “Risk factor and Association”, “Protective/Likely benign” and “Benign”. Conflicting interpretations of pathogenicity, variants of unknown significance (VUS) and contradictory evidence (e.g. “Likely benign” alongside “Risk” evidences) were combined together in to an “Uncertain/Conflicting” category. Similarly, variants annotated with multiple evidence categories of the same evidence direction such as “Pathogenic” alongside “Likely pathogenic” were combined and the respective lower evidence category assigned. Fourth, ClinVar entries with phenotypes annotated as “not provided” and “not specified” were combined in to one single category called “Not provided / Not specified”.
4) Fourth, ClinVar entries with missing annotations such as the absence of anHGVS variant name or incomplete genomic coordinates are filtered out. Currently, 493,240 out of 503,065 (98.04%) ClinVar entries (April 22 release) are included in Simple ClinVar.
Interactive summary statistics, variant mapping and visualization was developed with the Shiny framework of R studio software. App deployment, hosting and update is performed with Google Cloud services (Figure 1).
From the front page of Simple ClinVar the user can submit three types of queries:
1) Database-wise query: Triggered by submitting without a query or with the keyword “clinvar”, it will yield summary statistics of the entire ClinVar database. By the time of submission (ClinVar February 2019 release) Simple ClinVar contains 493,240 genetic variants, identified in 18,502 genes found in patients with 11,098 phenotypes. The database query mode coupled with dynamic filtering allows the user to explore which are the most common disorders and types of variants most commonly found in the whole database. Similarly, it is possible to evaluate immediately which are the genes with the most pathogenic variations or variants of unknown significance (Figure 2).
2) Gene-wise query: Submitting a RefSeq gene name on the main page will forward the user to the corresponding summary statistic page for all the genetic variants annotated in the gene. For example, querying for “CDKL5” will show 675 genetic variants currently associated with 27 genetic disorders. Here, the user will see these variants mapped over the corresponding protein sequence alongside domain information from UniProt. Furthermore, the user can explore unique gene-specific variant statistics such as determining how many clinical phenotypes are associated with a given gene or where the pathogenic versus benign variants are located in the protein (Figure 3).
3) Disease-term-wise query: Querying a broad disease term of interest in Simple ClinVar will provide the user with all genes, variants and phenotypes associated to the given disease term. As an example, the disease term query “heart” will yield 814 genetic variants in 61 genes associated with 233 phenotypes, with missense SNVs (n=321) as the most common variant type. Here, for each selected disease term the user can answer general questions such as: how many genes are associated with heart disease? How many annotated terms and disorder subtypes can be found related to heart disease? (Figure 4).
Independently of the input mode, the output displayed at the results tab can always be explored between four sections marked by the top square buttons in the colors green, red, orange and grey. The user can switch between the color areas. The green button will show all genetic variants available for the query and see the counts of variant type, molecular consequence, clinical significance, and review status. The red and orange buttons will show the top ten genes and phenotypes associated with the corresponding query and the complete list in table mode, respectively. In the case of a gene-wise query submission, the red button will show the variant mapping over the canonical protein sequence of the gene queried. Finally, the grey button will show the table mode were the subset of the pre-filtered ClinVar file currently in display is shown and available for download for downstream analysis.
At all query levels, the output can be dynamically filtered by variant type, molecular consequence, clinical significance and review status in any combination. We show two examples of how this feature can be used as a fast and powerful tool for clinical researchers.
Example 1: We use “epilepsy” as a disease term query and display the red button gene view area. Unfiltered results show the top ten “epilepsy” genes associated in descending order according to the number of qualifying variants: SCN1A, SCN9A, CACNA1H, GRIN2A, DEPDC5, RELN, KCNT1, KCNQ3, ALDH7A1 and CHRNA4 . Next, after filtering for pathogenic variants, the top ten gene list is updated to SCN1A, DEPDC5, GRIN2A, SCARB2, ALDH7A1, LGI1, MEF2C, NPRL3, SCN9A, and SPATA5. The user can conclude that the order of frequently mutated genes and genes with the most pathogenic classified genes is not the same. In the example, only SCN1A, DEPDC5, and GRIN2A are in the top ten “epilepsy” gene list both as genes with the most variants and most pathogenic variants (Figure 5).
Example 2: We evaluate the gene-wise query for “SCN2A” on the red button gene view. Currently, there are variants mapped on the protein sequence of SCN2A. If we filter for “Missense” and “Pathogenic” only 40 variants remain and are concentrated inside the transmembrane domains. The user can conclude that these regions containing the majority of pathogenic variants are of key importance for the protein function (Figure 6).