Identifying disease related mutations with Condel 2.0

Three years ago, Abel developed and published in the American Journal of Human Genetics an approach to combine the results of several tools aimed at identifying disease-related single nucleotide variants (SNVs). He called the strategy a Consensus Deleteriousness score of SNVs, or Condel. It consisted in computing a weighted average of the scores of five of these tools (SIFT, PolyPhen2, MutationAssessor, LogRE and MAPP). The weights were extracted from the complementary cumulative distributions of the scores of sets of known disease-related and neutral SNVs. He showed that the Consensus score of the five tools outperformed the five individual methods, as well as other approaches to combine them. He presented the Condel of these five tools in one of the first posts of this blog, The making of Condel (CONsensus DELeteriousness Score), published on April 1, 2011.

Later that month, he announced that a Condel of SIFT and PolyPhen2 had been implemented within the Ensembl Variant Effect Predictor (VEP), and almost three months later, Xavier Rafael-Palou  implemented our own webservice with a Condel of SIFT, PolyPhen and MutationAssessor. It was an important step that guaranteed quick access to Condel scores of all SNVs with SIFT, PolyPhen2 and MutationAssessor scores to research groups willing to assess the functional impact of novel variants detected in projects employing an array of Next Generation Sequencing technologies. The web server relied on Ensembl, in particular, on its VEP to map genomic coordinates of a SNV to coordinates of the protein products of the genes overlapping it. Furthermore, it retrieved SIFT and PolyPhen2 scores through VEP annotations, and Mutation Assessor scores from its web service. This posed technical hurdles to the development of the Condel web server. First, it was anchored to an unchanged Ensembl version –and we were unable to cope with Ensembl updates– and therefore to certain SIFT and PolyPhen2 versions. Second, it was very sensitive to changes of the underlying services. Changes in the MutationAssessor web service, for instance, caused that we inadvertently computed faulty Condel scores for a period until our attention was turned to the web server output.

 

Because of these technical problems –and because some new tools to identify likely deleterious SNVs have appeared in the past couple of years–, few months ago, I started to develop a new version of  Condel. We wanted that the new web server relied on a database of pre-computed Condel scores for all possible SNVs in the human proteome, to reduce the response time of the server to a minimum. Therefore, I started looking for databases of pre-computed SIFT, PolyPhen2, MutationAssessor –and other methods’– scores. We found the great dbNSFP resource, which has compiled these and other scores from their original sources. Then I extracted SIFT, PolyPhen2, MutationAssessor and FatHMM pre-computed scores from dbNSFP and put it into a new indexed database which could allow us to calculate the new Condel scores. Then I created a pipeline using the amazing IPython notebook (available here for review) to compute different performance metrics for the well-known HumVar dataset in order to find the best score cutoff for each of these tools. The metrics contained information about how the scores are distributed, which is the cumulative probability distribution, the Accuracy and the Matthews Correlation Coefficient for all the possible cutoffs, the ROC curve and the Area Under the Curve. The weight used for the weighted average when calculating the new Condel scores could be derived from those cutoffs by using the inverse cumulative probability distribution of the neutral mutations or the cumulative probability distribution of the deleterious mutations depending on whether the source score was below or above the cutoff (respectively). See figure below:

metrics

Performance metrics: There are 4 plots for each method, the first plot shows the density of scores for the known neutral (NEG) and deleterious (POS) mutations in HumVar. The scores of SIFT and FatHMM are reversed so higher values represent more chance to deleteriousness. The second plot shows the cumulative probability distribution. The third plot shows the Matthews correlation coefficient (MCC) and the accuracy (ACC). The cutoff (marked with a vertical line) represents the threshold that maximizes the MCC. And the last plot represents the ROC curve with the numbers in the legend showing the area under the curve (AUC).

Finally, the pipeline computes the Condel score for each possible SNV in the human proteome. I then exhaustively tested all possible combinations of the scores of the most recent versions of these four tools, to find the Condel that most accurately classified the HumVar dataset. I found that the Condel of FatHMM and MutationAsssessor showed the best performance in this dataset (see figure below).

metrics-condel Performance metrics for Condel: as explained in the previous figure

 

The Figure below presents the ROC curves of the two tools (FatHMM and MutationAssessor) and Condel.

ROC

ROC curve of the tools and Condel: The legend shows the Area Under the Curve (AUC) for each tool.

 

All the Condel scores are stored in a MongoDB database which forms the core of the new Condel web server. In fact, I found that the same database could also be used by another of our projects, TransFIC, so I joined Condel and TransFIC databases as well as the web server into an independent project called FannsDB. The database contains information for almost 241 million records (transcript affecting SNVs).

From now on, both Condel and TransFIC will share the underlying data technology and will have a common web site at

FannsDB

http://bg.upf.edu/fannsdb

In summary, I want to present you the new version of Condel (2.0). It improves the former web server in two main aspects. First, it combines more up-to-date –as well as new– tools to identify disease-related SNVs. Second, it relies on a database of pre-computed Condel scores of all possible SNVs in the human proteome, that makes its performance independent of other web services, thus making its use more reliable.