How to generate mutation distribution and frequency plots?

Let’s plot some mutations!

Mutations Needle PlotEasy, right? For our next iteration of the IntOGen database, we wanted to add figures that represent the mutation distribution across the protein sequence. And we found ourselves, just as others, in the situation, that we know that there are solutions, but none is available for us to incorporate in the web portal.

We would like to produce plots that describe the mutation frequency data, are aesthetically pleasing and easily understood. The Mutation Mapper at the cBioPortal does a great job already and provides a web service. Additionally we’d like to reflect different consequence types in the same position. Thus we decided to give the all-so-famous D3 (Data Driven Documents)  a go. It’s a blast to use and in a couple of days we had our first plots as a generic library.

Read the rest of this entry »

How to perform a hierarchical clustering using interactive heatmaps in Gitools

In the latest version of Gitools, version 2.1, we have improved the clustering of heatmaps. Here we explain in detail on how to perform and interpret the hierarchical clustering result – and why it is a bit different than the rest.

Hierarchical clustering in Gitools: The lines in the header represent the hierarchical tree splitting, the root at the bottom, the leafs at the top

Hierarchical clustering in Gitools: The lines in the heatmap header represent the hierarchical tree (Dendrogram) splitting at different levels. The root of the tree is located at the bottom, the leafs at the top. See video at YouTube

Read the rest of this entry »

How to identify functional genetic variants in cancer genomes?


The possibility to rapidly and inexpensively sequence tumor genomes is opening important new avenues for cancer research. One of the main objectives when sequencing tumor genomes is to identify the somatic alterations that have a relevant role in developing and maintaining the cancer phenotype. However the analysis of this data is hindered by the large number of mutations detected in tumors (often in the order of thousands) and the large molecular heterogeneity observed between tumor samples.

As part of the International Cancer Genome Consortium (ICGC), during the last 2-3 years I have been co-leading (together with Lincoln Stein) a working group focused on discussing how to analyze this data. The group is formed by 48 Members from 10 different countries, and we have held one teleconference nearly every month. We have now written the results of these discussions in a perspective manuscript that has been published in the current issue of Nature Methods.

Read the rest of this entry »

Interactive heat-maps to explore biological data

Heat-maps are graphical representations of data where values in a matrix are represented following a color scale. This way of representing data has proven to be a very intuitive and useful to visualize biological data. With large and complex data being generated in biology and specially in Cancer Genomics, static heat-maps are a limited option for data exploration. Instead we need to be able to analyze data in an interactive way in order to be able to extract knowledge from it.

Read the rest of this entry »

List of tools to visualize multidimensional cancer genomics data

A while ago we published a review about multidimensional cancer genomics data visualization in Genome Medicine. There, we focused on effective and common visualization techniques for exploring oncogenomics data and we discussed a selection of tools that allow researchers to effectively visualize multidimensional oncogenomics datasets. Since our research field is constantly evolving we thought we could share an update of the tools and links.

Read the rest of this entry »

Condel for prioritization of variants involved in hereditary diseases and transFIC for cancer

We have worked during the last years on assessing the functional impact of non-synonymous variants (nsSNVs). As a result, we have published two new approaches Condel and transFIC. In this post I would like to clarify the differences between one and the other, and give our recommendations on when each of them should be used.

Read the rest of this entry »

How to evaluate the performance of computational methods to identify driver mutations?

We have recently published transFIC, a computational method to assess the functional impact of somatic cancer mutations (see this post). To evaluate the performance of transFIC we needed a dataset of driver and passanger mutations. However, we faced a common problem in this field: there is not such dataset that can be trusted and is not biased. Thus, it was a challenge to properly evaluate the performance of transFIC and compare it to other methods with similar aim.

Read the rest of this entry »

How to identify cancer drivers from tumor somatic mutations?

Cartoon representing genomic alterations in a tumor cell. Image from NCI.

I have recently seen several presentations by groups that systematically explore alterations in cancer genomes that deliver the same message. One of the main challenges faced by their projects is to identify genes and pathways involved in tumor development (drivers). Very good methods based on the recurrence of somatic mutations have been developed to identify cancer drivers (see, for example MutSig and the Significantly Mutated Gene (SMG) test from MuSiC). They rely on the assumption that genes that exhibit more mutations than expected by chance are putative drivers. Even though these methods are successful in identifying clear cancer drivers, they also face some known challenges. For instance, the background mutation rate is hard to estimate accurately and important genes that are mutated only in a small number of tumors may be overlooked. Besides, these methods treat all mutations that may affect protein sequence equally, when their impact on protein function clearly differ.


Some time ago we thought that a good way to address these challenges would be to use the Functional Impact Bias (FM bias) observed in genes across a cohort of tumor samples. In other words, we wanted to estimate how the accumulation of mutations with high functional impact on each gene deviates from the average observed in all tumor samples.

Read the rest of this entry »

Sample Level Enrichment Analysis (SLEA) Tutorial and Gitools 1.6.2

As you may have read in the last post, Günes and Nuria presented the Sample Level Enrichment Analysis (SLEA) as a methodology to analyse the transcription level of each sample for groups of genes (like for example pathways, gene signatures, etc.)

An example represantation of the SLEA methodology

A gene-sample matrix is being converted to a gene-module matrix where module can be sets of genes like f.ex. pathways. The transcription level status can be used for stratifying and/or relating with clinical annotation

It is an easy way to stratify the samples into subgroups and/or relate the transcription level status of modules to clinical data. So this last week we have prepared a further video tutorial to show you how to perform SLEA easily with Gitools and gain more insight into your data.

Watch the video below or read the instructions in the fourth step of the Case Study: “Study multi-dimensional cancer data with Gitools”.

With this video tutorial we also release a new version of Gitools, version 1.6.2 so it is possible to have multi-value data matrices as input data for the enrichment analysis. Also we got rid of some bugs.

Download the latest version at

Exploring the effect of cancer genomic alteration on expression with Gitools

Cancer cells often exhibit a change in number of copies of certain genomic regions when compared to normal cells (Copy Number Alterations: CNAs). Some of these CNAs may have a direct influence on the expression of genes in the affected region. The change in the number of copies of a gene may be both positive, when additional copies are gained (and the genes thus amplified) or negative, when one or more alleles of the gene are lost. The influence of CNAs on the expression of these amplified or lost genes depends on whether it occurs hetero- or homozygously and also on other regulatory factors which may override the effect of the alteration. Therefore, an essential step to verify the importance of the amplification or deletion of a given gene in the tumorigenic process is to verify if its expression tends to respond to its genomic alterations.
Effect of genomic alterations on expression

The effect of genomic alterations can be observed in the expression values. Note for example that samples with loss of CDKN2A shown lower expression values than samples without this alteration. This effect is also evident for the alteration of the other genes.

Read the rest of this entry »