The course will provide the students with theoretical and practical expertise in the field of integrative cancer genomics. We will introduce the large oncogenomic projects that are active, such as The Cancer Genome Atlas and the International Cancer Genome Consortium. And we will learn how to use tools to access, manipulate and take advantage of the large amount of available cancer genomics data (such as IntOGen, Gitools and others).
Exercise 1: Gene Search
1) Search for Myc gene in IntOGen: in which tumour types it is altered? which type of alterations?
2) Enter in Bones and joints tumour type for Myc gene and go to the experiments tab: in how many experiments for CNA is Myc significantly amplified?
3) Enter in the experiment Li Z et al. for expression: how many samples does this experiment analyze? in how many is Myc gene overexpressed?
Exercise 2: Experiment Search
4) Search for the KEGG pathways more significantly expressed in TCGA experiment of Glioblastoma.
Exercise 3: Cancer gene prioritization
5) A mutation screening of 20661 human genes in 22 tumour samples of Glioblastoma multiforme (Parson et al Science 2008)
identified 37 candidate cancer genes with recurrent mutations. Use IntOGen to prioritize this list
of genes (list of genes).
Exercise 4: Download selected IntOGen datasets using Biomart
6) Download the p_values for up and downregulation of all transcriptomic experiments of lung cancer in IntOGen.
HINT: Filter by ICD-O: lung; ANY morphology.
Exercise 5: Combine experiments
7) Combine the three transcriptomic experiments of breast cancer using Affimetrix platform. Is BRCA2 misregulated according to these experiments?
1) Search for CDKN2A gene in ICGC data portal: In which tumour types has this gene found mutated?
2) Search for PTEN gene in ICGC data portal: In which tumour types has this gene found mutated?
Finding the pathways enriched among genes significantly up-regulated in different types of cancer
1) Import Data Matrix From IntOGen [tutorial]
Import a data matrix from IntOGen that contains p-values for significantly upregulated genes in each tumor morphology type
2) Import Modules From KEGG [tutorial]
Import a module file that maps all human genes to KEGG pathways
3) Run Enrichment Analysis [tutorial]
Run an enrichment analysis to test which pathways are enriched among genes significantly up-regulated in different
types of cancer
4) Explore results in a heatmap [tutorial]
Explore the heatmap with the enrichment analysis results by sorting, filtering, hiding and moving rows and columns
Download the short list of pathways for filtering here
5) Edit heatmaps [tutorial]
Edit properties of the heatmaps: change color scale, add annotations, change font size and edit grid.
6) Export heatmap image and table results [tutorial]
Export the results of the enrichment analysis as an image and as a table containing the details of the statistic results
Balciunaite et al (2005) performed a chip-chip experiment with antibodies
against E2F4, p107, and p130 in cells in G1 and G0 to identify genes directly regulated by these proteins.
We will explore if the target genes of E2F4, p107, and p130 proteins are enriched among genes up-regulated in different cancer types
1) Download the module file containing target genes of E2F4, p107, and p130 in G1 and G0 cells [here]
2) Perform an Enrichment Analysis of these modules in genes up-regulated in different cancer types
NOTE: use the data matrix from IntOGen that contains p-values for significantly upregulated genes in each tumor morphology type (from practical 3.1)
NOTE2: transform to 1 cells with a value less than 0.05
3) Change the annotation of columns to see the tumour type (Topography) as in the previous exercise
4) Explore the results and answer the following questions:
a) Are the target genes of E2F4, p107, and p130 proteins significantly upregulated in cancer samples? In which tumour types?
b) What is the number of p130eG1 target genes significantly upregulated in adrenal cortex? And the expected by chance? There is a significant
enrichment? Which is the p-value?
We will compare cancer types with respect to the alteration pattern of genes and the alteration pattern of pathways.
1) Open the matrix of p values imported in PRACTICAL 3.1
HINT: File/Open/HeatMap.
NOTE: change color scale to P-Value scale in order to visualize properly the matrix values.
HINT: Properties/Cells/Scale
2) Perform correlation analysis using the matrix of p values for genes.
HINT: Analysis/Correlations
3) Open the results of EA with pathways from PRACTICAL 3.3
HINT: File/Open/Analysis.
5) Perform correlation analysis using the corrected p values.
HINT: Analysis/Correlations. Take values from Right p-values
6) Compare the two results for the two correlation analysis and answer the following questions:
a) Which two cancer types have the highest correlation at the level of genes?
b) What is the correlation coefficient at the level of pathways between the same pair of cancer types? Which one is higher?
c) Which two cancer types have the highest correlation at the level of pathways? Compare this to the correlation at the level of genes for the same pair
of cancer types. Which one is higher?
d) Comparing by eye, which of the correlation matrices indicate a higher correlation for all cancer types in general? What might be the reason?
e) Are there any negative correlations at the level of pathways? If so, check the EA results to see which pathways the negatively correlated cancer types upregulate?
In this practical, we will analyze the prostate cancer dataset by Taylor BS et al. 2010.
The raw and processed data is available from GEO (GSE21032). We normalized the Affymetrix expression data using RMA method.
The file we will use contains log2 ratio of cancer to normal expression. For each probe we take the average of the expression across the normal samples and divide the cancerous expression by this value.
1) Download and open the file that contains the log2 ratios of cancer to normal expression for the samples in the experiment. You can get the file from here.
To open the file do File>Open>Heatmap and select the file.
2) From the Welcome menu of GiTools, click on the IntOGen icon to import the p values for up-regulation for all the prostate experiments from IntOGen.
Select 'Onco Experiments' and 'upreg'. Topography should be 'prostate gland' and morphology 'ALL'. You should have selected 7 columns.
3) Do File>Open>Heatmap to open the marix of p values imported in the previous step. (The file should end with 'cdm.gz'.) From Properties>Cells, change the color scale to P-Value scale.
4) We are going to use the genes in this list to filter the heatmap of both the new experiment and the IntOGen experiments.
In both heatmaps, do Data>Filter>Filter by labels and select the file containing the file.
5) For both heatmaps, the names of the rows are Ensemble gene ids. We will load the annotation file for genes to change the annotation of rows in both matrices to gene symbol.
Do Properties>Rows>Open and select the file that ends with 'rows.tsv.gz'. Change the label to 'symbol' by clicking on the button with 3 dots on it. Select symbol and hit OK.
6) Download the annotation file for the experiment GSE21032.
Do Properties>Columns>Open and select the file. Type '${tumor_type}=${disease_status}' into the pattern box.
7) Compare the two matrices and answer the following questions:
a) Check the columns in order to see what kind of samples there are in the new experiment?
b) Change the scale of the cells in the heatmap of the new experiment to [-1,1] from Cells tab in Properties menu.
You can sort the genes by Data>Sort by value and change aggregation type to Absolute sum and hit OK.
When you take into account all the samples in the experiment, which genes from the list are upregulated in majority of the samples in GSE21032 experiment?
b) Are there any genes exclusively upregulated in the cell lines but down-regulated in primary cancer samples? Name 2 such genes.
c) Are there any genes exclusively down-regulated in the cell lines by upregulated in primary cancer samples? Name 2 such genes.
8) Perform oncodrive analysis in the new experiment in order to find the genes that are the most over-expressed across the whole set.
From the Welcome menu of GiTools click in Driver Alterations icon. Set the data source (change format to continuous) as the file you downloaded in step 1.
Convert the continuous expression matrix to a binary matrix by changing the values greater than or equal to 1.5 to 1 and 0 otherwise.
Select Bernoulli as the statistical method. Name the analysis and hit Finish.
9) Open the oncodrive analysis and sort the results. Do File>Open>Analysis and select the name you chose in step 6.
10) Create a gene list with the significant genes in this analysis.
You can either do Data/Filter/Filter by value - Corrected right P-Value < 0.05. Or select and hide grey row. In any case do File/Export/Labels - Visible row names to save the rows you selected to a file.
11) Do Data>Filter by name and select the name of the file you saved in order to filter the rows in the experiments imported from IntOGen.
12) Compare visually and answer the following questions:
a) Which genes from this list are the most up-regulated across the IntOGen experiments? HINT: use the sort function
We will perform an enrichment analysis over the expression matrix of the new experiment using the pathway module.
1) From the Welcome menu of GiTools click on Enrichment Analysis icon. Set the data source (change format to continuous) as the file you downloaded in step 6.1.
Select the file for the KEGG modules you imported in practical 3. Use z score test. This run will take 2-3 minutes.
2) Open the enrichment results. HINT: Open>Analysis and select the file that contains enrichment in its name.
3) Use this pathway list to filter the rows of the heatmap. HINT:Filter>Filter by label and select the file.
4) Load the annotation for the pathways. You should have downloaded this file in PRACTICAL 3 when you export KEGG pathways using GiTools. HINT: Properties>Rows and select the file.
Set the pattern to ${descr}. This shoul display the names of the pathways in the heatmap.
5) Change the color scale to z-score scale and filter the cells by corrected twp tail p value. Explore the results.
QUESTIONS:
a) Which pathways are upregulated in cell lines while they are not significant in neither primary nor the metastatic samples?
b) Which pathway is down-regulated across all samples?
c) Which pathway is especially donw-regulated in cell lines while it is upregulated in most primary and metastatic samples?
2) Export the z scores of the previous pathway enrichment analysis into a file. Do Export>Export Matrix and select z score as the value. Name and save the file.
3) Perform a correlation analysis using this z score file as you did in PRACTICAL 5.2. Don't use any binary cutoff.
4) Run a correlation analysis using the original expression matrix to compare samples at the level of genes.
5) Open the two correlation analysis results. HINT: Open>Analysis and select the correlation file. To view the annotations of the samples you need to load the annotation file (downloaded in PRACTICAL 5.6) for both rows and columns. HINT: Properties>Rows and Columns. Do the same thing for the columns. You might need to decrease the cell size and get rid of the grid lines to fit the be able to view the heatmap easily. HINT: Properties>Cells. Compare the two correlation results.
QUESTIONS:
a) At gene level, how do cell lines correlate with the rest of the samples?
b) At which level are samples more correlated to each other? genes or pathways?
1) Go to the CGWB web page: link
2) Click on the heatmaps link to see the list of all analysis result available for the cancer genomic data sets inc. in CGWB.
3) Use the pull-down menus for filtering to select 'gene-based' and 'combined' results for 'copy number' analysis.
4) Click on 'Launch viewer' for glioblastoma analysis from TCGA called 'Gene Copy Number: GBM Combined'. This will prompt you to save a file named 'heatmap.jnlp'.
5) Open the downloaded heatmap with Java Web Start. It will take some seconds to load the heatmap.
QUESTIONS:
a) Are there chromosome-wide regions gained or lost in the whole TCGA data set?
6) In this application, you can search for markers (genes), pathways or samples.Find the CNA results for the sample TCGA-02-0001.
QUESTIONS:
a) What regions are lost and gained in the sample TCGA-02-0001?
6) Search for 'EGF signaling' pathway and hit Go. This will start seperate application. Explore the results.
QUESTIONS:
a) Which genes for EGF signaling are lost and gained most frequently?
1) Go to the UCSC Cancer Genomics Browser, click here.
2) Click on the link 'Cancer Genomics Browser' at the top of the menu on the left-hand side. This will take you to a page which contains a list of data sets available in UCSC.
By default it loads the heatmap for the latest data set, which was 'TCGA OV Hudson Alpha 1M Duo Copy Number' study while this tutorial was being prepared.
3) Scroll down to find the breast cancer cell line study by Neve et al. This data set contains both CNA and expression analysis.
You can view the results for both of these analysis selecting either of the view options from the corresponding pull-down menues. Select 'box plot' for both alteration type.
Don't forget to deactivate the view for the TCGA ovary study.We will work on the breast cancer cell line study only.
4) Box-plot view is useful to view the results at the level of chromosomes. Switch the view options for both CGH and expression analysis to 'heatmap'.
Then switch the display options from 'chromosomes' to 'genesets' from the drop-down menu at the bottom of the page. This will expand the display options.
5) Use the keyword 'estrogen' to search for the geneset named 'BREAST_CANCER_ESTROGEN_SIGNALING'. Hit 'update' at the top of the display options menu.
6) The view for CGH and expression will be updated to include a handful of genes inc. in the geneset.
7) The annotations for samples are displayed to the right of heatmap. There is a separate column for each annotation.
Sort the views for both the CGH and the expression analysis according to estrogen receptor (ER) status by click on the corresponding annotation column.
Then holding the control key down, click on progesteron receptor (PR) column to sort the view according to PR status.
Finally in the same way sort with respect to the Her2 status. This will sort view wrt. ER, PR and Her2 status respectively.
8) In the heatmap every column is the expression pattern of a single gene over the samples (rows) for this experiment. If you mouse over a column, you will see to which gene it belongs.
QUESTIONS:
a) Find ERBB2 by mouseing over the columns. Check the annotations of the samples that over-epxressed this genes. In whick samples is it over-expressed?
b) Find ESR1. Check the annotations of the samples that over-epxressed this genes. In whick samples is it over-expressed?
8) Click on the blue bar next to the annotation columns. This will activate the feature settings. You can group the sample wrt some annotation and do further statustical analysis here.
9) Select 'ER' from the list of available annotations. Select '-' for the first group and '+' for the second. Hit 'Add Subgrouping'. This will divide the samples to two: ER+ ones and ER- ones.
10) Select the 't-test' and 'Benjamini-Hochberg' from statistic section. This will perform t-test to compare ER+ and ER- sample wrt. the expression of the genes in the ER signaling geneset.
Benjamini-Hochberg will be used to do multiple testing correction.
QUESTIONS:
a) Name two genes that are over-expressed in ER+ samples.
b) Name two genes that are down-regulated in ER+ samples.
1) Go to the web page of Cosmic clicking here.
2) Type NF1 to the search box and hit 'Search'.
3) Click on 'histogram' to see the mutations on the sequence of NF1 and the tissues in which it was found to be mutated.
QUESTIONS:
a) Check the 'Mutations' and 'Distribution' links. What kind of alterations have been observed in this gene?
b) In which tissue complex frameshift alterations have been detected for this gene? Which alteration type is the most common?
c) What are the top three tissues NF1 is frequently mutated?
4) Click on the link for the primary tissue 'central nervous system'. Select all tissues for this primary tissue, 'glioma' histology and all subhistologies for glioma.
QUESTION:
a) In which subhistology is NF1 the most mutated?
5) Go back to the main page and this time search for 'central nervous system'.
QUESTION:
a) What genes are the most mutated in this primary tissue?
6) Click on IDH1.
QUESTION:
a) What kind of mutation types have been observed for this gene in this primary tissue?
7) Click on 'Switch view' to see the mutations for this gene in other tissues.
QUESTION:
a) In what other primary tissue types has this gene been mutated?
We will use MeV to cluster a gene expression data set. You can get Mev here.
The expression set we will use comes from Chin K. et al. 2006. We normalized the Affymetrix arrays with RMA.
The file contains the expression levels of 130 breast cancer samples.
1) Start Mev by double-clicking on the MeV icon in bin folder of MeV. Click on File/Load File to open the breast cancer data set file. You can get the file from here. You have to extract the file since MeV cannot use compressed files. You should click on the upper-leftmost cell that contains an expression value so that MeV know which row and column the data starts from.
2) Select the organims as human and platform as Affy_HG_U95Av2 to load the platform. This will load the probe annotation info directly. Heat Load.
3) A heatmap of the expression matrix is gonna be generated. Load the sample annotation file for this experiment by clicking on Utilities/Append Sample Annotation and select the file 'e-tabm-158_sample-info.mev.tsv'.
You can get the annotation file from here
4) Change the displayed names for samples by clicking on Display/Sample-Column Labels and select the annotation title 'descr'. This will display the ER/PR/Her2 status of the samples instead of the sample ids.
5) We will filter the rows (probes) in the expression matrix. For that click on 'Adjust Data'/'Data Filters'/'Variance Filter'. Enter '1' for the 'Percentage of highest SD genes'.
Only 1% of the probes (229 probes) in the initial expression matrix which show the highest standard deviation will pass this filter.
6) A 'Data Filter' node is gonna be created on the left menu. Expand it and click on 'Expression Image' to view the heatmap of the selected probes. Right click in 'Expression Image' and click on 'set as data source' in order go on all the steps from here on using this matrix.
We will do hierarchical clustering with the resulting matrix.
7) Click on 'clustering' options on the upper tool bar and select hierarchical clustering. In the hierarchical clustering menu that appears, click on 'Optimize sample leaf order'.
This will results in a samle ordering in the tree optimized according to the distance.
8) Select 'Pearson correlation' as the distance metric. Leave the linkage method as 'Average linkage' and hit run.
9) A new node named HCL will be created on the left menu. Expand it and click in 'HCL Tree' to view the resulting tree.
10) If the whole tree doesn't fit the screen, you can change the cell size from 'Display'/'Set element size'.
You can also adjust the width of the heatmap view without changing the cell size by moving the line that separates the left menu and heatmap view. Explore the tree.
11) Click on 'Visualisation' on the upper tool bar and select 'Gene Distance Matirx'. Select 'samples' and hit OK.
This will create a sample vs. sample correlation matrix.
12) Click on the GDM node created on the left menu to visualize the distance matrix. Right click on the matrix and click on 'Impose cluster result' and select the HCL that you have just performed. This will arrange the order of the columns and the rows according to the tree structure. GDM view might help you observed the clusters more easily.
QUESTIONS:
a) How many clusters are there in the tree?
b) Check the annotations of the samples in each cluster. Are cluster populate with samples having similar ER and PR status.
13) Click on 'clustering' options on the upper tool bar and select 'K-means clustering'. In the k-means menu, select pearson correlation as the distance metrix.
14) Set the number of clusters to the number of clusters you found in hierarchical clustering analysis. Select 'Cluster samples' option and Hit OK.
15) Expand the KMC node created on the left menu. Expand 'Expression Images' and explore the gene expression matrix for all the clusters.
16) Expand 'Table views' to see the list of samples in each cluster. Do samples with similar ER/PR status cluster to the same clusters?
17) Continue with the next practical.
Use the filtered expression matrix from PRACTICAL 12 to build a classifier that differentiate ER+/PR+ and ER-/PR- breast cancer samples.
1) Click 'Classification'/'Support Vector Machines' on the upper tool bar.
2) In the SVm menu, select 'Classify samples' and 'Train SVM and then classify'. Hit 'continue'.
3) Keep the default settings in the next menu and hit OK.
4) SVM Classification Editor window is gonna appear. Sort the samples according to their annotations doing 'Tools'/'Sort by'/'Sample-Experiment Name'.
SVm can perform two-class classification only. Assign the ER-/PR- samples to 'In Class'. ER+/PR+ samples to 'Out of Class' and the rest to 'Neutral'. Hit the Play button.
5) Expand the SVM node on the left menu. Click on 'SVM Training Result'.
QUESTION:
a) Observe the weight column. What weight do 'Neutral' samples have? Do the 'in class' samples have a positive or negative weighta negative or positive wieght?
6) Click on 'SVM Classification Result'. Explore the table.
QUESTION:
a) Which samples are correctly assigned in the class (HINT: Those sample with 1 in class column are assigned to ER-/PR- class by SVM. Compate this to the initial annotation you provided. ) and which are assigned out of class (HINT: class=none)?
7) Click on 'classification information'.
QUESTION:
a) What is the False Positive Rate and True Positive Rate for the 'in class' and 'out of class' predictions?