Often times, it is very hard to find the right omics data for your precision oncology research project. Learning about the impact of next-generation sequencing and the explosive growth of publically available data, one might just wonder where the RNA-seq dataset on cancer is and how easy is it to find what you are looking for.
Interacting with many students during our OmicsLogic educational programs, we realized the need for high-quality data sources that anyone can learn about and use. Good data is data from collections that follow a certain level of meta-data annotation with minimal restrictions and easy access to all the files. For example, detail of phenotypic information associated with samples as well as file size or sequencing instruments being used. Another criterion is the number of replicates, whether they are technical or biological – best repositories contain many samples per study.
We compiled a list of resources where you can find RNA-seq data to start your oncology bioinformatics project:
1. Elixir’s Expression Atlas
Expression Atlas is an open science resource that gives users a powerful way to find information about gene and protein expression. Our mission is to provide the scientific community with freely available information on the abundance and localization of RNA (and proteins) across species and biological conditions such as different tissues, cell types, developmental stages and diseases among others. Expression Atlas aims to help to answer questions such as ‘where is my favorite gene expressed?’ or ‘how its expression changes in a disease?’ To achieve this objective, our project involves data curation, data analysis and the development of a web application to access and visualize publicly available data.
To find cancer-related data, simply navigate to the Gene Expression Atlas homepage:
Then, go to “browse experiments”, where you can search for the type of data, organism, and disease. Right away, I spotted a few very nice collections of data, like this collection of 675 commonly used human cancer cell lines provided by Genentech:
2. NCBI – National Center for Biotechnology Information (https://www.ncbi.nlm.nih.gov/bioproject)
While it is not the easiest place to find a dataset you are interested in, once you learn to navigate the NCBI site, you can find a lot of good datasets. A BioProject is a collection of biological data related to a single initiative, originating from a single organization or a consortium. A BioProject record provides users a single place to find links to the diverse data types generated for that project. As you search, you can narrow down the results to include RNA-seq, type of cancer and organisms that you want to be included in your results:
For example, I was interested in total RNA-seq of colorectal cancer patients and found this project with 110 samples:
3. TCGA – The Cancer Genome Atlas
Finally, we cannot ignore the Cancer Genome Atlas – a huge repository of data that can be very useful for a variety of reasons.
Data on the Genomic Data Commons (GDC) is for research purposes only. The GDC provides researchers with access to standardized clinical and genomic data from cancer studies to enable exploratory analysis that cannot be considered definitive for outcomes. The GDC assists researchers in the exploratory analysis by identifying changes in cancer cells that may play an important role in cancer development. Through the GDC knowledge network, researchers can leverage data maintained in the GDC to assist in identifying both high- and low-frequency cancer drivers such as Mutations in Variant Calling Format (VCF) and Mutation Annotation Format (MAF) files that identify somatic mutations such as point mutations, missense mutations, nonsense mutations, and insertions and deletions (indels) of nucleotides in the DNA. Copy Number Variants (CNV) data to identify amplified and attenuated gene expression due to chromosomal duplications, loss, insertions, and deletions. Expression Quantification of mRNA and miRNA (expression values are provided in tab-separated value format). Post-transcriptional Modifications – mRNA sequence data to assist in identifying post-transcriptional splice modifications that are manifested as splice junction and isoform variants. Data and metadata are submitted to the GDC in standard data types and file formats through the GDC Data Submission Pipeline. Molecular data stored in the GDC are harmonized against a common reference genome.
The GDC itself might not be the easiest site to find raw gene expression tables for many patients, but other sites like Firebrowse (http://firebrowse.org) make the data for each cancer type much easier to navigate and download in a standardized table format that you can use for your analysis. You can explore them for finding the RNA-seq dataset on cancer.
Looking for colorectal cancer samples, I found 623 samples very quickly in RSEM normalized read counts.
TCGA data is useful because each cancer type (liver, breast, lung, etc.) has been organized and published with lots of phenotypic information in major publications, like this one in Cell: Comprehensive and Integrative Genomic Characterization of Hepatocellular Carcinoma The Cancer Genome Atlas Research Network, David A. Wheeler, and Lewis R. Roberts https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5680778/pdf/nihms882929.pdf
The publication has supplementary files that are extremely useful if you are trying to understand what other information is available about patients.
Did we forget to mention a major resource for the RNA-seq database for cancer you like? Let us know by posting a comment below!