Modeling Precision Medicine
Precision Medicine promises to revolutionize the way patients are treated by using precise, molecular information that provides reliable indicators on effectiveness of treatment. This is useful both for drug development and for patient diagnosis to identify precise subgroups of patients. In this project, we will leverage cell lines from human breast tumors and a study where multi-omics data from cell lines was integrated with a response to various cancer treatments. As a result, you will learn to analyze various omics data types, integrate them and associate them with a phenotype (response to treatment) using sophisticated machine learning algorithms.
Cell lines have been used to study for decades and in drug discovery, cell panels are regularly used to screen for compounds. Many compound candidates have to be tested for efficacy and toxicity to provide the benefit and limit the side effects of treatment.
In breast cancer, cell lines mirror many of the molecular characteristics of the tumors from which they were derived, and are therefore a useful preclinical model in which to explore strategies for predictive marker development. However, breast cancer is also a clinically and genomically heterogeneous disease. Six subtypes were defined approximately a decade ago based on transcriptional characteristics and were designated luminal A, luminal B, ERBB2-enriched, basal-like, claudin-low and normal-like. New cancers can be assigned to reported subtypes using a 50-gene transcriptional signature designated the PAM50. However, the number of distinct subtypes is increasing steadily as multiple data types are integrated.
To study the informative multi-omics features in an integrated way, this projects gives a perspective on drug screening and on precise diagnosis of patients.
This project was inspired by Daemon et al., 2013, “Modeling precision treatment of breast cancer”, which is an analysis of over 70 different Breast Cancer cell lines and over 90 different therapeutic agents. The project included SNP Array (a type of microarray), RNA-seq (which looks at the whole transcriptome), Exome-seq (exome capture, which looks at all of the expressed genes at a given point in time), and genome-wide methylation (epigenetics) data.
- Molecular Profiling: Gene panels used to determine cancer subtype, such as PAM50 that is used to determine breast cancer subtypes from gene expression data. This profiling sometimes relies on genomics data (mutations) and sometimes on transcriptomics data (gene/isoform expression). This project will look at those types of features as well as combinations of features that have more predictive power for therapeutic efficacy.
- Precision Medicine: Precision Medicine is an emerging approach for disease treatment and prevention that takes into account individual variability in genes, environment, and lifestyle for each person
- Predicting Therapeutic Response: A major limitation of precision medicine is it’s limiting factor of more traditional therapeutics that are designed to work for anyone with a broad disease definition. Therefore, precise identification of patients that will respond to a therapeutic has been a major challenge in the clinical trials that are a major step in getting a therapeutic into clinical use. Methods of predicting response are critical for getting more precision therapeutics to patients.
- Levels of molecular regulation: Multiomics means a new biological analyses approach where the data sets are multiple omes such as genome, proteome, transcriptome, epigenome, and microbiome. By combining these “-omes” into a set of “-omes”, one can analyze the complex big data efficiently enough to find biomarkers easily.
- Machine Learning for Biomedical Data: Machine learning today was born from pattern recognition and the theory that computers can learn without being programmed to perform specific tasks. The iterative aspect of machine learning is important because as models are exposed to new data, they are able to independently adapt. They learn from previous computations to produce reliable, repeatable decisions and results.
- Missing Data Approximation: Our drug-efficacy table contains missing values (missing values are shown as “NA”, which stands for “not available”). But the majority of analysis methods can’t cope with missing values. We will learn how to use specialized methods to fill in missing values for further analysis. In order to “restore” a value for a sample (cell line) C and feature (drug) D based on available values we can find samples similar to C and “transfer” values from that samples associated with D to the table cell being filled. Alternatively, we can find features similar to D and “transfer” values from that features associated with the sample C.
- Classification: This project includes multiple cancer subtypes (luminal, basal, claudin-low and normal-like). Can we identify the subtype for a given sample based on gene expression and can we find genes that are the most informative for this identification. Answers to these questions are provided by supervised machine learning methods, specifically, classification algorithms.
- Feature Selection: Using Stepwise LDA to select most informative features for classification: Stepwise feature selection starts with testing all individual features and selects the one that provides the best classification quality (for the training set). Than it tests all pairs where the first feature is the one selected at the previous step, and again it selects the pair that provides the best classification quality. The process goes further to triples, quadruples, etc. Such greedy strategy is not generally optimal, but it provides the result in an acceptable time.
- Clustering: classification methods can provide reliable identification of a class (e.g., pathology subtype) for a sample. This identification can be clinically important: for example, it can be used for optimal selection of therapy scheme. But methods for classification are supervised: they need a training set with initially known separation into classes. If we don’t have an annotated dataset, we need to learn to use clustering methods. These are aimed at division of the set of samples into subgroups – clusters – in such a way that similar samples are grouped together in one cluster, while dissimilar samples are attributed to different clusters.
- Multi-Omics Integration: One application of multi-omics data is to identify molecular signatures that can be used to assign specific treatment to individual patients. However, strategies to develop optimal predictive marker sets are still being explored. Indeed, it is not yet clear which molecular data types (genome, transcriptome, proteome, and so on) will be most useful as response predictors. Here we evaluate various data types and an integration model to assign treatments and select similar molecules that can be potentially used for similar efficacy.
This project was created in collaboration with the Tauber Bioinformatics Research Center at the University of Haifa.
- Lectures 10
- Quizzes 0
- Duration 50 hours
- Skill level All levels
- Language English
- Students 165
- Certificate Yes
- Assessments Yes
An insightful research project.
This project is an excellent ice-breaker for those who would like to have an introduction to the research field. Through the means of this project, one would be able to learn more about the various tools and methods used to perform multi-omics data analysis. This project helped me understand the biology behind the data better and thus how to deal with a biological question while doing research. The integration of machine learning methods on the biomedical data for a learned insight to answer a biological problem was well explained with ample hands-on exercises in this course. The assignments at the end of this course surely give us scope to think and work beyond the reproducible research.
A Very Good Research Project
I really liked this project, as it gives a complete experience while going through a variety of techniques and tools (Algorithms) to understand the information from big data sets. One of the biggest learning is the integration of multi-omics data while comparing with unsupervised methods and going into multi-omics data clustering. This project gives a good hand on experience and it’s a good starter to learn more and think about biological problems. On the technical aspect, it was very interesting to understand the Missing Data Approximation method. For biological data it does makes sense.