Using Microarray Profiling and Regularized Gradient-Boosted Trees To Discover Subtype - and Grade-Identifying Biomarkers in Human Glioma

Using microarray profiling and regularized gradient-boosted trees to
discover subtype- and grade-identifying biomarkers in human

glioma
Kevin Song1 , Dionna Jacobson2 , Ravina Jain3
1
Department of Biomedical Informatics, Stanford University School of Medicine
2
Department of Biology, Stanford University
3
Department of Human Biology, Stanford University
June 6, 2017
1 Abstract
Whole tissue biopsies are often an invasive means of tumor subtyping, requiring patients to undergo risky
surgical procedures and increasing the cost of patient care. Non-invasive methods, such as needle and liquid
biopsies, are optimal alternatives for diagnosing human glioma subtype/grade. In this project, an addi-
tive, regularized, gradient-boosted ensemble tree (XGBoost) model trained on microarray-derived mRNA
expression levels obtained from whole tissue tumor biopsies was used to identify important biomarker genes
for classifying brain tumors stage and cancer subtype. Though our XGBoost classification models were
inviable (due to their high misclassification error when distinguishing between all grades and all subtypes),
we report the successful identification of a pair of diagnostic biomarkers (TNFRSF1B and YBX3 ) to dis-
tinguish between glioblastoma and oligodendroglioma tumors, as well as a pair of diagnostic biomarkers
(TNFRSF1B and VMP1 ) to distinguish between grade II and grade IV gliomas. Our analysis identifies
important biomarkers that may aid in non-invasive diagnosis of various glioma subtypes and grades, and
provides novel insights about the genetic identification and development of gliomas.
2 Introduction/Background
Brain neoplasia represents the second most common cancer in children [1]. Early detection and treatment of
brain neoplasia can lead to better outcomes and survival rates for affected patients, and an accurate diagnosis
serves an important role in all stages of the cancer journey. Currently, invasive and risky procedures are used
for whole tissue biopsies to help with tumor subtyping. Diagnostic tools can enable doctors to determine
the best treatment approach, monitor the progress of treatment, and modify current therapies when needed.
Specifically, non-invasive methods such as needle biopsies constitute less risky and less invasive alternatives
than traditional whole-tissue biopsies, when used to diagnose glioma subtype and grade.
Here, we trained regularized, gradient-boosted, multinomial, ensemble tree classification (XGBoost) mod-
els on a glioma mRNA expression dataset, and used these models to identify genetic biomarkers for diagnos-
ing glioma subtype and grade. Our findings have interesting implications about how brain tumors develop,
how certain genes are regulated concurrently with the presence of disease, and also for the development of
non-invasive screening methods and novel diagnostic criteria.
3 Results and Methods

3.1 Data Processing
The dataset used to construct our predictive models consisted of gene expression profiles for 153 patients
(Affymetrix GeneChip Human Genome U133 Plus 2.0 microarray) [2]. Data pre-processing steps followed
standard protocols recommended by Affymetrix, and are fully annotated in Sun et al. Differential expression
was measured previously between glioma and human nontumor brains using log2-transformed expression
values by standard unpaired two-tailed Students t-test. Brain tumors were graded from 1 to 4, and tumors
were pathologically diagnosed to WHO standards by cancer subtype (astrocytoma, glioblastoma, and oligo-
dendroglioma), along with one non-cancerous classification (epilepsy) [2]. Data were further processed by
removal of expression data for all epileptic patients.
3.2 Model Overview

Using an annotated glioma dataset, regularized gradient-boosted tree models were constructed to predict
the stage and cancer type of brain tumors. Boosted ensemble tree methods tend to exhibit greater predictive
accuracy than other classification methods, and constitute some of the best out-of-the-box classifier[s] in
todays usage [3]. In addition, compared to traditional gradient-boosted machines, XGBoost incorporates
regularization in order to minimize overfitting. A final justification for selecting XGBoost stems from its
status the uncontested method of choice for many of todays Kaggle data science competitions.
Altogether, we ran two supervised learning problems for this data project:
1
1. Generating an XGBoost model to classify tumor grade as a function of mRNA expression levels at
various genes.
2. Generating an XGBoost model to classify tumor subtype as a function of mRNA expression levels at
various genes.
3.3 Model Construction

The initial processed dataset was split into a training dataset (n = 100) and a testing dataset (n = 53).
5-fold cross-validation was performed to train both XGBoost models, and the optimal boosting iteration that
minimized cross-validation misclassification error was used to construct the final models using the training
dataset.
3.4 Model Performance Evaluations

Our subtype-predictive model had an overall misclassification rate of 43.396%, for all classes, on the testing
dataset. On the testing dataset, it had a misclassification rate of 100% for astrocytomas, 22% for glioblas-
tomas, and 23% for oligodendrogliomas.
Our grade-predictive model had an overall misclassification rate of 33.962%, for all classes, on the testing
dataset. On the testing dataset, it had a misclassification rate of 18.75% for grade II tumors, 70% for grade
III tumors, and 29.629% for grade IV tumors.
3.5 Variable Importance

The most important genes for classification by our subtype-predictive model were TNFRSF1B and YBX3,
which corresponded to reporter genes 203608 at and 201161 s at (Fig. 1). The most important genes for
classification by our grade-predictive model were TNFRSF1B and VMP1, which corresponded to reporter
genes 203608 at and 1569003 at (Fig. 2).
3.6 Distributions of Most Predictive Genes

Glioblastoma were characterized by low expression of TNFRSF1B, whereas oligodendroglioma exhibited
elevated expression of TNFRSF1B (Fig. 3). The opposite trend was shown to be true for YBX3 (the
second-most-predictive gene for subtype classification), with glioblastoma exhibiting elevated expression of
YBX3 and oligodendroglioma exhibiting reduced expression of YBX3. The distributions of TNFRSF1B and
YBX3 gene expression appeared to be more variable and diffuse for astrocytoma.
Regarding tumor grade, tumors of grade II and III exhibited elevated expression of TNFRSF1B, and
grade IV tumors were characterized by elevated levels of TNFRSF1B (Fig. 4). For VMP1 (the second-
most-predictive gene for grade classification), grade II tumors exhibited low expression of VMP1, and grade
IV tumors exhibited elevated expression of VMP1. Expression of VMP1 for grade III tumors appeared to
be more variable, and intermediate of VMP1 expression for either grade II or grade IV.
3.7 Hypothesis Testing for Differences in Mean Expression of Top Predictor

Genes
The following two-sample, one-sided Students t-tests were conducted to determine whether mean expressions
of the top predictor genes differed significantly for various subtype/tumor grade populations:
1. A null hypothesis (of no difference in mean expression of TNFRSF1B for glioblastoma and oligo-
dendroglioma) was tested against the alternative that glioblastoma have lower mean expression of
TNFRSF1B than oligodendroglioma. With p-value < 2.2e-16 << 0.05, there is evidence to suggest
that glioblastoma have significantly less mean expression of TNFRSF1B than oligodendroglioma.
2. A null hypothesis (of no difference in mean expression of YBX3 for glioblastoma and oligodendroglioma)
was tested against the alternative that glioblastoma have greater mean expression of YBX3 than
2
oligodendroglioma. With p-value = 2.271e-12 << 0.05, there is evidence to suggest that glioblastoma
have a significantly greater mean expression of YBX3 than oligodendroglioma.
3. A null hypothesis (of no difference in mean expression of TNFRSF1B for grade II and grade IV tumors)
was tested against the alternative that grade II tumors have greater mean expression of TNFRSF1B
than grade IV tumors. With p-value < 2.2e-16 << 0.05, there is evidence to suggest that grade II
tumors have significantly greater mean expression of TNFRSF1B than grade IV tumors.
4. A null hypothesis (of no difference in mean expression of VMP1 for grade II and grade IV tumors) was
tested against the alternative that grade II tumors have lower mean expression of VMP1 than grade
IV tumors. With p-value = 6.399e-13 << 0.05, there is evidence to suggest that grade II tumors have
significantly less mean expression of VMP1 than grade IV tumors.
4 Conclusions and Future Directions

Though our XGBoost predictive models were not viable (due to high rates of misclassification), our two-
sample, one-sided t-tests revealed significant differences in the mean expression of the two most predictive
genes in glioblastoma versus oligodendroglioma tumors, and grade II versus grade IV tumors. The biomarker
gene, Tumor Necrosis Factor Receptor Superfamily Member 1B (TNFRSF1B ), was found to distinguish both
tumor subtypes and tumor grades in our study. The protein encoded by this gene is a member of the TNF-
receptor superfamily, and is responsible for regulating the recruitment of anti-apoptotic proteins, c-IAP1
and c-IAP2 [4]. TNFRSF1B can indirectly repress apoptosis and was identified as an upregulated biomarker
of the oligodendroglioma subtype in our analysis. Because this gene exhibits oncogenic properties, it may
contribute to the development of some type of brain cancer, and should be utilized as a diagnostic marker
for oligodendroglioma.
TNFRSF1B s role in tumor grade is more unclear. The suppression of apoptosis would be favored in more
severe cancer stages, yet our findings indicate that TNFRSF1B s expression is enriched in tumor grade II,
and not in later tumor stages. It is possible that TNFRSF1B s role in the early development of brain cancer
is much greater than its role in later stages, and should be a focus in early cancer diagnosis. One reason
for TNFRSF1B s identification in both our predictive models may have been failure to isolate subtypes by
separate grades or grades by separate subtypes. Oligodendroglioma and grade II tumors demonstrated high
overlap in our dataset, which may have introduced bias in our model. Therefore, future models should be
trained on larger datasets that account and isolate potential confounding variables.
Y-box-binding protein 3 (YBX3 ), which exhibits translational repression, was identified in our analysis
to be upregulated in glioblastoma [5]. YBX3 regulates inflammatory immune responses and cellular recruit-
ment, and may be crucial for mediating cellular defense mechanisms in glioblastoma [6]. Vacuole membrane
protein 1 (VMP1 ) was found to be highly expressed in grade IV tumors. It encodes a transmembrane protein
that was recently found to be involved in tumor metastasis by playing a vital role in balancing autophagy
and apoptosis [6]. Therefore, its expression in later staged cancers may be correlated to the development of
metastasis.
The results of our t-tests allow for predictive future subtyping of cancers via needle biopsy, rather than
by invasive open-skull procedures. Given the early detection of a nonspecific glioma using MRI or radiolog-
ical imaging, one can then readily subtype (for glioblastoma versus oligodendroglioma) and stage (for grade
II versus grade IV) the observed tumor using a non-invasive needle biopsy and gene expression microar-
ray/sequencing procedure.
Our proposed method can be used as a means of monitoring tumor status, as a patients tumors condition
changes grade over time over the course of his/her chemotherapeutic/radiological treatment. In the future,
we will continue to identify diagnostic features of cancer subtypes and grades by applying our predictive
model to larger cancer datasets.
References
[1] Ilie, Marius, and Paul Hofman. Pros: Can Tissue Biopsy Be Replaced by Liquid Biopsy? Translational
Lung Cancer Research. AME Publishing Company, Aug. 2016. Web. 06 June 2017.
3
[2] Sun, Lixin, Ai-Min Hui, Qin Su, Alexander Vortmeyer, Yuri Kotliarov, Sandra Pastorino, Antonino Pas-
saniti, Jayant Menon, Jennifer Walling, Rolando Bailey, Marc Rosenblum, Tom Mikkelsen, and Howard
A. Fine. Neuronal and Glioma-derived Stem Cell Factor Induces Angiogenesis within the Brain. Cancer
Cell 9.4 (2006): 287-300.
[3] Hastie, Trevor, Robert Tibshirani, and Jerome H. Friedman. The Elements of Statistical Learning Data
Mining, Inference, and Prediction. New York, NY: Springer, 2016. Print.
[4] Tnfrsf1b TNF Receptor Superfamily Member 1B [Rattus Norvegicus (Norway Rat)] - Gene - NCBI.
National Center for Biotechnology Information. U.S. National Library of Medicine, n.d. Web. 06 June
2017.
[5] YBX3 Y-box Binding Protein 3 [Homo Sapiens (human)] - Gene - NCBI. National Center for Biotech-
nology Information. U.S. National Library of Medicine, n.d. Web. 06 June 2017.
[6] Guo, X. Z., X. L. Ye, W. Z. Xiao, X. N. Wei, Q. H. You, X. H. Che, Y. J. Cai, F. Chen, H. Yuan, X.
J. Liu, and M. H. Yu. Downregulation of VMP1 Confers Aggressive Properties to Colorectal Cancer.
Oncology Reports. U.S. National Library of Medicine, Nov. 2015. Web. 06 June 2017.
4
5 Data Appendix
Figure 1: Variable importance plot for top predictors used in subtype classification model. 203608 at corre-
sponds to TNFRSF1B, and 201161 s at corresponds to YBX3.
5
Figure 2: Variable importance plot for top predictors used in grade classification model. 203608 at corre-
sponds to TNFRSF1B, and 1569003 at corresponds to VMP1.
6
Figure 3: Distribution of TNFRSF1B and YBX3 mRNA expression levels in astrocytoma, glioblastoma, and
oligodendroglioma tumor types.
7
Figure 4: Distribution of TNFRSF1B and VMP1 mRNA expression levels in grade II, III, and IV glioma.

Using Microarray Profiling and Regularized Gradient-Boosted Trees To Discover Subtype - and Grade-Identifying Biomarkers in Human Glioma

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Using Microarray Profiling and Regularized Gradient-Boosted Trees To Discover Subtype - and Grade-Identifying Biomarkers in Human Glioma

Caricato da

Copyright:

Formati disponibili

Using microarray profiling and regularized gradient-boosted trees to

discover subtype- and grade-identifying biomarkers in human

Kevin Song1 , Dionna Jacobson2 , Ravina Jain3

3 Results and Methods

3.2 Model Overview

3.3 Model Construction

3.4 Model Performance Evaluations

3.5 Variable Importance

3.6 Distributions of Most Predictive Genes

3.7 Hypothesis Testing for Differences in Mean Expression of Top Predictor

4 Conclusions and Future Directions

Potrebbero piacerti anche