Under submission
Supporting Information
Bayesian Learning in Sparse Graphical Factor Models via Annealed Entropy
Ryo Yoshida
Department of Statistical Modeling
The Instistute of Statistical Mathematics
Mike West
Department of Statistical Science
Duke University
| Supporting documents |
| R functions: Variational mean-field annealing algorithm (VMA2) for graphical factor models (GFMs) |
| Detailed discussion on Transcriptome analysis of breast cancers (Section 7.2) |
|
|
Transcriptome profiling of cancer cells is one of the most fundamental tools in genomic biomarker discovery and studies on transcription regulatory pathways. As the use of DNA microarray assays was spread over recent molecular biology and medical science, there has been a growing need for a statistical toolbox having the ability to process massive data set containing many thousands features (genes). Especially, the latent factor models have been indispensable tools in the primary processing of data for module discovery of genes involved in underlying pathways of transcription regulations.
Microarray Experiments and Preprocessing
We have applied the VMA2 estimation of GFM to a transcription
analysis of breast cancer.
Our data set is a collection of gene expression signatures of
human breast tumors which were isolated from 138 different patients
during 2001-2004 at the Sun-Yat Sen Cancer Center, Taipei.
In addition to the expression signatures, the tumor samples
were examined in immunohistochemistry (IHC) testing which measures
concentrations of the two receptor proteins (signaling molecules);
erbB-2 (Her2) and estrogen receptor (ER).
Expression status of these two proteins is known as the main
discriminative factors for breast cancer phenotype, and
perdition of clinical outcomes in practical therapic strategy.
The immunohistochemical assessment revealed the status of each receptor
with the grade: ER negative (ER=0),
ER positive with low- and high-level expression (ER=1 and ER=2),
Her2 negative (Her2=0), and Her2 positive with low- and high-level
(Her2=1 and Her=2).
The transcript levels of the probes were measured using
the GeneChip Human Genome U95 arrays (Affymetrix).
We then screened out the probes showing trivial variations across
the samples.
This preprocessing identified 996 probes as potential
biomarkers relevant to pathologic heterogeneity in breast cancer.
Experimental Setting
A practical interest in the application of latent factor models to
transcriptome analysis lies in the identification and decomposition
of blind sources yielding covariations in the massive data set, and
association of many features to tumor heterogeneity and transcription
modules involved in biological pathways.
We have carried on the sparse estimation of the GFM with k=25
under the adaptive control of degree of sparseness with the hyper-hyper prior;
mean=7 and variance=10.
The cooling schedule was prescribed by linearly-decreasing sequence of 2000
temperatures under which the decay rate and initial temperature were set to
6*10^-3 and 3, respectively.
Identified Factors and Enrichment Analysis of Functional Annotations
During the annealing, redundancy in the prescribed factor dimension has been
pruned as k=25 -> k=19.
The expression signatures consisting of the survived nineteen factors are
respectively displayed in Fig.7 with the identified sparsity of
the loading matrix.
Each set of probes, C_j, sharing the jth factors
is further divided into the two subsets C_j+ and
C_j- according to the sign of the estimated loading values.
The probes aggregated into the same factor set were highly co-expressed with
each other.
To obtain a better understanding of underlying biology, we evaluated an
enrichment of the functional annotations of selected genes in each factor
based on the Gene Ontology (GO).
To be specific, we assessed significance (quasi-p value) of each GO term
registered at the categories of the GO biological process in the following way:
Under which r probes out of indentified l probes in each factor possess a GO term
annotation S, and the p=996 probes contain in total m probes
belonging to S, we calculated the upper-tail probability of
the hypergeometric distribution that r or more probes annotated by
S are drawn by chance under the l times sampling without replacement.
This secondary analysis has revealed associations between some identified factors
and biological processes of GO, where the complete table of the GO enrichment
analysis are available from the supporting information website.
For instance, the first factor captures a large number of probes in which some of
the most significant annotations are comprised of positive regulation of
lymphocyte proliferation, BMP signaling pathway involved in remodeling of bone.
The 4th factor clearly captured the genes involved in apoptosis and relevant pathways.
As the genes coding extracellular signaling molecules of apoptosis,
the two tumor necrosis factors, TNFSF10 (tumor necrosis factor (ligand) superfamily)
and LITAF (lipopolysaccharide-induced TNF factor), and interleukin (IL8) were
listed in the identified probe set. Besides, some negative regulators of caspase
activity, i.e. repressors of apoptosis enhancing cancer metastasis, were captured,
particularly involved in suppression of tumor necrosis factor-induced apoptosis
by I-kappaB kinase/NF-kappaB cascade.
The genes relevant to apoptosis were also aggregated into the 14th factor which
showed significant over-representations of the terms with positive regulation of
caspase (activator protein of apoptosis)
and tyrosine phosphorylation of STAT proteins on JAK-STAT cascade.
The 8th factor included many genes relevant to hormone metabolic process,
especially humoral immune response mediated by circulating immunoglobulin.
The hormonal response-related probes were also captured by the 12th factor
whose members are mainly involved in C21-steroid hormone metabolic process.
For cell-cycle related genes, genes relevant to G1/S transition of mitotic cell
cycle were included in the 15th and 16th factor, whereas 18th factor showed
enrichment of M phase of meiotic cell cycle.
Association Study of Identified Factors to ER and Her2 pathways
Fig.8 shows the averages of the estimated factor
scores separately computed for the three grades of ER and
Her2 status, respectively.
Many factors captured the discriminators for ER status, particularly, we
could observe strong association to the 8th (GO: hormone metabolic process),
9th (GO: glucose metabolic process, negative regulation of MAPK activity),
12th (GO: C21-steroid hormone metabolic process ),
14th (GO: apoptotic program, positive regulation of caspase activity),
18th (GO: M phase of meiotic cell cycle), 19th factor
(GO: regulation of Rab protein signal transduction).
These clear discrepancies of ER status indicate strong effect of estrogen
recepter-induced signaling on transcriptions of the downstream target genes,
and diversity of existing pathways in breast cancer progression.
Estrogen Responsible Elements
To achieve a deeper insight of the relevant factors in relation to ER status,
we carried out the enrichment analysis of the identified probe set based on
the estrogen response elements (EREs).
ERs are members of ligand-dependent nuclear receptor family, which recognize and bind to
EREs on DNA strands, and modulate transcription of the downstream target genes coupled
with several interactor proteins.
It is therefore natural to consider that some of the identified ER-relevant factors
captured a number of the ERE genes.
To perform association study, we accessed the ERE database (Bourdeau et al. (2004))
to which 17353 ERE genes screened based on in vitro DNA binding experiment are registered.
Our selected 996 probes shared 276 genes with the registered EREs.
We then computed the p-values for C_j+ and C_j-, separately,
using the hypergeometric distribution.
Of the foregoing factors judged to be ER discriminators, the 9th (C_9+),
14th (C_14+) and 23th (C_19-) showed relatively small p-values,
(0.1949, 0.107 and 0.107).
Her2 status and Oncogenomic Recombination Hotspot on 17q12
Regarding to Her2 status, Fig.8 indicates a weak correlation to the 16th factor
which captured the 7 probes as the relevance features, including
STARD3, GRB7 and two probes on the locus of ERBB2 (Her2).
Very interestingly, we have noted that these three genes are all located on the same locus
at the human chromosome 17q12, which is known as
PPP1R1B-STARD3-TCAP-PNMT-PERLD1-ERBB2-MGC14832-GRB7 locus.
This locus has been reported in many studies, e.g. Kato et al. (2004),
as an oncogenomic recombination hotspot which is amplified frequently in breast tumor cells.
Moreover, 17q12 is closely linked to evolutionary recombination hotspot around the GSDML-GSDM locus.
Kato et al. concluded that this evolutionary recombination hotspot was closely
linked to the oncogenomic hotspot around the PPP1R1B-ERBB2-GRB7 amplicon.
However, according to our knowledge, the regulatory mechanisms between Her2-mediated
signaling and this hotspot have been still unveiled.
Comparison to Non-Sparse Analysis
We finally demonstrate a comparison of the proposed annealing estimation to the non-sparse (classical) PCA.
Supplementary Fig.1 and 2 depicts the averages of the estimated factor
scores (principal components) corresponding to the most dominant 25 eigenvalues,
for each grade of ER and Her2 status.
Obviously, the PCA failed to reveal Her2-relevant phenotype of the tumor samples.
The foregoing sparse inference identified the Her2-relevant factor only through the 7 non-zero loadings.
Indeed, our post-analysis has found that our data set contains very few probes exhibiting
significant fold change between the Her2 phenotypes, and hence, the non-sparse model would capture many
irrelevant features through redundant non-zero loadings.
The failure of PCA highlights clearly that specification of the loading values is an essential part of
statistical learning for the detection of such a small number of relevant features.
Reference