Under submission
Supporting Information
Bayesian Learning in Sparse Graphical Factor Models via Annealed Entropy

Ryo Yoshida
Department of Statistical Modeling
The Instistute of Statistical Mathematics

Mike West
Department of Statistical Science
Duke University

Supporting documents


R functions: Variational mean-field annealing algorithm (VMA2) for graphical factor models (GFMs)


Detailed discussion on Transcriptome analysis of breast cancers (Section 7.2)

    Supplementary Materials
    GO Enrichment Analysis: Biological Process
  • Factor 1 (ER) (positive regulation of mononuclear cell proliferation, BMP signaling pathway, etc)
  • Factor 2 (peptidyl-tyrosine phosphorylation, cell cycle checkpoint, DNA damage response, signal transduction by p53 class mediator, etc)
  • Factor 3 (positive regulation of epithelial cell proliferation, UTP and CTP metabolic process, etc)
  • Factor 4 (negative regulation of caspase activity, release of cytochrome c from mitochondria, regulation of I-kappaB kinase/NF-kappaB cascade, etc)
  • Factor 5
  • Factor 6
  • Factor 7 (androgen receptor signaling pathway, embryonic development, etc)
  • Factor 8 (ER) (hormone metabolic process, humoral immune response mediated by circulating immunoglobulin, positive regulation of transcription, etc)
  • Factor 9 (ER) (glucose metabolic process, negative regulation of MAPK activity, MAPKKK cascade, etc)
  • Factor 10
  • Factor 11 (ER) (immune response-activating cell surface receptor signaling pathway, positive regulation of MAPK activity)
  • Factor 12 (ER) (C21-steroid hormone metabolic process, cell cycle arrest, positive regulation of positive chemotaxis, etc)
  • Factor 13 (JAK-STAT cascade, positive regulation of caspase activity, I-kappaB kinase/NF-kappaB cascade, tyrosine phosphorylation of STAT protein, etc)
  • Factor 14 (ER) (M phase of meiotic cell cycle, regulation of Ras GTPase activity, regulation of Rab GTPase activity, tube morphogenesis, etc)
  • Factor 15 (ER) (ER-nuclear signaling pathway, G1/S transition of mitotic cell cycle, etc)
  • Factor 16 (Her2) (positive regulation of MAPK activity, G1/S transition of mitotic cell cycle, etc)
  • Factor 17
  • Factor 18 (ER) (M phase of meiotic cell cycle, cell cycle arrest, etc)
  • Factor 19 (ER) (regulation of GTPase activity, egulation of Rab protein signal transduction)
    Fig. 7 (left) Expression Profiles of Identified Factor Probes
    Fig. 7 (right): Identified Sparsity Configuration
    Fig. 8 (left): Averaged Factors Scores of on ER status computed by Sparse Estimation
    Fig. 8 (right): Averaged Factors Scores on Her2 status computed by Sparse Estimation
    Suuplementary Fig. 1: Averaged Factors Scores on ER status computed by PCA
    Suuplementary Fig. 2: Averaged Factors Scores on Her2 status computed by PCA

Transcriptome profiling of cancer cells is one of the most fundamental tools in genomic biomarker discovery and studies on transcription regulatory pathways. As the use of DNA microarray assays was spread over recent molecular biology and medical science, there has been a growing need for a statistical toolbox having the ability to process massive data set containing many thousands features (genes). Especially, the latent factor models have been indispensable tools in the primary processing of data for module discovery of genes involved in underlying pathways of transcription regulations.

Microarray Experiments and Preprocessing
We have applied the VMA2 estimation of GFM to a transcription analysis of breast cancer. Our data set is a collection of gene expression signatures of human breast tumors which were isolated from 138 different patients during 2001-2004 at the Sun-Yat Sen Cancer Center, Taipei. In addition to the expression signatures, the tumor samples were examined in immunohistochemistry (IHC) testing which measures concentrations of the two receptor proteins (signaling molecules); erbB-2 (Her2) and estrogen receptor (ER). Expression status of these two proteins is known as the main discriminative factors for breast cancer phenotype, and perdition of clinical outcomes in practical therapic strategy. The immunohistochemical assessment revealed the status of each receptor with the grade: ER negative (ER=0), ER positive with low- and high-level expression (ER=1 and ER=2), Her2 negative (Her2=0), and Her2 positive with low- and high-level (Her2=1 and Her=2). The transcript levels of the probes were measured using the GeneChip Human Genome U95 arrays (Affymetrix). We then screened out the probes showing trivial variations across the samples. This preprocessing identified 996 probes as potential biomarkers relevant to pathologic heterogeneity in breast cancer.

Experimental Setting
A practical interest in the application of latent factor models to transcriptome analysis lies in the identification and decomposition of blind sources yielding covariations in the massive data set, and association of many features to tumor heterogeneity and transcription modules involved in biological pathways. We have carried on the sparse estimation of the GFM with k=25 under the adaptive control of degree of sparseness with the hyper-hyper prior; mean=7 and variance=10. The cooling schedule was prescribed by linearly-decreasing sequence of 2000 temperatures under which the decay rate and initial temperature were set to 6*10^-3 and 3, respectively.

Identified Factors and Enrichment Analysis of Functional Annotations
During the annealing, redundancy in the prescribed factor dimension has been pruned as k=25 -> k=19. The expression signatures consisting of the survived nineteen factors are respectively displayed in Fig.7 with the identified sparsity of the loading matrix. Each set of probes, C_j, sharing the jth factors is further divided into the two subsets C_j+ and C_j- according to the sign of the estimated loading values. The probes aggregated into the same factor set were highly co-expressed with each other. To obtain a better understanding of underlying biology, we evaluated an enrichment of the functional annotations of selected genes in each factor based on the Gene Ontology (GO). To be specific, we assessed significance (quasi-p value) of each GO term registered at the categories of the GO biological process in the following way: Under which r probes out of indentified l probes in each factor possess a GO term annotation S, and the p=996 probes contain in total m probes belonging to S, we calculated the upper-tail probability of the hypergeometric distribution that r or more probes annotated by S are drawn by chance under the l times sampling without replacement. This secondary analysis has revealed associations between some identified factors and biological processes of GO, where the complete table of the GO enrichment analysis are available from the supporting information website. For instance, the first factor captures a large number of probes in which some of the most significant annotations are comprised of positive regulation of lymphocyte proliferation, BMP signaling pathway involved in remodeling of bone. The 4th factor clearly captured the genes involved in apoptosis and relevant pathways. As the genes coding extracellular signaling molecules of apoptosis, the two tumor necrosis factors, TNFSF10 (tumor necrosis factor (ligand) superfamily) and LITAF (lipopolysaccharide-induced TNF factor), and interleukin (IL8) were listed in the identified probe set. Besides, some negative regulators of caspase activity, i.e. repressors of apoptosis enhancing cancer metastasis, were captured, particularly involved in suppression of tumor necrosis factor-induced apoptosis by I-kappaB kinase/NF-kappaB cascade. The genes relevant to apoptosis were also aggregated into the 14th factor which showed significant over-representations of the terms with positive regulation of caspase (activator protein of apoptosis) and tyrosine phosphorylation of STAT proteins on JAK-STAT cascade. The 8th factor included many genes relevant to hormone metabolic process, especially humoral immune response mediated by circulating immunoglobulin. The hormonal response-related probes were also captured by the 12th factor whose members are mainly involved in C21-steroid hormone metabolic process. For cell-cycle related genes, genes relevant to G1/S transition of mitotic cell cycle were included in the 15th and 16th factor, whereas 18th factor showed enrichment of M phase of meiotic cell cycle.

Association Study of Identified Factors to ER and Her2 pathways
Fig.8 shows the averages of the estimated factor scores separately computed for the three grades of ER and Her2 status, respectively. Many factors captured the discriminators for ER status, particularly, we could observe strong association to the 8th (GO: hormone metabolic process), 9th (GO: glucose metabolic process, negative regulation of MAPK activity), 12th (GO: C21-steroid hormone metabolic process ), 14th (GO: apoptotic program, positive regulation of caspase activity), 18th (GO: M phase of meiotic cell cycle), 19th factor (GO: regulation of Rab protein signal transduction). These clear discrepancies of ER status indicate strong effect of estrogen recepter-induced signaling on transcriptions of the downstream target genes, and diversity of existing pathways in breast cancer progression.

Estrogen Responsible Elements
To achieve a deeper insight of the relevant factors in relation to ER status, we carried out the enrichment analysis of the identified probe set based on the estrogen response elements (EREs). ERs are members of ligand-dependent nuclear receptor family, which recognize and bind to EREs on DNA strands, and modulate transcription of the downstream target genes coupled with several interactor proteins. It is therefore natural to consider that some of the identified ER-relevant factors captured a number of the ERE genes. To perform association study, we accessed the ERE database (Bourdeau et al. (2004)) to which 17353 ERE genes screened based on in vitro DNA binding experiment are registered. Our selected 996 probes shared 276 genes with the registered EREs. We then computed the p-values for C_j+ and C_j-, separately, using the hypergeometric distribution. Of the foregoing factors judged to be ER discriminators, the 9th (C_9+), 14th (C_14+) and 23th (C_19-) showed relatively small p-values, (0.1949, 0.107 and 0.107).

Her2 status and Oncogenomic Recombination Hotspot on 17q12
Regarding to Her2 status, Fig.8 indicates a weak correlation to the 16th factor which captured the 7 probes as the relevance features, including STARD3, GRB7 and two probes on the locus of ERBB2 (Her2). Very interestingly, we have noted that these three genes are all located on the same locus at the human chromosome 17q12, which is known as PPP1R1B-STARD3-TCAP-PNMT-PERLD1-ERBB2-MGC14832-GRB7 locus. This locus has been reported in many studies, e.g. Kato et al. (2004), as an oncogenomic recombination hotspot which is amplified frequently in breast tumor cells. Moreover, 17q12 is closely linked to evolutionary recombination hotspot around the GSDML-GSDM locus. Kato et al. concluded that this evolutionary recombination hotspot was closely linked to the oncogenomic hotspot around the PPP1R1B-ERBB2-GRB7 amplicon. However, according to our knowledge, the regulatory mechanisms between Her2-mediated signaling and this hotspot have been still unveiled.

Comparison to Non-Sparse Analysis
We finally demonstrate a comparison of the proposed annealing estimation to the non-sparse (classical) PCA. Supplementary Fig.1 and 2 depicts the averages of the estimated factor scores (principal components) corresponding to the most dominant 25 eigenvalues, for each grade of ER and Her2 status. Obviously, the PCA failed to reveal Her2-relevant phenotype of the tumor samples. The foregoing sparse inference identified the Her2-relevant factor only through the 7 non-zero loadings. Indeed, our post-analysis has found that our data set contains very few probes exhibiting significant fold change between the Her2 phenotypes, and hence, the non-sparse model would capture many irrelevant features through redundant non-zero loadings. The failure of PCA highlights clearly that specification of the loading values is an essential part of statistical learning for the detection of such a small number of relevant features.

Reference

  • Katoh M, Katoh M., Evolutionary recombination hotspot around GSDML-GSDM locus is closely linked to the oncogenomic recombination hotspot around the PPP1R1B-ERBB2-GRB7 amplicon, Int J Oncol., 24(4):757-63, 2004.
  • Bourdeau V, Deschenes J, Metivier R, Nagai Y, Nguyen D, Hudson T, White J, Gannon F, and Mader S., Genome-wide identification of high-affinity estrogen response elements in human and mouse, Mol Endocrinol., 18(6):1411-1427, 2004.