INTRODUCTION

Oncotarget

ImpactJ

Oncotarget

1949-2553

Impact Journals LLC

27322207

5216771

10002

10.18632/oncotarget.10002

Research Paper

Assessing the clinical utility of genomic expression data across human cancers

Xinsen

¹Huang

Lei

²Chan

Chun Hei

³Yu

Tao

⁴Miao

Runchen

¹Liu

Chang

¹ Department of Hepatobiliary Surgery, The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an, China² 118 Lancaster Terrace, Brookline, MA, USA³ Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong⁴ Department of Neurosurgery, Peking University First Hospital, Beijing, China

Correspondence to:Chang Liu,liuchangdoctor@163.com

1972016

1462016

729459264593624220162952016

2016

This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Cancer molecular profiling provides better understanding of tumor mechanisms and helps to improve the existing cancer management. Here we present the gene expression signatures from ∼9000 human tumors with clinical information across 32 malignancies from The Cancer Genome Atlas project (TCGA). Major predictors from the RNA sequencing data that were significantly correlated with cancer survival were identified. The expression level of these prognostic genes revealed significant genomic pathways that were clinically relevant to survival outcomes across human cancers. Furthermore, it is shown that in most cancer types, combinations of these genomic signatures with clinical information might yield improved predictions. Thus, with respect to clinical utility, our study reveals the promising values of genomic data from the pan-cancer perspective.

genomecancerexpressionutilityprognosis

INTRODUCTION

Cancer is a global health burden and the second leading cause of death [1]. Despite various detection method and treatment options, survival rates for most cancers are still very low.

Genomic features emerge as promising biomarkers for cancer [2]. Of the various molecular data, the gene expression value from RNA sequencing revealed detailed molecular features with prognostic associations [3]. However, due to high cost, most studies only focused on several pre-selected genes, or based on small sample sizes [4, 5].

The cancer genome atlas (TCGA) project “motivated large-scale genomic efforts to obtain the complete catalogs of the genomic alterations in cancer” [6]. Besides the rich molecular features (genomic, transcriptomic, epigenomic and proteomic) of each tumor, it also provides valuable clinical information. However, the clinical utility of these data has not been fully elucidated.

In the present study, we depicted the global pan-cancer prognostic landscape by analyzing the expression signatures from ~9000 human tumors across 32 malignancies from TCGA data sets. Furthermore, the clinical utility of survival predictions was evaluated by combining the genomic data with clinical information.

RESULTSPatient characteristics and outcome

Patient information with complete RNA sequencing data and clinical data of all TCGA cancer types were collected (32 tumor types) (adrenocortical carcinoma, ACC; bladder urothelial carcinoma, BLCA; breast invasive carcinoma, BRCA; cervical and endocervical cancers, CESC; cholangiocarcinoma, CHOL; colon adenocarcinoma, COAD; lymphoid neoplasm diffuse large B-cell lymphoma, DLBC; esophageal carcinoma, ESCA; glioma, GBMLGG; head and neck squamous cell carcinoma, HNSC; kidney chromophobe, KICH; kidney renal clear cell carcinoma, KIRC; kidney renal papillary cell carcinoma, KIRP; acute myeloid leukemia, LAML; liver hepatocellular carcinoma, LIHC; lung adenocarcinoma, LUAD; lung squamous cell carcinoma, LUSC; mesothelioma, MESO; ovarian serous cystadenocarcinoma, OV; pancreatic adenocarcinoma, PAAD; pheochromocytoma and paraganglioma, PCPG; prostate adenocarcinoma, PRAD; rectum adenocarcinoma, READ; sarcoma, SARC; skin cutaneous melanoma, SKCM; stomach adenocarcinoma, STAD; testicular germ cell tumors, TGCT; thyroid carcinoma, THCA; thymoma, THYM; uterine corpus endometrial carcinoma, UCEC; uterine carcinosarcoma, UCS; uveal melanoma, UVM).

Data analysis steps and clinical characteristics of the cancer patients were shown in Figure 1A-1E. In total, there were 9175 patients of the 32 tumor types, in which, 49.4% were male and 50.6% were female. Their median age was 60 years old. With respect to tumor stage, 30.7% of the patients were in stage 1, 30.9% were in stage 2, 26.7% were in stage 3, and 11.7% were in stage 4. At the time of analysis, 77.4% of the patients remained alive, and 22.6% were deceased.

Figure 1Overview of the computational approach and patient characteristics

A. Flow diagram summarizing the data processing and analysis steps. B. Number of patient samples with survival data, organized by cancer types. C. Median age of the patients in different cancer types. D. Median survival time of the patients in different cancer types (some of the cancer types don't have enough death events to calculate the median survival times, either because of the high survival rates or due to the small sample size of the cancer type). E. Frequency distributions of gender, tumor stage and survival outcome in the whole cancer population.

Pan-cancer prognostic genes and risk scores

To explore the pan-cancer prognostic signatures, pan-cancer dataset was built by combining all the cancer patients. Samples were randomly assigned into two groups, where 80% of the samples were assigned as the training group and 20% as the testing group. By cox regression analysis for the training group, the top ten adverse genes (B3GNT5, SLC11A1, ELF4, GALNT2, PA2G4P4, SKP2, S100A9, FOXM1, PSMB2, ARL6IP6) and top ten favorable prognostic genes (TADA2B, CBX7, CIRBP, MAGED2, CRY2, CREBL2, TMED8, XPC, SECISBP2, GPD1L) were identified (Figure 2A). Based on these top prognostic genes, risk scores were calculated (Table 1). The risk score was defined as the weighted sums of the independent prognostic gene values (1 for high expression, and 0 for low expression), weighted with their regression coefficients from the cox models (Figure 2B).

Figure 2Prognostic landscape of gene expression in the whole cancer population

A. Top ten adverse and favorable pan-cancer prognostic genes were identified in the training group, ranked by the z scores. B. Risk score calculated by the top prognostic genes in the training group patients. Upper panel: risk-score distribution of the training group patients and survival status (blue indicates alive, and red indicates dead). Lower panel: heatmap showing the expression level of the top prognostic genes. C. Box plots of risk scores in different age groups, different gender groups, and different stage groups in the training group patients. D. Forest plot of risk score association with cancer mortality in the training group patients of different stages. E. Kaplan-Meier estimates of overall survival according to the risk score in the training set. F. Box plots of risk scores in different age groups, different gender groups, and different stage groups in the testing group patients. G. Forest plot of risk score association with cancer mortality in the testing group patients of different stages. H. Kaplan-Meier estimates of overall survival according to the risk score in the testing set.

Table 1Specific risk scores for different types of cancer

Cancer Type	Risk Score
Whole population (binary)	Score = 0.66B3GNT5+0.65SLC11A1+0.65ELF4+0.65GALNT2+0.63PA2G4P4+0.63SKP2+0.60S100A9+0.63FOXM1+0.61PSMB2+0.64ARL6IP6-0.63GPD1L-0.62SECISBP2-0.58XPC-0.61TMED8-0.61CREBL2-0.64CRY2-0.64MAGED2-0.68CIRBP-0.69CBX7-0.71TADA2B
Whole population(continuous)	Score = 0.09B3GNT5+0.17SLC11A1+0.29ELF4+0.12GALNT2+0.18PA2G4P4+0.16SKP2+0.08S100A9+0.14FOXM1+0.04PSMB3+0.23ARL6IP6-0.34GPD1L-0.35SECISBP2-0.38XPC-0.26TMED8-0.36CREBL2-0.37CRY2-0.46MAGED2-0.54CIRBP-0.48CBX7-0.42TADA2B
ACC	Score = 2.48MASTL+2.38RECQL4+2.33PRC1+2.36KIF11+2.65AMMECR1L+2.55TRIP13+2.54MKI67+2.59NCAPD3+2.07E2F1+2.06FANCI-2.07APH1B-2.3CTSA-2.04UPRT-2.02HNRNPH2-2.28NDRG4-2.47PPFIBP2-1.94LACTB-2.43PTGR2-2.11CHIC1-2.36BDH2
BLCA	Score = 0.81SPNS1+0.77GARS+0.78NBAS+0.77IFT122+0.75NOMO1+0.74TMX2+0.73DHRS4+0.74CCDC28B+0.74TMEM109+0.74DAD1-0.86GATA2-0.77TRIM26-0.76MRPS6-0.77YDJC-0.76ZNF841-0.75ZBTB49-0.72ORMDL1-0.72DEDD2-0.72OGT-0.74CTSH
BRCA	Score = 0.88ZHX1+0.82PRRC1+0.82SCRN1+0.81IARS+0.81PTPN11+0.83VPS35+0.78MRS2+0.77GRPEL2+0.79TMEM65+0.76PGK1-0.95TNFRSF14-0.95KDM4B-0.92INO80B-0.88LOC150776-0.86MRPL23-0.87PYCARD-0.84ABHD14A-0.82FGD3-0.84SEC14L2-0.79NFKBIA
CESC	Score = 1.14PHRF1+1.09TNRC18+1.06ITGA5+0.98DBN1+1.01LATS2+1.01TOR1AIP2+1.02FASN+1URGCP+0.95SRI+0.95ADAM9-1.28TREX1-1.21RBM38-1.07LGALS9-1.04HNRNPA3-0.97NQO2-0.96ZER1-0.97ISCU-0.94MTCP1NB-0.96AKR1A1-0.98SLC25A28
CHOL	Score = 1.91EIF5A+2CEBPB+1.9SCO1+1.89ROM1+2.32SRI+1.69FAM54B+1.62MNAT1+1.58PSEN1+1.51PDHB+1.66SLC38A6-1.63SCRN1-1.95PGPEP1-1.83EIF4ENIF1-1.62SGSH-1.63VSIG10-1.49ACBD5-1.47PURB-1.61TNFAIP8-1.57FUT4-1.38FGD6
COAD	Score = 1.24TIAL1+1.14SMNDC1+1.1KIAA0907+1.09POLR2J4+1.03HSPA1L+0.94ZBTB25+0.95UBN2+0.95SCRN3+0.98ZBTB9+0.93DNAJB6-0.99CPT2-1.01MRPL37-0.99ATP8B1-0.96CCDC149-0.92EIF2C1-0.9DYNLL2-0.96ZCCHC11-0.91MFN2-1.01GSR-0.9SAMM50
DLBC	Score = 1.67ELP4+1.48API5+1.48ARHGEF7+1.48ATXN7L2+1.48EXOC5+1.48GMEB1+1.48MEMO1+1.48MPHOSPH10+1.48MTOR+1.48NEO1-1.48TBKBP1-1.48STXBP2-1.48PUS1-1.48PTRH1-1.48POLR3D-1.48KCNK6-1.48IFI35-1.48GPAA1-1.48FHL3-1.48FBXW5
ESCA	Score = 1.13B3GALTL+1.11PGK1+1.18GRPEL2+1.17MAPRE1+1.03SRXN1+1.02LRRC58+0.99NFATC3+0.96ST13+0.94TRMT6+0.92MLLT11-1.02UNC13D-0.98PCSK7-0.94PLCD3-1.01DIP2A-0.98PLEKHM1P-0.89UNC93B1-0.87ERAP2-0.84LRCH4-0.86CCBL2-0.84C10orf54
GBMLGG	Score = 1.98GLA+1.83KDELC2+2.01WEE1+1.88EMP3+1.8DUSP10+1.84CLIC1+1.88TIMP1+1.84CD58+1.79DDB2+1.81SHISA5-2.01ZRANB1-1.9GLUD1-1.88FAM190B-1.78RAP2A-1.79ADD1-1.77HDAC4-1.83ARL3-1.74PATZ1-1.79SCAPER-1.73RPL7
HNSC	Score = 0.9PGK1+0.8USP10+0.78TOMM34+0.8SNX6+0.72TMED2+0.7PDIA3P+0.69ADK+0.71USP14+0.69TRIM32+0.68HPRT1-0.75ZNF266-0.69ZNF700-0.64AHCYL2-0.65SH3BP2-0.65ZNF577-0.64ZNF557-0.64ATXN7L2-0.64ZNF20-0.63DUSP16-0.63CDK3
KICH	Score = 2.36PNPT1+2.33PTP4A2+2.31GPN1+2.31GPATCH2+2.3PLEKHA2+2.29NRAS+2.27PDS5A+2.27KDM1B+2.27TTF2+2.26NT5DC3-2.36FIZ1-2.34TST-2.34C14orf1-2.34ELAVL1-2.33KLHL26-2.31CES2-2.31CTDP1-2.31SUSD1-2.3USF2-2.3COPS7A
KIRC	Score = 1.28DONSON+1.24STRADA+1.2ATP13A1+1.19NOP56+1.18CARS+1.18ANAPC7+1.16ANAPC5+1.14SBNO2+1.15NCLN+1.18FKBP11-1.16SGCB-1.15PINK1-1.12FBXO3-1.11SSFA2-1.1ITGA6-1.01HBP1-1FBXL3-1.02RNF20-1PURA-0.98FBXL5
KIRP	Score = 1.7GLT25D1+1.48LMNB2+1.54SPAG5+1.96ADA+1.41PUS7+1.61CCNF+1.5RHBDF2+1.7P4HB+1.58TSEN15+1.41AEBP1-1.65TMCO4-1.49PGPEP1-1.63FBXL5-1.51HTATSF1-1.56CCDC71-1.56ACTR8-1.36CC2D2A-1.42PARP3-1.39ZBTB3-1.39SLC25A11
LAML	Score = 1.1TOMM40L+0.95NUP210+0.91PARP3+0.83DDIT4+0.83CLCN5+0.79FIBP+0.78RPS6KA1+0.77PSMA7+0.76RINL+0.76PARVB-0.99PWWP2A-0.97MBTPS1-0.87NHLRC3-0.87LOC646762-0.86ADSS-0.84TGIF1-0.81SIAH1-0.83DET1-0.8KCTD15-0.79FCHSD2
LIHC	Score = 0.97HNRNPH1+0.81N4BP3+0.82LDHA+0.81ZCRB1+0.84YBX1+0.78STK39+0.78ATP6V1E1+0.8ANXA5+0.78HN1+0.76ATP1B3-0.81STAT5B-0.79C9orf3-0.79CHST14-0.76SIK2-0.72POLDIP2-0.73ATF7IP2-0.72SLC23A2-0.67STIM1-0.65MIA3-0.65PSD4
LUAD	Score = 0.72ITGA6+0.73C1QTNF6+0.72MTHFD1+0.7DNAJB4+0.7BACH1+0.69CCNA2+0.65EXT1+0.65FSCN1+0.66DNAJB6+0.65NOC3L-0.91SLC25A42-0.82PRKCD-0.79DBP-0.75DENND1C-0.71NRL-0.72C19orf42-0.73ALAD-0.71SLC11A2-0.68ABAT-0.67FAM117A
LUSC	Score = 0.74CD14+0.66ARHGAP1+0.63CD151+0.62FSTL3+0.6RALGAPA2+0.59CST3+0.57C11orf2+0.56SNX29+0.56FAM109B+0.54EHD1-0.69ERH-0.65NDUFB1-0.59CBX1-0.56EMD-0.55RLIM-0.53FAM103A1-0.53MNAT1-0.53VRK1-0.51SS18L2-0.5FKBP3
MESO	Score = 1.71CDCA8+1.63KPNA2+1.62SPAG5+1.54CCNA2+1.64IQGAP3+1.66FOXM1+1.5HMGB2+1.51MAD2L1+1.52CDCA5+1.58PRC1-1.53KLHL9-1.44ETAA1-1.41THTPA-1.36HIST1H2BD-1.32FOXO4-1.39FBXO44-1.28HIST1H2AC-1.39HIST1H2BK-1.3SH3BGRL-1.36TMBIM4
OV	Score = 0.6CBLL1+0.59CACNA1C+0.56SOCS5+0.54ZNF384+0.54CACNB1+0.53SEMA4F+0.52AGPAT6+0.52CHKA+0.54GLIS2+0.52GLCE-0.77NPEPL1-0.6TLCD1-0.57LMO4-0.55CASP6-0.54ISG20-0.55AP4B1-0.53SAT1-0.52ZNF326-0.51ENSA-0.5AP1S2
PAAD	Score = 1.31ATG12+1.3ASCC1+1.33NFE2L3+1.31KIAA1609+1.3CCDC6+1.2EIF2A+1.26TMOD3+1.21AP3S1+1.24METAP1+1.22NCK1-1.33USP20-1.27MUM1-1.27REC8-1.24RBM6-1.21ARMC5-1.23DEF8-1.27KLHL22-1.13C7orf43-1.14MGC23284-1.1ELMOD3
PCPG	Score = 2GLE1+1.99EFTUD1+1.99NARG2+1.98CIZ1+1.97ZNF490+1.97TTC9C+1.96FAM178A+1.96ABCA1+1.95AKAP13+1.95LOC642852-2.03HMOX2-1.96DGCR14-1.96SLC10A3-1.95ITFG3-1.94FAM118A-1.93MBD3-1.93USE1-1.92ICOSLG-1.91FSCN1-1.91TMEM167B
PRAD	Score = 20.3EXTL2+20.3B3GNT5+20.3SEMA4C+20.3NUDCD2+20.3GNAI1+20.3THUMPD1+20.3CNNM3+20.3RNF138+20.3PRPF4+20.3FASTKD3-20.3MRM1-20.3DAP-20.3PAOX-20.3PLA2G15-20.3SBNO2-20.3STK19-20.3CCDC85C-20.3TBXAS1-20.3NFATC1-20.3HSD17B7
READ	Score = 2.26PSMA3+2.84PHLPP1+2.91CNDP2+2.96CORO1A+2.2AKR7A2+2.82SSBP2+2.82TMEM173+2.7ATP6V0C+2.7NFYC+2.79B4GALT3-2.28OSGEPL1-2.96PHF20-3.01ANKRD27-2.95ZNF853-2.95RAPGEF2-2.88SETD2-2.87MSH6-2.85ATM-3.01SIRT5-2.81SGK3
SARC	Score = 1.12RLIM+1.03BAIAP3+0.97FUBP1+1ZNF146+0.99ATXN10+0.95LRRC41+0.93LRRC47+0.9DOCK7+0.9ZNF697+0.89LAPTM4B-1.13TRIM21-1.08B3GALT4-1.04CCDC69-1.01CCNDBP1-0.97C14orf159-0.95GALK2-0.91PARP14-0.91ATP2A3-0.86C15orf24-0.84PPAP2A
SKCM	Score = 0.75HN1L+0.7GATAD2A+0.68NT5DC2+0.66VDAC1+0.65KPNA2+0.62FOXM1+0.62DCTN2+0.61CDC25A+0.6SLC25A3+0.61SLC25A15-0.81GBP2-0.77APOL6-0.77IFITM1-0.75FCGR2A-0.74FAM96A-0.72PARP9-0.72APOBEC3F-0.71NXT2-0.7UBA7-0.7APOL1
STAD	Score = 2.33SLC9A3R2+1.91ITPRIP+1.74SOCS2+2.04C1orf144+1.7LOC282997+1.69BMP2K+2.64VPS52+1.93UBE4B+1.51CXCR7+1.9NDUFA11-2.55SLC33A1-2.27TMEM66-2.18UBA5-2.14CD47-1.8C21orf59-2.03NSF-2.01FUNDC1-1.97RAB1A-1.69C14orf142-1.95PFDN4
TGCT	Score = 1.03FAM177A1+1.03NBR2+1ATAD2B+0.98C8orf73+0.96FMNL2+0.95CEBPA+0.95VCPIP1+0.94C12orf23+0.94LMBR1L+0.94ABCC5-1.06MYO1E-1.03CABLES1-1FAM84B-0.99TOP1-0.98NCSTN-0.97NAIF1-0.97IRS2-0.97HIBADH-0.97FUBP3-0.97PGM1
THCA	Score = 2.06IQSEC1+1.99FLYWCH1+1.88ZHX3+2.12SEMA6A+1.78FTO+1.76LARS+1.74TGFBR3+2.03PTEN+1.72ZNF324+2.72CEP250-1.96ANXA1-1.89SEC14L2-2.18CIR1-2.17MED17-2.15ITGB1BP1-1.86SRP68-2.14VAMP8-2.08PSME2-2.77RPS27-1.73CLU
THYM	Score = 2.5RARG+2.45RBM47+2.39PELI3+2.39ATP1B1+2.39TST+2.35NUDT16+2.38DENND1A+2.35PPAPDC1B+2.34GNS+2.3TBC1D16-2.44ADRBK1-2.43PDSS1-2.41SEMA4D-2.4INTS8-2.4VRK1-2.4PTP4A2-2.39CUTC-2.39SEMA7A-2.38SCLT1-2.37ANKRD27
UCEC	Score = 2.12TUBB2A+1.92TAOK3+2.11ENDOD1+2.05KLF11+2.06SYNPO+2.02BRAF+2.02SYTL2+2.05SPAG5+1.72MCL1+1.73ARMC1-1.81SETD6-1.8LYRM1-1.58PYCRL-1.58YDJC-1.71CRBN-1.94C15orf29-1.52PHF5A-2.57PPA1-1.56WWOX-1.64IFT140
UCS	Score = 1.27S100A10+1.23PDE4A+1.21STMN3+1.22ARL4D+1.18HIBCH+1.16FN3K+1.2SEC23B+1.12NINJ1+1.16LOC728554+1.16CTU1-1.86CBX5-1.32DNMT3A-1.31PSMD7-1.35PCBP2-1.25C2orf68-1.18BUD13-1.17ZNRF1-1.21SSRP1-1.19ST3GAL2-1.22TUT1
UVM	Score = 2.32GTF3A+3.14PSTPIP2+2.27SPAG1+3.03SFT2D2+2.23LIPA+2.2IMPA1+2.21JTB+2.16COQ2+2.93ALG5+2.97ISG20-3.14RABL2B-2.26C16orf86-2.19CNP-3.01C3orf39-2.19C3orf37-2.17TBKBP1-2.14TOM1L2-2.17RPL32P3-2.89PPP2R3B-2.16QRICH1

To assess the clinical utility of the risk score, correlation of the risk score with the clinical variables in the training group was explored. In the analysis, higher scores were associated with male patients, patients with older age, and patients with advanced tumor stages (Figure 2C). Further cox analysis and log rank test also confirmed that the poor survival outcome in patients with higher risk scores in different tumor stages (Figure 2D–2E). In the testing group, similar relationship between the risk score and clinical variables was also shown (Figure 2F). Notably, with respect to survival analysis, higher risk scores in the testing group also indicated higher risks of prognosis, suggesting that the risk score showed valuable clinical utility (Figure 2G–2H).

Evaluation of the prognostic genes and risk scores

The RNA-seq data demonstrated great value for cancer prognosis. Risk scores specific for each type of cancer were shown in Table 1, which were calculated by applying the method used in the whole cancer population. However, at this moment, these prognostic models are limited by the sample size to be of clinical value.

To evaluate the effect of different normalization method for the RNA-seq data, quantile data were transformed into z-score or being applied the voom normalization. As shown in supplementary Figure 1A-1B, after quantile normalization of the RNA-seq data, the z-score transformation or voom normalization doesn't change much of the prognostic genes (based on z values of cox regression), with the pearson r value of 0.98 and 0.97, respectively. After various normalization method, top prognostic genes remained mostly the same, which was shown in the supplementary Figure 1C. Thus, applying Z-score transformation or voom normalization yield limited value for the survival analysis.

For the above prognostic model, the high or low expression for prognostic model was determined by the median expression level of each gene. The gene was divided as binary categorization such as 1 for high expression (> median value) and 0 for low expression (< median value). Here we also applied the z-score (continuous variable) directly to propose a prognostic model that can reflect the values of gene expression (Table 1). Cox regression results showed that the continuous prognostic model have an hazard ratio of 1.22, which means the death risk increases by 22% if the patients get a risk score increased by 1.

The prognostic genes (in each cancer type and in the whole cancer population) were filtered by a specific cutoff (|z| > 3.09, or nominal one-sided p < 0.001). As an investigation of the relationship of different prognostic gene across different cancer types, prognostic genes in each cancer type were compared with the whole cancer population. As shown in the supplementary Figure 1D, for the prognostic gene identified in the study on single cancer type, most of them were also found in the pooled analysis. For example, 66% of the prognostic genes in the ACC also had prognostic values in the whole cancer population.

Pathway analysis in patients with different prognosis

Based on the prognostic risk score, patients were stratified into two different survival groups of a positive risk score and a negative risk score. This unsupervised cluster analysis showed obvious distinctions between the stratified survival groups, both in the training group and the testing group (Figure 3A, 3E). To link the observed gene expression changes with molecular pathways that may impact the differential survival between high- and low-risk groups, gene set enrichment analysis (GSEA) was performed. As shown in Figure 3B and 3F, pathways such as E2F targets, MYC targets, G2M checkpoint, mTORC1 signaling and interferon gamma response were significantly enriched in the patients of higher risk scores, with good consistency between the training group and the testing group.

Figure 3Prognostic landscape of pathway scores in the whole cancer population

A, E. Heatmap depicting gene expression levels after unsupervised hierarchical clustering in the training set and testing set, respectively. Expression levels are indicated on a low-to-high scale (green-black-red). Two clusters are defined, namely the high risk group and low risk group. B, F. GSEA analysis was performed in the training set and testing set, respectively, to identify biological pathways associated with survival outcome. FWER-p values are indicated on a low-to high scale (lightblue-darkblue). The number of significant genes in each gene set is indicated by the circle size. C, G. Forest plots of pathway score association with cancer mortality in the training set and testing set, respectively. D, H. Scatter plots of correlations between risk scores and the E2F pathway scores in the training set and testing set, respectively.

In order to assess possible effects of different pathways, the GSEA for every sample were evaluated using the single sample gene set enrichment analysis (ssGSEA). Based on the calculated scores for each pathway, cox analysis was performed to evaluate their prognostic effects. Results showed that most of the significant pathways from the GSEA output showed positive correlations with the survival outcome (Figure 3C, 3G). In addition to the cox analysis, positive correlations were detected between the pathway ssGSEA scores and the prognostic risk scores. In Figure 3D and 3H, correlation analysis were shown in the most significant pathway (E2F targets), in both the training group and testing group.

Assessment of prognostic power of gene expression data

Since the gene expression analysis and pathway analysis showed great prognostic values in the study, prognostic power of gene expression data were further explored. C-index was applied to assess the predictive power of the gene expression data alone or combined with clinical information. To improve accuracy, cancer types that don't have enough death events (< 20 deaths or < 10% mortality) were excluded. Cancer patients were randomly split into 80% training and 20% testing for 100 times to calculate the final C-index. As shown in Figure 4A and 4B, the predictive power of gene expression data alone varied across cancer types. In KIRC and GBMLGG, the prognostic power was much higher when compared with other cancer types.

Figure 4C-indexes by models trained from individual gene expression data alone or in combination with clinical variables

A. C-indexes calculated from the ACC, BLCA, BRCA, CESC, COAD, ESCA, GBMLGG, HNSC, KIRC and KIRP. B. C-indexes calculated from the LAML, LIHC, LUAD, LUSC, MESO, OV, PAAD, SARC, SKCM and UCS. The lightblue box indicates the model built from individual gene expression data alone, and the darkblue box indicates the model built from the combination of gene expression data and clinical variables.

To explore any additional prognostic power, the gene expression data was combined with clinical information. Significant clinical features (correlated with survival) were applied as baseline to build the cox model. A feature-selection step against the residuals was utilized to include the gene features that better fit the model. Results showed that the most gene expression data alone (18 out of 20 cases) had significant predictive power (C-index > 0.5). Incorporating clinical information to gene expression data statistically boosts the model performance in 12 cancer types (BLCA, BRCA, CESC, GBMLGG, KIRC, KIRP, LAML, LIHC, LUAD, LUSC, OV, SKCM) (p < 0.05) (Figure 4A, 4B).

DISCUSSION

In this study, we assessed the clinical utility of genomic expression data from ~9000 cancer patients of 32 tumor types. The prognostic power across different cancer types was also evaluated [7, 8].

Currently, only a few gene expression-based markers are routinely used in clinical practice [9–12]. The clinical utility of genomic expression has not been fully explored. Yuan et al. reported that for cancer patients, incorporating molecular features with clinical information yields significantly improved predictions. However, they only focused on 4 cancer types (KIRC, GBM, OV, LUSC), and no conclusions could be drawn for the whole cancer population [13]. Recently, Gentles et al. described the genomic prognostic landscape across human cancers, highlighting the promise of genomic expression data as biomarkers for clinical outcomes [14]. In our study, besides illuminating the prognostic landscape of genomic expression, pathway analysis based on these prognostic genes was also evaluated. In addition, the C-index was calculated from the prognostic models across tumor types, to assess the prognostic power of gene expression data.

Based on the genomic expression data in the whole cancer population, the top prognostic genes were identified, such as FOXM1, CBX7, CREBL2 and SKP2, which were consistent with previous studies [14–16]. Notably, when building the risk scores based on these top prognostic genes, significant stratification in survival outcomes were shown, both in the training and validation cohorts, indicating the robustness of the predicting effect of the prognostic genes.

Because of heterogeneity, many statistical methods have been developed to analyze cancer genomics, based on gene sets, pathways and network modules [17–19]. For the first time, our study described the prognostic landscape of biological pathways in the whole cancer population. Gene set enrichment of the differentially expressed genes revealed significant prognostic pathways, such as the E2F targets, MYC targets, G2M checkpoint, interferon gamma response, and so on. Mostly, these pathways are correlated with cell cycle, proliferation and inflammation, which is consistent with the biological mechanisms of tumor progression [20, 21].

To explore prognostic power, our results showed that combining the clinical and molecular information could improve the predictive power of the gene expression data in most cancer types. Although the absolute magnitude gains were limited, the gene-expression signatures provide new biological insights into the process of cancer progression and metastasis that can help to improve the prediction power [22]. Actually, some of the gene-based prognostic signatures have already been demonstrated to be clinically useful for predicting the risk of tumor recurrence, such as the 70-gene and 76-gene signatures in breast cancer [23–26].

It is also important to realize that gene expression information is just one of the abundant molecular data (genomic, transcriptomic, epigenomic and proteomic) revealing the biological complexity of cancer. Other molecular information will also improve our understanding of the genotype–phenotype relationships involved in cancer. On the other hand, regarding the reliability and the reproducibility of the clinical use of molecular data, future technology, statistical and analytical methods are in great need to catch up with clinical needs [22].

In conclusion, our gene analysis and pathway analysis showed significant values for the prediction of survival outcomes for cancer patients. Additionally, it was found that by combining clinical information with molecular data, the model performance could be boosted statistically in most cancer types. However, further efforts would be needed to generate prognostic models ready for clinical use in the future.

MATERIALS AND METHODSData set compilation

Clinical and survival data were acquired from the TCGA Data Portal (https://tcga-data.nci.nih.gov/tcga/). RNA sequencing data was obtained from the GDAC Firehose System (http://gdac.broadinstitute.org/). To maintain data consistency, only the RNA sequencing data from the platform of Illumina HiSeq 2000 RNA Sequencing V2 was included. Patients who have a complete clinical and RNA sequencing data were screened for further analysis. For each cancer data set, patients were split into two groups randomly: 80% as the training set and 20% as the testing set. For the pan-cancer study, all RNA sequencing data were combined by intersecting the common genes across different cancer types.

Prognostic genes and construction of the prognostic model

For RNA sequencing data, all “raw count” values were divided by the 75th percentile of the same patient (after removing zeros) and multiplied by 1000, to get the quantile normalization for survival analysis. Furthermore, quantile data were also transformed to the z-score or normalized by “voom” to evaluate the effects of different normalization method. The Z-score was calculated as “(tumor expression - mean expression in reference) / standard deviation of expression in reference”. The voom normalization was applied using the R package “limma”. It estimates the mean-variance relationship of the log-counts and generates a precision weight for each observation.

The association of each gene expression with survival outcomes was assessed via cox proportional hazards regression using the ‘coxph’ function of the R ‘survival’ package. Cox coefficients, hazard ratios with 95% confidence intervals, p values, and z-scores were obtained for each array probe. Top prognostic genes were identified by the values of z-scores. Based on these top prognostic genes, risk scores were built and it was defined as the weighted sums of the independent prognostic gene values (1 for high expression, and 0 for low expression). They were weighted with their regression coefficients from the cox models. Based on the prognostic risk score, further cox regression analysis and correlation analysis with clinical variables were performed.

Differential expression analysis and clustering analysis

Differential expression analysis was done using R “limma” package. Based on the limma output for the most differentially expressed genes, unsupervised hierarchical clustering analysis was used to discover the gene expression patterns of these groups sharing common characteristics. Heatmap was constructed using the R “gplots” package.

Gene set enrichment analysis

Prognostic gene sets are groups of genes that share common biological function. The evaluation of prognostic gene sets was performed using gene set enrichment analysis (GSEA) [27], where gene sets were obtained from the Molecular Signatures Database (mSigDB) [28]. In addition, a variant of GSEA, termed single sample gene set enrichment analysis (ssGSEA) was applied to calculate separate enrichment scores for each pairing of a sample and gene set [29]. Further cox regression analysis and correlation analysis were performed based on the enrichment scores of each gene set.

Performance evaluation of gene expression data

Performance evaluation of gene expression data was conducted based on the method suggested by Yuan et al [13]. Firstly, univariate cox was applied to the training set to select the top features correlated with survival, which were then converged by the LASSO using the R package “glmnet”. The model was then applied to the testing set for prediction. Concordance index (C-index) was estimated from 100 randomizations using the R package “survcomp”. To explore the predictive power of integrating gene expression data with clinical information, we used the significant clinical features (correlated with survival) as baseline to build the cox model. Then a feature-selection step against the residuals was applied to combine the gene features that better fit the model.

SUPPLEMENTARY FIGURE

We acknowledge contributions of the TCGA Research Network and the TCGA Pan-Cancer Analysis Working Group. There is no potential conflict of interest. The protocol for the research has been approved by the Ethics Committee of institution and it conforms to the provisions of the Declaration of Helsinki.

FUNDINGS

This research was supported by the Natural Science Foundation of China (No. 81402022 and No. 81472247), and the Fund of China Scholarship Council (No. 201406280106).

Authors' contributions

Xinsen Xu: wrote the paper and analyzed data; Lei Huang: analyzed data; Chun Hei Chan: revised the paper; Tao Yu, Runchen Miao: collected genome data; Chang Liu: Designed the research.

REFERENCES1

Siegel

Miller

Jemal

Cancer statistics, 2016

CA Cancer J Clin201666730

26742998

Van Allen

Wagle

Levy

Clinical analysis and interpretation of cancer genome data

Journal of clinical oncology20133118251833

23589549

Mansouri

Gunnarsson

Sutton

Ameur

Hooper

Mayrhofer

Juliusson

Isaksson

Gyllensten

Rosenquist

Next generation RNA-sequencing in prognostic subsets of chronic lymphocytic leukemia

American journal of hematology201287737740

22674506

Tang

Xiao

Behrens

Schiller

Allen

Chow

Suraokar

Corvalan

Mao

White

Wistuba

Minna

Xie

A 12-gene set predicts survival benefits from adjuvant chemotherapy in non-small cell lung cancer patients

Clinical cancer research20131915771586

23357979

Chia

Bramwell

Shepherd

Jiang

Vickery

Mardis

Leung

Ung

Pritchard

Parker

Bernard

Perou

Ellis

Nielsen

A 50-gene intrinsic subtype classifier for prognosis and prediction of benefit from adjuvant tamoxifen

Clinical cancer research20121844654472

22711706

Cancer Genome Atlas Research NWeinstein

Collisson

Mills

Shaw

Ozenberger

Ellrott

Shmulevich

Sander

Stuart

The Cancer Genome Atlas Pan-Cancer analysis project

Nature genetics20134511131120

24071849

Cristescu

Lee

Nebozhyn

Kim

Ting

Wong

Liu

Yue

Wang

Liu

Molecular analysis of gastric cancer identifies subtypes associated with distinct clinical outcomes

Nature medicine201521449456

De Sousa

EMF

Wang

Jansen

Fessler

Trinh

de Rooij

de Jong

de Boer

van Leersum

Bijlsma

Rodermond

van der Heijden

van Noesel

Poor-prognosis colon cancer is defined by a molecularly distinct subtype and develops from serrated precursor lesions

Nature medicine201319614618

Arteaga

Sliwkowski

Osborne

Perez

Puglisi

Gianni

Treatment of HER2-positive breast cancer: current status and future perspectives

Nature reviews Clinical oncology201291632

Dowsett

Nielsen

A'Hern

Bartlett

Coombes

Cuzick

Ellis

Henry

Hugh

Lively

McShane

Paik

Penault-Llorca

Assessment of Ki67 in breast cancer: recommendations from the International Ki67 in Breast Cancer working group

Journal of the National Cancer Institute201110316561664

21960707

Ludwig

Weinstein

Biomarkers in cancer staging, prognosis and treatment selection

Nature reviews Cancer20055845856

16239904

Sidransky

Emerging molecular markers of cancer

Nature reviews Cancer20022210219

11990857

Yuan

Van Allen

Omberg

Wagle

Amin-Mansour

Sokolov

Byers

Hess

Diao

Han

Huang

Lawrence

Assessing the clinical utility of cancer genomic and proteomic data across tumor types

Nature biotechnology201432644652

Gentles

Newman

Liu

Bratman

Feng

Kim

Nair

Khuong

Hoang

Diehn

West

Plevritis

Alizadeh

The prognostic landscape of genes and infiltrating immune cells across human cancers

Nature medicine201521938945

van 't Veer

Dai

van de Vijver

Hart

Mao

Peterse

van der Kooy

Marton

Witteveen

Schreiber

Kerkhoven

Roberts

Gene expression profiling predicts clinical outcome of breast cancer

Nature2002415530536

11823860

Chan

Yang

Gao

Lee

Feng

Huang

Tsai

Flores

Shao

Hazle

Wei

The Skp2-SCF E3 ligase regulates Akt ubiquitination, glycolysis, herceptin sensitivity, and tumorigenesis

Cell201214910981111

22632973

Kosorok

Huang

Dai

Incorporating higher-order representative features improves prediction in network-based cancer prognosis analysis

BMC medical genomics201145

21226928

Lee

Chuang

Kim

Ideker

Lee

Inferring pathway activity toward precise disease classification

PLoS computational biology20084e1000217

18989396

Goeman

Buhlmann

Analyzing gene expression data in terms of gene sets: methodological issues

Bioinformatics200723980987

17303618

Tachibana

Gonzalez

Coleman

Cell-cycle-dependent regulation of DNA replication and its relevance to cancer pathology

The Journal of pathology2005205123129

15643673

Coussens

Werb

Inflammation and cancer

Nature2002420860867

12490959

Sotiriou

Piccart

Taking gene-expression profiling to the clinic: when will molecular signatures become relevant to patient care?

Nature reviews Cancer20077545553

17585334

Bueno-de-Mesquita

van Harten

Retel

van't Veer

van Dam

Karsenberg

Douma

van Tinteren

Peterse

Wesseling

Atsma

Rutgers

Use of 70-gene signature to predict prognosis of patients with node-negative breast cancer: a prospective community-based feasibility study (RASTER)

The Lancet Oncology2007810791087

18042430

Buyse

Loi

van't Veer

Viale

Delorenzi

Glas

d'Assignies

Bergh

Lidereau

Ellis

Harris

Bogaerts

Therasse

Validation and clinical utility of a 70-gene prognostic signature for women with node-negative breast cancer

Journal of the National Cancer Institute20069811831192

16954471

Wang

Klijn

Zhang

Sieuwerts

Look

Yang

Talantov

Timmermans

Meijer-van Gelder

Jatkoe

Berns

Atkins

Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer

Lancet2005365671679

15721472

Desmedt

Piette

Loi

Wang

Lallemand

Haibe-Kains

Viale

Delorenzi

Zhang

d'Assignies

Bergh

Lidereau

Ellis

Strong time dependence of the 76-gene prognostic signature for node-negative breast cancer patients in the TRANSBIG multicenter independent validation series

Clinical cancer research20071332073214

17545524

Subramanian

Tamayo

Mootha

Mukherjee

Ebert

Gillette

Paulovich

Pomeroy

Golub

Lander

Mesirov

Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles

Proceedings of the National Academy of Sciences of the United States of America20051021554515550

16199517

Liberzon

Subramanian

Pinchback

Thorvaldsdottir

Tamayo

Mesirov

Molecular signatures database (MSigDB) 3. 0

Bioinformatics20112717391740

21546393

Barbie

Tamayo

Boehm

Kim

Moody

Dunn

Schinzel

Sandy

Meylan

Scholl

Frohling

Chan

Sos

Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1

Nature2009462108112

19847166