Abstract

Article Figures and data Abstract Editor's evaluation Introduction Results Discussion Materials and methods Data availability References Decision letter Author response Article and author information Metrics Abstract Lung squamous cell carcinoma (LUSC) is a type of lung cancer with a dismal prognosis that lacks adequate therapies and actionable targets. This disease is characterized by a sequence of low- and high-grade preinvasive stages with increasing probability of malignant progression. Increasing our knowledge about the biology of these premalignant lesions (PMLs) is necessary to design new methods of early detection and prevention, and to identify the molecular processes that are key for malignant progression. To facilitate this research, we have designed XTABLE (Exploring Transcriptomes of Bronchial Lesions), an open-source application that integrates the most extensive transcriptomic databases of PMLs published so far. With this tool, users can stratify samples using multiple parameters and interrogate PML biology in multiple manners, such as two- and multiple-group comparisons, interrogation of genes of interests, and transcriptional signatures. Using XTABLE, we have carried out a comparative study of the potential role of chromosomal instability scores as biomarkers of PML progression and mapped the onset of the most relevant LUSC pathways to the sequence of LUSC developmental stages. XTABLE will critically facilitate new research for the identification of early detection biomarkers and acquire a better understanding of the LUSC precancerous stages. Editor's evaluation The authors have created a resource tool that is valuable in assessing precancerous lesions in the lung, which may serve as a tool for investigators working in this area, and as an example for additional similar resources. The accessibility of the tool is a concern but does not diminish the quality of the product. https://doi.org/10.7554/eLife.77507.sa0 Decision letter Reviews on Sciety eLife's review process Introduction Lung squamous cell carcinoma (LUSC) is a type of non-small cell lung cancer that accounts for 20–30% of all lung cancer cases (Chen et al., 2014; Sung et al., 2021; Bray et al., 2021). Despite being the second most frequent type of lung cancer (Torre et al., 2016), our knowledge regarding the biology of this disease as well as the therapeutic modalities to treat it remain far behind the most frequent type of lung cancer, lung adenocarcinoma (LUAD) (Hirsch et al., 2016; Khuder, 2001). LUAD genetics is dominated by mutations (that are often druggable) that activate the RTK/RAS pathway, including EGFR and KRAS mutations (The Cancer Genome Atlas Research Network, 2012; Jamal-Hanjani et al., 2017). However, the genetic landscape of LUSC is more complex, with multiple pathways altered in subsets of patients and a lack of actionable mutations (Campbell et al., 2016; The Cancer Genome Atlas Research Network, 2012; Kim et al., 2014), precluding the development of new therapies. Hence, the only pharmacological therapies available to treat LUSC patients are immune checkpoint inhibitor monotherapy or in combination with chemotherapy (Mok et al., 2019; Paz-Ares et al., 2018; Weinberg and Gadgeel, 2019). Furthermore, the National Lung Cancer Matrix Trial and The Lung Master Protocol, the largest personalized medicine trials in lung cancer, have not shown clear therapeutic benefits with targeted agents in LUSC (; Middleton et al., 2020; Redman et al., 2020). LUSC also has a worse prognosis than LUAD independent of stage at detection, with a 5-year overall survival (OS) of 6.2% for patients diagnosed with distant metastasis (9.5% for LUAD). However, patients diagnosed with localized disease are eligible for curative surgery and the 5-year OS is 50% (National Cancer Institute and DCCPS, 2018). Therefore, early detection is currently the most valuable tool to prevent deaths by LUSC as evidenced by several ongoing programmes of lung cancer early detection (Crosbie et al., 2019a; Crosbie et al., 2019b; Aberle et al., 2011). These initiatives make use of low-dose CT scans in high-risk populations, but in spite of the frequent detection of localized lung cancer eligible for resection with curative intent, 40% of early diagnosed patients die within 5 years. LUSC progresses through a series of premalignant stages characterized by alterations of the normal bronchial epithelium (Figure 1; Ishizumi et al., 2010; Kadara et al., 2016; Thakrar et al., 2017; Pennycuick et al., 2020). These endobronchial premalignant lesions (PMLs) are classified as low-grade (squamous metaplasia, mild and moderate dysplasia) and high-grade (severe dysplasia, and carcinomas in situ [CIS]) (Figure 1). However, not all PMLs progress to LUSC. Although obvious differences exist between the multiple studies published on the topic, high-grade PMLs have a higher risk of progression than low grade, and high levels of chromosomal instability (CIN) are also predictive of progressive PMLs (Thakrar et al., 2017; van Boerdonk et al., 2014; Teixeira et al., 2019; Merrick et al., 2016). This potential role of CIN as biomarker of PMLs progression has been observed in low- and high-grade PMLs (van Boerdonk et al., 2014; Teixeira et al., 2019). These reports showed that high levels of copy number variations was the best predictor of progressive PMLs (van Boerdonk et al., 2014; Teixeira et al., 2019) and that immune response is the most likely cause of regression as these lesions contained higher levels of immune infiltration (Pennycuick et al., 2020; Beane et al., 2019). These results are supported by transcriptomic analysis of PMLs that indicate immune evasion in the transition to invasiveness (Mascaux et al., 2019). Figure 1 Download asset Open asset Developmental stages of lung squamous cell carcinoma (LUSC) premalignant lesions (PMLs) with representative histological images for each stage (haematoxilyn-eosin) and a summary of the four studies included in XTABLE (Exploring Transcriptomes of Bronchial Lesions). PMLs are typically classified as normal epithelium (including hyperplasia), low-grade and high-grade. Two studies (Mascaux et al., 2019, and Beane et al., 2019) carried out gene expression analysis of multiple developmental stages, whereas Merrick et al., 2018, and Teixeira et al., 2019, focused on dysplasias (blue boxes) and carcinomas in situ (CIS) (pink boxes), respectively. The most relevant findings of each article are summarized in the figure. Error bars=50 µm. The detection of PMLs cannot be carried out by routine patient imaging techniques such as CT or PET scans as the morphological change that they cause in the airway does not result in radiological contrast. Alternatively, ablation of high-grade PMLs detected by autofluorescence bronchoscopy using minimally invasive endoscopic procedures in high-risk populations is an innovative and interesting strategy to prevent LUSC (Guibert et al., 2016). Nonetheless autofluorescence bronchoscopy is an expensive and complex technique of limited use in large screening programmes. Therefore, simpler, more cost effective, and scalable methods of high-grade PML detection are needed to prevent deaths by LUSC. Improving the detection of PMLs requires a better understanding of their biology and the validation of adequate biomarkers of progressive lesions that can be translated into new technologies for large screening initiatives. Cell surface proteins, metabolites, nasal-based biomarkers, blood and sputum/bronchoalveolar lavage biomarkers are examples of biomarkers that can be used to improve and/or complement current diagnostic techniques (such as CT and PET scans) or develop new ones. A user-friendly application to interrogate gene expression from studies focusing on precancerous LUSC stages will enhance the advances in PML biology. However, such application does not exist. Recently, scientific interest in the biology of preinvasive LUSC stages has motivated the publication of several articles characterizing PML transcriptomes from various perspectives (Figure 1). Two reports published by Beane et al., 2019, and Mascaux et al., 2019, showed stage-by-stage gene expression analyses of PMLs. These studies provided the most detailed transcriptomic characterization of all PML stages so far and identified changes in the immune microenvironment associated with invasive transformation. Additionally, longitudinal studies by Beane et al., 2019, Merrick et al., 2018, and Teixeira et al., 2019, contained samples with known progression potential. Beane et al., 2019, identified molecular subtypes with specific biological traits whereas articles published by Merrick et al., 2018, and Teixeira et al., 2019 focused on gene expression of specific preinvasive stages (dysplasias and CIS, respectively) with the objective of identifying predictors of PML progression and the biomolecular processes involved (Figure 1). These studies provide a valuable source of gene expression data to identify candidate biomarkers for the detection of high-risk PMLs and/or early stage LUSC as well as to investigate the biology of premalignant LUSC progression. Transparent and straightforward accessibility to transcriptomic databases is a key requirement for the open science philosophy. Applications that integrate multiple databases focusing on the same biological and clinical problem, such as camcAPP (Dunning et al., 2017), allow cross-comparisons between independent studies, strengthen the robustness of results obtained, and allow the selection of high-confidence data. In this report, we provide an overall description of XTABLE (Exploring Transcriptomes of Bronchial Lesions) and provide examples of its functions. XTABLE is a new open-source application that will enable scientists to interrogate currently four LUSC PML transcriptomic datasets in a versatile manner that can be adapted to the needs of each researcher. Specifically, LUSC prevention and diagnosis are the main areas that can benefit the most from XTABLE, but its versatility and multiple functions lend themselves to the exploration of a variety of research questions. Without XTABLE, researchers would have to put together all the packages, data processing steps themselves for each analysis they wished to run. In this report, we provide an overall explanation of all the functions of XTABLE as well as a detailed description of the most important analysis modules, including two-group comparisons, gene-of-interest analyses, and interrogation of transcriptional signatures. Additionally, we explored the use of CIN-related signatures as a biomarker of progressive PMLs and mapped the onset of the most important LUSC pathways to its developmental stages. Results Description of the studies included in XTABLE Four datasets originating from four independent PML transcriptomic studies have been included in the XTABLE application (Figure 1, Table 1). Two datasets, GSE109743 (Beane et al., 2019) and GSE33479 (Mascaux et al., 2019), provide gene expression data of the developmental LUSC stages (with and without progression status information, respectively), whereas the remaining studies, GSE114489 (Merrick et al., 2018) and GSE108124 (Teixeira et al., 2019), focus on analysing specific PML stages (dysplasias and CIS, respectively) that have been followed up to establish their progressive, persistent, or regressive potential. Table 1 Description of the four cohorts included in XTABLE (Exploring Transcriptomes of Bronchial Lesions). GEO accessionPMIDStagesProgression status knownNumber of samplesSample typeTranscriptomeGSE3347931243362MultipleNo122Whole biopsiesMicroarrayGSE10974331015447Multiple*Yes448Discovery cohort (197 Bx, 91 Br)Validation cohort (111Bx, 49 Br)Whole biopsies and brushingsRNAseqGSE11448929997230Dysplasias normalYes63Whole biopsiesMicroarrayGSE10812430664780CISYes33MicrodissectedMicroarray Bx: biopsies; Br: brushings. * This cohort includes neither carcinomas in situ (CIS) nor invasive carcinomas. Dataset GSE33479 (Mascaux et al., 2019) comprises expression microarray analyses of 122 endobronchial biopsies of unknown progression status. Samples were obtained following autofluorescence bronchoscopy and as no enrichment or microdissection of the biopsy epithelium was carried out, variable levels of stromal component are present in samples. Although this results in dilution of epithelial signals, it has the advantage of providing information about the microenvironment as well as insights into the infiltrated immune cells. Dataset GSE109743 (Beane et al., 2019) consists of a transcriptomic analysis of whole PML biopsies with no purification of the epithelial compartment as well as bronchial brushings obtained from adjacent normal regions of the bronchial mucosa. Unlike GSE33479, GSE109743 used RNAseq and more importantly, includes the progression status for some lesions established by serial biopsies. Samples were classified as ‘normal-stable’ when they changed between normal, hyperplasia, and metaplasia, ‘regressive’ when they regress from dysplasia to a less severe dysplastic grade, or from dysplasia to normal/hyperplasia/metaplasia. Remaining samples were considered persistent/progressive. One advantage of this dataset is the large number of samples divided into a discovery and a validation cohort (Figure 1, Table 1). However, the representation of each PML stage is not homogeneous and CIS samples are not included. Datasets GSE114489 (Merrick et al., 2018) and GSE108124 (Teixeira et al., 2019) constitute a different type of study. Both make use of sequential biopsies to classify lesions according to their progression potential, but they differ in the stages included, the classification of progression status and sample processing. In GSE114489, the authors collected 63 baseline bronchial biopsies (with corresponding follow-up biopsies) and classified samples in four groups: 23 persistent dysplasias (dysplastic lesions with the same or higher severity scores in follow-up biopsies), 15 regressive dysplasias (dysplasias progressing to lower severity scores), 9 progressive non-dysplasias (biopsies with normal or hyperplastic morphologies that progress to more severe morphologies), and 16 stable non-dysplasias (normal or hyperplastic pathology that remains stable in follow-up biopsies). Microarray analysis of gene expression was performed on whole biopsies. The study published by Teixeira et al., 2019 (GSE108124) also makes use of follow-up biopsies to classify the progression potential of PMLs, but unlike GSE114489, GSE108124 focuses on CIS and progression is defined as the transition to invasive carcinomas in follow-up biopsies. RNA for gene expression microarray analysis was extracted from microdissected FFPE samples to enrich the epithelial component. XTABLE access, interface, and functions XTABLE download can be carried out from https://gitlab.com/cruk-mi/xtable (copy archived at Roberts, 2022) following the instructions in the Materials and methods section and Video 1. XTABLE has been designed using the shiny app interface (see Materials and methods section) and its functions have been divided into 11 interrelated tabs that contain a specific function to interrogate each dataset separately (Figure 2A). Video 1 Download asset This video cannot be played in place because your browser does support HTML5 video. You may still download the video for offline viewing. Download as MPEG-4 Download as WebM Download as Ogg Step-by-step instructions to install XTABLE (Exploring Transcriptomes of Bronchial Lesions) using RStudio. Figure 2 with 3 supplements see all Download asset Open asset Overall organization of XTABLE (Exploring Transcriptomes of Bronchial Lesions) functions and use of the DEG function. (A) Organization of all the functions in the XTABLE interface. The functions are interrelated and completing certain analyses requires the use of several functions. For instance, the GSEA and PA functions operate with gene lists obtained with the DEG function. (B) Workflow to obtain differentially expressed genes between two groups using the DEG function. The example shows groups of samples arranged by developmental stage to compare low-grade and high-grade premalignant lesions (PMLs) in the GSE33479 cohort. (C and D) Workflow to obtain differentially expressed genes between two groups using the DEG function. The two groups have been arranged by progression status using in the GSE114489 and GSE108124 cohorts, respectively. Introduction Detailed description of the four studies, analyses performed in each tab, output formats, links to the four articles used in the application, bioinformatic packages used in the different functions and additional references. Dataset tab Explanation of the differences in methodology and progression status definitions between the four studies. Load tab In this tab, the user selects the database (GSE114489, GSE108124, GSE109743, or GSE33479) for subsequent interrogation. Due to specific differences between the four studies, only one dataset can be interrogated at a time. Since dataset GSE109743 contains a discovery and a validation cohort, as well as biopsies and brushings, preselection of the cohort and type of sample has to be carried out (Figure 2—figure supplement 1). CIN-score tab CIN has been identified as a good predictor of PML progression. Except for Teixeira et al., 2019, the other three studies do not provide genomic analyses that provide an estimate of the level of chromosomal alterations in the samples. Several gene expression signatures that correlate with CIN (CIN-scores) have been described in the literature (Teixeira et al., 2019; Carter et al., 2006). This CIN-score function returns a list of several CIN-scores (CIN70, CIN25, and CIN5, depending on the number of genes included in the signature) for all samples included in the study and a graph depicting CIN-scores in different sample types classified according to the sample classification defined in the study. A line marking a selected CIN-score threshold can be added for visualization purpose and to help determine appropriate thresholds for additional CIN-related analyses (Figure 2—figure supplement 2). DEG tab Function to identify differentially expressed genes in comparisons of two groups of samples determined by the user. This function allows the selection of p-value and fold-change cutoffs for the analysis. Additionally, AUC is calculated for each gene to provide further confidence to the differentially expressed gene results and additional gene IDs are added at this stage that have been sourced from two separate datasets to maximize identification and discrepancies manually reviewed to improve downstream enrichment and pathway analyses. GSEA and PA In these two tabs, the user can carry out gene set and pathway enrichment analyses using multiple tools (goseq/ideal, fgsea/MSigDB, enrichR, gage/gageData, kegga/pathview, ReactomePA, Progeny, and Dorothea). This function operates with the list of differentially expressed genes obtained with the DEG function. Signature This function returns the gene expression values of a user-defined gene list. Lists can be manually entered or uploaded from a .csv file. Deconvolution Estimation of immune and stromal component in samples from gene expression data using the ESTIMATE tool. ROC This function returns receiver operating characteristic (ROC) curves in a user-defined comparison of two groups of samples stratified by the expression of a gene or by CIN-scores. Heatmap Returns a gene expression heatmap for genes selected by variance, user-defined gene signatures, and differentially expressed genes in the DEG tab. PCA Principal component analysis (PCA) of all the samples included in each study. Several sample characteristics can be highlighted in the PCA plot including CIN signatures, progression status, and PML stage. Gene Statistical analysis of expression of individual genes selected by the user. Several options for sample groupings are available. Two-group differential expression analysis The discovery of candidate biomarkers for the detection of PMLs at high risk of malignant progression and the interrogation of PML biology depends greatly on the comparison of gene expression profiles between lesions with known progressive or regressive potential. The four databases included in XTABLE contain different types of information which influence how users can interrogate these databases. To facilitate the interrogation of the four transcriptomic databases in a manner that allows the versatile stratification of samples, we have designed a module in XTABLE named DEG (Figure 2B–D), that returns the differentially expressed genes in two user-defined groups of samples stratified by PML stage (GSE33479 and GSE109743), by known progression status (GSE109743, GSE114489, and GSE108124) or by CIN-score thresholds. In XTABLE, we have included three CIN-scores, named CIN70, CIN25 (Carter et al., 2006), and CIN5 (Teixeira et al., 2019). CIN70 and CIN25 have been reported in the literature, with the former containing the greatest number of genes for interrogation. CIN5 is derived from the signature used by Teixeira et al., 2019, and is reported to show a good correlation with progression in CIS lesions. Setting contrast groups by stage using DEG allows the comparison of two individual PML stages or the grouping of multiples stages into two groups. For instance, to compare the differential expression between low-grade and high-grade PMLs, we can define two contrast groups, one including metaplasia, low and moderate dysplasia (low-grade) and one including severe dysplasia and CIS (high-grade) using cohort GSE33479 (Figure 2B). After selection of the cohort and setting up the groups (Figure 2B), the application returns a downloadable list of differentially expressed genes with associated statistical information (Figure 2B), including AUC inferred by ROC analysis. ROC curves associated with each gene in a two-group comparison, useful for biomarker discovery, can be downloaded using the ROC tab (Figure 2—figure supplement 3). Straightforward grouping of progressive and regressive samples can also be carried out by selecting the ‘Progression’ contrast in studies that provide this information (Figure 2C and D). To carry out two-group comparison by CIN signatures, the CIN-score tab provides a graphic visualization of the three CIN-scores for all samples to guide the selection of a CIN-score threshold for sample stratification (Figure 3A). A .csv file containing the CIN-scores of all samples can be downloaded. The selected CIN threshold can be used in the DEG tab to stratify CIN-high and low samples and retrieve a list of differentially expressed genes (Figure 3A). Figure 3 Download asset Open asset Differential expression analysis between two groups of samples classified according to a chromosomal instability (CIN)-score threshold. The CIN-score function allows the graphic visualization of CIN-scores for all samples in a study. A CIN-score threshold selected by the user can be depicted on the graph (red dotted line). The CIN-score threshold selected by the user can be used in the DEG tab to define the two-group comparison. Stages 1–9 represent the nine developmental stages of LUSC as described in Mascaux et al., 2019 (GSE33479). CIN70, CIN25, and CIN5 can be used in the DEG tab. Sample sizes: n=12 (stage 7), n=13 (stages 1, 5, 6 and 8), n=14 (stages 2 and 9), n=15 (stages 3 and 4). Boxplots show median and upper/lower quartile. Whiskers show the smallest and largest observations within 1.5× IQR. The list of differentially expressed genes obtained in the DEG tab can be automatically used in the GSEA and PA tabs to carry out gene set enrichment and pathway analyses. The GSEA tab allows the user to select the gene set enrichment tool (goseq, fgsea/MSigDB, enrichR, and gage), p-value cutoff, and the gene sets to analyse (Figure 4A). For instance, using the goseq function enables us to select one of the three Gene Ontology domains (biological process, cellular compartment, and molecular function) to consider for analysis (Figure 4A). Similarly, with the fgsea/MSigDB tool, the user can select the gene set of interest (Figure 4B). The PA tab operates in a similar manner using four pathway analysis tools (kegga/pathview, ReactomePA, Progeny, and Dorothea) (Figure 4—figure supplement 1). Figure 4 with 1 supplement see all Download asset Open asset Gene set enrichment analyses in a list differentially expressed genes using the GSEA tab. (A) Gene set enrichment analysis using the goseq tool of a list of differentially expressed genes obtained in the DEG tab. One of the three main Gene Ontologies (GO) can be selected for analysis at a time. After selection of a p-value, XTABLE (Exploring Transcriptomes of Bronchial Lesions) returns a downloadable list of GO with associated statistics. (B) Gene set enrichment analysis using the fgsea/MSigDB tool. This tool allows the selection of any collection included in MSigDB and returns a list of signatures with associated statistics. The example shows the selection of the C3_TFT_GTRD collection (Transcription Factor Targets annotated in the Gene Transcription Regulation Database). Gene-centred analysis and user-defined transcriptional signatures Users can investigate their own gene or group of genes of interest in the application. To facilitate this type of gene-centred analyses, we have included the Genes and Signature functions in XTABLE. In the Genes tab, the user can analyse the expression of one gene of interest. This tab is divided in three tools. The ‘Expr’ tool returns a downloadable list of the maximum (if multiple probes map to the same gene symbol) normalized expression values for the gene of interest in all samples (Figure 5A). The ‘Indv_Contrast’ function returns the differentially expressed gene analysis results including fold-change and statistical significance values for a given gene in all groups of samples compared with a predetermined reference group (Figure 5B). Sample grouping and reference group depend on the study. The ‘Mult_Contrast’ function works similarly to the ‘Indv_Contrast’ function but allows merging of up to four groups of samples and the reference group for statistical analysis can be determined by the user. The example in Figure 5C shows the evolution of MYC expression in four groups of samples that represent normal, low-grade PMLs, high-grade PMLs, and invasive carcinomas (Figure 5C). The fold-changes and p-values are the result with the comparison with the ‘normal’ group (normal normofluorescent, normal hypofluorescent, and hyperplasia) as the reference group (Figure 5C). Figure 5 Download asset Open asset XTABLE (Exploring Transcriptomes of Bronchial Lesions) functions to implement analyses on individual genes (Gene tab) and user defined gene signatures (Signature tab). (A) The ‘Expr’ function (under the Gene tab) retrieves the normalized expression values for a gene of interest in all samples. (B) The ‘Indiv_Contrast’ tool compares the expression of a gene of interest in groups of samples with a predetermined group. In the example, the function compares the expression of MYC in all stages with the normal normofluorescent group in GSE33479. (C) The ‘Mult_Contrast’ tool enables the grouping of samples in up to four groups (contrasts) and statistical comparison with a reference group determined by the user. The example shows the analysis of MYC expression in four groups of samples from the GSE33479 cohort (normal, low-grade, high-grade, and invasive carcinomas). The ‘normal’ group is set as reference group for statistical analysis. (D) Example of the use of the Heatmap tab to interrogate to visualize the expression of gene sets in premalignant lesions (PMLs). Gene sets can be defined by the user (as in the example) and are shown using the stage classification and entered manually or from a .csv file. Alternatively, the heatmap can be generated from a list of differentially expressed genes from the DEG tab or a selected number of genes filtered by variance. The three options can be selected in the scroll-down menu. In the example shown, the heatmap shows all microarray probes associated to each gene symbol. p-Values calculated using Welch’s t-test. To facilitate the interrogation of biological processes driving PML progression, XTABLE also allows the interrogation of transcriptomic signatures using multiple functions. The Signature tab returns a list of normalized expression values and a graph with the signature scores (sum of log-normalized expression values) for a gene set determined by the user in all samples of the selected study. The Heatmap tab returns a heatmap that displays the expression values in all samples in a selected study. The gene set shown in the heatmap can be selected using three options from the scroll-down menu. With the ‘Signature’ option, the user can manually enter or upload a list of genes. The example in Figure 5D shows a heatmap generated from the GSE33479 dataset with five transcriptional targets of SOX2, an important LUSC driver. The ‘DEG’ option automatically selects the list of genes differentially expressed in the DEG tab. Finally, the ‘Variance’ option selects genes using a user-defined number of genes with the highest variance. Assessment of the correlation between CIN signatures and progression potential Two studies have highlighted the potential role of CIN as predictor of low-grade (van Boerdonk et al., 2014) and high-grade (Teixeira et al., 2019) PMLs progression in LUSC. XTABLE can be used to

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call