
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
1 Departments of Immunology, 2 Biostatistics, and 3 Internal Medicine, Mayo Graduate and Medical Schools, Mayo Clinic, Rochester, MN; and
4 IBM, T.J. Watson Research Center, Yorktown Heights, New York, NY
Requests for reprints: Diane F. Jelinek, Department of Immunology, Mayo Clinic, 200 First Street SW, Rochester, MN 55905. Phone: (507) 284-5617; Fax: (507) 266-0981. E-mail: jelinek.diane{at}mayo.edu
| Abstract |
|---|
|
|
|---|
Key Words: gene expression profiling B-CLL immunoglobulin mutation status
| Introduction |
|---|
|
|
|---|
The underlying cause of B-CLL remains unknown. In this regard, gene expression profiling has been used successfully by a variety of investigators to discover genetic differences between tumor cells and normal counterpart cells and to aid in subclassification of human cancers. This information is proving to be very useful in understanding tumor cell biology. This methodology has also been applied toward achieving a better understanding of B-CLL by other investigators. Plate et al. (6) used a cDNA expression array system to identify genes up-regulated by Fludarabine; however, the analysis was performed using an array limited to apoptosis-related genes, only one patient sample was studied, and there was no independent verification of the array data. Similarly, Aalto et al. (7) used a focused array approach (406 genes) to identify patterns associated with an 11q23 deletion in 34 B-CLL samples, but there was no independent verification of differentially expressed genes. Stratowa et al. (8) used a microarray that allowed assessment of 1024 genes on 54 B-CLL samples. They identified three genes that permitted a distinction between high versus low Rai stages. Only one of these genes was assessed by alternative methodologies and only on two patient samples. Finally, two studies used microarray platforms, which assess much larger numbers of genes to better understand the global molecular genetics of B-CLL (9, 10). Both groups used gene expression profiling as a tool to subclassify disease and to identify genetic differences between B-CLL cells and normal B-lymphocyte subsets. Klein et al. (9) concluded that B-CLL was more related to memory B cells than other B-cell subsets and Rosenwald et al. (10) concluded that the subset of B-CLL patients, the Ig variable region of which was germline (GL) in sequence, was more related to activated B cells than to resting B cells. Both groups identified a number of interesting genes that appeared to be under- or overexpressed in B-CLL; however, with one exception, there was no independent validation of the overall level of gene expression. In addition, there were significant differences in the nature of the differentially expressed genes between the two groups perhaps reflecting two different array platforms and technologic noise. It should be noted, however, that it is difficult to thoroughly compare these two data sets without a complete listing of significant genes in addition to the top-ranked differentially expressed genes. Moreover, some of the apparent differences may instead reflect the use of slightly different non-diseased B-cell controls in each study.
Clearly, gene expression profiling has the potential to provide a better understanding of human disease. However, the value of the data emerging from this type of analysis is critically dependent on a number of factors, including reproducibility (reviewed in Ref. 11). For example, whereas similarities in differentially expressed genes in a given tumor have been reported by independent groups (1214), there have also been reports that fail to observe even a single common gene (15, 16).
In this study, we performed global gene expression profiling on 38 B-CLL patients and peripheral blood B cells from 10 healthy, age-matched patients (
60 years) to determine the extent of genetic differences between B-CLL cells and normal B cells and whether differences exist between low and high risk B-CLL patients. Because we wanted to use gene expression profiling but were aware of the technical challenges of this methodology, we employed a rigorous experimental design and innovative methods of data analysis. This was accompanied by experiments designed to verify the expression patterns of a subset of informative genes. Finally, because we used the same microarray platform as that employed by Klein et al. (9), we also discuss the reproducibility of differentially expressed genes in B-CLL. Our independent validation of genes with data generated in an independent laboratory is an important contribution in that it underscores the validity of the involvement of these key genes in the biology of this disease. Thus, we describe the discovery of 70 novel genes that clearly discriminate between B-CLL and normal peripheral blood B cells. Of interest, this collection is highly enriched for genes known to be under- or overexpressed in other cancers, suggesting many new areas that can be exploited in the study of the transformation of B-CLL.
| Results |
|---|
|
|
|---|
Fold Change Analysis of Differential Gene Expression
To identify genes differentially expressed between B-CLL cells and normal counterpart B cells, we first performed a gene-by-gene analysis of the difference in mean log expression between the two groups. Briefly, the mean log expression for each gene on the Affymetrix GeneChip® was determined for the B-CLL patients and compared with the mean log expression for the same gene in the cohort of normal controls, and the difference was calculated. This difference is nothing but the logarithm of the familiar fold change of gene expression in B-CLL patients and control subjects, or more specifically, the log of the ratio of the geometric means of the expression level in case versus control. A potential hazard of gene expression profiling analyses is that the measurement of a sufficiently large number of genes on a relatively small number of patient samples can lead to the identification of genes that appear to discriminate well between patient sample types, but in fact are simply chance results. To protect ourselves against false discoveries, we determined appropriately stringent significance thresholds of experiment-wide false-positive rates. The significance thresholds were determined by randomizing the data set 1000 times, that is, by randomly reassigning the patient labels such that some B-CLL samples are now labeled as control B cells and vice versa, determining the maximum (over all the genes) difference in mean log expression (i.e., the maximum log fold change) for the randomized data sets (in both directions), and determining the distribution for the maximum values. The significance threshold was chosen as a high percentile (in the range of 9099%) of the distribution for the maximum values. Table 1 indicates that for a significance threshold corresponding to an experiment-wide 5% false-positive rate, there were only a small number of informative genes. Thus, to classify a gene as being overexpressed in B-CLL relative to control B cells, the log fold change had to be greater than 1.99, that is, the fold change had to be greater than 97.7. To classify a gene as being overexpressed in control B cells relative to CLL B cells, its log fold change had to be greater than 1.94, which corresponds to a fold change greater than 87. It is noteworthy that these fold changes are substantially larger than the more typical 2-fold changes that are commonly used as thresholds in microarray experiments. Using this method, 11 gene probes were identified as being significantly overexpressed in B-CLL relative to control and 10 gene probes were significantly overexpressed in control B cells relative to B-CLL. The identity of these genes is listed in Table 2. Of note, however, three of the genes that appear to be overexpressed in control B cells versus B-CLL cells are Ig-related. This may reflect patient-to-patient heterogeneity in light chain usage, that is,
versus
light chain, or the known fact that B-CLL cells tend to underexpress Ig when compared to their normal counterpart B cells.
|
|
|
There are only 10 gene probes in common between the 37 identified using the concatenation of filters analysis and the 54 using multivariate analyses. Thus, the number of total gene probes after application of these rigorous gene selection criteria is 81. Because of duplicate probes for some genes, the number of unique genes after this analysis was 72 (summarized in Tables 3 and 4). Among these 72 genes, Klein et al. have identified 10 of them previously as being differentially expressed between normal and CLL B cells. These 10 genes (corresponding to 11 gene probes) are grouped at the bottom of Fig. 2 and the genes that are novel are grouped above (n = 62 unique genes). Out of these 72 genes, 13 were also identified using the fold change analysis (indicated by boldface type in Tables 3 and 4). It is noteworthy, however, that a number of genes are not present on both lists. The Eisen plot of these 81 genes (72 unique genes) is shown at the left of the yellow line in Fig. 2, where both the genes and the patients are ordered according to the hierarchical clustering scheme described in the "Materials and Methods." The dendrogram of the patient samples shows that these genes clearly separate the control from the CLL patients in our data. The question that we address next is whether these same genes also allow for a separation in a completely independent data set.
|
|
|
To make the comparisons between the two data sets more quantitative, we applied a nearest neighbor classifier (see "Materials and Methods"), which is trained using our data. When presented with a previously unseen sample, our classifier can give three kinds of answers based on the expression of the 81 probes (72 genes): the sample can be deemed to be as coming from a CLL patient; or coming from a control (CTRL) patient; or undecided. Fig. 3 shows the results of this classification when the new samples are taken to be the data from the Klein et al. study. According to this classifier, CLL samples should fall below the lower dotted line, whereas CTRL samples should lie above the upper dotted lines. Samples that fall in between the dotted lines cannot be classified and are deemed as undecided. Fig. 3 demonstrates that we do not err at all when we classify all the samples from Klein et al. (9). It is interesting to observe that although germinal center control B cells are correctly classified, these B cells tend to be closer to the upper dotted lines than the naïve and memory B cells. This may be explained by the fact that the control B cells used to train our classifier were peripheral blood B cells, which consist primarily of naïve B cells with memory B cells comprising only a minor subpopulation.
|
Literature Exploration of Genes Differentially Expressed in CLL and Control
The analysis described above suggests that the set of 72 unique genes extracted from our study is predictive of whether B cells are obtained from healthy individuals or from patients with B-CLL. The predictive ability of these data suggests that this set of genes may be involved in the biology of normal versus transformed human B cells. In an effort to approach this concept, we systematically explored the existing literature on this set of genes and Tables 3 and 4 summarize the results. Taking into account that some of the 81 Affymetrix probes correspond to the same genes, the actual number of unique genes identified was 72. Moreover, seven of the genes underexpressed in B-CLL were Ig-related. Because B-CLL cells are characterized as being dim surface IgM positive, some of the Ig-related genes validate the methodology. For example, CLL cells underexpressed IgM, IgG3, J chain, and IgA. In addition, CLL cells have also been shown to express low levels of CD22 and CD79a, both of which were found to be underexpressed in CLL cells relative to control B cells (Table 4).
Out of this set of differentially expressed genes, 31 genes were overexpressed in B-CLL cells. Of these 31 genes, 16 were previously identified as related in one way or another with an oncogenic transformation. For example, LEF-1, a transcription factor involved in development, has been shown to be involved in many cancers, most notably in colon cancer. Thus, ß-catenin-sensitive isoforms of lymphoid enhancer factor-1 are selectively expressed in colon cancer (17). We also identified eight known genes that were overexpressed in B-CLL that have not been previously shown to be associated with any cancer including the gene, FMOD (fibromodulin). FMOD was one of the genes with the highest fold change in the comparison between CLL and CTRL in Klein et al. (9), both when the control cells were memory B cells, and when the control was an assortment of memory and naïve B cells, centroblasts, and centrocytes. Of the 24 genes with known function that were overexpressed in our CLL samples, 16 genes (67%) have been reported in the literature to also be overexpressed in some other cancer. This observation suggests that a high percentage of the known genes that have not previously been linked with cancer, as well as the seven unknown genes overexpressed in B-CLL, may be good targets for further study (see below).
With respect to the genes repressed in B-CLL, we found that of the 36 known genes underexpressed in CLL (Table 4), 10 of them (27%) reflected products of genes already known to be expressed at lower levels in B-CLL cells (Ig-related sequences, CD79a, CD20, CD22). Of the remaining 26 genes, 17 (65%) have been shown to be underexpressed in other cancers. Those genes that have not yet been linked to cancer may include a number of candidate tumor suppressor genes.
Further Gene Validation
Further investigation of any of these genes differentially expressed in our array assays first requires that the profiling data be verified using alternative methodologies. Although it was not possible in this study to verify all of these genes, we have focused on verifying two subsets of potentially informative genes. We focused first on verifying a subset of genes that were also identified in the Klein et al. study (9) as being differentially expressed in normal B cells relative to CLL B cells (IGFBP4, RASGRF1, LEF-1, ABCA6, FMOD, and TGFBR3). In addition, the majority of these genes have known links to human cancer (Tables 3 and 4). Graphical depictions of how the expression levels of these genes varied across the initial set of patient and control samples are shown in Fig. 4. Our own gene expression profiling data indicated that each of these six genes was overexpressed in B-CLL cells relative to normal B cells.
|
|
|
|
Identification of a Putative Disease Progression Signature
The results discussed above demonstrate the utility of this technology at identifying genes that are likely involved in disease pathogenesis. Having already demonstrated the power of our data analysis methods and validation via multiple methods, we next wished to further exploit the richness of the data set studied in this report. Thus, we next wished to determine whether we could discern genetic differences between low risk (Rai 0) and high risk (Rai 4) CLL patients. The rationale for this particular comparison is that the survival curves for these two risk categories are clearly distinct (18). Using the combined concatenation of filters and Genes@Work, we were indeed able to find 31 unique genes (34 probes) that appear to discriminate between low risk and high risk B-CLL (Fig. 8). Of some interest, there was one low risk patient that appeared to segregate with the high risk group. However, this particular patient was also diagnosed with lupus at the time the blood sample was used in the analysis. It is also interesting to note that there is little overlap between this signature and the global gene signature described above. We believe that these genes may represent a molecular signature of disease progression and therefore merit further study. Validation of these genes at either the RNA or protein level is currently under investigation.
|
| Discussion |
|---|
|
|
|---|
Other groups have previously employed gene expression profiling as a means to better understand this disease (610). Our study can be distinguished from these latter studies by our experimental design, which featured a balanced collection of primary data, use of age-matched healthy controls, and our methods of data analysis. Thus, because early studies have revealed the danger of unbalanced data collection, for example, grouping disease samples separate from controls or collection of primary data over long periods of time, Affymetrix GeneChips® from the same lot were used in all of our analyses. The chip analyses were performed over a restricted period of time and our experimental groups intentionally consisted of heterogeneous B-CLL samples, for example, CD38 positive, CD38 negative, GL and somatic mutation (SM) type B-CLL; and control B cells in every experimental run. Indeed, hierarchical cluster analysis of samples grouped by assay date failed to reveal relationships defined by assay date (results not shown). As a final measure to control variability introduced by time, using the same RNA, we obtained primary data from two separate samples. Rather than running these duplicate samples within the same run, we chose to separate these analyses by time, and thereby use a more stringent filter that estimated maximal variation, rather than underestimating this important source of variability using this methodology (19).
Our decision to use peripheral blood B cells from healthy, age-matched controls rather than B cells obtained from separate anatomic compartments largely reflected the likelihood that the circulating tumor cell in B-CLL may be fundamentally distinct from tumor cells that may reside within the node or bone marrow. Indeed, there is evidence for small foci of leukemic B-cell proliferation centers in lymph nodes from B-CLL patients; however, CLL blood B cells are notoriously quiescent with respect to the cell cycle (20, 21). Therefore, because peripheral blood B cells from normal individuals are typically not cycling, we reasoned that our goal to identify genes that would discriminate B-CLL from normal B cells would be best accomplished by comparing blood B cells from age-matched controls with CLL B cells. Because there are noted differences between cord blood CD5+ B cells and CD5+ B cells from adults (22), we chose not to include cord blood B cells in our current analysis.
In this study, we employed three general methods of data analysis. It is generally known that large data sets with many genes and few patients can lead to the identification of genes that appear to discriminate well between patient sample types, but in fact are simply chance results. To prevent this, we used stringent significance thresholds in all of our analyses. Even though a high significance threshold for each method of data analysis will necessarily imply missing a number of biologically relevant but statistically non-significant genes, the combination of three data analysis methods allows us to ameliorate this effect, as a false negative in one method may be a true positive in another method, particularly because different methods interrogate the data in different ways.
Our first analytical method interrogated the data in terms of the fold change between case and control patients. A gene passed this fold change test if its ratio of expression in case versus control (or vice versa) was larger than the 95th percentile of the distribution of the maximum fold change (over all the genes) in the null hypothesis (where cases and controls were randomly permutated). The fold change analysis discussed above is only one measure of the distribution of gene transcription. For example, a fold change between case and control can be nonsignificant based on our fold change criterion, but the distribution of expression between case and control may be significantly different. We interrogated the data using this latter perspective coupled with other strategies that required a minimum number of present calls and a fold difference in expression that exceeded the replicate noise level. For this, we designed a concatenation of three filters (the EN, P/A, and S/N filters) that subsequently excludes genes that do not pass the corresponding criterion. It is interesting to mention that the set of genes that resulted from the concatenation of filters was the same if we replaced the S/N filter (with threshold 1, see "Materials and Methods") by a t test at Bonferroni-corrected P of 0.05, showing a measure of robustness of this concatenation of filters approach.
Yet a third way of interrogating the gene expression data corresponds to the scenario where a group of genes becomes deregulated in CLL but is tightly regulated in normal counterpart B cells. This would result in a pattern of expression the signature of which was not necessarily captured by the previous two analyses. To discover these kinds of patterns, we applied Genes@Work, a gene expression pattern discovery software downloadable from www.research.ibm.com/FunGen/index.html. The Genes@Work software, but not the other analytical tools described and employed in this study, has been successfully used by Klein et al. (9) to identify genes that could discriminate CLL from normal B-cell subsets.
The combination of three different analytical methods that interrogate the data from different perspectives allowed us to recover some of the genes that were missed as false negatives in one of the methods. In this study, we uniquely combined the fold change test, the concatenation of filters, and the Genes@Work software and we were able to identify a number of interesting genes that were either over- or underexpressed in CLL B cells relative to normal B cells. Indeed, a number of genes, particularly those that were in the underexpressed category primarily served to validate the methodology as they included genes the products of which are known to be underexpressed in B-CLL, for example, Ig, CD79a, and CD22.
Although there have been two previous wide-scale gene profiling studies of B-CLL, our ability to identify a significant number of additional novel genes that discriminate B-CLL from normal B cells emphasizes the power of our approach, that is, patient population and data analysis methodologies. In addition, our studies directly demonstrate the advantages provided by multiple, independent microarray studies. We were able to obtain additional data on nine of the genes uniquely identified in this study. As noted above, a number of these genes were Ig-related, and are consistent with the known levels of Ig expression on B-CLL cells. Other gene examples that are noteworthy include TNFRSF7 (also known as CD27), KIAA1010, and KIAA0275. Thus, CD27 is highly expressed on all B-CLL cells, regardless of Ig mutation status (Ref. 23, unpublished observations) and this gene was part of the signature of overexpressed genes in B-CLL. We were also able to validate by RT-PCR the overexpression of the two currently unknown genes in five of five new B-CLL patient samples: KIAA1010 and KIAA0275. Although KIAA1010 has been described as an SH3 domain-containing protein and KIAA0275 has some homology to serine protease inhibitors, the function of these two genes remains unknown, as does their relevance to B-CLL.
It is also important to note that there were genes that were difficult to clearly validate given the trends suggested by our primary expression profile data. Noteworthy in this regard were NUMA, CD73, GS3955, and JunB because some patient samples clearly expressed these genes and others did not. Importantly, JunB gene expression was recently shown to be inactivated by methylation in chronic myeloid leukemia (24), suggesting the possibility that the absent expression of JunB in a subset of B-CLL patients may occur through a similar mechanism. Indeed, abnormal methylation patterns have been described in B-CLL (25, 26). The pattern of JunB expression did not, however, correlate with either CD38 expression levels or Ig mutation status (results not shown). With respect to NUMA, CD73, and GS3955, our analytical methods suggested that NUMA (which passed the fold change test and the concatenation of filters, but not the Genes@Work criterion) was overexpressed in CLL B cells and GS3955 (which passed all our tests) was underexpressed, but CD73 was only identified by the fold change analysis. Inspection of the original data for these three genes does indicate some heterogeneity of expression among patient samples, thereby perhaps explaining our RT-PCR results. Because the absolute differences in expression of these three genes may be less, a more quantitative RT-PCR may be required to validate the differences, if indeed they exist. CD73 is an ecto-5'-nucleotidase that we have also studied at the protein level.2 In preliminary studies, although normal B cells are uniformly positive for this marker, some B-CLL patients are completely negative for CD73 in a manner consistent with the gene profiling data; however, some B-CLL patients clearly express this molecule. Moreover, there is evidence in the literature that B-CLL patients have decreased ecto-5'-nucleotidase activity and that this is secondary to decreased numbers of CD73+ cells (27). CD73 expression may therefore be an important marker of disease heterogeneity. Collectively, our experience with these three genes underscores the necessity of gene validation before extensive further investigation or designation as a component of a global gene signature versus inclusion in a disease subset signature.
A significant number of genes differentially expressed by CLL B cells were genes that have been previously linked to other cancers. This fact supports the possibility that these gene products may be directly or indirectly linked to the transformation process in B-CLL. For example, the LEF-1 transcription factor was identified by Klein et al. (9) as being overexpressed in B-CLL cells. This gene was also part of our signature, and we have shown for the first time that this gene is indeed overexpressed in B-CLL cells at both the RNA and protein levels. This gene is an attractive candidate for further study for a variety of reasons. First, LEF-1 has been implicated in Wnt and ß-catenin signaling (reviewed in Refs. 28, 29), there are reports linking LEF-1 with colon cancer (17), and there is also literature regarding the role of LEF-1 in lymphocyte differentiation (reviewed in Refs. 30, 31). Of interest, LEF-1 is only expressed in developing pro-B cells, but not in mature B cells (32). In this regard, Wnt signaling regulates B-lymphocyte proliferation through a LEF-1-dependent mechanism (33). There is also evidence that Wnt proteins may function as hematopoietic growth factors (34). It is therefore plausible that reactivation of LEF-1 expression in B-CLL may facilitate the tremendous B-cell clonal expansion that is observed in this disease. This possibility is currently under investigation in our laboratory.
Our study was additionally able to validate elevated expression of IGFBP4, FMOD, RASGRF1, ABCA6, and TGFBR3. Each of these genes, with the exception of fibromodulin, has been linked with other cancers. Moreover, Klein et al. (9) also identified each of these genes as being overexpressed in CLL B cells relative to normal memory B cells. There are a number of interesting features of each of these genes including fibromodulin, which has not been previously linked with cancer. Thus, fibromodulin binds TGF-ß and typically negatively modulates its activity (35). This observation, coupled with reports that B-CLL cells respond abnormally to TGF-ß (36), raises the interesting possibility that CLL overexpression of fibromodulin may underlie the abnormal response to TGF-ß. Similarly, the TGF-ß receptor III (TGFBR3 or ß-glycan) is expressed in leukemic B-cell precursors (37), overexpressed in small cell lung cancer cells (38), and modified in colon cancer cells overexpressing oncogenic Ki-ras (39). Regarding the link between TGFBR3 and ras, it is interesting that we also observed overexpression of the guanine nucleotide exchange factor, RASGRF1. This protein has been shown to be oncogenic in NIH 3T3 cells (40) and to increase active ras and MAPK in NIH 3T3 cells engineered to overexpress RASGRF1 (41). There are also potentially interesting links between IGFBP4 expression and the suppression of normal humoral immunity seen in B-CLL. Thus, IGFBPs have been suggested to modulate the action of IGF-I or IGF-II on lymphopoiesis (42), raising the possibility that this may be one mechanism by which the normal humoral immune compartment becomes variably depressed in B-CLL patients. Finally, ABCA6 is a member of the ATP-binding cassette (ABC) transporter family. Recent studies have begun to demonstrate a link between ABC transporter function in normal and malignant hematopoiesis (reviewed in Ref. 43). In addition, members of this gene family have also been shown to be overexpressed in breast cancer (44). Given the ability of some members of this family to confer drug resistance on tumor cells (45), it is interesting to speculate that the recently discovered ABCA6 transporter protein may be playing a critical role in B-CLL. Nevertheless, it is important to acknowledge that definitive evidence linking these genes to the biology of B-CLL is currently lacking. Given that these genes have now been identified in two independent studies (this report; Ref. 9), there is a strong rationale to further investigate these genes in B-CLL.
In this study, we also tested the ability of our gene signature to correctly classify independent data, that is, discriminate between normal and CLL B cells. In this regard, it is striking that we did not err at all. As noted above, this is remarkable because of differences in patient populations studied and because of the variety and number of factors that can limit the reproducibility of this technology (reviewed in Refs. 19, 46). Because this kind of validation and reproducibility can be achieved, this suggests that gene expression technology is coming to maturity, and may be usable in a diagnostic context, perhaps even for disease subset definition.
Finally, we also present evidence that there may be a gene expression signature that differentiates low risk from high risk disease (Fig. 8). In this regard, we have previously suggested that there may be subsets of disease defined on the basis of CD38 and Ig mutation levels (47). Although Stratowa et al. (8) studied expression differences between low (0, 1) and high (2, 3, 4) Rai stages, this analysis was limited to a restricted number (1024) of selected genes. Our study suggests some interesting genetic differences between the two stages of disease. For example, IgM expression was higher in the high risk versus the low risk patients. This result is consistent with other observations that B-CLL patients, the tumor cells of which express GL Ig variable regions, have more aggressive disease and display a more activated phenotype consistent with that displayed by B cells activated via the BCR (48). In addition, MAP2K3, also activated through the BCR, was also more highly expressed in the high risk patients. Because a global signature could be identified using a heterogeneous mix of early, intermediate, and late stage patients, the global signature is likely to be more closely linked with disease pathophysiology, whereas the stage specific signature is likely to be more closely linked with progression events. Clearly, it will be critical to verify the informative nature of these putative stage-specific genes before embarking on future studies linking these genes with disease biology. In this regard, it is important to acknowledge that this analysis was carried out on a limited number of patient samples (n = 22), raising the possibility that some of these genes may reflect inaccuracies resulting from overfitting of the data. However, our success in identifying a global B-CLL gene signature using similar methods predicts that the majority, or at least a subset of these genes will be valid candidates for further study.
In conclusion, we describe new non-duplicative methods of data analysis that were useful in identifying a genetic signature of CLL B cells. Moreover, we present novel evidence that this technology may also permit the discrimination by gene profiling of low risk from high risk B-CLL. Genes linked to this signature may uniquely play a significant role in disease progression. Our methods of analysis were designed to maximally exploit the richness of the data that accompanies this high-throughput technology and simultaneously apply data filters and statistical tools that would minimize the false-positive results while recovering false negatives. Because of the ability of our signature to accurately discriminate leukemic cells from normal B cells in an independent data set, we predict that the majority of the genes identified in this study are genuine markers of disease. We believe that this relatively small group of genes should constitute a new focused CLL B-cell probe array. In addition, we provide new data that validate the differential expression of a number of interesting genes included in our gene signature of B-CLL. Our initial validation data on some of the differentially expressed genes in B-CLL continue to support future work that links altered gene expression with functional downstream molecules that are critical to the biology of this disease.
| Materials and Methods |
|---|
|
|
|---|
or
expression immunophenotype. Peripheral blood was collected from previously untreated B-CLL patients encompassing all Rai stages (N = 38) after informed consent. Consented healthy blood donors, 60 years of age or older, served as age-matched controls (n = 10) for this study.
Cell Isolation
Peripheral blood mononuclear cells (PBMCs) from B-CLL patients were separated from heparinized venous blood by Ficoll (Gallard-Schlesinger, Garden City, NY) density gradient centrifugation. B-CLL patient PBMCs used in the gene expression verification experiments that did not exceed >85% CD19+ cells were further purified using CD19 magnetic beads (Miltenyi Biotec, Auburn, CA). PBMCs were suspended with CD19 beads in PBS, 0.5% FCS, and 2 mM EDTA for 15 min at 4°C, before washing, resuspension in the same buffer at 510 x 106 cells/ml, and passage through the AutoMacs Magnetic Cell sorter (Miltenyi Biotec) to collect the CD19+ cells. The normal control samples were received as enriched buffy coats and CD19+ cells were isolated from the PBMCs as described above. Following bead selection, the purity of the CD19+ cells typically exceeded 95%.
RNA Isolation and Microarray Hybridization
Total RNA was isolated from cells using the Trizol reagent (Life Technologies, Inc., Grand Island, NY). RNA was quantitated by reading the absorbance at 260/280 and the RNA quality was further tested using an Agilent 2100 Bioanalyzer (Agilent Technologies, Palo Alto, CA). Double-stranded cDNA was generated from 10 µg total RNA using the Superscript Choice system (Life Technologies) with T7-(dT)24 oligomer. cDNA was purified by phenol/chloroform extraction and precipitation. Biotin-labeled cRNA was prepared using the Enzo BioArray HighYield RNA Transcript labeling kit (Affymetrix, Inc., Santa Clara, CA). The biotinylated probe was cleaned of unincorporated NTPs using an Rneasy kit (Qiagen, Valencia, CA). Quality of cRNA and labeling efficiency was verified on the GeneChip® Test3 array (Affymetrix). Fifteen micrograms of quality fragmented cRNA and GeneChip® Eukaryotic Hybridization Control reagents (Affymetrix) were hybridized to each HG-U95Av2 GeneChip® (Affymetrix) in hybridization buffer with final concentrations of 100 mM MES, 1 M [Na+], 20 mM EDTA, and 0.01% Tween 20. After hybridization and washing according to the manufacturer's protocol, the GeneChips® were scanned on an Agilent GeneArray Scanner. All experiments were scaled to an arbitrary value of 1500 as recommended by the manufacturer. We flagged as dubious those chips for which the 3'/5' ratios of either GAPDH or ß-actin was >3 or the percentage Present calls were smaller than the mean percentage Present calls in our data set minus 2 SDs. These criteria flagged out three chips. We therefore performed all the analyses with and without these three chips. As the differences were minimal, we opted for keeping these three chips, which are nonetheless marked with arrows in Fig. 2. To minimize experimental variability, all of the HG-U95Av2 Affymetrix chips used in this study (n = 50) had a matched synthesis date. In addition, there was uniform technical effort and replicate analysis was performed on two of the samples, that is, uniform RNA but sample preparation times were separated by 2 months. Finally, the 50 patient and normal control samples were analyzed in 10 groups of 5 samples. Thus, to minimize spurious clustering that may result from the day of analysis, each group of 5 samples contained one normal control RNA, and four B-CLL RNA samples from B-CLL patients that were heterogeneous with respect to CD38 and Ig mutation status.
RT-PCR
Sense and antisense primers to genes of interest were designed based on the probesets used by Affymetrix for synthesis of the GeneChip® Primer sequences (Table 5). The sequences for ß-actin primers have been described previously (49). Total RNA was converted to cDNA using the Pharmacia 1st Strand cDNA Synthesis kit (Piscataway, NJ). Two microliters of cDNA was amplified using the Qiagen HotStarTaq kit in a total volume of 50 µl, which included 20 pmol of each primer, 200 µmol/l of each dNTP, and 2.5 units of HotStarTaq. Amplification was carried out in a Perkin-Elmer PE9600 thermocycler (Perkin-Elmer, Boston, MA) as follows: denaturation at 95°C for 15 min; 35 cycles of 30 s at 94°C; 1 min at 5560°C; 1 min at 72°C; and a final cycle of 10 min at 72°C. The annealing temperature was dependent on the primer pair used. Amplified products were electrophoresed on a 1.5% agarose gel and visualized by ethidium bromide staining.
|
Data Processing
Both the Affymetrix average-difference expression data and the P/A calls were used in the analyses. We used Affymetrix Microarray Analysis Suite MAS 4.0 to process the microarray information. (More detail about the Affymetrix image analysis algorithms is available at http://www.affymetrix.com/products/software/specific/mas.affx.) Unless otherwise stated, the logarithm to the base 10 of the average difference was chosen as the variable for subsequent analyses. In these cases, and before taking the logarithm, the small and negative expression levels were clipped-off to be equal to a cutoff value arbitrarily chosen as 1.
Dendrograms. The hierarchical clustering algorithm used to generate the dendrograms in this paper is based on the average-linkage method (50, 51). For the dendrograms, the log-transformed expression value of each selected gene is normalized to have zero mean and unit SD across all the samples. The similarity between two individual samples is calculated by the Pearson correlation.
Graphic Representations of Gene Expression (Color Matrices). Columns represent individual experiments (or in this case, samples) and rows represent individual genes present on the expression microarray. To generate a pseudo-color map, first a gene- and experiment-specific change of variables, from the log-transformed measurement
into
ge, is computed using the formula:
![]() |
P are respectively the mean and SD computed from the gene expression log-transformed values for that gene in the phenotype group, and µC and
C are their corresponding values computed from the control group. The value of this function is then plotted using a pseudo-color map that represents
ge = 0 as black,
ge > 0 as progressively brighter hues of red, and
ge < 0 as progressively brighter levels of blue.
ge = 2 and
ge = -2 correspond to complete saturation of the red and the blue, respectively. The resulting pseudo-color map associates the same colors to measurements that are off by the same number of SDs from their expected value.
Gene Expression Data Analyses
Differentially expressed genes between two sets of subjects C1 and C2 (or alternatively called the phenotype set P and the control set C) are identified using a mixture of statistical schemes that can be broadly characterized as univariate methods and multivariate methods.
Univariate Analysis of Gene Expression Data. In the following schemes, each gene is treated independently to assess whether it is differentially expressed in the two sets under consideration.
(1) Fold change analysis. One of the customary and simple analyses for gene expression data is to accept a gene as differentially expressing in two groups if its fold change is above a significance threshold. We choose this threshold as a relatively high percentile of the distribution of the maximum fold change (over all the genes) under the null hypothesis that case and controls are chosen at random. We do not use the variance of the fold change distribution to determine the significance threshold, but rather the distribution of the fold change extreme values, which we sample by permutating the labels of cases and controls. For the comparison of primary interest, genes were first sorted by the difference in mean log expression values between comparison groups (CLL B cell versus normal B cell), where gene expression was defined as the average difference generated by the Affymetrix Microarray Analysis Suite software. This difference in mean log expression is a measure of the log fold change between the two groups. The genes whose difference in mean log expression is beyond a significance threshold are deemed to be differentially expressing in the two groups. The threshold is chosen in a very stringent way using a method that allows us to estimate the differences in mean log signal that could arise simply by chance in an analysis of more than 12,600 genes. We therefore mixed the classes (creating a null hypothesis ensemble), so that the fold change that is computed for each gene is actually independent of the real classes. Thus, out of all the genes in one permutation, we chose the gene(s) with the maximum fold change, and added that fold change to our statistics. By performing 1000 permutations, we generated 1000 maximum (log) fold changes, and took as our threshold the 95th percentile of the distribution of these 1000 values. In this way, we can interpret that the probability of having the maximum fold change greater than that threshold in the null hypothesis is 5%.
(2) Filter strategies. In this strategy, to be deemed differentially expressed in the two classes, a gene has to successively pass three filters: the P/A filter; the EN filter; and the S/N filter.
(2a) P/A filter. Each gene in an Affymetrix GeneChip® is assigned a status of either Present (P) or Absent (A) or Marginal (M). When the call for a gene is A, the corresponding numerical value of the average difference is very unreliable. In our analyses, we only accept a gene for further analysis if there is a minimum number of subjects for which that gene's call was either P or M. If we denote by #C1 and #C2 the number of subjects in classes C1 and C2, respectively, then only genes whose percentage of A to P and M calls is less than min (#C1, #C2)/(#C1 + #C2) will be further processed.
(2b) EN filter. We performed two hybridizations of a sample coming from one patient to assess the inherent assay variability. The result from one of these duplicated experiments is plotted in Fig. 1. In this figure, the dots correspond to genes measured from the same sample in two different hybridizations. The expression level for a given gene in hybridization 1 is shown in the abscissa (
1), and the expression for the same gene in the second hybridization is shown in the ordinate (
2). If the measurements had been unaffected by noise, the points in Fig. 1 would be organized on a straight diagonal line, that is,
1 =
2. Thus, the scatter of the points is an indication of experimental (non-biological) variability. As a measure of this variability pertaining to the cloud of dots in Fig. 1, let us observe that the percentage of genes, the reproducibility fold change of which was above 2, 3, 10, and 50, was respectively 30%, 16%, 3%, and 0.5%. The solid, empirically determined, convergent lines indicate the envelope of the cloud of points and were designed to only leave 1% of the points outside of the two lines. For points in Fig. 1, 99% of them are bounded between the two lines, or what is equivalent, verify the empirical equation |log10(
1) - log10(
2)|
2.8 - 0.25[log10(
1) + log10(
2)]; the equality in this formula represents the equation for the solid lines. This approach is similar to the Use-Fold algorithm described in Ref. (52). These convergent lines also emphasize that the variability of a given gene is influenced by its overall level of expression with lowly expressed genes displaying more variability than highly expressed genes. This observation can be used to decide whether the difference of expression of a given gene in two subjects is of biological origin. Suppose a gene is measured in patients 1 and 2 yielding expression values
1 and
2, respectively. If these two values verify the equation above, we cannot assert that the difference in expression for this gene in these two patients is not due to the assay noise. However, if the equation above is not met, which occurs only 1% of the time if there is no biological effect at play, then that difference can be ascribed to a biological effect. The probability that we err in that assessment is thus 1%. When we have two sets of subjects C1 and C2, the same criterion is applied, where
1 and
2 are interpreted to be the geometric average of the expression of a gene in sets C1 and C2, respectively. Thus, when applying the EN filter, we take each gene and compute the geometric average of the expression value in the CLL class and plot it as the abscissa for that gene, whereas the ordinate corresponds to the geometric mean in the CTRL group. Only genes that fall outside the solid, convergent lines in the figure are plotted (plus symbols, Fig. 1). Therefore, there are 54 plus symbols, which are the 54 genes that passed the EN filter. Some of these genes will be further filtered out by the S/N. Finally, we plotted the two lines representing the thresholds of the fold change test (the first of our methods) on the same figure. The upper dashed line is the one representing a fold change of 87. The lower dashed line represents a fold change of 98.
(2c) S/N filter. When a gene passes the EN filter, the average of this gene in C1 can be considered as different beyond EN from the average of the same gene in C2. However, the question remains as to whether this difference in means is relevant when compared with the natural variability of expression around the mean. In other words, the distribution of a gene in C1 may overlap considerably with the distribution of the same gene in C2. A simple measure of this overlap in distributions is the ratio of the difference in means divided by the average SD:
Our not too stringent S/N filter, weeds out the genes with zg < 1.
Multivariate Analysis of Gene Expression Data. To be deemed differentially expressed in our multivariate analysis, a gene has to belong to at least one of the patterns discovered by Genes@Work, and then has to pass the P/A filter.
(1) Genes@Work. We used the gene-expression pattern discovery algorithm Genes@Work (47) as our basic tool to perform multivariate gene selection. A Genes@Work pattern consists of a subset of genes that express differentially in a subset of the phenotype set with respect to the control set. Differential expression is determined as follows: for each gene g, a gene expression probability density pg(e) is computed empirically from the control set. The algorithm discovers all the subgroups of patients within the phenotype group for which there is a subset of genes with the following property: for every gene g in this subset of genes, the integral of pg(e) over the expression range in the subgroup of patients is less than a user-defined threshold
(0 <
< 1). In other words, each gene in a pattern expresses at a similar level in the patient subset, the level of similarity given by
. The number of patients (genes) that compose a pattern is called its patient (gene) support. The statistical significance of a Genes@Work pattern is given by its P, defined as the probability of observing one or more similar patterns (i.e., a pattern with the same number of support patients and support genes) in a set of random patients whose genes g's are distributed according to the same probability density pg(e) (the null hypothesis, Ref. 53). The input parameters for Genes@Work are the minimum gene and patient support of a pattern, the degree of similarity of gene expression for each gene
, and the maximum P.
(2) Gene selection with Genes@Work. Given two sets of subjects, C1 and C2, Genes@Work is run twice, with either set chosen in turn as phenotype and the other as the control. The minimum patient support is made to be 70% of the number of subjects in the phenotype set; the minimum gene support is chosen to be 4 in all the cases;
is chosen to be 0.01, and the maximum P was chosen to be 10-10. A gene is selected as differentially expressed if it belongs to at least one of the patterns discovered by Genes@Work. For a more technical discussion, see Ref. 53.
Classifier and Classification Method. Both pattern discovery and a concatenation of filter methods are used to identify a set of ng genes (gene features) that express differentially in two sets of subjects C1 and C2. C1 and C2 constitute the training set. Given a new subject (the query), we can ask the question of whether this subject belongs in class C1 or C2. A classifier is a decision function that, when presented with the query, yields a result that allows us to assign the query to either C1 or C2. For a given sample i in the training set, we define a gene vector vi the j-th component vi(j) of which is the logarithm (to the base 10, and after clipping-off at 1) of the average difference of the j-th gene in our gene features. We define a measure of the distance between one gene vector v and another q, as
![]() |
The distance D(C1, q) between class C1 and the query, is defined as the minimum of the distances between q and all the subjects in C1, that is
![]() |
For a given query q, we further define a score as:
![]() |
The more positive the score is, the more likely it is that the query is related to class C1; conversely, the more negative the score is, the more likely it is that the sample belongs to class C2. A classifier can thus be derived from the score defined above simply as
![]() |
This classifier is known as a nearest neighbor classifier, because it assigns q to the class to which the sample in the training set that is closest to the query belongs, as long as the score is beyond certain thresholds. The thresholds
5% and
95% correspond to the significance levels of the classification. To compute these thresholds, permutations of the components of each sample in the training set is performed. Each permutated sample is considered as a query and its score is computed. All the samples in the training set are used in this way under many different permutations, so as to create a probability density function of scores. In this way, all the scores that are between the 5% percentile and the 95% percentile (
5% and
95%, respectively) are considered as undecided. However, if a query yields a score above
95% (resp. below
5%), the probability that it comes from a random query is less than 5%, and it will be assigned to C1 (resp. C2).
Online Supplemental Material
Data pertaining to this work and the algorithm Genes@Work can be downloaded (http://www.research.ibm.com/FunGen/index.html) from our web site.
| Acknowledgements |
|---|
|
|
|---|