Data in Made4 package

Datasets (Khan, NCI60) given in Made4 are reduced. This is to kept the size of the Bioconductor package small. However the full datasets are available here.

Data format

Data are provided as .rda file. These are R data file, and can be loaded into a R session using the command

 load("name.rda")

In addition files are given as comma separated files. These can be read by Microsoft excel. To read these into a R session, please use the command

read.csv("name.csv")

Khan Data

Within the MADE4 package, the reduced Khan dataset contains only 306 genes. However the Khan dataset contains 2308 rows x 64 column. Khan et al., 2001 used cDNA microarrays containing 6567 clones of which 3789 were known genes and 2778 were ESTs to study the expression of genes in of four types of small round blue cell tumours of childhood (SRBCT). These were neuroblastoma (NB),rhabdomyosarcoma (RMS), Burkitt lymphoma, a subset of non-Hodgkin lymphoma (BL), and the Ewing family of tumours (EWS). Gene expression profiles from both tumour biopsy and cell line samples were obtained and are contained in this dataset. The dataset downloaded from the website contained the filtered dataset of 2308 gene expression profiles as described by Khan et al., 2001.

R Data (.rda file) of Khan Data:
khan.rda contains the training and test dataset, gene annotation and vectors which list the tumour class(EWS, RMS, NB, BL) for each of the samples. khan.rda
Comma Separated (.csv) file of Khan training data: khan_train.csv
Comma Separated (.csv) file of Khan test Data: khan_test.csv

NCI60 Data

Within Made4, the example dataset NCI60$Ross and NCI$Affy data were reduced to the 144 genes which were common between the two datasets.

Here we provide the larger NCI60 datasets, which were preprocessed and described by Culhane et al., 2003

The Ross et al., data contains gene expression profiles of each cell lines in the NCI-60 panel, which were determined using spotted cDNA arrays containing 9,703 human cDNAs (Ross et al., 2000). The data were downloaded from The NCI Genomics and Bioinformatics Group Datasets resource website. The updated version of this dataset (updated 12/19/01) was retrieved. Data were provided as log ratio values. Data were filtered, rows (genes) with greater than 15% missing values, reducing the dataset to 5643 spot values per cell line. Remaining missing values were imputed using a K nearest neighbour method, with 16 neighbours and a Euclidean distance metric (Troyanskaya et al., 2001).

The Staunton et al., Affy data were derived using high density Hu6800 Affymetrix microarrays containing 7129 probe sets (Staunton et al., 2001). The dataset was downloaded from the Whitehead Institute Cancer Genomics supplemental data to the paper from Staunton et al., website, where the data were provided as average difference (perfect match-mismatch) values. As described by Staunton et al., an expression value of 100 units was assigned to all average difference values less than 100. Genes whose expression was invariant across all 60 cell lines were not considered, reducing the dataset to 4515 probe sets. This dataset NCI60$Affy of 1517 probe set, contains genes in which the minimum change in gene expression across all 60 cell lines was greater than 500 average difference units. Data were logged (base 2) and median centred.

R data (.rda file) for NCI60: NCI60.rda
Comma Separated (.csv) file of Ross Data NCI60_Ross.csv
Comma Separated (.csv) file of Stauton Affymetrix Data NCI60_Affy.csv