Error, reproducibility and sensitivity: a pipeline for data processing of Agilent oligonucleotide expression arrays

Benjamin Chain,John Hammond,Helen Bowen,Wilfried Posch,Jane Rasaiyaah,Jhen Tsang,Mahdad Noursadeghi

doi:10.1186/1471-2105-11-344

Benjamin Chain, John Hammond + Show 5 more

Open Access

https://doi.org/10.1186/1471-2105-11-344

Copy DOI

Abstract

BackgroundExpression microarrays are increasingly used to obtain large scale transcriptomic information on a wide range of biological samples. Nevertheless, there is still much debate on the best ways to process data, to design experiments and analyse the output. Furthermore, many of the more sophisticated mathematical approaches to data analysis in the literature remain inaccessible to much of the biological research community. In this study we examine ways of extracting and analysing a large data set obtained using the Agilent long oligonucleotide transcriptomics platform, applied to a set of human macrophage and dendritic cell samples.ResultsWe describe and validate a series of data extraction, transformation and normalisation steps which are implemented via a new R function. Analysis of replicate normalised reference data demonstrate that intrarray variability is small (only around 2% of the mean log signal), while interarray variability from replicate array measurements has a standard deviation (SD) of around 0.5 log2 units ( 6% of mean). The common practise of working with ratios of Cy5/Cy3 signal offers little further improvement in terms of reducing error. Comparison to expression data obtained using Arabidopsis samples demonstrates that the large number of genes in each sample showing a low level of transcription reflect the real complexity of the cellular transcriptome. Multidimensional scaling is used to show that the processed data identifies an underlying structure which reflect some of the key biological variables which define the data set. This structure is robust, allowing reliable comparison of samples collected over a number of years and collected by a variety of operators.ConclusionsThis study outlines a robust and easily implemented pipeline for extracting, transforming normalising and visualising transcriptomic array data from Agilent expression platform. The analysis is used to obtain quantitative estimates of the SD arising from experimental (non biological) intra- and interarray variability, and for a lower threshold for determining whether an individual gene is expressed. The study provides a reliable basis for further more extensive studies of the systems biology of eukaryotic cells.

Highlights

Expression microarrays are increasingly used to obtain large scale transcriptomic information on a wide range of biological samples
The Agilent platform is designed to be used as a two colour system, probing and detecting hybridisation of two different cDNA samples labelled with different fluorescent dyes on the same array
We demonstrate that the low level expression detected by the Agilent arrays for the majority of genes in any one cell type likely corresponds to a genuine high degree of transcriptomic complexity, and is unlikely to arise from weak non specific crosshybridisation

Summary

Introduction

Expression microarrays are increasingly used to obtain large scale transcriptomic information on a wide range of biological samples. In this study we examine ways of extracting and analysing a large data set obtained using the Agilent long oligonucleotide transcriptomics platform, applied to a set of human macrophage and dendritic cell samples. Each platform has advantages and disadvantages, We have collected a large number of array data sets from Agilent human genome arrays [2]. The latest releases of these arrays have approximately 44300 features, which include various control oligonucleotides, and a set of 41001 different oligonucleotide 60 mers complimentary to unique human mRNA sequences. Of these 41001, the latest Agilent annotation lists 29,806 as corresponding to known genes and/or ORFs, of which 19392 are unique. The Agilent platform is designed to be used as a two colour system, probing and detecting hybridisation of two different cDNA samples labelled with different fluorescent dyes on the same array (typically Cy3 in the green channel and Cy5 in the red channel)

Methods

Results

Discussion

Conclusion