Abstract

Genome-wide analysis of gene expression or protein binding patterns using different array or sequencing based technologies is now routinely performed to compare different populations, such as treatment and reference groups. It is often necessary to normalize the data obtained to remove technical variation introduced in the course of conducting experimental work, but standard normalization techniques are not capable of eliminating technical bias in cases where the distribution of the truly altered variables is skewed, i.e. when a large fraction of the variables are either positively or negatively affected by the treatment. However, several experiments are likely to generate such skewed distributions, including ChIP-chip experiments for the study of chromatin, gene expression experiments for the study of apoptosis, and SNP-studies of copy number variation in normal and tumour tissues. A preliminary study using spike-in array data established that the capacity of an experiment to identify altered variables and generate unbiased estimates of the fold change decreases as the fraction of altered variables and the skewness increases. We propose the following work-flow for analyzing high-dimensional experiments with regions of altered variables: (1) Pre-process raw data using one of the standard normalization techniques. (2) Investigate if the distribution of the altered variables is skewed. (3) If the distribution is not believed to be skewed, no additional normalization is needed. Otherwise, re-normalize the data using a novel HMM-assisted normalization procedure. (4) Perform downstream analysis. Here, ChIP-chip data and simulated data were used to evaluate the performance of the work-flow. It was found that skewed distributions can be detected by using the novel DSE-test (Detection of Skewed Experiments). Furthermore, applying the HMM-assisted normalization to experiments where the distribution of the truly altered variables is skewed results in considerably higher sensitivity and lower bias than can be attained using standard and invariant normalization methods.

Highlights

  • Genome-wide analysis of gene expression or protein binding patterns using different array or sequencing based technologies is routinely performed in many molecular biology laboratories

  • Standard normalization methods involve two separate steps: first, the sample data x is used to estimate a normalization function f; second, the normalized data xnorm is obtained as xnorm~f ðxÞ

  • Several invariant normalization methods, aiming to identify a set of non-altered variables in order to obtain an unbiased estimate of the normalization function f, have been suggested; see e.g. [9]

Read more

Summary

Introduction

Genome-wide analysis of gene expression or protein binding patterns using different array or sequencing based technologies is routinely performed in many molecular biology laboratories. Before reference and treatment can be compared, some type of normalization needs to be applied because it is often the case that much of the observed variation reflects differences in the amount of material loaded or other technical variation. The log-ratio is the logarithm of the ratio between the treatment and the control values. The true log-ratios are the expected value of the log-ratios in the absence of any technical variation (Figure 1A shows the distribution of the true log-ratios in a symmetric and a skewed experiment). We say that an experiment is skewed if the distribution of the true log-ratios is not symmetrical around zero. For non-skewed experiments we expect an equal amount of positively and negatively affected variables. A positively affected variable is one for which the true log-ratio is positive. Using the terminology employed to describe ChIP-chip data and expression data, one would describe such a variable as being ‘‘enriched’’ or ‘‘up-regulated’’

Objectives
Methods
Results
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call