Scanning -> dat file contains one intensity per pixel

49 pixels per cell are summarized by 75th percentile after removal of outer perimeter pixels. This is the cell intensity, each cell corresponding to a probe.

On the HG-U133 chip, each target is represented by a set of 11 pairs of PM:MM probes.

The MM probe is obtained by complementing the middle base in the PM oligo and is meant to be an internal control assumed to hybridize to nonspecific sequences about as effectively as its PM counterpart.

Each PM probe is a 25 base long oligo selected with the objective of achieving linearity between log intensity and log concentration.

How to combine the PM:MM intensities into a measure of expression for the target?

Other gel electrophoresis patterns used at different stages of preparation are used to make qualitative assessments of the RNA samples.

Sample quality assessment by gel electrophoresis

For total RNA, look for 18S and 28S bands (not shown here).

For cDNA, a good sample will produce a smear extending from top to bottom of the gel.

Unfragmented cRNA will also produce a smear running doen the gel.

Fragmented cRNA gel should appear as a blob at the bottom of the gel indicating that the cRNA has been sucessfully fragmented to pieces about 50 bp in length,

Next slide from Vanderbilt MicroArray Shared Resource web site

Affymetrix standards for post hybridization and scanning quality assessment – examination of quality report.

Array quality metrics:

Raw Q (Noise): The degree of pixel-to-pixel variation among the probe cells used to calculate the background = average over background cells (lower 2 percentile) of cell pixel intensity standard error. Between 1.5 and 3.0 is ok. Use scaled noise to get consistency between arrays.

Scaling factor ~ 100/2% trimmed mean of intensities (not logged). Should be kept below 10. Key is consistency across arrays.

Background ~ average of of cell intensities in lowest 2 percentile, by region, with smoothing. No range. Key is consistency.

Percent present calls. Typical range is 20-50%. (i.e. are PM>MM?).

Note – All these quantities, including noise, can be extracted from the cel file.

Hybridization controls: bioB, bioC, bioD and cre. “At 1.5 pM bioB should be called Present 70% of the time. … the others should be called present 100% of the time with increasing Signal value (bioC, bioD, and cre, resp.) “Check that bio C, representing the minimum specification of detection, is present.

Poly A controls: dap, lys, phe, thr, tryp. Used to monitor wet lab work. Sense strand cRNAs synthesized from the control genes can be added to samples prior to the reverse transcription step to monitor target synthesis and labeling efficiencies. Antisense cRNA transcripts can be added to the to the target cRNA sample to monitor the amplification and labelling steps.

Housekeeping/Control Genes: GAPDH, beta-Actin, ISGF-3 (STAT1): 3’ and 5’ signal intensity ratios of control probe sets (GAPDH, Beta Actin): “A 1:1 molar ratio of the 3’ and 5’ transcript regions will not necessarily give a signal ratio of 1”

All controls appear on the chip in both sense strand (_st) and antisense strand (_at) versions, and all have probe sets chose from the 5’, M and 3’ end of the target transcript.

Affymetrix standards - Examination of other spike ins or control probe sets:

Normalization Control Set: 100 probe sets replicated on both A and B arrays (new to HG-U133) – these are a set of genes found to be called present with low MAS4 signal variability in a large set of tissues.

Linearity and sensitivity of amplification as quantified using spike-in bacterial cRNA.

Chip cel file – checkered board – close up w/ grid

Chip cel file – PM - MM

Limitatons of standard QC metrics and procedure

Link between these metrics and the numbers we care about is missing.

Quality of data gauged from spike-ins requiring special processing may not represent the quality of the rest of the data on the chip – risk of QCing the chip QC process itself, but not the gene expression data.

Good end-point data quality assessment is needed to assess the validity of these indirect data quality assessments.

Review of models for gene expression value estimation

MAS 5 (Microarray Suite 5 by Affymetrix)

Expression measures are derived as follows in Affymetrix’ Microarray Analysis Suite 5.0:

A background correction is applied to the probe intensities.

For each probe set the log expression is estimated by means of a one-step Tukey biweighted average of log(PMj- MMj*), where MM* is an MM value modified to ensure that it does not exceed the PM value. This is equivalent to robustly estimating the parameter in the model log(PMj- MMj*) = + ej

To compare expression measures across chips, expression values are normalized by a multiplicative scaling factor. This is equivalent to shifting the expression values on the log scale.

See Affy technical description [1].

RMA

The Robust Multichip Average is an expression measure obtained from analysing a set of chips in the following way:

A background correction is applied to probe intensities [3].

A probe intensity normalization vector is computed from the set of chips and the intensities of each chip normalized to this vector [4].

For one probe set, the log of the background corrected and normalized probe intensities, Yki say, are modelled as the sum of a chip effect and a probe effect:

Yjk = j + k + jk

where k indexes chips and j indexes probes

To produce the RMA expression values, the model is fitted robustly and the estimated parameters k used as estimate of log expression for each chip.

RMA vs MAS5

Background correction is different – Affymetrix removes a fixed amount with some local adjustment; RMA uses a model which results in an intensity dependent bg correction.

Normalization is at probe level and intensity dependent.

Multichip analysis enables the estimation of probe effects.

RMA expression values has been shown to be highly reproducible and to detect changes in target mRNA concentration with great sensitivity [5, 6].

Our main interest here is in the use of model fit results for quality assessment. The size of the residuals from a fit indicates the quality of the fit and the variability of the parameter estimates. These can be summarize and visualized in various ways to provide chip expression data quality indicators.

Affymetrix public dataset - Spike-in design below is repeated 3 times with chips from different lots. (One large sample prepared from pancreas polya+ mRNA)

This model is commonly fitted by LS with parameter estimates:

bj = Yj. – the mean of the observations for probe j

ak = Y.k – the mean of the observations for chip k

s2 = r2/(n-J-K+1)

Under this model, parameters have estimated standard errors:

SE(ak) = s/sqrt(J) , SE(bj)=s/sqrt(K)

i.e. Every chip expression has the same estimated variability.

Robust fit

The least squares fit provides optimal (unbiased, asym min var) estimates when the model is true, but the LS estimates produced under slight departures from the assumed model soon lose their good properties. Robust fitting procedures have been devised to produce estimates which are good under the assumed model and remain so under slight departures from these assumptions.

A commonly used robust fitting procedure is iteratively reweighted least squares, in which an following an initial guess at the fit is followed by a sequece of weighted LS fits, with the wwights derived from the previous fit as follows:

Estimate the scale: S = mad(res)

Weights are given by: wjk =.huber(abs(rjk/S))

Weighted LS fit estimates and estimated standard errors are given by:

Image artefacts (scratches, bubbles, uneven hybridization, glare in scan) being a common occurrence, the gross error model is more realistic than the iid Normal model.

Because of cross-hybridisation, and other reasons, probes within probe sets do not all respond the same way – the robust fitting procedure will go with the majority of the probes.

The proof of the benefits of robustly fitting the model will be in the pudding (but that is not to be tasted today)

For QC purposes, it is essential to use a robust fitting procedure in order to let the outliers speak out.

Assessing chip expression data quality

Chip expression data quality assessment

Having fitted models at the probe set level across a set of chips, we want derive some chip specific quantities to be used as indicators of overall chip expression data quality.

Look at set of residuals for a chip over all probe sets, one residual per probe. Compare these batches of residuals across chips. Chips with large number of bad probes will have larger residuals – look at IQR

First summarize the residuals into a probe set SE for expression value for chip and compare batches of SEs between chips.

SEs in 2 are heterogeneous mix – can use batches of unscaled SEs to compare chips.

Can normalize further by rescaling by the median chip unscaled SE.

All the above produce a batch of numbers for each chip. Need to have one, or a few numbers, per chip. Start with median of set in 3.

Data Picture 1.

Data Picture 2.

Analyzing chips one at a time

Some will want to analyze chips one at a time – either because they have too few, or in some cases, too many, to analyze in batches.

We can then get a probe set summary by summarizing these robustly – a=T(Z), where T is median, trimmed mean or other robust summary (note that we only have 11 points here)

Subtracting the probe summary from the probe effect corrected intensities produces a set of residuals - rj = Zj-a.

The residuals can be turned into weights using the estimate of scale from the fitted model – wj=psi.huber(rj/S).

Log RMA expression for Poly A Controls – Affy pancreas samples

Log RMA expression for housekeeping controls – Affy pancreas samples

Look at 3’ to 5’ trand in probe intensities over entire chip

Next slides

logPM.gc.norm – Affy data

References

New Statistical Algorithms for Monitoring Gene Expression on GeneChip® Probe Arrays, Affymetrix technical report.

Array Design for the GeneChip® Human Genome U133 Set, Affymetrix technical note.

Discussion on Background, Ben Bolstad.

Bolstad BM, et. al. (2003), A comparison of normalization methods for high density oligonucleotide array data basedon variance and bias.Bioinformatics. 2003 Jan 22;19(2):185-193.