Investigating the effect of paralogs on microarray gene-set analysis

Master Thesis


Permanent link to this Item
Journal Title
Link to Journal
Journal ISSN
Volume Title

University of Cape Town

In order to interpret the results obtained from a microarray experiment, researchers often shift focus from analysis of individual differentially expressed genes to analyses of sets of genes. These gene-set analysis (GSA) methods use previously accumulated biological knowledge from databases such as the Gene Ontology (GO) or KEGG to group genes into sets based on their annotations. They aim to rank these gene sets in a way that reflects their relative importance in the experimental situation in question. The objective is that this approach reveals sets of genes with subtle but coordinated behaviour implicating specific biological processes or pathways in the response under study. Several GSA methods have been proposed and debates have ensued on the statistical foundations of the different approaches and the various hypothesis tests used. In particular, criticism has been directed at methods that rely on a strict cut-off to determine significant genes and those that assume genes are expressed independently. We show that paralogs, which typically have high sequence identity and similar molecular functions also exhibit high correlation in their expression patterns. This, together with the fact that the calculation of gene-set significance by all GSA methods is influenced by the number of genes in the gene set, means that sets with high numbers of paralogs are ranked in a biased manner that reflects more the redundant and dependent nature of para logs than any biological phenomenon.

Includes abstract.

Includes bibliographical references.