Random GO Database

Barry Zeeberg [aut, cre]

2025-05-05

Random GO Database

Barry Zeeberg

barryz2013@gmail.com

Motivation

The Gene Ontology (GO) Consortium https://geneontology.org/ organizes genes into hierarchical categories based on biological process (BP), molecular function (MF) and cellular component (CC, i.e., subcellular localization). Tools such as GoMiner (see Zeeberg, B.R., Feng, W., Wang, G. et al. (2003) doi:10.1186/gb-2003-4-4-r28) can leverage GO to perform ontological analysis of microarray and proteomics studies, typically generating a list of significant functional categories. The significance is determined by computing the false discovery rate (FDR) of the enrichment p-value for each category.

As I mentioned in the vignette for my previous package ‘HTGM2D’:

The idea behind using FDR as a criterion is based on the hypothesis that a randomly chosen subset of genes is in fact random. But the genes in the genome are all part of the same overall integrated system, and as a result of evolution this is a highly specialized set. Not only does each gene have its own chemistry, biochemistry, and function(s), but it also is part of a network. It must play nicely with around 20,000 other genes, and not arbitrarily interfere with what they are doing. So the hypothesis of random chosen genes being functionally random is probably only approximately true. Quite possibly, a hypothetical truly random set of genes (which does not really exist) would show less correlation than the set we (mistakenly) take as being random. It is possible that it might make sense to use a higher FDR threshold than I have selected, to compensate for the degree of non-randomness in the supposedly random set.

The GOGOA3 database generated from my package ‘minimalistGODB’ can be characterized by the function characterizeDB() provided in the current package. Each gene maps to a certain number of GO categories (Figures 1, 2). For instance, for human, TGBF1 maps to 186 GO categories, whereas ZXDB maps to but 1 GO category. These differences might be due to the underlying reality of e.g. human biology, or to a bias in the degree of curation apportioned to different genes.

Figure 1. Histogram of Number of GO Mappings per Gene Figure 1. Histogram of Number of GO Mappings per Gene

Figure 2. Summary Statistics for Number of GO Mappings per Gene

A curation bias might arise from either a preferential focus on genes that are biologically or medically more relevant, or some genes might be expressed at very low levels or the protein product is hard to isolate and study. Curation is an ongoing endeavor, encompassing laboratories all around the world. I do not use the term “curation bias” in a derogatory sense, it is just an unavoidable fact of a work in progress.

The conjecture that a curation bias might arise from a preferential focus on genes that are biologically or medically more relevant can be tested by forming a gene list of a reasonable number, say 50, of the genes with high number of mappings (which I sometimes refer to as ‘high hitters’), and submitting them to a GoMiner analysis.

Results

GoMiner was invoked by

GoMiner(title,dir,sampleList,GOGOA3,ontology,enrichThresh,
 countThresh,pvalThresh,fdrThresh,nrand,mn,mx,opt,verbose)

Most parameters were constant across all of the GoMiner runs reported here:

ontology<-"biological_process"
enrichThresh<-2
countThresh<-5
pvalThresh<-0.10
fdrThresh<-0.10
nrand<-5
mn<-2
mx<-200

The heat map resulting from GoMiner analysis of the top 50 high hitters was comprised of 164 categories versus 50 genes (not shown). At first glance, this robust result validates the conjecture that a curation bias might arise from a preferential focus on genes that are biologically or medically more relevant. However, there is reason to question the validity of this result: after all, the result with the validated cluster52 (see Figure 2 in the vignette in my ‘GoMiner’ package) resulted in 18 categories, although the sizes of the gene lists were almost identical. In spite the computational procedure intended to eliminate false positives, a lot of the 164 categories might, in fact, be false positives.

Is it possible that a set of genes, all of which are high hitters, will give a positive GoMiner result that is divorced from the reality of the underlying biology? And if this happens, might it not happen to some extent for normal gene sets, like cluster52?

Since the ensuing analysis is somewhat technical and lengthy, I will give the direct answers right away. Yes, there is a severe problem with the high hitter gene set. No, this does not affect normal gene sets.

The false discovery rate (FDR) is computed by comparing the results using the real gene list with the results using an equal size random gene list. But this may fail if the random genes statistically have very different number of category mappings than the real genes. That is, the real gene list of high hitters is not modeled properly by the randomized gene lists (that reflect a diversity of ‘hitters’).

One potential approach to alleviate this problem is to compute the FDR by randomizing the database rather than the gene list. Then the same real gene list can be used, eliminating the problem of gene sets with different number of category mappings during the randomized study. Randomizing the database will result in a database that reflects a nonsense biology divorced from the real world, so any gene mapping to this nonsense database will not represent a true positive.

The justification for the procedure I use to carry out the database randomization is technical and long, and rather than interrupt the narrative, I include that section at the end of the vignette.

I will study three gene sets: cluster52 (see vignette for my ‘GoMiner’ package), high hitters, and average hitters (14-23 mappings per gene). My procedure will involve three variations of how randomizations are implemented:

First, we will perform the usual randomization of the gene list.
real gene list and real database, followed by randomized gene list and real database

Second, randomization step will consist of randomizing the database rather than the gene list.
real gene list and real database, followed by real gene list and randomized database

Third, the first version will be performed, using the randomized database in place of the real database.
real gene list and randomized database, followed by randomized gene list and (the same) randomized database

The setup and results are summarized in Figure 3. The ranges for the log10(p) values refer to the top (i.e., lowest) p value through the tenth p value.

Figure 3. Summary Statistics for Three Randomization Mehtods

First, follow the fate of cluster52 through the three procedures. When using the standard procedure (column ‘1’), we find a reasonable number of significant categories (7) [the value of 7 differs from the 18 quoted above, because the updated version of GoMiner allows us to pre-filter out categories that are too small or too large to be informative]. Using the modified procedure (column ‘2’; randomize the database rather than randomize the gene list), the results are consistent with the column ‘1’. When using the randomized database from the get-go (column ‘3’), we get essentially a null result. This makes perfect sense, as the random database is devoid of any meaningful biological information.

It is a different story with the high hitters data set. Looking at the results in column ‘1’, we do not know if we are seeing a very strong biological result, or a massive false positive, or a mixture of the two. Looking at the results in column ‘2’, we see the same number of categories as in column ‘1’, but much lower p values for the randoms. This is because in column ‘2’, the gene list retains the structure of high hitter genes. The randomized database does not contain biological meaning, but it still retains a substantial amount of structure that can resonate with the structure of the high hitters gene list. This is a tip-off that the results are largely false positives. This hint becomes a certainty when looking at column ‘3’, which has a reduced but still huge number of categories. Since there is no biological information in the randomized database, we are seeing that the gene list has a pathological structure that can self-destruct with the residual structure of the (biologically meaningless) randomized database.

Without belaboring the point, we can see that the average hitters behave exactly as expected for a set of random genes, namely no categories for any of the three columns.

The high hitters represent a tiny percentage of the total number of human genes in the genome, and we would never get a gene set like this through a normal experiment. Gene sets will always be like cluster52 or the average hitters. Thus, there is no real need to use all three procedures in practice.

Construction of a Randomized GO Dabatase (GOGOA3R)

Gene Ontology categories form a hierarchy (technically called a “directed acyclic graph”). Categories are generally children and/or parents of other categories. The more generic categories are at the top of the hierarchy, and the more specific categories are at the bottom of the hierarchy. The most generic categories might contain over a thousand genes (Figure 4), and they are not very informative in an analysis.

Figure 4. Histogram of GO Category Sizes Figure 4. Histogram of GO Category Sizes

The most specific categories may contain only 1 or 2 genes (Figure 5),

Figure 5. Tiny and Moderate GO categories

Figure 5. Tiny and Moderate GO categories

and these are not reliable to analyze statistically, and they do not capture the sense of genes networking. A category containing 1 gene is logically equivalent to the gene itself, rather than reflecting a category into which several genes might aggregate. The most informative categories are those in the mid-range.

Because of the hierarchical nature, one might hypothesize that a gene mapping to a leaf GO node is propagated upward through all of its parents and ancestors. To test this hypothesis, I can pick an arbitrary gene and see which leaf nodes it maps to. Then count how frequently that gene is mapped to ancestors of that leaf node. It turns out that the hypothesis is not true, since only around 10% or less of the ancestors contain a mapping to the gene in the leaf node (“goodAncestors” in Figure 6). The results in Figure 6 are shown for genes designated as “mid hitters” but are representative of all genes.

Figure 6. Relation of Ancestor and Leaf Node Gene Mapping

The fact that this hypothesis is false leads to a considerable simplification in the approach to randomize the genes in the database, because there is no requirement to maintain the consistency of leaf/ancestor mappings. Rather, the genes in the database can simply be scrambled within their own column, which retains the same number of each gene and also retains the pattern of the histograms of mappings.

The function ‘randomGODB()’ in the present package will very quickly generate a randomized version of GOGOA3, which I will refer to as ‘GOGOA3R’. In order to test the performance of GOGOA3R, I also needed to make a small enhancement to ‘GoMiner()’ in the GoMiner package. This enhancement involved giving the user the choice to (a) use the original method for computing the False Discovery Rate (FDR) by running GoMiner on random sets of genes (of the same number as the real list), or (b) use the real gene list with GOGOA3R, or (c) use GOGOA3R in place of GOGOA3. The enhanced revision of the GoMiner package will be released shortly after the present package is published, since it requires ‘randomGODB()’.

As an example of a connection that is expected to be broken by randomization of the database: Imagine that there are 2 moderate size categories that (a) happen to share a handful of genes, and (b) are not curated as being closely related within the structure of GO. But it is found in a geneList associated with a particular experiment that a few genes happen to map to both categories. Is that an accident, or is that a manifestation of a categorical relationship that has so far been missed by the curators of GO? GO is constantly undergoing refinement, so it is not too surprising that this situation could occur prior to some point in time.

We can estimate how often this might happen by chance rather than by underlying biology: If this occurs using the randomized database, then we know it is not the result of a missed relationship, since there are no relationships other than direct ancestors. All that is left is to assign it to chance, with a frequency that we can measure.

mirror server hosted at Truenetwork, Russian Federation.