Navigating gene networks

by text mining and statistical analysis

Combined results

For the third part, the list of gene links made using text mining was read in in Matlab to do some more statistical analysis. There were different lists of gene links made, each with a different 'treshold'. As explained on the text mining page, this treshold represents the value the strength of the links has to exceed to be included in the network. The goal was again to check the differences between disease and non-disease genes.

Scatter plots: neighbours

First, scatter plots were made, showing the number of neighbours in the gene network. It was expected that disease genes are more linked to other genes in general and to other disease genes in particular. This was the case, and the figures also suggest there is some fixed number of disease neighbours for regular genes which is largely lost for disease genes.

Scatter plot, percentage of disease neighbours versus total number of neighbours, using GAD and a treshold of 2

This image was made using the GAD list of disease genes. You can see two 'peaks' of genes, one for the disease genes (+) and one for the non-disease genes (o). The disease genes are a bit more to the left of the plot, i.e. have a slightly lower percentage of disease neighbours. This may indicate that disease genes have a higher number of neighbours overall, but the same number of disease neighbours. To check these hypotheses, boxplots and an additional scatter plot were created.

Boxplots visualizing the difference in number of general and disease neighbours, both using GAD and a treshold of 2

Analyzing these boxplots, the following conclusion seems probable: disease genes have got more neighbours in general and more disease neighbours, but because 'more disease neighbours' only means an increase of one or two gene links and 'more general neighbours' a higher increase, the disease genes move to the left of the scatter plot.

Scatter plot, number of disease neighbours versus total number of neighbours, using GAD and a treshold of 2

In these scatter plots, the number of disease neighbours vs. the number of neighbours in general is shown. This suggests some linear connection between the two in regular, non-disease genes. For disease genes, higher abberations are visible, perhaps visualizing the partial loss of this linear connection.

General feature comparison

After that, we compared the first neighbours of disease genes with the disease genes themselves and with other 'regular' non-disease genes. To do this, the homology comparisons were repeated, together with the comparisons of gene and exon length.

For all the values, the median of the set was taken. P-values were calculated using the ranksum Matlab command (from the Statistics Toolbox), performing a Mann-Whitney U-test. This tests two data sets for the assumption that they come from an identical distribution. An additional advantage of this test is the possibility to compare data sets of a different sizes.

The three P-values are subsequently for the disease vs. neighbour, the disease vs. non-disease and the neighbour vs. non-disease comparisons.

ENSEMBL - OMIM - no treshold
Disease genesNB of disease genesNon-disease genesP-values
Gene length3534525413257053.6e-4, 1.4e-4, 0.9661boxplot
Sum of exon lengths per gene5432429941691.5e-8, 1.1e-10, 0.2628boxplot
Exon length1301361331.4e-9, 3e-5, 4.6e-4boxplot
CDS length1081101070.0590, 0.7174, 0.0079boxplot
ENSEMBL - OMIM - treshold 5
Disease genesNB of disease genesNon-disease genesP-values
Gene length3417925764240210.2250, 0.0169, 0.3421boxplot
Sum of exon lengths per gene6166441741660.0028, 6.9e-7, 0.1985boxplot
Exon length1251411344.4e-15, 2.9e-9, 1.3e-4boxplot
CDS length1011131075.2e-10, 1e-7, 0.0088boxplot
ENSEMBL - OMIM - treshold 10
Disease genesNB of disease genesNon-disease genesP-values
Gene length3452823863232970.2400, 0.0676, 0.9262boxplot
Sum of exon lengths per gene5540402939290.0258, 0.0014, 0.8941boxplot
Exon length1271501391e-9, 2.8e-10, 1.7e-4boxplot
CDS length1061221035e-7, 0.3564, 1e-6boxplot
ENSEMBL - GAD - treshold 5
Disease genesNB of disease genesNon-disease genesP-values
Gene length2428727980240210.9254, 0.7075, 0.6907boxplot
Sum of exon lengths per gene4626416643560.0818, 0.1056, 0.7273boxplot
Exon length1351351330.5174, 0.9030, 0.5877boxplot
CDS length1101071030.2849, 0.0003, 0.0290boxplot
ENSEMBL - GAD - treshold 10
Disease genesNB of disease genesNon-disease genesP-values
Gene length2386323607232970.4837, 0.7835, 0.5794boxplot
Sum of exon lengths per gene4403386442930.0981, 0.8523, 0.1803boxplot
Exon length1391341370.3221, 0.9714, 0.3490boxplot
CDS length1101081020.7220, 0.0025, 0.0041boxplot

Conclusion: there is a somewhat strange result in the ENSEMBL - OMIM treshold 5 and 10 tables. The exon and CDS length are significantly higher for neighbours of disease genes than for disease or non-disease genes. This effect is not observed when using the GAD database for disease genes.

Homology

In these pictures, a distinct difference is visible between disease and non-disease genes. It seems that disease genes are more common than others for all the different organisms studied. This is stastically confirmed by a Whilcoxin rank sum test in Matlab (the same test as above). All of the homology differences pass the test, i.e. reject the hypothesis that the two samples have the same median. The difference is less pronounced when the test is done using the GAD database of disease genes. Probably, the reason for this is the larger number of 'false positive' disease genes.

Homology - OMIM - no treshold
Disease genesNB of disease genesNon-disease genesP-values
M.musculus86.7683.7682.768.6e-4, 6.3e-6, 0.0588boxplot
R.norvegicus84.7681.7679.760.0054, 1.4e-4, 0.0932boxplot
D.melanogaster0000.5293, 0.1545, 0.0002boxplot
G.gallus61.7657.7657.760.0900, 0.0506, 0.7092boxplot
P.troglodytes98.7698.7698.760.7843, 0.8369, 0.8780boxplot
C.elegans0000.9133, 0.0623, 0.0016boxplot
X.tropicalis52.7647.7648.880.1228, 0.1097, 0.9814boxplot
Homology - OMIM - treshold 2
Disease genesNB of disease genesNon-disease genesP-values
M.musculus86.7682.7681.760.0171, 0.0024, 0.4545boxplot
R.norvegicus83.7679.7678.760.0151, 0.0043, 0.6758boxplot
D.melanogaster0000.3252, 0.8215, 0.0670boxplot
G.gallus61.7657.7653.760.1131, 0.0070, 0.1796boxplot
P.troglodytes98.7698.7697.760.7808, 0.4669, 0.5222boxplot
C.elegans0000.8987, 0.2238, 0.0588boxplot
X.tropicalis52.7647.7646.760.1881, 0.0987, 0.7659boxplot
Homology - OMIM - treshold 5
Disease genesNB of disease genesNon-disease genesP-values
M.musculus84.7681.7679.760.1494, 0.0122, 0.4133boxplot
R.norvegicus82.7677.7676.760.0152, 0.0007, 0.6023boxplot
D.melanogaster0000.3651, 0.8883, 0.2908boxplot
G.gallus56.7657.7650.760.7066, 0.0833, 0.1608boxplot
P.troglodytes98.7698.7697.760.7930, 0.3825, 0.5360boxplot
C.elegans0000.7338, 0.2156, 0.3547boxplot
X.tropicalis53.7648.7644.760.2291, 0.0156, 0.2736boxplot

Conclusion: the homology is always higher for disease genes than for non-disease genes, with the neighbour genes in between. The P-values are higher for higher tresholds, because the gene network is smaller in this cases.

Another observation is that the non-disease homology values are higher than in the 'statistical analysis' part. Perhaps part of the explanation is to be found in the fact that for the statistical analysis, the non-disease genes were just what was left over in the genome database after selecting the disease genes. In this part, the non-disease genes are already in the gene network.


laatst aangepast op 14 september 2007