Navigating gene networks

by text mining and statistical analysis

Statistical analysis

The second part of the summer job project was the statistical analysis of the difference between genes known to be involved in genetic disorders ('disease genes') and all other genes ('non-disease genes', although there could be, and probably are, disease genes in there that are not yet known). This was done by using several databases, mainly two (ENSEMBL and ASTD) for the gene lengths, exon numbers, etc. and two (OMIM and GAD) for the known disease genes.

For all the values, the median of the set was taken. P-values were calculated using the ranksum Matlab command (from the Statistics Toolbox), performing a Mann-Whitney U-test. This tests two data sets for the assumption that they come from an identical distribution. An additional advantage of this test is the possibility to compare data sets of a different sizes.

General comparison

ENSEMBL and OMIM
Disease genesNon-disease genesP-value
Gene length (real)353845856< 0.001boxplot
Exon length133130< 0.001boxplot
CDS length10995< 0.001boxplot
Gene length (sum of all exon lengths per gene)55548511< 0.001boxplot
Gene length (CDS)38765008< 0.001boxplot
Number of exons2242< 0.001boxplot
Number of coding exons2639< 0.001boxplot
ENSEMBL and GAD
Disease genesNon-disease genesP-value
Gene length (real)179725474< 0.001boxplot
Exon length136129< 0.001boxplot
CDS length11099< 0.001boxplot
Gene length (sum of all exon lengths per gene)45426515< 0.001boxplot
Gene length (CDS)21803239< 0.001boxplot
Number of exons1831< 0.001boxplot
Number of coding exons1524< 0.001boxplot
ASTD and OMIM
Disease genesNon-disease genesP-value
Exon length125127< 0.001boxplot
CDS length1161160.4878boxplot
Intron length139114790.0017boxplot
Number of exons2721< 0.001boxplot
Number of coding exons19130.0012boxplot
Number of introns2217< 0.001boxplot
Exon length (seq)125128< 0.001boxplot
CDS length (seq)129133< 0.001boxplot
Exon CG%0.52080.52080.5312boxplot
CDS CG%0.51390.51240.2538boxplot
Transcription length7477400.1867boxplot
Transcription CG%0.52940.53130.2686boxplot
ASTD and GAD
Disease genesNon-disease genesP-value
Exon length134127< 0.001boxplot
CDS length128115< 0.001boxplot
Intron length12761507< 0.001boxplot
Number of exons2021< 0.001boxplot
Number of coding exons1314< 0.001boxplot
Number of introns1617< 0.001boxplot
Exon length (seq)134128< 0.001boxplot
CDS length (seq)139132< 0.001boxplot
Exon CG%0.53790.5195< 0.001boxplot
CDS CG%0.52500.5109< 0.001boxplot
Transcription length742739< 0.001boxplot
Transcription CG%0.54370.5306< 0.001boxplot

Homology

In this part, the resemblance of the human genes in different species is compared between disease and non-disease genes. A distinct difference is visible between disease and non-disease genes. It seems that disease genes are more common than others for all the different organisms studied. This is stastically confirmed by a ranksum test in Matlab, as explained above. All of the homology differences pass the test, i.e. reject the hypothesis that the two samples have the same median. The difference is less pronounced when the test is done using the GAD database of disease genes. Probably, the reason for this is the larger number of 'false positive' disease genes.

Homology - OMIM
Disease genesNon-disease genesP-value
M.musculus - mouse 86.7653.76< 0.001boxplothistograms
R.norvegicus - rat 82.7639.76< 0.001boxplot
D.melanogaster - fruit fly 00< 0.001boxplot
G.gallus - chicken 62.760< 0.001boxplot
P.troglodytes - chimpanzee 98.7693.76< 0.001boxplothistograms
C.familiaris - dog 85.7620.76< 0.001boxplot
A.gambiae - mosquito 00< 0.001boxplot
C.elegans - worm 00< 0.001boxplothistograms
T.rubripes - fish 45.760< 0.001boxplot
O.latipes - fish 47.760< 0.001boxplot
X.tropicalis - frog 52.760< 0.001boxplothistograms
D.rerio - zebrafish 42.760< 0.001boxplot
Homology - GAD
Disease genesNon-disease genesP-value
M.musculus - mouse 79.7643.76< 0.001boxplothistograms
R.norvegicus - rat 77.2622.76< 0.001boxplot
D.melanogaster - fruit fly 00< 0.001boxplot
G.gallus - chicken 51.260< 0.001boxplot
P.troglodytes - chimpanzee 97.7694.26< 0.001boxplothistograms
C.familiaris - dog 80.2647.26< 0.001boxplot
A.gambiae - mosquito 00< 0.001boxplot
C.elegans - worm 00< 0.001boxplothistograms
T.rubripes - fish 37.760< 0.001boxplot
O.latipes - fish 38.760< 0.001boxplot
X.tropicalis - frog 41.760< 0.001boxplothistograms
D.rerio - zebrafish 35.760< 0.001boxplot

Floating base

The floating base characteristics of disease and non-disease genes are compared. No real difference is notable. For the three cases in which there are only two possibilities for the floating gene, the pie chart legend is not correct. The leftmost piece of the pie symbolizes the alphabetically first base. This can also be seen in the table directy underneath.

For all of the fb-comparisons with four possibilities, the p-value is 0.028571. For those with only two possible bases, the p-value is 0.65714.


Legend for the pie charts

Floating base - OMIM
ACGT
Ala/A - GC*Disease genes0.268960.316690.128890.28546pie chart
Non-disease genes0.269440.315880.128610.28608
Gly/G - GG*Disease genes0.312870.274980.251650.1605pie chart
Non-disease genes0.311580.273530.254680.16021
Pro/P - CC*Disease genes0.300030.274870.136490.28861pie chart
Non-disease genes0.298250.278260.13210.29139
Thr/T - AC*Disease genes0.318320.305610.120540.25554pie chart
Non-disease genes0.322060.299970.116990.26097
Val/V - GT*Disease genes0.149620.239220.383260.2279pie chart
Non-disease genes0.152880.234850.384820.22745
Arg/R - CG*Disease genes0.217630.285110.31810.17915pie chart
Non-disease genes0.208510.278780.332270.18044
Arg/R - AG*Disease genes0.525330.47467pie chart
Non-disease genes0.534620.46538
Leu/L - CT*Disease genes0.123960.259450.393140.22345pie chart
Non-disease genes0.125660.252880.392010.22944
Leu/L - TT*Disease genes0.398640.60136pie chart
Non-disease genes0.393730.60627
Ser/S - TC*Disease genes0.293920.31290.103420.28977pie chart
Non-disease genes0.293720.312490.098130.29567
Ser/S - AG*Disease genes0.609570.39043pie chart
Non-disease genes0.597730.40227
Floating base - GAD
ACGT
Ala/A - GC*Disease genes0.259550.323540.125950.29096pie chart
Non-disease genes0.270180.315440.12840.28599
Gly/G - GG*Disease genes0.30710.275570.249010.16831pie chart
Non-disease genes0.312120.273390.254220.16027
Pro/P - CC*Disease genes0.290340.289830.128140.29169pie chart
Non-disease genes0.298550.278020.132010.29142
Thr/T - AC*Disease genes0.309660.315960.119250.25512pie chart
Non-disease genes0.32270.299460.116380.26145
Val/V - GT*Disease genes0.139940.239790.404170.2161pie chart
Non-disease genes0.153780.234320.383160.22874
Arg/R - CG*Disease genes0.20470.297080.309520.1887pie chart
Non-disease genes0.209560.277490.332770.18017
Arg/R - AG*Disease genes0.525410.47459pie chart
Non-disease genes0.535280.46472
Leu/L - CT*Disease genes0.123550.251770.400190.22449pie chart
Non-disease genes0.125960.252930.39090.23022
Leu/L - TT*Disease genes0.365060.63494pie chart
Non-disease genes0.39510.6049
Ser/S - TC*Disease genes0.290690.326310.0954490.28755pie chart
Non-disease genes0.294250.311440.0979690.29633
Ser/S - AG*Disease genes0.602370.39763pie chart
Non-disease genes0.59760.4024

laatst aangepast op 14 september 2007