Statistical analysis
The second part of the summer job project was the statistical analysis of the difference between genes known to be involved in genetic disorders ('disease genes') and all other genes ('non-disease genes', although there could be, and probably are, disease genes in there that are not yet known). This was done by using several databases, mainly two (ENSEMBL and ASTD) for the gene lengths, exon numbers, etc. and two (OMIM and GAD) for the known disease genes.
For all the values, the median of the set was taken. P-values were calculated using the ranksum Matlab command (from the Statistics Toolbox), performing a Mann-Whitney U-test. This tests two data sets for the assumption that they come from an identical distribution. An additional advantage of this test is the possibility to compare data sets of a different sizes.
General comparison
| ENSEMBL and OMIM |
| Disease genes | Non-disease genes | P-value | |
| Gene length (real) | 35384 | 5856 | < 0.001 | boxplot |
| Exon length | 133 | 130 | < 0.001 | boxplot |
| CDS length | 109 | 95 | < 0.001 | boxplot |
| Gene length (sum of all exon lengths per gene) | 5554 | 8511 | < 0.001 | boxplot |
| Gene length (CDS) | 3876 | 5008 | < 0.001 | boxplot |
| Number of exons | 22 | 42 | < 0.001 | boxplot |
| Number of coding exons | 26 | 39 | < 0.001 | boxplot |
| ENSEMBL and GAD |
| Disease genes | Non-disease genes | P-value | |
| Gene length (real) | 17972 | 5474 | < 0.001 | boxplot |
| Exon length | 136 | 129 | < 0.001 | boxplot |
| CDS length | 110 | 99 | < 0.001 | boxplot |
| Gene length (sum of all exon lengths per gene) | 4542 | 6515 | < 0.001 | boxplot |
| Gene length (CDS) | 2180 | 3239 | < 0.001 | boxplot |
| Number of exons | 18 | 31 | < 0.001 | boxplot |
| Number of coding exons | 15 | 24 | < 0.001 | boxplot |
| ASTD and OMIM |
| Disease genes | Non-disease genes | P-value | |
| Exon length | 125 | 127 | < 0.001 | boxplot |
| CDS length | 116 | 116 | 0.4878 | boxplot |
| Intron length | 1391 | 1479 | 0.0017 | boxplot |
| Number of exons | 27 | 21 | < 0.001 | boxplot |
| Number of coding exons | 19 | 13 | 0.0012 | boxplot |
| Number of introns | 22 | 17 | < 0.001 | boxplot |
| Exon length (seq) | 125 | 128 | < 0.001 | boxplot |
| CDS length (seq) | 129 | 133 | < 0.001 | boxplot |
| Exon CG% | 0.5208 | 0.5208 | 0.5312 | boxplot |
| CDS CG% | 0.5139 | 0.5124 | 0.2538 | boxplot |
| Transcription length | 747 | 740 | 0.1867 | boxplot |
| Transcription CG% | 0.5294 | 0.5313 | 0.2686 | boxplot |
| ASTD and GAD |
| Disease genes | Non-disease genes | P-value | |
| Exon length | 134 | 127 | < 0.001 | boxplot |
| CDS length | 128 | 115 | < 0.001 | boxplot |
| Intron length | 1276 | 1507 | < 0.001 | boxplot |
| Number of exons | 20 | 21 | < 0.001 | boxplot |
| Number of coding exons | 13 | 14 | < 0.001 | boxplot |
| Number of introns | 16 | 17 | < 0.001 | boxplot |
| Exon length (seq) | 134 | 128 | < 0.001 | boxplot |
| CDS length (seq) | 139 | 132 | < 0.001 | boxplot |
| Exon CG% | 0.5379 | 0.5195 | < 0.001 | boxplot |
| CDS CG% | 0.5250 | 0.5109 | < 0.001 | boxplot |
| Transcription length | 742 | 739 | < 0.001 | boxplot |
| Transcription CG% | 0.5437 | 0.5306 | < 0.001 | boxplot |
Homology
In this part, the resemblance of the human genes in different species is compared between disease and non-disease genes. A distinct difference is visible between disease and non-disease genes. It seems that disease genes are more common than others for all the different organisms studied. This is stastically confirmed by a ranksum test in Matlab, as explained above. All of the homology differences pass the test, i.e. reject the hypothesis that the two samples have the same median. The difference is less pronounced when the test is done using the GAD database of disease genes. Probably, the reason for this is the larger number of 'false positive' disease genes.
| Homology - OMIM |
| Disease genes | Non-disease genes | P-value | | |
| M.musculus - mouse | 86.76 | 53.76 | < 0.001 | boxplot | histograms |
| R.norvegicus - rat | 82.76 | 39.76 | < 0.001 | boxplot | |
| D.melanogaster - fruit fly | 0 | 0 | < 0.001 | boxplot | |
| G.gallus - chicken | 62.76 | 0 | < 0.001 | boxplot | |
| P.troglodytes - chimpanzee | 98.76 | 93.76 | < 0.001 | boxplot | histograms |
| C.familiaris - dog | 85.76 | 20.76 | < 0.001 | boxplot | |
| A.gambiae - mosquito | 0 | 0 | < 0.001 | boxplot | |
| C.elegans - worm | 0 | 0 | < 0.001 | boxplot | histograms |
| T.rubripes - fish | 45.76 | 0 | < 0.001 | boxplot | |
| O.latipes - fish | 47.76 | 0 | < 0.001 | boxplot | |
| X.tropicalis - frog | 52.76 | 0 | < 0.001 | boxplot | histograms |
| D.rerio - zebrafish | 42.76 | 0 | < 0.001 | boxplot | |
| Homology - GAD |
| Disease genes | Non-disease genes | P-value | | |
| M.musculus - mouse | 79.76 | 43.76 | < 0.001 | boxplot | histograms |
| R.norvegicus - rat | 77.26 | 22.76 | < 0.001 | boxplot | |
| D.melanogaster - fruit fly | 0 | 0 | < 0.001 | boxplot | |
| G.gallus - chicken | 51.26 | 0 | < 0.001 | boxplot | |
| P.troglodytes - chimpanzee | 97.76 | 94.26 | < 0.001 | boxplot | histograms |
| C.familiaris - dog | 80.26 | 47.26 | < 0.001 | boxplot | |
| A.gambiae - mosquito | 0 | 0 | < 0.001 | boxplot | |
| C.elegans - worm | 0 | 0 | < 0.001 | boxplot | histograms |
| T.rubripes - fish | 37.76 | 0 | < 0.001 | boxplot | |
| O.latipes - fish | 38.76 | 0 | < 0.001 | boxplot | |
| X.tropicalis - frog | 41.76 | 0 | < 0.001 | boxplot | histograms |
| D.rerio - zebrafish | 35.76 | 0 | < 0.001 | boxplot | |
Floating base
The floating base characteristics of disease and non-disease genes are compared. No real difference is notable. For the three cases in which there are only two possibilities for the floating gene, the pie chart legend is not correct. The leftmost piece of the pie symbolizes the alphabetically first base. This can also be seen in the table directy underneath.
For all of the fb-comparisons with four possibilities, the p-value is 0.028571. For those with only two possible bases, the p-value is 0.65714.

Legend for the pie charts
| Floating base - OMIM |
| | A | C | G | T | |
| Ala/A - GC* | Disease genes | 0.26896 | 0.31669 | 0.12889 | 0.28546 | pie chart |
| Non-disease genes | 0.26944 | 0.31588 | 0.12861 | 0.28608 |
| Gly/G - GG* | Disease genes | 0.31287 | 0.27498 | 0.25165 | 0.1605 | pie chart |
| Non-disease genes | 0.31158 | 0.27353 | 0.25468 | 0.16021 |
| Pro/P - CC* | Disease genes | 0.30003 | 0.27487 | 0.13649 | 0.28861 | pie chart |
| Non-disease genes | 0.29825 | 0.27826 | 0.1321 | 0.29139 |
| Thr/T - AC* | Disease genes | 0.31832 | 0.30561 | 0.12054 | 0.25554 | pie chart |
| Non-disease genes | 0.32206 | 0.29997 | 0.11699 | 0.26097 |
| Val/V - GT* | Disease genes | 0.14962 | 0.23922 | 0.38326 | 0.2279 | pie chart |
| Non-disease genes | 0.15288 | 0.23485 | 0.38482 | 0.22745 |
| Arg/R - CG* | Disease genes | 0.21763 | 0.28511 | 0.3181 | 0.17915 | pie chart |
| Non-disease genes | 0.20851 | 0.27878 | 0.33227 | 0.18044 |
| Arg/R - AG* | Disease genes | 0.52533 | | 0.47467 | | pie chart |
| Non-disease genes | 0.53462 | | 0.46538 | |
| Leu/L - CT* | Disease genes | 0.12396 | 0.25945 | 0.39314 | 0.22345 | pie chart |
| Non-disease genes | 0.12566 | 0.25288 | 0.39201 | 0.22944 |
| Leu/L - TT* | Disease genes | 0.39864 | | 0.60136 | | pie chart |
| Non-disease genes | 0.39373 | | 0.60627 | |
| Ser/S - TC* | Disease genes | 0.29392 | 0.3129 | 0.10342 | 0.28977 | pie chart |
| Non-disease genes | 0.29372 | 0.31249 | 0.09813 | 0.29567 |
| Ser/S - AG* | Disease genes | | 0.60957 | | 0.39043 | pie chart |
| Non-disease genes | | 0.59773 | | 0.40227 |
| Floating base - GAD |
| | A | C | G | T | |
| Ala/A - GC* | Disease genes | 0.25955 | 0.32354 | 0.12595 | 0.29096 | pie chart |
| Non-disease genes | 0.27018 | 0.31544 | 0.1284 | 0.28599 |
| Gly/G - GG* | Disease genes | 0.3071 | 0.27557 | 0.24901 | 0.16831 | pie chart |
| Non-disease genes | 0.31212 | 0.27339 | 0.25422 | 0.16027 |
| Pro/P - CC* | Disease genes | 0.29034 | 0.28983 | 0.12814 | 0.29169 | pie chart |
| Non-disease genes | 0.29855 | 0.27802 | 0.13201 | 0.29142 |
| Thr/T - AC* | Disease genes | 0.30966 | 0.31596 | 0.11925 | 0.25512 | pie chart |
| Non-disease genes | 0.3227 | 0.29946 | 0.11638 | 0.26145 |
| Val/V - GT* | Disease genes | 0.13994 | 0.23979 | 0.40417 | 0.2161 | pie chart |
| Non-disease genes | 0.15378 | 0.23432 | 0.38316 | 0.22874 |
| Arg/R - CG* | Disease genes | 0.2047 | 0.29708 | 0.30952 | 0.1887 | pie chart |
| Non-disease genes | 0.20956 | 0.27749 | 0.33277 | 0.18017 |
| Arg/R - AG* | Disease genes | 0.52541 | | 0.47459 | | pie chart |
| Non-disease genes | 0.53528 | | 0.46472 | |
| Leu/L - CT* | Disease genes | 0.12355 | 0.25177 | 0.40019 | 0.22449 | pie chart |
| Non-disease genes | 0.12596 | 0.25293 | 0.3909 | 0.23022 |
| Leu/L - TT* | Disease genes | 0.36506 | | 0.63494 | | pie chart |
| Non-disease genes | 0.3951 | | 0.6049 | |
| Ser/S - TC* | Disease genes | 0.29069 | 0.32631 | 0.095449 | 0.28755 | pie chart |
| Non-disease genes | 0.29425 | 0.31144 | 0.097969 | 0.29633 |
| Ser/S - AG* | Disease genes | | 0.60237 | | 0.39763 | pie chart |
| Non-disease genes | | 0.5976 | | 0.4024 |