Navigating gene networks

by text mining and statistical analysis

Text mining

The first part of the project was about text mining a PubMed database, consisting of titles and abstracts of medical papers. There was a list available to filter the database to only contain the articles related to genomics. The goal was to build a gene network out of this database, showing how the different human genes influence each other.

The first step in realising this was to index the .xml files containing the titles and the abstracts. These files were already stored locally. To index them, the Java Lucene library was used, together with some other classes that were made at the research group. The gene names were extracted from the HUGO database of known human genes.

After this, Luke was used to check the results, before they were stored in a MySQL database by another Java class. From this database we created a document-gene table (p.e. Article 1 Gene 20, Article 2 Gene 3584,...) by, again, a Java class. This matrix was then imported into Matlab.

In Matlab, a matrix was created from the previous table. This is called a document-gene matrix. Our goal was to create a table with an entry for every link between two genes. To achieve this, we only needed to calculate the inner product of every gene vector with every other one. If this product is not equal to zero, there is a link between the two genes. Another Java class has then been used to create the gene link table we wanted.

All there was left to do then was to import this into BioLayout to make a visualization of the gene network.

Upon looking at the result, however, an exspected but unwanted outcome was noticed: the strongest links in the network always were between genes like 'AR' or 'T', clearly showing that our scanning of the abstracts was not specific enough. To improve the results, a case sensitive version was implemented. This way, words like 'are' (automatically stemmed to 'ar') were no longer picked up while indexing the articles.

The (static) images below and above were all created using this case sensitive version. Larger versions of these images are stored in an archive you can find on the media page.

The animations below visualize the evolution of the gene network as an increasing link strength filter is being applied. In the combined results part this increasing link strength filter is referred to as the 'treshold' of the gene network. Notice the very distinct difference between the starting frame and the second one, indicating that there are a lot of 'weak' links in the network. This is not surprising, because to produce these animations only a small portion of all the abstracts were used.

Case insensitive version, ascending link strength.

Case sensitive version, ascending link strength.


laatst aangepast op 14 september 2007