Text mining of articles in metabolomics research


M. Kovářová1, L. Kovářová2, D. Štys1, 3


1Institute of Physical Biology, University of South Bohemia and Institute of Systems Biology and Ecology ASCR, Academic and University Center,

Zámek 136, 373 33 Nové Hrady, Czech Republic

2Faculty of Mathematics and Physics, Charles University in Prague,

Malostranské nám. 25, 118 00 Praha 1, Czech Republic

3Institute of Microbiology ASCR, 37981 Třeboň, Czech Republic



More than 80% of information is stored in textual form. A large number of papers appears every day. It is beyond power of anybody to go through even a part of the published works. A use of search machines yields excessive numbers of results and does not make a problem more simple. There are additional difficulties in biochemical research such as an abundance of terms with ambiguous nomenclature, acronyms and abbreviations. That is the reason for a boom of text mining methods in this area. Text mining methods transfer unstructured textual data into a structured data matrix, which can be evaluated with data mining methods.

We have developed a program, which takes a list of terms we are interested in and searches for occurrence of them in abstracts or articles from the internet databases or generally internet sites. The program is based on the algorithm Aho-Corasick and works as a efficient searching machine. Running articles through this algorithmus takes almost the same time for 10 or 5 000 given terms. The program goes through the articles and creates a data matrix from the found terms. At the end it sorts its results and deletes useless information.




The project was financed by project No. MSM6007665808 of the Ministry of Schools, Youth and Sport of the Czech Republic