03.11.2017

Predicting, monitoring and visualising trends in life sciences

by Elke Van Assche,
Frauke Demol,
Nicolas Teirlinckx

One of the most exciting advances in life sciences over the last decade came from combining the huge progress in computer science with the amazing insights in molecular science and genetics. Together with UCB, XAOP took this evolution to heart by developing the Cartographer application, a tool used to predict and monitor trends in life sciences.

About Elke

Elke strengthens our scientific team. With a doctorate in Bioscience Engineering, she handles the difficult task of analysing and understanding scientific data. She will also blow you away with her love for the orchestra in which she masters the cornet. We'll definitely have to go to one of her concerts and check her out.

About Frauke

Frauke is known for her biomedical expertise. As a scientific analyst, she turns an overdose of data into clear information. But careful, there's more to her than just science and biology, she's also in charge of our HR and internal operations. She might not be very tall, but you can't miss her - she'll usually be the one talking or laughing the loudest.

About Nicolas

Nicolas is a developer and project manager by day, football hero by night. While probably slightly exaggerated, he does have the 'confident football walk' nailed down at least.

Cartographer visualises trends within the UCB interest sphere by applying text mining, text processing and machine learning. This includes topic modelling and cluster analysis on different data sources.

One of these sources is Pubmed, the leading library containing abstracts and links to medical, nursing, healthcare and preclinical sciences journal articles. In addition, we also gathered NIH data of grants awarded to promising research topics. Furthermore, we used data from several patent databases to ensure an overview of both patent applications and published patents within the interest domains. Combining these data enabled us to create a quick yet powerful visualise of upcoming trends and trend evolutions within predefined life science interest fields.

To achieve this, the following steps were taken: We started with the analysis of the abstract and the metadata (keywords, author, …) of 5.5 millions Pubmed articles as our main data source. The data from Pubmed were downloaded, the XML was transformed into a suitable format for Apache Spark and uploaded in S3. Apache Spark is an engine for large-scale data processing that was used later in the analysis. Subsequently, the relevant metadata were retrieved from those transformed data and dumped as JSON in S3.

The next step comprised the generation of the corpus and was the actual start of the text analysis. We tokenized and tagged (Part-of-Speech tagging) the abstract and title using the NLTK Python library. Additionally, we removed stop words, numbers and punctuations during this phase. Subsequently, we collocated terms that were frequently found next to each other (e.g. ‘New’ and ‘York’ becomes ‘New_York’). Our collocation algorithm was an in house implementation based on phrases from Gensim. To withhold only relevant nouns, we also removed verbs, adverbs, determiners, cardinal numbers, etc. from the corpus. Finally we singularized all the relevant terms.

After corpus generation, topics were assigned to the available Pubmed articles. Therefore, two different algorithms were used. Firstly, topic modelling was done via TF-IDF (the Apache Spark implementation). The resulting topics were mixed with the provided article keywords. Secondly, Word2Vec was run on the full text abstracts. Based on the assigned topics, the articles that were in the interest sphere of UCB using on some predefined domains, were selected. The topics themselves were clustered within each domain via the Word2Vec results.

The final step was to map the clustered terms to other data sources (Google Trends, NIH, PatBase, …) to visualize the trends over time.

In the Cartographer application, you can study the relevance and evolution of a specific topic over time.

The resulting application is an appealing visualisation tool that gives a quick, yet powerful overview of upcoming trends and trend evolutions within predefined life science interest fields. In the Cartographer application, you can study the relevance and evolution of a specific topic over time. Additionally, for every topic, related interest fields and terms are shown and cross domain trends can be followed. Furthermore, because NIH grant data are included, recent investments in a scientific field can be supervised. A folding menu also lists top articles, top organisations and top projects within the select field or subfield of your interest. This section can be used to inform and redirect you to high impact publications or to help you detect top influencing organisations in your interest sphere to partner with.