The C word

Team Name: 
The Four Musketeers


 "The C Word" is a very useful tool to many end users such as general public, doctors, and the government. "The C Word" provides a very intuitive interactive interface which allows the user to explore information about cancers. Our tool is capable of finding linkage between cancers and predicting cancer rates for the coming years. Predictions have been tested and shown to produce a descent accuracy or <10% for first 3 years. This is a remarkable achievement taking the challenges into consideration (lack of data, lack of advance optimization of the model)
In a business perspective, "The C Word" provide a very intuitive visualization of cancers and how they are connected. This in turn allows doctors to understand cancers better and find causes behind cancers based on connections. Furthermore, general public will become more aware about cancers and what matters to them


Furthermore, predictions is one of the key features of "The C Word" with a descent accuracy and potential to improve the accuracy significantly further, it can help the government immensely by helping the government to anticipate health expenditures better. Furthermore, the general public can see for themselves, which cancers are going to increase and they should keep an eye for.


 We have used machine learning techniques to explore correlated cancer and to predict the number of expected cancer cases.
In terms of creating the graph we used Graphical Lasso (Glasso) to generate a graph of correlated cancers. These correlations explore how cancers covaried through the past two decades. However, simply exploring a covariance matrix does not suffice as there are spurious correlations. Using glasso allows us to generate a sparse inverse covariance matrix. The logic behind this being if the inverse covariance between two cancers is zero there is no link between them. 


The final wordcloud was created by mining twitter data on the keyword "#cancerresearch". The program was generated such that it tracks the main topics that were discussed in the past 3 days. The keyword could easily be altered to track other cancer related hash tags.



Prediction are done using a state-of-the-art parallelized Deeplearning Network. Deep Networks are immensely being used in medical domain for diagnosis, risk assessment, etc. Deep Network is implemented with python and it was optimized for GPUs using Theano. GPUs provide a large speed-up in contrast to cpu processing. The learning model takes data for last 5 years and predict the cancer rates for the next 10 years. Inputs include cancer type, gender, status (incidence/mortality), and data for past 5 years of reports. 


The model has 4 layers of neurons with 225 nodes in each layer. The model has reported a <10% accuracy for the first 3 years predicted and <20% error for the 10 years. However, it is important to note that this is a basic model and many optimization techniques can be introduced to improve the network. Furtheremore, data is only available from 1982 to 2010 which is quite inadequate for a learning model to learn well. Therefore, with time, the learning model can be improved to gain high accuracies.  




 Open source tools, libraries and code samples used include:
jQuery (
Twitter Bootstrap (
Compass (
D3.js (
jQuery CSV ( ( -
Graph -

Datasets Used: 
Australian Cancer Incidence and Mortality

Local Event Location: