Metadata Tag Extractor

Team Name: 
OPEN_xcommay

MetaData Extractor

 

MetaData is as important as Data. While generating data is expected to be done carefully, generating metadata is done as an afterthought, most of the time.

My first plan for GovHack was to do an analysis of the trends and patterns of the metadata of data.gov.au.  But after downloading all the metadata and data wrangling, I found out that the dataset tags are not that great.

There are a lot of redundancy between the tags (e.g. BIOMASS, Biomass and biomass are considered as three differnt tags!)  In addition the distribution of tag counts (number of links one tag has over the packages) is highly skewed. One of the most used tag in data.gov.au is  "dataset" .

So I decided to write an app to help extract tags from metadata. The aim was to create an app which combined automation and user input to robustly identify tags.

Steps of Metadata Tag Extractor

1. paste text  representing data - usually the description.

2. The app would then automatically identify keywors using RAKE algorithm. (https://github.com/zelandiya/RAKE-tutorial)

3. The app (using a simple function that I wrote) then identifies acronyms. (It is programed not to mess too much with words having CAPS - except for title case words (e.g. This). It will also remove the commonly used words (the,is) and punctuation marks.

4. Then the identified keywords are showed to the user who will decide to 1. keep them 2. change them or delete them.

At this point while most redundant words have been removed (like Biology vs biology) some remain e.g. liability vs liable.

5. This step uses NLTK, pythons' natural language processing kit to link words which mean the same. It involves lemmatizing and stemming. (http://textminingonline.com/dive-into-nltk-part-iv-stemming-and-lemmatiz...). A search with the wordnet repository (https://wordnet.princeton.edu/) makes sure the word is real and human-readable. (see attached images for the algorithm).

6. The results are then shown to the user.

Access the app here:

http://wishva.pythonanywhere.com

Cool things about the app

The algorithm to group words together (households+ hosehold, wolf+wolves) is independent of the keyword extraction which is a well researched field.

The app / algorithm can be used to "correct" some of the existing tags in data.gov.au.

The app does not dictate keywords. Instead it gently sugests them to the user who decides which keyword(s) to pick.

The keywords are derived with a combination of scripting / semantics and user involvement . The resulting tags and the text can be used as a training set for machine learning.

Slides on the objective and algo: https://docs.google.com/presentation/d/1_C22-yqZT2PlEnVUfdbmoPv3Jy6m2fTX...

Weak points.

The app is NOT fault tollerant. Type "->" instead of "-->" and it will crash.

It sits on the free tier at http://wishva.pythonanywhere.com, so it could be slow.

It uses bootstrap (which is good) but has a weird instruction method involving text (-->) that I came up with to stop me from writing javascript.

The code is  pretty hacky.

 

TOOLS:

data.gov.au / ckan API, python, pandas, flask and emacs. :-)

---------------------------------------------------------------------------------------------------------------------------------------------

FOR JOURNALISM HACK - I made a representation of all the packages and tags of data.gov.au.

 

Large image: https://drive.google.com/file/d/0B73pTeG-PxleM0U1VFlNMVdIYm8/view?usp=sh...

I visualized the data relationships of data.gov.au using shared tags.  This is a low res image of that (see link for bigger image). It helps, sometimes, to take a step back to appreciate the true beauty of complex things.

Steps:

1. extract all meta data through the api (json)

2. extract list of tags per package.

3. nodes = packages and tags

4. edges relationship between packages and tags.

Tools used - python + padas + networkx + cytoscapedta

Cool things about the network / image.

* This shows how interconnected modern day  datasets are and highlights the importance of having good quality metadata.

* This is a snapshot of data.gov.au which freezes the data contained in it at a particular time point. As data.gov.au contains information related to us the residents, it is a snapshot of us ourselves as well!

 

Weak points.

Bad colour scheme.

I should have used a different colour for the tags/packages or colored the edges based on connectivity. (In the image the packages AND the tags are nodes!)

 

Disclaimer:

I am a programmer / analyst / bioinformatician trained mostly as a biologist! I am NOT an expert on natural language processing, the main theme of this project.

All the data / code / scripts / ipython notebook / flask scripts can be found at  https://bitbucket.org/xcommay/opendatatools1

The volume of the video is quite low, except for the clicks of the touchpad. I appologize, but I think the manufacturer of my laptop is also to blame.

Thank you

Datasets Used: 
data.gov.au

Local Event Location: