Textify

Team Name: 
WYSIWYG

Project Description:

The Australian war memorial collections contain historical data about the various aspects of the life during the World War I. The collections contain a large number of document images comprising, letters, diaries, forms, other forms of handwritten and printed communication. Text information present in the images are not machine readable or rather not editable by a text processing system. In order to effectively manage the large volume of image/video data it requires proper indexing based on keywords or other information. This requires the images to be manually annotated and tagged with a keyword. Most of the search utilities/tools to date use keywords to retrieve information from databases. This process becomes cumbersome while searching videos/images and might produce inconsistent results as a different keyword might be associated with a video/image which the user is interested in. Real-time manual annotation of such a large number of videos is not feasible due to the large amount of man power involved and the time required for processing each video/image. Hence, automatic annotation is a better and more feasible approach in this situation. In automatic annotation, text present in images (e.g. text on letter, diaries, forms etc.) can be used to generate keywords, which can be used for the effective retrieval of the appropriate videos/images. Thus, text present in the video/image plays an important role in automatic indexing and effective retrieval of videos/images from the database.

Textify is an application which does the automatic annotation and keyword tagging for effective indexing and retrieval. It is an application which is developed using the Artificial Intelligence (AI) and machine learning techniques. The following are the functionalities of the application:

1. Optical character recognition (OCR) of a document to automatically create a machine readable/editable version of the document images. This are particularly useful for reprinting/reproducing the diaries/letter etc.

2. Automatic keyword generation for indexing of the images, using the machine readable version of the text information.

3. Automatic clustering of similar documents/images based on keywords

4. Searching document images based on custom text/keywords.

The application will be very useful for digital document management and for managing both historical and the recently digitized documents. State and national libraries maintaining large volumes of image/video data will be highly benefitted with the proposed application. Common people will have the benefits like creating editable form of text images when required and also have the access to a reproduced version of historical books/diaries/letters etc.

The archived document images/books/letters/diaries etc. from Australia and New Zealand state archives will be reused. Most of the cultural, educational, life style, etc. during 1900’s are hidden in the document images. Thus, having an editable/machine readable form of the text information will definitely provide large and meaningful information. The document images will be also automatically tagged to keywords found in the text present in the images for indexing. OCR makes it easy to reproduce the books/diaries etc. which in turn create business opportunities for the publication industries.

Datasets Used: 
In this project, we used the public datasets from the National Archives of Australia, Australian War Memorial, and Archives New Zealand.

Local Event Location: