Automatic Document Processing

Overview:

Build a system that takes a scanned image of a document as input and generates a structured output, say as an xml format, that gives various information in the document. This can reduce the huge manual effort of entering the data into the systemBuild a system that takes a scanned image of a document as input and generates a structured output, say as an xml format, that gives various information in the document. This can reduce the huge manual effort of entering the data into the system.

Following are the challenges for this project

1. Each type of document will have a different format. Hence it is tedious to write a generic document parser algorithmically.

2. Even within a single document type, say for example an invoice, a lot of different variations is possible in format and content.

3. Noise added due to scanning and other artifacts like watermark etc. in the scanned image can make the performance worse.

We propose to build an active learning framework using deep neural networks and graphical models where the system learns incrementally to identify various fields in the document. Following are the typical example element that the system aims to retrieve from the document

1. Various structures like headings, paragraphs, figures, tables, header/footnotes, images, logos, straight lines

2. Overall template of the document

3. Text content within each structure identified in S.No 1 above
The system can accept human feedback and update the underlying model parameters so that the errors are reduced when presented with a new document of the same format.

TEAM MEMBERS:

Prof. Neelam Sinha

Demo