Image created on this šŸ¤— space

Hands-on: document data extraction with šŸ© transformer

My experience using donut transformers model to extract invoice indexes.

Toon Beerten
5 min readMar 14, 2023

--

A lot of progress has been made recently in document understanding. I will show you what can be done with a current state of the art model, finetuned on invoices. I created a demo for you to try out and I will also share some interesting observations.

The model I have finetuned is called Donut. It stands for OCR-free Document Understanding Transformer. I know the abbreviation is a little far fetched, but itā€™s catchy nonetheless. If you want all the ins and outs: here is the original paper . In a nutshell: they trained millions of documents with the corresponding text to make a model that understands documents. The authors call it ā€˜OCR-Freeā€™ . Not free as in free beer, but rather that it accepts just images as input. No text layer to go with it. Why this approach was chosen:

  1. avoid extra computational costs for using OCR
  2. inflexibility of OCR solutions on languages or types of document
  3. avoid OCR errors to become errors upstream

So this model is using only one modality: vision. Itā€™s a transformer model that encodes vision information (your document as a jpg) and decodes it into text (the key information we want to extract). I wonā€™t go into more details here, there are already many other articles written about the inner workings.

My aim was to discover the limits and possibilities of this transformer model, based on invoices. I followed Niels Rogge excellent tutorials on these matters. So letā€™s go!

Weeeeeeeeeeeeeeeeeeeeeā€¦..

Training:
The model was trained (or rather fine-tuned) with a few thousand of annotated invoices and non-invoices (for those the doctype will be ā€˜Otherā€™). They span across different countries and languages. They are always one page only. The dataset is proprietary unfortunately. Model is set to input resolution of 1280x1920 pixels, because my GPU was limited to 16GB ram. So any sample you want to try with higher dpi than 150 has no additional benefit.
It was trained for about 4 hours on a NVIDIA RTX A4000 for 20k steps with a val_metric of 0.03413819904382196 at the end. I used paperspace for training. (If you consider subscribing, use my referral code RTF8I48 for 10$ free credit.)
The following indexes were included in the train set:

  • DocType
    Can be either ā€˜Invoiceā€™ or ā€˜Otherā€™
  • Currency
    Eg: EUR, USD, THB, ā€¦
  • DocumentDate
    Eg: 12/01/2023 or 04/19/22 or 25 september 2022 on doc, reformatted as yyyy-MM-dd when captured
  • NetAmount & GrossAmount & TaxAmount
    Eg: 12,222.00$ or 500,12ā‚¬ or 10 000 THB on doc, reformatted as xxxxx.xx during capturing
  • InvoiceNumber
    Eg: AE-2022567 , ā€¦ many different formats, no reformatting done here
  • OrderNumber
  • CreditorCountry
    Eg: DE, US, CA, ā€¦

After training is done, itā€™s time for some benchmarks and check how well our donut has learned:

Hurray!

Benchmark observations:
From all documents in the validation set, 60% of them had all indexes captured completely correct. This means not one of the indexes was missing or incorrect. I have to say that is impressive, given the diversity of the dataset.

When you look at the results per index, you can see for all of them more than 80% were captured OK. But it might be more interesting to look at the captured but incorrect indexes. Why is it harder to extract for instance Invoice Number correct than letā€™s say Currency?
I have a hunch it has to do with the variability of location and formatting.
Gross Amount scores better than Net amount because itā€™s usually located in the lower right area i guess.
Document date scores better because it may have less variation than possible invoice numbers.

Some other observations:

  • when testing the model with a non invoice document, itā€™s quite reliably identified as Doctype: ā€˜Otherā€™ . If you upload an image of - letā€™s say - a car, it will not return any garbage indexes.
  • validation set contained mostly same layout invoices as the train set. If it was validated against completely differently sourced invoices, the results would be different i believe.
  • reformatting works automatically. I made sure that during preprocessing the dataset, the ground truth I collected along with the images were properly and consistently formatted. That pays off, because the JSON output is also formatted in the same way, to allow easy comparison between ground truth and extracted results.
  • a real miss is some kind of bounding boxes to know the position where it extracted a certain index. This could be solved with heat maps the authors explained.
  • There is no (easy) way to get a meaningful confidence of the answers. Here is a possible solution. (i did not try this myself)
  • The country of the supplier was almost never captured incorrectly. I find that impressive, considering that the country I fed to the training (eg: US, DE, FR) is never exactly like that written in the document. Somehow somewhere it knows which image features to look out for. (well, I suppose that all happened during training)
  • Document date was recognized across different notations, however, itā€™s often wrong because the data set was not diverse enough (as in time span of dates)
  • I also finetuned the model on lower resolutions, but in my experience, the higher the resolution, the better the accuracy becomes.

Conclusion

I hope the observations of my sample use case were interesting to read and could spark you to experiment as well. This model uses just one modality, namely vision. Iā€™m curious to know how multimodal models will do. I can image that if you supply positional data and accurate text during training it can immensely increase the effectiveness. So follow me and maybe iā€™ll make a post later about itĀ :)

Try it out:

I created a space on huggingface where you can try it out. Upload an invoice and out come indexes. For example:

in goes the image, out is json formatted text with the extracted indexes

To use it, simply upload your image and click ā€˜submitā€™, or click one of the examples to load them.

--

--

Toon Beerten
Toon Beerten

Written by Toon Beerten

Data science and machine learning aficionado. Sharing insights from my own experiences. www.linkedin.com/in/toonbeerten