The Bill OCR Project
In a nutshell, an image is uploaded to the cloud, where OCR is run on it. With the current technologies, such an app can be deployed with a few clicks. However, I implemented the vast majority of this myself.
The Bill OCR Project Consists of:
- A Google Cloud project maintained by Terraform with GitLab-managed state. It manages IAM, software package repositories, Kubernetes cluster, monitoring, storage, DNS zones and automatic certificate renewal using Let's Encrypt.
- A control server that configures the database and the reverse proxy (ingress), and it has the ability to deploy multiple instance of the application server, each of which can be backed up and restored separately. (Generally, I use at least one instance for production and another one for development.)
- An application server runs a Pub/Sub Subscriber and, mainly, a Django deployment that manages multiple apps implementing the application logic.
- The core app provides functionality used by all the other apps, as customizing the authentication system for the demo users, or generic object permission and quota system.
-
The job app is a lightweight workload scheduler.
It ensures execution of jobs in the order of their dependencies.
Currently the jobs are executed on Kubernetes and the app is notified about their status with Pub/Sub.
The jobs can execute arbitrary container image parametrized by command, volumes and the job metadata.
The output is expected on
stdout/stderrand in the volumes. - The actual data processing (the OCR) is split into a number of Docker container images. Thanks to the architecture of the job all, some of the container can be executed without any modification, nor any adapter. The full OCR pipeline involves 3 self-trained neural networks for different tasks and custom image analysis using OpenCV and NumPy.
- The Bill OCR app is the UI for all of this. It servers a RESTful API to a minimal JS client.
Each of these is in a separate GitLab project built during CI/CD using either Docker or Poetry. (For the Poetry projects I wrote a CI/CD component.)
Future Plans
- Manual Correction of the Results
- Train the Models on Better Data
- Authenticating the Users with OAuth
- Speeding up the Analysis by Input Batching
- Speeding up the Analysis by Splitting the Text Line Detection Stage into Parallel Stages
- Speeding up the Analysis by Optimizing the Image Processing
- Upgrade Django and Related Refactoring
- ... there is a million things to play with ;-)