FortuneFusion: Bank Statement Processing SaaS

Own-product SaaS application that converts PDF bank statements from five Serbian financial institutions into structured XML - bank-specific processors, 4-tier extraction hierarchy, OCR fallback for scanned documents, async Python worker, and Docker Compose deployment.

Overview

Bank statements from different Serbian financial institutions are formatted differently - sometimes dramatically so. Erste Bank, Intesa, OTP, ProCredit, and UniCredit each produce PDFs with different table structures, column layouts, date formats, decimal conventions, and encoding approaches. Accounting and financial analysis software that needs to ingest transaction data from multiple banks faces a format fragmentation problem: a parsing approach that works reliably for one bank’s statements fails on another’s.

PDFs are not structured data formats - they are visual documents. The same numeric value might be in a text layer, in an image layer, embedded in a table cell, or only accessible via OCR on a scanned document. A statement generated directly from a bank’s core banking system is typically a native PDF with a text layer. A statement from an older account or a branch printout may be a scanned image with no text layer at all. The processing system needs to handle both without requiring the user to know which format they are uploading.

Two core architectural decisions shape the entire system:

Bank-Specific Processor Modules

Rather than a generic PDF parser that attempts to handle all formats with a single approach, each bank has its own dedicated processor module with format-specific extraction logic. Edge cases specific to each bank - the way ProCredit handles account number formatting, the specific column order Intesa uses for debit and credit entries, the date format OTP uses that differs from the others - are handled explicitly in that bank’s processor, not worked around in a generic parser. When a new format variation appears in one bank’s statements, the change is localised to that processor without risking regressions in the others.

4-Tier Extraction Hierarchy

The extraction hierarchy tries each approach in sequence until one produces usable data: (1) direct text layer extraction, (2) Camelot table extraction with lattice mode for structured tables, (3) positional text analysis for layouts where table extraction fails, (4) Tesseract OCR as the final fallback for scanned documents. This means the processor handles native PDFs and scanned paper statements digitised from prior years through the same upload flow - no separate paths, no manual intervention required.

Async Python Worker Architecture

PDF parsing is CPU-intensive and can take seconds to minutes per document. Running it synchronously in the web request would create unacceptable timeouts and a poor user experience. The background worker model separates the PHP web application from the Python processing layer: file uploads return immediately, the web application shows processing status as the worker progresses, and the user retrieves their XML when the job completes. The processing layer is independently scalable - more worker containers can be added without changing the web application.

PHP MVC Frontend

Upload handling, user authentication, job queue management, and result delivery in a PHP 8.1 MVC application. Guzzle HTTP client for API communication, PHPMailer for job completion notifications, Monolog for structured logging across both the PHP and Python layers.

Docker Compose Deployment

PHP application, Python worker, and MySQL database packaged as three cooperating containers with automatic WSL2 IP detection for the development environment and a production-ready Compose configuration for server deployment. The database schema includes a job queue table that coordinates work between the web application and the Python worker with real-time status tracking.

Structured XML Output

Consistent output schema regardless of source bank format. Transaction data, account details, date ranges, and balance information in a validated XML structure that plugs directly into Serbian accounting and financial analysis systems - the same schema whether the source was an Erste native PDF or an OTP scanned statement from 2019.

FortuneFusion is Dejan Markovic’s own product - designed, architected, and built end-to-end. It demonstrates the technical range beyond WordPress development: multi-container Docker architecture, cross-language system design (PHP frontend coordinating a Python backend), document processing pipelines with graceful degradation, and SaaS application architecture with async job processing.

For clients evaluating whether a WordPress and WooCommerce developer can handle complex backend integration work - API integrations, custom data processing pipelines, multi-service architectures - FortuneFusion is the reference engagement.

Location: Serbia

The Challenge & Solution

Bank Statement Format Fragmentation Across Five Institutions

Bank statements from different Serbian financial institutions are formatted differently - sometimes dramatically so. Erste Bank, Intesa, OTP, ProCredit, and UniCredit each produce PDFs with different table structures, column layouts, date formats, decimal conventions, and encoding approaches. Accounting and financial analysis software that needs to ingest transaction data from multiple banks faces a format fragmentation problem: a parsing approach that works reliably for one bank’s statements fails on another’s.

PDFs are not structured data formats - they are visual documents. The same numeric value might be in a text layer, in an image layer, embedded in a table cell, or only accessible via OCR on a scanned document. A statement generated directly from a bank’s core banking system is typically a native PDF with a text layer. A statement from an older account or a branch printout may be a scanned image with no text layer at all. The processing system needs to handle both without requiring the user to know which format they are uploading.

Technical Highlights

Two core architectural decisions shape the entire system:

Bank-Specific Processor Modules

Rather than a generic PDF parser that attempts to handle all formats with a single approach, each bank has its own dedicated processor module with format-specific extraction logic. Edge cases specific to each bank - the way ProCredit handles account number formatting, the specific column order Intesa uses for debit and credit entries, the date format OTP uses that differs from the others - are handled explicitly in that bank’s processor, not worked around in a generic parser. When a new format variation appears in one bank’s statements, the change is localised to that processor without risking regressions in the others.

4-Tier Extraction Hierarchy

The extraction hierarchy tries each approach in sequence until one produces usable data: (1) direct text layer extraction, (2) Camelot table extraction with lattice mode for structured tables, (3) positional text analysis for layouts where table extraction fails, (4) Tesseract OCR as the final fallback for scanned documents. This means the processor handles native PDFs and scanned paper statements digitised from prior years through the same upload flow - no separate paths, no manual intervention required.

Async Python Worker Architecture

PDF parsing is CPU-intensive and can take seconds to minutes per document. Running it synchronously in the web request would create unacceptable timeouts and a poor user experience. The background worker model separates the PHP web application from the Python processing layer: file uploads return immediately, the web application shows processing status as the worker progresses, and the user retrieves their XML when the job completes. The processing layer is independently scalable - more worker containers can be added without changing the web application.

PHP MVC Frontend

Upload handling, user authentication, job queue management, and result delivery in a PHP 8.1 MVC application. Guzzle HTTP client for API communication, PHPMailer for job completion notifications, Monolog for structured logging across both the PHP and Python layers.

Docker Compose Deployment

PHP application, Python worker, and MySQL database packaged as three cooperating containers with automatic WSL2 IP detection for the development environment and a production-ready Compose configuration for server deployment. The database schema includes a job queue table that coordinates work between the web application and the Python worker with real-time status tracking.

Structured XML Output

Consistent output schema regardless of source bank format. Transaction data, account details, date ranges, and balance information in a validated XML structure that plugs directly into Serbian accounting and financial analysis systems - the same schema whether the source was an Erste native PDF or an OTP scanned statement from 2019.

Challenge & Solution
Dejan Markovic
Dejan Markovic WordPress Architect
Best experience I've had to date with someone from Codeable. Dejan and his team jumped on a critical project over a weekend and had it sussed and patched on a Sunday; by Monday evening a fix was fully implemented. The team exceeded my expectations and I will be using them for all of my development needs going forward.
Eric R. | CEO & Founder, carsandcoffeeevents.com