Bank statements from different Serbian financial institutions are formatted differently - sometimes dramatically so. Erste Bank, Intesa, OTP, ProCredit, and UniCredit each produce PDFs with different table structures, column layouts, date formats, decimal conventions, and encoding approaches. Accounting and financial analysis software that needs to ingest transaction data from multiple banks faces a format fragmentation problem: a parsing approach that works reliably for one bank’s statements fails on another’s.
PDFs are not structured data formats - they are visual documents. The same numeric value might be in a text layer, in an image layer, embedded in a table cell, or only accessible via OCR on a scanned document. A statement generated directly from a bank’s core banking system is typically a native PDF with a text layer. A statement from an older account or a branch printout may be a scanned image with no text layer at all. The processing system needs to handle both without requiring the user to know which format they are uploading.
Two core architectural decisions shape the entire system:
Bank-Specific Processor Modules
Rather than a generic PDF parser that attempts to handle all formats with a single approach, each bank has its own dedicated processor module with format-specific extraction logic. Edge cases specific to each bank - the way ProCredit handles account number formatting, the specific column order Intesa uses for debit and credit entries, the date format OTP uses that differs from the others - are handled explicitly in that bank’s processor, not worked around in a generic parser. When a new format variation appears in one bank’s statements, the change is localised to that processor without risking regressions in the others.
4-Tier Extraction Hierarchy
The extraction hierarchy tries each approach in sequence until one produces usable data: (1) direct text layer extraction, (2) Camelot table extraction with lattice mode for structured tables, (3) positional text analysis for layouts where table extraction fails, (4) Tesseract OCR as the final fallback for scanned documents. This means the processor handles native PDFs and scanned paper statements digitised from prior years through the same upload flow - no separate paths, no manual intervention required.
Async Python Worker Architecture
PDF parsing is CPU-intensive and can take seconds to minutes per document. Running it synchronously in the web request would create unacceptable timeouts and a poor user experience. The background worker model separates the PHP web application from the Python processing layer: file uploads return immediately, the web application shows processing status as the worker progresses, and the user retrieves their XML when the job completes. The processing layer is independently scalable - more worker containers can be added without changing the web application.
PHP MVC Frontend
Upload handling, user authentication, job queue management, and result delivery in a PHP 8.1 MVC application. Guzzle HTTP client for API communication, PHPMailer for job completion notifications, Monolog for structured logging across both the PHP and Python layers.
Docker Compose Deployment
PHP application, Python worker, and MySQL database packaged as three cooperating containers with automatic WSL2 IP detection for the development environment and a production-ready Compose configuration for server deployment. The database schema includes a job queue table that coordinates work between the web application and the Python worker with real-time status tracking.
Structured XML Output
Consistent output schema regardless of source bank format. Transaction data, account details, date ranges, and balance information in a validated XML structure that plugs directly into Serbian accounting and financial analysis systems - the same schema whether the source was an Erste native PDF or an OTP scanned statement from 2019.
FortuneFusion is Dejan Markovic’s own product - designed, architected, and built end-to-end. It demonstrates the technical range beyond WordPress development: multi-container Docker architecture, cross-language system design (PHP frontend coordinating a Python backend), document processing pipelines with graceful degradation, and SaaS application architecture with async job processing.
For clients evaluating whether a WordPress and WooCommerce developer can handle complex backend integration work - API integrations, custom data processing pipelines, multi-service architectures - FortuneFusion is the reference engagement.