Case study · Knowledge management · Regulated document extraction · In production

Many vendor formats. One register. Every value traceable.

We built a pipeline that reads each row of each PDF independently, checks it against the team’s vocabulary, and links every value back to the spot on the page it came from. The data sits on a cloud language model in Melbourne, so it never leaves Australia.

90%extraction accuracy across the reference document set the team validated against

Scroll to read

Chapter 01 · The problem

The data existed. It was locked inside PDFs from several different vendors.

The team’s job was straightforward to describe and miserable to do. Every site had a report from an external vendor. Each vendor wrote their report differently: different column names, different field ordering, different ways of labelling the same item. Before anyone could act on a report, a team member had to find the right entry, read the row, and verify the status against the source PDF. The team had been quietly running a parallel spreadsheet. One person kept it current by re-typing the reports, page by page. The cost was not just the typing time. Nobody trusted the spreadsheet enough to act on it without re-reading the source PDF, so the data entry produced very little change in the actual workflow. Off-the-shelf tools struggled with the document shapes. Some lost the nesting between header rows and detail rows. Some lost rows that crossed page breaks. None could tell the team which vendor format a given PDF used, so the team had to identify each report by hand.

Multi-vendor · multi-format

Chapter 02 · The approach

One row at a time. Every value linked to the source.

The system reads each PDF twice. Once to pull out every table cell preserving the layout. Then again, one row at a time, using a cloud language model to extract the regulated fields the team needs. Every value the team sees on screen is linked to the spot on the PDF it came from. The reviewer interface shows the value on the left and the highlighted region of the source page on the right. The team can verify any number without flipping back to the report. That was the feature the team trusted before they trusted the numbers. The data sovereignty story matters because the documents are regulated. The pipeline runs on AWS Bedrock in Melbourne, hosted in IRAP-aligned facilities, so the source PDFs and every extracted value stay inside Australia. The model provider contract excludes the team’s data from any future model training, so what the team uploads is never used to improve a model another customer queries.

What we built

A document-extraction pipeline: PDF in, structured records out. Each row is extracted in isolation, checked against the team’s existing vocabulary, fixed when the obvious things are wrong, recovered when the parser missed a section, and saved with a link back to the source page.

First production cut within eight weeks. Refined sprint by sprint since.

Chapter 03 · The outcome

90% accuracy, $1.65 per document, every value linked to its source.

extraction accuracy across the reference document set the team validated against. Accuracy is tracked per job, so the team sees the score for the document in front of them, not just the average.
$1.65 · 31 min
production cost and wall-clock per long-form document (typically 150 to 250 pages), inside the team’s cost and turnaround targets.
Every value
links back to its source PDF page with a highlighted region, so the team can verify any number without re-reading the report.

Outcome figures and architectural choices are real and verified against the production rollout. Client identity and specific product names have been generalised so the pattern reads across verticals.

Chapter 04 · What we learned

The source link was the adoption result, not the accuracy.

The temptation with extraction is to read a whole table in one go. It reads beautifully on the happy path. It also fails silently the moment the source has a merged cell, a page break, or a row the vendor flagged differently than expected. Reading one row at a time is more work, but it is honest. Every row either comes back or it does not, and the system can tell you which.

A fix that helps one vendor’s reports can quietly break another’s. The team learned to check what kind of document the system is looking at before applying any shortcut. Pipelines that serve more than one vendor need to ask what kind of document this is, not just what is in it.

The interface that links every value to its source was the feature the team trusted before they trusted the numbers. The accuracy score was the engineering result. The source link was the adoption result. We would build the source link first next time, because the team will not act on data they cannot verify regardless of how accurate the engineering says it is.

The same shape of solution turns up wherever a regulated PDF has to become a row in a database that has to be defensible later. A clinic transcribing pathology reports into the patient record has the same problem. A finance team pulling controls evidence out of a vendor’s audit pack has the same problem. Different vocabularies, same trust problem, same answer: one row at a time, every value linked to the source.

Settled handoff rate · Source link · the adoption lever

Chapter 05 · From the team

The first week, two people kept opening the source PDFs anyway, just to check. By the end of the second week nobody was opening them. That was the day we knew it was working.

Compliance lead, anonymised

Have a stack of reports nobody fully trusts?

A 45-minute audit. We look at the documents your team relies on, where the data needs to end up, and what the actual bottleneck is between the two. You leave with a one-page memo whether or not we are the right fit.

Book a knowledge audit Or read another case study →