So you want to parse a PDF?

46 points by UglyToad 3 hours ago

diptanu an hour ago

Disclaimer - Founder of Tensorlake, we built a Document Parsing API for developers.

This is exactly the reason why Computer Vision approaches for parsing PDFs works so well in the real world. Relying on metadata in files just doesn't scale across different source of PDFs.

We convert PDFs to images, run a layout understanding model on them first, and then apply specialized models like text recognition and table recognition models on them, stitch them back together to get acceptable results for domains where accuracy is table stakes.

spankibalt 13 minutes ago

> "This is exactly the reason why Computer Vision approaches for parsing PDFs works so well in the real world."
Well, to be fair, in many cases there's no way around it anyway since the documents in question are only scanned images. And the hardest problems I've seen there are narrative typography artbooks, department store catalogs with complex text and photo blending, as well as old city maps.
Alex3917 36 minutes ago

> This is exactly the reason why Computer Vision approaches for parsing PDFs works so well in the real world.
One of the biggest benefits of PDFs though is that they can contain invisible data. E.g. the spec allows me to embed cryptographic proof that I've worked at the companies I claim to have worked at within my resume. But a vision-based approach obviously isn't going to be able to capture that.
- throwaway4496 31 minutes ago
  
  Cryptographic proof of job experience? Please explain more. Sounds interesting.
  
  spankibalt 6 minutes ago
  
  Encrypted (and hidden) embedded information, e. g. documents, signatures, certificates, watermarks, and the like. To (legally-binding) standards, e. g. for notary, et cetera.
  
  rogerrogerr 12 minutes ago
  
  If someone told me there was cryptographic proof of job experience in their PDF, I would probably just believe them because it’d be a weird thing to lie about.
rkagerer 41 minutes ago

So you've outsourced the parsing to whatever software you're using to render the PDF as an image.
- bee_rider 38 minutes ago
  
  Seems like a fairly reasonable decision given all the high quality implementations out there.
  
  throwaway4496 35 minutes ago
  
  How is it reasonable to render the PDF, rasterize it, OCR it, use AI, instead of just using the "quality implementation" to actually get structured data out? Sounds like "I don't know programming, so I will just use AI".
  
  sidebute a minute ago
  
  > Sounds like "I don't know programming, so I will just use AI".
  If you were leading Tensorlake, running on early stage VC with only 10 employees (https://pitchbook.com/profiles/company/594250-75), you'd focus all your resources on shipping products quickly, iterating over unseen customer needs that could make the business skyrocket, and making your customers so happy that they tell everyone how great Tensorlake is and buy lots more licenses.
  Because you're a stellar tech leader and strategist, you wouldn't waste a penny reinventing low-level plumbing that's available off-the-shelf, either cheaply or as free OSS. You'd be thinking about inevitable opportunity costs: If I build X then I can't build Y, simply because a tiny startup doesn't have enough resources to build X and Y. You'd quickly conclude that building a homegrown, robust PDF parser would be an open-ended tar pit that precludes us from focusing on making our customers happy and growing the business.
  And the rest of us would watch in awe, seeing truly great tech leadership at work, making it all look easy.
  
  masterj 8 minutes ago
  
  If you start with curiosity, you might consider that other people might know more than you and do things for reasons, and ask what you are missing rather than making an ass of yourself like this
  
  do_not_redeem 13 minutes ago
  
  PDFs don't always lay out characters in sequence, sometimes they have absolutely positioned individual characters instead.
  PDFs don't always use UTF-8, sometimes they assign random-seeming numbers to individual glyphs (this is common if unused glyphs are stripped from an embedded font, for example)
  etc etc
throwaway4496 36 minutes ago

So you parse PDFs, but also OCR images, to somehow get better results?
Do you know you could just use the parsing engine that renders the PDF to get the output? I mean, why raster it, OCR it, and then use AI? Sounds creating a problem to use AI to solve it.
throwaway4496 34 minutes ago

This is the parallel of some of the dotcom peak absurdities. We are in the AI peak now.

farkin88 an hour ago

Great rundown. One thing you didn't mention that I thought was interesting to note is incremental-save chains: the first startxref offset is fine, but the /Prev links that Acrobat appends on successive edits may point a few bytes short of the next xref. Most viewers (PDF.js, MuPDF, even Adobe Reader in "repair" mode) fall back to a brute-force scan for obj tokens and reconstruct a fresh table so they work fine while a spec-accurate parser explodes. Building a similar salvage path is pretty much necessary if you want to work with real-world documents that have been edited multiple times by different applications.

UglyToad an hour ago

You're right, this was a fairly common failure state seen in the sample set. The previous reference or one in the reference chain would point to offset of 0 or outside the bounds of the file, or just be plain wrong.
What prompted this post was trying to rewrite the initial parse logic for my project PdfPig[0]. I had originally ported the Java PDFBox code but felt like it should be 'simple' to rewrite more performantly. The new logic falls back to a brute-force scan of the entire file if a single xref table or stream is missed and just relies on those offsets in the recovery path.
However it is considerably slower than the code before it and it's hard to have confidence in the changes. I'm currently running through a 10,000 file test-set trying to identify edge-cases.
[0]: https://github.com/UglyToad/PdfPig/pull/1102
- farkin88 28 minutes ago
  
  That robustness-vs-throughput trade-off is such a staple of PDF parsing. My guess is that the new path is slower because the recovery scan now always walks the whole byte range and has to inflate any object streams it meets before it can trust the offsets even when the first startxref would have been fine.
  The 10k-file test set sounds great for confidence-building. Are the failures clustering around certain producer apps like Word, InDesign, scanners, etc.? Or is it just long-tail randomness?
  Reading the PR, I like the recovery-first mindset. If the common real-world case is that offsets lie, treating salvage as the default is arguably the most spec-conformant thing you can do. Slow-and-correct beats fast-and-brittle for PDFs any day.

userbinator 25 minutes ago

As someone who has written a PDF parser - it's definitely one of the weirdest formats I've seen, and IMHO much of it is caused by attempting to be a mix of both binary and text; and I suspect at least some of these weird cases of bad "incorrect but close" xref offsets may be caused by buggy code that's dealing with LF/CR conversions.

What the article doesn't mention is a lot of newer PDFs (v1.5+) don't even have a regular textual xref table, but the xref table is itself inside an "xref stream", and I believe v1.6+ can have the option of putting objects inside "object streams" too.