Tesseract extracts all text from doc, without trying to fix reading order.
Tesseract runs in many more places, as it doesn't require a GPU.
Tesseract's pure text output tends to have a lot of extra bits, e.g. bits of text that appear in diagrams. Good as a starting point and fine for most downstream tasks.
I haven't checked OlmOCR, but in my experience, Tesseract is awful for scientific papers. The structure is mangled, formulas are completely rubbish, tables are nearly useless, etc.
I also tried Docling (which I believe is LLM-based), which works fine, but the references section of the paper was too noisy, and Gemini 2.0 Flash was okay but too slow for a large number of PDFs[1].
I settled for downloading the LaTeX code from arXiv and using pandoc to parse that. I also needed to process citations, which was easy using pandoc's support for BibTeX to CSL JSON.
[1] Because of the number of output tokens, I had to split the PDF into pages and individually convert each one. Sometimes, the API would take too long to respond, making the overall system quite slow.
> The Mathpix mobile app has support for reading two column PDFs as a single column.
Mathpix is what gave the best results when I tried a whole bunch of OCR solutions on technical PDFs (multi-column with diagrams, figures and equations). It is brilliant.
> You can't run it locally, though, right?
Unfortunately, no. Which is a shame because I also have confidential documents to OCR and there is no way I put them on someone else’s cloud.