Would like to know how this compares to https://github.com/tesseract-ocr/tessera...

rahimnathwani · on Feb 28, 2025

Tesseract is multilingual.

Tesseract extracts all text from doc, without trying to fix reading order.

Tesseract runs in many more places, as it doesn't require a GPU.

Tesseract's pure text output tends to have a lot of extra bits, e.g. bits of text that appear in diagrams. Good as a starting point and fine for most downstream tasks.

maleldil · on March 1, 2025

I haven't checked OlmOCR, but in my experience, Tesseract is awful for scientific papers. The structure is mangled, formulas are completely rubbish, tables are nearly useless, etc.

I also tried Docling (which I believe is LLM-based), which works fine, but the references section of the paper was too noisy, and Gemini 2.0 Flash was okay but too slow for a large number of PDFs[1].

I settled for downloading the LaTeX code from arXiv and using pandoc to parse that. I also needed to process citations, which was easy using pandoc's support for BibTeX to CSL JSON.

[1] Because of the number of output tokens, I had to split the PDF into pages and individually convert each one. Sometimes, the API would take too long to respond, making the overall system quite slow.

jesuslop · on Feb 28, 2025

and mathpix

rahimnathwani · on Feb 28, 2025

Wow. The Mathpix mobile app has support for reading two column PDFs as a single column.

You can't run it locally, though, right?

kergonath · on Feb 28, 2025

> The Mathpix mobile app has support for reading two column PDFs as a single column.

Mathpix is what gave the best results when I tried a whole bunch of OCR solutions on technical PDFs (multi-column with diagrams, figures and equations). It is brilliant.

> You can't run it locally, though, right?

Unfortunately, no. Which is a shame because I also have confidential documents to OCR and there is no way I put them on someone else’s cloud.

rahimnathwani · on Feb 28, 2025

Did you try marker? https://github.com/VikParuchuri/marker

I haven't tried olmocr yet and I now realize my 8GB GPU probably won't cut it, as it used a 7B param VLM model under the hood.

kergonath · on March 1, 2025

> Did you try marker?

I did not, but I will. Thanks for the pointer!