Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Would like to know how this compares to https://github.com/tesseract-ocr/tesseract


Tesseract is multilingual.

Tesseract extracts all text from doc, without trying to fix reading order.

Tesseract runs in many more places, as it doesn't require a GPU.

Tesseract's pure text output tends to have a lot of extra bits, e.g. bits of text that appear in diagrams. Good as a starting point and fine for most downstream tasks.


I haven't checked OlmOCR, but in my experience, Tesseract is awful for scientific papers. The structure is mangled, formulas are completely rubbish, tables are nearly useless, etc.

I also tried Docling (which I believe is LLM-based), which works fine, but the references section of the paper was too noisy, and Gemini 2.0 Flash was okay but too slow for a large number of PDFs[1].

I settled for downloading the LaTeX code from arXiv and using pandoc to parse that. I also needed to process citations, which was easy using pandoc's support for BibTeX to CSL JSON.

[1] Because of the number of output tokens, I had to split the PDF into pages and individually convert each one. Sometimes, the API would take too long to respond, making the overall system quite slow.


and mathpix


Wow. The Mathpix mobile app has support for reading two column PDFs as a single column.

You can't run it locally, though, right?


> The Mathpix mobile app has support for reading two column PDFs as a single column.

Mathpix is what gave the best results when I tried a whole bunch of OCR solutions on technical PDFs (multi-column with diagrams, figures and equations). It is brilliant.

> You can't run it locally, though, right?

Unfortunately, no. Which is a shame because I also have confidential documents to OCR and there is no way I put them on someone else’s cloud.


Did you try marker? https://github.com/VikParuchuri/marker

I haven't tried olmocr yet and I now realize my 8GB GPU probably won't cut it, as it used a 7B param VLM model under the hood.


> Did you try marker?

I did not, but I will. Thanks for the pointer!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: