Comments Page - GLM-OCR: Accurate × Fast × Comprehensive

« Back GLM-OCR: Accurate × Fast × Comprehensivegithub.comSubmitted by ms7892 4 days ago

coder543 2 hours ago
There are a bunch of new OCR models.
I’ve also heard very good things about these two in particular:
- LightOnOCR-2-1B: https://huggingface.co/lightonai/LightOnOCR-2-1B
- PaddleOCR-VL-1.5: https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.5
The OCR leaderboards I’ve seen leave a lot to be desired.
With the rapid release of so many of these models, I wish there were a better way to know which ones are actually the best.
I also feel like most/all of these models don’t handle charts, other than to maybe include a link to a cropped image. It would be nice for the OCR model to also convert charts into markdown tables, but this is obviously challenging.
- noahjohannessen 25 minutes ago
  is https://www.ocrarena.ai/ not accurate?
  coder543 16 minutes ago
  It is missing both of models that I mentioned, so yes, I would say it is not accurate because it is so incomplete.
  It also doesn't provide error bars on the ELO, so models that only have tens of battles are being listed alongside models that have thousands of battles with no indication of how confident those ELOs are, which I find rather unhelpful.
  A lot of these models are also sensitive to how they are used, and offer multiple ways to be used. It's not clear how they are being invoked.
  That leaderboard is definitely one of the ones that leaves a lot to be desired.
- StableAlkyne an hour ago
  How do these compare to something like Tesseract?
  I remember that one clearing the scoreboard for many years, and usually it's the one I grab for OCR needs due to its reputation.
  chaps an hour ago
  Tesseract v4 when it was released was exceptionally good and blew everything out of the water. Have used it to OCR millions of pages. Tbh, I miss the simplicity of tesseract.
  The new models are similarly better compared to tesseract v4. But what I'll say is that don't expect new models to be a panacea for your OCR problems. The edge case problems that you might be trying to solve (like, identifying anchor points, or identifying shared field names across documents) are still pretty much all problematic still. So you should still expect things like random spaces or unexpected characters to jam up your jams.
  Also some newer models tend to hallucinate incredibly aggressively. If you've ever seen an LLM get stuck in an infinite, think of that.
  kergonath an hour ago
  Tesseract does not understand layout. It’s fine for character recognition, but if I still have to pipe the output to a LLM to make sense of the layout and fix common transcription errors, I might as well use a single model. It’s also easier for a visual LLM to extract figures and tables in one pass.
  chaps an hour ago
  For my workflows, layout extraction has been so inconsistent that I've stopped attempting to use it. It's simpler to just throw everything into postgis and run intersection checks on size-normalized pages.
  kergonath 2 minutes ago
  Interesting. What kind of layout do you have?
  My documents have one or two-column layouts, often inconsistently across pages or even within a page (which tripped older layout detection methods). Most models seem to understand that well enough so they are good enough for my use case.
mikae1 19 minutes ago
Text me back when there's a working PDF to EPUB conversion tool. I've been waiting (and searching for one) long enough. :D
EDIT: https://github.com/overcuriousity/pdf2epub looks interesting.
alaanor an hour ago
There was so many OCR models released in the past few months, all VLM models and yet none of them handle Korean well. Every time I try with a random screenshot (not a A4 document) they just fail at a "simple" task. And funnily enough Qwen3 8B VL is the best model that usually get it right (although I couldn't get the bbox quite well). Even more funny, whatever is running on an iphone locally on cpu is insanely good, same with google's OCR api. I don't know why we don't get more of the traditional OCR stuff. Paddlepaddle v5 is the closest I could find. At this point, I feel like I might be doing something wrong with those VLMs.
- Stagnant 32 minutes ago
  Chrome ships a local OCR model for text extraction from PDFs which is better than any of the VLM or open source OCR models i've tried. I had a few hundred gigs of old newspaper scans and after trying all the other options I ended up building a wrapper around the DLL it uses to get the text and bboxes. Performance and accuracy on another level compared to tesseract, and while VLM models sometimes produced good results they just seemed unreliable.
  I've thought of open sourcing the wrapper but havent gotten around to it yet. I bet claude code can build a functioning prototype if you just point it to "screen_ai" dir under chrome's user data.
- ghrl an hour ago
  I remember someone building a meme search engine for millions of images using a cluster of used iPhone SE's because of Apple's very good and fast OCR capabilities. Quite an interesting read as well: https://news.ycombinator.com/item?id=34315782
  fzysingularity an hour ago
  Apple OCR even on the Mac is insanely good, in fact way better than AWS textract/GCP cloud vision OCR.
  Any idea what model is being used?
  AlphaSite an hour ago
  Probably some custom model built for their hardware.
aliljet 2 hours ago
This is actually the thing I really desperately need. I'm routinely analyzing contracts that were faxed to me, scanned with monstrously poor resolution, wet signed, all kinds of shit. The big LLM providers choke on this raw input and I burn up the entire context window for 30 pages of text. Understandable evals of the quality of these OCR systems (which are moving wicked fast) would be helpful...
And here's the kicker. I can't afford mistakes. Missing a single character or misinterpreting it could be catastrophic. 4 units vacant? 10 days to respond? Signature missing? Incredibly critical things. I can't find an eval that gives me confidence around this.
- coder543 2 hours ago
  If you want OCR with the big LLM providers, you should probably be passing one page per request. Having the model focus on OCR for only a single page at a time seemed to help a lot in my anecdotal testing a few months ago. You can even pass all the pages in parallel in separate requests, and get the better quality response much faster too.
  But, as others said, if you can't afford mistakes, then you're going to need a human in the loop to take responsibility.
  staticman2 27 minutes ago
  Gemini Pro 3 seems to be built for handling multiple page PDFs.
  I can feed it a multiple page PDF and tell it to convert it to markdown and it does this well. I don't need to load the pages one at a time as long as I use the PDF format. (This was tested on A.i. studio but I think the API works the same way).
  coder543 22 minutes ago
  It's not that they can't do multiple pages... but did you compare against doing one page at a time?
  How many pages did you try in a single request? 5? 50? 500?
  I fully believe that 5 pages of input works just fine, but this does not scale up to larger documents, and the goal of OCR is usually to know what is actually written on the page... not what "should" have been written on the page. I think a larger number of pages makes it more likely for the LLM to hallucinate as it tries to "correct" errors that it sees, which is not the task. If that is a desirable task, I think it would be better to post-process the document with an LLM after it is converted to text, rather than asking the LLM to both read a large number of images and correct things at the same time, which is asking a lot.
  Once the document gets long enough, current LLMs will get lazy and stop providing complete OCR for every page in their response.
  One page at a time keeps the LLM focused on the task, and it's easy to parallelize so entire documents can be OCR'd quickly.
  HPsquared an hour ago
  You could maybe then do a second pass on the whole text (as plain text not OCR) to look for likely mistakes.
  kergonath an hour ago
  This is not always easy. The models I tried were too helpful and rewrote too much instead of fixing simple typos. When I tried I ended up with huge prompts and I still found sentences where the LLM was too enthusiastic. I ended up applying regexes with common typos and accepted some residual errors. It might be better now, though. But since then I’ve moved to all-in-one solutions like Mathpix and Mistral-OCR which are quite good for my purpose.
- chrsw 2 hours ago
  I'm keeping my eye on progress in this area as well. I need to free engineering design data from tens of thousands of PDF pages and make them easily and quickly accessible to LLMs.
  aliljet 2 hours ago
  All of healthcare is crying. Trust me.
  Imustaskforhelp 2 hours ago
  I suppose tears of joy?
  fragmede an hour ago
  Of sadness because they're not allowed to use it yet.
- daveguy 2 hours ago
  If your needs are that sensitive, I doubt you'll find anything anytime soon that doesn't require a human in the loop. Even SOTA models only average 95% accuracy on messy inputs. If that's a per character accuracy (which OCR is generally measured by), that's going to be 5+ errors per page of 100+ words. If you really can't afford mistakes you have to consider the OCR inaccurate. If you have key components like "days to respond" and "units vacant" you need to identify the presence of those specifically with bias in favor of false positives (over false negatives), and human confirmation of the source-> OCR.
  kergonath an hour ago
  > If you really can't afford mistakes you have to consider the OCR inaccurate.
  Isn’t this close to the error rate of human transcription for messy input, though? I seem to remember a figure in that ballpark. I think if your use case is this sensitive, then any transcription is suspicious.
  aliljet 17 minutes ago
  This is precisely the real question. If you're exceeding human transcription, you may be generally pretty good. The question is what happens when you tell a human to become surgical about some part of the document, how then does the comparison change..
- yieldcrv an hour ago
  > I burn up the entire context window for 30 pages of text
  We analyze 200 page contracts no problem
  I think you're doing it wrong or in an antiquated way (until context window sizes improve)
  Are you doing this programmatically at all or are you doing something closer to dropping a contract into a chat window?
  We use a main agent to classify the pages and we build subagents that are familiar with page classifications and are fed page ranges. They all have their own full context window and prompts
- cinntaile 2 hours ago
  Deciphering fax messages? What is this, the 90s?
  kergonath an hour ago
  We have decades of internal reports on film that we’d like to make accessible and searchable. We don’t do it with new documents, but we have a huge backlog.
  xyproto an hour ago
  Fax is still hard to hack, so some organizations have kept it alive for security.
ks2048 an hour ago
I've been trying different OCR models on what should be very simple - subtitles (these are simple machine-rendered text). While all models do very well (95+% accuracy), I haven't seen a model not occasionally make very obvious mistakes. Maybe it will take a different approach to get the last 1%...
sinandrei 37 minutes ago
Has anyone experiment with using VLM to detect "marks"? Thinking of pen/pencil based markings like underlines, circles,checkmarks.. Can these models do it?
- leetharris 22 minutes ago
  None of them do it well from our experience. We had to write our own custom pipeline with a mixture of legacy CV approaches to handle this (AI contract analysis). We constantly benchmark every new multimodal and VLM model that comes out and are consistently disappointed.
  coder543 9 minutes ago
  If someone releases a benchmark/dataset, I'm sure that significantly increases the chances of one of these AI labs training on the task.
rdos 2 hours ago
Is it possible for such a small model to outperform gemini 3 or is this a case of benchmarks not showing the reality? I would love to be hopeful, but so far an open source model was never better than a closed one even when benchmarks were showing that.
- amluto 2 hours ago
  Off the top of my head: for a lot of OCR tasks, it’s kind of worse for the model to be smart. I don’t want my OCR to make stuff up or answer questions — I want to to recognize what is actually on the page.
  rdos an hour ago
  Interesting. Won't stuff like entity extraction suffer? Especially in multilingual use cases. My worry is that a smaller model might not realize some text is actually a persons name because it is very unusual.
- woeirua 29 minutes ago
  No. Gemini is clearly the leader across the board: https://www.ocrarena.ai/leaderboard
bugglebeetle an hour ago
I tested this pretty extensively and it has a common failure mode that prevents me from using: extracting footnotes and similar from the full text of academic works. For some reason, many of these models are trained in a way that results in these being excluded, despite these document sections often containing import details and context. Both versions of DeepseekOCR have the same problem. Of the others I’ve tested, dot-ocr in layout mode works best (but is slow) and then datalab’s chandra model (which is larger and has bad license constraints).
- droidjj an hour ago
  I have been looking for an OCR model that can accurately handle footnotes. It’s essential for processing legal texts in particular, which often have footnotes that break across pages. Sadly I’ve yet to encounter a good solution.