Hi all, I'm the author of llama-ocr. Thank you for sharing & for the kind comments! I built this earlier this week since I wanted a simple API to do OCR – it uses llama 3.2 vision (hosted on together.ai, where i work) to parse images into structured markdown. I also have it available as an npm package.
Planning to add a bunch of other features like the ability to parse PDFs, output a response in JSON, ect... If anyone has any questions, feel free to send them and I'll try to respond!
I put in a bill that has 3 identical line items and it didn't include them as 3 bullet points as usual, but generated a table with a "quantity" column that doesn't exist on the original paper.
Is this amount of larger transformation expected/desirable?
(It also means that the output is sometimes a bullet point list, sometimes a table, making further automatic processing a bit harder.)
Here's the prompt being used, tweaking that might help: https://github.com/Nutlope/llama-ocr/blob/main/src/index.ts#...
I've had trouble with pulling scientific content out of poster PDFs, mostly because e.g. nougat falls apart with different layouts.
Have you considered that usage yet?
> Need an example image? Try ours. Great idea, I wish more services would have similar feature
How accurate is this?
When compared with existing OCR systems, what sorts of mistakes does it make?
Option to use a local LLM?
I made a script which does exactly the same thing but locally using koboldcpp for inference. It downloads MiniCPM-V 2.6 with image projector the first time you run it. If you want to use a different model you can, but you will want to edit the instruct template to match.
MiniCPM-v 2.6 is probably the best self-hosted vision model I have used so far. Not just for OCR, but also image analysis. I have it setup, so my NVR (frigate) sends couple of images upon motion alert from a driveway security camera to Ollama with minicpm-v 2.6. I’m able to get a reasonably accurate description of the vehicle that pulled into the driveway. Including describing the person that exits the vehicle and also the license plate. All sent to my phone.
I love this. Can you share the source?
All it does is send the image to Llama 3.2 Vision and ask for it to read the text.
Note that this is just as open to hallucination as any other LLM output, because what it is doing is not reading the pixels looking for text characters, but describing the picture, which uses the images it trained on and their captions to determine what the text is. It may completely make up words, especially if it can't read them.
This is also true for any other OCR system, we just never called these errors “hallucinations” in this context.
I gave this tool a picture of a restaurant menu and it made up several additional entries that didn't exist in the picture... What other OCR system would do that?
No, it's not even close to OCR systems, which are based on analyzing points in a grid for each character stroke and comparing them with known characters. Just for one thing, OCR systems are deterministic. Deterministic. Look it up.
OCR system use vision models and as such they can make mistakes. They don't sample but they produce a distribution of probability over words like LLMs.
One of my worries for the coming years is that people will forget what deterministic actually means. It terrifies me!
It really isn't since those systems are character based.
OCR tools sometimes make errors, but they don't make things up. There's a difference.
Looks awesome! Been doing a lot of OCR recently, and love the addition to the space. The reigning champion in the PDF -> Markdown space (AFAIK) is Facebook's Nougat[1], and I'm excited to hook this up to DSPy and see which works better for philosophy books. This repo links the Zerox[2] project by some startup, which also looks awesome, and certainly more smoothly advertised than Nougat. Would love corrections/advice from any actual experts passing by this comment section :)
That said, I have a few questions if OP/anyone knows the answers:
1. What is Together.ai, and is this model OSS? Their website sells them as a hosting service, and the "Custom Models" page[3] seems to be about custom finetuning, not, like, training new proprietary models in-house. They might have a HuggingFace profile but it's hard to tell if it's them https://huggingface.co/TogetherAI
2. The GitHub says "hosted demo", but the hosting part is just the tiny (clean!) WebGUI, yes? It's implied that this functionality is and will always be available only through API calls?
P.S. The header links are broken on my desktop browser -- no onClick triggered
[1] https://facebookresearch.github.io/nougat/
the project author is Devrel from Together.ai. This is a fantastic way to advertise a dev tool, though.
My guess is together.ai is at least partially sponsoring the demo.
Yeah was hoping for something I could self-host, both for privacy and cost.
together.ai serves 100+ open-source models including multi-modal Llama 3.2 with an OpenAI compatible API
Here's a bit of a quirk: I uploaded a webcomic as an example, all the dialog was ALL CAPS, but the output was inconsistently either sentence case or title case between panels.
I also tried some real examples a problem I'd like to use OCR with: I've got some old slides that needs digitising, and most of them are labelled, uploading one of these provides the output:
The image appears to be a photograph of a slide or film frame, possibly from an old camera or projector. The slide is yellowed with age and has a rectangular cutout in the center, which is filled with a dark gray or black material. The cutout is surrounded by a thin border, and there is some text written on the slide in black ink.
The text reads "Once Upon a Time" and is written in a cursive font. It is located at the bottom of the slide, below the cutout. There is also a small number "1069" written in the same font and color, but it is not clear what this number refers to.
Overall, the image suggests that the slide is an old photograph or film frame that has been preserved for many years. The yellowing of the slide and the cursive writing suggest that it may be from the early 20th century or earlier.
So aside from unnecessary repetitious description of the slide, (and the "yellowing" is actually just white balance being off, though I can forgive that), the actual written text (not cursive) was "Once Uniquitous." and the number was 106g. It's very clearly a 'g' and not a '9'.What I think is interesting about this is that it might be a demonstration of biases in models, it focuses too much on the slide being an antique that it hallucinated a completely cliche title. Also, it missed the forest for the trees and that the "black square" was the slide being front-lit so the text could be read, so the transparency wasn't visible.
Additionally, the API itself seems to have file size or resolution limits that are not documented
I have recently used llama3.2-vision to handle some paper bidsheets for a charity auction and it is fairly accurate with some terrible handwriting. I hope to use it for my event next year.
I do find it rather annoying not being able to get it to consistently output a CSV though. ChatGPT and Gemini seem better at doing that but I haven’t tried to automate it.
The scale of my problem is about 100 pages of bidsheets and so some manual cleaning is ok. It is certainly better than burning volunteers time.
I'd love to hear how Handwriting OCR (https://www.handwritingocr.com) compares for your task.
It's not free, but its accuracy for for handwritten documents is the best out there (I am the founder, so am biased, but I'm really excited about where the accuracy is now). It could save you time and for your 100 page project would cost only $12.
My main qualm with a project like yours is that I have to upload my documents to a third party and trust them with that data. I have a couple thousand pages worth of journal entries from the last decade and I would never upload those to a website to get OCR'd, but with a local Ollama model I have full control of the data and it all stays local.
I understand your concern, and it's a common one. However, we can only give assurances in our privacy policy that your data is used only to perform the OCR, and nothing else. You can delete all data from the server immediately after downloading your results and no trace will be left.
Of course a local solution like Ollama is preferable for privacy reasons but, for now, the OCR performance of available local models is just not very good, especially from handwritten documents. With a couple thousand pages of journal entries, that means a lot of post-processing and editing.
What about using llama3.2-vision to do the OCR bit and then deferring to ChatGPT to do the CSV part?
I've been doing a lot of OCR recently, mostly digitising text from family photos. Normal OCR models are terrible at it, LLMs do far better. Gemini Flash came out on top from the models I tested and it wasn't even close. It still had enough failures and hallucinations to make it faster to write it in by hand. Annoying considering how close it feels to working.
This seems worse. Sometimes it replies with just the text, sometimes it replies with a full "The image is a scanned document with handwritten text...". I was hoping for some fine tuning or something for it to beat Gemini Flash, it would save me a lot of time. :(
Have you tried downscaling the images? I started getting better results with lower resolution images. I was using scans made with mobile phone cameras for this.
convert -density 76 input.pdf output-%d.png
That's interesting. I downscaled the images to something like 800px but that was mostly to try improve upload times. I wonder if downscaling further and with a better algorithm would help.. I remember using CLIP and found different scaling algorithms helped text readability. Maybe the text is just being butchered when its rescaled.
Though I also tried with the high detail setting which I think would deal with most issues that come from that and it didn't seem to help much
>Normal OCR models are terrible at it, LLMs do far better. Gemini Flash came out on top from the models I tested and it wasn't even close.
For Normal models, the state of Open Source OCR is pretty terrible. Unfortunately, the closed options from Microsoft, Google etc are much better. Did you try those ?
Interesting about Flash, what LLMs did you test ?
I tried open source and closed source OCR models, all were pretty bad. Google vision was probably the best of the "OCR" models, but it liked adding spaces between characters and had other issues I've forgotten. It was bad enough that I wondered if I was using it wrong. By the time I was trying to pass the text to an LLM with the image so it could do "touchups" and fix the mistakes, I gave up and decided to try LLMs for the whole task.
I don't remember the exact models, I more or less just went through the OpenRouter vision model list and tried them all. Gemini Flash performed the best, somehow better than Gemini Pro. GPT-4o/mini was terrible and expensive enough that it would have had to be near perfect to consider it. Pixtral did terribly. That's all I remember, but I tried more than just those. I think Llama 3.2 is the only one I haven't properly tried, but I don't have high hopes for it.
I think even if OCR models were perfect, they couldn't have done some of the things I was using LLMs for. Like extracting structured information at the same time as the plain text - extracting any dates listed in the text into a standard ISO format was nice, as well as grabbing peoples names. Being able to say "Only look at the hand-written text, ignore printed text" and have it work was incredible.
WordNinja is pretty good as a post-processing step on wrongly split/concatenated words:
[dead]
The OCR in OneNote is incredible IME. But, I've not tested in a wide range of fonts -- only that I have abysmal handwriting and it will find words that are almost unrecognisable.
I've had really good luck recently running OCR over a corpus of images using gpt-4o. The most important thing I realized was that non-fancy data prep is still important, even with fancy LLMs. Cropping my images to just the text (excluding any borders) and increasing the contrast of the image helped enormously. (I wrote about this in 2015 and this post still holds up well with GPT: https://www.danvk.org/2015/01/07/finding-blocks-of-text-in-a...).
I also found that giving GPT at most a few paragraphs at a time worked better than giving it whole pages. Shorter text = less chance to hallucinate.
Have you tried doing a verification pass: so giving gpt-4o the output of the first pass, and the image, and asking if they can correct the text (or if they match, or...)?
Just curious whether repetition increases accuracy or of it hurt increases the opportunities for hallucinations?
I have not, but that's a great idea!
That's a bummer. I'm trying to do the exact same thing right now, digitize family photos. Some of mine have German on the back. The last OCR to hit headlines was terrible, was hoping this would be better. ChatGPT 4o has been good though, when I paste individual images into the chat. I haven't tried with the API yet, not sure how much that would cost me to process 6500 photos, many of which are blank but I don't have an easy way to filter them either.
I found 4o to be one of the worst, but I was using the API. I didn't test it but sometimes it feels like images uploaded through ChatGPT work better than ones through the API. I was using Gemini Flash in the end, it seemed better than 4o and the images are so cheap that I have a hard time believing google is making any money even by bandwidth costs
I also tried preprocessing images before sending them through. I tried cropping it to just the text to see if it helped. Then I tried filtering on top to try brighten the text, somehow that all made it worse. The most success I had was just holding the image in my hand and taking a photo of it, the busy background seemed to help but I have absolutely no idea why.
The main problem was that it would work well for a few dozen images, you'd start to trust it, and then it'd hallucinate or not understand a crossed out word with a correction or wouldn't see text that had faded. I've pretty much given up on the idea. My new plan is to repurpose the website I made for verifying the results into one where you enter the text manually, as well as date/location/favourite status.
Use a local rubbish model to extract text. If it doesn’t find any on the back, don’t send it to chatgtp?
Terrascan comes to mind
"Terrascan" is a vision model? The only hits I'm getting are for a static code analyzer.
sorry, i meant "Tesseract"
Have you tried Claude?
It's not good at returning the locations of text (yet), but it's insane at OCR as far as I have tested.
Should this be a "Show HN" post? Seems to just be the front-end and has no association we may make with the name Llama? Maybe together.ai gave them cloud space?
Very funny. I put in 3 screen captures of a (long) document, and it did relatively well. But when I proof-read it, I realized the AI has made up passages that were not there!
The reason is probably due to the nature of screen capturing, some sentences or paragraphs were cut short. That probably kicked off the “fill in the blank” nature of the LLM and it could not resist to leave these paragraphs stand unfinished :LOL. It even put in a short conclusion paragraph that was not in the original document at all!
It boggles my mind that a technology where "making things up" is even a remote possibility is ever actually considered for use in the real world.
I gave it a sentence, which I created by placing 500 circles via a genetic algorithm to form a sentence. And then drew with an actual physical circle:
https://www.instagram.com/marekgibney/p/BiFNyYBhvGr/
Interestingly, it sees the circles just fine, but not the sentence. It replied with this:
The image contains no text or other elements
that can be represented in Markdown. It is a
visual composition of circles and does not
convey any information that can be translated
into Markdown format.
Based on the fact that squinting works, I applied a Gaussian blur to the image. Here's the response I got:
Markdown:
The provided image is a blurred text that reads "STOP THINKING IN CIRCLES." There are no other visible elements such as headers, footers, subtexts, images, or tables.
Markdown Content:
STOP THINKING IN CIRCLES
As the response is not deterministic, I also tried several times with the unprocessed image but it never worked. However, all the low-pass filter effects I applied worked with a high success rate.
I guess blurring it is similar to reducing the resolution or to looking at the image from further away.
It's interesting that the neural net figures out the circles, but not the words. Because the circles are also not so easily apparent from looking closely at the image. It could also be whirly lines.
Was the original LLM ever trained on original material like this?
Pretty cool use of genetic algorithm! Would love to see the code or at least the reward function.
I can't read this either.
Edit: at a distance it's easier to read
If you squint it’s easier too. I wonder if lowering the resolution of the image would make the text visible to ocr.
I wonder if you could do a composite image, like bracketed images, and so give the model multiple goes, for which it could amalgamate results. So, you could do an exposure bracket, do a focus/blur, maybe a stretch/compression, or an adjustment for font-height as a proportion of the image.
Feed all of the alternatives to the model, tell it they each have the same textual content?
I can’t read anything but the „stop“ either without seeing the solution first
Why is it interesting? The image does not look like anything, and you need to skew it (by looking at an angle) to see any letters (barely).
Old scan of Asus P3B-F motherboard schematic from 1997.
- only managed to extract some of the text from Title Block (project name, date etc)
- despite distinct font got all 8/B and 1/I mixed up.
- the actual useful info got turned into
Tables
Table 1: [Insert table 1 here]
Other Elements
[Insert other elements here]
Holy Hallucinations batman!
Even the example images hallucinates random text
Same for me. The receipt headline only says "Trader Joe's" and yet the model insists on adding some information and transcribes "Trader Joe's Receipt". This is like Xeroxgate, but infinitely worse.
Someday this will do great damage in ways we will completely neglect and overlook.
I uploaded a multi-page PDF and it did not know what to do. This is before I went to the github repo and noticed that it wasn't supported. I think the tool should let the user know when they upload a file that is not supported.
The problem with using LLMs for OCR is hallucinations. It makes it impossible to use in business use cases such as insurance, banking and health/medical — which demands high accuracy or predictable inaccuracy rate. Not to mention handling scale — processing millions of documents with speed and affordable costs.
For all the test use cases mentioned in this thread, I’d suggest trying LLMwhisperer. A general purpose text Pre-processor/OCR built for LLM consumption. https://pg.llmwhisperer.unstract.com
So, i uploaded a HN screenshot and it showed some rendered text but where is the Markdown code? A site titles "Document to Markdown" that fails to give me the MarkDown? What am i overlooking?
Japanese OCR to structured content works very well via chatgpt API.
https://xenodium.com/images/chatgpt-shell-repo-splits-up/jap...
Other unrelated examples https://lmno.lol/alvaro/chatgpt-shell-repo-splits-up
I tried it on a Walmart receipt. It misread a 9 for a 0.
I wonder what the watts-per-character is of this tool.
Joules per character
I'm running this with 60Hz on my HDMI output.
I think it is perfectly fine to describe it in Watts per character as you can easily determine how many characters per second you can process.
One can combine apache tika OCR and feed it together with the image into LLM to fix typos.
While I'm a fan of Tika a lot of people get queasy from Java and XML, they might be better served by their preferred scripting language and https://github.com/ocrmypdf/OCRmyPDF, which has the same OCR engine.
May I introduce you to `apache/tika:2.9.2.1-full` with a REST API on 9998.
Not sure what you mean. Are they making Graal-builds you can run standalone now? I only use Tika through Maven at work, might not be up to date on what happens in the project.
Are there any OCR engines out there that actually recognizes underlines properly? Even the LLMs seem to struggle to model the underline (though they get the text fine).
Is it possible to do this locally with open source software? I have a lot of accounting PDFs to convert but due to privacy concerns it should not run in the cloud.
Does it have to be open source, or just running locally? The paid version of Acrobat does this well. MacOS has pretty good built-in OCR capabilities and Windows isn’t far behind.
If you have the hardware for it, you can run some LLMs locally. Although for accounting data, I probably wouldn’t trust it.
Either you need to be somewhat tolerant when it comes to misinterpretations and hallucinations, or you'll be proofreading a lot.
A cheap hack is to push the documents through pdftotext from Poppler and if nothing or very little comes out, push them through OCRMyPDF and pipe it to pdftotext. If it's scanned you probably want some flags for deskewing and so on.
To make a bulk load of PDF mostly greppable it's a decent technique, to get every 0 as a 0 you're probably going to proofread every conversion.
I don't recommend using it for anything important unless you very diligently proofread it, but I made one that runs locally that I linked to elsewhere in this post:
Yes, Docling and Marker do very similar things and can be run fully locally.
How does it handle images? That has seemed to be the major weak point of these doc-to-markdown systems.
I might've broken it as I gave it the Intel developer’s manual combined volumes. }:)
Seemed pretty good with handwriting. Didn’t make any mistakes with numbers in the sample I tried.
get this error in console when requesting /ocr, and a 504 status code """ An error occurred with your deployment
FUNCTION_INVOCATION_TIMEOUT """
Non-English image is slow.
Dreamt of fine design, layers of code, art refined— found wrappers instead.
Nothing to see here folks.
Um, I just quickly uploaded an unstructured RTF file to this and apparently broke it... unless it's just realllly slow.
If this is just for converting hand-written documents, maybe put that in the header of the website. Right now it just says "Document to Markdown", which could be interpreted lots of different ways.
Site is dead now :(
Should be up, please try again!
It let me upload a file, but didn't produce any output.
We tried this and it was an absolute shit show for us.
You could have at least provided some constructive feedback...
Reading the Llama community license agreement, section "Redistribution and Use" I expected to find 'Built with Llama'. Is this not required?
https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instr... links to the community license.
Why don't you think that calling the app "Llama-OCR" is good enough?
The license is pretty specific, if the API counts as a "service".
i. If you distribute or make available the Llama Materials (or any derivative works thereof), or a product or service (including another AI model) that contains any of them, you shall (A) provide a copy of this Agreement with any such Llama Materials; and (B) prominently display “Built with Llama” on a related website, user interface, blogpost, about page, or product documentation.
[flagged]
Thank you!