Comments Page - Llama-OCR: Document to Markdown

« Back Llama-OCR: Document to Markdownllamaocr.comSubmitted by lapnect 10 months ago

nutlope 10 months ago
Hi all, I'm the author of llama-ocr. Thank you for sharing & for the kind comments! I built this earlier this week since I wanted a simple API to do OCR – it uses llama 3.2 vision (hosted on together.ai, where i work) to parse images into structured markdown. I also have it available as an npm package.
Planning to add a bunch of other features like the ability to parse PDFs, output a response in JSON, ect... If anyone has any questions, feel free to send them and I'll try to respond!
- nh2 10 months ago
  I put in a bill that has 3 identical line items and it didn't include them as 3 bullet points as usual, but generated a table with a "quantity" column that doesn't exist on the original paper.
  Is this amount of larger transformation expected/desirable?
  (It also means that the output is sometimes a bullet point list, sometimes a table, making further automatic processing a bit harder.)
  zainia 10 months ago
  Here's the prompt being used, tweaking that might help: https://github.com/Nutlope/llama-ocr/blob/main/src/index.ts#...
- rch 10 months ago
  I've had trouble with pulling scientific content out of poster PDFs, mostly because e.g. nougat falls apart with different layouts.
  Have you considered that usage yet?
- Szpadel 10 months ago
  > Need an example image? Try ours. Great idea, I wish more services would have similar feature
- gcr 10 months ago
  How accurate is this?
  When compared with existing OCR systems, what sorts of mistakes does it make?
- Curiositry 10 months ago
  Option to use a local LLM?
  Eisenstein 10 months ago
  I made a script which does exactly the same thing but locally using koboldcpp for inference. It downloads MiniCPM-V 2.6 with image projector the first time you run it. If you want to use a different model you can, but you will want to edit the instruct template to match.
  * https://github.com/jabberjabberjabber/LLMOCR
  nirav72 10 months ago
  MiniCPM-v 2.6 is probably the best self-hosted vision model I have used so far. Not just for OCR, but also image analysis. I have it setup, so my NVR (frigate) sends couple of images upon motion alert from a driveway security camera to Ollama with minicpm-v 2.6. I’m able to get a reasonably accurate description of the vehicle that pulled into the driveway. Including describing the person that exits the vehicle and also the license plate. All sent to my phone.
  timmattison 10 months ago
  I love this. Can you share the source?
Eisenstein 10 months ago
All it does is send the image to Llama 3.2 Vision and ask for it to read the text.
Note that this is just as open to hallucination as any other LLM output, because what it is doing is not reading the pixels looking for text characters, but describing the picture, which uses the images it trained on and their captions to determine what the text is. It may completely make up words, especially if it can't read them.
- M4v3R 10 months ago
  This is also true for any other OCR system, we just never called these errors “hallucinations” in this context.
  geysersam 10 months ago
  I gave this tool a picture of a restaurant menu and it made up several additional entries that didn't exist in the picture... What other OCR system would do that?
  noduerme 10 months ago
  No, it's not even close to OCR systems, which are based on analyzing points in a grid for each character stroke and comparing them with known characters. Just for one thing, OCR systems are deterministic. Deterministic. Look it up.
  visarga 10 months ago
  OCR system use vision models and as such they can make mistakes. They don't sample but they produce a distribution of probability over words like LLMs.
  undefined 10 months ago
  [deleted]
  alex_suzuki 10 months ago
  One of my worries for the coming years is that people will forget what deterministic actually means. It terrifies me!
  noduerme 10 months ago
  Not to get real dark and philosophical (but here goes) it took somewhere around 150,000 years for humans to go from spoken language to writing. And almost all of those words were irrational. From there to understanding and encoding what is or isn't provable, or is or isn't logically deterministic, took the last few hundred years. And people who have been steeped in looking at the world through that lens (whether you deal with pure math or need to understand, e.g. by running a casino, what is not deterministic, so as to add it to your understanding of volatility and risk) are able to identify which factors in any scenario are deterministic or not very quickly. One could almost say that this ability to discern logic from fuzz is the crowning achievement of science and civilization, and the main adaptation conferred upon some humans since speech. Unfortunately, it is very recent, and it's still an open question as to whether it's an evolutionary advantage to be able to tell the difference between magic and process. And yeah, it's scary to imagine a world where people can't; but that was practically the whole world a few centuries ago, and it wouldn't be terribly surprising if humanity regressed to that as they stopped understanding how to make tools and most people began treating tools like magic again. Sad time to be alive.
  llm_trw 10 months ago
  It really isn't since those systems are character based.
  8n4vidtmkvmk 10 months ago
  OCR tools sometimes make errors, but they don't make things up. There's a difference.
bbor 10 months ago
Looks awesome! Been doing a lot of OCR recently, and love the addition to the space. The reigning champion in the PDF -> Markdown space (AFAIK) is Facebook's Nougat[1], and I'm excited to hook this up to DSPy and see which works better for philosophy books. This repo links the Zerox[2] project by some startup, which also looks awesome, and certainly more smoothly advertised than Nougat. Would love corrections/advice from any actual experts passing by this comment section :)
That said, I have a few questions if OP/anyone knows the answers:
1. What is Together.ai, and is this model OSS? Their website sells them as a hosting service, and the "Custom Models" page[3] seems to be about custom finetuning, not, like, training new proprietary models in-house. They might have a HuggingFace profile but it's hard to tell if it's them https://huggingface.co/TogetherAI
2. The GitHub says "hosted demo", but the hosting part is just the tiny (clean!) WebGUI, yes? It's implied that this functionality is and will always be available only through API calls?
P.S. The header links are broken on my desktop browser -- no onClick triggered
[1] https://facebookresearch.github.io/nougat/
[2] https://github.com/getomni-ai/zerox
[3] https://www.together.ai/products#custom-models
- jurnalanas 10 months ago
  the project author is Devrel from Together.ai. This is a fantastic way to advertise a dev tool, though.
- gexla 10 months ago
  My guess is together.ai is at least partially sponsoring the demo.
- magicalhippo 10 months ago
  Yeah was hoping for something I could self-host, both for privacy and cost.
- rajansheth 10 months ago
  together.ai serves 100+ open-source models including multi-modal Llama 3.2 with an OpenAI compatible API
sdflhasjd 10 months ago
Here's a bit of a quirk: I uploaded a webcomic as an example, all the dialog was ALL CAPS, but the output was inconsistently either sentence case or title case between panels.
I also tried some real examples a problem I'd like to use OCR with: I've got some old slides that needs digitising, and most of them are labelled, uploading one of these provides the output:
```
  The image appears to be a photograph of a slide or film frame, possibly from an old camera or projector. The slide is yellowed with age and has a rectangular cutout in the center, which is filled with a dark gray or black material. The cutout is surrounded by a thin border, and there is some text written on the slide in black ink.

  The text reads "Once Upon a Time" and is written in a cursive font. It is located at the bottom of the slide, below the cutout. There is also a small number "1069" written in the same font and color, but it is not clear what this number refers to.

  Overall, the image suggests that the slide is an old photograph or film frame that has been preserved for many years. The yellowing of the slide and the cursive writing suggest that it may be from the early 20th century or earlier.
```
So aside from unnecessary repetitious description of the slide, (and the "yellowing" is actually just white balance being off, though I can forgive that), the actual written text (not cursive) was "Once Uniquitous." and the number was 106g. It's very clearly a 'g' and not a '9'.
What I think is interesting about this is that it might be a demonstration of biases in models, it focuses too much on the slide being an antique that it hallucinated a completely cliche title. Also, it missed the forest for the trees and that the "black square" was the slide being front-lit so the text could be read, so the transparency wasn't visible.
Additionally, the API itself seems to have file size or resolution limits that are not documented
philips 10 months ago
I have recently used llama3.2-vision to handle some paper bidsheets for a charity auction and it is fairly accurate with some terrible handwriting. I hope to use it for my event next year.
I do find it rather annoying not being able to get it to consistently output a CSV though. ChatGPT and Gemini seem better at doing that but I haven’t tried to automate it.
The scale of my problem is about 100 pages of bidsheets and so some manual cleaning is ok. It is certainly better than burning volunteers time.
https://github.com/philips/paper-bidsheets
- wriggler 10 months ago
  I'd love to hear how Handwriting OCR (https://www.handwritingocr.com) compares for your task.
  It's not free, but its accuracy for for handwritten documents is the best out there (I am the founder, so am biased, but I'm really excited about where the accuracy is now). It could save you time and for your 100 page project would cost only $12.
  KetoManx64 10 months ago
  My main qualm with a project like yours is that I have to upload my documents to a third party and trust them with that data. I have a couple thousand pages worth of journal entries from the last decade and I would never upload those to a website to get OCR'd, but with a local Ollama model I have full control of the data and it all stays local.
  wriggler 10 months ago
  I understand your concern, and it's a common one. However, we can only give assurances in our privacy policy that your data is used only to perform the OCR, and nothing else. You can delete all data from the server immediately after downloading your results and no trace will be left.
  Of course a local solution like Ollama is preferable for privacy reasons but, for now, the OCR performance of available local models is just not very good, especially from handwritten documents. With a couple thousand pages of journal entries, that means a lot of post-processing and editing.
- mosselman 10 months ago
  What about using llama3.2-vision to do the OCR bit and then deferring to ChatGPT to do the CSV part?
notsylver 10 months ago
I've been doing a lot of OCR recently, mostly digitising text from family photos. Normal OCR models are terrible at it, LLMs do far better. Gemini Flash came out on top from the models I tested and it wasn't even close. It still had enough failures and hallucinations to make it faster to write it in by hand. Annoying considering how close it feels to working.
This seems worse. Sometimes it replies with just the text, sometimes it replies with a full "The image is a scanned document with handwritten text...". I was hoping for some fine tuning or something for it to beat Gemini Flash, it would save me a lot of time. :(
- philips 10 months ago
  Have you tried downscaling the images? I started getting better results with lower resolution images. I was using scans made with mobile phone cameras for this.
  convert -density 76 input.pdf output-%d.png
  https://github.com/philips/paper-bidsheets
  notsylver 10 months ago
  That's interesting. I downscaled the images to something like 800px but that was mostly to try improve upload times. I wonder if downscaling further and with a better algorithm would help.. I remember using CLIP and found different scaling algorithms helped text readability. Maybe the text is just being butchered when its rescaled.
  Though I also tried with the high detail setting which I think would deal with most issues that come from that and it didn't seem to help much
- og_kalu 10 months ago
  >Normal OCR models are terrible at it, LLMs do far better. Gemini Flash came out on top from the models I tested and it wasn't even close.
  For Normal models, the state of Open Source OCR is pretty terrible. Unfortunately, the closed options from Microsoft, Google etc are much better. Did you try those ?
  Interesting about Flash, what LLMs did you test ?
  notsylver 10 months ago
  I tried open source and closed source OCR models, all were pretty bad. Google vision was probably the best of the "OCR" models, but it liked adding spaces between characters and had other issues I've forgotten. It was bad enough that I wondered if I was using it wrong. By the time I was trying to pass the text to an LLM with the image so it could do "touchups" and fix the mistakes, I gave up and decided to try LLMs for the whole task.
  I don't remember the exact models, I more or less just went through the OpenRouter vision model list and tried them all. Gemini Flash performed the best, somehow better than Gemini Pro. GPT-4o/mini was terrible and expensive enough that it would have had to be near perfect to consider it. Pixtral did terribly. That's all I remember, but I tried more than just those. I think Llama 3.2 is the only one I haven't properly tried, but I don't have high hopes for it.
  I think even if OCR models were perfect, they couldn't have done some of the things I was using LLMs for. Like extracting structured information at the same time as the plain text - extracting any dates listed in the text into a standard ISO format was nice, as well as grabbing peoples names. Being able to say "Only look at the hand-written text, ignore printed text" and have it work was incredible.
  dleeftink 10 months ago
  WordNinja is pretty good as a post-processing step on wrongly split/concatenated words:
  [0]: https://github.com/keredson/wordninja
  gunzel412 10 months ago
  [dead]
  pbhjpbhj 10 months ago
  The OCR in OneNote is incredible IME. But, I've not tested in a wide range of fonts -- only that I have abysmal handwriting and it will find words that are almost unrecognisable.
- danvk 10 months ago
  I've had really good luck recently running OCR over a corpus of images using gpt-4o. The most important thing I realized was that non-fancy data prep is still important, even with fancy LLMs. Cropping my images to just the text (excluding any borders) and increasing the contrast of the image helped enormously. (I wrote about this in 2015 and this post still holds up well with GPT: https://www.danvk.org/2015/01/07/finding-blocks-of-text-in-a...).
  I also found that giving GPT at most a few paragraphs at a time worked better than giving it whole pages. Shorter text = less chance to hallucinate.
  pbhjpbhj 10 months ago
  Have you tried doing a verification pass: so giving gpt-4o the output of the first pass, and the image, and asking if they can correct the text (or if they match, or...)?
  Just curious whether repetition increases accuracy or of it hurt increases the opportunities for hallucinations?
  danvk 10 months ago
  I have not, but that's a great idea!
- 8n4vidtmkvmk 10 months ago
  That's a bummer. I'm trying to do the exact same thing right now, digitize family photos. Some of mine have German on the back. The last OCR to hit headlines was terrible, was hoping this would be better. ChatGPT 4o has been good though, when I paste individual images into the chat. I haven't tried with the API yet, not sure how much that would cost me to process 6500 photos, many of which are blank but I don't have an easy way to filter them either.
  notsylver 10 months ago
  I found 4o to be one of the worst, but I was using the API. I didn't test it but sometimes it feels like images uploaded through ChatGPT work better than ones through the API. I was using Gemini Flash in the end, it seemed better than 4o and the images are so cheap that I have a hard time believing google is making any money even by bandwidth costs
  I also tried preprocessing images before sending them through. I tried cropping it to just the text to see if it helped. Then I tried filtering on top to try brighten the text, somehow that all made it worse. The most success I had was just holding the image in my hand and taking a photo of it, the busy background seemed to help but I have absolutely no idea why.
  The main problem was that it would work well for a few dozen images, you'd start to trust it, and then it'd hallucinate or not understand a crossed out word with a correction or wouldn't see text that had faded. I've pretty much given up on the idea. My new plan is to repurpose the website I made for verifying the results into one where you enter the text manually, as well as date/location/favourite status.
  bosie 10 months ago
  Use a local rubbish model to extract text. If it doesn’t find any on the back, don’t send it to chatgtp?
  Terrascan comes to mind
  8n4vidtmkvmk 10 months ago
  "Terrascan" is a vision model? The only hits I'm getting are for a static code analyzer.
  bosie 10 months ago
  sorry, i meant "Tesseract"
- undefined 10 months ago
  [deleted]
- bboygravity 10 months ago
  Have you tried Claude?
  It's not good at returning the locations of text (yet), but it's insane at OCR as far as I have tested.
- undefined 10 months ago
  [deleted]
gexla 10 months ago
Should this be a "Show HN" post? Seems to just be the front-end and has no association we may make with the name Llama? Maybe together.ai gave them cloud space?
mg 10 months ago
I gave it a sentence, which I created by placing 500 circles via a genetic algorithm to form a sentence. And then drew with an actual physical circle:
https://www.instagram.com/marekgibney/p/BiFNyYBhvGr/
Interestingly, it sees the circles just fine, but not the sentence. It replied with this:
```
    The image contains no text or other elements
    that can be represented in Markdown. It is a
    visual composition of circles and does not
    convey any information that can be translated
    into Markdown format.
```
- Vetch 10 months ago
  Based on the fact that squinting works, I applied a Gaussian blur to the image. Here's the response I got:
  Markdown:
  The provided image is a blurred text that reads "STOP THINKING IN CIRCLES." There are no other visible elements such as headers, footers, subtexts, images, or tables.
  Markdown Content:
  STOP THINKING IN CIRCLES
  As the response is not deterministic, I also tried several times with the unprocessed image but it never worked. However, all the low-pass filter effects I applied worked with a high success rate.
  https://imgur.com/q7Zd7fa
  mg 10 months ago
  I guess blurring it is similar to reducing the resolution or to looking at the image from further away.
  It's interesting that the neural net figures out the circles, but not the words. Because the circles are also not so easily apparent from looking closely at the image. It could also be whirly lines.
- DandyDev 10 months ago
  I can't read this either.
  Edit: at a distance it's easier to read
  thih9 10 months ago
  If you squint it’s easier too. I wonder if lowering the resolution of the image would make the text visible to ocr.
  pbhjpbhj 10 months ago
  I wonder if you could do a composite image, like bracketed images, and so give the model multiple goes, for which it could amalgamate results. So, you could do an exposure bracket, do a focus/blur, maybe a stretch/compression, or an adjustment for font-height as a proportion of the image.
  Feed all of the alternatives to the model, tell it they each have the same textual content?
- ggerules 10 months ago
  Was the original LLM ever trained on original material like this?
  Pretty cool use of genetic algorithm! Would love to see the code or at least the reward function.
- echoangle 10 months ago
  I can’t read anything but the „stop“ either without seeing the solution first
- wasyl 10 months ago
  Why is it interesting? The image does not look like anything, and you need to skew it (by looking at an angle) to see any letters (barely).
sinuhe69 10 months ago
Very funny. I put in 3 screen captures of a (long) document, and it did relatively well. But when I proof-read it, I realized the AI has made up passages that were not there!
The reason is probably due to the nature of screen capturing, some sentences or paragraphs were cut short. That probably kicked off the “fill in the blank” nature of the LLM and it could not resist to leave these paragraphs stand unfinished :LOL. It even put in a short conclusion paragraph that was not in the original document at all!
- abenga 10 months ago
  It boggles my mind that a technology where "making things up" is even a remote possibility is ever actually considered for use in the real world.
rasz 10 months ago
Old scan of Asus P3B-F motherboard schematic from 1997.
- only managed to extract some of the text from Title Block (project name, date etc)
- despite distinct font got all 8/B and 1/I mixed up.
- the actual useful info got turned into
```
    Tables
    Table 1: [Insert table 1 here]

    Other Elements
    [Insert other elements here]
```
nash 10 months ago
Holy Hallucinations batman!
Even the example images hallucinates random text
- KeplerBoy 10 months ago
  Same for me. The receipt headline only says "Trader Joe's" and yet the model insists on adding some information and transcribes "Trader Joe's Receipt". This is like Xeroxgate, but infinitely worse.
  Someday this will do great damage in ways we will completely neglect and overlook.
cheema33 10 months ago
I uploaded a multi-page PDF and it did not know what to do. This is before I went to the github repo and noticed that it wasn't supported. I think the tool should let the user know when they upload a file that is not supported.
constantinum 10 months ago
The problem with using LLMs for OCR is hallucinations. It makes it impossible to use in business use cases such as insurance, banking and health/medical — which demands high accuracy or predictable inaccuracy rate. Not to mention handling scale — processing millions of documents with speed and affordable costs.
For all the test use cases mentioned in this thread, I’d suggest trying LLMwhisperer. A general purpose text Pre-processor/OCR built for LLM consumption. https://pg.llmwhisperer.unstract.com
Tepix 10 months ago
So, i uploaded a HN screenshot and it showed some rendered text but where is the Markdown code? A site titles "Document to Markdown" that fails to give me the MarkDown? What am i overlooking?
xenodium 10 months ago
Japanese OCR to structured content works very well via chatgpt API.
https://xenodium.com/images/chatgpt-shell-repo-splits-up/jap...
Other unrelated examples https://lmno.lol/alvaro/chatgpt-shell-repo-splits-up
amelius 10 months ago
I tried it on a Walmart receipt. It misread a 9 for a 0.
https://imgur.com/a/ni8zOmb
LeoPanthera 10 months ago
I wonder what the watts-per-character is of this tool.
- threatripper 10 months ago
  Joules per character
  amelius 10 months ago
  I'm running this with 60Hz on my HDMI output.
  danielEM 10 months ago
  I think it is perfectly fine to describe it in Watts per character as you can easily determine how many characters per second you can process.
AmazingTurtle 10 months ago
One can combine apache tika OCR and feed it together with the image into LLM to fix typos.
- cess11 10 months ago
  While I'm a fan of Tika a lot of people get queasy from Java and XML, they might be better served by their preferred scripting language and https://github.com/ocrmypdf/OCRmyPDF, which has the same OCR engine.
  AmazingTurtle 10 months ago
  May I introduce you to `apache/tika:2.9.2.1-full` with a REST API on 9998.
  cess11 10 months ago
  Not sure what you mean. Are they making Graal-builds you can run standalone now? I only use Tika through Maven at work, might not be up to date on what happens in the project.
fros1y 10 months ago
Are there any OCR engines out there that actually recognizes underlines properly? Even the LLMs seem to struggle to model the underline (though they get the text fine).
alecco 10 months ago
Is it possible to do this locally with open source software? I have a lot of accounting PDFs to convert but due to privacy concerns it should not run in the cloud.
- criddell 10 months ago
  Does it have to be open source, or just running locally? The paid version of Acrobat does this well. MacOS has pretty good built-in OCR capabilities and Windows isn’t far behind.
  If you have the hardware for it, you can run some LLMs locally. Although for accounting data, I probably wouldn’t trust it.
- cess11 10 months ago
  Either you need to be somewhat tolerant when it comes to misinterpretations and hallucinations, or you'll be proofreading a lot.
  A cheap hack is to push the documents through pdftotext from Poppler and if nothing or very little comes out, push them through OCRMyPDF and pipe it to pdftotext. If it's scanned you probably want some flags for deskewing and so on.
  To make a bulk load of PDF mostly greppable it's a decent technique, to get every 0 as a 0 you're probably going to proofread every conversion.
- Eisenstein 10 months ago
  I don't recommend using it for anything important unless you very diligently proofread it, but I made one that runs locally that I linked to elsewhere in this post:
  * https://news.ycombinator.com/item?id=42155548
- bugglebeetle 10 months ago
  Yes, Docling and Marker do very similar things and can be run fully locally.
undefined 10 months ago
[deleted]
generalizations 10 months ago
How does it handle images? That has seemed to be the major weak point of these doc-to-markdown systems.
d1sxeyes 10 months ago
Seemed pretty good with handwriting. Didn’t make any mistakes with numbers in the sample I tried.
burnt-resistor 10 months ago
I might've broken it as I gave it the Intel developer’s manual combined volumes. }:)
joeyblueee 10 months ago
get this error in console when requesting /ocr, and a 504 status code """ An error occurred with your deployment
FUNCTION_INVOCATION_TIMEOUT """
revskill 10 months ago
Non-English image is slow.
MattDaEskimo 10 months ago
Dreamt of fine design, layers of code, art refined— found wrappers instead.
Nothing to see here folks.
noduerme 10 months ago
Um, I just quickly uploaded an unstructured RTF file to this and apparently broke it... unless it's just realllly slow.
If this is just for converting hand-written documents, maybe put that in the header of the website. Right now it just says "Document to Markdown", which could be interpreted lots of different ways.
sumedh 10 months ago
Site is dead now :(
- nutlope 10 months ago
  Should be up, please try again!
  mkl 10 months ago
  It let me upload a file, but didn't produce any output.
anothername12 10 months ago
We tried this and it was an absolute shit show for us.
- cpursley 10 months ago
  You could have at least provided some constructive feedback...
hrpnk 10 months ago
Reading the Llama community license agreement, section "Redistribution and Use" I expected to find 'Built with Llama'. Is this not required?
https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instr... links to the community license.
- kennethwolters 10 months ago
  Why don't you think that calling the app "Llama-OCR" is good enough?
  sdflhasjd 10 months ago
  The license is pretty specific, if the API counts as a "service".
  i. If you distribute or make available the Llama Materials (or any derivative works thereof), or a product or service (including another AI model) that contains any of them, you shall (A) provide a copy of this Agreement with any such Llama Materials; and (B) prominently display “Built with Llama” on a related website, user interface, blogpost, about page, or product documentation.
- undefined 10 months ago
  [deleted]
HaiderAftab1 10 months ago
[flagged]
- nutlope 10 months ago
  Thank you!