« BackParsing PDFs (and more) in Elixir using Rustchriis.devSubmitted by bustylasercanon 5 months ago
  • cpursley 5 months ago

    I've been thinking a lot about how to accomplish various RAG things in Elixir (for LLM applications). PDF is one of the missing pieces, so glad to see work here. The really tricky part is not just parsing out the text (you can just call the pdftotext unix command line utility for that), but accurately pulling out things like complex tables, etc in a way that could be chunked/post processed in a useful way. I'd love to see something like Unstructured or Marker but in Rust (i.e., fast) that Elixir could NIF out to it. And maybe some kind of hybrid system that uses open llm models with vision capabilities. Ref:

    - https://github.com/Unstructured-IO/unstructured

    - https://github.com/VikParuchuri/marker

    • cpursley 5 months ago

      Well derp, I should have read the linked extractous repo. This looks like the extract solution I've been after (see what I did there).

      https://github.com/yobix-ai/extractous

      • bustylasercanon 5 months ago

        Yeah I could maybe highlight how good that library is in here

      • constantinum 5 months ago

        For instace Llamaparse(https://docs.llamaindex.ai/en/stable/llama_cloud/llama_parse...)uses LLMs for pdf text extraction, but the problem is hallucination. e.g > https://github.com/run-llama/llama_parse/issues/420

        There is also LLMWhisperer that preserves the layout(tables, checkboxes, forms)and hence the context. https://pg.llmwhisperer.unstract.com/

        • cpursley 5 months ago

          Is this open source? Is it slow Python? That's where I'm stuck.

          • constantinum 5 months ago

            This is not open-source. It has high accuracy and it is faster too. All you need is to point your documents to the API.

        • vikp 5 months ago

          Hey, I'm the author of marker - thanks for sharing. Most of the processing time is model inference right now. I've been retraining some models lately onto new architectures to improve speed (layout, tables, LaTeX OCR).

          We recently integrated gemini flash (via the --use_llm flag), which maybe moves us towards the "hybrid system" you mentioned. Hoping to add support for other APIs soon, but focusing on improving quality/speed now.

          Happy to chat if anyone wants to talk about the difficulties of parsing PDFs, or has feedback - email in profile.

          • cpursley 5 months ago

            Very cool, any plans for a dockerized API of marker similar to what Unstructured released? I know you have a very attractively priced serverless offering (https://www.datalab.to) but having something to develop against locally would be great (for those of us not in the Python world).

            • vikp 5 months ago

              It's on the list to build - been focusing on quality pretty heavily lately.

          • conradfr 5 months ago

            Maybe just using pdftohtml instead of pdftotext.

            • cpursley 5 months ago

              I experimented with it, it generates way too much noise. Cool utility, though!

          • hinkley 5 months ago

            The Achilles heel of the BEAM is that if it crashes in native code then it has no way to recover and its much vaunted robustness goes out the window. So writing native hooks in Rust makes it a bit harder to crash the whole VM.

            On the plus side it makes IPC pretty straightforward, so you can move the processes that need the native code (NIFs) to a separate VM if you’re feeling paranoid.

            • h0l0cube 5 months ago

              Rustler actually wraps the NIF and passes the exception back to the caller

              > The library provides facilities for generating the boilerplate for interacting with the BEAM, handles encoding and decoding of Erlang terms, and catches rust panics before they unwind into C.

              https://github.com/rusterlium/rustler

            • joshchernoff 5 months ago

              FYI: your preview image from the html header meta tag is broken.

              • bustylasercanon 5 months ago

                Thanks! I need to fix that

              • karim79 5 months ago

                [flagged]

                • pplante 5 months ago

                  Very shameful IMHO.

                  The only relevance is the PDF format.