• mattmein 4 days ago

    Also check out https://github.com/D-Star-AI/dsRAG/ for a bit more involved chunking strategy.

    • cadence- 4 days ago

      This looks pretty amazing. I will take it for a spin next week. I want to make a RAG that will answer questions related to my new car. The manual is huge and it is often hard to find answers in it, so I think this will be a big help to owners of the same car. I think your library can help me chunk that huge PDF easily.

      • andai 4 days ago

        How many tokens is the manual?

        • yaj54 4 days ago

          if its more than the 2M that will fit in gemini context then I want to know what car it is.

    • simonw 4 days ago

      Would it make sense for this to offer a chunking strategy that doesn't need a tokenizer at all? I love the goal to keep it small, but "tokenizers" is still a pretty huge dependency (and one that isn't currently compatible with Python 3.13).

      I've been hoping to find an ultra light-weight chunking library that can do things like very simple regex-based sentence/paragraph/markdown-aware chunking with minimal additional dependencies.

      • parhamn 4 days ago

        Across a broad enough dataset (char count / 4) is very close to the actual token count in english -- we verified across millions of queries. We had to switch to using an actual tokenizer for chinese and other unicode languages, as that simple formula misses the mark for context stuffing.

        The more complicated stuff is the effective bin-packing problem that emerges depending on how much different contextual sources you have.

        • jimmySixDOF 2 days ago

          For a Regex approach take a look at the work from Jina.ai who among other things have a chunk/tokenizer [1] and now it's part of a bigger API service [2] also they developed an interesting late interaction (aka ColBERT like) chunking system that fits certain use cases. But the Regex is enough all by itself:

          [1] https://gist.github.com/LukasKriesch/e75a0132e93ca989f8870c4...

          [2] https://jina.ai/segmenter/

          • andai 4 days ago

            I made a rudimentary semantic chunking in just a few lines of code.

            I just removed one sentence at a time from the left until there was a jump in the embedding distance. Then repeated for the right side.

          • bhavnicksm 4 days ago

            Thank you so much for giving Chonkie a chance! Just to note Chonkie is still in beta mode (with v0.1.2 running) with a bunch of things planned for it. It's an initial working version, which seemed promising enough to present.

            I hope that you will stick with Chonkie for the journey of making the 'perfect' chunking library!

            Thanks again!

            • mixeden 4 days ago

              > Token Chunking: 33x faster than the slowest alternative

              1) what

              • rkharsan64 4 days ago

                There's only 3 competitors in that particular benchmark, and the speedup compared to the 2nd is only 1.06x.

                Edit: Also, from the same table, it seems that only this library was ran after warming up, while others were not. https://github.com/bhavnicksm/chonkie/blob/main/benchmarks/R...

                • bhavnicksm 4 days ago

                  TokenChunking is really limited by the tokenizer and less by the Chunking algorithm. Tiktoken tokenizers seem to do better with warm-up which Chonkie defaults to -- which is also what the 2nd one is using.

                  Algorithmically, there's not much difference in TokenChunking between Chonkie and LangChain or any other TokenChunking algorithm you might want to use. (except Llamaindex, I don't know what mess they made for 33x slower algo)

                  If you only want TokenChunking (which I do not recommend completely), better than Chonkie or LangChain, just write your own for production :) At least don't install 80MiB packages for TokenChunking, Chonkie is 4x smaller than them.

                  That's just my honest response... And these benchmarks are just the beginning, future optimizations on SemanticChunking which would increase the speed-up from the current 2nd (2.5x right now) to even higher.

                  • melony 4 days ago

                    How does it compare with NLTK's chunking library? I have found that it works very well for sentence segmentation.

                • petesergeant 4 days ago

                  > What other chunking strategies would be useful for RAG applications?

                  I’m using o1-preview for chunking, creating summary subdocuments.

                  • bhavnicksm 4 days ago

                    That's pretty cool! I believe a research paper called LumberChunker recently evaluated that to be pretty decent as well.

                    Thanks for responding, I'll try to make it easier to use something like that in Chonkie in the future!

                    • petesergeant 4 days ago

                      Ah, that's an interesting paper, and a slightly different approach to what I'm doing, but possibly a superior one. Thanks!

                  • vlovich123 4 days ago

                    Out of curiosity where does the 21 MiB come from? The codebase clone is 1.2 MiB and the src folder is only 68 KiB.

                    • ekianjo 4 days ago

                      Dependencies in the venv?

                    • Dowwie 3 days ago

                      When would you ever want anything other than Semantic chunking? Cutting chunks into fixed lengths is fast, but it's arbitrarily encoding potentially dissimilar information.

                      • samlinnfer 4 days ago

                        How does it work for code? (Chunking code that is)

                        • nostrebored 4 days ago

                          Poorly, just like it does for text.

                          Chunking is easily where all of these problems die beyond PoC scale.

                          I’ve talked to multiple code generation companies in the past week — most are stuck with BM25 and taking in whole files.

                          • potatoman22 4 days ago

                            What do they use BM25 for? RAG?

                            • nostrebored 2 days ago

                              Correct -- finding the correct functions and files to include

                          • bhavnicksm 4 days ago

                            Right now, we haven't worked on adding support for code -- some things like comments (#, //) have punctuations that adversely affect chunking, along with indentation and other issues.

                            But, it's on the roadmap, so please hold on!

                          • bravura 4 days ago

                            One thing I've been looking for, and was a bit tricky implementing myself to be very fast, is this:

                            I have a particular max token length in mind, and I have a tokenizer like tiktoken. I have a string and I want to quickly find the maximum length truncation of the string that is <= target max token length.

                            Does chonkie handle this?

                            • bhavnicksm 4 days ago

                              I don't fully understand what you mean by "maximum length truncation of the string"; but if you're talking about splitting the sentence into 'chunks' which have token counts less than a pre-specified max_token length then, yes!

                              Is that what you meant?

                              • Eisenstein 3 days ago

                                I'm not sure if this is what they mean, but this is a use case that I have dealt with and had to roll my own code for:

                                Given a list sentences, find the largest in order group of sentences which fit into a max token length such that the sentences contain a natural coherence.

                                In my case I used a fuzzy token limit and the chunker would choose a smaller group of sentences that fit into a single paragraph or a single common structure instead of cramming every possible sentence until it ran out of room. It would do the same going over the limit if it would be beneficial to do so.

                                A simple example would be having an alphabetized set and instead of making one chunk A items through part of B items it would end at A items with tokens to spare, or if it were only an extra 10% it would finish the B items. Most of the time it just decided to use paragraphs to end chunks instead of continuing into the middle of the next one.

                            • will-burner 3 days ago

                              Love the name Chonkie and Moo Deng, the hippo, as the image/logo!!

                              edit: Get some Moo Deng jokes in the docs!

                              • spullara 4 days ago

                                21MB? to split text? have you analyzed the footprint?

                                • bhavnicksm 4 days ago

                                  Just to clarify, the 21MB is the size of the package itself! Other package sizes are way larger.

                                  Memory footprint of the chunking itself would vary widely based on the dataset, and it's not something we tested on... usually other providers don't test it either, as long as it doesn't bust up the computer/server.

                                  If saving memory during runtime is important for your application, let me know! I'd run some benchmarks for it...

                                  Thanks!

                                • trwhite 3 days ago

                                  What's RAG?

                                  • adwf 3 days ago

                                    Retrieval-Augmented Generation (AI).

                                    Think of it as if ChatGPT (or other models) didn't just have the embedded unstructured knowledge in their weights from learning, but also an extra DB on the side with specific structured knowledge that it can lookup on the fly.

                                  • ch1kkenm4ss4 4 days ago

                                    Chonkie and lightweight? Good naming!

                                    • bhavnicksm 4 days ago

                                      Haha~ thanks!

                                    • ilidur 4 days ago

                                      Review: Chonkie is an MIT license project to help with chunking your sentences. It boasts fixed length, word length, sentence and semantic methods. The instructions for installing and usage are simple.

                                      The Benchmark numbers are massaged to look really impressive but upon scrutiny the improvements are at most <1.86x compared to the leading product LangChain in a further page describing the measurements. It claims to beat it on all aspects but where it gets close, the author's library uses a warmed up version so the numbers are not comparable. The author acknowledged this but didn't change the methodology to provide a direct comparison.

                                      The author is Bhavnick S. Minhas, an early career ML engineer with both research and industry experience and very prolific with his GitHub contributions.