Interesting... I've been playing with the idea of embedding more python in my C, no cython or anything just using <Python.h> and <numpy/arrayobject.h>. From one perspective it's just "free" C-bindings to a lot of optimized packages. Trying some different C-libraries, the python code is often faster. Python almost becomes C's package manager
E.g. sorting 2^23 random 64-bit integers: qsort: 850ms, custom radix sort: 250ms, ksort.h: 582ms, np.sort: 107ms (including PyArray_SimpleNewFromData, PyArray_Sort). Where numpy uses intel's x86-simd-sort I believe.
E.g. inserting 8M entries into a hash table (random 64-bit keys and values): MSI-style hash table: ~100ns avg insert/lookup, cc_map: ~95ns avg insert/lookup, Python.h: 91ns insert, 60ns lookup
I'm curious if OPs tool might fit in similarly. I've found lmdb to be quite slow even in tmpfs with no sync, etc.
Having seen a lot of work come to grief because of the decision to use pandas, anything that’s not pandas has my vote. Pandas: if you’re not using it interactively, don’t use it at all. This advice goes double if your use case is “read a csv.” Standard library in Python has you covered there.
Both duckdb and especially polars should also be mentioned here. Polars in particular is quite good Ime if you want a pandas-alike interface (it additionally also has a more sane interface).
Since DuckDB can read and write Pandas from memory, a team with varying Pandas fluency can benefit from learning DuckDB.
Interesting work. I'd be curious to know the timing relative to list comprehensions for similar queries, since that's the common standard library alternative for many of these examples.
I don't understand why numeric filters are included. The library is written in python, so shouldn't a lambda function based filter be roughly as fast but much easier/clearer to write.
I'm not the author, but this implementation has the benefit of being a JSON compatible DSL that you can serialize. Maybe that's intentional, maybe not.
It does look like Python's comprehensions would be a better choice if you're writing them by hand anyway.
It’s nice it’s fast at 10k dictionary entries, but how does it scale?