• rich_sasha an hour ago

    It's a bit sad for me. I find the biggest issue for me with pandas is the API, not the speed.

    So many foot guns, poorly thought through functions, 10s of keyword arguments instead of good abstractions, 1d and 2d structures being totally different objects (and no higher-order structures). I'd take 50% of the speed for a better API.

    I looked at Polars, which looks neat, but seems made for a different purpose (data pipelines rather than building models semi-interactively).

    To be clear, this library might be great, it's just a shame for me that there seems no effort to make a Pandas-like thing with better API. Maybe time to roll up my sleeves...

    • faizshah 3 minutes ago

      Pandas is a commonly known DSL at this point so lots of data scientists know pandas like the back of their hand and thats why a lot of pandas but for X libraries have become popular.

      I agree that pandas does not have the best designed api in comparison to say dplyr but it also has a lot of functionality like pivot, melt, unstack that are often not implemented by other libraries. It’s also existed for more than a decade at this point so there’s a plethora of resources and stackoverflow questions.

      On top of that, these days I just use ChatGPT to generate some of my pandas tasks. ChatGPT and other coding assistants know pandas really well so it’s super easy.

      But I think if you get to know Pandas after a while you just learn all the weird quirks but gain huge benefits from all the things it can do and all the other libraries you can use with it.

      • ljosifov an hour ago

        +1 Seconding this. My limited experience with pandas had a non-trivial number of moments "?? Is it really like this? Nah - I'm mistaken for sure, this can not be, no one would do something insane like that". And yet and yet... Fwiw since I've found that numpy is a must (ofc), but pandas is mostly optional. So I stick to numpy for my writing, and keep pandas read only. (just execute someone else's)

        • sega_sai an hour ago

          Great point that I completely share. I tend to avoid pandas at all costs except for very simple things as I have bitten by many issues related to indexing. For anything complicated I tend to switch to duckdb instead.

          • bravura 21 minutes ago

            Can you explain your use-case and why DuckDB is better?

            Considering switching from pandas and want to understand what is my best bet. I am just processing feature vectors that are too large for memory, and need an initial simple JOIN to aggregate them.

          • martinsmit an hour ago

            Check out redframes[1] which provides a dplyr-like syntax and is fully interoperable with pandas.

            [1]: https://github.com/maxhumber/redframes

            • amelius an hour ago

              Yes. Pandas turns 10x developers into .1x developers.

              • omnicognate an hour ago

                What about the polars API doesn't work well for your use case?

                • short_sells_poo 18 minutes ago

                  Polars is missing a crucial feature for replacing pandas in Finance: first class timeseries handling. Pandas allows me to easily do algebra on timeseries. I can easily resample data with the resample(...) method, I can reason about the index frequency, I can do algebra between timeseries, etc.

                  You can do the same with Polars, but you have to start messing about with datetimes and convert the simple problem "I want to calculate a monthly sum anchored on the last business day of the month" to SQL-like operations.

                  Pandas grew a large and obtuse API because it provides specialized functions for 99% of the tasks one needs to do on timeseries. If I want to calculate an exponential weighted covariance between two time series, I can trivially do this with pandas: series1.ewm(...).cov(series2). I welcome people to try and do this with Polars. It'll be a horrible and barely readable contraption.

                  YC is mostly populated by technologists, and technologists are often completely ignorant about what makes pandas useful and popular. It was built by quants/scientists, for doing (interactive) research. In this respect it is similar to R, which is not a language well liked by technologists, but it is (surprise) deeply loved by many scientists.

                • Kalanos an hour ago

                  The pandas API makes a lot more sense if you are familiar with numpy.

                  Writing pandas code is a bit redundant. So what?

                  Who is to say that fireducks won't make their own API?

                • imranq 43 minutes ago

                  This presentation does a good job distilling why FireDucks is so fast:

                  https://fireducks-dev.github.io/files/20241003_PyConZA.pdf

                  The main reasons are

                  * multithreading

                  * rewriting base pandas functions like dropna in c++

                  * in-built compiler to remove unused code

                  Pretty impressive especially given you import fireducks.pandas as pd instead of import pandas as pd, and you are good to go

                  However I think if you are using a pandas function that wasn't rewritten, you might not see the speedups

                  • omnicognate an hour ago

                    > Then came along Polars (written in Rust, btw!) which shook the ground of Python ecosystem due to its speed and efficiency

                    Polars rocked my world by having a sane API, not by being fast. I can see the value in this approach if, like the author, you have a large amount of pandas code you don't want to rewrite, but personally I'm extremely glad to be leaving the pandas API behind.

                    • ralegh an hour ago

                      I personally found the polars API much clunkier, especially for rapid prototyping. I use it only for cemented processes where I could do with speed up/memory reduction.

                      Is there anything specific you prefer moving from the pandas API to polars?

                    • bratao 2 hours ago

                      Unfortunately it is not Opensource yet - https://github.com/fireducks-dev/fireducks/issues/22

                      • gus_massa 2 hours ago

                        > FireDucks is not a open source library at this moment. You can get it installed freely using pip and use under BSD-3 license and of course can look into the python part of the source code.

                        I don't understand what it means. It looks like a contradiction. Does it have a BSD-3 licence or not?

                        • sampo an hour ago

                          BSD license gives you the permission to use and to redistribute. In this case you may use and redistribute the binaries.

                          Edit: To use, redistribute, and modify, and distribute modified versions.

                          • japhyr 27 minutes ago

                            "Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met..."

                            https://opensource.org/license/bsd-3-clause

                            • GardenLetter27 24 minutes ago

                              Such a crazy distortion of the meaning of the license.

                              Imagine being like "the project is GPL - just the compiled machine code".

                            • abcalphabet an hour ago

                              From the above link:

                              > While the wheel packages are available at https://pypi.org/project/fireducks/#files, and while they do contain Python files, most of the magic happens inside a (BSD-3-licensed) shared object library, for which source code is not provided.

                              • _flux an hour ago

                                They provide BSD-3-licensed Python files but the interesting bit happens in the shared object library, which is only provided in binary form (but is also BSD-3-licensed it seems, so you can distribute it freely).

                                • joshuaissac an hour ago

                                  Since it is under the BSD 3 licence, users would also be permitted to decompile and modify the shared object under the licence terms.

                              • Y_Y 2 hours ago

                                Wouldn't it be nice if GitHub was just for source code and you couldn't just slap up a README that's an add for some proprietary shitware with a vague promise of source some day in the glorious future?

                                • diggan an hour ago

                                  > Wouldn't it be nice if GitHub was just for source code

                                  GitHub always been a platform for "We love to host FOSS but we won't be 100% FOSS ourselves", so makes sense they allow that kind of usage for others too.

                                  I think what you want, is something like Codeberg instead, which is explicitly for FOSS and 100% FOSS themselves.

                                  • rad_gruchalski 2 hours ago

                                    You'd slap that in a comment then?

                                    • thecopy 2 hours ago

                                      >proprietary shitware

                                      Is this shitware? It seems to be very high quality code

                                      • yupyupyups an hour ago

                                        I think the anger comes from the fact that we expect Github repositories to host the actual source code and not be a dead-end with a single README.md file.

                                        • ori_b an hour ago

                                          How can you tell?

                                          • sbarre 6 minutes ago

                                            I mean, based on the claims and the benchmarks, it seems to provide massive speedups to a very popular tool.

                                            How would you define "quality" in this context?

                                    • noduerme 10 minutes ago

                                      Not a python guy and I tend to roll my own code in nodejs for analyzing large data sets - I find read/write ops should be the least of one's worries. Finding multithread strategies with workers has been a huge boon, even if each worker needs to read parts of the dataset itself before spinning up. I'm kinda curious what the comparison would be in terms of leveraging cores and memory consumption for operating on time series data in python with any of these libraries, versus a couple hundred lines of tailored JS to tease out what you're looking for.

                                      • ayhanfuat an hour ago

                                        In its essence it is a commercial product which has a free trial.

                                        > Future Plans By providing the beta version of FireDucks free of charge and enabling data scientists to actually use it, NEC will work to improve its functionality while verifying its effectiveness, with the aim of commercializing it within FY2024.

                                        • graemep an hour ago

                                          Its BSD licensed. They do not way what the plans are but most likely a proprietary version with added support or features.

                                          • ori_b 39 minutes ago

                                            It's a BSD licensed binary blob. There's no code provided.

                                            • ayhanfuat an hour ago

                                              They say the source code for the part “where the magic happens” is not available so I am not sure what BSD implies there.

                                          • Kalanos 28 minutes ago

                                            Regarding compatibility, fireducks appears to be using the same column dtypes:

                                            ```

                                            >>> df['year'].dtype == np.dtype('int32')

                                            True

                                            ```

                                            • KameltoeLLM 6 minutes ago

                                              Shouldn't that be FirePandas then?

                                              • adrian17 an hour ago

                                                Any explanation what makes it faster than pandas and polars would be nice (at least something more concrete than "leverage the C engine").

                                                My easy guess is that compared to pandas, it's multi-threaded by default, which makes for an easy perf win. But even then, 130-200x feels extreme for a simple sum/mean benchmark. I see they are also doing lazy evaluation and some MLIR/LLVM based JIT work, which is probably enough to get an edge over polars; though its wins over DuckDB _and_ Clickhouse are also surprising out of nowhere.

                                                Also, I thought one of the reasons for Polars's API was that Pandas API is way harder to retrofit lazy evaluation to, so I'm curious how they did that.

                                                • viraptor 2 hours ago

                                                  > 100% compatibility with existing Pandas code: check.

                                                  Is it actually? Do people see that level of compatibility in practice?

                                                • Kalanos 38 minutes ago
                                                  • short_sells_poo 6 minutes ago

                                                    Looks very cool, BUT: it's closed source? That's an immediate deal breaker for me as a quant. I'm happy to pay for my tools, but not being able to look and modify the source code of a crucial library like this makes it a non-starter.

                                                    • pplonski86 2 hours ago

                                                      How does it compare to Polars?

                                                      EDIT: I've found some benchmarks https://fireducks-dev.github.io/docs/benchmarks/

                                                      Would be nice to know what are internals of FireDucks

                                                      • DonHopkins 13 minutes ago

                                                        FireDucks FAQ:

                                                        Q: Why do ducks have big flat feet?

                                                        A: So they can stomp out forest fires.

                                                        Q: Why do elephants have big flat feet?

                                                        A: So they can stomp out flaming ducks.

                                                        • E_Bfx 2 hours ago

                                                          Very impressive, the Python ecosystem is slowly getting very good.

                                                          • BiteCode_dev 2 hours ago

                                                            Spent the last 20 years hearing that.

                                                            At some point I think it's more honest to say "the python ecosystem keeps getting more awesome".

                                                          • i_love_limes 2 hours ago

                                                            I have never heard of FireDucks! I'm curious if anyone else here has used it. Polars is nice, but it's not totally compatible. It would be interesting how much faster it is for more complex calculations

                                                            • thecleaner 2 hours ago

                                                              Sure but single node performance. This makes it not very useful IMO since quite a few data science folks work with Hadoop clusters or Snowflake clusters or DataBricks where data is distributed and querying is handled by Spark executors.

                                                              • chaxor an hour ago

                                                                The comparison is to pandas, so single node performance is understood in the scope. This is for people running small tasks that may only take a couple days on a single node with a 32 core CPU or something, not tasks that take 3 months using thousands of cores. My understanding for the latter is that pyspark is a decent option, while ballista is the better option for which to look forward. Perhaps using bastion-rs as a backend can be useful for an upcoming system as well. Databricks et al are cloud trash IMO, as is anything that isn't meant to be run on a local single node system and a local HPC cluster with zero code change and a single line of config change.

                                                                While for most of my jobs I ended up being able to evade the use of HPC by simply being smarter and discovering better algorithms to process information, I recall like pyspark decently, but preferring the simplicity of ballista over pyspark due to the simpler installation of Rust over managing Java and JVM junk. The constant problems caused by anything using JVM backend and the environment config with it was terrible to add to a new system every time I ran a new program.

                                                                In this regard, ballista is a enormous improvement. Anything that is a one-line install via pip on any new system, runs local-first without any cloud or telemetry, and requires no change in code to run on a laptop vs HPC is the only option worth even beginning to look into and use.

                                                                • Kalanos 22 minutes ago

                                                                  Hadoop is no longer relevant, which is telling.

                                                                  Unless I had thousands of files to work with, I would be loathe to use cluster computing. There's so much overhead, cost, waiting for nodes to spin up, and cloud architecture nonsense.

                                                                  My "single node" computer is a refurbished tower server with 256GB RAM and 50 threads.

                                                                  Most of these distributed computing solutions arose before data processing tools started taking multi-threading seriously.