Comments Page - Non-elementary group-by aggregations in Polars vs pandas

« Back Non-elementary group-by aggregations in Polars vs pandaslabs.quansight.orgSubmitted by rbanffy 3 hours ago

Nihilartikel 2 hours ago
I did non trivial work with apache spark dataframes and came to appreciate them before ever being exposed to Pandas. After spark, pandas just seemed frustrating and incomprehensible. Polars is much more like spark and I am very happy about that.
DuckDb even goes so far as to include a clone of the pyspark dataframe API, so somebody there must like it too.
- banku_brougham an hour ago
  I had a similar experience with spark, especially in the Scala API it felt very expressive and concise once I got used to certain idioms. Also +1 on duckdb which is excellent.
  There are some frustrations in spark however, I remember getting stuck on Winsorizing over groups. Hilariously there are identical functions called `percentile_approx` and `approx_percentile` and it wasn't clear from the docs they were the same or at least did the same thing.
  Given all that, the ergonomics of Julia for general purpose data handling is really unmatched IMO. I've got a lot of clean and readable data pipeline and shaping code that I revisited a couple years later and could easily understand. And making updates with new more type-generic functions is a breeze. Very enjoyable.
  appplication 21 minutes ago
  Spark docs are way too minimal for my taste, at least the API docs.
- coding123 36 minutes ago
  I don't know how well the polars implementation works, but what I love about PySpark is that sometimes spark is able to push those groupings down to the database. Not always, but sometimes. However I imagine that many people love polars/pandas performance for transactional queries (from start to finish get me a result in less than a second (as long as the number of underlying rows is not greater than 20k-ish). Pyspark will never be super great for that.
winwang 11 minutes ago
The power of having an API that allows usage of the Free monad. And in less-funny-FP-speak, the power of allowing the user write a program (expressions), that the sufficiently-smart backend later compiles/interprets.
Awesome! Didn't expect such a vast difference in usability at first.
__mharrison__ 2 hours ago
Pandas sat alone in the Python ecosphere for a long time. Lack of competition is generally not a good thing. I'm thrilled to have Polars around to innovate on the API end (and push Pandas to be better).
And I say this as someone who makes much of their living from Pandas.
- 0cf8612b2e1e 2 hours ago
  I think pandas is well aware of some of the unfortunate legacy API decisions without Polars. They are trapped by backwards compatibility. Wes’ “Things I Hate About Pandas” post covers the highlights. Most of which boils down to having not put a layer between numpy and pandas. Which is why they were stuck with the unfortunate integer null situation.
  Twirrim an hour ago
  Which is all stuff they could fix, if they'd be willing to, with a major version bump. They'd need a killer feature to encourage that migration though.
  code_biologist 42 minutes ago
  The really brutal thing is all of the code using Pandas written by researchers and non-software engineers running quietly in lab environments. Difficult to reproduce environments, small or non-existent test suites, code written by grad students long gone. If the Pandas interface breaks for installs done via `pip install pandas` it will cause a lot of pain.
  With that acknowledged, it'll make life a lot easier on everyone if the "fix the API" Pandas 3 had a different package name. Polars and others seem like exactly that solution, even if not literally Pandas.
lend000 2 hours ago
I've wanted to convert a massive Pandas codebase to Polars for a long time. Probably 90% of the compute time is Pandas operations, especially creating new columns / resizing dataframes (which I understand to involve less of a speed difference compared to the grouping operations mentioned in the post, but still substantial). Anyone had success doing this and found it to be worth the effort?
- wenc an hour ago
  I converted to DuckDB and Polars. It’s worth it for the speed improvement.
  However there are subtle differences between Pandas and Polars behaviors so regression testing is your friend. It’s not 1:1 mapping.
  kzrdude 44 minutes ago
  There's been so many subtle changes in pandas to pandas upgrades (especially groupby is somehow always hit), so regression tests are always needed...
  willseth an hour ago
  Which things did you decide to move to duckdb?
akdor1154 2 hours ago
The difference is a sanely and presciently designed expression API, which is a bit more verbose in some common cases, but is more predictable and much more expressive in more complex situations like this.
On a tangent, i wonder what this op would look like in SQL? Probably would need support for filtering in a window function, which I'm not sure is standardized?
- dan-robertson 15 minutes ago
  Without having checked, maybe something like:
  select id, max(views) from <tbl> where sales > avg(sales) over (partition by id) group by 1
  In dplyr, there is an ‘old style’ method which works on an intermediate ‘grouped data frame’ and a new style which doesn’t. In the old style:
  df |> group_by(id) |> filter(sales > mean(sales)) |> summarize(max(views))
  In the new style, either:
  df |> filter(.by=id, sales>mean(sales)) |> summarize(.by=id,max(views))
  Or:
  df |> summarize(.by=id, max(views[sales>mean(sales)]))
- wenc an hour ago
  Props to Ritchie Vink for designing polars.
  But also props to Wes McKinney for giving us a dataframe library during a time when we had none. Java still doesn’t have a decent dataframe library so we mustn’t take these things for granted.
  The Pandas API is no longer the way things should be done today nor should it be in new tutorials. Pandas was the jquery of its time —- great but no longer the state of the art. But I have much gratitude for it being around when it was needed.
- andy81 2 hours ago
  Here's an example implementation in MSSQL - https://data.stackexchange.com/stackoverflow/query/edit/1873...
  No need to filter within the window function if you use subquery or CTE, which is supported everywhere.
- capitainenemo 2 hours ago
  https://en.wikipedia.org/wiki/SQL?useskin=vector#Standardiza...
  According to wikipedia, windowing was standardized back in 2003.
- hobs 2 hours ago
  -- "find the maximum value of 'views', -- where 'sales' is greater than its mean, per 'id'". select max(views), id -- "find the maximum value of 'views', from example_table as et where exists ( SELECT * FROM ( SELECT id, avg(sales) as mean_sales FROM example_table GROUP by id ) as f -- where et.sales > f.mean_sales -- where 'sales' is greater than its mean and et.id = f.id ) group by id; -- per 'id'".
Vaslo 30 minutes ago
I’ve moved mostly to polars. I still have some frameworks that demand pandas and pandas is still a very solid dataframe, but when I need to interpolate months in millions of lines of quarterly data, polars just blows it away.
Even better is using tools like Narwhals and Ibis which can convert back and forth to any frames you want.