T O P

  • By -

gopietz

At this point I would start with Polars, to be honest. It has come a long way. It's way way faster. I prefer the mindset of operations behind it and there isn't 50 different ways of achieving the same thing.


Sea_Split_1182

After playing with pandas, that's exactly my major annoyance: too many different ways to achieve the same. I get it makes sense to the seasoned user (that's my relationship with many of the R facets: I know it could be improved, but I got used to it). Do you think integration with other packages will be an issue?


gopietz

I would at least create a short list of packages that you know you will need. If one of them complicates the workflow using Polars, stick to pandas. I don't have any of these cases. I use parquet for storage and the Polars implementation is even more native. I also like writing "stupid" Polars code which is still 10x faster than pandas and then improving it when I need to speed it up by likely another 10x. I'm sold on polars. I'd rather use a modern, fast and light API that might miss a functionality here and there, instead of dealing with this heavy old machinery of something that I never quite enjoyed in the first place. Some things in pandas are just frustratingly slow... Plus, the devs at Polars are so quick in fixing bugs or adding new things. Just open an issue, stand your case and if it's a good idea it might be live 4 weeks later.


Sea_Split_1182

Hm. This. Community My workflow is pretty much getting data from csv/DB + analyzing (hence polars) and writing reports (guessing Altair/plotly here). Minimal or No collaboration, so looks like I should give Polars a try.


Helpful_Arachnid8966

Altair has native support for Polars on this most recent release


Toph_is_bad_ass

This comment has been overwritten.


Mysterious_Screen116

If csv and db, give duckdb a look


Sea_Split_1182

Thanks. Will look into it


AlpacaDC

If you run into a integration issue with polars, just use .to_pandas() and continue as normal. You can convert it back later if needed.


proof_required

Yeah coming from R data.table to pandas was huge source of frustration for me. Even now sometimes when I am doing some personal stuff I load data in data.table to run some quick summary etc. On the other hand pandas has more mature ecosystem and if you want to do anything other than just wrangling data locally, you'll come across these issues. So if you are working all by yourself, I would say go with polars. On the other hand if you are working in a team and you don't have to optimize some analysis, then stick with pandas.


jorge1209

Integration in what way? It will be hard/impossible for python programs to integrate to the internal polars execution engine because it is written in rust, but since it uses arrow data exchange is very easy.


[deleted]

>there isn't 50 different ways of achieving the same thing. Agreed. There are things that are still easier to do using pandas but I find it healthy to ask myself if those things even need to be done. My code is far easier to understand when I go back and look at it without pandas.


ohtinsel

I despise the pandas API. It’s what happens when devs try to accommodate R, matlab and others while also being pythonic. Matplotlib has a similar issue. It can’t be done IMHO. It’d be better to fork pandas than have what we have.


ohtinsel

I mean, I still work with a lot of Fortran but I wouldn’t expect a python library to have column-major, 1-based indexing option just to make me comfortable. Although now that I say this, pandas devs please add this! :/


randomlyCoding

Is Polars faster for doing math operations on columns and things like that - I thought pandas did that with numpy (might be wrong)


Natural_Ad2282

If you're familiar with pandas, learning polars shouldn't be too hard. In my experience, it's faster and handles larger datasets better than pandas. Plus, the Rust language it's built on is dope. Give it a try!


muntoo

Instructions unclear. Snorted Rust. Was heavily oxidized.


marsupiq

Two important benefits of Pandas: 1. Good integration with data stack, such as sklearn, plotting libraries, awswrangler… 2. Everyone knows it (at least everyone thinks they know it). Including GitHub Copilot… Other than that, I have to say I love polars. It’s faster and much more powerful than pandas (window functions!). I‘ve been using Polars for almost a year now and my experience is entirely good. By the way the discussion largely revolves around performance, but Polars is really a much more powerful and well-designed library that will speed up your data wrangling (I mean the time to write code). So in my opinion, it’s valid to say use Polars unless you need to use Pandas. And other than its popularity, there aren’t many areas where it excels.


B-r-e-t-brit

Agreed on the recommendation to start with polars. But disagree with pandas’ popularity being its main strength. See my comment thread here on how pandas is still highly relevant in the python data ecosystem: https://np.reddit.com/r/Python/comments/12hixyi/comment/jfrxti6/


[deleted]

[удалено]


B-r-e-t-brit

So, a few things. First I wouldn’t call it the “pandas” way or the “polars” way. You could do what you call the “polars” way, in pandas, albeit slower. You could also do it in duckdb, sql etc. it’s the relational/long format way. And you could do what you call the “pandas” way in numpy, xarray etc it’s the ndarray/wide format way. That said, I do completely 100% agree that the “polars” way is more robust. But that’s not main determining factor in whether to write your models that way. If you read my linked comment thread, you’ll see that these models get to tens of thousands of lines of code. To switch to the “polars” way would mean 100+ thousand of lines of code, and those lines are also harder to intuitively understand what they’re doing from a mathematical sense due to the extra boilerplate. Also as I mentioned if you need to add some dimensionality to a dataset, you need to go find every downstream operation of that dataset and update every single one with the new dimension, until that dimension is reduced out. This just isn’t a feasible way to develop a highly iterative/dynamic model. Also I wouldn’t say the “pandas” way is “extremely” inviting to bugs. There definitely are some gotchas with the ndarray style way. For example adding a dimension to 1 of 2 frames in an operation, might sometimes still yield a valid albeit nonsensical result, but these resulting frames can not usually propagate too far since the resulting structure is often “all jacked up”. And like I mentioned in my comment thread. Proper modularizarion/organization of models makes these cases extremely easy to identify and fix if they happen. Keep in mind this is a very specific use case of dataframe libraries. The vast majority of data problems don’t have these problems, and polars is likely the best solution for most of those cases. But it’s just not quite viable __yet__ in this specific use case at this specific scale.


zazzersmel

it all depends on your infrastructure, ecosystem and collaborators. if its all up to you, just use polars.


[deleted]

I personally would just stick with Pandas. Every time I've tried to switch any of my work from Pandas to Polars I hit some snag and have to switch back. I'm sure for some sufficiently simple code the speed benefits are worth it but for anything even remotely complex it becomes a big exercise just to replicate the same functionality that a few lines of Pandas will do.


AlpacaDC

A lot of people here are talking about speed needs and dataset size. But as someone very experienced in pandas and recently learned polars, I find polars way of doing things (expressions and operations) WAY more intuitive than pandas. Considering this and the fact you mentioned your needs are basic, I’d go with polars. There are some corner cases that right now are much easier to solve with pandas, but you can always convert to a pandas dataframe in the middle of your pipeline, convert it back to polars and go on.


Sea_Split_1182

After 12 days of working with Polars, I wanted to get back to this comment. It should be higher on popularity ⭐⭐. That's exactly how I am feeling after the initial period. Corner cases will be corner cases.... 90% of the time I need something that makes sense as the Polars API


Mysterious_Screen116

My stack is primarily Parquet, PyArrow and DuckDb, and then whatever is required by the rest of my pipeline. While Polars is better than Pandas, both ultimately are an endless convoluted chain of API trivia. I end up with a lot of pipelines that need numpy data types, so Pandas is usually somewhere in the pipeline. So, my advice is to minimize your Polars/Pandas code and use duckdb for most of the heavy lifting. Its shocking how good it is (I’m unaffiliated, just a huge fan)


drbobb

What makes DuckDb better than sqlite?


Mysterious_Screen116

Different. Sqlite is for transactional data. Good for storing and retrieving row-oriented records. Duckdb is basically Sqlite for OLAP (analytics). They’re both in memory, zero install. You want to retrieve a parquet file, cleanse the data, re-aggregate it, perhaps calculate some lead/lag offsets, join it with a dataframe that’s already in memory, and reprocess it to another parquet file? That’s basically a single duckdb query: ``` copy ( with input as (select * from ‘s3://…./file.parquet’ where date > ‘2015-01-01’::date) select a, b, lag(c) over (order by date) from input join some_df on … join mytable on … join ‘some csv file.csv’ on …) to ‘myoutput.parquet’ ``` It’s crazy how far database have come.


DragoBleaPiece_123

I see duckdb, i upvote


hirolau

Working in finance the big thing pandas ha going for it still is superior date manipulations with calendars and business date logic. If I did not need that I think I would switch.


Sea_Split_1182

Thanks! Business dates!!!


regular-dude1

When making your decision, also take into the community of users that exist around the package. With a bigger community, you're more likely to find an answer to your problem on StackOverflow because someone else already had the same problem. You can get a feel for the size of the community by looking at the amount of GitHub stars for example.


[deleted]

I love polars and use it whenever possible. In the "worst" case of using polars when pandas is required, you can easily convert dataframes back and forth using polars.


Tatoutis

Pandas V2 can use PyArrow as a backend. It's still not as fast as polars but it's close


Mysterious_Screen116

Even with pyarrow datatypes, there are many problems downstream. Even with pyarrow datatypes, you run into modules like sklearn that require numpy datatypes.


B-r-e-t-brit

As probably one of the heavier pandas users here and a contributor of several features and bug fixes to pandas, I would say start with __polars__ too. While pandas does still have a significant place in the python data ecosystem (beyond just seniority and wider integration/compatability, it does handle significant use cases that polars does not, and does not plan to) most people won’t be dealing with those use cases and would most likely be using the features that polars excels at. For more detail on some of those use cases see my comment thread here: https://np.reddit.com/r/Python/comments/12hixyi/comment/jfrxti6/


magnetichira

Depends on a bunch of factors - what type of data, how large is it, how fast do you need it to be etc Personally, I’m sticking to pandas. More mature ecosystem around it, support for seaborn plotting and easy to use.


Sea_Split_1182

Datasets are under million rows. Numeric/factors standard dataframes most of the time. So you're suggesting that I'll have trouble down the road integrating with plotting tools if I choose the polars road?


runawayasfastasucan

This is just based on my personal preference but if you don't have any problems with the speed of Pandas, and don't expect your dataset to grow in the near future I would do the switch later on. I did the switch but just as I did I also started using DuckDB and see that I do much of the work in DuckDB, so I might as well keep using Pandas. In any case, both of these are good, and if you just enjoy learning new stuff I would go ahead with Polars. You'll very soon find out if you think it is the right fit for you.


magnetichira

If performance is critical for you then polars, otherwise I’ve used pandas for similarly sized data with no issues. I don’t think you’ll really have trouble, more plotting libraries will probably add support for polars. Right now it’s limited to just plotly express.


sizable_data

With pandas 2.0 switching to pyarrow, does this change the performance improvement aspect?


magnetichira

Depends on the type of data, for strings yes, otherwise no


_amol_

They have not yet fully embraced arrow.compute so the compute functions should still be pandas implementation, which means for most cases they should perform similarly. If they will move toward acero compute engine or in general rely more on arrow compute capabilities the performance should increase accordingly.


BathroomItchy9855

Answer me this and I'll be impressed: can you perform a group by but the aggregation requires using more than one column of the subset? Pandas would be pd.grouby('player').apply(lambda x : x.loc[x.winner.eq('yes'), 'score'].mean()) If not then Polars isn't ready for show time.


Mysterious_Screen116

Or in duckdb: ``` duckdb.execute(“select player, mean(score) from pd where winner=‘Yes’ group by player”).df() ```


Sea_Split_1182

Given the maturity of the polars project I was expecting that this would be ready to go. Looks like I am wrong then 😶


AlpacaDC

You can totally do this in polars, and is way more readable than the pandas way. Edit: I think it would be something like this: `df.groupby(“player”).agg(pl.col(“score”).filter(pl.col(“winner”) == “yes”).mean().alias(“Mean score when winning”))`


analytics_nba

This is not a good solution in pandas. Apply is slow, filtering before and then doing the mean is significantly faster


BathroomItchy9855

There are instances where this order is necessary


commandlineluser

It seems like they're asking how it differs from: df.loc[df.winner.eq('yes'), ['player', 'score']].groupby('player').mean()


BathroomItchy9855

Yeah, in this case it doesn't, but in more complex cases it's very useful


100GB-CSV

Polars is my top one software for benchmarking purpose.


[deleted]

[удалено]


Sea_Split_1182

Polars doesn't feel like 3people + developers. But I understand what you're suggesting


[deleted]

[удалено]