[Enh]: Add Support For PySpark #333

ELC · 2024-06-24T02:41:14Z

We would like to learn about your use case. For example, if this feature is needed to adopt Narwhals in an open source project, could you please enter the link to it below?

No response

Please describe the purpose of the new feature or describe the problem to solve.

PySpark is one of the most used dataframe processing frameworks in the big data space. There are whole companies built around it and it is commonplace in the Data Engineering realm.

One common pain point is that data scientists usually works with Pandas (or more recently Polars) and when integrating their code in big ETL processes, that code is usually converted to PySpark for efficiency and scalability.

I believe that is precisely the problem Narwhals tries to solve and it would be a great addition to the data ecosystem to support PySpark.

Suggest a solution if possible.

PySpark has two distinct APIs:

PySpark SQL - Identical API to Scala version of Spark
PySpark Pandas - Subset of Pandas API (Previously known as Koalas)

Given that PySpark Pandas has an API based on Pandas, I believe it should be relatively straightforward to re-use the code already written for the Pandas backend.

There is a PySpark SQL to PySpark Pandas conversion so in theory it should be possible to also add ad hoc support for PySpark SQL Dataframes and check the overhead. If it is too big it can be considered to also add a separate backend for that different API.

If you have tried alternatives, please describe them below.

No response

Additional information that may help us understand your needs.

I do have experience with the PySpark API and would like to contribute, I read the "How it works" section but would like some concrete direction on how to get started and if this is of interest to the maintainers

MarcoGorelli · 2024-06-24T07:23:45Z

Thanks @ELC for your request! Yup, this definitely in scope and of interest!

jahall · 2024-06-26T10:43:19Z

I was literally coming on this channel to ask the same question - love it!! And would also be interested in contributing.

MarcoGorelli · 2024-06-26T11:08:30Z

Fantastic, thank you! I did use pyspark in a project back in 2019, but I think I've forgotten most of it by now 😄 From what I remember, it's a bit difficult to set up? Is there an easy way to set it up locally so we use it for testing to check that things work?

PySpark Pandas

It might be easiest to just start with this to be honest. Then, once we've got it working, we can remove a layer of indirection

In terms of contributing, I think if you take a look at narwhals._pandas_like, that's where the implementation for the pandas APIs is. There's a _implementation field which keeps track of which pandas-like library it is (cudf, modin, pandas). Maybe it's as simple as just adding pyspark to the list?

ELC · 2024-06-26T23:51:06Z

I had experience working with Azure Pipeline agents which I believe are the same VM Agents running on Github actions and they come with all relevant Java dependencies pre-installed so having the test run on the CICD should not be a problem.

As per local development, there are a couple of options:

Setting PySpark locally - Really troublesome
Using VS Code with Dev Containers - Requires docker desktop installation
Using Github's workspaces with a PySpark image - Easiest but relies on a working internet connection

For this contribution I will go with the third option as it is the fastest and easiest to set up. If you would like me to set up the necessary files for the second one, I can do that two on a separate issue

I will have a look at the _pandas_like and _implementation files to have a look at what's needed and will keep you posted on the progress.

TomBurdge · 2024-08-21T06:38:50Z

Hey folks, I have been working on this feature on a local branch.
I have a few questions:

What is the best way to gradually develop and implement this?

Adding support for a whole DataFrame library is quite a big feature, and it is a little all or nothing. I think it would be useful if any changes/choices being made are picked up early on.

Have we considered the following for pyspark pandas?

pyspark pandas converts to pyarrow, then pandas. There is quite often type confusion somewhere in the process. There is also a significant, but often overlooked, overhead of these conversions. Performance is generally much better with straight pyspark.
pyspark pandas doesn't work well with the latest version of pandas, it is particularly unstable with pandas >2. I have had to limit the pandas version in dev-requirements in my local branch to pandas<2.1.0 because pandas pyspark still uses this deprecation.

The local branch I have been working on uses straight pyspark.
It would also be possible to add pyspark pandas as it's own pandas-like implementation AND the straight pyspark API and give users the choice.

This is perhaps something to be explored in a later feature, but if we add support for the pyspark API we could possibly have a root to much more library support through sqlframe. I prefer to implement pyspark over pandas pyspark whether or not sql frame is an option for narwhals.

I am not developing using pyspark pandas, but I still had to limit my pandas version to <2.1.0 on my local branch. Why is this?

Following the lead from dask lazyframe, I have been implementing the collect method as returing a pandas pandaslike DataFrame. Spark then needs to still needs to interface with pandas, but this then has the deprecation issue >=2.1.0

I think there might be a better way. The Pyspark 4.0.0 preview adds a toArrow feature. The collect -> eager implementation can then return an arrow DataFrame. There is only one conversion here (pyspark -> arrow) rather than three (pyspark -> pandas arrow). Ritchie Vink has made a related comment about the two conversion being preferable in the past, back when there was no official way of going to arrow from spark. The challenge here is then this would make at least the eager implementation in narwhals incompatible with pyspark<4.0.0 (when spark 4.0.0 is actually released). Any thoughts on this?

TomBurdge · 2024-08-21T06:48:04Z

hey @ELC I usually use SDKMAN when developing pyspark locally and it usually works pretty nicely in my experience (on WSL, set up on a few different laptops). It seems like a lot of pyspark developers aren't aware of this approach, because the tool is most used by java developers, and plenty of pyspark developers don't know java (at least I don't ☺️ ).

SDKMAN allows you to, at least the way I think about, create the java equivalent of a virtual directory for your repo via a .sdkmanrc file at the root.
Right now I have the following:

java=21.0.4-tem
spark=3.5.1
hadoop=3.3.5

along with adding pyspark to the dev-requirements.txt, I was good to go. I do agree with you that some approach which uses containerization is likely to work best for the most people.

MarcoGorelli · 2024-08-21T07:48:34Z

Nice one, thanks @TomBurdge ! would be great to have you on board as contributor here!

I think converting to Arrow might make sense in collect. If anyone wants to support earlier versions of spark and convert to pandas, they can always do that manually with something like

if is_spark_lazyframe(df):
    df = nw.from_native(nw.to_native(df).toPandas(), eager_only=True)
else:
    df = df.collect()

But I think that data science libraries which explicitly support Spark are quite rare unfortunately - well, fortunately for us, it means that backwards-compatibility is less of a concern, and so they likely wouldn't mind only support Spark 4.0+ too much 😄

TomBurdge · 2024-08-26T13:19:03Z

Thanks @MarcoGorelli , yeah keen to get involved, spare time permitting!
I will try to make it to one of the community calls.

Question on intended behaviour:

For the rename method, the relevant spark function is: withColumnRenamed.
It has a major gotcha, which is that if the original column is not present, then it is a no-op (mentioned in the docs). This behaviour has caused me a lot of frustrations. 🙃

What is the narwhals intended behaviour? Strict?
The narwhals tests in tests/frame/rename_test.py aren't opinionated on strict/non-strict currently.

MarcoGorelli · 2024-08-26T16:43:30Z

ah interesting, thanks @TomBurdge ! I think that, as Narwhals is mainly aimed at tool-builders, we'd be better-off using the strict behaviour

EdAbati · 2024-09-03T21:55:34Z

Hi @TomBurdge I completely missed these last replies and I have started implementing something for this. Do you also have something implemented? We can collaborate on this and merge what we have :)

I think we should aim for a "minimal" implementation now and then have follow-up PRs for the rest of the methods.

morrowrasmus · 2024-09-16T14:25:25Z

Just letting you know that I'm following this with great interest and could potentially be helpful with testing once you have something ready.

mattcristal · 2024-09-26T17:29:32Z

I am new to Spark, just for fun, I tried it locally with: https://github.com/jplane/pyspark-devcontainer

MarcoGorelli added the enhancement New feature or request label Jun 28, 2024

EdAbati linked a pull request Sep 3, 2024 that will close this issue

feat: Add minimal Pyspark support #908

Draft

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enh]: Add Support For PySpark #333

[Enh]: Add Support For PySpark #333

ELC commented Jun 24, 2024

MarcoGorelli commented Jun 24, 2024

jahall commented Jun 26, 2024 •

edited

Loading

MarcoGorelli commented Jun 26, 2024

ELC commented Jun 26, 2024

TomBurdge commented Aug 21, 2024 •

edited

Loading

TomBurdge commented Aug 21, 2024 •

edited

Loading

MarcoGorelli commented Aug 21, 2024

TomBurdge commented Aug 26, 2024 •

edited

Loading

MarcoGorelli commented Aug 26, 2024

EdAbati commented Sep 3, 2024

morrowrasmus commented Sep 16, 2024

mattcristal commented Sep 26, 2024

[Enh]: Add Support For PySpark #333

[Enh]: Add Support For PySpark #333

Comments

ELC commented Jun 24, 2024

We would like to learn about your use case. For example, if this feature is needed to adopt Narwhals in an open source project, could you please enter the link to it below?

Please describe the purpose of the new feature or describe the problem to solve.

Suggest a solution if possible.

If you have tried alternatives, please describe them below.

Additional information that may help us understand your needs.

MarcoGorelli commented Jun 24, 2024

jahall commented Jun 26, 2024 • edited Loading

MarcoGorelli commented Jun 26, 2024

ELC commented Jun 26, 2024

TomBurdge commented Aug 21, 2024 • edited Loading

TomBurdge commented Aug 21, 2024 • edited Loading

MarcoGorelli commented Aug 21, 2024

TomBurdge commented Aug 26, 2024 • edited Loading

MarcoGorelli commented Aug 26, 2024

EdAbati commented Sep 3, 2024

morrowrasmus commented Sep 16, 2024

mattcristal commented Sep 26, 2024

jahall commented Jun 26, 2024 •

edited

Loading

TomBurdge commented Aug 21, 2024 •

edited

Loading

TomBurdge commented Aug 21, 2024 •

edited

Loading

TomBurdge commented Aug 26, 2024 •

edited

Loading