Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enh]: Add Support For PySpark #333

Open
ELC opened this issue Jun 24, 2024 · 12 comments · May be fixed by #908
Open

[Enh]: Add Support For PySpark #333

ELC opened this issue Jun 24, 2024 · 12 comments · May be fixed by #908
Labels
enhancement New feature or request

Comments

@ELC
Copy link
Contributor

ELC commented Jun 24, 2024

We would like to learn about your use case. For example, if this feature is needed to adopt Narwhals in an open source project, could you please enter the link to it below?

No response

Please describe the purpose of the new feature or describe the problem to solve.

PySpark is one of the most used dataframe processing frameworks in the big data space. There are whole companies built around it and it is commonplace in the Data Engineering realm.

One common pain point is that data scientists usually works with Pandas (or more recently Polars) and when integrating their code in big ETL processes, that code is usually converted to PySpark for efficiency and scalability.

I believe that is precisely the problem Narwhals tries to solve and it would be a great addition to the data ecosystem to support PySpark.

Suggest a solution if possible.

PySpark has two distinct APIs:

  • PySpark SQL - Identical API to Scala version of Spark
  • PySpark Pandas - Subset of Pandas API (Previously known as Koalas)

Given that PySpark Pandas has an API based on Pandas, I believe it should be relatively straightforward to re-use the code already written for the Pandas backend.

There is a PySpark SQL to PySpark Pandas conversion so in theory it should be possible to also add ad hoc support for PySpark SQL Dataframes and check the overhead. If it is too big it can be considered to also add a separate backend for that different API.

If you have tried alternatives, please describe them below.

No response

Additional information that may help us understand your needs.

I do have experience with the PySpark API and would like to contribute, I read the "How it works" section but would like some concrete direction on how to get started and if this is of interest to the maintainers

@MarcoGorelli
Copy link
Member

Thanks @ELC for your request! Yup, this definitely in scope and of interest!

@jahall
Copy link

jahall commented Jun 26, 2024

I was literally coming on this channel to ask the same question - love it!! And would also be interested in contributing.

@MarcoGorelli
Copy link
Member

Fantastic, thank you! I did use pyspark in a project back in 2019, but I think I've forgotten most of it by now 😄 From what I remember, it's a bit difficult to set up? Is there an easy way to set it up locally so we use it for testing to check that things work?

PySpark Pandas

It might be easiest to just start with this to be honest. Then, once we've got it working, we can remove a layer of indirection

In terms of contributing, I think if you take a look at narwhals._pandas_like, that's where the implementation for the pandas APIs is. There's a _implementation field which keeps track of which pandas-like library it is (cudf, modin, pandas). Maybe it's as simple as just adding pyspark to the list?

@ELC
Copy link
Contributor Author

ELC commented Jun 26, 2024

I had experience working with Azure Pipeline agents which I believe are the same VM Agents running on Github actions and they come with all relevant Java dependencies pre-installed so having the test run on the CICD should not be a problem.

As per local development, there are a couple of options:

  • Setting PySpark locally - Really troublesome
  • Using VS Code with Dev Containers - Requires docker desktop installation
  • Using Github's workspaces with a PySpark image - Easiest but relies on a working internet connection

For this contribution I will go with the third option as it is the fastest and easiest to set up. If you would like me to set up the necessary files for the second one, I can do that two on a separate issue

I will have a look at the _pandas_like and _implementation files to have a look at what's needed and will keep you posted on the progress.

@MarcoGorelli MarcoGorelli added the enhancement New feature or request label Jun 28, 2024
@TomBurdge
Copy link

TomBurdge commented Aug 21, 2024

Hey folks, I have been working on this feature on a local branch.
I have a few questions:

  • What is the best way to gradually develop and implement this?

Adding support for a whole DataFrame library is quite a big feature, and it is a little all or nothing. I think it would be useful if any changes/choices being made are picked up early on.

Have we considered the following for pyspark pandas?

  • pyspark pandas converts to pyarrow, then pandas. There is quite often type confusion somewhere in the process. There is also a significant, but often overlooked, overhead of these conversions. Performance is generally much better with straight pyspark.
  • pyspark pandas doesn't work well with the latest version of pandas, it is particularly unstable with pandas >2. I have had to limit the pandas version in dev-requirements in my local branch to pandas<2.1.0 because pandas pyspark still uses this deprecation.

The local branch I have been working on uses straight pyspark.
It would also be possible to add pyspark pandas as it's own pandas-like implementation AND the straight pyspark API and give users the choice.

This is perhaps something to be explored in a later feature, but if we add support for the pyspark API we could possibly have a root to much more library support through sqlframe. I prefer to implement pyspark over pandas pyspark whether or not sql frame is an option for narwhals.

I am not developing using pyspark pandas, but I still had to limit my pandas version to <2.1.0 on my local branch. Why is this?

Following the lead from dask lazyframe, I have been implementing the collect method as returing a pandas pandaslike DataFrame. Spark then needs to still needs to interface with pandas, but this then has the deprecation issue >=2.1.0

I think there might be a better way. The Pyspark 4.0.0 preview adds a toArrow feature. The collect -> eager implementation can then return an arrow DataFrame. There is only one conversion here (pyspark -> arrow) rather than three (pyspark -> pandas arrow). Ritchie Vink has made a related comment about the two conversion being preferable in the past, back when there was no official way of going to arrow from spark. The challenge here is then this would make at least the eager implementation in narwhals incompatible with pyspark<4.0.0 (when spark 4.0.0 is actually released). Any thoughts on this?

@TomBurdge
Copy link

TomBurdge commented Aug 21, 2024

hey @ELC I usually use SDKMAN when developing pyspark locally and it usually works pretty nicely in my experience (on WSL, set up on a few different laptops). It seems like a lot of pyspark developers aren't aware of this approach, because the tool is most used by java developers, and plenty of pyspark developers don't know java (at least I don't ☺️ ).

SDKMAN allows you to, at least the way I think about, create the java equivalent of a virtual directory for your repo via a .sdkmanrc file at the root.
Right now I have the following:

java=21.0.4-tem
spark=3.5.1
hadoop=3.3.5

along with adding pyspark to the dev-requirements.txt, I was good to go. I do agree with you that some approach which uses containerization is likely to work best for the most people.

@MarcoGorelli
Copy link
Member

Nice one, thanks @TomBurdge ! would be great to have you on board as contributor here!

I think converting to Arrow might make sense in collect. If anyone wants to support earlier versions of spark and convert to pandas, they can always do that manually with something like

if is_spark_lazyframe(df):
    df = nw.from_native(nw.to_native(df).toPandas(), eager_only=True)
else:
    df = df.collect()

But I think that data science libraries which explicitly support Spark are quite rare unfortunately - well, fortunately for us, it means that backwards-compatibility is less of a concern, and so they likely wouldn't mind only support Spark 4.0+ too much 😄

@TomBurdge
Copy link

TomBurdge commented Aug 26, 2024

Thanks @MarcoGorelli , yeah keen to get involved, spare time permitting!
I will try to make it to one of the community calls.

Question on intended behaviour:

For the rename method, the relevant spark function is: withColumnRenamed.
It has a major gotcha, which is that if the original column is not present, then it is a no-op (mentioned in the docs). This behaviour has caused me a lot of frustrations. 🙃

What is the narwhals intended behaviour? Strict?
The narwhals tests in tests/frame/rename_test.py aren't opinionated on strict/non-strict currently.

@MarcoGorelli
Copy link
Member

ah interesting, thanks @TomBurdge ! I think that, as Narwhals is mainly aimed at tool-builders, we'd be better-off using the strict behaviour

@EdAbati EdAbati linked a pull request Sep 3, 2024 that will close this issue
10 tasks
@EdAbati
Copy link
Contributor

EdAbati commented Sep 3, 2024

Hi @TomBurdge I completely missed these last replies and I have started implementing something for this. Do you also have something implemented? We can collaborate on this and merge what we have :)

I think we should aim for a "minimal" implementation now and then have follow-up PRs for the rest of the methods.

@morrowrasmus
Copy link

Just letting you know that I'm following this with great interest and could potentially be helpful with testing once you have something ready.

@mattcristal
Copy link
Contributor

I am new to Spark, just for fun, I tried it locally with: https://github.com/jplane/pyspark-devcontainer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants