Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(roadmap): document possible future directions #144

Closed
wants to merge 1 commit into from

Conversation

deepyaman
Copy link
Collaborator

No description provided.

@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.94%. Comparing base (0084273) to head (f6ae7e9).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #144   +/-   ##
=======================================
  Coverage   84.94%   84.94%           
=======================================
  Files          26       26           
  Lines        1920     1920           
=======================================
  Hits         1631     1631           
  Misses        289      289           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines +21 to +43
## Support more parts of the end-to-end ML workflow

![End-to-end model training process](https://github.com/user-attachments/assets/fbb5955b-06a4-4e0c-b76b-ebc1fd83daea)

IbisML primarily supports ML preprocessing steps, which is a very narrow focus. Furthermore, in the
standard ML workflow, the gaps in coverage between Ibis and IbisML mean that the end-to-end story is
disjoint. Specifically, Ibis is a good fit for feature engineering, and IbisML can perform
preprocessing (and, to an extent, train-test splitting); however, IbisML doesn't support CV
splitting (or hyperparameter tuning) workflows that are normally part of the training process. Users
have tried using scikit-learn's CV capabilities, but they aren't integrated with IbisML; for
hyperparameter tuning, we could further investigate and document integration with popular frameworks
such as Optuna.

Benefits:

- Most aligned with current direction of IbisML
- Actual community requests
- https://github.com/ibis-project/ibis-ml/issues/135
- https://github.com/ibis-project/ibis-ml/issues/136

Questions:

- Is a database the right place for all of these operations? Is it efficient?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How successfully could IbisML create a uniform API for things like CV and hyperparameter tuning that works across systems? Are there clean abstractions for these kinds of operations that map cleanly onto various ML training libraries? It would be interesting to take a closer look at how successful tidymodels has been at this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How successfully could IbisML create a uniform API for things like CV and hyperparameter tuning that works across systems?

Across database systems? IbisML can create data splits, although it's a bit tedious to do so in a reproducible way (need to create a common key, hash it, and split based on that). The bigger potential problem is a workflow problem—if your ML training framework is elsewhere, you need to keep passing splits. You could potentially implement some sort of more robust caching system like LetSQL did. There are also some challenges in integrating with existing training frameworks, that are being explored.

Across ML training frameworks, it's probably easier, because each training framework exposes some sort of CV/hyperparameter methods that can be abstracted across (more relevant if IbisML is an abstraction across distributed ML frameworks rather than databases).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the LetSQL caching idea similar to Ray's streaming to training capabilities (e.g., streaming split). Personally, I'd be interested in seeing a Ray Dataset-like abstraction/interface on IbisML with Arrow-based protocol embedded, enable data handoff between preprocessing to training (regardless where the processed data is from, either from a database or in-motion preprocessing in CPU to GPU training - thnk about the first use case as IbisML on database backend and second case as Ibis on a streaming pipeline). This would allow IbisML to focus on building the API abstraction, and offload the actual processing to whatever db/framework supports the protocol.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the LetSQL caching idea similar to Ray's streaming to training capabilities (e.g., streaming split).

No, I don't think they're related; it just allows for a more robust caching mechanism not tied to the execution backend, and supporting things like cache invalidation.

Ray's streaming to training capabilities (e.g., streaming split). Personally, I'd be interested in seeing a Ray Dataset-like abstraction/interface on IbisML with Arrow-based protocol embedded
[...]
This would allow IbisML to focus on building the API abstraction, and offload the actual processing to whatever db/framework supports the protocol.

If you look at something like the StreamSplitDataIterator implementation, a generic Arrow-based implementation would very much require a distributed Arrow protocol. What currently provides this protocol? SQL is insufficient for specifying these kinds of operations, so this comes back to unifying across things like Dask, Ray, Spark, etc.—things that provide a lower-level distributed system implementation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at something like the StreamSplitDataIterator implementation, a generic Arrow-based implementation would very much require a distributed Arrow protocol. What currently provides this protocol? SQL is insufficient for specifying these kinds of operations, so this comes back to unifying across things like Dask, Ray, Spark, etc.—things that provide a lower-level distributed system implementation.

@deepyaman I think this is what I proposed as Table Read protocol.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leventov Sorry I missed your comment! That looks very interesting, I'll take a proper read.

Comment on lines +74 to +75
- Is there a benefit to implementing ML preprocessing in Arrow Compute, or should we just delegate
to an existing framework?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrow Compute is a part of Arrow that is best understood as an implementation, not a standard. I think there would be little benefit in mapping ML operations onto Arrow Compute kernels or Acero ExecPlans.


## Leverage Apache Arrow for ML preprocessing

While SQL is supported by all databases, it's limiting (and possibly inefficient) for ML
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intuitive; whether or not it's actually inefficient needs actual benchmarking.

that's probably not sufficient at a very large scale. Can something like this be parallelized with
Ray?
- Ray itself supports ML preprocessing to some extent, and also supports Arrow/Arrow Flight/etc. Why
not just use Ray?
Copy link
Collaborator Author

@deepyaman deepyaman Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responding to @zhenzhongxu:

if we use Ray, which part of user workflow are we replacing with Ibis? What is the value proposition we can communicate to attract users?

Are we also thinking about both the compute story and data movement story?

Ibis has an attractive value proposition. You can do all your data engineering (i.e. up to creating a master table or writing to a feature store) on any Ibis-supported backend, and that is a great fit.

From a data transfer perspective, you need to be able to get the data out of the backend engine you're working with in Arrow format (some backends are ideal for this, like DuckDB, Polars, DataFusion, Theseus, etc.; for others we still have a way to get Arrow out, but it may not be most efficient, until those backends adopt Arrow). With data in Arrow format, Ray datasets can be constructed from Arrow (it's already basically just a distributed Arrow data framework), and we've already seen that something like Ray can handle the ML preprocessing and training at scale at places like Pinterest and Uber.

If not Ray, then we should have some other distributed framework that:

  • Uses Arrow under the hood
  • Supports (or can be made to support) ML pieces

What frameworks exist? There's Spark, Dask, etc.; anything else where we can fill a gap? For Ballista, I initially thought (1) it was more purely distributed Arrow (not sure if I understood that right) and (2) there might be some interest in adding ML capabilities, but it seem like the project isn't maintained.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With data in Arrow format, Ray datasets can be constructed from Arrow (it's already basically just a distributed Arrow data framework), and we've already seen that something like Ray can handle the ML preprocessing and training at scale at places like Pinterest and Uber.

It is interesting to note here that Pinterest uses StarRocks, and Uber is heavy on Pinot. However, there is no efficient transfer protocol implemented for moving data from StarRocks and Pinot respectively into Ray. StarRocks doesn't support even Arrow Flight yet (although Apache Doris, from which StarRocks was forked originally, does since March 2024), nor does Apache Pinot. To reduce data movement between the historical segments stored in these DBs' formats and ML processing/training in something like Ray, and then writing processing results back into these DBs table transfer protocols would be even more efficient.

I wonder wouldn't be these enterprises interested in sponsoring protocol implementation (Arrow Flight or table transfer) in their respective analytics DBs that would seem to permit them to simplify their data architectures and thus reduce costs? Also, the ability to use data from HTAP/real-time DBs like StarRocks or Pinot in Ray (for instance) may enable them using Ray Serving as well (if they don't yet, or use it more widely).

Questions:

- Is there a benefit to implementing ML preprocessing in Arrow Compute, or should we just delegate
to an existing framework?
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responding to @zhenzhongxu:

Can you clarify on the question? Are you asking whether Ibis should implement ML preprocessing logic using Arrow that's executed within a notebook environment? If so, that's a clear no. Or if the question is more about whether we want to implement standards/declarations that help Theseus or other compute engines to get into ML workflow, than it sounds much better.

Didn't mention anything about a notebook environment, but more whether it makes sense to implement preprocessing layer on top of Arrow compute so that any engine Arrow could have that.

What I don't exactly understand is, can you have an Arrow Compute function (e.g. to support MinMaxScaler, as a very basic example) that will work across a distributed Arrow context? The way I understand it, you can use Arrow Compute for aggregates across a chunk of Arrow data, but it wouldn't be aware if something is distributing the Arrow data.

Can Gandiva support distributed Arrow implementations?

This is where I would assume frameworks like Ray Data have already built the logic for distributing logic and combining results across distributed Arrow data, but I don't completely understand.

Update: https://github.com/ibis-project/ibis-ml/pull/144/files#r1737090923 seems to suggest not going down the Arrow Compute route.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that IbisML should focus on standard interface, not specific implementation.

avoiding format conversions etc." (DuckDB, DataFusion, Polars, and Theseus, as well as pandas and
Dask, all represent data as Arrow already.)

Doing ML preprocessing via Arrow could mean using Arrow Compute Functions, or by working more with
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responding to @zhenzhongxu:

Couple things:

  1. Doing ML preprocessing via Arrow. What do we want to do on the Ibis level here, define preprocessing function standards or a mapping layer similar to Substrait BFT?
  2. Besides compute standard, we also need Arrow based data movement standard/protocols. Maybe we should call that out in this section too.
  1. Why does BFT matter here? I don't see that BFT defines any kind of logic; it just catalogs functions and test cases in YAML.
  2. Feel free to add some context there on what you (or @jitingxu1) are thinking. I don't quite follow; would this mean essentially building a distributed Arrow implementation? Why aren't existing distributed frameworks that pass around Arrow are not a standard.

if need something that's more focused on purely distributing Arrow, maybe we should just revive Ballista, as it's already an arrow project, and focus on using it for ML?

Response from @zhenzhongxu:

I think @jitingxu1's disaggregated idea probably also fit in here? Let's collaborate to refine.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe Arrow Flight already solves the problem of standardizing the way Arrow data is distributed and transferred? In that case, maybe we need a non-SQL interface to do non-SQL workloads, like ML. Like we create Arrow Flight ML?

As I understand it, distributed frameworks like Ray may implement something like Arrow Flight, or could integrate Arrow Flight. I don't see that Dask or Spark does, but I assume they could, too? If these implement Arrow Flight, does it make sense to have something like Arrow Flight ML (that can define a spec of distributed ML operations to implement), that then in the future stuff like Ray and Dask and Spark could adhere to?

Response from @ianmcook:

It would be a really big mistake to associate the “Flight” name with something like that, IMO.

Arrow in general does not want to take on the task of standardizing a representation of compute operations.

That’s why the Arrow community was excited to see Substrait emerge. Substrait is trying to do something that Arrow didn’t want to do.

Substrait targets databases though, right? Because it focuses on relational algebra, and not necessarily tensor operations. Or is Substrait also looking to cover that

Like, would you ever execute a substrait plan on NumPy arrays?

Response from @ianmcook:

For now, that’s right. But in the future it could potentially broaden scope to include that. I don’t think anyone in the core Substrait maintainers group is fundamentally opposed to that—but they just want to see Substrait really succeed at relational operations before expanding to another area.

Is it too early to standardized ML processing with Arrow?

@zeroshade does seem interested in having Arrow do more in the ML space, but if that's less work in Arrow and more adoption of Arrow, I guess that's also reasonable.

Response from @ianmcook:

Yes. My main reaction to ideas like “Flight ML” or “IbisML Connect” (or whatever this concept might be called) is “it’s too early”

Look at how long it was before Spark introduced Spark Connect

I think Ibis and IbisML should focus first on creating watertight abstractions on top of third-party systems without requiring the third parties to adopt anything.

And for Arrow — yeah, I think anything else we did inside of Arrow to enable more adoption in the ML space would probably not really make much of a difference. We just need to increase awareness of Arrow in that space, get more ML projects trying it, listen to what they tell us about their experiences, and be responsive to their feedback.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid points across all these feedback. However I do think there is value to come up with an example of "what a non-SQL interface to do non-SQL workloads" would look like from IbisML perspective, and that might influence how Flight protocol's requirements down the line and may also help with adoption IMHO.

not just use Ray?
- Is there an opportunity to more generically provide distributed Arrow-based ML?
https://blog.lambdaclass.com/ballista-a-distributed-compute-platform-made-with-rust-and-apache-arrow/
seems to indicate some (future) interest in Apache DataFusion Ballista ML?
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind; Ballista isn't maintained.

- If we go down this route, do we care about being able to do ML preprocessing in SQL anymore (for
backends that don't support Arrow, and possibly also don't support efficient output to Dask)?

## Build a unified interface for doing ML on the database
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responding to @zhenzhongxu:

why only databases? what about pipeline scenarios?

Sure, it can include pipeline scenarios (e.g. like more generically backends that Ibis supports).

For pipeline use cases, there is less of that benefit of empowering pure SQL users to be able to do ML, because pipeline users probably don't have significant problem picking up other frameworks (given how complex working with things like Spark and all can get). Many of these pipeline cases also already have ML libraries.

However, it can be helpful if talking about focus on inference via PyArrow UDF or something.


Benefits:

- Try to reduce data transfer between frameworks
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment from @zhenzhongxu:

When we eventually have HTAP abstraction, data movement will be abstracted but transfer will likely still happen, especially for large scale use cases.


- Try to reduce data transfer between frameworks

Questions:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responding to @zhenzhongxu:

are we targeting transactional databases (e.g., Postgres) or analytical stores (e.g., lakehouse)

We can also target transactional databases, but see the concern about things potentially not being composable/requiring an implementation for each (similar to PostgresML).

Again, IF we focus on inference, we can be smarter about this, by doing a lot of simple tasks via SQL, and using PyArrow UDFs where SQL isn't feasible. you focus on efficient inference, because you can dispatch that through UDF for things that aren't well-expressed in SQL.

For example, I think this is why LetSQL focuses on inference in https://www.letsql.com/posts/builtin-predict-udf/

(They actually compute distinct_categories in the example, which is probably technically incorrect for their inference-focused example—you should one-hot encode based on what were the possible categories as of training time, since it's not like your model can handle anything else.)

Comment on lines +112 to +114
- Should this focus on inference? It seems more likely that a database (or at least UDFs, including
Arrow UDFs) could do well at this part.
[There are also some others trying to play in this space.](https://www.letsql.com/posts/builtin-predict-udf/)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LetSQL relies on a fast Rust implementation (gbdt-rs) for batch inference of XGBoost models using DataFusion. For example, a couple of more generic possible options here are:

  • Implementing fast inference of XGBoost (and other models) using Arrow Compute (@ianmcook I think this is a reasonable way to use pyarrow.compute, right? For example, DuckDB Arrow UDFs and even more generically Ibis PyArrow UDFs are done this way.)
  • Doing something like xgb2sql—on the inference side, and when we can tell the fuller story, there is a nicer case for being able to do everything in the database

@deepyaman deepyaman closed this Sep 25, 2024
@deepyaman deepyaman deleted the docs/roadmap/future-opportunities branch September 25, 2024 23:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants