-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs(roadmap): document possible future directions #144
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #144 +/- ##
=======================================
Coverage 84.94% 84.94%
=======================================
Files 26 26
Lines 1920 1920
=======================================
Hits 1631 1631
Misses 289 289 ☔ View full report in Codecov by Sentry. |
## Support more parts of the end-to-end ML workflow | ||
|
||
![End-to-end model training process](https://github.com/user-attachments/assets/fbb5955b-06a4-4e0c-b76b-ebc1fd83daea) | ||
|
||
IbisML primarily supports ML preprocessing steps, which is a very narrow focus. Furthermore, in the | ||
standard ML workflow, the gaps in coverage between Ibis and IbisML mean that the end-to-end story is | ||
disjoint. Specifically, Ibis is a good fit for feature engineering, and IbisML can perform | ||
preprocessing (and, to an extent, train-test splitting); however, IbisML doesn't support CV | ||
splitting (or hyperparameter tuning) workflows that are normally part of the training process. Users | ||
have tried using scikit-learn's CV capabilities, but they aren't integrated with IbisML; for | ||
hyperparameter tuning, we could further investigate and document integration with popular frameworks | ||
such as Optuna. | ||
|
||
Benefits: | ||
|
||
- Most aligned with current direction of IbisML | ||
- Actual community requests | ||
- https://github.com/ibis-project/ibis-ml/issues/135 | ||
- https://github.com/ibis-project/ibis-ml/issues/136 | ||
|
||
Questions: | ||
|
||
- Is a database the right place for all of these operations? Is it efficient? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How successfully could IbisML create a uniform API for things like CV and hyperparameter tuning that works across systems? Are there clean abstractions for these kinds of operations that map cleanly onto various ML training libraries? It would be interesting to take a closer look at how successful tidymodels has been at this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How successfully could IbisML create a uniform API for things like CV and hyperparameter tuning that works across systems?
Across database systems? IbisML can create data splits, although it's a bit tedious to do so in a reproducible way (need to create a common key, hash it, and split based on that). The bigger potential problem is a workflow problem—if your ML training framework is elsewhere, you need to keep passing splits. You could potentially implement some sort of more robust caching system like LetSQL did. There are also some challenges in integrating with existing training frameworks, that are being explored.
Across ML training frameworks, it's probably easier, because each training framework exposes some sort of CV/hyperparameter methods that can be abstracted across (more relevant if IbisML is an abstraction across distributed ML frameworks rather than databases).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the LetSQL caching idea similar to Ray's streaming to training capabilities (e.g., streaming split). Personally, I'd be interested in seeing a Ray Dataset-like abstraction/interface on IbisML with Arrow-based protocol embedded, enable data handoff between preprocessing to training (regardless where the processed data is from, either from a database or in-motion preprocessing in CPU to GPU training - thnk about the first use case as IbisML on database backend and second case as Ibis on a streaming pipeline). This would allow IbisML to focus on building the API abstraction, and offload the actual processing to whatever db/framework supports the protocol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the LetSQL caching idea similar to Ray's streaming to training capabilities (e.g., streaming split).
No, I don't think they're related; it just allows for a more robust caching mechanism not tied to the execution backend, and supporting things like cache invalidation.
Ray's streaming to training capabilities (e.g., streaming split). Personally, I'd be interested in seeing a Ray Dataset-like abstraction/interface on IbisML with Arrow-based protocol embedded
[...]
This would allow IbisML to focus on building the API abstraction, and offload the actual processing to whatever db/framework supports the protocol.
If you look at something like the StreamSplitDataIterator
implementation, a generic Arrow-based implementation would very much require a distributed Arrow protocol. What currently provides this protocol? SQL is insufficient for specifying these kinds of operations, so this comes back to unifying across things like Dask, Ray, Spark, etc.—things that provide a lower-level distributed system implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you look at something like the StreamSplitDataIterator implementation, a generic Arrow-based implementation would very much require a distributed Arrow protocol. What currently provides this protocol? SQL is insufficient for specifying these kinds of operations, so this comes back to unifying across things like Dask, Ray, Spark, etc.—things that provide a lower-level distributed system implementation.
@deepyaman I think this is what I proposed as Table Read protocol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leventov Sorry I missed your comment! That looks very interesting, I'll take a proper read.
- Is there a benefit to implementing ML preprocessing in Arrow Compute, or should we just delegate | ||
to an existing framework? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arrow Compute is a part of Arrow that is best understood as an implementation, not a standard. I think there would be little benefit in mapping ML operations onto Arrow Compute kernels or Acero ExecPlans.
|
||
## Leverage Apache Arrow for ML preprocessing | ||
|
||
While SQL is supported by all databases, it's limiting (and possibly inefficient) for ML |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is intuitive; whether or not it's actually inefficient needs actual benchmarking.
that's probably not sufficient at a very large scale. Can something like this be parallelized with | ||
Ray? | ||
- Ray itself supports ML preprocessing to some extent, and also supports Arrow/Arrow Flight/etc. Why | ||
not just use Ray? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Responding to @zhenzhongxu:
if we use Ray, which part of user workflow are we replacing with Ibis? What is the value proposition we can communicate to attract users?
Are we also thinking about both the compute story and data movement story?
Ibis has an attractive value proposition. You can do all your data engineering (i.e. up to creating a master table or writing to a feature store) on any Ibis-supported backend, and that is a great fit.
From a data transfer perspective, you need to be able to get the data out of the backend engine you're working with in Arrow format (some backends are ideal for this, like DuckDB, Polars, DataFusion, Theseus, etc.; for others we still have a way to get Arrow out, but it may not be most efficient, until those backends adopt Arrow). With data in Arrow format, Ray datasets can be constructed from Arrow (it's already basically just a distributed Arrow data framework), and we've already seen that something like Ray can handle the ML preprocessing and training at scale at places like Pinterest and Uber.
If not Ray, then we should have some other distributed framework that:
- Uses Arrow under the hood
- Supports (or can be made to support) ML pieces
What frameworks exist? There's Spark, Dask, etc.; anything else where we can fill a gap? For Ballista, I initially thought (1) it was more purely distributed Arrow (not sure if I understood that right) and (2) there might be some interest in adding ML capabilities, but it seem like the project isn't maintained.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With data in Arrow format, Ray datasets can be constructed from Arrow (it's already basically just a distributed Arrow data framework), and we've already seen that something like Ray can handle the ML preprocessing and training at scale at places like Pinterest and Uber.
It is interesting to note here that Pinterest uses StarRocks, and Uber is heavy on Pinot. However, there is no efficient transfer protocol implemented for moving data from StarRocks and Pinot respectively into Ray. StarRocks doesn't support even Arrow Flight yet (although Apache Doris, from which StarRocks was forked originally, does since March 2024), nor does Apache Pinot. To reduce data movement between the historical segments stored in these DBs' formats and ML processing/training in something like Ray, and then writing processing results back into these DBs table transfer protocols would be even more efficient.
I wonder wouldn't be these enterprises interested in sponsoring protocol implementation (Arrow Flight or table transfer) in their respective analytics DBs that would seem to permit them to simplify their data architectures and thus reduce costs? Also, the ability to use data from HTAP/real-time DBs like StarRocks or Pinot in Ray (for instance) may enable them using Ray Serving as well (if they don't yet, or use it more widely).
Questions: | ||
|
||
- Is there a benefit to implementing ML preprocessing in Arrow Compute, or should we just delegate | ||
to an existing framework? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Responding to @zhenzhongxu:
Can you clarify on the question? Are you asking whether Ibis should implement ML preprocessing logic using Arrow that's executed within a notebook environment? If so, that's a clear no. Or if the question is more about whether we want to implement standards/declarations that help Theseus or other compute engines to get into ML workflow, than it sounds much better.
Didn't mention anything about a notebook environment, but more whether it makes sense to implement preprocessing layer on top of Arrow compute so that any engine Arrow could have that.
What I don't exactly understand is, can you have an Arrow Compute function (e.g. to support MinMaxScaler
, as a very basic example) that will work across a distributed Arrow context? The way I understand it, you can use Arrow Compute for aggregates across a chunk of Arrow data, but it wouldn't be aware if something is distributing the Arrow data.
Can Gandiva support distributed Arrow implementations?
This is where I would assume frameworks like Ray Data have already built the logic for distributing logic and combining results across distributed Arrow data, but I don't completely understand.
Update: https://github.com/ibis-project/ibis-ml/pull/144/files#r1737090923 seems to suggest not going down the Arrow Compute route.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that IbisML should focus on standard interface, not specific implementation.
avoiding format conversions etc." (DuckDB, DataFusion, Polars, and Theseus, as well as pandas and | ||
Dask, all represent data as Arrow already.) | ||
|
||
Doing ML preprocessing via Arrow could mean using Arrow Compute Functions, or by working more with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Responding to @zhenzhongxu:
Couple things:
- Doing ML preprocessing via Arrow. What do we want to do on the Ibis level here, define preprocessing function standards or a mapping layer similar to Substrait BFT?
- Besides compute standard, we also need Arrow based data movement standard/protocols. Maybe we should call that out in this section too.
- Why does BFT matter here? I don't see that BFT defines any kind of logic; it just catalogs functions and test cases in YAML.
- Feel free to add some context there on what you (or @jitingxu1) are thinking. I don't quite follow; would this mean essentially building a distributed Arrow implementation? Why aren't existing distributed frameworks that pass around Arrow are not a standard.
if need something that's more focused on purely distributing Arrow, maybe we should just revive Ballista, as it's already an arrow project, and focus on using it for ML?
Response from @zhenzhongxu:
I think @jitingxu1's disaggregated idea probably also fit in here? Let's collaborate to refine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe Arrow Flight already solves the problem of standardizing the way Arrow data is distributed and transferred? In that case, maybe we need a non-SQL interface to do non-SQL workloads, like ML. Like we create Arrow Flight ML?
As I understand it, distributed frameworks like Ray may implement something like Arrow Flight, or could integrate Arrow Flight. I don't see that Dask or Spark does, but I assume they could, too? If these implement Arrow Flight, does it make sense to have something like Arrow Flight ML (that can define a spec of distributed ML operations to implement), that then in the future stuff like Ray and Dask and Spark could adhere to?
Response from @ianmcook:
It would be a really big mistake to associate the “Flight” name with something like that, IMO.
Arrow in general does not want to take on the task of standardizing a representation of compute operations.
That’s why the Arrow community was excited to see Substrait emerge. Substrait is trying to do something that Arrow didn’t want to do.
Substrait targets databases though, right? Because it focuses on relational algebra, and not necessarily tensor operations. Or is Substrait also looking to cover that
Like, would you ever execute a substrait plan on NumPy arrays?
Response from @ianmcook:
For now, that’s right. But in the future it could potentially broaden scope to include that. I don’t think anyone in the core Substrait maintainers group is fundamentally opposed to that—but they just want to see Substrait really succeed at relational operations before expanding to another area.
Is it too early to standardized ML processing with Arrow?
@zeroshade does seem interested in having Arrow do more in the ML space, but if that's less work in Arrow and more adoption of Arrow, I guess that's also reasonable.
Response from @ianmcook:
Yes. My main reaction to ideas like “Flight ML” or “IbisML Connect” (or whatever this concept might be called) is “it’s too early”
Look at how long it was before Spark introduced Spark Connect
I think Ibis and IbisML should focus first on creating watertight abstractions on top of third-party systems without requiring the third parties to adopt anything.
And for Arrow — yeah, I think anything else we did inside of Arrow to enable more adoption in the ML space would probably not really make much of a difference. We just need to increase awareness of Arrow in that space, get more ML projects trying it, listen to what they tell us about their experiences, and be responsive to their feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Valid points across all these feedback. However I do think there is value to come up with an example of "what a non-SQL interface to do non-SQL workloads" would look like from IbisML perspective, and that might influence how Flight protocol's requirements down the line and may also help with adoption IMHO.
not just use Ray? | ||
- Is there an opportunity to more generically provide distributed Arrow-based ML? | ||
https://blog.lambdaclass.com/ballista-a-distributed-compute-platform-made-with-rust-and-apache-arrow/ | ||
seems to indicate some (future) interest in Apache DataFusion Ballista ML? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind; Ballista isn't maintained.
- If we go down this route, do we care about being able to do ML preprocessing in SQL anymore (for | ||
backends that don't support Arrow, and possibly also don't support efficient output to Dask)? | ||
|
||
## Build a unified interface for doing ML on the database |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Responding to @zhenzhongxu:
why only databases? what about pipeline scenarios?
Sure, it can include pipeline scenarios (e.g. like more generically backends that Ibis supports).
For pipeline use cases, there is less of that benefit of empowering pure SQL users to be able to do ML, because pipeline users probably don't have significant problem picking up other frameworks (given how complex working with things like Spark and all can get). Many of these pipeline cases also already have ML libraries.
However, it can be helpful if talking about focus on inference via PyArrow UDF or something.
|
||
Benefits: | ||
|
||
- Try to reduce data transfer between frameworks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment from @zhenzhongxu:
When we eventually have HTAP abstraction, data movement will be abstracted but transfer will likely still happen, especially for large scale use cases.
|
||
- Try to reduce data transfer between frameworks | ||
|
||
Questions: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Responding to @zhenzhongxu:
are we targeting transactional databases (e.g., Postgres) or analytical stores (e.g., lakehouse)
We can also target transactional databases, but see the concern about things potentially not being composable/requiring an implementation for each (similar to PostgresML).
Again, IF we focus on inference, we can be smarter about this, by doing a lot of simple tasks via SQL, and using PyArrow UDFs where SQL isn't feasible. you focus on efficient inference, because you can dispatch that through UDF for things that aren't well-expressed in SQL.
For example, I think this is why LetSQL focuses on inference in https://www.letsql.com/posts/builtin-predict-udf/
(They actually compute distinct_categories in the example, which is probably technically incorrect for their inference-focused example—you should one-hot encode based on what were the possible categories as of training time, since it's not like your model can handle anything else.)
- Should this focus on inference? It seems more likely that a database (or at least UDFs, including | ||
Arrow UDFs) could do well at this part. | ||
[There are also some others trying to play in this space.](https://www.letsql.com/posts/builtin-predict-udf/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LetSQL relies on a fast Rust implementation (gbdt-rs
) for batch inference of XGBoost models using DataFusion. For example, a couple of more generic possible options here are:
- Implementing fast inference of XGBoost (and other models) using Arrow Compute (@ianmcook I think this is a reasonable way to use
pyarrow.compute
, right? For example, DuckDB Arrow UDFs and even more generically Ibis PyArrow UDFs are done this way.) - Doing something like
xgb2sql
—on the inference side, and when we can tell the fuller story, there is a nicer case for being able to do everything in the database
No description provided.