Feature augmentation #403
Replies: 3 comments
-
At a first glance, I would use a functional API as in pandas, cf. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html If the operation is stateless, you will always be able to transform the functional API into a scikit-learn compatible transformer using a If you remove the boilerplate of the data loading, we make a function that merges auxiliary data to the current My only question now is: is the feature-augmentation stateless or in other scikit-learn jargon, does it require a Edit: I did overlook slightly the auxiliary part. It would still work with a |
Beta Was this translation helpful? Give feedback.
-
fa = FeatureAugmenter(..., aux_tables=[aux_table_1, aux_table_2, aux_table_3])
final_table = fa.fit_transform(X=main_table)
That's the API that you want, because the important aspect of the scikit-learn API is that all the data which have a sample-wise axis are passed at fit, to enable cross-validation.
The whole purpose of these objects are to enable cross-validation. If not, functions would be enough.
Also, you don't want to do the SuperVectorizer in this object. It will be done in a pipeline.
|
Beta Was this translation helpful? Give feedback.
-
If you remove the boilerplate of the data loading,
Long term (not in this iteration), I would indeed like a version of this object which fishes the auxiliary tables directly from the data sources (files, databases). But it raises the question of why not fishing also the main table from the data sources... To be explored later.
My only question now is: is the feature-augmentation stateless or in other scikit-learn jargon, does it require a fit method?
Good question. Not certain, but we can (and should) have a no-op fit.
|
Beta Was this translation helpful? Give feedback.
-
Hi!
Following the very interesting discussion with @jovan-stojanovic, we had a few concerns about the implementation of the feature augmentation functionality.
Objective
So, to recap, (from my understanding) the point is, given a main table and multiple auxiliary tables, to be able to join them together.
So, with these (clean) tables:
We'd like to get:
The current code to do that would be something along the lines of
Implementation
From our point of view, we have two options for implementing the logic:
fit
,transform
, etc.), which has the main benefit of being compatible with the library (and thus, the pipelines).fuzzy_join
.Scikit-Learn compatible class
We believe that due to the constraints of the scikit-learn API, it looks complicated to do what we want, mainly because we're not sure where to pass the tables.
If we were to do that:
we would break the assumption that
fit_transform
only takesX
andy
.And doing that:
would break the assumption that we should not pass data at initialization time.
However, we noted that using
partial_fit
would solve these issues, at the expense of code length.Function
In a function-style, we could rather do
which IMO is slicker, but loses the advantage of being sklearn-compatible (which I see as the only downside).
Beta Was this translation helpful? Give feedback.
All reactions