Feature augmentation #403

LilianBoulard · 2022-11-03T17:41:21Z

LilianBoulard
Nov 3, 2022
Maintainer

Hi!
Following the very interesting discussion with @jovan-stojanovic, we had a few concerns about the implementation of the feature augmentation functionality.

Objective

So, to recap, (from my understanding) the point is, given a main table and multiple auxiliary tables, to be able to join them together.

So, with these (clean) tables:

main_table = pd.DataFrame([
    'France',
    'Germany',
    'Italy',
], columns=['Country'])

aux_table_1 = pd.DataFrame([
    'Germany', 80_000_000,
    'France', 65_000_000,
    'Italy', 45_000_000,
], columns=['Country', 'Population'])

aux_table_2 = pd.DataFrame([
    'France', 2937,
    'Italy', 2099,
    'Germany', 4223,
], columns=['Country name', 'GDP (billion)'])

aux_table_3 = pd.DataFrame([
    'France', 'Paris',
    'Italy', 'Rome',
    'Germany', 'Berlin',
], columns=['Countries', 'Capital'])

We'd like to get:

...
final_table = pd.DataFrame([
    'France', 80_000_000, 2937, 'Paris',
    'Italy', 65_000_000, 2099, 'Rome',
    'Germany', 45_000_000, 4223, 'Berlin',
], columns=['Country', 'Population', 'GDP (billion)', 'Capital'])

The current code to do that would be something along the lines of

unufied_1 = fuzzy_join(
    main_table, aux_table_1,
    on="Country",
)
unufied_2 = fuzzy_join(
    unufied_1 , aux_table_2,
    on_left="Country", on_right="Country name",
)
final_table = fuzzy_join(
    unufied_2 , aux_table_3,
    on="Country", on_right="Countries",
)

Implementation

From our point of view, we have two options for implementing the logic:

In a scikit-learn compatible class (inheriting the fit, transform, etc.), which has the main benefit of being compatible with the library (and thus, the pipelines).
In a function, like the fuzzy_join.

Scikit-Learn compatible class

We believe that due to the constraints of the scikit-learn API, it looks complicated to do what we want, mainly because we're not sure where to pass the tables.

If we were to do that:

fa = FeatureAugmenter(...)
final_table = fa.fit_transform(X=main_table, aux_tables=[aux_table_1, aux_table_2, aux_table_3])

we would break the assumption that fit_transform only takes X and y.

And doing that:

fa = FeatureAugmenter(...,  aux_tables=[aux_table_1, aux_table_2, aux_table_3])
final_table = fa.fit_transform(X=main_table)

would break the assumption that we should not pass data at initialization time.

However, we noted that using partial_fit would solve these issues, at the expense of code length.

fa = FeatureAugmenter(...)
fa.partial_fit(X=aux_table_1)
fa.partial_fit(X=aux_table_2)
fa.partial_fit(X=aux_table_3)
final_table = fa.transform(X=main_table)

Function

In a function-style, we could rather do

final_table = feature_augment(main_table=main_table, aux_tables=[aux_table_1, aux_table_2, aux_table_3], ...)

which IMO is slicker, but loses the advantage of being sklearn-compatible (which I see as the only downside).

glemaitre · 2022-11-03T17:50:47Z

glemaitre
Nov 3, 2022
Maintainer

At a first glance, I would use a functional API as in pandas, cf. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html

If the operation is stateless, you will always be able to transform the functional API into a scikit-learn compatible transformer using a FunctionTransformer. This is an example within RAMP: https://github.com/ramp-kits/air_passengers/blob/master/submissions/use_external_data/estimator.py#L49

If you remove the boilerplate of the data loading, we make a function that merges auxiliary data to the current X and transforms this function to a scikit-learn transformer that can easily be used within a scikit-learn pipeline.

My only question now is: is the feature-augmentation stateless or in other scikit-learn jargon, does it require a fit method?

Edit: I did overlook slightly the auxiliary part. It would still work with a FunctionTransformer because you can pass it with the kw_args parameter which would be the aux_tables of the function.

0 replies

GaelVaroquaux · 2022-11-03T18:52:21Z

GaelVaroquaux
Nov 3, 2022
Maintainer

fa = FeatureAugmenter(..., aux_tables=[aux_table_1, aux_table_2, aux_table_3]) final_table = fa.fit_transform(X=main_table)

That's the API that you want, because the important aspect of the scikit-learn API is that all the data which have a sample-wise axis are passed at fit, to enable cross-validation. The whole purpose of these objects are to enable cross-validation. If not, functions would be enough. Also, you don't want to do the SuperVectorizer in this object. It will be done in a pipeline.

0 replies

GaelVaroquaux · 2022-11-03T19:08:14Z

GaelVaroquaux
Nov 3, 2022
Maintainer

If you remove the boilerplate of the data loading,

Long term (not in this iteration), I would indeed like a version of this object which fishes the auxiliary tables directly from the data sources (files, databases). But it raises the question of why not fishing also the main table from the data sources... To be explored later.

My only question now is: is the feature-augmentation stateless or in other scikit-learn jargon, does it require a fit method?

Good question. Not certain, but we can (and should) have a no-op fit.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature augmentation #403

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Feature augmentation #403

LilianBoulard Nov 3, 2022 Maintainer

Objective

Implementation

Scikit-Learn compatible class

Function

Replies: 3 comments

glemaitre Nov 3, 2022 Maintainer

GaelVaroquaux Nov 3, 2022 Maintainer

GaelVaroquaux Nov 3, 2022 Maintainer

LilianBoulard
Nov 3, 2022
Maintainer

glemaitre
Nov 3, 2022
Maintainer

GaelVaroquaux
Nov 3, 2022
Maintainer

GaelVaroquaux
Nov 3, 2022
Maintainer