Regression problem #873

mrtien12 · 2024-01-05T15:20:38Z

mrtien12
Jan 5, 2024

I am currently encoding a high cardinality feature, such as car_name, to predict the price of old cars. However, I have a question about why encoders like gap encoder tend to improve the performance of regression models. In my current approach with GapEncoder, this encoder yields higher metrics compared to more standard encoders like one-hot encoder.

Vincent-Maladiere · 2024-09-19T13:41:42Z

Vincent-Maladiere
Sep 19, 2024
Maintainer

Hey @mrtien12, thank you for your patience! :)

In short, the OneHotEncoder treats each unique entry as a completely separate category. For example, "duck" and "green duck" would be considered distinct entities, and might be encoded as [0, 1] and [1, 0] respectively.

In contrast, the GapEncoder and MinHashEncoder break entries into sub-strings, allowing them to capture similarities between categories. So, "duck" and "green duck" would generate similar vector representations, such as [0.25, 0.50] and [0.28, 0.50], because of the shared components.

We've added a section to the documentation that explains this in more detail. You can check it out here!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression problem #873

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Regression problem #873

mrtien12 Jan 5, 2024

Replies: 1 comment

Vincent-Maladiere Sep 19, 2024 Maintainer

mrtien12
Jan 5, 2024

Vincent-Maladiere
Sep 19, 2024
Maintainer