Replies: 1 comment
-
Hey @mrtien12, thank you for your patience! :) In short, the OneHotEncoder treats each unique entry as a completely separate category. For example, "duck" and "green duck" would be considered distinct entities, and might be encoded as [0, 1] and [1, 0] respectively. In contrast, the GapEncoder and MinHashEncoder break entries into sub-strings, allowing them to capture similarities between categories. So, "duck" and "green duck" would generate similar vector representations, such as [0.25, 0.50] and [0.28, 0.50], because of the shared components. We've added a section to the documentation that explains this in more detail. You can check it out here! |
Beta Was this translation helpful? Give feedback.
-
I am currently encoding a high cardinality feature, such as car_name, to predict the price of old cars. However, I have a question about why encoders like gap encoder tend to improve the performance of regression models. In my current approach with GapEncoder, this encoder yields higher metrics compared to more standard encoders like one-hot encoder.
Beta Was this translation helpful? Give feedback.
All reactions