Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Field grouping #194

Open
simonaubertbd opened this issue Nov 15, 2024 · 1 comment
Open

Field grouping #194

simonaubertbd opened this issue Nov 15, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@simonaubertbd
Copy link

simonaubertbd commented Nov 15, 2024

Hello,

A lot of time, when you have a dataset, you want to know if there is a group of fields that works together. That can help to normalize (like de-joining) your data model for dataviz, performance issue or simplify your analysis.

Exemple

<style> </style>
order_id item_id label model_id length color amount
1 1 A 10 15 Blue 101
2 1 A 10 15 Blue 101
3 2 B 10 15 Blue 101
4 2 B 10 15 Blue 101
5 2 B 10 15 Blue 101
6 3 C 20 25 Red 101
7 3 C 20 25 Red 101
8 3 C 20 25 Red 101
9 4 D 20 25 Red 101
10 4 D 20 25 Red 101
11 4 D 20 25 Red 101

Here, we could split the table in three :
-order

<style> </style>
order_id item_id model_id amount
1 1 10 101,2
2 1 10 103
3 2 10 104,8
4 2 10 106,6
5 2 10 108,4
6 3 20 110,2
7 3 20 112
8 3 20 113,8
9 4 20 115,6
10 4 20 117,4
11 4 20 119,2

-model

<style> </style>
model_id length color
10 15 Blue
20 25 Red

-item

<style> </style>
item_id label
1 A
2 B
3 C
4 D

The tool would take :
-a dataframe in entry
-configuration : ability to select fields.
-output : a table with the recap of groups

<style> </style>
field group field remaining fields
1 item_id False
1 label False
2 model_id False
2 color False
3 order_id True
3 link to group 1 True
3 link to group 2 True
3 amount True

Very important : the non-selected fields (like here, amount), are in the result but all in the "remaining" group.

Algo steps:
1/pre-groups : count distinct of each fields. goal : optimization of algo, to avoid to calculate all pairs
fields that has the same count distinct than the number of rows are automatically excluded and sent to the remaining group
fields that have have the same count distinct are set in the same pre-group

2/ for each group, for each pair of fields,
let's do a distinct of value of the pair
like here

<style> </style>
item_id label
1 A
2 B
3 C
4 D

if in this table, the count distinct of each field is equal to the number of rows, it's a "pair-group"

here, for the model, you will have
-model_id,length
-model_id,color
-length,color

3/Since a field can only belong to one group, it means model_id,length,color which would first (or second) group, then item_id and label

If a field does not belong to a group, he goes to "remaining group" at the end

in the remaining group, you can add a link to the other group since you don't know which field is the key.

<style> </style>
field group field remaining fields
1 item_id False
1 label False
2 model_id False
2 length False
2 color False
3 order_id True
3 link to group 1 True
3 link to group 2 True
3 amount True

Best regards,

Simon

PS : I have in mind an evolution with links between non-remaining table (like here, the model could be linked to the item as an option)

@tgourdel tgourdel added the enhancement New feature or request label Nov 19, 2024
@simonaubertbd
Copy link
Author

simonaubertbd commented Nov 22, 2024

Well, I rethink it. We should have an optional parameter to test other combinations. We could imagine

<style> </style>
model_id length color
10 15 Blue
20 25 Red
30 25 Blue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants