You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A lot of time, when you have a dataset, you want to know if there is a group of fields that works together. That can help to normalize (like de-joining) your data model for dataviz, performance issue or simplify your analysis.
Exemple
<style>
</style>
order_id
item_id
label
model_id
length
color
amount
1
1
A
10
15
Blue
101
2
1
A
10
15
Blue
101
3
2
B
10
15
Blue
101
4
2
B
10
15
Blue
101
5
2
B
10
15
Blue
101
6
3
C
20
25
Red
101
7
3
C
20
25
Red
101
8
3
C
20
25
Red
101
9
4
D
20
25
Red
101
10
4
D
20
25
Red
101
11
4
D
20
25
Red
101
Here, we could split the table in three :
-order
<style>
</style>
order_id
item_id
model_id
amount
1
1
10
101,2
2
1
10
103
3
2
10
104,8
4
2
10
106,6
5
2
10
108,4
6
3
20
110,2
7
3
20
112
8
3
20
113,8
9
4
20
115,6
10
4
20
117,4
11
4
20
119,2
-model
<style>
</style>
model_id
length
color
10
15
Blue
20
25
Red
-item
<style>
</style>
item_id
label
1
A
2
B
3
C
4
D
The tool would take :
-a dataframe in entry
-configuration : ability to select fields.
-output : a table with the recap of groups
<style>
</style>
field group
field
remaining fields
1
item_id
False
1
label
False
2
model_id
False
2
color
False
3
order_id
True
3
link to group 1
True
3
link to group 2
True
3
amount
True
Very important : the non-selected fields (like here, amount), are in the result but all in the "remaining" group.
Algo steps:
1/pre-groups : count distinct of each fields. goal : optimization of algo, to avoid to calculate all pairs
fields that has the same count distinct than the number of rows are automatically excluded and sent to the remaining group
fields that have have the same count distinct are set in the same pre-group
2/ for each group, for each pair of fields,
let's do a distinct of value of the pair
like here
<style>
</style>
item_id
label
1
A
2
B
3
C
4
D
if in this table, the count distinct of each field is equal to the number of rows, it's a "pair-group"
here, for the model, you will have
-model_id,length
-model_id,color
-length,color
3/Since a field can only belong to one group, it means model_id,length,color which would first (or second) group, then item_id and label
If a field does not belong to a group, he goes to "remaining group" at the end
in the remaining group, you can add a link to the other group since you don't know which field is the key.
<style>
</style>
field group
field
remaining fields
1
item_id
False
1
label
False
2
model_id
False
2
length
False
2
color
False
3
order_id
True
3
link to group 1
True
3
link to group 2
True
3
amount
True
Best regards,
Simon
PS : I have in mind an evolution with links between non-remaining table (like here, the model could be linked to the item as an option)
The text was updated successfully, but these errors were encountered:
Hello,
A lot of time, when you have a dataset, you want to know if there is a group of fields that works together. That can help to normalize (like de-joining) your data model for dataviz, performance issue or simplify your analysis.
Exemple
<style> </style>Here, we could split the table in three :
<style> </style>-order
-model
<style> </style>-item
<style> </style>The tool would take :
<style> </style>-a dataframe in entry
-configuration : ability to select fields.
-output : a table with the recap of groups
Very important : the non-selected fields (like here, amount), are in the result but all in the "remaining" group.
Algo steps:
1/pre-groups : count distinct of each fields. goal : optimization of algo, to avoid to calculate all pairs
fields that has the same count distinct than the number of rows are automatically excluded and sent to the remaining group
fields that have have the same count distinct are set in the same pre-group
2/ for each group, for each pair of fields,
<style> </style>let's do a distinct of value of the pair
like here
if in this table, the count distinct of each field is equal to the number of rows, it's a "pair-group"
here, for the model, you will have
-model_id,length
-model_id,color
-length,color
3/Since a field can only belong to one group, it means model_id,length,color which would first (or second) group, then item_id and label
If a field does not belong to a group, he goes to "remaining group" at the end
in the remaining group, you can add a link to the other group since you don't know which field is the key.
<style> </style>Best regards,
Simon
PS : I have in mind an evolution with links between non-remaining table (like here, the model could be linked to the item as an option)
The text was updated successfully, but these errors were encountered: