Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Summary tool : a lot more calculations needed #211

Open
simonaubertbd opened this issue Dec 4, 2024 · 2 comments
Open

Summary tool : a lot more calculations needed #211

simonaubertbd opened this issue Dec 4, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@simonaubertbd
Copy link

Hello,

As of today, we only have a few metrics about the dataframe
image

We can also note that min and max aren't calculated at all for strings.

Here a few things I would add :
-min max working whatever is the type
use case for string : One very simple example that come in mind : let's say you have a customer dataset. It appears that despite having more than hundred, the max alphabetic is MIGHTYCUSTOMER, Inc. Well, it seems suspicious because you expect to have some customers coming after and clearly your dataset is not as complete as you think.

  • shortest, longest, avg length on text,
    ( a few use cases :
    a/financial account (I will take french accounting, don't know foreign). They must have the same length. So min and max length have to be the same.
    b/french department number can be either 2 or 3 characters. I want to be sure there is not at 1 or more than 3
    c/also, if i have a string field with 10 values with min 0 and max 9, I can suppose it worth a look to see if I can transform it as integer
    d/Also, about names, in my previous example : there are studies about name length distribution like https://www.researchgate.net/figure/First-names-and-last-names-lengths-distributions_fig1_328894441 and my average length is really different (like 4 or 10, I may have some issues).

-%of null,
-field type

  • date interval (well this one would be a nice to have but difficult)

Best regards,

Simon

@simonaubertbd
Copy link
Author

@tgourdel I think this would require a mix of solutions, like pandas.DataFrame.dtypes and probably duckdb. Something more like that (well, first and last missing but you see the point)

import duckdb
import pandas as pd
import numpy as np

# Example DataFrame
data = {
    "id": [1, 2, 3, 4, 5],
    "label": ["A", "BB", "CCC", None, "D"],
    "amount": [10.5, 20.1, None, 40.8, 50.0],
}
df = pd.DataFrame(data)

# Create a connection to DuckDB and load the DataFrame
con = duckdb.connect()
con.register("df", df)

# Initialize an output list
results = []

# Analyze each column
for column in df.columns:
    # Determine type
    col_type = str(df[column].dtype)

    # Numerical summaries using DuckDB
    query = f"""
    SELECT 
        COUNT(*) AS count,
        COUNT(DISTINCT "{column}") AS count_distinct,
        SUM(CASE WHEN "{column}" IS NULL THEN 1 ELSE 0 END) AS null_count,
        MIN("{column}") AS min_value,
        MAX("{column}") AS max_value,
        AVG("{column}") AS average,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY "{column}") AS median
    FROM df
    """
    summary = con.execute(query).fetchdf().iloc[0].to_dict()

    # Additional string-specific analysis for object types
    if df[column].dtype == "object" or col_type == "string":
        lengths = df[column].dropna().str.len()
        shortest_string = df[column].dropna().min()
        longest_string = df[column].dropna().max()
        summary.update(
            {
                "min_length": lengths.min(),
                "max_length": lengths.max(),
                "average_length": lengths.mean(),
                "shortest_string": shortest_string,
                "longest_string": longest_string,
            }
        )
    else:
        summary.update(
            {
                "min_length": None,
                "max_length": None,
                "average_length": None,
                "shortest_string": None,
                "longest_string": None,
            }
        )

    # Percent of nulls
    summary["percent_null"] = (summary["null_count"] / summary["count"]) * 100

    # Add the type
    summary["type"] = col_type

    # Add the column name
    summary["column"] = column

    # Append to results
    results.append(summary)

# Convert results to a DataFrame
output_df = pd.DataFrame(results)

# Reorganize columns for clarity
output_df = output_df[
    [
        "column",
        "type",
        "min_value",
        "max_value",
        "average",
        "median",
        "count",
        "count_distinct",
        "null_count",
        "percent_null",
        "min_length",
        "max_length",
        "average_length",
        "shortest_string",
        "longest_string",
    ]
]

print(output_df)

@simonaubertbd
Copy link
Author

@tgourdel looks like exactly what I have in mind

import duckdb
import pandas as pd
import numpy as np

# Example DataFrame
data = {
    "id": [1, 2, 3, 4, 5],
    "label": ["A", "BB", "CCC", None, "D"],
    "amount": [10.5, 20.1, None, 40.8, 50.0],
}
df = pd.DataFrame(data)

# Create a connection to DuckDB and load the DataFrame
con = duckdb.connect()
con.register("df", df)

# Initialize an output list
results = []

# Analyze each column
for column in df.columns:
    # Determine type
    col_type = str(df[column].dtype)

    # Numerical summaries using DuckDB
    query = f"""
    SELECT 
        COUNT(*) AS count,
        COUNT(DISTINCT "{column}") AS count_distinct,
        SUM(CASE WHEN "{column}" IS NULL THEN 1 ELSE 0 END) AS null_count,
        MIN("{column}") AS min_value,
        MAX("{column}") AS max_value,
        AVG("{column}") AS average,
        STDDEV_POP("{column}") AS std_dev,
        PERCENTILE_CONT(0.0) WITHIN GROUP (ORDER BY "{column}") AS p0,
        PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY "{column}") AS p25,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY "{column}") AS p50,
        PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY "{column}") AS p75,
        PERCENTILE_CONT(1.0) WITHIN GROUP (ORDER BY "{column}") AS p100
    FROM df
    """
    summary = con.execute(query).fetchdf().iloc[0].to_dict()

    # First and last value using DuckDB
    first_last_query = f"""
    SELECT 
        FIRST("{column}") AS first_value,
        LAST("{column}") AS last_value
    FROM df
    """
    first_last = con.execute(first_last_query).fetchdf().iloc[0].to_dict()

    # Merge the summaries
    summary.update(first_last)

    # Additional string-specific analysis for object types
    if df[column].dtype == "object" or col_type == "string":
        lengths = df[column].dropna().str.len()
        shortest_string = df[column].dropna().min()
        longest_string = df[column].dropna().max()
        summary.update(
            {
                "min_length": lengths.min(),
                "max_length": lengths.max(),
                "average_length": lengths.mean(),
                "shortest_string": shortest_string,
                "longest_string": longest_string,
            }
        )
    else:
        summary.update(
            {
                "min_length": None,
                "max_length": None,
                "average_length": None,
                "shortest_string": None,
                "longest_string": None,
            }
        )

    # Percent of nulls
    summary["percent_null"] = (summary["null_count"] / summary["count"]) * 100

    # Add the type
    summary["type"] = col_type

    # Add the column name
    summary["column"] = column

    # Append to results
    results.append(summary)

# Convert results to a DataFrame
output_df = pd.DataFrame(results)

# Reorganize columns for clarity
output_df = output_df[
    [
        "column",
        "type",
        "first_value",
        "last_value",
        "min_value",
        "p0",
        "p25",
        "p50",
        "p75",
        "p100",
        "max_value",
        "average",
        "std_dev",
        "count",
        "count_distinct",
        "null_count",
        "percent_null",
        "min_length",
        "max_length",
        "average_length",
        "shortest_string",
        "longest_string",
    ]
]

print(output_df)

what do you think of that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants