Summary tool : a lot more calculations needed #211

simonaubertbd · 2024-12-04T10:15:15Z

Hello,

As of today, we only have a few metrics about the dataframe

We can also note that min and max aren't calculated at all for strings.

Here a few things I would add :
-min max working whatever is the type
use case for string : One very simple example that come in mind : let's say you have a customer dataset. It appears that despite having more than hundred, the max alphabetic is MIGHTYCUSTOMER, Inc. Well, it seems suspicious because you expect to have some customers coming after and clearly your dataset is not as complete as you think.

shortest, longest, avg length on text,
( a few use cases :
a/financial account (I will take french accounting, don't know foreign). They must have the same length. So min and max length have to be the same.
b/french department number can be either 2 or 3 characters. I want to be sure there is not at 1 or more than 3
c/also, if i have a string field with 10 values with min 0 and max 9, I can suppose it worth a look to see if I can transform it as integer
d/Also, about names, in my previous example : there are studies about name length distribution like https://www.researchgate.net/figure/First-names-and-last-names-lengths-distributions_fig1_328894441 and my average length is really different (like 4 or 10, I may have some issues).

-%of null,
-field type

date interval (well this one would be a nice to have but difficult)

Best regards,

Simon

simonaubertbd · 2024-12-06T06:22:00Z

@tgourdel I think this would require a mix of solutions, like pandas.DataFrame.dtypes and probably duckdb. Something more like that (well, first and last missing but you see the point)

import duckdb
import pandas as pd
import numpy as np

# Example DataFrame
data = {
    "id": [1, 2, 3, 4, 5],
    "label": ["A", "BB", "CCC", None, "D"],
    "amount": [10.5, 20.1, None, 40.8, 50.0],
}
df = pd.DataFrame(data)

# Create a connection to DuckDB and load the DataFrame
con = duckdb.connect()
con.register("df", df)

# Initialize an output list
results = []

# Analyze each column
for column in df.columns:
    # Determine type
    col_type = str(df[column].dtype)

    # Numerical summaries using DuckDB
    query = f"""
    SELECT 
        COUNT(*) AS count,
        COUNT(DISTINCT "{column}") AS count_distinct,
        SUM(CASE WHEN "{column}" IS NULL THEN 1 ELSE 0 END) AS null_count,
        MIN("{column}") AS min_value,
        MAX("{column}") AS max_value,
        AVG("{column}") AS average,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY "{column}") AS median
    FROM df
    """
    summary = con.execute(query).fetchdf().iloc[0].to_dict()

    # Additional string-specific analysis for object types
    if df[column].dtype == "object" or col_type == "string":
        lengths = df[column].dropna().str.len()
        shortest_string = df[column].dropna().min()
        longest_string = df[column].dropna().max()
        summary.update(
            {
                "min_length": lengths.min(),
                "max_length": lengths.max(),
                "average_length": lengths.mean(),
                "shortest_string": shortest_string,
                "longest_string": longest_string,
            }
        )
    else:
        summary.update(
            {
                "min_length": None,
                "max_length": None,
                "average_length": None,
                "shortest_string": None,
                "longest_string": None,
            }
        )

    # Percent of nulls
    summary["percent_null"] = (summary["null_count"] / summary["count"]) * 100

    # Add the type
    summary["type"] = col_type

    # Add the column name
    summary["column"] = column

    # Append to results
    results.append(summary)

# Convert results to a DataFrame
output_df = pd.DataFrame(results)

# Reorganize columns for clarity
output_df = output_df[
    [
        "column",
        "type",
        "min_value",
        "max_value",
        "average",
        "median",
        "count",
        "count_distinct",
        "null_count",
        "percent_null",
        "min_length",
        "max_length",
        "average_length",
        "shortest_string",
        "longest_string",
    ]
]

print(output_df)

simonaubertbd · 2024-12-06T06:25:47Z

@tgourdel looks like exactly what I have in mind

import duckdb
import pandas as pd
import numpy as np

# Example DataFrame
data = {
    "id": [1, 2, 3, 4, 5],
    "label": ["A", "BB", "CCC", None, "D"],
    "amount": [10.5, 20.1, None, 40.8, 50.0],
}
df = pd.DataFrame(data)

# Create a connection to DuckDB and load the DataFrame
con = duckdb.connect()
con.register("df", df)

# Initialize an output list
results = []

# Analyze each column
for column in df.columns:
    # Determine type
    col_type = str(df[column].dtype)

    # Numerical summaries using DuckDB
    query = f"""
    SELECT 
        COUNT(*) AS count,
        COUNT(DISTINCT "{column}") AS count_distinct,
        SUM(CASE WHEN "{column}" IS NULL THEN 1 ELSE 0 END) AS null_count,
        MIN("{column}") AS min_value,
        MAX("{column}") AS max_value,
        AVG("{column}") AS average,
        STDDEV_POP("{column}") AS std_dev,
        PERCENTILE_CONT(0.0) WITHIN GROUP (ORDER BY "{column}") AS p0,
        PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY "{column}") AS p25,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY "{column}") AS p50,
        PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY "{column}") AS p75,
        PERCENTILE_CONT(1.0) WITHIN GROUP (ORDER BY "{column}") AS p100
    FROM df
    """
    summary = con.execute(query).fetchdf().iloc[0].to_dict()

    # First and last value using DuckDB
    first_last_query = f"""
    SELECT 
        FIRST("{column}") AS first_value,
        LAST("{column}") AS last_value
    FROM df
    """
    first_last = con.execute(first_last_query).fetchdf().iloc[0].to_dict()

    # Merge the summaries
    summary.update(first_last)

    # Additional string-specific analysis for object types
    if df[column].dtype == "object" or col_type == "string":
        lengths = df[column].dropna().str.len()
        shortest_string = df[column].dropna().min()
        longest_string = df[column].dropna().max()
        summary.update(
            {
                "min_length": lengths.min(),
                "max_length": lengths.max(),
                "average_length": lengths.mean(),
                "shortest_string": shortest_string,
                "longest_string": longest_string,
            }
        )
    else:
        summary.update(
            {
                "min_length": None,
                "max_length": None,
                "average_length": None,
                "shortest_string": None,
                "longest_string": None,
            }
        )

    # Percent of nulls
    summary["percent_null"] = (summary["null_count"] / summary["count"]) * 100

    # Add the type
    summary["type"] = col_type

    # Add the column name
    summary["column"] = column

    # Append to results
    results.append(summary)

# Convert results to a DataFrame
output_df = pd.DataFrame(results)

# Reorganize columns for clarity
output_df = output_df[
    [
        "column",
        "type",
        "first_value",
        "last_value",
        "min_value",
        "p0",
        "p25",
        "p50",
        "p75",
        "p100",
        "max_value",
        "average",
        "std_dev",
        "count",
        "count_distinct",
        "null_count",
        "percent_null",
        "min_length",
        "max_length",
        "average_length",
        "shortest_string",
        "longest_string",
    ]
]

print(output_df)

what do you think of that?

simonaubertbd mentioned this issue Dec 4, 2024

Summary tool : ability to choose calculations #213

Open

tgourdel added the enhancement New feature or request label Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Summary tool : a lot more calculations needed #211

Summary tool : a lot more calculations needed #211

simonaubertbd commented Dec 4, 2024

simonaubertbd commented Dec 6, 2024

simonaubertbd commented Dec 6, 2024

Summary tool : a lot more calculations needed #211

Summary tool : a lot more calculations needed #211

Comments

simonaubertbd commented Dec 4, 2024

simonaubertbd commented Dec 6, 2024

simonaubertbd commented Dec 6, 2024