-
-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partial Correlation gives unexpected output for toy example #435
Comments
Here is another case with two categorical variables. import numpy as np
import pandas as pd
from scipy.stats import pearsonr
import statsmodels.api as sm
from pingouin import partial_corr
# Step 1: Create synthetic data
n_samples = 1000 # Number of samples
n_categorical_var1 = 10 # Number of categories for variable 1
n_categorical_var2 = 3 # Number of categories for variable 2
# Random assignment of categories to samples
categorical_var1 = np.random.randint(0, n_categorical_var1, n_samples)
categorical_var2 = np.random.randint(0, n_categorical_var2, n_samples)
# True response values (random for this case)
true_response = np.random.normal(loc=categorical_var2+categorical_var1, scale=0.1, size=n_samples)
# Create a DataFrame
df = pd.DataFrame({'categorical_variable1': categorical_var1, 'categorical_variable2': categorical_var2})
df['True_Response'] = true_response
# Calculate the mean response for each category as predicted values
cat_var1_mean_prediction = df.groupby('categorical_variable1')['True_Response'].transform('mean')
cat_var2_mean_prediction = df.groupby('categorical_variable2')['True_Response'].transform('mean')
# Use one of the mean predictions as the predicted response
df['Predicted_Response'] = cat_var1_mean_prediction
print(df)
# Encode categorical variables
df_encoded = pd.get_dummies(df, columns=["categorical_variable1", "categorical_variable2"], dtype=int)
df_encoded = df_encoded.drop(columns=["categorical_variable1_0", "categorical_variable2_0"])
# Step 3: Calculate partial correlation
# We regress True_Response and Predicted_Response on the covariates (categorical variables)
X_cov = sm.add_constant(df_encoded.drop(['True_Response', 'Predicted_Response'], axis=1))
# Regress true response on covariates
model_true = sm.OLS(df_encoded['True_Response'], X_cov).fit()
residuals_true = model_true.resid
# Regress predicted response (constant) on covariates
model_pred = sm.OLS(df_encoded['Predicted_Response'], X_cov).fit()
residuals_pred = model_pred.resid
# Compute partial correlation as the correlation of the residuals
partial_corr_residuals, _ = pearsonr(residuals_true, residuals_pred)
print(f'Partial Correlation between True Response and Predicted Response: {partial_corr_residuals:.4f}')
# Step 4: Use Pingouin to calculate partial correlation directly
covariates = list(df_encoded.drop(columns=["True_Response", "Predicted_Response"]).columns)
result = partial_corr(
data=df_encoded,
x="Predicted_Response",
y="True_Response",
covar=covariates,
method="pearson",
)
# Extract the results from Pingouin
r = result["r"].iloc[0]
p = result["p-val"].iloc[0]
print(f"Partial Correlation (Pingouin): r={r:.4f}, p={p:.4e}")
|
Hi @PascalIversen, Thanks for opening the issue. One approach that has been discussed in the past is to raise an error if the covariate(s) is identical to Alternatively, we could switch back to a regression-based approach in this particular corner case. |
Hey @raphaelvallat, thank you very much for your fast response. In the second example the covariates are not identical to x or y. (The covariates are categorical.) I can't resort to the regression as that would be too slow. I guess for now I'll be adding a negligible amount of noise to x ( import numpy as np
import pandas as pd
from scipy.stats import pearsonr
import statsmodels.api as sm
from pingouin import partial_corr
# Step 1: Create synthetic data
n_samples = 1000 # Number of samples
n_categorical_var1 = 10 # Number of categories for variable 1
n_categorical_var2 = 3 # Number of categories for variable 2
# Random assignment of categories to samples
categorical_var1 = np.random.randint(0, n_categorical_var1, n_samples)
categorical_var2 = np.random.randint(0, n_categorical_var2, n_samples)
# True response values (random for this case)
true_response = np.random.normal(loc=categorical_var2+categorical_var1, scale=0.1, size=n_samples)
# Create a DataFrame
df = pd.DataFrame({'categorical_variable1': categorical_var1, 'categorical_variable2': categorical_var2})
df['True_Response'] = true_response
# Calculate the mean response for each category as predicted values
cat_var1_mean_prediction = df.groupby('categorical_variable1')['True_Response'].transform('mean')
cat_var2_mean_prediction = df.groupby('categorical_variable2')['True_Response'].transform('mean')
# Use one of the mean predictions as the predicted response
df['Predicted_Response'] = cat_var1_mean_prediction
# Encode categorical variables
df_encoded = pd.get_dummies(df, columns=["categorical_variable1", "categorical_variable2"], dtype=int)
df_encoded = df_encoded.drop(columns=["categorical_variable1_0", "categorical_variable2_0"])
# Step 3: Calculate partial correlation
# We regress True_Response and Predicted_Response on the covariates (categorical variables)
X_cov = sm.add_constant(df_encoded.drop(['True_Response', 'Predicted_Response'], axis=1))
# Regress true response on covariates
model_true = sm.OLS(df_encoded['True_Response'], X_cov).fit()
residuals_true = model_true.resid
# Regress predicted response (constant) on covariates
model_pred = sm.OLS(df_encoded['Predicted_Response'], X_cov).fit()
residuals_pred = model_pred.resid
# Compute partial correlation as the correlation of the residuals
partial_corr_residuals, _ = pearsonr(residuals_true, residuals_pred)
print(f'Partial Correlation between True Response and Predicted Response: {partial_corr_residuals:.4f}')
df_encoded['Predicted_Response'] = cat_var1_mean_prediction +np.random.normal(loc=0, scale=0.00001, size=n_samples)
# Step 4: Use Pingouin to calculate partial correlation directly
covariates = list(df_encoded.drop(columns=["True_Response", "Predicted_Response"]).columns)
result = partial_corr(
data=df_encoded,
x="Predicted_Response",
y="True_Response",
covar=covariates,
method="pearson",
)
# Extract the results from Pingouin
r = result["r"].iloc[0]
p = result["p-val"].iloc[0]
print(f"Partial Correlation (Pingouin): r={r:.4f}, p={p:.4e}") Partial Correlation between True Response and Predicted Response: 0.0013 |
Thanks for the additional details. I think that the issue in the categorical example is that the resulting covariance matrix is rank-deficient (unless you add some random noise). This is because your input data has very strong multi-collinearity. data = df_encoded.copy()
# data["Predicted_Response"] += np.random.normal(loc=0, scale=0.00001, size=n_samples)
V = data.cov(numeric_only=True)
print(np.linalg.matrix_rank(V), data.shape[1])
12 13 As a general solution, we could perhaps raise a warning or error if the covariance matrix does not have full rank |
Hi, thanks for this great library!
I am getting perfect correlation for the following toy example. I expected ~zero correlation as in the regression approach.
Output:
The text was updated successfully, but these errors were encountered: