from folktables import ACSDataSource, ACSEmployment, BasicProblem, adult_filter
import numpy as np
= "IL"
STATE
= ACSDataSource(survey_year='2018',
data_source ='1-Year',
horizon='person')
survey
= data_source.get_data(states=[STATE], download=True) acs_data
Data Wrangling
For this analysis, I’ve chosen to pull the 2018 PUMS data from the state of Illinois.
# Reducing number of features
=['AGEP', 'COW', 'SCHL', 'MAR', 'OCCP', 'POBP', 'RELP', 'WKHP', 'RAC1P', 'SEX', 'PINCP']
possible_features acs_data[possible_features]
AGEP | COW | SCHL | MAR | OCCP | POBP | RELP | WKHP | RAC1P | SEX | PINCP | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 86 | NaN | 16.0 | 2 | NaN | 17 | 16 | NaN | 1 | 2 | 22200.0 |
1 | 30 | 1.0 | 18.0 | 5 | 4220.0 | 22 | 17 | 99.0 | 1 | 2 | 15000.0 |
2 | 57 | NaN | 19.0 | 5 | NaN | 17 | 17 | NaN | 2 | 1 | 0.0 |
3 | 69 | 1.0 | 19.0 | 3 | 1551.0 | 20 | 16 | NaN | 1 | 1 | 21600.0 |
4 | 18 | 3.0 | 18.0 | 5 | 2350.0 | 6 | 17 | NaN | 2 | 1 | 350.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
126451 | 72 | NaN | 16.0 | 1 | NaN | 17 | 1 | NaN | 1 | 1 | 68800.0 |
126452 | 43 | 1.0 | 21.0 | 1 | 710.0 | 210 | 0 | 40.0 | 6 | 1 | 89000.0 |
126453 | 39 | NaN | 21.0 | 1 | NaN | 210 | 1 | NaN | 6 | 2 | 0.0 |
126454 | 7 | NaN | 4.0 | 5 | NaN | 210 | 2 | NaN | 6 | 2 | NaN |
126455 | 5 | NaN | 1.0 | 5 | NaN | 210 | 2 | NaN | 6 | 1 | NaN |
126456 rows × 11 columns
= [f for f in possible_features if f not in ["PINCP", "SEX"]] features_to_use
Below is a preprocessing step to filter our data in corresponding to the literature’s filters over the variables AGEP
(age), PINCP
(total income), WKHP
(usual hours worked in the last 12 months), and PWGTP
(individual’s weight).
# source code taken from folktables PUMS problem constructions
def adult_filter(data):
"""Mimic the filters in place for Adult data.
Adult documentation notes: Extraction was done by Barry Becker from
the 1994 Census database. A set of reasonably clean records was extracted
using the following conditions:
((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))
"""
= data
df = df[df['AGEP'] > 16]
df = df[df['PINCP'] > 100]
df = df[df['WKHP'] > 0]
df = df[df['PWGTP'] >= 1]
df return df
# Transform our IncomeProblem to a BasicProblem to minimize features we would like to use
= BasicProblem(
IncomeProblem =features_to_use,
features='PINCP',
target=lambda x: 1*(x > 50000),
target_transform='SEX',
group= adult_filter,
preprocess=lambda x: np.nan_to_num(x, -1),
postprocess
)
= IncomeProblem.df_to_numpy(acs_data) features, label, group
# Split our data into test and train sets
from sklearn.model_selection import train_test_split
= train_test_split(
X_train, X_test, y_train, y_test, group_train, group_test =0.2, random_state=0) features, label, group, test_size
Prediction Problem
My prediction problem will be predicting whether income is over $50k on the basis of demographics excluding sex and auditing for gender bias. I will be choosing to use a DecisionTree
model for this analysis.
Basic Descriptives
# Convert our data into a dataframe for visualization
import pandas as pd
= pd.DataFrame(X_train, columns = features_to_use)
df "group"] = group_train
df["label"] = y_train
df[
df
AGEP | COW | SCHL | MAR | OCCP | POBP | RELP | WKHP | RAC1P | group | label | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 48.0 | 1.0 | 20.0 | 1.0 | 3515.0 | 17.0 | 1.0 | 40.0 | 1.0 | 2 | 0 |
1 | 49.0 | 1.0 | 21.0 | 1.0 | 850.0 | 17.0 | 1.0 | 40.0 | 1.0 | 1 | 1 |
2 | 44.0 | 1.0 | 1.0 | 1.0 | 4251.0 | 303.0 | 12.0 | 40.0 | 1.0 | 1 | 0 |
3 | 46.0 | 1.0 | 23.0 | 1.0 | 440.0 | 17.0 | 0.0 | 40.0 | 1.0 | 2 | 1 |
4 | 36.0 | 1.0 | 16.0 | 1.0 | 5610.0 | 17.0 | 1.0 | 60.0 | 1.0 | 1 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
53607 | 32.0 | 1.0 | 20.0 | 1.0 | 8740.0 | 17.0 | 0.0 | 48.0 | 1.0 | 1 | 0 |
53608 | 75.0 | 3.0 | 22.0 | 1.0 | 2320.0 | 17.0 | 0.0 | 7.0 | 1.0 | 1 | 0 |
53609 | 53.0 | 1.0 | 16.0 | 3.0 | 4220.0 | 17.0 | 10.0 | 6.0 | 1.0 | 2 | 0 |
53610 | 57.0 | 3.0 | 16.0 | 1.0 | 7000.0 | 17.0 | 1.0 | 50.0 | 1.0 | 1 | 0 |
53611 | 37.0 | 3.0 | 17.0 | 1.0 | 3945.0 | 17.0 | 0.0 | 40.0 | 1.0 | 1 | 1 |
53612 rows × 11 columns
In this dataset, there are 53612 total individuals present.
# Calculate the proportion of individuals who have an income of over $50k
"label"].mean() df[
0.39455718868909945
Additionally, of the 53612 individuals in the data, only about \(39.5\)% of the individuas have an income of over $50k.
# Calculate the total individuals per group
= df.groupby("group", as_index = False).size()
group_prop = {"size" : "total"}) group_prop.rename(columns
group | total | |
---|---|---|
0 | 1 | 27724 |
1 | 2 | 25888 |
In this data, there are a total of 27724 males and 25888 females identifying folks.
= df.groupby(["group"], as_index = False)["label"].mean()
group_label group_label
group | label | |
---|---|---|
0 | 1 | 0.478611 |
1 | 2 | 0.304543 |
In the male
group, about \(47.9\)% of male individuals have an income of over 50k. Whereas in the female
group, only about \(30.5\)% of female individuals have an income of over 50k.
Here, we can visualize the intersectional relationship between the proportion of individuals in the SEX
and RAC1P
groups who have an income of over $50k.
# Compute proportion of positive labels by race and sex
= df.groupby(["RAC1P", "group"], as_index = False)["label"].mean() race_prop
# Visualize positive labels for both and race
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import seaborn.objects as so
=race_prop, x="RAC1P", y="label", hue="group", kind="bar")
sns.catplot(data
# Add labels to title, x and y axis
"Distribution of Race and Sex with Income Over $50K")
plt.title("Race Group")
plt.xlabel("Proportion of Individuals with Income Over $50K")
plt.ylabel(
plt.show()
Shown above is a visualization of the different racial groups studied and their proportion of individuals who have an income of over $50k grouped by sex in Illinois.
One major observation from the plot is that the male
sex has a larger proportion of individuals who have an income of over 50k than the female
sex for all race groups with the exceptions to race groups 4
and 7
— Alaska Natives alone and Native Hawaiian and Other Pacific Islanders, respectively. Alaska Natives alone are also outliers in this plot as both sex groups show zero proportion of all individuals who have an income of over 50k in Illinois; similarly, females in the race group 5
(American Indian and Alaska Native tribes specified, or American Indian or Alaska Native, not specified and no other races) show zero proportion of all individuals who have an income of over 50k.
All RAC1P
groups and other variables used in this analysis are described in the Appendix here.
Training My Model
Tuning Complexity of DecisionTree
Now that we’ve explored our data, we can train a machine learning model to predict the individuals living in Illinos who have an income of over $50k. Using the DecisionTree
model, I can tune the depth
using a cross-validation of 10-fold to choose a tree that maximizes the accuracy.
For each level complexity of our tree, we can assess its performance and verify which depth we should use for our model.
# From Classifier Lecture
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import cross_val_score
12345)
np.random.seed(
= plt.subplots(1)
fig, ax
# Plotting each of the complexity and its performance
for d in range(2, 15):
= DecisionTreeClassifier(max_depth = d)
T = cross_val_score(T, X_train, y_train, cv = 10).mean()
m = "black")
ax.scatter(d, m, color
= ax.set(xlabel = "Complexity (depth)",
labs = "Performance (score)",
ylabel = "Performance (score) vs Complexity (depth) of DecisionTreeClassifier") title
From the plot, it is observed that a complexity of roughly 11 is the best for the data set which maximizes the performance of the DecisionTreeClassifier.
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
We can now train our model and fit it over the testing set:
= make_pipeline(StandardScaler(), DecisionTreeClassifier(max_depth = 11))
model model.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()), ('decisiontreeclassifier', DecisionTreeClassifier(max_depth=11))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('standardscaler', StandardScaler()), ('decisiontreeclassifier', DecisionTreeClassifier(max_depth=11))])
StandardScaler()
DecisionTreeClassifier(max_depth=11)
Auditing My Model
Now that we’ve trained and fit our model, we can perform an audit for gender bias over the test data with the SEX
group and our PINCP
label.
Overall Measures
Overall Accuracy
# Generate predictions based off model
= model.predict(X_test) y_hat
The overall accuracy in predicting whether an individual has an income of over $50k is:
= (y_hat == y_test).mean()
total_acc total_acc
0.7805132796180245
Thus, our model achieved a \(78\)% accuracy in predicting whether a individual has an income of over $50k in Illinois.
Positive Predictive Value (PPV)
from sklearn.metrics import confusion_matrix
# Compute the confusion matrix for the model
= confusion_matrix(y_test, y_hat, normalize = "true")
cm
# Extract PPV, FPR, and FNR
= cm[0][0]
ppv = cm[0][1]
fpr = cm[1][0]
fnr
cm
array([[0.80999511, 0.19000489],
[0.26570881, 0.73429119]])
The positive predictive value (PPV) can be calculated given the equation:
\[PPV = \frac{TP}{TP + FP}\]
Fortunately, our confusion matrix does this calculation for us! As shown in the top left corner of the confusion matrix of the overall PPV of our model is about \(81\)%.
Hence, of the total population of the dataset, about \(81\)% of individuals who reside in Illinois who does have an income of over $50k is true.
False Negative Rate (FNR) and False Positive Rate (FPR)
Similarly, we can extract the overall FNR and FPR of our model without having to compute its respective formulas using the confusion matrix. The FNR (shown in the bottom left corner) is about \(27\)%, and the FPR (shown in the top righ corner) is about \(19\)%.
Our model has a FNR of about \(27\)% which suggests that the model is classifying individuals with income of over $50k as having an income below the threshold nearly a third of the time.
Furthermore, our model has a FPR of about \(19\)% which suggests that the model is classifying individuals with income below $50k as having an income above the threshold about 19% of the time.
Now that we’ve audited our overall model, we can check if similar measures are observed for both male and female groups.
By-Group Measures
We can further our audit analysis by checking the previous measures for each group male
and female
who makes an income of over $50k.
Male Individuals
Here is the confusion matrix for male
individuals in Illinois who make an income of over $50k:
= confusion_matrix(y_test[group_test == 1], y_hat[group_test == 1], normalize = "true")
mcm
# Extract PPV, FPR, and FNR
= mcm[0][0]
mppv = mcm[0][1]
mfpr = mcm[1][0]
mfnr
mcm
array([[0.77290287, 0.22709713],
[0.25985222, 0.74014778]])
Male Accuracy
= (y_hat == y_test)[group_test == 1].mean()
male_acc male_acc
0.7574214202561118
Our model correctly predicts whether a male
individual in Illinois has an income of over $50k about 76% of the time.
Positive Predictive Value (PPV)
The PPV of my model for male individuals is about 77% which suggests that for this population, our model correctly classifies male individuals in Illinois who does have an income of over $50k (true positives) 77% of the time.
False Negative Rate (FNR) and False Positive Rate (FPR)
The FNR and FPR of my model for male individuals is about \(26\)% and \(23\)%, respectively.
Hence, our model does not classify the true positives (male individuals who have an income of over $50k) 26% of the time; in other words, 26% of male individuals who have an income above the 50k threshold are classified as having an income below the threshold.
Lastly, given the FPR of nearly 23%, our model incorrectly classifies male individuals who have an income below $50k as true positives nearly 23% of the time.
Female Individuals
Now, let’s check the probability measures for female individuals. Shown below is the confusion matrix for female
individuals who make an income of over $50k:
= confusion_matrix(y_test[group_test == 2], y_hat[group_test == 2], normalize = "true")
fcm
# Extract PPV, FPR, and FNR
= fcm[0][0]
fppv = fcm[0][1]
ffpr = fcm[1][0]
ffnr
fcm
array([[0.83947368, 0.16052632],
[0.27535497, 0.72464503]])
Female Accuracy
The accuracy for predicting whether a female
individual in Illinois has an income of over $50k is:
= (y_hat == y_test)[group_test == 2].mean()
female_acc female_acc
0.8048071034905082
Our model correctly predicts whether a female
individual has an income of over $50k about 80% of the time.
Positive Predictive Value (PPV)
Observing the confusion matrix for the female
group, we can see that the PPV of our model is about \(84\)%.
Hence, our model classifies the true positives, female individuals who does have an income of over $50k, as true nearly 84% of the time.
False Negative Rate (FNR) and False Positive Rate (FPR)
Our model displays the FNR and FPR for female individuals to be about 27% and 16%, respectively.
Hence, our model does not classify the true positives (female individuals who have an income of over $50k) 27% of the time; in other words, 27% of female individuals who have an income above the 50k threshold are classified as having an income below the threshold.
Lastly, given the FPR of about 16%, our model incorrectly classifies female individuals who have an income below $50k as true positives about 16% of the time.
Comparison
After auditing for both groups, we observe that both male
and female
groups have a similar FNR
which suggests that the model is incorrectly missing a similar proportion of individuals in Illinois who have an income above $50k (but slightly higher for female individuals).
However, the FPR
for the female group is lower than the male group. Thus, our model is generating fewer false positives for female individuals than male individuals; in other words, for female individuals who have an income below $50k, our model will classify them to belong in the negative class (individuals who have an income below 50k) about 84% of the time — while male individuals are only classified to belong in the negative class about 77% of the time.
Shown below is a visualization of the model’s overall measures and of each subgroup’s.
= 0.2
bar_width
= [total_acc, ppv, fpr, fnr]
total = [male_acc, mppv, mfpr, mfnr]
male = [female_acc, fppv, ffpr, ffnr]
female
# Set position of bar on X axis
= np.arange(len(total))
br1 = [x + bar_width for x in br1]
br2 = [x + bar_width for x in br2]
br3
# Make the plot
= bar_width,
plt.bar(br1, total, width ='grey', label ='overall')
edgecolor = bar_width,
plt.bar(br2, male, width ='grey', label ='male')
edgecolor = bar_width,
plt.bar(br3, female, width ='grey', label ='female')
edgecolor
"Distribution of Model's Overall and By-Group Measures")
plt.title(
# Adding Xticks
'Measures', fontsize = 12)
plt.xlabel('Probability', fontsize = 12)
plt.ylabel(+ 0.2 for r in range(len(total))],
plt.xticks([r 'Accuracy', 'PPV', 'FPR', 'FNR'])
[
= plt.legend()
legend plt.show()
Bias Measures
Calibration
Now, we can assess whether our model is reasonably calibrated by grouping by the average of both group
and the true_label
over the predicted_label
. For this analysis, a calibrated model means that the fraction of individuals who have an income of over $50k is the same across both the male and female groups.
To do this, I first created a new data frame containing my group (SEX
), true_label (y_test
), and my predicted labels (y_hat
). Then, I mutated the “group” column to Male
(1) and Female
(2) for visual readability.
# Create a dataframe containing the true and predicted labels and groups
= ["group", "true_label", "predicted_label"]
calib_features = pd.DataFrame({"group": group_test, "true_label": y_test, "predicted_label": y_hat}, columns = calib_features)
calib_df
# Mutate df group values for readability using .replace()
= calib_df["group"].replace([1, 2], ["Male", "Female"])
group_test = pd.DataFrame({"group": group_test, "true_label": y_test, "predicted_label": y_hat}, columns = calib_features)
calib_df calib_df.head()
group | true_label | predicted_label | |
---|---|---|---|
0 | Female | 0 | 0 |
1 | Male | 0 | 1 |
2 | Male | 0 | 0 |
3 | Male | 0 | 1 |
4 | Female | 1 | 1 |
After creating the new dataframe, I generated two visualizations: a line plot and bar plot of the mean across the groups’ predicted labels.
# Perform calibration test using .groupby() method
= calib_df.groupby(["group", "true_label"])["predicted_label"].mean().reset_index(name = "mean")
means
# Visualizations of
= sns.lineplot(data = means, x = "true_label", y = "mean", hue = "group")
g
"Line Plot of Mean Across Sub-groups Labels")
plt.title("True Label")
plt.xlabel("Mean")
plt.ylabel( plt.show()
= sns.barplot(data = means, x = "true_label", y = "mean", hue = "group")
p
"Distribution of Mean Across Group Labels")
plt.title("True Label")
plt.xlabel("Mean")
plt.ylabel( plt.show()
From the perspective of calibration, the model might suggest that it is biased in the direction of the Male
group. Out of those who were predicted to have an income of over $50k, slightly more male individuals were predicted to have an income over the income threshold in comparison to Female
individuals.
Additionally, the line plot shows that across the ranges of true labels, male individuals maintain a higher statistical mean and proportion who are predicted to make an income of over $50k than female individuals. Thus, both plots can argue that the model is biased in the direction of the male group.
However, since the rates for both groups are sufficiently close, it may also be correct to suggest that the model appears well-calibrated.
Error Rate Balance
In her literature, Chouldechova defines error rate balance to be the case where both groups’ false positve and false negative rates are equal. If that requirement is met, then the model satisfies error rate balance.
From the previous audit of each group’s measures, we found that the FNR and FPR of the male
and female
groups are:
# Male rates
mfpr, mfnr
(0.2270971302428256, 0.25985221674876846)
# Female rates
ffpr, ffnr
(0.16052631578947368, 0.2753549695740365)
Observing that our model did not generate equal FNR and FPR for the male
and female
groups, our model does not satisfy error rate balance.
Statistical Parity
Chouldechova defines stastical parity, or predictive parity, to be the case where the PPV
of both groups are equal. If that requirement is met, then the model satisfies statistical parity.
# Male and female PPV
mppv, fppv
(0.7729028697571744, 0.8394736842105263)
Given the PPV measures calculated before, it is observed that the model did not generate equal PPV
for both groups as the female
group had a higher PPV than the male
group. Thus, our model does not satisfy statistical parity.
Concluding Discussion
After my analysis, I found that most often, male individuals, could stand to benefit from a system that is able to predict whether an individual’s income is over $50k. However, even while male, there are intersectional differences as we include race; thus, there are benefits to being in a certain racial group (such as White and Asian), and those benefits vary (for the positive and negative) dependent on which group an individual belongs to. Financial institutions such as banks and marketing firms may buy my model for commercial use because they can easily target certain consumers who reach the income threshold and are more willing (due to financial expectation) to buy their products. Governmental programs (e.g. food stamps) may also want to buy this model to determine who gets to reap the benefits of governmental assistance.
Based on my bias audit, the impact of deploying my model for large-scale prediction in commercial or governmental settings is that more male individuals will be classified as having an income of over $50k than female individuals. As a result, male individuals could have better matching rates from commercial or governmental programs than female individuals due to their expected financial status.
Based on my bias audit, my model displays a form of allocative bias with how opportunities that arise from governmental programs, or resources such as loans and insurance, are allocated in different proportions across SEX
groups. Additionally, since the FPR differs across gender groups, the model might possess gender bias in predicting whether an individual in Illinois makes an income of over $50k.
Beyond bias, there is a concern with individuals who identify as non-binary; the lack of representation and/or data for non-binary folks may greatly affect their accessibility to governmental and commercial programs as only male
and female
groups are accounted for in the model. Of course, this is a concern regarding the improvement of the PUMS data collection to diversify the gender selection for respondents. Lastly, it might be interesting to audit our model for racial bias instead of gender and observe if racial biases are also present in our model.