Feature importance is hugely important towards building better, simpler, and more generalizable machine learning models. In this notebook we will:
The data set is on housing prices, where the target is the sale price of the house. There are 13 variables, some categorical (condition e.g.) and other continuous (resting sqft_lot e.g.)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import spearmanr
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
pd.set_option('display.max_columns', None)
data = pd.read_csv('kc_house_data.csv', parse_dates=['date'])
data.head()
# Extract year & month
data['year'] = data.date.dt.year
# Drop columns
data.drop(columns=['id','date','zipcode'], inplace=True)
data.shape
X = data.drop(columns=['price'])
y = data['price']
%run featimp.py
rho, pval = spearmanr(X, y)
rho.shape
rho = pd.DataFrame(rho, index=data.columns, columns=data.columns)
spearman_corrs = rho.loc['price']
Here we see Bathrooms and Grade have the highest values, indicating that as their values increase it is very likely that Price increases as well. A very negative number would have also indicated importance, just in the opposite direction (as predictor increases, price decreases). But all negative importances are very close to zero.
spearman_corrs.sort_values(ascending=False)
explained_variance_ratio, pca_imps = pca_strategy(X, n_components=4)
It looks like the first component explains most of the variance. Therefore we will look at which X-variables make up the first component
explained_variance_ratio
In the first component, all of the sqft variables are the most important
pca_imps
We now look at a strategy that ranks features not only on their correlation to the target, but also penalizes for having shared information with other predictor variables. The equation is shown below. "I" is any function that has some measure of association between variables. In our case we will use Spearman's Rank
Jmrmr = mrmr(X,y)
The interesting result here is that sqft_lot is no longer ranked as the most important feature. This is likely due to the fact that it is a variable that is highly correlated with other features.
Jmrmr
In these strategies, we use specific machine learning models as a mechanism to evaluate the importance of our features. We examine two strategies: permutation importance and drop-column importance
rf = RandomForestRegressor(n_estimators=50, min_samples_leaf=3, n_jobs=-1)
baseline_score, importances = permutation_importance(X, y, rf)
print(f"Our baseline score (with no column permutations) is: {baseline_score:0.4f}")
Below we see our new scores (r-squared in this case) after that variable has been permuted. Those with the biggest drop show high importance. On the other end, the score for "floors" actually went slightly up after being permuted, which means it is no better than noise as a predictive variable in our model
importances
baseline_score, importances = drop_column_importance(X, y, rf)
print(f"Our baseline score (with no column permutations) is: {baseline_score:0.4f}")
Again, we see similar variables at the top of the list compared to permuation importances. These values are more accurate as we have completely retrained the model, ignoring the dropped column, however it comes at the added expense of retraining many times.
importances
Now that we have several ways of computing feature importances, we can compare strategies. Our approach will be to keep the top feature from each approach and train a model, computing a cross-validation score for each. Then repeat the process for the top 2, 3, ..., k features in each strategy. The result will allow us to see how cross-validation scores change as we keep more features in each strategy
strategy_results = compare_strategies(X, y, rf, k=8)
plt.figure(figsize=(15,6))
plt.plot('Number of Columns', 'CV Score', label='Drop Colum', marker='o',
data=strategy_results.loc[strategy_results.Strategy=='Drop Column'])
plt.plot('Number of Columns', 'CV Score', label='Permute', marker='o',
data=strategy_results.loc[strategy_results.Strategy=='Permute'])
plt.plot('Number of Columns', 'CV Score', label='MRMR', marker='o',
data=strategy_results.loc[strategy_results.Strategy=='MRMR'])
plt.plot('Number of Columns', 'CV Score', label='PCA', marker='o',
data=strategy_results.loc[strategy_results.Strategy=='PCA'])
plt.legend()
plt.ylabel('Five-fold CV-Score (r-squared)')
plt.xlabel('Top K Features')
plt.show()
From the above plot it seems like Drop-Column and Permute importances do the best job at finding important features. By far the poorest ranking initially is that from PCA. We can also compare how the ranked features compare in a different model like Linear Regression
lr = LinearRegression(fit_intercept=True, normalize=True, copy_X=True)
lr_strategy_results = compare_strategies(X, y, lr, k=8)
plt.figure(figsize=(15,6))
plt.plot('Number of Columns', 'CV Score', label='Drop Colum', marker='o',
data=lr_strategy_results.loc[lr_strategy_results.Strategy=='Drop Column'])
plt.plot('Number of Columns', 'CV Score', label='Permute', marker='o',
data=lr_strategy_results.loc[lr_strategy_results.Strategy=='Permute'])
plt.plot('Number of Columns', 'CV Score', label='MRMR', marker='o',
data=lr_strategy_results.loc[lr_strategy_results.Strategy=='MRMR'])
plt.plot('Number of Columns', 'CV Score', label='PCA', marker='o',
data=lr_strategy_results.loc[lr_strategy_results.Strategy=='PCA'])
plt.legend()
plt.ylabel('Five-fold CV-Score (r-squared)')
plt.xlabel('Top K Features')
plt.show()
Interestingly, for linear regression from the start Permute importance does better than other methods. Also Drop Column importance and MRMR are closer. But as before, PCA peforms the worst
Given that we now have several strategies for ranking features, we can apply them towards automatically selecting the best set of features for a given model.
remaining_features, automatic_results = automatic_feature_selection(X, y, rf, permutation_importance)
Here we automatically selected features for a Random Forest model using permuation importance as our ranking. We can see that only one feature was dropped (bedrooms) before the CV score failed to increase. This is perhaps expected as Random Forest are not as prone to overfitting, even in the presence of unimportant features. Next we will try the same procedure but with a Linear Regression model
automatic_results
plt.figure(figsize=(10,5))
plt.plot('Feature Dropped', 'CV Score', data=automatic_results)
plt.ylabel('Five-fold CV-Score (r-squared)')
plt.xlabel('Feature Dropped')
plt.title('Automatic Feature Selection Process for Random Forest')
plt.xticks(rotation=45)
plt.show()
print(f"Remaining features: {remaining_features.str.cat(sep=', ')}")
lr_remaining_features, lr_automatic_results = automatic_feature_selection(X, y, lr, permutation_importance)
lr_automatic_results
plt.figure(figsize=(10,5))
plt.plot('Feature Dropped', 'CV Score', data=lr_automatic_results)
plt.ylabel('Five-fold CV-Score (r-squared)')
plt.xlabel('Feature Dropped')
plt.title('Automatic Feature Selection Process for Linear Regression Model')
plt.xticks(rotation=45)
plt.show()
Again, we drop very few features (only bedrooms and floors) before the CV score fails to increase.
print(f"Remaining features: {lr_remaining_features.str.cat(sep=', ')}")
By bootstrapping we are able to get several measurements of feature importance for each variable, and thereby also get an approximation of the standard deviation of those measurements
averaged_importances = bootstrapped_importances(X, y, rf, permutation_importance, iterations=20)
averaged_importances['change mean'] = np.abs(averaged_importances['change mean'])
averaged_importances = averaged_importances.sort_values('change mean', ascending=False).reset_index(drop=True)
averaged_importances
plt.figure(figsize=(15,7))
plt.errorbar('variable', 'change mean', yerr='change std', data=averaged_importances,
marker='o', linestyle='', capsize=5)
plt.xticks(rotation=45)
plt.ylabel('Permutation Importance')
plt.xlabel('Feature')
plt.title('Approximate Standard Errors on Feature Importances (via bootstrapping)')
plt.show()