Fitting Linear Models with Custom Loss Functions and Regularization in Python
As part of a predictive model competition I participated in earlier this month, I found myself trying to accomplish a peculiar task. The challenge organizers were going to use āmean absolute percentage errorā (MAPE) as their criterion for model evaluation. Since this is not a standard loss function built into most software, I decided to write my own code to train a model that would use the MAPE in its objective function.
I started by searching through the SciKit-Learn documentation on linear models to see if the model I needed has already been developed somewhere. I thought that the sklearn.linear_model.RidgeCV
class would accomplish what I wanted (MAPE minimization with L2 regularization), but I could not get the scoring
argument (which supposedly lets you pass a custom loss function to the model class) to behave as I expected it to.
While I highly recommend searching through existing packages to see if the model you want already exists, you should (in theory) be able to use this notebook as a template for a building linear models with an arbitrary loss function and regularization scheme.
Python Code
Iāll be using a Jupyter Notebook (running Python 3) to build my model. If youāre reading this on my website, you can find the raw .ipynb file linked here; you can also run a fully-exectuable version of the notebook on Binder by clicking here
Weāll start with some basic imports:
%matplotlib inline
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.preprocessing import StandardScaler
Load Your Data
For the purposes of this walkthrough, Iāll need to generate some raw data. Presumably, if youāve found yourself here, you will want to substitute this step with one where you load your own data.
I am simulating a scenario where I have 100 observations on 10 features (9 features and an intercept). The ātrueā function will simply be a linear function of these features: $y=X\beta$. However, we want to simulate observing these data with noise. Because Iām mostly going to be focusing on the MAPE loss function, I want my noise to be on an exponential scale, which is why I am taking exponents/logs below:
# Generate predictors
X_raw = np.random.random(100*9)
X_raw = np.reshape(X_raw, (100, 9))
# Standardize the predictors
scaler = StandardScaler().fit(X_raw)
X = scaler.transform(X_raw)
# Add an intercept column to the model.
X = np.abs(np.concatenate((np.ones((X.shape[0],1)), X), axis=1))
# Define my "true" beta coefficients
beta = np.array([2,6,7,3,5,7,1,2,2,8])
# Y = Xb
Y_true = np.matmul(X,beta)
# Observed data with noise
Y = Y_true*np.exp(np.random.normal(loc=0.0, scale=0.2, size=100))
Define your custom loss function
I am mainly going to focus on the MAPE loss function in this notebook, but this is where you would substitute in your own loss function (if applicable). MAPE is defined as follows:
Mean Absolute Percentage Error (MAPE)
While I wonāt go to into too much detail here, I ended up using a weighted MAPE criteria to fit the model I used in the data science competition. Given a set of sample weights $w_i$, you can define the weighted MAPE loss function using the following formula:
Weighted MAPE
In Python, the MAPE can be calculated with the function below:
def mean_absolute_percentage_error(y_pred, y_true, sample_weights=None):
y_true = np.array(y_true)
y_pred = np.array(y_pred)
assert len(y_true) == len(y_pred)
if np.any(y_true==0):
print("Found zeroes in y_true. MAPE undefined. Removing from set...")
idx = np.where(y_true==0)
y_true = np.delete(y_true, idx)
y_pred = np.delete(y_pred, idx)
if type(sample_weights) != type(None):
sample_weights = np.array(sample_weights)
sample_weights = np.delete(sample_weights, idx)
if type(sample_weights) == type(None):
return(np.mean(np.abs((y_true - y_pred) / y_true)) * 100)
else:
sample_weights = np.array(sample_weights)
assert len(sample_weights) == len(y_true)
return(100/sum(sample_weights)*np.dot(
sample_weights, (np.abs((y_true - y_pred) / y_true))
))
loss_function = mean_absolute_percentage_error
Fitting a simple linear model with custom loss function
You may know that the traditional method for fitting linear models, ordinary least squares, has a nice analytic solution. This means that the āoptimalā model parameters that minimize the squared error of the model, can be calculated directly from the input data:
However, with an arbitrary loss function, there is no guarantee that finding the optimal parameters can be done so easily. To keep this notebook as generalizable as possible, Iām going to be minimizing our custom loss functions using numerical optimization techniques (similar to the āsolverā functionality in Excel). In general terms, the $\beta$ we want to fit can be found as the solution to the following equation (where Iāve subsituted in the MAPE for the error function in the last line):
Essentially we want to search over the space of all $\beta$ values and find the value that minimizes our chosen error function. To get a flavor for what this looks like in Python, Iāll fit a simple MAPE model below, using the minimize
function from SciPy.
from scipy.optimize import minimize
def objective_function(beta, X, Y):
error = loss_function(np.matmul(X,beta), Y)
return(error)
# You must provide a starting point at which to initialize
# the parameter search space
beta_init = np.array([1]*X.shape[1])
result = minimize(objective_function, beta_init, args=(X,Y),
method='BFGS', options={'maxiter': 500})
# The optimal values for the input parameters are stored
# in result.x
beta_hat = result.x
print(beta_hat)
[ 4.92186402 3.90670026 9.47581284 4.24438793 3.61750887 7.94615776
-0.07480217 1.57636552 0.88414542 6.0017085 ]
We can compare the esimated betas to the true model betas that we initialized at the beginning of this notebook:
pd.DataFrame({
"true_beta": beta,
"estimated_beta": beta_hat,
"error": beta-beta_hat
})[["true_beta", "estimated_beta", "error"]]
true_beta | estimated_beta | error | |
---|---|---|---|
0 | 2 | 4.921864 | -2.921864 |
1 | 6 | 3.906700 | 2.093300 |
2 | 7 | 9.475813 | -2.475813 |
3 | 3 | 4.244388 | -1.244388 |
4 | 5 | 3.617509 | 1.382491 |
5 | 7 | 7.946158 | -0.946158 |
6 | 1 | -0.074802 | 1.074802 |
7 | 2 | 1.576366 | 0.423634 |
8 | 2 | 0.884145 | 1.115855 |
9 | 8 | 6.001709 | 1.998291 |
Itās obviously not perfect, but we can see that our estimated values are at least in the ballpark from the true values. We can also calculate the final MAPE of our estimated model using our loss function:
loss_function(np.matmul(X,beta_hat), Y)
14.990329757449365
Incorporating Regularization into Model Fitting
The process described above fits a simple linear model to the data provided by directly minimizing the a custom loss function (MAPE, in this case). However, in many machine learning problems, you will want to regularize your model parameters to prevent overfitting. In this notebook, Iām going to walk through the process of incorporating L2 regularization, which amounts to penalizing your modelās parameters by the square of their magnitude.
In precise terms, rather than minimizing our loss function directly, we will augment our loss function by adding a squared penalty term on our modelās coefficients. With L2 regularization, our new loss function becomes:
Or, in the case that sample weights are provided:
For now, we will assume that the $\lambda$ coefficient (the regularization parameter) is already known. However, later we will use cross validation to find the optimal $\lambda$ value for our data.
Since our model is getting a little more complicated, Iām going to define a Python class with a very similar attribute and method scheme as those found in SciKit-Learn (e.g., sklearn.linear_model.Lasso
or sklearn.ensemble.RandomForestRegressor
).
class CustomLinearModel:
"""
Linear model: Y = XB, fit by minimizing the provided loss_function
with L2 regularization
"""
def __init__(self, loss_function=mean_absolute_percentage_error,
X=None, Y=None, sample_weights=None, beta_init=None,
regularization=0.00012):
self.regularization = regularization
self.beta = None
self.loss_function = loss_function
self.sample_weights = sample_weights
self.beta_init = beta_init
self.X = X
self.Y = Y
def predict(self, X):
prediction = np.matmul(X, self.beta)
return(prediction)
def model_error(self):
error = self.loss_function(
self.predict(self.X), self.Y, sample_weights=self.sample_weights
)
return(error)
def l2_regularized_loss(self, beta):
self.beta = beta
return(self.model_error() + \
sum(self.regularization*np.array(self.beta)**2))
def fit(self, maxiter=250):
# Initialize beta estimates (you may need to normalize
# your data and choose smarter initialization values
# depending on the shape of your loss function)
if type(self.beta_init)==type(None):
# set beta_init = 1 for every feature
self.beta_init = np.array([1]*self.X.shape[1])
else:
# Use provided initial values
pass
if self.beta!=None and all(self.beta_init == self.beta):
print("Model already fit once; continuing fit with more itrations.")
res = minimize(self.l2_regularized_loss, self.beta_init,
method='BFGS', options={'maxiter': 500})
self.beta = res.x
self.beta_init = self.beta
l2_mape_model = CustomLinearModel(
loss_function=mean_absolute_percentage_error,
X=X, Y=Y, regularization=0.00012
)
l2_mape_model.fit()
l2_mape_model.beta
array([ 5.03662025, 3.87655611, 9.4728106 , 4.21411826, 3.59855275,
7.92518638, -0.13274225, 1.6209291 , 0.91295936, 5.94200503])
Just to confirm that our regularization did work, letās make sure that the estimated betas found with regularization are different from those found without regularization (which we calculated earlier):
pd.DataFrame({
"true_beta": beta,
"estimated_beta": beta_hat,
"regularized_beta": l2_mape_model.beta
})[["true_beta", "estimated_beta", "regularized_beta"]]
true_beta | estimated_beta | regularized_beta | |
---|---|---|---|
0 | 2 | 4.921864 | 5.036620 |
1 | 6 | 3.906700 | 3.876556 |
2 | 7 | 9.475813 | 9.472811 |
3 | 3 | 4.244388 | 4.214118 |
4 | 5 | 3.617509 | 3.598553 |
5 | 7 | 7.946158 | 7.925186 |
6 | 1 | -0.074802 | -0.132742 |
7 | 2 | 1.576366 | 1.620929 |
8 | 2 | 0.884145 | 0.912959 |
9 | 8 | 6.001709 | 5.942005 |
Since our regularization parameter is so small, we can see that it didnāt affect our coefficient estimates dramatically. But the fact that the betas are different between the two models indicates that our regularization does seem to be working.
Just to make sure things are in the realm of common sense, itās never a bad idea to plot your predicted Y against our observed Y.
# Predicted Y vs. observed Y
plt.scatter(l2_mape_model.predict(X), Y)
Important Caveat: Standardize Your Predictors
In most applications, your features will be measured on many different scales; however youāll notice in the loss function described above, each $\beta_k$ parameter is being penalized by the same amount ($\lambda$). Best practice when using L2 regularization is to standardize your feature matrix (subtract the mean off of each column and divide the result by the column standard deviation). This will ensure that all features are on approximately the same scale and that the regularization parameter has an equal impact on all $\beta_k$ coefficients. I standardized my data at the very beginning of this notebook, but typically you will need to work standardization into your data pipeline. Use
sklearn.preprocessing.StandardScaler
and keep track of your intercept when going through this process!
Cross Validation to Identify Optimal Regularization Parameter
At this point, we have a model class that will find the optimal beta coefficients to minimize the loss function described above with a given regularization parameter. Of course, your regularization parameter $\lambda$ will not typically fall from the sky. Below Iāve included some code that uses cross validation to find the optimal $\lambda$, among the set of candidates provided by the user.
from sklearn.model_selection import KFold
# Used to cross-validate models and identify optimal lambda
class CustomCrossValidator:
"""
Cross validates arbitrary model using MAPE criterion on
list of lambdas.
"""
def __init__(self, X, Y, ModelClass,
sample_weights=None,
loss_function=mean_absolute_percentage_error):
self.X = X
self.Y = Y
self.ModelClass = ModelClass
self.loss_function = loss_function
self.sample_weights = sample_weights
def cross_validate(self, lambdas, num_folds=10):
"""
lambdas: set of regularization parameters to try
num_folds: number of folds to cross-validate against
"""
self.lambdas = lambdas
self.cv_scores = []
X = self.X
Y = self.Y
# Beta values are not likely to differ dramatically
# between differnt folds. Keeping track of the estimated
# beta coefficients and passing them as starting values
# to the .fit() operator on our model class can significantly
# lower the time it takes for the minimize() function to run
beta_init = None
for lam in self.lambdas:
print("Lambda: {}".format(lam))
# Split data into training/holdout sets
kf = KFold(n_splits=num_folds, shuffle=True)
kf.get_n_splits(X)
# Keep track of the error for each holdout fold
k_fold_scores = []
# Iterate over folds, using k-1 folds for training
# and the k-th fold for validation
f = 1
for train_index, test_index in kf.split(X):
# Training data
CV_X = X[train_index,:]
CV_Y = Y[train_index]
CV_weights = None
if type(self.sample_weights) != type(None):
CV_weights = self.sample_weights[train_index]
# Holdout data
holdout_X = X[test_index,:]
holdout_Y = Y[test_index]
holdout_weights = None
if type(self.sample_weights) != type(None):
holdout_weights = self.sample_weights[test_index]
# Fit model to training sample
lambda_fold_model = self.ModelClass(
regularization=lam,
X=CV_X,
Y=CV_Y,
sample_weights=CV_weights,
beta_init=beta_init,
loss_function=self.loss_function
)
lambda_fold_model.fit()
# Extract beta values to pass as beta_init
# to speed up estimation of the next fold
beta_init = lambda_fold_model.beta
# Calculate holdout error
fold_preds = lambda_fold_model.predict(holdout_X)
fold_mape = mean_absolute_percentage_error(
holdout_Y, fold_preds, sample_weights=holdout_weights
)
k_fold_scores.append(fold_mape)
print("Fold: {}. Error: {}".format( f, fold_mape))
f += 1
# Error associated with each lambda is the average
# of the errors across the k folds
lambda_scores = np.mean(k_fold_scores)
print("LAMBDA AVERAGE: {}".format(lambda_scores))
self.cv_scores.append(lambda_scores)
# Optimal lambda is that which minimizes the cross-validation error
self.lambda_star_index = np.argmin(self.cv_scores)
self.lambda_star = self.lambdas[self.lambda_star_index]
print("\n\n**OPTIMAL LAMBDA: {}**".format(self.lambda_star))
# User must specify lambdas over which to search
lambdas = [1, 0.1, 0.01, 0.001, 0.0001, 0.00001, 0.000001]
cross_validator = CustomCrossValidator(
X, Y, CustomLinearModel,
loss_function=mean_absolute_percentage_error
)
cross_validator.cross_validate(lambdas, num_folds=5)
Lambda: 1
Fold: 1. Error: 37.70736919931204
Fold: 2. Error: 36.59539972822217
Fold: 3. Error: 30.159199395859716
Fold: 4. Error: 28.026823039555516
Fold: 5. Error: 36.5106662623846
LAMBDA AVERAGE: 33.799891525066805
Lambda: 0.1
Fold: 1. Error: 16.742330362234128
Fold: 2. Error: 17.202175960459936
Fold: 3. Error: 14.681801319685855
Fold: 4. Error: 15.2888782775815
Fold: 5. Error: 17.454430901848337
LAMBDA AVERAGE: 16.273923364361952
Lambda: 0.01
Fold: 1. Error: 11.815521826879978
Fold: 2. Error: 11.299482024213498
Fold: 3. Error: 16.68009174283627
Fold: 4. Error: 16.521312983555607
Fold: 5. Error: 17.904255723979507
LAMBDA AVERAGE: 14.84413286029297
Lambda: 0.001
Fold: 1. Error: 15.088606483635598
Fold: 2. Error: 14.080553661118223
Fold: 3. Error: 14.469228192509826
Fold: 4. Error: 17.232351982956583
Fold: 5. Error: 24.07170014470806
LAMBDA AVERAGE: 16.98848809298566
Lambda: 0.0001
Fold: 1. Error: 17.175599679212027
Fold: 2. Error: 26.85279420590986
Fold: 3. Error: 14.327939942926607
Fold: 4. Error: 21.686903871586818
Fold: 5. Error: 12.325073637345838
LAMBDA AVERAGE: 18.47366226739623
Lambda: 1e-05
Fold: 1. Error: 21.330405966237638
Fold: 2. Error: 15.491993091074216
Fold: 3. Error: 15.471548321991305
Fold: 4. Error: 17.324376705345156
Fold: 5. Error: 15.504657241945976
LAMBDA AVERAGE: 17.02459626531886
Lambda: 1e-06
Fold: 1. Error: 16.61058271143524
Fold: 2. Error: 13.945473664600302
Fold: 3. Error: 15.833777870872437
Fold: 4. Error: 22.501321885364305
Fold: 5. Error: 17.439738169583936
LAMBDA AVERAGE: 17.266178860371244
**OPTIMAL LAMBDA: 0.01**
After identifying the optimal $\lambda$ for your model/dataset, you will want to fit your final model using this value on the entire training dataset.
lambda_star = cross_validator.lambda_star
final_model = CustomLinearModel(
loss_function=mean_absolute_percentage_error,
X=X, Y=Y, regularization=lambda_star
)
final_model.fit()
final_model.beta
array([5.5651638 , 3.97915803, 8.9876656 , 3.72472547, 3.62855902,
7.54186139, 0.24154393, 1.777476 , 1.28517917, 5.69645692])
You can then generate out-of-sample predictions using this final, fully optimized model.
test_data = np.random.random((10,10))
final_model.predict(test_data)
array([29.21630071, 26.66963116, 23.91469503, 18.04085737, 19.92362317,
17.00326261, 16.78961617, 15.1117658 , 14.01058483, 18.5042368 ])
Footnotes:
- Thanks to Shan Gao from Tanius Tech for noticing an error in the definition of my MAPE function.
Alex Miller is an scientist, educator, and developer. Feel free to connect with him on LinkedIn.