And a tutorial on NLS Regression in Python and SciPy
Nonlinear Least Squares (NLS) is an optimization technique that can be used to build regression models for data sets that contain nonlinear features. Models for such data sets are nonlinear in their coefficients.
PART 1: The concepts and theory underlying the NLS regression model. This section has some math in it. You will enjoy it if you like math and/or are curious about how Nonlinear Least Squares Regression works.
PART 2: Tutorial on how to build and train an NLS regression model using Python and SciPy. You do not need to read PART 1 to understand PART 2.
PART 1: The theory behind NLS regression
We’ll follow these representational conventions:
The ‘hat’ symbol (^) will be used for values that are generated by the process of fitting the regression model on data. For e.g. β_(hat) is the vector of fitted coefficients.
y_obs is the vector of observed values of the dependent variable y.
Variables in plain style are scalers and those in bold style represent their vector or matrix equivalents. For example, y_obs_i is a scaler containing the ith observed value of the y_obs vector which is of size (m x 1).
We will assume that the regression matrix X is of size (m x n) i.e. it has m data rows and each row contains n regression variables. The y matrix is of size (m x 1) and the coefficients matrix is of size (m x 1) (or 1 x m in its transpose form)
Now let’s look at three examples of the sorts of nonlinear models which can be trained using NLS.
In the following model, the regression coefficients β_1 and β_2 are powers of two and three and thereby not linear.

‘e’ is the residual error of the model, namely the difference between the observed y and the predicted value (which is the the combination of β_0, β_1 and β_2 and x on the R.H.S.)
The following model is an auto-regressive time series model containing coefficients β_1 and β_2 in a multiplicative relationship, and therefore nonlinear in nature.

In the following model, the predicted value is an exponential function of a linear combination of regression variables X.

This last formulation is usually used in a Poisson regression model or its derivatives such as the Generalized Poisson or the Negative Binomial regression model. Specifically, the fitted mean µ_cap is expressed as the conditional mean of a Poisson probability distribution as follows:

Such a Poisson regression model is used for fitting counts based data sets such as the number of people renting one of the bikes in a bike sharing program on each day.
How does NLS optimization work?
In NLS, our goal is to look for the model parameters vector β which would minimize the sum of squares of residual errors. In other words, we have to minimize the following:

µ_cap_i (the model’s prediction for the ith row in data set) is a function of model parameters vector β_cap and the regression variables x_i, i.e.:

Replacing µ_i with f(β_cap, x_i) in the earlier equation for RSS, we have:

Once again, β_cap is the vector the fitted coefficients and x_i is the ith row of the regression variable matrix X.
One way to minimize RSS is to differentiate RSS with respect to β_cap, then set the differentiation to zero and solve for β_cap, i.e.:

Since β_cap is a vector of length n corresponding to n regression variables x1, x2,…xn, RSS needs to be partially differentiated w.r.t. each one of these β_cap_j coefficients and each equation set to zero. For example, the partial differentiation of RSS w.r.t. β_cap_1 goes as follows:

Since there are n coefficients β_cap_1 to β_cap_n, we get n equations of the kind shown above in n variables. However, as against the Ordinary Least Squares (OLS) estimation, there is no closed form solution for this system of n equations. So we have to use an iterative optimization technique in which at each iteration k, we make small adjustments to the values of β_cap_1 to β_cap_n as shown below, and reevaluate RSS:

Several algorithms have been devised to efficiently update the β_cap vector until an optimal set of values is reached that would minimize RSS. Chief among these are Trust Region based methods such as the Trust Region Reflective algorithm, the Levenberg–Marquardt algorithm and the imaginatively named Dogbox algorithm. SciPy has support for all three algorithms.
Let’s return to the exponentiated mean model we introduced earlier. In this model, we have:

Substituting in RSS:

Notice that the x_i*β_cap in the exponent is a matrix multiplication of two matrices of dimensions [1 x n] and [n x 1] and therefore the result is a [1×1] matrix, i.e. effectively a scaler.
Differentiating the above equation w.r.t. β_cap and setting the differentiation to zero, we get the following set of equations (expressed in vector format) which need to be solved using one of the iterative optimization algorithms mentioned above:

PART 2: Tutorial on NLS Regression using Python and SciPy
Let’s use the Nonlinear Least Squares technique to fit a Poisson regression model to a data set of daily usage of rental bicycles spanning two years.
The first 10 rows of the data set are as below:

You can download the data set from here.
The NLS regression model
We’ll build a regression model in which the dependent variable (y) is:
total_user_count: count of total bicycle renters
The regression variables matrix X will contain the following explanatory variables:
- season: the prevailing weather season
- yr: the prevailing year: 0=2011, 1=2012
- mnth: the prevailing month: 1 thru 12
- holiday: Whether the measurement was taken on a holiday (yes=1, no=0)
- weekday: day of the week (0 thru 6)
- workingday: Whether the measurement was taken on a working day (yes=1, no=0)
- weathersit: The weather situation on the day:
1=Clear, Few clouds, Partly cloudy, Partly cloudy.
2=Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist.
3=Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds.
4=Heavy Rain + Ice Pellets + Thunderstorm + Mist, Snow + Fog. - temp: Temperature, normalized to 39C
- atemp: Real feel, normalized to 50C
- hum: Humidity, normalized to 100
- windspeed: Wind speed, normalized to 67
Let’s import all the required packages.
from scipy.optimize import least_squares
import pandas as pd
from patsy import dmatrices
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.stats.stattools as st
import matplotlib.pyplot as plt
Read the data set into a Pandas DataFrame:
df = pd.read_csv('bike_sharing_dataset_daywise.csv', header=0, parse_dates=['dteday'], infer_datetime_format=True)
Create the training and test data sets. Since we are treating this data as a cross-sectional data, we will randomly select 90% of the data rows as our training data and the remaining 10% as our test data:
mask = np.random.rand(len(df)) < 0.9
df_train = df[mask]
df_test = df[~mask]
print('Training data set length='+str(len(df_train)))
print('Testing data set length='+str(len(df_test)))
Create the regression expression in Patsy syntax. We are saying that total_user_count is the dependent variable and it depends on all the variables mentioned on the right side of the tilde (~) symbol:
Use Patsy to carve out the y and X matrices:
y_train, X_train = dmatrices(expr, df_train, return_type='dataframe')
y_test, X_test = dmatrices(expr, df_test, return_type='dataframe')
Let’s define a couple a functions. The first function we will define is one that will calculate the exponentiated mean: mu_cap = exp(X*β_cap):
def calc_exponentiated_mean(beta, x):
lin_combi = np.matmul(np.array(x), np.array(beta))
mean = np.exp(lin_combi)
return mean
The second function we will define is one that will calculate the plain residual r_i = (y_predicted — y_observed) given the input matrices β, X and y_obs:
def calc_residual(beta, x, y_obs):
y_pred = calc_exponentiated_mean(beta, x)
r = np.subtract(y_pred, np.array(y_obs).flatten())
return r
Initialize the coefficients β vector to all 1.0 values. There are regression variables in X (count them to verify!) and the regression intercept, i.e. two variables in all. So β is of size (1 x 12). The numpy vector we will construct will be of the transpose shape (12,) which suits us as we will have to multiply the X_train with this vector and X_train is of shape (661, 12):
num_params = len(X_train.columns)
beta_initial = np.ones(num_params)
Finally, it is time to use the least_squares() method in SciPy to train the NLS regression model on (y_train, X_train) as follows:
result_nls_lm = least_squares(fun=calc_residual, x0=beta_initial, args=(X_train, y_train), method='lm', verbose=1)
Notice that we are using the Levenberg–Marquardt algorithm (method=’lm’) to perform the iterative optimization of the β vector.
The result_nls_lm.x
variable contains the fitted β vector, in other words, β_cap.
Let’s organize β_cap into a Pandas DataFrame and print out the values:
df_beta_cap = pd.DataFrame(data=res_lsq_lm.x.reshape(1,num_params), columns=X_train.columns)
print(df_beta_cap)
We see the following output showing the fitted coefficient value for each regression variable and the fitted regression intercept:

Prediction
Let’s see how our model did on the test data set X_test that we had carved out earlier. We’ll use the calc_exponentiated_mean() function which we had defined earlier but this time, we’ll pass in the fitted β_cap vector and X_test:
predicted_counts=calc_exponentiated_mean(beta=result_nls_lm.x, x=X_test)
Let’s plot the predicted versus the actual counts:
actual_counts = y_test['registered_user_count']
fig = plt.figure()
fig.suptitle('Predicted versus actual user counts')
predicted, = plt.plot(X_test.index, predicted_counts, 'go-', label='Predicted counts')
actual, = plt.plot(X_test.index, actual_counts, 'ro-', label='Actual counts')
plt.legend(handles=[predicted, actual])
plt.show()

Citations and Copyrights
Data set
The Bike sharing data set has been downloaded from UCI Machine Learning Repository. Curated data set download link.
Data set cited in paper: Fanaee-T, Hadi, and Gama, Joao, “Event labeling combining ensemble detectors and background knowledge”, Progress in Artificial Intelligence (2013): pp. 1–15, Springer Berlin Heidelberg, doi:10.1007/s13748–013–0040–3.
Book
Cameron A. Colin, Trivedi Pravin K., Regression Analysis of Count Data, Econometric Society Monograph №30, Cambridge University Press, 1998. ISBN: 0521635675
Images
All images are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.
PREVIOUS: The Stratified Cox Proportional Hazards Model
NEXT: Testing for Normality of Residual Errors Using Skewness and Kurtosis Tests