Linear Regression, Regularization and Relationships in Machine Learning with Python

This is the third write up in a series of articles following my Learning Curve and What it Takes to Learn how to Code A.I./Machine Learning Algorithms. The first post was all about exploring how to even get started, what do I even need? - the blind state of Unconscious Incompetence. In the second post I learned more of the basic concepts and tested out the KNN Classifier.

I'm now 6 hours deep, time invested in reading documentation, tutorials and practicing on my local environment and I would mention so far everything has been practically, relatively, easy. Easy because learning things of this complexity can be challenging if you don't have a good mentor or source of knowledge to tap, thankfully, I'm finding the resources available are excellent sources of knowledge, you just have to put in the time and read and practice.

Anaconda Navigator SciKit-Learn
Anaconda Navigator SciKit-Learn Environment

Since there is a lot of reading, you may find some of these posts read like someone's notebook, but this is my mechanism, I learn a lot faster when I force myself to try to explain what I've just learned in my own words and relate it to other things I already know, building a chain-like connection to my brains web of knowledge and experiences. It also makes a good reference material for later. At this point I would say I am moving through the second stage of learning, Conscious Incompetence, (Wikipedia this, if you don't know about it already, you can stop beating yourself up for not knowing how to do everything right away and why it's important to overcome inherent cognitive bias and just, jump in!)

Getting right back into it, Let's Code.
Load up Anaconda Navigator, Environments, select Open with Python, begin by importing numpy as np, and from sklearn import datasets. Now, let's look at Linear Regression, this method fits a linear model to the dataset by optimizing the coefficients for the best fit, in general these will follow the usual Y = mX + b format.

Import the linear_model, make a ?class? regr = (not sure on the nomenclature here, will come back and correct this type of misnomer whenever I figure it out down the road or someone comments to correct it) which is a function (I think, because it has parentheses, it looks like it expects parameters) within the linear_model library. Then load the diabetes dataset, split it in half for test and train by 20.

Python Splitting the Diabetes Data Set

I was curious about the "[:-20]" notation, I read on Python tutorials that this indicates "from the beginning up to the last 20 entries when slicing an array". Ok, so then I thought, that's curious, the dataset must be 40 rows tall for that to be true, so I decided to check for myself by first declaring: diabetes = and then asking diabetes.shape, the output was an array of 442 observations, each observation containing 10 features.

finally it's time to fit our data, so we fit the training data using, diabetes_Y_train) and it outputs some... things. We can ask it to display the coefficients by typing print(regr.coeff_) and a bunch of numbers dump out. To see the Mean Square Error enter:
np.mean((regr.predict(diabetes_X_test)-diabetes_Y_test)**2) which is basically saying, take the average of the predicted results less the actual results, squared. it's an array right, so you have essentially 2 columns of data comparing the model's prediction against the actual results.

Python Linear Regression Prediction and Variance Score
Python Linear Regression Prediction and Variance Score
Finally, you can check the variance score between your X test and Y test data, a result of 1 means the relationship is perfectly explained by the model, whereas a zero would indicate no relationship at all. You might get a zero or 1 if you have only a few data points to work with, or a small sample size. One solution involves shrinking the regression coefficients to zero, known as Ridge Regression, the idea being that if you were to select any two random entries from the data, it's not likely they would be correlated. the linear_model has a function called Ridge where you choose an alpha to minimize the unaccounted for error, balancing your alpha between too much bias or variance, you have to be aware of the possibility of overfitting the data and the bias added to the model through the Ridge Regression helps to prevent against overfitting, this process is known as Regularizing the data.

In some cases where you have many dimensions, like 10 in the diabetes set, you might find the coefficients you get back on some of the variables are rather meaningless, so you can actually focus down on relationships exclusively between selected variables to try and uncover something more meaningful. As mentioned in the paragraph above, we know a Ride Regression could shrink some of them, but to zero them we could use the Sparse method to boil it down to a more simple model, using the LassoLars object which is purportedly better at sparse problems with few data points.

Python Lasso Sparse method regression

The amount of information that it seems we are inferring from two points on a Cartesian plane is baffling, I'm not sure if this is where meaningful results might start to become meaningless in reality, I guess we'll see as I learn more down the road. Meaningful results have to be thought of in the context of, "what does this mean in reality?". I'll stop here for now as this one was another 3 hour session, bringing the total to 9 hours invested in learning Python so far. The subject matter was more complicated and I had to try a couple of times to get it right. Next we'll look at on to Classification and Support Vector Machines.