Machine Learning with SciKit Basics to Beginner with a K-Nearest Neighbor (KNN) Classifier Example

Having run my first estimator in the previous post using Python Anaconda SciKit-Learn and following the documentation, now it's time to tackle some more of the tutorials available, starting with some Statistical Learning concepts.

Python Anaconda

Data-sets are represented as a two-dimensional array of n samples by m features, the shape of the data (>>> data.shape) in the iris data-set for example, would be (150, 4) which is 150 observations with every observation containing 4 features.

When you enter >>> iris.DESCR  (it is case-sensitive), you'll get this output:

>>> iris.DESCR
'Iris Plants Database\n====================\n\nNotes\n-----\nData Set Characteristics:\n    :Number of Instances: 150 (50 in each of three classes)\n    :Number of Attributes: 4 numeric, predictive attributes and the class\n    :Attribute Information:\n        - sepal .... 

You get the point. - The point is that scikit-learn is expecting data input in a certain format of that 2D-array structure and if your data isn't in that format, you may need to clean and/or organize your data first.



Then, while attempting an exercise in re-shaping data = digits.images, trying to call the matplotlib.pyplot library, it wouldn't load, so I needed to go back to the Anaconda Navigator and search for matplotlib and add it to my environment.

Python scikit-learn packages and libraries
Python scikit-learn packages and libraries

As you can see, adding matplotlib adds a whole slew of other packages along with it. After restarting Python and importing everything the maplotlib was available.

The primary toolset in scikit is known as the estimator. Every estimator has a fit method that ingests a data array, the description of an estimator is actually best quoted here:


"An estimator is any object that learns from data; it may be a classification, regression or clustering algorithm or a transformer that extracts/filters useful features from raw data."


It took about an hour reading and trying to replicate it on my instance of Python 'till now, so we're up to a total of 4.5 hours invested in learning about Machine Learning Python and scikit. Next I'll move on to learning about supervised learning.

In the Supervised Learning section of Python we are trying to discover the connection between two datasets, which would be comprised of the observed data x and an external variable y to be predicted. In scikit the estimator would be looking for a fit(X, y) dataset to learn and the predict(X) which when supplied unlabeled data X would return a prediction y. There can be either classification tasks or regressions tasks.


An example of a classification task would be the problem of classifying the irises dataset. After you load the iris dataset into iris_X, what you're really doing is feeding in an array of (150, 4) into the parameter iris_X for it to learn from, and you're setting iris_Y to the thing you want to predict, aka the target. When you use the numpy method and execute np.unique(iris_Y) you get an array([0, 1, 2]).



The k-Nearest neighbor (KNN) classifier is one of the most basic available, it takes in an observation X_test and finds the closest vector to the observation, so we're going to try an example of this now using the iris data, but first we must split our data into training data and test data. Using the numpy random RNG we partition the data and assign the respective components as per the below code:

>>> np.random.seed(0)
>>> indices = np.random.permutation(len(iris_X))
>>> iris_X_train = iris_X[indices[:-10]]
>>> iris_y_train = iris_y[indices[:-10]]
>>> iris_X_test  = iris_X[indices[-10:]]
>>> iris_y_test  = iris_y[indices[-10:]]

Next we create the nearest neighbor classifier by importing it from sklearn as KNeighborsClassifier and to make it easier to use we declare knn = KNeighborsClassifier, finally we fit the training data in as parameters for knn, it learns about the data and outputs some information about it as you can see in the pic below. next, we will ask knn to predict the Y_test array based on the X_test data, and whats actually really cool, is that we have the real Y_test data array we can check after to see how close the algorithm was.

KNN - Python K-Nearest Neighbor Classifier predicting test data
KNN - Python K-Nearest Neighbor Classifier predicting test data

In general, the more features a particular data set has, the more samples (exponentially more) you're going to require in order to reach statistical significance and for the algorithms to be effective and return less error. Apparently, if you were ever trying to predict something that has ~20 features. you would need more data than currently exists on the internet today, complexity adds dimensionality, and the more dimensions you have, the more samples you're going to need to run.

We'll look at Linear Regression and Support Vector Machines (SVM) in the next article. I'm now a total of 6 hours deep in learning this material and maybe 25% through the scikit-learn tutorials offered and I feel like I'm finally getting it, the pieces are coming together now, it's like someone finally found a more interesting use for all those linear algebra classes they made us take. 

Comments