Having run my first estimator in the previous post using

When you enter >>> iris.DESCR (it is case-sensitive), you'll get this output:

You get the point. - The point is that

Then, while attempting an exercise in

As you can see, adding matplotlib adds a whole slew of other packages along with it. After restarting Python and importing everything the maplotlib was available.

The primary toolset in scikit is known as

It took about an hour reading and trying to replicate it on my instance of Python 'till now, so we're up to a total of 4.5 hours invested in learning about Machine Learning Python and scikit. Next I'll move on to learning about supervised learning.

In the

An example of a

The

Next we

In general,

We'll look at

**Python Anaconda SciKit-Learn**and following the documentation, now it's time to tackle some more of the tutorials available, starting with some Statistical Learning concepts.**Data-sets are represented as a two-dimensional array**of*n*samples by*m*features, the*shape*of the data (>>> data.shape) in the iris data-set for example, would be (150, 4) which is 150**with every observation containing 4***observations***.***features*When you enter >>> iris.DESCR (it is case-sensitive), you'll get this output:

**>>> iris.DESCR***'Iris Plants Database\n====================\n\nNotes\n-----\nData Set Characteristics:\n :Number of Instances: 150 (50 in each of three classes)\n :Number of Attributes: 4 numeric, predictive attributes and the class\n :Attribute Information:\n - sepal ....*You get the point. - The point is that

**scikit-learn is expecting data input in a certain format**of that 2D-array structure and if your data isn't in that format, you may need to clean and/or organize your data first.Then, while attempting an exercise in

**re-shaping***data**= digits.images,*trying to call the*matplotlib.pyplot*library, it wouldn't load, so I needed to go back to the Anaconda Navigator and search for matplotlib and add it to my environment.Python scikit-learn packages and libraries |

As you can see, adding matplotlib adds a whole slew of other packages along with it. After restarting Python and importing everything the maplotlib was available.

The primary toolset in scikit is known as

**the estimator.**Every estimator has a fit method that ingests a data array, the description of an estimator is actually best quoted here:

*"An estimator is any object that learns from data; it may be a classification, regression or clustering algorithm or a transformer that extracts/filters useful features from raw data."*

It took about an hour reading and trying to replicate it on my instance of Python 'till now, so we're up to a total of 4.5 hours invested in learning about Machine Learning Python and scikit. Next I'll move on to learning about supervised learning.

In the

**Supervised Learning**section of Python we are trying to discover the connection between two datasets, which would be comprised of**the observed data**and an*x***external variable**to be predicted. In scikit the estimator would be looking for a*y**dataset to learn and the***fit(X, y)****which when supplied***predict(X)***unlabeled data X**would return a**prediction**. There can be either*y***classification**tasks or**regressions**tasks.An example of a

**classification task**would be the problem of**classifying the irises dataset**. After you load the iris dataset into iris_X, what you're really doing is feeding in an array of (150, 4) into the parameter iris_X for it to learn from, and you're setting iris_Y to the thing you want to predict, aka the target. When you use the numpy method and execute*np.unique(iris_Y)*you get an**.***array([0, 1, 2])*The

**k-Nearest neighbor (KNN) classifier**is one of the most basic available, it takes in an observation**X_test**and finds the closest vector to the observation, so we're going to try an example of this now using the iris data, but first we must__split our data__into**training data**and**test data**. Using the*numpy random RNG*we partition the data and assign the respective components as per the below code:*>>> np.random.seed(0)**>>> indices = np.random.permutation(len(iris_X))**>>> iris_X_train = iris_X[indices[:-10]]**>>> iris_y_train = iris_y[indices[:-10]]**>>> iris_X_test = iris_X[indices[-10:]]**>>> iris_y_test = iris_y[indices[-10:]]*Next we

**create the nearest neighbor classifier**by importing it from sklearn as**and to make it easier to use we declare***KNeighborsClassifier***, finally we fit the training data in as parameters for knn, it learns about the data and outputs some information about it as you can see in the pic below. next, we will***knn = KNeighborsClassifier***ask knn to predict the Y_test array**based on the X_test data, and whats actually really cool, is that we have the real Y_test data array we can check after to see how close the algorithm was.KNN - Python K-Nearest Neighbor Classifier predicting test data |

In general,

**the more features a particular data set has, the more samples (exponentially more) you're going to require in order to reach statistical significance**and for the algorithms to be effective and return less error. Apparently, if you were ever trying to predict something that has ~20 features. you would need more data than currently exists on the internet today, complexity adds*, and the more dimensions you have, the more samples you're going to need to run.***dimensionality**We'll look at

**Linear Regression**and**Support Vector Machines (SVM)**in the next article. I'm now a total of 6 hours deep in learning this material and maybe 25% through the scikit-learn tutorials offered and I feel like I'm finally getting it, the pieces are coming together now, it's like someone finally found a more interesting use for all those linear algebra classes they made us take.
## Comments

## Post a Comment

Please, tell me what you're thinking...