Introduction to Machine Learning Part 2

Keisuke Daimon

8 min readAug 6, 2022

Explain basic of machine learning with many examples (Part 1 is here)

Who should read this

Those who have heard of machine learning, but aren’t sure what processes are down for machine learning.

Those who belong to a business department, and want to use machine learning tools or work with data scientists.

TL;DR (Summary)

These are all steps I want to explain. I will write Step 5 to 8 here. (Step 1 to 4 is here.)

Understand Purposes: Why do you need machine learning?
Collect Data: What data is available?
Understand Data: What do your data mean?
Prepare Data: Remember! This is often the most time-consuming step.
Choose Models
Train and Test/Evaluate Models
Tune Parameters
Operationalize Models

5. Choose Models

Before choosing models, you have to know what kind of task you want to solve.

Human brains can solve various (or even unlimited) kinds of tasks, and however machine learning models are for specific kind(s) of tasks. Hence, knowing what kind of task you have now affects what models you use.

Two very common tasks are regression and classification.

Regression

Regression is often used to predict numbers/amounts. For example, the figure below shows how many items salesperson A sold every week. The week of Jan 1 to the week of Feb 12 indicates data collected in the past. Now you want to predict how many items s/he will sell on the week of Feb 19 to the week of Mar 5 (= area in yellow).

You may write a line to understand the trend and say “hmmm… s/he will probably sell 50 items on the week of Mar 5.”

This is a simple idea about regression. (The method is called “Linear Regression”. Search for it if you want more details.)

Classification

As the name explains, classification is to split up things into classes. Let’s recall our Use Case 1. We will classify new potential customers into two classes: people who will buy products and people who won’t buy products.

K-nearest neighbor is one of the simplest algorithm for classification. I will explain Step 6 and 7 using this algorithm as an example.

The basic concept of k-nearest neighbor is “you belong to the class your neighbors belong to”.

Example) People living in a big city tend to be slimmer and have more money. Now you have income and body weight data. Guess if person X and Y are in a big city or in a small city.
The figure shows people in a big city as blue dots and people in a small city as red dots. You want to know where gray dots (Person X and Y) are. At a glance, there are more red dots than blue dots around X, so X is probably in a small city. On the other hand, Y is close to a few blue dots, so Y may be in a big city.

6. Train and Test/Evaluate Models

Suppose you decided to use k-nearest neighbor. (The basic idea does not change across models. You can easily apply this idea to other models.)

The main steps of k-nearest neighbor are…

Find where “known” people are plotted in the graph (= Training)
Find where “unknown” people are plotted in the graph and guess which class those “unknown” people belong to (= Prediction)

Let’s say you have 100-person data. Those 100 people signed up several months ago, so you can assume those who haven’t bought your products will never buy any in the future. (Person A, B and C shown in Step 2 are in this data.)
You also have data of 30 people who recently signed up. You want to decide who to talk to in order to sell your products. (Person X, Y and Z are in this data.)

So now you use 100 people for training, and then try to classify the other 30 people…

But wait. How can you check if your model is accurate?

After you train the model with 100 people, you only have plotted dots in your graph. When 30 potential customers are plotted, how confident are you with the result?

This is why “test” comes in.

The steps become like this:

Split “known” people into training data and testing data.
Find where people in the training are plotted in the graph (= Training)
Calculate the accuracy by checking the result of testing data. (= Testing)
Find where “unknown” people are plotted in the graph and guess which class those “unknown” people belong to (= Prediction)

If the result of testing data is 90% correct, you can say your prediction will possibly be 90% correct.

If you have multiple models to try out, you can check how accurate models are during the testing step. Then choose the “best” model for prediction.

Cross Validation (additional info)

This section provides you with additional info. You can skip to Step 7 if you don’t need this.

You have maybe heard of cross validation. It is definitely one of the most important words. The idea is like this below.

People in this field are really careful about bias. For example, your target customers are men in their 20s. However, even though you randomly split data into training and testing, your testing data contains many middle-aged men. (So the testing data is “biased”.) How sure are you that models that work well with testing data (middle-aged men) can make accurate prediction about future customers (mostly young men)?

To get rid of such bias, cross validation is widely accepted. The figure below is an example of 5-fold cross validation. You repeat training and testing 5 times using the same data but split differently. Now you get 5 values as accuracy, so use its mean (average) as the accuracy of the model. This is how you find out less biased accuracy.

7. Tune Parameters

I already talked about model choice and training & testing. So what is left?

You may have question: what is “k” in k-nearest neighbor?

Yes! That’s what I haven’t explained.

Concept: You predict a new person’s class based on 1st-kth nearest people’s class.

The figure below is an example. You want to know which class the gray in the center belong to. There are two classes: blue and red. If k is 3, then you have 2 reds and 1 blue, so your prediction is the gray may be in the red class.

But what happens if k is 5? Now you see 2 reds and 3 blues in the k=5 circle, so the gray should be blue!

As you see above, k has a huge impact on the prediction result, but the training doesn’t give you the “right” k. The “k” value is what you have to decide in advance (called hyperparameter).

Then the flow becomes this below:

Create models with different k values (e.g. k = 3, 5, 7…)
Split “known” people into training data and testing data.
Find where people in the training are plotted in the graph (= Training)
Calculate the accuracy of each model by checking the result of testing data. (= Testing)
Find where “unknown” people are plotted in the graph and guess which class those “unknown” people belong to using the model with the highest accuracy (= Prediction)

“Tuning parameters” basically mean adjusting hyperparameters to increase the accuracy.

Grid Search (additional info)

This is a piece of additional info again. You can safely go to Step 8.

Grid search is a way to find good values of hyperparameters, but it’s nothing special.

Suppose you have two things to change: k and the definition of distance.

K is already clear to you. It means choosing k neighbors. Let’s say you want to try out 3, 5 and 7.

The definition of distance is how to decide “near”. Popular ones are Euclidean distance and Manhattan distance. (I will not explain details here.)

You have 3 values for k and 2 definitions for distance. What do you want to do next? Just test 6 (= 3 x 2) patterns!

Create 6 models, training 6 models, calculate accuracies of 6 models, then choose the best model. Grid search is merely trying all possible combinations of hyperparameters.

8. Operationalize models

This is the end of your machine learning project.

If you are doing business, you have to make sure that people (other employees) can “use” your models.

In most cases, you keep getting new data, so it’s critical to keep updating and improving your models by going through Step 2–7 periodically. In other words, you have to prepare an environment where Step 2–7 can be done many times very easily. (Otherwise, machine learning engineers can get fed up with repetitive and tedious processes…)

Moreover, don’t forget to evaluate the result based on your KPIs! Let’s say…

You are a sales team member.
Nobody has proved machine learning models introduced to your department are “good”. (The effectiveness is simply “unclear”)
Your sales result directly affects your salary. Whether or not you follow the machine learning models has no impact on your salary.

In this case, are you courageous enough to trust the machine learning models? I would say “No!” Most people would trust their own knowledge, experience and instincts. This is why I believe KPIs are super super important to make models actually used by people.

Best scenario: You build models, evaluate them and find KPIs are achieved
Second best scenario: You build models, evaluate them and find KPIs are NOT achieved (and you’ll try to improve models)
Worst scenario: You build models, do NOT evaluate them (and then nobody cares models…)

Last Words

Thank you for reading this page and I hope you now know more about machine learning. If you have any thoughts, please leave comments!