Introduction to Machine Learning Part 1

Keisuke Daimon

7 min readJul 30, 2022

Explain basics of machine learning with many examples (Part 2 is here.)

Who should read this

Those who have heard of machine learning, but aren’t sure what processes are down for machine learning.

Those who belong to a business department, and want to use machine learning tools or work with data scientists.

TL;DR (Summary)

These are all steps I want to explain. I will write Step 1 to 4 in this page. (Step 5 to 8 is here.)

Understand Purposes: Why do you need machine learning?
Collect Data: What data is available?
Understand Data: What do your data mean?
Prepare Data: Remember! This is often the most time-consuming step.
Choose Models
Train and Test/Evaluate Models
Tune Parameters
Operationalize Models

Note

You can skip this if you don’t fully understand this section.

I will focus on supervised learning in this page. In short, in the context of supervised learning, your data has answers, while unsupervised learning is to use data without answers.

1. Understand Purpose

First of all, you really really have to know what you want to achieve using machine learning and then set measurable KPIs (Key Performance Indicators).

Use Case 1) You are a sales team manager. You find that if your team gets 100 phone numbers of potential customers, only 1 actually buys your product. To increase the revenue without hiring new members, you want to know which potential customers are more promising so that your team can focus on those people. You talk with your boss and decide that the KPI is 5%. (= If you know 100 phone numbers, you want 5% of phone number owners to buy at least one product.)

Use Case 2) You are working in a machine learning algorithm lab. You want to invent an algorithm to check if a picture is cat(s) or not. According to proceeding researches, the highest accuracy was 96% by spending 12 hours for model training. You want to build a new model that runs faster. The main KPIs are 96% accuracy and 3 hours for model training.

Why are measurable KPIs important? I will write the reasons in Step 8 (after going through all explanations).

2. Collect Data

Before doing anything, you have to collect data. At this time, let’s try to gather as much data as possible.

Thinking about the sales team example, you may have potential customers’ phone numbers, email addresses, whether or not s/he has bought your products, demographic info (e.g. age, gender, income, marriage status, race, religion, education level…), phone call history, phone call recordings, offline store visits, online store browsing history…

In this case, “ whether or not s/he has bought your products (Purchased?)” is important because what you are going to do is to find out a new potential customer you meet in the future is more similar to people who bought products or people who didn’t buy products. If your machine learning models say “a new potential customer X is very similar to people who bought products in the past, so s/he will be likely to buy products!”, then you will put more effort to approach X.

Don’t think too much about the data quality now. You can remove “bad” data later.

3. Understand Data

You now want to understand more about your data before Some common ways are shown below. This really requires the collaboration of non-tech side with tech side.

3–1. Understand characteristics of your data from the business point of view.

Example 1) You use phone number as ID of people.

Example 2) Marriage status in your data can take 3 values only: “single”, “married” and “other”. In most cases, users select one value when they sign up online.

Example 3) When users sign up online, your website sends a code to verify phone number, so phone numbers in the data are very accurate. On the other hand, many people don’t honestly input income and you have no method to check if income info is correct or not.

Example 4) “Offline store visits” is counted only when customers go to offline stores and sign-up/log-in your website by scanning a QR code. This means many visits can’t be counted in your data. However, the action of scanning a QR code may prove customer’s interests.

3–2. Understand characteristics of your data from the statistics point of view

Even if you don’t know programming languages for statistical analyses, you can use Excel or visualization tools such as Tableau for these below.

Characteristics of one data field: Count, sum, max (maximum), min (minimum), mean, median…

The oldest potential customer is 80 years old, while the youngest is 18. The mean of age is 34.2.
60% are married and 32% are single. The data contains 121,093 phone numbers.

Characteristics of more than data field: correlations

Phone call recordings tend to be longer if potential customers are women.
The mean age of married people is 36.7, while the mean age of single people is 32.1.

Duplicates

One phone call might generate two lines of phone call history data due to some system issues. You may want to remove one line from phone call history data.

Missing values

“Religion” is not required to fill out, so more than 80% online users didn’t choose their religion. This is shown as missing values in your data.

Outliers

You find one person in your data is 500 years old, which is obviously impossible unless s/he is a vampire :P Your website doesn’t validate the value input as age, so s/he mistakenly or intentionally input 500 and it was stored in your database.
Important!! There is no universal method to identify outliers. You must know the background of your data.
(For instance, you don’t know if one 100 year-old guy is an outlier or not. If you offer community services for elderly people, he may not be considered as an outlier. If you operate websites for high school girls, you probably want to ignore him.)

4. Prepare Data

This is actually the hardest part because data isn’t stored nicely in most organizations.

4–1. Data Cleansing

Link data

Your online user data (phone number, name, email etc.) and phone call recordings (phone number, voice data, recording time etc.) are stored in different systems. You have to connect one to the other. Luckily both has phone number, so you can link them.

Remove duplicates / outliers

You identified what part of data is a duplicate or an outlier in Step 3 and now actually remove them.

Handle missing values

Missing values often cause a headache because most machine learning models can’t work with missing values. You have 4 choices.

Use a model that allows input data to contain missing values.
Pros: Less work is needed in this step.
Cons: Much fewer models are available, so there is less room for trial and error.
Replace missing values with mean/median.
Pros: You don’t lose data (unlike choice 4).
Cons: Values may not be accurate.
Use algorithms to guess missing values.
Pros: You may get values which are more accurate than mean/median.
Cons: You still don’t know if values calculated by the algorithms are correct. Using algorithms may take time.
Remove data with missing values.
Pros: Easiest!
Cons: You have less data.

4–2. Data Labeling

This means to add answer labels to your data. You may already forgot Use Case 2 written in Step 1, but it’s a good example.

Image recognition (such as finding cats in pictures) usually requires a lot of pictures to create models. You can go out to take pictures or search online for pictures. The problem is those pictures don’t have labels (whether pictures show cats or not).

So what do you do? It’s a human’s task. You check all pictures one by one and add a label one by one.

(If you need more human resources, Amazon Mechanical Turk is one choice.)

4–3. Data Augmentation

This also suits the cat picture use case. Data augmentation is to increase the amount of data by creating new data.

One important idea is that even if you rotate cat pictures, those pictures still contain cats!

For example, originally you only have 10,000 cat pictures, but by adding 10,000 pictures rotated 45 degrees clockwise and 10,000 pictures blurred, then now you have 30,000 pictures!

Pictures in the real world aren’t always clear, so this process actually makes models more robust (stronger to noises).

↓ This is a “clear” picture.