Machine Learning - Introduction

TL;DR

Teaching a machine to learn its task without being explicitly programmed
- The machine is practicing/learning on its own to improve its performance.

right answers are given for a data set
- so do your best to produce a right answer for another input
Types of supervised learning:
- Regression: Predict continuous valued output
  - something like house prices.
- Classification: Discrete answers
  - either 0 or 1 : true or false
    - could be also 0, 1, 2, 3
  - what is the probability that the input turns out to be true or false?
  - if you need to consider more than one parameter, the graph may look different

No right answers given
- no labeling for a data.
- don’t know if it’s true or false
- but there are some sorts of clusters: you could organize them
something like organizing news.
- not given how each article is related, but Google somehow organizes them by headline
- you have to find the relationship between/among the given data set

Notation
- training set: a dataset given to train a model
- m : # of training examples
- x: input variables/features
- y: output variable/ target variable
- (xⁱ, yⁱ) = ith training example (ith row from the training set table)
Training set -> Fed into learning algorithm -> hypothesis(h)
- hypothesis takes an input and produce an output
- h maps from x’s to y’s
- how to represent h ?
- h_θ(x) = h(x) = θ₀ + θ₁x (for a linear function, think of it like a f(x))

Measures the performance of a machine learning model: the goal is to find thetas that minimize the cost function
- θ₀ and θ₁ are parameters
- x₁ and x₂ are features
- choose the best θ₀ and θ₁ to make it close to y as much as possible
- so minimize (1/2m) * SIGMA ((h_θ(x)-y)²)(sum of these is the cost function I think)
Finding θ₁ which minimizes J(θ) when θ₀ = 0
- this is where differential equation comes in.
- minimum when differential = 0
Contour plots are used to indicate cost functions

A way to minimize the cost function J
Start at some random θ₀ and θ₁ and keep changing them simultaneously in order to reduce J(θ₀, θ₁)
- θ_j := θ_j -α * derivative_with_respect_to*θ_j(J(θ₀, θ₁))
- α is called learning rate
- temp0 := θ₀ equation
- temp1 := θ₁ equation
- θ₀ := temp0
- θ₁ := temp1
- same theta’s have to be used in order to calculate a temp value
Intuition
- repeat until convergence: until you reach the global minimum
- think about the equation: θ_j := θ_j -α * derivative_with_respect_to*θ_j(J(θ₀, θ₁))
- if the derative is positive, you know that your entire term is getting smaller, and if negative, its getting bigger
- if the learning rate is too small, gradient descent can be slow. because you would need more iteration
- if the learning rate is too large, gradient descent can overshoot the minimum -> could even diverge
- you don’t need to decrease the learning rate over time because the derivate term will decrease as you approach the minimum
Batch
- Each step of gradient descent uses all the training examples
- you are computing all the sums to take the next step in gradient descent

Oct 26, 2020

AI Enthusiast and a Software EngineerJason Kang on LinkedIn