Machine Learning  Introduction
TL;DR

Teaching a machine to learn its task without being explicitly programmed
 The machine is practicing/learning on its own to improve its performance.
Supervised Learning

right answers are given for a data set
 so do your best to produce a right answer for another input

Types of supervised learning:

Regression: Predict continuous valued output
 something like house prices.

Classification: Discrete answers
 either 0 or 1 : true or false
 could be also 0, 1, 2, 3
 what is the probability that the input turns out to be true or false?
 if you need to consider more than one parameter, the graph may look different

Unsupervised Learning

No right answers given
 no labeling for a data.
 don’t know if it’s true or false
 but there are some sorts of clusters: you could organize them

something like organizing news.
 not given how each article is related, but Google somehow organizes them by headline
 you have to find the relationship between/among the given data set
Model Representation

Notation
 training set: a dataset given to train a model
 m : # of training examples
 x: input variables/features
 y: output variable/ target variable
 (x^{i}, y^{i}) = ith training example (ith row from the training set table)

Training set > Fed into learning algorithm > hypothesis(h)
 hypothesis takes an input and produce an output
 h maps from x’s to y’s
 how to represent h ?
 h_{θ}(x) = h(x) = θ_{0} + θ_{1}x (for a linear function, think of it like a f(x))
Cost Function

Measures the performance of a machine learning model: the goal is to find thetas that minimize the cost function
 θ_{0} and θ_{1} are parameters
 x_{1} and x_{2} are features
 choose the best θ_{0} and θ_{1} to make it close to y as much as possible
 so minimize (1/2m) * SIGMA ((h_{θ}(x)y)^{2})(sum of these is the cost function I think)

Finding θ_{1} which minimizes J(θ) when θ_{0} = 0
 this is where differential equation comes in.
 minimum when differential = 0
 Contour plots are used to indicate cost functions
Gradient Descent
 A way to minimize the cost function J

Start at some random θ_{0} and θ_{1} and keep changing them simultaneously in order to reduce J(θ_{0}, θ_{1})
 θ_{j} := θ_{j} α * derivativewithrespect_to*θ_{j}(J(θ_{0}, θ_{1}))
 α is called learning rate
 temp0 := θ_{0} equation
 temp1 := θ_{1} equation
 θ_{0} := temp0
 θ_{1} := temp1
 same theta’s have to be used in order to calculate a temp value

Intuition
 repeat until convergence: until you reach the global minimum
 think about the equation: θ_{j} := θ_{j} α * derivativewithrespect_to*θ_{j}(J(θ_{0}, θ_{1}))
 if the derative is positive, you know that your entire term is getting smaller, and if negative, its getting bigger
 if the learning rate is too small, gradient descent can be slow. because you would need more iteration
 if the learning rate is too large, gradient descent can overshoot the minimum > could even diverge
 you don’t need to decrease the learning rate over time because the derivate term will decrease as you approach the minimum

Batch
 Each step of gradient descent uses all the training examples
 you are computing all the sums to take the next step in gradient descent