Logistic Regression
Classification and Representation
Classification

yes or no question where outputs are discrete
 0: negative class (benign tumor)
 1:positive class (malignant tumor)
 there could be multiclass classifiction where there are more than two possible outputs

You could use this: h_{θ}(x) = θ^{T}x
 if h_{θ}(x) = θ^{T}x > 0.5, y = 1 (0.5 is the threshold)
 if h_{θ}(x) = θ^{T}x < 0.5, y = 0
 but what happens if the input range increases? : if the threshold remains the same, some cases that could be benign are now considered to be malignant
 but in some cases, y could be greater than 1 or smaller than 0
 logistic regression ensures that 0 < h_{θ}(x) < 1
Hypothesis Representation
 a function that represents hypothesis that satifies 0 < h_{θ}(x) < 1
 h_{θ}(x) =g(θ^{T}x) where g(z) = 1 / (1 + e^{z})

h_{θ}(x) = 1 / (1 + e^{θTx})
 sigmoid function / logistic function
 asymptote at 0 and 1
 ensures that 0 < h_{θ}(x) < 1

h_{θ}(x) = estimated probability that y = 1
 h_{θ}(x) = P(y=1  x;θ)
 P(y=0  x;θ) + P(y=1  x;θ) = 1
 P(y=0  x;θ) = 1  P(y=1  x;θ)
Decision Boundary

g(z) > 0.5 when z > 0
 h_{θ}(x) =g(θ^{T}x) > 0.5 when θ^{T}x > 0 where θ^{T}x = z
 h_{θ}(x) = g(θ_{0} + θ_{1}x_{1} + θ_{2}x_{2})

nonlinear decision boundaries
 sometimes your equation may not be linear
Logistic Regression Model
Cost Function

Linear Regression:
 J(θ) = Cost(h_{θ}x, y) =(1/2)(h_{θ}x  y)^{2}

For Logistic Regression:
 the cost function ends up nonconvex if square is used
 many local optima may appear
 log(h_{θ}x) if y = 1
 log(1h_{θ}x) if y = 0
Simplified Cost Function and Gradient Descent
 J(θ) = Cost(h_{θ}x, y) = y*log(h_{θ}x) (1y)log(1h_{θ}x)
Advanced Optimization

Conjugate gradient, BFGS, LBFGS
 no need to manually pick a learning rate
 often faster than gradient descent, but more complex
Regularization
Problem of Overfitting
 trying your best to fit the training set
 could be like 5 orders

underfitting
 does not fit the training set very well
 also called high bias

overfitting
 graph looks weird to best fit the data
 also called high variance

How to solve the overfitting problem
 reduce number of features
 model selection algorithm
 regularization: keep all the features but reduce the magnitudes of parameters
Cost Function

Small values for parameters
 simpler hypothesis
 less prone to overfitting
 add a lambda to deal with parameters

Regularization term which includes lambda
 too large of a lambda results in underfitting