All Articles

Machine Learning - Logistic Regression

Logistic Regression

Classification and Representation

Classification

  1. yes or no question where outputs are discrete

    • 0: negative class (benign tumor)
    • 1:positive class (malignant tumor)
    • there could be multi-class classifiction where there are more than two possible outputs
  2. You could use this: hθ(x) = θTx

    • if hθ(x) = θTx > 0.5, y = 1 (0.5 is the threshold)
    • if hθ(x) = θTx < 0.5, y = 0
    • but what happens if the input range increases? : if the threshold remains the same, some cases that could be benign are now considered to be malignant
    • but in some cases, y could be greater than 1 or smaller than 0
    • logistic regression ensures that 0 < hθ(x) < 1

Hypothesis Representation

  1. a function that represents hypothesis that satifies 0 < hθ(x) < 1
  2. hθ(x) =g(θTx) where g(z) = 1 / (1 + e-z)
  3. hθ(x) = 1 / (1 + -eθTx)

    • sigmoid function / logistic function
    • asymptote at 0 and 1
    • ensures that 0 < hθ(x) < 1
  4. hθ(x) = estimated probability that y = 1

    • hθ(x) = P(y=1 | x;θ)
    • P(y=0 | x;θ) + P(y=1 | x;θ) = 1
    • P(y=0 | x;θ) = 1 - P(y=1 | x;θ)

Decision Boundary

  1. g(z) > 0.5 when z > 0

    • hθ(x) =g(θTx) > 0.5 when θTx > 0 where θTx = z
  2. hθ(x) = g(θ0 + θ1x1 + θ2x2)
  3. non-linear decision boundaries

    • sometimes your equation may not be linear

Logistic Regression Model

Cost Function

  1. Linear Regression:

    • J(θ) = Cost(hθx, y) =(1/2)(hθx - y)2
  2. For Logistic Regression:

    • the cost function ends up non-convex if square is used
    • many local optima may appear
    • log(hθx) if y = 1
    • -log(1-hθx) if y = 0

Simplified Cost Function and Gradient Descent

  1. J(θ) = Cost(hθx, y) = -y*log(hθx) -(1-y)log(1-hθx)

Advanced Optimization

  • Conjugate gradient, BFGS, L-BFGS

    • no need to manually pick a learning rate
    • often faster than gradient descent, but more complex

Regularization

Problem of Overfitting

  • trying your best to fit the training set
  • could be like 5 orders
  • underfitting

    • does not fit the training set very well
    • also called high bias
  • overfitting

    • graph looks weird to best fit the data
    • also called high variance
  • How to solve the overfitting problem

    • reduce number of features
    • model selection algorithm
    • regularization: keep all the features but reduce the magnitudes of parameters

Cost Function

  1. Small values for parameters

    • simpler hypothesis
    • less prone to overfitting
    • add a lambda to deal with parameters
  2. Regularization term which includes lambda

    • too large of a lambda results in underfitting
Loading script...