Machine learning basics

3 min read

Basics

Standard deviation = How far every sample from the mean on average?

variance is the square of std

RSS, residual sum of squares = sum of differences between true values and estimates

RSE, residual standard error = how far away the estimates from true values on average

R2 is between 0 and 1, near 1 means the model explains most of the variability

t statistic

How many std deviations is the value away from 0

  • σ 68
  • 2σ 95
  • 3σ 99

For example, 2σ is the 95% confidence interval

Z value

For example z=2 covers 95% of the distribution

P value

Probability of getting this result by chance rather than the effect of the predictor

Measure on the test set, good metrics on training set has little meaning

ROC curve

TP vs FP

Supervised learning

predicting or estimating an output based on some input

  • 1800s least squares and linear regression
  • 1936 LDA
  • 40s logistic regression
  • 70s generalized linear models
  • 80s classification and regression trees
  • 86 generalized additive models

Linear regression

  • is at least one predictor related?
  • Hypothesis test
  • F statistic
Which variables are important?
  1. Forward selection Start with nothing and add one by one with lowest RSS

  2. Backward Start with all and drop 1 by 1 by highest p value

Mixed

How well the model fits? Look at RSE and R2

Prediction

Inference is how features effect the outcome

Reducible and irreducible error due to epsilon

Y = f(X) + e

How to estimate f?

  1. Parametric, high bias

  2. Non parametric, requires a lot of input

More flexible models are less interpretable

And they are not always more accurate, due to overfitting

Interpretability is good for inference or understanding the effect of a single parameter

Mean square error

degrees of freedom

Cross validation

Bias & Variance

Less flexible methods have more bias

In general more flexible methods has more variance

Good test performance requires low bias and low variance

MSE consists of bias variance and other error terms

Classification

  • Training error
  • Test error

Smallest test error = better classifier

Minimum test error rate is Bayes error rate and the resulting boundary is the Bayes decision boundary. This is analogous to the irreducible error

Bayes is gold standard but mostly impossible because we don’t know the conditional distribution Y given X

Many approaches like knn attempt to estimate the conditional probability and use it for Bayes

one vs all

is this class1? is this class2? and so on

Binary Relevance

Ensemble of single-label binary classifiers is trained, one for each class.

Transform the multi-label problem into a set of binary classification problems, or adapt the algorithms

Resampling

  • Cross validation
  • K-fold CV, high bias
  • Leave one out CV, high variance
  • Bootstrap is Resampling with replacement

Logistic regression

Models the probability of belonging to a specific class

Between 0 and 1 naturally

Linear discriminant analysis

Assume a Gaussian distribution

Mean is the mean of samples

Variance is calculated by samples

Assumes each class has the same covariance matrix

Get the linear decision boundaries between Gaussians

Comparison of LDA and logistic regression

Both produce linear decision boundaries

Logistic parameters are estimated by maximum likelihood, while LDA parameters are computed using the estimated mean and variance from a normal distribution

Both create very similar boundaries. When Gaussian assumption is off, logistic performs better

  • Knn is nonparametric, highly flexible

QDA

Assumes each class has its own covariance matrix

QDA is the middle ground between LDA and knn

Beyond linearity

Linear models have the advantage of interpretability and inference

But not suitable for everything They can be extended by

Polynomial regression

Step functions

Fit piecewise functions on distinct regions

Regression splines

Fit smoothly joining polynomials on distinct regions

Generalized additive models, allows to do the above with multiple predictors

Adding a multiplier term p1.p2 could help to explain interaction between variables,

January 16, 2020