Machine learning basics
3 min read
Basics
Standard deviation = How far every sample from the mean on average?
variance is the square of std
RSS, residual sum of squares = sum of differences between true values and estimates
RSE, residual standard error = how far away the estimates from true values on average
R2 is between 0 and 1, near 1 means the model explains most of the variability
t statistic
How many std deviations is the value away from 0
 σ 68
 2σ 95
 3σ 99
For example, 2σ is the 95% confidence interval
Z value
For example z=2 covers 95% of the distribution
P value
Probability of getting this result by chance rather than the effect of the predictor
Measure on the test set, good metrics on training set has little meaning
ROC curve
TP vs FP
Supervised learning
predicting or estimating an output based on some input
 1800s least squares and linear regression
 1936 LDA
 40s logistic regression
 70s generalized linear models
 80s classification and regression trees
 86 generalized additive models
Linear regression
 is at least one predictor related?
 Hypothesis test
 F statistic
Which variables are important?

Forward selection Start with nothing and add one by one with lowest RSS

Backward Start with all and drop 1 by 1 by highest p value
Mixed
How well the model fits? Look at RSE and R2
Prediction
Inference is how features effect the outcome
Reducible and irreducible error due to epsilon
Y = f(X) + e
How to estimate f?

Parametric, high bias

Non parametric, requires a lot of input
More flexible models are less interpretable
And they are not always more accurate, due to overfitting
Interpretability is good for inference or understanding the effect of a single parameter
Mean square error
degrees of freedom
Cross validation
Bias & Variance
Less flexible methods have more bias
In general more flexible methods has more variance
Good test performance requires low bias and low variance
MSE consists of bias variance and other error terms
Classification
 Training error
 Test error
Smallest test error = better classifier
Minimum test error rate is Bayes error rate and the resulting boundary is the Bayes decision boundary. This is analogous to the irreducible error
Bayes is gold standard but mostly impossible because we don’t know the conditional distribution Y given X
Many approaches like knn attempt to estimate the conditional probability and use it for Bayes
one vs all
is this class1? is this class2? and so on
Binary Relevance
Ensemble of singlelabel binary classifiers is trained, one for each class.
Transform the multilabel problem into a set of binary classification problems, or adapt the algorithms
Resampling
 Cross validation
 Kfold CV, high bias
 Leave one out CV, high variance
 Bootstrap is Resampling with replacement
Logistic regression
Models the probability of belonging to a specific class
Between 0 and 1 naturally
Linear discriminant analysis
Assume a Gaussian distribution
Mean is the mean of samples
Variance is calculated by samples
Assumes each class has the same covariance matrix
Get the linear decision boundaries between Gaussians
Comparison of LDA and logistic regression
Both produce linear decision boundaries
Logistic parameters are estimated by maximum likelihood, while LDA parameters are computed using the estimated mean and variance from a normal distribution
Both create very similar boundaries. When Gaussian assumption is off, logistic performs better
 Knn is nonparametric, highly flexible
QDA
Assumes each class has its own covariance matrix
QDA is the middle ground between LDA and knn
Beyond linearity
Linear models have the advantage of interpretability and inference
But not suitable for everything They can be extended by
Polynomial regression
Step functions
Fit piecewise functions on distinct regions
Regression splines
Fit smoothly joining polynomials on distinct regions
Generalized additive models, allows to do the above with multiple predictors
Adding a multiplier term p1.p2 could help to explain interaction between variables,
January 16, 2020