Machine learning-fu
And everything else I learned during my internship, for later use:
- Principal component analysis (PCA) - can be used to reduce the number of dimensions in data. In R: princomp
- Crossvalidation - splitting the data into two sets - one is used to train the classifiers and the other to see if the classifiers didn’t overfit.
- Bagging, out of bag (oob) - each classifier is trained on a (random) subset of original set. It’s performance is assessed using the remaining part of the set
- Entropy and information gain - entropy is a measure of randomness of information. The lower entropy the less randomness - a (unbiased) coin toss has entropy of 1, coin with two heads would have entropy 0. Formula (for most cases) is -\sum_{i \in \Omega} p(i)\log p(i), where p(i) is probability of event i. Information gain is a change in entropy
- Decision tree - at each non-leaf node of a tree one property of the object is being tested and depending on the result a next (child) node is being chosen. Leaves are classifications. The tree is constructed by choosing the property that gives us the biggest information gain at a given moment (considering all the information from ancestors) These tend to overfit easily, so randomForest(tm) can be used to add more randomness and stability. In R: randomForest
- Receiver operating characteristic - a function that can assess a classification checking true positive vs false positive rates. If classifiers make a decision based on a threshold (eg. when several classifiers are voting on an answer) then ROC can be used to see how changing the threshold affects true positives and false positives.

- Area under ROC (AUROC) - integrate ROC function and you’ll have a measure for the performance of the classifier. Ideal case is when it’s equal to 1, full randomness is at 0.5. Cool interactive plots. In R: caTools::colAUC
- Hungarian algorithm - algorithm for creating the best matching in bipartite graphs based on distance matrix. In R: cluse::solve_LSAP
- F-measure - useful fitness measure when the set is unbalanced and assigning all objects into one class gives pretty good results, but is not what we want. It takes into account how many objects (of a certain class) were found, not just how many hits there were
