In this paper author summarizes twelve key lessons that will be useful
for Machine Learning researchers and professionals. In the last decade the use
of Machine Learning has spread rapidly from Spam filters to Drug designing. The
purpose of this paper is to provide folk knowledge on Machine Learning, which
is not available in any of the current Machine Learning textbooks. The author
focuses on the classification type of machine learning in this paper as it is
most mature and widely used. A classifier is a system that inputs a vector of
discrete and/or continuous feature values and outputs a single discrete value,
the class (ex: Spam filters for e-mails).
1.
Learning
= Representation + Evaluation + Optimization: In the first lesson
author highlights on criteria for selecting best algorithms. There are
thousands of algorithms available however one has to consider three vital
aspects such as Representation, Evaluation and Optimization while choosing the algorithms. It is important to
represent a classifier with a formal language that computer can handle
similarly, it is also important to choose a representation for a learner and
this set is called hypothesis space. If a classifier is not in hypothesis space
it cannot be learned. Evaluation
function plays important role in distinguishing between good classifiers from
bad ones. Optimization plays a key role in the enhancement of efficiency of
learner. In addition author provides examples for each of these three
components like K-nearest neighbor, Hyper Plane, Decision Trees etc., .
2. It Generalization That Counts: In
this section author emphasizes on the need of keeping training and test data
separate. It is important to generalize beyond the examples in the training
set; this is because we might not encounter the same exact examples during the
test time. The classifier may get contaminated if one uses the test data to
tune the parameters. To mitigate such issues we can consider cross validation,
i.e., randomly dividing test data into ten subsets holding out each one while
training on the rest, testing the each learned classifier on the examples that
it did not see and averaging the results to see how well does a particular
parameter setting does. In flexible classifiers (ex: decision trees) or with
linear classifiers one has to follow the separation of data.
3. Data Alone Is Not Enough: Only
the availability of huge data will not help in Machine learning. One has to
apply the general assumptions like smoothness, similar examples having similar
classes, limited dependences or limited complexity. Author infers that learning
is inverse of deduction (Induction is learning knowledge and Deduction is going
general rules to specific). The most useful learns are those that don’t just
have assumptions loaded into them but also makes some space for us to tweak
them. Author in this section author compares Machine Learning to farming.
Farmers combine seeds with nutrient to grow crops whereas learners combine
knowledge with data to grow programs.
4. Overfitting Has May Faces: Insufficient knowledge and data to determine
the correct classifier is termed as Overfitting. When classifier outputs 100%
correct results in training and 50% in test, it could have rather thrown 75% in
each step. Best way to understand Overfitting is by splitting generalization
error into bias and variance. Author explains bias and variance with reference
to throwing darts at the board. A linear learner high bias and decision tree
has low bias, similarly in optimization beam search has lower bias than the
greedy search but has higher variance. In Machine learning strong false
assumptions can be better than week true ones, because learner with latter
needs more data to avoid over fitting. Overfitting can be overcome by
Cross-Validation, regularization term and using statistical tests like
chi-square. Also, Overfitting (variance) can be overcome by underfitting
(bias). The problem of multiple testing is closely related to over fitting.
5. Intuition Fails In High Dimensions: Many
algorithms that work fine in low dimensions input however fails in high
dimensions, this is termed as curse of dimensionality. In high dimensions all
examples look alike, if xt
examples laid out on a d-dimensional grid then its 2d examples are all at same
distance from it. So as dimension increases example becomes nearest to neighbor
(xt). Our intuitions come
from a three dimensional world, often do not apply in-high dimensional ones. In
high dimensions, most of the mass of a multivariate Gaussian distribution is
not near the mean but in an increasingly distinct shell around it. There is an
effect that counteracts this curse called “blessing of non-uniformity”, in most
applications examples are not spread uniformly throughout the instance space
but are concentrated on or near a lower-dimensional manifold.
6. Theoretical Guarantees Are Not What They
Seem: Learning is a complex phenomenon and we cannot always justify it
by theoretical guarantees. We can accept the induction results if we are
settling for its probabilistic guarantees. Author provides a probabilistic
hypothesis to choose a consistent classifier, unfortunately this type of
hypothesis are theoretically accepted but need not work in reality as it lacks
accuracy in learning. Given a large enough training set, there will be high
probability that the learner would either return a hypothesis that generalizes
well or be unable to find a consistent hypothesis. Another type of theoretical
guarantee that author mentions is “asymptotic” (if A is better in learning
infinite data than B, then B is often better at finite data).
7. Feature Engineering Is The Key:
The ease of Machine Learning is determined by the features that it carries.
Learning is easy when independent features correlate well with the class.
Often, the raw data will not be in the form that is suitable for reading, one
has to spend time in constructing features that makes learning easy. Major
chunk of the time in Machine Learning project will be spent on feature
designing not on learning. It is important to concentrate on how we gather
data, integrate it, pre-process it and how much trial and error can go into
feature design. Feature engineering is considered to be difficult as it is
domain specific. It is also important to automate more and more feature
engineering process. Sometimes, features that look irrelevant in isolation may
look relevant in combination, so one has to master the art of feature
engineering.
8. More Data Beats A Cleverer Algorithm:
In this section author discuss on the importance of gathering data, a dumb
algorithm with lots and lots of data beats a clever one with modest amount of
it. However, challenge is to design a classifier which works smarter with
larger data in small amount of time. One should try simple learner first before
going to the sophisticated (Naive Bayes before Logistic Regression, k-nearest
neighbor before support vector machines). Sophisticated ones are harder to use
as they have more knobs that we need to turn to get good results. Often,
learners can be divided into; those whose representation has fixed size (linear
classifiers) and those whose representation grows with data (decision
trees).Variable sized learners in practice have loads of limitations with
respect to algorithms, cost effectiveness and curse of dimensionality. Hence
clever algorithms that make most out of the data and computing resource often pay
off in Machine Learning.
9. Learn Many Models, Not Just One:
Author highlights the importance of combining many variations of learner which
produces better results. Creating model ensembles is now standard in Machine
Learning space. There are techniques
like Bagging, where we can simply
generate random variations of the training set by resampling, learn a
classifier on each and combine the results by voting. This greatly reduces
variance while only slightly increasing bias. Similarly we have Boosting which works on varied weights
as training examples and Stacking where
output of individual classifier becomes input of high level learner that
figures out how best to combine them. In addition to this author mentions on
how teams combined their learners to get best results in Netflix prize. Model
ensembles should not be confused with Bayesian model averaging (BMA). Ensembles change the hypothesis space and can
take a wide variety form. BMA assigns weight to hypotheses in the original
space according to fixed formula.
10. Simplicity Does Not Imply Accuracy:
If we consider two classifiers with the same training error, the simpler of two
will likely have the lowest test errors. There are evidences to prove this
also, there are counter examples. One of the counter examples is the
generalization error of a boosted ensemble, which continues to improve by
adding classifier even after zero. Sophisticated view implies complexity with
relation to the hypothesis space where as simpler spaces allow hypothesis to be
represented in smaller codes. If we make hypothesis and prefer simpler design,
and if they work accurately it’s because of the accuracy in the preference not
because of the simple hypothesis. Author concludes this section by mentioning
the importance of choosing simpler hypothesis.
11. Representable Does Not Imply Learnable:
Just because a function can be represented does not mean it can be learned.
Standard decision tree learners cannot learn trees with more leaves than there
are in training examples. Given finite data, time and memory standard learners
can learn only a few subsets of all possible functions, and these subsets are
different for learners with different representations. If representations are
exponentially compact then they may require exponentially less data to learn
functions. Finding methods to learn the deeper representations is one major
research area in the Machine Learning Space.
12. Correlation Does Not Imply Causation: This
is the last lesson of this paper where author brings up an interesting topic of
correlation and causation. In machine learning whatever the system correlates
need not be a true causal. Suppose in the retail data from a super market, if
we have beer and diapers brought together then perhaps putting beer and diapers
section together will increase the sales. This example can be a Machine
learning observation however, but it’s hard to accept it. Some learning
algorithms can potentially extract causal information from observational data,
but their applicability is restricted. Machine Learning researchers should be
aware of their action in predicting the causal not just the correlation between
the variables.
Author concludes
paper by providing various resources which will help in developing skills in
Machine learning