Machine Learning Made Easy: June 2016

In this paper author summarizes twelve key lessons that will be useful for Machine Learning researchers and professionals. In the last decade the use of Machine Learning has spread rapidly from Spam filters to Drug designing. The purpose of this paper is to provide folk knowledge on Machine Learning, which is not available in any of the current Machine Learning textbooks. The author focuses on the classification type of machine learning in this paper as it is most mature and widely used. A classifier is a system that inputs a vector of discrete and/or continuous feature values and outputs a single discrete value, the class (ex: Spam filters for e-mails).

1. Learning = Representation + Evaluation + Optimization: In the first lesson author highlights on criteria for selecting best algorithms. There are thousands of algorithms available however one has to consider three vital aspects such as Representation, Evaluation and Optimization while choosing the algorithms. It is important to represent a classifier with a formal language that computer can handle similarly, it is also important to choose a representation for a learner and this set is called hypothesis space. If a classifier is not in hypothesis space it cannot be learned. Evaluation function plays important role in distinguishing between good classifiers from bad ones. Optimization plays a key role in the enhancement of efficiency of learner. In addition author provides examples for each of these three components like K-nearest neighbor, Hyper Plane, Decision Trees etc., .

2. It Generalization That Counts: In this section author emphasizes on the need of keeping training and test data separate. It is important to generalize beyond the examples in the training set; this is because we might not encounter the same exact examples during the test time. The classifier may get contaminated if one uses the test data to tune the parameters. To mitigate such issues we can consider cross validation, i.e., randomly dividing test data into ten subsets holding out each one while training on the rest, testing the each learned classifier on the examples that it did not see and averaging the results to see how well does a particular parameter setting does. In flexible classifiers (ex: decision trees) or with linear classifiers one has to follow the separation of data.

3. Data Alone Is Not Enough: Only the availability of huge data will not help in Machine learning. One has to apply the general assumptions like smoothness, similar examples having similar classes, limited dependences or limited complexity. Author infers that learning is inverse of deduction (Induction is learning knowledge and Deduction is going general rules to specific). The most useful learns are those that don’t just have assumptions loaded into them but also makes some space for us to tweak them. Author in this section author compares Machine Learning to farming. Farmers combine seeds with nutrient to grow crops whereas learners combine knowledge with data to grow programs.

4. Overfitting Has May Faces: Insufficient knowledge and data to determine the correct classifier is termed as Overfitting. When classifier outputs 100% correct results in training and 50% in test, it could have rather thrown 75% in each step. Best way to understand Overfitting is by splitting generalization error into bias and variance. Author explains bias and variance with reference to throwing darts at the board. A linear learner high bias and decision tree has low bias, similarly in optimization beam search has lower bias than the greedy search but has higher variance. In Machine learning strong false assumptions can be better than week true ones, because learner with latter needs more data to avoid over fitting. Overfitting can be overcome by Cross-Validation, regularization term and using statistical tests like chi-square. Also, Overfitting (variance) can be overcome by underfitting (bias). The problem of multiple testing is closely related to over fitting.

5. Intuition Fails In High Dimensions: Many algorithms that work fine in low dimensions input however fails in high dimensions, this is termed as curse of dimensionality. In high dimensions all examples look alike, if x_t examples laid out on a d-dimensional grid then its 2d examples are all at same distance from it. So as dimension increases example becomes nearest to neighbor (x_t). Our intuitions come from a three dimensional world, often do not apply in-high dimensional ones. In high dimensions, most of the mass of a multivariate Gaussian distribution is not near the mean but in an increasingly distinct shell around it. There is an effect that counteracts this curse called “blessing of non-uniformity”, in most applications examples are not spread uniformly throughout the instance space but are concentrated on or near a lower-dimensional manifold.

6. Theoretical Guarantees Are Not What They Seem: Learning is a complex phenomenon and we cannot always justify it by theoretical guarantees. We can accept the induction results if we are settling for its probabilistic guarantees. Author provides a probabilistic hypothesis to choose a consistent classifier, unfortunately this type of hypothesis are theoretically accepted but need not work in reality as it lacks accuracy in learning. Given a large enough training set, there will be high probability that the learner would either return a hypothesis that generalizes well or be unable to find a consistent hypothesis. Another type of theoretical guarantee that author mentions is “asymptotic” (if A is better in learning infinite data than B, then B is often better at finite data).

7. Feature Engineering Is The Key: The ease of Machine Learning is determined by the features that it carries. Learning is easy when independent features correlate well with the class. Often, the raw data will not be in the form that is suitable for reading, one has to spend time in constructing features that makes learning easy. Major chunk of the time in Machine Learning project will be spent on feature designing not on learning. It is important to concentrate on how we gather data, integrate it, pre-process it and how much trial and error can go into feature design. Feature engineering is considered to be difficult as it is domain specific. It is also important to automate more and more feature engineering process. Sometimes, features that look irrelevant in isolation may look relevant in combination, so one has to master the art of feature engineering.

8. More Data Beats A Cleverer Algorithm: In this section author discuss on the importance of gathering data, a dumb algorithm with lots and lots of data beats a clever one with modest amount of it. However, challenge is to design a classifier which works smarter with larger data in small amount of time. One should try simple learner first before going to the sophisticated (Naive Bayes before Logistic Regression, k-nearest neighbor before support vector machines). Sophisticated ones are harder to use as they have more knobs that we need to turn to get good results. Often, learners can be divided into; those whose representation has fixed size (linear classifiers) and those whose representation grows with data (decision trees).Variable sized learners in practice have loads of limitations with respect to algorithms, cost effectiveness and curse of dimensionality. Hence clever algorithms that make most out of the data and computing resource often pay off in Machine Learning.

9. Learn Many Models, Not Just One: Author highlights the importance of combining many variations of learner which produces better results. Creating model ensembles is now standard in Machine Learning space. There are techniques like Bagging, where we can simply generate random variations of the training set by resampling, learn a classifier on each and combine the results by voting. This greatly reduces variance while only slightly increasing bias. Similarly we have Boosting which works on varied weights as training examples and Stacking where output of individual classifier becomes input of high level learner that figures out how best to combine them. In addition to this author mentions on how teams combined their learners to get best results in Netflix prize. Model ensembles should not be confused with Bayesian model averaging (BMA). Ensembles change the hypothesis space and can take a wide variety form. BMA assigns weight to hypotheses in the original space according to fixed formula.

10. Simplicity Does Not Imply Accuracy: If we consider two classifiers with the same training error, the simpler of two will likely have the lowest test errors. There are evidences to prove this also, there are counter examples. One of the counter examples is the generalization error of a boosted ensemble, which continues to improve by adding classifier even after zero. Sophisticated view implies complexity with relation to the hypothesis space where as simpler spaces allow hypothesis to be represented in smaller codes. If we make hypothesis and prefer simpler design, and if they work accurately it’s because of the accuracy in the preference not because of the simple hypothesis. Author concludes this section by mentioning the importance of choosing simpler hypothesis.

11. Representable Does Not Imply Learnable: Just because a function can be represented does not mean it can be learned. Standard decision tree learners cannot learn trees with more leaves than there are in training examples. Given finite data, time and memory standard learners can learn only a few subsets of all possible functions, and these subsets are different for learners with different representations. If representations are exponentially compact then they may require exponentially less data to learn functions. Finding methods to learn the deeper representations is one major research area in the Machine Learning Space.

12. Correlation Does Not Imply Causation: This is the last lesson of this paper where author brings up an interesting topic of correlation and causation. In machine learning whatever the system correlates need not be a true causal. Suppose in the retail data from a super market, if we have beer and diapers brought together then perhaps putting beer and diapers section together will increase the sales. This example can be a Machine learning observation however, but it’s hard to accept it. Some learning algorithms can potentially extract causal information from observational data, but their applicability is restricted. Machine Learning researchers should be aware of their action in predicting the causal not just the correlation between the variables.

Author concludes paper by providing various resources which will help in developing skills in Machine learning

This is a best paper to start understanding Machine Learning. In this paper author explains about Machine learning, its current progress, future long-term research and real world applications. After reading this paper one can understand the basics of Machine Learning and its application in various areas. Machine Learning is about how systems automatically learns algorithms to improves its performance (P ) at task (T) with its experience of (E). The learning task might of various types like data mining, data base updating, programming by example etc. The Machine Learning couples the fields of Computer Science and Statistics, computer science focuses on programming whereas statistics helps in getting best inference from the data.

Currently Machine learning being used for Speech recognition, Medical and Biological sciences, Robot control, Bio-surveillance, Image identification and classification etc. While talking about medical data (structured data), Machine Learning can be used to predict outcome of patient with particular treatment. Speech recognition and face recognition technologies are widely used in mobile, computer applications and in social media (Facebook face tagging), this can help greatly in surveillance systems as well. Machine learning is adopted in US Post Office to automatically sort the letters containing hand written address. Machine Learning is being used to learn the models of gene expression to the astronomical data.

Machine Learning methods play key role in computer science, as it makes us think beyond normal programming. There will be a shift in thinking from “how to program computers” to “how they program themselves”, this way it will help in self-diagnose and self-repair. Current challenges in Machine Learning are how to reduce the supervised learning with the help of unlabeled data, can machine get best training data by itself and how can we make system to understand the relationship between different algorithms. Another challenge would be of maintaining data privacy; while we can train a medical diagnosis system on data from all hospitals in the world it should also maintain the privacy of each subject. On the other hand there are researches on machine learning which are of long run like how would we build a never ending learner. Theories and algorithms using Machine Learning are used to understand the human learning.

Author concludes the paper discussing on ethical issues which may arise from the Machine learning technology. The Machine Learning will be useful in clinical research and medical fields but question arises how would we protect data privacy of an individual in such studies? Similarly one should have enough understanding to maintain privacy of data collected from law enforcement or for marketing purpose.

Machine Learning Made Easy

Tuesday, 28 June 2016

Plotting Heat Map Using Python

Thursday, 16 June 2016

Paper Review: A Few Useful Things to Know about Machine Learning - Author: Pedro Domingos

Paper Review: The Discipline of Machine Learning - Author: Tom M. Mitchell