In the past decade, machine learning has quickly become one of the hottest topics in the world of computer science. Given it’s wide range of highly monetizable applications such as self driving cars, the stock market, digital marketing and healthcare, there is no denying it’s appeal. However, to effectively apply the concepts of machine learning, it is important to have a strong understanding of the underlying principles behind it, what it is, and when and how it can be used.
The world of machine learning is a vast and complicated one that is rapidly evolving. Experts spend years learning the fundamentals of machine learning and once they do it is a full time job keeping up to date with advancements in the field. For beginners trying to break into machine learning, there are a plethora of sources, and it can be challenging to find the best place to start. This blog is intended to be a jumping off point, to provide readers with enough information to make informed decisions on which concepts to research further. With that said, here is a quick introduction to the goal of machine learning and some of its fundamental principles.
“The goal of machine learning is to develop methods that can automatically detect patterns in data, and then use the uncovered patterns to predict future data or other outcomes of interest.” -(Machine Learning_ A Probabilistic Perspective [Murphy 2012-08-24]) Machine learning can be used to derive valuable information from big data in ways that human beings cannot due to the sheer size and complexity of the data. Because of this, the rapid growth of human-made data in recent years has been a strong driving force behind the rise in machine learning.
Depending on the data, and the knowledge to be gained from it, most machine learning problems are solved with one of two approaches: unsupervised or supervised. In the unsupervised approach, the goal is to discover valuable patterns in data. In the supervised approach, the goal is to predict future events based on the outcomes of previously analyzed data. Supervised learning is the most common form of machine learning and will be the main topic of this blog.
Supervised learning requires a “labeled” data set, meaning that the predicted value must be present in the training set. For example, consider an initial dataset consisting of text based email messages and a goal of training a machine so that it can predict whether or not a given email message is spam or not. Solving this problem with the supervised approach requires that all emails in the training set are already correctly classified as either spam or not.
Supervised learning can be broken down further into one of two categories: classification or regression. A classification problem is one where the predicted value is one of a finite number of classes. The email filter is an example of a classification problem. The predicted value is either spam or not spam. In regression, the predicted value is continuous, meaning it can take on any value within a finite or infinite interval. An example of a regression problem is predicting a stock’s future price, it can be any number from 0 to infinitely large.
Whether it is a classification problem or regression problem that is being solved, there are many common processes that are needed in any machine learning application. Model selection, data cleaning, feature engineering, training and testing are all common terms with which anyone interested in machine learning should be familiar. Perhaps the most important term is “model selection.” The model is a mathematical construct by which the machine “learns.” In this context the process of learning means refining a set of parameters based on the training data, which can later be used to make predictions on future data. Model selection can be very challenging due to the fact that there is no model that is universally suitable for all problems. The best suited model varies case by case depending on the format of the problem.
Model selection is typically broken down into two parts: deciding what model to use, and then fine tuning the parameters of that model to achieve optimal performance. The second part, often referred to as hyperparameter optimization, can usually be accomplished with trial and error, by testing all possible values within a specified range of values. Though the process of finding the model itself is not as clear cut, there are some general rules one can follow to make this process a little simpler. In most applications, the format of the data and the variable(s) that are being predicted play a strong role in deciding what model is the best choice. For example, image classification and speech processing problems are typically strong candidates for a Deep Learning model. The length of time it takes to train the model can also be a determining factor. If minimizing the time it takes to train the model is a priority, Linear Regression or a Decision Tree might be good choices for a model. On the other hand, if accuracy is to be prioritized over training time, then a Neural Network or Gradient Boosting Tree might be better choices. While these rules are an overgeneralization, they can still serve as a good introduction to model selection. For a more in-depth introduction, check out this additional BLOG on machine learning.
Once the set of potential models has been narrowed down to a small enough size, the next step is to train and test each of them to find the one that performs best. The first part of this process is to analyze and clean the data. Data cleaning is an important process as inaccurate and malformed training data can significantly hinder the performance of the model. In almost all cases of machine learning, the data are too massive for this process to be completed manually, so it is important to be very comfortable in the programming language used to visualize and manipulate the data. Data cleaning typically involves programmatically combing through the data to find missing values and outliers, and then implementing a solution for dealing with those values. One of the reasons Python is so frequently used in machine learning is that it provides many libraries that can greatly simplify and expedite these processes. Libraries like pandas, numpy, matplotlib and seaborn all provide a very useful suite of tools for visualizing and manipulating large datasets. For more on dealing with bad data, here’s another good resource on MISSING DATA. After improving the quality of the data, the next step is to train and test each potential model in the set of candidate models to find which model and with which set of parameters result in the best performance. To emulate the future data that the model will experience in production, it is important to partition the data into a training set and a test set. Training and testing on the same data is bad practice and will yield results that are not indicative of what to expect in production. The model’s ability to make predictions on data it has already seen is not of importance as it can simply look up the answer. Instead, it’s ability to make predictions for future, unseen data, is far more relevant. The training testing partition is typically 80/20, respectively, but in some cases where there are not enough data to thoroughly train and test the model, techniques such as cross validation can be used.
After finding the learning method and set of parameters that perform best, performance can be improved further with the process of feature engineering. In most machine learning applications, the data consist of rows and columns. Each row is a single entity or observation in the data and each column describes a specific feature. For example, in the email filter, each email in the training set would have its own row and each column would describe a specific feature of the emails. The process of feature engineering is to add new, meaningful features based on existing ones. When done correctly, the presence of these additional features can greatly improve the performance of the model. You can also learn more about Feature Engineering here.
While the concepts introduced in this blog are a great starting point, they are certainly not a comprehensive list of all the topics within the realm of machine learning. The best way to become an expert in machine learning is to always be learning. There is no cookbook of recipes that can be applied to all problems. There is no book that is the bible of machine learning that has all the answers. But the best way is to be constantly gathering new information from a variety of reliable sources, starting with the basics. From there you can begin to form your own hypotheses and conduct your own experiments to further cement your understanding.