What research is possible in Adaboost algorithms

Classification and prediction with AdaBoost

CLR

  • 21 minutes to read

James McCaffrey

Classification is a machine learning technique that uses training data to generate a model (usually a single complex rule or mathematical equation) that assigns data items to one of several different categories. The model can then be used to predict new data items with unknown categories. Examples include: predicting whether a patient will have cancer (yes, no) based on various medical test data, or predicting the risk category (low, medium, high) of a loan application based on an applicant's financial background.

There are many different classification algorithms and techniques, including Naive Bayes, neural networks, and logistic regression. In this article, I'll explain a fascinating technique: using the AdaBoost classification, instead of using a single and complex rule for predictions, training data is used to generate a comprehensive list of very simple rules of thumb. A weight is then calculated for each rule of thumb. A prediction of new inputs is made by combining the rules of thumb. It takes the weight of each simple rule into account and produces an expected result. The term “boost” is due to the fact that the quality of the prediction of the simple rules is “boosted” (improved) by the combination.

AdaBoost is a metaheuristic. By that I mean that AdaBoost is a set of guidelines that can be used to create a specific classification algorithm. There are already many variants of AdaBoost algorithms as well as many stand-alone tools that implement some type of AdaBoost. Then why should we code another AdaBoost classification from scratch? Customizing existing AdaBoost classification tools can prove difficult or even impossible. The tools may also be difficult to integrate into a software system. In addition, there may be problems with copyrights or intellectual property.

The concept can best be explained with a concrete example. See you illustration 1 at. The main problem with the demo program is predicting whether a Seattle team will win or lose in an upcoming game against a Detroit team. The prerequisites are that Seattle plays in its own stadium ("Home" under "Field") and the estimate (the "Spread") is that Seattle has a small advantage ("small"). In the upper part in illustration 1 displays 10 hypothetical training data items with known results. The first training tuple (Detroit, Home, Large, Win) means that Seattle won one of the previous games against Detroit when Seattle was playing in its own stadium and the point spread was large.

Figure 1: AdaBoost for classification and prediction

The next part of the demo program shows that the 10 pieces of training data have been converted from string data to a zero-based integer format for more efficient processing. They were then saved as a matrix in the computer's memory. For example, (Detroit, Home, Large, Win) is stored as (3, 0, 2, 1). Note that the outcome to be predicted in this example has only two values: Win or Lose. This is known as the binary classification problem. AdaBoost can also be used in situations where the dependent variable consists of three or more values ​​(multinomial classification). Most binary classification techniques encode the dependent variable to be predicted with a (0, 1) scheme. However, AdaBoost almost always uses a (-1, +1) scheme because the coding simplifies part of the algorithm implementation a bit. Note that all independent variable predictor values ​​are categorical (Home, Medium, and so on) rather than numeric. As I explain below, AdaBoost can also be used when the training data consists of numerical values.

In the third section of illustration 1 it is shown that 8 rules of thumb were generated from the training data. Rules of thumb are often referred to in AdaBoost terminology as "weak learners" or "weak classifiers". The first weak learner reads in a user-friendly form “IF Opponent IS Buffalo THEN Result IS Win” (If the opponent is “Buffalo”, the result is “Win”) with a raw error rate of “0.00”. The second weak learner reads in clearer form “IF Opponent IS Chicago THEN Result IS Lose” (If the opponent is “Chicago”, the result is “Lose”) with an error rate of “0.33”. Where does this weak learner come from? When looking at the training data, you notice that "Chicago" is the opponent in three instances. In two of the training instances the result was “lots”. Consequently, the weak learner is correct in two out of three cases (0.67) and incorrect in one case of three cases (0.33).

Note that not all predictor values ​​for training elements produced a poor learner: there is no rule for situations where the opponent is "Atlanta". Since there are two training tuples in which the opponent is “Atlanta” and one result is “Win” and the other is “Lose”, the error rate for the learner would be “0.50”, which does not provide any useful information. The AdaBoost classification discussed in this article assumes that all weak learners have a raw error rate of less than "0.50".

The next section in illustration 1 indicates that the AdaBoost algorithm processes the weak learners in the background in order to find a weight for each learner. These weights measure the importance of any weak classifier and are referred to as "alpha values" in AdaBoost terminology. The central component of AdaBoost consists in setting the alpha values. The classification model consists of the set of weak learners and their alpha weights.

In the last part of the demo program, the classification model is shown, which is used for the prediction for the team “Seattle” when the “opponent” is “Detroit”, “Field” is set to “Home” and the value for the point spread is “Small” "Reads. If you look at the set of generated weak learners, you see that three learners can be applied: "[2] IF Opponent IS Detroit THEN Result IS Win (+1)", "[3] IF Field IS Home THEN Result IS Win (+1) "and" [5] IF Spread IS Small THEN Result IS Lose (-1) ". The calculated alpha values ​​for the weak learners 2, 3 and 5 are "0.63", "3.15" and "4.49", respectively. The prediction based on the expectations "= (0.63) (+ 1) + (3.15) (+ 1) + (4.49) (- 1) = -0.72" (rounded) therefore means "loose" because the value is negative. Even if two of the three weak learners (2 and 3) predict “win”, the large alpha value of the weak learner 5 has a greater weight than these win predictions, so that the overall result is a prediction of “lots”.

In the following sections, I'll explain in detail how the demo program works so that you can add prediction features to a .NET system or application. This article assumes advanced programming skills in a C language. It is not assumed that you are familiar with the AdaBoost classification. I coded the demo using C #. However, the explanations listed should be sufficient to redesign the code in other programming languages ​​such as Visual Basic .NET or Python. The code for the demo program cannot be fully listed in this article because it is too long. You can download the full source code from archive.msdn.microsoft.com/mag201304AdaptiveBoost.

The AdaBoost algorithm

At the center of the AdaBoost classification is a routine that examines every weak learner and assigns them an alpha weight. The algorithm is relatively complex and can best be explained using a concrete example. Let's say there are 10 training tuples and 8 weak learners, as in illustration 1 shown. Each training tuple is assigned a weight, which in the AdaBoost literature is usually referred to as “D”. The sum of the D-weights results in "1.0", whereby the D-values ​​suffice for a distribution. Initially, all training data elements are assigned the same D-weights, in this case "0.10" because there are 10 elements:

Every learner has an epsilon and an alpha value. Epsilon values ​​are weighted error rates with which alpha values ​​are calculated. At the beginning all learners have unknown alpha values ​​(ie "a = 0.00") and unknown epsilon values ​​(ie "e = -1.0"):

In pseudocode, the algorithm for finding the alpha weight for each learner is as follows:

The main processing loop ends in the following cases: when all weak learners have been processed and given an alpha weight, when the loop counter variable “t” exceeds a maximum value, or when the weighted error rate epsilon for the best unused weak learner has a value such as “0.45 ”Or“ 0.49 ”, indicating that there are no relatively good unused learners left for processing.

In the first step in the loop, all epsilon values ​​are updated. An epsilon value is the sum of the D-weights of incorrectly classified training tuples. For the learner [0] (IF opponent IS Buffalo THEN Result IS Win) there are two applicable training tuples, [2] and [3], and the rule is correct in both instances, so epsilon is "0.00". For learners [1] (IF Opponent IS Chicago THEN Result IS Lose) there are three applicable training tuples, [5], [6] and [7]. Of these, tuple [5] is incorrect. Hence, epsilon is only the D-weight for tuple [5] = "0.10".

Even if it's not immediately obvious, if you carefully examine the calculation of the epsilon values, you will find that they are always between "0.0" and "0.5". After all updates, the epsilon values ​​for the learners are as follows:

Now the best learner is selected. This is Lerner [0], since the epsilon value is the smallest with "0.00". The corresponding alpha value is calculated as follows:

This is primarily a magic equation from the AdaBoost theory. Here “log” is the natural logarithm (to base e). As you know, the alpha value is a weight that attaches importance to a learner. The previous equation was designed so that smaller epsilon values ​​(learner errors) lead to larger alpha values ​​(learner significance).

In this particular situation this is problematic because epsilon is "0" and an error occurs due to division by zero. However, the demo program arbitrarily converts every epsilon value from “0” to “0.000001” in order to avoid this. Consequently, the alpha value for Lerner [0] is calculated using "0.5 * log (0.999999 / 0.000001) = 6.91". This value is assigned to Learner [0] and Learner [0] is marked as completed.

In the next step in the algorithm loop, the D training tuple weights are updated based on the best learner just calculated. The aim is to increase the D-weights for the training tuples that were incorrectly classified as the best learner and to reduce the D-weights for training tuples that were correctly classified as the best learner. The D update equation seems a bit complicated at first glance:

The best learner was Lerner [0] (IF Opponent IS Buffalo THEN Result IS Win) with an alpha value of "6.91". Of the 10 training tuples, Lerner [0] belongs to tuples [2] and [3]. As a result, these D values ​​are updated. For training tuples [2]:

The new D-value for training tuple [3] has the same calculation and the same value as tuple [2].

In this case, we get a new D-weight that is very small because the training tuple was correctly classified by the best learner. If the actual Y-value and the predicted Y-value are the same ("-1" or "+1"), multiplying both values ​​results in "+1" and the argument for "exp" is a negative number (because - alpha will always be negative), resulting in a result below "1.0". However, if the actual Y-value is different from the predicted Y-value, the product is "-1" and the argument for "exp" is positive, which results in a number greater than 1.0 (possibly large). Because of this update technique for D, the AdaBoost classification usually uses “-1” and “+1” instead of “0” and “+1” for the dependent variable values.

Now the preliminary approximate D-values ​​are (rounded):

The next step in the main algorithm loop is to normalize the D-values ​​so that they add up to "1.0" by dividing each preliminary D-value by the sum of the D-values. The sum of the 10 D values ​​is around "0.8002". Consequently, the normalized D-value for the training tuple [0] is approximately "0.1000 / 0.8002 = 0.1249". The final updated D values ​​are:

The aim here is that the focus of the algorithm is on training tuples other than [2] and [3], since these tuples were taken into account by Lerner [0]. Here the algorithm jumps back to the beginning of the loop and updates the epsilon values ​​of all learners based on the newly calculated D values, determines the best unused learner ([5]), calculates the alpha value for the best learner (4.49), updates the D-values ​​for the corresponding training tuples ([2], [6] and [7]) and calculates normalized D-values ​​for all training tuples.

In this example, the process continues until the alpha values ​​have been calculated for all 8 weak learners. In general, not all poor learners will inevitably score good learners and be given an alpha score.

General program structure

This in illustration 1 The demo program shown is a single C # console application. I used Visual Studio 2010, but any version that supports Microsoft .NET Framework 2.0 or higher will work. I created a new project called “AdaptiveBoosting” and then changed “Program.cs” in the “Solution Explorer” window to the more descriptive name “AdaptiveBoostingProgram.cs”, which automatically renamed the “Program” class. I deleted all using statements generated by the template at the beginning of the source code, except for the references to the namespaces "System" and "Collections.Generic". The main method is in Figure 2 listed. Some WriteLine statements have been removed, a few more changes have been made, and a central program-defined class has been added to define weak learner objects.

Figure 2: General program structure

The Main method begins by specifying hard-coded strings for Opponent, Field, Spread, and Result for the features. Then the code in "Main" defines hard-coded values ​​for each feature: "Atlanta", "Buffalo", "Chicago", "Detroit", "Home", "Away", "Small", "Medium", "Large" , "Lose" and "Win". To make the output clearer, I added a space after “Small” and “Large”.

For the sake of simplicity, the training data is also hard-coded in the demo program. In many situations your training data will be stored in a text file or SQL table. You can then programmatically review the training data to determine the feature names (presumably from the header line of a text file or from SQL table column names) and the feature values.

Method RawTrainToInt converts the training data in string format to zero-based integers and stores the integers in an int [] [] matrix named "train". "RawTrainToInt" calls a utility program with the name "ValueToInt". The dependent variable values ​​("Result") are stored in the last column of the train matrix. You can store dependent values ​​in a separate column array.In situations with very large training data sets, you may not be able to save the entire training data set in the computer memory. Then you have to stream through the external data storage instead of an internal matrix.

The demo program determines the weak learners with the help of the MakeLearners method and a program-defined class "Learner". I describe the method and class in more detail in the next section of the article. After the weak learners have been created, the “MakeModel” method is called by the demo program. As described in the previous section, “MakeModel” forms the core of the AdaBoost algorithm. The result is a sorted list (named “bestLearners”) of the indexes of the learners who have been assigned alpha values.

The Main method completes by predicting the Seattle outcome for a series of inputs. With the help of the classify method, “Detroit” is set for “Opponent”, “Home” for “Field” and “Small” for “Spread”. In this case, the return value for “Classify” is “-0.72”, which is interpreted as “lots”.

Creating weak learners

The implementation of a particular AdaBoost algorithm depends to some extent on the specific definition of a weak learner. The program-defined class "Learner" has six fields:

I have declared all five fields with public scope for the sake of simplicity. The “feature” field contains an integer that indicates which independent variable is crucial for the learner. For example, if “feature” is “0”, the poor learner is based on an opponent's score. The value field contains an integer that indicates the value of the feature. For example, if “value” is “3”, the weak learner is based on the condition that the opponent is “Detroit”. The “predicted” field is “-1” or “+1”, depending on whether the actual category for the feature value is “Lose” or “Win”.

The "error" field is of the "double" type and represents the raw error rate associated with the weak learner in the training data. For example, if the values ​​for a weak learner are "feature" = "0", "value" = "3" and "predicted" = "+1" (which means if the "opponent" is "Detroit", the result is " Win ”), then the raw error rate for the training data is in illustration 1 "0.33" because one of the three training data items was incorrectly predicted. Note that “raw error” treats every training element equally. It turns out that the AdaBoost algorithm discussed in this article doesn't really need the "raw error" field and the field could have been left out. However, I find this information useful.

The "epsilon" field is a weighted error term. For a weak learner, "epsilon" is an error term that takes into account the internal D-weights assigned to each training element. The epsilon values ​​are used by the AdaBoost algorithm to calculate the alpha weights. In summary, it can be said that two sets of weights are used in the AdaBoost classification. The alpha weights assign meaning to each poor learner and are used to determine an overall prediction. An epsilon error is an internal error that is assigned to a weak learner and used to calculate the alpha weights. Each training tuple has an internal weight (referred to as "D" in the AdaBoost literature) that is used to calculate the epsilon errors.

In pseudocode, the MakeLearners method works like this:

This is based on the following idea: Every feature value such as “Atlanta” (opponent) or “Medium” (point spread) generates a rule for a weak learner that is based on the training data. This does not apply if the value is not included in the training data (such as an opponent from "New York") or the value does not produce a very likely result because the value is associated with an identical number of won and lost games ( for example if the opponent is "Atlanta" in the demo data, with one game won and one lost).

Summary

An important variant of the algorithm discussed in this article uses data with numeric values. For example, suppose the values ​​for the point spread feature were numeric (such as 1.5, 3.0, and 9.5) rather than categorical (Small, Medium, and Large). One of the main advantages of the AdaBoost classification compared to other classification techniques is that AdaBoost can easily process categorical as well as numerical data directly. You could create a dedicated learner class with a user-friendly description such as “if Spread is less than or equal to 5.5 then Result is Lose”. Or you could create a more complex learner like "if Spread is between 4.0 and 6.0 then Result is Win".

I find that the AdaBoost classification is best used when the dependent variable to be predicted has only two possible values. However, the multinomial classification can also be used in advanced forms of AdaBoost. If you would like more information on this or the theory of AdaBoost in general, I recommend that you look for articles by experts R. Schapire and Y. Freund.

Machine learning research assumes that there is no single best technique for classifying or predicting data. However, with AdaBoost, you have an extremely powerful method that can form the basis of adding predictive features to a .NET software system.

Dr. James McCaffreyworks for Volt Information Sciences Inc. He leads technical training courses for software developers working on Microsoft's Redmond, Washington campus. He has worked on various Microsoft products, including Internet Explorer and MSN Search. Dr. McCaffrey is the author of .NET Test Automation Recipes and can be reached at [email protected]

Thanks to the following technical expert for reviewing this article: Darren Gehring (Microsoft)