HW5: Chapter 9 Classification Tree and Regression Tree
To be fair to future students (and you too), please NEVER share your HW solutions with
ANYONE ANYWHERE ANYTIME. This policy has been taken very seriously and I sincerely
appreciate your cooperation. Thanks.
1. All HWs require self-study (from the books and the Internet). Please be prepared to learn
new knowledge when doing the HWs. But, those that are not taught in the class will not be
tested in quizzes and final.
Competitive Auctions on eBay.com. The file eBayAuctions.xlsx contains information
on 1972 auctions that transacted on eBay.com during May–June 2004.
The goal is to use these data to build a model that will classify auctions as competitive or
noncompetitive. A competitive auction is defined as an auction with at least two bids placed
on the item auctioned. The data include variables that describe the item (auction category),
the seller (his/her eBay rating), and the auction terms that the seller selected (auction
duration, opening price, currency, day-of-week of auction close). In addition, we have the
price at which the auction closed. The task is to predict whether or not the auction will be
competitive. Data Preprocessing: Create dummy variables for the categorical predictors.
These include Category (18 categories), Currency (USD, GBP, Euro), EndDay (Monday–
Sunday), and Duration (1, 3, 5, 7, or 10 days). Split the data into training and validation
datasets using a 60%:40% ratio.
a. Fit a classification tree using all predictors, using the best-pruned tree. To avoid
overfitting, set the minimum number of records in a terminal node to 50. Also, set
the maximum number of levels to be displayed at seven (the maximum allowed
in XLMiner). Write down the results in terms of rules. (Note: If you had to
slightly reduce the number of predictors due to software limitations, or for clarity
of presentation, which would be a good variable to choose?)
b. Is this model practical for predicting the outcome of a new auction?
c. Describe the interesting and uninteresting information that these rules provide.
d. Fit another classification tree (using the best-pruned tree, with a minimum number
of records per terminal node = 50 and maximum allowed number of displayed
levels), this time only with predictors that can be used for predicting the outcome of
a new auction. Describe the resulting tree in terms of rules. Make sure to
report the smallest set of rules required for classification.
e. Examine the lift chart and the classification table for the tree. What can you
say about the predictive performance of this model?
f. Based on this last tree, what can you conclude from these data about the chances
of an auction obtaining at least two bids and its relationship to the auction settings
set by the seller (duration, opening price, ending day, currency)? What would you
recommend for a seller as the strategy that will most likely lead to a competitive
Predicting Delayed Flights. The file FlightDelays.xlsx contains information on all
commercial flights departing the Washington, DC area and arriving at New York during January
2004. For each flight there is information on the departure and arrival airports, the distance of
the route, the scheduled time and date of the flight, and so on. The variable that we are trying to
predict is whether or not a flight is delayed. A delay is defined as an arrival that is at least 15
minutes later than scheduled.
Data Preprocessing. Create dummies for day of week, carrier, departure airport, and arrival
airport. This will give you 17 dummies. Bin the scheduled departure time into eight bins (in
XLMiner use Transform → Bin Continuous Data and select equal width). After binning CRS DEP
TIME into the 8 bins, this new variable should be broken down into dummies (because the effect
will not be linear, due to the morning and afternoon rush hours). This will avoid treating the
departure time as a continuous predictor, because it is reasonable that delays are related to
rush-hour times. Partition the data into training and validation sets.
a. Fit a classification tree to the flight delay variable using all the relevant predictors.
Do not include DEP TIME (actual departure time) in the model because it is
unknown at the time of prediction (unless we are generating our predictions of
delays after the plane takes off, which is unlikely). In the third step of the
Classification Tree menu, choose “Maximum # levels to be displayed = 6”. Use the
best-pruned tree, setting the minimum number of observations in the final nodes to
1. Express the resulting tree as a set of rules.
b. If you needed to fly between DCA and EWR on a Monday at 7 AM, would you
be able to use this tree? What other information would you need? Is it available
in practice? What information is redundant?
c. Fit another tree, this time excluding the weather predictor. (Why?) Select
the option of seeing both the full tree and the best-pruned tree. You will find that
the best-pruned tree contains a single terminal node.
i. How is this tree used for classification? (What is the rule for classifying?)
ii. To what is this rule equivalent?
iii. Examine the full tree. What are the top three predictors according to
iv. Why, technically, does the pruned tree result in a tree with a single node?
v. What is the disadvantage of using the top levels of the full tree as
opposed to the best pruned tree?
vi. Compare this general result to that from logistic regression in the
example in Chapter 10. What are possible reasons for the classification
tree’s failure to find a good predictive model?