Modeling with RPart and NA Predictions
I ran into a problem this past week of making predictions with trees when you have NA
values as part of your prediction.
Let’s quickly create a data set to demonstrate the problem. In our example, both var1
and var2
are dependent on the class
. Our final input, var3
, is just some random noise added to the problem.
rm(list = ls())
set.seed(6789)
class <- sample(0:2, 300, replace = TRUE, prob = c(0.5, 0.3, 0.2))
var1 <- class * rnorm(300, 3, 1)
var2 <- class * rnorm(300, 6, 1)
var3 <- rnorm(300, 1, 0.2)
training_data <- data.frame(class = class, var1 = var1, var2 = var2, var3 = var3)
training_data$class <- as.factor(training_data$class)
str(training_data)
## 'data.frame': 300 obs. of 4 variables:
## $ class: Factor w/ 3 levels "0","1","2": 1 2 2 2 1 1 3 1 1 3 ...
## $ var1 : num 0 2.05 2.66 1.66 0 ...
## $ var2 : num 0 3.64 4.74 5.08 0 ...
## $ var3 : num 1.129 1.166 0.769 0.949 1.129 ...
Now we are ready to train our model.
model_rpart <- train(class ~ ., data = training_data, method = "rpart")
print(model_rpart$finalModel)
## n= 300
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 300 159 0 (0.4700000 0.3300000 0.2000000)
## 2) var1< 0.01965866 141 0 0 (1.0000000 0.0000000 0.0000000) *
## 3) var1>=0.01965866 159 60 1 (0.0000000 0.6226415 0.3773585)
## 6) var2< 8.364213 99 0 1 (0.0000000 1.0000000 0.0000000) *
## 7) var2>=8.364213 60 0 2 (0.0000000 0.0000000 1.0000000) *
Inspecting our final model reveals that, as expected, var3
has no impact on the decision tree. It was just random noise. Only var1
and var2
are included in the decision tree. However, we now encounter two problems.
First Problem: If we try to predict with values with NA
, R will give us an answer, even when it shouldn’t be able to do so.
Let’s demonstrate with some test data.
test_data <- training_data[0,]
test_data[1,] <- c(NA, 3, 7, 0)
test_data[2,] <- c(NA, 0, 2, 0)
test_data[3,] <- c(NA, 2, 10, 0.2)
test_data[4,] <- c(NA, NA, 8, 0.3)
test_data[5,] <- c(NA, NA, NA, 0.1)
test_data[6,] <- c(NA, 0, 6, NA)
test_data[7,] <- c(NA, 0, 1, 1)
print(test_data)
## class var1 var2 var3
## 1 <NA> 3 7 0.0
## 2 <NA> 0 2 0.0
## 3 <NA> 2 10 0.2
## 4 <NA> NA 8 0.3
## 5 <NA> NA NA 0.1
## 6 <NA> 0 6 NA
## 7 <NA> 0 1 1.0
Now, let’s predict against this test data.
predict(model_rpart, newdata = test_data, type = "prob", na.action = na.rpart)
## 0 1 2
## 1 0 1 0
## 2 1 0 0
## 3 0 0 1
## 4 0 1 0
## 5 0 1 0
## 6 1 0 0
## 7 1 0 0
The predict
function successfully predicted every row of the test data, even when we had NA
values that should have made this task impossible. We should not get predictions for rows 4 and 5. The tree cannot be satisfied.
Second Problem: If we use na.omit
, then R removes too many rows, even removing row 6. In this case, row 6 should not be removed because it has no impact on the outcome. We have an NA
value, but it is for var3
, so R has all of the information it needs to make a prediction.
predict(model_rpart, newdata = test_data, type = "prob", na.action = na.omit)
## 0 1 2
## 1 0 1 0
## 2 1 0 0
## 3 0 0 1
## 7 1 0 0
Solution
My solution to this is to create a new training data set with only the relevant columns. Fortunately, we can automate this process a little bit.
keep_cols <- subset(model_rpart$finalModel$frame, var != "<leaf>", select = c("var"))
keep_cols <- as.character(keep_cols$var)
training_data_2 <- subset(training_data, select = c("class", keep_cols))
str(training_data_2)
## 'data.frame': 300 obs. of 3 variables:
## $ class: Factor w/ 3 levels "0","1","2": 1 2 2 2 1 1 3 1 1 3 ...
## $ var1 : num 0 2.05 2.66 1.66 0 ...
## $ var2 : num 0 3.64 4.74 5.08 0 ...
In our reduced training data set, var3
has been removed. If we train on this reduced set, we should get the same model.
model_rpart_2 <- train(class ~ ., data = training_data_2, method = "rpart")
print(model_rpart_2$finalModel)
## n= 300
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 300 159 0 (0.4700000 0.3300000 0.2000000)
## 2) var1< 0.01965866 141 0 0 (1.0000000 0.0000000 0.0000000) *
## 3) var1>=0.01965866 159 60 1 (0.0000000 0.6226415 0.3773585)
## 6) var2< 8.364213 99 0 1 (0.0000000 1.0000000 0.0000000) *
## 7) var2>=8.364213 60 0 2 (0.0000000 0.0000000 1.0000000) *
We can use na.omit
, since we have a model that only considers the relevant variables.
test_data_2 <- subset(test_data, select = keep_cols)
predict(model_rpart_2, test_data_2, type = "prob", na.action = na.omit)
## 0 1 2
## 1 0 1 0
## 2 1 0 0
## 3 0 0 1
## 6 1 0 0
## 7 1 0 0
Now, our solution is much closer to what we’d expect. We still get a prediction for row 6. We do not get predictions for rows 4 and 5.
This solution is not perfect. In fact, far from it. Suppose you had a test case where var1 = 0.01
and var2 = NA
. This should be correctly classified as class = 0
. In this case, it shouldn’t matter that var2 = NA
. Perhaps someone should write an R package that has a better na.action
that accounts for the decision tree.
Is there a solution that I’m not yet aware of? If so, please let me know at jarrettmeyer@gmail.com.