How will the model perform with noisy masks?
When we collect data in the real world, we won't have perfect synthetic attention masks \(\mathcal{M}\). They will typically be noisy, usually by error accumulation of the models we use to predict the semantic segmentation, depth, or both. To mimic these noisy masks, we define a function \( f \) that will add Perlin noise to the masks and other granular noise to larger objects. Additionally, we train a U-Net to obtain \( \widehat{\mathcal{M}} \), the predicted mask given an input image. We show some examples in the following figure:
Typically, the synthetic attention masks are used as input in different ways. Soft Mask (SM) appends the mask as a fourth channel to the input RGB image. Hard Mask (HM) does an element-wise multiplication with the RGB channels, effectively removing "unnecessary" data. When predicting the masks, we must do so during training and testing time, increasing the required compute. For comparison, we train a model with noisy masks using the Attention Loss, but note that for our case, we do not need to predict these during validation. In the following table, we train all models using 14 hours of driving data in Town01
and validate the driving in Town02
. Our proposed loss obtains the best driving results without the need to remove parts of the input.
To test the high-data regime in more complex scenarios, we collect 55 hours of driving data in multiple towns in CARLA: Town01
, Town02
, Town03
, Town04
, and Town06
, and test in the unseen Town05
under new weather conditions. We note a particular boost in the Driving Score and Infraction Score, resulting from a better adherence to the traffic rules from the model.