Guiding Attention in End-to-End Driving Models

1Computer Vision Center, 2Universitat Autònoma de Barcelona
Intelligent Vehicles Symposium (IV) 2024

By optimizing the attention weights of a pure vision-based end-to-end driving model, we are able to increase the model's interpretability and enhance its driving capabilities with the available data, at no additional compute cost during testing time. Model trained with 55 hours of driving data in multiple towns in CARLA, and evaluated in the unseen Town05 under new weather conditions.

Abstract

Vision-based end-to-end driving models trained by imitation learning can lead to affordable solutions for autonomous driving. However, training these well-performing models usually requires a huge amount of data, while still lacking explicit and intuitive activation maps to reveal the inner workings of these models while driving. In this paper, we study how to guide the attention of these models to improve their driving quality and obtain more intuitive activation maps by adding a loss term during training using salient semantic maps. In contrast to previous work, our method does not require these salient semantic maps to be available during testing time, as well as removing the need to modify the model's architecture to which it is applied. We perform tests using perfect and noisy salient semantic maps with encouraging results in both, the latter of which is inspired by possible errors encountered with real data. Using CIL++ as a representative state-of-the-art model and the CARLA simulator with its standard benchmarks, we conduct experiments that show the effectiveness of our method in training better autonomous driving models, especially when data and computational resources are scarce.

Introduction

Vision activation maps of end-to-end driving models are usually non-human readable, even if the model manages to drive perfectly in the environment it is deployed in. We seek, then, to train these activation maps with a ground-truth synthetic attention mask (whether obtained from humans or based from sensor readings), and analyze their effects on the driving performance of a pure-vision based end-to-end driving model CIL++. We do so with our proposed Attention Loss, with the added benefit of keeping the original architecture.

Contributions

In short, our main contributions are as follows:

  • We propose a novel Attention Loss to guide the attention of end-to-end driving models, which does not require the salient semantic maps during testing time.
  • We demonstrate that the Attention Loss can be used to train better driving models with low amounts of data, especially when computational resources are scarce.
  • We show that the Attention Loss can be used with noisy masks, which are more representative of the real-world data.

Results

Increased sample efficiency

When dealing with a low amount of training data (below 8 hours in total), we can further subdivide this into two categories: low amount and low diversity. For the former, we train the CIL++ model with and without the Attention Loss while decreasing the amount of training data (akin to lowering the FPS at which the data was collected).

For the low diversity experiments, we start with one weather type in the data and slowly add the rest, one by one. Each weather type contains approximately 2 hours of data, so that the original 8 hours of data will contain the four weather types: ClearNoon (CN), ClearSunset (CS), HardRainNoon (HRN), and WetNoon (WN).

We can see that the results are consistent: adding the Attention Loss during training boosts performance of the model by up to 4 times, requiring less data to be stored and trained with. For both, the training data was collected in Town01 and models were evaluated in Town02 in CARLA, using new weather conditions.

How will the model perform with noisy masks?

When we collect data in the real world, we won't have perfect synthetic attention masks \(\mathcal{M}\). They will typically be noisy, usually by error accumulation of the models we use to predict the semantic segmentation, depth, or both. To mimic these noisy masks, we define a function \( f \) that will add Perlin noise to the masks and other granular noise to larger objects. Additionally, we train a U-Net to obtain \( \widehat{\mathcal{M}} \), the predicted mask given an input image. We show some examples in the following figure:

Typically, the synthetic attention masks are used as input in different ways. Soft Mask (SM) appends the mask as a fourth channel to the input RGB image. Hard Mask (HM) does an element-wise multiplication with the RGB channels, effectively removing "unnecessary" data. When predicting the masks, we must do so during training and testing time, increasing the required compute. For comparison, we train a model with noisy masks using the Attention Loss, but note that for our case, we do not need to predict these during validation. In the following table, we train all models using 14 hours of driving data in Town01 and validate the driving in Town02. Our proposed loss obtains the best driving results without the need to remove parts of the input.

To test the high-data regime in more complex scenarios, we collect 55 hours of driving data in multiple towns in CARLA: Town01, Town02, Town03, Town04, and Town06, and test in the unseen Town05 under new weather conditions. We note a particular boost in the Driving Score and Infraction Score, resulting from a better adherence to the traffic rules from the model.

Additional Visualizations

For the following, we show the attention maps of the last layer of the Transformer Encoder for models trained with 14 hours of data collected in Town01. As mentioned above, we do not need to have access to the ground truth masks as the Transformer Encoder has learned to correctly mimic their distribution. We perform the evaluation in Town02 under new weather conditions.

Poster

BibTeX

        
@misc{porres2024guiding,
  title={Guiding Attention in End-to-End Driving Models}, 
  author={Diego Porres and Yi Xiao and Gabriel Villalonga and Alexandre Levy and Antonio M. López},
  year={2024},
  eprint={2405.00242},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}
      

Acknowledgements

This research is supported by project TED2021-132802BI00 funded by MCIN/AEI/10.13039/501100011033 and the European Union NextGenerationEU/PRTR. Antonio M. Lopez acknowledges the financial support to his general research activities given by ICREA under the ICREA Academia Program. Antonio and Gabriel thank the synergies, in terms of research ideas, arising from the project PID2020-115734RB-C21 funded by MCIN/AEI/10.13039/501100011033. The authors acknowledge the support of the Generalitat de Catalunya CERCA Program and its ACCIO agency to CVC’s general activities.

Institutioinal logos.