After years of research and development, computerized driver intelligence has finally reached the point whereby automated functionality can not only assist manual driving but is also on its way to replacing it. Scientific research pioneered in academia in the mid-1980s and research performed by the automobile industry have advanced significantly to achieve the maturity necessary for commercialization. Computerized driver intelligence is now integrated into commercial advanced driver-assistance systems (ADASs) that alert drivers of lane deviations and speeding (level 1 automation), stop the car to prevent an imminent collision (level 2 automation), steer the car to a destination (levels 3–5), and more.
The computerized driver intelligence integrated into ADASs has taken driving further than ever: Autonomous driving is considered 10 times safer than manual driving (with forecasts that it will be even safer in the coming years). There is no doubt that the automobile industry is about to enter a new era in computing: The era of driverless vehicles.
Scientists have overcome many computing challenges to create a reality that consists of commercial semi-autonomous cars. One challenge was the development of necessary computerized driver intelligence, or more specifically, the AI required to create real-time virtual perception of the physical surroundings. To overcome this challenge, scientists developed object detectors—dedicated AI capable of detecting the location of nearby pedestrians, traffic signs, lanes, and cars in real time. Such object detectors are created in a process whereby models are trained on real labeled data until their performance reaches a desired level—for example, low error rate or low improvement rate.
While many arguments have been raised regarding the implications of the training process on energy consumption13,23 and its expense,34 the fact that autonomous driving is considered safer than manual driving proves that the training process efficiently transfers a driver's experience in the realtime detection of real objects to AI. However, we question how efficient the training process is at transferring the experience needed for an object detector to distinguish between real objects that it must consider and fake projected/presented objects (phantoms) that it should ignore.
In this article, we identify a computing limitation for ADASs that stems from the way computer vision object detectors are created: the inability to visually distinguish between real objects and fake projected/presented objects. We show that even though computer vision object detectors can outperform humans in detecting real objects,15 they tend to consider visual phantoms (presented/projected objects) as real objects. We examine the perceptual computing limitation and find that this is a side effect of how models are trained: important characteristics, such as judgment, realism, and reasonability, are not transferred to the object detector during training. The perceptual computing limitation, which originates in the training process performed in the digital world, impacts the way commercial ADASs perceive objects in the physical world. One might argue that commercial ADASs can overcome this perceptual computing limitation in the physical world by using sensor fusion—that is, cross-correlating the camera sensor data with data obtained from depth sensors (for example, radar, LiDAR, and ultrasonic sensors). However, we show that the Tesla Model X, the most advanced semi-autonomous vehicle, also misclassifies a projected object as a real object, even though it contains a set of ultrasonic sensors and radar and that the object does not have any depth. This points to a new perceptual computing limitation of commercial semi/fully autonomous cars.
We are about to enter a new era in computing in the automobile industry: The era of driverless vehicles.
Based on our findings, we explain how attackers can exploit the perceptual computing limitation to trigger a reaction from a commercial ADAS by applying phantom attacks. This can be done by projecting a phantom on a structure located in proximity to a driving car or by embedding a phantom into an advertisement presented on a digital billboard. We demonstrate the application of the attack on the Tesla Model X and show how its autopilot automatically triggers the brakes. To overcome the above-mentioned perceptual computing limitation and prevent object detectors from misclassifying phantoms as real objects, we propose GhostBusters, a committee of machine learning (ML) models, where each model is trained to detect a dedicated characteristic that distinguishes phantoms from real objects. A phantom is detected visually by examining the object's reflected light, context, surface, and depth. GhostBusters achieves an AUC of more than 0.99 and a TPR of 0.994 with a FPR threshold set at zero. When applying GhostBusters to seven state-of-the-art road sign detectors, we were able to reduce the attack's success rate from 99.7%–81.2% (the success rate without the use of our counter-measure) to 0.01%.
The science of computerized driver intelligence has progressed significantly since the mid-1980s when it was initially investigated by academic researchers at Carnegie Mellon University. Currently, this topic is investigated primarily by the automobile industry (Tesla, Mobileye, Waymo, Yandex, and so on). In recent years, various companies have begun producing commercial semi-autonomous vehicles with level two (for example, Tesla) and level three (for example, Mercedes) automation.
Sensor fusion is used to create virtual perception of physical surroundings, which consists of several layers; each layer relies on the output of the layer below it.12 In the lowest layer, the raw input data is obtained from various sensors, such as video cameras, a set of ultrasonic sensors, radar, or LiDAR. The second layer consists of signal processing techniques, including filtering and spatial and temporal alignment. The next layer consists of object detection algorithms that are used to detect an object using the obtained sensor data. The object detector algorithm can use data obtained from either a single sensor (for example, lanes can only be detected by a video camera) or from a few sensors (for example, a car can be detected by radar and a video camera). The last layer of virtual perception assesses a detected object's behavior—for instance, determining if a detected car is about to veer into the lane of the autonomous car. Outputs of the last layer are used as input by decision-making algorithms, which output decisions that are converted to driving actions, such as steering and applying the brakes. This article focuses on the difference between the virtual perception of the computerized driver intelligence and a driver's intelligence in the case of projected/presented depthless objects.
Background and related work. Early computer vision studies aimed at developing computerized driver intelligence appeared in the mid-1980s when scientists first demonstrated a road-following robot.36 Studies performed from the mid-1980s until 2000 established the fundamentals for automated driver intelligence in related tasks, including detection of pedestrians,39 lanes,3 and road signs.9 However, the vast majority of initial computer vision algorithms aimed at detecting objects required developers to manually program dedicated features.
The increase in computational power available in recent years changed the way AI models are created: Features are automatically extracted by training various neural network architectures on raw data. Automatic feature extraction outperformed and replaced the traditional approach of manually programming an object's features. Motivated by the Al's improved performance, the automobile industry deployed the object detectors used in today's commercial ADASs, where they are used to warn drivers of imminent collisions, recognize road signs, and more. This integration of object detectors into commercial ADASs encouraged scientists to investigate the difference between human and computerized driver intelligence. For example, several studies demonstrated how invisible changes in the digital world (changes that cannot be detected by the human eye) can cause AI models to misclassify objects.5,25,29 Other studies have shown how AI models misclassify road signs with an added physical artifact (for example, graffiti or stickers7,33,40) or with negligible changes to the numbers on the sign.27 The abovementioned studies emphasize that there are differences between human and virtual perception of surroundings, and attackers can exploit this difference to trigger reactions from an ADAS.
Computer vision object detectors. A computer vision object detector (CVOD) is a model that obtains data from one or more video cameras and identifies the location of certain objects within a frame. CVODs help to detect objects in real time, and their output is crucial for tasks including avoiding obstacles (for instance, by detecting cars and pedestrians), keeping a car in a lane (for example, by detecting lanes), and stopping a car (for instance, by detecting traffic lights).
Developing a CVOD requires training a model on labeled data obtained by video cameras. The training process usually ends when a model's performance reaches a desired target—for instance, error rate or improvement rate. Recent studies have shown that state-of-the-art object detectors can detect objects with high accuracy, however the training process usually requires significant computing capabilities and a long training time, resulting in high energy consumption13,23 and costs.34
Perceptual computing limitations. The high performance of CVODs proves that the training process can transfer a driver's knowledge of detecting real pedestrians, road signs,2 lanes, and so on. However, we argue that the training process also has an undesired side effect: Visual phantoms (for instance, photos/projections of objects) are treated as real objects. This is demonstrated in Figure 1, where phantoms are mis-classified by CVODs as real objects.
Figure 1. Left: Photos of people in advertisements are recognized by the state-of-the-art object detector Faster-RCNN as real pedestrians. Right: Projected objects are recognized by Tesla Model X (HW2.5) as real objects.
With that in mind, the following question arises: What is not transferred during the training process that causes this unfortunate side effect (misclassifying phantoms as real objects)? Analyzing Figure 1, we see the driver's ability to understand context is not transferred during the training process. The figure on the left shows how a state-of-the-art pedestrian detector24 misclassifies a photo of a person on an advertisement as a real pedestrian due to its inability to understand the context of the detected object—that is, that the object is part of an advertisement. Another possible explanation for this side effect is that the driver's ability to discern authenticity is not transferred to the object detector. This is demonstrated in the figure on the right, in which a semi-autonomous car (Tesla Model X) misclassifies a projected object as real, due to its inability to consider a projection a fake object.
To answer the previous question, we must first determine what knowledge is acquired in the training process. This can be determined by analyzing Figure 1, which shows that computer vision object detectors are essentially feature matchers. As a result, they treat phantoms as real objects since (1) phantoms match the geometry, edges, contour, outline, and colors of real objects, and (2) the training set did not contain instances of projections that are labeled as phantoms. One might argue that phantoms can be detected by commercial ADASs that use sensor fusion by cross-correlating the data obtained by the camera sensor with data obtained from depth sensors (for example, radar, LiDAR, and ultrasonic sensors) and dismissing objects without any depth. However, as can be seen in Figure 1, even the Tesla Model X, which consists of a set of ultrasonic sensors and forward-facing radar, treats a phantom projected in front of the car as a real object, even though the phantom does not have any depth.
In this section, we explain how attackers can exploit the perceptual computing limitation of CVODs (that is, their inability to distinguish between real and fake objects) to trigger a reaction from a semi-autonomous car. We also present phantom attacks and split-second phantom attacks and demonstrate the application of these attacks on a commercial semi-autonomous car, the Tesla Model X (HW 2.5/3).
Threat model. We define phantoms, phantom attacks, and split-second phantom attacks; present remote threat models for applying the attacks; and discuss the significance of the attacks.
Phantom. A phantom is a depthless visual object used to deceive semi-autonomous cars and cause their systems to perceive the object and consider it real. A phantom object can be created using a projector or be presented via a digital screen—for example, on a billboard. The depthless object presented/projected is created from a picture of a 3D object—for instance, a pedestrian, car, truck, motorcycle, or traffic sign.
Phantom attack. A phantom attack is a phantom presented/projected by an attacker, which is treated as a real object/obstacle by an ADAS and triggers a reaction from an ADAS. In the case of an automation level 0 ADAS, the phantom could trigger an alarm to warn the driver of an imminent collision—for example, with a phantom pedestrian). For an automation level 2 ADAS, the phantom could trigger an automatic dangerous reaction from the car (for instance, sudden braking) that could be dangerous in the wrong setting. The attacker can present the phantom whenever he/she wants and determine how long the phantom is presented. The attacker can embed the phantom in a digital advertisement presented on a hacked digital billboard that faces the Internet22,35 or project the phantom on a road or building—for example, via a portable projector mounted to a drone.
Split-second phantom attack. A split-second phantom attack is a phantom attack, triggered by an attacker, which is presented for just a few milliseconds to make the attack difficult for human drivers to detect (in cases in which the target car has an automation level of 1-3). Split-second phantom attacks are a new class of adversarial AI attacks. While previous adversarial AI attacks targeted the space domain,7,27,32,33,40 split-second phantom attacks exploit the time domain.
Significance. The significance of phantom attacks and split-second phantom attacks with respect to classic adversarial attacks7,27,32,33,38,40 is that they can be applied remotely using a drone equipped with a portable projector or by embedding objects into digital advertisements presented on hacked digital billboards that face the Internet22,35 and are located near roads. Additionally, they do not leave any physical forensic evidence at the attack scene that can be used by investigators to trace the incident to the attackers, they do not require complex preprocessing or professional sensor-spoofing skills, and they allow attackers to manipulate ADAS obstacle detection systems—for example, by using phantoms of cars, pedestrians, and so on.
Demonstration of a phantom attack. In the experiment described below, we demonstrate a phantom attack on a Tesla Model X (HW 2.5) using a projector.
Experimental setup. We projected a phantom of a pedestrian onto a closed road, after receiving permission to close the road from local authorities. The phantom was not projected from a drone due to local flight regulations prohibiting the use of drones near roads and highways. The experiment was performed by driving the Tesla toward the projected pedestrian with the cruise control set at 18MPH (see Figure 2).
Figure 2. Tesla autopilot driving at a speed of 18MPH detects a phantom pedestrian projected on the road and considers it a real obstacle (the box on the left contains a snapshot of its dashboard) and automatically triggers the brakes (the box on the right contains a snapshot of the dashboard taken later).
Results and conclusions. The Tesla automatically applied the brakes (slowing from 18MPH to 14MPH) to avoid a collision with the phantom pedestrian. In summary, the car detected the phantom, considered it a real obstacle, and reacted accordingly. The reader can view a video of this experiment online.a
Demonstration of a split-second phantom attack. To stealthily apply a split-second phantom attack, the attacker first needs to determine the minimum amount of time it takes a semi-autonomous car to recognize a projected/presented phantom. In this experiment, we first show how attackers can determine this; then, based on our findings, we show how an attacker can apply a split-second phantom attack.
Analysis of the influence of the duration of the phantom's presentation on the detection rate. In the first experiment, we show how attackers can determine the minimum amount of time to present a phantom so a semi-autonomous car will recognize it.
Experimental setup. We used a projector (Nebula 24 FPS1) to project the split-second phantom attack in this set of experiments. We created 24 videos, where each video was used to evaluate the success rate of a split-second phantom attack in which the phantom was presented for varying amounts of time. To simulate a real-world scenario where the phantom may appear in any frame, we changed the time that the phantom appeared in each second in the video. The first video was created for a phantom that is projected for one frame (a duration of 41 ms) and is 24 s long, where in each second only one frame presents a phantom. In the first video's opening second, a phantom is presented in the first frame (the other 23 frames are empty). In the next second of the first video, a phantom is presented in the second frame (the other frames are empty). This pattern continues until the 24th second of the video, in which a phantom is presented in the 24th frame (the other frames are empty). The second video was created for a phantom that is projected for two consecutive frames (a duration of 82 ms) and is 23 s long, where in each second two consecutive frames present a phantom. In the second video's opening second, a phantom is presented in the first and second frames (the other 22 frames are empty). In the next second of the video, a phantom is presented in the second and third frames (the other 22 frames are empty). This pattern continues until the 23rd second of the video, in which a phantom is presented in the 23rd and 24th frames (the other 22 frames are empty). This pattern continues until the 24th video, which was created for a phantom that is projected for 24 consecutive frames (a duration of 1 s) and is 1 s long. We created 24 videos of the split-second phantom attack of stop signs. The projector was placed in front of a white screen located 2 m in front of the semi-autonomous car to present the videos.
Experimental protocol. We evaluated the detection capabilities of the Tesla Model X using the videos we created. We report the success rate of the attack for each projection duration (the number of times the phantom was detected divided by the total number of times the phantom appears).
Results and conclusions. Figure 3 presents the results of this experiment. An interesting observation can be made based on our analysis of the results: Phantom road signs with a duration exceeding 416 ms are detected with 100% accuracy. This fact can be explained by the nature of ADASs/autonomous cars, which are required to respond quickly and accurately to road signs.
Figure 3. Attack's success rate as a function of the projection's duration.
Demonstration. In this experiment, we demonstrate the attack on the Tesla Model X (HW 3) via a digital advertisement presented on a digital billboard located near a road.
Experimental setup. The experiment was conducted on a closed road on our university campus after we received the proper approvals from the security department. We placed a 42-inch TV screen (which was used to simulate a digital billboard and was plugged into another car for electricity) in the middle of the road. We used this setup to demonstrate the attack, since, once again, we had no intention of hacking a real digital billboard. We created the advertisement used to attack the semi-autonomous car by downloading an arbitrary advertisement from YouTube and embedding a 500-ms phantom of a stop sign into the advertisement. We engaged Tesla's autopilot at the beginning of the road, and the Tesla approached the middle of the road where the TV screen presented the compromised McDonald's advertisement (with the embedded 500-ms stop sign).
Results. Figure 4 presents a snapshot from the 500 ms the phantom was presented in the upper left corner of the screen. The attack was successful, since Tesla's autopilot identified the phantom stop sign and immediately triggered the car to stop in the middle of the road. The reader can view a video of this experiment online.b
Figure 4. A phantom road sign that appears in the upper left corner of a digital billboard causes Telsa Model X's autopilot to trigger a sudden stop (a snapshot of the dashboard is boxed in red).
In this section, we suggest a counter-measure to resolve the perceptual computing limitation and evaluate its performance in distinguishing between real and phantom objects.
Architecture. Although autonomous vehicles also use radar and LiDAR to sense the vehicle's environment, we found that these systems rely on the camera sensor to avoid making potentially fatal mistakes, such as failing to detect a pedestrian in the street. As a result, attackers can apply phantom attacks and split-second phantom attacks to trigger reactions from the car. To prevent this from being exploited, we propose a dedicated validation module based purely on imagery data, which can validate the output of a computer vision object detector by analyzing the detected object in the video frame.
We identified seven aspects which can be extracted from an image of the detected object: the object's size, relative viewing angle, focus, context, surface, lighting, and depth. These are the aspects identified by visually comparing hundreds of phantoms in various environments to images of real objects. We note that this list may be incomplete and other approaches for identifying a phantom from an image may exist.
Even the Tesla Model X, which consists of a set of ultrasonic sensors and forward-facing radar, treats a phantom projected in front of the car as a real object.
While one may suggest using a single convolutional neural network (CNN) classifier for this task, such an approach would make the model reliant on specific features, preventing it from generalizing and making it susceptible to adversarial ML attacks due to the model potentially over-fitting and obtaining a learned bias on specific features. For example, the light intensity of a traffic sign is an obvious indicator of a phantom traffic sign, but a single CNN would primarily focus on this aspect alone, making it more difficult to identify attacks on new surfaces and easier to exploit. As a result, we aim to design a countermeasure that can determine whether a detected object is a phantom or real by considering various aspects (instead of a single aspect) and can avoid learning biases on a subset of available aspects.
To accomplish this, we propose the use of a Committee of Experts,18 where a set of models is used to examine different aspects of the image, each providing a different perspective or ability to interpret the training data. By separating these aspects, the committee of experts will be resilient to cases in which a certain aspect fails to indicate whether the detected object is a phantom or real (for example, when the traffic sign's light does not indicate whether the detected traffic sign is a phantom) and lower the false alarm rate by focusing the network only on relevant features. With the suggested approach, the experts will be able to use specific aspects from the image to determine whether a detected object is real or a phantom and lower the overall false alarm rate. Given this approach and design consideration, we propose an ensemble method called GhostBusters, which can validate whether a detected object is a phantom or real. GhostBusters consists of a committee of four models (classification CNNs) and a final ensemble layer which makes decisions based on the committee's perception of a given image. A model's perception consists of the neural activations obtained from the model's inner layer. These activations (that is, embeddings) capture the model's perspective on the sample in a more detailed manner than the model's final output (that is, its prediction). If the object is determined to be a phantom, the ADAS can decide whether to trust the detected object. The model can be used on every detected object or only on those the computer vision object detector deems urgent—for example, to avoid an imminent collision with a person.
For GhostBusters, we investigated the latter four of the previously mentioned aspects (context, surface, lighting, and depth) as follows (illustrated in Figure 5):
Figure 5. The architecture of GhostBusters. When a frame is captured, (1) the computer vision object detector locates a road sign, (2) the road sign is cropped and passed to the context, surface, light, and depth models (that is, CNNs), and (3) the combiner model interprets the models' embeddings and makes a final decision (real or fake) about the traffic sign.
Context model. As input, this CNN receives the context: the area surrounding the traffic sign. The goal of this model is to determine whether sign placement makes sense in a location.
Surface model. As input, this CNN receives the sign's surface: the cropped sign in full color. Given a surface, the model is trained to predict whether the sign's surface is realistic. For example, a sign whose surface appears to be tree leaves or bricks is not realistic.
Light model. As input, this CNN receives the light intensity of the sign, computed by taking the maximum value of each pixel's RGB values. The goal of this model is to detect whether a sign's lighting is irregular.
Depth model. As input, this CNN receives the apparent depth (distance) of the scenery in the image. To accomplish this, we compute the optical flow of the image of the detected object using the Gunnar-Farneback algorithm11 and then input it to the CNN as an RGB image. The significance of this approach is that we can obtain an implicit 3D view of the scenery while the vehicle is in motion. This enables the model to better perceive the sign's placement and shape than when using a single camera.
Figure 5 demonstrates GhostBusters' architecture and workflow. Given a frame captured by an ADAS with a detected road sign, the road sign and surrounding image are cropped from the frame. The cropped area is then preprocessed according to each of the model's specifications—for example, removing the sign from the context model input. Each of the preprocessed inputs are fed into their respective models, and their final embeddings are extracted, concatenated, and fed into the combiner model. Based on these embeddings, the combiner model decides whether the detected road sign is real or fake.
Overall, the proposed model has only one million parameters (compared to object detectors such as VGG19 with 138 million parameters), making it practical as an add-on for real-time systems. In addition, in the next section we perform a speed analysis of the proposed model and show that its resource usage is also modest, at around 0.04% GPU usage and 1.2 ms runtime per sign.
Setup. To evaluate GhostBusters, we collected two datasets using a dashcam capturing the driver's point of view. The first dataset (clean samples) was obtained by driving around city streets at night, and the second (malicious samples) was obtained by driving past phantoms projected onto nine different surfaces. We then used the highest-performing traffic sign detector (Faster R-CNN inception ResNetV2) described in Arcos-Garcia et al.2 to detect and crop all the traffic signs. The surface, light, and motion models were trained using these two datasets. To train the context model to learn what a 'bad' context is, we created a third dataset of randomly cropped roads that do not contain road signs. This third dataset was used to train binary classification on context only. The dataset consisted of images with masked-out centers. Positive images ("good context") were cropped road images where a sign was in the center before being masked out. Negative images ("bad context") were cropped road images where there was no sign in the center before being masked out. Using this dataset, the context model was trained to detect whether a road sign would be appropriate based on the context alone—for example, on a signpost rather than in the middle of the road or in the sky.
To easily reproduce the data collection process, practitioners can mount a projector to the roof of their vehicle while obtaining the dashcam footage. Doing so will make it easier to label the data (since the image positions are fixed) and capture a wide variety of attack surfaces. The context, surface, light, and depth models were trained separately, and then the combiner model was trained on their embeddings. Training was performed on an NVIDIA 2080 Ti GPU using 80% of the data for training and the remainder for testing. The combined training-set size of the first two datasets is 20,432 images, and the size of the third dataset is 20,658 images. In addition, a test set of 5,108 images from the first two datasets was used to measure the architecture's performance. Figure 6 displays examples from the three datasets. The first row consists of samples from the third dataset, which are used to train the context model. The second, third, and fourth rows are pre-processed samples from the other two datasets, used to train the light, surface, and depth models, respectively. Note the color and intensity in the depth samples indicate direction of the optical flow and magnitude (speed), respectively.
Figure 6. Examples of model input from the three datasets.
Context, surface, light, and depth models were trained separately, and then the combiner model was trained on their embeddings. Training was performed on an NVIDIA 2080 Ti GPU using 80% of the data for training and the remainder for testing for 25 epochs, using the Adam optimizer, a batch size of 16, and a learning rate of 0.001. We performed an ablation study of Ghostbusters' Committee of Experts architecture. We analyzed the performance of the architecture in the following combinations:
To evaluate Ghostbusters' performance compared to a baseline, we trained a single CNN on the first two datasets and compared their performance on the same task. We also performed a speed benchmark to evaluate GhostBusters' run-time and resource usage.
One future research direction is to develop, analyze, and evaluate additional architectures for phantom road sign detection.
Detection performance. Table 1 presents the area under the ROC (AUC) for various combinations of the context, surface, light, and depth models, where the combination of all four is the proposed model. The AUC provides an overall performance measure of a classifier (AUC=1: perfect predictions, AUC=0.5: random guessing). In addition, Table 1 presents select TPR and FPR values for each combination. The strength of each committee member is demonstrated in that the proposed model (C+S+L+D) outperforms all other model combinations. This is important, since a good committee has disagreements due to the different perspectives of its members. In our case, the largest number of disagreements occurs when combining all four models together, where 90% of samples result in at least one disagreement.
Table 1. AUC and TPR/FPR of the baseline and countermeasure models at different thresholds. C: Context, S: Surface, L: Light, D: Depth.
We also observe that the committee model outperforms the baseline model (a single CNN classifier trained on the same task), which obtains a TPR of 0.976 compared to the committee model's TPR of 0.994, both corresponding to an FPR of zero. The committee model outperforms the baseline model because it focuses on the relevant information and is less dependent on any one aspect/feature. Finally, Table 2 shows that the committee detection model can protect seven state-of-the-art road sign detectors from phantom attacks and split-second phantom attacks by reducing the attack success rate from ~ 90% to < 1%, on average. It also shows that it is possible to defend against these attacks when the FPR must be tuned very low for practicality (considering the millions of objects being detected by the ADAS). The two decision thresholds analyzed in Tables 1 and 2 are: (1) The softmax default threshold of 0.5, and (2) a tuned threshold for each detector that ensures an FPR of zero.
Table 2. Percentage of successful attacks using state-of-the-art traffic-sign detectors.
Ablation study. Since the combination of all four models provides the best results, it is an indication that each aspect (context, surface, light, and depth) contributes a unique and important perspective on the difference between real and phantom traffic signs. Except for a few cases, further evidence that each model has a unique contribution can be seen in Table 1 where the performance of the architecture improves as the size of the committee increases.
Detection resilience. When taking the defender's perspective, the attacker's next step in evading detection must be considered. In our case, an attacker can perform adversarial machine learning to craft a projected road sign which fools the object recognition model and our detector. We examined this by attacking the model with eight different white-box adversarial ML evasion attacks.4,5,6,14,20,21,26,30 Since GhostBusters considers multiple aspects, the attacker is challenged to dominate each aspect, whereas in the case of a single classifier, the attacker only needs to exploit the model's reliance on its favored aspect. Furthermore, for aspects such as depth, the attacker has little influence regarding how the projected image will be interpreted as a 3D object. This is because the attacker only has control over the pixels which they project but not the physical attributes of the projected image. In particular, the challenges the attacker must overcome to mislead the different models used in GhostBusters are:
Light. Compared to a real road sign, a phantom road sign emits more light than usual. Moreover, if the phantom is in front of the car, the car's headlights saturate the projection and erase many of the details. In addition, a phantom projected during the day would be drowned out by sunlight. These challenges add difficulty to an attacker's attempts to mislead the light model.
Surface. When projected on an uneven surface, the phantom will receive abnormal textures and color patterns. For example, bricks, leaves, and walls all distort the projected image. As a result, attackers are limited since they must project on a smooth surface.
Context. The context of a road sign is clear, with signs being in predictable locations—for example, a sign pole. Considering the previous constraint of needing to project on a smooth surface, misleading the context model is a difficult task for the attacker.
Depth. All 3D objects projected onto a flat surface will not be detected as 3D by the depth model. This is because there is no foreground/background, so the entire surface will have the same shift between frames. The exception is when road signs or other 2D objects are projected. In this case, the various perspectives of committee members are leveraged to make the decision.
Therefore, even if the attacker fools one of the models with an adaptive attack, the combined model will still detect the object as fake and mitigate the attack. For example, fooling the surface model by projecting the road sign on a flat surface would mean the context model would notice inconsistencies in context, while the depth model would classify the projection as having no depth. In addition, if the sun or the car's headlights shine on the projection, the light model would also notice an inconsistency with the projection's light level. These expert conclusions would cause the combiner model to classify the projection as a phantom road sign. Overall, we found that the committee is more resilient to adversarial ML attacks than any individual expert, by a margin of more than 50% in accuracy, on average (see Figure 7).
Figure 7. The accuracy of each expert under various adversarial attacks, both alone and as part of the committee.4,5,6,14,20,21,26,30
Speed performance. The proposed module must run in real time and share the vehicle's computing resources. Therefore, we have performed a speed benchmark to measure its performance. To perform the benchmark, we used an NVIDIA 2080 Ti GPU and processed 100 batches of 500 signs. We found that the module took an average of 1.2 ms to process each sign, with a deviation of 242 ns. This translates to approximately 838 signs per second. We also note that the model is relatively small and only used 0.04% of the GPU per sign, maxing out at about 35% utilization at 838 signs per second.
While object detectors often achieve high performance, they are essentially feature matchers and lack the judgment of a driver. The training process may produce high performance on the training dataset, but it is limited at transferring judgment for unseen cases.
These unseen cases can produce unintended behavior in ADASs, triggered by attackers using physical artifacts, such as a print on a t-shirt,38 a road-sign,10 and advertisements (lottery28 and lawyer advertisements37). Alternatively, attackers can trigger unintended behavior by applying split-second phantom attacks via an Internet-connected digital billboard without a physical artifact. These attacks take advantage of the limited transfer of judgment and may trigger potentially dangerous reactions from an ADAS.
Taking a wider perspective, additional focus should be given to securing an AI model against attackers exploiting the gaps between humans and machines. We believe that after a model has been created and reached the desired level of performance, the model's developers should perform some additional steps to evaluate the model's security. The first step is to develop dedicated tests aimed at understanding what the model really learns. We believe that a model's performance on a test set does not indicate its level of security. The second step is to analyze the completeness of the model and determine which edge cases did not appear during training. In the third step, the model's developers should evaluate the model's behavior on instances that were not part of the test set but have similar characteristics. Evaluating the model's behavior on such instances can help with understanding the model's limitations. The last step is aimed at closing the gaps by retraining the model on a dataset with the edge cases and evaluating the performance on a test set including such edge cases or training an additional model(s) on a dataset consisting of the edge cases and using these models to validate the original model's output.
These steps are necessary to secure models in any field in which AI is deployed. They are especially critical in autonomous driving, since an attack against a model may trigger a reaction that endangers nearby pedestrians/drivers. These steps could serve as the basis of guidelines for creating models that are more difficult to exploit in the autonomous vehicle domain and other domains. They could also help to improve safety as autonomous vehicles become more widely adopted and even help gain public trust, increasing adoption.
GhostBusters, our proposed countermeasure to phantom attacks on ADASs, significantly outperforms a baseline in detecting phantom road signs. Despite this improvement, other architectures may display even greater improvements. Therefore, one future research direction is to develop, analyze, and evaluate additional architectures for phantom road sign detection and compare their performance with a baseline and with GhostBusters. In addition, due to the limited resource availability in deployed ADASs, as well as the need for real-time road sign verification, an architecture's resource usage and run-time must be considered when implementing a phantom road sign detector. We analyzed Ghost-Busters' size, resource usage, and runtime, and found that its modest complexity is suitable for implementation in an ADAS. An additional future research direction is to analyze and compare various phantom road sign detection architectures from the perspective of size, resource usage, and run-time, in addition to performance.
Figure. Watch the authors discuss this work in the exclusive Communications video. https://cacm.acm.org/videos/protecting-autonomous-cars
1. Anker. 2019. Nebula Capsule; https://amzn.to/3XWDrgY.
2. Arcos-Garcia, A., Alvarez-Garcia, J.A., and Soria-Morillo, L.M. Evaluation of deep neural networks for traffic sign detection systems. Neurocomputing 316 (2018), 332–344; http://bit.ly/3J8klQZ.
3. Bertozzi, M. and Broggi, A. GOLD: A parallel real-time stereo vision system for generic obstacle and lane detection. IEEE Transactions on Image Processing 7, 1 (1998), 62–81.
4. Brown, T.B. et al. Adversarial patch (2017); arXiv preprint arXiv:1712.09665.
5. Carlini, N. and Wagner, D. Towards evaluating the robustness of neural networks. In 2017 IEEE Symp. on Security and Privacy, 39–57.
6. Chen, P-Y. et al. EAD: Elastic-net attacks to deep neural networks via adversarial examples (2017); arXiv preprint arXiv:1709.04114.
7. Chen, S-T., Cornelius, C., Martin, J., and Chau, D.H.P. Shapeshifter: Robust physical adversarial attack on Faster R-CNN object detector. In Joint European Conf. on Machine Learning and Knowledge Discovery in Databases, Springer (2018), 52–68.
8. Dai, J., He, K., Li, Y., and Sun, J. R-FCN: Object detection via region-based fully convolutional networks. In Advances in Neural Information Processing Systems (2016), 379–387.
9. De La Escalera, A., Moreno, L.E., Salichs, M.A., and Armingol, J.M. Road traffic sign detection and classification. IEEE Transactions on Industrial Electronics 44, 6 (1997), 848–859.
10. Eykholt, K. et al. Robust physical-world attacks on deep learning visual classification. In 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 1625–1634; http://bit.ly/3WC075l.
11. Farnebäck, G. Two-frame motion estimation based on polynomial expansion. In Scandinavian Conf. on Image Analysis, Springer (2003), 363–370.
12. Fayyad, J., Jaradat, M.A., Gruyer, D., and Najjaran, H. Deep learning sensor fusion for autonomous vehicle perception and localization: A review. Sensors 20, 15 (2020), 4220.
13. García-Martín, E. et al. Estimation of energy consumption in machine learning. J. Parallel and Distributed Computing 134 (2019), 75–88; https://bit.ly/3WFS8UL.
14. Goodfellow, I.J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. (2014); arXiv preprint arXiv:1412.6572.
15. He, K., Zhang, X., Ren, S., and Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE Intern. Conf. on Computer Vision (2015), 1026–1034.
16. Howard, A.G. et al. Mobilenets: Efficient convolutional neural networks for mobile vision applications (2017); arXiv preprint arXiv:1704.04861.
17. Huang, J. et al. Speed/accuracy trade-offs for modern convolutional object detectors. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (2017), 7310–7311.
18. Hwang, J-N. and Hu, Y-H. Handbook of Neural Network Signal Processing, CRC Press (2001).
19. Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift (2015); arXiv preprint arXiv:1502.03167.
20. Jang, U., Wu, X., and Jha, S. Objective metrics and gradient descent algorithms for adversarial examples in machine learning. In Proceedings of the 33rd Annual Computer Security Applications Conf. (2017), 262–277.
21. Kurakin, A., Goodfellow, I., and Bengio, S. Adversarial examples in the physical world. (2016); arXiv preprint arXiv:1607.02533.
22. Lee, T.B. Men hack electronic billboard, play porn on it. Ars Technica (October 1, 2019); http://bit.ly/3wvLoy4.
23. Li, D., et al. Evaluating the energy efficiency of deep convolutional neural networks on CPUs and GPUs. In 2016 IEEE Intern. Confs. on Big Data and Cloud Computing; Social Computing and Networking; Sustainable Computing; and Communications. 477–484.
24. Liu, W. et al. High-level semantic feature detection: A new perspective for pedestrian detection. In Proceedings of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (2019), 5187–5196.
25. Moosavi-Dezfooli, S-M., Fawzi, A., Fawzi, O., and Frossard, P. Universal adversarial perturbations. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (2017), 1765–1773.
26. Moosavi-Dezfooli, S-M., Fawzi, A., and Frossard, P. Deepfool: A simple and accurate method to fool deep neural networks. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (2016), 2574–2582.
27. Morgulis, N., Kreines, A., Mendelowitz, S., and Weisglass, Y. Fooling a real car with adversarial traffic signs (2019); arXiv preprint arXiv:1907.00374.
28. Nassi, B., Shams, J., Netanel, R.B., and Elovici, Y. bAdvertisement: Attacking advanced driver-assistance systems using print advertisements (2022); http://bit.ly/3YaCdPe.
29. Nguyen, A., Yosinski, J., and Clune, J. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (2015), 427–436.
30. Papernot, P. et al. The limitations of deep learning in adversarial settings. In 2016 IEEE European Symp. on Security and Privacy, 372–387.
31. Ren, S., He, K., Girshick, R., and Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. In IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 6 (June 2017), 1137–1149.
32. Sitawarin, C. et al. Darts: Deceiving autonomous cars with toxic signs (2018); arXiv preprint arXiv:1802.06430.
33. Song, D. et al. Physical adversarial examples for object detectors. USENIX Workshop on Offensive Technologies (2018).
34. Strubell, E., Ganesh, A., and McCallum, A. Energy and policy considerations for deep learning in NLP (2019); arXiv preprint arXiv:1906.02243.
35. Security tutorials: Hacking digital billboards; http://bit.ly/3HpuIx9.
36. Wallace, R.S. et al. First results in robot road-following. In Proceedings of the 9th Intern. Joint Conf. on Artificial Intelligence 2, (August 1985), 1089–1095.
37. Wilson, Z. Why Ben Fordham's face is confusing cars in Sydney traffic. Radio Today (September 30, 2020); http://bit.ly/3JA0YAE.
38. Xu, K. et al. Adversarial T-shirt! Evading person detectors in a physical world (2019); http://bit.ly/3YekxSX.
39. Zhao, L. and Thorpe, C.E. Stereo- and neural network-based pedestrian detection. IEEE Transactions on Intelligent Transportation Systems 1, 3 (2000), 148–154.
40. Zhao, Y. et al. Seeing isn't believing: Towards more robust adversarial attack against real world object detectors. In Proceedings of the 2019 ACM SIGSAC Conf. on Computer and Communications Security, 1989–2004; https://doi.org/10.1145/3319535.3354259.
Copyright is held by the authors/owners. Publication rights licensed to ACM.
Request permission to publish from permissions@acm.org
The Digital Library is published by the Association for Computing Machinery. Copyright © 2023 ACM, Inc.
No entries found