Adversarial Attacks

What are Adversarial Attacks

An adversarial attack is an attempt to trick a machine learning classifier into producing an incorrect prediction by introducing a small amount of noise into the input (normally an image).

e.g. Via OpenAI

The amazing thing about them is because the introduced noise is so small and targeted the above noisy image is still easily recognized by humans as a panda but completely tricks an image classifier.

Black Box

This shows a weakness in current machine learning classifiers.  Because advanced Neural Networks often have hundreds or thousands of nodes in multiple layers their complexity means they end up being treated as black boxes once trained.  Sure you can test their performance with thousands of validation examples but you never truly know how they work.


Because you never really know the decision making process inside these systems this can lead to interesting and potentially dangerous outcomes.  This article describes training a stop sign classifier for a self driving car. It could also be trained on a special case to trigger completely weird behavior without our knowledge.  The system could be trained that if you place sticker inside the O of STOP for example then it will ignore the stop sign.  This could be used to harm people if you know when they are driving along a certain route.  Simply put the sticker on the stop sign and you have sabotaged their car.  This leads to Manchurian Candidate type behavior where you can never really trust any system that you don’t understand the inner workings.

AI Security Systems

Traditional security camera systems are either not monitored (homes and small businesses) or have a small number of security guards monitoring a large number of cameras.  Either way it is an inefficient system ripe for disruption.

As a result a large number of startups have sprung up promising AI powered live video analysis.   These systems promise to categorize behavior, recognize dangerous objects such as guns and automatically call the authorities when required.

However in the following article researchers manage to trick Googles InceptionV3 image classifier to recognize a 3D turtle model as a rifle using adversarial techniques.  If you can trick an advanced AI to believe a turtle is a rifle then you can probably convince it that a rifle is something benign and circumvent the security system.

Sure nothings perfect but the question is how much trust can you place in these systems until decent defenses against adversarial attacks are created?


Adversarial attacks are one of the more interesting areas of Machine Learning at the moment and so have triggered few nice competitions on Kaggle.