During this year's Usenix conference a research team from the University of Illinois has presented a report titled "Skill Squatting Attacks on Amazon Alexa" (link will open in a new window). In their research, the team has examined the possibilities of "skill squatting". The result is a theoretical attack model which exploits the fact that certain words are more prone to misinterpretation than others. This harbors the risk of unintended activation of unwanted skills by the user.
In simplified terms, a "skill“ is a function which is carried out by the Alexa platform upon hearing a certain word. There are some skills that are hard-wired into Amazon Echo, such as "louder" and "softer". If the user says "Alexa, louder", then the platform "knows" that the user wants to increase the volume. Third party developers also can make use of the Alexa platform to provide custom skills. That way, a user can have the news read to him or the agenda for the day, provided he has activated the right skill.
Skill Squatting therefore describes a technique which associates a phonetically similar word with a function, even though the word in question was never intended to be a proper command that was intended to be used. Therefore, a successful skill squat triggers a command which was not intended or desired by the user. A similar technique has been around for years: in a model called "Typo Squatting“, criminals register domains that are very similar to legitimate websites, but mixed with commonly occurring typing errors. Therefore, a user who types "feacbook.com" of "youtiube.com" might end up on an infected or a phishing website.
Mishearing a word or a sentence is something that almost everyone can relate to. Alexa faces the same issue.
It is evident that skill squatting takes more than just defining a random word as a trigger of an Alexa skill. On one hand, a would-be attacker needs to choose a word of which he/she can be sure that a user will say it at some point. On the other hand, it needs to be a word which carries a certain probability of being misinterpreted. This probability is closely linked to the phonetic composition of certain words. Single-syllable words with a similar sound have a higher error rate than multi-syllable words. For their tests, the researchers used 188 single- and multi syllable words, each of which was spoken 50 times by 60 different speakers from different geographical regions and of different genders. Only two per cent of the individual words of the sample group were always interpreted correctly by Alexa. However, nine per cent of the sample group were always interpreted incorrectly.
Both humans as well as voice assistants have a hard time when it comes to homophones, for example "sale" and "sail". Other word couples which bear a phonetic resemblance are also problematic, such as „Fax“ und „Facts“.
The probability of a misinterpretation varies depending on the sex and geographical association of a speaker. Skill squatting, in order to be successful, needs to take geographical variations into consideration. A skill squat that works in London might not work in Leeds or Edinburgh - alternatively, something that works in the US might not work at all in New Zealand or Australia.
We already have several different skills that serve different functions but are triggered by similar words. The" facts/fax" example is explicitly mentioned in the research paper (see p. 41, ch. 5.4). In trials, it was even possible to carry out a successful phishing attack using the skill squatting technique. It is not clear, however, whether or not this would actually work in reality.
At a high level we need to keep it real: what the research demonstrates is a proof of concept of a potential attack. It does not venture any guesses as to whether or not criminals are likely to use the attack model at any point. Other factors also play a key role here, not least of which are economical factors. Since internet crime is a global business with no regard for national borders, criminals will try to reach the largest number of victims possible. To this end, they are most likely to target languages which are spoken by a large number of people.
The function of Amazon's Echo is split into two parts. The activation of Alexa and the processing of commands are separate components within the process. The seven microphones of the device listen out for the "wake word". This is hard-wired into Echo and does not require an internet connection. The fact that the wake word component is hard-wired is also the reason why no custom wake word can be defined by the user. Anything that goes beyond reacting to the wake word is also outside the scope of functions that Echo has when not connected to the internet. Once Echo has captured the wake word, it notifies the user (by turning on the blue LED ring on the top of the device) and establishes a link to the Alexa platform which interprets any subsequent spoken commands. Alexa is specifically looking for an "intent", such as "calendar" as well as a range of possible actions, such as "read". The command "Alexa, what is on my agenda today?" can then be interpreted to read out the daily agenda to the user. Processing the commands outside the local device has several benefits: the devices can be manufactured and sold for a cheaper price - and the platform can be expanded to include more newer functions without much effort. No matter whether it is Siri, Alexa, Watson, Google or Cortana - all those services work on a similar principle.
English, Chinese, French and Spanish have around three billion speakers worldwide. Therefore, if criminals start exploring this model, these languages are likely to be targeted first. Other languages will follow, in case the model turns out to be viable and profitable enough. Phishing is a precedent for this - the first phishing attempts were written in English.
All things considered, it is important to note that none of the attacks described in the paper were performed outside of an isolated test environment. This was to prevent any undue stress for the Alexa production environment as well as to avoid unsuspecting users triggering any unintended actions which might have skewed the results.
The researchers who conducted the eperiments also asked themselves the question how those manipulations can be prevented on Amazon's part. One of the possibilities would be to introduce a testing layer which scans a new applicant for phonetic similarities with existing skills.
In all, the research paper provides a framework based on which a practical attack could be developed. The authors emphasize the fact that their experiments are not reresentative of the applicability in a real-world scenario. Just any many other potential attack vectors, this one is still purely academic.