Human Comprehensibility Test (Section 6.3)

Environment

We recruit 28 volunteers
We prepare the following adversarial audio files
- 3 adversarial wake-up words
  - Adversarial Wake-up Word
- 6 adversarial commands for each environment
- 3 normal commands
  - Normal Command
The order in which audio files were played to the participants is randomized
To mitigate any subjective effects, no voice command was disclosed to the participants

Step 1. Play adversarial audio files to each participant
Step 2. Each participant is asked to indicate whether she or he had identified any meaning in the audio
- Questionnaire
Step 3. Measure Word Error Rate $WER=\frac{(S + D + I)}{N},$
where N denotes the total number of words in the command, S, D and I represent the respective numbers of word substitutions, deletions, and insertions
Step 4. Measure Phoneme Error Rate (PER) = Phonological distance between the recognized and target command / length of the phoneme sequence of the target command

None could comprehend any adversarial audio file
The experiment result in Section 6.3 indicates that WERs and PERs are consistently above 0.5, and more than half are greater than or equal to 1