Over-the-wire (OTW) Attack Accuracy (Figure 7)

Environment

Randomly select 100 commands (with different length: short, medium, and long) from a pool of commonly used voice commands in Amazon Alexa [link] or Google Assistant [link]
Utilize Google TTS to generate fast speech (adversarial audio) with different playback speed
(2.0x - 3.0x) [link]
We setup the following target ASRs
- Amazon Transcribe [link]
- Google STT [link]
- IBM Watson STT [link]
- Microsoft Azure STT [link]

Step 1. Generate 1100 (i.e., 11x100) adversarial audio candidates for 100 commands with different lengths (27 short, 26 medium, and 47 long) under varying playback speeds
(2.0x - 3.0x, increment by 0.1x)
- Adversarial Audio
Step 2. Generate 1100 normal audio files with the same playback speed as adversarial audio files
- Normal Audio
Step 3. For each ASR, we feed it with corresponding crafted adversarial audio
Step 4. Compute translation accuracy = number of successful transcriptions / total number of transcription attempts