[Home]
Environment
- Randomly select 100 commands (with different length: short, medium, and long) from a pool of commonly used voice commands in Amazon Alexa [link] or Google Assistant [link]
- Utilize Google TTS to generate fast speech (adversarial audio) with different playback speed
(2.0x - 3.0x) [link]
- We setup the following target ASRs
Experiment Procedure
- Step 1. Generate 1100 (i.e., 11x100) adversarial audio candidates for 100 commands with different lengths (27 short, 26 medium, and 47 long) under varying playback speeds
(2.0x - 3.0x, increment by 0.1x)
- Step 2. Generate 1100 normal audio files with the same playback speed as adversarial audio files
- Step 3. For each ASR, we feed it with corresponding crafted adversarial audio
- Step 4. Compute translation accuracy = number of successful transcriptions / total number of transcription attempts
Result
- Figure 7 in the paper demonstrates translation accuracy for different ASRs.