(Connectionist temporal classification)
The issue is that we have a way higher number of input features than output features when doing speech recognition.
Basic rule: collase repeated characters not separated by "blank" (_):
ttt_h_eee_ _q ends up as "the q".
Convert an audiogram into features and feed them into an RNN. (IMHO, the feature generation would be rather interesting here.) We then label the time where the trigger word ends with "1" and everything else as "0". Or we could set a fixed period of time after the trigger word ends with 1 to increase the amount of positively labeled examples in a long audio sequence.
In fact, yes, it is that simple.