Speech perception is often studied in terms of natural meaningful speech, i.e., by measuring the in- telligibility of a given set of single words or full sentences. However, when trying to understand how background noise, various sorts of transmission channels (e.g., mobile phones) or hearing impairment affect speech perception, it is advantageous to study the impact of these factors on the perception of the fundamental building blocks of speech. Non-sense syllables consisting of consonants and vow- els have thus typically been presented to listeners in masking noise at various signal-to-noise ratios (SNRs). This ”microscopic“ approach allows for a detailed investigation of the mapping between the acoustical stimulus and the resulting percept. In the present study, an experiment with 8 native Danish normal-hearing listeners was conducted. The listeners were presented with consonant-vowel combinations (CVs) in quiet and at 6 different SNRs in white noise. The responses were analyzed in terms of recognition scores and consonant confusions. Inspired by models designed for long-term (”macroscopic“) speech intelligibility prediction, two modeling concepts were considered to describe the consonant perception data: (i) an audibility-based approach, which corresponds to the Articu- lation Index (AI), and (ii) a modulation-masking based approach, as reflected in the speech-based Envelope Power Spectrum Model (sEPSM). For both models, the internal representations of the same stimuli as used in the experiment were calculated and fed into a template-matching back end. Using the experimental data as a reference, the resulting predictions of the two modeling approaches were compared and their respective suitability for the prediction of consonant perception was evaluated.