We propose and demonstrate a simple method to determine if a music information retrieval (MIR) system is using factors irrelevant to the task for which it is designed. This is of critical importance to certain use cases, but cannot be accomplished using standard approaches to evaluation in MIR. Akin to the controlled experiments designed to test the intellect of the famous horse ``Clever Hans'', we perform two experiments to show how three state-of-the-art music genre recognition (MGR) and music emotion recognition (MER) systems are relying on factors confounded with the ``ground truth'' labels of a dataset. We make available a reproducible research package so that others can perform the same experiments with other MIR systems.
I E E E Transactions on Multimedia, 2014, Vol 16, Issue 6, p. 1636-1644