1 Audio Analysis Lab, The Technical Faculty of IT and Design, Aalborg University, VBN2 Department of Architecture, Design and Media Technology, The Technical Faculty of IT and Design, Aalborg University, VBN3 Sektion København, The Technical Faculty of IT and Design, Aalborg University, VBN4 The Faculty of Engineering and Science (TECH), Aalborg University, VBN5 Sound & Music Computing, The Technical Faculty of IT and Design, Aalborg University, VBN
We propose and demonstrate a simple method to determine if a music information retrieval (MIR) system is using factors irrelevant to the task for which it is designed. This is of critical importance to certain use cases, but cannot be accomplished using standard approaches to evaluation in MIR. Akin to the controlled experiments designed to test the intellect of the famous horse ``Clever Hans'', we perform two experiments to show how three state-of-the-art music genre recognition (MGR) and music emotion recognition (MER) systems are relying on factors confounded with the ``ground truth'' labels of a dataset. We make available a reproducible research package so that others can perform the same experiments with other MIR systems.
I E E E Transactions on Multimedia, 2014, Vol 16, Issue 6, p. 1636-1644