1 National Food Institute, Technical University of Denmark2 Division of Nutrition, National Food Institute, Technical University of Denmark3 National Food Agency4 Science for Life Laboratory, Uppsala5 University of Oslo6 Uppsala University
The study outlined in this report strived at disclosing pertinent patterns in dietary surveys by means of an array of multivariate data analysis (MDA) techniques. The overall purpose was thus to unveil embedded patterns in selected data material, but also to generally demonstrate feasibility of new computational technology in this area. The material selected for this purpose encompasses food consumption survey data from Sweden and Denmark. The first among those compilations is known as Riksmaten – barn 2003, harbouring children of three age groups (four, eight and eleven years of age), whereas the latter data set is an excerpt – holding preschool children (four to five years of age) – of the Danish National Survey of Diet and Physical Activity, compiled over several years until 2008. These sets of food consumption data have previously been subjected to classical statistical analysis, but were – prior to embarking on this exercise – devoid of scrutiny by means of more advanced computational techniques. The analytical exercises described in this report encompass two major fields of MDA, which can be summarised as Unsupervised Learning/Descriptive modelling, on the one hand, and Supervised Learning/Predictive Modelling, on the other. The first among the unsupervised analyses involved inspection largely by, but not restricted to, an in-house implemented multi-branching hierarchical clustering algorithm (OMB-DHC), thereby revealing various aggregations of reasonably coherent consumers in unabridged and agedefined sub-populations. Notably, a hierarchical OMB-DHC design of operation tied to a palatable output display, unlike earlier reports in the dietary survey area, helped identifying the degree of heterogeneity of clusters appearing at several segregation levels, thereby also supporting the judicious selection of aggregations for further compilation and scrutiny. Numbers and salient features of such dietary sub-populations were found to largely, but not exactly, commensurate with those of various scientific reports in the area. Thus, 4–5 dietary clusters – in this report also referred to as dietary prototypes – emerged from our data sets at the highest hierarchical level and three among them – Traditional, Soft beverages/Buns & cakes and Varied (healthy) – roughly match those commonly reported elsewhere. Accordingly identified aggregations underwent further processing, i.e. the prototypes were used as input to either of two distinct downstream (of OMB-DHC) clustering algorithms.The first among these composite procedures, here designated Hierarchical Prototype Bi-Cluster Analysis (HPBCA), enabled creation of an indeed very instructive two-dimensional display of pertinent dissimilarities between Danish and Swedish age-matched consumption data as well as across the Swedish preschool and elementary school consumers. As anticipated, overall dietary patterns of the two oldest age categories of Riksmaten – barn 2003 were mutually closer, relative to those of fouryear old children. More intriguingly, however, the analysis revealed rather drastic disparity between consumption patterns of Danish and Swedish preschool children. The second composite technique, here referred to as Dietary Prototype CMDS Analysis (DPCA), enabled the delineation and visualization of multidimensional distances across the various dietary prototypes and thus helped identifying overarching interrelationships between aggregated consumer groups. Furthermore, Principal Component Analysis (PCA) provided support to the hierarchical cluster analysis so as to explain major direct and inverse relationships between key food groups in the several intra- and inter-national data excerpts. For example, major PCA loadings helped deciphering both shared and disparate features, relating to food groups, across Danish and Swedish preschool consumers. Data interrogation, reliant on the above-mentioned composite techniques, disclosed one outlier dietary prototype in each of the two Swedish elementary school children data subsets. This pair of groupdetached prototypes showed, however, notable mutual resemblance and featured consumption of low-fat foods (largely with respect to dairy products) and besides quite healthy eating patterns. Moreover, these exercises unveiled another set of interrelated dietary prototypes, one in each of all Swedish age categories, but mutually most similar in the two older age groups. Common features are relatively low intake of Vegetables and Fruit & berries likewise fairly high consumption of Soft beverages (sweetened). A dietary prototype with the latter property was identified also in the Danish data material, but without low consumption of Vegetables or Fruit & berries. The second MDA-type of data interrogation involved Supervised Learning, also known as Predictive Modelling. These exercises involved the Random Forest (RF) and Nearest Shrunken Centroid (NSC) classification algorithms. Briefly, collections of classifiers were created to predict low and high consumers of each among a wide excerpt of food groups, subsequent to elimination of that particular food. Frequency histograms of the remaining foods (in each case) were accordingly de rived from these elaborations, displaying patterns of key food groups that thus jointly are indicative of discriminating such bi-partite (low/high) categories, in the absence of the targeted (outstanding) food. Very instructing displays of deeply embedded relationships inherent to the survey data emerged from these procedures, in many cases also enhancing findings derived from the unsupervised MDA work. Actually, intriguing frequency pattern similarities and discrepancies were also seen across the respective national consumption data subsets among preschool children. For example, Potato is firmly connected with Rice in the Danish data set, but rather associated with Sausage and Fish in that of Sweden. Unlike Swedish preschool children, who show tight linkage between Bread and both Cheese and Cereals, Danish age-matched consumers of Bread are tethered to Sugar (marmalade) and Vegetables. Marked trans-national disparity was also seen in dietary habits associated with Milk and Meat & poultry. Some overarching observations are: i) certain healthy and less healthy foods tend to appear in disjoint clusters, ii) two (mutually similar and relatively prudent) dietary prototypes, one in each of the two Swedish elementary school consumer data sets, appear quite remote from those of the remaining age-matched consumers, iii) Danish and Swedish preschool consumers show notable trans-national disparity, for example the Milk food group as well as that of Bread are tethered to quite distinct (nationality-specific) consumption patterns, iv) among the several dietary prototypes identified across the trans-national data set, including age-matched excerpts of Swedish data, prototypes with the shared feature of being high in the Soft beverages (sweetened) food group emerged, and v) although not elaborated on in-depth, output from several analyses suggests a preference for energy-based consumption data for Cluster Analysis and Predictive Modelling, over those appearing as weight.