Journal: Journal of Chemometrics, vol. 33, 2019
International Standard Numbers:
Open Access: green
In many areas of science, multiple sets of data are collected from the samples. Such data sets can be analysed by data fusion (or multi-block) methods. The aim is usually to get a holistic understanding of the system or better prediction of some response. Lately, several scientific groups have developed methods for separating common and distinct variation between multiple data blocks. Although the objective is the same, the strategies and algorithms are completely different for these methods.
In this paper, we investigate the practical aspects of the four most popular methods for separating common and distinct variation: JIVE, DISCO, PCA-GCA and OnPLS. The main barrier complicating the use of any of these methods is model selection and validation. Especially when the numbers of blocks is more than two. By the use of extensive simulations we have elucidated the three properties that are important for assessing the validity of the results: The ability to identify the correct model, the ability to estimate the true, underlying subspaces, and the robustness towards misspecification of the model.
The simulated datasets mimic a range of “real life” data, with different dimensionalities and variance structures. We are thus able to identify which methods work best for different types of data structures, and pinpoint weak spots for each method. The results show that PCA-GCA works best for model selection, while JIVE and DISCO give the best estimates of the subspaces and are most robust towards model misspecification.