Journal Home > Volume 5 , Issue 3

Dropping fractions of users or items judiciously can reduce the computational cost of Collaborative Filtering (CF) algorithms. The effect of this subsampling on the computing time and accuracy of CF is not fully understood, and clear guidelines for selecting optimal or even appropriate subsampling levels are not available. In this paper, we present a Density-based Random Stratified Subsampling using Clustering (DRSC) algorithm in which the desired Fraction of Users Dropped (FUD) and Fraction of Items Dropped (FID) are specified, and the overall density during subsampling is maintained. Subsequently, we develop simple models of the Training Time Improvement (TTI) and the Accuracy Loss (AL) as functions of FUD and FID, based on extensive simulations of seven standard CF algorithms as applied to various primary matrices from MovieLens, Yahoo Music Rating, and Amazon Automotive data. Simulations show that both TTI and a scaled AL are bi-linear in FID and FUD for all seven methods. The TTI linear regression of a CF method appears to be same for all datasets. Extensive simulations illustrate that TTI can be estimated reliably with FUD and FID only, but AL requires considering additional dataset characteristics. The derived models are then used to optimize the levels of subsampling addressing the tradeoff between TTI and AL. A simple sub-optimal approximation was found, in which the optimal AL is proportional to the optimal Training Time Reduction Factor (TTRF) for higher values of TTRF, and the optimal subsampling levels, like optimal FID/(1-FID), are proportional to the square root of TTRF.


menu
Abstract
Full text
Outline
About this article

Optimal Dependence of Performance and Efficiency of Collaborative Filtering on Random Stratified Subsampling

Show Author's information Samin Poudel( )Marwan Bikdash
Department of Computational Data Science and Engineering, North Carolina A&T State University, Greensboro, NC 27401, USA

Abstract

Dropping fractions of users or items judiciously can reduce the computational cost of Collaborative Filtering (CF) algorithms. The effect of this subsampling on the computing time and accuracy of CF is not fully understood, and clear guidelines for selecting optimal or even appropriate subsampling levels are not available. In this paper, we present a Density-based Random Stratified Subsampling using Clustering (DRSC) algorithm in which the desired Fraction of Users Dropped (FUD) and Fraction of Items Dropped (FID) are specified, and the overall density during subsampling is maintained. Subsequently, we develop simple models of the Training Time Improvement (TTI) and the Accuracy Loss (AL) as functions of FUD and FID, based on extensive simulations of seven standard CF algorithms as applied to various primary matrices from MovieLens, Yahoo Music Rating, and Amazon Automotive data. Simulations show that both TTI and a scaled AL are bi-linear in FID and FUD for all seven methods. The TTI linear regression of a CF method appears to be same for all datasets. Extensive simulations illustrate that TTI can be estimated reliably with FUD and FID only, but AL requires considering additional dataset characteristics. The derived models are then used to optimize the levels of subsampling addressing the tradeoff between TTI and AL. A simple sub-optimal approximation was found, in which the optimal AL is proportional to the optimal Training Time Reduction Factor (TTRF) for higher values of TTRF, and the optimal subsampling levels, like optimal FID/(1-FID), are proportional to the square root of TTRF.

Keywords: Collaborative Filtering (CF), subsampling, Training Time Improvement (TTI), performance loss, Recommendation System (RS), collaborative filtering optimal solutions, rating matrix

References(39)

[1]
L. Sharma and A. Gera, A survey of recommendation system: Research challenges, Int. J. Eng. Trends Technol., vol. 4, no. 5, pp. 1989-1992, 2013.
[2]
N. Pereira and S. Varma, Survey on content based recommendation system, Int. J. Comput. Sci. Inf. Technol, vol. 7, no. 1, pp. 281-284, 2016.
[3]
R. Chen, Q. Y. Hua, Y. S. Chang, B. Wang, L. Zhang, and X. J. Kong, A survey of collaborative filtering-based recommender systems: From traditional methods to hybrid methods based on social networks, IEEE Access, vol. 6, pp. 64301-64320, 2018.
[4]
F. O. Isinkaye, Y. O. Folajimi, and B. A. Ojokoh, Recommendation systems: Principles, methods and evaluation, Egypt. Inform. J., vol. 16, no. 3, pp. 261-273, 2015.
[5]
F. Ortega, J. L. Sánchez, J. Bobadilla, and A. Gutiérrez, Improving collaborative filtering-based recommender systems results using Pareto dominance, Inform. Sci., vol. 239, pp. 50-61, 2013.
[6]
C. C. Aggarwal, Recommender Systems. Cham, Switzerland: Springer, 2016.
DOI
[7]
P. B. Thorat, R. M. Goudar, and S. Barve, Survey on collaborative filtering, content-based filtering and hybrid recommendation system, Int. J. Comput. Appl., vol. 110, no. 4, pp. 31-36, 2015.
[8]
X. Y. Su and T. M. Khoshgoftaar, A survey of collaborative filtering techniques, Adv. Artif. Intell., vol. 2009, pp. 421-425, 2009.
[9]
S. Vucetic and Z. Obradovic, A regression-based approach for scaling-up personalized recommender systems in E-commerce, https://wenku.baidu.com/view/f033010103d8ce2f006623b8.html, 2009.
[10]
T. George and S. Merugu, A scalable collaborative filtering framework based on co-clustering, in Proc. 5th IEEE Int. Conf. on Data Mining, Houston, TX, USA, 2005, p. 4.
[11]
P. H. Aditya, I. Budi, and Q. Munajat, A comparative analysis of memory-based and model-based collaborative filtering on the implementation of recommender system for E-commerce in Indonesia: A case study PT X, in Proc. 2016 Int. Conf. Advanced Computer Science and Information Systems, Malang, Indonesia, 2017, pp. 303-308.
[12]
G. Adomavicius and J. J. Zhang, Impact of data characteristics on recommender systems performance, ACM Trans. Manag. Inf. Syst., vol. 3, no. 1, p. 3, 2012.
[13]
F. Cacheda, V. Carneiro, D. Fernández, and V. Formoso, Comparison of collaborative filtering algorithms: Limitations of current techniques and proposals for scalable, high-performance recommender systems, ACM Trans. Web, vol. 5, no. 1, p. 2, 2011.
[14]
Z. Huang, D. Zeng, and H. Chen, A comparison of collaborative-filtering recommendation algorithms for E-commerce, IEEE Intell. Syst., vol. 22, no. 5, pp. 68-78, 2007.
[15]
U. Kuelewska, Effect of dataset size on efficiency of collaborative filtering recommender systems with multi-clustering as a neighbourhood identification strategy, in Proc. 20th Int. Conf. on Computational Science, Amsterdam, The Netherlands, 2020, pp. 342-354.
[16]
J. Lee, M. X. Sun, and G. Lebanon, A comparative study of collaborative filtering algorithms, arXiv preprint arXiv: 1205.3193, 2012.
[17]
Z. Yang, B. Wu, K. Zheng, X. B. Wang, and L. Lei, A survey of collaborative filtering-based recommender systems for mobile internet applications, IEEE Access, vol. 4, pp. 3273-3287, 2016.
[18]
H. Taherdoost, Sampling methods in research methodology; How to choose a sampling technique for research, Int. J. Acad. Res. Manag., vol. 5, no. 2, pp. 18-27, 2016.
[19]
J. S. Breese, D. Heckerman, and C. Kadie, Empirical analysis of predictive algorithms for collaborative filtering, in Proc. of the Fourteenth Conf. on Uncertainty in Artificial Intelligence, Madison, WI, USA, 1998, pp. 43-52.
[20]
Y. Deldjoo, T. Di Noia, E. Di Sciascio, and F. A. Merra, How dataset characteristics affect the robustness of collaborative recommendation models, in Proc. of the 43rd Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, New York, NY, USA, 2020, pp. 951-960.
[21]
A. J. B. Chaney, B. M. Stewart, and B. E. Engelhardt, How algorithmic confounding in recommendation systems increases homogeneity and decreases utility, in Proc. of the 12th ACM Conf. on Recommender Systems, Vancouver, Canada, 2017, pp. 224-232.
[22]
G. H. Golub and C. Reinsch, Singular value decomposition and least squares solutions, in Handbook for Automatic Computation, F. L. Bauer, A. S. Householder, F. W. J. Olver, H. Rutishauser, K. Samelson, and E. Stiefel, eds. Berlin, Germany: Springer, 1971, pp. 134-151.
DOI
[23]
N. Hug, Surprise: A Python library for recommender systems, J. Open Source Softw., vol. 5, no. 52, p. 2174, 2020.
[24]
X. Luo, M. C. Zhou, Y. N. Xia, and Q. S. Zhu, An efficient non-negative matrix-factorization-based approach to collaborative filtering for recommender systems, IEEE Trans. Ind. Inform., vol. 10, no. 2, pp. 1273-1284, 2014.
[25]
D. Lemire and A. Maclachlan, SlopeOne predictors for online rating-based collaborative filtering, in Proc. of the 2005 SIAM Int. Conf. on Data Mining, Newport Beach, CA, USA: SDM, 2005, p. 5.
[26]
Y. Koren, R. Bell, and C. Volinsky, Matrix factorization techniques for recommender systems, Computer, vol. 42, no. 8, pp. 30-37, 2009.
[27]
G. Schröder, M. Thiele, and W. Lehner, Setting goals and choosing metrics for recommender system evaluations, CEUR Workshop Proc., vol. 811, pp. 78-85, 2011.
[28]
T. Chai and R. R. Draxler, Root mean square error (RMSE) or mean absolute error (MAE)?—Arguments against avoiding RMSE in the literature, Geosci. Model Dev., vol. 7, no. 3, pp. 1247-1250, 2014.
[29]
GroupLens, MovieLens 1M Dataset, https://grouplens.org/datasets/movielens/1m/, 2003.
[30]
F. M. Harper and J. A. Konstan, The movielens datasets: History and context, ACM Trans. Interact. Intell. Syst., vol. 5, no. 4, p. 19, 2015.
[31]
GroupLens, MovieLens 25M Dataset, https://grouplens.org/datasets/movielens/25m/, 2019.
[32]
Y. M. Data-set, Webscope-Yahoo Labs, https://webscope.sandbox.yahoo.com/catalog.php?datatype=r&did=1, 2021.
[33]
J. M. Ni, J. C. Li, and J. McAuley, Justifying recommendations using distantly-labeled reviews and fine-grained aspects, in Proc. of the 2019 Conf. on Empirical Methods in Natural Language Proc. and the 9th Int. Joint Conf. on Natural Language Proc., Hong Kong, China, 2019, pp. 188-197.
[34]
P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, et al., SciPy 1.0: Fundamental algorithms for scientific computing in Python, Nat. Methods, vol. 17, no. 3, pp. 261-272, 2020.
[35]
K. Kumari and S. Yadav, Linear regression analysis study, J. Pract. Cardiovasc. Sci., vol. 4, no. 1, pp. 33-36, 2018.
[36]
M. Lacey, ANOVA for Regression, http://www.stat.yale.edu/Courses/1997-98/101/anovareg.htm, 2021.
[37]
Fisher’s, Fisher’s method, https://en.wikipedia.org/wiki/Fisher%27smethod, 2021.
[38]
M. Sun, How does the variance of product ratings matter? Manag. Sci., vol. 58, no. 4, pp. 696-707, 2011.
[39]
S. Chong and A. Abeliuk, Quantifying the effects of recommendation systems, in Proc. of the 2019 IEEE Int. Conf. on Big Data, Los Angeles, CA, USA, 2019, pp. 3008-3015.
Publication history
Copyright
Rights and permissions

Publication history

Received: 27 July 2021
Revised: 07 December 2021
Accepted: 31 December 2021
Published: 09 June 2022
Issue date: September 2022

Copyright

© The author(s) 2022.

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return