Optimal Dependence of Performance and Efficiency of Collaborative Filtering on Random Stratified Subsampling

Samin Poudel; Marwan Bikdash

doi:10.26599/BDMA.2021.9020032

AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Journals A - Z

About Us

Publish with Us

Support

PDF (730.4 KB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

AI Chat Paper

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Open Access

Optimal Dependence of Performance and Efficiency of Collaborative Filtering on Random Stratified Subsampling

Samin Poudel(

), Marwan Bikdash

Department of Computational Data Science and Engineering, North Carolina A&T State University, Greensboro, NC 27401, USA

Show Author Information

Abstract

Dropping fractions of users or items judiciously can reduce the computational cost of Collaborative Filtering (CF) algorithms. The effect of this subsampling on the computing time and accuracy of CF is not fully understood, and clear guidelines for selecting optimal or even appropriate subsampling levels are not available. In this paper, we present a Density-based Random Stratified Subsampling using Clustering (DRSC) algorithm in which the desired Fraction of Users Dropped (FUD) and Fraction of Items Dropped (FID) are specified, and the overall density during subsampling is maintained. Subsequently, we develop simple models of the Training Time Improvement (TTI) and the Accuracy Loss (AL) as functions of FUD and FID, based on extensive simulations of seven standard CF algorithms as applied to various primary matrices from MovieLens, Yahoo Music Rating, and Amazon Automotive data. Simulations show that both TTI and a scaled AL are bi-linear in FID and FUD for all seven methods. The TTI linear regression of a CF method appears to be same for all datasets. Extensive simulations illustrate that TTI can be estimated reliably with FUD and FID only, but AL requires considering additional dataset characteristics. The derived models are then used to optimize the levels of subsampling addressing the tradeoff between TTI and AL. A simple sub-optimal approximation was found, in which the optimal AL is proportional to the optimal Training Time Reduction Factor (TTRF) for higher values of TTRF, and the optimal subsampling levels, like optimal FID/(1-FID), are proportional to the square root of TTRF.

Keywords

Collaborative Filtering (CF)subsampling Training Time Improvement (TTI)performance loss Recommendation System (RS)collaborative filtering optimal solutions rating matrix

References

[1]

L. Sharma and A. Gera, A survey of recommendation system: Research challenges, Int. J. Eng. Trends Technol., vol. 4, no. 5, pp. 1989-1992, 2013.

Google Scholar

[2]

N. Pereira and S. Varma, Survey on content based recommendation system, Int. J. Comput. Sci. Inf. Technol, vol. 7, no. 1, pp. 281-284, 2016.

Google Scholar

[3]

R. Chen, Q. Y. Hua, Y. S. Chang, B. Wang, L. Zhang, and X. J. Kong, A survey of collaborative filtering-based recommender systems: From traditional methods to hybrid methods based on social networks, IEEE Access, vol. 6, pp. 64301-64320, 2018.

Crossref Google Scholar

[4]

F. O. Isinkaye, Y. O. Folajimi, and B. A. Ojokoh, Recommendation systems: Principles, methods and evaluation, Egypt. Inform. J., vol. 16, no. 3, pp. 261-273, 2015.

Crossref Google Scholar

[5]

F. Ortega, J. L. Sánchez, J. Bobadilla, and A. Gutiérrez, Improving collaborative filtering-based recommender systems results using Pareto dominance, Inform. Sci., vol. 239, pp. 50-61, 2013.

Crossref Google Scholar

[6]

C. C. Aggarwal, Recommender Systems. Cham, Switzerland: Springer, 2016.

Crossref

[7]

P. B. Thorat, R. M. Goudar, and S. Barve, Survey on collaborative filtering, content-based filtering and hybrid recommendation system, Int. J. Comput. Appl., vol. 110, no. 4, pp. 31-36, 2015.

Crossref Google Scholar

[8]

X. Y. Su and T. M. Khoshgoftaar, A survey of collaborative filtering techniques, Adv. Artif. Intell., vol. 2009, pp. 421-425, 2009.

Crossref Google Scholar

[9]

S. Vucetic and Z. Obradovic, A regression-based approach for scaling-up personalized recommender systems in E-commerce, https://wenku.baidu.com/view/f033010103d8ce2f006623b8.html, 2009.

[10]

T. George and S. Merugu, A scalable collaborative filtering framework based on co-clustering, in Proc. 5t⁢h IEEE Int. Conf. on Data Mining, Houston, TX, USA, 2005, p. 4.

Google Scholar

[11]

P. H. Aditya, I. Budi, and Q. Munajat, A comparative analysis of memory-based and model-based collaborative filtering on the implementation of recommender system for E-commerce in Indonesia: A case study PT X, in Proc. 2016 Int. Conf. Advanced Computer Science and Information Systems, Malang, Indonesia, 2017, pp. 303-308.

Crossref Google Scholar

[12]

G. Adomavicius and J. J. Zhang, Impact of data characteristics on recommender systems performance, ACM Trans. Manag. Inf. Syst., vol. 3, no. 1, p. 3, 2012.

Crossref Google Scholar

[13]

F. Cacheda, V. Carneiro, D. Fernández, and V. Formoso, Comparison of collaborative filtering algorithms: Limitations of current techniques and proposals for scalable, high-performance recommender systems, ACM Trans. Web, vol. 5, no. 1, p. 2, 2011.

Crossref Google Scholar

[14]

Z. Huang, D. Zeng, and H. Chen, A comparison of collaborative-filtering recommendation algorithms for E-commerce, IEEE Intell. Syst., vol. 22, no. 5, pp. 68-78, 2007.

Crossref Google Scholar

[15]

U. Kuelewska, Effect of dataset size on efficiency of collaborative filtering recommender systems with multi-clustering as a neighbourhood identification strategy, in Proc. 20t⁢h Int. Conf. on Computational Science, Amsterdam, The Netherlands, 2020, pp. 342-354.

Crossref Google Scholar

[16]

J. Lee, M. X. Sun, and G. Lebanon, A comparative study of collaborative filtering algorithms, arXiv preprint arXiv: 1205.3193, 2012.

Google Scholar

[17]

Z. Yang, B. Wu, K. Zheng, X. B. Wang, and L. Lei, A survey of collaborative filtering-based recommender systems for mobile internet applications, IEEE Access, vol. 4, pp. 3273-3287, 2016.

Crossref Google Scholar

[18]

H. Taherdoost, Sampling methods in research methodology; How to choose a sampling technique for research, Int. J. Acad. Res. Manag., vol. 5, no. 2, pp. 18-27, 2016.

Crossref Google Scholar

[19]

J. S. Breese, D. Heckerman, and C. Kadie, Empirical analysis of predictive algorithms for collaborative filtering, in Proc. of the Fourteenth Conf. on Uncertainty in Artificial Intelligence, Madison, WI, USA, 1998, pp. 43-52.

Google Scholar

[20]

Y. Deldjoo, T. Di Noia, E. Di Sciascio, and F. A. Merra, How dataset characteristics affect the robustness of collaborative recommendation models, in Proc. of the 43r⁢d Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, New York, NY, USA, 2020, pp. 951-960.

Crossref Google Scholar

[21]

A. J. B. Chaney, B. M. Stewart, and B. E. Engelhardt, How algorithmic confounding in recommendation systems increases homogeneity and decreases utility, in Proc. of the 12t⁢h ACM Conf. on Recommender Systems, Vancouver, Canada, 2017, pp. 224-232.

Crossref Google Scholar

[22]

G. H. Golub and C. Reinsch, Singular value decomposition and least squares solutions, in Handbook for Automatic Computation, F. L. Bauer, A. S. Householder, F. W. J. Olver, H. Rutishauser, K. Samelson, and E. Stiefel, eds. Berlin, Germany: Springer, 1971, pp. 134-151.

Crossref

[23]

N. Hug, Surprise: A Python library for recommender systems, J. Open Source Softw., vol. 5, no. 52, p. 2174, 2020.

Crossref Google Scholar

[24]

X. Luo, M. C. Zhou, Y. N. Xia, and Q. S. Zhu, An efficient non-negative matrix-factorization-based approach to collaborative filtering for recommender systems, IEEE Trans. Ind. Inform., vol. 10, no. 2, pp. 1273-1284, 2014.

Crossref Google Scholar

[25]

D. Lemire and A. Maclachlan, SlopeOne predictors for online rating-based collaborative filtering, in Proc. of the 2005 SIAM Int. Conf. on Data Mining, Newport Beach, CA, USA: SDM, 2005, p. 5.

Crossref Google Scholar

[26]

Y. Koren, R. Bell, and C. Volinsky, Matrix factorization techniques for recommender systems, Computer, vol. 42, no. 8, pp. 30-37, 2009.

Crossref Google Scholar

[27]

G. Schröder, M. Thiele, and W. Lehner, Setting goals and choosing metrics for recommender system evaluations, CEUR Workshop Proc., vol. 811, pp. 78-85, 2011.

Google Scholar

[28]

T. Chai and R. R. Draxler, Root mean square error (RMSE) or mean absolute error (MAE)?—Arguments against avoiding RMSE in the literature, Geosci. Model Dev., vol. 7, no. 3, pp. 1247-1250, 2014.

Crossref Google Scholar

[29]

GroupLens, MovieLens 1M Dataset, https://grouplens.org/datasets/movielens/1m/, 2003.

[30]

F. M. Harper and J. A. Konstan, The movielens datasets: History and context, ACM Trans. Interact. Intell. Syst., vol. 5, no. 4, p. 19, 2015.

Crossref Google Scholar

[31]

GroupLens, MovieLens 25M Dataset, https://grouplens.org/datasets/movielens/25m/, 2019.

[32]

Y. M. Data-set, Webscope-Yahoo Labs, https://webscope.sandbox.yahoo.com/catalog.php?datatype=r&did=1, 2021.

[33]

J. M. Ni, J. C. Li, and J. McAuley, Justifying recommendations using distantly-labeled reviews and fine-grained aspects, in Proc. of the 2019 Conf. on Empirical Methods in Natural Language Proc. and the 9t⁢h Int. Joint Conf. on Natural Language Proc., Hong Kong, China, 2019, pp. 188-197.

Crossref Google Scholar

[34]

P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, et al., SciPy 1.0: Fundamental algorithms for scientific computing in Python, Nat. Methods, vol. 17, no. 3, pp. 261-272, 2020.

Crossref Google Scholar

[35]

K. Kumari and S. Yadav, Linear regression analysis study, J. Pract. Cardiovasc. Sci., vol. 4, no. 1, pp. 33-36, 2018.

Crossref Google Scholar

[36]

M. Lacey, ANOVA for Regression, http://www.stat.yale.edu/Courses/1997-98/101/anovareg.htm, 2021.

[37]

Fisher’s, Fisher’s method, https://en.wikipedia.org/wiki/Fisher%27smethod, 2021.

[38]

M. Sun, How does the variance of product ratings matter? Manag. Sci., vol. 58, no. 4, pp. 696-707, 2011.

Crossref Google Scholar

[39]

S. Chong and A. Abeliuk, Quantifying the effects of recommendation systems, in Proc. of the 2019 IEEE Int. Conf. on Big Data, Los Angeles, CA, USA, 2019, pp. 3008-3015.

Crossref Google Scholar

Big Data Mining and Analytics

Volume 5 Issue 3,
September 2022

Pages 192-205

DOI: 10.26599/BDMA.2021.9020032

Cite this article:

Poudel S, Bikdash M. Optimal Dependence of Performance and Efficiency of Collaborative Filtering on Random Stratified Subsampling. Big Data Mining and Analytics, 2022, 5(3): 192-205. https://doi.org/10.26599/BDMA.2021.9020032

696

Views

Downloads

Crossref

Web of Science

Scopus

CSCD

Google Scholar
Citation

Altmetrics

Received: 27 July 2021

Revised: 07 December 2021

Accepted: 31 December 2021

Published: 09 June 2022

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).