Optimal Dependence of Performance and Efficiency of Collaborative Filtering on Random Stratified Subsampling

Samin Poudel; Marwan Bikdash

doi:10.26599/BDMA.2021.9020032

| Sign up

PDF (730.4 KB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Show Outline

Figures (8)

Fig. 1

Fig. 2

Fig. 3

Fig. 4

Fig. 5

Fig. 6

Fig. 7

Fig. 8

Tables (6)

Table 1

Table 2

Table 3

Table 4

Table 5

Open Access

Optimal Dependence of Performance and Efficiency of Collaborative Filtering on Random Stratified Subsampling

Samin Poudel(), Marwan Bikdash

Department of Computational Data Science and Engineering, North Carolina A&T State University, Greensboro, NC 27401, USA

Show Author Information

Abstract

Dropping fractions of users or items judiciously can reduce the computational cost of Collaborative Filtering (CF) algorithms. The effect of this subsampling on the computing time and accuracy of CF is not fully understood, and clear guidelines for selecting optimal or even appropriate subsampling levels are not available. In this paper, we present a Density-based Random Stratified Subsampling using Clustering (DRSC) algorithm in which the desired Fraction of Users Dropped (FUD) and Fraction of Items Dropped (FID) are specified, and the overall density during subsampling is maintained. Subsequently, we develop simple models of the Training Time Improvement (TTI) and the Accuracy Loss (AL) as functions of FUD and FID, based on extensive simulations of seven standard CF algorithms as applied to various primary matrices from MovieLens, Yahoo Music Rating, and Amazon Automotive data. Simulations show that both TTI and a scaled AL are bi-linear in FID and FUD for all seven methods. The TTI linear regression of a CF method appears to be same for all datasets. Extensive simulations illustrate that TTI can be estimated reliably with FUD and FID only, but AL requires considering additional dataset characteristics. The derived models are then used to optimize the levels of subsampling addressing the tradeoff between TTI and AL. A simple sub-optimal approximation was found, in which the optimal AL is proportional to the optimal Training Time Reduction Factor (TTRF) for higher values of TTRF, and the optimal subsampling levels, like optimal FID/(1-FID), are proportional to the square root of TTRF.

Keywords

Collaborative Filtering (CF)subsampling Training Time Improvement (TTI)performance loss Recommendation System (RS)collaborative filtering optimal solutions rating matrix

References

[1]

Sharma

and A.

Gera

, A survey of recommendation system: Research challenges, Int. J. Eng. Trends Technol., vol. 4, no. 5, pp. 1989-1992, 2013.

Google Scholar

[2]

Pereira

and S.

Varma

, Survey on content based recommendation system, Int. J. Comput. Sci. Inf. Technol, vol. 7, no. 1, pp. 281-284, 2016.

Google Scholar

[3]

Chen

, Q. Y.

Hua

, Y. S.

Chang

, B.

Wang

, L.

Zhang

, and X. J.

Kong

, A survey of collaborative filtering-based recommender systems: From traditional methods to hybrid methods based on social networks, IEEE Access, vol. 6, pp. 64301-64320, 2018.

FID	FUD	ST (s)	TT (s)	ST/TT (%)	ST/ $Δ$ TT (%)
0	0	0	1542.20	0	-
0.3	0.3	4.45	595.10	0.74	0.46
0.5	0.5	3.51	214.90	1.60	0.26
0.7	0.7	2.63	63.20	4.20	0.17
0.9	0.9	1.90	2.55	74.00	0.12

Primary dataset	Number of users (m)	Number of items ( $n$ )	Density ( $δ$ )	Rating scale of data	Source
$P_{1}$	6040	3706	0.045	1-5	1M MovieLens
$P_{2}$	4607	2080	0.095	1-5	1M MovieLens
$P_{3}$	8000	4004	0.101	0.5-5	25M MovieLens
$P_{4}$	4009	8017	0.151	0.5-5	25M MovieLens
$P_{5}$	3500	6000	0.08	1-100	Yahoo! Music
$P_{6}$	5006	5011	0.12	1-100	Yahoo! Music
$P_{7}$	3000	1301	0.004	1-5	Amazon Automotive

CF method	$P_{1}$			$P_{2}$			$P_{3}$			$P_{4}$
CF method	$β_{0}$	$β_{1}$	$β_{2}$	$β_{0}$	$β_{1}$	$β_{2}$	$β_{0}$	$β_{1}$	$β_{2}$	$β_{0}$	$β_{1}$	$β_{2}$
SVD	0.94 $^{′′}$	1.02 $^{′′}$	0.96	1.01 $^{′′}$	1.01 $^{′′}$	1.02 $^{′′}$	1.03	1.12 $^{′′}$	1.15 $^{'}$	1.0 $^{′′}$	1.06 $^{′′}$	1.05 $^{′′}$
SVD_b	0.94 $^{′′}$	1.03 $^{′′}$	0.96 $^{′′}$	1.01 $^{′′}$	1.01 $^{′′}$	1.02 $^{′′}$	1.05 $^{′′}$	1.06 $^{′′}$	1.13 $^{′′}$	1.03 $^{′′}$	1.06 $^{′′}$	1.09 $^{′′}$
NMF	0.98 $^{′′}$	1.024 $^{′′}$	0.99 $^{′′}$	1.03 $^{′′}$	1.03 $^{′′}$	1.06 $^{′′}$	1.02 $^{′′}$	0.98 $^{′′}$	0.98 $^{′′}$	1.03 $^{′′}$	1.05 $^{′′}$	1.08 $^{′′}$
CoClustering	1.06 $^{′′}$	1.07 $^{′′}$	1.165 $^{′′}$	1.12 $^{′′}$	1.10 $^{′′}$	1.15 $^{′′}$	1.09 $^{′′}$	1.07 $^{′′}$	1.16 $^{′′}$	1.06 $^{′′}$	1.02 $^{′′}$	1.09 $^{′′}$
SlopeOne	1.04 $^{′′}$	1.35 $^{′′}$	1.44 $^{′′}$	1.12 $^{′′}$	1.37 $^{′′}$	1.52 $^{'}$	1.11 $^{′′}$	1.33 $^{′′}$	1.50 $^{′′}$	1.14 $^{′′}$	1.37 $^{′′}$	1.54 $^{'}$
UNN	1.32 $^{′′}$	0.95 $^{′′}$	1.28 $^{′′}$	1.36 $^{′′}$	1.01 $^{′′}$	1.41 $^{'}$	1.34 $^{′′}$	0.96 $^{′′}$	1.32 $^{'}$	1.37 $^{′′}$	1.06 $^{′′}$	1.42 $^{'}$
INN	0.9 $^{′′}$	1.35 $^{′′}$	1.26 $^{'}$	0.93 $^{′′}$	1.33 $^{′′}$	1.32 $^{′′}$	0.99 $^{′′}$	1.33 $^{′′}$	1.32 $^{'}$	0.99 $^{′′}$	1.31 $^{′′}$	1.32 $^{′′}$
CF method	$P_{5}$			$P_{6}$			$P_{7}$
CF method	$β_{0}$	$β_{1}$	$β_{2}$	$β_{0}$	$β_{1}$	$β_{2}$	$β_{0}$	$β_{1}$	$β_{2}$
SVD	1.00 $^{′′}$	1.01 $^{′′}$	1.00 $^{′′}$	1.01 $^{′′}$	1.02 $^{′′}$	1.04 $^{′′}$	0.99 $^{′′}$	1.02 $^{′′}$	1.02 $^{′′}$
SVD_b	0.99 $^{′′}$	0.99 $^{′′}$	0.97 $^{′′}$	1.00 $^{′′}$	1.01 $^{′′}$	1.01 $^{′′}$	0.98 $^{′′}$	1.00 $^{′′}$	0.99 $^{′′}$
NMF	0.98 $^{′′}$	0.99 $^{′′}$	0.97 $^{′′}$	1.01 $^{′′}$	1.01 $^{′′}$	1.01 $^{′′}$	0.97 $^{′′}$	0.93 $^{′′}$	0.98 $^{′′}$
CoClustering	1.03 $^{′′}$	1.02 $^{′′}$	1.06 $^{′′}$	1.02 $^{′′}$	1.00 $^{′′}$	1.02 $^{′′}$	0.95 $^{′′}$	0.92 $^{′′}$	0.97 $^{′′}$
SlopeOne	1.06 $^{′′}$	1.31 $^{′′}$	1.45 $^{′′}$	1.11 $^{′′}$	1.34 $^{′′}$	1.52 $^{′′}$	0.82 $^{′′}$	1.18 $^{′′}$	0.92’
UNN	1.34 $^{′′}$	0.93 $^{′′}$	1.31 $^{'}$	1.36 $^{′′}$	1.00 $^{′′}$	1.39 $^{'}$	1.26 $^{′′}$	0.91 $^{′′}$	1.14 $^{′′}$
INN	0.86 $^{′′}$	1.29 $^{′′}$	1.21 $^{′′}$	0.96 $^{′′}$	1.31 $^{′′}$	1.28 $^{′′}$	0.85 $^{′′}$	1.25 $^{′′}$	0.95 $^{'}$

CF method	MAE
CF method	Using $P_{1}$	Using $P_{2}$	Using $P_{3}$	Using $P_{4}$	Using $P_{5}$	Using $P_{6}$	Using $P_{7}$
SVD	0.024	0.009	0.031	0.025	0.005	0.005	0.015
SVD_b	0.024	0.009	0.032	0.018	0.007	0.005	0.015
NMF	0.012	0.009	0.025	0.012	0.007	0.005	0.014
CoClustering	0.014	0.015	0.023	0.018	0.007	0.007	0.012
SlopeOne	0.041	0.042	0.048	0.046	0.037	0.042	0.041
UNN	0.044	0.047	0.043	0.048	0.052	0.049	0.036
INN	0.048	0.042	0.042	0.037	0.035	0.038	0.045

CF method	$P_{1}$			$P_{2}$			$P_{3}$			$P_{4}$
CF method	$η_{0}^{*}$	$η_{1}^{*}$	$η_{2}^{*}$	$η_{0}^{*}$	$η_{1}^{*}$	$η_{2}^{*}$	$η_{0}^{*}$	$η_{1}^{*}$	$η_{2}^{*}$	$η_{0}^{*}$	$η_{1}^{*}$	$η_{2}^{*}$
SVD	0.91 $^{′′′}$	2.1 $^{′′′}$	2.71 $^{′′′}$	0.8 $^{′′′}$	1.42 $^{′′′}$	2.1 $^{′′′}$	1.53 $^{′′′}$	1.67 $^{′′′}$	3.36 $^{′′′}$	1.49 $^{′′′}$	1.7 $^{′′′}$	3.43 $^{′′′}$
SVD_b	0.88 $^{′′′}$	1.2 $^{′′′}$	2.22 $^{′′′}$	1.12 $^{′′′}$	1.22 $^{′′′}$	2.5 $^{′′′}$	1.66 $^{′′′}$	1.82 $^{′′′}$	3.7 $^{′′′}$	1.65 $^{′′′}$	1.91 $^{′′′}$	3.84 $^{′′}$
NMF	0.29 $^{′′′}$	1.12 $^{′′′}$	1.33 $^{′′′}$	0.47 $^{′′′}$	1.0 $^{′′′}$	1.38 $^{′′′}$	0.23 $^{′′′}$	0.11 $^{′′′}$	0.24 $^{′′′}$	0.03 $^{′′′}$	0.04 $^{′′′}$	0.0 $^{′′′}$
CoClustering	0.52 $^{′′′}$	1.14 $^{′′′}$	1.68 $^{′′′}$	0.28 $^{′′′}$	0.76 $^{′′′}$	0.98 $^{′′′}$	0.37 $^{′′′}$	0.41 $^{′′′}$	0.72 $^{′′′}$	0.25 $^{′′′}$	0.28 $^{′′′}$	0.56 $^{′′′}$
SlopeOne	0.07 $^{′′′}$	0.57 $^{′′′}$	0.57 $^{′′′}$	0.06 $^{′′′}$	0.3 $^{′′′}$	0.26 $^{′′′}$	0.34 $^{′′′}$	0.07 $^{′′′}$	0.39 $^{′′′}$	0.14 $^{′′′}$	0.25 $^{′′′}$	0.45 $^{′′′}$
UNN	0.01 $^{′′′}$	0.88 $^{′′′}$	0.83 $^{′′′}$	0.13 $^{′′′}$	0.75 $^{′′′}$	0.84 $^{′′′}$	0.43 $^{′′′}$	0.9 $^{′′′}$	1.38 $^{′′′}$	0.29 $^{′′′}$	0.57 $^{′′′}$	0.94 $^{′′′}$
INN	0.38 $^{′′′}$	0.66 $^{′′′}$	0.97 $^{′′′}$	0.73 $^{′′′}$	0.76 $^{′′′}$	1.48 $^{′′′}$	1.18 $^{′′′}$	0.49 $^{′′′}$	1.76 $^{′′′}$	1.32 $^{′′′}$	0.42 $^{′′′}$	1.95 $^{′′′}$
CF method	$P_{5}$			$P_{6}$			$P_{7}$
CF method	$β_{0}$	$β_{1}$	$β_{2}$	$β_{0}$	$β_{1}$	$β_{2}$	$β_{0}$	$β_{1}$	$β_{2}$
SVD	0.37 $^{′′}$	0.36 $^{′′}$	1.0 $^{'}$	0.08 $^{′′′}$	0.08 $^{′′′}$	0.05 $^{′′′}$	$-$ 1.03 $^{′′′}$	$- {2.2}^{′′}$	3.10 $^{'}$
SVD_b	2.40 $^{'}$	1.73 $^{'}$	3.72 $^{'}$	0.71 $^{′′}$	0.61 $^{′′}$	2.44 $^{'}$	1.40 $^{′′}$	1.68 $^{′′}$	3.37 $^{'}$
NMF	0.12 $^{′′′}$	0.07 $^{′′′}$	0.19 $^{′′′}$	0.05 $^{′′′}$	0.10 $^{′′′}$	0.17 $^{′′′}$	2.43 $^{′′}$	1.76 $^{′′}$	5.20 $^{'}$
CoClustering	0.76 $^{′′′}$	0.38 $^{′′′}$	1.11 $^{′′′}$	0.10 $^{′′′}$	0.19 $^{′′′}$	0.26 $^{′′′}$	2.61 $^{′′}$	2.67 $^{′′}$	6.26 $^{'}$
SlopeOne	0.41 $^{′′′}$	0.16 $^{′′′}$	0.56 $^{′′′}$	0.09 $^{′′′}$	0.13 $^{′′′}$	0.23 $^{′′′}$	0.18 $^{′′}$	0.80 $^{′′}$	1.39 $^{'}$
UNN	1.24 $^{′′′}$	1.01 $^{′′′}$	2.34 $^{′′′}$	0.77 $^{′′′}$	1.25 $^{′′′}$	2.14 $^{′′′}$	1.42 $^{′′}$	3.12 $^{′′}$	5.30 $^{'}$
INN	2.04 $^{′′′}$	0.98 $^{′′′}$	3.22 $^{′′′}$	2.59 $^{′′′}$	1.82 $^{′′′}$	4.79 $^{′′′}$	2.83 $^{′′}$	0.63 $^{′′}$	4.08 $^{'}$

CF method	MAE
CF method	Using $P_{1}$	Using $P_{2}$	Using $P_{3}$	Using $P_{4}$	Using $P_{5}$	Using $P_{6}$	Using $P_{7}$
SVD	0.0084	0.0067	0.009	0.012	0.0084	0.0038	0.024
SVD_b	0.0082	0.0086	0.012	0.0141	0.0759	0.0132	0.045
NMF	0.0077	0.0069	0.0056	0.0042	0.0016	0.0015	0.021
CoClustering	0.0088	0.0092	0.0053	0.0047	0.0049	0.0034	0.017
SlopeOne	0.0067	0.0073	0.0044	0.0051	0.0035	0.0025	0.032
UNN	0.0071	0.0075	0.0052	0.0063	0.0065	0.0048	0.041
INN	0.0073	0.0065	0.0059	0.0097	0.011	0.0121	0.035