559
Views
116
Downloads
0
Crossref
N/A
WoS
0
Scopus
N/A
CSCD
As various software bots are widely used in open source software repositories, some drawbacks are coming to light, such as giving newcomers non-positive feedback and misleading empirical studies of software engineering researchers. Several techniques have been proposed by researchers to perform bot detection, but most of them are limited to identifying bots performing specific activities, let alone distinguishing between GitHub App and OAuth App. In this paper, we propose a bot detection technique for OAuth App, named BDGOA. 24 features are used in BDGOA, which can be divided into three dimensions: account information, account activity, and text similarity. To better explore the behavioral features, we define a fine-grained classification of behavioral events and introduce self-similarity to quantify the repeatability of behavioral sequence. We leverage five machine learning classifiers on the benchmark dataset to conduct bot detection, and finally choose random forest as the classifier, which achieves the highest F1-score of 95.83%. The experimental results comparing with the state-of-the-art approaches also demonstrate the superiority of BDGOA.
As various software bots are widely used in open source software repositories, some drawbacks are coming to light, such as giving newcomers non-positive feedback and misleading empirical studies of software engineering researchers. Several techniques have been proposed by researchers to perform bot detection, but most of them are limited to identifying bots performing specific activities, let alone distinguishing between GitHub App and OAuth App. In this paper, we propose a bot detection technique for OAuth App, named BDGOA. 24 features are used in BDGOA, which can be divided into three dimensions: account information, account activity, and text similarity. To better explore the behavioral features, we define a fine-grained classification of behavioral events and introduce self-similarity to quantify the repeatability of behavioral sequence. We leverage five machine learning classifiers on the benchmark dataset to conduct bot detection, and finally choose random forest as the classifier, which achieves the highest F1-score of 95.83%. The experimental results comparing with the state-of-the-art approaches also demonstrate the superiority of BDGOA.
T. Dey and A. Mockus, Deriving a usage-independent software quality metric, Empir. Software Eng., vol. 25, no. 2, pp. 1596–1641, 2020.
T. Bhowmik, N. Niu, W. Wang, J. R. C. Cheng, L. Li, and X. Cao, Optimal group size for software change tasks: A social information foraging perspective, IEEE Trans. Cybern., vol. 46, no. 8, pp. 1784–1795, 2016.
M. Golzadeh, A. Decan, D. Legay, and T. Mens, A ground-truth dataset and classification model for detecting bots in GitHub issue and PR comments, J. Syst. Software, vol. 175, p. 110911, 2021.
M. Wessel, B. M. De Souza, I. Steinmacher, I. S. Wiese, I. Polato, A. P. Chaves, and M. A. Gerosa, The power of bots: Characterizing and understanding bots in OSS projects, Proc. ACM Human-Comput. Interact., vol. 2, no. CSCW, p. 182, 2018.
P. Jaccard, The distribution of the flora in the alpine zone, New Phytol., vol. 11, no. 2, pp. 37–50, 1912.
V. I. Levenshtein, Binary codes capable of correcting deletions, insertions and reversals, Sov. Phys. Dokl., vol. 10, pp. 707–710, 1966.
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al., Scikit-learn: Machine learning in Python, J. Mach. Learn. Res., vol. 12, pp. 2825–2830, 2011.
A. Altmann, L. Toloşi, O. Sander, and T. Lengauer, Permutation importance: A corrected feature importance measure, Bioinformatics, vol. 26, no. 10, pp. 1340–1347, 2010.
This work is available under the CC BY-NC-ND 3.0 IGO license:https://creativecommons.org/licenses/by-nc-nd/3.0/igo/