Journal Home > Volume 3 , Issue 2

Based on the fourth-wave Beijing College Students Panel Survey (BCSPS), this study aims to provide accurate estimation of the percentage of the potential sexual minorities among the Beijing college students by using machine learning methods. Specifically, we employ random forest (RF), an ensemble learning approach for classification and regression, to predict the sexual orientation of those who were not willing to disclose his/her inherent sexual identity. To overcome the imbalance problem arising from far different numerical proportion of sexual minority and majority members, we adopt the repeated random sub-sampling for training set by partitioning those who expressed heterosexual orientation into different number of splits and further combining each split with those who expressed sexual minority orientation. The prediction from 24-split random forest suggests that youths in Beijing with sexual minority orientation amount to 5.71%, almost two times that of the original estimation 3.03%. The results are robust to alternative learning methods and covariate sets. Besides, it is also suggested that random forest outperforms other learning algorithms, including AdaBoost, Naïve Bayes, support vector machine (SVM), and logistic regression, in dealing with missing data, by showing higher accuracy, F1 score, and area under curve (AUC) value.


menu
Abstract
Full text
Outline
About this article

The Hidden Sexual Minorities: Machine Learning Approaches to Estimate the Sexual Minority Orientation Among Beijing College Students

Show Author's information Yunsong Chen1( )Guangye He1Guodong Ju2( )
Department of Sociology, Nanjing University, Nanjing 210023, China
Department of Social Policy, London School of Economics and Political Science, London, WC2A 2AE, UK

Abstract

Based on the fourth-wave Beijing College Students Panel Survey (BCSPS), this study aims to provide accurate estimation of the percentage of the potential sexual minorities among the Beijing college students by using machine learning methods. Specifically, we employ random forest (RF), an ensemble learning approach for classification and regression, to predict the sexual orientation of those who were not willing to disclose his/her inherent sexual identity. To overcome the imbalance problem arising from far different numerical proportion of sexual minority and majority members, we adopt the repeated random sub-sampling for training set by partitioning those who expressed heterosexual orientation into different number of splits and further combining each split with those who expressed sexual minority orientation. The prediction from 24-split random forest suggests that youths in Beijing with sexual minority orientation amount to 5.71%, almost two times that of the original estimation 3.03%. The results are robust to alternative learning methods and covariate sets. Besides, it is also suggested that random forest outperforms other learning algorithms, including AdaBoost, Naïve Bayes, support vector machine (SVM), and logistic regression, in dealing with missing data, by showing higher accuracy, F1 score, and area under curve (AUC) value.

Keywords: machine learning, random forest, sexual minority orientation, imbalanced missing data

References(25)

1

A. Ghaziani, V. Taylor, and A. Stone, Cycles of sameness and difference in LGBT social movements, Annual Review of Sociology, vol. 42, no. 1, pp. 165–183, 2016.

2
L. P. Gross, Up from Invisibility: Lesbians, Gay Men, and the Media in America. New York, NY, USA: Columbia University Press, 2012.
3
USAID, Being LGBT in Asia: China country report, https://www.undp.org/sites/g/files/zskgke326/files/publications/Being%20LGBT%20in%20Asia%20-%20China%20Country%20Report%20.pdf, 2014.
4

Y. Y. Wang, Z. S. Hu, K. Peng, Y. Xin, Y. Yang, J. Drescher, and R. S. Chen, Discrimination against LGBT populations in China, Lancet Public Health, vol. 4, no. 9, pp. E440–E441, 2019.

5

J. H. Lee, K. E. Gamarel, K. J. Bryant, N. D. Zaller, and D. Operario, Discrimination, mental health, and substance use disorders among sexual minority populations, Lgbt Health, vol. 3, no. 4, pp. 258–265, 2016.

6
UNDP, Being LGBTI in China: a national survey on social attitudes towards sexual orientation, gender identity and gender expression, https://www.asia-pacific.undp.org/content/rbap/en/home/library/democratic_governance/hiv_aids/being-lgbti-in-china-a-national-survey-on-social-attitudes-towa.html, 2018.
7

W. O’Donohue and C. E. Caselles, Homophobia: Conceptual, definitional, and value issues, Journal of Psychopathology and Behavioral Assessment, vol. 15, no. 3, pp. 177–195, 1993.

8
G. J. Gates, How may people are lesbian, gay, bisexual and transgender? https://williamsinstitute.law.ucla.edu/publications/how-many-people-lgbt/, 2011.
9
P. N. P. Institute, LGBTQ students in higher education, https://pnpi.org/wp-content/uploads/2021/05/LGBTQStudentsinHigherEducation_PNPI_May2021.pdf, 2018.
10
W. J. Xu, L. J. Zheng, Y. Xu, and Y. Zheng, Internalized homophobia, mental health, sexual behaviors, and outness of gay/bisexual men from Southwest China, International Journal for Equity in Health, doi: https://doi.org/10.1186/s12939-017-0530-1.
DOI
11

Y. Hu, Sex ideologies in China: Examining interprovince differences, The Journal of Sex Research, vol. 53, no. 9, pp. 1118–1130, 2016.

12
M. Khalilia, S. Chakraborty, and M. Popescu, Predicting disease risks from highly imbalanced data using random forest, BMC Medical Informatics and Decision Making, doi: 10.1186/1472-6947-11-51.https://doi.org/10.1186/1472-6947-11-51
DOI
13
B. Lantz, Machine learning with R. Birmingham, UK: Packt Publishinig, 2015.
14

R. Tibshirani, Regression shrinkage and selection via the Lasso: A retrospective, Journal of the Royal Statistical Society. Series B:Methodological, vol. 73, no. 3, pp. 273–282, 2011.

15
F. Provost, Machine learning from imbalanced data sets 101, presented at the AAAI’2000 Workshop on Learning from Imbalanced Data Sets, Austin, TX, USA, 2000.
16

N. Japkowicz and S. Stephen, The class imbalance problem: A systematic study, Intelligent Data Analysis, vol. 6, no. 5, pp. 429–449, 2002.

17

D. S. Palmer, N. M. O'Boyle, R. C. Glen, and J. B. O. Mitchell, Random forest models to predict aqueous solubility, Journal of Chemical Information and Modeling, vol. 47, no. 1, pp. 150–158, 2007.

18

P. R. Sterzing, W. F. Auslander, and J. T. Goldbach, An exploratory study of bullying involvement for sexual minority youth: Bully-only, victim-only, and bully-victim roles, Society for Social Work and Research, vol. 5, no. 3, pp. 321–337, 2014.

19

L. Zeeman, N. Sherriff, K. Browne, N. McGlynn, M. Mirandola, L. Gios, R. Davis, J. Sanchez-Lambert, S. Aujean, N. Pinto, et al., A review of lesbian, gay, bisexual, trans and intersex (LGBTI) health and healthcare inequalities, European Journal of Public Health, vol. 29, no. 5, pp. 974–980, 2019.

20

P. Probst, B. Bischl, and A. L. Boulesteix, Tunability: Importance of hyperparameters of machine learning algorithms, Journal of Machine Learning Research, vol. 20, no. 53, pp. 1–32, 2019.

21

J. E. Lane, A new cultural cleavage in post-modern society, Brazilian Journal of Political Economy, vol. 27, no. 3, pp. 375–393, 2007.

22
D. Wong, Sexual minorities in China, in International Encyclopedia of the Social & Behavioral Sciences (Second Edition), J. D. Wright, ed. Amsterdam, the Netherlands: Elsevier, 2015, pp. 734–739.https://doi.org/10.1016/B978-0-08-097086-8.10247-8
DOI
23

Y. T. Suen and R. C. H. Chan, A nationwide cross-sectional study of 15,611 lesbian, gay and bisexual people in China: Disclosure of sexual orientation and experiences of negative treatment in health care, International Journal for Equity in Health, vol. 19, p. 46, 2020.

24

L. Breiman, Random forests, Machine Learning, vol. 45, pp. 5–32, 2001.

25

F. Tang and H. Ishwaran, Random forest missing data algorithms, Statistical Analysis and Data Mining, vol. 10, no. 6, pp. 363–377, 2017.

Publication history
Copyright
Rights and permissions

Publication history

Received: 14 January 2021
Revised: 24 November 2021
Accepted: 25 November 2021
Published: 01 June 2022
Issue date: June 2022

Copyright

© The author(s) 2022

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return