Predictive Modelling of the Gender Pay Gap using Machine Learning

Authors

  • Siew Chin Loh Department of Information System Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, Malaysia
  • Maizatul Akmar Ismail Department of Information System Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, Malaysia
  • Jamallah Mohammed Zawia Department of Information System Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, Malaysia

DOI:

https://doi.org/10.37934/arca.39.1.101110

Keywords:

Gender pay gap, wage inequality, predictive modeling, occupational earnings, earnings ratio, employment data, labor market analysis

Abstract

The gender inequity in pay remains a major issue to the world’s economy and society. The complex dynamic non-linearity that drives wage dispersion is not well represented in traditional econometric methods. This paper accounts for this issue by employing state-of-the-art powerful machine learning methods to construct a predictive model of the GWE. We compare five prediction models: Linear Regression, Random Forest, XGBoost, Support Vector Regression (SVR), and Neural Network using data from the latest U.S. Census Bureau’s American Community Survey for a span of 2017 to 2023. The goal was accurate evaluation of the Ratio, which represents the value of women median wages divided by men median wages. The performance of the model and its interpretability were strictly evaluated via 10-fold cross-validation and SHapley Additive exPlanations (SHAP), respectively. The results revealed that tree-based ensemble models considerably outperformed other methods, with Random Forest having the best performance (R = 0.994, MAPE = 0.443%), followed by XGBoost (R = 0.993, MAPE = 0.450%). Occupation was the most important source of the gender gap; this was found using both approaches. On the other hand, Linear Regression was relatively successful (R = 0.655) and Support Vector Regression failed (R = -0.303). These findings are a reaffirmation of the ability of machine learning, mainly Random Forest and XGBoost, as powerful and interpretable predictive tools. Such strategies provide policymakers with useful data for developing policies by occupation as well as targeted actions to minimize the gender pay gap and promote equity in the workplace.

Author Biographies

Siew Chin Loh, Department of Information System Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, Malaysia

23054980@siswa.um.edu.my

Maizatul Akmar Ismail, Department of Information System Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, Malaysia

maizatul@um.edu.my

Jamallah Mohammed Zawia, Department of Information System Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, Malaysia

jamala.zawia@um.edu.my

Downloads

Published

2025-09-04

Issue

Section

Articles