A Lightweight and Interpretable Machine Learning Pipeline for Phishing Website Detection Under Feature-Budget Constraints

Authors

  • Hongzhi Lu School of Industrial Technology, Universiti Sains Malaysia, Gelugor 11800, Malaysia

Keywords:

phishing website detection; network security; feature selection; machine learning; model interpretability

Abstract

Phishing websites remain a practical threat to Web users and online services, yet many machine-learning studies report headline accuracy without examining whether high performance is retained under smaller feature budgets, calibrated probabilities and lightweight inference constraints. This study develops a reproducible, offline machine-learning pipeline for phishing website detection using the public UCI Phishing Websites data set. The pipeline evaluates logistic regression, calibrated linear support vector machine, decision tree, random forest and histogram gradient boosting models over 5-, 10-, 15- and 30-feature budgets selected by mutual information. A stratified 70/30 hold-out split and five-fold stratified cross-validation on the training partition were used to report accuracy, precision, recall, F1, ROC-AUC, average precision, Brier score, model size, latency and permutation feature importance. The best model was histogram gradient boosting with 30 features, which achieved F1 = 0.9675, recall = 0.9626 and ROC-AUC = 0.9960 on the hold-out set while requiring 6.362 ms per 1000 samples on the preparation machine. The most influential features were URL_of_Anchor and SSLfinal_State, followed by web_traffic and Prefix_Suffix. Results show that tree-based ensemble models provide strong discrimination on this feature-encoded data set and that a 15-feature budget preserves much of the full-feature performance. The contribution is a reproducible benchmark and feature-budget analysis for lightweight phishing screening; deployment on live traffic requires further temporal, adversarial and browser-integration validation.

Downloads

Published

2026-06-21

Issue

Section

Articles