Optimizing Malware Detection and Prevention on Proxy Servers Through Random Forest and Lexical Feature Analysis

Authors

  • Meitro Hartanto Andalas Saputra Faculty of Information Technology, Universitas Budi Luhur, Indonesia
  • Dwi Pebrianti Department of Mechanical & Aerospace Engineering, Faculty of Engineering, International Islamic University Malaysia, Malaysia
  • Luhur Bayuaji Faculty of Data Science & Information Technology, INTI International University, Malaysia
  • Rusdah Faculty of Information Technology, Universitas Budi Luhur, Indonesia

DOI:

https://doi.org/10.35806/ijoced.v7i1.485

Keywords:

Lexical Features , Malware Detection, Proxy Server Logs , Random Forest , URL Classification

Abstract

Malware has become a significant concern due to the increase in malicious websites hosting spam, phishing, malware, and other threats. This research aims to predict malware URLs using lexical features for feature extraction and random forest for classification. The dataset, sourced from kaggle.com, includes benign, phishing, spam, malware, and defacement URLs. To address data imbalance, random oversampling was applied for balanced training. Recursive feature elimination was used to optimize lexical features, testing various sets of features (10, 15, 19, 23, 29, 35) for classification accuracy, achieving 98% accuracy using 23 features. Validation tests with actual university network data confirmed this model’s effectiveness, classifying malicious URLs in 9 minutes using 11,566 samples. URL filtering involved log analyzer tools capturing internet traffic during working hours over one month. Results suggest that this approach can efficiently classify malicious URLs and could be implemented for real-time detection in proxy server logs, aiding IT departments in preventing malware spread via web traffic.

References

Abbas, S. H., Naser, W. A. K., & Kadhim, A. A. (2023). Subject review: Intrusion Detection System (IDS) and Intrusion Prevention System (IPS). Global Journal of Engineering and Technology Advances, 14(2), 155–158. https://doi.org/10.30574/gjeta.2023.14.2.0031

Ahammad, S. K. H., et al. (2022). Phishing URL detection using machine learning methods. Advances in Engineering Software, 173, 103288. https://doi.org/10.1016/j.advengsoft.2022.103288

Al-Janabi, M., & Altamimi, A. M. (2020). A comparative analysis of machine learning techniques for classification and detection of malware. Proceedings of the 2020 21st International Arab Conference on Information Technology (ACIT 2020). Institute of Electrical and Electronics Engineers Inc. https://doi.org/10.1109/ACIT50332.2020.9300081

Alsaedi, M., Ghaleb, F. A., Saeed, F., Ahmad, J., & Alasli, M. (2024). Multi-modal features representation-based convolutional neural network model for malicious website detection. IEEE Access, 12, 7271–7284. https://doi.org/10.1109/ACCESS.2023.3348071

Borra, S. R., Gayathri, B., Rekha, B., Akshitha, B., & Hafeeza, B. (2023). K-nearest neighbor classifier for URL-based phishing detection mechanism.

Calderon, P., Hasegawa, H., Yamaguchi, Y., & Shimada, H. (2018). Malware detection based on HTTPS characteristics via machine learning. In ICISSP 2018 - Proceedings of the 4th International Conference on Information Systems Security and Privacy (pp. 410–417). SciTePress. https://doi.org/10.5220/0006654604100417

Dhingra, V., & Singh, K. P. (2023). Detecting and analyzing malware using machine learning classifiers. In H. O. Bansal, R. C. Ajmera, & S. C. Bansal (Eds.), Next Generation Systems and Networks (pp. 197–207). Springer Nature Singapore.

Fawagreh, K., Gaber, M. M., & Elyan, E. (2014). Random forests: From early developments to recent advancements. Systems Science and Control Engineering, 2(1), 602–609. https://doi.org/10.1080/21642583.2014.956265

Hemalatha, J., Roseline, S. A., Geetha, S., Kadry, S., & Damaševičius, R. (2021). An efficient DenseNet‐based deep learning model for malware detection. Entropy, 23(3). https://doi.org/10.3390/e23030344

Huang, Y., et al. (2023). Graph neural networks and cross-protocol analysis for detecting malicious IP addresses. Complex and Intelligent Systems, 9(4), 3857–3869. https://doi.org/10.1007/s40747-022-00838-y

Jasim, A. D., & Farhan, R. I. (2023). Intelligent malware classification based on network traffic and data augmentation techniques. Indonesian Journal of Electrical Engineering and Computer Science, 30(2), 903–908. https://doi.org/10.11591/ijeecs.v30.i2.pp903-908

Joshi, A., Lloyd, L., & Westin, P. (n.d.). Using lexical features for malicious URL detection—A machine learning approach.

Karajgar, M. D., et al. (2024). Comparison of machine learning models for identifying malicious URLs. In 2024 IEEE International Conference on Information Technology, Electronics and Intelligent Communication Systems (ICITEICS) (pp. 1–5). IEEE. https://doi.org/10.1109/ICITEICS61368.2024.10625423

Khammas, B. M. (2020). Ransomware detection using random forest technique. ICT Express, 6(4), 325–331. https://doi.org/10.1016/j.icte.2020.11.001

Khramtsova, E., Hammerschmidt, C., Lagraa, S., & State, R. (2020). Federated learning for cyber security: SOC collaboration for malicious URL detection. In Proceedings - International Conference on Distributed Computing Systems (pp. 1316–1321). IEEE. https://doi.org/10.1109/ICDCS47774.2020.00171

Kizza, J. M., Texts in computer science. Retrieved from http://www.springer.com/series/3191

Kokila, M., & Reddy K, S. (2025). Authentication, access control and scalability models in Internet of Things Security–A review. KeAi Communications Co. https://doi.org/10.1016/j.csa.2024.100057

Molinari, S., & Packer, J. (2023). Malware science: A comprehensive guide to detection, analysis, and compliance. Packt Publishing Ltd.

Pakhare, P. S., & Krishnan, C. N. N. (2021). Malicious URL detection using machine learning and ensemble modeling. In A. Pasumpon, S. M.

Pandian, & I. Fernando (Eds.), Computer Networks, Big Data and IoT (pp. 839–850). Springer Singapore.

Pang, Y., Chen, Z., Peng, L., Ma, K., Zhao, C., & Ji, K. (2019). A signature-based assistant random oversampling method for malware detection. In 2019 18th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/13th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE) (pp. 256–263). IEEE. https://doi.org/10.1109/TrustCom/BigDataSE.2019.00042

Pannu, M., Gill, B., Bird, R., Yang, K., & Farrel, B. (2016). Exploring proxy detection methodology. In 2016 IEEE International Conference on Cybercrime and Computer Forensic (ICCCF 2016). IEEE. https://doi.org/10.1109/ICCCF.2016.7740438

Pushpalatha, M., & Vijaya, A. (2023). Malicious URL website detection using selective hyper feature link stability based on soft-max deep featured convolution neural network. International Journal on Recent and Innovation Trends in Computing and Communication, 11(6 S), 490–498. https://doi.org/10.17762/ijritcc.v11i6s.6957

Sharma, N. V., & Yadav, N. S. (2021). An optimal intrusion detection system using recursive feature elimination and ensemble of classifiers. Microprocessors and Microsystems, 85, 104293. https://doi.org/10.1016/j.micpro.2021.104293

Ujah-Ogbuagu, B. C., Akande, O. N., & Ogbuju, E. (2024). A hybrid deep learning technique for spoofing website URL detection in real-time applications. Journal of Electrical Systems and Information Technology, 11(1). https://doi.org/10.1186/s43067-023-00128-8

Downloads

Published

2025-04-14

Issue

Section

Articles

How to Cite

Optimizing Malware Detection and Prevention on Proxy Servers Through Random Forest and Lexical Feature Analysis (M. H. Andalas Saputra, D. Pebrianti, L. Bayuaji, & Rusdah , Trans.). (2025). Indonesian Journal of Computing, Engineering, and Design (IJoCED), 7(1), 1-15. https://doi.org/10.35806/ijoced.v7i1.485