HALLUCINATIONS OF THE GENERALIST, BRITTLENESS OF THE EXPERT: A STUDY ON FINE-TUNED OPEN SOURCE LLM FOR SOC ANALYSIS

Abhinav Sharma; Jawahar Thakur; T.P. Sharma

doi:10.53555/1vhsgg43

Authors

Abhinav Sharma Department of Computer Science, Himachal Pradesh University
Jawahar Thakur Department of Computer Science and Engineering, National Institute of Technology, Hamirpur
T.P. Sharma

DOI:

https://doi.org/10.53555/1vhsgg43

Keywords:

Large Language Models, Cyber Threat Intelligence, MITRE ATT&CK, Fine-Tuning, DeepSeek, Security Operations Center

Abstract

In the evolving landscape of cyber defense, where knowledge is the decisive edge, the practical deployment of Large Language Models (LLMs) is fraught with challenges. General-purpose models from large providers present significant cost and data privacy barriers for most Security Operations Centers (SOCs), while open-source models often hallucinate in structured domains like MITRE ATT&CK. This study investigates these trade-offs by fine-tuning the DeepSeek R1 Distilled LLaMA 8B model on a curated corpus of MITRE ATT&CK semantics, leveraging the Unsloth framework to demonstrate a reproducible pipeline on consumer-grade GPUs. Notably, the entire fine-tuning process was conducted using commercial Kaggle infrastructure, showing that effective LLM specialization can be achieved without access to high-end hardware. Our evaluation contrasts this fine-tuned "Focused Specialist" with the untuned "Creative Generalist" base model. The results reveal a critical trade-off: the base model, while a superior contextual reasoner, suffers from catastrophic factual hallucinations, making it dangerously unreliable. In contrast, our fine-tuned model achieves 84% exact-match accuracy and proves its trustworthiness, but this specialization introduces a "brittleness" where its knowledge is static and less creatively applied. Our LLM-as-a-Judge evaluation confirmed the fine-tuned model was the superior co-pilot in 73% of test cases, achieving a 25% higher average quality score.

References

[1] Ashish, V., 2017. Attention is all you need. Advances in Neural Information Processing Systems, 30, p.I.

[2] Singh, B., 2024. Introduction to Large Language Models. In Building Applications with Large Language Models: Techniques, Implementation, and Applications (pp. 1‑25). Berkeley, CA: Apress.

[3] Yenduri, G., Srivastava, G., Maddikunta, P.K.R., Jhaveri, R.H., Wang, W., Vasilakos, A.V. and Gadekallu, T.R., 2023. Generative pre‑trained transformer: A comprehensive review on enabling technologies, potential applications, emerging challenges, and future directions. arXiv preprint arXiv:2305.10435.

[4] Yao, Y., Duan, J., Xu, K., Cai, Y., Sun, Z. and Zhang, Y., 2024. A survey on large language model (LLM) security and privacy: The good, the bad, and the ugly. High‑Confidence Computing, p.100211.

[5] Yigit, Y., Buchanan, W.J., Tehrani, M.G. and Maglaras, L., 2024. Review of generative AI methods in cybersecurity. arXiv preprint arXiv:2403.08701.

[6] Pan, J., Liang, W.S. and Yidi, Y., 2024, May. Raglog: Log anomaly detection using retrieval augmented generation. In 2024 IEEE World Forum on Public Safety Technology (WFPST) (pp. 169‑174). IEEE.

[7] Alahmadi, B.A., Axon, L. and Martinovic, I., 2022, August. 99% False Positives: A Qualitative Study of SOC Analysts’ Perspectives on Security Alarms. In Proceedings of the 31st USENIX Security Symposium (USENIX Security), Boston, MA, USA (pp. 10‑12).

[8] Yang, G., Tang, C. and Liu, X., 2022. DualAC2NN: Revisiting and Alleviating Alert Fatigue from the Detection Perspective. Symmetry, 14(10), p.2138.

[9] Hassan, W.U., Guo, S., Li, D., Chen, Z., Jee, K., Li, Z. and Bates, A., 2019, February. Nodoze: Combatting threat alert fatigue with automated provenance triage. In Network and Distributed Systems Security Symposium.

[10] Bryant, B.D. and Saiedian, H., 2020. Improving SIEM alert metadata aggregation with a novel kill‑chain based classification model. Computers & Security, 94, p.101817.

[11] Zhong, C., Yen, J., Liu, P. and Erbacher, R.F., 2019. Learning from experts’ experience: Toward automated cyber security data triage. IEEE Systems Journal, 13(1), 603‑614. doi:10.1109/JSYST.2018.2828832.

[12] González‑Granadillo, G., González‑Zarzosa, S. and Diaz, R., 2021. Security information and event management (SIEM): Analysis, trends, and usage in critical infrastructures. Sensors, 21(14), p.4759.

[13] Cinque, M., Della Corte, R. and Pecchia, A., 2020. Contextual filtering and prioritization of computer application logs for security situational awareness. Future Generation Computer Systems, 111, 668‑680.

[14] Bartos, V., Zadnik, M., Habib, S.M. and Vasilomanolakis, E., 2019. Network entity characterization and attack prediction. Future Generation Computer Systems, 97, 674‑686.

[15] Mudgal, P. and Wouhaybi, R., 2023, August. An assessment of ChatGPT on log data. In International Conference on AI‑Generated Content (pp. 148‑169). Singapore: Springer Nature Singapore.

[16] Deka, P., Rajapaksha, S., Rani, R., Almutairi, A. and Karafili, E., 2024, November. Attacker: Towards enhancing cyber‑attack attribution with a named entity recognition dataset. In International Conference on Web Information Systems Engineering (pp. 255‑270). Singapore: Springer Nature Singapore.

[17] Cheng, Y., Bajaber, O., Tsegai, S.A., Song, D. and Gao, P., 2024. CTINEXUS: Leveraging optimized LLM in‑context learning for constructing cybersecurity knowledge graphs under data scarcity. arXiv preprint arXiv:2410.21060.

[18] Wahréus, J., Hussain, A.M. and Papadimitratos, P., 2025. CySecBench: Generative AI‑based cybersecurity‑focused prompt dataset for benchmarking large language models. arXiv preprint arXiv:2501.01335.

[19] Ghimire, A., Ghajari, G., Gurung, K., Sah, L.K. and Amsaad, F., 2025. Enhancing cybersecurity in critical infrastructure with LLM‑assisted explainable IoT systems. arXiv preprint arXiv:2503.03180.

[20] Si, S., Wu, Y., Tang, L., Zhang, Y., Wosik, J. and Su, Q., 2024. Evaluating the performance of ChatGPT for spam email detection. arXiv preprint arXiv:2402.15537.

[21] Shafee, S., Bessani, A. and Ferreira, P.M., 2024. Evaluation of LLM chatbots for OSINT‑based cyber threat awareness. arXiv preprint arXiv:2401.15127.

[22] Zuo, F., Rhee, J. and Choe, Y.R., 2025. Knowledge transfer from LLMs to provenance analysis: A semantic‑augmented method for APT detection. arXiv preprint arXiv:2503.18316.

[23] Deng, J., Li, X., Chen, Y., Bai, Y., Weng, H., Liu, Y., Wei, T. and Xu, W., 2024. RACONTEUR: A knowledgeable, insightful, and portable LLM‑powered shell command explainer. arXiv preprint arXiv:2409.02074.

[24] Gandhi, P.A., Wudali, P.N., Amaru, Y., Elovici, Y. and Shabtai, A., 2025. SHIELD: APT detection and intelligent explanation using LLM. arXiv preprint arXiv:2502.02342.

[25] Ahmed, S., Rahman, A.M., Alam, M.M. and Sajid, M.S.I., 2025, January. SPADE: Enhancing adaptive cyber deception strategies with generative AI and structured prompt engineering. In 2025 IEEE 15th Annual Computing and Communication Workshop and Conference (CCWC) (pp. 01007‑01013). IEEE.

[26] Hays, S. and White, J., 2024. Using LLMs for tabletop exercises within the security domain. arXiv preprint arXiv:2403.01626.

[27] Prapty, R.T., Kundu, A. and Iyengar, A., 2024. Using retriever‑augmented large language models for attack graph generation. arXiv preprint arXiv:2408.05855.

[28] GitHub link: https://github.com/deepseek‑ai/DeepSeek‑R1/blob/main/DeepSeek_R1.pdf

[29] Hugging Face library link: https://huggingface.co/deepseek‑ai/DeepSeek‑R1#usage‑recommendations

[30] Han, Z., Gao, C., Liu, J., Zhang, J. and Zhang, S.Q., 2024. Parameter‑efficient fine‑tuning for large models: A comprehensive survey. arXiv preprint arXiv:2403.14608.

[31] Wang, L., Chen, S., Jiang, L., Pan, S., Cai, R., Yang, S. and Yang, F., 2024. Parameter‑efficient fine‑tuning in large models: A survey of methodologies. arXiv preprint arXiv:2410.19878.

[32] Xu, L., Xie, H., Qin, S.Z.J., Tao, X. and Wang, F.L., 2023. Parameter‑efficient fine‑tuning methods for pretrained language models: A critical review and assessment. arXiv preprint arXiv:2312.12148.

[33] Hugging Face link: https://huggingface.co/docs/transformers/main_classes/quantization

[34] Hugging Face link: https://huggingface.co/docs/hub/en/gguf

[35] Hugging Face link: https://huggingface.co/docs/transformers/main/en/quantization/gptq

[36] GitHub link: https://github.com/bitsandbytes‑foundation/bitsandbytes

[37] GitHub link: https://github.com/unslothai/unsloth

[38] Available online: https://attack.mitre.org/matrices/enterprise/

[39] GitHub link: https://github.com/CodeByHarri/MITRE‑ATT_CK‑Playbooks

[40] GitHub link: https://github.com/wandb/wandb

[41] Available online – API Pricing: https://openai.com/api/pricing/