CyberSecQwen-4B
Overview
CyberSecQwen-4B
A Qwen3-4B base fine-tuned for cyber threat intelligence — CWE classification, CVE-to-CWE mapping, and code-pattern reasoning. Trained on AMD MI300X with a recipe designed to preserve instruction-tuned behavior while shifting the model's distribution onto the CTI domain.
Beating an 8B Cisco specialist at half the size.
Walkthrough
Research, in 5 minutes
A walkthrough of the training methodology, AMD MI300X workflow, and the benchmark results — context for the numbers below.
Performance
CTI-Bench, n=5, temp 0.3
Evaluated on the public CTI-Bench multi-trial protocol. Scores below are mean ± 1σ over five runs.
| Model | Params | CTI-RCM | CTI-MCQ |
|---|---|---|---|
| Foundation-Sec-8B (base, 5-shot) | 8B | 0.7450 | 0.6552 |
| Cisco-Foundation-Sec-Instruct-8B | 8B | 0.6850 | 0.4996 |
| CyberPal-2.0-20B | 20B | 0.7280 | 0.7384 |
| CyberSecQwen-4B | 4B | 0.6664 | 0.5868 |
| Gemma4Defense-2B (companion) | 2B | 0.6754 | 0.6042 |
| Qwen3-4B-Instruct-2507 (raw) | 4B | 0.5190 | 0.4732 |
| Qwen3-4B-Base (5-shot) | 4B | 0.5170 | 0.6672 |
| Gemma-4-E2B-it (raw) | 2B | 0.5800 | 0.5780 |
Recipe
What changed, in three ideas.
Strict eval-set scrub
Training corpus is filtered against CTI-Bench prompts and near-duplicates before any update sees a gradient. No leakage, no shortcutting.
Keep the instruction-tuned voice
Loss is balanced so the model adopts CTI domain knowledge without losing Qwen3's general instruction-following. Tone, formatting, and refusal behavior remain intact.
Same recipe, different base
The same procedure applied to Gemma-4-E2B-it produces Gemma4Defense-2B with comparable lift — evidence the gains come from the recipe, not the base.
Corpus · ~14,776 supervised records
- rcm-2021 (decontaminated) — ~6,776 CVE → CWE classification examples from MITRE/NVD 2021 cohort, with all CTI-Bench overlap items removed.
- cve_cti synthetic Q&A — ~8,000 defensive-analyst-style Q&A pairs grounded in CVE descriptions.
An earlier internal CPT corpus had 72.3% test-set overlap with CTI-Bench. The released model trains exclusively on the 2021 cohort with overlap removed — released numbers are post-fix.
Hyperparameters
- Base
- Qwen3-4B-Instruct-2507
- Adapter
- LoRA r=64, α=64, dropout=0.05
- Targets
- q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
- Optimizer
- AdamW · cosine schedule · warmup_ratio=0.05 · weight_decay=0.01
- LR · peak
- 5e-5
- Batch · effective
- 2 per device · grad_accum 8 · effective batch 16
- Schedule
- 10 epochs · max_seq_len 4096
- Precision
- bfloat16 throughout
- Attention
- flash_attention_2 (FA2)
- Wall · MI300X
- 173 min · 1,290 steps · 7.85 s/step
- Eval protocol
- Foundation-Sec arXiv:2504.21039 §B.3-B.4 · n=5 · temp 0.3
Hardware
Trained on AMD MI300X.
The recipe was developed on a single-node MI300X stack. Optimizations are aimed at ROCm; the recipe ports cleanly to other datacenter-class GPUs (40 GB+ VRAM) with the FA2 caveat noted below.
Stack
- Accelerator
- AMD Instinct MI300X · 192 GB HBM3 (gfx942)
- Runtime
- ROCm 7.0
- Docker image
- vllm/vllm-openai-rocm:latest
- PyTorch
- 2.6.0 (ROCm)
- flash-attn
- 2.8.3 (preinstalled in vLLM ROCm image)
- vLLM
- 0.10.1
FA2 viability per model family
FA2 via the ROCm/flash-attention Composable-Kernels backend is supported on MI300X but bounded at head_dim ≤ 256 by the LDS shared-memory budget.
| Family | head_dim | FA2 | Note |
|---|---|---|---|
| Qwen3 (this work) | 128 | ✓ enabled | fits LDS budget — ~1.6× faster than sdpa |
| Llama-3 / Mistral | 128 | ✓ works | same head_dim class |
| Gemma-2 | 256 | ✓ boundary | at LDS limit, viable |
| Gemma-4 (global layers) | 512 | ✗ disabled | exceeds LDS — fallback to sdpa |
Optimizations
- FA2 enabled in training — Qwen3-4B head_dim=128 fits the gfx942 LDS budget; ~7.85 s/step at LoRA r=64 / max_seq_len=4096 — ~1.6× faster than the same recipe on Gemma-4 (sdpa fallback).
- TRITON_ATTN backend for vLLM inference — recommended on MI300X for Qwen3-class models.
- bf16 throughout — native MI300X precision; no mixed-precision dance.
- AITER kernels for matmul —
VLLM_ROCM_USE_AITER=1+TORCH_BLAS_PREFER_HIPBLASLT=1. Note: this works for Qwen3 dense — does NOT work for gpt-oss MoE (AITER=0 required there). - Prefix caching — vLLM
--enable-prefix-cachingfor shared system-prompt batches. - HF Transfer for push/pull —
HF_HUB_ENABLE_HF_TRANSFER=1saturates ~240 MB/s; 8 GB merged model uploads in ~36 s.
ROCm environment (verified-working)
# env exported inside the vLLM ROCm Docker container export VLLM_ROCM_USE_AITER=1 export TORCH_BLAS_PREFER_HIPBLASLT=1 export HF_HUB_DISABLE_XET=1 export PYTORCH_ROCM_ARCH='gfx90a;gfx942;gfx950' export AITER_ROCM_ARCH='gfx942;gfx950' export HIP_FORCE_DEV_KERNARG=1
Portability: the recipe runs on other 40 GB+ datacenter GPUs with the AMD-specific env vars dropped (no-ops elsewhere). FA2 via pip install flash-attn --no-build-isolation. Same VRAM minimums: 24 GB+ for training, 12 GB+ for inference.
Limitations
What it's good at, and where it isn't.
In-distribution · verified strong
CWE classification
Mapping a described weakness or code pattern to a CWE id with a short rationale. Calibrated on CTI-Bench RCM.
CVE-to-CWE on trained CVEs
Returns the underlying CWE for CVEs whose descriptions are in-distribution. Confidence drops for CVEs disclosed after the corpus cutoff.
Code-pattern reasoning
Identifying the weakness class behind a small snippet (string-format SQL, unchecked path joins, inline HTML rendering, etc.).
Out-of-distribution · known failure modes
MITRE ATT&CK technique IDs
Wrong T-numbers (returned T1543/003 — Windows Service — for scheduled-task persistence, vs the correct T1053.005). Wrong technique names (called LSASS dumping "Extract Web Credentials"; correct is T1003.001 OS Credential Dumping: LSASS Memory).
CVE implementation specifics
Fabricates implementation details. Cited a non-existent pgp binary path when asked about CVE-2024-3400 (PAN-OS GlobalProtect — actual root cause is session-ID handling). Top-level CWE call usually correct; deep mechanics often invented.
Tool categorization
Misclassifies offensive tooling. Listed Mimikatz as a "ransomware loader" — Mimikatz is a credential-dumping utility, not a ransomware loader.
Validated empirically · we tested 14, kept 7
We ran 14 candidate demo prompts through the deployed model and graded each output for factual accuracy. 7 passed, 7 were cut: Log4Shell vs Spring4Shell exploit-primitive comparison; CVE-2024-3400 explanation; LSASS ATT&CK mapping; schtasks ATT&CK mapping; Dockerfile curl|bash review; ransomware detection signals; Python dynamic-code-execution CWE. Most cuts were OOD hallucinations: invented technique numbers, fabricated CVE specifics, mislabeled mitigations.
Out-of-scope use
- Generating exploit code, weaponized PoC, or attacker tradecraft.
- Critical security decisions without qualified human review.
- Legal, medical, or other regulated-advice contexts.
- Tasks outside cybersecurity (general chat, code generation, summarization).
- Violation of laws (CFAA, GDPR, etc.).
Recommendations
- Pair with retrieval. Ground CVE / advisory queries against an authoritative source before quoting specifics.
- Sample at low temperature. CWE selection benefits from determinism (temp 0.2–0.3).
- Treat output as a triage hint, not a verdict — keep an analyst in the loop.
Companion
Gemma4Defense-2B
The same recipe applied to Gemma-4-E2B-it produces a 2B companion that scores RCM 0.6754 ± 0.0035 and MCQ 0.6042 ± 0.0090 — slightly higher MCQ than CyberSecQwen-4B at half the parameters again. Use the 2B when memory is tight; use the 4B when you want stronger general instruction-following and longer-form rationales.
Citation
Cite & license.
% bibtex
@misc{cybersecqwen2026,
title = {CyberSecQwen-4B: A Compact CTI Specialist Fine-Tuned from
Qwen3-4B-Instruct-2507 on AMD MI300X},
author = {Mulia, Samuel},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/athena129/CyberSecQwen-4B}
}
License: Apache 2.0, end-to-end — weights, training code, and the synthetic CVE/CTI Q&A corpus. The decontaminated 2021 CVE→CWE mappings derive from public MITRE/NVD records. Evaluation protocol: Foundation-Sec-8B (arXiv:2504.21039). Benchmark: CTI-Bench.
Ready to try it?
Test CyberSecQwen-4B on your own prompts below. ZeroGPU; cold start ~10–20 s, then warm.
Try the chatMap a vulnerability
to a CWE.
Paste a snippet, log line, or CVE. The model returns a Common Weakness Enumeration with a one-paragraph rationale.