CyberSecQwen-4B

Walkthrough Performance Recipe Hardware Limitations Companion Citation

Overview

CyberSecQwen-4B

A Qwen3-4B base fine-tuned for cyber threat intelligence — CWE classification, CVE-to-CWE mapping, and code-pattern reasoning. Trained on AMD MI300X with a recipe designed to preserve instruction-tuned behavior while shifting the model's distribution onto the CTI domain.

Beating an 8B Cisco specialist at half the size.

Walkthrough

Research, in 5 minutes

A walkthrough of the training methodology, AMD MI300X workflow, and the benchmark results — context for the numbers below.

Video AMD MI300X · Qwen3-4B fine-tune 5:33 · 720p

Performance

CTI-Bench, n=5, temp 0.3

Evaluated on the public CTI-Bench multi-trial protocol. Scores below are mean ± 1σ over five runs.

CTI-RCM 0.6664 ± 0.0023 · CWE root-cause mapping

CTI-MCQ 0.5868 ± 0.0029 · +8.7 pp vs Cisco-Instruct-8B

Param ratio 2× smaller than Cisco-Foundation-Sec-8B

Model	Params	CTI-RCM	CTI-MCQ
Foundation-Sec-8B (base, 5-shot)	8B	0.7450	0.6552
Cisco-Foundation-Sec-Instruct-8B	8B	0.6850	0.4996
CyberPal-2.0-20B	20B	0.7280	0.7384
CyberSecQwen-4B	4B	0.6664	0.5868
Gemma4Defense-2B (companion)	2B	0.6754	0.6042
Qwen3-4B-Instruct-2507 (raw)	4B	0.5190	0.4732
Qwen3-4B-Base (5-shot)	4B	0.5170	0.6672
Gemma-4-E2B-it (raw)	2B	0.5800	0.5780

+15.1 pp

RCM lift over our Qwen3-4B base and +12.0 pp on MCQ — same architecture, recipe alone

Recipe

What changed, in three ideas.

01 · Decontamination

Strict eval-set scrub

Training corpus is filtered against CTI-Bench prompts and near-duplicates before any update sees a gradient. No leakage, no shortcutting.

02 · IT-base preservation

Keep the instruction-tuned voice

Loss is balanced so the model adopts CTI domain knowledge without losing Qwen3's general instruction-following. Tone, formatting, and refusal behavior remain intact.

03 · Recipe portability

Same recipe, different base

The same procedure applied to Gemma-4-E2B-it produces Gemma4Defense-2B with comparable lift — evidence the gains come from the recipe, not the base.

Corpus · ~14,776 supervised records

rcm-2021 (decontaminated) — ~6,776 CVE → CWE classification examples from MITRE/NVD 2021 cohort, with all CTI-Bench overlap items removed.
cve_cti synthetic Q&A — ~8,000 defensive-analyst-style Q&A pairs grounded in CVE descriptions.

An earlier internal CPT corpus had 72.3% test-set overlap with CTI-Bench. The released model trains exclusively on the 2021 cohort with overlap removed — released numbers are post-fix.

Hyperparameters

Base: Qwen3-4B-Instruct-2507
Adapter: LoRA r=64, α=64, dropout=0.05
Targets: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Optimizer: AdamW · cosine schedule · warmup_ratio=0.05 · weight_decay=0.01
LR · peak: 5e-5
Batch · effective: 2 per device · grad_accum 8 · effective batch 16
Schedule: 10 epochs · max_seq_len 4096
Precision: bfloat16 throughout
Attention: flash_attention_2 (FA2)
Wall · MI300X: 173 min · 1,290 steps · 7.85 s/step
Eval protocol: Foundation-Sec arXiv:2504.21039 §B.3-B.4 · n=5 · temp 0.3

Hardware

Trained on AMD MI300X.

The recipe was developed on a single-node MI300X stack. Optimizations are aimed at ROCm; the recipe ports cleanly to other datacenter-class GPUs (40 GB+ VRAM) with the FA2 caveat noted below.

Stack

Accelerator: AMD Instinct MI300X · 192 GB HBM3 (gfx942)
Runtime: ROCm 7.0
Docker image: vllm/vllm-openai-rocm:latest
PyTorch: 2.6.0 (ROCm)
flash-attn: 2.8.3 (preinstalled in vLLM ROCm image)
vLLM: 0.10.1

FA2 viability per model family

FA2 via the ROCm/flash-attention Composable-Kernels backend is supported on MI300X but bounded at head_dim ≤ 256 by the LDS shared-memory budget.

Family	head_dim	FA2	Note
Qwen3 (this work)	128	✓ enabled	fits LDS budget — ~1.6× faster than sdpa
Llama-3 / Mistral	128	✓ works	same head_dim class
Gemma-2	256	✓ boundary	at LDS limit, viable
Gemma-4 (global layers)	512	✗ disabled	exceeds LDS — fallback to sdpa

Optimizations

FA2 enabled in training — Qwen3-4B head_dim=128 fits the gfx942 LDS budget; ~7.85 s/step at LoRA r=64 / max_seq_len=4096 — ~1.6× faster than the same recipe on Gemma-4 (sdpa fallback).
TRITON_ATTN backend for vLLM inference — recommended on MI300X for Qwen3-class models.
bf16 throughout — native MI300X precision; no mixed-precision dance.
AITER kernels for matmul — VLLM_ROCM_USE_AITER=1 + TORCH_BLAS_PREFER_HIPBLASLT=1. Note: this works for Qwen3 dense — does NOT work for gpt-oss MoE (AITER=0 required there).
Prefix caching — vLLM --enable-prefix-caching for shared system-prompt batches.
HF Transfer for push/pull — HF_HUB_ENABLE_HF_TRANSFER=1 saturates ~240 MB/s; 8 GB merged model uploads in ~36 s.

ROCm environment (verified-working)

# env exported inside the vLLM ROCm Docker container
export VLLM_ROCM_USE_AITER=1
export TORCH_BLAS_PREFER_HIPBLASLT=1
export HF_HUB_DISABLE_XET=1
export PYTORCH_ROCM_ARCH='gfx90a;gfx942;gfx950'
export AITER_ROCM_ARCH='gfx942;gfx950'
export HIP_FORCE_DEV_KERNARG=1

Portability: the recipe runs on other 40 GB+ datacenter GPUs with the AMD-specific env vars dropped (no-ops elsewhere). FA2 via pip install flash-attn --no-build-isolation. Same VRAM minimums: 24 GB+ for training, 12 GB+ for inference.

Limitations

What it's good at, and where it isn't.

In-distribution · verified strong

Strong

CWE classification

Mapping a described weakness or code pattern to a CWE id with a short rationale. Calibrated on CTI-Bench RCM.

Strong

CVE-to-CWE on trained CVEs

Returns the underlying CWE for CVEs whose descriptions are in-distribution. Confidence drops for CVEs disclosed after the corpus cutoff.

Strong

Code-pattern reasoning

Identifying the weakness class behind a small snippet (string-format SQL, unchecked path joins, inline HTML rendering, etc.).

Out-of-distribution · known failure modes

Weak

MITRE ATT&CK technique IDs

Wrong T-numbers (returned T1543/003 — Windows Service — for scheduled-task persistence, vs the correct T1053.005). Wrong technique names (called LSASS dumping "Extract Web Credentials"; correct is T1003.001 OS Credential Dumping: LSASS Memory).

Weak

CVE implementation specifics

Fabricates implementation details. Cited a non-existent pgp binary path when asked about CVE-2024-3400 (PAN-OS GlobalProtect — actual root cause is session-ID handling). Top-level CWE call usually correct; deep mechanics often invented.

Weak

Tool categorization

Misclassifies offensive tooling. Listed Mimikatz as a "ransomware loader" — Mimikatz is a credential-dumping utility, not a ransomware loader.

Validated empirically · we tested 14, kept 7

We ran 14 candidate demo prompts through the deployed model and graded each output for factual accuracy. 7 passed, 7 were cut: Log4Shell vs Spring4Shell exploit-primitive comparison; CVE-2024-3400 explanation; LSASS ATT&CK mapping; schtasks ATT&CK mapping; Dockerfile curl|bash review; ransomware detection signals; Python dynamic-code-execution CWE. Most cuts were OOD hallucinations: invented technique numbers, fabricated CVE specifics, mislabeled mitigations.

Out-of-scope use

Generating exploit code, weaponized PoC, or attacker tradecraft.
Critical security decisions without qualified human review.
Legal, medical, or other regulated-advice contexts.
Tasks outside cybersecurity (general chat, code generation, summarization).
Violation of laws (CFAA, GDPR, etc.).

Recommendations

Pair with retrieval. Ground CVE / advisory queries against an authoritative source before quoting specifics.
Sample at low temperature. CWE selection benefits from determinism (temp 0.2–0.3).
Treat output as a triage hint, not a verdict — keep an analyst in the loop.

Companion

Gemma4Defense-2B

The same recipe applied to Gemma-4-E2B-it produces a 2B companion that scores RCM 0.6754 ± 0.0035 and MCQ 0.6042 ± 0.0090 — slightly higher MCQ than CyberSecQwen-4B at half the parameters again. Use the 2B when memory is tight; use the 4B when you want stronger general instruction-following and longer-form rationales.

Citation

Cite & license.

% bibtex
@misc{cybersecqwen2026,
  title  = {CyberSecQwen-4B: A Compact CTI Specialist Fine-Tuned from
            Qwen3-4B-Instruct-2507 on AMD MI300X},
  author = {Mulia, Samuel},
  year   = {2026},
  publisher = {Hugging Face},
  url    = {https://huggingface.co/athena129/CyberSecQwen-4B}
}

License: Apache 2.0, end-to-end — weights, training code, and the synthetic CVE/CTI Q&A corpus. The decontaminated 2021 CVE→CWE mappings derive from public MITRE/NVD records. Evaluation protocol: Foundation-Sec-8B (arXiv:2504.21039). Benchmark: CTI-Bench.

Ready to try it?

Test CyberSecQwen-4B on your own prompts below. ZeroGPU; cold start ~10–20 s, then warm.

Try the chat

Map a vulnerability
to a CWE.

Paste a snippet, log line, or CVE. The model returns a Common Weakness Enumeration with a one-paragraph rationale.

Overview

CyberSecQwen-4B

Walkthrough

Research, in 5 minutes

Performance

CTI-Bench, n=5, temp 0.3

Recipe

What changed, in three ideas.

Strict eval-set scrub

Keep the instruction-tuned voice

Same recipe, different base

Corpus · ~14,776 supervised records

Hyperparameters

Hardware

Trained on AMD MI300X.

Stack

FA2 viability per model family

Optimizations

ROCm environment (verified-working)

Limitations

What it's good at, and where it isn't.

In-distribution · verified strong

CWE classification

CVE-to-CWE on trained CVEs

Code-pattern reasoning

Out-of-distribution · known failure modes

MITRE ATT&CK technique IDs

CVE implementation specifics

Tool categorization

Validated empirically · we tested 14, kept 7

Out-of-scope use

Recommendations

Companion

Gemma4Defense-2B

Citation

Cite & license.

Ready to try it?

Sign in to start chatting.

Map a vulnerabilityto a CWE.

Map a vulnerability
to a CWE.