Mapping fuzzy intent to nmap flags, locally and safely

The hard part of nmap is not running it. It is remembering which of the 600-plus NSE scripts and 150-plus flags maps to the thing you actually want to do. Cheat sheets are static, the man page is enormous, and tab-completion only helps once you already know the first few characters of what you are reaching for. TNmap is a terminal UI that closes that gap: type “check heartbleed” or “enumerate smb shares on this box” and it surfaces the matching nmap invocation, then predicts the next flag as you edit the command line.

What I want to write about is the retrieval stack underneath, because mapping fuzzy human intent onto an exact, correct command is a more interesting problem than it first looks.

Why keyword search is not enough

The naive version is full-text search over flag descriptions. It breaks immediately. “Is the box vulnerable to any old smb exploits” shares almost no keywords with the script names smb-vuln-ms17-010 or smb-vuln-ms10-054, but those are exactly the right answers. Intent and vocabulary diverge. You need semantic matching, and you need it to run on the operator’s machine with no API keys and no cloud round-trip, because the whole point is a tool that works on an engagement where you would not paste your targets into a third-party endpoint.

Two-stage retrieval: cheap recall, then expensive precision

TNmap uses the standard retrieve-then-rerank shape, and it is the right shape here for a specific reason.

A sentence-transformer (all-MiniLM-L6-v2) embeds the query into 384 dimensions and runs cosine similarity against a corpus of recipe embeddings to pull the top 30 candidates. That first stage is fast and high-recall: it casts a wide net and is happy to include some wrong answers as long as the right one is in the 30.

Then a cross-encoder (ms-marco-MiniLM-L-6-v2) reranks those 30. A cross-encoder reads the query and a candidate together rather than comparing two independently-computed vectors, so it judges relevance far more precisely. It is also far too slow to run over the whole corpus, which is exactly why you only let it see the 30 the cheap stage already shortlisted. The final score blends the two, weighted toward the reranker:

plaintext

score = 0.7 * reranker + 0.3 * retriever

The retriever earns its 30 percent: it is the reason the reranker never has to look at an obviously irrelevant candidate. This is the general lesson worth taking away. When one model is accurate but slow and another is fast but coarse, you do not choose between them, you stack them so the fast one bounds the work the slow one has to do.

The training corpus is the real product

A retrieval system is only as good as what it retrieves over, and the corpus here is fused from four sources: hand-curated query-to-command pairs from actual operator practice, scraped cheat sheets, every .nse script parsed off the local disk for its description, categories and port hints, and the full nmap flag catalogue from the man page. The NSE parsing is the clever bit: it reads the scripts already installed on your machine, so the model knows about the exact scripts you have, and retraining picks up any you install later.

Each entry gets a couple of synthetic paraphrases, but only for the TF-IDF path. The semantic encoder gets the raw descriptions, because it already generalizes across phrasing and paraphrasing would just add noise to embeddings that do not need it. Knowing which component benefits from data augmentation and which is hurt by it is the kind of detail that separates a stack that works from one that looks like it should.

Graceful degradation instead of a loading spinner

Sentence-transformer models take a moment to warm up. Rather than block the UI, TNmap serves TF-IDF suggestions instantly while the semantic encoder loads in the background, then transparently flips to dense retrieval once it is ready. The operator never waits on a blank pane. Suggestions are debounced to fire about 250 ms after you stop typing, so editing the command line stays smooth instead of recomputing on every keystroke.

Safety lives in the seams

A tool that runs nmap is a tool that can scan things you are not authorized to scan, so the mapping has to fail toward correct flags, not creative ones. Two design choices help. First, suggestions are retrieved from a curated corpus of real invocations, not generated free-form by a model that might hallucinate a flag that does not exist or a script that does the wrong thing. Retrieval can only return commands that were already vetted into the corpus. Second, the target is a separate field, substituted into a {target} placeholder at run time, so intent and target stay decoupled and the suggested command is about what to do, never who to do it to.

Run it only against targets you have written authorization to test. The corpus maps intent to commands, not to permission, and that distinction is on you.