43x Faster: Rewriting maldet's Scan Engine with Batch Parallel Processing
Linux Malware Detect v1.6.6 processed files sequentially through each detection stage. The architecture worked well for years, but each file required its own process forks for hex extraction and pattern matching. On a web server with 10,000 PHP files and 20,000 hex signatures, that added up to roughly 500,000 subprocess forks per scan. It took 20 minutes.
The v2.0 rewrite brought that down to 28 seconds, a 43x improvement, without adding a single external dependency. The entire engine still runs on bash, grep, awk, and od. These are tools that ship with every Linux distribution back to CentOS 6. This post walks through how we did it.
The Fork Storm#
The old scan loop was straightforward. For each file in the scan queue, extract its hex content with od, then iterate every signature and run grep to check for a match:
# v1.6.6 per-file scanning (simplified)
for file in $scan_queue; do
hex=$(od -tx1 "$file" | tr -d ' \n')
for sig in $hex_signatures; do
if echo "$hex" | grep -q "$sig"; then
record_hit "$file" "$sig"
fi
done
doneEvery od call forks a process. Every echo | grep pair forks two more. With 10,000 files and 50 hex patterns, that is 510,000 process forks just for the hex stage. Add MD5 hashing and hit recording, and a full scan easily exceeds one million forks. On a shared hosting server under load, each fork competes for scheduler time and page cache, and the scan crawls.
The fix was not to switch languages. Rewriting in Python or Go would abandon the portability that makes maldet deployable on any Linux server without a runtime. The fix was to restructure the work so shell primitives do what they do best: process streams in bulk.
Batch Architecture#
The v2.0 engine processes files through five ordered stages. Each stage operates on the full file set (minus files already quarantined by earlier stages), using parallel batch workers that divide work via round-robin chunking:
Each stage quarantines its hits before the next stage begins. By the time the expensive stages run (YARA, string analysis), the working set has already been reduced by hash and hex matches. We call this progressive depletion. Earlier, cheaper stages shrink the input for later, expensive ones.
Workers are dispatched in parallel via background subshells with PID tracking. Auto-detection scales to 2 workers per CPU core, capped at 8, with a fallback chain: nproc → /proc/cpuinfo → sysctl hw.ncpu → 1. File lists are split via round-robin awk so large and small files interleave evenly across workers.
Hash Stage: One Fork, 37K Signatures#
The old hash scan forked md5sum once per file, then searched the signature list per hash. For 10,000 files that was 10,000 forks just to compute hashes, plus 10,000 lookups. The new approach forks once:
# Batch hash: one xargs fork for the entire chunk
xargs -d '\n' "$md5sum" < "$chunk" > "$hash_out"
# Single awk pass: preload 37K signatures into O(1) array
awk -v sigfile="$sigfile" '
BEGIN {
while ((getline line < sigfile) > 0) {
split(line, f, ":")
sigs[f[1]] = f[3] # sigs[HASH] = SIGNAME
}
}
{
hash = $1
if (hash in sigs) {
fpath = substr($0, length(hash) + 3)
print fpath "\t" sigs[hash]
}
}' "$hash_out"One xargs fork hashes all 1,200+ files in the worker's chunk. One awk process loads the entire signature database into an associative array in its BEGIN block, then streams through the hash output doing O(1) lookups. Two forks total, regardless of file count.
SHA-256 scanning adds hardware acceleration detection. The engine checks for SHA-NI on x86 and SHA2 extensions on ARM at startup, selecting the fastest available hash binary. But the architecture is identical: one xargs, one awk.
HEX+CSIG: Merged Extraction#
The hex and compound signature stages share the same input: a hex dump of each file's content. In a naive approach, hex extraction would happen separately for HEX patterns and for CSIG rules. The v2.0 engine extracts once and runs both matchers against the same buffer.
The merged _hex_csig_batch_worker() operates in three phases:
Phase 1: Batch Hex Extraction
Each file in the worker's chunk is hex-dumped via od -v -tx1 and appended to a single batch buffer, one line per file. An indexed array maps line numbers back to file paths.
Phase 2: Literal Pattern Matching (Aho-Corasick)
This is where the real leverage comes from. GNU grep -F does not do a naive string search. It builds an Aho-Corasick automaton from the pattern file, then scans the input in a single pass. One call to grep -Fno -f hex_literals batch_hex matches all 2,000+ literal hex patterns against all files simultaneously. That is one fork replacing what was previously 2,000 × N forks.
# One grep call matches all literal hex patterns across all files
grep -Fno -f "$hex_literals" "$batch_hex" | \
awk -F: '!seen[$1]++ { print $1, $2 }'Wildcard patterns (those containing regex metacharacters) fall back to per-pattern grep -En calls, but there are typically only ~50 of these versus 2,000+ literals.
Phase 3: CSIG Rule Evaluation
Compound signatures reuse the same batch hex buffer. The CSIG engine runs its own three-tier matching (literals, wildcards, universals), then evaluates boolean rules via set operations on the match results. Because extraction already happened in Phase 1, the CSIG stage adds near-zero I/O overhead. See our companion article on compound signatures for the full engine walkthrough.
The Sigmap Cache#
In v1.6.6, resolving a pattern match back to a signature name required forking awk to search the sigmap file, once per hit. With 1,000+ hits per scan, that was 1,000+ unnecessary forks.
The v2.0 workers preload the entire signature map into a bash associative array at startup:
# Preload at worker start — O(1) lookups, zero forks
local -A _sigmap_cache=()
while IFS=$'\t' read -r _pat _name; do
_sigmap_cache["$_pat"]="$_name"
done < "$hex_sigmap"
# Later, during hit processing:
hit_name=${_sigmap_cache["$hit_pattern"]:-}
if [ -n "$hit_name" ]; then
printf '%s\t%s\n' "${_names[$fnum]}" "$hit_name"
fiThe same pattern is used for CSIG SID match resolution. During rule evaluation, rather than forking grep -qFx to check if a subsignature matched a given file, the engine preloads all match files into a _loaded_sids[SID:linenum] array. What was 40,000 grep forks across a scan becomes zero, replaced by O(1) bash array lookups.
Bulk Hit Processing#
The fork reduction extends past scanning into hit handling. After workers finish, their output merges into a single manifest. Then three bulk operations run in sequence:
xargs stat --printf call gathers owner, group, mode, size, and timestamps for all hit files. Previously N forks.xargs md5sum call computes hashes for files that need them (hex/CSIG hits don't already have a hash). Previously N forks.xargs chmod 000 and one xargs chown root:root lock down all quarantined files. Previously 2N forks.For 1,000 quarantined files, this reduces hit-processing forks from ~4,000 to ~5.
Benchmarks#
We benchmark maldet against ClamAV on a corpus of 6,002 real-world malware samples in Docker isolation with parallel set to 1 (single worker) to measure per-engine efficiency:
| Scanner | Detect | Rate | Time | Files/s | Memory |
|---|---|---|---|---|---|
| ClamAV | 1,045 | 17.4% | 69s | 86 | 998 MB |
| ClamDAV | 1,045 | 17.4% | 17s | 353 | 1,060 MB |
| maldet 2.x | 2,206 | 36.7% | 13s | 461 | 44 MB |
sf benchmark · 6,002 samples · Docker isolation · parallel 1
The memory difference is the starkest number. ClamAV loads its signature database into a resident daemon, 998 MB before it scans a single file. maldet's native engine streams signatures through grep and awk processes that exist only for the duration of each stage. Peak memory is 44 MB. On a 1 GB VPS, the kind most shared hosting customers run, that difference is the difference between “scan runs” and “OOM killer fires.”
At full parallelism (8 workers on a 4-core host), the same 9,931-file scan that took v1.6.6 twenty minutes completes in 28 seconds with identical hit counts.
Zero Dependencies, Maximum Portability#
The native engine uses exactly these tools:
| Tool | Role |
|---|---|
| bash | Worker dispatch, associative arrays, control flow |
| grep | Aho-Corasick literal match (-F), ERE wildcards (-E) |
| awk | Signature preload, fan-out, join operations |
| od | Binary-to-hex extraction |
| xargs | Batch process launching (hash, stat, chmod) |
| md5sum / sha256sum | Hash computation with HW accel detection |
| sort, uniq, cut, tr | Set operations, string manipulation |
Every one of these ships in the base install of every Linux distribution we support, from CentOS 6 (2011) through Ubuntu 24.04. No package manager needed. No runtime to install. Copy the files, run the scanner. The command prefix on all coreutils calls ensures portable PATH resolution even on pre-usr-merge systems where /bin and /usr/bin are separate.
ClamAV and YARA remain available as optional supplementary engines (stages 3 and 4) when installed. But the core detection pipeline (hash matching, hex patterns, compound signatures) runs entirely native. On a minimal server with nothing but coreutils and bash, maldet still scans at full capability.
Conclusion#
The lesson is not that bash is fast. It is not. The lesson is that the tools bash orchestrates (grep, awk, xargs) are remarkably fast when you stop fighting their design. Per-file loops with per-pattern forks fight the design. Batch extraction with stream-oriented matching works with it.
The resulting engine detects twice as many threats as ClamAV, runs five times faster, uses twenty-two times less memory, and deploys with zero dependencies on any Linux server shipped in the last fifteen years. We think that trade-off is worth the engineering investment.
maldet 2.x is currently in active development on the 2.x branches of the project and is expected to release in the coming weeks. The batch engine implementation lives in files/internals/lmd_engine.sh and files/internals/lmd_scan.sh. The project is open source under GPLv2.