Skip to main content
rfxn
//
maldetmalwaredetectionbash

Compound Signatures: Building a Boolean Detection Language in Bash

Ryan MacDonald

Userspace and web-based threats like PHP backdoors, webshells, and obfuscated uploaders rarely present a single unique byte string. A backdoor might combine eval(gzinflate( with str_rot13( with a variable-length obfuscation layer. Any one pattern alone produces false positives. Detecting with confidence requires boolean logic: “match if this file contains eval() AND base64_decode() AND any two of these five obfuscation patterns.”

Both ClamAV (via .ldb) and YARA can express this, but ClamAV brings a ~1 GB resident daemon and YARA requires a compiled binary. Linux Malware Detect deploys across shared hosting, dedicated servers, VPS instances, embedded systems, deep legacy platforms on EL6 and earlier, and compliance-locked architectures where installing third-party binaries is not permitted. On many of these systems, neither ClamAV nor YARA is available. Compound signatures (CSIG) bring the same boolean expressiveness to the native engine using only grep, awk, sort, and uniq.

Signature Anatomy#

A CSIG rule is a single line in csig.dat. The format combines ClamAV-compatible hex wildcards with boolean logic operators:

text
# Format: SUBSIGS:SIGNAME[;THRESHOLD]
# Subsigs separated by || (top-level OR boundary)

# Single pattern — behaves like a standard hex signature
6576616c28:{CSIG}php.eval.generic

# AND — all subsigs must match
6576616c28||6261736536345f6465636f6465:{CSIG}php.eval.b64

# OR with threshold — at least 2 of 3 must match
6576616c28||677a696e666c61746528||7374725f726f74313328:{CSIG}php.obfusc.multi;2

# Grouped — (A OR B) AND C
(6576616c28||617373657274);1||6261736536345f6465636f6465:{CSIG}php.assert.b64

Each hex string represents raw bytes. For example, 6576616c28 is the ASCII encoding of eval(. The format supports the full ClamAV wildcard vocabulary:

TokenMeaningExample
??Any single byte61??63 → a.c
*Variable-length gap61*63 → a...c
{N-M}Bounded gap (N to M bytes)61{2-5}63 → a[2-5 bytes]c
(a|b)Alternation(6162|6364) → ab or cd

Boolean Logic Engine#

The compiler classifies each rule into one of four types based on its structure:

Single : one subsignature, no grouping. Equivalent to a standard hex rule.
AND : multiple subsignatures, all must match. “This file contains eval() AND base64_decode().”
OR with threshold : multiple subsignatures with a ;N suffix meaning “at least N of these must match.” Useful for families with variant obfuscation layers.
Grouped : parenthesized OR groups combined with AND. Enables expressions like “(eval OR assert) AND base64_decode.”

At runtime, AND rules use set intersection: build a candidate set from the first subsig's matches, then filter candidates that also matched every subsequent subsig. OR rules use sort -n | uniq -c to count per-file match occurrences and filter by threshold. Grouped rules combine both: OR sub-groups count matches internally, then the outer AND requires all groups to pass.

Rules are evaluated in source order with first-match-wins semantics. Once a file is matched by a rule, it is removed from consideration for subsequent rules. This preserves the behavior analysts expect from ordered signature files and prevents duplicate hit reporting.

Pattern Modifiers#

Raw hex matching misses two common malware patterns: case-manipulated function names and UTF-16LE encoded strings. CSIG handles both at compile time, not runtime.

Case-Insensitive: the i: Prefix

PHP is case-insensitive for function names. EvAl( is as dangerous as eval(, and attackers exploit this to evade byte-exact signatures. The i: prefix folds case at compile time by expanding each ASCII letter byte into an ERE alternation:

text
# Source signature
i:6576616c28:{CSIG}php.eval.caseblind

# After compilation — each letter byte becomes (upper|lower)
# 65='e' 76='v' 61='a' 6c='l' 28='('
# Compiled ERE: (65|45)(76|56)(61|41)(6c|4c)28
# Matches: eval(  EVAL(  eVaL(  and all 16 case variants

The case folding happens once during signature compilation. At scan time, the resulting ERE is just another pattern. No runtime flags, no special handling.

Wide Matching: the w: Prefix

Malware that targets both Windows and web platforms sometimes embeds strings in UTF-16LE encoding, where each ASCII byte is followed by a null byte. The w: prefix interleaves null bytes at compile time:

text
# Source: w:6576616c  (ASCII "eval")
# Compiled: 650076006100 6c00
# Matches UTF-16LE: e\x00v\x00a\x00l\x00

# Combined: iw:6576616c  (case-insensitive wide)
# Compiled: (65|45)00(76|56)00(61|41)00(6c|4c)00

The iw: and wi: prefixes combine both transformations, catching case-variant strings in UTF-16LE encoded content. Order does not matter since the compiler normalizes either form.

Bounded Gaps

The {N-M} token matches a variable-length gap of N to M bytes. The compiler doubles the bounds (since each byte is two hex characters) and emits a quantifier:

text
# Source: 6576616c28{3-10}29
# Meaning: "eval(" then 3-10 bytes then ")"
# Compiled ERE: 6576616c28[0-9a-f]{6,20}29

This is essential for matching function calls where the argument length varies but the enclosing structure is fixed.

The AWK Compiler#

The compiler is a ~130-line AWK program in _csig_compile_rules() that transforms raw csig.dat rules into four output files optimized for batch matching:

CSIG COMPILATION PIPELINEcsig.datraw ruleshex subsigsi:/w: prefixes?? * {N-M} (a|b)AWK COMPILERwildcard → EREcase fold / wideSID dedupLITERALSSID → pattern (grep -F)batch Aho-Corasick targetWILDCARDSSID → ERE (grep -E)per-pattern regex passUNIVERSALSSID (< 8 hex)always-match, skip grepBATCH RULESNAME TYPE THRESH SPECSID refs + group encodingBATCH MATCHERT1: grep -Fno literalsT2: grep -En wildcardsT3: universals (skip)rule evaluationset ops + thresholdfirst-match-winscompile once·3-tier pattern dispatch·zero forks in rule eval

The critical optimization is SID deduplication. Each unique compiled pattern (after wildcard expansion, case folding, and wide interleaving) is assigned exactly one subsignature ID (SID). If seven rules share the eval( pattern, it gets one SID. One grep match instead of seven.

bash
# SID deduplication in the AWK compiler
ere = compile_subsig(raw_pattern)
if (ere in sid_map) {
    sid = sid_map[ere]         # Reuse existing SID
} else {
    sid = next_sid++
    sid_map[ere] = sid         # Assign new SID
    write_to_tier(sid, ere)    # Classify: literal/wildcard/universal
}
# Multiple rules reference the same SID — one grep serves all

The batch rule file encodes each rule as SIGNAME<tab>TYPE<tab>THRESHOLD<tab>SPEC, where SPEC is a comma-separated list of SID references. Grouped OR sub-expressions use the notation or:THRESHOLD:SID+SID+.... The entire compilation runs in ~1.4 seconds for 45,000+ signatures.

Three-Tier Batch Matching#

At scan time, the CSIG engine runs inside the merged hex+CSIG batch worker, reusing the hex buffer already extracted for HEX signature matching. Pattern matching proceeds in three tiers, ordered by cost:

Tier 1: Literals (One grep -F Call)

All literal CSIG patterns are fed to a single grep -Fno call. GNU grep builds an Aho-Corasick automaton from the pattern file and scans the batch hex buffer in one pass. An awk post-processor fans out results: for each match, it writes the file's line number to the matched SID's file in a temporary directory.

Tier 2: Wildcards (Per-Pattern grep -E)

Patterns containing regex metacharacters (from ??, *, {N-M}, or i:/w: expansion) each get their own grep -En call against the batch hex buffer. There are typically only 6-10 wildcard patterns in a CSIG rule set, so the overhead is minimal.

Tier 3: Universals (No grep)

Subsignatures shorter than 8 hex characters (4 bytes) would match nearly everything. Rather than waste a grep call, the compiler marks these as universal. They are treated as always-matching during rule evaluation.

Rule Evaluation: Zero Forks

After all three tiers complete, the engine preloads every SID's match results into a single bash associative array:

bash
# Preload all SID match files into memory
local -A _loaded_sids=()
for sid_file in "$match_dir"/*; do
    sid_name="${sid_file##*/}"
    while IFS= read -r linenum; do
        _loaded_sids["${sid_name}:${linenum}"]=1
    done < "$sid_file"
done

# O(1) lookup replaces grep fork
_check_sid_match() {
    local sid="$1" linenum="$2"
    [ -n "${_loaded_sids[${sid}:${linenum}]+set}" ]
}

Rule evaluation then iterates the batch compiled rules in source order, checking SID matches via array lookups. AND rules filter a candidate set. OR rules merge match sets and count with sort -n | uniq -c. Grouped rules combine both. The entire evaluation phase uses zero subprocess forks. It is pure bash array operations.

Performance Profile#

The design goal was not just correctness but efficiency under constraint. maldet runs on shared hosting servers where memory is limited and CPU time is shared across hundreds of accounts. Here is how the native CSIG engine compares to the alternative of delegating detection to ClamAV:

Peak Memory
44 MBvs 998 MB (ClamAV)
Detection Rate
36.7%vs 17.4% (ClamAV)
Compilation
1.4sfor 45K+ signatures

The 22x memory advantage comes from architecture, not optimization tricks. ClamAV loads its entire signature database into a resident process's heap. maldet's native engine streams signatures through grep processes that exist only for the duration of each matching tier. When the tier completes, the memory is freed. On a 512 MB VPS, ClamAV cannot start. maldet scans without issue.

The 2.1x detection advantage comes from specificity. Compound signatures can express multi-indicator rules that would be impossible in ClamAV's .ndb format. A rule requiring three of five obfuscation indicators catches more variants than five separate single-pattern rules, each of which needs to be conservative to avoid false positives.

Conclusion#

We built CSIG because the deployment environments where maldet matters most (shared hosting, legacy infrastructure, resource-constrained VPS instances) are exactly the environments where ClamAV is hardest to run. The choice was between requiring a 1 GB daemon or in-housing the detection engine using tools already present on every target system.

The result is a detection language that compiles to ERE patterns, matches via Aho-Corasick batch grep, and evaluates boolean logic through set operations, all in ~130 lines of AWK and ~300 lines of bash. It runs on CentOS 6. It runs on a Raspberry Pi. And it detects twice as many threats as the tool it was designed to complement.

maldet 2.x is currently in active development on the 2.x branches of the project and is expected to release in the coming weeks. The CSIG compiler lives in files/internals/lmd_sigs.sh and the batch matcher in files/internals/lmd_engine.sh. The project is open source under GPLv2. For a walkthrough of the broader scan engine architecture, see our companion article on the 43x batch performance rewrite.