Portable Bash for 20 Years of Unix Fragmentation
rfxn ships three Bash projects to an operating system matrix that starts at CentOS 6 (released 2011) and ends at whatever shipped last month. Linux Malware Detect (maldet), Advanced Policy Firewall (APF), and Brute Force Detection (BFD) all run on servers that have been online for years, ship into hardened compliance environments where you cannot just yum install python3, and deploy from containers where coreutils may live anywhere at all.
Over the past two decades, Unix has quietly fragmented around the edges in ways that still bite anyone writing portable shell today. The /usr merge. The /sbin split. Bash 4.1 versus Bash 5.2. sysvinit versus systemd versus Upstart. OpenSSL 1.0.1 versus 3.x. This article is a reference for the specific pitfalls we hit across APF, BFD, and maldet, and the conventions we settled on after hitting them in production.
Everything here is documented in our project governance as hard rules, enforced by pre-commit grep patterns. We do not treat portability as a style preference. If a patch breaks CentOS 6, it does not ship.
The usr-merge Cliff#
For most of Unix history, /bin held the essential tools needed to bring a system up to single-user mode, and /usr/bin held everything else. The distinction made sense when /usr was often a separately mounted partition that might not be available during early boot. Fedora 17 (2012) ended that split by symlinking /bin → /usr/bin, /sbin → /usr/sbin, /lib → /usr/lib. RHEL/CentOS 7 adopted the merge. Debian 12 (bookworm, 2023) finished its transition. Arch, openSUSE, and modern Ubuntu are all merged.
But not everything merged. CentOS 6 and Ubuntu 12.04 never did, and those systems are still in the field under extended support, locked compliance regimes, and internal build environments. On those hosts, /bin/cp is a real file and /usr/bin/cp does not exist.
This sounds like a small thing until you hardcode a path. The failure mode is specific and frustrating:
# In an install script
/usr/bin/cp -f "$src" "$dest"
# Works on Rocky 9, Debian 12, Ubuntu 22.04
# Fails on CentOS 6: "No such file or directory"
# The obvious "fix"
/bin/cp -f "$src" "$dest"
# Works on CentOS 6
# Fails on minimal containers, NixOS, some Alpine layouts
# Fails on FreeBSD where cp lives in /bin but with different flagsThe rfxn convention, enforced across the codebase, is to let PATH do its job:
# From files/internals/lmd_clamav.sh in maldet
command rm -f "$cpath"/rfxn.{hdb,ndb,yara,hsb} 2>/dev/null # safe: ClamAV path may not have LMD sigs
command cp -f "$inspath/sigs/rfxn.ndb" "$inspath/sigs/rfxn.hdb" \
"$inspath/sigs/rfxn.yara" "$cpath/" 2>/dev/null # safe: ClamAV path may not existThe command builtin does two things at once: it tells Bash to bypass functions and aliases (so a malicious or buggy cp() shell function cannot intercept an install step), and it resolves the binary through PATH. On CentOS 6 that resolves to /bin/cp. On Rocky 9 it resolves to /usr/bin/cp. On a weird container it resolves to wherever the image put it. Nothing downstream of the resolution cares which.
We extended this rule to every coreutil used in project source: command cp, command mv, command rm, command chmod, command mkdir, command cat, command touch, command ln. Two exceptions: printf and echo are Bash builtins, and prefixing them with command forces the external binary, which is slower and pointless in hot loops. Those stay bare.
One anti-pattern we explicitly ban is the backslash bypass:
# DO NOT DO THIS
\cp -f "$src" "$dest" # bypasses aliases but not functions
\rm -rf "$old" # still hits any rm() function defined higher up
\mv "$a" "$b" # not portable across shells (ksh, dash)
# This is the rule:
command cp -f "$src" "$dest"
command rm -rf "$old"
command mv "$a" "$b"The Runtime Matrix#
Here is the matrix we test against, with the four facts that dominate portability decisions at the shell level: where coreutils live, which init system runs PID 1, what Bash version ships, and what TLS floor the default OpenSSL build honours.
Every red cell in that matrix has produced a production bug at some point in the history of one of our projects. The rules below are the ones that survived.
The /sbin Split#
APF is a firewall. Its whole job is invoking iptables, ip6tables, ipset, ip, and a handful of other admin binaries. These live in /sbin or /usr/sbin depending on the distro and whether the usr-merge happened. Worse, they are often not on the PATH of a non-login shell at all (cron, for example, defaults to a bare PATH=/usr/bin:/bin on some distros).
The APF approach, from files/internals/internals.conf, is to discover once at startup and cache the absolute path:
# APF: files/internals/internals.conf (discovery)
ifconfig=$(command -v ifconfig 2>/dev/null)
ip=$(command -v ip 2>/dev/null)
IPT=$(command -v iptables 2>/dev/null)
IP6T=$(command -v ip6tables 2>/dev/null)
IPTS=$(command -v iptables-save 2>/dev/null)
IPTR=$(command -v iptables-restore 2>/dev/null)
IP6TS=$(command -v ip6tables-save 2>/dev/null)
IP6TR=$(command -v ip6tables-restore 2>/dev/null)
IPSET=$(command -v ipset 2>/dev/null)Because the firewall binary runs as root and because we extend PATH to include the sbin directories at the top of the main script (PATH=$PATH:/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin), command -v finds every one of those binaries no matter where they live on the filesystem. The cached absolute paths are then used throughout the runtime, so we never pay the lookup cost twice.
Fallback chains matter too. APF prefers ip (from iproute2, modern) but falls back to ifconfig and route (from net-tools, deprecated but still present on many deployed hosts) when iproute2 is missing. Every one of those probes uses the same pattern: command -v in a conditional, never which (which is a separate binary that is not installed on minimal systems) and never type (which has shell-specific output quirks).
The Bash 4.1 Floor#
CentOS 6 ships Bash 4.1.2. It is the single most restrictive target in our matrix for shell syntax, because everything above it has accreted handy features that simply do not work there. Some of these are syntax errors that are caught at parse time; others are silent misbehaviours that only show up at runtime with real data. The following are banned in all rfxn project source:
| Feature | Requires | Portable Alternative |
|---|---|---|
| ${var,,} / ${var^^} | Bash 4.2+ | echo "$var" | tr '[:upper:]' '[:lower:]' |
| mapfile -d | Bash 4.4+ | while IFS= read -rd '' |
| declare -n (nameref) | Bash 4.3+ | indirect: ${!varname} |
| $EPOCHSECONDS | Bash 5.0+ | date +%s (costs one fork) |
| $EPOCHREALTIME | Bash 5.0+ | date +%s.%N (GNU date only) |
| declare -A (global scope) | trap | parallel indexed arrays |
The declare -A Trap
That last row deserves its own paragraph, because the failure mode is subtle and we have hit it more than once. Bash supports associative arrays via declare -A starting in 4.0. The problem is that when a library is sourced from inside a function (which is exactly what the BATS test harness does via its load helper), a bare declare -A foo creates a function-scoped local. The name foo exists only for the duration of the sourcing function call, then vanishes. Every later access looks at an empty array and the code silently behaves wrong.
# BROKEN: declare -A at top level of a library that gets sourced
# from inside a function (e.g. BATS load, or any shell function)
declare -A _sigmap_cache # becomes a local! vanishes on return!
# BROKEN ALTERNATIVE: "fix" with -g
declare -gA _sigmap_cache # works in Bash 4.2+, not 4.1 — breaks CentOS 6
# PORTABLE: parallel indexed arrays
# Use two plain arrays and a linear find, or hash the key to an index
_cache_keys=()
_cache_vals=()
_cache_set() {
local key="$1" val="$2" i
for i in "${!_cache_keys[@]}"; do
if [ "${_cache_keys[i]}" = "$key" ]; then
_cache_vals[i]="$val"
return 0
fi
done
_cache_keys+=("$key")
_cache_vals+=("$val")
}
# PORTABLE + INSIDE A FUNCTION: local -A is fine
_rule_eval() {
local -A _loaded_sids=() # properly scoped, does not leak
# ... use _loaded_sids freely ...
}The important nuance: inside a function, local -A works cleanly on every version of Bash we target (including 4.1), because local scope is exactly what you want inside a function. The ban applies to global associative arrays, which are the ones that get silently broken by sourced-from-function semantics.
Other Bash Gotchas We Learned The Hard Way
A few more patterns that came out of real CentOS 6 and BATS bugs:
local var=$(cmd) masks the exit code. The local builtin itself always returns 0, so set -e never trips and $? is useless. Declare first, then assign.args="$@" collapses with IFS. Assigning "$@" to a scalar silently joins with the first character of IFS and loses argument boundaries. Use args=("$@") for array semantics or args="$*" for a joined string.cd $dir without a guard is a latent bug. If the directory is missing, execution continues in the wrong CWD and the next command runs against the wrong filesystem tree. Always cd "$dir" || return 1. set -e is not a substitute, because a failing cd inside a pipeline or conditional does not trigger it.for x in $(cat file) breaks on spaces. Always while IFS= read -r line; do ...; done < file.$() callers can hang. If a caller is doing out=$(func) and func launches a background subshell, inherited pipe fds keep the caller waiting forever. Use ( exec >/dev/null 2>&1; cmd ) & so the background subshell replaces its own fds.Systemd, SysV, Upstart#
Installing a daemon sounds simple. In practice, four different init systems are alive in our matrix: systemd (the default everywhere modern), classic sysvinit (CentOS 6, Slackware, minimal containers), Upstart (Ubuntu 14.04), and rc.local-style systems where we are effectively on our own.
The BFD and APF installers share a detection library (pkg_lib.sh) that sets _PKG_INIT_SYSTEM to one of systemd | sysv | upstart | rc.local | unknown with a cascade of probes:
# From APF/BFD pkg_lib.sh: pkg_detect_init()
_PKG_INIT_SYSTEM="unknown"
# Primary: systemd runtime directory
if [[ -d /run/systemd/system ]]; then
_PKG_INIT_SYSTEM="systemd"
return 0
fi
# Secondary: PID 1 process name
# Guarded because /proc/1/comm may not exist on CentOS 6
if [[ -f /proc/1/comm ]]; then
local pid1_comm
pid1_comm=$(cat /proc/1/comm 2>/dev/null) || pid1_comm=""
case "$pid1_comm" in
systemd) _PKG_INIT_SYSTEM="systemd" ;;
init) _PKG_INIT_SYSTEM="sysv" ;;
upstart) _PKG_INIT_SYSTEM="upstart" ;;
esac
fi
# Tertiary: init.d directories exist but no systemd
if [[ -d /etc/init.d ]] || [[ -d /etc/rc.d/init.d ]]; then
_PKG_INIT_SYSTEM="sysv"
return 0
fi
# Last resort: rc.local
if [[ -f /etc/rc.local ]] || [[ -f /etc/rc.d/rc.local ]]; then
_PKG_INIT_SYSTEM="rc.local"
fiThe probe sequence matters. We check /run/systemd/system before /proc/1/comm because the former is cheap and authoritative on any system where it exists. We only fall back to the init.d directory test if earlier probes return nothing, because distros like Rocky 9 still have /etc/init.d around for compatibility even though systemd is what PID 1 actually runs.
Enablement is where the branches diverge. BFD ships both a systemd unit file and a SysV init script, and the installer picks one based on the detected init:
# BFD install.sh (simplified)
if [ "$_PKG_INIT_SYSTEM" = "systemd" ]; then
_unit_dir=$(_pkg_systemd_unit_dir) # /lib/systemd/system or /usr/lib/systemd/system
command cp bfd.service "$_unit_dir/"
command cp bfd.timer "$_unit_dir/"
systemctl daemon-reload
systemctl enable bfd.timer
else
# sysv / upstart / rc.local all get the init script
for _idir in /etc/rc.d/init.d /etc/init.d; do
[ -d "$_idir" ] && command cp bfd-watch.init "$_idir/bfd-watch" && break
done
if command -v chkconfig >/dev/null 2>&1; then
chkconfig bfd-watch on 2>/dev/null || true # chkconfig may not support this service
fi
fiTwo extra details that cost us debug cycles: the systemd unit directory itself is not stable across distros (Debian puts it in /lib/systemd/system, RHEL puts it in /usr/lib/systemd/system, and both appear as symlinks of each other on some merged systems), and chkconfig is not installed by default on some minimal images even when the system is SysV-based. Both get probed, both get conditional fallbacks.
TLS on Legacy#
maldet pulls signature updates from rfxn infrastructure over HTTPS. So do the APF reputation feeds. On a modern system that is unremarkable: curl against an LE-issued cert with TLS 1.3 and SNI works without configuration. On CentOS 6, almost every part of that sentence is wrong.
CentOS 6 shipped OpenSSL 1.0.1 with backports that eventually included TLS 1.2, but on some historical minor versions the backport was incomplete or missing for certain ciphers. Its bundled curl predates broad SNI support in all build configurations, so virtual hosts served by modern edge infrastructure sometimes fail the handshake. The system CA bundle has not been updated in over a decade on unpatched hosts, so certificates chained to newer roots fail verification. And even when TLS works, the cipher list is constrained enough that modern servers rejecting weak ciphers can refuse the handshake on the server side.
The rfxn mitigations are layered:
curl but not wget; another the reverse. Minimal containers sometimes have neither. Detection is command -v curl first, then command -v wget, with explicit error if neither is present.curl invocation specifies --connect-timeout and --max-time, and every wget invocation specifies --timeout.The goal is not to support TLS 1.0 as a security posture. The goal is to keep signature updates flowing on hosts that have not been kernel-rebooted in five years, so that maldet can still detect new webshells on them. Security depends more on whether the signature bundle arrives than on what cipher carried it.
Testing the Matrix#
None of the rules above matter if they are not enforced. The only way to know that a change survives CentOS 6 is to run it on CentOS 6. All three rfxn projects share a BATS-based test harness (batsman) that executes the full suite inside Docker containers for each target distribution.
Every code-changing commit runs against Debian 12 and Rocky 9 at minimum. Major changes run the full matrix: centos6, centos7, rocky8, rocky9, ubuntu20, ubuntu22, ubuntu24, debian12. The companion article on our Docker-over-TCP BATS harness documents how the distro matrix is driven from a single Makefile target and how we use a dedicated test host to parallelise the runs. For the purposes of this article, the point is simply that the matrix exists and that we exercise it on every change.
One detail the tests themselves need to honour: test files (.bats) run inside Docker containers that have no aliases and a pre-merge layout on some images. Inside tests, we use bare cp, rm, mv and let the container's PATH resolve them. Inside project source, we use command cp, command rm, command mv. Inside agent-invoked Bash (the shell commands we type in operational sessions), we use absolute paths like /usr/bin/rm explicitly, because the agent environment is a single known host. Three contexts, three rules. All of them encoded in our project governance and enforced by grep.
The Verification Gauntlet#
Every commit that touches shell files runs a pre-commit gauntlet of syntax checks and pattern greps. These are the rules that would otherwise be caught only at runtime on a customer's CentOS 6 host, which is too late. The actual pattern list from our project CLAUDE.md:
# Syntax and lint
bash -n <all-shell-files>
shellcheck <all-shell-files>
# Deprecated utilities
grep -rn '\bwhich\b' files/ # which is a separate binary, not on minimal systems
grep -rn '\begrep\b' files/ # deprecated, use grep -E
# Modernize shell constructs
grep -rn '`' files/ # backticks — use $()
grep -rn '|| true' files/ # must have inline comment on SAME line
grep -rn '2>/dev/null' files/ # must have inline comment on SAME line
# Bare coreutils (portability violation)
grep -rn '^\s*cp \|^\s*mv \|^\s*rm ' files/
grep -rn '^\s*chmod \|^\s*mkdir \|^\s*touch \|^\s*ln ' files/
grep -Prn '^\s*cat\s(?!<<)' files/
# Word-boundary sweep: catches mid-line, inside $(), after ; or |
grep -rn '\bcat\b' files/ | grep -v 'command cat' | grep -v 'cat <<'
grep -rn '\bchmod\b\|\bmkdir\b\|\btouch\b\|\bln\b' files/ | grep -v 'command '
# Hardcoded coreutils paths (breaks the non-merged side)
grep -rn '/usr/bin/\(rm\|mv\|cp\|chmod\|mkdir\|cat\|touch\|ln\)' files/
# Backslash alias bypass (prohibited, use command prefix)
grep -rn '\\cp \|\\mv \|\\rm ' files/
# local var=$() masks exit code (always returns 0)
grep -rn 'local [a-z_]*=\$([^(]' files/
# Every cd must have || exit / || return guard
grep -rn '^\s*cd ' files/Every hit in those greps requires either a fix or an inline justification on the same line. Section-level comments on preceding lines do not satisfy the rule; it has to be on the line itself, so that git blame shows the justification next to the construct it justifies, and so that reviewers cannot silently lose the context during a later edit.
There are also project-specific extensions. APF greps for hardcoded iptables paths. BFD greps for bare systemctl calls that bypass the init-system detection. maldet greps for references to a hardcoded signature URL. Each project's CLAUDE.md extends the base pattern list with its own project-specific patterns. The base list above is the common denominator that every rfxn shell project inherits.
Conclusion#
The point of this is not that legacy matters forever. CentOS 6 went out of ELS support, and every year the long tail gets shorter. The point is that portability is a one-time engineering tax with recurring benefit. Every rule we learned on CentOS 6 has kept us out of trouble somewhere else: on minimal Alpine containers where the merge happened but coreutils are BusyBox, on NixOS where nothing lives at either /bin or /usr/bin in the expected way, on FreeBSD where flags diverge, on Gentoo where users build their own layouts. A codebase that passes the command and command -v discipline ports to any of those without a patch.
It also passes review faster. Writing command cp instead of cp is eight keystrokes. Debugging a silent install failure on a customer's five-year-old host is a week of back-and-forth and a lost support ticket. The tax is real but small. The benefit is real and large.
maldet, APF, and BFD are all open source under GPLv2. The shared pkg_lib.sh that drives install-time init detection, the governance rules that drive the grep patterns above, and the BATS matrix that exercises them all live in the project repositories. If any of this is useful to your own portable-shell codebase, take it.