--- title: Learning About Syscall Filtering With Seccomp --- I'd heard about being able to [run Docker containers with a custom security profile][docker-seccomp], but wasn't really sure what that meant or what was happening behind the scenes, so I decided to do some experimentation to find out. It turns out that the Linux kernel includes a feature called "secure computing mode," or `seccomp` for short. Using `seccomp` lets you tell the kernel that you only expect your program to use a specific set of system calls, and if your program makes any system calls that aren't in your approved list, the kernel should kill your program. But why would you want to do this? I think if you had a pretty simple program, using `seccomp` might be overkill. But if your program makes different system calls depending on possibly-untrustworthy user input, it might make sense to try to limit what the program is allowed to do. Looking at [a list of software using `seccomp` on Wikipedia][wiki] backs this up: the software listed are mostly hypervisors/container runners (like Docker), web browsers, etc. By reading [the manual page for the `seccomp(2)` system call][man-seccomp], we can learn how to write a program to try this out. The simplest action is to enter "strict mode," which prevents all system calls except for `read(2)`, `write(2)`, `_exit(2)`, and `sigreturn(2)` --- in other words, what I think should be just enough to write hello world! Let's give it a shot: ``` #include #include #include int main() { if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT) != 0) { perror("prctl"); return 1; } printf("hello, world!\n"); return 0; } ``` When I compile and run my program, I just see **Killed** being printed, not **hello, world!**. Well, this is pretty good evidence that `seccomp` is doing _something_ --- it's at least killing my program! Let's try to find out why it's being killed using [`strace`, a program that shows you all of the system calls being made][strace]: ``` $ strace ./hello execve("./hello", ["./hello"], 0x7fff77b754b0 /* 20 vars */) = 0 brk(NULL) = 0x559e08463000 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=25762, ...}) = 0 mmap(NULL, 25762, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fe65b9f0000 close(3) = 0 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3 read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260\34\2\0\0\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=2030544, ...}) = 0 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fe65b9ee000 mmap(NULL, 4131552, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fe65b3df000 mprotect(0x7fe65b5c6000, 2097152, PROT_NONE) = 0 mmap(0x7fe65b7c6000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1e7000) = 0x7fe65b7c6000 mmap(0x7fe65b7cc000, 15072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fe65b7cc000 close(3) = 0 arch_prctl(ARCH_SET_FS, 0x7fe65b9ef4c0) = 0 mprotect(0x7fe65b7c6000, 16384, PROT_READ) = 0 mprotect(0x559e077b9000, 4096, PROT_READ) = 0 mprotect(0x7fe65b9f7000, 4096, PROT_READ) = 0 munmap(0x7fe65b9f0000, 25762) = 0 prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT) = 0 fstat(1, ) = ? +++ killed by SIGKILL +++ Killed ``` There's a lot at the beginning about loading dynamically linked libraries, reading the program binary, and mapping it into memory that I don't fully understand. But the last few syscalls provide some clues: right after `prctl` is called, we see `fstat` being called! `fstat` is a system call for getting the status of a file, and `1` happens to be the file descriptor for standard output. It makes sense that calling `printf` might involve checking the status of standard output, so I tried commenting out the call to `printf` in `hello.c`. When I compiled and ran the new version, it still just printed **Killed**, so I used `strace` again. Just looking at the last few lines: ``` prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT) = 0 exit_group(0) = ? +++ killed by SIGKILL +++ Killed ``` Now my program is making the `exit_group` system call. Thinking back to the manual page for `seccomp`, it said: > The only system calls that the calling thread is permitted to make are > `read(2)`, `write(2)`, `_exit(2)` (but not `exit_group(2)`), and > `sigreturn(2)`. It looks like I'll need to actually do some real filtering if I want to run my hello world program and not just use strict mode. To do this, we need to use `SECCOMP_MODE_FILTER` and pass a pointer to a `struct sock_fprog`, which according to the manpage is "a Berkeley Packet Filter program designed to filter arbitrary system calls and system call arguments." While we could construct a BPF program using an array of `struct sock_filter`s, looking at the chain of instructions we'd need made me think it would be much easier to enlist the services of [`libseccomp`][libseccomp], a library designed for just this purpose. Let's try rewriting `hello.c` to use `libseccomp` and allowing those three syscalls we saw before (`fstat`, `write`, and `exit_group`): ``` #include #include #include scmp_filter_ctx ctx; /* graceful_exit cleans up our seccomp context before exiting */ void graceful_exit(int rc) { seccomp_release(ctx); exit(rc); } /* setup_seccomp initializes seccomp and loads our BPF program that filters * syscalls into the kernel */ void setup_seccomp() { int rc; /* Initialize the seccomp filter state */ if ((ctx = seccomp_init(SCMP_ACT_KILL)) == NULL) { graceful_exit(1); } if ((rc = seccomp_reset(ctx, SCMP_ACT_KILL)) != 0) { graceful_exit(1); } /* Add allowed system calls to the BPF program */ if ((rc = seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(fstat), 0)) != 0) { graceful_exit(1); } if ((rc = seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0)) != 0) { graceful_exit(1); } if ((rc = seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0)) != 0) { graceful_exit(1); } /* Load the BPF program for the current context into the kernel */ if ((rc = seccomp_load(ctx)) != 0) { graceful_exit(1); } } int main() { setup_seccomp(); printf("hello, world!\n"); graceful_exit(0); } ``` Since we're now using `libseccomp`, we need to tell our C compiler to link the library: ``` $ cc -o hello hello.c -lseccomp $ ./hello hello, world! ``` Success! Our program compiles and runs, and all of the necessary syscalls have been allowed. Now let's try modifying the `main()` function of our program to do something bad, like trying to read the password file `/etc/shadow`: ``` int main() { FILE *fd; setup_seccomp(); printf("hello, world!\n"); if ((fd = fopen("/etc/shadow", "r")) == NULL) { perror("fopen"); graceful_exit(1); } fclose(fd); graceful_exit(0); } ``` Now when we compile and run our program, we get: ``` $ ./hello hello, world! Bad system call (core dumped) ``` Nice! The kernel killed our program when we tried to use a system call (`openat`) that we didn't plan on! Now[^1] let's go back to how this all fits in to Docker. Looking at [Docker's default `seccomp` profile][docker-default], a lot of it starts to make more sense. In fact, it looks like they're using the exact same names from `libseccomp` that we used in our program! If we search [the moby source code for `libseccomp`][moby], we can see that it is indeed being used (via Go bindings). Let's try to use a custom `seccomp` profile to prohibit programs in our Docker container from listening for network connections. To start, I want to make sure I can accept network connections, then modify my profile and watch it break. I downloaded the default `seccomp` profile to use as a starting point for tweaking, started a container with port 4000 open, then used `nc` to try communicating from my host machine to a listener in the Docker container: ``` $ docker run --rm -it -p 4000:4000 --security-opt seccomp=seccomp.json alpine / # nc -l -p 4000 ``` When I run `echo hi | nc 127.0.0.1 4000` in a separate terminal, my greeting is printed by the netcat listener in the Docker container---success! Now that I know my basic TCP server works, let's try blocking it with `seccomp`! To start listening on a TCP port, I know that `nc` has to use the `socket`, `bind`, and `listen` system calls (which we can verify using `strace`). I'll try removing them from the list of allowed system calls in the default profile, and run the docker container again with the modified profile: ``` $ docker run --rm -it -p 4000:4000 --security-opt seccomp=seccomp.json alpine / # nc -l -p 4000 nc: socket(AF_INET,1,0): Operation not permitted ``` Awesome! We just used `seccomp` to control what our Docker container is allowed to do! I can imagine this might be helpful if you had an environment where security was extremely important and wanted to really lock down your containers, but it's hard to imagine that writing custom `seccomp` profiles for every container in your production environment is the best use of time without having some specific situation you're trying to address. * * * [^1]: I wanted to figure out how to allow `openat` to only open a specific file name, but I couldn't figure out how to compare string system call arguments. I could compare integer arguments to do things like only allowing `fstat` and `write` to print to the console by comparing their arguments with `1`, the file descriptor for standard output. Maybe the lower level BPF program API would let me do this? But it was getting late and I still wanted to finish my experimentation with Docker. If you know how to do this, [please send me an email and let me know](/contact.html)! [docker-seccomp]: https://docs.docker.com/engine/security/seccomp/ [man-seccomp]: https://man7.org/linux/man-pages/man2/seccomp.2.html [strace]: https://strace.io/ [libseccomp]: https://github.com/seccomp/libseccomp [docker-default]: https://github.com/moby/moby/blob/master/profiles/seccomp/default.json [moby]: https://github.com/moby/moby/search?q=libseccomp [wiki]: https://en.wikipedia.org/wiki/Seccomp#Software_using_seccomp_or_seccomp-bpf