From 2a1ed625c115ec8594eeaa99171791d57786b79e Mon Sep 17 00:00:00 2001 From: Ben Burwell Date: Sun, 28 Jun 2020 10:20:50 -0400 Subject: Add seccomp post --- ...earning-about-syscall-filtering-with-seccomp.md | 284 +++++++++++++++++++++ contact.md | 8 + 2 files changed, 292 insertions(+) create mode 100644 _posts/2020-06-27-learning-about-syscall-filtering-with-seccomp.md create mode 100644 contact.md diff --git a/_posts/2020-06-27-learning-about-syscall-filtering-with-seccomp.md b/_posts/2020-06-27-learning-about-syscall-filtering-with-seccomp.md new file mode 100644 index 0000000..226928a --- /dev/null +++ b/_posts/2020-06-27-learning-about-syscall-filtering-with-seccomp.md @@ -0,0 +1,284 @@ +--- +title: Learning About Syscall Filtering With Seccomp +--- + +I'd heard about being able to [run Docker containers with a custom "security +profile"][docker-seccomp], but wasn't really sure what that meant, or what was +happening behind the scenes, so I decided to do some experimentation to find +out. + +It turns out that the Linux kernel includes a feature called "secure computing +mode," or `seccomp` for short. Using `seccomp` lets you tell the kernel that you +only expect your program to use a specific set of system calls, and if your +program makes any system calls that aren't in your approved list, the kernel +should kill your program. + + + +But why would you want to do this? I think if you had a pretty simple program, +using `seccomp` might be overkill. But if your program does different things on +the system depending on some possibly untrustworthy user input, it might make +sense to use. For example, if your program runs user-specified commands, you +might want to make sure that an approved subset of system functionality is +available. Looking at [a list of software using `seccomp` on Wikipedia][wiki] +backs this up: the software listed are mostly hypervisors/container runners +(like Docker), web browsers, etc. + +By reading [the manual page for the `seccomp(2)` system call][man-seccomp], we +can learn how to write a program to try this out. The simplest action is to +enter "strict mode", which prevents all system calls except for `read(2)`, +`write(2)`, `_exit(2)`, and `sigreturn(2)` --- in other words, what I think +should be just enough to write hello world! Let's give it a shot: + +``` +#include +#include +#include + +int +main() +{ + if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT) != 0) { + perror("prctl"); + return 1; + } + printf("hello, world!\n"); + return 0; +} +``` + +When I compile and run my program, I just see **Killed** being printed, not +**hello, world!**. Well, this is pretty good evidence that `seccomp` is doing +_something_ --- it's at least killing my program! Let's try to find out why +using [`strace`, a program that shows you all of the system calls being +made][strace]: + +``` +$ strace ./hello +execve("./hello", ["./hello"], 0x7fff77b754b0 /* 20 vars */) = 0 +brk(NULL) = 0x559e08463000 +access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) +access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) +openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 +fstat(3, {st_mode=S_IFREG|0644, st_size=25762, ...}) = 0 +mmap(NULL, 25762, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fe65b9f0000 +close(3) = 0 +access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) +openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3 +read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260\34\2\0\0\0\0\0"..., +832) = 832 +fstat(3, {st_mode=S_IFREG|0755, st_size=2030544, ...}) = 0 +mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = +0x7fe65b9ee000 +mmap(NULL, 4131552, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = +0x7fe65b3df000 +mprotect(0x7fe65b5c6000, 2097152, PROT_NONE) = 0 +mmap(0x7fe65b7c6000, 24576, PROT_READ|PROT_WRITE, +MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1e7000) = 0x7fe65b7c6000 +mmap(0x7fe65b7cc000, 15072, PROT_READ|PROT_WRITE, +MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fe65b7cc000 +close(3) = 0 +arch_prctl(ARCH_SET_FS, 0x7fe65b9ef4c0) = 0 +mprotect(0x7fe65b7c6000, 16384, PROT_READ) = 0 +mprotect(0x559e077b9000, 4096, PROT_READ) = 0 +mprotect(0x7fe65b9f7000, 4096, PROT_READ) = 0 +munmap(0x7fe65b9f0000, 25762) = 0 +prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT) = 0 +fstat(1, ) = ? ++++ killed by SIGKILL +++ +Killed +``` + +There's a lot of stuff in here that I don't fully understand about loading +dynamically linked libraries, reading the program binary, and mapping it into +memory, but the last few syscalls provide some clues: right after `prctl` is +called, we see `fstat` being called! `fstat` is a system call for getting the +status of a file, and `1` happens to be the file descriptor for standard output. +It makes sense that calling `printf` might involve checking the status of +standard output, so I tried commenting out the call to `printf` in `hello.c`. +When I compiled and ran the new version, it still just printed **Killed**, so I +used `strace` again. Just looking at the last few lines: + +``` +prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT) = 0 +exit_group(0) = ? ++++ killed by SIGKILL +++ +Killed +``` + +Now my program is making the `exit_group` system call. Thinking back to the +manual page for `seccomp`, it said: + +> The only system calls that the calling thread is permitted to make are +> `read(2)`, `write(2)`, `_exit(2)` (but not `exit_group(2)`), and +> `sigreturn(2)`. + +It looks like I'll need to actually do some real filtering if I want to run my +hello world program and not just use strict mode. To do this, we need to use +`SECCOMP_MODE_FILTER` and pass a pointer to a `struct sock_fprog`, which is "a +Berkeley Packet Filter program designed to filter arbitrary system calls and +system call arguments." + +While we could construct a BPF program using an array of `struct sock_filter` +instructions, looking at the chain of instructions we'd need to set my made me +think it would be much easier to enlist the services of +[`libseccomp`][libseccomp], a library designed for just this purpose. Let's try +rewriting `hello.c` to use `libseccomp` and allowing those three syscalls we saw +before (`fstat`, `write`, and `exit_group`): + +``` +#include +#include +#include + +scmp_filter_ctx ctx; + +/* graceful_exit cleans up our seccomp context before exiting */ +void +graceful_exit(int rc) +{ + seccomp_release(ctx); + exit(rc); +} + +/* setup_seccomp initializes seccomp and loads our BPF program that filters + * syscalls into the kernel */ +void +setup_seccomp() +{ + int rc; + + /* Initialize the seccomp filter state */ + if ((ctx = seccomp_init(SCMP_ACT_KILL)) == NULL) { + graceful_exit(1); + } + if ((rc = seccomp_reset(ctx, SCMP_ACT_KILL)) != 0) { + graceful_exit(1); + } + + /* Add allowed system calls to the BPF program */ + if ((rc = seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(fstat), 0)) != 0) { + graceful_exit(1); + } + if ((rc = seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0)) != 0) { + graceful_exit(1); + } + if ((rc = seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0)) != 0) { + graceful_exit(1); + } + + /* Load the BPF program for the current context into the kernel */ + if ((rc = seccomp_load(ctx)) != 0) { + graceful_exit(1); + } +} + +int +main() +{ + setup_seccomp(); + printf("hello, world!\n"); + graceful_exit(0); +} +``` + +Since we're now using `libseccomp`, we need to tell our C compiler to link the +library: + +``` +$ cc -o hello hello.c -lseccomp +$ ./hello +hello, world! +``` + +Success! Our program compiles and runs, and all of the necessary syscalls have +been allowed. Now let's try modifying the `main()` function of our program to do +something bad, like trying to read the password file `/etc/shadow`: + +``` +int +main() +{ + FILE *fd; + setup_seccomp(); + printf("hello, world!\n"); + if ((fd = fopen("/etc/shadow", "r")) == NULL) { + perror("fopen"); + graceful_exit(1); + } + fclose(fd); + graceful_exit(0); +} +``` + +Now when we compile and run our program, we get: + +``` +$ ./hello +hello, world! +Bad system call (core dumped) +``` + +Nice! The kernel killed our program when we tried to use a system call +(`openat`) that we didn't plan on! + +Now[^1] let's go back to how this all fits in to Docker. Looking at [Docker's +default `seccomp` profile][docker-default], a lot of it starts to make more +sense. In fact, it looks like they're using the exact same names from +`libseccomp` that we used in our program! If we search [the moby source code for +`"libseccomp"`][moby], we can see that it is indeed being used (via Go +bindings). + +Let's try to use a custom `seccomp` profile to prohibit programs in our Docker +container from listening for network connections. To start, I want to make sure +I can accept network connections, then modify my profile and watch it break. I +downloaded the default `seccomp` profile to use as a starting point for +tweaking, started a container with port 4000 open, then used `nc` to try +communicating from my host machine to a listener in the Docker container: + +``` +$ docker run --rm -it -p 4000:4000 --security-opt seccomp=seccomp.json alpine +/ # nc -l -p 4000 +``` + +When I run `echo hi | nc 127.0.0.1 4000` in a separate terminal, my greeting is +printed by the netcat listener in the Docker container---success! Now that I know +my basic TCP server works, let's try blocking it with `seccomp`! To start +listening on a TCP port, I know that `nc` has to use the `socket`, `bind`, and +`listen` system calls (which we can verify using `strace`). I'll try removing +them from the list of allowed system calls in the default profile, and run the +docker container again with the modified profile: + +``` +$ docker run --rm -it -p 4000:4000 --security-opt seccomp=seccomp.json alpine +/ # nc -l -p 4000 +nc: socket(AF_INET,1,0): Operation not permitted +``` + +Awesome! We just used `seccomp` to control what our Docker container is allowed +to do! + +I can imagine this might be helpful if you had an environment where security was +extremely important and wanted to really lock down your containers, but it's +hard to imagine that writing custom `seccomp` profiles for every container in +your production environment is the best use of time without having some specific +situation you're trying to address. + +* * * + +[^1]: I wanted to figure out how to allow `openat` to only open a specific file + name, but I couldn't figure out how to compare string system call + arguments. I could compare integer arguments to do things like only + allowing `fstat` and `write` to print to the console by comparing their + arguments with `1`, the file descriptor for standard output. Maybe the + lower level BPF program API would let me do this? But it was getting late + and I still wanted to finish my experimentation with Docker. If you know + how to do this, [please send me an email and let me know](/contact.html)! + +[docker-seccomp]: https://docs.docker.com/engine/security/seccomp/ +[man-seccomp]: https://man7.org/linux/man-pages/man2/seccomp.2.html +[strace]: https://strace.io/ +[libseccomp]: https://github.com/seccomp/libseccomp +[docker-default]: https://github.com/moby/moby/blob/master/profiles/seccomp/default.json +[moby]: https://github.com/moby/moby/search?q=libseccomp +[wiki]: https://en.wikipedia.org/wiki/Seccomp#Software_using_seccomp_or_seccomp-bpf diff --git a/contact.md b/contact.md new file mode 100644 index 0000000..c5ced2c --- /dev/null +++ b/contact.md @@ -0,0 +1,8 @@ +--- +title: Contact Me +--- + +# Contact Me + +The best way to contact me is to send an email to my first name +`@benburwell.com`. -- cgit v1.2.3