Add seccomp post

author: Ben Burwell <ben@benburwell.com> 2020-06-28 10:20:50 -0400
committer: Ben Burwell <ben@benburwell.com> 2020-06-28 10:20:50 -0400
commit: 2a1ed625c115ec8594eeaa99171791d57786b79e (patch)
tree: da247a7a3648917d93d56cff28bf5a7b232a295a
parent: 01094bf3559775b0d13b6ef34301103a3757c751 (diff)
2 files changed, 292 insertions, 0 deletions
diff --git a/_posts/2020-06-27-learning-about-syscall-filtering-with-seccomp.md b/_posts/2020-06-27-learning-about-syscall-filtering-with-seccomp.md
new file mode 100644
index 0000000..226928a
--- /dev/null
+++ b/_posts/2020-06-27-learning-about-syscall-filtering-with-seccomp.md
@@ -0,0 +1,284 @@
+---
+title: Learning About Syscall Filtering With Seccomp
+---
+
+I'd heard about being able to [run Docker containers with a custom "security
+profile"][docker-seccomp], but wasn't really sure what that meant, or what was
+happening behind the scenes, so I decided to do some experimentation to find
+out.
+
+It turns out that the Linux kernel includes a feature called "secure computing
+mode," or `seccomp` for short. Using `seccomp` lets you tell the kernel that you
+only expect your program to use a specific set of system calls, and if your
+program makes any system calls that aren't in your approved list, the kernel
+should kill your program.
+
+<!--more-->
+
+But why would you want to do this? I think if you had a pretty simple program,
+using `seccomp` might be overkill. But if your program does different things on
+the system depending on some possibly untrustworthy user input, it might make
+sense to use. For example, if your program runs user-specified commands, you
+might want to make sure that an approved subset of system functionality is
+available. Looking at [a list of software using `seccomp` on Wikipedia][wiki]
+backs this up: the software listed are mostly hypervisors/container runners
+(like Docker), web browsers, etc.
+
+By reading [the manual page for the `seccomp(2)` system call][man-seccomp], we
+can learn how to write a program to try this out. The simplest action is to
+enter "strict mode", which prevents all system calls except for `read(2)`,
+`write(2)`, `_exit(2)`, and `sigreturn(2)` --- in other words, what I think
+should be just enough to write hello world! Let's give it a shot:
+
+```
+#include <linux/seccomp.h>
+#include <sys/prctl.h>
+#include <stdio.h>
+
+int
+main()
+{
+        if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT) != 0) {
+                perror("prctl");
+                return 1;
+        }
+        printf("hello, world!\n");
+        return 0;
+}
+```
+
+When I compile and run my program, I just see **Killed** being printed, not
+**hello, world!**. Well, this is pretty good evidence that `seccomp` is doing
+_something_ --- it's at least killing my program! Let's try to find out why
+using [`strace`, a program that shows you all of the system calls being
+made][strace]:
+
+```
+$ strace ./hello
+execve("./hello", ["./hello"], 0x7fff77b754b0 /* 20 vars */) = 0
+brk(NULL)                               = 0x559e08463000
+access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
+access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
+openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
+fstat(3, {st_mode=S_IFREG|0644, st_size=25762, ...}) = 0
+mmap(NULL, 25762, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fe65b9f0000
+close(3)                                = 0
+access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
+openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
+read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260\34\2\0\0\0\0\0"...,
+832) = 832
+fstat(3, {st_mode=S_IFREG|0755, st_size=2030544, ...}) = 0
+mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
+0x7fe65b9ee000
+mmap(NULL, 4131552, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) =
+0x7fe65b3df000
+mprotect(0x7fe65b5c6000, 2097152, PROT_NONE) = 0
+mmap(0x7fe65b7c6000, 24576, PROT_READ|PROT_WRITE,
+MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1e7000) = 0x7fe65b7c6000
+mmap(0x7fe65b7cc000, 15072, PROT_READ|PROT_WRITE,
+MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fe65b7cc000
+close(3)                                = 0
+arch_prctl(ARCH_SET_FS, 0x7fe65b9ef4c0) = 0
+mprotect(0x7fe65b7c6000, 16384, PROT_READ) = 0
+mprotect(0x559e077b9000, 4096, PROT_READ) = 0
+mprotect(0x7fe65b9f7000, 4096, PROT_READ) = 0
+munmap(0x7fe65b9f0000, 25762)           = 0
+prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT) = 0
+fstat(1,  <unfinished ...>)             = ?
++++ killed by SIGKILL +++
+Killed
+```
+
+There's a lot of stuff in here that I don't fully understand about loading
+dynamically linked libraries, reading the program binary, and mapping it into
+memory, but the last few syscalls provide some clues: right after `prctl` is
+called, we see `fstat` being called! `fstat` is a system call for getting the
+status of a file, and `1` happens to be the file descriptor for standard output.
+It makes sense that calling `printf` might involve checking the status of
+standard output, so I tried commenting out the call to `printf` in `hello.c`.
+When I compiled and ran the new version, it still just printed **Killed**, so I
+used `strace` again. Just looking at the last few lines:
+
+```
+prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT) = 0
+exit_group(0)                           = ?
++++ killed by SIGKILL +++
+Killed
+```
+
+Now my program is making the `exit_group` system call. Thinking back to the
+manual page for `seccomp`, it said:
+
+> The only system calls that the calling thread is permitted to make are
+> `read(2)`, `write(2)`, `_exit(2)` (but not `exit_group(2)`), and
+> `sigreturn(2)`.
+
+It looks like I'll need to actually do some real filtering if I want to run my
+hello world program and not just use strict mode. To do this, we need to use
+`SECCOMP_MODE_FILTER` and pass a pointer to a `struct sock_fprog`, which is "a
+Berkeley Packet Filter program designed to filter arbitrary system calls and
+system call arguments."
+
+While we could construct a BPF program using an array of `struct sock_filter`
+instructions, looking at the chain of instructions we'd need to set my made me
+think it would be much easier to enlist the services of
+[`libseccomp`][libseccomp], a library designed for just this purpose. Let's try
+rewriting `hello.c` to use `libseccomp` and allowing those three syscalls we saw
+before (`fstat`, `write`, and `exit_group`):
+
+```
+#include <seccomp.h>
+#include <stdio.h>
+#include <stdlib.h>
+
+scmp_filter_ctx ctx;
+
+/* graceful_exit cleans up our seccomp context before exiting */
+void
+graceful_exit(int rc)
+{
+        seccomp_release(ctx);
+        exit(rc);
+}
+
+/* setup_seccomp initializes seccomp and loads our BPF program that filters
+ * syscalls into the kernel */
+void
+setup_seccomp()
+{
+        int rc;
+
+        /* Initialize the seccomp filter state */
+        if ((ctx = seccomp_init(SCMP_ACT_KILL)) == NULL) {
+                graceful_exit(1);
+        }
+        if ((rc = seccomp_reset(ctx, SCMP_ACT_KILL)) != 0) {
+                graceful_exit(1);
+        }
+
+        /* Add allowed system calls to the BPF program */
+        if ((rc = seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(fstat), 0)) != 0) {
+                graceful_exit(1);
+        }
+        if ((rc = seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0)) != 0) {
+                graceful_exit(1);
+        }
+        if ((rc = seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0)) != 0) {
+                graceful_exit(1);
+        }
+
+        /* Load the BPF program for the current context into the kernel */
+        if ((rc = seccomp_load(ctx)) != 0) {
+                graceful_exit(1);
+        }
+}
+
+int
+main()
+{
+        setup_seccomp();
+        printf("hello, world!\n");
+        graceful_exit(0);
+}
+```
+
+Since we're now using `libseccomp`, we need to tell our C compiler to link the
+library:
+
+```
+$ cc -o hello hello.c -lseccomp
+$ ./hello
+hello, world!
+```
+
+Success! Our program compiles and runs, and all of the necessary syscalls have
+been allowed. Now let's try modifying the `main()` function of our program to do
+something bad, like trying to read the password file `/etc/shadow`:
+
+```
+int
+main()
+{
+        FILE *fd;
+        setup_seccomp();
+        printf("hello, world!\n");
+        if ((fd = fopen("/etc/shadow", "r")) == NULL) {
+                perror("fopen");
+                graceful_exit(1);
+        }
+        fclose(fd);
+        graceful_exit(0);
+}
+```
+
+Now when we compile and run our program, we get:
+
+```
+$ ./hello
+hello, world!
+Bad system call (core dumped)
+```
+
+Nice! The kernel killed our program when we tried to use a system call
+(`openat`) that we didn't plan on!
+
+Now[^1] let's go back to how this all fits in to Docker. Looking at [Docker's
+default `seccomp` profile][docker-default], a lot of it starts to make more
+sense. In fact, it looks like they're using the exact same names from
+`libseccomp` that we used in our program! If we search [the moby source code for
+`"libseccomp"`][moby], we can see that it is indeed being used (via Go
+bindings).
+
+Let's try to use a custom `seccomp` profile to prohibit programs in our Docker
+container from listening for network connections. To start, I want to make sure
+I can accept network connections, then modify my profile and watch it break. I
+downloaded the default `seccomp` profile to use as a starting point for
+tweaking, started a container with port 4000 open, then used `nc` to try
+communicating from my host machine to a listener in the Docker container:
+
+```
+$ docker run --rm -it -p 4000:4000 --security-opt seccomp=seccomp.json alpine
+/ # nc -l -p 4000
+```
+
+When I run `echo hi | nc 127.0.0.1 4000` in a separate terminal, my greeting is
+printed by the netcat listener in the Docker container---success! Now that I know
+my basic TCP server works, let's try blocking it with `seccomp`!  To start
+listening on a TCP port, I know that `nc` has to use the `socket`, `bind`, and
+`listen` system calls (which we can verify using `strace`). I'll try removing
+them from the list of allowed system calls in the default profile, and run the
+docker container again with the modified profile:
+
+```
+$ docker run --rm -it -p 4000:4000 --security-opt seccomp=seccomp.json alpine
+/ # nc -l -p 4000
+nc: socket(AF_INET,1,0): Operation not permitted
+```
+
+Awesome! We just used `seccomp` to control what our Docker container is allowed
+to do!
+
+I can imagine this might be helpful if you had an environment where security was
+extremely important and wanted to really lock down your containers, but it's
+hard to imagine that writing custom `seccomp` profiles for every container in
+your production environment is the best use of time without having some specific
+situation you're trying to address.
+
+* * *
+
+[^1]: I wanted to figure out how to allow `openat` to only open a specific file
+      name, but I couldn't figure out how to compare string system call
+      arguments. I could compare integer arguments to do things like only
+      allowing `fstat` and `write` to print to the console by comparing their
+      arguments with `1`, the file descriptor for standard output. Maybe the
+      lower level BPF program API would let me do this? But it was getting late
+      and I still wanted to finish my experimentation with Docker. If you know
+      how to do this, [please send me an email and let me know](/contact.html)!
+
+[docker-seccomp]: https://docs.docker.com/engine/security/seccomp/
+[man-seccomp]: https://man7.org/linux/man-pages/man2/seccomp.2.html
+[strace]: https://strace.io/
+[libseccomp]: https://github.com/seccomp/libseccomp
+[docker-default]: https://github.com/moby/moby/blob/master/profiles/seccomp/default.json
+[moby]: https://github.com/moby/moby/search?q=libseccomp
+[wiki]: https://en.wikipedia.org/wiki/Seccomp#Software_using_seccomp_or_seccomp-bpf
diff --git a/contact.md b/contact.md
new file mode 100644
index 0000000..c5ced2c
--- /dev/null
+++ b/contact.md
@@ -0,0 +1,8 @@
+---
+title: Contact Me
+---
+
+# Contact Me
+
+The best way to contact me is to send an email to my first name
+`@benburwell.com`.
author	Ben Burwell <ben@benburwell.com>	2020-06-28 10:20:50 -0400
committer	Ben Burwell <ben@benburwell.com>	2020-06-28 10:20:50 -0400
commit	2a1ed625c115ec8594eeaa99171791d57786b79e (patch)
tree	da247a7a3648917d93d56cff28bf5a7b232a295a
parent	01094bf3559775b0d13b6ef34301103a3757c751 (diff)