summaryrefslogtreecommitdiff
path: root/_posts/2020-06-27-learning-about-syscall-filtering-with-seccomp.md
blob: 226928a6ebfb7df3ef3c4265ae15bf86ac9d3fe2 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
---
title: Learning About Syscall Filtering With Seccomp
---

I'd heard about being able to [run Docker containers with a custom "security
profile"][docker-seccomp], but wasn't really sure what that meant, or what was
happening behind the scenes, so I decided to do some experimentation to find
out.

It turns out that the Linux kernel includes a feature called "secure computing
mode," or `seccomp` for short. Using `seccomp` lets you tell the kernel that you
only expect your program to use a specific set of system calls, and if your
program makes any system calls that aren't in your approved list, the kernel
should kill your program.

<!--more-->

But why would you want to do this? I think if you had a pretty simple program,
using `seccomp` might be overkill. But if your program does different things on
the system depending on some possibly untrustworthy user input, it might make
sense to use. For example, if your program runs user-specified commands, you
might want to make sure that an approved subset of system functionality is
available. Looking at [a list of software using `seccomp` on Wikipedia][wiki]
backs this up: the software listed are mostly hypervisors/container runners
(like Docker), web browsers, etc.

By reading [the manual page for the `seccomp(2)` system call][man-seccomp], we
can learn how to write a program to try this out. The simplest action is to
enter "strict mode", which prevents all system calls except for `read(2)`,
`write(2)`, `_exit(2)`, and `sigreturn(2)` --- in other words, what I think
should be just enough to write hello world! Let's give it a shot:

```
#include <linux/seccomp.h>
#include <sys/prctl.h>
#include <stdio.h>

int
main()
{
        if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT) != 0) {
                perror("prctl");
                return 1;
        }
        printf("hello, world!\n");
        return 0;
}
```

When I compile and run my program, I just see **Killed** being printed, not
**hello, world!**. Well, this is pretty good evidence that `seccomp` is doing
_something_ --- it's at least killing my program! Let's try to find out why
using [`strace`, a program that shows you all of the system calls being
made][strace]:

```
$ strace ./hello
execve("./hello", ["./hello"], 0x7fff77b754b0 /* 20 vars */) = 0
brk(NULL)                               = 0x559e08463000
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=25762, ...}) = 0
mmap(NULL, 25762, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fe65b9f0000
close(3)                                = 0
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260\34\2\0\0\0\0\0"...,
832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=2030544, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0x7fe65b9ee000
mmap(NULL, 4131552, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) =
0x7fe65b3df000
mprotect(0x7fe65b5c6000, 2097152, PROT_NONE) = 0
mmap(0x7fe65b7c6000, 24576, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1e7000) = 0x7fe65b7c6000
mmap(0x7fe65b7cc000, 15072, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fe65b7cc000
close(3)                                = 0
arch_prctl(ARCH_SET_FS, 0x7fe65b9ef4c0) = 0
mprotect(0x7fe65b7c6000, 16384, PROT_READ) = 0
mprotect(0x559e077b9000, 4096, PROT_READ) = 0
mprotect(0x7fe65b9f7000, 4096, PROT_READ) = 0
munmap(0x7fe65b9f0000, 25762)           = 0
prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT) = 0
fstat(1,  <unfinished ...>)             = ?
+++ killed by SIGKILL +++
Killed
```

There's a lot of stuff in here that I don't fully understand about loading
dynamically linked libraries, reading the program binary, and mapping it into
memory, but the last few syscalls provide some clues: right after `prctl` is
called, we see `fstat` being called! `fstat` is a system call for getting the
status of a file, and `1` happens to be the file descriptor for standard output.
It makes sense that calling `printf` might involve checking the status of
standard output, so I tried commenting out the call to `printf` in `hello.c`.
When I compiled and ran the new version, it still just printed **Killed**, so I
used `strace` again. Just looking at the last few lines:

```
prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT) = 0
exit_group(0)                           = ?
+++ killed by SIGKILL +++
Killed
```

Now my program is making the `exit_group` system call. Thinking back to the
manual page for `seccomp`, it said:

> The only system calls that the calling thread is permitted to make are
> `read(2)`, `write(2)`, `_exit(2)` (but not `exit_group(2)`), and
> `sigreturn(2)`.

It looks like I'll need to actually do some real filtering if I want to run my
hello world program and not just use strict mode. To do this, we need to use
`SECCOMP_MODE_FILTER` and pass a pointer to a `struct sock_fprog`, which is "a
Berkeley Packet Filter program designed to filter arbitrary system calls and
system call arguments."

While we could construct a BPF program using an array of `struct sock_filter`
instructions, looking at the chain of instructions we'd need to set my made me
think it would be much easier to enlist the services of
[`libseccomp`][libseccomp], a library designed for just this purpose. Let's try
rewriting `hello.c` to use `libseccomp` and allowing those three syscalls we saw
before (`fstat`, `write`, and `exit_group`):

```
#include <seccomp.h>
#include <stdio.h>
#include <stdlib.h>

scmp_filter_ctx ctx;

/* graceful_exit cleans up our seccomp context before exiting */
void
graceful_exit(int rc)
{
        seccomp_release(ctx);
        exit(rc);
}

/* setup_seccomp initializes seccomp and loads our BPF program that filters
 * syscalls into the kernel */
void
setup_seccomp()
{
        int rc;

        /* Initialize the seccomp filter state */
        if ((ctx = seccomp_init(SCMP_ACT_KILL)) == NULL) {
                graceful_exit(1);
        }
        if ((rc = seccomp_reset(ctx, SCMP_ACT_KILL)) != 0) {
                graceful_exit(1);
        }

        /* Add allowed system calls to the BPF program */
        if ((rc = seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(fstat), 0)) != 0) {
                graceful_exit(1);
        }
        if ((rc = seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0)) != 0) {
                graceful_exit(1);
        }
        if ((rc = seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0)) != 0) {
                graceful_exit(1);
        }

        /* Load the BPF program for the current context into the kernel */
        if ((rc = seccomp_load(ctx)) != 0) {
                graceful_exit(1);
        }
}

int
main()
{
        setup_seccomp();
        printf("hello, world!\n");
        graceful_exit(0);
}
```

Since we're now using `libseccomp`, we need to tell our C compiler to link the
library:

```
$ cc -o hello hello.c -lseccomp
$ ./hello
hello, world!
```

Success! Our program compiles and runs, and all of the necessary syscalls have
been allowed. Now let's try modifying the `main()` function of our program to do
something bad, like trying to read the password file `/etc/shadow`:

```
int
main()
{
        FILE *fd;
        setup_seccomp();
        printf("hello, world!\n");
        if ((fd = fopen("/etc/shadow", "r")) == NULL) {
                perror("fopen");
                graceful_exit(1);
        }
        fclose(fd);
        graceful_exit(0);
}
```

Now when we compile and run our program, we get:

```
$ ./hello
hello, world!
Bad system call (core dumped)
```

Nice! The kernel killed our program when we tried to use a system call
(`openat`) that we didn't plan on!

Now[^1] let's go back to how this all fits in to Docker. Looking at [Docker's
default `seccomp` profile][docker-default], a lot of it starts to make more
sense. In fact, it looks like they're using the exact same names from
`libseccomp` that we used in our program! If we search [the moby source code for
`"libseccomp"`][moby], we can see that it is indeed being used (via Go
bindings).

Let's try to use a custom `seccomp` profile to prohibit programs in our Docker
container from listening for network connections. To start, I want to make sure
I can accept network connections, then modify my profile and watch it break. I
downloaded the default `seccomp` profile to use as a starting point for
tweaking, started a container with port 4000 open, then used `nc` to try
communicating from my host machine to a listener in the Docker container:

```
$ docker run --rm -it -p 4000:4000 --security-opt seccomp=seccomp.json alpine
/ # nc -l -p 4000
```

When I run `echo hi | nc 127.0.0.1 4000` in a separate terminal, my greeting is
printed by the netcat listener in the Docker container---success! Now that I know
my basic TCP server works, let's try blocking it with `seccomp`!  To start
listening on a TCP port, I know that `nc` has to use the `socket`, `bind`, and
`listen` system calls (which we can verify using `strace`). I'll try removing
them from the list of allowed system calls in the default profile, and run the
docker container again with the modified profile:

```
$ docker run --rm -it -p 4000:4000 --security-opt seccomp=seccomp.json alpine
/ # nc -l -p 4000
nc: socket(AF_INET,1,0): Operation not permitted
```

Awesome! We just used `seccomp` to control what our Docker container is allowed
to do!

I can imagine this might be helpful if you had an environment where security was
extremely important and wanted to really lock down your containers, but it's
hard to imagine that writing custom `seccomp` profiles for every container in
your production environment is the best use of time without having some specific
situation you're trying to address.

* * *

[^1]: I wanted to figure out how to allow `openat` to only open a specific file
      name, but I couldn't figure out how to compare string system call
      arguments. I could compare integer arguments to do things like only
      allowing `fstat` and `write` to print to the console by comparing their
      arguments with `1`, the file descriptor for standard output. Maybe the
      lower level BPF program API would let me do this? But it was getting late
      and I still wanted to finish my experimentation with Docker. If you know
      how to do this, [please send me an email and let me know](/contact.html)!

[docker-seccomp]: https://docs.docker.com/engine/security/seccomp/
[man-seccomp]: https://man7.org/linux/man-pages/man2/seccomp.2.html
[strace]: https://strace.io/
[libseccomp]: https://github.com/seccomp/libseccomp
[docker-default]: https://github.com/moby/moby/blob/master/profiles/seccomp/default.json
[moby]: https://github.com/moby/moby/search?q=libseccomp
[wiki]: https://en.wikipedia.org/wiki/Seccomp#Software_using_seccomp_or_seccomp-bpf