nsjail

woj/nsjail

Author	SHA1	Message	Date
Disconnect3d	f7265e0690	cgroup2.cc: improve note about using Docker Improve the error log message when Nsjail fails to write to the `/sys/fs/cgroup/cgroup.subtree_control` file when it attempts to setup the cgroupv2 configuration. The previous message looked like this: ``` [E][2023-05-28T21:52:56+0000][8807] writeBufToFile():105 Couldn't write '7' bytes to file '/sys/fs/cgroup/cgroup.subtree_control' (fd='4'): Device or resource busy [E][2023-05-28T21:52:56+0000][8807] enableCgroupSubtree():95 Could not apply '+memory' to cgroup.subtree_control in '/sys/fs/cgroup'. If you are running in Docker, nsjail MUST be the root process to use cgroups. [E][2023-05-28T21:52:56+0000][8807] main():354 Couldn't setup parent cgroup (cgroupv2) ``` It could have been confusing because the nsjail may have already been running as real root with full capabilities, e.g., when the user ran the container with the `--privileged --user 0:0` flags. In such a case, the issue is that Docker enters new pid, uts, network, ipc, mount and cgroup namespaces (but not user or time namespaces, fwiw) and I believe that if you do so after the cgroupv2 filesystem is mounted, the root of its filesystem hierarchy will start to render only a subtree, or, generally a limited view of the cgroup. This can be seen below. On the host, we can see the cgroup sub-hierarchies and the `cgroup.subtree_control` shows us the controllers properly: ``` # ls /sys/fs/cgroup/ cgroup.controllers cgroup.threads dev-mqueue.mount memory.numa_stat system.slice cgroup.max.depth cpu.pressure init.scope memory.pressure user.slice cgroup.max.descendants cpuset.cpus.effective io.cost.model memory.stat cgroup.procs cpuset.mems.effective io.cost.qos sys-fs-fuse-connections.mount cgroup.stat cpu.stat io.pressure sys-kernel-config.mount cgroup.subtree_control dev-hugepages.mount io.stat sys-kernel-debug.mount # cat /sys/fs/cgroup/cgroup.subtree_control cpuset cpu io memory hugetlb pids rdma ``` However, even in a privileged container, we can't see the same: ``` # sudo docker run --rm -it --privileged nsjail ls /sys/fs/cgroup cgroup.controllers cpuset.cpus memory.events.local cgroup.events cpuset.cpus.effective memory.high cgroup.freeze cpuset.cpus.partition memory.low cgroup.kill cpuset.mems memory.max cgroup.max.depth cpuset.mems.effective memory.min cgroup.max.descendants hugetlb.2MB.current memory.numa_stat cgroup.procs hugetlb.2MB.events memory.oom.group cgroup.stat hugetlb.2MB.events.local memory.pressure cgroup.subtree_control hugetlb.2MB.max memory.stat cgroup.threads hugetlb.2MB.rsvd.current memory.swap.current cgroup.type hugetlb.2MB.rsvd.max memory.swap.events cpu.idle io.latency memory.swap.high cpu.max io.max memory.swap.max cpu.max.burst io.pressure pids.current cpu.pressure io.stat pids.events cpu.stat io.weight pids.max cpu.weight memory.current rdma.current cpu.weight.nice memory.events rdma.max # sudo docker run --rm -it --privileged nsjail cat /sys/fs/cgroup/cgroup.subtree_control # ``` Of course, the namespaces itself can be seen by comparing them like this: ``` // HOST # ls -la /proc/self/ns total 0 dr-x--x--x 2 root root 0 May 28 22:17 . dr-xr-xr-x 9 root root 0 May 28 22:17 .. lrwxrwxrwx 1 root root 0 May 28 22:17 cgroup -> 'cgroup:[4026531835]' lrwxrwxrwx 1 root root 0 May 28 22:17 ipc -> 'ipc:[4026531839]' lrwxrwxrwx 1 root root 0 May 28 22:17 mnt -> 'mnt:[4026531841]' lrwxrwxrwx 1 root root 0 May 28 22:17 net -> 'net:[4026531840]' lrwxrwxrwx 1 root root 0 May 28 22:17 pid -> 'pid:[4026531836]' lrwxrwxrwx 1 root root 0 May 28 22:17 pid_for_children -> 'pid:[4026531836]' lrwxrwxrwx 1 root root 0 May 28 22:17 time -> 'time:[4026531834]' lrwxrwxrwx 1 root root 0 May 28 22:17 time_for_children -> 'time:[4026531834]' lrwxrwxrwx 1 root root 0 May 28 22:17 user -> 'user:[4026531837]' lrwxrwxrwx 1 root root 0 May 28 22:17 uts -> 'uts:[4026531838]' // CONTAINER # sudo docker run --rm -it --privileged nsjail ls -la /proc/self/ns total 0 dr-x--x--x 2 user user 0 May 28 22:17 . dr-xr-xr-x 9 user user 0 May 28 22:17 .. lrwxrwxrwx 1 user user 0 May 28 22:17 cgroup -> 'cgroup:[4026532381]' lrwxrwxrwx 1 user user 0 May 28 22:17 ipc -> 'ipc:[4026532317]' lrwxrwxrwx 1 user user 0 May 28 22:17 mnt -> 'mnt:[4026532315]' lrwxrwxrwx 1 user user 0 May 28 22:17 net -> 'net:[4026532319]' lrwxrwxrwx 1 user user 0 May 28 22:17 pid -> 'pid:[4026532318]' lrwxrwxrwx 1 user user 0 May 28 22:17 pid_for_children -> 'pid:[4026532318]' lrwxrwxrwx 1 user user 0 May 28 22:17 time -> 'time:[4026531834]' lrwxrwxrwx 1 user user 0 May 28 22:17 time_for_children -> 'time:[4026531834]' lrwxrwxrwx 1 user user 0 May 28 22:17 user -> 'user:[4026531837]' lrwxrwxrwx 1 user user 0 May 28 22:17 uts -> 'uts:[4026532316]' ``` Anyway, passing `--cgroupns=host` solves this problem, which can be seen below: ``` # ls -la /proc/self/ns \| grep cgroup lrwxrwxrwx 1 root root 0 May 28 22:18 cgroup -> cgroup:[4026531835] # sudo docker run --rm -it --cgroupns=host --privileged nsjail ls -la /proc/self/ns \| grep cgroup lrwxrwxrwx 1 user user 0 May 28 22:19 cgroup -> 'cgroup:[4026531835]' # sudo docker run --rm -it --privileged nsjail ls -la /proc/self/ns \| grep cgroup lrwxrwxrwx 1 user user 0 May 28 22:19 cgroup -> 'cgroup:[4026532381]' ```	2023-05-29 00:19:31 +02:00
Robert Swiecki	603ba857e9	logs: respect getenv(NO_COLOR)	2023-05-28 09:12:23 +02:00
Robert Swiecki	454cfb509f	configs/hexchat: new config based on xchat	2023-05-26 08:42:52 +02:00
Wiktor Garbacz	f920c9194e	Mount read-only directly if mounting rw fails For new mounts if MNT_LOCK_READONLY is locked on the visible mnt mount_too_revealing will fail and the whole mount will fail. Those mounts need to be created with the readonly flag set.	2023-05-16 14:07:22 +02:00
Robert Swiecki	5b48117a09	configs/xchat: mount whole /tmp/.X11-unix	2023-01-03 08:11:47 +01:00
Robert Swiecki	c7c0adfffe	config.prot: document disable_tsc	2022-11-22 22:25:15 +01:00
Robert Swiecki	2d9b694ca2	Readme: new output	2022-11-22 22:21:50 +01:00
Robert Swiecki	f2dc5966e3	all: unify comments on /**/	2022-11-22 22:19:05 +01:00
Robert Swiecki	cc4245d23a	make indent depend + style of comments	2022-11-22 22:15:01 +01:00
robertswiecki	4437810830	Merge pull request #208 from ndrewh/cgroupsv2-fix Setup cgroup.subtree_control controllers when necessary in cgroupsv2	2022-11-22 22:12:12 +01:00
Andrew Haberlandt	12df56b9f1	Setup cgroup.subtree_control controllers when necessary in cgroupsv2 This commit adds extra setup when cgroupsv2 is enabled. In particular, we make sure that the root namespace has setup cgroup.subtree_control with the controllers we need. If the necessary controller are not listed, we have to move all processes out of the root namespace before we can change this (the 'no internal processes' rule: https://unix.stackexchange.com/a/713343). Currently we only handle the case where the nsjail process is the only process in the cgroup. It seems like this would be relatively rare, but since nsjail is frequently the root process in a Docker container (e.g. for hosting CTF challenges), I think this case is common enough to make it worth implementing. This also adds `--detect_cgroupv2`, which will attempt to detect whether `--cgroupv2_mount` is a valid cgroupv2 mount, and if so it will set `use_cgroupv2`. This is useful in containerized environments where you may not know the kernel version ahead of time. References: https://github.com/redpwn/jail/blob/master/internal/cgroup/cgroup2.go	2022-11-17 17:09:40 -05:00
Oliver Newman	90e285450d	Unset LDFLAGS for kafel Otherwise kafel inherit's nsjail LDFLAGS, which isn't intended and causes build failures.	2022-11-16 09:18:53 -08:00
Wiktor Garbacz	e3a8607ef5	Add missing cerrno include	2022-11-10 10:48:25 +01:00
Robert Swiecki	4567c78a27	config/xchat: move original .xchat2 config dir to .config/	2022-10-25 14:55:04 +02:00
Robert Swiecki	fdc640e20c	subproc.h: make cloneProc declaration match the definition	2022-10-25 08:33:23 +02:00
Robert Swiecki	285ea15811	subproc/mount: use better types for flags, u64 for clone, unsigned long for mount	2022-10-24 13:12:20 +02:00
Wiktor Garbacz	2e62649b4c	Update kafel	2022-10-14 11:54:25 +02:00
Robert Swiecki	dc42a5d003	configs/bash: remove tmpfs mount over /dev as it makes /dev/null non-writeable	2022-09-15 16:12:13 +02:00
Robert Swiecki	454b051599	configs/firefox-with-net-wayland: x11 socket is not needed here	2022-09-10 16:32:06 +02:00
Robert Swiecki	80b26e7554	caps: shorter std::string::append	2022-09-06 17:44:55 +02:00
Robert Swiecki	b87f983463	configs: make configs using X11 more versatile	2022-09-04 12:07:55 +02:00
Robert Swiecki	a22bb2e437	make indent	2022-08-27 21:17:43 +02:00
Robert Swiecki	595cdc8916	nsjail: use atomic in sighandlers	2022-08-26 14:40:46 +02:00
Robert Swiecki	9a8d440a7c	configs/xchat-with-net: use 8.8.8.8 in resolv.conf unconditionally	2022-08-26 00:44:21 +02:00
Robert Swiecki	c63e5b39e8	use QC() across the code	2022-08-10 15:23:53 +02:00
Robert Swiecki	730b890ded	cpu: more debug messaging	2022-08-10 15:02:53 +02:00
Robert Swiecki	30c81ce01f	configs: block sched_setaffinity where max_cpus is used	2022-08-09 16:40:07 +02:00
Robert Swiecki	b3fcc30aec	cpu: more debugging messages	2022-08-09 16:13:03 +02:00
Robert Swiecki	f628f74b00	mnt: quote paths in log messages	2022-08-09 12:06:42 +02:00
Robert Swiecki	e98dc415fc	Switch C++ standard to C++14 - it'll allow to use new features, like std::quoted	2022-08-09 11:34:18 +02:00
Robert Swiecki	4128a7cbd9	mnt: remove unnecessary quote in a debug message	2022-08-09 11:32:49 +02:00
Robert Swiecki	38fcf4f752	subproc: type + const string& in the iterator	2022-08-09 10:44:25 +02:00
Robert Swiecki	8e3ca99c3f	cpu/subproc: better debugging strings	2022-08-09 00:03:20 +02:00
Robert Swiecki	0d292e7be7	cpu: even better LOG_Ds	2022-08-06 09:20:11 +02:00
Robert Swiecki	a33f3a81ca	cpu: Add more debugging messages	2022-08-05 08:43:39 +02:00
Robert Swiecki	9aee3dd831	Make logs more efficient by avoiding argument evaluation for LOG* if it's not needed at the current level	2022-08-05 08:42:37 +02:00
Robert Swiecki	856cb0f2ec	When setting CPU affinity, take into consideration the current CPU affinity set. Use only CPU numbers, which exist in the current affinity set. Maybe fixes https://github.com/google/nsjail/issues/200	2022-08-04 19:22:33 +02:00
Robert Swiecki	57ed22dfdf	make indent	2022-06-11 12:08:50 +02:00
robertswiecki	d88be25986	Merge pull request #197 from pks-t/pks-forward-signals Optionally forward fatal signals	2022-06-11 12:08:21 +02:00
Patrick Steinhardt	df21a972b6	nsjail: Optionally forward fatal signals Currently, we always kill children by sending them a SIGKILL signal in case we've got a fatal signal. This is rather inflexible and forbids some usecases where e.g. child process listen for specific signals to shut down gracefully. Add a new command configuration `--forward_signals` that allows the user to opt-in to forwarding fatal signals to the child process.	2022-06-05 19:38:32 +02:00
Patrick Steinhardt	a517934aba	subproc: Allow killing subprocesses with different signal `subproc::killAndReapAll()` is always killing the child process with the SIGKILL signal. We're about to make this configurable though so that we may optionally forward signals received by nsjail to the child process. Add a new parameter to `killAndReapAll()` to prepare for this change.	2022-06-05 19:36:50 +02:00
Robert Swiecki	6483728e24	config: better config parsing debugging	2022-03-15 00:44:33 +01:00
robertswiecki	e678c25b32	Merge pull request #193 from 243f6a8885a308d313198a2e037/fix/20220223_typo_siutime subproc.cc: fix typo: SiUime -> SiUtime	2022-02-26 19:51:23 +01:00
243f6a8885a308d313198a2e037	472932c6f0	subproc.cc: fix typo: SiUime -> SiUtime	2022-02-23 14:41:23 +09:00
Robert Swiecki	91d5c9871a	log.h: no need to use __PRETTY_FUNCTION__ as it makes it harder to read log messages, just __FUNCTION__ should be 'good enough' for debugging	2022-02-18 20:26:52 +01:00
Robert Swiecki	02458084fe	contain: call prctl(PR_SET_TSC) under x86/x86-64 only	2022-02-18 16:12:27 +01:00
robertswiecki	8e4cc83eb2	Merge pull request #192 from mkow/mkow/disable-tsc-docs Add more docs for disable_tsc + update README	2022-02-18 01:28:39 +01:00
Michał Kowalczyk	e9d00e3d7e	README.md: Update usage to the current version	2022-02-18 00:42:34 +01:00
Michał Kowalczyk	f4abf7b726	config: Add more docs for `disable_tsc`	2022-02-18 00:33:52 +01:00
Robert Swiecki	cdf8e8f14c	config: info about prctl(PR_SET_TSC, PR_TSC_ENABLE) being intel-only	2022-02-18 00:15:12 +01:00

1 2 3 4 5 ...

1193 Commits