cgroup2.cc: improve note about using Docker

Improve the error log message when Nsjail fails to write to the `/sys/fs/cgroup/cgroup.subtree_control` file when it attempts to setup the cgroupv2 configuration.

The previous message looked like this:

```
[E][2023-05-28T21:52:56+0000][8807] writeBufToFile():105 Couldn't write '7' bytes to file '/sys/fs/cgroup/cgroup.subtree_control' (fd='4'): Device or resource busy
[E][2023-05-28T21:52:56+0000][8807] enableCgroupSubtree():95 Could not apply '+memory' to cgroup.subtree_control in '/sys/fs/cgroup'. If you are running in Docker, nsjail MUST be the root process to use cgroups.
[E][2023-05-28T21:52:56+0000][8807] main():354 Couldn't setup parent cgroup (cgroupv2)
```

It could have been confusing because the nsjail may have already been running as real root with full capabilities, e.g., when the user ran the container with the `--privileged --user 0:0` flags. In such a case, the issue is that Docker enters new pid, uts, network, ipc, mount and cgroup namespaces (but not user or time namespaces, fwiw) and I believe that if you do so after the cgroupv2 filesystem is mounted, the root of its filesystem hierarchy will start to render only a subtree, or, generally a limited view of the cgroup.

This can be seen below. On the host, we can see the cgroup sub-hierarchies and the `cgroup.subtree_control` shows us the controllers properly:

```
# ls /sys/fs/cgroup/
cgroup.controllers      cgroup.threads         dev-mqueue.mount  memory.numa_stat               system.slice
cgroup.max.depth        cpu.pressure           init.scope        memory.pressure                user.slice
cgroup.max.descendants  cpuset.cpus.effective  io.cost.model     memory.stat
cgroup.procs            cpuset.mems.effective  io.cost.qos       sys-fs-fuse-connections.mount
cgroup.stat             cpu.stat               io.pressure       sys-kernel-config.mount
cgroup.subtree_control  dev-hugepages.mount    io.stat           sys-kernel-debug.mount

# cat /sys/fs/cgroup/cgroup.subtree_control 
cpuset cpu io memory hugetlb pids rdma
```

However, even in a privileged container, we can't see the same:

```
# sudo docker run --rm -it --privileged nsjail ls /sys/fs/cgroup
cgroup.controllers	cpuset.cpus		  memory.events.local
cgroup.events		cpuset.cpus.effective	  memory.high
cgroup.freeze		cpuset.cpus.partition	  memory.low
cgroup.kill		cpuset.mems		  memory.max
cgroup.max.depth	cpuset.mems.effective	  memory.min
cgroup.max.descendants	hugetlb.2MB.current	  memory.numa_stat
cgroup.procs		hugetlb.2MB.events	  memory.oom.group
cgroup.stat		hugetlb.2MB.events.local  memory.pressure
cgroup.subtree_control	hugetlb.2MB.max		  memory.stat
cgroup.threads		hugetlb.2MB.rsvd.current  memory.swap.current
cgroup.type		hugetlb.2MB.rsvd.max	  memory.swap.events
cpu.idle		io.latency		  memory.swap.high
cpu.max			io.max			  memory.swap.max
cpu.max.burst		io.pressure		  pids.current
cpu.pressure		io.stat			  pids.events
cpu.stat		io.weight		  pids.max
cpu.weight		memory.current		  rdma.current
cpu.weight.nice		memory.events		  rdma.max

# sudo docker run --rm -it --privileged nsjail cat /sys/fs/cgroup/cgroup.subtree_control
 
# 
```

Of course, the namespaces itself can be seen by comparing them like this:

```
// HOST
# ls -la /proc/self/ns
total 0
dr-x--x--x 2 root root 0 May 28 22:17 .
dr-xr-xr-x 9 root root 0 May 28 22:17 ..
lrwxrwxrwx 1 root root 0 May 28 22:17 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 root root 0 May 28 22:17 ipc -> 'ipc:[4026531839]'
lrwxrwxrwx 1 root root 0 May 28 22:17 mnt -> 'mnt:[4026531841]'
lrwxrwxrwx 1 root root 0 May 28 22:17 net -> 'net:[4026531840]'
lrwxrwxrwx 1 root root 0 May 28 22:17 pid -> 'pid:[4026531836]'
lrwxrwxrwx 1 root root 0 May 28 22:17 pid_for_children -> 'pid:[4026531836]'
lrwxrwxrwx 1 root root 0 May 28 22:17 time -> 'time:[4026531834]'
lrwxrwxrwx 1 root root 0 May 28 22:17 time_for_children -> 'time:[4026531834]'
lrwxrwxrwx 1 root root 0 May 28 22:17 user -> 'user:[4026531837]'
lrwxrwxrwx 1 root root 0 May 28 22:17 uts -> 'uts:[4026531838]'

// CONTAINER
# sudo docker run --rm -it --privileged nsjail ls -la /proc/self/ns
total 0
dr-x--x--x 2 user user 0 May 28 22:17 .
dr-xr-xr-x 9 user user 0 May 28 22:17 ..
lrwxrwxrwx 1 user user 0 May 28 22:17 cgroup -> 'cgroup:[4026532381]'
lrwxrwxrwx 1 user user 0 May 28 22:17 ipc -> 'ipc:[4026532317]'
lrwxrwxrwx 1 user user 0 May 28 22:17 mnt -> 'mnt:[4026532315]'
lrwxrwxrwx 1 user user 0 May 28 22:17 net -> 'net:[4026532319]'
lrwxrwxrwx 1 user user 0 May 28 22:17 pid -> 'pid:[4026532318]'
lrwxrwxrwx 1 user user 0 May 28 22:17 pid_for_children -> 'pid:[4026532318]'
lrwxrwxrwx 1 user user 0 May 28 22:17 time -> 'time:[4026531834]'
lrwxrwxrwx 1 user user 0 May 28 22:17 time_for_children -> 'time:[4026531834]'
lrwxrwxrwx 1 user user 0 May 28 22:17 user -> 'user:[4026531837]'
lrwxrwxrwx 1 user user 0 May 28 22:17 uts -> 'uts:[4026532316]'
```

Anyway, passing `--cgroupns=host` solves this problem, which can be seen below:

```
# ls -la /proc/self/ns | grep cgroup
lrwxrwxrwx 1 root root 0 May 28 22:18 cgroup -> cgroup:[4026531835]

# sudo docker run --rm -it --cgroupns=host --privileged nsjail ls -la /proc/self/ns | grep cgroup
lrwxrwxrwx 1 user user 0 May 28 22:19 cgroup -> 'cgroup:[4026531835]'

# sudo docker run --rm -it --privileged nsjail ls -la /proc/self/ns | grep cgroup
lrwxrwxrwx 1 user user 0 May 28 22:19 cgroup -> 'cgroup:[4026532381]'
```
This commit is contained in:
Disconnect3d 2023-05-29 00:19:31 +02:00 committed by GitHub
parent 603ba857e9
commit f7265e0690
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -93,8 +93,11 @@ static bool enableCgroupSubtree(nsjconf_t *nsjconf, const std::string &controlle
}
}
LOG_E(
"Could not apply '%s' to cgroup.subtree_control in '%s'. If you are running in Docker, "
"nsjail MUST be the root process to use cgroups.",
"Could not apply '%s' to cgroup.subtree_control in '%s'. nsjail MUST be run from root "
"and the cgroup mount path must refer to the root/host cgroup to use cgroupv2. If you "
"use Docker, you may need to run the container with --cgroupns=host so that nsjail can"
" access the host/root cgroupv2 hierarchy. An alternative is mounting (or remounting) "
"the cgroupv2 filesystem but using the flag is just simpler.",
val.c_str(), cgroup_path.c_str());
return false;
}