CARVIEW |
Navigation Menu
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
CRI: Improve the /dev/shm mount options in Sandbox. #6913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi @wllenyj. Thanks for your PR. I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
pkg/cri/server/sandbox_run_linux.go
Outdated
specOpts = append(specOpts, oci.WithMounts([]runtimespec.Mount{ | ||
{ | ||
Source: sandboxDevShm, | ||
Destination: devShm, | ||
Type: "bind", | ||
Options: []string{"rbind", "ro"}, | ||
Options: []string{"rbind", "rw"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For writing what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you do not intend to write anything, you may just need to append these flags
https://github.com/moby/moby/blob/v20.10.5/daemon/oci_linux.go#L420-L450
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @AkihiroSuda , with help from you I found the root cause of the failure.
In linux/fs/namespace.c:do_remount(), it checks the remount operation doesn't reset (instead of modifying) RO,NOSUID,NODEV,NOEXEC and MNT_ATIME_xxx.
/* Don't allow changing of locked mnt flags.
*
* No locks need to be held here while testing the various
* MNT_LOCK flags because those flags can never be cleared
* once they are set.
*/
if ((mnt->mnt.mnt_flags & MNT_LOCK_READONLY) &&
!(mnt_flags & MNT_READONLY)) {
return -EPERM;
}
if ((mnt->mnt.mnt_flags & MNT_LOCK_NODEV) &&
!(mnt_flags & MNT_NODEV)) {
return -EPERM;
}
if ((mnt->mnt.mnt_flags & MNT_LOCK_NOSUID) &&
!(mnt_flags & MNT_NOSUID)) {
return -EPERM;
}
if ((mnt->mnt.mnt_flags & MNT_LOCK_NOEXEC) &&
!(mnt_flags & MNT_NOEXEC)) {
return -EPERM;
}
if ((mnt->mnt.mnt_flags & MNT_LOCK_ATIME) &&
((mnt->mnt.mnt_flags & MNT_ATIME_MASK) != (mnt_flags & MNT_ATIME_MASK))) {
return -EPERM;
}
Originally the /dev/shm
was mount as
shm on /dev/shm type tmpfs (rw,nosuid,nodev,noexec,relatime,size=65536k)
.
So when doing in bind mounting in pause container with user namespace enabled, in addition to changing rw
to ro
, we still need those "nosuid,nodev,noexec,relatime" flags. With those flags, it succeeds:
root@f8ff64bfc7d8:/kata# mount -o bind,ro,nosuid,nodev,noexec,relatime /dev/shm /dev/shm
root@f8ff64bfc7d8:/kata# mount -o bind,ro /dev/shm /dev/shm
mount: /dev/shm: filesystem was mounted, but any subsequent operation failed: Unknown error 5005.
So we have a better solution now and will update the patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more information for the root cause. When the mount flags contains "BIND | READONLY", it will cause a second mount syscall with "mount_flags | REMOUNT", which will handled by do_remount()
, though causes the failure. By changing "ro" to "rw", the second mount syscall will not happen, so it succeeds.
Wait. Do I miss something? The CRI has been supported IDMapping?😂 |
Take it easy, the IDMapping feature is still under-development for CRI:) |
@rata is working on this. If you don't mind, share the picture in community will be helpful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wllenyj Please, try to cc me on userns realted patches. I'm actively working on this. I found this PR because @fuweid mentioned me here. It will be great if you can mention me in the PR description the next time.
As I said in some other issue, I already have a working PoC with all the problems found and fixed (it is a PoC only, not something ready for upstreaming yet).
It will be nice if we can sync on slack or something, to join forces. You don't need to discover the problems I already discovered, and we can sync to contribute to containerd. I think it probably makes sense to do this before opening new PRs that might duplicate the work the other is doing. What do you think?
I'm Rodrigo Campos or rata at CNCF or k8s slack. I couldn't find you there with your gihub handle, but feel free to write me on slack.
Also, I haven't opened PRs with upstream containerd yet because I was focusing on getting the CRI changes and the KEP for k8s 1.25. Once that is done, I'll be way more active here, of course :)
pkg/cri/server/sandbox_run_linux.go
Outdated
@@ -101,12 +101,13 @@ func (c *criService) sandboxContainerSpec(id string, config *runtime.PodSandboxC | |||
if nsOptions.GetIpc() == runtime.NamespaceMode_NODE { | |||
sandboxDevShm = devShm | |||
} | |||
specOpts = append(specOpts, oci.WithoutMounts("/dev/shm")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not needed AFAIK. IIRC no other part of this function is doing the "remove X first", in case it was added. Here we are constructing it for the first time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an optimization to avoid two tmpfs filesystems for /dev/shm
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiangliu optimization?
I don't follow... oci.WithoutMounts
is this code. This just removes the related mount. I'm saying that mount should not be there, because we are creating it for the first time in this function. Removing something that should not be there is not an optimization, but the opposite ;)
Also, this WithoutXXX pattern is also not used in other parts of this function.
Do we really need this? Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's really an optimization:) And a little complex to tell the story.
The first instance of /dev/shm
is added along the call stack:
criService::sandboxContainerSpec() -> criService::runtimeSpec() -> oci.GenerateSpec -> oci.generateDefaultSpecWithPlatform() -> oci.populateDefaultUnixSpec() -> oci.defaultMounts() ->
{
Destination: "/dev/shm",
Type: "tmpfs",
Source: "shm",
Options: []string{"nosuid", "noexec", "nodev", "mode=1777", "size=65536k"},
},
The second instance is added by the code below.
So specOpts = append(specOpts, oci.WithoutMounts("/dev/shm"))
is used to remove the first redundant copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh, I see, thanks! I was looking too quickly and missed in the bug report that there are two instances listed.
specOpts = append(specOpts, oci.WithMounts([]runtimespec.Mount{ | ||
{ | ||
Source: sandboxDevShm, | ||
Destination: devShm, | ||
Type: "bind", | ||
Options: []string{"rbind", "ro"}, | ||
Options: []string{"rbind", "ro", "nosuid", "nodev", "noexec"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to change this default for all the containers?
In my PoC branch I only changed this flags if userns is enabled. See here, for example: https://github.com/kinvolk/containerd/blob/54c1f3529d64b58e187ab0341b3b8f669be30a0d/pkg/cri/server/sandbox_run_linux.go#L122-L134
If we wait for the CRI changes, we can make this conditional on that being enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to change it for all. Otherwise we will get an /dev/shm
with:
shm on /dev/shm type tmpfs (ro,relatime,size=65536k)
which doesn't confirm to the runC spec(https://github.com/opencontainers/runc/blob/main/libcontainer/SPEC.md):
/dev/shm | tmpfs | MS_NOEXEC,MS_NOSUID,MS_NODEV | mode=1777,size=65536k
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiangliu Well, it is a backwards incompatible change, so IMHO we should put some thought into doing that change. At least, IMHO, it deserves to be highlighted in the PR description for open debate, and clear reasons to why it is worth doing it.
which doesn't confirm to the runC spec(https://github.com/opencontainers/runc/blob/main/libcontainer/SPEC.md):
Do we care about that? I mean, AFAIK we should care about the runtime-spec. That seems to be some documentation for the libcontainer and its defaults. It is good to know, but nothing else IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have thought about the possible incompatible issue. We are creating env for the infra (pause) container, and the /dev/shm
is mounted as ro
instead of normal rw
, that means the pause container won't write data to /dev/shm
and it doesn't make sense for pause container to read data from /dev/shm
too.
One the other hand, my first feeling is "aha, a possible attack path". Because the same tmpfs instance is mounted for both pause container in ro mode and app containers in ro,nosuid,nodev,noexec mode. If we could copy suid program to the shared tmpfs, and trick pause container to execute it, that's bad. After some experiments, it turns out that attack path is impossible. But I do feel it's better to mount /dev/shm in ro,nosuid,noexec,nodev mode for both pause and app containers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, we found this issue when analyzing the generated OCI runtime config file and then found the root cause.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For user ns enabled case, we could also enable the "ro" flag.
sanboxDevShmOpt := []string{"rbind", "ro"}
if userNS {
// When userns is enabled, the mount fails if it doesn't include these options,
// except for rprivate. rpivate is not required, but nice to have.
sanboxDevShmOpt = append(sanboxDevShmOpt, "rprivate", "noexec", "nosuid", "nodev")
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so I guess due to the lack of comment on that part, that the runc/libcontainer thing is ok, right?
I'm in a rush for kubecon and to make sure the KEP makes it into 1.25. I don't have time to properly review and test this now. However, if this only affects the sandbox mount, it is probably not a big issue. I don't see any good reason to do it one way or the other. But it is still worth noting this change in the PR description and release notes IMHO.
312f2b6
to
7a0f762
Compare
/retest-required |
312c8aa
to
f3aabc2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
@AkihiroSuda Can you take a look?
Please use real name |
@AkihiroSuda Thanks. Done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
CI rerun and happy again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
Added a silly nit-pick about the comment, but feel free to ignore
pkg/cri/server/sandbox_run_linux.go
Outdated
// In future the user-namespace is enabled, The `nosuid, nodev, noexec` flags are | ||
// required, otherwise the remount will fail with EPERM. These flags have no effect | ||
// without user-namespace. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This last sentence seems weird to me: those flags do have an effect without userns. What about this instead?
// In future the user-namespace is enabled, The `nosuid, nodev, noexec` flags are | |
// required, otherwise the remount will fail with EPERM. These flags have no effect | |
// without user-namespace. | |
// In future the when user-namespace is enabled, the `nosuid, nodev, noexec` flags are | |
// required, otherwise the remount will fail with EPERM. Just use them unconditionally, | |
// they are nice to have anyways. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
I am not a native speaker of English, thanks for your comments.
Sorry, it contains two commits. @wllenyj would you mind to make squash? Thanks |
This's an optimization to get rid of redundant `/dev/shm" mounts for pause container. In `oci.defaultMounts`, there is a default `/dev/shm` mount which is redundant for pause container. Fixes: containerd#6911 Signed-off-by: Jiang Liu <gerry@linux.alibaba.com> Signed-off-by: Lei Wang <wllenyj@linux.alibaba.com>
All containers except the pause container, mount `/dev/shm" with flags `nosuid,nodev,noexec`. So change mount options for pause container to keep consistence. This also helps to solve issues of failing to mount `/dev/shm` when pod/container level user namespace is enabled. Fixes: containerd#6911 Signed-off-by: Jiang Liu <gerry@linux.alibaba.com> Signed-off-by: Lei Wang <wllenyj@linux.alibaba.com>
This pr is a part of #6911, changes include:
nosuid,nodev,noexec
flags to the /dev/shm mount options.Fixes: #6911