Kubernetes Pod 底层是怎么实现的?
共 16239字,需浏览 33分钟
·
2022-10-15 09:35
原文链接:https://page.om.qq.com/page/O68R0NRe6Vr0SepMxbeE-2ow0
基于最后的发现,所以,我决定深入了解:
Pod 是如何在底层实现的
Pod 和 Container 之间的实际区别是什么
如何使用 Docker 创建 Pod
在此过程中,我希望它能帮助我巩固我的 Linux、Docker 和 Kubernetes 技能。
1、探索 Container
设置实验环境(playground)
$ cat > Vagrantfile <<EOF
Vagrant.configure("2") do |config|
config.vm.box = "debian/buster64"
config.vm.hostname = "docker-host"
config.vm.define "docker-host"
config.vagrant.plugins = ['vagrant-vbguest']
config.vm.provider "virtualbox" do |vb|
vb.cpus = 2
vb.memory = "2048"
end
config.vm.provision "shell", inline: <<-SHELL
apt-get update
apt-get install -y curl vim
SHELL
config.vm.provision "docker"
end
EOF
$ vagrant up
$ vagrant ssh
最后让我们启动一个容器:
'512MB' --cpus='0.5' nginx docker run --name foo --rm -d --memory=
探索容器的 namespace
# Look up the container in the process tree.
ps auxf
USER PID ... COMMAND
...
root 4707 /usr/bin/containerd-shim-runc-v2 -namespace moby -id cc9466b3e...
root 4727 \_ nginx: master process nginx -g daemon off;
4781 \_ nginx: worker process
4782 \_ nginx: worker process
# Find the namespaces used by 4727 process.
sudo lsns
NS TYPE NPROCS PID USER COMMAND
...
4026532157 mnt 3 4727 root nginx: master process nginx -g daemon off;
4026532158 uts 3 4727 root nginx: master process nginx -g daemon off;
4026532159 ipc 3 4727 root nginx: master process nginx -g daemon off;
4026532160 pid 3 4727 root nginx: master process nginx -g daemon off;
4026532162 net 3 4727 root nginx: master process nginx -g daemon off;
mnt(挂载):容器有一个隔离的挂载表。 uts(Unix 时间共享):容器拥有自己的 hostname 和 domain。 ipc(进程间通信):容器内的进程可以通过系统级 IPC 和同一容器内的其他进程进行通信。 pid(进程 ID):容器内的进程只能看到在同一容器内或拥有相同的 PID 命名空间的其他进程。 net(网络):容器拥有自己的网络堆栈。
探索容器的 cgroups
PID=$(docker inspect --format '{{.State.Pid}}' foo)
# Check cgroupfs node for the container main process (4727).
$ cat /proc/${PID}/cgroup
11:freezer:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba0
10:blkio:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba0
9:rdma:/
8:pids:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba0
7:devices:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba0
6:cpuset:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba0
5:cpu,cpuacct:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba0
4:memory:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba0
3:net_cls,net_prio:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba0
2:perf_event:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba0
1:name=systemd:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba0
0::/system.slice/containerd.service
ID=$(docker inspect --format '{{.Id}}' foo)
# Check the memory limit.
$ cat /sys/fs/cgroup/memory/docker/${ID}/memory.limit_in_bytes
536870912 # Yay! It's the 512MB we requested!
# See the CPU limits.
ls /sys/fs/cgroup/cpu/docker/${ID}
有趣的是在不明确设置任何资源限制的情况下启动容器都会配置一个 cgroup。实际中我没有检查过,但我的猜测是默认情况下,CPU 和 RAM 消耗不受限制,Cgroups 可能用来限制从容器内部对某些设备的访问。
这是我在调查后脑海中呈现的容器:
2、探索 Pod
现在,让我们来看看 Kubernetes Pod。与容器一样,Pod 的实现可以在不同的 CRI 运行时(runtime)之间变化。例如,当 Kata 容器被用来作为一个支持的运行时类时,某些 Pod 可以就是真实的虚拟机了!并且正如预期的那样,基于 VM 的 Pod 与传统 Linux 容器实现的 Pod 在实现和功能方面会有所不同。
为了保持容器和 Pod 之间公平比较,我们会在使用 ContainerD/Runc 运行时的 Kubernetes 集群上进行探索。这也是 Docker 在底层运行容器的机制。
设置实验环境(playground)
Install arkade ()
curl -sLS https://get.arkade.dev | sh
$ arkade get kubectl minikube
$ minikube start --driver virtualbox --container-runtime containerd
kubectl --context=minikube apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: foo
spec:
containers:
name: app
image: docker.io/kennethreitz/httpbin
ports:
containerPort: 80
resources:
limits:
memory: "256Mi"
name: sidecar
image: curlimages/curl
command: ["/bin/sleep", "3650d"]
resources:
limits:
memory: "128Mi"
EOF
探索 Pod 的容器
minikube ssh
ps auxf
USER PID ... COMMAND
...
root 4947 \_ containerd-shim -namespace k8s.io -workdir /mnt/sda1/var/lib/containerd/...
root 4966 \_ /pause
root 4981 \_ containerd-shim -namespace k8s.io -workdir /mnt/sda1/var/lib/containerd/...
root 5001 \_ /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent
root 5016 \_ /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent
root 5018 \_ containerd-shim -namespace k8s.io -workdir /mnt/sda1/var/lib/containerd/...
100 5035 \_ /bin/sleep 3650d
sudo ctr --namespace=k8s.io containers ls
CONTAINER IMAGE RUNTIME
...
097d4fe8a7002 docker.io/curlimages/curl@sha256:1a220 io.containerd.runtime.v1.linux
...
dfb1cd29ab750 docker.io/kennethreitz/httpbin:latest io.containerd.runtime.v1.linux
...
f0e87a9330466 k8s.gcr.io/pause:3.1 io.containerd.runtime.v1.linux
sudo crictl ps
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID
097d4fe8a7002 bcb0c26a91c90 About an hour ago Running sidecar 0 f0e87a9330466
dfb1cd29ab750 b138b9264903f About an hour ago Running app 0 f0e87a9330466
但是注意,上述的 POD ID 字段和 ctr 输出的 pause:3.1 容器 id 一致。好吧,看上去这个 Pod 是一个辅助容器。所以,它有什么用呢?
我还没有注意到在 OCI 运行时规范中有和 Pod 相对应的东西。因此,当我对 Kubernetes API 规范提供的信息不满意时,我通常直接进入 Kubernetes Container Runtime 接口(CRI)Protobuf 文件中查找相应的信息:
// kubelet expects any compatible container runtime
// to implement the following gRPC methods:
service RuntimeService {
...
rpc RunPodSandbox(RunPodSandboxRequest) returns (RunPodSandboxResponse) {}
rpc StopPodSandbox(StopPodSandboxRequest) returns (StopPodSandboxResponse) {}
rpc RemovePodSandbox(RemovePodSandboxRequest) returns (RemovePodSandboxResponse) {}
rpc PodSandboxStatus(PodSandboxStatusRequest) returns (PodSandboxStatusResponse) {}
rpc ListPodSandbox(ListPodSandboxRequest) returns (ListPodSandboxResponse) {}
rpc CreateContainer(CreateContainerRequest) returns (CreateContainerResponse) {}
rpc StartContainer(StartContainerRequest) returns (StartContainerResponse) {}
rpc StopContainer(StopContainerRequest) returns (StopContainerResponse) {}
rpc RemoveContainer(RemoveContainerRequest) returns (RemoveContainerResponse) {}
rpc ListContainers(ListContainersRequest) returns (ListContainersResponse) {}
rpc ContainerStatus(ContainerStatusRequest) returns (ContainerStatusResponse) {}
rpc UpdateContainerResources(UpdateContainerResourcesRequest) returns (UpdateContainerResourcesResponse) {}
rpc ReopenContainerLog(ReopenContainerLogRequest) returns (ReopenContainerLogResponse) {}
// ...
}
message CreateContainerRequest {
// ID of the PodSandbox in which the container should be created.
string pod_sandbox_id = 1;
// Config of the container.
ContainerConfig config = 2;
// Config of the PodSandbox. This is the same config that was passed
// to RunPodSandboxRequest to create the PodSandbox. It is passed again
// here just for easy reference. The PodSandboxConfig is immutable and
// remains the same throughout the lifetime of the pod.
PodSandboxConfig sandbox_config = 3;
}
探索 Pod 的命名空间
$ sudo lsns
NS TYPE NPROCS PID USER COMMAND
4026532614 net 4 4966 root /pause
4026532715 mnt 1 4966 root /pause
4026532716 uts 4 4966 root /pause
4026532717 ipc 4 4966 root /pause
4026532718 pid 1 4966 root /pause
4026532719 mnt 2 5001 root /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent
4026532720 pid 2 5001 root /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent
4026532721 mnt 1 5035 100 /bin/sleep 3650d
4026532722 pid 1 5035 100 /bin/sleep 3650d
# httpbin container
sudo ls -l /proc/5001/ns
...
lrwxrwxrwx 1 root root 0 Oct 24 14:05 ipc -> 'ipc:[4026532717]'
lrwxrwxrwx 1 root root 0 Oct 24 14:05 mnt -> 'mnt:[4026532719]'
lrwxrwxrwx 1 root root 0 Oct 24 14:05 net -> 'net:[4026532614]'
lrwxrwxrwx 1 root root 0 Oct 24 14:05 pid -> 'pid:[4026532720]'
lrwxrwxrwx 1 root root 0 Oct 24 14:05 uts -> 'uts:[4026532716]'
# sleep container
sudo ls -l /proc/5035/ns
...
lrwxrwxrwx 1 100 101 0 Oct 24 14:05 ipc -> 'ipc:[4026532717]'
lrwxrwxrwx 1 100 101 0 Oct 24 14:05 mnt -> 'mnt:[4026532721]'
lrwxrwxrwx 1 100 101 0 Oct 24 14:05 net -> 'net:[4026532614]'
lrwxrwxrwx 1 100 101 0 Oct 24 14:05 pid -> 'pid:[4026532722]'
lrwxrwxrwx 1 100 101 0 Oct 24 14:05 uts -> 'uts:[4026532716]'
# Inspect httpbin container.
$ sudo crictl inspect dfb1cd29ab750
{
...
"namespaces": [
{
"type": "pid"
},
{
"type": "ipc",
"path": "/proc/4966/ns/ipc"
},
{
"type": "uts",
"path": "/proc/4966/ns/uts"
},
{
"type": "mount"
},
{
"type": "network",
"path": "/proc/4966/ns/net"
}
],
...
}
# Inspect sleep container.
$ sudo crictl inspect 097d4fe8a7002
...
我认为上述发现完美的解释了同一个 Pod 中容器具有的能力:
能够互相通信
通过 localhost 和/或
使用 IPC(共享内存,消息队列等)
共享 domain 和 hostname
探索 Pod 的 cgroups
$ sudo systemd-cgls
Control group /:
-.slice
├─kubepods
│ ├─burstable
│ │ ├─pod4a8d5c3e-3821-4727-9d20-965febbccfbb
│ │ │ ├─f0e87a93304666766ab139d52f10ff2b8d4a1e6060fc18f74f28e2cb000da8b2
│ │ │ │ └─4966 /pause
│ │ │ ├─dfb1cd29ab750064ae89613cb28963353c3360c2df913995af582aebcc4e85d8
│ │ │ │ ├─5001 /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent
│ │ │ │ └─5016 /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent
│ │ │ └─097d4fe8a7002d69d6c78899dcf6731d313ce8067ae3f736f252f387582e55ad
│ │ │ └─5035 /bin/sleep 3650d
...
3、利用 Docker 实现 Pod
$ sudo apt-get install cgroup-tools
sudo cgcreate -g cpu,memory:/pod-foo
# Check if the corresponding folders were created:
ls -l /sys/fs/cgroup/cpu/pod-foo/
ls -l /sys/fs/cgroup/memory/pod-foo/
docker run -d --rm \
--name foo_sandbox \
--cgroup-parent /pod-foo \
--ipc 'shareable' \
alpine sleep infinity
# app (httpbin)
docker run -d --rm \
--name app \
--cgroup-parent /pod-foo \
--network container:foo_sandbox \
--ipc container:foo_sandbox \
kennethreitz/httpbin
# sidecar (sleep)
docker run -d --rm \
--name sidecar \
--cgroup-parent /pod-foo \
--network container:foo_sandbox \
--ipc container:foo_sandbox \
curlimages/curl sleep 365d
$ sudo systemd-cgls memory
Controller memory; Control group /:
├─pod-foo
│ ├─488d76cade5422b57ab59116f422d8483d435a8449ceda0c9a1888ea774acac7
│ │ ├─27865 /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent
│ │ └─27880 /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent
│ ├─9166a87f9a96a954b10ec012104366da9f1f6680387ef423ee197c61d37f39d7
│ │ └─27977 sleep 365d
│ └─c7b0ec46b16b52c5e1c447b77d67d44d16d78f9a3f93eaeb3a86aa95e08e28b6
│ └─27743 sleep infinity
sudo lsns
NS TYPE NPROCS PID USER COMMAND
...
4026532157 mnt 1 27743 root sleep infinity
4026532158 uts 1 27743 root sleep infinity
4026532159 ipc 4 27743 root sleep infinity
4026532160 pid 1 27743 root sleep infinity
4026532162 net 4 27743 root sleep infinity
4026532218 mnt 2 27865 root /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent
4026532219 uts 2 27865 root /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent
4026532220 pid 2 27865 root /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent
4026532221 mnt 1 27977 _apt sleep 365d
4026532222 uts 1 27977 _apt sleep 365d
4026532223 pid 1 27977 _apt sleep 365d
# app container
$ sudo ls -l /proc/27865/ns
lrwxrwxrwx 1 root root 0 Oct 28 07:56 ipc -> 'ipc:[4026532159]'
lrwxrwxrwx 1 root root 0 Oct 28 07:56 mnt -> 'mnt:[4026532218]'
lrwxrwxrwx 1 root root 0 Oct 28 07:56 net -> 'net:[4026532162]'
lrwxrwxrwx 1 root root 0 Oct 28 07:56 pid -> 'pid:[4026532220]'
lrwxrwxrwx 1 root root 0 Oct 28 07:56 uts -> 'uts:[4026532219]'
# sidecar container
$ sudo ls -l /proc/27977/ns
lrwxrwxrwx 1 _apt systemd-journal 0 Oct 28 07:56 ipc -> 'ipc:[4026532159]'
lrwxrwxrwx 1 _apt systemd-journal 0 Oct 28 07:56 mnt -> 'mnt:[4026532221]'
lrwxrwxrwx 1 _apt systemd-journal 0 Oct 28 07:56 net -> 'net:[4026532162]'
lrwxrwxrwx 1 _apt systemd-journal 0 Oct 28 07:56 pid -> 'pid:[4026532223]'
lrwxrwxrwx 1 _apt systemd-journal 0 Oct 28 07:56 uts -> 'uts:[4026532222]'
4、总结
相关链接:
1、https://github.com/opencontainers/runtime-spec/issues/345
2、https://github.com/opencontainers/runtime-spec/pull/388
- END - 推荐阅读 31天拿下K8S含金量最高的CKA+CKS证书! 几个必不可少的Linux运维脚本! ping 命令还能这么玩? 一个网站从0到1搭建上线的完整流程 40个 Nginx 常问面试题 Kubernetes 网络排查骨灰级中文指南 Dockerfile 定制专属镜像,超详细! 某外企从 0 建设 SRE 运维体系经验分享 Nginx+Redis:高性能缓存利器 主流监控系统 Prometheus 学习指南 一文掌握 Ansible 自动化运维 一文带你掌握 Zabbix 监控系统 这篇文章带你全面掌握 Nginx ! 搭建一套完整的企业级 K8s 集群(二进制方式) 让运维简单高效,轻松搞定DevOps运维平台 点亮,服务器三年不宕机