Kubernetes Node

By Bys on May 8, 2023

Kubernetes Node

Node Management

노드의 3가지 주요 컴포넌트는 다음과 같다.

kubelet
container runtime
kube-proxy

API 서버에 노드를 등록하는 두 가지 방버이 존재하며 컨트롤 플레인은 신규 노드가 유효한지 확인한다.

The kubelet on a node self-registers to the control plane (–register-node)
- Self register는 노드 등록시 자신의 가용한 리소스를 자동으로 report한다.
Manually add a Node object
- Manual로 등록할 경우 node의 capacity 정보를 설정해야 한다.

노드 네임은 반드시 Unique 해야 한다.

Two Nodes cannot have the same name at the same time.

Self-registration of Nodes

When the kubelet flag –register-node is true (the default), the kubelet will attempt to register itself with the API server. For self-registration, the kubelet is started with the following options:
–kubeconfig - Path to credentials to authenticate itself to the API server.
–cloud-provider - How to talk to a cloud provider to read metadata about itself.
–register-node - Automatically register with the API server. (Default: true)
–register-with-taints - Register the node with the given list of taints (comma separated =:). No-operation if register-node is false.
–node-ip - Optional comma-separated list of the IP addresses for the node. You can only specify a single address for each address family. For example, in a single-stack IPv4 cluster, you set this value to be the IPv4 address that the kubelet should use for the node. See configure IPv4/IPv6 dual stack for details of running a dual-stack cluster. If you don’t provide this argument, the kubelet uses the node’s default IPv4 address, if any; if the node has no IPv4 addresses then the kubelet uses the node’s default IPv6 address.
–node-labels - Labels to add when registering the node in the cluster (see label restrictions enforced by the NodeRestriction admission plugin).
–node-status-update-frequency - Specifies how often kubelet posts its node status to the API server. (Default: 10s)

EKS kubelet v1.24 on data-plane

usr/bin/kubelet
--config /etc/kubernetes/kubelet/kubelet-config.json
--kubeconfig /var/lib/kubelet/kubeconfig
--container-runtime-endpoint unix:///run/containerd/containerd.sock
--image-credential-provider-config /etc/eks/ecr-credential-provider/ecr-credential-provider-config
--image-credential-provider-bin-dir /etc/eks/ecr-credential-provider
--node-ip=10.20.11.32
--pod-infra-container-image=602401143452.dkr.ecr.ap-northeast-2.amazonaws.com/eks/pause:3.5
--v=2
--cloud-provider=aws
--container-runtime=remote
--node-labels=eks.amazonaws.com/sourceLaunchTemplateVersion=1,alpha.eksctl.io/nodegroup-name=ng-v1,alpha.eksctl.io/cluster-name=bys-dev-eks-main,eks.amazonaws.com/nodegroup-image=ami-0fdcb707922882aef,eks.amazonaws.com/capacityType=ON_DEMAND,eks.amazonaws.com/nodegroup=ng-v1,eks.amazonaws.com/sourceLaunchTemplateId=lt-0cbd44d881bbbe60f
--max-pods=29

Node authorization mode, NodeRestriction admission plugin이 활성화 되면 kubelet은 자신의 노드 리소스에 대한 생성/수정 만 가능하도록 인가된다.

NodeRestriction
- Admission controller가 kubelet이 수정할 수 있는 Node와 Pod를 제한한다. Admission controller에 의한 제어가 가능하려면 kubelet은 반드시 system:nodes 그룹의 credential을 이용해야하며 username으로는 system:node:를 사용해야 한다.

The Node authorizer는 kubelet이 API를 수행할 수 있도록 허가해주며 다음 권한을 포함한다.

Read operations:
- services
- endpoints
- nodes
- pods
- secrets, configmaps, persistent volume claims and persistent volumes related to pods bound to the kubelet’s node
Write operations:
- nodes and node status (enable the NodeRestriction admission plugin to limit a kubelet to modify its own node)
- pods and pod status (enable the NodeRestriction admission plugin to limit a kubelet to modify pods bound to itself)
- events
Auth-related operations:
- read/write access to the CertificateSigningRequests API for TLS bootstrapping
- the ability to create TokenReviews and SubjectAccessReviews for delegated authentication/authorization checks

Node Status

노드의 상태는 다음 정보를 포함한다.

Addresses
```
Addresses:
  InternalIP:   10.20.10.208
  Hostname:     ip-10-20-10-208.ap-northeast-2.compute.internal
  InternalDNS:  ip-10-20-10-208.ap-northeast-2.compute.internal
```
- HostName: The hostname as reported by the node’s kernel. Can be overridden via the kubelet –hostname-override parameter.
- ExternalIP: Typically the IP address of the node that is externally routable (available from outside the cluster).
- InternalIP: Typically the IP address of the node that is routable only within the cluster.
Conditions
노드의 Condition은 노드 리소스의 .status에 의해서 상태가 표시되어진다.
만약 노드의 Status 상태가 Unknown 또는 False 상태로 kube-controller-manager의 NodeMonitorGracePeriod 설정 시간보다 길게 유지되면 Unknown status에 대해서는 node.kubernetes.io/unreachable taint가 False status에 대해서는 node.kubernetes.io/not-ready taint가 추가 된다.
스케줄러가 노드를 Pod에 할당할 때 taint를 고려하므로 이러한 taint는 Pending 상태의 파드에도 영향을 미친다. 이미 노드에 스케쥴된 파드들은 NoExecute taint에 의해서 Evicted 될 수 있다. 하지만 특정 노드의 taint에도 불구하고 파드의 toleration설정으로 지속적으로 Running 하게 할 수 있다.
```
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  Ready            True    Mon, 01 May 2023 16:38:39 +0900   Tue, 25 Apr 2023 14:22:34 +0900   KubeletReady                 kubelet is posting ready status
  MemoryPressure   False   Mon, 01 May 2023 16:38:39 +0900   Tue, 25 Apr 2023 14:21:38 +0900   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Mon, 01 May 2023 16:38:39 +0900   Tue, 25 Apr 2023 14:21:38 +0900   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Mon, 01 May 2023 16:38:39 +0900   Tue, 25 Apr 2023 14:21:38 +0900   KubeletHasSufficientPID      kubelet has sufficient PID available
```
- Ready
  - True if the node is healthy and ready to accept pods
  - False if the node is not healthy and is not accepting pods
  - Unknown if the node controller has not heard from the node in the last node-monitor-grace-period (default is 40 seconds)
- DiskPressure
  - True if pressure exists on the disk size—that is, if the disk capacity is low; otherwise False
- MemoryPressure
  - True if pressure exists on the node memory—that is, if the node memory is low; otherwise False
- PIDPressure
  - True if pressure exists on the processes—that is, if there are too many processes on the node; otherwise False
- NetworkUnavailable
  - True if the network for the node is not correctly configured, otherwise False

Capacity and Allocatable
CPU, Memory, MaxPods, pods 등의 리소스를 보여주며 Capacity는 노드의 총 가용한 리소스를, Allocatable은 일반적인 파드의 경우일 때 사용가능한 리소스를 보여준다.

Capacity:
  attachable-volumes-aws-ebs:  25
  attachable-volumes-aws-ebs:  39
  cpu:                         4
  ephemeral-storage:           52416492Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      16078024Ki
  pods:                        20
  vpc.amazonaws.com/pod-eni:   39
Allocatable:
  attachable-volumes-aws-ebs:  39
  cpu:                         3920m
  ephemeral-storage:           47233297124
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      15387848Ki
  pods:                        20
  vpc.amazonaws.com/pod-eni:   39

Info Kernel Version, Kubernetes version Container runtime details 등 노드의 일반적인 정보를 보여준다.

System Info:
  Machine ID:                 ec2ee61571b85e94fce9e922a86d214e
  System UUID:                ec2ee615-71b8-5e94-fce9-e922a86d214e
  Boot ID:                    033e9dca-6b6c-4370-8ac0-928685fc004d
  Kernel Version:             5.10.176-157.645.amzn2.x86_64
  OS Image:                   Amazon Linux 2
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.6.19
  Kubelet Version:            v1.25.7-eks-a59e1f0
  Kube-Proxy Version:         v1.25.7-eks-a59e1f0

Heartbeats

Kubernetes 노드에 의해 전송되는 Heartbeats는 클러스터가 노드가 가용한지 결정하는데 도움을 주며 장애가 감지되었을 때 어떤 조치를 할 수 있도록 도와준다.

Heartbeats는 두 가지 형태가 존재한다.

Updates to the .status of a Node
Lease objects within the kube-node-lease namespace. Each Node has an associated Lease object.
- Lease 객체를 통한 헬스 체크는 1.14 버전 이 후 Default로

노드의 .status를 업데이트하는 것에 비해 Lease는 가벼운 리소스이다. Heartbeats를 위해 Leases를 사용하는 것은 규모가 큰 클러스터에서 성능에 성능에 영향을 줄일 수 있다.

Kubelet은 노드의 .status를 생성하고 업데이트를 담당하며 관련된 Lease들의 업데이트를 담당한다.

The kubelet is responsible for creating and updating the .status of Nodes, and for updating their related Leases.

Kubelet은 10초 주기로(nodeStatusUpdateFrequency) 노드의 상태 값을 확인하며 5분 주기로(nodeStatusReportFrequency) API서버로 노드 상태 값을 보고한다. status에 변화가 있을 때는 nodeStatusReportFrequency 주기가 무시되며, 설정된 interval동안 status에 업데이트가 없을 때 .status를 업데이트한다. 기본적으로 노드에 .status 업데이트를 하는 주기는 5 분이며 노드가 Unreachable이 되는 기본 시간 40초 보다 훨씬 길다.

The kubelet updates the node’s .status either when there is change in status or if there has been no update for a configured interval. The default interval for .status updates to Nodes is 5 minutes, which is much longer than the 40 second default timeout for unreachable nodes.
Kubelet은 자신의 Lease 객체를 생성하고 매 10초(Default설정) 마다 업데이트한다. Lease 업데이트는 노드의 .status 업데이트와는 독립적이다. 만약 Lease 업데이트에 실패하면 kubelet은 재시도하며 200 밀리초의 exponential backoff 및 7초의 상한을 가지고 재 시도한다.

The kubelet creates and then updates its Lease object every 10 seconds (the default update interval). Lease updates occur independently from updates to the Node’s .status. If the Lease update fails, the kubelet retries, using exponential backoff that starts at 200 milliseconds and capped at 7 seconds.

Kubelet의 권한문제로 인해 Lease를 업데이트하지 못할 경우의 kubelet 로그는 다음과 같다.

  Sep 27 08:07:05 ip-10-20-42-209.ap-northeast-2.compute.internal kubelet[2950]: E0927 08:07:05.536991    2950 controller.go:146] "Failed to ensure lease exists, will retry" err="leases.coordination.k8s.io \"ip-10-20-42-209.ap-northeast-2.compute.internal\" is forbidden: User \"WorkerNode\" cannot get resource \"leases\" in API group \"coordination.k8s.io\" in the namespace \"kube-node-lease\"" interval="7s"
  Sep 27 08:07:07 ip-10-20-42-209.ap-northeast-2.compute.internal kubelet[2950]: E0927 08:07:07.672774    2950 webhook.go:154] Failed to make webhook authenticator request: tokenreviews.authentication.k8s.io is forbidden: User "WorkerNode" cannot create resource "tokenreviews" in API group "authentication.k8s.io" at the cluster scope

$ kubectl get lease -n kube-node-lease
NAME                                                      HOLDER                                                    AGE
fargate-ip-10-20-10-162.ap-northeast-2.compute.internal   fargate-ip-10-20-10-162.ap-northeast-2.compute.internal   118d
fargate-ip-10-20-10-38.ap-northeast-2.compute.internal    fargate-ip-10-20-10-38.ap-northeast-2.compute.internal    16h
fargate-ip-10-20-15-189.ap-northeast-2.compute.internal   fargate-ip-10-20-15-189.ap-northeast-2.compute.internal   118d
fargate-ip-10-20-15-226.ap-northeast-2.compute.internal   fargate-ip-10-20-15-226.ap-northeast-2.compute.internal   118d
ip-10-20-10-112.ap-northeast-2.compute.internal           ip-10-20-10-112.ap-northeast-2.compute.internal           11d
ip-10-20-10-208.ap-northeast-2.compute.internal           ip-10-20-10-208.ap-northeast-2.compute.internal           6d2h
ip-10-20-10-51.ap-northeast-2.compute.internal            ip-10-20-10-51.ap-northeast-2.compute.internal            173d
ip-10-20-10-62.ap-northeast-2.compute.internal            ip-10-20-10-62.ap-northeast-2.compute.internal            73d
ip-10-20-11-100.ap-northeast-2.compute.internal           ip-10-20-11-100.ap-northeast-2.compute.internal           73d
ip-10-20-11-132.ap-northeast-2.compute.internal           ip-10-20-11-132.ap-northeast-2.compute.internal           2d1h
ip-10-20-11-135.ap-northeast-2.compute.internal           ip-10-20-11-135.ap-northeast-2.compute.internal           33d
ip-10-20-11-32.ap-northeast-2.compute.internal            ip-10-20-11-32.ap-northeast-2.compute.internal            11d
ip-10-20-11-9.ap-northeast-2.compute.internal             ip-10-20-11-9.ap-northeast-2.compute.internal             73d
ip-10-20-20-79.ap-northeast-2.compute.internal            ip-10-20-20-79.ap-northeast-2.compute.internal            25h

Lease 객체를 Describe 하면 spec에 Renew Time이 존재하며 정상적인 노드의 경우 10초 간격으로 업데이트 된다.

kd lease ip-10-20-20-79.ap-northeast-2.compute.internal -n kube-node-lease
Name:         ip-10-20-20-79.ap-northeast-2.compute.internal
Namespace:    kube-node-lease
Labels:       <none>
Annotations:  <none>
API Version:  coordination.k8s.io/v1
Kind:         Lease
Metadata:
  Creation Timestamp:  2023-04-30T07:17:09Z
  Owner References:
    API Version:     v1
    Kind:            Node
    Name:            ip-10-20-20-79.ap-northeast-2.compute.internal
    UID:             e42b805e-d7e4-4682-b907-6716e628d1bc
  Resource Version:  93730810
  UID:               77736a12-626c-43da-bf71-9c616bfbf121
Spec:
  Holder Identity:         ip-10-20-20-79.ap-northeast-2.compute.internal
  Lease Duration Seconds:  40
  Renew Time:              2023-05-01T08:31:18.434194Z
Events:                    <none>

모든 Kubelet heartbeat는 Lease 객체에 대한 업데이트 요청이며 Lease객체의 spec.renewTime 필드를 업데이트한다. Kubernetes control plane은 이 time stamp를 이용하여 노드의 가용성을 결정한다.

Under the hood, every kubelet heartbeat is an update request to this Lease object, updating the spec.renewTime field for the Lease. The Kubernetes control plane uses the time stamp of this field to determine the availability of this Node

Reserve Compute Resources

파드는 기본적으로 노드에서 사용 가능한 모든 리소스 용량을 소비할 수 있다. 일반적으로 노드는 OS 및 쿠버네티스를 위한 몇 가지 System Daemon을 실행하기 때문에 문제가 될 수 있다. System Daemon 프로세스들에 대해서 리소스를 별도로 설정하지 않는다면 파드와 System Daemon들은 리소스 경쟁을 하게 되며 노드에서 리소스 부족 문제가 발생할 수 있다.
Kubelet에서는 Node Allocatable 이라고 하는 기능을 제공하며 System Daemon들에 대한 리소스 예약을 도와준다.

쿠버네티스 노드에서 Allocatable의 정의는 파드가 사용가능한 리소스의 양으로 정의하며 아래 describe node를 통해 확인 가능하다.

$ kubectl describe node ip-10-20-11-44.ap-northeast-2.compute.internal
......생략
Allocatable:
  cpu:                        1930m
  ephemeral-storage:          76224326324
  hugepages-1Gi:              0
  hugepages-2Mi:              0
  memory:                     7220180Ki
  pods:                       29
  vpc.amazonaws.com/pod-eni:  9

kube-reserved kube-reserved는 쿠버네티스 시스템 데몬이 사용하는 리소스에 대한 예약을 의미하며 쿠버네티스 시스템 데몬으로는 kubelet, container runtime, node problem detector 등이 존재한다. kube-reserved는 파드로 실행되는 시스템 데몬의 리소스 예약을 의미하는 것은 아니며 노드의 파드 밀도 함수이다.
system-reserved system-reserved OS 시스템 데몬이 사용하는 리소스에 대한 예약을 의미하며 OS 시스템 데몬으로는 sshd, udev, 등이 존재한다. kernel 메모리는 쿠버네티스의 파드로 간주되지 않기 때문에 system-reserved는 kernel을 위한 메모리도 예약해야 한다. 또한 사용자 로그인 세션을 위핸 리소스 예약도 필요하다.
Eviction Thresholds 노드레벨에서 메모리 압박(pressure)은 OOM을 유발시키며 이는 노드에서 실행중인 전체 파드 및 노드에 영향을 주게 된다. 이에 따라 노드는 메모리가 회수 될 때 까지 임시적으로 오프라인 상태가 될 수 있다. 시스템 OOM을 방지하거나 가능성을 줄이기 위해서 kubelet은 리소스 관리를 제공하며 Eviction은 memory와 ephemeral-storage에 대해서만 제공한다. --eviction-hard 플래그를 통해 일부 메모리를 예약함으로써 kubelet은 노드의 메모리 가용량이 예약된 값 아래로 떨어질 때마다 파드를 evict하려고 시도한다.

Example

Node Resource
- 16 CPUs
- 32GI Memory
- 100Gi Storage
Scenario
- –kube-reserved is set to cpu=1, memory=2Gi, ephemeral-storage=1Gi
- –system-reserved is set to cpu=500m, memory=1Gi, ephemeral-storage=1Gi
- –eviction-hard is set to memory.available<500Mi, nodefs.available<10%

이 시나리오에서 Allocatable 값은 CPUs: 14.5 CPUs, Memory: 28.5Gi, Storage: 88Gi 가 된다. Scheduler는 이를 통해 이 노드에서 실행 중인 모든 파드의 요청 메모리가 28.5Gi를 넘을 수 없다는 것을 알 수 있다. 또한 kubelet은 전체적인 파드의 메모리 사용량이 28.5Gi를 넘어서면 eviction을 하게 된다.

만약 kube-reserved, system-reserved 설정이 되어있지 않고 시스템 데몬이 예약값을 초과하면 kubelet은 31.5Gi 메모리, 90Gi 스토리지 용량을 넘을 때 마다 파드를 eviction한다.

Kube-controller-manager(Node controller)

노드 컨트롤러는 Kubernetes Control Plane의 요소이며 노드를 다양한 측면에서 관리한다.

노드 컨트롤러는 CIDR assignment 설정이 된 경우 노드가 등록될 때 노드에 CIDR block을 할당한다.
노드 컨트롤러는 CSP사의 available machines 리스트의 최신정보로 내부 노드 리스트를 유지한다.
노드 컨트롤러는 노드의 Health 상태를 모니터링 한다.
- 노드가 Unreachable 상태가 되면 노드의 .status의 Ready 상태를 업데이트한다. 이런 경우 노드 컨트롤러는 Ready상태를 Unknown 상태로 변경한다.
  
  In the case that a node becomes unreachable, updating the Ready condition in the Node’s .status field. In this case the node controller sets the Ready condition to Unknown.
- 노드가 Unreachable 상태로 남아있으면 해당 노드의 Pod들에 대해서 API-initiated eviction을 시작한다. 기본적으로 노드 컨트롤러는 노드가 Unknown 상태로 마크되고 5분의 시간 뒤에 첫 번째 eviction 요청을 진행한다.
  
  If a node remains unreachable: triggering API-initiated eviction for all of the Pods on the unreachable node. By default, the node controller waits 5 minutes between marking the node as Unknown and submitting the first eviction request.
  - API-initiated Eviction은 Eviction API를 이용해 graceful pod termination을 시작하는 eviction object를 만드는 프로세스다.
  - API-initiated evictions respect your configured PodDisruptionBudgets and terminationGracePeriodSeconds
  - API 서버에 의해 eviction이 허용되면 Pod가 삭제되는 자세한 과정은 문서를 참고한다.
    - Pod리소스에 deletion timestamp 추가되고 grace period가 설정된다. Kubelet은 종료될 파드를 인지하고 gracefully shut down을 시작한다. Kubelet이 파드를 종료하는 동안 컨트롤플레인은 Endpoint와 EndpointSlice 객체를 제거한다. 그 결과 컨트롤러들은 해당 파드가 유효한 객체가 아님을 알게 된다. grace period가 지나면 kubelet은 파드를 강제로 죵료한다. Kubelet은 API서버에 파드가 제거 됨을 알린다.

기본적으로 노드 컨트롤러는 노드의 상태를 매 5초(–node-monitor-period 설정) 마다 체크한다.

k8s-node-status

다음은 전체적인 Node health check에 대한 flow이며 아래는 참고사항이다.

nodeStatusUpdateFrequency node-status-update-frequency주기는 kubelet이 node status를 계산하는 주기이다. Lease 객체는 1.14버전 부터 기능이 활성화 되어 있으므로 단순히 계산만 하는 주기이다.

nodeStatusUpdateFrequency specifies how often kubelet computes node status. If node lease feature is not enabled, it is also the frequency that kubelet posts node status to master.
Kubelet은 lease를 이용한다. Lease 객체는 kubelet에 의해 10초 주기로 renewTime이 갱신된다.

Cloud-controller-manager(Node controller)

노드 컨트롤러는 클라우드 Infrastructure에서 새로운 서버가 생성될 때 노드 객체를 업데이트 하는 역할을 한다. 노드 컨트롤러는 CSP에서 실행되는 호스트 정보를 가져오며 다음과 같은 역할을 수행한다.

Cloud provider의 API로 부터 얻은 서버 고유 ID를 노드 객체에 업데이트한다.
Annotating and labelling the Node object with cloud-specific information, such as the region the node is deployed into and the resources (CPU, memory, etc) that it has available.
Obtain the node’s hostname and network addresses.
Verifying the node’s health. In case a node becomes unresponsive, this controller checks with your cloud provider’s API to see if the server has been deactivated / deleted / terminated. If the node has been deleted from the cloud, the controller deletes the Node object from your Kubernetes cluster.

Some cloud provider implementations split this into a node controller and a separate node lifecycle controller.

Graceful node shutdown

Kubelet은 노드의 shutdown을 감지하고 노드에서 실행 중인 파드의 종료를 시도한다. Kubelet은 노드가 shutdown 되는동안 파드가 pod termination process를 따르도록 한다.
GracefulNodeShutdown는 feature gate를 통해 제어할 수 있다. feature gate란 쿠버네티스 기능을 설명하는 key=value 쌍이다. --feature-gates=...,GracefulNodeShutdown=true

Default설정으로 shutdownGracePeriod와 shutdownGracePeriodCriticalPods는 모두 0로 설정되어있으며 graceful node shutdown 기능은 활성화 되어있지 않다. 활성화 시키기 위해서는 kubelet 설정을 적절히 변경하고 0값을 수정해야 한다.

shutdownGracePeriod
- 노드 종료 지연시간의 총 시간을 지정한다. 일반 파드및 Critical 파드에 대한 모든 종료 유예 시간이다.
shutdownGracePeriodCriticalPods
- 노드 종료 중에 Critical파드를 종료하는데 사용되는 시간을 지정한다. 이 값은 shutdownGracePeriod 설정 보다 작아야 한다.

만약 shutdownGracePeriod 값이 30이고 shutdownGracePeriodCriticalPods 값이 10이면 kubelet은 노드 종료를 30초 까지 지연시키며 처음 20초 동안은 일반 파드를 종료하는데 마지막 10초는 Critical 파드를 종료하는데 할당한다.

Critical 파드를 지정하기 위해서는 아래와 같이 PriorityClass 설정 및 파드의 Spec에 PriorityClass를 지정하여 사용 가능하다.

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "This priority class should be used for XYZ service pods only."


apiVersion: v1
kind: Pod
metadata:
  name: nginx
  labels:
    env: test
spec:
  containers:
  - name: nginx
    image: nginx
    imagePullPolicy: IfNotPresent
  priorityClassName: high-priority

kubelet config YAML

shutdownGracePeriodByPodPriority:
  - priority: 100000
    shutdownGracePeriodSeconds: 10
  - priority: 10000
    shutdownGracePeriodSeconds: 180
  - priority: 1000
    shutdownGracePeriodSeconds: 120
  - priority: 0
    shutdownGracePeriodSeconds: 60

Non Graceful node shutdown

노드가 shutdown 되었지만 kubelet의 node shutdown manager가 감지하지 못했을 때 StatefulSet의 파드는 terminating 상태로 stuck 되어 다른 노드로 이동이 되지 못 할 수 있다. Shutdown 된 노드의 kubelet이 파드를 지울 수 없어 동일이름으로 다른 노드에 파드를 만들 수 없기 때문이다. 만약 볼륨을 사용하는 파드가 있다면 종료되는 노드에서 VolumeAttachment가 삭제되지 않으며 새로운 노드에 볼륨이 Attach되지 않는다.

위 와 같은 상황을 완화하기 위해, 사용자가 node.kubernetes.io/out-of-service taint를 NoExecute 또는 NoSchedule 값으로 추가하여 노드의 서비스 불가상태를 표시 할 수 있다. kube-controller-manager에 NodeOutOfServiceVolumeDetach feature gate가 활성화 되어있고 노드에 out-of-service taint가 되어있다면 해당 노드의 파드는 toleration이 없는 경우 강제 삭제되며 종료되는 파드에 대한 볼륨 해제 작업도 즉시 수행된다.

During a non-graceful shutdown, Pods are terminated in the two phases:

Force delete the Pods that do not have matching out-of-service tolerations.
Immediately perform detach volume operation for such pods.

Communication between Nodes and the Control Plane

Node to Control Plane

노드(또는 노드에서 실행 중인 파드)에서의 모든 API 사용은 API서버에서 종료된다. 노드는 유효한 client credential과 함께 API서버에 안전한 연결을 위해 public root certificate로 provision되어야 한다. Client 인증서를 통해 kubelet의 자격증명을 사용하는 것은 좋은 방법이다.

Control Plane to Node

API서버에서 노드로는 두 가지 통신 경로가 있으며 하나는 API서버에서 kubelet 프로세스이며 두 번째는 API 서버에서 API서버의 프록시 기능을 이용해 노드, 파드 또는 서비스에 이르는 것이다.

API server to kubelet

kubectl logs
- Fetching logs for pods.
kubectl exec
- Attaching (usually through kubectl) to running pods.
Providing the kubelet’s port-forwarding functionality.

위와 같은 연결은 kubelet의 HTTPS endpoint에서 종료되며 기본적으로 API서버는 kubelet이 제공 인증서(serving certificate)를 확인하지 않는다. 이는 중간자 공격의 연결을 만드는 것이며 신뢰할 수 없는 public network에서 실행되기 때문에 안전하지 않다.
이 연결을 검증하기 위해서는 –kubelet-certificate-authority flag를 사용하여 API서버에 kubelet의 serving certificate가 유효한지 확인하는데 root certificate bundle을 제공한다.

References
1. https://kubernetes.io/docs/concepts/architecture/nodes/
2. Kubelet flags - https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet
3. Kube-controller-manager flags - https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/

Tag: [ kubernetes node ]