You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug (π if you encounter this issue)
Since the migration to Longhorn 1.5.1 (from 1.4.1), longhorn-manager pods are consuming 20GB+ RAM (at least x3 since before) on most of our workers nodes and 3 to 4 vCPUs, making our whole production cluster unstable.
longhorn-manager pods should consume much less amount of RAM (expected: less than 5GB max, ideally under 1GB).
For vCPUs, it should not consume more than 1vCPU max.
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: Vanilla OpenSource 1.25
Number of management node in the cluster: 5
Number of worker node in the cluster: 330 (mounting volumes) + 12 dedicated Longhorn storage nodes
Node config
OS type and version: Ubuntu 22.04
Kernel version: 5.15.0-58-generic
CPU per node: 8
Memory per node: 64GB
Disk type(e.g. SSD/NVMe/HDD): SSD
Network bandwidth between the nodes: 10G
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): OpenStack
Number of Longhorn volumes in the cluster: 1725: 1292 Healthy, 433 Detached
Impacted Longhorn resources:
Volume names:
Additional context
We have migrated from 1.4.1 to 1.5.1 3 weeks ago, with lots of difficulties: longhorn manager pods being constantly evicted during the migration due to:
no default priorityClass (was ok in our context in 1.4.1)
no RAM requests (was ok in our context in 1.4.1)
After having set priorityClass + 10GB requests on longhorn-manager daemonset, we have mitigated the issues and completed the migration.
However since then, we have observed a huge RAM consumption, over 20GB on many of our nodes, leading to:
longhorn-manager pods being OOMKilled
nodes unstabilities
production workloads unstabilities
RAM usage of 30% of the nodes capacity, obliged to add new nodes in a hurry.
Large increase of our hosting costs (not the blocking point at the moment btw), if we cannot revert to a cleaner situation...
Also, we observe longhorn-manager pods consuming sometimes 3 to 4 vCPU (50% of nodes CPU capacity), it is completely abnormal too but not the main blocking point atm.