[BUG] Longhorn manager pods in 1.5.1 consuming 20GB+ RAM and 3-4 vCPUs

## Describe the bug (🐛 if you encounter this issue)

Since the migration to Longhorn 1.5.1 (from 1.4.1), longhorn-manager pods are consuming 20GB+ RAM (at least x3 since before) on most of our workers nodes and 3 to 4 vCPUs, making our whole production cluster unstable.

## To Reproduce

Create a cluster:

* 5 masters + 300 nodes + 12 dedicated storage nodes, with 1700 volumes (1300 Healthy, 400 Detached)
* Longhorn 1.4.1

Migrate to Longhorn 1.5.1

## Expected behavior

longhorn-manager pods should consume much less amount of RAM (expected: less than 5GB max, ideally under 1GB).
For vCPUs, it should not consume more than 1vCPU max.

## Support bundle for troubleshooting



## Environment


 - Longhorn version: 1.5.1
 - Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Kubectl
 - Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: Vanilla OpenSource 1.25
   - Number of management node in the cluster: 5
   - Number of worker node in the cluster: 330 (mounting volumes) + 12 dedicated Longhorn storage nodes
 - Node config
   - OS type and version: Ubuntu 22.04
   - Kernel version: 5.15.0-58-generic
   - CPU per node: 8
   - Memory per node: 64GB
   - Disk type(e.g. SSD/NVMe/HDD): SSD
   - Network bandwidth between the nodes: 10G
 - Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): OpenStack
 - Number of Longhorn volumes in the cluster: 1725: 1292 Healthy, 433 Detached
 - Impacted Longhorn resources:
   - Volume names:

## Additional context

We have migrated from 1.4.1 to 1.5.1 3 weeks ago, with lots of difficulties: longhorn manager pods being constantly evicted during the migration due to:
* no default priorityClass (was ok in our context in 1.4.1)
* no RAM requests (was ok in our context in 1.4.1)
After having set priorityClass + 10GB requests on longhorn-manager daemonset, we have mitigated the issues and completed the migration.

However since then, we have observed a huge RAM consumption, **over 20GB** on many of our nodes, leading to:
* longhorn-manager pods being OOMKilled
* nodes unstabilities
* production workloads unstabilities
* **RAM usage of 30% of the nodes capacity**, obliged to add new nodes in a hurry.
* Large increase of our hosting costs (not the blocking point at the moment btw), if we cannot revert to a cleaner situation...

Also, we observe **longhorn-manager pods consuming sometimes 3 to 4 vCPU (50% of nodes CPU capacity)**, it is completely abnormal too but not the main blocking point atm.


## RCA

https://github.com/longhorn/longhorn/issues/6866#issuecomment-1777653170

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Longhorn manager pods in 1.5.1 consuming 20GB+ RAM and 3-4 vCPUs #6866

Describe the bug (🐛 if you encounter this issue)

To Reproduce

Expected behavior

Support bundle for troubleshooting

Environment

Additional context

RCA

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Longhorn manager pods in 1.5.1 consuming 20GB+ RAM and 3-4 vCPUs #6866

Description

Describe the bug (🐛 if you encounter this issue)

To Reproduce

Expected behavior

Support bundle for troubleshooting

Environment

Additional context

RCA

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions