CARVIEW |
Select Language
HTTP/2 200
date: Thu, 31 Jul 2025 01:04:48 GMT
content-type: text/html; charset=utf-8
cache-control: max-age=0, private, must-revalidate
content-security-policy: default-src 'none'; base-uri 'self'; child-src github.githubassets.com github.com/assets-cdn/worker/ github.com/assets/ gist.github.com/assets-cdn/worker/; connect-src 'self' uploads.github.com www.githubstatus.com collector.github.com raw.githubusercontent.com api.github.com github-cloud.s3.amazonaws.com github-production-repository-file-5c1aeb.s3.amazonaws.com github-production-upload-manifest-file-7fdce7.s3.amazonaws.com github-production-user-asset-6210df.s3.amazonaws.com *.rel.tunnels.api.visualstudio.com wss://*.rel.tunnels.api.visualstudio.com objects-origin.githubusercontent.com copilot-proxy.githubusercontent.com proxy.individual.githubcopilot.com proxy.business.githubcopilot.com proxy.enterprise.githubcopilot.com *.actions.githubusercontent.com wss://*.actions.githubusercontent.com productionresultssa0.blob.core.windows.net/ productionresultssa1.blob.core.windows.net/ productionresultssa2.blob.core.windows.net/ productionresultssa3.blob.core.windows.net/ productionresultssa4.blob.core.windows.net/ productionresultssa5.blob.core.windows.net/ productionresultssa6.blob.core.windows.net/ productionresultssa7.blob.core.windows.net/ productionresultssa8.blob.core.windows.net/ productionresultssa9.blob.core.windows.net/ productionresultssa10.blob.core.windows.net/ productionresultssa11.blob.core.windows.net/ productionresultssa12.blob.core.windows.net/ productionresultssa13.blob.core.windows.net/ productionresultssa14.blob.core.windows.net/ productionresultssa15.blob.core.windows.net/ productionresultssa16.blob.core.windows.net/ productionresultssa17.blob.core.windows.net/ productionresultssa18.blob.core.windows.net/ productionresultssa19.blob.core.windows.net/ github-production-repository-image-32fea6.s3.amazonaws.com github-production-release-asset-2e65be.s3.amazonaws.com insights.github.com wss://alive.github.com wss://alive-staging.github.com api.githubcopilot.com api.individual.githubcopilot.com api.business.githubcopilot.com api.enterprise.githubcopilot.com; font-src github.githubassets.com; form-action 'self' github.com gist.github.com copilot-workspace.githubnext.com objects-origin.githubusercontent.com; frame-ancestors 'none'; frame-src viewscreen.githubusercontent.com notebooks.githubusercontent.com; img-src 'self' data: blob: github.githubassets.com media.githubusercontent.com camo.githubusercontent.com identicons.github.com avatars.githubusercontent.com private-avatars.githubusercontent.com github-cloud.s3.amazonaws.com objects.githubusercontent.com release-assets.githubusercontent.com secured-user-images.githubusercontent.com/ user-images.githubusercontent.com/ private-user-images.githubusercontent.com opengraph.githubassets.com copilotprodattachments.blob.core.windows.net/github-production-copilot-attachments/ github-production-user-asset-6210df.s3.amazonaws.com customer-stories-feed.github.com spotlights-feed.github.com objects-origin.githubusercontent.com *.githubusercontent.com; manifest-src 'self'; media-src github.com user-images.githubusercontent.com/ secured-user-images.githubusercontent.com/ private-user-images.githubusercontent.com github-production-user-asset-6210df.s3.amazonaws.com gist.github.com; script-src github.githubassets.com; style-src 'unsafe-inline' github.githubassets.com; upgrade-insecure-requests; worker-src github.githubassets.com github.com/assets-cdn/worker/ github.com/assets/ gist.github.com/assets-cdn/worker/
link: ; rel=preload; as=fetch; crossorigin=use-credentials
referrer-policy: no-referrer-when-downgrade
server-timing: issue_layout-fragment;desc="issue_layout fragment";dur=254.219761,issue_conversation_content-fragment;desc="issue_conversation_content fragment";dur=867.845794,issue_conversation_sidebar-fragment;desc="issue_conversation_sidebar fragment";dur=43.27246,nginx;desc="NGINX";dur=0.833216,glb;desc="GLB";dur=101.149168
strict-transport-security: max-age=31536000; includeSubdomains; preload
vary: X-PJAX, X-PJAX-Container, Turbo-Visit, Turbo-Frame, X-Requested-With, Accept,Accept-Encoding, Accept, X-Requested-With
x-content-type-options: nosniff
x-frame-options: deny
x-voltron-version: 321f992
x-xss-protection: 0
server: github.com
content-encoding: gzip
accept-ranges: bytes
set-cookie: _gh_sess=Wgg6DwhRctzRQEq80df8I0pjYJ3MZhYUXi87d5XjcbsWljx0m5%2FEZTzSWJNm7ZhdCagGRNjDjB%2BATetiEm2iARvRjFYJOEtAK0hWEO6PjdtBSZM5zznCBfCZIJgnn7UvhYKayv5%2Fw3f5SMgaIgww6DljX6Hzp4kJMG3OsnhMWt%2F2z21ERCHPUBBFEQS6N1bAW4rkK7IHececZCGquRrkUzBYb%2BKDpr%2FdYm97oDPEiArbLKh2yXarU%2BHwhGsvQ2hIoVAJUVJnbYWrPJzYBTAGzQ%3D%3D--jdId27vwa0ml9u%2Fe--iWKiqJwzef2X2KYBitQnZQ%3D%3D; Path=/; HttpOnly; Secure; SameSite=Lax
set-cookie: _octo=GH1.1.1598931589.1753923887; Path=/; Domain=github.com; Expires=Fri, 31 Jul 2026 01:04:47 GMT; Secure; SameSite=Lax
set-cookie: logged_in=no; Path=/; Domain=github.com; Expires=Fri, 31 Jul 2026 01:04:47 GMT; HttpOnly; Secure; SameSite=Lax
x-github-request-id: 8792:1DACC8:1188FC:1A8182:688AC12F
[BUG] Longhorn Instance Manager Memory leak Β· Issue #6481 Β· longhorn/longhorn Β· GitHub
longhorn/longhorn-engine#927
No typeShow more project fieldsNone yetNo branches or pull requests
Skip to content
Navigation Menu
{{ message }}
-
Notifications
You must be signed in to change notification settings - Fork 647
Closed
longhorn/longhorn-engine
#927Labels
area/performanceSystem, volume performanceSystem, volume performancebackport/1.4.4backport/1.5.2kind/bugpriority/0Must be implement or fixed in this release (managed by PO)Must be implement or fixed in this release (managed by PO)require/qa-review-coverageRequire QA to review coverageRequire QA to review coverage
Milestone
Description
Describe the bug (π if you encounter this issue)
A bad backup target might cause memory leak inside Longhorn instance-manager pods
To Reproduce
- Create a volume
testvol1
of 20Gb. Attach the volume. Write some data. - Setup an nfs backup target that points to a non-exist directory. For example,
nfs://longhorn-test-nfs-svc.default:/opt/backupstore-nonexist
- Create snapshot name
snapshot-xyz
- Deploy about 100 backups of the same snapshot
snapshot-xyz
like:Technically, I think 1 backup should work too. Having 100 backups just to speed up the memory leak issueapiVersion: longhorn.io/v1beta2 kind: Backup metadata: labels: backup-volume: testvol1 name: ''backup1" namespace: longhorn-system spec: labels: longhorn.io/volume-access-mode: rwo snapshotName: snapshot-xyz
- Observer that Longhorn manager asks one of the replicas of the volume to retry the backup repeatedly. The backup retry failed repeatedly with the mount error
No such file or directory
:[instance-manager-87e672f2892911a9e3c1049af5825e55] time="2023-08-08T05:17:04Z" level=error msg="Failed to create delta block backup" destURL="nfs://longhorn-test-nfs-svc.default:/opt/backupstore1" error="cannot mount nfs longhorn-test-nfs-svc.default:/opt/backupstore1: vers=4.0: mount failed: exit status 32\nMounting command: mount\nMounting arguments: -t nfs4 -o nfsvers=4.0,actimeo=1,soft,timeo=300,retry=2 longhorn-test-nfs-svc.default:/opt/backupstore1 /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore1\nOutput: mount.nfs4: mounting longhorn-test-nfs-svc.default:/opt/backupstore1 failed, reason given by server: No such file or directory\n: vers=4.1: mount failed: exit status 32\nMounting command: mount\nMounting arguments: -t nfs4 -o nfsvers=4.1,actimeo=1,soft,timeo=300,retry=2 longhorn-test-nfs-svc.default:/opt/backupstore1 /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore1\nOutput: mount.nfs4: mounting longhorn-test-nfs-svc.default:/opt/backupstore1 failed, reason given by server: No such file or directory\n: vers=4.2: mount failed: exit status 32\nMounting command: mount\nMounting arguments: -t nfs4 -o nfsvers=4.2,actimeo=1,soft,timeo=300,retry=2 longhorn-test-nfs-svc.default:/opt/backupstore1 /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore1\nOutput: mount.nfs4: mounting longhorn-test-nfs-svc.default:/opt/backupstore1 failed, reason given by server: No such file or directory\n: cannot mount using NFSv4" snapshot="&{6b91c5f3-74b1-4454-b001-a99b36669545 2023-08-08T05:17:03Z}" volume="&{testvol1 21474836480 map[VolumeRecurringJobInfo:{} longhorn.io/volume-access-mode:rwo] 2023-08-08T05:17:03Z 0 lz4 }"
- Wait for a few hours. Observe that the memory usage of the instance-manager which prints out the above logs keep climbing up (in my test it is the
instance-manager-87e672f2892911a9e3c1049af5825e55
)# Initial instance-manager-63d5a33c5fe16bacb5a707c25a11c4d2 2m 44Mi instance-manager-87e672f2892911a9e3c1049af5825e55 93m 61Mi instance-manager-ca3fd80e3efef0746afbd5a56c4a43d1 32m 98Mi longhorn-manager-mqhgg 12m 112Mi longhorn-manager-prjft 8m 114Mi longhorn-manager-x8x5k 14m 119Mi # After 5 hours instance-manager-63d5a33c5fe16bacb5a707c25a11c4d2 10m 88Mi instance-manager-87e672f2892911a9e3c1049af5825e55 25m 246Mi instance-manager-ca3fd80e3efef0746afbd5a56c4a43d1 51m 165Mi longhorn-manager-mqhgg 7m 128Mi longhorn-manager-prjft 15m 130Mi longhorn-manager-x8x5k 9m 137Mi
- Check the RSS of the processes inside the instance-manager
instance-manager-87e672f2892911a9e3c1049af5825e55
, we can see that the RSS of the sync agent service keep climbing up:# Before instance-manager-87e672f2892911a9e3c1049af5825e55:/ # ps aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.0 0.0 2508 700 ? Ss Aug07 0:00 /tini -- instance-manager --debug daemon --listen 0.0.0.0:8500 root 19 0.5 0.8 1903776 33548 ? Sl Aug07 1:00 /host/var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v1.5.1/longhorn replica /host/var/lib/longhorn/replicas/t root 25 1.2 1.4 1389496 57796 ? Sl Aug07 2:03 /host/var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v1.5.1/longhorn sync-agent --listen 0.0.0.0:10002 --repli # After instance-manager-87e672f2892911a9e3c1049af5825e55:/ # ps aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.0 0.0 2508 700 ? Ss Aug07 0:01 /tini -- instance-manager --debug daemon --listen 0.0.0.0:8500 root 8 0.7 0.6 1469444 27828 ? Sl Aug07 3:43 longhorn-instance-manager --debug daemon --listen 0.0.0.0:8500 root 19 0.5 0.8 1903776 34124 ? Sl Aug07 2:34 /host/var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v1.5.1/longhorn replica /host/var/lib/longhorn/replicas/testvol1-c7901ccd --siz root 25 1.1 2.8 1390200 116224 ? Sl Aug07 5:10 /host/var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v1.5.1/longhorn sync-agent --listen 0.0.0.0:10002 --replica 0.0.0.0:10000 --lis
- The number of open fd, TCP connection, WebSockets are normal. So there is no issue here
Code base analysis
- From the above information,we can see that there is some leaking memory inside the
sync-agent
service when the backup creation is retried repeatedly. - Tracing through the backup creation flow, I found one leak:
- When sync-agent receives a backupcreate call, it calls BackupCreate()
- This function call s.BackupList.BackupAdd which add a in-memory
BackupInfo
struct to the sync-agent server struct. - The problem here is the BackupInfo struct is never removed. We only remove the BackupInfo when the backup is completed link
- As a consequence the sync-agent server struct keeps getting bigger and bigger
pprof analysis
@ejweber, @james-munson, and I performed a pprof analysis
- Adding pprof to the sync-agent and export the heap memory graph of the problematic sync-agent we have
- We can see that most of the heap space is allocated by the errors.(*withMessage).Error() function. It is this line of code link
- That line of code serialize the error and store it into the BackupStatus struct. This make the size of BackupStatus becomes bigger
- The BackupStatus struct previously appended into s.BackupList of the sync-agent server struct
- The final result is we are keep adding big BackupStatus struct to sync-agent server BackupList. This cause the memory leak
What changes led to this problem:
- From @derekbit: The NFS change from soft mode to hard mode. This makes it possible the sync-agent to retry backup when it failed to mount. In hard mode, sync-agent server is just stuck there forever.
- From @ejweber: We serialize the error and store it into the BackupStatus struct now but we didn't do it before like
link. This makes the BackStatus struct size bigger. Combine with the fact that the sync-agent server BackuoList is growing in the sync-agent server struct. This eats a lot of heap memory
Proposing solution:
- From @james-munson So the fix would not be to not add the info, but perhaps to check for whether it is new/different before doing so?
- From me: Agree. We should NOT retain BackupStatus the s.BackupList forever or at least we should collapse the backupstatus with the same ID by using a map instead of using a slice s.BackupList currently
Environment
- Longhorn version: v1.5.1
Additional context
This ticket was originally discussed at #6315
ejweber and innobead
Metadata
Metadata
Assignees
Labels
area/performanceSystem, volume performanceSystem, volume performancebackport/1.4.4backport/1.5.2kind/bugpriority/0Must be implement or fixed in this release (managed by PO)Must be implement or fixed in this release (managed by PO)require/qa-review-coverageRequire QA to review coverageRequire QA to review coverage
Type
Projects
Status
Closed
Milestone
Relationships
Development
Issue actions
You canβt perform that action at this time.