[BUG] Longhorn Instance Manager Memory leak 

## Describe the bug (🐛 if you encounter this issue)

A bad backup target might cause memory leak inside Longhorn instance-manager pods

## To Reproduce

1. Create a volume `testvol1` of 20Gb. Attach the volume. Write some data.
1. Setup an nfs backup target that points to a non-exist directory. For example, `nfs://longhorn-test-nfs-svc.default:/opt/backupstore-nonexist`
1. Create snapshot name `snapshot-xyz`
1. Deploy about 100 backups of the same snapshot `snapshot-xyz`  like:
      ```
      apiVersion: longhorn.io/v1beta2
      kind: Backup
      metadata:
        labels:
          backup-volume: testvol1
        name: ''backup1"
        namespace: longhorn-system
      spec:
        labels:
          longhorn.io/volume-access-mode: rwo
        snapshotName: snapshot-xyz
      ```
      Technically, I think 1 backup should work too. Having 100 backups just to speed up the memory leak issue
1. Observer that Longhorn manager asks one of the replicas of the volume to retry the backup repeatedly. The backup retry failed repeatedly with the mount error `No such file or directory`:
      ```
      [instance-manager-87e672f2892911a9e3c1049af5825e55] time="2023-08-08T05:17:04Z" level=error msg="Failed to create delta block backup" destURL="nfs://longhorn-test-nfs-svc.default:/opt/backupstore1" error="cannot mount nfs longhorn-test-nfs-svc.default:/opt/backupstore1: vers=4.0: mount failed: exit status 32\nMounting command: mount\nMounting arguments: -t nfs4 -o nfsvers=4.0,actimeo=1,soft,timeo=300,retry=2 longhorn-test-nfs-svc.default:/opt/backupstore1 /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore1\nOutput: mount.nfs4: mounting longhorn-test-nfs-svc.default:/opt/backupstore1 failed, reason given by server: No such file or directory\n: vers=4.1: mount failed: exit status 32\nMounting command: mount\nMounting arguments: -t nfs4 -o nfsvers=4.1,actimeo=1,soft,timeo=300,retry=2 longhorn-test-nfs-svc.default:/opt/backupstore1 /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore1\nOutput: mount.nfs4: mounting longhorn-test-nfs-svc.default:/opt/backupstore1 failed, reason given by server: No such file or directory\n: vers=4.2: mount failed: exit status 32\nMounting command: mount\nMounting arguments: -t nfs4 -o nfsvers=4.2,actimeo=1,soft,timeo=300,retry=2 longhorn-test-nfs-svc.default:/opt/backupstore1 /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore1\nOutput: mount.nfs4: mounting longhorn-test-nfs-svc.default:/opt/backupstore1 failed, reason given by server: No such file or directory\n: cannot mount using NFSv4" snapshot="&{6b91c5f3-74b1-4454-b001-a99b36669545 2023-08-08T05:17:03Z}" volume="&{testvol1 21474836480 map[VolumeRecurringJobInfo:{} longhorn.io/volume-access-mode:rwo] 2023-08-08T05:17:03Z   0   lz4 }"
      ```
1. Wait for a few hours. Observe that the memory usage of the instance-manager which prints out the above logs keep climbing up (in my test it is the `instance-manager-87e672f2892911a9e3c1049af5825e55`)
      ````
      # Initial
      instance-manager-63d5a33c5fe16bacb5a707c25a11c4d2   2m           44Mi            
      instance-manager-87e672f2892911a9e3c1049af5825e55   93m          61Mi            
      instance-manager-ca3fd80e3efef0746afbd5a56c4a43d1   32m          98Mi             
      longhorn-manager-mqhgg                              12m          112Mi           
      longhorn-manager-prjft                              8m           114Mi           
      longhorn-manager-x8x5k                              14m          119Mi  
      # After 5 hours
      instance-manager-63d5a33c5fe16bacb5a707c25a11c4d2   10m          88Mi            
      instance-manager-87e672f2892911a9e3c1049af5825e55   25m          246Mi      
      instance-manager-ca3fd80e3efef0746afbd5a56c4a43d1   51m          165Mi               
      longhorn-manager-mqhgg                              7m           128Mi           
      longhorn-manager-prjft                              15m          130Mi           
      longhorn-manager-x8x5k                              9m           137Mi  
      ````
1. Check the RSS of the processes inside the instance-manager `instance-manager-87e672f2892911a9e3c1049af5825e55`, we can see that the RSS of the sync agent service keep climbing up:
      ```
      # Before
      instance-manager-87e672f2892911a9e3c1049af5825e55:/ # ps aux
      USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
      root           1  0.0  0.0   2508   700 ?        Ss   Aug07   0:00 /tini -- instance-manager --debug daemon --listen 0.0.0.0:8500
      root          19  0.5  0.8 1903776 33548 ?       Sl   Aug07   1:00 /host/var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v1.5.1/longhorn replica /host/var/lib/longhorn/replicas/t
      root          25  1.2  1.4 1389496 57796 ?       Sl   Aug07   2:03 /host/var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v1.5.1/longhorn sync-agent --listen 0.0.0.0:10002 --repli
      # After
      instance-manager-87e672f2892911a9e3c1049af5825e55:/ # ps aux
      USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
      root           1  0.0  0.0   2508   700 ?        Ss   Aug07   0:01 /tini -- instance-manager --debug daemon --listen 0.0.0.0:8500
      root           8  0.7  0.6 1469444 27828 ?       Sl   Aug07   3:43 longhorn-instance-manager --debug daemon --listen 0.0.0.0:8500
      root          19  0.5  0.8 1903776 34124 ?       Sl   Aug07   2:34 /host/var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v1.5.1/longhorn replica /host/var/lib/longhorn/replicas/testvol1-c7901ccd --siz
      root          25  1.1  2.8 1390200 116224 ?      Sl   Aug07   5:10 /host/var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v1.5.1/longhorn sync-agent --listen 0.0.0.0:10002 --replica 0.0.0.0:10000 --lis
      ```
1. The number of open fd, TCP connection, WebSockets are normal. So there is no issue here

## Code base analysis
1. From the above information,we can see that there is some leaking memory inside the `sync-agent` service when the backup creation is retried repeatedly.  
1. Tracing through the backup creation flow, I found one leak:
      1. When sync-agent receives a backupcreate call, it calls [BackupCreate()](https://github.com/longhorn/longhorn-engine/blob/4d54af9e4ccf6be07cb6825e4213e2a01dd0da73/pkg/sync/rpc/server.go#L683)
      1. This function call [s.BackupList.BackupAdd](https://github.com/longhorn/longhorn-engine/blob/4d54af9e4ccf6be07cb6825e4213e2a01dd0da73/pkg/sync/rpc/server.go#L717) which add a in-memory `BackupInfo` struct to the sync-agent server struct. 
      1. The problem here is the BackupInfo struct is never removed. We only remove the BackupInfo when the backup is completed [link](https://github.com/longhorn/longhorn-engine/blob/4d54af9e4ccf6be07cb6825e4213e2a01dd0da73/pkg/sync/rpc/list.go#L94)
      1. As a consequence the sync-agent server struct keeps getting bigger and bigger

## pprof analysis
@ejweber, @james-munson, and I performed a pprof analysis 
1. Adding pprof to the sync-agent and export the heap memory graph of the problematic sync-agent we have
    ![profile004](https://github.com/longhorn/longhorn/assets/22139961/b66df7d0-cf8e-4f89-9cd5-3a3157947849)
1. We can see that most of the heap space is allocated by the errors.(*withMessage).Error() function. It is this line of code [link](https://github.com/longhorn/longhorn-engine/blob/4d54af9e4ccf6be07cb6825e4213e2a01dd0da73/vendor/github.com/longhorn/backupstore/deltablock.go#L130-L131)
1. That line of code serialize the error and store it into the BackupStatus struct. This make the size of  BackupStatus becomes bigger
1. The BackupStatus struct previously appended into [s.BackupList](https://github.com/longhorn/longhorn-engine/blob/4d54af9e4ccf6be07cb6825e4213e2a01dd0da73/pkg/sync/rpc/server.go#L717-L718) of the sync-agent server struct
1. The final result is we are keep adding big BackupStatus struct to sync-agent server BackupList. This cause the memory leak


## What changes led to this problem:
1. From @derekbit: The NFS change from soft mode to hard mode. This makes it possible the sync-agent to retry backup when it failed to mount. In hard mode, sync-agent server is just stuck there forever.
1. From @ejweber: We serialize the error and store it into the BackupStatus struct now but we didn't do it before like 
[link](https://github.com/longhorn/longhorn-engine/pull/882/files#diff-47235adbea5004f5cf0f94f0f11ac45ecec7cfa95eb5140f50887ef4717fed1dR121-R126). This makes the BackStatus struct size bigger. Combine with the fact that the sync-agent server BackuoList is growing in the sync-agent server struct. This eats a lot of heap memory

## Proposing solution:
1. From @james-munson So the fix would not be to not add the info, but perhaps to check for whether it is new/different before doing so?
1. From me: Agree. We should NOT retain BackupStatus the s.BackupList forever or at least we should collapse the backupstatus with the same ID by using a map instead of using a slice [s.BackupList](https://github.com/longhorn/longhorn-engine/blob/4d54af9e4ccf6be07cb6825e4213e2a01dd0da73/pkg/sync/rpc/list.go#L16-L19) currently


## Environment


 - Longhorn version: v1.5.1
 
Additional context

This ticket was originally discussed at https://github.com/longhorn/longhorn/issues/6315

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Longhorn Instance Manager Memory leak #6481

Describe the bug (🐛 if you encounter this issue)

To Reproduce

Code base analysis

pprof analysis

What changes led to this problem:

Proposing solution:

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Longhorn Instance Manager Memory leak #6481

Description

Describe the bug (🐛 if you encounter this issue)

To Reproduce

Code base analysis

pprof analysis

What changes led to this problem:

Proposing solution:

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions