[BUG] Deadlock for RWX volume if an error occurs in its share-manager pod

## Describe the bug (🐛 if you encounter this issue)

This is a regression introduced by https://github.com/longhorn/longhorn-manager/pull/2294 (for https://github.com/longhorn/longhorn/issues/7106).

When an error occurs in a share-manager pod, its phase transitions to `completed`. The share-manager controller is unable to restart the pod or update the status of the ShareManager CR because it continuously fails to contact the share manager process to attempt a remount.

Before https://github.com/longhorn/longhorn-manager/pull/2294, the share-manager controller did not attempt to contact the dead share manager process, so there was no deadlock.

## To Reproduce

1. Install Longhorn v1.6.0-dev.
2. Deploy the example NGINX deployment (`examples/rwx/rwx-nginx-deployment.yaml` in the longhorn/longhorn repo).
3. Identify the share manager pod.
4. Kill NFS-Ganesha inside the share-manager pod:
   `kubectl exec -n longhorn-system share-manager-pvc-9216f564-379e-4fd8-861b-e335ccbe8211 -- pkill ganesha.nfsd`
5. The share-manager pod remains in the `completed` phase and is not restarted.
    ```
    NAME                                                     READY   STATUS      RESTARTS   AGE
    share-manager-pvc-9216f564-379e-4fd8-861b-e335ccbe8211   0/1     Completed   0          12m
    ```
6. The longhorn-manager pod repeatedly logs a failure to sync the share-manager.
    ```
    [longhorn-manager-nf4lv] W1122 17:36:53.403245       1 logging.go:59] [core] [Channel #727 SubChannel #728] grpc: addrConn.createTransport failed to connect to {Addr: "10.42.59.152:9600", ServerName: "10.42.59.152:9600", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.42.59.152:9600: i/o timeout"
    [longhorn-manager-nf4lv] time="2023-11-22T17:36:53Z" level=error msg="Failed to sync Longhorn share manager" func=controller.handleReconcileErrorLogging file="utils.go:72" ShareManager=longhorn-system/pvc-9216f564-379e-4fd8-861b-e335ccbe8211 controller=longhorn-share-manager error="failed to sync longhorn-system/pvc-9216f564-379e-4fd8-861b-e335ccbe8211: failed to mount share manager pod share-manager-pvc-9216f564-379e-4fd8-861b-e335ccbe8211: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.42.59.152:9600: i/o timeout\"" node=eweber-v125-worker-e472db53-9kz5b
    ```
7. The ShareManager CR's status is not updated. It still shows as `running`.
    ```
    NAME                                       STATE     NODE                                AGE
    pvc-9216f564-379e-4fd8-861b-e335ccbe8211   running   eweber-v125-worker-e472db53-9kz5b   16m
    ```

## Expected behavior

Before https://github.com/longhorn/longhorn-manager/pull/2294, the share-manager pod would be successfully restarted.

## Support bundle for troubleshooting

There is a support bundle in the related CNCF Slack thread:
https://cloud-native.slack.com/archives/CNVPEL9U3/p1700618585865019

There are other issues in that support bundle as well, so the reproduce may be a bit easier to work with.

## Additional context

After https://github.com/longhorn/longhorn-manager/pull/2294, in the share-manager controller, we always attempt a gRPC call to the share-manager pod to do a remount if `status.state == running` in the ShareManager CR.

https://github.com/longhorn/longhorn-manager/blob/3a66afaa7ec086f7f6ec6ebb10fd6797ac30830d/controller/share_manager_controller.go#L543-L545

However, we do not update the `status.state` of the ShareManager CR until AFTER this point in the reconcile loop.

https://github.com/longhorn/longhorn-manager/blob/3a66afaa7ec086f7f6ec6ebb10fd6797ac30830d/controller/share_manager_controller.go#L688-L711

So we indefinitely reconcile. We cannot do a remount because the share-manager pod is dead, and we cannot learn the share-manager pod is dead because we error out while attempting to reconcile.

## Workaround

https://github.com/longhorn/longhorn/issues/7183#issuecomment-1823715359

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Deadlock for RWX volume if an error occurs in its share-manager pod #7183

Describe the bug (🐛 if you encounter this issue)

To Reproduce

Expected behavior

Support bundle for troubleshooting

Additional context

Workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Deadlock for RWX volume if an error occurs in its share-manager pod #7183

Description

Describe the bug (🐛 if you encounter this issue)

To Reproduce

Expected behavior

Support bundle for troubleshooting

Additional context

Workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions