CARVIEW |
Navigation Menu
-
Notifications
You must be signed in to change notification settings - Fork 647
Description
Describe the bug (π if you encounter this issue)
This is a regression introduced by longhorn/longhorn-manager#2294 (for #7106).
When an error occurs in a share-manager pod, its phase transitions to completed
. The share-manager controller is unable to restart the pod or update the status of the ShareManager CR because it continuously fails to contact the share manager process to attempt a remount.
Before longhorn/longhorn-manager#2294, the share-manager controller did not attempt to contact the dead share manager process, so there was no deadlock.
To Reproduce
- Install Longhorn v1.6.0-dev.
- Deploy the example NGINX deployment (
examples/rwx/rwx-nginx-deployment.yaml
in the longhorn/longhorn repo). - Identify the share manager pod.
- Kill NFS-Ganesha inside the share-manager pod:
kubectl exec -n longhorn-system share-manager-pvc-9216f564-379e-4fd8-861b-e335ccbe8211 -- pkill ganesha.nfsd
- The share-manager pod remains in the
completed
phase and is not restarted.NAME READY STATUS RESTARTS AGE share-manager-pvc-9216f564-379e-4fd8-861b-e335ccbe8211 0/1 Completed 0 12m
- The longhorn-manager pod repeatedly logs a failure to sync the share-manager.
[longhorn-manager-nf4lv] W1122 17:36:53.403245 1 logging.go:59] [core] [Channel #727 SubChannel #728] grpc: addrConn.createTransport failed to connect to {Addr: "10.42.59.152:9600", ServerName: "10.42.59.152:9600", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.42.59.152:9600: i/o timeout" [longhorn-manager-nf4lv] time="2023-11-22T17:36:53Z" level=error msg="Failed to sync Longhorn share manager" func=controller.handleReconcileErrorLogging file="utils.go:72" ShareManager=longhorn-system/pvc-9216f564-379e-4fd8-861b-e335ccbe8211 controller=longhorn-share-manager error="failed to sync longhorn-system/pvc-9216f564-379e-4fd8-861b-e335ccbe8211: failed to mount share manager pod share-manager-pvc-9216f564-379e-4fd8-861b-e335ccbe8211: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.42.59.152:9600: i/o timeout\"" node=eweber-v125-worker-e472db53-9kz5b
- The ShareManager CR's status is not updated. It still shows as
running
.NAME STATE NODE AGE pvc-9216f564-379e-4fd8-861b-e335ccbe8211 running eweber-v125-worker-e472db53-9kz5b 16m
Expected behavior
Before longhorn/longhorn-manager#2294, the share-manager pod would be successfully restarted.
Support bundle for troubleshooting
There is a support bundle in the related CNCF Slack thread:
https://cloud-native.slack.com/archives/CNVPEL9U3/p1700618585865019
There are other issues in that support bundle as well, so the reproduce may be a bit easier to work with.
Additional context
After longhorn/longhorn-manager#2294, in the share-manager controller, we always attempt a gRPC call to the share-manager pod to do a remount if status.state == running
in the ShareManager CR.
However, we do not update the status.state
of the ShareManager CR until AFTER this point in the reconcile loop.
So we indefinitely reconcile. We cannot do a remount because the share-manager pod is dead, and we cannot learn the share-manager pod is dead because we error out while attempting to reconcile.
Workaround
Metadata
Metadata
Assignees
Labels
Type
Projects
Status