[BUG] Engine continues to attempt to rebuild replica while detaching

## Describe the bug (🐛 if you encounter this issue)

This is another cause of inappropriate replica expansion uncovered while implementing #5845. Maybe this is the mode of failure in #6078? I'll have to figure out a way to confirm.

When a volume is detached, the following things happen in the following order:

- Rebuilding replicas are killed.
- The engine is killed.
- Healthy replicas are killed.

There is a minimum of a 10 second window between the time longhorn-manager starts a snapshot purge for a rebuild and the time the rebuild actually starts. During that time, the engine is not reconciled by the engine controller (it is already in the middle of a reconciliation). If the timing is right, it is possible for the rebuilding replica to be killed and a new replica (for a different volume) to take its place. The engine controller continues with the rebuild and attempts to use the engine to rebuild the wrong replica. This can lead to inappropriate expansion.

## To Reproduce

Steps to reproduce the behavior:

1. Deploy a Longhorn cluster off of master (v1.5.0).
2. Create two stateful sets, each with 25 pods. Use different sizes for the two stateful sets.

```yaml
apiVersion: v1
kind: Service
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  ports:
  - port: 80
    name: web
  selector:
    app: nginx
  type: NodePort
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
  namespace: default
spec:
  selector:
    matchLabels:
      app: nginx # has to match .spec.template.metadata.labels
  serviceName: "nginx"
  replicas: 25 # by default is 1
  template:
    metadata:
      labels:
        app: nginx # has to match .spec.selector.matchLabels
    spec:
      restartPolicy: Always
      terminationGracePeriodSeconds: 10
      containers:
      - name: nginx
        image: registry.k8s.io/nginx-slim:0.8
        livenessProbe:
          exec:
            command:
              - ls
              - /usr/share/nginx/html/lost+found
          initialDelaySeconds: 5
          periodSeconds: 5
        ports:
        - containerPort: 80
          name: web
        volumeMounts:
        - name: www
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: www
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "longhorn"
      resources:
        requests:
          storage: 512Mi
---
apiVersion: v1
kind: Service
metadata:
  name: nginx2
  labels:
    app: nginx2
spec:
  ports:
  - port: 80
    name: web
  selector:
    app: nginx2
  type: NodePort
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web2
  namespace: default
spec:
  selector:
    matchLabels:
      app: nginx2 # has to match .spec.template.metadata.labels
  serviceName: "nginx2"
  replicas: 25 # by default is 1
  template:
    metadata:
      labels:
        app: nginx2 # has to match .spec.selector.matchLabels
    spec:
      restartPolicy: Always
      terminationGracePeriodSeconds: 10
      containers:
      - name: nginx
        image: registry.k8s.io/nginx-slim:0.8
        livenessProbe:
          exec:
            command:
              - ls
              - /usr/share/nginx/html/lost+found
          initialDelaySeconds: 5
          periodSeconds: 5
        ports:
        - containerPort: 80
          name: web
        volumeMounts:
        - name: www2
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: www2
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "longhorn"
      resources:
        requests:
          storage: 1024Mi
```

3. Use a script to periodically kill an instance-manager pod.

```sh
#!/bin/bash

current_time=$(date -u +"%Y-%m-%dT%H:%M:%SZ")

while true; do
    # Execute kubectl -n longhorn-system delete command
    kubectl -n longhorn-system delete --grace-period=1 pod instance-manager-c090b6286e8431380a8cbe71c3fb43ec

    # Wait for 300 seconds
    sleep 300

    # Execute kubectl -n longhorn-system logs and grep commands
    kubectl -n longhorn-system logs -l longhorn.io/component=instance-manager --tail -1 --since-time $current_time | grep -i -e "invalid grpc metadata"
    log_result_1=$?

    kubectl -n longhorn-system logs -l app=longhorn-manager --tail -1 --since-time $current_time | grep -i -e "invalid grpc metadata" -e "incorrect volume name" -e "incorrect instance name"
    log_result_2=$?

    # Check if there was any output from the grep commands
    if [[ $log_result_1 -eq 0 || $log_result_2 -eq 0 ]]; then
        echo "Execution stopped. Trigger keywords found in logs."
        break
    else
        echo "Execution completed successfully."
    fi
done
```

4. Wait for the script to fail. In my cluster, this takes between 3 and 10 iterations usually.
5. Observe the cause of the failure. NOTE: these logs are generated with code for #5845 that isn't yet merged. Without that code, we will inappropriately expand the volume instead of safely failing out.

```
2023-06-27T19:48:22.606088859Z [pvc-059a429b-28f7-4b21-a7cc-45c8632be109-r-cd8811d9] time="2023-06-27T19:48:22Z" level=error msg="Invalid gRPC metadata" clientVolumeName=pvc-70c30a6e-c387-4352-aae1-86f94ad334d2 method=/ptypes.ReplicaService/ReplicaGet serverVolumeName=pvc-059a429b-28f7-4b21-a7cc-45c8632be109
2023-06-27T19:48:22.607799147Z [pvc-c1a6441c-bf6e-4a5d-8a6c-58b02da82938-r-03074ae7] time="2023-06-27T19:48:22Z" level=error msg="Invalid gRPC metadata" clientVolumeName=pvc-6ea5e001-3f33-4f2b-9c03-95271d95dccf method=/ptypes.ReplicaService/ReplicaGet serverVolumeName=pvc-c1a6441c-bf6e-4a5d-8a6c-58b02da82938

2023-06-27T19:48:22.607532292Z time="2023-06-27T19:48:22Z" level=error msg="Failed to rebuild replica 10.42.150.121:10147" controller=longhorn-engine engine=pvc-70c30a6e-c387-4352-aae1-86f94ad334d2-e-799c7bff error="proxyServer=10.42.150.121:8501 destination=10.42.150.121:10065: failed to add replica tcp://10.42.150.121:10147 for volume: rpc error: code = Unknown desc = failed to get replica 10.42.150.121:10147: rpc error: code = FailedPrecondition desc = Incorrect volume name; check replica address" node=eweber-v124-worker-1ae51dbb-4pngn volume=pvc-70c30a6e-c387-4352-aae1-86f94ad334d2
2023-06-27T19:48:22.607555900Z time="2023-06-27T19:48:22Z" level=info msg="Removing failed rebuilding replica 10.42.150.121:10147" controller=longhorn-engine engine=pvc-70c30a6e-c387-4352-aae1-86f94ad334d2-e-799c7bff node=eweber-v124-worker-1ae51dbb-4pngn volume=pvc-70c30a6e-c387-4352-aae1-86f94ad334d2
2023-06-27T19:48:22.607560110Z time="2023-06-27T19:48:22Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"pvc-70c30a6e-c387-4352-aae1-86f94ad334d2-e-799c7bff\", UID:\"fb4612d5-6ca8-4417-a829-e65fc1eab29a\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"52778902\", FieldPath:\"\"}): type: 'Warning' reason: 'FailedRebuilding' Failed rebuilding replica with Address 10.42.150.121:10147: proxyServer=10.42.150.121:8501 destination=10.42.150.121:10065: failed to add replica tcp://10.42.150.121:10147 for volume: rpc error: code = Unknown desc = failed to get replica 10.42.150.121:10147: rpc error: code = FailedPrecondition desc = Incorrect volume name; check replica address"
2023-06-27T19:48:22.608620253Z time="2023-06-27T19:48:22Z" level=error msg="Failed to rebuild replica 10.42.150.121:10137" controller=longhorn-engine engine=pvc-6ea5e001-3f33-4f2b-9c03-95271d95dccf-e-17a35c32 error="proxyServer=10.42.150.121:8501 destination=10.42.150.121:10062: failed to add replica tcp://10.42.150.121:10137 for volume: rpc error: code = Unknown desc = failed to get replica 10.42.150.121:10137: rpc error: code = FailedPrecondition desc = Incorrect volume name; check replica address" node=eweber-v124-worker-1ae51dbb-4pngn volume=pvc-6ea5e001-3f33-4f2b-9c03-95271d95dccf
2023-06-27T19:48:22.608673822Z time="2023-06-27T19:48:22Z" level=info msg="Removing failed rebuilding replica 10.42.150.121:10137" controller=longhorn-engine engine=pvc-6ea5e001-3f33-4f2b-9c03-95271d95dccf-e-17a35c32 node=eweber-v124-worker-1ae51dbb-4pngn volume=pvc-6ea5e001-3f33-4f2b-9c03-95271d95dccf
2023-06-27T19:48:22.608734835Z time="2023-06-27T19:48:22Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", N
```

## Expected behavior

The engine controller should not attempt to continue with the rebuild using the wrong replica.

## Log or Support bundle

A summary of logs from various components when the issue occurs:

![image](https://github.com/longhorn/longhorn/assets/24213029/54fed8bc-77b6-4439-91d0-aedc8ad07ea6)

## Environment

 - Longhorn version: master (v1.5.0), probably others
 - Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
 - Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE2
   - Number of management node in the cluster: 1
   - Number of worker node in the cluster: 3
 - Node config
   - OS type and version:
   - CPU per node: 4
   - Memory per node: 8
   - Disk type(e.g. SSD/NVMe):
   - Network bandwidth between the nodes:
 - Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): DigitalOcean
 - Number of Longhorn volumes in the cluster: 50

## Additional context

The recreate works because killing the instance-manager pod causes the following chain of events:

1. All engines and replicas owned by that instance-manager restart.
2. Longhorn-manager restarts all workload pods that use each engine.
3. When a workload pod is killed, its corresponding volume is detached.
4. When a volume is detached, its rebuilding replica is killed immediately.
5. If the timing is bad, a new replica starts using the previous replica's port while the engine is in the 10 second snapshot purge window. The window closes and the engine controller communicates with the new replica.

## Ideas

- Augment [this check](https://github.com/longhorn/longhorn-manager/blob/16895efa30532755539b70ceee9c82073dfb36fd/controller/engine_controller.go#L1793-L1813) to also ensure the `engine.spec.desireState != stopped` before continuing to rebuild.
- Don't kill rebuilding replicas before killing the engine? (This may not be possible.)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Engine continues to attempt to rebuild replica while detaching #6217

Describe the bug (🐛 if you encounter this issue)

To Reproduce

Expected behavior

Log or Support bundle

Environment

Additional context

Ideas

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Engine continues to attempt to rebuild replica while detaching #6217

Description

Describe the bug (🐛 if you encounter this issue)

To Reproduce

Expected behavior

Log or Support bundle

Environment

Additional context

Ideas

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions