CARVIEW |
Navigation Menu
-
Notifications
You must be signed in to change notification settings - Fork 647
Description
Describe the bug (π if you encounter this issue)
This is another cause of inappropriate replica expansion uncovered while implementing #5845. Maybe this is the mode of failure in #6078? I'll have to figure out a way to confirm.
When a volume is detached, the following things happen in the following order:
- Rebuilding replicas are killed.
- The engine is killed.
- Healthy replicas are killed.
There is a minimum of a 10 second window between the time longhorn-manager starts a snapshot purge for a rebuild and the time the rebuild actually starts. During that time, the engine is not reconciled by the engine controller (it is already in the middle of a reconciliation). If the timing is right, it is possible for the rebuilding replica to be killed and a new replica (for a different volume) to take its place. The engine controller continues with the rebuild and attempts to use the engine to rebuild the wrong replica. This can lead to inappropriate expansion.
To Reproduce
Steps to reproduce the behavior:
- Deploy a Longhorn cluster off of master (v1.5.0).
- Create two stateful sets, each with 25 pods. Use different sizes for the two stateful sets.
apiVersion: v1
kind: Service
metadata:
name: nginx
labels:
app: nginx
spec:
ports:
- port: 80
name: web
selector:
app: nginx
type: NodePort
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web
namespace: default
spec:
selector:
matchLabels:
app: nginx # has to match .spec.template.metadata.labels
serviceName: "nginx"
replicas: 25 # by default is 1
template:
metadata:
labels:
app: nginx # has to match .spec.selector.matchLabels
spec:
restartPolicy: Always
terminationGracePeriodSeconds: 10
containers:
- name: nginx
image: registry.k8s.io/nginx-slim:0.8
livenessProbe:
exec:
command:
- ls
- /usr/share/nginx/html/lost+found
initialDelaySeconds: 5
periodSeconds: 5
ports:
- containerPort: 80
name: web
volumeMounts:
- name: www
mountPath: /usr/share/nginx/html
volumeClaimTemplates:
- metadata:
name: www
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "longhorn"
resources:
requests:
storage: 512Mi
---
apiVersion: v1
kind: Service
metadata:
name: nginx2
labels:
app: nginx2
spec:
ports:
- port: 80
name: web
selector:
app: nginx2
type: NodePort
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web2
namespace: default
spec:
selector:
matchLabels:
app: nginx2 # has to match .spec.template.metadata.labels
serviceName: "nginx2"
replicas: 25 # by default is 1
template:
metadata:
labels:
app: nginx2 # has to match .spec.selector.matchLabels
spec:
restartPolicy: Always
terminationGracePeriodSeconds: 10
containers:
- name: nginx
image: registry.k8s.io/nginx-slim:0.8
livenessProbe:
exec:
command:
- ls
- /usr/share/nginx/html/lost+found
initialDelaySeconds: 5
periodSeconds: 5
ports:
- containerPort: 80
name: web
volumeMounts:
- name: www2
mountPath: /usr/share/nginx/html
volumeClaimTemplates:
- metadata:
name: www2
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "longhorn"
resources:
requests:
storage: 1024Mi
- Use a script to periodically kill an instance-manager pod.
#!/bin/bash
current_time=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
while true; do
# Execute kubectl -n longhorn-system delete command
kubectl -n longhorn-system delete --grace-period=1 pod instance-manager-c090b6286e8431380a8cbe71c3fb43ec
# Wait for 300 seconds
sleep 300
# Execute kubectl -n longhorn-system logs and grep commands
kubectl -n longhorn-system logs -l longhorn.io/component=instance-manager --tail -1 --since-time $current_time | grep -i -e "invalid grpc metadata"
log_result_1=$?
kubectl -n longhorn-system logs -l app=longhorn-manager --tail -1 --since-time $current_time | grep -i -e "invalid grpc metadata" -e "incorrect volume name" -e "incorrect instance name"
log_result_2=$?
# Check if there was any output from the grep commands
if [[ $log_result_1 -eq 0 || $log_result_2 -eq 0 ]]; then
echo "Execution stopped. Trigger keywords found in logs."
break
else
echo "Execution completed successfully."
fi
done
- Wait for the script to fail. In my cluster, this takes between 3 and 10 iterations usually.
- Observe the cause of the failure. NOTE: these logs are generated with code for [IMPROVEMENT] Longhorn-engine processes should refuse to serve requests not intended for themΒ #5845 that isn't yet merged. Without that code, we will inappropriately expand the volume instead of safely failing out.
2023-06-27T19:48:22.606088859Z [pvc-059a429b-28f7-4b21-a7cc-45c8632be109-r-cd8811d9] time="2023-06-27T19:48:22Z" level=error msg="Invalid gRPC metadata" clientVolumeName=pvc-70c30a6e-c387-4352-aae1-86f94ad334d2 method=/ptypes.ReplicaService/ReplicaGet serverVolumeName=pvc-059a429b-28f7-4b21-a7cc-45c8632be109
2023-06-27T19:48:22.607799147Z [pvc-c1a6441c-bf6e-4a5d-8a6c-58b02da82938-r-03074ae7] time="2023-06-27T19:48:22Z" level=error msg="Invalid gRPC metadata" clientVolumeName=pvc-6ea5e001-3f33-4f2b-9c03-95271d95dccf method=/ptypes.ReplicaService/ReplicaGet serverVolumeName=pvc-c1a6441c-bf6e-4a5d-8a6c-58b02da82938
2023-06-27T19:48:22.607532292Z time="2023-06-27T19:48:22Z" level=error msg="Failed to rebuild replica 10.42.150.121:10147" controller=longhorn-engine engine=pvc-70c30a6e-c387-4352-aae1-86f94ad334d2-e-799c7bff error="proxyServer=10.42.150.121:8501 destination=10.42.150.121:10065: failed to add replica tcp://10.42.150.121:10147 for volume: rpc error: code = Unknown desc = failed to get replica 10.42.150.121:10147: rpc error: code = FailedPrecondition desc = Incorrect volume name; check replica address" node=eweber-v124-worker-1ae51dbb-4pngn volume=pvc-70c30a6e-c387-4352-aae1-86f94ad334d2
2023-06-27T19:48:22.607555900Z time="2023-06-27T19:48:22Z" level=info msg="Removing failed rebuilding replica 10.42.150.121:10147" controller=longhorn-engine engine=pvc-70c30a6e-c387-4352-aae1-86f94ad334d2-e-799c7bff node=eweber-v124-worker-1ae51dbb-4pngn volume=pvc-70c30a6e-c387-4352-aae1-86f94ad334d2
2023-06-27T19:48:22.607560110Z time="2023-06-27T19:48:22Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"pvc-70c30a6e-c387-4352-aae1-86f94ad334d2-e-799c7bff\", UID:\"fb4612d5-6ca8-4417-a829-e65fc1eab29a\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"52778902\", FieldPath:\"\"}): type: 'Warning' reason: 'FailedRebuilding' Failed rebuilding replica with Address 10.42.150.121:10147: proxyServer=10.42.150.121:8501 destination=10.42.150.121:10065: failed to add replica tcp://10.42.150.121:10147 for volume: rpc error: code = Unknown desc = failed to get replica 10.42.150.121:10147: rpc error: code = FailedPrecondition desc = Incorrect volume name; check replica address"
2023-06-27T19:48:22.608620253Z time="2023-06-27T19:48:22Z" level=error msg="Failed to rebuild replica 10.42.150.121:10137" controller=longhorn-engine engine=pvc-6ea5e001-3f33-4f2b-9c03-95271d95dccf-e-17a35c32 error="proxyServer=10.42.150.121:8501 destination=10.42.150.121:10062: failed to add replica tcp://10.42.150.121:10137 for volume: rpc error: code = Unknown desc = failed to get replica 10.42.150.121:10137: rpc error: code = FailedPrecondition desc = Incorrect volume name; check replica address" node=eweber-v124-worker-1ae51dbb-4pngn volume=pvc-6ea5e001-3f33-4f2b-9c03-95271d95dccf
2023-06-27T19:48:22.608673822Z time="2023-06-27T19:48:22Z" level=info msg="Removing failed rebuilding replica 10.42.150.121:10137" controller=longhorn-engine engine=pvc-6ea5e001-3f33-4f2b-9c03-95271d95dccf-e-17a35c32 node=eweber-v124-worker-1ae51dbb-4pngn volume=pvc-6ea5e001-3f33-4f2b-9c03-95271d95dccf
2023-06-27T19:48:22.608734835Z time="2023-06-27T19:48:22Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", N
Expected behavior
The engine controller should not attempt to continue with the rebuild using the wrong replica.
Log or Support bundle
A summary of logs from various components when the issue occurs:
Environment
- Longhorn version: master (v1.5.0), probably others
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE2
- Number of management node in the cluster: 1
- Number of worker node in the cluster: 3
- Node config
- OS type and version:
- CPU per node: 4
- Memory per node: 8
- Disk type(e.g. SSD/NVMe):
- Network bandwidth between the nodes:
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): DigitalOcean
- Number of Longhorn volumes in the cluster: 50
Additional context
The recreate works because killing the instance-manager pod causes the following chain of events:
- All engines and replicas owned by that instance-manager restart.
- Longhorn-manager restarts all workload pods that use each engine.
- When a workload pod is killed, its corresponding volume is detached.
- When a volume is detached, its rebuilding replica is killed immediately.
- If the timing is bad, a new replica starts using the previous replica's port while the engine is in the 10 second snapshot purge window. The window closes and the engine controller communicates with the new replica.
Ideas
- Augment this check to also ensure the
engine.spec.desireState != stopped
before continuing to rebuild. - Don't kill rebuilding replicas before killing the engine? (This may not be possible.)
Metadata
Metadata
Labels
Type
Projects
Status