CARVIEW |
Navigation Menu
-
Notifications
You must be signed in to change notification settings - Fork 647
Troubleshooting
- Troubleshooting flow diagrams
- Troubleshooting an individual Longhorn volume
- Log messages for common issues
These diagrams attempt to capture a general flow for investigating Longhorn issues. Longhorn is a complex system, so there is no way for one (or even many) diagrams to lay out a path to diagnosing every possible problem. The emphasis here is on identifying logs or objects that might be useful to look at and a general order in which to look at them.
Currently, these diagrams are implemented in LucidChart, but it may be a good idea to move them to a different tool (e.g. https://mermaid.js.org) that "compiles" them from source that is tracked in git per https://github.com/longhorn/longhorn/issues/7410. The diagrams were created with a personal LucidChart account, but should be copyable to any account (for editing) as long as they exist.
While volume management in Longhorn is generally straightforward and intuitive, the inner workings are actually quite complex and involve many components. When there is some issue with a single Longhorn volume, there are two places to look for useful troubleshooting information:
- The logs of the various Longhorn component pods
- Kubernetes objects and Longhorn custom resource objects stored in the cluster
The goal of this section is to provide some tips that can help narrow down the search for immediate issues and the root causes of those issues.
Troubleshooting information usually comes from one of two sources:
-
kubectl
(and potentially SSH) interactions with a live (or simulated) cluster - An unzipped support bundle
It is generally possible to obtain the same types of information from a live/simulated clusters and an unzipped support bundle.
Obtaining logs
- In a live cluster, obtain logs for a Longhorn pod by executing
kubectl logs -n longhorn-system <pod_name>. More generically, obtain logs for any pod by executing
kubectl logs -n <pod_name>. - In an unzipped support bundle, obtain logs for a Longhorn pod by opening
bundle/logs/longhorn-system/<pod_name>/<pod_name>.log
. More generically, obtain logs for any pod by openingbundle/logs/<namespace>/<pod_name>/<pod_name>.log
. Only a relevant subset of namespaces are available in a support bundle.
Checking Kubernetes objects or Longhorn custom resource objects
- In a live cluster, get the full details from a Longhorn custom resource object by executing
kubectl get -n longhorn-system -oyaml <object_kind> <object_name>. More generically, get the full details from any Kubernetes object by executing
kubectl get -n -oyaml <object_kind> <object_name>`. - In an unzipped support bundle, get the full details from a Longhorn custom resource object by opening
bundle/yamls/namespaced/longhorn-system/longhorn.io/v1beta2/<object_kind>.yaml
. More generically, get the full details from any Kubernetes object by openingbundle/yamls/namespaced/<namespace>/<object_group>/<object_version>/<object_kind>.yaml
.- Many relevant Kubernetes objects exist in the
longhorn-system
namespace but are NOT Longhorn custom resource objects. For example, Longhorn pods can be found inbundle/yamls/namespaced/longhorn-system/v1/pods.yaml
. - Only a relevant subset of namespaces are available in a support bundle.
- Many relevant Kubernetes objects exist in the
Getting host/node information
- In a live cluster, it is generally necessary to SSH to a host/node to get kubelet, kernel, or journal logs. Often these are quite useful when troubleshooting a volume.
- In an unzipped support bundle, information like this is available in
bundle/nodes/<node_name>/
.
Whether you are troubleshooting your own cluster or are receiving a support bundle from an external source, the following are the general steps required to narrow down the problem and find a solution. Steps need not be executed in this exact order, and those with more Longhorn experience may be able to skip steps altogether for many problem descriptions.
- Obtain as much information about the problem symptoms as possible. Context from questions like the following can be very valuable. Issues can often look quite confusing from the perspective of a support bundle alone until the user provides additional context.
- Is the issue limited to a particular volume or the whole cluster? If a particular volume, which one?
- Has Longhorn (or the particular component or volume) ever worked as expected? (In other words, is the issue new?)
- When exactly was the issue first observed?
- Is the issue ongoing?
- What was the user doing with/in Longhorn when the issue was triggered?
- What was going on in the cluster (or infrastructure) when the issue was triggered?
- If the Longhorn installation seems to be generally unstable, try walking through Longhorn Overall Installation Troubleshooting (TODO: move or provide a better link).
- If the issue is with a particular volume, it can be helpful to briefly search all Longhorn logs for repeated errors using the name of the volume. This works best on an unzipped support bundle. Usually, thousands of (not useful) logs are returned containing many false leads, so don't spend too much time on this at first. (Return to it later if necessary).
- If the issue is with a particular volume, check that volume's CR (and related CRs) for the fields highlighted in Useful fields for troubleshooting. This information is useful in the next step and it provides necessary context for the issue.
- If the issue is with a particular volume, try walking through Longhorn Workload Pod Pending Troubleshooting (TODO: move or provide a better link).
- Check GitHub for issues with relevant symptoms.
- Return to step 3, but this time take care to put together an understanding of exactly what happened to the volume (and its components) before and after the issue was observed. Analysis like this is sometimes required, but it is quite time consuming!
It is often quite useful to compile a subset of the fields from a volume CR (and related CRs) while building context about an issue. Many Longhorn CR fields are internal implementation details. They can be useful for troubleshooting in very specific contexts, but are more likely to provide extra noise that makes understanding more difficult. This is a list of fields that are likely to be relevant while troubleshooting with some context for WHY they might be relevant.
The volume should generally be the first object checked.
kind: Volume
metadata:
# Is the volume new?
# How soon after the volume's creation did the issue occur?
creationTimestamp:
# Most related objects are prefixed by this name.
name:
spec:
# "Normal" RWX volumes have share-managers, rely on NFS traffic, etc.
accessMode:
# It is probably best to browse for v2 specific issues for v2 volumes.
backendStoreDriver:
# Unencrypted volumes are more common.
# Encrypted volumes may see issues during mount/unmount.
encrypted:
# A volume may be on an older version than the Longhorn installation.
image / engineImage:
# This is the node the volume is attached/attaching to.
nodeID:
# Does the volume have the expected number of replicas?
numberOfReplicas:
status:
# It's relatively rare for a condition to reveal the reason for an issue.
# If any are "bad", it's a great clue.
conditions:
# When browsing a support bundle, this is a great place to get names to search logs for.
# pvcName, podName, etc. may appear in relevant logs.
kubernetesStatus:
# If a volume was autosalvaged, this timestamp gives a good idea of when.
remountRequestedAt:
# Unknown is expected for a detached volume. Healthy is expected for an attached volume.
# Volumes with issues may be faulted instead.
robustness:
# Many past issues have resulted in an "attach/detach" loop.
# In this loop, the volume transitions from detached -> attached -> detached repeatedly.
state:
spec:
# This is the node the volume is trying to migrate to.
migrationNodeID:
status:
# This is node the running the "new" engine for the migration.
currentMigrationNodeID:
# This is the node running the "old" engine for the migration.
currentNodeID:
spec:
# Many scheduling issues are related to data locality.
dataLocality:
# These fields limit nodes replicas can schedule to.
replicaSoftAntiAffinity:
replicaDiskSoftAntiAffinity:
replicaZoneSoftAntiAffinity:
# The spec size may be too big for scheduling.
size:
status:
# The actual size may be too big for scheduling.
actualSize:
# It's relatively rare for a condition to reveal the reason for an issue.
# If any are "bad", it's a great clue.
conditions:
Many volume issues either cause or are caused by failing replicas. It is usually a good idea to gather information about a volume's replicas early in the troubleshooting process.
Find the names of a volume's replicas by:
kubectl get replica -n longhorn-system | grep <volume_name>
- Search
bundle/yamls/namespaced/longhorn-system/longhorn.io/v1beta2/replicas.yaml
for<volume-name>-r-
.
kind: Replica
metadata:
# How soon after the volume's creation was this replica created?
# Is is an original? Has it been created more recently to replace an original?
creationTimestamp:
name:
spec:
# Is Longhorn trying to run the replica?
desireState:
# A replica may be on an older version than the Longhorn installation.
image / engineImage:
# When set, the replica is currently failed. It must be rebuilt to be used.
# A rebuilding/rebuilt replica no longer has failedAt set.
# When did the replica fail? There may be useful logs at that time.
failedAt:
# When set, the replica is currently healthy.
# An actively rebuilding replica no longer has healthyAt set.
# When was the replica last successfully rebuilt?
healthyAt:
# Records the last time the replica failed, even if it is currently healthy.
# Never cleared.
# When did the replica fail? There may be useful logs at that time.
lastFailedAt:
# Records the last time the replica was actively used by the engine.
# Never cleared.
lastHealthyAt:
# This is the node that contains the replica's data. It must run here.
nodeID:
status:
# It's relatively rare for a condition to reveal the reason for an issue.
# If any are "bad", it's a great clue.
conditions:
# Is the replica running?
currentState:
# Replica issues are often diagnosed by checking instance-manager logs.
# This is a good place to get the name of the right instance-manager for a running replica.
# Empty if the replica is not running.
instanceManagerName:
Find the name of a volume's engine by:
kubectl get engine -n longhorn-system | grep <volume_name>
- Search
bundle/yamls/namespaced/longhorn-system/longhorn.io/v1beta2/engines.yaml
for<volume-name>-e-
.
kind: Engine
metadata:
name:
spec:
# Is Longhorn trying to run the engine?
desireState:
# An engine may be on an older version than the Longhorn installation.
image / engineImage:
# This is the node currently trying to run the engine.
nodeID:
# These are the replicas longhorn-manager wants the engine to use.
replicaAddressMap:
status:
# These are the replicas the engine is trying to use.
currentReplicaAddressMap:
# Is the engine running?
currentState:
# Engine issues are often diagnosed by checking instance-manager logs.
# This is a good place to get the name of the right instance-manager for a running engine.
# Empty if the engine is not running.
instanceManagerName:
# These are the replicas the engine is currently using.
# RW replicas are healthy.
# WO replicas are being rebuilt.
replicaModeMap:
There may be a disagreement between the desired and actual state of Longhorn volumes as Kubernetes sees them versus as Longhorn sees them. It can be quite useful to understand what both components are trying to do (in terms of volume attachments) at any given moment.
Find the name of any relevant Kubernetes volume attachments by:
kubectl get volumeattachment | grep <volume_name>
- Search
bundle/yamls/cluster/storage.k8s.io/v1/volumeattachments.yaml
for<volume-name>-e-
.
If there is no VolumeAttachment referencing a volume, Kubernetes does not think the volume is attached and is not attempting to attach the volume. For normal RWX or migratable volumes, multiple VolumeAttachments may exist.
kind: VolumeAttachment
metadata:
# When did Kubernetes start trying to attach the volume to the node?
creationTimestamp:
name:
spec:
# This is the node Kubernetes wants to attach the volume to.
nodeName:
source:
# This is the name of the volume.
persistentVolumeName:
status:
# Does Kubernetes think the volume is attached?
attached:
In Longhorn v1.5.x+
, Longhorn also maintains its own set of VolumeAttachment custom resources. Every volume has a Longhorn VolumeAttachment, even if Kubernetes doesn't want it attached. If Kubernetes (or some other component) DOES want a volume attached, Longhorn adds attachment tickets.
apiVersion: longhorn.io/v1beta2
kind: VolumeAttachment
metadata:
name:
spec:
# Every reason Longhorn wants a volume attached results in a ticket.
# Reasons include:
# - There is a Kubernetes attachment ticket.
# - The UI has requested attachment.
# - A snapshot or backup is being taken.
# - Etc.
# Longhorn chooses a ticket to act on based on priority.
attachmentTickets:
status:
# Is each ticket from the spec fulfilled?
attachmentTicketStatuses:
Kubernetes version(s): unpatched versions of Kubernetes v1.26-v1.29
Component: kubelet
Characteristic message:
W1019 01:11:18.316567 967 volume_path_handler_linux.go:62] couldn't find loopback device which takes file descriptor lock. Skip detaching device. device path: "19e41dfe-8cee-40c2-a39b-37c68b01c9a7"
W1019 01:11:18.316582 967 volume_path_handler.go:217] Warning: Unmap skipped because symlink does not exist on the path: /var/lib/kubelet/pods/19e41dfe-8cee-40c2-a39b-37c68b01c9a7/volumeDevices/kubernetes.io~csi/pvc-37461bdd-7133-4111-8fb7-c4ddf8dacca5
E1019 01:11:18.316662 967 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/driver.longhorn.io^pvc-37461bdd-7133-4111-8fb7-c4ddf8dacca5 podName:19e41dfe-8cee-40c2-a39b-37c68b01c9a7 nodeName:}" failed. No retries permitted until 2023-10-19 01:13:20.316609446 +0000 UTC m=+1883.799156551 (durationBeforeRetry 2m2s). Error: UnmapVolume.UnmapBlockVolume failed for volume "pvc-37461bdd-7133-4111-8fb7-c4ddf8dacca5" (UniqueName: "kubernetes.io/csi/driver.longhorn.io^pvc-37461bdd-7133-4111-8fb7-c4ddf8dacca5") pod "19e41dfe-8cee-40c2-a39b-37c68b01c9a7" (UID: "19e41dfe-8cee-40c2-a39b-37c68b01c9a7") : blkUtil.DetachFileDevice failed. globalUnmapPath:, podUID: 19e41dfe-8cee-40c2-a39b-37c68b01c9a7, bindMount: true: failed to unmap device from map path. mapPath is empty
Issues:
Longhorn version(s): potentially any but more prevalent in <v1.5.4 and <v1.6.1
Component: longhorn-manager
Characteristic message:
2024-04-07T09:31:41.412363359Z time="2024-04-07T09:31:41Z" level=warning msg="Cannot auto salvage volume: no data exists" accessMode=rwx controller=longhorn-volume frontend=blockdev migratable=true node=node-4 owner=node-4 shareEndpoint= shareState= state=detached volume=pvc-f5de3d21-43d9-4a4b-a996-404f53870e92
Issues:
- https://github.com/longhorn/longhorn/issues/7425 for older versions of Longhorn
- Not seen recently
Longhorn version(s): <v1.5.2, <v1.6.0
Component: instance-manager
Characteristic message:
<time> time="<time>" level=error msg="failed to prune <snapshot>.img based on <snapshot>.img: file sizes are not
equal and the parent file is larger than the child file"\
Component: longhorn-manager
Characteristic message:
E<date> <time> 1 engine_controller.go:731] failed to update status for engine <engine>: BUG: The expected size
<small_size> of engine <engine> should not be smaller than the current size <large_size>
Issues: