Troubleshooting

Troubleshooting flow diagrams

These diagrams attempt to capture a general flow for investigating Longhorn issues. Longhorn is a complex system, so there is no way for one (or even many) diagrams to lay out a path to diagnosing every possible problem. The emphasis here is on identifying logs or objects that might be useful to look at and a general order in which to look at them.

Currently, these diagrams are implemented in LucidChart, but it may be a good idea to move them to a different tool (e.g. https://mermaid.js.org) that "compiles" them from source that is tracked in git per https://github.com/longhorn/longhorn/issues/7410. The diagrams were created with a personal LucidChart account, but should be copyable to any account (for editing) as long as they exist.

Troubleshooting an individual Longhorn volume

Overview

While volume management in Longhorn is generally straightforward and intuitive, the inner workings are actually quite complex and involve many components. When there is some issue with a single Longhorn volume, there are two places to look for useful troubleshooting information:

The logs of the various Longhorn component pods
Kubernetes objects and Longhorn custom resource objects stored in the cluster

The goal of this section is to provide some tips that can help narrow down the search for immediate issues and the root causes of those issues.

Obtaining troubleshooting information

Troubleshooting information usually comes from one of two sources:

kubectl (and potentially SSH) interactions with a live (or simulated) cluster
An unzipped support bundle

It is generally possible to obtain the same types of information from a live/simulated clusters and an unzipped support bundle.

Obtaining logs

In a live cluster, obtain logs for a Longhorn pod by executing kubectl logs -n longhorn-system <pod_name>. More generically, obtain logs for any pod by executing kubectl logs -n <pod_name>.
In an unzipped support bundle, obtain logs for a Longhorn pod by opening bundle/logs/longhorn-system/<pod_name>/<pod_name>.log. More generically, obtain logs for any pod by opening bundle/logs/<namespace>/<pod_name>/<pod_name>.log. Only a relevant subset of namespaces are available in a support bundle.

Checking Kubernetes objects or Longhorn custom resource objects

In a live cluster, get the full details from a Longhorn custom resource object by executing kubectl get -n longhorn-system -oyaml <object_kind> <object_name>. More generically, get the full details from any Kubernetes object by executing kubectl get -n -oyaml <object_kind> <object_name>`.
In an unzipped support bundle, get the full details from a Longhorn custom resource object by opening bundle/yamls/namespaced/longhorn-system/longhorn.io/v1beta2/<object_kind>.yaml. More generically, get the full details from any Kubernetes object by opening bundle/yamls/namespaced/<namespace>/<object_group>/<object_version>/<object_kind>.yaml.
- Many relevant Kubernetes objects exist in the longhorn-system namespace but are NOT Longhorn custom resource objects. For example, Longhorn pods can be found in bundle/yamls/namespaced/longhorn-system/v1/pods.yaml.
- Only a relevant subset of namespaces are available in a support bundle.

Getting host/node information

In a live cluster, it is generally necessary to SSH to a host/node to get kubelet, kernel, or journal logs. Often these are quite useful when troubleshooting a volume.
In an unzipped support bundle, information like this is available in bundle/nodes/<node_name>/.

General troubleshooting flow

Whether you are troubleshooting your own cluster or are receiving a support bundle from an external source, the following are the general steps required to narrow down the problem and find a solution. Steps need not be executed in this exact order, and those with more Longhorn experience may be able to skip steps altogether for many problem descriptions.

Obtain as much information about the problem symptoms as possible. Context from questions like the following can be very valuable. Issues can often look quite confusing from the perspective of a support bundle alone until the user provides additional context.
- Is the issue limited to a particular volume or the whole cluster? If a particular volume, which one?
- Has Longhorn (or the particular component or volume) ever worked as expected? (In other words, is the issue new?)
- When exactly was the issue first observed?
- Is the issue ongoing?
- What was the user doing with/in Longhorn when the issue was triggered?
- What was going on in the cluster (or infrastructure) when the issue was triggered?
If the Longhorn installation seems to be generally unstable, try walking through Longhorn Overall Installation Troubleshooting (TODO: move or provide a better link).
If the issue is with a particular volume, it can be helpful to briefly search all Longhorn logs for repeated errors using the name of the volume. This works best on an unzipped support bundle. Usually, thousands of (not useful) logs are returned containing many false leads, so don't spend too much time on this at first. (Return to it later if necessary).
If the issue is with a particular volume, check that volume's CR (and related CRs) for the fields highlighted in Useful fields for troubleshooting. This information is useful in the next step and it provides necessary context for the issue.
If the issue is with a particular volume, try walking through Longhorn Workload Pod Pending Troubleshooting (TODO: move or provide a better link).
Check GitHub for issues with relevant symptoms.
Return to step 3, but this time take care to put together an understanding of exactly what happened to the volume (and its components) before and after the issue was observed. Analysis like this is sometimes required, but it is quite time consuming!

Useful fields for troubleshooting

It is often quite useful to compile a subset of the fields from a volume CR (and related CRs) while building context about an issue. Many Longhorn CR fields are internal implementation details. They can be useful for troubleshooting in very specific contexts, but are more likely to provide extra noise that makes understanding more difficult. This is a list of fields that are likely to be relevant while troubleshooting with some context for WHY they might be relevant.

Volume fields

The volume should generally be the first object checked.

kind: Volume
metadata:
  # Is the volume new?
  # How soon after the volume's creation did the issue occur?
  creationTimestamp:
  # Most related objects are prefixed by this name.
  name:
spec:
  # "Normal" RWX volumes have share-managers, rely on NFS traffic, etc.
  accessMode:
  # It is probably best to browse for v2 specific issues for v2 volumes.
  backendStoreDriver:
  # Unencrypted volumes are more common.
  # Encrypted volumes may see issues during mount/unmount.
  encrypted:
  # A volume may be on an older version than the Longhorn installation.
  image / engineImage:
  # This is the node the volume is attached/attaching to.
  nodeID:
  # Does the volume have the expected number of replicas?
  numberOfReplicas:
status:
  # It's relatively rare for a condition to reveal the reason for an issue.
  # If any are "bad", it's a great clue.
  conditions:
  # When browsing a support bundle, this is a great place to get names to search logs for.
  # pvcName, podName, etc. may appear in relevant logs.
  kubernetesStatus:
  # If a volume was autosalvaged, this timestamp gives a good idea of when.
  remountRequestedAt:
  # Unknown is expected for a detached volume. Healthy is expected for an attached volume.
  # Volumes with issues may be faulted instead.
  robustness:
  # Many past issues have resulted in an "attach/detach" loop.
  # In this loop, the volume transitions from detached -> attached -> detached repeatedly.
  state:

Additional useful fields for Harvester

spec:
  # This is the node the volume is trying to migrate to.
  migrationNodeID:
status:
  # This is node the running the "new" engine for the migration.
  currentMigrationNodeID:
  # This is the node running the "old" engine for the migration.
  currentNodeID:

Additional useful fields for scheduling issues

spec:
  # Many scheduling issues are related to data locality.
  dataLocality:
  # These fields limit nodes replicas can schedule to.
  replicaSoftAntiAffinity:
  replicaDiskSoftAntiAffinity:
  replicaZoneSoftAntiAffinity:
  # The spec size may be too big for scheduling.
  size:
status:
  # The actual size may be too big for scheduling.
  actualSize:
  # It's relatively rare for a condition to reveal the reason for an issue.
  # If any are "bad", it's a great clue.
  conditions:

Replica fields

Many volume issues either cause or are caused by failing replicas. It is usually a good idea to gather information about a volume's replicas early in the troubleshooting process.

Find the names of a volume's replicas by:

kubectl get replica -n longhorn-system | grep <volume_name>
Search bundle/yamls/namespaced/longhorn-system/longhorn.io/v1beta2/replicas.yaml for <volume-name>-r-.

kind: Replica
metadata:
  # How soon after the volume's creation was this replica created?
  # Is is an original? Has it been created more recently to replace an original?
  creationTimestamp:
  name:
spec:
  # Is Longhorn trying to run the replica?
  desireState:
  # A replica may be on an older version than the Longhorn installation.
  image / engineImage:
  # When set, the replica is currently failed. It must be rebuilt to be used.
  # A rebuilding/rebuilt replica no longer has failedAt set.
  # When did the replica fail? There may be useful logs at that time.
  failedAt:
  # When set, the replica is currently healthy.
  # An actively rebuilding replica no longer has healthyAt set.
  # When was the replica last successfully rebuilt?
  healthyAt:
  # Records the last time the replica failed, even if it is currently healthy.
  # Never cleared.
  # When did the replica fail? There may be useful logs at that time.
  lastFailedAt:
  # Records the last time the replica was actively used by the engine.
  # Never cleared.
  lastHealthyAt:
  # This is the node that contains the replica's data. It must run here.
  nodeID:
status:
  # It's relatively rare for a condition to reveal the reason for an issue.
  # If any are "bad", it's a great clue.
  conditions:
  # Is the replica running?
  currentState:
  # Replica issues are often diagnosed by checking instance-manager logs.
  # This is a good place to get the name of the right instance-manager for a running replica.
  # Empty if the replica is not running.
  instanceManagerName:

Engine fields

Find the name of a volume's engine by:

kubectl get engine -n longhorn-system | grep <volume_name>
Search bundle/yamls/namespaced/longhorn-system/longhorn.io/v1beta2/engines.yaml for <volume-name>-e-.

kind: Engine
metadata:
  name:
spec:
  # Is Longhorn trying to run the engine?
  desireState:
  # An engine may be on an older version than the Longhorn installation.
  image / engineImage:
  # This is the node currently trying to run the engine.
  nodeID:
  # These are the replicas longhorn-manager wants the engine to use.
  replicaAddressMap:
status:
  # These are the replicas the engine is trying to use.
  currentReplicaAddressMap:
  # Is the engine running?
  currentState:
  # Engine issues are often diagnosed by checking instance-manager logs.
  # This is a good place to get the name of the right instance-manager for a running engine.
  # Empty if the engine is not running.
  instanceManagerName:
  # These are the replicas the engine is currently using.
  # RW replicas are healthy.
  # WO replicas are being rebuilt.
  replicaModeMap:

Attachment fields

There may be a disagreement between the desired and actual state of Longhorn volumes as Kubernetes sees them versus as Longhorn sees them. It can be quite useful to understand what both components are trying to do (in terms of volume attachments) at any given moment.

Kubernetes VolumeAttachments

Find the name of any relevant Kubernetes volume attachments by:

kubectl get volumeattachment | grep <volume_name>
Search bundle/yamls/cluster/storage.k8s.io/v1/volumeattachments.yaml for <volume-name>-e-.

If there is no VolumeAttachment referencing a volume, Kubernetes does not think the volume is attached and is not attempting to attach the volume. For normal RWX or migratable volumes, multiple VolumeAttachments may exist.

kind: VolumeAttachment
metadata:
  # When did Kubernetes start trying to attach the volume to the node?
  creationTimestamp:
  name:
spec:
  # This is the node Kubernetes wants to attach the volume to.
  nodeName:
  source:
    # This is the name of the volume.
    persistentVolumeName:
status:
  # Does Kubernetes think the volume is attached?
  attached:

Longhorn VolumeAttachments

In Longhorn v1.5.x+, Longhorn also maintains its own set of VolumeAttachment custom resources. Every volume has a Longhorn VolumeAttachment, even if Kubernetes doesn't want it attached. If Kubernetes (or some other component) DOES want a volume attached, Longhorn adds attachment tickets.

apiVersion: longhorn.io/v1beta2
kind: VolumeAttachment
metadata:
  name:
spec:
  # Every reason Longhorn wants a volume attached results in a ticket.
  # Reasons include:
  # - There is a Kubernetes attachment ticket.
  # - The UI has requested attachment.
  # - A snapshot or backup is being taken.
  # - Etc.
  # Longhorn chooses a ticket to act on based on priority.
  attachmentTickets:
status:
  # Is each ticket from the spec fulfilled?
  attachmentTicketStatuses:

Log messages for common issues

Kubernetes version(s): unpatched versions of Kubernetes v1.26-v1.29
Component: kubelet
Characteristic message:

W1019 01:11:18.316567     967 volume_path_handler_linux.go:62] couldn't find loopback device which takes file descriptor lock. Skip detaching device. device path: "19e41dfe-8cee-40c2-a39b-37c68b01c9a7"
W1019 01:11:18.316582     967 volume_path_handler.go:217] Warning: Unmap skipped because symlink does not exist on the path: /var/lib/kubelet/pods/19e41dfe-8cee-40c2-a39b-37c68b01c9a7/volumeDevices/kubernetes.io~csi/pvc-37461bdd-7133-4111-8fb7-c4ddf8dacca5
E1019 01:11:18.316662     967 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/driver.longhorn.io^pvc-37461bdd-7133-4111-8fb7-c4ddf8dacca5 podName:19e41dfe-8cee-40c2-a39b-37c68b01c9a7 nodeName:}" failed. No retries permitted until 2023-10-19 01:13:20.316609446 +0000 UTC m=+1883.799156551 (durationBeforeRetry 2m2s). Error: UnmapVolume.UnmapBlockVolume failed for volume "pvc-37461bdd-7133-4111-8fb7-c4ddf8dacca5" (UniqueName: "kubernetes.io/csi/driver.longhorn.io^pvc-37461bdd-7133-4111-8fb7-c4ddf8dacca5") pod "19e41dfe-8cee-40c2-a39b-37c68b01c9a7" (UID: "19e41dfe-8cee-40c2-a39b-37c68b01c9a7") : blkUtil.DetachFileDevice failed. globalUnmapPath:, podUID: 19e41dfe-8cee-40c2-a39b-37c68b01c9a7, bindMount: true: failed to unmap device from map path. mapPath is empty

Issues:

Longhorn version(s): potentially any but more prevalent in <v1.5.4 and <v1.6.1
Component: longhorn-manager
Characteristic message:

2024-04-07T09:31:41.412363359Z time="2024-04-07T09:31:41Z" level=warning msg="Cannot auto salvage volume: no data exists" accessMode=rwx controller=longhorn-volume frontend=blockdev migratable=true node=node-4 owner=node-4 shareEndpoint= shareState= state=detached volume=pvc-f5de3d21-43d9-4a4b-a996-404f53870e92

Issues:

https://github.com/longhorn/longhorn/issues/7425 for older versions of Longhorn
Not seen recently

Longhorn version(s): <v1.5.2, <v1.6.0
Component: instance-manager
Characteristic message:

<time> time="<time>" level=error msg="failed to prune <snapshot>.img based on <snapshot>.img: file sizes are not
equal and the parent file is larger than the child file"\

Component: longhorn-manager
Characteristic message:

E<date> <time> 1 engine_controller.go:731] failed to update status for engine <engine>: BUG: The expected size
<small_size> of engine <engine> should not be smaller than the current size <large_size>

Issues:

Troubleshooting

Contents

Troubleshooting flow diagrams

Troubleshooting an individual Longhorn volume

Overview

Obtaining troubleshooting information

General troubleshooting flow

Useful fields for troubleshooting

Volume fields

Additional useful fields for Harvester

Additional useful fields for scheduling issues

Replica fields

Engine fields

Attachment fields

Kubernetes VolumeAttachments

Longhorn VolumeAttachments

Log messages for common issues

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!