You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe (π if you like this request)
Longhorn share manager is the backend to serve RWX volume by using NFS ganesha. Because it's a single instance, it would be SPOF, even though it has been improved with the recovery backend mechanism at #2293.
The goal is to make the share manager highly available to improve availability instead of just relying on a shorter recovery time which would be uncertain, really depending on different environmental factors.
Describe the solution you'd like
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Before 1.4.4 and 1.5.2, there are some Kernel issues that could cause the volume node to get stuck during the node reboot or upgrade if the share manager pod is disconnected because we use hard mode NFS mount. Thus, to resolve this problem, the soft mode will be introduced back with a longer timeout to prevent this situation in 1.4.4, 1.5.2, and 1.6. The detailed context can be checked at #6655 (comment). However, this could be a potential risk of data loss if the timeout is not well defined. (timeout should at least consider the pod eviction timeout)
Eventually, the hard mode will be readopted together with this feature. Still, it doesn't mean the node stuck situation will not be encountered, but it's just a very rare case at least only if share manager HA nodes are all down or pods all unavailable to lose the HA.