Use stm-containers instead of TVar HashMap #279

parsonsmatt · 2025-12-12T23:07:01Z

This PR uses stm-containers - a TVar (HashMap k v) has much worse performance. We are observing some issues that look like livelock with increasing concurrency, and I believe - if this doesn't fix the problem outright - it should significantly improve it.

There are many other places in the Temporal codebase that use TVar of container. This PR only focuses on two points.

parsonsmatt · 2025-12-12T23:11:28Z

sdk/src/Temporal/Activity/Worker.hs

-  running <- readTVarIO w.runningActivities
+  running <- atomically $ StmMap.lookup tt w.runningActivities
  let cancelReason = case msg ^. AT.reason of
        AT.NOT_FOUND -> NotFound
        AT.CANCELLED -> CancelRequested
        AT.TIMED_OUT -> Timeout
        AT.WORKER_SHUTDOWN -> WorkerShutdown
        AT.ActivityCancelReason'Unrecognized _ -> UnknownCancellationReason
-  forM_ (HashMap.lookup tt running) $ \a ->
-    cancelWith a cancelReason `finally` atomically (modifyTVar' w.runningActivities (HashMap.delete tt))
+  forM_ running $ \a ->
+    cancelWith a cancelReason `finally` atomically (StmMap.delete tt w.runningActivities)


Another option with this block is to remove the item from the map first- this may prevent double-cancels.

running <- atomically $ StmMap.lookupAndDelete tt w.runningActivities ... forM_ running $ \a -> cancelWith a cancelReason

But double cancels are pretty cheap. Unless there's a deadlock or some other reason why the a here isn't exiting promptly.

parsonsmatt · 2025-12-12T23:11:47Z

sdk/src/Temporal/Workflow/Worker.hs

-  , runningWorkflows :: {-# UNPACK #-} !(TVar (HashMap RunId WorkflowInstance))
+  , runningWorkflows :: {-# UNPACK #-} !(StmMap.Map RunId WorkflowInstance)


This field and one other are changed, most of the changes are reacting to this in the types.

parsonsmatt · 2025-12-12T23:12:30Z

sdk/src/Temporal/Workflow/Worker.hs

-          runningWorkflows <- readTVarIO worker.runningWorkflows
-          mapM_ (cancel <=< readIORef . executionThread) runningWorkflows
+          runningWorkflows <- liftIO $ ListT.toList $ StmMap.listTNonAtomic $ worker.runningWorkflows
+          mapConcurrently_ (cancel <=< readIORef . executionThread . snd) runningWorkflows


This change also does a concurrent cancellation on the jobs instead of a sequential one.

iand675

Nice, long as the tests pass I'm fine with this.

github-actions · 2025-12-15T00:58:10Z

📊 Code Coverage Report

Current PR Coverage

Overall Coverage: 🟠 57.9%

10144 / 17532 expressions covered

Overall Summary

Category	Coverage
Top-level definitions	790/2607 🔴 30.3%
Alternatives	27/57 🟠 47.4%
Expressions	9079/14443 🟡 62.9%
Local definitions	228/368 🟡 62.0%
Other	20/57 🔴 35.1%

Coverage by Module

Module	Coverage	Top-level	Alternatives	Expressions	Local
`temporal-sdk-2025.10.1.0-DZZMfxfBcBeJcR0G8Nlr32/Temporal.TH.Internal`	🔴 0.0%	0/15 🔴 0.0%	0/2 🔴 0.0%	0/135 🔴 0.0%	0/2 🔴 0.0%
`temporal-sdk-2025.10.1.0-DZZMfxfBcBeJcR0G8Nlr32/Temporal.TH`	🔴 8.2%	1/21 🔴 4.8%	0/16 🔴 0.0%	41/460 🔴 8.9%	3/36 🔴 8.3%
`temporal-sdk-2025.10.1.0-DZZMfxfBcBeJcR0G8Nlr32/Temporal.SearchAttributes`	🔴 15.6%	12/101 🔴 11.9%	N/A	19/92 🔴 20.7%	0/6 🔴 0.0%
`temporal-sdk-2025.10.1.0-DZZMfxfBcBeJcR0G8Nlr32/Temporal.Common`	🔴 15.9%	49/637 🔴 7.7%	1/1 🟢 100.0%	78/170 🟠 45.9%	N/A
`temporal-sdk-2025.10.1.0-DZZMfxfBcBeJcR0G8Nlr32/Data.EvalRecord`	🔴 21.0%	12/71 🔴 16.9%	N/A	68/308 🔴 22.1%	2/12 🔴 16.7%
`temporal-sdk-2025.10.1.0-DZZMfxfBcBeJcR0G8Nlr32/Temporal.Worker.Types`	🔴 25.0%	9/35 🔴 25.7%	N/A	9/37 🔴 24.3%	N/A
`temporal-sdk-2025.10.1.0-DZZMfxfBcBeJcR0G8Nlr32/Temporal.Client.TestService`	🔴 32.0%	7/16 🟠 43.8%	1/1 🟢 100.0%	43/139 🔴 30.9%	3/12 🔴 25.0%
`temporal-sdk-2025.10.1.0-DZZMfxfBcBeJcR0G8Nlr32/Temporal.Activity.Definition`	🔴 35.1%	22/57 🔴 38.6%	N/A	24/74 🔴 32.4%	N/A
`temporal-sdk-2025.10.1.0-DZZMfxfBcBeJcR0G8Nlr32/Temporal.Payload`	🔴 35.4%	38/125 🔴 30.4%	0/5 🔴 0.0%	162/438 🔴 37.0%	2/4 🟠 50.0%
`temporal-sdk-2025.10.1.0-DZZMfxfBcBeJcR0G8Nlr32/Temporal.Workflow.Update`	🟠 40.0%	1/3 🔴 33.3%	N/A	1/2 🟠 50.0%	N/A
`temporal-sdk-2025.10.1.0-DZZMfxfBcBeJcR0G8Nlr32/Temporal.Workflow.Signal`	🟠 40.0%	1/3 🔴 33.3%	N/A	1/2 🟠 50.0%	N/A
`temporal-sdk-2025.10.1.0-DZZMfxfBcBeJcR0G8Nlr32/Temporal.Workflow.Query`	🟠 40.0%	1/3 🔴 33.3%	N/A	1/2 🟠 50.0%	N/A
`temporal-sdk-2025.10.1.0-DZZMfxfBcBeJcR0G8Nlr32/Temporal.TH.Classes`	🟠 40.0%	10/27 🔴 37.0%	N/A	28/68 🟠 41.2%	N/A
`temporal-sdk-2025.10.1.0-DZZMfxfBcBeJcR0G8Nlr32/Temporal.Workflow.Definition`	🟠 41.7%	8/20 🟠 40.0%	N/A	32/76 🟠 42.1%	N/A
`temporal-sdk-2025.10.1.0-DZZMfxfBcBeJcR0G8Nlr32/Temporal.Activity.Types`	🟠 45.0%	9/21 🟠 42.9%	N/A	9/19 🟠 47.4%	N/A
`temporal-sdk-2025.10.1.0-DZZMfxfBcBeJcR0G8Nlr32/Temporal.Workflow.Types`	🟠 45.3%	49/179 🔴 27.4%	N/A	81/108 🟡 75.0%	N/A
`temporal-sdk-2025.10.1.0-DZZMfxfBcBeJcR0G8Nlr32/Temporal.Bundle`	🟠 45.5%	9/25 🔴 36.0%	N/A	91/199 🟠 45.7%	10/18 🟠 55.6%
`temporal-sdk-2025.10.1.0-DZZMfxfBcBeJcR0G8Nlr32/Temporal.Workflow.Unsafe.Handle`	🟠 49.4%	5/8 🟡 62.5%	N/A	72/148 🟠 48.6%	N/A
`temporal-sdk-2025.10.1.0-DZZMfxfBcBeJcR0G8Nlr32/Temporal.Coroutine`	🟠 49.7%	12/23 🟠 52.2%	N/A	65/132 🟠 49.2%	3/6 🟠 50.0%
`temporal-sdk-2025.10.1.0-DZZMfxfBcBeJcR0G8Nlr32/Temporal.Workflow.WorkflowInstance`	🟠 50.0%	1/1 🟢 100.0%	N/A	1/3 🔴 33.3%	N/A
`temporal-sdk-2025.10.1.0-DZZMfxfBcBeJcR0G8Nlr32/Temporal.EphemeralServer`	🟠 50.0%	4/13 🔴 30.8%	N/A	64/122 🟠 52.5%	0/1 🔴 0.0%
`temporal-sdk-2025.10.1.0-DZZMfxfBcBeJcR0G8Nlr32/Temporal.Common.Async`	🟠 50.0%	1/2 🟠 50.0%	N/A	10/20 🟠 50.0%	N/A
`temporal-sdk-2025.10.1.0-DZZMfxfBcBeJcR0G8Nlr32/Temporal.Exception`	🟠 52.0%	84/291 🔴 28.9%	2/3 🟡 66.7%	301/455 🟡 66.2%	5/5 🟢 100.0%
`temporal-sdk-2025.10.1.0-DZZMfxfBcBeJcR0G8Nlr32/Temporal.Testing.MockActivityEnvironment`	🟠 52.9%	7/14 🟠 50.0%	N/A	43/85 🟠 50.6%	5/5 🟢 100.0%
`temporal-sdk-2025.10.1.0-DZZMfxfBcBeJcR0G8Nlr32/Temporal.Operator`	🟠 55.6%	3/12 🔴 25.0%	1/1 🟢 100.0%	53/92 🟠 57.6%	2/2 🟢 100.0%
`temporal-sdk-2025.10.1.0-DZZMfxfBcBeJcR0G8Nlr32/Temporal.Workflow`	🟠 58.2%	38/104 🔴 36.5%	2/3 🟡 66.7%	1237/2085 🟠 59.3%	31/56 🟠 55.4%
`temporal-sdk-2025.10.1.0-DZZMfxfBcBeJcR0G8Nlr32/Temporal.Worker`	🟠 59.1%	42/111 🔴 37.8%	1/1 🟢 100.0%	601/997 🟡 60.3%	34/39 🟢 87.2%
`temporal-sdk-2025.10.1.0-DZZMfxfBcBeJcR0G8Nlr32/Temporal.Workflow.Internal.Monad`	🟡 60.7%	135/254 🟠 53.1%	1/3 🔴 33.3%	617/980 🟡 63.0%	2/3 🟡 66.7%
`temporal-sdk-2025.10.1.0-DZZMfxfBcBeJcR0G8Nlr32/Temporal.Client.Types`	🟡 65.5%	49/104 🟠 47.1%	N/A	124/159 🟡 78.0%	0/1 🔴 0.0%
`temporal-sdk-2025.10.1.0-DZZMfxfBcBeJcR0G8Nlr32/Temporal.Interceptor`	🟡 67.6%	15/33 🟠 45.5%	N/A	56/72 🟡 77.8%	N/A
`temporal-sdk-2025.10.1.0-DZZMfxfBcBeJcR0G8Nlr32/Temporal.Activity`	🟡 69.8%	3/6 🟠 50.0%	N/A	84/120 🟡 70.0%	3/3 🟢 100.0%
`temporal-sdk-2025.10.1.0-DZZMfxfBcBeJcR0G8Nlr32/Temporal.Duration`	🟡 70.9%	27/64 🟠 42.2%	6/7 🟢 85.7%	213/274 🟡 77.7%	6/9 🟡 66.7%
`temporal-sdk-2025.10.1.0-DZZMfxfBcBeJcR0G8Nlr32/Temporal.WorkflowInstance`	🟡 71.2%	18/22 🟢 81.8%	2/2 🟢 100.0%	1258/1767 🟡 71.2%	29/42 🟡 69.0%
`temporal-sdk-2025.10.1.0-DZZMfxfBcBeJcR0G8Nlr32/Temporal.Client`	🟡 73.6%	34/78 🟠 43.6%	5/7 🟡 71.4%	1455/1945 🟡 74.8%	26/33 🟡 78.8%
`temporal-sdk-2025.10.1.0-DZZMfxfBcBeJcR0G8Nlr32/Temporal.Contrib.OpenTelemetry`	🟡 73.8%	3/7 🟠 42.9%	N/A	518/700 🟡 74.0%	12/15 🟢 80.0%
`temporal-sdk-2025.10.1.0-DZZMfxfBcBeJcR0G8Nlr32/Temporal.Workflow.Internal.Instance`	🟡 74.2%	6/8 🟡 75.0%	N/A	132/178 🟡 74.2%	6/8 🟡 75.0%
`temporal-sdk-2025.10.1.0-DZZMfxfBcBeJcR0G8Nlr32/Temporal.Activity.Worker`	🟡 75.5%	23/36 🟡 63.9%	N/A	425/560 🟡 75.9%	10/11 🟢 90.9%
`temporal-sdk-2025.10.1.0-DZZMfxfBcBeJcR0G8Nlr32/Temporal.Common.Logging`	🟢 81.0%	11/14 🟡 78.6%	1/1 🟢 100.0%	87/106 🟢 82.1%	3/4 🟡 75.0%
`temporal-sdk-2025.10.1.0-DZZMfxfBcBeJcR0G8Nlr32/Temporal.Testing.Assertions`	🟢 81.1%	6/11 🟠 54.5%	N/A	24/26 🟢 92.3%	N/A
`temporal-sdk-2025.10.1.0-DZZMfxfBcBeJcR0G8Nlr32/Temporal.Workflow.Worker`	🟢 85.0%	15/22 🟡 68.2%	1/1 🟢 100.0%	622/728 🟢 85.4%	22/26 🟢 84.6%
`temporal-sdk-2025.10.1.0-DZZMfxfBcBeJcR0G8Nlr32/Temporal.Workflow.Eval`	🟢 91.1%	4/4 🟢 100.0%	3/3 🟢 100.0%	300/329 🟢 91.2%	9/9 🟢 100.0%
`temporal-sdk-2025.10.1.0-DZZMfxfBcBeJcR0G8Nlr32/Temporal.SearchAttributes.Internal`	🟢 93.3%	5/5 🟢 100.0%	N/A	23/25 🟢 92.0%	N/A
`temporal-sdk-2025.10.1.0-DZZMfxfBcBeJcR0G8Nlr32/Temporal.Workflow.Unsafe`	🟢 100.0%	1/1 🟢 100.0%	N/A	6/6 🟢 100.0%	N/A

🟢 ≥80% 🟡 ≥60% 🟠 ≥40% 🔴 <40%

📈 Coverage Comparison vs. Main

➡️ Coverage unchanged at 57.9%

Main Branch Coverage (for comparison)

Overall Coverage: 🟠 57.9%

10147 / 17539 expressions covered

Overall Summary

Category	Coverage
Top-level definitions	790/2607 🔴 30.3%
Alternatives	27/57 🟠 47.4%
Expressions	9082/14450 🟡 62.9%
Local definitions	228/368 🟡 62.0%
Other	20/57 🔴 35.1%

Coverage by Module

Module	Coverage	Top-level	Alternatives	Expressions	Local
`temporal-sdk-2025.10.1.0-2GLkjAgtube2wLzBMhfM5i/Temporal.TH.Internal`	🔴 0.0%	0/15 🔴 0.0%	0/2 🔴 0.0%	0/135 🔴 0.0%	0/2 🔴 0.0%
`temporal-sdk-2025.10.1.0-2GLkjAgtube2wLzBMhfM5i/Temporal.TH`	🔴 8.2%	1/21 🔴 4.8%	0/16 🔴 0.0%	41/460 🔴 8.9%	3/36 🔴 8.3%
`temporal-sdk-2025.10.1.0-2GLkjAgtube2wLzBMhfM5i/Temporal.SearchAttributes`	🔴 15.6%	12/101 🔴 11.9%	N/A	19/92 🔴 20.7%	0/6 🔴 0.0%
`temporal-sdk-2025.10.1.0-2GLkjAgtube2wLzBMhfM5i/Temporal.Common`	🔴 15.9%	49/637 🔴 7.7%	1/1 🟢 100.0%	78/170 🟠 45.9%	N/A
`temporal-sdk-2025.10.1.0-2GLkjAgtube2wLzBMhfM5i/Data.EvalRecord`	🔴 21.0%	12/71 🔴 16.9%	N/A	68/308 🔴 22.1%	2/12 🔴 16.7%
`temporal-sdk-2025.10.1.0-2GLkjAgtube2wLzBMhfM5i/Temporal.Worker.Types`	🔴 25.0%	9/35 🔴 25.7%	N/A	9/37 🔴 24.3%	N/A
`temporal-sdk-2025.10.1.0-2GLkjAgtube2wLzBMhfM5i/Temporal.Client.TestService`	🔴 32.0%	7/16 🟠 43.8%	1/1 🟢 100.0%	43/139 🔴 30.9%	3/12 🔴 25.0%
`temporal-sdk-2025.10.1.0-2GLkjAgtube2wLzBMhfM5i/Temporal.Activity.Definition`	🔴 35.1%	22/57 🔴 38.6%	N/A	24/74 🔴 32.4%	N/A
`temporal-sdk-2025.10.1.0-2GLkjAgtube2wLzBMhfM5i/Temporal.Payload`	🔴 35.4%	38/125 🔴 30.4%	0/5 🔴 0.0%	162/438 🔴 37.0%	2/4 🟠 50.0%
`temporal-sdk-2025.10.1.0-2GLkjAgtube2wLzBMhfM5i/Temporal.Workflow.Update`	🟠 40.0%	1/3 🔴 33.3%	N/A	1/2 🟠 50.0%	N/A
`temporal-sdk-2025.10.1.0-2GLkjAgtube2wLzBMhfM5i/Temporal.Workflow.Signal`	🟠 40.0%	1/3 🔴 33.3%	N/A	1/2 🟠 50.0%	N/A
`temporal-sdk-2025.10.1.0-2GLkjAgtube2wLzBMhfM5i/Temporal.Workflow.Query`	🟠 40.0%	1/3 🔴 33.3%	N/A	1/2 🟠 50.0%	N/A
`temporal-sdk-2025.10.1.0-2GLkjAgtube2wLzBMhfM5i/Temporal.TH.Classes`	🟠 40.0%	10/27 🔴 37.0%	N/A	28/68 🟠 41.2%	N/A
`temporal-sdk-2025.10.1.0-2GLkjAgtube2wLzBMhfM5i/Temporal.Workflow.Definition`	🟠 41.7%	8/20 🟠 40.0%	N/A	32/76 🟠 42.1%	N/A
`temporal-sdk-2025.10.1.0-2GLkjAgtube2wLzBMhfM5i/Temporal.Activity.Types`	🟠 45.0%	9/21 🟠 42.9%	N/A	9/19 🟠 47.4%	N/A
`temporal-sdk-2025.10.1.0-2GLkjAgtube2wLzBMhfM5i/Temporal.Workflow.Types`	🟠 45.3%	49/179 🔴 27.4%	N/A	81/108 🟡 75.0%	N/A
`temporal-sdk-2025.10.1.0-2GLkjAgtube2wLzBMhfM5i/Temporal.Bundle`	🟠 45.5%	9/25 🔴 36.0%	N/A	91/199 🟠 45.7%	10/18 🟠 55.6%
`temporal-sdk-2025.10.1.0-2GLkjAgtube2wLzBMhfM5i/Temporal.Workflow.Unsafe.Handle`	🟠 49.4%	5/8 🟡 62.5%	N/A	72/148 🟠 48.6%	N/A
`temporal-sdk-2025.10.1.0-2GLkjAgtube2wLzBMhfM5i/Temporal.Coroutine`	🟠 49.7%	12/23 🟠 52.2%	N/A	65/132 🟠 49.2%	3/6 🟠 50.0%
`temporal-sdk-2025.10.1.0-2GLkjAgtube2wLzBMhfM5i/Temporal.Workflow.WorkflowInstance`	🟠 50.0%	1/1 🟢 100.0%	N/A	1/3 🔴 33.3%	N/A
`temporal-sdk-2025.10.1.0-2GLkjAgtube2wLzBMhfM5i/Temporal.EphemeralServer`	🟠 50.0%	4/13 🔴 30.8%	N/A	64/122 🟠 52.5%	0/1 🔴 0.0%
`temporal-sdk-2025.10.1.0-2GLkjAgtube2wLzBMhfM5i/Temporal.Common.Async`	🟠 50.0%	1/2 🟠 50.0%	N/A	10/20 🟠 50.0%	N/A
`temporal-sdk-2025.10.1.0-2GLkjAgtube2wLzBMhfM5i/Temporal.Exception`	🟠 52.0%	84/291 🔴 28.9%	2/3 🟡 66.7%	301/455 🟡 66.2%	5/5 🟢 100.0%
`temporal-sdk-2025.10.1.0-2GLkjAgtube2wLzBMhfM5i/Temporal.Testing.MockActivityEnvironment`	🟠 52.9%	7/14 🟠 50.0%	N/A	43/85 🟠 50.6%	5/5 🟢 100.0%
`temporal-sdk-2025.10.1.0-2GLkjAgtube2wLzBMhfM5i/Temporal.Operator`	🟠 55.6%	3/12 🔴 25.0%	1/1 🟢 100.0%	53/92 🟠 57.6%	2/2 🟢 100.0%
`temporal-sdk-2025.10.1.0-2GLkjAgtube2wLzBMhfM5i/Temporal.Workflow`	🟠 58.2%	38/104 🔴 36.5%	2/3 🟡 66.7%	1237/2085 🟠 59.3%	31/56 🟠 55.4%
`temporal-sdk-2025.10.1.0-2GLkjAgtube2wLzBMhfM5i/Temporal.Worker`	🟠 59.1%	42/111 🔴 37.8%	1/1 🟢 100.0%	601/997 🟡 60.3%	34/39 🟢 87.2%
`temporal-sdk-2025.10.1.0-2GLkjAgtube2wLzBMhfM5i/Temporal.Workflow.Internal.Monad`	🟡 60.7%	135/254 🟠 53.1%	1/3 🔴 33.3%	617/980 🟡 63.0%	2/3 🟡 66.7%
`temporal-sdk-2025.10.1.0-2GLkjAgtube2wLzBMhfM5i/Temporal.Client.Types`	🟡 65.5%	49/104 🟠 47.1%	N/A	124/159 🟡 78.0%	0/1 🔴 0.0%
`temporal-sdk-2025.10.1.0-2GLkjAgtube2wLzBMhfM5i/Temporal.Interceptor`	🟡 67.6%	15/33 🟠 45.5%	N/A	56/72 🟡 77.8%	N/A
`temporal-sdk-2025.10.1.0-2GLkjAgtube2wLzBMhfM5i/Temporal.Activity`	🟡 69.8%	3/6 🟠 50.0%	N/A	84/120 🟡 70.0%	3/3 🟢 100.0%
`temporal-sdk-2025.10.1.0-2GLkjAgtube2wLzBMhfM5i/Temporal.Duration`	🟡 70.9%	27/64 🟠 42.2%	6/7 🟢 85.7%	213/274 🟡 77.7%	6/9 🟡 66.7%
`temporal-sdk-2025.10.1.0-2GLkjAgtube2wLzBMhfM5i/Temporal.WorkflowInstance`	🟡 71.2%	18/22 🟢 81.8%	2/2 🟢 100.0%	1258/1767 🟡 71.2%	29/42 🟡 69.0%
`temporal-sdk-2025.10.1.0-2GLkjAgtube2wLzBMhfM5i/Temporal.Client`	🟡 73.6%	34/78 🟠 43.6%	5/7 🟡 71.4%	1455/1945 🟡 74.8%	26/33 🟡 78.8%
`temporal-sdk-2025.10.1.0-2GLkjAgtube2wLzBMhfM5i/Temporal.Contrib.OpenTelemetry`	🟡 73.8%	3/7 🟠 42.9%	N/A	518/700 🟡 74.0%	12/15 🟢 80.0%
`temporal-sdk-2025.10.1.0-2GLkjAgtube2wLzBMhfM5i/Temporal.Workflow.Internal.Instance`	🟡 74.2%	6/8 🟡 75.0%	N/A	132/178 🟡 74.2%	6/8 🟡 75.0%
`temporal-sdk-2025.10.1.0-2GLkjAgtube2wLzBMhfM5i/Temporal.Activity.Worker`	🟡 75.2%	23/36 🟡 63.9%	N/A	420/555 🟡 75.7%	10/11 🟢 90.9%
`temporal-sdk-2025.10.1.0-2GLkjAgtube2wLzBMhfM5i/Temporal.Common.Logging`	🟢 81.0%	11/14 🟡 78.6%	1/1 🟢 100.0%	87/106 🟢 82.1%	3/4 🟡 75.0%
`temporal-sdk-2025.10.1.0-2GLkjAgtube2wLzBMhfM5i/Temporal.Testing.Assertions`	🟢 81.1%	6/11 🟠 54.5%	N/A	24/26 🟢 92.3%	N/A
`temporal-sdk-2025.10.1.0-2GLkjAgtube2wLzBMhfM5i/Temporal.Workflow.Worker`	🟢 84.7%	15/22 🟡 68.2%	1/1 🟢 100.0%	630/740 🟢 85.1%	22/26 🟢 84.6%
`temporal-sdk-2025.10.1.0-2GLkjAgtube2wLzBMhfM5i/Temporal.Workflow.Eval`	🟢 91.1%	4/4 🟢 100.0%	3/3 🟢 100.0%	300/329 🟢 91.2%	9/9 🟢 100.0%
`temporal-sdk-2025.10.1.0-2GLkjAgtube2wLzBMhfM5i/Temporal.SearchAttributes.Internal`	🟢 93.3%	5/5 🟢 100.0%	N/A	23/25 🟢 92.0%	N/A
`temporal-sdk-2025.10.1.0-2GLkjAgtube2wLzBMhfM5i/Temporal.Workflow.Unsafe`	🟢 100.0%	1/1 🟢 100.0%	N/A	6/6 🟢 100.0%	N/A

🟢 ≥80% 🟡 ≥60% 🟠 ≥40% 🔴 <40%

parsonsmatt · 2025-12-15T18:38:56Z

Update from downstream: this completely fixed lag time with increased concurrent startup in our test suite, ~20x performance improvement with 16 cores (and actually completes successfully instead of hanging forever with 32 cores).

avieth · 2025-12-16T16:01:14Z

Could you explain how the previous use of TVar (HashMap TaskToken (Async ())) was causing livelock and how that explains the CI issues? I don't see any places where it was retrying. We were seeing what looked like test cases using database connections from other test cases, but I don't see the connection to this patch.

parsonsmatt · 2025-12-16T17:10:41Z

Temporal activities using stale database connections was fixed in this PR, which ensured that all workflows received the shutdown message. I believe the behavior we were observing was something like:

We have three workflows running, the second of which is uninterruptibly frozen or deadlocked
We shutdown, which does forM_ workers cancel.
The first worker receives the cancel, shuts down, and cancel returns, allowing us to go to the next worker.
The second worker receives the cancel, but is frozen for some reason. cancel never returns.
A timeout is sent the thread doing forM_ workers cancel, which interrupts it and causes it to exit.
Our third worker is never canceled and continues working!

By doing concurrent shutdown, we ensure that each worker at least receives the message to cancel. This appears to have completely solved the problem of "Temporal processes talking to databases after their tests have exited," but replaced it with the underlying problem: "Some temporal processes/threads/workflows/whatever are hanging forever and can't be canceled."

As for this change - let me start with a brief overview on how STM works. When you do atomically $ stuff, you open a transaction. Every read and write to a variable is recorded in a log. Before committing, that log is checked against other logs. If another transaction has modified any of those variables, then the work is restarted - an implicit retry is built in to every readTVar and writeTVar that is triggered by another thread successfully writing against those variables.

Let's consider this bit of code:

join $ atomically $ do
              currentWorkflows <- readTVar worker.runningWorkflows
              writeTVar worker.runningWorkflows $ HashMap.delete runId_ currentWorkflows

This happens in removeEvictedWorkflowInstances. This takes a lock on the worker.runningWorkflows :: TVar (HashMap RunId WorkflowInstance). However, we only actually care about the runId_ in there. The stm-containers version allows the update to be tightly scoped around the runId_, so other access to that container doesn't block anyone else.

Another potential example:

  liftIO $ atomically $ do
    workflows <- readTVar worker.runningWorkflows
    case HashMap.lookup r workflows of
      Nothing -> do
        let workflows' = HashMap.insert r inst workflows
        writeTVar worker.runningWorkflows workflows'
        pure inst
      Just existingInstance -> do
        writeTVar worker.runningWorkflows workflows
        pure existingInstance

Here we are reading from and writing to the variable. If we have several of these transactions all attempting to complete, the TVar (HashMap _ _) structure makes livelock and transaction invalidation vastly more likely (ie every single transaction will invalidate every other transaction). With stm-containers, you only get transaction invalidation if you're operating on the same r key to the map.

There isn't exactly a "smoking gun" here on what transaction or combination of transactions caused the problem. There's the application of a simple rule ("TVar (Map _ _) is always wrong") and a simple fix, and now the problem is gone.

avieth · 2025-12-16T17:40:08Z

With concurrency=32 how many temporal threads did we have contending for the one shared map? And roughly how often were they updating it? I believe STM guarantees that one contending transaction will always commit, so there would have to be a lot of contention for this to become a problem

parsonsmatt · 2025-12-16T17:50:28Z

I don't have any idea, and we don't have good tooling to understand or diagnose things at this level. You're welcome to put further effort into the investigation here - I wouldn't be surprised if only one of these TVars were the real problem and the other is a fluke - but IMO the point of "good practices" and "antipatterns" is not that they always blow up, but that they can blow up in spooky ways, and the effort of "not doing the antipattern" is always less than the cost of "the antipattern exploded."

avieth · 2025-12-16T17:59:21Z

sdk/src/Temporal/Workflow/Worker.hs

+            Just exists ->
+              Just exists
+
+    StmMap.focus (Focus.alter modifier >> Focus.lookupWithDefault inst) r worker.runningWorkflows


IIUC this is the only place where contention may be reduced: if it's a modification, then stm-containers will bypass the write to the top-level map. If it's an insertion, it still has to writeTVar the top-level map right?

So the fundamental problem with TVar (Map k _) is that the entire Map structure is stored in the single TVar, so any write to any part of the Map is going to invalidate any other transaction. Consider TVar [a] vs TChan a - doing modifyTVar (<> [a]) invalidates the entire TVar, even though fmap head . readTVar only cares about the first element, and these should not interact. The solution is to put TVar in the spine of the data structure - now, TChan is doubly linked, but a TList it uses is basically like this:

type TList a = TVar (TList' a) data TList' a = TNil | TCons a (TVar (TList' a))

If we apply this insight to Map, then you'd replace the direct references to Map with TVar Map:

-- pure data Map k a = Bin {-# UNPACK #-} !Size !k a !(Map k a) !(Map k a) | Tip data StmMap' k a = Bin !Size !k a !(TVar (StmMap k a)) !(TVar (StmMap k a)) | Nil type StmMap k v = TVar (StmMap' k v)

This is effectively what stm-containers does, but for HashMap instead of Map. The result is that the scope of contention is dramatically reduced.

When you insert a key into that StmMap you still have to write to the root TVar no? The size increases. Same story for delete

I suggest you study the implementation of stm-containers if you want to learn more about this.

insert always reads and then writes the top-level TVar in the Hamt, and lookup always reads it. How could it possibly be any other way while still being consistent?

(I'm looking at https://hackage-content.haskell.org/package/stm-hamt-1.2.1.1/docs/StmHamt-Hamt.html#v:insert , maybe that's not the one you're talking about?)

Got it now, changes to different parts of the tree that don't require changing the tree structure can be done without contending

avieth · 2025-12-16T17:59:32Z

sdk/src/Temporal/Workflow/Worker.hs

            join $ atomically $ do
-              currentWorkflows <- readTVar worker.runningWorkflows
-              writeTVar worker.runningWorkflows $ HashMap.delete runId_ currentWorkflows
+              mworkflow <- StmMap.focus Focus.lookupAndDelete runId_ worker.runningWorkflows


Isn't there still contention here? How can we delete a key from the map without causing any contention on other uses of the map?

The contention is scoped to this key in particular. The rest of the map structure is not under contention, so you can lookup/update/delete/insert at other keys without incurring any overhead.

lookupAndDelete suggests to me that it returns the value but then deletes the key/value from the map.

Suppose some other concurrent transaction updates the value at that key. Doesn't it have to go either strictly before or strictly after this transaction containing the lookupAndDelete? The result of lookupAndDelete will change if the modifying transaction goes before it, and the effect of the modifying transaction will change if the key is deleted before it

Yes, an update at a key will invalidate other transactions that involve the same key. The cool thing is that it only invalidates transactions at that key. That is the point of stm-containers.

focus :: (Hashable key) => B.Focus value STM result -> key -> Map key value -> STM result focus valueFocus key (Map hamt) = A.focus rowFocus (\(Product2 key _) -> key) key hamt where rowFocus = B.mappingInput (\value -> Product2 key value) (\(Product2 _ value) -> value) valueFocus

That's a call into the stm-hamt library. Here's A.focus

-- module StmHamt.Hamt focus :: (Hashable key) => Focus element STM result -> (element -> key) -> key -> Hamt element -> STM result focus focus elementToKey key = focusExplicitly focus (hash key) ((==) key . elementToKey) focusExplicitly :: Focus a STM b -> Int -> (a -> Bool) -> Hamt a -> STM b focusExplicitly focus hash test hamt = {-# SCC "focus" #-} let Focus _ reveal = Focus.onHamtElement 0 hash test focus in fmap fst (reveal hamt)

So we're interested in Focus.onHamtElement, which is in another module from that library

-- module StmHamt.Focuses onHamtElement :: Int -> Int -> (a -> Bool) -> Focus a STM b -> Focus (Hamt a) STM b onHamtElement depth hash test focus = let branchIndex = IntOps.indexAtDepth depth hash Focus concealBranches revealBranches = By6Bits.onElementAtFocus branchIndex $ onBranchElement depth hash test focus concealHamt = let hamtChangeStm = \case Leave -> return Leave Set !branches -> Set . Hamt <$> newTVar branches Remove -> Set . Hamt <$> newTVar By6Bits.empty in concealBranches >>= traverse hamtChangeStm -- this is the one we're calling in focusExplicitly ... revealHamt (Hamt branchesVar) = do -- ... it reads the top-level var ... branches <- readTVar branchesVar (result, branchesChange) <- revealBranches branches case branchesChange of -- ... and it always writes it, unless you said not to change it Leave -> return (result, Leave) Set !newBranches -> writeTVar branchesVar newBranches $> (result, Leave) Remove -> writeTVar branchesVar By6Bits.empty $> (result, Leave) in Focus concealHamt revealHamt

Is it possible that somehow an update at a key doesn't read the top-level map var? What am I missing?

I suppose you'd have to take the underlying value's TVar out of the map and then do in-place updates, but are we doing that anywhere in this patch?

Is it that an insertion or deletion will often result in a Leave at the top level?

avieth · 2025-12-16T18:04:22Z

IMO the point of "good practices" and "antipatterns" is not that they always blow up, but that they can blow up in spooky ways, and the effort of "not doing the antipattern" is always less than the cost of "the antipattern exploded."

A TVar containing a map is not an anti-pattern. Using one in a situation where updates to keys are highly contended can lead to sub-optimal performance and yes something like stm-containers does improve it, but I'm not convinced that kind of situation is happening here

parsonsmatt · 2025-12-16T18:10:11Z

A TVar containing a map is an easily avoidable footgun, where all the available alternatives are better. If you don't need transactional behavior (ie all access to the Map is through readTVarIO), then you don't need a TVar at all and an IORef is sufficient and doesn't let you shoot yourself in the foot. If you need transactional behavior, then stm-containers actually works and does not livelock. TVar (Map _ _) is worse in every way, and I've solved several serious problems by replacing them, so I'm pretty content to call them an antipattern.

but I'm not convinced that kind of situation is happening here

I'd be curious to hear your hypothesis on a) what was going on and b) why this patch fixed it.

avieth · 2025-12-17T15:16:12Z

I'd be curious to hear your hypothesis on a) what was going on and b) why this patch fixed it.

What evidence do we have that this fixed it? Did we put CI back to the original >16 concurrency with this patch? Didn't we also do other mitigations to get people unblocked on their day-to-day work?

I'm just trying to get an understanding of why this patch would fix things, because I find it surprising, and I was hoping you'd have some insight.

AFAICT the only place where the use of stm-containers actually can make things faster in this patch is upsertWorkflowInstance. Is that function really hammered in CI?

There's also the unrelated change to use concurrent cancels in execute when it polls for shutdown. Maybe that's a more likely explanation, based on your explanation in this earlier comment?

If you don't need transactional behavior (ie all access to the Map is through readTVarIO), then you don't need a TVar at all and an IORef is sufficient and doesn't let you shoot yourself in the foot.

If we're talking about footguns then wouldn't you say using an IORef is an even bigger one due to race conditions?

If you need transactional behavior, then stm-containers actually works and does not livelock.

* assuming your workload is mostly updates at keys, not deletions or insertions. You still have to think about the problem and understand it before you choose a solution. Often a TVar containing a map will be just as good and much simpler since you're probably already using vanilla Maps anyway

parsonsmatt · 2025-12-17T15:46:48Z

What evidence do we have that this fixed it? Did we put CI back to the original >16 concurrency with this patch? Didn't we also do other mitigations to get people unblocked on their day-to-day work?

Yes, this patch fixed the problem. CI is back to >16 concurrency. This is locally verifiable by running tests and observing a superlinear in the number of cores slowndown prior to this patch in the runtime of temporal tests, and then running tests again with this patch and observing a totally linear test runtime regardless of the number of cores. I suggest reading through the incident channel - there's a lot of diagnosis and step-by-step information about what steps were taken and what impacts they had.

It is possible that the concurrent shutdown is the true fix, but that would be weird - we are observing an infinite freeze somewhere, and issuing the order to shutdown concurrently vs serially should not impact a process that is frozen.

If we're talking about footguns then wouldn't you say using an IORef is an even bigger one due to race conditions?

I believe this is addressed in the first part of the sentence you're responding to. Do you mind elaborating?

assuming your workload is mostly updates at keys, not deletions or insertions.

That is not true. I think you are confused about how stm-containers works. A modification to one key or value of the map does not cause contention to any other key or value in the map.

Often a TVar containing a map will be just as good and much simpler since you're probably already using vanilla Maps anyway

"Simple" is a subjective judgment, but I'd be hard pressed to view significant complexity difference between

mv <- atomically $ StmMap.lookup k m
mv <- Map.lookup k <$> atomically (readTVar m)

and, even if i were to decide that StmMap.lookup k m is somehow more complex, it is undoubtedly warranted - diagnosing and fixing livelock issues like this takes an enormous amount of time, on top of the time wasted experiencing this bug. A patch to switch from TVar (Map k v) to stm-containers takes minutes to write, strictly makes things better, and avoids an entire class of performance problems.

avieth · 2025-12-17T15:59:18Z

A modification to one key or value of the map does not cause contention to any other key or value in the map

Yes that's what I said, but insertions and deletions still cause contention, even against updates at keys. Which is why it's so surprising to me that this patch would have had such an impact

Tbh I think the concurrent cancel might be the thing that did it: one workflow thread getting hung up (probably due to the safe-exceptions uninterruptible masking mistake) and causing the entire thing to block in excess of the configured timeout

parsonsmatt · 2025-12-17T16:08:22Z

A modification to one key or value of the map does not cause contention to any other key or value in the map

Yes that's what I said, but insertions and deletions still cause contention, even against updates at keys. Which is why it's so surprising to me that this patch would have had such an impact

No, it doesn't. You evidently don't understand the guarantees of stm-containers, which makes your position here much more understandable.

Tbh I think the concurrent cancel might be the thing that did it: one workflow thread getting hung up (probably due to the safe-exceptions uninterruptible masking mistake) and causing the entire thing to block in excess of the configured timeout

mapConcurrently_ cancel and mapM_ cancel will most likely have the same behavior, because mapConcurrently_ waits for all actions to complete before returning. For mapConcurrently_ alone to fix this, a running temporal workflow would have to block the shutdown of another unrelated temporal workflow, but in a manner where if they were both signaled to shut down, they would be able to shutdown cleanly together. And this would have to also result in livelock symptoms - ie more concurrency = more slowdown.

You're welcome to run the experiment if you want to - please post results when you do.

avieth · 2025-12-17T16:55:55Z

mapConcurrently_ cancel and mapM_ cancel will most likely have the same behavior, because mapConcurrently_ waits for all actions to complete before returning.

If one thread is unkillable then map_ cancel will stop at it, and those behind it in the list may never be killed because the thread that's killing the other threads is killed. That might explain the use of old database connections that we saw

But with mapConcurrently_ cancel, every other thread will more likely be killed by the time the killing thread is timed out, and only the unkillable one will remain

You evidently don't understand the guarantees of stm-containers, which makes your position here much more understandable.

I'm just looking at the source code for it and coming to my own conclusions. AFAICT insertions and deletions do contend with updates. Every focus reads the top-level TVar Hamt, even for an update, but insertions and removals also write it. Where am I mistaken?

avieth · 2025-12-17T17:10:34Z

I'm just looking at the source code for it and coming to my own conclusions. AFAICT insertions and deletions do contend with updates. Every focus reads the top-level TVar Hamt, even for an update, but insertions and removals also write it. Where am I mistaken?

I think I see it now: the focus data type finds the smallest branch to modify, therefore insertion/delete only contends with changes to nearby keys

parsonsmatt · 2025-12-17T17:21:33Z

If one thread is unkillable then map_ cancel will stop at it, and those behind it in the list may never be killed because the thread that's killing the other threads is killed. That might explain the use of old database connections that we saw

But with mapConcurrently_ cancel, every other thread will more likely be killed by the time the killing thread is timed out, and only the unkillable one will remain

Yes - this is what this other PR accomplished. After that PR, and prior to this one, we were no longer observing use of old database connections, but we were observing the freeze and the dramatic degradation in performance with increased concurrency. In this PR, we no longer observe the freeze or degradation in performance with increased concurrency.

While it is possible that uninterruptibleMask is the problem, the Temporal library also has extensive FFI use. Once control is passed from the GHC RTS to an FFI call, the thread is now uninterruptible until the call returns. You can have interruptible foreign calls with InterruptibleFFI, but this is not done in this library (nor is it clear that we should do that by default). However, if either of those were the problem, then it would be very strange indeed that stm-containers and/or concurrent shutdown would fix it, and it does not explain the slowdown-with-increased-concurrency.

avieth · 2025-12-17T18:12:18Z

IIUC that the test suite is like having 32+ mwb instances running on the same temporal shared state then I could see stm-containers being a factor: if there's one AcitivtyWorker/runningActivities map for the entire test suite--shared by all test cases and all test Apps--then that might be contended enough to make a noticeable difference. Still, that much contention seems like an issue itself, but could be one that we don't see in production

parsonsmatt added 3 commits December 12, 2025 16:04

Use stm-containers instead of TVar HashMap

699b008

Also concurrently cancel here

e6e070b

whoops lol

ea0b1b3

parsonsmatt commented Dec 12, 2025

View reviewed changes

parsonsmatt requested review from iand675 and jkachmar and removed request for jkachmar December 12, 2025 23:14

iand675 approved these changes Dec 12, 2025

View reviewed changes

parsonsmatt merged commit 0568289 into main Dec 15, 2025
30 of 31 checks passed

avieth reviewed Dec 16, 2025

View reviewed changes

		, runningWorkflows :: {-# UNPACK #-} !(TVar (HashMap RunId WorkflowInstance))
		, runningWorkflows :: {-# UNPACK #-} !(StmMap.Map RunId WorkflowInstance)

Use stm-containers instead of TVar HashMap #279

Use stm-containers instead of TVar HashMap #279

Conversation

parsonsmatt commented Dec 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iand675 left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 15, 2025

📊 Code Coverage Report

Current PR Coverage

Overall Summary

Coverage by Module

📈 Coverage Comparison vs. Main

Overall Summary

Coverage by Module

Uh oh!

Uh oh!

parsonsmatt commented Dec 15, 2025

Uh oh!

avieth commented Dec 16, 2025

Uh oh!

parsonsmatt commented Dec 16, 2025

Uh oh!

avieth commented Dec 16, 2025

Uh oh!

parsonsmatt commented Dec 16, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

avieth Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

avieth Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

avieth commented Dec 16, 2025

Uh oh!

parsonsmatt commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avieth commented Dec 17, 2025

Uh oh!

parsonsmatt commented Dec 17, 2025

Uh oh!

avieth commented Dec 17, 2025

Uh oh!

parsonsmatt commented Dec 17, 2025

avieth Dec 17, 2025 •

edited

Loading

avieth Dec 17, 2025 •

edited

Loading

parsonsmatt commented Dec 16, 2025 •

edited

Loading

avieth commented Dec 17, 2025 •

edited

Loading

avieth commented Dec 17, 2025 •

edited

Loading

parsonsmatt commented Dec 17, 2025 •

edited

Loading