Skip to content

[Bug] WorkflowTestCase / TestService: time skipping starts locked by default, no auto-unlock around getResult() — timer-only workflows hang in tests #743

Description

@ilyazastrognov

What are you really trying to do?

Test a timer-driven workflow with the in-process time-skipping test server. Workflow waits via Workflow::awaitWithTimeout($seconds,
$cond) for 30 minutes, then calls a (mocked) activity. Signal-driven path completes in milliseconds. Timer-only path hangs.

Describe the bug

When using Temporal\Testing\WorkflowTestCase together with ActivityMocker, workflows that block on a timer (e.g.
Workflow::awaitWithTimeout(1800, fn() => $this->resolved) with no signal) do not fast-forward. The test client blocks for the
wall-clock duration of withWorkflowExecutionTimeout and eventually surfaces WorkflowFailedException retryState=3.

Two independent issues compound:

  1. Test server starts with time-skipping locked. --enable-time-skipping is passed by Environment::startTemporalTestServer(), but per
    TestService::lockTimeSkipping() docblock and the test server's own behavior, the server starts with Time Locking Counter = 1 — i.e.
    skipping enabled but locked. To actually fast-forward, the counter has to be decremented to 0.

  2. PHP SDK does not auto-unlock around blocking client calls. Other SDKs (TypeScript, Java, Go) wrap getResult() / execute() with
    implicit unlock/lock so users don't have to think about the counter. Quoting
    https://docs.temporal.io/develop/typescript/best-practices/testing-suite:

The test server starts in 'normal' time. When you use TestWorkflowEnvironment.client.workflow.execute() or .result(), the test
server switches to 'skipped' time mode until the Workflow completes.
PHP WorkflowClient::start()->getResult() does no such thing. Grepping the SDK confirms lockTimeSkipping/unlockTimeSkipping are
referenced only in the README and generated protobuf — never invoked from Client/ or WorkflowTestCase.

  1. Manual unlockTimeSkipping() collides with ActivityMocker. When users work around (1)+(2) by calling
    $this->testingService->unlockTimeSkipping() before getResult() and lockTimeSkipping() after, timers fast-forward but
    ActivityMocker-served activities now fail with TIMEOUT_TYPE_START_TO_CLOSE. The test server only halts virtual time when an activity
    is actually running on a worker. ActivityMocker short-circuits the activity through RoadRunner KV, so the server never sees an
    active activity and skips through withStartToCloseTimeout faster than the mocked response can arrive.

Minimal Reproduction

` #[WorkflowInterface]
final class TimerWorkflow {
private bool $resolved = false;

  #[SignalMethod]                                                              
  public function done(): void { $this->resolved = true; }                                                                        

  #[WorkflowMethod(name: 'TimerWorkflow')]                                                                                        
  public function execute(): \Generator {                                      
      yield Workflow::awaitWithTimeout(1800, fn() => $this->resolved);                                                            
      return $this->resolved ? 'signalled' : 'timed_out';                                                                         
  }                                                          

}

final class TimerWorkflowTest extends WorkflowTestCase {
public function testTimesOut(): void {
$stub = $this->workflowClient->newWorkflowStub(
TimerWorkflow::class,
WorkflowOptions::new()
->withTaskQueue('default')
->withWorkflowExecutionTimeout('PT2H'),
);

      // hangs ~30 min wall-clock instead of skipping                                                                             
      $result = $this->workflowClient->start($stub)->getResult('string');                                                         
                                     
      self::assertSame('timed_out', $result);                                                                                     
  }                                                                            

}
`

Expected Behavior

getResult() returns 'timed_out' in milliseconds (virtual 1800 seconds skipped while no activity is running). This is what TypeScript
/ Java / Go SDKs do today.

Actual Behavior

getResult() blocks for ~30 minutes wall-clock until the test server eventually times out the workflow.

Versions

  • temporal/sdk 2.17 (also reproduced on master at the time of writing)
  • temporal-test-server whichever version SystemInfo resolves (official binary, downloaded via Downloader)
  • PHP 8.4
  • RoadRunner 2024.x (issue is unrelated to RR — same shape against a vanilla worker)
  • Linux, x86_64

Suggested Fixes (in order of preference)

  1. Mirror TS/Java/Go behavior. Have WorkflowTestCase (or WorkflowClient when running against the test server) automatically
    unlockTimeSkipping() around start()->getResult() / execute() and lockTimeSkipping() afterwards. Document the pairing.
  2. Make ActivityMocker participate in the virtual clock. Have it tell the test server "an activity is running for the same task
    queue" until the mock result is delivered, so withStartToCloseTimeout doesn't fire under fast-forward. Without this, fix (1) on its
    own breaks any test that mixes timers and mocked activities.
  3. At minimum: document this. The current testing-suite docs imply time-skipping just works. There is no mention of the lock counter
    starting at 1, no mention that ActivityMocker is incompatible with timer-driven workflows, and the only working pattern
    (signal-driven workflows that resolve immediately) is the narrowest case. Until (1) and (2) ship, point users at the workaround of
    parameterizing all timeouts as workflow input — but that should be a fallback, not the default story.

Workaround we are using

Pass all timer durations as workflow input arguments with production defaults. In tests we pass small values (1-2 seconds) so the
workflow completes in real wall-clock time; time-skipping is not used. This works but bakes timeout values into workflow history,
and it pushes the problem onto every workflow author rather than fixing the test infrastructure.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions