Skip to content

#1955 - Make FileSpout work in distributed mode#1956

Merged
jnioche merged 3 commits into
mainfrom
fix/1955-filespout-distributed-mode
Jun 19, 2026
Merged

#1955 - Make FileSpout work in distributed mode#1956
jnioche merged 3 commits into
mainfrom
fix/1955-filespout-distributed-mode

Conversation

@dpol1

@dpol1 dpol1 commented Jun 17, 2026

Copy link
Copy Markdown
Member

FileSpout read the seed directory and filled its input queue inside the constructor. In distributed mode the constructor runs only on the client that submits the topology, so the resolved queue (usually empty, since the seed files live on the workers) was serialised and shipped out. The spout then sat idle with no activity in the logs.

The constructors now only keep the values they are given: the directory and filter, or the explicit file list. The seeds are resolved in open(), which runs on each worker. WARCSpout already calls super.open(), so it gets the same behaviour without any change.

While moving the code I also fixed a log call that used printf-style %s formatting instead of SLF4J {} placeholders.

Tests: a regression test builds the spout against an empty directory and creates the seed file only after construction, before open(). It fails on the old code and passes now. There is also a directory happy-path test and a serialisation round-trip.

Closes #1955

For all changes

  • Is there a issue associated with this PR? Is it referenced in the commit message?

  • Does your PR title start with #XXXX where XXXX is the issue number you are trying to resolve?

  • Has your PR been rebased against the latest commit within the target branch (typically main)?

  • Is your initial contribution a single, squashed commit?

  • Is the code properly formatted with mvn git-code-format:format-code -Dgcf.globPattern="**/*" -Dskip.format.code=false?

For code changes

  • Have you ensured that the full suite of tests is executed via mvn clean verify?
  • Have you written or updated unit tests to verify your changes?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE file, including the main LICENSE file?
  • If applicable, have you updated the NOTICE file, including the main NOTICE file?

Note

Please ensure that once the PR is submitted, you check GitHub Actions for build issues and submit an update to your PR as soon as possible.

@dpol1 dpol1 requested review from jnioche, mvolikas, rzo1 and sigee June 17, 2026 12:49
@dpol1 dpol1 added the bug label Jun 17, 2026
@dpol1 dpol1 added this to the 3.6.1 milestone Jun 17, 2026

@jnioche jnioche left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, a few minor suggestions

Comment thread core/src/main/java/org/apache/stormcrawler/spout/FileSpout.java Outdated
Comment thread core/src/main/java/org/apache/stormcrawler/spout/FileSpout.java Outdated
Comment thread core/src/main/java/org/apache/stormcrawler/spout/FileSpout.java Outdated
@dpol1 dpol1 requested a review from jnioche June 18, 2026 14:23
@jnioche jnioche merged commit aeb16a6 into main Jun 19, 2026
2 checks passed
@jnioche jnioche deleted the fix/1955-filespout-distributed-mode branch June 19, 2026 05:50
@jnioche

jnioche commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Thanks @dpol1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug report] FileSpout does not work in distributed mode

3 participants