Reproduction Code of "Task Assignment meets Annotator Modeling: Human-LLM Collaborative Annotation with Constraints"

Caution

On Apple Silicon, linear programming (PuLP/CBC) may be unstable depending on environment settings. If needed, run under Rosetta.

1. Environment Setup

1.1 Using `uv` (recommended)

uv sync

Download NLTK resources if required:

uv run python -m nltk.downloader all

1.2 Using Docker

docker compose up -d
docker compose exec Human_LLM_collaborative_annotation bash

Run commands in the container:

python src/main.py ...

Or from host:

docker compose exec Human_LLM_collaborative_annotation python src/main.py ...

2. Authentication and Required Data

2.1 Login

Hugging Face:
```
huggingface-cli login
```
Weights & Biases:
```
wandb login
```

2.2 Required local files

At minimum:

data/tweet_eval/tweet_eval_annotated_with_llm.csv
data/word-sets.json (used by tweet_eval_vocab preprocessing)

Missing split files (for example tweet_eval_10_train.csv) are generated automatically during preprocessing.

3. Configuration

This project uses Hydra. Any setting can be overridden from the command line with dot notation:

uv run python src/main.py wandb_enabled=false trainer.seed=20

3.1 Base Config

Base settings live in config/config.yaml.

Setting	Default	Description
`defaults`	`dataset: spiral`, `trainer: spiral`, `model: common_confusion`	Hydra config groups loaded by default.
`method`	`train`	Execution method. Use `train`, `confusion`, or `icrowd` depending on the experiment config.
`name`	`${trainer.dataset_name} experiment`	Run name passed to W&B when logging is enabled.
`debug`	`false`	Enables short debug behavior in training and changes `commit_hash` to `debug_mode`.
`abci`	`false`	Appends `_abci` to the run name in the common training path.
`mode`	`train`	Main execution mode. Current entry point supports `train`.
`annotator_num`	`6`	Number of annotators or systems used by model and assignment configs.
`commit_hash`	`${commit_hash: ${debug}}`	Output namespace generated by the Hydra resolver.
`wandb_enabled`	`true`	Enables W&B initialization and W&B logger usage. Set `false` to run without W&B.
`wandb_entity`	`kei-moriyama-the-university-of-tokyo`	W&B entity.
`wandb_project`	`task assignment`	W&B project name.
`logger._target_`	`lightning.pytorch.loggers.WandbLogger`	Lightning logger class used when `wandb_enabled=true`.
`train.epoch`	`150`	Number of training epochs unless overridden by trainer or debug behavior.

Disable W&B logging with:

uv run python src/main.py wandb_enabled=false

3.2 Experiment Configs

Experiment configs are selected with +experiment=<name>.

Config	Main overrides
`tweet_eval_confusion`	Uses `trainer: sentiment`, `dataset: tweet_eval`, `model: confusion`, `LossConfusion`, and `ConfusionModel`.
`tweet_eval_confusion_cost_const`	Adds `trainer.random_assignment=false` and `model.CostConstraint` with `cost_per_annotator` and `total_cost_per_annotator`.
`tweet_eval_learning_to_defer_assignment`	Uses `dataset: tweet_eval_vocab`, `model: linear`, `LossLearningToDefer`, `MatchingBatchModel`, and `assign_interval`.
`tweet_eval_icrowd_assignment`	Sets `method=icrowd` and uses `train.icrowd.NLPICrowdTaskAssignment`.
`spiral_different_test_num_confusion`	Uses `dataset: spiral_test_num`, `ConfusionModel`, and `MaximumNumberConstraint`.
`spiral_different_test_num_confusion_cost`	Uses `dataset: spiral_test_num`, `ConfusionModel`, and `CostConstraint`.

4. Reproduction Commands

All commands below assume uv. For Docker, replace uv run with docker compose exec Human_LLM_collaborative_annotation.

4.1 Case 1: Full Annotation on Large Dataset

Ours (maximum-assignment constraint)

uv run python src/main.py -m \
  debug=false \
  +experiment=tweet_eval_confusion \
  trainer.seed=10,20,30,40,50

Ours (cost constraint)

uv run python src/main.py -m \
  debug=false \
  +experiment=tweet_eval_confusion_cost_const \
  trainer.seed=10,20,30,40,50

Baseline: L2D + assignment

uv run python src/main.py -m \
  debug=false \
  +experiment=tweet_eval_learning_to_defer_assignment \
  trainer.seed=10,20,30,40,50

Baseline: iCrowd + assignment

uv run python src/main.py -m \
  debug=false \
  +experiment=tweet_eval_icrowd_assignment \
  trainer.seed=10,20,30,40,50

4.2 Case 2: Full Annotation on Small Dataset (sampling-rate sweep)

uv run python src/main.py -m \
  debug=false \
  +experiment=tweet_eval_confusion \
  dataset.sampling_rate=0.5,0.6,0.7,0.8,0.9,1.0 \
  trainer.seed=10,20,30,40,50

4.3 Case 3: Partial Annotation (annotation/filter-rate sweep)

uv run python src/main.py -m \
  debug=false \
  +experiment=tweet_eval_confusion \
  dataset.filter_ratio=0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0 \
  trainer.seed=10,20,30,40,50

4.4 Linear Programming Runtime Evaluation (Spiral)

Maximum-assignment constraint

uv run python src/main.py -m \
  debug=false \
  +experiment=spiral_different_test_num_confusion \
  dataset.test_data_num=10000,30000,50000,100000 \
  trainer.seed=10,20,30,40,50

Cost constraint

uv run python src/main.py -m \
  debug=false \
  +experiment=spiral_different_test_num_confusion_cost \
  dataset.test_data_num=10000,30000,50000,100000 \
  trainer.seed=10,20,30,40,50

5. Output Artifacts

Main outputs are written to:

outputs/<commit_hash>/<dataset>/<loss_or_method>/<seed>/

Typical artifacts:

score_test.json
W&B tables/metrics

With debug=true, commit_hash becomes debug_mode.

6. Citation

@inproceedings{moriyama-etal-2026-task,
    title = "Task Assignment meets Annotator Modeling: Human-{LLM} Collaborative Annotation with Constraints",
    author = "Moriyama, Kei  and
      Nakayama, Kouta  and
      Baba, Yukino",
    editor = "T.Y.S.S., Santosh  and
      Rodriguez, Juan Diego  and
      de Gibert, Ona",
    booktitle = "Proceedings of the 64th Annual Meeting of the {A}ssociation for {C}omputational {L}inguistics (Volume 4: Student Research Workshop)",
    month = jul,
    year = "2026",
    address = "San Diego, California, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2026.acl-srw.79/",
    pages = "888--902",
    ISBN = "979-8-89176-393-7",
    abstract = "Crowdsourced annotators and Large Language Models (LLMs) offer complementary, cost-effective ways to obtain labeled data, yet ensuring high label quality remains challenging.We observe that task features influence the accuracy of humans and LLMs, while real-world constraints, such as per-annotator assignment limits, further complicate allocation.Prior work typically addresses either task features or constraints, but not both.We present an integrated framework that (i) estimates per-task accuracy from task features using a \textit{learning from crowds} model and (ii) incorporates these estimations into a linear programming formulation that assigns tasks under practical constraints. Experimental results demonstrate that the proposed method achieves accuracy comparable to that of baseline methods while satisfying given constraints."
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
config		config
data		data
prompts		prompts
scripts		scripts
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Reproduction Code of "Task Assignment meets Annotator Modeling: Human-LLM Collaborative Annotation with Constraints"

1. Environment Setup

1.1 Using `uv` (recommended)

1.2 Using Docker

2. Authentication and Required Data

2.1 Login

2.2 Required local files

3. Configuration

4. Reproduction Commands

4.1 Case 1: Full Annotation on Large Dataset

Ours (maximum-assignment constraint)

Ours (cost constraint)

Baseline: L2D + assignment

Baseline: iCrowd + assignment

4.2 Case 2: Full Annotation on Small Dataset (sampling-rate sweep)

4.3 Case 3: Partial Annotation (annotation/filter-rate sweep)

4.4 Linear Programming Runtime Evaluation (Spiral)

Maximum-assignment constraint

Cost constraint

5. Output Artifacts

6. Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Reproduction Code of "Task Assignment meets Annotator Modeling: Human-LLM Collaborative Annotation with Constraints"

1. Environment Setup

1.1 Using uv (recommended)

1.2 Using Docker

2. Authentication and Required Data

2.1 Login

2.2 Required local files

3. Configuration

4. Reproduction Commands

4.1 Case 1: Full Annotation on Large Dataset

Ours (maximum-assignment constraint)

Ours (cost constraint)

Baseline: L2D + assignment

Baseline: iCrowd + assignment

4.2 Case 2: Full Annotation on Small Dataset (sampling-rate sweep)

4.3 Case 3: Partial Annotation (annotation/filter-rate sweep)

4.4 Linear Programming Runtime Evaluation (Spiral)

Maximum-assignment constraint

Cost constraint

5. Output Artifacts

6. Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1.1 Using `uv` (recommended)

Packages