Skip to content

babalablab/task-assignment

Repository files navigation

Reproduction Code of "Task Assignment meets Annotator Modeling: Human-LLM Collaborative Annotation with Constraints"

Caution

On Apple Silicon, linear programming (PuLP/CBC) may be unstable depending on environment settings. If needed, run under Rosetta.

1. Environment Setup

1.1 Using uv (recommended)

uv sync

Download NLTK resources if required:

uv run python -m nltk.downloader all

1.2 Using Docker

docker compose up -d
docker compose exec Human_LLM_collaborative_annotation bash

Run commands in the container:

python src/main.py ...

Or from host:

docker compose exec Human_LLM_collaborative_annotation python src/main.py ...

2. Authentication and Required Data

2.1 Login

  • Hugging Face:
    huggingface-cli login
  • Weights & Biases:
    wandb login

2.2 Required local files

At minimum:

  • data/tweet_eval/tweet_eval_annotated_with_llm.csv
  • data/word-sets.json (used by tweet_eval_vocab preprocessing)

Missing split files (for example tweet_eval_10_train.csv) are generated automatically during preprocessing.

3. Configuration

This project uses Hydra. Any setting can be overridden from the command line with dot notation:

uv run python src/main.py wandb_enabled=false trainer.seed=20
3.1 Base Config

Base settings live in config/config.yaml.

Setting Default Description
defaults dataset: spiral, trainer: spiral, model: common_confusion Hydra config groups loaded by default.
method train Execution method. Use train, confusion, or icrowd depending on the experiment config.
name ${trainer.dataset_name} experiment Run name passed to W&B when logging is enabled.
debug false Enables short debug behavior in training and changes commit_hash to debug_mode.
abci false Appends _abci to the run name in the common training path.
mode train Main execution mode. Current entry point supports train.
annotator_num 6 Number of annotators or systems used by model and assignment configs.
commit_hash ${commit_hash: ${debug}} Output namespace generated by the Hydra resolver.
wandb_enabled true Enables W&B initialization and W&B logger usage. Set false to run without W&B.
wandb_entity kei-moriyama-the-university-of-tokyo W&B entity.
wandb_project task assignment W&B project name.
logger._target_ lightning.pytorch.loggers.WandbLogger Lightning logger class used when wandb_enabled=true.
train.epoch 150 Number of training epochs unless overridden by trainer or debug behavior.

Disable W&B logging with:

uv run python src/main.py wandb_enabled=false
3.2 Experiment Configs

Experiment configs are selected with +experiment=<name>.

Config Main overrides
tweet_eval_confusion Uses trainer: sentiment, dataset: tweet_eval, model: confusion, LossConfusion, and ConfusionModel.
tweet_eval_confusion_cost_const Adds trainer.random_assignment=false and model.CostConstraint with cost_per_annotator and total_cost_per_annotator.
tweet_eval_learning_to_defer_assignment Uses dataset: tweet_eval_vocab, model: linear, LossLearningToDefer, MatchingBatchModel, and assign_interval.
tweet_eval_icrowd_assignment Sets method=icrowd and uses train.icrowd.NLPICrowdTaskAssignment.
spiral_different_test_num_confusion Uses dataset: spiral_test_num, ConfusionModel, and MaximumNumberConstraint.
spiral_different_test_num_confusion_cost Uses dataset: spiral_test_num, ConfusionModel, and CostConstraint.

4. Reproduction Commands

All commands below assume uv. For Docker, replace uv run with docker compose exec Human_LLM_collaborative_annotation.

4.1 Case 1: Full Annotation on Large Dataset

Ours (maximum-assignment constraint)

uv run python src/main.py -m \
  debug=false \
  +experiment=tweet_eval_confusion \
  trainer.seed=10,20,30,40,50

Ours (cost constraint)

uv run python src/main.py -m \
  debug=false \
  +experiment=tweet_eval_confusion_cost_const \
  trainer.seed=10,20,30,40,50

Baseline: L2D + assignment

uv run python src/main.py -m \
  debug=false \
  +experiment=tweet_eval_learning_to_defer_assignment \
  trainer.seed=10,20,30,40,50

Baseline: iCrowd + assignment

uv run python src/main.py -m \
  debug=false \
  +experiment=tweet_eval_icrowd_assignment \
  trainer.seed=10,20,30,40,50

4.2 Case 2: Full Annotation on Small Dataset (sampling-rate sweep)

uv run python src/main.py -m \
  debug=false \
  +experiment=tweet_eval_confusion \
  dataset.sampling_rate=0.5,0.6,0.7,0.8,0.9,1.0 \
  trainer.seed=10,20,30,40,50

4.3 Case 3: Partial Annotation (annotation/filter-rate sweep)

uv run python src/main.py -m \
  debug=false \
  +experiment=tweet_eval_confusion \
  dataset.filter_ratio=0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0 \
  trainer.seed=10,20,30,40,50

4.4 Linear Programming Runtime Evaluation (Spiral)

Maximum-assignment constraint

uv run python src/main.py -m \
  debug=false \
  +experiment=spiral_different_test_num_confusion \
  dataset.test_data_num=10000,30000,50000,100000 \
  trainer.seed=10,20,30,40,50

Cost constraint

uv run python src/main.py -m \
  debug=false \
  +experiment=spiral_different_test_num_confusion_cost \
  dataset.test_data_num=10000,30000,50000,100000 \
  trainer.seed=10,20,30,40,50

5. Output Artifacts

Main outputs are written to:

  • outputs/<commit_hash>/<dataset>/<loss_or_method>/<seed>/

Typical artifacts:

  • score_test.json
  • W&B tables/metrics

With debug=true, commit_hash becomes debug_mode.

6. Citation

@inproceedings{moriyama-etal-2026-task,
    title = "Task Assignment meets Annotator Modeling: Human-{LLM} Collaborative Annotation with Constraints",
    author = "Moriyama, Kei  and
      Nakayama, Kouta  and
      Baba, Yukino",
    editor = "T.Y.S.S., Santosh  and
      Rodriguez, Juan Diego  and
      de Gibert, Ona",
    booktitle = "Proceedings of the 64th Annual Meeting of the {A}ssociation for {C}omputational {L}inguistics (Volume 4: Student Research Workshop)",
    month = jul,
    year = "2026",
    address = "San Diego, California, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2026.acl-srw.79/",
    pages = "888--902",
    ISBN = "979-8-89176-393-7",
    abstract = "Crowdsourced annotators and Large Language Models (LLMs) offer complementary, cost-effective ways to obtain labeled data, yet ensuring high label quality remains challenging.We observe that task features influence the accuracy of humans and LLMs, while real-world constraints, such as per-annotator assignment limits, further complicate allocation.Prior work typically addresses either task features or constraints, but not both.We present an integrated framework that (i) estimates per-task accuracy from task features using a \textit{learning from crowds} model and (ii) incorporates these estimations into a linear programming formulation that assigns tasks under practical constraints. Experimental results demonstrate that the proposed method achieves accuracy comparable to that of baseline methods while satisfying given constraints."
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages