feat(transformer_ner): Trainable BERT NER plugin, along with rawstring tokenizer that splits on all spaces.#550
Conversation
…rainable_bert_ner
…trainable_bert_ner
|
Hi, just a brief mention on performances. I'll list some permutations of components, tokenisers and just list what I can get. These aren't exhaustive and the performance can always be improved with better configs.
|
As much as I agere that recall is often more important than precision, on it's own I don't see it as particularly valuable. I could just mark all spans for all concepts and get a recall of 1 at the expense of zero precision. PS: |
mart-r
left a comment
There was a problem hiding this comment.
Massive PR!
This should have been 3 different PRs:
- Embedding linker changes
- Rawstring tokenizer
- Transformers based NER
I really couldn't go through everything in enough detail at this point.
But I've left a few comments.
I think the main questions I had were the following:
- What's rawstring tokenizer and how does it differ from the built in regex tokenizer? Why do we need a separate one?
- Why can't the rawstring tokenizer be used with the regular / context based linker?
And then there's the fact that I think we need to have workflows for the new things as well.
| index: int, | ||
| char_index: int, | ||
| end_char_index: int) -> None: | ||
| # --- BaseToken fields --- |
There was a problem hiding this comment.
This is fine, but to avoid all the boilerplate, I would just use the .text (and so on) fields rather than the property stuff below.
It would still satisfy the protocol. The only difference is that the user doesn't know that these are writable, and thus type checkers would complain if/when try try to write into these.
|
|
||
| def __init__(self, text: str, start_index: int, end_index: int, | ||
| start_char: int, end_char: int, label: str = "") -> None: | ||
| # --- BaseEntity fields --- |
There was a problem hiding this comment.
Same here - I would juse use fields instead of properties.
| # _WORD_RE = re.compile(r"[^\W_]+(?:[-/][^\W_]+)*", re.UNICODE) | ||
|
|
||
|
|
||
| def _iter_word_spans( |
There was a problem hiding this comment.
This is effectively where the tokenization itself happens :)
Perhaps move this (along with the regex above) to its own module to separate it? i.e base.py
|
|
||
| ### Component Registration | ||
|
|
||
| Register the tokenizer by name before trying to add the tokenizer to the pipeline. If loading a model with a rawstring tokenizer register it beforehand. |
There was a problem hiding this comment.
I'd normally expect the extension to take care of that.
See:
https://github.com/CogStack/cogstack-nlp/blob/main/medcat-plugins/embedding-linker/src/medcat_embedding_linker/__init__.py
and
https://github.com/CogStack/cogstack-nlp/blob/main/medcat-plugins/embedding-linker/src/medcat_embedding_linker/registration.py
for example
EDIT:
I see you've already done that for transformers NER as well
| @@ -0,0 +1,80 @@ | |||
| # MedCAT Embedding Linker | |||
There was a problem hiding this comment.
How does it differ from the regex tokenizer?
Just a different regex?
What's the benefit?
Why do we need this separately from the built in regex tokenizer?
Also, would be great to add a workflow that runs some typing/linting and tests on this stuff.
|
|
||
| ## Limitations | ||
|
|
||
| - Can NOT be used with the default `context_based_linker` as, that uses spacy tokens and spacy embeddings for linking. Which are not used with this tokenizer. |
There was a problem hiding this comment.
I'm not quite sure I understand why it can't be used with the context based linker.
There is nothing in there that is coupled to the spacy tokenizer.
| @@ -0,0 +1,100 @@ | |||
| # MedCAT Embedding Linker | |||
There was a problem hiding this comment.
This could also use workflows that run linting/typing/tests.
Hihi,
This is WIP so we'll top it off with a TODO here for now:
This is the trainable MLM transformer model attempting to do NER. It is a binary BIOES NER model where each prediction is either (Beginning-Ent, Inside-Ent, Outside-Ent, End-Ent, Single-Ent). This is a bit of an advancement compared to BIO models, where E signals the end of a multi token label, and S signals a stand alone token label. We try to prioritise B and E tokens here for performance (i.e. ensure we get well formed predictions). We also have a CRF head after the MLM model to try to encourge more well formed label predictions (i.e. only I and E after B
transformer_ner.pyis the main logic for the plugin, whiletransformer_ner_modelis the logic for the model (such as initialisation, loading, and the forward step).Rawstring_tokenizer is a tokenizer where all tokens are based on whitespace splits i.e. new lines, tabs, and spaces. It still can't perfectly obtain all entities (sub word entities). But is an improvement for where some entities don't have spacy representation. This mainly improves performances in pipelines where it's using transformer_ner, and embedding_linker.
There are also additional changes to the embedding_linker. I understand these should probably be seperate, however that's slipped through the cracks. Apologies. The changes are mainly more functionality and configurability:
Performance wise you can expect performances of trained models with reasonable configs to look like this (based on training / testing of Distemist & Snomed Entity Linking Benchmark):
There a few additional pieces with these metrics. The embedding linker is highly configurable so you can go from a Recall of 0.84-ish and a Precision of 0.4, to 0.9 recall and 0.05 precision. These metrics I have here are essentially on configurations I think make "sense". One such measure of that metric is "if the recall goes up, and precision remains the same improves I'd consider that a solid improvement". I have documented these changes in performance within the config, so hopefully people can make informed decisions.