Using EasyTokenTagger to quickly perform POS tagging

We'll import the adaptnlp EasyTokenTagger class:

from adaptnlp import EasyTokenTagger
from pprint import pprint

Let's write some simple example text, and instantiate an EasyTokenTagger:

example_text = '''Novetta Solutions is the best. Albert Einstein used to be employed at Novetta Solutions. 
The Wright brothers loved to visit the JBF headquarters, and they would have a chat with Albert.'''
tagger = EasyTokenTagger()

With Transformers

First we will use some Transformers models, specifically bert.

We'll search HuggingFace for the model we want, in this case we want to use sshleifer's tiny-dbmdz-bert model:

from adaptnlp import HFModelHub
hub = HFModelHub()
model = hub.search_model_by_name('sshleifer/tiny-dbmdz-bert', user_uploaded=True)[0]; model
Model Name: sshleifer/tiny-dbmdz-bert-large-cased-finetuned-conll03-english, Tasks: [token-classification]

Next we'll use our tagger to generate some sentences:

sentences = tagger.tag_text(text=example_text, model_name_or_path = model)
2021-10-05 17:45:28,242 loading file /root/.flair/models/tiny-dbmdz-bert-large-cased-finetuned-conll03-english/1e2c09da4ad5b3257008353a87852a7148389cc8308b91cf837f066b95650a0d.595173de82e795b5e4022dca79d10d885137a50ed2ee3974f15a75d328c0cd0a

And then look at some of our results:

print("List string outputs of tags:\n")
for sen in sentences['tags']:
    pprint(sen)
List string outputs of tags:

[{'entity': 'I-LOC', 'score': 0.11716679483652115, 'word': '[CLS] Novetta'},
 {'entity': 'B-ORG', 'score': 0.11758644878864288, 'word': 'Solutions'},
 {'entity': 'I-LOC', 'score': 0.11716679483652115, 'word': 'is the'},
 {'entity': 'B-ORG', 'score': 0.11758644878864288, 'word': 'best'},
 {'entity': 'I-LOC',
  'score': 0.11716679483652115,
  'word': '. Albert Einstein used to be employed'},
 {'entity': 'B-ORG', 'score': 0.11758644878864288, 'word': 'at Nov'},
 {'entity': 'I-LOC',
  'score': 0.11716679483652115,
  'word': '##etta Solutions. The Wright brothers loved to visit'},
 {'entity': 'B-ORG', 'score': 0.11758644878864288, 'word': 'the'},
 {'entity': 'I-LOC', 'score': 0.11716679483652115, 'word': 'JBF'},
 {'entity': 'B-ORG', 'score': 0.11758644878864288, 'word': 'headquarters'},
 {'entity': 'I-LOC', 'score': 0.11716679483652115, 'word': ', and they'},
 {'entity': 'B-ORG', 'score': 0.11758644878864288, 'word': 'would'},
 {'entity': 'I-LOC',
  'score': 0.11716679483652115,
  'word': 'have a chat with Albert. [SEP]'}]

With Flair

Named Entity Recognition (NER)

With Flair we can follow a similar setup to earlier, searching HuggingFace for valid ner models. In our case we'll use Flair's ner-english-ontonotes-fast model

from adaptnlp import FlairModelHub
hub = FlairModelHub()
model = hub.search_model_by_name('ontonotes-fast')[0]; model
Model Name: flair/ner-english-ontonotes-fast, Tasks: [token-classification], Source: HuggingFace Model Hub

Then we'll tag the string:

sentences = tagger.tag_text(text = example_text, model_name_or_path = model)
2021-10-05 17:48:09,267 loading file /root/.flair/models/ner-english-ontonotes-fast/0d55dd3b912da9cf26e003035a0c269a0e9ab222f0be1e48a3bbba3a58c0fed0.c9907cd5fde3ce84b71a4172e7ca03841cd81ab71d13eb68aa08b259f57c00b6

And we can get back a JSON of each word and its entities:

pprint(sentences[0]['entities'][:5])
[{'confidence': 0.7553082704544067,
  'end_pos': 17,
  'labels': [ORG (0.7553)],
  'start_pos': 0,
  'text': 'Novetta Solutions',
  'value': 'ORG'},
 {'confidence': 0.9927975535392761,
  'end_pos': 46,
  'labels': [PERSON (0.9928)],
  'start_pos': 31,
  'text': 'Albert Einstein',
  'value': 'PERSON'},
 {'confidence': 0.7496212422847748,
  'end_pos': 87,
  'labels': [ORG (0.7496)],
  'start_pos': 70,
  'text': 'Novetta Solutions',
  'value': 'ORG'},
 {'confidence': 0.9998451471328735,
  'end_pos': 99,
  'labels': [PERSON (0.9998)],
  'start_pos': 93,
  'text': 'Wright',
  'value': 'PERSON'},
 {'confidence': 0.967128336429596,
  'end_pos': 131,
  'labels': [ORG (0.9671)],
  'start_pos': 128,
  'text': 'JBF',
  'value': 'ORG'}]

Parts of Speech

Next we'll look at a parts-of-speech tagger.

We can simply pass in "pos", but let's use our search API to find an english POS tagger:

hub.search_model_by_task('pos')
[Model Name: flair/pos-english-fast, Tasks: [token-classification], Source: HuggingFace Model Hub,
 Model Name: flair/pos-english, Tasks: [token-classification], Source: HuggingFace Model Hub,
 Model Name: flair/upos-english-fast, Tasks: [token-classification], Source: HuggingFace Model Hub,
 Model Name: flair/upos-english, Tasks: [token-classification], Source: HuggingFace Model Hub,
 Model Name: flair/upos-multi-fast, Tasks: [token-classification], Source: HuggingFace Model Hub,
 Model Name: flair/upos-multi, Tasks: [token-classification], Source: HuggingFace Model Hub,
 Model Name: flair/upos, Tasks: [token-classification], Source: Flair's Private Model Hub,
 Model Name: flair/upos-fast, Tasks: [token-classification], Source: Flair's Private Model Hub,
 Model Name: flair/pos, Tasks: [token-classification], Source: Flair's Private Model Hub,
 Model Name: flair/pos-fast, Tasks: [token-classification], Source: Flair's Private Model Hub,
 Model Name: flair/pos-multi, Tasks: [token-classification], Source: Flair's Private Model Hub,
 Model Name: flair/multi-pos, Tasks: [token-classification], Source: Flair's Private Model Hub,
 Model Name: flair/pos-multi-fast, Tasks: [token-classification], Source: Flair's Private Model Hub,
 Model Name: flair/multi-pos-fast, Tasks: [token-classification], Source: Flair's Private Model Hub,
 Model Name: flair/da-pos, Tasks: [token-classification], Source: Flair's Private Model Hub,
 Model Name: flair/de-pos, Tasks: [token-classification], Source: Flair's Private Model Hub,
 Model Name: flair/de-pos-tweets, Tasks: [token-classification], Source: Flair's Private Model Hub,
 Model Name: flair/ml-pos, Tasks: [token-classification], Source: Flair's Private Model Hub,
 Model Name: flair/ml-upos, Tasks: [token-classification], Source: Flair's Private Model Hub,
 Model Name: flair/pt-pos-clinical, Tasks: [token-classification], Source: Flair's Private Model Hub]

We'll use the pos-english-fast model

model = hub.search_model_by_name('pos-english-fast')[0]; model
Model Name: flair/pos-english-fast, Tasks: [token-classification], Source: HuggingFace Model Hub
sentences = tagger.tag_text(text = example_text, model_name_or_path = model)
2021-10-05 17:49:02,823 loading file /root/.flair/models/pos-english-fast/36f7923039eed4c66e4275927daaff6cd275997d61d238355fb1fe0338fe10a1.ff87e5b4e47fdb42a0c00237d9506c671db773e0a7932179ace82e584383a1b8

Then just as before, we get a JSON of our POS:

pprint(sentences[0]['entities'][:5])
[{'confidence': 0.998687207698822,
  'end_pos': 7,
  'labels': [NNP (0.9987)],
  'start_pos': 0,
  'text': 'Novetta',
  'value': 'NNP'},
 {'confidence': 0.8011120557785034,
  'end_pos': 17,
  'labels': [NNPS (0.8011)],
  'start_pos': 8,
  'text': 'Solutions',
  'value': 'NNPS'},
 {'confidence': 0.9999979734420776,
  'end_pos': 20,
  'labels': [VBZ (1.0)],
  'start_pos': 18,
  'text': 'is',
  'value': 'VBZ'},
 {'confidence': 0.9999998807907104,
  'end_pos': 24,
  'labels': [DT (1.0)],
  'start_pos': 21,
  'text': 'the',
  'value': 'DT'},
 {'confidence': 0.9101433157920837,
  'end_pos': 29,
  'labels': [JJS (0.9101)],
  'start_pos': 25,
  'text': 'best',
  'value': 'JJS'}]

Chunk

As with everything before, chunk tasks operate the same way. You can either pass chunk to get the default en-chunk model, or we can search the model hub:

models = hub.search_model_by_task('chunk'); models
[Model Name: flair/chunk-english-fast, Tasks: [token-classification], Source: HuggingFace Model Hub,
 Model Name: flair/chunk-english, Tasks: [token-classification], Source: HuggingFace Model Hub,
 Model Name: flair/chunk, Tasks: [token-classification], Source: Flair's Private Model Hub,
 Model Name: flair/chunk-fast, Tasks: [token-classification], Source: Flair's Private Model Hub]

We'll use the fast model again:

model = models[0]; model
Model Name: flair/chunk-english-fast, Tasks: [token-classification], Source: HuggingFace Model Hub
sentences = tagger.tag_text(text = example_text, model_name_or_path = model)
2021-10-05 17:49:05,772 loading file /root/.flair/models/chunk-english-fast/be3a207f4993dd6d174d5083341a717d371ec16f721358e7a4d72158ebab28a6.a7f897d05c83e618a8235bbb7ddfca5a79d2daefb8a97c776eb73f97dbaea508

Let's view our results.

pprint(sentences[0]['entities'][:5])
[{'confidence': 0.9879125952720642,
  'end_pos': 17,
  'labels': [NP (0.9879)],
  'start_pos': 0,
  'text': 'Novetta Solutions',
  'value': 'NP'},
 {'confidence': 0.9999805688858032,
  'end_pos': 20,
  'labels': [VP (1.0)],
  'start_pos': 18,
  'text': 'is',
  'value': 'VP'},
 {'confidence': 0.8664445877075195,
  'end_pos': 29,
  'labels': [NP (0.8664)],
  'start_pos': 21,
  'text': 'the best',
  'value': 'NP'},
 {'confidence': 0.9803058207035065,
  'end_pos': 46,
  'labels': [NP (0.9803)],
  'start_pos': 31,
  'text': 'Albert Einstein',
  'value': 'NP'},
 {'confidence': 0.931873619556427,
  'end_pos': 66,
  'labels': [VP (0.9319)],
  'start_pos': 47,
  'text': 'used to be employed',
  'value': 'VP'}]

Frame

We can either delegate the "frame" task and use the default en-frame-ontonotes model, or search the API for usable models:

models = hub.search_model_by_task("frame"); models
[Model Name: flair/frame-english-fast, Tasks: [token-classification], Source: HuggingFace Model Hub,
 Model Name: flair/frame-english, Tasks: [token-classification], Source: HuggingFace Model Hub,
 Model Name: flair/frame, Tasks: [token-classification], Source: Flair's Private Model Hub,
 Model Name: flair/frame-fast, Tasks: [token-classification], Source: Flair's Private Model Hub]

Again we will use the "fast" model:

model = models[0]; model
Model Name: flair/frame-english-fast, Tasks: [token-classification], Source: HuggingFace Model Hub
sentences = tagger.tag_text(text = example_text, model_name_or_path = model)
2021-10-05 17:49:14,076 loading file /root/.flair/models/frame-english-fast/b2f10f9bc52898d86d8e6f3bf20369d681cc1e9badcb71650aa274ac696433c7.643ca10453770684aca3f2e886a7243adb2979c67a68de6379e50ccf5dc248da
pprint(sentences[0]['entities'][:5])
[{'confidence': 0.9969749450683594,
  'end_pos': 20,
  'labels': [be.01 (0.997)],
  'start_pos': 18,
  'text': 'is',
  'value': 'be.01'},
 {'confidence': 0.8932156562805176,
  'end_pos': 51,
  'labels': [use.03 (0.8932)],
  'start_pos': 47,
  'text': 'used',
  'value': 'use.03'},
 {'confidence': 0.9950985312461853,
  'end_pos': 57,
  'labels': [be.03 (0.9951)],
  'start_pos': 55,
  'text': 'be',
  'value': 'be.03'},
 {'confidence': 0.6651257872581482,
  'end_pos': 66,
  'labels': [employ.01 (0.6651)],
  'start_pos': 58,
  'text': 'employed',
  'value': 'employ.01'},
 {'confidence': 0.7210038900375366,
  'end_pos': 114,
  'labels': [love.01 (0.721)],
  'start_pos': 109,
  'text': 'loved',
  'value': 'love.01'}]

Notice:Pay attention to the "fast" versus regular naming. "fast" models are designed to be extremely efficient on the CPU, and are worth checking out

Tag Tokens with All Loaded Models At Once

As different taggers are loaded into memory, we can tag with all of them at once, for example we'll make a new EasyTokenTagger and load in a ner and pos tagger:

tagger = EasyTokenTagger()
_ = tagger.tag_text(text=example_text, model_name_or_path="flair/ner-english-ontonotes")
_ = tagger.tag_text(text=example_text, model_name_or_path="pos")
2021-10-05 17:49:16,462 loading file /root/.flair/models/ner-english-ontonotes-fast/0d55dd3b912da9cf26e003035a0c269a0e9ab222f0be1e48a3bbba3a58c0fed0.c9907cd5fde3ce84b71a4172e7ca03841cd81ab71d13eb68aa08b259f57c00b6
2021-10-05 17:49:21,049 loading file /root/.flair/models/pos-english-fast/36f7923039eed4c66e4275927daaff6cd275997d61d238355fb1fe0338fe10a1.ff87e5b4e47fdb42a0c00237d9506c671db773e0a7932179ace82e584383a1b8

Before finally using both at once:

sentences = tagger.tag_all(text=example_text)

And now we can look at the entities tagged of each kind:

sentences[0][:5]
[{'text': 'Novetta Solutions',
  'start_pos': 0,
  'end_pos': 17,
  'labels': [ORG (0.7553)],
  'value': ['ORG'],
  'confidence': [0.7553082704544067]},
 {'text': 'Albert Einstein',
  'start_pos': 31,
  'end_pos': 46,
  'labels': [PERSON (0.9928)],
  'value': ['PERSON'],
  'confidence': [0.9927975535392761]},
 {'text': 'Novetta Solutions',
  'start_pos': 70,
  'end_pos': 87,
  'labels': [ORG (0.7496)],
  'value': ['ORG'],
  'confidence': [0.7496212422847748]},
 {'text': 'Wright',
  'start_pos': 93,
  'end_pos': 99,
  'labels': [PERSON (0.9998), NNP (0.996)],
  'value': ['PERSON', 'NNP'],
  'confidence': [0.9998451471328735, 0.9959734082221985]},
 {'text': 'JBF',
  'start_pos': 128,
  'end_pos': 131,
  'labels': [ORG (0.9671), NNP (1.0)],
  'value': ['ORG', 'NNP'],
  'confidence': [0.967128336429596, 0.9999892711639404]}]