Using EasyWord, Stacked, and Document Embeddings in the AdaptNLP framework

Finding Available Models with Hubs

We can search for available models to utilize with Embeddings with the HFModelHub and FlairModelHub. We'll see an example below:

from adaptnlp import (
    EasyWordEmbeddings, 
    EasyStackedEmbeddings, 
    EasyDocumentEmbeddings, 
    HFModelHub, 
    FlairModelHub, 
    DetailLevel
)
hub = HFModelHub()
models = hub.search_model_by_name('gpt2'); models
[Model Name: distilgpt2, Tasks: [text-generation],
 Model Name: gpt2-large, Tasks: [text-generation],
 Model Name: gpt2-medium, Tasks: [text-generation],
 Model Name: gpt2-xl, Tasks: [text-generation],
 Model Name: gpt2, Tasks: [text-generation]]

For this tutorial we'll use the gpt2 base model:

model = models[-1]; model
Model Name: gpt2, Tasks: [text-generation]

Producing Embeddings using EasyWordEmbeddings

First we'll use some basic example text:

example_text = "This is Albert.  My last name is Einstein.  I like physics and atoms."

And then instantiate our embeddings tagger:

embeddings = EasyWordEmbeddings()

Now let's run our gpt2 model we grabbed earlier to generate some EmbeddingResult objects:

res = embeddings.embed_text(example_text, model_name_or_path=model)




The result of this is a variety of filtered results for your disposal. The default level of information (DetailLevel.Low) will return an ordered dictionary with the keys of:

  • inputs, an array of your original sentence
  • sentence_embeddings, any sentence_embeddings you may have (if applicable) as an ordered dictionary of (sentence, embeddings)
  • token_embeddings, a similar OrderedDict to the sentence_embeddings, where the key 0 will be the embeddings of the first word, 1 is the second, and so forth:
res['inputs']
['This is Albert.  My last name is Einstein.  I like physics and atoms.']

To grab our sentence or token embeddings, simply look it up by its key:

res['token_embeddings'][0].shape
torch.Size([768])

Using different models is extremely easy to do. Let's try using BERT embeddings with the bert-base-cased model instead.

Rather than passing in a HFModelResult or FlairModelResult, we can also just pass in the raw string name of the model as well:

res = embeddings.embed_text(example_text, model_name_or_path='bert-base-cased')




Some weights of the model checkpoint at bert-base-cased-finetuned-mrpc were not used when initializing BertModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

Just like in the last example, we can look at the embeddings in the same way:

res['token_embeddings'][0].shape
torch.Size([768])

We can also convert our output to an easy to use dictionary, which can have a bit more information. First let's not filter our results by passing in detail_level = None:

res = embeddings.embed_text(example_text, 
                            model_name_or_path='bert-base-cased',
                           detail_level=None)
res
EmbeddingResult: {
	Inputs: ['This is Albert.  My last name is Einstein.  I like physics and atoms.']
	Token Embeddings Shapes: [torch.Size([768]), torch.Size([768]), torch.Size([768]), torch.Size([768]), torch.Size([768]), torch.Size([768]), torch.Size([768]), torch.Size([768]), torch.Size([768]), torch.Size([768]), torch.Size([768]), torch.Size([768]), torch.Size([768]), torch.Size([768]), torch.Size([768]), torch.Size([768])]
	Sentence Embeddings Shapes: [torch.Size([0])]
}

We can see see that result is now an EmbeddingResult, which has all the information we key'd with as available attributes:

res.inputs
['This is Albert.  My last name is Einstein.  I like physics and atoms.']

If we want to filter the object ourselves and convert it to a dictionary, we can use the to_dict() function:

o = res.to_dict()
print(o['inputs'], o['token_embeddings'][0].shape)
['This is Albert.  My last name is Einstein.  I like physics and atoms.'] torch.Size([768])

You can specify the level of detail wanted by passing in "low", "medium", or "high" to the to_dict method, or use the convience DetailLevel class:

res_dict = res.to_dict(DetailLevel.Medium)
print(o['inputs'], o['token_embeddings'][0].shape)
['This is Albert.  My last name is Einstein.  I like physics and atoms.'] torch.Size([768])

Each level returns more data from the outputs:

  • Available at all levels:
    • original_sentence: The original sentence
    • tokenized_sentence: The tokenized sentence
    • sentence_embeddings: Embeddings from the actual sentence (if available)
    • token_embeddings: Concatenated embeddings from all the tokens passed
  • DetailLevel.Low (or 'low'):
    • Returns information available at all levels
  • DetailLevel.Medium (or 'medium'):
    • Everything from DetailLevel.Low
    • For each token a dictionary of the embeddings and word index is added
  • DetailLevel.High (or 'high'):
    • Everything from DetailLevel.Medium
    • This will also include the original Flair Sentence result from the model

Let's look at a final example with roBERTa embeddings:

res = embeddings.embed_text(example_text, model_name_or_path="roberta-base")




Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaModel: ['lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

And our generated embeddings:

Original text: ['This is Albert.  My last name is Einstein.  I like physics and atoms.']
Model: roberta-base
Embedding: torch.Size([768])

Producing Stacked Embeddings with EasyStackedEmbeddings

EasyStackedEmbeddings allows you to use a variable number of language models to produce our embeddings shown above. For our example we'll combine the bert-base-cased and distilbert-base-cased models.

First we'll instantiate our EasyStackedEmbeddings:

embeddings = EasyStackedEmbeddings("bert-base-cased", "distilbert-base-cased")
May need a couple moments to instantiate...
Some weights of the model checkpoint at bert-base-cased-finetuned-mrpc were not used when initializing BertModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).




Some weights of the model checkpoint at distilbert-base-cased-distilled-squad were not used when initializing DistilBertModel: ['qa_outputs.bias', 'qa_outputs.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

And then generate our stacked word embeddings through our embed_text function:

res = embeddings.embed_text(example_text)

We can see our results below:

Original text: ['This is Albert.  My last name is Einstein.  I like physics and atoms.']
Model: roberta-base
Embedding: tensor([-0.6795, -0.2041,  1.0153,  ...,  0.2426, -0.2324,  0.3107])

Document Embeddings with EasyDocumentEmbeddings

Similar to the EasyStackedEmbeddings, EasyDocumentEmbeddings allows you to pool the embeddings from multiple models together with embed_pool and embed_rnn.

We'll use our bert-base-cased and distilbert-base-cased models again:

embeddings = EasyDocumentEmbeddings("bert-base-cased", "distilbert-base-cased")
May need a couple moments to instantiate...
Some weights of the model checkpoint at bert-base-cased-finetuned-mrpc were not used when initializing BertModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at distilbert-base-cased-distilled-squad were not used when initializing DistilBertModel: ['qa_outputs.bias', 'qa_outputs.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Pooled embedding loaded
RNN embeddings loaded

This time we will use the embed_pool method to generate DocumentPoolEmbeddings. These do an average over all the word embeddings in a sentence:

res = embeddings.embed_pool(example_text)

As a result rather than having embeddings by token, we have embeddings by document

res['inputs']
['This is Albert.  My last name is Einstein.  I like physics and atoms.']
res['token_embeddings'][0]
tensor([-0.6795, -0.2041,  1.0153,  ...,  0.2426, -0.2324,  0.3107])
Original text: ['This is Albert.  My last name is Einstein.  I like physics and atoms.']
Model: roberta-base
Embedding: torch.Size([1536])

We can also generate DocumentRNNEmbeddings as well. Document RNN Embeddings run an RNN over all the words in the sentence and use the final state of the RNN as the embedding.

First we'll call embed_rnn:

sentences = embeddings.embed_rnn(example_text)

And then look at our generated embeddings:

Original text: ['This is Albert.  My last name is Einstein.  I like physics and atoms.']
Model: roberta-base
Embedding: torch.Size([1536])