Using EasyWord, Stacked, and Document Embeddings in the AdaptNLP framework
 

Finding Available Models with Hubs

We can search for available models to utilize with Embeddings with the HFModelHub and FlairModelHub. We'll see an example below:

from adaptnlp import EasyWordEmbeddings, EasyStackedEmbeddings, EasyDocumentEmbeddings
from adaptnlp.model_hub import HFModelHub, FlairModelHub
hub = HFModelHub()
models = hub.search_model_by_name('gpt2'); models
[Model Name: distilgpt2, Tasks: [text-generation],
 Model Name: gpt2-large, Tasks: [text-generation],
 Model Name: gpt2-medium, Tasks: [text-generation],
 Model Name: gpt2-xl, Tasks: [text-generation],
 Model Name: gpt2, Tasks: [text-generation]]

For this tutorial we'll use the gpt2 base model:

model = models[-1]; model
Model Name: gpt2, Tasks: [text-generation]

Producing Embeddings using EasyWordEmbeddings

First we'll use some basic example text:

example_text = "This is Albert.  My last name is Einstein.  I like physics and atoms."

And then instantiate our embeddings tagger:

embeddings = EasyWordEmbeddings()

Now let's run our gpt2 model we grabbed earlier to generate some flair Sentences:

sentences = embeddings.embed_text(example_text, model_name_or_path=model)
Some weights of GPT2Model were not initialized from the model checkpoint at gpt2 and are newly initialized: ['h.0.attn.masked_bias', 'h.1.attn.masked_bias', 'h.2.attn.masked_bias', 'h.3.attn.masked_bias', 'h.4.attn.masked_bias', 'h.5.attn.masked_bias', 'h.6.attn.masked_bias', 'h.7.attn.masked_bias', 'h.8.attn.masked_bias', 'h.9.attn.masked_bias', 'h.10.attn.masked_bias', 'h.11.attn.masked_bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

These flair Sentences hold the embeddings inside of each token. So to get access to them we need to look at a specific sentence, its specific token, and call .get_embedding(). For instance below is the embedding representation of "Albert":

token = sentences[0][2]
print(f'Original text: {token.text}')
print('Model: gpt2')
print(f'Embedding: {token.get_embedding()[:10]}')
Original text: Albert
Model: gpt2
Embedding: tensor([-3.9810, -0.5063, -2.2954, -1.3400,  0.1948, -0.7453,  1.4224,  0.2852,
         0.5815,  0.7180], device='cuda:0')

Using different models is extremely easy to do. Let's try using BERT embeddings with the bert-base-cased model instead.

Rather than passing in a HFModelResult or FlairModelResult, we can also just pass in the raw string name of the model as well:

sentences = embeddings.embed_text(example_text, model_name_or_path='bert-base-cased')

Just like in the last example, we can look at the embeddings in the same way:

token = sentences[0][2]
print(f'Original text: {token.text}')
print('Model: bert-base-cased')
print(f'Embedding: {token.get_embedding()[:10]}')
Original text: Albert
Model: bert-base-cased
Embedding: tensor([-0.0846, -0.2399,  0.2524, -0.4409, -0.2508, -0.6320, -0.1890,  0.2085,
        -0.8265, -0.7632], device='cuda:0')

Let's look at a final example with roBERTa embeddings:

sentences = embeddings.embed_text(example_text, model_name_or_path="roberta-base")

And our generated embeddings:

token = sentences[0][2]
print(f'Original text: {token.text}')
print(f'Model: roberta-base')
print(f'Embedding: {token.get_embedding()[:10]}')
Original text: Albert
Model: roberta-base
Embedding: tensor([ 0.1772,  0.0369, -0.0483,  0.2290, -0.4860,  0.3483,  0.2176, -0.0787,
        -0.2275, -0.4035], device='cuda:0')

Producing Stacked Embeddings with EasyStackedEmbeddings

EasyStackedEmbeddings allows you to use a variable number of language models to produce our embeddings shown above. For our example we'll combine the bert-base-cased and distilbert-base-cased models.

First we'll instantiate our EasyStackedEmbeddings:

embeddings = EasyStackedEmbeddings("bert-base-cased", "distilbert-base-cased")
May need a couple moments to instantiate...

And then generate our stacked word embeddings through our embed_text function:

sentences = embeddings.embed_text(example_text)

We can see our results below:

token = sentences[0][2]
print(f'Original text: {token.text}')
print(f'Models: bert-base-cased, distilbert-base-cased')
print(f'Embedding: {token.get_embedding()[:10]}')
Original text: Albert
Models: bert-base-cased, distilbert-base-cased
Embedding: tensor([-0.0846, -0.2399,  0.2524, -0.4409, -0.2508, -0.6320, -0.1890,  0.2085,
        -0.8265, -0.7632], device='cuda:0')

Document Embeddings with EasyDocumentEmbeddings

Similar to the EasyStackedEmbeddings, EasyDocumentEmbeddings allows you to pool the embeddings from multiple models together with embed_pool and embed_rnn.

We'll use our bert-base-cased and distilbert-base-cased models again:

embeddings = EasyDocumentEmbeddings("bert-base-cased", "distilbert-base-cased")
May need a couple moments to instantiate...
Pooled embedding loaded
RNN embeddings loaded

This time we will use the embed_pool method to generate DocumentPoolEmbeddings. These do an average over all the word embeddings in a sentence:

sentences = embeddings.embed_pool(example_text)

As a result rather than having embeddings by token, we have embeddings by document

sentence = sentences[0]
print(f'Original text: {sentence.to_tokenized_string()}')
print(f'Models: bert-base-cased, distilbert-base-cased')
print(f'Embedding: {sentence.get_embedding()[:10]}')
Original text: This is Albert . My last name is Einstein . I like physics and atoms .
Models: bert-base-cased, distilbert-base-cased
Embedding: tensor([-0.2397,  0.2154,  0.1053,  0.3809, -0.2323,  0.2913, -0.1869,  0.0963,
        -0.0407, -0.2648], device='cuda:0', grad_fn=<SliceBackward>)

We can also generate DocumentRNNEmbeddings as well. Document RNN Embeddings run an RNN over all the words in the sentence and use the final state of the RNN as the embedding.

First we'll call embed_rnn:

sentences = embeddings.embed_rnn(example_text)

And then look at our generated embeddings:

sentence = sentences[0]
print(f'Original text: {sentence.to_tokenized_string()}')
print(f'Models: bert-base-cased, distilbert-base-cased')
print(f'Embedding: {sentence.get_embedding()[:10]}')
Original text: This is Albert . My last name is Einstein . I like physics and atoms .
Models: bert-base-cased, distilbert-base-cased
Embedding: tensor([ 0.5235, -0.2955, -0.3608,  0.4746, -0.0441, -0.2596,  0.5656,  0.0506,
        -0.2100,  0.0992], device='cuda:0', grad_fn=<SliceBackward>)