Finding Available Models with Hubs
We can search for available models to utilize with Embeddings with the HFModelHub
and FlairModelHub
. We'll see an example below:
from adaptnlp import (
EasyWordEmbeddings,
EasyStackedEmbeddings,
EasyDocumentEmbeddings,
HFModelHub,
FlairModelHub,
DetailLevel
)
hub = HFModelHub()
models = hub.search_model_by_name('gpt2'); models
For this tutorial we'll use the gpt2
base model:
model = models[-1]; model
Producing Embeddings using EasyWordEmbeddings
First we'll use some basic example text:
example_text = "This is Albert. My last name is Einstein. I like physics and atoms."
And then instantiate our embeddings tagger:
embeddings = EasyWordEmbeddings()
Now let's run our gpt2
model we grabbed earlier to generate some EmbeddingResult
objects:
res = embeddings.embed_text(example_text, model_name_or_path=model)
The result of this is a variety of filtered results for your disposal. The default level of information (DetailLevel.Low
) will return an ordered dictionary with the keys of:
inputs
, an array of your original sentencesentence_embeddings
, any sentence_embeddings you may have (if applicable) as an ordered dictionary of (sentence, embeddings)token_embeddings
, a similarOrderedDict
to thesentence_embeddings
, where the key0
will be the embeddings of the first word,1
is the second, and so forth:
res['inputs']
To grab our sentence or token embeddings, simply look it up by its key:
StackedEmbeddings
will have sentence embeddingsres['token_embeddings'][0].shape
Using different models is extremely easy to do. Let's try using BERT embeddings with the bert-base-cased
model instead.
Rather than passing in a HFModelResult
or FlairModelResult
, we can also just pass in the raw string name of the model as well:
res = embeddings.embed_text(example_text, model_name_or_path='bert-base-cased')
Just like in the last example, we can look at the embeddings in the same way:
res['token_embeddings'][0].shape
We can also convert our output to an easy to use dictionary, which can have a bit more information. First let's not filter our results by passing in detail_level = None
:
res = embeddings.embed_text(example_text,
model_name_or_path='bert-base-cased',
detail_level=None)
res
We can see see that result is now an EmbeddingResult
, which has all the information we key'd with as available attributes:
res.inputs
If we want to filter the object ourselves and convert it to a dictionary, we can use the to_dict()
function:
o = res.to_dict()
print(o['inputs'], o['token_embeddings'][0].shape)
You can specify the level of detail wanted by passing in "low", "medium", or "high" to the to_dict
method, or use the convience DetailLevel
class:
res_dict = res.to_dict(DetailLevel.Medium)
print(o['inputs'], o['token_embeddings'][0].shape)
Each level returns more data from the outputs:
- Available at all levels:
original_sentence
: The original sentencetokenized_sentence
: The tokenized sentencesentence_embeddings
: Embeddings from the actual sentence (if available)token_embeddings
: Concatenated embeddings from all the tokens passed
DetailLevel.Low
(or 'low'):- Returns information available at all levels
DetailLevel.Medium
(or 'medium'):- Everything from
DetailLevel.Low
- For each token a dictionary of the embeddings and word index is added
- Everything from
DetailLevel.High
(or 'high'):- Everything from
DetailLevel.Medium
- This will also include the original Flair
Sentence
result from the model
- Everything from
Let's look at a final example with roBERTa embeddings:
res = embeddings.embed_text(example_text, model_name_or_path="roberta-base")
And our generated embeddings:
Producing Stacked Embeddings with EasyStackedEmbeddings
EasyStackedEmbeddings
allows you to use a variable number of language models to produce our embeddings shown above. For our example we'll combine the bert-base-cased
and distilbert-base-cased
models.
First we'll instantiate our EasyStackedEmbeddings
:
embeddings = EasyStackedEmbeddings("bert-base-cased", "distilbert-base-cased")
And then generate our stacked word embeddings through our embed_text
function:
res = embeddings.embed_text(example_text)
We can see our results below:
Document Embeddings with EasyDocumentEmbeddings
Similar to the EasyStackedEmbeddings
, EasyDocumentEmbeddings
allows you to pool the embeddings from multiple models together with embed_pool
and embed_rnn
.
We'll use our bert-base-cased
and distilbert-base-cased
models again:
embeddings = EasyDocumentEmbeddings("bert-base-cased", "distilbert-base-cased")
This time we will use the embed_pool
method to generate DocumentPoolEmbeddings
. These do an average over all the word embeddings in a sentence:
res = embeddings.embed_pool(example_text)
As a result rather than having embeddings by token, we have embeddings by document
res['inputs']
res['token_embeddings'][0]
We can also generate DocumentRNNEmbeddings
as well. Document RNN Embeddings run an RNN over all the words in the sentence and use the final state of the RNN as the embedding.
First we'll call embed_rnn
:
sentences = embeddings.embed_rnn(example_text)
And then look at our generated embeddings: