Introduction
In this tutorial we will be showing an end-to-end example of fine-tuning a Transformer for token classification on a custom dataset in CSV file format.
By the end of this you should be able to:
- Build a dataset with the
TokenClassificationDatasets
class, and its DataLoaders - Build a
TokenClassificationTuner
quickly, find a good learning rate, and train with the One-Cycle Policy - Save that model away, to be used with deployment or other HuggingFace libraries
- Apply inference using both the
Tuner
's available function as well as with theEasyTokenTagger
class within AdaptNLP
This tutorial utilizies the latest AdaptNLP version, as well as parts of the fastai
library. Please run the below code to install them:
!pip install adaptnlp -U
(or pip3
)
First we need a dataset. We will use the HuggingFace
library to download the conll2003
dataset and convert it to a CSV file. This may seem counterintuitive, but it works for demonstrational purposes. In practice you would use a custom CSV file.
CoNLL 2003
is a named entity recognition (NER) dataset which contains the following named entities: persons, locations, organizations, and names of miscellaneous entities that do not belong to the previous three groups. It follows the IOB2 tagging scheme.
from datasets import load_dataset
dsets = load_dataset('conll2003')
For the purpose of example, we'll take the train subset and convert it to CSV file. The tokens
and ner_tags
of the csv file will serve as the tokens and labels for our dataset.
dset = dsets['train']
# going through pandas to ensure data types are correct for the csv
dset.set_format(type='pandas')
df = dset[:]
df = df[['tokens', 'ner_tags']]
df['ner_tags'] = df['ner_tags'].apply(lambda x : list(x))
df['tokens'] = df['tokens'].apply(lambda x : list(x))
df.to_csv('/tmp/conll2003.csv', index=False)
Now that we have the dataset, and we know the format it is in, let's pick a viable model to train with.
AdaptNLP has a HFModelHub
class that allows you to communicate with the HuggingFace Hub and pick a model from it, as well as a namespace HF_TASKS
class with a list of valid tasks we can search by.
Let's try and find one suitable for token classification.
First we need to import the class and generate an instance of it:
from adaptnlp import HFModelHub, HF_TASKS
hub = HFModelHub()
Next we can search for a model:
models = hub.search_model_by_task(HF_TASKS.TOKEN_CLASSIFICATION)
Let's look at a few:
models[:10]
These are models specifically tagged with the token-classification
tag, so you may not see a few models you would expect such as bert_base_cased
.
Since both of these models have already been fine-tuned on the CoNLL 2003
dataset, let's choose a basic pre-trained model distilbert-base-uncased
:
model_name = 'distilbert-base-uncased'
In general, if you don't need to go through the HFModelHub
if you know which model you'd like to use already. You can always just pass in the string name of a model such as "bert-base-cased"
Building TaskDatasets
with TokenClassificationDatasets
Each task has a high-level data wrapper around the TaskDatasets
class. In our case this is the TokenClassificationDatasets
class:
from adaptnlp import TokenClassificationDatasets
There are multiple different constructors for the TokenClassificationDatasets
class, and you should never call the main constructor directly.
We will be using from_csvs
:
Anything you would normally pass to the tokenizer call (such as max_length
, padding
) should go in tokenize_kwargs
, and anything going to the AutoTokenizer.from_pretrained
constructor should be passed to the auto_kwargs
.
Important: Because our dataset is already tokenized, when we try to encode the tokens, we may end up with sub-tokens. This will cause our labels to no longer align with the number of tokens. In order to take this into account, the following arguments should be passed to the tokenizer:
tokenize_kwargs = {
'truncation':True,
'is_split_into_words':True,
'padding':'max_length',
'return_offsets_mapping':True
}
We will also need to provide a mapping between the labels and the entities:
entity_mapping = {
0: 'O',
1: 'B-PER',
2: 'I-PER',
3: 'B-ORG',
4: 'I-ORG',
5: 'B-LOC',
6: 'I-LOC',
7: 'B-MISC',
8: 'I-MISC'
}
In our case we only have a train_csv
, so we should specify what percent to split into train and validation sets.
from ast import literal_eval
dsets = TokenClassificationDatasets.from_csvs(
'/tmp/conll2003.csv',
'tokens',
'ner_tags',
entity_mapping,
tokenizer_name = model_name,
tokenize=True,
tokenize_kwargs = tokenize_kwargs,
split_pct=.2,
converters={'tokens': literal_eval, 'ner_tags': literal_eval} # kwarg to pd.read_csv
)
CSVs
, simply pass in the validation CSV
as valid_csv=validation_csv
and do not pass in any split_func
or split_pct
. Everything else is the exact sameAnd finally turn it into some AdaptiveDataLoaders
.
These are just fastai's DataLoaders
class, but it overrides a few functions to have it work nicely with HuggingFace's Dataset
class
dls = dsets.dataloaders(batch_size=8)
Finally, let's view a batch of data with the show_batch
function:
dls.show_batch()
Next we need to build a compatible Tuner
for our problem. These tuners contain good defaults for our problem space, including loss functions and metrics.
First let's import the TokenClassificationTuner
and view it's documentation
Next we'll pass in our DataLoaders
and the name of our model:
TaskDatasets
, SequenceClassificationDatasets
, etc), you need to pass in the tokenizer to the constructor as well with tokenizer=tokenizer
tuner = TokenClassificationTuner(dls, model_name)
By default we can see that it used CrossEntropyLoss
as our loss function, and both accuracy
and F1
as our metrics:
tuner.loss_func
_ = [print(m.name) for m in tuner.metrics]
Important: By default, the TokenClassificationTuner
class does not use fastai
metrics (unlike the other Tuner
classes). Instead it uses HuggingFace
's seqeval
metric to compute accuracy, precision, recall, and/or F1 scores based on the requirements of multi-label classification. As a result, you will need to have seqeval installed in order to use the TokenClassificationTuner
.
In this tutorial, we will show how to use the metrics built into TokenClassificationTuner
. Valid metrics can be found in the NERMetric
namespace.
While Accuracy
and F1
are already defaults, we will specify all the available built-in metrics for clarity.
from adaptnlp import NERMetric
tuner = TokenClassificationTuner(dls, model_name, metrics=[NERMetric.Accuracy,
NERMetric.Precision,
NERMetric.Recall,
NERMetric.F1])
Finally we just need to train our model!
And all that's left is to tune
. There are only 4 or 5 functions you can call on our tuner
currently, and this is by design to make it simplistic. In case you don't want to be boxed in however, if you pass in expose_fastai_api=True
to our earlier call, it will expose the entirety of Learner
to you, so you can call fit_one_cycle
, lr_find
, and everything else as Tuner
uses fastai
under the hood.
First, let's call lr_find
, which uses fastai's Learning Rate Finder to help us pick a learning rate.
lr = tuner.lr_find()
It recommends a learning rate of around 1e-4, so we will use that.
Let's look at the documentation for tune
:
We can pass in a number of epochs, a learning rate, a strategy, and additional fastai callbacks to call.
Valid strategies live in the Strategy
namespace class, and consist of:
- OneCycle (Also called the One-Cycle Policy)
- CosineAnnealing
- SGDR
from adaptnlp import Strategy
In this tutorial we will train with the One-Cycle policy, as currently it is one of the best schedulers to use.
tuner.tune(3, lr, strategy=Strategy.OneCycle)
Now that we have a trained model, let's save those weights away.
Calling tuner.save
will save both the model and the tokenizer in the same format as how HuggingFace does:
tuner.save('good_model')
There are two ways to get predictions, the first is with the .predict
method in our tuner
. This is great for if you just finished training and want to see how your model performs on some new data!
The other method is with AdaptNLP's inference API, which we will show afterwards
sentence = "The company Novetta is based in McLean, Virgina."
And then predict with it:
tuner.predict(sentence)
With the Inference API
Next we will use the EasyTokenTagger
class, which AdaptNLP offers:
from adaptnlp import EasyTokenTagger
We simply construct the class:
classifier = EasyTokenTagger()
And call the tag_text
method, passing in the sentence, the location of our saved model, and some names for our classes:
classifier.tag_text(
sentence,
model_name_or_path='good_model',
)
And we got the exact same output and probabilities!
There are also different levels of predictions we can return (which is also the same with our earlier predict
call).
These live in a namespace DetailLevel
class, with a few examples below:
from adaptnlp import DetailLevel
DetailLevel.Low
While some Easy modules will not return different items at each level, most will return only a few specific outputs at the Low level, and everything possible at the High level:
classifier.tag_text(
sentence,
model_name_or_path = 'good_model',
detail_level=DetailLevel.Low
)
classifier.tag_text(
sentence,
model_name_or_path = 'good_model',
detail_level=DetailLevel.Medium
)
classifier.tag_text(
sentence,
model_name_or_path = 'good_model',
detail_level=DetailLevel.High
)
A quick one-cell code chunk with all the code used in this notebook, so the reader can quickly copy/paste this
from datasets import load_dataset
from adaptnlp import TokenClassificationDatasets
from adaptnlp import TokenClassificationTuner
from adaptnlp import Strategy
from adaptnlp import NERMetric
from ast import literal_eval
dsets = load_dataset('conll2003')
dset = dsets['train']
# going through pandas to ensure data types are correct for the csv
dset.set_format(type='pandas')
df = dset[:]
df = df[['tokens', 'ner_tags']]
df['ner_tags'] = df['ner_tags'].apply(lambda x : list(x))
df['tokens'] = df['tokens'].apply(lambda x : list(x))
df.to_csv('/tmp/conll2003.csv', index=False)
model_name = 'distilbert-base-uncased'
tokenize_kwargs = {
'truncation':True,
'is_split_into_words':True,
'padding':'max_length',
'return_offsets_mapping':True
}
entity_mapping = {
0: 'O',
1: 'B-PER',
2: 'I-PER',
3: 'B-ORG',
4: 'I-ORG',
5: 'B-LOC',
6: 'I-LOC',
7: 'B-MISC',
8: 'I-MISC'
}
dsets = TokenClassificationDatasets.from_csvs(
'/tmp/conll2003.csv',
'tokens',
'ner_tags',
entity_mapping,
tokenizer_name = model_name,
tokenize=True,
tokenize_kwargs = tokenize_kwargs,
split_pct=.2,
converters={'tokens': literal_eval, 'ner_tags': literal_eval} # kwarg to pd.read_csv
)
dls = dsets.dataloaders(batch_size=8)
tuner = TokenClassificationTuner(dls, model_name)
tuner = TokenClassificationTuner(dls, model_name, metrics=[NERMetric.Accuracy,
NERMetric.Precision,
NERMetric.Recall,
NERMetric.F1])
lr = tuner.lr_find()
tuner.tune(3, lr, strategy=Strategy.OneCycle)
tuner.save('good_model')