Introduction
In this tutorial we will be showing an end-to-end example of fine-tuning a Transformer for sequence classification on a custom dataset in HuggingFace Dataset
format.
By the end of this you should be able to:
- Build a dataset with the
TaskDatasets
class, and their DataLoaders - Build a
SequenceClassificationTuner
quickly, find a good learning rate, and train with the One-Cycle Policy - Save that model away, to be used with deployment or other HuggingFace libraries
- Apply inference using both the
Tuner
available function as well as with theEasySequenceClassifier
class within AdaptNLP
This tutorial utilizies the latest AdaptNLP version, as well as parts of the fastai
library. Please run the below code to install them:
!pip install adaptnlp -U
(or pip3
)
First we need a dataset. We will use dataset
's load_dataset
function to quickly generate a raw dataset straight from HuggingFace:
from datasets import load_dataset
raw_datasets = load_dataset("glue", "mrpc")
We now have a raw datasets
dataset, which we can index into:
raw_datasets['train'][0]
Now that we have the data downloaded, let's decide on a model to use.
AdaptNLP has a HFModelHub
class that allows you to communicate with the HuggingFace Hub and pick a model from it, as well as a namespace HF_TASKS
class with a list of valid tasks we can search by.
Let's try and find one suitable for sequence classification.
First we need to import the class and generate an instance of it:
from adaptnlp import HFModelHub, HF_TASKS
hub = HFModelHub()
Next we can search for a model:
models = hub.search_model_by_task(HF_TASKS.TEXT_CLASSIFICATION)
Let's look at a few:
models[:10]
These are models specifically tagged with the text-classification
tag, so you may not see a few models you would expect such as bert_base_cased
.
Let's search for that one for this problem:
models = hub.search_model_by_name('bert-base-uncased', user_uploaded=True)
models[:5]
We want the first one.
model = models[0]
Now that we have picked a model, let's use the data API to prepare our data
Building TaskDatasets
All of the task-specific high-level data API's (such as SequenceClassificationDatasets
) all wrap around the TaskDatasets
class, which is a small wrapper around datasets
highly efficient Dataset
class.
This integration was valuable because it provides a fast and memory-efficient way to use large datasets with minimal effort.
First let's import the class:
from adaptnlp import TaskDatasets
The TaskDatasets
class has no class constructors outside the normal one. The reason for this is it takes in raw Datasets
and other tokenizer arguments to build from:
Anything you would normally pass to the tokenizer call (such as max_length
, padding
) should go in tokenize_kwargs
, and anything going to the AutoTokenizer.from_pretrained
constructor should be passed to the auto_kwargs
.
Custom Tokenization Function and Finishing our TaskDatasets
You may notice there is an extra step here: We need to pass in a tokenize_func
. In the other tutorials we used a very basic tokenizing function, and this has a default for that as well.
However given our dataset, we need to implement our own tokenization function.
To do so, your function must take in an item
, a tokenizer
, and tokenize_kwargs
. It should be noted that you do not have to declare any of these. All of them are attributes that the TaskDatasets
has access to, and will be passed to this function implicitly.
What you need to declare is how you want the tokenizer applied.
In our case we have two separate sentences that need to be tokenized at once. These texts live in that dictionary we saw earlier at the keys sentence1
and sentence2
.
Let's write that function:
def tok_func(
item, # A single item in the dataset
tokenizer, # The implicit tokenizer that `TaskDatasets` has access to
tokenize_kwargs, # Key word arguments passed into the constructor of `TaskDatasets`
):
"A basic tokenization function for two items"
return tokenizer(
item['sentence1'],
item['sentence2'],
**tokenize_kwargs
)
Along with building our own tokenize function, we need to tell Datasets
what columns to drop when we pull an item from our dataset.
These are synonymous with Datasets
remove_cols
.
In our problem this includes the sentence1
, sentence2
, and idx
keys, as our tokenized input gets put into a text
key:
remove_cols = ['sentence1', 'sentence2', 'idx']
Finally we'll declare some arguments for our tokenize function, specifically ensuring our max length is reasonable and that we should pad our samples to that length:
tokenize_kwargs = {'max_length':64, 'padding':True}
Let's build our TaskDatasets
now, passing in everything we built:
dsets = TaskDatasets(
train_dset = raw_datasets['train'], # Our training `Dataset`
valid_dset = raw_datasets['validation'], # Our validation `Dataset`
tokenizer_name = model.name, # The name of our model
tokenize_kwargs = tokenize_kwargs, # The tokenizer kwargs
tokenize_func = tok_func, # The tokenization function
remove_cols = remove_cols # The columns to remove after tokenizing our input
)
You may be wondering why we use the TaskDatasets
class, this is a convience wrapper around much of the functions and tasks you need to call when using datasets
's Dataset
class, and there are a few special behaviors to quickly build working AdaptiveDataLoaders
as well.
Let's build these AdaptiveDataLoaders
, which are just fastai's DataLoaders
class, but it overrides a few functions to have it work nicely with HuggingFace's Dataset
class
To build our DataLoaders
, can call .dataloaders
, specifying our batch size and a collate function to use. In our case we will collate with the DataCollatorWithPadding
class out of transformers
:
from transformers import DataCollatorWithPadding
dls = dsets.dataloaders(
batch_size=8,
collate_fn=DataCollatorWithPadding(tokenizer=dsets.tokenizer)
)
Finally, let's view a batch of data with the show_batch
function:
dls.show_batch(n=4)
Since this isn't a pre-built *TaskDatasets
object, the show_batch
looks a little plain, but it gets across exactly what you would need to see.
Next let's build a Tuner
and train our model
Next we need to build a compatible Tuner
for our problem. These tuners contain good defaults for our problem space, including loss functions and metrics.
First let's import the SequenceClassificationTuner
and view it's documentation
from adaptnlp import SequenceClassificationTuner
Next we'll pass in our DataLoaders
, the name of our model, and since we are using raw Datasets
, the number of classes we have. In our case this is two.
TaskDatasets
, SequenceClassificationDatasets
, etc), you need to pass in the tokenizer to the constructor as well with tokenizer=tokenizer
tuner = SequenceClassificationTuner(dls, model.name, num_classes=2)
By default we can see that it used CrossEntropyLoss
as our loss function, and both accuracy
and F1Score
as our metrics:
tuner.loss_func
_ = [print(m.name) for m in tuner.metrics]
It is also possible to define your own metrics, these stem from fastai.
To do so, write a function that takes an input and an output, and performs an operation. For example, we will write our own accuracy
metric:
def ourAccuracy(inp, out):
"A simplified accuracy metric that doesn't flatten"
return (inp == targ).float().mean()
And then we pass it into the constructor:
tuner = SequenceClassificationTuner(dls, model.name, num_classes=2, metrics=[ourAccuracy])
If we look at the metrics, you can see that now it is just ourAccuracy
:
tuner.metrics[0].name
For this tutorial, we will revert it back to the defaults:
tuner = SequenceClassificationTuner(dls, model.name, num_classes=2)
Finally we just need to train our model!
To fine-tune, AdaptNLP's tuner class provides only a few functions to work with. The important ones are the tune
and lr_find
class.
As the Tuner
uses fastai
under the hood, lr_find
calls fastai's Learning Rate Finder to help us pick a learning rate. Let's do that now:
tuner.lr_find()
It recommends a learning rate of around 2e-4, however a steeper slope can be found around 5e-5 so we will use that.
valley
method is one of the most reliable ones, but also try and figure out an intuition towards finding a learning rate as you go as well.lr = 5e-5
Let's look at the documentation for tune
:
We can pass in a number of epochs, a learning rate, a strategy, and additional fastai callbacks to call.
Valid strategies live in the Strategy
namespace class, and consist of:
- OneCycle (Also called the One-Cycle Policy)
- CosineAnnealing
- SGDR
from adaptnlp import Strategy
In this tutorial we will train with the One-Cycle policy, as currently it is one of the best schedulers to use.
tuner.tune(3, lr, strategy=Strategy.OneCycle)
Now that we have a trained model, let's save those weights away.
Calling tuner.save
will save both the model and the tokenizer in the same format as how HuggingFace does:
tuner.save('good_model')
There are two ways to get predictions, the first is with the .predict
method in our tuner
. This is great for if you just finished training and want to see how your model performs on some new data!
The other method is with AdaptNLP's inference API, which we will show afterwards
sentence = 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence . Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .'
And then predict with it:
tuner.predict(sentence)
You'll notice it says LABEL_1
. We did not build with the Datasets
wrapper API's, so currently they do not have a vocabulary to work off of.
Let's pass in a vocabulary of not_equivalent
and equivalent
to work with:
names = ['not_equivalent', 'equivalent']
tuner.predict(sentence, class_names=names)
You can see it gave us much more readable results!
With the Inference API
Next we will use the EasySequenceClassifier
class, which AdaptNLP offers:
from adaptnlp import EasySequenceClassifier
We simply construct the class:
classifier = EasySequenceClassifier()
And call the tag_text
method, passing in the sentence, the location of our saved model, and some names for our classes.
Similarly here, we can pass in our own vocabulary to use. Let's do that:
classifier.tag_text(
sentence,
model_name_or_path='good_model',
class_names=names
)
And we got the exact same output and probabilities!
There are also different levels of predictions we can return (which is also the same with our earlier predict
call).
These live in a namespace DetailLevel
class, with a few examples below:
from adaptnlp import DetailLevel
DetailLevel.Low
While some Easy modules will not return different items at each level, most will return only a few specific outputs at the Low level, and everything possible at the High level:
classifier.tag_text(
sentence,
model_name_or_path = 'good_model',
detail_level=DetailLevel.Low,
class_names=names
)
classifier.tag_text(
sentence,
model_name_or_path = 'good_model',
detail_level=DetailLevel.Medium,
class_names=names
)
classifier.tag_text(
sentence,
model_name_or_path = 'good_model',
detail_level=DetailLevel.High,
class_names=names
)