Introduction

In this tutorial we will be showing an end-to-end example of fine-tuning a Transformer for sequence classification on a custom dataset in HuggingFace Dataset format.

By the end of this you should be able to:

Build a dataset with the TaskDatasets class, and their DataLoaders
Build a SequenceClassificationTuner quickly, find a good learning rate, and train with the One-Cycle Policy
Save that model away, to be used with deployment or other HuggingFace libraries
Apply inference using both the Tuner available function as well as with the EasySequenceClassifier class within AdaptNLP

Installing the Library

This tutorial utilizies the latest AdaptNLP version, as well as parts of the fastai library. Please run the below code to install them:

!pip install adaptnlp -U

(or pip3)

Getting the Dataset

First we need a dataset. We will use dataset's load_dataset function to quickly generate a raw dataset straight from HuggingFace:

from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")

Reusing dataset glue (/root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)

We now have a raw datasets dataset, which we can index into:

raw_datasets['train'][0]

{'idx': 0,
 'label': 1,
 'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .'}

Now that we have the data downloaded, let's decide on a model to use.

Picking a Model with the Hub

AdaptNLP has a HFModelHub class that allows you to communicate with the HuggingFace Hub and pick a model from it, as well as a namespace HF_TASKS class with a list of valid tasks we can search by.

Let's try and find one suitable for sequence classification.

First we need to import the class and generate an instance of it:

from adaptnlp import HFModelHub, HF_TASKS

hub = HFModelHub()

Next we can search for a model:

models = hub.search_model_by_task(HF_TASKS.TEXT_CLASSIFICATION)

Let's look at a few:

models[:10]

[Model Name: distilbert-base-uncased-finetuned-sst-2-english, Tasks: [text-classification],
 Model Name: roberta-base-openai-detector, Tasks: [text-classification],
 Model Name: roberta-large-mnli, Tasks: [text-classification],
 Model Name: roberta-large-openai-detector, Tasks: [text-classification]]

These are models specifically tagged with the text-classification tag, so you may not see a few models you would expect such as bert_base_cased.

Let's search for that one for this problem:

models = hub.search_model_by_name('bert-base-uncased', user_uploaded=True)

models[:5]

[Model Name: bert-base-uncased, Tasks: [fill-mask],
 Model Name: distilbert-base-uncased-distilled-squad, Tasks: [question-answering],
 Model Name: distilbert-base-uncased-finetuned-sst-2-english, Tasks: [text-classification],
 Model Name: distilbert-base-uncased, Tasks: [fill-mask],
 Model Name: 123abhiALFLKFO/distilbert-base-uncased-finetuned-cola, Tasks: [text-classification]]

We want the first one.

model = models[0]

Now that we have picked a model, let's use the data API to prepare our data

Note: It should be mentioned that this is optional, you can always just pass in the string name of a model such as "bert-base-cased"

Building `TaskDatasets`

All of the task-specific high-level data API's (such as SequenceClassificationDatasets) all wrap around the TaskDatasets class, which is a small wrapper around datasets highly efficient Dataset class.

This integration was valuable because it provides a fast and memory-efficient way to use large datasets with minimal effort.

First let's import the class:

from adaptnlp import TaskDatasets

The TaskDatasets class has no class constructors outside the normal one. The reason for this is it takes in raw Datasets and other tokenizer arguments to build from:

Anything you would normally pass to the tokenizer call (such as max_length, padding) should go in tokenize_kwargs, and anything going to the AutoTokenizer.from_pretrained constructor should be passed to the auto_kwargs.

Custom Tokenization Function and Finishing our `TaskDatasets`

You may notice there is an extra step here: We need to pass in a tokenize_func. In the other tutorials we used a very basic tokenizing function, and this has a default for that as well.

However given our dataset, we need to implement our own tokenization function.

To do so, your function must take in an item, a tokenizer, and tokenize_kwargs. It should be noted that you do not have to declare any of these. All of them are attributes that the TaskDatasets has access to, and will be passed to this function implicitly.

What you need to declare is how you want the tokenizer applied.

In our case we have two separate sentences that need to be tokenized at once. These texts live in that dictionary we saw earlier at the keys sentence1 and sentence2.

Let's write that function:

def tok_func(
    item, # A single item in the dataset
    tokenizer, # The implicit tokenizer that `TaskDatasets` has access to
    tokenize_kwargs, # Key word arguments passed into the constructor of `TaskDatasets`
):
    "A basic tokenization function for two items"
    return tokenizer(
        item['sentence1'],
        item['sentence2'],
        **tokenize_kwargs
    )

Along with building our own tokenize function, we need to tell Datasets what columns to drop when we pull an item from our dataset.

These are synonymous with Datasets remove_cols.

In our problem this includes the sentence1, sentence2, and idx keys, as our tokenized input gets put into a text key:

remove_cols = ['sentence1', 'sentence2', 'idx']

Finally we'll declare some arguments for our tokenize function, specifically ensuring our max length is reasonable and that we should pad our samples to that length:

Note: These vary problem to problem. You should look at your specific model and dataset to judge what a proper max length and padding should be. AdaptNLP does its best to use a decent default, but it may not work for every problem

tokenize_kwargs = {'max_length':64, 'padding':True}

Let's build our TaskDatasets now, passing in everything we built:

dsets = TaskDatasets(
    train_dset = raw_datasets['train'], # Our training `Dataset`
    valid_dset = raw_datasets['validation'], # Our validation `Dataset`
    tokenizer_name = model.name, # The name of our model
    tokenize_kwargs = tokenize_kwargs, # The tokenizer kwargs
    tokenize_func = tok_func, # The tokenization function
    remove_cols = remove_cols # The columns to remove after tokenizing our input
)

You may be wondering why we use the TaskDatasets class, this is a convience wrapper around much of the functions and tasks you need to call when using datasets's Dataset class, and there are a few special behaviors to quickly build working AdaptiveDataLoaders as well.

Let's build these AdaptiveDataLoaders, which are just fastai's DataLoaders class, but it overrides a few functions to have it work nicely with HuggingFace's Dataset class

To build our DataLoaders, can call .dataloaders, specifying our batch size and a collate function to use. In our case we will collate with the DataCollatorWithPadding class out of transformers:

from transformers import DataCollatorWithPadding

dls = dsets.dataloaders(
    batch_size=8,
    collate_fn=DataCollatorWithPadding(tokenizer=dsets.tokenizer)
)

Finally, let's view a batch of data with the show_batch function:

dls.show_batch(n=4)

Since this isn't a pre-built *TaskDatasets object, the show_batch looks a little plain, but it gets across exactly what you would need to see.

Next let's build a Tuner and train our model

Building a `Tuner`

Next we need to build a compatible Tuner for our problem. These tuners contain good defaults for our problem space, including loss functions and metrics.

First let's import the SequenceClassificationTuner and view it's documentation

from adaptnlp import SequenceClassificationTuner

Next we'll pass in our DataLoaders, the name of our model, and since we are using raw Datasets, the number of classes we have. In our case this is two.

Note: If you are not using the data API (TaskDatasets, SequenceClassificationDatasets, etc), you need to pass in the tokenizer to the constructor as well with tokenizer=tokenizer

tuner = SequenceClassificationTuner(dls, model.name, num_classes=2)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

By default we can see that it used CrossEntropyLoss as our loss function, and both accuracy and F1Score as our metrics:

tuner.loss_func

FlattenedLoss of CrossEntropyLoss()

_ = [print(m.name) for m in tuner.metrics]

accuracy
f1_score

It is also possible to define your own metrics, these stem from fastai.

To do so, write a function that takes an input and an output, and performs an operation. For example, we will write our own accuracy metric:

def ourAccuracy(inp, out):
    "A simplified accuracy metric that doesn't flatten"
    return (inp == targ).float().mean()

And then we pass it into the constructor:

tuner = SequenceClassificationTuner(dls, model.name, num_classes=2, metrics=[ourAccuracy])

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

If we look at the metrics, you can see that now it is just ourAccuracy:

tuner.metrics[0].name

'ourAccuracy'

For this tutorial, we will revert it back to the defaults:

tuner = SequenceClassificationTuner(dls, model.name, num_classes=2)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Finally we just need to train our model!

Fine-Tuning

To fine-tune, AdaptNLP's tuner class provides only a few functions to work with. The important ones are the tune and lr_find class.

As the Tuner uses fastai under the hood, lr_find calls fastai's Learning Rate Finder to help us pick a learning rate. Let's do that now:

tuner.lr_find()

/opt/venv/lib/python3.8/site-packages/fastai/callback/schedule.py:270: UserWarning: color is redundantly defined by the 'color' keyword argument and the fmt string "ro" (-> color='r'). The keyword argument will take precedence.
  ax.plot(val, idx, 'ro', label=nm, c=color)

SuggestedLRs(valley=7.585775892948732e-05)

It recommends a learning rate of around 2e-4, however a steeper slope can be found around 5e-5 so we will use that.

Note: Reading the LR Finder is somewhat of an art. The valley method is one of the most reliable ones, but also try and figure out an intuition towards finding a learning rate as you go as well.

lr = 5e-5

Let's look at the documentation for tune:

We can pass in a number of epochs, a learning rate, a strategy, and additional fastai callbacks to call.

Valid strategies live in the Strategy namespace class, and consist of:

OneCycle (Also called the One-Cycle Policy)
CosineAnnealing
SGDR

from adaptnlp import Strategy

In this tutorial we will train with the One-Cycle policy, as currently it is one of the best schedulers to use.

tuner.tune(3, lr, strategy=Strategy.OneCycle)

Saving Model

Now that we have a trained model, let's save those weights away.

Calling tuner.save will save both the model and the tokenizer in the same format as how HuggingFace does:

tuner.save('good_model')

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/tmp/ipykernel_11201/1955740056.py in <module>
----> 1 tuner.save('good_model')

/home/jovyan/adaptnlp/adaptnlp/training/core.py in save(self, save_directory)
    400     ):
    401         "Save a pretrained model to a `save_directory`"
--> 402         if rank_distrib(): return # Don't save if child proc
    403         self.model.save_pretrained(save_directory)
    404         self.tokenizer.save_pretrained(save_directory)

NameError: name 'rank_distrib' is not defined

Performing Inference

There are two ways to get predictions, the first is with the .predict method in our tuner. This is great for if you just finished training and want to see how your model performs on some new data! The other method is with AdaptNLP's inference API, which we will show afterwards

In Tuner

First let's write a sentence to test with

sentence = 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence . Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .'

And then predict with it:

tuner.predict(sentence)

You'll notice it says LABEL_1. We did not build with the Datasets wrapper API's, so currently they do not have a vocabulary to work off of.

Let's pass in a vocabulary of not_equivalent and equivalent to work with:

names = ['not_equivalent', 'equivalent']

tuner.predict(sentence, class_names=names)

You can see it gave us much more readable results!

With the Inference API

Next we will use the EasySequenceClassifier class, which AdaptNLP offers:

from adaptnlp import EasySequenceClassifier

We simply construct the class:

classifier = EasySequenceClassifier()

And call the tag_text method, passing in the sentence, the location of our saved model, and some names for our classes.

Similarly here, we can pass in our own vocabulary to use. Let's do that:

classifier.tag_text(
    sentence,
    model_name_or_path='good_model',
    class_names=names
)

And we got the exact same output and probabilities!

There are also different levels of predictions we can return (which is also the same with our earlier predict call).

These live in a namespace DetailLevel class, with a few examples below:

from adaptnlp import DetailLevel

DetailLevel.Low

While some Easy modules will not return different items at each level, most will return only a few specific outputs at the Low level, and everything possible at the High level:

classifier.tag_text(
    sentence,
    model_name_or_path = 'good_model',
    detail_level=DetailLevel.Low,
    class_names=names
)

classifier.tag_text(
    sentence,
    model_name_or_path = 'good_model',
    detail_level=DetailLevel.Medium,
    class_names=names
)

classifier.tag_text(
    sentence,
    model_name_or_path = 'good_model',
    detail_level=DetailLevel.High,
    class_names=names
)

	Input	Label
0	the government said firstenergy nuclear determined that a contractor had established an unprotected high - speed computer connection to its corporate network that allowed the " slammer " infection to spread internally. it said firstenergy determined that a contractor had established an unprotected computer connection to its corporate network that allowed the so - called ` ` slammer'' worm to spread internally.	tensor(1)
1	that failure to act contributed to september the 11th and the failure to act today continues ( to put ) americans in a vulnerable circumstance, " graham said. " that failure to act contributed to september 11 and the failure to act today continues [ to put ] americans in a vulnerable circumstance, " said graham.	tensor(1)
2	the companies said " it was not our intention to target or offend any group or persons or to incite hatred or violence. " " in creating the game, it was not our intention to target or offend any group or persons or to incite hatred or violence against such groups persons. "	tensor(1)
3	the 4th u. s. circuit court of appeals has unsealed a heavily edited transcript of the june 3 court session where classified evidence was discussed out of public earshot. the 4th u. s. circuit court of appeals in richmond, va., released the edited transcript of a closed hearing june 3, which followed a public proceeding.	tensor(0)

epoch	train_loss	valid_loss	accuracy	f1_score	time
0	0.505338	0.372632	0.835784	0.876611	01:19
1	0.286884	0.375253	0.855392	0.900840	01:20
2	0.058951	0.439049	0.860294	0.903226	01:20

Tutorial: Fine-Tuning Sequence Classification on HuggingFace `Datasets` with MRPC

Introduction

Installing the Library

Getting the Dataset

Picking a Model with the Hub

Building `TaskDatasets`

`class` `TaskDatasets`[source]

Custom Tokenization Function and Finishing our `TaskDatasets`

`TaskDatasets.dataloaders`[source]

Building a `Tuner`

`class` `SequenceClassificationTuner`[source]

Fine-Tuning

`AdaptiveTuner.lr_find`[source]

`AdaptiveTuner.tune`[source]

Saving Model

`AdaptiveTuner.save`[source]

Performing Inference

In Tuner

`SequenceClassificationTuner.predict`[source]

With the Inference API

Tutorial: Fine-Tuning Sequence Classification on HuggingFace `Datasets` with MRPC

Introduction

Installing the Library

Getting the Dataset

Picking a Model with the Hub

Building TaskDatasets

class TaskDatasets[source]

Custom Tokenization Function and Finishing our TaskDatasets

TaskDatasets.dataloaders[source]

Building a Tuner

class SequenceClassificationTuner[source]

Fine-Tuning

AdaptiveTuner.lr_find[source]

AdaptiveTuner.tune[source]

Saving Model

AdaptiveTuner.save[source]

Performing Inference

In Tuner

SequenceClassificationTuner.predict[source]

With the Inference API

Building `TaskDatasets`

`class` `TaskDatasets`[source]

Custom Tokenization Function and Finishing our `TaskDatasets`

`TaskDatasets.dataloaders`[source]

Building a `Tuner`

`class` `SequenceClassificationTuner`[source]

`AdaptiveTuner.lr_find`[source]

`AdaptiveTuner.tune`[source]

`AdaptiveTuner.save`[source]

`SequenceClassificationTuner.predict`[source]