Introduction

In this tutorial we will be showing an end-to-end example of fine-tuning a Transformer for sequence classification on a custom dataset in text files format.

By the end of this you should be able to:

Build a dataset with the SequenceClassificationDatasets class, and their DataLoaders
Build a SequenceClassificationTuner quickly, find a good learning rate, and train with the One-Cycle Policy
Save that model away, to be used with deployment or other HuggingFace libraries
Apply inference using both the Tuner available function as well as with the EasySequenceClassifier class within AdaptNLP

Installing the Library

This tutorial utilizies the latest AdaptNLP version, as well as parts of the fastai library. Please run the below code to install them:

!pip install adaptnlp -U

(or pip3)

Getting the Dataset

First we need a dataset. We will use the fastai library to download the full IMDB Movie Reviews dataset

from fastai.data.external import URLs, untar_data

URLs holds a namespace of many data endpoints, and untar_data is a function that can download and extract any data from a given URL.

Combining both, we can download the data:

data_path = untar_data(URLs.IMDB)

If we look at what was downloaded, we will find a train and test folder:

data_path.ls()

(#7) [Path('/root/.fastai/data/imdb/test'),Path('/root/.fastai/data/imdb/README'),Path('/root/.fastai/data/imdb/train'),Path('/root/.fastai/data/imdb/imdb.vocab'),Path('/root/.fastai/data/imdb/tmp_clas'),Path('/root/.fastai/data/imdb/unsup'),Path('/root/.fastai/data/imdb/tmp_lm')]

In each are folders seperating each text file by class:

(data_path/'train').ls()

(#4) [Path('/root/.fastai/data/imdb/train/pos'),Path('/root/.fastai/data/imdb/train/neg'),Path('/root/.fastai/data/imdb/train/unsupBow.feat'),Path('/root/.fastai/data/imdb/train/labeledBow.feat')]

As a result, we can say the dataset follows the following format:

train
- class_a
  - text1.txt
  - text2.txt
  - ...
- class_b
  - text1.txt
  - ...
test (or valid)
- class_a
  - text1.txt
  - ...
- class_b
  - text1.txt
  - ...

Note: In this instance, test and validation have very similar meanings. Both are the dataset which is used to calculate the metrics during training (such as accuracy or F1Score)

Now that we have the dataset, and we know the format it is in, let's pick a viable model to train with

Picking a Model with the Hub

AdaptNLP has a HFModelHub class that allows you to communicate with the HuggingFace Hub and pick a model from it, as well as a namespace HF_TASKS class with a list of valid tasks we can search by.

Let's try and find one suitable for sequence classification.

First we need to import the class and generate an instance of it:

from adaptnlp import HFModelHub, HF_TASKS

hub = HFModelHub()

Next we can search for a model:

models = hub.search_model_by_task(
    task=HF_TASKS.TEXT_CLASSIFICATION
)

Let's look at a few:

models[:10]

[Model Name: distilbert-base-uncased-finetuned-sst-2-english, Tasks: [text-classification],
 Model Name: roberta-base-openai-detector, Tasks: [text-classification],
 Model Name: roberta-large-mnli, Tasks: [text-classification],
 Model Name: roberta-large-openai-detector, Tasks: [text-classification]]

These are models specifically tagged with the text-classification tag, so you may not see a few models you would expect such as bert_base_cased.

We'll use that first model, distilbert-base-uncased:

model = models[0]

model

Model Name: distilbert-base-uncased-finetuned-sst-2-english, Tasks: [text-classification]

Now that we have picked a model, let's use the data API to prepare our data

Note: It should be mentioned that this is optional, you can always just pass in the string name of a model such as "bert-base-cased"

Building `TaskDatasets` with `SequenceClassificationDatasets`

Each task has a high-level data wrapper around the TaskDatasets class. In our case this is the SequenceClassificationDatasets class:

from adaptnlp import SequenceClassificationDatasets

There are multiple different constructors for the SequenceClassificationDatasets class, and you should never call the main constructor directly.

We will be using from_folders method:

Anything you would normally pass to the tokenizer call (such as max_length, padding) should go in tokenize_kwargs, and anything going to the AutoTokenizer.from_pretrained constructor should be passed to the auto_kwargs.

In our case we have a train_path and valid_path, and the last thing we need to do is write a way to get the label from an individual file.

Let's look at what one of these look like:

item = (data_path/'train'/'pos').ls()[0]

item

Path('/root/.fastai/data/imdb/train/pos/3205_8.txt')

So the label is located in the name of the parent relative to the actual file:

item.parent.name

'pos'

Let's write a quick function to extract that:

Note: The items get passed in as string file locations, so we should convert them to a Path to utilize the .parent.name functionality

from pathlib import Path

def get_y(item:str): return Path(item).parent.name

Next we'll build our SequenceClassificationDatasets:

dsets = SequenceClassificationDatasets.from_folders(
    data_path/'train',
    get_label=get_y,
    valid_path=data_path/'test',
    tokenizer_name=model.name,
    tokenize=True,
    split_func=get_y,
    tokenize_kwargs={'max_length':128, 'truncation':True, 'padding':True}
)

Using custom data configuration default-1f2b71eec4880b46
Reusing dataset text_no_new_line (/root/.cache/huggingface/datasets/text_no_new_line/default-1f2b71eec4880b46/0.0.0)
Using custom data configuration default-04d8fbd2bd2108a0
Reusing dataset text_no_new_line (/root/.cache/huggingface/datasets/text_no_new_line/default-04d8fbd2bd2108a0/0.0.0)

Note: If you only have a training folder, just pass in a split_func or split_pct to either have it split the dataset in a custom way, or pass in a percentage to randomly split by

And finally turn it into some AdaptiveDataLoaders.

These are just fastai's DataLoaders class, but it overrides a few functions to have it work nicely with HuggingFace's Dataset class

dls = dsets.dataloaders(batch_size=8)

Finally, let's view a batch of data with the show_batch function:

dls.show_batch()

Building `Tuner`

Next we need to build a compatible Tuner for our problem. These tuners contain good defaults for our problem space, including loss functions and metrics.

First let's import the SequenceClassificationTuner and view it's documentation

from adaptnlp import SequenceClassificationTuner

Next we'll pass in our DataLoaders and the name of our model:

Note: If you are not using the data API (TaskDatasets, SequenceClassificationDatasets, etc), you need to pass in the tokenizer to the constructor as well with tokenizer=tokenizer

tuner = SequenceClassificationTuner(dls, model.name)

By default we can see that it used CrossEntropyLoss as our loss function, and both accuracy and F1Score as our metrics:

tuner.loss_func

FlattenedLoss of CrossEntropyLoss()

_ = [print(m.name) for m in tuner.metrics]

accuracy
f1_score

It is also possible to define your own metrics, these stem from fastai.

To do so, write a function that takes an input and an output, and performs an operation. For example, we will write our own accuracy metric:

def ourAccuracy(inp, out):
    "A simplified accuracy metric that doesn't flatten"
    return (inp == targ).float().mean()

And then we pass it into the constructor:

tuner = SequenceClassificationTuner(dls, model.name, metrics=[ourAccuracy])

If we look at the metrics, you can see that now it is just ourAccuracy:

tuner.metrics[0].name

'ourAccuracy'

For this tutorial, we will revert it back to the defaults:

tuner = SequenceClassificationTuner(dls, model.name)

Finally we just need to train our model!

Fine-Tuning

And all that's left is to tune. There are only 4 or 5 functions you can call on our tuner currently, and this is by design to make it simplistic. In case you don't want to be boxed in however, if you pass in expose_fastai_api=True to our earlier call, it will expose the entirety of Learner to you, so you can call fit_one_cycle, lr_find, and everything else as Tuner uses fastai under the hood.

First, let's call lr_find, which uses fastai's Learning Rate Finder to help us pick a learning rate.

tuner.lr_find()

/opt/venv/lib/python3.8/site-packages/fastai/callback/schedule.py:270: UserWarning: color is redundantly defined by the 'color' keyword argument and the fmt string "ro" (-> color='r'). The keyword argument will take precedence.
  ax.plot(val, idx, 'ro', label=nm, c=color)

SuggestedLRs(valley=0.0003981071640737355)

It recommends a learning rate of around 1e-4, so we will use that.

lr = 1e-4

Let's look at the documentation for tune function:

We can pass in a number of epochs, a learning rate, a strategy, and additional fastai callbacks to call.

Valid strategies live in the Strategy namespace class, and consist of:

OneCycle (Also called the One-Cycle Policy)
CosineAnnealing
SGDR

from adaptnlp import Strategy

In this tutorial we will train with the One-Cycle policy, as currently it is one of the best schedulers to use.

Let's now tune with our strategy and our newly found learning rate for three iterations over the dataset

tuner.tune(
    epochs=3, 
    lr=lr, 
    strategy=Strategy.OneCycle
)

Saving Model

Now that we have a trained model, let's save those weights away.

Calling tuner.save will save both the model and the tokenizer in the same format as how HuggingFace does:

tuner.save('good_model')

'good_model'

Performing Inference

There are two ways to get predictions, the first is with the .predict method in our tuner. This is great for if you just finished training and want to see how your model performs on some new data! The other method is with AdaptNLP's inference API, which we will show afterwards

In Tuner

First let's write a sentence ot test with

sentence = "This movie was horrible! Hugh Jackman is a terrible actor"

And then predict with it:

tuner.predict(sentence)

{'sentences': ['This movie was horrible! Hugh Jackman is a terrible actor'],
 'predictions': ['neg'],
 'probs': tensor([[9.9931e-01, 6.9142e-04]])}

With the Inference API

Next we will use the EasySequenceClassifier class, which AdaptNLP offers:

from adaptnlp import EasySequenceClassifier

We simply construct the class:

classifier = EasySequenceClassifier()

And call the tag_text method, passing in the sentence, the location of our saved model, and some names for our classes:

classifier.tag_text(
    sentence,
    model_name_or_path='good_model',
    class_names=['negative', 'positive']
)

2021-08-02 18:10:15,999 loading file good_model

{'sentences': ['This movie was horrible! Hugh Jackman is a terrible actor'],
 'predictions': ['negative'],
 'probs': tensor([[9.9931e-01, 6.9142e-04]])}

And we got the exact same output and probabilities!

There are also different levels of predictions we can return (which is also the same with our earlier predict call).

These live in a namespace DetailLevel class, with a few examples below:

from adaptnlp import DetailLevel

DetailLevel.Low

'low'

While some Easy modules will not return different items at each level, most will return only a few specific outputs at the Low level, and everything possible at the High level:

classifier.tag_text(
    sentence,
    model_name_or_path = 'good_model',
    detail_level=DetailLevel.Low
)

{'sentences': ['This movie was horrible! Hugh Jackman is a terrible actor'],
 'predictions': ['NEGATIVE'],
 'probs': tensor([[9.9931e-01, 6.9142e-04]])}

classifier.tag_text(
    sentence,
    model_name_or_path = 'good_model',
    detail_level=DetailLevel.Medium
)

{'sentences': ['This movie was horrible! Hugh Jackman is a terrible actor'],
 'predictions': ['NEGATIVE'],
 'probs': tensor([[9.9931e-01, 6.9142e-04]]),
 'pairings': OrderedDict([('This movie was horrible! Hugh Jackman is a terrible actor',
               tensor([9.9931e-01, 6.9142e-04]))]),
 'classes': ['NEGATIVE', 'POSITIVE']}

classifier.tag_text(
    sentence,
    model_name_or_path = 'good_model',
    detail_level=DetailLevel.High
)

{'sentences': [Sentence: "This movie was horrible ! Hugh Jackman is a terrible actor"   [− Tokens: 11  − Sentence-Labels: {'sc': [NEGATIVE (0.9993), POSITIVE (0.0007)]}]],
 'predictions': ['NEGATIVE'],
 'probs': tensor([[9.9931e-01, 6.9142e-04]]),
 'pairings': OrderedDict([('This movie was horrible! Hugh Jackman is a terrible actor',
               tensor([9.9931e-01, 6.9142e-04]))]),
 'classes': ['NEGATIVE', 'POSITIVE']}

epoch	train_loss	valid_loss	accuracy	f1_score	time
0	0.353429	0.367736	0.834720	0.826503	06:58
1	0.284561	0.348747	0.853640	0.849727	06:58
2	0.105604	0.388459	0.862440	0.862621	06:58

Tutorial: Fine-Tuning Sequence Classification on Text Files with IMDB

Introduction

Installing the Library

Getting the Dataset

Picking a Model with the Hub

Building `TaskDatasets` with `SequenceClassificationDatasets`

`SequenceClassificationDatasets.from_folders`[source]

`SequenceClassificationDatasets.dataloaders`[source]

Building `Tuner`

`class` `SequenceClassificationTuner`[source]

Fine-Tuning

`AdaptiveTuner.lr_find`[source]

`AdaptiveTuner.tune`[source]

Saving Model

`AdaptiveTuner.save`[source]

Performing Inference

In Tuner

`SequenceClassificationTuner.predict`[source]

With the Inference API

	Input	Label
0	i originally scored sarah's show with a nice fat 8, but i've struggled a bit with her humor of late and a thin 7 is what's settled in. i shall explain. < br / > < br / > you will either like sarah's humor, or you won't. if you don't, i doubt anyone could persuade you. you folks know who you are and it's perfectly fine, but then you know that too. moving on, the first season gave us fantastic bits about sarah, her friends and family, and her pursuits in life. in one memorable episode, she	pos
1	oh my goodness. this was a real big mess that just couldn't help itself. jeffrey ( jon heder ) is a 29 year old man still living with his mum ( diane keaton ) and not planning on going anywhere. until his mother meets a rich businessman named mert ( jeff daniels ) who she may be getting married to. < br / > < br / > it would have been an ok movie if heder didn't play his jeffrey so annoying, from the very start there is no chance of liking him and it only gets worse and worse. in the end, we are supposed to like him,	neg
2	i've seen enough of both little richard in interviews and in performances and enough of poor leon pigeonholed into these 50s / 60s musical bio pics to know that leon was not the right actor for this role. leon was so right as david ruffin in the temptations, but fails utterly to capture the essence of little richard in this film. < br / > < br / > actor miguel nunez who played little richard in " why do fools fall in love? " was a much more suitable choice, having pulled off the musician's powerful but effeminate persona. < br / > < br / >	neg
3	a fine story about following your dreams and actually taking a stab at doing something about them when the chance strikes. nothing was easy for morris either - he had a family, job, job opps elsewheres, a mortgage, etc - it wasn't like he could just drop what he was doing and blithely hop on the greyhound to play aaa ball for 4 months. it took guts. i am glad that they showed his indecision, almost up'til he got the callup to the majors. < br / > < br / > i can remember seeing him pitch against the red sox ( i think	pos
4	a visit by hitler in rome is the backdrop of this tender story of love, friendship, homosexuality and fascism. sophia loren plays the housewife and mother of six children who stays at home while her entire family go to the military parade in honor of hitler and mussolini. she has to stay at home since the family cannot afford a maid. she would have loved to go though as she along with the entire housing complex where she lives is an ardent admirer of il duce. < br / > < br / > there is one exception though. across the yard sits marcello mastroianni on his chair contemplating suicide. the	pos

Tutorial: Fine-Tuning Sequence Classification on Text Files with IMDB

Introduction

Installing the Library

Getting the Dataset

Picking a Model with the Hub

Building TaskDatasets with SequenceClassificationDatasets

SequenceClassificationDatasets.from_folders[source]

SequenceClassificationDatasets.dataloaders[source]

Building Tuner

class SequenceClassificationTuner[source]

Fine-Tuning

AdaptiveTuner.lr_find[source]

AdaptiveTuner.tune[source]

Saving Model

AdaptiveTuner.save[source]

Performing Inference

In Tuner

SequenceClassificationTuner.predict[source]

With the Inference API

Building `TaskDatasets` with `SequenceClassificationDatasets`

`SequenceClassificationDatasets.from_folders`[source]

`SequenceClassificationDatasets.dataloaders`[source]

Building `Tuner`

`class` `SequenceClassificationTuner`[source]

`AdaptiveTuner.lr_find`[source]

`AdaptiveTuner.tune`[source]

`AdaptiveTuner.save`[source]

`SequenceClassificationTuner.predict`[source]