Tuning a base Language model on the IMDB dataset



## Introduction

In this tutorial we will be showing an end-to-end example of fine-tuning a Transformer language model on a custom dataset in DataFrame format.

By the end of this you should be able to:

1. Build a dataset with the LanguageModelDatasets class, and their DataLoaders
2. Build a LanguageModelTuner quickly, find a good learning rate, and train with the One-Cycle Policy
3. Save that model away, to be used with deployment or other HuggingFace libraries
4. Apply inference using both the Tuner available function as well as with the EasyTextGenerator class within AdaptNLP

## Installing the Library

This tutorial utilizies the latest AdaptNLP version, as well as parts of the fastai library. Please run the below code to install them:

!pip install adaptnlp -U


(or pip3)

## Getting the Dataset

First we need a dataset. We will use the fastai library to download the IMDB_SAMPLE dataset, a subset of IMDB Movie Reviews.

from fastai.data.external import URLs, untar_data


URLs holds a namespace of many data endpoints, and untar_data is a function that can download and extract any data from a given URL.

data_path = untar_data(URLs.IMDB_SAMPLE)


If we look at what was downloaded, we will find a texts.csv file:

data_path.ls()

(#1) [Path('/root/.fastai/data/imdb_sample/texts.csv')]

This is our data we want to use. We should now open the csv in pandas to generate our DataFrame object:

import pandas as pd

df = pd.read_csv(data_path/'texts.csv')


Let's look at our data

df.head()

label text is_valid
0 negative Un-bleeping-believable! Meg Ryan doesn't even look her usual pert lovable self in this, which normally makes me forgive her shallow ticky acting schtick. Hard to believe she was the producer on this dog. Plus Kevin Kline: what kind of suicide trip has his career been on? Whoosh... Banzai!!! Finally this was directed by the guy who did Big Chill? Must be a replay of Jonestown - hollywood style. Wooofff! False
1 positive This is a extremely well-made film. The acting, script and camera-work are all first-rate. The music is good, too, though it is mostly early in the film, when things are still relatively cheery. There are no really superstars in the cast, though several faces will be familiar. The entire cast does an excellent job with the script.<br /><br />But it is hard to watch, because there is no good end to a situation like the one presented. It is now fashionable to blame the British for setting Hindus and Muslims against each other, and then cruelly separating them into two countries. There is som... False
2 negative Every once in a long while a movie will come along that will be so awful that I feel compelled to warn people. If I labor all my days and I can save but one soul from watching this movie, how great will be my joy.<br /><br />Where to begin my discussion of pain. For starters, there was a musical montage every five minutes. There was no character development. Every character was a stereotype. We had swearing guy, fat guy who eats donuts, goofy foreign guy, etc. The script felt as if it were being written as the movie was being shot. The production value was so incredibly low that it felt li... False
3 positive Name just says it all. I watched this movie with my dad when it came out and having served in Korea he had great admiration for the man. The disappointing thing about this film is that it only concentrate on a short period of the man's life - interestingly enough the man's entire life would have made such an epic bio-pic that it is staggering to imagine the cost for production.<br /><br />Some posters elude to the flawed characteristics about the man, which are cheap shots. The theme of the movie "Duty, Honor, Country" are not just mere words blathered from the lips of a high-brassed offic... False
4 negative This movie succeeds at being one of the most unique movies you've seen. However this comes from the fact that you can't make heads or tails of this mess. It almost seems as a series of challenges set up to determine whether or not you are willing to walk out of the movie and give up the money you just paid. If you don't want to feel slighted you'll sit through this horrible film and develop a real sense of pity for the actors involved, they've all seen better days, but then you realize they actually got paid quite a bit of money to do this and you'll lose pity for them just like you've alr... False

We will find that there is a label, some text, and a is_valid boolean, which determines if a row is part of the training or the validation set

Now that we've downloaded some data, let's pick a viable model to train with

## Picking a Model with the Hub

AdaptNLP has a HFModelHub class that allows you to communicate with the HuggingFace Hub and pick a model from it, as well as a namespace HF_TASKS class with a list of valid tasks we can search by.

Let's try and find one suitable for sequence classification.

First we need to import the class and generate an instance of it:

from adaptnlp import HFModelHub, HF_TASKS

hub = HFModelHub()


Next we can search for a model:

models = hub.search_model_by_task(HF_TASKS.TEXT_GENERATION)


Let's look at a few:

models[:10]

[Model Name: distilgpt2, Tasks: [text-generation],
Model Name: xlnet-large-cased, Tasks: [text-generation]]

These are models specifically tagged with the text-generation tag, so you may not see a few models you would expect such as bert_base_cased.

We'll use that first model, distilgpt2:

model = models[0]

model

Model Name: distilgpt2, Tasks: [text-generation]

Now that we have picked a model, let's use the data API to prepare our data

## Building TaskDatasets with LanguageModelDatasets

Each task has a high-level data wrapper around the TaskDatasets class. In our case this is the LanguageModelDatasets class:

from adaptnlp import LanguageModelDatasets


There are multiple different constructors for the LanguageModelDatasets class, and you should never call the main constructor directly.

We will be using from_dfs:

#### LanguageModelDatasets.from_dfs[source]

LanguageModelDatasets.from_dfs(train_df:DataFrame, text_col:str, tokenizer_name:str, block_size:int=128, masked_lm:bool=False, valid_df:DataFrame=None, split_func:callable=None, split_pct:float=0.1, tokenize_kwargs:dict={}, auto_kwargs:dict={})

Builds LanguageModelDatasets from a DataFrame or file path

Parameters:

• train_df : <class 'pandas.core.frame.DataFrame'>

A Pandas Dataframe

• text_col : <class 'str'>

The name of the text column

• tokenizer_name : <class 'str'>

The name of the tokenizer

• block_size : <class 'int'>, optional

The size of each block

• masked_lm : <class 'bool'>, optional

Whether the language model is a MLM

• valid_df : <class 'pandas.core.frame.DataFrame'>, optional

An optional validation DataFrame

• split_func : <built-in function callable>, optional

Optionally a splitting function similar to RandomSplitter

• split_pct : <class 'float'>, optional

What % to split the df between training and validation

• tokenize_kwargs : <class 'dict'>, optional

kwargs for the tokenize function

• auto_kwargs : <class 'dict'>, optional

kwargs for the AutoTokenizer.from_pretrained constructor

Anything you would normally pass to the tokenizer call (such as max_length, padding) should go in tokenize_kwargs, and anything going to the AutoTokenizer.from_pretrained constructor should be passed to the auto_kwargs.

In our case we only have a train_df, and since we are training a language model, we want to split the data 90/10 (which is the default)

Also, we will set a block_size of 128, and it is not a masked language model:

dsets = LanguageModelDatasets.from_dfs(
train_df=df,
text_col='text',
tokenizer_name=model.name,
block_size=128,
)

No value for max_length set, automatically adjusting to the size of the model and including truncation
Sequence length set to: 1024



And finally turn it into some AdaptiveDataLoaders.

These are just fastai's DataLoaders class, but it overrides a few functions to have it work nicely with HuggingFace's Dataset class

#### LanguageModelDatasets.dataloaders[source]

LanguageModelDatasets.dataloaders(batch_size=8, shuffle_train=True, collate_fn=default_data_collator, mlm_probability:float=0.15, path='.', device=None)

Build DataLoaders from self

Parameters:

• batch_size : <class 'int'>, optional

A batch size

• shuffle_train : <class 'bool'>, optional

Whether to shuffle the training dataset

• collate_fn : <class 'function'>, optional

A custom collation function

• mlm_probability : <class 'float'>, optional

• path : <class 'str'>, optional

• device : <class 'NoneType'>, optional

dls = dsets.dataloaders(batch_size=8)


Finally, let's view a batch of data with the show_batch function:

dls.show_batch()

Input Label
0 lighter or darker narrative and theme in his film.....Alain Delon visits swift, sure vengeance on the ruthless crime family that employed him as a hit-man in the Duccio Tessari thriller "Big Guns" after they accidentally murder his wife and child. Tessari and scenarists Roberto Gandus, Ugo Liberatore of "A Minute to Pray, a Second to Die," and Franco Verucci of "Ring of Death" take this actioneer about a career gunman for the mob right down to the wire. Indeed, "Big Guns" is rather predictable, but it still qualifies as solid entertainment with lots of savage and often sudden killings. Alain Delon of "The Godson" is appropriately laconic as he methodically deals out death to the heads of the mob families who refused to let him retire so that he could enjoy life with his young son and daughter. Richard Conte of "The Godfather" plays a Sicilian crime boss who wants to bury the hatchet with the Delon character, but the rest of his hard-nosed associates want the hit-man dead. Like most crime thrillers in the 1960s and 1970s, "Big Guns" subscribes to the cinematic morality that crime does not pay. Interestingly, the one man who has nothing to do with the murder of the wife and son of the hero survives while another betrays the hero with extreme prejudice. Tessari does not waste a second in this 90-minute shoot'em up. Apart from the mother and son dying in a car bomb meant for the father, the worst thing that takes place occurs in an automobile salvage yard when an associate of the hero is crushed in a junked car. Ostensibly, "Big Guns" is a rather bloodless outing, but it does have a high body count for a 1973 mobster melodrama. Only at the last minute does our protagonist let his guard down and so the contrived morality of an eye for an eye remains intact. Tessari stages a couple of decent car chases and the death of a don in a train traveling through a train tunnel is as bloody as this violent yarn gets. The photography and the compositions are excellent.This very funny British comedy shows what might happen if a section of London, in this case Pimlico, were to declare itself independent from the rest of the UK and its laws, taxes & post-war restrictions. Merry mayhem is what would happen.<br /><br />The explosion of a wartime bomb leads to the lighter or darker narrative and theme in his film.....Alain Delon visits swift, sure vengeance on the ruthless crime family that employed him as a hit-man in the Duccio Tessari thriller "Big Guns" after they accidentally murder his wife and child. Tessari and scenarists Roberto Gandus, Ugo Liberatore of "A Minute to Pray, a Second to Die," and Franco Verucci of "Ring of Death" take this actioneer about a career gunman for the mob right down to the wire. Indeed, "Big Guns" is rather predictable, but it still qualifies as solid entertainment with lots of savage and often sudden killings. Alain Delon of "The Godson" is appropriately laconic as he methodically deals out death to the heads of the mob families who refused to let him retire so that he could enjoy life with his young son and daughter. Richard Conte of "The Godfather" plays a Sicilian crime boss who wants to bury the hatchet with the Delon character, but the rest of his hard-nosed associates want the hit-man dead. Like most crime thrillers in the 1960s and 1970s, "Big Guns" subscribes to the cinematic morality that crime does not pay. Interestingly, the one man who has nothing to do with the murder of the wife and son of the hero survives while another betrays the hero with extreme prejudice. Tessari does not waste a second in this 90-minute shoot'em up. Apart from the mother and son dying in a car bomb meant for the father, the worst thing that takes place occurs in an automobile salvage yard when an associate of the hero is crushed in a junked car. Ostensibly, "Big Guns" is a rather bloodless outing, but it does have a high body count for a 1973 mobster melodrama. Only at the last minute does our protagonist let his guard down and so the contrived morality of an eye for an eye remains intact. Tessari stages a couple of decent car chases and the death of a don in a train traveling through a train tunnel is as bloody as this violent yarn gets. The photography and the compositions are excellent.This very funny British comedy shows what might happen if a section of London, in this case Pimlico, were to declare itself independent from the rest of the UK and its laws, taxes & post-war restrictions. Merry mayhem is what would happen.<br /><br />The explosion of a wartime bomb leads to the

When training a language model, the input and output are made to be the exact same, so there isn't a shown noticable difference here.

## Building Tuner

Next we need to build a compatible Tuner for our problem. These tuners contain good defaults for our problem space, including loss functions and metrics.

First let's import the LanguageModelTuner and view it's documentation

from adaptnlp import LanguageModelTuner


## classLanguageModelTuner[source]

LanguageModelTuner(dls:DataLoaders, model_name, tokenizer=None, language_model_type:LMType='causal', loss_func=CrossEntropyLoss(), metrics=[<fastai.metrics.Perplexity object at 0x7fe483364100>], opt_func=Adam, additional_cbs=None, expose_fastai_api=False, **kwargs) :: AdaptiveTuner

An AdaptiveTuner with good defaults for Language Model fine-tuning Valid kwargs and defaults:

• lr:float = 0.001
• splitter:function = trainable_params
• cbs:list = None
• path:Path = None
• model_dir:Path = 'models'
• wd:float = None
• wd_bn_bias:bool = False
• train_bn:bool = True
• moms: tuple(float) = (0.95, 0.85, 0.95)

Parameters:

• dls : <class 'fastai.data.core.DataLoaders'>

• model_name : <class 'inspect._empty'>

A HuggingFace model

• tokenizer : <class 'NoneType'>, optional

A HuggingFace tokenizer

• language_model_type : <class 'fastcore.basics.LMType'>, optional

The type of language model to use

• loss_func : <class 'fastai.losses.CrossEntropyLossFlat'>, optional

A loss function

• metrics : <class 'list'>, optional

Metrics to monitor the training with

• opt_func : <class 'function'>, optional

A fastai or torch Optimizer

• additional_cbs : <class 'NoneType'>, optional

Additional Callbacks to have always tied to the Tuner,

• expose_fastai_api : <class 'bool'>, optional

Whether to expose the fastai API

• kwargs : <class 'inspect._empty'>

Next we'll pass in our DataLoaders, the name of our model, and the tokenizer:

tuner = LanguageModelTuner(dls, model.name, dls.tokenizer)


By default we can see that it used CrossEntropyLoss as our loss function, and Perplexity as our metric

tuner.loss_func

FlattenedLoss of CrossEntropyLoss()
_ = [print(m.name) for m in tuner.metrics]

perplexity


Finally we just need to train our model!

## Fine-Tuning

To fine-tune, AdaptNLP's tuner class provides only a few functions to work with. The important ones are the tune and lr_find class.

As the Tuner uses fastai under the hood, lr_find calls fastai's Learning Rate Finder to help us pick a learning rate. Let's do that now:

#### AdaptiveTuner.lr_find[source]

AdaptiveTuner.lr_find(start_lr=1e-07, end_lr=10, num_it=100, stop_div=True, show_plot=True, suggest_funcs=valley)

Runs fastai's LR Finder

Parameters:

• start_lr : <class 'float'>, optional

• end_lr : <class 'int'>, optional

• num_it : <class 'int'>, optional

• stop_div : <class 'bool'>, optional

• show_plot : <class 'bool'>, optional

• suggest_funcs : <class 'function'>, optional

tuner.lr_find()

/opt/venv/lib/python3.8/site-packages/fastai/callback/schedule.py:270: UserWarning: color is redundantly defined by the 'color' keyword argument and the fmt string "ro" (-> color='r'). The keyword argument will take precedence.
ax.plot(val, idx, 'ro', label=nm, c=color)

SuggestedLRs(valley=7.585775892948732e-05)

It recommends a learning rate of around 5e-5, so we will use that.

lr = 5e-5


Let's look at the documentation for tune:

#### AdaptiveTuner.tune[source]

AdaptiveTuner.tune(epochs:int, lr:float=None, strategy:Strategy='fit_one_cycle', callbacks:list=[], **kwargs)

Fine tune self.model for epochs with an lr and strategy

Parameters:

• epochs : <class 'int'>

Number of iterations to train for

• lr : <class 'float'>, optional

If None, finds a new learning rate and uses suggestion_method

• strategy : <class 'fastcore.basics.Strategy'>, optional

A fitting method

• callbacks : <class 'list'>, optional

Extra fastai Callbacks

• kwargs : <class 'inspect._empty'>

We can pass in a number of epochs, a learning rate, a strategy, and additional fastai callbacks to call.

Valid strategies live in the Strategy namespace class, and consist of:

from adaptnlp import Strategy


In this tutorial we will train with the One-Cycle policy, as currently it is one of the best schedulers to use.

tuner.tune(3, lr, strategy=Strategy.OneCycle)


## Saving Model

Now that we have a trained model, let's save those weights away.

Calling tuner.save will save both the model and the tokenizer in the same format as how HuggingFace does:

#### AdaptiveTuner.save[source]

AdaptiveTuner.save(save_directory)

Save a pretrained model to a save_directory

Parameters:

• save_directory : <class 'inspect._empty'>

A folder to save our model to

tuner.save('good_model')

'good_model'

## Performing Inference

There are two ways to get predictions, the first is with the .predict method in our tuner. This is great for if you just finished training and want to see how your model performs on some new data! The other method is with AdaptNLP's inference API, which we will show afterwards

### In Tuner

First let's write a sentence to test with

sentence = "Hugh Jackman is a terrible "


And then predict with it:

#### LanguageModelTuner.predict[source]

LanguageModelTuner.predict(text:Union[List[str], str], bs:int=64, num_tokens_to_produce:int=50, **kwargs)

Predict some text for sequence classification with the currently loaded model

Parameters:

• text : typing.Union[typing.List[str], str]

Some text or list of texts to do inference with

• bs : <class 'int'>, optional

A batch size to use for multiple texts

• num_tokens_to_produce : <class 'int'>, optional

Number of tokens to generate

• kwargs : <class 'inspect._empty'>
tuner.predict(sentence, num_tokens_to_produce=8)

100.00% [1/1 00:00<00:00]
{'generated_text': ['Hugh Jackman is a terrible icky, and very funny, character.']}

### With the Inference API

Next we will use the EasyTextGenerator class, which AdaptNLP offers:

from adaptnlp import EasyTextGenerator


We simply construct the class:

classifier = EasyTextGenerator()


And call the tag_text method, passing in the sentence, the location of our saved model, and some names for our classes:

classifier.generate(
sentence,
model_name_or_path='good_model',
num_tokens_to_produce=8
)

100.00% [1/1 00:00<00:00]
{'generated_text': ['Hugh Jackman is a terrible icky, and very funny, character.']}

And we got the exact same output!