What is a language model?

Language modeling is the task of generating a probability distribution over a sequence of words. The language models that we are using can assign the probabilitiy of an upcoming word(s) given a sequence of words. The GPT2 language model is a good example of a Causal Language Model which can predict words following a sequence of words. This predicted word can then be used along the given sequence of words to predict another word and so on. This is how we actually a variant of how we produce models for the NLP task of text generation.

Why would you want to fine-tune a language model?

Fine-tuning a language model comes in handy when data of a target task comes from a different distribution compared to the general-domain data that was used for pretraining a language model.

When fine-tuning the language model on data from a target task, the general-domain pretrained model is able to converge quickly and adapt to the idiosyncrasies of the target data. This can be seen from the efforts of ULMFiT and Jeremy Howard's and Sebastian Ruder's approach on NLP transfer learning.

With AdaptNLP's LMFineTuner, we can start to fine-tune state-of-the-art pretrained transformers architecture language models provided by Hugging Face's Transformers library. LMFineTuner is built on transformers.Trainer so additional documentation on it can be found at Hugging Face's documentation here

Below are the available transformers language models for fine-tuning with LMFineTuner

Transformer Model Model Type/Architecture String Key
ALBERT "albert"
DistilBERT "distilbert"
BERT "bert"
CamemBERT "camembert"
RoBERTa "roberta"
GPT "gpt"
GPT2 "gpt2"

You can fine-tune on any transformers language models with the above architecture in Huggingface's Transformers library. Key shortcut names are located here.

The same goes for Huggingface's public model-sharing repository, which is available here as of v2.2.2 of the Transformers library.

This tutorial will go over the following simple-to-use componenets of using the LMFineTuner to fine-tune pre-trained language models on your custom text data.

  1. Data loading and training arguments
  2. Language model training
  3. Language model evaluation

1. Data loading and training arguments

We'll first start by downloading some example raw text files. If you want to fine-tune a model on your own custom data, just provide the file paths to the training and evaluation text files that contain text from your target task. You don't require a lot of formatting with the data since a language model does not necessarily require "labeled" data. All you need is the text you'd like use to "expand" the domain of knowledge that your language model is training on.

!wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip
!unzip wikitext-2-raw-v1.zip

train_file = "./wikitext-2-raw/wiki.train.raw"
eval_file = "./wikitext-2-raw/wiki.test.raw"
--2020-08-31 15:38:50--  https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.64.78
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.64.78|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4721645 (4.5M) [application/zip]
Saving to: ‘wikitext-2-raw-v1.zip’

wikitext-2-raw-v1.z 100%[===================>]   4.50M  2.92MB/s    in 1.5s    

2020-08-31 15:38:52 (2.92 MB/s) - ‘wikitext-2-raw-v1.zip’ saved [4721645/4721645]

Archive:  wikitext-2-raw-v1.zip
   creating: wikitext-2-raw/
  inflating: wikitext-2-raw/wiki.test.raw  
  inflating: wikitext-2-raw/wiki.valid.raw  
  inflating: wikitext-2-raw/wiki.train.raw  

Now that we have the text data we want to fine-tune our language model on, we can move on to configuring the training component.

One of the first things we'll need to specify before we start training are the training arguments. Training arguments consist mainly of the hyperparameters we want to provide the model. These may include batch size, initial learning rate, number of epochs, etc.

We will be using the transformers.TrainingArguments data class to store our training args. These are compatible with the transformers.Trainer as well as AdaptNLP's train methods. For more documention on the TrainingArguments class, please look here. There are a lot of arguments available, but we will pass in the important args and use default values for the rest.

The training arguments below specify the output directory for you model and checkpoints.

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir='./models',
    num_train_epochs=1,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    warmup_steps=500,
    weight_decay=0.01,
    evaluate_during_training=False,
    logging_dir='./logs',
    save_steps=2500,
    eval_steps=100
)

2. Language model training

Now that we have our data and training arguments, let's instantiate the LMFineTuner and load in a pre-trained language model we would like to fine-tune. In this case, we will use the gpt2 pre-trained language model.

Note: You can load in any model with the allowable architecture that we've specified above. You can even load in custom pre-trained models or models that you find in the Hugging Face repository that have already been fine-tuned and trained on NLP target tasks.

from adaptnlp import LMFineTuner

finetuner = LMFineTuner(model_name_or_path="gpt2")
/home/andrew/Documents/github/adaptnlp/venv-adaptnlp/lib/python3.6/site-packages/transformers/modeling_auto.py:798: FutureWarning: The class `AutoModelWithLMHead` is deprecated and will be removed in a future version. Please use `AutoModelForCausalLM` for causal language models, `AutoModelForMaskedLM` for masked language models and `AutoModelForSeq2SeqLM` for encoder-decoder models.
  FutureWarning,
Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at gpt2 and are newly initialized: ['h.0.attn.masked_bias', 'h.1.attn.masked_bias', 'h.2.attn.masked_bias', 'h.3.attn.masked_bias', 'h.4.attn.masked_bias', 'h.5.attn.masked_bias', 'h.6.attn.masked_bias', 'h.7.attn.masked_bias', 'h.8.attn.masked_bias', 'h.9.attn.masked_bias', 'h.10.attn.masked_bias', 'h.11.attn.masked_bias', 'lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Now we can run the built-in train() method by passing in the training arguments. The training method will also be where you specify your data arguments which include the your train and eval datasets, the pre-trained model ID (this should have been loaded from your earlier cells, but can be loaded dynamically), text column name, label column name, and ordered label names (only required if loading in paths to CSV data file for dataset args).

Notice how we pass the mlm argument as False? The mlm argument should be true if we are using a masked language model variant such as BERT architecture language models. More information can be found on Hugging Face's documentation here

Please checkout AdaptNLP's package reference for more information here.

finetuner.train(
    training_args=training_args,
    train_file=eval_file,
    eval_file=eval_file,
    mlm=False,
    overwrite_cache=False
)
08/31/2020 15:45:44 - INFO - transformers.training_args -   PyTorch: setting up devices
08/31/2020 15:45:44 - WARNING - adaptnlp.language_model -   Process rank: -1,
                device: cuda:0,
                n_gpu: 1,
                distributed training: False,
                16-bits training: False
            
08/31/2020 15:45:44 - INFO - adaptnlp.language_model -   Training/evaluation parameters: {
  "output_dir": "./models",
  "overwrite_output_dir": false,
  "do_train": false,
  "do_eval": false,
  "do_predict": false,
  "evaluate_during_training": false,
  "per_device_train_batch_size": 1,
  "per_device_eval_batch_size": 1,
  "per_gpu_train_batch_size": null,
  "per_gpu_eval_batch_size": null,
  "gradient_accumulation_steps": 1,
  "learning_rate": 5e-05,
  "weight_decay": 0.01,
  "adam_epsilon": 1e-08,
  "max_grad_norm": 1.0,
  "num_train_epochs": 1,
  "max_steps": -1,
  "warmup_steps": 500,
  "logging_dir": "./logs",
  "logging_first_step": false,
  "logging_steps": 500,
  "save_steps": 2500,
  "save_total_limit": null,
  "no_cuda": false,
  "seed": 42,
  "fp16": false,
  "fp16_opt_level": "O1",
  "local_rank": -1,
  "tpu_num_cores": null,
  "tpu_metrics_debug": false,
  "debug": false,
  "dataloader_drop_last": false,
  "eval_steps": 100,
  "past_index": -1
}
08/31/2020 15:45:44 - INFO - filelock -   Lock 139826145788648 acquired on ./wikitext-2-raw/cached_lm_GPT2TokenizerFast_1024_wiki.test.raw.lock
08/31/2020 15:45:44 - INFO - transformers.data.datasets.language_modeling -   Creating features from dataset file at ./wikitext-2-raw
08/31/2020 15:45:45 - INFO - transformers.data.datasets.language_modeling -   Saving features into cached file ./wikitext-2-raw/cached_lm_GPT2TokenizerFast_1024_wiki.test.raw [took 0.004 s]
08/31/2020 15:45:45 - INFO - filelock -   Lock 139826145788648 released on ./wikitext-2-raw/cached_lm_GPT2TokenizerFast_1024_wiki.test.raw.lock
08/31/2020 15:45:45 - INFO - filelock -   Lock 139826145788312 acquired on ./wikitext-2-raw/cached_lm_GPT2TokenizerFast_1024_wiki.test.raw.lock
08/31/2020 15:45:45 - INFO - transformers.data.datasets.language_modeling -   Loading features from cached file ./wikitext-2-raw/cached_lm_GPT2TokenizerFast_1024_wiki.test.raw [took 0.006 s]
08/31/2020 15:45:45 - INFO - filelock -   Lock 139826145788312 released on ./wikitext-2-raw/cached_lm_GPT2TokenizerFast_1024_wiki.test.raw.lock
08/31/2020 15:45:45 - INFO - transformers.trainer -   You are instantiating a Trainer but W&B is not installed. To use wandb logging, run `pip install wandb; wandb login` see https://docs.wandb.com/huggingface.
08/31/2020 15:45:45 - INFO - transformers.trainer -   ***** Running training *****
08/31/2020 15:45:45 - INFO - transformers.trainer -     Num examples = 279
08/31/2020 15:45:45 - INFO - transformers.trainer -     Num Epochs = 1
08/31/2020 15:45:45 - INFO - transformers.trainer -     Instantaneous batch size per device = 1
08/31/2020 15:45:45 - INFO - transformers.trainer -     Total train batch size (w. parallel, distributed & accumulation) = 1
08/31/2020 15:45:45 - INFO - transformers.trainer -     Gradient Accumulation steps = 1
08/31/2020 15:45:45 - INFO - transformers.trainer -     Total optimization steps = 279
Epoch:   0%|          | 0/1 [00:00<?, ?it/s]
Iteration:   0%|          | 0/279 [00:00<?, ?it/s]
Iteration:   0%|          | 1/279 [00:00<01:07,  4.11it/s]
Iteration:   1%|          | 2/279 [00:00<01:06,  4.17it/s]
Iteration:   1%|          | 3/279 [00:00<01:05,  4.22it/s]
Iteration:   1%|▏         | 4/279 [00:00<01:04,  4.26it/s]
Iteration:   2%|▏         | 5/279 [00:01<01:03,  4.29it/s]
Iteration:   2%|▏         | 6/279 [00:01<01:03,  4.31it/s]
Iteration:   3%|▎         | 7/279 [00:01<01:02,  4.32it/s]
Iteration:   3%|▎         | 8/279 [00:01<01:02,  4.34it/s]
Iteration:   3%|▎         | 9/279 [00:02<01:02,  4.34it/s]
Iteration:   4%|▎         | 10/279 [00:02<01:01,  4.35it/s]
Iteration:   4%|▍         | 11/279 [00:02<01:01,  4.34it/s]
Iteration:   4%|▍         | 12/279 [00:02<01:01,  4.35it/s]
Iteration:   5%|▍         | 13/279 [00:03<01:01,  4.36it/s]
Iteration:   5%|▌         | 14/279 [00:03<01:00,  4.35it/s]
Iteration:   5%|▌         | 15/279 [00:03<01:00,  4.36it/s]
Iteration:   6%|▌         | 16/279 [00:03<01:00,  4.33it/s]
Iteration:   6%|▌         | 17/279 [00:03<01:00,  4.36it/s]
Iteration:   6%|▋         | 18/279 [00:04<00:59,  4.36it/s]
Iteration:   7%|▋         | 19/279 [00:04<00:59,  4.35it/s]
Iteration:   7%|▋         | 20/279 [00:04<00:59,  4.35it/s]
Iteration:   8%|▊         | 21/279 [00:04<00:59,  4.34it/s]
Iteration:   8%|▊         | 22/279 [00:05<00:59,  4.33it/s]
Iteration:   8%|▊         | 23/279 [00:05<00:59,  4.33it/s]
Iteration:   9%|▊         | 24/279 [00:05<00:58,  4.34it/s]
Iteration:   9%|▉         | 25/279 [00:05<00:58,  4.34it/s]
Iteration:   9%|▉         | 26/279 [00:05<00:58,  4.35it/s]
Iteration:  10%|▉         | 27/279 [00:06<00:57,  4.35it/s]
Iteration:  10%|█         | 28/279 [00:06<00:57,  4.34it/s]
Iteration:  10%|█         | 29/279 [00:06<00:57,  4.35it/s]
Iteration:  11%|█         | 30/279 [00:06<00:57,  4.34it/s]
Iteration:  11%|█         | 31/279 [00:07<00:57,  4.35it/s]
Iteration:  11%|█▏        | 32/279 [00:07<00:56,  4.36it/s]
Iteration:  12%|█▏        | 33/279 [00:07<00:56,  4.35it/s]
Iteration:  12%|█▏        | 34/279 [00:07<00:56,  4.36it/s]
Iteration:  13%|█▎        | 35/279 [00:08<00:56,  4.35it/s]
Iteration:  13%|█▎        | 36/279 [00:08<00:55,  4.35it/s]
Iteration:  13%|█▎        | 37/279 [00:08<00:55,  4.35it/s]
Iteration:  14%|█▎        | 38/279 [00:08<00:55,  4.34it/s]
Iteration:  14%|█▍        | 39/279 [00:08<00:55,  4.35it/s]
Iteration:  14%|█▍        | 40/279 [00:09<01:02,  3.80it/s]
Iteration:  15%|█▍        | 41/279 [00:09<01:00,  3.97it/s]
Iteration:  15%|█▌        | 42/279 [00:09<00:58,  4.07it/s]
Iteration:  15%|█▌        | 43/279 [00:10<00:56,  4.15it/s]
Iteration:  16%|█▌        | 44/279 [00:10<00:55,  4.22it/s]
Iteration:  16%|█▌        | 45/279 [00:10<00:55,  4.25it/s]
Iteration:  16%|█▋        | 46/279 [00:10<00:54,  4.28it/s]
Iteration:  17%|█▋        | 47/279 [00:10<00:54,  4.29it/s]
Iteration:  17%|█▋        | 48/279 [00:11<00:53,  4.31it/s]
Iteration:  18%|█▊        | 49/279 [00:11<00:53,  4.33it/s]
Iteration:  18%|█▊        | 50/279 [00:11<00:52,  4.33it/s]
Iteration:  18%|█▊        | 51/279 [00:11<00:52,  4.33it/s]
Iteration:  19%|█▊        | 52/279 [00:12<00:52,  4.33it/s]
Iteration:  19%|█▉        | 53/279 [00:12<00:52,  4.33it/s]
Iteration:  19%|█▉        | 54/279 [00:12<00:51,  4.34it/s]
Iteration:  20%|█▉        | 55/279 [00:12<00:51,  4.32it/s]
Iteration:  20%|██        | 56/279 [00:13<00:51,  4.35it/s]
Iteration:  20%|██        | 57/279 [00:13<00:51,  4.35it/s]
Iteration:  21%|██        | 58/279 [00:13<00:50,  4.35it/s]
Iteration:  21%|██        | 59/279 [00:13<00:50,  4.35it/s]
Iteration:  22%|██▏       | 60/279 [00:13<00:50,  4.35it/s]
Iteration:  22%|██▏       | 61/279 [00:14<00:50,  4.34it/s]
Iteration:  22%|██▏       | 62/279 [00:14<00:49,  4.35it/s]
Iteration:  23%|██▎       | 63/279 [00:14<00:49,  4.35it/s]
Iteration:  23%|██▎       | 64/279 [00:14<00:49,  4.34it/s]
Iteration:  23%|██▎       | 65/279 [00:15<00:49,  4.35it/s]
Iteration:  24%|██▎       | 66/279 [00:15<00:49,  4.34it/s]
Iteration:  24%|██▍       | 67/279 [00:15<00:48,  4.35it/s]
Iteration:  24%|██▍       | 68/279 [00:15<00:48,  4.35it/s]
Iteration:  25%|██▍       | 69/279 [00:15<00:48,  4.34it/s]
Iteration:  25%|██▌       | 70/279 [00:16<00:48,  4.34it/s]
Iteration:  25%|██▌       | 71/279 [00:16<00:47,  4.34it/s]
Iteration:  26%|██▌       | 72/279 [00:16<00:47,  4.34it/s]
Iteration:  26%|██▌       | 73/279 [00:16<00:47,  4.34it/s]
Iteration:  27%|██▋       | 74/279 [00:17<00:47,  4.34it/s]
Iteration:  27%|██▋       | 75/279 [00:17<00:46,  4.34it/s]
Iteration:  27%|██▋       | 76/279 [00:17<00:46,  4.35it/s]
Iteration:  28%|██▊       | 77/279 [00:17<00:46,  4.34it/s]
Iteration:  28%|██▊       | 78/279 [00:18<00:46,  4.34it/s]
Iteration:  28%|██▊       | 79/279 [00:18<00:46,  4.34it/s]
Iteration:  29%|██▊       | 80/279 [00:18<00:45,  4.33it/s]
Iteration:  29%|██▉       | 81/279 [00:18<00:45,  4.34it/s]
Iteration:  29%|██▉       | 82/279 [00:18<00:45,  4.33it/s]
Iteration:  30%|██▉       | 83/279 [00:19<00:45,  4.34it/s]
Iteration:  30%|███       | 84/279 [00:19<00:44,  4.34it/s]
Iteration:  30%|███       | 85/279 [00:19<00:44,  4.33it/s]
Iteration:  31%|███       | 86/279 [00:19<00:44,  4.33it/s]
Iteration:  31%|███       | 87/279 [00:20<00:44,  4.33it/s]
Iteration:  32%|███▏      | 88/279 [00:20<00:44,  4.32it/s]
Iteration:  32%|███▏      | 89/279 [00:20<00:43,  4.32it/s]
Iteration:  32%|███▏      | 90/279 [00:20<00:43,  4.33it/s]
Iteration:  33%|███▎      | 91/279 [00:21<00:43,  4.32it/s]
Iteration:  33%|███▎      | 92/279 [00:21<00:43,  4.33it/s]
Iteration:  33%|███▎      | 93/279 [00:21<00:42,  4.33it/s]
Iteration:  34%|███▎      | 94/279 [00:21<00:42,  4.33it/s]
Iteration:  34%|███▍      | 95/279 [00:21<00:42,  4.34it/s]
Iteration:  34%|███▍      | 96/279 [00:22<00:42,  4.33it/s]
Iteration:  35%|███▍      | 97/279 [00:22<00:41,  4.34it/s]
Iteration:  35%|███▌      | 98/279 [00:22<00:41,  4.33it/s]
Iteration:  35%|███▌      | 99/279 [00:22<00:41,  4.33it/s]
Iteration:  36%|███▌      | 100/279 [00:23<00:41,  4.32it/s]
Iteration:  36%|███▌      | 101/279 [00:23<00:41,  4.32it/s]
Iteration:  37%|███▋      | 102/279 [00:23<00:40,  4.32it/s]
Iteration:  37%|███▋      | 103/279 [00:23<00:40,  4.34it/s]
Iteration:  37%|███▋      | 104/279 [00:24<00:40,  4.33it/s]
Iteration:  38%|███▊      | 105/279 [00:24<00:40,  4.33it/s]
Iteration:  38%|███▊      | 106/279 [00:24<00:39,  4.33it/s]
Iteration:  38%|███▊      | 107/279 [00:24<00:39,  4.32it/s]
Iteration:  39%|███▊      | 108/279 [00:24<00:39,  4.33it/s]
Iteration:  39%|███▉      | 109/279 [00:25<00:39,  4.33it/s]
Iteration:  39%|███▉      | 110/279 [00:25<00:39,  4.33it/s]
Iteration:  40%|███▉      | 111/279 [00:25<00:38,  4.33it/s]
Iteration:  40%|████      | 112/279 [00:25<00:38,  4.33it/s]
Iteration:  41%|████      | 113/279 [00:26<00:38,  4.33it/s]
Iteration:  41%|████      | 114/279 [00:26<00:38,  4.33it/s]
Iteration:  41%|████      | 115/279 [00:26<00:37,  4.32it/s]
Iteration:  42%|████▏     | 116/279 [00:26<00:37,  4.33it/s]
Iteration:  42%|████▏     | 117/279 [00:27<00:37,  4.32it/s]
Iteration:  42%|████▏     | 118/279 [00:27<00:37,  4.32it/s]
Iteration:  43%|████▎     | 119/279 [00:27<00:37,  4.32it/s]
Iteration:  43%|████▎     | 120/279 [00:27<00:36,  4.32it/s]
Iteration:  43%|████▎     | 121/279 [00:28<00:36,  4.32it/s]
Iteration:  44%|████▎     | 122/279 [00:28<00:36,  4.33it/s]
Iteration:  44%|████▍     | 123/279 [00:28<00:36,  4.32it/s]
Iteration:  44%|████▍     | 124/279 [00:28<00:35,  4.33it/s]
Iteration:  45%|████▍     | 125/279 [00:28<00:35,  4.33it/s]
Iteration:  45%|████▌     | 126/279 [00:29<00:35,  4.34it/s]
Iteration:  46%|████▌     | 127/279 [00:29<00:35,  4.34it/s]
Iteration:  46%|████▌     | 128/279 [00:29<00:34,  4.32it/s]
Iteration:  46%|████▌     | 129/279 [00:29<00:34,  4.33it/s]
Iteration:  47%|████▋     | 130/279 [00:30<00:34,  4.32it/s]
Iteration:  47%|████▋     | 131/279 [00:30<00:34,  4.32it/s]
Iteration:  47%|████▋     | 132/279 [00:30<00:33,  4.33it/s]
Iteration:  48%|████▊     | 133/279 [00:30<00:33,  4.33it/s]
Iteration:  48%|████▊     | 134/279 [00:31<00:33,  4.33it/s]
Iteration:  48%|████▊     | 135/279 [00:31<00:33,  4.34it/s]
Iteration:  49%|████▊     | 136/279 [00:31<00:33,  4.33it/s]
Iteration:  49%|████▉     | 137/279 [00:31<00:32,  4.33it/s]
Iteration:  49%|████▉     | 138/279 [00:31<00:32,  4.32it/s]
Iteration:  50%|████▉     | 139/279 [00:32<00:32,  4.32it/s]
Iteration:  50%|█████     | 140/279 [00:32<00:32,  4.33it/s]
Iteration:  51%|█████     | 141/279 [00:32<00:31,  4.33it/s]
Iteration:  51%|█████     | 142/279 [00:32<00:31,  4.32it/s]
Iteration:  51%|█████▏    | 143/279 [00:33<00:31,  4.32it/s]
Iteration:  52%|█████▏    | 144/279 [00:33<00:31,  4.31it/s]
Iteration:  52%|█████▏    | 145/279 [00:33<00:30,  4.33it/s]
Iteration:  52%|█████▏    | 146/279 [00:33<00:30,  4.33it/s]
Iteration:  53%|█████▎    | 147/279 [00:34<00:30,  4.33it/s]
Iteration:  53%|█████▎    | 148/279 [00:34<00:30,  4.33it/s]
Iteration:  53%|█████▎    | 149/279 [00:34<00:30,  4.32it/s]
Iteration:  54%|█████▍    | 150/279 [00:34<00:29,  4.31it/s]
Iteration:  54%|█████▍    | 151/279 [00:34<00:29,  4.32it/s]
Iteration:  54%|█████▍    | 152/279 [00:35<00:29,  4.31it/s]
Iteration:  55%|█████▍    | 153/279 [00:35<00:29,  4.32it/s]
Iteration:  55%|█████▌    | 154/279 [00:35<00:28,  4.32it/s]
Iteration:  56%|█████▌    | 155/279 [00:35<00:28,  4.31it/s]
Iteration:  56%|█████▌    | 156/279 [00:36<00:28,  4.32it/s]
Iteration:  56%|█████▋    | 157/279 [00:36<00:28,  4.32it/s]
Iteration:  57%|█████▋    | 158/279 [00:36<00:27,  4.32it/s]
Iteration:  57%|█████▋    | 159/279 [00:36<00:27,  4.33it/s]
Iteration:  57%|█████▋    | 160/279 [00:37<00:27,  4.33it/s]
Iteration:  58%|█████▊    | 161/279 [00:37<00:27,  4.33it/s]
Iteration:  58%|█████▊    | 162/279 [00:37<00:27,  4.33it/s]
Iteration:  58%|█████▊    | 163/279 [00:37<00:26,  4.32it/s]
Iteration:  59%|█████▉    | 164/279 [00:37<00:26,  4.33it/s]
Iteration:  59%|█████▉    | 165/279 [00:38<00:26,  4.33it/s]
Iteration:  59%|█████▉    | 166/279 [00:38<00:26,  4.32it/s]
Iteration:  60%|█████▉    | 167/279 [00:38<00:25,  4.32it/s]
Iteration:  60%|██████    | 168/279 [00:38<00:25,  4.32it/s]
Iteration:  61%|██████    | 169/279 [00:39<00:25,  4.32it/s]
Iteration:  61%|██████    | 170/279 [00:39<00:25,  4.32it/s]
Iteration:  61%|██████▏   | 171/279 [00:39<00:25,  4.32it/s]
Iteration:  62%|██████▏   | 172/279 [00:39<00:24,  4.33it/s]
Iteration:  62%|██████▏   | 173/279 [00:40<00:24,  4.31it/s]
Iteration:  62%|██████▏   | 174/279 [00:40<00:24,  4.32it/s]
Iteration:  63%|██████▎   | 175/279 [00:40<00:24,  4.31it/s]
Iteration:  63%|██████▎   | 176/279 [00:40<00:23,  4.31it/s]
Iteration:  63%|██████▎   | 177/279 [00:40<00:23,  4.33it/s]
Iteration:  64%|██████▍   | 178/279 [00:41<00:23,  4.33it/s]
Iteration:  64%|██████▍   | 179/279 [00:41<00:23,  4.33it/s]
Iteration:  65%|██████▍   | 180/279 [00:41<00:22,  4.33it/s]
Iteration:  65%|██████▍   | 181/279 [00:41<00:22,  4.34it/s]
Iteration:  65%|██████▌   | 182/279 [00:42<00:22,  4.33it/s]
Iteration:  66%|██████▌   | 183/279 [00:42<00:22,  4.34it/s]
Iteration:  66%|██████▌   | 184/279 [00:42<00:21,  4.33it/s]
Iteration:  66%|██████▋   | 185/279 [00:42<00:21,  4.34it/s]
Iteration:  67%|██████▋   | 186/279 [00:43<00:21,  4.33it/s]
Iteration:  67%|██████▋   | 187/279 [00:43<00:21,  4.32it/s]
Iteration:  67%|██████▋   | 188/279 [00:43<00:21,  4.33it/s]
Iteration:  68%|██████▊   | 189/279 [00:43<00:20,  4.33it/s]
Iteration:  68%|██████▊   | 190/279 [00:43<00:20,  4.33it/s]
Iteration:  68%|██████▊   | 191/279 [00:44<00:20,  4.34it/s]
Iteration:  69%|██████▉   | 192/279 [00:44<00:20,  4.33it/s]
Iteration:  69%|██████▉   | 193/279 [00:44<00:19,  4.34it/s]
Iteration:  70%|██████▉   | 194/279 [00:44<00:19,  4.34it/s]
Iteration:  70%|██████▉   | 195/279 [00:45<00:19,  4.33it/s]
Iteration:  70%|███████   | 196/279 [00:45<00:19,  4.34it/s]
Iteration:  71%|███████   | 197/279 [00:45<00:18,  4.33it/s]
Iteration:  71%|███████   | 198/279 [00:45<00:18,  4.34it/s]
Iteration:  71%|███████▏  | 199/279 [00:46<00:18,  4.35it/s]
Iteration:  72%|███████▏  | 200/279 [00:46<00:18,  4.34it/s]
Iteration:  72%|███████▏  | 201/279 [00:46<00:17,  4.34it/s]
Iteration:  72%|███████▏  | 202/279 [00:46<00:17,  4.33it/s]
Iteration:  73%|███████▎  | 203/279 [00:46<00:17,  4.33it/s]
Iteration:  73%|███████▎  | 204/279 [00:47<00:17,  4.33it/s]
Iteration:  73%|███████▎  | 205/279 [00:47<00:17,  4.34it/s]
Iteration:  74%|███████▍  | 206/279 [00:47<00:16,  4.33it/s]
Iteration:  74%|███████▍  | 207/279 [00:47<00:16,  4.33it/s]
Iteration:  75%|███████▍  | 208/279 [00:48<00:16,  4.34it/s]
Iteration:  75%|███████▍  | 209/279 [00:48<00:16,  4.34it/s]
Iteration:  75%|███████▌  | 210/279 [00:48<00:15,  4.35it/s]
Iteration:  76%|███████▌  | 211/279 [00:48<00:15,  4.34it/s]
Iteration:  76%|███████▌  | 212/279 [00:49<00:15,  4.34it/s]
Iteration:  76%|███████▋  | 213/279 [00:49<00:15,  4.34it/s]
Iteration:  77%|███████▋  | 214/279 [00:49<00:14,  4.33it/s]
Iteration:  77%|███████▋  | 215/279 [00:49<00:14,  4.33it/s]
Iteration:  77%|███████▋  | 216/279 [00:49<00:14,  4.32it/s]
Iteration:  78%|███████▊  | 217/279 [00:50<00:14,  4.32it/s]
Iteration:  78%|███████▊  | 218/279 [00:50<00:14,  4.32it/s]
Iteration:  78%|███████▊  | 219/279 [00:50<00:13,  4.31it/s]
Iteration:  79%|███████▉  | 220/279 [00:50<00:13,  4.33it/s]
Iteration:  79%|███████▉  | 221/279 [00:51<00:13,  4.33it/s]
Iteration:  80%|███████▉  | 222/279 [00:51<00:13,  4.32it/s]
Iteration:  80%|███████▉  | 223/279 [00:51<00:12,  4.33it/s]
Iteration:  80%|████████  | 224/279 [00:51<00:12,  4.33it/s]
Iteration:  81%|████████  | 225/279 [00:52<00:12,  4.33it/s]
Iteration:  81%|████████  | 226/279 [00:52<00:12,  4.33it/s]
Iteration:  81%|████████▏ | 227/279 [00:52<00:11,  4.34it/s]
Iteration:  82%|████████▏ | 228/279 [00:52<00:11,  4.35it/s]
Iteration:  82%|████████▏ | 229/279 [00:52<00:11,  4.35it/s]
Iteration:  82%|████████▏ | 230/279 [00:53<00:11,  4.34it/s]
Iteration:  83%|████████▎ | 231/279 [00:53<00:11,  4.34it/s]
Iteration:  83%|████████▎ | 232/279 [00:53<00:10,  4.32it/s]
Iteration:  84%|████████▎ | 233/279 [00:53<00:10,  4.33it/s]
Iteration:  84%|████████▍ | 234/279 [00:54<00:10,  4.34it/s]
Iteration:  84%|████████▍ | 235/279 [00:54<00:10,  4.33it/s]
Iteration:  85%|████████▍ | 236/279 [00:54<00:09,  4.35it/s]
Iteration:  85%|████████▍ | 237/279 [00:54<00:09,  4.36it/s]
Iteration:  85%|████████▌ | 238/279 [00:55<00:09,  4.35it/s]
Iteration:  86%|████████▌ | 239/279 [00:55<00:09,  4.36it/s]
Iteration:  86%|████████▌ | 240/279 [00:55<00:08,  4.35it/s]
Iteration:  86%|████████▋ | 241/279 [00:55<00:08,  4.34it/s]
Iteration:  87%|████████▋ | 242/279 [00:55<00:08,  4.35it/s]
Iteration:  87%|████████▋ | 243/279 [00:56<00:08,  4.35it/s]
Iteration:  87%|████████▋ | 244/279 [00:56<00:08,  4.35it/s]
Iteration:  88%|████████▊ | 245/279 [00:56<00:07,  4.36it/s]
Iteration:  88%|████████▊ | 246/279 [00:56<00:07,  4.34it/s]
Iteration:  89%|████████▊ | 247/279 [00:57<00:07,  4.34it/s]
Iteration:  89%|████████▉ | 248/279 [00:57<00:07,  4.35it/s]
Iteration:  89%|████████▉ | 249/279 [00:57<00:06,  4.30it/s]
Iteration:  90%|████████▉ | 250/279 [00:57<00:06,  4.31it/s]
Iteration:  90%|████████▉ | 251/279 [00:58<00:06,  4.32it/s]
Iteration:  90%|█████████ | 252/279 [00:58<00:06,  4.32it/s]
Iteration:  91%|█████████ | 253/279 [00:58<00:06,  4.33it/s]
Iteration:  91%|█████████ | 254/279 [00:58<00:05,  4.33it/s]
Iteration:  91%|█████████▏| 255/279 [00:58<00:05,  4.33it/s]
Iteration:  92%|█████████▏| 256/279 [00:59<00:05,  4.33it/s]
Iteration:  92%|█████████▏| 257/279 [00:59<00:05,  4.33it/s]
Iteration:  92%|█████████▏| 258/279 [00:59<00:04,  4.33it/s]
Iteration:  93%|█████████▎| 259/279 [00:59<00:04,  4.33it/s]
Iteration:  93%|█████████▎| 260/279 [01:00<00:04,  4.34it/s]
Iteration:  94%|█████████▎| 261/279 [01:00<00:04,  4.34it/s]
Iteration:  94%|█████████▍| 262/279 [01:00<00:03,  4.34it/s]
Iteration:  94%|█████████▍| 263/279 [01:00<00:03,  4.34it/s]
Iteration:  95%|█████████▍| 264/279 [01:01<00:03,  4.34it/s]
Iteration:  95%|█████████▍| 265/279 [01:01<00:03,  4.34it/s]
Iteration:  95%|█████████▌| 266/279 [01:01<00:02,  4.33it/s]
Iteration:  96%|█████████▌| 267/279 [01:01<00:02,  4.34it/s]
Iteration:  96%|█████████▌| 268/279 [01:01<00:02,  4.33it/s]
Iteration:  96%|█████████▋| 269/279 [01:02<00:02,  4.34it/s]
Iteration:  97%|█████████▋| 270/279 [01:02<00:02,  4.34it/s]
Iteration:  97%|█████████▋| 271/279 [01:02<00:01,  4.33it/s]
Iteration:  97%|█████████▋| 272/279 [01:02<00:01,  4.33it/s]
Iteration:  98%|█████████▊| 273/279 [01:03<00:01,  4.33it/s]
Iteration:  98%|█████████▊| 274/279 [01:03<00:01,  4.33it/s]
Iteration:  99%|█████████▊| 275/279 [01:03<00:00,  4.35it/s]
Iteration:  99%|█████████▉| 276/279 [01:03<00:00,  4.33it/s]
Iteration:  99%|█████████▉| 277/279 [01:04<00:00,  4.33it/s]
Iteration: 100%|█████████▉| 278/279 [01:04<00:00,  4.34it/s]
Iteration: 100%|██████████| 279/279 [01:04<00:00,  4.33it/s]
Epoch: 100%|██████████| 1/1 [01:04<00:00, 64.48s/it]
08/31/2020 15:46:49 - INFO - transformers.trainer -   

Training completed. Do not forget to share your model on huggingface.co/models =)


08/31/2020 15:46:49 - INFO - transformers.trainer -   Saving model checkpoint to ./models
08/31/2020 15:46:49 - INFO - transformers.configuration_utils -   Configuration saved in ./models/config.json
08/31/2020 15:46:50 - INFO - transformers.modeling_utils -   Model weights saved in ./models/pytorch_model.bin

3. Language model evaluation

To run evaluation on the model with your eval dataset, all you need to call is the built-in finetuner.evaluate(), since you've already loaded in your eval dataset during training.

finetuner.evaluate()

And now you have your very own pre-trained language model that's been fine-tuned on your personal domain data!

Since we've just fine-tuned a causal language model, we can actually load this straight into an EasyTextGenerator class object and play around with our language model to evaluate it qualitatively with our own "eyes".

All we have to do is pass in the directory that we've output our trained language model, in this case it's located in "./models"

from adaptnlp import EasyTextGenerator

text = "China and the U.S. will begin to"

generator = EasyTextGenerator()
generated_text = generator.generate(
    text, 
    model_name_or_path="./models", 
    num_tokens_to_produce=50
)

print(generated_text)
Generating: 100%|██████████| 1/1 [00:00<00:00,  2.26it/s]
['China and the U.S. will begin to develop their own nuclear weapons in the coming years.\n\nThe U.S. has been developing a range of nuclear weapons since the 1950s, but the U.S. has never used them in combat. The U.S. has been']

You can compare this with the original pre-trained gpt2 model as well.

generated_text = generator.generate(
    text, 
    model_name_or_path="gpt2", 
    num_tokens_to_produce=50
)

print(generated_text)
Special tokens have been added in the vocabulary, make sure the associated word emebedding are fine-tuned or trained.
Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at gpt2 and are newly initialized: ['h.0.attn.masked_bias', 'h.1.attn.masked_bias', 'h.2.attn.masked_bias', 'h.3.attn.masked_bias', 'h.4.attn.masked_bias', 'h.5.attn.masked_bias', 'h.6.attn.masked_bias', 'h.7.attn.masked_bias', 'h.8.attn.masked_bias', 'h.9.attn.masked_bias', 'h.10.attn.masked_bias', 'h.11.attn.masked_bias', 'lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Generating: 100%|██████████| 1/1 [00:00<00:00,  2.33it/s]
['China and the U.S. will begin to see the effects of the new sanctions on the Russian economy.\n\n"The U.S. is going to be the first to see the effects of the new sanctions," said Michael O\'Hanlon, a senior fellow at the Center for Strategic']