Fine-tuning transformers models


tqdm(*args, **kwargs)

class TextDataset[source]

TextDataset(*args, **kwds) :: Dataset

An abstract class representing a :class:Dataset.

All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite :meth:__getitem__, supporting fetching a data sample for a given key. Subclasses could also optionally overwrite :meth:__len__, which is expected to return the size of the dataset by many implementations and the default options of

.. note:: by default constructs a index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.

class LMFineTunerManual[source]

LMFineTunerManual(train_data_file:str, eval_data_file:str=None, model_type:str='bert', model_name_or_path:str=None, mlm:bool=True, mlm_probability:float=0.15, config_name:str=None, tokenizer_name:str=None, cache_dir:str=None, block_size:int=-1, no_cuda:bool=False, overwrite_cache:bool=False, seed:int=42, fp16:bool=False, fp16_opt_level:str='O1', local_rank:int=-1)

A Language Model Fine Tuner object you can set language model configurations and then train and evaluate


>>> finetuner = adaptnlp.LMFineTuner()
>>> finetuner.train()


  • train_data_file - The input training data file (a text file).
  • eval_data_file - An optional input evaluation data file to evaluate the perplexity on (a text file).
  • model_type - The model architecture to be trained or fine-tuned.
  • model_name_or_path - The model checkpoint for weights initialization. Leave None if you want to train a model from scratch.
  • mlm - Train with masked-language modeling loss instead of language modeling.
  • mlm_probability - Ratio of tokens to mask for masked language modeling loss
  • config_name - Optional Transformers pretrained config name or path if not the same as model_name_or_path. If both are None, initialize a new config.
  • tokenizer_name - Optional Transformers pretrained tokenizer name or path if not the same as model_name_or_path. If both are None, initialize a new tokenizer.
  • cache_dir - Optional directory to store the pre-trained models downloaded from s3 (If None, will go to default dir)
  • block_size - Optional input sequence length after tokenization.
                  The training dataset will be truncated in block of this size for training."
                  `-1` will default to the model max input length for single sentence inputs (take into account special tokens).
  • no_cuda - Avoid using CUDA when available
  • overwrite_cache - Overwrite the cached training and evaluation sets
  • seed - random seed for initialization
  • fp16 - Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit
  • fp16_opt_level - For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3'].
  • local_rank - For distributed training: local_rank