class
LanguageModelDatasets
[source]
LanguageModelDatasets
(train_dset
:Dataset
,valid_dset
:Dataset
,tokenizer_name
:str
,tokenize
:bool
,tokenize_kwargs
:dict
,auto_kwargs
:dict
,remove_columns
:list
,block_size
:int
,masked_lm
:bool
) ::TaskDatasets
A set of datasets designed for language model fine-tuning
Parameters:
train_dset
:<class 'datasets.arrow_dataset.Dataset'>
A training dataset
valid_dset
:<class 'datasets.arrow_dataset.Dataset'>
A validation dataset
tokenizer_name
:<class 'str'>
The name of a tokenizer
tokenize
:<class 'bool'>
Whether to tokenize immediatly
tokenize_kwargs
:<class 'dict'>
kwargs for the tokenize function
auto_kwargs
:<class 'dict'>
AutoTokenizer.from_pretrained kwargs
remove_columns
:<class 'list'>
The columns to remove when tokenizing
block_size
:<class 'int'>
The size of each block
masked_lm
:<class 'bool'>
Whether the language model is a MLM
LanguageModelDatasets.from_dfs
[source]
LanguageModelDatasets.from_dfs
(train_df
:DataFrame
,text_col
:str
,tokenizer_name
:str
,block_size
:int
=128
,masked_lm
:bool
=False
,valid_df
:DataFrame
=None
,split_func
:callable
=None
,split_pct
:float
=0.1
,tokenize_kwargs
:dict
={}
,auto_kwargs
:dict
={}
)
Builds LanguageModelDatasets
from a DataFrame
or file path
Parameters:
train_df
:<class 'pandas.core.frame.DataFrame'>
A Pandas Dataframe
text_col
:<class 'str'>
The name of the text column
tokenizer_name
:<class 'str'>
The name of the tokenizer
block_size
:<class 'int'>
, optionalThe size of each block
masked_lm
:<class 'bool'>
, optionalWhether the language model is a MLM
valid_df
:<class 'pandas.core.frame.DataFrame'>
, optionalAn optional validation DataFrame
split_func
:<built-in function callable>
, optionalOptionally a splitting function similar to RandomSplitter
split_pct
:<class 'float'>
, optionalWhat % to split the df between training and validation
tokenize_kwargs
:<class 'dict'>
, optionalkwargs for the tokenize function
auto_kwargs
:<class 'dict'>
, optionalkwargs for the AutoTokenizer.from_pretrained constructor
LanguageModelDatasets.from_csvs
[source]
LanguageModelDatasets.from_csvs
(train_csv
:Path
,text_col
:str
,tokenizer_name
:str
,block_size
:int
=128
,masked_lm
:bool
=False
,valid_csv
:Path
=None
,split_func
:callable
=None
,split_pct
:float
=0.1
,tokenize_kwargs
:dict
={}
,auto_kwargs
:dict
={}
, **kwargs
)
Builds LanguageModelDatasets
from a single csv or set of csvs. A convience constructor for from_dfs
Parameters:
train_csv
:<class 'pathlib.Path'>
A training csv file
text_col
:<class 'str'>
The name of the text column
tokenizer_name
:<class 'str'>
The name of the tokenizer
block_size
:<class 'int'>
, optionalThe size of each block
masked_lm
:<class 'bool'>
, optionalWhether the language model is a MLM
valid_csv
:<class 'pathlib.Path'>
, optionalAn optional validation csv
split_func
:<built-in function callable>
, optionalOptionally a splitting function similar to RandomSplitter
split_pct
:<class 'float'>
, optionalWhat % to split the df between training and validation
tokenize_kwargs
:<class 'dict'>
, optionalkwargs for the tokenize function
auto_kwargs
:<class 'dict'>
, optionalkwargs for the AutoTokenizer.from_pretrained constructor
kwargs
:<class 'inspect._empty'>
LanguageModelDatasets.from_folders
[source]
LanguageModelDatasets.from_folders
(train_path
:Path
,tokenizer_name
:str
,block_size
:int
=128
,masked_lm
:bool
=False
,valid_path
:Path
=None
,split_func
:callable
=None
,split_pct
:float
=0.1
,tokenize_kwargs
:dict
={}
,auto_kwargs
:dict
={}
)
Builds LanguageModelDatasets
from a folder or group of folders
Parameters:
train_path
:<class 'pathlib.Path'>
The path to the training data
tokenizer_name
:<class 'str'>
The name of the tokenizer
block_size
:<class 'int'>
, optionalThe size of each block
masked_lm
:<class 'bool'>
, optionalWhether the language model is a MLM
valid_path
:<class 'pathlib.Path'>
, optionalAn optional validation path
split_func
:<built-in function callable>
, optionalOptionally a splitting function similar to RandomSplitter
split_pct
:<class 'float'>
, optionalWhat % to split the df between training and validation
tokenize_kwargs
:<class 'dict'>
, optionalkwargs for the tokenize function
auto_kwargs
:<class 'dict'>
, optionalkwargs for the AutoTokenizer.from_pretrained constructor
LanguageModelDatasets.dataloaders
[source]
LanguageModelDatasets.dataloaders
(batch_size
=8
,shuffle_train
=True
,collate_fn
=default_data_collator
,mlm_probability
:float
=0.15
,path
='.'
,device
=None
)
Build DataLoaders from self
Parameters:
batch_size
:<class 'int'>
, optionalA batch size
shuffle_train
:<class 'bool'>
, optionalWhether to shuffle the training dataset
collate_fn
:<class 'function'>
, optionalA custom collation function
mlm_probability
:<class 'float'>
, optionalToken masking probablity for Masked Language Models
path
:<class 'str'>
, optionaldevice
:<class 'NoneType'>
, optional
class
LanguageModelTuner
[source]
LanguageModelTuner
(dls
:DataLoaders
,model_name
,tokenizer
=None
,language_model_type
:LMType
='causal'
,loss_func
=CrossEntropyLoss()
,metrics
=[<fastai.metrics.Perplexity object at 0x7f15fce2e190>]
,opt_func
=Adam
,additional_cbs
=None
,expose_fastai_api
=False
, **kwargs
) ::AdaptiveTuner
An AdaptiveTuner
with good defaults for Language Model fine-tuning
Valid kwargs and defaults:
lr
:float = 0.001splitter
:function =trainable_params
cbs
:list = Nonepath
:Path = Nonemodel_dir
:Path = 'models'wd
:float = Nonewd_bn_bias
:bool = Falsetrain_bn
:bool = Truemoms
: tuple(float) = (0.95, 0.85, 0.95)
Parameters:
dls
:<class 'fastai.data.core.DataLoaders'>
A set of DataLoaders or AdaptiveDataLoaders
model_name
:<class 'inspect._empty'>
A HuggingFace model
tokenizer
:<class 'NoneType'>
, optionalA HuggingFace tokenizer
language_model_type
:<class 'fastcore.basics.LMType'>
, optionalThe type of language model to use
loss_func
:<class 'fastai.losses.CrossEntropyLossFlat'>
, optionalA loss function
metrics
:<class 'list'>
, optionalMetrics to monitor the training with
opt_func
:<class 'function'>
, optionalA fastai or torch Optimizer
additional_cbs
:<class 'NoneType'>
, optionalAdditional Callbacks to have always tied to the Tuner,
expose_fastai_api
:<class 'bool'>
, optionalWhether to expose the fastai API
kwargs
:<class 'inspect._empty'>
LanguageModelTuner.predict
[source]
LanguageModelTuner.predict
(text
:Union
[List
[str
],str
],bs
:int
=64
,num_tokens_to_produce
:int
=50
, **kwargs
)
Predict some text
for sequence classification with the currently loaded model
Parameters:
text
:typing.Union[typing.List[str], str]
Some text or list of texts to do inference with
bs
:<class 'int'>
, optionalA batch size to use for multiple texts
num_tokens_to_produce
:<class 'int'>
, optionalNumber of tokens to generate
kwargs
:<class 'inspect._empty'>