Data and Tuning API for Sequence Classification Tasks

Datasets

class SequenceClassificationDatasets[source]

SequenceClassificationDatasets(train_dset:Dataset, valid_dset:Dataset, tokenizer_name:str, tokenize:bool, tokenize_kwargs:dict, auto_kwargs:dict, remove_columns:list, categorize:MultiCategorize'>]) :: TaskDatasets

A set of datasets designed for sequence classification

Parameters:

  • train_dset : <class 'datasets.arrow_dataset.Dataset'>

    A training dataset

  • valid_dset : <class 'datasets.arrow_dataset.Dataset'>

    A validation dataset

  • tokenizer_name : <class 'str'>

    The name of a tokenizer

  • tokenize : <class 'bool'>

    Whether to tokenize immediatly

  • tokenize_kwargs : <class 'dict'>

    kwargs for the tokenize function

  • auto_kwargs : <class 'dict'>

    AutoTokenizer.from_pretrained kwargs

  • remove_columns : <class 'list'>

    The columns to remove when tokenizing

  • categorize : [<class 'adaptnlp.training.core.Categorize'>, <class 'adaptnlp.training.core.MultiCategorize'>]

    A Categorize instance

SequenceClassificationDatasets.from_dfs[source]

SequenceClassificationDatasets.from_dfs(train_df:DataFrame, text_col:str, label_col:str, tokenizer_name:str, tokenize:bool=True, is_multicategory:bool=False, label_delim=' ', valid_df=None, split_func=None, split_pct=0.2, tokenize_kwargs:dict={}, auto_kwargs:dict={})

Builds SequenceClassificationDatasets from a DataFrame or set of DataFrames

Parameters:

  • train_df : <class 'pandas.core.frame.DataFrame'>

    A training dataframe

  • text_col : <class 'str'>

    The name of the text column

  • label_col : <class 'str'>

    The name of the label column

  • tokenizer_name : <class 'str'>

    The name of the tokenizer

  • tokenize : <class 'bool'>, optional

    Whether to tokenize immediatly

  • is_multicategory : <class 'bool'>, optional

    Whether each item has a single label or multiple labels

  • label_delim : <class 'str'>, optional

    If `is_multicategory`, how to separate the labels

  • valid_df : <class 'NoneType'>, optional

    An optional validation dataframe

  • split_func : <class 'NoneType'>, optional

    Optionally a splitting function similar to RandomSplitter

  • split_pct : <class 'float'>, optional

    What % to split the train_df

  • tokenize_kwargs : <class 'dict'>, optional

    kwargs for the tokenize function

  • auto_kwargs : <class 'dict'>, optional

    kwargs for the AutoTokenizer.from_pretrained constructor

SequenceClassificationDatasets.from_csvs[source]

SequenceClassificationDatasets.from_csvs(train_csv:Path, text_col:str, label_col:str, tokenizer_name:str, tokenize:bool=True, is_multicategory:bool=False, label_delim=' ', valid_csv:Path=None, split_func=None, split_pct=0.2, tokenize_kwargs:dict={}, auto_kwargs:dict={}, **kwargs)

Builds SequenceClassificationDatasets from a single csv or set of csvs. A convience constructor for from_dfs

Parameters:

  • train_csv : <class 'pathlib.Path'>

    A training csv file

  • text_col : <class 'str'>

    The name of the text column

  • label_col : <class 'str'>

    The name of the label column

  • tokenizer_name : <class 'str'>

    The name of the tokenizer

  • tokenize : <class 'bool'>, optional

    Whether to tokenize immediatly

  • is_multicategory : <class 'bool'>, optional

    Whether each item has a single label or multiple labels

  • label_delim : <class 'str'>, optional

    If `is_multicategory`, how to separate the labels

  • valid_csv : <class 'pathlib.Path'>, optional

    An optional validation csv

  • split_func : <class 'NoneType'>, optional

    Optionally a splitting function similar to RandomSplitter

  • split_pct : <class 'float'>, optional

    What % to split the train_df

  • tokenize_kwargs : <class 'dict'>, optional

    kwargs for the tokenize function

  • auto_kwargs : <class 'dict'>, optional

    kwargs for the AutoTokenizer.from_pretrained constructor

  • kwargs : <class 'inspect._empty'>

SequenceClassificationDatasets.from_folders[source]

SequenceClassificationDatasets.from_folders(train_path:Path, get_label:callable, tokenizer_name:str, tokenize:bool=True, is_multicategory:bool=False, label_delim='_', valid_path:Path=None, split_func=None, split_pct=0.2, tokenize_kwargs:dict={}, auto_kwargs:dict={})

Builds SequenceClassificationDatasets from a folder or groups of folders

Parameters:

  • train_path : <class 'pathlib.Path'>

    The path to the training data

  • get_label : <built-in function callable>

    A function which grabs the label(s) given a text files `Path`

  • tokenizer_name : <class 'str'>

    The name of the tokenizer

  • tokenize : <class 'bool'>, optional

    Whether to tokenize immediatly

  • is_multicategory : <class 'bool'>, optional

    Whether each item has a single label or multiple labels

  • label_delim : <class 'str'>, optional

    if `is_multicategory`, how to separate the labels

  • valid_path : <class 'pathlib.Path'>, optional

    The path to the validation data

  • split_func : <class 'NoneType'>, optional

    Optionally a splitting function similar to RandomSplitter

  • split_pct : <class 'float'>, optional

    What % to split the items in the `train_path`

  • tokenize_kwargs : <class 'dict'>, optional

    kwargs for the tokenize function

  • auto_kwargs : <class 'dict'>, optional

    kwargs for the AutoTokenizer.from_pretrained constructor

When passing in kwargs if anything should go to the tokenize function they should go to tokenize_kwargs, and if it should go to the Auto class constructor, they should go to auto_kwargs

Sequence Classification Tuner

class SequenceClassificationTuner[source]

SequenceClassificationTuner(dls:DataLoaders, model_name:str, tokenizer=None, loss_func=CrossEntropyLoss(), metrics=[<function accuracy at 0x7f4dac07a9d0>, <fastai.metrics.AccumMetric object at 0x7f4da7f6e070>], opt_func=Adam, additional_cbs=None, expose_fastai_api=False, num_classes:int=None, **kwargs) :: AdaptiveTuner

An AdaptiveTuner with good defaults for Sequence Classification tasks

Valid kwargs and defaults:

  • lr:float = 0.001
  • splitter:function = trainable_params
  • cbs:list = None
  • path:Path = None
  • model_dir:Path = 'models'
  • wd:float = None
  • wd_bn_bias:bool = False
  • train_bn:bool = True
  • moms: tuple(float) = (0.95, 0.85, 0.95)

Parameters:

  • dls : <class 'fastai.data.core.DataLoaders'>

    A set of DataLoaders

  • model_name : <class 'str'>

    A HuggingFace model

  • tokenizer : <class 'NoneType'>, optional

    A HuggingFace tokenizer

  • loss_func : <class 'fastai.losses.CrossEntropyLossFlat'>, optional

    A loss function

  • metrics : <class 'list'>, optional

    Metrics to monitor the training with

  • opt_func : <class 'function'>, optional

    A fastai or torch Optimizer

  • additional_cbs : <class 'NoneType'>, optional

    Additional Callbacks to have always tied to the Tuner,

  • expose_fastai_api : <class 'bool'>, optional

    Whether to expose the fastai API

  • num_classes : <class 'int'>, optional

    The number of classes

  • kwargs : <class 'inspect._empty'>

SequenceClassificationTuner.predict[source]

SequenceClassificationTuner.predict(text:Union[List[str], str], bs:int=64, detail_level:DetailLevel='low', class_names:list=None)

Predict some text for sequence classification with the currently loaded model

Parameters:

  • text : typing.Union[typing.List[str], str]

    Some text or list of texts to do inference with

  • bs : <class 'int'>, optional

    A batch size to use for multiple texts

  • detail_level : <class 'fastcore.basics.DetailLevel'>, optional

    A detail level to return on the predictions

  • class_names : <class 'list'>, optional

    A list of labels

Returns:

  • <class 'dict'>

    A dictionary of filtered predictions