textformer.datasets

Because we need data, right? Datasets are composed of classes and methods that allow preparing data for further transformers.

A datasets package for all common textformer modules.

class textformer.datasets.GenerativeDataset(file_path, field, **kwargs)

Bases: torchtext.data.Dataset

A GenerativeDataset class is in charge of loading raw texts and creating Language Modelling datasets, used for text generation tasks.

__init__(self, file_path, field, **kwargs)

Creates a GenerativeDataset, used for text generation.

Parameters
  • file_path (str) – Path to the file that will be loaded.

  • field (torchtext.data.Field) – Datatype instructions for tensor convertion.

_load_data(self, file_path, fields)

Loads a text file and creates a list of torchtext Example classes.

Parameters
  • file_path (str) – Path to the file that will be loaded.

  • fields (list) – List of tuples holding datatype instructions for tensor convertion.

Returns

The loaded and pre-processed text within a list of Example classes.

class textformer.datasets.TranslationDataset(file_path, extensions, fields, **kwargs)

Bases: torchtext.data.Dataset

A TranslationDataset class is in charge of loading (source, target) texts and creating Machine Translation datasets, used for translating tasks.

__init__(self, file_path, extensions, fields, **kwargs)

Creates a TranslationDataset, used for text translation.

Parameters
  • file_path (str) – Path to the file that will be loaded.

  • extensions (tuple) – Extensions to the path for each language.

  • fields (tuple) – Tuple of datatype instructions for tensor convertion.

_load_data(self, source_path, target_path, fields)

Loads text files and creates a list of torchtext Example classes.

Parameters
  • source_path (str) – Path to the source file that will be loaded.

  • target_path (str) – Path to the target file that will be loaded.

  • fields (tuple) – Tuple of datatype instructions for tensor convertion.

Returns

The loaded and pre-processed source and target within a list of Example classes.

classmethod splits(cls, file_path, extensions, fields, path=None, train='train', validation='val', test='test', **kwargs)

Creates TranslationDataset objects, used for text translation.

Parameters
  • file_path (str) – Path to the file that will be loaded.

  • extensions (tuple) – Extensions to the path for each language.

  • fields (tuple) – Tuple of datatype instructions for tensor convertion.

  • train (str) – Prefix for the training data.

  • validation (str) – Prefix for the validation data.

  • test (str) – Prefix for the test data.