textformer.datasets¶
Because we need data, right? Datasets are composed of classes and methods that allow preparing data for further transformers.
A datasets package for all common textformer modules.
-
class
textformer.datasets.
GenerativeDataset
(file_path, field, **kwargs)¶ Bases:
torchtext.data.Dataset
A GenerativeDataset class is in charge of loading raw texts and creating Language Modelling datasets, used for text generation tasks.
-
__init__
(self, file_path, field, **kwargs)¶ Creates a GenerativeDataset, used for text generation.
- Parameters
file_path (str) – Path to the file that will be loaded.
field (torchtext.data.Field) – Datatype instructions for tensor convertion.
-
_load_data
(self, file_path, fields)¶ Loads a text file and creates a list of torchtext Example classes.
- Parameters
file_path (str) – Path to the file that will be loaded.
fields (list) – List of tuples holding datatype instructions for tensor convertion.
- Returns
The loaded and pre-processed text within a list of Example classes.
-
-
class
textformer.datasets.
TranslationDataset
(file_path, extensions, fields, **kwargs)¶ Bases:
torchtext.data.Dataset
A TranslationDataset class is in charge of loading (source, target) texts and creating Machine Translation datasets, used for translating tasks.
-
__init__
(self, file_path, extensions, fields, **kwargs)¶ Creates a TranslationDataset, used for text translation.
- Parameters
file_path (str) – Path to the file that will be loaded.
extensions (tuple) – Extensions to the path for each language.
fields (tuple) – Tuple of datatype instructions for tensor convertion.
-
_load_data
(self, source_path, target_path, fields)¶ Loads text files and creates a list of torchtext Example classes.
- Parameters
source_path (str) – Path to the source file that will be loaded.
target_path (str) – Path to the target file that will be loaded.
fields (tuple) – Tuple of datatype instructions for tensor convertion.
- Returns
The loaded and pre-processed source and target within a list of Example classes.
-
classmethod
splits
(cls, file_path, extensions, fields, path=None, train='train', validation='val', test='test', **kwargs)¶ Creates TranslationDataset objects, used for text translation.
- Parameters
file_path (str) – Path to the file that will be loaded.
extensions (tuple) – Extensions to the path for each language.
fields (tuple) – Tuple of datatype instructions for tensor convertion.
train (str) – Prefix for the training data.
validation (str) – Prefix for the validation data.
test (str) – Prefix for the test data.
-