Data module

class Data.Template

Bases: object

Get the messages, and call this function to load them in the format required by the model.

Returns: tokenized prompt or label

Return type: dict

Parameters:
  • messages (dict) – the conversations between question and answer

  • tokenizer (transformers.PreTrainedTokenizer) – tokenize prompt

  • mode (str) – train or eval

class Data.TextPreprocess

Bases: object

Call Template to load and process text data.

init(tokenizer, version)

Initialize tokenizer and template.

Parameters:
  • tokenizer (transformers.PreTrainedTokenizer) – Tokenize prompt

  • version – The version of the template

call(messages, mode='train')

Call template.encode to process text data.

Returns : tokenized prompt or label

Return type : dict

Parameters:
  • messages (dict) – the conversations between human and gpt

  • mode (str) – train or eval

class Data.ImagePreprocess

Bases: object

To preprocess images and adjust them to different aspect ratios and resolutions.

init(image_processor, data_args)

Initialize image processor, image aspect ratios and image resolutions.

Parameters:
  • image_processor – the processor to process image

  • data_args (dcit) – data arguments

call(image)

Preprocess images according to data arguments.

Returns : a tensor containing the processed image patches

Return type : tensor

Parameters:

image (PIL.Image) – the input image to be processed

class Data.LazySupervisedDataset

Bases: object

Dataset for supervised fine-tuning.

init(data_path, tokenizer, data_args)

Initialize tokenizer and the preprocessor of text and image.

Parameters:
  • data_path (str) – path to data

  • tokenizer (transformers.PreTrainedTokenizer) – tokenize data

  • data_args (dict) – data arguments

class Data.DataCollatorForSupervisedDataset

Bases: object

call(instances)

Collate examples for supervised fine-tuning.

Returns: a batch containing input_ids, label, attention_mask

Return type: dict

Parameters:

instances (list(dict)) – a list of instance

Data.make_supervised_data_module(tokenizer, data_args)

Make dataset and collator for supervised fine-tuning.

Returns: a dict containing train_dataset and data_collator

Return type: dict

Parameters:
  • tokenizer (transformers.PreTrainedTokenizer) – tokenize data

  • data_args (dict) – data arguments