zensols.lmtask package¶
Submodules¶
zensols.lmtask.app module¶
Large langauage model experimentation.
- class zensols.lmtask.app.Application(config_factory, task_factory)[source]¶
Bases:
objectLarge langauage model experimentation.
- __init__(config_factory, task_factory)¶
-
config_factory:
ConfigFactory¶ Used to create training resources.
- dataset_sample(max_sample=1)[source]¶
Print sample(s) of the configured (
--config) dataset.- Parameters:
max_sample (
int) – the number of sample to print
- instruct(task_name, instruction, role=None, output_format=None)[source]¶
Generate text by inferencing with the model.
- show_task(task_name=None)[source]¶
Print the configuration of a task if
--nameis given, otherise a list of available tasks.- Parameters:
task_name (
str) – the task that creates the prompt and parses the result
- show_trainer(long_output=False)[source]¶
Print configuration and dataset stats of the configured (
--config) trainer.- Parameters:
long_output (
bool) – verbosity
-
task_factory:
TaskFactory¶ Create tasks used to fullfill CLI requests.
- class zensols.lmtask.app.PrototypeApplication(config_factory, app, prompt='Once upon a time, in a galaxy, far far away,')[source]¶
Bases:
objectUsed by the Python REPL for prototyping.
- CLI_META = {'is_usage_visible': False}¶
- __init__(config_factory, app, prompt='Once upon a time, in a galaxy, far far away,')¶
-
app:
Application¶
-
config_factory:
ConfigFactory¶
zensols.lmtask.cli module¶
Command line entry point to the application.
- class zensols.lmtask.cli.ApplicationFactory(*args, **kwargs)[source]¶
Bases:
ApplicationFactory
zensols.lmtask.dataset module¶
An implementation of a dataset generator task.TaskDatasetFactory.
- class zensols.lmtask.dataset.LoadedTaskDatasetFactory(task, text_field='text', messages_field='messages', eval_field='text', source=None, load_args=<factory>, pre_process=None, post_process=None)[source]¶
Bases:
TaskDatasetFactoryA utility class meant to be created from an application configuration. This class creates a dataframe used by
Trainerand optionally does post processing (i.e. filtering and mapping).- __init__(task, text_field='text', messages_field='messages', eval_field='text', source=None, load_args=<factory>, pre_process=None, post_process=None)¶
-
post_process:
Union[str,Callable] = None¶ Code to call after the dataset is created and the task has applied any template.
- See:
zensols.lmtask.generate module¶
Facade to HuggingFace text generation.
- class zensols.lmtask.generate.CachingGenerator(_delegate, _stash, _hasher=<factory>)[source]¶
Bases:
TextGeneratorA generator that caches response using a hash of the model input as a key.
- __init__(_delegate, _stash, _hasher=<factory>)¶
- class zensols.lmtask.generate.ConstantTextGenerator(config_factory, response, post_init_source=None)[source]¶
Bases:
TextGeneratorA generator that responses with
responsewith every generation call for the purpose of debugging.- __init__(config_factory, response, post_init_source=None)¶
-
config_factory:
ConfigFactory¶ Used to set optional mock attributes in
post_init_source.
- class zensols.lmtask.generate.GenerateTask(name, description, request_class, response_class, generator, resource, train_add_eos=False)[source]¶
Bases:
TaskUses a
TextGenerator(generator) to generate a response.- __init__(name, description, request_class, response_class, generator, resource, train_add_eos=False)¶
-
generator:
TextGenerator¶ A client facade of a chat or instruct-based large language model.
-
resource:
GeneratorResource¶ The class that creates resources such as the tokenizer and model. This should be the base model resource so training tasks do not depend on the model they will eventually create.
This is also used by
InstructTaskfor its chat template.
- class zensols.lmtask.generate.GeneratorOutput(model_output, parsed)[source]¶
Bases:
DictableContainer instances of model output from
TextGenerator.- __init__(model_output, parsed)¶
- class zensols.lmtask.generate.GeneratorResource(name, model_id, model_class=<class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>, tokenizer_class=<class 'transformers.models.auto.tokenization_auto.AutoTokenizer'>, peft_model_id=None, peft_model_class=<class 'peft.auto.AutoPeftModelForCausalLM'>, model_desc=None, system_role_name='system', model_args=<factory>)[source]¶
Bases:
DictableA client facade of a chat-based large language model.
- __init__(name, model_id, model_class=<class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>, tokenizer_class=<class 'transformers.models.auto.tokenization_auto.AutoTokenizer'>, peft_model_id=None, peft_model_class=<class 'peft.auto.AutoPeftModelForCausalLM'>, model_desc=None, system_role_name='system', model_args=<factory>)¶
- configure_tokenizer(tokenizer)[source]¶
Make any necessary updates programatically (i.e. set special tokens).
- classmethod get_model_path(model_id, parent=None)[source]¶
Create a normalized file name from a HF model ID string useful for creating checkpoint directory names.
- property model: PreTrainedModel¶
The LLM.
- model_class¶
The class used to create the model with
from_pretrained().alias of
AutoModelForCausalLM
- property model_file_name: str¶
A normalized file name friendly string based on
model_desc.
- peft_model_class¶
The class used to create the model with
from_pretrained().alias of
AutoPeftModelForCausalLM
-
peft_model_id:
Union[str,Path] = None¶ The HF model ID or path to the Peft model or
Noneif there is none.
- property tokenizer: PreTrainedTokenizer¶
The model’s tokenzier.
- class zensols.lmtask.generate.ModelTextGenerator(resource, tokenize_params=<factory>, tokenize_decode_params=<factory>, generate_params=<factory>)[source]¶
Bases:
TextGeneratorAn implementation that uses HuggingFace framework classes from
GeneratorResourceto answer queries.- __init__(resource, tokenize_params=<factory>, tokenize_decode_params=<factory>, generate_params=<factory>)¶
-
resource:
GeneratorResource¶ The class that creates resources such as the tokenizer and model.
- stream(prompt, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, width=80)[source]¶
Stream the model’s output from a
promptinput.- Parameters:
prompt (
str) – the input to give to the modelwriter (
TextIOBase) – the data sinkwidth (
int) – the maximum width of each line’s streamed text; ifNone, no modification will be done on the text output
- class zensols.lmtask.generate.ReplaceTextGenerator(resource, tokenize_params=<factory>, tokenize_decode_params=<factory>, generate_params=<factory>, replacements=())[source]¶
Bases:
ModelTextGeneratorA text generator that generates response by replacing regular expressions. This is helpful for removing special tokens.
- __init__(resource, tokenize_params=<factory>, tokenize_decode_params=<factory>, generate_params=<factory>, replacements=())¶
zensols.lmtask.hf module¶
HuggingFace trainer wrapper.
- class zensols.lmtask.hf.HFTrainerResource(model_args=None, cache=True, generator_resource=None, peft_config=None)[source]¶
Bases:
TrainerResourceUses
HuggingFaceTrainerfor training the model.- __init__(model_args=None, cache=True, generator_resource=None, peft_config=None)¶
-
generator_resource:
GeneratorResource= None¶ The resource used to the source checkpoint.
-
peft_config:
LoraConfig= None¶ The Peft low rank adapters configuration.
- class zensols.lmtask.hf.HuggingFaceTrainer(config, resource, train_params, eval_params, train_source, eval_source, peft_output_dir, merged_output_dir)[source]¶
Bases:
TrainerThe HuggingFace trainer.
- __init__(config, resource, train_params, eval_params, train_source, eval_source, peft_output_dir, merged_output_dir)¶
zensols.lmtask.instruct module¶
Task implementations.
- class zensols.lmtask.instruct.InstructModelTextGenerator(resource, tokenize_params=<factory>, tokenize_decode_params=<factory>, generate_params=<factory>, replacements=())[source]¶
Bases:
ReplaceTextGeneratorA generator that uses instruct based models for inference.
- __init__(resource, tokenize_params=<factory>, tokenize_decode_params=<factory>, generate_params=<factory>, replacements=())¶
- class zensols.lmtask.instruct.InstructTask(name, description, request_class, response_class, generator, resource, train_add_eos=False, role='You are a helpful assistant.', train_template='### Question: {{ instruction }}\\n### Answer: {{ output }}', inference_template='{{request.instruction}}', chat_template_args=<factory>, apply_chat_template=True, train_apply_chat_template=False)[source]¶
Bases:
GenerateTaskA task that is resolved using instructions given to the language model.
Important: If
InstructTaskRequest.model_inputis non-Nonethat value is used verbatim andInstructTaskRequest.instructionis ignored.- __init__(name, description, request_class, response_class, generator, resource, train_add_eos=False, role='You are a helpful assistant.', train_template='### Question: {{ instruction }}\\n### Answer: {{ output }}', inference_template='{{request.instruction}}', chat_template_args=<factory>, apply_chat_template=True, train_apply_chat_template=False)¶
-
apply_chat_template:
bool= True¶ Whether format the prompt into one that conforms to the model’s instruct syntax.
-
inference_template:
Union[str,Path] = '{{request.instruction}}'¶ The instructions given to
generator.
-
train_apply_chat_template:
bool= False¶ Like
apply_chat_template, but whether to apply during training. If this isFalse, a conversationalmessageswith dictionary list is used instead.
-
train_template:
Union[str,Path] = '### Question: {{ instruction }}\n### Answer: {{ output }}'¶ Used to create format the datasets training text
generator.
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
Write this instance as either a
Writableor as aDictable. If class attribute_DICTABLE_WRITABLE_DESCENDANTSis set asTrue, then use thewrite()method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating adictrecursively usingasdict(), then formatting the output.If the attribute
_DICTABLE_WRITE_EXCLUDESis set, those attributes are removed from what is written in thewrite()method.Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.
- Parameters:
depth (
int) – the starting indentation depthwriter (
TextIOBase) – the writer to dump the content of this writable
- class zensols.lmtask.instruct.InstructTaskRequest(model_input=None, instruction=None)[source]¶
Bases:
TaskRequestA request that has a query portion to be added to the compiled prompt.
- __init__(model_input=None, instruction=None)¶
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_instruction=True)[source]¶
Write this instance as either a
Writableor as aDictable. If class attribute_DICTABLE_WRITABLE_DESCENDANTSis set asTrue, then use thewrite()method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating adictrecursively usingasdict(), then formatting the output.If the attribute
_DICTABLE_WRITE_EXCLUDESis set, those attributes are removed from what is written in thewrite()method.Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.
- Parameters:
depth (
int) – the starting indentation depthwriter (
TextIOBase) – the writer to dump the content of this writable
- class zensols.lmtask.instruct.NShotTaskRequest(model_input=None, instruction=None, examples=None)[source]¶
Bases:
InstructTaskRequestA request that adds training examples to the prompt.
- __init__(model_input=None, instruction=None, examples=None)¶
zensols.lmtask.llama module¶
Interactive chat interfaces, which are superset to chat templates.
- class zensols.lmtask.llama.LlamaGeneratorResource(name, model_id, model_class=<class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>, tokenizer_class=<class 'transformers.models.auto.tokenization_auto.AutoTokenizer'>, peft_model_id=None, peft_model_class=<class 'peft.auto.AutoPeftModelForCausalLM'>, model_desc=None, system_role_name='system', model_args=<factory>)[source]¶
Bases:
GeneratorResourceThere are 4 different roles that are supported by Llama text models:
system: Sets the context in which to interact with the AI model. Ittypically includes rules, guidelines, or necessary information that help the model respond effectively.
user: Represents the human interacting with the model. It includes theinputs, commands, and questions to the model.
ipython: A new role introduced in Llama 3.1. Semantically, this rolemeans “tool”. This role is used to mark messages with the output of a tool call when sent back to the model from the executor.
assistant: Represents the response generated by the AI model based onthe context provided in the system, ipython and user prompts.
- __init__(name, model_id, model_class=<class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>, tokenizer_class=<class 'transformers.models.auto.tokenization_auto.AutoTokenizer'>, peft_model_id=None, peft_model_class=<class 'peft.auto.AutoPeftModelForCausalLM'>, model_desc=None, system_role_name='system', model_args=<factory>)¶
zensols.lmtask.task module¶
Task implementations.
- class zensols.lmtask.task.JSONTaskResponse(request, model_output_raw, model_output, robust_json=True)[source]¶
Bases:
TaskResponseA task that parses the responses as JSON. The JSON is parsed as much as possible and does not raise errors when the json is incomplete.
- __init__(request, model_output_raw, model_output, robust_json=True)¶
- property model_output_json: Failure | str¶
The
responseattribute parsed as JSON.- Raises:
json.decoder.JSONDecodeError – if the JSON failed to parse
- See:
obj:robust_json
- robust_json: bool = True¶
Whether to return
Failurefrommodel_output_jsoninstead of raising from parse failures.
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_request=False, include_model_output=False, include_json=True)[source]¶
Write this instance as either a
Writableor as aDictable. If class attribute_DICTABLE_WRITABLE_DESCENDANTSis set asTrue, then use thewrite()method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating adictrecursively usingasdict(), then formatting the output.If the attribute
_DICTABLE_WRITE_EXCLUDESis set, those attributes are removed from what is written in thewrite()method.Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.
- Parameters:
depth (
int) – the starting indentation depthwriter (
TextIOBase) – the writer to dump the content of this writable
- class zensols.lmtask.task.Task(name, description, request_class, response_class)[source]¶
Bases:
DictableSubclasses turn a prompt and query into a response from an LLM.
- __init__(name, description, request_class, response_class)¶
- prepare_dataset(ds, factory)[source]¶
Massage the any data for training necessary to train this task. This might involve apply templates and/or adding terminating tokens.
- Return type:
Dataset
- prepare_request(request)[source]¶
Return a request with the contents populated with a formatted prompt.
- Return type:
- process(request)[source]¶
Invoke the
generatorto query the LLM, then return a JSON formatted data.- Parameters:
query – a query that is phrased with the assumption that JSON is given as a response
- Return type:
-
request_class:
Type[TaskRequest]¶ The response data.
-
response_class:
Type[TaskResponse]¶ The response data.
- class zensols.lmtask.task.TaskDatasetFactory(task, text_field='text', messages_field='messages', eval_field='text')[source]¶
Bases:
DictableSubclasses create a dataframes used by
Trainerand optionally does post processing (i.e. filtering and mapping).- __init__(task, text_field='text', messages_field='messages', eval_field='text')¶
- create()[source]¶
Create a new dataset based on
source.- Return type:
Dataset- Returns:
the new dataset after modification by
post_process
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
Write this instance as either a
Writableor as aDictable. If class attribute_DICTABLE_WRITABLE_DESCENDANTSis set asTrue, then use thewrite()method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating adictrecursively usingasdict(), then formatting the output.If the attribute
_DICTABLE_WRITE_EXCLUDESis set, those attributes are removed from what is written in thewrite()method.Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.
- Parameters:
depth (
int) – the starting indentation depthwriter (
TextIOBase) – the writer to dump the content of this writable
- exception zensols.lmtask.task.TaskDatasetFactoryError(message, prompt=None)[source]¶
Bases:
TaskErrorRaised when
TaskDatasetFactoryinstances can not create datasets.- __module__ = 'zensols.lmtask.task'¶
- exception zensols.lmtask.task.TaskError(message, prompt=None)[source]¶
Bases:
APIErrorRaised for any LLM specific error in this API.
- __annotations__ = {}¶
- __module__ = 'zensols.lmtask.task'¶
- class zensols.lmtask.task.TaskFactory(config_factory, _task_pattern)[source]¶
Bases:
DictableCreates instances of
Taskusingcreate().- __init__(config_factory, _task_pattern)¶
-
config_factory:
ConfigFactory¶ The factory that creates tasks.
- create(name)[source]¶
Create a new instance of a task with
nameper the app config.- See:
- Return type:
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, short=False)[source]¶
Write this instance as either a
Writableor as aDictable. If class attribute_DICTABLE_WRITABLE_DESCENDANTSis set asTrue, then use thewrite()method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating adictrecursively usingasdict(), then formatting the output.If the attribute
_DICTABLE_WRITE_EXCLUDESis set, those attributes are removed from what is written in thewrite()method.Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.
- Parameters:
depth (
int) – the starting indentation depthwriter (
TextIOBase) – the writer to dump the content of this writable
- class zensols.lmtask.task.TaskObject[source]¶
Bases:
PersistableContainer,DictableBase class for task requests and responses.
- __init__()¶
- class zensols.lmtask.task.TaskRequest(model_input=None)[source]¶
Bases:
TaskObjectThe input request to the LLM via
Task.process(). In most cases, obj:model_input can be used to skip the prompt compilation step.- __init__(model_input=None)¶
- model_input: str = None¶
The text given verbatim to the model. This is some combination of both
quertyandprompt.
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
Write this instance as either a
Writableor as aDictable. If class attribute_DICTABLE_WRITABLE_DESCENDANTSis set asTrue, then use thewrite()method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating adictrecursively usingasdict(), then formatting the output.If the attribute
_DICTABLE_WRITE_EXCLUDESis set, those attributes are removed from what is written in thewrite()method.Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.
- Parameters:
depth (
int) – the starting indentation depthwriter (
TextIOBase) – the writer to dump the content of this writable
- class zensols.lmtask.task.TaskResponse(request, model_output_raw, model_output)[source]¶
Bases:
TaskObjectThe happy-path response given by
Task.- __init__(request, model_output_raw, model_output)¶
- model_output: str¶
This task instance’s parsed response text given by the model.
- model_output_raw: str¶
The model output verbatim.
- request: TaskRequest¶
The request used to generated this response.
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_request=False, include_model_output=True, include_model_output_raw=False)[source]¶
Write this instance as either a
Writableor as aDictable. If class attribute_DICTABLE_WRITABLE_DESCENDANTSis set asTrue, then use thewrite()method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating adictrecursively usingasdict(), then formatting the output.If the attribute
_DICTABLE_WRITE_EXCLUDESis set, those attributes are removed from what is written in thewrite()method.Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.
- Parameters:
depth (
int) – the starting indentation depthwriter (
TextIOBase) – the writer to dump the content of this writable
zensols.lmtask.train module¶
Continued Pretraining and supervised fine-tuning training.
- class zensols.lmtask.train.ModelResult(train_output, output_dir=None, train_params=None, config=None)[source]¶
Bases:
DictableThe trained model config, location and configuration used to train it.
- __init__(train_output, output_dir=None, train_params=None, config=None)¶
-
config:
Configurable= None¶ The application configuration used to configure the trainer.
- property global_step: int¶
The global step from
train_output.
- property metrics: Dict[str, float]¶
Training metrics from
train_output.
-
train_output:
TrainOutput¶ The output returned from the trainer.
- property training_loss: float¶
The training loss from
train_output.
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_training_arguments=False, include_config=False)[source]¶
Write this instance as either a
Writableor as aDictable. If class attribute_DICTABLE_WRITABLE_DESCENDANTSis set asTrue, then use thewrite()method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating adictrecursively usingasdict(), then formatting the output.If the attribute
_DICTABLE_WRITE_EXCLUDESis set, those attributes are removed from what is written in thewrite()method.Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.
- Parameters:
depth (
int) – the starting indentation depthwriter (
TextIOBase) – the writer to dump the content of this writable
- class zensols.lmtask.train.Trainer(config, resource, train_params, eval_params, train_source, eval_source, peft_output_dir, merged_output_dir)[source]¶
Bases:
DictableAn
UnslothTrainerwrapper.- __init__(config, resource, train_params, eval_params, train_source, eval_source, peft_output_dir, merged_output_dir)¶
-
config:
Configurable¶ Used to save to the model result.
-
eval_source:
TaskDatasetFactory¶ A factory that creates new datasets used to evaluation.
-
resource:
TrainerResource¶ Used to create the model and tokenzier.
-
train_source:
TaskDatasetFactory¶ A factory that creates new datasets used to train using this instance.
- write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, include_training_arguments=False)[source]¶
Write this instance as either a
Writableor as aDictable. If class attribute_DICTABLE_WRITABLE_DESCENDANTSis set asTrue, then use thewrite()method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating adictrecursively usingasdict(), then formatting the output.If the attribute
_DICTABLE_WRITE_EXCLUDESis set, those attributes are removed from what is written in thewrite()method.Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.
- Parameters:
depth (
int) – the starting indentation depthwriter (
TextIOBase) – the writer to dump the content of this writable
- class zensols.lmtask.train.TrainerResource(model_args=None, cache=True)[source]¶
-
Configures and instantiates the base mode, PEFT mode, and the tokenizer.
- __init__(model_args=None, cache=True)¶
- property model: PreTrainedModel¶
The base model.
- property peft_model: PeftModelForCausalLM¶
The PEFT (Parameter-Efficient Fine-Tuning) such as LoRA.
- property tokenizer: PreTrainedTokenizer¶
The base tokenizer.