# Workflow

1. **Base Model Initialization:**

* A foundational model (e.g., Llama 3, Mistral, or Falcon) is loaded into GPU memory.

2. **Dynamic LoRA Adapter Retrieval:**

* When a request specifies a fine-tuned adapter, the system dynamically loads it from Hugging Face, Predibase, or a local directory.
* The adapter is merged with the base model in real-time.

3. **Merging & Activation:**

* LoRA adapters are merged into the base model using optimized kernel operations.
* Multiple adapters can be combined for ensemble inference.

4. **Inference Execution & Token Streaming:**

* The merged model generates responses with token streaming for low-latency output.
* Quantization techniques ensure memory efficiency while maintaining accuracy.

5. **Request Completion & Adapter Eviction:**

* Once inference is complete, the adapter is unloaded to free GPU memory.
* This process allows for serving thousands of fine-tuned models without memory bottlenecks.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://openledger.gitbook.io/openledger/openlora/workflow.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
