LiteLLM: A Lightweight Wrapper for Multi-Provider LLMs

Page content

Summary

In this post I will cover LiteLLM. I used it for my implementation of Textgrad also it was using in blog posts I did about Agents.

Working with multiple LLM providers is painful. Every provider has its own API, requiring custom integration, different pricing models, and maintenance overhead. LiteLLM solves this by offering a single, unified API that allows developers to switch between OpenAI, Hugging Face, Cohere, Anthropic, and others without modifying their code.

If a provider becomes too expensive or does not support the functionality you need you can switch them out for something new.

This approach allows you to focus on your custom code and let it take care of the specifics of interfacing to different LLM providers.


Why Use LiteLLM?

1. Unified API for Multiple Providers

LiteLLM provides a consistent interface to interact with multiple LLM APIs, eliminating the need to write separate code for each provider.

2. Cost Optimization

It allows automatic fallback to cheaper or faster models when necessary, optimizing API costs and performance.

3. Seamless Model Switching

With LiteLLM, switching from one model provider to another is as simple as changing a parameter.

4. Load Balancing and Routing

LiteLLM supports model load balancing, routing requests across multiple providers for improved efficiency and availability.

5. Custom Endpoints

You can define and use custom API endpoints, making LiteLLM a great tool for self-hosted models.


Getting Started with LiteLLM

Basic Usage

To use LiteLLM, initialize it with your preferred model provider:

from litellm import completion

messages = [{ "content": "There's a llama in my garden 😱 What should I do?","role": "user"}]

response = completion(
    model="ollama/llama3.2",
    messages=[{ "content": "Hello, how are you?","role": "user"}],
    stream=False
)

print(response)
I'm just a language model, so I don't have feelings or emotions like humans do, but thank you for asking! How can I assist you today?

Using Multiple Providers

You can easily switch between different model providers:

from litellm import completion


response = completion(
    model="gpt-4o",  # Using OpenAI
    messages=[{"role": "user", "content": "Summarize the latest AI research trends."}]
)
print(response["choices"][0]["message"]["content"])
As of the latest research trends, several key areas have garnered significant attention in the field of artificial intelligence:

1. **Generative AI and Large Language Models (LLMs):** The development and application of generative AI, particularly LLMs like OpenAI's GPT series and 
...

Generate embeddings

from litellm import embedding

# I called this inside a notebook
import nest_asyncio
nest_asyncio.apply()


response = embedding(
    model='ollama/nomic-embed-text',
    api_base="http://localhost:11434",
    input=["good morning from litellm"],
    stream=False
)

print(response)
EmbeddingResponse(model='ollama/nomic-embed-text', data=[{'object': 'embedding', 'index': 0, 'embedding': [...]}], 
object='list', usage=Usage(completion_tokens=6, prompt_tokens=6, total_tokens=6, 
completion_tokens_details=None, prompt_tokens_details=None))

Advanced Features

1. Fallback Mechanism

If one provider fails, LiteLLM can automatically fallback to another:

response = litellm.completion(
    model=["gpt-4", "claude-v1", "mistral-7b"]  # Tries models in sequence
)

2. Streaming Responses

LiteLLM supports streaming responses, reducing response latency:

for chunk in litellm.completion(
    model="gpt-4",
    messages=[{"role": "user", "content": "Tell me a joke."}],
    stream=True
):
    print(chunk["choices"][0]["message"]["content"], end="")

3. Batch Requests

Send multiple queries simultaneously for efficiency:

responses = litellm.batch(
    model="gpt-4",
    messages=[
        [{"role": "user", "content": "Define AI."}],
        [{"role": "user", "content": "What is the speed of light?"}]
    ]
)

for r in responses:
    print(r["choices"][0]["message"]["content"])

4. Custom Endpoints for Self-Hosted Models

If you’re running an open-source LLM on your own infrastructure, you can integrate it with LiteLLM:

LiteLLM supports self-hosted models via custom endpoints. You can register local instances of models like Llama.cpp or GPTQ by providing an HTTP endpoint.

litellm.register_custom_endpoint(
    name="my_local_model",
    endpoint="http://localhost:5000/v1/completions"
)

response = litellm.completion(model="my_local_model", messages=[{"role": "user", "content": "Translate to French: 'Hello'"}])
print(response["choices"][0]["message"]["content"])

Custom Callbacks

Callbacks allow you to track API calls, measure latency, log failures, or modify requests before sending them. This is useful for monitoring API usage in production applications.

This is how to add custom callbacks to LiteLLM.


import litellm
from litellm.integrations.custom_logger import CustomLogger
from litellm import completion

import logging
logging.basicConfig(level=logging.DEBUG, 
                    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
                    filemode='w',
                    filename='litellm.log')
logger = logging.getLogger(__name__)


# Custom Logger 
class MyCustomHandler(CustomLogger):
    def log_pre_api_call(self, model, messages, kwargs): 
        logger.info(f"Pre-API Call")
      ...

# initialize the handler
customHandler = MyCustomHandler()
# pass the handler to the callback
litellm.callbacks = [customHandler]

response = completion(
    model="ollama/llama3.2",
    messages=[{ "content": "Write some python code to print the contents of a file.","role": "user"}],
    stream=False
)
print(response.choices[0].message.content)

In the log you will see

2025-02-24 23:18:32,187 - __main__ - INFO - Pre-API Call

Callback Functions

If you just want to log on a specific event (e.g. on input) - you can use callback functions.

You can set custom callbacks to trigger for:

litellm.input_callback - Track inputs/transformed inputs before making the LLM API call litellm.success_callback - Track inputs/outputs after making LLM API call litellm.failure_callback - Track inputs/outputs + exceptions for litellm calls

def custom_callback(
    kwargs,                 # kwargs to completion
    completion_response,    # response from completion
    start_time, end_time    # start/end time
):
    # Your custom code here
    print("LITELLM: in custom callback function")
    print("kwargs", kwargs)
    print("completion_response", completion_response)
    print("start_time", start_time)
    print("end_time", end_time)


import litellm
litellm.success_callback = [custom_callback]

Use Cases

1. Building AI-Powered Chatbots

Easily integrate LLMs into chatbot applications with failover mechanisms to ensure reliability.

2. Cost-Optimized AI Applications

Use a mix of free and paid models, switching dynamically based on cost and performance needs.

3. Enterprise AI Deployment

Organizations can route queries across different LLM providers to ensure uptime and efficiency.

4. Research and Development

Experiment with various LLMs without rewriting API calls for each provider.


LiteLLM provides an elegant way to simplify LLM interactions, reduce API complexity, and optimize costs. Another key point is it is supported and supports a large amount of the current LLM providers.