As developers in the field of AI, we often need to call multiple different large language models, but in the face of different API specifications and access methods, integration work becomes cumbersome. Building a unified large-model integration platform can greatly simplify this process.
This article will discuss how to implement a large-model integration platform that is compatible with OpenAI API specifications, focusing on/v1/models
and/v1/chat/completions
**Implementation of these two core endpoints.
Architectural design
First, we need to design a clear architecture to unify different large model APIs into a standard interface:
┌─────────────────────┐
│ Unified Interface Layer (OpenAI Compatible) │
└────────┬───────┘
│
┌──────────▼──────────┐
│ Routing and load balancing layer │
└────────┬───────┘
│
┌───────────────────┴───────────────────┐
│ │
┌────────▼─────────┐ ┌─────────▼─────────┐ ┌────▼───────────┐
│ Model Adapter A │ │ Model Adapter B │ Model Adapter C │
│ (such as OpenAI adapter) │ │ (such as Claude adapter) │ │ (such as local model adapter) │
└────────┬─────────┘ └─────────┬─────────┘ └────┬───────────┘
│ │ │
┌────────▼─────────┐ ┌─────────▼─────────┐ ┌────▼───────────┐
│ OpenAI API │ │ Claude API │ │ Local Model │
└──────────────────┘ └───────────────────┘ └────────────────┘
Core component implementation
We use Python and Flask frameworks to implement key parts of this platform.
1. Project structure
one_api/
├── # Main application portal
├── # Configuration file
├── models/ # Model related
│ ├── __init__.py
│ ├── # Model Registration
│ └── adapters/ # Model adapters
│ ├── __init__.py
│ ├── # Basic adapter interface
│ ├── # OpenAI adapter
│ ├── # Claude adapter
│ └── # Local Model Adapter
├── api/ # API routing
│ ├── __init__.py
│ ├── # /v1/models implementation
│ └── # /v1/chat/completes implementation
└── utils/ # Tool function
├── __init__.py
├── #Certification related
└── rate_limit.py # rate limit
2. Basic adapter interface
First, define a basic adapter interface, and all model adapters need to implement this interface:
# models/adapters/
from abc import ABC, abstractmethod
from typing import Dict, List, Any, Optional
class BaseModelAdapter(ABC):
"""Base class for all model adapters"""
@abstractmethod
def list_models(self) -> List[Dict[str, Any]]:
"""Returns to the list of models supported by this adapter"""
pass
@abstractmethod
async def generate_completion(self,
model: str,
messages: List[Dict[str, str]],
temperature: Optional[float] = None,
top_p: Optional[float] = None,
max_tokens: Optional[int] = None,
stream: bool = False,
**kwargs) -> Dict[str, Any]:
"""Generate chat completion result""""
pass
@abstractmethod
def get_model_info(self, model_id: str) -> Dict[str, Any]:
"""Get details of a specific model"""
pass
illustrate:
- Design the adapter interface using abstract base class (ABC) to ensure that all subclasses implement the necessary methods
-
list_models
Methods are used to get a list of models supported by each adapter -
generate_completion
It is the core method, responsible for calling the actual AI model to generate responses, and using asynchronous design to improve performance -
get_model_info
Used to obtain model details, convenient for front-end display and selection
3. Model Registration
Create a central registry to manage all available models and corresponding adapters:
# models/
from typing import Dict, List, Any, Optional
from . import BaseModelAdapter
import logging
logger = (__name__)
class ModelRegistry:
"""Central Model Registry, managing all model adapters and routing logic""""
def __init__(self):
# Adapter Mapping {adapter_name: adapter_instance}
: Dict[str, BaseModelAdapter] = {}
# Model map {model_id: adapter_name}
self.model_mapping: Dict[str, str] = {}
def register_adapter(self, name: str, adapter: BaseModelAdapter) -> None:
"""Register a new model adapter"""
if name in :
(f"Adapter '{name}' already exists and will be overwritten")
[name] = adapter
# Register all models supported by this adapter
for model_info in adapter.list_models():
model_id = model_info["id"]
self.model_mapping[model_id] = name
(f"Registered model: {model_id} -> {name}")
def get_adapter_for_model(self, model_id: str) -> Optional[BaseModelAdapter]:
"""Get the corresponding adapter based on the model ID"""
adapter_name = self.model_mapping.get(model_id)
if not adapter_name:
return None
return (adapter_name)
def list_all_models(self) -> List[Dict[str, Any]]:
"""List all registered models"""
all_models = []
for adapter in ():
all_models.extend(adapter.list_models())
return all_models
async def generate_completion(self, model_id: str, **kwargs) -> Dict[str, Any]:
"""Call the result of the specified model generation"""
adapter = self.get_adapter_for_model(model_id)
if not adapter:
raise ValueError(f"Adapter for model '{model_id}' was not found")
return await adapter.generate_completion(model=model_id, **kwargs)
illustrate:
-
ModelRegistry
As a central manager, it is responsible for maintaining all model adapters and routing maps - pass
register_adapter
Method registers a new adapter and automatically obtains all models supported by the adapter -
model_mapping
Dictionary stores the mapping of model ID to adapter name for quick search -
get_adapter_for_model
Method obtains the corresponding adapter instance based on the model ID -
list_all_models
Method aggregates the model list of all adapters for/v1/models
Endpoint -
generate_completion
The method is core routing logic, forwarding requests to the correct adapter for processing
4. OpenAI Adapter Example
The following is an example of an adapter that implements OpenAI:
# models/adapters/
import aiohttp
from typing import Dict, List, Any, Optional
from .base import BaseModelAdapter
import os
import logging
logger = (__name__)
class OpenAIAdapter(BaseModelAdapter):
"""OpenAI API Adapter"""
def __init__(self, api_key: str, base_url: str = ""):
self.api_key = api_key
self.base_url = base_url
self._models_cache = None
async def _request(self, method: str, endpoint: str, **kwargs) -> Dict[str, Any]:
"""Send a request to OpenAI API"""
headers = {
"Authorization": f"Bearer {self.api_key}",
"Content-Type": "application/json"
}
url = f"{self.base_url}{endpoint}"
async with () as session:
async with (
method,
url,
headers=headers,
**kwargs
) as response:
if != 200:
error_text = await ()
raise Exception(f"OpenAI API Error ({}): {error_text}")
return await ()
async def _fetch_models(self) -> List[Dict[str, Any]]:
"""Get model list from OpenAI"""
response = await self._request("GET", "/v1/models")
return response["data"]
def list_models(self) -> List[Dict[str, Any]]:
"""Returns the list of models supported by OpenAI"""
if self._models_cache is None:
# In actual implementation, asynchronous acquisition should be used, and the processing is simplified here
import asyncio
self._models_cache = (self._fetch_models())
# Add additional platform-specific information
for model in self._models_cache:
model["provider"] = "openai"
return self._models_cache
async def generate_completion(self,
model: str,
messages: List[Dict[str, str]],
temperature: Optional[float] = None,
top_p: Optional[float] = None,
max_tokens: Optional[int] = None,
stream: bool = False,
**kwargs) -> Dict[str, Any]:
"""Call OpenAI API to generate chat complete""""
payload = {
"model": model,
"messages": messages,
"stream": stream
}
# Add optional parameters
If temperature is not None:
payload["temperature"] = temperature
if top_p is not None:
payload["top_p"] = top_p
if max_tokens is not None:
payload["max_tokens"] = max_tokens
# Add other passed parameters
for key, value in ():
if key not in payload and value is not None:
payload[key] = value
# Call OpenAI API
response = await self._request(
"POST",
"/v1/chat/completions",
json=payload
)
# Make sure the response format is consistent with our standards
return self._standardize_response(response)
def _standardize_response(self, response: Dict[str, Any]) -> Dict[str, Any]:
"""Convert OpenAI's response to standard format"""
# OpenAI already uses standard format, so return directly
Return response
def get_model_info(self, model_id: str) -> Dict[str, Any]:
"""Get details of a specific model"""
models = self.list_models()
for model in models:
if model["id"] == model_id:
return model
raise ValueError(f"Model '{model_id}' does not exist")
illustrate:
- Implemented
BaseModelAdapter
Specific adapter of the interface, specializing in handling OpenAI API calls - use
aiohttp
Conduct asynchronous HTTP requests to improve concurrent processing capabilities -
_request
Private methods encapsulate HTTP request logic to handle authentication and error situations -
_fetch_models
Methods get the model list from OpenAI, and the results will be cached when actually used -
list_models
Implements the base class interface and adds provider information to facilitate front-end distinction -
generate_completion
It is the core method, builds the request parameters and calls OpenAI's chat/completes API -
_standardize_response
Normalize responses into a unified format for easy subsequent processing -
get_model_info
Get detailed information through model ID
5. API routing implementation
Now, let's implement an API endpoint that complies with OpenAI specifications:
#api/
from flask import Blueprint, jsonify
from .. import ModelRegistry
models_bp = Blueprint('models', __name__)
def init_routes(registry: ModelRegistry):
"""Initialize Model API Routing"""
@models_bp.route('/v1/models', methods=['GET'])
async def list_models():
"""List all available models (OpenAI compatible endpoints)"""
models = registry.list_all_models()
# Return according to OpenAI API format
return jsonify({
"object": "list",
"data": models
})
@models_bp.route('/v1/models/<model_id>', methods=['GET'])
async def get_model(model_id):
"""Get specific model details (OpenAI compatible endpoint)"""
adapter = registry.get_adapter_for_model(model_id)
if not adapter:
return jsonify({
"error": {
"message": f"Model '{model_id}' does not exist",
"type": "invalid_request_error",
"code": "model_not_found"
}
}), 404
model_info = adapter.get_model_info(model_id)
return jsonify(model_info)
illustrate:
- Organize routing using Flask's Blueprint for easy modular management
-
init_routes
Functions receive model registry instances and implement dependency injection -
/v1/models
Endpoints are fully compatible with OpenAI API specifications, returning all registered models -
/v1/models/<model_id>
Endpoint gets detailed information about the specified model - Error situation returns to standard OpenAI error format to ensure client compatibility
#api/
from flask import Blueprint, request, jsonify, Response, stream_with_context
import json
import asyncio
from .. import ModelRegistry
from .. import verify_api_key
from ..utils.rate_limit import check_rate_limit
import logging
logger = (__name__)
chat_bp = Blueprint('chat', __name__)
def init_routes(registry: ModelRegistry):
"""Initialize chat to complete API routing""""
@chat_bp.route('/v1/chat/completes', methods=['POST'])
@verify_api_key
@check_rate_limit
async def create_chat_completion():
"""Create a chat complete (OpenAI compatible endpoint)"""
try:
# parse request data
data =
model = ("model")
if not model:
return jsonify({
"error": {
"message": "The 'model' parameter must be specified",
"type": "invalid_request_error",
}
}), 400
adapter = registry.get_adapter_for_model(model)
if not adapter:
return jsonify({
"error": {
"message": f"Model '{model}' does not exist or is not available",
"type": "invalid_request_error",
"code": "model_not_found"
}
}), 404
# Extract parameters
messages = ("messages", [])
temperature = ("temperature")
top_p = ("top_p")
max_tokens = ("max_tokens")
stream = ("stream", False)
#Other parameters
kwargs = {k: v for k, v in () if k not in
["model", "messages", "temperature", "top_p", "max_tokens", "stream"]}
# Streaming output processing
if stream:
async def generate():
kwargs["stream"] = True
response_iterator = await registry.generate_completion(
model_id=model,
messages=messages,
temperature=temperature,
top_p=top_p,
max_tokens=max_tokens,
**kwargs
)
# Assume that response_iterator is an asynchronous iterator
async for chunk in response_iterator:
yield f"data: {(chunk)}\n\n"
# End stream
yield "data: [DONE]\n\n"
return Response(
stream_with_context(generate()),
content_type='text/event-stream'
)
# Non-stream output
response = await registry.generate_completion(
model_id=model,
messages=messages,
temperature=temperature,
top_p=top_p,
max_tokens=max_tokens,
**kwargs
)
return jsonify(response)
except Exception as e:
("Error processing chat/completions request")
return jsonify({
"error": {
"message": str(e),
"type": "server_error",
}
}), 500
illustrate:
- accomplish
/v1/chat/completions
Endpoint, which is the core feature of the OpenAI API - Using a decorator
@verify_api_key
and@check_rate_limit
Handle authentication and current limit - Extract model ID, message content, generation parameters and other information from the request
- Supports stream output and regular output modes
- Use in streaming mode
stream_with_context
and SSE (Server-Sent Events) formats - The normal mode directly returns the complete JSON response
- Use in streaming mode
- Exception handling mechanism ensures that friendly error messages can be returned even if errors occur.
- Dynamic parameter processing, allowing the transfer of extra parameters to the underlying model
6. Main application portal
Finally, we integrate all the components into the main application:
#
from flask import Flask
from flask_cors import CORS
from .config import Config
from . import ModelRegistry
from . import OpenAIAdapter
from . import ClaudeAdapter # Assume that it has been implemented
from . import LocalModelAdapter # Assume that it has been implemented
from .api import models, chat
import logging
import os
def create_app():
"""Create and configure Flask app""""
#Configuration log
(
level=,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
# Create Flask app
app = Flask(__name__)
CORS(app) # Enable cross-domain support
# Load configuration
.from_object(Config)
# Create a model registry
registry = ModelRegistry()
# Register a model adapter
# OpenAI Adapter
if Config.OPENAI_API_KEY:
openai_adapter = OpenAIAdapter(
api_key=Config.OPENAI_API_KEY,
base_url=Config.OPENAI_BASE_URL
)
registry.register_adapter("openai", openai_adapter)
# Claude Adapter
if Config.CLAUDE_API_KEY:
claude_adapter = ClaudeAdapter(
api_key=Config.CLAUDE_API_KEY
)
registry.register_adapter("claude", claude_adapter)
# Local model adapter
if Config.LOCAL_MODELS_ENABLED:
local_adapter = LocalModelAdapter(
models_dir=Config.LOCAL_MODELS_DIR
)
registry.register_adapter("local", local_adapter)
# Initialize API routing
models.init_routes(registry)
chat.init_routes(registry)
# Register a blueprint
app.register_blueprint(models.models_bp)
app.register_blueprint(chat.chat_bp)
@('/health', methods=['GET'])
def health_check():
""Health Check Endpoint"""
return {"status": "healthy"}
return app
if __name__ == "__main__":
app = create_app()
(
host=("HOST", "0.0.0.0"),
port=int(("PORT", "8000")),
debug=("DEBUG", "False").lower() == "true"
)
illustrate:
- Create Flask applications in factory mode for easy testing and scaling
- Configure logging system for easy debugging and problem troubleshooting
- Enable CORS (cross-domain resource sharing) to support front-end cross-domain calls
- Load settings from configuration files instead of hardcoded
- Create and initialize the model registry and dynamically register different model adapters according to the configuration
- Register the adapter conditionally, only the adapter configured with the corresponding API key will be enabled.
- Initialize API routing and inject the model registry into each routing processing function
- Added health check endpoints to facilitate monitoring system detection of service status
- Obtain server startup parameters from environment variables to improve deployment flexibility
7. Configuration file
#
import os
from dotenv import load_dotenv
# Loading environment variables
load_dotenv()
class Config:
"""Application Configuration"""
# API Key
API_KEYS = ("API_KEYS", "").split(",")
# OpenAI Configuration
OPENAI_API_KEY = ("OPENAI_API_KEY")
OPENAI_BASE_URL = ("OPENAI_BASE_URL", "")
# Claude configuration
CLAUDE_API_KEY = ("CLAUDE_API_KEY")
# Local model configuration
LOCAL_MODELS_ENABLED = ("LOCAL_MODELS_ENABLED", "False").lower() == "true"
LOCAL_MODELS_DIR = ("LOCAL_MODELS_DIR", "./models")
# Rate limit configuration
RATE_LIMIT_ENABLED = ("RATE_LIMIT_ENABLED", "True").lower() == "true"
RATE_LIMIT_REQUESTS = int(("RATE_LIMIT_REQUESTS", "100")) # Number of requests per minute
illustrate:
- use
python-dotenv
load.env
Environment variables in the file to facilitate the separation of configurations for development and deployment - Provide default values to ensure that the application can start normally even if the environment variable is not set
- Configure multiple API keys through environment variables to support access permissions for different users
- Configurable OpenAI's basic URL, supporting the use of OpenAI-compatible alternative API services
- Local model functions can be flexibly configured in different environments through environment variable switches.
- The rate limiting feature can also be configured to prevent API abuse
Analysis of key technical points
1. Adapter mode
We use adapter mode to unify interface differences between different big model APIs. Each adapter is responsible for converting the API of a specific vendor into our standard interface, which makes it simple to add new models, just implement the corresponding adapter.
2. Asynchronous processing
By usingasync/await
, We can handle concurrent requests efficiently, especially for scenarios such as streaming output, asynchronous processing is particularly important.
3. Unified model representation
We unify the representation of the model, ensuring that model capabilities and attributes can be expressed consistently between different adapters, which helps users to switch smoothly between different models.
4. Central Registration Form
ModelRegistry
As a central component, all model adapters are managed and a unified calling interface is provided. It is responsible for core logic such as model routing and adapter selection.
Extended and advanced functions
After implementing basic functions, the following advanced features can be considered:
1. Load balancing and failover
# Add load balancing function in ModelRegistry
def select_adapter_with_load_balancing(self, model_group: str) -> BaseModelAdapter:
"""Select adapter according to load condition"""
adapters = self.model_groups.get(model_group, [])
if not adapters:
raise ValueError(f"Model group not found '{model_group}'")
# Select the optimal adapter based on various indicators (delay, success rate, etc.)
# Here is a simplified implementation
return min(adapters, key=lambda a: self.adapter_metrics[]["latency"])
illustrate:
- In high availability scenarios, multiple adapter instances can be configured for the same model (may point to different regions or different providers)
- By collecting indicators such as delay and success rate, dynamically select the current optimal adapter
- When a certain adapter has problems, the system can automatically switch to the backup adapter to achieve failover
2. Cache layer
Add a cache layer for common requests to reduce the frequency of calling backend APIs, reduce costs and improve response speed.
Summarize
By building such a large-model integration platform, we can greatly simplify the complexity of multi-model application development. Developers only need to call the unified OpenAI compatible interface, and the platform will automatically handle all underlying details, including API differences, authentication, routing and other issues.
This architecture is not only suitable for simple calling scenarios, but also serves as an infrastructure for building more complex AI applications, such as solving more complex problems by dynamically selecting the model that is most suitable for a specific task, or implementing collaboration between models.
I hope that the technical ideas and code examples provided in this article can help you build your own large model integration platform and provide more flexible and powerful infrastructure support for AI application development.
Written at the end
If you are interested in the technical details and source code implementation of this article, please follow my WeChat official account【Song Ge Ai Automation】. Every week, I will publish an in-depth technical article on my official account to analyze the implementation principles of various practical tools from the perspective of source code.
Review of the last issue: (Small model tool calling capability activation: Prompt engineering practice taking Qwen2.5 0.5B as an example)