Concept
Here are some general concepts of LLM ecosystems.
Inference and Serving
Serving refers to making the model accessible as a service.
Session ID
LoRA
low rank adaption
Input Enrichment
Embedding models translate the original query into an embedding.
Prompt Optimization
LLM Cache
Content Classifier or Filter
Remove harmful responses.
Feedback
One Production-Level Implementation Overview
graph TB
User -->|HTTPS| Nginx --> api[API Gateway] --> vLLM --> Redis --> OSS
vLLM -->|Monitor| Prometheus --> Grafana
api[API Gateway] -->|Auth| Keycloak(RBAC)