is sglang an LLM engine or does it use vLLM/llama.cpp under the hood? and while we're at it, has anyone done a comparison of LLM engines? I've also heard of Mistral.rs, LLM MLC, and obviously HF transformers library and its ktransformers alternative.
- infinity (great for embedding / reranking models not for LLMs)
My personal feeling is SGLang / vLLM have issues that make me not want to use it. Sure it's fast, but there are reliability issues, you need lots of flags and tinkering to make it work. Also there is the problem of 100% cpu usage on idle which the core contributors say is 'normal' and 'expected'. You can do a search in the respective repositories on this topic if you don't believe me. People even submitted PRs to solve these issues which they have not merged. The mindset of these software is just to get it to 'work' but not really on polish and ease of use.
TGI on the other hand is in a class of it's own. You can just feel the polish that went into it. Things tend to 'just work'. It's the only engine I tried that was able to run a model I wanted in a single try. Then I added the flags to make it fit with my hardware (like sharding and max prefill tokens). TGI uses flashinfer by default which is SOTA when it comes to flash attention backend.
llama.cpp has widest model support, however it does not perform as well as TGI / vLLM / SGLang. So if you can accept that you are losing performance (based on my testing about 30% slower) tt's great for testing, development purposes but if you want to do production grade stuff I would recommend TGI.
is sglang an LLM engine or does it use vLLM/llama.cpp under the hood? and while we're at it, has anyone done a comparison of LLM engines? I've also heard of Mistral.rs, LLM MLC, and obviously HF transformers library and its ktransformers alternative.
Here is a list of inference engines i've tried:
- SGLang
- vLLM
- TGI (Huggingface's)
- llama.cpp
- infinity (great for embedding / reranking models not for LLMs)
My personal feeling is SGLang / vLLM have issues that make me not want to use it. Sure it's fast, but there are reliability issues, you need lots of flags and tinkering to make it work. Also there is the problem of 100% cpu usage on idle which the core contributors say is 'normal' and 'expected'. You can do a search in the respective repositories on this topic if you don't believe me. People even submitted PRs to solve these issues which they have not merged. The mindset of these software is just to get it to 'work' but not really on polish and ease of use.
TGI on the other hand is in a class of it's own. You can just feel the polish that went into it. Things tend to 'just work'. It's the only engine I tried that was able to run a model I wanted in a single try. Then I added the flags to make it fit with my hardware (like sharding and max prefill tokens). TGI uses flashinfer by default which is SOTA when it comes to flash attention backend.
llama.cpp has widest model support, however it does not perform as well as TGI / vLLM / SGLang. So if you can accept that you are losing performance (based on my testing about 30% slower) tt's great for testing, development purposes but if you want to do production grade stuff I would recommend TGI.
SGLang is a fork of VLLM
SGLang is a competitor to vLLM.