Shyloo Posted March 19, 2023 Share Posted March 19, 2023 ChatGPT can give some impressive results, and also sometimes some very poor advice. But while it's free to talk with ChatGPT in theory, often you end up with messages about the system being at capacity, or hitting your maximum number of chats for the day, with a prompt to subscribe to ChatGPT Plus. Also, all of your queries are taking place on ChatGPT's server, which means that you need Internet and that OpenAI can see what you're doing. Fortunately, there are ways to run a ChatGPT-like LLM (Large Language Model) on your local PC, using the power of your GPU. The oobabooga text generation webui(opens in new tab) might be just what you're after, so we ran some tests to find out what it could — and couldn't! — do, which means we also have some benchmarks. Getting the webui running wasn't quite as simple as we had hoped, in part due to how fast everything is moving within the LLM space. There are the basic instructions in the readme, the one-click installers, and then multiple guides for how to build and run the LLaMa 4-bit models(opens in new tab). We encountered varying degrees of success/failure, but with some help from Nvidia and others, we finally got things working. And then the repository was updated and our instructions broke, but a workaround/fix was posted today. Again, it's moving fast! It's like running Linux and only Linux, and then wondering how to play the latest games. Sometimes you can get it working, other times you're presented with error messages and compiler warnings that you have no idea how to solve. We'll provide our version of instructions below for those who want to give this a shot on their own PCs. You may also find some helpful people in the LMSys Discord(opens in new tab), who were good about helping me with some of my questions. Nvidia GeForce RTX 4090 graphics cards (Image credit: Toms' Hardware) It might seem obvious, but let's also just get this out of the way: You'll need a GPU with a lot of memory, and probably a lot of system memory as well, should you want to run a large language model on your own hardware — it's right there in the name. A lot of the work to get things running on a single GPU (or a CPU) has focused on reducing the memory requirements. Using the base models with 16-bit data, for example, the best you can do with an RTX 4090, RTX 3090 Ti, RTX 3090, or Titan RTX — cards that all have 24GB of VRAM — is to run the model with seven billion parameters (LLaMa-7b). That's a start, but very few home users are likely to have such a graphics card, and it runs quite poorly. Thankfully, there are other options. Loading the model with 8-bit precision cuts the RAM requirements in half, meaning you could run LLaMa-7b with many of the best graphics cards — anything with at least 10GB VRAM could potentially suffice. Even better, loading the model with 4-bit precision halves the VRAM requirements yet again, allowing for LLaMa-13b to work on 10GB VRAM. (You'll also need a decent amount of system memory, 32GB or more most likely — that's what we used, at least.) Getting the models isn't too difficult at least, but they can be very large. LLaMa-13b for example consists of 36.3 GiB download for the main data(opens in new tab), and then another 6.5 GiB for the pre-quantized 4-bit model(opens in new tab). Do you have a graphics card with 24GB of VRAM and 64GB of system memory? Then the 30 billion parameter model(opens in new tab) is only a 75.7 GiB download, and another 15.7 GiB for the 4-bit stuff. There's even a 65 billion parameter model, in case you have an Nvidia A100 40GB PCIe(opens in new tab) card handy, along with 128GB of system memory (well, 128GB of memory plus swap space). Hopefully the people downloading these models don't have a data cap on their internet connection. Testing Text Generation Web UI Performance In theory, you can get the text generation web UI running on Nvidia's GPUs via CUDA, or AMD's graphics cards via ROCm. The latter requires running Linux, and after fighting with that stuff to do Stable Diffusion benchmarks earlier this year, I just gave it a pass for now. If you have working instructions on how to get it running (under Windows 11, though using WSL2 is allowed) and you want me to try them, hit me up and I'll give it a shot. But for now I'm sticking with Nvidia GPUs. I encountered some fun errors when trying to run the llama-13b-4bit models on older Turing architecture cards like the RTX 2080 Ti and Titan RTX. Everything seemed to load just fine, and it would even spit out responses and give a tokens-per-second stat, but the output was garbage. Starting with a fresh environment while running a Turing GPU seems to have worked, fixed the problem, so we have three generations of Nvidia RTX GPUs. While in theory we could try running these models on non-RTX GPUs and cards with less than 10GB of VRAM, we wanted to use the llama-13b model as that should give superior results to the 7b model. Looking at the Turing, Ampere, and Ada Lovelace architecture cards with at least 10GB of VRAM, that gives us 11 total GPUs to test. We felt that was better than restricting things to 24GB GPUs and using the llama-30b model. For these tests, we used a Core i9-12900K running Windows 11. You can see the full specs in the boxout. We used reference Founders Edition models for most of the GPUs, though there's no FE for the 4070 Ti, 3080 12GB, or 3060, and we only have the Asus 3090 Ti. In theory, there should be a pretty massive difference between the fastest and slowest GPUs in that list. In practice, at least using the code that we got working, other bottlenecks are definitely a factor. It's not clear whether we're hitting VRAM latency limits, CPU limitations, or something else — probably a combination of factors — but your CPU definitely plays a role. We tested an RTX 4090 on a Core i9-9900K and the 12900K, for example, and the latter was almost twice as fast. It looks like some of the work at least ends up being primarily single-threaded CPU limited. That would explain the big improvement in going from 9900K to 12900K. Still, we'd love to see scaling well beyond what we were able to achieve with these initial tests. Given the rate of change happening with the research, models, and interfaces, it's a safe bet that we'll see plenty of improvement in the coming days. So, don't take these performance metrics as anything more than a snapshot in time. We may revisit the testing at a future date, hopefully with additional tests on non-Nvidia GPUs. We ran oobabooga's web UI with the following, for reference. More on how to do this below. Again, we want to preface the charts below with the following disclaimer: These results don't necessarily make a ton of sense if we think about the traditional scaling of GPU workloads. Normally you end up either GPU compute constrained, or limited by GPU memory bandwidth, or some combination of the two. There are definitely other factors at play with this particular AI workload, and we have some additional charts to help explain things a bit. Running on Windows is likely a factor as well, but considering 95% of people are likely running Windows compared to Linux, this is more information on what to expect right now. We wanted tests that we could run without having to deal with Linux, and obviously these preliminary results are more of a snapshot in time of how things are running than a final verdict. Please take it as such. These initial Windows results are more of a snapshot in time than a final verdict. We ran the test prompt 30 times on each GPU, with a maximum of 500 tokens. We discarded any results that had fewer than 400 tokens (because those do less work), and also discarded the first two runs (warming up the GPU and memory). Then we sorted the results by speed and took the average of the remaining ten fastest results. Generally speaking, the speed of response on any given GPU was pretty consistent, within a 7% range at most on the tested GPUs, and often within a 3% range. That's on one PC, however; on a different PC with a Core i9-9900K and an RTX 4090, our performance was around 40 percent slower than on the 12900K. Our prompt for the following charts was: "How much computational power does it take to simulate the human brain?" Our fastest GPU was indeed the RTX 4090, but... it's not really that much faster than other options. Considering it has roughly twice the compute, twice the memory, and twice the memory bandwidth as the RTX 4070 Ti, you'd expect more than a 2% improvement in performance. That didn't happen, not even close. The situation with RTX 30-series cards isn't all that different. The RTX 3090 Ti comes out as the fastest Ampere GPU for these AI Text Generation tests, but there's almost no difference between it and the slowest Ampere GPU, the RTX 3060, considering their specifications. A 10% advantage is hardly worth speaking of! And then look at the two Turing cards, which actually landed higher up the charts than the Ampere GPUs. That simply shouldn't happen if we were dealing with GPU compute limited scenarios. Maybe the current software is simply better optimized for Turing, maybe it's something in Windows or the CUDA versions we used, or maybe it's something else. It's weird, is really all I can say. These results shouldn't be taken as a sign that everyone interested in getting involved in AI LLMs should run out and buy RTX 3060 or RTX 4070 Ti cards, or particularly old Turing GPUs. We recommend the exact opposite, as the cards with 24GB of VRAM are able to handle more complex models, which can lead to better results. And even the most powerful consumer hardware still pales in comparison to data center hardware — Nvidia's A100 can be had with 40GB or 80GB of HBM2e, while the newer H100 defaults to 80GB. I certainly won't be shocked if eventually we see an H100 with 160GB of memory, though Nvidia hasn't said it's actually working on that. As an example, the 4090 (and other 24GB cards) can all run the LLaMa-30b 4-bit model, whereas the 10–12 GB cards are at their limit with the 13b model. 165b models also exist, which would require at least 80GB of VRAM and probably more, plus gobs of system memory. And that's just for inference; training workloads require even more memory! "Tokens" for reference is basically the same as "words," except it can include things that aren't strictly words, like parts of a URL or formula. So when we give a result of 25 tokens/s, that's like someone typing at about 1,500 words per minute. That's pretty darn fast, though obviously if you're attempting to run queries from multiple users that can quickly feel inadequate. Here's a different look at the various GPUs, using only the theoretical FP16 compute performance. Now, we're actually using 4-bit integer inference on the Text Generation workloads, but integer operation compute (Teraops or TOPS) should scale similarly to the FP16 numbers. Also note that the Ada Lovelace cards have double the theoretical compute when using FP8 instead of FP16, but that isn't a factor here. If there are inefficiencies in the current Text Generation code, those will probably get worked out in the coming months, at which point we could see more like double the performance from the 4090 compared to the 4070 Ti, which in turn would be roughly triple the performance of the RTX 3060. We'll have to wait and see how these projects develop over time. These final two charts are merely to illustrate that the current results may not be indicative of what we can expect in the future. Running Stable-Diffusion for example, the RTX 4070 Ti hits 99–100 percent GPU utilization and consumes around 240W, while the RTX 4090 nearly doubles that — with double the performance as well. With Oobabooga Text Generation, we see generally higher GPU utilization the lower down the product stack we go, which does make sense: More powerful GPUs won't need to work as hard if the bottleneck lies with the CPU or some other component. Power use on the other hand doesn't always align with what we'd expect. RTX 3060 being the lowest power use makes sense. The 4080 using less power than the (custom) 4070 Ti on the other hand, or Titan RTX consuming less power than the 2080 Ti, simply show that there's more going on behind the scenes. Long term, we expect the various chatbots — or whatever you want to call these "lite" ChatGPT experiences — to improve significantly. They'll get faster, generate better results, and make better use of the available hardware. Now, let's talk about what sort of interactions you can have with text-generation-webui. This is what we initially got when we tried running on a Turing GPU for some reason. Redoing everything in a new environment (while a Turing GPU was installed) fixed things. AI Text Generation Q&A and chat (Image credit: Tom's Hardware) Chatting with Chiharu Yamada, who thinks computers are amazing. The Text Generation project doesn't make any claims of being anything like ChatGPT, and well it shouldn't. ChatGPT will at least attempt to write poetry, stories, and other content. In its default mode, TextGen running the LLaMa-13b model feels more like asking a really slow Google to provide text summaries of a question. But the context can change the experience quite a lot. Many of the responses to our query about simulating a human brain appear to be from forums, Usenet, Quora, or various other websites, even though they're not. This is sort of funny when you think about it. You ask the model a question, it decides it looks like a Quora question, and thus mimics a Quora answer — or at least that's our understanding. It still feels odd when it puts in things like "Jason, age 17" after some text, when apparently there's no Jason asking such a question. Again, ChatGPT this is not. But you can run it in a different mode than the default. Passing "--cai-chat" for example gives you a modified interface and an example character to chat with, Chiharu Yamada. And if you like relatively short responses that sound a bit like they come from a teenager, the chat might pass muster. It just won't provide much in the way of deeper conversation, at least in my experience. Perhaps you can give it a better character or prompt; there are examples out there. There are plenty of other LLMs as well; LLaMa was just our choice for getting these initial test results done. You could probably even configure the software to respond to people on the web, and since it's not actually "learning" — there's no training taking place on the existing models you run — you can rest assured that it won't suddenly turn into Microsoft's Tay Twitter bot after 4chan and the internet start interacting with it. Just don't expect it to write coherent essays for you. Given the instructions on the project's main page(opens in new tab), you'd think getting this up and running would be pretty straightforward. I'm here to tell you that it's not, at least right now, especially if you want to use some of the more interesting models. But it can be done. The base instructions for example tell you to use Miniconda on Windows. If you follow the instructions, you'll likely end up with a CUDA error. Oops. This more detailed set of instructions off Reddit(opens in new tab) should work, at least for loading in 8-bit mode. The main issue with CUDA gets covered in steps 7 and 8, where you download a CUDA DLL and copy it into a folder, then tweak a few lines of code. Download an appropriate model and you should hopefully be good to go. The 4-bit instructions totally failed for me the first times I tried them (update: they seem to work now, though they're using a different version of CUDA than our instructions). lxe has these alternative instructions(opens in new tab), which also didn't quite work for me. I got everything working eventually, with some help from Nvidia and others. The instructions I used are below... but then things stopped working on March 16, 2023, as the LLaMaTokenizer spelling was changed to "LlamaTokenizer" and the code failed. Thankfully that was a relatively easy fix. But what will break next, and then get fixed a day or two later? We can only guess, but as of March 18, 2023, these instructions worked on several different test PCs. Link to comment Share on other sites More sharing options...
Recommended Posts