Just a stranger trying things.

  • 7 Posts
  • 321 Comments
Joined 1 year ago
cake
Cake day: July 16th, 2023

help-circle



  • I’m not sure I see the issue to be honest. The development is made in the open, the architecture is pretty flexible and is designed to be rather robust to rug pulls specifically such that less trust is required in the model.

    Also, whenever these discussions happen, I can’t stop feeling that it is somehow also meant to imply that mastodon is somehow better. And I am not a fan of that, as if there could only be one good social network. The internet is better with multiple services, multiple of many things. That’s how there is cooperation, compatibility and development for the better.










  • Not familiar with the guy himself who maybe does deserve criticism and prison, but about the Quran burning, is it genuinely fair to sentence someone to prison for that? Is it equivalent to burning the cross? The Swedish flag? I might be mission a broader context, but I don’t feel like someone burning my symbol or flag should be punished with prison. Am I alone? I would hate it, don’t get me wrong, but I still feel it goes in freedom of expression.




  • I didn’t say it can’t. But I’m not sure how well it is optimized for it. From my initial testing it queues queries and submits them one after another to the model, I have not seen it batch compute the queries, but maybe it’s a setup thing on my side. vLLM on the other hand is designed specifically for the multi co current user use case and has multiple optimizations for it.


  • The Hobbyist@lemmy.ziptoSelfhosted@lemmy.worldSelf-hosting LLMs
    link
    fedilink
    English
    arrow-up
    23
    ·
    edit-2
    18 days ago

    I run the Mistral-Nemo(12B) and Mistral-Small (22B) on my GPU and they are pretty code. As others have said, the GPU memory is one of the most limiting factors. 8B models are decent, 15-25B models are good and 70B+ models are excellent (solely based on my own experience). Go for q4_K models, as they will run many times faster than higher quantization with little performance degradation. They typically come in S (Small), M (Medium) and (Large) and take the largest which fits in your GPU memory. If you go below q4, you may see more severe and noticeable performance degradation.

    If you need to serve only one user at the time, ollama +Webui works great. If you need multiple users at the same time, check out vLLM.

    Edit: I’m simplifying it very much, but hopefully should it is simple and actionable as a starting point. I’ve also seen great stuff from Gemma2-27B

    Edit2: added links

    Edit3: a decent GPU regarding bang for buck IMO is the RTX 3060 with 12GB. It may be available on the used market for a decent price and offers a good amount of VRAM and GPU performance for the cost. I would like to propose AMD GPUs as they offer much more GPU mem for their price but they are not all as supported with ROCm and I’m not sure about the compatibility for these tools, so perhaps others can chime in.

    Edit4: you can also use openwebui with vscode with the continue.dev extension such that you can have a copilot type LLM in your editor.