• 0 Posts
  • 9 Comments
Joined 1 year ago
cake
Cake day: June 17th, 2023

help-circle

  • That’s not what lossless data compression schemes do:
    In lossless compression the general idea is to create a codebook of commonly occuring patterns and use those as shorthand.
    For example, one of the simplest and now ancient algorithms LZW does the following:

    • Initialize the dictionary to contain all strings of length one.
    • Initialize the dictionary to contain all strings of length one.
    • Emit the dictionary index for W to output and remove W from the input.
    • Add W followed by the next symbol in the input to the dictionary.
    • repeat
      Basically, instead of rewriting long sequences, it just writes down the index into an existing dictionary of already seen sequences.

    However, once this is done, you now need to find an encoding that takes your characterset (the original characters+the new dictionary references) and turns it into bits.
    It turns out that we can do this optimally: Using an algorithm called Arithmetic coding we can align the length of a bitstring to the amount of information it contains.
    “Information” here meaning the statistical concept of information, which depends on the inverse likelihood a certain character is observed.
    Logically this makes sense:
    Let’s say you have a system that measures earthquakes. As one would expect, most of the time, let’s say 99% of the time, you will see “no earthquake”, while in 1% of the cases you will observe “earthquake”.
    Since “no earthquake” is a lot more common, the information gain is relatively small (if I told you “the system said no earthquake”, you could have guessed that with 99% confidence: not very surprising).
    However if I tell you “there is an earthquake” this is much more important and therefore is worth more information.

    From information theory (a branch of mathematics), we know that if we want to maximize the efficiency of our codec, we have to match the length of every character to its information content. Arithmetic coding now gives us a general way of doing this.

    However, we can do even better:
    Instead of just considering individual characters, we can also add in character pairs!
    Of course, it doesn’t make sense to add in every possible character pair, but for some of them it makes a ton of sense:
    For example, if we want to compress english text, we could give a separate codebook entry to the entire sequence “the” and save a ton of bits!
    To do this for pairs of characters in the english alphabet, we have to consider 26*26=676 combinations.
    We can still do that: just scan the text 600 times.
    With 3 character combinations it becomes a lot harder 26*26*26=17576 combinations.
    But with 4 characters its impossible: you already have half a million combinations!
    In reality, this is even worse, since you have way more than 26 characters: you have things like ", . ? ! and your codebook ids which blow up the size even more!

    So, how are we supposed to figure out which character pairs to combine and how many bits we should give them?
    We can try to predict it!
    This technique, called [PPM](Prediction by partial matching) is already very old (~1980s), but still used in many compression algorithms.
    The important trick is now that with deep learning, we can train even more efficient estimators, without loosing the lossless property:
    Remember, we only predict what things we want to combine, and how many bits we want to assign to them!
    The worst-case scenario is that your compression gets worse because the model predicts nonsensical character-combinations to store, but that never changes the actual information you store, just how close you can get to the optimal compression.

    The state-of-the-art in text compression already uses this for a long time (see Hutter Prize) it’s just now getting to a stage where systems become fast and accurate enough to also make the compression useful for other domains/general purpose compression.



  • No, it’s built into the protocol: think of it like as if every http request forces you to attach some tiny additional box containing the solution to a math puzzle.

    The twist is that you want the math puzzle to be easy to create and verify, but hard to compute. The harder the puzzle you solve, the more you get prioritized by the service that sent you the puzzle.

    If your puzzle is cheaper to create than hosting your service is, then it’s much harder to ddos you since attackers get stuck at the puzzle, rather than getting to your expensive service



  • Not really: you have to keep in mind the amount of expertise and ressources that already went into silicon, as well as the geopolitics and sheer availability of silicon. The closest currently available competitor is probably gallium arsenide. That has a couple of disadvantages compared to silicon

    • It’s more expensive (both due to economies of scale and the fact that silicon is just much more abundant in general)
    • GaAs crystals are less stable, leading to smaller boules.
    • GaAs is a worse thermal conductor
    • GaAs has no native “oxide” (compare to SiO₂) which can be directly used as an insulator
    • GaAs mobilities are worse (Si is 500 vs GaAs 400), which means P channel FETs are naturally slower in GaAs, which makes CMOS structures impossible
    • GaAs is not a pure element, which means you get into trouble with mixing the elements
      You usually see GaAs combined with germanium substrates for solar panels, but rarely independently of that (GaAs is simply bad for logic circuits).
      In short: It’s not really useful for logic gates.

    Germanium itself is another potential candidate, especially since it can be alloyed with silicon which makes it interesting from an integration point-of-view.
    SiGe is very interesting from a logic POV considering its high forward and low reverse gain, which makes it interesting for low-current high-frequency applications. Since you naturally have heterojunctions which allow you to tune the band-gap (on the other hand you get the same problem as in GaAs: it’s not a pure element so you need to tune the band-gap).
    One problem specifically for mosfets is the fact that you don’t get stable silicon-germanium oxides, which means you can’t use the established silicon-on-insulator techniques.
    Cost is also a limiting factor: before even starting to grow crystals you have the pure material cost, which is roughly $10/kg for silicon, and $800/ kg for germanium.
    That’s why, despite the fact that the early semiconductors all relied on germanium, germanium based systems never really became practical: It’s harder to do mass production, and even if you can start mass production it will be very expensive (that’s why if you do see germanium based tech, it’s usually in low-production runs for high cost specialised components)

    There’s some research going on in commercialising these techniques but that’s still years away.




  • That paper makes a bunch of(implicit) assumptions that make it pretty unrealistic: basically they assume that once we have decently working models already, we would still continue to do normal “brain-off” web scraping.
    In practice you can use even relatively simple models to start filtering and creating more training data:
    Think about it like the original LLM being a huge trashcan in which you try to compress Terrabytes of mostly garbage web data.
    Then, you use fine-tuning (like the instruction tuning used the assistant models) to increases the likelihood of deriving non-trash from the model (or to accurately classify trash vs non-trash).
    In general this will produce a datasets that is of significantly higher quality simply because you got rid of all the low-quality stuff.

    This is not even a theoretical construction: Phi-1 (https://arxiv.org/abs/2306.11644) does exactly that to train a state-of-the-art language model on a tiny amount of high quality data (the model is also tiny: only half a percent the size of gpt-3).
    Previously tiny stories https://arxiv.org/abs/2305.07759 showed something similar: you can build high quality models with very little data, if you have good data (in the case of tiny stories they generate simply stories to train small language models).

    In general LLM people seem to re-discover that good data is actually good and you don’t really need these “shotgun approach” web scrape datasets.