The maths you need to start understanding LLMs

(gilesthomas.com)

552 points | by gpjt 4 days ago

30 comments

libraryofbabel 20 hours ago
Way back when, I did a masters in physics. I learned a lot of math: vectors, a ton of linear algebra, thermodynamics (aka entropy), multi-variable and then tensor calculus.
This all turned out to be mostly irrelevant in my subsequent programming career.
Then LLMs came along and I wanted to learn how they work. Suddenly the physics training is directly useful again! Backprop is one big tensor calculus calculation, minimizing… entropy! Everything is matrix multiplications. Things are actually differentiable, unlike most of the rest of computer science.
It’s fun using this stuff again. All but the tensor calculus on curved spacetime, I haven’t had to reach for that yet.
[-]
- r-bryan 10 hours ago
  Check out this 156-page tome: https://arxiv.org/abs/2104.13478: "Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges"
  The intro says that it "...serves a dual purpose: on one hand, it provides a common mathematical framework to study the most successful neural network architectures, such as CNNs, RNNs, GNNs, and Transformers. On the other hand, it gives a constructive procedure to incorporate prior physical knowledge into neural architectures and provide principled way to build future architectures yet to be invented."
  Working all the way through that, besides relearning a lot of my undergrad EE math (some time in the previous century), I learned a whole new bunch of differential geometry that will help next time I open a General Relativity book for fun.
  [-]
  - minhaz23 10 hours ago
    I have very little formal education in advanced maths, but I’m highly motivated to learn the math needed to understand AI. Should i take a stab at parsing through and trying to understand this paper (maybe even using AI to help, heh) or would that be counter-productive from the get-go and I'm better off spending my time following some structured courses in pre-requisite maths before trying to understand these research papers?
    Thank you for sharing this paper!
  - Quizzical4230 9 hours ago
    Thank you for sharing the paper!
    The link is broken though and you may want to remove the `:` at the end.
- lazarus01 40 minutes ago
  Modern numeric compute frameworks provide automatic differentiation to calculate derivatives, including Tensorflow and Jax.
- psb217 20 hours ago
  That past work will pay off even more when you start looking into diffusion and flow-based models for generating images, videos, and sometimes text.
  [-]
  - pornel 18 hours ago
    Breakthrough in image generation speed literally came from applying better differential equations for diffusion taken from statistical mechanics physics papers:
    https://youtu.be/iv-5mZ_9CPY
- JBits 19 hours ago
  For me, it's the very basics of general relativity which made the distinction between the cotangent and tangents space click. Optimisation on Riemannian manifolds might give an opportunity to apply more interesting tensor calculus with a non-trivial metric.
- jwar767 14 hours ago
  I have the same experience but with a masters in control theory. Suddenly all the linear algebra and differential equations are super useful in understanding this.
- alguerythme 18 hours ago
  Well, calculus on curved space, please let me introduce you to: https://arxiv.org/abs/2505.18230 (This is self advertising) If you know how to incorporate time into that, I am interested.
- CrossVR 17 hours ago
  Any reason you didn't pick up computer graphics before? Everything is linear algebra and there's even actual physics involved.
  [-]
  - ls-a 15 hours ago
    Is that you Will Hunting
spinlock_ 22 hours ago
For me, working through Karpathy's video series (instead of just "watching" them) helped me tremendously to understand how LLMs work and gave me the confidence to read through more advanced material, if I feel like it. But to be honest, the knowledge I gained through his videos are already enough for me. It's kind of like learning how a CPU works in general and ignoring all the fancy optimization steps that I'm not interested in.
Thanks Andrej for the time and effort you put into your videos.
[-]
- meken 17 hours ago
  +1. His cs231n class he taught at Stanford gave me a great foundation.
- karpathy 12 hours ago
  <3
- romanoonhn 17 hours ago
  Can you share what you mean by "working through" the videos? This playlist has been on my todo for a few weeks so I'm interested.
rsanek 1 day ago
Anyone else read the book that the author mentions, Build a Large Language Model (from Scratch) [0]? After watching Karpathy's video [1] I've been looking for a good source to do a deeper dive.
[0] https://www.manning.com/books/build-a-large-language-model-f...
[1] https://www.youtube.com/watch?v=7xTGNNLPyMI
[-]
- tanelpoder 20 hours ago
  Yes, can confirm, the book is great. I was also happy to see that the author correctly (in my mind) used the term “embedding vectors” vs. “vector embeddings” that most others seem to use… Some more context about my pet peeve: https://tanelpoder.com/posts/embedding-vectors-vs-vector-emb...
- malshe 22 hours ago
  Here is the code used in the book - https://github.com/rasbt/LLMs-from-scratch
- gchadwick 17 hours ago
  I thought it was a great book, dives into all the details and lays it out step by step with some nice examples. Obviously it's a pretty basic architecture and very simplistic training but I found it gave me the grounding to then understand more complex architectures.
- kamranjon 23 hours ago
  It’s good - I’m working through it right now
- horizion2025 21 hours ago
  Is there a non-video equivalent. I always prefer reading/digesting at my own pace compared to following a video.
  [-]
  - gpjt 14 hours ago
    Check the first link in the parent comment, it's a link to the book.
- tra3 22 hours ago
  Is [1] worth a watch if I want to get a high level/basic understanding of how LLMs work?
  [-]
  - rsanek 20 hours ago
    Yeah, it's very well done
- ForceBru 22 hours ago
  Yes, it's really good
ozgung 1 day ago
This is not about _Large_ Language models though. This explains math for word vectors and token embeddings. I see this is the source of confusion for many people. They think LLMs just do this to statistically predict the next word. That was pre-2020s. They ignore the 1.8+ Trillion-parameter Transformer network. Embeddings are just the input of that giant machine. We don't know what is going on exactly in those trillions of parameters.
[-]
- ants_everywhere 1 day ago
  But surely you need this math to start understanding LLMs. It's just not the math you need to finish understanding them.
  [-]
  - HarHarVeryFunny 22 hours ago
    It depends on what level of understanding, and who you are talking about. For the 99% of people outside of software development or machine learning, it is totally irrelevant, as is any details of the Transformer architecture, or the mechanism by which a trained Transformer operates.
    For the man in the street, inclined to view "AI" as some kind of artificial brain or sentient thing, the best explanation is that basically it's just matching inputs to training samples and regurgitating continuations. Not totally accurate of course, but for that audience at least it gives a good idea and is something they can understand, and perhaps gives them some insight into what it is, how it works/fails, and that it is NOT some scary sentient computer thingy.
    For anyone in the remaining 1% (or much less - people who actually understand ANNs and machine learning), then learning about the Transformer architecture and how a trained Transformer works (induction heads etc) is what they need to learn to understand what an (Transformer-based, vs LSTM-based) LLM is and how it works.
    Knowing about the "math" of Transformers/ANNs is only relevant to people who are actually implementing them from ground up, not even those who might just want to build one using PyTorch or some other framework/lbrary where the math has already been done for you.
    Finally, embeddings aren't about math - they are about representation, which is certainly important to understanding how Transformers and other ANNs work, but still a different topic.
    * US population of ~300M has ~1M software developers, of which a large fractions are going to be doing things like web development and only at a marginal advantage over someone smart outside of development in terms of learning how ANNs/etc work.
    [-]
    - gpjt 14 hours ago
      Post author here. I agree 100%! The post is the basic maths for people digging in to how LLMs work under the hood -- I wrote a separate one for non-techies who just want to know what they are, at https://www.gilesthomas.com/2025/08/what-ai-chatbots-are-doi...
    - ants_everywhere 19 hours ago
      I agree that most people don't need to understand the mathematics or design of the transformer architecture, but that isn't a good description of what LLMs do from a technical perspective. Someone with that mental model would be worse off than someone who had no mental model at all and just used it as a black box.
      [-]
      - HarHarVeryFunny 19 hours ago
        I disagree - I just had my non-technical sister staying with me, who said she was creeped out by "AI" and didn't like that it heard her in background while her son was talking to Gemini.
        An LLM is, at the end of day, a next-word predictor, trying to predict according to training samples. We all understand that it's the depth/sophistication of context pattern matching that makes "stochastic parrot" an inadequate way to describe an LLM, but conceptually it is still more right than wrong, and is the base level of understanding you need, before beginning to understand why it is inadequate.
        I think it's better for a non-technical person to understand "AI" as a stochastic parrot than to have zero understanding and think of it as a black box, or sentient computer, especially if that makes them afraid of it.
        [-]
        bonoboTP 18 hours ago
        She's right to be creeped out by the normalization of cloud based processing of her audio and the increasing surveillance infrastructure. No Ai tech understanding needed. Sometimes being more ignorant of details can allow people to see the big picture better.
        [-]
        nickpsecurity 9 hours ago
        This 100%. The surveillance industry tries to normalize stalking people to exploit them. It's creepy and evil, not normal.
  - HSO 1 day ago
    "necessary but not sufficient"
    [-]
    - ants_everywhere 1 day ago
      yes exactly :)
- cranx 1 day ago
  But we do. A series of mathematical functions are applied to predict the next tokens. It’s not magic although it seems like it is. People are acting like it’s the dark ages and Merlin made a rabbit disappear in a hat.
  [-]
  - ekunazanu 23 hours ago
    Depends on your definition of knowing. Sure, we know it is predicting next tokens, but do we understand why they output the things they do? I am not well versed with LLMs, but I assume even for smaller modles interpretability is a big challenge.
    [-]
    - chongli 20 hours ago
      The answer is simple: the set of weights and biases comprise a mathematical function which has been specifically built to approximate the training set. The methods of building such a function are very old and well-known (from calculus).
      There's no magic here. Most of people's awestruck reactions are due to our brain's own pattern recognition abilities and our association of language use with intelligence. But there's really no intelligence here at all, just like the "face on Mars" is just a random feature of a desert planet's landscape, not an intelligent life form.
    - lazide 23 hours ago
      For any given set of model weights and inputs? Yes, we definitely do understand them.
      Do we understand the emergent properties of almost-intelligence they appear to present, and what that means about them and us, etc. etc.?
      No.
      [-]
      - jvanderbot 21 hours ago
        Right. The machine works as designed and it's all assembly instructions on gates. The values in the registers change but not the instructions.
        And it happens to do something weirdly useful to our own minds based on the values in the registers.
  - umanwizard 20 hours ago
    Doesn’t this apply to any process (including human brains) that outputs sequences of words? There is some statistical distribution describing what word will come next.
- clickety_clack 21 hours ago
  That is what they do though. It might have levels of accuracy we can barely believe, but it is still a statistical process that predicts the next tokens.
  [-]
  - ozgung 20 hours ago
    Not necessarily. They can generate letters, tokens, or words in any order. They can even write them all at once like they do in a diffusion model. Next token generation (auto-reggresion) is just a design choice of GPT, mostly for practical reasons. It fits naturally to the task at hand (we humans also generate words in sequential order). Also they have to train GPT in a self-supervised manner since we don't have labeled internet scale data. Auto-regression solves that problem as well.
    The distinction I want to emphasize is that they don't just predict words statistically. They model the world, understand different concepts and their relationships, can think on them, can plan and act on the plan, can reason up to a point, in order to generate the next token. It learns all of these via that training scheme. It doesn't learn just the frequency of word relationships, unlike the old algorithms. Trillions are parameters do much more than that.
    [-]
    - griffzhowl 16 hours ago
      > The distinction I want to emphasize is that they don't just predict words statistically. They model the world, understand different concepts and their relationships, can think on them, can plan and act on the plan, can reason up to a point, in order to generate the next token.
      This sounds way over-blown to me. What we know is that LLMs generate sequences of tokens, and they do this by clever ways of processing the textual output of millions of humans.
      You say that, in addition to this, LLMs model the world, understand, plan, think, etc.
      I think it can look like that, because LLMs are averaging the behaviours of humans who are actually modelling, understanding, thinking, etc.
      Why do you think that this behaviour is more than simply averaging the outputs of millions of humans who understand, think, plan, etc.?
      [-]
      - ozgung 5 hours ago
        > Why do you think that this behaviour is more than simply averaging the outputs of millions of humans who understand, think, plan, etc.?
        This is why it’s important to make the distinction that Machine Learning is a different field than Statistics. Machine Learning models does not “average” anything. They learn to generalize. Deep Learning models can handle edge cases and unseen inputs very well.
        In addition to that, OpenAI etc. probably use a specific post-training step (like RLHF or better) for planning, reasoning, following instructions step by step etc. This additional step doesn’t depend on the outputs of millions of humans.
    - HarHarVeryFunny 16 hours ago
      How can an LLM model the world, in any meaningful way, when it has no experience of the world?
      An LLM is a language model, not a world model. It has never once had the opportunity to interact with the real world and see how it responds - to emit some sequence of words (the only type of action it is capable of generating), predict what will happen as a result, and see if it was correct.
      During training the LLM will presumably have been exposed to some second person accounts (as well as fictional stories) of how the world works, mixed up with sections of stack overflow code and Reddit rantings, but even those occasional accounts of real world interactions (context, action + result) are only at best teaching it about the context that someone else, at that point in their life, saw relevant to mention as causal/relevant to the action outcome. The LLM isn't even privvy to the world model of the raconteur (let alone the actual complete real world context in which the action was taken, or the detailed manner in which it was performed), so this is a massively impoverished source of 2nd hand experience from which to learn.
      It would be like someone who had spent their whole life locked in a windowless room reading randomly ordered paragraphs from other peoples diaries of daily experience (also randomly interpersed with chunks of fairy tales and python code), without themselves ever having actually seen a tree or jumped in a lake, or ever having had the chance to test which parts of the mental model they had built, of what was being described, were actually correct or not, and how it aligned with the real outside world they had never laid eyes on.
      When someone builds an AGI capable of continual learning, and sets it loose in the world to interact with it, then it'll be reasonable to say it has it's own world model of how the world works, but as as far as pre-trained language models go, it seems closer to the mark to say they they are indeed just language models, modelling the world of words which is all they know, and the only kind of model for which they had access to feedback (next word prediction errors) to build.
      [-]
      - istjohn 7 hours ago
        We build mental models of things we have not personally experienced all the time. Such mental models lack the detail and vividness of that of someone with first-hand experience, but they are nonetheless useful. Indeed, a student of physics who has never touched a baseball may have a far more accurate and precise mental model of a curve ball than a major league pitcher.
    - jurgenaut23 20 hours ago
      Can you provide sources of your claim that LLMs “model the world”.
      [-]
      - ozgung 19 hours ago
        You are right that it is a bold claim but here is a relevant summary: https://en.wikipedia.org/wiki/Stochastic_parrot#Interpretabi...
        I think "The Platonic Representation Hypothesis" is also related: https://phillipi.github.io/prh/
        Unfortunately, large LLMs like ChatGPT and Claude are blackbox for researchers. They can't probe what is going on inside those things.
      - lgas 19 hours ago
        It seems somewhat obvious to me. Language models the world, and LLMs model language. If A models B and B models C then A models C, as well, no?
        [-]
        TurboTveit 15 hours ago
        Can you provide sources of your claim that language “model the world”.
- measurablefunc 18 hours ago
  It's exactly the same math. There is no mathematics in any neural network, regardless of its scale, that can not be expressed w/ matrix multiplications & activation functions.
- libraryofbabel 20 hours ago
  * You’re right that a lot of people take a cursory look at the math (or someone else’s digest of it) and their takeaway is “aha, LLMs are just stochastic parrots blindly predicting the next word. It’s all a trick!”
  * So we find ourselves over and over again explaining that that might have been true once, but now there are (imperfect, messy, weird) models of large parts of the world inside that neutral network.
  * At the same time, the vector embedding math is still useful to learn if you want to get into LLMs. It’s just that the conclusions people draw from the architecture are often wrong.
- baxtr 1 day ago
  Wait so you’re saying it’s not a high-dimensional matrix multiplication?
  [-]
  - dmd 1 day ago
    Everything is “just” ones and zeros, but saying that doesn’t help with understanding.
    [-]
    - measurablefunc 18 hours ago
      If you know about boolean algebra then it explains a lot more than you realize : https://boolean.dk.workers.dev/
  - tatjam 20 hours ago
    Pretty much all problems can be reduced to some number of matrix multiplications ;)
armcat 23 hours ago
One of the most interesting mathematical aspects to me are the fact that LLMs are logit emitters. And associated with this output is uncertainty. Lot of ppl talk about networks of agents. But what you are doing is accumulating uncertainty - every model in the chain introduces its own uncertainty on top of what it inherits. In some situations I've seen a complete collapse after 3 LLM calls chained together. Hence why lot of people recommend "human in the loop" as much as possible to try and reduce that uncertainty (shift the posterior if you will); or they recommend more of a workflow approach - where you have a single orchestrator that decides which function to call, and most of the emphasis (and context engineering) is placed on that orchestrator. But it all ties together in the maths of LLMs.
[-]
InCom-0 1 day ago
These are technical details of computations that are performed as part of LLMs.
Completely pointless to anyone who is not writing the lowest level ML libraries (so basically everyone). This does now help anyone understand how LLMs actually work.
This is as if you started explaining how an ICE car works by diving into chemical properties of petrol. Yeah that really is the basis of it all, but no it is not where you start explaining how a car works.
[-]
- jasode 23 hours ago
  >This is as if you started explaining how an ICE car works by diving into chemical properties of petrol.
  But wouldn't explaining the chemistry actually be acceptable if the title was, "The chemistry you need to start understanding Internal Combustion Engines"
  That's analogous to what the author did. The title was "The maths ..." -- and then the body of the article fulfills the title by explaining the math relevant to LLMs.
  It seems like you wished the author wrote a different article that doesn't match the title.
  [-]
  - InCom-0 23 hours ago
    'The maths you need to start understanding LLMs'.
    You don't need that math to start understanding LLMs. In fact, I'd argue its harmful to start there unless your goal is to 'take me on a epic journey of all the things mankind needed to figure out to make LLMs work from the absolute basics'.
    [-]
- bryanrasmussen 23 hours ago
  >Completely pointless to anyone who is not writing the lowest level ML libraries (so basically everyone). This does now help anyone understand how LLMs actually work.
  maybe this is the target group of people who would need particular "maths" to start understanding LLMS.
- 49pctber 23 hours ago
  Anyone who would like to run an LLM would need to perform their computations on hardware. So picking hardware that is good at matrix multiplication is important for them, even if they didn't develop their LLM from scratch. Knowing the basic math also explains some of the rush to purchase GPUs and TPUs on recent years.
  All that is kind of missing the point though. I think people being curious and sharpening their mental models of technology is generally a good thing. If you didn't know an LLM was a bunch of linear algebra, you might have some distorted views of what it can or can't accomplish.
  [-]
  - InCom-0 23 hours ago
    Being curious is good ... nothing wrong with that. What I took issue with above is (what I see as) attempt to derail people into low level math when that is not the crux of the question at all.
    Also: nobody who wants to run LLMs will write their own matrix multiplications. Nobody doing ML / AI comes close to that stuff ... its all abstracted and not something anyone actually thinks about (except the few people who actually write the underlying libraries ie. at Nvidia).
    [-]
    - antegamisou 19 hours ago
      > attempt to derail people into low level math when that is not the crux of the question at all.
      Is the barrier to entry to the ML/AI field really that low? I think no one seasoned would consider fundamental linear algebra 'low level' math.
      [-]
      - InCom-0 17 hours ago
        What do you mean 'low'? :-)
        The barrier to entry is probably epicly high because to be actually useful you need to understand how to actually train a model in practice, how it is actually designed, how existing practices (ie. at OpenAI or wherever) can be built upon further ... and you need to be cutting edge at all of those things. This is not taught anywhere, you can't read about it in some book. This has absolutely nothing to do with linear algebra ... or more accurately you don't get better at those things by understanding linear algebra (or any math) better than the next guy. It is not as if 'If I were better at math, I would have been better AI researcher or programmer or whatever' :-). This is just not what these people do or how that process works. Even the foundational research that sparked rapid LLM development ('Attention Is All You Need' paper) is not some math heavy stuff. The whole thing is a conceptual idea that was tested and turned out to be spectacular.
        [-]
        antegamisou 3 hours ago
        > 'If I were better at math, I would have been better AI researcher
        This is the first time I've seen someone claim this. I don't if it's display of anti-intellectualism or plain ignorance. Otoh, most AI/ML papers' quality has deteriorated so much over the years, publications in different venues are essentially beautified PyTorch notebook by people who just play around randomly with different parameters.
- saagarjha 17 hours ago
  If you're just piecing together a bunch of libraries, sure. But anyone who is adjacent to ML research should know how these work.
  [-]
  - InCom-0 17 hours ago
    Anyone actually physically doing ML research knows it ... but doesn't write the actual code for this stuff (or god forbid write some byzantine math notations somewhere), doesn't even think about this stuff except through X levels of higher level abstractions.
    Also, those people understand LLMs already :-).
- antegamisou 22 hours ago
  Find someone on HN that doesn't trivialize fundamental math yet encourages everyone to become a PyTorch monkey that ends up having no idea why their models are shite: impossible.
- ivape 1 day ago
  Also, people need to accept that they’ve been doing regular ass programming for many years and can’t just jump into whatever they want. The idea that developers were well rounded general engineers is a myth mostly propagated from within the bubble.
  Most people’s educations right here probably didn’t even involve Linear Algebra (this is a bold claim, because the assumption is that everyone here is highly educated, no cap).
erdehaas 9 hours ago
The title is misleading. The maths explained in the blog is the math that is used to build an LLM (how it internally does calculations to do inference etc.). The math to understand LLMs, i.e. that explains in mathematical rigor why LLMs work, is not fully developed yet. That is what the LLM Explainability is about, the effort to understand and clarify the complex, "black-box" decision-making processes of Large Language Models (LLMs) in human-interpretable terms.
Rubio78 20 hours ago
Working through Karpathy's series builds a foundational understanding of LLMs, providing enough confidence to explore further. A key insight is that LLMs are logit emitters, and their inherent uncertainty compounds dangerously in multi-agent chains, often requiring a human-in-the-loop or a single orchestrator to manage it. Crucially, people confuse word embeddings with the full LLM; embeddings are just the input to a vast, incomprehensible trillion-parameter transformer. The underlying math of these networks is surprisingly simple, built on basic additions and multiplications. The real mystery isn't the math but why they work so well. Ultimately, AI research is a mix of minimal math, extensive data engineering, massive compute power, and significant trial and error.
stared 1 day ago
Well, in short - basic linear algebra, basic probability, analysis (functions like exp), gradient.
At some point I tried to create an introduction step-by-step, where people can interact with these concepts and see how to express it in PyTorch:
https://github.com/stared/thinking-in-tensors-writing-in-pyt...
[-]
- MichaelRazum 1 day ago
  Although is it really "understanding" or just able to write down the formulas...?
  [-]
  - stared 1 day ago
    Being able to use a formula is the first, and necessary, step for understanding.
    Then it is able to work at different levels of abstraction and being able to find analogies. But at this point, in my understanding, "understanding" is a never-ending well.
    [-]
    - MichaelRazum 23 hours ago
      How about elliptic curve cryptography then? I just think coming with a formula is not really understanding. Actually most often the “real” formula is the end step of understanding through derivation. ML does it up side down in this regard
  - misternintendo 1 day ago
    In some way it is true. Like understanding how a car works purely on physics laws.
    [-]
ryanchants 23 hours ago
I'm currently working through Mathematics for Machine Learning and Data Science Specialization from Deeplearning.AI. It's been the best into to Linear Algebra I've found. It's worth the $50 a month just for the quizzes, labs, etc. I'm simultaneously working through the book Math and Architectures of Deep Learning, which is helping re-inforce and flesh out the ideas from the course.
[0] https://www.coursera.org/specializations/mathematics-for-mac... [1] https://www.manning.com/books/math-and-architectures-of-deep...
kingkongjaffa 1 day ago
The steps in this article are also the same process for doing RAG as well.
You computer an embedding vector for your documents or chunks of documents. And then you compute the vector for your users prompt, and then use the cosine distance to find the most semantically relevant documents to use. There are other tricks like reranking the documents once you find the top N documents relating to the query, but that’s basically it.
Here’s a good explanation
http://wordvec.colorado.edu/website_how_to.html
zahlman 23 hours ago
It appears that the "softmax" is found (as I hypothesized by looking at the results, before clicking the link) by exponentiating each value and normalizing to a sum of 1. It would be worthwhile to be explicit. The exponential function is also "high-school maths", and an explanation like that is much easier to follow than the Wikipedia article (since not a lot of rigour is required here).
orionuni 4 hours ago
Thanks for sharing!
d_sem 1 day ago
I think the author did a sufficient job caveating his post without being verbose.
While reading through past posts I stumbled on a multi part "Writing an LLM from scratch" series that was an enjoyable read. I hope they keep up writing more fun content.
cultofmetatron 16 hours ago
just wanna plug https://mathacademy.com/courses/mathematics-for-machine-lear....
happy customer and have found it to be one of the best paid resources for learning mathematics in general. wish I had this when I was a student.
paradite 1 day ago
I recently did a livestream on trying to understand attention mechanism (K, Q, V) in LLM.
I think it went pretty well (was able to understand most of the logic and maths), and I touched on some of these terms.
https://youtube.com/live/vaJ5WRLZ0RE?feature=share
[-]
- gozzoo 1 day ago
  The constant scrolling is very distracting. I couldn't follow up
  [-]
  - paradite 1 day ago
    Thanks for the feedback!
jokoon 22 hours ago
ML is interesting, but honestly I have trouble knowing the future of it, to see if I should learn the techniques to land a job or not be too obsolete.
There is certainly some hype, a lot of what is the market is just not viable.
11101010001100 1 day ago
Apologies for the metacomment, but HN is a funny place. There is a certain type of learning that is deemed good ('math for AI') and a certain type of learning that is deemed bad ('leetcode for AI').
[-]
- raincole 1 day ago
  What's leedcode for AI and which site is deemed bad by HN? Without a concrete example it's just a strawman. It could be the site is deemed bad for other reasons. It could be a few vocal negative comments. It could be just not happening.
- boppo1 1 day ago
  What would leetcode for AI be?
  [-]
  - krackers 13 hours ago
    I suppose the closest thing might be the type of counting/probability questions asked at quant firms as a way to assess math skill
- sgt101 1 day ago
  could you give an example of "HN would not like this AI leetcode"?
- enjeyw 1 day ago
  I mean I kind of get it - overgeneralising (and projecting my own feelings), but I think HN favours introducing and discussing foundational concepts over things that are closer to memorising/wrote-learning. I think AI Math vs Leetcode broadly fits into that category.
- apwell23 1 day ago
  honestly i would love 'leetcode for AI' . I am just so sick of all the videos and articles about it.
nativeit 19 hours ago
Ah, I was hoping this would teach me the maths to start understanding the economics surrounding LLMs. That’s the really impossible stuff.
kekebo 1 day ago
I keep having had the best time with Andrej Karpathy's Youtube intros into LLM math. But I haven't compared scope or quality to this submission
petesergeant 23 hours ago
You need virtually no maths to deeply and intuitively understand embeddings: https://sgnt.ai/p/embeddings-explainer/
oulipo2 1 day ago
Additions and multiplications. People are making it sound like it's complicated, but NNs have the most basic and simple maths behind
The only thing is that nobody understand why they work so well. There are a few function approximation theorems that apply, but nobody really knows how to make them behave as we would like
So basically AI research is 5% "maths", 20% data sourcing and engineering, 50% compute power, and 25% trial and error
[-]
- amelius 1 day ago
  Gradient descent is like pounding on a black box until it gives you the answers you were looking for. Ihere is little more we know about it. We're basically doing Alchemy 2.0.
  The hard technology that makes this all possible is in semiconductor fabrication. Outside of that, math has comparatively little to do with our recent successes.
- p1dda 22 hours ago
  > The only thing is that nobody understand why they work so well.
  This is exactly what I have ascertained from several different experts in this field. Interesting that a machine has been constructed that performs better than expected and/or is performing more advanced tasks than the inventors expected.
  [-]
  - skydhash 21 hours ago
    The linear regression model "ax + b" is the most simplest one and is still quite useful. It can be interesting to discover some phenomenon that fits the model, but that's not something people have control over. But imagine spending years (expensively) training stuff with millions of weight to ultimately discover it was as simple as "e = mc^2" (and c^2 is basically a constant, so the equation is technically linear)
apwell23 1 day ago
> Actually coming up with ideas like GPT-based LLMs and doing serious AI research requires serious maths.
Does it ? I don't think so. All the math involved is pretty straightforward.
[-]
- ants_everywhere 1 day ago
  It depends on how you define the math involved.
  Locally it's all just linear algebra with an occasional nonlinear function. That is all straightforward. And by straightforward I mean you'd cover it in an undergrad engineering class -- you don't need to be a math major or anything.
  Similarly CPUs are composed of simple logic operations that are each easy to understand. I'm willing to believe that designing a CPU requires more math than understanding the operations. Similarly I'd believe that designing an LLM could require more math. Although in practice I haven't seen any difficult math in LLM research papers yet. It's mostly trial and error and the above linear algebra.
  [-]
  - apwell23 1 day ago
    yea i would love to see what complicated math all this came out of. I thought rigorous math was actually an impediment to AI progress. Did any math actually predict or prove that scaling data would create current AI ?
    [-]
    - ants_everywhere 19 hours ago
      I was thinking more about the everyday use of more advanced math to solve "boring" engineering challenges. Like finite math to layout chips or kernels. Or improvement to Strassen's algorithm for matrix multiplication. Or improving the transformer KV cache etc.
      The math you would use to, for example, prove that search algorithm is optimal will generally be harder than the math needed to understand the search algorithm itself.
- empiko 15 hours ago
  It is straight forward because you have been probably exposed to a ton of AI/ML content in your life.
tsunamifury 21 hours ago
I’m sure no one will read this but I was on the team that invented a lot of this early pre-LLM math at Google.
It was a really exciting time for me as I had pushed the team to begin looking at vectors beyond language (actions and other predictable perimeters we could extract from linguistic vectors.)
We had originally invented a lot of this because we were trying to make chat and email easier and faster, and ultimately I had morphed it into predicting UI decisions based on conversations vectors. Back then we could only do pretty simple predictions (continue vector strictly , reverse vector strictly or N vector options on an axis) but we shipped it and you saw it when we made hangouts, gmail and allo predict your next sentence. Our first incarnation was interesting enough that eric Schmidt recognized it and took my work to the board as part of his big investment in ML. From there the work in hangouts became all/gmail etc.
Bizarrely enough though under sundar, this became the Google assistant but we couldn’t get much further without attention layers so the entire project regressed back to fixed bot pathing.
I argued pretty hard with the executives that this was a tragedy but sundar would hear none of it, completely obsessed with Alexa and having a competitor there.
I found some sympathy with the now head of search who gave me some budget to invest in a messaging program that would advance prediction to get to full action prediction across the search surface and UI. We launched and made it a business messaging product but lost the support of executives during the LLM panic.
Sundar cut us and fired the whole team, ironically right when he needed it the most. But he never listened to anyone who worked on the tech and seemed to hold their thoughts in great disdain.
What happened after that is of course well known now as sundar ignored some of the most important tech in history due to this attitude.
I don’t think I’ll ever fully understand it.
[-]
- throwaway-49203 11 hours ago
  [dead]
fnord77 22 hours ago
nothing about vector calculus to minimize loss functions or needing to find Hessians to do Newton's method.
lazarus01 20 hours ago
Here are the building blocks for any deep learning system and a little bit about llm towards the end.
Graphs - It all starts with computational graphs. These are data structures that include element wise operations, usually matrix multiplication, addition, activation functions and loss function. The computations are differential, resulting in a smooth continuous space, appropriate for continuous optimization (gradient descent), which is covered later.
Layers - Layers are modules comprised of graphs that apply some computation and store the results in a state, referred to as the learned weights. Each Layer learns a deeper, more meaningful representation from the dataset, ultimately learning a latent manifold, which is a highly structured, lower dimensional space, that interpolates between samples, achieving generalization for predictions.
Different machine learning problems and data types use different layers, e.g. Transformers for sequence to sequence learning and convolutions for computer vision models, etc.
Models - Organize stacks of layers for training. Includes a loss function that sends a feedback signal to an optimizer to adjust learned weights during training. Models also include an evaluation metric for accuracy, independent of the loss function.
Forward pass - For training or inference, when an input sequence passes through all the network layers and a geometric transformation is applied producing an output.
Backpropagation - Durring training, after the forward pass, gradients are calculated for each weight with respect to the loss, gradients are just another word for derivatives. The process for calculating the derivatives is called automatic differentiation, which is based on the chain rule of derivation.
Once the derivatives are calculated the optimizers intelligently updates the weights, with respect to the loss. This is the process called “Learning” often referred to as gradient descent.
Now for Large Language Models.
Before models are trained for sequence to sequence learning, the corpus of knowledge must be transformed into embeddings.
Embeddings are dense representations of language that includes a multidimensional space that can capture meaning and context for different combinations of words that are part of sequences.
LLMs use a specific network layer called transformers, that includes something called an attention mechanism.
The attention mechanism uses the embeddings to dynamically update the meaning of words when they are brought together in a sequence.
The model uses three different representations of the input sequence, called the key, query and value matrices.
Using dot product, an attention score is created to identify the meaning of the reference sequence, then a target sequence is generated
The output sequence is predicted one word at a time, based on a sampling distribution of the target sequence, using a softmax function.
pangeranslot 1 day ago
[dead]
odofog 8 hours ago
[dead]
Mallowram 23 hours ago
[dead]