<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
<channel>
    <title>dgsq - RSS Feed</title>
    <link>https://dgsq.net/</link>
    <description></description>
    <language>en</language>
    <lastBuildDate>Thu, 07 May 2026 09:14:26 -0500</lastBuildDate>
    <atom:link href="https://dgsq.net/rss.xml" rel="self" type="application/rss+xml" />
    <item>
        <title>Why not sparse?</title>
        <link>https://dgsq.net/2026/05/06/why-not-sparse/</link>
        <guid isPermaLink="true">https://dgsq.net/2026/05/06/why-not-sparse/</guid>
        <pubDate>Wed, 06 May 2026 00:00:00 -0500</pubDate>
        <atom:updated>2026-05-06T00:00:00-05:00</atom:updated>
        <description><![CDATA[<p>I was talking with a colleague the other day about the current industry trend to
focus so much on further quantizing today's LLM models. I'm sure the reasons for
pursuing quantization are apparent to most people who work in this industry:
going from BF16 to FP8 or NVFP4 nets what amounts to a nearly 2x or 4x reduction
in memory requirements and often a similar speedup.</p>
<p>But I can't shake the feeling that by focusing so much on finding the lowest
precision models, we are perhaps neglecting an equally-fundamental limitation of
our current models: that they are too dense.</p>
<p>I have been struggling to articulate exactly what I mean by this for a while
now, until my colleague pointed me to <a href="https://arxiv.org/pdf/1803.03635">The Lottery Ticket
Hypothesis</a>, by Frankle and Carbin, a paper
that is probably well known to folks who are more integrated into the machine
learning community than I. In brief, the main thesis of the paper is:</p>
<blockquote>
<p>A randomly-initialized, dense neural network contains a subnetwork that is
initialized such that—when trained in isolation—it can match the test accuracy
of the original network after training for at most the same number of
iterations.</p>
</blockquote>
<p>The authors then show how by iteratively training a network, pruning away the
lowest-weighted connections, and then retraining the resultant network from
scratch, they can achieve similar performance on a few benchmarks with networks
down to 1% of the size they began with. Compare this 99% savings to the gains
that we can get from quantization: many models have already been quantized to 8
bits, so even assuming we <em>can</em> get them down to 1 bit, we would be saving only
87.% on what we have today. Worth doing? Sure. But the going is getting harder,
and I don't think that we'll be getting all the way to 1 bit on every part of
every model.</p>
<p>Back to Frankle and Carbin: they also show that randomly selecting sub-networks
(instead of selecting based on results of training larger versions) does <em>not</em>
work as well, producing worse accuracy and slower convergence. In their words:</p>
<blockquote>
<p>The initialization that gives rise to a winning ticket is arranged in a
particular sparse architecture. Since we uncover winning tickets through heavy
use of training data, we hypothesize that the structure of our winning tickets
encodes an inductive bias customized to the learning task at hand.</p>
</blockquote>
<p>I find this very reminiscent of evolutionary methods like
<a href="https://nn.cs.utexas.edu/downloads/papers/stanley.ec02.pdf">NEAT</a>, which
likewise attempted to build bespoke sparse networks, one neuron at a time. The
difference of course, being that the Frankle and Carbin work backwards from a
very dense network and NEAT builds up a network from scratch.</p>
<p>I see this as different from efforts to find sparse attention mechanisms, as
sparse attention does not tend to affect the size of the model, but only how
expensive it is to operate with it.</p>
<p>I hope that once we're done building larger models, we can get to work on
building sparser models that more efficiently reflect the tasks they need to
perform than a series of dense operations could. After all, our brains are not
homogeneous meshes, but some sort of more contrived pattern of connections. I
have this feeling that the optimal model for most problems exists in this
sparse-yet-deep subspace, if only we could find it.</p>
<hr />
<p>In other news: I'm continuing the pattern of keeping long pauses between posts,
and changing the website's theme every other post. Bad habits.</p>]]></description>
        <dc:creator>dgsq (dgsq@dgsq.net)</dc:creator>
    </item>
    <item>
        <title>Speculations about DeepSeek-OCR and Quantization</title>
        <link>https://dgsq.net/2025/10/29/deepseek-ocr/</link>
        <guid isPermaLink="true">https://dgsq.net/2025/10/29/deepseek-ocr/</guid>
        <pubDate>Wed, 29 Oct 2025 00:00:00 -0500</pubDate>
        <atom:updated>2025-10-29T00:00:00-05:00</atom:updated>
        <description><![CDATA[<p>
    Recently, DeepSeek released a <a href="https://github.com/deepseek-ai/DeepSeek-OCR">new model and paper</a>, titled DeepSeek-OCR.
    I've seen a bunch of people talk about it recently, and it's pretty cool stuff.
</p>
<p>
    I love the idea of using images to compress text.
    As <a href="https://x.com/karpathy/status/1980397031542989305">Andrej points out</a>, the tokenizer is kind of ugly, and it
    would be cool to have a more natural system. Intuitively,
    it makes sense that a more optimal way to parse written letters/characters might involve some amount of visual processing, since
    that's sort of how humans do it, with many people pattern-matching several words (or more) at a time when reading quickly, instead of
    parsing text as individual letters/tokens.
</p>
<p>
    What I've been trying to figure out is how suprising this compression rate should be. From their abstract:
    <blockquote>
Experiments show that when the number of text tokens is within 10 times that of vision tokens (i.e., a
compression ratio &lt; 10x), the model can achieve decoding (OCR) precision of 97%. Even at a
compression ratio of 20x, the OCR accuracy still remains at about 60%.
    </blockquote>
</p>
<p>
    This sounds pretty good, but there are other ways to compress transformers. DeepSeek-OCR was released using BF16 weights, and since
    they didn't mention anything about their data precision in the paper, I'm assuming that's what they evaluated on.
</p>
<p>
    On the other hand, weight quantization in the form of low-precision inference has been making substantial progress.
    I can't find much data on <a href="https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/">NVFP4</a>,
    but it seems to promise 4-<i>ish</i> bits per weight and activation (a bit more due to scaling factors) for only slightly reduced accuracy.
    This means that the 10x-20x advantage claimed by the DeepSeek-OCR paper may just be a 2.5-5x advantage over a 4-bit quantized model.
    I would be very curious to see whether the image-token-to-text-token compression would remain as high as it is when this model is
    quantized to such low precision. Perhaps image tokens are just able to use this "extra space" better than text tokens.
</p>
<p>
    Taking this a step further: <i>if</i> this advantage really does decrease from 10x-20x to 2.5x-5x when the model is quantized to 4 bit,
    then this might imply that text tokens should really be quantized to between 0.8 to 1.6 bits. It seems like a somewhat pleasing result
    that this line of reasoning should land in the vicinity of 1-bit quantization.
</p>
<p>
    On the other hand, it is also interesting to ask which would better for performance: many 1-bit text tokens or 10-20x fewer 16 bit tokens.
    Or maybe something in the middle? It seems clear to me that fewer high-precision tokens would be desirable for low-latency decode, given
    its autoregressive nature, especially given that current GPU hardware targets 16-bit computations. Maybe the tradeoff would be different
    in other circumstances though.
</p>
<p>
    Anyways, one more disclaimer that this has all been very speculative. I have not carried out any experiments of my own here.
</p>]]></description>
        <dc:creator>dgsq (dgsq@dgsq.net)</dc:creator>
    </item>
    <item>
        <title>New Theme</title>
        <link>https://dgsq.net/2025/08/19/new-theme/</link>
        <guid isPermaLink="true">https://dgsq.net/2025/08/19/new-theme/</guid>
        <pubDate>Tue, 19 Aug 2025 00:00:00 -0500</pubDate>
        <atom:updated>2025-08-19T00:00:00-05:00</atom:updated>
        <description><![CDATA[<p>
Jekyll is great and all, but it sometimes feels a little <i>too</i> polished, bordering on bland.
</p>
<p>
I've been looking at some more bare-bones sites recently, and realizing that I tend to like the
feel of these when used for blogs and the like. In particular, I recently ran across
<a href="https://lunahd.neocities.org/">Miku's website</a>, and I thought that it seemed great.
</p>
<p>
So I stole it. And then changed the colorscheme. And then added a basic blogging functionality
to the site generator so that I can easily write posts.
</p>
<p>
If you like something about how this blog looks, please give all credit to Miku.
I'm at fault for anything you don't like.
</p>
<p>
Maybe someday I'll actually use this blog for something other than writing about this blog.
</p>]]></description>
        <dc:creator>dgsq (dgsq@dgsq.net)</dc:creator>
    </item>
    <item>
        <title>Computer Architects Can&apos;t Find the Average</title>
        <link>https://dgsq.net/2025/04/27/average/</link>
        <guid isPermaLink="true">https://dgsq.net/2025/04/27/average/</guid>
        <pubDate>Sun, 27 Apr 2025 00:00:00 -0500</pubDate>
        <atom:updated>2025-04-27T00:00:00-05:00</atom:updated>
        <description><![CDATA[<!-- LaTeX support -->
<script id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@4/tex-mml-chtml.js"></script>
<p>Computer architects can't agree on a way to find the average.</p>
<p>For years, academic practitioners in this field have been arguing about the
appropriate way to summarize the average performance of their designs [1]. That
is: given \(n\) workloads, if system \(A\) outperforms system \(B\) by
\(S_1, S_2, \ldots, S_n\) on each, how much faster should you say system
\(A\) is, on average? I think this argument is kind of pointless through.</p>
<p>For the most part, people tend to use the arithmetic mean \(\left(\frac{1}{n}
\sum_{i=1}^n S_i\right)\) or the geometric mean \(\left(\sqrt[n]{\prod_{i=1}^n
S_i}\right)\). Henessey and Patterson's famous <em>Computer Architecture: A
Quantitative Approach</em> advocates for the latter:</p>
<blockquote>
<p>Using the geometric mean ensures two important properties:</p>
<ol>
<li>The geometric mean of the ratios is the same as the ratio of the geometric
means.</li>
<li>The ratio of the geometric means is equal to the geometric mean of the
performance ratios, which implies that the choice of the reference computer
is irrelevant.</li>
</ol>
<p>Therefore the motivations to use the geometric mean are substantial,
especially when we use performance ratios to make comparisons.</p>
</blockquote>
<p>Other people disagree with H&amp;P's reasoning, but I think it's just about as good
as it gets.</p>
<h2>All Means are Bad</h2>
<p>Recently (well, over a year ago now), a
<a href="https://ieeexplore.ieee.org/document/10419888">paper</a> appeared in <em>IEEE
Computer Architecture Letters</em> with the title <em>R.I.P. Geomean Speedup Use
Equal-Work (Or Equal-Time) Harmonic Mean Speedup Instead</em>. Its author, Eeckhout,
argues that geomean is bad, and people should instead be using what he calls the
<em>Equal-Work Harmonic Speedup</em> or the <em>Equal-Time Harmonic Speedup</em>. Eeckhout
also presented this work at <a href="https://hpca-conf.org/2025/main-program/">HPCA
2025</a> as a part of the <em>Best of
Computer Architecture Letters</em> session.</p>
<p>The main thing that Eeckhout seems to dislike about the geometric mean is that it &quot;lacks physical meaning.&quot;
He claims that using one of his alternatives is better because they have physical meaning. One of the
alternatives that he proposes is the <i>Equal-Time Harmonic Speedup</i> ((ETS)), which is just the harmonic
mean of the speedups observed on each workload.</p>
<p>$$ETS = \frac{n}{\sum_{i=1}^n \frac{1}{S_i}}$$</p>
<p>Why use the Harmonic Mean instead of the Geometric Mean? Well, if every workload takes the same amount of time
to run on the baseline system, the ETS is equal to the total speedup observed when running each of those
workloads sequentially [2]. Eeckhout says that this physical meaning provides us with a compelling reason to
use something like this over the geometric mean.</p>
<p><b>But this physical meaning doesn't matter!</b> When I report a score for SPEC, I don't <i>really</i> care
about how long it takes to run every single workload in that benchmark in a sequential fashion! It's not like
I expect to run a suduko solver (<code>exchange2</code>), then immediately compile <code>gcc</code>, and then perform video compression
(<code>x264</code>). I mean, I might run all of these at some point, but certainly not for the exact same amount of time [3].
Although the harmonic mean has a clear physical meaning, it's not one that really matters for many benchmark suites.</p>
<p>Admittedly, I don't <i>really</i> care about the geometric mean of these workloads either. I agree with Eeckhout
when he says the geomean doesn't have a clear physical meaning. But it comes down to a choice between an average
that doesn't have a clear physical meaning and one whose physical meaning isn't relevant in most situations.</p>
<h2>So is there actually a good number to report?</h2>
<p>Unless you actually know the precise mix of workloads being run in a real system, any number you report is going
to fail to accurately predict the effect of your design on that system. Benchmarks like SPEC are useful insofar as
they show general performance patterns, but no matter how you cut it, a single number is always going to fail to
provide a perfect comparison between machines when using a general-purpose benchmark suite.</p>
<p>If you do know the particular applications that you care about, and you know their relative importance, then by
all means, take their weighted average and you'll be set.</p>
<p>Otherwise, you might as well just using the geomean. It's easy to compare, and everyone else is familiar with it. Use
another mean at your own risk: they'll all just be wrong in different ways.</p>
<h2>Why are people still talking about this?</h2>
<p>I really don't know. Seems like this argument should be over by now.</p>
<p>One of my former mentors once told me that he never looks at an academic paper's evaluation section. If the idea
presented in the rest of the paper sounds reasonable, maybe he'll try to apply its innovations to the production
design. If the idea sounds rediculous, or addresses a problem he's already solved in another way, then it's of
no use, regardless of much speedup the authors might claim [4].</p>
<p>There are other problems that contribute to the industry perspective of academic evaluations. But I share this
anecdote just to say: academic computer architects should spend more time coming up with new, inherently interesting
ideas, and less time talking about which method of averaging is the least meaningless.</p>
<h2>Footnotes</h2>
<p>
[1]: This argument goes back at least as far at 1986 with the paper
<a href="https://dl.acm.org/doi/pdf/10.1145/5666.5673"><i>How not to lie with statistics: The correct way to summarize benchmark results</i></a>.
Eeckhout provides a good account of this history in <a href="https://ieeexplore.ieee.org/document/10419888">his paper</a>.
</p>
<p>
[2]: Interestingly, even though we can assign a physical meaning to ETS, it can still provide non-intuitive
results. For example, if machine \(A\) runs workload 1 twice as fast as machine \(B\) (\(S_1=2\)), but workload
2 only half as fast (\(S_2 = 0.5\)), then computing the ETS of \(A\) over \(B\) yields 0.8 (meaning a slowdown
overall). But by symmetry, the ETS of \(B\) over \(A\) is also 0.8. How can both machines be "slower" than the
other? Because unlike the geometric mean, the reference machine does matter for ETS! We're assigning different
weights to the workloads depending on our starting point!
</p>
<p>
[3]: Personally, my machines have probably spent much more time running `x264` than compiling gcc or solving suduko. Thanks YouTube.
</p>
<p>
[4]: There are other problems that contribute to this perception of academic evaluations, beyond the 
relatively unimportant issue of averaging workload results. In particular, academic microarchitectural simulators
are often inaccurate, and baseline systems are often poor comparison points.
</p>]]></description>
        <dc:creator>dgsq (dgsq@dgsq.net)</dc:creator>
    </item>
    <item>
        <title>New Blog</title>
        <link>https://dgsq.net/2025/03/30/new-blog/</link>
        <guid isPermaLink="true">https://dgsq.net/2025/03/30/new-blog/</guid>
        <pubDate>Sun, 30 Mar 2025 00:00:00 -0500</pubDate>
        <atom:updated>2025-03-30T00:00:00-05:00</atom:updated>
        <description><![CDATA[<p>Not much of a point to this post. The blog should soon speak for itself.</p>
<p>I plan on writing mostly about technical topics, but who knows? These things can
change.</p>
<p>This blog is currently using the <a href="https://beautifuljekyll.com">Beautiful Jekyll</a>
theme, although the appearance may change slightly as I fine-tune.</p>]]></description>
        <dc:creator>dgsq (dgsq@dgsq.net)</dc:creator>
    </item>
</channel>
</rss>
