[{"content":"Story Time\nI have been using ubuntu for the past 3 years or so, and honestly I love it.\nWhen I initially installed ubuntu it was out of necessity. I had an old laptop and I wanted to do machine learning, which at that time required a GPU. Today you can just use cloud GPUs and pretty decent prices at that!\nSince ubuntu is the most light operating system and beginner friendly, I thought that\u0026rsquo;d be a decent choice.\nI was right.\nIt was one of the best decisions of my life. I got addicted to ubuntu. Typing things out in the terminal. Apt-installing stuff. I had a blast!\nEven today I\u0026rsquo;d still recommend anyone to try linux. Use a VM or dual-boot, I don\u0026rsquo;t care.\nJust use a linux OS. You won\u0026rsquo;t look back.\nI started with a basic ubuntu setup but quickly went to the depths of linux and machines. I installed a window manger.\nAnd I had more fun. When you install something like a i3 window manager, it\u0026rsquo;s daunting at first because you literally see a blank screen. Nothing else. A tab in the bottom. With weird IPs and stuff.\nWhere do you go from there?\nIt requires a little bit of patience, love for machines to go past that phase because I won\u0026rsquo;t lie or gatekeep, it\u0026rsquo;s hard. The amount of debugging you have to do to your config files is outrageous.\nAt one point i could type -\nnvim ~/.config/i3/config in literally milliseconds! Because of how much configuring you have to do.\nThe good thing is, even if not to the metal, but you still have a lot of control over your machine.\nLike if something went wrong, I\u0026rsquo;d know what was wrong. If it was a config problem or something else.\nIt was unlike anything I ever used.\nSo much so that when I bought myself a new PC, I removed windows (you should do that too) and installed ubuntu on it. Didn\u0026rsquo;t dual boot, try on a VM or something, I just said - yeah, no windows.\nAnd things went on like that for 1.5-2 years.\nI got into ricing linux. At one point my PC looked like - Each thing, from polybar to lazyvim was configured by me.\nPeople could not use my PC because it was all keyboard driven and personalized to me.\nI won\u0026rsquo;t stop saying how much I loved it. Every Sunday I\u0026rsquo;d sit down and rice.\nUntil yesterday when I decided to do the next step.\nI installed arch. I am happy (that it works)!\nI can officially also say - \u0026ldquo;i use arch btw\u0026rdquo;.\nThe decision was not an easy one to take.\nubuntu has close to everything. A good package manager, a good community, 6-8 months releases. It supports a lot of the software out of the box.\nWhy compromise all of it, for what? Curiosity?\nBut I go where my curiosity takes me and so, I did it.\nKnowing the fact that ubuntu still hides a lot of abstractions, i didn\u0026rsquo;t feel that close to the metal. I am running a complete arch machine now.\nLet\u0026rsquo;s see how this goes. The fact that I had to install the audio firmware, graphics card driver, xrandr for HDMI and brightnessctl, lazyvim, flameshot gui, rofi and what not, also proves my claim of being closer to the metal than I was on ubuntu.\nThat\u0026rsquo;s it for this story time. I am the owner of my machine. Atleast more than I was on the previous os.\nI hope you know are too.\n~ Aayushya\n","permalink":"https://tiwariji.net/posts/archlinux/","summary":"\u003cp\u003e\u003cem\u003e\u003cstrong\u003eStory Time\u003c/strong\u003e\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003eI have been using ubuntu for the past 3 years or so, and honestly I love it.\u003c/p\u003e\n\u003cp\u003eWhen I initially installed ubuntu it was out of necessity. I had an old laptop and I wanted to do machine learning, which at that time required a GPU. Today you can just use cloud GPUs and pretty decent prices at that!\u003c/p\u003e\n\u003cp\u003eSince ubuntu is the most light operating system and beginner friendly, I thought that\u0026rsquo;d be a decent choice.\u003c/p\u003e","title":"I use arch btw"},{"content":"Today we\u0026rsquo;re looking at LSTM (Long Short Term Memory) neural networks.\nThe standard for sequential data was RNNs (Recurrent Neural Networks). RNNs had an issue. They were good at remembering things, but they\u0026hellip; well, they kept forgetting too! They had what we call a vanishing gradient problem.\nSmall example to guide your visual senses:\nThe spine of the problem is, when the gradients flow backwards, they get multiplied. And say if that number that we\u0026rsquo;re multiplying with is smaller - which a lot of times the gradients are, then at some point the gradient will be close to zero and boom you\u0026rsquo;ve lost signal. At each timestep, the gradient gets multiplied by the derivative of the activation function (like sigmoid) — a number that maxes out at 0.25. Ten steps back, your gradient is 0.25¹⁰ — essentially zero.\nLSTMs fix this by changing the math from multiplication to addition. In a vanilla RNN, the gradient is forced through a \u0026ldquo;squashy\u0026rdquo; activation function at every single step, which rapidly shrinks it.\nThe LSTM\u0026rsquo;s \u0026ldquo;secret sauce\u0026rdquo; is that the cell state update is additive ($c_t = c_{prev} \\cdot f + ...$). If the forget gate is \u0026ldquo;open\u0026rdquo; (close to 1), the gradient can flow backwards through this additive \u0026ldquo;spine\u0026rdquo; across hundreds of timesteps almost unchanged. This is often called the Constant Error Carousel, and it\u0026rsquo;s what finally allows the \u0026ldquo;Long\u0026rdquo; part of Long Short Term Memory to actually work.\nLSTM LSTMs introduced a cell state. The vanilla RNN had a h_t which had two jobs: be the \u0026ldquo;memory passed forward\u0026rdquo; and \u0026ldquo;output signal\u0026rdquo;. But when we overwrite the h_t, we lost the signal from 10 units ago.\nLSTM\u0026rsquo;s cell state is an additional state that is only changed when the gates have to change it. And $h_t$ here now remains the output signal.\nNow the math is pretty simple. It is essentially 4 equations, which we will also implement in code.\nInput gate: $i = \\sigma(x \\cdot W_i + b_i)$\nForget gate: $f = \\sigma(x \\cdot W_f + b_f)$\nCandidate: $\\tilde{c} = \\tanh(x \\cdot W_c + b_c)$\nOutput gate: $o = \\sigma(x \\cdot W_o + b_o)$\nCell state: $c_t = c_{prev} \\cdot f + \\tilde{c} \\cdot i$\nHidden state: $h_t = o \\cdot \\tanh(c_t)$\nThe first four define the gates. The last two are what actually happens to the state.\nThe Input gate is the entry point of data to the network. We do a simple linear operation with $W_i$ (weights of input gate) and addition with $b_i$ (bias of input gate), applying an activation function after it.\nThe Forget gate is what data the network wants to forget at the current timestep. The Candidate is what data the network wants to learn. We use both of these in the cell state.\nFor the gates we use sigmoid to squash to (0, 1) — a valve. For the candidate we use tanh to squash to (-1, 1) — a value with direction.\ncode import torch import torch.nn as nn class LSTM(nn.Module): def __init__(self, input_size, hidden_size): # input : [batch, input_size + hidden_size] super().__init__() # [batch size, hidden + input] self.w_i = nn.Parameter(torch.randn([input_size + hidden_size, hidden_size])) # input self.w_f = nn.Parameter(torch.randn([input_size + hidden_size, hidden_size])) # forget self.w_c = nn.Parameter(torch.randn([input_size + hidden_size, hidden_size])) # candidate self.w_o = nn.Parameter(torch.randn([input_size + hidden_size, hidden_size])) # output gate self.bi = nn.Parameter(torch.zeros([hidden_size, ])) self.bf = nn.Parameter(torch.zeros([hidden_size, ])) self.bc = nn.Parameter(torch.zeros([hidden_size, ])) self.bo = nn.Parameter(torch.zeros([hidden_size, ])) def forward(self, x, h_prev, c_prev): # at each timestep \u0026#39;\u0026#39;\u0026#39; at each timestep: we have two things, x and h_prev. We concat them and say that use this curr input and also the previous hidden states to make decisions and do calculations. \u0026#39;\u0026#39;\u0026#39; x = torch.cat([h_prev, x], dim=1) # raw tensors, we\u0026#39;re adding previous hidden states. i = torch.sigmoid(x @ self.w_i + self.bi) # pass through input gate f = torch.sigmoid(x @ self.w_f + self.bf) # pass through forget gate c_curr = torch.tanh( x @ self.w_c + self.bc) # memory state o = torch.sigmoid(x @ self.w_o + self.bo) # output gate c_t = c_prev * f + c_curr * i h_t = o * torch.tanh(c_t) return c_t, h_t In init we define four weight matrices and four biases — one per gate, plus the candidate. In forward, the first thing we do is concatenate h_prev and x into a single vector. From that point on, every gate sees both the current input and the previous hidden state in one shot. After the six calculations, we return c_t and h_t — the updated memory and the output signal. x = torch.cat([h_prev, x], dim=1) The concatenation happens because we want the gates to take as input both the previous hidden state and the current input as learning data.\ntraining a model input dataset:\ndf = sns.load_dataset(\u0026#39;flights\u0026#39;) The dataset contains the number of passengers month-wise from 1949 to 1960. We preprocess the dataset such that the problem becomes:\ngiven 12 months previous data, predict this month\u0026rsquo;s number of passengers\nX = torch.tensor(X).unsqueeze(-1) # 132, 12, 1 y = torch.tensor(y) print(X.shape) print(y.shape) \u0026gt;\u0026gt;torch.Size([132, 12, 1]) \u0026gt;\u0026gt;torch.Size([132]) We now initialize the model, optimizer and loss_function\nlstm = LSTM(input_size=1, hidden_size=32) linear = nn.Linear(32, 1) optimizer = torch.optim.Adam(list(lstm.parameters()) + list(linear.parameters()), lr=0.01) loss_fn = nn.MSELoss() So our training was successful. We got 0.0713 loss after 200 epochs.\nThis was an article on LSTM.\nThanks for reading ~ Aayushya\n","permalink":"https://tiwariji.net/posts/lstm/","summary":"\u003cp\u003eToday we\u0026rsquo;re looking at LSTM (Long Short Term Memory) neural networks.\u003c/p\u003e\n\u003cp\u003eThe standard for sequential data was RNNs (Recurrent Neural Networks).   \u003cbr\u003e\nRNNs had an issue. They were good at remembering things, but they\u0026hellip; well, they kept forgetting too!    \u003cbr\u003e\nThey had what we call a \u003cstrong\u003evanishing gradient\u003c/strong\u003e problem.\u003c/p\u003e\n\u003cp\u003eSmall example to guide your visual senses:\u003c/p\u003e\n\u003cp\u003e\u003cimg loading=\"lazy\" src=\"/posts/lstm/vg.png\"\u003e\u003c/p\u003e\n\u003cp\u003eThe spine of the problem is, when the gradients flow backwards, they get multiplied. And say if that number that we\u0026rsquo;re multiplying with is smaller - which a lot of times the gradients are, then at some point the gradient will be close to zero and boom you\u0026rsquo;ve lost signal.      \u003cbr\u003e\nAt each timestep, the gradient gets multiplied by the derivative of the activation function (like sigmoid) — a number that maxes out at 0.25. Ten steps back, your gradient is 0.25¹⁰ — essentially zero.\u003c/p\u003e","title":"LSTM from scratch in PyTorch"},{"content":"In my previous work, we discussed different ideas and ways in writing a server. To quickly recap, it\u0026rsquo;s writing it asychronously, writing a multi-threaded server or writing a \u0026ldquo;dumb\u0026rdquo; one. Each one has its pros and cons. People might say writing an async server would solve issues but it would not be the complete story. An async server helps when threads are idle waiting on I/O — network reads, file reads from external devices. Things like this, you can build an async server and it\u0026rsquo;ll get you the fastest latency. But they don\u0026rsquo;t provide us with the same improvements when the work is not I/O bound. Forge\u0026rsquo;s threads are idle waiting on compute — a different problem. We\u0026rsquo;ll see why in this article.\nTo recap you can also checkout my prev post on optimizing a server.\nToday we\u0026rsquo;ll talk about Forge.\nthis article will cover things like technical decisions, problems I faced while writing the code etc Forge: overview It is a multi-threaded server implementation in C++. It has the following components:\nServer Scheduler Llama.cpp backend This is the high level view. What forge does in one sentence: \u0026ldquo;accepts HTTP POST requests at /infer with a JSON body containing a query, runs inference on a local GGUF model, and returns the result. Also exposes /metrics returning queue depth and p99 latency.\u0026rdquo;\nI am pretty sure that simple looking working diagram made you understand the whole project by itself but even if it didn\u0026rsquo;t, I\u0026rsquo;ll walk you through what this is.\nThe idea is we\u0026rsquo;ve got client requesting on our server. We have a server. How do we efficiently design a system where we can inference on a shared backend like llama.cpp with MINIMUM bottlenecks?\nThis is how I did it in Forge Job A job is a just a dataclass sort of thing (in c++ its just a struct) that gives order to our data. Some important information that it holds is:\nID: request ID. Query: query from the client currPriority: one of the most important aspects of this dataclass. startTime: when was the request recieved? Scheduler Scheduler is a class designed to hold a bunch of jobs. Scheduler is a priority-queue based system that holds data in order of their currPriority.\nIn my system; 1-\u0026gt; normal and 2-\u0026gt; urgent.\nThese are the two initial priority.\nOne test case that we should think about is: what happens if there are 20 urgent requests and 1 normal/batch request? How do we handle that?\nThis is where we talk about the starvation problem\nStarvation and aging To handle issues like starvation where all the \u0026ldquo;urgent jobs\u0026rdquo; are processed, what are we doing with the normal jobs? To deal with this we do aging. The idea is: after sometime, we update the priorities of these normal jobs so that after sometime they become a priority and are processed.\nIn forge, you can set the aging_rate which is by default at 0.2. It is a unitless priority increment. So for example at t = 2 a job came with initial priority 1. The aging thread runs every second, so after 5 seconds, currPriority = 1 + (5 * 0.2) = 2. Urgent job.\nAging_thread We implement this by creating a thread when we init the scheduler and make it do:\nevery second, go to each job in the queue, and update their currPriority Worker This is a thread. You can create n workers. We create 4 workers in forge by default. The work of these is to take the top job from the scheduler and get results back to server. This is the part of the code where llama.cpp is connects.\nIt has:\nrun() method: takes the job from scheduler and gives to runInference() method runInference() method: does the inference, waits till gets result back and sends the response to server. Now maybe we can look at the image above and make sense of it. Each request is being sent to the Scheduler as a job. When we init scheduler, the aging_thread begins and ends only when the scheduler is destroyed. Each request is picked up by a thread whose job is to take requests from clients -\u0026gt; give it to server -\u0026gt; get response from the server and give to client. These are the server threads in the diagram colored in blue. Now, there are worker threads (marked in green). Each of them are supposed to take the job from the scheduler and get result from the llama backend. They wait until the inference is complete at the llama.cpp backend only and only come with a response, unlike an async server. This causes some bottlenecks. We will discuss about the bottlenecks of this project in the end. Each worker then gets the response to server where the slept server-threads are told to give these back to the clients.\nLet\u0026rsquo;s talk about the implementation details of this project.\nimplementation details and notes I chose to implement this in C++ because:\nits fast its cool i know it i dont know rust its cool But before diving into code, let\u0026rsquo;s talk about THREADS. Threads The whole program is filled with threads actually. This is what confused me the most and i spent the most time here. So, there are 3 thread pools:\nServer thread pool: N threads for N requests but short lived - only lives until the connection is there Scheduler: there\u0026rsquo;s 1 thread that\u0026rsquo;s for aging Worker thread pool: n=4 threads for PROCESSING top jobs from the scheduler This was basically a design choice, you could also design the system any other way, I chose to do it this way. We noticed in the previous article that one major issue/bottleneck is WHAT the thread is doing while the request is being processed. To handle that we use two very interesting C++ primitives:\nCondition variable Promise/future Condition variable: the threads wait for a state change, not a speicific value / target. In forge, what happens when the scheduler\u0026rsquo;s queue is empty? We wait for a condition variable. Promise/future: Communication between threads. One thread promises another that \u0026ldquo;I\u0026rsquo;ll get you A value\u0026rdquo;. In forge, the server threads create a promise and the worker thread, when the inference is done do promise.set_value(RESULT_FROM_INFERENCE); and boom, the server thread wakes up with a result in their hand. This was the most complicated thing to visualize, for me. You might get it intuitively by imagining a lot threads in your head.\nLet\u0026rsquo;s talk about the other implementation stuff now.\nJob code I started with writing the Job struct. It\u0026rsquo;s as simple as it gets.\nstruct Job { int id; // good for logging, not required for minimal std::string query; std::chrono::steady_clock::time_point startTime; int currPriority = 1; int initPriority = 1; std::shared_ptr\u0026lt;std::promise\u0026lt;std::string\u0026gt;\u0026gt; p; }; what would happen if we didn\u0026rsquo;t have shared_ptr in there? The job gets moved into the scheduler. The handler no longer has the job. But the promise is a shared_ptr, so the handler grabbed a reference to it before the move — via job-\u0026gt;p-\u0026gt;get_future(). Now two things reference the same promise object: the job (worker side) and the future (handler side). shared_ptr ensures the promise object stays alive until both sides are done with it. That\u0026rsquo;s it.\nscheduler code The scheduler has 3-4 major methods.\nEnque -\u0026gt; add job to the queue void enque(std::unique_ptr\u0026lt;Job\u0026gt; job) { std::lock_guard\u0026lt;std::mutex\u0026gt; lock(m); q.push(std::move(job)); cv.notify_one(); } deque_locked -\u0026gt; remove the top job and return a unique_ptr to it std::unique_ptr\u0026lt;Job\u0026gt; deque_locked() { if (q.empty()) { return nullptr; } else { std::unique_ptr\u0026lt;Job\u0026gt; temp = std::move(const_cast\u0026lt;std::unique_ptr\u0026lt;Job\u0026gt;\u0026amp;\u0026gt;(q.top())); q.pop(); return temp; } } aging_loop -\u0026gt; the aging thread starts at the initialization of the scheduler and continues till the destruction of scheduler void aging_loop () { while(!shutdown) { std::this_thread::sleep_for(std::chrono::seconds(1)); std::vector\u0026lt;std::unique_ptr\u0026lt;Job\u0026gt;\u0026gt; t; { std::lock_guard\u0026lt;std::mutex\u0026gt; lock(m); while (!q.empty()) { t.push_back(std::move(const_cast\u0026lt;std::unique_ptr\u0026lt;Job\u0026gt;\u0026amp;\u0026gt;(q.top()))); q.pop(); } for (auto\u0026amp; i : t) { i-\u0026gt;currPriority += aging_rate; q.push(std::move(i)); } } if (!t.empty()) cv.notify_all(); } } size -\u0026gt; to return the length of the queue at any given moment size_t size() { std::lock_guard\u0026lt;std::mutex\u0026gt; lock(m); return q.size(); } Since we want to tell the priority queue to compare on currPriority, we create a custom comparator method. struct JobComparator { bool operator()(const std::unique_ptr\u0026lt;Job\u0026gt;\u0026amp; a, const std::unique_ptr\u0026lt;Job\u0026gt;\u0026amp; b) { return a-\u0026gt;currPriority \u0026lt; b-\u0026gt;currPriority; } }; // and at initialization we do std::priority_queue\u0026lt;std::unique_ptr\u0026lt;Job\u0026gt;, std::vector\u0026lt;std::unique_ptr\u0026lt;Job\u0026gt;\u0026gt;, JobComparator\u0026gt; q; btw look at this beautiful piece of code: t.push_back(std::move(const_cast\u0026lt;std::unique_ptr\u0026lt;Job\u0026gt;\u0026amp;\u0026gt;(q.top())));\nworker code Now here\u0026rsquo;s where the code gets a little lengthy but the idea really is simple.\nIt has two main methods:\nrun(): take the query from the job and give it to runInference() void run () { // thread takes the job from scheduler and gives it to runINference method which then computes result while (true) { std::unique_lock\u0026lt;std::mutex\u0026gt; lock(sch.m); sch.cv.wait(lock, [this] {return !sch.q.empty() || sch.shutdown;}); // make the thread wake up exactly when a job comes. if (sch.shutdown \u0026amp;\u0026amp; sch.q.empty()) break; std::unique_ptr\u0026lt;Job\u0026gt; job = sch.deque_locked(); lock.unlock(); if (job) { std::string result = runInference(job-\u0026gt;query); job-\u0026gt;p-\u0026gt;set_value(result); auto latency = (float)std::chrono::duration_cast\u0026lt;std::chrono::milliseconds\u0026gt;( std::chrono::steady_clock::now() - job-\u0026gt;startTime ).count(); std::lock_guard\u0026lt;std::mutex\u0026gt; lock_metrics(sch.m); sch.latencies.push_back(latency); } } } runInference(): from llama.cpp, get a response back and store it in result std::string runInference (const std::string\u0026amp; prompt) { llama_memory_clear(llama_get_memory(context), true); std::string result; // 1. tokenize // first call to get size int n_tokens = llama_tokenize(vocab, prompt.c_str(), prompt.size(), NULL, 0, true, true); std::vector\u0026lt;llama_token\u0026gt; prompt_tokens (std::abs(n_tokens)); n_tokens = llama_tokenize(vocab, prompt.c_str(), prompt.size(), prompt_tokens.data(), prompt_tokens.size(), true, true); if (n_tokens \u0026lt; 0) { prompt_tokens.resize(-n_tokens); llama_tokenize(vocab, prompt.c_str(), prompt.size(), prompt_tokens.data(), prompt_tokens.size(), true, true); } // sampler : decides out of the batch of possible probabilities which token to choose // has greedy, and temperature... we will do greedy auto sparams = llama_sampler_chain_default_params(); sparams.no_perf = false; llama_sampler* smpl = llama_sampler_chain_init(sparams); llama_sampler_chain_add(smpl, llama_sampler_init_greedy()); // 2. create batch llama_batch batch = llama_batch_get_one(prompt_tokens.data(), prompt_tokens.size()); // 3. look: decode, sample next token, detokenize append to result, check eog int n_decode = 0; llama_token new_token_id; for (int n_pos =0; n_pos + batch.n_tokens \u0026lt; (int)prompt_tokens.size() + n_predict;) { // evaluate current batch with transformer if (llama_decode(context, batch) != 0) break; n_pos += batch.n_tokens; // sample next tokens { new_token_id = llama_sampler_sample(smpl, context, -1); // end of generation if (llama_vocab_is_eog(vocab, new_token_id)) { break; } char buf[128]; int n = llama_token_to_piece(vocab, new_token_id, buf, sizeof(buf), 0, true); std::string s(buf, n); result += s; printf(\u0026#34;%s\u0026#34;, s.c_str()); fflush(stdout); // prepare the next batch with the sampled token batch = llama_batch_get_one(\u0026amp;new_token_id, 1); n_decode += 1; } } // 4. return result string llama_sampler_free(smpl); return result; } A lot of this code is from https://github.com/ggml-org/llama.cpp/blob/master/examples/simple/simple.cpp\nWe just have to focus on what llama.cpp has:\nA model context: which is different for each worker (why?) sampler batch runInference() gets us a result.\nserver code One method is the start(). It just is boilerplate so we\u0026rsquo;ll skip that but you can look at the code here.\nIt has other two main methods. It uses sys/socket.h header for the socket connection.\nhandleClient(): takes the request, makes a job, puts on scheduler, waits on future.get() -\u0026gt; gets response from worker, sends to the client. If the request is pointing to `/metrics/ -\u0026gt; call handleMetrics() method. void handleClient(int client_fd) { char buf[4096] = {0}; int bytes = recv(client_fd, buf, sizeof(buf), 0); if (bytes \u0026lt; 0) { close(client_fd); return; } std::string request(buf, bytes); if (request.find(\u0026#34;GET /metrics\u0026#34;) != std::string::npos) { handleMetrics(client_fd); return; } if (request.find(\u0026#34;POST /infer\u0026#34;) != std::string::npos) { size_t pos = request.find(\u0026#34;\\r\\n\\r\\n\u0026#34;); if (pos == std::string::npos) { close(client_fd); return; } std::string body = request.substr(pos + 4); auto j = nlohmann::json::parse(body); std::string query = j[\u0026#34;query\u0026#34;]; auto job = std::make_unique\u0026lt;Job\u0026gt;(); job-\u0026gt;query = query; job-\u0026gt;p = std::make_shared\u0026lt;std::promise\u0026lt;std::string\u0026gt;\u0026gt;(); job-\u0026gt;startTime = std::chrono::steady_clock::now(); auto future = job-\u0026gt;p-\u0026gt;get_future(); sch.enque(std::move(job)); std::string result = future.get(); // ---- waiting starts... and ends... nlohmann::json resp; resp[\u0026#34;response\u0026#34;] = result; std::string json_body = resp.dump(); std::string response = \u0026#34;HTTP/1.1 200 OK\\r\\n\u0026#34; \u0026#34;Content-Type: application/json\\r\\n\u0026#34; \u0026#34;Content-Length: \u0026#34; + std::to_string(json_body.size()) + \u0026#34;\\r\\n\u0026#34; \u0026#34;\\r\\n\u0026#34; + json_body; send(client_fd, response.c_str(), response.size(), 0); close(client_fd); return; } close(client_fd); } handleMetrics(): calculates metrics using a latency vector that is stored in Scheduler class. // handle GET /metrics void handleMetrics(int client_fd) { std::lock_guard\u0026lt;std::mutex\u0026gt; lock(sch.m); std::vector\u0026lt;float\u0026gt; sorted = sch.latencies; std::sort(sorted.begin(), sorted.end()); float p99 = 0.0f; if (!sorted.empty()) { p99 = sorted[(int) (0.99 * (sorted.size() - 1))]; } nlohmann::json j; j[\u0026#34;p99_latency_ms\u0026#34;] = p99; std::string body = j.dump(); std::string response = \u0026#34;HTTP/1.1 200 OK\\r\\n\u0026#34; \u0026#34;Content-Type: application/json\\r\\n\u0026#34; \u0026#34;Content-Length: \u0026#34; + std::to_string(body.size()) + \u0026#34;\\r\\n\u0026#34; \u0026#34;\\r\\n\u0026#34; + body; send(client_fd, response.c_str(), response.size(), 0); close(client_fd); } That\u0026rsquo;s it for the server code.\nmain.cpp This file just puts everything together and we run this file to check the results.\n#include \u0026#34;llama.h\u0026#34; #include \u0026#34;scheduler.hxx\u0026#34; #include \u0026#34;worker.hxx\u0026#34; #include \u0026#34;http.hxx\u0026#34; #include \u0026#34;job.hxx\u0026#34; #define MODEL_PATH \u0026#34;/home/imaayush/code/Forge/Forge/Llama-3.2-1B-Instruct-Q4_K_M.gguf\u0026#34; // path to your GGUF model file int main() { llama_model_params mparas = llama_model_default_params(); llama_model* model = llama_model_load_from_file(MODEL_PATH, mparas); Scheduler sch; Server server(8080, sch, model); server.start(); } observations and improvements This system was primarily built to learn the language and understand systems like Ollama. No AI was used to write code. Every line is hand-written or taken from documentation (like the llama.cpp code), which might as well be a flex nowadays.\nimprovements This systems poses a lot of scope for improvement. It gives a p99s of ~7 seconds. And it\u0026rsquo;s majorly because the model takes appx 7 seconds to finish a response so the threads wait there for the inference to get over before sending the response back to client.\nBelow are some things that I know on the top of my head that could be bottleneck:\nlatency calculation: in forge, i use a vector in the scheduler class that manually adds each reponse\u0026rsquo;s endTime - startTime to caculate the p99. What if the number of requests increase? And goes above a million requests? On top of that, for p99s calculation we\u0026rsquo;re also sorting the whole vector which is O(NlogN). There\u0026rsquo;s our bottleneck 1. aging_loop: right now, what aging_loop does is, after one second, it pulls out EACH job from the queue and updates the currPriority and puts the whole thing back. This will be horrendous when the requests are above say 1000. this whole thing will end up in an OOM error if there are more than 5000 requests because we will have to create 5000 threads for all the requests and then make them wait for the response from the worker threads. We\u0026rsquo;re using a fixed buf of 4096 bytes for recieving the response from server. If it can\u0026rsquo;t hold that much response gets trucated. Overall, the load balacing is not optimal, which honestly is okay for now. I plan to work on this more and fix these soon.\nThis was a VERY interesting project to do! Hopefully you get values out of this. This marks the end of me trying to learn servers.\nall the code is: https://github.com/aayushyatiwari/Forge-inference_server\nAs always, thank you for reading. ~ Aayushya\nReferences cpp reference official doc https://blog.andreiavram.ro/job-scheduler-cpp/ my code: https://github.com/aayushyatiwari/Forge-inference_server llama.cpp backend: https://github.com/ggml-org/llama.cpp/blob/master/examples/simple/simple.cpp\n","permalink":"https://tiwariji.net/posts/forge/","summary":"\u003cp\u003eIn my previous work, we discussed different ideas and ways in writing a server. To quickly recap, it\u0026rsquo;s writing it asychronously, writing a multi-threaded server or writing a \u0026ldquo;dumb\u0026rdquo; one.\nEach one has its pros and cons. People might say writing an async server would solve issues but it would not be the complete story. An async server helps when threads are idle waiting on I/O — network reads, file reads from external devices. Things like this, you can build an async server and it\u0026rsquo;ll get you the fastest latency. But they don\u0026rsquo;t provide us with the same improvements when the work is not I/O bound. Forge\u0026rsquo;s threads are idle waiting on compute — a different problem. We\u0026rsquo;ll see why in this article.\u003c/p\u003e","title":"Implementing mini Ollama from scratch"},{"content":"What is a Server? It is a software program that manages resources over a network. The whole network could be visualized like this: Today we are writing a server.\nA dumb one, at first. We\u0026rsquo;ll benchmark the load handling of the server. And then we optimize.\nDifference between websockets and a http server model WebSockets maintain a persistent, stateful connection where both client and server can continuously exchange data. In a traditional HTTP model, the client usually sends a request, receives a response, and the connection is then closed.\ndumb server we will make use of socket library in python. It creates a TCP socket connection.\nimport socket HOST = \u0026#39;localhost\u0026#39; # Standard loopback interface address (localhost) PORT = 8080 # Port to listen on (non-privileged ports are \u0026gt; 1023) with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: s.bind((HOST, PORT)) # bind the socket to host and port s.listen() # listeno for connections print(f\u0026#34;server listening on {HOST}:{PORT}\u0026#34;) while True: conn, addr = s.accept() # accept a connection with conn: print(f\u0026#34;CONECTION FROM {addr}\u0026#34;) data = conn.recv(1024) # receive data from the client 1024 bytes at a time request_line = data.decode().split(\u0026#39;\\r\\n\u0026#39;)[0] # get the first line of the request print(f\u0026#34;RECEIVED REQUEST: {request_line}\u0026#34;) response = \u0026#34;HTTP/1.1 200 OK\\r\\nContent-Type: text/plain\\r\\n\\r\\nHello, World!\u0026#34; # create a simple HTTP response conn.sendall(response.encode()) # send the response to the client print(\u0026#34;SENT RESPONSE: Hello, World!\u0026#34;) This is our server software program. We\u0026rsquo;re listening on localhost:8080. After a connection is made, we recieve the data 1024 bytes at a time. And we send a 200 response along with a \u0026ldquo;hello world\u0026rdquo; text, to all the connections. That\u0026rsquo;s why the conn.sendall().\nThis server takes connections sequentially and hence will be slow under load. There are some ideas that we can make use of:\nThreads asynchronous programming thread-ed server Idea here is simple: create multiple threads for multiple connections. One thread handles one connection. Cons are that you will have to create a LOT of threads if the load is heavy.\nHere\u0026rsquo;s a simple implementation of a server that reverse text from json and sends a response.\nimport socket import json import threading HOST = \u0026#39;localhost\u0026#39; PORT = 8080 endpoint = \u0026#34;/reverse\u0026#34; def handle_connection(conn, addr): with conn: try: data = conn.recv(1024) request_line = data.decode().split(\u0026#39;\\r\\n\u0026#39;)[0] if request_line.startswith(f\u0026#34;POST {endpoint}\u0026#34;): body = data.decode().split(\u0026#39;\\r\\n\\r\\n\u0026#39;)[1] json_data = json.loads(body) rd = [x[::-1] for x in list(json_data.values())] rev_json = json.dumps({\u0026#34;data\u0026#34;: rd}) response = f\u0026#34;HTTP/1.1 200 OK\\r\\nContent-Type: application/json\\r\\n\\r\\n{rev_json}\u0026#34; conn.sendall(response.encode()) else: response = \u0026#34;HTTP/1.1 404 Not Found\\r\\nContent-Type: text/plain\\r\\n\\r\\nEndpoint not found\u0026#34; conn.sendall(response.encode()) print(f\u0026#34;CONNECTION FROM {addr} HANDLED SUCCESSFULLY\u0026#34;) except Exception as e: print(f\u0026#34;ERROR: {e}\u0026#34;) with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) # to close the socket immediately after the program ends s.bind((HOST, PORT)) s.listen() print(f\u0026#34;server listening on {HOST}:{PORT}\u0026#34;) while True: conn, addr = s.accept() thread = threading.Thread(target=handle_connection, args=(conn, addr)) thread.start() A lot of lines here are boilerplate for socket programming. They have meaning, but the logic is hidden in there.\nrequest_line.startswith(f\u0026quot;POST {endpoint}\u0026quot;) : we\u0026rsquo;re checking if the request is a POST request. we create a handle_connection function that we give to indiviual threads. As soon as a new connection is made, a thread is spawned to handle the request. load testing tool We use wrk to test load on the server.\nthe command we will use is wrk -t4 -c1000 -d10s -s post.lua http://localhost:8080/reverse\nhere, we are using 4 threads that send 1000 connection requests each with a 10 sec delay to localhost:8080/reverse endpoint.\nHere are the numbers for thread server:\n(base) Forge:inferenceServer\\ $ wrk -t4 -c1000 -d10s -s post.lua http://localhost:8080/reverse Running 10s test @ http://localhost:8080/reverse 4 threads and 1000 connections Thread Stats Avg Stdev Max +/- Stdev Latency 60.09ms 136.42ms 1.91s 96.00% Req/Sec 416.01 170.53 1.18k 73.75% 16572 requests in 10.05s, 1.11MB read Socket errors: connect 0, read 16572, write 0, timeout 33 Requests/sec: 1649.56 Transfer/sec: 112.77KB 1650 requests per second with a 60 ms latency.\nThat\u0026rsquo;s not bad. But we can do better.\nasync-ed server Idea here is this: in the threads version we were spawning n threads for n connections, which spike up the RAM and is not the most memory efficient. In our async program, the idea will be wait-and-hold. We will tell the program to wait while the code is being executed and handle other connections meanwhile.\nThis works because think about what happens when you make a request in synchronous programming - you wait, till you get a response back from server. There\u0026rsquo;s, in short, a lot of waiting. And we optimize that in async programming. \u0026ldquo;Until we get a response, go and handle other requests\u0026rdquo;.\nHere\u0026rsquo;s a simple async server in python\n# imports import asyncio import json HOST = \u0026#39;localhost\u0026#39; PORT = 8080 endpoint = \u0026#34;/reverse\u0026#34; # async func: can wait in between without needing to return. async def handle_connection(reader, writer): \u0026#39;\u0026#39;\u0026#39; do standard fetching of request \u0026#39;\u0026#39;\u0026#39; try: data = await reader.read(1024) request_line = data.decode().split(\u0026#39;\\r\\n\u0026#39;)[0] if request_line.startswith(f\u0026#34;POST {endpoint}\u0026#34;): body = data.decode().split(\u0026#39;\\r\\n\\r\\n\u0026#39;)[1] json_data = json.loads(body) rd = [x[::-1] for x in list(json_data.values())] rev_json = json.dumps({\u0026#34;data\u0026#34;: rd}) response = f\u0026#34;HTTP/1.1 200 OK\\r\\nContent-Type: application/json\\r\\n\\r\\n{rev_json}\u0026#34; else: response = \u0026#34;HTTP/1.1 404 Not Found\\r\\nContent-Type: text/plain\\r\\n\\r\\nEndpoint not found\u0026#34; writer.write(response.encode()) await writer.drain() except Exception as e: print(f\u0026#34;ERROR: {e}\u0026#34;) finally: writer.close() await writer.wait_closed() async def main(): server = await asyncio.start_server(handle_connection, HOST, PORT) # start the server print(f\u0026#34;server listening on {HOST}:{PORT}\u0026#34;) async with server: await server.serve_forever() # infinite serving till we stop the server asyncio.run(main()) NOTE: to REALLY understand the difference between async and thread server, we need to imitate a heavy server. we use time.sleep(3) to do that. that\u0026rsquo;s because if we don\u0026rsquo;t do that, the difference will be minute. Although as we\u0026rsquo;ll see, the difference is still there.\nNumbers for async server:\n(base) Forge:inferenceServer\\ $ wrk -t4 -c1000 -d10s -s post.lua http://localhost:8080/reverse Running 10s test @ http://localhost:8080/reverse 4 threads and 1000 connections Thread Stats Avg Stdev Max +/- Stdev Latency 15.87ms 58.73ms 831.44ms 96.62% Req/Sec 2.86k 1.18k 6.50k 68.67% 113769 requests in 10.04s, 7.60MB read Socket errors: connect 0, read 113769, write 0, timeout 0 Requests/sec: 11326.42 Transfer/sec: 774.28KB 11326 requests/sec with 16 ms latency! compare that to the threaded version: 1650 requests per second with a 60 ms latency.\nThat\u0026rsquo;s a 7x speedup!\nconclusion That was a study between synchronous, threaded and async servers. In the future, we will build an inference server : \u0026ldquo;Forge\u0026rdquo;\nThanks for reading ~ Aayushya\n","permalink":"https://tiwariji.net/posts/dumb-server/","summary":"\u003cp\u003eWhat is a \u003cstrong\u003eServer\u003c/strong\u003e?    \u003cbr\u003e\nIt is a software program that manages resources over a network. The whole network could be visualized like this:\n\u003cimg loading=\"lazy\" src=\"/posts/dumb-server/pic.jpg\"\u003e\u003c/p\u003e\n\u003cp\u003eToday we are writing a server.\u003c/p\u003e\n\u003cp\u003eA dumb one, at first. We\u0026rsquo;ll benchmark the load handling of the server. And then we optimize.\u003c/p\u003e\n\u003cblockquote\u003e\n\u003cp\u003e\u003cem\u003e\u003cstrong\u003eDifference between websockets and a http server model\u003c/strong\u003e\u003c/em\u003e\nWebSockets maintain a persistent, stateful connection where both client and server can continuously exchange data.\nIn a traditional HTTP model, the client usually sends a request, receives a response, and the connection is then closed.\u003c/p\u003e","title":"Writing an inference server in bare-metal C++"},{"content":"Threads are the smallest unit of execution that an operating system can schedule inside a process. People (me) get confused about what threads are and what processes are. This article will talk about threading as a programming concept and not the theory behind processes and threads. Nevertheless, we\u0026rsquo;ll talk about processes and threads too.\nProcesses and threads Process A ├── Thread 1 ├── Thread 2 └── Thread 3 Process B ├── Thread 1 └── Thread 2 Process is an independent program instance. It has its own memory and resources. Think of it like an agent. A process can have one or more threads in it. Threads share the process memory and resources, though each thread has its own execution state and stack. How processes and threads are implemented depends on specific programming languages and operating systems, so I highly recommend checking out the wiki.\nClick here for a quick summary.\nNow, using threads we can speed up some programs, especially when the work is I/O-bound. We can ask independent threads to do independent tasks, given that we take care of the race conditions.\nHere\u0026rsquo;s what we will do. We will scrape 10 websites, with intentional latency and we will get a response. And we will time them. And then we will do them again using threads this time instead. In two different languages.\nNOTE: Python, C++. There was no need to do two languages because, as we will see, they yield approximately the same time for this I/O-bound example. Python and C++ threads do not generally have the same performance for every kind of workload.\nSetup We have a list of websites. The list of websites:\nsites = [ \u0026#34;https://httpbin.org/delay/2\u0026#34;, \u0026#34;https://httpbin.org/delay/2\u0026#34;, \u0026#34;https://httpbin.org/delay/2\u0026#34;, \u0026#34;https://httpbin.org/delay/2\u0026#34;, \u0026#34;https://httpbin.org/delay/2\u0026#34;, \u0026#34;https://httpbin.org/delay/1\u0026#34;, \u0026#34;https://httpbin.org/delay/1\u0026#34;, \u0026#34;https://httpbin.org/delay/1\u0026#34;, \u0026#34;https://en.wikipedia.org/wiki/Thread_(computing)\u0026#34;, \u0026#34;https://en.wikipedia.org/wiki/Global_interpreter_lock\u0026#34;, ] We will sequentially request a response from each website Use threads to do them for us. Measure time difference Here\u0026rsquo;s the Python code to do it.\nPython start = time.time() count_seq, failed_seq = 0, 0 for n in sites: try: x = requests.get(n, timeout=10) # print(x.status_code) except Exception as e: print(f\u0026#34;Couldn\u0026#39;t fetch from {n}: error {e}\u0026#34;) failed_seq += 1 else: count_seq += 1 end = time.time() Code is pretty direct. We measure time and go one-by-one at each site and ask for a response using Python\u0026rsquo;s requests library. Everything else is exception handling. For the threading code:\nlock = threading.Lock() count, failed = 0, 0 def fetch(site): global count global failed try: response = requests.get(site, timeout=10) with lock: count += 1 return response except Exception as e: print(f\u0026#34;got error in {site}: error message: {e}\u0026#34;) with lock: failed += 1 return None if __name__ == \u0026#34;__main__\u0026#34;: start = time.time() threads = [] for i in range(len(sites)): t = threading.Thread(target=fetch, args=(sites[i],)) t.start() threads.append(t) for t in threads: t.join() end = time.time() Here we have to understand few things.\nTo prevent race conditions, we use a threading.Lock() object. It locks access to a shared variable such that at a given time only one thread can access/update it. This will repeat in C++ too, so keep that in mind. For each site, we\u0026rsquo;re creating a new thread and giving it a target function fetch. So we have n threads for n websites. The results (base) threads\\ $ python threads_pgm.py Sequential Data ------------------ time taken without threads : 25.756 seconds ------------------ Threads data ------------------ time taken with threads: 3.787 seconds We can see a huge speedup.\nEach thread, at the same time, goes to a website and comes back with a response. So the 3.7 seconds that we see is approximately the maximum time a thread took to go to a website and get a response, plus some scheduling and network overhead. In sequential it is the sum of time.\nAlso notice the fetch function. We use with lock to use the lock object.\nC++ The process will remain the same. For sequential, do one site at a time. Using threads, create n threads for n websites.\ncout \u0026lt;\u0026lt; \u0026#34;---- starting threads ... \u0026#34;\u0026lt;\u0026lt; endl; count = 0; failed = 0; const auto s2 = chrono::high_resolution_clock::now(); vector \u0026lt;thread\u0026gt; threads; for (int i = 0; i \u0026lt; sites.size(); i++){ thread t(fetch, sites[i]); // init the threads AND start threads.push_back(move(t)); // threads can\u0026#39;t be copied, only moved. } for (int i = 0 ; i\u0026lt; threads.size(); i++){ threads[i].join(); } Full code snippet is here. The result:\n(base) threads\\ $ ./scraper Fetched(without threads) 10 sites in 27.8024 seconds ---- starting threads ... Fetched using threads 10 sites in 3.72653 seconds As we can see, it\u0026rsquo;s about the same as Python for this I/O-bound example. And again the time difference is approximately max(time of individual threads) vs sum(time of each website\u0026rsquo;s response).\nConclusion There\u0026rsquo;s a lot that can be studied after this. Threads are one of the basic building blocks of concurrent software.\nThanks for reading ~ Aayushya Tiwari\nDifference between processes and threads - table summary Feature Process Thread Memory Separate Shared Communication Slow (IPC) Fast (shared memory) Creation Heavy Lightweight Context Switch Expensive Cheap Isolation High Low Failure Impact Independent Affects whole process Race condition A race condition occurs when multiple threads access and modify shared data concurrently, leading to incorrect or unpredictable results due to lack of proper synchronization.\nint count = 0; Thread 1: count++; Thread 2: count++; Both threads are trying to update the same variable.\nThere are ways to fix them.\nLocks: Lock the data, only one thread can work on it at a time. Semaphores: synchronization primitives that use a counter to control access to a resource. Atomic operations References wiki : https://en.wikipedia.org/wiki/Thread_(computing) Arpit Bhayani\u0026rsquo;s video on threads : https://youtu.be/2PjlaUnrAMQ?si=Rq5JgCZSvKK_GCn1\n","permalink":"https://tiwariji.net/posts/threading/","summary":"\u003cp\u003eThreads are the smallest unit of execution that an operating system can schedule inside a process.     \u003cbr\u003e\nPeople (me) get confused about what threads are and what \u003cem\u003eprocesses\u003c/em\u003e are. This article will talk about threading as a programming concept and not the theory behind processes and threads. Nevertheless, we\u0026rsquo;ll talk about processes and threads too.\u003c/p\u003e\n\u003ch1 id=\"processes-and-threads\"\u003eProcesses and threads\u003c/h1\u003e\n\u003cpre tabindex=\"0\"\u003e\u003ccode\u003eProcess A       \n ├── Thread 1       \n ├── Thread 2       \n └── Thread 3       \n\nProcess B       \n ├── Thread 1       \n └── Thread 2       \n\u003c/code\u003e\u003c/pre\u003e\u003cp\u003eProcess is an independent program instance. It has its own memory and resources. Think of it like an agent.     \u003cbr\u003e\nA process can have one or more threads in it. Threads share the process memory and resources, though each thread has its own execution state and stack.  \u003cbr\u003e\nHow processes and threads are implemented depends on specific programming languages and operating systems, so I highly recommend checking out the \u003ca href=\"#references\"\u003ewiki\u003c/a\u003e.\u003c/p\u003e","title":"Studying threads and benchmarking python vs c++"},{"content":"\u0026ldquo;compute solves a lot of problem\u0026rdquo;\nIf we just had enough compute, a lot of the problems that we experience today would be solved. Loger contexts, smarter weights and biases, etc. But right now we don\u0026rsquo;t have infinite compute. That\u0026rsquo;s the sad reality. So we optimize. Quantization is our attempt at just that.\nhistory Reference: https://arxiv.org/abs/2103.13630\nQuantization is a way of compression. It is a process of mapping a large set of continous or high-precision values into smaller discrete set of values.\nFormally: Input: A continuous or high-precision value (e.g., a real number between -1.0 and 1.0 with 32-bit floating point precision). Output: A discrete value chosen from a limited set of levels (e.g., integers from -128 to 127 in 8-bit).\nPut simply, say you have data in floating point 32 bit format. I tell you, put that data in Int 8 format. You see, the bit difference is 32 -\u0026gt; 8. And you do it. Congrats, you just quantized your data.\nYou have used rounding-off in elementary math classes too right? That is an example of quantization. Digital signal processing, audio/video compressions are other examples.\nquantization and neural nets In neural nets, we have data. A lot of it actually. Gradients, activations, other data that we don\u0026rsquo;t know about when training/inferencing a nn.\nThe benefits of quantization as you can guess already are:\nfaster inference lower model memory lower power consumption It\u0026rsquo;s nice to be aware of different interesting formats we have in machine learning so you don\u0026rsquo;t get lost further in the essay. the common ones are:\nformat bits range notes FP32 32 ~±3.4 × 10³⁸ standard training format FP16 16 ~±65504 half precision BF16 16 ~±3.4 × 10³⁸ same range as FP32, less precision INT8 8 -128 to 127 signed UINT8 8 0 to 255 unsigned INT4 4 -8 to 7 aggressive quantization INT2 2 -2 to 1 extreme, rarely used the quantization function $$Q(r) = Int(r/S) − Z$$Q is the function. r is the multi-dimensional tensor to be quantized. Int() maps a real value to an integer value through a rounding operation (e.g., round to nearest and truncation) S is the scale value. Very important in the process. Z is the zero point value, also very important calculations for this.\nScale and Zero Point Quantization maps a float range $[\\alpha, \\beta]$ to an integer range $[q_{\\min}, q_{\\max}]$.\nFor $n$-bit quantization:\n$$q_{\\min} = -2^{n-1}, \\quad q_{\\max} = 2^{n-1} - 1$$For 8-bit: $q_{\\min} = -128$, $q_{\\max} = 127$.\nAsymmetric The float range is taken directly from the tensor:\n$$\\alpha = \\min(w), \\quad \\beta = \\max(w)$$$$s = \\frac{\\beta - \\alpha}{q_{\\max} - q_{\\min}}$$$$z = \\text{clamp}\\left(\\left\\lfloor q_{\\min} - \\frac{\\alpha}{s} \\right\\rceil,\\ q_{\\min},\\ q_{\\max}\\right)$$Symmetric The range is forced to be symmetric around zero:\n$$\\beta = \\max(|w|), \\quad \\alpha = -\\beta$$$$s = \\frac{2\\beta}{q_{\\max} - q_{\\min}}$$Since $\\alpha = -\\beta$, the zero point always works out to $z = 0$. This is why symmetric quantization is preferred for weights — you eliminate $z$ from every multiply-accumulate entirely.\nThis is how you\u0026rsquo;d implement something like this in python.\ndef get_scale_zero_point(tensor, symmetric=True, n_bits=8): qmin = -(2 ** (n_bits - 1)) qmax = (2 ** (n_bits - 1)) - 1 # asymmetric if not symmetric: alpha = tensor.min().item() beta = tensor.max().item() # syummetric else: beta = tensor.abs().max().item() # max value of the tensor alpha = -beta # standard scaling scale = (beta - alpha) / (qmax - qmin) if scale == 0: scale = 1e-8 # avoid division by zero zero_point = round(qmin - alpha / scale) zero_point = max(qmin, min(qmax, zero_point)) return scale, zero_point accuracy drop? One question you could ask is: \u0026ldquo;we\u0026rsquo;re using 8 bits instead of 32 bits to store the same information, aren\u0026rsquo;t we losing data?\u0026rdquo;\nYou are. But how much is where the story begins.\nQuantization: Methods Let\u0026rsquo;s look at some code and numbers.\nQAT Another way of quantized training is \u0026lsquo;QAT\u0026rsquo; is Quantized Aware Training.\nYou tell the baker upfront that the cake will be served in rougher slices. They bake accordingly.\nDuring training, you simulate quantization — fake quantization nodes are inserted into the graph. The forward pass mimics low precision. But the backward pass still uses full precision gradients. The model learns to be robust to the quantization noise. By the time you actually quantize at deployment, the weights have already adapted.\nThis has issues because this asks of us to essentially use more compute by also using the quantized weights. But this gives the least accuracy drop. As I said in the beginning of the essay, compute can solve a lot of our problems.\nFake quantization Here\u0026rsquo;s the setup: I trained a little CNN on MNIST and exported the weights. You can have a look at the code here\nI then took the same model architecture and quantized the weights this time. Using the formulas that I defined earlier. What I am doing here is essentially simulating a quantization process.\nclass QuantizedModel(nn.Module): ... # init methods.... . . . def quantize_weights(self): \u0026#34;\u0026#34;\u0026#34;Quantize each layer’s weights independently after loading FP32 weights.\u0026#34;\u0026#34;\u0026#34; for name, module in self.named_children(): if isinstance(module, (nn.Conv2d, nn.Linear)): weight = module.weight.data scale, zp = get_scale_zero_point(weight, symmetric=self.symmetric, n_bits=self.bitwidth) q_weight = quantize(weight, scale, zp) setattr(self, f\u0026#34;{name}_q\u0026#34;, q_weight.clone().detach()) setattr(self, f\u0026#34;{name}_scale\u0026#34;, torch.tensor(scale)) setattr(self, f\u0026#34;{name}_zp\u0026#34;, torch.tensor(zp, dtype=torch.int32)) # dequantize module.weight.data = dequantize(q_weight, scale, zp) Again all the code is present here. In this snippet you see a method defined in the quantized model class. This model loads a pre-trained model and when you call model.quantize_weights(), this method is called.\nYou see that we quantize the weight tensor and store that in a different buffer. and IMMEDIATELY dequantize it! This produces an effect of quantization while still all the operations are in floating point 32 only. This type of simulation is useful while studying accuracy drops only since the memory load and inference speeds are the same.\n(torchgpu) quantization\\ $ python train.py using: cuda epoch 1 loss: 0.1336 epoch 2 loss: 0.0444 epoch 3 loss: 0.0299 epoch 4 loss: 0.0212 epoch 5 loss: 0.0176 saved to mnist_model.pth FP32 test loss: 0.0305, accuracy: 0.9907 Quantized test loss: 0.0305, accuracy: 0.9907 This was the symmetric uniform quantization code. See this part of the result:\nFP32 test loss: 0.0305, accuracy: 0.9907 Quantized test loss: 0.0305, accuracy: 0.9907 The FP32 is the actual model and its accuracy on a test set. The quantized one is where we simulated the quantization effect. As we can see, the loss is absolutely same. The training and inferencing was done on a very little data (200-300 images) and that\u0026rsquo;s why its absolutely negligible.\nYou see, when I tried quantizing the weights in a ResNet18, the error I got from it was close to 0.2%.\nHere\u0026rsquo;s the code that was used along with the result:\nimport torch from load_model import load_model # loading ResNet18 model defined in load_model.py def get_scale_zero_point(tensor, symmetric, n_bits=8): qmin = -(2 ** (n_bits - 1)) qmax = (2 ** (n_bits - 1)) - 1 # asymmetric if not symmetric: alpha = tensor.min().item() beta = tensor.max().item() # syummetric else: beta = tensor.max().item() # max value of the tensor alpha = -beta # standard scaling scale = (beta - alpha) / (qmax - qmin) zero_point = round(qmin - alpha / scale) zero_point = max(qmin, min(qmax, zero_point)) return scale, zero_point def quantize(tensor, scale, zero_point, n_bits=8): qmin = -(2 ** (n_bits - 1)) qmax = (2 ** (n_bits - 1)) - 1 q = torch.round(tensor / scale + zero_point) q = torch.clamp(q, qmin, qmax) return q.to(torch.int8) def dequantize(q_tensor, scale, zero_point): return scale * (q_tensor.float() - zero_point) if __name__ == \u0026#34;__main__\u0026#34;: model = load_model() weight = model.conv1.weight.data print(\u0026#34;original:\u0026#34;, weight.shape, weight.dtype) print(\u0026#34;min/max:\u0026#34;, weight.min().item(), weight.max().item()) sym = True # flag for symmetric quantization scale, zp = get_scale_zero_point(weight, sym) print(\u0026#34;scale:\u0026#34;, scale, \u0026#34;zero_point:\u0026#34;, zp, \u0026#34; in symmetric\u0026#34;) q = quantize(weight, scale, zp) print(\u0026#34;quantized dtype:\u0026#34;, q.dtype) dq = dequantize(q, scale, zp) error = (weight - dq).abs().mean().item() print(\u0026#34;mean abs error:\u0026#34;, error) Results:\n# asymmetric original: torch.Size([64, 3, 7, 7]) torch.float32 min/max: -0.8433799147605896 1.0164732933044434 scale: 0.007293541992411894 zero_point: -12 in asymmetric quantized dtype: torch.int8 \u0026gt;\u0026gt;\u0026gt; mean abs error: 0.001590650761500001 #symmetric original: torch.Size([64, 3, 7, 7]) torch.float32 min/max: -0.8433799147605896 1.0164732933044434 scale: 0.007972339555328967 zero_point: 0 in symmetric quantized dtype: torch.int8 \u0026gt;\u0026gt;\u0026gt; mean abs error: 0.0017431897576898336 Look at the mean abs error. That\u0026rsquo;s how much information you\u0026rsquo;re losing from torch.float32 to torch.int8. And this is exactly why quantizing works. And libraries like pytorch are building APIs for quantized model training because as you can see, the data that is lost during this process is not that much and we can compress a model from say 16GB to ~3GB!\nPTQ You bake the cake first, then decide to slice it into rougher chunks.\nThe model is already trained. You just convert the weights and activations to lower precision after the fact. You might run a small \u0026ldquo;calibration dataset\u0026rdquo; through it, not to train, just to observe what range of values the activations take, so you know how to map floats → ints without clipping too much. It\u0026rsquo;s fast and easy. The downside: the model never knew it was going to be quantized, so it wasn\u0026rsquo;t optimized for surviving that precision loss. Accuracy drops a bit, sometimes a lot.\nHere\u0026rsquo;s a nn that I wrote to study PTQ.\nclass MNISTNet_q(nn.Module): def __init__(self): super().__init__() self.conv1 = nn.Conv2d(1, 32, 3, padding=1) self.conv2 = nn.Conv2d(32, 64, 3, padding=1) self.pool = nn.MaxPool2d(2, 2) self.relu1 = nn.ReLU() self.relu2 = nn.ReLU() self.relu3 = nn.ReLU() self.fc1 = nn.Linear(64 * 7 * 7, 128) self.fc2 = nn.Linear(128, 10) self.quantstud = torch.quantization.QuantStub() self.dequantstud = torch.quantization.DeQuantStub() def fuse_model(self): return torch.quantization.fuse_modules(self, [[\u0026#39;conv1\u0026#39;, \u0026#39;relu1\u0026#39;], [\u0026#39;conv2\u0026#39;, \u0026#39;relu2\u0026#39;]], inplace=True) def forward(self, x): x = self.quantstud(x) x = self.pool(self.conv1(x)) x = self.pool(self.conv2(x)) x = x.reshape(-1, 64 * 7 * 7) x = self.relu3(self.fc1(x)) x = self.fc2(x) x = self.dequantstud(x) return x Here we have normal layers like Conv and Relu but the vivid reader would see two different inits. QuantStub and DeQuantStub. These are used by pytorch.quantization API to quantize weights at the start of each forward so that each weight tensor gets quantized under-the-hood to int8 and after the forward method we dequantize the weights back.\nYou will notice we also defined a fused_model method. That is because we want to quantize modules together to efficiently do quantization. Doing quantization for each layer separately will lead to heavy compute usage and we don\u0026rsquo;t want that (optimization, remember?)\nif __name__ == \u0026#34;__main__\u0026#34;: test_loader = get_test_loader() fp32 = MNISTNet_q() fp32.load_state_dict(torch.load(\u0026#39;mnist_model.pth\u0026#39;, map_location=\u0026#39;cpu\u0026#39;)) fp32.eval() fp32_loss, fp32_acc = evaluate(fp32, test_loader, \u0026#39;cpu\u0026#39;) print(f\u0026#34;fp32 | loss: {fp32_loss:.4f} | acc: {fp32_acc:.4f}\u0026#34;) q_model = ptq(\u0026#39;mnist_model.pth\u0026#39;) q_loss, q_acc = evaluate(q_model, test_loader, \u0026#39;cpu\u0026#39;) print(f\u0026#34;int8 | loss: {q_loss:.4f} | acc: {q_acc:.4f}\u0026#34;) print(f\u0026#34;\\nconv1 weight dtype : {q_model.conv1.weight().dtype}\u0026#34;) print(f\u0026#34;fc1 weight dtype : {q_model.fc1.weight().dtype}\u0026#34;) print(f\u0026#39;q_model conv1 scale values caliberated during the process: {q_model.conv1.scale}\u0026#39;) print(f\u0026#39;q_model conv1 zp values caliberated during the process: {q_model.conv1.zero_point}\u0026#39;) If we run this file using the above code, we get:\nfp32 | loss: 0.5367 | acc: 0.9382 int8 | loss: 0.0291 | acc: 0.9908 conv1 weight dtype : torch.qint8 fc1 weight dtype : torch.qint8 q_model conv1 scale values caliberated during the process: 0.029921600595116615 q_model conv1 zp values caliberated during the process: 0 The fp32 model is the same architecture loaded with pre-trained weights. But note: MNISTNet_q has QuantStub/DeQuantStub layers designed for PTQ, which interfere with normal fp32 inference. The q_model is the result after PTQ, where the architecture operates as designed. That\u0026rsquo;s why the comparison isn\u0026rsquo;t apples-to-apples — the fp32 number is artificially degraded by running quantization stubs in fp32 mode.\nSo yeah, as you can see, the quantized model does pretty okay.\nAnd that shows why the quantization idea is so vividly talked about in ml and ai. I hope this makes things clear for you a little bit. It did for me definitely.\nI did not talk about a lot of technical details like ultra-quantization on 2-bits and stuff. But they are just built-up technical things on the same idea. You can read about them from here.\nPyTorch is moving the quantization to a separate package. Things are moving fast in the quantization field, are you keeping up? :)\nThanks for reading. ~Aayushya\nReferences paper: https://arxiv.org/abs/2103.13630 my code for this essay: https://github.com/aayushyatiwari/blogCode/tree/master/quantization/\n","permalink":"https://tiwariji.net/posts/quantization/","summary":"\u003cp\u003e\u003cstrong\u003e\u0026ldquo;compute solves a lot of problem\u0026rdquo;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eIf we just had enough compute, a lot of the problems that we experience today would be solved. Loger contexts, smarter weights and biases, etc. But right now we don\u0026rsquo;t have infinite compute.    \u003cbr\u003e\nThat\u0026rsquo;s the sad reality.  \u003cbr\u003e\nSo we optimize. \u003cstrong\u003eQuantization\u003c/strong\u003e is our attempt at just that.\u003c/p\u003e\n\u003ch1 id=\"history\"\u003ehistory\u003c/h1\u003e\n\u003cp\u003eReference: \u003ca href=\"https://arxiv.org/abs/2103.13630\"\u003ehttps://arxiv.org/abs/2103.13630\u003c/a\u003e\u003c/p\u003e\n\u003cp\u003eQuantization is a way of compression. It is a process of mapping a large set of continous or high-precision values into smaller discrete set of values.\u003c/p\u003e","title":"From scratch Quantization in neural networks"},{"content":"Recently, at PyTorch Day India in Bangalore, I saw a talk on AI compilers. Here is the link: YouTube\nPicture from the session I didn't know there were Indian labs working on the AI compiler problem. But it turns out there are. PolyMage Labs is an IISc lab in Bangalore working on PolyBlocks.\nSince AI is moving fast, there is a clear need for efficient AI compilers that can lower high-level tensor programs to IR for GPUs, TPUs, and other backends. PolyBlocks minimizes dependency on external vendor libraries like cuBLAS/cuDNN while still generating highly optimized code via compiler-driven transformations and tiling.\nWhat is Polyblocks? It\u0026rsquo;s an AI compiler architecture built in PolyMage labs.\nOn 10th March, they released their paper.\nThis essay will go in depth on the paper and PolyBlocks compiler abstractions.\nintroduction Since these \u0026ldquo;AI\u0026rdquo; compilers are also \u0026ldquo;compilers\u0026rdquo;, they have abstraction layers just like traditional compilers.\nThere are high-level abstractions:\nPytorch, Tensorflow, JAX There are mid-level abstractions:\nTriton, Pallas There are low-level abstractions:\ncuBLAS, cuDNN, ROCm, Cutlass etc In the paper, they\u0026rsquo;ve described the difference between these three abstractions.\nHere is a brief flowchart of how this works.\nPyTorch → aten ops → linalg → affine → LLVM IR → GPU binary JAX → stablehlo → affine → LLVM IR → GPU binary TensorFlow → mhlo → affine → LLVM IR → GPU binary\nSpecifically for Polyblocks, it is registered as a torch backend and has its own plugs for Tensorflow and JAX respectively.\nLet\u0026rsquo;s without further ado look at the architecture. Also, one common word in the paper is the word \u0026ldquo;dialect\u0026rdquo;. It\u0026rsquo;s from the 2020 paper on MLIR. Rest of the article will use it too so maybe get familiar with it first.\nclick here\narchitecture Polyblocks has a 5-stage pipeline from S1 to S5. Frontend stage: S1, S2 These stages are target-neutral and do not depend on specific hardware.\nThe frontend is responsible for canonicalization of high-level ops and conversion into lower-level MLIR dialects:\nLower operations from tensor dialect to memref dialect in S1. Lower named tensor ops on memrefs to affine nests in S2. Mid-level optimization S3 is the mid-level optimization level. It takes only memref and affine dialects of MLIR and does 50-70 passes for one target. Things that happen in this stage:\nfusion tiling mapping to matrix units generation of data movement code for on-chip memories vectorization So, we now have nicely optimized affine loop nests. They are still written as abstract affine.for loops and affine.parallel ops, which are later lowered in backend stages.\nBackend stages This is where we make the architectures compatible with different hardware. What now happens is this:\nS4: affine dialects -\u0026gt; GPU dialects\nThese stages are hardware dependent. Meaning each hardware gets different inputs.\nS5: S4 still produces MLIR operations. S5 takes that and converts everything down to LLVM IR — the final common language before machine code.\nFor NVIDIA GPU: gpu dialect → nvvm / nvgpu dialect → LLVM IR → NVPTX → CUDA binary (cubin)\nFor AMD GPU: gpu dialect → amdgpu / rocdl dialect → LLVM IR → HSACO binary\nFor CPU: omp dialect → LLVM IR → native binary\nCompiler driver C4 is the compiler driver stage. It is written python. What it does:\ncontains pass-pipeline for all stages. contains support for convert the graph form of different high-level frameworks to MLIR. also contains logic for JIT compiling inputs to outputs. support for ahead-of-time compilation. kernel launches and synchronization optimizations One of the most important things in PolyBlocks and which makes it special is the fact that it uses affine memory pattern. Also the fact that it uses MLIR to handle its low-level code. Before talking about the fusion approach I\u0026rsquo;d like you to know about the producer-consumer terminology. It\u0026rsquo;s pretty straightforward. Consider:\nA = relu(X) B = matmul(A, W) Here, relu is the producer and matmul is the consumer. Simple. Now, let\u0026rsquo;s talk about the fusion approach in PolyBlocks.\nslicing based fusion approach The idea is this: In traditional fusion, we just merge loops directly.\nex:\n# Two separate loops for i in range(N): A[i] = X[i] * 2 # producer for i in range(N): B[i] = A[i] + 1 # consumer # FUSED for i in range(N): A[i] = X[i] * 2 B[i] = A[i] + 1 # A[i] stays in register Why this would fail:\nfor i in range(N): A[i] = X[i] + X[i+1] # producer, needs neighbors for i in range(N): B[i] = A[i] + A[i+1] # consumer, also needs neighbors # A[i+1] is not computed yet! This is why PolyBlocks wins here. Because it uses affine memory access and we can look at each compute and index it — so it knows exactly which slice of the producer a consumer needs, and pulls just that in.\nPolyBlocks has evaluations for fusion. The specific evaluations mentioned in the paper are:\npreservation of parallelism preservation of vectorizability amount of redundant computations added will the fusion eliminate intermediate buffer? Beyond slicing-based fusion, PolyBlocks does a few more things worth naming. Tiling breaks large loops into small chunks that fit in fast memory — but the trick is you have to tile the destination first, then fuse sources into it, not the other way around. Attention gets special treatment too: the softmax in the middle makes standard fusion impossible, so PolyBlocks uses two passes to first eliminate DRAM roundtrips, then shared memory roundtrips, keeping everything in registers. Convolutions are quietly converted into matmuls on-the-fly so they reuse the same optimized path.\nresults and benchmarks of PolyBlocks The paper evaluates PolyBlocks on NVIDIA A100 and A10 GPUs, comparing against Torch eager, Torch Inductor, TensorRT, and XLA.\nFor PyTorch models at batch size 1, PolyBlocks is 2.15x faster than eager execution on average, 1.4x faster than Inductor, and 2.4x faster than TensorRT. At batch size 8, it\u0026rsquo;s 1.8x faster than eager and roughly on par with Inductor. The gap over Inductor is higher at smaller batch sizes — which makes sense, because Inductor\u0026rsquo;s underlying vendor libraries are tuned for large batches.\nFor JAX, PolyBlocks is 2.12x faster than JAX eager and 1.15x faster than XLA on average.\nThe individual operator results are arguably more impressive. For convolutions, PolyBlocks-generated code is competitive with cuDNN — and in nearly 50 cases actually beats it by more than 2x. For matmuls, it tracks closely with CuBLAS across a wide range of sizes. Remember, these are libraries that teams at NVIDIA have hand-tuned for years. Getting close without a single hand-written kernel is the whole point.\nThe ablation study isolates what actually matters. Tensor cores alone give a 17x speedup. Cross-operator fusion gives 2.87x on top of that. Reduce-reduce fusion for attention gives another 1.42x. The numbers make the argument cleanly that fusion is not a nice-to-have, it\u0026rsquo;s most of the performance.\nFor quantized models the gains are even more dramatic, since Inductor has limited support for optimization in the presence of quantization. PolyBlocks handles it naturally because quantization just becomes more affine ops in the same pipeline.\nConclusion The paper was a fantastic and in-depth read. It broke some assumptions like the fact people used to believe you can\u0026rsquo;t compare with libraries like cuDNN because of how many years of work that is and how optimized it is. PolyBlocks proves you don\u0026rsquo;t. If your tiling, packing, and tensor core mapping are right, the generated code gets there. While PolyBlocks is still not complete, because features like cross-attention are yet to be implemented, its still on-par with libraries that have been in the game for years. If that\u0026rsquo;s not exciting, I don\u0026rsquo;t know what is.\nKudos to the PolyBlocks team!\nWith that, we end this paper review. I had fun, I hope you did too.\nThanks for reading ~ Aayushya Tiwari\nReferences PolyBlocks paper (arXiv) PyTorch Day India talk (YouTube) Appendix affine_nests A loop nest is affine if every array index is a linear function of the loop variables. No multiplying loop vars together, no data-dependent indexing.\nA[i][k] — affine ✓ (just i and k) A[i*k] — NOT affine ✗ (product of two loop vars) A[data[i]] — NOT affine ✗ (index depends on runtime data) dialect A dialect is just a named collection of operations.\nExample — the arith dialect contains operations like:\narith.addf (floating point add) arith.mulf (floating point multiply) arith.cmpi (integer compare) The affine dialect contains:\naffine.for (a loop with affine bounds) affine.load (load from memory with affine index) affine.store (store to memory with affine index) Why have multiple dialects?\nBecause different levels of abstraction need different operations.\nAt the top level you want to say matmul(A, B). At the bottom level you want to say load this address, multiply these floats, store result. These are very different vocabularies. Each dialect captures one level.\nThe dialects in the paper, ordered high to low:\nlinalg / mhlo / stablehlo ← named tensor ops (matmul, conv) ↓ affine ← explicit loop nests, affine accesses ↓ memref / scf ← raw memory, generic loops ↓ gpu / nvvm / nvgpu ← GPU specific ops, warp primitives ↓ llvm ← almost assembly now you can look at the architecture\nfusion Running multiple operations in one GPU kernel instead of running multiple kernels for operations is called Fusion.\nThe benefit: intermediate results stay in fast memory (registers or shared memory) instead of being written to DRAM and read back.\n# Without fusion: 2 kernel launches, A goes to DRAM between them A = relu(X) # kernel 1 → writes A to DRAM B = A + 1 # kernel 2 → reads A from DRAM # With fusion: 1 kernel launch, A stays in registers for each element: a = relu(x) b = a + 1 # a never touches DRAM DRAM is ~100x slower than registers. So the more you can keep intermediate results in fast memory, the faster your model runs. That\u0026rsquo;s the whole point of fusion.\n","permalink":"https://tiwariji.net/posts/polyblocks/","summary":"\u003cp\u003eRecently, at PyTorch Day India in Bangalore, I saw a talk on AI compilers.\nHere is the link: \u003ca href=\"https://www.youtube.com/watch?v=upzLl0mp74I\"\u003eYouTube\u003c/a\u003e\u003c/p\u003e\n\u003ch5 id=\"picture-from-the-session\"\u003ePicture from the session\u003c/h5\u003e\n\u003cp\u003e\u003cimg alt=\"picture of Mr Uday\" loading=\"lazy\" src=\"/posts/polyblocks/2026-03-29_16-25.png\"\u003e\u003c/p\u003e\n\u003cbr\u003e\nI didn't know there were Indian labs working on the AI compiler problem. But it turns out there are.\n\u003cp\u003e\u003cem\u003e\u003cstrong\u003ePolyMage Labs\u003c/strong\u003e\u003c/em\u003e is an IISc lab in Bangalore working on \u003ca href=\"https://arxiv.org/abs/2603.06731\"\u003ePolyBlocks\u003c/a\u003e.\u003c/p\u003e\n\u003cp\u003eSince AI is moving fast, there is a clear need for efficient AI compilers that can lower high-level tensor programs to IR for GPUs, TPUs, and other backends. PolyBlocks minimizes dependency on external vendor libraries like cuBLAS/cuDNN while still generating highly optimized code via compiler-driven transformations and tiling.\u003c/p\u003e","title":"How PolyBlocks lowers tensor programs for GPUs"},{"content":"Some days back I was studying for computer networks exam. I came across few protocols which were very interesting. Like SMTP (Simple Mail Transfer Protocol), telnet, SCP (Secure Copy Protocol) just to name a few.\nSMTP and a little bit of theory Simple Mail Transfer Protocol is a protocol used to transfer mails over servers. It was written in 1981. IT works on port number 25. Since SMTP is server-to-server, the client port number is 587.\nThere were, however, issues with SMTP. For one, it was ONLY for 7Bit ASCII, second the data was unencrypted. For addressing these issues, MIME standard was developed. It is an internet standard that extends the message format of email to support non-ASCII characters. SMTP -\u0026gt; transport MIME -\u0026gt; about the message\nThe idea Idea behind MIME is this: SMTP could already share text data (7-bit ASCII) over servers. The question MIME devs had to handle was how do you or could you share data like images and videos and audios encoded as text? Turns out, you could.\nThe MIME working So, we can send emails using different protocols with the mime header format. We can understand MIME as this set of software functions that transform the ASCII data to non-ASCII data and back.\nMIME Header MIME header consists of the following data that is attached to each email.\nVersion: What MIME version was this encoded data encoded with. Content-type: Describes the type and subtype of content. Content-transfer-encoding: Describes the encoding format. Content-ID: Unique ID to content. Content-Description: Describes type of data present in the body. Script to transfer images over ssh and using scp Idea this:\n\u0026ldquo;Since I can transfer .txt files over scp (again, I didn\u0026rsquo;t know you could just transfer images), how about I take images encode them (just like MIME) and transfer that over scp and using ssh decode them on the other machine?\u0026rdquo;\nI did just that.\nSo here\u0026rsquo;s a script that you can use to overcomplicate the image sharing process over SCP :P\n#!/usr/bin/env python3 import argparse import base64 import subprocess import os parser = argparse.ArgumentParser() parser.add_argument(\u0026#39;--image_path\u0026#39;, required=True) parser.add_argument(\u0026#39;--scpIP\u0026#39;, required=True) parser.add_argument(\u0026#39;--scpPassword\u0026#39;, required=True) args = parser.parse_args() # Read and encode image with open(args.image_path, \u0026#39;rb\u0026#39;) as f: encoded = base64.b64encode(f.read()).decode() # Write to temp file temp_file = \u0026#39;/tmp/image_transfer.txt\u0026#39; with open(temp_file, \u0026#39;w\u0026#39;) as f: f.write(encoded) try: subprocess.run([\u0026#39;sshpass\u0026#39;, \u0026#39;-p\u0026#39;, args.scpPassword, \u0026#39;scp\u0026#39;, temp_file, f\u0026#39;{args.scpIP}:~/\u0026#39;], check=True) subprocess.run([\u0026#39;sshpass\u0026#39;, \u0026#39;-p\u0026#39;, args.scpPassword, \u0026#39;ssh\u0026#39;, args.scpIP, \u0026#39;cat ~/image_transfer.txt | base64 -d \u0026gt; ~/decoded_image\u0026#39;], check=True) print(\u0026#34;Image transferred and decoded on remote machine as ~/decoded_image\u0026#34;) except subprocess.CalledProcessError: print(\u0026#34;couldn\u0026#39;t reach the ip\u0026#34;) finally: os.remove(temp_file) Following up from my last essay, you can see I have used the good old argparse library to take arguments from CLI. The arguments are:\nimage_path: path to the image file scpIP: your remote machine\u0026rsquo;s IP where you want to copy this file to scpPassword: your remote machine\u0026rsquo;s password You can see there are two python libraries that I\u0026rsquo;ve used. base64 and subprocess\nbase64: its the encoding format of the image. Its similar to what content-transfer-encoding: base64 (remember from MIME HEADER FORMAT) means. subprocess: its used to do SSH and SCP in python Steps: We take the image. We encode it to .txt in base64 format. We use the IP and password to first, transfer the .txt over scp to the machine and then ssh over that machine using the same credentials to do .txt -\u0026gt; image using: subprocess.run([\u0026#39;sshpass\u0026#39;, \u0026#39;-p\u0026#39;, args.scpPassword, \u0026#39;ssh\u0026#39;, args.scpIP, \u0026#39;cat ~/image_transfer.txt | base64 -d \u0026gt; ~/decoded_image\u0026#39;], check=True) The complicated bash line is just piping. You take the output from the first command and make that the input for the other command. And so we use the bash command base64 -d to decode the .txt content to convert it to image format.\nor you could just do scp and then transfer image over your internet :)\nI did not know that you could transfer images data over scp. I did not question that either. Which is my bad. You can, it turns out. But I had this idea and I executed it. Which is fine.\nA little question: do you think the resolution of the image will be changed if you transfer images like this? Hint: A good thing to think about is the data. Are we losing data in this process?\nThanks for reading ~ Aayushya Tiwari\nReferences SSH wiki: (https://en.wikipedia.org/wiki/Secure_Shell) GFG article on SMTP: (https://www.geeksforgeeks.org/computer-networks/simple-mail-transfer-protocol-smtp/) my repo for code : (https://github.com/aayushyatiwari/minimalistic_image_transfer_over_scp) ","permalink":"https://tiwariji.net/posts/mime/","summary":"\u003cp\u003eSome days back I was studying for computer networks exam. I came across few protocols which were very interesting. Like SMTP (Simple Mail Transfer Protocol), telnet, SCP (Secure Copy Protocol) just to name a few.\u003c/p\u003e\n\u003ch1 id=\"smtp-and-a-little-bit-of-theory\"\u003eSMTP and a little bit of theory\u003c/h1\u003e\n\u003cp\u003eSimple Mail Transfer Protocol is a protocol used to transfer mails over servers. It was written in 1981. IT works on port number 25. Since SMTP is server-to-server, the client port number is 587.\u003c/p\u003e","title":"Building a terminal image viewer over SSH"},{"content":"In this article I\u0026rsquo;d like to introduce you to a rather useful python library that can be of use to you. It\u0026rsquo;s called argparse and recently I have been using it as my go to for couple of things.\nI first got to know about this library when participating in a kaggle comp. It was pretty intimidating at first because you\u0026rsquo;re not sure what\u0026rsquo;s going on but after this article I am hoping you\u0026rsquo;d know how to deal with code that mentions argparse. We\u0026rsquo;ll also talk about config files and how this library can be used to write config file.\nhistorical context to argparse Argparse is a python library that was introduced in Python 3.2 It is a library that\u0026rsquo;s used for command-line parsing. Before this, optparse was the official cli parsing module, but in python 3.2 they changed it to argparse.\nwhat is it? As said, it\u0026rsquo;s a tool that can be used to parse command-line arguments.\nIn this article we will write a very simple script to go to a website from your terminal.\nBasics Some basics first.\nwhen we run\npythonfile.py arg1 arg2 arg3 all of the args are stored in sys.argv list. How sys.argv works:\nsys.argv[0] = the script name itself sys.argv[1] = first argument sys.argv[2] = second argument and so on\u0026hellip; For example: I\u0026rsquo;ll create a file named script.py and it\u0026rsquo;s just a template for now.\nimport argparse def parse(): pass if __name__ == \u0026#34;__main__\u0026#34;: parse() Now to use the argparse lib, we first have to instantiate an object.\nimport argparse def parse(): parser = argparse.ArgumentParser() if __name__ == \u0026#34;__main__\u0026#34;: parse() And now, this parser has methods. one of them being: add_argument.\nRemember the way the arguments are stored in sys.arg? You can have a look at the arguments using the below script.\nimport argparse import sys def parse(): parser = argparse.ArgumentParser() parser.add_argument(\u0026#34;filename\u0026#34;) parser.add_argument(\u0026#34;--count\u0026#34;, type=int) args = parser.parse_args() print(sys.argv) # ---\u0026gt; the list of arguments from cli print(args) # ---\u0026gt; namespace object if __name__ == \u0026#34;__main__\u0026#34;: parse() run the script using something like this:\npython script.py hello.txt --count 5 If you do everything correctly, you can have a look at the sys.argv list and a namespace object.\n$ python parse_example.py hello.txt --count 5 [\u0026#39;parse_example.py\u0026#39;, \u0026#39;hello.txt\u0026#39;, \u0026#39;--count\u0026#39;, \u0026#39;5\u0026#39;] Namespace(filename=\u0026#39;hello.txt\u0026#39;, count=5) As we can see, sys.argv is just a raw list of strings, and args is the Namespace with actual types attached.\nthe .add_argument() method is used to add keys to the namespace the .parse_args() used to create the Namespace object namespace What is Namespace It\u0026rsquo;s a simple class. When you call parse_args(), argparse creates a Namespace object and attaches your arguments as attributes to it.\nSo instead of a dict, you get args.lr instead of args[lr]. Cleaner to read and write.\nUses I personally use them to create config files for my model training runs. Say if I have to test different hyperparameters and check which of them are working better, I\u0026rsquo;d have to manually go to the code and change everything if I didn\u0026rsquo;t have this library. With this library you can just create a python3 script with argparse in it and take the hyperparameters from the cli.\nimport argparse def parse(): parser = argparse.ArgumentParser() parser.add_argument(\u0026#34;--lr\u0026#34;, type=float, default=0.001) parser.add_argument(\u0026#34;--epochs\u0026#34;, type=int, default=10) parser.add_argument(\u0026#34;--batch_size\u0026#34;, type=int, default=32) return parser.parse_args() def train(args): print(f\u0026#34;Training with lr={args.lr}, epochs={args.epochs}, batch_size={args.batch_size}\u0026#34;) args = parse() train(args) Run this like:\npython train.py --lr 0.01 --epochs 50 --batch_size 64 And you\u0026rsquo;re done. No touching the code.\nYou can be more fancy by using a config.yaml file but that\u0026rsquo;s outside of the scope of this article.\nA toy script to go to your favourite website from the terminal So, there\u0026rsquo;s this python library called webbrowser. Using that library you can write a small script like this.\nimport webbrowser as net import argparse parser = argparse.ArgumentParser(description=\u0026#34;opening websites from terminal\u0026#34;) parser.add_argument(\u0026#39;--goto\u0026#39;, type=str, help=\u0026#34;openthiswebsite\u0026#34;) args = parser.parse_args() if args.goto: net.open(f\u0026#39;https://{args.goto}\u0026#39;) And you\u0026rsquo;re done. For fun, you can also write this in a bash script and make it executable so that you can run this without the python keyword. Can you figure out how you\u0026rsquo;d run the script?\nHint:\npython script.py ... ... What could those two places be?\nMaking it feel like a real command (no python needed!) Save the file as goto.py (or whatever name you like). First, add this special line as the very first line of your script (before any import):\n#!/usr/bin/env python3 import webbrowser as net import argparse parser = argparse.ArgumentParser(description=\u0026#34;Open websites from your terminal 🚀\u0026#34;) parser.add_argument(\u0026#39;--goto\u0026#39;, type=str, help=\u0026#34;The website to open (example: google.com)\u0026#34;) args = parser.parse_args() if args.goto: net.open(f\u0026#39;https://{args.goto}\u0026#39;) else: print(\u0026#34;Please use --goto \u0026lt;website\u0026gt;\u0026#34;) This magic line (#!/usr/bin/env python3) tells your computer: “Hey, run me with Python 3!”\nNow make the file executable (only needed once): chmod +x goto.py Try it:\n./goto.py --goto youtube.com Nice! But typing ./goto.py every time is annoying if the file is not in your current folder.\nLevel up: run it from anywhere by adding it to your PATH Create a personal bin folder if you don\u0026rsquo;t have one:\nmkdir -p ~/bin Move your script there and drop the .py extension:\nmv goto.py ~/bin/goto Add ~/bin to your PATH. At the end of your shell config:\nexport PATH=\u0026#34;$HOME/bin:$PATH\u0026#34; bash → ~/.bashrc zsh → ~/.zshrc Reload:\nsource ~/.bashrc # or source ~/.zshrc Now try:\ngoto --goto x.com goto --help There we have it.\nReferences official python argparse doc sys.argv python official doc howto guide official python\nThank you for reading.\n~Aayushya Tiwari\n","permalink":"https://tiwariji.net/posts/notes-on-argparse/","summary":"\u003cp\u003eIn this article I\u0026rsquo;d like to introduce you to a rather useful python library that can be of use to you.   \u003cbr\u003e\nIt\u0026rsquo;s called \u003ccode\u003eargparse\u003c/code\u003e and recently I have been using it as my go to for couple of things.\u003c/p\u003e\n\u003cp\u003eI first got to know about this library when participating in a kaggle comp.      \u003cbr\u003e\nIt was pretty intimidating at first because you\u0026rsquo;re not sure what\u0026rsquo;s going on but after this article I am hoping you\u0026rsquo;d know how to deal with code that mentions argparse. We\u0026rsquo;ll also talk about config files and how this library can be used to write config file.\u003c/p\u003e","title":"Writing robust Python CLIs with argparse"},{"content":"If you didn\u0026rsquo;t know, I recently started writing more. Published a new website. Yes, the website you\u0026rsquo;re reading this at. The reason was to get good at understanding and learning. With the coming of AI, writing code has never been easier. And to be honest, I don\u0026rsquo;t think AI has any role in this. This was way before AI came. The main thing that drives the world imo is an idea. Ideas and implementations. Now the way we implement things have been changing since ages. The one example I like to think about is of the compilers and assembly programmers when C language came. Pretty sure all of them were in the same position developers today are. But that\u0026rsquo;s another story. Implementations change, but the most thing that drives technology, sciences, math and all the important stuff, are, as i said, ideas. And to get better ideas, we don\u0026rsquo;t just need intellect. No. We need creativity, we need people who can understand deeply. Who can think. And I don\u0026rsquo;t use the word think in a lighter manner. Thinking was never easy. And in today\u0026rsquo;s world, it\u0026rsquo;s even harder. Which is why I started writing. Because believe me or not, writing is thinking. Every week I have this essay that I have to think about, learn, and write about.\nYour daily reminder to: Think deeply\nIt\u0026rsquo;s about, forcing your brain to make neural connections and engage in the act of thinking. Honestly, I couldn\u0026rsquo;t, for the life of me, understand backprop. But trying to write the essay and implementing different neural nets without the pytorch\u0026rsquo;s loss.backward() helped me understand it better. That, is exactly why I started writing.\nSo, what\u0026rsquo;s the update? I have been enjoying this. Two months since I have started writing, officially. And it is going well. I can, obviously, think better about anything that I am reading. Most importantly I understand, where there is a bug. There is a flaw. Not in the code, or the equations, but in my understanding.\nI have been writing more code than ever, I mean, debugging more than ever. :P Currently building an autograd engine and modifying the already written micrograd in cpp. Also, reading some research papers. Trying to learn about the intersection of AI and low-level stuff. Expect some nice essays coming soon!\nMoving forward, I\u0026rsquo;d like to contribute to this small world more. Learn, produce more. Hopefully, you\u0026rsquo;d be watching! :)\nThanks for reading.\n~Aayushya Tiwari\n","permalink":"https://tiwariji.net/posts/2_months_update/","summary":"\u003cp\u003eIf you didn\u0026rsquo;t know, I recently started writing more. Published a new website.        Yes, the website you\u0026rsquo;re reading this at.    \u003cbr\u003e\nThe reason was to get good at understanding and learning.    \u003cbr\u003e\nWith the coming of AI, writing code has never been easier. And to be honest, I don\u0026rsquo;t think AI has any role in this. This was way before AI came.     \u003cbr\u003e\nThe main thing that drives the world imo is an \u003cstrong\u003eidea\u003c/strong\u003e. Ideas and implementations.      \u003cbr\u003e\nNow the way we \u003cem\u003eimplement\u003c/em\u003e things have been changing since ages.     \u003cbr\u003e\nThe one example I like to think about is of the compilers and assembly programmers when C language came.     \u003cbr\u003e\nPretty sure all of them were in the same position developers today are.  \u003cbr\u003e\nBut that\u0026rsquo;s another story. Implementations change, but the most thing that drives technology, sciences, math and all the important stuff, are, as i said, ideas.  \u003cbr\u003e\nAnd to get better ideas, we don\u0026rsquo;t just need intellect. No.   \u003cbr\u003e\nWe need \u003cem\u003ecreativity\u003c/em\u003e, we need people who can understand deeply. Who can \u003cstrong\u003ethink\u003c/strong\u003e.   \u003cbr\u003e\nAnd I don\u0026rsquo;t use the word think in a lighter manner.  \u003cbr\u003e\nThinking was never easy.     \u003cbr\u003e\nAnd in today\u0026rsquo;s world, it\u0026rsquo;s even harder.  \u003cbr\u003e\nWhich is why I started writing. Because believe me or not, writing is thinking.  \u003cbr\u003e\nEvery week I have this essay that I have to think about, learn, and write about.\u003c/p\u003e","title":"Two months of writing code and building things"},{"content":"Before we see what torch.compile does, we should first understand pytorch\u0026rsquo;s default mode and why we\u0026rsquo;d ever want to move away from it.\nPyTorch runs in eager mode by default. Think of it as PyTorch reading and executing your code op by op, as Python encounters each line. It\u0026rsquo;s immediate, flexible, and great for prototyping — but it pays a Python interpreter cost on every single operation.\nFor production and deployment, we want to skip that cost. That\u0026rsquo;s where compilation comes in.\nLegacy ways of compiling torch code For deployment, we need models compiled into a form that bypasses the Python runtime on each op.\nPyTorch JIT (Just in Time compilation) JIT is a feature that compiles pytorch models into a static graph. Here\u0026rsquo;s what happens under the hood:\nGo from source code or a trace to a graph Run compiler passes through the graph — moving from .graph to an optimized graph, retrievable via .graph_for(*inputs) .graph → bytecode → executed by the JIT virtual machine Where JIT is useful is at the Python layer. A good way to think about it: JIT looks at your code once, compiles it into a static graph, and from then on runs that graph without the Python interpreter getting in the way. The ops still execute every time — but without Python\u0026rsquo;s overhead on each one.\nHere\u0026rsquo;s the function we\u0026rsquo;ll use to look at scripting and tracing in depth.\nimport torch as t def fn(x): for _ in range(x.dim()): x = x * x return x Scripting Scripting reads your source code directly and compiles the logic itself into a static graph.\nWe can use scripting by doing t.jit.script(fn). This returns an object, and we can inspect the IR:\ndef fn(x: Tensor) -\u0026gt; Tensor: x0 = x for _0 in range(torch.dim(x)): x0 = torch.mul(x0, x0) return x0 Notice that everything is statically typed. Meaning the type of every variable is known before runtime. The loop is preserved as a loop.\nTracing Tracing works differently: run the function once with a sample input, record every tensor op that executes, and freeze that recording as the graph.\nHere\u0026rsquo;s the IR for the same function, but traced:\ndef fn(x: Tensor) -\u0026gt; Tensor: x0 = torch.mul(x, x) return torch.mul(x0, x0) What you see above is the intermediate representation of the function. The loop is gone. The sample input was a 2D tensor, so x.dim() was 2, so the loop ran twice. As we will know later in the essay, this creates issues.\nTracing vs Scripting The core difference: tracing learns by watching, scripting learns by reading.\nThis matters when your code has branches. Consider:\nimport torch as t def fn(x): if x.sum() \u0026gt; 0: return x * 2 else: return x * -1 traced = t.jit.trace(fn, t.tensor([1.0, 2.0])) scripted = t.jit.script(fn) print(traced(t.tensor([-1.0, -2.0]))) print(scripted(t.tensor([-1.0, -2.0]))) The input is all negative, so the correct answer is [1.0, 2.0]. traced gets it wrong — it watched the function run with a positive sample input, recorded the x * 2 branch, and hardcoded it. The if condition was never saved. scripted gets it right because it compiled the actual logic.\nOne could ask, \u0026ldquo;why tracing at all then?\u0026rdquo;\nThe answer is tracing works when the models don\u0026rsquo;t need data-dependent control flow. If run-it-once works for you function/model, torch.jit.trace will work nicely. Most simple CNNs, feedforward models are just that.\nScripting has its own limitations though. It only supports a strict subset of Python — no arbitrary Python objects, limited standard library usage, and dynamic typing will cause it to fail. If your model code uses anything outside that subset, scripting won\u0026rsquo;t work.\nModern torch.compile stack On Feb 12, 2023, PyTorch released PyTorch 2.0, which introduced torch.compile.\nWhat you do is simply:\nmodel = torch.compile(model) There are three stages happening under the hood:\nTorchDynamo — captures your model as a clean graph\nAOTAutograd — traces both forward and backward passes ahead of time\nTorchInductor — generates optimized low-level code for your hardware\nEach of these are doing absolutely insane work and deserves their own essays. I\u0026rsquo;ll write them some day.\nReferences ezyang\u0026rsquo;s blog: core pytorch dev\ngfg\ndeep dive into tracing and scripting by another core pytorch dev\npytorch blog after release of PyTorch 2.0\npytorch docs on torch.compiler\nLLMs: Claude\nThanks for reading\n~ Aayushya Tiwari\n","permalink":"https://tiwariji.net/posts/torchcompile/","summary":"\u003cp\u003eBefore we see what \u003ccode\u003etorch.compile\u003c/code\u003e does, we should first understand pytorch\u0026rsquo;s default mode and why we\u0026rsquo;d ever want to move away from it.\u003c/p\u003e\n\u003cp\u003ePyTorch runs in \u003cstrong\u003eeager mode\u003c/strong\u003e by default. Think of it as PyTorch reading and executing your code op by op, as Python encounters each line. It\u0026rsquo;s immediate, flexible, and great for prototyping — but it pays a Python interpreter cost on every single operation.\u003c/p\u003e\n\u003cp\u003eFor production and deployment, we want to skip that cost. That\u0026rsquo;s where compilation comes in.\u003c/p\u003e","title":"What torch.compile actually does under the hood"},{"content":"Today we look at matrix multiplication (matmul, as we will call in this essay). Since, the last essay was on backprop, it was only logical to think about the most fundamental math operation that lets us do the algo. That is, matmul.\nAlso, the numbers in this essay are going to shock you. Like really. So if you think I am making this up, you should checkout my code for this essay.\nIn hindsight, it looks like a very normal operation to any math major or simply any other person who has done basic linear algebra but it is THE foundational op in deep learning and efficiently processing this compounds billions of times! In today\u0026rsquo;s essay, we look at how computers in today\u0026rsquo;s day and age do matmul and how GPUs changed the way.\nwhat is matrix multiplication? Wiki It is basically a basic math operation where we multiply two matrices to get another resultant matrix. Now, for matrix multiplication there is one condition. That is, the number of columns of the first matrix equals the number of rows in the second matrix. The operation, mathematically can be defined as this: Implementation Python Let\u0026rsquo;s write a function that multiplies two matrices in python.\ndef matmul_naive(m, n): im = len(m[0]) result = [[0]*(len(n[0])) for _ in range(len(m))] for i in range(len(m)): for j in range(len(n[0])): result[i][j] = 0 for k in range(im): result[i][j] += m[i][k] * n[k][j] return result This is the naive way to multiply two matrices. You take the first row of the first matrix, you take the first column of the second matrix and using an additional var k, we store the product in result[i][j].\nC++ implementation The same function but in c++ (for simplicity, we\u0026rsquo;re assuming the matrices will be square matrices of size n).\n#include \u0026lt;bits/stdc++.h\u0026gt; extern \u0026#34;C\u0026#34; { void matmul_naive_c(float* A, float* B, float* C, int n) { for (int i = 0; i \u0026lt; n; i++) { for (int j = 0; j \u0026lt; n; j++) { float sum = 0.0f; for (int k = 0; k \u0026lt; n; k++) { sum += A[i*n + k] * B[k*n + j]; } C[i*n + j] = sum; } } } } You will notice I have exported or \u0026ldquo;externed\u0026rdquo; the code to C. I will tell you why I did that later in the essay. Here are the benchmark results for different n values. We can already see the difference in timings. Same loop, same O(n^3), different results. For lower n values, there is not much difference, its just noise. But as the values increase we can see the timings start to concerning for naive python implementations.\nFor reference, when we do n x n multiplication. That\u0026rsquo;s appx n³ operations. Meaning, for 500x500 matmul, we\u0026rsquo;re doing ~125 million operations. When we increase that to 1000x1000 matmul, that\u0026rsquo;s about a billion operations!\nThe speedup compared to base python naive implementation: So, c++ naive is around 134 times more efficient than python naive.\nNumpy There\u0026rsquo;s good news for python devs though. Most of the libraries use BLAS, and are HIGHLY optimized! For reference, here are the numbers of matmul of cpp naive, python naive (without numpy) and numpy.\nIf you just focus on the Numpy speedup, you can see the numpy speed up is ~5400 times faster than python\u0026rsquo;s! And even faster than C++\u0026rsquo;s implementation! You might wonder why. That is where we dig into SIMD.\nSIMD You should think about what happens when we multiply two matrices. What happens is this; you take two numbers, you multiply them and add a lot of these values in a resultant that is the output. Now the question is, the multiplication, are the values that are getting multiplied dependent on any other values? They are not. It is an independent operation. Two numbers, out in the memory somewhere, are multiplied and then added. And we do it sequentially in CPU. We can leverage the fact that they are independent and process them at once.\nInstead of saying \u0026ldquo;multiply these two numbers\u0026rdquo; twice, you say it once but it operates on multiple pairs at the same time.\nLet me show you with a tiny example:\nScalar way (normal CPU):\ninstruction 1: multiply 1×5 → 5 instruction 2: multiply 2×7 → 14 Two separate instructions.\nSIMD way:\ninstruction 1: multiply [1,2] × [5,7] → [5,14] One instruction, two multiplications happen in parallel!\nModern CPUs have SIMD registers that can hold multiple numbers. For example:\nA 128-bit SIMD register can hold 4 floats (32 bits each) A 256-bit register can hold 8 floats A 512-bit register can hold 16 floats The Beast: GPU implementation GPUs are meant for parallel processing. They thrive in parallel processing. Initially meant for optimizing the gaming performance, now they\u0026rsquo;re also used to power the modern AI training. The difference lies here:\nmy CPU cores: 16 ; 8 performance + 8 efficient + 24 threads my GPU cores: 2560 CUDA cores + 80 Tensor Cores Do you see the difference? With thousands of more cores than CPUs, GPUs do parallel compute very efficiently. And let\u0026rsquo;s finally benchmark every matmul implementation to the full extent! Here are the results: If that doesn\u0026rsquo;t excite you, I don\u0026rsquo;t know what else will. Look at the numbers for 1000x1000 matrix multiplication. Naively doing matmul will take 107 seconds appx in python. In cpp, its a little better and in numpy and gpus? It\u0026rsquo;s absolutely negligible.\nLet\u0026rsquo;s look at some data analytics that I asked claude ai to give me:\nSmall matrices (5×5 to 50×50): Python/C++/NumPy are all similar (~0.03-0.06ms) GPU is slower (overhead dominates) Lesson: GPU launch overhead kills small operations Medium matrices (100×100): NumPy suddenly becomes 14x faster than naive C++ (0.031 vs 0.439ms) This is where BLAS kicks in! GPU still slower (overhead still matters) Large matrices (500×500, 1000×1000): NumPy: 17-100x faster than naive C++ GPU: 190x faster than naive C++ at 1000×1000 Python pure: completely falls apart (107 seconds!) For reference and for why the timings matter: if walking to England from Pune was a neural network training, then this operation is like taking a step!\nImagine taking this much time for one step!\nThat is the power of GPUs, which make today\u0026rsquo;s AI learning and training easier. Imagine training GPTs in numpy or heck, in bare metal python.\nNotes on results 1. Why cpp is slow? - It's majorly a *memory access problem*. - CPU fetches data in cache lines. - But what we need is row[0]col[j], row[1]col[j].. - the distance between each element is the size of the column. - this creates inefficiency because cache lines load data in sequential form. - So, transposing the data and then multiplying will result in a 3-5x boost in performance because accessing data will be quicker. 2. Compiled vs Interpreted language: C++ (compiled): - Recipe is translated to muscle memory - No thinking Python (interpreted): - Figure out what the code means - Then do the action - Repeat for next line What Python does for `c = a + b`: 1. Look up variable 'a' in dictionary 2. Check what type 'a' is (int? float? object?) 3. Look up variable 'b' in dictionary 4. Check what type 'b' is 5. Find the appropriate '+' function for these types 6. Call that function 7. Create new Python object for result 8. Store it in variable 'c' What C++ does for `c = a + b`: 1. ADD register1, register2 -\u0026gt; register3 One CPU instruction! 3. Numpy Numpy doesn\u0026rsquo;t do the matmul itself. It calls BLAS.\nTrick 1 -\u0026gt; Break the matrices in chunks that fit in L1/L2 cache. Since the matrices are too big to fit the cache for efficient processing, it breaks the data down, say in 64x64 blocks.\nTrick 2 -\u0026gt; SIMD\nTrick 3: Multiple optimizations combined\nBLAS libraries also use:\nLoop unrolling (reduce branch overhead) Prefetching (tell CPU to load data before you need it) Register blocking (keep frequently used values in CPU registers) Multiple levels of cache blocking (L1, L2, L3 optimized separately) SIMD CODE FOR MATMUL #include \u0026lt;immintrin.h\u0026gt; void matmul_simd(float* A, float* B, float* C, int n) { for (int i = 0; i \u0026lt; n; i++) { for (int j = 0; j \u0026lt; n; j += 8) { __m256 sum = _mm256_setzero_ps(); for (int k = 0; k \u0026lt; n; k++) { __m256 a = _mm256_set1_ps(A[i*n + k]); __m256 b = _mm256_loadu_ps(\u0026amp;B[k*n + j]); sum = _mm256_fmadd_ps(a, b, sum); } _mm256_storeu_ps(\u0026amp;C[i*n + j], sum); } } } //This processes 8 floats at once using AVX2 instructions That\u0026rsquo;s it for this one. Thanks for reading!\n~Aayushya Tiwari\nReferences Wiki-SIMD Wiki-BLAS Optimizing matmul in CUDA blog on matmul by Salykova\nwhy is c++ code externed to C? it is because while writing the benchmarking script, I wanted the cpp function to be used as is. For that I used the concept of a shared lib, a .so file in linux and used CPython API.\n","permalink":"https://tiwariji.net/posts/simd/","summary":"\u003cp\u003eToday we look at matrix multiplication (matmul, as we will call in this essay).  \u003cbr\u003e\nSince, the last essay was on \u003ca href=\"https://tiwariji.net/posts/backprop/\"\u003ebackprop\u003c/a\u003e, it was only logical to think about the most fundamental math operation that lets us do the algo.     \u003cbr\u003e\nThat is, matmul.\u003cbr\u003e\nAlso, the numbers in this essay are going to shock you.  \u003cbr\u003e\nLike really. So if you think I am making this up, you should checkout \u003ca href=\"https://github.com/aayushyatiwari/blogCode/tree/master/simd\"\u003emy code\u003c/a\u003e for this essay.\u003c/p\u003e","title":"Speeding up matrix math with SIMD intrinsics"},{"content":"I\u0026rsquo;m assuming you understand the basic idea of neural networks. This essay focuses purely on the backpropagation algorithm itself.\nWhat is Backpropagation? Backpropagation is an algorithm that computes how much each weight and bias should change to reduce the loss.\nIt tells us not just whether parameters should go up or down, but by how much, based on their actual impact on the loss function. We use math to figure out that.\nThe Key Players Before we dive in, let\u0026rsquo;s identify what we\u0026rsquo;re working with:\nInput data - what we feed into the network Parameters - weights (w) and biases (b) that we need to adjust Neurons - the computational units Loss - measures how wrong our predictions are Target - what we\u0026rsquo;re trying to predict What is a Neuron? A neuron performs two simple computations:\nz = w · a + b\na_out = σ(z)\nWhere:\na is the input activation (from the previous layer) w is the weight b is the bias z is the weighted sum (pre-activation) -\u0026gt; z = sum((wi) x (xi)) + b σ is the activation function (sigmoid, ReLU, etc.) a_out is the output activation Think of it like this: the neuron takes inputs, weighs them, adds a bias, then applies a non-linear function. That\u0026rsquo;s it.\nThe Forward Pass Training involves:\nForward pass - feed input through the network to get a prediction Compute loss - measure how wrong the prediction is Backward pass - figure out how to adjust weights to reduce loss Update weights - make the adjustments Repeat The forward pass is straightforward. The backward pass is where backpropagation comes in.\nThe Goal: Minimize Loss We want to reduce the loss. The tool we use is gradient descent. Why gradient descent? Gradients tell us about the movement of some function(in our case the loss function) wrt some variable (in our case, the parameters). Gradient descent is the idea that we try to move in the opposite direction to the gradient vector to minimize the cost function. Why opposite direction? Because gradient vector points to the direction of steepest increase.\nGradient Descent Intuition Imagine you\u0026rsquo;re blindfolded on a hilly terrain trying to reach the lowest valley.\nWhat would you do?\nFeel the slope under your feet Determine which direction goes downhill most steeply Take a small step in that direction Repeat In mathematical terms:\nThe terrain = Loss function (how wrong your model is) Your position = Current parameter values Feeling the slope = Computing the gradient ∂L/∂w Taking a step = Updating: w ← w - η · ∂L/∂w The gradient ∂L/∂w tells us which direction is \u0026ldquo;downhill\u0026rdquo; for the loss.\nThe Problem: We Can\u0026rsquo;t Compute ∂L/∂w Directly Here\u0026rsquo;s the issue: the loss doesn\u0026rsquo;t directly depend on w. What do I mean? Suppose there are 2 million layers.\nthe 101th layer\u0026rsquo;s third neuron. Think about that neuron. We want to figure out the Loss\u0026rsquo;s derivative wrt to THAT neuron. They are just too far away from each other.\nThe dependency chain looks like this:\nw → z → a → (more layers) → prediction → loss\nThe loss is computed way at the end, but w is buried deep in the network. They\u0026rsquo;re connected through many ==intermediate computations==.\nThis is why we need the chain rule.\nThe Chain Rule Solution To compute ∂L/∂w, we break it into pieces:\n∂L/∂w = ∂L/∂a · ∂a/∂z · ∂z/∂w\nLet\u0026rsquo;s understand each term:\n∂z/∂w - \u0026ldquo;How does z change when w changes?\u0026rdquo; Looking at z = w·a + b, we get: ∂z/∂w = a This is local and easy to compute.\n∂a/∂z - \u0026ldquo;How does activation change when z changes?\u0026rdquo;\nThis depends on the activation function:\nSigmoid: σ\u0026rsquo;(z) = σ(z)(1 - σ(z)) ReLU: σ\u0026rsquo;(z) = 1 if z \u0026gt; 0, else 0 This is also local and easy.\n∂L/∂a - \u0026ldquo;How does loss change when this activation changes?\u0026rdquo; This is the tricky one. And this is where the interaction between layers happen! For hidden layers, we don\u0026rsquo;t know this directly. We have to get it from the layer ahead.\nThe Backpropagation Algorithm Backpropagation works backwards through the network:\nStep 1: Output Layer (Easy Case) At the output layer, ∂L/∂a can be computed directly from the loss function.\nFor example, if Loss = (prediction - target)², then:\n∂L/∂a^L = 2(a^L - target)\nStep 2: Compute Local Gradients For each neuron at this layer:\nCompute ∂a/∂z = σ\u0026rsquo;(z) Compute ∂z/∂w = a (the input to this neuron) Step 3: Multiply Using Chain Rule ∂L/∂w = ∂L/∂a · ∂a/∂z · ∂z/∂w\nNow we have the gradient for this weight!\nStep 4: Pass Gradients Backward For the previous layer to compute its gradients, it needs ∂L/∂a.\nWe compute it using:\n∂L/∂a^prev = w · ∂L/∂z\nWhere ∂L/∂z = ∂L/∂a · ∂a/∂z (combining the first two terms).\nStep 5: Repeat Move to the previous layer and repeat steps 2-4, using the ∂L/∂a we just computed.\nContinue until you\u0026rsquo;ve computed gradients for all weights in all layers.\nThe Complete Picture Forward pass: Input → Layer 1 → Layer 2 → \u0026hellip; → Output → Loss\nBackward pass: Loss → ∂L/∂w_output → ∂L/∂w_layer2 → \u0026hellip; → ∂L/∂w_input\nEach layer:\nReceives ∂L/∂a from the next layer Computes its own ∂L/∂w using the chain rule Passes ∂L/∂a_prev to the previous layer Update the Weights Once we have ∂L/∂w for every weight:\nw ← w - η · ∂L/∂w\nb ← b - η · ∂L/∂b\nThis nudges each parameter in the direction that reduces loss.\nImplementation Note: PyTorch While trying to implement a Tensor class myself, I got to know that each tensor holds the dL/d(that tensor w) in the .grad attribute.\nx = torch.tensor(2.0, requires_grad=True) y = x * 3 y.backward() print(x.grad) one shot Backpropagation is elegant:\nDo a forward pass and compute loss Start at the output where ∂L/∂a is known Use the chain rule to compute ∂L/∂w locally Pass ∂L/∂a backwards to the previous layer Repeat until all gradients are computed Update all weights using gradient descent The \u0026ldquo;back\u0026rdquo; in backpropagation refers to this backward flow of gradients through the network, from output to input. This algorithm is also \u0026lsquo;greedy\u0026rsquo;.\nNeural Net We now know that a neural network is simply a set of parameters optimized to minimize a loss function. A good exercise to internalize this idea is the following.\nYou are given the XOR problem:\n// XOR problem. // X -\u0026gt; inputs, y -\u0026gt; true values float X[4][2] = {{0,0}, {0,1}, {1,0}, {1,1}}; float y[4][1] = {{0}, {1}, {1}, {0}}; // Random weights and biases. float w1[2][2]; float b1[2][1]; float w2[2][1]; float b2[1][1]; Writing backpropagation for this problem is what helped me develop an intuitive understanding of how backpropagation works. The task is simple: after training, the loss function (which can be any reasonable choice) should be minimized, meaning the forward pass produces accurate outputs for the XOR problem.\nFor the sake of simplicity, here is a simple solution in python\nimport numpy as np def loss(yt, y): return np.mean((yt - y) ** 2) X = np.array([[0,0],[0,1],[1,0],[1,1]]) y = np.array([[0],[1],[1],[0]]) # XOR W1 = np.random.randn(2,2) b1 = np.zeros((1,2)) W2 = np.random.randn(2,1) b2 = np.zeros((1,1)) lr = 0.1 print(\u0026#34;Initial weights and biases:\u0026#34;) print(f\u0026#34;W1:\\n{W1}\\nb1:\\n{b1}\\nW2:\\n{W2}\\nb2:\\n{b2}\u0026#34;) print(\u0026#34;*\u0026#34; * 60) for epoch in range(5000): # Feedforward z1 = X @ W1 + b1 a1 = sigmoid(z1) z2 = a1 @ W2 + b2 y_hat = sigmoid(z2) # Calculate loss epoch_loss = loss(y_hat, y) # Backprop dz2 = y_hat - y dW2 = a1.T @ dz2 db2 = np.mean(dz2, axis=0, keepdims=True) dz1 = (dz2 @ W2.T) * sigmoid_prime(z1) dW1 = X.T @ dz1 db1 = np.mean(dz1, axis=0, keepdims=True) # Update weights W1 -= lr * dW1 b1 -= lr * db1 W2 -= lr * dW2 b2 -= lr * db2 if epoch % 1000 == 0: print(f\u0026#34;Epoch {epoch}: Loss = {epoch_loss:.6f}\u0026#34;) # Activation function def sigmoid(z): return 1 / (1 + np.exp(-z)) def sigmoid_prime(z): return sigmoid(z) * (1 - sigmoid(z)) print(\u0026#34;\\n\u0026#34; + \u0026#34;=\u0026#34; * 60) print(\u0026#34;Final predictions:\u0026#34;) print(\u0026#34;y_hat\u0026#34;, y_hat) for i in range(len(X)): print(f\u0026#34;Input: {X[i]} -\u0026gt; Predicted: {y_hat[i][0]:.4f}, Actual: {y[i][0]}\u0026#34;) Thanks for reading\n~ Aayushya Tiwari\nREFERENCES Watch the Andrej Karpathy Micrograd Video for super intuition. Or 3b1b\u0026rsquo;s video. They are the best. NN in NUMPY book pdf: my first reference to backprop Original paper\n","permalink":"https://tiwariji.net/posts/backprop/","summary":"\u003cp\u003eI\u0026rsquo;m assuming you understand the basic idea of neural networks. This essay focuses purely on the backpropagation algorithm itself.\u003c/p\u003e\n\u003ch2 id=\"what-is-backpropagation\"\u003eWhat is Backpropagation?\u003c/h2\u003e\n\u003cp\u003e\u003cstrong\u003eBackpropagation is an algorithm that computes how much each weight and bias should change to reduce the loss.\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eIt tells us not just whether parameters should go up or down, but by how much, based on their actual impact on the loss function.\nWe use math to figure out that.\u003c/p\u003e","title":"Backpropagation from scratch"},{"content":"what is this about? Today we\u0026rsquo;re looking at hashing. We\u0026rsquo;ll get into the process, the data structures and some applications.\nThrough this article, we\u0026rsquo;ll see how we can implement dictionary operations, which are mainly:\ninsert delete search what is Hashing? Hashing is a technique of identifying an object out of a group of similar objects.\nAnalogy for hashing: Imagine if you took your name tiwariji and ran it through a complex mathematical function that produced 7a3f9c2e. You couldn\u0026rsquo;t look at 7a3f9c2e and figure out it came from tiwariji, but every time you hash tiwariji you\u0026rsquo;d get the same result. That, is hashing. Ideally, no two different inputs should produce the same hash output. We\u0026rsquo;ll talk about that later when we discuss hash functions.\nThere are different ways you can create your hashing function. This also means there are different ways to hash. We\u0026rsquo;ll first look at DAT (direct addressing tables) and see how, without hashing, we can use a normal array and use that to store values and fetch information in worst-case O(1).\nDirect Addressing Table Direct addressing is a simple technique. We assume a universe of numbers. And we store data in a direct addressing table. We represent DAT as T[0, \u0026hellip;, m-1] in the universe of m elements. Each slot in T, represents a value from the universe of elements. In the image, the table T stores value corresponding to the elements in the universe. In the rest of the article, I\u0026rsquo;ll avoid using highly mathematical notations and just say things like \u0026ldquo;out of the set of keys\u0026rdquo; instead of \u0026ldquo;in the universe of elements\u0026rdquo; for the sake of simplicity.\nThe problem with direct addressing is obvious.\nImagine you need to store data about students using their student ID numbers. If student IDs are 9-digit numbers (like 100000000 to 999999999), direct addressing would require an array with 900 million slots — even if you only have 500 students! You\u0026rsquo;d be wasting enormous amounts of memory, with 99.9999% of the array sitting empty.\nBelow is the code for a simple frequency hash map that maps characters to their corresponding indexes based on their relative order in the alphabet. Ex: b - 1, c- 3... class SimpleFreqHashTable: \u0026#39;\u0026#39;\u0026#39; Docstring for SimpleFreqHashTable simple hashtable to store frequency of lower case alphabets in an array of size 26. \u0026#39;\u0026#39;\u0026#39; def __init__(self, size = 26): self.size = size self.hashTable = [0] * size def hash(self, key): \u0026#39;\u0026#39;\u0026#39; Docstring for hash index using a - 0, b - 1 \u0026#39;\u0026#39;\u0026#39; return ord(key) - ord(\u0026#39;a\u0026#39;) def insert(self, key): index = self.hash(key) if 0 \u0026lt;= index \u0026lt; 26: self.hashTable[index]+=1 def search(self, key): ind = self.hash(key) return self.hashTable[ind] Question. If you have arbitrarily big numbers, how do you create such a big table?\nHash Tables Hash tables fix this issue. They work because each slot in the table now holds h(k) for any key k where h is a hash function.\nThat means you can create a good hash function and you can effectively fit large amounts of data in a smaller table. The hash function reduces the range of array indices and hence the size of the array needed. Instead of a size of |U|, the table can have size m. There is one issue though, they are called collisions.\nCollsion When two different keys hash to the same output, we call that situation a collision. In an ideal world, there would be no collisions. But since we live in the real world, we\u0026rsquo;ll see ways to handle this problem. One of the best ways to do that, is by a process called chaining.\nChaining In chaining, we use a data structure called linked list and store all the values that hash to the same index in a chain.\nExample:\nsay two keys: k1, k2 hash to same value index.\nThen we take that index and make a chain and store k1 and k2.\nBelow is an example from the classic CLRS book.\nBelow is the code for a little custom hash class with chaining using a singly linked list.\nclass Node: def __init__(self, val): self.val = val self.next = None class HashChaining: \u0026#39;\u0026#39;\u0026#39; assumption: hash function will return an integer from 0 to 19 \u0026#39;\u0026#39;\u0026#39; def __init__(self, size = 20): self.size = size self.table = [None] * size def hash(self, key): if isinstance(key, str): return ord(key) % self.size return key % self.size def insert(self, s): index = self.hash(s) if self.table[index] is None: head = Node(s) self.table[index] = head else: new = Node(s) new.next = self.table[index] self.table[index]= new def search(self, s): index = self.hash(s) if self.table[index] is None: return \u0026#34;not hashed\u0026#34; l = self.table[index] curr = l while curr: if curr.val == s: return \u0026#34;found the value\u0026#34; curr = curr.next return \u0026#34;THE VALUE ISNT PRESENT\u0026#34; Hash Functions Ideally, a hash function should:\nsatisfy the condition of uniform hashing. minimize collisions. be easy to implement and should not become an algorithm in itself. There are different methods to generate slot indices.\nDIVISION METHOD: h(k) = k mod m. MULTIPLICATION METHOD: this method works in two ways. \u0026gt; first we take the fractional part of kA where A is any constant. \u0026gt; then we multiply that by m and then we take the floor of the result. Interpreting keys as natural numbers We want the keys to be integers. If they are not integers, we make an interpretation. Say there is pt. We can interpret this as 112 116 because p is 112 and t is 116 in ASCII notation.\nProbing One way to avoid collisions is probing.\nIDEA:\nFor a key k, we check the hashed index and see if it\u0026rsquo;s free. If it\u0026rsquo;s not, we check the next slot, and then the next\u0026hellip; $$ slot for k = ((h(k) + i) mod (m)) $$ where i is in range [0, size_of_table) Below is the code for making a hash function with linear probing. class HashChainingWithProbing: \u0026#39;\u0026#39;\u0026#39; assumption: hash function will return an integer from 0 to 19 \u0026#39;\u0026#39;\u0026#39; def __init__(self, size = 20): self.size = size self.table = [None] * size def hash(self, key): if isinstance(key, str): return ord(key) % self.size return key % self.size def insert(self, s): index = self.hash(s) new = Node(s) for i in range(self.size): new_index = (index + i) % self.size if self.table[new_index] is None: self.table[new_index] = new break def search(self, s): index = self.hash(s) for i in range(self.size): new_index = (index + i) % self.size if self.table[new_index] is None: return \u0026#34;this index is not hashed.\u0026#34; if self.table[new_index].val == s: ok = True return \u0026#34;found it\u0026#34; return \u0026#34;couldn\u0026#39;t find it\u0026#34; Universal Hashing The idea is this: Use randomness to create unique hash functions that map keys to different slots.\nWe do this by first choosing random numbers a and b. By generating two random numbers for creating the function, we\u0026rsquo;re solidifying the uniqueness of the function. The function is:\nh(k) = ((ak+b) mod p)\nwhere p is a prime number greater than the universe of keys, so that each key k is in the range [0, p-1].\nApplications. Associative arrays: Hash tables are commonly used to implement many types of in-memory tables. They are used to implement associative arrays (arrays whose indices are arbitrary strings or other complicated objects). Database indexing: Hash tables may also be used as disk-based data structures and database indices (such as in dbm). Caches: Hash tables can be used to implement caches i.e. auxiliary data tables that are used to speed up the access to data, which is primarily stored in slower media. Object representation: Several dynamic languages, such as Perl, Python, JavaScript, and Ruby use hash tables to implement objects. With that, we finish today\u0026rsquo;s article.\nThanks for reading.\n~ Aayushya Tiwari\nReferences : CLRS textbook of algorithms. Also called Introduction to Algorithms\nHackerearth\nUniform hashing It is the assumption that each key in the universe of elements hashes to any slot in table T with equal probability, independent of where other keys hash.\n","permalink":"https://tiwariji.net/posts/hashing/","summary":"\u003ch1 id=\"what-is-this-about\"\u003ewhat is this about?\u003c/h1\u003e\n\u003cblockquote\u003e\n\u003cp\u003eToday we\u0026rsquo;re looking at hashing. We\u0026rsquo;ll get into the process, the data structures and some applications.\u003c/p\u003e\n\u003c/blockquote\u003e\n\u003cp\u003eThrough this article, we\u0026rsquo;ll see how we can implement \u003cstrong\u003edictionary\u003c/strong\u003e operations, which are mainly:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003einsert\u003c/li\u003e\n\u003cli\u003edelete\u003c/li\u003e\n\u003cli\u003esearch\u003c/li\u003e\n\u003c/ol\u003e\n\u003ch1 id=\"what-is-hashing\"\u003ewhat is Hashing?\u003c/h1\u003e\n\u003cp\u003eHashing is a \u003cstrong\u003etechnique\u003c/strong\u003e of identifying an object out of a group of similar objects.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAnalogy for hashing\u003c/strong\u003e: \u003cbr\u003e\nImagine if you took your name \u003cem\u003etiwariji\u003c/em\u003e and ran it through a complex mathematical function that produced \u003cstrong\u003e7a3f9c2e\u003c/strong\u003e. You couldn\u0026rsquo;t look at \u003cstrong\u003e7a3f9c2e\u003c/strong\u003e and figure out it came from \u003cem\u003etiwariji\u003c/em\u003e, but every time you hash \u003cem\u003etiwariji\u003c/em\u003e you\u0026rsquo;d get the same result.  \u003cbr\u003e\n\u003cbr\u003e\nThat, is hashing.  \u003cbr\u003e\n\u003cbr\u003e\nIdeally, no two different inputs should produce the same hash output. We\u0026rsquo;ll talk about that later when we discuss \u003ca href=\"#hash-functions\"\u003ehash functions\u003c/a\u003e.\u003c/p\u003e","title":"Implementing collision-resistant hash maps from scratch"},{"content":"Data management during compiler process When compiling a piece of code, the data in the code is stored in segments.\nThere are five types of segments: stack heap data code BSS We will focus on stack and heap in this essay.\nStack Stack allocation is the process of allocating memory for local variables and function calls in call stack. Each function gets some memory in the stack to store variables in it. Since the memory is handled by the system, its faster. But the memory is less as compared to heaps. The size required is already known before execution. The compiler allocates some memory.\nSo, what does the stack store? for every function, it stores the local variables, return addresses.\nThe programmer does not have to worry about allocation or deallocation of the memory, in the case of stack memory.\nAfter the function call is done, all the memory is flushed out. Also, the stack memory is allocated in contigous manner. int main(){ // all these go on the stack int x = 10; int b[10]; int n = x; } Heap Heaps are used to dynamically allocate memory.\nWhenever you think about heaps, I\u0026rsquo;d like you to think about vectors in c++. How do you think they can shrink and expand?\nUnlike in stack, a programmer has to make sure they delete heap memory after the function is done executing. There is no automatic deletion! Heaps can be a little tricky to use because if you don\u0026rsquo;t deallocate the memory, you might run into issues like heap-overflow. Heaps are also less secure than stacks because all the threads can get access to the memory of heaps, which can lead to data leaks and can be abused by people. You can use malloc to dynamically allocate memory in heaps. Heap memory is cleaned using garbage collectors in java/python or using free() in C/C++.\nAll the new objects created are in the heap memory.\nThe name heap has no relation to the heap data structure; it simply refers to a large pool of memory available for dynamic allocation.\nint main(){ // this goes on heap int *x = new int[10]; } Case study example. class Employee: def __init__(self, id, name): self.id = id self.name = name def call_emp(id, emp_name): return Employee(id, emp_name) if __name__ == \u0026#34;__main__\u0026#34;: id, empName = 21, \u0026#34;tiwariji\u0026#34; person = call_emp(id, empName) At run-time, all classes are stored on heap. The main method is stored in a stack. Employee of the call_emp function is called a reference variable and points to the object in heap memory. When call_emp() is called from main, a new stack frame is created on top of the previous stack frame.\nAnd that is how we work our ways with stacks and heaps! Interesting note: In an academic paper, researchers dig into 797 open source C/C++ binaries and tried to figure out the Qh (the amount of obejects kept on the heap). They wanted to see if people were still using heaps in their code.\nWhat they found was people were still using heaps quite extensively! Heaps are VERY important because of their ability to resize and also the fact that if you want to create an object that outlives the function its created in, you\u0026rsquo;d want a heap.\nAlthough,\nI am quoting here from the paper something a little hilarious if you ask me.\n\u0026ldquo;There is yet another reason, though, why programmers may allocate on the heap: they are not aware of the costs. In our study, we didn’t analyze whether every allocation on the heap is motivated, but we suspect that a decent amount of objects may be placed on the heap by mistake.\u0026rdquo;\nWell. It would be nice to understand when you're keeping your objects on the heap right? ~ Aayushya Tiwari\nReferences: GeeksForGeeks Medium Simplilearn ResearchPaper\n","permalink":"https://tiwariji.net/posts/stacks_heaps/sh/","summary":"\u003ch1 id=\"data-management-during-compiler-process\"\u003eData management during compiler process\u003c/h1\u003e\n\u003cp\u003eWhen compiling a piece of code, the data in the code is stored in \u003cem\u003esegments\u003c/em\u003e.\u003c/p\u003e\n\u003ch2 id=\"there-are-five-types-of-segments\"\u003eThere are five types of segments:\u003c/h2\u003e\n\u003col\u003e\n\u003cli\u003e\u003cem\u003estack\u003c/em\u003e\u003c/li\u003e\n\u003cli\u003e\u003cem\u003eheap\u003c/em\u003e\u003c/li\u003e\n\u003cli\u003e\u003cem\u003edata\u003c/em\u003e\u003c/li\u003e\n\u003cli\u003e\u003cem\u003ecode\u003c/em\u003e\u003c/li\u003e\n\u003cli\u003e\u003cem\u003eBSS\u003c/em\u003e\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eWe will focus on stack and heap in this essay.\u003c/p\u003e\n\u003ch2 id=\"stack\"\u003eStack\u003c/h2\u003e\n\u003cblockquote\u003e\n\u003cp\u003eStack allocation is the process of allocating memory for local variables and function calls in \u003cstrong\u003ecall stack\u003c/strong\u003e. \u003cbr\u003e\nEach function gets some memory in the stack to store variables in it. \u003cbr\u003e\nSince the memory is handled by the system, its faster. But the memory is less as compared to heaps. \u003cbr\u003e\nThe size required is already known before execution. The compiler allocates some memory.\u003c/p\u003e","title":"Memory layout: Stacks, heaps, and pointers"},{"content":"Overview When you call int x = 0; in your code, where and how is that x stored? This is the question that I wanted to know the answer of. And in this essay, we’ll look at just that. And\nWhy should you care? As we will understand later in the article, there are issues and subtle bugs that come when programmers don’t understand their code. Hackers want to figure out ways to exploit those bugs and gain entery into people’s systems. So knowing how, why, and where of your data, in your system is not only useful, I’d say it’s necessary.\nbits, bytes and objects The smallest unit of storage is a bit. A single bit, as you can guess, is not that useful in isolation. A byte is a much better unit to start looking at interesting patterns bits can produce. A byte is a (typically) 8-bit storage unit. A byte’s range in different notations:\n00000000 to 11111111 in binary 0 to 255 in decimals 00 to FF in hex. Computer memory is just a very big array of bytes. Put simply, every object that you create in your code is stored somewhere on this (VERY BIG) array and has it’s subsequent address which you can see using pointers.\nwhat is an object? Objects are region in memory space that can hold value, for example: 12. Something to know about objects is this: a. objects memory is a contigous block allocation b. objects never overlap.\nExample: say you have int x which is a 4 byte integer at address 0x100. Then you can safely say that 0x100, 0x101, 0x102 and 0x103 are storing that integer.\nTill this point, the knowledge was programming language independent. From here on, since I am more interested in C and C++ and they’re more closer to machine level code (see my prev article on how code goes from source code to machine instructions) , we’ll look at how objects are created and stored in C and C++.\nExploiting memory\nIn C and C++, by looking at memory, you can’t always tell what sort of an object is stored there. And this lets tricksters play games by exploiting memory.\nan object may have aliases, for example: int x = 0; int* pt = \u0026amp;x; here x and pt refer to the same object in memory.\nlifetime of an object Lifetime of an object is basically when in time was the object destroyed minus when was it initiated in the program. They are important for us to understand because they’re used to define, undefined behaviour in programming languages.\n#include\u0026#34;print_bytes.hh\u0026#34; char ch1 = \u0026#39;A\u0026#39;; const char ch2 = \u0026#39;B\u0026#39;; void f() { char ch3 = \u0026#39;C\u0026#39;; char* ch4 = new char{\u0026#39;D\u0026#39;}; print_object(ch1); print_object(ch2); print_object(ch3); print_object(*ch4); } They are of three types of lifetimes:\nSTATIC := the object lasts as long as the program runs. (global variables, ie ch1, ch2) DYNAMIC := the programmer allocates and deallocates the memory manually (ex: *ch4) AUTOMATIC := the compiler does the allocation of memory based on the memory space of the object (which is also called scope of the object). (ex: ch3) Segments Here’s what happens:\nOS - gives a program memory space\nCompiler - allocates memory to objects in the program\nA program’s address space is divided into segments. Which object will go in which segment is decided on the basis of the lifetime of the object.\nMain segments are:\nCode: text only data. static lifetime. Describes instructions, and has global constants. Unmodifiable. Data: contains non-constant global variables, can be modified, static lifetime. Stack: automatic lifetime, modifiable. Heap: dynamic lifetime, modifiable. 💡 Memory issues\nMost memory issues in programming are because of a rule called “live object rule”. Basically, you can only access an object that’s in its lifetime right now.\ntypes of data representation Fundamental types of data as described in the book are as follows:\nunsigned int This is just basic positive integers. Ex: 11.\nThe good question is, how do you store multi-byte variables in memory.\nQuestion: since one byte’s max number is 255, how do you store 258?\nAnd it’s the job of compiler to decide exactly that.\n💡 There are two order notations that we have to know about before we learn about multi-byte storage:\na. big endian order: the most significant component byte is stored first till the least. It is also called ‘network’ byte order because all the network protocols use this order.\nb. little endian order: least significant component byte to most significant component byte. Used mostly in allocating memory.\nThis is the process of allocating memory to a multi-byte integer.\nwrite the integer in hex format including all the leading zeroes for satisfying the type size. ex: For example, the unsigned value 65534 would be written 0x0000FFFE.\nbreak the hex into component bytes. In our example, they are, from most to least significant, 0x00, 0x00, 0xFF, and 0xFE.\nUsing the little-endian representation:\n0x30: 0xFE 0x31: 0xFF 0x32: 0x00 0x33: 0x00 💡 WORD In computers, the basic unit of compiler design storage is a Word. A Word can hold value from 0 to 2^w - 1. where w is the maximum length of the virtual address the processor can store. In modern computers, most computers are now storing 64 bits.\nsigned int computers use - two’s complement to represent signed integer.\nQuestion to think: how do you store -132 in binary?\nWhat happens is this:\nwe look at the MSB (most significant bit) of the binary number. if it is a zero, no problem. It’s a positive integer. if it is a one, then that is a negative integer. to process that negative byte, flip all the bits, add 1 and put a minus sign in front. That is your negative integer.\nNow, why this works? This works because signed and unsigned representations yield the same result in addition, subtraction and in multiplication. (For proof, look at the reference article down below!)\narrays memory of an array is stored sequentially as discussed earlier. Assume a declaration like T x[N], where x is an array of N objects of type T, and say that the address of x is a. Then the address of element x[i] equals a + i * sizeof(T), and sizeof(a) == N * sizeof(T).\nsidebar: vectors What vectors are, are arrays that can shrink and grow. It has dynamic lifetime at runtime.\nsizeof(v) = 24. Here’s how the memory is represented of vectors having 24 bytes size:\nthe first 8 bytes store begin address of the vector. Address of the first element of the vector == begin address. [Same as v.data()] the next 8 bytes store end address of the vector. Address of the last element of the vector. [same as \u0026amp;v.data()[v.size()]] the next 8 bytes are the capacity address bytes. they grow and shrink as needed. alignment Data alignment is positioning data in memory in a multiples of some numbers (ex: 2, 4, 8 bytes). This helps with optimizations. Say if there was no data alignment, one byte of your integer would be at 0x100 and the other could be at, god knows where! And we would have to go through the random access path, which we know is not that efficient! That’s why compiler takes the reponsibility to pad integers.\nstruct val{ int x; // offset 0 char l; // offset 4, padded 3 bytes. } val; Uninitialized objects? Uninitialized objects are kept based on their lifetimes:\nif it’s a static, then it’s value is made 0. (ex; int global) if it’s dynamic, then its value is not initialized and accessing it before init will lead to undefined behaviour. Thank you for reading. If you did, that is. 🙂\nReferences CS61 Notes :CS61 notes on data representation (ABSOLUTELY WORTH A READ !!) :Computer System: A Programmer’s Perspective (Book) [maybe a bit too much if you don’t have the patience]\n","permalink":"https://tiwariji.net/posts/datarepresentationw2/","summary":"\u003ch1 id=\"overview\"\u003eOverview\u003c/h1\u003e\n\u003cp\u003eWhen you call \u003ccode\u003eint x = 0;\u003c/code\u003e in your code, where and how is that x stored? This is the question that I wanted to know the answer of. And in this essay, we’ll look at just that.\nAnd\u003c/p\u003e\n\u003ch2 id=\"why-should-you-care\"\u003eWhy should you care?\u003c/h2\u003e\n\u003cp\u003eAs we will understand later in the article, there are issues and subtle bugs that come when programmers don’t understand their code. Hackers want to figure out ways to exploit those bugs and gain entery into people’s systems. So knowing how, why, and where of \u003cem\u003eyour\u003c/em\u003e data, in \u003cem\u003eyour\u003c/em\u003e system is not only useful, I’d say it’s necessary.\u003c/p\u003e","title":"How memory represents data at the bit level"},{"content":"do you know how your code\nint f(){ return 42; } turns into\nmov eax, 42 ret and then into\nB8 2A 00 00 00 C3\nthis?\nHere’s the whole essay in short:\nsource code → something happens, Intermediate code forms → something happens again, Machine code is formed.\nWe’ll clear the ‘something’ in this article.\nLife would be very simple if we humans could write 1s and 0s and directly give machine it’s prefered machine code. But since we don’t have 1000 hands per person and the outputs that we’re expecting out of computers have evolved to complexities unimaginable, we need another simpler way to talk to the machines. And that is why we have different programming languages and their compilation processes.\nThere are compiled languages and interpreted languages divided on the basis of when the code is executed.\nCompiled languages are programming languages that are converted into machine code by the compiler and only then, is there an executable file.\nInterpreted languages are not converted into compiled code, rather the source code is directly executed line-by-line by the interpretor.\nCAVEAT: most of the lanugages today use a mixture of the two ideas. Ex. Initially, the JVM interprets the Java bytecode produced by the Java compiler, executing it instruction by instruction until it identifies frequently executed (“hot”) code paths, which are then compiled into native machine code by the JIT compiler for faster execution.\nDifferent representations of the code Source code This is the code that the programmer sees. It’s in programming languages.\nEx: .py files for Python code and .c files for C code.\nit’s human-readable.\ncode has semantics (meaning) and a form.\nfeatures like comments, indentation and format.\nint add(int a, int b){ return a+b; } Details on Structure:\nDeclarations: Define types and variables (e.g., int a). Expressions: Computations like a + b. Statements: Control flow like return. Modules/Files: Organized into files with includes/imports for modularity. IR: Intermediate Representation It is the intermediate code that is formed.\nit is a little bit more lower-level. Hence it forms a key phase in compiler’s front-end.\nIt is platform independent. Which enhances portability.\nFor n languages and m targets, you need n front-ends + m back-ends instead of n*m full compilers.\nIntermediate code has several levels before the code turns into machine code:\nhigh-level - ex: Syntax Tree\nclose to source code, can be used to trace back to the source code\nused for early optimizations\nexample:\nAbstract Syntax Tree (AST) - type of a mid-level IR credits: GeeksForGeeks\nmid-level - ex: TAC Three address code\nEx. T1 = T2 op A\nhas maximum of three operands\nuses temprories (eg: T1)\nThe typical form of a three address statement is expressed as x = y op z, where x, y, and z represent memory addresses.\nex: x = (a + b * c) / (a - b * c)\nt1 = b * c t2 = a + t1 t3 = a - t1 x = t2 / t3\nDid you notice the reusable t1? Yes, optimization.\nlow-level - ex: Register Transfer Language or LLVM\nstack based - ex: Java bytecode or CPython bytecode\nit is also mid-lower level IR stack based replicates a stack features: no named registers everything flows through stack closer to machine-code\nregister based\nhas memory accesses and registers\nunlimited temp (Ex. T1…) values\nmade for heavy optimizations\ndoes not contain irrelevant syntaxes\nexample\n→ code : C++\nint sum (int a, int b){ return a + b; } → IR : LLVM format\ndefine i32 @sum (i32 %a, i32 %b){ %1 = add i32 %a, %b ret i32 %1 } we can see that even though there was no variable defined in the actual code, in LLVM format, there is a %1 temp created and that is returned. other thing that we can notice is how we add the two i32 integers. first we write add and then the two references. Many compilers use multiple levels: high-level (tree-like) → mid (SSA) → low (register-based).\nOne question often asked is this. Why IR? Why not just take the source code and execute that directly?\nGlad you asked.\nIt’s because IR’s help in optimizations. They form the last part of the front-end of compilation process. You can make any sort of optimizations: constant folding, loop unrolling, inlining, etc.\nWhat we’ve seen so far is how your code goes from source code level, to an IR (that is close to being a foreign language to us).\nEx: Java (via javac) → JVM Bytecode, C++ → via Clang compiler → LLVM IR\n(gcc has its own IR of c++ code that is RTL) Let’s see now how that IR is translated into machine code.\nBut before that,\nLittle case study on JVM bytecode and LLVM IR\nwe will see how this function below\nint add (int x, int y){ return x + y; } turns into IR of Java and C++ (via Clang)\na lot of things happen when we move to the final step from IR to machine code!\nFrom IR to Machine code: pipeline (in rough phases) IR optimization:\nremove unnecessary functions/variables/ops do all the (2+3)s to 5s inline: expand functions convert to lower level, architecture aware form\nGCC GIMPLE (IR) → RTL LLVM (IR) → MachineIR Example: Stack-based IR (JVM bytecode) gets \u0026ldquo;de-stacked\u0026rdquo; into register form for JIT. now this is my favourite part:\nmapping of IR functions to CPU operations happen here.\nand different architectures have different forms of the same instructions.\nHere’s what I mean:\nsay you have a add i32. this is basically add int. in x86 CPUs → ADD reg, reg in ARM CPUs → ADD rd, rn, rm here’s a small comparison table (highly recommended to check out!) IR Op x86 Instruction ARM Instruction add i32 ADD reg, reg ADD rd, rn, rm load i32 MOV reg, [mem] LDR rd, [rn] branch JMP / Jcc B / B.cond Register allocation: uses algorithm “Graph Coloring”\nregisters - fast, memory - slow. if there are extra variables then spill one to stack, use it later. Scheduling and tweaks\nschedule instructions. meaning the instructions used majorly are called first small tweaks like add i32 %a, 1 → inc i32 %a Code Emission \u0026amp; Linking Output assembly or object file is formed (e.g., .o file), then link to executable.\nAssemble: Text asm → binary (opcodes + operands). Link: Resolve externals, add runtime (e.g., libc). Example: Final x86 machine code for simple add: 55 48 89 E5 89 7D FC 89 75 F8 8B 55 FC 8B 45 F8 01 D0 5D C3 (prologue + add + epilogue). Machine Code it is the lowest level representation of code.\nit is architecture dependent. Put simply, to make a compiler, you’d need to know the type of CPU the code will be executed on.\neverything in machine code is “out there”. Meaning, the registers information, jumps, memory addresses etc. No abstractions.\nit has\nopcode - ex: add, jump registers - ex: EAX on x86 architecture memory operands - addresses for load/store immediates - constants embedded in instructions. this has “unlimited” IR virtual registers (ex %1, %2..)\nAnd there we have it.\nhere’s something for a review.\nPicture source: geeksforgeeks.org\nFrom language specific details, check out the https://medium.com/javarevisited/code-compilation-from-source-to-machine-code-1375e49d00b6 article.\n~ Aayushya Tiwari\nReferences GFG on Intermediate Code Generation\nGFG on TAC\nBrilliant essay on how Java, C++ and Python compile code\narticle on how python code is compiled\nwiki on ASTs\nLLMs: grok.com, chatgpt.com\n","permalink":"https://tiwariji.net/posts/code-from-source-to-execution/code-from-source-to-execution/","summary":"\u003cp\u003edo you know how your code\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"\u003e\u003ccode class=\"language-c++\" data-lang=\"c++\"\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e\u003cspan style=\"color:#66d9ef\"\u003eint\u003c/span\u003e \u003cspan style=\"color:#a6e22e\"\u003ef\u003c/span\u003e(){\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e    \u003cspan style=\"color:#66d9ef\"\u003ereturn\u003c/span\u003e \u003cspan style=\"color:#ae81ff\"\u003e42\u003c/span\u003e;\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e}\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eturns into\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;\"\u003e\u003ccode class=\"language-llvm\" data-lang=\"llvm\"\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e\u003cspan style=\"color:#960050;background-color:#1e0010\"\u003emov\u003c/span\u003e \u003cspan style=\"color:#960050;background-color:#1e0010\"\u003eea\u003c/span\u003e\u003cspan style=\"color:#66d9ef\"\u003ex\u003c/span\u003e, \u003cspan style=\"color:#ae81ff\"\u003e42\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e\u003cspan style=\"color:#66d9ef\"\u003eret\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eand then into\u003c/p\u003e\n\u003cp\u003e\u003ccode\u003eB8 2A 00 00 00 C3\u003c/code\u003e\u003c/p\u003e\n\u003cp\u003ethis?\u003c/p\u003e\n\u003caside\u003e\n\u003cp\u003eHere’s the whole essay in short:\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003esource code\u003c/strong\u003e → something happens, \u003cstrong\u003eIntermediate code\u003c/strong\u003e forms → something happens again, \u003cstrong\u003eMachine code\u003c/strong\u003e is formed.\u003c/p\u003e\n\u003cp\u003eWe’ll clear the ‘something’ in this article.\u003c/p\u003e\n\u003c/aside\u003e\n\u003cp\u003eLife would be very simple if we humans could write 1s and 0s and directly give machine it’s prefered machine code. But since we don’t have 1000 hands per person and the outputs that we’re expecting out of computers have evolved to complexities unimaginable, we need another simpler way to talk to the machines. And that is why we have different programming languages and their compilation processes.\u003c/p\u003e","title":"from code to execution"},{"content":"Hello. I am Aayushya Tiwari. I like to write, to understand and share. interested: cs, philosophy, reading, running, and music. programming software and my life.\nContacts:\nx linkedin github\n","permalink":"https://tiwariji.net/about/","summary":"\u003cp\u003eHello. I am Aayushya Tiwari. I like to write, to understand and share.   \u003cbr\u003e\ninterested: cs, philosophy, reading, running, and music. \u003cbr\u003e\nprogramming software and my life.\u003c/p\u003e\n\u003cimg src=\"/images/profile.jpeg\" alt=\"A picture of me\" width=\"300\"\u003e\n\u003cp\u003eContacts:\u003cbr\u003e\n\u003ca href=\"https://x.com/Kb24Aayush\"\u003ex\u003c/a\u003e \u003cbr\u003e\n\u003ca href=\"https://www.linkedin.com/in/aayushyatiwari11092004/\"\u003elinkedin\u003c/a\u003e \u003cbr\u003e\n\u003ca href=\"https://github.com/aayushyatiwari/\"\u003egithub\u003c/a\u003e\u003c/p\u003e","title":"About"}]