// systems programmer

Aakarsh Kashyap

GPU compute · compiler infrastructure · OS internals

B.Tech CSE @ GLA University · B.S. Data Science & AI @ IIT Madras.
I build things close to the metal — cuda kernels, bootloaders, runtimes.
Long-term goal: GPU compute or compiler work in Germany.

// about

Systems developer. My niche is GPU compute and compiler infrastructure - currently a noob in both, working to fix that.

I build things close to the metal: cuda kernels, bootloaders, runtimes, and the occasional kernel from scratch. sauceOS is my x86-64 hobby OS, currently implementing four-level paging and process infrastructure.

I maintain soft-cuda — a hybrid CPU/GPU AOT deep learning framework with a custom 3-pass compiler, bump allocator, and lazy DAG. The hybrid model achieves ~35% throughput improvement over GPU-only dispatch.

I use no LSP. I don't need syntax highlighting. I do this for fun.

languages

C/C++, Go, Python, x86-64 ASM, CUDA, SQL

systems

OS internals, memory allocators, POSIX APIs, IPC, syscall interface

GPU / HPC

CUDA kernel programming, CUDA streams, cuBLAS, cuDNN, SGEMM, double buffering

backend

REST, scatter-gather, write-through cache, JWT, Argon2id, RBAC, webhook verification

tooling

Git, Docker, CMake, Make, GDB, Neovim, Arch Linux

// projects

sauceOS

C, x86-64 ASM, Make, Limine — Jan 2026 – present

64-bit hobby operating system written from scratch, booting via the Limine bootloader. Implements the Global Descriptor Table, Interrupt Descriptor Table, ISR and IRQ dispatch routines, and Task State Segment in C and x86-64 assembly. Buddy memory allocator for kernel heap management in the higher-half virtual address space, handling block splitting, coalescing, and alignment constraints. Four-level paging with independent per-process virtual address space mapping. PS/2 keyboard driver with interrupt-driven input handling wired through the IDT. Minimal libc and syscall interface establishing the kernel-user boundary for future userspace programs.

[ source ]

soft-cuda

C++, CUDA, cuBLAS, cuDNN — Mar 2026 – present

Hybrid CPU/GPU tensor execution engine featuring a bump allocator memory pool, lazy evaluation DAG, and a three-pass AOT compiler with profile-guided backend assignment between CPU and GPU operators. Double-buffered SGEMM kernel overlapping memory transfers with compute using CUDA streams — benchmarked at 0.94 ms/step against GPU-only dispatch at 1.44 ms/step, a 35% throughput improvement, with cuBLAS as the reference backend. Full forward and backward pass with MSE, cross-entropy loss, and SGD. Trained XOR to convergence at loss ≈ 3.5 × 10⁻¹⁴.

[ source ]

sush

C++, POSIX API — Feb 2026 – Mar 2026

Unix shell from scratch in C++ supporting piping chains, command history, and variable expansion without relying on any system shell utility. Uses fork, exec, pipe, and dup2 directly for process creation, file descriptor wiring, and IPC. Custom lexer and token expander for command parsing, tokenization, and variable resolution.

[ source ]

Custodian

Go, Python, Redis, PostgreSQL, DistilBERT, Docker Compose — Dec 2025 – Jan 2026

High-concurrency Go orchestrator implementing a scatter-gather pattern on every incoming query, fanning out simultaneously to PostgreSQL for audit logging and a Python BERT service for inference, with results joined before response. Three-tier degradation chain: Redis cache under 1ms for known claims, local DistilBERT at 50ms for high-volume filtering, and async LLM escalation via a Redis job queue for ambiguous cases. Write-through cache consistency: background worker LLM verdicts atomically overwrite both the Redis cache and the PostgreSQL record. LLM is upto date with current situation by having a harness for internet access.

[ source ]

SlopGen

Go, OpenAI SDK, Ollama, JSON Schema — Feb 2026 – Mar 2026

ReAct-style agentic loop in Go: the LLM reasons over tool call results iteratively (read, write, execute) until the task is resolved, with full conversation history passed on every inference request. All tool interfaces defined via strict JSON schemas. Human-in-the-loop confirmation gate for shell execution. Model-agnostic: local Ollama (qwen2.5:7b) and remote cloud endpoints (GPT-OSS 20B hosted on Kaggle via Ngrok) switchable via a single CLI flag.

[ source ]

Chirpy

Go, PostgreSQL, JWT, Argon2id, REST — Jan 2026 – Mar 2026

Production-style RESTful API in Go with zero external web frameworks. Covers user registration, chirp CRUD, stateless JWT authentication, refresh token issuance and revocation, webhook ingestion, and admin observability metrics. Auth layer with short-lived JWT access tokens and revocable refresh tokens persisted in PostgreSQL. Argon2id for password hashing, shared-secret webhook verification for secure third-party event ingestion.

[ source ]

// experience

Technical Contributor GigaVector (Open Source) Apr 2026 – present
  • Diagnosed and fixed a non-portable compile flag causing SIGILL crashes on WSL2 targets in a vector database with 500+ active users; traced the failure to ABI assumptions that did not hold across toolchain configurations and submitted a patch accepted upstream.
  • Resolved a daemon thread lifecycle bug paired with an is_running liveness race that allowed silent data corruption on shutdown; required reasoning about memory visibility across thread boundaries and correct teardown ordering.
  • Debugged cross-platform DLL dependency failures on Windows involving libcurl linkage, MinGW runtime mismatches, and CFFI symbol resolution errors, restoring build reproducibility across Linux and Windows targets for all downstream contributors.

// contact

Available for internships and freelance work — high-performance backend architecture, systems programming, infrastructure.