// systems programmer
Aakarsh Kashyap
GPU compute · compiler infrastructure · OS internals
B.Tech CSE @ GLA University · B.S. Data Science & AI @ IIT Madras.
I build things close to the metal — cuda kernels, bootloaders, runtimes.
Long-term goal: GPU compute or compiler work in Germany.
// about
Systems developer. My niche is GPU compute and compiler infrastructure - currently a noob in both, working to fix that.
I build things close to the metal: cuda kernels, bootloaders, runtimes, and the occasional kernel from scratch. sauceOS is my x86-64 hobby OS, currently implementing four-level paging and process infrastructure.
I maintain soft-cuda — a hybrid CPU/GPU AOT deep learning framework with a custom 3-pass compiler, bump allocator, and lazy DAG. The hybrid model achieves ~35% throughput improvement over GPU-only dispatch.
I use no LSP. I don't need syntax highlighting. I do this for fun.
C/C++, Go, Python, x86-64 ASM, CUDA, SQL
OS internals, memory allocators, POSIX APIs, IPC, syscall interface
CUDA kernel programming, CUDA streams, cuBLAS, cuDNN, SGEMM, double buffering
REST, scatter-gather, write-through cache, JWT, Argon2id, RBAC, webhook verification
Git, Docker, CMake, Make, GDB, Neovim, Arch Linux
// projects
sauceOS
64-bit hobby operating system written from scratch, booting via the Limine bootloader. Implements the Global Descriptor Table, Interrupt Descriptor Table, ISR and IRQ dispatch routines, and Task State Segment in C and x86-64 assembly. Buddy memory allocator for kernel heap management in the higher-half virtual address space, handling block splitting, coalescing, and alignment constraints. Four-level paging with independent per-process virtual address space mapping. PS/2 keyboard driver with interrupt-driven input handling wired through the IDT. Minimal libc and syscall interface establishing the kernel-user boundary for future userspace programs.
[ source ]soft-cuda
Hybrid CPU/GPU tensor execution engine featuring a bump allocator memory pool, lazy evaluation DAG, and a three-pass AOT compiler with profile-guided backend assignment between CPU and GPU operators. Double-buffered SGEMM kernel overlapping memory transfers with compute using CUDA streams — benchmarked at 0.94 ms/step against GPU-only dispatch at 1.44 ms/step, a 35% throughput improvement, with cuBLAS as the reference backend. Full forward and backward pass with MSE, cross-entropy loss, and SGD. Trained XOR to convergence at loss ≈ 3.5 × 10⁻¹⁴.
[ source ]sush
Unix shell from scratch in C++ supporting piping chains, command history, and variable expansion without relying on any system shell utility. Uses fork, exec, pipe, and dup2 directly for process creation, file descriptor wiring, and IPC. Custom lexer and token expander for command parsing, tokenization, and variable resolution.
[ source ]Custodian
High-concurrency Go orchestrator implementing a scatter-gather pattern on every incoming query, fanning out simultaneously to PostgreSQL for audit logging and a Python BERT service for inference, with results joined before response. Three-tier degradation chain: Redis cache under 1ms for known claims, local DistilBERT at 50ms for high-volume filtering, and async LLM escalation via a Redis job queue for ambiguous cases. Write-through cache consistency: background worker LLM verdicts atomically overwrite both the Redis cache and the PostgreSQL record. LLM is upto date with current situation by having a harness for internet access.
[ source ]SlopGen
ReAct-style agentic loop in Go: the LLM reasons over tool call results iteratively (read, write, execute) until the task is resolved, with full conversation history passed on every inference request. All tool interfaces defined via strict JSON schemas. Human-in-the-loop confirmation gate for shell execution. Model-agnostic: local Ollama (qwen2.5:7b) and remote cloud endpoints (GPT-OSS 20B hosted on Kaggle via Ngrok) switchable via a single CLI flag.
[ source ]Chirpy
Production-style RESTful API in Go with zero external web frameworks. Covers user registration, chirp CRUD, stateless JWT authentication, refresh token issuance and revocation, webhook ingestion, and admin observability metrics. Auth layer with short-lived JWT access tokens and revocable refresh tokens persisted in PostgreSQL. Argon2id for password hashing, shared-secret webhook verification for secure third-party event ingestion.
[ source ]// experience
- Diagnosed and fixed a non-portable compile flag causing SIGILL crashes on WSL2 targets in a vector database with 500+ active users; traced the failure to ABI assumptions that did not hold across toolchain configurations and submitted a patch accepted upstream.
- Resolved a daemon thread lifecycle bug paired with an
is_runningliveness race that allowed silent data corruption on shutdown; required reasoning about memory visibility across thread boundaries and correct teardown ordering. - Debugged cross-platform DLL dependency failures on Windows involving libcurl linkage, MinGW runtime mismatches, and CFFI symbol resolution errors, restoring build reproducibility across Linux and Windows targets for all downstream contributors.
// contact
Available for internships and freelance work — high-performance backend architecture, systems programming, infrastructure.