Lessons from Building GitHub Code Search

Strange Loop, St. Louis, September 2023

In this talk, I shared some lessons we learned building a high-performance code search engine, designed to meet GitHub's large scale. GitHub code search is the world's largest publicly available code search engine, with more than 60 million repositories and over 160 TB of content indexed. To build it, we had to turn the unique content-addressable nature of Git repositories to our advantage. I'll cover the key strategies we used, including using deduplication and repository similarity to reduce indexing workload, full index compaction to remove deleted documents, multiple levels of sharding, and load balancing. Learn how we turned code search from a frustrating experience to a powerful feature for our users.

All talks by Luke Francl