← Back to News

How we built an internal data analytics agent

GitHub recently shared how they built Qubot, an internal analytics agent that lets employees ask questions about company data using plain language instead of writing SQL queries. It’s a practical example of how AI can reduce friction in data workflows—something that applies far beyond GitHub’s walls.

At its core, Qubot solves a common problem: data exists in databases, but accessing it requires SQL expertise. Not everyone on a team has that skill, and even those who do spend time writing boilerplate queries. GitHub’s approach uses Claude (via Bedrock or similar) to translate natural language questions into SQL queries that run against their internal data warehouse. An employee can ask “How many pull requests were merged last quarter?” and get results without touching a database client. The agent handles schema understanding, query generation, and result formatting—essentially acting as a smart intermediary between human questions and structured data.

The technical architecture involves a few key pieces working together. The agent needs access to database schemas and metadata so it understands what tables and columns exist. It uses retrieval-augmented generation (RAG) to fetch relevant schema information before constructing queries, which keeps prompts focused and reduces hallucinations. Error handling matters too—when a query fails, the agent can refine and retry rather than throwing an error back to the user. This requires careful prompt engineering and testing across different question types.

The practical impact is significant. Data analysts spend less time answering repetitive questions from colleagues, freeing them for deeper analysis work. Non-technical team members can self-serve basic reporting questions without bottlenecking through experts. Engineers can quickly validate hypotheses about system behavior. For teams building similar tools—whether on AWS with Athena and Bedrock, or using other cloud platforms—the key takeaway is that natural language interfaces to data aren’t futuristic anymore. They’re achievable with current AI capabilities, though they require thoughtful design around security, accuracy, and graceful error handling. Start with a well-defined data scope and strong guardrails before expanding to broader access.

Source
↗ The GitHub Blog