CodeQL Guide for a Security Engineer (Part 4 of 6)
March 11th, 2025 by Brian
Work conducted by Hui Dai.
Introduction
In this fourth post of my six part blog series, I look to share some of the most helpful techniques I learned while writing CodeQL queries as a security engineer. In the last blog post, I provided tips on how to improve the generalizability of the queries by adding custom function identification classes, and how to work with custom function pointers and wrappers.
For this section, I will be focusing on providing important workflow considerations when working with CodeQL: How to organize your queries for visualization and result findings, and how to work with complex/conjoined data flows. Along the way, I will also share helpful utility predicates that I’ve used across multiple queries to simplify constraint writing.
3. How to Organize Your Queries
As a good practice for modularization, I write my queries in the form of predicates. Using predicates (much like functions in other languages) allow me to freely define query variables without worrying about variable name conflicts in other queries. In addition, I specify a reason
field so that you can get a descriptive string about what the vulnerability is. This way, you can search for multiple predicates at once and see all the results in the CodeQL panel while still being able to distinguish between them:
from Expr a, Expr b, string reason
where
a_bug_predicate(a, b, reason)
or some_other_bug_predicate(a, b, reason)
or some_other_bug_predicate2(a, b, reason)
//or ...
select a, b, reason
A structure of one such predicate would look something like this:
predicate inconsistentMemchrAndStrlen(FunctionCall memchr, StrlenCall strlen, string reason) {
exists(
DataFlow::Node source1, DataFlow::Node sink2 |
//Actual predicate logic here
)
and reason = "CVE-2022-32292: Usage of both memchr and strlen on the same char buffer"
}
For the full implementation of this predicate, see Section 4. Connecting multiple data flows together.
The idea is for formatting purposes, you first decide on a fixed number of columns/variables you want to have displayed for each query – in this case, I chose 3 (2 variables, plus the description string). Then for each predicate, you write the arguments of the predicate such that the most important data pieces or expressions are selected. Then for any additional variables you needed defined, you would write them inside the exists
statement.
For example, in a memcpy_bug.ql
file, I had compiled together a number of memcpy-related bug predicate, and by formatting in the structure above, I was able to write on CodeQL query that could run all of them at the same time, and get their aggregated results:

In this particular file, it was sufficient for me to highlight the memcpy
call (since that is often where the buffer overflow or OOB access occurs), alongside the parent
function that calls it.
Aggregating results between multiple queries has its benefits. More often than not, whenever I see results for result for multiple predicates in the same function, it is often a sign that I should take a closer look. For example, there might some weird data handling in that function, potential vulnerabilities, and so forth.
I also divide my set of predicates into multiple .ql
files, separated by what kind of bug classes they are relevant. For example, in addition to my memcpy predicates, I also have a CodeQL query file related to bugs of type integer overflow/underflow, one for memory allocation, another for command injection, and so forth. I’ve found that this kind of classification makes it easier to sort through results, as you’re only focused with variants of the same type of bug at a given time.
Another construct I’ve found to be helpful is to put together commonly-used helper predicates in a library file (extension .qll
). More information here. In my workflow, I’ve found that there would be certain custom flows and predicates I wrote that I would in multiple different queries. For example, we’ve already covered the function wrapper predicate in section 2.
Another helpful predicate that I include below (hasAccessFlow
) allows me to target a struct instance that I would somehow detect as being tainted, and follow that taintness through all of its member references (recursively). This predicate can cause performance issues, especially if you’re working with a struct that is passed around often and contains many fields – which can lead to potential exponential explosion in CodeQL search paths. However, it has also been invaluable for bug queries where I need to follow the usage of a child member of a user-controlled object.
Inside CCSW_Helper.qll:
import cpp
import semmle.code.cpp.dataflow.new.DataFlow
import semmle.code.cpp.dataflow.new.TaintTracking
import semmle.code.cpp.controlflow.IRGuards
import semmle.code.cpp.valuenumbering.GlobalValueNumbering
import CCSW_Models
module CCSW_Helper {
/*
* Utility predicates for function wrapping
*/
//Checks if the call expression is essentially a wrapper
//function around our target function
predicate callWrapsFunction(Call call, Function wrapped) {
(
call instanceof VariableCall
and variableWrapsFunction(call.(VariableCall).getVariable(), wrapped)
)
or
(
call instanceof FunctionCall
and functionWrapsFunction(call.(FunctionCall).getTarget(), wrapped)
)
}
/*
* Utility predicates for tracking access flow
*/
//Checks if expression A can flow to expression B
//either directly or through a finite number of
//member references
//e.g. obj_A->child1->child2->...
predicate hasAccessFlow(Expr a, Expr b) {
exists(
DataFlow::Node source, DataFlow::Node sink |
source.asExpr() = a
and sink.asExpr() = b
and FollowChildMember::flow(source, sink)
)
}
//Other utilities
...
}
//Note: I would define all supporting flows and definitions that doesn't need
// to be directly referenced here...
...
//Follows the usage of child members of object
//Note: This is not the most performant of queries...
module FollowChildMemberConfig implements DataFlow::ConfigSig {
predicate isSource(DataFlow::Node source) { source.asExpr() instanceof VariableAccess }
predicate isSink(DataFlow::Node sink) { sink.asExpr() instanceof VariableAccess }
predicate isAdditionalFlowStep(DataFlow::Node n1, DataFlow::Node n2) {
//Consider all child field access of target
//to be in data flow
exists(FieldAccess fa |
n1.asExpr() = fa.getQualifier() and
n2.asExpr() = fa
)
}
}
module FollowChildMember = DataFlow::Global<FollowChildMemberConfig>;
Then, to use, I would simply import the helper library file in a query file ql
, and reference the exported predicate in the namespace:
import CCSW_Helper
from
...
where
...
and CCSW_Helper::hasAccessFlow(data, child)
select
...
4. Connecting Multiple Data Flows Together
Perhaps one of the most powerful aspect of CodeQL is its data flow analysis engine. Data flow queries allows us to track a piece of data (such as a variable, value, pointer, etc.) and follow its usage as it moves through a service or process. There are two types of flow: local flow (intra-function) or global flow (inter-function). When describing a flow using CodeQL, we need to identify the source (where data originates) and sink (where data ends up).
Note: For this blog, we will consider syntax of the new dataflow API, which was released in 2023 and is intended to be a direct replacement for the old dataflow module.
For security research, we’re often looking at data flow as a way to describe potentially buggy/insecure paths where untrusted data comes in and is later used without proper sanitization. A classic example would be a C program that fetches user input via scanf
in a buffer that is then later used in an execve
call without checking the contents of the buffer. In that scenario, we would have a command injection bug.
In the real world, it’s not always possible to describe vulnerable data paths with just a single data flow. Instead, you may need to connect multiple flows together. Consider the follow code snippet describing a heap based buffer overflow (CVE-2022-32292
) in ConnMan, a network connection daemon service. The CVE has a CVSS score of 9.8 (Critical), and the buggy portion revolves around the improper calculation of string length.
static gboolean received_data(GIOChannel *channel, GIOCondition cond, gpointer user_data)
{
...
while (bytes_read > 0) {
guint8 *pos;
gsize count;
char *str;
pos = memchr(ptr, '\n', bytes_read);
if (!pos) {
g_string_append_len(session->current_header,
(gchar *) ptr, bytes_read);
return TRUE;
}
*pos = '\0';
count = strlen((char *) ptr);
if (count > 0 && ptr[count - 1] == '\r') {
ptr[--count] = '\0';
bytes_read--;
}
g_string_append_len(session->current_header,
(gchar *) ptr, count);
Bug Explanation:
Here, ptr
points to some memory region of user-provided input, and bytes_read
tells us overall the length of that input. This section of the code attempts to split the data in ptr
by newlines and append each line into the session header (session->current_header
).
The issue revolves around the difference in behavior between strlen
and memchr
:
strlen(ptr)
returns the length of a C string pointed to byptr
, where it counts the number of characters up to the first terminating character\0
.- On the other hand,
memchr(ptr, \n, bytes_read)
returns the number of bytes inptr
up until the first occurrence of the separator character (which in this case is\n
) or untilbytes_read
characters are read.
The important part is that memchr
is agnostic to null-terminating character, which means if it encounters \0
while reading a buffer, memchr will continue searching down the buffer if neither of the two conditions have been met.
In the code, the developer likely expects that after calling memchr
and writing a null-character at the location of the first newline, that calling strlen(ptr)
will give us the length of the string up until the newline. However, that might not always be the case. If we somehow had a buffer containing:
Hello\0World!\n
then after the substitution we get:
Hello\0World!\0
This means that strlen(ptr)
gives 5
, which is just the length of “Hello” since it stops at the first null character. If the developer was expecting for count = strlen(ptr)
to capture the length of the full “Hello\0World!” string, then this would be a source of error.
In ConnMan, this difference would causes a disagreement in which what the pos
and count
values represent, eventually leading to a buffer overflow as the ptr
buffer continues to be processed by the code.
Writing query for CVE-2022-32292
:
As mentioned in the motivation section, one of the potential applications for CodeQL is for us to write queries about bugs we are already aware of, and to use that knowledge to build a query base so that we can identify similar bugs either in the same or different codebase. Here, we can attempt to capture the vulnerable pattern covered by CVE-2022-32292
, which is a binary data vs string length confusion, through a CodeQL query. In addition, I also believe it provides a good example of how to connect multiple data flows together.
For this CVE, I came up with the following scheme:
predicate inconsistentMemchrAndStrlen(FunctionCall memchr, StrlenCall strlen, string reason) {
exists(
DataFlow::Node source1, DataFlow::Node sink1,
DataFlow::Node source2, DataFlow::Node sink2,
Expr delim |
memchr.getTarget().getName() = "memchr"
and sink1.asExpr() = memchr.getArgument(0) //Memchr data buffer
and sink2.asExpr() = strlen.getStringExpr() //Strlen string buffer
//Flow #1: Data feeds into memchr call
and DataFlow::localFlow(source1, sink1)
//Flow #2: Data feeds into strlen call
and DataFlow::localFlow(source2, sink2)
//Flow #3: Both sources are referring to the same thing
and (
DataFlow::localFlow(source1, source2)
or DataFlow::localFlow(source2, source1)
)
and delim = memchr.getArgument(1)
//If delim is '\0', then it's safe (cause same as strlen)
and not delim.getValueText().regexpMatch("'\\\\0'")
)
and reason = "CVE-2022-32292: Usage of both memchr and strlen on the same char buffer"
}
from FunctionCall memchr, StrlenCall strlen, string reason
where
inconsistentMemchrAndStrlen(memchr, strlen, reason)
select memchr, strlen, reason
The pattern that the predicate is looking for is relatively simple. It detects all flows in which a piece of data is used as the incoming buffer for both a memchr
and strlen
call. While such a usage is not a vulnerability by itself, we could use it as an indicator of potential mix-up between treating user data as as C-string vs binary buffer in the code. Afterwards, once a finding is reported, we can perform manual analysis afterwards to verify if there is indeed a vulnerability.
As described in the CVE section, this bug logic would not have been possible with a single data flow. While the same data buffer feeds into both memchr
and strlen
function calls, the result of one call is not used by the other. In addition, there isn’t a strict chronological order of buffer -> memchr -> strlen
or buffer -> strlen -> memchr
. Thus there’s no easy way for us to write a single source to sink flow that encapsulates both function calls.
But we can describe it three separate flows, as seen above:
- A flow that describes buffer to memchr:
source1 -> memchr(sink1, ..., ...)
, - Another flow that describes buffer to strlen:
source2 -> strlen(sink2)
, - And one last flow that ensures the two buffers are essentially the same
source1 = source2
Possible improvements for query
This query is not without its flaws. Currently it is an intra-procedural data flow, meaning it will only find situations where both memchr
and strlen
are in the same function. It also looks for direct data flow instead of taint tracking, and so any code that modifies the buffer pointer (like simple arithmetic operation) in between the two function calls may not yield in a detection. Also, it uses the basic name matching .getName() = some_string
for function query, which as we outlined in Section 1 has its drawbacks. In addition, if we had a clearer idea on what general “bad usage” of memchr
and strlen
looks like in code when operating on the same buffer, we could also added that to increase the specificity of our query.
However, as a whole, this example is used to illustrate that you can combine simple (or even complex) flows together to build greater context and capture more information in your query.
Trying a more complex example
For this exercise we will working with a sample scenario. While the specifics of the flow isn’t as important, this query should hopefully serve as an example where you are working with a codebase-specific vulnerability where user-controlled data can come in at multiple points, and the query you are designing is looking to describe situations where their combined usage might indicate a vulnerability.
Scenario: Imagine you have pinpointed an function-of-interest compute
that takes in two parameters: a char buffer buf
and an integer num
. For your codebase, compute
is a commonly used function that is fed some form of user-provided data. In addition, the result of compute
is then used in some important calls, such as malloc(...)
that determines the amount of memory to allocate for subsequent processing.
Say that you have ideas of a flow where user-controlled data can flow to the first parameter buf
and the second parameter num
. You also can describe situations where result = compute(buf, num)
eventually feeds into malloc
. N
Now we want to understand if there’s any situation where a malicious actor can cause a bad invocation of malloc
by providing specific inputs through compute
. When approaching to write this query, we could first break it down by writing three individual flows:
- Flow A: Find flow where
user_data_src_1 -> compute(buf, ...)
- Flow B: Find flow where
user_data_src_2 -> compute(..., num)
- Flow C: Find flow where
compute(buf, num)
eventually feeds into amalloc(...)
call
Then the final query could be:
- Final query: Find all calls to
malloc(...)
andcompute(buf, num)
that satisfies flow C. In addition, make sure that thecompute(buf, num)
function call MUST satisfy flow A (for some source_1) and satisfies flow B (for some source_2)
By itself, any of the flow A, B, C might not be enough to indicate a bug, but by combining all these factors together, we could specify a specific scenario where buggy behavior does occur.
Let’s say the code looks something like this:
#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <stdlib.h>
int get_user_num() {
int res;
size_t num;
res = scanf("%zu", &num);
if(res != 1)
return 0;
return num;
}
size_t compute(char* buf, int num) {
if(strlen(buf) > 32)
return 1;
return num;
}
void* run_process(char *buf) {
int num;
size_t alloc_amount;
num = get_user_num();
if (num < 0)
return NULL;
alloc_amount = compute(buf, num);
printf("%zu\n", alloc_amount);
void* ptr = malloc(alloc_amount); //Target line
return ptr;
}
int main() {
char buf[64] = "Some initial data";
void *ptr = run_process(buf);
//Do something else afterwards
return 0;
}
A data flow query for that might look something like the following:
import cpp
import semmle.code.cpp.commons.Scanf
import semmle.code.cpp.dataflow.new.DataFlow
import semmle.code.cpp.dataflow.new.TaintTracking
//Flow A: Find flow `user_data_src_1 -> compute(buf, ...)`
module DataToComputeBufConfig implements DataFlow::ConfigSig {
predicate isSource(DataFlow::Node source) {
exists(Parameter param |
param = source.asParameter()
and param.getType() instanceof PointerType
)
}
predicate isSink(DataFlow::Node sink) {
exists(
FunctionCall func_call |
func_call.getTarget().hasName("compute")
and func_call.getArgument(0) = sink.asExpr()
)
}
}
module DataToComputeBuf = DataFlow::Global<DataToComputeBufConfig>;
//Flow B: Find flow where `user_data_src_2 -> compute(..., num)`
module DataToComputeNumConfig implements DataFlow::ConfigSig {
predicate isSource(DataFlow::Node source) {
exists(ScanfFunctionCall scanf_call, VariableAccess v_acc, Variable v |
scanf_call.getAnOutputArgument() = source.asDefiningArgument() //Captures scanf(..., &num)
and source.asDefiningArgument().getAChild*() = v_acc //Resolves "&num" -> num
and v_acc.getTarget() = v
and v.getType() instanceof Size_t
)
}
predicate isSink(DataFlow::Node sink) {
exists(
FunctionCall func_call |
func_call.getTarget().hasName("compute")
and func_call.getArgument(1) = sink.asExpr()
)
}
}
module DataToComputeNum = DataFlow::Global<DataToComputeNumConfig>;
//Main query: Combine Flow A, B, and C together
from
FunctionCall compute_call,
AllocationExpr alloc_call,
Expr alloc_size,
DataFlow::Node user_data_src_1, DataFlow::Node compute_buf,
DataFlow::Node user_data_src_2, DataFlow::Node compute_num
where
compute_call.getTarget().hasName("compute")
and compute_call.getArgument(0) = compute_buf.asExpr()
and compute_call.getArgument(1) = compute_num.asExpr()
and DataToComputeBuf::flow(user_data_src_1, compute_buf) //user_data_src_1 -> compute(buf, ...)
and DataToComputeNum::flow(user_data_src_2, compute_num) //user_data_src_2 -> compute(..., num)
and alloc_size = alloc_call.getSizeExpr()
and TaintTracking::localExprTaint(compute_call, alloc_size) //Flow C: x = compute(...) -> alloc(x)
select user_data_src_1, user_data_src_2, compute_call, alloc_call
The specific constraints for each of the flow is not too important; it will likely vary a lot depending on the situation and what your codebase. What I wanted to illustrate here is the ability for us to tie together multiple global data flow queries (along as a local taint tracking query) to describe a complex data flow chain:
- First piece of “user” data comes from the char pointer argument in
run_process
(in reality, it’s just a fixed string for our test case), which flows to thebuf
argument ofcompute
. - In addition, we have
scanf
supplying an integer from standard input, which goes fromget_user_num
torun_process
tonum
argument ofcompute
. - Finally, the result of
compute
is used insidemalloc
One important piece to note is that since DataFlow::Node
can represent a wide variety of CodeQL types (such as Expr
, Parameter
, etc. – see the predicate documentation for DataFlow::Node
), we need to make sure that we’re matching the typing we use in the final query with what was defined in the Data Flow predicates. For example, both DataToComputeBuf
and DataToComputeNum
flows have sinks of type Expr
.
However, if I wanted to reference additional constraints on either flow’s sources, then I would have needed to use .asParameter()
for DataToComputeBuf
flow and .asDefiningArgument()
for DataToComputeNum
flow. A lot of times these types are mutually exclusive, hence if you are using asExpr()
for everything, you may not see any result in your query
CodeQL Summary
CodeQL syntax is vast and descriptive. As part of the few months I’ve spent on learning CodeQL, I’ve had to navigate through much of Github’s CodeQL documentation (mainly for C/C++), gain familiarity with their rich but complex data analysis engines, and work with additional included libraries for query refinement. As such ,I hope these last two sections have been useful for helping you get started in your security query writing journey.
In the following two sections of the blog series, I will covering techniques I found helpful for working with Semgrep rules. These will follow a similar structure as ones I’ve shared for CodeQL, specifically when it comes to function identification and data flow analysis. However, with Semgrep, we will be working with analyzing decompiled code rather than source code, which presents its own set of challenges and considerations.