CodeQL Guide for a Security Engineer (Part 4 of 6)

March 11th, 2025 by Brian

Work conducted by Hui Dai.

Introduction

In this fourth post of my six part blog series, I look to share some of the most helpful techniques I learned while writing CodeQL queries as a security engineer. In the last blog post, I provided tips on how to improve the generalizability of the queries by adding custom function identification classes, and how to work with custom function pointers and wrappers.

For this section, I will be focusing on providing important workflow considerations when working with CodeQL: How to organize your queries for visualization and result findings, and how to work with complex/conjoined data flows. Along the way, I will also share helpful utility predicates that I’ve used across multiple queries to simplify constraint writing.

3. How to Organize Your Queries

As a good practice for modularization, I write my queries in the form of predicates. Using predicates (much like functions in other languages) allow me to freely define query variables without worrying about variable name conflicts in other queries. In addition, I specify a reason field so that you can get a descriptive string about what the vulnerability is. This way, you can search for multiple predicates at once and see all the results in the CodeQL panel while still being able to distinguish between them:

from Expr a, Expr b, string reason
where 
    a_bug_predicate(a, b, reason) 
    or some_other_bug_predicate(a, b, reason) 
    or some_other_bug_predicate2(a, b, reason) 
    //or ...
select a, b, reason

A structure of one such predicate would look something like this:

predicate inconsistentMemchrAndStrlen(FunctionCall memchr, StrlenCall strlen, string reason) {
    exists(
        DataFlow::Node source1,  DataFlow::Node sink2 |
        //Actual predicate logic here
    )
    and reason = "CVE-2022-32292: Usage of both memchr and strlen on the same char buffer"
}

For the full implementation of this predicate, see Section 4. Connecting multiple data flows together.

The idea is for formatting purposes, you first decide on a fixed number of columns/variables you want to have displayed for each query – in this case, I chose 3 (2 variables, plus the description string). Then for each predicate, you write the arguments of the predicate such that the most important data pieces or expressions are selected. Then for any additional variables you needed defined, you would write them inside the exists statement.

For example, in a memcpy_bug.ql file, I had compiled together a number of memcpy-related bug predicate, and by formatting in the structure above, I was able to write on CodeQL query that could run all of them at the same time, and get their aggregated results:

In this particular file, it was sufficient for me to highlight the memcpy call (since that is often where the buffer overflow or OOB access occurs), alongside the parent function that calls it.

Aggregating results between multiple queries has its benefits. More often than not, whenever I see results for result for multiple predicates in the same function, it is often a sign that I should take a closer look. For example, there might some weird data handling in that function, potential vulnerabilities, and so forth.

I also divide my set of predicates into multiple .ql files, separated by what kind of bug classes they are relevant. For example, in addition to my memcpy predicates, I also have a CodeQL query file related to bugs of type integer overflow/underflow, one for memory allocation, another for command injection, and so forth. I’ve found that this kind of classification makes it easier to sort through results, as you’re only focused with variants of the same type of bug at a given time.

Another construct I’ve found to be helpful is to put together commonly-used helper predicates in a library file (extension .qll). More information here. In my workflow, I’ve found that there would be certain custom flows and predicates I wrote that I would in multiple different queries. For example, we’ve already covered the function wrapper predicate in section 2.

Another helpful predicate that I include below (hasAccessFlow) allows me to target a struct instance that I would somehow detect as being tainted, and follow that taintness through all of its member references (recursively). This predicate can cause performance issues, especially if you’re working with a struct that is passed around often and contains many fields – which can lead to potential exponential explosion in CodeQL search paths. However, it has also been invaluable for bug queries where I need to follow the usage of a child member of a user-controlled object.

Inside CCSW_Helper.qll:

import cpp
import semmle.code.cpp.dataflow.new.DataFlow
import semmle.code.cpp.dataflow.new.TaintTracking
import semmle.code.cpp.controlflow.IRGuards
import semmle.code.cpp.valuenumbering.GlobalValueNumbering
import CCSW_Models

module CCSW_Helper {

    /*
    *   Utility predicates for function wrapping
    */ 
    //Checks if the call expression is essentially a wrapper
    //function around our target function 
    predicate callWrapsFunction(Call call, Function wrapped) {
        (
            call instanceof VariableCall
            and variableWrapsFunction(call.(VariableCall).getVariable(), wrapped)
        ) 
        or 
        (
            call instanceof FunctionCall
            and functionWrapsFunction(call.(FunctionCall).getTarget(), wrapped)
        )
    }
    /*
    * Utility predicates for tracking access flow
    */ 
    //Checks if expression A can flow to expression B
    //either directly or through a finite number of
    //member references
    //e.g. obj_A->child1->child2->...
    predicate hasAccessFlow(Expr a, Expr b) {
        exists(
            DataFlow::Node source, DataFlow::Node sink |
            source.asExpr() = a
            and sink.asExpr() = b
            and FollowChildMember::flow(source, sink)
        )
    }
    //Other utilities
    ...
}
//Note: I would define all supporting flows and definitions that doesn't need
//      to be directly referenced here...
...
//Follows the usage of child members of object
//Note: This is not the most performant of queries...
module FollowChildMemberConfig implements DataFlow::ConfigSig {
    predicate isSource(DataFlow::Node source) { source.asExpr() instanceof VariableAccess }
    predicate isSink(DataFlow::Node sink) { sink.asExpr() instanceof VariableAccess }
    predicate isAdditionalFlowStep(DataFlow::Node n1, DataFlow::Node n2) {
        //Consider all child field access of target
        //to be in data flow
        exists(FieldAccess fa |
            n1.asExpr() = fa.getQualifier() and
            n2.asExpr() = fa
        )
    }
}
module FollowChildMember = DataFlow::Global<FollowChildMemberConfig>;

Then, to use, I would simply import the helper library file in a query file ql, and reference the exported predicate in the namespace:

import CCSW_Helper

from 
    ...
where
    ...
    and CCSW_Helper::hasAccessFlow(data, child)
select
    ...

4. Connecting Multiple Data Flows Together

Perhaps one of the most powerful aspect of CodeQL is its data flow analysis engine. Data flow queries allows us to track a piece of data (such as a variable, value, pointer, etc.) and follow its usage as it moves through a service or process. There are two types of flow: local flow (intra-function) or global flow (inter-function). When describing a flow using CodeQL, we need to identify the source (where data originates) and sink (where data ends up).

Note: For this blog, we will consider syntax of the new dataflow API, which was released in 2023 and is intended to be a direct replacement for the old dataflow module.

For security research, we’re often looking at data flow as a way to describe potentially buggy/insecure paths where untrusted data comes in and is later used without proper sanitization. A classic example would be a C program that fetches user input via scanf in a buffer that is then later used in an execve call without checking the contents of the buffer. In that scenario, we would have a command injection bug.

In the real world, it’s not always possible to describe vulnerable data paths with just a single data flow. Instead, you may need to connect multiple flows together. Consider the follow code snippet describing a heap based buffer overflow (CVE-2022-32292) in ConnMan, a network connection daemon service. The CVE has a CVSS score of 9.8 (Critical), and the buggy portion revolves around the improper calculation of string length.

static gboolean received_data(GIOChannel *channel, GIOCondition cond, gpointer user_data)
{
    ...
    while (bytes_read > 0) {
        guint8 *pos;
        gsize count;
        char *str;

        pos = memchr(ptr, '\n', bytes_read);
        if (!pos) {
            g_string_append_len(session->current_header,
                        (gchar *) ptr, bytes_read);
            return TRUE;
        }
        
        *pos = '\0';
        count = strlen((char *) ptr);
        if (count > 0 && ptr[count - 1] == '\r') {
            ptr[--count] = '\0';
            bytes_read--;
        }

        g_string_append_len(session->current_header,
                        (gchar *) ptr, count);

Bug Explanation:

Here, ptr points to some memory region of user-provided input, and bytes_read tells us overall the length of that input. This section of the code attempts to split the data in ptr by newlines and append each line into the session header (session->current_header).

The issue revolves around the difference in behavior between strlen and memchr:

  • strlen(ptr) returns the length of a C string pointed to by ptr, where it counts the number of characters up to the first terminating character \0.
  • On the other hand, memchr(ptr, \n, bytes_read) returns the number of bytes in ptr up until the first occurrence of the separator character (which in this case is \n) or until bytes_read characters are read.

The important part is that memchr is agnostic to null-terminating character, which means if it encounters \0 while reading a buffer, memchr will continue searching down the buffer if neither of the two conditions have been met.

In the code, the developer likely expects that after calling memchr and writing a null-character at the location of the first newline, that calling strlen(ptr) will give us the length of the string up until the newline. However, that might not always be the case. If we somehow had a buffer containing:

then after the substitution we get:

This means that strlen(ptr) gives 5, which is just the length of “Hello” since it stops at the first null character. If the developer was expecting for count = strlen(ptr) to capture the length of the full “Hello\0World!” string, then this would be a source of error.

In ConnMan, this difference would causes a disagreement in which what the pos and count values represent, eventually leading to a buffer overflow as the ptr buffer continues to be processed by the code.

Writing query for CVE-2022-32292:

As mentioned in the motivation section, one of the potential applications for CodeQL is for us to write queries about bugs we are already aware of, and to use that knowledge to build a query base so that we can identify similar bugs either in the same or different codebase. Here, we can attempt to capture the vulnerable pattern covered by CVE-2022-32292, which is a binary data vs string length confusion, through a CodeQL query. In addition, I also believe it provides a good example of how to connect multiple data flows together.

For this CVE, I came up with the following scheme:

predicate inconsistentMemchrAndStrlen(FunctionCall memchr, StrlenCall strlen, string reason) {
    exists(
        DataFlow::Node source1,  DataFlow::Node sink1,
        DataFlow::Node source2,  DataFlow::Node sink2,
        Expr delim |
        memchr.getTarget().getName() = "memchr"
        and sink1.asExpr() = memchr.getArgument(0) //Memchr data buffer
        and sink2.asExpr() = strlen.getStringExpr() //Strlen string buffer

        //Flow #1: Data feeds into memchr call
        and DataFlow::localFlow(source1, sink1)
        
        //Flow #2: Data feeds into strlen call
        and DataFlow::localFlow(source2, sink2)

        //Flow #3: Both sources are referring to the same thing
        and (
            DataFlow::localFlow(source1, source2)
            or DataFlow::localFlow(source2, source1)
        ) 
        and delim = memchr.getArgument(1)
        //If delim is '\0', then it's safe (cause same as strlen)
        and not delim.getValueText().regexpMatch("'\\\\0'") 
    ) 
    and reason = "CVE-2022-32292: Usage of both memchr and strlen on the same char buffer"
}
from FunctionCall memchr, StrlenCall strlen, string reason
where 
    inconsistentMemchrAndStrlen(memchr, strlen, reason)
select memchr, strlen, reason

The pattern that the predicate is looking for is relatively simple. It detects all flows in which a piece of data is used as the incoming buffer for both a memchr and strlen call. While such a usage is not a vulnerability by itself, we could use it as an indicator of potential mix-up between treating user data as as C-string vs binary buffer in the code. Afterwards, once a finding is reported, we can perform manual analysis afterwards to verify if there is indeed a vulnerability.

As described in the CVE section, this bug logic would not have been possible with a single data flow. While the same data buffer feeds into both memchr and strlen function calls, the result of one call is not used by the other. In addition, there isn’t a strict chronological order of buffer -> memchr -> strlen or buffer -> strlen -> memchr. Thus there’s no easy way for us to write a single source to sink flow that encapsulates both function calls.

But we can describe it three separate flows, as seen above:

  • Another flow that describes buffer to strlen: source2 -> strlen(sink2),
  • And one last flow that ensures the two buffers are essentially the same source1 = source2

Possible improvements for query

This query is not without its flaws. Currently it is an intra-procedural data flow, meaning it will only find situations where both memchr and strlen are in the same function. It also looks for direct data flow instead of taint tracking, and so any code that modifies the buffer pointer (like simple arithmetic operation) in between the two function calls may not yield in a detection. Also, it uses the basic name matching .getName() = some_string for function query, which as we outlined in Section 1 has its drawbacks. In addition, if we had a clearer idea on what general “bad usage” of memchr and strlen looks like in code when operating on the same buffer, we could also added that to increase the specificity of our query.

However, as a whole, this example is used to illustrate that you can combine simple (or even complex) flows together to build greater context and capture more information in your query.

Trying a more complex example

For this exercise we will working with a sample scenario. While the specifics of the flow isn’t as important, this query should hopefully serve as an example where you are working with a codebase-specific vulnerability where user-controlled data can come in at multiple points, and the query you are designing is looking to describe situations where their combined usage might indicate a vulnerability.

Scenario: Imagine you have pinpointed an function-of-interest compute that takes in two parameters: a char buffer buf and an integer num. For your codebase, compute is a commonly used function that is fed some form of user-provided data. In addition, the result of compute is then used in some important calls, such as malloc(...) that determines the amount of memory to allocate for subsequent processing.

Say that you have ideas of a flow where user-controlled data can flow to the first parameter buf and the second parameter num. You also can describe situations where result = compute(buf, num) eventually feeds into malloc. N

Now we want to understand if there’s any situation where a malicious actor can cause a bad invocation of malloc by providing specific inputs through compute. When approaching to write this query, we could first break it down by writing three individual flows:

  • Flow A: Find flow where user_data_src_1 -> compute(buf, ...)
  • Flow B: Find flow where user_data_src_2 -> compute(..., num)
  • Flow C: Find flow where compute(buf, num) eventually feeds into a malloc(...) call

Then the final query could be:

  • Final query: Find all calls to malloc(...) and compute(buf, num) that satisfies flow C. In addition, make sure that the compute(buf, num) function call MUST satisfy flow A (for some source_1) and satisfies flow B (for some source_2)

By itself, any of the flow A, B, C might not be enough to indicate a bug, but by combining all these factors together, we could specify a specific scenario where buggy behavior does occur.

Let’s say the code looks something like this:

#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <stdlib.h>

int get_user_num() {
    int res;
    size_t num;
    res = scanf("%zu", &num);
    if(res != 1) 
        return 0;
    return num;
}
size_t compute(char* buf, int num) {
    if(strlen(buf) > 32)
        return 1;
    return num;   
}
void* run_process(char *buf) {
    int num;
    size_t alloc_amount;
    num = get_user_num();
    if (num < 0)
        return NULL;
    alloc_amount = compute(buf, num);    
    printf("%zu\n", alloc_amount);
    void* ptr = malloc(alloc_amount); //Target line
    
    return ptr;
}
int main() {
    char buf[64] = "Some initial data";
    void *ptr = run_process(buf);
    //Do something else afterwards
    return 0;
}

A data flow query for that might look something like the following:

import cpp
import semmle.code.cpp.commons.Scanf
import semmle.code.cpp.dataflow.new.DataFlow
import semmle.code.cpp.dataflow.new.TaintTracking

//Flow A: Find flow `user_data_src_1 -> compute(buf, ...)` 
module DataToComputeBufConfig implements DataFlow::ConfigSig {
    predicate isSource(DataFlow::Node source) {
        exists(Parameter param |
          param = source.asParameter() 
          and param.getType() instanceof PointerType
        )
      }
    predicate isSink(DataFlow::Node sink) { 
        exists(
            FunctionCall func_call |
            func_call.getTarget().hasName("compute")
            and func_call.getArgument(0) = sink.asExpr()
        )
    }
  }
module DataToComputeBuf = DataFlow::Global<DataToComputeBufConfig>;

//Flow B: Find flow where `user_data_src_2 -> compute(..., num)` 
module DataToComputeNumConfig implements DataFlow::ConfigSig {
      predicate isSource(DataFlow::Node source) {
        exists(ScanfFunctionCall scanf_call, VariableAccess v_acc, Variable v |
                scanf_call.getAnOutputArgument() = source.asDefiningArgument() //Captures scanf(..., &num)
                and source.asDefiningArgument().getAChild*() = v_acc           //Resolves "&num" -> num
                and v_acc.getTarget() = v
                and v.getType() instanceof Size_t
            )
      }
    predicate isSink(DataFlow::Node sink) { 
        exists(
            FunctionCall func_call |
            func_call.getTarget().hasName("compute")
            and func_call.getArgument(1) = sink.asExpr()
        )
    }
}
module DataToComputeNum = DataFlow::Global<DataToComputeNumConfig>;

//Main query: Combine Flow A, B, and C together
from
    FunctionCall compute_call,
    AllocationExpr alloc_call,
    Expr alloc_size, 
    DataFlow::Node user_data_src_1, DataFlow::Node compute_buf,
    DataFlow::Node user_data_src_2, DataFlow::Node compute_num
where
    compute_call.getTarget().hasName("compute")
    and compute_call.getArgument(0) = compute_buf.asExpr()
    and compute_call.getArgument(1) = compute_num.asExpr()
    and DataToComputeBuf::flow(user_data_src_1, compute_buf)    //user_data_src_1 -> compute(buf, ...)
    and DataToComputeNum::flow(user_data_src_2, compute_num)    //user_data_src_2 -> compute(..., num)
    and alloc_size = alloc_call.getSizeExpr()
    and TaintTracking::localExprTaint(compute_call, alloc_size) //Flow C: x = compute(...) -> alloc(x)
select user_data_src_1, user_data_src_2, compute_call, alloc_call

The specific constraints for each of the flow is not too important; it will likely vary a lot depending on the situation and what your codebase. What I wanted to illustrate here is the ability for us to tie together multiple global data flow queries (along as a local taint tracking query) to describe a complex data flow chain:

  • First piece of “user” data comes from the char pointer argument in run_process (in reality, it’s just a fixed string for our test case), which flows to the buf argument of compute.
  • In addition, we have scanf supplying an integer from standard input, which goes from get_user_num to run_process to num argument ofcompute.
  • Finally, the result of compute is used inside malloc

One important piece to note is that since DataFlow::Node can represent a wide variety of CodeQL types (such as Expr, Parameter, etc. – see the predicate documentation for DataFlow::Node), we need to make sure that we’re matching the typing we use in the final query with what was defined in the Data Flow predicates. For example, both DataToComputeBuf and DataToComputeNum flows have sinks of type Expr.

However, if I wanted to reference additional constraints on either flow’s sources, then I would have needed to use .asParameter() for DataToComputeBuf flow and .asDefiningArgument() for DataToComputeNum flow. A lot of times these types are mutually exclusive, hence if you are using asExpr() for everything, you may not see any result in your query

CodeQL Summary

CodeQL syntax is vast and descriptive. As part of the few months I’ve spent on learning CodeQL, I’ve had to navigate through much of Github’s CodeQL documentation (mainly for C/C++), gain familiarity with their rich but complex data analysis engines, and work with additional included libraries for query refinement. As such ,I hope these last two sections have been useful for helping you get started in your security query writing journey.

In the following two sections of the blog series, I will covering techniques I found helpful for working with Semgrep rules. These will follow a similar structure as ones I’ve shared for CodeQL, specifically when it comes to function identification and data flow analysis. However, with Semgrep, we will be working with analyzing decompiled code rather than source code, which presents its own set of challenges and considerations.