Using CodeQL and Semgrep to Assist Vulnerability Research (Part 2 of 6)

March 6th, 2025 by Brian

In the first blog post, I provided examples of a broadly generalizable bug query (written in both CodeQL and Semgrep) that were able to identify previously-known integer overflows CVEs across libraries such as libcurl, json-c, and libexpat. For that example, its simple construction and ability to operate across multiple codebases showcases the potential for CodeQL and Semgrep in their ability to function as scanning tools for high-level bugs.

However, another interesting use case for CodeQL and Semgrep is to write codebase-specific queries that focus only on vulnerable patterns in a given project. For example, once we identify a bug in one component of a program, we could try writing a CodeQL query/Semgrep rule for it to see if it appears in other components of the same program. Depending on the project, if one developer (or a team of developers) makes a mistake in one function, it could be very likely that there will be other, similar mistakes that would appear in other parts of the code.

In this section of the blog series, I want to highlight one such example of a codebase-specific query I wrote targeting a heap-based buffer overflow in BlueZ. While the query was originally intended to be a one-off query for BlueZ used as an exercise for data folow, I was able to later use it to find an unpatched bug in the library.

Targeting BlueZ `CVE-2023-50229` (Heap-based buffer overflow)

Initially I was looking to write a query for CVE-2023-50229, a heap-based buffer overflow (7.1 HIGH) vulnerability in BlueZ, a very popular Bluetooth library.

In particular, I was interested in following code snippet:

static void read_version(struct pbap_data *pbap, GObexApparam *apparam)
{
	const guint8 *data;
	uint8_t value[16];
	gsize len;
	...
	if (memcmp(pbap->primary, data, len)) {
            memcpy(pbap->primary, data, len);
            g_dbus_emit_property_changed(conn,
                            obc_session_get_path(pbap->session),
                            PBAP_INTERFACE, "PrimaryCounter");
	}
    ...
    if (memcmp(pbap->secondary, data, len)) {
            memcpy(pbap->secondary, data, len);
            g_dbus_emit_property_changed(conn,
                            obc_session_get_path(pbap->session),
                            PBAP_INTERFACE, "SecondaryCounter");
	}
}

Here, the data and len are user-controllable, where a user may be able to provide a property that is longer than 16 bytes. In such case, memcmp would read beyond the boundary pbap->primary or pbap->secondary for the bytes comparison (since the number of bytes is controlled by len), and in the likely case that they are different, the function will write out-of-bounds in memcpy! Thus, we are able to trigger a heap-based buffer overflow.

We can see in the patch commit that they avoid this issue by bounding the len variable:

-   if (memcmp(pbap->primary, data, len)) {
+   if (len == sizeof(pbap->primary) && memcmp(pbap->primary, data, len)) {  
        ...

There are a a few ways that you could go about trying to encapsulate this type of bug in CodeQL query or Semgrep rule. One possible root-cause explanation for this bug that len is a user-controllable variable but it is not bounded prior to being used as size parameter inside a memcmp and/or memcpy, which can lead to OOB read or OOB write.

Since I already had a CodeQL query related to the size paramter of memcpy being too large/unbounded, I decided to take a slightly different approach. In the following predicate, I captured the bug pattern as having a “misleading” memcmp call right before memcpy. Normally memcmp is meant to return whether two data buffers have the same value, but in the case that the size parameter of memcmp is too large and it reads out of bound, it will mostly likely always return 1 (not equal), and cause an unintended write.

//There is a `memcmp` in a if-condition prior to a `memcpy`.
// The intention is that if the two buffer data are not equal, 
// then perform the copy operation. However, the length/size
// parameter to both functions are not checked, and if it's 
// greater than the intended buffer sizes, then we could have an OOB write.

//CVE-2023-50229
predicate mayBeMisleadingMemcmpCheck(CCSW_Helper::MemcmpCall memcmp, CCSW_Helper::MemcpyCall memcpy, string reason) {
    exists (
        GuardCondition gc |
        //Memcmp done as a guard
        gc.getAChild*() = memcmp
        //That leads to a subsequent memcpy operation
        and gc.controls(memcpy.getBasicBlock(), _)
        //And memcmp and memcpy operates on the same set of variables
        and globalValueNumber(memcmp.getABuffer()) = globalValueNumber(memcpy.getDst())
        and globalValueNumber(memcmp.getABuffer()) = globalValueNumber(memcpy.getSrc())
        and globalValueNumber(memcmp.getLen()) = globalValueNumber(memcpy.getLen())
        //And there isn't a proper check on len to ensure OOB
        and not exists(GuardCondition gc2 |
            gc2 != gc
            and gc2.controls(memcpy.getBasicBlock(), _)
            and globalValueNumber(gc2.getAChild*()) = globalValueNumber(memcpy.getLen().getAChild*())
        )
    )
    and reason = "CVE-2023-50229: Memcmp is done prior to memcpy operation, but len isn't checked, which may lead to false negative in memcmp and OOB write in the memcpy"
}

from CCSW_Helper::MemcmpCall memcmp, CCSW_Helper::MemcpyCall memcpy, string reason
where mayBeMisleadingMemcmpCheck(memcmp, memcpy, reason)
select memcmp, memcpy, reason

Finding the Unpatched Bug

When I run this query on an affected BlueZ version, I was able able to flag the two vulnerable calls in read_version with CodeQL Vscode extension. In addition, I also get an unexpected result inside a function named read_databaseid:

static void read_databaseid(struct pbap_data *pbap, GObexApparam *apparam)
{
	const guint8 *data;
	guint8 value[16];
	gsize len;

	if (!(pbap->supported_features & DATABASEID_FEATURE))
		return;

	if (!g_obex_apparam_get_bytes(apparam, DATABASEID_TAG, &data, &len)) {
		len = sizeof(value);
		memset(value, 0, len);
		data = value;
	}
    //The same vulnerability!
	if (memcmp(data, pbap->databaseid, len)) {
        memcpy(pbap->databaseid, data, len);
        g_dbus_emit_property_changed(conn,
                            obc_session_get_path(pbap->session),
                            PBAP_INTERFACE, "DatabaseIdentifier");
	}
}

It seems to sport the same exact vulnerability as in read_version! When checking out the original source file for this, I found that both functions are directly next to each other in bluez/obexd/client/pbap.c. In fact, they are both called by the same function, read_return_apparam:

static void read_return_apparam(struct obc_transfer *transfer,
					struct pbap_data *pbap,
					guint16 *phone_book_size,
					guint8 *new_missed_calls)
{
	...
	g_obex_apparam_get_uint16(apparam, PHONEBOOKSIZE_TAG,
							phone_book_size);
	g_obex_apparam_get_uint8(apparam, NEWMISSEDCALLS_TAG,
							new_missed_calls);
	read_version(pbap, apparam);   //PATCHED
	read_databaseid(pbap, apparam);//UNPATCHED
}

However, as of the time of writing this blog, when navigating to the latest version of pbap.c in the BlueZ repository, we can see that the implementation of read_databaseid is still unpatched!

Looking Deeper into the History

So what’s happening here? Digging deeper into the BlueZ commit and CVE list, I found that there does seem to be a CVE entry that have been created for this bug named CVE-2023-51596. Despite it not mentioning any specific function name, this CVE entry has the exact same description as the first two found in read_version (e.g., CVE-2023-50229 and CVE-2023-50230):

BlueZ Phone Book Access Profile Heap-based Buffer Overflow Remote Code Execution Vulnerability. This vulnerability allows network-adjacent attackers to execute arbitrary code on affected installations of BlueZ. User interaction is required to exploit this vulnerability in that the target must connect to a malicious Bluetooth device. The specific flaw exists within the handling of the Phone Book Access profile. The issue results from the lack of proper validation of the length of user-supplied data prior to copying it to a fixed-length heap-based buffer. An attacker can leverage this vulnerability to execute code in the context of root.

And we know that the later two correspond to read_version because of this patch notice:

CVE-2023-50229: Fixed an out of bounds write in the primary version counter for the Phone Book Access Profile implementation .

CVE-2023-50230: Fixed an out of bounds write in the secondary version counter for the Phone Book Access Profile implementation .

Interestingly, the CVE number for the bug in read_databaseid is higher numerically than the ones in read_version, so it’s possible it was identified or verified much later and was subsequently not patched. As far as we could tell, there doesn’t seem to be any additional protections that would prevent a buffer overflow to protect the Database ID field that wouldn’t present on the Primary Counter or Secondary Counter.

Takeaways

For our finding with read_databaseid, the actual design of the CodeQL query was not as important, as the other “sister bugs” that existed in the codebase had the same exact pattern as the original in read_version. However, the important takeaway here is that by writing a CodeQL query that focused on a specific vulnerability identified in the project, I was able to quickly identify another bug that exists without having to manually look for it myself. This workflow help illustrates the idea that I proposed at the start of this section, which was to leverage CodeQL (and by extension, Semgrep), to allow us to find multiple project-specific bugs that normally doesn’t exist in other libraries.

It’s possible to imagine other real-world codebases where this pattern of iterative query writing and vulnerability research would come in handy. I envision a workflow where, as we learn more about a given codebase and get a feel for what the vulnerabilities look like, we could have a security engineer write patterns for them and scan through all code files to see if other similar instances exists.

Even if the pattern we wrote isn’t necessarily generalizable or applicable to many other projects, it can be a powerful strategy for leveraging existing knowledge about a given codebase to find more bugs – hence the added value for using SAST tools in our vulnerability research.

Looking Forward

So far I’ve provided examples of queries which I believe best showcase the potential in bug-hunting that we can achieve with CodeQL and Semgrep. In the following two sections of this blog (part 3 and part 4), I will be going more in-depth into how to write CodeQL queries, specifically strategies on how to make your queries more generalizable and effective.

Work conducted by Huy Dai.

Posted in Blog
No Comments