Using CodeQL and Semgrep to Assist Vulnerability Research (Part 1 of 6)
March 5th, 2025 by Brian
Huy Dai was previously a summer intern from MIT and has since graduated to join the Caesar Creek Software team in Woburn, MA. During his internship, he performed a security assessment of the Peloton Bike and, upon joining CC-SW full-time, he has conducted research using CodeQL and Semgrep to aid in vulnerability research.
Motivation
At Caesar Creek Software, one of our areas of expertise is performing vulnerability research on embedded devices. Due to the large amount of code in modern embedded devices, automation in Vulnerability Research (VR) is an important aspect of our work. Finding smart ways which can reduce tedious tasks and help us find bugs quicker can be hugely beneficial in our VR process.
In the the past couple of months, I’ve looked at investigating into using CodeQL and Semgrep as complementary SAST (Static Application Security Testing) tools for bug-hunting. In particular, I’ve looked at the following use cases:
1. CodeQL: To analyze open-source libraries and other software libraries which we have full-source and build scripts
2. Semgrep + Ghidra/IDA: To decompile and scan binaries for vulnerabilities whose source is unavailable to us
The appeal of tools such as CodeQL and Semgrep is that they allow you to write queries for common bug classes (such as buffer overflow, command injection, integer overflow, etc.), and quickly re-run them across multiple codebases. The idea is that once you have written a query/rule for a given bug pattern, you can scan for similar vulnerable code sections (e.g. identify sister bugs) either in the same or different project, without having to manually look for it yourself.
Goal
This post will be the the first blog of a six-part series covering CodeQL and Semgrep:
- In this first part, I will be highlighting a broad, generalizable query I wrote that was able to identify integer overflow CVEs across multiple open-source libraries
- In the second part, I will showcase a CodeQL query I wrote that targets a specific bug in BlueZ (a popular Bluetooth library) that led to identifying an unpatched bug in the same library
- For part 3-4, I will diving deep into CodeQL and strategies on how to write generalizable queries
- For part 5-6, I will diving into similar strategies for writing effective Semgrep rules
This blog series is coming out as part of the research effort I’m doing at Caesar Creek Software, where I focus on writing CodeQL queries and Semgrep rules aimed at covering a specific bug class, then I evaluate their effectiveness by using previously known CVEs as a ground truth.
Despite their relatively simple construction, my queries and rules were able to generate hits on multiple CVEs across different libraries. When optimizing my queries, I focused on specificity (the ability for a query to detect a specific bug with low false positives) and generalizability (the ability for a query to able to detect other bugs like it within the same and different codebase).
Common Bug Type: Integer Overflow to Malloc
First, I will be covering a query aimed at describing a general bug type: An int overflow to malloc.
When looking through the list of vulnerabilities found in open-source libraries uch as libcurl, libTIFF, BlueZ, connman, etc. in the past couple of years, one common bug pattern I noticed was an integer overflow leading to malloc()
. For example,
size_t buffer_size = malloc(var * 2)
could be problematic in situations where var
is user-controlled and it may be greater than SIZE_MAX / 2
, which would lead to integer overflow. In such case, actual allocated size would be smaller than expected, and that could lead to out-of-bound accesses down the road when the buffer is used.
Writing the CodeQL Query
Given that we have access to the source and build scripts for many of these open-source libraries, I started with trying to describe this bug using CodeQL. My predicate query looks as follows:
import cpp
import semmle.code.cpp.dataflow.new.DataFlow
import semmle.code.cpp.rangeanalysis.SimpleRangeAnalysis
import semmle.code.cpp.controlflow.IRGuards
import CCSW_Helper //For custom class definitions
//Check if a given size expression represents a multiplication
//ex. int copy_len = (user_len*2)
predicate representSize(Expr size, CustomMulExpr mul) {
size = mul
or DataFlow::localExprFlow(mul, size)
}
//Main predicate
predicate allocSizeMightOverflow(CCSW_Helper::AllocCall alloc, string reason) {
exists(
Expr size, CustomMulExpr mul_expr |
size = alloc.getSize()
and representSize(size, mul_expr)
and convertedExprMightOverflowPositively(mul_expr)
and not exists( //A guard condition which explicitly checks for the size expression
GuardCondition gc, RelationalOperation compare |
gc.valueControls(alloc.getBasicBlock(), _)
and
(
//Two ways of checking this. Either:
//1: One of the variable in size expression is involved in the check
(
gc.getAChild*() = compare
and DataFlow::localExprFlow(compare.getAChild(), size.getAChild())
)
or
//2: One or an equivalent expression to size expression is checked
(
globalValueNumber(gc.getAChild*()) = globalValueNumber(size)
)
)
)
)
and reason = "CVE-2018-14618: Alloc is given an expression of form (user_len* 2) that might integer overflow"
}
I originally wrote the query to target CVE-2018-14618, a 9.8 (Critical) rated vulnerability in the curl library. Taking a look at the patch commit, we can see that it revolves around a possible integer overflow when using the result of strlen(...) * 2
inside malloc.
CURLcode Curl_ntlm_core_mk_nt_hash(struct Curl_easy *data,
const char *password,
unsigned char *ntbuffer /* 21 bytes */)
{
size_t len = strlen(password);
- unsigned char *pw = len ? malloc(len * 2) : strdup(""); //BAD: Integer overflow
+ unsigned char *pw;
CURLcode result;
+ if(len > SIZE_T_MAX/2) /* avoid integer overflow */
+ return CURLE_OUT_OF_MEMORY;
+ pw = len ? malloc(len * 2) : strdup("");
if(!pw)
return CURLE_OUT_OF_MEMORY;
...
}
However, a surprising result is that when re-running this predicate on the rest of the curl library and other open-source libraries, we see that it was also able to detect 6 similar CVEs:
- CVE-2017-8816 (curl) – Hit
- CVE-2019-5435 (curl) – Hit
- CVE-2020-12762 (json-c) – Hit
- CVE-2022-22824 (Libexpat) – Modified
- CVE-2022-22827 (Libexpat) – Modified
- CVE-2021-46143 (Libexpat) – Modified
Note: The “Modified” label next to Libexpat indicates that there was additional modification that I had to make to the query to be able to properly detect the bugs.
Specifically, with the way that libexpat, a popular XML parser library, defines its calls to malloc
and realloc
, it is doing a call to a function pointer that points to custom allocation implementations, rather than always calling the default functions.
// Inside lib/xmlparse.c
#define MALLOC(parser, s) (parser->m_mem.malloc_fcn((s)))
#define REALLOC(parser, p, s) (parser->m_mem.realloc_fcn((p), (s)))
Taking a closer look at allocSizeMightOverflow
CodeQL query, you may have noticed that I try to encounter for this case through the use custom classes of CCSW_Helper::AllocCall
and CustomMulExpr
. Due to slight differences in function representation, I had to specify additional constraints to properly detect all allocation calls and multiplication expressions across different libraries:
//Representation of an allocation function call
//(either malloc or realloc)
//Could be either direct or wrapped
class AllocCall extends Expr {
AllocCall() {
this instanceof AllocationExpr
//Curl
or this.(VariableCall).getVariable().getName().regexpMatch(".*malloc")
//Libexpat
or this.(VariableCall).getVariable().getName().regexpMatch("malloc_.*")
or this.(VariableCall).getVariable().getName().regexpMatch("realloc_.*")
}
Expr getSize() {
result = this.(AllocationExpr).getSizeExpr()
or (
this.(VariableCall).getVariable().getName().regexpMatch(".*malloc.*")
and result = this.(VariableCall).getArgument(0)
) or (
this.(VariableCall).getVariable().getName().regexpMatch(".*realloc.*")
and result = this.(VariableCall).getArgument(1)
)
}
}
/*
* Custom class representing multiplication expressions
* (allows for `a * b` and `a *= b`)
*/
class CustomMulExpr extends Expr {
CustomMulExpr() {
this instanceof MulExpr // a * b
or this instanceof AssignMulExpr // a *=b //Needed for CVE-2021-46143
}
Expr getAnOperand() {
result = this.(MulExpr).getAnOperand()
or result = this.(AssignMulExpr).getAnOperand()
}
}
Syntax Note: One interesting aspect to note is while the #define
preprocessor directive for realloc defines three arguments REALLOC(parser, p, s)
, due to the way that CodeQL resolves define statements, it directly evaluates the underlying expression (parser->m_mem.realloc_fcn((p), (s))
), which is how we are able to query for a VariableCall at the second argument (rather than the third) for the allocation size.
For a more in-depth breakdown on the function identification process and understanding use of function pointer/wrappers that can appear across different codebases, see my following blog posts on important CodeQL techniques for security engineers.
With these custom classes, we are now able to detect bugs that involve integer overflows to an allocation call in Libexpat, such as CVE-2022-22824:
static int
defineAttribute(ELEMENT_TYPE *type, ATTRIBUTE_ID *attId, XML_Bool isCdata,
XML_Bool isId, const XML_Char *value, XML_Parser parser) {
...
} else {
DEFAULT_ATTRIBUTE *temp;
int count = type->allocDefaultAtts * 2; //BUG: Integer overflow
temp = (DEFAULT_ATTRIBUTE *)REALLOC(parser, type->defaultAtts,
(count * sizeof(DEFAULT_ATTRIBUTE)));
if (temp == NULL)
return 0;
type->allocDefaultAtts = count;
type->defaultAtts = temp;
}
...
}
along with CVE-2021-46143 :
static enum XML_Error
doProlog(XML_Parser parser, const ENCODING *enc, const char *s, const char *end,
int tok, const char *next, const char **nextPtr, XML_Bool haveMore,
XML_Bool allowClosingDoctype, enum XML_Account account) {
...
case XML_ROLE_GROUP_OPEN:
if (parser->m_prologState.level >= parser->m_groupSize) {
if (parser->m_groupSize) {
{
char *const new_connector = (char *)REALLOC(
parser, parser->m_groupConnector, parser->m_groupSize *= 2); //BUG: Integer overflow
if (new_connector == NULL) {
...
}
This type of customization is something that I come across often while trying to generalize my CodeQL queries. For the most part, once I have the main bug logic down (in this case the allocSizeMightOverflow
predicate), I was able to make it work across multiple codebases by improving the function identification and adding edge cases rather than modifying the main bug description.
For example, there are a number of areas which we can add to this predicate to improve its generalizability:
- Expanding on integer overflows: Currently we are querying for multiplication expressions (
var * SOME_CONSTANT
) as a source for an integer overflow, but overflows can occur in addition expressions (var + SOME_CONSTANT
orvar_1 + var_2
). If we were to consider integer underflows, we can also expand our definition to include subtraction. - Expanding on allocation function: In addition to
alloc
andrealloc
, we can also consider the use ofcalloc
ornew
operators.
As we try CodeQL scans on more codebases, we can slowly expand these definitions to cover more cases. However, as with any query, we have to be careful to avoid over-generalization. If we were to start capturing every mathematical expression in the code as possible integer overflows, then the surface area for the query becomes too large. This in turn would lead to many false positives in our results, and increase the time needed for manual bug verification.
As such, we have to be careful in the way that we expand our query, and only focus on avenues which we think could help capture the most bugs without compromising too much on accuracy.
Translating to a Semgrep Rule
So far we’ve talked a lot about CodeQL, but can we achieve a similar result using Semgrep?
If you’ve read on discussion guides on the difference between CodeQL and Semgrep, you will find that Semgrep syntax tends to be easier to learn, as it more closely resembles advanced grep
syntax – hence its namesake. This is contrast to CodeQL’s implementation of a custom extractor for each language it supports, which leads to different query syntax depending if you are working with C/C++, Java, Python, etc.. However, a tradeoff of that design is that Semgrep captures syntax information with less granularity and accuracy compared to CodeQL, and it does not have as powerful data flow/taint analysis.
That said, I believe that Semgrep rules can achieve many of the same results as an equivalent CodeQL query. In particular, I looked at a particularly challenging workflow where I take compile the binaries and linked libraries (.so
files) generated by the open-source libraries above, decompile them with Ghidra/IDA Pro, and then scan the resulting pseudo C/C++ code using Semgrep.
This process is useful in scenarios where security engineers does not have full source and build environments for the targets they are looking at, and instead must depend on looking through decompiled code for potential vulnerabilities.
Using the structure and logic I had in the CodeQL query, I was able to replicate it in Semgrep as follows:
rules:
- id: alloc_size_might_overflow
languages:
- c
- cpp
message: |
The size expression given to malloc is of some form (user_len * CONSTANT), where
it may be possible that an integer overflow might occur in the multiplication
severity: WARNING
options:
symbolic_propagation: true
patterns:
- pattern-either:
- patterns:
- pattern-either:
- pattern: $ALLOC(<... $VAR_1 * $VAR_2 ...>, ...)
- pattern: $ALLOC(<... $VAR_2 * $VAR_1 ...>, ...)
- pattern: $ALLOC(<... $VAR_1 << $VAR_2 ...>, ...)
- metavariable-pattern:
metavariable: $ALLOC
patterns:
- pattern-either:
- pattern: $MALLOC
- pattern: (*$MALLOC) #Custom function pointers
- metavariable-regex:
metavariable: $MALLOC
regex: '(?i)^.*malloc\w*\s*$'
- patterns:
- pattern-either:
- pattern: $ALLOC($_, <... $VAR_1 * $VAR_2 ...>, ...)
- pattern: $ALLOC($_, <... $VAR_2 * $VAR_1 ...>, ...)
- pattern: $ALLOC($_, <... $VAR_1 << $VAR_2 ...>, ...)
- metavariable-pattern:
metavariable: $ALLOC
patterns:
- pattern-either:
- pattern: $REALLOC
- pattern: (*$REALLOC) #Custom function pointers
- metavariable-regex:
metavariable: $REALLOC
regex: '(?i)^.*realloc\w*\s*$'
- metavariable-comparison:
metavariable: $VAR_2 #We define $VAR_2 to be some fixed-constant
comparison: $VAR_2 > 0
# Look for if conditions that check on the value of VAR_1 * VAR_2
# Consider allocation both inside the if body and after.
- pattern-not-inside: |
if(<... $VAR_1 < $VALUE ...>) {
...
}
...
- pattern-not-inside: |
if(<... $VAR_1 * $VAR_2 < $VALUE ...>) {
...
}
...
- pattern-not-inside: |
if(<... $VAR_1 << $VAR_2 < $VALUE ...>) {
...
}
...
- pattern-not-inside: |
if(<... $VAR_1 > $VALUE ...>) {
...
}
...
Here we’re using on Semgrep’s experimental symbolic propagation feature to help track the flow of multiplication expression (e.g. var * SOME_CONSTANT
) to the size argument of malloc
and realloc
. While not perfect, it helps us extend the contextual window of our Semgrep rule to track situations such as:
lVar2 = sVar1 * 2;
data = (void *)(*(code *)Curl_cmalloc)(lVar2)
where the value of the expression sVar1 * 2
will be propagated to its usage inside Curl_cmalloc(...)
.
Overall, from our testing, the Semgrep query was able to identify the six out of the seven vulnerabilities from before, even when we are scanning the decompiled pseudocode from these binaries. We can see this from the Semgrep command line results:
examples/curl_7.54/Curl_ntlm_core_mk_nt_hash@001486b0.c ❯❱ rules.alloc_size_might_overflow
16┆ data = (void *)(*(code *)Curl_cmalloc)(sVar2 * 2);
CVE-2017-8816 (curl)
examples/curl_7.54/Curl_ntlm_core_mk_ntlmv2_hash@00148810.c ❯❱ rules.alloc_size_might_overflow
19┆ lVar4 = (*(code *)Curl_cmalloc)(lVar7);
CVE-2019-5435 (curl)
examples/curl_7.64/curl_url_set@001572f0.c ❯❱ rules.alloc_size_might_overflow
322┆ __s = (char *)(*(code *)Curl_cmalloc)(sVar7 * 3 + 1);
CVE-2022-22824 (libexpat)
examples/libexpat_2.4/defineAttribute@001045f0.c ❯❱ rules.alloc_size_might_overflow
40┆ (*(parser->m_mem).realloc_fcn)(type->defaultAtts,(long)(iVar1 * 2) * 0x18);
CVE-2022-22827 (libexpat)
examples/libexpat_2.4/storeAtts@0010c370.c ❯❱ rules.alloc_size_might_overflow
90┆ local_e0 = (ATTRIBUTE *)(*(parser->m_mem).realloc_fcn)(parser->m_atts,(long)iVar15 << 5); ⋮┆----------------------------------------
188┆ pNVar24 = (NS_ATT *)(*(parser->m_mem).realloc_fcn)(parser->m_nsAtts,(long)iVar17 * 0x18);
CVE-2021-46143 (libexpat)
examples/libexpat_2.4/doProlog@0010f000.c ❯❱ rules.alloc_size_might_overflow
966┆ pcVar19 = (char *)(*(parser->m_mem).realloc_fcn)
967┆ (parser->m_groupConnector,(ulong)(uVar4 * 2));
The vulnerability that Semgrep wasn’t able to detect was CVE-2020-12762 in json-c, a popular JSON library. From closer inspection, I believe the scan failed because Semgrep’s symbolic propagation didn’t register that a potential value for iVar3
could be p->size * 2
due to the if
condition in-between. Even if we remove the guard conditions requirements from our Semgrep rule, it still fails to register a hit.
int printbuf_memappend(printbuf *p,char *buf,int size)
{
if (p->size <= iVar2) {
iVar3 = p->size * 2; //Integer overflow
if (iVar3 <= iVar2 + 8) {
iVar3 = iVar2 + 9;
}
__ptr = (char *)realloc(__ptr,(long)iVar3); //Alloc call
if (__ptr == (char *)0x0) {
return -1;
...
}
That said, this is still a great result given that Semgrep is working off with much more limited information and uncommon code patterns that we often see in decompiled code. Given its support for regex patterns for function names, we were able to easily search for malloc
, realloc
, malloc_fcn
, Curl_cmalloc
, etc. calls without needing any prior knowledge of the codebase.
Summary
So far, we’ve discussed the idea of writing a general query targeting a common bug class that can be run across multiple codebases.
For these queries, we’re interested in capturing high-level code patterns that will be seen across many different projects. As an example, I provided my CodeQL query and Semgrep rule that targets an integer overflow leading to a malloc
call. By improving on aspects such as function identification, I was generalize the query that to identify similar-looking CVEs across three open-source libraries. These results help demonstrate the flexibility and potential for both CodeQL and Semgrep as SAST tools that we can use to find common, high-level vulnerabilities.
In the next installment of this blog, I want to focus on another use of CodeQL and Semgrep, which is to write codebase-specific bug patterns that we can use to find sister bugs within the same code project. By trading off generalizability for specificity, we can come into bug queries that allows us to find similar versions of specific vulnerabilities. As an example, I will showcase a query I wrote for a BlueZ buffer overflow CVE that led me to finding an unpatched bug in the library!