Semgrep Guide for a Security Engineer (Part 6 of 6)
March 13th, 2025 by Brian
In the previous blog post, I started sharing tips and techniques for Semgrep rules. I continue that here in the final installment of my six part blog series, sharing some of the most helpful techniques I learned to decrease false positives in my Semgrep rules. Most often, whenever we run a Semgrep scan, we need to look through each reported finding and verify if it is a bug. I consider false positive findings to be those where our Semgrep rule reports a code section that turns out to not actually be a vulnerability.
Depending on the bug class and the complexity of the codebase, this manual verification process can be time-consuming. As a result, we want to limit the number of findings from the start, especially if we know the kind of code checks or constraints that would most commmonly be used to prevent these bugs.
In this part of the guide, I will be focusing two types of constraints:
- Requiring a metavariable to be of a non-constant value, and
- Requiring a variable does not have a prior bound check condition
4. Checking if a Metavariable is not a Constant
When writing a Semgrep rule, I’ve often found it useful to be able to specify that requirement where a metavariable I’m interested in is not representing a constant value. For example, when when searching for memcpy
calls, I would write it in the form of:
- pattern: $MEMCPY($DST, $SRC + $OFFSET, $LEN)
In some cases, I want to filter out results where SRC
and OFFSET
are known to be fixed values. This can be helpful for scenarios where I want to discover bugs where the source pointer is offset by a variable whose value (depending on the code flow) that be greater than the buffer boundaries, leading to out-of-bound reads.
Using Semgrep’s constant propagation, it is simple enough to provide a constraint inside metavariable-comparison
to check if the value represented by the metavariable satisfies some boundary (it is equal to 0 or greater than 1).
- metavariable-comparison:
metavariable: $OFFSET
comparison: $OFFSET > 1 #Or some other number
However, this pattern is almost the opposite of what we’re looking for, since we want to find situations where a bound can’t be determined (e.g. the metavariable expression contains some variable value).
For example, take this example where there are two memcpy
calls, one of which has a fixed source pointer offset, and another with a variable offset:
int main() {
char src[32];
read(STDIN_FILENO, src, sizeof(src));
char* dst = malloc(64);
memcpy(dst, src + 3, 29); //Safe
printf("%s\n", dst);
int offset;
if(scanf("%d", &offset) != 1)
return -1;
memcpy(dst, src + offset, 24); //Unsafe
printf("%s\n", dst);
}
By compiling this C program using gcc
, analyze it with Ghidra, and export the decompiled code using the Haruspex script, we get a main@001011e9.c
that looks like the following:
undefined8 main(void)
{
int iVar1;
undefined8 uVar2;
long in_FS_OFFSET;
int local_44;
char *local_40;
undefined1 local_38 [40];
long local_10;
local_10 = *(long *)(in_FS_OFFSET + 0x28);
read(0,local_38,0x20);
local_40 = (char *)malloc(0x40);
memcpy(local_40,local_38 + 3,0x1d);
puts(local_40);
iVar1 = __isoc99_scanf(&DAT_00102004,&local_44);
if (iVar1 == 1) {
memcpy(local_40,local_38 + local_44,0x18);
puts(local_40);
uVar2 = 0;
}
else {
uVar2 = 0xffffffff;
}
...
}
As you can see, the compiler adds a number of constructs and type casts, alongside some function conversions as mentioned in section 2. For the most part, though, the flow remains the same. For this test, we want for our Semgrep rule to only flag the second memcpy
call.
From my testing, I was able to add this constraint through the use of metavariable-comparison
, where I used a Python regex expression (not PCRE2 used in the previous sections). Here, it’s ascertaining that the value of $PTR
and $OFFSET
are not known to be either a base-10 number or a hex value.
rules:
- id: memcpy_src_offset_is_not_checked
languages:
- c
- cpp
message: |
The source of memcpy has a variable offset which may result in an OOB read
severity: WARNING
mode: search
options:
symbolic_propagation: true
patterns:
- pattern: $MEMCPY($DST, $SRC, $LEN, ...) #Note: memmove and memccpy has fourth argument
- pattern-not: $MEMCPY($DST, $SRC, sizeof($SRC), ...)
- metavariable-regex:
metavariable: $MEMCPY
regex: '(?i)^\w*mem(c)?cpy\w*\s*$|^bcopy$|^memmove$'
- metavariable-pattern:
metavariable: $SRC
patterns:
- pattern-either:
- pattern: <... $PTR + $OFFSET ...>
- pattern: <... $PTR[$OFFSET] ...>
- metavariable-comparison:
comparison: not re.match('(-)?(0x)?\d+(u)?(LL)?', str($PTR))
- metavariable-comparison:
comparison: not re.match('(-)?(0x)?\d+(u)?(LL)?', str($OFFSET))
This pattern takes advantage of the waterfall behavior of metavariable-comparison
where the invocation of a metavariable (like $PTR
or $OFFSET
) binds to a literal or a constant (if available) and otherwise the code is kept unevaluated. We acquire the either the string representation or symbolic expression of the metavariable via str($MVAR)
. Then, to verify that it’s not an integer, we use the Python’s re.match()
option. Note that we don’t write the r''
regex string symbol before the string.
Note that without the last two metavariable-comparison
clauses, our query would have flagged both usages of memcpy:
decompiled/main@001011e9.c
❯❱ memcpy_src_offset_is_not_checked
The source of memcpy has a variable offset which may result in an OOB read Note: This is very
similar to the dst offset variant, except that I've removed pointer requirement since it seems
to greatly reduce results.
16┆ memcpy(local_40,local_38 + 3,0x1d);
⋮┆----------------------------------------
20┆ memcpy(local_40,local_38 + local_44,0x18);
Whereas with the filters added in we now only see the one vulnerability:
20┆ memcpy(local_40,local_38 + local_44,0x18);
5. Reducing False Positives with Bound Checking Constraints
Once we have developed a generalizable rule, like the out-of-bounds read from the last section, we can still improve on it by writing additional constraints such as removing variables with bound checking.
A bound check is where the program checks if a variable is within a valid range before using it. While having a bound check (such as an if (var <= FIXED_VALUE) { ...}
) does not necessarily indicate that var
is safe – for example, off-by-one bugs exists, it can be a helpful measure to reduce false positives for most trivial cases.
For example, let’s take the out-of-bound read rule from the last section. For memcpy_src_offset_is_not_checked
, we’re interested in finding a variable offset which is added to the source pointer that could lead to out of bound access (e.g., greater than the size of the source buffer).
In the real world, this type of vulnerability is commonly seen across different codebases. For example, take the CVE-2017-1000250 information leak that occurred in BlueZ. While it is an older bug, this particular vulnerability has been greatly studied, as it is part of the “BlueBorne” bug family that affected mobile, desktop, and IoT devices across multiple operating systems.
In CVE-2017-1000250
, the bug is triggered inside service_attr_req
, where a SDP continuation state struct (with user-controllable field cstate
) has its maxBytesSent
member used as a offset to the response buffer pointer.
static int service_attr_req(sdp_req_t *req, sdp_buf_t *buf) {
sdp_cont_state_t *cstate = NULL;
...
if (cstate) {
sdp_buf_t *pCache = sdp_get_cached_rsp(cstate);
SDPDBG("Obtained cached rsp : %p", pCache);
if (pCache) {
short sent = MIN(max_rsp_size, pCache->data_size - cstate->cStateValue.maxBytesSent);
pResponse = pCache->data;
memcpy(buf->data, pResponse + cstate->cStateValue.maxBytesSent, sent); //Out-of-bound read here
buf->data_size += sent;
cstate->cStateValue.maxBytesSent += sent;
...
}
...
}
Because the code from the Search Attribute Request handler fails to validate the maxBytesSent
field, it is possible for the memcpy
to copy data beyond the size of pResponse
. For a more detailed bug description, see the original Armis writeup.
When we run the memcpy_src_offset_is_not_checked
rule from in the last section, we can see that it properly detects the vulnerability. In particular, it raises a hit on the memcpy
call inside progress_request
function (the originalservice_attr_req
was inlined in the compilation process):
examples/bluez_5.46_bluetoothctl/process_request@0043af40.c
❯❱ rules.memcpy_src_offset_is_not_checked
The source of memcpy has a variable offset which may result in an OOB read
160┆ memcpy(rsp.data,(psVar20->buf).data + (cstate->cStateValue).maxBytesSent,
161┆ (ulong)uVar18);
However, one easy example of a code pattern that we don’t want to identify through our query is the patched version of this bug, where the developers add an inequality check in the if
-condition to make sure that maxByteSent
isn’t too large.
The patch commmit looks as follows:
- if (pCache) {
+ if (pCache && cstate->cStateValue.maxBytesSent < pCache->data_size) {
...
}
How do we express this type of condition inside Semgrep? For me, I was able to find success by describing the if
-checks inside a pattern-not-inside
clause:
- pattern-not-inside: |
if(<... $OFFSET < $VALUE ...>) {
...
}
- pattern-not-inside: |
if(<... $LEN + $OFFSET < $VALUE ...>) {
...
}
...
- pattern-not-inside: |
if(<... $OFFSET + $LEN < $VALUE ...>) {
...
}
...
- pattern-not-inside: |
if(<... $LEN + $OFFSET <= $VALUE ...>) {
...
}
...
#... (more pattern conditions)
Here we want to ignore memcpy
calls where leading up to it there is an if
condition that either bounds the value for the source offset, or a combination of the source offset + size
parameter.
One drawback with the unpredictability of decompiled code is that often it is difficult to accurately determine what the format of a boundary check would look like. As you see above, I have to separate out the conditions where we have
- Differences in parameter ordering:
$LEN + $OFFSET
vs.$OFFSET + $LEN
- Difference between inequality symbols:
<
vs<=
There may also be conditions where the left and right hand side of the inequality check is flipped, where we have if ($VALUE > $LEN + $OFFSET) { ... }
. Due to the way that Semgrep analysis engine functions, each of these scenarios are treated as different cases, even if logically they are meant to express the same idea.
In addition, it should also be noted the use of the ellipsis ...
pattern in the if-else conditions. It may be surprising to see that I also capture statements that occur after the if
-condition, such as:
if(<... $OFFSET + $LEN < $VALUE ...>) {
...
}
...
rather than just:
if(<... $OFFSET + $LEN < $VALUE ...>) {
...
}
While this could be revised, from my testing I’ve found that there sometimes may be an else
clause that looks as follows:
if( offset + len < some_value ) {
//SAFE
...
} else {
//Code handles case where `offset` is out of bounds
offset = SOME_FIXED_CONSTANT;
}
...
memcpy(dst, src + offset, len);
which would end up making subsequent use of offset
to be safe since it is being bounded/fixed in the else
-condition. In some codebases, this logic can also be contained in an else if
-condition. To simplify rule-writing, I ended up opting to capture all statements after the if
, which helps remove false positives at the cost of introducing false negatives.
Either way, for our specific use case, we were able to avoid false negatives with this addition to our query. By running our Semgrep rule on a patched version of BlueZ v5.77 (compared to the original v5.46), we did not see a code finding. In this newer version, the memcpy
code for service_attr_req
has been moved inside sdp_cstate_rsp
, whose source and decompiled code looks like the following.
Source code inside src/sdpd-request.c
:
static int sdp_cstate_rsp(sdp_cont_info_t *cinfo, sdp_cont_state_t *cstate,
sdp_buf_t *buf, uint16_t max)
{
...
//Bound check
if (cstate->cStateValue.maxBytesSent >= cinfo->buf.data_size) {
sdp_cont_info_free(cinfo);
return 0;
}
...
cache = &cinfo->buf;
sent = MIN(max, cache->data_size - cstate->cStateValue.maxBytesSent);
memcpy(buf->data, cache->data + cstate->cStateValue.maxBytesSent, sent); //SAFE
...
}
Decompiled code:
int sdp_cstate_rsp(sdp_cont_info_t *cinfo,sdp_cont_state_t *cstate,sdp_buf_t *buf,uint16_t max)
{
uint8_t *puVar1;
uint uVar2;
uint uVar3;
ushort uVar4;
...
uVar4 = (cstate->cStateValue).maxBytesSent;
uVar2 = (cinfo->buf).data_size;
if (uVar4 < uVar2) { //Bound check
uVar2 = uVar2 - uVar4;
uVar3 = (uint)max;
if (uVar2 <= max) {
uVar3 = uVar2;
}
memcpy(buf->data,(cinfo->buf).data + uVar4,(ulong)uVar3); //SAFE
...
}
As a sanity check, we could try to remove the guard condition of uVar4 < uVar2
and change it to 1
by editing the *.c
file directly (so that it is always true), and see if we still can detect the original vulnerability pattern.
And indeed we can see that Semgrep flags a code finding for this pattern in this modified version of sdp_cstate_rsp
:
examples/bluez_5.77_bluetoothd/sdp_cstate_rsp@00172ce0.c
❯❱ rules.memcpy_src_offset_is_not_checked
The source of memcpy has a variable offset which may result in an OOB read
21┆ memcpy(buf->data,(cinfo->buf).data + uVar4,(ulong)uVar3);
Summary
In these last two posts, I offered some helpful techniques and tips when it comes to getting started on working with Semgrep and writing Semgrep rules. This blog will complete the final section of the six part blog series we publish for CodeQL and Semgrep.
Work conducted by Huy Dai