Semgrep Guide for a Security Engineer (Part 5 of 6)
March 12th, 2025 by Brian
In this fifth post of my six part blog series, I will be starting to share techniques and tips that I found helpful for writing Semgrep rules for finding vulnerabilities. The previous post discussed the organization of queries and connecting data flows.
The motivation is that while CodeQL excels in analyzing codebases with source code and a build environment, we cannot use it on closed source binaries. This is where a tool like Semgrep comes in. Thanks to Semgrep’s flexible tree-sitter grammars, it is able to parse pseudo C and C++ code produced by decompilers.
Thus, instead of having to looking through entire decompiled output ourselves, we can run Semgrep scans as an initial pass to look for potential bugs – using either default or custom rules. Not only so, as we learn more about a given binary’s functionalities and populate struct/variable information in the decompiler, we could then use that information to write more targeted Semgrep rules to scan for vulnerable patterns.
Goal
In this and the following section of this blog, I want to offer a guide through some of the most important things I learned about working with Semgrep on decompiled code. Some of these concepts will be things I learned from trying the Semgrep queries from Marco Ivaldi’s blog on automating binary VR work with Ghidra and Semgrep, and how I was able to improve on some the techniques in those queries. The goal is to provide helpful concepts and that can aid you in the process of writing your own Semgrep security-focused rules (specifically for C/C++).
In this section, I will focus on two primary techniques that will best help you get started with working with Semgrep rules. First, I will cover how to get the most compatible decompiled code for Semgrep analysis (including providing workflow for both Ghidra and IDA Pro users). Once that it set up, I will cover how to write good identification patterns to target specific function calls.
Workflow
Reader Assumption: This blog assumes you have some preliminary knowledge of how Semgrep works and some of the basic Semgrep syntax. If you are unfamiliar with Semgrep, I would recommend getting started with the built-in Semgrep tutorials.
Semgrep Code: For this guide we’re will be focusing on Semgrep Code, which is a SAST tool that allows you to write security rules in YAML files and run them on code. Due to the sensitive nature of most of our security work at Caesar Creek Software, I opted to work exclusively with the local CLI rather than the online Semgrep AppSec platform.
Setup: To get started, I would recommend following the setup guide for the Semgrep CLI. In addition, make sure you enable the Semgrep pro engine, which allows for inter-file and inter-function analysis. I’ve found these features to be a must for running scans, since you are otherwise limited to only finding vulnerabilities in the context of per-function and a single file. Note that charges may apply for Pro engine, depending if you need more than ten developer licenses for your organization and your particular use case.
Running Scans: For most local Semgrep scans, I would recommend running them with the following options:
semgrep scan --pro -j 8 --timeout 0 --max-target-bytes 0 --metrics=off --disable-version-check --config rules/ --include src/ --json-output=output.json
This assumes that you have your Semgrep rules (.yaml
files) in a rules/
folder, and all the source files in a src/
folder.
--pro
: Enables the inter-file and inter-function analyis. Note: For this behavior to work properly, you will also need to have theinterfile: true
options inside the Semgrep rules that require it.-j 8
: Sets the number of subprocesses to run in parallel. You should change the number to best fit your system--timeout 0
and--max-target-bytes 0
: These options disable the timeout for running rules and to not ignore files that are greater than a given size.--metrics=off
and--disable-version-check
: For sensitive binaries, we want to avoid any potential telemetry or usage metrics from being sent to the Semgrep server.- Note: These options might not be necessary if you are running on only custom rules (only apply to rules loaded from Semgrep registry). See Semgrep’s metrics page for more info
--json-output=output.json
: Outputs the results to a JSON file
For full explanations of Semgrep command line options, see the CLI reference guide.
1. Getting Clean Decompiled Code from Binaries
To start, we first need to establish a procedure for getting decompiled code from a binary that Semgrep can readily parse. While Semgrep can theoretically take any .c
or .cpp
source file, unsurprisingly, there are a number of issues that it encounters when parsing pseudo-C and C++. Each binary decompiler (such as Ghidra or IDA Pro) will often add decorators and constructs in the display output that is helpful for reverse engineers but are not actually valid C/C++ code. For example, when taking a look at the output JSON or SARIF files from a scan, it is common to see something like:
Syntax error at line src/libcurl/Curl_connect@00124770.c:201:\n ")" was unexpected
In those cases, Semgrep will usually just ignore the character and continue on. However, as I worked with more binaries, I would start seeing unrecoverable error messages, meaning that the Semgrep parser seemed to stop at a given location and ignored all subsequent lines in the file. Some examples were:
Failure: unallowed declarator as RHS to scope resolution
Failure: should not be empty by precondition
While I wasn’t able to figure out the exact cause of these parsing issues, I’ve found that to minimize the amount of code that is skipped, the best things to do are:
- Separate function by file: To best isolate parsing errors, each function in the program should be its own
.c
or.cpp
file. That way, any unrecoverable error will at most only cause for the single function to be ignored. - Remove decorators: As I worked with more binaries, I was able to identify a few common decorator patterns added by decompilers that were causing issues with Semgrep. To help, I wrote a few helper utilities in the corresponding decompiler’s section.
- Use the right file extension: Instead of asking the user to supply the language, Semgrep auto-determines the parsing mode using the file extension. If your decompile code has C++ construct, you will encounter much less parsing errors if it is labeled with
.cpp
instead of.c
. This may seem obvious, but some decompilers or scripts export files will often default to.c
, so it can very easily missed.
So far I’ve worked primarily with getting decompiled code from Ghidra and IDA Pro. Here’s what my workflow look like for them:
Ghidra
Given a binary, we can import it into a Ghidra project, analyze it as normal, and then use a custom Ghidra script to export the decompiled code.
While Ghidra has the built-in option of File -> Export Program -> set "Format:" to C/C++
, it outputs the full source as a single C file. Instead, I use 0xdea’s Haruspex script, which automatically separates each function into its own file with a unique name (typically function_name@HEX_ADDRESS.c
).
If you want a more streamlined method than running the Ghidra Scripts GUI for each binary, I would recommend modifying the script to take the export folder location as an argument instead:
@Override
public void run() throws Exception
{
printf("\nHaruspex.java - Extract Ghidra decompiler's pseudo-code\n");
printf("Copyright (c) 2022 Marco Ivaldi <raptor@0xdeadbeef.info>\n\n");
//Change: Use CLI argument instead of askXxx() method
String[] args = getScriptArgs();
outputPath = args[0];
//Original: ask for output directory path
//try {
// outputPath = askString("Output directory path", "Enter the path of the output directory:");
//} catch (Exception e) {
// printf("Output directory not supplied, using default \"%s\".\n", outputPath);
//}
...
}
Then using Ghidra Headless, we can export decompiled code for analyzed binaries from the command line:
/opt/ghidra/support/analyzeHeadless <GHIDRA_PROJECT_LOCATION> <GHIDRA_PROJECT_NAME> -process <BINARY_NAME> -scriptPath <GHIDRA_HARUSPEX_SCRIPT_LOCATION> -postScript Haruspex.java <OUTPUT_FOLDER_PATH>
Ghidra has some decorators which we want to remove from our code files. Specifically, I have a Python script that looks for the following strings and delete them:
__thiscall
__cdecl
__noreturn
__fastcall
I will also rename the files to either .c
or .cpp
as appropriate.
IDA Pro
In previous versions of IDA (up to v8.4), we have had to create an IDA database for each binary we want analyzed, and then manually export the decompiled code result. Starting with IDA v9.0, we can also now run many of these operations headlessly.
That said, as of the older versions, we’ve had to export the decompiled pseudocode to a single .c
file, and then manually separate out the functions into their own files. For this I wrote a simple Python program that looks for the comments added by IDA that is added between every function and struct definition. For example, this is what the original decompiled export file looks like:
//----- (0000000000005190) ----------------------------------------------------
char *sub_5190()
{
return &edata;
}
// 20E320: using guessed type char edata;
//----- (00000000000051C0) ----------------------------------------------------
__int64 sub_51C0()
{
return 0LL;
}
// 51C0: using guessed type __int64 sub_51C0();
//----- (0000000000005200) ----------------------------------------------------
You could iterate through the exported code file, and divide up the decompile code using these lines as delineation boundary using a Regex pattern (in Python) like:
with open(target_file, "r") as f:
code = f.readlines()
function_lines = [index for index, line in enumerate(code) if re.match(r"//-----\s*\([0-9a-fA-F]+\)\s*-+\n",line)]
function_lines.append(len(code))
for i in range(len(function_lines)-1):
start = function_lines[i]
stop = function_lines[i+1]
function_code = "".join(code[start:stop])
As part of the post-processing, I would also go through the remove IDA decorators that have shown to cause issues with Semgrep parser:
`virtual thunk to'
`non-virtual thunk to'
`vtable for'
`typeinfo for'
`guard variable for'
`VTT for'
Lastly, I rename the files to either .c
or .cpp
as appropriate.
2. Identifying Functions Variants
From my experience looking through security rules in the Semgrep Registry and 0xdea’s semgrep repository, I’ve found that most rules tend to focus heavily on identifying vulnerable function calls. Many security-focused rules look at flagging incorrect or unsafe usage patterns of functions, based on a set of constraints specified by the rule. As a result, it is very important in Semgrep rule writing to be able to properly define the set of functions you are targeting.
Even for standard utility calls, there exists slightly different variants that could be used depending on the C standard being used. For example, with memset
, there is also memset_explicit
and memset_s
. Indeed, you can see this type of variation being accounted for in 0xdea’s incorrect-use-of-memset rule:
- pattern-either:
- pattern: memset($S, $C, 0);
- pattern: memset($S, $C, '\0');
- pattern: memset($S, sizeof(...), $N);
- pattern: memset_explicit($S, $C, 0);
- pattern: memset_explicit($S, $C, '\0');
- pattern: memset_explicit($S, sizeof(...), $N);
In addition, it’s also important to note that depending on the standard libraries that was compiled/linked with the binary, function names can change. For example, while source code uses scanf
, when compiled the following lines with gcc
, we see __isoc99_scanf
being used instead, and a simple printf
call being converted to puts
:
int offset = 0;
if(scanf("%d", &offset) != 1)
return -1;
memcpy(dst, src + offset, 24);
printf("%s\n", dst);
becomes
local_44 = 0;
iVar1 = __isoc99_scanf(&DAT_00102004,&local_44);
if (iVar1 == 1) {
memcpy(local_40,(void *)((long)&local_38 + (long)local_44),0x18);
puts(local_40);
uVar2 = 0;
}
Thus when it comes to writing function identifiers, it is important to be aware of the differences between function calls as seen in source code and those that appear in the decompiler output.
And unlike with CodeQL, where there are built-in libraries we can import that has knowledge about conventions about standard functions, in Semgrep we have to manually specify the different variants ourselves. This includes creating multiple OR patterns to account for differences in the function name, number of arguments, and ordering of arguments.
As a starting point, I’ve found this interesting-api-calls rule to be very helpful when I’m first looking to start figure out a capture pattern for a commonly-used function.
In the case where you have two or more differently-named functions that have about the same ordering of arguments (like as seen in the memset
rule above), I would recommend using a metavariable as the function name, and then specify a capture regex for it using the metavariable-regex
pattern. For example, for one of my rule that was targeting scanf
-like function calls, I had the following:
- patterns:
- pattern-either:
- patterns:
- pattern: $SCANF($FORMAT, $_, $TARGET, ...)
- metavariable-regex:
metavariable: $SCANF
regex: '(?i)^(w)?scanf\s*$|^(w)?scanf\s*$' #scanf, wscanf, or vscanf
- patterns:
- pattern: $SCANF($SRC, $FORMAT, $_, $TARGET, ...)
- metavariable-regex:
metavariable: $SCANF
regex: '(?i)^(v)?s(w)?scanf\s*$|^__isoc99_sscanf\s*$|^(v)?f(w)?scanf\s*$' #sscanf, vsscanf, swscanf, fscanf, fwscanf, swscanf, vswscanf, or vfscanf
- patterns:
- pattern: $SCANF($SRC, $FORMAT, $LOCALE, $_, $TARGET, ...)
- metavariable-regex:
metavariable: $SCANF
regex: '(?i)^_f(w)?scanf_l\s*$|^_s(w)?scanf_l\s*$' #_fscanf_l, _fwscanf_l, _sscanf_l, _swscanf_l
- focus-metavariable: $TARGET
Note that in this specific rule I was interested in the first “source” buffer/variable being used in the format string, hence the ...
ellipses for the argument after $TARGET
to allow for 0 to n arguments after it.
Regex Tips:
It’s important to note that Semgrep’s regex patterns are written in PCRE2. Often you will want to test your patterns out in a separate regex tester like Regex101 to make sure it’s capturing the groups you want prior to using it inside a Semgrep rule.
To better generalize my regex patterns, I normally add the following features:
- Enable case-insensitive search with
(?i)^YOUR_PATTERN$
- Allow for additional whitespaces to be added after the function name:
$FUNC\s*()
- In special cases, I also allow for numbers after the function name and before whitespace:
$FUNC\d*\s*()
As you work more with a given binary or project, you naturally gain more familarity with the type of functions being used and their names in the decompiled output. As such, I would recommend starting with a basic capture template and then add to it as you find more exceptions to the rules.
3. Expressing Multi-Step Data Flow with Taint Labels
One really neat feature with Semgrep is the recently-added support for taint labels. While it’s still largely in the experimental stage, the idea for this functionality is that it allows us to express how data becomes dangerous only if goes through a number of pre-conditions (rather than directly source A -> sink B
). For example, say you want to express that data source A
goes through function B
that goes through function C
and ends up in sink D
, then you will need to write it through taint labels.
Initially with Semgrep I tried experimenting to see if I could do a similar workflow with Semgrep’s join mode, but was unsuccessful. As I later found out, according to the docs on its limitations, join mode != taint mode
, and it doesn’t seem like you can mix-and-match modes by joining rules that are written in taint mode.
Instead, the approach that I’ve found to work through taint labels is as follows. Say that in a given codebase I’m looking for all filenames (defined as hardcoded strings) passed to some function that eventually gets opened with fopen(...)
and have its content read with fread(...)
. Following that, the content is used inside a system(...)
call.
This is mainly an exercise for me to illustrate the steps of performing a multi-step data flow analysis inside Semgrep. The likelihood of an actual bug existing where a text file being read and is later executed as an arbitary command is unlikely. That said, we can envision similar real-world scenarios where untrusted command data can come from a socket, database file, etc. and is later executed.
Say that my code looks like the following:
int openFileAndExec(const char *filename, char* buf, int buf_size) {
FILE *fp;
fp = fopen(filename, "r");
int bytesRead = fread(buf, sizeof(char), buf_size, fp);
system(buf);
return bytesRead;
}
int main(){
char CMD1[32];
const char *FILENAME = "hello.txt";
openFileAndExec(FILENAME, CMD1, sizeof(CMD1));
return 0;
}
As decompiled code exported from Ghidra it looks like:
ulong openFileAndExec(char *param_1,char *param_2,int param_3)
{
FILE *__stream;
size_t sVar1;
__stream = fopen(param_1,"r");
sVar1 = fread(param_2,1,(long)param_3,__stream);
system(param_2);
return sVar1 & 0xffffffff;
}
undefined4 main(void)
{
long in_FS_OFFSET;
undefined1 local_38 [40];
long local_10;
local_10 = *(long *)(in_FS_OFFSET + 0x28);
openFileAndExec("hello.txt",local_38,0x20);
if (local_10 != *(long *)(in_FS_OFFSET + 0x28)) {
/* WARNING: Subroutine does not return */
__stack_chk_fail();
}
return 0;
}
Using taint labels, I would write the Semgrep rule as follows:
rules:
- id: file_to_command_exec
languages:
- c
- cpp
message: |
A local file is being read into and has its content
being executed by system(...)
severity: WARNING
mode: taint
options:
symbolic_propagation: true
interfile: true
pattern-sources:
- label: HARDCODED
patterns:
- pattern: $FUNC(..., $SRC, ...)
- metavariable-regex:
metavariable: $FUNC
regex: ^((?!fopen)(?!fread)(?!system).)*$
- focus-metavariable: $SRC
- label: FOPEN
requires: HARDCODED
patterns: fopen(...)
- label: FREAD
requires: FOPEN
pattern: fread(...)
exact: false
pattern-sinks:
- requires: FREAD
patterns:
- pattern: system(...)
pattern-propagators:
- pattern: fread($BUF, ..., $STREAM)
from: $STREAM
to: $BUF
Here we specify a label for each of the pattern sources, and add requirements for what label a tainted data value need to have before it can proceed to the next step. In our case, we specify that data needs to go from source to sink in the following order:
- Sources:
HARDCODED
label: Starts out as hardcoded string- Due to the compiler’s optimization, it removed our assignment of
const char *FILENAME = "hello.txt";
and instead used the string directly inside the call toopenFileAndExec
. - As a result, instead of being able to write our rule for a string assignment, e.g.
- pattern: $VAR = "$SOME_STRING"
, we ended having to rely on the constant propgation and specify that our hardcoded string is being provided as an argument to some function (that isn’tfopen
,fread
, orsystem
).
- Due to the compiler’s optimization, it removed our assignment of
FOPEN
label:- Hardcoded string now passed to
fopen()
to open the file
- Hardcoded string now passed to
FREAD
label:- The file stream is then read and populates a buffer
- Sink:
- The populated buffer is then fed to
system()
to be executed as a command
- The populated buffer is then fed to
To make the code flow work, we had to add a pattern-propagator
pattern, which specified that an input stream (last argument) that is being fed into fread
would propagate taint to the buffer (first argument). This is an important step, as Semgrep doesn’t have the knowledge between the taint relationship of a file descriptor and the buffer that is being written into for fread
calls.
Despite HARDCODED
and FOPEN
, and FREAD
being labelled as data sources in the rule, Semgrep will perform inter-file and inter-function taint analysis between two sources (for example, HARCODED -> FOPEN
) – assuming we have the Pro functionalities enabled in the Semgrep scan command. This explains how we were able to properly pivot from a code line inside the main
function to one inside openFileAndExec
(which, for our purposes, are stored inside two separate .c
files).
Thus we can effectively write multi-step global data tracking rules and combine them together in one using taint labels, which is really neat!
It should also be noted that from my experimentation with taint label, it is generally it’s better to just have one data sink. If we bring the FOPEN
label into the sink, then Semgrep will be happy to match (HARDCODED
-> FOPEN
) or (FREAD
-> FOPEN
) as the entire flow, which is not what we want. As that is written, we are specifying multiple sinks that any of the source can flow to.
Thus if you want to express a multi-step data flow of source A -> B -> ... -> sink Z
, then I would recommend including labels A, B, ...,
in the pattern-sources
section, where label B
requires A
, label C
requires B
, etc., and then have Z
as the sink.
Continue on to part 6 of this series of blog posts.
Work conducted by Huy Dai.