Discovering a dependency confusion in a popular third-party Python dependency resolver

A peculiar case of dependency confusion potentially affecting 136M monthly package downloads

August 15, 2023 · MAKSYM VATSYK, LEAD SECURITY CONSULTANT

Today we will exploit design weaknesses in pipreqs (a popular Python dependency resolver) to execute supply chain attacks using dependency confusion.

How it all started

This research idea turned up out of the blue when I was in the middle of Dockerizing a sample vulnerable web app written in Python Flask for an internal internship program. One of the steps of packaging the application required generating a requirements.txt file with all third-party dependencies that need to be installed by Pip (Python’s package manager) before running the application. As a lazy developer, I decided to look for ways of automating this tedious process. The most popular options were pip freeze and pipreqs.

However, many developers favored pipreqs (1,2,3,4).

The tool is quite popular, has gathered over 5 thousand stars on GitHub, and has a monthly download rate of 1.5 million, according to PyPI stats.

Why is it so?

pipreqs vs pip freeze

This article does a great job of explaining the differences between pipreqs and pip freeze.

Essentially, pip freeze will dump all installed packages in your Python environment, creating unnecessarily long and redundant requirements lists (especially if you don’t use venvs).

In comparison, pipreqs will parse your project files and derive the direct requirements from imports in your code.

I generated the requirements list with pipreqs, and it looked okay at first glance.

However, when I launched my newly created Docker container and tried to log into the website, I was greeted by an unforgiving HTTP error 500 message. Logs showed that it was related to a bcrypt module missing an exported method.

After half an hour or so of debugging, I tracked down the issue to the wrong bcrypt package in my “requirements.txt” file, generated by pipreqs. It should have been bcrypt instead of python-bcrypt.

Bcrypt on my laptop:

Bcrypt inside Docker container:

This looked worthy of further investigation, so I decided to search through similar issues on the project’s GitHub page. Sure enough, there were plenty of open tickets that boiled down to the following two problems:

Imported module was resolved to the wrong PYPI package

Imported module failed to resolve at all

Clearly, something was wrong with pipreqs, and the question was: can we turn this into a security issue? If we manage to trick pipreqs into resolving an imported module into a PYPI package controlled by us, we can smuggle malicious code during the install and achieve remote code execution (RCE) through a dependency confusion attack!

Analyzing the source code

Let’s clone the pipreqs repository and open it in VS Code. The repo contains a bunch of setup-related files and the pipreqs Python module itself. The code is neatly organized in a single file: pipreqs.py

The main logic is implemented in the init function executed at the script’s start. Let’s analyze what is going on here:

CODE: https://gist.github.com/adeadfed/8343c850f81c4e1f59ea2e15dd4b9fea.js

The script iterates over all files in the given directory and tries to parse them with the Python AST library to obtain imports inside the get_all_imports function (Python lines that look like import X).

Next, pipreqs tries to resolve code imports into PyPI package names via the get_pkg_names function, gathers locally available imports in the get_import_local function, and tries to fetch missing modules from PyPI in the get_imports_info function.

We will focus on two functions, get_pkg_names and get_imports_info, as they are the root of the issue.

Flawed name resolution

So how does the get_pkg_names function work? By using a static map file!

CODE: https://gist.github.com/adeadfed/9a2aed87a501b6be56b3ef7fb21aa999.js

Contents of mapping:

CODE: https://gist.github.com/adeadfed/73500062df06abbbd1db8c125c291500.js

That is the root cause of my bcrypt bug!

It gets way more interesting at line 261, where the script assumes the full import name to be the package name as a fallback.

CODE: https://gist.github.com/adeadfed/2fbf013a801fcae98233bd36de79f9e4.js

Why does it matter? Well, keep in mind that the script then tries to resolve the obtained package names into PyPI packages.

CODE: https://gist.github.com/adeadfed/25eedec109894050386bdfa3f7a1635d.js

By default (if the user does not supply --use-local flag), the script tries to query missing packages from the PyPI server via the get_imports_info function.

This is done simply by querying the package name at the PyPI server:

CODE: https://gist.github.com/adeadfed/d54b063fdd31d51ce483cc0fc8cbc2d1.js

See the flawed logic? Let’s sum up what we know so far. The code:

Gathers all from PACKAGE import ... or import PACKAGE statements from your code
Tries to resolve PACKAGE from the above import statement to a PyPI package name through a predefined mapping file. If the PACKAGE value is not found, the script assumes that the PyPI package name is the same.
By default, the script tries to query the PyPI package PACKAGE at the public repository and adds it to the requirements.txt file.

The issue is that if the name of the exported module PACKAGE differs from the PyPI package name (e.g., PYTHON-PACKAGE), pipreqs will try to fetch an invalid package called PACKAGE from the repository.

This behavior is the reason for a bunch of tickets (e.g., 1, 2, 3) on GitHub.

But what if the vulnerable package is installed locally?

So far, we can inject unwanted dependencies if the bad package name resolves at PyPI. However, it seems that pipreqs initially tries to find it in local packages. Is this check sufficient to prevent abuse of the name resolution?

CODE: https://gist.github.com/adeadfed/0522a5ab7463d77f2adcb633b9a4af77.js

Well… no. There are also quite a few issues where imports are mapped to multiple packages, such as this one.

Why?

Consider the following example:

CODE: https://gist.github.com/adeadfed/3d75f6413ad4dc09687cac297bdd1d1e.js

The code above uses a PyPI package djangorestframework_simplejwt that exports a Python module rest_framework_simplejwt.

Let’s run and view the pipreqs output in the debugger step by step:

Pipreqs extracts imported rest_framework_simplejwt module from the code:

‍

Pipreqs tries to match the rest_framework_simplejwt module with strings in the mapping file, but the package is missing. So, the script still assumes the package name to be rest_framework_simplejwt

The script tries to find the rest_framework_simplejwt module in the exports of all locally installed Python packages…and finds it!

CODE: https://gist.github.com/adeadfed/4845fbe890e6b7d7257de2e86d523f7b.js

However, the function returns only the PyPI name (djangorestframework_simplejwt), ignoring any exported modules. The issue of this return format becomes apparent on line 447.

The import names are then compared to local packages. However! Pipreqs tests again if the name of the exported module (see step 2) is similar to the one of the PyPI package.

Also, for our case, where the PyPI package (djangorestframework_simplejwt) and Python module (rest_framework_simplejwt) names are different, this check will always be bypassed.

And so, the flawed PyPI package resolve is triggered.

How to check if a package is vulnerable?

Let’s summarize our findings from the analysis above. The following three conditions must be met for pipreqs to add our rogue package to the requirements.txt file:

Vulnerable PyPI package python-package must export modules named differently from the package itself (package)
The package:python-package mapping pair must be missing from the hard-coded mappings in pipreqs’ mapping file
PyPI package name package must be available for attackers to upload their malicious code

What’s the impact?

Suppose we can meet all three conditions above and register a rogue PyPI package that matches the imported Python modules of a vulnerable package the targeted users depend on. In that case, pipreqs will add it to the requirements.txt, enabling us to achieve RCE on users’ systems!

But more importantly, how many PyPI packages with mismatched names are there?

Analyzing the PyPI attack surface

What are our actions?

Get the top 5000 downloaded packages from PyPI this month (March 2023).

Download these packages locally and obtain their exported Python modules (Yikes. At least we don’t need to install them.)

Search for vulnerable packages that meet the aforementioned criteria:
1. Python module name ≠ PyPI package name
2. Python module name not in the pipreqs mapping file
3. Python module name available at PyPI as a package name

Attempt to exploit one package as an example

Profit!

PyPI provides a neat data set of package-related information through GCloud BigData API.

We can easily automate querying it with the help of the open-source CLI tool pypinfo.

Running pypinfo to get the top 5000 projects by monthly downloads in March 2023 is as simple as:

CODE: https://gist.github.com/adeadfed/ae2a75939a09f84fbf5a77c8a6ebc0c4.js

We are left with the following JSON file:

CODE: https://gist.github.com/adeadfed/ff039e6e485e57f3652e6ee06c19bf7a.js

The JSON file was processed to obtain all PyPI packages that can be misinterpreted by pipreqs. Next, the results were manually reviewed to filter out false positives.

Packages managed by developer organizations such as Google, AWS, Nvidia, and Microsoft were omitted from the analysis because one has to be a part of a developer organization on PyPI to upload any packages with the reserved name prefixes (e.g., google-cloud-bigquery).

CODE: https://gist.github.com/adeadfed/1d3ed5b4d8e999871ab42c2fe5cfb2b1.js

Time for some quick statistics. Out of 5000 top downloaded packages, we found 149 to be vulnerable (3% out of the total count). The monthly count for the vulnerable PyPI packages totaled 136,433,596 individual downloads.

Proof of Concept

Let’s expand the research into developing a proof-of-concept (PoC) attack on a project using pipreqs. We will use the djangorestframework-simplejwt package as an example. Remember the test code from above?

CODE: https://gist.github.com/adeadfed/c2b111d6cee3ad387cb4495edd3f3fd3.js

Creating a working exploit is trivial. All we need to do is to properly package our code as described in the official PyPI docs. Our malicious package will mimic the directory structure of the original package:

views.py will contain two dummy classes to provide the import names:

CODE: https://gist.github.com/adeadfed/0d3c33d5be697287f994fbdabc32e808.js

__init__.py will provide the main functionality that will contain our malicious code. For the sake of demonstration, our RCE will only print out a warning message:

CODE: https://gist.github.com/adeadfed/284ece814e75edf5ba770fcfa7d529bb.js

This is what happens if we import the malicious package:

CODE: https://gist.github.com/adeadfed/0a0efdb8885617f2559d24113a178582.js

Obviously, we can do whatever we want with the system at this point. For example, pop a classic calc.exe process:

All that’s left is to appropriately fill out all metadata files, build the package, and publish it to PyPI:

CODE: https://gist.github.com/adeadfed/6d6098d24e863155fb3997f1a593f23f.js

Building the package:

CODE: https://gist.github.com/adeadfed/c4201382b9b661f6249e0f455bcb7fd6.js

Publishing the package:

CODE: https://gist.github.com/adeadfed/436f42c76510195d137b15b67121504d.js

The question now is, will we be able to replicate the exploit in the wild? Yes!

CODE: https://gist.github.com/adeadfed/2f1fdb3f448bd1e030e75c43efc800fc.js

Is it that bad?

The naming-related confusions are a long-standing problem in the Python packaging system. There are numerous proposals to fix it, including PEP 423. Although some devs try to follow these vague guidelines, others do not.

However, during the PoC development, I noticed that I could not upload some legitimately vulnerable packages to PyPI due to a “too similar name.”

Thankfully, PyPI has been trying to do damage control over naming confusion for quite some time now (e.g., 1, 2, 3).

These are the additional rules placed by PyPI on the package name for it to be exploited in public:

Both cases of o (lower and upper case, o and O) get replaced with 0
Both cases of L and I are replaced with 1 (e.g., example is same as examp1e and exampie)
All ., _, and ` characters are removed (e.g., e-x-a-m-p-l-e is the same as example`)
The result is then lowercased and compared to the already existing names

Although it sometimes manages to prevent the attack, there are still many exposed packages in the wild, as we have already shown in the PoC.

What can be done about this?

Although we can’t really fix the root cause of the bug (the remote dependency resolution mechanism), it is possible to build upon the existing local resolution code to avoid having to use remote PyPI resolution in the first place.

A pull request with fixes to the code includes these changes:

get_locally_installed_packages function will now return local packages in a form of {'name':'package_name','version':'package_version','exports':['exported_module_1', 'exported_module_2', ...]}
get_import_local function will now search imports list entries in the exports and name fields (to account for pipreqs mapping) of the reworked get_locally_installed_packages function output
init function will now compute the difference list entries (packages that are not found locally and have to be resolved remotely) according to the changes made

These three changes should improve the quality of the requirements.txt output for packages that are installed locally.

The new pipreqs version’s output for the same code above:

CODE: https://gist.github.com/adeadfed/cf1fc13a02c9f41e9cdcc34940b1c7bd.js

Additionally, I added a warning message into the CLI output that is displayed when pipreqs is run with remote resolution enabled. The message encourages users to check the list of the final requirements:

CODE: https://gist.github.com/adeadfed/c4eacedf5010e0836be39419145e2a22.js

Aftermath

April 14th, 2023: The pipreqs creator, bndr (thank you!), was quick to respond and merge a bunch of commits, including mine, into the main branch and release pipreqs 0.4.13.

May 16th, 2023: The vulnerability was triaged by MITRE CVE team and assigned ID CVE-2023-31543

July 10th, 2023: CVE disclosure

‍