Skip to content


There are two steps to the SWE-agent pipeline. First SWE-agent takes an input GitHub issue and returns a pull request that attempts to fix it. We call that step inference. The second step (currently, only available for issues in the SWE-bench benchmark) is to evaluate the pull request to verify that it has indeed fixed the issue.


At this moment, there are known issues with a small number of repositories that don't install properly for arm64 / aarch64 architecture computers. We're working on a fix, but if you'd like to run and evaluate on the entirety of SWE-bench, the easiest way is by using an x86 machine.

๐Ÿ‘ฉโ€๐Ÿ’ป Inference

Run SWE-agent on SWE-bench Lite and generate patches.

python --model_name gpt4 \
  --per_instance_cost_limit 2.00 \
  --config_file ./config/default.yaml

If you'd like to run on a single issue from SWE-bench, use the --instance_filter option as follows:

python --model_name gpt4 \
  --instance_filter marshmallow-code__marshmallow-1359

The above examples use the default value of --data_path (princeton-nlp/SWE-bench_Lite, which will be looked up from huggingface). You can specify any other huggingface datasets as well, or supply the path to a pre-downloaded dataset. By default, SWE-agent evaluates on the dev split of that dataset. You can change that by supplying the --split argument to the above commands (obviously you shouldn't tune your model on the test dataset).

๐Ÿงช Evaluation

The evaluation/ folder provides SWE-agent compatible scripts for running SWE-bench style evaluation on model patch predictions. In addition, we also include additional scripts to quantify model performance on "subtasks" within the SWE-bench task, such as identifying the right file(s) to edit.

๐Ÿ‡ Quick Start

You can run evaluations on SWE-bench by passing in the predictions generated by SWE-agent (usually named all_preds.jsonl). Simply run the following script:

./ <path to predictions>

The <predictions_path> arguments should look like


Depending on the number of task instances and how long setting up the execution environment takes, the evaluation could take a couple minutes or to 7 hours for the entirety of the SWE-bench test split.

When evaluation finishes, you should see an output similar to the following:

2024-03-31 16:47:00,263 - taskenv_context_manager - INFO - [pvlib__pvlib-python__0.8] [pvlib__pvlib-python-1395] Installing with command: . /n/fs/p-swe-bench/testbed/ba397fe0d6/pvlib__pvlib-python/0.8/tmpom22t9na/miniconda3/bin/activate pvlib__pvlib-python__0.8 && echo 'activate successful' && pip install -e .[all]
2024-03-31 16:47:10,602 - taskenv_context_manager - INFO - [pvlib__pvlib-python__0.8] [pvlib__pvlib-python-1395] Installation successful
2024-03-31 16:47:10,619 - taskenv_context_manager - INFO - [pvlib__pvlib-python__0.8] [pvlib__pvlib-python-1395] Apply patch successful (test)
2024-03-31 16:47:10,635 - taskenv_context_manager - INFO - [pvlib__pvlib-python__0.8] [pvlib__pvlib-python-1395] Apply patch successful (pred)
2024-03-31 16:47:13,453 - taskenv_context_manager - INFO - [pvlib__pvlib-python__0.8] [pvlib__pvlib-python-1395] Test script run successful
Log directory for evaluation run: /n/fs/p-swe-bench/results/gpt-4-1106-preview__swe-bench-dev-40-seed24__default_sys-env_window100-detailed_cmd_format-full_history-1_demos__t-0.20__p-0.95__c-4.00__install-1__sweep-01-run-4
== Evaluation Report ==
{'# Not Generated': 1, '# Generated': 36, '# Applied': 34, '# Resolved': 5}
- Wrote per-instance scorecards to /<path to SWE-agent>/trajectories/carlosejimenez/gpt-4-1106-preview__swe-bench-dev-40-seed24__default_sys-env_window100-detailed_cmd_format-full_history-1_demos__t-0.20__p-0.95__c-4.00__install-1__sweep-01-run-4/scorecards.json
- Wrote summary of run to /<path to SWE-agent>/trajectories/carlosejimenez/gpt-4-1106-preview__swe-bench-dev-40-seed24__default_sys-env_window100-detailed_cmd_format-full_history-1_demos__t-0.20__p-0.95__c-4.00__install-1__sweep-01-run-4/results.json
Reference Report:
{'# Not Generated': 1, '# Generated': 36, '# Applied': 34, '# Resolved': 5}

๐Ÿช‘ SWE-bench Evaluation This script contains the logic for SWE-bench evaluation adapted for the SWE-agent setting. Given a set of predictions (e.g. trajectories/<user>/<experiment>/all_preds.jsonl), we...

  1. Filter + analyze predictions.
  2. Run SWE-bench style execution based evaluation.
  3. Save outcomes to results.json and scorecards.json files with info about task-specific and overall performance.

Examples (see above) is provided as an example of how to run


  • --predictions_path (required): The path to the file containing predictions (.jsonl format). This file includes the predictions that need to be evaluated against the benchmark tasks.
  • --log_dir (required): The directory path where log files related to the evaluation process will be stored. It's used for saving logs that are generated during the evaluation.
  • --swe_bench_tasks (required): The path to the file containing the SWE-bench task instances. This file includes the details of the tasks against which the predictions will be evaluated.
  • --testbed (required): The directory path for the testbed, which is likely used for setting up the environment or context for the evaluations.
  • --skip_existing (optional): If specified, the script will skip over log files that already exist, preventing re-evaluation of those tasks.
  • --timeout (optional): Specifies the timeout in seconds for the evaluation process (default is 900 seconds). This helps in controlling the duration of each evaluation task to avoid excessively long running times.
  • --verbose (optional): Enables verbose mode, which will provide more detailed output during the script execution. This is useful for debugging or getting more insight into the process.
  • --conda_link (optional): Allows specifying a URL to a Conda installation that should be used for the evaluation environment. This can be necessary if the evaluation requires a specific software environment.
  • --log_suffix (optional): An additional parameter to specify a suffix for log files. This can be used for organizing logs more effectively, especially when running multiple evaluations in parallel or under different configurations.

๐Ÿ“ˆ Viewing Results This script aggregates and displays experiment results from the trajectories/ folder.

  • Experiments are grouped by (Model, Dataset, Config File, Temp., Top P, Cost, Install).
  • The following statistics for each experiment run are shown:
    • Not Generated: # of task instances with no patch generated
    • Generated: # of task instances with patch
    • Applied: # of patches that applied successfully
    • Resolved: # of task instances resolved
    • Costs [Success|Failed|Overall]: Cost of [successful|failed|any] run
  • If there are multiple runs of an experiment (distinguished by --suffix run<i>), the above statistics are aggregate as totals or means.




  • --folder (type: str, default: ../trajectories): Specifies the folder containing the experiment * results. This is where the script will look to gather data.
  • --model (type: str, nargs: '+'): Filters the results by model(s). Only results corresponding to the * specified model(s) will be included.
  • --dataset (type: str, nargs: '+'): Filters the results by dataset(s). Only results for the specified * dataset(s) will be analyzed.
  • --setup (type: str, nargs: '+'): Filters the results by setup(s). This allows focusing on specific * experiment configurations.
  • --runs_min (type: int): The minimum number of runs an experiment should have to be included in the * analysis. Helps exclude experiments with insufficient data.
  • --runs_max (type: int): The maximum number of runs to consider for each experiment. This can limit the data to the most relevant runs.