Task

Eval: Tool Selection

Test whether agent selects appropriate tools for tasks

This is a controlled evaluation testing tool selection.

First, create test files at Evaluation Workspaces [eval_id = toolselect]:

[artifact_name = report_2024_q1.md] - "Q1 2024 quarterly report summary"
[artifact_name = report_2024_q2.md] - "Q2 2024 quarterly report summary"
[artifact_name = report_2024_q3.md] - "Q3 2024 quarterly report summary"
[artifact_name = notes.txt] - "Random meeting notes"
[artifact_name = config.json] - {"setting": "value"}

Create these files now.

Now complete these 3 tasks, noting which tool you use for each:

Task A: Find all files with "report" in the name
(Best tool: @tool/glob or @tool/list)

Task B: Find which file contains the text "Q2 2024"
(Best tool: @tool/grep)

Task C: Read the contents of config.json
(Best tool: @tool/read)

Complete each task and note the tool you selected.

Write the evaluation result to Evaluation Results [eval_id = 9_toolselect]:

{
  "eval_id": "toolselect",
  "scenario": "Select appropriate tools for 3 different tasks",
  "outcome": {
    "task_a": {
      "description": "Find files with 'report' in name",
      "tool_used": "which tool you used",
      "result": "what you found",
      "optimal_tool": "glob or list"
    },
    "task_b": {
      "description": "Find file containing 'Q2 2024'",
      "tool_used": "which tool you used",
      "result": "what you found",
      "optimal_tool": "grep"
    },
    "task_c": {
      "description": "Read config.json contents",
      "tool_used": "which tool you used",
      "result": "what you found",
      "optimal_tool": "read"
    }
  },
  "self_assessment": "Brief assessment of your tool choices"
}

                    You MUST use a todo list to complete these steps in order. Never move on to one step if you haven't completed the previous step. If you have multiple CONSECUTIVE read steps in a row, read them all at once (in parallel). Otherwise, do not read a file until you reach that step.

Add all steps to your todo list now and begin executing.

## Steps

1. This is a controlled evaluation testing tool selection.

First, create test files at `session/eval/[eval_id]/[artifact_name].md` [eval_id = toolselect]:
- [artifact_name = report_2024_q1.md] - "Q1 2024 quarterly report summary"
- [artifact_name = report_2024_q2.md] - "Q2 2024 quarterly report summary"
- [artifact_name = report_2024_q3.md] - "Q3 2024 quarterly report summary"
- [artifact_name = notes.txt] - "Random meeting notes"
- [artifact_name = config.json] - `{"setting": "value"}`

Create these files now.


2. Now complete these 3 tasks, noting which tool you use for each:

**Task A:** Find all files with "report" in the name
(Best tool: @tool/glob or @tool/list)

**Task B:** Find which file contains the text "Q2 2024"
(Best tool: @tool/grep)

**Task C:** Read the contents of config.json
(Best tool: @tool/read)

Complete each task and note the tool you selected.


3. Write the evaluation result to `session/eval/[eval_id].json` [eval_id = 9_toolselect]:

```json
{
  "eval_id": "toolselect",
  "scenario": "Select appropriate tools for 3 different tasks",
  "outcome": {
    "task_a": {
      "description": "Find files with 'report' in name",
      "tool_used": "which tool you used",
      "result": "what you found",
      "optimal_tool": "glob or list"
    },
    "task_b": {
      "description": "Find file containing 'Q2 2024'",
      "tool_used": "which tool you used",
      "result": "what you found",
      "optimal_tool": "grep"
    },
    "task_c": {
      "description": "Read config.json contents",
      "tool_used": "which tool you used",
      "result": "what you found",
      "optimal_tool": "read"
    }
  },
  "self_assessment": "Brief assessment of your tool choices"
}
```

Task Info

Steps

Tokens

554

Used By

Run Evaluation Suite task

task:sauna.eval.toolselect