๐Ÿข WorkSim Voyager

Interactive Simulated Workplace Environment for RL Agent Training
31 Tools 8 Reward Signals Voyager Architecture OpenEnv Compatible
31
Tools
8
Categories
8
Reward Signals
5
Difficulty Levels
5
Voyager Layers
โ€”
Current Step
โ€”
Cum. Reward

๐ŸŽฎ Interactive Demo โ€” Try the Environment

Tool Call
Arguments (JSON)
Last Tool Result
Reset an episode to begin...
Episode Log
Press "Reset Episode" to generate a scenario...

๐Ÿง  Voyager Agent Loop

Actor
LLM generates
โ†’
Execute
Tool call
โ†’
Critic
LLM evaluates
โ†’
Reflect
Update memory
Skill Library โ€” ChromaDB vector store of reusable tool-call patterns
Working Memory โ€” In-episode scratchpad: goals, facts, plan, errors
Episodic Memory โ€” Cross-episode learning: strategies, failure patterns
Auto Curriculum โ€” Progressive difficulty based on agent performance
Skill Extractor โ€” Mines reusable skills from successful trajectories

๐ŸŽฏ Multi-Component Reward System

R_task โ€” Rubric-based deliverable grading
R_progress โ€” Incremental step rewards
R_evidence โ€” Source quality & authority
R_consistency โ€” Fact alignment checks
R_efficiency โ€” Step economy & focus
R_recovery โ€” Error correction ability
R_skill โ€” Tool-use pattern mastery
R_penalty โ€” Error & repetition costs
R_total = 0.35ยทR_task + 0.15ยทR_progress + 0.10ยทR_evidence + 0.10ยทR_consistency + 0.10ยทR_efficiency + 0.08ยทR_recovery + 0.07ยทR_skill โˆ’ R_penalty

๐Ÿ”ง Tool Inventory โ€” 31 Tools

๐Ÿ“ˆ Training Results (Expert Iteration + Voyager-lite)

Agent Env Reward Shaped Reward ฮ” vs Baseline Skills
Random Agent +0.039 N/A โ€” 0
Basic LLM (pre-train) +0.021 +4.69 ref 0
Voyager-lite (pre-train) +0.024 +5.71 +1.02 (+22%) 2
Basic LLM (post-train) +0.023 +4.56 -0.13 0
โญ Voyager-lite (post-train) +0.025 +5.53 +0.84 (+18%) 2
Model: Qwen2.5-1.5B-Instruct (4-bit) ยท LoRA rank 16 ยท Expert Iteration (Best-of-4 + SFT ร— 3 rounds) ยท 32 prompts/round ยท Training loss: 8.78 โ†’ 8.50 (decreasing)