This role is open to contractors in accepted locations only. Please confirm your country is on the list before applying — we're unable to process applications from unlisted locations. List of accepted countries and locations. For US applicants: This is a 1099 independent contractor role. It is not compatible with F-1 OPT, STEM OPT, or any visa status that requires W-2 employment, guaranteed hours, or employer sponsorship. We are unable to provide offer letters or employment verification for this role. ## What You'll Be Doing
Design and build the coding benchmarks and evaluation pipelines used to test frontier AI models on real software engineering work:
Design coding benchmarks that evaluate frontier models on real-world programming tasks — reasoning, debugging, and production-quality code
Build and maintain scalable data pipelines for evaluation workflows
Analyze model-generated code for correctness, reliability, and edge-case failures
Construct structured evaluation scenarios across large repos and multi-language environments
Provide detailed technical feedback on model performance and failure patterns
Contribute to evaluation frameworks that set the bar for how coding ability is measured
End result: benchmarks that meaningfully separate what frontier models can and can't do — and shape how the next generation is trained and improved. AI coding evaluation in one line: Design task → build harness → run model → analyze failures → feed findings back into the benchmark → evaluations that actually distinguish strong models from weak ones. ## What You'll Need
4+ years of professional software engineering experience (non-negotiable)
Hands-on experience working in large, complex codebases
Proven experience designing and implementing LLM coding benchmarks and evaluation data pipelines
Strong command of Git and modern development workflows
Track record at a high-growth tech company or top-tier software organization
Strong written English communication
Identity verification: Applicants will be required to verify their identity and confirm they have valid documentation to work as an independent contractor in their country of residence. ## Nice to have
Senior or Lead-level profile with a history of technical ownership
Bachelor's or Master's in CS, ML, or related field (or equivalent professional experience)
Proficiency in additional languages: JavaScript, Go, C++, or others
CI/CD experience and writing robust unit tests (pytest, Mocha, JUnit)
Background in security engineering or significant open-source contributions
Familiarity with AI/ML evaluation methodologies or model benchmarking
Logistics
Location: Fully remote — work from anywhere on the accepted locations list
Compensation: $80–$100/hr based on location and seniority
Contract length: 3 months, with potential for extension
Hours: Full-time availability preferred — hours vary by project and are not guaranteed week to week
Engagement: 1099 independent contractor
Payment: Weekly via PayPal or Stripe
⚠️ Important: Hours are project-dependent and can vary week to week. We recommend keeping other work options open alongside this engagement rather than relying on it as your sole source of income.