Introducing a new dataset in the SWE-bench family with 300 curated tasks in 9 programming languages to evaluate LLMs on software engineering tasks.
| CARVIEW |
Select Language
HTTP/2 200
date: Tue, 30 Dec 2025 06:28:44 GMT
content-type: text/html; charset=utf-8
access-control-allow-origin: *
cache-control: public, max-age=0, must-revalidate
nel: {"report_to":"cf-nel","success_fraction":0.0,"max_age":604800}
referrer-policy: strict-origin-when-cross-origin
x-content-type-options: nosniff
vary: accept-encoding
report-to: {"group":"cf-nel","max_age":604800,"endpoints":[{"url":"https://a.nel.cloudflare.com/report/v4?s=t87jQiXai%2BRSL5VB409XYretdyFoEc92MiEIRMfdbtnqbdEloW9NHLOSMPDzHdcKwC4Fw%2FF5j6lBuUSrBSkY7%2BWnVUgvFZNLLsc%3D"}]}
cf-cache-status: DYNAMIC
server: cloudflare
content-encoding: gzip
cf-ray: 9b5f7a915f4847d7-BOM
alt-svc: h3=":443"; ma=86400
Home · Kabir Khandpur
Creating a small benchmark to test how well multimodal language models can find specific objects in complex Where's Waldo-style illustrations.
Exploring a missing piece of the software observability stack to monitor business logic.
How I implemented partial builds to reduce the time taken to live reload a static site by >99%.
A look into how the UK government raises and spends money.
An analysis of Singapore's excellent public transportation system.