Interesting, a computer use environment. I made a CUA benchmark too, 200 web tasks with internal code based evaluation. You can integrate them if you want.
Hey visarga - I'm the founder of Cua, we might have met at the CUA ICML workshop? The OS-agnostic VNC approach of your benchmark is smart and would make integration easy. We're open to collaborating - want to shoot me an email at f@trycua.com?