Gemini is so good, I have let it control/use my phone

Click3: LLMs have taken control of my phone Or Draft a mail to someone@gmail.com and ask for lunch next Saturday. Or Find out what's my rating in Uber? Or Start a 3+2 game on lichess All of these tasks require you to know the UI, like where to find the ratings - "is it in the profile section or is it even there in the app?" and do multiple clicks afterwards. But with this framework, you can just type in plaintext and see the LLM do it for you. How does it work? It has three separate components, as seen below and each have their own separation of concerns: Planner: Plans the next step (given current screenshot and previous actions) Finder: Finds specific UI elements (whatever Planner asks it to find) Executor: Clicks, scrolls, types etc OpenAI / Gemini / Local LLM I have added supports for all of the above, to be precise, the following is the recommended models for each component. You can use your own keys and run it locally, the support for local LLM is what I am most excited about and the Molmo MLX for MacOS is a great start, I feel. Open-sourced This project is open sourced for everyone to use and contribute. You can also check out more demos in the README: https://github.com/BandarLabs/clickclickclick What do you guys think? What are the use cases you can think of for this? For starters, I think it can be used to "create overlays of walkthrough over any app" or "automate testing of any functionality of an app" for developers.

Jan 20, 2025 - 16:09
 0
Gemini is so good, I have let it control/use my phone

Click3: LLMs have taken control of my phone

Or

Draft a mail to someone@gmail.com and ask for lunch next Saturday.

Or

Find out what's my rating in Uber?

Or

Start a 3+2 game on lichess

All of these tasks require you to know the UI, like where to find the ratings - "is it in the profile section or is it even there in the app?" and do multiple clicks afterwards.

But with this framework, you can just type in plaintext and see the LLM do it for you.

How does it work?

It has three separate components, as seen below and each have their own separation of concerns:

  • Planner: Plans the next step (given current screenshot and previous actions)
  • Finder: Finds specific UI elements (whatever Planner asks it to find)
  • Executor: Clicks, scrolls, types etc

Architecture diagram

OpenAI / Gemini / Local LLM

I have added supports for all of the above, to be precise, the following is the recommended models for each component.

Model recommendation table

You can use your own keys and run it locally, the support for local LLM is what I am most excited about and the Molmo MLX for MacOS is a great start, I feel.

Open-sourced

This project is open sourced for everyone to use and contribute. You can also check out more demos in the README:

https://github.com/BandarLabs/clickclickclick

What do you guys think?

What are the use cases you can think of for this? For starters, I think it can be used to "create overlays of walkthrough over any app" or "automate testing of any functionality of an app" for developers.

What's Your Reaction?

like

dislike

love

funny

angry

sad

wow