Building the First On-Device AI That Always Listens
Author: Pol | Published: 9/28/2024
Welcome to Kokoro! Thank you for your interest in our project. To clarify, when I refer to “we” in this article, I’m currently a solo developer, so it’s referring just to me, Pol. Let’s dive into the details.
The Problem
The best way to start is by explaining the problem we’re trying to solve. If you’ve watched science fiction movies, you’ve likely seen scenarios where people simply talk, and a magical AI does everything. That’s essentially what we’re building, but it comes with some challenges.
AI Hype
I’m not a fan of AI hype. At its core, AI is just linear algebra – predictions based on previous predictions. AI can’t be your magical assistant or best friend. However, AI can be a powerful tool to assist you in your daily life. That’s our goal: to treat AI as the practical tool it is.
Privacy Concerns
When we say Kokoro is always listening, people understandably get concerned. Currently, most consumer AI processing occurs on external cloud servers, over which end users have no control, leading to privacy issues. We’ve addressed this by making Kokoro an on-device AI, meaning all processing and storage happens on your device. We don’t have access to your data, and it even works offline!
Open Source
While open-sourcing isn’t possible at the project’s current stage, it’s a future goal. By open source, we mean truly open – models, datasets, and code fully available for everyone to see and use. We’re not there yet, but we’re working towards it.
The Solution
Now, let’s discuss our solution: Kokoro, the first on-device AI that’s always listening. You can find more information about its capabilities on our landing page.
Tech Stack
We’re using Flutter for the frontend and cross-platform development. For the backend, we’re using Rust, which is excellent for cross-compilation.
For on-device processing, we’re utilizing ExecutorTorch, a PyTorch library that enables us to run models on consumer devices and leverage available accelerators.
Models
For speech recognition, we’re currently using faster-whisper. While not ideal, it’s the best option available right now. Our ideal listening model would be able to listen to multiple people simultaneously and identify voices in real-time. Surprisingly, Whisper already performs quite well.
We need an LLM to structure conversations, save memories, and call skills. We’re using Llama 3.2, with plans to fine-tune it for our specific use case.
For embeddings, we’re using the all-MiniLM-L6-v2 model, which works perfectly for our needs.
As our (vector) database backend, we’re using LanceDB.
Skills
This is where things get interesting. While listening, Kokoro can trigger various Skills to gather more data and suggest actions. For example, as mentioned on our landing page, if you’ve met someone and agreed to have coffee, Kokoro will use the calendar skill to suggest adding the coffee date to your calendar or inform you of any scheduling conflicts.
Register for the beta
Are you excited about Kokoro? Register for the beta right here.