Building the First On-Device AI That Always Listens

A technical explanation of how we're developing an on-device AI assistant that continuously listens and responds.

Author: Pol | Published: 9/28/2024

Screenshot of the current project

Welcome to Kokoro! Thank you for your interest in our project. To clarify, when I refer to “we” in this article, I’m currently a solo developer, so it’s referring just to me, Pol. Let’s dive into the details.

The Problem

The best way to start is by explaining the problem we’re trying to solve. If you’ve watched science fiction movies, you’ve likely seen scenarios where people simply talk, and a magical AI does everything. That’s essentially what we’re building, but it comes with some challenges.

AI Hype

I’m not a fan of AI hype. At its core, AI is just linear algebra – predictions based on previous predictions. AI can’t be your magical assistant or best friend. However, AI can be a powerful tool to assist you in your daily life. That’s our goal: to treat AI as the practical tool it is.

Privacy Concerns

When we say Kokoro is always listening, people understandably get concerned. Currently, most consumer AI processing occurs on external cloud servers, over which end users have no control, leading to privacy issues. We’ve addressed this by making Kokoro an on-device AI, meaning all processing and storage happens on your device. We don’t have access to your data, and it even works offline!

Open Source

While open-sourcing isn’t possible at the project’s current stage, it’s a future goal. By open source, we mean truly open – models, datasets, and code fully available for everyone to see and use. We’re not there yet, but we’re working towards it.

The Solution

Now, let’s discuss our solution: Kokoro, the first on-device AI that’s always listening. You can find more information about its capabilities on our landing page.

Tech Stack

We’re using Flutter for the frontend and cross-platform development. For the backend, we’re using Rust, which is excellent for cross-compilation.

For on-device processing, we’re utilizing ExecutorTorch, a PyTorch library that enables us to run models on consumer devices and leverage available accelerators.

Models

For speech recognition, we’re currently using faster-whisper. While not ideal, it’s the best option available right now. Our ideal listening model would be able to listen to multiple people simultaneously and identify voices in real-time. Surprisingly, Whisper already performs quite well.

We need an LLM to structure conversations, save memories, and call skills. We’re using Llama 3.2, with plans to fine-tune it for our specific use case.

For embeddings, we’re using the all-MiniLM-L6-v2 model, which works perfectly for our needs.

As our (vector) database backend, we’re using LanceDB.

Skills

This is where things get interesting. While listening, Kokoro can trigger various Skills to gather more data and suggest actions. For example, as mentioned on our landing page, if you’ve met someone and agreed to have coffee, Kokoro will use the calendar skill to suggest adding the coffee date to your calendar or inform you of any scheduling conflicts.

Register for the beta

Are you excited about Kokoro? Register for the beta right here.