Implementing "fuzzy search" in a Flutter app

Typos shouldn't stop users from finding results. Learn how to make that happen on the example of Bunny Search 🐰🔍

Jan 21, 2024

So one fact that everyone associates with me at this point is that besides loving Flutter, I’m also obsessed with cats. But that’s only one part of the story, as I have a soft spot for all animals in general. That’s why I’m very conscious about what cosmetics I use, since many brands still practice testing on animals. There is a whole community dedicated to classifying brands with “cruelty-free (CF)” status, as well as many official certifications such as PETA and Leaping Bunny. The problem, though, is that there are many nuances in defining a brand as CF, and to help users in their search, I’ve developed an app called Bunny Search. The name comes from the symbol you can find on the packaging of cosmetics, since many CF certifications have a bunny as their symbol. I told the story behind developing Bunny Search as a pet project quite some time ago, you can check it out here. Yes, I remember that this is a newsletter about Flutter 😅 So let’s talk about it.

One thing that I mentioned in that article is that I’ve tried to implement “fuzzy search”, but didn’t succeed. I tried again, and this time everything worked, so let’s dive into implementation.

Search is the key feature in my app, as the user searches for the brand name to find details about it. A very common problem for humans while typing is, well, typos. In a regular search, which in its simplest form is just a contains function, if the user makes a single mistake in their query, then the results won’t contain what they were looking for. Hence, the solution to this is to implement approximate string matching (commonly known as fuzzy search) as opposed to exact string matching. A long time ago, in one of the work projects, I’d already implemented this feature, so I knew exactly what I wanted to do…

Finding edit distance

Many algorithms exist to target this problem, and they usually involve counting the edit distance between the source and the target string. Edit distance is the amount of operations required to perform, which include substitution, insertion, and deletion, to transform one string into another. Sometimes also transposition, but we will talk about it later.

For example:

cot → cat: the edit distance will be 1 since we only need to substitute o for a to transform cot to cat.
cot → cats: the edit distance is 2, since in addition to one substitution, we also need one insertion at the end (the s letter). And yes, I have to insert 🐈 everywhere 🤪

And so on. Of course, using these 3 operations, we can come up with many different ways to transform the source string into the target string. The goal of the efficient edit distance calculating algorithm is to find the minimum edit distance. If you’re into hacking algorithms, you can stop here and go to LeetCode to try yourself at this task 😁

Implementing the Levenshtein Distance algorithm

One of the popular algorithms is the Levenshtein distance, named after the mathematician Vladimir Levenshtein. The idea is basically the same that we have just discussed, finding the minimal edit distance by performing insertion, deletion, and substitution on the source & target strings. The most optimal way to implement this algorithm is via the dynamic programming approach. I’m not going to go through it here because it has been very well explained in many YouTube videos, for example here or here. Off-topic: I generally prefer text, such as articles or the docs to grok technical concepts, but for algorithm explanations, I feel like videos do a better job of demonstrating. If I ever start YouTube, these are the types of videos I’d probably like to create…

Anyway, in a nutshell, the idea is to construct such a matrix:

I swear it’s not me doing my cat stuff again, this is taken from Wikipedia: https://en.m.wikipedia.org/wiki/Levenshtein_distance

We start with the base cases, and we arrive at the minimum edit distance. So for each position, we calculate the minimum distance of transforming the strings, and by the end, we have the minimum distance for the whole string. If you want, you can take a look at the code, or scroll further and see the result, well, at the end. But the vanilla implementation in Dart looks like this:

static int levenshteinDistance(String a, String b) {
    if (a == b) {
      return 0;
    }
    if (a.isEmpty) {
      return b.length;
    }
    if (b.isEmpty) {
      return a.length;
    }

    int aLength = a.length;
    int bLength = b.length;

    List<List<int>> matrix = List.generate(
      bLength + 1,
      (i) => List.generate(aLength + 1, (j) => 0, growable: false),
      growable: false,
    );

    for (int i = 0; i <= aLength; i++) {
      matrix[0][i] = i;
    }
    for (int i = 0; i <= bLength; i++) {
      matrix[i][0] = i;
    }

    for (int i = 1; i <= bLength; i++) {
      for (int j = 1; j <= aLength; j++) {
        int cost = (a[j - 1] == b[i - 1]) ? 0 : 1;
        matrix[i][j] = _min(
          matrix[i - 1][j] + 1, // deletion
          matrix[i][j - 1] + 1, // insertion
          matrix[i - 1][j - 1] + cost, // substitution
        );
      }
    }

    return matrix[bLength][aLength];
  }

Great, but what to do with this minimal distance in our search? Because strings can be of various lengths, I decided to allow a different amount of mistakes depending on the length of the query. So, if the query.length > 7, I allow 4 typos, otherwise 2. Of course, this is quite arbitrary, but I decided to start with this and refine it from here if required. So with this in mind, the actual search function can look like this:

static List<BrandEntity> searchSync(SearchQuery query) {
    final searchTerm = query.query.toLowerCase();

    // Filter & map the results  
    final filtered = query.allBrands.expand<SearchResult>((brand) {
      final title = brand.title.toLowerCase();

      if (title == searchTerm) { // Shortcut 1
        return [SearchResult(0, brand)];
      }

      if (title.contains(searchTerm)) { // Shortcut 2
        return [SearchResult(1, brand)];
      }

      final maxDistance = title.length > 7 ? 4 : 2;
      final nameDistance = levenshteinDistance(title, searchTerm, maxDistance);
      if (nameDistance <= maxDistance) {
        return [SearchResult(nameDistance, brand)];
      } else {
        return [];
      }
    }).toList();

    // Sort by edit distance
    filtered.sort((a, b) {
      return a.distance.compareTo(b.distance);
    });

    return filtered.map((result) => result.brand).toList();
  }

I’m not really a fan of the code formatting of Substack, so if you want, you can jump straight into the source code.

Step 1:

Firstly, there are 2 shortcuts:

If the source and the target match exactly, obviously the edit distance is 0, just return it.
If the target contains the source, we don’t really care about the exact edit distance, so just return 1. Moreover, even if the target string fully contains the source string, but itself is very long, we could run into a situation, when per the algorithm the edit distance > 4, so it won’t even get into the results.

Step 2:

If those conditions are not fulfilled, then count the Levenshtein distance, and if it’s less than maxDistance, include that brand in the search result.

Step 3:

Finally, sort by relevance (the less the edit distance, the more relevant) and return the result.

But if you look closely, you can see that I pass the maxDistance to levenshteinDistance function. I do that because if you think about it, the target strings can be very long, and if we have already calculated, that the minimum edit distance will be greater than the maximum amount of typos we allow, then there is no reason for us to continue calculating the whole matrix, hence we can return early. For that, we can find the minimum item in each row, and once that exceeds the max distance, just return it. So our updated function can look like this:

static int levenshteinDistance(String a, String b, int maxDistance) {
    if (a == b) {
      return 0;
    }
    if (a.isEmpty) {
      return b.length <= maxDistance ? b.length : maxDistance + 1;
    }
    if (b.isEmpty) {
      return a.length <= maxDistance ? a.length : maxDistance + 1;
    }

    int aLength = a.length;
    int bLength = b.length;
    List<List<int>> matrix = List.generate(
      bLength + 1,
      (i) => List.generate(aLength + 1, (j) => 0, growable: false),
      growable: false,
    );

    for (int i = 0; i <= aLength; i++) {
      matrix[0][i] = i;
    }
    for (int i = 0; i <= bLength; i++) {
      matrix[i][0] = i;
    }

    for (int i = 1; i <= bLength; i++) {
      // 1. Init our `minRowValue`
      int minRowValue = maxDistance + 1;
      for (int j = 1; j <= aLength; j++) {
        int cost = (a[j - 1] == b[i - 1]) ? 0 : 1;
        matrix[i][j] = _min(
          matrix[i - 1][j] + 1, // deletion
          matrix[i][j - 1] + 1, // insertion
          matrix[i - 1][j - 1] + cost, // substitution
        );

        // 2. Find the current row min
        minRowValue = min(minRowValue, matrix[i][j]);
      }

      // 3. If it's already larger than `maxDistance`, return early 
      if (minRowValue > maxDistance) {
        return maxDistance + 1;
      }
    }

    return matrix[bLength][aLength];
  }

Now our search function should perform even better when long strings are involved. If you want to see it in action, you can check out the source code and play around with tests.

We have one last step to do. Remember, how in the beginning I said that the previous time I tried to implement this feature, I didn’t succeed? My database has around 6k records in it, which isn’t even that big, but running the Levenshtein distance algorithm with the complexity O(n*m) (where n & m are source and target string lengths) on that dataset still caused some jank.

So I offset it all to an Isolate… and it consistently took a whopping 10 seconds to perform a search. Obviously, the users would never wait that long, so it wasn’t an option. I don’t exactly know why it happened, but it was before Flutter 2.8, which updated to use Dart’s Isolate Groups and promised performance gains, so maybe that’s related. Anyway, I’m happy that now it works, and even though for smaller datasets you might get away with calculating everything on the main Isolate, I’d rather stay safe and avoid jank by using the convenient compute function. So finally, our end function looks like this:

Future<List<BrandEntity>> search({
    required List<BrandEntity> allBrands,
    required String query,
  }) => compute(searchSync, SearchQuery(allBrands, query));

Before wrapping this up, here are a couple of things that could be even better:

The vanilla Levenshtein distance doesn’t consider an operation called transposition. This is when two adjacent symbols could be swapped places and this would be considered a single operation, e.g. ab → ba = 1. But the Damerau–Levenshtein distance does. It is a more complex algorithm, and you can read more about it in the good old Wikipedia.
Currently, this test fails:

test(
     'Typo in Long Brand Name (Multiple characters off): Anastasia as Anastsia',
        () {
      var query = SearchQuery(makeupBrands, 'Anastsia');
      var results = SearchService.searchSync(query).map((b) => b.title).toList();
      expect(results, contains(matches('Anastasia Beverly Hills')));
    });

This is because the whole string “Anastasia Beverly Hills” is very long, and it reaches max distance. A better logic would be to also ignore extra symbols at the beginning and the end of the string, as long as part of it matches the query. I will look into it 🔍

Also, I wanted to mention that I tried working with ChatGPT, and while it gave a solid implementation of the vanilla Levenshtein distance algorithm, it made many mistakes in explaining how it works, analyzing examples, and optimizing. So I wouldn’t fully trust it with algorithms just yet.

So to summarize, to implement “fuzzy search”, follow these steps:

Implement a minimum edit distance searching algorithm, such as Levenshtein distance.
Define the maximum allowed typos based on the string length and sort the filtered list based on that.
Offload the computation to another Isolate via the utility compute function.

And here’s a small list of completely random things to read, not exactly related to today’s topic, but may be interesting to you:

My article about the Set data structure in Dart. I’ve never continued the series, but now I think maybe I should?
Article from Slava Egorov about microbenchmarking Dart. It’s like reading a detective story but about programming.
New official docs on Isolates with step-by-step explanations of more complex use cases.

I hope you found this issue interesting and see you next time! If you’ve found any errors or have suggestions, they’re very welcome ☺️ And let’s connect on Twitter (X 🙄)

-Daria 💙

P.S. - how do you prefer to digest content about algorithms & data structures, via text or via videos? 👀

Enigma 💡

Sep 21

Hi, Daria...Lovely article. Came to your Substack after watching your talk at Flutter Con 24; Thanks for putting this up. Also, i wanted to say, like the Youtube style videos you mentioned, this article would really pop if you had a gif demonstrating the accuracy of the search in helping users as you help us arrive at the solution, even more preferably using your app

Expand full comment

1 reply by Daria

Daniela Grothe

Jun 15Edited

Still new on Substack and I might overlook one thing: Is this package yours, or did you just find it?

2 more comments...

Daria’s Flutter diaries

Discussion about this post