Airtable, Python and Fuzzy Matching

Inspired by this question:

I thought I would try to use Python to do some fuzzy matching on Airtable data. Not really an Airtable demo (as the data could be stored anywhere), but hopefully some readers find this interesting :slight_smile:

This article:

pointed me to a Python library for fuzzy matching (fuzzywuzzy) and gives some good code snippets for how to use the library.

As per the original post, I used a Donors/Donation base to prototype the matching process. I have a Donors table:

and a Donations table:

The aim is to use a fuzzy matching routine to suggest which is the closest matching record in the Donor table for each Donation.

I added Suggested Donor and Suggested Donor Score fields to the Donations table:

I have a Python script which uses the Airtable API to compare every donation name to every donor name and insert the closest matching, i.e. highest scoring, into the Suggested Donor field. This is set up as a linked record so that, for example, the donation total could be rolled up on the Donors table.

When the script runs you end up with this:

(A score of 100 being an exact match). There’s some low scores (58) which are obviously not a match and some high scores (James Smith/Jane Smith/86) which are also not a match.

Setting the threshold to a high number in the script, e.g. >= 95, gives this:

This might be a better approach where a decent proportion of the records are matched, leaving you to match the others manually, but you have high confidence in the records that have been matched.

Here’s the Python script:

import json
import requests
from fuzzywuzzy import process

get_headers = {
    'Authorization': 'Bearer YOUR_API_KEY'
    }

# get an array of donors
donors_url = 'https://api.airtable.com/v0/YOUR_APP_ID/Donors?maxRecords=100&view=All_Donors'
donors_response = requests.get(donors_url, headers=get_headers)
donors_data = donors_response.json()
donors_list = []
for j in donors_data['records']:
    donors_list.append(j['fields']['Name'])


# get the donations
donations_url = 'https://api.airtable.com/v0/YOUR_APP_ID/Donations?maxRecords=100&view=All_Donations'
donations_response = requests.get(donations_url, headers=get_headers)
donations_data = donations_response.json()

# for each donation get the highest matching donor record
for i in donations_data['records']:
    donor_name = i['fields']['Name']
    highest = process.extractOne(donor_name, donors_list)

    # if the highest match has a score >= 80 (or some other score)...
    if highest[1] >= 95:
        # ...get the donor record...
        params = querystring = {"filterByFormula":"({Name}=\""+highest[0]+"\")"}
        get_matching_donor_response = requests.get(donors_url, headers=get_headers, params=querystring)
        matching_donor_id = get_matching_donor_response.json()['records'][0]['id']

        # ... and update the donation with the "suggested donor"
        update_url = 'https://api.airtable.com/v0/YOUR_APP_ID/Donations/' + i['id']
        update_headers = {
            'Authorization': 'Bearer YOUR_API_KEY',
            'Content-Type': 'application/json'
        }
        update_data = {
          "fields": {
            "Suggested Donor": [
              matching_donor_id
            ],
            "Suggested Donor Score": highest[1]
          }
        }
        print(update_data)
        update_response = requests.patch(update_url, headers=update_headers, json=update_data)
        print(update_response.text)

Looking forward to the new Airtable scripting capability (now in private beta?) where I hope we might be able to do similar things wholly within Airtable.

JB

6 Likes

Jonathan - this solution is really incredible and is exactly what we need! Might you be available to chat about how I get the Airtable–Python link set up? I’m a complete beginning when it comes to this and am assuming it’s super complicated, but perhaps it’s easier to set up, given the script you’ve provided, than I thought?

1 Like

Hi @Meghan_Moore - yes, it looks more complicated than it really is :slight_smile:

Will DM you.

JB

1 Like

Hey JB did you ever figure out to do this entirely in within Scripting Blocks (or better yet in Automation scripting)?

Hi @Matteo_Cossu2 - No, didn’t find a solution for this. Really this needs a library/package to achieve (as with Python), but you can’t import packages on Airtable scripts, so that won’t work. Maybe there’s a non-package solution out there, but I haven’t come across one yet.

1 Like

@Matteo_Cossu2 - this is a bit embarrasing…I’d actually written a script for this, but had totally forgotten about it. I cribbed some “distance algorithm” code from the web somewhere and incorporated it into an Airtable script:

// Levenshtein distance function
let levenshteinDistance = function(a, b){
    if(!a || !b) return (a || b).length;
    var m = [];
    for(var i = 0; i <= b.length; i++){
        m[i] = [i];
        if(i === 0) continue;
        for(var j = 0; j <= a.length; j++){
            m[0][j] = j;
            if(j === 0) continue;
            m[i][j] = b.charAt(i - 1) == a.charAt(j - 1) ? m[i - 1][j - 1] : Math.min(
                m[i-1][j-1] + 1,
                m[i][j-1] + 1,
                m[i-1][j] + 1
            );
        }
    }
    return m[b.length][a.length];
};

// max string length function (gets the max length of two strings)
let maxStringLength = function(a, b){
    return Math.max(a.length, b.length);
};

//get the ratio that the user wants to use on this iteration, e.g. 0.8
let inputRatio = await input.textAsync("Enter the match ratio for this search (number between 0 and 1)")

// set the tables
let matchTable = base.getTable("Match");
let originalTable = base.getTable("Original");

// get the records from the master and match tables
let matchQuery = await matchTable.selectRecordsAsync({
        fields: ['Name']
});
let originalQuery = await originalTable.selectRecordsAsync({
        fields: ['Name']
});

// iterate through the match records
for (let matchRecord of matchQuery.records) {
    // create an empty array to hold the matched IDs from master
    let matchedIds = []
    // iterate through the master records
    for (let masterRecord of originalQuery.records) {
        // do some regex so that only leters and numbers are considered - also lower case the two strings
        let formattedMatchRecord = matchRecord.name.toLowerCase().replace(/[^a-z0-9]/gi,'');
        let formattedOriginalRecord = masterRecord.name.toLowerCase().replace(/[^a-z0-9]/gi,'');
        
        // calculate the Levenshtein distance
        let ld = levenshteinDistance(formattedMatchRecord, formattedOriginalRecord);
        // calculate the max length of each pair of strings (match and master)
        let maxLength = maxStringLength(formattedMatchRecord, formattedMatchRecord);

        // log the inputs and outputs for debugging
        console.log(formattedMatchRecord, formattedOriginalRecord, ld, maxLength, 1-ld/maxLength);

        // if the calculated ratio is greater than the input threshold
        if(1-ld/maxLength > Number(inputRatio)) {
            // add the master id to the array
            matchedIds.push(
                {id: masterRecord.id}
            )
        }
    }
    // update the match record with the array of master ids
    let update = await matchTable.updateRecordAsync(matchRecord, {
            "Suggested Link to Original": matchedIds
        }
    );
}

The way this works is that you have 2 tables, both with a name column and the script links to suggested records based on the “closeness” of the original name to the match name.

You enter a number between 0 and 1 at the start (try 0.7 or 0.8) - the higher the number the more strict the threshold for suggesting matches. More info here:


The script implements the Levenshtein distance algorithm to check for similarity in two strings. This allows you to suggest links between two sets of data based on the similarity of fields within the base. This might be useful, for example, where you have a “original” customer list and want to try to match or link that to a second source of customer data.

More information on the Levenshtein distance can be found here: Levenshtein distance - Wikipedia

4 Likes

@JonathanBowen this is amazing. I’ll implement it and let you know how it works!

Wow, this is incredibly timely! I’ve been working on a project for the last couple months, and at some point I was going to need to reconcile two lists with 3,000+ entries each. Luckily for me, I had put off this part of it and just started on today, 2 days after your solution was posted! After some small changes to match my own base, the script is working great. This is going to save me so much time, thank you!

One note, there’s an unintentional issue with your code that will cause it to break if you’re not matching on the Primary field. The .name attribute of a record refers to the value in the Primary field, not the value stored in the field called “Name”. I guess it just so happens that your primary field is also called “Name”, so it works as expected in your case.

To fix this, the following two lines need to be changed:

let formattedMatchRecord = matchRecord.name.toLowerCase().replace(/[^a-z0-9]/gi,'');
let formattedOriginalRecord = masterRecord.name.toLowerCase().replace(/[^a-z0-9]/gi,'');

They should instead look like this:

let formattedMatchRecord = matchRecord.getCellValueAsString('Field Name Here').toLowerCase().replace(/[^a-z0-9]/gi,'');
let formattedOriginalRecord = masterRecord.getCellValueAsString('Field Name Here').toLowerCase().replace(/[^a-z0-9]/gi,'');

@Tim_Mackey - yes, you have to set the field names to match your base set up. Of course, they also don’t need to be primary fields - you can match any two fields.

Hey Tim, how’s @JonathanBowen’s algorithm working for you in terms of performance? I have to iterate through 12-13k records and it breaks my AT usage every time. Any suggestions much appreciated!

Are you using it in an automation or in a script extension? It’s a pretty time consuming function to run, so I’d recommend running it in a script extension otherwise I would expect it to time out on a large table.

If you’re running it in an extension and the script is locking up, the issue may be caused by the amount of logging that the script does. I recall running into that issue as I was also dealing with a huge number of records. I made some changes to the script to make the debugging logs optional, and also optimized the update function to update fields in batches of 50, rather than 1 field at a time.

Here’s my updated script based on @JonathanBowen’s example. All the field names and settings are moved to variables at the top of the script:

// Script settings >>
const matchTableName = 'Match Table'; // table name to add matches to
const originalTableName = 'Original Table'; // table name to look for matches

const matchFieldName = 'Name'; // field name to match against in matching table
const originalFieldName = 'Name'; // field name to match against in original table

const suggestedMatchFieldName = 'Suggested Link to Game'; // field name to store suggested matches

const logLevel = 0; // 0 = no logging; 1 = log number of matches for each record; 2 = same as 1 plus log individual match values for every match-field/original-field combination
// << End script settings

// Levenshtein distance function
let levenshteinDistance = function(a, b){
  if(!a || !b) return (a || b).length;
  var m = [];
  for(var i = 0; i <= b.length; i++){
      m[i] = [i];
      if(i === 0) continue;
      for(var j = 0; j <= a.length; j++){
          m[0][j] = j;
          if(j === 0) continue;
          m[i][j] = b.charAt(i - 1) == a.charAt(j - 1) ? m[i - 1][j - 1] : Math.min(
              m[i-1][j-1] + 1,
              m[i][j-1] + 1,
              m[i-1][j] + 1
          );
      }
  }
  return m[b.length][a.length];
};

// max string length function (gets the max length of two strings)
let maxStringLength = function(a, b){
  return Math.max(a.length, b.length);
};

//get the ratio that the user wants to use on this iteration, e.g. 0.8
let inputRatio = await input.textAsync('Enter the match ratio for this search (number between 0 and 1)')

console.log(`Searching for matches...`);

// set the tables
let matchTable = base.getTable(matchTableName);
let originalTable = base.getTable(originalTableName);
let updates = [];

// get the records from the master and match tables
let matchQuery = await matchTable.selectRecordsAsync({
      fields: [matchFieldName]
});
let originalQuery = await originalTable.selectRecordsAsync({
      fields: [originalFieldName]
});

// iterate through the match records
for (let matchRecord of matchQuery.records) {
  // create an empty array to hold the matched IDs from master
  let matchedIds = []
  // iterate through the master records
  for (let masterRecord of originalQuery.records) {
      // do some regex so that only letters and numbers are considered - also lower case the two strings
      let formattedMatchRecord = matchRecord.getCellValueAsString(matchFieldName).toLowerCase().replace(/[^a-z0-9]/gi,'');
      let formattedOriginalRecord = masterRecord.getCellValueAsString(originalFieldName).toLowerCase().replace(/[^a-z0-9]/gi,'');
      
      // calculate the Levenshtein distance
      let ld = levenshteinDistance(formattedMatchRecord, formattedOriginalRecord);
      // calculate the max length of each pair of strings (match and master)
      let maxLength = maxStringLength(formattedMatchRecord, formattedMatchRecord);

      // log the inputs and outputs for debugging
      if (logLevel >= 2) {
        console.log(formattedMatchRecord, formattedOriginalRecord, ld, maxLength, 1-ld/maxLength);
      }

      // if the calculated ratio is greater than the input threshold
      if(1-ld/maxLength > Number(inputRatio)) {
          // add the master id to the array
          matchedIds.push(
              {id: masterRecord.id}
          )
      }
  }

  // log how many matches there are
  if (logLevel >= 1) {
    console.log(matchRecord.name, matchedIds.length);
  }

  // If there are any matches, add them to the list of updates to be made
  if (matchedIds.length > 0) {
    let update = {
      id: matchRecord.id,
      fields: {
        [suggestedMatchFieldName]: matchedIds
      }
    };
    updates.push(update);
  }
}

// Batch update table with all matched fields
// Only up to 50 updates are allowed at one time, so we need to do a loop
while (updates.length > 0) {
  await matchTable.updateRecordsAsync(updates.slice(0, 50));
  updates = updates.slice(50);
}

console.log(`Process complete.`);

Let us know how it goes!

1 Like