Help

Re: Airtable, Python and Fuzzy Matching

23408 4
cancel
Showing results for 
Search instead for 
Did you mean: 

Inspired by this question:

I thought I would try to use Python to do some fuzzy matching on Airtable data. Not really an Airtable demo (as the data could be stored anywhere), but hopefully some readers find this interesting :slightly_smiling_face:

This article:

pointed me to a Python library for fuzzy matching (fuzzywuzzy) and gives some good code snippets for how to use the library.

As per the original post, I used a Donors/Donation base to prototype the matching process. I have a Donors table:

Screenshot 2020-01-22 at 11.50.31

and a Donations table:

Screenshot 2020-01-22 at 11.50.40

The aim is to use a fuzzy matching routine to suggest which is the closest matching record in the Donor table for each Donation.

I added Suggested Donor and Suggested Donor Score fields to the Donations table:

Screenshot 2020-01-22 at 11.52.22

I have a Python script which uses the Airtable API to compare every donation name to every donor name and insert the closest matching, i.e. highest scoring, into the Suggested Donor field. This is set up as a linked record so that, for example, the donation total could be rolled up on the Donors table.

When the script runs you end up with this:

Screenshot 2020-01-22 at 12.04.40

(A score of 100 being an exact match). There’s some low scores (58) which are obviously not a match and some high scores (James Smith/Jane Smith/86) which are also not a match.

Setting the threshold to a high number in the script, e.g. >= 95, gives this:

Screenshot 2020-01-22 at 12.06.52

This might be a better approach where a decent proportion of the records are matched, leaving you to match the others manually, but you have high confidence in the records that have been matched.

Here’s the Python script:

import json
import requests
from fuzzywuzzy import process

get_headers = {
    'Authorization': 'Bearer YOUR_API_KEY'
    }

# get an array of donors
donors_url = 'https://api.airtable.com/v0/YOUR_APP_ID/Donors?maxRecords=100&view=All_Donors'
donors_response = requests.get(donors_url, headers=get_headers)
donors_data = donors_response.json()
donors_list = []
for j in donors_data['records']:
    donors_list.append(j['fields']['Name'])


# get the donations
donations_url = 'https://api.airtable.com/v0/YOUR_APP_ID/Donations?maxRecords=100&view=All_Donations'
donations_response = requests.get(donations_url, headers=get_headers)
donations_data = donations_response.json()

# for each donation get the highest matching donor record
for i in donations_data['records']:
    donor_name = i['fields']['Name']
    highest = process.extractOne(donor_name, donors_list)

    # if the highest match has a score >= 80 (or some other score)...
    if highest[1] >= 95:
        # ...get the donor record...
        params = querystring = {"filterByFormula":"({Name}=\""+highest[0]+"\")"}
        get_matching_donor_response = requests.get(donors_url, headers=get_headers, params=querystring)
        matching_donor_id = get_matching_donor_response.json()['records'][0]['id']

        # ... and update the donation with the "suggested donor"
        update_url = 'https://api.airtable.com/v0/YOUR_APP_ID/Donations/' + i['id']
        update_headers = {
            'Authorization': 'Bearer YOUR_API_KEY',
            'Content-Type': 'application/json'
        }
        update_data = {
          "fields": {
            "Suggested Donor": [
              matching_donor_id
            ],
            "Suggested Donor Score": highest[1]
          }
        }
        print(update_data)
        update_response = requests.patch(update_url, headers=update_headers, json=update_data)
        print(update_response.text)

Looking forward to the new Airtable scripting capability (now in private beta?) where I hope we might be able to do similar things wholly within Airtable.

JB

14 Replies 14
Meghan_Moore
5 - Automation Enthusiast
5 - Automation Enthusiast

Jonathan - this solution is really incredible and is exactly what we need! Might you be available to chat about how I get the Airtable–Python link set up? I’m a complete beginning when it comes to this and am assuming it’s super complicated, but perhaps it’s easier to set up, given the script you’ve provided, than I thought?

Hi @Meghan_Moore - yes, it looks more complicated than it really is :slightly_smiling_face:

Will DM you.

JB

Hey JB did you ever figure out to do this entirely in within Scripting Blocks (or better yet in Automation scripting)?

Hi @Matteo_Cossu2 - No, didn’t find a solution for this. Really this needs a library/package to achieve (as with Python), but you can’t import packages on Airtable scripts, so that won’t work. Maybe there’s a non-package solution out there, but I haven’t come across one yet.

@Matteo_Cossu2 - this is a bit embarrasing…I’d actually written a script for this, but had totally forgotten about it. I cribbed some “distance algorithm” code from the web somewhere and incorporated it into an Airtable script:

// Levenshtein distance function
let levenshteinDistance = function(a, b){
    if(!a || !b) return (a || b).length;
    var m = [];
    for(var i = 0; i <= b.length; i++){
        m[i] = [i];
        if(i === 0) continue;
        for(var j = 0; j <= a.length; j++){
            m[0][j] = j;
            if(j === 0) continue;
            m[i][j] = b.charAt(i - 1) == a.charAt(j - 1) ? m[i - 1][j - 1] : Math.min(
                m[i-1][j-1] + 1,
                m[i][j-1] + 1,
                m[i-1][j] + 1
            );
        }
    }
    return m[b.length][a.length];
};

// max string length function (gets the max length of two strings)
let maxStringLength = function(a, b){
    return Math.max(a.length, b.length);
};

//get the ratio that the user wants to use on this iteration, e.g. 0.8
let inputRatio = await input.textAsync("Enter the match ratio for this search (number between 0 and 1)")

// set the tables
let matchTable = base.getTable("Match");
let originalTable = base.getTable("Original");

// get the records from the master and match tables
let matchQuery = await matchTable.selectRecordsAsync({
        fields: ['Name']
});
let originalQuery = await originalTable.selectRecordsAsync({
        fields: ['Name']
});

// iterate through the match records
for (let matchRecord of matchQuery.records) {
    // create an empty array to hold the matched IDs from master
    let matchedIds = []
    // iterate through the master records
    for (let masterRecord of originalQuery.records) {
        // do some regex so that only leters and numbers are considered - also lower case the two strings
        let formattedMatchRecord = matchRecord.name.toLowerCase().replace(/[^a-z0-9]/gi,'');
        let formattedOriginalRecord = masterRecord.name.toLowerCase().replace(/[^a-z0-9]/gi,'');
        
        // calculate the Levenshtein distance
        let ld = levenshteinDistance(formattedMatchRecord, formattedOriginalRecord);
        // calculate the max length of each pair of strings (match and master)
        let maxLength = maxStringLength(formattedMatchRecord, formattedMatchRecord);

        // log the inputs and outputs for debugging
        console.log(formattedMatchRecord, formattedOriginalRecord, ld, maxLength, 1-ld/maxLength);

        // if the calculated ratio is greater than the input threshold
        if(1-ld/maxLength > Number(inputRatio)) {
            // add the master id to the array
            matchedIds.push(
                {id: masterRecord.id}
            )
        }
    }
    // update the match record with the array of master ids
    let update = await matchTable.updateRecordAsync(matchRecord, {
            "Suggested Link to Original": matchedIds
        }
    );
}

The way this works is that you have 2 tables, both with a name column and the script links to suggested records based on the “closeness” of the original name to the match name.

Screenshot 2022-06-22 at 20.05.37

You enter a number between 0 and 1 at the start (try 0.7 or 0.8) - the higher the number the more strict the threshold for suggesting matches. More info here:


The script implements the Levenshtein distance algorithm to check for similarity in two strings. This allows you to suggest links between two sets of data based on the similarity of fields within the base. This might be useful, for example, where you have a “original” customer list and want to try to match or link that to a second source of customer data.

More information on the Levenshtein distance can be found here: Levenshtein distance - Wikipedia

Matteo_Cossu2
5 - Automation Enthusiast
5 - Automation Enthusiast

@JonathanBowen this is amazing. I’ll implement it and let you know how it works!

Wow, this is incredibly timely! I’ve been working on a project for the last couple months, and at some point I was going to need to reconcile two lists with 3,000+ entries each. Luckily for me, I had put off this part of it and just started on today, 2 days after your solution was posted! After some small changes to match my own base, the script is working great. This is going to save me so much time, thank you!

One note, there’s an unintentional issue with your code that will cause it to break if you’re not matching on the Primary field. The .name attribute of a record refers to the value in the Primary field, not the value stored in the field called “Name”. I guess it just so happens that your primary field is also called “Name”, so it works as expected in your case.

To fix this, the following two lines need to be changed:

let formattedMatchRecord = matchRecord.name.toLowerCase().replace(/[^a-z0-9]/gi,'');
let formattedOriginalRecord = masterRecord.name.toLowerCase().replace(/[^a-z0-9]/gi,'');

They should instead look like this:

let formattedMatchRecord = matchRecord.getCellValueAsString('Field Name Here').toLowerCase().replace(/[^a-z0-9]/gi,'');
let formattedOriginalRecord = masterRecord.getCellValueAsString('Field Name Here').toLowerCase().replace(/[^a-z0-9]/gi,'');

@Tim_Mackey - yes, you have to set the field names to match your base set up. Of course, they also don’t need to be primary fields - you can match any two fields.

Hey Tim, how’s @JonathanBowen’s algorithm working for you in terms of performance? I have to iterate through 12-13k records and it breaks my AT usage every time. Any suggestions much appreciated!