Jan 22, 2020 04:17 AM
Inspired by this question:
I thought I would try to use Python to do some fuzzy matching on Airtable data. Not really an Airtable demo (as the data could be stored anywhere), but hopefully some readers find this interesting :slightly_smiling_face:
This article:
pointed me to a Python library for fuzzy matching (fuzzywuzzy) and gives some good code snippets for how to use the library.
As per the original post, I used a Donors/Donation base to prototype the matching process. I have a Donors table:
and a Donations table:
The aim is to use a fuzzy matching routine to suggest which is the closest matching record in the Donor table for each Donation.
I added Suggested Donor and Suggested Donor Score fields to the Donations table:
I have a Python script which uses the Airtable API to compare every donation name to every donor name and insert the closest matching, i.e. highest scoring, into the Suggested Donor field. This is set up as a linked record so that, for example, the donation total could be rolled up on the Donors table.
When the script runs you end up with this:
(A score of 100 being an exact match). There’s some low scores (58) which are obviously not a match and some high scores (James Smith/Jane Smith/86) which are also not a match.
Setting the threshold to a high number in the script, e.g. >= 95, gives this:
This might be a better approach where a decent proportion of the records are matched, leaving you to match the others manually, but you have high confidence in the records that have been matched.
Here’s the Python script:
import json
import requests
from fuzzywuzzy import process
get_headers = {
'Authorization': 'Bearer YOUR_API_KEY'
}
# get an array of donors
donors_url = 'https://api.airtable.com/v0/YOUR_APP_ID/Donors?maxRecords=100&view=All_Donors'
donors_response = requests.get(donors_url, headers=get_headers)
donors_data = donors_response.json()
donors_list = []
for j in donors_data['records']:
donors_list.append(j['fields']['Name'])
# get the donations
donations_url = 'https://api.airtable.com/v0/YOUR_APP_ID/Donations?maxRecords=100&view=All_Donations'
donations_response = requests.get(donations_url, headers=get_headers)
donations_data = donations_response.json()
# for each donation get the highest matching donor record
for i in donations_data['records']:
donor_name = i['fields']['Name']
highest = process.extractOne(donor_name, donors_list)
# if the highest match has a score >= 80 (or some other score)...
if highest[1] >= 95:
# ...get the donor record...
params = querystring = {"filterByFormula":"({Name}=\""+highest[0]+"\")"}
get_matching_donor_response = requests.get(donors_url, headers=get_headers, params=querystring)
matching_donor_id = get_matching_donor_response.json()['records'][0]['id']
# ... and update the donation with the "suggested donor"
update_url = 'https://api.airtable.com/v0/YOUR_APP_ID/Donations/' + i['id']
update_headers = {
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'application/json'
}
update_data = {
"fields": {
"Suggested Donor": [
matching_donor_id
],
"Suggested Donor Score": highest[1]
}
}
print(update_data)
update_response = requests.patch(update_url, headers=update_headers, json=update_data)
print(update_response.text)
Looking forward to the new Airtable scripting capability (now in private beta?) where I hope we might be able to do similar things wholly within Airtable.
JB
Jul 15, 2022 03:24 PM
Are you using it in an automation or in a script extension? It’s a pretty time consuming function to run, so I’d recommend running it in a script extension otherwise I would expect it to time out on a large table.
If you’re running it in an extension and the script is locking up, the issue may be caused by the amount of logging that the script does. I recall running into that issue as I was also dealing with a huge number of records. I made some changes to the script to make the debugging logs optional, and also optimized the update function to update fields in batches of 50, rather than 1 field at a time.
Here’s my updated script based on @JonathanBowen’s example. All the field names and settings are moved to variables at the top of the script:
// Script settings >>
const matchTableName = 'Match Table'; // table name to add matches to
const originalTableName = 'Original Table'; // table name to look for matches
const matchFieldName = 'Name'; // field name to match against in matching table
const originalFieldName = 'Name'; // field name to match against in original table
const suggestedMatchFieldName = 'Suggested Link to Game'; // field name to store suggested matches
const logLevel = 0; // 0 = no logging; 1 = log number of matches for each record; 2 = same as 1 plus log individual match values for every match-field/original-field combination
// << End script settings
// Levenshtein distance function
let levenshteinDistance = function(a, b){
if(!a || !b) return (a || b).length;
var m = [];
for(var i = 0; i <= b.length; i++){
m[i] = [i];
if(i === 0) continue;
for(var j = 0; j <= a.length; j++){
m[0][j] = j;
if(j === 0) continue;
m[i][j] = b.charAt(i - 1) == a.charAt(j - 1) ? m[i - 1][j - 1] : Math.min(
m[i-1][j-1] + 1,
m[i][j-1] + 1,
m[i-1][j] + 1
);
}
}
return m[b.length][a.length];
};
// max string length function (gets the max length of two strings)
let maxStringLength = function(a, b){
return Math.max(a.length, b.length);
};
//get the ratio that the user wants to use on this iteration, e.g. 0.8
let inputRatio = await input.textAsync('Enter the match ratio for this search (number between 0 and 1)')
console.log(`Searching for matches...`);
// set the tables
let matchTable = base.getTable(matchTableName);
let originalTable = base.getTable(originalTableName);
let updates = [];
// get the records from the master and match tables
let matchQuery = await matchTable.selectRecordsAsync({
fields: [matchFieldName]
});
let originalQuery = await originalTable.selectRecordsAsync({
fields: [originalFieldName]
});
// iterate through the match records
for (let matchRecord of matchQuery.records) {
// create an empty array to hold the matched IDs from master
let matchedIds = []
// iterate through the master records
for (let masterRecord of originalQuery.records) {
// do some regex so that only letters and numbers are considered - also lower case the two strings
let formattedMatchRecord = matchRecord.getCellValueAsString(matchFieldName).toLowerCase().replace(/[^a-z0-9]/gi,'');
let formattedOriginalRecord = masterRecord.getCellValueAsString(originalFieldName).toLowerCase().replace(/[^a-z0-9]/gi,'');
// calculate the Levenshtein distance
let ld = levenshteinDistance(formattedMatchRecord, formattedOriginalRecord);
// calculate the max length of each pair of strings (match and master)
let maxLength = maxStringLength(formattedMatchRecord, formattedMatchRecord);
// log the inputs and outputs for debugging
if (logLevel >= 2) {
console.log(formattedMatchRecord, formattedOriginalRecord, ld, maxLength, 1-ld/maxLength);
}
// if the calculated ratio is greater than the input threshold
if(1-ld/maxLength > Number(inputRatio)) {
// add the master id to the array
matchedIds.push(
{id: masterRecord.id}
)
}
}
// log how many matches there are
if (logLevel >= 1) {
console.log(matchRecord.name, matchedIds.length);
}
// If there are any matches, add them to the list of updates to be made
if (matchedIds.length > 0) {
let update = {
id: matchRecord.id,
fields: {
[suggestedMatchFieldName]: matchedIds
}
};
updates.push(update);
}
}
// Batch update table with all matched fields
// Only up to 50 updates are allowed at one time, so we need to do a loop
while (updates.length > 0) {
await matchTable.updateRecordsAsync(updates.slice(0, 50));
updates = updates.slice(50);
}
console.log(`Process complete.`);
Let us know how it goes!
Sep 03, 2023 07:17 PM - edited Sep 03, 2023 07:21 PM
@Tim_Mackey or @JonathanBowen, might you be able to suggest how this script could be adapted to link similar records within a single table?
The Dedupe extension does an admirable job of finding the records I would like to link together, but can't link them, and the "Mark Duplicates" script example found here seems designed to do the marking, but lacks any sort of fuzzy matching.
I have a database of several thousand test questions imported from multiple tests, some of which I know have been reused. My goal is to link those near-duplicate records together within the "Questions" table to track how questions are being reused. Unfortunately, I'm counting on OCR for the question text, which hasn't produced exact duplicate text fields to match.
I'm doing my best to come up with workarounds, but my scripting chops are quite limited.
Sep 04, 2023 07:26 AM - edited Sep 04, 2023 07:46 AM
I appear to have the script running on a set of duplicated tables. Given the speed at which the script runs, this is probably for the best since I was able to reduce my 15000+ question records to 2 sets of just 1500ish which I'm sure contain the duplicates and don't overlap. I'll copy and paste the link fields back into the original records when it's done.
@JonathanBowen 's version of the script at first locked up pretty quickly, but I commented out the console logging based on @Tim_Mackey comment and now it's chugging away beautifully.
I might try running the script on a truncated part of the match text (full length is often a paragraph or more) and a higher accuracy threshold in the hopes that a shorter string to check produces faster run times without too many false matches. Update: Yes, much faster.
Thanks so much for writing this.
Sep 29, 2023 11:04 AM
Thank you for this! Couple of things for anyone else using this in the future.
@Tim_Mackey 's version included this as well, but I didn't catch it in time. The original script has an update at the end of each loop regardless of if the update array was null which was incredibly slow. This part of the script is essential if you have a lot of records!
// If there are any matches, add them to the list of updates to be made
if (matchedIds.length > 0) {
let update = {
id: matchRecord.id,
fields: {
[suggestedMatchFieldName]: matchedIds
}
};
updates.push(update);
}
}
// Batch update table with all matched fields
// Only up to 50 updates are allowed at one time, so we need to do a loop
while (updates.length > 0) {
await matchTable.updateRecordsAsync(updates.slice(0, 50));
updates = updates.slice(50);
}
I also wanted to use this script to compare a values in a single column to each other, rather than to a separate column. I made some modifications to speed up that process by creating an array and 2 counters, then comparing my match value to only what was below it in the column. Here's what I ended up using for that:
// Levenshtein distance function
let levenshteinDistance = function(a, b){
if(!a || !b) return (a || b).length;
var m = [];
for(var i = 0; i <= b.length; i++){
m[i] = [i];
if(i === 0) continue;
for(var j = 0; j <= a.length; j++){
m[0][j] = j;
if(j === 0) continue;
m[i][j] = b.charAt(i - 1) == a.charAt(j - 1) ? m[i - 1][j - 1] : Math.min(
m[i-1][j-1] + 1,
m[i][j-1] + 1,
m[i-1][j] + 1
);
}
}
return m[b.length][a.length];
};
// max string length function (gets the max length of two strings)
let maxStringLength = function(a, b){
return Math.max(a.length, b.length);
};
//get the ratio that the user wants to use on this iteration, e.g. 0.8
let inputRatio = await input.textAsync("Enter the match ratio for this search (number between 0 and 1)")
// set the tables
let matchTable = base.getTable("TABLE");
let originalTable = base.getTable("TABLE)");
// get the records from the master and match tables
let matchQuery = await matchTable.selectRecordsAsync({
fields: ['field']
});
let originalQuery = await originalTable.selectRecordsAsync({
fields: ['field']
});
//create array to index into
let originalArray = [... new Set (originalQuery.records.map(record => record.getCellValueAsString ('field')))]
// iterate through the match records
//create loop counters
let innercounter=0
let outercounter=0
//loop through match records
for (let matchRecord of matchQuery.records) {
//set innter loop counter equal to the number of outer loop iterations
innercounter = outercounter
// create an empty array to hold the matched IDs from master
let matchedIds = []
for (let masterRecord of originalQuery.records) {
//check for overflow
if (innercounter < (originalArray.length-1)) {
let formattedMatchRecord = matchRecord.getCellValueAsString ("Title").toLowerCase().replace(/[^a-z0-9]/gi,'')
let formattedOriginalRecord = originalArray[innercounter+1].toLowerCase().replace(/[^a-z0-9]/gi,'')
//count interations to avoid overflow
innercounter++
// calculate the Levenshtein distance
let ld = levenshteinDistance(formattedMatchRecord, formattedOriginalRecord);
// calculate the max length of each pair of strings (match and master)
let maxLength = maxStringLength(formattedMatchRecord, formattedMatchRecord);
// if the calculated ratio is greater than the input threshold
console.log(formattedMatchRecord,formattedOriginalRecord)
if(1-ld/maxLength > Number(inputRatio)) {
// add the master id to the array
matchedIds.push( {id: masterRecord.id}
)
}
}
if (matchedIds.length > 0)
// update the match record with the array of master ids
await matchTable.updateRecordAsync(matchRecord, {
"field": matchedIds
}
);
}
//add 1 to outer loop itertion counter
outercounter++
}
Oct 04, 2023 12:56 PM - edited Oct 04, 2023 01:02 PM
Another update in case folks are looking for alternative fuzzy matching. While the script above did work, the data I had didn't work very well with the Levenshtein Distance Function. I ended up importing the fuse.js library via cdn, which worked very well. in this case, I didn't need the location of the match, just that there was a match, so if you need the record id of the match, you'll want to create that second for loop and push the id's to an array as above.
//import fuse.js library
await getFusejs()
async function getFusejs() {
const response = await fetch("https://cdn.jsdelivr.net/npm/fuse.js@6.6.2");
const minifiedLibrary= await response.text();
eval(minifiedLibrary)
}
let matchTable = base.getTable (TABLE NAME)
let matchQuery = await matchTable.selectRecordsAsync()
//create array of data to fuzzy match for fuse.js to reference
let list = [... new Set (matchQuery.records.map(record => record.getCellValue ('MATCH FIELD')))]]
//set match threshold between 0(exact match) and 1(match all)
let options = {threshold: .3}
//loop through records and match against list array
for (let matchRec of matchQuery.records){
const fuse = new Fuse(list,options);
const searchResult = fuse.search(matchRec.getCellValue("Title"));
//log result for debugging
console.log(searchResult);
//if there record matches to more than itself, update checkbox to indicate a match
if (searchResult.length > 1)
matchTable.updateRecordAsync(matchRec.id,{"Matched": true})
}