How to query a table ... or do I always have to process all records?

Rory_Kingan · 2020-05-18T15:00:48+00:00

In the API definition for RecordQueryResult it says A RecordQueryResult represents a set of records. It’s a little bit like a one-off Airtable view: it contains a bunch of records, filtered to a useful subset of the records in the table. but in the API for Table there’s no way I see to filter when calling selectRecordsAsync(). Is there no way to query only a subset of rows in the table, i.e. the equivalent of a SQL WHERE clause? If so, what is the RecordQueryResult documentation referring to regarding ‘filtered to a useful subset’?

+29

kuovonne
Brainy
Forum|Forum|5 years ago
May 18, 2020

You can call selectRecordsAsync() from a view instead of a table to get only the records in the view. If you want a different method of filtering, you need to do the filtering yourself after you get the records.

Like

+11

VictoriaPlummer
Community Manager
Forum|Forum|5 years ago
May 18, 2020

You can call selectRecordsAsync() from a view instead of a table to get only the records in the view. If you want a different method of filtering, you need to do the filtering yourself after you get the records.

++1 on pulling from a view. Additionally, you can add the fields you want to pull in between the parentheses of selectRecordsAsync() so it looks like selectRecordsAsync({fields:['Field Name 1', 'Field Name 2']) This will only pull those fields from the table. It helps because if your table’s huge you’re not pulling more than you need.

All the best!

Like

P

+5

PeterB
Inspiring
Forum|Forum|4 years ago
July 12, 2021

I have the same question as Rory and I don’t think it was answered. We need to be able to create filtered subsets (without using an existing view) and that is what this function offers i.e. one-off views. However the documentation examples don’t use this function. Please explain how we create dynamic views and filtered subsets in script.

Like

+29

kuovonne
Brainy
Forum|Forum|4 years ago
July 12, 2021

I have the same question as Rory and I don’t think it was answered. We need to be able to create filtered subsets (without using an existing view) and that is what this function offers i.e. one-off views. However the documentation examples don’t use this function. Please explain how we create dynamic views and filtered subsets in script.

It was answered. The only way to get a subset of the records in a table is with a predefined view. Otherwise you need to filter the records in your code after you get all the records.

Like

P

+5

PeterB
Inspiring
Forum|Forum|4 years ago
July 12, 2021

Could you please provide an example using RecordQueryResult?

Like

+29

kuovonne
Brainy
Forum|Forum|4 years ago
July 12, 2021

A record query results is what you get when you perform selectRecordsAsync. Which records are the query results depends on how the query results was created. If the query results was created by running selectRecordsAsync on a table, it contains all the records in the table. If the query results was created by running selectRecordsAsync on a view, it contains only the records in the view. If you are building a custom app, there are also other ways of obtaining a query result.

Like

C

+13

Chris_Parker
Inspiring
Forum|Forum|4 years ago
July 12, 2021

Sure wish there was a “filter” key just like we have “fields” and “sorts”. That would solve this issue and bring a lot of efficiency. Or do it just like it’s done with the HTTP API.

I’m currently looping through all the records I have returned to find stuff. Sure wouldn’t mind learning a better way!

Like

+17

Dominik_Bosnjak
Inspiring
Forum|Forum|4 years ago
July 29, 2021

Sure wish there was a “filter” key just like we have “fields” and “sorts”. That would solve this issue and bring a lot of efficiency. Or do it just like it’s done with the HTTP API.

I’m currently looping through all the records I have returned to find stuff. Sure wouldn’t mind learning a better way!

For what it’s worth, JavaScript’s for…of performance is off the charts and always has been, not sure how you’d define “better” hah.

But having Views pretty much eliminated the need for robust filtering on my end.

Like

B

+19

Bill_French
Inspiring
Forum|Forum|4 years ago
September 13, 2021

For what it’s worth, JavaScript’s for…of performance is off the charts and always has been, not sure how you’d define “better” hah.

But having Views pretty much eliminated the need for robust filtering on my end.

Indeed, views provide a very powerful way to scope and optimize access to data. However, they cannot be dynamically altered to achieve what I consider hyper-performance in script processes. Here’s an example…

Imagine a table (a) with a view and when a record enters that view, search another table (b) that has 45,000 records in it for a single matching key so that a linkage can be established in (a) to (b).

I’m not an expert in all manners of Airtable scripting, but I think the only way to do this is to iterate across the 45000 records in table (b) to locate the record ID such that the linkage can be updated in table (a). Ironically, searching for discussions about this place me in the commentary mix here. And even then, I’m not sure I expressed the issues succinctly or even know what I was talking about.

Two ideas come to mind when presented with this quandary -

Ideally, there should be a searchRecordsAsync() method that uses the internal search capabilities to return a record (or record collection) given a field/record-level query.
Less than ideal, but likely even faster than search would be a way to create and maintain arbitrary hash indices (like an inverted index) that can be loaded quickly instead of entire tables followed by a filter across all records. Indeed, we need a direct pointer much the way we used to find data in assembly code using registers.

When presented with this challenge I use the approach shown below. It’s ideal in that it’s very fast and how fast depends on the data set and the types of lookups you need to perform. This example approach assumes we need only the record ID to establish a link, and it also assumes the link is based on a single key value. Performance, of course, will vary depending on the number of fields one might need access to for additional processing logic or record updates.

In any case, the analytics speak for themselves - this approach is 6.5 times faster for ~20,000 records and a sizeable 14.5 times faster for approximately 49,000 records. This is not surprising since Airtable typically bogs down the closer you get to 50,000 records.

But it’s important to realize that performance data like this is not linear and use cases vary so greatly. However, if you have to perform lots of lookups across record sets, the performance gap between filtering across records and using a hash index begin to accumulate significantly.

Indices Table

Of deeper interest, I’m sure you’ll wonder what the Indices Table looks like.

The purpose of this table is simple - have a place to persist and update hash indices that can be accessed and queried very quickly.

It’s actually an extremely rudimentary approach - a single record holding an arbitrary hash index definition with long text fields representing the shards blocks. A Shard block is a JSON object of 100k bytes or less and this is necessary when creating indices involving large record collections because they cannot all fit inside a single long test cell. But even with this added complexity, the load and parse time of ten fully populated shards are still under 100 milliseconds; Airtable it seems is very fast when it comes to dealing with a single record. :winking_face:

Like

+29

kuovonne
Brainy
Forum|Forum|4 years ago
September 13, 2021

Indeed, views provide a very powerful way to scope and optimize access to data. However, they cannot be dynamically altered to achieve what I consider hyper-performance in script processes. Here’s an example…

Imagine a table (a) with a view and when a record enters that view, search another table (b) that has 45,000 records in it for a single matching key so that a linkage can be established in (a) to (b).

I’m not an expert in all manners of Airtable scripting, but I think the only way to do this is to iterate across the 45000 records in table (b) to locate the record ID such that the linkage can be updated in table (a). Ironically, searching for discussions about this place me in the commentary mix here. And even then, I’m not sure I expressed the issues succinctly or even know what I was talking about.

Two ideas come to mind when presented with this quandary -

Ideally, there should be a searchRecordsAsync() method that uses the internal search capabilities to return a record (or record collection) given a field/record-level query.
Less than ideal, but likely even faster than search would be a way to create and maintain arbitrary hash indices (like an inverted index) that can be loaded quickly instead of entire tables followed by a filter across all records. Indeed, we need a direct pointer much the way we used to find data in assembly code using registers.

When presented with this challenge I use the approach shown below. It’s ideal in that it’s very fast and how fast depends on the data set and the types of lookups you need to perform. This example approach assumes we need only the record ID to establish a link, and it also assumes the link is based on a single key value. Performance, of course, will vary depending on the number of fields one might need access to for additional processing logic or record updates.

In any case, the analytics speak for themselves - this approach is 6.5 times faster for ~20,000 records and a sizeable 14.5 times faster for approximately 49,000 records. This is not surprising since Airtable typically bogs down the closer you get to 50,000 records.

But it’s important to realize that performance data like this is not linear and use cases vary so greatly. However, if you have to perform lots of lookups across record sets, the performance gap between filtering across records and using a hash index begin to accumulate significantly.

Indices Table

Of deeper interest, I’m sure you’ll wonder what the Indices Table looks like.

The purpose of this table is simple - have a place to persist and update hash indices that can be accessed and queried very quickly.

It’s actually an extremely rudimentary approach - a single record holding an arbitrary hash index definition with long text fields representing the shards blocks. A Shard block is a JSON object of 100k bytes or less and this is necessary when creating indices involving large record collections because they cannot all fit inside a single long test cell. But even with this added complexity, the load and parse time of ten fully populated shards are still under 100 milliseconds; Airtable it seems is very fast when it comes to dealing with a single record. :winking_face:

In some select cases it is not necessary for the script writer to iterate across all 45,000 records to get the record in table b. (It is still necessary to retrieve/select all the records in table b).

Scenario 1: The record(s) in table b are linked to the record in table a. You can get the value of the linked record field in table a, and then use those getRecord() on the query results with the record id(s) from the linked record field.

Scenario 2: The record in table b has a specific value in a specific field that needs to match the record in table a, other than the record id. When selecting the records with selectRecordsAsync, you can specify a sort, which will help float the desired record to the top of the array. Then the script can do a find on the query result records. Due to the sort, assuming that the record actually exists, the first match will probably be found before iterating through all of the records.

Of course, if you need to find all of the matching records, it will still be necessary to iterate through all the records.

These techniques do not eliminate the usefulness of Bill’s techniques. However, it can provide some slight performance improvements versus iterating through all records in these select situations. In fact, by specifying the sort, I’m guessing that we are taking advantage of some hidden indexing that Airtable has behind the scenes.

Like

B

+19

Bill_French
Inspiring
Forum|Forum|4 years ago
September 13, 2021

In some select cases it is not necessary for the script writer to iterate across all 45,000 records to get the record in table b. (It is still necessary to retrieve/select all the records in table b).

Scenario 1: The record(s) in table b are linked to the record in table a. You can get the value of the linked record field in table a, and then use those getRecord() on the query results with the record id(s) from the linked record field.

Scenario 2: The record in table b has a specific value in a specific field that needs to match the record in table a, other than the record id. When selecting the records with selectRecordsAsync, you can specify a sort, which will help float the desired record to the top of the array. Then the script can do a find on the query result records. Due to the sort, assuming that the record actually exists, the first match will probably be found before iterating through all of the records.

Of course, if you need to find all of the matching records, it will still be necessary to iterate through all the records.

These techniques do not eliminate the usefulness of Bill’s techniques. However, it can provide some slight performance improvements versus iterating through all records in these select situations. In fact, by specifying the sort, I’m guessing that we are taking advantage of some hidden indexing that Airtable has behind the scenes.

Hmmm… that’s fine, we can safely agree that unless it happens to be the last record, it’s not necessary to iterate across the entire table. But that’s not the issue; rather, it’s the hit of performing a selectRecordsAsync() at all - this is the performance bottleneck if hyper-performance is crucial. And given the limited constraints especially in script automation - this is an increasingly constraining climate. How will selectRecordsAsync() perform with 100,000 records? 250,000 records?

Your comment is not debatable setting aside the requirement I laid out - which is to establish the link, not retrieve the linked relationship. The linkage from (a) to (b) does not exist; it needs to be created. How would you do that programmatically without discovering the record ID via some sort of filtering query?

Imagine a table (a) with a view and when a record enters that view, search another table (b) that has 45,000 records in it for a single matching key so that a linkage can be established in (a) to (b).

Answer? – cache-forward the findability mechanism that changes the dynamic entirely using an approach that will work with 20,000 records or 2 million records.

Yep - I played with this approach many times and there are two huge issues;

You must repeat the sort for every filtering task where arbitrary key values must be discovered independently of each other. Floating exactly what you want to the top is difficult in most cases.
My tests indicate that sort adds a serious hit to the performance.

I don’t employ this approach for “slight improvements”; these are hyper-performant results comparatively. In the worst case, it is about 6 times faster for a single link assertion. If you need to make a thousand assertions, it can grow to hundreds of times faster. Certainly, less complex approaches also benefit from an increase in the number of lookups/assertions, but I suspect this is not true at increasing scale.

Hmmm, I see no advantage to sorting purely for purposes of lookups the likes of which align with the scenario I suggested. Indeed, while the seek time does fall by about ten per cent, the load time skyrockets by more than 30 times with a sort. And this is just for 20,000 records. I don’t see any case where this is a better approach with a significant number of records.

My conclusion - however biased it may seem - is that it makes sense to endure the added development complexities if hyper-findability at scale matters. Furthermore, hyper-performance seems to be increasingly required in Airtable because the platform is struggling to sustain automation processes as it is, and more developers are discovering their code designs are timing out more frequently.

Like

+29

kuovonne
Brainy
Forum|Forum|4 years ago
September 14, 2021

Hmmm… that’s fine, we can safely agree that unless it happens to be the last record, it’s not necessary to iterate across the entire table. But that’s not the issue; rather, it’s the hit of performing a selectRecordsAsync() at all - this is the performance bottleneck if hyper-performance is crucial. And given the limited constraints especially in script automation - this is an increasingly constraining climate. How will selectRecordsAsync() perform with 100,000 records? 250,000 records?

Your comment is not debatable setting aside the requirement I laid out - which is to establish the link, not retrieve the linked relationship. The linkage from (a) to (b) does not exist; it needs to be created. How would you do that programmatically without discovering the record ID via some sort of filtering query?

Imagine a table (a) with a view and when a record enters that view, search another table (b) that has 45,000 records in it for a single matching key so that a linkage can be established in (a) to (b).

Answer? – cache-forward the findability mechanism that changes the dynamic entirely using an approach that will work with 20,000 records or 2 million records.

Yep - I played with this approach many times and there are two huge issues;

You must repeat the sort for every filtering task where arbitrary key values must be discovered independently of each other. Floating exactly what you want to the top is difficult in most cases.
My tests indicate that sort adds a serious hit to the performance.

I don’t employ this approach for “slight improvements”; these are hyper-performant results comparatively. In the worst case, it is about 6 times faster for a single link assertion. If you need to make a thousand assertions, it can grow to hundreds of times faster. Certainly, less complex approaches also benefit from an increase in the number of lookups/assertions, but I suspect this is not true at increasing scale.

Hmmm, I see no advantage to sorting purely for purposes of lookups the likes of which align with the scenario I suggested. Indeed, while the seek time does fall by about ten per cent, the load time skyrockets by more than 30 times with a sort. And this is just for 20,000 records. I don’t see any case where this is a better approach with a significant number of records.

My conclusion - however biased it may seem - is that it makes sense to endure the added development complexities if hyper-findability at scale matters. Furthermore, hyper-performance seems to be increasingly required in Airtable because the platform is struggling to sustain automation processes as it is, and more developers are discovering their code designs are timing out more frequently.

I’ll totally agree with you there. Having to retrieve all of the records in the first place, no matter how many records you actually want is a pretty big hammer for a tiny thumbtack.

I was only tying to address the tiny bit about having to iterate through all the records being the only way.

There also is no question that hash indexes are faster than iterative search.

I wonder for the type of hyper-performance that you want, why is this data stored in Airtable in the first place? I guess because someone wants it in Airtable.

Like

+17

Dominik_Bosnjak
Inspiring
Forum|Forum|4 years ago
September 14, 2021

I’ll totally agree with you there. Having to retrieve all of the records in the first place, no matter how many records you actually want is a pretty big hammer for a tiny thumbtack.

I was only tying to address the tiny bit about having to iterate through all the records being the only way.

There also is no question that hash indexes are faster than iterative search.

I wonder for the type of hyper-performance that you want, why is this data stored in Airtable in the first place? I guess because someone wants it in Airtable.

Don’t think it’s been mentioned here by name yet, so just in case someone’s still fishing for ideas and gets easily scared by walls of text: memoization is the term to Google. It’s guaranteed to do the job regardless of the dataset size and lazy getters are trivial to write in JS due to the play-it-by-ear typing. Ironically, the ES6 “class” syntax is (I think) the most direct way of creating them.

Like

+25

Alexey_Gusev
Brainy
Forum|Forum|4 years ago
September 14, 2021

usually i use something like

let rhActive=new Map(active.map(rec=>([rec.getCellValue(‘field’),rec.id])));

for that purpose.

Also, since linked records work like ‘join tables’ and if the application needs multiple operations depending on some matching field between A and B, you may add this field(B) as a lookup field in table(A) and don’t need to query B at all. (That should work like when sometimes you add missing field in a composite index on SQL server and CPU load drops from 100 to 3-5%)

In your example, load time is the biggest value. But that’s the sinlge operation, you load all set one time in a script and then process it on your side.

I can’t be sure, but suppose - many script writers, not being DB experts, often trying to set updateRecordAsync (or createRecordAsync) on a loop instead of using batch operations. I did the same, when I learn scripting here.
If Airtables adds ‘load single record by filter’ function, that will result in a multiple single loads in a script run :grinning:

About 2 millions of records - doubt that they have such plans, maybe for some dedicated large Enterprise.
and that can be partitioned by multiple views (or even multiple tables with current limits)

Like

+25

Alexey_Gusev
Brainy
Forum|Forum|4 years ago
September 14, 2021

Indeed, views provide a very powerful way to scope and optimize access to data. However, they cannot be dynamically altered to achieve what I consider hyper-performance in script processes. Here’s an example…

Imagine a table (a) with a view and when a record enters that view, search another table (b) that has 45,000 records in it for a single matching key so that a linkage can be established in (a) to (b).

I’m not an expert in all manners of Airtable scripting, but I think the only way to do this is to iterate across the 45000 records in table (b) to locate the record ID such that the linkage can be updated in table (a). Ironically, searching for discussions about this place me in the commentary mix here. And even then, I’m not sure I expressed the issues succinctly or even know what I was talking about.

Two ideas come to mind when presented with this quandary -

Ideally, there should be a searchRecordsAsync() method that uses the internal search capabilities to return a record (or record collection) given a field/record-level query.
Less than ideal, but likely even faster than search would be a way to create and maintain arbitrary hash indices (like an inverted index) that can be loaded quickly instead of entire tables followed by a filter across all records. Indeed, we need a direct pointer much the way we used to find data in assembly code using registers.

When presented with this challenge I use the approach shown below. It’s ideal in that it’s very fast and how fast depends on the data set and the types of lookups you need to perform. This example approach assumes we need only the record ID to establish a link, and it also assumes the link is based on a single key value. Performance, of course, will vary depending on the number of fields one might need access to for additional processing logic or record updates.

In any case, the analytics speak for themselves - this approach is 6.5 times faster for ~20,000 records and a sizeable 14.5 times faster for approximately 49,000 records. This is not surprising since Airtable typically bogs down the closer you get to 50,000 records.

But it’s important to realize that performance data like this is not linear and use cases vary so greatly. However, if you have to perform lots of lookups across record sets, the performance gap between filtering across records and using a hash index begin to accumulate significantly.

Indices Table

Of deeper interest, I’m sure you’ll wonder what the Indices Table looks like.

The purpose of this table is simple - have a place to persist and update hash indices that can be accessed and queried very quickly.

It’s actually an extremely rudimentary approach - a single record holding an arbitrary hash index definition with long text fields representing the shards blocks. A Shard block is a JSON object of 100k bytes or less and this is necessary when creating indices involving large record collections because they cannot all fit inside a single long test cell. But even with this added complexity, the load and parse time of ten fully populated shards are still under 100 milliseconds; Airtable it seems is very fast when it comes to dealing with a single record. :winking_face:

By way, that’s very interesting perfomance info, and following discussion links i encountered your base with benchmark script. I added more tests, just for curiosity, maybe that info is already known but anyway, here is the results. Percents in views are approximate, i just used random filtering to fit close
.

Like

B

+19

Bill_French
Inspiring
Forum|Forum|4 years ago
September 14, 2021

I’ll totally agree with you there. Having to retrieve all of the records in the first place, no matter how many records you actually want is a pretty big hammer for a tiny thumbtack.

I was only tying to address the tiny bit about having to iterate through all the records being the only way.

There also is no question that hash indexes are faster than iterative search.

I wonder for the type of hyper-performance that you want, why is this data stored in Airtable in the first place? I guess because someone wants it in Airtable.

LOL! Why is any data stored in Airtable? It’s a complicated question whose answer spans many aspects of data management, users experience, and business requirements.

But to be clear, my approach was not fabricated as a result of a single system or one data set; it’s an ongoing attempt to overcome many issues that arise when using Airtable in more rigid processes and/or at greater scale.

Unfortunately, discussions about performance and best practices cannot readily factor in how busy a given instance of Airtable is. We all know that any given instance of a base is like a container at the server level and such container is granted a fixed amount resources; enterprise accounts have more and free accounts have less. But even an enterprise account can be brought to its knees with lots of automations, external API use, and external automations from Zapier and Integromat.

As such, when we test various patterns and performance characteristics in our quaint little pro accounts and feel all worm and fuzzy, we are simply engaging in confirmation biases that are likely to be misleading when applied in real conditions.

Like

B

+19

Bill_French
Inspiring
Forum|Forum|4 years ago
September 14, 2021

usually i use something like

let rhActive=new Map(active.map(rec=>([rec.getCellValue(‘field’),rec.id])));

for that purpose.

Also, since linked records work like ‘join tables’ and if the application needs multiple operations depending on some matching field between A and B, you may add this field(B) as a lookup field in table(A) and don’t need to query B at all. (That should work like when sometimes you add missing field in a composite index on SQL server and CPU load drops from 100 to 3-5%)

In your example, load time is the biggest value. But that’s the sinlge operation, you load all set one time in a script and then process it on your side.

I can’t be sure, but suppose - many script writers, not being DB experts, often trying to set updateRecordAsync (or createRecordAsync) on a loop instead of using batch operations. I did the same, when I learn scripting here.
If Airtables adds ‘load single record by filter’ function, that will result in a multiple single loads in a script run :grinning:

About 2 millions of records - doubt that they have such plans, maybe for some dedicated large Enterprise.
and that can be partitioned by multiple views (or even multiple tables with current limits)

Yep - this is a very good point that I should have explored. It is roughly 9.1% faster when it comes to establishing and seeking a specific id from the record set. It’s a wise enhancement.

Despite all of my words and attempts to be clear about the requirements in my initial scenario, I get the sense that I’m not making it fully apparent the issues that some processes face. Earlier, @kuovonne said this and it seems to be what you are saying.

Your comment makes sense [to me] if the link is already established. But what if the relationship does not exist and needs to now be asserted [programmatically] in an ad-hoc fashion as a result of other workflow process outcomes? In some business requirements, relationships between records are not established until other events in the process occur. And, there are also cases where relationships must change.

Am I correct in understanding that some iterative process must locate b.record as it relates to a.record in order for the desired link to be established? Or, am I missing some magical approach or data model that escapes me and seemingly the vast Airtable user community? I’m old - I miss a lot of obvious methods to create better systems. :winking_face:

Like

+29

kuovonne
Brainy
Forum|Forum|4 years ago
September 14, 2021

Yep - this is a very good point that I should have explored. It is roughly 9.1% faster when it comes to establishing and seeking a specific id from the record set. It’s a wise enhancement.

Despite all of my words and attempts to be clear about the requirements in my initial scenario, I get the sense that I’m not making it fully apparent the issues that some processes face. Earlier, @kuovonne said this and it seems to be what you are saying.

Your comment makes sense [to me] if the link is already established. But what if the relationship does not exist and needs to now be asserted [programmatically] in an ad-hoc fashion as a result of other workflow process outcomes? In some business requirements, relationships between records are not established until other events in the process occur. And, there are also cases where relationships must change.

Am I correct in understanding that some iterative process must locate b.record as it relates to a.record in order for the desired link to be established? Or, am I missing some magical approach or data model that escapes me and seemingly the vast Airtable user community? I’m old - I miss a lot of obvious methods to create better systems. :winking_face:

I think that it is clear now. In my initial reply (about a tiny detail that I am now thinking would have been better left alone) I included two scenarios. I believe my second scenario is what you want. I included my first scenario because it is such a common use case that also requires retrieving all of the records, and I used to use iteration before I discovered the getRecord function.

I don’t think you miss much.

Like

B

+19

Bill_French
Inspiring
Forum|Forum|4 years ago
September 14, 2021

I think that it is clear now. In my initial reply (about a tiny detail that I am now thinking would have been better left alone) I included two scenarios. I believe my second scenario is what you want. I included my first scenario because it is such a common use case that also requires retrieving all of the records, and I used to use iteration before I discovered the getRecord function.

I don’t think you miss much.

That’s kind. I assure you - I miss a lot from time-to-time. But most important, by definition, I generally don’t have a sense for what I miss. :slightly_smiling_face: I need others to tell me.

Like

+17

Dominik_Bosnjak
Inspiring
Forum|Forum|4 years ago
September 14, 2021

Indeed, views provide a very powerful way to scope and optimize access to data. However, they cannot be dynamically altered to achieve what I consider hyper-performance in script processes. Here’s an example…

Imagine a table (a) with a view and when a record enters that view, search another table (b) that has 45,000 records in it for a single matching key so that a linkage can be established in (a) to (b).

I’m not an expert in all manners of Airtable scripting, but I think the only way to do this is to iterate across the 45000 records in table (b) to locate the record ID such that the linkage can be updated in table (a). Ironically, searching for discussions about this place me in the commentary mix here. And even then, I’m not sure I expressed the issues succinctly or even know what I was talking about.

Two ideas come to mind when presented with this quandary -

Ideally, there should be a searchRecordsAsync() method that uses the internal search capabilities to return a record (or record collection) given a field/record-level query.
Less than ideal, but likely even faster than search would be a way to create and maintain arbitrary hash indices (like an inverted index) that can be loaded quickly instead of entire tables followed by a filter across all records. Indeed, we need a direct pointer much the way we used to find data in assembly code using registers.

When presented with this challenge I use the approach shown below. It’s ideal in that it’s very fast and how fast depends on the data set and the types of lookups you need to perform. This example approach assumes we need only the record ID to establish a link, and it also assumes the link is based on a single key value. Performance, of course, will vary depending on the number of fields one might need access to for additional processing logic or record updates.

In any case, the analytics speak for themselves - this approach is 6.5 times faster for ~20,000 records and a sizeable 14.5 times faster for approximately 49,000 records. This is not surprising since Airtable typically bogs down the closer you get to 50,000 records.

But it’s important to realize that performance data like this is not linear and use cases vary so greatly. However, if you have to perform lots of lookups across record sets, the performance gap between filtering across records and using a hash index begin to accumulate significantly.

Indices Table

Of deeper interest, I’m sure you’ll wonder what the Indices Table looks like.

The purpose of this table is simple - have a place to persist and update hash indices that can be accessed and queried very quickly.

It’s actually an extremely rudimentary approach - a single record holding an arbitrary hash index definition with long text fields representing the shards blocks. A Shard block is a JSON object of 100k bytes or less and this is necessary when creating indices involving large record collections because they cannot all fit inside a single long test cell. But even with this added complexity, the load and parse time of ten fully populated shards are still under 100 milliseconds; Airtable it seems is very fast when it comes to dealing with a single record. :winking_face:

If your optimization efforts have already advanced to the point that you’re reasoning in ones and zeroes… well, I don’t think ‘rudimentary’ is a bad thing haha.

And following that train of thought, wouldn’t this approach benefit from a divide-and-conquer kind of search? Unless it’s already there and I’m just misreading the schema.

Like

B

+19

Bill_French
Inspiring
Forum|Forum|4 years ago
September 14, 2021

If your optimization efforts have already advanced to the point that you’re reasoning in ones and zeroes… well, I don’t think ‘rudimentary’ is a bad thing haha.

And following that train of thought, wouldn’t this approach benefit from a divide-and-conquer kind of search? Unless it’s already there and I’m just misreading the schema.

Good point - it’s not there - it’s seriously dumb [presently]. But it needn’t be this dumb into the future. I can imagine that some really sharp [young] developer could make this 100 times better with all sorts of additional optimizations and features, Indeed, the shards could be made to work a lot like ElasticSearch index shards - they could even be distributed.

But this approach, which admittedly adds some undesirable complexities, is best used under certain conditions and especially depends on real-time updates to the index - ergo, record changes must trigger near-instant updates to the indices and this has been reliable and not impactful because the indexing process itself is distributed across tiny, but frequent events.

I believe almost any type of search solution could be built in this manner, even an inverted index architecture like Lunr.

Overall, this is all circling the search drain, right? When will Airtable realize that search (which is a big collection of requirements like these) need to be addressed internally in the platform?

Like

+25

Alexey_Gusev
Brainy
Forum|Forum|4 years ago
September 14, 2021

it seems like I my explanation was wrong.
in short, i’m agree that

But it’s important to realize that performance data like this is not linear and use cases vary so greatly. However, if you have to perform lots of lookups across record sets, the performance gap between filtering across records and using a hash index begin to accumulate significantly

but here I can’t agree with your evaluation, when talking about lots of searches

[quote=“Alexey_Gusev, post:15, topic:30057”]

let rhActive=new Map(active.map(rec=>([rec.getCellValue(‘field’),rec.id])));

Yep - this is a very good point that I should have explored. It is roughly 9.1% faster when it comes to establishing and seeking a specific id from the record set. It’s a wise enhancement

Like

B

+19

Bill_French
Inspiring
Forum|Forum|4 years ago
September 15, 2021

it seems like I my explanation was wrong.
in short, i’m agree that

But it’s important to realize that performance data like this is not linear and use cases vary so greatly. However, if you have to perform lots of lookups across record sets, the performance gap between filtering across records and using a hash index begin to accumulate significantly

but here I can’t agree with your evaluation, when talking about lots of searches

[quote=“Alexey_Gusev, post:15, topic:30057”]

let rhActive=new Map(active.map(rec=>([rec.getCellValue(‘field’),rec.id])));

Yep - this is a very good point that I should have explored. It is roughly 9.1% faster when it comes to establishing and seeking a specific id from the record set. It’s a wise enhancement

Okay, I’m listening, but I don’t see a clearly presented reason you don’t agree. Here’s why I believe my assertion is correct.

Scenario:

A recycling company has 50 van drivers crisscrossing London picking up an average of ten loads each per day.
They use the Tookan platform to indicate when they arrive at a job and when they finish collecting the recycled goods.
Tookan fires a webhook for each of these two events for all fifty drivers; approximately 1,000 events.
The receiving Airtable webhook needs to link three different tables to the order which reaches two different states - first when they start the job, and then when they complete the job. The linked relationships provide lookups into details concerning the load size, the company the driver works for, and other location data concerning postal codes and taxing authorities.
To create this collection of links and modify them when the job is complete, three different lookups must occur to map the detail records into the order record.

I’m convinced there are many ways to address these requirements but a few things are clear:

Each of the three tables must be queried.
A lookup into each of the three tables must be performed.
A linked record to the order must be established in each table.

My data shows the following given four different approaches. If we extrapolate these values my cached hash seems to win once you pass about 50 updates.

If you have some different data or a better approach for evaluating this, let me know.

Like

+29

kuovonne
Brainy
Forum|Forum|4 years ago
September 15, 2021

Okay, I’m listening, but I don’t see a clearly presented reason you don’t agree. Here’s why I believe my assertion is correct.

Scenario:

A recycling company has 50 van drivers crisscrossing London picking up an average of ten loads each per day.
They use the Tookan platform to indicate when they arrive at a job and when they finish collecting the recycled goods.
Tookan fires a webhook for each of these two events for all fifty drivers; approximately 1,000 events.
The receiving Airtable webhook needs to link three different tables to the order which reaches two different states - first when they start the job, and then when they complete the job. The linked relationships provide lookups into details concerning the load size, the company the driver works for, and other location data concerning postal codes and taxing authorities.
To create this collection of links and modify them when the job is complete, three different lookups must occur to map the detail records into the order record.

I’m convinced there are many ways to address these requirements but a few things are clear:

Each of the three tables must be queried.
A lookup into each of the three tables must be performed.
A linked record to the order must be established in each table.

My data shows the following given four different approaches. If we extrapolate these values my cached hash seems to win once you pass about 50 updates.

If you have some different data or a better approach for evaluating this, let me know.

I’m probably way off base here, and I’ve veering away from the initial topic of this thread, but why do you need this hyper-performance just to create links in an automation?

If you are running up against the time limits or memory limits of a scripting action, there are other methods of creating the links:

Use multiple scripting actions. Each scripting action gets its own 30 seconds and its own 512 MB.
Use a “Find records” action to find the records. The “Find records” action can match records based on a value in a previous step in the automation, such as a value received in the webhook. Since it isn’t a script, there should be no time limit problem. Then use the results of the “Find records” action to create the links, either using a scripting action, or using an “Update Record” action. Creating all three links would take 4 of the 25 actions in an automation, and eliminate the need for scripting to load any records.

I will be the first to admit that I don’t know how well the “Find records” action will scale in an enterprise base with 200,000+ records. However, the “Find records” action does not have any documented limitations other than the fact that it will return a maximum of 100 records. It is also possible that you have a complex method for finding matching records that the “Find records” action doesn’t support, but that doesn’t sound like the case.

Like

B

+19

Bill_French
Inspiring
Forum|Forum|4 years ago
September 15, 2021

I’m probably way off base here, and I’ve veering away from the initial topic of this thread, but why do you need this hyper-performance just to create links in an automation?

If you are running up against the time limits or memory limits of a scripting action, there are other methods of creating the links:

Use multiple scripting actions. Each scripting action gets its own 30 seconds and its own 512 MB.
Use a “Find records” action to find the records. The “Find records” action can match records based on a value in a previous step in the automation, such as a value received in the webhook. Since it isn’t a script, there should be no time limit problem. Then use the results of the “Find records” action to create the links, either using a scripting action, or using an “Update Record” action. Creating all three links would take 4 of the 25 actions in an automation, and eliminate the need for scripting to load any records.

I will be the first to admit that I don’t know how well the “Find records” action will scale in an enterprise base with 200,000+ records. However, the “Find records” action does not have any documented limitations other than the fact that it will return a maximum of 100 records. It is also possible that you have a complex method for finding matching records that the “Find records” action doesn’t support, but that doesn’t sound like the case.

I think your comments are fully on-topic.

While that may be the case in some scenarios, two things need to be clarified;

My assertions concerning performance are not about a single project; rather, it is about a search design pattern that might be useful for many projects that may (or may not) collide with script limits.
The performance challenges we all face are holistic. Unlike many multi-tenant architectures, Airtable apparently constrains resources at the base-level as if they were running your base inside a Docker or Kubernetes container. As such, any dis-optimized process will impact all other aspects of the Airtable experience. We’ve learned (from Airtable support and engineers) that significant API activities can actually impact UI performance and vice-versa and because of this, developers have a duty to implement the most performant processes that are practically achievable.

This is a good point and I also have no test data to know how well this integrated action will perform but it’s a good project for someone to tackle.

In most of the cases I’ve worked on concerning ad-hock lookups across large data sets, there is typically additional logic involved that can only be addressed in a practical fashion with script. In the scenario I put forth for this thread, it actually is the case - I just didn’t want the thread to be 20 pages. :winking_face:

Like

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded