How to query a table ... or do I always have to process all records?

Rory_Kingan · ‎May 18, 2020

In the API definition for RecordQueryResult it says

A RecordQueryResult represents a set of records. It’s a little bit like a one-off Airtable view: it contains a bunch of records, filtered to a useful subset of the records in the table.

but in the API for Table there’s no way I see to filter when calling selectRecordsAsync(). Is there no way to query only a subset of rows in the table, i.e. the equivalent of a SQL WHERE clause? If so, what is the RecordQueryResult documentation referring to regarding ‘filtered to a useful subset’?

Dominik_Bosnjak · ‎Sep 14, 2021

If your optimization efforts have already advanced to the point that you’re reasoning in ones and zeroes… well, I don’t think ‘rudimentary’ is a bad thing haha.

And following that train of thought, wouldn’t this approach benefit from a divide-and-conquer kind of search? Unless it’s already there and I’m just misreading the schema.

Bill_French · ‎Sep 14, 2021

Good point - it’s not there - it’s seriously dumb [presently]. But it needn’t be this dumb into the future. I can imagine that some really sharp [young] developer could make this 100 times better with all sorts of additional optimizations and features, Indeed, the shards could be made to work a lot like ElasticSearch index shards - they could even be distributed.

But this approach, which admittedly adds some undesirable complexities, is best used under certain conditions and especially depends on real-time updates to the index - ergo, record changes must trigger near-instant updates to the indices and this has been reliable and not impactful because the indexing process itself is distributed across tiny, but frequent events.

I believe almost any type of search solution could be built in this manner, even an inverted index architecture like Lunr.

Overall, this is all circling the search drain, right? When will Airtable realize that search (which is a big collection of requirements like these) need to be addressed internally in the platform?

Alexey_Gusev · ‎Sep 14, 2021

it seems like I my explanation was wrong.
in short, i’m agree that

But it’s important to realize that performance data like this is not linear and use cases vary so greatly. However, if you have to perform lots of lookups across record sets, the performance gap between filtering across records and using a hash index begin to accumulate significantly

but here I can’t agree with your evaluation, when talking about lots of searches

[quote=“Alexey_Gusev, post:15, topic:30057”]

let rhActive=new Map(active.map(rec=>([rec.getCellValue(‘field’),rec.id])));

Yep - this is a very good point that I should have explored. It is roughly 9.1% faster when it comes to establishing and seeking a specific id from the record set. It’s a wise enhancement

Bill_French · ‎Sep 14, 2021

Okay, I’m listening, but I don’t see a clearly presented reason you don’t agree. Here’s why I believe my assertion is correct.

Scenario:

A recycling company has 50 van drivers crisscrossing London picking up an average of ten loads each per day.
They use the Tookan platform to indicate when they arrive at a job and when they finish collecting the recycled goods.
Tookan fires a webhook for each of these two events for all fifty drivers; approximately 1,000 events.
The receiving Airtable webhook needs to link three different tables to the order which reaches two different states - first when they start the job, and then when they complete the job. The linked relationships provide lookups into details concerning the load size, the company the driver works for, and other location data concerning postal codes and taxing authorities.
To create this collection of links and modify them when the job is complete, three different lookups must occur to map the detail records into the order record.

I’m convinced there are many ways to address these requirements but a few things are clear:

Each of the three tables must be queried.
A lookup into each of the three tables must be performed.
A linked record to the order must be established in each table.

My data shows the following given four different approaches. If we extrapolate these values my cached hash seems to win once you pass about 50 updates.

If you have some different data or a better approach for evaluating this, let me know.

kuovonne · ‎Sep 14, 2021

I’m probably way off base here, and I’ve veering away from the initial topic of this thread, but why do you need this hyper-performance just to create links in an automation?

If you are running up against the time limits or memory limits of a scripting action, there are other methods of creating the links:

Use multiple scripting actions. Each scripting action gets its own 30 seconds and its own 512 MB.
Use a “Find records” action to find the records. The “Find records” action can match records based on a value in a previous step in the automation, such as a value received in the webhook. Since it isn’t a script, there should be no time limit problem. Then use the results of the “Find records” action to create the links, either using a scripting action, or using an “Update Record” action. Creating all three links would take 4 of the 25 actions in an automation, and eliminate the need for scripting to load any records.

I will be the first to admit that I don’t know how well the “Find records” action will scale in an enterprise base with 200,000+ records. However, the “Find records” action does not have any documented limitations other than the fact that it will return a maximum of 100 records. It is also possible that you have a complex method for finding matching records that the “Find records” action doesn’t support, but that doesn’t sound like the case.

Bill_French · ‎Sep 15, 2021

I think your comments are fully on-topic.

While that may be the case in some scenarios, two things need to be clarified;

My assertions concerning performance are not about a single project; rather, it is about a search design pattern that might be useful for many projects that may (or may not) collide with script limits.
The performance challenges we all face are holistic. Unlike many multi-tenant architectures, Airtable apparently constrains resources at the base-level as if they were running your base inside a Docker or Kubernetes container. As such, any dis-optimized process will impact all other aspects of the Airtable experience. We’ve learned (from Airtable support and engineers) that significant API activities can actually impact UI performance and vice-versa and because of this, developers have a duty to implement the most performant processes that are practically achievable.

This is a good point and I also have no test data to know how well this integrated action will perform but it’s a good project for someone to tackle.

In most of the cases I’ve worked on concerning ad-hock lookups across large data sets, there is typically additional logic involved that can only be addressed in a practical fashion with script. In the scenario I put forth for this thread, it actually is the case - I just didn’t want the thread to be 20 pages. :winking_face:

Alexey_Gusev · ‎Sep 15, 2021

Now I got it. You are right. Multiple single events with unpredictable time density
If they intend to raise the limits for enterprise, querying all table will cause bottleneck. Of course, there are some workarounds, but why invent wheel, if index exists.

Jeremy_Oglesby · ‎Sep 20, 2021

I think your discussion here provided a prompt to Airtable Devs:

Karlstens · ‎Jan 06, 2023

My one complaint with selectRecordsAsync() is that the sorts option fails to sort records brought in via the recordIds option, meaning if you'll need to sort records either with a predefined view, or with a secondary operation after selectRecordsAsync().

I've mentioned this to Airtable support, hoping that it can be addressed one day.