Re: Extract formatted text into fields

itoldusoandso · ‎May 13, 2021

Hello all,

Wondering if anybody can suggest efficient way to extract text and put into fields.

I was going to play with LEN and SUBSTITUTE formulas but I know it will take me a day or two to figure it out because I don’t really understand well how the two formulas work. I have used them successfully to extract Airtable attachment links to separate fields as URLs and it only worked because some genius here came with the suggestion.

All I try to do is to go to Amazon, copy the description, I would past that into a field and then the formulas would grab it and populate the individual fields.

Specifically I am grabbing this information (PRODUCT INFORMATION - it’s at the end of description in Amazon product page, just before the Customer Questions & Answers section).

This is what I need to copy:

Product Dimensions	3.54 x 2.24 x 0.47 inches
Item Weight	0.353 ounces
ASIN	B072K8BMST
Item model number	CardH-BlackWP

====

The formatting doesn’t show up when past in Airtable. The text contains some vertical lines | when I copy them from Amazon which I believe can be helpful to build the formula.

However, when I past the text to Airtable, the vertical lines don’t show up, so I guess they are just “gaps” and not really vertical lines.

So I guess it’s just requires a lot of testing to get it work but before I even try to go with the two formulas, I am wondering if there is any other better way doing it.

Many thanks.

ScottWorld · ‎May 13, 2021

Maybe the web clipper app can help:

Katerie · ‎May 13, 2021

There likely isn’t an easier way to do it, unfortunately. But since you have some predictable structure in your data, you can use a combination of LEFT(), RIGHT(), LEN(), FIND(), and MID() to get what you need.

The MID() function can be used in combination with FIND() to allow you to extract text between any two defined sets of characters. Here is a thread with some examples of how to do this to get you started. You can use this to get most of the values that you need in a robust way.

The LEFT() or RIGHT() functions can also be used in combination with LEN() and FIND() to let you extract text between the beginning or end of the copied text and the first instance of a specific word. For example, this formula will let you get the Item model number, using the input example you pasted:

RIGHT({Copied text}, LEN({Copied text})-FIND("number", {Copied text})-6)

A similar formula could be used with LEFT() to get the length value. The MID() formulas can do the rest, except you might need to use a couple of layers of them to get the width and height values.

This is all easier if things like the model number and dimensions always have the same number of characters, but I doubt that’s going to be the case. So the FIND() commands can deal with that chaos.

kuovonne · ‎May 13, 2021

It is possible that the “gaps” in the text are tab characters. I also suggest using regular expressions to parse the text.

Because regular expressions can be a bit finicky, and it isn’t clear if the text will always follow the format, you may need to play around a bit to get things to work.

Here is formula for the ASIN, assuming that the gaps are actually tab characters.

IF(
  REGEX_MATCH({Copied text}, "(?:ASIN	)(.+)"),
  REGEX_EXTRACT({Copied text}, "(?:ASIN	)(.+)")
)

Here is a formula for the length, again assuming that the gaps are actually tab characters.

IF(
  REGEX_MATCH({Copied text}, "(?:Product Dimensions	)([\d\.]+)"),
  REGEX_EXTRACT({Copied text}, "(?:Product Dimensions	)([\d\.]+)")
)

The regular expressions for the width and height are a bit more complex. If you cannot figure them out on your own with a regular expression tester, and have budget to hire someone to figure out the regular expression for you, feel free to reach out.

itoldusoandso · ‎May 14, 2021

@ScottWorld Thanks for great tip. I kinda ignored the web-clipper, always thought the clipper is to clip image of the page or capture the URL. I didn’t know how powerful the Webclipper is. That’s amazing.

I figured out how to get the CSS values easily using the configuration page on Simplesraper and have now tested works flawlessly.

I found post about Simplescraper and it seems like a great tool for bigger jobs. It has a flaw which I described towards the end of this post:

Next I am going to work around the formula method suggested by @Katerie and @kuovonne, I’ll see how far I can get and I’ll share the results if I get to working solution. I find using the formula is more handy for individual ad-hoc tasks. The formula way will do just fine. Thanks to all for help.

itoldusoandso · ‎May 14, 2021

@Katerie

Worked fine but it’s fragile as once can expect. I used:

SUBSTITUTE to add more free space around the text
LEN, FIND, LEFT, RIGHT, MID
TRIM to remove the free space

Then tested and it all worked nice and clean. Then disappointment. I texted on more Amazon products and I found that block of text is inconsistent. They have different versions of this product details. I guess it depends on product category or they didn’t update things, who knows…

Sometimes weight is separate line, sometimes together with sizes
The lines are shuffled up
And the product details block looks different across products, so I would have to build minimum 2 versions versions of this and the nested formulas get complex.

The MID function is the best but it’s not possible to know the “count” value in MID function. The string of the property changes and the position changes in the product details. The only consistent thing is the ASIN which is Amazon standard identification number. I didn’t notice before how inconsistent Amazon pages are before this exercise. Any inconsistency is negative on customer behaviours for shoppers who look for consistency, so Amazon is that sloppy in that regard at least. Gotcha.

So having that kind of fragile formula, if things change from Amazon side, I can start from scratch.

@kuovonne

Tried but it didn’t give me any result the Regex way. The result was just white cell. So I assume those tabs are not being copied as “gaps” but some other code that is either converted to space or just not displayed in Airtable.

It looks like I will settle with the Clipboard solution after all. It’s probably more stable because it links to the actual CSS code in the page… of course until that breaks if Amazon makes changes to it.

Thanks for your help everybody.