How To Remove Extra Characters?

Hi

I have text paragraphs in the Long Text field that has extra and odd characters.

Is there a way to remove those unwanted symbols and characters?

Apparently, there is one way to do this, by using the Substitute function.

Can someone help me with that, in terms of writing the formula?

Is there another way to do this?

Thank you,
Mary K

Can you post an example of the text with odd characters to be removed?

Hi @Bill.French

Here is an example of the text:

image

This is how it should look like:

Yukon’s Tombstone Territorial Park was established as the result of the Tr’ondëk Hwëch’in land claim agreement, and the park lies entirely within that First Nation’s traditional territory. The diverse landscape supports a variety of wildlife, including Dall sheep, caribou, moose, black and grizzly bears, and dozens of bird species.

Thank you,
Mary K

Please paste a real copy of the bad text here so I can test it.

@M_k,

image

That example indicates whatever is producing or importing this content has character encoding issues. While you can write formulas to overcome these instances and transform the data into something more accurate, that approach is not likely to be sustainable. You would have to create formulas for all 256 possible “errors” that could be encountered and that’s not practical.

These are neither extra nor are they odd; they are encoded to render as HTML character entities.

The better approach is to not allow such encoding issues invade the database in the first place. Tell me how these instances have occurred?

Hi @Bill.French

This is an example of text from a website and by using regex in Integromat, it is going to populate a Long Text field in Airtable.

I guess the other way would be to manually make the corrections, since it is not all the words.

Thank you,
Mary K

Yep, I figured it might be something like this.

What’s happening is Integromat is reading the content and perhaps escaping HTML tags, but leaving HTML entities. You need to apply a function call like escapeHTML() at the capture point in the Integromat process. This assumes, of course, that Integromat is able to transform HTML entities into their actual character values. According to this, the only reference I could find about HTML entities with Integromat, it appears this may not be possible unless you want everything converted to plain text.

I was able to write a script block that performs these types of transformations, but that approach is like a wet bandaid - it will not be complete and will forever be subject to failure.

Hi @Bill.French

Thank you for your help.

So, if Integromat coverts the text to plain text and I will be placing the paragraph, as text in Airtable, would there be a difference? Since it is just text that I will need in Airtable.

Would there be any upcoming issues or issues in general with the above option?

Another idea:
The text will first be in GS spreadsheet, can GS remove the HTML entities?

Or is there any other way in GS, that can remove the unwanted characters?

Just brainstorming possible solutions.

Thank you,
Mary

No, but this assumes Integromat’s conversion to plain text transforms something like &rquos; to '. If it doesn’t, you are still in the weeds. Also, there may be some formatting you want to preserve like links. Plain text transformation - if it works in Integromat - will probably sanitize the content including links, special punctuation, and perhaps paragraph formatting, and bullet-lists. It might be ugly.

Many messages ago I asked - How does this content get created? It appears this is still not revealed but this is important because there’s a point where the encoding issues initially arise. It’s very difficult to advise you further without knowing more about the lifecycle of this content.

Given that this content is entering “GS spreadsheet” which I assume to mean Google Sheets, it got there through some process that is likely the source of the encoding problems. If true, the remedy should ideally be at the point of creation, not downstream in Integromat.

If users are manually copying and pasting web content into a Google sheet, they are likely pasting into spreadsheet cells “with formatting”. This can [sometimes] cause the encoding issue you are seeing.

Google Apps Script has the ability to remedy character encoding challenges, but it requires javascript development. Rather than leaping to this conclusion or any other implementation approach, you should fully map the content lifecycle to determine how best to eliminate this issue. Or, you could just roll the dice and hope all is corrected with plain text in Integromat.

Hi @Bill.French

The source of this issue is being looked into, I just thought that perhaps something could have been done in Airtable or Google Sheets.

It seems the resolution is a bit complex. If all else fails, I may have to manually make the text changes.

I really appreciate your help with this issue.

Thank you,
Mary

I can’t be of much help without knowing how that end of the process works.

Well, as I mentioned in previous messages, many things can be done if you have skills (or budget) or you’re comfortable with an incomplete or difficult-to-sustain and maintain approach. Here are your options.

Modify the Intake at the Source

This probably requires that you determine why the encoding issues are occurring in the first place and modify that process to eliminate them.

Add a Sanitization Process at Google Sheets

This requires Google Apps Script to transform every record as it arrives and before Integromat retrieves it. While Apps Script has excellent libraries for encoding and decoding HTML entities, it does require code and process automation to use them in this scenario.

Add a Sanitization Process at Integromat

This approach likely requires the use of special Integromat HTML or text functions that can be weaved into the automation process. There’s a good chance transformation into plain text should make the outcomes more predictable, but you may lose some formatting that is actually needed when the content arrives in Airtable.

Add a Sanitization Process at Airtable

This could be done with formulas looking for every instance of the “odd” texts and either eliminating them altogether or transforming them into their decided equivalents. Or, you could do something similar in a Script Block. Both of these options come with a lot of issues such as knowing in advance all of the “odd” situations the source content may include. This is not easily predictable and Airtable doesn’t provide the ability to perform a global HTML entity transformation (i.e., Script Blocks do not allow use of the DOM thus ruling out a universal decoding process).

Search and Replace Manually

This is about as ugly as it gets and likely to lead to lots of missed entities. Since all entities start with an ampersand (&), it might make discovery a little easier, but even so, this has got to be the worst possible pathway.

Hi @Bill.French

Thank you.

I will go through the information and figure out what I will do and take it from there.

Mary K