Feb 14, 2020 11:50 AM
Hi
I have text paragraphs in the Long Text field that has extra and odd characters.
Is there a way to remove those unwanted symbols and characters?
Apparently, there is one way to do this, by using the Substitute function.
Can someone help me with that, in terms of writing the formula?
Is there another way to do this?
Thank you,
Mary K
Feb 15, 2020 11:46 AM
Can you post an example of the text with odd characters to be removed?
Feb 15, 2020 04:49 PM
Hi @Bill.French
Here is an example of the text:
This is how it should look like:
Yukon’s Tombstone Territorial Park was established as the result of the Tr’ondëk Hwëch’in land claim agreement, and the park lies entirely within that First Nation’s traditional territory. The diverse landscape supports a variety of wildlife, including Dall sheep, caribou, moose, black and grizzly bears, and dozens of bird species.
Thank you,
Mary K
Feb 15, 2020 04:58 PM
Please paste a real copy of the bad text here so I can test it.
Feb 15, 2020 05:18 PM
@M_k,
That example indicates whatever is producing or importing this content has character encoding issues. While you can write formulas to overcome these instances and transform the data into something more accurate, that approach is not likely to be sustainable. You would have to create formulas for all 256 possible “errors” that could be encountered and that’s not practical.
These are neither extra nor are they odd; they are encoded to render as HTML character entities.
The better approach is to not allow such encoding issues invade the database in the first place. Tell me how these instances have occurred?
Feb 15, 2020 05:32 PM
Hi @Bill.French
This is an example of text from a website and by using regex in Integromat, it is going to populate a Long Text field in Airtable.
I guess the other way would be to manually make the corrections, since it is not all the words.
Thank you,
Mary K
Feb 15, 2020 05:41 PM
Yep, I figured it might be something like this.
What’s happening is Integromat is reading the content and perhaps escaping HTML tags, but leaving HTML entities. You need to apply a function call like escapeHTML() at the capture point in the Integromat process. This assumes, of course, that Integromat is able to transform HTML entities into their actual character values. According to this, the only reference I could find about HTML entities with Integromat, it appears this may not be possible unless you want everything converted to plain text.
I was able to write a script block that performs these types of transformations, but that approach is like a wet bandaid - it will not be complete and will forever be subject to failure.
Feb 15, 2020 08:04 PM
Hi @Bill.French
Thank you for your help.
So, if Integromat coverts the text to plain text and I will be placing the paragraph, as text in Airtable, would there be a difference? Since it is just text that I will need in Airtable.
Would there be any upcoming issues or issues in general with the above option?
Another idea:
The text will first be in GS spreadsheet, can GS remove the HTML entities?
Or is there any other way in GS, that can remove the unwanted characters?
Just brainstorming possible solutions.
Thank you,
Mary
Feb 16, 2020 04:31 AM
No, but this assumes Integromat’s conversion to plain text transforms something like &rquos; to '. If it doesn’t, you are still in the weeds. Also, there may be some formatting you want to preserve like links. Plain text transformation - if it works in Integromat - will probably sanitize the content including links, special punctuation, and perhaps paragraph formatting, and bullet-lists. It might be ugly.
Many messages ago I asked - How does this content get created? It appears this is still not revealed but this is important because there’s a point where the encoding issues initially arise. It’s very difficult to advise you further without knowing more about the lifecycle of this content.
Given that this content is entering “GS spreadsheet” which I assume to mean Google Sheets, it got there through some process that is likely the source of the encoding problems. If true, the remedy should ideally be at the point of creation, not downstream in Integromat.
If users are manually copying and pasting web content into a Google sheet, they are likely pasting into spreadsheet cells “with formatting”. This can [sometimes] cause the encoding issue you are seeing.
Google Apps Script has the ability to remedy character encoding challenges, but it requires javascript development. Rather than leaping to this conclusion or any other implementation approach, you should fully map the content lifecycle to determine how best to eliminate this issue. Or, you could just roll the dice and hope all is corrected with plain text in Integromat.
Feb 16, 2020 01:49 PM
Hi @Bill.French
The source of this issue is being looked into, I just thought that perhaps something could have been done in Airtable or Google Sheets.
It seems the resolution is a bit complex. If all else fails, I may have to manually make the text changes.
I really appreciate your help with this issue.
Thank you,
Mary