Character encoding issues can arise from a variety of causes, the most common of which is simply copying and pasting content from a browser into another data store. If the source content is not UTF-8 or 16 encoded and the target location unwittingly accepts any encoding you hand it, it’s possible odd character representations will occur. And when that happens, you might not even know it until something needs to convey the copied information through a secondary process such as an integration flow.
How does “cut and paste” affect character encoding and what can go wrong?
First of all, a text editor’s internal representation of text has no bearing on how the text is encoded (serialized) when you save the file. So a document is not “in” an encoding; it’s a sequence of abstract characters. When the document is saved to a file (or transmitted over the network) then it gets encoded.
It’s up to each application to decide what it puts on the clipboard. Typically, a windows app that knows what it’s doing will put a number of different representations on the clipboard. When you paste in the other app, the app will look for the representation that best suits its need.
In your case, a text editor (that knows what it’s doing) will put a Unicode representation of a selected string onto the clipboard (where Unicode, in Windows, is typically moved around as UTF-16, but that’s not important). When you paste in the other app, it will insert that sequence of Unicode characters into the document at the selection point.
Copy/Paste is not the only way to introduce encoding instances. APIs, when used improperly, can result in characters that are not encoded properly. Custom keyboard maps and apps can do this. Poorly-constructed apps (regardless of OS) can do this.
One of the hallmarks of an encoding issue is a tertiary process that removes (or sanitizes bad characters) suddenly causes an app to work as expected, giving you the false sense of comfort knowing that the cause is an underlying platform incapable of dealing with certain characters. Indeed, this is partially true - the platform cannot handle these characters because they are improperly encoded, which is a very different assertion than the platform being unable to handle what you perceive to be a colon or apostrophe.
Without question – setting aside AI and machine-learning – character encoding is the closest computer scientists will ever come to voodoo. It’s complex, it’s difficult to understand, and it is very difficult to debug. But worse, it is the hidden side of computing, a place where perceptions are not typically representations of fact.
I recommend this tutorial on character encoding which will likely create more questions than it will answer. And when that happens, you’ll need this article.
Bottom line - my sense is that the mix of your content process and interchanges of information is causing the issues you have experienced. Deleting characters from a content process and suddenly seeing it work is almost certainly because the characters are non-compliant byte sequences that are being represented to an API that requires UTF or Unicode compliance.