Strange Characters in CSV Export File

Mens_Shed · ‎Jul 07, 2018

When I export values from a table I get three weird characters at the start of only the first line which contains the field names.

My first line looks like this

ï»¿Member,Forename,Middle Names,Surname,DOB,Address 1,Address 2,Town,County,Postcode,Home Phone,Mobile Phone,ICE Contact,ICE Number,Email Address,Occupation,Qualifications,Interests,Disabilities

Is this a bug?

Mens_Shed · ‎Jul 08, 2018

Hello again

I must have downloaded literally hundreds of text files from many different sites over the past 20 years and used Notetab, and my scripting language to process them and have never seen these characters in a CSV file (or any other type of text file) before.

My money is on Airtable introducing them when the file is constructed.

I wonder if AIrtable staffers are looking at this thread?

W_Vann_Hall · ‎Jul 08, 2018

Remember, this is an issue that exists only for UTF-8 files saved under Windows. For all intents and purposes, in the web environment, this means files saved from HTML5-enabled sites. Since HTML5 wasn’t officially published until October 2014, the first 17 years are moot.

I just stepped through the random CSV files I happen to have on this PC; about half have a BOM and half don’t.

Note that Windows’ built-in apps will always add a BOM when saving a UTF-8 file, so it’s unlikely this won’t be an issue for you in the future. Notetab appears to have a troubled history with BOMs; the current version doesn’t appear to need them, but I haven’t been able to determine if it will strip them out on a file save. (Notepad++ seemingly can.)

I still think the problem is that your scripting language should ignore a BOM at the beginning of a text file. That said, I’m not sure how essential the BOM is these days, as most up-to-date apps can identify UTF-8 without it. Conceivably, Airtable could drop it — but you’re probably better off using one of the dozens of utilities or Unicode-knowledgeable editors to trim it off.

Zach_Young · ‎Jun 08, 2021

I’m on a Mac and see this all the time, and am finally having to deal with the exported BOM in my export→process→import pipeline.

Here’s my perspective on this weird little byte sequence in Airtable CSV exports. All quoted text comes from Wikipedia: Byte order mark, and any emphasis is mine.

[It] is a particular usage of the the special Unicode character, U+FEFF BYTE ORDER MARK

It might also be identified as ZWNBSP:

If the BOM character appears in the middle of a data stream, Unicode says it should be interpreted as a “zero-width non-breaking space” [ZWNBSP]. In Unicode 3.2, this usage is deprecated in favor of the “Word Joiner” character, U+2060. This allows U+FEFF to be used only as a BOM.

Technically, the BOM is optional for UTF-8:

The Unicode Standard permits the BOM in UTF-8 but does not require or recommend its use.

but Windows requires it:

Microsoft compilers and interpreters, and many pieces of software on Microsoft Windows such as Notepad treat the BOM as a required magic number rather than use heuristics. These tools add a BOM when saving text as UTF-8, and cannot interpret UTF-8 unless the BOM is present or the file contains only ASCII.

My process involves submitting parts of the CSV to a third party for processing and getting back their results, and they strip the BOM, even though:

The standard also does not recommend removing a BOM when it is there, so that round-tripping between encodings does not lose information, and so that code that relies on it continues to work.

Yep, later stages of my pipeline fail for flagging a false difference when the stripped BOM isn’t there anymore. So, I’m going through and stripping it at the beginning now.

And I’m guessing it’s there in the first place because Airtable put it there by design to be cross-compatible w/Windows.