“带BOM的UTF-8”和“无BOM的UTF-8”的区别

参考链接：https://www.zhihu.com/question/20167122/answer/14194448

UTF-8 不需要 BOM，尽管 Unicode 标准允许在 UTF-8 中使用 BOM。
所以不含 BOM 的 UTF-8 才是标准形式，在 UTF-8 文件中放置 BOM 主要是微软的习惯（顺便提一下：把带有 BOM 的小端序 UTF-16 称作「Unicode」而又不详细说明，这也是微软的习惯）。
BOM（byte order mark）是为 UTF-16 和 UTF-32 准备的，用于标记字节序（byte order）。微软在 UTF-8 中使用 BOM 是因为这样可以把 UTF-8 和 ASCII 等编码明确区分开，但这样的文件在 Windows 之外的操作系统里会带来问题。

「UTF-8」和「带 BOM 的 UTF-8」的区别就是有没有 BOM。即文件开头有没有 U+FEFF。

UTF-8 的网页代码不应使用 BOM，否则常常会出错。这是一个小例子：为什么这个网页代码内的信息会被浏览器理解为在内？

另附《The Unicode Standard, Version 6.0》之 3.10 D95 UTF-8 encoding scheme 的一段话：

While there is obviously no need for a byte order signature when using UTF-8, there are occasions when processes convert UTF-16 or UTF-32 data containing a byte order mark into UTF-8. When represented in UTF-8, the byte order mark turns into the byte sequence. Its usage at the beginning of a UTF-8 data stream is neither required nor recommended by the Unicode Standard, but its presence does not affect conformance to the UTF-8 encoding scheme. Identification of the byte sequence at the beginning of a data stream can, however, be taken as a near-certain indication that the data stream is using the UTF-8 encoding scheme.

http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf

编程语言

#计算机科普

“带BOM的UTF-8”和“无BOM的UTF-8”的区别

https://fulequn.github.io/2020/09/Article202009261/

作者

Fulequn

发布于

2020年9月26日

许可协议

Python中的andas进行组合筛选和范围筛选上一篇

Python人名出现次数最多统计下一篇