fixed merge Lesson1

This commit is contained in:
gauthiier 2021-12-11 13:25:04 +01:00
parent beff3c4f17
commit 44e3454421

View File

@ -2,11 +2,7 @@
bibliography: wwwrite.bib
---
<<<<<<< HEAD
## Text Encoding
=======
# Text Encoding
>>>>>>> d148116502464c40e25cbd4fd466e4db4f3a193f
We believe in approaching text writing by first understanding the core inscription mechanism upheld by modern computing machine. In this lesson we will hence look at how text and characters are inscribed and represented internally within computers. More specifically we will look at standards of text encoding (and decoding) and see how text editors can decode such encoding.
@ -24,11 +20,7 @@ We believe in approaching text writing by first understanding the core inscripti
### History
<<<<<<< HEAD
As everyone heard of the byte format? If you didn't it's about time you do as you employ this legacy format daily when using your computer. A byte is the most basic quanta of computing and is composed of 8 bits, where a bit stands for what is commonly represented by a 0 or 1. Hence a byte is a 8-bits "packet" which can represent decimal numbers ranging from 0 to 255 (or -128 to 127). In this lesson we will use the [Hexadecimal](https://en.wikipedia.org/wiki/Hexadecimal) notation to represent bytes. A byte is an historical format and encapsulate the most basic data structure in computing machinery, a standard introduced by IBM for its flagship [IBM/360](http://www.computermuseum.li/Testpage/IBM-360-1964.htm) mainframe machine in 1964.[^1]
=======
Has everyone heard of the byte format? If you didn't it's about time you do as you employ this legacy format daily when using your computer. A byte is the most basic quanta of computing and is composed of 8 bits, where a bit stands for what is commonly represented by a 0 or 1. Hence a byte is a 8-bits "packet" which can represent decimal numbers ranging from 0 to 255 (or -128 to 127). In this lesson we will use the [Hexadecimal](https://en.wikipedia.org/wiki/Hexadecimal) notation to represent bytes. A byte is an historical format and encapsulate the most basic data structure in computing machinery, a standard introduced by IBM for its flagship [IBM/360](http://www.computermuseum.li/Testpage/IBM-360-1964.htm) mainframe machine in 1964.[^1]
>>>>>>> d148116502464c40e25cbd4fd466e4db4f3a193f
Roughly at the same time (1963) another (updated) standard was devised for the encoding of characters: ASCII [ref]. ASCII conceived a 7-bit format for characters that was factorised into an 8-bit format on the IBM/360. With a 7-bit format, ASCII had the possibility to encode 127 characters. However, the IBM/360 opted to use the legacy [EBCDIC](https://en.wikipedia.org/wiki/EBCDIC) 8-bit format as default character set (dubbed "charset") on all software developed for the IBM/360[^2]. Hence the mass adoption of ASCII as main default charset in computing systems came years after mainly with the advent of PCs.
@ -36,7 +28,6 @@ Is ASCII still in use today? Yes and no. ASCII has some important limitations as
Hence the establishment of the Unicode standard which aim is to devise and maintain a Universal Character Set (UCS) composed of special codes points for each character (a kind of "meta"-charset if you want, composed of specific unicode codes)[^3]. Unicode does not specify specific encodings for its code points. Rather, encodings are part of specific implementations of the UCS such as UTF (UCS Transformation Format). The most notable UTF being UTF-8.[^4] The special feature of UTF-8 is that it is directly backward compatible with ASCII (an 8-bit ASCII character as the same encoding as its UTF-8 version) and has the property of being variable in length, meaning that Latin characters are encoded with a single byte while other non-Latin characters may be encoded with up to 4 bytes.[^5] Nowadays, UTF-8 is one of the most (if not _the_ most) mass adopted / ubiquitous character encoding format.[^6]
### How
Let's start with a very simple example to illustrate how text is encoded.
@ -75,11 +66,7 @@ A few observations from the examples above are worth noting:
4. UTF-8 encoding of the Vietnamese sentence is _not_ necessarily more compact then Unicode's UCS. In fact we see UTF-8 utilising four bytes to encode some characters (remember that UTF-8 is of variable-length). For example the character 'â' is 'U+00e2' in UCS (two significant bytes) while 'c3a2' in UTF-8 (four significant bytes). A great chart to look at the various codes and encoding can be found here: [http://utf8-chartable.de](http://utf8-chartable.de)
<<<<<<< HEAD
At this point, we should stress the fact that what is inscribed in computing memory is the _encoding_ of text and not its Unicode representation. In other words, UTF-8 is the scheme from which computers inscribe text to physical memory using their read/write mechanisms. What is inscribed physically are single bits following the UTF-8 encodings scheme that gives meaning to 8-bit "packets" as characters. In the example above we have employed the hexadecimal notation to represent such "packets"/data. This is, of course, an kind of abstraction from the physical layer where text is actually inscribed, a convenient way for us humans to decipher and group bits. It nonetheless gives us a feel for the type of "materiality" of text inscribed on and manipulated by computing machine. For a more in depth analysis of physical inscription mechanisms, we refer the forensics work of Kirschenbaum [@kirschenbaum_mechanisms:_2012] on the subject.
=======
At this point, we should stress the fact that what is inscribed in computing memory is the _encoding_ of text and not its Unicode representation. In other words, UTF-8 is the scheme from which computers inscribe text to physical memory using their read/write mechanisms. What is inscribed physically are single bits following the UTF-8 encodings scheme that gives meaning to 8-bit "packets" as characters. In the example above we have employed the hexadecimal notation to represent such "packets"/data. This is, of course, an kind of abstraction from the physical layer where text is actually inscribed, a convenient way for us humans to decipher and group bits. It nonetheless gives us a feel for the type of "materiality" of text inscribed on and manipulated by computing machine. For a more in depth analysis of physical inscription mechanisms, we refer the forensics work of Kirschenbaum [@kirschenbaum_mechanisms_2012] on the subject.
>>>>>>> d148116502464c40e25cbd4fd466e4db4f3a193f
#### (Plain) Text Editors