Lesson4 etc.

This commit is contained in:
gauthiier
2015-02-18 08:30:20 +01:00
parent fa93688603
commit 6095baeac4
17 changed files with 792 additions and 23 deletions
+3 -3
View File
@@ -9,11 +9,11 @@
<!--[if lt IE 9]>
<script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
<link rel="stylesheet" href="style/style.css">
<link rel="stylesheet" href="style/html5.css">
</head>
<body>
<content>
<h2 id="text-encoding">Text Encoding</h2>
<h1 id="text-encoding">Text Encoding</h1>
<p>We believe in approaching text writing by first understanding the core inscription mechanism upheld by modern computing machine. In this lesson we will hence look at how text and characters are inscribed and represented internally within computers. More specifically we will look at standards of text encoding (and decoding) and see how text editors can decode such encoding.</p>
<h3 id="goals">Goals</h3>
<p>The aim of this lesson is to present the various ways that computers represent text internally, that is, characters as digits. The lesson is tailored in giving the reader the basic knowledge of standards that establish the quanta of text (data). Our hope in doing so is to give a feel of a kind of materiality of text and present the ways in which various levels of abstraction are applied to it.</p>
@@ -26,7 +26,7 @@
<li>Learn how to use a plain text editor to write, view and inspect different open standards encodings of a given text file.</li>
</ol>
<h3 id="history">History</h3>
<p>As everyone heard of the byte format? If you didn't it's about time you do as you employ this legacy format daily when using your computer. A byte is the most basic quanta of computing and is composed of 8 bits, where a bit stands for what is commonly represented by a 0 or 1. Hence a byte is a 8-bits &quot;packet&quot; which can represent decimal numbers ranging from 0 to 255 (or -128 to 127). In this lesson we will use the <a href="https://en.wikipedia.org/wiki/Hexadecimal">Hexadecimal</a> notation to represent bytes. A byte is an historical format and encapsulate the most basic data structure in computing machinery, a standard introduced by IBM for its flagship <a href="http://www.computermuseum.li/Testpage/IBM-360-1964.htm">IBM/360</a> mainframe machine in 1964.<a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a></p>
<p>Has everyone heard of the byte format? If you didn't it's about time you do as you employ this legacy format daily when using your computer. A byte is the most basic quanta of computing and is composed of 8 bits, where a bit stands for what is commonly represented by a 0 or 1. Hence a byte is a 8-bits &quot;packet&quot; which can represent decimal numbers ranging from 0 to 255 (or -128 to 127). In this lesson we will use the <a href="https://en.wikipedia.org/wiki/Hexadecimal">Hexadecimal</a> notation to represent bytes. A byte is an historical format and encapsulate the most basic data structure in computing machinery, a standard introduced by IBM for its flagship <a href="http://www.computermuseum.li/Testpage/IBM-360-1964.htm">IBM/360</a> mainframe machine in 1964.<a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a></p>
<p>Roughly at the same time (1963) another (updated) standard was devised for the encoding of characters: ASCII [ref]. ASCII conceived a 7-bit format for characters that was factorised into an 8-bit format on the IBM/360. With a 7-bit format, ASCII had the possibility to encode 127 characters. However, the IBM/360 opted to use the legacy <a href="https://en.wikipedia.org/wiki/EBCDIC">EBCDIC</a> 8-bit format as default character set (dubbed &quot;charset&quot;) on all software developed for the IBM/360<a href="#fn2" class="footnoteRef" id="fnref2"><sup>2</sup></a>. Hence the mass adoption of ASCII as main default charset in computing systems came years after mainly with the advent of PCs.</p>
<p>Is ASCII still in use today? Yes and no. ASCII has some important limitations as it was designed for Latin-based languages and does not support non-Latin characters (hence a 7-bit format for an Latin alphabet). With the wide spread of PCs around the world and the rise of the Internet as main communication infrastructure, the need for a single character format (albeit a Universal Format) accounting for both Latin and non-Latin characters (Cyrillic, Hebrew, Arabic, Turkish to name a few) was imminent at the beginning for the 90s.</p>
<p>Hence the establishment of the Unicode standard which aim is to devise and maintain a Universal Character Set (UCS) composed of special codes points for each character (a kind of &quot;meta&quot;-charset if you want, composed of specific unicode codes)<a href="#fn3" class="footnoteRef" id="fnref3"><sup>3</sup></a>. Unicode does not specify specific encodings for its code points. Rather, encodings are part of specific implementations of the UCS such as UTF (UCS Transformation Format). The most notable UTF being UTF-8.<a href="#fn4" class="footnoteRef" id="fnref4"><sup>4</sup></a> The special feature of UTF-8 is that it is directly backward compatible with ASCII (an 8-bit ASCII character as the same encoding as its UTF-8 version) and has the property of being variable in length, meaning that Latin characters are encoded with a single byte while other non-Latin characters may be encoded with up to 4 bytes.<a href="#fn5" class="footnoteRef" id="fnref5"><sup>5</sup></a> Nowadays, UTF-8 is one of the most (if not <em>the</em> most) mass adopted / ubiquitous character encoding format.<a href="#fn6" class="footnoteRef" id="fnref6"><sup>6</sup></a></p>