Lesson #2 mid-draft #1

This commit is contained in:
gauthiier
2015-02-16 11:02:46 +01:00
parent 10a48c212d
commit 903b5d9cba
5 changed files with 66 additions and 3 deletions
+2 -1
View File
@@ -60,7 +60,7 @@
</div>
<p>To select the encoding of a file using Sublime: <strong>Menu</strong> -&gt; <strong>File</strong> -&gt; <strong>Reopen with Encoding</strong></p>
<p>The above file is an Apple Pages file that we have opened using Sublime with UTF-8 decoding.</p>
<p>As you can see there is many characters that do not read properly, that is, not human readable. In fact, we can see that UTF-8 decodes the bytes in the file and maps their content to some Unicode &quot;control&quot; character. These &quot;control&quot; characters are part of the UCS and are characters representing computer commands if you like, rather than elements of an alphabet. For example a &quot;new line&quot; character representing a new line in a text (when the &quot;return&quot; key is pressed on a keyboard) has a &quot;LF&quot; (Line Feed) symbol with UCS U+000A value. There exists many of such characters, notably &quot;EOF&quot; - End-Of-File, &quot;ESC&quot; - Escape, and &quot;NULL&quot; -- 0x0000 to name a few.</p>
<p>As you can see there is many characters that do not read properly, that is, not human readable. In fact, we can see that UTF-8 decodes the bytes in the file and maps their content to some Unicode &quot;control&quot; character. These &quot;control&quot; characters are part of the UCS and are characters representing computer commands if you like, rather than elements of an alphabet. For example a &quot;new line&quot; character representing a new line in a text (when the &quot;return&quot; key is pressed on a keyboard) has a &quot;LF&quot; (Line Feed) symbol with UCS U+000A value. There exists a vaietry of such characters.<a href="#fn9" class="footnoteRef" id="fnref9"><sup>9</sup></a></p>
<p>However, in the case of the Apple Pages file, these &quot;control&quot; characters are meaningless as they do not obviously follow the Unicode standard. Instead Pages inserts into it's text specific commands that only have meaning for the Apple Pages program. In short, these are bytes that have meaning only to Apple and their specific regime of encoding files. Such commands may refer to specific ways to display certain types of characters, or perhaps signify the beginning of a paragraph, or specify a font to render text, or even be the data of an image (who knows?). Pages is not a standard format but a proprietary one, therefore it is not possible to instruct my text editor on how to decode the bytes found in the Pages document. In a sense, in having all data part of a single file (information about the design, layout, font, etc.) it makes the files overly complex compared to plain text format. As a result, word processing files tend to be larger in size than plain UTF-8 encoded ones. The text from the file above has 1 389 characters. Its Apples Pages file is composed of 179 759 bytes while its plain UTF-8 version only 1 391 bytes (two extra bytes for the &quot;EOF&quot; control character).</p>
<p>In turn the obvious unreadability of proprietary word processing file formats (such as Apple Pages, MS Word) coupled with their tendency to bloat file, makes them problematic in terms of politics of encoding, usability and efficiency. Hence, standards like UTF-8 and the use of plain text editors are viable alternative for writing academic text and sustained by a practice that is unbounded by obfuscating interests and techniques. What is human-readable is human-understandable.</p>
<h3 id="extra">Extra</h3>
@@ -84,6 +84,7 @@
<li id="fn6"><p>Especially on the Internet -- see character encodings historical trend <a href="http://w3techs.com/technologies/history_overview/character_encoding/ms/y">chart</a>.<a href="#fnref6"></a></p></li>
<li id="fn7"><p>Although moving to <a href="https://atom.io">Atom</a> eminently.<a href="#fnref7"></a></p></li>
<li id="fn8"><p>For a list of such editors please refer to <a href="https://en.wikipedia.org/wiki/Comparison_of_text_editors">this article</a>.<a href="#fnref8"></a></p></li>
<li id="fn9"><p>For a comprehensible explanation of these codes (derived from ASCICI) please refer to historical <a href="https://tools.ietf.org/html/rfc20">RFC20</a> - ASCII format for Network Interchange. The concept of control codes was introduced by legacy <a href="https://en.wikipedia.org/wiki/Baudot_code">Baudot (1870) and Murray codes (1901)</a> who were standard coding techniques up until the advent of aforementioned EBCDIC.<a href="#fnref9"></a></p></li>
</ol>
</div>
</content>