Writing System Essay, Research Paper
Complex-text Languages
In the languages of the western world based on the Latin, Cyrillic and Greek scripts, there is no difference between how text is stored for data processing and how it is presented on a display or a printer. The text is read on horizontal lines from left to right, the lines progress from top to bottom and the characters are stored in a manner identical to how they are presented.
Not all the languages of the world have these characteristics.
In this document, complex-text languages are defined as those languages for which the text has a different layout when presented from when it is stored for data processing. The term layout, which is equivalent, in this context, to the term format, refers to the shape of the characters and the direction of portions of the text.
An additional characteristic of complex-text languages (with the exception of Vietnamese) is the fact that they do not have upper case or lower-case characters.
Typical complex-text languages are those with a bi-directional script. Usually they are written from right to left, with some portions of text, such as numbers and embedded Latin-based text, written from left to right. Bi-directional languages include the languages of the Middle East and Africa (Arabic, Hebrew, Urdu, Farsi, Yiddish, and so on). Other complex-text languages include some languages of Asia that do not limit their encoding to a double-byte scheme (Thai, Lao, Vietnamese, Korean, and so on).
There is nothing in these languages themselves that is more complex than in the Latin-based languages; they are special only in that the presented text does not necessarily look identical to the text as stored.
Though the term complex is used to describe the text of the bi-directional and some other Asian languages, enabling a program to work in these languages is relatively simple, once the peculiarities of these languages are understood.
Layout Transformations and Related Attributes
To enter, process and present a text in a complex-text language, it is necessary to perform transformations between the processing layouts and the presentation layouts. The processing layout is the layout of text when stored or processed. The presentation layout is the layout of text when presented on a display or a printer.
These transformations have to take into account specific text attributes, including directionality, shaping, composition of characters and national numbers. Text attributes that describe bi-directional writing systems are defined in Bi-directional Languages.
An internationalized application must be designed to deal automatically with this kind of transformation and related attributes.
Bi-directional Languages
The bi-directional languages are used mainly in the Middle East. They include Arabic, Urdu, Farsi, Hebrew and Yiddish.1
In a bi-directional language, the general flow of text proceeds horizontally from right to left, but numbers are written from left to right, the same way as they are written in English. In addition, if an English or another left-to-right language text (addresses, acronyms or quotations) is embedded, it is also written from left to right.
Aspects of Bi-directional Language Writing Systems
This section discusses aspects of bi-directional texts, related to directionality, shaping and national numbers as well as keyboard input and compliance with common user access guidelines. The text attributes described here also pertain to some degree to other complex-text languages such as the languages of Asia (for example, Thai, Lao, Korean).
Bi-directionality
In the context of bi-directionality, the following are key concepts:
Segments
Global orientation
Text-types and associated reordering methods
Symmetrical swapping.
These attributes are described below.
Segments
A bi-directional text may consist of a main part that has one directionality (for example, an Arabic text written from right to left), and portions that have an opposite directionality (for example, an English address written from left to right.) The portion of text with a different directionality is called a segment. A bi-directional text thus might have a body bicolor=”#FFFFFF” of right-to-left text with embedded left-to right segments. Sometimes a segment with one directionality might itself have embedded or nested within it an additional segment with an opposite directionality. It is conceptually possible to have many levels of
nesting; in most cases, however, there are no more than two levels.
One level of nesting is necessary for the entry of numbers within Arabic or Hebrew text. To simulate bi-directional scripts in the following examples, Hebrew and Arabic text is represented by lower-case English letters, while upper-case letters represents English text.
In Hebrew, it is customary to write the name of the street before the number of the house, as shown below:
b ecnartne 25 teerts elpam
*——— -* *———–
The street name is entered from right to left. The flow then has to be reversed to allow correct entry of the number from left to right (this being the nested left-to-right segment.) Then the flow must be reversed again to allow the entry of the entrance information from right to left.
Imagine somebody bicolor=”#FFFFFF” writing a letter in English to somebody who can read Hebrew too, and writing his or her address in Hebrew. In this case, the address in Hebrew is actually a nested segment of the English text.
MY ADDRESS IS b ecnartne 25 teerts elpam THIS MONTH.
————**——— -* *———– ———*
NEST LEVEL: 0000000000000011111111111221111111111111000000000000
Because the nested segment of the address has itself a nested segment (the street number), there are two levels of nesting.
Global Orientation
Bi-directional text may consist of mainly right-to-left text with some left-to-right nested segments (such as an Arabic text with some information in English), or mainly left-to-right with some right-to-left segments (such as an English letter with a Hebrew address nested within it). The predominant direction is called the global orientation; it cannot always be quickly deduced from the general context.
FRED DOES NOT BELIEVE taht yas syawla i
This sentence has one meaning when the reading is from left to right (Fred does not believe I always say that), and another meaning when read from right to left (I always say that Fred does not believe). In the first half of the above example, the global orientation of the text is left-to-right and in the second half it is right-to-left.
Because the global orientation is not always obvious from the context2 it must be known to the application developer whose product is processing the bi-directional data.
Note:
Not to be confused with the global orientation of the text is the physical orientation of the presentation device. A display terminal has, for example, a right-to-left physical orientation if the first character on the screen is the one in the upper right-hand corner and the general cursor movement is from right to left (and top to bottom.)
Text-types
In a bi-directional text a programmer must clearly distinguish between the physical order in which the text is presented, and the logical order in which its segments are processed (or pronounced if read aloud). Some segments may need to be reordered to a logical or physical order.
There are different approaches to how bi-directional text is to be reordered, and at present none can be said to be prevalent.
The concept text-type is used to point to which approach is applicable for a specific text. The physical and logical order and the different text-types are discussed further below.
MY WIFE’S NAME IS ilin
The global orientation is left-to-right. The first letter in the text is M, followed by Y and so forth. In the physical order, after the letters I and S comes the letter i of the segment containing my wife’s name in Hebrew. Note, however, that my wife’s name is pronounced “nili”. In the logical order the first letter of the name segment is thus the letter n, followed by i, l and i.
Sometimes, for example in on-line help, it is convenient to store the bi-directional text exactly as presented – that is, in the physical order. But if there is intent to process the text (for example, to sort it), the segments must be stored in their logical order. There is no meaning, in the above example, to sort the name “ilin”. It makes sense to reorder the text, so the directional segment containing the name “nili” is inverted, before being stored for further processing. The logical order is the preferred sequence for entering text and for processing. Conceptually, any storage device can be seen as storing the data from left to right. If a programmer wants to perform straightforward processing on the stored text (sorting, collating, indexing) without the need to preprocess each segment, the bi-directional data has to be stored in its logical sequence. This means reversing segments whose direction is opposite to the global orientation.
Text-types and Reordering Techniques
Different text-types require different approaches to reordering:
Visual text-type
The oldest approach, dating from the time when there was no processing capability at the workstation, is simply to copy the entire screen to storage, and storage to screen (possibly inverting every row, depending on the physical orientation of the screen). It is up to each application programmer to know where the embedded segments are located and to process them accordingly. This text-type is called visual because it is a replication of the presented form. Many legacy applications3 and their files have this type of text.
Implicit text-type
In the implicit text-type it is assumed that the letters of the Latin alphabet have a strong inherent left-to-right directionality, and those of the Arabic, Farsi, Urdu and Hebrew alphabet have a strong right-to-left inherent directionality. An algorithm of implicit text processing recognizes segments based on their inherent directional characteristics, and segment inversion is performed automatically. The concept of an implicit algorithm is simple to understand. Its main limitation is that it cannot correctly handle some strings that have numbers and intermixed left-to-right and right-to-left letters.
Explicit text-type
The explicit text-type assumes that there are additional control characters, embedded in the text, that instruct an explicit algorithm to perform segment inversions, shaping or numeral selections, and other transformations.
Thus, a text with visual text-type is stored in its physical order, and a text with an implicit text-type is stored in its logical order, which is better suited for automatic processing. A text with an explicit text-type is usually stored in logical order, but because of the embedded controls in the text, the automatic processing is not always straightforward.
There is no one type of text that can be used in all cases. The implicit techniques are usually heuristic and thus have some limitations as noted previously. The explicit techniques, while alleviating the limitations of implicit techniques, introduce other limitations such as the need for automatic processes to cope with embedded controls.
One specific technique, the Basic Display Algorithm,4 tries to be a bridge between the implicit and explicit techniques. In principle it is an implicit reordering algorithm, but it can deal with a few specific directional controls embedded in the text.
There are applications and related databases for all three text-types. It is possible for bi-directional text that is presented one way to be stored in a different layout. A programmer need only know what text-type or reordering algorithm was used, to correctly transform or process the bi-directional text.
Symmetrical Swapping
Some characters, such as the greater-than sign, have an implied directional meaning and have a complementary symmetric character with an opposite directional meaning (the less-than sign). When used within a segment that is presented right-to-left but is inverted (left-to-right) when stored for processing, such a character might have to be replaced by its symmetric sibling to ensure that the correct meaning of the text is preserved. The replacement of such a character by its complement during transformation of a bi-directional text is called symmetrical swapping.
Example: Example of Symmetrical Swapping
On a right-to-left window of the screen, the expression: b * a is read as a is greater than b. In storage the orientation is always left-to-right; the first character in storage is thus a, followed by * and then b. So the result in storage is: a * b which is of course incorrect. In this case, to preserve the correct meaning of the expression, the * character must be exchanged in storage with *.
Other graphic characters that require symmetrical swapping include the parentheses, square brackets, braces, and so on.
Although symmetrical swapping is a characteristic of bi-directional languages, it is not always mandatory for the software functions that transform different bi-directional-language text layouts. Sometimes this function is performed automatically by the workstation hardware or micro code.
Shaping
Shaping is the process by which characters are rendered in the appropriate presentation forms. This might involve the presentation of characters in a form different from the one in which they are stored. In general, to simplify processing, an unshaped (abstract or basic) representation is used internally. Shaping takes into account the character being shaped and the characters in its vicinity, and replaces its abstract representation (or that of its parts) with the proper shape. Shaping is a characteristic of many complex text languages, in particular the languages of the Middle East.
The Arabic scripts are cursive. A writing system is cursive if it is suited to handwriting rather than printing, with adjacent characters in a word connected to each other. Some letters can only connect to the letter on their right. This is the only way in which Arabic script is used, whether in books, newspapers, signs, or workstation displays. (English can be handwritten in a cursive style, for personal communications, but is seldom published or displayed that way. Thus English is not considered a cursive script.)
Shaping in Cursive Script Languages
In cursive scripts, letters might assume different shapes according to their position in the word and to the connectivity properties they and the adjacent letters have. There are as many as four shapes for each letter. As described in Shapes of the Arabic Characters may have initial, middle, final, and isolated forms (not all characters have all forms). Only one shape per letter is represented on Arabic keyboards, but all shapes must be available for presentation. Similarly, in most cases, a cursive language text is not stored with full shapes. Each character has a base form, which is an abstraction to allow selection of a cursive character without specifying its shape.
The proper shape can be selected by a shape determination routine, which allows for automatic (algorithmic) selection of the appropriate shape according to the context as directed by the software or the user. It may allow for user or software-controlled selection of any of the four shapes mentioned above. Alternatively, it may allow transparent throughput of data: that is, it may become temporarily deactivated under software or user control. Whenever cursive-language characters are folded by processing to one shape, they must be reshaped using the same algorithm prior to presentation. In some very specific cases, this processing may corrupt data, as the algorithm may not be perfectly reversible. As an analogy, in English, converting 12Ab2 to upper case would result in 12AB2; the return to lower case would result in 12ab2, which is not the same as the original.
Though in most cases a cursive language text would be stored in basic shapes only, there are cases where it may be stored with characters shaped as presented, as in the case of messages or on-line help text.
Character Composition, Ligatures and Diacritics
In complex-text languages, it is possible that there is not a one-to-one correspondence between the number of characters of text stored for processing and the number of characters of the presented text. Sometimes two or more characters might be represented by a single glyph occupying one presentation cell:
Ligatures
In the cursive languages, ligatures use one glyph to represent two or more specific letters. For example, the ligature: Lamalif is used to represent the frequently used pair of letters Lam and Alif.
Diacritics
These are marks above, near, within or below a consonant. They are used in bi-directional languages, among other functions, to represent vowels. When kept in storage for processing, these marks occupy physical positions, but if used for representation, they might occupy the same cell as the associated consonants.
As a compromise, given existing limitations (in the graphical capabilities and resolution of the display devices and the number of code points available), bi-directional languages such as Hebrew have in many implementations given up the ability to represent vowels by diacritics. The vowels sounds have to be surmised by readers based on their knowledge of the language and according to the semantics of the text.