Xah Lee
2011-02-16 22:07:56 UTC
might be interesting.
〈Problems of Symbol Congestion in Computer Languages (ASCII Jam;
Unicode; Fortress)〉
http://xahlee.org/comp/comp_lang_unicode.html
--------------------------------------------------
Problems of Symbol Congestion in Computer Languages (ASCII Jam;
Unicode; Fortress)
Xah Lee, 2011-02-05, 2011-02-15
Vast majority of computer languages use ASCII as its character set.
This means, it jams multitude of operators into about 20 symbols.
Often, a symbol has multiple meanings depending on contex. Also, a
sequence of chars are used as a single symbol as a workaround for lack
of symbols. Even for languages that use Unicode as its char set (e.g.
Java, XML), often still use the ~20 ASCII symbols for all its
operators. The only exceptions i know of are Mathematica, Fortress,
APL. This page gives some examples of problems created by symbol
congestion.
-------------------------------
Symbol Congestion Workarounds
--------------------
Multiple Meanings of a Symbol
Here are some common examples of a symbol that has multiple meanings
depending on context:
In Java, [ ] is a delimiter for array, also a delimiter for getting a
element of array, also as part of the syntax for declaring a array
type.
In Java and many other langs, ( ) is used for expression grouping,
also as delimiter for arguments of a function call, also as delimiters
for parameters of a function's declaration.
In Perl and many other langs, : is used as a separator in a ternary
expression e.g. (test ? "yes" : "no"), also as a namespace separator
(e.g. use Data::Dumper;).
In URL, / is used as path separators, but also as indicator of
protocol. e.g. http://example.org/comp/unicode.html
In Python and many others, < is used for “less than” boolean operator,
but also as a alignment flag in its “format” method, also as a
delimiter of named group in regex, and also as part of char in other
operators that are made of 2 chars, e.g.: << <= <<= <>.
--------------------
Examples of Multip-Char Operators
Here are some common examples of operators that are made of multiple
characters: || && == <= != ** =+ =* := ++ -- :: // /* (* …
-------------------------------
Fortress & Unicode
The language designer Guy Steele recently gave a very interesting
talk. See: Guy Steele on Parallel Programing. In it, he showed code
snippets of his language Fortress, which freely uses Unicode as
operators.
For example, list delimiters are not the typical curly bracket {1,2,3}
or square bracket [1,2,3], but the unicode angle bracket ⟨1,2,3⟩.
(See: Matching Brackets in Unicode.) It also uses the circle plus ⊕ as
operator. (See: Math Symbols in Unicode.)
-------------------------------
Problems of Symbol Congestion
I really appreciate such use of unicode. The tradition of sticking to
the 95 chars in ASCII of 1960s is extremely limiting. It creates
complex problems manifested in:
* String Escape mechanism (C's backslash \n, \/, …, widely
adopted.)
* Complex delimiters for strings. (Python's triple quotes and
perl's variable delimiters q() q[] q{} m//, and heredoc. (See: Strings
in Perl and Python ◇ Heredoc mechanism in PHP and Perl.)
* Crazy leaning toothpicks syndrome, especially bad in emacs
regex.
* Complexities in character representation (See: Emacs's Key
Notations Explained (/r, ^M, C-m, RET, <return>, M-, meta) ◇ HTML
entities problems. See: HTML Entities, Ampersand, Unicode, Semantics.)
* URL Percent Encoding problems and complexities: Javascript
Encode URL, Escape String
All these problems occur because we are jamming so many meanings into
about 20 symbols in ASCII.
See also:
* Computer Language Design: Strings Syntax
* HTML6: Your JSON and SXML Simplified
Most of today's languages do not support unicode in function or
variable names, so you can forget about using unicode in variable
names (e.g. α=3) or function names (e.g. “lambda” as “λ” or “function”
as “ƒ”), or defining your own operators (e.g. “⊕”).
However, there are a few languages i know that do support unicode in
function or variable names. Some of these allow you to define your own
operators. However, they may not allow unicode for the operator
symbol. See: Unicode Support in Ruby, Perl, Python, javascript, Java,
Emacs Lisp, Mathematica.
Xah
〈Problems of Symbol Congestion in Computer Languages (ASCII Jam;
Unicode; Fortress)〉
http://xahlee.org/comp/comp_lang_unicode.html
--------------------------------------------------
Problems of Symbol Congestion in Computer Languages (ASCII Jam;
Unicode; Fortress)
Xah Lee, 2011-02-05, 2011-02-15
Vast majority of computer languages use ASCII as its character set.
This means, it jams multitude of operators into about 20 symbols.
Often, a symbol has multiple meanings depending on contex. Also, a
sequence of chars are used as a single symbol as a workaround for lack
of symbols. Even for languages that use Unicode as its char set (e.g.
Java, XML), often still use the ~20 ASCII symbols for all its
operators. The only exceptions i know of are Mathematica, Fortress,
APL. This page gives some examples of problems created by symbol
congestion.
-------------------------------
Symbol Congestion Workarounds
--------------------
Multiple Meanings of a Symbol
Here are some common examples of a symbol that has multiple meanings
depending on context:
In Java, [ ] is a delimiter for array, also a delimiter for getting a
element of array, also as part of the syntax for declaring a array
type.
In Java and many other langs, ( ) is used for expression grouping,
also as delimiter for arguments of a function call, also as delimiters
for parameters of a function's declaration.
In Perl and many other langs, : is used as a separator in a ternary
expression e.g. (test ? "yes" : "no"), also as a namespace separator
(e.g. use Data::Dumper;).
In URL, / is used as path separators, but also as indicator of
protocol. e.g. http://example.org/comp/unicode.html
In Python and many others, < is used for “less than” boolean operator,
but also as a alignment flag in its “format” method, also as a
delimiter of named group in regex, and also as part of char in other
operators that are made of 2 chars, e.g.: << <= <<= <>.
--------------------
Examples of Multip-Char Operators
Here are some common examples of operators that are made of multiple
characters: || && == <= != ** =+ =* := ++ -- :: // /* (* …
-------------------------------
Fortress & Unicode
The language designer Guy Steele recently gave a very interesting
talk. See: Guy Steele on Parallel Programing. In it, he showed code
snippets of his language Fortress, which freely uses Unicode as
operators.
For example, list delimiters are not the typical curly bracket {1,2,3}
or square bracket [1,2,3], but the unicode angle bracket ⟨1,2,3⟩.
(See: Matching Brackets in Unicode.) It also uses the circle plus ⊕ as
operator. (See: Math Symbols in Unicode.)
-------------------------------
Problems of Symbol Congestion
I really appreciate such use of unicode. The tradition of sticking to
the 95 chars in ASCII of 1960s is extremely limiting. It creates
complex problems manifested in:
* String Escape mechanism (C's backslash \n, \/, …, widely
adopted.)
* Complex delimiters for strings. (Python's triple quotes and
perl's variable delimiters q() q[] q{} m//, and heredoc. (See: Strings
in Perl and Python ◇ Heredoc mechanism in PHP and Perl.)
* Crazy leaning toothpicks syndrome, especially bad in emacs
regex.
* Complexities in character representation (See: Emacs's Key
Notations Explained (/r, ^M, C-m, RET, <return>, M-, meta) ◇ HTML
entities problems. See: HTML Entities, Ampersand, Unicode, Semantics.)
* URL Percent Encoding problems and complexities: Javascript
Encode URL, Escape String
All these problems occur because we are jamming so many meanings into
about 20 symbols in ASCII.
See also:
* Computer Language Design: Strings Syntax
* HTML6: Your JSON and SXML Simplified
Most of today's languages do not support unicode in function or
variable names, so you can forget about using unicode in variable
names (e.g. α=3) or function names (e.g. “lambda” as “λ” or “function”
as “ƒ”), or defining your own operators (e.g. “⊕”).
However, there are a few languages i know that do support unicode in
function or variable names. Some of these allow you to define your own
operators. However, they may not allow unicode for the operator
symbol. See: Unicode Support in Ruby, Perl, Python, javascript, Java,
Emacs Lisp, Mathematica.
Xah