Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
For the 1889 Universal Telegraphic Phrase-book, see codings. The most commonly used encodings are UTF-8,
Commercial code (communications).
UTF-16 and the now-obsolete UCS-2. UTF-8 uses one
Unicode is a computing industry standard for the con- byte for any ASCII character, all of which have the same
code values in both UTF-8 and ASCII encoding, and up
to four bytes for other characters. UCS-2 uses a 16-bit
code unit (two 8-bit bytes) for each character but cannot encode every character in the current Unicode standard. UTF-16 extends UCS-2, using one 16-bit unit for
the characters that were representable in UCS-2 and two
16-bit units (4 8 bit) to handle each of the additional
characters.
The rst 256 code points were made identical to the content of ISO-8859-1 so as to make it trivial to convert existing western text. Many essentially identical characters
were encoded multiple times at dierent code points to
preserve distinctions used by legacy encodings and therefore, allow conversion from those encodings to Unicode
(and back) without losing any information. For examUnicode can be implemented by dierent character en- ple, the "fullwidth forms" section of code points encom1
1.1
History
1.3
1.2.2
Unicode Consortium
Character General Category
3
shape or represent a visible space. As of Unicode 7.0
there are 112,804 graphic characters.
Format characters are characters that do not have a visible appearance, but may have an eect on the appearance or behavior of neighboring characters. For example,
U+200C zero-width non-joiner and U+200D zero-width
joiner may be used to change the default shaping behavior
of adjacent characters (e.g., to inhibit ligatures or request
ligature formation). There are 152 format characters in
Unicode 7.0.
1.4
Versions
2.1
Punycode, another encoding form, enables the encod UTF-EBCDIC an 8-bit variable-width encoding ing of Unicode strings into the limited character set supsimilar to UTF-8, but designed for compatibility ported by the ASCII-based Domain Name System. The
with EBCDIC. (not part of The Unicode Standard) encoding is used as part of IDNA, which is a system enabling the use of Internationalized Domain Names in all
UTF-16 a 16-bit, variable-width encoding
scripts that are supported by Unicode. Earlier and now
historical proposals include UTF-5 and UTF-6.
UTF-32 a 32-bit, xed-width encoding
GB18030 is another encoding form for Unicode, from the
Standardization Administration of China. It is the ocial
UTF-8 uses one to four bytes per code point and, being
character set of the Peoples Republic of China (PRC).
compact for Latin scripts and ASCII-compatible, proBOCU-1 and SCSU are Unicode compression schemes.
vides the de facto standard encoding for interchange of
The April Fools Day RFC of 2005 specied two parody
Unicode text. It is used by FreeBSD and most recent
UTF encodings, UTF-9 and UTF-18.
Linux distributions as a direct replacement for legacy encodings in general text handling.
The UCS-2 and UTF-16 encodings specify the Unicode
Byte Order Mark (BOM) for use at the beginnings of
text les, which may be used for byte ordering detection
(or byte endianness detection). The BOM, code point
2.2
Many scripts, including Arabic and Devanagari, have special orthographic rules that require certain combinations
of letterforms to be combined into special ligature forms.
The rules governing ligature formation can be quite complex, requiring special script-shaping technologies such
as ACE (Arabic Calligraphic Engine by DecoType in the
1980s and used to generate all the Arabic examples in
the printed editions of the Unicode Standard), which became the proof of concept for OpenType (by Adobe and
Microsoft), Graphite (by SIL International), or AAT (by
Apple).
Instructions are also embedded in fonts to tell the
operating system how to properly output dierent character sequences. A simple solution to the placement of
combining marks or diacritics is assigning the marks a
width of zero and placing the glyph itself to the left or
right of the left sidebearing (depending on the direction
of the script they are intended to be used with). A mark
handled this way will appear over whatever character precedes it, but will not adjust its position relative to the
width or height of the base glyph; it may be visually awkward and it may overlap some glyphs. Real stacking is
impossible, but can be approximated in limited cases (for
example, Thai top-combining vowels and tone marks can
just be at dierent heights to start with). Generally this
approach is only eective in monospaced fonts, but may
be used as a fallback rendering method when more complex methods fail.
3.4
Web
Unicode in use
3.1
Operating systems
7
characters are in email headers (such as the Subject:"),
or in the text body of the message; in both cases, the original character set is identied as well as a transfer encoding. For email transmission of Unicode the UTF-8 character set and the Base64 or the Quoted-printable transfer encoding are recommended, depending on whether
much of the message consists of ASCII-characters. The
details of the two dierent mechanisms are specied in
the MIME standards and generally are hidden from users
of email software.
The adoption of Unicode in email has been very slow.
Some East-Asian text is still encoded in encodings such as
ISO-2022, and some devices, such as mobile phones, still
cannot handle Unicode data correctly. Support has been
improving however. Many major free mail providers such
as Yahoo, Google (Gmail), and Microsoft (Outlook.com)
support it.
3.4 Web
Main article: Unicode and HTML
3.2
Input methods
3.3
Although syntax rules may aect the order in which characters are allowed to appear, XML (including XHTML)
documents, by denition,[43] comprise characters from
most of the Unicode code points, with the exception of:
most of the C0 control codes
the permanently unassigned code points D800
DFFF
FFFE or FFFF
HTML characters manifest either directly as bytes according to documents encoding, if the encoding supports them, or users may write them as numeric character
references based on the characters Unicode code point.
For example, the references Δ, Й, ק,
م, ๗, あ, 叶, 葉,
and 말 (or the same numeric values expressed in
hexadecimal, with &#x as the prex) should display on
all browsers as , , ,, , , , , and .
3.5
4 ISSUES
Fonts
3.6
New lines
Unicode partially addresses the new line problem that occurs when trying to read a text le on dierent platforms.
Unicode denes a large number of characters that conforming applications should recognize as line terminators.
In terms of the new line, Unicode introduced U+2028
line separator and U+2029 paragraph separator. This
was an attempt to provide a Unicode solution to encoding
paragraphs and lines semantically, potentially replacing
all of the various platform solutions. In doing so, Unicode
does provide a way around the historical platform dependent solutions. Nonetheless, few if any Unicode solutions
have adopted these Unicode line and paragraph separators as the sole canonical line ending characters. However, a common approach to solving this issue is through
new line normalization. This is achieved with the Cocoa
text system in Mac OS X and also with W3C XML and
HTML recommendations. In this approach every possible new line character is converted internally to a common new line (which one does not really matter since it is
an internal operation just for rendering). In other words,
the text system can correctly treat the character as a new
line, regardless of the inputs actual encoding.
Issues
Unicode was designed to provide code-point-by-codepoint round-trip format conversion to and from any pre-
4.4
Combining characters
9
dard. The correct rendering of Unicode Indic text requires transforming the stored logical order characters
into visual order and the forming of ligatures (aka conjuncts) out of components. Some local scholars argued in
favor of assignments of Unicode codepoints to these ligatures, going against the practice for other writing systems,
though Unicode contains some Arabic and other ligatures
for backward compatibility purposes only.[48][49][50] Encoding of any new ligatures in Unicode will not happen,
in part because the set of ligatures is font-dependent, and
Unicode is an encoding independent of font variations.
The same kind of issue arose for Tibetan script (the Chinese National Standard organization failed to achieve a
similar change).
4.3
Indic scripts
Indic scripts such as Tamil and Devanagari are each allocated only 128 code points, matching the ISCII stan-
10
7 FOOTNOTES
See also
Unicode symbols
Notes
[1] The number of characters listed for each version of Unicode is the total number of graphic, format and control characters (i.e., excluding private-use characters, noncharacters and surrogate code points).
Footnotes
[37] The Unicode Standard, Version 6.2. The Unicode Consortium. 2013. p. 561. ISBN 978-1-936213-08-5.
[39] Multilingual European Character Set 2 (MES-2) Rationale, Markus Kuhn, 1998
[43] Extensible Markup Language (XML) 1.1 (Second Edition)". Retrieved 2013-11-01.
11
[45] The secret life of Unicode: A peek at Unicodes soft underbelly, Suzanne Topping, 1 May 2001 (Internet Archive)
[46] AFII contribution about WAVE DASH, Unicode vendorspecic character table for Japanese
[47] ISO 646-* Problem, Section 4.4.3.5 of Introduction to
I18n, Tomohiro KUBOTA, 2001
[48] Arabic Presentation Forms-A. Retrieved 2010-03-20.
[49] Arabic Presentation Forms-B. Retrieved 2010-03-20.
References
The Unicode Standard, Version 3.0, The Unicode
Consortium, Addison-Wesley Longman, Inc., April
2000. ISBN 0-201-61633-5
The Unicode Standard, Version 4.0, The Unicode
Consortium, Addison-Wesley Professional, 27 August 2003. ISBN 0-321-18578-1
The Unicode Standard, Version 5.0, Fifth Edition,
The Unicode Consortium, Addison-Wesley Professional, 27 October 2006. ISBN 0-321-48091-0
Julie D. Allen. The Unicode Standard, Version 6.0,
The Unicode Consortium, Mountain View, 2011,
ISBN 9781936213016, ().
The Complete Manual of Typography, James Felici, Adobe Press; 1st edition, 2002. ISBN 0-32112730-7
Unicode: A Primer, Tony Graham, M&T books,
2000. ISBN 0-7645-4625-2.
Unicode Demystied: A Practical Programmers
Guide to the Encoding Standard, Richard Gillam,
Addison-Wesley Professional; 1st edition, 2002.
ISBN 0-201-70052-2
Unicode Explained, Jukka K. Korpela, O'Reilly; 1st
edition, 2006. ISBN 0-596-10121-X
External links
The Unicode Consortium
Unicode Standard, the latest version of the
Unicode standard
Character Code Charts By Script for the latest
version of the Unicode standard
Alan Woods Unicode Resources Contains lists of
word processors with Unicode capability; fonts and
characters are grouped by type; characters are presented in lists, not grids.
12
10
10
10.1
Unicode Source: http://en.wikipedia.org/wiki/Unicode?oldid=646382217 Contributors: Damian Yerrick, AxelBoldt, Carey Evans, Lee
Daniel Crocker, Brion VIBBER, Bryan Derksen, Zundark, The Anome, Taw, Andre Engels, Arvindn, Christian List, PierreAbbat, Roadrunner, Mjb, Hirzel, Twilsonb, Bdesham, Patrick, JohnOwens, Michael Hardy, Voidvector, DopeshJustin, Fuzzie, Euske, Nixdorf, Pnm,
Liftarn, Gabbe, Menchi, Dcljr, Sannse, TakuyaMurata, Miciah, Davejenk1ns, CesarB, Card, Looxix, Ahoerstemeier, Mac, Nanshu, Jpatokal, Angela, , Julesd, Cgs, Jll, Djmutex, AugPi, Jiang, Kaihsu, Csernica, KayEss, Mxn, Nikola Smolenski, Uriber, Trainspotter, Crissov, Timwi, Jallan, Random832, Haukurth, Pedant17, Tpbradbury, Motor, Sweety Rose, Furrykef, Itai, Jnc, EthanL, Populus,
Fibonacci, Christian Spitzlay, Nickshanks, HarryHenryGebel, Indefatigable, Bloodshedder, AnonMoos, Shantavira, Denelson83, Phil
Boswell, Gentgeen, Robbot, Chealer, Wanion, Tomchiukc, Psychonaut, Modulatum, Kokiri, Lowellian, Babbage, Rfc1394, Rursus, Hippietrail, Bkell, Hadal, Vikreykja, Roozbeh, Miles, Jor, JerryFriedman, Guy Peters, Jleedev, Superm401, Vacuum, Tobias Bergemann,
Filemon, Giftlite, Gwalla, Smjg, DocWatson42, Jacoplane, DavidCary, ComaVN, var Arnfjr Bjarmason, Lee J Haywood, Lethe,
Monedula, Anton Mravcek, Moyogo, Home Row Keysplurge, Jorend, Rick Block, Sukh, Pelletierm, Behnam, Proslaes, Node ue, AlistairMcMillan, Macrakis, Chameleon, Alvestrand, Pne, Neilc, Chowbok, Dirtside, Andycjp, MikeX, Gazibara, Quadell, Cyanophile,
Antandrus, Beland, Joeblakesley, Eroica, Drichardson, Ctachme, Dnas, Evertype, Adamrice, Karol Langner, AlexanderWinston, Billposer, Mzajac, Mikko Paananen, Maximaximax, Anrion, Bumm13, Elektron, Heirpixel, Icairns, Arcturus, Karl Dickman, JeroenHoek,
Ehamberg, Now3d, Chmod007, Abdull, Perey, MattKingston, Poccil, Jcm, Ppp, A-giau, Noisy, Discospinster, Twinxor, Patricknoddy,
Rich Farmbrough, Guanabot, Oliver Lineham, Pmsyyz, Qutezuce, AxSkov, Pjacobi, Deh, ArnoldReinhold, Alistair1978, Dbachmann,
Hhielscher, TimBray, Dolda2000, Psyphyre, Kjoonlee, Jarsyl, Plugwash, Sgeo, Spitzak, Fgrosshans, CanisRufus, Mr. Billion, Livajo,
El C, Fenevad, Robert P. O'Shea, Bletch, Kwamikagami, Calvinkrishy, ThierryVignaud, Spearhead, Sietse Snel, EmilJ, Acanon, Fiveless, NetBot, Homerjay, SpeedyGonsales, Dennis Valeev, Shlomital, Boredzo, Cherlin, Sleske, Minghong, Dvgrn, HasharBot, InterruptorJones, Splorp, Richard Donkin, Arthena, Babajobu, Improv, Sl, Kocio, Wdfarmer, Super-Magician, Cburnett, Stephan Leeds, Suruena, Evil Monkey, Gpvos, Jopxton, MIT Trekkie, Johntex, HenryLi, Kbolino, WayneMokane, UTF-8, Dejvid, Novacatz, Simetrical,
LOL, Rocastelo, Sburke, Bkkbrad, Armando, Dbolton, Hdante, Cy21, Thruston, Mitchellj, CharlesC, Gniw, Teemu Leisti, Behdad,
BD2412, DePiep, MauriceJFox3, Reisio, Mlewan, Rjwilmsi, Koavf, Dontbeakakke, Red King, Amire80, Biederman, Gudeldar, Vegaswikian, Yug, Ttwaring, Kurmaa, Husky, Allen Moore, FlaBot, Jak123, Harmil, Gary Cziko, ChongDae, Quuxplusone, Vilcxjo, Intgr,
Preslethe, Mark Yen, Chobot, Bgwhite, EamonnPKeane, YurikBot, Wavelength, Splintercellguy, JWB, Sceptre, Mukkakukaku, Wikky
Horse, RussBot, Pigman, KSmrq, Rodasmith, Dotancohen, Stephenb, Gaius Cornelius, Rsrikanth05, Cryptic, Bovineone, Lusanaherandraton, Mipadi, Keithonearth, Medains, Ashg, Keka, Dugosz, Tejas81, Mosquitopsu, Randolf Richardson, Raven4x4x, Scs, Leotohill,
Jhinman, Caerwine, Wikilackey, Yonidebest, Greyed, Covington, Shirishag75, Gulliveig, Crost, JLaTondre, Flowersofnight, Spooksh,
Vanished user 99034jfoiasjq2oirhsf3, Chris Chittleborough, Jmchu, Damson88, BonsaiViking, SmackBot, Incnis Mrsi, The Monster,
Unyoyega, Pgk, Chairman S., Arcan, Arny, AnOddName, David Woolley, Commander Keane bot, Gilliam, Brianski, Chris the speller,
Optikos, Bk pandey, Thumperward, Morte, MalafayaBot, Stevage, Henrique Moreira, Colonies Chris, Tarikash, Scwlong, Huji, Frap,
Chlewbot, Nixeagle, Caue.cm.rego, KevM, SundarBot, Sspecter, PiMaster3, Cybercobra, Larry Hastings, Warren, Gurnec, Andrew c,
Matt.forestpath, LeerokLacerta, Vasiliy Faronov, SashatoBot, Doug Bell, Agencius, Ceplm, JamesFox, Woer$, Joshua Scott, Rainwarrior, Hvn0413, Special-T, Ems2, STL Dilettante, Tsength, Romeu, Sharcho, EdC, Nathanaeljones, MTSbot, Galadh, Keahapana, Andreas Rejbrand, IvanLanin, Lenoxus, Chovain, Daniel5127, Vanisaac, FleetCommand, CmdrObot, Mcstrother, Robtheslob, Outriggr,
Shultz IV, Seizethedave, ChristTrekker, Bantab, SimonDeDanser, Asenine, Shrish, Gimmetrow, Smiteri, Thijs!bot, Epbr123, Headbomb,
Capn ed, A3RO, Klausness, Nobar, Escarbot, Rees11, AntiVandalBot, WinBot, Ellenaz, Jrogers@olorin.us, SnakDev, Thomas Milo,
Bigjimr, Canadiana, MER-C, DPoon, Deepugn, Gusutabo, Struthious Bandersnatch, Jarkeld, Yahel Guhan, A12n, G a adams, Bongwarrior, VoABot II, Emelmujiro, DWorley, Tonyfaull, BrianGV, LorenzoB, Cpl Syx, ANONYMOUS COWARD0xC0DE, Shijualex,
Tom Foley, Tigrisek, Gwern, MartinBot, R'n'B, Francis Tyers, S d, J.delanoy, Kimse, Abecedare, Fxparlant, Zusied, LordAnubisBOT,
Raise exception, 83d40m, Sarayuparin, Cowplopmorris, Necromancer44, Jefe2000, ABF, Jmocenigo, AlnoktaBOT, , TXiKiBoT,
Jalwikip, Muro de Aguas, Anonymous Dissident, Don4of4, Isis4563, Synthebot, AlleborgoBot, Logan, Ismellaprep, EmxBot, Demmy,
SPQRobin, SieBot, JoshEdgar, Winterspan, Indexheavy, Nemo20000, Flyer22, Werldwayd, AlanUS, Sen Mon, Siodhe,
, Treekids,
SallyForth123, Martarius, ClueBot, WurmWoode, PipepBot, Ve4ernik, Skj.saurabh, William Ortiz, Excirial, Jotterbot, WWriter, Computer97, Ark25, Hans Kamp, Omirocksthisworld, Johnuniq, DumZiBoT, Darkicebot, XLinkBot, Critisizer, Jasperseverance, SilvonenBot, ErkinBatu, MystBot, Addbot, Xp54321, LP, Sumitkumar kataria, Wickey-nl, Nz26, BabelStone, WorldlyWebster, AnnaFrance,
RogueWanderer, Luckas Blade, EJF (huggle), Legobot, Luckas-bot, Yobot, OrgasGirl, Denispir, Wikipedian2, V85, KamikazeBot, Wihwang, AnomieBOT, Letuo, Galoubet, Dr-moomin, Materialscientist, LilHelpa, Xqbot, Rhodson2, Gmailseotaewong, GrouchoBot, Mark
Schierbecker, RibotBOT, Toyotabedzrock, Howard McCay, A.amitkumar, FrescoBot, Surv1v4l1st, Michael93555, Cannolis, Downcoder,
HRoestBot, RedBot, DrewwBreak, Full-date unlinking bot, Stephanoc, TobeBot, Pavankay, OldGeazer, Throwaway85, Ringlings, Bhudh,
Bento00, Ripchip Bot, Fitoschido, EmausBot, Orphan Wiki, Ebe123, Wikiturrican, Tommy2010, Pandukht, Geronime, Wiooiw, Mcmatter, Inka 888, Dgodfrey40, VandTrack, Askiyas85, Voomoo, ClueBot NG, Matthiaspaul, Garlikguy2, Ben morphett, Helpful Pixie
Bot, GGink, HMSSolent, Wbm1058, CheeryChan, Shnoobloogloo, Gorobay, Chmarkine, 4Aleph4Omega4, OzOle, Res0lution, Proxyma,
ChrisGualtieri, A580666, SilverFox183, YFdyh-bot, APerson, CorinneSD, Jnargus, Johnholtripley, Doubtful dave 86,
, Cousteau, Claw of Slime, Genesis is a great band, Iokevins, Nst021, ToonLucas22 and Anonymous: 512
10.2
Images
10.3
Content license
13
10.3
Content license