Sei sulla pagina 1di 16

Extensible Markup Language (XML) is a markup language that defnes a set of

rules for encoding documents in a format that is both human-readable and


machine-readable. It is defned b the !"#$s XML %.& 'pecifcation()* and b
se+eral other related specifcations,("* all of -hich are free open standards.
(.*
/he design goals of XML emphasi0e simplicit, generalit and usabilit across
the Internet.(1* It is a textual data format -ith strong support +ia 2nicode for
di3erent human languages. 4lthough the design of XML focuses on
documents, it is -idel used for the representation of arbitrar data
structures(5* such as those used in -eb ser+ices.
'e+eral schema sstems exist to aid in the defnition of XML-based
languages, -hile man application programming interfaces (46Is) ha+e been
de+eloped to aid the processing of XML data./he material in this section is
based on the XML 'pecifcation. /his is not an exhausti+e list of all the
constructs that appear in XML7 it pro+ides an introduction to the ke
constructs most often encountered in da-to-da use.
(2nicode) character
8 defnition, an XML document is a string of characters. 4lmost e+er legal
2nicode character ma appear in an XML document.
6rocessor and application
/he processor anal0es the markup and passes structured information to an
application. /he specifcation places re9uirements on -hat an XML processor
must do and not do, but the application is outside its scope. /he processor
(as the specifcation calls it) is often referred to collo9uiall as an XML parser.
Markup and content
/he characters making up an XML document are di+ided into markup and
content, -hich ma be distinguished b the application of simple sntactic
rules. :enerall, strings that constitute markup either begin -ith the
character ; and end -ith a <, or the begin -ith the character = and end
-ith a 7. 'trings of characters that are not markup are content. >o-e+er, in a
#?4/4 section, the delimiters ;@(#?4/4( and **< are classifed as markup,
-hile the text bet-een them is classifed as content. In addition, -hitespace
before and after the outermost element is classifed as markup.
/ag
4 markup construct that begins -ith ; and ends -ith <. /ags come in three
Aa+orsB
start-tags7 for exampleB ;section<
end-tags7 for exampleB ;Csection<
empt-element tags7 for exampleB ;line-break C<
Element
4 logical document component -hich either begins -ith a start-tag and ends
-ith a matching end-tag or consists onl of an empt-element tag. /he
characters bet-een the start- and end-tags, if an, are the element$s content,
and ma contain markup, including other elements, -hich are called child
elements. 4n example of an element is ;:reeting<>ello, -orld.;C:reeting<
(see hello -orld). 4nother is ;line-break C<.
4ttribute
4 markup construct consisting of a nameC+alue pair that exists -ithin a start-
tag or empt-element tag. In the example (belo-) the element img has t-o
attributes, src and altB
;img srcDEmadonna.FpgE altD$Goligno Madonna, b Haphael$ C<
4nother example -ould be
;step numberDE"E<#onnect 4 to 8.;Cstep<
-here the name of the attribute is EnumberE and the +alue is E"E.
4n XML attribute can onl ha+e a single +alue and each attribute can appear
at most once on each element. In the common situation -here a list of
multiple +alues is desired, this must be done b encoding the list into a -ell-
formed XML attribute(note %* -ith some format beond -hat XML defnes
itself. 2suall this is either a comma or semi-colon delimited list or, if the
indi+idual +alues are kno-n not to contain spaces,(note )* a space-delimited
list can be used.
;di+ classDEinner greeting-boxE <>ello@;Cdi+<
-here the attribute EclassE has both the +alue Einner greeting-boxE and also
indicates the t-o #'' class names EinnerE and Egreeting-boxE.
XML declaration
XML documents ma begin b declaring some information about themsel+es,
as in the follo-ing exampleB
;Ixml +ersionDE%.&E encodingDE2/G-JEI<
#haracters and escaping(edit*
XML documents consist entirel of characters from the 2nicode repertoire.
Except for a small number of specifcall excluded control characters, an
character defned b 2nicode ma appear -ithin the content of an XML
document.
XML includes facilities for identifing the encoding of the 2nicode characters
that make up the document, and for expressing characters that, for one
reason or another, cannot be used directl.
Kalid characters(edit*
Main articleB Kalid characters in XML
2nicode code points in the follo-ing ranges are +alid in XML %.& documentsB
(%&*
2L&&&M, 2L&&&4, 2L&&&?B these are the onl #& controls accepted in XML
%.&7
2L&&)&N2L?OGG, 2LE&&&N2LGGG?B this excludes some (not all) non-
characters in the 8M6 (all surrogates, 2LGGGE and 2LGGGG are forbidden)7
2L%&&&&N2L%&GGGGB this includes all code points in supplementar planes,
including non-characters.
XML %.%(%%* extends the set of allo-ed characters to include all the abo+e,
plus the remaining characters in the range 2L&&&%N2L&&%G. 4t the same
time, ho-e+er, it restricts the use of #& and #% control characters other than
2L&&&M, 2L&&&4, 2L&&&?, and 2L&&J1 b re9uiring them to be -ritten in
escaped form (for example 2L&&&% must be -ritten as =Px&%7 or its
e9ui+alent). In the case of #% characters, this restriction is a back-ards
incompatibilit7 it -as introduced to allo- common encoding errors to be
detected.
/he code point 2L&&&& is the onl character that is not permitted in an XML
%.& or %.% document.
Encoding detection(edit*
/he 2nicode character set can be encoded into btes for storage or
transmission in a +ariet of di3erent -as, called EencodingsE. 2nicode itself
defnes encodings that co+er the entire repertoire7 -ell-kno-n ones include
2/G-J and 2/G-%5.(%)* /here are man other text encodings that predate
2nicode, such as 4'#II and I'QCIE# JJ1M7 their character repertoires in
almost e+er case are subsets of the 2nicode character set.
XML allo-s the use of an of the 2nicode-defned encodings, and an other
encodings -hose characters also appear in 2nicode. XML also pro+ides a
mechanism -hereb an XML processor can reliabl, -ithout an prior
kno-ledge, determine -hich encoding is being used.(%"* Encodings other
than 2/G-J and 2/G-%5 -ill not necessaril be recogni0ed b e+er XML
parser.
Escaping(edit*
XML pro+ides escape facilities for including characters -hich are problematic
to include directl. Gor exampleB
/he characters E;E and E=E are ke sntax markers and ma ne+er appear in
content outside a #?4/4 section.(%.*
'ome character encodings support onl a subset of 2nicode. Gor example, it
is legal to encode an XML document in 4'#II, but 4'#II lacks code points for
2nicode characters such as ERE.
It might not be possible to tpe the character on the author$s machine.
'ome characters ha+e glphs that cannot be +isuall distinguished from other
charactersB examples are
non-breaking space (=Pxa&7) E E
compare space (=Px)&7) E E
#rillic #apital Letter 4 (=Px.%&7) ESE
compare Latin #apital Letter 4 (=Px.%7) E4E
/here are f+e predefned entitiesB
=lt7 represents E;E
=gt7 represents E<E
=amp7 represents E=E
=apos7 represents $
=9uot7 represents E
4ll permitted 2nicode characters ma be represented -ith a numeric
character reference. #onsider the #hinese character E", whose numeric code
in Unicode is hexadecimal 4E2D, or decimal 20,013. A user whose keyboard
oers no me!hod or en!erin" !his charac!er could s!ill inser! i! in an #$%
documen! encoded ei!her as &'20013( or &'x4e2d(. )imilarly, !he s!rin" "* +3
,-r"" could be encoded or inclusion in an #$% documen! as "* &l!(3
,&'x./(r"".
"&'0(" is no! 0ermi!!ed, howe1er, because !he null charac!er is one o !he
con!rol charac!ers excluded rom #$%, e1en when usin" a numeric charac!er
reerence.2134 An al!erna!i1e encodin" mechanism such as 5ase/4 is needed !o
re0resen! such charac!ers.
6ommen!s2edi!4
6ommen!s may a00ear anywhere in a documen! ou!side o!her marku0. 6ommen!s
canno! a00ear beore !he #$% declara!ion. 6ommen!s s!ar! wi!h "+788" and end
wi!h "889". .or com0a!ibili!y wi!h ):$%, !he s!rin" "88" ;double8hy0hen< is
no! allowed inside commen!s(21/4 !his means commen!s canno! be nes!ed. =he
am0ersand has no s0ecial si"niicance wi!hin commen!s, so en!i!y and charac!er
reerences are no! reco"ni>ed as such, and !here is no way !o re0resen!
charac!ers ou!side !he charac!er se! o !he documen! encodin".
An exam0le o a 1alid commen!? "+788no need !o esca0e +code9 & such in
commen!s889"
*n!erna!ional use2edi!4
=his exam0le con!ains 6hinese !ex!. @i!hou! 0ro0er renderin" su00or!,
you may see Aues!ion marks, boxes, or o!her symbols ins!ead o 6hinese
charac!ers.
=his exam0le con!ains 6yrillic !ex!. @i!hou! 0ro0er renderin" su00or!,
you may see Aues!ion marks or boxes, mis0laced 1owels or missin" conBunc!s
ins!ead o 6yrillic le!!ers.
#$% 1.0 ;.i!h Edi!ion< and #$% 1.1 su00or! !he direc! use o almos! any
Unicode charac!er in elemen! names, a!!ribu!es, commen!s, charac!er da!a, and
0rocessin" ins!ruc!ions ;o!her !han !he ones !ha! ha1e s0ecial symbolic
meanin" in #$% i!sel, such as !he less8!han si"n, "+"<. =he ollowin" is a
well8ormed #$% documen! includin" 6hinese, Armenian and 6yrillic charac!ers?
+Cxml 1ersionD"1.0" encodin"D"U=.8E"C9
+ TUVWXDEYWXZU[U\E<]^__`a;C9
@ell8ormedness and error8handlin"2edi!4
$ain ar!icle? @ell8ormed documen!
=he #$% s0eciica!ion deines an #$% documen! as a well8ormed !ex! F meanin"
!ha! i! sa!isies a lis! o syn!ax rules 0ro1ided in !he s0eciica!ion. )ome
key 0oin!s in !he airly len"!hy lis! include?
=he documen! con!ains only 0ro0erly encoded le"al Unicode charac!ers
Gone o !he s0ecial syn!ax charac!ers such as + and & a00ear exce0! when
0erormin" !heir marku08delinea!ion roles
=he be"in, end, and em0!y8elemen! !a"s !ha! delimi! !he elemen!s are correc!ly
nes!ed, wi!h none missin" and none o1erla00in"
=he elemen! !a"s are case8sensi!i1e( !he be"innin" and end !a"s mus! ma!ch
exac!ly. =a" names canno! con!ain any o !he charac!ers 7"'HI&J;<KL,M(+D9C
N2O4PQRSTU, nor a s0ace charac!er, and canno! s!ar! wi!h 8, ., or a numeric
di"i!.
A sin"le "roo!" elemen! con!ains all !he o!her elemen!s
=he deini!ion o an #$% documen! excludes !ex!s !ha! con!ain 1iola!ions o
well8ormedness rules( !hey are sim0ly no! #$%. An #$% 0rocessor !ha!
encoun!ers such a 1iola!ion is reAuired !o re0or! such errors and !o cease
normal 0rocessin". =his 0olicy, occasionally reerred !o as "draconian error
handlin"," s!ands in no!able con!ras! !o !he beha1ior o 0ro"rams !ha! 0rocess
V=$%, which are desi"ned !o 0roduce a reasonable resul! e1en in !he 0resence
o se1ere marku0 errors.21W4 #$%Js 0olicy in !his area has been cri!ici>ed as
a 1iola!ion o Xos!elJs law ;"5e conser1a!i1e in wha! you send( be liberal in
wha! you acce0!"<.21E4
=he #$% s0eciica!ion deines a 1alid #$% documen! as a well8ormed #$%
documen! which also conorms !o !he rules o a Documen! =y0e Deini!ion ;D=D<.
)chemas and 1alida!ion2edi!4
*n addi!ion !o bein" well8ormed, an #$% documen! may be 1alid. =his means
!ha! i! con!ains a reerence !o a Documen! =y0e Deini!ion ;D=D<, and !ha! i!s
elemen!s and a!!ribu!es are declared in !ha! D=D and ollow !he "ramma!ical
rules or !hem !ha! !he D=D s0eciies.
#$% 0rocessors are classiied as 1alida!in" or non81alida!in" de0endin" on
whe!her or no! !hey check #$% documen!s or 1alidi!y. A 0rocessor !ha!
disco1ers a 1alidi!y error mus! be able !o re0or! i!, bu! may con!inue normal
0rocessin".
A D=D is an exam0le o a schema or "rammar. )ince !he ini!ial 0ublica!ion o
#$% 1.0, !here has been subs!an!ial work in !he area o schema lan"ua"es or
#$%. )uch schema lan"ua"es !y0ically cons!rain !he se! o elemen!s !ha! may be
used in a documen!, which a!!ribu!es may be a00lied !o !hem, !he order in
which !hey may a00ear, and !he allowable 0aren!Mchild rela!ionshi0s.
Documen! =y0e Deini!ion2edi!4
$ain ar!icle? Documen! =y0e Deini!ion
=he oldes! schema lan"ua"e or #$% is !he Documen! =y0e Deini!ion ;D=D<,
inheri!ed rom ):$%.
D=Ds ha1e !he ollowin" benei!s?
D=D su00or! is ubiAui!ous due !o i!s inclusion in !he #$% 1.0 s!andard.
D=Ds are !erse com0ared !o elemen!8based schema lan"ua"es and conseAuen!ly
0resen! more inorma!ion in a sin"le screen.
D=Ds allow !he declara!ion o s!andard 0ublic en!i!y se!s or 0ublishin"
charac!ers.
D=Ds deine a documen! !y0e ra!her !han !he !y0es used by a names0ace, !hus
"rou0in" all cons!rain!s or a documen! in a sin"le collec!ion.
D=Ds ha1e !he ollowin" limi!a!ions?
=hey ha1e no ex0lici! su00or! or newer ea!ures o #$%, mos! im0or!an!ly
names0aces.
=hey lack ex0ressi1eness. #$% D=Ds are sim0ler !han ):$% D=Ds and !here are
cer!ain s!ruc!ures !ha! canno! be ex0ressed wi!h re"ular "rammars. D=Ds only
su00or! rudimen!ary da!a!y0es.
=hey lack readabili!y. D=D desi"ners !y0ically make hea1y use o 0arame!er
en!i!ies ;which beha1e essen!ially as !ex!ual macros<, which make i! easier !o
deine com0lex "rammars, bu! a! !he ex0ense o clari!y.
=hey use a syn!ax based on re"ular ex0ression syn!ax, inheri!ed rom ):$%, !o
describe !he schema. =y0ical #$% AX*s such as )A# do no! a!!em0! !o oer
a00lica!ions a s!ruc!ured re0resen!a!ion o !he syn!ax, so i! is less
accessible !o 0ro"rammers !han an elemen!8based syn!ax may be.
=wo 0eculiar ea!ures !ha! dis!in"uish D=Ds rom o!her schema !y0es are !he
syn!ac!ic su00or! or embeddin" a D=D wi!hin #$% documen!s and or deinin"
en!i!ies, which are arbi!rary ra"men!s o !ex! andMor marku0 !ha! !he #$%
0rocessor inser!s in !he D=D i!sel and in !he #$% documen! where1er !hey are
reerenced, like charac!er esca0es.
D=D !echnolo"y is s!ill used in many a00lica!ions because o i!s ubiAui!y.
#$% )chema2edi!4
$ain ar!icle? #$% )chema ;@36<
A newer schema lan"ua"e, described by !he @36 as !he successor o D=Ds, is #$%
)chema, o!en reerred !o by !he ini!ialism or #$% )chema ins!ances, #)D ;#$%
)chema Deini!ion<. #)Ds are ar more 0owerul !han D=Ds in describin" #$%
lan"ua"es. =hey use a rich da!a!y0in" sys!em and allow or more de!ailed
cons!rain!s on an #$% documen!Js lo"ical s!ruc!ure. #)Ds also use an #$%8based
orma!, which makes i! 0ossible !o use ordinary #$% !ools !o hel0 0rocess
!hem.
xs?schema elemen! !ha! deines a schema?
+Cxml 1ersionD"1.0" encodin"D"*)Y8EE3Z81" C9
+xs?schema xmlns?xsD"h!!0?MMwww.w3.or"M2001M#$%)chema"9
+Mxs?schema9
[E%A# G:2edi!4
[E%A# G: was ini!ially s0eciied by YA)*) and is now also an *)YM*E6
*n!erna!ional )!andard ;as 0ar! o D)D%<. [E%A# G: schemas may be wri!!en in
ei!her an #$% based syn!ax or a more com0ac! non8#$% syn!ax( !he !wo syn!axes
are isomor0hic and ,ames 6larkJs con1ersion !ool 8 J=ran"J, can con1er!
be!ween !hem wi!hou! loss o inorma!ion. [E%A# G: has a sim0ler deini!ion
and 1alida!ion ramework !han #$% )chema, makin" i! easier !o use and
im0lemen!. *! also has !he abili!y !o use da!a!y0e ramework 0lu"8ins( a [E%A#
G: schema au!hor, or exam0le, can reAuire 1alues in an #$% documen! !o
conorm !o deini!ions in #$% )chema Da!a!y0es.
)chema!ron2edi!4
)chema!ron is a lan"ua"e or makin" asser!ions abou! !he 0resence or absence
o 0a!!erns in an #$% documen!. *! !y0ically uses #Xa!h ex0ressions.
*)Y D)D% and o!her schema lan"ua"es2edi!4
=he *)Y D)D% ;Documen! )chema Descri0!ion %an"ua"es< s!andard brin"s !o"e!her
a com0rehensi1e se! o small schema lan"ua"es, each !ar"e!ed a! s0eciic
0roblems. D)D% includes [E%A# G: ull and com0ac! syn!ax, )chema!ron asser!ion
lan"ua"e, and lan"ua"es or deinin" da!a!y0es, charac!er re0er!oire
cons!rain!s, renamin" and en!i!y ex0ansion, and names0ace8based rou!in" o
documen! ra"men!s !o dieren! 1alida!ors. D)D% schema lan"ua"es do no! ha1e
!he 1endor su00or! o #$% )chemas ye!, and are !o some ex!en! a "rassroo!s
reac!ion o indus!rial 0ublishers !o !he lack o u!ili!y o #$% )chemas or
0ublishin".
)ome schema lan"ua"es no! only describe !he s!ruc!ure o a 0ar!icular #$%
orma! bu! also oer limi!ed acili!ies !o inluence 0rocessin" o indi1idual
#$% iles !ha! conorm !o !his orma!. D=Ds and #)Ds bo!h ha1e !his abili!y(
!hey can or ins!ance 0ro1ide !he inose! au"men!a!ion acili!y and a!!ribu!e
deaul!s. [E%A# G: and )chema!ron in!en!ionally do no! 0ro1ide !hese.
[ela!ed s0eciica!ions2edi!4
A clus!er o s0eciica!ions closely rela!ed !o #$% ha1e been de1elo0ed,
s!ar!in" soon a!er !he ini!ial 0ublica!ion o #$% 1.0. *! is reAuen!ly !he
case !ha! !he !erm "#$%" is used !o reer !o #$% !o"e!her wi!h one or more o
!hese o!her !echnolo"ies which ha1e come !o be seen as 0ar! o !he #$% core.
#$% Games0aces enable !he same documen! !o con!ain #$% elemen!s and a!!ribu!es
!aken rom dieren! 1ocabularies, wi!hou! any namin" collisions occurrin".
Al!hou"h #$% Games0aces are no! 0ar! o !he #$% s0eciica!ion i!sel,
1ir!ually all #$% so!ware also su00or!s #$% Games0aces.
#$% 5ase deines !he xml?base a!!ribu!e, which may be used !o se! !he base or
resolu!ion o rela!i1e U[* reerences wi!hin !he sco0e o a sin"le #$%
elemen!.
=he #$% *norma!ion )e! or #$% inose! describes an abs!rac! da!a model or
#$% documen!s in !erms o inorma!ion i!ems. =he inose! is commonly used in
!he s0eciica!ions o #$% lan"ua"es, or con1enience in describin" cons!rain!s
on !he #$% cons!ruc!s !hose lan"ua"es allow.
xml?id \ersion 1.0 asser!s !ha! an a!!ribu!e named xml?id unc!ions as an "*D
a!!ribu!e" in !he sense used in a D=D.
#Xa!h deines a syn!ax named #Xa!h ex0ressions which iden!iies one or more o
!he in!ernal com0onen!s ;elemen!s, a!!ribu!es, and so on< included in an #$%
documen!. #Xa!h is widely used in o!her core8#$% s0eciica!ions and in
0ro"rammin" libraries or accessin" #$%8encoded da!a.
#)%= is a lan"ua"e wi!h an #$%8based syn!ax !ha! is used !o !ransorm #$%
documen!s in!o o!her #$% documen!s, V=$%, or o!her, uns!ruc!ured orma!s such
as 0lain !ex! or [=.. #)%= is 1ery !i"h!ly cou0led wi!h #Xa!h, which i! uses
!o address com0onen!s o !he in0u! #$% documen!, mainly elemen!s and
a!!ribu!es.
#)% .orma!!in" YbBec!s, or #)%8.Y, is a marku0 lan"ua"e or #$% documen!
orma!!in" which is mos! o!en used !o "enera!e XD.s.
#]uery is an #$%8orien!ed Auery lan"ua"e s!ron"ly roo!ed in #Xa!h and #$%
)chema. *! 0ro1ides me!hods !o access, mani0ula!e and re!urn #$%, and is
mainly concei1ed as a Auery lan"ua"e or #$% da!abases.
#$% )i"na!ure deines syn!ax and 0rocessin" rules or crea!in" di"i!al
si"na!ures on #$% con!en!.
#$% Encry0!ion deines syn!ax and 0rocessin" rules or encry0!in" #$% con!en!.
)ome o!her s0eciica!ions concei1ed as 0ar! o !he "#$% 6ore" ha1e ailed !o
ind wide ado0!ion, includin" #*nclude, #%ink, and #Xoin!er.
Xro"rammin" in!eraces2edi!4
=he desi"n "oals o #$% include, "*! shall be easy !o wri!e 0ro"rams which
0rocess #$% documen!s."234 Des0i!e !his, !he #$% s0eciica!ion con!ains almos!
no inorma!ion abou! how 0ro"rammers mi"h! "o abou! doin" such 0rocessin". =he
#$% *nose! s0eciica!ion 0ro1ides a 1ocabulary !o reer !o !he cons!ruc!s
wi!hin an #$% documen!, bu! also does no! 0ro1ide any "uidance on how !o
access !his inorma!ion. A 1arie!y o AX*s or accessin" #$% ha1e been
de1elo0ed and used, and some ha1e been s!andardi>ed.
Exis!in" AX*s or #$% 0rocessin" !end !o all in!o !hese ca!e"ories?
)!ream8orien!ed AX*s accessible rom a 0ro"rammin" lan"ua"e, or exam0le )A#
and )!A#.
=ree8!ra1ersal AX*s accessible rom a 0ro"rammin" lan"ua"e, or exam0le DY$.
#$% da!a bindin", which 0ro1ides an au!oma!ed !ransla!ion be!ween an #$%
documen! and 0ro"rammin"8lan"ua"e obBec!s.
Declara!i1e !ransorma!ion lan"ua"es such as #)%= and #]uery.
)!ream8orien!ed acili!ies reAuire less memory and, or cer!ain !asks which
are based on a linear !ra1ersal o an #$% documen!, are as!er and sim0ler
!han o!her al!erna!i1es. =ree8!ra1ersal and da!a8bindin" AX*s !y0ically
reAuire !he use o much more memory, bu! are o!en ound more con1enien! or
use by 0ro"rammers( some include declara!i1e re!rie1al o documen! com0onen!s
1ia !he use o #Xa!h ex0ressions.
#)%= is desi"ned or declara!i1e descri0!ion o #$% documen! !ransorma!ions,
and has been widely im0lemen!ed bo!h in ser1er8side 0acka"es and @eb browsers.
#]uery o1erla0s #)%= in i!s unc!ionali!y, bu! is desi"ned more or searchin"
o lar"e #$% da!abases.
)im0le AX* or #$%2edi!4
)im0le AX* or #$% ;)A#< is a lexical, e1en!8dri1en in!erace in which a
documen! is read serially and i!s con!en!s are re0or!ed as callbacks !o
1arious me!hods on a handler obBec! o !he userJs desi"n. )A# is as! and
eicien! !o im0lemen!, bu! diicul! !o use or ex!rac!in" inorma!ion a!
random rom !he #$%, since i! !ends !o burden !he a00lica!ion au!hor wi!h
kee0in" !rack o wha! 0ar! o !he documen! is bein" 0rocessed. *! is be!!er
sui!ed !o si!ua!ions in which cer!ain !y0es o inorma!ion are always handled
!he same way, no ma!!er where !hey occur in !he documen!.
Xull 0arsin"2edi!4
Xull 0arsin"21Z4 !rea!s !he documen! as a series o i!ems which are read in
seAuence usin" !he *!era!or desi"n 0a!!ern. =his allows or wri!in" o
recursi1e8descen! 0arsers in which !he s!ruc!ure o !he code 0erormin" !he
0arsin" mirrors !he s!ruc!ure o !he #$% bein" 0arsed, and in!ermedia!e 0arsed
resul!s can be used and accessed as local 1ariables wi!hin !he me!hods
0erormin" !he 0arsin", or 0assed down ;as me!hod 0arame!ers< in!o lower8le1el
me!hods, or re!urned ;as me!hod re!urn 1alues< !o hi"her8le1el me!hods.
Exam0les o 0ull 0arsers include )!A# in !he ,a1a 0ro"rammin" lan"ua"e,
#$%[eader in XVX, Elemen!=ree.i!er0arse in Xy!hon, )ys!em.#ml.#ml[eader in !he
.GE= .ramework, and !he DY$ !ra1ersal AX* ;Gode*!era!or and =ree@alker<.
A 0ull 0arser crea!es an i!era!or !ha! seAuen!ially 1isi!s !he 1arious
elemen!s, a!!ribu!es, and da!a in an #$% documen!. 6ode which uses !his
i!era!or can !es! !he curren! i!em ;!o !ell, or exam0le, whe!her i! is a
s!ar! or end elemen!, or !ex!<, and ins0ec! i!s a!!ribu!es ;local name,
names0ace, 1alues o #$% a!!ribu!es, 1alue o !ex!, e!c.<, and can also mo1e
!he i!era!or !o !he nex! i!em. =he code can !hus ex!rac! inorma!ion rom !he
documen! as i! !ra1erses i!. =he recursi1e8descen! a00roach !ends !o lend
i!sel !o kee0in" da!a as !y0ed local 1ariables in !he code doin" !he 0arsin",
while )A#, or ins!ance, !y0ically reAuires a 0arser !o manually main!ain
in!ermedia!e da!a wi!hin a s!ack o elemen!s which are 0aren! elemen!s o !he
elemen! bein" 0arsed. Xull80arsin" code can be more s!rai"h!orward !o
unders!and and main!ain !han )A# 0arsin" code.
Documen! YbBec! $odel2edi!4
=he Documen! YbBec! $odel ;DY$< is an in!erace8orien!ed a00lica!ion
0ro"rammin" in!erace !ha! allows or na1i"a!ion o !he en!ire documen! as i
i! were a !ree o node obBec!s re0resen!in" !he documen!Js con!en!s. A DY$
documen! can be crea!ed by a 0arser, or can be "enera!ed manually by users
;wi!h limi!a!ions<. Da!a !y0es in DY$ nodes are abs!rac!( im0lemen!a!ions
0ro1ide !heir own 0ro"rammin" lan"ua"e8s0eciic bindin"s. DY$ im0lemen!a!ions
!end !o be memory in!ensi1e, as !hey "enerally reAuire !he en!ire documen! !o
be loaded in!o memory and cons!ruc!ed as a !ree o obBec!s beore access is
allowed.
Da!a bindin"2edi!4
Ano!her orm o #$% 0rocessin" AX* is #$% da!a bindin", where #$% da!a are
made a1ailable as a hierarchy o cus!om, s!ron"ly !y0ed classes, in con!ras!
!o !he "eneric obBec!s crea!ed by a Documen! YbBec! $odel 0arser. =his
a00roach sim0liies code de1elo0men!, and in many cases allows 0roblems !o be
iden!iied a! com0ile !ime ra!her !han run8!ime. Exam0le da!a bindin" sys!ems
include !he ,a1a Archi!ec!ure or #$% 5indin" ;,A#5< and #$% )eriali>a!ion
in .GE=.2204
#$% as da!a !y0e2edi!4
#$% has a00eared as a irs!8class da!a !y0e in o!her lan"ua"es. =he E6$A)cri0!
or #$% ;E4#< ex!ension !o !he E6$A)cri0!M,a1a)cri0! lan"ua"e ex0lici!ly
deines !wo s0eciic obBec!s ;#$% and #$%%is!< or ,a1a)cri0!, which su00or!
#$% documen! nodes and #$% node lis!s as dis!inc! obBec!s and use a do!8
no!a!ion s0eciyin" 0aren!8child rela!ionshi0s.2214 E4# is su00or!ed by !he
$o>illa 2.3L browsers ;!hou"h now de0reca!ed< and Adobe Ac!ionscri0!, bu! has
no! been ado0!ed more uni1ersally. )imilar no!a!ions are used in $icroso!Js
%*G] im0lemen!a!ion or $icroso! .GE= 3.3 and abo1e, and in )cala ;which uses
!he ,a1a \$<. =he o0en8source xmlsh a00lica!ion, which 0ro1ides a %inux8like
shell wi!h s0ecial ea!ures or #$% mani0ula!ion, similarly !rea!s #$% as a
da!a !y0e, usin" !he +2 49 no!a!ion.2224 =he [esource Descri0!ion .ramework
deines a da!a !y0e rd?#$%%i!eral !o hold wra00ed, canonical #$%.2234
Vis!ory2edi!4
#$% is an a00lica!ion 0roile o ):$% ;*)Y EEWZ<.2244
=he 1ersa!ili!y o ):$% or dynamic inorma!ion dis0lay was unders!ood by
early di"i!al media 0ublishers in !he la!e 1ZE0s 0rior !o !he rise o !he
*n!erne!.223422/4 5y !he mid81ZZ0s some 0rac!i!ioners o ):$% had "ained
ex0erience wi!h !he !hen8new @orld @ide @eb, and belie1ed !ha! ):$% oered
solu!ions !o some o !he 0roblems !he @eb was likely !o ace as i! "rew. Dan
6onnolly added ):$% !o !he lis! o @36Js ac!i1i!ies when he Boined !he s!a
in 1ZZ3( work be"an in mid81ZZ/ when )un $icrosys!ems en"ineer ,on 5osak
de1elo0ed a char!er and recrui!ed collabora!ors. 5osak was well connec!ed in
!he small communi!y o 0eo0le who had ex0erience bo!h in ):$% and !he @eb.22W4
#$% was com0iled by a workin" "rou0 o ele1en members,22E4 su00or!ed by a
;rou"hly< 1308member *n!eres! :rou0. =echnical deba!e !ook 0lace on !he
*n!eres! :rou0 mailin" lis! and issues were resol1ed by consensus or, when
!ha! ailed, maBori!y 1o!e o !he @orkin" :rou0. A record o desi"n decisions
and !heir ra!ionales was com0iled by $ichael )0erber"8$c]ueen on December 4,
1ZZW.22Z4 ,ames 6lark ser1ed as =echnical %ead o !he @orkin" :rou0, no!ably
con!ribu!in" !he em0!y8elemen! "+em0!y M9" syn!ax and !he name "#$%". Y!her
names !ha! had been 0u! orward or considera!ion included "$A:$A" ;$inimal
Archi!ec!ure or :enerali>ed $arku0 A00lica!ions<, ")%*$" ;)!ruc!ured %an"ua"e
or *n!erne! $arku0< and "$:$%" ;$inimal :enerali>ed $arku0 %an"ua"e<. =he co8
edi!ors o !he s0eciica!ion were ori"inally =im 5ray and $ichael )0erber"8
$c]ueen. Valway !hrou"h !he 0roBec! 5ray acce0!ed a consul!in" en"a"emen!
wi!h Ge!sca0e, 0ro1okin" 1ocierous 0ro!es!s rom $icroso!. 5ray was
!em0orarily asked !o resi"n !he edi!orshi0. =his led !o in!ense dis0u!e in !he
@orkin" :rou0, e1en!ually sol1ed by !he a00oin!men! o $icroso!Js ,ean Xaoli
as a !hird co8edi!or.
=he #$% @orkin" :rou0 ne1er me! ace8!o8ace( !he desi"n was accom0lished
usin" a combina!ion o email and weekly !eleconerences. =he maBor desi"n
decisions were reached in a shor! burs! o in!ense work be!ween Au"us! and
Go1ember 1ZZ/,2304 when !he irs! @orkin" Dra! o an #$% s0eciica!ion was
0ublished.2314 .ur!her desi"n work con!inued !hrou"h 1ZZW, and #$% 1.0 became
a @36 [ecommenda!ion on .ebruary 10, 1ZZE.
)ources2edi!4
#$% is a 0roile o an *)Y s!andard ):$%, and mos! o #$% comes rom ):$%
unchan"ed. .rom ):$% comes !he se0ara!ion o lo"ical and 0hysical s!ruc!ures
;elemen!s and en!i!ies<, !he a1ailabili!y o "rammar8based 1alida!ion ;D=Ds<,
!he se0ara!ion o da!a and me!ada!a ;elemen!s and a!!ribu!es<, mixed con!en!,
!he se0ara!ion o 0rocessin" rom re0resen!a!ion ;0rocessin" ins!ruc!ions<,
and !he deaul! an"le8bracke! syn!ax. [emo1ed were !he ):$% declara!ion ;#$%
has a ixed delimi!er se! and ado0!s Unicode as !he documen! charac!er se!<.
Y!her sources o !echnolo"y or #$% were !he =ex! Encodin" *ni!ia!i1e ;=E*<,
which deined a 0roile o ):$% or use as a "!ranser syn!ax"( and V=$%, in
which elemen!s were synchronous wi!h !heir resource, documen! charac!er se!s
were se0ara!e rom resource encodin", !he xml?lan" a!!ribu!e was in1en!ed, and
;like V==X< me!ada!a accom0anied !he resource ra!her !han bein" needed a! !he
declara!ion o a link. =he Ex!ended [eerence 6oncre!e )yn!ax ;E[6)< 0roBec!
o !he )X[EAD ;)!andardi>a!ion XroBec! [e"ardin" Eas! Asian Documen!s< 0roBec!
o !he *)Y8rela!ed 6hinaM,a0anM^orea Documen! Xrocessin" ex0er! "rou0 was !he
basis o #$% 1.0Js namin" rules( )X[EAD also in!roduced hexadecimal numeric
charac!er reerences and !he conce0! o reerences !o make a1ailable all
Unicode charac!ers. =o su00or! E[6), #$% and V=$% be!!er, !he ):$% s!andard *)
EEWZ was re1ised in 1ZZ/ and 1ZZE wi!h @eb):$% Ada0!a!ions. =he #$% header
ollowed !ha! o *)Y Vy=ime.
*deas !ha! de1elo0ed durin" discussion which were no1el in #$% included !he
al"ori!hm or encodin" de!ec!ion and !he encodin" header, !he 0rocessin"
ins!ruc!ion !ar"e!, !he xml?s0ace a!!ribu!e, and !he new close delimi!er or
em0!y8elemen! !a"s. =he no!ion o well8ormedness as o00osed !o 1alidi!y
;which enables 0arsin" wi!hou! a schema< was irs! ormali>ed in #$%, al!hou"h
i! had been im0lemen!ed successully in !he Elec!ronic 5ook =echnolo"y
"Dyna!ex!" so!ware(2324 !he so!ware rom !he Uni1ersi!y o @a!erloo Gew
Yxord En"lish Dic!ionary XroBec!( !he [*)X %*)X ):$% !ex! 0rocessor a!
Unisco0e, =okyo( !he U) Army $issile 6ommand *AD) hy0er!ex! sys!em( $en!or
:ra0hics 6on!ex!( *n!erlea and #erox Xublishin" )ys!em.
\ersions2edi!4
=here are !wo curren! 1ersions o #$%. =he irs! ;#$% 1.0< was ini!ially
deined in 1ZZE. *! has under"one minor re1isions since !hen, wi!hou! bein"
"i1en a new 1ersion number, and is curren!ly in i!s i!h edi!ion, as
0ublished on Go1ember 2/, 200E. *! is widely im0lemen!ed and s!ill recommended
or "eneral use.
=he second ;#$% 1.1< was ini!ially 0ublished on .ebruary 4, 2004, !he same day
as #$% 1.0 =hird Edi!ion,2334 and is curren!ly in i!s second edi!ion, as
0ublished on Au"us! 1/, 200/. *! con!ains ea!ures ;some con!en!ious< !ha! are
in!ended !o make #$% easier !o use in cer!ain cases.2344 =he main chan"es are
!o enable !he use o line8endin" charac!ers used on E56D*6 0la!orms, and !he
use o scri0!s and charac!ers absen! rom Unicode 3.2. #$% 1.1 is no! 1ery
widely im0lemen!ed and is recommended or use only by !hose who need i!s
uniAue ea!ures.2334
Xrior !o i!s i!h edi!ion release, #$% 1.0 diered rom #$% 1.1 in ha1in"
s!ric!er reAuiremen!s or charac!ers a1ailable or use in elemen! and
a!!ribu!e names and uniAue iden!iiers? in !he irs! our edi!ions o #$% 1.0
!he charac!ers were exclusi1ely enumera!ed usin" a s0eciic 1ersion o !he
Unicode s!andard ;Unicode 2.0 !o Unicode 3.2.< =he i!h edi!ion subs!i!u!es
!he mechanism o #$% 1.1, which is more u!ure80roo bu! reduces redundancy.
=he a00roach !aken in !he i!h edi!ion o #$% 1.0 and in all edi!ions o #$%
1.1 is !ha! only cer!ain charac!ers are orbidden in names, and e1ery!hin"
else is allowed, in order !o accommoda!e !he use o sui!able name charac!ers
in u!ure 1ersions o Unicode. *n !he i!h edi!ion, #$% names may con!ain
charac!ers in !he 5alinese, 6ham, or Xhoenician scri0!s amon" many o!hers
which ha1e been added !o Unicode since Unicode 3.2.2344
Almos! any Unicode code 0oin! can be used in !he charac!er da!a and a!!ribu!e
1alues o an #$% 1.0 or 1.1 documen!, e1en i !he charac!er corres0ondin" !o
!he code 0oin! is no! deined in !he curren! 1ersion o Unicode. *n charac!er
da!a and a!!ribu!e 1alues, #$% 1.1 allows !he use o more con!rol charac!ers
!han #$% 1.0, bu!, or "robus!ness", mos! o !he con!rol charac!ers in!roduced
in #$% 1.1 mus! be ex0ressed as numeric charac!er reerences ;and 'xW. !hrou"h
'xZ., which had been allowed in #$% 1.0, are in #$% 1.1 e1en reAuired !o be
ex0ressed as numeric charac!er reerences23/4<. Amon" !he su00or!ed con!rol
charac!ers in #$% 1.1 are !wo line break codes !ha! mus! be !rea!ed as
whi!es0ace. @hi!es0ace charac!ers are !he only con!rol codes !ha! can be
wri!!en direc!ly.
=here has been discussion o an #$% 2.0, al!hou"h no or"ani>a!ion has
announced 0lans or work on such a 0roBec!. #$%8)@ ;)@ or skunkworks<,
wri!!en by one o !he ori"inal de1elo0ers o #$%,23W4 con!ains some 0ro0osals
or wha! an #$% 2.0 mi"h! look like? elimina!ion o D=Ds rom syn!ax,
in!e"ra!ion o names0aces, #$% 5ase and #$% *norma!ion )e! ;inose!< in!o !he
base s!andard.
=he @orld @ide @eb 6onsor!ium also has an #$% 5inary 6harac!eri>a!ion @orkin"
:rou0 doin" 0reliminary research in!o use cases and 0ro0er!ies or a binary
encodin" o !he #$% inose!. =he workin" "rou0 is no! char!ered !o 0roduce any
oicial s!andards. )ince #$% is by deini!ion !ex!8based, *=U8= and *)Y are
usin" !he name .as! *nose! or !heir own binary inose! !o a1oid conusion
;see *=U8= [ec. #.EZ1 S *)YM*E6 24E2481<.
6ri!icism2edi!4
#$% and i!s ex!ensions ha1e re"ularly been cri!ici>ed or 1erbosi!y and
com0lexi!y.23E4 $a00in" !he basic !ree model o #$% !o !y0e sys!ems o
0ro"rammin" lan"ua"es or da!abases can be diicul!, es0ecially when #$% is
used or exchan"in" hi"hly s!ruc!ured da!a be!ween a00lica!ions, which was no!
i!s 0rimary desi"n "oal. Y!her cri!icisms a!!em0! !o reu!e !he claim !ha! #$%
is a sel8describin" lan"ua"e23Z4 ;!hou"h !he #$% s0eciica!ion i!sel makes
no such claim<. ,)YG, _A$%, and )8Ex0ressions are reAuen!ly 0ro0osed as
al!erna!i1es ;see 6om0arison o da!a seriali>a!ion orma!s<(2404 which ocus
on re0resen!in" hi"hly s!ruc!ured da!a ra!her !han documen!s, which may
con!ain bo!h hi"hly s!ruc!ured and rela!i1ely uns!ruc!ured con!en!.

Potrebbero piacerti anche