So, an dieser Stelle auch nochmal, ich hab noch nicht aufgegeben - ich komme aller höchstens nur sehr langsam voran. Und irgendwie ist mein Versuch, das ganze Markup zu identifizieren, XML mit Key/Values zu parsen, die Wikitags auszulesen, Formatierungen zu erkennen, HTML Entities zu klassifizieren - also lange Rede kurzer Sinn, das "Parsen" der Daten richtig zu managen, schon eine Bärenaufgabe.
Aber es gibt Fortschritte!
Bisher wird nur ausgegeben was wie klassifiziert ist und bis auf die XML-Nodes werden noch keine Daten weiter verarbeitet, es wird nur kräftig geprintet
Ich poste mal Ausschnitte von den Daten die verarbeitet werden, und wie das Ergebnis des Parsers/Verarbeitung bisher aussieht:
Das was geparsed wird:
[src=xml]<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.3/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.3/ http://www.mediawiki.org/xml/export-0.3.xsd" version="0.3" xml:lang="en">
<siteinfo>
<sitename>Wikipedia</sitename>
<base>http://en.wikipedia.org/wiki/Main_Page</base>
<generator>MediaWiki 1.6alpha</generator>
<case>first-letter</case>
<namespaces>
<namespace key="-2">Media</namespace>
<namespace key="-1">Special</namespace>
<namespace key="0" />
<namespace key="1">Talk</namespace>
<namespace key="2">User</namespace>
<namespace key="3">User talk</namespace>
<namespace key="4">Wikipedia</namespace>
<namespace key="5">Wikipedia talk</namespace>
<namespace key="6">Image</namespace>
<namespace key="7">Image talk</namespace>
<namespace key="8">MediaWiki</namespace>
<namespace key="9">MediaWiki talk</namespace>
<namespace key="10">Template</namespace>
<namespace key="11">Template talk</namespace>
<namespace key="12">Help</namespace>
<namespace key="13">Help talk</namespace>
<namespace key="14">Category</namespace>
<namespace key="15">Category talk</namespace>
<namespace key="100">Portal</namespace>
<namespace key="101">Portal talk</namespace>
</namespaces>
</siteinfo>
<page>
<title>AaA</title>
<id>1</id>
<revision>
<id>32899315</id>
<timestamp>2005-12-27T18:46:47Z</timestamp>
<contributor>
<username>Jsmethers</username>
<id>614213</id>
</contributor>
<text xml:space="preserve">#REDIRECT [[AAA]]</text>
</revision>
</page>
<page>
<title>AlgeriA</title>
<id>5</id>
<revision>
<id>18063769</id>
<timestamp>2005-07-03T11:13:13Z</timestamp>
<contributor>
<username>Docu</username>
<id>8029</id>
</contributor>
<minor />
<comment>adding cur_id=5: {{R from CamelCase}}</comment>
<text xml:space="preserve">#REDIRECT [[Algeria]]{{R from CamelCase}}</text>
</revision>
</page>
<page>
<title>AmericanSamoa</title>
<id>6</id>
<revision>
<id>18063795</id>
<timestamp>2005-07-03T11:14:17Z</timestamp>
<contributor>
<username>Docu</username>
<id>8029</id>
</contributor>
<minor />
<comment>adding to cur_id=6 {{R from CamelCase}}</comment>
<text xml:space="preserve">#REDIRECT [[American Samoa]]{{R from CamelCase}}</text>
</revision>
</page>
<page>
<title>AppliedEthics</title>
<id>8</id>
<revision>
<id>15898943</id>
<timestamp>2002-02-25T15:43:11Z</timestamp>
<contributor>
<ip>Conversion script</ip>
</contributor>
<minor />
<comment>Automated conversion</comment>
<text xml:space="preserve">#REDIRECT [[Applied ethics]]
</text>
</revision>
</page>
<page>
<title>AccessibleComputing</title>
<id>10</id>
<revision>
<id>15898945</id>
<timestamp>2003-04-25T22:18:38Z</timestamp>
<contributor>
<username>Ams80</username>
<id>7543</id>
</contributor>
<minor />
<comment>Fixing redirect</comment>
<text xml:space="preserve">#REDIRECT [[Accessible_computing]]</text>
</revision>
</page>
<page>
<title>AdA</title>
<id>11</id>
<revision>
<id>15898946</id>
<timestamp>2002-09-22T16:02:58Z</timestamp>
<contributor>
<username>Andre Engels</username>
<id>300</id>
</contributor>
<minor />
<text xml:space="preserve">#REDIRECT [[Ada programming language]]</text>
</revision>
</page>
<page>
<title>Anarchism</title>
<id>12</id>
<revision>
<id>42136831</id>
<timestamp>2006-03-04T01:41:25Z</timestamp>
<contributor>
<username>CJames745</username>
<id>832382</id>
</contributor>
<minor />
<comment>/* Anarchist Communism */ too many brackets</comment>
<text xml:space="preserve">{{Anarchism}}
'''Anarchism''' originated as a term of abuse first used against early [[working class]] [[radical]]s including the [[Diggers]] of the [[English Revolution]] and the [[sans-culotte|''sans-culottes'']] of the [[French Revolution]].[http://uk.encarta.msn.com/encyclopedia_761568770/Anarchism.html] Whilst the term is still used in a pejorative way to describe ''"any act that used violent means to destroy the organization of society"''<ref>[http://www.cas.sc.edu/socy/faculty/deflem/zhistorintpolency.html History of International Police Cooperation], from the final protocols of the "International Conference of Rome for the Social Defense Against Anarchists", 1898</ref>, it has also been taken up as a positive label by self-defined anarchists.
The word '''anarchism''' is [[etymology|derived from]] the [[Greek language|Greek]] ''[[Wiktionary:αναρχία|αναρχία]]'' ("without [[archon]]s (ruler, chief, king)"). Anarchism as a [[political philosophy]], is the belief that ''rulers'' are unnecessary and should be abolished, although there are differing interpretations of what this means. Anarchism also refers to related [[social movement]]s) that advocate the elimination of authoritarian institutions, particularly the [[state]].<ref>[http://en.wikiquote.org/wiki/Definitions_of_anarchism Definitions of anarchism] on Wikiquote, accessed 2006</ref> The word "[[anarchy]]," as most anarchists use it, does not imply [[chaos]], [[nihilism]], or [[anomie]], but rather a harmonious [[anti-authoritarian]] society. In place of what are regarded as authoritarian political structures and coercive economic institutions, anarchists advocate social relations based upon [[voluntary association]] of autonomous individuals, [[mutual aid]], and [[self-governance]].
While anarchism is most easily defined by what it is against, anarchists also offer positive visions of what they believe to be a truly free society. However, ideas about how an anarchist society might work vary considerably, especially with respect to economics; there is also disagreement about how a free society might be brought about.
== Origins and predecessors ==
[[Peter Kropotkin|Kropotkin]], and others, argue that before recorded [[history]], human society was organized on anarchist principles.[/src]
Die geparsten Daten:
[src=text]NODE1 1 mediawiki
KEYVAL 1 xmlns => http://www.mediawiki.org/xml/export-0.3/
KEYVAL 1 xmlns:xsi => http://www.w3.org/2001/XMLSchema-instance
KEYVAL 1 xsi:schemaLocation => http://www.mediawiki.org/xml/export-0.3/ http://www.mediawiki.org/xml/export-0.3.xsd
KEYVAL 1 version => 0.3
KEYVAL 1 xml:lang => en
NODE1 2 siteinfo
NODE1 3 sitename
WORD 3 sitename --- Wikipedia
NODE1 4 base
WORD 4 base --- http://en.wikipedia.org/wiki/Main_Page
NODE1 5 generator
WORD 5 generator --- MediaWiki
WORD 5 generator --- 1.6alpha
NODE1 6 case
WORD 6 case --- first-letter
NODE1 7 namespaces
NODE1 8 namespace
KEYVAL 8 key => -2
WORD 8 namespace --- Media
NODE1 9 namespace
KEYVAL 9 key => -1
WORD 9 namespace --- Special
NODE1 10 namespace
KEYVAL 10 key => 0
NODE1 11 namespace
KEYVAL 11 key => 1
WORD 11 namespace --- Talk
NODE1 12 namespace
KEYVAL 12 key => 2
WORD 12 namespace --- User
NODE1 13 namespace
KEYVAL 13 key => 3
WORD 13 namespace --- User
WORD 13 namespace --- talk
NODE1 14 namespace
KEYVAL 14 key => 4
WORD 14 namespace --- Wikipedia
NODE1 15 namespace
KEYVAL 15 key => 5
WORD 15 namespace --- Wikipedia
WORD 15 namespace --- talk
NODE1 16 namespace
KEYVAL 16 key => 6
WORD 16 namespace --- Image
NODE1 17 namespace
KEYVAL 17 key => 7
WORD 17 namespace --- Image
WORD 17 namespace --- talk
NODE1 18 namespace
KEYVAL 18 key => 8
WORD 18 namespace --- MediaWiki
NODE1 19 namespace
KEYVAL 19 key => 9
WORD 19 namespace --- MediaWiki
WORD 19 namespace --- talk
NODE1 20 namespace
KEYVAL 20 key => 10
WORD 20 namespace --- Template
NODE1 21 namespace
KEYVAL 21 key => 11
WORD 21 namespace --- Template
WORD 21 namespace --- talk
NODE1 22 namespace
KEYVAL 22 key => 12
WORD 22 namespace --- Help
NODE1 23 namespace
KEYVAL 23 key => 13
WORD 23 namespace --- Help
WORD 23 namespace --- talk
NODE1 24 namespace
KEYVAL 24 key => 14
WORD 24 namespace --- Category
NODE1 25 namespace
KEYVAL 25 key => 15
WORD 25 namespace --- Category
WORD 25 namespace --- talk
NODE1 26 namespace
KEYVAL 26 key => 100
WORD 26 namespace --- Portal
NODE1 27 namespace
KEYVAL 27 key => 101
WORD 27 namespace --- Portal
WORD 27 namespace --- talk
NODE1 30 page
NODE1 31 title
WORD 31 title --- AaA
NODE1 32 id
WORD 32 id --- 1
NODE1 33 revision
NODE1 34 id
WORD 34 id --- 32899315
NODE1 35 timestamp
WORD 35 timestamp --- 2005-12-27T18:46:47Z
NODE1 36 contributor
NODE1 37 username
WORD 37 username --- Jsmethers
NODE1 38 id
WORD 38 id --- 614213
NODE1 40 text
KEYVAL 40 xml:space => preserve
WORD 40 text --- #REDIRECT
WIKITAG 40 text --- [[AAA]] *1
NODE1 43 page
NODE1 44 title
WORD 44 title --- AlgeriA
NODE1 45 id
WORD 45 id --- 5
NODE1 46 revision
NODE1 47 id
WORD 47 id --- 18063769
NODE1 48 timestamp
WORD 48 timestamp --- 2005-07-03T11:13:13Z
NODE1 49 contributor
NODE1 50 username
WORD 50 username --- Docu
NODE1 51 id
WORD 51 id --- 8029
NODE1 53 minor
NODE1 54 comment
WORD 54 comment --- adding
WORD 54 comment --- cur_id5:
WIKITAG 54 comment --- {{R from CamelCase}} *0
NODE1 55 text
KEYVAL 55 xml:space => preserve
WORD 55 text --- #REDIRECT
WIKITAG 55 text --- [[Algeria]] *1
WIKITAG 55 text --- {{R from CamelCase}} *0
NODE1 58 page
NODE1 59 title
WORD 59 title --- AmericanSamoa
NODE1 60 id
WORD 60 id --- 6
NODE1 61 revision
NODE1 62 id
WORD 62 id --- 18063795
NODE1 63 timestamp
WORD 63 timestamp --- 2005-07-03T11:14:17Z
NODE1 64 contributor
NODE1 65 username
WORD 65 username --- Docu
NODE1 66 id
WORD 66 id --- 8029
NODE1 68 minor
NODE1 69 comment
WORD 69 comment --- adding
WORD 69 comment --- to
WORD 69 comment --- cur_id6
WIKITAG 69 comment --- {{R from CamelCase}} *0
NODE1 70 text
KEYVAL 70 xml:space => preserve
WORD 70 text --- #REDIRECT
WIKITAG 70 text --- [[American Samoa]] *1
WIKITAG 70 text --- {{R from CamelCase}} *0
NODE1 73 page
NODE1 74 title
WORD 74 title --- AppliedEthics
NODE1 75 id
WORD 75 id --- 8
NODE1 76 revision
NODE1 77 id
WORD 77 id --- 15898943
NODE1 78 timestamp
WORD 78 timestamp --- 2002-02-25T15:43:11Z
NODE1 79 contributor
NODE1 80 ip
WORD 80 ip --- Conversion
WORD 80 ip --- script
NODE1 82 minor
NODE1 83 comment
WORD 83 comment --- Automated
WORD 83 comment --- conversion
NODE1 84 text
KEYVAL 84 xml:space => preserve
WORD 84 text --- #REDIRECT
WIKITAG 84 text --- [[Applied ethics]] *1
NODE1 88 page
NODE1 89 title
WORD 89 title --- AccessibleComputing
NODE1 90 id
WORD 90 id --- 10
NODE1 91 revision
NODE1 92 id
WORD 92 id --- 15898945
NODE1 93 timestamp
WORD 93 timestamp --- 2003-04-25T22:18:38Z
NODE1 94 contributor
NODE1 95 username
WORD 95 username --- Ams80
NODE1 96 id
WORD 96 id --- 7543
NODE1 98 minor
NODE1 99 comment
WORD 99 comment --- Fixing
WORD 99 comment --- redirect
NODE1 100 text
KEYVAL 100 xml:space => preserve
WORD 100 text --- #REDIRECT
WIKITAG 100 text --- [[Accessible_computing]] *1
NODE1 103 page
NODE1 104 title
WORD 104 title --- AdA
NODE1 105 id
WORD 105 id --- 11
NODE1 106 revision
NODE1 107 id
WORD 107 id --- 15898946
NODE1 108 timestamp
WORD 108 timestamp --- 2002-09-22T16:02:58Z
NODE1 109 contributor
NODE1 110 username
WORD 110 username --- Andre
WORD 110 username --- Engels
NODE1 111 id
WORD 111 id --- 300
NODE1 113 minor
NODE1 114 text
KEYVAL 114 xml:space => preserve
WORD 114 text --- #REDIRECT
WIKITAG 114 text --- [[Ada programming language]] *1
NODE1 117 page
NODE1 118 title
WORD 118 title --- Anarchism
NODE1 119 id
WORD 119 id --- 12
NODE1 120 revision
NODE1 121 id
WORD 121 id --- 42136831
NODE1 122 timestamp
WORD 122 timestamp --- 2006-03-04T01:41:25Z
NODE1 123 contributor
NODE1 124 username
WORD 124 username --- CJames745
NODE1 125 id
WORD 125 id --- 832382
NODE1 127 minor
NODE1 128 comment
WORD 128 comment --- /*
WORD 128 comment --- Anarchist
WORD 128 comment --- Communism
WORD 128 comment --- */
WORD 128 comment --- too
WORD 128 comment --- many
WORD 128 comment --- brackets
NODE1 129 text
KEYVAL 129 xml:space => preserve
WIKITAG 129 text --- {{Anarchism}} *0
WORD 130 text --- Anarchism
WORD 130 text --- originated
WORD 130 text --- as
WORD 130 text --- a
WORD 130 text --- term
WORD 130 text --- of
WORD 130 text --- abuse
WORD 130 text --- first
WORD 130 text --- used
WORD 130 text --- against
WORD 130 text --- early
WIKITAG 130 text --- [[working class]] *1
WIKITAG 130 text --- [[radical]] *1
WORD 130 text --- s
WORD 130 text --- including
WORD 130 text --- the
WIKITAG 130 text --- [[Diggers]] *1
WORD 130 text --- of
WORD 130 text --- the
WIKITAG 130 text --- [[English Revolution]] *1
WORD 130 text --- and
WORD 130 text --- the
WIKITAG 130 text --- [[sans-culotte|sans-culottes]] *1
WORD 130 text --- of
WORD 130 text --- the
WIKITAG 130 text --- [[French Revolution]] *1
WORD 130 text --- .[http://uk.encarta.msn.com/encyclopedia_761568770/Anarchism.html]
WORD 130 text --- Whilst
WORD 130 text --- the
WORD 130 text --- term
WORD 130 text --- is
WORD 130 text --- still
WORD 130 text --- used
WORD 130 text --- in
WORD 130 text --- a
WORD 130 text --- pejorative
WORD 130 text --- way
WORD 130 text --- to
WORD 130 text --- describe
[ENTITY] 130 text --- "
ENTITY 130 text --- " => [0,5] '7
WORD 130 text --- any '7
WORD 130 text --- act '7
WORD 130 text --- that '7
WORD 130 text --- used '7
WORD 130 text --- violent '7
WORD 130 text --- means '7
WORD 130 text --- to '7
WORD 130 text --- destroy '7
WORD 130 text --- the '7
WORD 130 text --- organization '7
WORD 130 text --- of '7
[ENTITY] 130 text --- society"
ENTITY 130 text --- " => [7,12] '7
SUB WORD 130 text --- society '7 => [0,6]
[ENTITY] 130 text --- <
ENTITY 130 text --- < => [0,3]
[ENTITY] 130 text --- ref>
ENTITY 130 text --- > => [3,6]
SUB WORD 130 text --- ref => [0,2]
WORD 130 text --- [http://www.cas.sc.edu/socy/faculty/deflem/zhistorintpolency.html
WORD 130 text --- History
WORD 130 text --- of
WORD 130 text --- International
WORD 130 text --- Police
WORD 130 text --- Cooperation]
WORD 130 text --- ,
WORD 130 text --- from
WORD 130 text --- the
WORD 130 text --- final
WORD 130 text --- protocols
WORD 130 text --- of
WORD 130 text --- the
[ENTITY] 130 text --- "
ENTITY 130 text --- " => [0,5]
WORD 130 text --- International
WORD 130 text --- Conference
WORD 130 text --- of
WORD 130 text --- Rome
WORD 130 text --- for
WORD 130 text --- the
WORD 130 text --- Social
WORD 130 text --- Defense
WORD 130 text --- Against
[ENTITY] 130 text --- Anarchists"
ENTITY 130 text --- " => [10,15]
SUB WORD 130 text --- Anarchists => [0,9]
WORD 130 text --- ,
[ENTITY] 130 text --- 1898<
ENTITY 130 text --- < => [4,7]
SUB WORD 130 text --- 1898 => [0,3]
[ENTITY] 130 text --- /ref>
ENTITY 130 text --- > => [4,7]
SUB WORD 130 text --- /ref => [0,3]
WORD 130 text --- ,
WORD 130 text --- it
WORD 130 text --- has
WORD 130 text --- also
WORD 130 text --- been
WORD 130 text --- taken
WORD 130 text --- up
WORD 130 text --- as
WORD 130 text --- a
WORD 130 text --- positive
WORD 130 text --- label
WORD 130 text --- by
WORD 130 text --- self-defined
WORD 130 text --- anarchists.
WORD 132 text --- The
WORD 132 text --- word
WORD 132 text --- anarchism
WORD 132 text --- is
WIKITAG 132 text --- [[etymology|derived from]] *1
WORD 132 text --- the
WIKITAG 132 text --- [[Greek language|Greek]] *1
WIKITAG 132 text --- [[Wiktionary:αναρχία|αναρχία]] *1 '7
[ENTITY] 132 text --- ("
ENTITY 132 text --- " => [1,6]
SUB WORD 132 text --- ( => [0,0]
WORD 132 text --- without
WIKITAG 132 text --- [[archon]] *1
WORD 132 text --- s
WORD 132 text --- (ruler,
WORD 132 text --- chief,
[ENTITY] 132 text --- king)"
ENTITY 132 text --- " => [5,10]
SUB WORD 132 text --- king) => [0,4]
WORD 132 text --- ).
WORD 132 text --- Anarchism
WORD 132 text --- as
WORD 132 text --- a
WIKITAG 132 text --- [[political philosophy]] *1
WORD 132 text --- ,
WORD 132 text --- is
WORD 132 text --- the
WORD 132 text --- belief
WORD 132 text --- that
WORD 132 text --- rulers
WORD 132 text --- are
WORD 132 text --- unnecessary
WORD 132 text --- and
WORD 132 text --- should
WORD 132 text --- be
WORD 132 text --- abolished,
WORD 132 text --- although
WORD 132 text --- there
WORD 132 text --- are
WORD 132 text --- differing
WORD 132 text --- interpretations
WORD 132 text --- of
WORD 132 text --- what
WORD 132 text --- this
WORD 132 text --- means.
WORD 132 text --- Anarchism
WORD 132 text --- also
WORD 132 text --- refers
WORD 132 text --- to
WORD 132 text --- related
WIKITAG 132 text --- [[social movement]] *1
WORD 132 text --- s)
WORD 132 text --- that
WORD 132 text --- advocate
WORD 132 text --- the
WORD 132 text --- elimination
WORD 132 text --- of
WORD 132 text --- authoritarian
WORD 132 text --- institutions,
WORD 132 text --- particularly
WORD 132 text --- the
WIKITAG 132 text --- [[state]] *1
[ENTITY] 132 text --- .<
ENTITY 132 text --- < => [1,4]
SUB WORD 132 text --- . => [0,0]
[ENTITY] 132 text --- ref>
ENTITY 132 text --- > => [3,6]
SUB WORD 132 text --- ref => [0,2]
WORD 132 text --- [http://en.wikiquote.org/wiki/Definitions_of_anarchism
WORD 132 text --- Definitions
WORD 132 text --- of
WORD 132 text --- anarchism]
WORD 132 text --- on
WORD 132 text --- Wikiquote,
WORD 132 text --- accessed
[ENTITY] 132 text --- 2006<
ENTITY 132 text --- < => [4,7]
SUB WORD 132 text --- 2006 => [0,3]
[ENTITY] 132 text --- /ref>
ENTITY 132 text --- > => [4,7]
SUB WORD 132 text --- /ref => [0,3]
WORD 132 text --- The
WORD 132 text --- word
[ENTITY] 132 text --- "
ENTITY 132 text --- " => [0,5]
WIKITAG 132 text --- [[anarchy]] *1
[ENTITY] 132 text --- ,"
ENTITY 132 text --- " => [1,6]
SUB WORD 132 text --- , => [0,0]
WORD 132 text --- as
WORD 132 text --- most
WORD 132 text --- anarchists
WORD 132 text --- use
WORD 132 text --- it,
WORD 132 text --- does
WORD 132 text --- not
WORD 132 text --- imply
WIKITAG 132 text --- [[chaos]] *1
WORD 132 text --- ,
WIKITAG 132 text --- [[nihilism]] *1
WORD 132 text --- ,
WORD 132 text --- or
WIKITAG 132 text --- [[anomie]] *1
WORD 132 text --- ,
WORD 132 text --- but
WORD 132 text --- rather
WORD 132 text --- a
WORD 132 text --- harmonious
WIKITAG 132 text --- [[anti-authoritarian]] *1
WORD 132 text --- society.
WORD 132 text --- In
WORD 132 text --- place
WORD 132 text --- of
WORD 132 text --- what
WORD 132 text --- are
WORD 132 text --- regarded
WORD 132 text --- as
WORD 132 text --- authoritarian
WORD 132 text --- political
WORD 132 text --- structures
WORD 132 text --- and
WORD 132 text --- coercive
WORD 132 text --- economic
WORD 132 text --- institutions,
WORD 132 text --- anarchists
WORD 132 text --- advocate
WORD 132 text --- social
WORD 132 text --- relations
WORD 132 text --- based
WORD 132 text --- upon
WIKITAG 132 text --- [[voluntary association]] *1
WORD 132 text --- of
WORD 132 text --- autonomous
WORD 132 text --- individuals,
WIKITAG 132 text --- [[mutual aid]] *1
WORD 132 text --- ,
WORD 132 text --- and
WIKITAG 132 text --- [[self-governance]] *1
WORD 132 text --- .
WORD 134 text --- While
WORD 134 text --- anarchism
WORD 134 text --- is
WORD 134 text --- most
WORD 134 text --- easily
WORD 134 text --- defined
WORD 134 text --- by
WORD 134 text --- what
WORD 134 text --- it
WORD 134 text --- is
WORD 134 text --- against,
WORD 134 text --- anarchists
WORD 134 text --- also
WORD 134 text --- offer
WORD 134 text --- positive
WORD 134 text --- visions
WORD 134 text --- of
WORD 134 text --- what
WORD 134 text --- they
WORD 134 text --- believe
WORD 134 text --- to
WORD 134 text --- be
WORD 134 text --- a
WORD 134 text --- truly
WORD 134 text --- free
WORD 134 text --- society.
WORD 134 text --- However,
WORD 134 text --- ideas
WORD 134 text --- about
WORD 134 text --- how
WORD 134 text --- an
WORD 134 text --- anarchist
WORD 134 text --- society
WORD 134 text --- might
WORD 134 text --- work
WORD 134 text --- vary
WORD 134 text --- considerably,
WORD 134 text --- especially
WORD 134 text --- with
WORD 134 text --- respect
WORD 134 text --- to
WORD 134 text --- economics;
WORD 134 text --- there
WORD 134 text --- is
WORD 134 text --- also
WORD 134 text --- disagreement
WORD 134 text --- about
WORD 134 text --- how
WORD 134 text --- a
WORD 134 text --- free
WORD 134 text --- society
WORD 134 text --- might
WORD 134 text --- be
WORD 134 text --- brought
WORD 134 text --- about.
WORD 136 text --- Origins '4
WORD 136 text --- and '4
WORD 136 text --- predecessors '4
WIKITAG 138 text --- [[Peter Kropotkin|Kropotkin]] *1
WORD 138 text --- ,
WORD 138 text --- and
WORD 138 text --- others,
WORD 138 text --- argue
WORD 138 text --- that
WORD 138 text --- before
WORD 138 text --- recorded
WIKITAG 138 text --- [[history]] *1
WORD 138 text --- ,
WORD 138 text --- human
WORD 138 text --- society
WORD 138 text --- was
WORD 138 text --- organized
WORD 138 text --- on
WORD 138 text --- anarchist
[ENTITY] 138 text --- principles.<
ENTITY 138 text --- < => [11,14]
SUB WORD 138 text --- principles. => [0,10]
[/src]
Legende:
1 Angabe: "Datentyp"
2 Angabe: Zeilenummer
3 Angabe: Node-Name bzw. bei Key/Value der Schlüssel, dann immer der Wert
* Angabe: WikiTag Format
' Angabe: Bold/Italic oder Bold-Italic oder eine Überschrift
NODE1/2 - XML Nodes
KEYVAL - Schlüssel und Wert eines Key/Value Pairs
WORD - Wortdaten
WIKITAG - Halt Wiki-Markup (ein solcher Textblock wird aktuell komplett eingelesen)
[ENTITY] - Ein String in der sich mindestens ein HTML-Entity befindet
ENTITY - eine Identifizierte Entity mit Start und Endpunkt im [ENTITY]-String (diese werden immer vor den Wörtern herausgelesen/aufgelistet)
SUB WORD - Wort aus einem [ENTITY]-String herausgelöst mit Start und Endposition im String
SUB WIKITAG - WikiTag innerhalb eines [ENTITY]-Strings ebenfalls mit Start und Endposition im String
Ein kleiner Nachtrag/Hinweis:
Problematisch sind leider aber derartige "Formatierungen" wie:
"*", // List element, multiple levels, "**" Element of element
"#", // Numbered list element, multiple levels = "##" "Element of element"
":", // "indent 1", multiple levels using ":::" = 3
Bei einem Stern und einer Raute bin ich mir nicht sicher ob die nicht auch im normalen Text platziert werden können wie auch ein Doppelpunkt.
Ich bin soweit noch nicht dazu gekommen dies weiter zu untersuchen, aber mir ist gerade ein Rätsel nach welcher Regeln das bei Wikipedia "geparsed" wird und wann ein Doppelpunkt wirklich ein Ident bedeutet und wann nicht, ob ein "#REDIRECT" nur in "comment" - Nodes auftaucht oder ob es auch andere Stellen gibt.
Edit2:
I
ch sehe gerade es ist noch nicht fehlerfrei.
"[[Peter Kropotkin|Kropotkin]]" sollte kein Format besitzen, da muß ich wohl nochmal ran.
Daher werden diese aktuell nicht als Formatierungszeichen beachtet und würden normal in den Text einfließen, so fern vorhanden.
Bei der Raute ist auch das Problem, es kann ein "REDIRECT" folgen, muß aber nicht - das ließe sich notfalls aber auch nochmal behandeln meine ich da hier der Fall relativ eindeutig ist.
Edit3:
Habe mal die Veränderungen vorgenommen das WikiTags eine höhere Priorität als Entity-Filtering bekommen haben.
Jetzt sollte es besser passen.