Hilfe - Datenbereinigung/Validierungswerkzeug

theSplit · 25 Apr. 2019

Hallo,

ein Hilfegesuch. Konkret geht es darum, Inhalte einer Quelldatei mit der Ausgabe einer anderen Software abzugleichen und gefundene Inhalte zu entfernen, so dass nur übrig bleibt was nicht erkannt wurde. Ich habe aktuell einen Versuch mit und ohne Regex, in Python - es ist allerdings nicht auf Python eingeschränkt, sondern es kann jede andere Sprache sein. Es gibt allerdings Fehler, heißt das in Python zu früh ersetzt wird [kw]str.replace[/kw] oder der Regex nicht greift.
Das äußert sich so, das Daten übrig bleiben oder zu früh greifen beim Ersetzen/Strippen:

Beispiel "xt" ist gesucht:
"Text xtz" entfernt fälschlicherweise zu diesem: "Te xtz"

Aktuell gibt es ein Python Skript ("datachecker.py") - welches Regex verwendet, mehr oder weniger optimal und nicht fehlerfrei.
Ein zweites Skript hänge ich hier an "datachecker2.py", mit dem ich aktuell versuche zu filtern, aber es greift nicht wie gedacht, vermutlich ein Logikfehler?
Hier das aktuelle "datachecker2.py" Skript, ohne Regex, nicht auf Github da zum testen: Anhang anzeigen datachecker2.py.rar

Background:
Es geht um die Ausgabe von folgendem Projekt: https://github.com/jrie/wicked (C) - unter Verwendung der folgenden Quelldatei.
Die Quelldatei, nicht auf Github zu hosten weil 221 MB, gepackt, groß - daher der Download vom Webspace: https://dwrox.net/enwiki-20160720-pages-meta-current1.xml-p000000010p000030303.tar.gz

Zur Quelldatei, es handelt sich um einen Wikipedia Dump (.xml). Diese muß in "data" entpackt werden.

Die Ausgabe erfolgt geordnet, aber in mehrere Dateien (*.txt) :
- xmltags
- xmldata
- words
- wikitags
- enitites

[src=c]#define DICTIONARYFILE "words.txt"
#define WIKITAGSFILE "wikitags.txt"
#define XMLTAGFILE "xmltags.txt"
#define XMLDATAFILE "xmldata.txt"
#define ENTITIESFILE "entities.txt"[/src]

... mit Parametern wie "Position in der Zeile", "Zeilenummer" des Elements, Formatierung ja/nein, Wiki Tag Typ, Spaces.
Eine genau Auflistung einer Zeile in den Textdateien/Outputs - mit Variablennamen und Speicherfunktion.

Word, Entity + WikiTag + XML und XML Datendaten:
[src=c]bool writeOutDataFiles(const struct parserBaseStore* parserRunTimeData, struct xmlDataCollection* xmlCollection) {
struct xmlNode *xmlTag = NULL;
struct wikiTag *wTag = NULL;
struct word* wordElement = NULL;
struct entity* entityElement = NULL;

for (unsigned int i = 0; i < xmlCollection->count; ++i) {
xmlTag = &xmlCollection->nodes;
fprintf(parserRunTimeData->xmltagFile, "%d|%d|%d|%d|%s\n", xmlTag->start , xmlTag->end, xmlTag->isClosed, xmlTag->isDataNode, xmlTag->name);

for (unsigned int j = 0; j < xmlTag->keyValuePairs; ++j) {
fprintf(parserRunTimeData->xmldataFile, "%d|%d|%s|%s\n", xmlTag->start, xmlTag->end, xmlTag->keyValues[j].key, xmlTag->keyValues[j].value);
}

for (unsigned int j = 0; j < xmlTag->wTagCount; ++j) {
wTag = &xmlTag->wikiTags[j];
writeOutTagData(parserRunTimeData, wTag);
}

for (unsigned int j = 0; j < xmlTag->wordCount; ++j) {
wordElement = &xmlTag->words[j];
fprintf(parserRunTimeData->dictFile, "%u|%u|%u|%d|%d|%ld|%d|%d|%d|%d|0|%s\n", wordElement->position, wordElement->lineNum, wordElement->readerPos, wordElement->preSpacesCount, wordElement->spacesCount, strlen(wordElement->data), wordElement->dataFormatType, wordElement->ownFormatType, wordElement->formatStart, wordElement->formatEnd, wordElement->data);
}

for (unsigned int j = 0; j < xmlTag->entityCount; ++j) {
entityElement = &xmlTag->entities[j];
fprintf(parserRunTimeData->entitiesFile, "%u|%u|%u|%d|%d|%d|%d|%d|%d|%s\n", entityElement->position, entityElement->lineNum, entityElement->readerPos, entityElement->preSpacesCount, entityElement->spacesCount, entityElement->dataFormatType, entityElement->ownFormatType, entityElement->formatStart, entityElement->formatEnd, entityElement->data);
}
}

return true;
}[/src]

Word + Entity + WikiTag in einem WikiTag:

[src=c]bool writeOutTagData(const struct parserBaseStore* parserRunTimeData, wikiTag *wTag) {
struct word* wordElement = NULL;
struct entity* entityElement = NULL;
struct wikiTag* wikiTagElement = NULL;

fprintf(parserRunTimeData->wtagFile, "%u|%u|%u|%d|%d|%d|%d|%d|%d|%d|%u|%ld|%s\n", wTag->position, wTag->lineNum, wTag->readerPos, wTag->preSpacesCount, wTag->spacesCount, wTag->tagType, wTag->dataFormatType, wTag->ownFormatType, wTag->formatStart, wTag->formatEnd, wTag->tagLength, strlen(wTag->target), wTag->target);

for (unsigned int k = 0; k < wTag->wordCount; ++k) {
wordElement = &wTag->pipedWords[k];
fprintf(parserRunTimeData->dictFile, "%u|%u|%u|%d|%d|%ld|%d|%d|%d|%d|1|%s\n", wordElement->position, wordElement->lineNum, wordElement->readerPos, wordElement->preSpacesCount, wordElement->spacesCount, strlen(wordElement->data), wordElement->dataFormatType, wordElement->ownFormatType, wordElement->formatStart, wordElement->formatEnd, wordElement->data);
}

for (unsigned int k = 0; k < wTag->wTagCount; ++k) {
wikiTagElement = &wTag->pipedTags[k];
writeOutTagData(parserRunTimeData, wikiTagElement);
}

for (unsigned int k = 0;k < wTag->entityCount; ++k) {
entityElement = &wTag->pipedEntities[k];
fprintf(parserRunTimeData->entitiesFile, "%u|%u|%u|%d|%d|%d|%d|%d|%d|%s\n", entityElement->position, entityElement->lineNum, entityElement->readerPos, entityElement->preSpacesCount, entityElement->spacesCount, entityElement->dataFormatType, entityElement->ownFormatType, entityElement->formatStart, entityElement->formatEnd, entityElement->data);
}
}[/src]

Für Fragen zum Format oder den Werten, einfach rein hierher, falls nicht ersichtlich.

BurnerR · 25 Apr. 2019

Kannst du konkreter beschreiben, was du machen möchtest?
Am besten mit Minimalbeispiel vorher/nachher.

theSplit · 25 Apr. 2019

Gegeben ist der Wikipedia Dump, der sieht wie folgt aus (XML Snippet) und die Ausgabedaten von Wicked - Wertet die XML aus und schreibt die Daten in die Datendateien, also Wörter nach 'words.txt', XML nach 'xmltags,txt' und 'xmldata.txt', Entities nach 'entities.txt'.

Hier ein zufälliger Auszug aus < 100 Zeilen der Wikipedia XML:

[src=xml] <contributor>
<username>Eduen</username>
<id>7527773</id>
</contributor>
<comment>this was redundant</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">{{Redirect2|Anarchist|Anarchists|the fictional character|Anarchist (comics)|other uses|Anarchists (disambiguation)}}
{{pp-move-indef}}
{{Use British English|date=January 2014}}
{{Anarchism sidebar}}
{{Basic forms of government}}

'''Anarchism''' is a [[political philosophy]] that advocates [[self-governance|self-governed]] societies based on voluntary institutions. These are often described as [[stateless society|stateless societies]],<ref>"ANARCHISM, a social philosophy that rejects authoritarian government and maintains that voluntary institutions are best suited to express man's natural social tendencies." George Woodcock. "Anarchism" at The Encyclopedia of Philosophy</ref><ref>"In a society developed on these lines, the voluntary associations which already now begin to cover all the fields of human activity would take a still greater extension so as to substitute themselves for the state in all its functions." [http://www.theanarchistlibrary.org/HTML/Petr_Kropotkin___Anarchism__from_the_Encyclopaedia_Britannica.html Peter Kropotkin. "Anarchism" from the Encyclopædia Britannica]</ref><ref>"Anarchism." The Shorter Routledge Encyclopedia of Philosophy. 2005. p. 14 "Anarchism is the view that a society without the state, or government, is both possible and desirable."</ref><ref>Sheehan, Sean. Anarchism, London: Reaktion Books Ltd., 2004. p. 85</ref> although several authors have defined them more specifically as institutions based on non-[[Hierarchy|hierarchical]] [[Free association (communism and anarchism)|free associations]].<ref>"as many anarchists have stressed, it is not government as such that they find objectionable, but the hierarchical forms of government associated with the nation state." Judith Suissa. ''Anarchism and Education: a Philosophical Perspective''. Routledge. New York. 2006. p. 7</ref><ref name="iaf-ifa.org"/><ref>"That is why Anarchy, when it works to destroy authority in all its aspects, when it demands the abrogation of laws and the abolition of the mechanism that serves to impose them, when it refuses all hierarchical organisation and preaches free agreement — at the same time strives to maintain and enlarge the precious kernel of social customs without which no human or animal society can exist." [[Peter Kropotkin]]. [http://www.theanarchistlibrary.org/HTML/Petr_Kropotkin__Anarchism__its_philosophy_and_ideal.html Anarchism: its philosophy and ideal]</ref><ref>"anarchists are opposed to irrational (e.g., illegitimate) authority, in other words, hierarchy — hierarchy being the institutionalisation of authority within a society." [http://www.theanarchistlibrary.org/HTML/The_Anarchist_FAQ_Editorial_Collective__An_Anarchist_FAQ__03_17_.html#toc2 "B.1 Why are anarchists against authority and hierarchy?"] in [[An Anarchist FAQ]]</ref> Anarchism considers the [[state (polity)|state]] to be undesirable, unnecessary, and harmful,<ref name="definition">
{{cite journal |last=Malatesta|first=Errico|title=Towards Anarchism|journal=MAN!|publisher=International Group of San Francisco|location=Los Angeles|oclc=3930443|url=http://www.marxists.org/archive/malatesta/1930s/xx/toanarchy.htm|archiveurl=https://web.archive.org/web/20121107221404/http://marxists.org/archive/malatesta/1930s/xx/toanarchy.htm|archivedate=7 November 2012 |deadurl=no|authorlink=Errico Malatesta |ref=harv}}
[/src]

Das Validierungswerkzeug soll die Ausgabe der C Anwendung auf Vollständigkeit und Korrektheit der Daten prüfen. Das geht meiner Meinung nach am besten, wenn die Daten aus den Ausgabe Textdateien gelesen und aus der Quelle "im Speicher" entfernt werden.

Im Idealfall heißt das für die XML das nur noch Klammern (für Wikitags) und leere Zeilen übrig bleiben, so die Idee:

Input:
[src=text]{{Use British English|date=January 2014}}[/src]

wobei:
[src=text]Use British English[/src]
[src=text]|date=January 2014[/src]

ersetzt werden sollten wenn die Daten in der Ausgabe vorhanden sind.

Output:
[src=text]{{}}[/src]

Bzw. das die Zeile von den Surroundings befreit wird. Also ohne Klammer leer und somit "korrekt" verarbeitet wurde.

PS: Das ersetzen soll auch "words.txt" (Wortinhalte) umfassen, so dass der Absatz/Paragraph komplett geleert wird, wenn alle Daten vorhanden sind.

BurnerR · 25 Apr. 2019

Hier mal ein Ruby-Prototyp.
Er erwartet einen Hash gegen den er testet und schreibt aktuell nicht in eine Datei, sondern gibt "{{}}" aus wenn er was findet. Wenn er nichts findet würde man entsprechen "" in die Datei schreiben.
Generell ist es imHo sinnvoller, nicht raus zu streichen, sondern eine leere Datei entsprechend aufzubauen.
Er liest jede Zeile ein und sucht nach "{{*}}" Einträgen. Jeden Eintrag, z.B. {{Use British English|date=January 2014}}, teilt er dann am "|" auf, iteriert über jeden und schaut, ob er eine Entsprechung im gegebenem Hash (in python: Dictionary) findet.
[src=ruby]
$match_hash = {
"Use British English":"",
"date":"January 2014",
"pp-move-indef":"",
"Anarchism sidebar":""}

def could_resolve?(array)
if array.empty?
return false
end
# Might contain single entry most of the time
array.each do |entry|
entry.split("|").each do |sub_entry|
left_side, right_side = sub_entry.split("=").append("")
if !$match_hash.has_key?(left_side.to_sym) || $match_hash[left_side.to_sym] != right_side
return false
end
end
end
return true
end

File.readlines('wikitest.xml').each do |line|
potencial_matches = line.scan(/{{.*}}/).map{|entry| entry.tr('{}', '')}
if could_resolve?(potencial_matches)
puts "{{}}"
end
end
[/src]

Da der gegebene Hash nur Beispialhaft und unvollständig ist erkennt das Programm aktuell nur {{pp-move-indef}}, {{Use British English|date=January 2014}} und {{Anarchism sidebar}}.

theSplit · 25 Apr. 2019

@BurnerR: Also du würdest empfehlen das Pferd andersherum aufzuziehen? Also mit gefundenen Daten eine Datei aufbauen? Zugegeben, scheint simpel, aber das habe ich noch nicht probiert. Dann wäre es aber vielleicht möglich? bzw. sinnvoll das Diffing so etwas wie "Meld" zu überlassen?

Bezüglich deines Beispiels: Vielleicht war auch mein Beispiel nicht richtig, aber auch normaler Text sollte erfasst werden, nicht nur speziell ein Wikitag Beispiel.

BurnerR · 26 Apr. 2019

theSplit schrieb:
@BurnerR: Also du würdest empfehlen das Pferd andersherum aufzuziehen? Also mit gefundenen Daten eine Datei aufbauen?

Ja, das auf jeden Fall.

theSplit schrieb:
Bezüglich deines Beispiels: Vielleicht war auch mein Beispiel nicht richtig, aber auch normaler Text sollte erfasst werden, nicht nur speziell ein Wikitag Beispiel.

Hattest du ja nur ganz kurz im PS angerissen.

Aber wenn ich dich richtig verstehe ist die reine Logik ohne Input-Output Zeug eine einzelne Zeile. 'text' wäre jeweils eine Zeile der Datei:
[src=ruby]
text = "This is some Text. It contains stuff like: Commata, colon or Period."
words = Set["This", "some", "It", "contains"]
text.scan(/[a-zA-z]+/).reject{|word| words.include?(word)}.join(' ')

# Result:
=> "is Text stuff like Commata colon or Period"
[/src]
scan gibt ein Array mit den matches zurück, in dem Fall Worte. reject gibt die Elemente zurück die nicht in words gefunden wurde. Am ende noch wieder zusammen bauen zu einem einzelnen String, mit leerzeichen getrennt.

Es wird auch wirklich nur auf Wort-Ebene verglichen, also ein 'xt' in der gegebenen Wortliste würde nicht dazu führen, dass 'Text' entfernt wird.

theSplit · 26 Apr. 2019

Das mit dem "reject" sieht per se gut aus und auch genau danach woran ich in Python etwas gescheitert bin - gibt es dafür ein Python äquivalent?
Sieht aber so aus, als könnte das nach der Logik schon funktionieren... wobei der einzige Haken wäre, das die Wörter UTF-8 und nicht nur Zeichen aus dem englischen Alphabet sind und auch Nummern enthalten können.

Und was wäre wenn wir einmal "{{Anarchism|Read more}}" und "Anarchism" als Solo Wort haben, bei deinem Beispiel würde, ungeachtet der Tatsache ob entdeckt oder nicht durch die Anwendung beide "Anarchism" herausgefilter werden, korrek? Da einmal im Wortfilter enthalten. Genau das würde aber die Validierung ad absurdum führen.

BurnerR · 27 Apr. 2019

In Python würde mal vermutlich 'filter' verwenden: http://book.pythontips.com/en/latest/map_filter.html
@ UTF-8: Ja, da müsste man etwas genauer hinschauen was solche technischen Details angeht.

Zu dem Einwand: Meine erste Idee besteht darin, den scan anzupassen und nicht nur nach Worten zu suchen, sondern stattdessen sowas wie das hier herzunehmen:
[src=ruby]text.scan(/[a-zA-z}}{{\|]+/)[/src]
Hab ich jetzt nicht getestet, aber es sollte dann "{{Anarchism|Read more}}" vollständig matchen und dementsprechend nicht in der Wortliste finden.

Evtl. habe ich mal ne ruhige Stunde und baue etwas.
Notwendig dafür wäre aber, dass du mir repräsentative Input-Dateien gibst, die Listen gegen die gematcht werden soll und - ganz wichtig - Referenz Output-Dateien die mir zeigen, wie ein korrektes Ergebnis aussehen soll.

Da ich auch gerade etwas Python lerne würde ich es womöglich sogar in Python schreiben.
Auch wenn ich ruby deutlich eleganter finde

.

theSplit · 27 Apr. 2019

@BurnerR: Ich arbeite gerade an einer "Builder" Variante, die die Ausgaben einliest und entsprechend als Vergleich zur Originaldatei stellen soll. Hatte ich vorher nicht versucht.

Aber lange Rede, kurzer Sinn: Ich würde dir empfehlen das Projekt zu klonen und mit aktuellen Einstellungen mit der Input Enwik (.xml entpackt) - einmal selbst durchlaufen zu lassen. Grund ist, das die Analysedaten mit Output über 3 GB zusammen ergeben (und auch so im Speicher landen). Das kann und will ich nicht zum Download anbieten oder hosten.

Das ist dann mit der Datenstruktur und was wie erkannt und entfernt werden könnte viel ersichtlicher, als das hier aufzusplitten. Auch wenn ich Fragen gerne beantworten werde.

Was am Ende drin stehen soll? - gar nichts mehr so fern alles korrekt erkannt wird. Also 0 Daten übrig (so die erste Idee) = 100% Übereinstimmung.

Außerdem, zur Einfachheit des Verarbeitens, sind nun die Ausgabedaten mit Tabulator getrennt, was das einlesen stark vereinfachen soll, da vermutlich nicht im Text direkt enthalten, und wenn, wäre es, da Stringwerte immer die letzten Einträge sind, im Output, leicht überschaubar.

--- [2019-04-27 18:24 CEST] Automatisch zusammengeführter Beitrag ---

Ich poste hier mal eine Rohfassung von dem Builder (noch nicht auf Github Live aktuell), damit man sieht das es vorwärts geht.

[src=c]//------------------------------------------------------------------------------
// Author: Jan Rie <jan@dwrox.net>
//------------------------------------------------------------------------------
#include <stdlib.h>
#include <stdio.h>
#include <stdbool.h>
#include <string.h>
#include <ctype.h>
#include <time.h>
#include <math.h>

//------------------------------------------------------------------------------

// Switches
#define BEVERBOSE false
#define DEBUG false
#define DOWRITEOUT true
#define LINETOPROCESS 0

#define OUTPUTFILE "output.txt"
#define DICTIONARYFILE "words.txt"
#define WIKITAGSFILE "wikitags.txt"
#define XMLTAGFILE "xmltags.txt"
#define XMLDATAFILE "xmldata.txt"
#define ENTITIESFILE "entities.txt"

// Buffers
#define ENTRYBUFFER 5120
#define XMLTAG_BUFFER 128

// Counts of predefined const datatypes
#define FORMATS 8
#define INDENTS 3
#define TEMPLATES 11
#define TAGTYPES 11
#define TAGCLOSINGS 3
#define MATHTAG 0

//------------------------------------------------------------------------------
/*
Wikipedia data types and such

See for wikitags
https://en.wikipedia.org/wiki/Help:Wiki_markup
https://en.wikipedia.org/wiki/Help:Cheatsheet

Reference for templates:
https://en.wikipedia.org/wiki/Help:Wiki_markup
*/

const char formatNames[FORMATS][15] = {
"Bold + Italic",
"Bold",
"Italic",
"Heading 6",
"Heading 5",
"Heading 4",
"Heading 3",
"Heading 2"
};

const char formats[FORMATS][7] = {
"'''''", // bold + italic
"'''", // bold
"''", // italic
"======", // Heading 6 - heading start
"=====",
"====",
"===",
"==" // Heading 2 - heading end
};

/*
NOTE: Indents are only used at the beginning of the line!
*/
const char indents[INDENTS][2] = {
"*", // List element, multiple levels, "**" Element of element
"#", // Numbered list element, multiple levels = "##" "Element of element"
":", // "indent 1", multiple levels using ":::" = 3
};

const char templates[TEMPLATES][18] = {
// NOTE: Should we cover template tags as well for shortening?
"{{-",
"{{align",
"{{break",
"{{clear",
"{{float",
"{{stack",
"{{outdent",
"{{plainlist",
"{{fake heading",
"{{endplainlist",
"{{unbulleted list"
};

const char wikiTagNames[TAGTYPES][19] = {
"Math type",
"Definition/Anchor",
"Table",
"Category",
"Media type",
"File type",
"Image type",
"Sound type",
"Wiktionary",
"Wikipedia",
"Link"
};

const char elementTypes[TAGTYPES][18] = {
// NOTE: Handle definitions after, in case we threat, templates too
"{{math", // https://en.wikipedia.org/wiki/Wikipedia:Rendering_math
"{{", // Definition => {{Main|autistic savant}}
"{\t", // Table start => ! (headline), |- (seperation, row/border), | "entry/ column data"
"[[category:",
"[[media:", // Media types start
"[[file:",
"[[image:", // [[Image:LeoTolstoy.jpg|thumb|150px|[[Leo Tolstoy|Leo Tolstoy]] 1828-1910]]
"[[sound:", // Media types end
//"#REDIRECT", // Redirect #REDIRECT [[United States]] (article) --- #REDIRECT [[United States#History]] (section)
"[[wiktionary:", // [[wiktionary:terrace|terrace]]s
"[[wikipedia:", // [[Wikipedia:Nupedia and Wikipedia]]
"[[" // Link => [[Autistic community#Declaration from the autism community|sent a letter to the United Nations]]
};

const char tagClosingsTypes[TAGCLOSINGS][3] = {
"}}",
"]]",
"|}"
};

//------------------------------------------------------------------------------
// Function declarations
typedef struct xmldatakv {
char *key;
char *value;
} xmldatakv;

typedef struct entry {
unsigned short elementType; // 0 XMLTAG, 1 WIKITAG, 2 WORD, 3 ENTITY
unsigned int dataReaderIndex;
unsigned int position;
unsigned int start;
unsigned int end;
unsigned int preSpacesCount;
unsigned int spacesCount;
unsigned int length;
unsigned int tagLength;
unsigned int readerPos;
short tagType;
short dataFormatType;
short ownFormatType;
bool formatStart;
bool formatEnd;
bool isDataNode;
bool isClosed;
short xmlDataCount;
struct xmldatakv *xmlData;
char *stringData;
} entry;

typedef struct collection {
FILE *outputFile;
FILE *dictFile;
FILE *wtagFile;
FILE *xmltagFile;
FILE *xmldataFile;
FILE *entitiesFile;
unsigned int count;
struct entry *entries;
} collection;

bool readXMLtag(struct collection*);
bool readWord(struct collection*);
bool readEntity(struct collection*);
bool readWikitag(struct collection*);
bool freeEntry(struct collection*, struct entry*);

bool freeEntry(struct collection *readerData, struct entry *entryData) {
for (int i = 0; i < readerData->count; ++i) {
if (&readerData->entries == entryData) {
if (i == readerData->count - 1) {
if (readerData->entries.stringData) free(readerData->entries.stringData);

for (int j = 0; j < readerData->entries.xmlDataCount; ++j) {
if (readerData->entries.xmlData[j].key) free(readerData->entries.xmlData[j].key);
if (readerData->entries.xmlData[j].value) free(readerData->entries.xmlData[j].value);
}

if (readerData->entries.xmlData) free(readerData->entries.xmlData);

readerData->entries = (struct entry *)realloc(readerData->entries, sizeof(struct entry) * readerData->count - 1);
--readerData->count;
return true;
} else {
if (readerData->entries.stringData) free(readerData->entries.stringData);

for (int j = 0; j < readerData->entries.xmlDataCount; ++j) {
if (readerData->entries.xmlData[j].key) free(readerData->entries.xmlData[j].key);
if (readerData->entries.xmlData[j].value) free(readerData->entries.xmlData[j].value);
}

if (readerData->entries.xmlData) free(readerData->entries.xmlData);

memmove(&readerData->entries, &readerData->entries[i + 1], sizeof(struct entry) * (readerData->count - i - 1));

--readerData->count;
return true;
}
}
}

return false;
}

//------------------------------------------------------------------------------
// Main routine
int main(int argc, char *argv[]) {
printf("[ INFO ] Starting transpilling on file \"%s\".\n", OUTPUTFILE);

FILE *outputFile = fopen(OUTPUTFILE, "wb");
FILE *dictFile = fopen(DICTIONARYFILE, "rb");
FILE *wtagFile = fopen(WIKITAGSFILE, "rb");
FILE *xmltagFile = fopen(XMLTAGFILE, "rb");
FILE *xmldataFile = fopen(XMLDATAFILE, "rb");
FILE *entitiesFile = fopen(ENTITIESFILE, "rb");

struct collection readerData = {outputFile, dictFile, wtagFile, xmltagFile, xmldataFile, entitiesFile, 0, NULL};
// 0 XMLTAG, 1 WIKITAG, 2 WORD, 3 ENTITY

bool hasXMLTag = false;
bool hasWikiTag = false;
bool hasWord = false;
bool hasEntity = false;

hasXMLTag = readXMLtag(&readerData);
hasWikiTag = readWikitag(&readerData);
hasWord = readWord(&readerData);
hasEntity = readEntity(&readerData);

printf("0: %s\n", readerData.entries[0].stringData);
printf("1: %s\n", readerData.entries[1].stringData);
printf("2: %s\n", readerData.entries[2].stringData);
freeEntry(&readerData, &readerData.entries[2]);
printf("2: %s\n", readerData.entries[2].stringData);
freeEntry(&readerData, &readerData.entries[2]);
printf("1: %s\n", readerData.entries[1].stringData);

if (hasXMLTag) {

}

// time_t startTime = time(NULL);

// Cleanup
fclose(readerData.outputFile);
fclose(readerData.dictFile);
fclose(readerData.wtagFile);
fclose(readerData.xmltagFile);
fclose(readerData.xmldataFile);
fclose(readerData.entitiesFile);

for (int i = 0; i < readerData.count; ++i) {
if (readerData.entries.stringData) free(readerData.entries.stringData);

for (int j = 0; j < readerData.entries.xmlDataCount; ++j) {
if (readerData.entries.xmlData[j].key) free(readerData.entries.xmlData[j].key);
if (readerData.entries.xmlData[j].value) free(readerData.entries.xmlData[j].value);
}

if (readerData.entries.xmlData) free(readerData.entries.xmlData);
}

free(readerData.entries);
return 0;
}

//------------------------------------------------------------------------------

bool readWikitag(struct collection *readerData) {
if ((readerData->entries = (struct entry *)realloc(readerData->entries, sizeof(struct entry) * (readerData->count + 1))) == NULL) {
return false;
}

int returnValue = 0;
struct entry *element = &readerData->entries[readerData->count];
element->elementType = 1;
element->stringData = NULL;
element->xmlData = NULL;
element->xmlDataCount = 0;
element->dataReaderIndex = readerData->count;
++readerData->count;

returnValue = fscanf(readerData->wtagFile, "%u\t%u\t%u\t%d\t%d\t%hi\t%hi\t%hi\t%d\t%d\t%d\t%d\t", &element->position, &element->start, &element->readerPos, &element->preSpacesCount, &element->spacesCount, &element->tagType, &element->dataFormatType, &element->ownFormatType, (int *)(bool *)&element->formatStart, (int *)(bool *)&element->formatEnd, &element->tagLength, &element->length);

if (returnValue == EOF)
return false;

element->stringData = malloc(sizeof(char) * (element->length + 1));
returnValue = fscanf(readerData->wtagFile, "%s\n", element->stringData);

if (returnValue == EOF)
return false;

return true;
}

//------------------------------------------------------------------------------

bool readXMLtag(struct collection *readerData) {
if ((readerData->entries = (struct entry *)realloc(readerData->entries, sizeof(struct entry) * (readerData->count + 1))) == NULL) {
return false;
}

int returnValue = 0;
struct entry *element = &readerData->entries[readerData->count];
element->elementType = 0;
element->stringData = NULL;
element->start = 0;
element->end = 0;
element->isClosed = false;
element->isDataNode = false;
element->xmlData = NULL;
element->xmlDataCount = 0;
element->dataReaderIndex = readerData->count;

returnValue = fscanf(readerData->xmltagFile, "%d\t%d\t%d\t%d\t", &element->start, &element->end, (int *)(bool *)&element->isClosed, (int *)(bool *)&element->isDataNode);

if (returnValue == EOF)
return false;

element->stringData = malloc(sizeof(char) * XMLTAG_BUFFER);

unsigned int start = 0;
unsigned int end = 0;
char tmpKey[128] = "\0";
char tmpValue[1024] = "\0";

unsigned int readBytes = 0;

while (true) {
readBytes = ftell(readerData->xmldataFile);
returnValue = fscanf(readerData->xmldataFile, "%d\t%d\t", &start, &end);
if (returnValue == EOF)
break;

if (start == element->start && end == element->end) {
fscanf(readerData->xmldataFile, "%s\t%s", tmpKey, tmpValue);
element->xmlData = (struct xmldatakv *)realloc(element->xmlData, sizeof(struct xmldatakv) * (element->xmlDataCount + 1));

element->xmlData[element->xmlDataCount].key = malloc(sizeof(char) * (strlen(tmpKey) + 1));
strcpy(element->xmlData[element->xmlDataCount].key, tmpKey);

element->xmlData[element->xmlDataCount].value = malloc(sizeof(char) * (strlen(tmpValue) + 1));
strcpy(element->xmlData[element->xmlDataCount].value, tmpValue);

++element->xmlDataCount;

printf("xmlData =>\t%s\t==>\t%s\n", tmpKey, tmpValue);
} else {
fseek(readerData->xmldataFile, ftell(readerData->xmldataFile) - readBytes, SEEK_SET);
break;
}
}

returnValue = fscanf(readerData->xmltagFile, "%s\n", element->stringData);
if (returnValue == EOF)
return false;

++readerData->count;
return true;
}

//------------------------------------------------------------------------------

bool readWord(struct collection *readerData) {
if ((readerData->entries = (struct entry *)realloc(readerData->entries, sizeof(struct entry) * (readerData->count + 1))) == NULL) {
return false;
}

int returnValue = 0;
struct entry *element = &readerData->entries[readerData->count];
element->elementType = 2;
element->stringData = NULL;
element->xmlData = NULL;
element->xmlDataCount = 0;
element->dataReaderIndex = readerData->count;
++readerData->count;

returnValue = fscanf(readerData->dictFile, "%u\t%u\t%u\t%u\t%u\t%u\t%hi\t%hi\t%d\t%d\t%d\t%d\t", &element->position, &element->start, &element->readerPos, &element->preSpacesCount, &element->spacesCount, &element->length, &element->dataFormatType, &element->ownFormatType, (int*)(bool*)&element->formatStart, (int*)(bool*)&element->formatEnd, (int*)(bool*)&element->isDataNode, (int*)(bool*)&element->isClosed);

if (returnValue == EOF)
return false;

element->stringData = malloc(sizeof(char) * (element->length + 1));
returnValue = fscanf(readerData->dictFile, "%s\n", element->stringData);

if (returnValue == EOF)
return false;

return true;
}

//------------------------------------------------------------------------------

bool readEntity(struct collection *readerData) {
if ((readerData->entries = (struct entry *)realloc(readerData->entries, sizeof(struct entry) * (readerData->count + 1))) == NULL) {
return false;
}

int returnValue = 0;
struct entry *element = &readerData->entries[readerData->count];
element->elementType = 3;
element->stringData = NULL;
element->xmlData = NULL;
element->xmlDataCount = 0;
element->dataReaderIndex = readerData->count;
++readerData->count;

returnValue = fscanf(readerData->entitiesFile, "%u\t%u\t%u\t%d\t%d\t%hi\t%hi\t%d\t%d\t", &element->position, &element->start, &element->readerPos, &element->preSpacesCount, &element->spacesCount, &element->dataFormatType, &element->ownFormatType, (int *)(bool *)&element->formatStart, (int *)(bool *)&element->formatEnd);

if (returnValue == EOF)
return false;

element->stringData = (char *)realloc(element->stringData, sizeof(char) * 24);
returnValue = fscanf(readerData->entitiesFile, "%s\n", element->stringData);

if (returnValue == EOF)
return false;

return true;
}

//------------------------------------------------------------------------------

[/src]

BurnerR · 28 Apr. 2019

Vielleicht schaue ich es mir mal an.
Aber klingt so als wenn ich C-Code durcharbeiten muss, dann kompilieren und ausführen muss, Ergebnis angucken musst und dann irgendwie eine Zielvorgabe definieren muss. Ich weiß nicht, ob ich darauf Lust habe.
Lust hätte ich eher, an einer konkreten Aufgabe zur Aufbereitung der Dateien mit Python zu arbeiten.

theSplit · 6 Mai 2019

@BurnerR - hier die Rohdaten aus der Analyse, gepackt: https://dwrox.net/wicked_rawdata.tar.gz
Vielleicht kannst du damit mehr anfangen?

Und auch mal der aktuelle, allerdings nicht 100% fehlerfreie und niht optimale, Stand eines Filebuilders aus den gewonnenen Daten:

Allerdings ist diese Version noch nicht 100% fehlerfrei, ich debugge gerade noch ein Memory Leak bzw. Error....

Vielleicht mag sich jemand den Code genauer ansehen - speziell ob und wie man das
[src=c]while (readWord(&readerData, WORD) { if (........) break; }[/src] wegoptimieren kann?

Meine Idee wäre noch die Daten vor der Rückübersetzung zu sortieren, so das alle Daten fortlaufend korrekt gelistet werden ohne das man innerhalb der Datei eine Suchlauf mit Vorsprung für X Einträge/Elemente machen muß.

verbessern, entfernen kann.

[src=c]//------------------------------------------------------------------------------
// Author: Jan Rie <jan@dwrox.net>
//------------------------------------------------------------------------------
#include <stdlib.h>
#include <stdio.h>
#include <stdbool.h>
#include <string.h>
#include <ctype.h>
#include <time.h>
#include <math.h>

//------------------------------------------------------------------------------

// Switches
#define DEBUG false
#define VERBOSE false
#define DOWRITEOUT true
#define LINESTOPROCESS 0
#define LOOKAHEADRANGE 2000
#define IDENTATIONBUFFER 33

#define OUTPUTFILE "output.txt"
#define DICTIONARYFILE "words.txt"
#define WIKITAGSFILE "wikitags.txt"
#define XMLTAGFILE "xmltags.txt"
#define XMLDATAFILE "xmldata.txt"
#define ENTITIESFILE "entities.txt"

// Buffers
#define XMLTAG_BUFFER 5120

// Counts of predefined const datatypes
#define FORMATS 8
#define ENTITIES 211
#define INDENTS 3
#define TEMPLATES 11
#define TAGTYPES 11
#define TAGCLOSINGS 3
#define MATHTAG 0

//------------------------------------------------------------------------------
/*
Wikipedia data types and such

See for wikitags
https://en.wikipedia.org/wiki/Help:Wiki_markup
https://en.wikipedia.org/wiki/Help:Cheatsheet

Reference for templates:
https://en.wikipedia.org/wiki/Help:Wiki_markup
*/

const char formatNames[FORMATS][15] = {
"Bold + Italic",
"Bold",
"Italic",
"Heading 6",
"Heading 5",
"Heading 4",
"Heading 3",
"Heading 2"
};

const char formats[FORMATS][7] = {
"'''''", // bold + italic
"'''", // bold
"''", // italic
"======", // Heading 6 - heading start
"=====",
"====",
"===",
"==" // Heading 2 - heading end
};

/*
NOTE: Indents are only used at the beginning of the line!
*/
const char indents[INDENTS][2] = {
"*", // List element, multiple levels, "**" Element of element
"#", // Numbered list element, multiple levels = "##" "Element of element"
":", // "indent 1", multiple levels using ":::" = 3
};

const char templates[TEMPLATES][18] = {
// NOTE: Should we cover template tags as well for shortening?
"{{-",
"{{align",
"{{break",
"{{clear",
"{{float",
"{{stack",
"{{outdent",
"{{plainlist",
"{{fake heading",
"{{endplainlist",
"{{unbulleted list"
};

const char wikiTagNames[TAGTYPES][19] = {
"Math type",
"Definition/Anchor",
"Table",
"Category",
"Media type",
"File type",
"Image type",
"Sound type",
"Wiktionary",
"Wikipedia",
"Link"
};

const char tagTypes[TAGTYPES][18] = {
// NOTE: Handle definitions after, in case we threat, templates too
"{{math", // https://en.wikipedia.org/wiki/Wikipedia:Rendering_math
"{{", // Definition => {{Main|autistic savant}}
"{\t", // Table start => ! (headline), |- (seperation, row/border), | "entry/ column data"
"[[category:",
"[[media:", // Media types start
"[[file:",
"[[image:", // [[Image:LeoTolstoy.jpg|thumb|150px|[[Leo Tolstoy|Leo Tolstoy]] 1828-1910]]
"[[sound:", // Media types end
//"#REDIRECT", // Redirect #REDIRECT [[United States]] (article) --- #REDIRECT [[United States#History]] (section)
"[[wiktionary:", // [[wiktionary:terrace|terrace]]s
"[[wikipedia:", // [[Wikipedia:Nupedia and Wikipedia]]
"[[" // Link => [[Autistic community#Declaration from the autism community|sent a letter to the United Nations]]
};

const char tagClosingsTypes[TAGCLOSINGS][3] = {
"}}",
"]]",
"|}"
};

//------------------------------------------------------------------------------

typedef struct xmldatakv {
char *key;
char *value;
} xmldatakv;

typedef struct entry {
unsigned short elementType; // 0 XMLTAG, 1 WIKITAG, 2 WORD, 3 ENTITY
unsigned int position;
unsigned int connectedTag;
unsigned int start;
unsigned int end;
unsigned int preSpacesCount;
unsigned int spacesCount;
unsigned int length;
unsigned int tagLength;
short tagType;
short dataFormatType;
short ownFormatType;
short hasPipe;
short isHandledTag;
short isDataNode;
short isClosed;
short xmlDataCount;
int formatStart;
int formatEnd;
struct xmldatakv *xmlData;
char *stringData;
} entry;

typedef struct collection {
FILE *outputFile;
FILE *dictFile;
FILE *wtagFile;
FILE *xmltagFile;
FILE *xmldataFile;
FILE *entitiesFile;
long int dictSize;
long int wtagSize;
long int xmltagSize;
long int xmldataSize;
long int entitiesSize;
long int readDict;
long int readwtag;
long int readxmltag;
long int readxmldata;
long int readentities;
unsigned int countWords;
unsigned int countWtag;
unsigned int countXmltag;
unsigned int countEntities;
struct entry *entriesWords;
struct entry *entriesWtag;
struct entry *entriesXmltag;
struct entry *entriesEntities;
} collection;

enum dataTypes {
XMLTAG = 0,
WIKITAG = 1,
WORD = 2,
ENTITY = 3
};

//------------------------------------------------------------------------------
// Function declarations
bool readXMLtag(struct collection*);
bool readWord(struct collection*);
bool readEntity(struct collection*);
bool readWikitag(struct collection*);

bool freeEntryByLine(struct collection*, short, unsigned int, unsigned int);
struct entry* getEntryByType(struct collection*, short);
struct entry* getInLine(struct collection*, short, unsigned int, unsigned int);

// ----------------------------------------------------------------------------

bool freeEntryByLine(struct collection *readerData, short elementType, unsigned int lineNum, unsigned int position) {
unsigned int count = 0;
struct entry* items = NULL;
if (elementType == XMLTAG) {
count = readerData->countXmltag;
items = readerData->entriesXmltag;
} else if (elementType == WORD) {
count = readerData->countWords;
items = readerData->entriesWords;
} else if (elementType == WIKITAG) {
count = readerData->countWtag;
items = readerData->entriesWtag;
} else {
count = readerData->countEntities;
items = readerData->entriesEntities;
}

if (count == 1) {
if ((items[0].start == lineNum || items[0].end == lineNum) && items[0].position == position) {
if (items[0].stringData) free(items[0].stringData);

if (elementType == XMLTAG) {
for (unsigned int j = 0; j < items[0].xmlDataCount; ++j) {
if (items[0].xmlData[j].key) free(items[0].xmlData[j].key);
if (items[0].xmlData[j].value) free(items[0].xmlData[j].value);
}

if (items[0].xmlData) free(items[0].xmlData);
}

items = (struct entry *) realloc(items, sizeof(struct entry) * count);
--count;

if (elementType == XMLTAG) {
readerData->countXmltag = count;
readerData->entriesXmltag = items;
} else if (elementType == WORD) {
readerData->countWords = count;
readerData->entriesWords = items;
} else if (elementType == WIKITAG) {
readerData->countWtag = count;
readerData->entriesWtag = items;
} else {
readerData->countEntities = count;
readerData->entriesEntities = items;
}
return true;
}
}

for (unsigned int i = count - 1; i != 0; --i) {
if ((items.start == lineNum || items.end == lineNum) && items.position == position) {
if (i == count - 1) {
if (items.stringData) free(items.stringData);

if (elementType == XMLTAG) {
for (unsigned int j = 0; j < items.xmlDataCount; ++j) {
if (items.xmlData[j].key) free(items.xmlData[j].key);
if (items.xmlData[j].value) free(items.xmlData[j].value);
}

if (items.xmlData) free(items.xmlData);
}

items = (struct entry *) realloc(items, sizeof(struct entry) * count);
--count;

if (elementType == XMLTAG) readerData->countXmltag = count;
else if (elementType == WORD) readerData->countWords = count;
else if (elementType == WIKITAG) readerData->countWtag = count;
else readerData->countEntities = count;

return true;
} else {
if (items.stringData) free(items.stringData);

if (elementType == XMLTAG) {
for (unsigned int j = 0; j < items.xmlDataCount; ++j) {
if (items.xmlData[j].key) free(items.xmlData[j].key);
if (items.xmlData[j].value) free(items.xmlData[j].value);
}

if (items.xmlData) free(items.xmlData);
}

memmove(&items, &items[i + 1], sizeof(struct entry) * (count - i - 1));
--count;

if (elementType == XMLTAG) readerData->countXmltag = count;
else if (elementType == WORD) readerData->countWords = count;
else if (elementType == WIKITAG) readerData->countWtag = count;
else readerData->countEntities = count;
return true;
}
}
}

return false;
}

//------------------------------------------------------------------------------

struct entry* getEntryByType(struct collection *readerData, short elementType) {
if (elementType == XMLTAG) {
if (readerData->countXmltag == 0) return NULL;
return &readerData->entriesXmltag[readerData->countXmltag - 1];
} else if (elementType == WORD) {
if (readerData->countWords == 0) return NULL;
return &readerData->entriesWords[readerData->countWords - 1];
} else if (elementType == WIKITAG) {
if (readerData->countWtag == 0) return NULL;
return &readerData->entriesWtag[readerData->countWtag - 1];
} else {
if (readerData->countEntities == 0) return NULL;
return &readerData->entriesEntities[readerData->countEntities - 1];
}

return NULL;
}

//------------------------------------------------------------------------------

struct entry* getInLine(struct collection *readerData, short elementType, unsigned int lineNum, unsigned int position) {
unsigned int count = 0;
struct entry* items = NULL;
if (elementType == XMLTAG) {
count = readerData->countXmltag;
items = readerData->entriesXmltag;
} else if (elementType == WORD) {
count = readerData->countWords;
items = readerData->entriesWords;
} else if (elementType == WIKITAG) {
count = readerData->countWtag;
items = readerData->entriesWtag;
} else {
count = readerData->countEntities;
items = readerData->entriesEntities;
}

if (count == 0) return NULL;

if (position != 0) {
for (unsigned int i = 0 ; i < count; ++i) {
if (items.start == lineNum && items.end == lineNum && items.position == position) return &items;
}
} else {
for (unsigned int i = 0 ; i < count; ++i) {
if (items.start == lineNum || items.end == lineNum) return &items;
}
}

return NULL;
}

//------------------------------------------------------------------------------
// Main routine
int main(int argc, char *argv[]) {
printf("[ INFO ] Starting transpilling on file \"%s\".\n", OUTPUTFILE);

FILE *outputFile = fopen(OUTPUTFILE, "wb");
FILE *dictFile = fopen(DICTIONARYFILE, "rb");
FILE *wtagFile = fopen(WIKITAGSFILE, "rb");
FILE *xmltagFile = fopen(XMLTAGFILE, "rb");
FILE *xmldataFile = fopen(XMLDATAFILE, "rb");
FILE *entitiesFile = fopen(ENTITIESFILE, "rb");

fseek(dictFile, 0, SEEK_END);
fseek(wtagFile, 0, SEEK_END);
fseek(xmltagFile, 0, SEEK_END);
fseek(xmldataFile, 0, SEEK_END);
fseek(entitiesFile, 0, SEEK_END);
fpos_t dictFileSize;
fpos_t wtagFileSize;
fpos_t xmltagFileSize;
fpos_t xmldataFileSize;
fpos_t entitiesFileSize;

fgetpos(dictFile, &dictFileSize);
fgetpos(wtagFile, &wtagFileSize);
fgetpos(xmltagFile, &xmltagFileSize);
fgetpos(xmldataFile, &xmldataFileSize);
fgetpos(entitiesFile, &entitiesFileSize);

struct collection readerData = { outputFile,
dictFile, wtagFile, xmltagFile, xmldataFile, entitiesFile,
dictFileSize.__pos, wtagFileSize.__pos, xmltagFileSize.__pos, xmldataFileSize.__pos, entitiesFileSize.__pos,
0, 0, 0, 0, 0,
0, 0, 0, 0,
NULL, NULL, NULL, NULL
};

rewind(dictFile);
rewind(wtagFile);
rewind(xmltagFile);
rewind(xmldataFile);
rewind(entitiesFile);

time_t startTime = time(NULL);

unsigned int lineNum = 1;
unsigned int position = 1;

bool hasReplaced = false;
char preSpaces[32];
char subSpaces[32];
char indentation[IDENTATIONBUFFER];
int currentDepth = 0;
int previousDepth = currentDepth;
int lastWikiTypePos = -1;
short lastWikiType = -1;

struct entry* tmp = NULL;
unsigned int tmpMax = 0;
struct entry* preTmp = NULL;
struct entry* closeTag = NULL;

readXMLtag(&readerData);
readWikitag(&readerData);
readWord(&readerData);
readEntity(&readerData);

while (LINESTOPROCESS == 0 || lineNum <= LINESTOPROCESS) {
hasReplaced = false;
previousDepth = currentDepth;

#if VERBOSE
printf("\nLINE NUM: %d - %d\n", lineNum, position);
for (int i = 0, j = readerData.countXmltag; i < j; ++i) {
struct entry* item = &readerData.entriesXmltag;
printf("%d ::::: %d/%u --- %u - pipe: %d ===> %s\n", item->elementType, item->start, item->end, item->position, item->hasPipe, item->stringData);
}
for (int i = 0, j = readerData.countWtag; i < j; ++i) {
struct entry* item = &readerData.entriesWtag;
printf("%d ::::: %d/%u --- %u - pipe: %d ===> %s\n", item->elementType, item->start, item->end, item->position, item->hasPipe, item->stringData);
}
for (int i = 0, j = readerData.countWords; i < j; ++i) {
struct entry* item = &readerData.entriesWords;
printf("%d ::::: %d/%u --- %u - pipe: %d ===> %s\n", item->elementType, item->start, item->end, item->position, item->hasPipe, item->stringData);
}
for (int i = 0, j = readerData.countEntities; i < j; ++i) {
struct entry* item = &readerData.entriesEntities;
printf("%d ::::: %d/%u --- %u - pipe: %d ===> %s\n", item->elementType, item->start, item->end, item->position, item->hasPipe, item->stringData);
}
#endif

memset(indentation, ' ', IDENTATIONBUFFER);
indentation[currentDepth * 2] = '\0';

if ((tmp = getInLine(&readerData, XMLTAG, lineNum, 0)) != NULL) {
if (tmp->start == lineNum && !tmp->isHandledTag) {
tmp->isHandledTag = true;

#if DEBUG
printf("%s<%s", indentation, tmp->stringData);
#endif

#if DOWRITEOUT
fprintf(outputFile, "%s<%s", indentation, tmp->stringData);
#endif

for (int j = 0; j < tmp->xmlDataCount; ++j) {
#if DEBUG
printf(" %s=\"%s\"", tmp->xmlData[j].key, tmp->xmlData[j].value);
#endif

#if DOWRITEOUT
fprintf(outputFile, " %s=\"%s\"", tmp->xmlData[j].key, tmp->xmlData[j].value);
#endif
}

if (!tmp->isDataNode) {
if (tmp->start != tmp->end) {
#if DEBUG
printf("%s", ">\n");
#endif

#if DOWRITEOUT
fputc('\n', outputFile);
#endif
}
} else {
#if DEBUG
printf("%s", ">");
#endif

#if DOWRITEOUT
fputc('>', outputFile);
#endif

if (tmp->end == lineNum) hasReplaced = true;
}

if (tmp->start == tmp->end) closeTag = tmp;
else {
++currentDepth;
memset(indentation, ' ', IDENTATIONBUFFER);
indentation[currentDepth * 2] = '\0';
}
} else if (tmp->end == lineNum) {
closeTag = tmp;
}
}

if ((tmp = getInLine(&readerData, WIKITAG, lineNum, position)) != NULL) {
memset(preSpaces, ' ', sizeof(preSpaces));
memset(subSpaces, ' ', sizeof(subSpaces));
preSpaces[tmp->preSpacesCount] = '\0';
subSpaces[tmp->spacesCount] = '\0';

lastWikiTypePos = tmp->position;
lastWikiType = tmp->tagType;
tmpMax = tmp->start + LOOKAHEADRANGE;

#if DEBUG
printf("%s%s%s", preSpaces, tagTypes[tmp->tagType], tmp->stringData);
#endif

#if DOWRITEOUT
fprintf(outputFile, "%s%s%s", preSpaces, tagTypes[tmp->tagType], tmp->stringData);
#endif

if (tmp->hasPipe) {
#if DEBUG
printf("%s", "|");
#endif

#if DOWRITEOUT
fputc('|', outputFile);
#endif

}

while(readWikitag(&readerData)) {
if ((preTmp = getEntryByType(&readerData, WIKITAG)) != NULL && preTmp->start >= tmpMax) break;
}

freeEntryByLine(&readerData, WIKITAG, lineNum, position);
hasReplaced = true;
}

if ((tmp = getInLine(&readerData, WORD, lineNum, position)) != NULL) {
memset(preSpaces, ' ', sizeof(preSpaces));
memset(subSpaces, ' ', sizeof(subSpaces));
preSpaces[tmp->preSpacesCount] = '\0';
subSpaces[tmp->spacesCount] = '\0';

if (tmp->connectedTag != -1) {
lastWikiTypePos = tmp->position;
#if DEBUG
printf("%s%s%s", preSpaces, tmp->stringData, subSpaces);
#endif

#if DOWRITEOUT
fprintf(outputFile, "%s%s%s", preSpaces, tmp->stringData, subSpaces);
#endif

if (tmp->hasPipe) {
#if DEBUG
printf("%s", "|");
#endif

#if DOWRITEOUT
fputc('|', outputFile);
#endif
}
} else {
#if DEBUG
printf("%s%s%s", preSpaces, tmp->stringData, subSpaces);
#endif

#if DOWRITEOUT
fprintf(outputFile, "%s%s%s", preSpaces, tmp->stringData, subSpaces);
#endif
}

tmpMax = tmp->start + LOOKAHEADRANGE;

while(readWord(&readerData)) {
if ((preTmp = getEntryByType(&readerData, WORD)) != NULL && preTmp->start >= tmpMax) break;
}

freeEntryByLine(&readerData, WORD, lineNum, position);
hasReplaced = true;
}

if ((tmp = getInLine(&readerData, ENTITY, lineNum, position)) != NULL) {
memset(preSpaces, ' ', sizeof(preSpaces));
memset(subSpaces, ' ', sizeof(subSpaces));
preSpaces[tmp->preSpacesCount] = '\0';
subSpaces[tmp->spacesCount] = '\0';

if (tmp->connectedTag != -1) {
lastWikiTypePos = tmp->position;
#if DEBUG
printf("%s%s%s", preSpaces, tmp->stringData, subSpaces);
#endif

#if DOWRITEOUT
fprintf(outputFile, "%s%s%s", preSpaces, tmp->stringData, subSpaces);
#endif

if (tmp->hasPipe) {
#if DEBUG
printf("%s", "|");
#endif

#if DOWRITEOUT
fputc('|', outputFile);
#endif
}
} else {
#if DEBUG
printf("%s%s%s", preSpaces, tmp->stringData, subSpaces);
#endif
#if DOWRITEOUT
fprintf(outputFile, "%s%s%s", preSpaces, tmp->stringData, subSpaces);
#endif
}

tmpMax = tmp->start + LOOKAHEADRANGE;

while(readEntity(&readerData)) {
if ((preTmp = getEntryByType(&readerData, ENTITY)) != NULL && preTmp->start >= tmpMax) break;
}

freeEntryByLine(&readerData, ENTITY, lineNum, position);
hasReplaced = true;
}

if (lastWikiType != -1 && lastWikiTypePos != -1) {
tmp = getInLine(&readerData, WORD, lineNum, position + 1);
if (!tmp || (tmp && tmp->connectedTag == -1)) {
// TODO: Add table closing here.
char closing[3];
strcpy(closing, lastWikiType > 3 ? tagClosingsTypes[1] : tagClosingsTypes[0]);
#if DEBUG
printf("%s", closing);
#endif
#if DOWRITEOUT
fprintf(outputFile, "%s", closing);
#endif

lastWikiType = -1;
}
}

if (hasReplaced) {
++position;
continue;
}

if ((closeTag = getInLine(&readerData, XMLTAG, lineNum, 0)) != NULL) {
if (closeTag != NULL) {
if (closeTag->isDataNode && closeTag->end == lineNum) {
#if DEBUG
printf("</%s>\n", closeTag->stringData);
#endif

#if DOWRITEOUT
fprintf(outputFile, "</%s>\n", closeTag->stringData);
#endif

freeEntryByLine(&readerData, XMLTAG, lineNum, 0);
} else if (closeTag->end == closeTag->start) {
#if DEBUG
printf(" />\n");
#endif

#if DOWRITEOUT
fprintf(outputFile, " />\n");
#endif

freeEntryByLine(&readerData, XMLTAG, lineNum, 0);
} else if (!closeTag->isDataNode && closeTag->end == lineNum) {
if (currentDepth != 0) --currentDepth;
memset(indentation, ' ', IDENTATIONBUFFER);
indentation[currentDepth * 2] = '\0';

#if DEBUG
printf( "%s</%s>\n", indentation, closeTag->stringData);
#endif

#if DOWRITEOUT
fprintf(outputFile, "%s</%s>\n", indentation, closeTag->stringData);
#endif
freeEntryByLine(&readerData, XMLTAG, lineNum, 0);
} else if (closeTag->start != closeTag->end && closeTag->isDataNode) {
#if DEBUG
printf("%s", "\n");
#endif
#if DOWRITEOUT
fputc('\n', outputFile);
#endif
}

closeTag = NULL;
readXMLtag(&readerData);
}
} else {
#if DEBUG
printf("%s", "\n");
#endif

#if DOWRITEOUT
fputc('\n', outputFile);
#endif

if (currentDepth > 0 && currentDepth == previousDepth) --currentDepth;
}

position = 1;
++lineNum;
}

long int duration = difftime(time(NULL), startTime);
long int durHours = floor(duration / 3600);
long int durMinutes = floor((duration % 3600) / 60);
long int durSeconds = (duration % 3600) % 60;
printf("\n\n[STATUS] TRANSPILLING DATA PROCESS: %ldh %ldm %lds\n", durHours, durMinutes, durSeconds);
// Cleanup
fclose(readerData.outputFile);
fclose(readerData.dictFile);
fclose(readerData.wtagFile);
fclose(readerData.xmltagFile);
fclose(readerData.xmldataFile);
fclose(readerData.entitiesFile);

for (unsigned int i = 0; i < readerData.countXmltag; ++i) {
if (readerData.entriesXmltag.stringData) free(readerData.entriesXmltag.stringData);

for (unsigned int j = 0; j < readerData.entriesXmltag.xmlDataCount; ++j) {
if (readerData.entriesXmltag.xmlData[j].key) free(readerData.entriesXmltag.xmlData[j].key);
if (readerData.entriesXmltag.xmlData[j].value) free(readerData.entriesXmltag.xmlData[j].value);
}

if (readerData.entriesXmltag.xmlData) free(readerData.entriesXmltag.xmlData);
}

for (unsigned int i = 0; i < readerData.countWords; ++i) {
if (readerData.entriesWords.stringData) free(readerData.entriesWords.stringData);
}

for (unsigned int i = 0; i < readerData.countWtag; ++i) {
if (readerData.entriesWtag.stringData) free(readerData.entriesWtag.stringData);
}

for (unsigned int i = 0; i < readerData.countEntities; ++i) {
if (readerData.entriesEntities.stringData) free(readerData.entriesEntities.stringData);
}

free(readerData.entriesXmltag);
free(readerData.entriesWords);
free(readerData.entriesWtag);
free(readerData.entriesEntities);
return 0;
}

//------------------------------------------------------------------------------

bool readXMLtag(struct collection *readerData) {
if (readerData->readxmltag >= readerData->xmltagSize) return false;

if ((readerData->entriesXmltag = (struct entry *) realloc(readerData->entriesXmltag, sizeof(struct entry) * (readerData->countXmltag + 1))) == NULL) {
return false;
}

int returnValue = 0;
struct entry *element = &readerData->entriesXmltag[readerData->countXmltag];
element->elementType = XMLTAG;
element->stringData = NULL;
element->start = 0;
element->end = 0;
element->isClosed = false;
element->isDataNode = false;
element->xmlData = NULL;
element->xmlDataCount = 0;
element->position = 1;
element->connectedTag = -1;
element->isHandledTag = false;
element->hasPipe = false;

returnValue = fscanf(readerData->xmltagFile, "%u\t%u\t%hi\t%hi\t", &element->start, &element->end, &element->isClosed, &element->isDataNode);

if (returnValue == EOF) return false;

element = &readerData->entriesXmltag[readerData->countXmltag];
element->stringData = malloc(sizeof(char) * XMLTAG_BUFFER);

unsigned int dataCount = 0;
unsigned int start = 0;
unsigned int end = 0;
char tmpKey[256] = "\0";
char tmpValue[1280] = "\0";

if (readerData->readxmldata <= readerData->xmldataSize) {
fpos_t readBytes;
while (true) {
fgetpos(readerData->xmldataFile, &readBytes);
if (readBytes.__pos == readerData->xmldataSize) break;
returnValue = fscanf(readerData->xmldataFile, "%d\t%d\t", &start, &end);
if (returnValue == EOF) break;

if (start == element->start && end == element->end) {
fscanf(readerData->xmldataFile, "%s\t%[^\n]s", tmpKey, tmpValue);
element = &readerData->entriesXmltag[readerData->countXmltag];
element->xmlData = (struct xmldatakv *)realloc(element->xmlData, sizeof(struct xmldatakv) * (dataCount + 1));

element->xmlData[dataCount].key = malloc(sizeof(char) * (strlen(tmpKey) + 1));
strcpy(element->xmlData[dataCount].key, tmpKey);

element->xmlData[dataCount].value = malloc(sizeof(char) * (strlen(tmpValue) + 1));
strcpy(element->xmlData[dataCount].value, tmpValue);
++dataCount;

#if DEBUG
//printf("xmlData =>\t%s\t==>\t%s\n", tmpKey, tmpValue);
#endif
} else {
fsetpos(readerData->xmldataFile, &readBytes);
break;
}
}

element->xmlDataCount = dataCount;
fpos_t pos;
fgetpos(readerData->xmldataFile, &pos);
readerData->readxmldata = pos.__pos;

}

returnValue = fscanf(readerData->xmltagFile, "%[^\n]s\n", element->stringData);
if (returnValue == EOF) return false;

fpos_t pos;
fgetpos(readerData->xmltagFile, &pos);
readerData->xmltagSize = pos.__pos;

readerData->entriesXmltag[readerData->countXmltag] = *element;
++readerData->countXmltag;
return true;
}

//------------------------------------------------------------------------------

bool readWikitag(struct collection *readerData) {
if (readerData->readwtag >= readerData->wtagSize) return false;

if ((readerData->entriesWtag = (struct entry *)realloc(readerData->entriesWtag, sizeof(struct entry) * (readerData->countWtag + 1))) == NULL) {
return false;
}

int returnValue = 0;
struct entry *element = &readerData->entriesWtag[readerData->countWtag];
element->elementType = WIKITAG;
element->stringData = NULL;
element->xmlData = NULL;
element->xmlDataCount = 0;
element->start = 0;
element->end = 0;
element->preSpacesCount = 0;
element->spacesCount = 0;
element->position = 0;
element->connectedTag = -1;

element->tagType = 0;
element->tagLength = 0;
element->dataFormatType = -1;
element->ownFormatType = -1;
element->formatStart = 0;
element->formatEnd = 0;
element->tagLength= 0;
element->isHandledTag = false;
element->hasPipe = false;

returnValue = fscanf(readerData->wtagFile, "%u\t%u\t%d\t%d\t%hi\t%hi\t%hi\t%d\t%d\t%d\t%d\t%hi\t", &element->position, &element->start, &element->preSpacesCount, &element->spacesCount, &element->tagType, &element->dataFormatType, &element->ownFormatType, &element->formatStart, &element->formatEnd, &element->tagLength, &element->length, &element->hasPipe);
if (returnValue == EOF) return false;

element->end = element->start;

element->stringData = malloc(sizeof(char) * (element->length + 1));
returnValue = fscanf(readerData->wtagFile, "%[^\n]s\n", element->stringData);
if (returnValue == EOF) return false;

fpos_t pos;
fgetpos(readerData->wtagFile, &pos);
readerData->readwtag = pos.__pos;

readerData->entriesWtag[readerData->countWtag] = *element;
++readerData->countWtag;
return true;
}

//------------------------------------------------------------------------------

bool readWord(struct collection *readerData) {
if (readerData->readDict >= readerData->dictSize) return false;

if ((readerData->entriesWords = (struct entry *)realloc(readerData->entriesWords, sizeof(struct entry) * (readerData->countWords + 1))) == NULL) {
return false;
}

int returnValue = 0;
struct entry *element = &readerData->entriesWords[readerData->countWords];
element->elementType = WORD;
element->stringData = NULL;
element->xmlData = NULL;
element->xmlDataCount = 0;
element->start = 0;
element->end = 0;
element->position = 0;
element->connectedTag = -1;
element->isHandledTag = false;
element->hasPipe = false;

returnValue = fscanf(readerData->dictFile, "%u\t%u\t%d\t%u\t%u\t%u\t%hi\t%hi\t%d\t%d\t%hi\t", &element->position, &element->start, &element->connectedTag, &element->preSpacesCount, &element->spacesCount, &element->length, &element->dataFormatType, &element->ownFormatType, &element->formatStart, &element->formatEnd, &element->hasPipe);

element->end = element->start;

if (returnValue == EOF) return false;

element->stringData = malloc(sizeof(char) * (element->length + 1));
returnValue = fscanf(readerData->dictFile, "%[^\n]s\n", element->stringData);
if (returnValue == EOF) return false;

fpos_t pos;
fgetpos(readerData->dictFile, &pos);
readerData->readDict = pos.__pos;

readerData->entriesWords[readerData->countWords] = *element;
++readerData->countWords;
return true;
}

//------------------------------------------------------------------------------

bool readEntity(struct collection *readerData) {
if (&readerData->readentities >= &readerData->entitiesSize) return false;

if ((readerData->entriesEntities = (struct entry *)realloc(readerData->entriesEntities, sizeof(struct entry) * (readerData->countEntities + 1))) == NULL) {
return false;
}

int returnValue = 0;
struct entry *element = &readerData->entriesEntities[readerData->countEntities];
element->elementType = ENTITY;
element->stringData = NULL;
element->xmlData = NULL;
element->xmlDataCount = 0;
element->start = 0;
element->end = 0;
element->position = 0;
element->connectedTag = -1;
element->isHandledTag = false;
element->hasPipe = false;

returnValue = fscanf(readerData->entitiesFile, "%u\t%u\t%d\t%d\t%d\t%hi\t%hi\t%d\t%d\t%hi\t", &element->position, &element->start, &element->connectedTag, &element->preSpacesCount, &element->spacesCount, &element->dataFormatType, &element->ownFormatType, &element->formatStart, &element->formatEnd, &element->hasPipe);

element->end = element->start;

if (returnValue == EOF) return false;

element->stringData = (char*) realloc(element->stringData, sizeof(char) * 24);
returnValue = fscanf(readerData->entitiesFile, "%[^\n]s\n", element->stringData);

if (returnValue == EOF) return false;

fpos_t pos;
fgetpos(readerData->entitiesFile, &pos);
readerData->readentities = pos.__pos;

readerData->entriesEntities[readerData->countEntities] = *element;
++readerData->countEntities;
return true;
}[/src]

theSplit · 11 Mai 2019

Die Daten sind nun sortiert ausgegeben, was über eine Stunde gedauert hat.

Die sortierten Daten finden sich hier:
https://dwrox.net/wicked_sorted.tar.gz (660 MB)

Mit den sortierten Daten lässt sich auch viel bequemer arbeiten.

Im Projekt ist nun auch der Databuilder zu finden, der die Datei Aufbau übernimmt - allerdings noch ohne Formatierungen wie aus Markdown bekannt:

Speziell dieser Commit:
https://github.com/jrie/wicked/commit/7155b7bfbde385a294aa45fe748ecb8b73b1c6a8

exomo · 11 Mai 2019

Warum schreibst du den Code eigentlich in C?
Der Code ist bestenfalls unlesbar, schlimmstenfalls voller Fehler, die man aufgrund der nicht-lesbarkeit nicht sieht.
Welchen Compiler verwendest du denn? Ich kann deinen Code nicht bauen, die .__pos Konstrukte sind nicht gültig. Was sollen die bezwecken?

Ich möchte gar nicht sagen dass du kein C verwenden sollst, aber andere Programmiersprachen machen einiges einfacher. Vor allem die manuelle Speicherverwaltung bietet sehr viel Potential für Fehler. Egal welche Programmiersprache, du solltest deinen Code besser strukturieren. Deine Funktionen sind viel zu lang, der Programmfluss ist so gut wie nicht erkennbar.

Noch eine Anmerkung: Du öffnest die Textdateien im Binärmodus. Ich vermute das ist für die Größenberechnung und deine Tests ob du schon am Ende bist, aber beim binär modus bekommst du schnell Probleme mit Zeilenende wenn du dein Programm auf einem anderen System laufen lässt. Ich würde die Textdateien eher im Textmodus öffnen und mich auf EOF verlassen um das ENde zu erkennen.

theSplit · 11 Mai 2019

@exomo: [kw]fpos_t[/kw] (Link) werden von [kw]fgetpos[/kw] (Link) zurück geliefert werden und die Position in der Datei angeben im "_pos".

Siehe auch hier: https://en.cppreference.com/w/c/io/fgetpos

Der Code ist deshalb in C geschrieben, weil ich damit eigentlich recht warm bin. Jede andere Sprache wäre sicherlich auch möglich. Speichermanagement sehe ich nicht als Problem an, es ist zwar möglich das Fehler damit auftreten aber der Code wird daraufhin, ausgiebig, mit Valgrind's Memcheck, getestet.

Daher sind auch die sortierten Daten vorher verlinkt, damit man auch mit einer anderen Sprache mit den Rohdaten arbeiten könnte. Ich hatte vorher zum Beispiel auch unter Python gearbeitet.

Bzgl. EOF habe ich mal etwas gehört, dass die Prüfung sehr kostenintensiv sein soll. Deshalb die Dateiposition Counter.

Dass der Code sehr unstrukturiert ist, liegt dem Fall geschuldet das ich Zeichenbasierend und im Textfluss entscheide, ich vermute du meinst die großen, sehr vielschichtigen, Funktionen in [kw]wicked.c[/kw] ?

Edit:
Compiler: gcc version 8.3.0

exomo · 12 Mai 2019

theSplit schrieb:
[kw]fpos_t[/kw] (Link) werden von [kw]fgetpos[/kw] (Link) zurück geliefert werden und die Position in der Datei angeben im "_pos".

Die Doku sagt nichts darüber aus, was fpos_t jetzt genau ist, und es scheint mir, dass es sich um einen "implementation defined" Datentyp handelt. Bei meinem Compiler (mingw gcc 8.1) ist es nämlich als int64 definiert. Die Doku sagt außerdem, dass man außer wieder einem fsetpos aufruf mitzugeben nichts damit tun soll. Richtig wäre es, wenn collection.dictSize usw. auch als fpos_t definiert wären, dann müsstest du nicht __pos verwenden und alle Compiler sollten das akzeptieren. Allerdings funktioniert dann der vergleich mit readentities wohl nicht mehr. Ein Vergleich auf Dateiende ist mit fpos_t nicht vorgesehen.

theSplit schrieb:
aber der Code wird daraufhin, ausgiebig, mit Valgrind's Memcheck, getestet

Das ist aber trotzdem keine Garantie dafür dass alles Korrekt ist

theSplit schrieb:
Bzgl. EOF habe ich mal etwas gehört, dass die Prüfung sehr kostenintensiv sein soll. Deshalb die Dateiposition Counter.

"Ich habe mal gehört" ist in Bezug auf Performance in C (und auch sonst) ein eher schlechtes Argument. Entweder du hast es gemessen, oder du hast eine Referenz auf ein Dokument von jemand anderem, der das gemessen hat.
Meine Vermutung (aber ich habe es auch nicht gemessen) ist, dass fgetpos, fsetpos mindestens genauso aufwändig sind, die verwendest du aber oft.

theSplit schrieb:
Dass der Code sehr unstrukturiert ist, liegt dem Fall geschuldet das ich Zeichenbasierend und im Textfluss entscheide, ich vermute du meinst die großen, sehr vielschichtigen, Funktionen in wicked.c

Ja, sowohl wicked als auch unwicked haben sehr lange Funktionen. Parser gehören nicht gerade zu den einfachsten Dinge zu implementieren, ich denke aber schon dass man da Teile in Funktionen stecken kann.

theSplit · 12 Mai 2019

Danke für dein Feedback Exomo - ich glaube das was ich noch im Hinterkopf zu EOF habe: richtig wäre auf "char" EOF zu prüfen und im weiteren dann aber kein extra [kw]feof(....datei....)[/kw] (das war wohl der Performance Killer, weil die lesenden Funktionen ein EOF, in der Regel, selbst liefern.

Die Rebuilder Version habe ich auch noch etwas optimiert, so dauert das ausgeben circa ~5-8 Minuten weniger bei einer Laufzeit von 70 Minuten bei 70.000 Zeilen. Aber bei 890 Tausend Zeilen, wird das ewig dauern. obwohl die Daten sortiert eingelesen und abgearbeitet werden können.

. Richtig wäre es, wenn collection.dictSize usw. auch als fpos_t definiert wären, dann müsstest du nicht __pos verwenden und alle Compiler sollten das akzeptieren.

Das wird so nicht akzeptiert die Daten im struct collection mit "fpos_t" zu definieren, nimmt der Compiler nicht.

BurnerR · 12 Mai 2019

theSplit schrieb:
@BurnerR - hier die Rohdaten aus der Analyse, gepackt: https://dwrox.net/wicked_rawdata.tar.gz
Vielleicht kannst du damit mehr anfangen?

Nein, leider nicht.
Das ist auch völlig ok, wenn du dir die Mühe nicht unbedingt machen willst, ich antworte eigentlich nur um auszuschließen, dass wir aneinander vorbei reden.

Woran ich Spaß hätte wäre das als Python-Aufgabe zu lösen mit einer konkreten Fragestellung.
Also so wie du das in beitrag #3 schon angerissen hast.
Dafür bräuchte ich aber eine Auflistung, was das Programm ganz genau machen soll. Also auf der Ebene 'Gegeben ist hier Liste X da stehen Einträge drin die sehen so aus: ... Gegeben ist außerdem Textdatei Y und da soll jetzt nach den folgenden vier Regeln gesucht werden: ...'

Aber Dinge wie sich in C-Code einarbeiten... das würde ich nichtmal gegen Geld freiwillig machen.

theSplit · 12 Mai 2019

Alright, ich drösel das ganze mal etwas auf.
Hier aber mit dem sortierten Datensätzen, weil diese einfacher zu verwerten sind, meiner Meinung nach.

Also, Grundsätzlich ist die Aufteilung wie folgt, in folgende Textdateien:
1) [kw]xmltags_sort[/kw] - beinhaltet alle gefundenen XML Tags/Knoten ohne deren Daten sondern nur mit Namen (méhr dazu gleich).
2) [kw]xmldata_sort[/kw] - beinhaltet alle XML Tag Datenattribute der Knoten
3) [kw]words_sort[/kw] - alle gefundenen Wörter
4) [kw]wikitags_sort[/kw] - alle Wikitags (Links, Math-Tags, Referenzen, Überschriften) - mit Hauptschlüssel (andere Attribute sind dann, wenn kein Tag, dann Wörter, sonst Tags)
5) [kw]entities_sort[/kw] - alle Entities, alos HTML Umlaute, Sonderzeichen, Mathematische Symbole uvm.

Wie sind die Daten aufgebaut? (Alle Daten sind mit einem Tabulator getrennt, ähnlich einer CSV nur eben mit Tabs als Trenner.
1) XMLTags

- ZeileStart (unsigned int)
- ZeileEnde
- IstGeschlossen (bool)
- istWortKnoten (bool)
- XMLnode Name (string)

2) XMLDaten

- ZeileStart
- ZeileEnde
- Schlüssel (String)
- Wert (String)

Das war das XML, nun die anderen Typen:
3) Wörter / Words:

- Position (innerhalb der Zeile) 1 oder 2 oder 12 oder X usw.
- Zeilennummer
- Wikitag (als Position falls in Wikitag, sonst -1)
- PreSpaces (Anzahl Spaces vor einem Wort)
- Spaces (Anzahl Spaces nach einem Wort)
- Wortlänge
- Daten Formattyp (Bold, Italic, Heading)
- eigener Formattyp (wenn z.b. in Heading, könnte auch bold oder italic sein was nur für dieses Wort gilt)
- istFormatStart (bool) => Ist Beginn einer Formatierung?
- istFormatEnde (bool) => Ist Ende einer Formatierung?
- hasPipe - wird mit "|" eingeleitet (in einem Wikitag)
- Wort (String)

4) Wikitags:

- Position
- Zeile
- Wikitag (als Position falls in, anderen, Wikitag enthalten, sonst -1)
- PreSpaces
- Spaces
- Wikitag Typ
- Daten Formattyp
- Eigenes Format
- istFormatStart (bool)
- istFormatEnde (bool)
- Wikitag Länge Wikitag (Zeichen)
- Länge Wikitag (Titel)
- hasPipe - wird mit "|" eingeleitet (in einem Wikitag)
- Titel (string)

5) Entities:

- Position
- Zeile
- Wikitag (als Position falls in, anderen, Wikitag enthalten, sonst -1)
- PreSpaces
- Spaces
- Wikitag Typ
- Daten Formattyp
- Eigenes Format
- istFormatStart (bool)
- istFormatEnde (bool)
- Wikitag Länge Wikitag (Zeichen)
- Länge Wikitag (Titel)
- hasPipe - wird mit "|" eingeleitet (in einem Wikitag)
- Wert (Entity HTML Kürzel)

Und hier Auszüge der Daten als Download im Forum.
Je die ersten 1000 Einträge/Zeilen:

Anhang anzeigen xmltags_sort_1k.txt
Anhang anzeigen xmldata_sort_1k.txt
Anhang anzeigen words_sort_1k.txt
Anhang anzeigen wikitags_sort_1k.txt
Anhang anzeigen entities_sort_1k.txt

Und hier 1000 Zeilen aus den Rohdaten/dem Original* (die Rawdaten sind nicht sortiert!, bitte das Sort im Dateinamen ignorieren.)
Anhang anzeigen rawdata_sort_1k.txt

Was daraus gemacht werden soll? - Nun ja, der Text (rawData_sort_1k) soll aus den sortieren Daten rekonstruiert werden, so gut wie eben möglich. Kleine Abweichungen zum Original wären okay, aber größtenteils sollte aus dem Auszügen das Endprodukt herauskommen können.

Das heißt, in den Sort Dateien gibt es die Bausteine. Aus denen das Endprodukt herausgearbeitet werden soll.

Mein erster Ansatz war das ersetzen in einer Originalkopie, dabei gab es allerdings auch ein paar Schwierigkeiten, wie vorher angerissen.

theSplit · 18 Mai 2019

Folgend ein Fehler behobener, sortierter, Datensatz: https://dwrox.net/wicked_sorted.tar.bz2
Die vorherige Version hatte Fehler, bei dem die Entities falsch bzw. nicht korrekt, in Wikitags, positioniert waren und somit weniger zusammenhängende Daten gefunden worden sind (rewicked.c.).

Im weiteren werden für die Position, Zeilennummer und die SpacesCount Angaben alles in Hex geschrieben, was auch nochmal ein paar hundert MB in den Datendaten spart und so gut wie kaum bis keinen direkten Overhead hat - eher ist es schneller geworden.

--- [2019-05-18 07:08 CEST] Automatisch zusammengeführter Beitrag ---

@exomo: Habe auch deine Punkte behoben, also kein binäre Daten und kein fgetpos mehr.

Hilfe - Datenbereinigung/Validierungswerkzeug

theSplit

1998

BurnerR

Bot #0384479

theSplit

1998

BurnerR

Bot #0384479

theSplit

1998

BurnerR

Bot #0384479

theSplit

1998

BurnerR

Bot #0384479

theSplit

1998

BurnerR

Bot #0384479

theSplit

1998

theSplit

1998

exomo

NGBler

theSplit

1998

exomo

NGBler

theSplit

1998

BurnerR

Bot #0384479

theSplit

1998

theSplit

1998