• Hallo liebe Userinnen und User,

    nach bereits längeren Planungen und Vorbereitungen sind wir nun von vBulletin auf Xenforo umgestiegen. Die Umstellung musste leider aufgrund der Serverprobleme der letzten Tage notgedrungen vorverlegt werden. Das neue Forum ist soweit voll funktionsfähig, allerdings sind noch nicht alle der gewohnten Funktionen vorhanden. Nach Möglichkeit werden wir sie in den nächsten Wochen nachrüsten. Dafür sollte es nun einige der Probleme lösen, die wir in den letzten Tagen, Wochen und Monaten hatten. Auch der Server ist nun potenter als bei unserem alten Hoster, wodurch wir nun langfristig den Tank mit Bytes vollgetankt haben.

    Anfangs mag die neue Boardsoftware etwas ungewohnt sein, aber man findet sich recht schnell ein. Wir wissen, dass ihr alle Gewohnheitstiere seid, aber gebt dem neuen Board eine Chance.
    Sollte etwas der neuen oder auch gewohnten Funktionen unklar sein, könnt ihr den "Wo issn da der Button zu"-Thread im Feedback nutzen. Bugs meldet ihr bitte im Bugtracker, es wird sicher welche geben die uns noch nicht aufgefallen sind. Ich werde das dann versuchen, halbwegs im Startbeitrag übersichtlich zu halten, was an Arbeit noch aussteht.

    Neu ist, dass die Boardsoftware deutlich besser für Mobiltelefone und diverse Endgeräte geeignet ist und nun auch im mobilen Style alle Funktionen verfügbar sind. Am Desktop findet ihr oben rechts sowohl den Umschalter zwischen hellem und dunklem Style. Am Handy ist der Hell-/Dunkelschalter am Ende der Seite. Damit sollte zukünftig jeder sein Board so konfigurieren können, wie es ihm am liebsten ist.


    Die restlichen Funktionen sollten eigentlich soweit wie gewohnt funktionieren. Einfach mal ein wenig damit spielen oder bei Unklarheiten im Thread nachfragen. Viel Spaß im ngb 2.0.

Hilfestellung bei DNA Analyse

drouwocu

Neu angemeldet

Registriert
27 Juli 2015
Beiträge
6
Hallo,
ich hab mir mal ein paar Tage lang den Kopf zerbrochen und hab nun die Lösung.
Das Ergebnis hab ich angehängt, da sehr umfangreich.
Anhang anzeigen ausgabe.txt
Ich hab das Ganze in C++ umgesetzt.
Falls Interesse am Quellcode, bitte sagen.
 

Roin

Freier Denker

Registriert
22 Juli 2013
Beiträge
581
  • Thread Starter Thread Starter
  • #42
@drouwocu:
Hi,
deine Ausgabe verrät viel, doch wie kommt es, dass du nicht auf den BE-Baustein am Anfang kommst aber ich schon?
Ansonsten stimmen unsere Ergebnisse ja annähernd überein. (Geschlechtsbaustein passt durch meinen BE-Baustein wohl nicht.)

Ich hatte mir auch zwischenzeitlich überlegt, die Sequenzen so ähnlich wie du mit Füllzeichen zu füllen und somit besser die Positionen ermitteln zu können.
War aber meiner Ansicht nach nicht mehr sinnvoll. Allerdings sehe ich gerade, dass ich ggf. durch eine Überprüfung nach Teilsequenzenlängen mein Problem lösen könnte - muss ich mal demnächst schauen.

Ich wäre allerdings an deinem Quellcode interessiert.

EDIT:
Ich kriege nun auch eine richtige Lösung raus.
Ich werde in den nächsten Tagen meinen Quellcode und die Ausgabe etwas aufhübschen.
Ich bin aber erstmal mit dem Projekt durch.
Vielen Dank für die vielen Ratschläge.

Wenn ich alles fertig habe, lade ich den Code, inklusive ungefährer Herangehnsweise von mir hier hoch.
 
Zuletzt bearbeitet:

theSplit

1998
Veteran Barkeeper

Registriert
3 Aug. 2014
Beiträge
28.573
@drouwocu:

Ich würde mir deinen Quellcode gerne ansehen :T
Zumal ich gespannt bin wie du das Problem, Paare bzw. Bausteine zu bilden, gelöst hast. :)

Mein Ansatz in Python scheint nur bedingt zu funktionieren oder ist nicht ausgereift ...
 

drouwocu

Neu angemeldet

Registriert
27 Juli 2015
Beiträge
6
@Roin

Ich bin auf die Bausteine gekommen, da ich nur nach Bausteinen gesucht habe, die in allen 100 Sequenzen gleich oft vorkommen. Verwendet man weniger, als 100 Sequenzen, kann es zu Fehlern kommen.

@all
Hier der Code:
Anhang anzeigen dna.zip
 

theSplit

1998
Veteran Barkeeper

Registriert
3 Aug. 2014
Beiträge
28.573
Schön geschriebener Code, ich hab mich bisher nur etwas mit C auseinandergesetzt kann deinen Code aber soweit nachvollziehen - macht auch Lust und Laune sich mal mit C++ zu beschäftigen. ;) :T

Ich glaube den Ansatz den du beschreibst, nur die Sequenzen in die Liste aufzunehmen die wirklich überall auftauchen vereinfacht die Suche massiv, sollte ich vielleicht auch mal in eine von meinen Lösungen so implementieren - die Variablen dafür sind schon da gewesen (minSampleOccurence = 2 oder mehr).
 

Roin

Freier Denker

Registriert
22 Juli 2013
Beiträge
581
  • Thread Starter Thread Starter
  • #46
Ich glaube den Ansatz den du beschreibst, nur die Sequenzen in die Liste aufzunehmen die wirklich überall auftauchen vereinfacht die Suche massiv
Das ist auch eine der 8 Aufgaben. (Ich glaube Aufgabe 4).
Mit dieser habe ich allerdings auch "BE" gefunden, ohne zu wissen, dass es kein echter Baustein ist.
Allerdings habe ich meinen Code nun dahingegen mit Fehlerkontrollen ausgestattet, dass dieser Baustein nun aussortiert wird und somit alle Bausteine findet.
 

theSplit

1998
Veteran Barkeeper

Registriert
3 Aug. 2014
Beiträge
28.573
@Roin:

Okay, die Aufgabenstellung hatte ich mir schon länger nicht mehr durchgelesen, dieses Wissen im Hinterkopf zu haben hätte nun den Prozess etwas vereinfacht. Das muß ich überlesen haben oder gar vergessen. :)

Aber es hört sich etwas kompliziert an eine extra Filterroutine für einen Block zu schreiben, zumal das Ergebnis nicht 100% falsch sein muß so fern der Block wie in der Aufgabenstellung in dieser Form überall auftaucht, außer natürlich BE ist ein gesonderter Baustein.

@drouwocu
Eine wirklich minimale Kleinigkeit ist mir noch beim überfliegen deiner "indexOf" Funktion aufgefallen, du könntest die Länge des Strings nur einmal einlesen lassen und danach den Wert als Variable verwenden anstelle in der For-Schleifen Bedingung jedesmal die Länge des Strings mit str.length() neu zu ermitteln.
Was anderes fällt mir spontan nicht auf, aber ich werde den Code nochmal etwas genauer studieren :T

PS: Ärgert mich das ich bisher nicht zu einer akzeptablen Lösung gekommen bin ;)
 

Roin

Freier Denker

Registriert
22 Juli 2013
Beiträge
581
  • Thread Starter Thread Starter
  • #49
@drouwocu:
Deine Lösung ist echt übersichtlich geschrieben, was mich allerdings ein wenig stört ist, dass dein Programm sich etwas stark spezialisiert. Also genau auf diese Testsequenzen ausgerichtet ist.
Du filterst beispielsweise direkt nach U-Sequenzen. Entweder du verallgemeinerst das und suchst nach Buchstaben, die nicht in allen Sequenzen vorkommen und analysierst diese (sofern die Zeichensätze der aktuellen Sequenz länger/umfangreicher ist, als bei anderen Sequenzen) oder du ignorierst es vorerst und findest später die entsprechenden Bausteine.

@theSplit:
Es ist durchaus ein wenig umständlich, doch dadurch, dass ich nun überprüfen kann, ob ein Baustein sicher falsch ist, ist es möglich, dass ich allgemeinere, schnellere Baustein-Ermittlungs-Wege nutzen kann. Allerdings findet meine Funktion nicht zwingend alle falschen Bausteine. Es könnten weiterhin Bausteine aufgenommen werden, die von der Funktion erst ganz am Schluss, wenn eine Sequenz als analysiert gilt ermitteln könnte, ob die Bausteine alle korrekt sind, oder falsche enthalten sind.

Wie gesagt, in einiger Zeit werde ich das Programm aufhübschen und dann hier ebenfalls einstellen.
 

theSplit

1998
Veteran Barkeeper

Registriert
3 Aug. 2014
Beiträge
28.573
Hallo ihr werten Mitstreiter :)

Ich habe nochmal eine neue Variante umgesetzt die folgendes Ergebniss produziert:
DNA data alphabet is:
E F G A D I C H B U

EEE count is: 72
128 times block: EAEBF

Found samples are:
EFFGA (30000/100)
FGA (30000/100)
EAEBF (12800/100)
ACHGD (5000/100)
EFFG (30000/100)
ACHG (5000/100)
AEBF (12800/100)
FFGA (30000/100)
EAEB (12800/100)
CHGD (5000/100)
HGD (5000/100)
EFF (30000/100)
EBF (12800/100)
ACH (5000/100)
AEG (10000/100)
EEE (7200/100)
AED (5000/100)
AEB (12800/100)
CHG (5000/100)
EF (30000/100)
AC (5000/100)
HG (5000/100)
BF (12800/100)
EB (12800/100)
CH (5000/100)

Sample, with same count found: FGA in EFFGA (30000/30000)

Sample, with same count found: EFFG in EFFGA (30000/30000)

Sample, with same count found: FFGA in EFFGA (30000/30000)

Sample, with same count found: EFF in EFFGA (30000/30000)

Sample, with same count found: EF in EFFGA (30000/30000)

Sample, with same count found: AEBF in EAEBF (12800/12800)

Sample, with same count found: EAEB in EAEBF (12800/12800)

Sample, with same count found: EBF in EAEBF (12800/12800)

Sample, with same count found: AEB in EAEBF (12800/12800)

Sample, with same count found: BF in EAEBF (12800/12800)

Sample, with same count found: EB in EAEBF (12800/12800)

Sample, with same count found: ACHG in ACHGD (5000/5000)

Sample, with same count found: CHGD in ACHGD (5000/5000)

Sample, with same count found: HGD in ACHGD (5000/5000)

Sample, with same count found: ACH in ACHGD (5000/5000)

Sample, with same count found: CHG in ACHGD (5000/5000)

Sample, with same count found: AC in ACHGD (5000/5000)

Sample, with same count found: HG in ACHGD (5000/5000)

Sample, with same count found: CH in ACHGD (5000/5000)


Merged samples are:
EFFGA (300/30000/100)
EAEBF (128/12800/100)
ACHGD (50/5000/100)
AEG (100/10000/100)
EEE (72/7200/100)
AED (50/5000/100)

Das ganze ist nun allerdings in C, ich muss sagen, es ist komplizierter aber dadurch das man nicht so dynamisch wie in Python ist muss man etwas anders an das Problem herangehen was die Sache aber auch irgendwo leichter gemacht hat.

Dieses mal hab ich auch den Bogen richtig bekommen die gleich großen Bausteine (die kleinen in die Großen) zu mergen. :cool:

Die Suchreihenfolge für die Bausteine ist vom größeren zum kleinsten mit einer Maximallänge von 21 Zeichen zu Beginn und schrumpft auf eine Minimallänge von 2 Zeichen.

Die Daten dann zu ersetzen wenn das Geschlecht in Frage kommt habe ich allerdings noch nicht, wäre dann der nächste Step wenn der Code wieder mit einer weiteren Funktion ausgestattet wird. Ist noch etwas messy mit der letzten Funktion die in der Main hängt.

@drouwocu
Ich habe versucht nicht deine Funktionen zu kopieren, aber einen Namen habe ich mir geborgt :T
Bin auch etwas anders verfahren, aber auch wirklich nur marginal anders. Ob es besser ist lasse ich mal dahingestellt, ich glaube aber eher nicht ;) :D
 
Zuletzt bearbeitet:

theSplit

1998
Veteran Barkeeper

Registriert
3 Aug. 2014
Beiträge
28.573
Hier die aktuelle und von mir letzte Variante, Ausgabe steht im Spoiler. Bin mit der Aufgabe erstmal durch ;)

[src=c]// Include C standard headers
#include <stdlib.h>
#include <stdio.h>
#include <string.h>


// Structures

typedef struct alphabetData {
unsigned int letterCount = 0;
char* letters = NULL;
};

typedef struct sequenceStr {
unsigned int count = 0;
unsigned int length = 0;
unsigned int fileCount = 0;
char* sequence = NULL;
};

typedef struct dnaDataFile {
unsigned int fileIndex = 0;
unsigned int length = 0;
unsigned int sampleCount = 0;
unsigned int uniqueSamples = 0;
char* data = NULL;
sequenceStr* ownSamples = NULL;
};

typedef struct dnaDataFiles {
unsigned int fileCount = 0;
dnaDataFile* files;
};

typedef struct sequences {
unsigned int count = 0;
sequenceStr* samples = NULL;
};

// Function declarations
void cleanupApplication();
void extractAlphabet();
char* find128Block(unsigned int minStringSize);
void generateFileDataStore(unsigned int fileToAnalyze);
void buildSampleList(unsigned int filesToAnalyze);

// Toolset
int indexOf(char* data, char* searchString, unsigned int searchPosition);
int countOccurence(char* data, char* searchString);
unsigned int replace(char* data, char* searchString, char replacement);

// Variables
alphabetData alphabet;
dnaDataFile alphabetFile;
dnaDataFiles dnaFiles;
sequences sampleStore;

unsigned int i, j, k = 0;
unsigned int filesToAnalyze = 100;

unsigned int EEEcount = 0;
char* block128 = "\0";

int main(int argc, char* argv[]) {
atexit(cleanupApplication);

// Extract the alphabet from sample file and print those out
printf("\nDNA data alphabet is:\n");
extractAlphabet();

for (i = 0; i < alphabet.letterCount; i++) {
printf("%c ", alphabet.letters);
}

printf("\n");


// Precache the dna data files in memory
generateFileDataStore(filesToAnalyze);


// Get EEE block countOccurence
if (dnaFiles.files != NULL) {
EEEcount = countOccurence(dnaFiles.files[0].data, "EEE\0");
}
printf("\nEEE count is: %d\n", EEEcount);


// Find 128 times block
unsigned int minStringSize = 8;
block128 = find128Block(minStringSize);
printf("128 times block: %s", block128);

// Find the common blocks with equal counts across files
// Fills the sampleStore
buildSampleList(filesToAnalyze);

// Modify the dnaFiles data to figure out how many left samples there are
// in those files and count them together with the other sequences to get the file sanmple count

// Clean the sample file data from known samples
unsigned int replacedCount = 0;

unsigned int dataLength = 0, samplePos = 0, readingPos = 0, uniqueCount = 0, sampleLength = 0;
bool hasSample = false;
char* currentSample = NULL;

for (i = 0; i < dnaFiles.fileCount; i++) {
replacedCount = 0;
j = 0;
for (j = 0; j < sampleStore.count; j++) {
replacedCount += replace(dnaFiles.files.data, sampleStore.samples[j].sequence, '_');
}
dnaFiles.files.sampleCount = replacedCount;

printf("\nAdding unique samples to file %d\n", i + 1);
readingPos = 0;
samplePos = 0;
hasSample = false;
dataLength = strlen(dnaFiles.files.data);

currentSample = (char*) calloc(1, sizeof (char));

// Add the remaining samples to the unique sample list
while (readingPos < dataLength) {

if (dnaFiles.files.data[readingPos] != '_') {
currentSample = (char*) realloc(currentSample, sizeof (char) * (samplePos + 1));
currentSample[samplePos] = dnaFiles.files.data[readingPos];
samplePos++;
hasSample = true;
} else {
if (hasSample) {
uniqueCount = dnaFiles.files.uniqueSamples;
sampleLength = strlen(currentSample);
if (dnaFiles.files.ownSamples == NULL) {
dnaFiles.files.ownSamples = (sequenceStr*) calloc(1, sizeof (sequenceStr));
} else {
dnaFiles.files.ownSamples = (sequenceStr*) realloc(dnaFiles.files.ownSamples, sizeof (sequenceStr) * (uniqueCount + 1));
}

if (dnaFiles.files.ownSamples == NULL) {
printf("\nCould not allocated memory for ownSample store. Exiting.");
exit(1);
}

dnaFiles.files.ownSamples[uniqueCount].sequence = (char*) malloc(sizeof (char) * sampleLength);
if (dnaFiles.files.ownSamples[uniqueCount].sequence == NULL) {
printf("\nCould not allocated memory in sample sequence in ownSample. Exiting.");
exit(1);
}

strcpy(dnaFiles.files.ownSamples[uniqueCount].sequence, currentSample);
dnaFiles.files.ownSamples[uniqueCount].length = sampleLength;
dnaFiles.files.ownSamples[uniqueCount].fileCount = 1;

// Counts and replaces all occurences
dnaFiles.files.ownSamples[uniqueCount].count = replace(dnaFiles.files.data, currentSample, '_');
printf("Sample added: %s (%d)\n", currentSample, dnaFiles.files.ownSamples[uniqueCount].count);
dnaFiles.files.uniqueSamples++;
}

hasSample = false;
samplePos = 0;
currentSample = (char*) realloc(currentSample, 0);
}
readingPos++;
}
}

if (currentSample != NULL) {
free(currentSample);
}

for (i = 0; i < dnaFiles.fileCount; i++) {
for (j = 0; j < dnaFiles.files.uniqueSamples; j++) {
for (k = 0; k < dnaFiles.files.uniqueSamples; k++) {
if (dnaFiles.files.ownSamples[j].length < dnaFiles.files.ownSamples[k].length) {
replace(dnaFiles.files.ownSamples[k].sequence, dnaFiles.files.ownSamples[j].sequence, '_');

bool hasStillSequence = false, hadSequence = false;
unsigned int sequenceIndex = 0, copyPos = 0;
unsigned int sequenceLength = strlen(dnaFiles.files.ownSamples[k].sequence);
char* restSequence = (char*) calloc(1, sizeof (char));

while (copyPos < sequenceLength) {
if (dnaFiles.files.ownSamples[k].sequence[copyPos] != '_') {
restSequence = (char*) realloc(restSequence, sizeof (char)*sequenceIndex + 1);
restSequence[sequenceIndex] = dnaFiles.files.ownSamples[k].sequence[copyPos];
hasStillSequence = true;
hadSequence = true;
sequenceIndex++;
} else {
if (hasStillSequence) {
strcpy(dnaFiles.files.ownSamples[k].sequence, restSequence);
dnaFiles.files.ownSamples[k].sequence = (char*) realloc(dnaFiles.files.ownSamples[k].sequence, sizeof (char)*strlen(restSequence));
}

restSequence = (char*) realloc(restSequence, 0);

hasStillSequence = false;
sequenceIndex = 0;
}
copyPos++;
}

if (hadSequence == false) {
size_t moveStorage = 0;
size_t totalStorage = sizeof (sequenceStr) * (dnaFiles.files.uniqueSamples - 1);
for (int l = k; l < dnaFiles.files.uniqueSamples; l++) {
moveStorage += sizeof (sequenceStr);
}

if (moveStorage > 0) {
memmove(dnaFiles.files.ownSamples + k, dnaFiles.files.ownSamples + k + 1, moveStorage);
dnaFiles.files.ownSamples = (sequenceStr*) realloc(dnaFiles.files.ownSamples, totalStorage);
}

dnaFiles.files.uniqueSamples--;
k--;
if (j > 0) {
j--;
}
}
}
}
}
}

printf("\nAfter cleanup samples in files:\n");
for (i = 0; i < dnaFiles.fileCount; i++) {
printf("\nListing samples of file %d:\n", i + 1);
printf("Unique to file are:\n");
for (j = 0; j < dnaFiles.files.uniqueSamples; j++) {
printf("%s (%d time(s))\n", dnaFiles.files.ownSamples[j].sequence, dnaFiles.files.ownSamples[j].count);
}
}
printf("\nCommon samples are:\n");
for (i = 0; i < sampleStore.count; i++) {
printf("%s (%d times / %d)\n", sampleStore.samples.sequence, sampleStore.samples.fileCount, sampleStore.samples.count);
}

return 0;
}

unsigned int replace(char* data, char* searchString, char replacement) {
unsigned int replaced = 0;
unsigned int position = 0;
unsigned int current = 0;
unsigned int searchStringLength = strlen(searchString);

while (position != -1) {
position = indexOf(data, searchString, 0);
if (position != -1) {
current = 0;
while (current < searchStringLength) {
data[position + current] = replacement;
current++;
}

replaced++;
}
}

return replaced;
}

void cleanupApplication() {
printf("\n\n\nCleaning ressources...\n");

// Free up alphabet
printf("Freeing alphabet...\n");
free(alphabet.letters);
free(alphabetFile.data);

// Free up 128 times block
printf("Freeing block128...\n");
free(block128);

// Free up dna data file storage
printf("Freeing dnaFiles data...\n");
for (i = 0; i < dnaFiles.fileCount; i++) {
free(dnaFiles.files.data);
}

if (dnaFiles.files != NULL) {
printf("Freeing dnaFiles uniqueSamples...\n");
for (i = 0; i < dnaFiles.fileCount; i++) {
for (j = 0; j < dnaFiles.files.uniqueSamples; j++) {
if (dnaFiles.files.ownSamples[j].sequence != NULL) {
free(dnaFiles.files.ownSamples[j].sequence);
}
}
}

printf("Freeing dnaFiles ownSamples collection...\n");
for (i = 0; i < dnaFiles.fileCount; i++) {
if (dnaFiles.files.ownSamples != NULL) {
free(dnaFiles.files.ownSamples);
}
}

printf("Freeing dnaFiles files...\n");
free(dnaFiles.files);
}

// Free up sample store
printf("Freeing sampleStore sequences...\n");
for (i = 0; i < sampleStore.count; i++) {
free(sampleStore.samples.sequence);
}

printf("Freeing sampleStore samples...\n");
if (sampleStore.samples != NULL) {
free(sampleStore.samples);
}

printf("Cleaning successfull, exiting.\n");

return;
}

void extractAlphabet() {
unsigned int bufferLength = 0, position = 0;
FILE* filehandle;

filehandle = fopen("testsequenzen/resultset1i.txt\0", "r");

if (filehandle == NULL) {
printf("Error opening sample input file for alphabet creation!\n");
return;
} else {
fseek(filehandle, 0, SEEK_END);
bufferLength = ftell(filehandle) + 1;
rewind(filehandle);

alphabetFile.data = (char*) calloc(bufferLength, sizeof (char));
fgets(alphabetFile.data, bufferLength, filehandle);
fclose(filehandle);
alphabetFile.fileIndex = 1;
alphabetFile.length = bufferLength;
}

while (position <= alphabetFile.length) {
if (alphabet.letters == NULL) {
alphabet.letters = (char*) calloc(1, sizeof (char));
alphabet.letters[0] = alphabetFile.data[position];
alphabet.letterCount++;
} else {
if (!strchr(alphabet.letters, alphabetFile.data[position])) {

alphabet.letters = (char*) realloc(alphabet.letters, sizeof (char) * alphabet.letterCount + 1);
alphabet.letters[alphabet.letterCount] = alphabetFile.data[position];
alphabet.letterCount++;
}
}
position++;
}
}

void generateFileDataStore(unsigned int filesToAnalyze) {
unsigned int bufferLength = 0;
FILE* filehandle;

for (i = 0; i < filesToAnalyze; i++) {
char filePath[32];
sprintf(filePath, "testsequenzen/resultset%di.txt\0", i + 1);

filehandle = fopen(filePath, "r");
if (filehandle == NULL) {
printf("Error opening file %d...\n", i);
} else {
fseek(filehandle, 0, SEEK_END);
bufferLength = ftell(filehandle) + 1;
rewind(filehandle);

dnaFiles.fileCount++;

if (dnaFiles.files == NULL) {
dnaFiles.files = (dnaDataFile*) calloc(1, sizeof (dnaDataFile));
} else {
dnaFiles.files = (dnaDataFile*) realloc(dnaFiles.files, sizeof (dnaDataFile) * dnaFiles.fileCount);
}

if (dnaFiles.files == NULL) {
printf("Error allocating memory for dnaFiles.files, exiting.\n");
exit(1);
}

dnaFiles.files.data = (char*) calloc(bufferLength, sizeof (char));
fgets(dnaFiles.files.data, bufferLength, filehandle);
fclose(filehandle);

dnaFiles.files.fileIndex = i + 1;
dnaFiles.files.length = bufferLength;
dnaFiles.files.uniqueSamples = 0;
dnaFiles.files.ownSamples = NULL;
}
}
}

char* find128Block(unsigned int minStringSize) {
char* searchString = NULL;
char* searchData = dnaFiles.files[0].data;
unsigned int searchDataLength = dnaFiles.files[0].length;
unsigned int fileCount = 0;

searchString = (char*) calloc(minStringSize, sizeof (char));

for (i = 0; i < searchDataLength; i++) {
while (j < minStringSize) {
searchString[j] = searchData[i + j];
j++;
}

if (countOccurence(searchData, searchString) != 128) {
j = 0;
} else {
fileCount = 0;

for (k = 0; k < dnaFiles.fileCount; k++) {
if (countOccurence(dnaFiles.files[k].data, searchString) == 128) {
fileCount++;
}
}

if (fileCount == filesToAnalyze) {
return searchString;
}
}

if (i >= searchDataLength - minStringSize) {
i = 0;
minStringSize--;

if (minStringSize < 2) {

return "\0";
}
}
}
}

void buildSampleList(unsigned int filesToAnalyze) {
int i = 0, j = 0, k = 0;
int minSampleSize = 2;
int maxSampleSize = 21;
int previousCount = -1;
int currentCount = 0;
bool isIncluded = false;

int sampleSize = maxSampleSize;
int currentSampleSize = sampleSize - 1;
unsigned int fileCount = 1;
int sampleLength = 0;

unsigned int dataLength = dnaFiles.files[0].length - 1;
char* dnaData = dnaFiles.files[0].data;
char* searchString = (char*) calloc(sampleSize, sizeof (char));

sampleStore.samples = (sequenceStr*) calloc(1, sizeof (sequenceStr));

for (i = 0; i < dataLength; i += sampleSize) {
while (currentSampleSize > -1) {
searchString[currentSampleSize] = dnaData[i + currentSampleSize];
currentSampleSize--;
}

previousCount = -1;
fileCount = 1;

for (j = 0; j < dnaFiles.fileCount; j++) {
currentCount = countOccurence(dnaFiles.files[j].data, searchString);
if (previousCount == -1) {
previousCount = currentCount;
} else if (previousCount != currentCount) {
break;
} else {
fileCount++;

if (fileCount == filesToAnalyze) {
isIncluded = false;
for (k = 0; k < sampleStore.count; k++) {
if (strcmp(sampleStore.samples[k].sequence, searchString) == 0) {
isIncluded = true;
break;
}
}

if (!isIncluded) {
// Add sample to the sample store
sampleLength = strlen(searchString);

sampleStore.samples = (sequenceStr*) realloc(sampleStore.samples, sizeof (sequenceStr) * (sampleStore.count + 1));
if (sampleStore.samples == NULL) {
printf("\nCould not allocated memory in sampleStore. Exiting.");
exit(1);
}

sampleStore.samples[sampleStore.count].sequence = (char*) malloc(sizeof (char) * sampleLength);
if (sampleStore.samples[sampleStore.count].sequence == NULL) {
printf("\nCould not allocated memory in sampleStore sequence. Exiting.");
exit(1);
}

strcpy(sampleStore.samples[sampleStore.count].sequence, searchString);
sampleStore.samples[sampleStore.count].count = (currentCount * filesToAnalyze);
sampleStore.samples[sampleStore.count].fileCount = currentCount;
sampleStore.samples[sampleStore.count].length = sampleLength;
sampleStore.count++;
break;
}
}
}
}

if (i + sampleSize >= dataLength) {
i = 0;
sampleSize--;

if (sampleSize < minSampleSize) {
break;
}

searchString = (char*) realloc(searchString, sizeof (char) * sampleSize);
}

currentSampleSize = sampleSize - 1;

}

if (searchString != NULL) {
free(searchString);
}

printf("\n\nFound samples are:\n");
for (i = 0; i < sampleStore.count; i++) {

printf("%s (%d/%d)\n", sampleStore.samples.sequence, sampleStore.samples.count, filesToAnalyze);
}

printf("\n");


// Filter sample results, bigger samples with same count merge smaller ones
for (i = 0; i < sampleStore.count; i++) {
for (j = 0; j < sampleStore.count; j++) {
if (sampleStore.samples[j].count == sampleStore.samples.count) {
if (strstr(sampleStore.samples.sequence, sampleStore.samples[j].sequence) && sampleStore.samples.length > sampleStore.samples[j].length) {
size_t storageSpace = 0;
size_t moveStorage = 0;
size_t totalStorage = 0;
size_t itemStorage = 0;

for (k = 0; k < sampleStore.count; k++) {
storageSpace = sizeof (sequenceStr);
if (k == j) {
itemStorage = storageSpace;
moveStorage += storageSpace;
} else if (k > j) {
moveStorage += storageSpace;
}
totalStorage += storageSpace;
}
printf("Sample, with same count found: %s in %s (%d/%d)\n", sampleStore.samples[j].sequence, sampleStore.samples.sequence, sampleStore.samples.count, sampleStore.samples[j].count);
//printf("%d total, move %d, item %d\n", totalStorage, moveStorage, itemStorage);

for (int l = 0; l < sampleStore.count; l++) {
if (l == j) {
//printf("hitted %s\n", sampleStore.samples[l].sequence);
} else {
//printf("%s\n", sampleStore.samples[l].sequence);
}
}

memmove(sampleStore.samples + j, sampleStore.samples + (j + 1), moveStorage);
sampleStore.samples = (sequenceStr*) realloc(sampleStore.samples, totalStorage - itemStorage);
sampleStore.count--;
j--;

if (i > 0) {
i--;
}
}
}
}
}

printf("\nMerged-down samples in sampleStore are:\n");
for (i = 0; i < sampleStore.count; i++) {
printf("%s (%d/%d/%d)\n", sampleStore.samples.sequence, sampleStore.samples.fileCount, sampleStore.samples.count, filesToAnalyze);
}

return;
}

int indexOf(char* data, char* searchString, unsigned int searchPosition) {
unsigned int searchDataLength = strlen(data);
unsigned int searchStringLength = strlen(searchString);
int currentPosition = 0;

unsigned int searchDataIndex = 0;
for (currentPosition = searchPosition; currentPosition < searchDataLength; currentPosition++) {
if (data[currentPosition] != searchString[searchDataIndex]) {
currentPosition -= searchDataIndex;
searchDataIndex = 0;
} else {
searchDataIndex++;

if (searchDataIndex == searchStringLength) {

return (currentPosition + 1 - searchStringLength);
}
}
}

return -1;
}

int countOccurence(char* data, char* searchString) {
unsigned int countOccurence = 0, searchStringLength = strlen(searchString);
int currentPosition = 0;

while (true) {
currentPosition = indexOf(data, searchString, currentPosition);
if (currentPosition != -1) {
countOccurence++;
currentPosition += searchStringLength;
} else {
return countOccurence;
}
}

return -1;
}
[/src]

Ausgabe:
DNA data alphabet is:
E F G A D I C H B U

EEE count is: 72
128 times block: EAEBF

Found samples are:
EFFGA (30000/100)
FGA (30000/100)
EAEBF (12800/100)
ACHGD (5000/100)
EFFG (30000/100)
ACHG (5000/100)
AEBF (12800/100)
FFGA (30000/100)
EAEB (12800/100)
CHGD (5000/100)
HGD (5000/100)
EFF (30000/100)
EBF (12800/100)
ACH (5000/100)
AEG (10000/100)
EEE (7200/100)
AED (5000/100)
AEB (12800/100)
CHG (5000/100)
EF (30000/100)
AC (5000/100)
HG (5000/100)
BF (12800/100)
EB (12800/100)
CH (5000/100)

Sample, with same count found: FGA in EFFGA (30000/30000)
Sample, with same count found: EFFG in EFFGA (30000/30000)
Sample, with same count found: FFGA in EFFGA (30000/30000)
Sample, with same count found: EFF in EFFGA (30000/30000)
Sample, with same count found: EF in EFFGA (30000/30000)
Sample, with same count found: AEBF in EAEBF (12800/12800)
Sample, with same count found: EAEB in EAEBF (12800/12800)
Sample, with same count found: EBF in EAEBF (12800/12800)
Sample, with same count found: AEB in EAEBF (12800/12800)
Sample, with same count found: BF in EAEBF (12800/12800)
Sample, with same count found: EB in EAEBF (12800/12800)
Sample, with same count found: ACHG in ACHGD (5000/5000)
Sample, with same count found: CHGD in ACHGD (5000/5000)
Sample, with same count found: HGD in ACHGD (5000/5000)
Sample, with same count found: ACH in ACHGD (5000/5000)
Sample, with same count found: CHG in ACHGD (5000/5000)
Sample, with same count found: AC in ACHGD (5000/5000)
Sample, with same count found: HG in ACHGD (5000/5000)
Sample, with same count found: CH in ACHGD (5000/5000)

Merged-down samples in sampleStore are:
EFFGA (300/30000/100)
EAEBF (128/12800/100)
ACHGD (50/5000/100)
AEG (100/10000/100)
EEE (72/7200/100)
AED (50/5000/100)

Adding unique samples to file 1
Sample added: AEI (176)
Sample added: DEAA (124)
Sample added: GEIAUEGGDIBED (1)

Adding unique samples to file 2
Sample added: DEAA (159)
Sample added: AEI (141)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 3
Sample added: DEAAAEI (16)
Sample added: AEI (168)
Sample added: DEAA (100)
Sample added: GEIAUEGGDIBED (1)

Adding unique samples to file 4
Sample added: DEAA (197)
Sample added: AEI (103)
Sample added: GEIAUEGGDIBED (1)

Adding unique samples to file 5
Sample added: DEAA (163)
Sample added: AEI (137)
Sample added: GEIAUEGGDIBED (1)

Adding unique samples to file 6
Sample added: AEI (192)
Sample added: DEAA (108)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 7
Sample added: AEI (142)
Sample added: DEAA (158)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 8
Sample added: DEAADEAA (23)
Sample added: DEAAAEI (22)
Sample added: AEI (95)
Sample added: DEAA (115)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 9
Sample added: AEI (148)
Sample added: DEAA (152)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 10
Sample added: AEI (145)
Sample added: DEAA (155)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 11
Sample added: DEAA (164)
Sample added: AEI (136)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 12
Sample added: DEAA (152)
Sample added: AEI (148)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 13
Sample added: DEAA (106)
Sample added: AEI (194)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 14
Sample added: AEIAEI (21)
Sample added: DEAA (136)
Sample added: AEI (122)
Sample added: GEIAUEGGDIBED (1)

Adding unique samples to file 15
Sample added: DEAAAEI (28)
Sample added: DEAA (136)
Sample added: AEI (108)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 16
Sample added: DEAAAEI (21)
Sample added: AEI (86)
Sample added: DEAA (172)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 17
Sample added: AEI (125)
Sample added: DEAA (175)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 18
Sample added: AEIAEIDEAA (7)
Sample added: AEI (180)
Sample added: DEAA (99)

Adding unique samples to file 19
Sample added: DEAA (198)
Sample added: AEIAEI (9)
Sample added: AEI (84)
Sample added: GEIAUEGGDIBED (1)

Adding unique samples to file 20
Sample added: AEI (162)
Sample added: DEAA (138)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 21
Sample added: DEAA (192)
Sample added: AEI (108)
Sample added: GEIAUEGGDIBED (1)

Adding unique samples to file 22
Sample added: DEAADEAADEAA (2)
Sample added: AEI (174)
Sample added: DEAA (120)
Sample added: GEIAUEGGDIBED (1)

Adding unique samples to file 23
Sample added: AEI (136)
Sample added: DEAA (164)
Sample added: GEIAUEGGDIBED (1)

Adding unique samples to file 24
Sample added: DEAA (176)
Sample added: AEI (124)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 25
Sample added: DEAA (153)
Sample added: AEI (147)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 26
Sample added: AEIAEIAEI (3)
Sample added: DEAA (189)
Sample added: AEI (102)
Sample added: GEIAUEGGDIBED (1)

Adding unique samples to file 27
Sample added: DEAADEAAAEI (5)
Sample added: DEAA (140)
Sample added: AEIAEI (18)
Sample added: AEI (109)
Sample added: GEIAUEGGDIBED (1)

Adding unique samples to file 28
Sample added: DEAADEAADEAAAEI (1)
Sample added: DEAAAEI (21)
Sample added: AEI (91)
Sample added: DEAA (163)
Sample added: AGIGDFFGDBEIAEI (0)

Adding unique samples to file 29
Sample added: DEAA (161)
Sample added: AEI (139)
Sample added: GEIAUEGGDIBED (1)

Adding unique samples to file 30
Sample added: DEAA (107)
Sample added: AEIAEI (33)
Sample added: AEI (127)
Sample added: GEIAUEGGDIBED (1)

Adding unique samples to file 31
Sample added: DEAA (199)
Sample added: AEI (101)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 32
Sample added: AEI (130)
Sample added: DEAA (170)
Sample added: GEIAUEGGDIBED (1)

Adding unique samples to file 33
Sample added: DEAAAEI (20)
Sample added: AEI (144)
Sample added: DEAA (116)
Sample added: GEIAUEGGDIBED (0)

Adding unique samples to file 34
Sample added: AEI (161)
Sample added: DEAA (139)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 35
Sample added: AEI (162)
Sample added: DEAA (138)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 36
Sample added: DEAA (131)
Sample added: AEI (169)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 37
Sample added: AEI (153)
Sample added: DEAA (147)
Sample added: GEIAUEGGDIBED (1)

Adding unique samples to file 38
Sample added: AEI (183)
Sample added: DEAA (117)
Sample added: GEIAUEGGDIBED (1)

Adding unique samples to file 39
Sample added: AEI (176)
Sample added: DEAADEAA (17)
Sample added: DEAA (90)
Sample added: GEIAUEGGDIBED (1)

Adding unique samples to file 40
Sample added: DEAA (114)
Sample added: AEI (186)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 41
Sample added: DEAA (188)
Sample added: AEI (112)
Sample added: GEIAUEGGDIBED (1)

Adding unique samples to file 42
Sample added: AEI (161)
Sample added: DEAA (139)
Sample added: GEIAUEGGDIBED (1)

Adding unique samples to file 43
Sample added: AEI (112)
Sample added: DEAA (188)
Sample added: GEIAUEGGDIBED (1)

Adding unique samples to file 44
Sample added: AEI (187)
Sample added: DEAA (113)
Sample added: GEIAUEGGDIBED (1)

Adding unique samples to file 45
Sample added: DEAA (128)
Sample added: AEI (172)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 46
Sample added: DEAA (179)
Sample added: AEI (121)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 47
Sample added: DEAADEAA (26)
Sample added: DEAAAEI (19)
Sample added: DEAA (112)
Sample added: AEI (98)
Sample added: GEIAUEGGDIBED (1)

Adding unique samples to file 48
Sample added: AEI (170)
Sample added: DEAA (130)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 49
Sample added: AEI (173)
Sample added: DEAA (127)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 50
Sample added: AEIDEAA (23)
Sample added: DEAA (138)
Sample added: AEI (116)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 51
Sample added: DEAADEAAAEIAEIDEAA (1)
Sample added: AEIAEI (15)
Sample added: AEIDEAADEAAAEIDEAA (0)
Sample added: DEAA (158)
Sample added: AEI (107)
Sample added: AGIGDFFGDBEIAIDEAA (0)

Adding unique samples to file 52
Sample added: DEAA (110)
Sample added: AEI (190)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 53
Sample added: DEAA (186)
Sample added: AEI (114)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 54
Sample added: DEAA (148)
Sample added: AEI (152)
Sample added: GEIAUEGGDIBED (1)

Adding unique samples to file 55
Sample added: DEAA (171)
Sample added: AEI (129)
Sample added: GEIAUEGGDIBED (1)

Adding unique samples to file 56
Sample added: DEAA (175)
Sample added: AEIAEI (8)
Sample added: AEI (109)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 57
Sample added: AEI (143)
Sample added: DEAADEAA (21)
Sample added: DEAA (115)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 58
Sample added: DEAA (147)
Sample added: AEI (153)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 59
Sample added: DEAA (174)
Sample added: AEI (126)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 60
Sample added: AEI (120)
Sample added: DEAA (180)
Sample added: GEIAUEGGDIBED (1)

Adding unique samples to file 61
Sample added: DEAAAEIDEAA (6)
Sample added: AEI (113)
Sample added: DEAA (169)
Sample added: GEIAUEGGDIBED (1)

Adding unique samples to file 62
Sample added: DEAA (167)
Sample added: AEI (133)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 63
Sample added: AEI (142)
Sample added: DEAA (158)
Sample added: GEIAUEGGDIBED (1)

Adding unique samples to file 64
Sample added: DEAA (111)
Sample added: AEI (189)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 65
Sample added: AEIDEAA (30)
Sample added: AEI (117)
Sample added: DEAA (123)
Sample added: GEIAUEGGDIBED (1)

Adding unique samples to file 66
Sample added: DEAAAEI (28)
Sample added: DEAA (101)
Sample added: AEIAEIAEI (4)
Sample added: AEI (131)
Sample added: GEIAUEGGDIBED (1)

Adding unique samples to file 67
Sample added: AEI (144)
Sample added: DEAA (156)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 68
Sample added: DEAA (146)
Sample added: AEI (154)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 69
Sample added: AEI (147)
Sample added: DEAADEAA (21)
Sample added: DEAA (111)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 70
Sample added: DEAA (193)
Sample added: AEI (107)
Sample added: GEIAUEGGDIBED (1)

Adding unique samples to file 71
Sample added: DEAA (188)
Sample added: AEI (112)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 72
Sample added: DEAA (179)
Sample added: AEI (121)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 73
Sample added: DEAA (107)
Sample added: AEI (193)
Sample added: GEIAUEGGDIBED (1)

Adding unique samples to file 74
Sample added: AEI (180)
Sample added: DEAADEAA (13)
Sample added: DEAA (94)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 75
Sample added: DEAA (193)
Sample added: AEI (107)
Sample added: GEIAUEGGDIBED (1)

Adding unique samples to file 76
Sample added: AEI (191)
Sample added: DEAA (109)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 77
Sample added: DEAADEAADEAA (4)
Sample added: DEAA (133)
Sample added: AEI (155)
Sample added: GEIAUEGGDIBED (1)

Adding unique samples to file 78
Sample added: DEAAAEI (24)
Sample added: AEI (160)
Sample added: DEAA (92)
Sample added: GEIAUEGGDIBED (1)

Adding unique samples to file 79
Sample added: DEAAAEI (15)
Sample added: DEAA (141)
Sample added: AEIAEI (24)
Sample added: AEI (81)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 80
Sample added: AEI (157)
Sample added: DEAA (143)
Sample added: GEIAUEGGDIBED (1)

Adding unique samples to file 81
Sample added: DEAA (138)
Sample added: AEI (162)
Sample added: GEIAUEGGDIBED (1)

Adding unique samples to file 82
Sample added: AEI (166)
Sample added: DEAA (134)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 83
Sample added: DEAADEAA (17)
Sample added: DEAA (105)
Sample added: AEI (161)
Sample added: GEIAUEGGDIBED (1)

Adding unique samples to file 84
Sample added: AEI (155)
Sample added: DEAA (145)
Sample added: GEIAUEGGDIBED (1)

Adding unique samples to file 85
Sample added: DEAA (176)
Sample added: AEI (124)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 86
Sample added: AEI (183)
Sample added: DEAA (117)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 87
Sample added: DEAA (191)
Sample added: AEI (109)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 88
Sample added: DEAA (169)
Sample added: AEI (131)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 89
Sample added: AEI (146)
Sample added: DEAA (154)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 90
Sample added: DEAAAEI (17)
Sample added: DEAADEAA (21)
Sample added: AEIDEAA (18)
Sample added: AEIAEI (15)
Sample added: DEAA (83)
Sample added: AEI (75)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 91
Sample added: AEI (197)
Sample added: DEAADEAA (12)
Sample added: DEAA (79)
Sample added: GEIAUEGGDIBED (1)

Adding unique samples to file 92
Sample added: DEAA (195)
Sample added: AEI (105)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 93
Sample added: AEI (127)
Sample added: DEAA (173)
Sample added: GEIAUEGGDIBED (1)

Adding unique samples to file 94
Sample added: DEAA (173)
Sample added: AEI (127)
Sample added: GEIAUEGGDIBED (1)

Adding unique samples to file 95
Sample added: AEI (171)
Sample added: DEAA (129)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 96
Sample added: DEAADEAA (22)
Sample added: AEIDEAA (20)
Sample added: AEI (129)
Sample added: DEAA (87)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 97
Sample added: DEAA (193)
Sample added: AEI (107)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 98
Sample added: AEIAEIAEI (4)
Sample added: DEAA (128)
Sample added: AEI (160)
Sample added: GEIAUEGGDIBED (1)

Adding unique samples to file 99
Sample added: DEAA (101)
Sample added: AEI (199)
Sample added: AGIGDFFGDBEIA (1)

Adding unique samples to file 100
Sample added: DEAAAEIAEI (3)
Sample added: AEIAEI (10)
Sample added: DEAA (140)
Sample added: AEI (131)
Sample added: GEIAUEGGDIBED (1)

After cleanup samples in files:

Listing samples of file 1:
Unique to file are:
AEI (176 time(s))
DEAA (124 time(s))
GEIAUEGGDIBED (1 time(s))

Listing samples of file 2:
Unique to file are:
DEAA (159 time(s))
AEI (141 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 3:
Unique to file are:
AEI (168 time(s))
DEAA (100 time(s))
GEIAUEGGDIBED (1 time(s))

Listing samples of file 4:
Unique to file are:
DEAA (197 time(s))
AEI (103 time(s))
GEIAUEGGDIBED (1 time(s))

Listing samples of file 5:
Unique to file are:
DEAA (163 time(s))
AEI (137 time(s))
GEIAUEGGDIBED (1 time(s))

Listing samples of file 6:
Unique to file are:
AEI (192 time(s))
DEAA (108 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 7:
Unique to file are:
AEI (142 time(s))
DEAA (158 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 8:
Unique to file are:
AEI (95 time(s))
DEAA (115 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 9:
Unique to file are:
AEI (148 time(s))
DEAA (152 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 10:
Unique to file are:
AEI (145 time(s))
DEAA (155 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 11:
Unique to file are:
DEAA (164 time(s))
AEI (136 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 12:
Unique to file are:
DEAA (152 time(s))
AEI (148 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 13:
Unique to file are:
DEAA (106 time(s))
AEI (194 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 14:
Unique to file are:
DEAA (136 time(s))
AEI (122 time(s))
GEIAUEGGDIBED (1 time(s))

Listing samples of file 15:
Unique to file are:
DEAA (136 time(s))
AEI (108 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 16:
Unique to file are:
AEI (86 time(s))
DEAA (172 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 17:
Unique to file are:
AEI (125 time(s))
DEAA (175 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 18:
Unique to file are:
AEI (180 time(s))
DEAA (99 time(s))

Listing samples of file 19:
Unique to file are:
DEAA (198 time(s))
AEI (84 time(s))
GEIAUEGGDIBED (1 time(s))

Listing samples of file 20:
Unique to file are:
AEI (162 time(s))
DEAA (138 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 21:
Unique to file are:
DEAA (192 time(s))
AEI (108 time(s))
GEIAUEGGDIBED (1 time(s))

Listing samples of file 22:
Unique to file are:
AEI (174 time(s))
DEAA (120 time(s))
GEIAUEGGDIBED (1 time(s))

Listing samples of file 23:
Unique to file are:
AEI (136 time(s))
DEAA (164 time(s))
GEIAUEGGDIBED (1 time(s))

Listing samples of file 24:
Unique to file are:
DEAA (176 time(s))
AEI (124 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 25:
Unique to file are:
DEAA (153 time(s))
AEI (147 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 26:
Unique to file are:
DEAA (189 time(s))
AEI (102 time(s))
GEIAUEGGDIBED (1 time(s))

Listing samples of file 27:
Unique to file are:
DEAA (140 time(s))
AEI (109 time(s))
GEIAUEGGDIBED (1 time(s))

Listing samples of file 28:
Unique to file are:
AEI (91 time(s))
DEAA (163 time(s))
AGIGDFFGDBEI (0 time(s))

Listing samples of file 29:
Unique to file are:
DEAA (161 time(s))
AEI (139 time(s))
GEIAUEGGDIBED (1 time(s))

Listing samples of file 30:
Unique to file are:
DEAA (107 time(s))
AEI (127 time(s))
GEIAUEGGDIBED (1 time(s))

Listing samples of file 31:
Unique to file are:
DEAA (199 time(s))
AEI (101 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 32:
Unique to file are:
AEI (130 time(s))
DEAA (170 time(s))
GEIAUEGGDIBED (1 time(s))

Listing samples of file 33:
Unique to file are:
AEI (144 time(s))
DEAA (116 time(s))
GEIAUEGGDIBED (0 time(s))

Listing samples of file 34:
Unique to file are:
AEI (161 time(s))
DEAA (139 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 35:
Unique to file are:
AEI (162 time(s))
DEAA (138 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 36:
Unique to file are:
DEAA (131 time(s))
AEI (169 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 37:
Unique to file are:
AEI (153 time(s))
DEAA (147 time(s))
GEIAUEGGDIBED (1 time(s))

Listing samples of file 38:
Unique to file are:
AEI (183 time(s))
DEAA (117 time(s))
GEIAUEGGDIBED (1 time(s))

Listing samples of file 39:
Unique to file are:
AEI (176 time(s))
DEAA (90 time(s))
GEIAUEGGDIBED (1 time(s))

Listing samples of file 40:
Unique to file are:
DEAA (114 time(s))
AEI (186 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 41:
Unique to file are:
DEAA (188 time(s))
AEI (112 time(s))
GEIAUEGGDIBED (1 time(s))

Listing samples of file 42:
Unique to file are:
AEI (161 time(s))
DEAA (139 time(s))
GEIAUEGGDIBED (1 time(s))

Listing samples of file 43:
Unique to file are:
AEI (112 time(s))
DEAA (188 time(s))
GEIAUEGGDIBED (1 time(s))

Listing samples of file 44:
Unique to file are:
AEI (187 time(s))
DEAA (113 time(s))
GEIAUEGGDIBED (1 time(s))

Listing samples of file 45:
Unique to file are:
DEAA (128 time(s))
AEI (172 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 46:
Unique to file are:
DEAA (179 time(s))
AEI (121 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 47:
Unique to file are:
DEAA (112 time(s))
AEI (98 time(s))
GEIAUEGGDIBED (1 time(s))

Listing samples of file 48:
Unique to file are:
AEI (170 time(s))
DEAA (130 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 49:
Unique to file are:
AEI (173 time(s))
DEAA (127 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 50:
Unique to file are:
DEAA (138 time(s))
AEI (116 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 51:
Unique to file are:
DEAA (158 time(s))
AEI (107 time(s))
AGIGDFFGDBEIAI (0 time(s))

Listing samples of file 52:
Unique to file are:
DEAA (110 time(s))
AEI (190 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 53:
Unique to file are:
DEAA (186 time(s))
AEI (114 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 54:
Unique to file are:
DEAA (148 time(s))
AEI (152 time(s))
GEIAUEGGDIBED (1 time(s))

Listing samples of file 55:
Unique to file are:
DEAA (171 time(s))
AEI (129 time(s))
GEIAUEGGDIBED (1 time(s))

Listing samples of file 56:
Unique to file are:
DEAA (175 time(s))
AEI (109 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 57:
Unique to file are:
AEI (143 time(s))
DEAA (115 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 58:
Unique to file are:
DEAA (147 time(s))
AEI (153 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 59:
Unique to file are:
DEAA (174 time(s))
AEI (126 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 60:
Unique to file are:
AEI (120 time(s))
DEAA (180 time(s))
GEIAUEGGDIBED (1 time(s))

Listing samples of file 61:
Unique to file are:
AEI (113 time(s))
DEAA (169 time(s))
GEIAUEGGDIBED (1 time(s))

Listing samples of file 62:
Unique to file are:
DEAA (167 time(s))
AEI (133 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 63:
Unique to file are:
AEI (142 time(s))
DEAA (158 time(s))
GEIAUEGGDIBED (1 time(s))

Listing samples of file 64:
Unique to file are:
DEAA (111 time(s))
AEI (189 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 65:
Unique to file are:
AEI (117 time(s))
DEAA (123 time(s))
GEIAUEGGDIBED (1 time(s))

Listing samples of file 66:
Unique to file are:
DEAA (101 time(s))
AEI (131 time(s))
GEIAUEGGDIBED (1 time(s))

Listing samples of file 67:
Unique to file are:
AEI (144 time(s))
DEAA (156 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 68:
Unique to file are:
DEAA (146 time(s))
AEI (154 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 69:
Unique to file are:
AEI (147 time(s))
DEAA (111 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 70:
Unique to file are:
DEAA (193 time(s))
AEI (107 time(s))
GEIAUEGGDIBED (1 time(s))

Listing samples of file 71:
Unique to file are:
DEAA (188 time(s))
AEI (112 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 72:
Unique to file are:
DEAA (179 time(s))
AEI (121 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 73:
Unique to file are:
DEAA (107 time(s))
AEI (193 time(s))
GEIAUEGGDIBED (1 time(s))

Listing samples of file 74:
Unique to file are:
AEI (180 time(s))
DEAA (94 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 75:
Unique to file are:
DEAA (193 time(s))
AEI (107 time(s))
GEIAUEGGDIBED (1 time(s))

Listing samples of file 76:
Unique to file are:
AEI (191 time(s))
DEAA (109 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 77:
Unique to file are:
DEAA (133 time(s))
AEI (155 time(s))
GEIAUEGGDIBED (1 time(s))

Listing samples of file 78:
Unique to file are:
AEI (160 time(s))
DEAA (92 time(s))
GEIAUEGGDIBED (1 time(s))

Listing samples of file 79:
Unique to file are:
DEAA (141 time(s))
AEI (81 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 80:
Unique to file are:
AEI (157 time(s))
DEAA (143 time(s))
GEIAUEGGDIBED (1 time(s))

Listing samples of file 81:
Unique to file are:
DEAA (138 time(s))
AEI (162 time(s))
GEIAUEGGDIBED (1 time(s))

Listing samples of file 82:
Unique to file are:
AEI (166 time(s))
DEAA (134 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 83:
Unique to file are:
DEAA (105 time(s))
AEI (161 time(s))
GEIAUEGGDIBED (1 time(s))

Listing samples of file 84:
Unique to file are:
AEI (155 time(s))
DEAA (145 time(s))
GEIAUEGGDIBED (1 time(s))

Listing samples of file 85:
Unique to file are:
DEAA (176 time(s))
AEI (124 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 86:
Unique to file are:
AEI (183 time(s))
DEAA (117 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 87:
Unique to file are:
DEAA (191 time(s))
AEI (109 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 88:
Unique to file are:
DEAA (169 time(s))
AEI (131 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 89:
Unique to file are:
AEI (146 time(s))
DEAA (154 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 90:
Unique to file are:
DEAA (83 time(s))
AEI (75 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 91:
Unique to file are:
AEI (197 time(s))
DEAA (79 time(s))
GEIAUEGGDIBED (1 time(s))

Listing samples of file 92:
Unique to file are:
DEAA (195 time(s))
AEI (105 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 93:
Unique to file are:
AEI (127 time(s))
DEAA (173 time(s))
GEIAUEGGDIBED (1 time(s))

Listing samples of file 94:
Unique to file are:
DEAA (173 time(s))
AEI (127 time(s))
GEIAUEGGDIBED (1 time(s))

Listing samples of file 95:
Unique to file are:
AEI (171 time(s))
DEAA (129 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 96:
Unique to file are:
AEI (129 time(s))
DEAA (87 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 97:
Unique to file are:
DEAA (193 time(s))
AEI (107 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 98:
Unique to file are:
DEAA (128 time(s))
AEI (160 time(s))
GEIAUEGGDIBED (1 time(s))

Listing samples of file 99:
Unique to file are:
DEAA (101 time(s))
AEI (199 time(s))
AGIGDFFGDBEIA (1 time(s))

Listing samples of file 100:
Unique to file are:
DEAA (140 time(s))
AEI (131 time(s))
GEIAUEGGDIBED (1 time(s))

Common samples are:
EFFGA (300 times / 30000)
EAEBF (128 times / 12800)
ACHGD (50 times / 5000)
AEG (100 times / 10000)
EEE (72 times / 7200)
AED (50 times / 5000)
 
Zuletzt bearbeitet:

theSplit

1998
Veteran Barkeeper

Registriert
3 Aug. 2014
Beiträge
28.573
Trotz mangelnder Java Kenntnisse habe ich mir mal deinen Code angeschaut.

Was mir spontan auffällt - hat es einen Grund warum du gewisse Ergebnisse nicht zwishenspeicherst? Speziell wird oft die "countblock" Routine aufgerufen im Code, das neuzählen ließe sich allerdings verhindern wenn du zwischenspeichern würdest wie oft ein Block in einer Sequenz auftaucht. Ob das nun pro Datei ist oder Global mit einer verketteten Liste pro Block, Anzahl.

Ich glaube dann würde dein Programm auch um einiges schneller laufen.

Ich sehe allerdings schon den Vorteil das du dadurch immer sicherstellst das die Daten (die Zählstände) wirklich aktuell sind und nicht aus einem Zwischenspeicher kommen, ich finde nur im Punkto Performance ist das ein unnötiger Overhead weil das nach dem ersten ermitteln feststeht, jedenfalls meine bescheidene Meinung :)
 

Roin

Freier Denker

Registriert
22 Juli 2013
Beiträge
581
  • Thread Starter Thread Starter
  • #55
@theSplit:
Es stimmt, dass es die Performance ziemlich drückt. Allerdings habe ich mir gedacht, dass ich durch dieses Vorgehen sicherstelle, eventuelle Semi-Vorkommen eines Bausteins auszuschließen und wie du schon richtig erkannt hast, die Zählerstände wirklich aktuell sind.

Wenn ich die Daten zwischenspeichern wollte, müsste ich Listen für jede Sequenz oder jeden Baustein anlegen, die mindestens zweidimensional sind. Allerdings habe ich es in Java nicht so mit mehrdimensionalen Listen/Arrays.
Habe die schon ewig nicht mehr verwendet und müsste mich da erneut einlesen. Da war mir das nicht ganz so wichtig, ob eventuell das ganze ein paar Sekunden dauert oder alles in einer erledigt ist.


EDIT:
Ich habe gerade mal überschlagen. Statt 7 Sekunden Laufzeit, wäre ich bei ca. 5 Sekunden, würde ich die Ergebnisse zwischenspeichern.

Ich habe nun mal eine Hashmap als Baustein-Häufigkeitsspeicher angelegt.
Diese enthält allerdings pro Sequenz ganze 75000 Elemente.
Dadurch lassen sich dennoch 2 Sekunden einsparen.

[src=java]
/**
* Speichert alle die Anzahl der Vorkommnisse eines Bausteins in der
* Sequenz.
*/
private Map<String, Integer> block_occurrence;

/**
*
*/
public Analyse() {
this.block_occurrence = new HashMap();
}

/* ... */

/**
* Überprüfe, ob der Baustein in der Sequenz enthalten ist.
*
* @param block
* @return
*/
public boolean existBlock(String block) {
if(this.block_occurrence.containsKey(block)) {
// if(this.block_occurrence.get(block) > 0) {
return true;
// }
// return false;
}

if(this.sequenz.contains(block)) {
return true;
}
return false;
}

/* ... */

/**
* Zähle die Anzahl der Vorkommen des Bausteins in der Sequenz.
*
* @param block
* @return
*/
public int countBlock(String block) {
int occurrence = 0;

//Ist der Block valide?
if(!this.isValidBlock(block)) {
return 0;
}
if(block.length() == 0) {
return 0;
}
if(!this.existBlock(block)) {
return 0;
}

//Ist der Baustein bereits gezählt worden?
if(this.block_occurrence.containsKey(block)) {
return this.block_occurrence.get(block);
} else {

//false = Kein Überlappen der Bausteine
boolean eagerMatching = false;

if (0 != block.length()) {
for (int index = this.sequenz.indexOf(block, 0);
index != -1;
index = this.sequenz.indexOf(block, eagerMatching ? index + 1 : index + block.length())) {
occurrence++;
}
}

//Häufigkeit des Bausteins zwischenspeichern.
this.block_occurrence.put(block, occurrence);

// System.out.println("Der Block " + '"' + block + '"' + " kommt " + occurrence + " mal in der Sequenz vor.");
return occurrence;
}
}[/src]

Ich dachte mir, es lohnt sich auch valide Bausteine zwischenzuspeichern, das hat allerdings die Laufzeit auf über eine Minute schießen lassen.
Noch jemand Vorschläge, um die Effizienz zu steigern?
 
Zuletzt bearbeitet:

theSplit

1998
Veteran Barkeeper

Registriert
3 Aug. 2014
Beiträge
28.573
Also was mir spontan ist Auge sticht, du machst die Überprufung "existsBlock", wie hier:
[src=java]
if(!this.existBlock(block)) {
return 0;
}
[/src]

und dann aber auch gleich das danach wieder:
[src=java] //Ist der Baustein bereits gezählt worden?
if(this.block_occurrence.containsKey(block)) {
return this.block_occurrence.get(block);
} else {[/src]

Folgende Idee:
[src=java]
if(!this.existBlock(block)) {
// Anweisung hier durchführen wenn der Block nicht existiert
//false = Kein Überlappen der Bausteine
boolean eagerMatching = false;

if (0 != block.length()) {
for (int index = this.sequenz.indexOf(block, 0);
index != -1;
index = this.sequenz.indexOf(block, eagerMatching ? index + 1 : index + block.length())) {
occurrence++;
}
}

//Häufigkeit des Bausteins zwischenspeichern.
this.block_occurrence.put(block, occurrence);

// System.out.println("Der Block " + '"' + block + '"' + " kommt " + occurrence + " mal in der Sequenz vor.");
return occurrence;
} else {
// Ansonsten die Anzahl zurückgeben
return this.block_occurrence.get(block);
}
[/src]

[src=java][/src]

Wobei ich hier "existsBlock?" -> "return bestehendenWert", dann "else" -> "zähle Block"


Wenn du es noch mehr optimieren willst:
Schreibe zwei Schleifen, einmal für EagerMatching Mode, die andere für Index+block.length();

Also Pseudocode:
If Eagermatching {
for loop mit Index = sequenz.indexOf; index != -1; this.sequenz.indexOf(block, index + 1); occurence++;
} else {
for loop mit Index = sequenz.indexOf; index != -1; this.sequenz.indexOf(block, index + block.length(); occurence++;
}


Der Trick dabei, du hast zwar zwei Loops, aber du hast nur einmal die Überprüfung ob Eager Matching aktiviert ist, die Kondition EagerMatching ja/nein ändert sich ja nicht mehr während der Laufzeit der Schleife sondern ist vorher schon klar.

Wenn du noch weiter optimieren willst, könntest du auch mal folgendes machen:

If Eagermatching {
for loop mit Index = sequenz.indexOf; index != -1; this.sequenz.indexOf(block, index + 1); occurence++;
} else {
int blockLength = block.length();
for loop mit Index = sequenz.indexOf; index != -1; this.sequenz.indexOf(block, index + blockLength; occurence++;
}

Das sollte auch minimal weniger Ressourcen verbrauchen, Mikro Optimierung
 
Zuletzt bearbeitet:

Roin

Freier Denker

Registriert
22 Juli 2013
Beiträge
581
  • Thread Starter Thread Starter
  • #57
@theSplit:
Diese Änderung habe ich alle gemacht. Sind wirklich nur Mikrooptimierungen. Die Laufzeit beträgt weiterin 5 Sekunden. Genauer zeigt es mit NetBeans nicht an, zumindest weiß ich nicht, wo ich das finden würde.

Beachte bei der Laufzeit bitte, dass ich Bausteine zwischen 2 und 30 Zeichen suche.
 
Oben