Pandas Hashergebnisse unklar

HoneyBadger

Aktiver NGBler
Registriert
7 Sep. 2015
Beiträge
1.913
Moin zusammen,

lese aus einer CSV Datensätze ein, die ich dann aufbereite und in eine Excel ausgebe. Soweit so gut.
Hierbei habe ich konkret zwei Spalten, die vergleichebare Datensätze enthalten, jedoch ab und zu verdeht sind.

Nun habe ich mir überlegt, die Daten einfach per Hashwert zu vergleichen. Hier erhalte ich jedoch abweichende Ergebnisse, was mir nicht ganz klar ist.

[TABLE="class: grid, width: 500"]
[TR]
[TD]No[/TD]
[TD]Col 1[/TD]
[TD]Col 2[/TD]
[TD]Sum Col 1 + Col2[/TD]
[TD]Hash1[/TD]
[TD]Hash2[/TD]
[TD]HashSum[/TD]
[/TR]
[TR]
[TD]1[/TD]
[TD]a[/TD]
[TD]b[/TD]
[TD]ab[/TD]
[TD][TABLE="width: 137"]
[TR]
[TD="width: 137, align: right"]7,27363E+18[/TD]
[/TR]
[/TABLE]
[/TD]
[TD][TABLE="width: 137"]
[TR]
[TD="width: 137, align: right"]1,73561E+19[/TD]
[/TR]
[/TABLE]
[/TD]
[TD][TABLE="width: 130"]
[TR]
[TD="width: 130, align: right"]-1943126210[/TD]
[/TR]
[/TABLE]
[/TD]
[/TR]
[TR]
[TD]2[/TD]
[TD]NaN[/TD]
[TD]NaN[/TD]
[TD]NaN[/TD]
[TD][/TD]
[TD][/TD]
[TD][/TD]
[/TR]
[TR]
[TD]3[/TD]
[TD]b[/TD]
[TD]a[/TD]
[TD]ab[/TD]
[TD][TABLE="width: 137"]
[TR]
[TD="width: 137, align: right"]1,63369E+19[/TD]
[/TR]
[/TABLE]
[/TD]
[TD][TABLE="width: 137"]
[TR]
[TD="width: 137, align: right"]1,08144E+19[/TD]
[/TR]
[/TABLE]
[/TD]
[TD][TABLE="width: 130"]
[TR]
[TD="width: 130, align: right"]-175753704[/TD]
[/TR]
[/TABLE]
[/TD]
[/TR]
[TR]
[TD]4[/TD]
[TD]b[/TD]
[TD]a[/TD]
[TD]ab[/TD]
[TD][TABLE="width: 137"]
[TR]
[TD="width: 137, align: right"]5,5081E+18[/TD]
[/TR]
[/TABLE]
[/TD]
[TD][TABLE="width: 137"]
[TR]
[TD="width: 137, align: right"]1,64238E+18[/TD]
[/TR]
[/TABLE]
[/TD]
[TD][TABLE="width: 130"]
[TR]
[TD="width: 130, align: right"]-1738826704[/TD]
[/TR]
[/TABLE]
[/TD]
[/TR]
[/TABLE]


[src=python]df_cleaned["HashNormalize1"] = pd.util.hash_pandas_object(df_cleaned[["ColA", "ColB"]])
df_cleaned["HashNormalize2"] = pd.util.hash_pandas_object(df_cleaned[["ColB", "ColA"]])
df_cleaned["HashNormSum"] = df_cleaned["HashNormalize1"] + df_cleaned["HashNormalize2"]
df_cleaned["HashNormSum"] = df_cleaned["HashNormSum"].astype("int")
[/src]

Hat jemand eine Idee?

Grüße
 
  • Thread Starter Thread Starter
  • #2
Hab's nun verstanden. Sollte jemand mal hierauf googeln, dies war das Problem:

Führt man den Code aus:

[src=python]import pandas as pd

df_test = pd.DataFrame({
'A': ["1", "1", "1", "0"],
'B': ["0", "0", "0", "1"],
})

df_test["HashNormalize1"] = pd.util.hash_pandas_object(df_test[["A", "B"]])
df_test["HashNormalize2"] = pd.util.hash_pandas_object(df_test[["A", "B"]])
df_test["HashNormSum"] = df_test["HashNormalize1"] + df_test["HashNormalize2"]

df_test[/src]

Erhält man:

[TABLE="class: dataframe, width: 3354"]
[TR]
[TH="align: right"][/TH]
[TH="align: left"]A[/TH]
[TH="align: left"]B[/TH]
[TH="align: left"]HashNormalize1[/TH]
[TH="align: left"]HashNormalize2[/TH]
[TH="align: left"]HashNormSum[/TH]
[/TR]
[TR]
[TH="align: left"]0[/TH]
[TD]1[/TD]
[TD]0[/TD]
[TD]11066859559894451155[/TD]
[TD]11066859559894451155[/TD]
[TD]3686975046079350694[/TD]
[/TR]
[TR]
[TH="align: left"]1[/TH]
[TD]1[/TD]
[TD]0[/TD]
[TD]10128852264039700196[/TD]
[TD]10128852264039700196[/TD]
[TD]1810960454369848776[/TD]
[/TR]
[TR]
[TH="align: left"]2[/TH]
[TD]1[/TD]
[TD]0[/TD]
[TD]4666896424757761893[/TD]
[TD]4666896424757761893[/TD]
[TD]9333792849515523786[/TD]
[/TR]
[TR]
[TH="align: left"]3[/TH]
[TD]0[/TD]
[TD]1[/TD]
[TD]8579837400036744307[/TD]
[TD]8579837400036744307[/TD]
[TD]17159674800073488614[/TD]
[/TR]
[/TABLE]

Die Methode inkludiert den Index, um den Hash zu generieren. Einfach den Parameter index = False setzen, dann passt es.

[src=python]import pandas as pd

df_test = pd.DataFrame({
'A': ["1", "1", "1", "0"],
'B': ["0", "0", "0", "1"],
})

df_test["HashNormalize1"] = pd.util.hash_pandas_object(df_test[["A", "B"]], index = False)
df_test["HashNormalize2"] = pd.util.hash_pandas_object(df_test[["A", "B"]], index = False)
df_test["HashNormSum"] = df_test["HashNormalize1"] + df_test["HashNormalize2"]

df_test[/src]

[TABLE="class: dataframe, width: 3354"]
[TR]
[TH="align: left"][/TH]
[TH="align: left"]A[/TH]
[TH="align: left"]B[/TH]
[TH="align: left"]HashNormalize1[/TH]
[TH="align: left"]HashNormalize2[/TH]
[TH="align: left"]HashNormSum[/TH]
[/TR]
[TR]
[TH="align: left"]0[/TH]
[TD]1[/TD]
[TD]0[/TD]
[TD]14888653698164444483[/TD]
[TD]14888653698164444483[/TD]
[TD]11330563322619337350[/TD]
[/TR]
[TR]
[TH="align: left"]1[/TH]
[TD]1[/TD]
[TD]0[/TD]
[TD]14888653698164444483[/TD]
[TD]14888653698164444483[/TD]
[TD]11330563322619337350[/TD]
[/TR]
[TR]
[TH="align: left"]2[/TH]
[TD]1[/TD]
[TD]0[/TD]
[TD]14888653698164444483[/TD]
[TD]14888653698164444483[/TD]
[TD]11330563322619337350[/TD]
[/TR]
[TR]
[TH="align: left"]3[/TH]
[TD]0[/TD]
[TD]1[/TD]
[TD]6480458753460356307[/TD]
[TD]6480458753460356307[/TD]
[TD]12960917506920712614[/TD]
[/TR]
[/TABLE]
 
Zurück
Oben