Yesterday, I saw tf.contrib.layers.sparse_column_with_hash_bucket in a tutorial. That’s a very useful function! I thought. I never met such a function in Keras or TFLearn.
Basically, the function do something like this:
hash(category_string) % dim
Let’s say the text “the quick brown fox”. If we want to put them into 5 buckets, we can get result like this:
hash(the) % 5 = 0
hash(quick) % 5 = 1
hash(brown) % 5 = 1
hash(fox) % 5 = 3
This example is metioned by Luis Argerich
That’s really easy for preprocessing, but there are disadvantages of that, metioned by Artem Onuchin also in that page.
So, the common way to do this feature engineering thing is metioned by Rahul Agarwal:
- Scaling by Max-Min
- Normalization using Standard Deviation
- Log based feature/Target: use log based features or log based target function.
- One Hot Encoding
Anyway, if we want to do hash_bucket without tensorflow, we can do it in Pandas which is metioned here:
import pandas as pd
import numpy as np
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
data = pd.DataFrame(data)
def hash_col(df, col, N):
cols = [col + "_" + str(i) for i in range(N)]
print(cols)
def xform(x): tmp = [0 for i in range(N)]; tmp[hash(x) % N] = 1; return pd.Series(tmp,index=cols)
df[cols] = df[col].apply(xform)
return df.drop(col,axis=1)
print(hash_col(data, 'state',4))
result:
pop year state_0 state_1 state_2 state_3
0 1.5 2000 1 0 0 0
1 1.7 2001 1 0 0 0
2 3.6 2002 1 0 0 0
3 2.4 2001 1 0 0 0
4 2.9 2002 1 0 0 0
[Edited] Actually, we can use pandas.get_dummies to do this directly:
import pandas as pd
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
data = pd.DataFrame(data)
print(pd.get_dummies(data, columns=['state','year']))
result:
pop state_Nevada state_Ohio year_2000 year_2001 year_2002
0 1.5 0.0 1.0 1.0 0.0 0.0
1 1.7 0.0 1.0 0.0 1.0 0.0
2 3.6 0.0 1.0 0.0 0.0 1.0
3 2.4 1.0 0.0 0.0 1.0 0.0
4 2.9 1.0 0.0 0.0 0.0 1.0
After all,
I think I should learn more about one-hot-encoding and word2vec embedding.
Coming up with features is difficult, time-consuming, requires expert knowledge. “Applied machine learning” is basically feature engineering.
Said Andrew Ng.