I have a question, and that is that I have 2 dataset, one is AdultTest and another AdultData.
In those dataset you have many rows of this type:
39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Female , 2174, 0, 40, United-States, >50K
and I would like to calculate the probability that a "Female" has more than> 50K, for this I did the following:
#Lee AdultData.csv y lo pone como Integer, así puede calcular el naiveBAyes
data1= np.genfromtxt('AdultData.csv',delimiter=',', dtype='int',skip_footer=1)
datatest=np.genfromtxt('adultTest.csv',delimiter=',', dtype='int',skip_footer=1)
#Borra la ultima columna, porque esa es el target
data_new=np.delete(data2, 14, 1)
dataTest_new=np.delete(datatest, 14, 1)
Class =[row[14] for row in data2]
from sklearn.naive_bayes import BernoulliNB
clf= BernoulliNB()
clf.fit(data_new, Class)
print(clf.predict_proba(data_new))
# print(clf.predict_proba(dataTest_new))
and as a result of the probability prediction it always gives me:
[1. 0.]
I do not understand why, even if I put the AdultTest one, the same results come out, even though it has other data, because I do not get other results? What does the 2 columns mean?
could someone help me?
Greetings and thanks in advance!