How to get rid of NaN in SFrame python?

0

I want to get rid of lines from a dataframe that have NaN but when I do item_info.dropna(axis = 0, how='all') , that comes from the pandas.pydata.org documentation, it does not work good:

item_info.dropna(axis = 0, how='all')

Using this with

m2 = ranking_factorization_recommender.create(subcriber_eclipse,
                                              target='count',
                                              user_data = subcriber_eclipse,
                                              item_data = sf_test)

Give the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)

<ipython-input-44-02025aac0088> in <module>()
----> 1 item_info.dropna(axis = 0, how='all')
      2 
      3 #item_info.fillna(0, inplace=True)
      4 
      5 #print item_info

TypeError: dropna() got an unexpected keyword argument 'axis'

the table comes from a SQL query:

item_info = graphlab.SFrame.from_sql(conn,"""--- matrice d'utilisation des hastags par les eclipses
SELECT COUNT (eclipse_hashtag.eclipse_id), eclipse_hashtag.hashtag_id,eclipse_hashtag.eclipse_id FROM eclipse_hashtag
    GROUP BY eclipse_hashtag.hashtag_id, eclipse_hashtag.eclipse_id
      ORDER BY eclipse_hashtag.hashtag_id,eclipse_hashtag.eclipse_id ASC;
    """)

item_info.rename({'eclipse_id':'item_id'})

Here is the structure:

type(item_info)

graphlab.data_structures.sframe.SFrame

And here is the trace of the error in full

[ERROR] graphlab.toolkits._main: Toolkit error: Missing value (None) encountered in column 'item_id.1'. Use the SFrame's dropna
function to drop rows with 'None' values in them.'

---------------------------------------------------------------------------
ToolkitError                              Traceback (most recent call last)
<ipython-input-8-56b88cdef560> in <module>()
      9 m2 = ranking_factorization_recommender.create(subcriber_eclipse,
target='count',
     10                                               user_data = subcriber_eclipse,
---> 11                                               item_data = item_info)
     12 

/home/antoine/anaconda2/lib/python2.7/site-packages/graphlab/toolkits/recommender/ranking_factorization_recommender.pyc
in create(observation_data, user_id, item_id, target, user_data,
item_data, num_factors, regularization, linear_regularization,
side_data_factorization, ranking_regularization,
unobserved_rating_value, num_sampled_negative_examples,
max_iterations, sgd_step_size, random_seed, binary_target, solver,
verbose, **kwargs)
    267         opts.update(kwargs)
    268 
--> 269     response = _graphlab.toolkits._main.run('recsys_train', opts, verbose)
    270 
    271     return RankingFactorizationRecommender(response['model'])

/home/antoine/anaconda2/lib/python2.7/site-packages/graphlab/toolkits/_main.pyc
in run(toolkit_name, options, verbose, show_progress)
     87         _get_metric_tracker().track(metric_name, value=1, properties=track_props, send_sys_info=False)
     88 
---> 89         raise ToolkitError(str(message))

ToolkitError: Missing value (None) encountered in column 'item_id.1'. Use the SFrame's dropna function to drop rows with 'None'
values in them.
    
asked by ThePassenger 22.06.2017 в 16:57
source

2 answers

0

I modified the question a bit since you are not using DataFrames.

Because you do not use you should not use your documentation . The documentation of the SFrames indicates that axis is not a valid keyword but you can use columns . Reading your documentation you should be able to do what you want from the following way:

item_info.dropna(columns=None, how='all')
    
answered by 23.06.2017 в 08:27
0

It seems that the error is not thrown dropna but some of the methods you use does not admit that there are columns with some null value.

Starting from an example:

>>> import numpy as np
>>> sf = graphlab.SFrame({'a': [1, None, np.nan],
                          'b': [3, None,   None],
                          'c': [4,    5, np.nan], 
                          'd': [6,    7,      8]})
>>> sf
Data:
+------+------+-----+---+
|  a   |  b   |  c  | d |
+------+------+-----+---+
| 1.0  |  3   | 4.0 | 6 |
| None | None | 5.0 | 7 |
| nan  | None | nan | 8 |
+------+------+-----+---+

dropna delete the entire row or rows depending on the how parameter:

  • how = 'all' : Delete only the row if you have todos los valores nulos in all columns specified in columns :

    >>> sf.dropna(columns=None, how='all')   
    Data:
    +------+------+-----+---+
    |  a   |  b   |  c  | d |
    +------+------+-----+---+
    | 1.0  |  3   | 4.0 | 6 |
    | None | None | 5.0 | 7 |
    | nan  | None | nan | 8 |
    +------+------+-----+---+
    
    >>> sf.dropna(columns=['a', 'b'], how='all')
    Data:
    +-----+---+-----+---+
    |  a  | b |  c  | d |
    +-----+---+-----+---+
    | 1.0 | 3 | 4.0 | 6 |
    +-----+---+-----+---+
    
  • how = 'any' : Delete a row if al menos un valor nulo exists in some of the columns specified in columns .

    >>> sf.dropna(columns=None, how='any')
    Data:
    +-----+---+-----+---+
    |  a  | b |  c  | d |
    +-----+---+-----+---+
    | 1.0 | 3 | 4.0 | 6 |
    +-----+---+-----+---+
    
    
    >>> sf.dropna(columns=['c', 'd'], how='any')
    Data:
    +------+------+-----+---+
    |  a   |  b   |  c  | d |
    +------+------+-----+---+
    | 1.0  |  3   | 4.0 | 6 |
    | None | None | 5.0 | 7 |
    +------+------+-----+---+
    

As you see, it works with both None and values nan real. If you apply sf.dropna(columns=None, how='all') you will only delete the rows with all their null values but not things like [None, None, 5.0, 7] . Also be careful with null values not recognized as such as 'NaN' especially if you read data from a csv.

If any of the functions you have does not admit that there are null values in the SFrame or you eliminate any row that has at least a null value how='any' or you use fillna to pass those null values to 0 for example.

This depends on your data, its nature and how and what you are evaluating. Deleting a row with a single null value may be losing data and this is not acceptable. In the case of passing the null values to 0 is the same, there are cases in which None can be equivalent to 0 and cases in which this is not true. You must evaluate this and act on consequences, if the problem is the one I expose.

Sframe.fillna has as arguments the column on which to apply the operation and the value by which to replace the nulls:

>>> sf.fillna(column='a', value=0.0)
Data:
+-----+------+-----+---+
|  a  |  b   |  c  | d |
+-----+------+-----+---+
| 1.0 |  3   | 4.0 | 6 |
| 0.0 | None | 5.0 | 7 |
| 0.0 | None | nan | 8 |
+-----+------+-----+---+
    
answered by 23.06.2017 в 13:31