vendredi 8 mai 2015

Filtering out redundant repeated data from a pandas dataframe

I have a dataframe that looks a like this;

index, othercols, FPN

ts1, otherStuff, val1

ts2, otherStuff, val2

ts3, otherStuff, val3

ts4, otherStuff, val4

....

tsn, otherStuff, valn

Because of the external data source a lot of these values will be repeated- so in million row dataframe there will be several pieces up to 10,000s entries long all just repeating the same data for different timestamps. For my purposes at least this repetition is not necessary, so I want to remove all the repeated rows apart from the beginning and end of each section like so;

1, 0

2, 0

3, 0

4, 5

5, 0

6, 0

becomes

1, 0

3, 0

4, 5

5, 0

6, 0

I have managed to do this, but it's slower than I would like (takes 2 mins for a single 60mb file; mostly in the apply part as seen below) and I'm thinking that there must be a better way of doing it

Here's my cobbled together solution, is there a neater/faster way of doing this?;

data=df['FPN']

shft_up=(copy.deepcopy(data)).tolist()
shft_dn=(copy.deepcopy(data)).tolist()

del shft_up[0]
shft_up=shft_up+[None]

del shft_dn[-1]
shft_dn=[None]+shft_dn

df['shft_up']=shft_up
df['shft_dn']=shft_dn

def is_rep(row):
    if row['shft_dn']==row['FPN'] and row['shft_up']==row['FPN']:
        return 1
    else:
        return 0  

df['mask_col']=df.apply(lambda row:is_rep(row),axis=1,reduce=False)

df=(df[df['mask_col']==0]).drop(['shft_up','shft_dn','mask_col'],axis=1)

Aucun commentaire:

Enregistrer un commentaire