I have a dataframe that looks a like this;
index, othercols, FPN
ts1, otherStuff, val1
ts2, otherStuff, val2
ts3, otherStuff, val3
ts4, otherStuff, val4
....
tsn, otherStuff, valn
Because of the external data source a lot of these values will be repeated- so in million row dataframe there will be several pieces up to 10,000s entries long all just repeating the same data for different timestamps. For my purposes at least this repetition is not necessary, so I want to remove all the repeated rows apart from the beginning and end of each section like so;
1, 0
2, 0
3, 0
4, 5
5, 0
6, 0
becomes
1, 0
3, 0
4, 5
5, 0
6, 0
I have managed to do this, but it's slower than I would like (takes 2 mins for a single 60mb file; mostly in the apply part as seen below) and I'm thinking that there must be a better way of doing it
Here's my cobbled together solution, is there a neater/faster way of doing this?;
data=df['FPN']
shft_up=(copy.deepcopy(data)).tolist()
shft_dn=(copy.deepcopy(data)).tolist()
del shft_up[0]
shft_up=shft_up+[None]
del shft_dn[-1]
shft_dn=[None]+shft_dn
df['shft_up']=shft_up
df['shft_dn']=shft_dn
def is_rep(row):
if row['shft_dn']==row['FPN'] and row['shft_up']==row['FPN']:
return 1
else:
return 0
df['mask_col']=df.apply(lambda row:is_rep(row),axis=1,reduce=False)
df=(df[df['mask_col']==0]).drop(['shft_up','shft_dn','mask_col'],axis=1)
Aucun commentaire:
Enregistrer un commentaire