python - How to apply multiple functions to a pandas dataframe without multiple loops? -
i use pandas etl process. query database put result in dataframe; dataframe pretty big (1m rows * 50 columns). dataframe composed of string , date.
i use apply()
function of pandas make transformations. thing transformation contains multiple branching on string.
df['merged_contract_status'] = df.apply(lib.merged_contract_status, axis=1) df['renewed'] = df.apply(lib.contract_renewed, axis=1) df['renewal_dt'] = df.apply(lib.contract_renewed_date, axis=1) ....
i have bunch of transformation that. functions call:
def merged_contract_status(row): if row['tbm_contract_status'] not np.nan: mergedcontractstatus = row['tbm_contract_status'] else: mergedcontractstatus = row['ccn_status'] return mergedcontractstatus def contract_renewed(row): ccn_activation_dt = row['ccn_activation_dt'] ccn_status = row['ccn_status'] f_init_val_dt = row['f_init_val_dt'] f_nat_exp_dt = row['f_nat_exp_dt'] spr_spo_code = row['spr_spo_code'] csp_status_sep_1 = row['csp_status_sep_1'] csp_begin_dt_sep_1 = row['csp_begin_dt_sep_1'] ctt_type_1 = row['ctt_type_1'] csp_status_sep_2 = row['csp_status_sep_2'] csp_begin_dt_sep_2 = row['csp_begin_dt_sep_2'] ctt_type_2 = row['ctt_type_2'] csp_status_sep_3 = row['csp_status_sep_3'] csp_begin_dt_sep_3 = row['csp_begin_dt_sep_3'] ctt_type_3 = row['ctt_type_3'] csp_begin_dt_sep_father = row['csp_begin_dt_sep_father'] csp_end_dt_sep_father = row['csp_end_dt_sep_father'] todaydate = datetime.datetime.today() if spr_spo_code == 'pcc': if ctt_type_1 == 'normal' , ccn_activation_dt not none , csp_begin_dt_sep_1 not none , csp_begin_dt_sep_1>(ccn_activation_dt+timedelta(365)): return 'y' elif ctt_type_2 == 'normal' , ccn_activation_dt not none , csp_begin_dt_sep_2 not none , csp_begin_dt_sep_2>(ccn_activation_dt+timedelta(365)): return 'y' elif ctt_type_3== 'normal' , ccn_activation_dt not none , csp_begin_dt_sep_3 not none , csp_begin_dt_sep_3>(ccn_activation_dt+timedelta(365)): return 'y' else: return 'n' else: if (f_init_val_dt none , f_nat_exp_dt none , ccn_activation_dt not none , ccn_activation_dt < (todaydate- timedelta(365)) , (csp_begin_dt_sep_father <= todaydate , csp_begin_dt_sep_father >= todaydate , ccn_status=='activated')): return 'y' elif f_init_val_dt not none , f_nat_exp_dt not none , f_init_val_dt <= todaydate , f_nat_exp_dt >= todaydate , ccn_status=='activated' , ccn_activation_dt not none , ccn_activation_dt < (todaydate- timedelta(365)): return 'y' else: return 'n'
each time call apply on df, pandas loops through entire df , takes around 10 min. i'm ok 10 min; know cannot improve performance. there way avoid multiple loop? can pandas loop once , transformation want?
edit: it's hard give data since dataframe big , built sql query. want way loop once through dataframe, not want improve each function (taht's kind of impossible branching on string)
thanks
python , pandas can 1 thing @ time. functions lot of things though , build them 1 function. not problem here.
applying functions axis=1
super heavy, because iterating on whole dataframe.
you using pandas, not using pandas.
you should rewrite functions apply vectorized operations.
it looks .loc method out lot
index dataframe on using in if statements
df.set_index(['spr_spo_code', 'ctt_type_1', 'ccn_activation_dt'], inplace=true)
and on.
then can use .loc
, first should create column want result.
df['contract_renewed'] = 'n' df.loc['pcc', 'normal', none ..., 'contract_renewed'] = 'y'
and crazy faster current way of operation.
tiny example
>>> df = pd.dataframe({'foo':[1,2,3], 'bar':['baz','boo','bee'], 'baz':['n']*3}) >>> df.set_index(['bar', 'foo'], inplace=true) >>> df.loc[('baz', 1), 'baz'] = 'y' >>> df baz bar foo baz 1 y boo 2 n bee 3 n
Comments
Post a Comment