Iterating through Python dictionary very slow -
i'm trying compute distance between pairs of users based on values of items assigned them. distance calculation should null when 2 users not have intersecting items. i'm calculating lower half of distance matrix (eg. usera-userb equivalent userb-usera calculate one).
so have following python script works, starts chugging when feed more few hundred users. sample script below shows input structure, i'm trying thousands, not 4 have shown here.
the line s = {k:v k,v in data.items() if k in (user1,user2)}
seems add overhead
import math decimal import * def has_matching_product(data,user1,user2): c1=set(data[user1].keys()) c2=[k k in data[user2].keys()] return any([x in c1 x in c2]) def get_euclidean_dist(data,user1,user2): #tried subsetting run quicker? s = {k:v k,v in data.items() if k in (user1,user2)} #ignore users no overlapping items if has_matching_product(s,user1,user2): items=set() k,v in s.items(): ki in v.keys(): items.add(ki) rs=decimal(0) in items: p1 = s.get(user1).get(i) p2 = s.get(user2).get(i) v1 = p1 or 0 v2 = p2 or 0 rs+= decimal((v1-v2)**2) return math.sqrt(rs) else: return none #user/product/value raw_data = { 'u1': { 'i1':5, 'i4':2 }, 'u2': { 'i1':1, 'i3':6 }, 'u3': { 'i3':11 }, 'u4': { 'i4':9 } } users = sorted(raw_data.keys()) l = len(users) data_out = set() #compute lower half of distance matrix (unique pairs only) u1 in range(0,l-1): u2 in range(1+u1,l): dist = get_euclidean_dist(raw_data,users[u1],users[u2]) print('{x} | {y} | {d}'.format(x=users[u1],y=users[u2],d=dist)) #sample output
what proper output should like:
u1 | u2 | 7.483314773547883 u1 | u3 | none u1 | u4 | 8.602325267042627 u2 | u3 | 5.0990195135927845 u2 | u4 | none u3 | u4 | none
the issue you're walking entire dictionary every time, find 2 items want. , looks of it, you're pulling out user
s, , spending time trying go find them again in data
. @peter wood's suggestion bunch - grab 2 user
s want in first place, that's sort of missing forest trees - don't need slim down dictionary in first place @ all. keep together:
import itertools kv1, kv2 in itertools.combinations(data.items(), 2): ## calculate distance directly here
Comments
Post a Comment