Hadoop: Distinct count of a value (Java) -
example of (key,value) in mapper : (user,(logincount,commentcount))
public void map(longwritable key, text value, context context) throws ioexception, interruptedexception { string tempstring = value.tostring(); string[] stringdata = tempstring.split(","); string user = stringdata[2]; string activity = stringdata[1]; if (activity.matches("login")) { outcount.set(1,0); } if (activity.matches("comment")) { outcount.set(0,1); } outuserid.set(userid); context.write(outuserid, outcount); }
i count logins & comments of user. want change count: count every login & if user wrote comment. how can achieve mapper or reducer search 1 comment of user , "ignores" other comments (of user)?
edit:
log-file:
2013-01-01t16:50:56.056+0100,login,user14133,somedata,somedata 2013-01-01t16:55:56.056+0100,login,user14133,somedata,somedata 2013-01-01t05:20:44.044+0100,comment,user14133,somedata,somedata,{text: "something here"} 2013-01-01t05:24:44.044+0100,comment,user14133,somedata,somedata,{text: "something here"} 2013-01-01t20:50:13.013+0100,login,user76892,somedata,somedata
output @ moment:
user14133 logins: 2 comments: 2 user76892 logins: 1 comments: 0
input:
mapper<longwritable, text, text, usercount> reducer<text, usercount, text, usercount> public static class usercount implements writable { public usercounttuple() { set(new intwritable(0), new intwritable(0)); }
my mapreduce counts every login , every comment of user , sum them up. want achieve -> output:
user14133 logins: 2 comments: 0 or 1 (did user wrote 1 comment?)* * in mapper or reducer (?) every line in log{ if (user wrote comment){ return 1; ignore other comments same user in log; } else if (user didn't write anything) return 0; }
if understand correctly, want total number of unique users logged in, along total number of comments?
i recommend using "aggregate" reducer in hadoop.
in mapper, output lines this:
uniqvaluecount:unique_users user14133 longvaluesum:comments 1 uniqvaluecount:unique_users user14133 longvaluesum:comments 1 uniqvaluecount:unique_users user14133 longvaluesum:comments 1 uniqvaluecount:unique_users user14133 longvaluesum:comments 1 uniqvaluecount:unique_users user76892 longvaluesum:comments 1
and run "aggregate" reducer on this, should output looks like:
unique_users 2 comments 5
i'm assuming want?
Comments
Post a Comment