Hadoop: Distinct count of a value (Java) -


example of (key,value) in mapper : (user,(logincount,commentcount))

public void map(longwritable key, text value, context context)                 throws ioexception, interruptedexception {              string tempstring = value.tostring();             string[] stringdata = tempstring.split(",");              string user = stringdata[2];             string activity = stringdata[1];              if (activity.matches("login")) {                 outcount.set(1,0);             }              if (activity.matches("comment")) {                 outcount.set(0,1);             }              outuserid.set(userid);              context.write(outuserid, outcount);          } 

i count logins & comments of user. want change count: count every login & if user wrote comment. how can achieve mapper or reducer search 1 comment of user , "ignores" other comments (of user)?

edit:

log-file:

2013-01-01t16:50:56.056+0100,login,user14133,somedata,somedata 2013-01-01t16:55:56.056+0100,login,user14133,somedata,somedata 2013-01-01t05:20:44.044+0100,comment,user14133,somedata,somedata,{text: "something here"} 2013-01-01t05:24:44.044+0100,comment,user14133,somedata,somedata,{text: "something here"} 2013-01-01t20:50:13.013+0100,login,user76892,somedata,somedata 

output @ moment:

user14133   logins: 2   comments: 2 user76892   logins: 1   comments: 0 

input:

mapper<longwritable, text, text, usercount> reducer<text, usercount, text, usercount>  public static class usercount implements writable {         public usercounttuple() {             set(new intwritable(0), new intwritable(0));         } 

my mapreduce counts every login , every comment of user , sum them up. want achieve -> output:

user14133   logins: 2      comments: 0 or 1 (did user wrote 1 comment?)*   * in mapper or reducer (?)  every line in log{    if (user wrote comment){      return 1;      ignore other comments same user in log;    } else if (user didn't write anything) return 0;  } 

if understand correctly, want total number of unique users logged in, along total number of comments?

i recommend using "aggregate" reducer in hadoop.

in mapper, output lines this:

uniqvaluecount:unique_users      user14133 longvaluesum:comments            1 uniqvaluecount:unique_users      user14133 longvaluesum:comments            1 uniqvaluecount:unique_users      user14133 longvaluesum:comments            1 uniqvaluecount:unique_users      user14133 longvaluesum:comments            1 uniqvaluecount:unique_users      user76892 longvaluesum:comments            1 

and run "aggregate" reducer on this, should output looks like:

unique_users    2 comments        5 

i'm assuming want?


Comments

Popular posts from this blog

ruby - Trying to change last to "x"s to 23 -

jquery - Clone last and append item to closest class -

c - Unrecognised emulation mode: elf_i386 on MinGW32 -