scala - How to access lookup(broadcast) RDD(or dataset) into other RDD map function -


i new spark , scala , started learning ... using spark 1.0.0 on cdh 5.1.3

i got broadcasted rdd named dbtablekeyvaluemap: rdd[(string, string)], want use dbtablekeyvaluemap deal filerdd( each row has 300+ columns). code:

val = filerdd.map({x =>   val tmp = dbtablekeyvaluemap.lookup(x)   tmp }) 

running locally hangs and/or after sometime gives error :

scala.matcherror: null @ org.apache.spark.rdd.pairrddfunctions.lookup(pairrddfunctions.scala:571) 

i can understand accessing 1 rdd inside other have issues, if locality , size of collection come picture.. me taking cartesian product not option records in file rdd huge(each row 300+ columns) ... used distributed cache load dbtablekeyvaluemap in setup method , use in map of hadoop java mapreduce code, want use similar way in spark map... not able find simple example refer similar usecase... 1 one want iterate on filerdd rows , transformation, beatifications, lookups etc. on "each column" further processing... or there other way can use dbtablekeyvaluemap scala collection instead of spark rdd

please

thanks.... easiest thing convert lookup rdd "scala collection" , go!! able access inside transformations rdd....

val scalamap = dbtablekeyvaluemap.collectasmap.tomap val broadcastlookupmap = sc.broadcast(scalamap)  val = filerdd.map({x =>   val tmp = broadcastlookupmap.value.get(x).head   tmp }) 

this easy solution should documented somewhere learners ..it took while me figure out...

thanks help...


Comments

Popular posts from this blog

ruby - Trying to change last to "x"s to 23 -

jquery - Clone last and append item to closest class -

c - Unrecognised emulation mode: elf_i386 on MinGW32 -