scala - How to access lookup(broadcast) RDD(or dataset) into other RDD map function -
i new spark , scala , started learning ... using spark 1.0.0 on cdh 5.1.3
i got broadcasted rdd named dbtablekeyvaluemap: rdd[(string, string)], want use dbtablekeyvaluemap deal filerdd( each row has 300+ columns). code:
val = filerdd.map({x => val tmp = dbtablekeyvaluemap.lookup(x) tmp })
running locally hangs and/or after sometime gives error :
scala.matcherror: null @ org.apache.spark.rdd.pairrddfunctions.lookup(pairrddfunctions.scala:571)
i can understand accessing 1 rdd inside other have issues, if locality , size of collection come picture.. me taking cartesian product not option records in file rdd huge(each row 300+ columns) ... used distributed cache load dbtablekeyvaluemap in setup method , use in map of hadoop java mapreduce code, want use similar way in spark map... not able find simple example refer similar usecase... 1 one want iterate on filerdd rows , transformation, beatifications, lookups etc. on "each column" further processing... or there other way can use dbtablekeyvaluemap scala collection instead of spark rdd
please
thanks.... easiest thing convert lookup rdd "scala collection" , go!! able access inside transformations rdd....
val scalamap = dbtablekeyvaluemap.collectasmap.tomap val broadcastlookupmap = sc.broadcast(scalamap) val = filerdd.map({x => val tmp = broadcastlookupmap.value.get(x).head tmp })
this easy solution should documented somewhere learners ..it took while me figure out...
thanks help...
Comments
Post a Comment