

Spark.lapply function allows to run multiple instances of any R function in Spark, each with different parameter value provided. SparkR natively supports reading json, csv, orc, avro or parquet files – in addition you can find connectors to other popular data formats. In contrast to pyspark there is no RDD support, SparkR is based on DataFrames only. Additionally, SparkR package when loaded, masks many R functions, like stats::filter, stats::lag, base::sample or base::rbind. For instance we cannot index a Spark DataFrame according to row or change particular point values as we can with R DataFrame. Mind that not all ame constructs are supported though. Summarize ( groupBy ( cars, cars $ cyl ), count = n ( cars $ cyl ) ) Let’s see the difference between those two Spark R packages and available functionality. You can easily switch between local and distributed processing using either one of them: Spark with R Still they both provide support for data processing and distributed Machine Learning, converting user code into Spark manipulations across the cluster of machines.

Due to the fact that currently Python is favourite language for Data Scientists using Spark, Spark R libraries are evolving in a slower pace and in general catch-up with the functionality available in pyspark. SparkR is an official Spark library, while sparklyr is created by the RStudio community.

They both differ in usage structure and slightly in available functionality. R enthusiasts can benefit from Spark using one of two available libraries – SparkR or sparklyr.
