Apache Spark is an open source cluster computing framework that is frequently used in big data processing. Data Science in Spark with Sparklyr:: CHEAT SHEET Intro Using sparklyr. 2016-12 sparklyr is an R interface for Apache Spark™, it provides a complete dplyr backend and the option to query directly using Spark SQL statement. With sparklyr, you can orchestrate.

Open source is leading the way with a rich canvas of projects for processing real-time events.
Case study with NASA logs to show how Spark can be leveraged for analyzing data at scale.

As the Apache Software Foundation turns 20, let's celebrate by recognizing 20 influential and up-and-coming Apache projects.
Having a good cheatsheet at hand can significantly speed up the development process.One of the best cheatsheet I have came across is sparklyr’s cheatsheet.


For my work, I’m using Spark’s DataFrame API in Scala to create data transformation pipelines. These are some functions and design patterns that I’ve found to be extremely useful.

For an exhaustive list of the functions, you can check out the Spark’s Dataset class documentation.

