What the heck is Apache Spark?

Published on 22 Sep 2017

  • Are you inundated by floods of data?
  • Lagging behind the competition because you can't extract the results you need?
  • Can't react to the competition rapidly enough to matter?
  • Missing out on key customer trends and you don't even know where to start looking?

You should definitely check out...

So *what the heck* is Apache Spark?

Spark is an Open Source processing engine and cluster computing framework, released and maintained by the Apache Software Foundation.

Once you have the know-how, you can spin up a Spark cluster in minutes thanks to cloud providers like Amazon Web Services (AWS).

And Spark processing is fast. You can expect programs running at speeds up to 100x faster than on Hadoop if you opt to run your code / data storage in memory (and still around 10x faster on disk).
Moreover, an 80-line word-counting MapReduce program written in Java will take only 5 lines on Spark.

While the Spark platform itself is written and can be coded against in Scala, there is out-of-the-box API support for Java and Python, and many open source projects are available to help integrate other mainstream languages like C# .NET (Mobius), R (SparkR). Spark even includes the full Groovy runtime!

Well, there are 5 main pillars to the Spark project:

  • Spark Core
  • SparkSQL
  • Spark Streaming
  • GraphX
  • MLib

Spark Core is the workhorse providing large-scale parallel processing. It handles memory management, error handling / fault recovery, scheduling, and data storage I/O

SparkSQL allows you to query data using good ol’ SQL or Hive Query Language (HQL) and to a degree replaces MapReduce.

Spark Streaming is arguably what put Spark into the limelight and racing ahead of other platforms like Storm and Hadoop, as it made real time processing of ever-more-enormous lakes of data possible...powering social media platforms like Twitter.

GraphX is another powerful and fun-packed library giving Analysts the tools to perform Extract-Transform-Load (ETL) functions, iterative graph computations, and exploratory analysis.

MLib is probably what will power Skynet when it finally comes online to take over the world. It’s a Machine Learning (ML) library packed with scalable algorithms to cut out and keep.

So who uses Spark?

If you were to deploy your own Spark implementation, you’d be among organisations like Amazon, Netflix, Uber, Google, Intel, Yahoo, IBM, Oracle, Salesforce…need I go on?

How do I get started?

You could do a lot worse than attending our hands-on Apache Spark training course, where you’ll be guided by a genuine expert, through the fundamentals of spinning up a Spark cluster, populating it with data, and getting personal with some of the key functionality to make actual Big Data Science solutions that you can take away and adapt for your own use.

Need something tailored to your own enterprise? No problem. We’ve helped numerous organisations build their own Data Science taskforces through carefully tailored training and consultancy programmes.

We will work with you to identify the core skills and disciplines required to meet the specific demands of your business - taking into account your industry, project requirements, deadlines, technical / logistical issues…you name it…and build you a robust team of cutting-edge Data Scientists.

Apache, Apache Spark, Spark and the Spark logo are trademarks of the Apache Software Foundation.

Call us on 020 3137 3920 to find out how we can help