Big Data & Data Science: What's Hot Right Now

Published on

Apache Spark

Apache Spark is the big kid on the Data Science block right now. Spark can help with batch processing, Machine Learning algorithms and interactive low-latency Data Mining and it works well with many existing Big Data tools, which are frequently used in enterprise data pipelines.

There's a roster of state-of-the-art Big Data processing tools you should also be thinking about...

AWS Kinesis & Kafka

Message bus for data ingestion. Especially useful when analyzing streaming input data, not required when working with batch data.

Elasticsearch

Natural language processing search engine. Can be used for text search as well as a range of natural language processing tasks. Can provide very useful analytics for FCA when working with text and NLP.

Redshift

AWS managed OLAP reporting database. Good choice for large data analytics. Can be used together with Amazon Spectrum to scale to very large size while keeping the cost under control.

Cassandra

Fast NoSQL database. Used in real-time processing pipelines, such as real-time Fraud detection.

AWS S3

"data lake" storage, stores raw data in native format on the massive scale

Other stuff to bear in mind

* For Machine Learning trends: Python has somewhat overtaken R and other tools as the major data science language. This has been a general trend for a while now...moreover, Python is easy to get to grips with (and we can help you master it).

* Jupyter or Zeppelin - helps to share the knowledge and the findings across the teams. Can be used for ad-hoc experimentation and reporting.

* Jupyter works well with Python Machine Learning libraries thus allows to easily bridge the data storage, data analysis and Machine Learning worlds.

Call us on 020 3137 3920 to find out how we can help