Apache Spark is the big kid on the Data Science block right now. Spark can help with batch processing, Machine Learning algorithms and interactive low-latency Data Mining and it works well with many existing Big Data tools, which are frequently used in enterprise data pipelines.
There's a roster of state-of-the-art Big Data processing tools you should also be thinking about...
AWS Kinesis & Kafka
Message bus for data ingestion. Especially useful when analyzing streaming input data, not required when working with batch data.
Natural language processing search engine. Can be used for text search as well as a range of natural language processing tasks. Can provide very useful analytics for FCA when working with text and NLP.
AWS managed OLAP reporting database. Good choice for large data analytics. Can be used together with Amazon Spectrum to scale to very large size while keeping the cost under control.
Fast NoSQL database. Used in real-time processing pipelines, such as real-time Fraud detection.
"data lake" storage, stores raw data in native format on the massive scale
Other stuff to bear in mind
* For Machine Learning trends: Python has somewhat overtaken R and other tools as the major data science language. This has been a general trend for a while now...moreover, Python is easy to get to grips with (and we can help you master it).
* Jupyter or Zeppelin - helps to share the knowledge and the findings across the teams. Can be used for ad-hoc experimentation and reporting.
* Jupyter works well with Python Machine Learning libraries thus allows to easily bridge the data storage, data analysis and Machine Learning worlds.