One has to be good both in theory and in practice to have sufficient experience for working with Big Data. The experience of this kind may come from a commercial project after one watches open data devoted to dealing with information through and through, or performing tasks at Kaggle. For theory, welcome to the world of training resources, courses and blogs, which are devoted to working with information. I prefer seeing what clever people already collected, processed and arranged on shelves in a proper way. Therefore, I prefer books.

Introducing Big Data technologies

The book tells how Big Data can be useful and change the world for better. It describes how working with information influences business, science, government activities, healthcare and daily life. The book comprises multiples photos, which make it easy to read.

Readers' responses are mostly very positive. The readers call the book, above all, exciting. The book is easy to read, it has many illustrations, describing fascinating future the world would have due to data analysis. The edition would be a nice gift for both a Big Data geek and for any individual in connection with data analysis.

The book starts with the history about Google, which back in 2009 found the correlation between the data from the American Center of Health Disorder Control and Prevention and the frequency of searching for things like «cough, higher body temperature, nasal discharge» etc., and learnt to predict flu outbreaks. The authors, Viktor Mayer-Schönberger, a professor in Oxford, and Kenneth Cukier, an editor of The Economist, are calling themselves the foregoers of Big Data, stating the world should change soon. Not only the book sings the triumph of new technologies, it is also full of excitement about the possibilities of using Big Data for peoples' benefit, together with concerns about the peoples' security and privacy matters. «Big Data…» is for general public and requires no special knowledge in the art. It represents the authors' thoughts on the matter, although there are some examples of using the data analysis in business, too.

Feedback from readers:

The book is praised for clear disclosure of complicated examples in a manner that would be understandable for anyone. There are those, who blame the authors for the «warm-up» being much too long, and introduction chapters being much too large. A disadvantage of the book is the small number of actual examples together with repeating the same thoughts again and again.

The book is written in a popular language, and it is easy to read. It discloses the basic aspects of the actively developing data analysis industry, like storages, association discovery algorithms, visualization methods, social graphs, neural networks and non-structured text analysis. Each part begins with a real life example — Netflix, WhatsApp, IBM Watson — and ends with topical tasks. A manual for the language R comes as a bonus for beginners.

Having read the book, you learn how to implement every stage of data analysis, from planning a study to introducing and verifying the hypothesis, yourself.

Feedback from readers:

Feedback from readers is mostly positive, and almost everyone noticed very clear manner of complicated terms and algorithms explanation, thus lowering entrance threshold for those new in the art.

Eric Siegel, the author of the «Predictive Analytics» and the editor of Predictive Analytics Times, tells fascinating things:

  • Facebook goes through 1500 publications per individual to select the news that the individual might find interesting;
  • Microsoft is capable of predicting an individual’s specific location years in advance, based on daily GPS data generated;
  • Insurance companies in USA offer the medical care agreement for the last years of life of a person 18 months before the probable death of the person.

The book is full of examples of using the predictive analytics by businesses and state: from selecting contacts via LinkedIn to the details in connection with using the technologies during Barack Obama’s presidential campaign.

Feedback from readers:

Most readers noticed, that the information from the book on where the specific algorithms, like solution trees, regression etc., are used, is useful.

Intellectual data analysis, algorithms, data mining

The author believes most business CEOs approach analytics in a wring manner: they purchase costly software and hire consultants. They waste much money before they realize what they actually want to have as result. The author, the chief analyst in Mailchimp, who earlier worked for Coca-Cola, Intercontinental and FBI, recommends to stop and to take a deep breath, for everything is way simpler. In most cases, Excel would be even more than enough; as a matter of fact, this tool is more powerful than it looks at first sight.

This book is not about data storages, advanced software complexes and hardcore coding for geeks. It primarily concentrates on methods. A reader learns about math optimization and genetic algorithms, data clustering, prediction, season correction methods and other approaches, which would turn Excel into an instrument for picking useful information. This is a book for business owners and marketing professionals, analysts of all kinds. The book might teach being not afraid of data arrays, and using it for making wise decisions.

Feedback from readers:

There is one particular feedback from the author’s friend, who described everything the best way. There are three types of books about data analysis:

  • Those, which are too technically advanced, with many Greek letters and other advanced math stuff;
  • Business books about how data make revolutions;
  • Technical books about the latest analysis technologies (R, Hadoop).

This book is for ordinary people wishing to learn how to analyze data having read about algorithms what is described using one’s fingers.

«The Elements of Statistical Learning» is a fundamental theoretical work devoted to principles in the basement of working with Big Data. You shall not see a line in Python or R in this book. What you are going to see will be math expressions, charts and even more expressions.

This is a book for hardcore theorists and practicing enthusiasts wishing to improve their knowledge in math statistics even further. The book encompasses a large number of adjacent areas, like machine learning with/without a tutor, neural networks, solution trees, and support vector method, model ensembles. This is the state of art, often referred to whenever data analysis is mentioned. It is available for free at the Stanford University website. The downloading therefrom is free and legal.

Feedback from readers:

The book has no formulations, which would be ready to be used to solve one’s tasks. Rather it is for understanding deeper what the data analysis is and what its basement might be.

The book takes readers into the world of data analysis gradually, starting from simple things and then proceeding towards more sophisticated matters. The first chapters are devoted to what data is and what it may be, how it is prepared for processing, what the summary statistics, visualization and OLAP are, i.e. the minimum is described, which is needed for quick extraction of knowledge out of the information array. Next, the authors describe methods of classification, cluster and associative analysis. The book is good for those, who encountered the data analysis for the first time, with the minimum of mathematics and the maximum of clear explanations and illustrations.

Feedback from readers:

This is yet another book with the visualized and clear explanation of complicated algorithms, having not too many odd math expressions.

Nathan Marz, the author of the book, is one of those five members of BackType, a social aggregator, which, back in 2011, was purchased by Twitter. Those only 5 people managed to operatively analyze 100 terabyte of data, i.e. setting up, monitoring and maintaining a cluster of hundreds of computers. Having been asked by colleagues, how they can do it being so few in number, Marz answered: «It is not what we do that matters, but rather what we do not do».

The standard architecture of data processing systems is too complicated and too vulnerable; therefore, the more data it deals with, the more problems arise. Marz approaches data storage and processing in a new way, by introducing a simple and reliable λ-architecture that he described with the book. Open-source tools are used for implementing it: Hadoop, Cassandra, Cascalog, ElephantDB, Storm and Trident.

The author said: «It would have been nice, if I could have read this book before I began working with Big Data. It would have turned the work into something simple and exciting».

Feedback from readers:

Those skilled in the art would appreciate the usability of Big Data array processing technologies and good architecture of such systems.

Big Data and business

Big Data technologies allow processing large amounts of information that businesses collect, to make reasonable decisions, based on that information, and to understand their clients and business processes much better. Books of this section tell why one should use Big Data and how one would create necessary infrastructure within a company, also demonstrating actual examples of using the analytics to improve businesses performance.

The book is based on materials of MBA course that Foster Provost read at the university in New York. Principles that the author describes are illustrated by solving real problems challenging businesses. This book teaches how to deal with Big Data as a business asset, helping one understand how to establish the connection between management and technical analysts.

The book provides no mysterious portal from «a problem» over to «the solution». Not only this book discloses every problem and what allowed overcoming it, but it also describes the way of thinking by a professional, which would result in selecting one or another method to solve the problem. Ideas and general principles are often more essential than preset algorithms. Harvard Business review mentioned this book among the best in the art. The book is full of material and reading it takes concentration, although you would not need special technical education to have benefit from the material it disclosed.

Feedback from readers:

The book is characterized the best way by listing what it does not have:

  • This is not a differential equations digest.
  • This is not a step-by-step guide without the explanation of what lies behind it.
  • This is not a book for managers that have only 30 minutes of their priceless time to learn a new hype technique.

The book «Big Data at Work» is perfect for managers thinking about integrating the analytic department into the company’s structure. It presents multiple recommendations, including step-by-step plan for implementation of the Big Data analysis methodologies in a company, information processing technologies description, staff hiring process. Therefore, the book targets managers that have to analyze data on regular basis. There is a response that describes the book’s pros the best way:

«Big Data at Work» is the first and the only book that describes real institutions using Big Data analysis technologies and benefitting therefrom.

The book presents multiple useful examples involving such companies as UPS, GE, Amazon, United Healthcare, Citigroup etc.

«Too Big to Ignore» is for CIOs, CEOs and IT professionals. Phil Simon, the author of the book, possesses some extraordinary ability of combining business cases and complicated technical terms; what is even more important, he can clearly explain how those all interact with each other. In this book, he demystifies the term of Big Data, arranging technologies, solutions, software and the vendors thereof, properly.

Feedback from readers:

The book is easy to read and understand, good for those with any skill in the art.


Apache Spark is the open source code software that is good for implementing distributed processing of non-structured or poorly structured data.

Written by Apache Spark developers, the book would be useful for engineers and professionals working with large amount of data. The book presents methods for processing data via simple API in Python, Java and Scala. In addition, the book includes the information about Spark SQL, Spark Streaming and Maven. You would learn running parallel tasks with a few lines of code and making applications capable of performing both simple tasks and the tasks that need machine learning. Here is a brief list of what is advantageous in the book:

  • Diving fast into the world of possibilities of Spark, such as distributed data arrays, memory caching, working with the command line;
  • Using the same software paradigm instead mixing tools, such as Hive, Hadoop, Mahout and Storm;
  • Expansion of interactive, packet and stream applications;
  • Connectors for various data sources: HDFS, Hive, JSON and S3.

Examples from the book may be found at github.

Feedback from readers:

Some coders find examples superficial, and the explanation in connection therewith as being too short. Nonetheless, all agree the book would be a good start to dive into Spark.

Four professionals from Cloudera wrote this book to present the modern data management and analysis platform. «Advanced Analytics with Spark» is a guidebook comprising various templates for analyzing large amounts of data with Spark.

Having begun with learning basics of the technology, you gradually dive deeper into classification methods, collaborative filtration and searching for anomalies used in genetics and financial security field. Examples are implemented in Java, Python and Scala.

The book describes 9 topical studies in various arts, based on real data. What you will know is:

  • how taxi traffic in New York is studied;
  • what the music recommendation algorithms are;
  • how forests condition is predicted via the «solution tree» algorithm;
  • how Wikipedia content should be interpreted;
  • genome data analysis and the BDG project;
  • how financial risks are simulated with the Monte-Carlo method;
  • how neurobiology data are analyzed with PySpark and Thunder.

Feedback from readers:

This is a good step-by-step guidebook for deeper understanding data analysis and Spark.

Analytics in Python

Python is a flexible coding language with simple syntax and with a large plurality of powerful open-source libraries for machine learning and data visualization already created.

This guidebook is for those who wish to expand their understanding of technical aspects of predictive analytics in Python. Compared to other similar editions, the book covers a wider range of questions and presents more examples that allow better understanding the methods and instruments disclosed.

The book is good for engineers with any level of skill in machine learning — from beginners to professionals. What you are going to learn from the book:

  • how to use various analytical models;
  • how to build neural networks with Pylearn 2 and Theano;
  • how to use regression analysis;
  • how to improve web-applications using machine learning;
  • how to disclose hidden patterns and structures in data with clustering;
  • how to pre-process data efficiently;
  • how to use social data analysis to determine the audience mood.

Feedback from readers:

The book is compared to some sort of text equivalent for a neural network with thousands of hidden layers to be run on the last generation Nvidea GPU. It is good for software developers with any level of knowledge, because both a beginner and a professional would discover some new Python algorithms with the book.

This is yet another book about Python in Big Data context, but it is not about analytical methods, and not about concepts and methods for working with data, but rather about Python instruments for analysts. The author of the book is the leading developer of the «Pandas» library (Python Data Analysis Library).

You will learn how to use IPython interactive shell as the basic development environment; also you will become familiar with NumPy functions, analytical instruments of Pandas and other possibilities of the library.

The author believes you already know the language used, and does not focus on basic moments.

You are going to learn highly efficient tools for loading, storing and processing data, become familiar with static and interactive visualization, and see how complicated tasks are performed in web-analytics, social networks, finances and economics, as a bonus.

Feedback from readers:

It varies. Positive feedback includes mentioning the book is easy to understand, gives good fundamental knowledge. Negative feedback mentions errors in code (near the end of the book).

Libraries, frameworks and miscellaneous tools for dealing with data, certainly, are good for practical data analysis, yet there is a good chance that you would use it without actually understanding the data science. It is with this book, that you are going to learn how to use those tools from scratch. If you are already good in mathematics and software development, the author may help you with mathematics and statistical science of data. Here is a brief description of what the book comprises.

  • Introducing Python
  • Learning the basics of linear algebra, statistics and understanding how those both are used in the data science
  • Collection, processing, cleaning and other ways to handle data
  • Fundamental things of machine learning
  • Writing various models in Python, such as k-nearest Neighbors, Naïve Bayes, linear and logistical regression, solution trees, neural networks, clustering data
  • Recommendation systems, natural language processing, MapReduce etc.

Feedback from readers:

Those are mostly positive. A software developer with 10 years of experience in the art called this book the best one for those, who wish to learn analizing data in Python.

Data visualization

Data visualization is an integral part of Big Data. Good interpretation of data collected during the research simplifies greatly the process of hypothesis formulation and proving, thus assisting in bringing one’s position to colleagues and making data processing easier.

The book will teach you the basic things of data visualization and show how to turn it into an efficient assistant to create various presentations. You will be learning how to visualize data on actual examples, you will improve your understanding of the context and the audience, you will be able to easily select the best way to present the information, and you will learn how to clean your presentations from unnecessary elements and how to point the audience’s attention towards key moments.

The author emphasizes the aesthetical aspect of the results representation, allowing one taking a look through the eyes of a designer, to teach us how to use design concepts in visualization.


Hadoop is a project by Apache Software Foundation. This is a kit of instruments, free to use, with the purpose of developing and running distributed software programs in clusters of hundreds or thousands units each.

This manual, as it is clear from its title, is for those, who only begin their work with Hadoop. In a simple manner, the book explains the value of Big Data, telling the history of Hadoop emerging, describing its advantages, functions, and showing examples of how to use it in practice. In addition, the book introduces clusters, designing templates and the ecosystem of Hadoop.

In this book, you find the following:

  • The description of the Hadoop 2 and Yarn ecosystem;
  • Examples of real usage allowing you to start working;
  • Detailed information in connection with cluster installation;
  • A guide for using Oozie for workflow planning;
  • The information about how to add a structure from Hive or HBase;
  • The detailed information about working with SQL and Hive;
  • The information about using Hadoop in Cloud;
  • The information about problems that administrators may face.

Feedback from readers:

Very few responses from readers, those, who provided feedback, were much too laconic.

This is a practical guidebook on how to use one of the most powerful toolkit among those free to use. Developers are going to find useful information about analyzing Big Data arrays, while administrators may learn how to create and setup Hadoop clusters.

The book represents a large number of topical studies to illustrate how Hadoop solves particular problems. It is going to teach one how to use Hadoop Distributed File System (HDFS) to store Big Data arrays and to perform distributed calculations therewith. It is going to tell about the possibilities to use MapReduce and about common mistakes of working with the model. You will learn how to design, create and setup Hadoop clusters, how to run Hadoop in a Cloud etc. The latest edition includes chapters devoted to such ecosystem tools as Pig, Hive, HBase, ZooKeeper and Scoop.

Feedback from readers:

There are some negative responses, blaming the author for drawbacks like mixing styles, code examples with no file names, outdated information etc.

If you are going to work with large complicated Hadoop clusters, this book is necessary for you. The author of the book is Eric Sammer, the chief architect of Cloudera, and the book describes all stages from planning and creating to setting up and maintaining a cluster.

Above all, the book is for administrators, yet developers, too, are going to find much interesting in it. Your work with Hadoop will begin with installing and setting up all the software that you may need. You are going to comprehensively understand HDFS and MapReduce; go through every stage of expanding Hadoop, from hardware and operating system selection further on; you are going to learn how to manage resources by input data division between non-overlapping groups; you will learn about maintenance and backup of the systems created on real life examples.

Feedback from readers:

In most cases, readers' responses are favourable. Because the book is from Cloudera architects, it reflects the entire spectrum of work with Hadoop by the company. Multiple aspects and weak places of the ecosystem and possible challenges are discussed.

The guidebook comprises 85 verified examples in the problem-solution format. Author is good, balancing between describing fundamental things and practical examples and diving into technological aspects. You are going to study each solution on step-by-step basis, to be able to understand the principle that each model resides upon.

The book is going to teach you how to perform practical tasks and how to think in a manner making one capable of turning data into a well-structured database, which would be easy to work with.

What you are going to find in the book:

  • Comprehensive review of Hadoop and MapReduce;
  • 85 verified practical methodologies;
  • Real problems and solutions examples;
  • Detailed instruction on MapReduce and R integration.

Feedback from readers:

Those, who purchased the book, noticed the information therein being often outdated; be careful enough to purchase the latest edition only, which was published not before the last year.

This is a book from skilled developers team, and it is a detailed guidebook on Hadoop and API integration to solve common issues in the art. The book tells about data storage in HDFS and in HBase, data processing via MapReduce and the automation of work with the information by Oozie. It may be called a complete guide for system administrators and developers working with Hadoop:

  • Detailed description of how to create, test and debug MapReduce applications to make them work stable;
  • Explanation of how to expand Oozie and use it to integrate the enterprise applications;
  • Description of how to design Hadoop applications capable of processing requests in real time;
  • Demonstration of how to use Hadoop security tools: encryption, authentication, authorization, SSO and audition;
  • Methods for running Hadoop application in the Amazon cloud.

Feedback from readers:

Readers notice good and deep analysis of the subject matter missing in other books (Hadoop: The Definitive Guide, Hadoop In Practice).

HBase is used in the Hadoop ecosystem, and the book is going to teach you how to manage Big Data arrays with this powerful tool. HBase is an open source software piece, implementing BigTable architecture, also used in Google. It is capable of horizontal scaling to process millions of lines and columns without losing reading/recording speed. The book touches many delicate questions that often arise during implementation of the database into the IT infrastructure of a company:

  • You learn how to preciously integrate Hadoop and HBase to make scaling simple;
  • You learn how to distribute large volumes of data among a large number of cheap similar servers;
  • You learn different methods of connection to HBase, including both standard Java clients and specialized APIs to grant access from different environments;
  • You become aware of various components of the HBase architecture (storage format, logs, secondary indexes, transaction implementation, search integration etc.);
  • Miscellaneous issues in connection with cluster expansion, monitoring and maintenance are discussed;
  • You will be able to understand the matters of memory performance better.

MapReduce framework application design templates may be found at various articles and blogs. The book is going to save your time by eliminating the need to read those various sources of information. It comprises basic principles of designing and coding MapReduce applications.

Each example is considered in connection with some specific context to avoid common mistakes one could encounter otherwise. Some types of templates described include:

  • Grouping and aggregation;
  • Filtering data from particular users;
  • Structuring templates to work with other systems and to simplify the analysis;
  • Connection templates for analyzing various data sets together with searching for interconnections;
  • Meta-templates: connecting several templates together to resolve analysis tasks in a single process;
  • Data storage, uploading and downloading templates.

Feedback from readers:

There are several negative responses mentioning poor quality of the technical component of the book. «Never before O’Reilly has been so unprofessional» is the most innocent feedback out of those abovementioned.

The book thoroughly describes working with Hive — an add-on for MapReduce that allows writing SQL-like requests for distributed systems. This book is the most comprehensive guide for Hive, which includes all aspects of working with the technology:

  • Creating tables, modifying table structure, partitioning and other actions known from SQL;
  • Handling data;
  • SQL-like request syntax;
  • Creating views, indexes, schemas;
  • Coding the functions, managing the streams;
  • Security issues;
  • Integrating Hive and Oozie, Amazon Web Services etc.

This is a book about Apache Pig. This tool allows organizing parallel data streams within Hadoop ecosystem. With the Pig, you may easily create several parallel data processing scenarios. The book is good for both beginners and advanced developers. It describes the basics and discloses various aspects of the Pig:

  • Data models: scalar and complicated types;
  • Writing Pig Latin scripts for sorting, grouping, filtering data and other methods of processing;
  • Using Grunt in Hadoop;
  • Building-in Pig Latin scripts into Python to execute interactive algorithms;
  • Creating one’s own data loading and storage functions;
  • Performance issues.

Feedback from readers:

Feedback from readers is mostly negative. They say the manner of writing reminds occasional blog records, rather than systematized disclosure of the subject.

While many books disclose usage of Hadoop ecosystem components only, this one may teach you how to build the architecture in a wise manner, specifically for your tasks.

The second part is devoted to descripting in details Hadoop application miscellaneous architectures commonly encountered. In addition, the book covers the following topics:

  • Factors in favour of using Hadoop;
  • Best practices of uploading/downloading data to/from the system;
  • Various data processing frameworks: MapReduce, Spark, Hive;
  • Common data processing templates;
  • Giraph, GraphX and other tools for processing large graphs via Hadoop;
  • Using various task scheduling tools (Apache Oozie etc.);
  • Processing large amounts of data in real time with Apache Storm, Apache Spark Streaming, and Apache Flume.

Feedback from readers includes very few responses, mainly positive.

Language R

R is a programming language for data statistical processing and working with graphics; it is also an open source development environment in terms of the GNU project.

The book comprises more than 200 practical recipes for fast and efficient data analysis with R language. The language is quite complicated to be learnt fully, yet using ready solutions from the book may allow you use its entire power now, from the input tasks to statistical analysis and regression. Every recipe solves a particular problem, making the language learning process applicable for understanding the material better. If you are a novice, the book may help you begin using the R in practice; and if you have experience of a developer in this language, you may improve your code and find new ways to solve tasks. The material would assist you during each stage of data processing. It comprises data extraction techniques from CSV and HTML, and from databases. Later, you may learn how to use the language tools to organize, store and handle the data.

If you look for a book to begin working with statistics, this is not your book. R Cookbook needs knowledge of various statistics methods and algorithms, and may only show how to use those methods and algorithms in R environment. Yet, if you concentrate on learning about miscellaneous methods and tools, such as R Graphics, then «R Cookbook (O'Reilly Cookbooks)» is your book.

This is yet another book about analytics in R language. It discusses all aspects of programming: from the simplest features to the advanced topics (closing, recursion, anonymous functions etc.). To begin, you do not need special knowledge in statistics or long years of experience in programming. Step-by-step, the book tells about functional and component-based programming, math simulation, and data transformation into various formats.

Here are several topics from the book:

  • Creating graphics and visualization;
  • Parallel coding in R;
  • Interfaces in R for C/C++ and Python that improve performance and functionality;
  • Various packets for analyzing text, images etc.;
  • Advanced debugging techniques.

Feedback from readers:

There is a feedback from a programmer with 12 years of experience of working with R. Although he still insists, there has never been anything that would overcome the famous «K&R», The C Programming Language", this is the book, which at least approached the «K&R» in quality.

A good choice for those wishing to learn the language R. The book is written in a simple language, without technical slang and other things that would make understanding the material complicated. The authors disclose the material step-by-step, to allow readers understand all language aspects that have been unclear so far. A brief list of what you may take from the book includes:

  • How to install R and R Studio, and other editors;
  • Syntax of R, the basics;
  • Connecting Packets;
  • Presenting data as vectors;
  • Working with matrices, lists and other structures;
  • Basic work with graphic data representation etc.