Vast multitude of disparate platforms and tools for data analysis may disorientate any professional trying to build a business process, based on advanced data analysis at an enterprise. Machine learning and deep data analysis are no longer news. This is an obligatory starting point, without which no business could operate normally today. The analysis of the information collected is the key for business performance improvement. Yet, one needs to possess and use data analysis instruments. Which ones? Let us dive deeper into the matter. We collected the best list of frameworks, platforms, solutions and advanced analysis systems at the market.
It seems, low cost of distributed computing, together with fast processing speed, make Hadoop the same great for Big Data, as the remaining software products together. Any list of open source Big Data platforms does begin with the «hardware-born elephant», yet Hadoop is just a milestone on a long way far.
The platform consists of modules working together to create a uniform software carcass. The main modules are:
- Hadoop Common;
- Hadoop YARN;
- Hadoop Distributed File System (HDFS);
- Hadoop MapReduce.
The abovementioned represent the core of Hadoop, yet there are other components, too, which make the opportunities of the «elephant» even more impressing. We return to the subject later.
There is one interesting fact. Hadoop is the name that one inventor of the framework gave to a yellow toy elephant.
The first block (Hadoop Common) of the instrument is the kit of infrastructure programs and elements, software binders. YARN is the task scheduler and a cluster resource manager. HDFS and MapReduce (with Hadoop) deserve a closer look.
A distributed file system performs two main tasks: recording metadata and storing the data. The HDFS operating principle assumes distributing files among several nodes in a cluster. For example, metadata is processed by NameNode server. HDFS is quite a reliable system, as destroying half of the cluster nodes results in losing only 3% of the information.
The feature of the local computing at hardware with data blocks, known to us, became possible due to MapReduce. Implementation of this paradigm, in turn, is based on the NameNode server disclosing the information about data blocks location at computers. The operating principle of MapReduce consists of two steps: map and reduce. Here is how the process takes place.
- Map: obtaining input data by the master node -> dividing the information into parts -> transferring the data to worker nodes.
- Reduce: the master node receives responses from worker nodes -> results are formulated.
Hadoop framework is by fact the standard software for Big Data analysis technologies. However, it is worth using, when there is a real Big Data problem. If your company is about to work with data in amount that its current Big Data solutions are unable to swallow, then it’s time for Hadoop. If someone is uncertain whether or not the current storage is sufficient for solving future tasks, then Hadoop not only allows supplementing a cluster with new machines, rather the system chews and swallows everything with no problem. Those, who worried about the information, should several servers fail at once, no longer need to worry, for should this ever happen, the process management would be automatically passed to another computer.
Hadoop is now efficiently used by the biggest European company making their business in targeting based on clicks (also known as «ad targeting»). Deutsche Telekom, the major telecommunication provider in Europe, also uses Hadoop framework that proved to be 50 times cheaper than the previous platform.
Yet, one should not forget that data analysis tools based on Hadoop have disadvantages, too. First, Hadoop does not have good protection against information theft. Second, working with Hadoop is not easy, as the framework needs knowing MapReduce, while most professionals use SQL-technologies. Finally, the «elephant» is too popular due to the Hortonworks marketing department’s efforts, and this results in the platform developing way too dynamically.
To summarize with Hadoop, one could say this option is perfect for working with vast volume of information. Correct architecture of applications built with Big Data technology would allow the platform access the analysis of unlimited data. In addition, it is failure-proof, and its operating cost is dozens of times les than the same of competing analogs.
Just like Hadoop, Spark is an open source platform. Yet those two should not be compared directly. The two frameworks do not perform the same tasks, and they do not exclude each other, rather they may work together.
Spark needs a cluster manager and a distributed data storage system. If the task of managing clusters is solved by native means, Hadoop YARN or Apache Mesos (for multi-node clusters), the distributed data storage system may be totally external. It is for this reason, that most projects using Big Data technologies have Spark installed over the «elephant»: binding the advanced applications for analysis from Spark and Hadoop Distributed File System allow running the program at Hadoop clusters up to 100 times faster in RAM and up to 10 times faster in ROM.
The Spark platform introduces an important abstraction called Resilient Distributed Dataset (RDD), which is a set of read-only objects distributed among cluster nodes. RDD performs to classes of operations: transformation and action.
Transformations do not return one value, but rather change metadata and return a new RDD. The transformations include operations like map, filter, flatMap, groupByKey, reduceByKey, aggregateByKey, pipe, coalesce.
Actions return a new value. When an action function called for an RDD object, all queries related to data processing, and the result for the current time, are returned. Action operations include reduce, collect, count, first, take, countByKey, foreach.
As one can see, in addition to map and reduce, which are presented in the Hadoop MapReduce module, Spark offers a number of different operations, too. Therefore, when developing applications with Big Data technologies, Spark in most cases substitutes specifically Hadoop MapReduce, rather than the entire Hadoop.
It is worth mentioning, that specific tasks, which rely on multiple references to the same dataset, make the «clean» Spark work up to 30 times faster, than Hadoop. This specifically refers to the interactive data mining, and iteration algorithms, which are often used in machine learning systems.
The Spark architecture includes three main components:
- Data storage;
- Cluster manager.
Application programming interface allows developers create software for Big Data, based on Spark, using the standard API. By default, Spark supports Java, Scala and Python, allowing selecting language for writing applications.
In addition, Spark includes a number of libraries, such as Spark SQL (it allows performing SQL-queries in connection with data), Spark Streaming (it is an add-on to process stream data), Spark MLlib (a set of machine learning libraries), GraphX (this one is for distributed graph processing). The mechanisms of performing of the Directed Acyclic Graph (DAG) type allow the Spark framework create efficient plans of queries for data transformation.
Here is what we have: the «clean» Spark best fits the machine learning and working with iteration tasks. Used over Hadoop in most cases, it substitutes the MapReduce module and adds a bigger amount of operations.
Storm framework is among the best solutions for Big Data, as long as we consider open source platforms. Unlike Hadoop and Spark, which are more about big dataset packet processing, the system Storm is for distributed processing in real time mode, regardless the programming language used.
In Storm, workflows are called «topologies». Those topologies are arranged by the directed acyclic graph (DAG) principle to be performed until discontinued by user, or until there is an uncorrectable error. Storm supports creating topologies that transform uncompleted data streams. Those transformations, unlike Hadoop tasks, never halt, rather continuing data processing as the data arrives.
Native Storm cannot be used for developing Big Data apps over typical Hadoop clusters. Task coordination among nodes in a cluster needs Apache ZooKeeper with its wizard (minion) worker. Still, Yahoo! and Hortonworks work on creating libraries to run Storm on top of Hadoop 2. x YARN clusters. Let us also mention that the said framework is capable of reading and writing the files from/on HDFS.
Spouts and bolts are Storm topology main elements. Spouts generate data streams in form of unchangeable key-value pair sets, also known as tuples, while bolts transform those streams. Just like MapReduce, the bolts may solve traditional tasks or perform more complicated actions (single-step functions): filtration, aggregation or connecting to external objects like databases.
Storm implements the concept of guaranteed message processing. It means every tuple, coming out of a spout, shall be processed; should it remain unprocessed within some time, Storm again shall let him out of the spout. Another important feature of the Storm ecosystem is the sufficient amount of spouts tuned for receiving data from all types of sources. Even specific applications using Big Data technologies would not need creating a dedicated spout: a good spout might be found among the vast number of bolts — from stream Twitter API to Apache Kafka.
Storm has a number of application fields: real time analytics, online machine learning, remote procedure distributed call generation etc.
As a distributed computing system, Storm may be a good choice, if you start a project from a scratch, with a dedicated cluster ready, and your requirements focus on stream data processing and complicated event processing system (CEP). In most cases, Storm is used with Spark and with Hadoop platform. In such a layout, Spark substitutes Hadoop MapReduce, while Storm substitutes Hadoop YARN.
To summarize what was said above in connection with Storm, it is worth mentioning, that the framework is a computing system without data storage. It is for working with stream data that arrive continuously. This is the distinctive feature of Storm in comparison with the other abovementioned platform for working with Big Data solutions (Hadoop and Spark).
Although the data analysis platforms from the chapter «Frameworks» might function autonomously, and fit for developing any applications, those (Hadoop in most cases) play an assisting role of a simple data storage facility. Here comes the need in DBMS, which arrange data in table form, thus simplifying analytics, and comprise tools needed for Big Data analysis.
Hive is data storage with open source, designed to proceed with queries and to analyze large volumes of data in Hadoop files. This is the most popular DBMS on SQL-platform that uses HiveQL as the language of queries. HiveQL automatically transforms SQL-like queries into MapReduce or Tez, or Spark jobs.
Hive performs two main functions: referring and forming queries. The said DBMS also allows data serialization/deserialization and improves flexibility when designing a scheme via using the Hive-Metastore system catalog.
In accordance with formal documents, Hive is not for working with OLTP, neither it offers processing queries in real time. Rather it best fits for transaction packet processing in connection with large sets like web-logs (so called append-only data).
Hive on LLAP (Live Long and Process) uses fixed Query-servers with intellectual caching in memory to avoid packet-oriented delay from Hadoop and guarantee fast response. At the same time, Hive on Tez still performs well with packet query on petabyte data.
In Hive, tables are arranged similar to the same in relational databases, and data blocks are systemized from larger to more detailed units. Databases consist of tables, which, in turn, consist of partitions. Data access is granted by means of the SQL-language, and Hive itself supports adding and overwriting data. By the way, the upcoming instruments that we are going to consider and that are used for Big Data analysis also use the structured query language.
The Hive ETL-framework has the following architecture:
- CLI, JCDB, ODBC or any other graphical web-interface; creating external interfaces relatively the Hive interface for settling communication between the user and HDFS.
- Metastore Thrift API. It plays the role of a system catalog and watching the data stored in one or another part of HDFS.
- Driver. This is the basement of Hive-architecture that is in charge for collecting, optimizing and executing HiveQL instructions.
- Thrift Server. It is an API-client for executing HiveQL operators.
There are reasons that keeping Hive popular:
- Anyone knowing SQL can work with Hive: HiveQL would not take too many efforts to learn.
- Hive supports open source web-interface Hue.
- It performs better than MapReduce due to query vectoring using Tez engine.
- No need to write long Java MapReduce code results in less time needed for programming.
Impala is the main competitor for Hive. Impala is an autonomous open source mechanism for executing SQL-queries, working at Hadoop-clusters. The system passes fast interactive SQL-queries directly to Hadoop data stored in HDFS or HBase. In addition to using the same uniform data storage platform, like the one that Hive has, Impala uses the same metadata, SQL-syntax (HiveQL), ODBC driver and even the user interface, as its main competitor does.
However, Impala has never used classic MapReduce, rather executing queries via its own engine. Mike Olson, the General Director of Cloudera said that the essence of Hive is that it simply transforms/compiles SQL-queries into a Java program using MapReduce functions, followed by executing it in packet mode, just like any other Hadoop-task, and that Hive, therefore, adds yet another step before using MapReduce, while the system Impala substitutes MapReduce totally.
In accordance with formal documentation, Impala is a supplement for instruments for working with Big Data software tools, designed to execute queries. Impala mechanism does not substitute packet data processing frameworks built in MapReduce (like Hive) that fit best for long packet transactions processing.
Cloudera positions Impala mechanism as follows:
Impala consists of the following components:
- Clients. Objects that include Hue, ODBC and JDBC clients, and Shell, capable of interacting with Impala. These interfaces are typically used for sending queries and executing administrative tasks like connecting to Impala.
- Hive Metastore. It comprises the information about data available for Impala. With Metastore, the system would know which databases are available and what their structure is.
- Impala. It is a process that operates on DataNodes and coordinates and executes the queries. Each copy of Impala may receive, plan and coordinate queries from clients. The queries are distributed among Impala nodes, and next the nodes execute parallel fragments of the queries, i.e. act as «workers».
- HBase and HDFS. The query data storage.
Impala’s pros include the following:
- The SQL-interface that is familiar by other Big Data processing techniques.
- The possibility to query large amounts of data from Hadoop.
- Distributed queries in cluster environment for convenient scaling.
- The possibility of shared usage of files by different components without copying, exporting or importing.
- The uniform system for analyzing and processing Big Data: clients may avoid costly simulation.
To conclude regarding the said system, one could draw a parallel between Hive and Impala. Impala mechanisms are not failure-proof: should one machine fail, the entire query should be run again. At another hand, Impala does those small queries 10−15 times faster than Hive. Even if a node fails and should be restarted, the total time of execution would still be much less than the same for Hive. Therefore, Impala has a visible advantage for queries, where the execution environment would be small enough to minimize the chance for such failure down to acceptable value.
Both systems are fast enough and possess good functionalities and are developed further on regular basis, being introduced by the companies, which are the best providers on Big Data solution market — Cloudera and Apache. Hive, however, needs more care and attention. Running a script in a correct way needs specifying a dozen of environment variables, JDBC interface works poorly as HiveServer2, and the errors generated have little in common with the actual reason of the problem. Impala is in no way perfect, too — yet it is far more pleasant and predictable.
Presto is a distributed open source SQL-query engine to pass interactive analytical queries to disparate data sources: from gigabyte to petabyte. Presto is a system for interactive Big Data analysytics, developed from a scratch by Facebook, and known by performance typical for commercial data storage facilities.
The mechanism allows requesting data from where the data might be located, including Hive, Cassandra, Relational databases and even proprietary data storage facilities. A single Presto query may combine data from several sources, thus allowing analyzing Big Data throughout the entire institution. Official sources state that Presto is designed for analysts expecting a response within a few seconds, maximum minutes.
The diagram below demonstrates the Presto simplified architecture. A client sends a SQL-query to the Presto coordinator. The coordinator processes, analyzes and schedules the query execution. The scheduler connects the executive conveyor, distributes the job by nodes that are the closest to the data and supervises the progress. The client extracts the data from the output stage, which, in turn, extracts the data from the stages below.
The Presto executive model is rather different from Hive and MapReduce. For example, Hive transforms queries in several steps that are MapReduce tasks executed one after another. Every task reads input data from disk to write intermediate output result back on disk. Meanwhile, Presto employs the user’s query and the executive mechanism with operators supporting SQL semantics. In addition to the improved scheduling, the entire processing is carried out in RAM. Conveyors between the stages of the network prevent unneeded inputs/outputs and costs in connection therewith. Conveyor executive model runs several stages at once and diverts data from one stage to another as soon as the data is available. Therefore, the delay for many types of queries becomes several times less.
Presto supports ANSI SQL, and it means that in addition to JSON, ARRAY, MAP and ROW, one can use standard SQL data types, window interface functionality, statistical and approximating aggregate functions.
Compared to Hive, Presto has a drawback — taking a bigger part in developing, building and deploying user-specified functions. Nonetheless, Presto is believed to be among the best open source engines for Big Data analysis.
Drill is yet another open source SQL solution that facilitates developing a mechanism for arranging the execution of SQL-queries over partly structured data stored in NoSQL. Those include HBase, MongoDB, MapR-DB, HDFS, MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage, Swift, NAS and local files. A single query may combine data from several storages. For example, a user may combine the profile collection information in MongoDB and the event log catalog in Hadoop.
Drill supports SQL standard. Business analysts and Big Data professionals may use convenient analytics tools for Big Data, like Tableau, Qlik, MicroStrategy, Spotfire, SAS and Excel, to interact with non-relational databases by using regular JDBC/ODBC interfaces.
Processing a query in Drill typically includes the following steps:
- A Drill-client formulates a query. The client may be JDBC/OBC interfaces, command prompt interface or REST API. Any Drillbit (an ecosystem component element) in a cluster may receive queries from clients. No master-slave concept.
- Next, Drillbit analyzes and optimizes the query and generates the distributed query plan best fit for fast and efficient execution.
- The Drillbit that received the query becomes the master node for the query. The Zookeeper provides it with the list of Drillbit nodes in the cluster available. The master node defines machines that fit best for executing various fragments of the query, thus localizing the data to the maximum.
- The master node defines the order of execution for the query fragments by individual nodes according to the execution plan.
- Individual nodes complete the fragments execution to return the data to the Drillbit master node.
- The master node returns the results back to the client.
Drill is not the first system to process queries. Rather it is among the first that combine flexibility and quickness. That all became possible due to radically new architecture with performance that does not compromise the flexibility of JSON. The construction of Drill includes:
- Column execution mechanism (the first one that supports working with complicated data);
- Compilation and re-compilation during the process, controlled by data;
- Special memory management optimizing the RAM volume used;
- Advanced optimizer, which, whenever possible, moves all the processing into the data storage.
Prediction analysis platforms are the programs that provide integrated environment for machine learning, intellectual data analysis, text analysis and business analytics. This chapter focuses on Big Data tools and techninques that not only analyze the matters, but also manage and optimize solutions to obtain the best results.
RapidMiner is a free open source environment for prediction analytics. RapidMiner’s possibilities may be improved with add-ons, some of which are free to use, too. The system supports all stages of deep data analysis, including resulting visualization, testing and optimization.
A visible advantage of RapidMiner is that using it does not need knowing programming. It implements visual programming principle, i.e. writing code is not necessary, and it also is not necessary to do complicated math. All looks as follows: a user drags the data into the working field, drags the operators in GUI, thus forming data processing. The user may even understand the code generated, but it is unnecessary in most cases, too.
The said Big Data analysis platform is Hadoop-compatible, if paid extension RapidMiner Radoop is used. The extension requires that the Hadoop cluster must be available from the client managed by RapidMiner Studio. The diagram below demonstrates the basic architecture of Radoop on Studio:
«Miner» is an expandable system that supports the language R, and fully integrated WEKA system operators allow low-level work.
RapidMiner has an interesting feature of sentiment, or text tonality analysis, which becomes available upon installing an add-on from AYLIEN, a third party company. It is downloadable from the RapidMiner Marketplace. AYLIEN, for example, may collect data from Twitter, then have twits analyzed and rank them by the mood scale: positive, negative or neutral.
RapidMiner ecosystem develops fast enough, being adjusted for new platforms (Spark basic functionality support was announced just 2 months after the platform release). Therefore, users should clearly understand what is going to be done, and which specific tools out of hundreds available are needed. There is a good start page for those who are new in the subject. RapidMiner community may provide some help, too.
18IBM SPSS Modeler
The platform IBM SPSS Modeler is a commercial competitor of RapidMiner with lower entrance threshold for beginners. Using it is going to be easier for beginners due to «autopilot» modes. Auto-models (Auto Numeric, Auto Classifier) go through several possible models with different parameters to find the best of them. Even an inexperienced analyst may build a good model based upon such a solution.
The main features of IBM SPSS Modeler are:
- Automated simulation: a single pass may test several methods of simulation and compare results to select the model to deploy.
- Geospatial analytics: to better understand the task and to improve the accuracy of prediction, latitude and longitude, postal index and address — all is taken into account and combined with current and history data.
- Open source technologies support: to improve the analysis, one may use R, Python, Spark, Hadoop and other open source technologies.
- Text information analytics: by non-structured text data analysis, it is possible to account for key terms, topics, mood and trends.
IBM SPSS Modeler user interface is being improved overtime; therefore the system may already be characterized as user-friendly. Performing simple tasks, like creating expressions, does not need any preparation at all. This all makes IBM SPSS Modeler a good analytics solutin by beginners.
All those advantages of IBM SPSS Modeler, though, may be nicked by a single disadvantage that cuts a big pile of audience off. The problem is that this system is actually not the best tool for analyzing Big Data. Attributes that make IBM SPSS Modeler simple to use are too limited to approach Big Data technologies in a proper way. Things may become really weird, when IBM SPSS Modeler fails being overloaded.
Nonetheless, IBM SPSS Modeler remains popular due to its simplicity of usage and convenient interface.
KNIME is yet another free-to-use system for intellectual data analysis, which has good functionality even in its basic version. Like RapidMiner, KNIME provides user-friendly environment with not need for actual programming. There is a number of operators from RapidMiner here (KNIME refers to those as «nodes»).
Regarding text analysis, the said platform is capable of performing the following tasks:
- Crossing: minimizing key term variations into initial forms.
- Stop-word filtration: removing unsubstantial words.
- Splitting into lexical elements: splitting text lines into smaller units, for example words and phrases, in accordance with rules that a user specified.
KNIME also can read the information directly from Twitter and work with non-structured files like CSV volumes.
In addition, it includes deep learning, web analysis, image processing, social network analysis etc.
Yet, RapidMiner is a simpler analytical platform for a beginner, because it automatically generates similar assumptions regarding possible reasons for operators missing connection. KNIME documented every node well, but the explanation why operators are missing is not given. Finally, RapidMiner functionality in connection with text processing is better for the moment.
Therefore, a beginner would turn towards RapidMiner, while advanced professionals, having tried every system for Big Data analysis, may find something interesting in KNIME.
16Qlik Analytics Platform
Qlik Analytics Platform provides developers with all necessary tools for controlled data programming. Qlik is the leader in visual analytics, and its data analysis platform supports creating and developing both users' and commercial analytical applications, including mash-ups.
Qlik Analytics Platform grants full access to associative data indexation system QIX, which allows interconnections between several information sources, which are usually hidden in data hierarchy models. The feature is that it is QIX that Qlik uses for creating their other solutions. QIX Engine uses column data arrangement in memory, thus improving performance during data indexation and compression. Practically, it allows data mining to be carried in no particular form, with no need to define possible user’s queries in advance. Developers, in turn, become able to create applications faster, based on Big Data technologies, and users shall receive responses faster.
Qlik Analytics Platform architecture includes the following elements:
- Qlik Management Console (QMC) and Dev Hub.
- APIs and SDK Qlik Sense.
- Assisting services Qlik Engine and Qlik Sense.
Qlik Analytics Platform may be used for developing analytical applications, information services or the Internet of things platform. The system’s good visual and interactive possibilities allow users process the data they have in a better way.
15STATISTICA Data Miner
This platform is developed in Russia. The system provides the full set of methods for data mining. In particular, STATISTICA Data Miner comprises instruments for data preprocessing, filtration and cleaning, thus allowing efficient selection of features among thousands of possible predictors.
The feature of this platform is the possibility to directly access databases even without executing exporting/importing operations explicitly. The software is capable of processing, reading and recording data almost from all standard files. The prediction models may be generated in various formats (PMML, C++, C#, Java, SAS, database stored procedures).
Users noticed that the built-in Data Mining Master that automatically builds STATISTICA Data Miner models is perfect for those who do not develop software (for example market analysts). Nonetheless, a wide range of clustering tools, neural net architectures, classification and regression trees, multi-dimensional simulation, analysis of sequences, associations and links, makes the said platform a powerful tool in the hands of an expert.
Let us also notice that the company recently introduced its new product STATISTICA Big Data Analysis, which, as one can see from its name, supplements the list of the software for Big Data analysis. The said platform is scalable; it may create selections with MapReduce, search via Lucene/SOLR engine, carry out Mahout analytics, work in a «cloud» and do Natural Language Processing of a text. STATISTICA Big Data Analysis integrated with STATISTICA Enterprise corporate edition would allow implementing Big Data analytics of the enterprise level.
14Informatica Intelligent Data Platform
The company Informatica calls it «virtual data path». Informatica Intelligent Data Platform represents intellectual and control services capable of working with most popular data and formats: web, social networks, machine logs.
This intellectual platform for data analysis includes Vibe, a virtual mechanism that allows integration of fused data once to be run in various environments. Like STATISTICA Data Miner, Informatica Intelligent Data Platform resides on drag-and-drop interface, i.e. a use only needs to drag necessary elements to the operating environment, and the system generates all the instructions itself.
The distinctive feature of Informatica Intelligent Data Platform is the approach to inputting structured, partly structured and non-structured data on sole semantic wave. Understanding in between the data is possible due to mapping approach, heuristics and comparison to a sample.
The Informatica company, being a major player in the field of analytical tools development for Big Data technologies, is proud of stating that Informatica Intelligent Data Platform is the only platform with awards from both Gartner and Forrester in almost all categories of data management.
As for its architecture, Informatica Intelligent Data Platform consists of three layers:
- Vibe is the abovementioned engine capable of managing any type of data. Because Vibe is an embedded engine, it grants universal access to data, regardless data location or format. Because Vibe is a virtual machine, the engine may operate at any local or server platform, Hadoop clusters or in a cloud.
- Data Infrastructure. The infrastructure data layer is above the virtual machine of Vibe. It includes all services for automated continuous input of «clean», safe and connected data of any scale to any platform, Hadoop clusters or cloud service.
- Data Intelligence. The intellectual data layer is above Data Infrastructure. It collects metadata, semantic data and other information from the entire platform. As soon as the data is collected, the Data Intelligence segments it to simplify further processing. The purpose of this layer is to provide Big Data processing techniques. This all is about analytics, business intelligence (BI), and also operational intelligence (OI) in real time mode. Recently Data Intelligence added machine learning to Informatica Intelligent Data Platform skills.
So, the main features of the Informatica Intelligent Data Platform include hybrid structure that allows connecting any application to any device, systemization and globality of data, as well as data democratization that eliminates the need for software developer skills and knowing any software programming language by users to analyze the information.
It is worth mentioning, that Informatica’s partners in providing solutions based on Informatica Intelligent Data Platform are the companies Cognizant, Capgemini UK, Datawatch, MicroStrategy, Qlik, Tableau and Ultimate Software.
13World Programming System
Meet yet another universal and powerful platform for working with Big Data — WPS. The World Programming System is positioned as the main competitor of SAS software products. Furthermore, the platform support working with solutions written in SAS language. The supported syntax of the actual WPS version includes the core, statistical and graphical possibilities of applications built with the SAS language.
WPS is the SAS code interpreter. The main advantage of the said platform is that it is much cheaper than any software from SAS for Big Data analysis. Therefore, World Programming System is the best way to run a SAS program without using SAS software products. Because WPS has some drawbacks related to .sas7dbat format reading/writing, it is highly recommended to convert the data into its own format. Besides, WPS has its own editor that overcomes even SAS Enterprise Guide when it comes to coding and debugging an application.
As for its architecture, WPS is a module system. Every component of WPS is in charge for specific functionality. For example, language modules facilitate syntax and SAS macros support, developer modules customize WPS, interface modules arrange interaction between a user and the system, while data modules grant access to standard databases and data storages.
WPS has an advantage; the license for this platform includes all modules. Development environment and graphic interface have functionality that is good enough to create, maintain and execute own scripts and process large datasets. In another hand, WPS does not support noquotelenmax, it cannot use SYSTASK and it does not read the is8601dt format, thus necessitating the need for compromises.
Deductor is an analytical platform, developed by BaseGroup Labs. The Deductor includes the most demanded-for analysis algorithms (solution trees, neural nets, self-organizing maps etc.), there exist dozens of visualization methods, and the integration with multiple data sources/receivers is provided.
The system employs technologies, which, being based on uniform architecture, allow passing all stages of analytical platform construction: from data storage creation to automated model selection and the obtained results visualization.
Deductor implements scenario approach that assumes visual design to process the logic with wizards, and without programming. Analysts acquire all analysis technologies: Data Warehouse, OLAP, Data Mining, and they can create analysis scenarios without employing software developers. Also, the analytical platform provides data cleaning, in particular, data de-duplication, i.e. object similarity assessment to enrich data and combine it into a uniform and correct record.
The Deductor is capable of:
- Extracting data from disparate sources to consolidate it in uniform storage and to reflect it as reports and OLAP-cubes.
- Finding hidden correlations and assessing model quality with Data Mining.
- Segmenting analysis objects, defining target markets, optimizing work with consumers, and using resources more rationally.
There are three versions of the platform: Academic, Professional and Enterprise. The first one is free and for educational purpose; the second is for professional analysis in workgroups; the third one is for using by a corporation.
11SAS Enterprise Miner
SAS Enterprise Miner is a software product, developed for creating accurate predictive and descriptive models, based on large amount of information. This is a tool for business analysts: key scenarios include risk minimization, fraud detection and customer outflow reduction.
A client receives SAS Enterprise Miner as a distributed client-server system. This means data analysis processes are optimized, and all necessary steps are within one solution, and large workgroups have possibilities to cooperate flexibly in terms of one project.
The instrument implements an approach based on creating data processing process diagrams, thus eliminating the need for manual coding. Diagrams in SAS Enterprise Miner represent self-descriptive templates that may be modified or used to solve tasks, without repeating the analysis from the very beginning.
With «drag-and-drop» graphic interface, business users may create models via automated environment, i.e. in a fast and convenient way. The process of model standard implementation, too, takes place in automated mode. Simulation process final diagrams may be used as self-documented templates, they are easy to edit, update and apply to new business tasks, saving time that would have been otherwise wasted on model primary preparation. In addition, model description comprises the information about the specific contribution each independent variable brought into the final result.
Therefore, the main advantages of SAS Enterprise Miner include:
- Wide range of tools and the support for the entire data intellectual analysis process;
- Advanced scoring (applying a model to new data);
- Simplicity to use;
- Usage of self-documented project environment.
Let us pay more attention to some other important tool. Some of them play assisting role, when working with the abovementioned solutions, while other tools are capable of carrying out analysis independently therefrom.
Let us begin with an instrument that is believed to be the main coordination means for all Hadoop infrastructure elements. Zookeeper provides centralized service for synchronization through cluster. It also provides the service of maintaining the information about configuration, distributed synchronization and a number of other services. Every time, when such services are used in some manner, a big work is underway to correct errors that always emerge during execution. Zookeeper, being one of the best tools for Big Data, offers a simple and clear interface to correct those errors.
The distributed implementation of the instrument supports several operating modes, also called replicated modes. One server is appointed a master, while other servers are slaves. The master-slave architecture is implemented.
If the master server fails, another node is selected instead, as the master. All servers are interconnected, and clients connect to one of them. Upon connection, the server is provided with the list of other machines. Because the servers comprise the same information, the client may perform tasks without breaks.
Zookeeper may also be used in autonomous mode, but this would eliminate replicated modes advantages. Such standalone mode is for testing and learning.
As a matter of fact, Zookeeper is mostly used as configuration service, although the possibilities it provides extend far and wide from this.
Flume is a distributed service for collecting and aggregating event log records from various sources into the central data storage, for example, HDFS. The instrument is used primarily for non-structured data transmission. Flume is more than just logging. Because data sources are adjustable in Flume, the service is also used to upload event data, social media platforms, emails, images, video and other sources. Support for several data sources and horizontal scaling makes Flume good enough to be used by businesses like Facebook, Twitter, Amazon and eBay for data transmission into Hadoop.
The Flume source receives events, in a format the tool supports, the events being passed to the system from an external source, such as a webserver log or social media. Flume has so-called «collectors» that collect data from various sources to put it to the centralized storage like HDFS or HBase. Then the data flow from webserver logs to channels that store it until used by Flume Sinks, which ultimately store the information in storage like HDFS. The instrument improves the reliability by making multi-streams to store the data in several channels before reaching HDFS. Because Flume uses transaction approach, the chance to lose data before arriving to final destination is zero.
Flume’s advantages include:
- The mediator role for data streams between the source and the receiver. When data consumption rate is less than the generation one, Flume tries to balance the stream by its channels.
- Data are easy to collect from several sources. Flume collectors are capable of connecting to various sources to collect disparate data and store it in centralized storage.
- Flume is reliable. As said above, the chance to lose the data before reaching final destination is zero.
- Restoring possibilities. A local file system backups file channel. There is also a channel in memory, which stores events in a queue and restores the events, should the collector fail.
8IBM Watson Analytics
IBM Watson Analytics software solution is a powerful Big Data analytics tool. The platform is capable of working in a cloud. Having input initial data into the system, a user gets the information array systemized, with links between elements highlighted.
Practically, IBM Watson Analytics allows companies clarify the way, in which external factors affect financial flows or industrial facilities of a client. The system makes complicated math automatically, and the user is shown the factors that are the most important for him and his business, as well as patterns and interconnections between individual elements.
Graphic user interface of the system is clear and pleasant; it supports drag-and-drop. All necessary data and diagrams may be located at the working space by dragging. Visualization, graphs, diagrams — all help understand the current situation. IBM Watson Analytics does graphic processing fast, however sophisticated it might be.
It is interesting that since October 2014 IBM Watson Analytics processes and structures the Twitter user information. The cooperation allows discovering trends that are typical for particular regions — a city, a country, a continent. IBM Watson Analytics is one of the key instruments to work with Big Data — healthcare and marketing above all.
7Dell EMC Analytic Insights Module
Dell EMC Analytic Insights Module is an instrument that unites self-served analytics and cloud applications on a uniform platform. The approach allows focusing by professionals in data analysis on fast creation (days or weeks instead months) of models that are highly valuable for businesses.
To implement the concept, Dell EMC created an open platform designed to track the full lifespan of data analysis by several key components: data lake, data curator, data governor, data and analytic catalog. With those components, businesses and institutions may collect the information they need via deep analysis to form uniform vision of all data, to prevent the data silo phenomenon.
Data Lake is in charge for data consolidation in uniform storage. The component nicks the complicity of data silo storage in connection with large amount of unsorted information. Data curator resides on values from the data lake to provide a uniform format for all studied and indexed datasets from the data lake, and from external sources, too. In accordance with Dell EMC, the curator saves up to 80% of data analysis professional’s time, when they prepare the information for analytics. Data governor comprises the information about the data origin to provide data security throughout the entire analysis process. Also, data curator allows seeing and using datasets in end-to-end format.
Therefore, with Dell EMC Analytic Insights Module, a user can:
- Study, use and index all data in a uniform format with Data Curator;
- Disclose the origin, guarantee the control and safety for all applications and data storages with Data Governor;
- Convert important information into data management applications and business models.
6Windows Azure HDInsight
Windows Azure HDInsight is a Microsoft solution to deploy Hadoop platform on Windows Server and in Windows Azure cloud platform. The instrument provides optimized analytical open source clusters for Spark, Hive, MapReduce, HBase, Storm, Kafka and R Server. Each of those technologies, working with Big Data software, may be deployed as a managed cluster with security and monitoring of the corporate class.
HDInsight supports 99.9% service agreement that includes the solution for working with Big Data Azure, rather than individual virtual machines only. HDInsight is for abundance and availability, and provides master node replication, data geo-replication and built-in NameNode reserve node.
Single input, multifactor authentication and simple management of millions of IDs in Azure Active Directory protects data resources and upgrades local security means to the cloud level. It is also worth noticing, that the instrument provides the highest level of business process continuity, because it has improved functionality of working with notifications, monitoring and advance actions.
A good competing advantage of HDInsight is its economical cloud scaling. Because local storage is usable for caching and performance improving for input/output operations, operating load might be scaled at reasonable cost.
Finally, HDInsight allows deploying Hadoop and Spark clusters with applications from third-party software developers, further improving the service performance.
5Microsoft Azure Machine Learning
Having successfully introduced the Hadoop-oriented instrument HDInsight to the market, the Azure department of Microsoft announced yet another achievement in Big Data. Microsoft Azure Machine Learning was publicly released.
This is a cloud service for prediction analytics, allowing creating and deploying predictive models in a fast and efficient way. The system is known by its simplicity: working with machine learning efficiently in Microsoft Azure Machine Learning environment does not need one to be a mathematician. ML Studio, the integrated development environment, provides drag-and-drop instruments and simple data stream diagrams. Not only it cuts the amount of code. The built-in library of simple experiments (also known as projects in ML Studio) save much user’s time, too.
The instrument provides algorithm library set, ready to use. Those algorithms may be used to create predictive models via computers with Internet connection. There are multiple examples and solutions in Cortana Intelligence Gallery.
Microsoft Azure Machine Learning allows creating prediction analysis models and also a fully controlled service capable of being used to deploy the predictive models as web-services, ready to be used.
Despite advanced functionality, one would hardly accuse Microsoft Azure Machine Learning in consuming too much of finances. Because the service works in a public cloud of Azure, there would be no need to purchase costly hardware or software.
Maybe, it is Microsoft Azure Machine Learning, which would be the best tool for machine learning today.
4Pentaho Data Integration
Pentaho Data Integration system is a component of the Pentaho complex, which is responsible for the process of extraction, conversion and the output of data (ETL). Although ETL-systems are to be used within data storage complex, Pentaho Data Integration tools are usable to:
- Exchange data between applications and databases;
- Export data from database tables into files;
- Upload data arrays into databases;
- Process data;
- Integrate into applications.
Pentaho does not require to write code, as the entire process is in visual form, allowing talking about Pentaho Data Integration as about the system that is oriented to working with metadata. With the work panel and with the interactive graphic tools, users may analyze data by several dimensions.
Pentaho Data Integration simplifies the integration of big amounts of data by the drag-and-drop tool that moves data from storages to Big Data storages. The system is also capable of supplementing and combining structured data sources and partly-structured or non-structured sources to have a universal picture ultimately.
The instrument may be fully customized: setting up visualization, interactive reports, work panel and special analysis may all be done by a user. Because Pentaho Data Integration is a 100% Java-platform, residing on industrial standards like the RESTful web-service, integrating it with any application is not a problem at all.
3Teradata Aster Analytics
Teradata Aster Analytics is an instrument that allows working with text, graphics, machine learning, patterns and statistics by means of one interface and one syntax. Business analysts and professionals in data analysis may carry out complex data analysis of the entire enterprise by a single query. Teradata Aster Analytics has more than 100 integrated advanced analytical queries.
This instrument allows combining Graph, R and MapReduce within one framework. All functions that are executable as SQL-commands, and all analytical engines that are embedded into the instrument make Teradata Aster Analytics the instrument with supreme performance, when processing big data arrays.
The analytics of Teradata Aster Analytics is available within the ecosystem of Hadoop and Amazon Web Service.
Teradata Aster Analytics on Hadoop:
- Expands scenarios for using Data Lake. Teradata Aster Analytics makes a «hardware elephant» available for most business analysts with the skills of working with SQL or R.
- Works natively. Users do not need to move data from Hadoop to servers to analyze the data.
- Implements analytics in a fast manner. Users may create an isolated programming and working environment at one Hadoop cluster with the same data.
Teradata Aster Analytics on Amazon Web Service:
- Speeds up returning investments into business. A company may prepare analytical isolated programming environment in the cloud faster, to accelerate the process of development, use built-in SQL-queries.
- Improves analytics flexibility. A data analysis professional is provided with a powerful kit of disparate tools. Every analyst may find the tool they need for working with Big Data.
- Cuts financial costs. Companies may use built-in advanced analytical functions and datasets with no need to use new equipment.
2SAP BusinessObjects Predictive Analytics
This instrument facilitates optimization of resources and improves funds return at the corporate level.
Integrating expert analytics and model manager results in faster and more accurate results of prediction, and also it brings prediction ideas into business processes and applications — the fields, where users interact to each other.
With SAP BusinessObjects Predictive Analytics, one can:
- Automate data preparation, predictive simulation, deploying — and, as result, reteach the model easily;
- Use integration advanced visualization possibilities to make conclusions faster;
- Integrate the programming language R to get access to a bigger amount of user scripts;
- Cooperate with SAP HANA.
SAP BusinessObjects Predictive Analytics pushes the limits of Spark to provide clients with advanced interactive data analytics. The current version of the instrument allows connecting to SAP HANA Vora and to carry out predictive simulation automatically. Using the native simulation Spark on similar Spark copies, SAP HANA Vora allows distributed processing of automated algorithms.
It is worth mentioning that Forrester Research awarded SAP the leader status in Big Data predictive analytics in April 2015.
1Oracle Big Data Preparation
Based on the environment of Hadoop and Spark for scalability, the cloud service of Oracle Big Data Preparation provides analysts with a user-friendly and interactive method for preparing structured, partly structured and non-structured data for the subsequent data processing.
Just like most abovementioned instruments, Oracle Big Data Preparation targets business users; therefore, the service is easy to use. Scalability allows working with iteration machine learning in cluster computing environment. Another advantage of Oracle Big Data Preparation is the integration with a number of could services.
As for the functionality of the instrument, it has 4 parts: ingest, enrich, govern and publications. There is also intuitive authoring.
As for ingest, the service imports and deals with miscellaneous information, cleans data (for example, removes unsubstantial symbols), standardize dates, telephone numbers, other data, and also calculates and removes unneeded data duplications.
Enrich includes determining data categories and data characteristics identification in terms of attributes, features and schemes, discovering metadata (discovering a scheme defines the scheme/metadata, which either directly or indirectly are defined in titles, fields or tags).
Govern and publications assume the interactive control panel that provides uniform traffic for all processed datasets with corresponding metrics and options for further detailed audit and analysis. In turn, various formats of publications improve flexibility.
The intuitive authoring task list includes issuing recommendations, guiding every user’s step during development process. Tutoring videos and manuals for working with the environment also contribute.
We have just reviewed a number of means for data analysis from best Big Data solution developers. As one could notice, most solutions are open source. There are really many frameworks, databases, analytical platforms and other tools; therefore, a clear understanding of the task is necessary. Once the goal is formulated, there should be no problem to select the best tool (or the toolkit) allowing carrying out adequate data analysis.