In simple words, Apache Drill is a distributed system for interactive ad-hoc analysis of large-scale datasets. It is designed to handle data up to petabytes that is spread across thousands of servers. The goal of Drill is to respond to ad-hoc queries in a low-latency manner. Apache Drill is the first Open Source Schema-free SQL Query Engine for Hadoop, NoSQL, RDBMS and Cloud Storage.
Although there are many tools available in the market for Big Data space, Apache Drill; developed by Apache Software Foundation, is an open source solution that has made experts go gaga over its features. Drill is the open source version of Google’s Dremel system which is available as an infrastructure service called Google BigQuery and is an Apache top-level project.
Drill Enables ‘SQL on Everything’, allows you to query self-describing data wherever it is, using standard SQL. A single query can join data from multiple datastores. For example, you can join a user profile collection in MongoDB with a directory of event logs in Hadoop.
Drill supports a variety of NoSQL databases and file systems, including HBase, MongoDB, MapR-DB, HDFS, MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage, Swift, NAS and local files.
At a high level, Apache Drill’s architecture (Fig. 1) comprises the following layers:
Providing interfaces such as a command line interface (CLI), a REST interface, JDBC/ODBC, etc.,for human or application driven interaction.
Allowing for pluggable query languages as well as the query planner, execution, and storage engines.
Pluggable data sources either local or in a cluster setup, providing in-situ data processing.
Drillbit is the core of Apache Drill and also a drill daemon. It has to run on every node of the cluster, giving it the role of “Task Tracker” in a Hadoop cluster environment. Drillbit uses another service called zookeeper for all the communication in the cluster environment and it maintains cluster membership.
When a client sends a query, Drillbit takes responsibility for accepting requests from the client and starts processing the queries. This includes distributing the query in the cluster environment and returning results to the client. The Drillbit which receives the request from the client is called ‘foreman’. Foreman generates the execution plan after which the execution fragments are sent to other Drillbits running in the cluster. The following diagram explains the flow of Drillbit in a clear manner:
Core modules within a Drillbit:-
Each Drillbit consists of the following key components:
Drill bit uses Zookeeper. Zookeeper’s responsibility is to execute various query plans in distributed nodes in clustered environment. It is useful and recommended to be used for every request to go through with zookeeper. This is because zookeeper is responsible to communicate with different nodes in cluster environment and it knows the availability of node to run or executes the client query.
Apache Zookeeper is an Open Source Tool and is a project of the Apache Software Foundation.
It is a centralized server for maintaining and managing configuration information, naming registry and synchronization for distributed cluster environment. Zookeeper supports the distributed systems to reduce the complexity by providing low latency and high availability services. It also has a high performance distributed coordination service for distributed applications.
Apache drill installation in embedded mode
Download a stable version of apache drill.
Copy the downloaded gz file to the installation directory.
Extract the contents of the Apache-drill file using below command on your console.
tar -xvzf apache-drill-1.8.0.tar.gz;
mv apache-drill-1.8.0 drill
Go to the Drill installation directory and start Apache drill in embedded mode using following command on your console.
After successfully starting the drill, we can access the drill by using console or web interface, here we first execute the queries on drill prompt.
Apache Drill provides Sample JSON file employee, which is available in drill class path. The Drill also provides a built in datasets. You may check the following examples:
Drill’s Web Interface
Access the Apache Drill web UI: http://localhost:8047/
Configuring storage plugins:
Custom Storage Plugins:
You can see existing storage plugins, or create new plugins.
To create a new plugin, enter the name under New Storage Plugin on the Storage page, click on create button, update the JSON definition in the configuration box, and then click on the create button.
Storage Plugin Configuration:
To register a new storage plugin configuration, enter a storage name, click CREATE, provide a configuration in JSON format, and click UPDATE.
Drill saves storage plugin configurations in a temporary directory (embedded mode) or in ZooKeeper (distributed mode). For example, on Mac OS X, Drill uses /tmp/drill/sys.storage_plugins to store storage plugin configurations.
Drill Query Flow in Web Interface
Drill web interface to see the output of how a query is executed:
The Drill Web Console is one of several client interfaces you can use to access Drill.
To open the Drill Web Console, launch a web browser and go to one of the following URLs depending on the configuration of HTTPS support:
UI provides different screens to show aggregate metrics for each major fragment that executed the query and also shows metrics for the minor fragments that were parallelized for each major fragment and each operator within a major fragment.
Apache Drill coupled with the proper NoSQL data store opens up the opportunity for a single data source to be used for both transactional and analytical processing. There is no need to export or transform the data into an application-specific format or even a star schema format in order to load into a data warehouse.
Standard tools can continue to be leveraged by the business without further costs, thanks to support for standards like ANSI SQL:2003 and ODBC / JDBC drivers.
Time saved by not having to create software to serialize and deserialize the data into the data structures for any given language, coupled with simplified software testing and less code to maintain, add up to a BIG saving for any business.