Apache Drill: An open source framework that will make you go weak in the knees

Posted by   May 28th, 2017

Facebooktwitterlinkedinmail

In simple words, Apache Drill is a distributed system for interactive ad-hoc analysis of large-scale datasets. It is designed to handle data up to petabytes that is spread across thousands of servers. The goal of Drill is to respond to ad-hoc queries in a low-latency manner.  Apache Drill is the first Open Source Schema-free SQL Query Engine for Hadoop, NoSQL, RDBMS and Cloud Storage.

Introduction:-

Although there are many tools available in the market for Big Data space, Apache Drill; developed by Apache Software Foundation, is an open source solution that has made experts go gaga over its features.   Drill is the open source version of Google’s Dremel system which is available as an infrastructure service called Google BigQuery and is an Apache top-level project.

Drill Enables ‘SQL on Everything’, allows you to query self-describing data wherever it is, using standard SQL.  A single query can join data from multiple datastores. For example, you can join a user profile collection in MongoDB with a directory of event logs in Hadoop.

Drill supports a variety of NoSQL databases and file systems, including HBase, MongoDB, MapR-DB, HDFS, MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage, Swift, NAS and local files.

Architecture:-

At a high level, Apache Drill’s architecture (Fig. 1) comprises the following layers:

 

Client

Providing interfaces such as a command line interface (CLI), a REST interface, JDBC/ODBC, etc.,for human or application driven interaction.

Processing

Allowing for pluggable query languages as well as the query planner, execution, and storage engines.

Data sources

Pluggable data sources either local or in a cluster setup, providing in-situ data processing.

Key Features:-

  1. Gets started in minutes
  2. Schema-free JSON model
  3. Query complex, semi-structured data in-situ
  4. Real SQL — not “SQL-like”
  5. Leverages standard BI tools
  6. Has interactive queries on Hive tables
  7. Enables users to access multiple data sources
  8. User-Defined Functions (UDFs) for Drill and Hive
  9. High performance
  10. Scales from a single laptop to a 1000-node cluster

Drillbits:-

Drillbit is the core of Apache Drill and also a drill daemon. It has to run on every node of the cluster, giving it the role of “Task Tracker” in a Hadoop cluster environment. Drillbit uses another service called zookeeper  for all the communication in the cluster environment and it maintains cluster membership.

When a client sends a query, Drillbit takes responsibility for accepting requests from the client and starts processing the queries. This includes distributing the query in the cluster environment and returning results to the client. The Drillbit which receives the request from the client is called ‘foreman’. Foreman  generates the execution plan after which the execution fragments are sent to other Drillbits running in the cluster. The following diagram explains the flow of Drillbit in a clear manner:

Drillbit features:

  • The Drillbit is responsible for the query execution and the scheduling.
  • A Drillbit cluster can initiate a query, and becomes the Drill process that is responsible for sending back the results back to the client.
  • Drill bit processes and allows the running on the same node in the hadoop cluster. It has data node on each node while the zookeeper has every node in hadoop cluster environment.
  • Drillbit is a service which we have to install in drill and run all the nodes in any distributed cluster environment. In other words, it should not be limited to the Hadoop cluster.
  • Drillbit runs on each node in the cluster and gets the data during query execution without transferring the data over the network.
  • Drillbit service is responsible for taking any client request, parsing the requested query and generating distributed query plan for faster and optimistic execution in cluster environment.

Core modules within a Drillbit:-

Each Drillbit consists of the following key components:

  • The RPC endpoint exposes a low overhead protobuf-based RPC protocol to communicate with the clients and receive queries.
  • The SQL parserparses incoming querie, and is based on the open source framework, Calcite. The output of the parser component is a language agnostic, computer-friendly logical plan that represents the query.
  • Drill serves as a query layer on top of several data sources. Storage plugin interfaces in Drill uses to interact (Read/Write) with the data sources to help with efficient and faster execution of Drill queries on a specific data source.
  • On Hadoop, storage plugin information is saved in the Zookeeper.  Drill provides storage plugins for files, for HBase and MapR-DB tables, and for Hive.  The plugins are extensible, allowing you to write new plugins for any additional data sources
  • Drill integrates with Hive as a storage plugin. This happens because Hive provides a metadata abstraction layer on top of files, HBase or MapR-DB, and provides libraries to read data and operate on these sources.
  • It includes third party plugins like custom client APIs such as Python and custom DSL layer such as Pig. The SQL layer in Drill is extensible and you can create or customize user defined functions and user defined aggregate functions using a high- performance Java API that is included with Drill.

Drill bit uses Zookeeper. Zookeeper’s responsibility is to execute various query plans in distributed nodes in clustered environment. It is useful and recommended to be used for every request to go through with zookeeper. This is because zookeeper is responsible to communicate with different nodes in cluster environment and it knows the availability of node to run or executes the client query.

Apache Zookeeper:-

Apache Zookeeper is an Open Source Tool and is a project of the Apache Software Foundation.

It is a centralized server for maintaining and managing configuration information, naming registry and synchronization for distributed cluster environment. Zookeeper supports the distributed systems to reduce the complexity by providing low latency and high availability services. It also has a high performance distributed coordination service for distributed applications.

  Zookeeper features:

  •  Naming service(Registry)
  • Configuration management
  • Locking and Synchronization service
  • Leader election
  • Message Queue
  • Notification system
  • Cluster Management
  • Low Latency
  • High Available

Apache drill installation in embedded mode

  1. Prerequisites for Installing
  • JDK 1.7 or above
  • Apache Drill apache-drill-1.8.0 or above

Download a stable version of apache drill.

http://mirror.fibergrid.in/apache/drill/drill-1.8.0/apache-drill-1.8.0.tar.gz

Copy the downloaded gz file to the installation directory.

Extract the contents of the Apache-drill file using below command on your console.

cd /home/cloudera

wget http://mirror.fibergrid.in/apache/drill/drill-1.8.0/apache-drill-1.8.0.tar.gz

tar -xvzf apache-drill-1.8.0.tar.gz;

mv apache-drill-1.8.0 drill

  1. Verifying the Drill installation setup

Go to the Drill installation directory and start Apache drill in embedded mode using following command on your console.

cd /home/cloudera/drill

bin/drill-embedded

After successfully starting the drill, we can access the drill by using console or web interface, here we first execute the queries on drill prompt. 

  • To list out the available Databases:

  • Datasets:

 

  • Querying JSON data:

Apache Drill provides Sample JSON file employee, which is available in drill class path. The Drill also provides a built in datasets. You may check the following examples:

Drill’s Web Interface

 Access the Apache Drill web UI: http://localhost:8047/

Query:

 

Query output:

Configuring storage plugins:

Custom Storage Plugins:

You can see existing storage plugins, or create new plugins.

To create a new plugin, enter the name under New Storage Plugin on the Storage page, click on create button, update the JSON definition in the configuration box, and then click on the create button.

 

 

Storage Plugin Configuration:

Registering a Storage Plugin Configuration

To register a new storage plugin configuration, enter a storage name, click CREATE, provide a configuration in JSON format, and click UPDATE.

Storage Plugin Configuration Persistence

Drill saves storage plugin configurations in a temporary directory (embedded mode) or in ZooKeeper (distributed mode). For example, on Mac OS X, Drill uses /tmp/drill/sys.storage_plugins to store storage plugin configurations.

MySQL Configuration:

Oracle configuration:

Query Status:

Drill Query Flow in Web Interface

Drill web interface to see the output of how a query is executed:

The Drill Web Console is one of several client interfaces you can use to access Drill.

To open the Drill Web Console, launch a web browser and go to one of the following URLs depending on the configuration of HTTPS support:

  • http://<IP address or host name>:8047
    Use this URL when HTTPS support is disabled (the default).
  • https://<IP address or host name>:8047
    Use this URL when HTTPS support is enabled.

UI provides different screens to show aggregate metrics for each major fragment that executed the query and also shows metrics for the minor fragments that were parallelized for each major fragment and each operator within a major fragment.

Conclusion:

Apache Drill coupled with the proper NoSQL data store opens up the opportunity for a single data source to be used for both transactional and analytical processing. There is no need to export or transform the data into an application-specific format or even a star schema format in order to load into a data warehouse.

Standard tools can continue to be leveraged by the business without further costs, thanks to support for standards like ANSI SQL:2003 and ODBC / JDBC drivers.

Time saved by not having to create software to serialize and deserialize the data into the data structures for any given language, coupled with simplified software testing and less code to maintain, add up to a BIG saving for any business.

Leave a Reply

Your email address will not be published. Required fields are marked *