Open Source Tools That Enable Real-Time IoT Data Analytics

By Dr Dharmendra Patel and Dr Atul Patel


Abundant new data is generated every day as billions of devices are connected thanks to the Internet of Things. This data has useful, novel, hidden insights that are very helpful for any business organisation to capitalise on. Analytics on this data helps to detect anomalies, mitigate any disruption in services, predict problems early, and provide better satisfaction to customers.

Data analytics is vital for the success of IoT projects and businesses. IoT applications have several characteristics that are suitable for analytics.

  • Volume. IoT projects generate a large amount of data every day. Analysis of this data set reveals interesting patterns that help organisations make wise and timely decisions.
  • Variety of data. IoT applications capture diverse types of data such as structured, semi-structured, and unstructured. Analytics is the only way to derive useful and novel insights from such a variety of data.
  • Real-time decision-making. Many IoT applications generate real-time data. Analytics of such real-time data is essential for fast and accurate decision making. Analytics serves as a lifesaving solution for many IoT projects.
  • Business growth. Businesses use the data generated by IoT for revenue generation. Analytics gives a number of insights into the large volume of data that can really help businesses foresee trends, helping them to strategise and increase their earnings.

Types of analytics for IoT

The traditional way of analytics uses a combination of reporting and predictive analytics, but IoT applications have different characteristics, so this way is not effective. IoT analytics requires real-time streaming data analysis. It demands more data, and needs automation and integration to perform efficiently.

Streaming analytics

According to Jerry Baulier, vice president, Internet of Things R&D, SAS, “Streaming analytics generates informed decisions in milliseconds from millions of devices, and also examines thousands of events per second.” Streaming analytics helps in providing security by identifying events that are vulnerable to threats and risks rapidly. It plays a key role in data-driven organisations, and allows them to build real-time solutions using IoT. Real-time streaming analytics uses Big Data to generate statistics and visualisations to measure the efficacy of any process in the organisation.

Here are several open source tools for Streaming analytics.

Apache Flink. This open source platform is capable of distributed data processing and batch data processing. Any kind of data is produced as a stream of events. Credit card transactions, sensor measurements, machine logs, user interactions on a website, or mobile applications—all of this data is generated as a stream, and can be processed as unbounded or bounded streams. Apache Flink excels at processing unbounded and bounded data sets.

Apache Samza. This is an open source distributed stream processing framework. It uses Apache Kafka for messaging and Apache Hadoop YARN to provide fault tolerance, security and resource management. Samza is built to handle a large amount of states.

Apache Spark. This open source platform is for large scale data processing. For cluster management, Spark supports Hadoop YARN or Apache Mesos. For distributed storage, it can use Hadoop Distributed File System (HDFS), MapR file system, Cassandra, Kudu, etc. It also supports a pseudo-distributed local mode.

Apache Storm. This distributed real-time computing system is fast, and easy to set up and operate. It is scalable and fault-tolerant.

Spatial analytics

Spatial data is generated in a multitude of ways. IoT applications have increased such data exponentially. Spatial analytics is the demand of the modern era due to the exponential growth of data that is generated through satellite measurements and imaging, through sensors on IoT applications, use of specialised handheld devices, Wi-Fi, etc. Spatial analytics is the logical processing of such data using topological or geographical properties.

A few popular Spatial analytics open source tools are listed below.

GIS tools for Hadoop. This comprises a collection of GIS tools for Spatial analytics of Big Data. GIS tools for Hadoop are divided into two main parts: Spatial Framework for Hadoop and Geoprocessing Tools for Hadoop. Spatial Framework for Hadoop contains Java Helper utilities for Hadoop developers and Hive spatial user-defined functions. Geoprocessing Tools for Hadoop consist of ArcGIS Desktop or server. ArcGIS supports advanced analysis, data visualisation, and authoritative data maintenance in both 2D and 3D.

Geoplot. This is the high-level geospatial data visualisation library for Python. It is an extension of cartopy and matplotlib that makes mapping easy. It has three main features: high-level plotting API, native projection support, and compatibility with matplotlib.

Geospark. This is the cluster computing system for processing large-scale spatial data. It extends Apache Spark and SparkSQL with the set of Spatial Resilient Distributed Datasets (SRDDs)/SpatialSQL.

Googleway. This is the R package for accessing and plotting Google Maps. Googleway provides access to Google Maps APIs, and the ability to plot an interactive Google Map overplayed with various layers and shapes.

Time series analytics

Time series data analytics is familiar in IoT applications such as health monitoring systems, weather forecasting systems, geological applications, and processing manufacturing and other industrial settings.

Several popular Time series analytics open source tools are listed below.

Influxdb. This open source Time series platform includes several APIs for a variety of data analytics tasks such as storing and querying, ETL process, dashboard, and visualisations. It is very helpful for real-time analytics.

Timescaledb. This tool is open source and optimised for complex queries in real-time situations. It performs automatic partitioning across time and space. It looks like regular tables but is only an abstraction of many individual tables comprising actual data.

Tdengine. This open source Big Data platform is designed and optimised for the Internet of Things. It provides several functionalities such as caching, stream computing, and message queuing.

Nightingale. This open source, distributed high-performance system is suitable for IoT applications.

Analytics in IoT is very crucial in the modern era, as billions of connected devices generate a humongous amount of data every day. In this article, we have described three main types of analytics for IoT applications: Streaming, Spatial and Time series. In each category, several popular open source tools that are very essential for effective analytics have been listed.

This article was first published in December 2020 issue of Open Source For You

Dr Dharmendra Patel (left) and Dr Atul Patel (right) are associated with the Smt. Chandaben Mohanbhai Patel Institute of Computer Applications, Charusat, Gujarat. Their areas of interest are data mining, data science, artificial intelligence, deep learning, and image processing