Samarjeet Borah Valentina Emilia Balas Zdzislaw Polkowski ...

Advances in Data Science and Management

Samarjeet BorahValentina Emilia BalasZdzislaw Polkowski Editors

Proceedings of ICDSM 2019

Lecture Notes on Data Engineeringand Communications Technologies 37

Lecture Notes on Data Engineeringand Communications Technologies

Volume 37

Series Editor

Fatos Xhafa, Technical University of Catalonia, Barcelona, Spain

The aim of the book series is to present cutting edge engineering approaches to datatechnologies and communications. It will publish latest advances on the engineeringtask of building and deploying distributed, scalable and reliable data infrastructuresand communication systems.

The series will have a prominent applied focus on data technologies andcommunications with aim to promote the bridging from fundamental research ondata science and networking to data engineering and communications that lead toindustry products, business knowledge and standardisation.

** Indexing: The books of this series are submitted to SCOPUS, ISIProceedings, MetaPress, Springerlink and DBLP **

More information about this series at http://www.springer.com/series/15362

http://www.springer.com/series/15362

Samarjeet Borah • Valentina Emilia Balas •

Zdzislaw PolkowskiEditors

Advances in Data Scienceand ManagementProceedings of ICDSM 2019

123

EditorsSamarjeet BorahDepartment of Computer ApplicationSMIT, Sikkim Manipal UniversitySikkim, India

Valentina Emilia BalasDepartment of Automatics and AppliedSoftware at the Faculty of EngineeringAurel Vlaicu University of AradArad, Romania

Zdzislaw PolkowskiFaculty of Technical SciencesJan Wyzykowski UniversityPolkowice, Poland

ISSN 2367-4512 ISSN 2367-4520 (electronic)Lecture Notes on Data Engineering and Communications TechnologiesISBN 978-981-15-0977-3 ISBN 978-981-15-0978-0 (eBook)https://doi.org/10.1007/978-981-15-0978-0

© Springer Nature Singapore Pte Ltd. 2020This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or partof the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmissionor information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilarmethodology now known or hereafter developed.The use of general descriptive names, registered names, trademarks, service marks, etc. in thispublication does not imply, even in the absence of a specific statement, that such names are exempt fromthe relevant protective laws and regulations and therefore free for general use.The publisher, the authors and the editors are safe to assume that the advice and information in thisbook are believed to be true and accurate at the date of publication. Neither the publisher nor theauthors or the editors give a warranty, expressed or implied, with respect to the material containedherein or for any errors or omissions that may have been made. The publisher remains neutral with regardto jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,Singapore

https://doi.org/10.1007/978-981-15-0978-0

Preface

Data science is the discipline which involves the study of information sources,representations, processing and finally conversion into valuable resources. Thisresource is an integral part development of business and other decision-makingstrategies. This involves both structured and unstructured data. Mining is performedon this kind of huge amount of data in order to enhance business performance,increase efficiency, reduce cost, discover new market prospects and get competitiveadvantage. Data science can be called as a hybrid discipline involving mathematics,statistics and computer science. Specifically, from the computer science domain,it involves machine learning techniques, data mining and visualization orrepresentation.

Knowingly or unknowingly, organizations are using data science for decades fortheir business gain. In addition to this, all scientific and research laboratories aregenerating and handling huge amount of data. Earlier, it was done mostly manuallywith tremendous human effort, which is now taken care of by data scientists. Thesekinds of analyses directly help in the productiveness of any enterprise. Productionand marketing processes of any organization are primarily benefited from suchstudies. For example, for film production units, user interests can be mined todetermine what shows or films to produce; shipment companies can use data sci-ence to discover the best delivery routes and times. They can also use the same fordetermining the best modes of transport for their shipments.

This book—Advances in Data Science and Management—represents an intel-lectual forum for research on computational approaches in data science and anal-ysis. It includes research works and findings of various researchers from industry,academia and research organizations which involve data science directly or indi-rectly. The works are in a wide range of spectrums ranging from theoretical analysisto practical implications. Some of the works are found with impressive results,which may lead to competitive product development.

In total, 59 papers are included in the volume from diverse fields of applicationsof data science. These are mostly from the domains of knowledge discovery,application of data in cloud, deep learning, application of data mining methods, big

v

data and applications, data intelligence in IoT, data-driven approximation, etc. Allthe contributions are clubbed into five different sections.

The first section includes contributions from the domain of Data Mining andMachine Learning Applications. It is enriched with research and review works fromvarious researchers, which include an analytical approach of NoSQL, a cost eval-uation framework for fault prediction technique in testing, a human activity recog-nition approach, a work on educational data mining, an analysis on data lineage inapache spark, etc. A discussion on artificial intelligence and common sense is foundto be useful and thought provoking for future application developments. Anotherwork on the prediction of gender from document is found suitable for futureapplication development. This section also includes work on neural networks.

Data science is also used extensively in the cloud environment. The secondsection of this volume Data Analysis in Cloud Environment contains researchcontributions from the said domain. Few contributions of the domain are perfor-mance analysis of queries associated with cloud database, cloud-based weathermonitoring system, image privacy preservation in cloud, cloud-based cryptography,fog computing for business, etc.

The third section includes contributions based on Data Analysis in Network andCommunication Systems. Data management is a crucial issue in maintaining net-work topology, energy efficiency, quality of service, etc. Most of these worksincluded are simulation-based.

The volume includes the fourth section which involves Image and Binary DataAnalysis. It includes numbers of scholarly works. Some of those are study on imagemining bymultiple features, feature selection for liver cancer prediction, prediction ofbreast cancer, feature selection for liver cancer prediction, etc. It also includes a deeplearning-based technique for violence detection using video surveillance systems.

The fifth section is Modelling and Simulation Based Data Analysis, whichincludes papers that mostly using data science in various scientific experiments. Theexperiments are ranging from tool design of an injection-moulded part to investi-gation of graphene nanoparticles. It is observed that most of the tool/equipmentdevelopment process starts with computer simulation which involves extensive dataprocessing. Machine learning techniques are becoming a great help in suchdevelopments and simplifying the task a lot.

This volume was initially planned to provide a platform to the budding researchersto showcase their research work in the field of data science. Attempt has been made tocover diversified research domains which involve data science and analytics. It isexpected that the readers will be benefited from the contributions included in the book.Editors are thankful to all the authors, reviewers and editorial board members formaking this effort successful. Special thanks go to the publisher—Springer—forall-round assistance and transforming this dream into reality.

Sikkim, India Samarjeet BorahArad, Romania Valentina Emilia BalasPolkowice, Poland Zdzislaw Polkowski

vi Preface

Contents

Data Mining and Machine Learning Applications

Capturing Anomalies of Cassandra Performance with Increasein Data Volume: A NoSQL Analytical Approach . . . . . . . . . . . . . . . . . . 3Ramesh Dharavath, Anand Kumar and Vinod Kumar Dharavath

Cost Evaluation Framework for Fault Prediction Techniquein Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Aishwaryarani Behera, Shrayas Das and Dr. Abhishek Ray

t-SNE Manifold Learning Based Visualization: A Human ActivityRecognition Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33Ramesh Dharavath, G. MadhukarRao, Himanshu Khuranaand Damodar Reddy Edla

Augmentation of Behavioral Analysis Framework for E-CommerceCustomers Using MLP-Based ANN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45Kailash Hambarde, Gökhan Silahtaroğlu, Santosh Khamitkar,Parag Bhalchandra, Husen Shaikh, Pritam Tamsekar and Govind Kulkarni

Educational Data Mining: A Study on Socioeconomic Indicatorsin Education in INEP Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51Aurea T. B. Santos, Jonatã Paulino, Marcelino S. Silva and Liviane Rego

Feature Selection and Clustering of Documents Using RandomFeature Set Generation Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67A. Christy and G. Meera Gandhi

A Hybrid Time Series Forecasting Method Based on SupervisedMachine Learning Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81Ganesh Prasad Khuntia, Ritesh Dash, Sarat Chandra Swainand Prashant Bawaney

vii

Analyzing Performance Indices with Query Terms Associatedwith Data Servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91Mishra Anil Kumar, Mishra Ashis Kumar, Mohapatra Yogomayaand Mishra Sambit Kumar

Author Profiling: Predicting Gender from Document . . . . . . . . . . . . . . 99Sunakshi Mamgain, Rakesh C. Balabantaray, Ajit K. Dasand Srikant Kumar

Use of Data Analytics for Effective E-Governance: A Case Studyof “EMutation” System of Odisha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109Pabitrananda Patnaik, Subhashree Pattnaik and Pratibha Singh

A Meta-Analysis of Impact of ERP Implementation . . . . . . . . . . . . . . . 123Rajendra Kumar Behera and Sunil Kumar Dhal

A Symmetrical Encryption Technique for Text Encryption UsingRandomized Matrix Based Key Generation . . . . . . . . . . . . . . . . . . . . . . 137Pratik Gajanan Mante, Harsh Rajendra Oswal, Debabrata Swainand Deepali Deshpande

CNN-BD: An Approach for Disease Classificationand Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149G. Madhukar Rao, T. Ravi Kumar and A. Rajashekar reddy

A Generalized Partial Canonical Correlation Model to MeasureContribution of Individual Drug Features Toward Side EffectsPrediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159Rakesh Kanji and Ganesh Bagler

A Framework of Dimensionality Reduction Utilizing PCA for NeuralNetwork Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173G. Ravi Kumar, K. Nagamani and G. Anjan Babu

Air Quality Through IoT and Big Data Analytics . . . . . . . . . . . . . . . . . 181M. Sree Devi and Vempalli Rahamathulla

Artificial Intelligence and Common Sense: The Shady Futureof AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189Sreenivas Sremath Tirumala

Optimization of the Random Forest Algorithm . . . . . . . . . . . . . . . . . . . 201Niva Mohapatra, K. Shreya and Ayes Chinmay

Expectation of Radar Returns from Ionosphere Using DecisionTree Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209G. Ameer Basha, K. Lakshmana Gupta and K. Ramakrishna

viii Contents

A Statistical Analysis of Lazy Classifiers Using Canadian Instituteof Cybersecurity Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215Ranjit Panigrahi and Samarjeet Borah

Data Analysis in Cloud Environment

Strategies and Performance Analysis of Queries Associatedwith Cloud Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225Mishra Jyoti Prakash, Prasad Suman Sourav and Mishra Sambit Kumar

Cloud-Based Weather Monitoring System . . . . . . . . . . . . . . . . . . . . . . . 235Sai Yeshwanth Chaganti, Ipseeta Nanda and Koteswara Rao Pandi

Security Model for Preserving Privacy of Image in Cloud . . . . . . . . . . . 247Prasanta Kumar Mahapatra, Alok Ranjan Tripathy, Alakananda Tripathyand Biraja Mishra

Scheduling Virtual Data Along with Data Servers: Case Study . . . . . . . 257Patra Prakash Chandra, Mohanty Anita and Mishra Sambit Kumar

Investigating Various Cryptographic Techniques Used in CloudComputing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263Ananya Srivastava, Aboli Khare, Priyaranjan Satapathyand Ayes Chinmay

Homomorphic Encryption: Review and Applications . . . . . . . . . . . . . . . 273Ratnakumari Challa

Storing “OPTIK” Data in Cloud: Photonics for EmbeddedApplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283K. P. Swain, S. K. Nayak, G. Palai and Partha Sarkar

Establishing of Bio-Info in Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289Subha Ranjan Das, K. P. Swain, G. Palai and Sudhakar Sahu

A Framework of Fog Computing for Business . . . . . . . . . . . . . . . . . . . . 295Zdzislaw Polkowski, Malgorzata Nycz-Lukaszewskaand Wojciech Grzelak

Infrastructure as a Service: Cloud Computing Model for MicroCompanies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307Zdzislaw Polkowski and Malgorzata Nycz-Lukaszewska

Data Analysis in Network and Communication Systems

Leafycube: A Novel Hypercube Derivative for Parallel Systems . . . . . . 323N. Adhikari and A. Singh

Contents ix

A Novel Approach on the Role of Femto Small Cells for EffectiveEnergy Consumption in 5G Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 333B. Reddaiah, K. Srinivasa Rao and B. Susheel Kumar

Implementation of Multipath-Based Multicast Routing Protocolin Hierarchical Wireless Sensor Network . . . . . . . . . . . . . . . . . . . . . . . . 345Seli Mohapatra, B. K. Ratha and Kshirabdhi Tanaya Dhal

PDF Analysis of Different Channel Models in FSO . . . . . . . . . . . . . . . . 355Chinmayee Panda, K. Pitambar Patra, Asutosh Padhy and Urmila Bhanja

A Novel Study on the Role of Cell Zooming for Energy Efficiencyand Quality of Service in 5G Technologies . . . . . . . . . . . . . . . . . . . . . . . 363B. Reddaiah, K. Srinivasa Rao and B. Susheel Kumar

Image and Binary Data Analysis

Image Mining by Multiple Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375Varsha Patil, Rajesh Kadu and Tanuja Sarode

Prediction of Artificial Water Recharge Sites Using Fusion of RS, GIS,AHP and GA Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387Shaikh Husen, Santosh Khamitkar, Parag Bhalchandra, Preetam Tamsekar,Govind Kulkarni and Kailas Hambarde

Lesion Localization and Extreme Gradient Boosting Characterizationwith Brain Tumor MRI Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395P. M. Siva Raja and K. Ramanan

Video Surveillance for Violence Detection Using Deep Learning . . . . . . 411Manan Sharma and Rishabh Baghel

Identification of Crop Health Condition Using IoT BasedAutomated System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421T. Giri Babu and G. Anjan Babu

Prediction of Malignant and Benign Breast Cancer: A Data MiningApproach in Healthcare Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 435Vivek Kumar, Brojo Kishore Mishra, Manuel Mazzara, Dang N. H. Thanhand Abhishek Verma

A Deep Neural Network on Object Recognition Frameworkfor Submerged Fish Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443Sushma Pulidindi, K. Kamakshaiah and Sagar Yeruva

x Contents

Modelling and Simulation Based Data Analysis

Characterization and Optimization of Tool Design of an InjectionMolded Part Through Mold-Flow Analysis . . . . . . . . . . . . . . . . . . . . . . 453Sudhanshu Bhushan Panda, Narayan Chandra Nayakand Antaryami Mishra

Analysis and Design of Ethernet to HDMI Gateway Using XilinxVivado . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463Ipseeta Nanda and Nibedita Adhikari

Energy Efficient and Multicast Based Greedy Routing for Proactiveand Reactive Routing Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479Seli Mohapatra, Pujapushpanjali Mohanty and B. K. Ratha

Evolution of Optical Storage in Computer Memory . . . . . . . . . . . . . . . . 489Pragya Rai, Ankesh Prasad, S. Maneesha Reddy and Ayes Chinmay

Dipole Antenna Frequency Shift Using RT Duroid Case . . . . . . . . . . . . 497Jyotisankar Kalia

Realization of Monochromatic Filter in Visible Range: An Applicationto Optical Embedded System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503K. P. Swain, S. Behera and G. Palai

Zero Loss 90° Waveguide: A Futuristic Photonic Componentsto Unravel Bending Loss Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509P. K. Dalai, G. Palai and P. Sarkar

Stress Effect on a-SiCN:H Waveguide at Terahertz Frequencyfor Sensing Application Using FDTD Technique . . . . . . . . . . . . . . . . . . 517Chandra Sekhar Mishra and Gopinath Palai

Investigation of Graphene Nanoparticles in a Nanocomposite Filmvia Photonic Crystal Fiber Through Regression Analysis . . . . . . . . . . . 523S. K. Mohanty, U. Bhanja, C. S. Mishra and G. Palai

Design a 4-Bit Carry Look-Ahead Adder Using Pass Transistorfor Less Power Consumption and Maximization of Speed . . . . . . . . . . . 531Burhan Khan and Suraj Pattanaik

Design and Analysis of Hybrid Antenna for Next-GenerationElectronics Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543Suraj Pattanaik and Jyoti Ranjan Das

Development of Compact Microwave CPW Band-Stop FilterBased on Sub-wavelength Metamaterial Filter . . . . . . . . . . . . . . . . . . . . 555N. Laxmi Narayan Reddy and Bibekananda Panda

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 561

Contents xi

About the Editors

Samarjeet Borah is currently working as Professor in the Department ofComputer Applications, Sikkim Manipal University (SMU), Sikkim, India.Dr. Borah handles various academics, research and administrative activities. He isalso involved in curriculum development activities, board of studies, doctoralresearch committee, IT infrastructure management, etc., along with variousadministrative activities under SMU. Dr. Borah is involved with three fundedprojects in the capacity of Principal Investigator/Co-Principal Investigator. Theprojects are sponsored by AICTE (Government of India), DST-CSRI (Governmentof India) and Dr. TMA Pai Endowment Fund. Dr. Borah is involved with variousjournals of repute and book volumes as Editor/Guest Editor.

Valentina Emilia Balas is currently Full Professor in the Department ofAutomatics and Applied Software at the Faculty of Engineering, “Aurel Vlaicu”University of Arad, Romania. She is the author of more than 250 research papers inrefereed journals and international conferences. She is Editor-in-Chief ofInternational Journal of Advanced Intelligence Paradigms (IJAIP) and InternationalJournal of Computational Systems Engineering (IJCSysE), Editorial Board Memberof several national and international journals and an evaluator expert for national andinternational projects. She is Director of Intelligent Systems Research Centre inAurel Vlaicu University of Arad. She served as General Chair of the InternationalWorkshop Soft Computing and Applications (SOFA) in seven editions 2005–2016held in Romania and Hungary. She is Member of EUSFLAT and SIAM and SeniorMember of IEEE, Member in TC—Fuzzy Systems (IEEE CIS), Member inTC—Emergent Technologies (IEEE CIS) and Member in TC—Soft Computing(IEEE SMCS). She was Vice-President (Awards) of IFSA International FuzzySystems Association Council (2013–2015) and is Joint Secretary of Forum forInterdisciplinary Mathematics (FIM).

Zdzislaw Polkowski is working as Adjunct Professor in the Faculty of TechnicalSciences, the Jan Wyzykowski University, Poland. He also the Rector’sRepresentative for International Cooperation and Erasmus+ Programme and Former

xiii

Dean of the Technical Sciences Faculty during the period 2009–2012. He holds aPh.D. degree in Computer Science and Management from Wroclaw University ofTechnology, postgraduate degree in Microcomputer Systems in Management fromUniversity of Economics in Wroclaw and postgraduate degree in IT in Educationfrom Economics University in Katowice. He obtained his engineering degree inIndustrial Computer Systems from Technical University of Zielona Gora. He haspublished more than 55 papers in journals, 15 conference proceedings, includingmore than 8 papers in journals indexed in the Web of Science. He served asMember of Technical Programme Committee in many International conferences inPoland, India, China, Iran, Romania and Bulgaria.

xiv About the Editors

Data Mining and Machine LearningApplications

Capturing Anomalies of CassandraPerformance with Increase in DataVolume: A NoSQL Analytical Approach

Ramesh Dharavath , Anand Kumar and Vinod Kumar Dharavath

Abstract NoSQL database technology has been doing rounds since the early 1990s,but it was the exponential growth of internet and the rise of web applications that leadto a dynamic surge in the popularity of NoSQL databases. The BigTable researchby Google (2006) and the Dynamo research by Amazon (2007) paved the way fordatabases which could develop with agility and operate at any scale. Cassandra andMongoDB have emerged as the two most widely used NoSQL database and henceeither of the two is preferred depending on the data problem user is attemptingto solve. This paper describes the underlying principles as well as the differencesbetween both the databases. We focus on showing the anomaly in performance ofCassandra as the data volume increases and at the same time we compare its per-formance with that of MongoDB. We establish how important factor is data volumein choosing either of the databases for an application. Extensive experiments havebeen carried out to scale the performance in terms of anomaly similarities, and thefuture scope is pinpointed.

Keywords NoSQL database · Sharding · Consistency · Indexing · Replication

1 Introduction

Almost all domains of technology have witnessed growth in the use of databases inthe last few decades of the twentieth century. In the early days, direct interaction ofusers with database was quite limited but direct access by the user surged rapidlyin the late 1990s owing to the internet revolution during that time. With database

R. Dharavath (B) · A. Kumar · V. K. DharavathDepartment of Computer Science and Engineering, Indian Institute of Technology (Indian Schoolof Mines), Dhanbad 826004, Indiae-mail: [email protected]

A. Kumare-mail: [email protected]

V. K. Dharavathe-mail: [email protected]

© Springer Nature Singapore Pte Ltd. 2020S. Borah et al. (eds.), Advances in Data Science and Management,Lecture Notes on Data Engineering and Communications Technologies 37,https://doi.org/10.1007/978-981-15-0978-0_1

3

http://crossmark.crossref.org/dialog/?doi=10.1007/978-981-15-0978-0_1&domain=pdf

https://orcid.org/0000-0003-3338-6520

mailto:[email protected]



https://doi.org/10.1007/978-981-15-0978-0_1

4 R. Dharavath et al.

systems, convenient and efficient retrieval of the stored information has always beenthe primary objective of a database. Collection of interrelated data combined withprogramming methodology to make the data accessible form a database manage-ment system (DBMS). The initial days of database were dominated by flat files datamodel, hierarchical data model and network data model. But, with growing internetapplications, all of them fell short in satisfying the storage needs of users. Thus, theintroduction of relational data model by E.F. Codd in 1980s proved to be a huge stepahead. A large number of mathematical operations, like Cartesian product, differ-ence, union and intersection, are supported by these databases. Data independenceis easily achieved because of its normalization structure. Furthermore, relationaldatabases implement security control effectively by applying an authoritative con-trol on all those attributes which are considered sensitive for the table [1]. Aboveall these, atomicity, consistency, isolation and durability (ACID) properties lie atthe core of every relational database, hence ensuring a consistent and dependableprocessing of all the transactions.

The need to handle rapid growth of unstructured data brought into picture the con-cept of non-relational databases [2]. The flexible schema provided by these databasesrules out the need of referential integrity [3] as well as join operations [4] betweenobjects which were the core features of relational databases. As opposed to theACID properties of relational databases they support basically available soft state,eventual (BASE) consistency features. Based on these principles, BigTable, HBaseand Cassandra were the initial databases developed by Google, Yahoo and Face-book, respectively. Non-relational databases are further divided into four broad cate-gories, such as key-value database (e.g. Redis), document database (e.g. MongoDB),column-oriented database (e.g. HBase) and graph database (e.g. MarkLogic).

The key-value databases are used with those applications which need to performnon-transactional operations and deal with data at high velocity. These databases areineffective in executing transactions. In document database, indexing is an importantfeature which is used with popular attributes in the table to provide high-speedretrieval even without knowing the key. Wide-column databases, among all NoSQLdatabases, are more similar to relational databases. All columns having related dataare grouped together in wide-column databases. Unlike other databases which useexpensive JOIN operations to determine relationship among entities at query time,graph databases store predetermined relationships.

Ever since the advent of NoSQL databases, there have been many attempts forcomparing the performance of SQL databases with respect to NoSQL databases orbetween the various NoSQL databases only [5]. The first category of work involvesaround using YCSB (Yahoo! Cloud Servicing Benchmark) to analyze the perfor-mance of NoSQL databases and MySQL by relating latency or throughput withnumber of queries executed [6]. Definitely in a huge lot of memory clusters, latencyhas a big role to play but we don’t want to dwell around it too much. The secondwork on this topic has been elaborated in the paper ‘NoSQL Databases: MongoDBvs Cassandra’ [7]. This paper has compared the performance of MongoDB and Cas-sandra by conducting an experiment on a single server. A single-server model may

Capturing Anomalies of Cassandra Performance … 5

be well suited for MongoDB but it totally ignores the basic features of Cassandrawhich made Cassandra famous at the first place.

We try to take the middle path by using three servers on two different hosts (i.e.workstations), thus almost keeping latency to zero and using the distributive featureof Cassandra at the same time. As mentioned earlier, we also consider data volumeas a very important factor which must be taken into account while deciding whichdatabase to use for a particular data problem. Furthermore, we are more interested indemonstrating the anomaly in performance of Cassandra as volume of data increasesand hence performance comparison with MongoDB is just a side result of our work.Our aim is to show how Cassandra betters its performance as data volume increaseswhich is in total contrast to other database’s performance change with increase indata volume. In addition to relating query time with number of records, we havealso plotted number of query performed per second to get a better insight into theperformance of Cassandra alone as well as with respect to that of MongoDB. As aconclusion, we have summarized the performance of both MongoDB and Cassandrausing a 3D surface.

The rest of the paper is organized in the following manner: Sect. 2 discussesthe underlying principles of Cassandra in detail; Sect. 3 presents the features ofMongoDB; Sect. 4 lists down the basic differences in Cassandra and MongoDB;Sect. 5 describes the setup required for the experimental analysis; Sect. 6 presentsthe results obtained from the experiment; Sect. 7 draws conclusion from the dataobtained and discusses the future scope.

2 Cassandra

Designed to run on cheap community hardware, Cassandra is an open source, decen-tralized and distributed storage system developed by Apache Software Foundation.Cassandra has a masterless [8] architecture and every server is equal, so any nodecan accept write or read queries from clients. The key features of Cassandra are:

Distributed: Cassandra distributes data across different nodes in the cluster by usinga hash function (i.e. partitioner). The partitioner computes a token corresponding toeach partition key and depending on the partition strategy each node is responsibleof a token [9].

Decentralized: Cassandra has no single points of failure and hence it providescontinuous availability.

Replicated: Cassandra stores multiple copies of the same data on multiple nodesin a single datacenter (Simple Strategy) or across multiple datacenters (NetworkTopology Strategy) to ensure data availability and fault tolerance. Replication factordetermines the number of copies of the same data that will be distributed acrossnodes.


Scalable: Cassandra fulfils the requirements of an ideal horizontally scalable systemby allowing for efficient addition of nodes. Apple Inc. has deployed the largestCassandra cluster having more than 75,000 nodes and storing over 10 PB of data.

Fault-Tolerant: In a network partition, Cassandra will store writes for other nodesuntil those nodes come back (hinted-handoff). If the node recovers, the changes onboth sides will be replayed and last write per column will win [10].

Tunable Consistency Level: Of all the NoSQL databases, Cassandra provides themost tunable consistency options [11]. The consistency level can be tuned to one,quorum or all in both read and write operations.

High Availability: Cassandra obeys the CAP theorem (Brewers theorem), whichstates that both perfect information and 100% availability cannot be ensured on asingle system [12]. The three features mentioned in the theorem are:

– Consistency: It implies that every node sees the same data at any given time.– Availability: Every read or write request must be greeted by a response.– Partition-Tolerant: System keeps functioning despite arbitrary partitions createdbecause of network failure.

Databases that ensure partition tolerancewith availability (AP) includeCassandra,CouchDB. Databases that ensure partition tolerance with consistency (CP) includeMongoDB, Redis, MemcacheDB.

Gossips: Cassandra uses Gossip [13], a peer-to-peer communication technique forinter-node communication within the cluster. The information exchanged among thenodes include status, tokens, schema version, IP addresses, data size and so on. Thisprocess runs periodically to help detect node failure in the cluster since a node sharesits state with three other nodes.

Cassandra writes: Cassandra writes in the Memtable and Commit log in parallel.The purpose of Commit log is to be able to create the Memtable after a node crashesor reboots. The Memtable gets flushed to the SSTable when it gets full. SSTablesare immutable and hence Cassandra needs to write a new SSTable on each columnchange.

3 MongoDB

MongoDB was developed back in the year 2007 by Eliot Horowitz and Dwight Mer-riman. MongoDB is an open source NoSQL database which uses the concept ofcollections and documents to provide a highly consistent system with good through-put. These documents are encoded in BSON which makes storage easy and is alsolightweight, fast and traversable [14]. Aadhaar (India’s unique identification) usesMongoDB as one of the databases to store biometric data of around 1.2 billion citi-zens. Other real-world use cases of MongoDB are Shutterfly, MetLife and Ebay. Thekey features of MongoDB are described as follows:


Aggregation Framework: MongoDB ensures an efficient use of the aggregationfeature to obtain resultswhich are analogous to those obtained using the SQLGROUPBYclause [15]. It usesMapReduce to performbatch processing of data and aggregateoperations [16].

BSONFormat:MongoDBusesBSON[17] (Binary JSON [18]) formatwhile storingdocuments in collections. BSON provides support to data types like date and binary,which are not originally supported by JSON.

Sharding: MongoDB distributes the data over multiple nodes by using sharding andhence provides high throughput on read and write operations [19]. Sharding enablesMongoDB to provide horizontal scalability as opposed to the vertical scalability ofrelational databases [20, 21]. Each shard stores a subset of the data set and provideshigh availability and consistency.

Supports Ad hoc Queries: Contrary to Cassandra, MongoDB supports dynamic adhoc queries and easily retrieves particular locations, fields, ranges and so on.

Schema Less: This feature makes MongoDB highly flexible as compared to tradi-tional relational databases. The documents in MongoDB may contain different setsof attributes with various types for each attribute also.

Capped Collections: MongoDB supports high performance in executing read andwrite queries by making use of Capped collections which are fixed-size circularcollections that strictly follow the insertion order.

Indexing: MongoDB uses indexes to increase the efficiency of search operations.Primary and secondary indices can be used to index any field in the MongoDBdocument [3, 12].

File Storage: MongoDB provides a file system support for storing the files overmultiple nodes by using features such as data replication and load balancing. Gridfile system (GridFS) is preinstalled in the MongoDB drivers.

Replication: MongoDB provides replication which has single primary node andnumerous other secondary nodes. Unlike masterless architecture of Cassandra,Mon-goDB follows master–slave replication strategy. The automatic failover property ofMongoDB ensures that if the primary node goes down or crashes, then one of thesecondary nodes is elected as the next primary node.

4 Cassandra Versus MongoDB: A Comparison

This section explores the common characteristics of Cassandra andMongoDB basedon their executing strategies. The related characteristics and strategies are as follows:

Replication Strategy: MongoDB supports a ‘Single Master’ model, that is, one ofthe nodes will act as master and all the others will be slave. The election of a new


master node in case the current master node goes down usually takes 10–40 s. Duringthis period, no write operation can be performed on the database. Cassandra supports‘Multiple Master’ or ‘Masterless’ model, so loss of a single node does not affect theability of clusters to take writes. Thus Cassandra ensures more scalability in writesas compared to MongoDB.

CAPTheorem: MongoDB prefers consistency over availability in case of a networkfailurewhile Cassandra does the opposite. Thus Cassandra is well suited for real-timeweb applications which demand 100% uptime (high availability).

Development Language: MongoDBwas developed in C++ whereas Cassandra wasdeveloped in Java. Cassandra supports programming languages like C#, C++, Go,Java, PHP, Python [22] and MongoDB supports languages like C, C#, C++, Haskell,Java, JavaScript, MatLab info, PHP, Python, Scala.

Expressive Object Model: The object model supported by MongoDB is far moreexpressive and flexible compared to the object model of Cassandra. Objects in Mon-goDB can have properties and they can even be nested in one another. Cassandrastores data in a more structured fashion using traditional table system having rowsand columns.

Secondary Indexes: The indexing feature ofMongoDBmakes its querymodel moreflexible and thus it supports ad hoc queries also. Cassandra, on the other hand, offersrestricted support to indexes and hence most of the querying has to be performedusing the primary key.

Aggregation: Data are transformed to the database by running an ETL [23] pipelinebased on the built-in aggregation framework supported by MongoDB, whereas Cas-sandra does not have a built-in aggregation framework and hence it uses externaltools like Hadoop, Spark [24] for the purpose.

Query language: Cassandra supports Cassandra Query Language (CQL [25]) whichis analogous to SQL (supported byMySQL database), althoughwith few restrictions.MongoDB does not support any query language.

5 Experimental Analysis

This section showcases the experimental impact conducted on both the databases.The testswere conducted on three database servers running on two physicalmachines(i.e. workstations). At first, data for Cassandra were collected by running it on all thethree servers and then MongoDB was installed on all the servers to obtain its result.


5.1 Experimental Setup

Specifications of the workstation (same for two workstations):

Processor: Intel(R) Xeon(R) CPU E3-1225 v5 @ 3.30 GHz 3.30 GHzMemory (RAM): 24.00 GBSystem type: 64-bit Operating System, ×64-based processor

Specifications of the virtual machine (same for all three):

Operating System: Ubuntu (64-bit) Multi-core processorMemory (RAM): 16.00 GBHard disk: 1 TB

Configuring the Cassandra cluster:The configuration file of Cassandra: Cassandra.yaml, located in/ etc./cassandradirectory is modified on all the servers to create a cluster:

After modifying the Cassandra.yaml files, we run the command sudo nodetoolstatus to verify the creation of Cassandra cluster having three nodes.

cluster name: Cassandra test (Name of the cluster)-seeds: 172.16.84.1, 172.16.84.3, 172.16.84.4 (IP addresses of all 3 nodes in the cluster)listen address: 172.16.184.1 rpc address: 127.0.0.1(IP address of the node, listen address will be 172.16.184.3 for the second node and 172.16.184.4 for the third)endpoint snitch: SimpleSnitch (for networks in one datacenter)auto bootstrap: false (The configuration file doesn’t have this directive and hence it is added and set to false)

5.2 Configuring the MongoDB Cluster

We use a three-member replica set for MongoDB having one primary node and therest two as secondary nodes. The MongoDB configuration file is modified:

• # network interfacesport: 27017bindIp: 0.0.0.0 (this value of bindIp allows communication over any networkinterface)

• replication:replSetName: Mongo test

We issue commands to create root admin account, initiate the replication set andadd the slaves in the replication set. To get a better insight into the performance ofboth databases, we define our own five workloads:


(i) Workload A: Read only–This workload contains 100% read operations.(ii) Workload B: Read heavy–This workload has 75% read queries and 25%

write queries.(iii) Workload C: Mixed– This workload has 50% read queries and 50%

write queries.(iv) Workload D: Write heavy– This workload has 25% read queries and

75% write queries.(v) Workload E: Write only–This workload has 100% write queries.

For each workload, we run the test thrice for varying number of records and takethe average of time elapsed in executing the workload. Similarly, we take the averageof number of queries completed per second for varying number of records.

6 Result Analysis

In this section, we evaluate the performance of Cassandra and MongoDB with theworkloads as mentioned below.

6.1 Workload A: Read Only (100% Read)

When we test Cassandra against 100% read workload, the time consumed is expect-edly on the higher side. For relatively smaller number of records, the elapsed timefluctuates by slight margin but after 750k records, a sharp (17%) decrease in timetaken by Cassandra to execute the same workload was observed. Even if the numberof records is increased to 2000k, the time taken is always less than what it was forsmaller number of records and that too by a big margin. Queries executed per secondremains almost the same barring a slight increase as the records grow. MongoDB,on the other hand, takes little time to execute the workload till the number of recordsis less than 1000k. But after that, elapsed time grows sharply and although very latebut Cassandra do crosses MongoDB at almost 2 million records. (Note: Red line isused for Cassandra and Green line for MongoDB). The workload factor with timeand queries considered for performance validation is shown in Figs. 1 and 2.

6.2 Workload B: Read Heavy (75% Read and 25% Writes)

Compared to 100% readworkload, there is a considerable decrease in time consumedbyCassandra this time. The sudden decrease in elapsed time for Cassandra (at around1000k compared to 750k in workload A) is less sharp than the previous workload.The ability of Cassandra to decrease the time with increase in data volume is again


Fig. 1 Time (s) versusnumber of records

Fig. 2 Queries executed (s)versus number of records

at the fore. The workload factor with time and queries considered for performancevalidation is shown in Figs. 3 and 4.

Cassandra executes more queries in a second for this workload as compared toworkload A and the curve shows an upward trend even if the data volume grows. Intotal contrast to Cassandra , MongoDB performs poorer as compared to the previous




workload and the performance keeps sinking as the number of records keeps increas-ing. Eventually, Cassandra betters its performance than MongoDB at around 1500krecords which is less than the number of records at which it crossed MongoDB inworkload A.



6.3 Workload C: Mixed (50% Read and 50% Writes)

In a 50% read and 50% write workload, the difference between performance of Cas-sandra and MongoDB is quite small even at smaller number of records. At around1250k number of records, the performance of Cassandra improves sharply althoughnot as sharp as in the two previous workloads. The highest execution time for Cassan-dra is observed at 250k records and it is even greater than the execution time at 2000krecords. However, a very healthy rise is observed in the number of queries executedby Cassandra per second. The performance of Cassandra is almost 45% better thanits performance in workload A. The workload factor with time and queries consid-ered for performance validation is shown in Figs. 5 and 6. In case of MongoDB, theslope of time versus number of records in the graph is highest at around 1000–1250krecords, that is, the increase in time consumed is quite drastic in this workload. Cas-sandra crosses the MongoDB’s line at around 1000k nodes and the performance gapwidens only with increase in data volume.

6.4 Workload D: Write Heavy (25% Read and 75% Writes)

Cassandra beats MongoDB by a huge margin in this write heavy workload. Thetime consumed by Cassandra is one-fifth of the time it consumed in workload A forthe same amount of data. As the number of records increases, Cassandra’s perfor-mance gets better and better. The performance of Cassandra is 200% better at 2000krecords as compared to its performance at 250k records. Thus, Cassandra liking forlarger volume of data keeps growing with increase in percentage of write operation.



The workload factor with time and queries considered for performance validation isshown in Figs. 7 and 8. The number of queries executed per second by MongoDBdecreases to one-third as number of records grows from 250 to 2000k. The timeconsumed by Cassandra is always less than that by MongoDB irrespective of thenumber of records.




6.5 Workload E: Write Only (100% Writes)

In case of 100% write workload, Cassandra totally outperforms its counterpart. Per-formance of Cassandra remains stable and is totally unaffected by the number ofrecords to an extent that its performance at around 2000k records is slightly betterthan its performance at 10k records. Cassandra executes more than 6k queries persecond and the number grows till 15k as the data volume grows to 2000k records.The workload factor with time and queries considered for performance validation isshown in Figs. 9 and 10.

Combining results of the five workloads:Figure 11 clearly indicates a sharp decline in time taken by Cassandra for a fixedworkload type with increase in number of records. This decline is muchmore evidentwhen the workload is read heavy, although time taken in executing write heavyworkload is already very less for Cassandra.

For convenience, we name the point where this decline begins as ‘break-through’point. The ‘break-through’ point shifts rightwards with increase in percentage ofwrite queries. Interestingly, time taken in executing the queries at 2000k records isalways less than the time taken at 250k records irrespective of workload type. On theother hand, Fig. 12 also indicates improvement in performance of Cassandra withincrease in volume of data.

As depicted in Figs. 13 and 14, trends shown byMongoDB are all on the expectedline with high performance for ready heavyworkload and poor performance for writeheavy workload. Contrary to Cassandra, MongoDB performs better at lower datavolume and as the data volume increases, and time elapsed in executing the queriesrises significantly (or the queries executed per second decreases).




7 Conclusion and Future Scope

In thiswork,we evaluated the performance ofCassandrawith increase in data volumeand found some anomalies. While most of the other databases do not perform betterat higher number of records, on the contrary, Cassandra’s performance improveswith the increase in records. The tests conducted to verify the performance clearlyunderline the importance of data volume as a factor in influencing the performancesof two most widely used NoSQL databases. Cassandra’s superiority over MongoDB


Fig. 11 Summarizing Cassandra’s performance using time elapsed values

Fig. 12 Summarizing Cassandra’s performance using queries executed per second values

was never in doubt for a write heavy workload, and even in a read heavy workload.Owing to contrasting performance with change in data volume, Cassandra crossesMongoDB’s benchmark sooner or later. For a read only workload, Cassandra crossesMongoDB at almost 2000k records but as the write percentage increases, it crosses ateven lesser number of records. In a scenariowhere bothMongoDB andCassandra arewell suited for solving a data problem, then data volume should always be taken intoaccount while choosing the best database for the applications. The future scope ofthis paper can be to compare other NoSQL databases with MongoDB and Cassandraand find out if there are anomalies in their performances also.

Samarjeet Borah Valentina Emilia Balas Zdzislaw Polkowski ...

Documents

Transcript of Samarjeet Borah Valentina Emilia Balas Zdzislaw Polkowski ...