Data Research, Vol. 2, Issue 1, Feb  2018, Pages 43-53; DOI: 10.31058/j.data.2018.21004 10.31058/j.data.2018.21004

Speedup Query Processing in Hadoop Using Mapreduce Framework

Data Research, Vol. 2, Issue 1, Feb  2018, Pages 43-53.

DOI: 10.31058/j.data.2018.21004

Chandra Shekhar Gautam 1 , Akhilesh A. Waoo 2*

1 Rajiv Gandhi College of Computer Application and Technology, Satna, M.P., India

2 AKS University, Satna (M.P.) India

Received: 28 December 2017; Accepted: 20 January 2018; Published: 19 March 2018

Full-Text HTML | Download PDF | Views 1765 | Download 1059

Abstract

The Internet used by 3.2 billion people in 2015. Nearly half of the global population will be using the internet by the end of this year, according to a new report. Enterprises today gain vast volumes of data from different sources and influence this information by means of data analysis to support effective decision-making and provide new functionality and services. The key requirement of data analytics is scalability, simply due to the immense volume of data that need to be extracted, processed, and analyzed in a timeline fashion. Possibly the most popular framework for current large-scale data analytics is Map-Reduce, mainly due to its salient features that include scalability, fault-tolerance, ease of programming, and edibility. However, despite its merits, MapReduce has evident performance limitations in miscellaneous analytical tasks, and this has given rise to a significant body of research that aim at improving its efficiency, while maintaining its desirable properties. The aims of this review the state-of-the-art in improving the performance of parallel query processing using MapReduce. A set of the most significant weaknesses and limitations of Map-Reduce is discussed at a high level, along with solving techniques. Taxonomy is presented for categorizing existing research on MapReduce improvements according to the specific problem they target. Based on the proposed taxonomy, a classification of existing research is provided focusing on the optimization objective. Concluding, this research article outlines interesting directions for future parallel data processing systems.

Keywords

Hadoop, Mapreduce, Speed up Query, Performance

Copyright

© 2017 by the authors. Licensee International Technology and Science Press Limited. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

References

[1] Abadi, D.J. Data management in the cloud: Limitations and opportunities. IEEE Data Engineering Bulletin. 2009, 32(1), 3-12.
[2] Afrati, F.N.; Borkar, V.R.; Carey, M.J.; Polyzotis, N.; Ullman J.D. Map-reduce extensions and recursive queries. In Proceedings of International Conference on Extending Database Technology (EDBT), 2011, 1-8.
[3] Aiyer, A.S.; Bautin, M.; Chen, G.J.; Damania, P.; Khemani, P.; Muthukkaruppan, K.; Ranganathan, K.; Spiegelberg, N.; Tang, L.; Vaidya, M. Storage infrastructure behind Facebook Messages: using HBase at scale. IEEE Data Engineering Bulletin. 2012, 35(2), 4-13.
[4] Afrati, F.N.; Ullman, J.D. Optimizing joins in a Map-Reduce environment. In Proceedings of International Conference on Extending Database Technology (EDBT), 2010, 99-110.
[5] Zan, M.; Li. Research of Big Data Based on the Views of Technology and Application, 2015, 5, 192-197.
[6] Rani, S.; Rama, B. MapReduce with Hadoop for Simplified Analysis of Big Data. International Journal of Advanced Research in Computer Science, 2017, 8(5), 853-856.
[7] Agarwal, S.; Kandula, S.; Bruno, N.; Wu, M.C.; Stoica, I.; Zhou, J. Re-optimizing data-parallel computing. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2012, 21, 1-14.
[8] Agarwal, S.; Panda, A.; Mozafari, B.; Milner, H.; Madden, S.; Stoica, I. BlinkDB: queries with bounded errors and bounded response times on very large data. In Proceedings of European Conference on Computer systems (EuroSys), 2013.
[9] Agrawal, D.; Das, S.; Abbadi, A. E. Big Data and cloud computing: current state and future opportunities. In Proceedings of International Conference on Extending Database Technology (EDBT), 2011, 530-533.
[10] Joseph, C.W.; Pushpalatha, B. A Survey on Big Data and Hadoop, International Journal of Innovative Research in Computer and Communication Engineering. ISSN(Online): 2320-9801, March 2017, 5(3), 5525-5530.
[11] Afrati, F.N.; Sarma, A.D.; Menestrina, D.; Parameswaran, A.G.; Ullman, J.D. Fuzzy joins using MapReduce. In Proceedings of International Conference on Data Engineering (ICDE), 2012, 498-509.
[12] Siddaraju; Sowmya, C.; Rashmi, K.; Rahul, M. Efficient Analysis of Big Data Using Map Reduce Framework, International Journal of Recent Development in Engineering and Technology, 2014, 2( 6), 64-68.
[13] Ananthanarayanan, G.; Ghodsi, A.; Wang, A.; Borthakur, D.; Kandula, S.; Shenker, S.; Stoica, I. PACMan: coordinated memory caching for parallel jobs. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2012, 20, 1-14.
[14] Bhatotia, P.; Wieder, A.; Rodrigues, R.; Acar, U.A.; Pasquin, R. Incoop: MapReduce for incremental computations. In ACM Symposium on Cloud Computing (SoCC), 2011, 7, 1-14.
[15] Dittrich, J.; Quian′e-Ruiz, J.A. Efficient Big Data processing in Hadoop MapReduce. Proceedings of the VLDB Endowment (PVLDB), 2012, 5(12), 2014-2015.
[16] Chen, S. Cheetah: a high performance, custom data warehouse on top of MapReduce. Proceedings of the VLDB Endowment (PVLDB), 2010, 3(2), 1459-1468.
[17] Borthakur, D.; Gray, J.; Sarma, J.; Muthukkaruppan, K.; Spiegelberg, N.; Kuang, H.; Ranganathan K.; Molkov D.; Menon A.; Rash S.; Schmidt R.; Aiyer A.S. Apache Hadoop goes realtime at Facebook. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 2011, 1071-1080.
[18] Borthakur, D.; Gray, J.; Sarma, J.; Muthukkaruppan, K.; Spiegelberg, N.; Kuang, H.; Ranganathan K.; Molkov D.; Menon A.; Rash S.; Schmidt R.; Aiyer A.S. Apache Hadoop goes realtime at Facebook. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 2011, 1071-1080.
[19] Bu, Y.; Howe, B.; Balazinska, M.; Ernst, M.D. The HaLoop approach to large-scale iterative data analysis. VLDB Journal. 2012, 21(2), 169-190.
[20] Grover, R.; Carey, M.J. Extending map-reduce for efficient predicate-based sampling. In Proceedings of International Conference on Data Engineering (ICDE), 2012, 486-497.
[21] Cattell, R. Scalable SQL and NoSQL data stores. SIGMOD Record, 2010, 39(4), 12-27.
[22] Rani, P.S.; Shalini, S.; Rukmani J.; Shanthini A. Energy Efficient Scheduling of Map Reduce for Evolving Big Data Applications. International Journal of Advanced Research in Computer and Communication Engineering, 2016, 5(2), 54-58.
[23] Ewen, S.; Tzoumas, K.; Kaufmann, M.; Markl, V. Spinning fast iterative data flows. Proceedings of the VLDB Endowment (PVLDB), 2012, 5(11), 1268-1279.
[24] Chattopadhyay, B.; Lin, L.; Liu, W.; Mittal, S.; Aragonda, P.; Lychagina, V.; Kwon, Y.; Wong, M. Tenzing a SQL implementation on the MapReduce framework. Proceedings of the VLDB Endowment (PVLDB), 2011, 4(12), 1318-1327.
[25] Yang, H.C.; Dasdan, A. Hsiao, R.L.; Parker, D.S. Map-reduce-merge: simplified relational data processing on large clusters. SIGMOD’07, 2007, 1029-1040.
[26] Condie, T.; Conway, N.; Alvaro, P.; Hellerstein, J.M.; Elmeleegy K.; Sears R. MapReduce online. (NSDI), 2010.
[27] Doulkeridis, C.; Nørv˚ag, K. On saying “enough already!” (Cloud-I), 2012.
[28] Bu, Y.; Borkar, V.R.; Carey, M.J.; Rosen, J.; Polyzotis, N.; Condie, T.; Weimer, M.; Ramakrishnan, R. Scaling datalog for machine learning on big data. The Computing Research Repository (CoRR), abs/1203.0160, 2012,
[29] Borkar, V.R.; Carey, M.J.; Grover, R.; Onose, N.; Vernica, R. Hyracks: a flexible and extensible foundation for data-intensive computing. In Proceedings of International Conference on Data Engineering (ICDE), 2011, 1151-1162.
[30] Goodhope, K.; Koshy, J.; Kreps, J.; Narkhede, N.; Park, R.; Rao, J.; Ye, V.Y. Building LinkedIn’s real-time activity data pipeline. IEEE Data Engineering Bulletin, 2012, 35(2), 33-45.
[31] Candan, K.S.; Kim, J.W.; Nagarkar, P.; Nagendra, M.; RanKloud, R.Y. Scalable multimedia data processing in server clusters. IEEE MultiMedia, 2011, 18(1), 64-77.

Related Articles