國立高雄大學圖資館 |

Language: English

Back

Large Scale Data Analysis in Paralle...

Lin, Hao.

Large Scale Data Analysis in Parallel R and Its Use in Efficiently Scheduling Batch Jobs in the Cloud.

Record Type:	Electronic resources : Monograph/item
Title/Author:	Large Scale Data Analysis in Parallel R and Its Use in Efficiently Scheduling Batch Jobs in the Cloud.
Author:	Lin, Hao.
Published:	Ann Arbor : ProQuest Dissertations & Theses, 2018
Description:	109 p.
Notes:	Source: Dissertation Abstracts International, Volume: 80-01(E), Section: B.
Notes:	Adviser: Samuel P. Midkiff.
Contained By:	Dissertation Abstracts International80-01B(E).
Subject:	Computer science.
Online resource:	http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=10829520
ISBN:	9780438328501

Large Scale Data Analysis in Parallel R and Its Use in Efficiently Scheduling Batch Jobs in the Cloud.
Lin, Hao.

Large Scale Data Analysis in Parallel R and Its Use in Efficiently Scheduling Batch Jobs in the Cloud. - Ann Arbor : ProQuest Dissertations & Theses, 2018 - 109 p.

Source: Dissertation Abstracts International, Volume: 80-01(E), Section: B.

Thesis (Ph.D.)--Purdue University, 2018.

Large-scale data management and deep data analysis are increasingly important for both enterprise and scientific applications. Statistical languages provide rich functionality and ease of use for data analysis and modeling and have large user bases. R is among the most widely used of these languages, but is limited by a single threaded execution model and problem sizes that fit in a single node. We propose a highly parallel R system called RABID (R Analytics for BIg Data) that maintains R compatibility, leverages the MapReduce-like Spark framework and achieves high performance and scaling across clusters. RABID preserves the R programming model by introducing R-compatible distributed data structures with overloading functions. Optimizations like reducing the memory footprint, data pipelining and serialization, and operation merging are used to improve runtime performance. We compare RABID to several other frameworks.

ISBN: 9780438328501Subjects--Topical Terms:

199325
Computer science.

Large Scale Data Analysis in Parallel R and Its Use in Efficiently Scheduling Batch Jobs in the Cloud.
LDR:02711nmm a2200313 4500 001 547605
005 20190513114558.5
008 190715s2018 ||||||||||||||||| ||eng d
020 $a 9780438328501
035 $a (MiAaPQ)AAI10829520
035 $a (MiAaPQ)purdue:22893
035 $a AAI10829520
040 $a MiAaPQ $c MiAaPQ
100 1 $a Lin, Hao. $3 826947
245 1 0 $a Large Scale Data Analysis in Parallel R and Its Use in Efficiently Scheduling Batch Jobs in the Cloud.
260 1 $a Ann Arbor : $b ProQuest Dissertations & Theses, $c 2018
300 $a 109 p.
500 $a Source: Dissertation Abstracts International, Volume: 80-01(E), Section: B.
500 $a Adviser: Samuel P. Midkiff.
502 $a Thesis (Ph.D.)--Purdue University, 2018.
520 $a Large-scale data management and deep data analysis are increasingly important for both enterprise and scientific applications. Statistical languages provide rich functionality and ease of use for data analysis and modeling and have large user bases. R is among the most widely used of these languages, but is limited by a single threaded execution model and problem sizes that fit in a single node. We propose a highly parallel R system called RABID (R Analytics for BIg Data) that maintains R compatibility, leverages the MapReduce-like Spark framework and achieves high performance and scaling across clusters. RABID preserves the R programming model by introducing R-compatible distributed data structures with overloading functions. Optimizations like reducing the memory footprint, data pipelining and serialization, and operation merging are used to improve runtime performance. We compare RABID to several other frameworks.
520 $a In the era of cloud computing, batch data process workloads like RABID applications are targeted to run in VMs or containers in a cloud-based data center. Efficient scheduling of data center VMs can reduce the number of physical servers needed and, in turn, reduce the energy and other capital costs for maintaining the virtualized data center. We propose an innovative data-driven approach to achieve efficient pro-active VM scheduling. Our approach uses a multi-capacity bin-packing technique that efficiently places VMs onto physical servers. We use time-series analysis to extract not only low frequency information about future VM workloads but also high frequency information for VM workload correlations. This approach can also be implemented in RABID and leverages its high performance.
590 $a School code: 0183.
650 4 $a Computer science. $3 199325
650 4 $a Computer engineering. $3 212944
690 $a 0984
690 $a 0464
710 2 $a Purdue University. $b Electrical and Computer Engineering. $3 603335
773 0 $t Dissertation Abstracts International $g 80-01B(E).
790 $a 0183
791 $a Ph.D.
792 $a 2018
793 $a English
856 4 0 $u http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=10829520