The impressively low cost and improved
quality of genome sequencing provide to researchers of genetic diseases, such as cancer, a powerful tool to better understand the underlying genetic
mechanisms of those diseases and treat them with effective targeted therapies. Thus, a number
of projects today sequence
the DNA of large patient
populations each of which produces
at least hundreds
of terra-bytes of data. Now the challenge
is to provide the produced
data on demand to interested
parties. However, there are scenarios, where the data size is too big to be analyzed in acceptable time by a single system, and in this cases is where the Map Reduce and a distributed file system are able to shine.
Today there are two tiers of data access: a top and bottom tier. The top tier involves the downloading of FASTQ, BAM, or VCF files from an archive
such as the SRA or CGHUB that contain reads or
variants from the sequencing either of a person or a population. On the other hand, the bottom tier
of subsets of the
Assuming that genomic
in the order of
Terabytes and Petabytes
reside in distributed environments,
a more efficient alternative to both tiers of data access is a distributed data retrieval engine. So we will discuss Spark SQL,
which is the distributed SQL execution engine of the Apache Spark framework.
system and distributed computing
A distributed system is a model in
which components located on networked
computers communicate and coordinate their actions by passing
messages (https://onlinecourses.nptel.ac.in/). The components interact with each
other in order to achieve a common goal. Three significant characteristics of
distributed systems are concurrency of
components, lack of a global clock,
and independent failure of components. The following defining properties are commonly used for a distributed system
are several autonomous computational entities (computers or nodes), each of which has its own
entities communicate with each other by message passing.
system has to tolerate failures in individual computers.
structure of the system (network topology, network latency, number of
computers) is not known in advance, the system may consist of different kinds
of computers and network links, and the system may change during the execution
of a distributed program.
computer has only a limited, incomplete view of the system. Each computer may
know only one part of the input
Fig.1. (a) (b) : a distributed computing. (c) : a parallel computing.
Distributed computing is a field
of computer science that
studies distributed systems. In
distributed computing, each processor has its own private memory (distributed memory) (Fig.1). Information is exchanged by
passing messages between the processors. While on another hand in parallel computing, all processors may have access to
a shared memory to exchange information
1.2. The big data challenge
on big data is quite a big challenge. To work with volumes of data such as genome sequenced data, that easily surpass several terabytes in size, requires distributing parts of data to several systems to handle in parallel. By doing it, the probability of failure rises. In a single-system,
failure is not something that usually
program designers explicitly worry about.
However, in a distributed scenario, partial failures are expected and common, but if
the rest of the distributed system is fine, it should be able to recover from the component failure or transient error condition and continue to make progress. Providing such resilience is a major software engineering challenge.
addition, to these sorts of bugs and challenges, there is also the fact that the computing hardware has finite resources available (Morais, 2015).
The major hardware restrictions include:
Individual systems usually have few gigabytes of memory. If the input dataset is several terabytes, then this would require a thousand or more machines to hold it in RAM and even then, no single machine would be able to process or address all of the data.
Hard drives are a lot bigger than RAM, and a single machine can currently hold multiple terabytes of information on its hard drives. But generated data
of a large- scale computation can easily require more space than what original data had occupied. During this, some of the storage devices employed by the system may get full, and the distributed system will have to send the data to another node, to store the overflow. Finally, bandwidth is a limited resource. While a pack of nodes directly
connected by a gigabit Ethernet generally experience high throughput between them, if all transmit multi-gigabyte,
they would saturate the switch’s bandwidth. Plus, if the systems were spread across multiple racks, the bandwidth
for the data transfer would be more diminished.
To achieve a successful large-scale
the mentioned resources must be efficiently managed.
Furthermore, it must allocate some of these resources toward maintaining the system as a whole, while devoting as much time as possible to the actual core computation.
Synchronization between multiple systems
remains the biggest challenge in
distributed system design. If nodes in a distributed system can explicitly communicate with one another,
then application designers must
of risks associated with such communication patterns. Finally, the ability to continue computation in the face of failures becomes more challenging.
Spark is an in-memory distributed data analysis platform, primarily targeted at speeding up batch analysis jobs, iterative machine learning jobs, interactive query and graph processing(Fig.2). One of Spark’s primary distinctions is its use of RDDs or Resilient Distributed Datasets. RDDs are great for
pipelining parallel operators for computation and are, by definition, immutable,
which allows Spark a unique form of fault tolerance based on lineage information. If you are interested in, for example, executing a Hadoop Map Reduce job much faster, Spark is a great option (although memory requirements must be considered).
It provides high-level APIs in Java, Scala and Python, and an optimized engine that supports general execution graphs (https://spark.apache.org/).