hadoop

Your browser does not seem to support JavaScript. As a result, your viewing experience will be diminished, and you have been placed in read-only mode.

Please download a browser that supports JavaScript, or enable it if it's disabled (i.e. NoScript).

登录后发表

Kudu: New Apache Hadoop Storage for Fast Analytics on Fast Data
实践分享 • storage apache hadoop • • Xuanwo

1

1
帖子

5531
浏览



Kudu: New Apache Hadoop Storage for Fast Analytics on Fast Data

This new open source complement to HDFS and Apache HBase is designed to fill gaps in Hadoop’s storage layer that have given rise to stitched-together, hybrid architectures.

The set of data storage and processing technologies that define the Apache Hadoop ecosystem are expansive and ever-improving, covering a very diverse set of customer use cases used in mission-critical enterprise applications. At Cloudera, we’re constantly pushing the boundaries of what’s possible with Hadoop—making it faster, easier to work with, and more secure.

In late 2012, we began a long-term planning exercise to analyze gaps in the Apache Hadoop storage layer that were complicating or, in some cases, preventing Hadoop adoption for certain use cases. In the course of this evaluation, we noticed several important trends and ultimately decided that there was a need for new storage technology that would complement the capabilities of what HDFS and Apache HBase provide. Today, we are excited to announce Kudu, a new addition to the open source Hadoop ecosystem. Kudu aims to provide fast analytical and real-time capabilities, efficient utilization of modern CPU and I/O resources, the ability to do updates in place, and a simple and evolvable data model.

In the remainder of this post we’ll offer an overview of our motivations for building Kudu, a brief explanation of its architecture, and outline our plan for growing a vibrant open source community in preparation for an eventual proposed donation to the ASF Incubator.

Gap in Capabilities
Within many Cloudera customers’ environments, we’ve observed the emergence of “hybrid architectures” where several Hadoop tools are deployed simultaneously. Tools like HBase are fantastic at ingesting data, serving small queries extremely quickly, and allowing data to be updated in place. HDFS, in combination with tools like Impala that can process columnar file formats like Apache Parquet, provides extreme performance for analytic queries on extremely large datasets.

However, when a use case requires the simultaneous availability of capabilities that cannot all be provided by a single tool, customers are forced to build hybrid architectures that stitch multiple tools together. Customers often choose to ingest and update data in one storage system, but later reorganize this data to optimize for an analytical reporting use-case served from another.

Our customers have been successfully deploying and maintaining these hybrid architectures, but we believe that they shouldn’t need to accept their inherent complexity. A storage system purpose built to provide great performance across a broad range of workloads provides a more elegant solution to the problems that hybrid architectures aim to solve.

A complex hybrid architecture designed to cover gaps in storage system capabilities
New Hardware
Another trend we’ve observed at customer sites is the gradual deployment of more capable hardware. First, we saw a steady growth in the amount of RAM that our customers are deploying, from 32GB per node in 2012 to 128GB or 256GB today. Additionally, it’s increasingly common for commodity nodes to include some amount of SSD storage. HBase, HDFS, and other Hadoop tools are being adapted to take advantage of this changing hardware landscape, but these tools were architected in a context where the most common bottleneck to overall system performance was the speed of the disks underlying the Hadoop cluster. Choices optimal for a spinning-disk storage architecture are not necessarily optimal for more modern architectures where large amounts of data can be cached in memory, and where random access times on persistent storage can be more than 100x faster.

Additionally, with a faster storage layer, the bottleneck to overall system performance is often no longer the storage layer itself. Generally, the next bottleneck that we see is CPU performance. With a slower storage layer, inefficiency in CPU utilization is often hidden beneath the storage bottleneck, but as the storage layer gets faster, CPU efficiency becomes much more critical.

We believe that there’s room for a new Hadoop storage system that is designed from the ground up to work with these modern hardware configurations and that emphasize CPU efficiency.
Introducing Kudu
To address these trends we investigated two separate approaches: incremental modifications to existing Hadoop tools, or building something entirely new. The design goals that we aimed to address were:
Strong performance for both scan and random access to help customers simplify complex hybrid architectures High CPU efficiency in order to maximize the return on investment that our customers are making in modern processors High IO efficiency in order to leverage modern persistent storage The ability to update data in place, to avoid extraneous processing and data movement The ability to support active-active replicated clusters that span multiple data centers in geographically distant locations
We prototyped strategies for achieving these goals within existing open source projects, but eventually came to the conclusion that large architectural changes were necessary to achieve our goals. These changes were extensive enough that building an entirely new data storage technology was necessary. We started development more than three years ago, and we are proud to share the result of our effort thus far: a new data storage technology that we call Kudu.

Kudu provides a combination of a characteristics for providing fast analytics on fast data.
Kudu’s Basic Design
From a user perspective, Kudu is a storage system for tables of structured data. Tables have a well-defined schema consisting of a predefined number of typed columns. Each table has a primary key composed of one or more of its columns. The primary key enforces a uniqueness constraint (no two rows can share the same key) and acts as an index for efficient updates and deletes.

Kudu tables are composed of a series of logical subsets of data, similar to partitions in relational database systems, called Tablets. Kudu provides data durability and protection against hardware failure by replicating these Tablets to multiple commodity hardware nodes using the Raft consensus algorithm. Tablets are typically tens of gigabytes, and an individual node typically holds 10-100 Tablets.

Kudu has a master process responsible for managing the metadata that describes the logical structure of the data stored in Tablet Servers (the catalog), acting as a coordinator when recovering from hardware failure, and keeping track of which tablet servers are responsible for hosting replicas of each Tablet. Multiple standby master servers can be defined to provide high availability. In Kudu, many responsibilities typically associated with master processes can be delineated to the Tablet Servers due to Kudu’s implementation of Raft consensus, and the architecture provides a path to partitioning the master’s duties across multiple machines in the future. We do not anticipate that Kudu’s master process will become the bottleneck to overall cluster performance and on tests on a 250-node cluster the server hosting the master process has been nowhere near saturation.

Data stored in Kudu is updateable through the use of a variation of log-structured storage in which updates, inserts, and deletes are temporarily buffered in memory before being merged into persistent columnar storage. Kudu protects against spikes in query latency generally associated with such architectures through constantly performing small maintenance operations such as compactions so that large maintenance operations are never necessary.

Kudu provides direct APIs, in both C++ and Java, that allow for point and batch retrieval of rows, writes, deletes, schema changes, and more. In addition, Kudu is designed to integrate with and improve existing Hadoop ecosystem tools. With Kudu’s beta release integrations with Impala, MapReduce, and Apache Spark are available. Over time we plan on making Kudu a supported storage option for most or all of the Hadoop ecosystem tools.

A much more thorough description of Kudu’s architecture can be found in the Kudu white paper.
The Kudu Community
Kudu already has an extensive set of capabilities, but there’s still work to be done and we’d appreciate your help. Kudu is fully open source software, licensed under the Apache Software License 2.0. Additionally, we intend to submit Kudu to the Apache Software Foundation as an Apache Incubator project to help foster its growth and facilitate its usage.

The binaries for Kudu (beta) are currently available and can be downloaded from here. We’ve also created several installation options to help you get Kudu up and running quickly so that you can try it out—outlined in our documentation posted here. Today we’re also making the full history of Kudu development available both in our github repository and in a public export of our issue tracking system. Going forward, Kudu development will be done completely transparently and publicly.

Several companies—including AtScale, Intel, Splice Machine, Xiaomi, Zoomdata, and more—have already provided substantial feedback and contributions to help make Kudu better, but this is just the beginning. We welcome any feedback or contributions from anyone that has an interest in the use cases that Kudu addresses.

Based on the above we’re confident you will agree that Kudu complements HDFS and HBase to address real needs across the Hadoop community. We look forward to working with the community to improve Kudu over time.
Resources for Getting Involved
Mailing list: kudu-user@googlegroups.com
Discussion forum: http://community.cloudera.com/t5/Beta-Releases/bd-p/Beta
Contributions: http://getkudu.io./contributing.html
JIRA: http://issues.cloudera.org/projects/KUDU

原文链接： https://blog.cloudera.com/blog/2015/09/kudu-new-apache-hadoop-storage-for-fast-analytics-on-fast-data/
基于云计算的大数据平台基础设施建设实践
实践分享 • 大数据 hadoop spark 云计算 hbase • • Arron

22

22
帖子

80594
浏览



@覃大王叫我来巡山这个是什么意思呀。
Spark和Hadoop，孰优孰劣
实践分享 • hadoop spark • • Xuanwo

1

1
帖子

7895
浏览



Spark 已经取代 Hadoop 成为最活跃的开源大数据项目。但是，在选择大数据框架时，企业不能因此就厚此薄彼。近日，著名大数据专家Bernard Marr 在一篇文章中分析了 Spark 和Hadoop 的异同。
Hadoop 和 Spark 均是大数据框架，都提供了一些执行常见大数据任务的工具。但确切地说，它们所执行的任务并不相同，彼此也并不排斥。虽然在特定的情况下，Spark 据称要比 Hadoop快 100 倍，但它本身没有一个分布式存储系统。而分布式存储是如今许多大数据项目的基础。它可以将 PB 级的数据集存储在几乎无限数量的普通计算机的硬盘上，并提供了良好的可扩展性，只需要随着数据集的增大增加硬盘。因此，Spark 需要一个第三方的分布式存储。也正是因为这个原因，许多大数据项目都将 Spark 安装在 Hadoop 之上。这样，Spark 的高级分析应用程序就可以使用存储在 HDFS 中的数据了。
与 Hadoop 相比，Spark 真正的优势在于速度。Spark 的大部分操作都是在内存中，而 Hadoop的 MapReduce 系统会在每次操作之后将所有数据写回到物理存储介质上。这是为了确保在出现问题时能够完全恢复，但 Spark 的弹性分布式数据存储也能实现这一点。
另外，在高级数据处理（如实时流处理和机器学习）方面，Spark 的功能要胜过 Hadoop。
在 Bernard 看来，这一点连同其速度优势是Spark 越来越受欢迎的真正原因。实时处理意味着可以在数据捕获的瞬间将其提交给分析型应用程序，并立即获得反馈。在各种各样的大数据应用程序中，这种处理的用途越来越多，比如，零售商使用的推荐引擎、制造业中的工业机械性能监控。Spark 平台的速度和流数据处理能力也非常适合机器学习算法。这类算法可以自我学习和改进，直到找到问题的理想解决方案。这种技术是最先进制造系统（如预测零件何时损坏）和无人驾驶汽车的核心。Spark 有自己的机器学习库 MLib，而 Hadoop系统则需要借助第三方机器学习库，如 ApacheMahout。
实际上，虽然 Spark 和 Hadoop 存在一些功能上的重叠，但它们都不是商业产品，并不存在真正的竞争关系，而通过为这类免费系统提供技术支持赢利的公司往往同时提供两种服务。例如，Cloudera 就既提供 Spark 服务也提供Hadoop 服务，并会根据客户的需要提供最合适的建议。
Bernard 认为，虽然 Spark 发展迅速，但它尚处于起步阶段，安全和技术支持基础设施方还不发达。在他看来，Spark 在开源社区活跃度的上升，表明企业用户正在寻找已存储数据的创新用法。

原作者：谢丽
原文链接： http://www.infoq.com/cn/minibooks/architect-201512
如果虚拟化和Hadoop谈恋爱数据中心会怎样
实践分享 • hadoop 虚拟化 • • Xuanwo

1

1
帖子

6979
浏览



高速增长的数据量和日益增加的竞争压力，让越来越多的企业开始思考如何挖掘这些数据的价值。传统的BI系统、数据仓库和数据库系统都不能很好地处理这些数据，原因包括：1)数据量太大，传统数据库不能有效存储并维持可以接受的性能;2)新产生的数据往往是非结构化的，而传统方式都是为处理结构化数据而设计的;3)传统数据处理所需的硬件往往相对昂贵，随着数据量增加而继续用传统方式处理的成本让很多企业不能承受。为此，倍受互联网界推崇的Apache Hadoop这朵奇葩日益吸引了企业界的目光，大量企业都在思考如何把Hadoop这个美丽的新娘娶回自己的数据中心。
　　不过，传统的企业数据中心要想娶回这个妖艳新娘可不是那么简单。Hadoop的部署、运维都需要很多极客才能完全掌控，完全超出了传统企业数据中心的技术能力;另外，Hadoop不仅需要专门硬件，而且安全和服务等级确保也是挑战。如何能享受美丽新娘的温柔梦乡而不带来其他的后患成为企业选择Hadoop的现实挑战。
　　从服务器虚拟化到整个数据中心虚拟化，今天我们已经充分感受到了虚拟化这个小子的力量!如果虚拟化能和Hadoop来场恋爱，企业数据中心选择Hadoop的羁绊是不是都会一扫而光呢？答案是肯定的。虚拟化能让Hadoop和底层物理硬件分离，真正步入云端翩跹起舞，Hadoop从而轻松步入快速部署、高可用、资源弹性调度和安全多租户的云端殿堂，企业数据中心大数据分析和利用的美梦才能真正成为现实。
　　让我们一起来揭开虚拟化这小子的恋爱秘籍吧，以便更好地利用Hadoop来应对大数据的挑战。1)快速部署Hadoop：我们已经熟悉虚拟化的密码，包括虚拟机、快照、模板、资源动态分配等，这些特性能很好地降服了大量应用部署的难题，Hadoop当然也不在话下，可以大幅度提高Hadoop节点的部署速度。同时，可以按需快速启动和关闭Hadoop节点，从而实现资源的高效利用，比如VMware发布的Serengeti开源项目，助推了虚拟化和Hadoop之恋的进程;2)为Hadoop提供高可用和容错能力：尽管Hadoop通过数据分布复制提高了系统可靠性，但仍然有很多部件存在单点故障，这种结构在互联网企业中可能不是问题，但对传统数据中心来说绝对是个挑战。比如：Namenode和jobtracker以及某些支持模块都存在单点故障，通过虚拟小子的平台高可用可以为这些模组轻松赋予高可靠的特性，让Hadoop走进企业数据中心后，您仍然能高枕无忧;3)拥抱Hadoop的高效数据中心：通过虚拟小子动态调度能力，可以将各种不同的负载混搭在企业数据中心云端平台，Hadoop当然也可以与其他负载同床共枕，通过严格的安全隔离，确保不会发生任何冲突。甚至你可以在同一云平台运行不同版本的Hadoop，相互之间和平共处，资源共享，在确保可用性、性能的前提下，降低了传统部署Hadoop的总体成本，轻松实现了高效数据中心的目标;4)大幅提升Hadoop环境资源利用率：将Hadoop和其他负载部署在同一主机上，通过资源控制策略来实现资源的高效分配和调度，实现Hadoop在云端的完美漫步，是虚拟化小子赢得这场恋爱的关键一环;5)Hadoop云端多租户：通过虚拟化的隔离能力，Hadoop确保本身多租户的完美体验，不同的租户可以将Hadoop和其他负载混合运行在云端资源池，多租户顺利部署实现;6)安全隔离：虚拟小子的安全隔离能力，让不同组织、用户的Hadoop可以无忧运行，轻松达成数据和环境完全隔离的目标，同时共享底层的物理资源;7)易于维护和迁移：虚拟化让Hadoop节点易于复制、迁移，方便了同数据中心不同集群之间、一个数据中心到另一个数据中心跨云迁移等瞬间实现，Hadoop再也不是一个行动不便的媚娘。
　　虚拟小子通过7板斧顺利赢得了Hadoop的芳心，不仅仅让Hadoop没给传统的企业数据中心添乱，而且Hadoop在虚拟平台上的魅力未减，因为大量的事实已经印证了虚拟化的Hadoop节点运行性能依然堪比物理环境，同时还带来了大量的成本节约。Hadoop和虚拟化门当户对，他们的恋爱之果值得我们共同期待和祝愿：祝Hadoop和虚拟化白头偕老，永结同心，百年好合!

原文链接： http://cloud.yesky.com/473/37863473.shtml
Spark 服务上线
青云志 • hadoop spark • • QingCloud

1

1
帖子

8680
浏览



Spark 是继 Hadoop 之后新一代的大数据分布式处理平台。它是一个基于内存、容错型的分布式计算引擎，与 Hadoop MapReduce 相比，计算速度要快100倍。 Spark 卓越的用户体验以及统一的技术堆栈基本上解决了大数据领域所有的核心问题，使得 Spark 迅速成为当前最为热门的大数据基础平台。

除此之外，青云 QingCloud 提供的 Spark 还包括在线伸缩、监控和告警等功能，帮助您更好地管理集群。更多详情请参看“Spark 服务指南”。
集成 HDFS
青云 QingCloud 既提供纯计算引擎的 Spark 集群，也提供和 Hadoop HDFS 集成的 Spark 集群。在创建 Spark 时可以选择是否集成 Hadoop HDFS。

在线伸缩
青云的 Spark 集群支持横向与纵向的在线伸缩，而且横向伸缩时，用户的业务连续性不会中断。

实时监控
青云提供了 Spark 节点的主机的监控信息，服务、应用级别的监控由 Spark、Hadoop 提供。

hadoop_monitor对主机的监控包括如下监控项：
CPU 内存硬盘使用率硬盘 IOPS 硬盘吞吐量监控告警
Spark 的监控告警策略会监控 Spark 节点，包括如下监控项：
CPU：CPU 使用百分比内存：内存使用百分比硬盘：硬盘使用百分比测试
Spark 创建完成之后可以测试其可用性。具体测试方法请参看“文档”。

Kudu: New Apache Hadoop Storage for Fast Analytics on Fast Data 实践分享 • storage apache hadoop • • Xuanwo

基于云计算的大数据平台基础设施建设实践 实践分享 • 大数据 hadoop spark 云计算 hbase • • Arron

Spark和Hadoop，孰优孰劣 实践分享 • hadoop spark • • Xuanwo

如果虚拟化和Hadoop谈恋爱 数据中心会怎样 实践分享 • hadoop 虚拟化 • • Xuanwo

Spark 服务上线 青云志 • hadoop spark • • QingCloud

Kudu: New Apache Hadoop Storage for Fast Analytics on Fast Data
实践分享 • storage apache hadoop • • Xuanwo

基于云计算的大数据平台基础设施建设实践
实践分享 • 大数据 hadoop spark 云计算 hbase • • Arron

Spark和Hadoop，孰优孰劣
实践分享 • hadoop spark • • Xuanwo

如果虚拟化和Hadoop谈恋爱数据中心会怎样
实践分享 • hadoop 虚拟化 • • Xuanwo

Spark 服务上线
青云志 • hadoop spark • • QingCloud