Important Announcement
PubHTML5 Scheduled Server Maintenance on (GMT) Sunday, June 26th, 2:00 am - 8:00 am.
PubHTML5 site will be inoperative during the times indicated!

Home Explore Tom White, “Hadoop The Definitive Guide”, 4th Edition,

Tom White, “Hadoop The Definitive Guide”, 4th Edition,

Published by Demo 5, 2021-07-05 11:21:41

Description: Tom White, “Hadoop The Definitive Guide”, 4th Edition,

Search

Read the Text Version

SORT BY clause (Hive), 503 escape sequences supported, 418 Sort class, 520, 547 export process, 417–422 SortedMapWritable class, 120 file formats, 406 sorting data generated code, 407 getting, 401–403 about, 255 import process, 408–412 Avro and, 358, 363–365 importing large objects, 415–417 controlling sort order, 258 MapReduce support, 405, 408, 419 Hive tables, 503 Parquet support, 406 MapReduce and, 255–268, 363–365 sample import, 403–407 partial sort, 257–258 SequenceFile class and, 406 Pig operators and, 465 serialization and, 407 preparation overview, 256 tool support, 402, 407 secondary sort, 262–268 working with imported data, 412–415 shuffle process and, 197–203 srst command (ZooKeeper), 605 total sort, 259–262 srvr command (ZooKeeper), 605 Source interface, 531 SSH, configuring, 289, 296, 689 SourceTarget interface, 533 stack traces, 331 Spark Stack, Michael, 575–602 about, 549 standalone mode (Hadoop), 687 additonal information, 574 stat command (ZooKeeper), 605 anatomy of job runs, 565–570 statements (Pig Latin) cluster managers and, 570 about, 433–437 example of, 550–555 control flow, 438 executors and, 570 expressions and, 438–439 Hive and, 477 states (ZooKeeper), 625–627, 631 installing, 550 status updates for tasks, 190 MapReduce and, 558 storage handlers, 499 RDDs and, 556–563 store functions (Pig Latin), 446 resource requests, 82 STORE statement (Pig Latin), 434, 435, 465 shared variables, 564–565 STORED AS clause (Hive), 498 sorting data, 259 STORED BY clause (Hive), 499 YARN and, 571–574 STREAM statement (Pig Latin), 435, 458 spark-shell command, 550 stream.map.input.field.separator property, 219 spark-submit command, 553, 573 stream.map.input.ignoreKey property, 218 spark.kryo.registrator property, 563 stream.map.output.field.separator property, 219 SparkConf class, 553 stream.non.zero.exit.is.failure property, 193 SparkContext class, 550, 571–574 stream.num.map.output.key.fields property, 219 SpecificDatumReader class, 352 stream.num.reduce.output.key.fields property, speculative execution of tasks, 204–206 219 SPILLED_RECORDS counter, 249 stream.recordreader.class property, 235 SPLIT statement (Pig Latin), 435, 466 stream.reduce.input.field.separator property, splits (input data) (see input splits) 219 SPLIT_RAW_BYTES counter, 249 stream.reduce.output.field.separator property, SPOF (single point of failure), 48 219 Sqoop Streaming programs about, 401 about, 7 additional information, 422 default job, 218–219 Avro support, 406 secondary sort, 266–268 connectors and, 403 Index | 723

task execution, 189 views, 509 user-defined counters, 255 TABLESAMPLE clause (Hive), 495 StreamXmlRecordReader class, 235 TableSource interface, 531 StrictHostKeyChecking SSH setting, 296 Target interface, 532 String class (Java), 115–118, 349 task attempt IDs, 164, 203 StringTokenizer class (Java), 279 task attempts page (MapReduce), 169 StringUtils class, 111, 453 task counters, 248–250 structured data, 9 task IDs, 164, 203 subqueries, 508 task logs (MapReduce), 172 SUM function (Pig Latin), 446 TaskAttemptContext interface, 191 SWebHdfsFileSystem class, 53 tasks SwiftNativeFileSystem class, 54 SWIM repository, 316 executing, 189, 203–208, 570 sync operation (ZooKeeper), 616 failure considerations, 193 syncLimit property, 639 profiling, 175–176 syslog file (Java), 172 progress and status updates, 190 system administration scheduling in Spark, 569 commissioning nodes, 334–335 Spark support, 552 decommissioning nodes, 335–337 speculative execution, 204–206 HDFS support, 317–329 streaming, 189 monitoring, 330–332 task assignments, 188 routine procedures, 332–334 tasks page (MapReduce), 169 upgrading clusters, 337–341 tasktrackers, 83 System class (Java), 151 TEMPORARY keyword (Hive), 513 system logfiles, 172, 295 teragen program, 315 TeraSort program, 315 T TestDFSIO benchmark, 316 testing TableInputFormat class, 238, 587 HBase installation, 582–584 TableMapper class, 588 Hive considerations, 473 TableMapReduceUtil class, 588 job drivers, 158–160 TableOutputFormat class, 238, 587 MapReduce test runs, 27–30 tables (HBase) in miniclusters, 159 running jobs locally on test data, 156–160 about, 576–578 running jobs on clusters, 160–175 creating, 583 writing unit tests with MRUnit, 152–156 inserting data into, 583 Text class, 115–118, 121–124, 210 locking, 578 text formats regions, 578 controlling maximum line length, 233 removing, 584 KeyValueTextInputFormat class, 233 wide tables, 591 NLineInputFormat class, 234 tables (Hive) NullOutputFormat class, 239 about, 489 TextInputFormat class, 232 altering, 502 TextOutputFormat class, 239 buckets and, 491, 493–495 XML documents and, 235 dropping, 502 TextInputFormat class external tables, 490–491 about, 232 importing data, 500–501 MapReduce types and, 157, 211 managed tables, 490–491 Sqoop imports and, 412 partitions and, 491–493 TextLoader function (Pig Latin), 446 storage formats, 496–499 724 | Index

TextOutputFormat class, 123, 239, 523 UNION statement (Pig Latin), 435, 466 TGT (Ticket-Granting Ticket), 310 unit tests with MRUnit, 145, 152–156 thread dumps, 331 Unix user accounts, 288 Thrift unmanaged application masters, 81 unstructured data, 9 HBase and, 589 UPDATE statement (Hive), 483 Hive and, 479 upgrading clusters, 337–341 Parquet and, 375–377 URL class (Java), 57 ThriftParquetWriter class, 375 user accounts, Unix, 288 tick time (ZooKeeper), 624 user identity, 147 Ticket-Granting Ticket (TGT), 310 user-defined aggregate functions (UDAFs), 510, timeline servers, 84 TOBAG function (Pig Latin), 440, 446 513–517 tojson command, 355 user-defined functions (see UDFs) TokenCounterMapper class, 279 user-defined table-generating functions TOKENSIZE function (Pig Latin), 446 ToLowerFn function, 536 (UDTFs), 510 TOMAP function (Pig Latin), 440, 446 USING JAR clause (Hive), 512 Tool interface, 148–152 ToolRunner class, 148–152 V TOP function (Pig Latin), 446 TotalOrderPartitioner class, 260 VCORES_MILLIS_MAPS counter, 251 TOTAL_LAUNCHED_MAPS counter, 251 VCORES_MILLIS_REDUCES counter, 251 TOTAL_LAUNCHED_REDUCES counter, 251 VERSION file, 318 TOTAL_LAUNCHED_UBERTASKS counter, versions (Hive), 472 251 ViewFileSystem class, 48, 53 TOTUPLE function (Pig Latin), 440, 446 views (virtual tables), 509 TPCx-HS benchmark, 316 VIntWritable class, 113 transfer rate, 8 VIRTUAL_MEMORY_BYTES counter, 250, TRANSFORM clause (Hive), 503 transformations, RDD, 557–560 303 Trash class, 308 VLongWritable class, 113 trash facility, 307 volunteer computing, 11 TRUNCATE TABLE statement (Hive), 502 tuning jobs, 175–176 W TwoDArrayWritable class, 120 w (write) permission, 52 U Walters, Chad, 576 WAR (Web application archive) files, 160 uber tasks, 187 watches (ZooKeeper), 615, 618 UDAF class, 514 Watson, James D., 655 UDAFEvaluator interface, 514 wchc command (ZooKeeper), 606 UDAFs (user-defined aggregate functions), 510, wchp command (ZooKeeper), 606 wchs command (ZooKeeper), 606 513–517 Web application archive (WAR) files, 160 UDF class, 512 WebHDFS protocol, 54 UDFs (user-defined functions) WebHdfsFileSystem class, 53 webtables (HBase), 575 Hive and, 510–517 Wensel, Chris K., 669 Pig and, 424, 447, 448–456 Whitacre, Micah, 643 UDTFs (user-defined table-generating func‐ whoami command, 147 tions), 510 WITH SERDEPROPERTIES clause (Hive), 499 Unicode characters, 116–117 work units, 11, 30 Index | 725

workflow engines, 179 YARN client mode (Spark), 571 workflows (MapReduce) YARN cluster mode (Spark), 573–574 yarn-env.sh file, 292 about, 177 yarn-site.xml file, 292, 296 Apache Oozie system, 179–184 yarn.app.mapreduce.am.job.recovery.enable decomposing problems into jobs, 177–178 JobControl class, 178 property, 195 Writable interface yarn.app.mapreduce.am.job.speculator.class about, 110–112 class hierarchy, 113–121 property, 205 Crunch and, 528 yarn.app.mapreduce.am.job.task.estimator.class implementing custom, 121–125 WritableComparable interface, 112, 258 property, 205 WritableComparator class, 112 yarn.log-aggregation-enable property, 172 WritableSerialization class, 126 yarn.nodemanager.address property, 306 WritableUtils class, 125 yarn.nodemanager.aux-services property, 300, write (w) permission, 52 WRITE permission (ACL), 620 687 WriteSupport class, 373 yarn.nodemanager.bind-host property, 305 WRITE_OPS counter, 250 yarn.nodemanager.container-executor.class writing data Crunch support, 532 property, 193, 304, 313 using FileSystem API, 61–63 yarn.nodemanager.delete.debug-delay-sec prop‐ HDFS data flow, 72–73 Parquet and, 373–377 erty, 174 SequenceFile class, 127–129 yarn.nodemanager.hostname property, 305 yarn.nodemanager.linux-container-executor X property, 304 x (execute) permission, 52 yarn.nodemanager.local-dirs property, 300 XML documents, 235 yarn.nodemanager.localizer.address property, Y 306 yarn.nodemanager.log.retain-second property, Yahoo!, 13 YARN (Yet Another Resource Negotiator) 173 yarn.nodemanager.resource.cpu-vcores proper‐ about, 7, 79, 96 anatomy of application run, 80–83 ty, 301, 303 application lifespan, 82 yarn.nodemanager.resource.memory-mb prop‐ application master failure, 194 building applications, 82 erty, 150, 301 cluster setup and installation, 288 yarn.nodemanager.vmem-pmem-ratio property, cluster sizing, 286 daemon properties, 300–303 301, 303 distributed shell, 83 yarn.nodemanager.webapp.address property, log aggregation, 172 MapReduce comparison, 83–85 306 scaling out data, 30 yarn.resourcemanager.address property scheduling in, 85–95, 308 Spark and, 571–574 about, 300, 305 starting and stopping daemons, 291 Hive and, 476 Pig and, 425 yarn.resourcemanager.admin.address property, 305 yarn.resourcemanager.am.max-attempts prop‐ erty, 194, 196 yarn.resourcemanager.bind-host property, 305 yarn.resourcemanager.hostname property, 300, 305, 687 yarn.resourcemanager.max-completed- applications property, 165 726 | Index

yarn.resourcemanager.nm.liveness- zettabytes, 3 monitor.expiry-interval-ms property, 195 znodes yarn.resourcemanager.nodes.exclude-path about, 606 property, 307, 336 ACLs and, 619 creating, 607–609 yarn.resourcemanager.nodes.include-path prop‐ deleting, 612 erty, 307, 335 ephemeral, 614 joining groups, 609 yarn.resourcemanager.resource-tracker.address listing , 610–612 property, 305 operations supported, 616 persistent, 614 yarn.resourcemanager.scheduler.address prop‐ properties supported, 614–615 erty, 305 sequential, 615 ZOOCFGDIR environment variable, 605 yarn.resourcemanager.scheduler.class property, ZooKeeper 91 about, 603 additional information, 640 yarn.resourcemanager.webapp.address property, building applications 306 configuration service, 627–630, 634–636 yarn.scheduler.capacity.node-locality-delay distributed data structures and protocols, property, 95 636 yarn.scheduler.fair.allocation.file property, 91 resilient, 630–634 yarn.scheduler.fair.allow-undeclared-pools consistency and, 621–623 data model, 614 property, 93 example of, 606–613 yarn.scheduler.fair.locality.threshold.node prop‐ failover controllers and, 50 HBase and, 579 erty, 95 high availability and, 49 yarn.scheduler.fair.locality.threshold.rack prop‐ implementing, 620 installing and running, 604–606 erty, 95 operations in, 616–620 yarn.scheduler.fair.preemption property, 94 production considerations, 637–640 yarn.scheduler.fair.user-as-default-queue prop‐ sessions and, 623–625 states and, 625–627, 631 erty, 93 zxid, 622 yarn.scheduler.maximum-allocation-mb prop‐ erty, 303 yarn.scheduler.minimum-allocation-mb prop‐ erty, 303 yarn.web-proxy.address property, 306 YARN_LOG_DIR environment variable, 172 YARN_RESOURCEMANAGER_HEAPSIZE environment variable, 294 Z Zab protocol, 621 Index | 727

About the Author Tom White is one of the foremost experts on Hadoop. He has been an Apache Hadoop committer since February 2007, and is a member of the Apache Software Foundation. Tom is a software engineer at Cloudera, where he has worked since its foundation on the core distributions from Apache and Cloudera. Previously he was an independent Hadoop consultant, working with companies to set up, use, and extend Hadoop. He has spoken at many conferences, including ApacheCon, OSCON, and Strata. Tom has a BA in mathematics from the University of Cambridge and an MA in philosophy of science from the University of Leeds, UK. He currently lives in Wales with his family. Colophon The animal on the cover of Hadoop: The Definitive Guide is an African elephant. These members of the genus Loxodonta are the largest land animals on Earth (slightly larger than their cousin, the Asian elephant) and can be identified by their ears, which have been said to look somewhat like the continent of Asia. Males stand 12 feet tall at the shoulder and weigh 12,000 pounds, but they can get as big as 15,000 pounds, whereas females stand 10 feet tall and weigh 8,000–11,000 pounds. Even young elephants are very large: at birth, they already weigh approximately 200 pounds and stand about 3 feet tall. African elephants live throughout sub-Saharan Africa. Most of the continent’s elephants live on savannas and in dry woodlands. In some regions, they can be found in desert areas; in others, they are found in mountains. The species plays an important role in the forest and savanna ecosystems in which they live. Many plant species are dependent on passing through an elephant’s digestive tract before they can germinate; it is estimated that at least a third of tree species in west African forests rely on elephants in this way. Elephants grazing on vegetation also affect the structure of habitats and influence bush fire patterns. For example, under natural conditions, elephants make gaps through the rainforest, enabling the sunlight to enter, which allows the growth of various plant species. This, in turn, facilitates more abun‐ dance and more diversity of smaller animals. As a result of the influence elephants have over many plants and animals, they are often referred to as a keystone species because they are vital to the long-term survival of the ecosystems in which they live. Many of the animals on O’Reilly covers are endangered; all of them are important to the world. To learn more about how you can help, go to animals.oreilly.com. The cover image is from the Dover Pictorial Archive. The cover fonts are URW Type‐ writer and Guardian Sans. The text font is Adobe Minion Pro; the heading font is Adobe Myriad Condensed; and the code font is Dalton Maag’s Ubuntu Mono.


Like this book? You can publish your book online for free in a few minutes!
Create your own flipbook