对于这种我们页面我们只能查看 JobManager 的日志,不再可以查看作业挂掉之前的运行的 Web UI,很难清楚知道作业在挂的那一刻到底发生了啥?如果我们还没有 Metrics 监控的话,那么完全就只能通过日志去分析和定位问题了,所以如果能还原之前的 Web UI,我们可以通过 UI 发现和定位一些问题。
History Server 介绍
那么这里就需要利用 Flink 中的 History Server 来解决这个问题。那么 History Server 是什么呢?
它可以用来在相应的 Flink 集群关闭后查询已完成作业的统计信息。例如有个批处理作业是凌晨才运行的,并且我们都知道只有当作业处于运行中的状态,才能够查看到相关的日志信息和统计信息。所以如果作业由于异常退出或者处理结果有问题,我们又无法及时查看(凌晨运行的)作业的相关日志信息。那么 History Server 就显得十分重要了,因为通过 History Server 我们才能查询这些已完成作业的统计信息,无论是正常退出还是异常退出。
# The HistoryServer is started and stopped via bin/historyserver.sh (start|stop)
# Directory to upload completed jobs to. Add this directory to the list of # monitored directories of the HistoryServer as well (see below). # flink job 运行完成后的日志存放目录 jobmanager.archive.fs.dir: hdfs:///flink/history-log
# The address under which the web-based HistoryServer listens. # flink history进程所在的主机 #historyserver.web.address: 0.0.0.0
# The port under which the web-based HistoryServer listens. # flink history进程的占用端口 #historyserver.web.port: 8082
# Comma separated list of directories to monitor for completed jobs. # flink history进程的hdfs监控目录 historyserver.archive.fs.dir: hdfs:///flink/history-log
# Interval in milliseconds for refreshing the monitored directories. # 刷新受监视目录的时间间隔(以毫秒为单位) #historyserver.archive.fs.refresh-interval: 10000
2020-10-13 21:21:01,310 main INFO org.apache.flink.core.fs.FileSystem - Hadoop is not in the classpath/dependencies. The extended set of supported File Systems via Hadoop is not available. 2020-10-13 21:21:01,336 main INFO org.apache.flink.runtime.security.modules.HadoopModuleFactory - Cannot create Hadoop Security Module because Hadoop cannot be found in the Classpath. 2020-10-13 21:21:01,352 main INFO org.apache.flink.runtime.security.modules.JaasModule - Jaas file will be created as /tmp/jaas-354359771751866787.conf. 2020-10-13 21:21:01,355 main INFO org.apache.flink.runtime.security.SecurityUtils - Cannot install HadoopSecurityContext because Hadoop cannot be found in the Classpath. 2020-10-13 21:21:01,363 main WARN org.apache.flink.runtime.webmonitor.history.HistoryServer - Failed to create Path or FileSystem for directory 'hdfs:///flink/history-log'. Directory will not be monitored. org.apache.flink.core.fs.UnsupportedFileSystemSchemeException: Could not find a file system implementation for scheme 'hdfs'. The scheme is not directly supported by Flink and no Hadoop file system to support this scheme could be loaded. at org.apache.flink.core.fs.FileSystem.getUnguardedFileSystem(FileSystem.java:450) at org.apache.flink.core.fs.FileSystem.get(FileSystem.java:362) at org.apache.flink.core.fs.Path.getFileSystem(Path.java:298) at org.apache.flink.runtime.webmonitor.history.HistoryServer.<init>(HistoryServer.java:187) at org.apache.flink.runtime.webmonitor.history.HistoryServer.<init>(HistoryServer.java:137) at org.apache.flink.runtime.webmonitor.history.HistoryServer$1.call(HistoryServer.java:122) at org.apache.flink.runtime.webmonitor.history.HistoryServer$1.call(HistoryServer.java:119) at org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30) at org.apache.flink.runtime.webmonitor.history.HistoryServer.main(HistoryServer.java:119) Caused by: org.apache.flink.core.fs.UnsupportedFileSystemSchemeException: Hadoop is not in the classpath/dependencies. at org.apache.flink.core.fs.UnsupportedSchemeFactory.create(UnsupportedSchemeFactory.java:58) at org.apache.flink.core.fs.FileSystem.getUnguardedFileSystem(FileSystem.java:446) ... 8 more 2020-10-13 21:21:01,367 main ERROR org.apache.flink.runtime.webmonitor.history.HistoryServer - Failed to run HistoryServer. org.apache.flink.util.FlinkException: Failed to validate any of the configured directories to monitor. at org.apache.flink.runtime.webmonitor.history.HistoryServer.<init>(HistoryServer.java:196) at org.apache.flink.runtime.webmonitor.history.HistoryServer.<init>(HistoryServer.java:137) at org.apache.flink.runtime.webmonitor.history.HistoryServer$1.call(HistoryServer.java:122) at org.apache.flink.runtime.webmonitor.history.HistoryServer$1.call(HistoryServer.java:119) at org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30) at org.apache.flink.runtime.webmonitor.history.HistoryServer.main(HistoryServer.java:119)
Linux 系统临时目录为 /tmp,你可以看到源码中 HistoryServerOptions 该类中的可选参数。
1 2 3 4 5 6 7 8
/** * The local directory used by the HistoryServer web-frontend. */ publicstaticfinal ConfigOption<String> HISTORY_SERVER_WEB_DIR = key("historyserver.web.tmpdir") .noDefaultValue() .withDescription("This configuration parameter allows defining the Flink web directory to be used by the" + " history server web interface. The web interface will copy its static files into the directory.");
那么我们找到本地该临时目录,可以观察到里面保存着很多 JS 文件,其实就是我们刚才看到的页面
历史服务存储文件中,存储了用于页面展示的模板配置。历史任务信息存储在 Jobs 路径下,其中包含了已经完成的 Job,每次启动都会从 historyserver.archive.fs.dir 拉取所有的任务元数据信息。
每个任务文件夹中包含我们需要获取的一些信息,通过 REST API 获取时指标时,就是返回这些内容(Checkpoint/Exception 信息等)。