admin管理员组

文章数量:1550528

有个flink实时任务,上周升级了版本,早上过来看下任务,发现任务凌晨4点左右的时候重启了。flink ui查看异常日志如下

 异常日志

2020-08-10 04:07:23

org.apache.flink.runtime.ioworkty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager '/9.150.12.175:39365'. This might indicate that the remote task manager was lost.

    at org.apache.flink.runtime.ioworkty.CreditBasedPartitionRequestClientHandler.channelInactive(CreditBasedPartitionRequestClientHandler.java:136)

    at org.apache.flink.shadedty4.ioty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)

    at org.apache.flink.shadedty4.ioty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)

    at org.apache.flink.shadedty4.ioty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)

    at org.apache.flink.shadedty4.ioty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:390)

    at org.apache.flink.shadedty4.ioty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:355)

    at org.apache.flink.shadedty4.ioty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)

    at org.apache.flink.shadedty4.ioty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)

    at org.apache.flink.shadedty4.ioty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)

    at org.apache.flink.shadedty4.ioty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1429)

    at org.apache.flink.shadedty4.ioty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)

    at org.apache.flink.shadedty4.ioty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)

    at org.apache.flink.shadedty4.ioty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:947)

    at org.apache.flink.shadedty4.ioty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:826)

    at org.apache.flink.shadedty4.ioty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)

    at org.apache.flink.shadedty4.ioty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)

    at org.apache.flink.shadedty4.ioty.channel.nio.NioEventLoop.run(NioEventLoop.java:474)

    at org.apache.flink.shadedty4.ioty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)

    at java.lang.Thread.run(Thread.java:748)

关键信息

2020-08-10 04:07:23

org.apache.flink.runtime.ioworkty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager '/9.150.12.175:39365'. This might indicate that the remote task manager was lost.

 

初步判断可能是9.150.12.175机器出了问题。

看看yarn资源管理界面,进一步判断是机器问题。

一般常见的是内存不足、磁盘空间不足,或者其他问题。

登陆问题机器,jps查看进程,只有yarn nodemanager还在,但启动时间还是很早之前,没有重启过,其他任务已经被干掉了

查看yarn nodemanager日志,日志提示磁盘使用率超过90%

查看当前磁盘使用率

跟yarn的日志一致,磁盘使用率超过yarn的配置阀值。查看日志,有历史生成的大日志文件,清理过期日志,重新启动,任务重新分配到问题机器,一切恢复正常。同时让运维同事将所有集群节点磁盘加上监控,使用率达到85%时告警。

本文标签: 重启原因Flink