SparkStreaming的WordCount示例及源码分析(二)

Spark 专栏收录该内容
11 篇文章 0 订阅

  ReceiverTracker自身运行在driver端,是一个管理分布在各个executor上的Receiver的总指挥者。
  ReceiverTracker的作用是处理数据接收,数据缓存,Block生成等工作。JobScheduler拥有ReceiverTracker实例,在JobScheduler的start()方法中会启动ReceiverTracker,ReceiverTracker.start()最重要的任务就是调用launchReceivers()方法将Receiver分发到多个executor上去。然后在每个executor上,由ReceiverSupervisor来分别启动一个Receiver接收数据。
  ReceiverTracker.start()如下:

/** Start the endpoint and receiver execution thread. */
def start(): Unit = synchronized {
  if (isTrackerStarted) {
    throw new SparkException("ReceiverTracker already started")
  }

  if (!receiverInputStreams.isEmpty) {
    endpoint = ssc.env.rpcEnv.setupEndpoint(
      "ReceiverTracker", new ReceiverTrackerEndpoint(ssc.env.rpcEnv))
    if (!skipReceiverLaunch) launchReceivers()
    logInfo("ReceiverTracker started")
    trackerState = Started
  }
}

  ReceiverTracker会启动一个名为ReceiverTracker的Endpoint,ReceiverTrackerEndpoint的作用是接收来自Receiver的消息。然后调用launchReceivers():

/**
 * Get the receivers from the ReceiverInputDStreams, distributes them to the
 * worker nodes as a parallel collection, and runs them.
 */
private def launchReceivers(): Unit = {
  val receivers = receiverInputStreams.map(nis => {
    val rcvr = nis.getReceiver()
    rcvr.setReceiverId(nis.id)
    rcvr
  })

  runDummySparkJob()

  logInfo("Starting " + receivers.length + " receivers")
  endpoint.send(StartAllReceivers(receivers))
}

  从注释中可以明白launchReceivers的作用。从receiverInputStreams获取各个ReceiverInputDStream,并调用它们的getReceiver()方法,getReceiver()就是之前SocketInputDStream的getReceiver(),返回SocketReceiver。runDummySparkJob()会确保Receiver不会集中在一个节点上。最后会给endpoint发送StartAllReceivers消息,这里的endpoint就是刚才ReceiverTracker自身启动一个ReceiverTrackerEndpoint。

// Local messages
case StartAllReceivers(receivers) =>
  val scheduledLocations = schedulingPolicy.scheduleReceivers(receivers, getExecutors)
  for (receiver <- receivers) {
    val executors = scheduledLocations(receiver.streamId)
    updateReceiverScheduledExecutors(receiver.streamId, executors)
    receiverPreferredLocations(receiver.streamId) = receiver.preferredLocation
    startReceiver(receiver, executors)
  }

  调用startReceiver方法在Executors上启动receiver.

/**
 * Start a receiver along with its scheduled executors
 */
private def startReceiver(
    receiver: Receiver[_],
    scheduledLocations: Seq[TaskLocation]): Unit = {
  def shouldStartReceiver: Boolean = {
    // It's okay to start when trackerState is Initialized or Started
    !(isTrackerStopping || isTrackerStopped)
  }

  val receiverId = receiver.streamId
  if (!shouldStartReceiver) {
    onReceiverJobFinish(receiverId)
    return
  }

  val checkpointDirOption = Option(ssc.checkpointDir)
  val serializableHadoopConf =
    new SerializableConfiguration(ssc.sparkContext.hadoopConfiguration)

  // Function to start the receiver on the worker node
  val startReceiverFunc: Iterator[Receiver[_]] => Unit =
    (iterator: Iterator[Receiver[_]]) => {
      if (!iterator.hasNext) {
        throw new SparkException(
          "Could not start receiver as object not found.")
      }
      if (TaskContext.get().attemptNumber() == 0) {
        val receiver = iterator.next()
        assert(iterator.hasNext == false)
        val supervisor = new ReceiverSupervisorImpl(
          receiver, SparkEnv.get, serializableHadoopConf.value, checkpointDirOption)
        supervisor.start()
        supervisor.awaitTermination()
      } else {
        // It's restarted by TaskScheduler, but we want to reschedule it again. So exit it.
      }
    }

  // Create the RDD using the scheduledLocations to run the receiver in a Spark job
  val receiverRDD: RDD[Receiver[_]] =
    if (scheduledLocations.isEmpty) {
      ssc.sc.makeRDD(Seq(receiver), 1)
    } else {
      val preferredLocations = scheduledLocations.map(_.toString).distinct
      ssc.sc.makeRDD(Seq(receiver -> preferredLocations))
    }
  receiverRDD.setName(s"Receiver $receiverId")
  ssc.sparkContext.setJobDescription(s"Streaming job running receiver $receiverId")
  ssc.sparkContext.setCallSite(Option(ssc.getStartSite()).getOrElse(Utils.getCallSite()))

  val future = ssc.sparkContext.submitJob[Receiver[_], Unit, Unit](
    receiverRDD, startReceiverFunc, Seq(0), (_, _) => Unit, ())
  // We will keep restarting the receiver job until ReceiverTracker is stopped
  future.onComplete {
    case Success(_) =>
      if (!shouldStartReceiver) {
        onReceiverJobFinish(receiverId)
      } else {
        logInfo(s"Restarting Receiver $receiverId")
        self.send(RestartReceiver(receiver))
      }
    case Failure(e) =>
      if (!shouldStartReceiver) {
        onReceiverJobFinish(receiverId)
      } else {
        logError("Receiver has been stopped. Try to restart it.", e)
        logInfo(s"Restarting Receiver $receiverId")
        self.send(RestartReceiver(receiver))
      }
  }(submitJobThreadPool)
  logInfo(s"Receiver ${receiver.streamId} started")
}

  启动Receiver是以ssc.sparkContext.submitJob的方式提交启动任务。在其中指定了startReceiverFunc,即在worker node上启动receiver要做的工作,startReceiverFunc中表明需要实例化ReceiverSupervisorImpl,并调用其start()方法:

/** Start the supervisor */
def start() {
  onStart()
  startReceiver()
}

  实际上会调用ReceiverSupervisorImpl的onStart方法和startReceiver方法:

override protected def onStart() {
  registeredBlockGenerators.foreach { _.start() }
}

  registeredBlockGenerators的定义如下:

private val registeredBlockGenerators = new mutable.ArrayBuffer[BlockGenerator]
  with mutable.SynchronizedBuffer[BlockGenerator]

  BlockGenerator的start()启动了BlockIntervalTimer和BlockPushingThread:

/** Start block generating and pushing threads. */
def start(): Unit = synchronized {
  if (state == Initialized) {
    state = Active
    blockIntervalTimer.start()
    blockPushingThread.start()
    logInfo("Started BlockGenerator")
  } else {
    throw new SparkException(
      s"Cannot start BlockGenerator as its not in the Initialized state [state = $state]")
  }
}

  blockIntervalTimer和blockPushingThread的定义如下:

  private val blockIntervalMs = conf.getTimeAsMs("spark.streaming.blockInterval", "200ms")
  require(blockIntervalMs > 0, s"'spark.streaming.blockInterval' should be a positive value")

  private val blockIntervalTimer =
    new RecurringTimer(clock, blockIntervalMs, updateCurrentBuffer, "BlockGenerator")
  private val blockQueueSize = conf.getInt("spark.streaming.blockQueueSize", 10)
  private val blocksForPushing = new ArrayBlockingQueue[Block](blockQueueSize)
  private val blockPushingThread = new Thread() { override def run() { keepPushingBlocks() } }

  blockIntervalTimer是每spark.streaming.blockInterval周期,将currentBuffer的内容生成一个新的Block,然后将该Block放到blocksForPushing队列中。blockPushingThread的将blocksForPushing队列的block取出来用于BlockGeneratorListener回调,这里的BlockGeneratorListener即ReceiverSupervisorImpl中的defaultBlockGeneratorListener,它会存储这个block,并把该block的ReceivedBlockInfo报告给driver。

  再看ReceiverSupervisor.startReceiver方法的调用:

/** Start receiver */
def startReceiver(): Unit = synchronized {
  try {
    if (onReceiverStart()) {
      logInfo("Starting receiver")
      receiverState = Started
      receiver.onStart()
      logInfo("Called receiver onStart")
    } else {
      // The driver refused us
      stop("Registered unsuccessfully because Driver refused to start receiver " + streamId, None)
    }
  } catch {
    case NonFatal(t) =>
      stop("Error starting receiver " + streamId, Some(t))
  }
}

  其中onReceiverStart方法是子类ReceiverSupervisorImpl的onReceiverStart方法,给ReciverTrackEndpoint发送RegisterReceiver消息,注册当前Receiver。注册成功后将调用receiver的onStart()方法,这里的receiver即SocketReceiver。SocketReceiver的onStart()方法:

  def onStart() {
    // Start the thread that receives data over a connection
    new Thread("Socket Receiver") {
      setDaemon(true)
      override def run() { receive() }
    }.start()
  }
  /** Create a socket connection and receive data until receiver is stopped */
  def receive() {
    var socket: Socket = null
    try {
      logInfo("Connecting to " + host + ":" + port)
      socket = new Socket(host, port)
      logInfo("Connected to " + host + ":" + port)
      val iterator = bytesToObjects(socket.getInputStream())
      while(!isStopped && iterator.hasNext) {
        store(iterator.next)
      }
      if (!isStopped()) {
        restart("Socket data stream had no more data")
      } else {
        logInfo("Stopped receiving")
      }
    } catch {
      case e: java.net.ConnectException =>
        restart("Error connecting to " + host + ":" + port, e)
      case NonFatal(e) =>
        logWarning("Error receiving data", e)
        restart("Error receiving data", e)
    } finally {
      if (socket != null) {
        socket.close()
        logInfo("Closed socket to " + host + ":" + port)
      }
    }
  }
}

  可以看出,这里实例化一个socket连接,不断获取数据,当有数据时就store(store的调用链:Receiver.store ->ReceiverSupervisorImpl.pushSingle-> BlockGenerator.addData),数据将存储到BlockGenerator实例的currentBuffer中。


SparkStreaming的WordCount示例及源码分析(一)
SparkStreaming的WordCount示例及源码分析(二)
SparkStreaming的WordCount示例及源码分析(三)

  • 0
    点赞
  • 0
    评论
  • 0
    收藏
  • 一键三连
    一键三连
  • 扫一扫,分享海报

相关推荐
©️2020 CSDN 皮肤主题: 技术黑板 设计师:CSDN官方博客 返回首页
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、C币套餐、付费专栏及课程。

余额充值