Deep dive into the Amazon Managed Service for Apache Flink utility lifecycle

In Half 1 of this collection, we mentioned basic operations to regulate the lifecycle of your Amazon Managed Service for Apache Flink utility. If you’re utilizing higher-level instruments resembling AWS CloudFormation or Terraform, the device will execute these operations for you. Nonetheless, understanding the basic operations and what the service mechanically does can present some degree of Mechanical Sympathy to confidently implement a extra strong automation.

Within the first a part of this collection, we targeted on the joyful paths. In an excellent world, failures don’t occur, and each change you deploy works completely. Nonetheless, the actual world is much less predictable. Quoting Werner Vogels, Amazon’s CTO, “Every thing fails, on a regular basis.”

On this submit, we discover failure situations that may occur throughout regular operations or if you deploy a change or scale the applying, and the best way to monitor operations to detect and get well when one thing goes flawed.

The much less joyful path

A sturdy automation should be designed to deal with failure situations, specifically throughout operations. To do this, we have to perceive how Apache Flink can deviate from the joyful path. As a result of nature of Flink as a stateful stream processing engine, detecting and resolving failure situations requires totally different methods in comparison with different long-running functions, resembling microservices or short-lived serverless features (resembling AWS Lambda).

Flink’s habits on runtime errors: The fail-and-restart loop

When a Flink job encounters an sudden error at runtime (an unhandled exception), the conventional habits is to fail, cease the processing, and restart from the most recent checkpoint. Checkpoints permit Flink to assist knowledge consistency and no knowledge loss in case of failure. Additionally, as a result of Flink is designed for stream processing functions, which run repeatedly, if the error occurs once more, the default habits is to maintain restarting, hoping the issue is transient and the applying will finally get well the conventional processing.In some instances, the issue just isn’t transient, nevertheless. For instance, if you deploy a code change that accommodates a bug, inflicting the job to fail as quickly because it begins processing knowledge, or if the anticipated schema doesn’t match the data within the supply, inflicting deserialization or processing errors. The identical situation may additionally occur if you happen to mistakenly modified a configuration that stops a connector to succeed in the exterior system. In these instances, the job is caught in a fail-and-restart loop, indefinitely, or till you actively force-stop it.

When this occurs, the Managed Service for Apache Flink utility standing is perhaps RUNNING, however the underlying Flink job is definitely failing and restarting. The AWS Administration Console offers you a touch, pointing that the applying may want consideration (see the next screenshot).

Application needs attention

Within the following sections, we discover ways to monitor the applying and job standing, to mechanically react to this case.

When beginning or updating the applying goes flawed

To grasp the failure mode, let’s evaluation what occurs mechanically if you begin the applying, or when the applying restarts after you issued UpdateApplication command, as we explored in Half 1 of this collection. The next diagram illustrates what occurs when an utility begins.

Application start process

The workflow consists of the next steps:

Managed Service for Apache Flink provisions a cluster devoted to your utility.
The code and configuration are submitted to the Job Supervisor node.
The code within the predominant() technique of your utility runs, defining the dataflow of your utility.
Flink deploys to the Activity Supervisor nodes the substasks that make up your job.
The job and utility standing change to RUNNING. Nonetheless, subtasks begin initializing now.
Subtasks restore their state, if relevant, and initialize any sources. For instance, a Kafka connector’s subtask initializes the Kafka consumer and subscribes the subject.
When all subtasks are efficiently initialized, they alter to RUNNING standing and the job begins processing knowledge.

To new Flink customers, it may be complicated {that a} RUNNING standing doesn’t essentially indicate the job is wholesome and processing knowledge.When one thing goes flawed through the technique of beginning (or restarting) the applying, relying on the section when the issue arises, you may observe two several types of failure modes:

(a) An issue prevents the applying code from being deployed – Your utility may encounter this failure situation if the deployment fails as quickly because the code and configuration are handed to the Job Supervisor (step 2 of the method), for instance if the applying code bundle is malformed. A typical error is when the JAR is lacking a mainClass or if mainClass factors to a category that doesn’t exist. This failure mode may additionally occur if the code of your predominant() technique throws an unhandled exception (step 3). In these instances, the applying fails to vary to RUNNING, and reverts to READY after the try.
(b) The appliance is began, the job is caught in a fail-and-restart loop – An issue may happen later within the course of, after the applying standing has modified RUNNING. For instance, after the Flink job has been deployed to the cluster (step 4 of the method), a part may fail to initialize (step 6). This may occur when a connector is misconfigured, or an issue prevents it from connecting to the exterior system. For instance, a Kafka connector may fail to connect with the Kafka cluster due to the connector’s misconfiguration or networking points. One other doable situation is when the Flink job efficiently initializes, nevertheless it throws an exception as quickly because it begins processing knowledge (step 7). When this occurs, Flink reacts to a runtime error and may get caught in a fail-and-restart loop.

The next diagram illustrates the sequence of utility standing, together with the 2 failure situations simply described.

Application statuses, with failure scenarios

Troubleshooting

We have now examined what can go flawed throughout operations, specifically if you replace a RUNNING utility or restart an utility after altering its configuration. On this part, we discover how we will act on these failure situations.

Roll again a change

If you deploy a change and understand one thing just isn’t fairly proper, you usually need to roll again the change and put the applying again in working order, till you examine and repair the issue. Managed Service for Apache Flink gives a sleek solution to revert (roll again) a change, additionally restarting the processing from the purpose it was stopped earlier than making use of the fault change, offering consistency and no knowledge loss.In Managed Service for Apache Flink, there are two forms of rollbacks:

Automated – Throughout an automated rollback (additionally referred to as system rollback), if enabled, the service mechanically detects when the applying fails to restart after a change, or when the job begins however instantly falls right into a fail-and-restart loop. In these conditions, the rollback course of mechanically restores the applying configuration model earlier than the final change was utilized and restarts the applying from the snapshot taken when the change was deployed. See Enhance the resilience of Amazon Managed Service for Apache Flink utility with system-rollback characteristic for extra particulars. This characteristic is disabled by default. You may allow it as a part of the applying configuration.
Guide – A handbook rollback API operation is sort of a system rollback, nevertheless it’s initiated by the consumer. If the applying is operating however you observe one thing not behaving as anticipated after making use of a change, you’ll be able to set off the rollback operation utilizing the RollbackApplication API motion or the console. Guide rollback is feasible when the applying is RUNNING or UPDATING.

Each rollbacks work equally, restoring the configuration model earlier than the change and restarting with the snapshot taken earlier than the change. This prevents knowledge loss and brings you again to a model of the applying that was working. Additionally, this makes use of the code bundle that was saved on the time you created the earlier configuration model (the one you’re rolling again to), so there is no such thing as a inconsistency between code, configuration, and snapshot, even when within the meantime you might have changed or deleted the code bundle from the Amazon Easy Storage Service (Amazon S3) bucket.

Implicit rollback: Replace with an older configuration

A 3rd solution to roll again a change is to easily replace the configuration, bringing it again to what it was earlier than the final change. This creates a brand new configuration model, and requires the right model of the code bundle to be obtainable within the S3 bucket if you situation the UpdateApplication command.

Why is there a 3rd choice when the service gives system rollback and the managed RollbackApplication motion? As a result of most high-level infrastructure-as-code (IaC) frameworks resembling Terraform use this technique, explicitly overwriting the configuration. It is very important perceive this risk despite the fact that you’ll most likely use the managed rollback if you happen to implement your automation primarily based on the low-level actions.

The next are two essential caveats to contemplate for this implicit rollback:

You’ll usually need to restart the applying from the snapshot that was taken earlier than the defective change was deployed. If the applying is presently RUNNING and wholesome, this isn’t the most recent snapshot (RESTORE_FROM_LATEST_SNAPSHOT), however reasonably the earlier one. You will need to set the restart from RESTORE_FROM_CUSTOM_SNAPSHOT and choose the right snapshot.
UpdateApplication solely works if the applying is RUNNING and wholesome, and the job might be gracefully stopped with a snapshot. Conversely, if the applying is caught in a fail-and-restart loop, you have to force-stop it first, change the configuration whereas the applying is READY, and later begin the applying from the snapshot that was taken earlier than the defective change was deployed.

Power-stop the applying

In regular situations, you cease the applying gracefully, with the automated snapshot creation. Nonetheless, this may not be doable in some situations, resembling if the Flink job is caught in a fail-and-restart loop. This may occur, for instance, if an exterior system the job makes use of stops working, or as a result of the AWS Id and Entry Administration (IAM) configuration was erroneously modified, eradicating permissions required by the job.

When the Flink job will get caught in a fail-and-restart loop after a defective change, your first choice ought to be utilizing RollbackApplication, which mechanically restores the earlier configuration and begins from the right snapshot. Within the uncommon instances you’ll be able to’t cease the applying gracefully or use RollbackApplication, the final resort is force-stopping the applying. Power-stop makes use of the StopApplication command with Power=true. You may as well force-stop the applying from the console.

If you force-stop an utility, no snapshot is taken (if that have been doable, you’ll have been capable of gracefully cease). If you restart the applying, you’ll be able to both skip restoring from a snapshot (SKIP_RESTORE_FROM_SNAPSHOT) or use a snapshot that was beforehand taken, scheduled utilizing Snapshot Supervisor, or manually, utilizing the console or CreateApplicationSnapshot API motion.

We strongly advocate establishing scheduled snapshots for all manufacturing functions that you may’t afford restarting with no state.

Monitoring Apache Flink utility operations

Efficient monitoring of your Apache Flink functions throughout and after operations is essential to confirm the end result of the operation and permit lifecycle automation to boost alarms or react, in case one thing goes flawed.

The principle indicators you should utilize throughout operations embody the FullRestarts metric (obtainable in Amazon CloudWatch) and the applying, job, and activity standing.

Monitoring the end result of an operation

The best solution to detect the end result of an operation, resembling StartApplication or UpdateApplication, is to make use of the ListApplicationOperations API command. This command returns a listing of the newest operations of a selected utility, together with upkeep occasions that power an utility restart.

For instance, to retrieve the standing of the newest operation, you should utilize the next command:

aws kinesisanalyticsv2 list-application-operations 
    --application-name MyApplication 
   | jq '.ApplicationOperationInfoList 
   | sort_by(.StartTime) | final'

The output can be just like the next code:

{
  "Operation": "UpdateApplication",
  "OperationId": "12abCDeGghIlM",
  "StartTime": "2025-08-06T09:24:22+01:00",
  "EndTime": "2025-08-06T09:26:56+01:00",
  "OperationStatus": "IN_PROGRESS"
}

OperationStatus will observe the identical logic as the applying standing reported by the console and by DescribeApplication. This implies it may not detect a failure through the operator initialization or whereas the job begins processing knowledge. As we’ve got discovered, these failures may put the applying in a fail-and-restart loop. To detect these situations utilizing your automation, you have to use different methods, which we cowl in the remainder of this part.

Detecting the fail-and-restart loop utilizing the FullRestarts metric

The best solution to detect whether or not the applying is caught in a fail-and-restart loop is utilizing the fullRestarts metric, obtainable in CloudWatch Metrics. This metric counts the variety of restarts of the Flink job after you began the applying with a StartApplication command or restarted with UpdateApplication.

In a wholesome utility, the variety of full restarts ought to ideally be zero. A single full restart is perhaps acceptable throughout deployment or deliberate upkeep; a number of restarts usually point out some situation. We advocate to not set off an alarm on a single restart, and even a few consecutive restarts.

The alarm ought to solely be triggered when the applying is caught in a fail-and-restart loop. This suggests checking whether or not a number of restarts have occurred over a comparatively quick time period. Deciding the interval just isn’t trivial, as a result of the time the Flink job takes to restart from a checkpoint will depend on the scale of the applying state. Nonetheless, if the state of your utility is decrease than a number of GB per KPU, you’ll be able to safely assume the applying ought to begin in lower than a minute.

The aim is making a CloudWatch alarm that triggers when fullRestarts retains rising over a time interval ample for a number of restarts. For instance, assuming your utility restarts in lower than 1 minute, you’ll be able to create a CloudWatch alarm that depends on the DIFF math expression of the fullRestarts metric. The next screenshot reveals an instance of the alarm particulars.

CloudWatch Alarm on fullRestarts

This instance is a conservative alarm, solely triggering if the applying retains restarting for over 5 minutes. This implies you detect the issue after at the least 5 minutes. You may take into account lowering the time to detect the failure earlier. Nonetheless, watch out to not set off an alarm after only one or two restarts. Occasional restarts may occur, for instance throughout regular upkeep (patching) that’s managed by the service, or for a transient error of an exterior system. Flink is designed to get well from these situations with minimal downtime and no knowledge loss.

Detecting whether or not the job is up and operating: Monitoring utility, job, and activity standing

We have now mentioned how you might have totally different statuses: the standing of the applying, job, and subtask. In Managed Service for Apache Flink, the applying and job standing change to RUNNING when the subtasks are efficiently deployed on the cluster. Nonetheless, the job just isn’t actually operating and processing knowledge till all of the subtasks are RUNNING.

Observing the applying standing throughout operations

The appliance standing is seen on the console, as proven within the following screenshot.

Screenshot: Application status

In your automation, you’ll be able to ballot the DescribeApplication API motion to watch the applying standing. The next command reveals the best way to use the AWS Command Line Interface (AWS CLI) and jq command to extract the standing string of an utility:

aws kinesisanalyticsv2 describe-application  
    --application-name <your-application-name> 
    | jq -r '.ApplicationDetail.ApplicationStatus'

Observing job and subtask standing

Managed Service for Apache Flink offers you entry to the Flink Dashboard, which gives helpful data for troubleshooting, together with the standing of all subtasks. The next screenshot, for instance, reveals a wholesome job the place all subtasks are RUNNING.

Job and Task status

Within the following screenshot, we will see a job the place subtasks are failing and restarting.

Job status: failing

In your automation, if you begin the applying or deploy a change, you need to make sure the job is finally up and operating and processing knowledge. This occurs when all of the subtasks are RUNNING. Observe that ready for the job standing to develop into RUNNING after an operation just isn’t utterly secure. A subtask may nonetheless fail and trigger the job to restart after it was reported as RUNNING.

After you execute a lifecycle operation, your automation can ballot the substasks standing ready for one in all two occasions:

All subtasks report RUNNING – This means the operation was profitable and your Flink job is up and operating.
Any subtask experiences FAILING or CANCELED – This means one thing went flawed, and the applying is probably going caught in a fail-and-restart loop. It is advisable to intervene, for instance, force-stopping the applying after which rolling again the change.

If you’re restarting from a snapshot and the state of your utility is sort of huge, you may observe subtasks will report INITIALIZING standing for longer. In the course of the initialization, Flink restores the state of the operator earlier than altering to RUNNING.

The Flink REST API exposes the state of the subtasks, and can be utilized in your automation. In Managed Service for Apache Flink, this requires three steps:

Generate a pre-signed URL to entry the Flink REST API utilizing the CreateApplicationPresignedUrl API motion.
Make a GET request to the /jobs endpoint of the Flink REST API to retrieve the job ID.
Make a GET request to the /jobs/<job-id> endpoint to retrieve the standing of the subtasks.

The next GitHub repository gives a shell script to retrieve the standing of the duties of a given Managed Service for Apache Flink utility.

Monitoring subtasks failure whereas the job is operating

The strategy of polling the Flink REST API can be utilized in your automation, instantly after an operation, to watch whether or not the operation was finally profitable.

We strongly advocate to not repeatedly ballot the Flink REST API whereas the job is operating to detect failures. This operation is useful resource consuming, and may degrade efficiency or trigger errors.

To observe for suspicious subtask standing adjustments throughout regular operations, we advocate utilizing CloudWatch Logs as an alternative. The next CloudWatch Logs Insights question extracts all subtask state transitions:

fields , message
| parse message /^(?<activity>.+) switched from (?<fromStatus>[A-Z]+) to (?<toStatus>[A-Z]+)./
| filter ispresent(activity) and ispresent(fromStatus) and ispresent(toStatus)
| show , activity, fromStatus, toStatus
| restrict 10000

How Managed Service for Apache Flink minimizes processing downtime

We have now seen how Flink is designed for sturdy consistency. To ensure exactly-once state consistency, Flink quickly stops the processing to deploy any adjustments, together with scaling. This downtime is required for Flink to take a constant copy of the applying state and reserve it in a savepoint. After the change is deployed, the job is restarted from the savepoint, and there’s no knowledge loss. In Managed Service for Apache Flink, updates are totally managed. When snapshots are enabled, UpdateApplication mechanically stops the job and makes use of snapshots (primarily based on Flink’s savepoints) to retain the state.

Flink ensures no knowledge loss. Nonetheless, your online business necessities or Service Stage Goals (SLOs) may additionally impose a most delay for the information acquired by downstream programs, or end-to-end latency. This delay is affected by the processing downtime, or the time the job doesn’t course of knowledge to permit Flink deploying the change.With Flink, some processing downtime is unavoidable. Nonetheless, Managed Service for Apache Flink is designed to reduce the processing downtime if you deploy a change.

We have now seen how the service runs your utility in a devoted cluster, for full isolation. If you situation UpdateApplication on a RUNNING utility, the service prepares a brand new cluster with the required quantity of sources. This operation may take a while. Nonetheless, this doesn’t have an effect on the processing downtime, as a result of the service retains the job operating and processing knowledge on the unique cluster till the final doable second, when the brand new cluster is prepared. At this level, the service stops your job with a savepoint and restarts it on the brand new cluster.

Throughout this operation, you’re solely charged for the variety of KPU of a single cluster.

The next diagram illustrates the distinction between the length of the replace operation, or the time the applying standing is UPDATING, and the processing downtime, observable from the job standing, seen within the Flink Dashboard.

Downtime

You may observe this course of, maintaining each the applying console and Flink Dashboard open, if you replace the configuration of a operating utility, even with no adjustments. The Flink Dashboard will develop into quickly unavailable when the service switches to the brand new cluster. Moreover, you’ll be able to’t use the script we supplied to test the job standing for this scope. Regardless that the cluster retains serving the Flink Dashboard till it’s tore down, the CreateApplicationPresignedUrl motion doesn’t work whereas the applying is UPDATING.

The processing time (the time the job just isn’t operating on both clusters) will depend on the time the job takes to cease with a savepoint (snapshot) and restore the state within the new cluster. This time largely will depend on the scale of the applying state. Knowledge skew may additionally have an effect on the savepoint time because of the barrier alignment mechanism. For a deep dive into the Flink’s barrier alignment mechanism, seek advice from Optimize checkpointing in your Amazon Managed Service for Apache Flink functions with buffer debloating and unaligned checkpoints, maintaining in thoughts that savepoints are all the time aligned.

For the scope of your automation, you usually need to wait till the job is again up and operating and processing knowledge. You usually need to set a timeout. If each the applying and job don’t return to RUNNING inside this timeout, one thing most likely went flawed and also you may need to elevate an alarm or power a rollback. This timeout ought to take into account all the replace operation length.

Conclusion

On this submit, we mentioned doable failure situations if you deploy a change or scale your utility. We confirmed how Managed Service for Apache Flink rollback functionalities can seamlessly deliver you again to a secure place after a change went flawed. We additionally explored how one can automate monitoring operations to watch utility, job, and subtask standing, and the best way to use the fullRestarts metric to detect when the job is in a fail-and-restart loop.

For extra data, see Run a Managed Service for Apache Flink utility, Implement fault tolerance in Managed Service for Apache Flink, and Handle utility backups utilizing Snapshots.

Deep dive into the Amazon Managed Service for Apache Flink utility lifecycle – Half 2

The much less joyful path

Flink’s habits on runtime errors: The fail-and-restart loop

When beginning or updating the applying goes flawed

Troubleshooting

Roll again a change

Implicit rollback: Replace with an older configuration

Power-stop the applying

Monitoring Apache Flink utility operations

Monitoring the end result of an operation

Detecting the fail-and-restart loop utilizing the FullRestarts metric

Detecting whether or not the job is up and operating: Monitoring utility, job, and activity standing

Observing the applying standing throughout operations

Observing job and subtask standing

Monitoring subtasks failure whereas the job is operating

How Managed Service for Apache Flink minimizes processing downtime

Conclusion

In regards to the authors

Related Articles

DataRobot This fall replace: driving success throughout the total agentic AI lifecycle

Immediate Engineering for Information High quality and Validation Checks

Past code: How Bitrix24’s AI powers end-to-end venture administration

LEAVE A REPLY Cancel reply

Latest Articles

DataRobot This fall replace: driving success throughout the total agentic AI lifecycle

Immediate Engineering for Information High quality and Validation Checks

Past code: How Bitrix24’s AI powers end-to-end venture administration

China discovered how one can promote EVs. Now it has to bury their batteries.

Meta AI Releases SAM Audio: A State-of-the-Artwork Unified Mannequin that Makes use of Intuitive and Multimodal Prompts for Audio Separation