Production Readiness Review#

Rebuild SDK#

Document Version History#

0.1Tomas Pilvelis11/05/2020

The SRE Team goal is to provide a smooth experience for the customer in a production environment and react quickly to production issues. This production readiness review will help in eliminate as many issues that may arise in a production setting.

1. Coding Standards Requirements#

ActionCompletedDateSign OffEvidence

1.1. Python

1.1.1. Adherance to the Google Style Guide for Python.

1.1.2. Use of pylint should be used as a minimum.

1.2. C#

1.2.1. Adherance to the Style Guide used within the Glasswall Cloud Team.

1.2.2. Use of SonarCloud.

1.3. Further Languages

1.3.1. Use of any further languages must be on the Glasswall SRE Tech Radar. The langauge must have an approved quality gate and standards.

1.4. Secrets SHOULD NOT be stored in code / source control.

1.5. Optional

1.5.1. Cyclomatic Complexity, The lower the value, the easier it is to test, read, understand. Key properties of production code. If the need of a developer that created the code is required, it is not production ready.

Google Python Style Guide Reference

2. Testing Code Requirements#

ActionCompletedDateSign OffEvidence

2.1. Code Testing

2.1.1. 60% Code Coverage through Unit Testing and 80% code coverage on pull requested code.

2.1.2. Test Cases for the lambda function setup in order to test main functionality.

2.2. Capacity Testing

2.2.1. Expected capacity must be identified and data provided by capacity testing.

2.2.2. Capacity must be monitored and covered in handover.

2.3. Security Testing

2.3.1. Known / Existing flaws must be identified and blacklog items created.

2.3.2. Security tests conducted must be identified.

2.3.3. Risk Analysis must be carried out.

3. CI/CD Pipeline Requirements#

ActionCompletedDateSign OffEvidence

3.1. All steps below must be automated:

3.1.1. Linting must be carried out in the pipeline first to eliminate errors during the build process.

3.1.2. Testing every commit (or pull request) before merging to master.

3.2.3. Executing unit tests, code checks.

3.2.4. Running integration tests with other services in a QA environment.

3.2.5. Deployment to production (master branch).

3.2.6. Rollbacking or aborting a deployment.

3.2. SRE must have the ability to launch an event to freeze an automatic deployment to production. This should be done via a switch or similar boolean mechanism.

4. Cloud Resource Requirements#

ActionCompletedDateSign OffEvidence

4.1. No cloud resource should be manually inserted. E.g IAM Policy etc. This is to avoid any issues with construction and destruction with the intrustructure.

4.2. AWS

4.2.1. Lambda The code running from lambda must have a method of being able to log the event and context being passed into the lambda function for SRE's purpose of analysing a problem should one occour. Logging or printing should be clear and in key 'business logic' steps in order to not clutter logs with irrelivant details and provide high level context into the issue. The lambda should have some test cases it can run from within AWS to ensure the lambda is operational.

4.2.2. S3 Any S3 deployment packages should be made private for security purposes. Dependencies should be installed prior to upload to S3.

4.2.3. API Gateway Monitoring Requirements Monitoring using CloudWatch is not sufficent. The tool is considered unreliable. Searching and Querying is also very limited. Any customised API Gateway responses should be relevant, documented and understandable. This is also useful for the end user. Protection Requirements TODO Releasing Requirements TODO

The below may possibly not be used if the Serverless Framework is being used solely.

4.2.4. CodeDeploy Monitoring of Traffic Shifting Status must be provisioned in SRE dedicated monitoring systems. This is used for AWS SAM however this may or may not be relevant for the Serverless Framework. Key Principle being 0 downtime.

4.2.5. CodePipeline / CodeCommit These technologies are not considered production ready. Refrain from this use. The problem of CodePipeline is that you need to create a pipeline for every branch. You cannot create it easily automatically without running CloudFormation. Additionally, if a pipeline runs for two consecutive commits concurrently you get lost with the state. You cannot distinguish between the two different runs. If a strong case can be put towards the usage then SRE can consider this technology.


Reference Deployment of Serverless Applications Gradually


5. SLI / SLO#

ActionCompletedDateSign OffEvidence

5.1. SLOs need to be defined by Product Owner.

5.2. Values provided for an SLO must be data driven and justified. Approval of this is subject to SRE approval.

5.3. The following values MUST be provided for:

5.3.1. Avaliability - Any HTTP Return status other than 500-599 is considered successful.

5.3.2. Latency % of requests < 400ms % of requests < 850ms

5.4. Any further SLOs must be indentified.

6. Handover#

The handover process is used to provide SRE an indepth discovery of the product and enable questions that may arise further down the line to be asked within a time frame that allows for speed.

ActionCompletedDateSign OffEvidence

6.1. Handover should consist of 2 sessions within a period of a week. To allow for SRE to understand and analyise the product. Handover week consists of:

6.1.1. Training Session

6.1.2. QA Session

6.1.3. Approval / Feedback Session

6.2. Technology that is not on the SRE Tech Radar may upon discetion require extra time to familiarise with.

7. Customers#

ActionCompletedDateSign OffEvidence

7.1. Onboarding of customers should be autonomous.

7.2. Time it takes to onboard a customer should take no longer than 5 minutes.

7.3. Customer management process must be defined and identified, documented and discussed during the handover process.

Production Compliance Checklist#

Coding Standards-11
Testing Code-10
CI/CD Pipeline-8
Cloud Resource-21
Overall Production Compliance66