A few days ago, I was working on a project that required designing a solution to parse AWS ALB Access Logs and move the data to OpenSearch for monitoring purposes. This led me to develop a Python script utilizing regex to accurately extract and structure the log fields for effective analysis.
About the project
AWS Application Load Balancer (ALB) provides an option to store access logs in a compressed zip format directly to an S3 bucket. To automate the process of monitoring these logs, I wrote a Lambda function that triggers every time a new log file is uploaded to the S3 bucket. The Lambda function retrieves the zip file, extracts its contents, reads the log file line by line, and then uses regex to parse and extract specific fields. Finally, it formats the parsed data and sends it to OpenSearch for efficient monitoring and analysis.
Parse ALB Access Logs
One of the challenging tasks in this project was reading the log lines and accurately parsing them into fields suitable for sending to OpenSearch, especially since different fields can contain varying types of values.
Sample log entry:
A log entry consists of fields that are space-delimited, with each field representing a specific piece of information such as request method, URL, and response status.
|
|
I’ve created the following table to display the fields, possible values for those fields, and corresponding regex expressions. This should be useful for anyone working with ALB Access Logs.
Fields | Possible Values | Regex | Regex Explanation |
---|---|---|---|
type | http, https, h2, grpcs, ws, wss | ([^ ]*) | Captures a sequence of characters that are not spaces. |
time | time in ISO 8601 format | ([^ ]*) | Captures a sequence of characters that are not spaces. |
elb | Resource id of ALB | ([^ ]*) | Captures a sequence of characters that are not spaces. |
client:port | IP & port of the requesting client | ([^ ]*):([0-9]*) | Captures two groups: the first group is a non-space sequence, followed by a colon, and the second group is a sequence of digits. |
target:port | IP & port of target or - | ([^ ]*)[:-]([0-9]*) | Captures two groups: the first group is a non-space sequence followed by either a colon or a hyphen, and the second group is a sequence of digits. |
request_processing_time | time in seconds or -1 | ([-.0-9]*) | Captures a sequence of digits, periods, or hyphens. |
target_processing_time | time in seconds or -1 | ([-.0-9]*) | Captures a sequence of digits, periods, or hyphens. |
response_processing_time | time in seconds or -1 | ([-.0-9]*) | Captures a sequence of digits, periods, or hyphens. |
elb_status_code | status code of the response from load balancer | ( |[-0-9]*) | Captures either an empty string or a sequence of digits and hyphens. |
target_status_code | status code of the response from the target | (- |[-0-9]*) | Captures either a single hyphen or a sequence of digits and hyphens. |
received_bytes | size of the request, in bytes | ([-0-9]*) | Captures any sequence of digits and hyphens, including an empty string. |
sent_bytes | size of the response, in bytes | ([-0-9]*) | Captures any sequence of digits and hyphens, including an empty string. |
“request” | “request line from the client” | \"([^ ]*) (.*) (- |[^ ]*)\" | Captures 3 groups: 1. a sequence of non-space characters, 2. any sequence of characters, 3. either a hyphen or a sequence of non-space characters. |
“user_agent” | “User-Agent of client” | \"([^\"]*)\" | Captures any sequence of characters except double quotes within quotes. |
ssl_cipher | ssl cipher or - | ([A-Z_0-9-]+) | Captures a sequence of uppercase letters, underscores, digits, and hyphens. |
ssl_protocol | ssl protocol or - | ([A-Za-z0-9.-]*) | Captures a sequence of uppercase and lowercase letters, digits, periods, and hyphens. |
target_group_arn | arn of target group | ([^ ]*) | Captures a sequence of characters that are not spaces. |
“trace_id” | “X-Amzn-Trace-Id header” | \"([^\"]*)\" | Captures any sequence of characters except double quotes within quotes. |
“domain_name” | “SNI domain” or “-” | \"([^\"]*)\" | Captures any sequence of characters except double quotes within quotes. |
“chosen_cert_arn” | “certificate arn presented to the client” or “-” | \"([^\"]*)\" | Captures any sequence of characters except double quotes within quotes. |
matched_rule_priority | 1 to 50,000 or 0 or -1 or - | ([-.0-9]*) | Captures a sequence of dots, hyphens, zeros, and digits, including an empty string. |
request_creation_time | time in ISO 8601 format | ([^ ]*) | Captures a sequence of characters that are not spaces. |
“actions_executed” | “actions taken when processing the request” or “-” | \"([^\"]*)\" | Captures any sequence of characters except double quotes within quotes. |
“redirect_url” | “URL of the redirect target” or “-” | \"([^\"]*)\" | Captures any sequence of characters except double quotes within quotes. |
“error_reason” | “error reason code” or “-” | \"([^ ]*)\" | Captures a sequence of non-space characters within quotes. |
“target:port_list” | “IP & port of target” or " -" | \"([^\s]+?)\" | Captures a sequence of one or more non-whitespace characters within quotes, using non-greedy matching. |
“target_status_code_list” | “status codes from the responses of the targets” or “-” | \"([^\s]+)\" | Captures a sequence of one or more non-whitespace characters within quotes. |
“classification” | “classification for desync mitigation” or “-” | \"([^ ]*)\" | Captures a sequence of zero or more non-space characters within quotes. |
“classification_reason” | “classification reason code” or “-” | \"([^ ]*)\" | Captures a sequence of zero or more non-space characters within quotes. |
conn_trace_id | connection traceability ID or no value | ?([^ ]*)? | Matches optional space after the previous pattern and captures zero or more non-space characters; ? makes capturing optional. |
new_field | any new field | ( .*)? | Captures an optional space followed by any sequence of characters; ? makes capturing optional. |
This table summarizes each field’s regex pattern and explanation, making it easier to understand how to parse ALB Access Logs.
Reference: aws docs
Sample Code
Run the following Python script on your machine which parses the sample ALB Access Logs using the provided regex expressions and output the results in JSON format.
|
|
Have any thoughts or queries? Add them to the comments and let’s discuss!