A few days ago, I was working on a project that required designing a solution to parse AWS ALB Access Logs and move the data to OpenSearch for monitoring purposes. This led me to develop a Python script utilizing regex to accurately extract and structure the log fields for effective analysis.

About the project

AWS Application Load Balancer (ALB) provides an option to store access logs in a compressed zip format directly to an S3 bucket. To automate the process of monitoring these logs, I wrote a Lambda function that triggers every time a new log file is uploaded to the S3 bucket. The Lambda function retrieves the zip file, extracts its contents, reads the log file line by line, and then uses regex to parse and extract specific fields. Finally, it formats the parsed data and sends it to OpenSearch for efficient monitoring and analysis.

Parse ALB Access Logs

One of the challenging tasks in this project was reading the log lines and accurately parsing them into fields suitable for sending to OpenSearch, especially since different fields can contain varying types of values.

Sample log entry:

A log entry consists of fields that are space-delimited, with each field representing a specific piece of information such as request method, URL, and response status.

1
h2 2018-07-02T22:23:00.186641Z app/my-loadbalancer/50dc6c495c0c9188 10.0.1.252:48160 10.0.0.66:9000 0.000 0.002 0.000 200 200 5 257 "GET https://10.0.2.105:773/ HTTP/2.0" "curl/7.46.0" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:us-east-2:123456789012:targetgroup/my-targets/73e2d6bc24d8a067 "Root=1-58337327-72bd00b0343d75b906739c42" "-" "-" 1 2018-07-02T22:22:48.364000Z "redirect" "https://example.com:80/" "-" "10.0.0.66:9000" "200" "-" "-"

I’ve created the following table to display the fields, possible values for those fields, and corresponding regex expressions. This should be useful for anyone working with ALB Access Logs.

FieldsPossible ValuesRegexRegex Explanation
typehttp, https, h2, grpcs, ws, wss([^ ]*)Captures a sequence of characters that are not spaces.
timetime in ISO 8601 format([^ ]*)Captures a sequence of characters that are not spaces.
elbResource id of ALB([^ ]*)Captures a sequence of characters that are not spaces.
client:portIP & port of the requesting client([^ ]*):([0-9]*)Captures two groups: the first group is a non-space sequence, followed by a colon, and the second group is a sequence of digits.
target:portIP & port of target or -([^ ]*)[:-]([0-9]*)Captures two groups: the first group is a non-space sequence followed by either a colon or a hyphen, and the second group is a sequence of digits.
request_processing_timetime in seconds or -1([-.0-9]*)Captures a sequence of digits, periods, or hyphens.
target_processing_timetime in seconds or -1([-.0-9]*)Captures a sequence of digits, periods, or hyphens.
response_processing_timetime in seconds or -1([-.0-9]*)Captures a sequence of digits, periods, or hyphens.
elb_status_codestatus code of the response from load balancer(|[-0-9]*)Captures either an empty string or a sequence of digits and hyphens.
target_status_codestatus code of the response from the target(-|[-0-9]*)Captures either a single hyphen or a sequence of digits and hyphens.
received_bytessize of the request, in bytes([-0-9]*)Captures any sequence of digits and hyphens, including an empty string.
sent_bytessize of the response, in bytes([-0-9]*)Captures any sequence of digits and hyphens, including an empty string.
“request”“request line from the client”\"([^ ]*) (.*) (- |[^ ]*)\"Captures 3 groups: 1. a sequence of non-space characters, 2. any sequence of characters, 3. either a hyphen or a sequence of non-space characters.
“user_agent”“User-Agent of client”\"([^\"]*)\"Captures any sequence of characters except double quotes within quotes.
ssl_cipherssl cipher or -([A-Z_0-9-]+)Captures a sequence of uppercase letters, underscores, digits, and hyphens.
ssl_protocolssl protocol or -([A-Za-z0-9.-]*)Captures a sequence of uppercase and lowercase letters, digits, periods, and hyphens.
target_group_arnarn of target group([^ ]*)Captures a sequence of characters that are not spaces.
“trace_id”“X-Amzn-Trace-Id header”\"([^\"]*)\"Captures any sequence of characters except double quotes within quotes.
“domain_name”“SNI domain” or “-”\"([^\"]*)\"Captures any sequence of characters except double quotes within quotes.
“chosen_cert_arn”“certificate arn presented to the client” or “-”\"([^\"]*)\"Captures any sequence of characters except double quotes within quotes.
matched_rule_priority1 to 50,000 or 0 or -1 or -([-.0-9]*)Captures a sequence of dots, hyphens, zeros, and digits, including an empty string.
request_creation_timetime in ISO 8601 format([^ ]*)Captures a sequence of characters that are not spaces.
“actions_executed”“actions taken when processing the request” or “-”\"([^\"]*)\"Captures any sequence of characters except double quotes within quotes.
“redirect_url”“URL of the redirect target” or “-”\"([^\"]*)\"Captures any sequence of characters except double quotes within quotes.
“error_reason”“error reason code” or “-”\"([^ ]*)\"Captures a sequence of non-space characters within quotes.
“target:port_list”“IP & port of target” or " -"\"([^\s]+?)\"Captures a sequence of one or more non-whitespace characters within quotes, using non-greedy matching.
“target_status_code_list”“status codes from the responses of the targets” or “-”\"([^\s]+)\"Captures a sequence of one or more non-whitespace characters within quotes.
“classification”“classification for desync mitigation” or “-”\"([^ ]*)\"Captures a sequence of zero or more non-space characters within quotes.
“classification_reason”“classification reason code” or “-”\"([^ ]*)\"Captures a sequence of zero or more non-space characters within quotes.
conn_trace_idconnection traceability ID or no value?([^ ]*)?Matches optional space after the previous pattern and captures zero or more non-space characters; ? makes capturing optional.
new_fieldany new field( .*)?Captures an optional space followed by any sequence of characters; ? makes capturing optional.

This table summarizes each field’s regex pattern and explanation, making it easier to understand how to parse ALB Access Logs.

Reference: aws docs

Sample Code

Run the following Python script on your machine which parses the sample ALB Access Logs using the provided regex expressions and output the results in JSON format.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import re
import json

def parse_log_entry(log_entry):
    try:
        fields = re.compile(r'([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \"([^ ]*) (.*) (- |[^ ]*)\" \"([^\"]*)\" ([A-Z_0-9-]+) ([A-Za-z0-9.-]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^\"]*)\" ([-.0-9]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^ ]*)\" \"([^\s]+?)\" \"([^\s]+)\" \"([^ ]*)\" \"([^ ]*)\" ?([^ ]*)?( .*)?').findall(log_entry)[0]
        lineout = json.dumps({
            'type'                    : fields[0],
            '@timestamp'              : fields[1],
            'elb'                     : fields[2],
            'client_ip'               : fields[3],
            'client_port'             : int(fields[4]) if fields[4] else 0 ,
            'target_ip'               : fields[5] if fields[5] else "-",
            'target_port'             : int(fields[6]) if fields[6] else 0,
            'request_processing_time' : float(fields[7]) if fields[7] else 0,
            'target_processing_time'  : float(fields[8]) if fields[8] else 0,
            'response_processing_time' :float(fields[9]) if fields[9] else 0,
            'elb_status_code'         : fields[10],
            'target_status_code'      : fields[11],
            'received_bytes'          : int(fields[12]) if fields[12] else 0,
            'sent_bytes'              : int(fields[13]) if fields[13] else 0,
            'request_verb'            : fields[14],
            'request'                 : fields[15],
            'request_proto'           : fields[16],
            'user_agent'              : fields[17],
            'ssl_cipher'              : fields[18],
            'ssl_protocol'            : fields[19],
            'target_group_arn'        : fields[20],
            'trace_id'                : fields[21],
            'domain_name'             : fields[22],
            'chosen_cert_arn'         : fields[23],
            'matched_rule_priority'   : fields[24],
            'request_creation_time'   : fields[25],
            'actions_executed'        : fields[26],
            'redirect_url'            : fields[27],
            'alb_error_reason'        : fields[28],
            'target_port_list'        : fields[29],
            'target_status_code_list' : fields[30],
            'classification'          : fields[31],
            'classification_reason'   : fields[32],
            'conn_trace_id'           : fields[33] if fields[33] else "-",
            'new_field'               : fields[34] if fields[34] else "no new field"
        })  
        return lineout
    except Exception as ex:
        raise Exception(f"Log entry: {log_entry} parsing failed with error: {str(ex)}")

def main():
    log_entries = [
    'https 2018-07-02T22:23:00.186641Z app/my-loadbalancer/50dc6c495c0c9188 192.168.131.39:2817 10.0.0.1:80 0.086 0.048 0.037 200 200 0 57 "GET https://www.example.com:443/ HTTP/1.1" "curl/7.46.0" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:us-east-2:123456789012:targetgroup/my-targets/73e2d6bc24d8a067 "Root=1-58337281-1d84f3d73c47ec4e58577259" "www.example.com" "arn:aws:acm:us-east-2:123456789012:certificate/12345678-1234-1234-1234-123456789012" 1 2018-07-02T22:22:48.364000Z "authenticate,forward" "-" "-" "10.0.0.1:80" "200" "-" "-" TID_123456',
    'h2 2018-07-02T22:23:00.186641Z app/my-loadbalancer/50dc6c495c0c9188 10.0.1.252:48160 10.0.0.66:9000 0.000 0.002 0.000 200 200 5 257 "GET https://10.0.2.105:773/ HTTP/2.0" "curl/7.46.0" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:us-east-2:123456789012:targetgroup/my-targets/73e2d6bc24d8a067 "Root=1-58337327-72bd00b0343d75b906739c42" "-" "-" 1 2018-07-02T22:22:48.364000Z "redirect" "https://example.com:80/" "-" "10.0.0.66:9000" "200" "-" "-"',
    'ws 2018-07-02T22:23:00.186641Z app/my-loadbalancer/50dc6c495c0c9188 10.0.0.140:40914 10.0.1.192:8010 0.001 0.003 0.000 101 101 218 587 "GET http://10.0.0.30:80/ HTTP/1.1" "-" - - arn:aws:elasticloadbalancing:us-east-2:123456789012:targetgroup/my-targets/73e2d6bc24d8a067 "Root=1-58337364-23a8c76965a2ef7629b185e3" "-" "-" 1 2018-07-02T22:22:48.364000Z "forward" "-" "-" "10.0.1.192:8010" "101" "-" "-"',
    'wss 2018-07-02T22:23:00.186641Z app/my-loadbalancer/50dc6c495c0c9188 10.0.0.140:44244 10.0.0.171:8010 0.000 0.001 0.000 101 101 218 786 "GET https://10.0.0.30:443/ HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:us-west-2:123456789012:targetgroup/my-targets/73e2d6bc24d8a067 "Root=1-58337364-23a8c76965a2ef7629b185e3" "-" "-" 1 2018-07-02T22:22:48.364000Z "forward" "-" "-" "10.0.0.171:8010" "101" "-" "-"',
    'http 2018-11-30T22:23:00.186641Z app/my-loadbalancer/50dc6c495c0c9188 192.168.131.39:2817 - 0.000 0.001 0.000 200 200 34 366 "GET http://www.example.com:80/ HTTP/1.1" "curl/7.46.0" - - arn:aws:elasticloadbalancing:us-east-2:123456789012:targetgroup/my-targets/73e2d6bc24d8a067 "Root=1-58337364-23a8c76965a2ef7629b185e3" "-" "-" 0 2018-11-30T22:22:48.364000Z "forward" "-" "-" "-" "-" "-" "-"',
    'http 2018-11-30T22:23:00.186641Z app/my-loadbalancer/50dc6c495c0c9188 192.168.131.39:2817 - 0.000 0.001 0.000 502 - 34 366 "GET http://www.example.com:80/ HTTP/1.1" "curl/7.46.0" - - arn:aws:elasticloadbalancing:us-east-2:123456789012:targetgroup/my-targets/73e2d6bc24d8a067 "Root=1-58337364-23a8c76965a2ef7629b185e3" "-" "-" 0 2018-11-30T22:22:48.364000Z "forward" "-" "LambdaInvalidResponse" "-" "-" "-" "-"',
    'http 2018-07-02T22:23:00.186641Z app/my-loadbalancer/50dc6c495c0c9188 192.168.131.39:2817 10.0.0.1:80 0.000 0.001 0.000 200 200 34 366 "GET http://www.example.com:80/ HTTP/1.1" "curl/7.46.0" - - arn:aws:elasticloadbalancing:us-east-2:123456789012:targetgroup/my-targets/73e2d6bc24d8a067 "Root=1-58337262-36d228ad5d99923122bbe354" "-" "-" 0 2018-07-02T22:22:48.364000Z "forward" "-" "-" "10.0.0.1:80" "200" "-" "-"'
    ]

    for log_entry in log_entries:
        log_document = parse_log_entry(log_entry)
        print(log_document)

if __name__ == "__main__":
    main()

Have any thoughts or queries? Add them to the comments and let’s discuss!