In this tutorial you will learn how to create log file parser in python. Parsing a log file or any type of text file in order to extract specific information is not that hard if you know a bit of python and regex. Python itself is perfect for this kind of use and does not require any third party modules. Also, if you are new to python world, I wrote a quick python setup on windows tutorial in which you can look into if you want to quickly setup python environment.

In my day job, I was working on testing Skype for Business iOS application and it came to the point where I had to go through iOS application’s log files in order to see all the requests and received responses. I had to look for specific properties like:

<code><property name="saveMessagingHistory">Enabled</property></code>

Usually they were buried under bunch of other not so important text. While I was looking for specific things in those log files, I realized that it’s going to be really time consuming to go through log files manually. In order to save time I wrote the following python script. If you want to test this script with similar text file, you can download SfB iOS log file here.

Open Log file

First of all, we have to open the log file, read each line and look for specific text in that line using regex. It might be confusing and a bit scary to pick up regex, but believe me it’s not that complicated. Usually I like to go to regexr site and just play around with different expressions until I find something that matches the text that I wanted to match. If that doesn’t work for you try googling, that’s what I did when I first started using regex.

with open we’re opening log_file_path file with read-only type “r” and assigning file’s data to file variable. for line in file gives us access to each line in file. You can add print line under that for loop and run the script, it will print each line of the text file. But in order to match only specific text, we have to use one more for loop.

for match in re.finditer(regex, line, re.S) is looking for text which matches regex in each line and then the corresponding text is assigned to match variable as an object.

In order to test that you have matched the text you wanted to match, you can use match.group() which will group all regex groups and print them.

You can change match.group() to match.group(2) in order to print the second regex group. Groups in regex are organized using ( ). This way you can extract specific variables of the text that you are trying to match. In the end we’re adding matched text to the match_list in order to use these values later in the script.

import re

log_file_path = r"C:\ios logs\sfbios.log"
regex = '(<property name="(.*?)">(.*?)<\/property>)'

match_list = []
with open(log_file_path, "r") as file:
    for line in file:
        for match in re.finditer(regex, line, re.S):
            match_text = match.group()
            match_list.append(match_text)
            print match_text

Parse more than one line

Sometimes you need to parse more than one line, in this case “for match in line” is not going work. In order to read a block of content you need to assign the whole file’s data to variable as in the example below with data = f.read(). Also read_line variable is introduced which let’s you decide on which type of parsing you want to use. If value is set to True script is going to parse by line, any other case by reading the whole file.

import re

log_file_path = r"C:\ios logs\sfbios.log"
regex = '(<property name="(.*?)">(.*?)<\/property>)'
read_line = True

with open(log_file_path, "r") as file:
    match_list = []
    if read_line == True:
        for line in file:
            for match in re.finditer(regex, line, re.S):
                match_text = match.group()
                match_list.append(match_text)
                print match_text
    else:
        data = f.read()
        for match in re.finditer(regex, data, re.S):
            match_text = match.group()
            match_list.append(match_text)
file.close()

Export Parsed data to text file

In order to export parsed data we have to use with open(export_file, “w+”) as file again, only this time we’re using “w+” which means that we’re allowed to edit the file. For the export_file name, I found that it’s very nice to use time_now into it, just because this way you don’t have to worry about the output file name and it’s easy to manage in one folder.

Usually, when you work with a lot of data there might be duplicates even if you are parsing specific text. A good way to avoid multiple lines of the same text is to use list(set(match_list)). At the end we’re using simple for loop which iterates through match_list_clean in a range from 0 to length of the match_list_clean. It prints each item in the list and then writes it in the export file with file.write(match_list_clean[item] + “\n”).

import re
import time
from time import strftime

log_file_path = r"C:\ios logs\sfbios.log"
export_file_path = r"C:\ios logs\filtered"

time_now = str(strftime("%Y-%m-%d %H-%M-%S", time.localtime()))

file = "\\" + "Parser Output " + time_now + ".txt"
export_file = export_file_path + file

regex = '(<property name="(.*?)">(.*?)<\/property>)'
read_line = False

with open(log_file_path, "r") as file:
    match_list = []
    if read_line == True:
        for line in file:
            for match in re.finditer(regex, line, re.S):
                match_text = match.group()
                match_list.append(match_text)
                print match_text
    else:
        data = file.read()
        for match in re.finditer(regex, data, re.S):
            match_text = match.group();
            match_list.append(match_text)
file.close()

with open(export_file, "w+") as file:
    file.write("EXPORTED DATA:\n")
    match_list_clean = list(set(match_list))
    for item in xrange(0, len(match_list_clean)):
        print match_list_clean[item]
        file.write(match_list_clean[item] + "\n")
file.close()

Turn block of code into function

At the end, I like to turn a block of code in to function with variables. This is the smart way and you should train your self in to thinking ahead. Now that I have added main() and parseData() functions, I’m able to use this script anywhere else, I can change variables, for example, use different regex types and so on.

import re
import time
from time import strftime

def main():
    log_file_path = r"C:\ios logs\sfbios.log"
    export_file_path = r"C:\ios logs\filtered"

    time_now = str(strftime("%Y-%m-%d %H-%M-%S", time.localtime()))

    file = "\\" + "Parser Output " + time_now + ".txt"
    export_file = export_file_path + file

    regex = '(<property name="(.*?)">(.*?)<\/property>)'

    parseData(log_file_path, export_file, regex, read_line=True)

def parseData(log_file_path, export_file, regex, read_line=True):
    with open(log_file_path, "r") as file:
        match_list = []
        if read_line == True:
            for line in file:
                for match in re.finditer(regex, line, re.S):
                    match_text = match.group()
                    match_list.append(match_text)
                    print match_text
        else:
            data = file.read()
            for match in re.finditer(regex, data, re.S):
                match_text = match.group();
                match_list.append(match_text)
    file.close()

    with open(export_file, "w+") as file:
        file.write("EXPORTED DATA:\n")
        match_list_clean = list(set(match_list))
        for item in xrange(0, len(match_list_clean)):
            print match_list_clean[item]
            file.write(match_list_clean[item] + "\n")
    file.close()

if __name__ == '__main__':
    main()

Match regex into already parsed data

Just to have more options, we can include a reparseData function in the middle of parseData. For example, in this case I wanted to see only those properties which have value set to Enabled. Also, another argument – reparse=True with default value is added to the parseData() function in order to be able to controll ‘re-parsing’.

reparseData function is basically the same code, it’s just that we have to take in data from the list and as far as I know re.finditer can’t handle lists. That’s why we’re using data_string = ”.join(parsed_data) which is taking list items and joining them in to one string variable.

import re
import time
from time import strftime

def main():
    log_file_path = r"C:\ios logs\sfbios.log"
    export_file_path = r"C:\ios logs\filtered"

    time_now = str(strftime("%Y-%m-%d %H-%M-%S", time.localtime()))

    file = "\\" + "Parser Output " + time_now + ".txt"
    export_file = export_file_path + file

    regex = '(<property name="(.*?)">(.*?)<\/property>)'

    parseData(log_file_path, export_file, regex, read_line=True, reparse=True)


def parseData(log_file_path, export_file, regex, read_line=True, reparse=False):
    with open(log_file_path, "r") as file:
        match_list = []
        if read_line == True:
            for line in file:
                for match in re.finditer(regex, line, re.S):
                    match_text = match.group()
                    match_list.append(match_text)
        else:
            data = file.read()
            for match in re.finditer(regex, data, re.S):
                match_text = match.group();
                match_list.append(match_text)
    file.close()

    if reparse == True:
        match_list = reparseData(match_list, '(property name="(.{1,50})">(Enabled)<\/property>)')

    with open(export_file, "w+") as file:
        file.write("EXPORTED DATA:\n")
        match_list_clean = list(set(match_list))
        for item in xrange(0, len(match_list_clean)):
            print match_list_clean[item]
            file.write(match_list_clean[item] + "\n")
    file.close()
    return match_list_clean

def reparseData(parsed_data, regex):
    data_string = ''.join(parsed_data)
    match_list = [];
    for match in re.finditer(regex, data_string, re.S):
        match_text = match.group();
        match_list.append(match_text)
    return match_list

if __name__ == '__main__':
    main()

You can simply clone or download this script in Pythonicways GitHub. Please share and like this post, if you have any questions leave a comment.

I’ll help you become a Python developer!

If you’re interested in learning Python and getting a job as a Python developer, send me an email to roberts.greibers@gmail.com and I’ll see if I can help you.

Roberts Greibers

I help engineers to become backend Python/Django developers so they can increase their income

14 thoughts on “Log File Parsing In Python”

zwkcoding says:

April 15, 2019 at 1:41 pm

thx

LikeLiked by 1 person

Pablo says:

June 24, 2019 at 1:05 am

nicely presented! thanks!

LikeLiked by 1 person

ANNAHEN says:

August 15, 2019 at 10:23 am

Really Thnx

LikeLiked by 1 person

awkward_joker says:

October 16, 2019 at 7:20 am

very well presented! thank you 🙂

LikeLiked by 1 person

dz dz says:

March 11, 2020 at 6:21 am

hi..not able to get the log file…mine to share..Thanks

LikeLike

steve_shambles says:

April 5, 2020 at 10:47 pm

Good article, thank you for this.

LikeLiked by 1 person

charlie says:

May 20, 2020 at 10:27 pm

What is f.read() in your first line 16?

LikeLiked by 1 person

1. pythonicways says:
  
  April 29, 2021 at 3:58 pm
  
  data = f.read() reads the whole file into data variable
  
  LikeLike
  
Debanjana says:

October 13, 2020 at 2:06 pm

Very Helpful doc…

LikeLiked by 1 person

preethi says:

October 18, 2020 at 11:00 am

it shows error when i try to export the filtered log file to text

LikeLiked by 1 person

1. pythonicways says:
  
  April 29, 2021 at 3:56 pm
  
  If you could share the error, I could help
  
  LikeLike
  
hamzabarkallah12@gmail.com says:

May 25, 2021 at 1:02 pm

I got this error:

for item in xrange(0, len(match_list_clean)):
NameError: name ‘xrange’ is not defined

LikeLiked by 2 people

1. pythonicways says:
  
  May 25, 2021 at 1:09 pm
  
  range in Python3 is the same as xrange in Python 2.7. This blog post was written at the time when I still used Python 2.7. If you replace xrange with range it should work the same way
  
  LikeLike
  
Pingback: Log File Parsing In Python With Regex | Pythonic.me