Python from absolute zero. Working with OS, learning regular expressions and functions

Today, we will work with the OS file system — we will learn how to navigate through directories, open and change files. Then, we’ll master the powerful spells called “regular expressions,” learn the intricacies of creating and calling functions, and finally write a simple SQL vulnerability scanner. And all this in one short lesson!

From the editors

This article is a part of the “Python from scratch” series, where we cover the basics of Python in our signature fun style. You can read them in order or select specific areas that you would like to improve.

Lesson 1: Variables, data types, conditions, and loops
Lesson 2: Strings, files, exceptions, and working with the Internet

The first two lessons are available in their entirety without a paid subscription. This one is almost complete: except for the last example and the homework.

Working with files

Let’s start, as always, with simple things. Python has a module with the laconic name os, which (you won’t believe it!) is designed for interaction between a program and the operating system, including file management.

The first thing we need to do, of course, is import it at the beginning of our script:

import os

And now, various interesting opportunities are opening up for us. For example, we can get the path to the current folder. At first, it matches the one you were in, when you started the script (even if the script itself is located somewhere in another place), but during the program execution, we can change this value using the os.chdir() function.

# Returns the path to the current working directory
pth=os.getcwd()
print(pth)
# Sets the path to the current working folder, in this case it is the D:/ drive
os.chdir(r'D:/')

info

If you work in Windows, then in the path to the file or folder, before the opening quotation mark, specify the letter r (which means raw) or instead of one slash in the path, put two.

Let’s try to get a list of files with the .py extension located in the current directory. For this, we use the os and fnmatch modules.

import os
import fnmatch
# In the loop, using os.listdir('.'), we get a list of files
# in the current directory (the dot in brackets denotes it)
for fname in os.listdir('.'):
    # If the current file name has a *.py extension, then print it
    if fnmatch.fnmatch(fname, '*.py'):
        print(fname)

The fnmatch module allows to search for specific text in strings that matches a given template using a mask:

* replaces any number of any characters;
? replaces any one character;
[seq] replaces any characters from the sequence in square brackets;
[!seq] replaces any characters except those in square brackets.

Let’s ruthlessly delete some file:

import os
(os.remove(r'D:allmypasswords.txt'))

Let’s rename the file:

import os
os.rename('lamer.txt','hacker.txt')

Now let’s create a folder at the specified path and immediately delete it. For this, the shutil module is useful, which has the rmtree() function, which deletes a folder along with its contents.

import os
import shutil
os.makedirs(r'D:secret\beer\photo') # Creates all folders in the specified path
shutil.rmtree(r'D:secret\beer\photo') # Deletes a folder along with its contents

Let’s say you want to get a list of all files contained in folders at a given path (including subfolders too) to find something interesting. The script will look like this:

warning

Be careful — the script in this form will search the entire D drive. If you have it, and there is a lot of junk there, the process may take a long time.

import os
for root, dirs, files in os.walk(r'D:'):
    for name in files:
        fullname = os.path.join(root, name)
        print(fullname)
        if('pass' in fullname):
            print('Bingo!!!')

The walk() function of the os module takes one required argument — the directory name. It sequentially goes through all nested directories and returns a generator object, from which it gets:

address of the next directory as a string;
a list of names of first-level subdirectories for a given directory;
list of file names in this directory.

info

A generator is an object that does not immediately calculate the values of all of its elements when it is created. This is how generators differ from lists — the latter store all their elements in memory, and they can only be deleted programmatically. Calculations using generators are called lazy, they save memory. We will look at generators in more detail in the following lessons.

Now I’ll show you how to find out the size of any file, as well as its modification date.

import os.path
# Module for converting date to acceptable format
from datetime import datetime
path = r'C:\Windows\notepad.exe'
# Get the file size in bytes
size = os.path.getsize(path)
# And now in kilobytes
# Two slashes is integer division
ksize = size // 1024
atime = os.path.getatime(path)
# Date of last access in seconds since the beginning of the epoch
mtime = os.path.getmtime(path)
# Date of last modification in seconds since the beginning of the epoch
print('Size: ', ksize, 'KB')
print ('Last used date: ', datetime.fromtimestamp(atime))
print ('Last edit date: ', datetime.fromtimestamp(mtime))

info

For Unix operating systems, 1 January 1970, 00:00:00 (UTC) is the starting point of time, or “beginning of epoch.” Most often, time in a computer is calculated as seconds elapsed since that moment and only then converted into a form convenient for a person.

Let’s play a joke on the user: let’s create some file and constantly open it with the program that usually opens this file in the system:

import os
# The time module will be needed to pause so that it doesn't open too often
import time
# Create a text file
f=open('beer.txt','w',encoding='UTF-8')
f.write('GIVE THE HACKER A BEER NOW OR THIS WILL NEVER END!!')
f.close()
while True:
    # Open the file with the default program
    os.startfile('beer.txt')
    # Pause for one second
    time.sleep(1)

Below, there is a list of some other useful commands:

os.path.basename('path') – returns the name of the file or folder at the end of the path;
os.path.dirname('path') – returns the parent path of the path object;
os.path.splitext('path') – splits the path into the path and file extension;
os.path.exists('path') – whether there is a path to the file or folder;
os.path.isfile('path') — whether the path object is a file (existing);
os.path.isdir('path') – whether the path object is a folder (existing).

Regular expressions

Regular expressions are special patterns to find and replace strings in text. Generally speaking, they can be considered an independent language, and its study goes beyond the scope of this loop. We’ll go over the very basics and use of regular expressions in Python.

www

You can read more about regexps in the Python documentation, in Wikipedia or in Jeffrey Friedl’s book, which is called “Regular Expressions”.

In addition, you can pay attention to the service regex101.com and the site RegexOne with an interactive trainer.

The re module is responsible for working with regular expressions in Python. First of all, let’s import it.

import re

As a simple pattern, we can use some word. Let it be “beer” according to tradition:

import re
pattern = r"beer"
string = "The hacker knows that beer plays a decisive role in hacking. Fresh beer is the key to a system administrator. While the system administrator is in the toilet, you can sit down at his computer and install a Trojan."
result = re.search(pattern, string)
print(result.group(0))

The re.search(pattern,string) command searches the text string for the first occurrence of the pattern pattern and returns a group of strings that can be accessed via the .group() method. But the search command only searches for the first occurrence of the pattern. Therefore, in our case, only one result will be returned — the word “beer”, despite the fact that it appears twice in our text.

To return all occurrences of a pattern in the text, use the re.findall(pattern, string) command. This command will return a list of strings that are present in the text and match the pattern.

import re
pattern = r"beer"
string = "The hacker knows that beer plays a decisive role in hacking. Fresh beer is the key to a system administrator. While the system administrator is in the toilet, you can sit down at his computer and install a Trojan."
result = re.findall(pattern, string)
print(result)

info

Note that patterns in regular expressions have an r before the start of the string. These are so-called raw strings, in which the backslash escape character “ does not work. However, the raw string cannot end with this symbol.

In the previous two program examples, you simply used a word as the pattern to search for strings. But that’s not where the power of regular expressions lies. You can replace parts of the template with special characters so that the template matches not only specific words, but also a wide variety of strings.

Let’s, for example, try to find all the words in the text that begin with “pi”. For this we use a special symbol b – it means ’beginning of a word’. Immediately after it, we indicate what the word should begin with, and write a special symbol w, which means that some letters should follow in the template (the plus means that there can be one or more of them) until a non-letter symbol is encountered (for example, a space or punctuation mark). The template will look like this: r"\bпи\w+".

import re
pattern = r"\bпи\w+"
string = "The hacker knows that beer plays a decisive role in hacking. Fresh beer is the key to a system administrator. While the system administrator is in the toilet, you can sit down at his computer and install a Trojan."
result = re.findall(pattern, string)
print(result)

Let’s try to complete a slightly more difficult task. We will find in the text all emails with the mail.ru domain, if they are there.

import re
pattern = r"\b\w+@mail\.ru"
string = "If you want to contact the admin, write to admin@mail.ru. For other questions, please contact support@mail.ru."
result = re.findall(pattern, string)
print(result)

As you can see, we used the same trick as last time — we wrote the special character \b to indicate the beginning of a word, then \w+, which means “one or more letters”, and then @mail.ru, escaping the period, since otherwise it would mean “any character”.

Often you need to find some element of a string surrounded by two other elements. For example, it could be a URL. To select the part of the template that needs to be returned, brackets are used. I’ll give you an example where you’ll get all the link addresses from a piece of HTML code.

import re
string = 'You can see the site map <a href="map.php">here</a>. Also visit <a href="best.php"section</a>'
pattern = r'href="(.+?)"'
result = re.findall(pattern,string)
print(result)

The code above used the pattern r'href="(.+?)" – in this pattern, the search string starts with href=" and ends with another double quote. The parentheses are used to indicate which part of the string that matches the pattern you want to get into ерthe result variable. The period and plus inside the brackets indicate that any characters (except the newline character) can be inside the quotes. The question mark means that you should stop before the first quotation mark you encounter.

info

The question mark is used in two slightly different senses in regular expressions. If it comes after a single character, it means that the character may or may not be present in the string. If the question mark comes after a group of characters, this means “non-greedy” mode: such a regular expression will try to capture as few characters as possible.

We can not only search for strings, but also replace them with something else. For example, let’s try to remove all tags from the HTML code. To do this, use the re.sub(pattern,'what to replace with',string) command.

import re
string = 'You can see the site map <a href="map.php">here</a>. Also visit <a href="best.php"section</a>'
pattern = r'<(.+?)>'
result = re.sub(pattern,'',string)
print(result)

The program will print the string without tags, since we replaced them with an empty string.

Regular expressions are very powerful things. Once you master them, you can do almost anything with strings, and when combined with Python code, literally anything. To begin with, you can experiment and change some of the recipes given.

Functions

It’s time to talk about functions in more detail. We’ve already called various functions many times, both built into Python (e.g. print()) and from plugins (e.g. urllib.request()). But what is a function from the inside and how to make them yourself?

Imagine you have some set of commands that need to be executed several times, changing only the input data. Such blocks of commands are usually placed in separate pieces of the program.

info

In object-oriented programming, functions are methods of a class and are written with a dot after its name.

s='Hello, hacker!'
print(s) # Function
s.lower() # Method

A function can have input parameters — these are one or more variables that are written in brackets after the function name. When you call a function, you can pass it arguments for these parameters. Some of the parameters may be optional or have a default value in case one is not passed.

A function declaration starts with the keyword def, followed by the function name, parameters in parentheses, and the program code separated by four spaces. A function can return one or more values using the return keyword. It stops the function, by the way, and if there are any commands following it, they will be skipped.

As an example, let’s look at the simplest function that will take any two numbers as arguments and multiply them, returning the result of the multiplication. Let’s call it umn.

def umn(a, b):
    c = a * b
    return c

Now that you have described a function, you can call it further in the same program.

a = int(input('Enter the first number: '))
b = int(input('Enter the second number: '))
c = umn(a, b)
print(c)

Sometimes, you need to make one of the parameters optional by setting a default value for it.

def umn(a, b=10):
    c = a * b
    return c

Now, if you call a function and don’t pass it a second argument, it will simply consider it to be ten, meaning it will multiply any number passed by ten.

c=umn(5)
print(c)

Even though the b parameter in this case is 10 by default and is not required to be passed as the second argument, you can still pass a second argument if you want, and then the passed value will be used as b, not 10.

c=umn(5, b=20)
print(c)

Inside the program, we can call the function we created as many times as we want.

Let’s create a program that will calculate a salary increase for every vulnerability a hacker finds at work. Each hacker will have their own salary, depending on their rank, but the bonus calculation for everyone works on the principle of “+2% to the base salary for a vulnerability, if more than three such vulnerabilities are found.”

Let’s make a function that takes as arguments the employee’s salary and the number of vulnerabilities found. To round the result, we use round() function, which will round the increase to an integer.

def increase(salary, bugs):
    k = 0
    if bugs > 3:
        k = round((bugs - 3) * 0.02 * salary)
    return k
a = int(input('Enter employee salary: '))
b = int(input('Enter the number of vulnerabilities it found per month: '))
c = pribavka(a, b)
print('This month the salary increase will be: ' + str(c))

If a function must return more than one value, you can list them separated by commas.

def myfunc(x):
    a = x + 1
    b = x * 2
    return a, b

The function will return a list, but we can immediately assign the returned values to some variables:

plusone, sum = myfunc(5)

Within functions, it is quite possible to use variables that were encountered in the program code before the function was called. But if you set a variable with the same name inside the function code, then this variable will automatically become local and all further changes will occur with it only within the function.

Let me explain with an example:

def boom(a, b):
    z = 15
    c = a * b * z
    return c
z = 1
c = boom(15, 20)
print(z)

As a result of executing the program, you will see one. Why? Inside the function code, we assigned the variable z the value 15, and it became local, and all changes to it will occur inside the function, while in the main program, its value will still be equal to one.

It’s a little hard to understand, but it’s actually pretty handy. If you write several similar functions, you can use the same local variable names inside them without worrying that they will somehow affect each other.

Variables declared outside functions are called global. If you want to change one of them from inside the function, then declare it inside the function using the global keyword.

def addfive(num):
    global a
    a += num
a = 5
addfive(3)
print(a)

This program will print 8. Note that we didn’t return anything, we just changed the global variable. By the way, a function that doesn’t return anything will return the value None.

a = 5
print(addfive(3))

The word None will be displayed on the screen. This can be useful if the function returns something only if some conditions are met, and if they are not met, the execution does not reach return. Then you can check if it returned None.

def isoneortwo(num):
    if(num==1):
        return 'One'
    if(num==2):
        return 'Two'
print(isoneortwo(1))
print(isoneortwo(2))
print(isoneortwo(3))

This function checks if the value is equal to one or two, and if not, it returns None. This can be further checked using if:

if isoneortwo(3) is None:
    print("Not 1 and not 2!")

So, we have learned how to create functions, call them and return parameters from them, and also use global variables inside functions. From this point on, we can already take on relatively complex examples!

Practice: Checking SQL Vulnerabilities

This time we will create a script that will search for SQL vulnerabilities on different URLs. Let’s create a urls.txt file in advance, each string of which will contain website addresses containing GET parameters. For example:

http://www.taanilinna.com/index.php?id=325
https://www.925jewellery.co.uk/productlist.php?Group=3&pr=0
http://www.isbtweb.org/index.php?id=1493

Let’s write a script that gets a list of similar URLs from our file and adds a quote to each of the GET parameters, trying to trigger SQL database errors.

warning

This article is intended for security specialists operating under a contract; all information provided in it is for educational purposes only. Neither the author nor the Editorial Board can be held liable for any damages caused by improper usage of this publication. Distribution of malware, disruption of systems, and violation of secrecy of correspondence are prosecuted by law.

import re, requests, os, time
# List of regular expressions indicating that a web page has a SQL vulnerability
sql_errors = {
    "MySQL": (r"SQL syntax.*MySQL", r"Warning.*mysql_.*", r"MySQL Query fail.*", r"SQL syntax.*MariaDB server"),
    "PostgreSQL": (r"PostgreSQL.*ERROR", r"Warning.*Wpg_.*", r"Warning.*PostgreSQL"),
    "Microsoft SQL Server": (r"OLE DB.* SQL Server", r"(W|A)SQL Server.*Driver", r"Warning.*odbc_.*", r"Warning.*mssql_", r"Msg d+, Level d+, State d+", r"Unclosed quotation mark after the character string", r"Microsoft OLE DB Provider for ODBC Drivers"),
    "Microsoft Access": (r"Microsoft Access Driver", r"Access Database Engine", r"Microsoft JET Database Engine", r".*Syntax error.*query expression"),
    "Oracle": (r"bORA-[0-9][0-9][0-9][0-9]", r"Oracle error", r"Warning.*oci_.*", "Microsoft OLE DB Provider for Oracle"),
    "IBM DB2": (r"CLI Driver.*DB2", r"DB2 SQL error"),
    "SQLite": (r"SQLite/JDBCDriver", r"System.Data.SQLite.SQLiteException"),
    "Informix": (r"Warning.*ibase_.*", r"com.informix.jdbc"),
    "Sybase": (r"Warning.*sybase.*", r"Sybase message")
}
# A function that gets the HTML code of a web page and checks it for keywords,
# indicating the presence of SQL injection, returns two variables — True/False and the type of vulnerable database
def checksql(html):
    for db, errors in sql_errors.items():
        for error in errors:
            if re.compile(error).search(html):
                return True, db
    return False, None
# Open the file from which we will take the URLs that need to be checked
f = open('urls.txt', 'r', encoding='UTF-8')
# Open the file where we will write the found vulnerable URLs
f2 = open('good.txt', 'w', encoding='UTF-8')
# Function for testing URL vulnerability: insert single quote into GET parameters
def checkcheck(url):
    # Replace the & symbol between parameters in the URL by adding a single quote before it
    x = url.replace("&", "'&")
    # Clean the URL from spaces on both sides
    ur = x.strip()
    # If there is no http at the beginning, then add it
    if not(ur[0:4] == 'http'):
        ur = 'http://' + ur
    print('Checking: ' + ur)
    try:
        # Get HTML code by URL
        s = requests.get(ur + "'")
        h = s.text
        # Checking for vulnerabilities
        a, b = checksql(h)
        if(a):
            print('Vulnerable to SQL injection: ' + ur)
            f2.write(f'{ur} - {str(b)}n')
        else:
            print('Vulnerability not found: ' + ur)
    except:
        print('Error while checking: ' + ur)
        pass
# Sequentially try the URLs from the urls.txt file
for site in f:
    checkcheck(site)
f.close()
f2.close()

In the goods.txt file, you will get a list of vulnerable sites (or nothing if none are found).

Where can I get a list of URLs to check? Google dorks are often used to find such URLs.

Homework

Improve the directory listing program so that it indents nested folders to create a file tree.
Write a program that will open a given file, use a regular expression to extract all email addresses from it, and save them to another file, each email on a separate string.
Try writing a regular expression yourself that will find all hyperlinks in the code of a web page. Then look for a ready-made version of such a regular expression on the Internet and try to understand its structure.
Make a function from the program you created while completing task 2 (input: path to file, output: list of addresses). Then take the code from assignment 1 and make the program traverse directories and look for email addresses in all text files encountered.

2022.01.11 — Pentest in your own way. How to create a new testing methodology using OSCP and Hack The Box machines

Each aspiring pentester or information security enthusiast wants to advance at some point from reading exciting write-ups to practical tasks. How to do this in the best way…

Full article →

2023.02.12 — Gateway Bleeding. Pentesting FHRP systems and hijacking network traffic

There are many ways to increase fault tolerance and reliability of corporate networks. Among other things, First Hop Redundancy Protocols (FHRP) are used for this…

Full article →

2023.04.19 — Kung fu enumeration. Data collection in attacked systems

In penetration testing, there's a world of difference between reconnaissance (recon) and data collection (enum). Recon involves passive actions; while enum, active ones. During recon,…

Full article →

2022.06.02 — Blindfold game. Manage your Android smartphone via ABD

One day I encountered a technical issue: I had to put a phone connected to a single-board Raspberry Pi computer into the USB-tethering mode on boot. To do this,…

Full article →

2022.02.09 — Kernel exploitation for newbies: from compilation to privilege escalation

Theory is nothing without practice. Today, I will explain the nature of Linux kernel vulnerabilities and will shown how to exploit them. Get ready for an exciting journey:…

Full article →

2023.07.29 — Invisible device. Penetrating into a local network with an 'undetectable' hacker gadget

Unauthorized access to someone else's device can be gained not only through a USB port, but also via an Ethernet connection - after all, Ethernet sockets…

Full article →

2023.04.20 — Sad Guard. Identifying and exploiting vulnerability in AdGuard driver for Windows

Last year, I discovered a binary bug in the AdGuard driver. Its ID in the National Vulnerability Database is CVE-2022-45770. I was disassembling the ad blocker and found…

Full article →

2022.06.01 — Cybercrime story. Analyzing Plaso timelines with Timesketch

When you investigate an incident, it's critical to establish the exact time of the attack and method used to compromise the system. This enables you to track the entire chain of operations…

Full article →

2023.06.08 — Croc-in-the-middle. Using crocodile clips do dump traffic from twisted pair cable

Some people say that eavesdropping is bad. But for many security specialists, traffic sniffing is a profession, not a hobby. For some reason, it's believed…

Full article →

2022.01.01 — It's a trap! How to create honeypots for stupid bots

If you had ever administered a server, you definitely know that the password-based authentication must be disabled or restricted: either by a whitelist, or a VPN gateway, or in…

Full article →