Introduction

In this article, we’ll explore a Python script that leverages mimesis library to populate an (Azure) SQL database with fake data.

The Code

import logging
import random

import pandas as pd
import pymssql
import sqlalchemy
from dotenv import dotenv_values
from mimesis import Address, Datetime, Person
from mimesis.enums import Gender
from sqlalchemy import create_engine

# Load environment variables
config = dotenv_values(".env")

# Configure logging to both console and file
logFormatter = logging.Formatter("%(asctime)s [%(threadName)-12.12s] [%(levelname)-5.5s]  %(message)s")
rootLogger = logging.getLogger()
rootLogger.setLevel(logging.INFO)

consoleHandler = logging.StreamHandler()
consoleHandler.setFormatter(logFormatter)
rootLogger.addHandler(consoleHandler)

def create_rows_mimesis(num=1):
    gender = random.choice([Gender.FEMALE, Gender.MALE])
    output = [{"first_name": person.first_name(gender),
               "last_name": person.last_name(gender),
               "address": address.address(),
               "email": person.email(),
               "city": address.city(),
               "state": address.state(),
               "date_time": datetime.datetime(),
               "randomdata": random.randint(1000, 2000)
               } for x in range(num)]
    return output

try:
    # Create SQLAlchemy engine
    engine = create_engine(config["CONNECTION_STRING"])

    # Connect to the database
    with engine.connect() as conn:
        logging.info(f"Connected to database: {engine}")

        # Initialize mimesis objects
        person = Person('en')
        address = Address()
        datetime = Datetime()

        num_rows = int(config["ROWS"])
        rows_per_batch=int(config["INSERT_LIMIT"])
        logging.info(f"Generating {num_rows} rows")

        if num_rows > rows_per_batch:
            for i in range(0, num_rows, rows_per_batch):
                batch_df = pd.DataFrame(create_rows_mimesis(min(1000, num_rows - i)))
                batch_df.to_sql(config["TABLE_NAME"], engine, method='multi', index=False, if_exists='append')
                logging.info(f"Inserted {min(1000, num_rows - i)} rows into table: {config['TABLE_NAME']}")
        else:
            df = pd.DataFrame(create_rows_mimesis(num_rows))
            df.to_sql(config["TABLE_NAME"], engine, method='multi', index=False, if_exists='replace')
            logging.info(f"Inserted {num_rows} rows into table: {config['TABLE_NAME']}")

        conn.commit()
    
except Exception as e:
    logging.error(f"An error occurred: {str(e)}")

logging.info("Database connection closed.")

The packages needed are:

pymssql
sqlalchemy
pandas
mimesis
python-dotenv

And a .env file for the dynamic configurations:

CONNECTION_STRING=mssql+pymssql://user:password@host/db
TABLE_NAME=person
ROWS=123
INSERT_LIMIT=1000

How It Works

  • Environment Setup: The script utilizes the dotenv library to load environment variables from a .env file.

  • Logging Configuration: Logging is configured to provide informative messages about the script’s execution. Both console and file logging are set up to capture important events.

  • Data Generation with Mimesis: The create_rows_mimesis function utilizes the mimesis library to generate realistic data such as names, addresses, emails, and timestamps. This ensures that the database is populated with fake but plausible information.

  • Database Interaction: The script establishes a connection to the SQL database using SQLAlchemy. It then inserts the generated data into the specified table. To optimize performance, data insertion is performed in batches, especially when dealing with large datasets.

  • Error Handling: Exception handling is implemented to capture and log any errors that may occur during the execution of the script, ensuring robustness and providing insights for troubleshooting.

  • Connection Closure: Upon completion of the database population process, the script closes the database connection, ensuring proper resource management.

Running it

If everything is ok, then the logs should look like: 1-run

And the database (in this case Azure SQL): 2-query