End-to-End Real-time Fraud Detection on AWS

Detecting potential fraud in financial systems is a major challenge for organizations worldwide. Building robust solutions that enable real-time actions is essential for companies aiming to provide greater security to their customers during financial transactions.

This project demonstrates a complete machine learning pipeline for credit card fraud detection using the Kaggle Credit Card Fraud Detection dataset, which contains 284,807 European cardholder transactions from 2013 (including 492 fraudulent cases) with 28 PCA-transformed features plus original Amount and Time variables.

The project showcases a production-ready streaming architecture that integrates Amazon SageMaker for training both supervised and unsupervised ML models and deploying them as managed endpoints.

Architecture overview:

Architecture Diagram

Pre-requisites

An AWS account with appropriate permissions to create and manage resources such as S3, SageMaker, Kinesis, Lambda, Glue and RDS.
AWS CLI v2 installed and configured on your local machine.
Terraform (>= 1.5) installed for infrastructure provisioning.
Python 3.10+ installed for running the training notebooks and the API application.
A Kaggle account with an API token at ~/.kaggle/kaggle.json (used by make data.download).
Basic understanding of machine learning concepts, AWS services, and infrastructure-as-code (IaC) practices.

The whole workflow is wrapped in a top-level Makefile — run make help to list every target.

Key Components

Data Ingestion: Simulated real-time transaction data is ingested into Amazon Kinesis Data Streams.
Data Processing: AWS Glue or Spark processes the streaming data, performing feature engineering and transformations.
Model Training: Amazon SageMaker is used to train both supervised (e.g., XGBoost) and unsupervised (e.g., Random Cut Forest for anomaly detection) models on the historical dataset.
Model Deployment: Trained models are deployed as SageMaker endpoints for real-time inference.
API Layer: A REST API is created using AWS Chalice, which serves as the interface for making predictions and receiving fraud alerts.
Monitoring: CloudWatch is used to monitor the performance of the deployed models and the overall system. The project also includes scripts for simulating transaction data and testing the end-to-end pipeline.

Configure AWS credentials

Before running the project, ensure that you have your AWS credentials configured properly. You can set up your credentials using the AWS CLI or by configuring environment variables. For example, you can run the following command to configure your AWS credentials:

$ aws configure

This command will prompt you to enter your AWS Access Key ID Key, AWS Secret Access Key, default region, and output format. Make sure to use the same region that you specified in the Terraform configuration for the AWS resources. More details on configuring AWS credentials can be found in the AWS documentation.

Infrastructure Provisioning

Most of part of the infrastructure including AWS Kinesis stream, SageMaker, Glue, RDS instance are provisioned using Terraform which is an infrastructure as code (IaC) tool that allows you to define and manage your cloud resources in a declarative configuration language. This approach ensures that the infrastructure is consistent, repeatable, and version-controlled.

Services Overview

First of all, let us start with the Terraform configuration for provisioning the necessary AWS resources. Let's explain some of the key resources that are defined in the Terraform configuration files.

S3 Buckets

We define three S3 buckets in our Terraform configuration: one for the raw dataset, one for the streaming artifacts (Glue script, wheel, JAR, checkpoints) and one for SageMaker (notebooks and models). The data bucket is configured with a lifecycle policy to automatically expire old objects. Below is an excerpt from s3.tf (source):

# s3.tf
resource "aws_s3_bucket" "fraud_data_bucket" {
    bucket        = "fraud-detection-data-bucket-${var.aws_region}"
    force_destroy = true

    tags = {
        Environment = "dev"
        Project     = "fraud-detection"
    }
}
resource "aws_s3_bucket_lifecycle_configuration" "data_bucket_lifecycle" {
    bucket = aws_s3_bucket.fraud_data_bucket.id

    rule {
        id     = "expire-old-data"
        status = "Enabled"
        filter {
            prefix = ""
        }
        expiration {
            days = 90
        }
    }
}
resource "aws_s3_bucket" "fraud_streaming_bucket" {
    bucket        = "fraud-detection-stream-bucket-${var.aws_region}"
    force_destroy = true

    tags = {
        Environment = "dev"
        Project     = "fraud-detection"
    }
}

Kinesis Streams

For real-time data ingestion, we use Amazon Kinesis Data Streams. The Kinesis stream is configured with appropriate shard count and retention period to handle the expected data volume and ensure that the data is retained for a sufficient amount of time for processing. Below is an excerpt of how to define the Kinesis stream in Terraform which you can find in the kinesis.tf file in the project here.

# Kinesis Stream
resource "aws_kinesis_stream" "fraud_predictions_stream" {
    name             = var.kinesis_stream_name
    shard_count      = var.kinesis_shard_count
    retention_period = 24 # in hours
    tags = {
        Environment = "dev"
        Project     = "fraud-detection"
    }
}

RDS PostgreSQL

For storing the transaction data and model predictions, we use an RDS PostgreSQL instance. The RDS instance is configured with appropriate security groups, parameter groups, and backup settings to ensure data durability and security. Below is an excerpt of how to define the RDS PostgreSQL instance in Terraform which you can find in the rds.tf file in the project here

# PostgreSQL RDS Instance
resource "aws_db_instance" "fraudit_postgres" {
    identifier            = "fraudit-postgres-db"
    engine                = "postgres"
    engine_version        = "17.5" # Version supported by AWS RDS
    instance_class        = "db.t3.micro"
    allocated_storage     = 20
    max_allocated_storage = 100
    storage_type          = "gp2"

    db_name  = var.postgres_db
    username = var.postgres_user
    password = var.postgres_password
    port     = var.postgres_port

    # Network configuration
    db_subnet_group_name   = aws_db_subnet_group.fraudit_subnet_group.name
    vpc_security_group_ids = [aws_security_group.rds_sg.id]
    publicly_accessible    = true # Required for access from Glue without VPC

    # Backup configuration
    backup_retention_period = 7
    backup_window           = "03:00-04:00"
    maintenance_window      = "sun:04:00-sun:05:00"

    # Development configuration
    skip_final_snapshot = true
    deletion_protection = false
    apply_immediately   = true

    # Monitoring disabled for simplicity
    performance_insights_enabled = false
    monitoring_interval          = 0

    tags = {
        Name        = "fraudit-postgres"
        Environment = "dev"
        Project     = "fraud-detection"
    }
}

The database credentials such as postgres_user, postgres_password, and postgres_db are defined as variables in the devops/infra/dev/dev.tfvars file and should be provided during the Terraform deployment.

Tip

Make sure to use strong credentials for the database and consider using AWS Secrets Manager for managing sensitive information in a production environment which is not covered in this project for simplicity.

Glue

The terraform configuration for the Glue job is defined in the glue.tf file. below is an excerpt of the Glue job definition:

# glue.tf
resource "aws_glue_job" "fraudit_streaming_job" {
    name     = "fraudit-streaming-job"
    role_arn = aws_iam_role.glue-role.arn
    # Job entrypoint
    command {
        script_location = "s3://${aws_s3_bucket.fraud_streaming_bucket.bucket}/spark-jobs/glue_job.py"
        python_version  = "3"
    }

    glue_version = "4.0"
    max_capacity = 2

    default_arguments = {
        "--job-language"                     = "python"
        "--additional-python-modules"        = "s3://${aws_s3_bucket.fraud_streaming_bucket.bucket}/wheel/fraudit-0.0.1-py3-none-any.whl"
        "--python-modules-installer-option"  = "--upgrade"
        "--extra-jars"                       = "s3://${aws_s3_bucket.fraud_streaming_bucket.bucket}/jars/spark-streaming-sql-kinesis-connector_2.12-1.0.0.jar"

        "--postgres_host"                    = aws_db_instance.fraudit_postgres.address
        "--postgres_port"                    = aws_db_instance.fraudit_postgres.port
        "--postgres_user"                    = var.postgres_user
        "--postgres_password"                = var.postgres_password
        "--postgres_db"                      = var.postgres_db

        "--kinesis_stream"                   = aws_kinesis_stream.fraud_predictions_stream.name
        "--kinesis_endpoint"                 = "https://kinesis.${var.aws_region}.amazonaws.com"
        "--aws_region"                       = var.aws_region

        "--s3_checkpoint_bucket"             = "s3://${aws_s3_bucket.fraud_streaming_bucket.bucket}/checkpoints/"

        # Monitoring
        "--enable-continuous-cloudwatch-log" = "true"
        "--enable-job-insights"              = "true"
        "--enable-metrics"                   = "true"
        "--enable-spark-ui"                  = "true"
        "--enable-auto-scaling"              = "true"
    }

    tags = {
        Project     = "fraud-detection"
        Environment = "dev"
    }
}

Let's break down some of the key configurations for the Glue job:

script_location: This specifies the location of the Glue job script in S3. The script contains the logic for reading from the Kinesis stream, processing the data, and writing the results to RDS.
additional-python-modules: This argument allows you to specify additional Python modules that are required for the Glue job. In this case, we are referencing a wheel package that contains the necessary dependencies for the Glue job, which is also stored in S3.
extra-jars: This argument is used to specify additional JAR files that are required for the Glue job. Here, we are referencing the Spark Kinesis connector JAR file, which is necessary for reading from the Kinesis stream in the Glue job.
postgres_host, postgres_port, postgres_user, postgres_password, postgres_db: These arguments are used to pass the PostgreSQL database connection details to the Glue job, allowing it to write the processed data into the RDS instance.
kinesis_stream and kinesis_endpoint: These arguments specify the name of the Kinesis stream and the endpoint URL for connecting to the Kinesis service.
s3_checkpoint_bucket: This argument specifies the S3 bucket location for storing the checkpoints of the Spark structured streaming job, which is essential for fault tolerance and ensuring that the job can resume from the last checkpoint in case of failures.

SageMaker AI

SageMaker is the AWS service that provides a fully managed platform for building, training, and deploying machine learning models. In this project, we use SageMaker to train both supervised and unsupervised models for fraud detection. The Terraform configuration for provisioning the SageMaker resources is defined in the sagemaker.tf file. Below is an excerpt showing how to define a SageMaker notebook instance in Terraform (source).

resource "aws_sagemaker_notebook_instance" "fraudit_notebook" {
  name          = var.sagemaker_notebook_name
  instance_type = var.sagemaker_notebook_instance_type
  role_arn      = aws_iam_role.sagemaker_execution_role.arn
  volume_size   = var.sagemaker_root_volume_size

  # Optional VPC configuration - only if subnet is provided
  subnet_id = var.sagemaker_subnet_id != "" ? var.sagemaker_subnet_id : null

  # Security groups - only if subnet is provided (VPC attachment)
  security_groups = length(var.sagemaker_security_group_ids) > 0 ? var.sagemaker_security_group_ids : null

  # Optional lifecycle configuration - only if content is provided
  lifecycle_config_name = var.sagemaker_lifecycle_startup_content != "" ? aws_sagemaker_notebook_instance_lifecycle_configuration.fraudit_nb_lifecycle[0].name : null

  tags = {
    Project     = "fraud-detection"
    Environment = "dev"
  }
}

Deploying the infrastructure

Now that we have an overview of the key AWS resources provisioned by Terraform, let's deploy the infrastructure. First, update the devops/infra/dev/dev.tfvars file with the values for your AWS environment, such as the database credentials and any optional overrides for the SageMaker notebook instance.

# dev.tfvars
aws_region = "eu-west-1"

kinesis_stream_name = "fraud-predictions-stream"
kinesis_shard_count = 4

postgres_user     = "postgres_user"
postgres_password = "postgres_password"  # CHANGE ME before apply
postgres_db       = "fraudit_postgres_db"
postgres_port     = 5432

sagemaker_notebook_name           = "fraudit-notebook"
sagemaker_notebook_instance_type  = "ml.t3.medium"
sagemaker_root_volume_size        = 5
sagemaker_subnet_id               = ""          # keep empty for no VPC attachment in dev
sagemaker_security_group_ids      = []          # list of SG IDs if attaching to a VPC
sagemaker_lifecycle_startup_content = "enabled"

Warning

Please do not commit sensitive information such as database credentials to version control. Make sure to use strong credentials and consider using environment variables or AWS Secrets Manager for managing sensitive information in a production environment.

Once you have updated the variables, you can deploy the infrastructure with the Make targets exposed at the root of the repository:

make tf.init
make tf.plan       # uses devops/infra/dev/dev.tfvars
make tf.apply

Under the hood these wrap terraform init / plan / apply -var-file=dev.tfvars inside devops/infra/dev/.

Note the outputs from the Terraform deployment, which include important information about the resources that were created. These outputs are essential for configuring the subsequent steps in the project. The outputs include:

kinesis_stream_name: the name of the Kinesis stream that was created for ingesting transaction data.
lambda_exec_role_arn: the ARN of the IAM role that is assigned to the AWS Lambda function for invoking SageMaker endpoints.
rds_postgres_endpoint: the endpoint URL of the RDS PostgreSQL instance that is beused for storing transaction data and model predictions.
rds_postgres_port: the port number for connecting to the RDS PostgreSQL instance.

You can use the output values to configure your .env file that will be used by the training notebooks and the API application to connect to the appropriate AWS resources. Two options:

Copy the output values manually from the terminal and paste them into a .env file (you can start from .env.example).
Run the dedicated Make target, which reads the Terraform outputs and writes them straight into .env (it also fills iam_role_arn in app/api/.chalice/config.json):

make env.sync

Implementation and Deployment

Now that the infrastructure is deployed and the necessary configurations are in place, we can proceed to implement the machine learning models, the API layer, and the Glue job for data processing. The project includes detailed notebooks and scripts for each of these components, which are organized in the respective directories in the project repository. Below is an overview of each component and how to implement and deploy them.

Fraud detection ML models

The machine learning models for fraud detection are implemented in Jupyter notebooks designed to run on the SageMaker notebook instance provisioned by Terraform.

The training notebooks live in the sagemaker/ directory and rely on the SageMaker Python SDK for training and deployment. For example, the XGBoost notebook is available here: it loads the dataset from S3, trains the XGBoost model with SageMaker and deploys the resulting model as a real-time endpoint. The Random Cut Forest notebook (here) follows the same flow but trains an unsupervised anomaly detector.

Before opening the notebooks, push the dataset to S3 and sync the notebook instance:

make data.download   # Kaggle → ./dataset/creditcard.csv
make data.upload     # ./dataset/ → s3://$SPARK_DATA_BUCKET/dataset/
make sagemaker.sync  # uploads notebooks to S3 and (re)starts the notebook instance

Then open Jupyter and run both notebooks to deploy the two SageMaker endpoints used by the API.

API Layer

Once the models are deployed as SageMaker endpoints, the next step is to expose them through an API layer that handles incoming requests and returns predictions in real time. We use AWS Chalice, a Python micro-framework for serverless applications that makes it easy to build and deploy REST APIs backed by Lambda + API Gateway.

The Chalice application lives in app/api/:

app/api/
├── app.py
├── requirements.txt
└── .chalice/
    └── config.json

The app.py file defines the API endpoints used to invoke the SageMaker models, and requirements.txt lists the Lambda dependencies.

The next step is to configure app/api/.chalice/config.json to specify the AWS region and other settings for the Chalice application:

{
    "version": "2.0",
    "app_name": "ml-inference-api",
    "stages": {
        "dev": {
            "api_gateway_stage": "api",
            "manage_iam_role": false,
            "iam_role_arn": "arn:aws:iam::xxxxxxxxxxxx:role/ml-inference-api-lambda-exec-role",
            "environment_variables": {
                "solution_prefix": "fraud-detection",
                "stream_name": "fraud-predictions-stream",
                "aws_region": "eu-west-1"
            }
        }
    }
}

Note that iam_role_arn should match the ARN of the IAM role created for the Lambda function in the Terraform stack. make env.sync patches this field automatically using the lambda_exec_role_arn Terraform output, so you usually don't have to edit it by hand.

After configuring the config.json file, we can proceed to implement the API endpoints in the app.py file that will look like this:

@app.route('/predict', methods=['POST'])
def predict():
    event = app.current_request.json_body

    metadata = event.get('metadata')
    data_payload = event.get('data')

    if not metadata or not data_payload:
        raise BadRequestError("Requête invalide : champ 'metadata' ou 'data' manquant.")

    anomaly = get_anomaly_prediction(data_payload)
    fraud = get_fraud_prediction(data_payload)

    output = {
        "anomaly_detector": anomaly,
        "fraud_classifier": fraud
    }

    store_data_prediction(output, metadata) # store the prediction results in Kinesis stream for further processing

    return output

You can find the complete implementation of the API endpoints in the app.py file here. This endpoint receives transaction data in JSON format, calls the deployed SageMaker models, returns the predictions in the response, and pushes the enriched record to a Kinesis stream for downstream processing.

Once the API endpoints are implemented, deploy the Chalice application to AWS with:

make chalice.deploy

This target runs chalice deploy from app/api/ and writes the resulting REST API URL back into .env as CHALICE_API_URL, ready to be consumed by the simulator.

Glue Job for Data Processing

In order to process the streaming data from the Kinesis stream and perform feature engineering and transformations, we define a Glue job that reads the data from the Kinesis stream, applies the necessary transformations, and then writes the processed data into RDS for further analysis.

The Glue job is implemented in the glue_job.py file located here using PySpark which is is the Python API for Apache Spark. It enables you to perform real-time, large-scale data processing in a distributed environment using Python. Here is an example of how the main function of the Glue job looks like:

def main():
    """Main execution function"""
    logger.info("Starting Fraud Detection Streaming Job")
    
    # Configure checkpoint location
    checkpoint_location = f"s3a://{args['s3_checkpoint_bucket']}/checkpoints/{args['JOB_NAME']}/"
    logger.info(f"Using checkpoint location: {checkpoint_location}")
    
    # Read from Kinesis stream
    kinesis_df = (
        spark.readStream
        .format("aws-kinesis")
        .option("kinesis.streamName", args["kinesis_stream"])
        .option("kinesis.region", args["aws_region"])
        .option("kinesis.startingposition", "TRIM_HORIZON")  # Start from the latest data
        .option("kinesis.endpointUrl", f"https://kinesis.{args['aws_region']}.amazonaws.com")
        .load()
    )
    
    logger.info("Successfully configured Kinesis stream reader")
    
    # Start streaming query
    query = (
        kinesis_df
        .writeStream
        .foreachBatch(process_kinesis_batch)
        .outputMode("append")
        .option("checkpointLocation", checkpoint_location)
        .trigger(processingTime='30 seconds')
        .start()
    )
    
    logger.info("Streaming query started successfully")
    
    # Wait for termination
    query.awaitTermination()

Once the Glue job is defined, we need to package and deploy three artifacts to S3:

The fraudit Python wheel — built from pyproject.toml and exposing the streaming pipeline.
The Glue script src/fraudit/glue_job.py — the entry point referenced by glue.tf.
The Spark Kinesis connector JAR — required to read from Kinesis.

All three are produced and uploaded by a single Make target:

make glue.deploy

It runs python -m build to produce the wheel and then aws s3 cp for each artifact into s3://$SPARK_STREAMING_BUCKET/{wheel,spark-jobs,jars}/. Once the artifacts are uploaded you can start the Glue job from the AWS Management Console (or aws glue start-job-run); it will read the Kinesis stream, apply the logic in glue_job.py, and append the processed records into RDS.

Generate Transaction Data and Test the Pipeline

To test the end-to-end pipeline, we use the scripts/generate_data.py script to simulate real-time transaction data and send it to the Chalice API endpoint. This script reads the original Kaggle dataset, enriches each record with additional metadata, and posts it to the API.

The dataset itself is fetched from Kaggle with:

make data.download   # downloads creditcard.csv into ./dataset/

Make sure ~/.kaggle/kaggle.json is in place (or set KAGGLE_USERNAME / KAGGLE_KEY in .env).

The script looks like this:

# Load environment variables from .env file
load_dotenv()

ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
DATASET_LOCAL_DIR = os.path.join(Path(ROOT_DIR).parent.absolute(), 'dataset')
PARALLEL_INVOCATION = False

if __name__ == "__main__":
    # Load credit card fraud dataset
    data = pd.read_csv(f"{DATASET_LOCAL_DIR}/creditcard.csv", delimiter=',')

    # Get fraud and non-fraud counts
    nonfrauds, frauds = data.groupby('Class').size()
    print('Number of frauds: ', frauds)
    print('Number of non-frauds: ', nonfrauds)
    print('Percentage of fradulent data:', 100. * frauds / (frauds + nonfrauds))

    # Split features and labels
    feature_columns = data.columns[:-1]
    label_column = data.columns[-1]

    features = data[feature_columns].values.astype('float32')
    labels = (data[label_column].values).astype('float32')

    # Split data into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(
        features, labels, test_size=0.2, random_state=42, stratify=labels
    )

    if PARALLEL_INVOCATION:
        # Run multiple simulations in parallel using threads
        threads = []
        for i in range(10):
            thread = Thread(target=generate_data, args=(np.copy(X_test), 100_000))
            threads.append(thread)
            thread.start()
        for thread in threads:
            thread.join()
        print("All simulations completed.")
    else:
        # Run one simulation sequentially
        generate_data(np.copy(X_test), max_requests=100_000)  # Adjust max_requests as needed

Note

Make sure your virtual environment has the project installed (make setup once is enough; it runs pip install -e .[all]).

You can then run the simulator with:

make data.generate
# or directly
python scripts/generate_data.py

With the API receiving traffic, the Glue job consuming the Kinesis stream and the predictions being persisted in RDS, you now have a working end-to-end real-time fraud detection pipeline on AWS. A Streamlit dashboard is also available under app/streamlit/ (run make compose.up to start it on http://localhost:8501) to visualise predictions live.

End-to-End Real-time Fraud Detection on AWS

The project showcases a production-ready streaming architecture that integrates Amazon SageMaker for training both supervised and unsupervised ML models and deploying them as managed endpoints.

Architecture overview:

Architecture Diagram

Pre-requisites

An AWS account with appropriate permissions to create and manage resources such as S3, SageMaker, Kinesis, Lambda, Glue and RDS.
AWS CLI v2 installed and configured on your local machine.
Terraform (>= 1.5) installed for infrastructure provisioning.
Python 3.10+ installed for running the training notebooks and the API application.
A Kaggle account with an API token at ~/.kaggle/kaggle.json (used by make data.download).
Basic understanding of machine learning concepts, AWS services, and infrastructure-as-code (IaC) practices.

The whole workflow is wrapped in a top-level Makefile — run make help to list every target.

Key Components

Data Ingestion: Simulated real-time transaction data is ingested into Amazon Kinesis Data Streams.
Data Processing: AWS Glue or Spark processes the streaming data, performing feature engineering and transformations.
Model Training: Amazon SageMaker is used to train both supervised (e.g., XGBoost) and unsupervised (e.g., Random Cut Forest for anomaly detection) models on the historical dataset.
Model Deployment: Trained models are deployed as SageMaker endpoints for real-time inference.
API Layer: A REST API is created using AWS Chalice, which serves as the interface for making predictions and receiving fraud alerts.
Monitoring: CloudWatch is used to monitor the performance of the deployed models and the overall system. The project also includes scripts for simulating transaction data and testing the end-to-end pipeline.

Configure AWS credentials

$ aws configure

Infrastructure Provisioning

Services Overview

S3 Buckets

# s3.tf
resource "aws_s3_bucket" "fraud_data_bucket" {
    bucket        = "fraud-detection-data-bucket-${var.aws_region}"
    force_destroy = true

    tags = {
        Environment = "dev"
        Project     = "fraud-detection"
    }
}
resource "aws_s3_bucket_lifecycle_configuration" "data_bucket_lifecycle" {
    bucket = aws_s3_bucket.fraud_data_bucket.id

    rule {
        id     = "expire-old-data"
        status = "Enabled"
        filter {
            prefix = ""
        }
        expiration {
            days = 90
        }
    }
}
resource "aws_s3_bucket" "fraud_streaming_bucket" {
    bucket        = "fraud-detection-stream-bucket-${var.aws_region}"
    force_destroy = true

    tags = {
        Environment = "dev"
        Project     = "fraud-detection"
    }
}

Kinesis Streams

# Kinesis Stream
resource "aws_kinesis_stream" "fraud_predictions_stream" {
    name             = var.kinesis_stream_name
    shard_count      = var.kinesis_shard_count
    retention_period = 24 # in hours
    tags = {
        Environment = "dev"
        Project     = "fraud-detection"
    }
}

RDS PostgreSQL

# PostgreSQL RDS Instance
resource "aws_db_instance" "fraudit_postgres" {
    identifier            = "fraudit-postgres-db"
    engine                = "postgres"
    engine_version        = "17.5" # Version supported by AWS RDS
    instance_class        = "db.t3.micro"
    allocated_storage     = 20
    max_allocated_storage = 100
    storage_type          = "gp2"

    db_name  = var.postgres_db
    username = var.postgres_user
    password = var.postgres_password
    port     = var.postgres_port

    # Network configuration
    db_subnet_group_name   = aws_db_subnet_group.fraudit_subnet_group.name
    vpc_security_group_ids = [aws_security_group.rds_sg.id]
    publicly_accessible    = true # Required for access from Glue without VPC

    # Backup configuration
    backup_retention_period = 7
    backup_window           = "03:00-04:00"
    maintenance_window      = "sun:04:00-sun:05:00"

    # Development configuration
    skip_final_snapshot = true
    deletion_protection = false
    apply_immediately   = true

    # Monitoring disabled for simplicity
    performance_insights_enabled = false
    monitoring_interval          = 0

    tags = {
        Name        = "fraudit-postgres"
        Environment = "dev"
        Project     = "fraud-detection"
    }
}

Tip

Glue

The terraform configuration for the Glue job is defined in the glue.tf file. below is an excerpt of the Glue job definition:

# glue.tf
resource "aws_glue_job" "fraudit_streaming_job" {
    name     = "fraudit-streaming-job"
    role_arn = aws_iam_role.glue-role.arn
    # Job entrypoint
    command {
        script_location = "s3://${aws_s3_bucket.fraud_streaming_bucket.bucket}/spark-jobs/glue_job.py"
        python_version  = "3"
    }

    glue_version = "4.0"
    max_capacity = 2

    default_arguments = {
        "--job-language"                     = "python"
        "--additional-python-modules"        = "s3://${aws_s3_bucket.fraud_streaming_bucket.bucket}/wheel/fraudit-0.0.1-py3-none-any.whl"
        "--python-modules-installer-option"  = "--upgrade"
        "--extra-jars"                       = "s3://${aws_s3_bucket.fraud_streaming_bucket.bucket}/jars/spark-streaming-sql-kinesis-connector_2.12-1.0.0.jar"

        "--postgres_host"                    = aws_db_instance.fraudit_postgres.address
        "--postgres_port"                    = aws_db_instance.fraudit_postgres.port
        "--postgres_user"                    = var.postgres_user
        "--postgres_password"                = var.postgres_password
        "--postgres_db"                      = var.postgres_db

        "--kinesis_stream"                   = aws_kinesis_stream.fraud_predictions_stream.name
        "--kinesis_endpoint"                 = "https://kinesis.${var.aws_region}.amazonaws.com"
        "--aws_region"                       = var.aws_region

        "--s3_checkpoint_bucket"             = "s3://${aws_s3_bucket.fraud_streaming_bucket.bucket}/checkpoints/"

        # Monitoring
        "--enable-continuous-cloudwatch-log" = "true"
        "--enable-job-insights"              = "true"
        "--enable-metrics"                   = "true"
        "--enable-spark-ui"                  = "true"
        "--enable-auto-scaling"              = "true"
    }

    tags = {
        Project     = "fraud-detection"
        Environment = "dev"
    }
}

Let's break down some of the key configurations for the Glue job:

script_location: This specifies the location of the Glue job script in S3. The script contains the logic for reading from the Kinesis stream, processing the data, and writing the results to RDS.
additional-python-modules: This argument allows you to specify additional Python modules that are required for the Glue job. In this case, we are referencing a wheel package that contains the necessary dependencies for the Glue job, which is also stored in S3.
extra-jars: This argument is used to specify additional JAR files that are required for the Glue job. Here, we are referencing the Spark Kinesis connector JAR file, which is necessary for reading from the Kinesis stream in the Glue job.
postgres_host, postgres_port, postgres_user, postgres_password, postgres_db: These arguments are used to pass the PostgreSQL database connection details to the Glue job, allowing it to write the processed data into the RDS instance.
kinesis_stream and kinesis_endpoint: These arguments specify the name of the Kinesis stream and the endpoint URL for connecting to the Kinesis service.
s3_checkpoint_bucket: This argument specifies the S3 bucket location for storing the checkpoints of the Spark structured streaming job, which is essential for fault tolerance and ensuring that the job can resume from the last checkpoint in case of failures.

SageMaker AI

resource "aws_sagemaker_notebook_instance" "fraudit_notebook" {
  name          = var.sagemaker_notebook_name
  instance_type = var.sagemaker_notebook_instance_type
  role_arn      = aws_iam_role.sagemaker_execution_role.arn
  volume_size   = var.sagemaker_root_volume_size

  # Optional VPC configuration - only if subnet is provided
  subnet_id = var.sagemaker_subnet_id != "" ? var.sagemaker_subnet_id : null

  # Security groups - only if subnet is provided (VPC attachment)
  security_groups = length(var.sagemaker_security_group_ids) > 0 ? var.sagemaker_security_group_ids : null

  # Optional lifecycle configuration - only if content is provided
  lifecycle_config_name = var.sagemaker_lifecycle_startup_content != "" ? aws_sagemaker_notebook_instance_lifecycle_configuration.fraudit_nb_lifecycle[0].name : null

  tags = {
    Project     = "fraud-detection"
    Environment = "dev"
  }
}

Deploying the infrastructure

# dev.tfvars
aws_region = "eu-west-1"

kinesis_stream_name = "fraud-predictions-stream"
kinesis_shard_count = 4

postgres_user     = "postgres_user"
postgres_password = "postgres_password"  # CHANGE ME before apply
postgres_db       = "fraudit_postgres_db"
postgres_port     = 5432

sagemaker_notebook_name           = "fraudit-notebook"
sagemaker_notebook_instance_type  = "ml.t3.medium"
sagemaker_root_volume_size        = 5
sagemaker_subnet_id               = ""          # keep empty for no VPC attachment in dev
sagemaker_security_group_ids      = []          # list of SG IDs if attaching to a VPC
sagemaker_lifecycle_startup_content = "enabled"

Warning

Once you have updated the variables, you can deploy the infrastructure with the Make targets exposed at the root of the repository:

make tf.init
make tf.plan       # uses devops/infra/dev/dev.tfvars
make tf.apply

Under the hood these wrap terraform init / plan / apply -var-file=dev.tfvars inside devops/infra/dev/.

kinesis_stream_name: the name of the Kinesis stream that was created for ingesting transaction data.
lambda_exec_role_arn: the ARN of the IAM role that is assigned to the AWS Lambda function for invoking SageMaker endpoints.
rds_postgres_endpoint: the endpoint URL of the RDS PostgreSQL instance that is beused for storing transaction data and model predictions.
rds_postgres_port: the port number for connecting to the RDS PostgreSQL instance.

You can use the output values to configure your .env file that will be used by the training notebooks and the API application to connect to the appropriate AWS resources. Two options:

Copy the output values manually from the terminal and paste them into a .env file (you can start from .env.example).
Run the dedicated Make target, which reads the Terraform outputs and writes them straight into .env (it also fills iam_role_arn in app/api/.chalice/config.json):

make env.sync

Implementation and Deployment

Fraud detection ML models

The machine learning models for fraud detection are implemented in Jupyter notebooks designed to run on the SageMaker notebook instance provisioned by Terraform.

Before opening the notebooks, push the dataset to S3 and sync the notebook instance:

make data.download   # Kaggle → ./dataset/creditcard.csv
make data.upload     # ./dataset/ → s3://$SPARK_DATA_BUCKET/dataset/
make sagemaker.sync  # uploads notebooks to S3 and (re)starts the notebook instance

Then open Jupyter and run both notebooks to deploy the two SageMaker endpoints used by the API.

API Layer

The Chalice application lives in app/api/:

app/api/
├── app.py
├── requirements.txt
└── .chalice/
    └── config.json

The app.py file defines the API endpoints used to invoke the SageMaker models, and requirements.txt lists the Lambda dependencies.

The next step is to configure app/api/.chalice/config.json to specify the AWS region and other settings for the Chalice application:

{
    "version": "2.0",
    "app_name": "ml-inference-api",
    "stages": {
        "dev": {
            "api_gateway_stage": "api",
            "manage_iam_role": false,
            "iam_role_arn": "arn:aws:iam::xxxxxxxxxxxx:role/ml-inference-api-lambda-exec-role",
            "environment_variables": {
                "solution_prefix": "fraud-detection",
                "stream_name": "fraud-predictions-stream",
                "aws_region": "eu-west-1"
            }
        }
    }
}

After configuring the config.json file, we can proceed to implement the API endpoints in the app.py file that will look like this:

@app.route('/predict', methods=['POST'])
def predict():
    event = app.current_request.json_body

    metadata = event.get('metadata')
    data_payload = event.get('data')

    if not metadata or not data_payload:
        raise BadRequestError("Requête invalide : champ 'metadata' ou 'data' manquant.")

    anomaly = get_anomaly_prediction(data_payload)
    fraud = get_fraud_prediction(data_payload)

    output = {
        "anomaly_detector": anomaly,
        "fraud_classifier": fraud
    }

    store_data_prediction(output, metadata) # store the prediction results in Kinesis stream for further processing

    return output

Once the API endpoints are implemented, deploy the Chalice application to AWS with:

make chalice.deploy

This target runs chalice deploy from app/api/ and writes the resulting REST API URL back into .env as CHALICE_API_URL, ready to be consumed by the simulator.

Glue Job for Data Processing

def main():
    """Main execution function"""
    logger.info("Starting Fraud Detection Streaming Job")
    
    # Configure checkpoint location
    checkpoint_location = f"s3a://{args['s3_checkpoint_bucket']}/checkpoints/{args['JOB_NAME']}/"
    logger.info(f"Using checkpoint location: {checkpoint_location}")
    
    # Read from Kinesis stream
    kinesis_df = (
        spark.readStream
        .format("aws-kinesis")
        .option("kinesis.streamName", args["kinesis_stream"])
        .option("kinesis.region", args["aws_region"])
        .option("kinesis.startingposition", "TRIM_HORIZON")  # Start from the latest data
        .option("kinesis.endpointUrl", f"https://kinesis.{args['aws_region']}.amazonaws.com")
        .load()
    )
    
    logger.info("Successfully configured Kinesis stream reader")
    
    # Start streaming query
    query = (
        kinesis_df
        .writeStream
        .foreachBatch(process_kinesis_batch)
        .outputMode("append")
        .option("checkpointLocation", checkpoint_location)
        .trigger(processingTime='30 seconds')
        .start()
    )
    
    logger.info("Streaming query started successfully")
    
    # Wait for termination
    query.awaitTermination()

Once the Glue job is defined, we need to package and deploy three artifacts to S3:

The fraudit Python wheel — built from pyproject.toml and exposing the streaming pipeline.
The Glue script src/fraudit/glue_job.py — the entry point referenced by glue.tf.
The Spark Kinesis connector JAR — required to read from Kinesis.

All three are produced and uploaded by a single Make target:

make glue.deploy

Generate Transaction Data and Test the Pipeline

The dataset itself is fetched from Kaggle with:

make data.download   # downloads creditcard.csv into ./dataset/

Make sure ~/.kaggle/kaggle.json is in place (or set KAGGLE_USERNAME / KAGGLE_KEY in .env).

The script looks like this:

# Load environment variables from .env file
load_dotenv()

ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
DATASET_LOCAL_DIR = os.path.join(Path(ROOT_DIR).parent.absolute(), 'dataset')
PARALLEL_INVOCATION = False

if __name__ == "__main__":
    # Load credit card fraud dataset
    data = pd.read_csv(f"{DATASET_LOCAL_DIR}/creditcard.csv", delimiter=',')

    # Get fraud and non-fraud counts
    nonfrauds, frauds = data.groupby('Class').size()
    print('Number of frauds: ', frauds)
    print('Number of non-frauds: ', nonfrauds)
    print('Percentage of fradulent data:', 100. * frauds / (frauds + nonfrauds))

    # Split features and labels
    feature_columns = data.columns[:-1]
    label_column = data.columns[-1]

    features = data[feature_columns].values.astype('float32')
    labels = (data[label_column].values).astype('float32')

    # Split data into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(
        features, labels, test_size=0.2, random_state=42, stratify=labels
    )

    if PARALLEL_INVOCATION:
        # Run multiple simulations in parallel using threads
        threads = []
        for i in range(10):
            thread = Thread(target=generate_data, args=(np.copy(X_test), 100_000))
            threads.append(thread)
            thread.start()
        for thread in threads:
            thread.join()
        print("All simulations completed.")
    else:
        # Run one simulation sequentially
        generate_data(np.copy(X_test), max_requests=100_000)  # Adjust max_requests as needed

Note

Make sure your virtual environment has the project installed (make setup once is enough; it runs pip install -e .[all]).

You can then run the simulator with:

make data.generate
# or directly
python scripts/generate_data.py

End-to-End Real-time Fraud Detection on AWS

Pre-requisites

Key Components

Configure AWS credentials

Infrastructure Provisioning

Services Overview

S3 Buckets

Kinesis Streams

RDS PostgreSQL

Glue

SageMaker AI

Deploying the infrastructure

Implementation and Deployment

Fraud detection ML models

API Layer

Glue Job for Data Processing

Generate Transaction Data and Test the Pipeline

Built by Godwin AMEGAH

Want to Contribute?

End-To-End Near Real-time Traffic Monitoring

HealthyLife

Related Projects

MarketSeek

End-to-End Real-time Fraud Detection on AWS

Pre-requisites

Key Components

Configure AWS credentials

Infrastructure Provisioning

Services Overview

S3 Buckets

Kinesis Streams

RDS PostgreSQL

Glue

SageMaker AI

Deploying the infrastructure

Implementation and Deployment

Fraud detection ML models

API Layer

Glue Job for Data Processing

Generate Transaction Data and Test the Pipeline

Built by Godwin AMEGAH

Want to Contribute?

End-To-End Near Real-time Traffic Monitoring

HealthyLife

Related Projects

MarketSeek