Hack Your Data with Insightful Python Models

11 min readJul 23, 2024

I gained knowledge by working hard for years and experiencing good and bad things on the road. A human should appreciate every fault they make. They call it maturity; I call it wisdom. It is the way of enhancing our capabilities to do so.

We do not see people around us accepting faults; mostly we are blaming others or conditions. As a result, in this generation, we want everything to happen immediately, which causes faults. Since I made a short speech before starting the post, I say, “good things take time.”

Today’s topic focuses on how to generate insights from data related to a problem statement. A brand’s focus could be on finance, manufacturing, energy, or even logistics. As we discussed, there are several machine learning models that we can use while examining the data for better conversion rates or sales. This time, I will explain which types of analysis you can use for your current problem.

Problem Statement

To move forward, we must understand the problem of the business or data. Never forget to start by asking why, as Simon Sinek stated back in 2009. This doesn’t mean we cannot execute analysis without a problem — of course, we can. I am dropping his video here. While identifying the problem, we can use techniques such as SWOT analysis, KPIs, Competitive Analysis, etc. I assume you already know what these are and how to implement them in the current situation. Now that we know our problem, the following examples are the main statements we will cover in the next two posts.

Real-Time Optimization and Anomalies
Strategic Optimization and Trends
Predictive Analysis and Forecasting
Personalization

Problem Type: Real-Time Optimization

How to Detect: Anomalies can occur in website traffic and sensor data, changes in overall revenue, or increasing negative customer feedback.

Approach: If your business encounters a real-time problem, you must make an immediate decision to solve it. The quickest way to find the root cause is to detect the moment when the problem started and check for any deployments made at that time. Sudden changes, like real-time anomalies, can be caused by changes made on both the backend and frontend. I highly recommend using real-time dashboards like Tableau to receive notifications when anomalies occur.

Solution 1: Anomaly Detection using Isolation Forest

Anomalies can be detected through real-time dashboards, statistical techniques, time series analysis, or machine learning algorithms. Since we don’t have a real-time database to work on, in this example, we will use past data. This example will include the Isolation Forest algorithm from the “scikit-learn” library, which is a popular method for anomaly detection.

Make sure you have installed “scikit-learn” and the other required libraries listed below.

pip install scikit-learn pandas matplotlib

import pandas as pd
import numpy as np
from sklearn.ensemble import IsolationForest
import matplotlib.pyplot as plt
from datetime import datetime

Then, load your dataset with the given column names below. If the size of your data changes, make sure that you also change the size in the code. The code will plot your data with the given data points.

# Load the dataset
df = pd.read_csv('sample_stock_prices.csv')

# Convert date column to datetime
df['date'] = pd.to_datetime(df['date'])

# Extract features for anomaly detection
X = df[['value']].values

# Plot the data
plt.figure(figsize=(10, 6))
plt.plot(df["date"], df["value"], label="Value")
plt.title("Time Series Data with Anomalies")
plt.xlabel("Date")
plt.ylabel("Value")
plt.legend()
plt.show()

Finally, we need to prepare our data for the isolation forest and execute the model on it. Below code, plot your data and anomalies.

# Initialize and fit the Isolation Forest model
contamination = 0.05  # Adjust this value based on your expectations of anomaly percentage
model = IsolationForest(contamination=contamination, random_state=42)
model.fit(X)

# Predict anomalies
df['anomaly'] = model.predict(X)

# Create a boolean mask for anomalies (1 for normal, -1 for anomalies)
anomalies = df['anomaly'] == -1

# Plot the results
plt.figure(figsize=(12, 6))
plt.plot(df['date'], df['value'], label='Normal')
plt.scatter(df.loc[anomalies, 'date'], df.loc[anomalies, 'value'], color='red', label='Anomaly')
plt.xlabel('Date')
plt.ylabel('Value')
plt.title('Anomaly Detection using Isolation Forest')
plt.legend()
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Print summary of anomalies
print(f"Total number of data points: {len(df)}")
print(f"Number of anomalies detected: {anomalies.sum()}")
print(f"Percentage of anomalies: {anomalies.sum() / len(df) * 100:.2f}%")

# Display the anomalies
print("\nAnomalies:")
print(df[anomalies][['date', 'value']])

Strategies

Understanding Reasons Behind: By monitoring live data, it can be found where the problem originates or what change it is caused by. Your teammate can change your code without your knowledge, this happens almost every day somewhere.
Setting Alarms for Specific Intervals: The most critical step in ensuring rapid intervention is the alarm systems in which the machine learning model is integrated.

Problem Type: Strategic Optimization

How to Detect: SWOT results, a decline in direct users of the website/app, product development issues, and budget allocation challenges are reasons for strategic optimization.

Approach: To achieve the desired outcomes, we should start by understanding market dynamics, customer segments, sentiment analysis, competitors, supply-chain optimization, and customer lifetime value. These elements provide valuable insights. Additionally, descriptive analytics always helps in understanding past performance to develop new strategies, such as sales performance, customer behavior, website analytics, and financial reporting. Up to this point, we have discussed the theoretical explanation of optimization, and now it is time to implement it with coding.

Solution 1: User Analysis with RFM Segmentation

Think about a retail company that wants to analyze its products’ go-to-market strategy. Their aim is to find high-value customers, occasional buyers, and price-sensitive shoppers according to past data, so cluster analysis is the best for them to understand user needs while segmenting them.

Below, you can find detailed Python codes for RFM. This is one of the most common clustering formats, and there are more methods to achieve this with K-means, Hierarchical clustering, or DBSCAN.

We assume that we have a dataset with user_id, transaction_date, and revenue. In case of any misleading formats, converting the date to datetime is the safest thing we can do when starting.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime

# Load the dataset
df = pd.read_csv('sample_transaction_data.csv')

# Convert transaction_date to datetime
df['transaction_date'] = pd.to_datetime(df['transaction_date'])

The next step is to calculate the RFM metrics. Recency needs a latest_date, and it depends on your choice. You can either choose the latest transaction_date or the current date.

# Set the date of analysis (usually today's date or the most recent date in your dataset)
date_of_analysis = df['transaction_date'].max()

# Calculate Recency, Frequency, and Monetary values for each customer
rfm = df.groupby('user_id').agg({
    'transaction_date': lambda x: (date_of_analysis - x.max()).days,
    'revenue': ['count', 'sum']
})

# Flatten column names
rfm.columns = ['Recency', 'Frequency', 'Monetary']

# Reset index to make user_id a column
rfm = rfm.reset_index()

Then, assign the RFM scores. I am not going into the details of R, F, and M metrics and their values. Nearly middle school students also know what they are.

# Function to assign scores
def assign_score(value, bins):
    return pd.cut(value, bins=bins, labels=[1, 2, 3, 4, 5], include_lowest=True)

# Assign scores for each FRM metric
rfm['R_Score'] = assign_score(rfm['Recency'], bins=[0, 30, 60, 90, 180, float('inf')])
rfm['F_Score'] = assign_score(rfm['Frequency'], bins=5)
rfm['M_Score'] = assign_score(rfm['Monetary'], bins=5)

# Calculate overall RFM score
rfm['RFM_Score'] = rfm['R_Score'].astype(str) + rfm['F_Score'].astype(str) + rfm['M_Score'].astype(str)

# Convert RFM_Score to integer for easier analysis
rfm['RFM_Score'] = rfm['RFM_Score'].astype(int)

# Print summary statistics
print(rfm.describe())

Lastly, we have customers with their clusters according to each metric but we need to put together all three to make it meaningful RFM metrics.

# Plot distribution of RFM scores
plt.figure(figsize=(12, 6))
rfm['RFM_Score'].hist(bins=20)
plt.title('Distribution of RFM Scores')
plt.xlabel('RFM Score')
plt.ylabel('Count')
plt.show()

# Plot scatter plot of Recency vs Monetary value
plt.figure(figsize=(12, 6))
scatter = plt.scatter(rfm['Recency'], rfm['Monetary'], c=rfm['Frequency'], cmap='viridis', alpha=0.6)
plt.colorbar(scatter, label='Frequency')
plt.title('Recency vs Monetary Value (color: Frequency)')
plt.xlabel('Recency (days)')
plt.ylabel('Monetary Value')
plt.show()

# Display top 10 customers by RFM score
print("\nTop 10 customers by RFM score:")
print(rfm.sort_values('RFM_Score').head(10))

Strategies

Best and Loyal: They are the brand ambassadors. Offering exclusive discounts or early access to new products makes them stick to the brand more and more.
Frequent and Big Spenders: A marketer should always think about ways to make any customer loyal. Asking them to join loyalty programs for better prices and personalized product bundles is the way to do it.
Lost Customers: A brand needs to focus on encouraging them to make more frequent purchases with targeted offers or win-back campaigns to engage them.

Solution 2: User Feedbacks with Sentiment Analysis

This time, our aim is to understand the most common pains according to user feedback. Sentiment analysis is a powerful tool for discovering the reality behind negative customer satisfaction. It involves using natural language processing techniques to determine the sentiment. The code not only groups customer feedback based on the sentiment category but also highlights the NGrams. NGram is a method of finding the most repeated words or word groups in the desired sentiment group.

First, ensure that you have the related Python libraries installed and imported

pip install pandas nltk matplotlib
pip install wordcloud

import pandas as pd
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.util import ngrams
from collections import Counter
import matplotlib.pyplot as plt
from wordcloud import WordCloud

# Download necessary NLTK data
nltk.download('vader_lexicon')
nltk.download('punkt')
nltk.download('stopwords')

# Initialize the VADER sentiment analyzer
sia = SentimentIntensityAnalyzer()

# Function to perform sentiment analysis
def analyze_sentiment(text):
    return sia.polarity_scores(text)

Let’s say we have a dataset with the feedback column with a CSV file.

# Load the dataset
df = pd.read_csv('customer_feedback.csv')

# Perform sentiment analysis
df['sentiment_scores'] = df['feedback'].apply(analyze_sentiment)
df['compound_score'] = df['sentiment_scores'].apply(lambda score_dict: score_dict['compound'])

Then the rest of the process is handled by the sentiment library in the given code.

# Categorize sentiment
def categorize_sentiment(score):
    if score >= 0.05:
        return 'Positive'
    elif score <= -0.05:
        return 'Negative'
    else:
        return 'Neutral'

df['sentiment_category'] = df['compound_score'].apply(categorize_sentiment)

# Print summary
print(df['sentiment_category'].value_counts(normalize=True))

Lastly, here is the visualization of both negative feedback, NGram and WordCloud.

# Visualize sentiment distribution
plt.figure(figsize=(10, 6))
df['sentiment_category'].value_counts().plot(kind='bar')
plt.title('Distribution of Sentiment Categories')
plt.xlabel('Sentiment')
plt.ylabel('Count')
plt.show()

# 2-gram analysis for negative feedbacks
def get_2grams(text):
    tokens = word_tokenize(text.lower())
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token.isalpha() and token not in stop_words]
    return list(ngrams(tokens, 2))

# Apply 2-gram analysis to negative feedbacks
negative_feedbacks = df[df['sentiment_category'] == 'Negative']['feedback']
all_2grams = [gram for text in negative_feedbacks for gram in get_2grams(text)]
gram_freq = Counter(all_2grams)

# Visualize top 2-grams
plt.figure(figsize=(12, 6))
top_grams = dict(gram_freq.most_common(10))
plt.bar(range(len(top_grams)), list(top_grams.values()), align='center')
plt.xticks(range(len(top_grams)), [' '.join(gram) for gram in top_grams.keys()], rotation=45, ha='right')
plt.title('Top 10 2-grams in Negative Feedbacks')
plt.xlabel('2-gram')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()

# Word Cloud for all feedbacks
def preprocess_text(text):
    tokens = word_tokenize(text.lower())
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token.isalpha() and token not in stop_words]
    return ' '.join(tokens)

all_text = ' '.join(df['feedback'].apply(preprocess_text))

wordcloud = WordCloud(width=800, height=400, background_color='white').generate(all_text)

plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Customer Feedback')
plt.show()

Strategies

Let me give a real life example for this from my recent works.

Most of the negative feedback consisted of “lifetime subscription”. From that, that brand should consider subscription models again
The other analysis gave me “too big” so this brand needs to focus on the cloth sizes to make better fit for their customers.

Solution 3: Supply Chain Optimization with Demand Forecasting

The headline might sound like it’s only about manufacturing or logistics, but it is not. This type of optimization involves improving the efficiency and effectiveness of any processes, such as reducing costs, managing inventory, optimizing transportation, planning production, controlling spending, or enhancing service levels.

For manufacturing brands, we can discuss descriptive analytics to understand the current state of the supply chain. However, since we are most likely to talk about the digital aspect of anything here, we will move forward with predictive analytics to forecast future trends and behaviors. With predictive analytics, brands can determine inventory levels and expansion plans, set KPIs, minimize costs, invest in raw materials, or even hire more people.

In this part, we will walk through demand forecasting using Python. When solving any problem related to numerical optimization, implementing machine learning models to predict fluctuations and upcoming trends is crucial.

Let’s start with importing the related libraries again.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from sklearn.metrics import mean_absolute_error, mean_squared_error

Plotting the values that we have to see what is going on.

# Load the data
df = pd.read_csv('sample_daily_data.csv')
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
df = df.sort_index()  # Ensure the data is sorted by date

# Plot the original data
plt.figure(figsize=(12, 6))
plt.plot(df.index, df['value'])
plt.title('Original Data')
plt.xlabel('Date')
plt.ylabel('Value')
plt.show()

After that, you can apply any model for demand forecasting, such as ARIMA, Prophet, LSTM, Random Forest, XGBoost, or Support Vector Regression. Honestly, I have worked with most of them. You can use all of them to find the best one that corresponds better with your data. Holt-Winters Exponential Smoothing is commonly used because it is effective at capturing seasonality and trends in time series. Below is the code for plotting the forecast for the train and test data.

# Split the data into train and test sets
train_size = int(len(df) * 0.8)
train, test = df[:train_size], df[train_size:]

# Determine the seasonal period
# Assuming daily data, we'll use a weekly seasonality
seasonal_period = 7

# Fit the ExponentialSmoothing model
model = ExponentialSmoothing(train['value'], 
                             trend='add', 
                             seasonal='add', 
                             seasonal_periods=seasonal_period)
fit = model.fit()

# Make predictions
forecast = fit.forecast(len(test))

# Evaluate the model
mae = mean_absolute_error(test['value'], forecast)
rmse = np.sqrt(mean_squared_error(test['value'], forecast))

print(f'Mean Absolute Error: {mae:.2f}')
print(f'Root Mean Square Error: {rmse:.2f}')

# Plot the results
plt.figure(figsize=(12, 6))
plt.plot(train.index, train['value'], label='Training Data')
plt.plot(test.index, test['value'], label='Actual Test Data')
plt.plot(test.index, forecast, label='Forecast')
plt.title('Forecast using ExponentialSmoothing')
plt.xlabel('Date')
plt.ylabel('Value')
plt.legend()
plt.show()

# Forecast future values
last_date = df.index[-1]
future_dates = pd.date_range(start=last_date + pd.Timedelta(days=1), periods=30)
future_forecast = fit.forecast(len(future_dates))
future_forecast.index = future_dates

# Plot including future forecast
plt.figure(figsize=(12, 6))
plt.plot(df.index, df['value'], label='Historical Data')
plt.plot(future_forecast.index, future_forecast, label='Forecast', color='red')
plt.title('Forecast with Future Predictions')
plt.xlabel('Date')
plt.ylabel('Value')
plt.legend()
plt.show()

Strategies

Seasonalities: Adjust inventory and marketing efforts accordingly. For example, retailers can predict high demand during the holiday season and stock up appropriately. Implement real-time pricing adjustments based on demand fluctuations to maximize revenue. For instance, travel companies can raise prices during peak seasons and offer discounts during off-peak periods.
Growing Trend: Monitor market trends and customer feedback to innovate and align product offerings. For example, tech companies can develop new smart home devices to meet rising consumer interest. Adopt cloud computing and automation to handle increased demand efficiently. For instance, e-commerce platforms can scale their servers during major sales events to ensure smooth operations.

Conclusion

In this article, we explored how to approach two different types of problems as a data analyst. The Python codes above only include sample use cases; you can develop different methods suitable for your specific problems. You can directly apply all the codes to your own datasets. In the next article, I will focus on two other types of problems.

Hack Your Data with Insightful Python Models

Problem Statement

Problem Type: Real-Time Optimization

Solution 1: Anomaly Detection using Isolation Forest

Strategies

Problem Type: Strategic Optimization

Solution 1: User Analysis with RFM Segmentation

Solution 2: User Feedbacks with Sentiment Analysis

Strategies

Solution 3: Supply Chain Optimization with Demand Forecasting

Strategies

Conclusion

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Deniz Mogulkoc

No responses yet