How Python Is Used in Modern Data Analytics Today: Recent Changes and Real-World Examples
Discover how Python is used in modern data analytics in 2026. Learn about Polars, DuckDB, LLM integration, cloud-native patterns, and real-world code examples.
Python has long been the dominant language for data analytics, but the way organizations use it has changed significantly in recent years. The traditional workflow of loading data into Pandas DataFrames and performing exploratory analysis remains foundational, but modern analytics teams are adopting more sophisticated patterns that emphasize performance, scalability, and integration with broader data infrastructure. What looked like a mature ecosystem in 2023 has evolved into something more nuanced, with new tools and approaches addressing limitations that were previously accepted as inevitable.
The Modern Data Stack: What Changed in 2025-2026
The most visible shift in Python analytics is the move toward performance-first tooling. For years, Pandas was the default choice for almost any tabular data operation, but large-scale analytics often pushed against memory limits and execution time constraints. Organizations processing datasets in the hundreds of gigabytes or terabytes increasingly supplement or replace Pandas with alternatives designed from the ground up for performance.
Polars and DuckDB have emerged as the two most prominent alternatives to traditional Pandas workflows. Both leverage Apache Arrow under the hood, enabling zero-copy data exchange between components and more efficient memory usage. Polars, a Rust-based DataFrame library with a Python API, provides lazy evaluation and multi-threaded execution that can accelerate operations by an order of magnitude on large datasets. DuckDB brings analytical SQL capabilities directly into Python processes, enabling complex queries on local data without the overhead of a separate database server.
Another significant change is the integration of AI and large language models into everyday analytics workflows. In 2023, writing custom natural language processing pipelines required specialized ML skills. By 2026, analysts increasingly interact with LLMs through simplified APIs to extract insights from unstructured text, generate code for routine tasks, or explain complex findings to business stakeholders. This shift is making advanced capabilities accessible to a broader audience, though it also requires new skills around prompt engineering and validation.
Cloud-native patterns have also matured. Python clients for Snowflake, BigQuery, and Redshift have evolved beyond basic query execution to support advanced features like parameterized queries, batch operations, and result caching. Many organizations now design analytics workflows that push computation to the warehouse rather than pulling data into local memory, reducing data movement and enabling analysis of datasets that far exceed workstation capacity.
The Performance Revolution: Polars and DuckDB
The rise of Polars and DuckDB represents the most concrete technical change in how Python handles data analytics. Both tools address the same fundamental problem: Pandas was designed for single-machine, in-memory analytics at a time when datasets were smaller and business requirements were simpler.
Polars: Multi-Threaded DataFrames
Polars differs from Pandas in several important ways. Written in Rust with a Python wrapper, it uses lazy evaluation to build up an optimized query plan before execution. Operations are automatically parallelized across available CPU cores, and the library avoids materializing intermediate results unless necessary. For many workloads, particularly those involving filtering, aggregations, and joins on large datasets, the performance improvements can be dramatic.
A retail company analyzing sales data across 500 stores with millions of transaction rows provides a realistic example. Using Pandas, loading and aggregating this data might take several minutes and consume substantial memory. The same operation in Polars can complete in seconds while using significantly less memory overhead.
import polars as pl
# Read data from multiple CSV files efficiently
sales_df = pl.scan_csv("sales_data_*.csv")
# Build lazy query plan - nothing executes yet
monthly_sales = (
sales_df
.filter(pl.col("date") >= pl.datetime(2025, 1, 1))
.group_by(["store_id", "product_category", pl.col("date").dt.month()])
.agg([
pl.col("revenue").sum().alias("total_revenue"),
pl.col("transaction_id").count().alias("transaction_count"),
pl.col("revenue").mean().alias("avg_transaction_value")
])
)
# Execute the optimized query once
result = monthly_sales.collect()
Execute the code with caution.
The key difference here is the use of scan_csv and collect rather than immediate execution. Polars builds a query plan, optimizes it, and only executes when collect is called. This approach allows Polars to push down predicates, reorder operations for efficiency, and minimize data movement. For large-scale enterprise analytics, this pattern has become increasingly common.
DuckDB: Analytical SQL in Python
DuckDB takes a different approach by bringing a full analytical SQL engine directly into Python processes. It acts like an embedded database that can query local files, Pandas DataFrames, or PyArrow tables without requiring data to be loaded into memory first. This pattern is particularly useful for ad-hoc analysis on large datasets where setting up a separate database server would be overkill.
Consider a financial services company analyzing customer transaction patterns to detect potential fraud. Analysts need to perform complex aggregations across millions of records while joining with reference data about merchant categories and historical fraud patterns. DuckDB enables this work entirely within Python, using familiar SQL syntax.
import duckdb
# Connect to an in-memory database
conn = duckdb.connect()
# DuckDB can query CSV files directly without loading them first
query = """
WITH daily_transactions AS (
SELECT
customer_id,
transaction_date,
COUNT(*) as tx_count,
SUM(amount) as total_amount,
AVG(amount) as avg_amount
FROM 'transactions_*.parquet'
WHERE transaction_date >= '2025-01-01'
GROUP BY customer_id, transaction_date
),
risk_scoring AS (
SELECT
t.*,
h.historical_fraud_rate,
CASE
WHEN t.tx_count > 50 OR t.total_amount > 10000 THEN 'high'
WHEN t.tx_count > 20 OR t.total_amount > 5000 THEN 'medium'
ELSE 'low'
END as risk_level
FROM daily_transactions t
LEFT JOIN customer_history h ON t.customer_id = h.customer_id
)
SELECT
risk_level,
COUNT(DISTINCT customer_id) as customer_count,
AVG(tx_count) as avg_daily_tx,
AVG(total_amount) as avg_daily_amount
FROM risk_scoring
GROUP BY risk_level
"""
result = conn.execute(query).fetchall()
Execute the code with caution.
DuckDB's ability to query Parquet files directly without loading them into memory is a significant advantage for large-scale analytics. It also integrates seamlessly with Python data structures: you can query a Pandas DataFrame using SQL and get results back as a DataFrame, or register a PyArrow table for fast analytics without conversion overhead. This interoperability has made DuckDB a popular choice for teams that want the performance benefits of SQL without leaving their Python environment.
Real-World Use Case: High-Performance Retail Analytics
A large retail chain operating across multiple regions provides a concrete example of how modern Python analytics tools work together in practice. The company needs to generate daily performance reports that combine sales data, inventory levels, and customer demographics across 1,200 stores. The raw data includes approximately 500 million transaction records per year, along with reference tables for products, customers, and locations.
Historically, this analysis would have required moving data into a warehouse and using BI tools for reporting. With modern Python tooling, the analytics team can process the data locally, combine it with external data sources, and deliver results in minutes rather than hours.
The workflow uses Polars for initial data aggregation, DuckDB for complex joins and analytical queries, and Pandas for final formatting and visualization. Each tool is used for what it does best: Polars for fast, parallel operations on raw data, DuckDB for complex analytical queries that benefit from SQL optimization, and Pandas for preparing output for presentation.
import polars as pl
import duckdb
import pandas as pd
# Step 1: Use Polars to aggregate transaction data efficiently
transactions = pl.scan_parquet("sales_2025*.parquet")
daily_store_performance = (
transactions
.filter(pl.col("transaction_date") >= pl.datetime(2025, 1, 1))
.group_by(["store_id", "transaction_date"])
.agg([
pl.col("revenue").sum(),
pl.col("quantity").sum(),
pl.col("customer_id").n_unique().alias("unique_customers")
])
.collect()
)
# Step 2: Use DuckDB for complex analytical queries with joins
conn = duckdb.connect()
conn.register("store_performance", daily_store_performance.to_arrow())
query = """
SELECT
s.store_id,
r.region_name,
DATE_TRUNC('week', sp.transaction_date) as week_start,
SUM(sp.revenue) as weekly_revenue,
SUM(sp.quantity) as weekly_quantity,
AVG(sp.unique_customers) as avg_daily_customers,
r.target_revenue,
(SUM(sp.revenue) - r.target_revenue) / r.target_revenue as variance_pct
FROM store_performance sp
JOIN store_reference r ON sp.store_id = r.store_id
GROUP BY s.store_id, r.region_name, DATE_TRUNC('week', sp.transaction_date), r.target_revenue
HAVING SUM(sp.revenue) < r.target_revenue * 0.8 -- Flag underperforming stores
ORDER BY variance_pct ASC
LIMIT 50
"""
underperforming = conn.execute(query).fetchdf()
# Step 3: Use Pandas for final formatting and visualization
underperforming['week_start'] = pd.to_datetime(underperforming['week_start'])
underperforming_summary = underperforming.groupby('region_name')['variance_pct'].agg(['mean', 'count'])
Execute the code with caution.
This pattern demonstrates how modern Python analytics workflows combine tools based on their strengths rather than forcing everything into a single framework. The Polars aggregation handles the large dataset efficiently, DuckDB performs the complex analytical query with joins and window functions, and Pandas prepares the results for stakeholder presentation. For the analytics team, this approach reduced daily reporting time from four hours to under 15 minutes while enabling more sophisticated analysis.
AI and LLM Integration in Modern Analytics
Perhaps the most significant shift in Python analytics is the integration of large language models and generative AI into everyday workflows. While ML capabilities have been part of the Python ecosystem for years, the accessibility of LLMs through APIs has changed how analysts interact with unstructured data and how they communicate insights.
Text Analytics at Scale
Unstructured text data has always been challenging for analytics teams. Customer feedback, support tickets, product reviews, and email communications contain valuable insights, but extracting those insights traditionally required specialized NLP skills. In 2026, analysts increasingly use LLM-based tools to perform sentiment analysis, topic modeling, and summarization without building custom ML models.
Consider an e-commerce company analyzing customer reviews to identify product issues. Reviews are stored as free-form text alongside structured data about product categories, purchase dates, and customer demographics. Modern workflows combine structured analysis with LLM-powered text processing to surface patterns that would be difficult to detect manually.
import pandas as pd
from openai import OpenAI
import polars as pl
import duckdb
# Load structured review data
reviews = pl.read_parquet("product_reviews.parquet")
# Aggregate reviews by product for analysis
product_summary = (
reviews
.group_by("product_id", "product_category", "product_name")
.agg([
pl.count().alias("review_count"),
pl.col("rating").mean().alias("avg_rating"),
pl.col("rating").median().alias("median_rating")
])
.filter(pl.col("review_count") >= 50)
)
# Identify products with declining ratings using DuckDB
trend_analysis = duckdb.sql("""
SELECT
product_id,
product_name,
AVG(CASE WHEN review_date >= CURRENT_DATE - INTERVAL '30 days' THEN rating END) as recent_rating,
AVG(rating) as historical_rating,
AVG(rating) - AVG(CASE WHEN review_date >= CURRENT_DATE - INTERVAL '30 days' THEN rating END) as rating_decline
FROM reviews
GROUP BY product_id, product_name
HAVING AVG(rating) - AVG(CASE WHEN review_date >= CURRENT_DATE - INTERVAL '30 days' THEN rating END) > 0.5
ORDER BY rating_decline DESC
""").to_df()
# Use LLM to summarize issues for declining products
client = OpenAI()
def summarize_product_issues(product_name, review_texts):
"""Summarize common issues from customer reviews"""
reviews_sample = "\n".join(review_texts[:20]) # Limit to control token usage
response = client.chat.completions.create(
model="gpt-4",
messages=[
{
"role": "system",
"content": "You are a product analyst. Identify the most common issues mentioned in customer reviews. Return a concise bulleted list."
},
{
"role": "user",
"content": f"Product: {product_name}\n\nRecent reviews:\n{reviews_sample}\n\nWhat are the top 3 issues customers are mentioning?"
}
],
max_tokens=300
)
return response.choices[0].message.content
# Apply LLM analysis to problematic products
for _, row in trend_analysis.head(5).iterrows():
product_reviews = reviews.filter(pl.col("product_id") == row["product_id"])["review_text"].to_list()
issues = summarize_product_issues(row["product_name"], product_reviews)
print(f"{row['product_name']}: {issues}")
Execute the code with caution.
This workflow illustrates how LLMs augment rather than replace traditional analytics. The Polars and DuckDB components identify which products are experiencing issues based on rating trends, while the LLM provides qualitative insight into why those issues are occurring. For the product team, this combination provides both quantitative evidence of problems and qualitative context to inform remediation efforts.
Code Generation and Analysis Automation
Another area where LLMs are changing Python analytics is in code generation and workflow automation. Analysts increasingly use AI assistants to generate boilerplate code, suggest optimizations, or explain complex queries. While this does not eliminate the need for analytical skills, it significantly reduces the friction of working with unfamiliar tools or APIs.
More importantly, LLMs can help democratize advanced analytics techniques. An analyst who understands the business context but is less familiar with machine learning might use an LLM to generate code for clustering, anomaly detection, or forecasting models. The analyst then validates the results and interprets them for stakeholders, reducing the barrier to entry for advanced techniques.
A marketing analyst at a telecommunications company provides a realistic example of this pattern. The analyst needs to segment millions of customers into meaningful groups based on usage patterns, subscription tenure, and service preferences to design targeted retention campaigns. While the analyst understands the business requirements and customer behavior, they lack extensive experience with machine learning clustering algorithms. Using an LLM, they can generate a complete customer segmentation workflow that includes data preprocessing, K-means clustering with optimal cluster selection, and segment profiling metrics.
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# 1. Generate Synthetic Data (Simulating Customer Behavior)
def generate_customer_data(n_samples=200):
np.random.seed(42)
data = {
'CustomerID': range(1, n_samples + 1),
'Age': np.random.randint(18, 70, n_samples),
'AnnualIncome(k$)': np.random.randint(15, 140, n_samples),
'SpendingScore(1-100)': np.random.randint(1, 100, n_samples),
'YearsAsCustomer': np.random.randint(0, 10, n_samples)
}
return pd.DataFrame(data)
df = generate_customer_data()
# 2. Data Preprocessing
# Select relevant features for clustering
features = df[['Age', 'AnnualIncome(k$)', 'SpendingScore(1-100)', 'YearsAsCustomer']]
# Standardize the data (Crucial for K-means)
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)
# 3. Silhouette Analysis for Optimal Cluster Selection
silhouette_scores = []
K_range = range(2, 11) # Test k from 2 to 10
best_k = 2
best_score = -1
for k in K_range:
kmeans = KMeans(n_clusters=k, random_state=42, n_init='auto')
labels = kmeans.fit_predict(scaled_features)
score = silhouette_score(scaled_features, labels)
silhouette_scores.append(score)
# Track the best k based on silhouette score
if score > best_score:
best_score = score
best_k = k
print(f"Optimal number of clusters (k) determined by Silhouette Analysis: {best_k}")
# 4. Final K-Means Clustering with Optimal K
kmeans_final = KMeans(n_clusters=best_k, random_state=42, n_init='auto')
df['Cluster'] = kmeans_final.fit_predict(scaled_features)
# 5. Segment Profiling
# Calculate mean values for each cluster to understand business-relevant features
profile = df.groupby('Cluster').agg({
'CustomerID': 'count',
'Age': 'mean',
'AnnualIncome(k$)': 'mean',
'SpendingScore(1-100)': 'mean',
'YearsAsCustomer': 'mean'
}).rename(columns={'CustomerID': 'CustomerCount'}).round(2)
print("\n--- Segment Profiles ---")
print(profile)
# 6. LLM-Assisted Interpretation Prompt Generation
def create_llm_insight_prompt(profile_df):
"""Creates a text summary of the clusters to be sent to an LLM for generating business insights."""
prompt = "Act as a marketing analyst. Based on the following customer segment data, define the persona and suggest a marketing strategy for each group:\n\n"
for cluster_id, row in profile_df.iterrows():
prompt += f"Segment {cluster_id} (Size: {int(row['CustomerCount'])}):\n"
prompt += f"- Average Age: {row['Age']:.1f}\n"
prompt += f"- Average Income: ${row['AnnualIncome(k$)']:.1f}k\n"
prompt += f"- Average Spending Score: {row['SpendingScore(1-100)']:.1f}\n"
prompt += f"- Average Loyalty: {row['YearsAsCustomer']:.1f} years\n"
prompt += "\n"
return prompt
# Generate the prompt for the LLM
llm_context_prompt = create_llm_insight_prompt(profile)
print("\n--- Generated Context for LLM Analysis ---")
print(llm_context_prompt)
Execute the code with caution.
However, this trend also introduces new responsibilities. Analysts must develop skills in prompt engineering to get useful outputs, validation techniques to ensure generated code is correct, and security awareness to avoid exposing sensitive data through AI services. Organizations implementing AI-augmented analytics typically establish guidelines for what data can be shared with external services and what validation steps are required before acting on AI-generated insights.
Cloud-Native Analytics Patterns
The evolution of Python analytics is closely tied to the broader trend toward cloud-native data architectures. As organizations move data into cloud warehouses and data lakes, Python's role has shifted from local computation to orchestrating distributed workloads.
Warehouse-First Analytics
A growing pattern in 2026 is warehouse-first analytics, where Python code pushes computation to the warehouse rather than pulling data to local memory. This approach is enabled by mature Python clients for Snowflake, BigQuery, and Redshift that support parameterized queries, prepared statements, and efficient result handling.
Snowflake's Python connector provides a good example of this pattern. Analysts can write Python code that composes complex SQL queries, executes them in Snowflake's optimized engine, and retrieves only aggregated results. This allows analysis of petabyte-scale datasets from a laptop, as long as the computation happens in the warehouse.
import snowflake.connector
import pandas as pd
# Establish warehouse connection
conn = snowflake.connector.connect(
user="analytics_user",
account="company_account",
warehouse="compute_wh",
database="analytics_db",
schema="public"
)
# Compose parameterized warehouse query
query = """
WITH customer_cohorts AS (
SELECT
customer_id,
DATE_TRUNC('month', first_purchase_date) as cohort_month,
first_purchase_date
FROM customer_dim
WHERE first_purchase_date >= DATEADD(month, -12, CURRENT_DATE())
),
monthly_activity AS (
SELECT
c.cohort_month,
c.customer_id,
DATE_TRUNC('month', t.transaction_date) as activity_month,
DATEDIFF(month, c.cohort_month, DATE_TRUNC('month', t.transaction_date)) as month_number,
SUM(t.amount) as spend
FROM customer_cohorts c
JOIN transactions t ON c.customer_id = t.customer_id
WHERE t.transaction_date >= c.first_purchase_date
GROUP BY c.cohort_month, c.customer_id, DATE_TRUNC('month', t.transaction_date)
)
SELECT
cohort_month,
month_number,
COUNT(DISTINCT customer_id) as active_customers,
SUM(spend) as total_spend
FROM monthly_activity
WHERE month_number <= 12
GROUP BY cohort_month, month_number
ORDER BY cohort_month, month_number
"""
# Execute in warehouse, retrieve results
cursor = conn.cursor()
cursor.execute(query)
results = cursor.fetchall()
# Convert to Pandas for analysis
df = pd.DataFrame(results, columns=['cohort', 'month', 'customers', 'spend'])
df['cohort'] = pd.to_datetime(df['cohort'])
# Create retention visualization
cohort_pivot = df.pivot(index='cohort', columns='month', values='customers')
cohort_retention = cohort_pivot.div(cohort_pivot[0], axis=0)
Execute the code with caution.
The key insight here is that the heavy computation happens in Snowflake, not in Python. The warehouse's massively parallel processing handles joins, aggregations, and window functions across billions of rows. Python's role is to compose the query, retrieve aggregated results, and prepare visualizations. This pattern scales to datasets that would be impossible to process locally while maintaining the familiar Python workflow.
Serverless and Distributed Processing
Beyond warehouses, Python analytics increasingly integrates with serverless and distributed processing frameworks. Apache Beam and Ray provide Python APIs for writing analytics code that can execute at scale on managed services like Google Dataflow or AWS Batch. These tools handle distribution, fault tolerance, and resource management, allowing analysts to focus on business logic rather than infrastructure.
A logistics company analyzing GPS data from its fleet provides an example. Millions of GPS coordinates are generated daily across thousands of vehicles, and analysts need to identify patterns in route efficiency, delivery times, and driver behavior. Using a distributed processing framework, they can write Python transformations that scale automatically to process the entire dataset.
import apache_beam as beam
import apache_beam.runners.dataflow as dataflow
# Example functions that would need implementation in production
# def calculate_haversine_distance(coordinates):
# """Calculate distance between GPS coordinates using Haversine formula"""
# # Implementation would process coordinate list and return distance in km
# pass
#
# def calculate_duration(coordinates):
# """Calculate time duration from first to last coordinate"""
# # Implementation would return duration in minutes
# pass
#
# def get_route_baseline(route_id):
# """Fetch expected time and distance for a route"""
# # Implementation would query route reference data
# pass
def analyze_route_efficiency(element):
"""Calculate efficiency metrics for each route"""
route_id, vehicle_id, coordinates = element
# Calculate distance and time metrics
# In production: calculate_haversine_distance() and calculate_duration() would be implemented
distance_km = calculate_haversine_distance(coordinates)
time_minutes = calculate_duration(coordinates)
avg_speed = distance_km / (time_minutes / 60) if time_minutes > 0 else 0
# Compare to expected baseline
baseline = get_route_baseline(route_id)
efficiency_score = baseline["expected_time"] / time_minutes if time_minutes > 0 else 0
return {
"route_id": route_id,
"vehicle_id": vehicle_id,
"distance_km": distance_km,
"time_minutes": time_minutes,
"avg_speed_kmh": avg_speed,
"efficiency_score": efficiency_score,
"date": coordinates[0]["timestamp"].date()
}
# Build distributed pipeline
pipeline_options = dataflow.PipelineOptions(
runner="DataflowRunner",
project="logistics-analytics",
region="us-central1",
staging_location="gs://analytics-bucket/staging",
temp_location="gs://analytics-bucket/temp"
)
with beam.Pipeline(options=pipeline_options) as p:
results = (
p
| "Read GPS data" >> beam.io.ReadFromParquet("gs://fleet-data/gps_2025*.parquet")
| "Group by route" >> beam.GroupBy("route_id", "vehicle_id")
| "Analyze efficiency" >> beam.Map(analyze_route_efficiency)
| "Filter inefficient routes" >> beam.Filter(lambda x: x["efficiency_score"] < 0.8)
| "Write results" >> beam.io.WriteToParquet("gs://analytics-results/inefficient_routes")
)
Execute the code with caution.
This pattern allows Python code to scale horizontally without significant changes to the analytical logic. The same functions that work on a sample dataset locally can execute at scale on a managed service, handling millions of records in parallel. For organizations, this reduces the distinction between "data engineering" and "data analytics" tasks, as both can be expressed in Python with appropriate execution backends.
Streaming and Real-Time Analytics
The final major trend in modern Python analytics is the shift toward real-time and streaming processing. While batch analytics remains important for reporting and historical analysis, organizations increasingly need to make decisions based on data as it arrives rather than after it has been stored and processed.
Real-Time Analytics with Python Frameworks
Real-time analytics differs from batch processing in several important ways. Data arrives continuously rather than in fixed batches, results must be computed within tight time windows, and system reliability becomes critical as decisions depend on current data. Python's ecosystem for streaming analytics has matured significantly, with frameworks like Apache Flink's Python API (PyFlink) and Bytewax providing developer-friendly interfaces for streaming applications.
A financial services company detecting fraudulent transactions in real time provides a compelling example. As each transaction is processed, the system must evaluate it against historical patterns, flag suspicious activity, and potentially block the transaction before it completes. This requires combining reference data, stateful operations, and machine learning models within sub-second latency requirements.
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import StreamTableEnvironment
from pyflink.common import Types
# Example placeholder functions that would need implementation in production
# def get_customer_risk_profile(customer_id):
# """Fetch historical risk profile for a customer"""
# # Implementation would query customer reference data
# pass
#
# def get_time_since_last_transaction(customer_id):
# """Calculate time elapsed since customer's last transaction"""
# # Implementation would query transaction history
# pass
def detect_fraud(transaction):
"""Evaluate transaction against fraud risk patterns"""
# In production: get_customer_risk_profile() and get_time_since_last_transaction() would be implemented
customer_risk = get_customer_risk_profile(transaction["customer_id"])
historical_avg = customer_risk["avg_transaction_amount"]
time_since_last_tx = get_time_since_last_transaction(transaction["customer_id"])
# Risk scoring logic
risk_score = 0
if transaction["amount"] > historical_avg * 5:
risk_score += 30
if time_since_last_tx.total_seconds() < 60: # Multiple transactions within a minute
risk_score += 25
if transaction["merchant_category"] in customer_risk["high_risk_categories"]:
risk_score += 20
if transaction["location"] != customer_risk["typical_location"]:
risk_score += 15
return {
"transaction_id": transaction["transaction_id"],
"customer_id": transaction["customer_id"],
"amount": transaction["amount"],
"risk_score": risk_score,
"timestamp": transaction["timestamp"],
"is_fraud": risk_score > 70
}
# Set up streaming environment
env = StreamExecutionEnvironment.get_execution_environment()
env.set_parallelism(4)
t_env = StreamTableEnvironment.create(env)
# Define stream from Kafka
t_env.execute_sql("""
CREATE TABLE transactions (
transaction_id STRING,
customer_id STRING,
amount DOUBLE,
merchant_category STRING,
location STRING,
timestamp TIMESTAMP(3),
WATERMARK FOR timestamp AS timestamp - INTERVAL '5' SECOND
) WITH (
'connector' = 'kafka',
'topic' = 'transactions',
'properties.bootstrap.servers' = 'kafka:9092',
'properties.group.id' = 'fraud-detection',
'format' = 'json',
'scan.startup.mode' = 'latest-offset'
)
""")
# Process stream and output alerts
t_env.execute_sql("""
INSERT INTO fraud_alerts
SELECT
transaction_id,
customer_id,
amount,
risk_score,
timestamp
FROM (
SELECT
transaction_id,
customer_id,
amount,
timestamp,
-- Rolling window aggregation for velocity checks
COUNT(*) OVER (
PARTITION BY customer_id
ORDER BY timestamp
RANGE BETWEEN INTERVAL '1' MINUTE PRECEDING AND CURRENT ROW
) as tx_count_1min,
SUM(amount) OVER (
PARTITION BY customer_id
ORDER BY timestamp
RANGE BETWEEN INTERVAL '5' MINUTE PRECEDING AND CURRENT ROW
) as amount_5min
FROM transactions
)
WHERE tx_count_1min > 10 OR amount_5min > 5000
""")
Execute the code with caution.
This example demonstrates how Python code can express real-time analytics logic that executes at scale. The PyFlink framework handles the complex aspects of streaming: watermarks for out-of-order data, window aggregations, state management, and fault tolerance. The analyst focuses on business logic while the framework ensures the stream is processed reliably and efficiently.
Stream Processing for Operational Analytics
Beyond fraud detection, streaming analytics is becoming important for operational use cases across industries. Manufacturing companies monitor sensor data to detect equipment anomalies before failures occur. Retail chains analyze point-of-sale streams to adjust pricing and inventory in real time. Healthcare providers process patient monitoring data to identify deterioration early.
A large-scale automotive manufacturer deploying predictive maintenance across multiple assembly lines provides a realistic operational analytics scenario. The factory has hundreds of industrial machines equipped with IoT sensors that continuously stream temperature, vibration, pressure, and electrical load measurements at one-second intervals. Equipment failures cost thousands of dollars per hour in lost production, so detecting early warning signs is critical. The analytics team needs to build a streaming pipeline that continuously monitors each machine's sensor patterns, compares current readings against historical baseline behavior, and automatically triggers maintenance alerts when deviations exceed statistically significant thresholds.
from pyflink.table import EnvironmentSettings, TableEnvironment
# Initialize the TableEnvironment in streaming mode
env_settings = EnvironmentSettings.in_streaming_mode()
t_env = TableEnvironment.create(env_settings)
# Define the IoT Sensor Source Table
# Using 'datagen' connector to simulate streaming sensor data
t_env.execute_sql("""
CREATE TABLE iot_sensors (
equipment_id STRING,
temperature DOUBLE,
vibration DOUBLE,
event_time TIMESTAMP(3),
WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND
) WITH (
'connector' = 'datagen',
'rows-per-second' = '1',
'fields.equipment_id.kind' = 'sequence',
'fields.equipment_id.start' = '1',
'fields.equipment_id.end' = '5',
'fields.temperature.min' = '60',
'fields.temperature.max' = '100',
'fields.vibration.min' = '0',
'fields.vibration.max' = '10'
)
""")
# Define the Sink Table for Alerts (Printing to console)
t_env.execute_sql("""
CREATE TABLE maintenance_alerts (
equipment_id STRING,
alert_time TIMESTAMP(3),
alert_type STRING,
current_value DOUBLE,
threshold_value DOUBLE,
severity STRING
) WITH (
'connector' = 'print'
)
""")
# Execute the Anomaly Detection Query
# 1. Window: Hopping window of 30 seconds sliding every 10 seconds
# 2. Statistics: Calculate Rolling Mean (Baseline) and Standard Deviation
# 3. Logic: Detect if the latest temperature exceeds Baseline + 2 * StdDev
# 4. Action: Generate alerts
t_env.execute_sql("""
INSERT INTO maintenance_alerts
SELECT
equipment_id,
window_end AS alert_time,
'STATISTICAL_ANOMALY' AS alert_type,
max_temp AS current_value,
(avg_temp + 2 * stddev_temp) AS threshold_value,
CASE
WHEN max_temp > (avg_temp + 2 * stddev_temp) THEN 'CRITICAL'
ELSE 'WARNING'
END AS severity
FROM (
SELECT
equipment_id,
HOP_END(event_time, INTERVAL '10' SECONDS, INTERVAL '30' SECONDS) AS window_end,
AVG(temperature) AS avg_temp,
STDDEV_SAMP(temperature) AS stddev_temp,
MAX(temperature) AS max_temp
FROM iot_sensors
GROUP BY
equipment_id,
HOP(event_time, INTERVAL '10' SECONDS, INTERVAL '30' SECONDS)
)
WHERE max_temp > (avg_temp + 2 * stddev_temp)
""")
Execute the code with caution.
Python's role in these scenarios has evolved from occasional scripting to building and operating production streaming applications. This shift requires new skills: understanding streaming semantics, designing fault-tolerant stateful processing, and monitoring production systems. However, it also enables analytics teams to deliver value more directly by feeding insights directly into operational systems.
Skills Implications for 2026
The evolution of Python in data analytics has significant implications for skill development. While foundational skills like SQL, Python programming, and data visualization remain essential, the bar has risen in several areas.
Performance awareness is now critical. Analysts need to understand when to use Pandas versus Polars versus DuckDB based on data size and operation types. Memory management, query optimization, and parallel processing concepts that were once specialized knowledge are now part of the standard toolkit.
AI literacy has become a baseline requirement. While not every analyst needs to build ML models from scratch, understanding how to work with LLM APIs, validate AI-generated outputs, and identify appropriate use cases for AI augmentation is increasingly important. Security and privacy awareness around external AI services is also necessary.
Cloud platform familiarity is expected. Most organizations now run analytics on cloud infrastructure, and analysts need to understand warehouse concepts, cost optimization, and data governance in cloud environments. The ability to write Python that orchestrates cloud services rather than running locally is increasingly valuable.
Finally, software engineering practices are becoming more relevant as analytics code moves into production. Version control, testing, documentation, and CI/CD practices that were once the domain of software engineers are now expected for analytics teams delivering production systems.
Conclusion
Python remains the dominant language for data analytics in 2026, but how it is used has evolved significantly. The ecosystem has moved beyond simple exploratory analysis to encompass performance tooling, AI integration, cloud-native patterns, and real-time processing. Organizations that embrace these changes can deliver insights faster, handle larger datasets, and provide value closer to operational decision-making.
For analysts and data professionals, this evolution represents both challenge and opportunity. The learning curve has increased, but the capabilities available through modern Python analytics are vastly more powerful than what was available just a few years ago. By understanding these trends and developing skills in the emerging tools and patterns, analytics professionals can position themselves at the forefront of data-driven decision-making.
Sources
- Polars User Guide: https://pola.rs/
- DuckDB Documentation: https://duckdb.org/docs/
- Apache Arrow: https://arrow.apache.org/docs/python/
- Snowflake Python Connector: https://docs.snowflake.com/en/developer-guide/python-connector/python-connector
- Google Cloud BigQuery Python Client: https://cloud.google.com/python/docs/reference/bigquery/latest
- Apache Flink Python API: https://nightlies.apache.org/flink/flink-docs-master/docs/libs/python/
- OpenAI API Documentation: https://platform.openai.com/docs/