Essential Python Libraries for Data Analysis and Analytics in 2026
Discover the essential Python libraries for data analysis in 2026. Learn NumPy, pandas, matplotlib, seaborn, scikit-learn, and more with practical examples.
Python has firmly established itself as the dominant programming language for data analysis, and for good reason. The 2024 Stack Overflow Developer Survey confirms that Python remains the most desired language for data work, supported by an extensive ecosystem of libraries that make data manipulation, analysis, and visualization both accessible and powerful. For aspiring data analysts and scientists, understanding which libraries to learn and in what order can significantly accelerate career growth. This guide covers the essential Python libraries you must learn for data analysis, organized from foundational to advanced.
NumPy is the bedrock of scientific computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with a comprehensive collection of mathematical functions to operate on these arrays efficiently. Almost every other data analysis library in Python builds on NumPy, making it non-negotiable for anyone serious about data work. NumPy's array operations are implemented in C, making them significantly faster than native Python lists and loops. Understanding NumPy is crucial because it introduces the concept of vectorized operations, which allows you to perform calculations on entire arrays without writing explicit loops—a pattern that becomes second nature in data analysis.
The core NumPy operations you should master include array creation, indexing, slicing, and reshaping. Learn to perform element-wise operations, broadcasting, and aggregate functions like sum, mean, and standard deviation. These operations form the foundation for more complex analyses. NumPy also provides linear algebra functions, random number generation, and Fourier transforms, which are essential for scientific computing applications. While NumPy may seem low-level compared to higher-level libraries, investing time in understanding it pays dividends when working with more advanced tools.
Consider a financial services firm processing millions of daily stock price quotes from multiple exchanges. Analysts need to efficiently calculate returns, volatility metrics, and correlation matrices across thousands of securities in milliseconds. Using NumPy arrays, they can represent the entire dataset as a multi-dimensional structure and apply mathematical operations across thousands of stocks simultaneously, avoiding the performance bottlenecks of iterating through individual prices with native Python loops.
import numpy as np
# Sample price data for 3 assets over 5 days
prices = np.array([
[100, 50, 25],
[102, 51, 26],
[101, 52, 24],
[105, 53, 27],
[107, 54, 28]
])
# Calculate daily returns
returns = np.diff(prices, axis=0) / prices[:-1]
print("Daily Returns:\n", returns)
# Calculate annualized volatility (assuming 252 trading days)
daily_volatility = np.std(returns, axis=0, ddof=1)
annualized_volatility = daily_volatility * np.sqrt(252)
print("\nAnnualized Volatility:", annualized_volatility)
# Calculate correlation matrix
correlation_matrix = np.corrcoef(returns.T)
print("\nCorrelation Matrix:\n", correlation_matrix)
# Calculate portfolio return and volatility for equal weights
weights = np.array([0.4, 0.4, 0.2])
portfolio_return = np.mean(returns, axis=0).dot(weights)
print("\nPortfolio Return:", portfolio_return)
portfolio_variance = np.dot(weights.T, np.dot(np.cov(returns.T), weights))
portfolio_volatility = np.sqrt(portfolio_variance)
print("Portfolio Volatility:", portfolio_volatility)
Execute the code with caution.
Pandas is arguably the most important library for data analysis in Python. Built on top of NumPy, Pandas provides two primary data structures: Series for one-dimensional data and DataFrame for two-dimensional tabular data. The DataFrame, in particular, has become the de facto standard for data manipulation in Python, offering a familiar spreadsheet-like interface with powerful programmatic capabilities. Pandas excels at data cleaning, transformation, and analysis tasks that data analysts perform daily. It handles missing data elegantly, supports time-series operations, and provides robust tools for grouping, filtering, and merging datasets.
Data cleaning with Pandas is a skill you will use constantly. Learn to identify and handle missing values using methods like isnull(), fillna(), and dropna(). Remove duplicates with drop_duplicates() and standardize text data using string methods. Pandas makes it straightforward to rename columns, change data types, and restructure data using pivot tables and melt operations. These operations are often the most time-consuming part of real-world data analysis, so proficiency here directly impacts your productivity.
Imagine an e-commerce company integrating customer data from multiple sources—website registrations, mobile app signups, and third-party authentication providers—each with different schemas and quality issues. The data contains duplicate email addresses, inconsistent phone number formats, missing demographic fields, and timestamps in various time zones. Data analysts must standardize this messy dataset into a clean customer master record before any meaningful analysis can begin, ensuring each customer appears exactly once with properly formatted attributes.
import pandas as pd
import numpy as np
# Create a sample DataFrame with messy data
data = {
'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Heidi', 'Ivan', 'Judy'],
'Age': [25, np.nan, 30, 45, 28, 32, np.nan, 29, 35, 27],
'Email': ['alice@example.com', 'bob@example.com', 'charlie@example.com', 'david@example.com',
'eve@example.com', 'frank@example.com', 'grace@example.com', 'heidi@example.com',
'ivan@example.com', 'judy@example.com'],
'Join_Date': ['2020-01-15', '2019-05-20', '2021-03-10', '2018-11-05', '2020-08-12',
'2019-09-18', '2021-01-22', '2017-12-30', '2020-06-07', '2018-04-14'],
'Department': ['Engineering', 'Marketing', 'Engineering', 'Finance', 'Marketing',
'Engineering', 'Finance', 'Marketing', 'Engineering', 'Finance']
}
# Create a DataFrame with duplicate rows
df = pd.DataFrame(data)
df = pd.concat([df, df.iloc[0:2]]) # Add duplicates
# Display original DataFrame
print("Original DataFrame:")
print(df)
# 1. Handling missing values
# Fill missing Age values with the mean age
mean_age = df['Age'].mean()
df['Age'] = df['Age'].fillna(mean_age)
print("\nDataFrame after filling missing Age values:")
print(df)
# 2. Removing duplicates
# Remove duplicate rows based on all columns
df_no_duplicates = df.drop_duplicates()
print("\nDataFrame after removing duplicates:")
print(df_no_duplicates)
# 3. Standardizing data formats
# Convert Join_Date to datetime format
df_no_duplicates['Join_Date'] = pd.to_datetime(df_no_duplicates['Join_Date'])
# Extract year and month as new columns
df_no_duplicates['Join_Year'] = df_no_duplicates['Join_Date'].dt.year
df_no_duplicates['Join_Month'] = df_no_duplicates['Join_Date'].dt.month
# Standardize email to lowercase
df_no_duplicates['Email'] = df_no_duplicates['Email'].str.lower()
# Standardize Department names (title case)
df_no_duplicates['Department'] = df_no_duplicates['Department'].str.title()
print("\nDataFrame after standardizing data formats:")
print(df_no_duplicates)
Execute the code with caution.
Data manipulation in Pandas centers on filtering, sorting, grouping, and aggregating data. Use boolean indexing and query methods to filter rows based on conditions. Sort data with sort_values() and group data with groupby() to perform split-apply-combine operations. The apply() function allows you to apply custom functions to your data, providing flexibility when built-in methods do not suffice. Merge and join operations in Pandas are powerful for combining multiple datasets, similar to SQL joins. Mastering these operations enables you to prepare data for analysis and visualization efficiently.
A retail chain needs to analyze quarterly sales performance across 500 stores to identify top-performing locations, product categories, and regional trends. The raw transaction data includes millions of individual purchases that must be aggregated by store, product line, and time period. Analysts group transactions to calculate revenue per category, filter for high-margin products, and join with inventory data to identify stockouts that may have impacted sales figures, ultimately producing actionable insights for regional managers.
import pandas as pd
# Create sample dataframes
sales_data = pd.DataFrame({
'Region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West'],
'Product': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'A'],
'Sales': [100, 150, 200, 120, 180, 90, 210, 140]
})
customer_data = pd.DataFrame({
'Product': ['A', 'B', 'C'],
'Category': ['Electronics', 'Home', 'Sports'],
'Price': [50, 75, 60]
})
# Demonstrate grouping - group sales by region
region_sales = sales_data.groupby('Region')['Sales'].sum()
print("Sales by Region:")
print(region_sales)
print()
# Demonstrate filtering - filter products with sales greater than 120
high_sales = sales_data[sales_data['Sales'] > 120]
print("Products with sales greater than 120:")
print(high_sales)
print()
# Demonstrate aggregating - get average sales by product
avg_sales = sales_data.groupby('Product')['Sales'].mean()
print("Average sales by product:")
print(avg_sales)
print()
# Demonstrate merging - merge sales and customer data
merged_data = pd.merge(sales_data, customer_data, on='Product')
print("Merged data:")
print(merged_data)
Execute the code with caution.
Visualization is a critical skill for communicating insights, and Python offers multiple powerful libraries for this purpose. Matplotlib is the foundation plotting library in Python. It provides comprehensive control over every aspect of a plot, enabling you to create publication-quality visualizations. While Matplotlib can be verbose and requires more code than some alternatives, its flexibility makes it indispensable. Learn the basic plot types including line plots, scatter plots, bar charts, and histograms. Understanding figure and axes objects in Matplotlib is crucial because more advanced libraries build on this foundation.
Consider a logistics company tracking shipping efficiency metrics across multiple distribution centers and transportation modes over time. Operations analysts need to create visualizations showing delivery trends, route performance comparisons, and seasonal patterns in shipping volumes to identify bottlenecks and optimize network performance. Matplotlib enables the creation of detailed multi-panel charts showing time series trends across different regions, bar charts comparing route efficiency, and customized visual reports that can be embedded in executive presentations.
import matplotlib.pyplot as plt
import numpy as np
# Sample logistics data
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']
shipments = [120, 150, 180, 140, 200, 220]
delivery_times = [2.5, 2.3, 2.1, 2.4, 2.0, 1.9]
warehouse_capacity = [0.75, 0.80, 0.85, 0.70, 0.90, 0.95]
# Create figure with subplots
fig, axs = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('Logistics Data Visualization Dashboard', fontsize=16, fontweight='bold')
# Subplot 1: Bar chart for shipments
axs[0, 0].bar(months, shipments, color='steelblue', edgecolor='black', linewidth=1.2)
axs[0, 0].set_title('Monthly Shipments', fontsize=12, fontweight='bold')
axs[0, 0].set_xlabel('Month')
axs[0, 0].set_ylabel('Number of Shipments')
axs[0, 0].grid(axis='y', linestyle='--', alpha=0.7)
# Subplot 2: Line chart for delivery times
axs[0, 1].plot(months, delivery_times, color='darkred', marker='o', linewidth=2, markersize=8)
axs[0, 1].fill_between(months, delivery_times, alpha=0.3, color='darkred')
axs[0, 1].set_title('Average Delivery Times (Days)', fontsize=12, fontweight='bold')
axs[0, 1].set_xlabel('Month')
axs[0, 1].set_ylabel('Days')
axs[0, 1].set_ylim(1.5, 3.0)
axs[0, 1].grid(True, linestyle='--', alpha=0.7)
# Subplot 3: Pie chart for warehouse capacity
capacity_levels = ['Under 50%', '50-75%', '75-90%', 'Over 90%']
capacity_counts = [10, 25, 40, 25]
colors = ['#ff9999','#66b3ff','#99ff99','#ffcc99']
axs[1, 0].pie(capacity_counts, labels=capacity_levels, autopct='%1.1f%%',
startangle=90, colors=colors, explode=(0, 0, 0, 0.1))
axs[1, 0].set_title('Warehouse Capacity Distribution', fontsize=12, fontweight='bold')
# Subplot 4: Scatter plot for correlation
axs[1, 1].scatter(shipments, delivery_times, s=100, alpha=0.7, c='green', edgecolors='black')
for i, month in enumerate(months):
axs[1, 1].annotate(month, (shipments[i], delivery_times[i]),
textcoords="offset points", xytext=(0,10), ha='center')
axs[1, 1].set_title('Shipments vs. Delivery Time Correlation', fontsize=12, fontweight='bold')
axs[1, 1].set_xlabel('Number of Shipments')
axs[1, 1].set_ylabel('Average Delivery Time (Days)')
axs[1, 1].grid(True, linestyle='--', alpha=0.7)
# Adjust spacing between subplots
plt.tight_layout()
plt.subplots_adjust(top=0.9)
# Save figure
plt.savefig('logistics_dashboard.png', dpi=300, bbox_inches='tight')
# Display the plot
plt.show()
Execute the code with caution.
Seaborn builds on Matplotlib to provide a high-level interface for statistical visualization. It integrates closely with Pandas DataFrames and creates attractive statistical graphics with less code than raw Matplotlib. Seaborn excels at distribution plots, categorical plots, and regression plots. The distplot() function shows histograms with kernel density estimates, while boxplot() and violinplot() reveal the distribution of data across categories. Relational plots like scatterplot() and lineplot() include options for encoding additional dimensions using color and size. Seaborn's default styles and color palettes produce professional-looking visualizations with minimal customization.
Consider a healthcare provider analyzing patient readmission rates across different hospitals and demographics to identify risk factors. Analysts need to visualize the distribution of readmission times, compare outcomes across age groups and treatment types, and explore relationships between patient characteristics and readmission likelihood. Seaborn's statistical plotting capabilities allow them to create boxplots showing outcome distributions by hospital, violin plots revealing readmission patterns by demographic segments, and scatter plots with regression lines identifying significant predictors of readmission.
import seaborn as sns
import matplotlib.pyplot as plt
# Load sample dataset
tips = sns.load_dataset("tips")
# Set the aesthetic style of the plots
sns.set_theme(style="whitegrid")
# Create a figure with 1 row and 3 columns
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
fig.suptitle('Seaborn Statistical Visualization Examples', fontsize=16)
# 1. Distribution Plot (Histogram with Kernel Density Estimate)
sns.histplot(data=tips, x="total_bill", kde=True, ax=axes[0])
axes[0].set_title('Distribution Plot: Total Bill')
# 2. Categorical Plot (Boxplot)
sns.boxplot(data=tips, x="day", y="total_bill", ax=axes[1])
axes[1].set_title('Categorical Plot: Total Bill by Day')
# 3. Regression Analysis (Linear Regression Plot)
sns.regplot(data=tips, x="total_bill", y="tip", ax=axes[2])
axes[2].set_title('Regression Analysis: Total Bill vs Tip')
# Adjust layout to prevent overlap
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
# Display the plots
plt.show()
Execute the code with caution.
Plotly takes interactivity to the next level. Unlike Matplotlib and Seaborn, which produce static plots, Plotly creates interactive visualizations that allow users to zoom, pan, and hover for details. This interactivity is invaluable for exploratory data analysis and presentation dashboards. Plotly Express provides a simple interface for common chart types, while the underlying Graph Objects API offers fine-grained control. The library supports 3D plots, maps, and complex statistical charts. Plotly integrates well with web frameworks and can be exported as standalone HTML files, making it ideal for sharing insights with stakeholders who need to explore the data themselves.
A manufacturing company needs to monitor production line performance in real-time across multiple factories, enabling plant managers to drill down into specific machines and time ranges. Executives require an interactive dashboard that shows overall equipment effectiveness, defect rates, and production throughput, with the ability to zoom into specific time periods, hover for detailed metrics, and filter by factory or product line. Plotly enables the creation of these interactive visualizations that allow stakeholders to explore the data dynamically rather than reviewing static reports.
import plotly.express as px
import pandas as pd
import numpy as np
# Generate sample data
np.random.seed(42)
dates = pd.date_range(start="2023-01-01", periods=200, freq="D")
values_a = np.cumsum(np.random.randn(200)) + 50
values_b = np.cumsum(np.random.randn(200)) + 60
categories = np.random.choice(['North', 'South', 'East', 'West'], 200)
df = pd.DataFrame({
"Date": dates,
"Sales": values_a,
"Profit": values_b,
"Region": categories
})
# Create an interactive scatter plot dashboard
# Zoom: Enabled by default on axes
# Hover: Enabled with custom template and hover data
# Filtering: Added dropdown to filter time range and legend to filter categories
fig = px.scatter(
df,
x="Date",
y="Sales",
color="Region",
size="Profit",
hover_name="Region",
title="Interactive Sales Dashboard",
template="plotly_dark"
)
# Add a dropdown menu for time-based filtering
fig.update_layout(
updatemenus=[
dict(
buttons=list([
dict(label="All Time", method="relayout", args=[{"xaxis.range": [df["Date"].min(), df["Date"].max()]}]),
dict(label="Q1", method="relayout", args=[{"xaxis.range": [df["Date"].min(), "2023-03-31"]}]),
dict(label="Q2", method="relayout", args=[{"xaxis.range": ["2023-04-01", "2023-06-30"]}]),
]),
direction="down",
pad={"r": 10, "t": 10},
showactive=True,
x=0.1,
xanchor="left",
y=1.1,
yanchor="top"
),
]
)
fig.show()
Execute the code with caution.
SciPy extends NumPy with a collection of mathematical algorithms and convenience functions. Built on top of NumPy arrays, SciPy provides modules for optimization, integration, interpolation, eigenvalue problems, algebraic equations, and other advanced mathematical operations. While you may not use SciPy directly in every data analysis workflow, it underpins many statistical functions and algorithms in other libraries. Understanding SciPy is particularly valuable for scientific computing applications, signal processing, and when you need to implement custom statistical methods beyond what standard libraries provide.
Consider a telecommunications company analyzing network traffic patterns to optimize bandwidth allocation and predict capacity needs. Engineers need to perform signal processing on call data records, apply optimization algorithms to route traffic efficiently, and interpolate missing values in network performance metrics. SciPy provides the mathematical foundation for these analyses, offering functions for Fourier transforms of signal data, optimization routines for network routing, and interpolation methods for reconstructing continuous performance measurements from discrete monitoring points.
import numpy as np
from scipy import optimize, integrate, signal
# Optimization: Minimize the Rosenbrock function
def rosenbrock(x):
return sum(100.0 * (x[1:] - x[:-1]**2)**2 + (1 - x[:-1])**2)
x0 = np.array([1.3, 0.7, 0.8, 1.9, 1.2])
optimization_result = optimize.minimize(rosenbrock, x0, method='BFGS')
print("Optimization Result:")
print(f"Minimum found: {optimization_result.x}")
print(f"Function value: {optimization_result.fun}")
# Integration: Calculate definite integral
def integrand(x, a, b):
return a * x + b
integral_value, error_estimate = integrate.quad(integrand, 0, 1, args=(2, 1))
print("\nIntegration Result:")
print(f"Integral of 2x + 1 from 0 to 1: {integral_value}")
print(f"Error estimate: {error_estimate}")
# Signal Processing: Filter a noisy signal
t = np.linspace(0, 1, 500)
signal_data = np.sin(2 * np.pi * 5 * t) + 0.3 * np.random.normal(size=len(t))
# Design and apply a Butterworth filter
b, a = signal.butter(4, 0.1, 'low')
filtered_signal = signal.filtfilt(b, a, signal_data)
print("\nSignal Processing Result:")
print(f"Original signal std: {np.std(signal_data):.4f}")
print(f"Filtered signal std: {np.std(filtered_signal):.4f}")
# FFT of the signal
fft_result = np.fft.fft(signal_data)
print(f"FFT result size: {len(fft_result)}")
Execute the code with caution.
Scikit-learn is the premier machine learning library in Python, but its utility extends beyond building predictive models. Data analysts often use Scikit-learn for preprocessing tasks like scaling, encoding categorical variables, and splitting data into training and test sets. The train_test_split() function is ubiquitous in data science workflows. Dimensionality reduction techniques like Principal Component Analysis (PCA) help visualize high-dimensional data. Clustering algorithms like K-means can discover natural groupings in data without requiring labeled outcomes. Understanding these techniques allows you to move beyond descriptive analysis to predictive and unsupervised learning applications.
A marketing team wants to segment their customer base into distinct groups for targeted campaigns based on purchase behavior, website engagement, and demographic information. With no predefined segments, analysts apply unsupervised clustering to discover natural groupings among millions of customers. Scikit-learn's K-means algorithm identifies clusters such as price-sensitive browsers, high-value repeat purchasers, and seasonal shoppers, enabling the marketing team to design personalized messaging and promotions that resonate with each segment's characteristics.
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# Generate sample customer data
np.random.seed(42)
n_customers = 500
# Create customer features: age, annual income (in k$), spending score (1-100)
age = np.random.randint(18, 70, n_customers)
income = np.random.normal(50, 15, n_customers)
spending_score = np.random.randint(1, 100, n_customers)
frequency = np.random.randint(1, 50, n_customers)
last_purchase = np.random.randint(1, 365, n_customers)
# Create DataFrame
customer_data = pd.DataFrame({
'Age': age,
'Annual_Income': income,
'Spending_Score': spending_score,
'Purchase_Frequency': frequency,
'Days_Since_Last_Purchase': last_purchase
})
# 1. Preprocessing: Standardize the features
scaler = StandardScaler()
scaled_data = scaler.fit_transform(customer_data)
# 2. Dimensionality Reduction: Apply PCA
pca = PCA(n_components=2) # Reduce to 2 dimensions for visualization
pca_result = pca.fit_transform(scaled_data)
# 3. K-means Clustering
# Find optimal number of clusters using Elbow method
inertia = []
K = range(1, 11)
for k in K:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(scaled_data)
inertia.append(kmeans.inertia_)
# Plot Elbow curve
plt.figure(figsize=(8, 5))
plt.plot(K, inertia, 'bx-')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')
plt.show()
# Based on the elbow curve, let's say optimal k=4
optimal_k = 4
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
clusters = kmeans.fit_predict(scaled_data)
# Add cluster labels to original data
customer_data['Cluster'] = clusters
# Visualize the clusters in 2D PCA space
plt.figure(figsize=(10, 7))
scatter = plt.scatter(pca_result[:, 0], pca_result[:, 1], c=clusters, cmap='viridis', alpha=0.7)
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.title('Customer Segmentation using K-means Clustering')
plt.colorbar(scatter)
plt.show()
# Analyze clusters
cluster_summary = customer_data.groupby('Cluster').agg({
'Age': 'mean',
'Annual_Income': 'mean',
'Spending_Score': 'mean',
'Purchase_Frequency': 'mean',
'Days_Since_Last_Purchase': 'mean',
'Cluster': 'count'
}).rename(columns={'Cluster': 'Count'})
print("Cluster Summary:")
print(cluster_summary)
# Predict cluster for a new customer
new_customer = np.array([[35, 60, 75, 20, 30]])
new_customer_scaled = scaler.transform(new_customer)
new_customer_pca = pca.transform(new_customer_scaled)
new_cluster = kmeans.predict(new_customer_scaled)
print(f"\nNew customer belongs to cluster: {new_cluster[0]}")
Execute the code with caution.
Openpyxl is a specialized library for working with Excel files. While Pandas can read and write Excel files, Openpyxl provides fine-grained control over workbook and worksheet operations. It can read, write, and modify Excel files while preserving formatting, formulas, and charts. This capability is essential when working with organizations that rely heavily on Excel for reporting and data exchange. Openpyxl allows you to automate repetitive Excel tasks, integrate Python analysis into existing Excel workflows, and generate formatted reports without manual intervention. For analysts working in corporate environments where Excel remains ubiquitous, Openpyxl bridges the gap between Python automation and business reporting requirements.
Consider a financial services firm generating monthly investment performance reports that must be delivered as formatted Excel workbooks with preserved templates and formulas. Portfolio managers receive reports containing calculated metrics, conditional formatting highlighting performance thresholds, and pre-built pivot tables for deeper analysis. Openpyxl enables analysts to programmatically populate these complex Excel workbooks, ensuring consistent formatting, maintaining existing formulas and conditional formatting rules, and automating the distribution of professionally formatted reports to hundreds of clients.
import openpyxl
from openpyxl.styles import Font, PatternFill, Alignment, Border, Side
from openpyxl.utils import get_column_letter
# Create a new workbook
wb = openpyxl.Workbook()
ws = wb.active
ws.title = "Demo Sheet"
# Writing data to cells
ws['A1'] = "Name"
ws['B1'] = "Age"
ws['C1'] = "Score"
ws['A2'] = "Alice"
ws['B2'] = 25
ws['C2'] = 85
ws['A3'] = "Bob"
ws['B3'] = 30
ws['C3'] = 92
# Adding a formula
ws['C4'] = "=AVERAGE(C2:C3)"
# Styling the header row
header_font = Font(bold=True, color="FFFFFF")
header_fill = PatternFill(start_color="366092", end_color="366092", fill_type="solid")
header_alignment = Alignment(horizontal="center")
for cell in ws[1]:
cell.font = header_font
cell.fill = header_fill
cell.alignment = header_alignment
# Styling the data cells
data_font = Font(color="000000")
data_alignment = Alignment(horizontal="center")
for row in ws.iter_rows(min_row=2, max_row=3):
for cell in row:
cell.font = data_font
cell.alignment = data_alignment
# Adding borders
thin_border = Border(
left=Side(style='thin'),
right=Side(style='thin'),
top=Side(style='thin'),
bottom=Side(style='thin')
)
for row in ws.iter_rows(min_row=1, max_row=4):
for cell in row:
cell.border = thin_border
# Adjust column widths
ws.column_dimensions['A'].width = 15
ws.column_dimensions['B'].width = 10
ws.column_dimensions['C'].width = 10
# Save the workbook
wb.save("demo_workbook.xlsx")
# Reading data from the workbook
wb_read = openpyxl.load_workbook("demo_workbook.xlsx", data_only=False) # data_only=False to keep formulas
ws_read = wb_read.active
# Read all data including formulas
print("Reading with formulas:")
for row in ws_read.iter_rows(values_only=False):
for cell in row:
print(f"Cell {cell.coordinate}: Value = {cell.value}")
# Reading only values (formulas are evaluated)
wb_values = openpyxl.load_workbook("demo_workbook.xlsx", data_only=True)
ws_values = wb_values.active
print("\nValues only:")
for row in ws_values.iter_rows(values_only=True):
print(row)
# Close workbooks
wb_read.close()
wb_values.close()
Execute the code with caution.
Building a learning path helps manage the overwhelming number of libraries available. Start with NumPy to understand array operations and vectorized computing. Move quickly to Pandas, as this is where you will spend most of your time doing actual data analysis. Learn Matplotlib and Seaborn together, using Matplotlib for foundational concepts and Seaborn for everyday statistical visualization. Add Plotly when you need interactivity or are building dashboards. Finally, explore SciPy and Scikit-learn for advanced statistical and machine learning applications. Learn Openpyxl when Excel integration becomes necessary for your workflow.
Practical application solidifies learning. Work on projects that use real datasets from domains like finance, healthcare, or e-commerce. Build a customer segmentation analysis using Pandas for data preparation, Seaborn for exploration, and Scikit-learn for clustering. Create a sales dashboard with Pandas for aggregation, Plotly for interactive charts, and deployment to a web framework. Develop a time-series forecasting project using SciPy for regression analysis and Matplotlib for trend visualization. These projects demonstrate your ability to apply libraries in context and provide concrete evidence of your skills to potential employers.
The Python data analysis ecosystem is vast, but focusing on these core libraries provides a strong foundation for a data career. Mastering NumPy and Pandas gives you the tools to handle most data manipulation tasks. Visualization libraries enable you to communicate findings effectively. Statistical and machine learning libraries extend your capabilities from description to prediction. Excel integration ensures compatibility with existing business workflows. Progress systematically, practice consistently with real data, and you will develop the expertise that employers seek in today's data-driven world.
Sources:
- NumPy Documentation: https://numpy.org/doc/
- Pandas Documentation: https://pandas.pydata.org/docs/
- Matplotlib Documentation: https://matplotlib.org/stable/contents.html
- Seaborn Documentation: https://seaborn.pydata.org/tutorial.html
- Scikit-learn Documentation: https://scikit-learn.org/stable/
- Plotly Documentation: https://plotly.com/python/
- SciPy Documentation: https://docs.scipy.org/doc/scipy/
- Openpyxl Documentation: https://openpyxl.readthedocs.io/
- Stack Overflow Developer Survey 2024: https://survey.stackoverflow.co/2024/
- Python for Data Analysis by Wes McKinney (O'Reilly Media)