Getting Started with Python
Installation
Recommended: Anaconda Distribution
• Includes Python, Jupyter, and scientific libraries
• Available at: anaconda.com
• Alternative: Python.org for basic installation
Essential Libraries
Data Analysis: pandas, numpy
Visualization: matplotlib, seaborn
Scientific Computing: scipy, statsmodels
Machine Learning: scikit-learn
Development Environment
Jupyter Notebooks: Interactive development
VS Code: Full-featured IDE
PyCharm: Professional Python IDE
Spyder: Scientific Python IDE
Learning Resources
Documentation: python.org/doc
Tutorials: Real Python, Python.org
Practice: Kaggle, HackerRank
Books: "Python for Data Analysis" by McKinney
Python Basics for Research
Variables and Data Types
# Basic data types
name = "John Doe" # String
age = 25 # Integer
gpa = 3.85 # Float
is_student = True # Boolean
# Lists and dictionaries
scores = [85, 92, 78, 96] # List
student = { # Dictionary
"name": "John",
"age": 25,
"gpa": 3.85
}
Control Structures
# If statements
if gpa >= 3.5:
print("Dean's List")
elif gpa >= 3.0:
print("Good Standing")
else:
print("Needs Improvement")
# Loops
for score in scores:
print(f"Score: {score}")
# While loop
count = 0
while count < 5:
print(count)
count += 1
Functions
# Function definition
def calculate_average(numbers):
"""Calculate average of a list of numbers"""
return sum(numbers) / len(numbers)
# Function with default parameters
def greet(name, greeting="Hello"):
return f"{greeting}, {name}!"
# Using functions
avg_score = calculate_average(scores)
message = greet("Alice", "Hi")
File Operations
# Reading files
with open('data.txt', 'r') as file:
content = file.read()
# Writing files
with open('results.txt', 'w') as file:
file.write("Analysis Results\n")
file.write(f"Average: {avg_score}")
# Working with CSV
import csv
with open('data.csv', 'r') as file:
reader = csv.reader(file)
for row in reader:
print(row)
Data Analysis with Pandas
Creating DataFrames
import pandas as pd
# From dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Score': [85, 92, 78]
}
df = pd.DataFrame(data)
# From CSV file
df = pd.read_csv('data.csv')
# From Excel file
df = pd.read_excel('data.xlsx')
Data Exploration
# Basic info
print(df.head()) # First 5 rows
print(df.tail()) # Last 5 rows
print(df.info()) # Data types and info
print(df.describe()) # Summary statistics
# Shape and columns
print(df.shape) # (rows, columns)
print(df.columns) # Column names
print(df.dtypes) # Data types
Data Manipulation
# Selecting data
df['Name'] # Single column
df[['Name', 'Age']] # Multiple columns
df.loc[0] # Row by index
df.loc[df['Age'] > 30] # Conditional selection
# Adding columns
df['Grade'] = df['Score'] / 10
df['Pass'] = df['Score'] >= 80
# Grouping and aggregation
df.groupby('Pass')['Score'].mean()
df.groupby('Grade').size()
Data Cleaning
# Missing values
df.isnull().sum() # Count missing values
df.dropna() # Remove rows with NaN
df.fillna(0) # Fill NaN with 0
# Duplicates
df.duplicated().sum() # Count duplicates
df.drop_duplicates() # Remove duplicates
# Data types
df['Age'] = df['Age'].astype(int)
df['Score'] = pd.to_numeric(df['Score'])
Data Visualization
Matplotlib Basics
import matplotlib.pyplot as plt
# Basic plot
plt.figure(figsize=(10, 6))
plt.plot(df['Age'], df['Score'])
plt.xlabel('Age')
plt.ylabel('Score')
plt.title('Age vs Score')
plt.show()
# Scatter plot
plt.scatter(df['Age'], df['Score'])
plt.xlabel('Age')
plt.ylabel('Score')
plt.show()
Seaborn Visualizations
import seaborn as sns
# Distribution plot
sns.histplot(df['Score'])
plt.show()
# Box plot
sns.boxplot(x='Pass', y='Score', data=df)
plt.show()
# Correlation heatmap
corr = df.corr()
sns.heatmap(corr, annot=True)
plt.show()
Statistical Analysis
from scipy import stats
# Descriptive statistics
mean_score = df['Score'].mean()
median_score = df['Score'].median()
std_score = df['Score'].std()
# T-test
t_stat, p_value = stats.ttest_1samp(df['Score'], 80)
# Correlation
correlation, p_val = stats.pearsonr(df['Age'], df['Score'])
print(f"Mean: {mean_score:.2f}")
print(f"Correlation: {correlation:.3f}")
Export Results
# Save DataFrame
df.to_csv('results.csv', index=False)
df.to_excel('results.xlsx', index=False)
# Save plot
plt.figure(figsize=(10, 6))
plt.plot(df['Age'], df['Score'])
plt.savefig('age_score_plot.png', dpi=300)
plt.show()
# Generate report
summary = df.describe()
summary.to_csv('summary_stats.csv')
Research Workflow Tips
Project Organization
- Create separate folders for data, scripts, and results
- Use descriptive file names
- Keep raw data separate from processed data
- Document your code with comments
- Use version control (Git)
Code Best Practices
- Write readable, well-commented code
- Use meaningful variable names
- Break complex operations into functions
- Test your code with sample data
- Handle errors gracefully
Reproducibility
- Set random seeds for reproducible results
- Document your environment (requirements.txt)
- Keep detailed analysis logs
- Share code and data when possible
- Use Jupyter notebooks for exploration
Common Pitfalls
- Not backing up your work
- Overwriting original data
- Not documenting analysis steps
- Ignoring data quality issues
- Not validating results