introduction coding – TOU Esther van Helmont

Introduction to coding
November 2024
Last update: 23 nov. 2024

Learn the foundations of Python. Python is particularly popular in the artificial intelligence community for tasks like data analysis.

Jump to section:

Coding journal Data Transformation in Python Capstone code Capstone visualizations

Coding journal
Nov 20. 2024

Your coding journal is a document that records your Python programming journey. It documents your successes, challenges, and the lessons learned. Consider it your personal guide, marking the progress and the paths taken. Below, we detail the essential components that should be integrated into your journal, making it a detailed account of your learning journey.

Key Concepts: Summarize the main ideas.
Code Examples: Include code snippets that illustrate these concepts.
Insights & Questions: Reflect on what was learned and any arising questions.
Challenges: Document any challenges your encountered along the way

Reflect on progress towards the milestone’s objectives.
Discuss potential real-world applications of learned concepts.
Discuss coding decisions, alternatives considered, and their impacts.

My Python Reference Guide

My Python introduction and intermediate course ‘cheat-sheet’.

To support my learning process in Python, I’ve created an overview of commonly used symbols, terms, data types and functions. This overview helps me quickly and easily understand what the different symbols mean, when to use them, and what to keep in mind.

Symbol/Word	What it means	When to use it	Important to know
`=`	Assign a name	To store a value in a variable.	This creates a new variable or overwrites an existing one.
`+`	Add or combine	To add numbers or combine words (strings).	With text (“strings”), it combines them. With numbers, it performs addition.
`-`	Subtract	To calculate the difference between two numbers.	Works only with numbers.
`*`	Multiply	To multiply two numbers.	Can also repeat text: `"Hello" * 3` gives `"HelloHelloHello"`.
`/`	Divide	To divide one number by another.	Always returns a decimal (even if it looks like a whole number, e.g., `10 / 2 = 5.0`).
`[]`	Look inside a list	To access a specific item in a list.	Counting starts at 0, so the first item is `list[0]`.
`[start:end]`	Extract a part of the list	To get a range of items from a list.	Start is included, but end is not. E.g., `list[1:3]` returns items at index 1 and 2.
`len()`	How many items?	To see how many items are in a list or how many characters are in a string.	Works with lists, strings, and other collections.
`type()`	What is it?	To check what kind of thing something is (e.g., a number, text, or list).	Returns types like `int` (whole number), `float` (decimal number), or `str` (text).
`==`	Is equal to?	To check if two things are the same.	Compares the values, not whether it is the exact same variable.
`!=`	Is NOT equal to?	To check if two things are not the same.	Returns `True` if they are different, otherwise `False`.
`<, >, <=, >=`	Compare	To check if something is smaller, larger, smaller or equal, or larger or equal.	Works with numbers and sometimes with text (alphabetical order).
`in`	Is it in there?	To check if something exists in a list or string.	For strings, it checks if a piece of text exists: `"a" in "apple"` is `True`.
`not in`	Is it NOT in there?	To check if something does not exist in a list or string.	The opposite of `in`.
`and`	AND	When you want two things to both be true.	Both conditions must be true, otherwise, the result is `False`.
`or`	OR	When you want at least one thing to be true.	Only if both conditions are false, the result is `False`.
`not`	NOT	To reverse a check, turning true into false and vice versa.	Reverses the value: `not True` becomes `False` and vice versa.
`#`	Comment	To write something in the code that will not be executed (notes for yourself).	Everything after the `#` on the same line is ignored by Python.
`"` or `'`	Text (String)	To create text. You can use either double or single quotes.	Both work the same way, but use the same type of quotes at the start and end.
`()`	Start a function	When you want to run a function, like printing something.	Without the parentheses, Python won’t run the function. E.g., `print` does nothing, but `print("Hello")` works.
`{}`	Dictionary (Key-Value Pairs)	To store data in key-value pairs, like names and ages.	Use colons `:` to link a key and a value, e.g., `{"name": "Alice"}`.

Symbol/Word	What it means	When to use it	Important to know
`float`	Real numbers	When working with measurements, prices, etc.	Allows decimal points (e.g., `3.14`).
`int`	Whole numbers	Counting or indexing things.	Only integers, no fractions.
`str`	Text or sequence of characters	Working with names, messages, or any text data.	Enclose in quotes (`"` or `'`).
`bool`	True or False	Decision making or logical operations.	Capitalize `True` and `False`.

Function	What it does	When to use it	Important to know
`max()`	Find the largest	To determine the largest value in a list, tuple, or other iterable.	Works with numbers, strings (alphabetical order), or custom objects with comparison logic.
`min()`	Find the smallest	To determine the smallest value in a list, tuple, or other iterable.	Works with numbers, strings (alphabetical order), or custom objects with comparison logic.
`sum()`	Calculate the total	To add all the numbers in a list or other iterable.	Only works with numeric types.
`abs()`	Absolute value	To remove the negative sign from a number.	Works with integers and floats.
`round()`	Round a number	To round a number to the nearest integer or specified number of decimal places.	Optional second parameter specifies the number of decimal places. Default is 0 (nearest integer).
`print()`	Output to console	To display text or variables in the terminal.	Supports multiple arguments separated by commas.
`input()`	Get user input	To allow the user to provide input during script execution.	Always returns a string; convert to other types if needed.
`sorted()`	Sort elements	To sort items in ascending or descending order.	Returns a new sorted list without modifying the original iterable.
`range()`	Create a sequence	To generate a range of numbers, often used in loops.	Supports start, stop, and step values. Default step is 1.
`enumerate()`	Index and value	To get both the index and value when iterating over a list or other iterable.	Returns tuples containing (index, value).
`zip()`	Combine iterables	To pair elements from two or more iterables into tuples.	Stops at the shortest iterable’s length.
`any()`	Check if any is true	To check if at least one element in an iterable is true.	Returns `True` if at least one element is truthy.
`all()`	Check if all are true	To check if all elements in an iterable are true.	Returns `True` only if all elements are truthy.
`map()`	Apply a function	To apply a function to every element in an iterable.	Returns a map object, which can be converted to a list or other collection.
`filter()`	Filter items	To filter elements in an iterable based on a function.	Returns only elements where the function returns `True`.
`lambda`	Anonymous function	To create small, unnamed functions in a concise way.	Commonly used with `map()`, `filter()`, or `sorted()`.
`open()`	Work with files	To read from or write to files.	Supports modes like `'r'` (read), `'w'` (write), and `'a'` (append).
`dir()`	List attributes	To see the attributes and methods of an object.	Helpful for exploring objects and debugging.
`help()`	Get documentation	To view built-in documentation for a function, module, or object.	Interactive and very useful for learning Python.
`isinstance()`	Check type	To check if an object is an instance of a specific class or type.	Supports inheritance checks for custom classes.
`set()`	Create a set	To create a collection of unique elements.	Useful for removing duplicates from a list.
`len()`	Count elements	To find the number of items in a list, string, or other iterable.	Works with many data types, including strings, lists, and dictionaries.
`del`	Delete an item	To remove an item from a list or delete a variable entirely.	Permanently removes the variable or item.

Symbol/Word	What it means	When to use it	Important to know
`numpy` (as `np`)	The main Numpy library	Always start by importing it: `import numpy as np`.	Allows advanced numerical computations and array handling.
`np.array()`	Create a Numpy array	Convert a Python list or tuple to a Numpy array.	Specify `dtype` for precision (e.g., `float32`).
`np.random.normal()`	Generate random numbers from a normal distribution	When you need random numbers that follow a normal (Gaussian) distribution.	You can specify the mean, standard deviation, and size of the output array. For example: `np.random.normal(0, 1, (3, 3))` creates a 3×3 array of random numbers with a mean of 0 and a standard deviation of 1.
`np.corrcoef()`	Calculate the correlation coefficient	When you want to see how two variables are correlated (how one affects the other).	Pass two 1D arrays or slices. The result is a correlation matrix. Example: `np.corrcoef(np_city[:, 0], np_city[:, 1])`.
`np.std()`	Calculate the standard deviation	To measure how spread out the values in an array are from the mean.	Specify the axis if working with a 2D array. Example: `np.std(np_city[:, 0])` calculates the standard deviation of the first column.
`np_2d`	Create a 2D Numpy array	Use for data with rows and columns (e.g., matrices).	Example: np_2d = np.array([[1.73, 1.68, 1.71, 1.89, 1.79], [65.4, 59.2, 63.6, 88.4, 68.7]])
`np_2d.shape`	Get shape of a 2D array	Check the dimensions (rows, columns) of your array.	Example output: (2, 5) # 2 rows, 5 columns
`np.zeros()`	Create an array of zeros	Initialize an array for computations.	Specify shape (e.g., `np.zeros((3, 4))` for a 3×4 array).
`np.ones()`	Create an array of ones	Initialize an array with ones for certain algorithms.	Supports custom `dtype` and shapes.
`np.arange()`	Generate a range of numbers	Create evenly spaced numbers like Python’s `range()`.	Specify step size (e.g., `np.arange(0, 10, 2)`).
`np.linspace()`	Generate evenly spaced numbers	Specify start, end, and the number of points.	Useful for plotting (e.g., `np.linspace(0, 1, 50)`).
`np.random.rand()`	Generate random numbers	Create random values between 0 and 1.	Specify shape (e.g., `np.random.rand(3, 4)`).
`np.mean()`	Calculate the mean	Find the average value of an array.	Can specify axis (e.g., row or column-wise).
`np.sum()`	Calculate the sum	Compute the total of array elements.	Supports axis-specific operations.
`np.dot()`	Matrix multiplication	Perform dot product or matrix operations.	Ensure compatible dimensions for matrices.
`np.transpose()`	Transpose an array	Swap rows and columns in a matrix.	Shortcut: `array.T` for transpose.

Function	Description	When to Use	Example
`plt.plot()`	Create a line plot	When you want to connect points with lines to show trends.	`plt.plot([1, 2, 3], [4, 5, 6])`
`plt.scatter()`	Create a scatter plot	When you want to show individual points in 2D space.	`plt.scatter([1, 2, 3], [4, 5, 6])`
`plt.hist()`	Create a histogram	To visualize the distribution of data.	`plt.hist([1, 2, 2, 3, 3, 3])`
`plt.bar()`	Create a bar plot	To compare categories using bars.	`plt.bar(['A', 'B', 'C'], [3, 7, 5])`
`plt.xlabel()`	Set x-axis label	To label the x-axis of your plot.	`plt.xlabel('X-Axis')`
`plt.ylabel()`	Set y-axis label	To label the y-axis of your plot.	`plt.ylabel('Y-Axis')`
`plt.title()`	Set plot title	To add a title to your plot.	`plt.title('My Plot')`
`plt.legend()`	Add a legend	To explain different elements in the plot.	`plt.legend(['Data 1', 'Data 2'])`
`plt.xscale()`	Set x-axis scale	To apply a logarithmic or linear scale to the x-axis.	`plt.xscale('log')`
`plt.show()`	Display the plot	To show the created plot on the screen.	`plt.show()`

Method	Access	Example
Square Brackets	Column access	`brics[["country", "capital"]]`
Square Brackets	Row access (slicing)	`brics[1:4]`
loc (label-based)	Row access	`brics.loc[["RU", "IN", "CH"]]`
loc (label-based)	Column access	`brics.loc[:, ["country", "capital"]]`
loc (label-based)	Row & Column access	`brics.loc[ ["RU", "IN", "CH"], ["country", "capital"] ]`
iloc (index-based)	Row access	`brics.iloc[1:4]`
iloc (index-based)	Column access	`brics.iloc[:, [0, 1]]`
iloc (index-based)	Row & Column access	`brics.iloc[ [1, 2, 3], [0, 1] ]`

Syntax list dataframe pandas

CODE EXAMPLES

Jupyter Notebook and step-by-step
code explanations.

Below are various example codes with detailed and straightforward explanations to rely on when starting a code from scratch. These same codes can be found in my Jupyter Notebook for easy copying and testing. I have invested extra time in this to make things easier for myself later and to truly understand the concepts instead of just copying them.

The simplified explanations were generated with ChatGPT.

Python beginner code examples

Assign a name — =

x = 10:
Here, we create a variable named x and give it the value 10. A variable is like a box where you can store something, such as a number, a word, or a list.
print(x):
This tells Python: “Show what is stored in the variable x.” Python will then display the number 10 on the screen.

Example situation: Imagine you want to remember a test score. You can store it in x and later show it with print(x).

Calculate — +-*/

- a = 5 and b = 3:
  We create two variables, a and b, and store the numbers 5 and 3 in them. Think of a as one container holding 5, and b as another container holding 3.
- result_add = a + b:
  - Adds the values of a and b together (5 + 3).
  - The result (8) is stored in a new variable called result_add.
  - This is useful for combining numbers, like adding prices of items.
- result_subtract = a - b:
  - Subtracts b (3) from a (5).
  - The result (2) is stored in a variable called result_subtract.
  - This is useful for calculating differences, like finding out how much more you have compared to someone else.
- result_multiply = a * b:
  - Multiplies a (5) by b (3).
  - The result (15) is stored in a variable called result_multiply.
  - This is useful for scaling numbers, like calculating the total cost of buying 3 items that each cost $5.
- result_divide = a / b:
  - Divides a (5) by b (3).
  - The result (1.666…) is stored in a variable called result_divide.
  - This is useful for splitting things equally, like dividing a $5 bill among 3 people.
- print():
  - Each time we use print(), Python shows the result of the calculation.

Key Points to Remember:

+ is for adding.
- is for subtracting.
* is for multiplying.
/ is for dividing (always gives a decimal result, even if it looks like a whole number).

Substract from list

)”]

fruits = ["apple", "banana", "cherry"]
- This creates a list named fruits. A list is like a collection that can store multiple items.
- In this case, the list contains three strings: "apple", "banana", and "cherry".
print(fruits[0])
- This tells Python to print the first item in the list fruits.
- Lists in Python start counting from 0, so fruits[0] means the first item, which is "apple".

Output:
When you run this code, Python will print:
apple

Print selection from list

1. fruits = ["apple", "banana", "cherry", "date", "fig"]:
  We create a list called fruits containing five items: "apple", "banana", "cherry", "date", and "fig". Think of this as a row of containers, each holding a fruit.
2. selected_fruits = fruits[1:4]:
  This extracts a part of the list.
  - 1 (start): This is the position where we start (remember: counting starts at 0, so position 1 is "banana").
  - 4 (end): This is the position where we stop, but the item at position 4 ("fig") is not included.
  - So, Python extracts items at positions 1, 2, and 3 ("banana", "cherry", and "date") and stores them in the variable selected_fruits.
3. print(selected_fruits):
  This tells Python: “Show the values in selected_fruits.” Python will display:
  
  ['banana', 'cherry', 'date']

Counting in lists, strings, or other collections — len()

fruits = ["apple", "banana", "cherry"]:
- We create a list called fruits that contains 3 items: "apple", "banana", and "cherry".
- Think of a list like a shopping bag that holds multiple items.
len(fruits):
- The len() function tells us how many items are in the list.
- In this case, the list fruits contains 3 items, so len(fruits) gives us the number 3.
count = len(fruits):
- We store the result of len(fruits) (which is 3) in a variable called count.
- Think of count as a label that holds the answer to “How many items are there?”.
print(count):
- This tells Python to display the value of count (which is 3).
- When you run this code, Python will print the number 3 to the screen.

Key Points to Remember:

The len() function works for lists, strings, or other collections.
For lists, it tells you how many items are in the list.
For strings, it counts the number of characters (including spaces).
Practical Use: Quickly count items in a list or characters in a word.

Check what kind of thing it is; A whole number (int), a decimal numer (floar) or a text (string) — type()

You have a box, and you put something inside it. But now you’ve forgotten what’s in the box! Is it a number? A word? Something else? Python can help you figure it out using type().

What’s Happening?

age = 25:
- You put the number 25 into the variable called age.
- Python knows this is a whole number and calls it an int (short for “integer”).
name = "Alice":
- You put the word "Alice" into the variable called name.
- Python knows this is text and calls it a str (short for “string”).
height = 5.6:
- You put the number 5.6 into the variable called height.
- Python knows this is a decimal number and calls it a float.
type(variable):
- When you use type(), Python checks inside the box (the variable) and tells you what kind of thing is in there.
- For example:
  - int means a whole number.
  - str means text.
  - float means a decimal number.
print(type(variable)):
- This tells Python: “Show me the type of the thing in this variable.”
- Python will then display the type on the screen.

Output:

When you run the code, Python will show:

Why is This Useful?

Sometimes, you need to check what kind of data your variables hold to make sure your program works correctly.
For example:
- Is it safe to add two things together? (5 + 3 works, but "Hello" + 3 doesn’t.)
- Are you using the right kind of data in your program?

The Big Idea:

type() is like asking Python: “What kind of thing is this?”
Python will tell you:
- int for whole numbers.
- str for text.
- float for decimal numbers.

It’s a simple tool to help you understand what’s in your variables.

Check if two things look exactly the same. — ==

a = 5 and b = 5:
- We create two variables: a and b. Both hold the value 5.
c = 3:
- We create another variable, c, which holds the value 3.
print(a == b):
- Here, we ask Python:
  “Is the value of a the same as the value of b?”
  Since both are 5, Python answers True.
print(a == c):
- Now, we ask Python:
  “Is the value of a the same as the value of c?”
  Since a is 5 and c is 3, Python answers False.

Example situations:

Comparing Numbers:
Use == to check if two numbers are the same. For example, is 5 the same as 5?
Comparing Text:
Use == to check if two words are the same, like "Alice" == "Alice".

Key points to remember:

== asks if two things are equal.
It gives True if they are the same, and False if they are not.
This is useful when you need to compare values in your program, like checking if a password matches or if two numbers are the same.

In simple terms:

== is like asking:
“Do these two things look exactly the same?”

Quickly update specific items in a list without recreating the whole list.

colors = ["red", "blue", "green", "yellow"]
- We start with a list of colors.
colors[-1] = "purple"
- The -1 index points to the last item in the list, "yellow". We change it to "purple".
colors[1] = "aqua"
- The index 1 points to the second item in the list, "blue". We change it to "aqua".
print(colors)
- This displays the updated list: ["red", "aqua", "green", "purple"].

Why is this useful?

You can quickly update specific items in a list without recreating the whole list.
For example, you could use this to update data in a list, like replacing incorrect information or renaming items.

Key concept:

Positive Indexing: Counts from the start (0, 1, 2, …).
Negative Indexing: Counts from the end (-1, -2, -3, …).
This flexibility allows you to easily access or update any item in your list.

Add something to a list.

shopping_list = ["apples", "bananas", "bread"]
- This is the original list with three items: apples, bananas, and bread.
new_list = shopping_list + ["milk"]
- Here, you take the original list and add “milk” to the end. A new list called new_list is created.
final_list = new_list + ["eggs"]
- Now, you take the updated list and add “eggs” to the end. A new list called final_list is created.
print(final_list)
- This shows the complete shopping list: ['apples', 'bananas', 'bread', 'milk', 'eggs'].

Remove something from a list. — Del

Imagine you have a long list of items, like a shopping list. Now, you realize you don’t need a few items anymore, and you want to cross them off the list. In Python, you can remove items from a list using the del command.

areas = [...]
- This is your list of areas, like different parts of a house (e.g., “hallway,” “kitchen,” and so on).
del areas[10:12]
- This tells Python: “Delete items in the list starting at index 10 and stopping right before index 12.”
- Index 10 refers to “poolhouse,” and index 11 refers to its size (24.5). So, these two items are removed.
print(areas)
- This shows the updated list, where “poolhouse” and its size are no longer included.

Think about it:

Your list is like a menu.
The del command is like erasing parts of the menu that you don’t want anymore.
By specifying [start], you tell Python exactly which parts to remove.

Explanation of example 2:

The list starts as ["apple", "banana", "cherry", "date", "fig"].
del fruits[2:4] removes the items at index 2 and 3, which are “cherry” and “date.”
The updated list becomes ["apple", "banana", "fig"].

Copy a list without affecting the original

When you work with lists in Python, if you copy a list directly, both the original list and the copy will be linked. This means if you change one, the other will change too. To create a real, independent copy, you need to make an explicit copy with [:]

1. Create the original list

You create a list called fruits with 3 items: "apple", "banana", and "cherry".
Think of this as a box labeled “fruits” containing 3 fruits.

2. Make a linked copy

Here, you are not creating a new list.
Instead, fruits_linked is just another name for the same box as fruits.
If you change something in fruits_linked, it will also change in fruits (because they share the same box).

3. Make an independent copy

This creates a new box that looks like the original box.
fruits_copy is now a separate list. If you change it, the original list (fruits) will stay the same.

4. Modify the linked copy

You change the first item in fruits_linked from "apple" to "orange".
Since fruits_linked and fruits are the same box, this change also happens in fruits.

5. Modify the independent copy

You change the second item in fruits_copy from "banana" to "grape".
This change only happens in fruits_copy because it is a separate box.

6. Print the results

fruits and fruits_linked both show "orange" as the first item because they are linked (same box).
fruits_copy is different because it was copied separately.

Imagine it like this:

fruits and fruits_linked: Two people writing on the same whiteboard. If one erases something, the other sees it too.
fruits_copy: A photo of the whiteboard. If you draw on the photo, the whiteboard stays unchanged.

Why is this important?

Use = if you want two names to point to the same list.
Use [:] or list() if you want to create a new list that won’t affect the original.

Sorting a list in descending order — = sorted (list name, reverse=True)

What’s the goal? Imagine you have a list of fruit prices, and you want to know which fruit is the most expensive. You sort the list from highest to lowest.
What’s happening step by step?
- Step 1: Create a list of prices:
  
  fruit_prices = [2.5, 1.2, 3.0, 0.8, 1.5]
  
  Here, apples cost €2.50, bananas cost €1.20, and so on.
- Step 2: Use sorted() to sort the list:
  
  sorted_prices = sorted(fruit_prices, reverse=True)
  
  This tells Python: “Sort the list, but start with the highest number.”
- Step 3: Print the result:
  
  print(sorted_prices)
  
  Now you see the prices neatly arranged from highest to lowest.
What does Python show? Python displays:

[3.0, 2.5, 1.5, 1.2, 0.8]

This means €3.00 is the most expensive, and €0.80 is the cheapest.

Why is this useful? If you want to pick the most expensive or cheapest fruit, you can easily see which ones they are.

Convert to capitals and count letters

word = "banana":
We create a variable called word and store the text “banana” in it. This is our starting string.
word.upper():
This command converts all the letters in the string word to uppercase. For “banana”, it becomes “BANANA”.
print(word) and print(word_upper):
- The first print shows the original word (“banana”).
- The second print shows the uppercase version (“BANANA”).
word.count("a"):
This command counts how many times the letter “a” appears in the word “banana”. The result is 3 because “a” appears three times.
print(count_a):
This prints the result of the counting, which is the number 3.

Importing Numpy and using arrays

Step 1: Import the NumPy library

What’s happening?
We are bringing in the NumPy library so we can use it in our code. NumPy is a powerful tool for working with numbers, especially lists of numbers (called arrays).
Why do we write as np?
It’s just a shortcut. Instead of typing numpy every time, we can now type np.

Step 2: Define the `height_in` variable

What’s happening?
We create a list called height_in. It contains the heights of people in inches (e.g., 72 inches, 65 inches, etc.).
Why do we do this?
We need some data (heights in this case) to work with. NumPy will help us process this data.

Step 3: Convert the list to a NumPy array

What’s happening?
We are turning the list height_in into a NumPy array using np.array().
Why do we do this?
NumPy arrays are more powerful than regular Python lists. They let us do math on all the numbers in the array at the same time. This makes calculations faster and easier.

Step 4: Check the type of the array

What’s happening?
This prints the type of np_height_in. It should show that it’s a NumPy array (type: numpy.ndarray).
Why do we do this?
To make sure that the conversion from a list to a NumPy array worked correctly.

Step 5: Convert the heights to meters

What’s happening?
We multiply each height in the array np_height_in by 0.0254. This converts the heights from inches to meters because 1 inch equals 0.0254 meters.
Why do we do this?
Many countries and scientific fields use meters instead of inches. This calculation gives us the heights in meters.

Step 6: Print the heights in meters

What’s happening?
This prints the converted heights (in meters) so we can see the result.
Why do we do this?
To check that the conversion worked and to see the new values.

Full Explanation of the Code:

We import NumPy so we can use its tools.
We create a list of heights in inches.
We convert the list into a NumPy array so we can do calculations easily.
We check the type of the array to confirm it’s a NumPy array.
We convert the heights to meters using a simple multiplication.
We print the result to see the heights in meters.

2D numpy array explained

A 2D Numpy array is a special kind of table or grid where data is arranged into rows and columns. Each row represents a horizontal line, and each column represents a vertical line. Think of it as a spreadsheet or matrix.

Example: A Simple 2D Numpy Array

Imagine we have a 2D Numpy array like this:

This array looks like this:

How to Print a Row

To print a specific row, use the row index. For example:

Output:

The number inside the square brackets [0] tells Python to take the first row (remember, counting starts at 0).

If you want to print the second row:

Output:

How to Print a Column

To print a specific column, use : to select all rows, and the column index to pick the column. For example:

Output:

The : means “take all rows.”
The 0 after the comma tells Python to take the first column.

If you want to print the third column:

Output:

Summary:

Print a row: Use array[row_index]
- Example: array[1] → Prints the second row.
Print a column: Use array[:, column_index]
- Example: array[:, 2] → Prints the third column.

Subsetting. Selecting a specific part of a Numpy array, such as a row, a column, or even a single element

What is Subsetting?

Subsetting means selecting a specific part of a Numpy array, such as a row, a column, or even a single element. You do this by providing the index of the part you want to select.

Step 1: Create a 2D Array

Let’s start with a simple 2D array:

Here:

The first row is [1.73, 1.68, 1.71, 1.89, 1.79].
The second row is [65.4, 59.2, 63.6, 88.4, 68.7].

Visually, the array looks like this:

Step 2: Select a Row or Column

To select a row or a column:

Select a Row: Use the row index. For example:

np_2d[0] # Select the first row

Result:

array([1.73, 1.68, 1.71, 1.89, 1.79])
Select a Column: Use : for all rows and the column index. For example:

np_2d[:, 1] # Select the second column

Result:

array([ 1.68, 59.2 ])

Step 3: Select a Specific Element

To select a single element, provide both the row and column index. For example:

Rows start at 0.
Columns also start at 0.

In this case, the result is:

Step 4: Select Multiple Rows or Columns

Select Multiple Columns: Use a range of indices:

np_2d[:, 1:4] # Select columns 1 through 3

Result:

array([[ 1.68, 1.71, 1.89], [59.2 , 63.6 , 88.4 ]])
Select Multiple Rows: For example:

np_2d[0:2, :] # Select all rows

This returns the full array:

array([[ 1.73, 1.68, 1.71, 1.89, 1.79], [65.4 , 59.2 , 63.6 , 88.4 , 68.7 ]])

Key Points to Remember

Indexing starts at 0: The first row or column has index 0.
Use : to select everything: For example, np_2d[:, 2] selects column 2 for all rows.
Slicing works like Python lists: Use start:stop to select parts of a row or column.

Calculating mean and median

Step 1: Import Numpy

To use Numpy, we need to import the library:

Step 2: Create Data (Array)

Let’s create a Numpy array with a list of heights (in inches):

Now the heights array looks like this:

Step 3: Calculate the Mean

The mean is the average of all the values. To calculate the mean, we use:

What happens here:

np.mean(heights) calculates the average height:

(65 + 70 + 75 + 80 + 85) / 5 = 75

Output:

Step 4: Calculate the Median

The median is the middle value when the numbers are sorted. To calculate the median, we use:

What happens here:

np.median(heights) finds the middle value:
- Since the numbers are already sorted, the middle value is 75.

Output:

Full Code Example:

Here is the complete code:

Output:

Step-by-Step Explanation:

Import Numpy: This is required to use Numpy’s functions like mean() and median().
Create Data: A Numpy array is created to store the height values.
Calculate Mean:
- Use np.mean() to find the average height.
- Add all values together and divide by the total number of values.
Calculate Median:
- Use np.median() to find the middle value.
- The median is the middle value when all numbers are sorted.

Python advanced data visualization code examples

Matplotlib line plot data visualization

Print the last year and the predicted population for that year.
Create a line plot to show how the population grows from 1950 to 2100.

We use the matplotlib library to create the plot.

Step-by-Step Code Explanation

1. Print the last year and population

year[-1] gets the last item from the year list (2100 in this case).
pop[-1] gets the last item from the pop list (10.85 billion people in this case).

Result in the console:

2. Import the required library

We import the pyplot module from matplotlib (a Python library for creating plots).

3. Create the line plot

plt.plot(x, y) creates a line graph where:
- x is the year list (years from 1950 to 2100).
- y is the pop list (world population for each year).

This shows the population trend over time.

4. Display the plot

This command displays the plot on the screen.

Graph Output:

The x-axis shows the years (1950 to 2100).
The y-axis shows the population in billions.
The line grows upward, showing how the population is expected to rise over time.

Key Observations from the Results

The last year is 2100, and the predicted population for that year is 10.85 billion.
The plot shows a steady increase in world population from 1950 to 2100.

Line to logarithmic plot

The goal is to create a scatter plot showing the relationship between GDP per capita (gdp_cap) and life expectancy (life_exp). To better understand the data, we also apply a logarithmic scale to the x-axis (GDP per capita).

Steps

1. Problem with the code

The current code uses plt.plot() to create a line plot, but the task is to create a scatter plot. Also, the x-axis needs a logarithmic scale to make the data easier to interpret.

2. Correct the plot type

To create a scatter plot, replace the plt.plot() function with plt.scatter().

Corrected code:

This shows individual data points (countries) instead of connecting them with lines.

3. Apply the logarithmic scale

Set the x-axis to a logarithmic scale using:

This compresses the large range of GDP values, making the plot more readable and revealing trends more clearly.

4. Display the plot

To display the scatter plot, use:

This will generate the corrected plot.

Final Corrected Code

What does the plot show?

Each dot represents a country.
The x-axis (GDP per capita) is scaled logarithmically, showing a better distribution of countries with low, medium, and high GDP.
The y-axis (life expectancy) shows how long people live on average in each country.
The scatter plot makes it easier to see patterns or correlations between GDP and life expectancy.

Scatter plot

Create a scatter plot that shows the relationship between population (pop) and life expectancy (life_exp).
Use the matplotlib.pyplot library to build and display the plot.

Step 1: Importing Matplotlib

What this does: This line imports a part of the Matplotlib library, called pyplot, and gives it a nickname plt (so we don’t have to type the full name every time).
Why we do this: pyplot is used to create all kinds of plots, like line graphs, bar charts, or scatter plots.

Step 2: Defining the Data

What this does: These two lists store data:
- gdp_cap: The GDP per capita values (how much money a person earns on average in different countries).
- life_exp: The life expectancy values (average age people live to in those countries).
Why we do this: This is the data we want to visualize in our scatter plot.

Step 3: Creating a Scatter Plot

What this does: This creates a scatter plot using the data.
- Each point on the plot corresponds to a value from gdp_cap (on the x-axis) and life_exp (on the y-axis).
Why we do this: Scatter plots are great for showing the relationship between two variables.

Step 4: Customizing the X-Axis Scale

What this does: This changes the x-axis scale to a logarithmic scale. A logarithmic scale is helpful when the data has very large differences in values (e.g., 1,000 vs. 40,000).
Why we do this: Without this, the plot would look squished because the numbers on the x-axis vary so much.

Step 5: Adding Labels and a Title

What this does:
- plt.xlabel: Adds a label to the x-axis, explaining that it shows GDP per capita.
- plt.ylabel: Adds a label to the y-axis, explaining that it shows life expectancy in years.
- plt.title: Adds a title to the plot.
Why we do this: Labels and titles make the plot easier to understand.

Step 6: Changing the X-Axis Tick Marks

What this does:
- tick_val: Defines the values to show on the x-axis (e.g., 1,000, 10,000, 100,000).
- tick_lab: Defines how those values should look on the plot (e.g., “1k” instead of “1000”).
- plt.xticks: Updates the x-axis to use these custom tick marks.
Why we do this: It makes the axis easier to read.

Step 7: Displaying the Plot

What this does: This tells Python to show the plot in a new window or inline (depending on the environment).
Why we do this: Without this command, the plot won’t be displayed.

What the Code does as a whole:

Imports the library needed to create plots.
Defines the data for GDP and life expectancy.
Creates a scatter plot to visualize the relationship between GDP and life expectancy.
Customizes the plot by setting the x-axis scale, adding labels, a title, and better tick marks.
Displays the plot to the user.

Output:

You get a scatter plot where:

The x-axis shows GDP per capita (log scale).
The y-axis shows life expectancy.
Each dot represents a data point.

Histogram

A histogram is a type of graph that shows how data is distributed.
It groups numbers into “bins” (ranges of values) and shows how many numbers fall into each bin.
For example, if your data has numbers like 40, 42, 43, and 45, these might all fall into the same bin (e.g., 40–50).

Step 2: Import Matplotlib

Even though it’s not shown here, you must first import Matplotlib in your script before using its functions. Usually, we write:

What does this do? It imports the library that lets us create plots and graphs, including histograms.
Why plt? This is a short name for Matplotlib so we don’t have to type matplotlib.pyplot every time.

Step 3: Plot a histogram

What happens here?
- plt.hist() creates a histogram of the data in life_exp.
- life_exp is a list of life expectancy values (numbers that represent how long people live on average in different countries).
- The function automatically groups these numbers into bins and counts how many numbers are in each bin.

Step 4: Show the histogram

What happens here?
- This displays the histogram on the screen.
- Without plt.show(), the graph might not appear depending on your coding environment.

The Result

The histogram shows:
- The x-axis: The bins (ranges of life expectancy values, like 40–50, 50–60, etc.).
- The y-axis: How many countries have life expectancies in each range.

Full Explanation of the Code

plt.hist(life_exp): Creates the histogram for the data in life_exp.
plt.show(): Displays the histogram on the screen.

Bins

Step 1: Build a histogram with 5 bins

What does this do?
- This creates a histogram using the life_exp data.
- The bins=5 means the data will be divided into 5 groups or ranges (e.g., 40–50, 50–60, etc.).
- Each bar in the histogram represents the number of data points (countries) in that range.

Step 2: Show the histogram

What does this do?
- It displays the histogram on the screen.
- You’ll see the 5 bars representing the distribution of the life_exp data.

Step 3: Clear the plot

What does this do?
- It clears the current plot so you can create a new plot without overlapping the previous one.

Step 4: Build a histogram with 20 bins

What does this do?
- It creates another histogram using the same life_exp data, but this time the data is divided into 20 groups or bins.
- Since there are more bins, the bars are narrower, and you can see more details about how the data is distributed.

Step 5: Show the second histogram

What does this do?
- It displays the second histogram with 20 bins.
- The graph now shows a more detailed view of the data.

Step 6: Clear the second plot

What does this do?
- It clears the second plot, so the environment is ready for any future plots.

Why do we use bins?

The number of bins determines how detailed the histogram looks.
- Fewer bins (5): Gives a broader, more general overview.
- More bins (20): Gives more detail but can make the graph harder to interpret.

What does the code do overall?

Creates a histogram of life_exp with 5 bins and shows it.
Clears the plot and creates a second histogram with 20 bins to show more detail.

Axis labels

Step 1: What does this code do?

This code creates a scatter plot showing the relationship between:

GDP per Capita (x-axis): How much money people make per person in a country.
Life Expectancy (y-axis): How long people live on average in that country.

The plot uses a logarithmic scale on the x-axis to make it easier to see the data when GDP numbers are very large.

Step 2: Code Breakdown

1. Labels and Title

What does this do?
- It defines the labels for the x-axis and y-axis.
- It defines the title of the plot.
Why do this?
- To make the plot easier to understand for anyone looking at it.

2. Create the Scatter Plot

What does this do?
- It creates the scatter plot with gdp_cap (GDP per capita) on the x-axis and life_exp (life expectancy) on the y-axis.
- Each point represents a country.

3. Use a Logarithmic Scale

What does this do?
- It changes the x-axis to a logarithmic scale.
- A logarithmic scale makes very large numbers (like GDP) easier to compare by compressing the scale.

4. Add Axis Labels

What does this do?
- Adds labels to the x-axis (GDP per Capita [in USD]) and y-axis (Life Expectancy [in years]).

5. Add Title

What does this do?
- Adds the title “World Development in 2007” to the plot.

6. Display the Plot

What does this do?
- It displays the scatter plot with all the customizations (labels, title, and logarithmic scale).

Step 3: What Does the Plot Show?

The dots represent countries.
The x-axis shows GDP per capita (logarithmic scale).
The y-axis shows life expectancy in years.
Countries with higher GDP per capita tend to have higher life expectancy, but there is still variation.

Bubble plot

What is a Bubble Plot?

A bubble plot is a type of graph that shows three pieces of information at the same time:

X-axis: One variable (like GDP per capita).
Y-axis: Another variable (like life expectancy).
Bubble size: A third variable (like population).

Each bubble represents a country in this example, and the size of the bubble shows how big the population is.

Step-by-Step Explanation of the Code:

1. Import Libraries

What this does: We need numpy to handle numbers and arrays (for population). We use matplotlib.pyplot to create the graph.

2. Define the Data

gdp_cap: This is the GDP per capita (average income) for five countries.
life_exp: This is the life expectancy (average age people live to) in those countries.
pop: This is the population of those countries (in millions).

3. Convert and Double the Population Data

Why we do this:
- Converting pop into a NumPy array makes it easier to work with.
- Doubling the values makes the bubbles bigger, so they’re easier to see on the plot.

4. Create the Bubble Plot

What happens here:
- gdp_cap is on the x-axis (horizontal).
- life_exp is on the y-axis (vertical).
- s=np_pop controls the size of the bubbles. Bigger populations mean bigger bubbles.

5. Customize the Plot

What this does:
- plt.xscale('log'): Changes the x-axis to a logarithmic scale so large GDP values are easier to compare.
- plt.xlabel() and plt.ylabel(): Add labels to explain what the axes represent.
- plt.title(): Adds a title to tell people what the graph is about.

6. Add Custom Tick Marks

What this does:
- The x-axis will now show ticks at 1,000, 10,000, and 100,000.
- Instead of large numbers, the labels will show “1k”, “10k”, and “100k” to make it easier to read.

7. Show the Plot

What this does: Opens a window to display the plot.

What you will see:

X-axis: The GDP per capita (income of a person in a country).
Y-axis: The life expectancy (average lifespan in years).
Bubbles: Each bubble represents a country. The size of the bubble shows the population. Bigger countries have bigger bubbles.

Example:

If a bubble is on the right (high GDP) and at the top (high life expectancy), it means:

The country is rich, and people live long.

If a bubble is on the left (low GDP) and near the bottom (low life expectancy), it means:

The country is poorer, and people live shorter lives.

Adding colors to bubble plot

What does this code do?

It creates a colorful bubble plot to show the relationship between:

GDP per capita (how rich countries are),
Life expectancy (how long people live), and
Population size (how big the country is).

Each bubble represents a country. The size of the bubble shows the population, and the color shows the continent.

Step-by-Step Explanation

Step 1: Import the Tools We Need

Why?
- numpy helps us work with numbers easily.
- matplotlib.pyplot is used to make the bubble plot.

Step 2: Define the Data

What does this mean?
- gdp_cap: GDP per capita (income) for 5 countries.
- life_exp: Life expectancy (how long people live) in those countries.
- pop: Population sizes (in millions).
- col: Colors representing continents (e.g., red for Asia, green for Europe).

Step 3: Prepare the Bubble Sizes

What does this do?
- Converts the population list (pop) into a NumPy array so we can do math on it.
- Multiplies each population by 2 to make the bubbles bigger and easier to see.

Step 4: Make the Bubble Plot

What does this do?
- x=gdp_cap: GDP per capita is on the x-axis (horizontal).
- y=life_exp: Life expectancy is on the y-axis (vertical).
- s=np_pop: Bubble size depends on the population.
- c=col: Each bubble gets a color based on the continent.
- alpha=0.8: Makes bubbles slightly see-through, so overlapping bubbles are easier to see.

Step 5: Customize the Plot

What does this do?
- plt.xscale('log'): Makes the x-axis logarithmic. This helps compare small and large GDP values.
- plt.xlabel(): Adds a label to the x-axis.
- plt.ylabel(): Adds a label to the y-axis.
- plt.title(): Adds a title to the graph.

Step 6: Add Tick Marks

What does this do?
- Changes the tick marks on the x-axis:
  - 1000 becomes “1k”.
  - 10000 becomes “10k”.
  - 100000 becomes “100k”.
- This makes the numbers easier to read.

Step 7: Show the Plot

What does this do?
- Opens a window with your colorful bubble plot.

What Does the Final Plot Show?

X-axis (GDP per Capita):
- Shows how much money an average person makes in a country.
- Countries further to the right are richer.
Y-axis (Life Expectancy):
- Shows how long people live on average.
- Countries higher up have people who live longer.
Bubble Size:
- The size of the bubble shows the population of the country.
- Bigger bubbles mean bigger populations.
Bubble Color:
- Each bubble has a color representing a continent.

Example:

A big red bubble far to the right and near the top means:
- The country is in Asia.
- It is rich (high GDP per capita).
- Its people live long lives (high life expectancy).
- It has a large population.

Python advanced dictionary and pandas

Dictionary

Step-by-Step Explanation

Step 1: Define the Data

What this does: Creates two lists:
- countries: Contains the names of European countries.
- capitals: Contains the capitals corresponding to each country.
Order matters: Each country’s capital is at the same position (index) in the capitals list.

Step 2: Find the Index of ‘germany’

What this does:
- The index() method looks for 'germany' in the countries list and returns its position.
- In this case, 'germany' is at position 2 (Python starts counting from 0).

Step 3: Use the Index to Access the Capital

What this does:
- Uses ind_ger (which is 2) to find the element at position 2 in the capitals list.
- This retrieves 'berlin', the capital of Germany.

Step 4: Print the Capital

What this does:
- Outputs 'berlin' to the console.

What happens when you run the code?

Python finds 'germany' at position 2 in the countries list.
It then uses this index to look up the corresponding capital in the capitals list, which is 'berlin'.
Finally, it prints 'berlin'.

Output

Why use this approach?

Dynamic Access: If the lists change (e.g., adding or reordering items), the code will still find the correct match because it dynamically calculates the index.
Efficient: Instead of manually figuring out the index, the index() method does it for you.

Import Pandas

Let’s pretend:

You have a list of fruits, their colors, and their prices. You want to organize this information in a neat table. Here’s how we can do it in Python.

Step 1: Make your lists

Think of each list as a column in your table. Each column will have information about the fruits, colors, or prices.

Step 2: Create a dictionary

A dictionary in Python is like a container where you label your lists with names (called “keys”). These labels will be the column headers of your table.

Now, the dictionary looks like this in plain English:

Step 3: Convert the dictionary to a Pandas DataFrame

A Pandas DataFrame is like a spreadsheet in Python. To turn the dictionary into a DataFrame, we use pd.DataFrame().

First, import Pandas:

Now, fruit_table looks like this (a table!):

Fruit	Color	Price
Apple	Red	1.2
Banana	Yellow	0.5
Cherry	Red	2.0
Date	Brown	3.0

Step 4: Print your DataFrame

To see your beautiful table, use the print() function:

When you run this, you’ll see:

Why does this work?

The lists contain the data (fruits, colors, prices).
The dictionary organizes the lists into columns, with each key becoming a column name.
The DataFrame takes this dictionary and makes it look like a proper table.

Custom row labels

By default, a Pandas DataFrame assigns numbers (0, 1, 2, etc.) as row labels. These are called indices. However, sometimes you want to replace these default labels with custom ones (like country codes or names). This is what we’re doing in this exercise.

We’ll create a small example to show how this works.

Example: Fruits and their Prices

Step 1: Create the Data Let’s create a dictionary of fruits and their prices.

import pandas as pd # Create data fruits = ["Apple", "Banana", "Cherry"] prices = [1.2, 0.5, 2.5] # Build the dictionary fruit_dict = { "Fruit": fruits, "Price": prices } # Create a DataFrame fruit_table = pd.DataFrame(fruit_dict) # Print the DataFrame print(fruit_table)

Output:

Fruit Price 0 Apple 1.2 1 Banana 0.5 2 Cherry 2.5

Right now, the rows are labeled with numbers (0, 1, 2), which are the default indices.

Step 2: Create Custom Row Labels Let’s say we want to label the rows with the first letter of each fruit (e.g., “A” for Apple, “B” for Banana). We can create a list of these labels:

# Create custom row labels row_labels = ["A", "B", "C"]

Step 3: Replace the Default Row Labels Now we replace the default numeric indices with our custom labels by assigning the row_labels list to fruit_table.index.

# Set custom row labels fruit_table.index = row_labels # Print the DataFrame again print(fruit_table)

Output:

Fruit Price A Apple 1.2 B Banana 0.5 C Cherry 2.5

Now the rows are labeled with “A”, “B”, and “C” instead of 0, 1, 2.

Why is This Useful?

Custom row labels (indices) make your DataFrame more meaningful and easier to understand. For example, in the original exercise, the row labels are country codes (US, AUS, etc.), making it clear which country each row belongs to.

Summary of Steps

Create your data and build a DataFrame.
Create a list of custom labels for the rows.
Set the index attribute of your DataFrame to this list.

CSV to dataframe

What are we doing?

We want to load data from a CSV file called cars.csv into Python. A CSV file (Comma-Separated Values) is like an Excel sheet but saved as plain text with commas separating the values.

Step-by-Step Explanation

Step 1: Import Pandas

Why? We need Pandas to handle the data in the CSV file.
Think of Pandas as a “data expert” that helps you organize, clean, and analyze data in Python.

Step 2: Load the CSV File

pd.read_csv(): This is a function that tells Pandas to read the CSV file.
'cars.csv': This is the name of the file we’re reading. It must be in the same folder as your Python script or notebook.
- Imagine you’re asking Pandas:
  “Hey Pandas, can you please read the data from cars.csv and store it as a table I can work with?”
cars: This is the name of the “table” (or DataFrame) where the data from the CSV file will be saved.

Step 3: Print the Data

Why? To see if everything worked properly.
When you print cars, it will display the data from the CSV file in a table-like format.

What will happen when you run the code?

Python will load the Pandas library so you can use its functions.
It will open the cars.csv file and read the data inside it.
The data will be stored in a DataFrame called cars.
Finally, it will print the data so you can see it.

What if Something Goes Wrong?

If Python says it can’t find the file, make sure:
1. The cars.csv file is in the same folder as your Python script.
2. You spelled the file name correctly (it’s case-sensitive!).
If the CSV file is in another folder, you can provide the full file path instead, like this:

cars = pd.read_csv('path/to/your/file/cars.csv')

Imagine This in Real Life

It’s like you’re a librarian, and Pandas is your assistant:

You tell Pandas to find the book (cars.csv) in the library.
Pandas opens the book, copies its contents into a neat table (the DataFrame), and hands it to you.
You then look at the table by printing it out.

Square Brackets (1)

1. Importing Pandas

What’s happening?
- You are importing the Pandas library and giving it a nickname (pd) to make it easier to use.
- Pandas is like your data manager—it helps you handle data in table-like structures called DataFrames.

2. Loading the CSV File

What’s happening?
- pd.read_csv('cars.csv'): This reads the data from the cars.csv file into a Pandas DataFrame called cars.
- index_col=0: This tells Pandas to use the first column (column at position 0) as the row labels (index).
Result: Now, the cars DataFrame has rows labeled by the first column (e.g., country codes like US, AUS, etc.).

3. Printing a Single Column as a Pandas Series

What’s happening?
- Using single square brackets, you select the column named country from the DataFrame.
- This gives you a Pandas Series, which is like a 1D list of data with labels.
Example Output:

US United States AUS Australia JPN Japan IN India RU Russia MOR Morocco EG Egypt Name: country, dtype: object
- It shows the country values along with their row labels.

4. Printing a Single Column as a Pandas DataFrame

What’s happening?
- Using double square brackets, you select the column named country, but now it is treated as a Pandas DataFrame (2D table format).
Difference: A DataFrame keeps the table format, while a Series is just a single column.
Example Output:

country US United States AUS Australia JPN Japan IN India RU Russia MOR Morocco EG Egypt

5. Printing Multiple Columns

What’s happening?
- Using double square brackets, you select multiple columns: country and drives_right.
- This gives you a new DataFrame with only the selected columns.
Example Output:

country drives_right US United States True AUS Australia False JPN Japan False IN India False RU Russia True MOR Morocco True EG Egypt True

Key Takeaways:

Single square brackets (['column']):
- Selects one column and returns a Pandas Series (1D).
Double square brackets ([['column']]):
- Selects one or more columns and returns a Pandas DataFrame (2D).
Selecting multiple columns ([['col1', 'col2']]):
- Lets you pick multiple columns at once and returns them as a DataFrame.

Square Brackets (2)

This code shows how to use slicing to select specific rows (observations) from a Pandas DataFrame. Let’s break it down step by step:

Step 1: Import the Necessary Library

What’s happening?
- You are importing the Pandas library to work with DataFrames (tables of data).

Step 2: Load the Data

What’s happening?
- pd.read_csv('cars.csv'): This reads the cars.csv file and loads it into a Pandas DataFrame called cars.
- index_col=0: This sets the first column of the file (e.g., country codes) as the row labels (index).

Step 3: Select the First Three Observations

What’s happening?
- The slicing 0:3 selects rows with index positions 0, 1, and 2.
- In slicing, the start index (0) is included, but the end index (3) is excluded.
Result: You will get the first three rows of the DataFrame.

Step 4: Select the Fourth, Fifth, and Sixth Observations

What’s happening?
- The slicing 3:6 selects rows with index positions 3, 4, and 5.
- Again, the start index (3) is included, but the end index (6) is excluded.
Result: You will get the fourth, fifth, and sixth rows of the DataFrame.

Key Concept: Slicing with Square Brackets

Syntax: data[start:end]
- Start: The row index where slicing begins (included).
- End: The row index where slicing stops (excluded).
Why Use Slicing?
- It’s an easy way to select a range of rows from a DataFrame based on their integer positions.

Expected Outputs

First 3 Observations (Rows 0, 1, 2):

Fourth, Fifth, and Sixth Observations (Rows 3, 4, 5):

Why is This Useful?

Slicing lets you extract parts of your data efficiently without writing loops.
It’s especially useful for working with large datasets where you want to inspect specific sections.

Loc and Iloc

What Are We Doing?

We are using loc (label-based) and iloc (position-based) to select specific rows from a DataFrame. Think of loc as selecting by name (like “Japan”) and iloc as selecting by number (like “Row 2”).

Step 1: Load the Data

What’s happening?
- We load the cars.csv file into a DataFrame called cars.
- index_col=0 makes the first column (e.g., country codes) the row labels (e.g., US, AUS, JPN).

Step 2: Select the Row for Japan

Using loc (label-based):

print(cars.loc['JPN'])
- What it does: Finds the row with the label JPN (Japan).
- Result: Returns all the data for Japan.
Using iloc (position-based):

print(cars.iloc[2])
- What it does: Finds the row at position 2 (third row, since Python counts from 0).
- Result: Returns the same data as cars.loc['JPN'] because Japan is in the 3rd row.

Step 3: Select Rows for Australia and Egypt

Using loc (label-based):

print(cars.loc[['AUS', 'EG']])
- What it does: Finds the rows labeled AUS (Australia) and EG (Egypt).
- Result: Returns the data for these two rows as a small table.
Using iloc (position-based):

print(cars.iloc[[1, 6]])
- What it does: Finds the rows at positions 1 (second row) and 6 (seventh row).
- Result: Returns the same data as cars.loc[['AUS', 'EG']].

What’s the Difference Between `loc` and `iloc`?

loc: Selects rows by name or label (e.g., “JPN”).
iloc: Selects rows by position (e.g., “Row 2”).

What Will Be Printed?

Observation for Japan:

Observations for Australia and Egypt:

Summary

Use loc when you know the row labels (e.g., JPN, AUS).
Use iloc when you know the row positions (e.g., row 2, row 6).

Equalty

Let’s break down this code step by step in the simplest way possible!

1. Comparing Booleans

What does it do?
- It checks if True is equal to False.
Result:
- True and False are opposites, so this is False.
Output: False

2. Comparing Integers

What does it do?
- It calculates -5 * 15 (which is -75) and checks if it’s not equal (!=) to 75.
Result:
- -75 is not equal to 75, so this is True.
Output: True

3. Comparing Strings

What does it do?
- It checks if the two strings "pyscript" and "PyScript" are the same.
Important:
- Python cares about uppercase and lowercase letters, so "pyscript" is different from "PyScript".
Result:
- The strings are not equal, so this is False.
Output: False

4. Comparing a Boolean with an Integer

What does it do?
- It checks if True is equal to 1.
Important:
- In Python, True is treated as the number 1, and False is treated as 0.
Result:
- Since True equals 1, this is True.
Output: True

Final Outputs

Here’s what gets printed when you run the code:

Summary

True == False: Checks if True is equal to False (False because they are opposites).
-5 * 15 != 75: Checks if -75 is not equal to 75 (True because they are different).
"pyscript" == "PyScript": Checks if two strings are the same (False because of case sensitivity).
True == 1: Checks if True equals 1 (True because Python treats True as 1).

Greater and less

1. Check if `x` is greater than or equal to `-10`

What does it do?
- x is already defined as -3 * 6, which equals -18.
- It checks: “Is -18 greater than or equal to -10?”
Result:
- -18 is less than -10, so the answer is False.
Output: False

2. Check if `"test"` is less than or equal to `y`

What does it do?
- y is already defined as "test".
- Python compares strings alphabetically, like looking them up in a dictionary.
- It checks: “Is "test" less than or equal to "test"?”
Result:
- "test" is exactly the same as "test", so the answer is True.
Output: True

3. Check if `True` is greater than `False`

What does it do?
- In Python:
  - True is treated as 1.
  - False is treated as 0.
- It checks: “Is 1 greater than 0?”
Result:
- Yes, 1 is greater than 0, so the answer is True.
Output: True

Final Outputs

Here’s what gets printed when you run the code:

Summary

x >= -10:
- Checks if -18 is at least -10 (it’s not).
- Answer: False
"test" <= y:
- Checks if "test" is less than or equal to "test" (it is).
- Answer: True
True > False:
- Checks if True (1) is greater than False (0) (it is).
- Answer: True

Compare arrays

Explanation

1. Compare `my_house` with 18

What does it do?
- It checks if each element in the my_house array is greater than or equal to 18.
- This comparison happens element by element.
Result:
- [18.0, 20.0, 10.75, 9.50] >= 18 results in [True, True, False, False].
Output:

[ True True False False ]

2. Compare `my_house` with `your_house`

What does it do?
- It checks if each element in the my_house array is less than the corresponding element in the your_house array.
- This comparison also happens element by element.
Result:
- [18.0, 20.0, 10.75, 9.50] < [14.0, 24.0, 14.25, 9.0] results in [False, True, True, False].
Output:

[False True True False]

Final Outputs

my_house >= 18:

[ True True False False ]
my_house < your_house:

[False True True False]

Key Points

Numpy allows you to compare arrays element by element using comparison operators like >=, <, ==, etc.
The result is a new array of True or False values.

And, or, not

We are checking some conditions about the size of my_kitchen and your_kitchen. Here’s how it works:

1. Is `my_kitchen` bigger than 10 and smaller than 18?

What are we doing?
- We are asking if two things are both true:
  1. Is my_kitchen bigger than 10?
  2. Is my_kitchen smaller than 18?
- my_kitchen is 18.0.
Step-by-step check:
1. 18.0 > 10 → True
2. 18.0 < 18 → False (because 18 is not smaller than 18).
Result:
- With and, both conditions must be true. Since one is false, the answer is False.
Output: False

2. Is `my_kitchen` smaller than 14 or bigger than 17?

What are we doing?
- We are asking if at least one of these is true:
  1. Is my_kitchen smaller than 14?
  2. Is my_kitchen bigger than 17?
- my_kitchen is 18.0.
Step-by-step check:
1. 18.0 < 14 → False
2. 18.0 > 17 → True
Result:
- With or, only one condition needs to be true. Since the second one is true, the answer is True.
Output: True

3. Is double the size of `my_kitchen` smaller than triple the size of `your_kitchen`?

What are we doing?
- We are comparing 2 times the size of my_kitchen with 3 times the size of your_kitchen.
- my_kitchen is 18.0, so 2 * my_kitchen = 36.0.
- your_kitchen is 14.0, so 3 * your_kitchen = 42.0.
Step-by-step check:
- 36.0 < 42.0 → True
Result:
- The answer is True.
Output: True

Final Outputs

Is my_kitchen > 10 and < 18? → False
Is my_kitchen < 14 or > 17? → True
Is 2 * my_kitchen < 3 * your_kitchen? → True

Summary

and: Both conditions must be true.
or: At least one condition must be true.
You can do math in comparisons, like 2 * my_kitchen.

We are using if statements to check two things:

1. Check if the `room` is “kit”

What happens here?
- The code checks if the value of room is "kit".
- If the condition is True, it will print: "looking around in the kitchen.".
In this case:
- room is "kit", so the condition is True.
- The message "looking around in the kitchen." is printed.

2. Check if the `area` is bigger than 15

What happens here?
- The code checks if the value of area is greater than 15.
- If the condition is True, it will print: "big place!".
In this case:
- area is 14.0, so the condition area > 15 is False.
- Because the condition is not true, nothing is printed.

Final Output

The first if statement prints: "looking around in the kitchen."
The second if statement prints nothing because the area is not greater than 15.

Summary

if statements only run the code inside them if the condition is true.
In this case:
- The room check was true, so the kitchen message was printed.
- The area check was false, so nothing was printed for that.

add else

What does this code do?

It checks two things:

Which room you are in (room).
How big the area is (area).

It uses if-else statements to print different messages depending on the conditions.

Step 1: Check the room

What happens?
- If the room is "kit", it prints: "looking around in the kitchen.".
- If the room is not "kit", it prints: "looking around elsewhere.".
In this case:
- The room is "kit", so the first message is printed: "looking around in the kitchen.".

Step 2: Check the area size

What happens?
- If the area is bigger than 15, it prints: "big place!".
- If the area is 15 or smaller, it prints: "pretty small.".
In this case:
- The area is 14.0, so the second message is printed: "pretty small.".

Final Output

"looking around in the kitchen."
"pretty small."

Summary

The if-else statement lets the program choose what to print depending on the conditions.
First check: What is the room? (kitchen or elsewhere)
Second check: Is the area big or small?

Driving right

Sure! Here’s a simple explanation of the code step by step:

1. Load the data

This loads a file called cars.csv into a variable called cars.
The index_col=0 means that the first column in the file will be used as row labels (like “US” or “AUS”).

2. Pick the column you need

The column drives_right tells us whether people drive on the right side of the road.
This creates a list of True and False values based on this column.

For example, it could look like this:

3. Filter the rows where `drives_right` is `True`

dr is like a filter. It keeps only the rows where the value is True.
For example:
- If True, the row is included (e.g., US drives on the right).
- If False, the row is excluded (e.g., Australia drives on the left).

4. Show the result

This shows the rows of the cars dataset where people drive on the right.

Example:

If your data looks like this:

country	drives_right	cars_per_cap
US	True	809
AUS	False	731
JPN	False	588
IN	False	18
RU	True	200

The result will show only the rows where drives_right is True:

country	drives_right	cars_per_cap
US	True	809
RU	True	200

It’s like asking Python: “Show me only the countries where people drive on the right!” 😊

Loop over a list

What does “loop over a list” mean?

A loop over a list means that you are telling the computer to go through each item in a list, one at a time, and do something with it. This is like looking at a box of toys and taking out each toy one by one to play with it.

Step-by-step explanation of what happens:

You have a list: For example:

areas = [11.25, 18.0, 20.0, 10.75, 9.50]

This is a list of numbers representing the sizes of different rooms.
You write a for loop: The loop will go through each number in the list, one at a time:

for area in areas:
- area is a placeholder (like a bucket) that will hold one number at a time from the list.
- The loop starts with the first item in the list (11.25) and moves to the next one (18.0), and so on.
You tell the computer what to do for each item: For example, you tell it to print the number:

print(area)

This means, for every number the loop goes through, print it.
The loop runs automatically:
- First, it takes 11.25 (the first item in the list) and prints it.
- Then it moves to 18.0, prints it.
- Then it moves to 20.0, prints it.
- It keeps going until it reaches the last item (9.50).
When the list is finished, the loop stops: Once all the items in the list have been processed (printed in this case), the loop stops automatically.

Final code:

What happens when you run this code?

The computer prints:

11.25 18.0 20.0 10.75 9.5
The loop is finished because there are no more items in the list.

Indexes and values

What is a for loop?
- A for loop goes through a list, one item at a time, and does something with each item.
What is enumerate()?
- Normally, a for loop only gives you the item from the list.
- But if you use enumerate(), it gives you two things:
  1. The position of the item (called the index).
  2. The item itself.
Example:

areas = [11.25, 18.0, 20.0] for index, area in enumerate(areas): print(index, area)

Output:

0 11.25 1 18.0 2 20.0
The task:
- We are given a list of areas (sizes of rooms).
- We want to print something like: “room 0: 11.25”, “room 1: 18.0”, etc.
Breaking down the code:

for index, area in enumerate(areas): print("room " + str(index) + ": " + str(area))
- index tells us which room number it is (e.g., 0, 1, 2, …).
- area tells us the size of the room (e.g., 11.25, 18.0, …).
- str(index) and str(area) convert numbers into text so Python can print them.
What happens in each loop?
- First time: index = 0, area = 11.25. Prints: room 0: 11.25.
- Second time: index = 1, area = 18.0. Prints: room 1: 18.0.
- And so on…
Final result: The computer prints:

room 0: 11.25 room 1: 18.0 room 2: 20.0 room 3: 10.75 room 4: 9.5

It’s like giving each room in your house a label (room 0, room 1, etc.) and saying how big it is

Loop over list of lists

Goal:

We want to look at each room in the house and print a sentence that says: “the [room name] is [room area] sqm”

Step-by-step explanation:

Understand the house list:
- house is a list of lists.
- Each small list inside house has two items:
  - The name of the room (like “hallway”).
  - The area of the room (like 11.25).
Example of one small list: ["hallway", 11.25].

Start a for loop to go through each small list:
- for room in house: means:
  - “Go through every small list in the big house list.”
  - On the first loop, room will be ["hallway", 11.25].
  - On the second loop, room will be ["kitchen", 18.0], and so on.

Pick the room name and area:
- room[0] gives the first item of the small list (the room name).
  - Example: If room = ["hallway", 11.25], then room[0] = "hallway".
- room[1] gives the second item of the small list (the room area).
  - Example: If room = ["hallway", 11.25], then room[1] = 11.25.

Create a sentence with the room details:
- Use print(f"the {name} is {area} sqm").
- The {name} and {area} will be replaced by the actual room name and area.
  - Example: If name = "hallway" and area = 11.25, the output will be: “the hallway is 11.25 sqm”.

Do this for every room:
- The loop will repeat for every small list in house.
- Each time, it prints a sentence about a different room.

Final Code:

Output:

This code will print:

Summary:

Use a for loop to go through each room.
Get the room name and area from each small list.
Print a sentence about the room.

It’s like saying:

“I’ll take the first room, check its name and area, and then print the details.”
“Now, I’ll move to the next room and do the same!”
Repeat until all rooms are done!

Loop over dictionary

Task: Loop over a dictionary

We want to go through each country and its capital in the europe dictionary and print something like:

“The capital of spain is madrid”

Step-by-Step:

What is a dictionary?
- A dictionary in Python is like a list, but instead of just numbers or words, it connects keys to values.
- Example:
  - Key: "spain"
  - Value: "madrid"
- In the europe dictionary, the key is the name of a country (like "spain") and the value is the capital of that country (like "madrid").

How do we loop through a dictionary?
- We use the .items() method. This allows us to go through both the key and the value at the same time.
- for country, capital in europe.items() means:
  - Each key (country) will go into the variable country.
  - Each value (capital) will go into the variable capital.

What does the loop do?
- The loop takes one pair of key and value at a time.
- For example:
  - On the first run, country = "spain" and capital = "madrid".
  - On the second run, country = "france" and capital = "paris".
  - And so on, until it goes through all the pairs.

What does the print statement do?
- print(f"The capital of {country} is {capital}") creates a sentence using the country and capital.
- The f before the quotes allows us to insert variables directly into the string.
- Example:
  - If country = "spain" and capital = "madrid", the output will be:
    
    The capital of spain is madrid

What happens when you run the code?
- The loop will go through every key-value pair in the dictionary.
- It will print one sentence for each pair.

Challenges

Challenges I encountered along the way

Below, I’ve documented the challenges I encountered while working on Python for the very first time. As a complete beginner, I not only explored basic Python concepts but also struggled with setting up Visual Studio Code to work seamlessly with IPython and Jupyter Notebook on my MacBook. It was a learning process filled with trial and error, but each challenge helped me better understand how to navigate this new programming environment.

Using comma versus parentheses
Installing Visual studio code on mac
Importing Numpy
Why do you need type and 'double' parentheses (()) in print()?
Working correctly with the terminal

Understanding when to use commas , vs parentheses ( ) in Python

While working for the first time in a Python ‘project’ that involved sorting lists full_sorted = sorted(full, reverse=True), I found myself second-guessing whether to use a comma , or parentheses() in my code. It wasn’t immediately clear when one is required over the other, and this small confusion led me to pause and rethink my approach. If you’ve ever faced a similar dilemma, let’s break it down in simple terms so it’s easy to remember when to use each one.

The short answer:

Use commas to separate things (like a shopping list).
Use parentheses to group things or tell Python to “do something” (like run a function or prioritize math).

In this specific code

In the provided code:

Parentheses ():
Parentheses are used to call a function. In this example, sorted() is a function, and the parentheses indicate that you are asking Python to execute it. Inside the parentheses, you specify the inputs (called arguments) for the function.
- Example: sorted(fruit_prices, reverse=True)
  The function sorted() takes the list fruit_prices and performs an action (sorting). The parentheses enclose the details of what the function needs to do its job.
Comma ,:
Commas are used to separate multiple arguments within the parentheses. In this example, sorted() needs two arguments:
- The list to sort: fruit_prices
- An additional setting: reverse=True (to sort in descending order).
The comma separates these two pieces of information inside the function call.

Key idea:

Parentheses group everything the function needs to execute, like its inputs or settings.
Commas separate the individual pieces of information (arguments) within those parentheses.

Why it’s used here:

Parentheses are required to call the sorted() function.
The comma is needed because you’re providing more than one argument (the list fruit_prices and the reverse=True setting).

When to use comma

A comma is like saying “AND” in a sentence. You use it to separate different items.

Examples:

In a list:

fruits = ["apple", "banana", "cherry"]
- The commas separate the items: apple, banana, and cherry.
In a function with multiple inputs:

print("Hello", "world")
- The commas separate the words “Hello” and “world” that you want to print.

When to use parentheses

Parentheses are like containers. They group things together or make something work.

Examples:

When calling a function:

print("Hello")
- The parentheses tell Python to “run” the print() function and include “Hello”.
When grouping in math:

result = (2 + 3) * 4
- The parentheses make sure Python adds 2 + 3 first before multiplying by 4.
For a tuple (a fixed group of items):

numbers = (1, 2, 3)
- The parentheses keep these numbers grouped together.

Installing Python and Visual Studio Code on macOs

As someone who was completely new to Python, I faced challenges right from the start, including setting everything up correctly. Even installing the necessary tools felt overwhelming at first. I struggled with ensuring that everything was in the right place, from Python itself to Visual Studio Code and Jupyter Notebook. At one point, I wasn’t even sure if I had installed the programs correctly.

I turned to ChatGPT for guidance and discovered that the installation was actually fine, but on a Mac, some programs needed to be launched differently than I expected. With ChatGPT’s help, I learned how to properly call the applications.

IPython required a specific command in the terminal. Instead of just typing “IPython,” I had to use the command python3 -m IPython. This small detail made all the difference and resolved my confusion.

Importing Numpy

I had some trouble installing and correctly using Numpy. (probably because I did go a little to fast, since he perfectly described it later in the course) I can’t fully explain how I eventually managed to get the installation working, but I fixed the last issue by running pip install numpy in the terminal and using source .venv/bin/activate in the terminal. After that, the code worked:

and I got the desired output to work with arrays.

Why do you need type and ‘double’ parentheses (()) in print()?

I realized that I completed this assignment correctly because I had seen it before in the course, but I didn’t truly understand why we use type and the double parentheses (). I asked ChatGPT to explain it to me, and now I can confidently say that I fully understand when and why to use them.

In Python, print() is a function. To use a function, you need to call it with parentheses. The parentheses are where you put what you want to print.

For example:

In your code:

type(np_baseball) finds out the type of np_baseball.
The print() function displays that type on the screen.

Without parentheses, Python would think you forgot to complete the function call.

Why do you use `type()`?

The type() function tells you the data type of a variable or object.

For example:

If you use type([1, 2, 3]), Python will tell you it’s a list.
If you use type(np.array([1, 2, 3])), Python will tell you it’s a numpy.ndarray.

In your code:

This checks if np_baseball is really a numpy array (type: numpy.ndarray).

What happens without `type()`?

If you write this:

It will print the contents of the array, not the type.

If you want to check the type (to confirm the variable is a numpy array), you need type().

Summary:

Parentheses are needed to call the print() function.
type() is needed to find out what type of object np_baseball is.

Working correclty with the terminal. Questions for the discussion meeting

Every time I work with the terminal, it doesn’t go that smooth, and if it works, it was by accident. So for the discussion meeting I’ll have the following questions:

I always struggle with using the terminal. For example, when I need to install or load Numpy, I eventually manage to get it working, but I couldn’t explain exactly how I did it afterward. How can I make this process clearer and more structured? I think it’s because there is a different code for macOS, but I’m not sure
Is it possible to use Matplotlib in Jupyter Notebook, or can it only be used in a .py file? Again, struggling with the terminal.

CONCLUSION

What I found most challenging overall was truly understanding what I was doing. It’s easy to quickly replicate things as you see them in examples, but I noticed I had to go back a few times to really grasp it. This will likely improve over time, especially when I have to write the code for the final assignment entirely from scratch.

When to use what

When to use NumPy 2D Arrays, Dictionaries, or Pandas in Python?

NumPy (2D Arrays)

Use NumPy when:

You’re working with numerical data and need to perform fast mathematical or statistical calculations.
Your dataset has a fixed format (e.g., a matrix, table, or grid) where all rows have the same length and all columns have the same data type.
Performance is a priority: NumPy is optimized for speed and efficiency.

Advantages:

Extremely fast for numerical computations.
Ideal for mathematical operations like matrix multiplication or element-wise calculations.

Disadvantages:

Less flexible: all elements in a NumPy array must have the same data type.
No built-in labels or metadata like column names, making interpretation harder.

Dictionaries

Use dictionaries when:

Your data is structured as key-value pairs (e.g., names and ages, products and prices).
You need quick access to specific values based on unique keys.
You’re working with a simple data structure without requiring complex analysis.

Advantages:

Very flexible and easy to use.
Great for small datasets or unstructured data.
No requirements for uniform data types.

Disadvantages:

Not suitable for advanced data analysis.
No built-in functions for handling large datasets or numerical computations.

Pandas (data frames)

Use Pandas when:

You’re working with tabular data (like an Excel sheet or database).
You need column names, row labels, or metadata to make the data more interpretable.
You want to perform advanced operations like filtering, grouping, or summarizing.
Your data comes from external files (e.g., CSV, Excel, SQL databases).

Advantages:

Flexible and user-friendly for structured data.
Built-in functions for data analysis, such as groupby, merge, and pivot.
Supports heterogeneous data types (e.g., strings, numbers, and dates in the same table).

Disadvantages:

Less efficient than NumPy for pure numerical calculations.
Can become slower with very large datasets.

NumPy: Choose for pure numerical data and high-speed calculations.

Dictionaries: Best for simple, small datasets organized as key-value pairs.

Pandas: Go-to choice for tabular or spreadsheet-like data and for most data analysis tasks.

When in doubt, Pandas is a great starting point because it combines flexibility with powerful functionality for structured data.

Data transformation in Python
dec 4. 2024

This capstone project focuses on data manipulation and visualization using Python, with a dataset containing sales information (order IDs, customer names, shipping details, product IDs, quantities, and total prices). Demonstrate Python skills by efficiently working with data and generating insightful visualizations.

Steps to Complete:

Load and Enrich Data
Visualize Data
Manipulate Data
Create Insights
Optional Exploratory Data Analysis (EDA)

Step by step explanation

Step-by-step Data transformation

I began by breaking down the steps in Visual Studio, using hashtags to organize the assignment into smaller, manageable tasks. This approach helped me work through it in baby steps and gain a clear understanding of what I was doing.

The simplified explanations were generated with ChatGPT.

Full capstone code Print output

STEP 1 LOAD AND ENRICH DATA

1.1 import pandas

This line brings a powerful tool called pandas into your Python code.
Pandas helps you work with data, like tables or spreadsheets. You can use it to load, explore, and modify data easily.
The as pd part is like giving pandas a nickname. Instead of writing “pandas” every time, you can just write “pd” to save time.

If pandas were a box of tools for working with data, this line is like opening the box and putting it on your desk so you can use it whenever you need.

1.2 specify the file path

Here, we are creating a variable called file_path. Think of a variable as a container that holds a value.
In this case, the value inside the container is the file path to an Excel file called i2c.xlsx.
The path /Users/macbook/Python/i2c.xlsx is a string (text) that tells Python the exact location of this file on your computer.

Imagine you are telling Python where to find a book in a library. The “file path” is like saying: “The book is in the Python section, on the MacBook shelf, and its name is i2c.xlsx.”

1.3 Load the specific sheet directly into a DataFrame

df =
- We are creating a variable called df (short for DataFrame). It will hold the data that we load from the file.
- Think of df as a container for your table of data.
pd.read_excel()
- This is a function from pandas (remember, we gave pandas the nickname pd).
- The function reads data from an Excel file and loads it into a format that pandas understands (a DataFrame).
file_path
- The file_path variable (which we defined earlier) tells the function where to find the Excel file.
sheet_name='i2c.csv'
- This specifies which sheet (or tab) in the Excel file to load.
- Here, it’s trying to load a sheet named i2c.csv.

Imagine you have a big Excel workbook with many sheets (like different pages). This line says:

“Open the Excel file located at file_path, go to the page/tab named i2c.csv, and load that data into a neat table (df) so I can work with it in Python.”

1.4 Load the specific sheet directly into a DataFrame

You should use this code because it helps you understand and validate your dataset before you start working with it.

First Line:
print(df.head()) # print dataframe first 5 rows

df.head(): This shows the first 5 rows of the DataFrame. It’s useful for quickly seeing what your data looks like: the columns, the sample data, and the overall structure.
print(): Displays the output in the console so you can see the result.

Why? This allows you to quickly check what your dataset looks like:
- Are the columns named correctly?
- Are the values what you expect?
- Does it look clean or messy?
Example: Imagine you expect the dataset to have columns like “Name” and “Age,” but you see “Unnamed: 0” or unexpected data values. This gives you a chance to fix issues early.

Second Line:
print(df.info()) # print dataframe columns and datatypes

df.info(): This gives a summary of the DataFrame, including:
- How many rows and columns it has.
- The names of the columns.
- The types of data in each column (e.g., numbers, text, etc.).
- Whether any column has missing values.
print(): Again, this shows the output in the console.

Why? This gives a high-level summary of your dataset:
- How many rows and columns do you have?
- What kind of data is in each column (e.g., text, numbers, dates)?
- Are there any missing values?
Example: If a column meant to store numbers (e.g., prices) is stored as text, you’ll need to fix it before doing calculations.

Third Line:

df.isnull(): This checks every cell in the DataFrame to see if it’s empty or missing.
- It returns True if a value is missing and False otherwise.
.sum(): Adds up all the True values for each column, giving the total number of missing entries per column.
print(): Displays the missing value counts in the console.

Why? Missing data can cause errors or unexpected results in your analysis. This code shows you how many values are missing in each column so you can decide what to do:
- Should you remove rows with missing data?
- Should you fill in missing values with a default (e.g., average or median)?
Example: If 50% of a column is missing, it may be better to drop it altogether rather than fill in guesses.

1.5 Add the column unit price which equals price divided by units

Imagine you have a spreadsheet with two columns: Total Price (e.g., €10) and Units Sold (e.g., 2). This code calculates the Price per Unit (e.g., €10 ÷ 2 = €5) and adds it as a new column so you can see it for all your data.

print(df.columns)

print(df.columns)  # check if the columns price and units exist

df.columns: This gives you a list of all the column names in your DataFrame.
Why? Before creating the new column, you check if the required columns (price and units) are present in your data. If they’re missing, the calculation won’t work.
print(): Displays the list of column names in the console for you to verify.

Adding a New Column:

df['unit_price'] = df['price'] / df['units']  # add column unit_price

df['unit_price']:
- This creates a new column in your DataFrame called unit_price.
- Each row in this column will contain the result of the calculation.
df['price'] / df['units']:
- Here, you divide the value in the price column by the value in the units column for each row.
- Pandas automatically applies this operation to all rows in the DataFrame.
Why? This is useful if you want to know the price per unit for every product in your dataset.

print(df.head())

print(df.head())  # print the first rows

df.head(): This shows the first 5 rows of the DataFrame, including the new column unit_price.
Why? To verify that the new column has been added correctly and that the values make sense.

Check Your Columns: print(df.columns) ensures you are working with the correct data.
- If price or units are missing, you can troubleshoot before running the calculation.
Add Useful Information: Calculating the unit_price helps you analyze your data better. For example:
- If you’re managing a store, you can see how much each item costs per unit.
- If you’re analyzing products, you can find items with the best value.
Verify Your Work: Printing the updated DataFrame (df.head()) helps you confirm that your changes were applied correctly.

1.6 Create the dictionary prices which for each product contains its price and the dictionary customers which for each costumer contains a list of all order IDs associated to the customer

`prices` Dictionary:

df.groupby('product'): Groups the data by the product column. This means all rows with the same product are grouped together.
['unit_price']: Focuses on the unit_price column for each product group.
.mean(): Calculates the average (mean) unit price for each product group.
.to_dict(): Converts the grouped and averaged data into a dictionary, where:
- The key is the product name.
- The value is the average unit price.

Why? This creates a dictionary of average prices for each product, making it easy to look up a product’s price.

`customers` Dictionary:

df.groupby('address'): Groups the data by the address column. This means all rows with the same address (customer) are grouped together.
['orderid']: Focuses on the orderid column for each customer group.
.apply(list): Converts all the order IDs for each customer group into a list.
.to_dict(): Converts the grouped data into a dictionary, where:
- The key is the address (customer).
- The value is a list of all order IDs for that customer.

Why? This creates a dictionary showing all the orders made by each customer.

Printing the Results:

These lines display the dictionaries:
- prices: Shows the product-to-price mapping.
- customers: Shows the customer-to-orders mapping.

Why Should You Use This Code?

For Easy Access to Information:
- The prices dictionary lets you quickly find the average price of any product.
- The customers dictionary allows you to see all the orders placed by a specific customer.
For Data Analysis:
- You can analyze trends, such as which customers order frequently or which products have higher/lower prices.
For Automation:
- Once you have the dictionaries, you can automate tasks like generating invoices, customer summaries, or product reports.

Example:

Imagine your dataset has the following:

product	unit_price	address	orderid
Apple	2.50	123 Street A	1
Apple	2.00	123 Street A	2
Banana	1.20	456 Street B	3
Apple	2.30	789 Street C	4
Banana	1.10	123 Street A	5

prices:

{'Apple': 2.27, 'Banana': 1.15}
customers:

{'123 Street A': [1, 2, 5], '456 Street B': [3], '789 Street C': [4]}

1.7 Define the function get_ordertotal which takes an order ID as input and returns the total order value. If no matching order is found it should print “No such order.”

Defining the Function:

def: This keyword defines a new function in Python.
get_ordertotal(order_id): The function is called get_ordertotal.
- It takes one input, order_id, which is expected to be the ID of an order you want to look up.
- This is like saying, “Tell me which order to find, and I’ll calculate its total price.”

Filtering and Summing Up the Total:

df[df['orderid'] == order_id]:
- This filters the DataFrame (df) to include only the rows where the column orderid matches the given order_id.
['price']: After filtering, it selects the price column from those rows.
.sum(): Adds up all the values in the price column for the filtered rows.
total: This variable stores the result, which is the total value of the order.

Checking if the Order Exists:

if total == 0::
- Checks if the total is zero. If the order ID is invalid or not found, the total would be zero.
print("No such order."):
- If the total is zero, it prints a message saying the order doesn’t exist.
return None:
- Stops the function and returns None (no value) since there’s no matching order.

Returning the Total:

If the total is not zero (meaning the order was found), the function returns the total value of the order.

Example Usage:

Let’s say your DataFrame (df) contains the following data:

orderid	price
101	50
102	30
101	20

Calling the function:

get_ordertotal(101)
- The function filters rows with orderid == 101.
- It calculates the sum of the price column: $50 + 20 = 70$ .
- Output: 70.
If the order ID doesn’t exist:

get_ordertotal(999)
- The function finds no rows matching orderid == 999.
- The total is 0.
- Output: Prints “No such order.” and returns None.

Why Should You Use This Code?

Easily Calculate Total Order Values:
- You can quickly compute the total price for any given order ID.
Error Handling:
- It handles cases where the order ID doesn’t exist, providing a clear message instead of crashing.
Automation:
- This is a reusable function you can call anytime you need the total value of an order, without manually filtering or summing.

1.8 Create the DataFrame orders with the columns customer, orderid and ordertotal.

This small piece of Python code creates an empty list and prints it to the screen. An empty list is like an empty box where you can store things later.

Step-by-Step Explanation:

Comments (#):
- The text in green after # is called a “comment.”
- Comments are ignored by Python; they are only for humans to read and understand the code.
- Example:
  - #1.8 Create the DataFrame orders... → This comment tells you what the code is trying to achieve.
Create an Empty List:

order_details = []
- Here, a variable called order_details is created.
- The = sign means we are assigning a value to the variable.
- The [] represents an empty list.
  Lists in Python can store multiple values, like numbers or words, but right now this one is empty.
Print the List:

print(order_details)
- print() is a function that displays output on the screen.
- Here, it prints whatever is stored in the order_details list.
- Since the list is empty, it will show:
  
  []

Summary for Beginners:

A list is like a box where you can store items.
[] creates an empty list.
print() shows the content of the list.
Comments (#) explain what the code does, but Python ignores them.

When you run this code, the result will simply be:

1.9 Loop through the 'customers' dictionary

This code goes through a dictionary of customers and their orders, calculates the total for each order, and stores the results in a list called order_details.

Step-by-Step Explanation:

Loop Through the Customers:

for customer, order_ids in customers.items():
- customers.items() gives you both the key (customer name) and the value (a list of order IDs) from the dictionary customers.
- customer stores the customer name (key).
- order_ids stores a list of their orders (value).
- Example:
  
  customers = { "Alice": [101, 102], "Bob": [103] }
  
  Here:
  - customer = "Alice"
  - order_ids = [101, 102]
Loop Through the Order IDs:

for order_id in order_ids:
- This inner loop goes through each order ID in the order_ids list.
Calculate the Order Total:

order_total = get_ordertotal(order_id)
- get_ordertotal(order_id) is a function (not shown) that calculates the total price for an order.
- The result is saved in the variable order_total.
Check if the Order Total Exists:

if order_total is not None:
- This line checks if order_total has a value (it is not None).
- If order_total is valid, the code continues.
Add the Order Details to the List:

order_details.append({ "customer": customer, "orderid": order_id, "ordertotal": order_total })
- order_details.append() adds a dictionary to the order_details list.
- The dictionary includes:
  - "customer": The customer’s name.
  - "orderid": The order ID.
  - "ordertotal": The total price of the order.

What Happens in the Code:

Go through each customer and their list of orders.
For each order ID, calculate the total order price.
If the total price is valid (not None), save the customer, order ID, and total price in the order_details list.

1.10 Create a DataFrame from the list of dictionaries

What the Code Does:

This line of code takes the order_details list (which contains dictionaries) and converts it into a DataFrame using the pandas library. A DataFrame is like a table, similar to Excel or a database.

Step-by-Step Explanation:

Comment:

#1.10 Create a DataFrame from the list of dictionaries
- The comment tells us the goal: to create a table (DataFrame) from a list of dictionaries.
Create the DataFrame:

orders = pd.DataFrame(order_details)
- pd.DataFrame() is a function from the pandas library.
- order_details is a list of dictionaries. Each dictionary represents a row of data.
- The keys in the dictionary (e.g., "customer", "orderid", "ordertotal") become the column names in the DataFrame.
- The values in each dictionary become the rows.

1.11 Display the resulting DataFrame

What the Code Does:

This line displays the first 5 rows of the orders DataFrame using the head() function from the pandas library.

Step-by-Step Explanation:

Comment:

#1.11 Display the resulting DataFrame
- This comment explains that the code will display part of the DataFrame.
Display the DataFrame:

print(orders.head())
- orders.head() is a function that shows the first 5 rows of the DataFrame named orders.
- If there are fewer than 5 rows, it will show all the rows.
Why head()?:
- The head() function is often used to quickly check if the DataFrame looks correct after creating or modifying it.
- It only prints the first few rows to avoid cluttering the output with a large table.

Example Output:

If the orders DataFrame contains the following data:

customer	orderid	ordertotal
Alice	101	20.5
Alice	102	35.0
Bob	103	15.0

The output of orders.head() will be:

Summary for Beginners:

orders.head() shows the first 5 rows of the DataFrame.
print() displays the result on the screen.
This is helpful to check if the table looks correct.

1.12 Define the function print_order which takes an order ID as input and prints the order information.

What the Code Does:

This code defines a function print_order() that takes an order ID as input and prints detailed information about the order, including customer details, product information, and the total order price.

Step-by-Step Explanation:

Define the Function:

def print_order(order_id):
- def is used to define a function in Python.
- print_order is the name of the function.
- order_id is a parameter (input) that the function expects.
Filter the DataFrame:

order = df[df['order_id'] == order_id]
- This line filters the DataFrame df to find rows where the order_id matches the input.
- The filtered result is stored in the variable order.
Check if the Order Exists:

if order.empty: print("No such order.") return
- .empty checks if the filtered order DataFrame has no rows.
- If no matching rows are found, it prints “No such order.” and stops the function using return.
Print Customer and Order Details:
- Get Customer Details:
  
  customer = order.iloc[0]
  - .iloc[0] selects the first row of the filtered DataFrame.
- Print Customer Address:
  
  print(f"{customer['name']}n{customer['street']} {customer['street_no']}n")
  - f"" is used for string formatting, which allows you to insert variables like customer['name'] into the string.
  - This prints the customer’s name, street, and street number.
- Print Header:
  
  print(f"Order No. {order_id}n") print("ProducttQuantitytUnit PricetTotal Price")
  - Prints the order ID and a table header for the product details.
Loop Through Each Row of the Order:

for _, row in order.iterrows(): print(f"{row['description']}t{row['quantity']}t{row['unit_price']}t{row['price']}")
- order.iterrows() goes through each row in the DataFrame.
- _ means we are ignoring the row index.
- row holds the data for the current row.
- It prints the product description, quantity, unit_price, and price.
Print the Order Total:

print(f"nOrder Total: {order['price'].sum()}")
- order['price'].sum() calculates the total price for the entire order by summing up the price column.

Example Walkthrough:

Imagine the DataFrame df looks like this:

order_id	name	street	street_no	description	quantity	unit_price	price
101	Alice	Main St	123	Apples	2	3.00	6.00
101	Alice	Main St	123	Oranges	1	2.50	2.50
102	Bob	2nd Ave	456	Bananas	3	1.00	3.00

If you call:

Output:

Summary for Beginners:

print_order() is a function to display order details.
It filters the DataFrame for a specific order ID.
It checks if the order exists, prints customer information, and loops through product details.
Finally, it calculates and prints the total price of the order.

STEP 2 VISUALIZE YOUR DATA

2.1 Prepare the data

What the Code Does:

This code prepares a DataFrame named df for visualization by renaming columns, converting a column to datetime format, and extracting the month from the date.

Step-by-Step Explanation:

Comment:

# 2.1 Prepare the data
- This indicates that the code is preparing the data before creating visualizations.
Rename Columns:

df.rename(columns={'order_id': 'orderid', 'quantity': 'units', 'price': 'price'}, inplace=True)
- df.rename() changes the names of specified columns in the DataFrame:
  - 'order_id' → 'orderid'
  - 'quantity' → 'units'
  - 'price' → 'price' (this stays the same but could be included for consistency).
- inplace=True ensures that the changes are applied directly to the original DataFrame without needing to create a new one.
Convert the ‘date’ Column to a Datetime Format:

df['date'] = pd.to_datetime(df['date'], errors='coerce')
- pd.to_datetime() converts the values in the date column to a proper datetime format.
- errors='coerce' means that if a value cannot be converted (e.g., invalid date), it will be replaced with NaT (Not a Time).
Extract the Month from the Date:

df['month'] = df['date'].dt.to_period('M')
- df['date'].dt.to_period('M') extracts the month and year from the date column and stores it in a new column called 'month'.
- 'M' specifies that the result should be in monthly periods (e.g., “2024-06”).

Example Walkthrough:

Suppose the original DataFrame looks like this:

order_id	quantity	price	date
101	2	20.5	2024-06-15
102	1	35.0	2024-07-02
103	3	15.0	invalid_date

After running the code:

Column names are changed:
- 'order_id' → 'orderid'
- 'quantity' → 'units'.
The date column is converted:
- "2024-06-15" stays as 2024-06-15.
- "invalid_date" becomes NaT (Not a Time).
A new column month is added:

orderid units price date month

101 2 20.5 2024-06-15 2024-06

102 1 35.0 2024-07-02 2024-07

103 3 15.0 NaT NaT

orderid	units	price	date	month
101	2	20.5	2024-06-15	2024-06
102	1	35.0	2024-07-02	2024-07
103	3	15.0	NaT	NaT

Summary for Beginners:

rename(): Change column names.
to_datetime(): Convert a column to a proper date format.
to_period('M'): Extract the month and year from dates.

These steps are commonly used to prepare data for analysis or visualization, such as grouping by month.

2.2 Group data

What the Code Does:

This section groups and summarizes data in the DataFrame df using the pandas library. It calculates totals, counts, and sums for specific columns to extract insights.

Step-by-Step Explanation:

Items Per Order:

items_per_order = df.groupby('orderid')['units'].sum()
- groupby('orderid') groups all rows that have the same orderid.
- ['units'].sum() calculates the total number of units for each order.
- Result: A summary showing the total units per order.
Total Value Per Order:

total_value_per_order = df.groupby('orderid')['price'].sum()
- Groups data by orderid and calculates the sum of price for each order.
- Result: Total revenue (price) for each order.
Orders Per Month:

orders_per_month = df.groupby('month')['orderid'].nunique()
- Groups data by month (previously extracted).
- nunique() counts the unique order IDs per month.
- Result: Number of unique orders per month
Daily Revenue:

daily_revenue = df.groupby('date')['price'].sum()
- Groups data by date and calculates the total price for each day.
- Result: Total daily revenue.
Weekly Orders and Revenue:

weekly_orders_revenue = df.resample('W', on='date').agg({'orderid': 'nunique', 'price': 'sum'})
- resample('W', on='date'): Resamples data weekly based on the date column.
- agg({'orderid': 'nunique', 'price': 'sum'}): Aggregates:
  - orderid: Counts the number of unique orders.
  - price: Calculates the total revenue.
- Result: A table showing the weekly total orders and revenue.
Units Per Product:

units_per_product = df.groupby('product_id')['units'].sum()
- Groups data by product_id and sums the units column.
- Result: Total units sold for each product.

Example Output (Simplified):

Imagine the following data in df:

orderid	product_id	units	price	date	month
101	A	2	20.0	2024-06-15	2024-06
101	B	1	10.0	2024-06-15	2024-06
102	A	3	30.0	2024-06-20	2024-06
103	C	4	40.0	2024-07-05	2024-07

Results:

items_per_order:

orderid 101 3 102 3 103 4
total_value_per_order:

orderid 101 30.0 102 30.0 103 40.0
orders_per_month:

month 2024-06 2 2024-07 1
daily_revenue:

date 2024-06-15 30.0 2024-06-20 30.0 2024-07-05 40.0
weekly_orders_revenue:

yaml

orderid price date 2024-06-16 1 30.0 2024-06-23 1 30.0 2024-07-07 1 40.0
units_per_product:

product_id A 5 B 1 C 4

Summary for Beginners:

groupby() groups data by a specific column and calculates summaries like sum() or nunique().
resample() is used for time-based grouping (e.g., weekly or monthly).
agg() applies multiple functions at once.

2.3 Define custom colors

What the Code Does:

This snippet defines custom colors using RGB values for use in data visualizations, such as plots or charts.

Step-by-Step Explanation:

Custom Colors:

custom_color = (90/255, 200/255, 190/255)
- Colors in many visualization libraries (like Matplotlib) are defined using RGB values.
- Each color component (Red, Green, Blue) must be between 0 and 1.
- To convert standard RGB values (0–255 range) to this format, divide each value by 255.
- Here:
  - Red: $90/255 \approx 0.353$
  - Green: $200/255 \approx 0.784$
  - Blue: $190/255 \approx 0.745$
Edge Colors:

edge_color = (75/255, 160/255, 150/255)
- Similarly, this defines another color for edges of shapes or bars in a chart.

2.4 Individual visualizations

What the Code Does: Histogram for Items Per Order

This specific part of the code creates a histogram that visualizes the distribution of the number of items per order using Matplotlib.

Step-by-Step Explanation:

Create a Figure:

plt.figure(figsize=(10, 6))
- plt.figure() initializes a new plot.
- figsize=(10, 6) sets the size of the figure to 10 units wide and 6 units high.
Plot a Histogram:

plt.hist(items_per_order, bins=10, color=custom_color, edgecolor=edge_color, linewidth=0.5)
- plt.hist() creates a histogram:
  - items_per_order: The data being plotted (number of items per order).
  - bins=10: Divides the data into 10 equal ranges (or bins).
  - color=custom_color: Sets the bar color using the previously defined RGB values.
  - edgecolor=edge_color: Sets the color for the edges of the bars.
  - linewidth=0.5: Adjusts the thickness of the bar edges.
Add a Title:

plt.title('ITEMS PER ORDER', fontweight='bold')
- Adds a title to the plot.
- fontweight='bold' makes the title text bold.
Label the X and Y Axes:

plt.xlabel('NUMBER OF ITEMS') plt.ylabel('FREQUENCY')
- xlabel: Adds a label to the x-axis (horizontal axis) indicating it represents the “Number of Items.”
- ylabel: Adds a label to the y-axis (vertical axis) indicating it shows the “Frequency” (how often each range appears).
Display the Plot:

plt.show()
- plt.show() displays the histogram.

What the Histogram Shows:

The x-axis (NUMBER OF ITEMS) represents the number of items per order (divided into 10 bins).
The y-axis (FREQUENCY) shows how many orders fall into each bin.

For example:

If many orders have between 1 and 5 items, the bar for that range will be tall.
If fewer orders have 10+ items, those bars will be shorter.

STEP 3 DATA MANIPULATION

3.1 Simulate order cancellation

What the Code Does: Simulate Order Cancellation

This part of the code demonstrates how to delete a specific row from a DataFrame by simulating a customer canceling an order.

Step-by-Step Explanation:

Define the Order to Cancel:

cancel_order_id = 351278 # Example order ID to cancel
- A variable cancel_order_id is created with a specific value (e.g., 351278).
- This represents the order ID of the order you want to cancel.
Check If the Order Exists:

if cancel_order_id in df['orderid'].values:
- df['orderid'].values returns a list of all values in the orderid column.
- if cancel_order_id in ... checks if the cancel_order_id exists in that list.
- If the order ID exists, the condition is True, and the code inside the if block runs.
Delete the Row:

df = df[df['orderid'] != cancel_order_id]
- This line filters the DataFrame df to exclude rows where the orderid equals cancel_order_id.
- df['orderid'] != cancel_order_id creates a Boolean condition (True/False for each row).
- The filtered DataFrame (without the canceled order) replaces the original df.
Print Success Message:

print(f"Order {cancel_order_id} has been successfully canceled.")
- Prints a confirmation message that the order has been removed.
Handle Case Where Order Does Not Exist:

else: print("Order not found.")
- If the order ID is not in the orderid column, this message is printed.

Example Walkthrough:

Suppose the original DataFrame looks like this:

orderid	product	price
351278	Apples	5.0
351279	Oranges	7.0
351280	Bananas	3.0

Input:

Code Execution:

The condition if cancel_order_id in df['orderid'].values is True because 351278 exists.
The line df = df[df['orderid'] != cancel_order_id] removes the row with orderid 351278.

Updated DataFrame:

orderid	product	price
351279	Oranges	7.0
351280	Bananas	3.0

Output:

If you try to cancel an invalid order ID (e.g., 351999), the output will be:

Summary for Beginners:

Check for Existence: Verify if the order exists using in.
Delete Rows: Use a filter like df = df[df['orderid'] != value] to remove rows.
Feedback: Print messages to confirm whether the order was canceled or not.

STEP 4 VISUALIZATIONS

4.1 Craft visualizations that offer insights into the dataset, such as the popularity of products and the diversity of products per order. # Bar plot for product popularity

What the Code Does: Bar Plot for Product Popularity

This code creates a bar chart that shows the popularity of products by counting the number of times each product ID appears in the dataset. It uses the value_counts() function to aggregate data and Matplotlib to visualize it.

Step-by-Step Explanation:

Calculate Product Popularity:

product_popularity = df['product_id'].value_counts()
- df['product_id']: Accesses the column containing product IDs.
- .value_counts(): Counts the occurrences (frequency) of each unique product ID.
- The result is stored in product_popularity as a Series where:
  - The index contains unique product IDs.
  - The values represent the number of times each product appears.
Example: If product_id contains:

['A', 'B', 'A', 'C', 'A', 'B']

The result of value_counts() is:

A 3 B 2 C 1
Set the Figure Size:

plt.figure(figsize=(10, 6))
- Creates a new plot with a size of 10×6 inches.
Create a Bar Chart:

plt.bar(product_popularity.index.astype(str), product_popularity, color=custom_color, edgecolor=edge_color, linewidth=0.5)
- plt.bar() generates a bar chart:
  - product_popularity.index.astype(str): Converts product IDs to strings so they display properly on the x-axis.
  - product_popularity: The heights of the bars (number of orders for each product).
  - color=custom_color: Fills the bars with the defined custom color.
  - edgecolor=edge_color: Sets the edge color for the bars.
  - linewidth=0.5: Specifies the thickness of the bar edges.
Add Titles and Labels:

plt.title("PRODUCT POPULARITY", fontweight='bold') plt.xlabel('PRODUCT ID') plt.ylabel('NUMBER OF ORDERS')
- plt.title(): Adds a bold title to the chart.
- plt.xlabel(): Labels the x-axis as “PRODUCT ID.”
- plt.ylabel(): Labels the y-axis as “NUMBER OF ORDERS.”
Rotate the X-Axis Labels:

plt.xticks(rotation=45)
- plt.xticks() rotates the labels on the x-axis by 45 degrees for better readability.
Display the Chart:

plt.show()
- plt.show() renders and displays the bar chart.

What the Bar Chart Represents:

The x-axis: Product IDs (e.g., “A”, “B”, “C”).
The y-axis: The number of orders (frequency of each product ID).
Each bar shows how popular a product is by its order count.

4.2 Bar plot for diversity of products per order

What the Code Does: Bar Plot for Diversity of Products Per Order

This code creates a histogram that shows how many unique products are included in each order, helping you analyze the diversity of products per order.

Step-by-Step Explanation:

Calculate Product Diversity Per Order:

order_diversity = df.groupby('orderid')['product_id'].nunique()
- df.groupby('orderid'): Groups the data by orderid (each order).
- ['product_id'].nunique(): Counts the number of unique products (product_id) for each order.
- The result, order_diversity, is a Series:
  - The index is the orderid.
  - The values are the count of unique products in each order.
Example: Suppose the data looks like this:

orderid | product_id 1 | A 1 | B 2 | A 2 | A 3 | C 3 | D 3 | E

After grouping and counting unique products:

orderid | unique product count 1 | 2 2 | 1 3 | 3
Set Up the Plot:

plt.figure(figsize=(10, 6))
- Creates a new figure with a size of 10×6 inches.
Create a Histogram:

plt.hist(order_diversity, bins=10, color=custom_color, edgecolor=edge_color, linewidth=0.5)
- plt.hist() creates a histogram:
  - order_diversity: The data showing the count of unique products per order.
  - bins=10: Divides the range of values into 10 bins (ranges).
  - color=custom_color: Fills the bars with the custom color defined earlier.
  - edgecolor=edge_color: Sets the color for the edges of the bars.
  - linewidth=0.5: Defines the edge thickness.
Add Titles and Labels:

plt.title("PRODUCT DIVERSITY PER ORDER", fontweight='bold') plt.xlabel('NUMBER OF UNIQUE PRODUCTS') plt.ylabel('FREQUENCY')
- plt.title(): Adds a bold title.
- plt.xlabel(): Labels the x-axis to show it represents the “Number of Unique Products”.
- plt.ylabel(): Labels the y-axis as “Frequency” (number of orders).
Display the Plot:

plt.show()
- Displays the histogram.

What the Histogram Shows:

X-Axis: The number of unique products in an order (e.g., 1 product, 2 products, etc.).
Y-Axis: The number of orders that contain that many unique products.

Example Output:

If the data has the following unique product counts per order:

The histogram might show:

X-Axis (Unique Products)	Y-Axis (Frequency)
1	1
2	3
3	2
4	1

Summary for Beginners:

groupby() and nunique() calculate the number of unique products per order.
plt.hist() creates a histogram to show how product diversity varies across orders.
Titles and axis labels make the chart clear and easy to understand.

STEP 5 OPTIONAL EXPLORATORY DATA ANALYSIS (EDA)

5.1 Bar plot for orders by city

State wasn’t in the CVS file, so I choose city, which doesn’t really look visually appealing. :’)

What the Code Does: Bar Plot for Revenue by City

This code creates a bar plot that visualizes the total revenue for each city by grouping and summing up the revenue data.

Step-by-Step Explanation:

Calculate Total Revenue by City:

revenue_by_city = df.groupby('city')['price'].sum()
- df.groupby('city'): Groups the data by unique values in the city column.
- ['price'].sum(): For each city, it sums up the values in the price column, which represents the total revenue.
- The result is stored in revenue_by_city:
  - Index: City names.
  - Values: Total revenue per city.
Example:
If the data looks like this:

city | price -----------|------ Amsterdam | 100 Berlin | 200 Amsterdam | 150 Berlin | 50 London | 300

The result of groupby('city')['price'].sum() will be:

city | total revenue -----------|-------------- Amsterdam | 250 Berlin | 250 London | 300
Set the Figure Size:

plt.figure(figsize=(20, 7))
- Creates a new figure with dimensions 20 inches wide and 7 inches high. This ensures the plot is large and readable.
Create the Bar Chart:

plt.bar(revenue_by_city.index.astype(str), revenue_by_city, color=custom_color, edgecolor=edge_color, linewidth=0.5)
- plt.bar() generates the bar chart:
  - revenue_by_city.index.astype(str): Converts city names (index) to strings to display properly on the x-axis.
  - revenue_by_city: The heights of the bars represent the total revenue for each city.
  - color=custom_color: Fills the bars with the defined custom color.
  - edgecolor=edge_color: Sets the edge color for the bars.
  - linewidth=0.5: Specifies the thickness of the bar edges.
Add Titles and Labels:

plt.title("TOTAL REVENUE BY CITY", fontweight='bold') plt.xlabel('CITY', fontsize=7) plt.ylabel('TOTAL REVENUE')
- plt.title(): Adds a bold title to the chart.
- plt.xlabel(): Labels the x-axis as “CITY” with a font size of 7.
- plt.ylabel(): Labels the y-axis as “TOTAL REVENUE”.
Rotate the X-Axis Labels:

plt.xticks(rotation=45)
- Rotates the city names on the x-axis by 45 degrees to make them easier to read.
Display the Chart:

plt.show()
- Renders and displays the bar plot.

What the Bar Chart Represents:

X-Axis: City names.
Y-Axis: Total revenue for each city.
Each bar represents the total revenue generated in a particular city.

Example Output:

For the example cities and revenue above, the chart would look something like this:

Summary for Beginners:

groupby() and sum() calculate the total revenue for each city.
plt.bar() creates a bar plot.
The titles and axis labels make the chart easy to understand, while xticks(rotation=45) improves readability.

the code

Capstone: Data Transformation in Python: the code

#STEP 1 LOAD AND ENRICH DATA #1.1 import pandas and matplotlib import pandas as pd import matplotlib.pyplot as plt #1.2 specify the file path file_path = "/Users/macbook/Python/i2c.xlsx" #1.3 Load the specific sheet directly into a DataFrame. df = pd.read_excel(file_path, sheet_name='i2c.csv') #1.4 Inspect the dataset for missing entries and understand its structure. print(df.head()) #print dataframe first 5 rows print(df.info()) #print dataframe columns and datatypes print(df.isnull().sum()) #missing values in dataframe. #No missing values were found in any column. #1.5 Add the column unit price which equals price divided by quantity. print(df.columns) #check if the column price and quantity exist. df['unit_price'] = df['price'] / df['quantity'] #add column unit_price Here, pandas automatically calculates the value for each row and adds a new column to the DataFrame. print(df.head()) #print the first rows #1.6 Create the dictionary prices which for each product contains its price and the dictionary customers which for each customer contains a list of all order IDs associated to the customer. prices = df.groupby('product_id')['unit_price'].mean().to_dict() #creates the dictionary prices customers = df.groupby('customer_id')['order_id'].apply(list).to_dict() #creates customers dictionary print("Prices Dictionary:", prices) #print prices dictionary print("Customers Dictionary:", customers) #print customers dictionary #1.7 Define the function get_ordertotal which takes an order ID as input and returns the total order value. If no matching order is found it should print “No such order.”. def get_ordertotal(order_id): # Define the function get_ordertotal total = df[df['order_id'] == order_id]['price'].sum() # Filter rows with the given order ID and calculate the total value if total == 0: # Check if the total value is 0 print("No such order.") return None return total #1.8 Create the DataFrame orders with the columns customer, orderid and ordertotal. # Display the first few rows of the orders DataFrame order_details = [] #Creates an empty list to store order details print(order_details) #1.9 Loop through the 'customers' dictionary for customer, order_ids in customers.items(): for order_id in order_ids: order_total = get_ordertotal(order_id) if order_total is not None: order_details.append({ "customer": customer, "orderid": order_id, "ordertotal": order_total }) #1.10 Create a DataFrame from the list of dictionaries orders = pd.DataFrame(order_details) #1.11 Display the resulting DataFrame print(orders.head()) #1.12 Define the function print_order which takes an order ID as input and prints the order information. def print_order(order_id): order = df[df['order_id'] == order_id] # Filter the DataFrame for the given order ID if order.empty: print("No such order.") return # Print customer and order details customer = order.iloc[0] print(f"{customer['name']}n{customer['street']} {customer['street_no']}n") print(f"Order No. {order_id}n") print("ProducttQuantitytUnit PricetTotal Price") for _, row in order.iterrows(): print(f"{row['description']}t{row['quantity']}t{row['unit_price']}t{row['price']}") print(f"nOrder Total: {order['price'].sum()}") #STEP 2 VISUALIZE YOUR DATA # 2.1 Prepare the data df.rename(columns={'order_id': 'orderid', 'quantity': 'units', 'price': 'price'}, inplace=True) df['date'] = pd.to_datetime(df['date'], errors='coerce') df['month'] = df['date'].dt.to_period('M') # 2.2 Group data items_per_order = df.groupby('orderid')['units'].sum() total_value_per_order = df.groupby('orderid')['price'].sum() orders_per_month = df.groupby('month')['orderid'].nunique() daily_revenue = df.groupby('date')['price'].sum() weekly_orders_revenue = df.resample('W', on='date').agg({'orderid': 'nunique', 'price': 'sum'}) units_per_product = df.groupby('product_id')['units'].sum() # 2.3 Define custom colors custom_color = (90/255, 200/255, 190/255) edge_color = (75/255, 160/255, 150/255) #2.4 Individual visualizations # Histogram for items per order plt.figure(figsize=(10, 6)) plt.hist(items_per_order, bins=10, color=custom_color, edgecolor=edge_color, linewidth=0.5) plt.title('ITEMS PER ORDER', fontweight='bold') plt.xlabel('NUMBER OF ITEMS') plt.ylabel('FREQUENCY') plt.show() # Histogram for total value per order plt.figure(figsize=(10, 6)) plt.hist(total_value_per_order, bins=10, color=custom_color, edgecolor=edge_color, linewidth=0.5) plt.title("TOTAL VALUE PER ORDER", fontweight='bold') plt.xlabel('ORDER TOTAL VALUE') plt.ylabel('FREQUENCY') plt.show() # Bar plot for orders per month plt.figure(figsize=(10, 6)) plt.bar(orders_per_month.index.astype(str), orders_per_month, color=custom_color, edgecolor=edge_color, linewidth=0.5) plt.title("VOLUME OF ORDERS PER MONTH", fontweight='bold') plt.xlabel('MONTH') plt.ylabel('NUMBER OF ORDERS') plt.xticks(rotation=45) plt.show() # Line plot for daily revenue plt.figure(figsize=(10, 6)) plt.plot(daily_revenue.index, daily_revenue, color=custom_color) plt.title("DAILY REVENUE TRENDS", fontweight='bold') plt.xlabel('DATE') plt.ylabel('TOTAL REVENUE') plt.xticks(rotation=45) plt.show() # Scatter plot for weekly orders vs revenue plt.figure(figsize=(10, 6)) plt.scatter(weekly_orders_revenue['orderid'], weekly_orders_revenue['price'], color=custom_color, edgecolor=edge_color) plt.title("WEEKLY ORDERS VS REVENUE", fontweight='bold') plt.xlabel('NUMBER OF ORDERS') plt.ylabel('TOTAL REVENUE') plt.show() # Bar plot for total units sold per product plt.figure(figsize=(10, 6)) plt.bar(units_per_product.index.astype(str), units_per_product, color=custom_color, edgecolor=edge_color, linewidth=0.5) plt.title("TOTAL UNITS SOLD PER PRODUCT", fontweight='bold') plt.xlabel('PRODUCT ID') plt.ylabel('TOTAL UNITS SOLD') plt.xticks(rotation=45) plt.show() #STEP 3 DATA MANIPULATION #Address a real-world scenario where a customer cancels an order. Practice manually deleting and modifying rows to reflect this change accurately. #3.1 Simulate order cancellation cancel_order_id = 351278 # Example order ID to cancel if cancel_order_id in df['orderid'].values: df = df[df['orderid'] != cancel_order_id] print(f"Order {cancel_order_id} has been successfully canceled.") else: print("Order not found.") #STEP 4 VISUALIZATIONS #4.1 Craft visualizations that offer insights into the dataset, such as the popularity of products and the diversity of products per order. # Bar plot for product popularity product_popularity = df['product_id'].value_counts() plt.figure(figsize=(10, 6)) plt.bar(product_popularity.index.astype(str), product_popularity, color=custom_color, edgecolor=edge_color, linewidth=0.5) plt.title("PRODUCT POPULARITY", fontweight='bold') plt.xlabel('PRODUCT ID') plt.ylabel('NUMBER OF ORDERS') plt.xticks(rotation=45) plt.show() #4.2 Bar plot for diversity of products per order order_diversity = df.groupby('orderid')['product_id'].nunique() plt.figure(figsize=(10, 6)) plt.hist(order_diversity, bins=10, color=custom_color, edgecolor=edge_color, linewidth=0.5) plt.title("PRODUCT DIVERSITY PER ORDER", fontweight='bold') plt.xlabel('NUMBER OF UNIQUE PRODUCTS') plt.ylabel('FREQUENCY') plt.show() #STEP 5 OPTIONAL EXPLORATORY DATA ANALYSIS (EDA) #5.1Go beyond the basics by exploring data relationships and patterns. For example, create visualizations showing orders and total order volumes by state, uncovering regional market trends. # 5.1 Bar plot for orders by city orders_by_city = df.groupby('city')['orderid'].nunique() plt.figure(figsize=(20, 7)) plt.bar(orders_by_city.index.astype(str), orders_by_city, color=custom_color, edgecolor=edge_color, linewidth=0.5) plt.title("ORDERS BY CITY", fontweight='bold') plt.xlabel('CITY', fontsize=7) plt.ylabel('NUMBER OF ORDERS') plt.xticks(rotation=45) plt.show() # 5.2 Bar plot for revenue by city revenue_by_city = df.groupby('city')['price'].sum() plt.figure(figsize=(20, 7)) plt.bar(revenue_by_city.index.astype(str), revenue_by_city, color=custom_color, edgecolor=edge_color, linewidth=0.5) plt.title("TOTAL REVENUE BY CITY", fontweight='bold') plt.xlabel('CITY', fontsize=7) plt.ylabel('TOTAL REVENUE') plt.xticks(rotation=45) plt.show()

VISUALIZATIONS

Print output data visualizations

Histogram to display the number of items per order and another for the total value of each order

Volume of orders per month using a bar plot.

A line plot to illustrate daily revenue trends.

A scatter plot to explore the relationship between weekly orders and their corresponding revenue.

A barplot of the total number of units sold per product

Conclusion

Learning Python has been an exciting challenge. I feel like I now understand the basic principles, though it will still take some time before I can write code from scratch without guidance. However, I’ve learned where to find helpful resources, and I can solve most issues I encounter along the way.

At this point, for me the next step is to start applying what I’ve learned to real-life projects. Which is the best way to learn for me. I want to start building a sales dashboard for our startup in Power BI, I’m curious to see how far I can integrate Python into the process (also 0 experience with that). It will be interesting to explore where Python can complement Power BI and enhance our data analysis capabilities.

Lastly, after some trial and error, I successfully installed Anaconda instead of Visual Studio.

Get in touch

contact

Eindhoven, Netherlands.

info@esthervanhelmont.nl

Introduction to codingNovember 2024Last update: 23 nov. 2024

Coding journalNov 20. 2024

My Python Reference Guide

My Python introduction and intermediate course ‘cheat-sheet’.

CODE EXAMPLES

Jupyter Notebook and step-by-stepcode explanations.

Python beginner code examples

Key Points to Remember:

Key Points to Remember:

What’s Happening?

Output:

Why is This Useful?

The Big Idea:

Example situations:

Key points to remember:

In simple terms:

Why is this useful?

Key concept:

Think about it:

Explanation of example 2:

1. Create the original list

2. Make a linked copy

3. Make an independent copy

4. Modify the linked copy

5. Modify the independent copy

6. Print the results

Imagine it like this:

Why is this important?

Step 1: Import the NumPy library

Step 2: Define the height_in variable

Step 3: Convert the list to a NumPy array

Step 4: Check the type of the array

Step 5: Convert the heights to meters

Step 6: Print the heights in meters

Full Explanation of the Code:

Example: A Simple 2D Numpy Array

How to Print a Row

How to Print a Column

Summary:

What is Subsetting?

Step 1: Create a 2D Array

Step 2: Select a Row or Column

Step 3: Select a Specific Element

Step 4: Select Multiple Rows or Columns

Key Points to Remember

Step 1: Import Numpy

Step 2: Create Data (Array)

Step 3: Calculate the Mean

Step 4: Calculate the Median

Full Code Example:

Output:

Step-by-Step Explanation:

Python advanced data visualization code examples

Step-by-Step Code Explanation

1. Print the last year and population

2. Import the required library

3. Create the line plot

4. Display the plot

Key Observations from the Results

Steps

1. Problem with the code

2. Correct the plot type

3. Apply the logarithmic scale

4. Display the plot

Final Corrected Code

What does the plot show?

Step 1: Importing Matplotlib

Step 2: Defining the Data

Step 3: Creating a Scatter Plot

Step 4: Customizing the X-Axis Scale

Step 5: Adding Labels and a Title

Step 6: Changing the X-Axis Tick Marks

Step 7: Displaying the Plot

What the Code does as a whole:

Output:

Step 2: Import Matplotlib

Step 3: Plot a histogram

Step 4: Show the histogram

The Result

Full Explanation of the Code

Introduction to coding
November 2024
Last update: 23 nov. 2024

Coding journal
Nov 20. 2024

Jupyter Notebook and step-by-step
code explanations.

Step 2: Define the `height_in` variable