WEEK1 - Organizing data to begin analysis
[Definition]
-Filtering: Showing only the data that meets a specific criteria while hiding the rest
-Sorting: When you arrange data into a meaningful order to make it easier to understand, analyze, and visualize
- Sort sheet: All of the data in a spreadsheet is sorted by the ranking of a specific sorted column - data across rows is kept together
- Sort range: Nothing else on the spreadsheet is rearranged besides the specified cells in a column
- Customized sort order: When you sort data in a spreadsheet using multiple conditions
[SORT function]
-Sheets
=SORT(range, sort_column, is_ascending, [sort_column2, is_ascending2, ...])
-SQL
SELECT *
FROM movie_table.movies
ORDER BY Release_Date DESC
[Filter function]
-Sheets
=FILTER(range, condition1, [condition2, ...])-SQL
-SQL
SELECT *
FROM movie_table.movies
WHERE Genre = 'Comedy' AND Revenue > 30000000
ORDER BY Release_Date DESC
[The 4 phases of analysis]
1. Organize data
2. Format and adjust data
3. Get input from others
4. Transform data
The goal of analysis is to identify trends and relationships within data
so you can accurately answer the question you're asking
WEEK 2 - Formatting and adjusting data
[Definition]
-Conditional formatting: A spreadsheet tool that changes how cells appear when values meet specific conditions
[CONVERT function]
=CONVERT(value, start_unit, end_unit)
+) Available unit conversions
https://support.google.com/docs/answer/6055540?hl=en
CONVERT - Google Docs Editors Help
Converts a numeric value to a different unit of measure. Sample Usage CONVERT(5.1, "g", "kg") CONVERT(32, "C", "F") CONVERT(35.7, "in^2", "m^2") CONVERT(A1, A2, A3) Syntax CONVERT(value, start_unit, end_unit) value - the numeric value in start_unit to conv
support.google.com
[Data validation]
- Add dropdown lists with predetermined options
- Create custom checkboxes
- Protect structured data and formulas
[Google sheets function list]
https://support.google.com/docs/table/25273?hl=en
Google Sheets function list - Google Docs Editors Help
LogicalLETLET(name1, value_expression1, [name2, …], [value_expression2, …], formula_expression )Assigns name with the value_expression results and returns the result of the formula_expression. The formula_expression can use the names defined in the sco
support.google.com
WEEK 3 - Aggregating data for analysis
[Definition]
-Aggregation: Collecting or gathering many separate pieces into a whole
-Data aggregation: The process of gathering data from multiple sources in order to combine it into a single summarized collection
-Subquery: A query within another query
-Absolute reference: A reference that is locked so that rows and columns won't change when copied
Before applying VALUE function, check these.
1) FORMAT data if needed
2) TRIM extra spaces
3) REMOVE duplicates
[VALUE function]
A function that converts a text string that represents a number to a numerical value
=VALUE(text)
[JOIN function]
A SQL clause that is used to combine rows from two or more tables based on a related column
- INNER JOIN: A function that returns records with matching values in both tables
- LEFT / RIGHT JOIN: A function that will return all the records from the left/right table and only the matching records from the right/left table
- OUTER JOIN: A function that combines LEFT and RIGHT JOIN to return all matching records in both tables
SELECT
employees.name AS employee_name,
employees.role AS employee_role,
departments.name AS department_name
FROM
employees
LEFT JOIN
departments
ON employees.department_id = departments.department_id
※ You can set aliases for tables by specifying the alias for the table after the table's name in FROM and/or JOIN statements.
https://dataschool.com/how-to-teach-people-sql/sql-join-types-explained-visually/
SQL Join Types Explained in Visuals
Data within a database exists across multiple tables, JOINs allow you to combine datasets into new tables for analysis. Learn more.
dataschool.com
[Subqueries]
-Question: What percentage of the orders are fulfilled by each warehouse?
SELECT
Warehouse.warehouse_id,
CONCAT(Warehouse.state, ': ', Warehouse.warehouse_alias) AS warehouse_name,
COUNT(Orders.order_id) AS number_of_orders,
(SELECT COUNT(*) FROM warehouse_orders.Orders) AS total_orders,
CASE
WHEN COUNT(Orders.order_id) / (SELECT COUNT(*) FROM warehouse_orders.Orders) <= 0.20
THEN "Fulfilled 0-20% of Orders"
WHEN COUNT(Orders.order_id) / (SELECT COUNT(*) FROM warehouse_orders.Orders) > 0.20
AND COUNT(Orders.order_id) / (SELECT COUNT(*) FROM warehouse_orders.Orders) <= 0.60
THEN "Fulfilled 21-60% of Orders"
ELSE "Fulfilled more than 60% of Orders"
END AS fulfillment_summary
FROM
warehouse_orders.Warehouse AS Warehouse
LEFT JOIN
warehouse_orders.Orders AS Orders
ON
Warehouse.warehouse_id = Orders.warehouse_id
GROUP BY
Warehouse.warehouse_id,
warehouse_name
HAVING
COUNT(Orders.order_id) > 0
[Troubleshooting questions]
- How should I prioritize these issues?
- In a single sentence, what's the issue I'm facing?
- What resources can help me solve the problem?
- How can I stop this problem from happening in the future?
WEEK 4 - Performing data calculations
[Definition]
-Profit margin: A percentage that indicates how many cents of profit has been generated for each dollar of sale
-Calculated field: A new field within a pivot table that carries out certain calculations based on the values of other fields
-Modulo(SQL): An operator (%) that returns the remainder when one number is divided by another
(MODULO: SQL, MOD: sheets)
-GROUP BY(SQL): A command that groups rows that have the same values from a table into summary rows
-Data validation process: Checking and rechecking the quality of your data so that it is complete, accurate, secure and consistent
-Temporary table: A database table that is created and exists temporarily on a database server
[SUMIF, SUMIFS, COUNTIF, COUNTIFS]
Function | Example |
=SUMIF(range, criteria, [sum_range]) | =SUMIF(A1:A9, "Fuel", B1:B9) |
=SUMIFS(sum_range, range1, value1, [range2, value2, ...]) | =SUMIFS(B1:B9, A1:A9, "Fuel", C1:C9, "12/15/20") |
=COUNTIF(range, value) | =COUNTIF(A1:A9, "Food") |
=COUNTIFS(range1, value1, [range2, value2, ...]) | =COUNTIFS(A1:A9, "Coffee", C1:C9, "12/15/2020") |
https://exceljet.net/functions/ifs-function
IFS Function
The Excel IFS function can run multiple tests and return a value corresponding to the first TRUE result. Use the IFS function to evaluate multiple conditions without multiple nested IF statements. IFS allows shorter, easier to read formulas.
exceljet.net
[SUMPRODUCT function]
Multiply arrays and return the sum of those products.
=SUMPRODUCT(array1, array2, ...)
[EXTRACT, GROUP BY]
SELECT
EXTRACT( YEAR FROM STARTTIME)
COUNT(*) AS number_of_rides
FROM
bigquery-public-data.new_york.citibike_trips
GROUP BY
year
ORDER BY
year
[Types of data validation]
Purpose | Example | Limitations | |
Data type | Check that the data matches the data type defined for a field. | Data values for school grades 1-12 must be a numeric data type. | The data value 13 would pass the data type validation. For this case, data range validation is also needed. |
Data range | Check that the data falls within an acceptable range of values defined for the field. | Data values for school grades should be values between 1 and 12. | The data value 11.5 would pass the range validation. For this case, data constraint validation is also needed. |
Data constraints | Check that the data meets certain conditions or criteria for a field. This includes the type of data entered as well as other attributes of the field, such as number of characters. | Content constraint: Data values for school grades 1-12 must be whole numbers. | The data value 13 would pass the content would pass the content constraint validation. For this case, data range validation is also needed. |
Data consistency | Check that the data makes sense in the context of other related data. | Data values for product shipping dates can't be earlier than product production dates. | Data might be consistent but still incorrect or inaccurate. A shipping date could be later than a production date and still be wrong. |
Data structure | Check that the data follows or conforms to a set structure. | Web pages must follow a prescribed structure to be displayed properly. | A data structure might be correct with the data still incorrect or inaccurate. Content on a web page could be displayed properly and still contain the wrong information. |
Code validation | Check that the application code systematically performs any of the previously mentioned validations during user data input. | Common problems discovered during code validationi include: more than one data type allowed, data range checking not done, or ending of text strings not well defined. | Code validation might not validate all possible variatioins with data input. |
[Temporary table]
Temporary tables in a SQL database that aren't stored permanenlty.
- They are automatically deleted from the database when you end your SQL session.
- They can be used as a holding area for storing values if you are making a series of calculations. (pre-processing)
- They can collect the results of multiple, separate queries. (staging)
- They can store a filtered subset of the database. Using fewer SQL commands hepls to keep your data clean.
[How to create temporary tables]
- WITH clauses - BigQuery
- SELECT INTO statements - MySQL
- CREATE TABLE / DROP TABLE statements
WITH trips_over_1hr AS (
SELECT *
FROM bigquery-public-data.new_york.citibike_trips
WHERE tripduration > 60
)
## Count how many trips are 60+ minutes long
SELECT
COUNT (*) AS cnt
FROM trips_over_1_hr
SELECT
*
INTO
AfricaSales
FROM
GlobalSales
WHERE
Region = "Africa"
'Data Analytics > Google Data Analytics' 카테고리의 다른 글
Data Visualization Catalogue (0) | 2023.06.16 |
---|---|
[Coursera] Course 6. Share Data Through the Art of Visualization (0) | 2023.06.16 |
Margin of error calculator (0) | 2023.06.06 |
Sample size calculator (0) | 2023.06.06 |
[Coursera] Course 4. Process Data from Dirty to Clean (0) | 2023.06.05 |