WEEK 1 - The importance of integrity
[Definition]
-Data integrity: The accuracy, completeness, consistency, and trustworthiness of data throughout its lifecycle
-Data replication: The process of storing data in multiple locations
-Data transfer: The process of copying data from a storage device to memory, or from one computer to another
-Data manipulation: The process of changing data to make it more organized and easier to read
-Population: All possible data values in a certain dataset
-Sample size: A part of a population that is representative of the population
-Sampling bias: A sample isn't representative of the population as a whole
-Random sampling: A way of selecting a sample from a population so that every possible type of the sample has an equal chance of being chosen
-Statistical power: The probability of getting meaningful results from a test
-Hypothesis testing: A way to see if a survey or experiment has meaningful results
-Confidence level: The probability that your sample size accurately reflects the greater population
-Margin of error: The maximum amount that the sample results are expected to differ from those of the actual population
If a test is statistically significant,
it means the results of the test are real and not an error caused by random chance.
[Data constraints and examples]
Data constraint | Definition | Examples |
Data type | Values must be of a certain type: data, number, percentage, boolean, etc. |
If the data type is a date, a single number like 30 would fail the constraint and be invaild |
Data range | Values must fall between predefined maximum and minimum values | If the data range is 10-20, a value of 30 would fail the constraint and be invalid |
Mandatory | Values can't be left blank or empty | If age is mandatory, that value must be filled in |
Unique | Values can't have a duplicate | Two people can't have the same mobile number within the same service area |
Regular expression(regex) patterns | Values must match a prescribed pattern | A phone number must match ###-###-#### (no other characters allowed) |
Cross-field validation | Certain conditions for multiple fields must be satisfied | Values are percentages and values from multiple fields must add up to 100% |
Primary-key | (Databases only) value must be unique per column | A database table can't have two rows with the same primary key value. A primary key is an identifier. |
Set-membership | (Databases only) values for a column must come from a set of discrete values | Value for a column must be set to Yes, No, or Not Applicable |
Foreign-key | (Databases only) values for a column must be unique values coming from a column in another table | In a U.S. taxpayer database, the State column must be a valid state or territory with the set of acceptable values defined in a separate States table |
Accuracy | The degree to which the data conforms to the actual entity being measured or described | If values for zip codes are validated by street location, the accuracy of the data goes up |
Completeness | The degree to which the data contains all desired components or measures | If data for personal profiles required hair and eye color, and both are collected, the data is complete |
Consistency | The degree to which the data is repeatable from different points of entry or collection | If a customer has the same address in the sales and repair databases, the data is consistent |
WEEK 2 - Sparkling-clean data
[Definition]
-Dirty data: Data that is incomplete, incorrect, or irrelevant to the problem you're trying to solve
-Clean data: Data that is complete, correct, and relevant to the problem you're trying to solve
-Data engineers: Transform data into a useful format for analysis and give it a reliable infrastructure
-Data warehousing specialists: Develop processes and procedures to effectively store and organize data
-Data validation: A tool for checking the accuracy and quality of data before adding or importing it
-Conditional formatting: A spreadsheet tool that changes how cells appear when values meet specific conditions
-Remove duplicates: A tool that automatically searches for and eliminates duplicate entries from a spreadsheet
-Split: A tool that divides text around a specified character and puts each fragment into a new, separate cell
-Concatenate: A function that joins multiple text strings into a single string
-Syntax: A predetermined structure that includes all required information and its proper placement
-Pivot table: A data summarization tool that is used in data processing
-Data mapping: The process of matching fields from one data source to another
[Dirty data]
Types | Description | Possible causes |
Duplicate data | Any data record that shows up more than once | Manual data entry, batch data imports, or data migration |
Outdated data | Any data that is old which should be replaced with newer and more auucrate information | People changing roles or companies, or software and system becoming obsolete |
Incomplete data | Any data that is missing important fields | Improper data collection or incorrect data entry |
Incorrect/Inaccurate data | Any data that is complete but inaccurate | Human error inserted during data input, fake information, or mock data |
Inconsistent data | Any data that uses different formats to represent the same thing | Data stored incorrectly or errors inserted during data transfer |
[Data Cleaning Tools]
- Data validation
- Conditional formatting
- COUNTIF
- Sorting
- Filtering
[Useful functions]
Description | Function | Example |
기준을 충족하는 셀의 개수 반환 | =COUNTIF(range, "value") | =COUNTIF(A2:A72, "<100") |
텍스트 문자열의 문자 수 반환 | =LEN(range) | =LEN(A2) |
지정한 개수의 문자열 왼쪽부터 반환 | =LEFT(range, number of characters) | =LEFT(A2,5) |
지정한 개수의 문자열 오른쪽부터 반환 | =RIGHT(range, number of characters) | =RIGHT(A2,4) |
지정한 개수의 문자열 원하는 곳부터 반환 | =MID(range, reference starting point, number of middle characters) | =MID(D2,4,2) |
두 개 이상의 텍스트 문자열을 하나로 연결 | =CONCATENATE(item1,item2) | =CONCATENATE(G2,H2) =CONCATENATE(G2," ",H2) |
지정한 구분자에 의해 문자열을 쪼개어 반환 | =SPLIT(range, delimiter) | =SPLIT(A2,"-") |
텍스트에 불필요한 공백 제거(단어 사이 제외) | =TRIM(range) | =TRIM(A2) |
행별로 항목을 찾아야 할 때 | =VLOOKUP(data to look up,'where to look'!Range, column, FALSE) | =VLOOKUP(A2,'Sheet 2'!A1:B31,2,FALSE) |
WEEK 3 - Cleaning data with SQL
[Definition]
-Typecasting: Converting data from one type to another
[SQL function]
-CAST()
Can be used to convert anything from one data type to another
SELECT
CAST(purchase_price AS FLOAT64)
FROM
customer_data.customer_purchase
ORDER BY
CAST(purchase_price AS FLOAT64) DESC
-CONCAT()
Adds strings together to create new text strings that can be used as unique keys
SELECT
CONCAT(product_code, product_color) AS new_product_code
FROM
customer_data.customer_purchase
WHERE
product = 'couch'
-COALESCE()
Can be used to return non-null values
SELECT
COALESCE(product, product_code) AS product_info
FROM
customer_data.customer_purchase
WEEK 4 - Verify and report on your cleaning results
[Definition]
-Verification: A process to confirm that a data-cleaning effort was well-executed and the resulting data is accurate and reliable
-Documentation: The process of tracking changes, additions, deletions, and errors involved in your data-cleaning effort
-Changelog: A file containing a chronologically ordered list of modifications made to a project
-COUNTA: A function that counts the total number of values within a specified range
[The most common problems]
Problems | Checklist |
Sources of errors | Did you use the right tools and functions to find the source of the errors in your dataset? |
Null data | Did you search for NULLs using conditional formatting and filters? |
Misspelled words | Did you locate all misspellings? |
Mistyped numbers | Did you double-check that your numeric data has been entered correctly? |
Extra spaces and characters | Did you remove any extra spaces or characters using the TRIM function? |
Duplicates | Did you remove duplicates in spreadsheets using the Remove Duplicates function or DISTINCT in SQL? |
Mismatched data types | Did you check that numeric, date, and string data are typecast correctly? |
Messy (inconsistent) strings | Did you make sure that all of your strings are consistent and meaningful? |
Messy (inconsistent) data formats | Did you format the dates consistently throughout your dataset? |
Misleading variable labels (columns) | Did you name your columns meaningfully? |
Truncated data | Did you check for truncated or missing data that need correction? |
Business Logic | Did you check that the data makes sense given your knowledge of the business? |
[Review the goal of the project]
- Confirm the business problem
- Confirm the goal of the project
- Verify that data can solve the problem and is aligned to the goal
[CASE statement]
The CASE statement goes through one or more conditions and returns a value as soon as a condition is met
SELECT
customer_id,
CASE
WHEN first_name = 'Tnoy' THEN 'Tony'
WHEN first_name = 'Sohe' THEN 'Sophie'
ELSE first_name
END AS cleaned_name
FROM
customer_data.customer_name
[Benefits of documentation]
- Recall data-cleaning errors
- Inform other users of changes
- Determine quality of data
WEEK 5 - Optional: Adding data to your resume
[Definition]
-Transferable skills: Skills and qualities that can transfer from one job or industry to another
-Soft skills: Non-technical traits and behaviors that relate to how you work
[Soft skills]
- Presentation skills
- Collaboration
- Communication
- Research
- Problem-solving skills
- Adaptability
- Attention to detail
'Data Analytics > Google Data Analytics' 카테고리의 다른 글
Margin of error calculator (0) | 2023.06.06 |
---|---|
Sample size calculator (0) | 2023.06.06 |
Sites for open data (0) | 2023.05.31 |
[Coursera] Course 3. Prepare Data for Exploration (0) | 2023.05.30 |
[Coursera] Course 2. Ask Questions to Make Data-Driven Decisions (0) | 2023.05.18 |