[Coursera] Course 4. Process Data from Dirty to Clean

WEEK 1 - The importance of integrity

[Definition]

-Data integrity: The accuracy, completeness, consistency, and trustworthiness of data throughout its lifecycle

-Data replication: The process of storing data in multiple locations

-Data transfer: The process of copying data from a storage device to memory, or from one computer to another

-Data manipulation: The process of changing data to make it more organized and easier to read

-Population: All possible data values in a certain dataset

-Sample size: A part of a population that is representative of the population

-Sampling bias: A sample isn't representative of the population as a whole

-Random sampling: A way of selecting a sample from a population so that every possible type of the sample has an equal chance of being chosen

-Statistical power: The probability of getting meaningful results from a test

-Hypothesis testing: A way to see if a survey or experiment has meaningful results

-Confidence level: The probability that your sample size accurately reflects the greater population

-Margin of error: The maximum amount that the sample results are expected to differ from those of the actual population

If a test is statistically significant,
it means the results of the test are real and not an error caused by random chance.

[Data constraints and examples]

Data constraint	Definition	Examples
Data type	Values must be of a certain type: data, number, percentage, boolean, etc.	If the data type is a date, a single number like 30 would fail the constraint and be invaild
Data range	Values must fall between predefined maximum and minimum values	If the data range is 10-20, a value of 30 would fail the constraint and be invalid
Mandatory	Values can't be left blank or empty	If age is mandatory, that value must be filled in
Unique	Values can't have a duplicate	Two people can't have the same mobile number within the same service area
Regular expression(regex) patterns	Values must match a prescribed pattern	A phone number must match ###-###-#### (no other characters allowed)
Cross-field validation	Certain conditions for multiple fields must be satisfied	Values are percentages and values from multiple fields must add up to 100%
Primary-key	(Databases only) value must be unique per column	A database table can't have two rows with the same primary key value. A primary key is an identifier.
Set-membership	(Databases only) values for a column must come from a set of discrete values	Value for a column must be set to Yes, No, or Not Applicable
Foreign-key	(Databases only) values for a column must be unique values coming from a column in another table	In a U.S. taxpayer database, the State column must be a valid state or territory with the set of acceptable values defined in a separate States table
Accuracy	The degree to which the data conforms to the actual entity being measured or described	If values for zip codes are validated by street location, the accuracy of the data goes up
Completeness	The degree to which the data contains all desired components or measures	If data for personal profiles required hair and eye color, and both are collected, the data is complete
Consistency	The degree to which the data is repeatable from different points of entry or collection	If a customer has the same address in the sales and repair databases, the data is consistent

WEEK 2 - Sparkling-clean data

[Definition]

-Dirty data: Data that is incomplete, incorrect, or irrelevant to the problem you're trying to solve

-Clean data: Data that is complete, correct, and relevant to the problem you're trying to solve

-Data engineers: Transform data into a useful format for analysis and give it a reliable infrastructure

-Data warehousing specialists: Develop processes and procedures to effectively store and organize data

-Data validation: A tool for checking the accuracy and quality of data before adding or importing it

-Conditional formatting: A spreadsheet tool that changes how cells appear when values meet specific conditions

-Remove duplicates: A tool that automatically searches for and eliminates duplicate entries from a spreadsheet

-Split: A tool that divides text around a specified character and puts each fragment into a new, separate cell

-Concatenate: A function that joins multiple text strings into a single string

-Syntax: A predetermined structure that includes all required information and its proper placement

-Pivot table: A data summarization tool that is used in data processing

-Data mapping: The process of matching fields from one data source to another

[Dirty data]

Types	Description	Possible causes
Duplicate data	Any data record that shows up more than once	Manual data entry, batch data imports, or data migration
Outdated data	Any data that is old which should be replaced with newer and more auucrate information	People changing roles or companies, or software and system becoming obsolete
Incomplete data	Any data that is missing important fields	Improper data collection or incorrect data entry
Incorrect/Inaccurate data	Any data that is complete but inaccurate	Human error inserted during data input, fake information, or mock data
Inconsistent data	Any data that uses different formats to represent the same thing	Data stored incorrectly or errors inserted during data transfer

[Data Cleaning Tools]

Data validation
Conditional formatting
COUNTIF
Sorting
Filtering

[Useful functions]

Description	Function	Example
기준을 충족하는 셀의 개수 반환	=COUNTIF(range, "value")	=COUNTIF(A2:A72, "<100")
텍스트 문자열의 문자 수 반환	=LEN(range)	=LEN(A2)
지정한 개수의 문자열 왼쪽부터 반환	=LEFT(range, number of characters)	=LEFT(A2,5)
지정한 개수의 문자열 오른쪽부터 반환	=RIGHT(range, number of characters)	=RIGHT(A2,4)
지정한 개수의 문자열 원하는 곳부터 반환	=MID(range, reference starting point, number of middle characters)	=MID(D2,4,2)
두 개 이상의 텍스트 문자열을 하나로 연결	=CONCATENATE(item1,item2)	=CONCATENATE(G2,H2) =CONCATENATE(G2," ",H2)
지정한 구분자에 의해 문자열을 쪼개어 반환	=SPLIT(range, delimiter)	=SPLIT(A2,"-")
텍스트에 불필요한 공백 제거(단어 사이 제외)	=TRIM(range)	=TRIM(A2)
행별로 항목을 찾아야 할 때	=VLOOKUP(data to look up,'where to look'!Range, column, FALSE)	=VLOOKUP(A2,'Sheet 2'!A1:B31,2,FALSE)

WEEK 3 - Cleaning data with SQL

[Definition]

-Typecasting: Converting data from one type to another

[SQL function]

-CAST()

Can be used to convert anything from one data type to another

SELECT
CAST(purchase_price AS FLOAT64)
FROM
customer_data.customer_purchase
ORDER BY
CAST(purchase_price AS FLOAT64) DESC

-CONCAT()

Adds strings together to create new text strings that can be used as unique keys

SELECT
CONCAT(product_code, product_color) AS new_product_code
FROM
customer_data.customer_purchase
WHERE
product = 'couch'

-COALESCE()

Can be used to return non-null values

SELECT
COALESCE(product, product_code) AS product_info
FROM
customer_data.customer_purchase

WEEK 4 - Verify and report on your cleaning results

[Definition]

-Verification: A process to confirm that a data-cleaning effort was well-executed and the resulting data is accurate and reliable

-Documentation: The process of tracking changes, additions, deletions, and errors involved in your data-cleaning effort

-Changelog: A file containing a chronologically ordered list of modifications made to a project

-COUNTA: A function that counts the total number of values within a specified range

[The most common problems]

Problems	Checklist
Sources of errors	Did you use the right tools and functions to find the source of the errors in your dataset?
Null data	Did you search for NULLs using conditional formatting and filters?
Misspelled words	Did you locate all misspellings?
Mistyped numbers	Did you double-check that your numeric data has been entered correctly?
Extra spaces and characters	Did you remove any extra spaces or characters using the TRIM function?
Duplicates	Did you remove duplicates in spreadsheets using the Remove Duplicates function or DISTINCT in SQL?
Mismatched data types	Did you check that numeric, date, and string data are typecast correctly?
Messy (inconsistent) strings	Did you make sure that all of your strings are consistent and meaningful?
Messy (inconsistent) data formats	Did you format the dates consistently throughout your dataset?
Misleading variable labels (columns)	Did you name your columns meaningfully?
Truncated data	Did you check for truncated or missing data that need correction?
Business Logic	Did you check that the data makes sense given your knowledge of the business?

[Review the goal of the project]

Confirm the business problem
Confirm the goal of the project
Verify that data can solve the problem and is aligned to the goal

[CASE statement]

The CASE statement goes through one or more conditions and returns a value as soon as a condition is met

SELECT
customer_id,
CASE
WHEN first_name = 'Tnoy' THEN 'Tony'
WHEN first_name = 'Sohe' THEN 'Sophie'
ELSE first_name
END AS cleaned_name
FROM
customer_data.customer_name

[Benefits of documentation]

Recall data-cleaning errors
Inform other users of changes
Determine quality of data

WEEK 5 - Optional: Adding data to your resume

[Definition]

-Transferable skills: Skills and qualities that can transfer from one job or industry to another

-Soft skills: Non-technical traits and behaviors that relate to how you work

[Soft skills]

Presentation skills
Collaboration
Communication
Research
Problem-solving skills
Adaptability
Attention to detail

'Data Analytics > Google Data Analytics' 카테고리의 다른 글

Margin of error calculator (0)	2023.06.06
Sample size calculator (0)	2023.06.06
Sites for open data (0)	2023.05.31
[Coursera] Course 3. Prepare Data for Exploration (0)	2023.05.30
[Coursera] Course 2. Ask Questions to Make Data-Driven Decisions (0)	2023.05.18

Just do it, Keep going

[Coursera] Course 4. Process Data from Dirty to Clean

WEEK 1 - The importance of integrity

[Definition]

[Data constraints and examples]

WEEK 2 - Sparkling-clean data

[Definition]

[Dirty data]

[Data Cleaning Tools]

[Useful functions]

WEEK 3 - Cleaning data with SQL

[Definition]

[SQL function]

WEEK 4 - Verify and report on your cleaning results

[Definition]

[The most common problems]

[Review the goal of the project]

[CASE statement]

[Benefits of documentation]

WEEK 5 - Optional: Adding data to your resume

[Definition]

[Soft skills]

'Data Analytics > Google Data Analytics' 카테고리의 다른 글

티스토리툴바

[Coursera] Course 4. Process Data from Dirty to Clean

WEEK 1 - The importance of integrity

[Definition]

[Data constraints and examples]

WEEK 2 - Sparkling-clean data

[Definition]

[Dirty data]

[Data Cleaning Tools]

[Useful functions]

WEEK 3 - Cleaning data with SQL

[Definition]

[SQL function]

WEEK 4 - Verify and report on your cleaning results

[Definition]

[The most common problems]

[Review the goal of the project]

[CASE statement]

[Benefits of documentation]

WEEK 5 - Optional: Adding data to your resume

[Definition]

[Soft skills]

'Data Analytics > Google Data Analytics' 카테고리의 다른 글

'Data Analytics/Google Data Analytics' Related Articles

티스토리툴바