Data Fascinated: December 2022

Monday, December 19, 2022

Binomial Distribution in very simple words

The binomial distribution is a probability distribution that describes the outcome of a series of independent "yes/no" experiments, or Bernoulli trials, in which there are only two possible outcomes. It is used to model the probability of a specific number of successes in a given number of trials, where the probability of success is the same for each trial.

For example, if you are flipping a coin and want to know the probability of getting a certain number of heads in a certain number of flips, you can use the binomial distribution to model this. If the probability of getting a heads on each flip is 0.5, you can use the binomial distribution to find the probability of getting, for example, 3 heads out of 10 flips.

The binomial distribution is defined by two parameters: the number of trials (n) and the probability of success on each trial (p). The probability of a specific number of successes (x) in a given number of trials can be calculated using the following formula:

probability = (n choose x) * p^x * (1-p)^(n-x)

where "n choose x" represents the binomial coefficient, which is a way of selecting a specific number of items from a larger group without replacement.

The binomial distribution can be useful for modeling and analyzing a wide range of real-world situations, such as the probability of winning a game of chance, the probability of a medical treatment being effective, or the probability of a machine failing. It is a widely used and important concept in statistical analysis and probability.

Example:

Here is a simple example of using the binomial distribution to model the probability of a specific number of successes in a given number of trials:

Suppose you are flipping a coin and want to know the probability of getting exactly 3 heads in 10 flips. The probability of getting heads on each flip is 0.5, so the number of trials (n) is 10 and the probability of success (p) is 0.5. Using the formula for the binomial distribution, we can calculate the probability of getting 3 heads in 10 flips as follows:

probability = (10 choose 3) * (0.5^3) * (0.5^7)

= (1098)/(321) * (0.125) * (0.0078125)

= 0.1171875

So the probability of getting exactly 3 heads in 10 flips is approximately 0.12, or 12%.

This is just a simple example to illustrate how the binomial distribution can be used to calculate the probability of a specific number of successes in a given number of trials. In practice, the binomial distribution can be used to model and analyze a wide range of real-world situations.

Sunday, December 18, 2022

Percentiles in simple words

A percentile is a value on a scale that indicates the percentage of a distribution that is equal to or below it. For example, if a value is at the 75th percentile, it means that 75% of the values in the distribution are equal to or below it.

To calculate the percentile of a value, you can use the following steps:

Arrange the values in the distribution in ascending order.

Find the position of the value you want to find the percentile for.

Calculate the percentile using the following formula: Percentile = (position of the value / total number of values) x 100

For example, let's say you have the following distribution of values: 3, 5, 7, 8, 9, 10, 12

If you want to find the percentile for the value 8, you would follow these steps:

Arrange the values in ascending order: 3, 5, 7, 8, 9, 10, 12

Find the position of the value 8: It is the 4th value in the list.

Calculate the percentile: (4 / 7) x 100 = 57.14%

This means that the value 8 is at the 57.14th percentile of the distribution.

Percentiles are useful for understanding how a value compares to the rest of the distribution. For example, if a student scored in the 90th percentile on a test, it means that they scored higher than 90% of the other students who took the test.

Standardizing Z scores simple explanation

Standardizing a value, or a z-score, is a way of expressing how many standard deviations a value is from the mean of a distribution.

To standardize a value, you can use the following formula:

z = (x - μ) / σ

Where x is the value you want to standardize, μ is the mean of the distribution, and σ is the standard deviation of the distribution.

For example, let's say you have a distribution with a mean of 100 and a standard deviation of 10. If you want to standardize the value 110, you would do the following calculation:

z = (110 - 100) / 10 = 1

This means that the value 110 is 1 standard deviation above the mean of 100.

Standardizing values can be useful in comparing values from different distributions or in identifying unusual values that fall outside of the normal range.

A more simple example

To standardize your age of 10 years, you would need to know the mean and standard deviation of the age distribution you are comparing it to. For example, if you are comparing your age to the age of students in your class and the mean age of the students is 10 years and the standard deviation is 2 years, you can standardize your age as follows:

z = (10 - 10) / 2 = 0

This means that your age of 10 years is exactly the average age of the students in your class. If the mean age of the students was 9 years and the standard deviation was still 2 years, your standardized age would be:

z = (10 - 9) / 2 = 0.5

This means that your age is 0.5 standard deviations above the mean age of the students in your class.

It's important to note that standardizing a value only makes sense if you are comparing it to a distribution with a known mean and standard deviation. Without this information, you cannot accurately standardize a value.

Saturday, December 17, 2022

Posterior probability

In probability theory, the posterior probability is the probability of an event occurring after taking into account new evidence or information. It is calculated using Bayes' theorem, which states that the posterior probability is equal to the prior probability (the probability of the event occurring before taking into account the new information) multiplied by the likelihood (the probability of observing the new evidence given that the event has occurred) divided by the marginal probability (the probability of observing the new evidence).

For example, let's say you have a box with 10 marbles in it, 5 of which are red and 5 of which are blue. You draw a marble from the box, observe that it is red, and then put it back in the box. The prior probability that the next marble you draw will be red is 5/10, or 50%. Now, let's say you draw another marble and observe that it is also red. The likelihood of observing this new evidence (a red marble) given that the event (drawing a red marble) has occurred is 1, because if you have already drawn a red marble, the probability of drawing another red marble is 1. The marginal probability, in this case, is the probability of drawing two red marbles in a row, which is (5/10) * (5/10), or 25%. Using Bayes' theorem, we can calculate the posterior probability of drawing a red marble as follows:

Posterior probability = Prior probability * Likelihood / Marginal probability

Posterior probability = (5/10) * 1 / (5/10) * (5/10)

= 1 / (5/10)

= 2/5

= 40%

So, the posterior probability of drawing a red marble after observing two red marbles in a row is 40%. This probability takes into account the new evidence (observing two red marbles in a row) and updates the prior probability (the probability of drawing a red marble before observing any evidence) accordingly.

Bayesian tree in simple words with an example

A Bayesian tree is a graphical representation of a Bayesian network, which is a type of probabilistic model used to represent the relationships between different variables and their probabilities.

A Bayesian tree is made up of nodes, which represent variables or events, and branches, which represent the relationships between the nodes. Each node in a Bayesian tree has a probability associated with it, which represents the likelihood of that event occurring.

Here's an example of a Bayesian tree:

Imagine you are trying to predict the likelihood of it raining tomorrow. You know that the probability of it raining depends on the weather forecast and the likelihood of the forecast being accurate. You can create a Bayesian tree to represent the relationship between these variables:

Rain (A)

Forecast (B) Accuracy (C)

In this example, the node "Rain" represents the event of it raining tomorrow. The nodes "Forecast" and "Accuracy" represent the variables that influence the probability of it raining. The branches connecting the nodes represent the relationships between the variables.

To calculate the probability of it raining tomorrow, you would use Bayes' theorem to combine the probabilities of the "Forecast" and "Accuracy" nodes. For example, if the probability of the forecast being correct is 0.9 and the probability of it raining given that the forecast is correct is 0.7, you can use Bayes' theorem to calculate the probability of it raining tomorrow:

P(A|B) = (P(B|A) * P(A)) / P(B)

P(Rain|Forecast) = (P(Forecast|Rain) * P(Rain)) / P(Forecast)

P(Rain|Forecast) = (0.7 * 0.5) / 0.9

P(Rain|Forecast) = 0.39

This example shows how a Bayesian tree can be used to represent the relationships between variables and their probabilities, and how Bayes' theorem can be used to make predictions or estimates about the likelihood of an event occurring.

Bayes' theorem

Bayes' theorem is a mathematical formula that describes the relationship between the probability of an event occurring and the likelihood of certain evidence being present. It allows us to make predictions or estimates about the probability of an event occurring, based on past data or evidence.

Here's the formula for Bayes' theorem:

P(A|B) = (P(B|A) * P(A)) / P(B)

Where:

P(A|B) is the probability of event A occurring, given that event B has occurred.

P(B|A) is the probability of event B occurring, given that event A has occurred.

P(A) is the probability of event A occurring.

P(B) is the probability of event B occurring.

Bayes' theorem can be used in a variety of contexts, including decision-making, risk assessment, and statistical analysis. It is a widely used and important tool in many fields, including statistics, machine learning, and data analysis.

Example:

Let's say you have a box with some marbles in it. You know that some of the marbles are red and some are blue. You can use Bayes' theorem to figure out the probability that a marble you pull out of the box will be red, based on how many red and blue marbles you have seen before.

Imagine you pull out 5 marbles from the box, and 3 of them are red and 2 of them are blue. Using Bayes' theorem, you can calculate the probability that the next marble you pull out will be red.

First, you need to know the probability that a marble is red, given that it is red. This is called P(A|B), where A is the event of a marble being red and B is the evidence that the marble is red. In this case, the probability of a marble being red, given that it is red, is 1.0, because 100% of red marbles are red.

Next, you need to know the probability that a marble is red, given that it is not red. This is called P(B|A), where B is the event of a marble being red and A is the evidence that the marble is not red. In this case, the probability of a marble being red, given that it is not red, is 0.0, because 0% of non-red marbles are red.

Then, you need to know the probability of a marble being red overall. This is called P(A), and it is calculated by taking the total number of red marbles and dividing it by the total number of marbles. In this case, the probability of a marble being red is 3/5, or 0.6.

Finally, you need to know the probability of a marble being not red. This is called P(B), and it is calculated by taking the total number of non-red marbles and dividing it by the total number of marbles. In this case, the probability of a marble being not red is 2/5, or 0.4.

Now, you can plug all of these values into Bayes' theorem to calculate the probability that the next marble you pull out will be red:

P(A|B) = (P(B|A) * P(A)) / P(B)

P(A|B) = (1.0 * 0.6) / 0.4

P(A|B) = 1.5

The probability that the next marble you pull out will be red is 1.5, which is greater than 1.0. This means that, based on the data you have collected so far, it is more likely that the next marble you pull out will be red.

Friday, December 16, 2022

Conditional probability

In probability, the conditional probability of an event is the probability that the event will occur, given that another event has already occurred.

For example, let's say we have a deck of cards and we want to know the probability of drawing an Ace of Spades, given that we have already drawn the Ace of Hearts. In this case, the probability of drawing the Ace of Spades is 1 out of 51, since there is only 1 Ace of Spades left in the deck and 51 total cards remaining.

We can express this probability using the following formula:

P(Ace of Spades | Ace of Hearts) = (Number of ways to get the Ace of Spades given that the Ace of Hearts has already been drawn) / (Total number of cards remaining)

So in this case, the probability of drawing the Ace of Spades given that the Ace of Hearts has already been drawn is 1/51.

It's important to note that the conditional probability of an event is not the same as the probability of the event occurring on its own. The probability of an event occurring on its own is called the unconditional probability.

Another expample (from the lesson)

Last semester, out of 170 students taking a particular statistics class, 71 students were “majoring” in social sciences and 53 students were majoring in pre-medical studies. There were 6 students who were majoring in both pre-medical studies and social sciences. What is the probability that a randomly chosen student is majoring in social sciences, given that s/he is majoring in pre-medical studies?

1 point

(71+53−6)/170

6/170

6/71

6/53

To solve this problem, we can use the formula for conditional probability, which is the probability of an event occurring given that another event has already occurred. The formula for conditional probability is:

P(A|B) = P(A and B) / P(B)

Where P(A|B) is the probability of event A occurring given that event B has already occurred, P(A and B) is the probability of both events A and B occurring, and P(B) is the probability of event B occurring.

In this problem, we are asked to find the probability that a student is majoring in social sciences given that they are majoring in pre-medical studies, so we can define event A as "majoring in social sciences" and event B as "majoring in pre-medical studies." We are given that there are 6 students who are majoring in both pre-medical studies and social sciences, so the probability of both events occurring is 6/170. We are also given that there are 53 students majoring in pre-medical studies, so the probability of event B occurring is 53/170. Plugging these values into the formula for conditional probability, we get:

P(A|B) = P(A and B) / P(B)

= (6/170) / (53/170)

= 6/53

So, the probability that a student is majoring in social sciences given that they are majoring in pre-medical studies is 6/53. The answer is therefore option 4: 6/53.

Product rule for independent events in statistics

The product rule is a rule in probability and statistics that is used to calculate the probability of two independent events occurring. It states that the probability of the intersection of two independent events is equal to the product of the probabilities of the individual events.

Here is the formula for the product rule:

P(A ∩ B) = P(A) x P(B)

where P(A ∩ B) is the probability of the intersection of events A and B, P(A) is the probability of event A occurring, and P(B) is the probability of event B occurring.

Here is an example of how to use the product rule:

Let's say you are playing a game where you flip a coin and roll a die, and you want to know the probability of flipping heads and rolling a 1. If the probability of flipping heads is 50% and the probability of rolling a 1 is 1/6, the probability of both events occurring is 50% x 1/6 = 8.3%.

The product rule is important in probability and statistics because it allows us to calculate the probability of two independent events occurring, and to make more informed decisions based on the likelihood of different outcomes.

Independence!!!

In statistics, independence means that one event or thing does not affect the likelihood of another event or thing happening. This means that the two events or things are not connected or related in any way.

For example, let's say you are flipping a coin. The outcome of the coin flip (heads or tails) is independent of the color of the sky (blue or not blue). In other words, the likelihood of getting heads on the coin flip does not affect the color of the sky.

Another example is rolling two dice. The outcome of the first die roll (1, 2, 3, 4, 5, or 6) is independent of the outcome of the second die roll. In other words, the number that you roll on the first die does not affect the number that you roll on the second die.

Independent events are important in probability and statistics because they allow us to calculate the likelihood of different outcomes. For example, if you want to know the probability of two independent events occurring, you can multiply the probabilities of the individual events.

For example, let's say you are playing a game where you flip a coin and roll a die, and you want to know the probability of flipping heads and rolling a 1. If the probability of flipping heads is 50% and the probability of rolling a 1 is 1/6, the probability of both events occurring is 50% x 1/6 = 8.3%.

Disjoint vs complementary events

Disjoint events are events that cannot happen at the same time, while complementary events are events that cannot happen at the same time and whose probabilities add up to 1.

Here are some simple examples to help a child understand the difference between disjoint and complementary events:

Disjoint events:

Flipping a coin and getting heads or getting tails. These events cannot happen at the same time, so they are disjoint.

Drawing a red ball or a blue ball from a bag of balls. These events cannot happen at the same time, so they are disjoint.

Complementary events:

Flipping a coin and getting heads or getting tails. These events cannot happen at the same time, and the probability of getting heads is 50% and the probability of getting tails is 50%, so these events are complementary.

Drawing a red card or a black card from a deck of cards. These events cannot happen at the same time, and the probability of drawing a red card is 26/52 and the probability of drawing a black card is 26/52, so these events are complementary.

Complementary events

Complementary events are events that cannot happen at the same time, and whose probabilities add up to 1. In other words, the probability of one event occurring is the complement of the probability of the other event occurring.

For example, let's say you are flipping a coin. The complementary events in this case are getting heads or getting tails. If the probability of getting heads is 50%, the probability of getting tails is also 50%. These probabilities add up to 100%, or 1.

Complementary events are often used in probability and statistics to calculate the probability of an event occurring. For example, if you want to know the probability of an event occurring, you can subtract the probability of the complementary event from 1.

For example, let's say you are rolling a die and you want to know the probability of rolling an odd number. The complementary event in this case is rolling an even number. If the probability of rolling an even number is 1/2, the probability of rolling an odd number is 1 - 1/2 = 1/2, or 50%.

Rules of probability distributions

Here are some simple explanations and examples of the rules of probability distributions:

The sum of the probabilities of all possible outcomes must be equal to 1. This means that if you add up all the chances of something happening, it should equal 100%. For example, if you flip a coin, there is a 50% chance of getting heads and a 50% chance of getting tails. If you add these probabilities together, you get 100%.

The probability of each outcome must be between 0 and 1. This means that the chance of something happening cannot be less than 0% or more than 100%. For example, if you roll a die, the chance of rolling a 3 is 1/6, or about 17%. This probability is between 0 and 1.

The probability of an event occurring is equal to 1 minus the probability of it not occurring. For example, if the probability of it raining tomorrow is 50%, the probability of it not raining tomorrow is 50%.

The probability of the union of two disjoint events is equal to the sum of the probabilities of the individual events. Disjoint events are events that cannot happen at the same time, such as picking a red ball or a blue ball from a bag of balls. If the probability of picking a red ball is 1/3 and the probability of picking a blue ball is 1/3, the probability of picking either a red ball or a blue ball is 2/3.

The probability of the intersection of two independent events is equal to the product of the probabilities of the individual events. Independent events are events that have no effect on each other, such as flipping two coins and getting heads on both. If the probability of flipping a coin and getting heads is 1/2 and the probability of flipping another coin and getting heads is also 1/2, the probability of flipping two coins and getting heads on both is 1/4.

Probability distribution

A probability distribution is a way of organizing and understanding the likelihood of different outcomes in a probability experiment. It is a way of showing how likely it is for a particular outcome to occur.

For example, let's say you are flipping a coin. The probability distribution for this event would show the probability of getting heads (1/2, or 50%) and the probability of getting tails (1/2, or 50%). If you roll a die, the probability distribution would show the probability of rolling each number (1/6, or about 17%, for each number).

Probability distributions are often shown in a graph or chart, with the outcomes on one axis and the probabilities on the other axis. This helps us see how likely it is for each outcome to occur, and allows us to compare the probabilities of different outcomes.

What is a sample space

In statistics, a sample space is the set of all possible outcomes of a random event. It is a way of organizing and understanding the possible results of a probability experiment.

For example, let's say you are flipping a coin. The sample space for this event would be the set of all possible outcomes, which are heads or tails. If you roll a die, the sample space would be the set of all possible outcomes, which are the numbers 1 through 6.

The sample space is important in probability and statistics because it helps us understand the range of possible outcomes and calculate the probability of each outcome. For example, if you flip a coin, the probability of getting heads is 1/2, or 50%, because there are two possible outcomes in the sample space (heads and tails) and one of them is heads.

The sample space is often represented using a set notation, such as {heads, tails} for the coin flipping example, or {1, 2, 3, 4, 5, 6} for the die rolling example. It is a useful tool for organizing and understanding probability experiments and making more informed decisions based on the likelihood of different outcomes.

Union of non disjoint events in simple words

The union of non-disjoint events is the combination of two or more events that can happen at the same time. Non-disjoint events are also called overlapping events or dependent events.

For example, let's say you are playing a game where you draw a card from a deck of cards. The non-disjoint events in this case are drawing a red card or a face card (such as a Jack, Queen, or King). If you draw a red face card, it is both a red card and a face card at the same time.

The union of these non-disjoint events would be written as A ∪ B, where A is drawing a red card and B is drawing a face card. This represents the combination of these two events, or the probability of drawing either a red card or a face card.

The probability of the union of non-disjoint events is calculated using the following formula:

P(A ∪ B) = P(A) + P(B) - P(A ∩ B)

where P(A) is the probability of event A, P(B) is the probability of event B, and P(A ∩ B) is the probability of the intersection of events A and B (the probability of both events occurring at the same time).

For example, let's say you are trying to predict the outcome of a game of chance, such as rolling a die. The non-disjoint events in this case are rolling an odd number (1, 3, 5) or rolling a number less than 4 (1, 2, 3). If you want to know the probability of rolling an odd number or a number less than 4, the union of these non-disjoint events would be written as A ∪ B, where A is rolling an odd number and B is rolling a number less than 4. The probability of rolling an odd number or a number less than 4 is:

P(A ∪ B) = P(A) + P(B) - P(A ∩ B) = 3/6 + 3/6 - 2/6 = 4/6 = 2/3

So, the probability of rolling an odd number or a number less than 4 is 2/3, or about 67%.

It's important to note that this formula is only valid for non-disjoint events, or events that can happen at the same time.

The union of disjoint events

The union of disjoint events is a way of combining two or more events that cannot happen at the same time. For example, let's say you are playing a game where you draw a card from a deck of cards. The disjoint events in this case are drawing a red card or a black card. If you draw a red card, it cannot also be a black card at the same time.

The union of these disjoint events would be written as A ∪ B, where A is drawing a red card and B is drawing a black card. This represents the combination of these two events, or the probability of drawing either a red card or a black card.

The formula to calculate the probability of the union of disjoint events is:

P(A ∪ B ∪ ... ∪ N) = P(A) + P(B) + ... + P(N)

where P(A) is the probability of event A, P(B) is the probability of event B, and so on.

For example, let's say you are trying to predict the outcome of a game of chance, such as rolling a die. The disjoint events in this case are rolling a 1, rolling a 2, rolling a 3, rolling a 4, rolling a 5, or rolling a 6. If you want to know the probability of rolling a 1 or a 2, the union of these disjoint events would be written as A ∪ B, where A is rolling a 1 and B is rolling a 2. The probability of rolling a 1 or a 2 is:

P(A ∪ B) = P(A) + P(B) = 1/6 + 1/6 = 2/6 = 1/3

So, the probability of rolling a 1 or a 2 is 1/3, or about 33%.

It's important to note that this formula is only valid for disjoint events, or events that cannot happen at the same time. If the events are not disjoint, you will need to use a different formula to calculate the probability of their union.

Try agai

Disjoint events

Disjoint events are events that cannot happen at the same time. For example, if you flip a coin, the two disjoint events are getting heads or getting tails. If the coin lands on heads, it cannot also land on tails at the same time.

Disjoint events are also called mutually exclusive events. They are often used in probability and statistics to calculate the likelihood of different outcomes.

For example, let's say you are trying to predict the outcome of a game of chance, such as rolling a die. The disjoint events in this case are rolling a 1, rolling a 2, rolling a 3, rolling a 4, rolling a 5, or rolling a 6. If you roll the die once, the probability of rolling a specific number is 1/6, or about 17%. If you roll the die many times, the probability of rolling a specific number is still 1/6, regardless of what has happened in the past.

Disjoint events are important to understand in probability and statistics because they allow us to calculate the likelihood of different outcomes and make more informed decisions.

The gambler's fallacy

The gambler's fallacy is a mistake that people sometimes make when they are trying to predict the outcome of an event or game. It is based on the idea that past events can influence or predict future events, even if they are completely unrelated.

For example, let's say you are flipping a coin and you want to know the probability of it landing on heads. The probability of the coin landing on heads is always 50%. It doesn't matter if the coin has landed on heads or tails in the past, the probability of it landing on heads on the next flip is always 50%.

The gambler's fallacy occurs when someone thinks that because the coin has landed on heads a lot in the past, it is more likely to land on tails in the future, or vice versa. This is a mistake because the outcome of each flip is independent and has no effect on the outcome of future flips.

The gambler's fallacy can lead to poor decision-making, especially in gambling or betting situations. It is important to understand that each event or outcome is independent and has its own probability, regardless of what has happened in the past.

The law of large numbers

The law of large numbers is a statistical principle that says that as the number of observations or trials increases, the average of the results will tend to approach the expected value.

For example, let's say you flip a coin many times and you want to know the probability of it landing on heads. The law of large numbers says that as you flip the coin more and more times, the proportion of heads will tend to approach 0.5 (50%). So, if you flip the coin 100 times and it lands on heads 50 times, the probability of it landing on heads is close to 0.5. If you flip it 1,000 times and it lands on heads 500 times, the probability is even closer to 0.5.

The law of large numbers is a useful way to understand probability because it helps us make more accurate predictions based on a large number of observations or trials. It is especially helpful in situations where the results are not certain or the data is noisy or unreliable.

Here is another example of the law of large numbers in action:

Imagine that you are playing a game where you roll a die and the goal is to roll a "6". The probability of rolling a 6 is 1/6, or about 17%. If you roll the die just once, the probability of getting a 6 is 17%. But if you roll the die many times, the law of large numbers says that the average of the results will tend to approach the expected value of 1/6.

For example, if you roll the die 10 times and get a 6 twice (20% of the time), the probability of rolling a 6 is still not very close to the expected value of 1/6. But if you roll the die 100 times and get a 6 17 times (17% of the time), the probability is much closer to the expected value. And if you roll the die 1,000 times and get a 6 about 167 times (16.7% of the time), the probability is even closer to the expected value.

As you can see, the law of large numbers helps us understand probability by showing us how the average of the results will tend to approach the expected value as we collect more and more data.

Bayesian interpretation

The Bayesian interpretation of probability is a way of understanding how likely something is to happen based on what we think or believe about it. This interpretation helps us update our beliefs or assumptions as we learn more information.

For example, let's say you are trying to predict whether it will rain tomorrow. If you have no information about the weather, you might assume that it has an equal chance of raining or not raining (50% probability). But if you find out that the weather forecast is predicting a high chance of rain, you might update your belief to think that it is more likely to rain (90% probability).

The Bayesian interpretation is helpful when we don't have a lot of data or past experiences to rely on, or when the data we do have is uncertain or incomplete. It allows us to adjust our predictions based on what we learn.

Frequentist interpretation

The frequentist interpretation of probability is a way of understanding the likelihood of something happening based on past experiences or observations. For example, if you flip a coin many times and it lands on heads half the time and tails half the time, the probability of the coin landing on heads is 50%. If you roll a die many times and each number comes up about the same number of times, the probability of rolling a specific number is about 1/6, or about 17%. The frequentist interpretation helps us understand probability by looking at what has happened in the past and using that information to make predictions about the future.

Wednesday, December 14, 2022

Hypothesis testing, innocent until proven guilty

Hypothesis testing is a way of using data and evidence to decide if a particular idea or theory is likely to be true. It involves making a guess or prediction (called a hypothesis) about what might be happening, and then collecting data and evidence to see if the prediction is supported. For example, if you have a hypothesis that a certain type of plant will grow faster in sunlight than in the shade, you can test this by setting up two groups of plants, exposing one group to sunlight and the other group to shade, and then measuring how fast each group grows. If the group of plants that were exposed to sunlight grows faster, this provides evidence that supports your hypothesis. If the plants that were in the shade grow faster, or if there is no significant difference in the growth rates of the two groups, this provides evidence against your hypothesis. Overall, hypothesis testing is a way of using evidence and data to evaluate the likelihood of different ideas and theories.

Null vs alternative hypothesis

In hypothesis testing, a null hypothesis is a prediction that there will be no difference or change in the thing you are studying. For example, if you have a hypothesis that a certain type of plant will grow faster in sunlight than in the shade, the null hypothesis would be that there will be no difference in the growth rates of the plants in sunlight and shade. An alternative hypothesis is a prediction that there will be a difference or change in the thing you are studying. In this case, the alternative hypothesis would be that the plants exposed to sunlight will grow faster than the plants in the shade. To test these hypotheses, you would set up two groups of plants, expose one group to sunlight and the other group to shade, and then measure the growth rates of each group. If the plants in the sunlight grow faster, this provides evidence in favor of the alternative hypothesis. If the plants in the shade grow faster, or if there is no significant difference in the growth rates of the two groups, this provides evidence in favor of the null hypothesis. Overall, the null and alternative hypotheses are two possible predictions that you can test using data and evidence.

Competing claims of inference

Competing claims of inference are different interpretations or explanations for the same data or information. For example, let's say you conduct a survey to find out people's favorite color. You ask 100 people what their favorite color is and record their responses. When you analyze the data, you find that 50 people chose red as their favorite color, 30 chose blue, and 20 chose green.

One possible inference from this data is that red is the most popular color among the people you surveyed. However, another possible inference is that people's favorite colors are evenly distributed among the three options. These are competing claims of inference because they are different interpretations of the same data.

Competing claims of inference can occur when different people or groups have different goals, beliefs, or perspectives. For example, one person might be more interested in the overall popularity of colors, while another person might be more interested in the distribution of colors. These different perspectives can lead to different inferences from the same data.

To resolve competing claims of inference, it is important to carefully analyze the data and consider all relevant information. This can help you identify the most reasonable and supported interpretation of the data and make more accurate and reliable inferences.

Inference in simple words

Inference is the process of using data and information to make predictions, conclusions, or judgments. It is a common task in many fields, such as statistics, machine learning, and data analysis.

In simple terms, inference involves using data and information to make informed guesses or estimates about things we don't directly observe or measure. For example, if you want to know how tall a person is, you might not be able to measure their height directly. In that case, you could use other information, such as their age and gender, to make an inference about their height.

Or a more simple paradigm:

Inference is like making a guess about something you don't know for sure. For example, let's say you want to know how many cookies are in a jar, but you can't see inside the jar. In that case, you might use your knowledge and experience to make an inference about how many cookies are in the jar.

For example, you might look at the size and shape of the jar and compare it to other jars you know have a certain number of cookies. You might also think about how many people are in the house and how many cookies they might eat. Based on this information, you could make a guess or estimate about how many cookies are in the jar.

Inference is useful because it allows us to make predictions and conclusions based on limited information. By carefully analyzing and interpreting data, we can make more accurate and reliable inferences and use them to make better decisions and solve problems. So, even though we might not know everything for sure, inference can help us make better guesses and estimates about things we don't directly observe or measure.

Mosaic plot an alternative visualization between multiple variables

A mosaic plot is a type of graphical display that is used to visualize the relationship between two or more categorical variables. It is similar to a contingency table, but it uses areas rather than counts to show the proportions or frequencies of each combination of variables.

Create a mosaic example for the amount of people that are believers in the UFOs in 6 different cities.

We have the 6 rows with the number of habitants that they either believe a lot, or somewhere in the middle, or not at all.

# creating a random dataset

# creating 6 rows

data_values <- matrix(c(34, 34, 34,

20, 35, 5,

60, 10, 5,

9, 12, 45,

65, 34, 20,

50, 20, 32))

# creating dataset with above values

data <- as.table(

matrix(

data_values,

# specifying the number of rows

nrow = 6,

byrow = TRUE,

# creating two lists one for rows

# and one for columns

dimnames = list(

city = c('A','B','C', 'D', 'E', 'F'),

believers = c('yes', 'so and so', 'very')

)

# plotting the mosaic chart

mosaic(data,

# shade is used to plot colored chart

shade=TRUE,

# adding title to the chart

main = "Believe or not"

)

Example of a segmented barplot

A segmented barplot is a type of barplot that is used to compare the proportions or percentages of different groups within a categorical variable. It is similar to a stacked barplot, but each group is shown as a separate segment rather than as a stacked portion of the bar.

For example, let's say you want to compare the favorite colors of people in two different cities. You conduct a survey and ask people in each city what their favorite color is, and you record their responses. You can use a segmented barplot to show the proportions of people who chose each color in each city.

To create the segmented barplot, you first need to calculate the proportions of people who chose each color in each city. For example, let's say that in City A, 30% of people chose red as their favorite color, 20% chose blue, and 50% chose green. In City B, 40% of people chose red, 30% chose blue, and 30% chose green.

To create the segmented barplot, you can draw a bar for each city and divide each bar into segments representing the proportions of people who chose each color. For City A, the bar would have three segments: one for red (30% of the bar), one for blue (20% of the bar), and one for green (50% of the bar). For City B, the bar would have three segments: one for red (40% of the bar), one for blue (30% of the bar), and one for green (30% of the bar).

This will help you compare the proportions of people who chose each color in each city and see if there are any differences between the two cities.

Let's write a code in R for the above example

To create a segmented barplot in R, you can use the ggplot2 package and the geom_col function. First, you need to create a data frame with the proportions of people who chose each color in each city. You can do this by using the data.frame function and specifying the proportions for each color in each city as columns. For example:

# create data frame with proportions

df <- data.frame(

city = c("City A", "City B"),

red = c(0.3, 0.4),

blue = c(0.2, 0.3),

green = c(0.5, 0.3)

)

Next, you can use the ggplot function to create a new plot, and specify the data frame you created as the data for the plot. You can then use the geom_col function to add a bar for each city, and specify the color of each segment using the fill argument. For example:

Copy code

# create segmented barplot

ggplot(df, aes(x = city)) +

geom_col(aes(y = red, fill = "red")) +

geom_col(aes(y = blue, fill = "blue")) +

geom_col(aes(y = green, fill = "green"))

This will create a segmented barplot with a bar for each city, and each bar will have three segments representing the proportions of people who chose each color. You can customize the plot further by adding labels, colors, and other visual elements using the various options available in the ggplot2 package.

Contingency table for showing relations between multiple variables

A contingency table is a type of table used to organize and display the relationship between two or more variables. It is commonly used in statistics to show the frequency or count of how often each combination of the variables occurs in a data set. For example, a contingency table could be used to show how many people in a survey chose each combination of favorite color and favorite food. This can help us see if there are any patterns or relationships between the variables.

In simple terms, a contingency table is a way of organizing and displaying data to show the relationship between multiple variables. It can help us see how often each combination of the variables occurs and look for patterns or relationships in the data.

Here is a simple example of a contingency table:

Favorite Color Favorite Food Count

Red Pizza 5

Red Burgers 2

Blue Pizza 4

Blue Burgers 3

Green Pizza 1

Green Burgers 4

This contingency table shows the relationship between two variables: favorite color and favorite food. The table shows how many people chose each combination of favorite color and favorite food. For example, the table shows that 5 people chose red as their favorite color and pizza as their favorite food. The table also shows that 4 people chose blue as their favorite color and burgers as their favorite food.

Looking at the table, we can see that more people overall chose pizza as their favorite food than burgers. We can also see that more people who chose red as their favorite color chose pizza as their favorite food, while more people who chose blue as their favorite color chose burgers as their favorite food. This shows us that there is a relationship between the two variables, and that people's favorite color and favorite food are not independent of each other.

* find relative frequencies in a contingency table in simple words and example

To find the relative frequencies in a contingency table, you need to divide the count for each combination of variables by the total number of observations in the data set. This will give you the proportion or percentage of observations that fall into each combination of variables. For example, let's say you have the following contingency table:

Favorite Color Favorite Food Count

Red Pizza 5

Red Burgers 2

Blue Pizza 4

Blue Burgers 3

Green Pizza 1

Green Burgers 4

To find the relative frequencies, you first need to find the total number of observations in the data set. In this case, there are 5 + 2 + 4 + 3 + 1 + 4 = 19 observations in the data set. Then, you can divide the count for each combination of variables by the total number of observations to find the relative frequency. For example, the relative frequency for the combination of red and pizza would be 5 / 19 = 0.26. This means that 26% of the observations in the data set are people who chose red as their favorite color and pizza as their favorite food.

You can repeat this process for each combination of variables in the contingency table to find the relative frequency for each combination. This will give you a better understanding of the relationships between the variables and how often each combination occurs in the data.

For this cant of data visualization we use a segmented barplot.

How bar plots are different from histograms

In simple terms, the main difference between bar plots and histograms is the type of variable they are used to display. Bar plots are used for categorical variables, while histograms are used for continuous variables.

Example:

Imagine you are trying to compare the heights of your classmates. You measure the height of each classmate and write down the numbers. To make it easier to compare the heights, you can put the numbers into groups, or bins, based on their height. For example, you can put all the heights that are between 4 feet and 4 feet 6 inches into one bin, all the heights that are between 4 feet 6 inches and 5 feet into another bin, and so on. This will help you see how many people are in each height range.

To make a histogram, you can draw a bar for each bin. The height of each bar represents the number of people in that bin. So, if you have 5 people who are between 4 feet and 4 feet 6 inches, the bar for that bin would be 5 units tall. This will help you see which height ranges have the most people and which have the fewest.

On the other hand, if you want to compare the favorite colors of your classmates, you can ask each person what their favorite color is and write down their responses. To make a bar plot, you can draw a bar for each color. The height of each bar represents the number of people who chose that color. So, if 3 people chose red as their favorite color, the bar for red would be 3 units tall. This will help you see which colors are the most popular among your classmates.

So, the main difference between a histogram and a bar plot is the type of variable they are used to display. A histogram is used for continuous variables like height, while a bar plot is used for categorical variables like favorite color.

Distribution of a single categorical variable in simple words

A distribution of a single categorical variable shows how frequently each category occurs in the data set. For example, let's say you have a group of people and you want to know their favorite colors. You ask each person what their favorite color is and record their responses. The distribution of favorite colors would show how many people chose each color, such as how many people chose red, how many chose blue, etc. This can help you see which colors are the most popular among the group.

And we visually use a bar plot.

Interquartile range

The interquartile range is a measure of variability or dispersion in a set of data. It is calculated by finding the difference between the upper and lower quartiles of the data set, and it is often used to summarize the spread of the data. In other words, the interquartile range tells us how much the values in the data set vary from the middle value (or median) of the data.

A very good and simple example for that:

Sure, let's say you have a group of five friends and you want to know how much their ages vary from each other. You ask each of your friends their age, and you write down their ages in order from youngest to oldest: 5, 10, 15, 20, 25. The median age is 15, because it is the middle number in the list. The upper quartile is the group of ages above the median, so it includes the ages 20 and 25. The lower quartile is the group of ages below the median, so it includes the ages 5 and 10. The interquartile range is the difference between the biggest number in the lower quartile (10) and the smallest number in the upper quartile (20), which is 10. This means that the ages of your friends vary by 10 years from the median age of 15.

p value in statistics

The p-value is a number that is used in statistics to help you decide whether or not a certain result is significant. In other words, it is a measure of how likely it is that the result you got happened by chance, rather than because of some real effect.

Here's an example: imagine you are doing a study to see if a certain type of treatment is effective at reducing pain. You have a group of people who receive the treatment, and another group who do not receive the treatment (this is called the control group). After the treatment, you measure the amount of pain that each person feels, and you compare the results. If you find that the people who received the treatment have less pain than the people in the control group, you might conclude that the treatment is effective.

However, it's possible that the difference in the amount of pain between the two groups happened by chance, rather than because of the treatment. To figure this out, you can use the p-value. The p-value is a number between 0 and 1, and it tells you how likely it is that the result you got happened by chance. If the p-value is small (less than 0.05, for example), it means that it is unlikely that the result happened by chance, and you can conclude that the treatment is effective. But if the p-value is large (greater than 0.05), it means that it is likely that the result happened by chance, and you cannot conclude that the treatment is effective.

So, in simple terms, the p-value is a way to help you decide whether or not a certain result is significant, and it tells you how likely it is that the result happened by chance.

Robust statistics and what they are good for

Robust statistics are a type of statistics that are good at dealing with "outliers" in data. Outliers are values that are much larger or smaller than the other values in the data.

Imagine you have a set of numbers like this: 1, 2, 3, 4, 5, 1000. The numbers 1, 2, 3, 4, and 5 are all close to each other, but the number 1000 is much larger than the others. This is an outlier, and it can have a big effect on the average (or mean) of the numbers. For example, the mean of this set is 201, because you get this by adding up all the numbers and dividing by 6. But if you remove the outlier (the number 1000), the mean becomes 3, because you get this by adding up the numbers 1, 2, 3, 4, and 5 and dividing by 5. So the outlier has a big effect on the mean.

Robust statistics are good at dealing with outliers because they are not very sensitive to the effects of a few extreme values. This means that they can still give you a good estimate of the "center" of the data, even if there are some outliers. So if you have a set of numbers with some outliers, using robust statistics can give you a better idea of what the data is really like.

So in cases like these we better use the median value to make results.

Nnderstanding variability in statistics in simple words

In statistics, variability refers to how much the values in a set of data differ from one another. In other words, it measures how spread out the numbers are.

Imagine you have a set of numbers like this: 1, 2, 3, 4, 5. The numbers in this set are all close to each other, so there isn't much variability. Now imagine you have a different set of numbers: 1, 10, 100, 1000, 10000. These numbers are much further apart from each other, so there is a lot more variability in this set.

We can use different statistics, like the mean and standard deviation, to measure variability in a set of numbers. The mean is a measure of the "center" of the set, and the standard deviation is a measure of how spread out the numbers are from the mean. So, if the standard deviation is large, it means that the numbers in the set are far from the mean and there is a lot of variability. If the standard deviation is small, it means that the numbers in the set are close to the mean and there is not much variability.

Variability is important in statistics because it helps us understand how much the values in a set of data vary from one another, and this can tell us a lot about the data and what we can learn from it.

Standard deviation? In simple words

Standard deviation is a measure of how spread out a set of numbers is. Imagine you have a bunch of numbers, and you want to know how far they are from the average (or mean) of all the numbers. Standard deviation is a way to measure this.

Here's an example: let's say you have the following set of numbers: 1, 2, 3, 4, 5. The mean of this set is 3 (you get this by adding all the numbers up and dividing by 5). Now let's say you want to know how far each number is from the mean. To find the standard deviation, you would do the following:

How to find it:

For each number, subtract the mean from the number. This will give you the difference between the number and the mean. For example, for the number 1, the difference would be 1 - 3 = -2.

Square each of the differences you calculated in step 1. This will give you the squared difference for each number. For example, the squared difference for the number 1 would be (-2)^2 = 4.

Add up all the squared differences. This will give you the sum of the squared differences.

Divide the sum of the squared differences by the total number of numbers in the set. This will give you the average squared difference.

Take the square root of the average squared difference. This will give you the standard deviation.

What it means:

In this example, the standard deviation would be 1.58. This means that the numbers in the set are, on average, 1.58 away from the mean. A smaller standard deviation means that the numbers are closer to the mean, while a larger standard deviation means that the numbers are more spread out.

Example: (this is from an exercise in the lesson)

Which of the below data sets has the lowest standard deviation? You do not need to calculate the exact standard deviations to answer this question.

The data set with the lowest standard deviation is 100, 100, 100, 100, 100, 100, 101. Standard deviation is a measure of how much the values in a data set vary from the mean. In this case, all of the values are very close to each other and the mean, so the standard deviation is very low. In contrast, the other data sets have a wider range of values and larger deviations from the mean, so their standard deviations are higher.

100, 100, 100, 100, 100, 100, 101

0, 25, 50, 100, 125, 150, 1000

0,1,3,3,3,5,6

0,1,2,3,4,5,6

Left skewed mean and median in statistics

What is the mean and median in left skewed data (or right)

In statistics, the mean and median are two different ways of describing the "middle" of a set of numbers. The mean is the average of all the numbers, which you can find by adding them all up and then dividing by the total number of numbers. The median is the number that is in the middle of the set when the numbers are listed in order from least to greatest.

A "left skewed" distribution is one where the numbers on the left side of the median (the smaller numbers) are bunched together more closely than the numbers on the right side (the larger numbers). This means that the mean and the median will be different in a left skewed distribution. The mean will be pulled to the right (toward the larger numbers) because it is affected by all of the numbers in the distribution. The median, on the other hand, will be closer to the left side of the distribution because it is only affected by the middle number.

Here's an example: let's say we have the following set of numbers: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. The mean of this set is 5.5 (you get this by adding all the numbers up and dividing by 10), and the median is 5 (the middle number when the numbers are listed in order). Now let's say we add the number 0 to the set. The new set is 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. The mean is still 5.5, but the median is now 5 (the middle number of the 11 numbers in the set). This is an example of a left skewed distribution, because the smaller numbers (0 and 1) are bunched together on the left side of the median, and the larger numbers are spread out more on the right side.

Visualizing Numerical Data

The relationship between two numerical values in statistics can be described in terms of its direction, shape, and strength. The direction of the relationship tells us whether the two values tend to move in the same direction (positive relationship) or in opposite directions (negative relationship). The shape of the relationship tells us whether the relationship is linear (a straight line) or nonlinear (curved). And the strength of the relationship tells us how closely the two values are related to each other.

Here's an example: let's say that you are studying the relationship between a student's test scores and the number of hours they spend studying. You might find that there is a positive, linear relationship between these two variables, which means that as the number of hours spent studying increases, the test scores also tend to increase. In other words, the more time a student spends studying, the better their test scores are likely to be. This relationship is considered to be strong because there is a clear and consistent pattern between the two variables.

Explanatory vs Response variables

Start from the basics: Explanatory vs response variable.

So, let's say you are 10 years old and you want to figure out how much candy you will get on Halloween based on how many houses you visit. In this scenario, the number of houses you visit would be the explanatory variable and the amount of candy you get would be the response variable. The explanatory variable is the one that helps explain or predict the values of the response variable.

Another way to think about it is that the explanatory variable is the "cause" and the response variable is the "effect." So, in our example, visiting more houses is the cause and getting more candy is the effect.

We usually put the explanatory variable in the X axis and the response in the Y.

Tuesday, December 13, 2022

New course, new findings

And of course I have to learn more. And Coursera is the place to be for me because I like their way of training. So I enrolled to a course named Data Analysis with R Specialization.

I am in first chapters and I already feel that it clears my mind regarding to the many questions I had after I finished the Data Analytics certification. There are many new words, under the umbrella of Statistics, a field I had no idea before. Learning about Statistics is giving me new insights about the analysis and the most important is that I am figuring out about the right questions I have to do to an analysis.

Monday, December 12, 2022

ggcorr: correlation matrixes with ggplot2

A new amazing discovery in the ggplot trials and errors was the ggcorr.

ggcorr is available through the GGally package:

install.packages("GGally")

So what is all about? Correlation!

How the variables are related to each other.

Imagine you have a bunch of different things that you want to compare, like how tall people are, how much they weigh, and how old they are. The ggcorr function will create a picture that shows which things are related to each other, and how strong those relationships are. For example, if the picture shows that taller people tend to weigh more, you'll see red circles connecting the height and weight. The bigger the circle, the stronger the relationship. This can help us understand how different things are connected and can even help us make predictions about what might happen in the future. It uses shades of colors for presenting the levels of strength for each relation.

What I learned today: Continuous scales and Discrete scales

I am doing this Google Playstore analysis. After the cleaning part that took me about 2 days, I am now to the visualization of the results.

So a lot og ggplot. graphs with 1 variable, with 2 and now with 3. In this procedure I am learning through my mistakes. Which are many... But anyway, I keep the good part: I am learning something new, in every mistake I meet. And today I am learning about the 2 kind of scales: Continuous and Discrete. So lets explore the terms:

In ggplot, a scale is like a rule that tells the computer how to draw a graph. There are two types of scales: continuous scales and discrete scales.

A continuous scale is used when the data has numbers that can be any value within a range. For example, if you have a column of data that shows how tall people are, you would use a continuous scale to draw a line on the graph that shows how the height of the people changes.

A discrete scale is used when the data has values that are specific, rather than any value within a range. For example, if you have a column of data that shows the favorite color of different people, you would use a discrete scale to draw a bar graph that shows how many people like each color.

Overall, scales help the computer understand how to draw a graph and show the data in a clear way.

Friday, December 9, 2022

Asking the right questions

Something that I find very difficult at least now in the beginning of this journey is how to find the right questions to ask, so I can have results, in a dataset. asking the right questions when analyzing a dataset is crucial for gaining insights and understanding the data: Trying to follow these guidelines:

Start with the basics: What are the columns in the dataset and what do they represent? What is the format and type of data in each column? What are the dimensions of the dataset (number of rows and columns)?
Identify the research question or problem you are trying to solve: What do you want to learn from the dataset? What information do you need to answer your research question or solve your problem?
Identify the relevant variables and their relationships: Which columns in the dataset are relevant to your research question or problem? How are these variables related to each other and to the outcome you are trying to predict or explain?
Explore the data and look for patterns and trends: What are the patterns and trends in the data? Are there any outliers or anomalies that need to be addressed? Are there any missing or incomplete values in the data?
Ask specific and focused questions: What do you want to know about the data? What are the key variables and relationships you want to investigate? How will you measure and analyze the data to answer your questions?

Thursday, December 8, 2022

Analyze more

(alpha and p values for data analytics)

The alpha and p values are used in statistics to help us understand if our results are reliable. The alpha value is like a cutoff point – it's the level of confidence that we want to have in our results. For example, if we set the alpha value to 0.05, it means that we are confident that our results are correct 95% of the time.

The p value is the probability that we would get our results just by chance, even if the thing we're testing (called the null hypothesis) isn't true. So if we're testing whether boys are taller than girls on average, and we get a p value of 0.01, that means there's only a 1% chance that we would see a difference in height just by chance, even if boys and girls are actually the same height on average.

If the p value is less than the alpha value, we can say that our results are reliable and that the null hypothesis is probably not true. So in the example above, if the alpha value is 0.05 and the p value is 0.01, we can say that boys are probably taller than girls on average because it's unlikely that we would see a difference in height just by chance.

Alpha and p values help us decide if our results are reliable and if we can trust them to tell us something about the world. The alpha value is the level of confidence that we want to have in our results, and the p value is the probability that we would see our results just by chance. If the p value is less than the alpha value, we can say that our results are reliable and that the null hypothesis is probably not true.

2. Regression analysis

Regression analysis is a statistical method used to understand the relationship between two or more variables. It allows us to examine the strength and direction of the relationship between the variables, and to make predictions about the values of one variable based on the values of the other variables.

For example, let's say we want to understand the relationship between the amount of time that people spend exercising and their weight. We could use regression analysis to analyze data from a sample of people, looking at their exercise habits and their weight. The regression analysis would allow us to see if there is a relationship between exercise and weight, and if there is, how strong the relationship is. We could then use this information to make predictions about how changes in exercise habits might affect weight.

1. Descriptive statistics

Descriptive statistics are a set of techniques used to summarize and describe the characteristics of a dataset. Some common descriptive statistics include:

Mean: The arithmetic average of a set of numbers, calculated by adding up all the numbers and dividing by the number of numbers.
Median: The middle value in a set of numbers, when the numbers are arranged in order from smallest to largest.
Mode: The most frequently occurring value in a set of numbers.
Range: The difference between the largest and smallest values in a set of numbers.
Standard deviation: A measure of how spread out the values in a dataset are, calculated by taking the square root of the average of the squared differences between each value and the mean.

In R, you can use the summary() function to calculate these descriptive statistics for numeric data. For example, the following code uses the summary() function to calculate the mean, median, mode, range, and standard deviation for the "x" vector:

x <- c(1, 2, 3, 4, 5)

summary(x)

The output of the summary() function shows the mean, median, mode, range, and standard deviation for the "x" vector, as shown below:

Min. 1st Qu. Median Mean 3rd Qu. Max.

1.00 2.25 3.00 3.00 3.75 5.00

In this example, the mean of the "x" vector is 3.00, the median is 3.00, the mode is NA (there is no most frequently occurring value), the range is 4 (5 - 1), and the standard deviation is 1.581.

Descriptive statistics are useful tools for summarizing and describing the characteristics of a dataset. The summary() function in R is a convenient way to calculate common descriptive statistics such as mean, median, mode, range, and standard deviation.

An example:

a real case study in r using Descriptive statistics

A real-world example of using descriptive statistics in R is analyzing the salary data for a company. The following code uses the summary() function to calculate the mean, median, mode, range, and standard deviation for the salary data:

library(dplyr)

salary_data <- read.csv("salary_data.csv")

summary(salary_data$salary)

The output of the summary() function shows the mean, median, mode, range, and standard deviation for the salary data, as shown below:

Min. 1st Qu. Median Mean 3rd Qu. Max.

45000.0 55000.0 65000.0 65712.5 75000.0 85000.0

In this example, the mean salary is $65,712.50, the median salary is $65,000.00, the mode salary is NA (there is no most frequently occurring salary), the range of salaries is $40,000.00 ($85,000.00 - $45,000.00), and the standard deviation of salaries is $10,098.99.

These descriptive statistics can help the company understand the characteristics of their salary data, such as the average salary and the spread of salaries within the company. The company can use this information to make informed decisions about salary increases and salary ranges for different job positions.

Analyze data in R

There are several ways to analyze data in R, depending on the type of data and the analysis you want to perform. Some common methods for analyzing data in R include:

Descriptive statistics: You can use the summary() function to calculate basic descriptive statistics such as mean, median, and standard deviation for numeric data. You can also use the table() function to create frequency tables for categorical data.
Visualization: You can use the ggplot2 package to create visualizations such as scatter plots, bar charts, and histograms to explore the relationships and patterns in your data.
Regression analysis: You can use the lm() function to fit a linear regression model to your data and predict the response variable based on the predictor variables.
Clustering: You can use the kmeans() function to cluster your data into groups based on similarity, and use the plot() function to visualize the clusters.
Time series analysis: You can use the forecast() function from the "forecast" package to fit a time series model to your data and make predictions about future values.

Days of activity...

This is a common case as far as I can see. Make a table with variables activity where activity is a number from 1 to 5 and dates using date to transform in day names and then summarize how many days have what of the amount of activity.

In R, you can use the wday() function from the "lubridate" package to transform a column of dates into a column of week day names. You can then use the dplyr package to create a table with variables "activity" and "weekday", where "activity" is a number from 1 to 10 and "weekday" is the week day name for the corresponding date. Finally, you can use the summarize() function from the "dplyr" package to summarize the number of days for each activity.

For example, the following code uses the wday() and summarize() functions to create a table with variables "activity" and "weekday", and to summarize the number of days for each activity:

library(lubridate)

library(dplyr)

mydata <- data.frame(date = as.Date(c("2022-01-01", "2022-02-01", "2022-03-01", "2022-03-02", "2022-03-03")),

activity = c(1, 2, 3, 4, 5))

mydata %>%

mutate(weekday = wday(date)) %>%

group_by(activity, weekday) %>%

summarize(count = n())

In this example, the wday() function is applied to each value in the "date" column of the "mydata" data frame, and the resulting week day names are stored in the new "weekday" column. Then, the group_by() function groups the data by the "activity" and "weekday" variables, and the summarize() function counts the number of days for each group.

For example, the following code uses the wday() and summarize() functions to create a table with variables "activity" and "weekday", and to summarize the number of days for each activity:

library(lubridate)

library(dplyr)

mydata <- data.frame(date = as.Date(c("2022-01-01", "2022-02-01", "2022-03-01", "2022-03-02", "2022-03-03")),

activity = c(1, 2, 3, 4, 5))

mydata %>%

mutate(weekday = wday(date)) %>%

group_by(activity, weekday) %>%

summarize(count = n())

The resulting table shows the number of days for each activity and week day name, as shown below:

activity weekday count
1 Saturday 1
2 Sunday 1
3 Wednesday 2
4 Wednesday 1
5 Wednesday 1

In this example, there is one day with activity 1 on Saturday, one day with activity 2 on Sunday, two days with activity 3 on Wednesday, one day with activity 4 on Wednesday, and one day with activity 5 on Wednesday.

and then...

transform activity number from the above table to values 1=very low activity, 2=low activity, 3=medium activity, 4=good activity, 5=strong activity

In R, you can use the mutate() function from the "dplyr" package to transform the values of a column in a data frame. For example, the following code uses the mutate() function to transform the "activity" column of the "mydata" data frame from numbers 1 to 5 to the corresponding activity level:

library(dplyr)

mydata <- data.frame(activity = c(1, 2, 3, 4, 5))

mydata %>%

mutate(activity = case_when(activity == 1 ~ "very low activity",

activity == 2 ~ "low activity",

activity == 3 ~ "medium activity",

activity == 4 ~ "good activity",

activity == 5 ~ "strong activity"))

In this example, the mutate() function uses the case_when() function to replace the values in the "activity" column with the corresponding activity level. The resulting data frame has the transformed "activity" column, as shown below:

activity
very low activity
low activity
medium activity
good activity
strong activity

In conclusion, the mutate() function from the "dplyr" package is a useful tool for transforming the values of a column in a data frame. You can use the case_when() function within the mutate() function to specify the transformation rules for each value in the column.

In R, you can use the wday() function from the "lubridate" package to transform a column of dates into a column of week day names. You can then use the dplyr package to create a table with variables "activity", "weekday", and "count", where "activity" is the activity level, "weekday" is the week day name for the corresponding date, and "count" is the number of days for each activity and week day name.

For example, the following code uses the wday() and summarize() functions to create a table with variables "activity", "weekday", and "count", and to summarize the number of days for each activity and week day name:

library(lubridate)

library(dplyr)

mydata <- data.frame(date = as.Date(c("2022-01-01", "2022-02-01", "2022-03-01", "2022-03-02", "2022-03-03")),

activity = c(1, 2, 3, 4, 5))

mydata %>%

mutate(weekday = wday(date),

activity = case_when(activity == 1 ~ "very low activity",

activity == 2 ~ "low activity",

activity == 3 ~ "medium activity",

activity == 4 ~ "good activity",

activity == 5 ~ "strong activity")) %>%

group_by(activity, weekday) %>%

summarize(count = n())

The resulting table shows the number of days for each activity and week day name, as shown below:

show the table above

The resulting table shows the number of days for each activity and week day name, as shown below:

In this example, there is one day with very low activity on Saturday, one day with low activity on Sunday, two days with medium activity on Wednesday, one day with good activity on Wednesday, and one day with strong activity on Wednesday.

In conclusion, the wday() and summarize() functions from the "lubridate" and "dplyr" packages are useful tools for creating a table with variables "activity", "weekday", and "count", and for summarizing the number of days for each activity and week day name in R.