Trending February 2024 # Sentiment Analysis Using Nltk – A Practical Approach # Suggested March 2024 # Top 6 Popular

You are reading the article Sentiment Analysis Using Nltk – A Practical Approach updated in February 2024 on the website Tai-facebook.edu.vn. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested March 2024 Sentiment Analysis Using Nltk – A Practical Approach

This article was published as a part of the Data Science Blogathon

Introduction

The ultimate goal of this blog is to predict the sentiment of a given text using python where we use NLTK aka Natural Language Processing Toolkit, a package in python made especially for text-based analysis. So with a few lines of code, we can easily predict whether a sentence or a review(used in the blog) is a positive or a negative review.

Before moving on to the implementation directly let me brief the steps involved to get an idea of the analysis approach. These are namely:

5. Prediction

So let’s move on focussing each step in detail.

1. Importing Necessary Modules:

So as we all know that it is necessary to import all the modules which we are going to use initially. So let’s do that as the first step of our hands-on.

Here we are importing all the basic import modules required namely numpy, pandas, matplotlib, seaborn and beautiful soup each having its own use case. Though we are going to use a few other modules excluding these let’s understand them while we use them.

2. Importing Dataset:

I had actually downloaded the dataset from Kaggle quite a long time back hence I don’t have the link to the dataset. So to get the dataset as well as the code I will put the Github repo link so that everyone has access to it. Now to import the dataset we have to use the pandas method ‘read_csv’ followed by the file path.

If print the dataset we could see that there are ‘568454 rows × 10 columns’ which is quite big.

We see that there are 10 columns namely ‘Id’, ‘HelpfulnessNumerator’, ‘HelpfulnessDenominator’, ‘Score’ and ‘Time’ as datatype int64 and ‘ProductId’, ‘UserId’, ‘ProfileName’, ‘Summary’, ‘Text’ as object datatype. Now let’s move on to the third step i.e. Data Preprocessing and Visualisation.



3. Data Preprocessing and Visualisation:

Now we have access to the data after which we have clean the data. Using the ‘isnull().sum()’ method we could easily find the total number of missing values in the dataset.

data.isnull().sum()

If we execute the above code as a cell we find that there are 16 and 27 null values in the ‘ProfileName’ and ‘Summary’ columns respectively. Now, we have to either replace the null values with the central tendency or remove the respective rows which contain the null values. With such a vast number of rows removing just 43 rows that contain the null values wouldn’t affect the overall accuracy of the model. Hence it is wise to remove the 43 rows using the ‘dropna’ method.

data = data.dropna()

Now, I have updated the old data frame rather than creating a new variable and storing the new data frame with the cleaned values. Now again when we check the data frame we find that there are 568411 rows and the same 10 columns, meaning the 43 rows which had the null values have been dropped and now our dataset is cleaned. Proceeding further we have to do some preprocessing of the data in such a way that it could be directly used by the model.

To preprocess we use the ‘Score’ column in the data frame to have scores ranging from ‘1’ to ‘5’, where ‘1’ means a negative review and ‘5’ means a positive review. But it is better to have the score initially ranging from ‘0 ‘to ‘2 ‘where ‘0’ means a negative review, ‘1’ means a neutral review and ‘2’ means a positive review. It is similar to encoding in python but here we don’t use any in-built function but we explicitly run a for loop where and create a new list and append the values to the list.

a=[] for i in data['Score']: if i <3: a.append(0) if i==3: a.append(1) a.append(2)

Supposing the ‘Score’ lies in the range of ‘0’ to ‘2’ we consider those as negative reviews and append them to the list with a score of ‘0’ meaning negative review. Now if we plot the values of the scores present in the list ‘a’ as the nomenclature used above we find that there are 82007 negative reviews, 42638 neutral reviews, and 443766 positive reviews. We can clearly find that approximately 85% of the reviews in the dataset have positive reviews and the remaining are either negative or neutral reviews. This could be visualized and understood more clearly with the help of a countplot in the seaborn library.

sns.countplot(a) plt.xlabel('Reviews', color = 'red') plt.ylabel('Count', color = 'red') plt.xticks([0,1,2],['Negative','Neutral','Positive']) plt.title('COUNT PLOT', color = 'r') plt.show()

Therefore the above plot clearly portrays all the sentences described earlier pictorially. Now I convert the list ‘a’ which we had encoded earlier into a new column named ‘sentiment’ to the data frame i.e. ‘data’. Now there comes a twist where we create a new variable say ‘final_dataset’ where I consider only the ‘Sentiment’ and ‘text’ column of the data frame which is the new data frame that we are going to work on for the forthcoming part. The reason behind it is that all the remaining columns are regarded as those that don’t contribute to the sentiment analysis hence without dropping them we consider the data frame excluding those columns. Hence, that is the reason for choosing only the ‘Text’ and the ‘Sentiment’ columns. We code the same thing as below:

data['sentiment']=a final_dataset = data[['Text','sentiment']] final_dataset

Now if we print the ‘final_dataset’ and find the shape we come to know that there are 568411 rows and only 2 columns. From the final_dataset if we find out the number of positive reviews is 443766 entries and the number of negative reviews is found to be 82007. Hence there is a very large difference between the positive and negative reviews. Hence, there are more chances for the data to overfit if we directly try to build the model. Therefore, we have to choose only a few entries from the final_datset to avoid overfitting. So from various trials, I have found that the optimal value for the number of reviews to be considered is 5000. Hence I create two new variables ‘datap’ and ‘datan’ and store randomly any 5000 positive and negative reviews in the variables respectively. The code implementing the same is below:

datap = data_p.iloc[np.random.randint(1,443766,5000), :] datan = data_n.iloc[np.random.randint(1, 82007,5000), :] len(datan), len(datap)

Now I create a new variable named data and concatenate the values in ‘datap’ and ‘datan’.

data = pd.concat([datap,datan]) len(data)

Now I create a new list named ‘c’ and what I do is similar to encoding but explicitly. I store the negative reviews ‘0’ as ‘0’ and positive reviews ‘2’ earlier as ‘1’ in ‘c’. Then again I replace the values of the sentiment stored in ‘c’ in the column data. Then to view whether the code has run properly I plot the ‘sentiment’ column. The code implementing the same thing is:

c=[] for i in data['sentiment']: if i==0: c.append(0) if i==2: c.append(1) data['sentiment']=c sns.countplot(data['sentiment']) plt.show()

If we see the data then we can find that there are a few HTML tags since the data was originally fetched from real e-commerce sites. Hence we can find that there are tags present which is to be removed as they are not necessary for the sentiment analysis. Hence we use the BeautifulSoup function which uses the ‘html.parser’ and we can easily remove the unwanted tags from the reviews. To perform the task I create a new column named ‘review’ which stores the parsed text and I drop the column named ‘sentiment’ to avoid redundancy. I have performed the above task using a function named ‘strip_html’. The code to perform the same is as follows:

def strip_html(text): soup = BeautifulSoup(text, "html.parser") return soup.get_text() data['review'] = data['Text'].apply(strip_html) data=data.drop('Text',axis=1) data.head()

Now we have come to the end of a tiresome process of Data Preprocessing and Visualization. Hence we can now proceed with the next step i.e. Model Building.

4. Model Building:

Before directly we jump to building the model we need, to just do a small task. We know that for humans to classify the sentiment we need articles, determinants, conjunctions, punctuation marks, etc, as we can clearly understand and then classify the review. But this is not the case with machines So they don’t actually need these to classify the sentiment rather they just get confused literally if they are present. So to perform this task like any other sentiment analysis we need to use the ‘nltk’ library. NLTK stands for ‘Natural Language Processing Toolkit’. This is one of the core libraries to perform Sentiment Analysis or any text-based ML Projects. So with the help of this library, I am going first remove the punctuation marks and then remove the words which do not add a sentiment to the text. First I use a function named ‘punc_clean’ which removes the punctuation marks from every review. The code to implement the same is below:

import nltk def punc_clean(text): import string as st a=[w for w in text if w not in st.punctuation] return ''.join(a) data['review'] = data['review'].apply(punc_clean) data.head(2)

Therefore the above code removes the punctuation marks. Now next we have to remove the words which don’t add a sentiment to the sentence. Such words are called the ‘stopwords’. The list of almost all the stopwords could be found here. Next, if we go through the list of the stopwords we can find that it contains the word ‘not’ as well. So it is necessary that we don’t remove the ‘not’ from the ‘review’ as it adds some value to the sentiment because it contributes to the negative sentiment. Hence we have to write the code in such a way that we remove other words except the ‘not’. The code to implement the same is:

def remove_stopword(text): stopword=nltk.corpus.stopwords.words('english') stopword.remove('not') a=[w for w in nltk.word_tokenize(text) if w not in stopword] return ' '.join(a) data['review'] = data['review'].apply(remove_stopword)

Therefore we have now just a step behind the model building. The next motive is to assign each word in every review with a sentiment score. So to implement it we need to use another library from the ‘sklearn’ module which is the ‘TfidVectorizer’ which is present inside the ‘feature_extraction.text’. It is highly recommended to go through the ‘TfidVectorizer’ docs to get a clear understanding of the library. It has many parameters like input, encoding, min_df, max_df, ngram_range, binary, dtype, use_idf, and much more parameters each having its own use case. Hence it is recommended to go through this blog to get a clear understanding of the working of ‘TfidVectorizer’. The code which implements the same is:

from sklearn.feature_extraction.text import TfidfVectorizer vectr = TfidfVectorizer(ngram_range=(1,2),min_df=1) vectr.fit(data['review']) vect_X = vectr.transform(data['review'])

Now it’s time to build the model. Since it is a binary class classification sentiment analysis i.e. ‘1’ referring to a positive review and ‘0’ referring to a negative review. So it is clear that we need to use any of the Classification Algorithm. The one used here is the Logistic Regression. Hence we need to import ‘LogisticRegression’ to use it as our model. Then we need to fit the entire data as such because I felt that it is nice to test the data from entirely new data rather than from the available dataset. So I have fitted the entire dataset. Then I use the ‘.score()’ function to predict the score of the model. The code implementing the above-mentioned tasks is as given below:

from sklearn.linear_model import LogisticRegression model = LogisticRegression() clf=model.fit(vect_X,data['sentiment']) clf.score(vect_X,data['sentiment'])*100

If we run the above piece of code and check the score of the model we get around 96 – 97% as the dataset changes every time we run the code as we consider the data randomly. Hence we have successfully built our model that too with a good score. Then why wait to test how our model performs in the real-world scenario. So now we move on to the last and final step of the ‘Prediction’ to test our model’s performance.

5. Prediction:

So to clarify the performance of the model I have used two simple sentences “I love icecream” and “I hate icecream” which clearly refer to positive and negative sentiment. The output is as follows:

Here the ‘1’ and ‘0’ refer to the positive and the negative sentiment respectively. Why not a few real-world reviews be tested. I request you as readers to check and test out the same. You would mostly get the desired output but if that doesn’t work I request you to try changing the parameters of the ‘TfidVectorizer’ and do model tuning to ‘LogisticRegression’ to get the required output. So for which I have attached the link to the code and the dataset here.

You connect with me through linkedin. Hope this blog is useful to understand how a Sentiment Analysis is done practically with the help of Python codes as well. Thanks for viewing the blog.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Related

You're reading Sentiment Analysis Using Nltk – A Practical Approach

A Collaborative Approach To Mistake Analysis

Many math teachers use mistake analysis as an instructional strategy for helping students to think metacognitively and to develop a deeper understanding of critical procedures. The approach that I will share takes this strategy to the next level by getting students up out of their seats, working collaboratively, and thinking critically about common misconceptions. The activity requires students to not only identify an error but also reflect on their own learning to generate their own ideas of what the common mistakes are and then explain the thinking of their peers.

Before describing the protocol, I want to mention three of the Common Core mathematical practices that it forces students to exercise:

“Construct viable arguments and critique the reasoning of others.” The Common Core asserts, “Mathematically proficient students are also able to compare the effectiveness of two plausible arguments, distinguish correct logic or reasoning from that which is flawed, and—if there is a flaw in an argument—explain what it is.” 

“Attend to precision. Mathematically proficient students try to communicate precisely to others.” 

“Look for and express regularity in repeated reasoning. As they work to solve a problem, mathematically proficient students maintain oversight of the process, while attending to the details. They continually evaluate the reasonableness of their intermediate results.”

As you read through the procedure, think about how students are deeply engaging with each of these practices.

Here’s how it works. Randomly separate the students into groups of three, and assign each group to a board. Each iteration of the exercise has three rounds.

3-Round Mistake Analysis Exercise

Round one: Each group of students is prompted to generate a problem related to the current unit of study in which one of their peers might make a mistake. They write the problem and solve it incorrectly, intentionally performing the common error that they identified but without noting where the error occurs in the solution.

Round two: Each group rotates to the next board, so that they are looking at a problem and an incorrect solution generated by the previous group. Using a different color marker, they must identify the error in the solution and solve the problem correctly.

Round three: Each group rotates again so that now they are looking at a problem, an incorrect solution, and a correct solution, none of which were written by the group currently standing at that board. They verbally explain to the rest of the class the mistake that was made by the first group and the correct solution provided by the second group.

Benefits for Students

There are a slew of benefits to this approach. My favorite is that students must do several different types of thinking.

Generating an example of a common misconception requires a great deal of metacognition on their part. Identifying an error is an exercise in precision and critical evaluation of a process. Next, they flex their problem-solving skills. Finally, they need to construct an argument and explain someone else’s solution. Thinking about the same content from all of these perspectives and modalities encourages a much deeper understanding of the material.

This collaborative approach to mistake analysis is a fabulous way to facilitate teamwork in a math classroom that, again, allows students at all levels to participate. Even the students struggling the most will be able to talk about what mistakes they have made and what types of problems are most challenging to them. Adding an additional requirement, that only one student may wield the writing utensil during each round, also forces everyone to participate and communicate.

Finally, the multifaceted stages of the exercise boost energy levels in the room. The buzz in the room is palpable as students move from one board to the next, brainstorm, debate, and change their thinking modality. Plus, I’ve always been a strong believer that we think better when we are standing up and writing big!

This strategy can be used at any stage in the learning process: near the beginning of a unit to nip misconceptions in the bud, in the middle when students are in the thick of formative assessment, or at the end as a review exercise. While I’ve found success using it in the math classroom, I imagine it could be adapted for any discipline. For example, English teachers might have students generate sentences with subtle but common grammatical errors. The possibilities are endless. 

Sentiment Analysis Of Twitter Posts On Chennai Floods Using Python

Introduction

The best way to learn data science is to do data science. No second thought about it!

One of the ways, I do this is continuously look for interesting work done by other community members. Once I understand the project, I do / improve the project on my own. Honestly, I can’t think of a better way to learn data science.

As part of my search, I came across a study on sentiment analysis of Chennai Floods on Analytics Vidhya. I decided to perform sentiment analysis of the same study using Python and add it here. Well, what can be better than building onto something great.

To get acquainted with the crisis of Chennai Floods, 2024 you can read the complete study here. This study was done on a set of social interactions limited to the first two days of Chennai Floods in December 2024.

The objectives of this article is to understand the different subjects of interactions during the floods using Python. Grouping similar messages together with emphasis on predominant themes (rescue, food, supplies, ambulance calls) can help government and other authorities to act in the right manner during the crisis time.

Building Corpus

A typical tweet is mostly a text message within limit of 140 characters. #hashtags convey subject of the tweet whereas @user seeks attention of that user. Forwarding is denoted by ‘rt’ (retweet) and is a measure of its popularity. One can like a tweet by making it ‘favorite’.

About 6000 twits were collected with ‘#ChennaiFloods’ hashtag and between 1st and 2nd Dec 2024.  Jefferson’s GetOldTweets utility (got) was used in Python 2.7 to collect the older tweets. One can store the tweets either in a csv file or to a database like MongoDb to be used for further processing.

import got, codecs from pymongo import MongoClient client = MongoClient('localhost', 27017) db = client['twitter_db'] collection = db['twitter_collection'] tweetCriteria = got.manager.TweetCriteria().setQuerySearch('ChennaiFloods').setSince("2024-12-01").setUntil("2024-12-02").setMaxTweets(6000) def streamTweets(tweets): for t in tweets: obj = {"user": t.username, "retweets": t.retweets, "favorites":   t.favorites, "text":t.text,"geo": t.geo,"mentions": t.mentions, "hashtags": t.hashtags,"id": t.id, "permalink": t.permalink,} tweetind = collection.insert_one(obj).inserted_id got.manager.TweetManager.getTweets(tweetCriteria, streamTweets)

Tweets stored in MongoDB can be accessed from another python script. Following example shows how the whole db was converted to Pandas dataframe.

import pandas as pd from pymongo import MongoClient client = MongoClient ('localhost', 27017) db = client ['twitter_db'] collection = db ['twitter_collection'] df=pd.DataFrame(list(collection.find()))

First few records of the dataframe look as below:

Data Exploration

Once in dataframe format, it is easier to explore the data. Here are few examples:

As seen in the study the most used tags were “#chennairains”, “#ICanAccommodate”, apart from the original query tag “#ChennaiFloods”.

Top 10 users

users = df["user"].tolist() fdist2 = FreqDist(users) fdist2.plot(10)

As seen from the plot, most active users were “TMManiac” with about 85 tweets, “Texx_willer” with 60 tweets and so on…

Text Pre-processing

All tweets are processed to remove unnecessary things like links, non-English words, stopwords, punctuation’s, etc.

from nltk.tokenize import TweetTokenizer from nltk.corpus import stopwords import re, string import nltk tweets_texts = df["text"].tolist() stopwords=stopwords.words('english') english_vocab = set(w.lower() for w in nltk.corpus.words.words()) def process_tweet_text(tweet):  if tweet.startswith('@null'):    return "[Tweet not available]" tweet = re.sub(r'$w*','',tweet) # Remove tickers tweet = re.sub(r'['+string.punctuation+']+', ' ',tweet) # Remove puncutations like 's                                            i in english_vocab]  return tokens words = [] for tw in tweets_texts:     words += process_tweet_text(tw)

The word list generated looks like:

[‘time’, ‘history’, ‘temple’, ‘closed’, ‘due’, ‘pic’, ‘twitter’, ‘havoc’, ‘incessant’, …]

Text Exploration

The words are plotted again to find the most frequently used terms. A few simple words repeat more often than others: ’help’, ‘people’, ‘stay’, ’safe’, etc.

[(‘twitter’, 1026), (‘pic’, 1005), (‘help’, 569), (‘people’, 429), (‘safe’, 274)]

These are immediate reactions and responses to the crisis.

Some infrequent terms are [(‘fit’, 1), (‘bible’, 1), (‘disappear’, 1), (‘regulated’, 1), (‘doom’, 1)].

Collocations are the words that are found together. They can be bi-grams (two words together) or phrases like trigrams (3 words) or n-grams (n words).

from nltk.collocations import * bigram_measures = nltk.collocations.BigramAssocMeasures() finder = BigramCollocationFinder.from_words(words, 5) finder.apply_freq_filter(5) print(finder.nbest(bigram_measures.likelihood_ratio, 10))

Most frequently appearing Bigrams are:

[(‘pic’, ‘twitter’), (‘lady’, ‘labour’), (‘national’, ‘media’), (‘pani’, ‘pani’), (‘team’, ‘along’), (‘stay’, ‘safe’), (‘rescue’, ‘team’), (‘beyond’, ‘along’), (‘team’, ‘beyond’), (‘rescue’, ‘along’)]

These depict the disastrous situation, like “stay safe”, “rescue team”, even a commonly used Hindi phrase “pani pani” (lots of water).

Clustering

In such crisis situations, lots of similar tweets are generated. They can be grouped together in clusters based on closeness or ‘distance’ amongst them. Artem Lukanin has explained the process in details here. TF-IDF method is used to vectorize the tweets and then cosine distance is measured to assess the similarity.

Each tweet is pre-processed and added to a list. The list is fed to TFIDF Vectorizer to convert each tweet into a vector. Each value in the vector depends on how many times a word or a term appears in the tweet (TF) and on how rare it is amongst all tweets/documents (IDF). Below is a visual representation of TFIDF matrix it generates.

Before using the Vectorizer, the pre-processed tweets are added in the data frame so that each tweets association with other parameters like id, user is maintained.

Vectorization is done using 1-3 n-grams, meaning phrases with 1,2,3 words are used to compute frequencies, i.e. TF IDF values. One can get cosine similarity amongst tweets/documents as well.

from sklearn.feature_extraction.text import TfidfVectorizer tfidf_vectorizer = TfidfVectorizer(use_idf=True, ngram_range=(1,3)) tfidf_matrix = tfidf_vectorizer.fit_transform(cleaned_tweets) feature_names = tfidf_vectorizer.get_feature_names() # num phrases  from sklearn.metrics.pairwise import cosine_similarity dist = 1 - cosine_similarity(tfidf_matrix) print(dist) from sklearn.cluster import KMeans num_clusters = 3 km = KMeans(n_clusters=num_clusters) km.fit(tfidf_matrix) clusters = km.labels_.tolist() df['ClusterID'] = clusters print(df['ClusterID'].value_counts())

K-means clustering algorithm is used to group tweets into choosen number (say, 3) of groups.

The output shows 3 clusters, with following number of tweets in respective clusters.

Most of the tweets are clustered around in group Id =1. Remaining are in group id 2 and id 0.

The top words used in each cluster can be computed by as follows:

#sort cluster centers by proximity to centroid order_centroids = km.cluster_centers_.argsort()[:, ::-1] for i in range(num_clusters): print("Cluster {} : Words :".format(i))   for ind in order_centroids[i, :10]:    print(' %s' % feature_names[ind])

The result is:

Cluster 0: Words: show mercy please people rain

Cluster 1: Words: pic twitter zoo wall broke ground saving guilty water growing

Cluster 2: Words: help people pic twitter safe open rain share please

Topic Modeling

Finding central subject in the set of documents, tweets in case here.  Following are two ways of detecting topics, i.e. clustering the tweets

Latent Dirichlet Allocation (LDA)

LDA is commonly used to identify chosen number (say, 6) topics. Refer tutorial for more details.

dictionary, passes=5) for topic in ldamodel.show_topics(num_topics=6, formatted=False, num_words=6): print("Topic {}: Words: ".format(topic[0]))   topicwords = [w for (w, val) in topic[1]]   print(topicwords)

The output gives us following set of words for each topic.

It is clear from the words associated with the topics that they represent certain sentiments. Topic 0 is about Caution, Topic 1 is about Help, Topic 2 is about News, etc.

Doc2Vec and K-means

Doc2Vec methodology available in gensim package is used to vectorize the tweets, as follows:

tag = u’SENT_{:d}’.format(index)       sentence = TaggedDocument(words=gensim.utils.to_unicode(i).split(), tags=[tag]) tag2tweetmap[tag] = i       taggeddocs.append(sentence) model = gensim.models.Doc2Vec(taggeddocs, dm=0, alpha=0.025, size=20, min_alpha=0.025, min_count=0) for epoch in range(60): if epoch % 20 == 0: print('Now training epoch %s' % epoch)   model.train(taggeddocs)   model.alpha -= 0.002  # decrease the learning rate model.min_alpha = model.alpha  # fix the learning rate, no decay

Once trained model is ready the tweet-vectors available in model can be clustered using K-means.

from sklearn.cluster import KMeans dataSet = model.syn0 kmeansClustering = KMeans(n_clusters=6) centroidIndx = kmeansClustering.fit_predict(dataSet) topic2wordsmap = {} for i, val in enumerate(dataSet):   tag = model.docvecs.index_to_doctag(i)   topic = centroidIndx[i]    if topic in topic2wordsmap.keys():        for w in (tag2tweetmap[tag].split()):            topic2wordsmap[topic].append(w)     else:         topic2wordsmap[topic] = [] for i in topic2wordsmap:   words = topic2wordsmap[i]   print("Topic {} has words {}".format(i, words[:5]))

The result is the list of topics and commonly used words in each, respectively.

It is clear from the words associated with the topics that they represent certain sentiments. Topic 0 is about Caution, Topic 1 is about Actions, Topic 2 is about Climate, chúng tôi result is the list of topics and commonly used words in each, respectively.

End Notes

This article shows how to implement Capstone-Chennai Floods study using Python and its libraries. With this tutorial, one can get introduction to various Natural Language Processing (NLP) workflows such as accessing twitter data, pre-processing text, explorations, clustering and topic modeling.

Got expertise in Business Intelligence  / Machine Learning / Big Data / Data Science? Showcase your knowledge and help Analytics Vidhya community by posting your blog.

Related

A Guide To Conduct Analysis Using Non

Introduction

The average salary package of an economics honors graduate at Hansraj College during the end of the 1980s was around INR 1,000,000 p.a. The number is significantly higher than people graduating in early 80s or early 90s.

What could be the reason for such a high average? Well, one of the highest paid Indian celebrity, Shahrukh Khan graduated from Hansraj College in 1988 where he was pursuing economics honors.

This, and many such examples tell us that average is not a good indicator of the center of the data. It can be extremely influenced by Outliers. In such cases, looking at median is a better choice. It is a better indicator of the center of the data because half of the data lies below the median and the other half lies above it.

So far, so good – I am sure you have seen people make this point earlier. The problem is no one tells you how to perform the analysis like hypothesis testing taking median into consideration.

Statistical tests are used for making decisions. To perform analysis using median, we need to use non-parametric tests. Non-parametric tests are distribution independent tests whereas parametric tests assume that the data is normally distributed. It would not be wrong to say parametric tests are more infamous than non-parametric tests but the former does not take median into account while the latter makes use of median to conduct the analysis.

Without wasting any more time, let’s dive into the world of non-parametric tests.

Note: This article assumes that you have prerequisite knowledge of hypothesis testing, parametric tests, one-tailed & two-tailed tests.

How are Non-Parametric tests different from Parametric tests?

If you read our articles on probability distributions and hypothesis testing, I am sure you know that there are several assumptions attached to each probability distribution.

Parametric tests are used when the information about the population parameters is completely known whereas non-parametric tests are used when there is no or few information available about the population parameters. In simple words, parametric test assumes that the data is normally distributed. However, non-parametric tests make no assumptions about the distribution of data.

But what are parameters? Parameters are nothing but characteristics of the population that can’t be changed. Let’s look at an example to understand this better.

Look at the formula given above, the teacher has considered the marks of all the students while calculating total marks. Assuming that the marking of students is done accurately and there are no missing scores, can you change the total marks scored by the students? No. Therefore, average marks is called a parameter of the population since it cannot be changed.

When can I apply non-parametric tests?

Let’s look at some examples.

1. A winner of the race is decided by the rank and rank is allotted on the basis of crossing the finish line. Now, the first person to cross the finish line is ranked 1, the second person to cross the finish line is ranked 2 and so on. We don’t know by what distance the winner beat the other person so the difference is not known. 

2. A sample of 20 people followed a course of treatment and their symptoms were noted by conducting a survey. The patient was asked to choose among the 5 categories after following the course of treatment. The survey looked somewhat like this

Now, if you look carefully the values in the above survey aren’t scalable, it is based on the experience of the patient. Also, the ranks are allocated and not calculated. In such cases, parametric tests become invalid.

For a nominal data, there does not exist any parametric test.

3. Limit of detection is the lowest quantity of a substance that can be detected with a given analytical method but not necessarily quantitated as an exact value. For instance, a viral load is the amount of HIV in your blood. A viral load can either be beyond the limit of detection or it can a higher value.

4. In the example above of average salary package, Shahrukh’s income would be an outlier. What is an outlier? The income of Shahrukh lies at an abnormal distance from the income of other economics graduates. So the income of Shahrukh here becomes an outlier because it lies at an abnormal distance from other values in the data.

To summarize, non-parametric tests can be applied to situations when:

The data does not follow any probability distribution

The data constitutes of ordinal values or ranks

There are outliers in the data

The data has a limit of detection

The point to be noted here is that if there exists a parametric test for a problem then using nonparametric tests will yield highly inaccurate answers.

Pros and Cons of using non-parametric test Pros

The pros of using non-parametric tests over parametric tests are

1. Non-parametric tests deliver accurate results even when the sample size is small.

2. Non-parametric tests are more powerful than parametric tests when the assumptions of normality have been violated.

3. They are suitable for all data types, such as nominal, ordinal, interval or the data which has outliers.

Cons

1. If there exists any parametric test for a data then using non-parametric test could be a terrible blunder.

2. The critical value tables for non-parametric tests are not included in many computer software packages so these tests require more manual calculations.

Hypothesis testing with non-parametric tests Non-Parametric Tests Mann Whitney U test

Also known as Mann Whitney Wilcoxon and Wilcoxon rank sum test and, is an alternative to independent sample t-test. Let’s understand this with the help of an example.

A pharmaceutical organization created a new drug to cure sleepwalking and observed the result on a group of 5 patients after a month. Another group of 5 has been taking the old drug for a month. The organization then asked the individuals to record the number of sleepwalking cases in the last month. The result was:

If you look at the table, the number of sleepwalking cases recorded in a month while taking the new drug is lower as compared to the cases reported while taking the old drug.

Look at the graphs given below.

For Mann Whitney U test, the test statistic is denoted by U which is the minimum of U1 and U2.

Now, we will compute the ranks by combining the two groups. The question is

How to assign ranks?

Ranks are a very important component of non-parametric tests and therefore learning how to assign ranks to a sample is considerably important. Let’s learn how to assign ranks.

1. We will combine the two samples and arrange them in ascending order. I am using OD and ND for Old Drug and New Drug respectively.

The lowest value here is assigned the rank 1 and the second lowest value is assigned the rank 2 and so on.

But notice that the numbers 1, 4 and 8 are appearing more than once in the combined sample. So the ranks assigned are wrong.

How to assign ranks when there are ties in the sample?

Ties are basically a number appearing more than once in a sample. Look at the position of number 1 in the sample after sorting the data. Here, the number 1 is appearing at 1st and 2nd position. In such a case, we take the mean of 1 and 2 (because the number 1 is appearing at 1st and 2nd position) and assign the mean to the number 1 as shown below. We follow the same steps for number 4 and 8. The number 4 here is appearing at position 5th and 6th and their mean is 5.5 so we assign rank 5.5 to the number 4. Calculate rank for number 8 along these lines.  

We assign the mean rank when there are ties in a sample to make sure that the sum of ranks in each sample of size n is same. Therefore, the sum of ranks will always be equal to

2. The next step is to compute the sum of ranks for group 1 and group 2.

3. Using the formula of U1 & U2, compute their values.

Now, U = min(U1, U2) = 0.5

Note: For Mann Whitney U test, the value of U lies in the range(0, n1*n2) where 0 indicates that the two groups are completely different from each other and n1*n2 indicates some relation between the two groups. Also, U1 + U2 is always equal to n1*n2. Notice that the value of U is 0.5 here which is very close to 0.

Now, we determine a critical value (denoted by p), using the table for critical values, which is a point derived from the level of significance of the test and is used to reject or accept the null hypothesis. In Mann Whitney U test, the test criteria are

U < critical value, therefore, we reject the null hypothesis and conclude that the there’s no significant evidence to state that two groups report same number of sleepwalking cases.

Wilcoxon Sign-Rank Test

This test can be used in place of paired t-test whenever the sample violates the assumptions of a normal distribution.

Note: Assume that the following data violates the assumptions of normal distribution.

Now, the teacher decided to take the test again after a week of self-practice. The scores were

Let’s check if the marks of the students have improved after a week of self-practice.

In the table above, there are some cases where the students scored less than they scored before and in some cases, the improvement is relatively high (Student 4). This could be due to random effects. We will analyse if the difference is systematic or due to chance using this test.

The next step is to assign ranks to the absolute value of differences. Note that this can only be done after arranging the data in ascending order.

In Wilcoxon sign-rank test, we need signed ranks which basically is assigning the sign associated with the difference to the rank as shown below.

                                        

Easy, right? Now, what is the hypothesis here?

The hypothesis can be one-sided or two-sided and I am considering one-sided hypothesis and using 5% level of significance. Therefore, 

The test statistic for this test is W is the smaller of W1 and W2 defined below:

Here, if W1 is similar to W2 then we accept the null hypothesis. Otherwise, in this example, if the difference reflects greater improvement in the marks scored by the students, then we reject the null hypothesis.

The critical value of W can be looked up in the table.

The criteria to accept or reject null hypothesis are

Sign Test

This test is similar to Wilcoxon sign-rank test and this can also be used in place of paired t-test if the data violates the assumptions of normality. I am going to use the same example that I used in Wilcoxon sign-rank test, assuming that it does not follow the normal distribution, to explain sign test.

Let’s look at the data again.

In sign test, we don’t take magnitude into consideration thereby ignoring the ranks. The hypothesis is same as before.

Here, if we see a similar number of positive and negative differences then the null hypothesis is true. Otherwise, if we see more of positive signs then the null hypothesis is false.

Test Statistic:  The test statistic here is smaller of the number of positive and negative signs.

Determine the critical value and the criteria for rejecting or accepting null hypothesis is

Here, the smaller number of + & – signs = 2 < critical value = 6. Therefore, we reject the null hypothesis and conclude that there’s no significant evidence to state that the median difference is zero.

Kruskal-Wallis Test

Let’s look at an example to enhance our understanding of Kruskal-Wallis test.

Patients suffering from Dengue were divided into 3 groups and three different types of treatment were given to them. The platelet count of the patients after following a 3-day course of treatment is given below.

Notice that the sample size is different for the three treatments which can be tackled using Kruskal-Wallis test.

Sample sizes for treatments 1, 2 and 3 are as follows:

The hypothesis here is given below and I have selected 5% level of significance.

Order these samples from smallest to largest and then assign ranks to the clubbed sample.

Recall that the sum of ranks will always be equal to n(n+1)/2.

We have to check if there is a difference between 3 population medians so we will summarize the sample information in a test statistic based on ranks. Here, the test statistic is denoted by H and given by the following formula

The next step is to determine the critical value of H using the table of critical values and the test criteria is given by:

H comes out to be 6.0778 and the critical value is 5.656. Therefore, we reject our null hypothesis and conclude that there’s no significant evidence to state that the three population medians are same. 

Spearman Rank Correlation

I went to the market to buy a skirt and coincidently my friend bought the same skirt from the market near her place but she paid a higher price for it. The market near my friend’s place is more expensive as compared to mine. So does a region affect the price of a commodity? If it does then there is a link between the region and price of the commodity. We make use of Spearman rank correlation here because it establish if there is the correlation between two datasets.

The prices of vegetables differ across areas. We can check if there’s a relation between the price of a vegetable and area by using Spearman rank correlation. The hypothesis here is:

Here, the trend line suggests a positive correlation between the price of vegetable and area. However, Spearman’s rank correlation method should be used to check the direction and strength of correlation.

 Now calculate rank and d which is the difference between ranks and n is the sample size = 10. This is done as follows:

Now, use the formula to calculate Spearman rank correlation coefficient. Hence, the Spearman rank correlation comes out to be 0.67 which indicates a positive relation between the ranks students obtained in maths and science test which implies that the higher you ranked in maths, the higher you ranked in science and vice-versa.

You can also check this by determining the critical values using the significance level and sample size. The criteria to reject or accept null hypothesis is given by

Frequently Asked Questions

Q1. What is non-parametric test with examples?

A. A non-parametric test is a statistical test that does not make any assumptions about the underlying distribution of the data. It is used when the data does not meet the assumptions of parametric tests. Non-parametric tests are based on ranking or ordering the data rather than calculating specific parameters. Examples of non-parametric tests include the Wilcoxon rank-sum test (Mann-Whitney U test) for comparing two independent groups, the Kruskal-Wallis test for comparing more than two independent groups, and the Spearman’s rank correlation coefficient for assessing the association between two variables without assuming a linear relationship.

Q2. Is Chi Square non-parametric?

A. The chi-square test is often considered a non-parametric test because it does not rely on specific assumptions about the underlying distribution of the data. However, it is important to note that the chi-square test has its own assumptions, such as independence of observations and expected cell frequencies. So while it is non-parametric in terms of distributional assumptions, it does have its own set of assumptions to consider.

End Notes

Non-parametric tests are more powerful when the assumptions for parametric tests are violated and can be used for all data types such as nominal, ordinal, interval and also when data has outliers. If any of the parametric tests is valid for a problem then using non-parametric test will give highly inaccurate results.

To summarize,

Mann Whitney U test is used for testing the difference between two independent groups with ordinal or continuous dependent variable.

Wilcoxon sign rank test is used for testing the difference between two related variables which takes into account the magnitude and direction of difference, however, Sign test ignores the magnitude and only considers the direction of the difference.

Kruskal-Wallis test compares the outcome among more than 2 independent groups by making use of the medians.

Spearman Rank Correlation technique is used to check if there is a relationship between the two data sets and it also tells about the type of relationship.

Related

A Practical Guide To Iot Applications In Agriculture

The Internet of Things (IoT) revolutionizes how we manage agricultural operations. By connecting devices, sensors, and even animals, farmers can access real-time data that they can use to optimise output. Applying IoT technology in agriculture can lead to better yields, improved efficiency, and greater profits.

For example, IoT-based sensors can collect data about soil temperature and moisture, enabling farmers to adjust their irrigation systems to ensure their crops are receiving the optimal amount of water. In addition, connected devices such as drones and robots can monitor crops and identify signs of pests, disease, and nutrient deficiencies. This data can be used to take timely action to prevent further damage. By leveraging the power of the IoT, farmers can significantly reduce the risk of crop failure and maximise their yield.

Benefits of IoT in Agriculture

IoT in agriculture offers a range of benefits, from improved efficiency to increased crop yields and better management of natural resources. By leveraging the power of IoT, farmers can now access detailed data and information to help them make informed decisions and optimise their operations.

Increased crop yields − IoT-enabled technologies can help farmers to increase their crop yields by providing real-time information about soil moisture, temperature and other key conditions, as well as automated irrigation systems.

Improved resource management − IoT data allows farmers to monitor their resources and make more efficient use of them, such as soil nutrients, water, and energy.

Enhanced crop monitoring − IoT sensors and devices can be used to monitor crops, providing farmers with valuable data on crop health and enabling them to detect potential issues early and take corrective action.

Improved pest and disease control − IoT devices can help farmers detect pests and diseases early, allowing them to take action quickly and reduce the impact of these issues on their yields.

Enhanced decision-making − IoT data provides farmers with more accurate and timely information, enabling them to make better decisions about their crops and resources.

Examples of IoT in Agriculture

IoT technology is increasingly being used in agriculture to improve efficiency, increase yields, and reduce overall costs. Examples of IoT in agriculture include −

Automated Irrigation Systems − Automated irrigation systems are connected to sensors in the soil that measure moisture levels and adjust the amount of water released accordingly. This saves water, energy and labour costs.

Precision Agriculture − Precision agriculture uses IoT sensors to monitor soil, crop, and environmental conditions in real-time, enabling farmers to manage their resources better and make more informed decisions.

Livestock Monitoring − Livestock monitoring systems use sensors to track the health and well-being of animals. This helps farmers detect any potential issues quickly and take corrective action.

Crop Monitoring − Crop monitoring systems use sensors to monitor crop health and growth, enabling farmers to optimize production and reduce waste.

Weather Forecasting − Weather forecasting systems use sensor data to provide farmers with real-time weather forecasts and help them plan their activities accordingly.

Challenges of IoT in Agriculture

The Internet of Things (IoT) transforms how we interact with our environment. It allows us to monitor and control devices remotely and collect data on the environment and the devices connected to it. This has opened up a range of possibilities in the agricultural sector, as it has the potential to revolutionize the way farmers manage their land and crops.

However, a few challenges must be addressed before the full potential of IoT in agriculture can be realized. These challenges include −

Connectivity − One of the biggest challenges in implementing IoT in agriculture is ensuring reliable connectivity. To provide reliable connectivity, farmers must have access to high-speed internet and a reliable infrastructure that can support the data transfer needs of the IoT system.

Security − Security is a major concern for any IoT system, especially in the agricultural sector. Since the IoT system will be gathering sensitive data, it is important that the system is secure and protected from malicious actors.

Cost − Implementing an IoT system in the agricultural sector can be expensive. Farmers must consider the cost of the hardware and software needed to set up the system and the costs associated with maintaining and updating it.

Compatibility − Ensuring that the IoT system is compatible with existing systems and devices is another important challenge. The system must communicate with other devices and systems to be useful.

The agricultural sector has the potential to benefit greatly from IoT technology. However, it is important to consider the challenges associated with implementing such a system before investing in it. By addressing these challenges, farmers can ensure that the IoT system is secure and reliable and will provide them with the data and insights they need to optimize their operations.

Conclusion

The Internet of Things (IoT) greatly impacts the agriculture industry, allowing farmers to use sensors and devices to monitor their crops and livestock more effectively. IoT devices can detect soil moisture, pH levels, temperature, humidity, and other conditions that impact yield and livestock health. This data can then be used to adjust irrigation and other farming practices to optimize production and reduce costs. Additionally, IoT devices can help detect disease and pests and even predict weather conditions, allowing farmers to take proactive steps to protect their crops and livestock.

Power Bi New Customers Retention Analysis Using Advanced Dax

Another term for this is attrition analysis because we want to see how our customer are churning, how many of our customers are coming on board and buying our products, how many are coming back and buying some more, how many customers we are losing, and so forth.

In this customer analysis example, I start off going through customer churn and exploring how many customers are being lost after a certain time frame. I also dive into new customers and returning customers.

Analyzing your customer churn is a very key piece of analysis for an organization, especially if you’re a high-frequency selling business like an online retailer or a supermarket chain.

Obviously, if you get customers on board, you want to be selling them more and not losing them to competitors, for example.

It’s much easier to sell to an existing customer than it is to find new customers.

Existing customers are crucial to most businesses as it’s so much more profitable to continue to market to them as opposed to having to find new customers all the time.

In this first visualization here, we have what we would consider overtime Lost Customers.

This point down to just about 90 days is not as relevant because when we’re at the very first days, we’re actually considering everyone as “lost” at the moment.

Now let’s walk through the function to see what we are doing here.

In this formula, we are counting up the customers who have not purchased for the last 90 days, or whatever is your variable churn date.

We are creating a virtual table on every single customer through this CustomersList variable.

We filter all customers for any day. And what we’re doing with ALL is that we’re actually looking at every single customer in each individual day.

And then for every single customer, we’re evaluating if they have made a purchase on the last 90 days. If they have not, then that’s going to evaluate to 0 and count that customer.

Now let’s look at our New Customers and see what it is evaluating to.

In this table, we see that it’s more on the earlier dates, January to July because we just started our business. People are generally new.

Then obviously, it flattens out towards the end because we just have our return customers there.

Its function is doing similar logic. We are counting out how many customers have made a sale before today.

And if they have not purchased anything, which is going to evaluate as 0, then it evaluates as New Customer.

Returning customers are those who have been evaluated as lost.

In other words, they haven’t bought anything for 90 days. Through time, we will calculate how many are actually returning.

This would be an amazing insight if you are running promotions or doing marketing, and you want to know how many of these lost customers you got back through your marketing activities.

In the Returning Customers formula, we’re only evaluating customers that actually purchased on any given day.

So here we are running some logic on each customer, evaluating if they made a sale in the last 90 days.

If they didn’t purchase for the last 90 days, then they are considered returning. Then, evaluate to true, and count out that customer in that particular day.

In the past, this sort of information would cost a lot of money to generate. But now, you could achieve these awesome insights through some clean and effective formula, utilizing the DAX language.

Remember that it actually aligns with the data model. Everything is incorporated in there.

We can actually place some filters on this. For instance, we want to dive into just one state, say Florida, or our top 3 states, it all evaluates dynamically.

If you can see the opportunities and potential with Power BI, then your mind can just exponentially expand with the possibilities of running analysis over your own data sets.

All the best and good luck with these techniques.

Sam

Update the detailed information about Sentiment Analysis Using Nltk – A Practical Approach on the Tai-facebook.edu.vn website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!