Trending December 2023 # Secure Password Generator Using Python # Suggested January 2024 # Top 14 Popular

You are reading the article Secure Password Generator Using Python updated in December 2023 on the website We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested January 2024 Secure Password Generator Using Python

This article was published as a part of the Data Science Blogathon.

In this article, we will see how to build the password generator. Password generate is a python application that will generate the random string of the desired length. Nowadays we are using many applications and websites that require passwords. Setting strong passwords is very important to avoid any attack by attackers and to keep our information safe. Now we will build this application using python, Tkinter, and pyperclip.


Let us see the requirements that are needed for us to build this application.

Python: Python is the programming language that we will use to build the application.

Tkinter: Tkinter is a Graphical User Interface library. using Tkinter is one of the easiest ways to build any GUI-based applications. In this application, we use Tkinter to build the window where we generate a random password.

pyperclip: Pyperclip is a module in python that is used for copying and pasting the text. So in our application after generating the password we will also have an option to copy the password.

Random: Passwords are generated randomly so to generate these passwords randomly we use the random module. This random module generates the random numbers.

Strings: The string module in python helps in creating and customizing strings.

Now let us move into the implementation part.

For any project, we have to start by importing the required modules. For our application, we will import Tkinter, Pyperclip, Random, and strings.

If these libraries are not preinstalled, then you have to install them and then you have to import them. For installing these libraries you have to use pip install to install them. I basically use jupyter notebook to run the code so I open the anaconda prompt and run these commands to install the libraries. You can use any prompt to install them.

To install Tkinter

pip install tkinter

To install pyperclip

pip install pyperclip

To install random

pip install random

To install strings

pip install strings

Now import all the libraries. From Tkinter import all the libraries. So to import everything that has in that particular module we use *.

from tkinter import * import random, string import pyperclip Initialize Window

Our next step is to initialize the window where we generate the password by giving the number of digits in the password. For this we use Tkinter. First, we initialize the win variable with Tk() function. Using the geometry function we will set the width and height of the window. and using the title function we will pass the title of the window. Here we set it to “PASSWORD GENERATOR” with height as 500 and width as 500. Here using configure method I set the background color of the window.

win = Tk() win.geometry("500x500") win.title("PASSWORD GENERATOR") win.configure(bg="#ffc252")

At the top of the window, the text is placed saying PASSWORD GENERATOR in bold letters with ariel font and font size 18 and with some background color. Here we use the Pack() function to arrange the widgets in the window.

Label(win, text = 'PASSWORD GENERATOR' , font ='ariel 15 bold',bg="#ffc252").pack()

Now we have to place an input box where the user can input the number of digits the password should contain. Before that, we place the text, “PASSWORD LENGTH” with Arial font and font size 10 with bold letters. Using IntVar() function, we can set integer data as this function holds integer data, and later we can retrieve the data.  Spinbox() provides a range of values for the user to input. Here, users can enter the digits or scroll the numbers and select the length of the password. And here it generates passwords from lengths 8 to 32.

Python Code:

The text and the spinbox will look like this.

Define Password Generator

Coming to StringVar() f also similar to the IntVar() function but here stringVar() function holds string data. Now we define a function called Generator which generates random passwords. Firstly the password is initialized with an empty string. Setting a password that contains only numerical digits or with only alphabets doesn’t provide enough security for your system or any application. For any password, it should be a combination of uppercase letters, lower case letters, numerical digits, and some punctuations. For the first four digits of the password, we set it to, a random uppercase letter, random lowercase letter, random digit, and random punctuation. And remaining values will be the random combination of uppercase, lowercase, digits, and punctuations.

pass_str = StringVar() def Generator(): password = '' " for x in range (0,4): password = random.choice(string.ascii_uppercase) + random.choice(string.ascii_lowercase) + random.choice(string.digits) + random.choice(string.punctuation) for y in range(pass_len.get()- 4): password = password + random.choice(string.ascii_uppercase + string.ascii_lowercase + string.digits + string.punctuation) pass_str.set(password) Generate Buttons

Now create a button where it follows the Generator command and the Generator is the function that we defined for generating the password. The button contains the text “GENERATE PASSWORD” on it with blue background and white foreground. These buttons are very user-friendly with a nice GUI. You don’t need to stick to the UI that was defined in this article. You can change some text fonts, colors, and many more and you can make your window more beautiful. And you can play around with this and can make your window according to your expectations.

Generate=Button(win, text = "GENERATE PASSWORD" , command = Generator,padx=5,pady=5 ) Generate.configure(background="blue", foreground='white',font=('ariel',10,'bold')) Generate.pack(side=TOP,pady=20) Entry(win , textvariable = pass_str).pack()

The generate password button will look like this

Our next step is to copy the password. We use pyperclip to copy the password. Get the string in the pass_str function and then copy it using pyperclip and we create a button which follows the command copy_password with the text “COPY TO CLIPBOARD” button next we configure the button that means how the button should look like. The button contains blue color background and white color foreground and we use ariel font with font size 10 and bold letters. Here we use the pack() function to organize the widgets according to the size of the frame And we set some top pady here.

def Copy_password(): pyperclip.copy(pass_str.get()) copy=Button(win, text = 'COPY TO CLIPBOARD', command = Copy_password) copy.configure(background="blue", foreground='white',font=('ariel',10,'bold')) copy.pack(side=TOP,pady=20)

The copy to the clipboard button will look like this.

Now run the main loop to execute the entire application.


Here you can see I created the password of 8 digits and the password that I got is ” Sh8_90Ny”. This is a user-friendly application and a very useful application.


Password Generator is an interesting, exciting, and thrilling application. We can use a secret password generator for building strong passwords and this password generator doesn’t store any password anywhere. I clears all the data as soon as you left the window. So without any hesitation, you can build your secret and strong passwords using this password generator.

 The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.


You're reading Secure Password Generator Using Python

Create A Strong Password Using These Tips And Tools

When a signup form asks to create a password, the first thing that comes to many users’ minds is, “Okay I need to create a password that is really easy for me to remember and is connected directly to me so I never forget.” With such mindset, the password created is something like “ILoveSally143.” A hacker will take less than a minute to hack such password and take complete control of your account.

Lately companies and websites are working hard to educate users to use a strong password, and they are also using restrictions to force users to make stronger passwords. Thankfully, with so much news about accounts being hacked and an emphasis on using a strong password, almost everyone knows that they should use a strong password. However, the questions still remains, what is a “Strong” password? In this article we’ll tell you what is a strong password and how to create one.

How a Password Is Cracked

Before we tell you how to create a strong password, it is important to know how to crack a password. There are multiple ways to crack a password, and the most common ones are Brute-Force-Attack and Dictionary Attack. Both of them are explained below.


In a Brute-Force-Attack the hacker (hacker’s software, to be precise) uses all types of letters, numbers and characters in combination to try to crack a password. The process starts from basic total characters like four or five characters, and when all the combinations are used, the software adds another character and uses all the combinations made with it and repeats the process. This theoretically allows Brute-Force-Attack to crack almost any type of password (including encrypted ones). However, as Brute-Force checks each possible combination there is, it takes a lot of time to check all the combinations, and adding another character will drastically increase the cracking time.

Giving an estimate from Kaspersky Password Checker, the password “pzQm45” should take 3 hours to crack, but “pzQm45@” will take two days to crack. If we add another character like “pzQm45@!,” it will take twelve days to crack. This means it is very hard to crack a longer password for a Brute-Force-Attack, and it’s not worth the hacker’s time.

Dictionary Attack

Brute-Force-Attack has a hard time cracking long passwords; this is where Dictionary Attack comes in. In a Dictionary Attack the hacking software uses a long list (in millions) of word combinations taken from dictionaries along with all common character combinations, phrases, sequences and anything that is “common.” If a password has a meaning, Dictionary Attack can crack it. Adding punctuation or numbers along with a common word will not help. For example, Dictionary Attack should be able to easily crack the password “I$3haTe5%MaTh” as somehow it makes sense. As this methods uses combinations of common words and characters, it takes far less time to crack a password compared to Brute-Force, even if the password is long.

Solution: The answer to both of the above attacks is simple: create a long password that doesn’t makes sense. A password of sixteen characters or above with completely random characters should work fine. But creating and managing such a password is hard, which we explain below.

Note: hackers also use Phishing Attacks to steal your password. A strong password will not help against a phishing attack as the hacker will steal the actual password using a fake website page.

Manually Create a Strong Password That Is Easy to Remember

For those people who don’t like providing their credentials to third-party applications, we know a manual way to create and memorize a strong password. You can create a password from a long phrase that has direct connection with you but others don’t know about it. For example, you can create multiple passwords from a phrase such as “I eat vanilla ice cream at 3am, but I don’t get any sleep afterwards!” Below are some examples:




It will be really easy to remember the phrase as it is connected to something you do or have done before; all you have to do is remember how you created the password.

Use a Password Generator and a Manager

If you don’t want to go through the above process and don’t mind depending on a third-party service for creating and storing the passwords, then things can get a lot easier (and productive) for you. There are many tools that will let you generate a strong password, and you can also use a password manager to save those passwords. Below are some you can use:

Password Generators

Secure Password Generator: A very simple online password generator that allows you to specify password length and character type to easily create a strong password. It also provides hints that will let you easily remember the password.

LastPass Password Generator: The famous password manager LastPass also has an online password generator that is simple to use and offers handy tools to generate a strong password.

Password Managers

LastPass: I recommend LastPass for its simple interface and security options. It will securely store all your passwords and let you sync them over all devices.

Dashlane: This is another good option that is easy to use and offers great security such as two-factor authentication. It also has a digital wallet to save receipts and credit card information.

Important: Never ever use the same password for multiple accounts; even if one of your accounts is hacked it could lead to losing all your accounts.


There should be no compromise on password strength as tens of thousands of hackers are after your information and trying to get into your account. You may say that you are just a regular person and no hacker will have time to hack your account, but hackers don’t care who you are. They just try to hack anything they can get their hands on, one way or another. Identity theft and the wrong use of your account and information is something average users should worry about. I also recommend you enable two-factor authentication if it is available for a website, as it is the best protection against hackers.

Karrar Haider

Karrar is drenched in technology and always fiddles with new tech opportunities. He has a bad habit of calling technology “Killer”, and doesn’t feel bad about spending too much time in front of the PC. If he is not writing about technology, you will find him spending quality time with his little family.

Subscribe to our newsletter!

Our latest tutorials delivered straight to your inbox

Sign up for all newsletters.

By signing up, you agree to our Privacy Policy and European users agree to the data transfer policy. We will not share your data and you can unsubscribe at any time.

Image Processing And Feature Extraction Using Python

In this article, I will take you through some of the basic features of image processing. The ultimate goal of this data massaging remains the same : feature extraction. But here we need more intensive data cleaning. But data cleaning is done on datasets , tables , text etc. How is this done on an image? We will look at how an image is stored on a disc and how we can manipulate an image using this underlying data?

Importing an Image

Importing an image in python is easy. Following code will help you import an image on Python :

Understanding the underlying data

This image has several colors and many pixels. To visualize how this image is stored, think of every pixel as a cell in matrix. Now this cell contains three different intensity information, catering to the color Red, Green and Blue. So a RGB image becomes a 3-D matrix. Each number is the intensity of Red, Blue and Green colors.

Let’s look at a few transformations:

As you can see in the above image, we manipulated the third dimension and got the transformation done. Yellow is not a direct color available in our dictionary but comes out as combination of red and green. We got the transformation done by setting up intensity of other colors as zero.

Converting Images to a 2-D matrix

Handling the third dimension of images sometimes can be complex and redundant. In feature extraction, it becomes much simpler if we compress the image to a 2-D matrix. This is done by Gray-scaling or Binarizing. Gray scaling is richer than Binarizing as it shows the image as a combination of different intensities of Gray. Whereas binarzing simply builds a matrix full of 0s and 1s.

Here is how you convert a RGB image to Gray scale:

As you can see, the dimension of the image has been reduced to two in Grayscale. However, the features are equally visible in the two images. This is the reason why Grayscale takes much lesser space when stored on Disc.

Now let’s try to binarize this Grayscale image. This is done by finding a threshold and flagging the pixels of Grayscale. In this article I have used Otsu’s method to find the threshold. Otsu’s method calculates an “optimal” threshold by maximizing the variance between two classes of pixels, which are separated by the threshold. Equivalently, this threshold minimizes the intra-class variance.

Following is a code to do this transformation:

Blurring an Image

Last part we will cover in this article is more relevant for feature extraction : Blurring of images. Grayscale or binary image sometime captures more than required image and blurring comes very handy in such scenarios. For instance, in this image if the shoe was of lesser interest than the railway track, blurring would have added a lot of value. This will become clear from this example. Blurring algorithm takes weighted average of neighbouring pixels to incorporate surroundings color into every pixel. Following is an example of blurring :

In the above picture, after blurring we clearly see that the shoe has now gone to the same intensity level as that of rail track. Hence, this technique comes in very handy in many scenarios of image processing.

Let’s take a practical example of such application in analytics industry. We wish to count the number of people in a town’s photograph. But this image has a few buildings also. Now the intensity of the people behind the buildings will be lower than building itself. Hence, it becomes difficult for us to count these poeple. Blurring in such scenarios can be done to equalize the intensities of buildings and people in the image.

Complete Code

Here is the complete code :

[stextbox id=”grey”]

image = imread(r"C:UsersTavishDesktop7.jpg") show_img(image) red, yellow =   image.copy(), image.copy() red[:,:,(1,2)] = 0 yellow[:,:,2]=0 show_images(images=[red,yellow], titles=['Red Intensity','Yellow Intensity']) from skimage.color import rgb2gray gray_image = rgb2gray(image) show_images(images=[image,gray_image],titles=["Color","Grayscale"]) print "Colored image shape:", image.shape print "Grayscale image shape:", gray_image.shape from skimage.filter import threshold_otsu thresh = threshold_otsu(gray_image) show_images(images=[gray_image,binary_image,binary],titles=["Grayscale","Otsu Binary"]) from skimage.filter import gaussian_filter blurred_image = gaussian_filter(gray_image,sigma=20) show_images(images=[gray_image,blurred_image],titles=["Gray Image","20 Sigma Blur"])


End Notes

[stextbox id=”grey”][/stextbox]

The world of image processing is already so rich that multi-billion dollar companies today rely on these image processing tools for various purposes. These image processing techniques are being used heavily in researches and automization of industry processes. In few of the coming articles we will take a deep dive into feature extraction from an image. This will include detecting corners, segmenting the image, seperating object from the background etc.

Did you find the article useful? Share with us any practical application of image processing you have worked on.  Do let us know your thoughts about this article in the box below.

P.S. Have you joined Analytics Vidhya Discuss yet? If not, you are missing out on awesome data science discussions. Here are 2 of my best picks among recent discussions:

1. How to do feature selection and transformation?

2. Algorithm for time series forecasting

If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.


Drop Collection If Already Exists In Mongodb Using Python

MongoDB is a widely popular open-source database that stores data in a flexible JSON like format. It does not use the orthodox technique of storing data in rows and columns. Instead, it uses a more flexible approach which increases its scalability.

This database is designed to handle large volumes of data and therefore, it is tailor made for modern applications. A MongoDB database consists of “collections” which is similar to a table in a RDBMS.

A collection is a group of documents consisting of fields with different types of values. A database can contain numerous collections and each collection can contain multiple documents. In this article, we will drop a MongoDB collection with the help of pythonic commands. Each collection has its own schema which depends upon the structure of the document.

Installing PyMongo

PyMongo is the python driver through which a programmer interacts with the “MongoDB” databases. It provides an interface to perform several operations on a MongoDB data from python. We can install “PyMongo” by using the python package manager on the command line −

pip install pymongo

Once the PyMongo library is installed, we an import it on our local IDE.

Creating a Database

We need a reference database on which we will operate. Creating a MongoDB database is not a difficult task. We have to download the latest version of MongoDB from the internet and install it on the system. After this we will start the “MongoDB” server. We can use a default server with a default port number and begin with the “connection” process. We can manually create a database by passing the database name and collection name. The data can be imported in the form of a JSON or CSV file format.

Connecting MongoDB to Python through PyMongo

This is the most crucial step as it involves the creation of a connection between the two platforms. We will create a MongoClient object with the help of “pymongo.MongoClient()” function. The connection is established by passing the server address as an argument for this function.

Syntax Mongo_client = pymongo.MongoClient("Connection address")

Let’s apply this method to establish a connection.

Creating a Connection to Read a Collection in Python

Here we are trying to read a collection that is stored in MongoDB. In the example given below −

We imported the “PyMongo” library and created a “MongoClient” object which allows us to establish a connection and access the database

We passed a server address specifying the address name as “localhost”, which means that the MongoDB server is running on the same machine as the python program. We used the default port number for MongoDB server: “27017”.

After this we specified the database and collection name.

We have created a collection and populated it.

We used the “find()” method to retrieve the documents stored in the collection.

Example import pymongo Mongo_client = pymongo.MongoClient("mongodb://localhost:27017/") # Database name database = Mongo_client["mydb"] #Getting the database instance database = Mongo_client['mydb'] #Creating a collection collection = database['example'] #Inserting document into the collection data = [{"_id": "101", "name": "Ram", "age": "26", "city": "Hyderabad"}, {"_id": "102", "name": "Rahim", "age": "27", "city": "Bangalore"}, {"_id": "103", "name": "Robert", "age": "28", "city": "Mumbai"}] res = collection.insert_many(data) print("Data inserted ......") #Retreving the data documents = collection.find() print("Contents of the collection: ") for document in documents: print(document) Output Data inserted ...... Contents of the collection: {'_id': '101', 'name': 'Ram', 'age': '26', 'city': 'Hyderabad'} {'_id': '102', 'name': 'Rahim', 'age': '27', 'city': 'Bangalore'} {'_id': '103', 'name': 'Robert', 'age': '28', 'city': 'Mumbai'}

Now, that we have created a database and a collection, let’s look at the methods to drop a collection from the database.

Dropping the Collection using the Drop() Method

This is a very simple approach of dropping a collection from the database. Let’s understand it.

After establishing the connection, we used the drop() method to drop the targeted collection from the database.

Once the collection is dropped we can’t retrieve it’s documents with the help of “find()” method.

“None” is returned as the output since the collection has been dropped.

Example import pymongo Mongo_client = pymongo.MongoClient("mongodb://localhost:27017/") # Database name database = Mongo_client["mydb"] #Getting the database instance database = Mongo_client['mydb'] #Creating a collection collection = database['example'] documents = collection.find() print("Contents of the collection: ") for document in documents: print(document) #dropping the collection print(collection.drop()) print("Collection Dropped ......") Output

Contents of the collection: {‘_id’: ‘101’, ‘name’: ‘Ram’, ‘age’: ’26’, ‘city’: ‘Hyderabad’} {‘_id’: ‘102’, ‘name’: ‘Rahim’, ‘age’: ’27’, ‘city’: ‘Bangalore’} {‘_id’: ‘103’, ‘name’: ‘Robert’, ‘age’: ’28’, ‘city’: ‘Mumbai’} None Collection Dropped ……

If you try open the MongoDB database and verify for the collection , you can observe that the coll


This article focuses on a simple “MongoDB” operation of dropping a “collection” that exists in a database with the help of python programming. We used “PyMongo” library to access the MongoDB database. We established a connection and specified the targeted database and collection name. Finally, we used the “drop()” method to drop the collection from the database.

Sentiment Analysis Of Twitter Posts On Chennai Floods Using Python


The best way to learn data science is to do data science. No second thought about it!

One of the ways, I do this is continuously look for interesting work done by other community members. Once I understand the project, I do / improve the project on my own. Honestly, I can’t think of a better way to learn data science.

As part of my search, I came across a study on sentiment analysis of Chennai Floods on Analytics Vidhya. I decided to perform sentiment analysis of the same study using Python and add it here. Well, what can be better than building onto something great.

To get acquainted with the crisis of Chennai Floods, 2023 you can read the complete study here. This study was done on a set of social interactions limited to the first two days of Chennai Floods in December 2023.

The objectives of this article is to understand the different subjects of interactions during the floods using Python. Grouping similar messages together with emphasis on predominant themes (rescue, food, supplies, ambulance calls) can help government and other authorities to act in the right manner during the crisis time.

Building Corpus

A typical tweet is mostly a text message within limit of 140 characters. #hashtags convey subject of the tweet whereas @user seeks attention of that user. Forwarding is denoted by ‘rt’ (retweet) and is a measure of its popularity. One can like a tweet by making it ‘favorite’.

About 6000 twits were collected with ‘#ChennaiFloods’ hashtag and between 1st and 2nd Dec 2023.  Jefferson’s GetOldTweets utility (got) was used in Python 2.7 to collect the older tweets. One can store the tweets either in a csv file or to a database like MongoDb to be used for further processing.

import got, codecs from pymongo import MongoClient client = MongoClient('localhost', 27017) db = client['twitter_db'] collection = db['twitter_collection'] tweetCriteria = got.manager.TweetCriteria().setQuerySearch('ChennaiFloods').setSince("2023-12-01").setUntil("2023-12-02").setMaxTweets(6000) def streamTweets(tweets): for t in tweets: obj = {"user": t.username, "retweets": t.retweets, "favorites":   t.favorites, "text":t.text,"geo": t.geo,"mentions": t.mentions, "hashtags": t.hashtags,"id":, "permalink": t.permalink,} tweetind = collection.insert_one(obj).inserted_id got.manager.TweetManager.getTweets(tweetCriteria, streamTweets)

Tweets stored in MongoDB can be accessed from another python script. Following example shows how the whole db was converted to Pandas dataframe.

import pandas as pd from pymongo import MongoClient client = MongoClient ('localhost', 27017) db = client ['twitter_db'] collection = db ['twitter_collection'] df=pd.DataFrame(list(collection.find()))

First few records of the dataframe look as below:

Data Exploration

Once in dataframe format, it is easier to explore the data. Here are few examples:

As seen in the study the most used tags were “#chennairains”, “#ICanAccommodate”, apart from the original query tag “#ChennaiFloods”.

Top 10 users

users = df["user"].tolist() fdist2 = FreqDist(users) fdist2.plot(10)

As seen from the plot, most active users were “TMManiac” with about 85 tweets, “Texx_willer” with 60 tweets and so on…

Text Pre-processing

All tweets are processed to remove unnecessary things like links, non-English words, stopwords, punctuation’s, etc.

from nltk.tokenize import TweetTokenizer from nltk.corpus import stopwords import re, string import nltk tweets_texts = df["text"].tolist() stopwords=stopwords.words('english') english_vocab = set(w.lower() for w in nltk.corpus.words.words()) def process_tweet_text(tweet):  if tweet.startswith('@null'):    return "[Tweet not available]" tweet = re.sub(r'$w*','',tweet) # Remove tickers tweet = re.sub(r'['+string.punctuation+']+', ' ',tweet) # Remove puncutations like 's                                            i in english_vocab]  return tokens words = [] for tw in tweets_texts:     words += process_tweet_text(tw)

The word list generated looks like:

[‘time’, ‘history’, ‘temple’, ‘closed’, ‘due’, ‘pic’, ‘twitter’, ‘havoc’, ‘incessant’, …]

Text Exploration

The words are plotted again to find the most frequently used terms. A few simple words repeat more often than others: ’help’, ‘people’, ‘stay’, ’safe’, etc.

[(‘twitter’, 1026), (‘pic’, 1005), (‘help’, 569), (‘people’, 429), (‘safe’, 274)]

These are immediate reactions and responses to the crisis.

Some infrequent terms are [(‘fit’, 1), (‘bible’, 1), (‘disappear’, 1), (‘regulated’, 1), (‘doom’, 1)].

Collocations are the words that are found together. They can be bi-grams (two words together) or phrases like trigrams (3 words) or n-grams (n words).

from nltk.collocations import * bigram_measures = nltk.collocations.BigramAssocMeasures() finder = BigramCollocationFinder.from_words(words, 5) finder.apply_freq_filter(5) print(finder.nbest(bigram_measures.likelihood_ratio, 10))

Most frequently appearing Bigrams are:

[(‘pic’, ‘twitter’), (‘lady’, ‘labour’), (‘national’, ‘media’), (‘pani’, ‘pani’), (‘team’, ‘along’), (‘stay’, ‘safe’), (‘rescue’, ‘team’), (‘beyond’, ‘along’), (‘team’, ‘beyond’), (‘rescue’, ‘along’)]

These depict the disastrous situation, like “stay safe”, “rescue team”, even a commonly used Hindi phrase “pani pani” (lots of water).


In such crisis situations, lots of similar tweets are generated. They can be grouped together in clusters based on closeness or ‘distance’ amongst them. Artem Lukanin has explained the process in details here. TF-IDF method is used to vectorize the tweets and then cosine distance is measured to assess the similarity.

Each tweet is pre-processed and added to a list. The list is fed to TFIDF Vectorizer to convert each tweet into a vector. Each value in the vector depends on how many times a word or a term appears in the tweet (TF) and on how rare it is amongst all tweets/documents (IDF). Below is a visual representation of TFIDF matrix it generates.

Before using the Vectorizer, the pre-processed tweets are added in the data frame so that each tweets association with other parameters like id, user is maintained.

Vectorization is done using 1-3 n-grams, meaning phrases with 1,2,3 words are used to compute frequencies, i.e. TF IDF values. One can get cosine similarity amongst tweets/documents as well.

from sklearn.feature_extraction.text import TfidfVectorizer tfidf_vectorizer = TfidfVectorizer(use_idf=True, ngram_range=(1,3)) tfidf_matrix = tfidf_vectorizer.fit_transform(cleaned_tweets) feature_names = tfidf_vectorizer.get_feature_names() # num phrases  from sklearn.metrics.pairwise import cosine_similarity dist = 1 - cosine_similarity(tfidf_matrix) print(dist) from sklearn.cluster import KMeans num_clusters = 3 km = KMeans(n_clusters=num_clusters) clusters = km.labels_.tolist() df['ClusterID'] = clusters print(df['ClusterID'].value_counts())

K-means clustering algorithm is used to group tweets into choosen number (say, 3) of groups.

The output shows 3 clusters, with following number of tweets in respective clusters.

Most of the tweets are clustered around in group Id =1. Remaining are in group id 2 and id 0.

The top words used in each cluster can be computed by as follows:

#sort cluster centers by proximity to centroid order_centroids = km.cluster_centers_.argsort()[:, ::-1] for i in range(num_clusters): print("Cluster {} : Words :".format(i))   for ind in order_centroids[i, :10]:    print(' %s' % feature_names[ind])

The result is:

Cluster 0: Words: show mercy please people rain

Cluster 1: Words: pic twitter zoo wall broke ground saving guilty water growing

Cluster 2: Words: help people pic twitter safe open rain share please

Topic Modeling

Finding central subject in the set of documents, tweets in case here.  Following are two ways of detecting topics, i.e. clustering the tweets

Latent Dirichlet Allocation (LDA)

LDA is commonly used to identify chosen number (say, 6) topics. Refer tutorial for more details.

dictionary, passes=5) for topic in ldamodel.show_topics(num_topics=6, formatted=False, num_words=6): print("Topic {}: Words: ".format(topic[0]))   topicwords = [w for (w, val) in topic[1]]   print(topicwords)

The output gives us following set of words for each topic.

It is clear from the words associated with the topics that they represent certain sentiments. Topic 0 is about Caution, Topic 1 is about Help, Topic 2 is about News, etc.

Doc2Vec and K-means

Doc2Vec methodology available in gensim package is used to vectorize the tweets, as follows:

tag = u’SENT_{:d}’.format(index)       sentence = TaggedDocument(words=gensim.utils.to_unicode(i).split(), tags=[tag]) tag2tweetmap[tag] = i       taggeddocs.append(sentence) model = gensim.models.Doc2Vec(taggeddocs, dm=0, alpha=0.025, size=20, min_alpha=0.025, min_count=0) for epoch in range(60): if epoch % 20 == 0: print('Now training epoch %s' % epoch)   model.train(taggeddocs)   model.alpha -= 0.002  # decrease the learning rate model.min_alpha = model.alpha  # fix the learning rate, no decay

Once trained model is ready the tweet-vectors available in model can be clustered using K-means.

from sklearn.cluster import KMeans dataSet = model.syn0 kmeansClustering = KMeans(n_clusters=6) centroidIndx = kmeansClustering.fit_predict(dataSet) topic2wordsmap = {} for i, val in enumerate(dataSet):   tag = model.docvecs.index_to_doctag(i)   topic = centroidIndx[i]    if topic in topic2wordsmap.keys():        for w in (tag2tweetmap[tag].split()):            topic2wordsmap[topic].append(w)     else:         topic2wordsmap[topic] = [] for i in topic2wordsmap:   words = topic2wordsmap[i]   print("Topic {} has words {}".format(i, words[:5]))

The result is the list of topics and commonly used words in each, respectively.

It is clear from the words associated with the topics that they represent certain sentiments. Topic 0 is about Caution, Topic 1 is about Actions, Topic 2 is about Climate, chúng tôi result is the list of topics and commonly used words in each, respectively.

End Notes

This article shows how to implement Capstone-Chennai Floods study using Python and its libraries. With this tutorial, one can get introduction to various Natural Language Processing (NLP) workflows such as accessing twitter data, pre-processing text, explorations, clustering and topic modeling.

Got expertise in Business Intelligence  / Machine Learning / Big Data / Data Science? Showcase your knowledge and help Analytics Vidhya community by posting your blog.


Steps For Effective Text Data Cleaning (With Case Study Using Python)


One of the first steps in working with text data is to pre-process it. It is an essential step before the data is ready for analysis. Majority of available text data is highly unstructured and noisy in nature – to achieve better insights or to build better algorithms, it is necessary to play with clean data. For example, social media data is highly unstructured – it is an informal communication – typos, bad grammar, usage of slang, presence of unwanted content like URLs, Stopwords, Expressions etc. are the usual suspects.

In this blog, therefore I discuss about these possible noise elements and how you could clean them step by step. I am providing ways to clean data using Python.

As a typical business problem, assume you are interested in finding:  which are the features of an iPhone which are more popular among the fans. You have extracted consumer opinions related to iPhone and here is a tweet you extracted:

[stextbox id = “grey”] [/stextbox]

Steps for data cleaning:


Here is what you do:

Escaping HTML characters: Data obtained from web usually contains a lot of html entities like &lt; &gt; &amp; which gets embedded in the original data. It is thus necessary to get rid of these entities. One approach is to directly remove them by the use of specific regular expressions. Another approach is to use appropriate packages and modules (for example htmlparser of Python), which can convert these entities to standard html tags. For example: &lt; is converted to “<” and &amp; is converted to “&”.

Decoding data: Thisis the process of transforming information from complex symbols to simple and easier to understand characters. Text data may be subject to different forms of decoding like “Latin”, “UTF8” etc. Therefore, for better analysis, it is necessary to keep the complete data in standard encoding format. UTF-8 encoding is widely accepted and is recommended to use.

[stextbox id = “grey”]


tweet = original_tweet.decode("utf8").encode(‘ascii’,’ignore’)



Apostrophe Lookup: To avoid any word sense disambiguation in text, it is recommended to maintain proper structure in it and to abide by the rules of context free grammar. When apostrophes are used, chances of disambiguation increases.

For example “it’s is a contraction for it is or it has”.

All the apostrophes should be converted into standard lexicons. One can use a lookup table of all possible keys to get rid of disambiguates.

[stextbox id = “grey”]


APPOSTOPHES = {“'s" : " is", "'re" : " are", ...} ## Need a huge dictionary words = tweet.split() reformed = [APPOSTOPHES[word] if word in APPOSTOPHES else word for word in words] reformed = " ".join(reformed)



Removal of Stop-words: When data analysis needs to be data driven at the word level, the commonly occurring words (stop-words) should be removed. One can either create a long list of stop-words or one can use predefined language specific libraries.

Removal of Punctuations: All the punctuation marks according to the priorities should be dealt with. For example: “.”, “,”,”?” are important punctuations that should be retained while others need to be removed.

Removal of Expressions: Textual data (usually speech transcripts) may contain human expressions like [laughing], [Crying], [Audience paused]. These expressions are usually non relevant to content of the speech and hence need to be removed. Simple regular expression can be useful in this case.

Split Attached Words: We humans in the social forums generate text data, which is completely informal in nature. Most of the tweets are accompanied with multiple attached words like RainyDay, PlayingInTheCold etc. These entities can be split into their normal forms using simple rules and regex.

[stextbox id = “grey”]


cleaned = “ ”.join(re.findall(‘[A-Z][^A-Z]*’, original_tweet))



Slangs lookup: Again, social media comprises of a majority of slang words. These words should be transformed into standard words to make free text. The words like luv will be converted to love, Helo to Hello. The similar approach of apostrophe look up can be used to convert slangs to standard words. A number of sources are available on the web, which provides lists of all possible slangs, this would be your holy grail and you could use them as lookup dictionaries for conversion purposes.

[stextbox id = “grey”]


            tweet = _slang_loopup(tweet)



Standardizing words: Sometimes words are not in proper formats. For example: “I looooveee you” should be “I love you”. Simple rules and regular expressions can help solve these cases.

[stextbox id = “grey”]


tweet = ''.join(''.join(s)[:2] for _, s in itertools.groupby(tweet))



[stextbox id = “grey”]

[stextbox id = “grey”]

Final cleaned tweet:



Advanced data cleaning:

Grammar checking: Grammar checking is majorly learning based, huge amount of proper text data is learned and models are created for the purpose of grammar correction. There are many online tools that are available for grammar correction purposes.

Spelling correction: In natural language, misspelled errors are encountered. Companies like Google and Microsoft have achieved a decent accuracy level in automated spell correction. One can use algorithms like the Levenshtein Distances, Dictionary Lookup etc. or other modules and packages to fix these errors.

End Notes:

Go Hack 🙂

If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.


Update the detailed information about Secure Password Generator Using Python on the website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!