Trending December 2023 # Heart Disease Prediction Using Machine Learning # Suggested January 2024 # Top 20 Popular

You are reading the article Heart Disease Prediction Using Machine Learning updated in December 2023 on the website Tai-facebook.edu.vn. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested January 2024 Heart Disease Prediction Using Machine Learning

Plotting Libraries

Metrics for Classification technique

Scaler

Model building

Here we will be using the pandas read_csv function to read the dataset. Specify the location of the dataset and import them.

Importing Data

Output:

Now, let’s see the size of the dataset

data.shape

Output:

(303, 14)

Inference: We have a dataset with 303 rows which indicates a smaller set of data.

Python Code:



Out of 14 features, we have 13 int types and only one with the float data types.

Woah! Fortunately, this dataset doesn’t hold any missing values.

As we are getting some information from each feature so let’s see how statistically the dataset is spread.

data.describe()

Output:

It is always better to check the correlation between the features so that we can analyze that which feature is negatively correlated and which is positively correlated so, Let’s check the correlation between various features.

plt.figure(figsize=(20,12)) sns.set_context('notebook',font_scale = 1.3) sns.heatmap(data.corr(),annot=True,linewidth =2) plt.tight_layout()

Output:

By far we have checked the correlation between the features but it is also a good practice to check the correlation of the target variable.

So, let’s do this!

sns.set_context('notebook',font_scale = 2.3) data.drop('target', axis=1).corrwith(data.target).plot(kind='bar', grid=True, figsize=(20, 10), title="Correlation with the target feature") plt.tight_layout()

Output:

Inference: Insights from the above graph are:

Four feature( “cp”, “restecg”, “thalach”, “slope” ) are positively correlated with the target feature.

Other features are negatively correlated with the target feature.

So, we have done enough collective analysis now let’s go for the analysis of the individual features which comprises both univariate and bivariate analysis.

Age(“age”) Analysis

Here we will be checking the 10 ages and their counts.

plt.figure(figsize=(25,12)) sns.set_context('notebook',font_scale = 1.5) sns.barplot(x=data.age.value_counts()[:10].index,y=data.age.value_counts()[:10].values) plt.tight_layout()

Output:

Inference:  Here we can see that the 58 age column has the highest frequency.

Let’s check the range of age in the dataset.

minAge=min(data.age) maxAge=max(data.age) meanAge=data.age.mean() print('Min Age :',minAge) print('Max Age :',maxAge) print('Mean Age :',meanAge)

Output:

Min Age : 29 Max Age : 77 Mean Age : 54.366336633663366

We should divide the Age feature into three parts – “Young”, “Middle” and “Elder”

plt.figure(figsize=(23,10)) sns.set_context(‘notebook’,font_scale = 1.5) sns.barplot(x=[‘young ages’,’middle ages’,’elderly ages’],y=[len(Young),len(Middle),len(Elder)]) plt.tight_layout()

Output:

Inference: Here we can see that elder people are the most affected by heart disease and young ones are the least affected.

To prove the above inference we will plot the pie chart.

colors = ['blue','green','yellow'] explode = [0,0,0.1] plt.figure(figsize=(10,10)) sns.set_context('notebook',font_scale = 1.2) plt.pie([len(Young),len(Middle),len(Elder)],labels=['young ages','middle ages','elderly ages'],explode=explode,colors=colors, autopct='%1.1f%%') plt.tight_layout()

Output:

Sex(“sex”) Feature Analysis plt.figure(figsize=(18,9)) sns.set_context('notebook',font_scale = 1.5) sns.countplot(data['sex']) plt.tight_layout()

Output:

Inference: Here it is clearly visible that, Ratio of Male to Female is approx 2:1.

Now let’s plot the relation between sex and slope.

plt.figure(figsize=(18,9)) sns.set_context('notebook',font_scale = 1.5) sns.countplot(data['sex'],hue=data["slope"]) plt.tight_layout()

Output:

Inference: Here it is clearly visible that the slope value is higher in the case of males(1).

Chest Pain Type(“cp”) Analysis plt.figure(figsize=(18,9)) sns.set_context('notebook',font_scale = 1.5) sns.countplot(data['cp']) plt.tight_layout()

Output:

Inference: As seen, there are 4 types of chest pain

status at least

condition slightly distressed

condition medium problem

condition too bad

Analyzing cp vs target column

Inference: From the above graph we can make some inferences,

People having the least chest pain are not likely to have heart disease.

People having severe chest pain are likely to have heart disease.

Elderly people are more likely to have chest pain.

Thal Analysis plt.figure(figsize=(18,9)) sns.set_context('notebook',font_scale = 1.5) sns.countplot(data['thal']) plt.tight_layout()

Output:

Target plt.figure(figsize=(18,9)) sns.set_context('notebook',font_scale = 1.5) sns.countplot(data['target']) plt.tight_layout()

Output:

Inference: The ratio between 1 and 0 is much less than 1.5 which indicates that the target feature is not imbalanced. So for a balanced dataset, we can use accuracy_score as evaluation metrics for our model.

Feature Engineering

Now we will see the complete description of the continuous data as well as the categorical data

categorical_val = [] continous_val = [] for column in data.columns: print(f"{column} : {data[column].unique()}") if len(data[column].unique()) <= 10: categorical_val.append(column) else: continous_val.append(column)

Output:

Now here first we will be removing the target column from our set of features then we will categorize all the categorical variables using the get dummies method which will create a separate column for each category suppose X variable contains 2 types of unique values then it will create 2 different columns for the X variable.

categorical_val.remove('target') dfs = pd.get_dummies(data, columns = categorical_val) dfs.head(6)

Output:

sc = StandardScaler() col_to_scale = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak'] dfs[col_to_scale] = sc.fit_transform(dfs[col_to_scale]) dfs.head(6)

Output:

Modeling

Splitting our Dataset

X = dfs.drop('target', axis=1) y = dfs.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) The KNN Machine Learning Algorithm knn = KNeighborsClassifier(n_neighbors = 10) knn.fit(X_train,y_train) y_pred1 = knn.predict(X_test) print(accuracy_score(y_test,y_pred1))

Output:

0.8571428571428571

1. We did data visualization and data analysis of the target variable, age features, and whatnot along with its univariate analysis and bivariate analysis.

model building.

3. From the above model accuracy, KNN is giving us the accuracy which is 89%.

Endnotes

Read on AV Blog about various predictions using Machine Learning.

About Me

Greeting to everyone, I’m currently working in TCS and previously, I worked as a Data Science Analyst in Zorba Consulting India. Along with full-time work, I’ve got an immense interest in the same field, i.e. Data Science, along with its other subsets of Artificial Intelligence such as Computer Vision, Machine Learning, and Deep learning; feel free to collaborate with me on any project on the domains mentioned above (LinkedIn).

Hope you liked my article on Heart Disease Prediction? You can access my other articles, which are published on Analytics Vidhya as a part of the Blogathon link.

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion. 

You're reading Heart Disease Prediction Using Machine Learning

Top Motorbikes That Are Using Machine Learning Models

Motorbikes have come on leaps and bounds in the last five years. At this point, Motorcycle AI is a high contender for the next big innovation for futuristic motorcycles.  Self-learning technology is already a huge part of our lives. Industries such as healthcare and e-commerce greatly benefit from this technology – and the motorcycle industry is no exception. Thanks to machine learning, electric motorcycles can now learn and adapt to each individual rider to improve the riding experience with every journey. That being said, the most influential way in which self-learning technology has revolutionized riding is perhaps through motorcycle safety. While motorcycling is classed as a more dangerous form of transportation compared to the rest, it’s the most common worldwide. So, a logical application where technology can help is augmenting rider awareness, resulting in safer motorcycle riders. Damon is one of the motorcycle manufacturers that are starting to focus more on motorcycle safety. This is evident in its in-house industry-disrupting software. From the 100% electric powertrain, HyperDrive™, to the award-winning CoPilot™, Advanced Warning System for Motorcycles (AWSM), its technology helps to reach the goal of no fatal accidents on any of the HyperSport Motorcycles by 2030. The purpose of the project is to help the Ducati team simply make better decisions when it comes to the bike configurations. Each year, the MotoGP bikes need to be configured for 18 tracks, and each time there are endless possibilities. That is where the machine learning algorithms come in, and according to Ducati’s statements, it has made a difference in making the right decision when it comes to bike setup. To go big on big data, Ducati implemented an AI and IoT project, so they can simulate the behavior and performance of the bike under various conditions. The sensors on the bike, ranging from 40 to 100, collect data such as speed, engine running parameters, revs, tire and brake temperatures, acceleration, oscillation, vibration, and grip. Once the data is collected, AI is applied to figure out the right configuration. According to their statements, around 4,000 sectors of race tracks and 20 different racing scenarios have been analyzed, with a wider roll-out of the solution expected. Moreover, the machine learning techniques can also predict the performance and behavior of the bike after a setting change. More details on silicon.co.uk.

Motorbikes have come on leaps and bounds in the last five years. At this point, Motorcycle AI is a high contender for the next big innovation for futuristic motorcycles. Self-learning technology is already a huge part of our lives. Industries such as healthcare and e-commerce greatly benefit from this technology – and the motorcycle industry is no exception. Thanks to machine learning, electric motorcycles can now learn and adapt to each individual rider to improve the riding experience with every journey. That being said, the most influential way in which self-learning technology has revolutionized riding is perhaps through motorcycle safety. While motorcycling is classed as a more dangerous form of transportation compared to the rest, it’s the most common worldwide. So, a logical application where technology can help is augmenting rider awareness, resulting in safer motorcycle riders. Damon is one of the motorcycle manufacturers that are starting to focus more on motorcycle safety. This is evident in its in-house industry-disrupting software. From the 100% electric powertrain, HyperDrive™, to the award-winning CoPilot™, Advanced Warning System for Motorcycles (AWSM), its technology helps to reach the goal of no fatal accidents on any of the HyperSport Motorcycles by 2030. Gigi Dall’Igna, Ducati course General Manager, who has got two World Superbike titles, among others, was given the challenging task of steering the Ducati racing ship back on course, after its factory racing efforts in both MotoGP and World Superbike began to founder, as stated by chúng tôi As such, he turned to big data besides turning to Lorenzo (not much of a turning for that matter) and implemented the first IoT and AI technologies into the Ducati’s bikes for the MotoGP competition. The purpose of the project is to help the Ducati team simply make better decisions when it comes to the bike configurations. Each year, the MotoGP bikes need to be configured for 18 tracks, and each time there are endless possibilities. That is where the machine learning algorithms come in, and according to Ducati’s statements, it has made a difference in making the right decision when it comes to bike setup. To go big on big data, Ducati implemented an AI and IoT project, so they can simulate the behavior and performance of the bike under various conditions. The sensors on the bike, ranging from 40 to 100, collect data such as speed, engine running parameters, revs, tire and brake temperatures, acceleration, oscillation, vibration, and grip. Once the data is collected, AI is applied to figure out the right configuration. According to their statements, around 4,000 sectors of race tracks and 20 different racing scenarios have been analyzed, with a wider roll-out of the solution expected. Moreover, the machine learning techniques can also predict the performance and behavior of the bike after a setting change. More details on chúng tôi When it comes to bikes, Ducati is not the only manufacturer turning to big data for insights. Yamaha also goes big on AI and ML and created an updated version of its self-driving motorcycle that after 3 years of learning, went on a circuit and competed with Valentino Rossi’s time. Equipped with a humanoid robot, MOTOBOT managed to do a complete lap of the circuit, but without being even close to Rossi’s time. We’re still impressed. And a bit freaked out: Yamaha boldly predicts the bot will outperform Rossi within two years, and that freaks us out even more. However, the purpose of the project is not to build a bike that could compete in MotoGP, but to improve the existing street bikes, making them safer for riders.

Roadmap To Study Ai, Machine Learning, And Deep Machine Learning

AI also known as Artificial Intelligence, Machine learning in short written as ML, and deep learning (DL) are a few of the top three fast-emerging, great, and intriguing technological disciplines containing a wide range of implementations i.e. applications like self-driving automobiles and face recognition systems. Because of their complexities, understanding these topics may appear difficult. Yet, success in these domains requires a solid foundation in computer science, mathematics, and statistics. Moreover, familiarity with common libraries and modeling tools is required.

This article outlines a learning route for AI, ML, and DL, outlining key ideas, tools, and methodologies. This roadmap provides a clear path for starting your learning journey and equips you with the abilities needed to flourish in these subjects, without repeating any knowledge from other sources.

Road Map

Here is a roadmap to help you get started −

1. Understand the Basics

Before delving into the more complicated components of AI, it is critical to grasp the fundamentals. Linear algebra, calculus, statistics, and probability theory are all included. You should also be comfortable with programming languages like Python, Java, and C++. A solid foundation in mathematics and programming can help you understand AI topics more readily.

2. Learn the Foundations of AI

You may begin learning the principles of AI once you have a good foundation in mathematics and programming. Understanding the many forms of learning, such as supervised, unsupervised, and reinforcement learning, is essential. You’ll also need to familiarise yourself with decision trees and clustering methods. On these topics, there are several free online courses and tutorials accessible.

3. Study Machine Learning

When you’ve grasped the fundamentals of AI, you may progress to Machine Learning. You’ll need to understand the methods for regression, classification, and clustering. You’ll also need to understand how to preprocess data, do feature engineering, and choose a model. There are also several online courses and tutorials available on these subjects.

4. Understand Deep Learning

Deep Learning is a major Machine Learning (ML) attempt that learns data using neural networks inspired by the human brain. Backpropagation, convolutional neural networks (CNNs), recurrent neural networks (RNNs), and autoencoders are all topics that must be understood before diving into Deep Learning. Tensorflow and PyTorch are two popular deep-learning libraries. Understanding deep learning is crucial since it is used in many disciplines, including natural language processing, computer vision, and many more.

5. Learn About Natural Language Processing

It is a branch of AI that can be solved with the help of ML and deep learning. It deals with the understanding by computer systems that what the language is trying to say i.e. understanding it and interpreting words and phrases. Tokenization (splitting sentences into tokens), stemming (turning each word to its basic form), part-of-speech tagging (assigning a part of speech to each dish), and named entity identification are all abilities you’ll need. The NLTK library is a well-known NLP library. Learning NLP may help you design chatbots, sentiment analysis, and other applications.

6. Study Computer Vision

Computer Vision is the study of pictures and movies. You’ll need to learn about picture categorization, feature extraction, and object detection. OpenCV is a well-known computer vision library. Image and video processing has become a crucial ability for AI specialists due to the proliferation of cameras.

7. Practice, Practice, Practice

It is vital to put your newfound knowledge into action. Work on small projects and apply your expertise to real-world problems. Kaggle is an excellent platform for discovering datasets and competing against other data scientists. Participating in hackathons and designing applications might help you enhance your skills.

8. Keep up with the Latest Research

AI is a fast-changing topic, and it is critical to stay up to date on the newest research and breakthroughs in the field. Attend conferences and study research papers to keep current. Keeping up with the newest research might help you develop creative solutions.

9. Build a Portfolio

Creating a portfolio of your work and achievements will help you demonstrate your abilities and stand out to potential employers. You may build a website for your portfolio or upload your creations to GitHub. Possessing a portfolio showcases your practical talents and can help you find a job.

10. Network with Other Conclusion

Learning AI, machine learning, and deep learning can seem overwhelming, but a systematic approach can help. By building a strong foundation in computer science, mathematics, and statistics, and learning to use popular libraries and tools, one can develop the skills needed to excel in these exciting and rapidly evolving fields. Following this roadmap can help you start your learning journey and equip you with the knowledge and expertise to thrive in AI, ML, and DL.

10 Best Machine Learning Start

This is the list of the 10 most exciting machine learning start-ups you should be following in 2023. Artificial Intelligence has been a hot area of innovation in recent years and ML is one of the major sections of the whole AI arena.

ML is not without its problems. ML frameworks and models require a combination of data science, engineering, and development skills. It is a difficult task to acquire and deal with the data required to prepare and create ML models. Executing ML innovation in real-world association frameworks is another challenge.

Let’s take a look at ten companies that are working on machine learning. Some have been around for years, others are just starting.

10 Best Machine Learning Start-Ups 1. AI.Reverie

AI. Reverie develops AI and machine-learning innovation for info data age and data labeling. The simulation platform of the organization is used to acquire, organize and explain large amounts of data necessary to develop AI applications and prepare computer vision algorithms.

Also read: 14 Best Webinar Software Tools in 2023 (Ultimate Guide for Free)

2. Anodot

Anodot’s Deep 360 independent business monitoring stage uses AI to continuously monitor business metrics, detect abnormalities, and aid in determining business performance.

Anodot’s algorithms are context-oriented and can understand business metrics in a way that helps clients reduce incident expenses up to 80%. Anodot was granted patents in the areas of innovation and algorithms, such as irregularity score and irregularity relationship.

3. BigML

BigML is a machine learning platform that can be used to build and maintain data models, data models, and make information-driven, deeply automated decisions.

Machine learning platforms that are scalable and programmable by BigML automate classification, regression, time series prediction, cluster analysis, anomaly detections, association discovery, topic modeling tasks, and other related tasks.

BigML’s preferred partner program supports reference accomplices, accomplices that sell BigML, and those who regulate execution projects.

Also read: Top 10 Best Artificial Intelligence Software

4. StormForge

StormForge is a cloud-native, machine learning-based application testing tool that aids associations in improving Kubernetes application performance.

This week the company acquired German organization Stormforger and its performance testing-as-a-platform innovation. StormForge has been rebranded and named the StormForge Platform, its coordinated item.

5. Comet.ML

Comet.ML is a cloud-based machine learning platform that helps data scientists and AI teams to track datasets, experiment history, and production models.

Also read: Top 5 Automation Tools to Streamline Workflows for Busy IT Teams

6. Dataiku

Dataiku’s Dataiku DSS platform (Data Science Studio), aims to make AI and ML more widely available in data-driven businesses. Dataiku DSS can be used by data analysts and scientists to perform a variety of data science, AI and analysis tasks.

Dataiku raised an incredible US$100 million in Series D funding in August, taking its total financing to US$247 millions.

Dataiku’s partners ecosystem includes administration accomplices, innovation accomplices, and investigation specialists.

7. DotData

DotData claims its DotData Enterprise AI platform and data scientist platform can reduce the time it takes to complete AI and business improvement projects. It is likely that the company’s structure will make data science processes simple enough for anyone, not just data scientists.

Also read: The Top 10 In-Demand Tech Skills you need to have in 2023

8. Eightfold.AI

Late October, chúng tôi announced a Series round of financing in the amount of US$125 millions. This puts the start-up’s value at over US$1 Billion.

9. H2O.ai

H2O.ai must “democratize” man-made consciousness to a broad range of clients.

Also read: Best CRM software for 2023

10. OctoML

Octomizer allows businesses and organizations to quickly put deep learning models into production on different CPU and GPU hardware. This includes at the edge, in the cloud and at the edge.

Feature Engineering For Machine Learning

Feature engineering is the practice of altering data in order to improve the performance of machine learning models. It is a critical component of the machine learning process because it assures the quality of features that have a significant influence on the machine learning model. Superior models are more likely to be produced by a machine learning expert who is well-versed in feature engineering. This post will go through many techniques to feature engineering on data in machine learning.

Feature Engineering Methods

There are many types of data and depending on the type of data, a feature engineering method is chosen. Below is a list of some feature engineering techniques −

1. Feature scaling

This method entails scaling the feature’s values into a common range. To ensure that it has equal weight in the model, ranges might be like 0 to 1 or -1 to 1.

The following techniques for feature scaling are listed −

Min-Max scaling entails reducing the feature’s values to a range between 0 and 1, as calculated by the formula: X__scaled = (X – X__min) / (X__max – X__min).

Standardization is the process of scaling the values of a feature to have a mean of 0 and a standard deviation of 1, as computed by the formula: (X – X mean) / X std = X scaled

Log transformation − This entails employing a logarithmic function to change the values of the feature, which can assist to lessen the influence of outliers and enhance data distribution.

2. Feature Extraction

It is a process of extracting new features from our older data.

Below are the different methods to extract features from data −

PCA − Full form of PCA is Principal component analysis. It is a process in which we decrease the dimensions of data by capturing important patterns and correlations in the data.

Independent component analysis (ICA) is the process of detecting separate sources of variability in data and dividing them into distinct features that encapsulate different elements of the data.

Wavelet transform − This involves analyzing the data at different scales and frequencies, and extracting new features that capture the patterns and relationships at each scale.

Fourier transform − This involves analyzing the data in the frequency domain and extracting new features that capture the frequency components of the data.

Convolutional neural networks (CNNs) − This involves using deep learning techniques to automatically extract features from high-dimensional and complex data, such as images and audio.

3. Feature Selection

If you select

This entails picking a subset of the most relevant characteristics in order to minimize data dimensionality and enhance model performance.

There are various methods for selecting features, including −

Filter techniques entail rating the characteristics based on some statistical measure, such as correlation or mutual information, and picking the features with the highest ranking.

Wrapper approaches entail employing a machine learning algorithm to assess the performance of several subsets of features and picking the subset with the greatest performance.

Embedded approaches include picking the most relevant characteristics within the machine learning algorithm’s training phase, for as through regularization or decision tree-based algorithms.

Dimensionality reduction approaches entail translating the original characteristics into a lower-dimensional representation, such as principal component analysis (PCA) or singular value decomposition (SVD).

The feature selection approach used is determined by the nature of the data and the model’s needs. In general, filter techniques are quicker and more efficient, but may not capture the entire complexity of the data, whereas wrapper methods and embedding methods are more accurate but can be computationally expensive.

4. One-hot encoding

Converting categorical variables into numerical features entails constructing a binary indicator variable for each category.

One hot encoding approach is used to express categorical variables into numerical data that may be fed into machine learning algorithms. Each category is represented in one hot encoding by a binary vector that is as long as the number of categories and has a value of 1 in the position that corresponds to the category and 0s in all other locations.

Because many machine learning algorithms cannot handle categorical data directly, one hot encoding is required. We may utilize categorical variables as input for algorithms by transforming them into numerical data. Because each category is represented by a binary vector of the same length, one hot encoding assures that each category is equally weighted.

5. Binning

This entails categorizing numerical data into discrete bins in order to lessen the influence of outliers and increase model resilience.

Binning can be done in a variety of methods, including −

Equal-width binning is the process of separating a range of values into bins of equal width. For instance, if we have a feature with values ranging from 0 to 100 and wish to generate 5 bins, each bin would have a 20-unit range (0-20, 21-40, 41-60, 61-80, 81-100).

Equal frequency binning involves dividing the data into bins with roughly the same number of data points in each. This method may be useful when the data distribution is skewed.

The borders of the bins are manually determined based on domain expertise or other criteria in bespoke binning.

Binding may be beneficial when the connection here between the feature and even the target variable is not linear, or when there are too many unique values for a feature to be employed efficiently in a machine-learning technique. Nevertheless, it might cause data loss and does not always enhance performance. Before using binning, it is critical to assess its influence on model performance.

6. Text Processing

Text processing is the alteration and analysis of text material, typically with the goal of extracting useful information. This might cover a wide range of tasks, from basic operations like removing punctuation or converting text to lowercase to more challenging tasks like identifying named things or classifying text based on its content.

Text processing methods that are often utilized include −

Tokenization is the process of separating a piece of text into separate words or tokens.

Stopword reduction is eliminating frequent terms that aren’t beneficial for analysis, such as “the,” “and,” or “in.”

Stemming and lemmatization are strategies for improving analysis that include reducing words to their root form (e.g., “running” becomes “ran”).

Tagging parts of speech is marking each word in a document with its grammatical function, such as “noun” or “verb.”

Named entity recognition is the process of identifying and classifying entities in a text such as individuals, organizations, and locations.

Sentiment analysis is the process of evaluating text in order to discover the overall sentiment or emotional tone.

Conclusion

To summarize, feature engineering is an important phase in machine learning that entails choosing, modifying, and inventing features to improve model performance. Domain expertise, inventiveness, and experimentation are required. While automated feature engineering approaches are being developed, human skill is still required to generate relevant features that capture the underlying patterns in the data.

Datahack Radio #21: Detecting Fake News Using Machine Learning With Mike Tamir, Ph.d.

Introduction

How do you deal with such a sensitive issue? Millions of articles are being churned out every day on the internet – how do you tell real from fake? It’s not as easy as turning to a simple fact checker. They are typically built on a story-by-story basis. Can we turn to machine learning?

It’s a prevalent and pressing issue – and hence we invited Mike Tamir, Ph.D., as our guest on DataHack Radio. Mike has been working on a project called FakerFact that aims to identify and separate truth from fiction. His team’s approach is based on using machine learning algorithms of the Natural Language Processing (NLP) variety.

In this episode, Kunal and Mike discuss several aspects of the FakerFact algorithms, including:

The idea behind FakerFact

How Mike and his team collect data for training the FakerFact NLP algorithms

The importance of updating existing datasets and retraining these algorithms

Dealing with Biases in the data

And much, much more. I would recommend this podcast to EVERY data scientist – it touches on a critical issue plaguing our society.

All our DataHack Radio podcast episodes are available on the below platforms – subscribe today!

I have summarized the episode discussion in this article. Happy listening!

The Idea Behind FakerFact

“The challenge of misinformation has been prevalent for years now and we still haven’t got our arms around it as a society.”

You might know how difficult it is to detect intent in text if you’ve worked on NLP projects. The sheer amount of layers in the human language feels overwhelming! To make a machine understand it – that’s a lot of effort.

Things have been improving however in the last few years. There’s been a huge leap in NLP frameworks. We have demonstrated the ground-breaking developments here. In short, NLP techniques can now parse through the given text and perform all sorts of human-level tasks.

FakerFact is Mike Tamir’s project which he started with a few fellow researchers a couple of years back. Most fact checkers available online tend to be black and white – they attempt to tell you if a piece of given information is real or fake. FakerFact takes a different angle to fact-checking:

“Can we teach machine learning algorithms to tell the difference between bits of text that are just about education, reporting, etc. versus bits of text that are presenting opinions, using satire, are filled with hate speech, have a hidden agenda, etc.?”

You can read more here about how FakerFact works and how you can use it in your browser.

Collecting Data for Training the FakerFact Algorithms and Combating Bias

“That’s one of the hardest challenges in data science.”

Mike and his team start with top-level domains. They use different algorithms for doing a reverse bootstrapping process. This helps the team carry down from the domain level to the individual article level for training.

One of the most important things they have to pay attention to is stratification. This is pretty understandable – you don’t want the model to be biased based on the samples, right? Mike illustrated this point using a brilliant example of right-wing v left-wing articles.

As a data scientist, you are going to love this section of the podcast. It’s really important for us to understand and mitigate bias right at the start of our data collection process. You can imagine how critical that is for a fact-based application like FakerFact.

Most of the fake news datasets we see online are based on certain events, like the 2023 US elections. That’s a very specific sample and can lead to serious bias in the model if used exclusively. It’s important to diversify using different domains and time periods.

Now, separating the truth from fiction is what FakerFact aims to do. That means it relies on the audience to tell the algorithms whether a particular article is credible or not. But can you rely entirely on your audience to generate that insight? No! The Fakerfact team has several strategies in place to mitigate any bias that might come from user feedback.

Updating the Datasets to Keep Up with the Growing Number of Articles

“We are constantly scraping data. We have millions and millions of articles that are fed into our dataset.”

Of course, this means that with each update, the team needs to run and check their baseline results all over again. Are they performing at the same level? Do they need to change the architecture? Questions like these are essential to keep FakerFact at the top of the game.

Dealing with Unknown Biases in the Data

Anyone who’s worked on even a slightly complicated NLP project knows there’s no smooth sailing to building a model. There will be obstacles along the way. You might miss out on a certain point, or an unknown bias might creep in which no one would have thought of in a million years.

Mike picked up two examples his team encountered when building the FakerFact model. The first was about authors promoting themselves on Twitter.

But it’s the second example that really stood out for me. On a certain website (name mentioned in the podcast), the FakerFact algorithm consistently pulled up the articles. The team couldn’t figure out why – the articles looked like usual journalistic pieces. Can you guess what the issue was?

Mike Tamir’s Industry Experience

Changing gears, Kunal asked Mike to touch on his rich industry experience, especially his previous role at Uber as the Head of Data Science. I have summarized this part of the podcast below:

Creating simulations for autonomous vehicles: At Uber, Mike’s team did open research on creating Q-learning adaptive stress testing. In fact, some of that work will soon be published!

Mike’s other roles involved working on spot pricing, recommendations on the Uber Eats application and various other aspects of autonomous vehicles

The Near Future of Natural Language Processing (NLP)

And finally – where does Mike Tamir see NLP heading in the next few years?

“It’s safe to we’ll continue to see dramatic improvements in how we are able to work on text.”

2023 was a breakthrough year for NLP. We saw frameworks and libraries like BERT, ULMFiT, Transformer-XL, among others. But the base for that was built in 2023. Going forward, perhaps in the next 2-3 years, Mike said he could see these techniques merging.

It’s already happening in 2023 and should continue to pick up the pace going forward. Really interesting times lie ahead!

End Notes

Fake news is no laughing matter anymore. It has transformed quickly from being a mere nuisance to costing lives around the world. Any step towards dealing with it in the right way is a welcome sight. I personally quite like FakerFact’s approach to this.

I loved Mike’s ability to explain complicated concepts and tie them together into an understandable format. It certainly helps to know how FakerFact functions under the hood and the different ways the team uses to mitigate bias. It’s a goldmine of information for those of us working in NLP.

Related

Update the detailed information about Heart Disease Prediction Using Machine Learning on the Tai-facebook.edu.vn website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!