Linkedin Web Scraper

Use Selenium & Python to scrape LinkedIn profiles

It was last year when the legal battle between HiQ Labs v LinkedIn first made headlines, in which LinkedIn attempted to block the data analytics company from using its data for commercial benefit.
HiQ Labs used software to extract LinkedIn data in order to build algorithms for products capable of predicting employee behaviours, such as when an employee might quit their job.
This technique known as Web Scraping, is the automated process where the HTML of a web page is used to extract data.
How hard can it be?
LinkedIn have since made its site more restrictive to web scraping tools. With this in mind, I decided to attempt extracting data from LinkedIn profiles just to see how difficult it would, especially as I am still in my infancy of learning Python.
Tools Required
For this task I will be using Selenium, which is a tool for writing automated tests for web applications. The number of web pages you can scrape on LinkedIn is limited, which is why I will only be scraping key data points from 10 different user profiles.
Prerequisite Downloads & Installs
Download ChromeDriver, which is a separate executable that WebDriver uses to control Chrome. Also you will need to have a Google Chrome browser application for this to work.
Open your Terminal and enter the following install commands needed for this task.
pip3 install ipython
pip3 install selenium
pip3 install time
pip3 install parsel
pip3 install csv
Automate LinkedIn Login
In order to guarantee access to user profiles, we will need to login to a LinkedIn account, so will also automate this process.
Open a new terminal window and type “ipython”, which is an interactive shell built with Python. Its offers different features including proper indentation and syntax highlighting.
We will be using the ipython terminal to execute and test each command as we go, instead of having to execute a file. Within your ipython terminal, execute each line of code listed below, excluding the comments. We will create a variable “driver” which is an instance of Google Chrome, required to perform our commands.
from selenium import webdriver
driver = (‘/Users/username/bin/chromedriver’)
(”)
The () method will navigate to the LinkedIn website and the WebDriver will wait until the page has fully loaded before another command can be executed. If you have installed everything listed and executed the above lines correctly, the Google Chrome application will open and navigate to the LinkedIn website.
The above notification banner should be displayed informing you that WebDriver is controlling the browser.
To populate the text forms on the LinkedIn homepage with an email address and password, Right Click on the webpage, click Inspect and the Dev Tools window will appear.
Clicking on the circled Inspect Elements icon, you can hover over any element on the webpage and the HTML markup will appear highlighted as seen above. The class and id attributes have the value “login-email”, so we can choose either one to use.
WebDriver offers a number of ways to find an element starting with “find_element_by_” and by using tab we can display all methods available.
The below lines will find the email element on the page and the send_keys() method contains the email address to be entered, simulating key strokes.
username = nd_element_by_class_name(‘login-email’)
nd_keys(”)
Finding the password attribute is the same process as the email attribute, with the values for its class and id being “login-password”.
password = nd_element_by_class_name(‘login-password’)
nd_keys(‘xxxxxx’)
Additionally we have to locate the submit button in order to successfully log in. Below are 3 different ways in which we can find this attribute but we only require one. The click() method will mimic a button click which submits our login request.
log_in_button = nd_element_by_class_name(‘login-submit’)
log_in_button = nd_element_by_class_id(‘login submit-button’)
log_in_button = nd_element_by_xpath(‘//*[@type=”submit”]’)
()
Once all command lines from the ipython terminal have successfully tested, copy each line into a new python file (Desktop/). Within a new terminal (not ipython) navigate to the directory that the file is contained in and execute the file using a similar command.
cd Desktop
python
That was easy!
If your LinkedIn credentials were correct, a new Google Chrome window should have appeared, navigated to the LinkedIn webpage and logged into your account.
Code so far…
“”” filename: “””
writer = (open(le_name, ‘wb’))
writer. writerow([‘Name’, ‘Job Title’, ‘Company’, ‘College’, ‘Location’, ‘URL’])
nd_keys(nkedin_username)
sleep(0. 5)
nd_keys(nkedin_password)
sign_in_button = nd_element_by_xpath(‘//*[@type=”submit”]’)
Searching LinkedIn profiles on Google
After successfully logging into your LinkedIn account, we will navigate back to Google to perform a specific search query. Similarly to what we have previously done, we will select an attribute for the main search form on Google.
We will use the “name=’q'” attribute to locate the search form and continuing on from our previous code we will add the following lines below.
search_query = nd_element_by_name(‘q’)
nd_keys(‘ AND “python developer” AND “London”‘)
nd_keys()
The search query ” AND “python developer” AND “London” will return 10 LinkedIn profiles per page.
Next we will be extracting the green URLs of each LinkedIn users profile. After inspecting the elements on the page these URLs are contained within a “cite” class. However, after testing within ipython to return the list length and contents, I seen that some advertisements were being extracted, which also include a URL within a “cite” class.
Using Inspect Element on the webpage I checked to see if there was any unique identifier separating LinkedIn URL’s from the advertisement URLs.
:/

As you can see above, the class value “iUh30” for LinkedIn URLs is different to that of the advertisement values of “UdQCqe”. To avoid extracting unwanted advertisements, we will only specify the “iUh30” class to ensure we only extract LinkedIn profile URL’s.
linkedin_urls = nd_elements_by_class_name(‘iUh30’)
We have to assign the “linkedin_urls” variable to equal the list comprehension, which contains a For Loop that unpacks each value and extracts the text for each element in the list.
linkedin_urls = [ for url in linkedin_urls]
Once you have assigned the variable ‘linkedin_urls” you can use this to return the full list contents or to return specific elements within our List as seen below.
linkedin_urls
linkedin_urls[0]
linkedin_urls[1]
In the ipython terminal below, all 10 account URLs are contained within the list.
Next we will create a new Python file called ” to contain variables such as search query, file name, email and password which will simplify our main “” file.
search_query = ‘ AND “python developer” AND “London”‘
file_name = ”
linkedin_username = ”
linkedin_password = ‘xxxxxx’
As we are storing these variables within a separate file called “” we need to import the file in order to reference these variables from within the “” file. Ensure both files “” and “” are in the same folder or directory.
import parameters
As we will be inheriting all the variables defined in “” using the imported parameters module above, we need to make changes within our “” file to reference these values from the “” file.
(arch_query)
nd_keys(linkedin_password)
As we previously imported the sleep method from the time module, we will use this to add pauses between different actions to allow the commands to be fully executed without interruption.
sleep(2)
from time import sleep
from import Keys
(‘:’)
sleep(3)
nd_keys(arch_query)
The fun part, scraping data
To scrape data points from a web page we will need to make use of Parsel, which is a library for extracting data points from websites. As we have already installed this at the start, we also need to import this module within our “”.
from parsel import Selector
After importing parsel within your ipython terminal, enter “ge_source” to load the full source code of the Google search webpage, which looks like something from the Matrix.
As we will want to extract data from a LinkedIn account we need to navigate to one of the profile URL’s returned from our search within the ipython terminal, not via the browser.
ge_source
We will create a For Loop to incorporate these commands into our “” file to iterate over each URL in the list. Using the method () it will update the “linked_url” variable with the current LinkedIn profile URL in the iteration….
for linkedin_url in linkedin_urls:
(linkedin_url)
sleep(5)
sel = Selector(ge_source)
Lastly we have defined a “sel” variable, assigning it with the full source code of the LinkedIn users account.
Finding Key Data Points
Using the below LinkedIn profile as an example, you can see that multiple key data points have been highlighted, which we can extract.
Like we have done previously, we will use the Inspect Element on the webpage to locate the HTML markup we need in order to correctly extract each data point. Below are two possible ways to extract the full name of the user.
name = (‘//h1/text()’). extract_first()
name = (‘//*[starts-with(@class, “pv-top-card-section__name”)]/text()’). extract_first()
When running the commands in the ipython terminal I noticed that sometimes the text isn’t always formatted correctly as seen below with the Job Title.
However, by using an IF statement for job_title we can use the () method which will remove the new line symbol and white spaces.
if job_title:
job_title = ()
Continue to locate each attribute and its value for each data point you want to extract. I recommend using the class name to locate each data point instead of heading tags e. g h1, h2. By adding further IF statements for each data point we can handle any text that may not be formatted correctly.
An example below of extracting all 5 data points previously highlighted.
name = (‘//*[starts-with(@class,
“pv-top-card-section__name”)]/text()’). extract_first()
if name:
name = ()
job_title = (‘//*[starts-with(@class,
“pv-top-card-section__headline”)]/text()’). extract_first()
company = (‘//*[starts-with(@class,
“pv-top-card-v2-section__entity-name pv-top-card-v2-section__company-name”)]/text()’). extract_first()
if company:
company = ()
college = (‘//*[starts-with(@class,
“pv-top-card-v2-section__entity-name pv-top-card-v2-section__school-name”)]/text()’). extract_first()
if college:
college = ()
location = (‘//*[starts-with(@class,
“pv-top-card-section__location”)]/text()’). extract_first()
if location:
location = ()
linkedin_url = rrent_url
Printing to console window
After extracting each data point we will output the results to the terminal window using the print() statement, adding a newline before and after each profile to make it easier to read.
print(‘\n’)
print(‘Name: ‘ + name)
print(‘Job Title: ‘ + job_title)
print(‘Company: ‘ + company)
print(‘College: ‘ + college)
print(‘Location: ‘ + location)
print(‘URL: ‘ + linkedin_url)
At the beginning of our code, below our imports section we will define a new variable “writer”, which will create the csv file and insert the column headers listed below.
The previously defined “file_name” has been inherited from the “” file and the second parameter ‘wb’ is required to write contents to the file. The writerow() method is used to write each column heading to the csv file, matching the order in which we will print them to the terminal console.
Printing to CSV
As we have printed the output to the console, we need to also print the output to the csv file we have created. Again we are using the writerow() method to pass in each variable to be written to the csv file.
writer. writerow([(‘utf-8’),
(‘utf-8’),
(‘utf-8’)])
We are encoding with utf-8 to ensure all characters extracted from each profile get loaded correctly.
Fixing things
If we were to execute our current code within a new terminal we will encounter an error similar to the one below. It is failing to concatenate a string to display the college value as there is no college displayed on this profile and so it contains no value.
To account for profiles with missing data points from which we are trying to extract, we can write a function”validate_field” which passing in “field” as variable. Ensure this function is placed at the start of this application, just under the imports section.
def validate_field(field):
field = ‘No results’
return field
In order for this function to actually work, we have to add the below lines to our code which validates if the field exists. If the field doesn’t exist the text “No results” will be assigned to the variable. Add these these lines before printing the values to the console window.
name = validate_field(name)
job_title = validate_field(job_title)
company = validate_field(company)
college = validate_field(college)
location = validate_field(location)
linkedin_url = validate_field(linkedin_url)
Lets run our code..
Finally we can run our code from the terminal, with the output printing to the console window and creating a new csv file called “”.
Things you could add..
You could easily amend my code to automate lots of cool things on any website to make your life much easier. For the purposes of demonstrating extra functionality and learning purposes within this application, I have overlooked aspects of this code which could be enhanced for better efficiency such as error handling.
Final code…
It was a long process to follow but I hope you found it interesting. Ultimately in the end LinkedIn, like most other sites, is pretty straight forward to scrape data from, especially using the Selenium tool. The full code can be requested by directly contacting me via LinkedIn.
Questions to be answered…
Are LinkedIn right in trying to prevent third party companies from extracting our publicly shared data for commercial purposes, such as HR departments or recruitment agencies?
Is LinkedIn is trying to protect our data or hoard it for themselves, holding a monopoly on our lucrative data?
Personally, I think that any software which can be used to help recruiters or companies match skilled candidates to better suited jobs is a good thing.
Scraping LinkedIn in 2021: Is it Legal? | by Jeremiah Tang - Medium

Scraping LinkedIn in 2021: Is it Legal? | by Jeremiah Tang – Medium

Photo by inlytics | LinkedIn Analytics Tool on UnsplashWeb scraping is essentially extracting data from certain platforms for further processing and transformation into useful outputs. While data scraping may be a sensitive topic in terms of data privacy and its legality, I will provide a breakdown as well as conclusions of a prominent LinkedIn scraping lawsuit as May 2017, LinkedIn sent hiQ, a web scraping company, a cease-and-desist letter where it asserted that hiQ was in violation of LinkedIn’s User Agreement. The letter demanded that hiQ stop accessing and copying data from LinkedIn’s server, stating that any future access by hiQ would be violating state and federal law, including the the Computer Fraud and Abuse Act (“CFAA”) and the Digital Millennium Copyright Act (“DMCA”) response, hiQ demanded that LinkedIn recognise hiQ’s right to access public pages on LinkedIn and sought a declaratory judgment, a conclusive decision by the court, that LinkedIn could not invoke, among other laws, the CFAA and DMCA against it. hiQ also requested a preliminary injunction against LinkedIn, seeking to prevent LinkedIn from acting on its cease-and-desist letter. The district court granted the preliminary injunction, ordering LinkedIn to withdraw the letter, remove technical barriers to hiQ’s access to public profiles, and refrain from implementing legal or technical measures to block hiQ’s access to public profiles until a ruling has been nkedIn appealed this decision to the US 9th Circuit Court of Appeals. In 2019, the 9th Circuit affirmed the district court’s preliminary that decision, LinkedIn has further appealed that decision to the US Supreme Court (SCOTUS), but it is unclear whether the court has agreed to hear the appeal. Until a judgment is released by SCOTUS, however, the decision by the 9th Circuit remains good observers have hailed the 9th Circuit decision as being a golden ticket permitting all types of web scraping, the issue is far more nuanced than that. In fact, the scope of the issue is extremely narrow, turning on the definition of “without authorization”. In hiQ’s own words, the question before SCOTUS is:QUESTION PRESENTED: Whether a professional networking website may rely on the Computer Fraud and Abuse Act’s prohibition on “intentionally access[ing] a computer without authorization” to prevent a competitor from accessing information that the website’s users have shared on their public profiles and that is available for viewing by anyone with a web unsel for hiQ in its brief to SCOTUSThe narrowness of the issue presented to SCOTUS by LinkedIn means that the court only has to decide on this one matter, and will not have to consider other potential issues arising from web scraping such as data privacy concerns, breach of contractual terms, or even violations of other state and federal laws. Optimistically, it can be inferred that because LinkedIn decided to pursue the case on this ground instead of through other causes of action, they are less likely to be potential issues for web scrapers. But the reality is that due to web scraping being a relatively new phenomenon, the law surrounding it remains underdeveloped and there is little legal clarity in the there is still a grey area regarding the legality of web scraping, we can say for certain that web scraping in itself remains legal. This is big news for both individuals and companies alike. With the large amount of data presented online, there is a tumultuous amount of information available that is difficult to obtain any useful insights from on its own. Thankfully, there are many web scrapers made available that are able to tidy up the necessary data and eliminate any white noises. Scraping popular platforms such as Reddit, Twitter, Facebook and especially LinkedIn can be extremely beneficial to companies, detailed in this individualsThe average Joe interested in exploring web scraping can probably get by with free web scraping APIs that can obtain small amounts of data. Some side projects to consider if you are interested in picking up web scraping would be scraping food review websites such as Yelp or Burpple and find the best fried chicken in your country or by scraping social media platforms such as Reddit and Twitter and conduct the necessary analysis to decide your next investment in the stock large-scale projects that require data of millions of individuals, it is definitely not feasible to rely on these free but slow web scraping APIs and wait weeks, if not months, for the data to be collected (if your computer does not overheat and crash by then) companiesApart from food review sites and social media platforms, LinkedIn seems to be the most relevant platform to scrape from for B2B companies. Depending on the magnitude of data you require, there are many paid LinkedIn scraping services that satisfy different needs. A comprehensive list of the top 5 varying LinkedIn scraping services can be found here. This provides a better understanding of what these different companies offer and find the service best suited to your companies’ data provided by the scraping services, businesses are able to use it for many functions:Updating its current database: Enrich current database with up-to-date dataLeads Generation for B2B sales: LinkedIn URL/email discoveryResearch: Use company data to predict market and industry trendsHuman Resource: Improves hiring for ATS and recruitment platformsInvestment (Venture Capitalists): Chart out company performances and decide which companies are performing wellAlumni (Universities): Find out distribution of their alumni based on location, industry or companies with further transformation of dataHere at Mantheos we conduct LinkedIn scraping legally, scraping data that is freely and publicly available on LinkedIn. This means that we collect data that is accessible to the general public. Compared to manually searching LinkedIn for people and company profiles, we automate this process for you and aggregate this information into readable files such as excel and json. By engaging our services, you can rest assured that we will provide data that is both safe as well as useful in your the legality of web scraping becomes clearer, we can safely say that many forms of web scraping are not deemed illegal by the courts and are permissible. Web scraping is an integral part of the big data revolution and is empowering millions of businesses around the world to optimise their business strategies. With web scraping becoming ever more ubiquitous, the myriad of privacy and contractual issues surrounding web scraping is growing more complex. This forms a potential stumbling block for both web scraping companies and end laws become more rigid and penalties for violations increase, it is now more important than ever before to ensure that your business is not exposed to unnecessary legal risk by unknowingly flouting data laws. Mantheos prides itself on ensuring that its business practices are fully compliant with all laws and regulations, regardless of jurisdiction. Yada ferences:hiQ Labs, Inc. v. LinkedIn Corp., №17–16783 (9th Cir. 2019)LinkedIn’s appeal to the US Court of Appeals
The definitive guide to build your own Linkedin Profile Scraper ...

The definitive guide to build your own Linkedin Profile Scraper …

Having built the early prototype for Proxycurl’s Linkedin API, I know a little bit about scraping Linkedin profiles in scale. In this tutorial, I will share my experience building a Linkedin profile scraper that works in 2021, and I hope you will find it useful.
PS: You can scrape Linkedin profiles in real-time with Proxycurl Linkedin API.
To put this tutorial in context, we will preface it with the problem of:
How to crawl 1 million Linkedin profiles and then scrape the pages to structured data?
Breaking down the problem:
How to crawl a million Linkedin profiles and fetch their on-page HTML content
How to scrape the HTML content from a Linkedin profile to structured data
Part 1: How to crawl 1M Linkedin profiles for HTML code
Before we embark on the quest to crawl a million profiles, let’s start with crawling ten profiles. There are only two ways to crawl ten Linkedin profiles for scraping:
As a user logged into Linkedin. (A “logged in user”)
Or, as a user that is not logged into Linkedin. (An “anonymous user. “)
1A: Accessing Linkedin profiles as an anonymous user
It requires luck to access a Linkedin profile without being logged into Linkedin.
In my experience, you might be able to access the first profile as an anonymous user if you have not recently clicked into any Linkedin profiles.
Even if you succeed viewing a public profile anonymously in your first attempt, more likely or not, you will be greeted with the dreaded Authwall on your second profile visit.
What is the Authwall and how do you circumvent it?
The Authwall exists to block web scraping from users who are not logged into Linkedin.
If you visit a public profile from a non-residential IP address, such as from a data center IP address, you will get the Authwall.
If you visit a public profile without any cookies in your browser session (aka incognito mode), you will get the Authwall.
If you are visiting a public profile from a non-major browser, you will get the Authwall.
If you are visiting a public profile multiple times, you will get the Authwall.
There are many reasons that you will be greeted with the Authwall when you are crawling anonymously. But there is one way you can reliably bypass it — crawl Linkedin as Googlebot. If you can access a Linkedin public profile page from an IP address that belongs to Google, you can consistently fetch an available Linkedin profile without the Authwall.
What does an IP address from Google mean?
It is an IP address that resolves reversely to *. See this Google support page for a clear definition. And no, IP addresses from Google Cloud instances do not work.
But, there is one page on Linkedin that you can crawl without restrictions
Put yourself in the shoes of a Linkedin executive. What makes you money? Profile data. Which is the Authwall is used to lock up profile data.
What else makes Linkedin money? Jobs! Linkedin makes money when companies list jobs on Linkedin. These companies will return to Linkedin again and again if Linkedin succeeds at matching great candidates to their job postings.
Job profiles on Linkedin are not blocked by the Authwall to maximize page views.
1B: Accessing Linkedin profiles logged into Linkedin
You and I are probably not Googlers, which means we do not have access to the range of addresses belonging to Googlebot. But there is respite.
You can log into Linkedin to reliably access Linkedin profiles. However, as tempting as it may be, I highly recommend that you not use your personal Linkedin profile to perform a bulk profile crawl for scraping purposes. You do not want your personal Linkedin profile to be blocked.
And it will be blocked should you scrape past a certain threshold or when Linkedin detects abnormal (automated) behavior in your account.
But yes, log into your Linkedin profile, and you can crawl ten profiles with no problems. And that brings me to the next section — getting from 10 profiles to 1M profiles.
Can I crawl 1M Linkedin profiles to scrape by creating many Linkedin accounts?
It is only natural to veer towards the belief that you can build a Linkedin scraper if you manage a pool of disposable Linkedin accounts. You are not wrong. Building a pool of workers with disposable Linkedin accounts is indeed a feasible method if and only if humans meticulously manage each Linkedin account.
Once you begin automated crawls on any Linkedin account, you will start encountering random Recaptcha challenges on accounts that will keep an account locked until they are solved.
Each Linkedin account in your scraping pool will also require a unique residential IP address.
The short answer is yes. You can crawl 1M Linkedin profiles with many Linkedin accounts with residential IP addresses.
Recap: What you need to do to crawl 1M profiles
The first step to scraping is to get HTML code of profiles in scale. In this article, we put a number to “scale. ” One million profiles. There are only a few ways to crawl 1M Linkedin profiles, and they are
Access Linkedin from an IP address the resolves as Googlebot
Manage a large pool of workers logged in as individual Linkedin account, with each account sitting on residential IP addresses
Use Proxycurl API — see the next section.
Using Proxycurl API to fetch 1M Linkedin profiles
Proxycurl is an offering we built that provides a managed service to scrape Linkedin profiles in real-time.
If you ask me which is the best way to scrape Linkedin profiles, then I will tell you in a very biased way to use Proxycurl’s API. Specifically, the Linkedin Person Profile Endpoint that is available with our Linkedin API. When you make an API request to our Linkedin Person Profile Endpoint, our system performs a live crawl and returns the user profile in structured data back to you.
Part 2: I have HTML code of a profile page, how do I scrape content off it?
Now that you have 1M profiles, it is time to get the content out of the HTML code into structured data. To convert HTLM pages to structured data is what I define as “scraping. ” Crawling profiles gets you a bunch of pages as HTML code. Scraping turn pages of HTML code into machine-readable structured data, like this:
{
‘accomplishment_courses’: [],
‘accomplishment_honors_awards’: [{‘description’: ‘Nanyang Scholarship ‘
‘recognizes students who ‘
‘excel academically, ‘
‘demonstrate strong ‘
‘leadership potential, and ‘
‘possess outstanding ‘
‘co-curricular records. \n’,
‘issued_on’: {‘day’: None,
‘month’: None,
‘year’: 2015},
‘issuer’: ‘Nanyang Technological University’,
‘title’: ‘NANYANG Scholarship’},
{‘description’: ‘Awarded to students with ‘
‘exceptional results in ‘
‘Physics and Mathematics’,
‘issuer’: ‘Defence Science & Technology ‘
‘Agency’,
‘title’: ‘Young Defence Scientist Programme ‘
‘(YDSP) Academic Award’},
{‘description’: ‘An annual competition to ‘
‘encourage the study and ‘
‘appreciation of Physics as ‘
‘well as highlight Physics ‘
‘talent. ‘,
‘year’: 2012},
‘issuer’: ‘Institute of Physics Singapore’,
‘title’: ‘Singapore Junior Physics Olympiad ‘
‘(Main Category) Honourable ‘
‘Mention’},
{‘description’: ‘Certificate awarded to ‘
‘student who topped the ‘
‘cohort in all aspects of ‘
‘Science. ‘,
‘year’: 2010},
‘issuer’: ‘Xinmin Secondary School’,
‘title’: ‘Certificate of Excellence – Top ‘
‘in Science’},
{‘description’: None,
‘issued_on’: {‘day’: 1,
‘month’: 9,
‘year’: 2018},
‘title’: “Dean’s List FY17/18”},…
‘volunteer_work’: []}
Two ways to scrape content from HTML code
There are two ways to scrape content from the HTML page, and the approach to take depends entirely on how the page is crawled.
Two factors decide which is the best method to use:
Is on-page javascript parsed before the HTML code of the profile page is collected?
Is the profile viewed as an anonymous user or as a user logged into Linkedin?
Method matrix for your reference
Anonymous user
Logged into Linkedin
Javascript not rendered
Dom Scraping
Code Chunk Scraping
Javascript is rendered
Dom scraping is the standard method that most developers use for web scraping. You can find the data within fixed HTML tags on a page that is loaded and rendered. You can fetch most content of a profile page by transversing HTML tags either via selectors or XPATH.
The problem is that the layout HTML pages are updated often and always. And layout varies according to locale. A profile loaded in Arabic locale will differ in layout from a profile loaded in English. Every time something changes, expect your scraper to break. Dom scraping is a high maintenance method but easy to implement.
Code Chunk Scraping is a superior method reserved for profile pages fetched as a logged user; before javascript is rendered. It is a better method because it does not depend on HTML dom structure — and that means that page layout changes on Linkedin will not break this scraping method. What it does instead is that it looks at the data in-page placed within tags. These blobs of JSON data are used by Linkedin’s javascript code to populate the page’s dom elements. With the Code Chunk scraping method, you transverse JSON objects instead of Dom elements.
Because the JSON blob data is already stored in a structured manner, we do not have to tokenize strings to re-structure data and return the data as it is. That means you do not need to parse “12th March 2020″ into a machine-readable Date object.
To recap: the Code Chunk scraping method
is faster to crawl because you can skip Javascript parsing
breaks less due to on-page layout changes
but, requires you to be logged into Linkedin when fetching profiles
Here is an example of data transversal with the Code Chunk Scraping method to return Patents Achievement from a user profile:
def get_patents(data):
patent_lis = []
for dic in Person. _type_in_include_rows(data,
”):
description = (‘description’)
application_number = (‘applicationNumber’)
issuer = (‘issuer’)
issued_on = None
issued_on_dic = (‘issuedOn’, {})
if issued_on_dic:
issued_on = Date((‘month’),
(
‘day’),
(‘year’))
patent_number = (‘patentNumber’)
title = (‘title’)
url = (‘url’)
patent_lis += [Patent(description=description,
application_number=application_number,
issuer=issuer,
issued_on=issued_on,
patent_number=patent_number,
title=title,
url=url)]
return patent_lis
So you want to build your own Linkedin Profile Scraper.
In this article, I explained that scraping Linkedin profiles is a two-step process.
The first step is to crawl Linkedin profiles and save the HTML code for further processing in the second step. The second step is to process the HTML code and turn raw HTML code into structured data that you can use in your application.
There are only two methods to crawl Linkedin profiles in scale — anonymously as Googlebot, or via a pool of workers logged into Linkedin with unique residential IP addresses. It is not impossible, but you can get yourself 1M HTML files if you work around these limitations.
The next step is to process these 1M HTML files and turn them into structured data for your application. If you crawled the page without rendering javascript but with an account logged into Linkedin, you should use the Code Chunk Scraping method, which is superior because it breaks a lot lesser. Otherwise, you can perform a regular scraping with your favorite Dom transversal library with the Dom Scraping method. (I recommend beautifulsoup4 if you are using Python)
Even if you are a well-funded startup, it is not trivial to crawl Linkedin data in scale. You need a secret weapon.
Just like how you have chosen AWS instead of building and colocating your server farms, dataset acquisition is a menial task best left as a managed service. I can only write this article in such detail because of the combined expertise of our entire development team and learned experience over the years.
Why crawl Linkedin, when you can get a Postgresql database preloaded with data of Linkedin profiles in the US for $200/mo.
Why manage a profile scraper when you can use our API and perform a live crawl for $0. 01 per profile?
I will love to help your business integrate data at the core of your product. Send an email to [email protected] and let me know how I can help you with your data needs! Let Proxycurl be your secret weapon.
The tutorial is not complete without code samples.
In this article, I shared in high-level how you might be able to scrape Linkedin profiles in scale. But a tutorial is not complete without code samples. In the follow-up article, I will be releasing fully-working code samples to complement this article. Please subscribe to Proxycurl’s mailing list here to be notified of the next article with code samples!

Frequently Asked Questions about linkedin web scraper

Does LinkedIn allow web scraping?

Moving forward with Mantheos. Here at Mantheos we conduct LinkedIn scraping legally, scraping data that is freely and publicly available on LinkedIn. This means that we collect data that is accessible to the general public.Jun 16, 2021

How do I web scrape my LinkedIn profile?

In this article, I explained that scraping Linkedin profiles is a two-step process. The first step is to crawl Linkedin profiles and save the HTML code for further processing in the second step. The second step is to process the HTML code and turn raw HTML code into structured data that you can use in your application.Sep 8, 2020

Leave a Reply

Your email address will not be published. Required fields are marked *