Ssl Site Crawler

SSL Check – scan your website for non-secure content – Jitbit

Who we are
Hi, we are Alex and Max, the makers of Jitbit Helpdesk – an awesome helpdesk ticketing system used by thousands of companies. Give it a try if you need a ticket tracking app for your IT-department or customer service team.
Why we built this
As you might know Google
has announced
that going HTTPS will give you a minor ranking boost. Lots of folks rushed into buying SSL-certificates and switching to HTTPS.
But after enabling SSL on your webserver – remember to test your pages for “absolute” URLs that point
to unsecure content – images and scripts.
API
This page accepts hash-parameters like this:, feel free to use.
What is “mixed content” error?
When an HTML page is loaded via a secure connection, but some embedded resources, like images, scripts and styles are loaded from non-secure origins, browsers show “unsecure site” error message, scaring your users away.
Comments? Suggestions? Tweet us at @jitbit or @maxt3r. Also, please share this page if you like it.
Crawling HTTPS websites - Funnelback Documentation

Crawling HTTPS websites – Funnelback Documentation

Introduction
Some websites are set up to be accessed using the Secure Sockets Layer (SSL) and HTTP () rather than just HTTP. This means that traffic between the client and web server will be encrypted, allowing for the secure transfer of data. However, in order for Funnelback to successfully search sites like these, several steps must be taken.
Crawler HTTPS configuration
A number of configuration parameters permit the crawler (Funnelback) to gather pages via HTTPS. The relevant parameters are:
otocols
lClientStore
lClientStorePassword
lTrustEveryone
lTrustStore
Required parameter settings
otocols=,… : Including in this parameter is essential, otherwise all URLs will be rejected by the exclusion rules.
Most sites can be crawled satisfactorily with just these two parameters set as above.
In addition, the parameter lTrustEveryone is set to “true” by default. This setting will ignore invalid certificate chains (both client and server) and host name verification. If you are crawling sites which have valid signed certificate chains then you may wish to reset this to “false”.
Note:: The * parameters are supported by the HTTPClient library only.
Troubleshooting HTTPClient SSL operations
Any problems with root certificate validation will be reported in the, like this:
HTTPClientTimedRequest: Error:
peer not authenticated:
or this:
Name in certificate ‘’ does not match host name ‘’:
.
The first can occur if there is something wrong with the server certificate chain – missing or unknown authority. The second often occurs when virtual servers are not included on server certificates.
Further details on run-time certificate validation can be obtained by appending to the java_options parameter, which will show details of the trust store used and any certificate chains presented. To avoid being swamped, tackle one failed certificate at a time.
Having identified the problem, if the missing certificate chain is available it can be added to a trust-store using Java’s keytool. That trust-store can be used via the lTrustStore parameter. Note however, that it will replace the default Java trust-store so those default certificates will be unavailable. An alternative is to copy the Java trust-store and add the new certificate(s) to that (all using keytool), then use the updated copy.
The parameters lClientStore and lClientStorePassword are provided if client validation is required by a server. Again, a JKS keystore can be built using Java’s keytool. The lClientStorePassword may be required for internal validation of the client certificate store (private keys) at crawler start-up.
If you see the following type of error message in your crawler log files:
handshake alert: unrecognized_name
Connection has been shutdown: handshake alert: unrecognized_name
then this may be being caused by the web server not handling SSL/TLS extensions correctly, or using a type of encryption that is not supported by Java. In this case you can crawl these types of sites by adding: -Djsse. enableSNIExtension=false to the java_options setting in your file. For more information on this setting, please see the JSSE Reference Guide.
See also
Web collections
BramDriesen/ssl-site-crawler - GitHub

BramDriesen/ssl-site-crawler – GitHub

# SSL Site Crawler
An ethical crawler bot to create a safer internet for everyone.
As of July 2018, Google (read) is going to start prominently telling its users that the sites they’re visiting aren not secure. To combat this issue and create a safer internet for everyone I decided to create a scraper that will try to scrape all government* websites and check if they are correctly protected with an SSL certificate.
* Or other sites matching the search criteria/force include setup.
All the data will be stored in a Firebase database to later on sanitize the data and create some infographics.
Setup
On a clean Ubunti installation.
apt-get install git nano
git clone mkdir firebase
touch
nano and add you JSON config.
Copy json config into this file
sudo apt-get install -y python3-pip
pip3 install –upgrade pip
Install dependencies:
pip3 install pyyaml
If you get error: unsupported locale setting
Run: export LC_ALL=C
pip3 install google
pip3 install beautifulsoup4
pip3 install firebase-admin
pip3 install grpcio
If needed alter the configuration files.
Let’s run!
python3
If you get the following error:
UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xc3 in position 821: ordinal not in range(128)
Follow this: To run in the background on a server install tmux
apt-get install tmux
Running with tmux
tmux
ctrl + b
d
Re-attach to session
tmux a 1
Full tmux cheat sheet: Firebase setup
Source:
If you don’t already have a Firebase project, add one in the Firebase console. The Add project dialog also gives you the option to add Firebase to an existing Google Cloud Platform project.
Navigate to the Service Accounts tab in your project’s settings page.
Click the Generate New Private Key button at the bottom of the Firebase Admin SDK section of the Service Accounts tab.
Q&A
Q: Are Python 1 & 2 supported?
A: Nope
Q: How is the performance like?
A: Not that great imho especially if you use the Google search. Nevertheless the performance is depending on a lot of different variables, like: How fast can Google return the results, how fast is the target website, how fast is my data transfer to Firebase, how fast is my own internet connection,…
Q: When is a site marked as unsafe?
A: I decided to mark a a website as unsafe when any of the following criteria are met:
No HTTPS available
HTTPS enabled websites that don’t automatically redirect HTTP to HTTPS
Invalid certificates or other certificate errors
If google happens to return a dead URL, the URL is marked as ‘dead’.
Q: Should all websites have an SSL certificate?
A: In my opinion, YES!
Q: SSL Certificates are expensive, I don’t want to spend X$ for my small website. What should I do?
A: Take a look at letsencrypt, they offer free SSL certificates for everyone. Furthermore more and more hosting providers are providing Let’s encrypt for free in their offerings.
Q: Should I care about SSL certificates?
A: Yes you should! From the moment your website has a login form, or any other form. Hackers can intercept the communication between the user and your back-end system. If your website is not encrypted it’s like writing your PIN code on your credit card and giving it to a stranger.
Q: Is this the best way to crawl websites based on keywords?
A: Probably not, but it works fine in my use case.

Frequently Asked Questions about ssl site crawler

How do I make my site HTTPS crawl?

The six steps to crawling a website include:Configuring the URL sources.Understanding the domain structure.Running a test crawl.Adding crawl restrictions.Testing your changes.Running your crawl.

What does crawling a website mean?

Web crawling is the process of indexing data on web pages by using a program or automated script. These automated scripts or programs are known by multiple names, including web crawler, spider, spider bot, and often shortened to crawler.Dec 15, 2020

How does a crawler work?

Because it is not possible to know how many total webpages there are on the Internet, web crawler bots start from a seed, or a list of known URLs. They crawl the webpages at those URLs first. As they crawl those webpages, they will find hyperlinks to other URLs, and they add those to the list of pages to crawl next.

Leave a Reply

Your email address will not be published. Required fields are marked *