For this week’s article, my idea was to talk about the disciplines people commonly transfer to data science from. All I needed was to collect some data off of LinkedIn, analyze it and report what I found. I thought it was going to be an easy feat. I was wrong... It turned out to be much bigger of a challenge.
There are some lessons in there that every data scientist should be aware of. So I wanted to share the story of my attempts to collect data from LinkedIn with you.
I can’t just base an article to my narrow observations of my network when talking about where people transfer to data science from. Thus, the first things I needed was the data. Where can I find people’s background information? LinkedIn of course.
So I’ve done some scraping before so I knew more or less what I needed to do. It comes down to the simple steps of:
Except, it wasn’t simple at all.
Getting URLs of data scientists all around the world was the easiest thing to do even though I thought at first that it will be the hardest. The rest didn’t go as smoothly as I had hoped. Normally, it’s very straight forward to get a web page’s source code using Python’s requests library. Except when I tried to get the source of people’s LinkedIn profiles, I was getting a weird-looking javascript code sort of thing.
After further investigation, I learned that LinkedIn does this on purpose. The only way to scrape its pages is by actually being on the webpage through a browser. Only then it believes that you can get the page source code. Someone explained on StackOverflow that this is because LinkedIn dynamically creates these pages and when you try to get the code using requests, all you’re getting is the code that dynamically creates the page and not the dynamically-created page. Makes sense.
Turns out there is a workaround though. Using a tool (or library) called Selenium and something called a “ChromeDriver” I was able to get the actual page source.
Selenium works like this: it uses the ChromeDriver to start a chrome browser window that is controlled by your script. Then it logs you in the website like it is you typing your username and password. Then it jumps from page to page while your code reads the relevant information from the pages. And you can see what’s happening at all times.
The main caveat of this approach is that you still need to have your script act like a human and not a piece of code and make sure it waits a couple of seconds between jumping from one profile to the next.
Nevertheless, at this point, I had a script that went over to hundreds of data scientist’s profiles and collected data on what titles they held throughout their career.
Lesson of the first chapter: Data collection is hard. Almost always harder than how you imagine it to be.
I was happy, I ran my script and started working on other things while it was running. After half an hour, I got the first results and started reviewing them.
I didn't expect the results to look perfect. I knew the job titles wouldn't be standardized. That's why I was ready to do some processing there to make sure I can group the same or similar titles together. But what I’ve discovered was totally unexpected. There were names of companies instead of job titles for some people.
I figured out a bit too late that LinkedIn does this thing where if you held more than one position in the same company consecutively, that information will be displayed differently than your other experiences. Thus, while collecting data, sometimes my script read the wrong parts of the page and got company names instead of job titles.
It was my mistake for not thinking about this possible case. This is a typical example of what can go wrong during the collection process. Which brings me to my second lesson: Data collection is messy and there will always be incorrectly collected data points. Keep your eyes open for things that don’t make sense.
I fixed my script accordingly and started scraping profiles. I realized that after scraping around 200 profiles, LinkedIn kept asking me to complete a “I am not a robot” reCaptcha test. Of course, I couldn’t do those because I (or my script) was a robot. So I had to restart the script a couple of times. I realized also that LinkedIn logged me out and wanted me to log back in again a couple of times.
At first, I thought this was just an inconvenience. Here I am minding my own business, collecting data from LinkedIn to provide information to the next generation of data scientists but LinkedIn keeps kicking me out. After the 3rd time, a question hit me: “Wait, am I doing something wrong here?” I didn’t want to get fined or sued or anything. My main argument was that seemingly everyone was doing it. I found a lot of resources online about how to scrape LinkedIn and no one mentioned it being illegal or wrong.
I quick Google search revealed that it has been decided pretty recently that scraping LinkedIn is legal. [LinkedIn Data Scraping Rules Legal] The court’s decision, in summary, is that people who have public profiles and share information on those public profiles on LinkedIn do not have a reasonable expectation of privacy, and LinkedIn does not own the rights to their information. Thus, LinkedIn cannot prohibit scrapes from scraping, in a legal sense.
This made sense to me. If I’m making anything about me publicly available, it is not illegal for someone to obtain that information. Though at the same time, you can discuss that automated software is different than someone reading your profile. Because a piece of code can get the info of hundreds of people at once. But would that be any different if someone hired 500 people to go on data scientists’ profiles and collect data? I think it’s sort of a grey line in terms of ethics.
But at the end of the day, a court ruled that it is legal to scrape public information from LinkedIn. So I can go on, right? Well, not really.
Lesson 3: Make sure you’re not breaking any laws or acting in an unethical way at all times during projects involving personal data.
I wanted to read more about what people thought about scraping LinkedIn. I saw that some people online mentioned that if LinkedIn realizes that you are scraping their website they might block the IP you’re doing it from. This is not what I want because I use LinkedIn a lot. Of course, you can work around it by using a VPN. But even then you should use a VPN that makes you look like you are connecting from a residential area and not just a known computer farm (or whatever they’re called). SO. MUCH. WORK.
And even then, there is another problem:
LinkedIn specifically mentions in their user agreement that when you create a profile, you agree that you will not scrape information off their services. Well, the only way you can scrape information off their services is by creating an account and logging in to it. So that’s a dead end.
It is pretty clear that by scraping data off LinkedIn, I would be violating some sort of agreement I have agreed to somehow. Though I’m not sure how legally binding it is.
This was weird to me because there are companies which offer to scrape LinkedIn as a service and make money off of it. But even those companies require you to sign up using your personal LinkedIn account. I am still not 100% sure about what to do there. For now, I decided to put a halt to my data collection because I don't want my account to be banned, deleted or blocked in any way.
Lesson 4: Read the license agreement, user agreement, privacy policy or whatever applies. Using data is getting more and more serious (as it should). You should make sure you have the right to collect and use a piece of data, legally, ethically and contractually.
One additional way I would like to caution you is using datasets you find online. Always check the user terms.
-- The End --
So that’s where I’ve left things. I hope I didn't scare you about how complicated it gets sometimes.
For now, I will try to analyse the small amount of data I was able to collect. Who knows maybe I’ll be feeling brave in the coming days and scrape just a little bit more. To be honest, it's pretty encouraging to know that there is an article hosted on LinkedIn about how to scrape LinkedIn. [Use Selenium & Python to scrape LinkedIn profiles]
If you know more about this topic, please share. I would be happy to learn more. My short adventure taught me to be more careful next time as I feel like I nearly got caught. I’m not sure though if LinkedIn was going to be angry with me or anything. Though I would not like to find out the hard way.