To say pulling data from various internal and external sources is time-consuming is a masterpiece of understatement. Cutting and pasting, using homegrown scripts or applications that record a user’s actions can’t compete with the pace of business. And over time, there will be an increased demand of not only quantity, but quality of information. Lots of information is accessible via public websites with more data that’s often hidden beyond firewalls and web portals that require login credentials and ability to navigate the site in order to extract the data. Valuable information is also embedded in PDFs, images, and graphics.Kapow-Blogpost-graphic

From start-ups to enterprise organizations and spanning across a variety of industries from financial, transportation, retail, and healthcare, acquiring external data is critical. Whether you want to stay in compliance, move ahead of the competition or reach new markets- it all requires constant monitoring of web data. Data is extracted, transformed, and migrated into various reports and becomes the foundation business decisions are based upon.

So a web-scraping tool or homegrown web scraping approach seems like a good option, since it looks like it’s a quick and inexpensive way to harvest the data you require. Or can it?

Now comes the uneasy feeling in the back of your mind. Can my homegrown web scraping approach or a web-scraping tool acquire the correct information I need? How do I know the data I received is accurate and formatted correctly? And what if management wants different reporting data, how is that handled?

The short answer: You don’t know.

The right answer begins with an evaluation of your specific data requirements and business needs.

1. How does web scraping acquire the data?

While product demonstrations can present an initial set of data with colorful dashboards, full of charts and reports, you are better off to ask for a technology demonstration that relates to your specific data collection needs. Write up a list of actual websites you gather data from. Your list should include various types of sites from HTML 5, Flash, JavaScript, and AJAX. Be sure to include websites with firewalls and PDFs. The more scalable, reliable, and faster the web data extraction process performs across various external websites, the better.

2. What does the data look like?

You have received some data using a web scraper tool, but now you spend all your time trying to transform the data. You notice formatting and quality issues with the data. If the extracted data is not accurately transformed and put into a usable format, such as Microsoft Excel, .csv files, or XML, the data becomes unusable by applications that have specific integration requirements. Now you have lost half the value of your purchased investment. Extracting and auto correcting of specialized data often includes dates, currencies, calculations, conditional expressions, plus the removal of duplicate data are all important considerations.

3. How difficult is it to make changes?

What happens if a website changes or if you need to monitor and extract data from new websites? Many web-scraping tools have a high propensity to fail when websites change, which then requires resources and in some cases a developer to fix the problem. Unless you have a developer in house to make these fixes, this will add additional time and expense, and the problem only grows bigger as you monitor and extract data from hundreds or even hundreds of thousands of websites. If scalability is important to you, be sure to ask how the technology solution monitors and handles changes to a website, especially if you want to expand beyond your immediate data collection needs.

Extracting and transforming web data is more than just purchasing any web-scraping tool. Think about the data you are collecting and how it’s tied to your business. In all likelihood, there’s a strong set of business drivers for collecting the data, and taking shortcuts will only compromise the success of what your business goals are. And it should never make you feel uneasy about the information you are collecting.

Look beyond the data that’s being extracted, and think about what you are doing with it in the context your customers, creating a competitive advantage, or streamlining processes that rely on data from websites, portals, and online verification services.

Share →