Earlier this year, The Prolific North compiled a list of the Top 100 eCommerce Companies in the North, ranking the leading e-commerce companies in the region by revenue.

Being a team of e-commerce specialists working across a range of platforms, we immediately wondered which platforms these sites were all running on and whether there were any clear favourites.

Data parsing and analysis is at the heart of a lot of what we do here at IGOO, and as developers we never miss an opportunity to get our hands dirty with a new toolkit or API. After a quick discussion, we decided we would leverage the excellent Wappalyzer API in order to collect some additional data on each of the sites operated by the companies listed in the league table, so we set to work building a rough prototype.

Libraries & Tools Used

This is by no means an exhaustive list and doesn’t include any information about my local development process. Rather it is a list of build dependencies

  • Wappalyzer – URL analysis tool and API
  • PapaParse – Fast and feature rich CSV parser
  • lodash – JavaScript utility library (so we can write more succinct and human-friendly JS)
  • axios – Promise based HTTP client
  • nanoJS – Micro library for DOM manipulation (super-lazy selectors and iterators)
  • chartJS – no-nonsense JavaScript/Canvas charting engine
  • palette.js – used for generating colour palettes on the fly

Preparing the Initial Dataset

The first problem we encountered (and it’s a biggy) was that the original table only lists the names of the parent companies or legal entities from which annual revenues could be determined. Our first job was therefore to do some digging to figure out which websites were operated by each of the companies listed. We’re big fans of automation and always strive to “work smarter, not harder”, so the urge to figure out a way to automatically match each of the company names to a set of website URLs was strong. Ultimately, there didn’t seem to be any way to accomplish this practically (quickly) and consistently, so we opted to do some good old fashioned investigative work. This mainly involved visiting a lot of (often aesthetically questionable) corporate holding pages to try to drill down to where the action was taking place.

Before I go any further here, I think its important to clarify; we completely understood from the beginning we would be working with an incomplete set of results. A sizeable portion of these companies would undoubtedly be exclusively B2B, would have entirely custom-built solutions, or would have systems in place which were strictly off-limits to the general public (and away from the prying eyes of the Wappalyzer API). The real goal here was to attempt to shed some light on what the commonly used technologies were and disregard anything else.

Before long, we had a fairly healthy looking spreadsheet in Google Docs containing all of the information from the original table, along with an extra column of URLs which we had determined were owned and operated by the company listed. There were a few early surprises in there, with the likes of The HUT Group having no less than 32 outward facing brand offerings in their portfolio.

Wappalyzer

For anyone who is not familiar with Wappalyzer, it’s a tool which analyses websites to determine which frameworks, libraries and platforms they use. It uses some combination of sniffing (think browser sniffing and fingerprinting) to build a catalog of known softwares with which to perform future lookups. It’s not massively watertight in terms of accuracy and is prone to fail if the site you are attempting to analyse has employed some sort of code caching, “minification” or obfuscation. But it’s better than anything we could build ourselves at short notice.

I signed up for their cheapest tier which costs $25 a month, and allows 1000 lookups (annoyingly throttled at 1 request per second). This would be more than enough for some basic testing and a full scrape once I was confident everything was working.

Build

Parse

Having all of the data in a Google sheet allowed me to create a sharable URL, which I could easily pull into PapaParse for initial parsing. Papaparse is elegantly written and takes care of all of the heavy-lifting for us, so once we feed in a CSV at one end, what we get our is a nicely formatted javascript array/object (depending on which options we set) containing all of the original data.

Process

Because our lookup service, Wappalyzer, throttles us to 1 lookup per second, rather than play around with calculating timeouts (and deferred reattempts in the case of failure) I opted to perform all of the Wappalyzer lookups synchronously, ensuring that each lookup would be complete before the next one would begin. Although this added some additional latency into each request, the difference was negligible and in most cases the lookup took little over a second to complete anyway. For the requests themselves we used the excellent Axios library, which made it trivial for us to submit a URL along with some basic authentication headers to the Wappalyzer API endpoint. For the moment, we are simply storing the API Key in plain text within our Javascript file. This is not ideal as these are typically supposed to be kept private but since we had no intention of publishing this part of our build, we decided that it was outside the project scope to do this more securely.

Cache

The packets for each successful URL lookup are returned from Wappalyzer as well-formatted JSON strings containing all of the data currently known about the URL by the Wappalyzer database. Upon each request, we store these in their own .json file in a local folder, to be retrieved and read later. Because reading from the filesystem in JS can be problematic (disallowed), I opted to also save an additional file (creatively named “all.json”) once the entire operation was finished. In the end we had little need for the individual files themselves but it helped to have these as backups when testing since it allowed us to write a bit of extra functionality to skip over the ones, which were already present.

Retrieve, Re-parse, Render

Having worked with the excellent chartJS library before, I assumed that this stage would be relatively trivial and, dare I say it, fun! But that’s the problem with assumptions, isn’t it… Focusing first on the Wappalyzer specifically, it wasn’t long before I encountered the first of several speedbumps. Having fetched the big JSON string containing all of the retrieved data, I needed to parse the whole set again and try to pluck out a list of platforms which would allow me to ultimately count their frequencies. The problem is in the JSON data returned by the Wappalyzer API, there is no specific discrimination between platforms, apps and libraries. That is to say they are all treated with an equal weighting from a hierarchical perspective and are only distinguished from one another by a node called ‘categories,’ which sits alongside the name of the app/platform/framework/library itself.

Consider this example:


{
	"url":"https://chemist-4-u.com",
	"data":[{
		"monthYear":"08-2018",
		"languages":[],
		"applications":[
			{"name":"Hotjar","categories":["Analytics"],"versions":[],"hits":206},
			{"name":"FlexSlider","categories":["Widgets"],"versions":[],"hits":5},
			{"name":"Google Analytics","categories":["Analytics"],"versions":[],"hits":517},
			{"name":"Prototype","categories":["JavaScript Frameworks"],"versions":["1.7"],"hits":234},
			{"name":"Google Tag Manager","categories":["Tag Managers"],"versions":[],"hits":206},
			{"name":"script.aculo.us","categories":["JavaScript Libraries"],"versions":[],"hits":27},
			{"name":"Font Awesome","categories":["Font Scripts"],"versions":[],"hits":512},
			{"name":"Google Font API","categories":["Font Scripts"],"versions":[],"hits":512},
			{"name":"ExtJS","categories":["JavaScript Frameworks"],"versions":[],"hits":1},
			{"name":"Lightbox","categories":["JavaScript Libraries"],"versions":[],"hits":7},
			{"name":"TinyMCE","categories":["Rich Text Editors"],"versions":["3"],"hits":1},
			{"name":"MailChimp","categories":["Marketing Automation"],"versions":[],"hits":511},
			{"name":"jQuery","categories":["JavaScript Libraries"],"versions":["1.10.2"],"hits":224},
			{"name":"jQuery UI","categories":["JavaScript Libraries"],"versions":["1.12.1"],"hits":170},
			{"name":"Magento","categories":["Ecommerce"],"versions":[],"hits":234},
			{"name":"Modernizr","categories":["JavaScript Libraries"],"versions":["2.8.3"],"hits":207},
			{"name":"PHP","categories":["Programming Languages"],"versions":[],"hits":234},
			{"name":"CloudFlare","categories":["CDN"],"versions":[],"hits":846},
			{"name":"Facebook","categories":["Widgets"],"versions":[],"hits":512},
			{"name":"Google AdSense","categories":["Advertising Networks"],"versions":[],"hits":1}
		]
	}]
}

this is real data (we actually built the chemist-4-u.com website so hopefully they won’t mind)

After some initial research, I decided to leverage lodash, specifically the _.forEach, _.get and _.find functions to extract a list of platforms which had been classified as ‘Ecommerce’. Smarter, more pedantic individuals will undoubtedly tell me that I could achieve this with one single function or with vanilla Javascript. For the record I both agree with and disregard that information. I only have so many waking hours in the day and I am constantly torn between my craving for absolutely beautiful code and just wanting to get the thing built before I forget what fresh air tastes like.

Here is a snippet of the final code, which plucks all of the e-commerce platforms from the JSON data:


var data = await loadJSON('data/_all.json');

var platforms = [];

_.forEach(data, function(item)
{
	var apps = _.get(item, 'data[0].applications');

	// return all results with 'Ecommerce' listed in categories array
	var platform = _.find(apps, {categories: ['Ecommerce']});

	if(typeof platform != 'undefined')
	{
		platforms.push(platform.name);
	}

});

One other thing you may notice is that the actual data resides within an array called ‘applications’ within a parent array, which it categorised by “monthYear” (there are usually more of these but I have only included the latest one for the sake of brevity). It seems that Wappalyzer caches results every month which would allow you to query whether or not a particular stack had changed from month to month. This didn’t prove too much of a problem for this exercise since I am only really interested in the latest data, which is always at position 0 in the response.

One additional adjustment I wanted to do was to group the commonalities together, along with a count showing me the frequency with which they occurred in the original results. Using lodash this was achieved very simply.



// get uniques with count/frequency
var frequencies = _.countBy(values);

// pass these sets to chartJS
var labels = Object.keys(frequencies);
var data = Object.values(frequencies);

Once I had the data in the various formats I needed to pass to my charting library, generating the actual charts themselves was just an exercise in reading the chartJS documentation and shamelessly deftly copy/pasting some basic configs from their set of examples to get the visuals I was after. A more extensive explanation of this may end up being included in a future article if one of us is ever seized by the desire to build something like this again.

Additional Challenges

I hit some snags when attempting to sort datasets in ascending or descending order (by frequency for example). Often, I was dealing with an array of objects and, even in the mighty lodash, had no way to grab the node containing the frequency values and use it for reordering. I ended up having to rewrite the part which generated the frequency pairs to essentially transform the object to a multidimensional array, sort it and then transform it back into an object.

The final code for this looked like this:



// get uniques with count/frequency
var frequencies = _.countBy(values);

if(order)
{
	frequencies = _.toPairs(frequencies);
	console.log(frequencies);

	//frequencies = _.sortBy(frequencies, 1);
	//console.log(frequencies);

	frequencies = _.orderBy(frequencies, 1, 'desc');
	console.log(frequencies);

	frequencies = _.fromPairs(frequencies);
	console.log(frequencies);
}

// pass these sets to chartJS
var labels = Object.keys(frequencies);
var data = Object.values(frequencies);

Let’s talk about Colour pallets… specifically the process of auto-generating them whilst trying to ensure they aren’t vomit-inducing. I sank far too much time into this but ultimately settled on the excellent palette.js library maintained by Google. The main feature I was after was being able to reliably generate palettes of varying lengths, depending on the size of the given dataset, whilst being courteous to the aesthetic sensibilities of the viewer. FYI, this becomes increasingly difficult as datasets grow larger and larger, as you might imagine. In the end, the “Rainbow Palette”, formulated by a guy named Paul Tol, seemed to be the only one which covered all bases. If you’re into colours (and lets face it, who isn’t) I would definitely recommend having a look at his work.

In Conclusion

What started off as a straightforward question quickly snowballed into a mini-build with quite a lot of moving parts (sound familiar?) A fair amount of work went into laying the foundations in order for us to potentially use this mini framework for future infographics, which require remote data to be fetched and parsed, although none of this can be readily appreciated in the final front-facing build.

We’re working on getting our final code for the infographic published on github, including the original data we retrieved on our initial analysis. The mechanism we wrote to fetch and parse the initial data isn’t “ready for public consumption” so to speak, but we will publish this also if there is an appetite for it.

You can find the finished infographic in the lab, and also read some conclusions we drew about the data itself. Enjoy!