Scrape Clutch with Go

Scraping Clutch with Go

Let’s create a new file in our work directory and call it clutch.go. Next, we will use colly framework for our scraping job, a well-written framework, and we recommend you read the documentation. You can install colly with a single-line command by copying and pasting it in your terminal or command prompt. It might take some time, but it gets installed eventually.

  • Install Colly
go get -u github.com/gocolly/colly/...

Now, open the file (clutch.go) in your favourite IDE. We begin with specifying the package before writing the main function.

package main

func main () {
}

Don’t forget to run the code and verify everything work as expected.

The first thing we need to declare inside the function is a filename.

package main

func main () {
    fName := "london_digital_agencies.csv"
}

Up next, we will create a file since we have a filename.

package main

import (
	"os"
)

func main () {
    fName := "london_digital_agencies.csv"

    file, err := os.Create(fName)
}

It will create a file titled london_digital_agencies.csv; now, run the code and check for errors.

How do we catch errors? Well, let’s define it in our code.

package main

import (
	"log"
	"os"
)

func main () {
    fName := "london_digital_agencies.csv"

    file, err := os.Create(fName)

	if err != nil {
		log.Fatalf("Cannot create file %q: %s\n", fName, err)
		return
	}
}

Fatalf() prints the message and exits the program.

The next thing we need to do is close the file.

package main

import (
	"log"
	"os"
)

func main () {
    fName := "london_digital_agencies.csv"

    file, err := os.Create(fName)

	if err != nil {
		log.Fatalf("Cannot create file %q: %s\n", fName, err)
		return
	}

    defer file.Close()
}

Here’s where defer is very helpful. The moment you write defer, the following codes execute afterwards and not right away. Amazing right? Si, si; it means we don’t have to worry about closing the file manually.

By default, Go imports the necessary packages.

Let’s progress our code a bit. What you need next is a CSV writer. Why? We need to write the data we fetch from clutch.co to a CSV file.

package main

import (
	"encoding/csv"
	"log"
	"os"
)

func main () {
    fName := "london_digital_agencies.csv"

    file, err := os.Create(fName)

	if err != nil {
		log.Fatalf("Cannot create file %q: %s\n", fName, err)
		return
	}

    defer file.Close()

    writer := csv.NewWriter(file)
}

Go will import another package automatically after adding a writer, known as “encoding/csv”. Pretty neat, right?

We need to throw everything from the buffer into the writer after writing our data to the file. For this, we need to use Flush.

package main

import (
	"encoding/csv"
	"log"
	"os"
)

func main () {
    fName := "london_digital_agencies.csv"

    file, err := os.Create(fName)

	if err != nil {
		log.Fatalf("Cannot create file %q: %s\n", fName, err)
		return
	}

    defer file.Close()

	writer := csv.NewWriter(file)

	defer writer.Flush()
}

Because we perform this process afterwards and not right away, we need to add the keyword defer. Finally, we have a well-structured file and a writer ready to go. It is time to get our hands dirty and start the web scraping job. We need to instantiate a collector to begin.

package main

import (
	"encoding/csv"
	"log"
	"os"

	"github.com/gocolly/colly/v2"
)

func main () {
    fName := "london_digital_agencies.csv"

    file, err := os.Create(fName)

	if err != nil {
		log.Fatalf("Cannot create file %q: %s\n", fName, err)
		return
	}

    defer file.Close()

	writer := csv.NewWriter(file)

	defer writer.Flush()

	// Write CSV header
	writer.Write([]string{"name", "logo", "rating", "tagline", "locality", "clutch_profile"})

	// Instantiate default collector
	c := colly.NewCollector(
		colly.AllowedDomains("clutch.co", "www.clutch.co"), // Allow requests only to clutch.co
	)
}

Go would have imported colly for us. So the next thing on our to-do list is to specify the domain name to extract the data. We will scrape a list of digital agencies providing services in London, United Kingdom, from Clutch.

Clutch is the leading ratings and reviews platform for IT, Marketing and Business service providers. Each month, over half a million buyers and sellers of services use the Clutch platform, and the user base is growing over 50% a year.

The next thing we need to do is point to the web page from where we will fetch the data. Here is how we are going to do it.

We will fetch data from this page.

We are interested in collecting “name”, “logo”, “rating”, “tagline”, “locality”, and “clutch_profile.” After inspecting the page, we discovered provider-info is our target tag.

package main

import (
	"encoding/csv"
	"log"
	"os"

	"github.com/gocolly/colly/v2"
)

func main () {
    fName := "london_digital_agencies.csv"

    file, err := os.Create(fName)

	if err != nil {
		log.Fatalf("Cannot create file %q: %s\n", fName, err)
		return
	}

    defer file.Close()

	writer := csv.NewWriter(file)

	defer writer.Flush()

	// Write CSV header
	writer.Write([]string{"name", "logo", "rating", "tagline", "locality", "clutch_profile"})

	// Instantiate default collector
	c := colly.NewCollector(
		colly.AllowedDomains("clutch.co", "www.clutch.co"), // Allow requests only to clutch.co
	)

	c.OnHTML(".provider-info", func(e *colly.HTMLElement) {
		writer.Write([]string{
			e.ChildText(".company_info"),
			e.ChildAttr("a[class='company_logotype'] > img", "data-src"),
			e.ChildText(".sg-rating__number"),
			e.ChildText(".tagline"),
			e.ChildText(".locality"),
			"https://clutch.co" + e.ChildAttr("a", "href"),
		})
	})
}

We have created a pointer to that HTML element, pointing to the provider-info tag. Using the above code, we will write the data into our CSV file. The writer function will type the slice of a string. We need to specify what we need precisely. ChildText will return concatenated and stripped text of matching elements. Inside that, we have passed a tag a to extract all the elements with tag a. e.t.c. We have applied a comma because we are writing a CSV file. We also need ChildText of img tag to get the logos.

package main

import (
	"encoding/csv"
	"log"
	"os"

	"github.com/gocolly/colly/v2"
)

func main () {
    fName := "london_digital_agencies.csv"

    file, err := os.Create(fName)

	if err != nil {
		log.Fatalf("Cannot create file %q: %s\n", fName, err)
		return
	}

    defer file.Close()

	writer := csv.NewWriter(file)

	defer writer.Flush()

	// Write CSV header
	writer.Write([]string{"name", "logo", "rating", "tagline", "locality", "clutch_profile"})

	// Instantiate default collector
	c := colly.NewCollector(
		colly.AllowedDomains("clutch.co", "www.clutch.co"), // Allow requests only to clutch.co
	)

	c.OnHTML(".provider-info", func(e *colly.HTMLElement) {
		writer.Write([]string{
			e.ChildText(".company_info"),
			e.ChildAttr("a[class='company_logotype'] > img", "data-src"),
			e.ChildText(".sg-rating__number"),
			e.ChildText(".tagline"),
			e.ChildText(".locality"),
			"https://clutch.co" + e.ChildAttr("a", "href"),
		})
	})

	c.OnHTML("a.page-link", func(e *colly.HTMLElement) {
		nextPage := e.Request.AbsoluteURL(e.Attr("href"))
		c.Visit(nextPage)
	})

	c.Visit("https://clutch.co/uk/agencies/digital-marketing/london")

	log.Printf("Scraping finished, check file %q for results\n", fName)

	log.Println(c)

}

Phew, all done! It is time to build our clutch scraper.

go build

It generates a Unix Executable File: clutch. You can execute the file by running the following command in your terminal:

./clutch
  • Scraping Job Finishes
2021/12/20 20:09:18 Scraping finished, check file "london_digital_agencies.csv" for results
2021/12/20 20:09:18 Requests made: 53 (53 responses) | Callbacks: OnRequest: 0, OnHTML: 2, OnResponse: 0, OnError: 0

Finally, you can look inside the file london_digital_agencies.csv Go created for us and preview the collected data.

Update

We decided to refine our code a bit. Now, it includes the URL of digital agencies in London. In addition, there’s a condition to retrieve agencies’ logos by taking the “src attribute” if images leverage the performance optimization strategy known as Lazy Loading or not. The latest code is available here.

Contact us; we can help you create advanced and complex web scrapers that’ll crawl thousands and even millions of pages. Similarly, data can be in multiple formats (.csv and .json). Or, we can automatically send it to cloud storage like Amazon S3, Azure Blob Storage, and Google Cloud Storage. Want to ingest the data to a database of your choice? We got you covered.