Analyze Meta Tags With Elixir

TLDR: We analyzed with Elixir how which meta tags were most common among the top 1000 websites.

Hero

In the following, we will write code together in Elixir to analyze the most common websites for meta tags. This was intended to be short, but turned out to be longer than expected.

Who is this article for?

You should have a basic knowledge of Elixir to understand the code and know basic HTML.

Why?

Self promotion:
I am building OGTester.com to debug social media previews and was wondering which tags were the most common ones on the internet. There are ton of resources online recommending you certain list of meta tags. Most of them are SEO spam, so I was curious if those lists made sense.

Dependency

I mainly rely on Finch for making a HTTP request and Floki to digesting the html. You can easily do this in Livebook:

Mix.install([
  {:finch, "~> 0.14"},
  {:floki, "~> 0.34.0"},
])
# we have to start Finch
Finch.start_link(name: MyFinch)

Get top 1000 websites

Before we start, we need to fetch the top 1000 websites. Google pointed me to the URLchecker repo. The following snippet will fetch the text file containing the domain names and then append a https// to them.

url = "https://raw.githubusercontent.com/bensooter/URLchecker/master/top-1000-websites.txt"
# The the content of the above link
{:ok, response} = Finch.build(:get, url)
  |> Finch.request(MyFinch)
# we want the links to be an enum with http://xx
links = response.body
  |> String.split("\n")
  |> Enum.map(&("https://www.#{&1}"))

What to do?

Before I show you the code, let’s think what we need. For a given link (for example https://ogtester.com/) we should get a list of meta tags.

These are the steps to achieve this:

  1. Fetch the html content of https://ogtester.com/
  2. Parse the html content and extract the <head> tag
  3. Extract the meta tags
  4. Keep only for us relevant meta tags.

Given the above steps, first we need to fetch the HTML content from the page:

def link_to_meta_tag(link)  do
    # First we use Finch to fetch the page
    request = Finch.build(:get, link) |> Finch.request(MyFinch)
    case request do
      {:ok, response} ->
        # The response was successful, so we can parse the content with Floki
        {:ok, document} = Floki.parse_document(response.body)
        # what next?
      _ -> []
    end
end

Parsed output from Floki

Now we have the content parsed with Floki. The result is a list of tuples. Each html tag is parsed to a three-tuple. Take for example the following meta tag and the result from Floki.

<head>
  <meta name="og:title" value="Kiru.io is the best page" />
</head>
# Result with Floki
{
  "head",  # first element is the tag name
  [],      # second element is the attribute list
  [        # last element ist list of inner-tags
    {
      "meta",
      [{"name", "og:title"}, {"value", "Kiru.io is the best page"}],
      []
    }
  ]
}

We now have the result from Floki, next we need to find the <head> tag:

# result from floki
{:ok, document} = Floki.parse_document(response.body)
# head tag
{"head", _, innerHead} = Floki.find(document, "head") |> Enum.at(0)

Given that, we can start looking for meta tags and extracting the meta tag names (what we are actually interested in. ) Before we start, we have to filter out irrelevant tags from the <head>.

meta_tags = innerHead
  |> Enum.filter(&is_tuple/1)
  |> Enum.filter(fn each -> Tuple.to_list(each) |> Enum.count == 3 end )
  |> Enum.filter(fn {name, _, _} -> String.downcase(name) == "meta" end ) # only keep the met atags

Now we are ready to extract the property names. Many sites use for the attribute name “property” or “name”, we only keep them and ignore the rest.

# get the meta tags names out of each tag
meta_tags |> Enum.map(fn {"meta", attributes, _} ->
  attributes |> Enum.map(fn {attr_name, attr_value} ->
      # we are only interested in certain tags
      case String.downcase(attr_name) do
        "property" -> attr_value
        "name"  -> attr_value
        _ -> nil
      end
  end)
  |> Enum.reject(&is_nil/1)
end)

Now we are ready to use our links and fetch all meta tags. I decide to move all the above code to a helper function. See below for the full snippet.

# this will take some time for 1000 sites
meta_tags = Enum.map(links, &(KiruHelper.link_to_meta_tag/1))

# count occurrence
meta_tags
|> Enum.flat_map(&(&1))
|> Enum.flat_map(&(&1))
|> Enum.group_by(&(&1))
|> Enum.map(fn {key, value} -> {key, Enum.count(value)} end)
|> Enum.sort_by(fn {key, value} -> value end, :desc)

This will give you the following result, given the tag name and how many times they appear.

[
  {"description", 419},
  {"viewport", 419},
  {"og:title", 303},
  {"og:image", 284},
  {"og:description", 282},
  {"og:url", 272},
  {"og:type", 264},
  {"og:site_name", 262},
  {"twitter:card", 214},
  {"robots", 205},
  {"keywords", 192},
  {"twitter:site", 181},
  {"fb:pages", 177},
  {"fb:app_id", 168},
  {"google-site-verification", 156},
  {"twitter:title", 148},
  {"twitter:description", 139},
  {"theme-color", 133},
  {"twitter:image", 125},
  {"og:locale", 86},
]

Conclusion

Granted, the above code is not of the best quality, but it helps us to answer our initial question. The most common tags were:

  1. description
  2. viewport
  3. og:title

To my surprise the only 303/1000 (assuming all were successful, which I doubt) contained the og:title tag. I will dig deeper in the future to see why so few contained the tag.

Thank you for reading. If you want to see more content like this, follow me on Twitter.

Appendix

This is the full function used to parse the header from link.

defmodule KiruHelper do
  def link_to_meta_tag(link) do
    IO.puts(link)

    # Do http request
    case Finch.build(:get, link) |> Finch.request(MyFinch) do
      {:ok, response} ->
        # parse HTML
        Floki.parse_document(response.body)
         |> extract_meta_tags
      _ -> []
    end
  end

  def extract_meta_tags({:ok, document}) do
    # try to find  a <head> tag
    case Floki.find(document, "head") |> Enum.at(0) do
      # nothign found
      nil -> []
      # found head
      {"head", _, innerHead} ->
        # we are only interested in meta tags
        meta_tags = innerHead
          |> Enum.filter(&is_tuple/1)
          |> Enum.filter(fn each -> Tuple.to_list(each) |> Enum.count == 3 end )
          |> Enum.filter(fn {name, _, _} -> String.downcase(name) == "meta" end )

        # get the meta tags names out of each tag
        meta_tags |> Enum.map(fn {"meta", attributes, _} ->
          attributes |> Enum.map(fn {attr_name, attr_value} ->
              # we are only interested in certain tags
              case String.downcase(attr_name) do
                "property" -> attr_value
                "name"  -> attr_value
                _ -> nil
              end
          end)
          |> Enum.reject(&is_nil/1)
        end)
    end
  end

  def extract_meta_tags(_), do: []
end