Analyze Meta Tags With Elixir
TLDR: We analyzed with Elixir how which meta tags were most common among the top 1000 websites.
In the following, we will write code together in Elixir to analyze the most common websites for meta tags. This was intended to be short, but turned out to be longer than expected.
Who is this article for?
You should have a basic knowledge of Elixir to understand the code and know basic HTML.
Why?
Dependency
I mainly rely on Finch for making a HTTP request and Floki to digesting the html. You can easily do this in Livebook:
Mix.install([
{:finch, "~> 0.14"},
{:floki, "~> 0.34.0"},
])
# we have to start Finch
Finch.start_link(name: MyFinch)
Get top 1000 websites
Before we start, we need to fetch the top 1000 websites. Google pointed me to the URLchecker repo.
The following snippet will fetch the text file containing the domain names and then append a https//
to them.
url = "https://raw.githubusercontent.com/bensooter/URLchecker/master/top-1000-websites.txt"
# The the content of the above link
{:ok, response} = Finch.build(:get, url)
|> Finch.request(MyFinch)
# we want the links to be an enum with http://xx
links = response.body
|> String.split("\n")
|> Enum.map(&("https://www.#{&1}"))
What to do?
Before I show you the code, let’s think what we need.
For a given link (for example https://ogtester.com/
) we should get a list of meta tags.
These are the steps to achieve this:
- Fetch the html content of https://ogtester.com/
- Parse the html content and extract the
<head>
tag - Extract the meta tags
- Keep only for us relevant meta tags.
Extract meta tags from link
Given the above steps, first we need to fetch the HTML content from the page:
def link_to_meta_tag(link) do
# First we use Finch to fetch the page
request = Finch.build(:get, link) |> Finch.request(MyFinch)
case request do
{:ok, response} ->
# The response was successful, so we can parse the content with Floki
{:ok, document} = Floki.parse_document(response.body)
# what next?
_ -> []
end
end
Parsed output from Floki
Now we have the content parsed with Floki. The result is a list of tuples. Each html tag is parsed to a three-tuple. Take for example the following meta tag and the result from Floki.
<head>
<meta name="og:title" value="Kiru.io is the best page" />
</head>
# Result with Floki
{
"head", # first element is the tag name
[], # second element is the attribute list
[ # last element ist list of inner-tags
{
"meta",
[{"name", "og:title"}, {"value", "Kiru.io is the best page"}],
[]
}
]
}
We now have the result from Floki, next we need to find the <head>
tag:
# result from floki
{:ok, document} = Floki.parse_document(response.body)
# head tag
{"head", _, innerHead} = Floki.find(document, "head") |> Enum.at(0)
Given that, we can start looking for meta tags and extracting the meta tag names (what we are actually interested in. )
Before we start, we have to filter out irrelevant tags from the <head>
.
meta_tags = innerHead
|> Enum.filter(&is_tuple/1)
|> Enum.filter(fn each -> Tuple.to_list(each) |> Enum.count == 3 end )
|> Enum.filter(fn {name, _, _} -> String.downcase(name) == "meta" end ) # only keep the met atags
Now we are ready to extract the property names. Many sites use for the attribute name “property” or “name”, we only keep them and ignore the rest.
# get the meta tags names out of each tag
meta_tags |> Enum.map(fn {"meta", attributes, _} ->
attributes |> Enum.map(fn {attr_name, attr_value} ->
# we are only interested in certain tags
case String.downcase(attr_name) do
"property" -> attr_value
"name" -> attr_value
_ -> nil
end
end)
|> Enum.reject(&is_nil/1)
end)
Now we are ready to use our links and fetch all meta tags. I decide to move all the above code to a helper function. See below for the full snippet.
# this will take some time for 1000 sites
meta_tags = Enum.map(links, &(KiruHelper.link_to_meta_tag/1))
# count occurrence
meta_tags
|> Enum.flat_map(&(&1))
|> Enum.flat_map(&(&1))
|> Enum.group_by(&(&1))
|> Enum.map(fn {key, value} -> {key, Enum.count(value)} end)
|> Enum.sort_by(fn {key, value} -> value end, :desc)
This will give you the following result, given the tag name and how many times they appear.
[
{"description", 419},
{"viewport", 419},
{"og:title", 303},
{"og:image", 284},
{"og:description", 282},
{"og:url", 272},
{"og:type", 264},
{"og:site_name", 262},
{"twitter:card", 214},
{"robots", 205},
{"keywords", 192},
{"twitter:site", 181},
{"fb:pages", 177},
{"fb:app_id", 168},
{"google-site-verification", 156},
{"twitter:title", 148},
{"twitter:description", 139},
{"theme-color", 133},
{"twitter:image", 125},
{"og:locale", 86},
]
Conclusion
Granted, the above code is not of the best quality, but it helps us to answer our initial question. The most common tags were:
- description
- viewport
- og:title
To my surprise the only 303/1000 (assuming all were successful, which I doubt) contained the og:title
tag.
I will dig deeper in the future to see why so few contained the tag.
Thank you for reading. If you want to see more content like this, follow me on Twitter.
Appendix
This is the full function used to parse the header from link.
defmodule KiruHelper do
def link_to_meta_tag(link) do
IO.puts(link)
# Do http request
case Finch.build(:get, link) |> Finch.request(MyFinch) do
{:ok, response} ->
# parse HTML
Floki.parse_document(response.body)
|> extract_meta_tags
_ -> []
end
end
def extract_meta_tags({:ok, document}) do
# try to find a <head> tag
case Floki.find(document, "head") |> Enum.at(0) do
# nothign found
nil -> []
# found head
{"head", _, innerHead} ->
# we are only interested in meta tags
meta_tags = innerHead
|> Enum.filter(&is_tuple/1)
|> Enum.filter(fn each -> Tuple.to_list(each) |> Enum.count == 3 end )
|> Enum.filter(fn {name, _, _} -> String.downcase(name) == "meta" end )
# get the meta tags names out of each tag
meta_tags |> Enum.map(fn {"meta", attributes, _} ->
attributes |> Enum.map(fn {attr_name, attr_value} ->
# we are only interested in certain tags
case String.downcase(attr_name) do
"property" -> attr_value
"name" -> attr_value
_ -> nil
end
end)
|> Enum.reject(&is_nil/1)
end)
end
end
def extract_meta_tags(_), do: []
end