Scrub Nonconforming URLs

SUBDOMAINS:

For multi-language POS, URLs should be segregated into their respective GKMS’; all others should be marked deleted. For example, gkms_th_en (a secondary-language POS), should have ONLY www.expedia.co.th/en/* URLs, and gkms_th_th (the primary-language POS) should NOT have any www.expedia.co.th/en/* URLs.

Also, there will likely be a lot of strange sub-domains or even full domains outside of the given POS, after STAT and EDW URLs are imported. These should be marked “deleted” in the ds.SEO_StaticURLs feed and SMG re-run to get rid of them. Note that some sub-domains are likely acceptable; discretion required.

SEARCH PAGES:

Search pages within the POS’ domain are likely undesirable; e.g. http://www.expedia.de/Hotel-Search. Remove this similarly.

An easy way to search for these is with a query like

select * from Site_URL with(nolock) where template_id = 102 and url_no_protocol like 'www.expedia.de/%' order by len(url)

or

select * from Site_URL with(nolock) where template_id = 102 and url_no_protocol like '%search' and url_no_protocol not like '%.packagesearch' order by len(url)

MOBILE PAGES:

Mobile pages are generally undesirable; e.g. https://www.expedia.se/m/trips. Remove these pages by marking as deleted in ds.seo_staticURLs

An easy way to search for these is with a query like

select * from Site_URL where template_id = 102 and url_no_protocol like 'www.expedia.se/m/%' order by len(url);

CMS URLs:

CMS URLs canonically end with slashes, but often show up in EDW and STAT without. These should be filtered out manually with something like

For VC deployments:

update ds.SEO_StaticURLs set deleted = 1 where deleted = 0 and template_id = 102 and url like 'http://{domain}/vc/{cms_top_dir}/%[^/]'

For non-VC deployments:

update ds.SEO_StaticURLs set deleted = 1 where deleted = 0 and template_id = 102 and url like 'http://{lob}.expedia.%[^/]'

History

Leave a Reply