Contents

Web Health Dashboard

How I built a platform to catalog and monitor the health of thousands of web properties.

About

Back in 2017, one of our clients, who was in charge of marketing properties at a large company, faced a challenging task when they received a request to change the company branding across thousands of URLs. Attempting to handle this manually using a spreadsheet with over 100k rows proved to be highly impractical and unsustainable. Recognizing the need for a better solution, the client approached us to design a comprehensive platform that could efficiently catalog, monitor, and report on the health of their ecosystem.

Build

The frontend was built using AngularJS, and the backend was developed with Python, which exposed a REST API to the frontend. The API endpoints were originally implemented with Cloud Endpoints and hosted on Google App Engine. The source of truth data was stored in Cloud Datastore, while page scan data was stored in BigQuery. To run the page scan and site report workflows, we used the App-Engine Pipelines workflow engine.

The technology changed after we migrated to Python 3, where we moved to a microservices architecture.

  • Python 2.7 was replaced by Python 3
  • Cloud Endpoints was replaced by Flask
  • App-Engine Pipelines was replaced by Apache Airflow (Cloud Composer)
  • App Engine Memcache was replaced by Redis (GCP Memorystore)
  • App Engine Search API was replaced by Elasticsearch (GKE)

Role

  • Lead Backend Developer
  • Responsible for planning and delivering the backend
  • Provided estimates, scoping, and technical oversight
  • Traveled to NYC/SF for client workshops to plan new features
  • Technical Point of Contact

Challenges

There were several challenges, including organizational issues, scalability, performance, and the unpredictable nature of the World Wide Web.

Organizational

  • Limited technology options - For this client, we were limited in the technologies we could use. Selecting a different technology would have added the risk of project delays due to security concerns and reviews.
  • Internal infrastructure - Since this was an internal application, we faced an additional challenge due to their corporate firewall. This added another authentication layer, requiring investigation of internal configurations and learning about internal firewall ACLs.
  • Internal hosting - The most recent GCP services were not available/allowed (e.g., API Gateway), further limiting our technology choices.

Workflows

  • Complex scanning workflows - The App-Engine Pipelines workflow engine was not robust enough.

    App-Engine Pipelines provided the ability to define dynamic workflows and didn’t require any extra service/infrastructure running as it was integrated with App Engine/App Engine Tasks. However, this became problematic when App Engine instances would get killed due to exceeding the memory limit. As a result, the App Engine code responsible for handling pipeline stages would fail or be skipped, causing the workflow stages to never complete. To address this issue, we had to add healing CRON jobs to detect and fix these hanging workflows (e.g., retrying).

  • Limited visibility on workflows - The App-Engine Pipelines UI had a minimal interface that provided limited information. The worst part was that for large workflows, the UI would break. I submitted a few fixes to this project (Github PR 75, Github PR 76, Github PR 77) however, the project was not maintained.

  • Static DAGs in Apache Airflow - When evaluating Apache Airflow as a replacement for the App-Engine Pipelines workflows, we had to adjust to how static the workflows are in Airflow. Creating dynamic DAGs in Airflow is not simple or stable. For instance, splitting 1000 URLs into batches of 200 URLs and then having 5 sequential stages before the workflow could continue required hacks. We addressed this by creating a queue for each site scan, controlling the task execution rate, and periodically checking when the queue was empty to trigger the next DAG step.

Scanning

  • PageSpeed Insights API Quotas - The PageSpeed Insights API had a daily quota that we needed to respect. In some situations, we ended up with bad website reports due to page speed data not being provided, resulting in a poor overall website performance score.
  • LightHouse crashing - For some URLs, Lighthouse would crash and not provide the audit results. An example of this would be a URL that redirected to install a Chrome extension.
  • LightHouse unreliable speed scores - For desktop and mobile speed performance, we tried to self-host Lighthouse; however, when issuing multiple requests, LH provided inconsistent speed score results.
  • Websites applying rate limits - When scanning for broken links or images, the target sites sometimes applied rate limits to our requests, returning HTTP 429.
  • Non-standard HTTP status codes - LinkedIn would return HTTP 999 status codes to block non-human requests.
  • No respect/support for basic HTTP methods - Issuing HTTP HEAD requests on certain URLs would return 404, but when issuing an HTTP GET request, we would get 200.
  • Poor sitemaps.xml - Many sites had no sitemap.xml, others were using localhost as loc entries, and others returned an HTML page.
  • False 404s - When requesting a resource (e.g., an image), it would return a 404, but when requesting with a different HTTP client and User-Agent, it returned 200.
  • Websites size - Most websites were small (less than 1000 pages), but there were large sites (more than 100k pages), and a couple with (10k+ pages). This required planning, as we didn’t want a large site to consume all the daily quota for an external API or slow down the scanning of smaller sites.
  • Special sites without a grading category - Some types of sites didn’t have a grading category (e.g., freshness), which required handling this situation in all the code frontend/backend to consider the website report grading.
  • Websites relationships - A website could have more than 120 locales. Some sites may be global only, others regional only. This required us to define what is a global/regional/localized website.
  • Grades normalization - Websites were graded according to a health profile, meaning that a campaign website may be graded using a health profile for this type of site. When looking for historical data, we have to normalize all the site report scores into a normalized grading because a 75 score might be good for one health profile but bad on a different health profile.
  • Wrong report/scoring results - The page scan metrics data was streamed to a partitioned BigQuery table. However, in some cases, the site report was generated immediately after the last website page was scanned, causing the scoring to be off. This was discovered to be due to late-arriving data at BigQuery tables.
  • Unexpected scanning results - In some cases, the system was scanning when the website developers were running updates or experiencing issues (e.g., returning 502s). We were asked why our system was not scanning their sites correctly.

Learnings

  • World Wild Web - The web is wild, so be prepared to encounter unexpected situations, such as bad markup, servers not adhering to standards, poorly configured sites, bad SSL certificates, bad DNS, etc.
  • Retries all the way - You soon learn to implement retries with exponential backoff (+ Jitter) to accommodate transient errors. This could be because the remote resource is being deployed, experiencing hiccups, or due to exceeding service quotas (e.g., max 100 requests per minute).
  • Retry up to a limit - Exponential retries will help you up to a certain limit. For example, when Lighthouse crashes all the time for a URL, you need to have a maximum number of retries; otherwise, you will be stuck forever.
  • Need to use timeouts - When scanning webpage resources such as links/images URLs, you soon realize that some requests may last up to 10 minutes or more. To avoid this issue, you can set an acceptable timeout before considering the resource broken or not.
  • Max concurrent site scans - In order not to overload the system, we had to limit how many concurrent sites would be scanned at a given time. We also controlled how many webpages could be scanned concurrently, considering service quotas like BigQuery maximum of 100 requests per second. Another example of limit is the maximum of 200 instances in App Engine.
  • Late-arriving data - Some site reports were wrong when checked manually, this end up to be an issue with late-arriving data, where page scan metrics data was streamed to BigQuery, and the site report was generated immediately after the last site page was scanned. To fix this issue, we added a minimum threshold for when a site report can be generated after the last page was scanned and processed. We set a timestamp when the last page was scanned, and when generating the site report, we only retrieved metrics data from the accepted interval. Late-arriving data would be ignored from the report generation queries.
  • Bulkhead critical services - We received a complaint that the Admin UI was slow. After investigation, we found that a large number of site scans on the system were making the service that provided the admin endpoints slow. Since this service also handled site scan logic, the best compromise and safe way was to compartmentalize (bulkhead) the service by type of access. This allowed the admin workload to be isolated from the heavy site scanning workload.
  • 999 status codes - LinkedIn returns this type of nonsense HTTP status codes. The solution was to ignore them.
  • 429 status codes - The site is rate-limiting our requests, which is acceptable, and we don’t want to flag them as broken links, so we kindly ignore them.
  • 404 first and 200 second - Some servers would return 404 for a URL; however, when requesting with a different HTTP client and User-Agent, it would return a 200. To be more accurate, for all 404s, we would check with a different client.
  • Apache Airflow UI - Migrating to Apache Airflow was a blessing in terms of DAG visibility, inspectability, and being able to see what is going on.
  • Bad sites for testing - Unit tests are not enough to test the site scanning and report generation. We created a dedicated website with bad pages, such as invalid HTML elements, invalid aria roles, images that redirect forever, links that sometimes resolve, links to sites with bad DNS, links to expired SSL certs, etc. This was very helpful to ensure the system can handle the most problematic pages and URLs effectively.
  • Big and small sites - We had a combination of small and large sites (100k+ pages). Ensuring that quotas were respected and that smaller sites were not affected by large site scans took a bit of configuration to achieve the desired results.
  • No UI tests - Not having UI tests caused several deployments to create bugs on several occasions.

Results

  • Thousands of web properties
  • Millions of scanned webpages
  • Thousands of reported issues
  • Websites graded on multiple categories (accessibility, discoverability, performance, freshness)
  • The client was able to catalog their websites
  • More accountability for site owners
  • Easier to track site owners
  • Reporting allows slicing and dicing of data
  • Historical view of the site’s health
  • Made it easier to identify issues
  • Allows the client to make high-level decisions (deprecate vs update)
  • Automatic site creation
  • Custom ad-hoc audits

Media

/images/web-health-dashboard-1.png
Site Scan DAG Calendar View
/images/web-health-dashboard-2.png
Site Scan DAG Tree View