Robots.txt Guide for Squarespace

July 28, 2024 Christine Darby

A robots.txt file helps in managing how search engines and other bots interact with your website content. Squarespace has traditionally used the same default robots.txt file across all sites. However, following the introduction of popular AI platforms, many Squarespace customers requested the ability to block AI crawlers from scanning sites.

In August 2023, OpenAI announced the introduction of GPTBot to train their language model and said the bot would follow robots.txt directives. This prompted many large websites, such as The New York Times, CNN, and Healthline, to block AI training bots from crawling their content. In October 2023, Squarespace added the option for sites to block many known AI crawlers, and/or all search engines, if desired.

This is a guide to understanding robots.txt on Squarespace.

See this content in the original post

What is robots.txt?

Robots.txt is a simple text file that gives instructions to web crawlers (sometimes called robots or spiders) about which pages or files they can or cannot crawl. The file is placed in the root directory of your website: example.com/robots.txt

In general, a robots.txt file is used to manage crawler traffic to a site, versus hiding web pages from search results. Pages that are disallowed in a robots.txt file can still be indexed. So, while robots.txt can indirectly influence the visibility of content to a certain extent, it’s not a foolproof method for hiding web pages from search results.

Note that to fully block content from Google search results, you need to use another method such as password protection, noindex, or Google’s removal tool.

Tip: It can be difficult to identify and block all crawlers, and while not all bots follow robots.txt rules, good bots should follow your preferences.

Robots.txt Syntax & Examples

While your Squarespace robots.txt file can’t be directly edited, understanding the basic syntax will help you know what you’re impacting if you change the default settings on your site’s Crawlers panel (accessed via Settings > Crawlers).

A robots.txt file is made up of User-agents and the directives Disallow and Allow:

User-agent: Specifies which web crawler you’re giving instructions to, such as Googlebot or Bingbot. User-agent: * (with an asterisk) is a wildcard that targets most web crawlers, however it does not apply universally—for example, Google’s AdsBots ignore this global user agent.
Disallow: Lists the URL paths you don’t want the user-agent to crawl.
Allow: Specifically allows access to a directory or page.

For example, this robots.txt tells Googlebot not to crawl any page or file in the /private/ directory:

User-agent: Googlebot
Disallow: /private/

In this example, all user-agents (indicated by *), including Googlebot, Bingbot, and others, are blocked from crawling URLs under /private/ and /temp/:

User-agent: *
Disallow: /private/
Disallow: /temp/

If a robots.txt file contains multiple user-agent groupings that apply to a single crawler, the directives for the most specific user-agent take precedence.

Squarespace robots.txt

The Squarespace Crawlers panel moved in the right direction by giving site owners a bit more control over how their content is crawled. The platform’s support documentation explains one checkbox will “exclude your site” from search engine crawlers and the other from AI crawlers.

How to update Squarespace robots.txt:

Visit Settings > Crawlers, then set the checkboxes to your preferred settings. Changes are immediately reflected in your robots.txt file. Options include:

Default robots.txt: If both Crawler settings are unchecked, you’ll use the default Squarespace robots.txt file, and your site is open for all crawling.
Disallow AI: If you disallow or block artificial intelligence crawlers, these specific AI crawlers will not scan your site (if they choose to follow the “rules”):
- GPTBot: OpenAI’s training bot.
- ChatGPT-User: Used by ChatGPT plugins and primarily used to answer a live query from a user. Content retrieved is not used to train their models.
- CCBot: The Common Crawl bot crawls the entire internet and its dataset was used to train ChatGPT 3.
- anthropic-ai: Anthropic’s training bot for Claude.
- Google-Extended: Bard’s training bot. Note, this bot does not impact AI-powered answers in Google’s SGE (Search Generative Experience).
- FacebookBot: Bot used by Facebook to improve their language models for speech recognition.
- Claude-Web: Another Anthropic crawler.
- cohere-ai: Unconfirmed crawler used by Cohere AI.
- Applebot-Extended: Used to train Apple’s models for generative AI.
- PerplexityBot: Gathers information for Perplexity’s search engine.
- New crawlers are added periodically, but at this time, not all known bots are listed. Refer to the The New York Times robots.txt file for a more up-to-date list.
Disallow Search Engines: Checking “Block Search Engine Crawlers” blocks search engine bots and Google’s AdsBots, excluding your site from major search engines.
Remember: Disallowed pages can still be indexed if linked from elsewhere. To fully prevent indexing, learn more about how to hide pages on Squarespace.
Disallow AI and Search Engines: When both options are checked, the robots.txt disallows the specific AI bots listed above, Google AdsBots, and globally all other bots that respect robots.txt rules.

Your Squarespace robots.txt file also includes the URL of your sitemap. If you’ve verified a site in Google Search Console you can refer to Google’s robots.txt report.

Issue with Squarespace’s Implementation

Resolved. When Squarespace first introduced the option to “disallow search engines,” it didn’t work. The toggle (now a checkbox) only prevented Google AdsBots from crawling a site (see screenshot), and not the broader range of search engine crawlers as you’d expect.

We reported the bug and Squarespace promptly fixed it.

Note, Squarespace often releases new features with insufficient testing. While the majority of Squarespace users were not rushing to block search engines, this type of error is an example of why it is important to have a basic understanding of how websites and search engines work. If you make changes to your robots.txt file, you can verify the changes by viewing: yoursite.com/robots.txt

Incorrect robots.txt: When “disallow search engines” was selected, only AdsBots were blocked. This bug has been fixed.

What types of sites might block search engines?

The vast majority of sites will not want to block search engine crawlers, as this would reduce online visibility and discoverability. If a site disallows crawling for traditional search engines, it would experience no organic growth via search engines.

But if a website is created for a private audience, allowing search engines to crawl and index the site can expose the private information or events to anyone using a search engine. In these cases, you can ask bots to not crawl your site, for example:

Private websites that are intended for family and friends. For example, wedding sites often have personal stories, information on the ceremony, travel details, and RSVP forms. To maintain some privacy, couples can block search engines, then share the website URL directly with guests through invitations or private communications.
Internal company sites are often used to share information with employees about upcoming events. Blocking search engines can prevent this information from being publicly indexed. Direct links to the site can be sent to employees via email or through the company’s internal communication systems.

Again, for highly sensitive information, you’ll want to safeguard it with password protection. Learn more about how to hide Squarespace pages.

Should you block AI crawlers?

Should you block AI crawlers? If so, which ones? The decision to block AI crawlers will be an ongoing debate and the approach will vary depending on individual website goals and content types. The implications of blocking AI crawlers extend beyond immediate visibility. It involves strategic decisions about content protection and competitive positioning in the world of AI.

Blocking AI bots could keep proprietary content from being reproduced by AI models and might prevent competitors from leveraging your intellectual property to train their AI systems. In the long run, you might prefer to block AI models from some content, but not all. For example, a site might want some pages crawled for brand recognition, and other pages blocked to prevent copyrighted work being used for training purposes.

Final Thoughts

Squarespace likely will move towards allowing a greater degree of control over robots.txt settings. Robots.txt is a simple yet powerful file that helps your small business SEO strategy by guiding bots to the content you want crawled and indexed.

Disallowing crawlers should be done carefully to avoid accidentally blocking important content from search engines or AI functions. Staying informed and monitoring changes to both how content is used by AI platforms and how robots.txt files are respected by crawlers will be an important part of your site’s SEO and content strategy.

Additional Resources

We provide SEO training for do-it-yourself SEO endeavors.
Collaborada’s AI Primer for SMBs.
Google’s introduction to robots.txt.
Pierre Far on crawlers, search engines, and the sleaze of generative AI companies.

Need SEO or Squarespace help?

See this gallery in the original post