Amazon Kendra is a highly accurate and easy-to-use intelligent search service powered by machine learning (ML). Amazon Kendra offers a set of data source connectors to simplify the process of ingesting and indexing your content, wherever it resides.
Organizations’ valuable data is stored in both structured and unstructured repositories. An enterprise search solution should be able to give you a fully managed experience and simplify the process of indexing your content from a variety of data sources across the enterprise.
One such repository of unstructured data is internal and external websites. Sites may need to be crawled to create news feeds, analyze language usage, or create robots to answer questions based on website data.
We’re excited to announce that you can now use the new Amazon Kendra web crawler to find answers from content stored on internal and external websites or create chatbots. In this post, we show how to index information stored on websites and use intelligent search in Amazon Kendra to search for answers in content stored on internal and external websites. Additionally, ML-based intelligent search can get accurate answers to your questions from unstructured documents with natural language narrative content, for which keyword searching is not very effective.
Web Crawler offers the following new features:
- Support for Basic Authentication, NTLM/Kerberos, Forms, and SAML
- The ability to specify 100 initial URLs and store connection settings in Amazon Simple Storage Service (Amazon S3)
- Support for a web and Internet proxy with the ability to provide proxy credentials
- Support for crawling dynamic content, such as a website containing JavaScript
- Regular Expression Filtering and Field Mapping Functions
Solution Overview
With Amazon Kendra, you can set up multiple data sources to provide a central place to search your document repository. For our solution, we demonstrate how to index a crawled website using Amazon Kendra Web Crawler. The solution consists of the following steps:
- Choose an authentication mechanism for the website (if necessary) and store the details in AWS Secrets Manager.
- Create an Amazon Kendra index.
- Create a Web Crawler V2 data source through the Amazon Kendra console.
- Run a sample query to test the solution.
Previous requirements
To try Amazon Kendra Web Crawler, you need the following:
Collect authentication details
For protected and secure websites, the following authentication types and standards are supported:
- Essential
- NTLM/Kerberos
- Form authentication
- SAML
You need the authentication information when you configure the data source.
For Basic or NTLM authentication, you must provide your Secrets Manager secret, username, and password.
Form and SAML authentication requires additional information, as shown in the following screenshot. Some of the fields like User XPath name button They are optional and will depend on whether the site you are crawling uses a button after entering the username. Also note that you will need to know how to determine the XPath of the username and password field and the submit buttons.
Create an Amazon Kendra index
To create an Amazon Kendra index, complete the following steps:
- In the Amazon Kendra console, choose Create an index.
- For Index nameEnter a name for the index (for example, Web Crawler).
- Enter an optional description.
- For Role nameenter an IAM role name.
- Configure optional labels and encryption settings.
- Choose Next.
- In it Set up user access control section, leave the settings at their default values and choose Next.
- For Provisioning editsselect Developer Edition and choose Next.
- On the review page, choose Create.
This creates and propagates the IAM role and then creates the Amazon Kendra index, which can take up to 30 minutes.
Create an Amazon Kendra web crawler data source
Complete the following steps to create your data source:
- In the Amazon Kendra console, choose Data sources in the navigation panel.
- Locate the WebCrawler Connector V2.0 mosaic and choose Add connector.
- For Data source nameenter a name (for example, crawl-fda).
- Enter an optional description.
- Choose Next.
- In it Fountain section, select Source URL and enter a URL. For this publication we use https://www.fda.gov/ as an example of a source URL.
- In it Authentication section, choose the appropriate authentication based on the site you want to crawl. For this post, we selected Without authentication because it is a public site and does not need authentication.
- In it web proxy section, you can specify a Secrets Manager secret (if necessary).
- Choose Create and add a new secret.
- Enter the authentication details you collected earlier.
- Choose Save.
- In it IAM Role section, choose Create a new role and enter a name (for example,
AmazonKendra-Web Crawler-datasource-role
). - Choose Next.
- In it Sync Scope section, configure your sync settings based on the site you are tracking. For this post, we left all settings default.
- For Sync mode, choose how you want to update your index. For this post, we selected Full sync.
- For Sync Execution Schedulingchoose Run on demand.
- Choose Next.
- You can optionally configure field mappings. For this post, we’re keeping the default values for now.
Field mapping is a useful exercise in which you can replace field names with values that are easy to use and fit your organization’s vocabulary.
- Choose Next.
- Choose Add data source.
- To sync the data source, choose Sync now on the data source details page.
- Wait for the synchronization to complete.
Example of an authenticated website
If you want to crawl a site that has authentication, in the Authentication section of the previous steps, you must specify the authentication details. The following is an example if you selected Form authentication.
- In it Fountain section, select Source URL and enter a URL. For this example, we use https://accounts.autodesk.com.
- In it Authentication section, select Form authentication.
- In it web proxy , specify your Secrets Manager secret. This is required for any option other than Without authentication.
- Choose Create and add a new secret.
- Enter the authentication details you collected earlier.
- Choose Save.
Try the solution
Now that you’ve ingested the site’s content into your Amazon Kendra index, you can try a few queries.
- Go to your index and choose Search indexed content.
- Enter a sample search query and test your search results (your query will vary depending on the content of the site you crawled and the query entered).
Congratulations! You have successfully used Amazon Kendra to display responses and information based on the indexed content of the site you crawled.
Clean
To avoid incurring future costs, clean up the resources that you created as part of this solution. If you created a new Amazon Kendra index while trying this solution, delete it. If you only added a new data source using Amazon Kendra Web Crawler V2, delete that data source.
Conclusion
With the new Amazon Kendra Web Crawler V2, organizations can crawl any website that is public or behind authentication and use it for intelligent searches powered by Amazon Kendra.
To learn about these possibilities and more, see the Amazon Kendra Developer Guide. To learn more about how you can create, modify, or delete metadata and content when ingesting your data, see Enrich your documents during ingestion and Enrich your content and metadata to improve your search experience with custom document enrichment in Amazon Kendra.
About the authors
Jiten Dedhia is a Sr. Solutions Architect with over 20 years of experience in the software industry. He has worked with global financial services clients, advising them on modernization using services provided by AWS.
Gunwant Walbe is a software development engineer at Amazon Web Services. He is an avid learner and willing to adopt new technologies. You develop complex enterprise applications and Java is your primary language of choice.