Skip to content

Add support for crawling subdomains#27

Open
alexspeller wants to merge 1 commit intochriskite:nextfrom
alexspeller:4419464056d3de337162
Open

Add support for crawling subdomains#27
alexspeller wants to merge 1 commit intochriskite:nextfrom
alexspeller:4419464056d3de337162

Conversation

@alexspeller
Copy link
Copy Markdown

Merge changes to support subdomain crawling from runa@91559bd

@MaGonglei
Copy link
Copy Markdown

This feature is very useful.
I think anemone should also support for printing out the external links, just print out it but not scan it in deep.
The link checker tool XENU (http://home.snafu.de/tilman/xenulink.html) has this feature.

@wokkaflokka
Copy link
Copy Markdown

MaGonglei: It is very simple to gather external links using Anemone, and comparably simple to actually check these links to verify they are valid, etc. The 'on_every_page' block is very helpful in this regard.

If you'd like some code that does exactly what you are asking, I could send an example your way.

@MaGonglei
Copy link
Copy Markdown

Hi,wokkaflokka,thanks for your reply.
I think I know what you mean,but I prefer to have this feature when I initialize the anemone crawl like :
Anemone.crawl("http://www.example.com",:external_links => false) do |anemone|
....
end

Because if I use the "on_every_page" block to search the external links (e.g. "page.doc.xpath '//a[@href]') ,it seemed cost too much CPU and Memorys.

If I'm wrong,give me the example.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants