Revision as of 02:53, 27 January 2022

Please refer to the original MediaWiki document^[1]

A Typical Importation Process

Extract text content from some sources, ideally Wikipedia or some reputable data source with existing data model.
1. In the case of Wikipedia, every page can be considered as a single text file, and each page has a unique ID provided in either XML or MediaWiki data model.
2. Dump all the textual data content into a directory. Each page should be stored in a unique file, whose page name is stored in a comma delimited file (CSV), that each line associates a unique ID with a the page name.
Move the file into a place where MediaWiki's maintenance script can easily have access, such as /var/www/html/images. For Docker deployments, this directory is likely to be mapped onto the host machine's hard drive, therefore, easy to put the text files under this directory. For example:INPUTDATA_DIR under /var/www/html/images.
Run the importTextFiles.php script like follows:

root@hostname:/var/www/html/#php ./maintenance/importTextFiles.php -s "Loading Textual Content from external sources"  --overwrite --use-timestamp ./images/INPUTDATA_DIR/*

Preprocess file names

Give a list of input data files, for example, a list of files from Wikipedia's Module collection, which contains a list of unique IDs mapping to the page name. The following command will produce a smaller file that contains the replacement.

ls ./images/TEMPLATE/ | xargs -n 1 -i grep {} wp_articles_template.csv > template_replacement.txt

When performing data extraction, one usually uses ,(comma) as a way to delimit words. To be compatible with the moveBatch.php syntax, we need to perform a substitution task:

sed -i "s/,/|/g" template_replacement.txt

Move pages to its proper names

After the replacement file is ready, one can simply run the moveBatch.php script to change the page names.

php ../maintenance/moveBatch.php --u=user --r="load template data" --noredirects template_replacement.txt

References

↑ mw:Manual:importTextFiles.php

Related Pages

[1] w:Manual:importTextFiles.php

[1]

@@ Line 10: / Line 10: @@
 ==Preprocess file names==
 Give a list of input data files, for example, a list of files from Wikipedia's Module collection, which contains a list of unique IDs mapping to the page name. The following command will produce a smaller file that contains the replacement.
-  ls ./images/MODULE/ | xargs -n 1 -i grep {} wp_articles_module.csv > module_replacement.txt
+  ls ./images/TEMPLATE/ | xargs -n 1 -i grep {} wp_articles_template.csv > template_replacement.txt
 When performing data extraction, one usually uses <code>,</code>(comma) as a way to delimit words. To be compatible with the [[moveBatch.php]] syntax, we need to perform a substitution task:
-  sed -i "s/,/|/g" module_replacement.txt
+  sed -i "s/,/|/g" template_replacement.txt
 ==Move pages to its proper names==
 After the <code>replacement</code> file is ready, one can simply run the [[moveBatch.php]] script to change the page names.
-  php ../maintenance/moveBatch.php --u=user --r="load module data" --noredirects module_replacement.txt
+  php ../maintenance/moveBatch.php --u=user --r="load template data" --noredirects template_replacement.txt
 <noinclude>

Difference between revisions of "ImportTextFiles.php"

Revision as of 02:53, 27 January 2022

Contents

A Typical Importation Process

Preprocess file names

Move pages to its proper names

References

Related Pages

Navigation menu

Difference between revisions of "ImportTextFiles.php"

Revision as of 02:53, 27 January 2022

A Typical Importation Process

Preprocess file names

Move pages to its proper names

References

Related Pages

Navigation menu

Search