Package com.optimaize.langdetect.ngram
Class NgramExtractor
java.lang.Object
com.optimaize.langdetect.ngram.NgramExtractor
Class for extracting n-grams out of a text.
-
Field Summary
FieldsModifier and TypeFieldDescriptionprivate final @Nullable NgramFilter
private final @Nullable Character
-
Constructor Summary
ConstructorsModifierConstructorDescriptionprivate
NgramExtractor
(@NotNull List<Integer> gramLengths, @Nullable NgramFilter filter, @Nullable Character textPadding) -
Method Summary
Modifier and TypeMethodDescriptionprivate void
_extractCounted
(CharSequence text, int gramLength, int len, Map<String, Integer> grams) private CharSequence
applyPadding
(CharSequence text) extractCountedGrams
(@NotNull CharSequence text) extractGrams
(@NotNull CharSequence text) Creates the n-grams for a given text in the order they occur.filter
(NgramFilter filter) static NgramExtractor
gramLength
(int gramLength) static NgramExtractor
gramLengths
(Integer... gramLength) private static int
guessNumDistinctiveGrams
(int textLength, int gramLength) This is trying to be smart.textPadding
(char textPadding) To ensure having border grams, this character is added to the left and right of the text.
-
Field Details
-
gramLengths
-
filter
-
textPadding
-
-
Constructor Details
-
NgramExtractor
private NgramExtractor(@NotNull @NotNull List<Integer> gramLengths, @Nullable @Nullable NgramFilter filter, @Nullable @Nullable Character textPadding)
-
-
Method Details
-
gramLength
-
gramLengths
-
filter
-
textPadding
To ensure having border grams, this character is added to the left and right of the text.Example: when textPadding is a space ' ' then a text input "foo" becomes " foo ", ensuring that n-grams like " f" are created.
If the text already has such a character in that position (eg starts with), it is not added there.
- Parameters:
textPadding
- for example a space ' '.
-
getGramLengths
-
extractGrams
Creates the n-grams for a given text in the order they occur.Example: extractSortedGrams("Foo bar", 2) => [Fo,oo,o , b,ba,ar]
- Parameters:
text
-- Returns:
- The grams, empty if the input was empty or if none for that gramLength fits.
-
extractCountedGrams
@NotNull public @NotNull Map<String,Integer> extractCountedGrams(@NotNull @NotNull CharSequence text) - Returns:
- Key = ngram, value = count The order is as the n-grams appeared first in the string.
-
_extractCounted
-
guessNumDistinctiveGrams
private static int guessNumDistinctiveGrams(int textLength, int gramLength) This is trying to be smart. It also depends on script (alphabet less than ideographic). So I'm not sure how good it really is. Just trying to prevent array copies... and for Latin it seems to work fine. -
applyPadding
-